«Under review as a conference paper at ICLR 2016 CAPTURING MEANING IN PRODUCT REVIEWS WITH CHARACTER-LEVEL GENERATIVE TEXT MODELS Zachary C. Lipton ...»
All experiments are executed with a custom recurrent neural network library written in Python, using Theano (Bergstra et al.) for GPU acceleration. Our networks use 2 hidden layers with 1024 nodes per layer. During training, examples are processed in mini-batches and we update weights with RMSprop (Tieleman and Hinton, 2012). To assemble batches, we concatenate all reviews in the training set together, delimiting them with (STR) and (EOS) tokens. We split this string into mini-batches of size 256 and again split each mini-batch into segments with sequence length
200. Furthermore, LSTM state is preserved across batches during training. To combat exploding gradients, we clip the elements of each gradient at ± 5. We found that it was faster to ﬁrst train the concatenated input model if we ﬁrst trained an unsupervised character-level generative RNN to convergence. We then transplant weights from the unsupervised net to initialize the concatenatedinput RNN. We implement two nets in this fashion, one using the star rating scaled to [-1, 1] as xaux, and a second using a one-hot encoding of 5 beer categories as xaux.
Under review as a conference paper at ICLR 2016
Figure 3: (top) Probability of each category and (bottom) most likely star rating as each letter is encountered. The RNN learns Budweiser is a lager and that stouts and porters are heavy. It learns to tilt positive by the ‘c’ in ‘excellent’ and that the ‘f’ in ‘awful’ reveals negative sentiment.
4.1 GENERATING TEXT
Running the concatenated input RNN in generative mode and conditioning upon a 5 star rating, we
produce a decidedly positive review:
STRPoured from a 12oz bottle into a pint glass. A: Pours a deep brown color with a thin tan head. The aroma is of coffee, chocolate, and coffee. The taste is of roasted malts, coffee, chocolate, and coffee. The ﬁnish is slightly sweet and smooth with a light bitterness and a light bitterness that lingers on the palate. The ﬁnish is slightly bitter and dry. Mouthfeel is medium bodied with a good amount of carbonation. The alcohol is well hidden. Drinkability is good. I could drink this all day long. I would love to try this one again and again. EOS Conditioning on the “Fruit / Vegetable Beer” category, the model generates a commensurately botanical review; interestingly the user “Mikeygrootia” does not exist in the dataset.
STRThanks to Mikeygrootia for the opportunity to try this one. A: Poured a
nice deep copper with a one ﬁnger head that disappears quickly. Some lacing. S:
A very strong smelling beer. Some corn and grain, some apple and lemon peel.
Taste: A very sweet berry ﬂavor with a little bit of a spice to it. I am not sure what to expect from this beer. This stuff is a good summer beer. I could drink this all day long. Not a bad one for me to recommend this beer.EOS For more examples of generated text, please see Appendix A and Appendix B.
4.2 PREDICTING SENTIMENT AND CATEGORY ONE CHARACTER AT A TIME
In addition to running the model to generate output, we take example sentences from unseen reviews and plot the rating which gives the sentence maximum likelihood as each character is encountered (Figure 3). We can also plot the network’s perception of item category, using each category’s prior and the review’s likelihood to infer posterior probabilities after reading each character. These visualizations demonstrate that by the “d” in “Budweiser”, our model recognizes a “lager”. Similarly, Under review as a conference paper at ICLR 2016 (a) “Mindblowing experience.” (b) “Tastes watered down.” (c) “Not the best, not worst.” Figure 4: Log likelihood of the review for many settings of the rating. This tends to be smooth and monotonic for unambiguous sentences. When the sentiment is less extreme, the peak is centered.
reading the “f” in “awful”, the network seems to comprehend that the beer is “awful” and not “awesome” (Figure 3). See appendices C and D for more examples.
To verify that the argmax over many settings of the rating is reasonable, we plot the log likelihood after the ﬁnal character is processed, given by a range of ﬁne-grained values for the rating (1.0, 1.1, etc.). These plots show that the log likelihood tends to be smooth and monotonic for sentences with unambiguous sentiment, e.g., “Mindblowing experience”, while, they are smooth with a peak in the middle when sentiment is ambiguous, e.g., “not the best, not the worst.” (Figure 4). We also ﬁnd that the model understands nonlinear dynamics of negation and can handle simple spelling mistakes, as seen in Appendices E and D.
4.3 CLASSIFICATION RESULTS
While our motivation is to produce a character-level general model, running in reverse-fashion as a classiﬁer proved an effective way to objectively gauge what the model knows. To investigate this capability more thoroughly, we compared it to a word-level tf-idf n-gram multinomial logistic regression (LR) model, using the top 10,000 n-grams. Our model achieves a classiﬁcation accuracy of 89.9% while LR achieves 93.4% (Table 1). Both models make the majority of their mistakes confusing Russian Imperial Stouts for American Porters, which is not surprising because a stout is a sub-type of porter. If we collapse these two into one category, the RNN achieves 94.7% accuracy while LR achieves 96.5%. While the reverse model does not yet eclipse a state of the art classiﬁer, it was trained at the character level and was not optimized to minimize classiﬁcation error or with attention to generalization error. In this light, the results appear to warrant a deeper exploration of this capability. Please see Appendix F for detailed classiﬁcation results. We also ran the model in reverse to classify results as positive (≥ 4.0 stars) or negative (≤ 2.0 stars), achieving AUC of.88 on a balanced test set with 1000 examples.
The prospect of capturing meaning in character-level text has long captivated neural network researchers. In the seminal work, “Finding Structure in Time”, Elman (1990) speculated, “one can ask whether the notion ‘word’ (or something which maps on to this concept) could emerge as a consequence of learning the sequential structure of letter sequences that form words and sentences (but in which word boundaries are not marked).” In this work, an ‘Elman RNN’ was trained with 5 input
Under review as a conference paper at ICLR 2016
nodes, 5 output nodes, and a single hidden layer of 20 nodes, each of which had a corresponding context unit to predict the next character in a sequence. At each step, the network received a binary encoding (not one-hot) of a character and tried to predict the next character’s binary encoding. Elman plots the error of the net character by character, showing that it is typically high at the onset of words, but decreasing as it becomes clear what each word is. While these nets do not possess the size or capabilities of large modern LSTM networks trained on GPUs, this work lays the foundation for much of our research. Subsequently, in 2011, Sutskever et al. (2011) introduced the model of text generation on which we build. In that paper, the authors generate text resembling Wikipedia articles and New York Times articles. They sanity check the model by showing that it can perform a debagging task in which it unscrambles bag-of-words representations of sentences by determining which unscrambling has the highest likelihood. Also relevant to our work is Zhang and LeCun (2015), which trains a strictly discriminative model of text at the character level using convolutional neural networks (LeCun et al., 1989; 1998). Demonstrating success on both English and Chinese language datasets, their models achieve high accuracy on a number of classiﬁcation tasks.
Related works generating sequences in a supervised fashion generally follow the pattern of Sutskever et al. (2014), which uses a word-level encoder-decoder RNN to map sequences onto sequences.
Their system for machine translation demonstrated that a recurrent neural network can compete with state of the art machine translation systems absent any hard-coded notion of language (beyond that of words). Several papers followed up on this idea, extending it to image captioning by swapping the encoder RNN for a convolutional neural network (Mao et al., 2014; Vinyals et al., 2015; Karpathy and Fei-Fei, 2014).
5.1 KEY DIFFERENCES AND CONTRIBUTIONS
RNNs have been used previously to generate text at the character level. And they have been used to generate text in a supervised fashion at the word-level. However, to our knowledge, this is the ﬁrst work to demonstrate that an RNN can generate relevant text at the character level. Further, while Sutskever et al. (2011) demonstrates the use of a character level RNN as a scoring mechanism, to our knowledge, this is the ﬁrst paper to use such a scoring mechanism to infer labels, simultaneously learning to generate text and to perform supervised tasks like multiclass classiﬁcation with high accuracy. Our work is not the ﬁrst to demonstrate a character-level classiﬁer, as Zhang and LeCun (2015) offered such an approach. However, while their model is strictly discriminative, our model’s main purpose is to generate text, a capability not present in their approach. Further, while we present a preliminary exploration of ways that our generative model can be used as a classiﬁer, we do not train it directly to minimize classiﬁcation error or generalization error, rather using the classiﬁer interpretation to validate that the generative model is in fact modeling the auxiliary information meaningfully.
In this work, we demonstrate the ﬁrst character-level recurrent neural network to generate relevant text conditioned on auxiliary input. This work is also the ﬁrst work, to our knowledge, to generate coherent product reviews conditioned upon data such as rating and item category. Our quantitative and qualitative analysis shows that our model can accurately perform sentiment analysis and model item category. While this capability is intriguing, much work remains to investigate if such an approach can be competitive against state of the art word-level classiﬁers. The model learns nonlinear dynamics of negation, and appears to respond intelligently to a wide vocabulary despite lacking any a priori notion of words.
We believe that this is only beginning of this line of research. Next steps include extending our work to the more complex domain of individual items and users. Given users with extensive historical feedback in a review community and a set of frequently reviewed items, we’d like to take a previously unseen (user, item) pair and generate a review that plausibly reﬂects the user’s tastes and writing style as well as the item’s attributes. We also imagine an architecture by which our concatenated input network could be paired with a neural network encoder, to leverage the strengths of both the encoder-decoder approach and our approach. Details of this proposed model are included in Appendix G.
Under review as a conference paper at ICLR 2016
7 ACKNOWLEDGEMENTSZachary C. Lipton’s research is funded by the UCSD Division of Biomedical Informatics, via NIH/NLM training grant T15LM011271. Sharad Vikram’s research is supported in part by NSF grant CNS-1446912. We would like to thank Professor Charles Elkan for his mentorship. We gratefully acknowledge the NVIDIA Corporation, whose hardware donation program furnished us with a Tesla K40 GPU, making our research possible.
REFERENCES Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difﬁcult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
James Bergstra, Olivier Breuleux, Fr´ d´ ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume ee Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math compiler in python.
Jeffrey L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
Felix A. Gers, J¨ rgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction u with LSTM. Neural computation, 12(10):2451–2471, 2000.
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Sepp Hochreiter and J¨ rgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
u 1735–1780, 1997.
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306, 2014.
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.
Neural computation, 1(4):541–551, 1989.
Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to e document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Zachary C. Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.
Julian John McAuley and Jure Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd international conference on World Wide Web, pages 897–908. International World Wide Web Conferences Steering Committee, 2013.
John R. Surber and Mark Schroeder. Effect of Prior Domain Knowledge and Headings on Processing of Informative Text. Contemporary Educational Psychology, 32(3):485–498, jul 2007. ISSN 0361476X. doi: 10.1016/j.cedpsych.2006.08.002. URL http://www.sciencedirect.
Ilya Sutskever, James Martens, and Geoffrey E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.