«Under review as a conference paper at ICLR 2016 CAPTURING MEANING IN PRODUCT REVIEWS WITH CHARACTER-LEVEL GENERATIVE TEXT MODELS Zachary C. Lipton ...»
Under review as a conference paper at ICLR 2016
CAPTURING MEANING IN PRODUCT REVIEWS WITH
CHARACTER-LEVEL GENERATIVE TEXT MODELS
Zachary C. Lipton ∗ Sharad Vikram †
Computer Science and Engineering Computer Science and Engineering
University of California, San Diego University of California, San Diego
La Jolla, CA 92093, USA La Jolla, CA 92093, USA firstname.lastname@example.org email@example.com Julian McAuley ‡ Computer Science and Engineering University of California, San Diego La Jolla, CA 92093, USA firstname.lastname@example.org ABSTRACT We present a character-level recurrent neural network that generates relevant and coherent text given auxiliary information such as a sentiment or topic.1 Using a simple input replication strategy, we preserve the signal of auxiliary input across wider sequence intervals than can feasibly be trained by back-propagation through time. Our main results center on a large corpus of 1.5 million beer reviews from BeerAdvocate. In generative mode, our network produces reviews on command, tailored to a star rating or item category. The generative model can also run in reverse, performing classiﬁcation with surprising accuracy. Performance of the reverse model provides a straightforward way to determine what the generative model knows without relying too heavily on subjective analysis. Given a review, the model can accurately determine the corresponding rating and infer the beer’s category (IPA, Stout, etc.). We exploit this capability, tracking perceived sentiment and class membership as each character in a review is processed. Quantitative and qualitative empirical evaluations demonstrate that the model captures meaning and learns nonlinear dynamics in text, such as the effect of negation on sentiment, despite possessing no a priori notion of words. Because the model operates at the character level, it handles misspellings, slang, and large vocabularies without any machinery explicitly dedicated to the purpose.
1 INTRODUCTION Our work is motivated by an interest in product recommendation. Currently, recommender systems assist users in navigating an unprecedented selection of items, personalizing services to a diverse set of users with distinct individual tastes. Typical approaches surface items that a customer is likely to purchase or rate highly, providing a basic set of primitives for building functioning internet applications. Our goal is to create richer user experiences, not only recommending products but generating descriptive text. For example, engaged users may wish to know what precisely their impression of an item is expected to be, not simply whether the item will warrant a thumbs up or thumbs down. Consumer reviews can address this issue to some extent, but large volumes of reviews are difﬁcult to sift through, especially if a user is interested in some niche aspect. Our fundamental goal is to resolve this issue by building systems that can both generate contextually appropriate descriptions and infer items from
∗ Author website: http://zacklipton.com † Author website: http://www.sharadvikram.com ‡ Author website: http://cseweb.ucsd.edu/∼jmcauley/ Live web demonstration of rating and category-based review generation (http://deepx.ucsd.edu/beermind) Figure 1: Our generative model runs in reverse, inferring ratings and categories given reviews without any a priori notion of words.
Character-level Recurrent Neural Networks (RNNs) have a remarkable ability to generate coherent text (Sutskever et al., 2011), appearing to hallucinate passages that plausibly resemble a training corpus. In contrast to word-level models, they do not suffer from computational costs that scale with the size of the input or output vocabularies. This property is alluring, as product reviews draw upon an enormous vocabulary. Our work focuses on reviews scraped from Beer Advocate (McAuley and Leskovec, 2013). This corpus contains over 60,000 distinct product names alone, in addition to standard vocabulary, slang, jargon, punctuation, and misspellings.
Character-level LSTMs powerfully demonstrate the ability of RNNs to model sequences on multiple time scales simultaneously, i.e., they learn to form words, to form sentences, to generate paragraphs of appropriate length, etc. To our knowledge, all previous character-level generative models are unsupervised. However, our goal is to generate character-level text in a supervised fashion, conditioning upon auxiliary input such as an item’s rating or category2. Such conditioning of sequential output has been performed successfully with word-level models, for tasks including machine translation (Sutskever et al., 2014), image captioning (Vinyals et al., 2015; Karpathy and Fei-Fei, 2014;
Mao et al., 2014), and even video captioning (Venugopalan et al., 2014). However, despite the aforementioned virtues of character-level models, no prior work, to our knowledge, has successfully trained them in such a supervised fashion.
Most supervised approaches to word-level generative text models follow the encoder-decoder approach popularized by Sutskever et al. (2014). Some auxiliary input, which might be a sentence or an image, is encoded by an encoder model as a ﬁxed-length vector. This vector becomes the initial input to a decoder model, which then outputs at each sequence step a probability distribution predicting the next word. During training, weights are updated to give high likelihood to the sequences encountered in the training data. When generating output, words are sampled from each predicted distribution and passed as input at the subsequent sequence step. This approach successfully produces coherent and relevant sentences, but is generally limited to generating sentences (e.g. typically less than 10 words in length), as the model gradually ‘forgets’ the auxiliary input.
However, to model longer passages of text (such as reviews), and to do so at the character level, we must produce much longer sequences than seem practically trainable with an encoder-decoder approach. To overcome these challenges, we present an alternative modeling strategy. At each sequence step t, we concatenate the auxiliary input vector xaux with the character representation (t) (t) xchar, using the resulting vector x to train an otherwise standard generative RNN model. It might seem redundant to replicate xaux at each sequence step, but by providing it, we eliminate pressure on the model to memorize it. Instead, all computation can focus on modeling the text and its interaction with the auxiliary input.
In this paper, we implement the concatenated input model, demonstrating its efﬁcacy at both review generation and traditional supervised learning tasks. In generative mode, our model produces We use auxiliary input to differentiate the “context” input from the character representation passed in at each sequence step. By supervised, we mean the output sequence depends upon some auxiliary input.
Under review as a conference paper at ICLR 2016
convincing reviews, tailored to a star rating and category. We present a live web demonstration of this capability (http://deepx.ucsd.edu/beermind). This generative model can also run in reverse, performing classiﬁcation with surprising accuracy (Figure 1). The purpose of this model is to generate text, but we ﬁnd that classiﬁcation accuracy of the reverse model provides an objective way to assess what the model has learned. An empirical evaluation shows that our model can accurately classify previously unseen reviews as positive or negative and determine which of 5 beer categories is being described, despite operating at the character level and not being optimized directly to minimize classiﬁcation error. Our exploratory analysis also reveals that the model implicitly learns a large vocabulary and can effectively model nonlinear dynamics, like the effect of negation. Plotting the inferred rating as each character is encountered for many sentences (Figure 1) shows that the model infers ratings quickly and anticipates words after reading particularly informative characters.
2 THE BEER ADVOCATE DATASET
We focus on data scraped from Beer Advocate as originally collected and described by McAuley and Leskovec (2013). Beer Advocate is a large online review community boasting 1,586,614 reviews of 66,051 distinct items composed by 33,387 users. Each review is accompanied by a number of numerical ratings, corresponding to “appearance”, “aroma”, “palate”, “taste”, and also the user’s “overall” impression. The reviews are also annotated with the item’s category. For our experiments on ratings-based generation and classiﬁcation, we select 250,000 reviews for training, focusing on the most active users and popular items. For our experiments focusing on generating reviews conditioned on item category, we select a subset of 150,000 reviews, 30,000 each from 5 of the top categories, namely “American IPA”, “Russian Imperial Stout”, “American Porter”, “Fruit/Vegetable Beer”, and “American Adjunct Lager”. From both datasets, we hold out 10% of reviews for testing.
3 RECURRENT NEURAL NETWORK METHODOLOGY
Figure 2: (a) Standard generative RNN; (b) encoder-decoder RNN; (c) concatenated input RNN.
3.1 GENERATIVE RECURRENT NEURAL NETWORKSBefore introducing our contributions, we review the generative RNN model of Sutskever et al. (2011;
2014) on which we build. A generative RNN is trained to predict the next token in a sequence, i.e. y t = x(t+1), given all inputs to that point (x1,..., xt ). Thus input and output strings are equivaˆ lent but for a one token shift (Figure 2a). The output layer is fully connected with softmax activation, ensuring that outputs specify a distribution. Cross entropy is the loss function during training.
Once trained, the model is run in generative mode by sampling stochastically from the distribution output at each sequence step, given some starting token and state. Passing the sampled output as the subsequent input, we generate another output conditioned on the ﬁrst prediction, and can continue in this manner to produce arbitrarily long sequences. Sampling can be done directly according to softmax outputs, but it is also common to sharpen the distribution by setting a temperature ≤ 1, analogous to the so-named parameter in a Boltzmann distribution. Applied to text, generative models trained in this fashion produce surprisingly coherent passages that appear to reﬂect the characteristics of the training corpus. They can also be used to continue passages given some starting tokens.
3.2 CONCATENATED INPUT RECURRENT NEURAL NETWORKS
Our goal is to generate text in a supervised fashion, conditioned on an auxiliary input xaux. This has been done at the word-level with encoder-decoder models (Figure 2b), in which the auxiliary input is encoded and passed as the initial state to a decoder, which then must preserve this input signal
Under review as a conference paper at ICLR 2016
across many sequence steps (Sutskever et al., 2014; Karpathy and Fei-Fei, 2014). Such models have successfully produced (short) image captions, but seem impractical for generating full reviews at the character level because signal from xaux must survive for hundreds of sequence steps.
We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused. The value of re-iterating high-level material is borne out in one study, Surber and Schroeder (2007), which showed that repetitive subject headings in textbooks resulted in faster learning, less rereading and more accurate answers to high-level questions.
Thus we propose a simple architecture in which input xaux is concatenated with the character rept) (t) (t) resentation xchar. Given this new input x = [xchar ; xaux ] we can train the model precisely as with the standard generative RNN (Figure 2c). At train time, xaux is a feature of the training set.
At predict time, we ﬁx some xaux, concatenating it with each character sampled from y (t). One ˆ might reasonably note that this replicated input information is redundant. However, since it is ﬁxed over the course of the review, we see no reason to require the model to transmit this signal across hundreds of time steps. By replicating xaux at each input, we free the model to focus on learning the complex interaction between the auxiliary input and language, rather than memorizing the input.
3.3 WEIGHT TRANSPLANTATION
Models with even modestly sized auxiliary input representations are considerably harder to train than a typical unsupervised character model. To overcome this problem, we ﬁrst train a character model to convergence. Then we transplant these weights into a concatenated input model, initializing the extra weights (between the input layer and the ﬁrst hidden layer) to zero. Zero initialization is not problematic here because symmetry in the hidden layers is already broken. Thus we guarantee that the model will achieve a strictly lower loss than a character model, saving (days of) repeated training. This scheme bears some resemblance to the pre-training common in the computer vision community (Yosinski et al., 2014). Here, instead of new output weights, we train new input weights.
3.4 RUNNING THE MODEL IN REVERSE
Many common document classiﬁcation models, like tf-idf logistic regression, maximize the likelihood of the training labels given the text. Given our generative model, we can then produce a predictor by reversing the order of inference, that is by maximizing the likelihood of the text, given a classiﬁcation. The relationship between these two tasks (P (xaux |Review) and P (Review|xaux )) follows from Bayes’ rule. That is, our model predicts the conditional probability P (Review|xaux ) of an entire review given some xaux (such as a star rating). The normalizing term can be disregarded in determining the most probable rating and when the classes are balanced, as they are in our test cases, the prior also vanishes from the decision rule leaving P (xaux |Review) ∝ P (Review|xaux ).