# recurrent neural networks for sentiment analysis .recurrent neural networks for sentiment analysis

Post on 16-Feb-2019

214 views

Embed Size (px)

TRANSCRIPT

Recurrent Neural Networks for Sentiment Analysis

Joaqun Ruales

Columbia Universityjar2262@columbia.edu

Abstract

We compare LSTM Recurrent Neural Net-works to other algorithms for classify-ing the sentiment of movie reviews. Wethen repurpose RNN training to learningsentiment-aware word vectors useful forword retrieval and visualization.

1 Introduction

Recurrent Neural Networks (RNNs. Described inmore detail in Section 3.1) have proven to be suc-cessful in many natural language tasks, includ-ing in learning language models that outperformn-grams (Mikolov, Karafiat et al., 2011), and inachieving the state-of-the-art in speech recogni-tion (Hannun et al., 2014).

Furthermore, Recursive Neural Networks1anetwork structure similar in spirit to RecurrentNeural Networks but that, unlike RNNs, usesa tree topology instead of a chain topology forits time-stepshas been successfully used forstate-of-the-art binary sentiment classification af-ter training on a sentiment treebank (Socher et al.,2013).

Motivated by these results, we use RecurrentNeural Networks for the natural language task ofpredicting the binary sentiment of movie reviews.

2 Problem Formulation and Dataset

Our focus is the binary sentiment classification ofmovie reviews. Given the text of a polar moviereview, we predict whether it is a positive or neg-ative review, where positive means the reviewergave the movie a star rating of 7/10 or higher, andnegative means a rating of 4/10 or lower.

Later, we tackle the problem of learningsentiment-aware vector representations of words,

1We avoid referring to these as RNNs to avoid confu-sion

and of using these for visualization and word re-trieval.

We use a subset of the Large Movie ReviewDataset (Maas et al., 2011). The dataset consistsof 50000 polar movie reviews, 25000 for train-ing and 25000 for testing, from the IMDb internetmovie database. Due to our limited computationalresources, for each experiment we chose a wordcount limit and trained on reviews with at mostthat number of words. For testing we used 500 re-views with arbitrary length chosen at random fromthe larger test set.

3 Background

3.1 RNNs

Recurrent Neural Networks (RNNs) are a type ofneural network that contains directed loops. Theseloops represent the propagation of activations tofuture inputs in a sequence. Once trained, insteadof accepting a single vector input x as a testingexample, an RNN can accept a sequence of vectorinputs (x1, . . . , xT ) for arbitrary, variable valuesof T , where each x

t

has the same dimension. Inour specific application, each x

t

is a vector rep-resentation of a word, and (x1, . . . , xT ) is a se-quence of words in a movie review. As shown inFigure 1, we can imagine unrolling an RNN sothat each of these inputs x

t

has a copy of a networkthat shares the same weights as the other copies.For every looping edge in the network, we con-nect the edge to the corresponding node in x

t+1snetwork, thus creating a chain, which breaks anyloops and allows us to use the standard backpropa-gation techniques of feedforward neural networks.

3.2 LSTM Units

One of the known instabilities of the standardRNN is the vanishing gradient problem (Pascanuet al., 2012). The problem involves the quick de-crease in the amount of activation passed into sub-sequent time steps. This limits how far into the

Figure 1: An RNN as a looping graph (left), andas a sequence of time steps (right).

CHAPTER 4. LONG SHORT-TERM MEMORY 33

Figure 4.2: LSTM memory block with one cell. The three gates are nonlin-ear summation units that collect activations from inside and outside the block,

and control the activation of the cell via multiplications (small black circles).

The input and output gates multiply the input and output of the cell while the

forget gate multiplies the cells previous state. No activation function is applied

within the cell. The gate activation function f is usually the logistic sigmoid,

so that the gate activations are between 0 (gate closed) and 1 (gate open). The

cell input and output activation functions (g and h) are usually tanh or lo-

gistic sigmoid, though in some cases h is the identity function. The weighted

peephole connections from the cell to the gates are shown with dashed lines.

All other connections within the block are unweighted (or equivalently, have a

fixed weight of 1.0). The only outputs from the block to the rest of the network

emanate from the output gate multiplication.

Figure 2: LSTM Unit (Graves, 2012)

past the network can remember. In order to al-leviate this problem, Long Short Term Memory(LSTM) units were created to replace normal re-current nodes (Hochreiter et al., 1997). A dia-gram of one of these units is displayed in Figure2. These units introduce a variety of gates thatregulate the propagation of activations along thenetwork. This, in turn, allows a network to learnwhen to ignore a new input, when to remember apast hidden state, and when to emit a nonzero out-put.

4 Sentiment Classification Experiments

Below, we explain the procedure for each of ourexperiments for sentiment classification.

4.1 LSTM

We performed our LSTM experiments by extend-ing the deeplearning.net LSTM sentiment clas-sification tutorial code, which uses the PythonTheano symbolic mathematical library.

The network takes as input the ordered se-quence of T words that make up a movie review,and it performs 4 steps to predict the movies sen-timentprojection, LSTM, mean pooling, and lo-gistic regression.

1. Projection:

x

t

= Rw

t

where wt

is the tth word of the currentreview represented as a one-hot 10000-dimensional vector unique to each word (anyword not in the 9999 most common words inthe data is treated as the same word, UNK,to prevent sparseness), and R is a weight ma-trix that projects each word into to a 128-dimensional space. Here R is learned duringtraining.

2. LSTM unit:

C

t

= tanh(W

c

x

t

+ U

c

h

t1 + bc)

i

t

= logistic(W

i

x

t

+ U

i

h

t1 + bi)

f

t

= logistic(W

f

x

t

+ U

f

h

t1 + bf

)

C

t

= i

t

Ct

+ f

t

C

t1

o

t

= logistic(W

o

x

t

+ U

o

h

t1 + bo)

h

t

= o

t

tanh(Ct

)

where Ct

is the output value of the recur-rent unit as would be computed by a clas-sical RNN, i

t

is the LSTM input gate, ft

is the LSTM forget gate, Ct

is the valueof the internal LSTM memory state cre-ated by blending the past state (C

t1) andthe proposed state ( C

t

), ot

is the outputgate, and h

t

is the final output of theLSTM unit after passing through the outputgate. The symbol represents element-wise multiplication. The parameters that arelearned during training are the weight matri-ces W

i

, U

i

,W

c

, U

c

,W

f

, U

f

,W

o

, U

o

, and thebias vectors b

i

, b

c

, b

f

, b

o

.

3. Mean pooling:

h =

1

T

TX

t=1

h

t

4. And finally, logistic regression:

y = softmax(W

y

h + by

)

where weight matrix Wy

and bias vec-tor b

y

are learned during training. yis a 2D vector where the first entry isp(negative|w1 . . . wT ), and the second entryis p(positive|w1 . . . wT ). While we are us-ing the softmax function, this formulation isequivalent to logistic regression.

Figure 3: Word count histogram for training andtesting data. Word counts ranged from 11 to 2820.

The network is trained using stochastic gradientdescent with the AdaDelta update rule, which usesan adaptive per-dimension learning rate that doesnot require manual annealing.

Due to the prohibitively long time it takes totrain this network on all of the reviews in our train-ing set, for our experiments we trained this net-work using word count limits of 100, 150, and300. We saw a rapid increase in running time asthe limit increased since both number of examplesand average example word count grows. Figure 3shows the distribution of word counts for the fulltraining and validation set for reference.

4.2 Variations on LSTM

4.2.1 Regular RNN

For this experiment, we replaced all LSTM unitsin the above network with classical RNN units tosee if the addition of LSTM brought an advantageto this specific problem.

4.2.2 Non-Forgetting LSTM

For this experiment, we modified the LSTM byreplacing the forget gate with a vector of all 1s,always preserving the previous state when incor-porating a new example. This modification repre-sents the LSTM unit as it was first described, be-fore (Gers et al., 2000) introduced the forget gate.

4.2.3 LSTM with Max Pooling

For this experiment, we used max pooling insteadof mean pooling of our LSTM unit outputs, setting

h = max

t=1...Th

t

4.3 SVM

We trained a linear SVM on the data using Bagof Words features for each movie review. Figure 4shows our training and validation errors for severalvalues o

Recommended