deep learningと自然言語処理

39
Deep Learning と 然語処 株式会社 Preferred Infrastructure 製品事業部

Upload: preferred-infrastructure-preferred-networks

Post on 12-Jul-2015

33.719 views

Category:

Technology


3 download

TRANSCRIPT

  • Deep Learning

    Preferred Infrastructure

  • l l Sedue Sedue Predictor

    l

    l

    l

    l

    l

    l

  • (2013 GW )

  • (2013/12 ) l Sedue

  • (2014/12 ) l

    l () l l hadoop

    l GBl

    l () New!! - 4MongoDB ()

  • Deep Learningl Deep Learning

    l l Deep Learning

    l Deep Learning l Deep Learning

    l NLP l Word embedding

    l l Deep?

  • Deep Learningl Deep Learning

    l l Deep Learning

    l Deep Learning l Deep Learning

    l NLP l Word embedding

    l l Deep?

  • Deep Learning()

    l l l l

    l l Facebook, Google, Baidu, Yahoo, TwitterDeep Learning

  • Deep Learning

    l l Deep Neural Network(DNN)

    l l

  • l l l

  • l

    y = f ( wixi i )

    x1x2x3xn

    y

  • l

    x1

    x2

    xn

    y1

    ym

    h1

    h2

    hl

  • l 2

    x1

    x2

    xn

    y1

    ym

    h1

    h2

    hl

    h1

    h2

    hl

  • l 2

    x1

    x2

    xn

    y1

    ym

    h1

    h2

    hl

    h1

    h2

    hl

    Deep Belief Network Convolutional Neural Network Stacked Auto Encoder Recursive Neural Network Recurrent Neural Network

  • Deep Learning: Multi Task

    l

    Task2

    x1

    x2

    xn

    Task1h1

    h2

    hl

    h1

    h2

    hl

  • Deep Learning: Multi Modal

    l

    Task

    h1

    h2

    hl

    h1

    h2

    hl

  • Deep Learningl Deep Learning

    l l Deep Learning

    l Deep Learning l Deep Learning

    l NLP l Word embedding

    l l Deep?

  • (Collobert, et al. 2013) l

    l l l l semantic role labeling

    l l Deep Learningmulti task l Semantic role labeling

  • l l l

    l Wikipedia

    l Reuters RCV1

    l 50

    All the Guys

    Input Sentence

    Lookup Table

    Convolution

    Max Over Time

    Linear

    HardTanh

    Linear

    Text The cat sat on the matFeature 1 w11 w12 . . . w1N...Feature K wK1 wK2 . . . wKN

    LTW 1

    ...LTWK

    max()

    M2

    M3

    d

    Padding

    Padding

    n1hu

    M1

    n1hu

    n2hu

    n3hu = #tags

    Figure 2: Sentence approach network.

    which says if a word is in a gazetteer or not. Another common practice is to introduce somebasic pre-processing, such as word-stemming or dealing with upper and lower case. In thislatter option, the word would be then represented by three discrete features: its lower casestemmed root, its lower case ending, and a capitalization feature.

    Generally speaking, we can consider a word as represented by K discrete features w 2D1 DK , where Dk is the dictionary for the kth feature. We associate to each feature alookup table LTWk(), with parameters W k 2 Rdkwrd|Dk| where dkwrd 2 N is a user-specified

    8

  • Natural Language Processing (almost) from Scratch

    france jesus xbox reddish scratched megabits454 1973 6909 11724 29869 87025

    austria god amiga greenish nailed octetsbelgium sati playstation bluish smashed mb/sgermany christ msx pinkish punched bit/sitaly satan ipod purplish popped baudgreece kali sega brownish crimped caratssweden indra psNUMBER greyish scraped kbit/snorway vishnu hd grayish screwed megahertzeurope ananda dreamcast whitish sectioned megapixelshungary parvati geforce silvery slashed gbit/s

    switzerland grace capcom yellowish ripped amperes

    Table 6: Word embeddings in the word lookup table of the language model neural networkLM1 trained with a dictionary of size 100, 000. For each column the queried word is followedby its index in the dictionary (higher means more rare) and its 10 nearest neighbors (usingthe Euclidean metric, which was chosen arbitrarily).

    Very long training times make such strategies necessary for the foreseable future: if wehad been given computers ten times faster, we probably would have found uses for datasetsten times bigger. Of course this process makes very dicult to characterize the learningprocedure. However we can characterize the final product.

    In the following subsections, we report results obtained with two trained languagemodels. The results achieved by these two models are representative of those achievedby networks trained on the full corpuses.

    Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100units. The embedding layers were dimensioned like those of the supervised networks(Table 4). Model LM1 was trained on our first English corpus (Wikipedia) usingsuccessive dictionaries composed of the 5000, 10, 000, 30, 000, 50, 000 and finally100, 000 most common WSJ words. The total training time was about four weeks.

    Language model LM2 has the same dimensions. It was initialized with the embeddingsof LM1, and trained for an additional three weeks on our second English corpus(Wikipedia+Reuters) using a dictionary size of 130,000 words.

    4.4 Embeddings

    Both networks produce much more appealing word embeddings than in Section 3.4. Table 6shows the ten nearest neighbours of a few randomly chosen query words for the LM1 model.The syntactic and semantic properties of the neighbours are clearly related to those of thequery word. These results are far more satisfactory than those reported in Table 6 forembeddings obtained using purely supervised training of the benchmark NLP tasks.

    19

  • l Window Approach l

    l Time Delay Networks l Window Approach

    All the Guys

    Input Sentence

    Lookup Table

    Convolution

    Max Over Time

    Linear

    HardTanh

    Linear

    Text The cat sat on the matFeature 1 w11 w12 . . . w1N...Feature K wK1 wK2 . . . wKN

    LTW 1

    ...LTWK

    max()

    M2

    M3

    d

    Padding

    Padding

    n1hu

    M1

    n1hu

    n2hu

    n3hu = #tags

    Figure 2: Sentence approach network.

    which says if a word is in a gazetteer or not. Another common practice is to introduce somebasic pre-processing, such as word-stemming or dealing with upper and lower case. In thislatter option, the word would be then represented by three discrete features: its lower casestemmed root, its lower case ending, and a capitalization feature.

    Generally speaking, we can consider a word as represented by K discrete features w 2D1 DK , where Dk is the dictionary for the kth feature. We associate to each feature alookup table LTWk(), with parameters W k 2 Rdkwrd|Dk| where dkwrd 2 N is a user-specified

    8

  • Natural Language Processing (almost) from Scratch

    Lookup Table

    Linear

    Lookup Table

    Linear

    HardTanh HardTanh

    Linear

    Task 1

    Linear

    Task 2

    M2(t1) M2(t2)

    LTW 1

    ...LTWK

    M1 n1hu n

    1hu

    n2hu,(t1) = #tags n

    2hu,(t2) = #tags

    Figure 3: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained withthe architecture presented in Figure 1. Lookup tables as well as the first hidden layer areshared. The last layer is task specific. The principle is the same with more than two tasks.

    Approach POS CHUNK NER(PWA) (F1) (F1)

    Benchmark Systems 97.24 94.29 89.31NN+STC+LM2 97.20 93.63 88.67NN+STC+LM2+MTL 97.22 94.10 88.62

    Table 8: Eect of multi-tasking on our neural architectures. We trained POS, CHUNKand NER in a MTL way. As a baseline, we show previous results of our system trainedseparately on each task. We also report benchmark systems performance for comparison.

    fi is the classifier for the i-th task with parameters wi and vi. Notations (x) and (x)represent engineered features for the pattern x. Matrix maps the (x) features into a lowdimensional subspace common across all tasks. Each task is trained using its own exampleswithout a joint labelling requirement. The learning procedure alternates the optimizationof wi and vi for each task, and the optimization of to minimize the average loss for allexamples in all tasks. The authors also consider auxiliary unsupervised tasks for predictingsubstructures. They report excellent results on several tasks, including POS and NER.

    23

  • ? (Qi, et al. 2013) l state-of-the-art

    l

    l

  • Word Vector (Mikolov, et al. 2013) l l

    Table 1: Examples of five types of semantic and nine types of syntactic questions in the Semantic-Syntactic Word Relationship test set.

    Type of relationship Word Pair 1 Word Pair 2Common capital city Athens Greece Oslo NorwayAll capital cities Astana Kazakhstan Harare ZimbabweCurrency Angola kwanza Iran rialCity-in-state Chicago Illinois Stockton CaliforniaMan-Woman brother sister grandson granddaughterAdjective to adverb apparent apparently rapid rapidlyOpposite possibly impossibly ethical unethicalComparative great greater tough tougherSuperlative easy easiest lucky luckiestPresent Participle think thinking read readingNationality adjective Switzerland Swiss Cambodia CambodianPast tense walking walked swimming swamPlural nouns mouse mice dollar dollarsPlural verbs work works speak speaks

    4.1 Task Description

    To measure quality of the word vectors, we define a comprehensive test set that contains five typesof semantic questions, and nine types of syntactic questions. Two examples from each category areshown in Table 1. Overall, there are 8869 semantic and 10675 syntactic questions. The questionsin each category were created in two steps: first, a list of similar word pairs was created manually.Then, a large list of questions is formed by connecting two word pairs. For example, we made alist of 68 large American cities and the states they belong to, and formed about 2.5K questions bypicking two word pairs at random. We have included in our test set only single token words, thusmulti-word entities are not present (such as New York).

    We evaluate the overall accuracy for all question types, and for each question type separately (se-mantic, syntactic). Question is assumed to be correctly answered only if the closest word to thevector computed using the above method is exactly the same as the correct word in the question;synonyms are thus counted as mistakes. This also means that reaching 100% accuracy is likelyto be impossible, as the current models do not have any input information about word morphology.However, we believe that usefulness of the word vectors for certain applications should be positivelycorrelated with this accuracy metric. Further progress can be achieved by incorporating informationabout structure of words, especially for the syntactic questions.

    4.2 Maximization of Accuracy

    We have used a Google News corpus for training the word vectors. This corpus contains about6B tokens. We have restricted the vocabulary size to 1 million most frequent words. Clearly, weare facing time constrained optimization problem, as it can be expected that both using more dataand higher dimensional word vectors will improve the accuracy. To estimate the best choice ofmodel architecture for obtaining as good as possible results quickly, we have first evaluated modelstrained on subsets of the training data, with vocabulary restricted to the most frequent 30k words.The results using the CBOW architecture with different choice of word vector dimensionality andincreasing amount of the training data are shown in Table 2.

    It can be seen that after some point, adding more dimensions or adding more training data providesdiminishing improvements. So, we have to increase both vector dimensionality and the amountof the training data together. While this observation might seem trivial, it must be noted that it iscurrently popular to train word vectors on relatively large amounts of data, but with insufficient size

    6

  • : Skip-gram

    w(t-2)

    w(t+1)

    w(t-1)

    w(t+2)

    w(t)

    SUM

    INPUT PROJECTION OUTPUT

    w(t)

    INPUT PROJECTION OUTPUT

    w(t-2)

    w(t-1)

    w(t+1)

    w(t+2)

    CBOW Skip-gram

    Figure 1: New model architectures. The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current word.

    R words from the future of the current word as correct labels. This will require us to do R 2word classifications, with the current word as input, and each of the R + R words as output. In thefollowing experiments, we use C = 10.

    4 Results

    To compare the quality of different versions of word vectors, previous papers typically use a tableshowing example words and their most similar words, and understand them intuitively. Althoughit is easy to show that word France is similar to Italy and perhaps some other countries, it is muchmore challenging when subjecting those vectors in a more complex similarity task, as follows. Wefollow previous observation that there can be many different types of similarities between words, forexample, word big is similar to bigger in the same sense that small is similar to smaller. Exampleof another type of relationship can be word pairs big - biggest and small - smallest [20]. We furtherdenote two pairs of words with the same relationship as a question, as we can ask: What is theword that is similar to small in the same sense as biggest is similar to big?

    Somewhat surprisingly, these questions can be answered by performing simple algebraic operationswith the vector representation of words. To find a word that is similar to small in the same sense asbiggest is similar to big, we can simply compute vectorX = vector(biggest)vector(big)+vector(small). Then, we search in the vector space for the word closest toX measured by cosinedistance, and use it as the answer to the question (we discard the input question words during thissearch). When the word vectors are well trained, it is possible to find the correct answer (wordsmallest) using this method.

    Finally, we found that when we train high dimensional word vectors on a large amount of data, theresulting vectors can be used to answer very subtle semantic relationships between words, such asa city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectorswith such semantic relationships could be used to improve many existing NLP applications, suchas machine translation, information retrieval and question answering systems, and may enable otherfuture applications yet to be invented.

    5

  • Zhou, et al. 2013 l

    l l

    Table 1: Results on Chinese Semantic Similarity

    Method Sp. Corr. K. Tau

    (100) (100)Prior work (Jin and Wu, 2012) 5.0

    Tf-idf

    Naive tf-idf 41.5 28.7Pruned tf-idf 46.7 32.3

    Word Embeddings

    Align-Init 52.9 37.6Mono-trained 59.3 42.1Biling-trained 60.8 43.3

    ganizers of SemEval-2012 Task 4. This test-set con-

    tains 297 Chinese word pairs with similarity scores

    estimated by humans.

    The results for semantic similarity are shown in

    Table 1. We show two evaluation metrics: Spear-

    man Correlation and Kendalls Tau. For both, bilin-

    gual embeddings trained with the combined objec-

    tive defined by Equation 5 perform best. For pruned

    tf-idf, we follow Reisinger et al. (2010; Huang et

    al. (2012) and count word co-occurrences in a 10-

    word window. We use the best results from a

    range of pruning and feature thresholds to compare

    against our method. The bilingual and monolingual

    trained embeddings4 out-perform pruned tf-idf by

    14.1 and 12.6 Spearman Correlation (100), respec-tively. Further, they out-perform embeddings initial-

    ized from alignment by 7.9 and 6.4. Both our tf-idf

    implementation and the word embeddings have sig-

    nificantly higher Kendalls Tau value compared to

    Prior work (Jin and Wu, 2012). We verified Tau cal-

    culations with original submissions provided by the

    authors.

    4.2 Named Entity Recognition

    We perform NER experiments on OntoNotes (v4.0)

    (Hovy et al., 2006) to validate the quality of the

    Chinese word embeddings. Our experimental set-

    up is the same as Wang et al. (2013). With em-

    beddings, we build a naive feed-forward neural net-

    work (Collobert et al., 2008) with 2000 hidden neu-

    rons and a sliding window of five words. This naive

    setting, without sequence modeling or sophisticated

    4Due to variations caused by online minibatch L-BFGS, we

    take embeddings from five random points out of last 105 mini-

    batch iterations, and average their semantic similarity results.

    Table 2: Results on Named Entity Recognition

    Embeddings Prec. Rec. F1 Improve

    Align-Init 0.34 0.52 0.41Mono-trained 0.54 0.62 0.58 0.17Biling-trained 0.48 0.55 0.52 0.11

    Table 3: Vector Matching Alignment AER (lower is bet-ter)

    Embeddings Prec. Rec. AER

    Mono-trained 0.27 0.32 0.71Biling-trained 0.37 0.45 0.59

    join optimization, is not competitive with state-of-

    the-art (Wang et al., 2013). Table 2 shows that the

    bilingual embeddings obtains 0.11 F1 improvement,

    lagging monolingual, but significantly better than

    Align-Init (as in Section3.2.1) on the NER task.

    4.3 Vector matching alignment

    Translation equivalence of the bilingual embeddings

    is evaluated by naive word alignment to match word

    embeddings by cosine distance.5 The Alignment Er-

    ror Rates (AER) reported in Table 3 suggest that

    bilingual training using Equation 5 produces embed-

    dings with better translation equivalence compared

    to those produced by monolingual training.

    4.4 Phrase-based machine translation

    Our experiments are performed using the Stan-

    ford Phrasal phrase-based machine translation sys-

    tem (Cer et al., 2010). In addition to NIST08 train-

    ing data, we perform phrase extraction, filtering

    and phrase table learning with additional data from

    GALE MT evaluations in the past 5 years. In turn,

    our baseline is established at 30.01 BLEU and rea-

    sonably competitive relative to NIST08 results. We

    use NIST06 as the tuning set 6, and apply Minimum

    Error Rate Training (MERT) (Och, 2003) to tune

    the decoder.

    In the phrase-based MT system, we add one fea-

    ture to bilingual phrase-pairs. For each phrase, the

    word embeddings are averaged to obtain a feature

    vector. If a word is not found in the vocabulary, we

    disregard and assume it is not in the phrase; if no

    5This is evaluated on 10,000 randomly selected sentence

    pairs from the MT training set.6Updated to clarify the decoder tuning procedure.

  • Mapping

    Single-prototype English embeddings by Huang

    et al. (2012) are used to initialize Chinese em-

    beddings. The initialization readily provides a set

    (Align-Init) of benchmark embeddings in experi-

    ments (Section 4), and ensures translation equiva-

    lence in the embeddings at start of training.

    3.2.2 Bilingual training

    Using the alignment counts, we form alignment

    matrices Aenzh and Azhen. For Aenzh, each

    row corresponds to a Chinese word, and each col-

    umn an English word. An element aij is first as-

    signed the counts of when the ith Chinese word is

    aligned with the jth English word in parallel text.

    After assignments, each row is normalized such that

    it sums to one. The matrix Azhen is defined sim-

    ilarly. Denote the set of Chinese word embeddings

    as Vzh, with each row a word embedding, and the

    set of English word embeddings as Ven. With the

    two alignment matrices, we define the Translation

    Equivalence Objective:

    JTEO-enzh = Vzh AenzhVen2 (3)

    JTEO-zhen = Ven AzhenVzh2 (4)

    We optimize for a combined objective during train-

    ing. For the Chinese embeddings we optimize for:

    JCO-zh + JTEO-enzh (5)

    For the English embeddings we optimize for:

    JCO-en + JTEO-zhen (6)

    During bilingual training, we chose the value of

    such that convergence is achieved for both JCO and

    JTEO. A small validation set of word similarities

    from (Jin and Wu, 2012) is used to ensure the em-

    beddings have reasonable semantics. 2

    In the next sections, bilingual trained embed-

    dings refer to those initialized with MT alignments

    and trained with the objective defined by Equa-

    tion 5. Monolingual trained embeddings refer to

    those intialized by alignment but trained without

    JTEO-enzh.

    2In our experiments, = 50.

    3.3 Curriculum training

    We train 100k-vocabulary word embeddings using

    curriculum training (Turian et al., 2010) with Equa-

    tion 5. For each curriculum, we sort the vocabu-

    lary by frequency and segment the vocabulary by a

    band-size taken from {5k, 10k, 25k, 50k}. Separatebands of the vocabulary are trained in parallel using

    minibatch L-BFGS on the Chinese Gigaword cor-

    pus 3. We train 100,000 iterations for each curricu-

    lum, and the entire 100k vocabulary is trained for

    500,000 iterations. The process takes approximately

    19 days on a eight-core machine. We show visual-

    ization of learned embeddings overlaid with English

    in Figure 1. The two-dimensional vectors for this vi-

    sualization is obtained with t-SNE (van der Maaten

    and Hinton, 2008). To make the figure comprehen-

    sible, subsets of Chinese words are provided with

    reference translations in boxes with green borders.

    Words across the two languages are positioned by

    the semantic relationships implied by their embed-

    dings.

    Figure 1: Overlaid bilingual embeddings: English wordsare plotted in yellow boxes, and Chinese words in green;reference translations to English are provided in boxeswith green borders directly below the original word.

    4 Experiments

    4.1 Semantic Similarity

    We evaluate the Mandarin Chinese embeddings with

    the semantic similarity test-set provided by the or-

    3Fifth Edition. LDC catelog number LDC2011T13. We only

    exclude cna cmn, the Traditional Chinese segment of the cor-

    pus.

  • Recursive Neural Network Socher, et al. 2013

    N-gramPos/

    Neg

  • Recursive Neural Network Socher, et al. 2013

    Recursive Deep Models for Semantic CompositionalityOver a Sentiment Treebank

    Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,Christopher D. Manning, Andrew Y. Ng and Christopher Potts

    Stanford University, Stanford, CA 94305, [email protected],{aperelyg,jcchuang,ang}@cs.stanford.edu

    {jeaneis,manning,cgpotts}@stanford.edu

    Abstract

    Semantic word spaces have been very use-ful but cannot express the meaning of longerphrases in a principled way. Further progresstowards understanding compositionality intasks such as sentiment detection requiresricher supervised training and evaluation re-sources and more powerful models of com-position. To remedy this, we introduce aSentiment Treebank. It includes fine grainedsentiment labels for 215,154 phrases in theparse trees of 11,855 sentences and presentsnew challenges for sentiment composition-ality. To address them, we introduce theRecursive Neural Tensor Network. Whentrained on the new treebank, this model out-performs all previous methods on several met-rics. It pushes the state of the art in singlesentence positive/negative classification from80% up to 85.4%. The accuracy of predictingfine-grained sentiment labels for all phrasesreaches 80.7%, an improvement of 9.7% overbag of features baselines. Lastly, it is the onlymodel that can accurately capture the effectsof negation and its scope at various tree levelsfor both positive and negative phrases.

    1 Introduction

    Semantic vector spaces for single words have beenwidely used as features (Turney and Pantel, 2010).Because they cannot capture the meaning of longerphrases properly, compositionality in semantic vec-tor spaces has recently received a lot of attention(Mitchell and Lapata, 2010; Socher et al., 2010;Zanzotto et al., 2010; Yessenalina and Cardie, 2011;Socher et al., 2012; Grefenstette et al., 2013). How-ever, progress is held back by the current lack oflarge and labeled compositionality resources and

    0

    0

    This0

    film

    0

    does0

    nt

    0

    +

    care+

    0

    about+

    +

    +

    +

    +

    cleverness0

    ,

    0

    wit

    0

    or

    +

    0

    0

    any0

    0

    other+

    kind

    +

    0

    of+

    +

    intelligent+ +

    humor

    0

    .

    Figure 1: Example of the Recursive Neural Tensor Net-work accurately predicting 5 sentiment classes, very neg-ative to very positive ( , , 0, +, + +), at every node of aparse tree and capturing the negation and its scope in thissentence.

    models to accurately capture the underlying phe-nomena presented in such data. To address this need,we introduce the Stanford Sentiment Treebank anda powerful Recursive Neural Tensor Network thatcan accurately predict the compositional semanticeffects present in this new corpus.

    The Stanford Sentiment Treebank is the first cor-pus with fully labeled parse trees that allows for acomplete analysis of the compositional effects ofsentiment in language. The corpus is based onthe dataset introduced by Pang and Lee (2005) andconsists of 11,855 single sentences extracted frommovie reviews. It was parsed with the Stanfordparser (Klein and Manning, 2003) and includes atotal of 215,154 unique phrases from those parsetrees, each annotated by 3 human judges. This newdataset allows us to analyze the intricacies of senti-ment and to capture complex linguistic phenomena.Fig. 1 shows one of the many examples with clearcompositional structure. The granularity and size of

    Pos/Neg

  • Recursive Neural Network Socher, et al. 2013

    children of p2:

    p2,down =

    W T p2,com + S

    f 0

    ap1

    ,

    where we define

    S =dX

    k=1

    p2,comk

    V [k] +

    V [k]

    T ap1

    The children of p2, will then each take half of thisvector and add their own softmax error message forthe complete . In particular, we have

    p1,com = p1,s + p2,down[d+ 1 : 2d],

    where p2,down[d + 1 : 2d] indicates that p1 is theright child of p2 and hence takes the 2nd half of theerror, for the final word vector derivative for a, itwill be p2,down[1 : d].The full derivative for slice V [k] for this trigram

    tree then is the sum at each node:

    @E

    @V [k]=

    Ep2

    @V [k]+ p1,comk

    bc

    bc

    T,

    and similarly for W . For this nonconvex optimiza-tion we use AdaGrad (Duchi et al., 2011) which con-verges in less than 3 hours to a local optimum.

    5 Experiments

    We include two types of analyses. The first type in-cludes several large quantitative evaluations on thetest set. The second type focuses on two linguisticphenomena that are important in sentiment.For all models, we use the dev set and cross-

    validate over regularization of the weights, wordvector size as well as learning rate and minibatchsize for AdaGrad. Optimal performance for all mod-els was achieved at word vector sizes between 25and 35 dimensions and batch sizes between 20 and30. Performance decreased at larger or smaller vec-tor and batch sizes. This indicates that the RNTNdoes not outperform the standard RNN due to sim-ply having more parameters. The MV-RNN has or-ders of magnitudes more parameters than any othermodel due to the word matrices. The RNTN wouldusually achieve its best performance on the dev setafter training for 3 - 5 hours. Initial experiments

    Model Fine-grained Positive/Negative

    All Root All Root

    NB 67.2 41.0 82.6 81.8SVM 64.3 40.7 84.6 79.4BiNB 71.0 41.9 82.7 83.1

    VecAvg 73.3 32.7 85.1 80.1RNN 79.0 43.2 86.1 82.4

    MV-RNN 78.7 44.4 86.8 82.9RNTN 80.7 45.7 87.6 85.4

    Table 1: Accuracy for fine grained (5-class) and binarypredictions at the sentence level (root) and for all nodes.

    showed that the recursive models worked signifi-cantly worse (over 5% drop in accuracy) when nononlinearity was used. We use f = tanh in all ex-periments.We compare to commonly used methods that use

    bag of words features with Naive Bayes and SVMs,as well as Naive Bayes with bag of bigram features.We abbreviate these with NB, SVM and biNB. Wealso compare to a model that averages neural wordvectors and ignores word order (VecAvg).The sentences in the treebank were split into a

    train (8544), dev (1101) and test splits (2210) andthese splits are made available with the data release.We also analyze performance on only positive andnegative sentences, ignoring the neutral class. Thisfilters about 20% of the data with the three sets hav-ing 6920/872/1821 sentences.

    5.1 Fine-grained Sentiment For All PhrasesThe main novel experiment and evaluation metricanalyze the accuracy of fine-grained sentiment clas-sification for all phrases. Fig. 2 showed that a finegrained classification into 5 classes is a reasonableapproximation to capture most of the data variation.Fig. 6 shows the result on this new corpus. The

    RNTN gets the highest performance, followed bythe MV-RNN and RNN. The recursive models workvery well on shorter phrases, where negation andcomposition are important, while bag of featuresbaselines perform well only with longer sentences.The RNTN accuracy upper bounds other models atmost n-gram lengths.Table 1 (left) shows the overall accuracy numbers

    for fine grained prediction at all phrase lengths andfull sentences.

  • Deep Learningl Deep Learning

    l l Deep Learning

    l Deep Learning l Deep Learning

    l NLP l Word embedding

    l l Deep?

  • CRFDeep LearningNER Wang, Manning 2013 l

    l l

    l (Collobert 2011)

    l l CRF

    l Sentence-Level Likelihood Neural Network(Collobert 2011)

  • CRFDeep LearningNER Wang, Manning 2013 l CRF +

    CRF SLNNP R F1 P R F1

    CoNLLd 90.9 90.4 90.7 89.3 89.7 89.5CoNLLt 85.4 84.7 85.0 83.3 83.9 83.6ACE 81.0 74.2 77.4 80.9 74.0 77.3MUC 72.5 74.5 73.5 71.1 74.1 72.6Chunk 93.7 93.5 93.6 93.3 93.3 93.3

    Table 1: Results of CRF versus SLNN, overdiscrete feature space. CoNLLd stands for theCoNLL development set, and CoNLLt is the testset. Best F1 score on each dataset is highlighted inbold.

    5.1 Results of Discrete Representation

    The first question we address is the following:in the high-dimensional discrete feature space,would the non-linear architecture in SLNN modelhelp it to outperform CRF?Results from Table 1 suggest that SLNN does

    not seem to benefit from the non-linear architec-ture on either the NER or Syntactic Chunkingtasks. In particular, on the CoNLL and MUCdataset, SLNN resulted in a 1% performance drop,which is significant for NER. The specific statisti-cal properties of this dataset that lead to the per-formance drop are hard to determine, but we be-lieve it is partially because the SLNN has a muchharder non-convex optimization problem to solve on this small dataset, the SLNN with 300 hiddenunits generates a shocking number of 100 millionparameters (437905 features times 300 hidden di-mensions), due to the high dimensionality of theinput feature space.To further illustrate this point, we also com-

    pared the CRF model with its Linear Neural Net-work (LNN) extension, which has exactly thesame number of parameters as the SLNN but doesnot include the non-linear activation layer. Al-though this model is identical in representationalpower to the CRF as we discussed in Section 2,the optimization problem here is no longer convex(Ando and Zhang, 2005). To see why, consider ap-plying a linear scaling transformation to the inputlayer parameter matrix , and apply the inversescaling to output layer matrix. The resultingmodel has exactly the same function values. Wecan see from Table 2 that there is indeed a perfor-mance drop with the LNN model as well, likelydue to difficulty with optimization. By compar-ing the results of LNN and SLNN, we see that theaddition of a non-linear activation layer in SLNNdoes not seem to help, but in fact further decreases

    CRF LLNP R F1 P R F1

    CoNLLd 90.9 90.4 90.7 89.5 90.6 90.0CoNLLt 85.4 84.7 85.0 83.1 84.7 83.9ACE 81.0 74.2 77.4 80.7 74.3 77.3MUC 72.5 74.5 73.5 72.3 75.2 73.7Chunk 93.7 93.5 93.6 93.1 93.2 93.2

    Table 2: Results of CRF versus LNN, over discretefeature space.

    performance in all cases except Syntactic Chunk-ing.A distinct characteristic of NLP data is its high

    dimensionality. The vocabulary size of a decentsized text corpus is already in the tens of thou-sands, and bigram statistics are usually an or-der of magnitude larger. These basic informationunits are typically very informative, and there isnot much structure in them to be explored. Al-though some studies argue that non-linear neu-ral nets suffer less from the curse of dimension-ality (Attali and Pages, 1997; Bengio and Bengio,2000; Pitkow, 2012), counter arguments have beenoffered (Camastra, 2003; Verleysen et al., 2003).The empirical results from our experiment seemsto support the latter. Similar results have alsobeen found in other NLP applications such as TextClassification. Joachims concluded in his seminalwork: non-linear SVMs do not provide any ad-vantage for text classification using the standardkernels (Joachims, 2004, p. 115). If we comparethe learning curve of CRF and SLNN (Figure 2),where we vary the amount of binary features avail-able in the model by random sub-sampling, wecan further observe that SLNNs enjoy a small per-formance advantage in lower dimensional space(when less than 30% of features are used), but arequickly outpaced by CRFs in higher dimensionalspace as more features become available.Another point of consideration is whether there

    is actually much non-linearity to be captured insequence labeling. While in some NLP applica-tions like grammar induction and semantic pars-ing, the data is complex and rich in statisticalstructures, the structure of data in sequence label-ing is considerably simpler. This contrast is moresalient if we compare with data in Computer Vi-sion tasks such as object recognition and imagesegmentation. The interactions among local vari-ables there are much stronger and more likely tobe non-linear. Lastly, models like CRF actuallyalready capture some of the non-linearity in the

    0.2 0.4 0.6 0.8 1

    70

    80

    90

    SLNNCRF

    Figure 2: The learning curve of SLNN vs. CRFon CoNLL-03 dev set, with respect to the percent-age of discrete features used (i.e., size of input di-mension). Y-axis is the F1 score (out of 100), andX-axis is the percentage of features used.

    CRF SLNNP R F1 P R F1

    CoNLLd 80.7 78.7 79.7 86.1 87.1 86.6CoNLLt 76.4 75.5 76.0 79.8 81.7 80.7ACE 71.5 71.1 71.3 75.8 74.1 75.0MUC 65.3 74.0 69.4 65.7 76.8 70.8

    Table 3: Results of CRF versus SLNN, over con-tinuous space feature representations.

    input space through the interactions of latent vari-ables (Liang et al., 2008), and it is unclear howmuch additional gain we would get by explicitlymodeling the non-linearity in local inputs.

    5.2 Results of Distributional RepresentationFor the next experiment, we replace the discreteinput features with a continuous space representa-tion by looking up the embedding of each word,and concatenate the embeddings of a five wordwindow centered around the current position. Fourbinary features are also appended to each wordembedding to capture capitalization patterns, asdescribed in Collobert et al. (2011). Results ofthe CRF and SLNN under this setting for the NERtask is show in Table 3.With a continuous space representation, the

    SLNN model works significantly better than aCRF, by as much as 7% on the CoNLL develop-ment set, and 3.7% on ACE dataset. This suggeststhat there exist statistical dependencies within thislow-dimensional (300) data that cannot be effec-tively captured by linear transformations, but canbe modeled in the non-linear neural nets. Thisperhaps coincides with the large performance im-

    CoNLLd CoNLLt ACE MUCCRFdiscrete 90.7 85.0 77.4 73.5CRFjoin 92.4 87.7 82.2 81.1SLNNcontinuous 86.6 80.7 75.0 70.8SLNNjoin 91.9 87.1 81.2 79.7

    Table 4: Results of CRF and SLNN when wordembeddings are appended to the discrete features.Numbers shown are F1 scores.

    provements observed from neural nets in hand-written digit recognition datasets as well (Peng etal., 2009; Do and Artieres, 2010), where dimen-sionality is also relatively low.

    5.3 Combine Discrete and DistributionalFeatures

    When we join word embeddings with discrete fea-tures, we see further performance improvements,especially in the out-of-domain datasets. The re-sults are shown in Table 4.A similar effect was also observed in Turian et

    al. (2010). The performance of both the CRF andSLNN increases by similar relative amounts, butthe CRFmodel maintains a lead in overall absoluteperformance.

    6 Conclusion

    We carefully compared and analyzed the non-linear neural networks used in Collobert et al.(2011) and the widely adopted CRF, and revealedtheir close relationship. Through extensive exper-iments on NER and Syntactic Chunking, we haveshown that non-linear architectures are effective inlow dimensional continuous input spaces, but thatthey are not better suited for conventional high-dimensional discrete input spaces. Furthermore,both linear and non-linear models benefit greatlyfrom the combination of continuous and discretefeatures, especially for out-of-domain datasets.This finding confirms earlier results that distribu-tional representations can be used to achieve bettergeneralization.

    Acknowledgments

    The authors would like to thank the three anony-mous reviewers and acknowledge the support ofthe DARPA Broad Operational Language Transla-tion (BOLT) program through IBM. Any opinions,findings, and conclusion or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the view of DARPAor the US government.

  • Paragraph Vector (Le, et al. 2014) l Word VectorParagraph

    l 1

    Distributed Representations of Sentences and Documents

    example, powerful and strong are close to each other,whereas powerful and Paris are more distant. The dif-ference between word vectors also carry meaning. For ex-ample, the word vectors can be used to answer analogyquestions using simple vector algebra: King - man +woman = Queen (Mikolov et al., 2013d). It is also pos-sible to learn a linear matrix to translate words and phrasesbetween languages (Mikolov et al., 2013b).

    These properties make word vectors attractive for manynatural language processing tasks such as language mod-eling (Bengio et al., 2006; Mikolov, 2012), natural lan-guage understanding (Collobert & Weston, 2008; Zhilaet al., 2013), statistical machine translation (Mikolov et al.,2013b; Zou et al., 2013), image understanding (Fromeet al., 2013) and relational extraction (Socher et al., 2013a).

    2.2. Paragraph Vector: A distributed memory model

    Our approach for learning paragraph vectors is inspired bythe methods for learning the word vectors. The inspirationis that the word vectors are asked to contribute to a predic-tion task about the next word in the sentence. So despitethe fact that the word vectors are initialized randomly, theycan eventually capture semantics as an indirect result of theprediction task. We will use this idea in our paragraph vec-tors in a similar manner. The paragraph vectors are alsoasked to contribute to the prediction task of the next wordgiven many contexts sampled from the paragraph.

    In our Paragraph Vector framework (see Figure 2), everyparagraph is mapped to a unique vector, represented by acolumn in matrix D and every word is also mapped to aunique vector, represented by a column in matrix W . Theparagraph vector and word vectors are averaged or concate-nated to predict the next word in a context. In the experi-ments, we use concatenation as the method to combine thevectors.

    More formally, the only change in this model comparedto the word vector framework is in equation 1, where h isconstructed fromW and D.

    The paragraph token can be thought of as another word. Itacts as a memory that remembers what is missing from thecurrent context or the topic of the paragraph. For thisreason, we often call this model the Distributed MemoryModel of Paragraph Vectors (PV-DM).

    The contexts are fixed-length and sampled from a slidingwindow over the paragraph. The paragraph vector is sharedacross all contexts generated from the same paragraph butnot across paragraphs. The word vector matrix W , how-ever, is shared across paragraphs. I.e., the vector for pow-erful is the same for all paragraphs.

    The paragraph vectors and word vectors are trained using

    stochastic gradient descent and the gradient is obtained viabackpropagation. At every step of stochastic gradient de-scent, one can sample a fixed-length context from a randomparagraph, compute the error gradient from the network inFigure 2 and use the gradient to update the parameters inour model.

    At prediction time, one needs to perform an inference stepto compute the paragraph vector for a new paragraph. Thisis also obtained by gradient descent. In this step, the pa-rameters for the rest of the model, the word vectorsW andthe softmax weights, are fixed.

    Suppose that there are N paragraphs in the corpus, Mwords in the vocabulary, and we want to learn paragraphvectors such that each paragraph is mapped to p dimen-sions and each word is mapped to q dimensions, then themodel has the total of N p + M q parameters (ex-cluding the softmax parameters). Even though the numberof parameters can be large when N is large, the updatesduring training are typically sparse and thus efficient.

    Figure 2. A framework for learning paragraph vector. This frame-work is similar to the framework presented in Figure 1; the onlychange is the additional paragraph token that is mapped to a vec-tor via matrix D. In this model, the concatenation or average ofthis vector with a context of three words is used to predict thefourth word. The paragraph vector represents the missing infor-mation from the current context and can act as a memory of thetopic of the paragraph.

    After being trained, the paragraph vectors can be used asfeatures for the paragraph (e.g., in lieu of or in additionto bag-of-words). We can feed these features directly toconventional machine learning techniques such as logisticregression, support vector machines or K-means.

    In summary, the algorithm itself has two key stages: 1)training to get word vectors W , softmax weights U, b andparagraph vectors D on already seen paragraphs; and 2)the inference stage to get paragraph vectors D for newparagraphs (never seen before) by adding more columnsin D and gradient descending on D while holding W,U, bfixed. We useD to make a prediction about some particularlabels using a standard classifier, e.g., logistic regression.

  • Paragraph Vector (Le, et al. 2014) l Paragraph MatrixWord vector

    l Paragraph Matrix l l Support Vector Machine

    l K-means

  • Stanford sentiment Treebank Dataset

    Distributed Representations of Sentences and Documents

    Tasks and Baselines: In (Socher et al., 2013b), the au-thors propose two ways of benchmarking. First, one couldconsider a 5-way fine-grained classification task wherethe labels are {Very Negative, Negative, Neutral, Posi-tive, Very Positive} or a 2-way coarse-grained classifica-tion task where the labels are {Negative, Positive}. Theother axis of variation is in terms of whether we should la-bel the entire sentence or all phrases in the sentence. In thiswork we only consider labeling the full sentences.

    Socher et al. (Socher et al., 2013b) apply several methodsto this dataset and find that their Recursive Neural TensorNetwork works much better than bag-of-words model. Itcan be argued that this is because movie reviews are oftenshort and compositionality plays an important role in de-ciding whether the review is positive or negative, as well assimilarity between words does given the rather tiny size ofthe training set.

    Experimental protocols: We follow the experimentalprotocols as described in (Socher et al., 2013b). To makeuse of the available labeled data, in our model, each sub-phrase is treated as an independent sentence and we learnthe representations for all the subphrases in the training set.

    After learning the vector representations for training sen-tences and their subphrases, we feed them to a logistic re-gression to learn a predictor of the movie rating.

    At test time, we freeze the vector representation for eachword, and learn the representations for the sentences usinggradient descent. Once the vector representations for thetest sentences are learned, we feed them through the logis-tic regression to predict the movie rating.

    In our experiments, we cross validate the window size us-ing the validation set, and the optimal window size is 8.The vector presented to the classifier is a concatenation oftwo vectors, one from PV-DBOW and one from PV-DM.In PV-DBOW, the learned vector representations have 400dimensions. In PV-DM, the learned vector representationshave 400 dimensions for both words and paragraphs. Topredict the 8-th word, we concatenate the paragraph vec-tors and 7 word vectors. Special characters such as ,.!? aretreated as a normal word. If the paragraph has less than 9words, we pre-pad with a special NULL word symbol.

    Results: We report the error rates of different methods inTable 1. The first highlight for this Table is that bag-of-words or bag-of-n-grams models (NB, SVM, BiNB) per-form poorly. Simply averaging the word vectors (in a bag-of-words fashion) does not improve the results. This isbecause bag-of-words models do not consider how eachsentence is composed (e.g., word ordering) and thereforefail to recognize many sophisticated linguistic phenom-ena, for instance sarcasm. The results also show that

    Table 1. The performance of our method compared to other ap-proaches on the Stanford Sentiment Treebank dataset. The errorrates of other methods are reported in (Socher et al., 2013b).

    Model Error rate Error rate(Positive/ (Fine-Negative) grained)

    Nave Bayes 18.2 % 59.0%(Socher et al., 2013b)SVMs (Socher et al., 2013b) 20.6% 59.3%Bigram Nave Bayes 16.9% 58.1%(Socher et al., 2013b)Word Vector Averaging 19.9% 67.3%(Socher et al., 2013b)Recursive Neural Network 17.6% 56.8%(Socher et al., 2013b)Matrix Vector-RNN 17.1% 55.6%(Socher et al., 2013b)Recursive Neural Tensor Network 14.6% 54.3%(Socher et al., 2013b)Paragraph Vector 12.2% 51.3%

    more advanced methods (such as Recursive Neural Net-work (Socher et al., 2013b)), which require parsing andtake into account the compositionality, perform much bet-ter.

    Our method performs better than all these baselines, e.g.,recursive networks, despite the fact that it does not re-quire parsing. On the coarse-grained classification task, ourmethod has an absolute improvement of 2.4% in terms oferror rates. This translates to 16% relative improvement.

    3.2. Beyond One Sentence: Sentiment Analysis withIMDB dataset

    Some of the previous techniques only work on sentences,but not paragraphs/documents with several sentences. Forinstance, Recursive Neural Tensor Network (Socher et al.,2013b) is based on the parsing over each sentence and itis unclear how to combine the representations over manysentences. Such techniques therefore are restricted to workon sentences but not paragraphs or documents.

    Our method does not require parsing, thus it can producea representation for a long document consisting of manysentences. This advantage makes our method more generalthan some of the other approaches. The following experi-ment on IMDB dataset demonstrates this advantage.

    Dataset: The IMDB dataset was first proposed by Maaset al. (Maas et al., 2011) as a benchmark for sentiment anal-ysis. The dataset consists of 100,000 movie reviews takenfrom IMDB. One key aspect of this dataset is that eachmovie review has several sentences.

    The 100,000 movie reviews are divided into three datasets:

    Recursive Neural Network

    Positive/Negative: Fine-graind: Very Negative, Negative, Neutral, Positive, Very Positive5

  • IMDB dataset

    Distributed Representations of Sentences and Documents

    25,000 labeled training instances, 25,000 labeled test in-stances and 50,000 unlabeled training instances. There aretwo types of labels: Positive and Negative. These labels arebalanced in both the training and the test set. The datasetcan be downloaded at http://ai.Stanford.edu/amaas/data/sentiment/index.html

    Experimental protocols: We learn the word vectors andparagraph vectors using 75,000 training documents (25,000labeled and 50,000 unlabeled instances). The paragraphvectors for the 25,000 labeled instances are then fedthrough a neural network with one hidden layer with 50units and a logistic classifier to learn to predict the senti-ment.1

    At test time, given a test sentence, we again freeze the restof the network and learn the paragraph vectors for the testreviews by gradient descent. Once the vectors are learned,we feed them through the neural network to predict the sen-timent of the reviews.

    The hyperparameters of our paragraph vector model are se-lected in the same manner as in the previous task. In par-ticular, we cross validate the window size, and the opti-mal window size is 10 words. The vector presented to theclassifier is a concatenation of two vectors, one from PV-DBOW and one from PV-DM. In PV-DBOW, the learnedvector representations have 400 dimensions. In PV-DM,the learned vector representations have 400 dimensions forboth words and documents. To predict the 10-th word, weconcatenate the paragraph vectors and word vectors. Spe-cial characters such as ,.!? are treated as a normal word.If the document has less than 9 words, we pre-pad with aspecial NULL word symbol.

    Results: The results of Paragraph Vector and other base-lines are reported in Table 2. As can be seen from theTable, for long documents, bag-of-words models performquite well and it is difficult to improve upon them usingword vectors. The most significant improvement happenedin 2012 in the work of (Dahl et al., 2012) where they com-bine a Restricted Boltzmann Machines model with bag-of-words. The combination of two models yields an improve-ment approximately 1.5% in terms of error rates.

    Another significant improvement comes from the workof (Wang & Manning, 2012). Among many variations theytried, NBSVM on bigram features works the best and yieldsa considerable improvement of 2% in terms of the errorrate.

    The method described in this paper is the only approachthat goes significantly beyond the barrier of 10% error rate.

    1In our experiments, the neural network did perform betterthan a linear logistic classifier in this task.

    It achieves 7.42% which is another 1.3% absolute improve-ment (or 15% relative improvement) over the best previousresult of (Wang & Manning, 2012).

    Table 2. The performance of Paragraph Vector compared to otherapproaches on the IMDB dataset. The error rates of other methodsare reported in (Wang & Manning, 2012).

    Model Error rateBoW (bnc) (Maas et al., 2011) 12.20 %BoW (btc) (Maas et al., 2011) 11.77%LDA (Maas et al., 2011) 32.58%Full+BoW (Maas et al., 2011) 11.67%Full+Unlabeled+BoW (Maas et al., 2011) 11.11%WRRBM (Dahl et al., 2012) 12.58%WRRBM + BoW (bnc) (Dahl et al., 2012) 10.77%MNB-uni (Wang & Manning, 2012) 16.45%MNB-bi (Wang & Manning, 2012) 13.41%SVM-uni (Wang & Manning, 2012) 13.05%SVM-bi (Wang & Manning, 2012) 10.84%NBSVM-uni (Wang & Manning, 2012) 11.71%NBSVM-bi (Wang & Manning, 2012) 8.78%Paragraph Vector 7.42%

    3.3. Information Retrieval with Paragraph Vectors

    We turn our attention to an information retrieval task whichrequires fixed-length representations of paragraphs.

    Here, we have a dataset of paragraphs in the first 10 resultsreturned by a search engine given each of 1,000,000 mostpopular queries. Each of these paragraphs is also known asa snippet which summarizes the content of a web pageand how a web page matches the query.

    From such collection, we derive a new dataset to test vectorrepresentations of paragraphs. For each query, we createa triplet of paragraphs: the two paragraphs are results ofthe same query, whereas the third paragraph is a randomlysampled paragraph from the rest of the collection (returnedas the result of a different query). Our goal is to identifywhich of the three paragraphs are results of the same query.To achieve this, we will use paragraph vectors and computethe distances the paragraphs. A better representation is onethat achieves a small distance for pairs of paragraphs of thesame query and a larg distance for pairs of paragraphs ofdifferent queries.

    Here is a sample of three paragraphs, where the first para-graph should be closer to the second paragraph than thethird paragraph:

    Paragraph 1: calls from ( 000 ) 000 - 0000 . 3913calls reported from this number . according to 4 re-ports the identity of this caller is american airlines .

    Restricted Boltzmann Machine

  • l Deep Learning l POS, NER, Chunking, SRL, WS l Word Embedding l

    l Word embedding + () l Deep LearningDeep