neural networks for natural language processing tomas mikolov.pdf · 2017-10-17 · neural networks...

Neural Networks for NaturalLanguage Processing

Tomas Mikolov, FacebookTalk at AI Summit, Vienna, 2017

Introduction

• Neural networks have been recently applied to many important tasks that involve natural language:• Web Search

• Advertisement

• Spam filters

• Machine translation

• Among the most frequently used concepts are:• Distributed representations: word2vec, unsupervised pre-training

• Weight sharing: recurrent and convolutional networks

Neural Networks for NLP, Tomas Mikolov

Distributed representations of words

• Vector representation of words computed using neural networks

• Linguistic regularities in the word vector space

• Word2vec, fastText


Word vectors

• Each word is associated with a real valued vector in N-dimensional space (usually N = 50 – 1000)

• The word vectors have similar properties to word classes (similar words have similar vector representations)

• Computed often using various types of neural networks


Word vectors

• These word vectors can be subsequently used as features in many NLP tasks (Collobert et al, 2011)

• As word vectors can be trained on huge text datasets, they provide generalization for systems trained with limited amount of supervised data


Word vectors

• Many neural architectures were proposed for training the word vectors, usually using several hidden layers

• We need some way how to compare word vectors trained using different architectures


Linguistic regularities in vector space

• We can do nearest neighbor search around result of vector operation “king – man + woman” and obtain “queen”

Linguistic regularities in continuous space word representations (Mikolov et al, 2013)


Trained word vectors – visualization using PCA


Word vectors – datasets for evaluation

Word-based dataset, almost 20K questions, focuses on both syntax and semantics:• Athens:Greece Oslo: ___

• Angola:kwanza Iran: ___

• brother:sister grandson: ___

• possibly:impossibly ethical: ___

• walking:walked swimming: ___

Efficient estimation of word representations in vector space (Mikolov et al, 2013)


Word vectors – Bengio architecture

• Neural net based word vectors were traditionally trained as part of neural network language models (Bengio et al, 2003)

• Training on <1M words took days


Word vectors – word2vec architectures

• Continuous Bag-of-Words model (CBOW):adds inputs from words within short windowto predict the current word

• No hidden layer, no matrix multiplications

• The weights for different positions are shared

• Usually thousands times faster to trainthan Bengio’s architecture


Word vectors – word2vec architectures

• Skip-gram model: predict surrounding wordsusing the current word

• Performance similar to the CBOW model


Word vectors - training

• Stochastic gradient descent + backpropagation: after every prediction, the weights are slightly changed to minimize error

• Efficient solutions to very large softmax – size equal to vocabulary size, can easily be in order of millions (too many outputs to evaluate):

• Hierarchical softmax: class-based approach

• Negative sampling: 1-against-all loss instead of softmax


Word vectors – comparison of performance (2013)


• Google 20K questions dataset (word based, both syntax and semantics)

• Almost all models are trained on different datasets

• More data, bigger models => better results

Word vectors – real examples of analogies


Beyond word2vec:

• We can further improve accuracy of word vectors by adding subwordinformation

• Can be achieved by using the character n-grams as additional inputs

• Can form representations for out-of-vocabulary words


Beyond word2vec:

• Subword information helps especially for low resource domains and morphologically rich languages

• Enriching Word Vectors with Subword Information (Bojanowski, Grave, Joulin, Mikolov, 2017)

• https://github.com/facebookresearch/fastText


word2vec fastText [+subwords]

Czech analogies 40% 53% [+13%]

German analogies 56% 59% [+3%]

English analogies 75% 77% [+2%]

https://github.com/facebookresearch/fastText

Beyond word2vec:

• Comparison of word2vec & fastText with GloVe (popular version of word2vec coming from Stanford NLP group):

• Models available at fasttext.cc

• Paper with details will be published later this year


Semantic Syntactic Total

GloVe: Wikipedia + Gigaword 300d 78% 67% 72%

Word2vec: Wikipedia + News 300d 91% 84% 87%

fastText: Wikipedia + News 300d 88% 88% 88%

fasttext.cc

Beyond word2vec:

• fastText also implements supervised mode: like CBOW architecture predicting label instead of the middle word• Comparable accuracy to deep learning models (bidirectional LSTMs, CNNs),

trains in seconds where deep learning models require days to weeks and expensive GPUs

• Bag of tricks for efficient text classification (Joulin, Grave, Bojanowski, Mikolov, 2017)

• Can use pre-trained features to build classifiers when only limited number of training examples is available


Distributed word representations: summary

• Simple, scalable models are the most useful: the best accuracy is achieved by the biggest models trained on the largest datasets

• Unsupervised pre-training can help to train more robust classifiers, or to provide more insight into the data

• fastText library implements both unsupervised learning and supervised text classification, achieves state of the art performance on many important tasks


Recurrent Networks

• Rebirth of recurrent networks and how it happened

• Language modeling

• Speech Recognition

• Machine Translation


Simple RNN Architecture

• Input layer, hidden layer with recurrentconnections, and the output layer

• In theory, the hidden layer can learnto represent unlimited memory

• Also called Elman network(Finding structure in time, Elman 1990)


Brief History of Recurrent Nets – 90’s - 2010

• After the initial excitement, recurrent nets vanished from the mainstream research

• Despite being theoretically powerful models, RNNs were mostly considered as unstable to be trained (Bengio et al, 1994)

• LSTM RNN (more complex architecture) has been developed but not widely used: complex model difficult to understand and implement, not competitive with simpler techniques at the time


Brief History of Recurrent Nets – 2010 - today

• In 2010 - 2012, it was shown that RNNs can significantly improve state-of-art in:• language modeling

• machine translation

• data compression

• speech recognition

• RNNLM toolkit was published (used at Microsoft Research, Google, IBM, Facebook, Yandex, …)

• The key novel trick in RNNLM was trivial: to clip gradients to prevent instability of training


Brief History of RNNLMs – 2010 - today

• 21% - 24% reduction of WER on standard Wall Street Journal ASR setup

• The biggest models are the best

Neural Networks for NLP, Tomas Mikolov 25


• Improvement from RNNLM over n-gram increases with more data!

Neural Networks for NLP, Tomas Mikolov 26


• Breakthrough result in 2011: 11% WER reduction over large system from IBM (NIST RT04)

• Ensemble of big RNNLM models trained on a lot of data



• Up to 3 BLEU points improvement in machine translation by rescoring n-best lists with RNNs was achieved already in 2010

• 6% better compression ratio than the best PAQ algorithm in 2011

• Famous Chomsky’s example of inadequacy of probabilistic view of language broken:



• RNNs became much more accessible through open-source toolkits:• Theano

• Torch

• TensorFlow

• …

• Training on GPUs allowed further scaling up (billions of words, thousands of hidden neurons)


Recurrent Nets Today

• Widely applied:• ASR (both acoustic and language models)

• MT (language & translation & alignment models, joint models, end-to-end)

• Many NLP applications

• Video modeling, handwriting recognition, user intent prediction, …

• There is a trend towards more end-to-end systems (replacing the previous technology) instead of complementing it


Recurrent Nets Today

• Downside: for many problems RNNs are too powerful, models are becoming unnecessarily complex

• Trained models are hard to analyze, spectacular (and incorrect) claims are over-used: “It’s like the brain™!”

• Complex RNN architectures are popular (LSTM, GRU), though there are simpler tricks how to add longer short term memory to RNNs(Learning longer memory in recurrent neural networks, 2014)


Recurrent Nets Summary

• RNNs can cluster similar sequences and thus generalize better than many older techniques

• N-grams were state of the art in NLP for over 30 years and finally were beaten on a number of important tasks by Recurrent Networks

• The reasons of paradigm shift:• Solved instability of the training by clipping gradients• Open-source tools, public datasets, easy reproducibility• State of the art results on benchmarks from academia and industry, big

improvements


neural networks for natural language processing tomas mikolov.pdf · 2017-10-17 · neural networks...

Documents