neural networks for natural language processing tomas mikolov.pdf · 2017-10-17 · neural networks...

32
Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna, 2017

Upload: others

Post on 07-Jan-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Neural Networks for NaturalLanguage Processing

Tomas Mikolov, FacebookTalk at AI Summit, Vienna, 2017

Page 2: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Introduction

• Neural networks have been recently applied to many important tasks that involve natural language:• Web Search

• Advertisement

• Spam filters

• Machine translation

• Among the most frequently used concepts are:• Distributed representations: word2vec, unsupervised pre-training

• Weight sharing: recurrent and convolutional networks

Neural Networks for NLP, Tomas Mikolov

Page 3: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Distributed representations of words

• Vector representation of words computed using neural networks

• Linguistic regularities in the word vector space

• Word2vec, fastText

Neural Networks for NLP, Tomas Mikolov

Page 4: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors

• Each word is associated with a real valued vector in N-dimensional space (usually N = 50 – 1000)

• The word vectors have similar properties to word classes (similar words have similar vector representations)

• Computed often using various types of neural networks

Neural Networks for NLP, Tomas Mikolov

Page 5: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors

• These word vectors can be subsequently used as features in many NLP tasks (Collobert et al, 2011)

• As word vectors can be trained on huge text datasets, they provide generalization for systems trained with limited amount of supervised data

Neural Networks for NLP, Tomas Mikolov

Page 6: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors

• Many neural architectures were proposed for training the word vectors, usually using several hidden layers

• We need some way how to compare word vectors trained using different architectures

Neural Networks for NLP, Tomas Mikolov

Page 7: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Linguistic regularities in vector space

• We can do nearest neighbor search around result of vector operation “king – man + woman” and obtain “queen”

Linguistic regularities in continuous space word representations (Mikolov et al, 2013)

Neural Networks for NLP, Tomas Mikolov

Page 8: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Trained word vectors – visualization using PCA

Neural Networks for NLP, Tomas Mikolov

Page 9: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – datasets for evaluation

Word-based dataset, almost 20K questions, focuses on both syntax and semantics:• Athens:Greece Oslo: ___

• Angola:kwanza Iran: ___

• brother:sister grandson: ___

• possibly:impossibly ethical: ___

• walking:walked swimming: ___

Efficient estimation of word representations in vector space (Mikolov et al, 2013)

Neural Networks for NLP, Tomas Mikolov

Page 10: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – Bengio architecture

• Neural net based word vectors were traditionally trained as part of neural network language models (Bengio et al, 2003)

• Training on <1M words took days

Neural Networks for NLP, Tomas Mikolov

Page 11: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – word2vec architectures

• Continuous Bag-of-Words model (CBOW):adds inputs from words within short windowto predict the current word

• No hidden layer, no matrix multiplications

• The weights for different positions are shared

• Usually thousands times faster to trainthan Bengio’s architecture

Neural Networks for NLP, Tomas Mikolov

Page 12: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – word2vec architectures

• Skip-gram model: predict surrounding wordsusing the current word

• Performance similar to the CBOW model

Neural Networks for NLP, Tomas Mikolov

Page 13: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors - training

• Stochastic gradient descent + backpropagation: after every prediction, the weights are slightly changed to minimize error

• Efficient solutions to very large softmax – size equal to vocabulary size, can easily be in order of millions (too many outputs to evaluate):

• Hierarchical softmax: class-based approach

• Negative sampling: 1-against-all loss instead of softmax

Neural Networks for NLP, Tomas Mikolov

Page 14: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – comparison of performance (2013)

Neural Networks for NLP, Tomas Mikolov

• Google 20K questions dataset (word based, both syntax and semantics)

• Almost all models are trained on different datasets

• More data, bigger models => better results

Page 15: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Word vectors – real examples of analogies

Neural Networks for NLP, Tomas Mikolov

Page 16: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Beyond word2vec:

• We can further improve accuracy of word vectors by adding subwordinformation

• Can be achieved by using the character n-grams as additional inputs

• Can form representations for out-of-vocabulary words

Neural Networks for NLP, Tomas Mikolov

Page 17: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Beyond word2vec:

• Subword information helps especially for low resource domains and morphologically rich languages

• Enriching Word Vectors with Subword Information (Bojanowski, Grave, Joulin, Mikolov, 2017)

• https://github.com/facebookresearch/fastText

Neural Networks for NLP, Tomas Mikolov

word2vec fastText [+subwords]

Czech analogies 40% 53% [+13%]

German analogies 56% 59% [+3%]

English analogies 75% 77% [+2%]

Page 18: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Beyond word2vec:

• Comparison of word2vec & fastText with GloVe (popular version of word2vec coming from Stanford NLP group):

• Models available at fasttext.cc

• Paper with details will be published later this year

Neural Networks for NLP, Tomas Mikolov

Semantic Syntactic Total

GloVe: Wikipedia + Gigaword 300d 78% 67% 72%

Word2vec: Wikipedia + News 300d 91% 84% 87%

fastText: Wikipedia + News 300d 88% 88% 88%

Page 19: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Beyond word2vec:

• fastText also implements supervised mode: like CBOW architecture predicting label instead of the middle word• Comparable accuracy to deep learning models (bidirectional LSTMs, CNNs),

trains in seconds where deep learning models require days to weeks and expensive GPUs

• Bag of tricks for efficient text classification (Joulin, Grave, Bojanowski, Mikolov, 2017)

• Can use pre-trained features to build classifiers when only limited number of training examples is available

Neural Networks for NLP, Tomas Mikolov

Page 20: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Distributed word representations: summary

• Simple, scalable models are the most useful: the best accuracy is achieved by the biggest models trained on the largest datasets

• Unsupervised pre-training can help to train more robust classifiers, or to provide more insight into the data

• fastText library implements both unsupervised learning and supervised text classification, achieves state of the art performance on many important tasks

Neural Networks for NLP, Tomas Mikolov

Page 21: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Recurrent Networks

• Rebirth of recurrent networks and how it happened

• Language modeling

• Speech Recognition

• Machine Translation

Neural Networks for NLP, Tomas Mikolov

Page 22: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Simple RNN Architecture

• Input layer, hidden layer with recurrentconnections, and the output layer

• In theory, the hidden layer can learnto represent unlimited memory

• Also called Elman network(Finding structure in time, Elman 1990)

Neural Networks for NLP, Tomas Mikolov

Page 23: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of Recurrent Nets – 90’s - 2010

• After the initial excitement, recurrent nets vanished from the mainstream research

• Despite being theoretically powerful models, RNNs were mostly considered as unstable to be trained (Bengio et al, 1994)

• LSTM RNN (more complex architecture) has been developed but not widely used: complex model difficult to understand and implement, not competitive with simpler techniques at the time

Neural Networks for NLP, Tomas Mikolov

Page 24: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of Recurrent Nets – 2010 - today

• In 2010 - 2012, it was shown that RNNs can significantly improve state-of-art in:• language modeling

• machine translation

• data compression

• speech recognition

• RNNLM toolkit was published (used at Microsoft Research, Google, IBM, Facebook, Yandex, …)

• The key novel trick in RNNLM was trivial: to clip gradients to prevent instability of training

Neural Networks for NLP, Tomas Mikolov

Page 25: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of RNNLMs – 2010 - today

• 21% - 24% reduction of WER on standard Wall Street Journal ASR setup

• The biggest models are the best

Neural Networks for NLP, Tomas Mikolov 25

Page 26: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of RNNLMs – 2010 - today

• Improvement from RNNLM over n-gram increases with more data!

Neural Networks for NLP, Tomas Mikolov 26

Page 27: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of RNNLMs – 2010 - today

• Breakthrough result in 2011: 11% WER reduction over large system from IBM (NIST RT04)

• Ensemble of big RNNLM models trained on a lot of data

Neural Networks for NLP, Tomas Mikolov

Page 28: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of RNNLMs – 2010 - today

• Up to 3 BLEU points improvement in machine translation by rescoring n-best lists with RNNs was achieved already in 2010

• 6% better compression ratio than the best PAQ algorithm in 2011

• Famous Chomsky’s example of inadequacy of probabilistic view of language broken:

Neural Networks for NLP, Tomas Mikolov

Page 29: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Brief History of RNNLMs – 2010 - today

• RNNs became much more accessible through open-source toolkits:• Theano

• Torch

• TensorFlow

• …

• Training on GPUs allowed further scaling up (billions of words, thousands of hidden neurons)

Neural Networks for NLP, Tomas Mikolov

Page 30: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Recurrent Nets Today

• Widely applied:• ASR (both acoustic and language models)

• MT (language & translation & alignment models, joint models, end-to-end)

• Many NLP applications

• Video modeling, handwriting recognition, user intent prediction, …

• There is a trend towards more end-to-end systems (replacing the previous technology) instead of complementing it

Neural Networks for NLP, Tomas Mikolov

Page 31: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Recurrent Nets Today

• Downside: for many problems RNNs are too powerful, models are becoming unnecessarily complex

• Trained models are hard to analyze, spectacular (and incorrect) claims are over-used: “It’s like the brain™!”

• Complex RNN architectures are popular (LSTM, GRU), though there are simpler tricks how to add longer short term memory to RNNs(Learning longer memory in recurrent neural networks, 2014)

Neural Networks for NLP, Tomas Mikolov

Page 32: Neural Networks for Natural Language Processing Tomas Mikolov.pdf · 2017-10-17 · Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Talk at AI Summit, Vienna,

Recurrent Nets Summary

• RNNs can cluster similar sequences and thus generalize better than many older techniques

• N-grams were state of the art in NLP for over 30 years and finally were beaten on a number of important tasks by Recurrent Networks

• The reasons of paradigm shift:• Solved instability of the training by clipping gradients• Open-source tools, public datasets, easy reproducibility• State of the art results on benchmarks from academia and industry, big

improvements

Neural Networks for NLP, Tomas Mikolov