recurrent networks and beyond by tomas mikolov

Download Recurrent networks and beyond by Tomas Mikolov

Post on 11-Feb-2017




1 download

Embed Size (px)


Distributed Representations, Recurrent Nets and Beyond

Recurrent Networks and Beyond

Tomas Mikolov, FacebookNeu-IR Workshop, Pisa, Italy 2016


Goals of this talkExplain recent success of recurrent networks

Understand better the concept of (longer) short term memory

Explore limitations of recurrent networks

Discuss what needs to be done to build machines that can understand languageTomas Mikolov, Facebook, 2016


Brief History of Recurrent Nets 80s & 90sRecurrent network architectures were very popular in the 80s and early 90s (Elman, Jordan, Mozer, Hopfield, Parallel Distributed Processing group, )

The main idea is very attractive: to re-use parameters and computation (usually over time)Tomas Mikolov, Facebook, 2016

Simple RNN ArchitectureInput layer, hidden layer with recurrentconnections, and the output layer

In theory, the hidden layer can learnto represent unlimited memory

Also called Elman network(Finding structure in time, Elman 1990)

Tomas Mikolov, Facebook, 2016


Brief History of Recurrent Nets 90s - 2010After the initial excitement, recurrent nets vanished from the mainstream research

Despite being theoretically powerful models, RNNs were mostly considered as unstable to be trained

Some success was achieved at IDSIA with the Long Short Term Memory RNN architecture, but this model was too complex for others to reproduce easily

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets 2010 - todayIn 2010, it was shown that RNNs can significantly improve state-of-the-art in language modeling, machine translation, data compression and speech recognition (including strong commercial speech recognizer from IBM)

RNNLM toolkit was published to allow researchers to reproduce the results and extend the techniques

The key novel trick in RNNLM was trivial: to clip gradients to prevent instability of training

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets 2010 - today

21% - 24% reduction of WER on Wall Street Journal setupTomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets 2010 - today

Improvement from RNNLM over n-gram increases with more data!Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets 2010 - today

Breakthrough result in 2011: 11% WER reduction over large system from IBMEnsemble of big RNNLM models trained on a lot of dataTomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets 2010 - todayRNNs became much more accessible through open-source implementations in general ML toolkits:TheanoTorchPyBrainTensorFlow

Tomas Mikolov, Facebook, 2016

Recurrent Nets TodayWidely applied:ASR (both acoustic and language models)MT (language & translation & alignment models, joint models)Many NLP applicationsVideo modeling, handwriting recognition, user intent prediction,

Downside: for many problems RNNs are too powerful, models are becoming unnecessarily complex

Often, complicated RNN architectures are preferred because of wrong reasons (easier to get a paper published and attract attention)Tomas Mikolov, Facebook, 2016

Longer short term memory in simple RNNsHow to add longer memory to RNNs without unnecessary complexity

Paper: Learning Longer Memory in Recurrent Neural Networks(Mikolov, Joulin, Chopra, Mathieu, Ranzato, ICLR Workshop 2015)Tomas Mikolov, Facebook, 2016


Recurrent Network Elman ArchitectureTomas Mikolov, Facebook, 2016

Recurrent Network Elman ArchitectureTomas Mikolov, Facebook, 2016

Simple Recurrent Net ProblemsBackpropagation through time algorithm + stochastic gradient descent is commonly used for training (Rumelhart et al, 1985)

Gradients can either vanish or explode (Hochreiter 1991; Bengio 1994)

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Exploding GradientsThe gradients explode rarely, but this can have disastrous effects

Simple hack is to clip gradients to stay within some range

This prevents exponential growth (which would later lead to giant step in weight update)

One can also normalize the gradients, or discard the weight updates that are too big

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Vanishing GradientsMost of the time, the gradients quickly vanish (after 5-10 steps of backpropagation through time)

This may not be a problem of SGD, but of the architecture of the SRN

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Vanishing GradientsWhat recurrent architecture would be easier to train to capture longer term patterns?

Instead of fully connected recurrent matrix, we can use architecture where each neuron is connected only to the input and to itself

Old idea (Jordan 1987; Mozer 1989)Tomas Mikolov, Facebook, 2016

Combination of both ideas: Elman + MozerPart of the hidden layer is fully connected,part is diagonal (self-connections)

Can be seen as RNN with twohidden layers

Or as RNN with partially diagonalrecurrent matrix (+ linear hidden units)

Tomas Mikolov, Facebook, 2016

Combination of both ideas: Elman + MozerTomas Mikolov, Facebook, 2016

Structurally Constrained Recurrent NetBecause we constrain the architecture of SRN, we further denote the model as Structurally Constrained Recurrent Net (SCRN)

Alternative name is slow recurrent nets, as the state of the diagonal layer changes slowly

Q: Wouldnt it be enough to initialize the recurrent matrix to be diagonal?A: No. This would degrade back to normal RNN and not learn longer memory.

Tomas Mikolov, Facebook, 2016

ResultsLanguage modeling experiments: Penn Treebank, Text8

Longer memory in language models is commonly called cache / topic

Comparison to Long Short Term Memory RNNs (currently popular but quite complicated architecture that can learn longer term patterns)

Datasets & code: is in the paper)

Tomas Mikolov, Facebook, 2016

Results: Penn Treebank language modeling

Gain from SCRN / LSTM over simpler recurrent net is similar to gain from cacheLSTM has 3 gates for each hidden unit, and thus 4x more parameters need to be accessed during training for the given hidden layer size (=> slower to train)SCRN with 100 fully connected and 40 self-connected neurons is only slightly more expensive to train than SRN

Tomas Mikolov, Facebook, 2016MODEL# hidden unitsPerplexityN-gram-141N-gram + cache-125SRN100129LSTM100 (x4 parameters)115SCRN 100 + 40115

Results: Text8

Text8: Wikipedia text (~17M words), much stronger effect from cacheBig gain for both SCRN & LSTM over SRNFor small models, SCRN seems to be superior (simpler architecture, better accuracy, faster training less parameters)

Tomas Mikolov, Facebook, 2016MODEL# hidden unitsPerplexityN-gram-309N-gram + cache-229SRN100245LSTM100 (x4 parameters)193SCRN 100 + 80184

Results: Text8

With 500 hidden units, LSTM is slightly better in perplexity (3%) than SCRN, but it also has many more parameters

Tomas Mikolov, Facebook, 2016MODEL# hidden unitsPerplexityN-gram-309N-gram + cache-229SRN500184LSTM500 (x4 parameters)156SCRN 500 + 80161

Discussion of ResultsSCRN accumulates longer history in the slow hidden layer: the same as exponentially decaying cache model

Empirically, LSTM performance correlates strongly with cache (weighted bag-of-words)

For very large (~infinite) training sets, SCRN seems to be the preferable architecture: it is computationally very cheapTomas Mikolov, Facebook, 2016

ConclusionSimple tricks can overcome the vanishing and exploding gradient problems

State of the recurrent layer can represent longer short term memory, but not the long term one (across millions of time steps)

To represent true long term memory, we may need to develop models with ability to grow in size (modify their own structure)Tomas Mikolov, Facebook, 2016

Beyond Deep LearningGoing beyond: what RNNs and deep networks cannot model efficiently?

Surprisingly simple patterns! For example, memorization ofvariable-length sequence of symbols

Tomas Mikolov, Facebook, 2016

Beyond Deep Learning: Algorithmic PatternsTomas Mikolov, Facebook, 2016


Beyond Deep Learning: Algorithmic PatternsAmong the myriad of complex tasks that are currently not solvable, which ones should we focus on?

We need to set ambitious end goal, and define a roadmap how to achieve it step-by-step

Tomas Mikolov, Facebook, 2016


A Roadmap towardsMachine IntelligenceTomas Mikolov, Armand Joulin and Marco Baroni

Ultimate Goal for Communication-based AICan do almost anything:Machine that helps students to understand homeworksHelp researchers to find relevant informationWrite programsHelp scientists in tasks that are currently too demanding (would require hundreds of years of work to solve)

Tomas Mikolov, Facebook, 2016

The RoadmapWe describe a minimal set of components we think the intelligent machine will consist of

Then, an approach to construct the machine

And the requirements for the machine to be scalableTomas Mikolov, Facebook, 2016

Components of Intelligent machinesAbility to communicate

Motivation component

Learning skills (further requires long-term memory), ie. ability to modify itself to adapt to new problemsTomas Mikolov, Facebook, 2016

Components of FrameworkTo build and develop intelligent machines, we need:

An environment that can teach the machine basic communication skills and learning strategies