Transcript
Page 1: Recurrent networks and beyond by Tomas Mikolov

Recurrent Networks and Beyond

Tomas Mikolov, FacebookNeu-IR Workshop, Pisa, Italy 2016

Page 2: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Goals of this talk• Explain recent success of recurrent networks

• Understand better the concept of (longer) short term memory

• Explore limitations of recurrent networks

• Discuss what needs to be done to build machines that can understand language

Page 3: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 80’s & 90’s• Recurrent network architectures were very popular in the 80’s and

early 90’s (Elman, Jordan, Mozer, Hopfield, Parallel Distributed Processing group, …)

• The main idea is very attractive: to re-use parameters and computation (usually over time)

Page 4: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simple RNN Architecture• Input layer, hidden layer with recurrent

connections, and the output layer

• In theory, the hidden layer can learnto represent unlimited memory

• Also called Elman network(Finding structure in time, Elman 1990)

Page 5: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 90’s - 2010• After the initial excitement, recurrent nets vanished from the

mainstream research

• Despite being theoretically powerful models, RNNs were mostly considered as unstable to be trained

• Some success was achieved at IDSIA with the Long Short Term Memory RNN architecture, but this model was too complex for others to reproduce easily

Page 6: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 2010 - today• In 2010, it was shown that RNNs can significantly improve state-of-the-

art in language modeling, machine translation, data compression and speech recognition (including strong commercial speech recognizer from IBM)

• RNNLM toolkit was published to allow researchers to reproduce the results and extend the techniques

• The key novel trick in RNNLM was trivial: to clip gradients to prevent instability of training

Page 7: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 2010 - today

• 21% - 24% reduction of WER on Wall Street Journal setup

Page 8: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 2010 - today

• Improvement from RNNLM over n-gram increases with more data!

Page 9: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 2010 - today

• Breakthrough result in 2011: 11% WER reduction over large system from IBM• Ensemble of big RNNLM models trained on a lot of data

Page 10: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Brief History of Recurrent Nets – 2010 - today• RNNs became much more accessible through open-source

implementations in general ML toolkits:• Theano• Torch• PyBrain• TensorFlow• …

Page 11: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Recurrent Nets Today• Widely applied:

• ASR (both acoustic and language models)• MT (language & translation & alignment models, joint models)• Many NLP applications• Video modeling, handwriting recognition, user intent prediction, …

• Downside: for many problems RNNs are too powerful, models are becoming unnecessarily complex

• Often, complicated RNN architectures are preferred because of wrong reasons (easier to get a paper published and attract attention)

Page 12: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Longer short term memory in simple RNNs• How to add longer memory to RNNs without unnecessary complexity

• Paper: Learning Longer Memory in Recurrent Neural Networks(Mikolov, Joulin, Chopra, Mathieu, Ranzato, ICLR Workshop 2015)

Page 13: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Recurrent Network – Elman Architecture• Also known as Simple Recurrent Network (SRN)

• Input layer , hidden layer , output

• Weight matrices

Page 14: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Recurrent Network – Elman Architecture• Input layer , hidden layer , output • Weight matrices

is softmax function

Page 15: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net Problems• Backpropagation through time algorithm + stochastic gradient

descent is commonly used for training (Rumelhart et al, 1985)

• Gradients can either vanish or explode (Hochreiter 1991; Bengio 1994)

Page 16: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Exploding Gradients• The gradients explode rarely, but this can have disastrous effects

• Simple “hack” is to clip gradients to stay within some range

• This prevents exponential growth (which would later lead to giant step in weight update)

• One can also normalize the gradients, or discard the weight updates that are too big

Page 17: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Vanishing Gradients• Most of the time, the gradients quickly vanish (after 5-10 steps of

backpropagation through time)

• This may not be a problem of SGD, but of the architecture of the SRN

Page 18: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simple Recurrent Net: Vanishing Gradients• What recurrent architecture would be easier to train to capture

longer term patterns?

• Instead of fully connected recurrent matrix, we can use architecture where each neuron is connected only to the input and to itself

• Old idea (Jordan 1987; Mozer 1989)

Page 19: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Combination of both ideas: Elman + Mozer• Part of the hidden layer is fully connected,

part is diagonal (self-connections)

• Can be seen as RNN with twohidden layers

• Or as RNN with partially diagonalrecurrent matrix (+ linear hidden units)

Page 20: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Combination of both ideas: Elman + Mozer• The value can be learned, or kept

fixed close to 1 (we used 0.95)

• The matrix is optional(usually helps a bit)

Page 21: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Structurally Constrained Recurrent Net• Because we constrain the architecture of SRN, we further denote the

model as Structurally Constrained Recurrent Net (SCRN)

• Alternative name is “slow recurrent nets”, as the state of the diagonal layer changes slowly

Q: Wouldn’t it be enough to initialize the recurrent matrix to be diagonal?A: No. This would degrade back to normal RNN and not learn longer memory.

Page 22: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Results• Language modeling experiments: Penn Treebank, Text8

• Longer memory in language models is commonly called cache / topic

• Comparison to Long Short Term Memory RNNs (currently popular but quite complicated architecture that can learn longer term patterns)

• Datasets & code: http://github.com/facebook/SCRNNs(link is in the paper)

Page 23: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Results: Penn Treebank language modeling

• Gain from SCRN / LSTM over simpler recurrent net is similar to gain from cache• LSTM has 3 gates for each hidden unit, and thus 4x more parameters need to be

accessed during training for the given hidden layer size (=> slower to train)• SCRN with 100 fully connected and 40 self-connected neurons is only slightly more

expensive to train than SRN

MODEL # hidden units Perplexity

N-gram - 141

N-gram + cache - 125

SRN 100 129

LSTM 100 (x4 parameters) 115

SCRN 100 + 40 115

Page 24: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Results: Text8

• Text8: Wikipedia text (~17M words), much stronger effect from cache• Big gain for both SCRN & LSTM over SRN• For small models, SCRN seems to be superior (simpler architecture, better

accuracy, faster training – less parameters)

MODEL # hidden units Perplexity

N-gram - 309

N-gram + cache - 229

SRN 100 245

LSTM 100 (x4 parameters) 193

SCRN 100 + 80 184

Page 25: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Results: Text8

• With 500 hidden units, LSTM is slightly better in perplexity (3%) than SCRN, but it also has many more parameters

MODEL # hidden units Perplexity

N-gram - 309

N-gram + cache - 229

SRN 500 184

LSTM 500 (x4 parameters) 156

SCRN 500 + 80 161

Page 26: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Discussion of Results• SCRN accumulates longer history in the “slow” hidden layer: the same

as exponentially decaying cache model

• Empirically, LSTM performance correlates strongly with cache (weighted bag-of-words)

• For very large (~infinite) training sets, SCRN seems to be the preferable architecture: it is computationally very cheap

Page 27: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Conclusion• Simple tricks can overcome the vanishing and exploding gradient

problems

• State of the recurrent layer can represent longer short term memory, but not the long term one (across millions of time steps)

• To represent true long term memory, we may need to develop models with ability to grow in size (modify their own structure)

Page 28: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Beyond Deep Learning• Going beyond: what RNNs and deep networks cannot model

efficiently?

• Surprisingly simple patterns! For example, memorization ofvariable-length sequence of symbols

Page 29: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Beyond Deep Learning: Algorithmic Patterns• Many complex patterns have short, finite description length in natural

language (or in any Turing-complete computational system)

• We call such patterns Algorithmic patterns

• Examples of algorithmic patterns: , sequence memorization, addition of numbers learned from examples

• These patterns often cannot be learned with standard deep learning techniques

Page 30: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Beyond Deep Learning: Algorithmic Patterns• Among the myriad of complex tasks that are currently not solvable,

which ones should we focus on?

• We need to set ambitious end goal, and define a roadmap how to achieve it step-by-step

Page 31: Recurrent networks and beyond by Tomas Mikolov

A Roadmap towardsMachine Intelligence

Tomas Mikolov, Armand Joulin and Marco Baroni

Page 32: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Ultimate Goal for Communication-based AICan do almost anything:• Machine that helps students to understand homeworks• Help researchers to find relevant information• Write programs• Help scientists in tasks that are currently too demanding (would

require hundreds of years of work to solve)

Page 33: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

The Roadmap• We describe a minimal set of components we think the intelligent

machine will consist of

• Then, an approach to construct the machine

• And the requirements for the machine to be scalable

Page 34: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Components of Intelligent machines• Ability to communicate

• Motivation component

• Learning skills (further requires long-term memory), ie. ability to modify itself to adapt to new problems

Page 35: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Components of FrameworkTo build and develop intelligent machines, we need:

• An environment that can teach the machine basic communication skills and learning strategies

• Communication channels

• Rewards

• Incremental structure

Page 36: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

The need for new tasks: simulated environment• There is no existing dataset known to us that would allow to teach the

machine communication skills

• Careful design of the tasks, including how quickly the complexity is growing, seems essential for success:• If we add complexity too quickly, even correctly implemented intelligent

machine can fail to learn• By adding complexity too slowly, we may miss the final goals

Page 37: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

High-level description of the environmentSimulated environment:• Learner• Teacher• Rewards

Scaling up:• More complex tasks, less examples, less supervision• Communication with real humans• Real input signals (internet)

Page 38: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simulated environment - agents• Environment: simple script-based reactive agent that produces signals

for the learner, represents the world

• Learner: the intelligent machine which receives input signal, reward signal and produces output signal to maximize average incoming reward

• Teacher: specifies tasks for Learner, first based on scripts, later to be replaced by human users

Page 39: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Simulated environment - communication• Both Teacher and Environment write to Learner’s input channel

• Learner’s output channel influences its behavior in the Environment, and can be used for communication with the Teacher

• Rewards are also part of the IO channels

Page 40: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Visualization for better understanding• Example of input / output streams and visualization:

Page 41: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

How to scale up: fast learners• It is essential to develop fast learner: we can easily build a machine

today that will “solve” simple tasks in the simulated world using a myriad of trials, but this will not scale to complex problems

• In general, showing the Learner new type of behavior and guiding it through few tasks should be enough for it to generalize to similar tasks later

• There should be less and less need for direct supervision through rewards

Page 42: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

How to scale up: adding humans• Learner capable of fast learning can start communicating with human

experts (us) who will teach it novel behavior

• Later, a pre-trained Learner with basic communication skills can be used by human non-experts

Page 43: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

How to scale up: adding real world• Learner can gain access to internet through its IO channels

• This can be done by teaching the Learner how to form a query in its output stream

Page 44: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

The need for new techniquesCertain trivial patterns are nowadays hard to learn:• context free language is out-of-scope of standard RNNs• Sequence memorization breaks LSTM RNNs

• We show this in a recent paper Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets

Page 45: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

ScalabilityTo hope the machine can scale to more complex problems, we need:• Long-term memory• (Turing-) Complete and efficient computational model• Incremental, compositional learning• Fast learning from small number of examples• Decreasing amount of supervision through rewards

• Further discussed in: A Roadmap towards Machine Intelligencehttp://arxiv.org/abs/1511.08130

Page 46: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Some steps forward: Stack RNNs (Joulin & Mikolov, 2015)• Simple RNN extended with a long term memory module that the

neural net learns to control

• The idea itself is very old (from 80’s – 90’s)

• Our version is very simple and learns patterns with complexity far exceeding what was shown before (though still very toyish): much less supervision, scales to more complex tasks

Page 47: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

• Learns algorithms from examples• Add structured memory to RNN:

• Trainable [read/write]• Unbounded

• Actions: PUSH / POP / NO-OP

• Examples of memory structures: stacks, lists, queues, tapes, grids, …

Stack RNN

Page 48: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Algorithmic Patterns

• Examples of simple algorithmic patterns generated by short programs (grammars)

• The goal is to learn these patterns unsupervisedly just by observing the example sequences

Page 49: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Algorithmic Patterns - Counting

• Performance on simple counting tasks• RNN with sigmoidal activation function cannot count• Stack-RNN and LSTM can count

Page 50: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Algorithmic Patterns - Sequences

• Sequence memorization and binary addition are out-of-scope of LSTM• Expandable memory of stacks allows to learn the solution

Page 51: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Binary Addition

• No supervision in training, just prediction• Learns to: store digits, when to produce output, carry

Page 52: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

Stack RNNs: summaryThe good:• Turing-complete model of computation (with >=2 stacks)• Learns some algorithmic patterns• Has long term memory• Simple model that works for some problems that break RNNs and LSTMs• Reproducible: https://github.com/facebook/Stack-RNN

The bad:• The long term memory is used only to store partial computation (ie. learned skills are not stored

there yet)• Does not seem to be a good model for incremental learning• Stacks do not seem to be a very general choice for the topology of the memory

Page 53: Recurrent networks and beyond by Tomas Mikolov

Tomas Mikolov, Facebook, 2016

ConclusionTo achieve true artificial intelligence, we need:• AI-complete goal• New set of tasks• Develop new techniques• Motivate more people to address these problems


Top Related