grid-lstm€¦ · grid-lstm: motivation stacked lstm, but lstm units connections along depth...

17
GRID-LSTM MAS.S63, Eric Chu

Upload: others

Post on 20-May-2020

28 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

GRID-LSTM

MAS.S63, Eric Chu

Page 2: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension
Page 3: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Neural Networks++Memory

Attention

Bayesian

Extremely deep

Page 4: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM RecapMotivation: problems with vanilla RNNs

Vanishing gradient due to non-linearities

Harder to capture longer term interactions

Solution:

“Long” “short-term” memory cells, controlled by gates that allow information to pass unmodified over many timesteps

Page 5: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM Recap

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 6: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM Recap

Forget gate: which parts of memory vector to delete

Input gate: which parts of memory vector to update

Content gate: what should the memory vector be updated with

Output gate: what gets read from new memory into hidden vector

Page 7: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM Recap

Page 8: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM Recap

Page 9: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

A LSTM Recap: Stacked LSTM

Page 10: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Grid-LSTM: MotivationStacked LSTM, but LSTM units connections along depth dimension as well as

temporal dimension

Page 11: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Grid-LSTM: Motivation

Page 12: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Grid-LSTM: 1D 1D Grid-LSTM = feedforward NN with LSTM cells instead of transfer functions

such as tanh and ReLU

Very closely related to Highway Networks

Page 13: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Grid-LSTM: 3D3D Grid-LSTM = Multidimensional LSTM, but again with LSTM cells in depth

dimension

2D Multidimensional RNN has 2 hidden vectors instead of 1

Page 14: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Grid-LSTM: All together now● N-D Grid-LSTM has N inputs and N outputs at each LSTM block

Page 15: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Relation to AttentionLSTM: “The mechanism also acts as a memory and implicit attention system,

whereby the signal from some input xi can be written to the memory vector and attended to in parts across multiple steps by being retrieved one part at a time.” - Quoc Le

Grid-LSTM: “Another interpretation of the attention model is that it allows an O(T) computation per prediction step. So the model itself has O(T2) total

computation (assuming the lengths of input and output sequences are roughly the same). With this interpretation, an alternative approach to the attention model is to lay out the input and output sequences in a grid structure to allow O(T2) computation. This idea is called Grid-LSTM” - Quoc Le

Page 16: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

ExperimentTask: Character prediction

3-layer stacked LSTM vs. 3-layer stacked Grid-LSTM

Page 17: GRID-LSTM€¦ · Grid-LSTM: Motivation Stacked LSTM, but LSTM units connections along depth dimension as well as temporal dimension

Future WorkApplication to speech recognition, which

uses stacked RNNs on spectrograms

Start with 2D Grid-LSTM

Can also try 3D Grid-LSTM

Machine translation

3D Grid-LSTM instead of encoder decoder network