論文輪読資料「Gated Feedback Recurrent Neural Networks」

Download 論文輪読資料「Gated Feedback Recurrent Neural Networks」

Post on 21-Aug-2015

310 views

Category:

Technology

1 download

Embed Size (px)

TRANSCRIPT

<ol><li> 1. Gated Feedback Recurrent Neural Networks Matsuo lab. paper reading session Jul.17 2015 School of Engineering, The University of Tokyo Hiroki Kurotaki kurotaki@weblab.t.u-tokyo.ac.jp </li><li> 2. Contents 2 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 3. Contents 3 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 4. Paper Information Gated Feedback Recurrent Neural Networks Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio Dept. IRO, Universite de Montreal, CIFAR Senior Fellow Proceedings of The 32nd International Conference on Machine Learning, pp. 20672075, 2015 (ICML 2015) 1st submission to arXiv.org is on 9 Feb 2015 Cited by 9 (Google Scholar, Jul 17 2015) http://jmlr.org/proceedings/papers/v37/chung15.html 4 </li><li> 5. Contents 5 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 6. Introduction 1/3 They propose a novel recurrent neural network(RNN) architecture, Gated-feedback RNN (GN-RNN). GN-RNN allows connections from upper layers to lower layers, then controlls the signals by global gating unit. 6 (Each circle represents a layer consists of recurrent units (i.e. LSTM-Cell)) </li><li> 7. Introduction 2/3 The proposed GF-RNN outperforms the baseline methods in these tasks. 7 1. Character-level Lauguage Modeling (from a subsequence of structured data, predict the rest of the characters.) </li><li> 8. Introduction 3/3 The proposed GF-RNN outperforms the baseline methods in these tasks. 8 2. Python Program Evaluation (predict script execution results from the input as a raw charater sequence.) ( [Zaremba 2014] Figure 1) </li><li> 9. Contents 9 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 10. Related works(Unit) : Long short-term memory A LSTM cell is just a neuron but it decides when to memorize, forget and expose the content value 10 ( [Zaremba 2014] Figure 1) (The notation used in the figure is slightly different from this paper) </li><li> 11. Related works(Unit) : Gated recurrent unit Cho et al. 2014 Like LSTM, adaptively reset(forget) or update(input) its memory content. But unlike LSTM, no output gate balances between the previous and new memory contents adaptively 11( [Cho 2014] Figure 2) </li><li> 12. Related works(Architecture) : Conventional Stacked RNN Each circle represents a layer consists of many recurrent units Several hidden recurrent layers are stacked to model and capture hierahchical structure between short and long-term dependencies. 12 </li><li> 13. Related works(Architecture) : Clockwork RNN i-th hidden module is only updated at the rate of 2^(i-1) Neurons in faster module i are connected to neurons in a slower module j only if a clock period T_i &lt; T_j. 13( [Koutnik 2014] Figure 1) </li><li> 14. Contents 14 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 15. Proposed Method : Gated Feedback RNN Generalize the Clockwork RNN in both connection and work rate Flows back from the upper recurrent layers into the lower layers Adaptively control when to connect each layer with "global reset gates". (Small bullets on the edges) 15 : the concatenation of all the hidden states from the previous timestep (t-1) : from layer i in timestep (t-1) to layer j in timestep t global reset gate </li><li> 16. Proposed Method : GF-RNN with LSTM unit Only used when computing new memory state 16 ( [Zaremba 2014] Figure 1) (The notation used in the figure is slightly different from this paper) </li><li> 17. Proposed Method : GF-RNN with GRN unit Only used when computing new memory state 17( [Cho 2014] Figure 2) </li><li> 18. Contents 18 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 19. Experiment : Tasks (Lauguage Modeling) From a subsequence of structured data, predict the rest of the characters.) 19 </li><li> 20. Experiment : Tasks (Lauguage Modeling) Hutter dataset English Wikipedia, contains 100 MBytes of characters which include Latin alphabets, non-Latin alphabets, XML markups and special characters Training set : the first 90 MBytes Validation set : the next 5 MBytes Test set : the last 10 MBytes Performance measure : the average number of bits-per-character (BPC) 20 </li><li> 21. Experiment : Models (Lauguage Modeling) 3 RNN architectures : single, (conventional) stacked, Gated-feedback 3 recurrent units : tanh, LSTM, Gated Recurrent Unit (GRU) The number of parameters are constrained to be roughly 1000 Detail - RMSProp &amp; momentum - 100 epochs - learning rate : 0.001 (GRU, LSTM) 510^(-5) (tanh) - momentum coef. : 0.9 - Each update is done using a minibatch of 100 subsequences of length 100 each. 21 </li><li> 22. Experiment : Results and Analysis (Lauguage Modeling) 22 GF-RNN is good when used together with GRU and LSTM But failed to improve the performance with tanh units GF-RNN with LSTM is better than the Non-Gated (The undermost) </li><li> 23. Experiment : Results and Analysis (Lauguage Modeling) 23 the stacked LSTM failed to close the tags with and in both trials However, the GF-LSTM succeeded to close both of them, which shows that it learned about the structure of XML tags </li><li> 24. Experiment : Additional results (Lauguage Modeling) 24 They trained another GF-RNN with LSTM which includes larger number of parameters, and obtained comparable results. (They wrote it's better than the previously reported best results, but there is a non-RNN work that acheived 1.278) </li><li> 25. Experiment : Tasks (Python Program Evaluation) input : a python program ends with a print statement, 41symbols output : the result of a print statement, 13 symbols Scripts used in this task include addition, multiplication, subtraction, for-loop, variable assignment, logical comparison and if-else statement. Both the input &amp; output are sequences of characters. Nesting : [1,5] length : [1, 1^10] 25( [Zaremba 2014] Figure 1) </li><li> 26. Experiment : Models (Python Program Evaluation) RNN encoder-decoder approach, used for translation task previously Encoder RNN : the hidden state of the encoder RNN is unfolded for 50 timesteps. Decoder RNN : initial hidden state is initialized with the last hidden state of the encoder RNN. Detail - GRU &amp; LSTM with and without Gated - 3 hidden layers for each Encoder &amp; Decoder RNN - hidden layer contains : 230 units(GRM) 200 units(LSTM) - mixed curriculum strategy [Zaremba '14] - Adam [Kingma '14] - minibatch with 128 sequences - 30 epochs 26( [Cho 2014] Figure 1) </li><li> 27. Experiment : Results &amp; Analysis (Python Program Evaluation) From the 3rd column, GF-RNN is better with almost all target script. 27 GRULSTM </li><li> 28. Contents 28 Paper information Introduction Related Works Proposed Methods Experiments, Results and Analysis Conclusion </li><li> 29. Conclusion They proposed a novel architecture for deep stacked RNNs which uses gated- feedback connections between different layers. The proposed method outperformed previous results in the tasks of character-level language modeling and Python program evaluation. Gated-feedback architecture is faster and better (in performance) than the standard stacked RNN even with a same amount of capacity. More thorough investigation into the interaction between the gated- feedback connections and the role of recurrent activation function is required in the future. (because the proposed gated-feedback architecture works bad with the tanh activation function) 29 </li><li> 30. References [Cho 2014] Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Koutnik 2014] Koutnik, Jan, Greff, Klaus, Gomez, Faustino, and Schmidhuber, Ju rgen. A clockwork rnn. In Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014. [Schmidhuber 1992] Schmidhuber, Jurgen. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. [Stollenga 2014] Stollenga, Marijn F, Masci, Jonathan, Gomez, Faustino, and Schmidhuber, Ju rgen. Deep networks with internal selective attention through feedback connections. In Ad- vances in Neural Information Processing Systems, pp. 35453553, 2014. [Zaremba 2014] Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014. 30 </li></ol>