edunex itb

EDUNEX ITB

EDUNEX ITB

IF4074 Minggu 9-15

• Minggu ke-9 (18 Oktober 2021): LSTM + RNN arsitektur + Tubes 2

• Minggu ke-10 (25 Oktober 2021): RNN Latihan + BPTT

• Minggu ke-11 (1 November 2021): Kuliah tamu (sharing Aplikasi ML di Gojek)

• Minggu ke-12 (8 November 2021): Praktikum RNN

• Minggu ke-13 (15 November 2021): Feature Engineering 1 / TugasDesain eksperimen

• Minggu ke-14 (22 November 2021): Kuis 2

• Minggu ke-15 (29 November 2021): Praktikum Feature Engineering 2

EDUNEX ITB

04 LSTM: What & Why

Pembelajaran Mesin Lanjut(Advanced Machine Learning)

Masayu Leylia Khodra([email protected])

KK IF – Teknik Informatika – STEI ITB

Modul 4: Recurrent Neural Network

01

EDUNEX ITB

Long Short-Term Memory (LSTM): Why

https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

ℎ𝑡 = 𝑓(𝑈𝑥𝑡 +𝑊ℎ𝑡−1 + 𝑏xh)

𝑦𝑡 = 𝑓(𝑉ℎ𝑡 + 𝑏hy)

RNN: long-term dependency problem

U

W V

𝑥𝑡

ℎt-1

ℎt

Suffer from short-term memory (forward propagation).

Suffer from vanishing gradient problem (backward propagation).RNNs fail to learn greater than 5-10 time steps.In the worst case, this may completely stop the neural network from further training.

02

EDUNEX ITB

Long Short-Term Memory (LSTM): What

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs are explicitly designed to avoid the long-term dependency problem.

Introduced by Hochreiter & Schmidhuber (1997)

LSTM is special kind of RNN. The differences are the operations within the LSTM’s cells. RNN: repeating module have a very simple structure. LSTM: repeating module contains four interacting layers.

03

http://www.bioinf.jku.at/publications/older/2604.pdf

EDUNEX ITB

LSTM: Cell State & Gates• as the “memory” of the network

• act as a transport highway that transfers relevant information throughout the processing of the sequence.

Cell State

• decides what information should be thrown away or kept.

• Values closer to 0 means to forget, and closer to 1 means to keep.Forget Gate

• Decides what information is relevant to add from the current step

• update the cell state by hidden state and current inputInput Gate

• decides what the next hidden state should be.

• Hidden state contains information on previous inputs. The hidden state is also used for predictions.

Output Gate


04

EDUNEX ITB

Forget Gate


𝑓𝑡 = 𝜎(𝑈𝑓𝑥𝑡 +𝑊𝑓ℎ𝑡−1 + 𝑏f)

Value 1 represents

“completely keep this”

while a 0 represents “completely get rid of this.”

𝑥𝑡

ℎ t-1

𝑐t-1

𝑓𝑡

05

EDUNEX ITB

Input Gate


𝑖𝑡 = 𝜎(𝑈𝑖𝑥𝑡 +𝑊𝑖ℎ𝑡−1 + 𝑏i)

𝑥𝑡

ℎ t-1

𝑐t-1

𝑓𝑡 𝑖𝑡 ǁ𝑐𝑡

෩𝐶𝑡 = tanh(𝑈𝑐𝑥𝑡 +𝑊𝑐ℎ𝑡−1 + 𝑏c)

06

EDUNEX ITB

Cell State


𝑥𝑡

ℎ t-1

𝑐t-1


𝐶𝑡 = 𝑓𝑡 ⊙𝐶𝑡−1 + 𝑖𝑡 ⊙ ෩𝐶𝑡

07

EDUNEX ITB

Output Gate


𝑥𝑡

ℎ t-1

𝑐t-1


𝑜𝑡 = 𝜎 𝑈𝑜𝑥𝑡 +𝑊𝑜ℎ𝑡−1 + 𝑏o

ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝐶t)

08

EDUNEX ITB

LSTM Forward Propagation: Example

https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9

Ԧ𝑥 (2)

ℎ (1)

𝑈𝑓 , 𝑈𝑖,

𝑈𝑐 , 𝑈𝑜

𝑊𝑓 ,𝑊𝑖,

𝑊𝑐 ,𝑊𝑜

A1 A2 Target

1 2 0.5

0.5 3 1

…







Uf0.700 0.450

Ui0.950 0.800

Uc0.450 0.250

Uo0.600 0.400

Wf bf0.100 0.150

Wi bi0.800 0.650

Wc bc0.150 0.200

Wo bo0.250 0.100

ht-1 Ct-10 0

09

EDUNEX ITB

Computing ht and ct : Timestep t1

t1=<12, 0.5> Uf.xt Wf.ht-1+bf net_ft ft

1.600 0.150 1.750 0.852

Ui.xt Wi.ht-1+bi net_it it2.550 0.650 3.200 0.961

Uc.xt Wc.ht-1+bc net_~ct ~ct0.950 0.200 1.150 0.818

Uo.xt Wo.ht-1+bo net_ot ot1.400 0.100 1.500 0.818

Ct ht0.786 0.536

ht-1 Ct-10 0








10

EDUNEX ITB

Computing ht and ct : Timestep t2

t2=<0.53

, 1>

ht-1 Ct-10.786 0.536







Uf.xt Wf.ht-1+bf net_ft ft1.700 0.204 1.904 0.870

Ui.xt Wi.ht-1+bi net_it it2.875 1.079 3.954 0.981

Uc.xt Wc.ht-1+bc net_~ct ~ct0.975 0.280 1.255 0.850

Uo.xt Wo.ht-1+bo net_ot ot1.500 0.234 1.734 0.850

Ct ht1.518 0.772


11

EDUNEX ITB

Implementing LSTM on Keras: Many to One

from keras import Sequential

from keras.layers import LSTM, Dense

model = Sequential()

model.add(LSTM(10, input_shape=(50,1)))

#10 neurons & process 50x1 sequences

model.add(Dense(1,activation='linear’))

#linear output as regression problem

https://towardsdatascience.com/a-comprehensive-guide-to-working-with-recurrent-neural-networks-in-keras-f3b2d5e2fa7f

Ԧ𝑥 (1)

ℎ (10)

𝑦 (1)

U

V

W

# predict amazon stock closing prices, LSTM 50 timestep

12

EDUNEX ITB

Number of Parameter

Ԧ𝑥 (1)

ℎ (10)

𝑦 (1)

U

V

W

Total parameter = (1+10+1)*4*10+(10+1)*1=491

Simple RNN with equal networks: 131 parameterU: matrix hidden neurons x (input dimension + 1)W: matrix hidden neurons x hidden neuronsV: matrix output neurons x (hidden neurons+1)

13

Total parameter for n unit lstm from m-dimension input to k dimension output = (m+n+1)*4*n+(n+1)*k

EDUNEX ITB

RNN → LSTM → GRU → ReGU

1985Recurrent nets

1997LSTMBi-RNN

2014GRU

2017Residual LSTM

2019Residual Gated Unit

GRU: no cell state, 2 gates

ReGU: shortcut connection

14

EDUNEX ITB

Summary

LSTMs avoid the long-term

dependency problem

LSTMs have a cell state and 3 gates

(forget, input, output)

Computing ht and ct

15

Backpropagation Through Time

EDUNEX ITB

03 RNN Architecture





01

EDUNEX ITB

General Architecture

Ԧ𝑥 (i)

ℎ1 (𝑗)

ℎℎ (k)

𝑦 (𝑚)

Uxh1

Uh1h…

V

…

Uh…hh

Wh1

Wh…

Whh

Ԧ𝑥1

ℎ (𝑗)

Ԧ𝑥2

ℎ (𝑗)

Ԧ𝑥𝑛

ℎ (𝑗)…

n timestep

Return sequence = True/False

02

EDUNEX ITB

Architecture

fixed-sized input vector xt

fixed-sized output vector ot

RNN state st

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

One to many: image captioningMany to one: text classificationMany to many: machine translation, video frame classification, POS tagging

03

EDUNEX ITB

One to Many: Image Captioning

CNN Encoder (Inception) - RNN Decoder (LSTM) (Vinyals dkk., 2014)

04

EDUNEX ITB

Many to One: Text Classification

22https://www.oreilly.com/learning/perform-sentiment-analysis-with-lstms-using-tensorflow

05

EDUNEX ITB

Many to Many: Sequence Tagging

https://www.depends-on-the-definition.com/guide-sequence-tagging-neural-networks-python/

Input is a sequence of words, and output is the sequence of POS tag for each word.

23

06

EDUNEX ITB

Many to Many: Machine Translation

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

● Machine Translation: input is a sequence of words in source language (e.g. German). Output is a sequence of words in target language (e.g. English).

● A key difference is that our output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

24

07

EDUNEX ITB

Implementing RNN on Keras: Many to One

from keras import Sequential

from keras.layers import SimpleRNN, Dense

model = Sequential()

model.add(SimpleRNN(10, input_shape=(50,1)))

#simple recurrent layer, 10 neurons & process 5

0x1 sequences

model.add(Dense(1,activation='linear')) #linear

output because this is a regression problem

https://towardsdatascience.com/a-comprehensive-guide-to-working-with-recurrent-neural-networks-in-keras-f3b2d5e2fa7f

Ԧ𝑥 (1)

ℎ (10)

𝑦 (1)

U

V

W

# predict amazon stock closing prices, RNN 50 timestep

08

EDUNEX ITB

Number of Parameter

Ԧ𝑥 (1)

ℎ (10)

𝑦 (1)

U

V

W

Total parameter = (1+10+1)*10+(10+1)*1=131Simple RNN:U: matrix hidden neurons x (input dimension + 1)W: matrix hidden neurons x hidden neuronsV: matrix output neurons x (hidden neurons+1)

09

EDUNEX ITB

Number of Parameter: Example 2model = Sequential() #initialize model

model.add(SimpleRNN(64, input_shape=(50,1), return_sequences=True))#64 neurons

model.add(SimpleRNN(32, return_sequences=True))#32 neurons

model.add(SimpleRNN(16)) #16 neurons

model.add(Dense(8,activation='tanh'))

model.add(Dense(1,activation='linear'))

Total parameter = 8257

= (1+64+1)*64=4224

= (64+32+1)*32=3104

= (32+16+1)*16=784= (16+1)*8=136= (8+1)*1=9

10

EDUNEX ITB

Bidirectional RNNs

• In many applications we want to output a prediction of y (t) which may depend on the whole input sequence. E.g. co-articulation in speech recognition, right neighbors in POS tagging, etc.

• Bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence. https://www.cs.toronto.edu/~tingwuwang/rnn_tutorial.pdf

11

EDUNEX ITB

Bidirectional RNNs for Information Extraction

https://www.depends-on-the-definition.com/sequence-tagging-lstm-crf/

29

12

EDUNEX ITB

Summary

Architecture: 1-to-n, n-to-1,

n-to-n

Number of parameter

RNN

Bidirectional RNN

10

LSTM

EDUNEX ITB

05 Backpropagation Through Time





01

EDUNEX ITB

Backpropagation Through Time (BPTT)

Forward Passget sequence current

output

Backward Passcompute 𝛿𝑔𝑎𝑡𝑒𝑠𝑡, 𝛿𝑥𝑡, ∆𝑜𝑢𝑡𝑡−1, 𝛿U, 𝛿W, 𝛿b

Update Weights𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − . 𝛿𝑤𝑜𝑙𝑑

BPTT learning algorithm is an extension of standard backpropagation that performs gradients descent on an unfolded network.

02

EDUNEX ITB

Example

Ԧ𝑥 (2)

ℎ (1)

𝑈𝑓 , 𝑈𝑖,

𝑈𝑐 , 𝑈𝑜

𝑊𝑓 ,𝑊𝑖,

𝑊𝑐 ,𝑊𝑜

unfold

Ԧ𝑥1 =12

[0.536]

Ԧ𝑥2 =0.53

[0.772]

0.5 1.25

U0.7 0.95 0.5 0.6

0.45 0.8 0.3 0.4

W0.100 0.800 0.150 0.250

03

0

EDUNEX ITB

LSTM: Backward Propagation Timestep t







𝛿𝑜𝑢𝑡𝑡 = ∆𝑡 + ∆𝑜𝑢𝑡𝑡

𝛿𝐶𝑡 = 𝛿𝑜𝑢𝑡𝑡 ⊙𝑜𝑡 ⊙ 1− 𝑡𝑎𝑛ℎ2 𝐶𝑡 + 𝛿𝐶𝑡+1 ⊙𝑓𝑡+1

𝛿 ෩𝐶𝑡 = 𝛿𝐶𝑡 ⊙ 𝑖𝑡 ⊙ (1 − ෩𝐶𝑡2)

𝛿𝑖𝑡 = 𝛿𝐶𝑡 ⊙ ෩𝐶𝑡 ⊙ 𝑖𝑡 ⊙ (1 − 𝑖𝑡)

𝛿𝑓𝑡 = 𝛿𝐶𝑡 ⊙𝐶𝑡−1 ⊙𝑓𝑡 ⊙ 1− 𝑓𝑡

𝛿𝑜𝑡 = 𝛿𝑜𝑢𝑡𝑡 ⊙ tanh 𝐶𝑡 ⊙𝑜𝑡 ⊙ 1− 𝑜𝑡

𝛿𝑥𝑡 = 𝑈𝑇 . 𝛿𝑔𝑎𝑡𝑒𝑠𝑡

∆𝑜𝑢𝑡𝑡−1= 𝑊𝑇. 𝛿𝑔𝑎𝑡𝑒𝑠𝑡

04

EDUNEX ITB

Computing 𝛿𝑔𝑎𝑡𝑒𝑠𝑡 for timestep t=2

Last timestep: ∆𝑜𝑢𝑡𝑡= 0; 𝑓𝑡+1 = 0; 𝛿𝐶𝑡+1 = 0

t2: 𝜕𝐸

𝜕𝑜= 0.772 − 1.25 = −0.478→ 𝛿𝑜𝑢𝑡2 = −0.478 + 0 = −0.478

𝛿𝐶2 = −0.478 ∗ 0.85 ∗ 1 − 𝑡𝑎𝑛ℎ2 1.518 + 0 ∗ 0 = −0.071

𝛿𝑓2 = −0.071 ∗ 0.786 ∗ 0.870 ∗ 1 − 0.870 = −0.006𝛿𝑖2 = −0.071 ∗ 0.850 ∗ 0.981 ∗ 1 − 0.981 = −0.001𝛿෪𝐶2 = −0.071 ∗ 0.981 ∗ 1 − 0.8502 = −0.019𝛿𝑜2 = −0.478 ∗ tanh 1.518 ∗ 0.850 ∗ 1 − 0.850 = −0.055

𝛿𝑖𝑡 = 𝛿𝐶𝑡 ⊙ ෩𝐶𝑡 ⊙ 𝑖𝑡 ⊙ (1 − 𝑖𝑡)𝛿𝑓𝑡 = 𝛿𝐶𝑡 ⊙𝐶𝑡−1 ⊙𝑓𝑡 ⊙ 1− 𝑓𝑡𝛿𝑜𝑡 = 𝛿𝑜𝑢𝑡𝑡 ⊙ tanh 𝐶𝑡 ⊙𝑜𝑡 ⊙ 1− 𝑜𝑡

𝛿𝑜𝑢𝑡𝑡 = ∆𝑡 + ∆𝑜𝑢𝑡𝑡𝛿𝐶𝑡 = 𝛿𝑜𝑢𝑡𝑡 ⊙𝑜𝑡 ⊙ 1− 𝑡𝑎𝑛ℎ2 𝐶𝑡 + 𝛿𝐶𝑡+1 ⊙𝑓𝑡+1

𝛿 ෩𝐶𝑡 = 𝛿𝐶𝑡 ⊙ 𝑖𝑡 ⊙ (1 − ෩𝐶𝑡2)

𝐸 =1

2(𝑡𝑎𝑟𝑔𝑒𝑡 − ℎ)2 ∆𝑡=

𝜕𝐸

𝜕ℎ= −(𝑡𝑎𝑟𝑔𝑒𝑡 − ℎ) = ℎ − 𝑡𝑎𝑟𝑔𝑒𝑡

𝛿𝑔𝑎𝑡𝑒𝑠2 =

−0.006−0.001−0.019−0.055

05

EDUNEX ITB

Computing 𝛿𝑥2 and ∆𝑜𝑢𝑡1 for timestep t=2

𝛿𝑥𝑡 = 𝑈𝑇 . 𝛿𝑔𝑎𝑡𝑒𝑠𝑡∆𝑜𝑢𝑡𝑡−1= 𝑊𝑇 . 𝛿𝑔𝑎𝑡𝑒𝑠𝑡

U0.7 0.95 0.5 0.6

0.45 0.8 0.3 0.4

dgates-0.006-0.001-0.019-0.055

dx2-0.047-0.030

W0.100 0.800 0.150 0.250

dout1-0.018

06

EDUNEX ITB

Computing for timestep t=1: ∆𝑜𝑢𝑡1= −0.018

𝛿𝑜𝑢𝑡1 = 0.036 − 0.018 = 0.018𝛿𝐶1 = −0.053𝛿𝑓1 = 0𝛿𝑖1 = −0.0017𝛿෪𝐶1 = −0.017𝛿𝑜1 = 0.0018

𝛿𝑥𝑡 = 𝑈𝑇 . 𝛿𝑔𝑎𝑡𝑒𝑠𝑡∆𝑜𝑢𝑡𝑡−1= 𝑊𝑇 . 𝛿𝑔𝑎𝑡𝑒𝑠𝑡

U0.7 0.95 0.5 0.6

0.45 0.8 0.3 0.4

W0.100 0.800 0.150 0.250

dgates0.0000

-0.0017-0.01700.0018

dx1-0.0082-0.0049

dout0-0.0035

07

EDUNEX ITB

Computing 𝛿U, 𝛿W, 𝛿b

𝛿𝑈 =

𝑡=1

2

𝛿𝑔𝑎𝑡𝑒𝑠𝑡 . 𝑥𝑡 =

0.0−0.0017−0.01700.0018

1 2 +

−0.006−0.001−0.019−0.055

0.5 3 =

𝛿𝑊 = σ𝑡=12 𝛿𝑔𝑎𝑡𝑒𝑠𝑡+1 . ℎ𝑡=

−0.006−0.001−0.019−0.055

[0.536]=

𝛿𝑏 =

𝑡=1

2

𝛿𝑔𝑎𝑡𝑒𝑠𝑡+1 =

dU-0.0032 -0.0189-0.0022 -0.0067-0.0267 -0.0922-0.0259 -0.1626

dW-0.0034-0.0006-0.0104-0.0297

db-0.00631-0.00277-0.03641-0.05362

08

EDUNEX ITB

Update Weights (=0.1) 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − . 𝛿𝑤𝑜𝑙𝑑

dU-0.0032 -0.0189-0.0022 -0.0067-0.0267 -0.0922-0.0259 -0.1626

dW-0.0034-0.0006-0.0104-0.0297

db-0.00631-0.00277-0.03641

Unew0.7003 0.9502 0.4527 0.60260.4519 0.8007 0.2592 0.4163

Uold0.7 0.95 0.5 0.6

0.45 0.8 0.3 0.4

Wold0.100 0.800 0.150 0.250

Wnew0.1003 0.8001 0.1510 0.2530

bnew0.1506 0.6503 0.2036 0.1054

bold0.1500 0.6500 0.2000 0.1000

09

EDUNEX ITB

Truncated BPTT

https://deeplearning4j.org/docs/latest/deeplearning4j-nn-recurrent

41

Truncated BPTT was developed in order to reduce the computational complexity of each parameter update in a recurrent neural network.

10

EDUNEX ITB

Summary

Backpropagation through time for

LSTMTruncated BPTT

11

EDUNEX ITB

edunex itb

Documents