recurrent neural networks - github pages

47
Recurrent Neural Networks -The most widely used sequential model Hao Dong 2019, Peking University 1

Upload: others

Post on 13-Feb-2022

42 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networks - GitHub Pages

Recurrent Neural Networks-The most widely used sequential model

Hao Dong

2019, Peking University

1

Page 2: Recurrent Neural Networks - GitHub Pages

• Motivation

• Before We Begin: Word Representation

• Sequential Data

• Vanilla Recurrent Neural Network

• Long Short-Term Memory

• Time-series Applications

2

Recurrent Neural Networks

Page 3: Recurrent Neural Networks - GitHub Pages

Motivation

3

Page 4: Recurrent Neural Networks - GitHub Pages

Mo1va1on

• Both Multi-layer Perceptron and Convolutional Neural Networks take one data sample as the inputand output one result, are categorised as feed-forward neural networks (FNNs) as they only passdata layer-by-layer and obtain one output for one input (e.g., an image in input with an output of aclass label)

• There are many time-series data sets, such as language, video, and biosignals that could not fit intothis framework

• The recurrent neural network (RNN) is a deep learning architecture designed for processing time-series data.

4

Cat 0.97

Feed-forwardNeuralNetwork

“oneminustwoplusthree” “theansweristwo”

𝑝 𝑦, 𝑥8, 𝑥9, 𝑥: = 𝑝 𝑦 𝑝 𝑥8 𝑥; 𝑝(𝑥9|𝑥;, 𝑥8)𝑝(𝑥:|𝑥;, 𝑥8, 𝑥9)

Page 5: Recurrent Neural Networks - GitHub Pages

Motivation

5

LanguageTranslation

Chatbot

ImageCaptioning

Anti-spam SignalAnalysis

VideoAnalysis

SentenceGeneration

Page 6: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

6

Page 7: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• One-hot Vector

7

• RecurrentNeuralNetwork receiveswordsonebyone,butletseehowawordisrepresentedincomputer.

• Simplemethod1:One-hot vector

“DeepLearningisveryuseful”

0000100010

0010001000

10000

Disadvantages:

• Largevocabularywillleadsto“CurseofDimensionality”.

• Allwordrepresentationsareindependent!Toosparse!

Word representa2on

Page 8: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Bag of Words

8

• Simplemethod2:BagofWordsusethewordfrequenciestorepresentthesentence

Disadvantages:

• Largevocabularywillleadto“CurseofDimensionality”.

• Missingtheinformationofthewordlocations.

“we like TensorLayer, do we?”

Word Frequency

we 2

like 1

TensorLayer 1

do 1

[2, 1, 1, 1]

Sentence (text) representation

Page 9: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word Embedding

9

• Represent a word using a vector of floating-point numbers.

hellohi

sheheit

one

twothreebeijing

hangzhou london

Wordrepresentationspace

“hello” [0.32, 0.98, 0.12, 0.54 ….]

Wordembeddingvector

Similar words are grouped together

Page 10: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word Embedding

10

• Given a vocabulary with 5 words, we can have a word embedding table that eachword has 3 feature values.

Word embedding vectorOne-hot vector Word embedding table

Inpractice,wewillnotmultipletheone-hotvectorandthewordembeddingtable,tosavetime,wedirectlyusethewordIDastherowindextofindtheembeddingvectorfromthetable

Word ID (Row Index)

--- 0

deep 1

learning 2

is 3

popular 4

Each word has a unique ID

We need to find a “good” table!

Page 11: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Ideal Word Embedding

11

• Low dimension == High-level features to represent the words• Contains semantic information of the words

Similar words, such as “cat”-“tiger” and “water”-“liquid”, should have similarword representation.

Semantic information allows semantic operations:

King − Man + Women = QueenParis − France + Italy = Rome

The features in the word embeddingcontain information, such as “gender”and “location”.

Page 12: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Learn Word Embedding

12

• Use the text document as the trainingdata without any labels.

• Find similar words by comparing thecontext.

• Existing algorithms learn the word embedding table by reading a large textdocument to find the patterns, which is one type of self-supervised learning.

• This is a blue bird• This is a yellow bird• This is a red bird• This is a green bird

As the ”color” words located in similarlocations of the sentences, we can groupthe “color” words together.

Page 13: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word2Vec

13

• Google 2013

• Word2Vec = Skip-Gram/CBOW+ Negative Sampling

Page 14: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word2Vec

14

• Continuous Bag-of-Words (CBOW): predicts the middle word using the context.• In “a blue bird on the tree”, the context of “bird” is [“a” , “blue” , “on” , “the”, “ tree”]

this is a blue bird on the tree

bird

a

blue CBOW: maximize the probability of the middle word

Inthisexample,weonlyconsider4wordsonleftandright,i.e.,windowsizeis2.

on

on

Note:onlyonewordembeddingtablehere,differentinputsreusethesametable.

Page 15: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word2Vec

15

• Skip-Gram (SG) is opposite to CBOW, but for the same propose• CBOW predicts the middle word using the context, while SG predicts the context using the

middle word. In “a blue bird on the tree”, the input of SG is “bird”, the outputs are [“a” , “blue” ,“on” , “the” , “tree”]

bird

SG: maximize the probability of the contextblue

on

this is a blue bird on the tree

a

the

Page 16: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word2Vec

16

birdblue

on

this is a blue bird on the tree

a

the

birdblue

on

a

the

tree

other..

0.99

0.00

0.98

0.99

0.97

0.01

Equal

Sigmoid

• Skip-Gram (SG) is opposite to CBOW, but for the same propose• CBOW predicts the middle word using the context, while SG predicts the context using the

middle word. In “a blue bird on the tree”, the input of SG is “bird”, the outputs are [“a” , “blue” ,“on” , “the” , “tree”]

Page 17: Recurrent Neural Networks - GitHub Pages

Before We Begin: Word Representation

• Word2Vec

17

• Noise-Contrastive Estimation (NCE)

• Skip-Gram has multiple target outputs, so we use Sigmoid instead of Softmax.Each word in the vocabulary is separated into either positive and negative sample, and weclassify each word independently.

• A large vocabulary will lead to large computational cost.We use Negative Sampling to speed up the computation of the loss function,by randomlysample N negative samples from the vocabulary.

• This method is called Noise-Contrastive Estimation:

𝐸 = −( eJ∈LMN

log 𝑦J + eO∈PQR

log −𝑦O )

Randomly select N negative samplesPositive samples

Page 18: Recurrent Neural Networks - GitHub Pages

Sequential Data

18

Page 19: Recurrent Neural Networks - GitHub Pages

Sequential Data

19

FeedforwardNeuralNetworks Nontime-seriesproblemssuchasimageclassificationandobjectdetection

Input

Hiddenstate

Output

Page 20: Recurrent Neural Networks - GitHub Pages

Sequential Data

20

One data input and many data outputs in order

Input

Hiddenstate

Output

Imagecaptioning:inputanimagetogenerateasentence

Page 21: Recurrent Neural Networks - GitHub Pages

Sequential Data

21

Multiple data inputs and a single data output

Input

Hiddenstate

Output

Sentencesentimentclassification:inputasentenceinorderandoutputtheprobabilityofhappiness.

Page 22: Recurrent Neural Networks - GitHub Pages

Sequential Data

22

Multiple data inputs and multiple data outputs

Input

Hiddenstate

Output

Languagetranslation:inputtheentiresentencetothemodelbeforestartingtogeneratethetranslatedsentence

asynchronous(Seq2Seq)

Page 23: Recurrent Neural Networks - GitHub Pages

Sequential Data

23

Multiple data inputs and multiple data outputs

synchronous(simultaneous)

Input

Hiddenstate

Output

Weatherprediction:inputtheinformationtothemodelineverytime-stepandoutputthepredictweathercondition.

asynchronous(Seq2Seq)

Page 24: Recurrent Neural Networks - GitHub Pages

Sequential Data

24

Recurrent Neural NetsStore and process the temporal information

Input

Hiddenstate

Output

Page 25: Recurrent Neural Networks - GitHub Pages

Vanilla Recurrent Neural Network

25

Page 26: Recurrent Neural Networks - GitHub Pages

Vanilla Recurrent Neural Network

• Hidden Vector (State)

26

𝑋!:theinputoftime 𝑡

The hidden vector of time 𝑡 is the input of 𝑡 + 1

TheonlydifferencebetweenRNNandFNN:Feedtheinformationofprevious

time-steptoitsnextstep.

Unrolled

Thehiddenvectoroftime 𝑡

Page 27: Recurrent Neural Networks - GitHub Pages

Vanilla Recurrent Neural Network

• Processing Temporal Information

27

I am from China, so I speak ________Chinese𝑿𝟎 𝑿𝟏 𝑿𝟐 𝑿𝟑 𝑿𝟒 𝑿𝟓 𝑿𝟔 𝑿𝟕

ℎ! = 𝜎"(𝑊"𝑥! +𝑈"ℎ!#$ + 𝑏")𝑦! = 𝜎%(𝑊%ℎ! + 𝑏%)

𝑊、𝑈、 𝑏 :Matrixandbias𝜎!、 𝜎" :activation, 𝜎! istanh𝑦# :outputattime 𝑡

ℎ! isazerovector,ℎ"#$ contains the information

of the previous time-step.

Pass information to the future

Unrolled

Page 28: Recurrent Neural Networks - GitHub Pages

Vanilla Recurrent Neural Network

• Limitation: Long-TermDependencyProblem

28

Difficult to maintain information for a long term.

I am from China, and I live in the UK and US for 10 years, my mother language is ________?

Page 29: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

29

Page 30: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• For solving the Long-Term Dependency Problem

30

VanillaRNN LSTM

Page 31: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Gate Function

31

0.53

-0.32

0.72

0.45

1.23

-0.24

0.01

0.99

0.98

0.04

0.98

0.02

0.0053

-0.3168

0.7056

0.0018

1.2054

-0.0048

Input Vector Gate Vector Output Vector

=⨀

• RNNhasahiddenvector,LSTMhasbothhiddenandcellvectors.

• Thevaluesingatevectorarevaryingfrom0to1,0means“close|,1means“open.”

• “Filter”theinformationbymultipletheinputvectorandgatevectorelement-wisely.

0~1

Page 32: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Forget Gate

32

Cellvector

• Computetheforgetgatevector:

𝑓! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( ℎ!#$, 𝑥! 𝑾& + 𝑏&)

Concatenatetwovectors

Page 33: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Input Gate

33

𝐶!=𝑓!⊙𝐶!#$ + 𝑖!⊙ I𝐶!

Forgetpreviousinformation Inputnewinformation

𝑖!=s𝑖𝑔𝑚𝑜𝑖𝑑( ℎ!#$, 𝑥! 𝑾' + 𝑏')• Computeinputgatevector

I𝐶!=tanh( ℎ!#$, 𝑥! 𝑾( + 𝑏()• Computeinformationvector

• Computenewcellvector

Page 34: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Output Gate

34

𝑜!=sigmoid( ℎ!#$, 𝑥! 𝑾) + 𝑏))

ℎ! = 𝑜!⊙tanh(𝐶!)

• Computeoutputgatevector

• Computenewhiddenvector

Page 35: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Questions

35

• CouldweuseReLU toreplaceSigmoidintheGatefunction?

• WhyweusetanhratherthanSigmoidwheninputtinginformationtoavector?

Page 36: Recurrent Neural Networks - GitHub Pages

Long Short-Term Memory

• Variants of LSTM

36

• LSTM was invented in 1997, several variants of LSTM exist, including the Gate RecurrentUnit (GRU). However, Greff et al. [1] analyzed eight LSTM variants on three representative tasks,including speech recognition, handwriting recognition, and polyphonic music modeling, andsummarised the results of 5,400 experimental runs (representing 15 years of CPU time). Thisreview suggests that none of the LSTM variants provides significant improvements to thestandard LSTM.

• Gated Recurrent Unit (GRU) does not have cell state and reduce computational cost andmemory usage of LSTM.

[1] K. Greff, R. K. Srivastava, J. Koutn ́ık, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,” IEEE TNNLS, 2017.

Page 37: Recurrent Neural Networks - GitHub Pages

Time-series Applications

42

Page 38: Recurrent Neural Networks - GitHub Pages

Time-series Applications

• Many-to-one: Sentence Sentiment Classification

43

• Use the last output to compute the loss.

• Stack a fully connected layer with Softmax on the top ofthe hidden vector.

I Feel Bad

0.12 0.88Positive/Negative

Page 39: Recurrent Neural Networks - GitHub Pages

Time-series Applications

• One-to-many: Image Captioning

44

This

This

is

is

a

a

bird

bird

<EOS>• Use the output of every time-step as the inputof its next step.

• Terminate the process when the output is aspecial token for end of sentence (EOS)

• Use all outputs to compute the loss, e.g.,averaged cross-entropy of all outputs.

Page 40: Recurrent Neural Networks - GitHub Pages

Time-series Applications

45

• Synchronous Many-to-Many: Traffic Counting

• Duringtraining,apre-definedsequencelengthisrequired tocomputetheloss.

• Duringtesting,inputdatasamplesone byone.

5 6 7 6

Sequencelength= 4

Page 41: Recurrent Neural Networks - GitHub Pages

Time-series Applications

• Synchronous Many-to-Many: Text Generation/Language Modelling

46

this is a bird

is a bird <EOS>

During testing, input “this,” output the entire sentence

• Theoutputofeachstepisequaltotheinputofitsnextstep.

Page 42: Recurrent Neural Networks - GitHub Pages

Time-series Applica1ons

• Asynchronous Many-to-Many (Seq2Seq): Chatbot

47

How are you

I’m fine <EOS>

Encode

Decode

• Encodetheinputsequentialdatabeforestartingtooutputthesequentialresult.

Page 43: Recurrent Neural Networks - GitHub Pages

Time-series Applications

• Asynchronous Many-to-Many: Chatbot

48

<START>

Targetoutput

How are you

I’m fine <EOS>

I’m fineDecoderinputEncoderinput

• Encode the sequential inputdata before starting to outputthe result.

• During training, add an EOStoken on the target outputand add a START token on thedecoder input.

Page 44: Recurrent Neural Networks - GitHub Pages

Summary

49

Page 45: Recurrent Neural Networks - GitHub Pages

• Motivation• Time-seriesdata

• Before We Begin: Word Representation• one-hotvector,BOW,wordembedding,Word2Vec,CBOW,Skip-Gram,negativesampling,NCE

• Sequential Data• one-to-many,many-to-one,asynchronousmany-to-many,synchronousmany-to-many

• Vanilla Recurrent Neural Network• Hiddenvector(state),long-termdependencyproblem

• Long Short-Term Memory• Cellvector(state),forgetgate,inputgate,outputgate

• Time-series Applications• one-to-many,many-to-one,asynchronousmany-to-many,synchronousmany-to-many• Detailsoftrainingandtesting(inferencing)

50

Recurrent Neural Networks

Page 46: Recurrent Neural Networks - GitHub Pages

51

Recurrent Neural Networks

• References

• Word2Vec Parameter Learning Explained. Rong Xin. arXiv. 2016

• Deep Learning, NLP, and Representation. Colah Blog. 2014

• Natural Language Processing with Deep Learning. Christopher manning, Richard Socher. Stanford University. 2017

Page 47: Recurrent Neural Networks - GitHub Pages

Thanks

52