deep recurrent neutral networks for sequence learning in spark

www.thalesgroup.com OPEN

Deep recurrent neural networks for Sequence Learning in Spark

YVES MABIALA

2OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Outline

▌Thales & Big Data

▌On the difficulty of Sequence Learning

▌Deep Learning for Sequence Learning

▌Spark implementation of Deep Learning

▌Use casesPredictive maintenanceNLP

3OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Thales & Big DataThales systems produce a huge quantity of data

Transportation systems (ticketing, supervision, …)Security (radar traces, network logs, …)Satellite (photos, videos, …)

which is oftenMassiveHeterogeneousExtremely dynamic

and where understanding the dynamic of the monitored phenomena is mandatory Sequence Learning

4OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

What is sequence learning ?Sequence learning refers to a set of ML tasks where a model has to either deal with sequences as input, produce sequences as output or both

Goal : Understand the dynamic of a sequence toClassifyPredictModel

Typical applicationsText

- Classify texts (sentiment analysis)- Generate textual description of images (image captioning)

Video- Video classification

Speech- Speech to text

5OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

How is it typically handled ?Taking into account the dynamic is difficult

Often people do not bother- E.g. text analysis using bag of word (one hot encoding)

– Problem for certain tasks such as sentiment classification (order of the words is important)

Or use popular statistical approaches - (Hidden) Markov model for prediction (and classification)

– Short term dependency (order 1) : = - Autoregressive approaches for time series forecasting

The chair is red 1 0 1 1 0 0 0 0

The cat is on a chair

The cat is young 1 1 0 0 1 1 0 0

1 1 1 0 0 1 1 1

The is chair red young cat on a

6OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Link with artificial neural network ?Artificial neural networks are statistical models inspired from the brain

Transforms the input by applying at each layer (non linear) functionsMore layers equals more capabilities (hidden layers : Deep Learning)

Set of transformation and activation operationsAffine : sigmoid activation : , tanh activation :

Convolutional : Apply a spatial convolution on the 1D/2D input (signal, image, …): - Learns spatial features used for classification or prediction (mostly on images/videos)

Recurrent : Learn dependencies between successive observations (features related to the dynamic)

ObjectiveFind the best weights W to minimize the difference between the predicted output and the desired one (using back-propagation algorithm)

inputhidden layers

output

7OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Able to cope with varying size sequences either at the input or at the output

Recurrent Neural Network basics

One to many (fixed size input, sequence output)

e.g. Image captioning

Many to many(sequence input to sequence

output)

e.g. Speech to text

Many to one(sequence input to fixed size

output)e.g. Text classification

Artificial neural networks with one or more recurrent layers

Classical neural network Recurrent neural network

𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌𝒀 𝒌

𝑿𝒌−𝟑 𝑿𝒌−𝟐 𝑿𝒌−𝟏 𝑿𝒌𝒀 𝒌= 𝒇 (𝑾 𝒕𝑿𝒌+𝑯𝒀 𝒌−𝟏)𝑿𝒌𝑿

𝒀 𝒌= 𝒇 (𝑾 𝒕𝑿𝒌)

𝒀Unrolled through

time

𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌

𝑿

𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌

𝑿𝒌−𝟑 𝑿𝒌−𝟐 𝑿𝒌−𝟏 𝑿𝒌𝑿𝒌−𝟑 𝑿𝒌−𝟐 𝑿𝒌−𝟏 𝑿𝒌

𝒀

8OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

On the difficulty of training recurrent networksRNNs are (were) known to be difficult to learn

More weights and more computational steps - More computationally expensive (accelerator needed for matrix ops : Blas or

GPU)- More data needed to converge (scalability over Big Data architectures : Spark)

– Theano, Tensor Flow, Caffe do not have distributed versionsUnable to learn long range dependencies (Graves & Al 2014)

- At a given time t, RNN does not remember the observations before Þ New RNN architectures with memory preservation (more

context)

LSTM GRU

9OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Recurrent neural networks in Spark

Spark implementation of DL algorithms (data parallel)All the needed blocks- Affine, convolutional, recurrent layers (Simple and GRU)- SGD, rmsprop, adadelta optimizers- Sigmoid, tanh, reLu activationsCPU (and GPU backend)Fully compatible with existing DL library in Spark ML

PerformanceOn 6 nodes cluster (CPU)- 5.46 average speedup (some communication overhead)

– About the same speedup as MLP in Spark ML

Driver

Worker 1Worker 2Worker 3

Resulting gradients (2)

Model broadcast (1)

10OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 1 : predictive maintenance (1)

ContextThales and its clients build systems in different domains- Transportation (ticketing, controlling), Defense (radar), Satellites

Need better and more accurate maintenance services- From planned maintenance (every x days) to an alert maintenance- From expert detection to automatic failure prediction- From whole subsystem changes to more localized reparations

GoalDetect early signs of a (sub)system failure using data coming from sensors monitoring the health of a system (HUMS)

11OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 1 : predictive maintenance (2)Example on a real system

20 sensors (20 values every 5 minutes), label (failure or not)

Take 3 hours of data and predict the probability of failure in the next hour (fully customizable)

Learning using MLLIB

12OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 1 : predictive maintenance (3)Recurrent net learning

Impact of recurrent netsLogistic regression- 70% detection with 70% accuracyRecurrent Neural Network

• 85% detection with 75% accuracy

13OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 2 : Sentiment analysis (1)Context

Social network analysis application developed at Thales (Twitter, Facebook, blogs, forums)

- Analyze both the content of the texts and the relations (texts, actors)Multiple (big data) analysis

- Actor community detection- Text clustering (themes)- …

Focus onSentiment analysis on the collected texts

- Classify texts based on their sentiment

14OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 2 : Sentiment analysis (2)Learning dataset

Sentiment140 + Kaggle challenge (1.5M labeled tweets)50% positives, 50% negatives

Compare Bag of words + traditional classifiers (Naïve Bayes, SVM, logistic regression) versus RNN

15OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

Use case 2 : Sentiment analysis (3)

NB SVM Log Reg

Neural Net (perceptron

)RNN (GRU)

100 61.4 58.4 58.4 55.6 NA

1 000 70.6 70.6 70.6 70.8 68.1

10 000 75.4 75.1 75.4 76.1 72.3

100 000 78.1 76.6 76.

9 78.5 79.2

700 000 80 78.3 78.

3 80 84.1

Results

100

1000

1000

0

1000

00

7000

004045505560657075808590 NB

SVMLo-gRegNeu-ralNetRNN (GRU)

16OPEN

This

docu

men

t may

not

be

repr

oduc

ed, m

odifi

ed, a

dapt

ed, p

ublis

hed,

tran

slate

d, in

any

way

, in

whol

e or

in

par

t or d

isclo

sed

to a

third

par

ty w

ithou

t the

prio

r writ

ten

cons

ent o

f Tha

les

- ©

Tha

les

201

5 Al

l rig

hts r

eser

ved.

The end…

THANK YOU !

deep recurrent neutral networks for sequence learning in spark

Technology