deep learning for nlp

@ODSC

Thomas Delteilhttps://www.linkedin.com/in/thomasdelteil

Miguel Fierro@miguelgfierro

https://miguelgfierro.com

ODSC 2016 London – Thomas Delteil linkedin.com/in/thomasdelteil & Miguel Fierro @miguelgfierro

O p e r a t i o n a l i z a t i o nN L P w i t h C N NN L P


Interaction between computers

and human language


NLP

Machine

translation OCR

Q&A

Sentiment

Analysis

Speech

Recognition

T2STopic

Modelling

Information

Retrieval

Natural

Language

Understanding

Document

Classification


£1.3T value of company

datasource: IDC, 2014

10%of organizations expect to

commercialise their data by 2020source: Gartner, 2016


8.4PBof information per second

as of 2020source: business2comunity, 2016

70%of companies

use customer feedbackSource: business2comunity, 2016


SpaghettiMilkEatingBroccoli

KittenPuppyHamster

Eating

TOP IC 1 TOP IC 2

… my favourite dish is

spaghetti …… the cute hamster is

eating broccoli…… I love kittens…


Generative models joint distribution

source: https://en.wikipedia.org/wiki/Hidden_Markov_model


Conditional models conditional distribution

source: John Lafferty, Andrew McCallum, Fernando C.N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.

ICML, 2001.

ODSC 2016 London – Thomas Delteil #linkedin.com/in/thomasdelteil & Miguel Fierro @miguelgfierro

source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf


Bag of n-grams instead of bag of words

source: A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification, 2016

https://arxiv.org/abs/1607.01759


N e e d s G P U s a n d l o t s

o f d a t aG r e a t p e r f o r m a n c eF e a t u r e g e n e r a t i o n


wait, wait, wait…

What makes deep learning

deep?

input hidden output


input hidden hidden hidden output

…

…

…


source: R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996


source: https://en.wikipedia.org/wiki/Maxima_and_minima


input

hidden

output

hidden

hidden

ti ti+1 ti+2 ti+3


number of layers


source: https://en.wikipedia.org/wiki/Long_short-term_memory


Convolution Pooling PoolingConvolution Fully

connected

Fully

connected

Input image Output predictions

7


Sharpening filter

Laplacian filter

Sobel x-axis filter


Max pooling with 2x2 kernel and stride of 2x2


input hidden output


SoftmaxReLUtanh


When I read some of the rules

for speaking the English

language correctly, I think any

fool can make a rule, and every

fool will mind it

Henry David Thoreau

?122 122 112 90 5 10 21

121 122 112 11 6 11 21

120 118 6 10 11 12 23

118 4 6 5 23 23 23

4 6 1 23 23 21 23

4 5 20 24 23 21 23

source: Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn,and Dong Yu,. ClassificationConvolutional Neural Networks for

Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014


O D S C - U K N L P

space 0 0 0 0 0 0 0 1 0 0 0

- 0 0 0 0 1 0 0 0 0 0 0

. 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0 0 0 0 0

B 0 0 0 0 0 0 0 0 0 0 0

C 0 0 0 1 0 0 0 0 0 0 0

D 0 1 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 0 0 0 0 0

F 0 0 0 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 0 0 0 0

H 0 0 0 0 0 0 0 0 0 0 0

I 0 0 0 0 0 0 0 0 0 0 0

J 0 0 0 0 0 0 0 0 0 0 0

K 0 0 0 0 0 0 1 0 0 0 0

L 0 0 0 0 0 0 0 0 0 1 0

M 0 0 0 0 0 0 0 0 0 0 0

N 0 0 0 0 0 0 0 0 1 0 0

O 1 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 1

Q 0 0 0 0 0 0 0 0 0 0 0

R 0 0 0 0 0 0 0 0 0 0 0

S 0 0 1 0 0 0 0 0 0 0 0

T 0 0 0 0 0 0 0 0 0 0 0

U 0 0 0 0 0 1 0 0 0 0 0

V 0 0 0 0 0 0 0 0 0 0 0

W 0 0 0 0 0 0 0 0 0 0 0

X 0 0 0 0 0 0 0 0 0 0 0

Y 0 0 0 0 0 0 0 0 0 0 0

Z 0 0 0 0 0 0 0 0 0 0 0

One-hot encoding over a

vocabulary of characters.

Encoding:

Text = “ODSC-UK NLP”

Vocab: [ ‘ ‘, ‘-’, ‘.’, ‘A’, ‘B’, ‘C’, …, ‘Z’ ]


For images:

For text:

Humans to rephrase the examplesSynonyms

Similar semantic meaning


source: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. NIPS 2015

http://arxiv.org/abs/1509.01626

O D S C - U K N L P … 1013

space 0 0 0 0 0 0 0 1 0 0 0 … …

- 0 0 0 0 1 0 0 0 0 0 0 … …

. 0 0 0 0 0 0 0 0 0 0 0 … …

A 0 0 0 0 0 0 0 0 0 0 0 … …

B 0 0 0 0 0 0 0 0 0 0 0 … …

C 0 0 0 1 0 0 0 0 0 0 0 … …

D 0 1 0 0 0 0 0 0 0 0 0 … …

E 0 0 0 0 0 0 0 0 0 0 0 … …

F 0 0 0 0 0 0 0 0 0 0 0 … …

G 0 0 0 0 0 0 0 0 0 0 0 … …

H 0 0 0 0 0 0 0 0 0 0 0 … …

I 0 0 0 0 0 0 0 0 0 0 0 … …

J 0 0 0 0 0 0 0 0 0 0 0 … …

K 0 0 0 0 0 0 1 0 0 0 0 … …

L 0 0 0 0 0 0 0 0 0 1 0 … …

M 0 0 0 0 0 0 0 0 0 0 0 … …

N 0 0 0 0 0 0 0 0 1 0 0 … …

O 1 0 0 0 0 0 0 0 0 0 0 … …

P 0 0 0 0 0 0 0 0 0 0 1 … …

Q 0 0 0 0 0 0 0 0 0 0 0 … …

R 0 0 0 0 0 0 0 0 0 0 0 … …

S 0 0 1 0 0 0 0 0 0 0 0 … …

T 0 0 0 0 0 0 0 0 0 0 0 … …

U 0 0 0 0 0 1 0 0 0 0 0 … …

V 0 0 0 0 0 0 0 0 0 0 0 … …

W 0 0 0 0 0 0 0 0 0 0 0 … …

X 0 0 0 0 0 0 0 0 0 0 0 … …

Y 0 0 0 0 0 0 0 0 0 0 0 … …

Z 0 0 0 0 0 0 0 0 0 0 0 … …


0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 -0.4 … 3.1

… … … … … … … …

255 1.2 3.4 -1 1.2 3.2 … -1

x 256

69x1014x1

1x1008x256

x 1008


0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 -0.4 … 3.1

… … … … … … … …

255 1.2 3.4 -1 1.2 3.2 … -1

0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 0 … 3.1

… … … … … … … …

255 1.2 3.4 0 1.2 3.2 … 0

1x1008x256

1x1008x256


0 1 2 3 4 … … … 1007

0 6.4 1.1 3.2 0.1 0 … … … 3.1

… … … … … … … … … …

255 1.2 3.4 0 1.2 3.2 … … … 0

0 1 … 335

0 6.4 0.1 … …

… … … … …

255 3.4 3.2

1x1008x256

1x336x256

x 336x 256


0 1 2 3 4 5 6 7 8 … 335

0 6.4 0.1 … … … … … … … … …

… … … … … … … … … … … …

255 3.4 3.2 … … … … … … … … …

0 1 2 3 4 5 6 … 329

0 -2.4 3.2 … … … … … … …

… … … … … … … … … …

255 … … … … … … … … …

1x330x256

1x336x256

x 256x 330


1x330x256 <- after 2 convolution (7x1/1) and 1 max pooling (3x1/3)

1x110x256 <- 1 max-pooling (3x1/3)

3x102x256 <- 4 convolutions (3x1/1)

1x34x256 <- 1 max-pooling (3x1/3)


0 1 2 3 4 5 6 7 8 … 33

0 6.4 0.1 … … … … … … … … …

1 2.1 24.9 … … … … … … … … …

… … … … … … … … … … … …

255 … … … … … … … … … … 9.9

0

0 6.4

1 0.1

… …

35 2.1

36 24.9

… …

… …

… …

… …

8703 9.98704x1x1

1x34x256

x 256

0

0 6.4

1 0.1

… …

… …

… …

… …

… …

… …

… …

8703 9.9


8704x1x1

0

…

k

1023

x 1024

1024x1x1

𝑓𝑘 𝑋 =

𝑖=0

8703

𝑤𝑘𝑖 ∗ 𝑥𝑖 + 𝑏𝑘

0

0 8.7

1 -2.1

… …

… …

… …

… …

… …

… …

… …

1023 32.1

0

0 6.4

1 0.1

… …

… …

… …

… …

… …

… …

… …

1023 9.9


1024x1x1

0

…

k

1023

x 1024

1024x1x1

𝑓𝑘 𝑋 =

𝑖=0

8703

𝑤𝑘𝑖 ∗ 𝑥𝑖 + 𝑏𝑘

0

0 8.7

1 -2.1

… …

… …

… …

… …

… …

… …

… …

1023 32.1

ignored

0

0 6.4

1 0.1

… …

… …

… …

… …

… …

… …

… …

1023 9.9


1024x1x1

0

…

N

x NNx1x1

0

0 2.7

1 0.1

… …

… …

N-1 12.5

ignored

Softmax

0

0 0.1

1 0.01

… …

… …

N-1 0.8

Nx1x1

𝜎 𝒛 𝑖 =𝑒𝑧𝑖

σ𝑗=0𝑁−1 𝑒

𝑧𝑗


• MXNet using python bindings

• Training on Azure N-Series, on Tesla K80 GPU

• 3 days of training on 2.5M example for sentiment polarity


Amazon Review Polarity dataset (1.8M training, 200k testing):

- Crepe model + thesaurus augmentation: 95.07%

- TFIDF + n-grams: 91.64%

AG’s news corpus dataset (4 Classes 120kM training, 7.6k testing):

- Crepe model + thesaurus augmentation: 85.20%

- TFIDF + n-grams: 92.36%

CNN are no silver bullets, but they perform best on very large dataset


source: Alexis Conneau, Holger Schwenk, Loïc Barrault, Yann Lecun. Very Deep Convolutional Networks

for Natural Language Processing, 2016




source: Sergey Ioffe and Christian Szegedy Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.

https://arxiv.org/pdf/1502.03167v3


6.4 1.1 3.2 0.1 0 5 3.1 10 21 3.1 0.2 1.8 0 16.4 1.1 3.2 0.1 0 5 3.1 10 21 3.1 0.2 1.8 0 1

6.4 3.2 5 10 21


source: A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification. 2016



NLP APIs from major cloud providers and market places

- Language detection

- Sentiment Analysis

- Topic detection

- Translation

- Content moderation

- Text to speech

- Speech to text

- Intent modelling


+Scalable

Managed

Pay per use pricing

Documentation and sample code

-Generic solutions

Limited customizability

Performance

Latency

Limited batch processing


Single Machine

Training Data Testing Data

Sample Production

DataModel

Development


Data pipeline ?

Retraining ?

Scalability ?

Real time / Batch scoring ?

Multiple team / frameworks ?


Production


Training

instance(s)

(GPU)

Scoring

instance

(CPU)

Scoring

instance

(CPU)

Scoring

instance

(CPU)

Scoring

instance

(CPU)

Training

Data

Serialized

model

Serialized

model

Training

instance(s)

(GPU)

Orchestration Layer (CI/CD / Job scheduling / Monitoring)


http://osdcwebappdeeplearning.azurewebsites.net/

http://osdcwebappdeeplearning.azurewebsites.net/


+Auto-scale and load balancing

Managed

Domain specific training data

Latency

-Pricing less flexible

Deployment pipeline to monitor

Performance

@ODSC

Thomas Delteilhttps://www.linkedin.com/in/thomasdelteil

Miguel Fierro@miguelgfierro

https://miguelgfierro.com

The code of this application is published at:

https://github.com/ilkarman/Bangalore_Senti

ment

Part of our code is based on:

https://github.com/zhangxiangxiao/Crepe

Attribution of some images:

• http://morguefile.com

• https://unsplash.com

• Ana Corrales Photography

• http://wikipedia.org

Amazon dataset citation:

• J. McAuley, C. Targett, J. Shi, A. van den

Hengel. Image-based recommendations

on styles and substitutes. SIGIR, 2015.

• J. McAuley, R. Pandey, J. Leskovec.

Inferring networks of substitutable and

complementary products. Knowledge

Discovery and Data Mining, 2015

Open Data Science Conference London,

8 & 9 October, 2016

© 2016 Microsoft Corporation. All right reserved

https://github.com/ilkarman/Bangalore_Sentiment

https://github.com/zhangxiangxiao/Crepe

http://morguefile.com/

https://unsplash.com/

http://wikipedia.org/

deep learning for nlp

Technology