deep learning for nlp

Thomas Delteilhttps://www.linkedin.com/in/thomasdelteil

Miguel Fierro@miguelgfierro

https://miguelgfierro.com

ODSC 2016 London – Thomas Delteil linkedin.com/in/thomasdelteil & Miguel Fierro @miguelgfierro

O p e r a t i o n a l i z a t i o nN L P w i t h C N NN L P

Interaction between computers

and human language

Machine

translation OCR

Sentiment

Analysis

Speech

Recognition

T2STopic

Modelling

Information

Retrieval

Natural

Language

Understanding

Document

Classification

£1.3T value of company

datasource: IDC, 2014

10%of organizations expect to

commercialise their data by 2020source: Gartner, 2016

8.4PBof information per second

as of 2020source: business2comunity, 2016

70%of companies

use customer feedbackSource: business2comunity, 2016

SpaghettiMilkEatingBroccoli

KittenPuppyHamster

Eating

TOP IC 1 TOP IC 2

… my favourite dish is

spaghetti …… the cute hamster is

eating broccoli…… I love kittens…

Generative models joint distribution

source: https://en.wikipedia.org/wiki/Hidden_Markov_model

Conditional models conditional distribution

source: John Lafferty, Andrew McCallum, Fernando C.N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.

ICML, 2001.

ODSC 2016 London – Thomas Delteil #linkedin.com/in/thomasdelteil & Miguel Fierro @miguelgfierro

source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Bag of n-grams instead of bag of words

source: A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification, 2016

N e e d s G P U s a n d l o t s

o f d a t aG r e a t p e r f o r m a n c eF e a t u r e g e n e r a t i o n

wait, wait, wait…

What makes deep learning

input hidden output

input hidden hidden hidden output

source: R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996

source: https://en.wikipedia.org/wiki/Maxima_and_minima

hidden

output

hidden

ti ti+1 ti+2 ti+3

number of layers

source: https://en.wikipedia.org/wiki/Long_short-term_memory

Convolution Pooling PoolingConvolution Fully

connected

Input image Output predictions

Sharpening filter

Laplacian filter

Sobel x-axis filter

Max pooling with 2x2 kernel and stride of 2x2

input hidden output

SoftmaxReLUtanh

When I read some of the rules

for speaking the English

language correctly, I think any

fool can make a rule, and every

fool will mind it

Henry David Thoreau

?122 122 112 90 5 10 21

121 122 112 11 6 11 21

120 118 6 10 11 12 23

118 4 6 5 23 23 23

4 6 1 23 23 21 23

4 5 20 24 23 21 23

source: Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn,and Dong Yu,. ClassificationConvolutional Neural Networks for

Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014

O D S C - U K N L P

space 0 0 0 0 0 0 0 1 0 0 0

- 0 0 0 0 1 0 0 0 0 0 0

. 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0 0 0 0 0

B 0 0 0 0 0 0 0 0 0 0 0

C 0 0 0 1 0 0 0 0 0 0 0

D 0 1 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 0 0 0 0 0

F 0 0 0 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 0 0 0 0

H 0 0 0 0 0 0 0 0 0 0 0

I 0 0 0 0 0 0 0 0 0 0 0

J 0 0 0 0 0 0 0 0 0 0 0

K 0 0 0 0 0 0 1 0 0 0 0

L 0 0 0 0 0 0 0 0 0 1 0

M 0 0 0 0 0 0 0 0 0 0 0

N 0 0 0 0 0 0 0 0 1 0 0

O 1 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 1

Q 0 0 0 0 0 0 0 0 0 0 0

R 0 0 0 0 0 0 0 0 0 0 0

S 0 0 1 0 0 0 0 0 0 0 0

T 0 0 0 0 0 0 0 0 0 0 0

U 0 0 0 0 0 1 0 0 0 0 0

V 0 0 0 0 0 0 0 0 0 0 0

W 0 0 0 0 0 0 0 0 0 0 0

X 0 0 0 0 0 0 0 0 0 0 0

Y 0 0 0 0 0 0 0 0 0 0 0

Z 0 0 0 0 0 0 0 0 0 0 0

One-hot encoding over a

vocabulary of characters.

Encoding:

Text = “ODSC-UK NLP”

Vocab: [ ‘ ‘, ‘-’, ‘.’, ‘A’, ‘B’, ‘C’, …, ‘Z’ ]

For images:

For text:

Humans to rephrase the examplesSynonyms

Similar semantic meaning

source: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. NIPS 2015

O D S C - U K N L P … 1013

space 0 0 0 0 0 0 0 1 0 0 0 … …

- 0 0 0 0 1 0 0 0 0 0 0 … …

. 0 0 0 0 0 0 0 0 0 0 0 … …

A 0 0 0 0 0 0 0 0 0 0 0 … …

B 0 0 0 0 0 0 0 0 0 0 0 … …

C 0 0 0 1 0 0 0 0 0 0 0 … …

D 0 1 0 0 0 0 0 0 0 0 0 … …

E 0 0 0 0 0 0 0 0 0 0 0 … …

F 0 0 0 0 0 0 0 0 0 0 0 … …

G 0 0 0 0 0 0 0 0 0 0 0 … …

H 0 0 0 0 0 0 0 0 0 0 0 … …

I 0 0 0 0 0 0 0 0 0 0 0 … …

J 0 0 0 0 0 0 0 0 0 0 0 … …

K 0 0 0 0 0 0 1 0 0 0 0 … …

L 0 0 0 0 0 0 0 0 0 1 0 … …

M 0 0 0 0 0 0 0 0 0 0 0 … …

N 0 0 0 0 0 0 0 0 1 0 0 … …

O 1 0 0 0 0 0 0 0 0 0 0 … …

P 0 0 0 0 0 0 0 0 0 0 1 … …

Q 0 0 0 0 0 0 0 0 0 0 0 … …

R 0 0 0 0 0 0 0 0 0 0 0 … …

S 0 0 1 0 0 0 0 0 0 0 0 … …

T 0 0 0 0 0 0 0 0 0 0 0 … …

U 0 0 0 0 0 1 0 0 0 0 0 … …

V 0 0 0 0 0 0 0 0 0 0 0 … …

W 0 0 0 0 0 0 0 0 0 0 0 … …

X 0 0 0 0 0 0 0 0 0 0 0 … …

Y 0 0 0 0 0 0 0 0 0 0 0 … …

Z 0 0 0 0 0 0 0 0 0 0 0 … …

0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 -0.4 … 3.1

… … … … … … … …

255 1.2 3.4 -1 1.2 3.2 … -1

69x1014x1

1x1008x256

x 1008

0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 -0.4 … 3.1

… … … … … … … …

255 1.2 3.4 -1 1.2 3.2 … -1

0 1 2 3 4 … 1007

0 6.4 1.1 3.2 0.1 0 … 3.1

… … … … … … … …

255 1.2 3.4 0 1.2 3.2 … 0

1x1008x256

0 1 2 3 4 … … … 1007

0 6.4 1.1 3.2 0.1 0 … … … 3.1

… … … … … … … … … …

255 1.2 3.4 0 1.2 3.2 … … … 0

0 1 … 335

0 6.4 0.1 … …

… … … … …

255 3.4 3.2

1x1008x256

1x336x256

x 336x 256

0 1 2 3 4 5 6 7 8 … 335

0 6.4 0.1 … … … … … … … … …

… … … … … … … … … … … …

255 3.4 3.2 … … … … … … … … …

0 1 2 3 4 5 6 … 329

0 -2.4 3.2 … … … … … … …

… … … … … … … … … …

255 … … … … … … … … …

1x330x256

1x336x256

x 256x 330

1x330x256 <- after 2 convolution (7x1/1) and 1 max pooling (3x1/3)

1x110x256 <- 1 max-pooling (3x1/3)

3x102x256 <- 4 convolutions (3x1/1)

1x34x256 <- 1 max-pooling (3x1/3)

0 1 2 3 4 5 6 7 8 … 33

0 6.4 0.1 … … … … … … … … …

1 2.1 24.9 … … … … … … … … …

… … … … … … … … … … … …

255 … … … … … … … … … … 9.9

… …

35 2.1

36 24.9

… …

8703 9.98704x1x1

1x34x256

… …

8703 9.9

8704x1x1

x 1024

1024x1x1

𝑓𝑘 𝑋 =

𝑖=0

𝑤𝑘𝑖 ∗ 𝑥𝑖 + 𝑏𝑘

1 -2.1

… …

1023 32.1

… …

1023 9.9

1024x1x1

x 1024

1024x1x1

𝑓𝑘 𝑋 =

𝑖=0

𝑤𝑘𝑖 ∗ 𝑥𝑖 + 𝑏𝑘

1 -2.1

… …

1023 32.1

ignored

… …

1023 9.9

1024x1x1

x NNx1x1

… …

N-1 12.5

ignored

Softmax

1 0.01

… …

N-1 0.8

𝜎 𝒛 𝑖 =𝑒𝑧𝑖

σ𝑗=0𝑁−1 𝑒

𝑧𝑗

• MXNet using python bindings

• Training on Azure N-Series, on Tesla K80 GPU

• 3 days of training on 2.5M example for sentiment polarity

Amazon Review Polarity dataset (1.8M training, 200k testing):

- Crepe model + thesaurus augmentation: 95.07%

- TFIDF + n-grams: 91.64%

AG’s news corpus dataset (4 Classes 120kM training, 7.6k testing):

- Crepe model + thesaurus augmentation: 85.20%

- TFIDF + n-grams: 92.36%

CNN are no silver bullets, but they perform best on very large dataset

source: Alexis Conneau, Holger Schwenk, Loïc Barrault, Yann Lecun. Very Deep Convolutional Networks

for Natural Language Processing, 2016

source: Sergey Ioffe and Christian Szegedy Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.

6.4 1.1 3.2 0.1 0 5 3.1 10 21 3.1 0.2 1.8 0 16.4 1.1 3.2 0.1 0 5 3.1 10 21 3.1 0.2 1.8 0 1

6.4 3.2 5 10 21

source: A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification. 2016

NLP APIs from major cloud providers and market places

- Language detection

- Sentiment Analysis

- Topic detection

- Translation

- Content moderation

- Text to speech

- Speech to text

- Intent modelling

+Scalable

Managed

Pay per use pricing

Documentation and sample code

-Generic solutions

Limited customizability

Performance

Latency

Limited batch processing

Single Machine

Training Data Testing Data

Sample Production

DataModel

Development

Data pipeline ?

Retraining ?

Scalability ?

Real time / Batch scoring ?

Multiple team / frameworks ?

Production

Training

instance(s)

Scoring

instance

Scoring

instance

Scoring

instance

Scoring

instance

Training

Serialized

Training

instance(s)

Orchestration Layer (CI/CD / Job scheduling / Monitoring)

+Auto-scale and load balancing

Managed

Domain specific training data

Latency

-Pricing less flexible

Deployment pipeline to monitor

Performance

Thomas Delteilhttps://www.linkedin.com/in/thomasdelteil

Miguel Fierro@miguelgfierro

https://miguelgfierro.com

The code of this application is published at:

https://github.com/ilkarman/Bangalore_Senti

Part of our code is based on:

https://github.com/zhangxiangxiao/Crepe

Attribution of some images:

• http://morguefile.com

• https://unsplash.com

• Ana Corrales Photography

• http://wikipedia.org

Amazon dataset citation:

• J. McAuley, C. Targett, J. Shi, A. van den

Hengel. Image-based recommendations

on styles and substitutes. SIGIR, 2015.

• J. McAuley, R. Pandey, J. Leskovec.

Inferring networks of substitutable and

complementary products. Knowledge

Discovery and Data Mining, 2015

Open Data Science Conference London,

8 & 9 October, 2016

deep learning for nlp

Technology

[ppt]deep learning and applications to nlp - university of...

sequences + nlp deep learning series (wwc …...talk outline...

deep learning-based nlp data pipeline for ehr scanned

deep learning for natural language processing and machine...

deep learning in nlp: the past, the present, and the...

deep learning in nlp - computer science...

deep learning & nlp: graphs to the rescue!

deep adversarial learning for nlp

deep learning for nlp

list of deep learning and nlp resources - computer … + +

learning deep structured semantic ... - purdue nlp group

list of deep learning and nlp resourcescs- · list of deep...

einführung in nlp mit deep learning - inovex ·...

using deep learning and nlp to predict performance from...

[ introduction ] deep...

deep learning for nlp applications

deep learning architectures for nlp (hungarian nlp meetup...

an introduction to deep learning for nlp -...

deep learning for nlp: an introduction to neural word...

ai, machine, deep learning and nlp - mathworks...3...