deep neural networks are our friendslxmls.it.pt/2016/deep-neural-networks-are-our-friends.pdf ·...

Deep Neural NetworksAre Our Friends

Wang Ling

● Part I - Neural Networks are our friends○ Numbers are our friends ○ Operators are our friends○ Functions are our friends○ Parameters are our friends○ Cost Functions are our friends○ Optimizers are our friends○ Gradients are our friends○ Computation Graphs are our friends

Outline

● Part I - Neural Networks are our friends● Part 2 - Into Deep Learning

○ Nonlinear Neural Models○ Multilayer Perceptrons○ Using Discrete Variables○ Example Applications

Outline

Numbers are our friends

Numbers are our friendsAbby Cadabby

How many apples does Abby have?

Numbers are our friends

Abby Cadabby

Numbers are our friends● Types of Numbers:

○ Integers : 5○ Rationals : 1/2○ Reals : 1.4e10 ...

Operators are our friends

If Abby has 4 apples, and gives Bert 1 apple, how many apples will

Abby have?

Operators are our friends

Operators are our friends● Arithmetic Operators

○ Addition : 23 + 12 = 35○ Subtraction : 31 - 15 = 16○ Multiplication : 4 x 5 = 20○ Division : 20 / 5 = 4

Functions are our friends

If Bert always returns 3 bananas for each apple, how many bananas will

Abby receive for 2 apples

y = 3x

● Input, x - Number of Apples given by Abby

y = 3x

● Input, x - Number of Apples given by Abby

● Output, y - Number of Bananas received by Abby

y = 3x

y = 3x , x =1

y = 3x , x =1y = 3

Functions are our friendsy = 3x

Cookie Monster

Functions are our friendsy = 3x y = ??

Functions are our friendsy = ??

If Abby gives Cookie Monster 3 apples, how many bananas

does she get?

Parameters are our friends

y = 3x + 1

● Input● Output

y = wx + b

● Input● Output● Parameters

Input - Fixed, comes from dataParameters - Need to be estimated

Parameters are our friendsy = wx + b

y = wx + bx y

Data Model

y = wx + bx y

Data Model

How to find the parameters w and b?

y = wx + bx y

Data ModelModel

Candidate 1x y ŷ

5 16 5

6 20 6y = 1x + 0

y = wx + bx y

Data ModelModel

Candidate 1x y ŷ

5 16 5

6 20 6

Model Candidate 2 x y ŷ

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

y = wx + bx y

Data ModelModel

Candidate 1x y ŷ

5 16 5

6 20 6

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2Which one is better ?

Cost functions are our friends

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1x y ŷ

5 16 5

6 20 6

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1x y ŷ

5 16 5

6 20 6

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5

2 6 20 6

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

318C(1,0)

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

Model Candidate 2

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 12 16

2 6 20 14 36

C(1,0)

C(2,2)

yn = wxn + bn x y

1 5 16

2 6 20

Data ModelModel

Candidate 1

Model Candidate 2

y = 1x + 0

y = 2x + 2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

C(1,0)

C(2,2)

yn = wxn + bn x y

1 5 16

2 6 20

Data Model

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

yn = wxn + bn x y

1 5 16

2 6 20

Data Model

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

How to find the parameters w and b?

Optimizers are our friends

yn = wxn + bn x y

1 5 16

2 6 20

Data Model

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2Optimizer

arg min C(w,b)w,b∈[-∞,∞]

Optimizers are our friendsOptimizer

w0,b0 = 2,2 : C(w0,b0) = 68

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = ?

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = 26

n x y ŷ (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0

C(3,2) 26

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = 26

n x y ŷ (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0

C(3,2) 26

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 4,2 : C(w2,b2) = ??

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 4,2 : C(w2,b2) = 136

n x y ŷ (y-ŷ)

0 1 0 6 36

1 5 16 22 64

2 6 20 26 36

C(4,2) 136

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 3,3 : C(w2,b2) = 41

n x y ŷ (y-ŷ)

0 1 0 6 36

1 5 16 18 4

2 6 20 21 1

C(3,3) 41

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26

y = wx + b

w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 3,1 : C(w2,b2) = 17

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 16 0

2 6 20 19 1

C(3,1) 17

y = wx + b

w2,b2 = 3,1 : C(w2,b2) = 17

y = wx + b

w2,b2 = 3,1 : C(w2,b2) = 17

w3,b3 = 3,0 : C(w3,b3) = 13

n x y ŷ (y-ŷ)

0 1 0 3 9

1 5 16 15 1

2 6 20 18 4

C(3,0) 13

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 3,-1 : C(w4,b4) = 17

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 14 4

2 6 20 17 9

C(3,-1) 17

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 2,0 : C(w4,b4) = 104

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 10 36

2 6 20 12 64

C(2,0) 104

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 4,0 : C(w4,b4) = 104

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 20 16

2 6 20 24 16

C(2,0) 54

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b

w?,b? = 4,-2 : C(w?,b?) = ??

y = wx + b

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 18 4

2 6 20 22 4

C(4,-2) 12

w?,b? = 4,-2 : C(w?,b?) = 12

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13

Search Problem

y = wx + b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 3.01,0 : C(w4,b4) = 12.82

n x y ŷ (y-ŷ)

0 1 0 3.01 9.06

1 5 16 15.01 0.98

2 6 20 18.01 3.96

C(3.01,0) 12.82

y = wx + b

w*,b* = 4,-2 : C(w*,b*) = 12

y = wx + b

w*,b* = 4,-2 : C(w*,b*) = 12

y = wx + b

w*,b* = 4,-4 : C(w*,b*) = 0

y = wx + b

Gradients are our friendsOptimizer

Should be used sparingly

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hwhw = 1

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hwhw = 1C(w0+hw,b0) = C(3,2) = 26

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hwhw = 1C(w0+hw,b0) = C(3,2) = 26 (C(w0+1,b0)-C(w0,b0))

(C(3,2)-C(2,2))=-421

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hwhw = 1, r = -42hw = 0.1, r = -98hw = 0.01, r = -104hw = 0.001, r = -104

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hwhw = 1, r = -42hw = 0.1, r = -98hw = 0.01, r = -104hw = 0.001, r = -104 ∂C

∂w(w0,b0)hw → 0, r =

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hw∂C

∂∑(ŷn-yn) 2

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hw∂C

∂∑(ŷn-yn) 2

∂wn = ∑-2(ŷn-yn)xn

w0,b0 = 2,2 : C(w0,b0) = 68

∂∑(ŷn-yn) 2

∂w(w0,b0)hw → 0, rw = = -104

n x y ŷ (ŷ-y) -2(ŷ-y)x

0 1 0 4 4 8

1 5 16 12 -4 -40

2 6 20 14 -6 -72

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

hw∂C

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

w0,b0 = 2,2 : C(w0,b0) = 68

∂w(w0,b0)hw → 0, rw = = -104

n x y ŷ (ŷ-y) -2(ŷ-y)

0 1 0 4 4 8

1 5 16 12 -4 -8

2 6 20 14 -6 -12

∂w(w0,b0)hb → 0, rb = = -12

w0,b0 = 2,2 : C(w0,b0) = 68

∂w(w0,b0)hw → 0, rw = = -104

∂w(w0,b0)hb → 0, rb = = -12

y = wx + b

2w1 = w0 - rw

b1 = b0 - rb → Learning Rate

Gradients are our friendsy = 4x-4

Computation Graphs are our friends

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

∂∑(ŷn-yn)

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

y = wx + b

Harder!

y = wx + b + tanh(yx + b)2

Computation Graphs can

compute gradients for you!

y = wx + b + tanh(yx + b)2

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

∂∑(ŷn-yn)

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

y = wx + b

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

∂(ŷn-yn)

∂ynn

= ∑-2(ŷn-yn)xn n

= ∑-2(ŷn-yn) n

y = wx + b

∂(ŷn-yn)

∂ynn

∂b∑

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

∂(ŷn-yn)

∂ynn

y = wx + b

∂(ŷn-yn)

∂ynn ∂b∑ ∂yn

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

∂(ŷn-yn)

∂ynn

y = o + bo = wx

∂(ŷn-yn)

C(w,b) = ∑cnn∈{0,1,2}

∂ynn

c = dd = y - ŷy = o + bo = wx

∂(ŷn-yn)

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

∂(ŷn-yn)

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

forward(x,y) → zbackward(x,y,dz) → dx,dy

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

forward(x,y) : return x - ybackward(x,y,dz) : return dz, -dz

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

forward(x,y) : return x - ybackward(x,y,dz) : return dz, -dz

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

forward(x,y) : return x - ybackward(x,y,dz) : return 1, -1

Sub ∂dn

∂ŷn

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷy = o + bo = wx

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

c = dd = y - ŷ

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

C(w,b) = ∑cnn∈{0,1,2}

∂dnn

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

C(w,b) = ∑cnn∈{0}

∂dnn

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

d c Id C

C(w,b) = ∑cnn∈{0}

∂dnn

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

d c Id C

C(w,b) = ∑cnn∈{0}

∂dnn

∂w∑

= ∂cn

∂dnn

∑ ∂dn

Power 2

Product

d c Id C

Parameters

Computation Graphs are our friendsPower 2

Product

d c Id C

Forward:1-Initialize inputs

Product

d c Id C

Forward:1-Initialize inputs2-Initialize variables

Variables

Product

d c Id C

Forward:1-Initialize inputs2-Initialize variables

Variables

2 values: x and dx

0,00,0 0,0

Product

d c Id C

Forward:1-Initialize inputs2-Initialize variables3-Topological Sort variables

0,00,0 0,0

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

3rd4th 5th

Product

d c Id C

0,00,0 0,0

d c Id CForward:

1-Initialize inputs2-Initialize variables3-Topological Sort variables

0,00,0 0,0

d c CForward:

0,00,0 0,0

3rd4th 5th

d c Add CForward:

0,00,0 0,0

d c CForward:

0,00,0 0,0

5th 6th 7th

Product

d c Id C

Forward:1-Initialize inputs2-Initialize variables3-Topological Sort variables4-For each variable in topological

order, run the forward method of all operations that link to them

0,00,0 0,0

4th 5th

Product

d c Id C

0,00,0 0,0

4th 5th

Product

d c Id C

0,00,0 0,0

4th 5th

Product

d c Id C

-4,00,0 0,0

4th 5th

Product

d c Id C

-4,016,0 0,0

4th 5th

Product

d c Id C

-4,016,0

4th 5th16,0

Product

d c Id C

5-Set gradients to final variables

-4,016,0

4th 5th16,1

Product

d c Id C

order, run the forward method of all operations that link to them (Forward)

5-Set gradients to final variables6-run the operations backward method

in reverse order (Backward)10,0

-4,016,0

4th 5th16,1

∂c C=c =1

dc = dC ∂C

Product

d c Id C

-4,016,1

4th 5th16,1

∂c C=c =1

dc = dC ∂C

Product

d c Id C

-4,016,1

4th 5th16,1

c = d2

dd = dc ∂c

∂d= 2d

Product

d c Id C

-4,016,1

4th 5th16,1

c = d2

dd = dc ∂c

∂d= 2 x -4

Product

d c Id C

-4,016,1

4th 5th16,1

c = d2

dd = dc ∂c

∂d= -8

Product

d c Id C

-4,-816,1

4th 5th16,1

c = d2

dd = dc ∂c

∂d= -8

Product

d c Id C

-4,-816,1

4th 5th16,1

d = y - ŷ ∂d

∂y= 1

Product

d c Id C

-4,-816,1

4th 5th16,1

d = y - ŷ ∂d

∂y= 1

dy = dd ∂d

Product

d c Id C

in reverse order (Backward)10,-8

-4,-816,1

4th 5th16,1

y = o + b

∂o= 1

do = dy ∂y

Product

d c Id C

-4,-816,1

4th 5th16,1

y = o + b

∂o= 1

∂b= 1

bt+1 = b - dy ∂y

Product

d c Id C

-4,-816,1

4th 5th16,1

y = o + b

∂o= 1

∂b= 1

bt+1 = b - dy ∂y

Product

d c Id C

-4,-816,1

4th 5th16,1

y = o + b

∂o= 1

∂b= 1

bt+1 = b - ∂c

∂d∂y

∂y∂b

Product

d c Id C

-4,-816,1

4th 5th16,1

y = o + b

∂o= 1

∂b= 1

bt+1 = b - ∂C

Product

d c Id C

-4,-816,1

4th 5th16,1

o = wx

∂w= x

wt+1 = w - do ∂o

Product

d c Id C

in reverse order (Backward)7-update parameters 10,-8

-4,-816,1

4th 5th16,1

o = wx

∂w= x

wt+1 = w - do ∂o

Product

d c Id C

2.210,-8

-4,-816,1 16,1

o = wx

∂w= x

wt+1 = w - do ∂o

Existing Tools:-Tensorflow ( https://www.tensorflow.org )-Torch ( https://github.com/torch/nn )-CNN ( https://github.com/clab/cnn )-JNN ( https://github.com/wlin12/JNN )-Theano (http://deeplearning.net/software/theano/ )

Into Deep Learning

Nonlinear Neural Modelsy = 4x-4

Nonlinear Neural Models

There is a limit of bananas I can give you

1 5 16

2 6 20

y y = 4x-4

1 5 16

2 6 20

3 9 20

4 11 20

y y = 4x-4

1 5 16

2 6 20

3 9 20

4 11 20

y y = 2x+3

Model Problem

1 5 16

2 6 20

3 9 20

4 11 20

y y = 2x+3

Model Problem

Underfitting

1 5 16

2 6 20

3 9 20

4 11 20

y y = ???

Can we learn arbitrary functions?

y = (w1x + b1)s1 + (w2x+b2)s2

Use different linear functions depending on the value of x?

y = (w1x + b1)s1 + (w2x+b2)s2s1 - 1 if x < 6 and 0 otherwises2 - 1 if x >= 6 and 0 otherwise

y = (w1x + b1)s1 + (w2x+b2)s2

1 5 16

2 6 20

3 9 20

4 11 20

y = (4x - 4)s1 + (0x+20)s2

s1 - 1 if x < 6 and 0 otherwises2 - 1 if x >= 6 and 0 otherwise

s = (wx + b)

(t) = 11 + e-t

s = (1000x)

x = 0.1 then (1000x) = 1

x = -0.1 then (1000x) = 0

s = (1000x)

x = 0.1 then (1000x) = 1

x = -0.1 then (1000x) = 0

s = (1000x - 6000)

x = 6.1 then (1000x - 6000) = 1

x = 5.9 then (1000x - 6000) = 0

y = (w1x + b1)s1 + (w2x+b2)s2

s1 = (w3x + b3)s2 = (w4x + b4)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (1000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (1000)s2 = (-1000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)1 + (20)0

s1 = (1000)s2 = (-1000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = 16

s1 = (1000)s2 = (-1000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-3000)s2 = (1000x - 6000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-3000)s2 = (3000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)0 + (20)1

s1 = (-3000)s2 = (3000)

1 5 16

2 6 20

3 9 20

4 11 20

Data y = 20

s1 = (-3000)s2 = (3000)

If you give me too many apples, I will give them to...

Count Von Count

Multilayer Perceptrons

1 5 16

2 6 20

3 9 20

4 11 20

y y = (4x - 4)s1 + (0x+20)s2

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y y = (4x - 4)s1 + (0x+20)s2

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = ????s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = not s1 and not s3

s3 = (1000x - 15000)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 1 Perceptron

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 2 Perceptron

Layer 1 Perceptron

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = not s1 and not s3

s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)s1 + (20)s2 + (1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (-1000s1 - 1000s3 + 500)s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (-1000s4 - 1000s5 + 500)s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (500)s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (40)0 + (20)1 + (1)0s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = 20s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (772)s1 + (20)s2 + (1)s3 s1 = (-1000x + 6000)s2 = (-1000s4 - 1000s5 + 500)s3 = (1000x - 15000)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-1000s4 - 1000s5 + 500)s3 = (4000) = 1

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-1000 + 0 + 500)s3 = (4000) = 1

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = (772)0 + (20)0 + (1)1s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

y = 1s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

yy = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 2 Perceptron

Layer 1 Perceptron

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

w6s3w5s1

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s3x < 6 x > 15

!(x > 15) & !(x < 6)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s3x < 6 x > 15

x∈[6,15]

s3x < 6 x > 15

x∈[6,15]

x∈]-∞,6] & ]15,∞]

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

Layer 1 (Input Features)

Layer 2 (And and Or Combinations)

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

And(s1,s2) = (1000s1 + 1000s3 - 1500)Or(s1,s2) = (1000s1 + 1000s3 - 500)

Layer 3 (Xor Combinations)s8

Xor(s1,s2) = Or(And(s1,!s2), And(!s1,s2))

Xor(s1,s2) = Or(s5, s6)

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Universal approximator

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

but...

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

No guarantee that the best function will

be found

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

1 5 16

2 6 20

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

1 5 16

2 6 20

y = 0s5 + 16s6 + 20s7

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

1 5 16

2 6 20

y = 0s5 + 16s6 + 20s7

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

1 5 16

2 6 20

y = 0s5 + 16s6 + 20s7

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

1 5 16

2 6 20Overfitting

y = 0s5 + 16s6 + 20s7

Model Problem

Task Complexity

Model Complexity

Task Complexity

Model Complexity

Underfitting

Task Complexity

Model Complexity

Overfitting

Underfitting

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Sentiment analysis

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Sentiment analysis

Machine Translation

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

yn x y

1 5 16

2 6 20

yn x y

1 5 16

2 6 20

1 5 16

2 6 20

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Model Bias

Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Model BiasL1 & L2 RegularizationStochastic Dropout (Srivastava et al, 2014)Model Structure (CNN, RNNs)

Regularization

C(w,b) = ∑(yn-ŷn) + (w+b)ß

ß = Regularization constantn∈{0,1,2}

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

Regularization

s2x > 1 nothing

x∈]-∞,1]

s3 nothing

s4 x < 6

nothing x∈[6,∞]

Regularization

s2x > 1 nothing

x∈]-∞,1]

s3 nothing

s4 x < 6

nothing x∈[6,∞]

Regularization

Find solutions that require less effort

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

x∈[5,6[ x∈[6,∞]

Stochastic Dropout (Srivastava et al, 2014)

s2x > 1 0

x∈]-∞,1]

s3 x < 5

s4 x < 6

s2x > 1 0

x∈]-∞,1]

s3 x < 5

s4 x < 6

y Find robust models

Model Structure

Weighted sum of linear functions VS MLP

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

Model Structure

Weighted sum of linear functions VS MLP

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

Convolutional Vs RNNs

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

w6s3w5s1

Representation

s1 = (W3x + b3)s2 = (W4s1 + b4)

Representation

s1 = (Ws2 + b)

Representation

s1 = (Ws2 + b)Tensoflow Code

s1 = tf.matmul(x, W1) + b1

s1 = tf.nn.sigmoid(s1)

s2 = tf.matmul(s1, W2) + b2

s2 = tf.nn.sigmoid(s2)

Using Discrete Variables

Number of fruit to offer

Number of fruit received

Using Discrete Variablesx

Number of fruit received

uType of fruit to offer

v Number of fruit receivedType of fruit received

u∈{Apple, Banana, Coconut}

v∈{Apple, Banana, Coconut}

Using Discrete VariablesLookup Tables

e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

Embedding for u Size = 4

e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

Embedding for u

Banana

Size = 4

e1 e2 e3 e4

0 0.1 -0.4 0.2 0.5

1 0.4 1.4 -1.0 0.1

2 1.1 0.9 1.1 0.5

Embedding for u

Size = 4

Embedding for u

Lookup

Size = 4

Lookup

Using Discrete VariablesSoftmax

Apple Banana Coconut

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

Input vector Size = 4V = 3

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

Input vector Size = 4

logits Size = V

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

Input Vector

Logits

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

1 -1 -2

Input Vector

Logits

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

1 -1 -2

0.84 0.11 0.05

Input Vector

Logits

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

1 -1 -2

0.84 0.11 0.05

Softmax

Lookup

Softmax

Lookup

Example Applications

Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

NNP VBZ TO VB NNS CC NNS

e-2 e-1 e-0 e1 e2

e-2 e-1 e-0 e1 e2 Word Embeddings

Non-Linear Layer 1s1

s2 Non-Linear Layer 2

VB Softmax

Translation Rescoring (Devlin et al, 2014)

ContextPredict

e-4 e-3 e-2 e-1

Softmax

0.2<s>

0.10.2

0.10.2 0.3

0.10.2 0.3 0.5 0.7 0.4 0.20.000378

Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012

John does to eat coconuts and bananas 0.00003

Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012

John does to eat coconuts and bananas 0.00003

ContextPredict

Translation

Source

Abby gosta de comer macas e bananas

ContextPredict

Translation

Source

Abby gosta de comer macas e bananas

Translation

e-4 e-3 e-2 e-1

Translation Score (BLEU) Arabic - English Chinese - English

Best Rescored System 52.8 34.7

1st OpenMT12 49.5 32.6

Hierarchical 43.4 30.1

Deep Neural Networks are our friends?Convolutional Neural Network

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

y Is this a cat?

deep neural networks are our friendslxmls.it.pt/2016/deep-neural-networks-are-our-friends.pdf ·...

Documents

distinct neural processes are engaged in the modulation...

an introduction to neural networks - the university of...

are multilingual neural machine translation models better

neural optimizer search with reinforcement learning -...

neural-symbolic vqa: disentangling reasoning from vision and...

outline what neural networks are and why they are desirable...

recurrent neural networks -...

neural representation and neural...

sketch classification with neural...

artificial neural networks - newcastle university...

convolutional neural networks arise from ising models and...

13 artificial intelligence-neuralnetworks · artificial...

neural representations of emotion are organized around...

choose a section what are neural nets? the biological...

bp – addiction and connection … · 2020. 5. 18. · we...

a deep non-negative matrix factorization neural network ·...

deep parametric continuous convolutional neural...

deformable part models are convolutional neural networks

transformers are graph neural networks

learning discrete structures for graph neural...