basic math for neural networks - liangliang...

Basic Math for Neural Networks

Xiaodong Cui

IBM T. J. Watson Research CenterYorktown Heights, NY 10598

Spring, 2017

Deep Learning for Computer Vision, Speech, and Language

Outline

• Feedforward neural networks

• Forward propagation

• Neural networks as universal approximators

• Back propagation

• Jacobian

• Vanishing gradient problem

EECS 6894, Columbia University 2/23


Feedforward Neural Networks

input layer

hidden layers

output layer

a

z

h

+1

+1

x

a(l)i =

∑j

w(l)ij z

(l−1)j + b

(l)i

z(l)i = h

(l)i

(a

(l)i

)EECS 6894, Columbia University 3/23


Activation Functions – Sigmoidalso known as logistic function

h(a) = 11 + e−a

h′(a) = e−a

(1 + e−a)2 = h(a)[1 − h(a)

]

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fdf/dx



Activation Functions – Hyperbolic Tangent (tanh)

h(a) = ea − e−a

ea + e−a

h′(a) = 4(ea + e−a)2 = 1 −

[h(a)

]2

-5 -4 -3 -2 -1 0 1 2 3 4 5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

fdf/dx



Activation Functions – Rectified Linear Unit (ReLU)

h(a) = max(0, a) ={

a if a > 0,

0 if a ≤ 0.

h′(a) ={

1 if a > 0,

0 if a ≤ 0

-3 -2 -1 0 1 2 30

0.5

1

1.5

2

2.5

3

fdf/dx



Activation Functions – Softmax

also known as normalized exponential function, a generalization ofthe logistic function to multiple classes.

h(ak) = eak∑j eaj

∂h(ak)∂aj

= ∂h(ak)∂ log h(ak)

∂ log h(ak)∂aj

= h(ak)(

δkj − h(aj))

Why "softmax"?

max{x1, · · · , xn} ≤ log(ex1 + · · · + exn) ≤ max{x1, · · · , xn} + log n



Loss Functions

• Regression• Classification



Regression: Least Square ErrorLet yn,k and yn,k be the output and target respectively. Forregression, the typical loss function is least square error (LSE):

L(y, y) = 12

∑n

(yn − yn)2

= 12

∑n

∑k

(ynk − ynk)2

where n is the sample index and k is the dimension index.The derivative of LSE with respect to each dimension of eachsample:

∂L(y, y)∂ynk

= (ynk − ynk)



Classification: Cross-Entropy (CE)Suppose there are two (discrete) probability distributions p and q, the "distance"between the two distributions can be measured by the Kullback-Leibler divergence:

DKL(p||q) =∑

k

pk log pk

qk

For classification, the targets yn are given

yn = {yn1, · · · , ynk, · · · , ynK}

The predictions after the softmax layer are posterior probabilities after normalization

yn = {yn1, · · · , ynk, · · · , ynK}

Now measure the "distance" of these two distributions from all samples∑

n

DKL(yn||yn) =∑

n

∑k

ynk log ynk

ynk=

∑n

∑k

ynk log ynk −∑

n

∑k

ynk log ynk

= −∑

n

∑k

ynk log ynk + C



Classification: Cross-Entropy (CE)Cross-entropy as the loss function:

L(y, y) = −∑

n

∑k

ynk log ynk

Typically, targets are given in the form of 1-of-K coding

yn = {0, · · · , 1, · · · , 0}

where

yn ={

1 if k = kn,

0 otherwise

It follows that the cross-entropy has an even simpler form:

L(y, y) = −∑

n

log ynkn



Classification: Cross-Entropy (CE)derivative of the composite function of cross-entropy and softmax:

L(y, y) =∑

n

En =∑

n

(−

∑i

yni log yni

)yni = eani∑

j eanj

∂En

∂ank= ∂

∂ank

(−

∑i

yni log yni

)= −

∑i

yni∂ log yni

∂ank

= −∑

i

yni∂ log yni

∂yni

∂yni

∂ank= −

∑i

yni

yni

∂yni

∂ log yni

∂ log yni

∂ank

= −∑

i

yni

yni· yni · (δik − ynk) = −

∑i

yni(δik − ynk)

= −∑

i

yniδik +∑

i

yniynk = −ynk + ynk

( ∑i

yn,i

)= ynk − ynk

What’s the derivative for an arbitrary differentiable function? – leave it forexercise



A DNN Example In Torch

Algorithm 1 Definition of a DNN with 3 hidden layersfunction create_model()

– MODELlocal n_inputs = 460local n_outputs = 3000local n_hidden = 1024local model = nn.Sequential()model:add(nn.Linear(n_inputs, n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_outputs))model:add(nn.LogSoftMax())

– LOSS FUNCTIONlocal criterion = nn.ClassNLLCriterion()

return model:cuda(), criterion:cuda()end



Neural Networks As Universal ApproximatorsUniversal Approximation Theorem[1][2][3] Let ϕ(·) be a nonconstant, bounded, andmonotonically-increasing continuous function. Let In represent an n-dimensional unit cube[0, 1]n and C(In) be the space of continuous functions on In. Then given any functionf ∈ C(In) and ϵ > 0, there exist an integer N and real constants ai, bi ∈ R and wi ∈ Rn,where i = 1, 2, · · · , N , such that one may define

F (x) =N∑

i=1aiϕ(wT

i x + bi)

as an approximate realization of the function f where f is independent of ϕ such that

|F (x) − f(x)| < ϵ

for all x ∈ In.

[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks areUniversal Approximators," Neural Networks, 2(5), 359-366.[3] Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks," NeuralNetworks, 4(2), 251-257.



Neural Networks As Universal ClassifiersExtends f(x) to decision functions of the form [1]

f(x) = j, iff x ∈ Pj , j = 1, 2, · · · , K

where Pj partition An into K disjoint measurable subsets and An is a compact subset of Rn.

Arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neuralnetworks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity!

[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Huang, W. Y. and Lippmann, R. P. (1988). "Neural Nets and Traditional Classifiers," inNeural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor. AmericanInstitute of Physics, New York, 387-396.



Back Propagation

wki

j

i

k

wij

L How to compute the gradient ofthe loss function with respect toa weight wij?

∂L∂wij

D. E. Rumelhart, G. E. Hintonand R. J. Williams (1986)."Learning representations byback-propagating errors,"Nature, vol. 323, 533-536.



Back Propagation

a(l)i =

∑j w

(l)ij z

(l−1)j

z(l)i = h

(l)i

(a

(l)i

)

∂L∂w

(l)ij

= ∂L∂a

(l)i

∂a(l)i

∂w(l)ij

It follows that

∂a(l)i

∂w(l)ij

= z(l)j and δ

(l)i , ∂L

∂a(l)i

δi is often referred to as “errors”. By the chain rule,

δ(l)i = ∂L

∂a(l)i

=∑

k

∂L∂a

(l+1)k

∂a(l+1)k

∂a(l)i

=∑

k

δ(l+1)k

∂a(l+1)k

∂a(l)i



Back Propagation

a(l+1)k =

∑i

w(l+1)ki h(a(l)

i )

therefore

∂a(l+1)k

∂a(l)i

= w(l+1)ki h′(a(l)

i )

It follows that

δ(l)i = h′(a(l)

i )∑

k

w(l+1)ki δ

(l+1)k

which tells us that the errors can be evaluated recursively.



Back Propagation: An Example

• A neural network with multiple layers• Softmax output layer• Cross-entropy loss function

• Forward propagation:I push the input xn through the network to get the activations to the hidden layers

a(l)i =

∑j

w(l)ij z

(l−1)j , z

(l)i = h

(l)i

(a

(l)i

)• Back propagation:

I start from the output layer

δ(L)k = ynk − ynk

I back-propagate errors from the output layer all the way down to the input layer

δ(l)i = h′(a(l)

i )∑

k

w(l+1)ki δ

(l+1)k

I evaluate gradients

∂L∂w

(l)ij

= δ(l)i z

(l)j



Gradient CheckingIn practice, when you implement the gradient of a network or amodule, one way to check the correctness of your implementationis via the following gradient checking by a (symmetric) finitedifference:

∂L∂wij

= L(wij + ϵ) − L(wij − ϵ)2ϵ

check the "closeness" of the two gradients (your ownimplementation and the one from the above difference)

||g1 − g2||||g1 + g2||



Jacobian Matrixyk

xi

• measure the sensitivity of outputs with respect to inputs of a(sub-)network

• can be used as a modular operation for error backprop in alarger network



Jacobian MatrixThe Jacobian matrix can be computed using a similar back-propagation procedure.

∂yk

∂xi=

∑l

∂yk

∂al

∂al

∂xi

where al are the activation functions having immediate connections with the inputs xi.

∂yk

∂xi=

∑l

wli∂yk

∂al

Analogous to the error propagation, define

δkl ,∂yk

∂al

and it can be computed recursively

δkl = ∂yk

∂al=

∑j

∂yk

∂aj

∂aj

∂al=

∑j

∂yk

∂aj

[wjlh

′(al)]

=∑

j

δkj

[wjlh

′(al)]

where j sweeps all the connections wjl with l.

What’s the Jacobian matrix of softmax outputs to the inputs? – leave it as an exercise



Vanishing Gradients

function nonlinearity gradient

sigmoid x = 11+e−a

∂L∂a = x(1 − x)∂L

∂x

tanh x = ea−e−a

ea+e−a∂L∂a = (1 − x2)∂L

∂x

softsign x = a1+|a|

∂L∂a = (1 − |x|)2 ∂L

∂x

What’s the problem???

|x| ≤ 1 → ∂L∂a

≤ ∂L∂x

Nonlinearity causes vanishing gradients.EECS 6894, Columbia University 23/23

basic math for neural networks - liangliang...

Documents