lecture 17: neural networks and deep learninglecture 17: neural networks and deep learning jack...

Lecture 17: Neural Networks and

Deep LearningJack Lanchantin

Dr. Yanjun Qi

1

UVA CS 6316 / CS 4501-004Machine Learning

Fall 2016

Neurons1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice2

3

x1

x2

x3

W1

w3

W2

x ŷ

ewx+b

1 + ewx+b

Logistic Regression

Sigmoid Function(aka logistic, logit, “S”, soft-step)

4

P(Y=1|x) =

x

ez

1 + ez

Expanded Logistic Regression

x1

x2

x3

Σ

+1

z

z = wT . x + b

y = sigmoid(z) =5

px11xp

p = 3

1x1

w1

w2

w3

b1SummingFunction

SigmoidFunction

Multiply by weights

ŷ = P(Y=1|x,w)

Input x

“Neuron”

x1

x2

x3

Σ

+1

z

6

w1

w2

w3

b1

Neurons

7http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Neuron

x1

x2

x3

Σ z ŷ

8

ez

1 + ez

z = wT . x

ŷ = sigmoid(z) =px11xp1x1

From here on, we leave out bias for simplicity

Input x

w1

w2

w3

“Block View” of a Neuron

x *

Dot Product Sigmoid

w z

9

y

Input output

Dot Product SigmoidInput output

x *w z ŷ

parameterized block

ez

1 + ez

z = wT . x

ŷ = sigmoid(z) =px11xp1x1

Neuron Representation

Σ

10

The linear transformation and nonlinearity together is typically considered a single neuron

ŷ

x1

x2

x3

x

w1

w2

w3

Neuron Representation

*w

11

ŷx

The linear transformation and nonlinearity together is typically considered a single neuron

Neurons

1-Layer Neural NetworkMulti-layer Neural Network

Loss Functions

Backpropagation


NNs in Practice12

13

x1

x2

x3

W1

w3

W2

x ŷ

1-Layer Neural Network (with 4 neurons)

x1

x2

x3

Σ

Σ

Σ

Σ

Input x Output ŷ

14

ŷ

1

ŷ

2

ŷ

3

ŷ

4

Wz

Linear Sigmoid

1 layer

matrixvector


x1

x2

x3

Σ

Σ

Σ

Σ

Input x

15

d = 4

p = 3

Wz

Output ŷ

ŷ

1

ŷ

2

ŷ

3

ŷ

4

ez

1 + ez

z =WT x

ŷ = sigmoid(z) =px1dxpdx1

dx1 dx1

Element-wise on vector z


x1

x2

x3

Input x

16

ez

1 + ez

z =WT x


dx1 dx1

d = 4

p = 3

W Σ

Σ

Σ

Σ

Output ŷ

ŷ

1

ŷ

2

ŷ

3

ŷ

4Element-wise on vector z

“Block View” of a Neural Network

x *

Dot Product Sigmoid

w z

17

y

Input output

Dot Product SigmoidInput output

x *W z ŷ

W is now a matrix

ez

1 + ez

z =WT x


dx1 dx1

z is now a vector

Neurons

1-Layer Neural Network

Multi-layer Neural NetworkLoss Functions

Backpropagation


NNs in Practice18

19

x1

x2

x3

W1

w3

W2

x ŷ

Multi-Layer Neural Network(Multi-Layer Perceptron (MLP) Network)

20

Outputlayer

x1

x2

x3

x ŷ

W1

w2

Hiddenlayer

weight subscript represents layer number

2-layer NN

Multi-Layer Neural Network (MLP)

21

1st hidden layer

2nd hiddenlayer

Outputlayer

x1

x2

x3

x ŷ

3-layer NN

W1

w3

W2

Multi-Layer Neural Network (MLP)

22

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)

x1

x2

x3

h1 h2

W1

w3

W2

x ŷ

1

2

3

hidden layer 1 output

Multi-Class Output MLP

x1

x2

x3

x ŷ

23

h1 h2

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =WT h2 ŷ = sigmoid(z2)

1

2

3

W1

w3

W2

“Block View” Of MLP

x y

1st hidden layer

2nd hidden layer

Output layer

24

*W1

*W2

*W3

z1 z2 z3h1 h2

“Deep” Neural Networks (i.e. > 1 hidden layer)

25Researchers have successfully used 1000 layers to train an object classifier

Neurons



Loss FunctionsBackpropagation


NNs in Practice26

27

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

ŷ = P(y=1|X,W)

28

x1

x2

x3

x

Binary Classification Loss

E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)

W1

w3

W2

this example is for a single sample x

true output

29

x1

x2

x3

Σ

W1

w3

W2

x

Regression Loss

E = loss= ( y - ŷ )212

ŷ

true output

z1

z2

z3

30

x1

x2

x3

x

Σ

Σ

Σ

ŷ1

ŷ2

ŷ3

Multi-Class Classification Loss

“Softmax” function. Normalizing function which converts each class output to a probability.

E = loss = - yj ln ŷjΣj = 1...K

= P( ŷi = 1 | x )

W1 W3

W2

ŷi

“0” for all except true class

K = 3

Neurons



Loss Functions

BackpropagationNonlinearity Functions

NNs in Practice31

32

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

Training Neural Networks

33

How do we learn the optimal weights WL for our task??● Gradient descent:

LeCun et. al. Efficient Backpropagation. 1998

WL(t+1) = WL(t) - E WL(t)

But how do we get gradients of lower layers?● Backpropagation!

○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable

x ŷ

W1w3

W2

34

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

35



36



37



38



39



40



41



42



43



44



45



46


Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4


47

ŷ = P(y=1|X,W)

x1

x2

x3

W1

w2

x

Backpropagation(binary classification example)

Example on 1-hidden layer NN for binary classification

48


x y*W1

*w2

z1z2h1

49

x y*W1

*w2

z1z2h1

E = loss =


Gradient Descent to Minimize loss: Need to find these!

50


x y*W1

*w2

z1z2h1

51


= ??

= ??

x y*W1

*w2

z1z2h1

52


= ??

= ??

Exploit the chain rule!

x y*W1

*w2

z1z2h1

53


x y*W1

*w2

z1z2h1

54


chain rule

x y*W1

*w2

z1z2h1

55


x y*W1

*w2

z1z2h1

56


x y*W1

*w2

z1z2h1

57


x y*W1

*w2

z1z2h1

58


x y*W1

*w2

z1z2h1

59


x y*W1

*w2

z1z2h1

60


x y*W1

*w2

z1z2h1

61


x y*W1

*w2

z1z2h1

62


already computed

x y*W1

*w2

z1z2h1

63

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients


64

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients

x

65

Example: Sigmoid Block

sigmoid(x) = (x)

Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions)

x y

1st hidden layer

2nd hidden layer

Output layer

66

*W1

*W2

*W3

z1 z2 z3h1 h2

Want to find optimal weights W to minimize some loss function E!

Backprop Whiteboard Demo

x1

x2

1

Σ

Σ

ŷ

67

w1 z1w2

w3w4

b1

b2

z2

h1

h2

Σ

w5

w6

1b3

z1 = x1w1 + x2w3 + b1z2 = x1w2 + x2w4 + b2

h1 =exp(z1)

1 + exp(z1)exp(z2)

1 + exp(z2)h2 =

ŷ = h1w5 + h2w6 + b3

E = ( y - ŷ )2

f1

f2

f3

f4

w(t+1) = w(t) - E w(t)

E w = ??

Neurons



Loss Functions

Backpropagation

Nonlinearity FunctionsNNs in Practice

68

Nonlinearity Functions (i.e. transfer or activation functions)

x1

x2

x3

Σ

SummingFunction

SigmoidFunction

w1

w2

w3

+1

b1

z

x﹡w

Multiply by weights

69


x﹡w

70https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )

https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions



x﹡w





Nonlinearity Functions (aka transfer or activation functions)

x﹡w



usually works best in practice



Neurons



Loss Functions

Backpropagation


NNs in Practice73

74

Neural Net Pipeline

x ŷ

W1w3

1. Initialize weights2. For each batch of input x samples S:

a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)

3. Repeat step 2 until loss convergence

W2

E

75

Non-Convexity of Neural Nets

In very high dimensions, there exists many local minimum which are about the same.

Pascanu, et. al. On the saddle point problem for non-convex optimization 2014

76

Building Deep Neural Nets


fx

y

77

Building Deep Neural Nets

“GoogLeNet” for Object Classification

78

Block Example Implementation


79

Advantage of Neural Nets

As long as it’s fully differentiable, we can train the model to automatically learn features for us.

Advanced Deep Learning Models:

Convolutional Neural Networks

& Recurrent Neural Networks

Most slides from http://cs231n.stanford.edu/

80

81

Convolutional Neural Networks(aka CNNs and ConvNets)

Challenges in Visual Recognition

84

Problems with “Fully Connected Networks” on Images

1000x1000 Image1M hidden units:→ 10^12 parameters

Spatial Correlation is local! → Connect units

locally

Each neuron corresponds to a specific pixel

85

Problems with “Fully Connected Networks” on Images

1000x1000 Image1M hidden units:→ 10^12 parameters

Spatial Correlation is local! → Connect units

locally

Each neuron corresponds to a specific pixel

How do we deal with multiple dimensions in the input? Length,height, channel (R,G,B)

86

Convolutional Layer

87

Convolutional Layer

88

Convolutional Layer

89

Convolutional Layer

90

Convolution

91

Convolution

92

Neuron View of Convolutional Layer

93


94

Convolutional Layer

95

Convolutional Layer

96


97


98


99


100


http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

101

Pooling Layer

102

Pooling Layer (Max Pooling example)

103

History of ConvNets

Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner]

ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]

1998 2012

108

Recurrent Neural Networks

109

Standard “Feed-Forward” Neural Network

110

Standard “Feed-Forward” Neural Network

111

Recurrent Neural Networks (RNNs)RNNs can handle

112

Recurrent Neural Networks (RNNs)RNNs can handle

Recurrent Neural Networks (RNNs)

Traditional “Feed Forward” Neural Network Recurrent Neural Network

input

hidden

output

predict a vector at each timestep

input

hidden

output

114


ht

ht-1

115


ht

ht-1

116

“Vanilla” Recurrent Neural Network

ht

ht-1

117

Character Level Language Model with an RNN

118


119


120


121


122


123


124

Example: Generating Shakespeare with RNNs

125

Example: Generating C Code with RNNs

Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences

input

hidden

output

Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences

Use “gating units” to learn when to remember

input

output

128

RNNs and CNNs Together

lecture 17: neural networks and deep learninglecture 17: neural networks and deep learning jack...

Documents