lecture 17: neural networks and deep learninglecture 17: neural networks and deep learning jack...
TRANSCRIPT
Lecture 17: Neural Networks and
Deep LearningJack Lanchantin
Dr. Yanjun Qi
1
UVA CS 6316 / CS 4501-004Machine Learning
Fall 2016
Neurons1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice2
3
x1
x2
x3
W1
w3
W2
x ŷ
ewx+b
1 + ewx+b
Logistic Regression
Sigmoid Function(aka logistic, logit, “S”, soft-step)
4
P(Y=1|x) =
x
ez
1 + ez
Expanded Logistic Regression
x1
x2
x3
Σ
+1
z
z = wT . x + b
y = sigmoid(z) =5
px11xp
p = 3
1x1
w1
w2
w3
b1SummingFunction
SigmoidFunction
Multiply by weights
ŷ = P(Y=1|x,w)
Input x
“Neuron”
x1
x2
x3
Σ
+1
z
6
w1
w2
w3
b1
Neurons
7http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Neuron
x1
x2
x3
Σ z ŷ
8
ez
1 + ez
z = wT . x
ŷ = sigmoid(z) =px11xp1x1
From here on, we leave out bias for simplicity
Input x
w1
w2
w3
“Block View” of a Neuron
x *
Dot Product Sigmoid
w z
9
y
Input output
Dot Product SigmoidInput output
x *w z ŷ
parameterized block
ez
1 + ez
z = wT . x
ŷ = sigmoid(z) =px11xp1x1
Neuron Representation
Σ
10
The linear transformation and nonlinearity together is typically considered a single neuron
ŷ
x1
x2
x3
x
w1
w2
w3
Neuron Representation
*w
11
ŷx
The linear transformation and nonlinearity together is typically considered a single neuron
Neurons
1-Layer Neural NetworkMulti-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice12
13
x1
x2
x3
W1
w3
W2
x ŷ
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x Output ŷ
14
ŷ
1
ŷ
2
ŷ
3
ŷ
4
Wz
Linear Sigmoid
1 layer
matrixvector
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
15
d = 4
p = 3
Wz
Output ŷ
ŷ
1
ŷ
2
ŷ
3
ŷ
4
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
Element-wise on vector z
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Input x
16
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
d = 4
p = 3
W Σ
Σ
Σ
Σ
Output ŷ
ŷ
1
ŷ
2
ŷ
3
ŷ
4Element-wise on vector z
“Block View” of a Neural Network
x *
Dot Product Sigmoid
w z
17
y
Input output
Dot Product SigmoidInput output
x *W z ŷ
W is now a matrix
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
z is now a vector
Neurons
1-Layer Neural Network
Multi-layer Neural NetworkLoss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice18
19
x1
x2
x3
W1
w3
W2
x ŷ
Multi-Layer Neural Network(Multi-Layer Perceptron (MLP) Network)
20
Outputlayer
x1
x2
x3
x ŷ
W1
w2
Hiddenlayer
weight subscript represents layer number
2-layer NN
Multi-Layer Neural Network (MLP)
21
1st hidden layer
2nd hiddenlayer
Outputlayer
x1
x2
x3
x ŷ
3-layer NN
W1
w3
W2
Multi-Layer Neural Network (MLP)
22
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)
x1
x2
x3
h1 h2
W1
w3
W2
x ŷ
1
2
3
hidden layer 1 output
Multi-Class Output MLP
x1
x2
x3
x ŷ
23
h1 h2
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =WT h2 ŷ = sigmoid(z2)
1
2
3
W1
w3
W2
“Block View” Of MLP
x y
1st hidden layer
2nd hidden layer
Output layer
24
*W1
*W2
*W3
z1 z2 z3h1 h2
“Deep” Neural Networks (i.e. > 1 hidden layer)
25Researchers have successfully used 1000 layers to train an object classifier
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss FunctionsBackpropagation
Nonlinearity Functions
NNs in Practice26
27
x1
x2
x3
W1
w3
W2
x ŷ E (ŷ)
ŷ = P(y=1|X,W)
28
x1
x2
x3
x
Binary Classification Loss
E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)
W1
w3
W2
this example is for a single sample x
true output
29
x1
x2
x3
Σ
W1
w3
W2
x
Regression Loss
E = loss= ( y - ŷ )212
ŷ
true output
z1
z2
z3
30
x1
x2
x3
x
Σ
Σ
Σ
ŷ1
ŷ2
ŷ3
Multi-Class Classification Loss
“Softmax” function. Normalizing function which converts each class output to a probability.
E = loss = - yj ln ŷjΣj = 1...K
= P( ŷi = 1 | x )
W1 W3
W2
ŷi
“0” for all except true class
K = 3
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
BackpropagationNonlinearity Functions
NNs in Practice31
32
x1
x2
x3
W1
w3
W2
x ŷ E (ŷ)
Training Neural Networks
33
How do we learn the optimal weights WL for our task??● Gradient descent:
LeCun et. al. Efficient Backpropagation. 1998
WL(t+1) = WL(t) - E WL(t)
But how do we get gradients of lower layers?● Backpropagation!
○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable
x ŷ
W1w3
W2
34
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
35
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
36
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
37
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
38
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
39
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
40
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
41
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
42
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
43
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
44
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
45
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
46
Backpropagation Intro
Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
47
ŷ = P(y=1|X,W)
x1
x2
x3
W1
w2
x
Backpropagation(binary classification example)
Example on 1-hidden layer NN for binary classification
48
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
49
x y*W1
*w2
z1z2h1
E = loss =
Backpropagation(binary classification example)
Gradient Descent to Minimize loss: Need to find these!
50
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
51
Backpropagation(binary classification example)
= ??
= ??
x y*W1
*w2
z1z2h1
52
Backpropagation(binary classification example)
= ??
= ??
Exploit the chain rule!
x y*W1
*w2
z1z2h1
53
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
54
Backpropagation(binary classification example)
chain rule
x y*W1
*w2
z1z2h1
55
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
56
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
57
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
58
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
59
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
60
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
61
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
62
Backpropagation(binary classification example)
already computed
x y*W1
*w2
z1z2h1
63
“Local-ness” of Backpropagation
fx y
“local gradients”activations
gradients
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
64
“Local-ness” of Backpropagation
fx y
“local gradients”activations
gradients
x
65
Example: Sigmoid Block
sigmoid(x) = (x)
Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions)
x y
1st hidden layer
2nd hidden layer
Output layer
66
*W1
*W2
*W3
z1 z2 z3h1 h2
Want to find optimal weights W to minimize some loss function E!
Backprop Whiteboard Demo
x1
x2
1
Σ
Σ
ŷ
67
w1 z1w2
w3w4
b1
b2
z2
h1
h2
Σ
w5
w6
1b3
z1 = x1w1 + x2w3 + b1z2 = x1w2 + x2w4 + b2
h1 =exp(z1)
1 + exp(z1)exp(z2)
1 + exp(z2)h2 =
ŷ = h1w5 + h2w6 + b3
E = ( y - ŷ )2
f1
f2
f3
f4
w(t+1) = w(t) - E w(t)
E w = ??
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity FunctionsNNs in Practice
68
Nonlinearity Functions (i.e. transfer or activation functions)
x1
x2
x3
Σ
SummingFunction
SigmoidFunction
w1
w2
w3
+1
b1
z
x﹡w
Multiply by weights
69
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
70https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
71https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
Nonlinearity Functions (aka transfer or activation functions)
x﹡w
72https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
usually works best in practice
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice73
74
Neural Net Pipeline
x ŷ
W1w3
1. Initialize weights2. For each batch of input x samples S:
a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)
3. Repeat step 2 until loss convergence
W2
E
75
Non-Convexity of Neural Nets
In very high dimensions, there exists many local minimum which are about the same.
Pascanu, et. al. On the saddle point problem for non-convex optimization 2014
76
Building Deep Neural Nets
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
fx
y
77
Building Deep Neural Nets
“GoogLeNet” for Object Classification
78
Block Example Implementation
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
79
Advantage of Neural Nets
As long as it’s fully differentiable, we can train the model to automatically learn features for us.
Advanced Deep Learning Models:
Convolutional Neural Networks
& Recurrent Neural Networks
Most slides from http://cs231n.stanford.edu/
80
81
Convolutional Neural Networks(aka CNNs and ConvNets)
Challenges in Visual Recognition
Challenges in Visual Recognition
84
Problems with “Fully Connected Networks” on Images
1000x1000 Image1M hidden units:→ 10^12 parameters
Spatial Correlation is local! → Connect units
locally
Each neuron corresponds to a specific pixel
85
Problems with “Fully Connected Networks” on Images
1000x1000 Image1M hidden units:→ 10^12 parameters
Spatial Correlation is local! → Connect units
locally
Each neuron corresponds to a specific pixel
How do we deal with multiple dimensions in the input? Length,height, channel (R,G,B)
86
Convolutional Layer
87
Convolutional Layer
88
Convolutional Layer
89
Convolutional Layer
90
Convolution
91
Convolution
92
Neuron View of Convolutional Layer
93
Neuron View of Convolutional Layer
94
Convolutional Layer
95
Convolutional Layer
96
Neuron View of Convolutional Layer
97
Convolutional Neural Networks
98
Convolutional Neural Networks
99
Convolutional Neural Networks
100
Convolutional Neural Networks
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
101
Pooling Layer
102
Pooling Layer (Max Pooling example)
103
History of ConvNets
Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner]
ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]
1998 2012
104
105
106
107
108
Recurrent Neural Networks
109
Standard “Feed-Forward” Neural Network
110
Standard “Feed-Forward” Neural Network
111
Recurrent Neural Networks (RNNs)RNNs can handle
112
Recurrent Neural Networks (RNNs)RNNs can handle
Recurrent Neural Networks (RNNs)
Traditional “Feed Forward” Neural Network Recurrent Neural Network
input
hidden
output
predict a vector at each timestep
input
hidden
output
114
Recurrent Neural Networks
ht
ht-1
115
Recurrent Neural Networks
ht
ht-1
116
“Vanilla” Recurrent Neural Network
ht
ht-1
117
Character Level Language Model with an RNN
118
Character Level Language Model with an RNN
119
Character Level Language Model with an RNN
120
Character Level Language Model with an RNN
121
Character Level Language Model with an RNN
122
Character Level Language Model with an RNN
123
Character Level Language Model with an RNN
124
Example: Generating Shakespeare with RNNs
125
Example: Generating C Code with RNNs
Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences
input
hidden
output
Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences
Use “gating units” to learn when to remember
input
output
128
RNNs and CNNs Together