artificial neural networks: intro · 2018. 5. 17. · artificial neural networks: intro csc411:...

36
Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making Connections” by Filomena Booth (2013) Slides from Andrew Ng, Geoffrey Hinton, and Tom Mitchell 1

Upload: others

Post on 09-Mar-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Artificial Neural Networks: Intro

CSC411: Machine Learning and Data Mining, Winter 2018

Michael Guerzhoy and Lisa Zhang

“Making Connections” by Filomena Booth (2013)

Slides from Andrew Ng, Geoffrey Hinton, and Tom Mitchell

1

Page 2: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Non-Linear Decision Surfaces

x1

x2

• There is no linear decision boundary

2

Page 3: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Car Classification

Testing:

What is this?

Not a carCars

3

Page 4: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

You see this:

But the camera sees this:

4

Page 5: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Learning Algorithm

pixel 1

pixel 2

pixel 1

pixel 2

Raw image

Cars“Non”-Cars

Page 6: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

pixel 1

pixel 2

Raw image

Cars“Non”-Cars

Learning Algorithm

pixel 1

pixel 2

Page 7: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

pixel 1

pixel 2

Raw image

Cars“Non”-Cars

50 x 50 pixel images→ 2500 pixels(7500 if RGB)

pixel 1 intensity

pixel 2 intensity

pixel 2500 intensity

Quadratic features ( ): ≈3 millionfeatures

Learning Algorithm

pixel 1

pixel 2

Page 8: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Simple Non-Linear Classification Example

x1

x2

x1

x2

8

Page 9: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Inspiration: The Brain

9

Page 10: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Inspiration: The Brain

10

Page 11: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Linear Neuron

𝑤1 𝑤2 𝑤3

Linear neuron

𝑤0

𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑤3𝑥3

𝑥1 𝑥2 𝑥3

11

Page 12: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Linear Neuron: Cost Function

• Any number of choices. The one made for linear regression is

σ𝑖=1𝑚 𝑦 𝑖 − 𝑤𝑇𝑥(𝑖)

2

• Can minimize this using gradient descent to obtain the best weights w for the training set

12

Page 13: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Logistic Neuron

𝑤1 𝑤2 𝑤3

𝑤0

𝜎(𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +𝑤3𝑥3), 𝜎 𝑡 =1

1 + exp(−𝑡)

𝑥1 𝑥2 𝑥3

13

Page 14: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Logistic Neuron: Cost Function

• Could use the quadratic cost function again

• Could use the “log-loss” function to make the neuron perform logistic regression

𝑖=1

𝑚

𝑦 𝑖 log1

1 + exp −𝑤𝑇𝑥 𝑖+ (1 − 𝑦 𝑖 ) log

exp −𝑤𝑇𝑥 𝑖

1 + exp −𝑤𝑇𝑥 𝑖

(Note: we derived this cost function by saying we want to maximize the likelihood of the data under a certain model, but there’s nothing stopping us from just making up a loss function)

14

Page 15: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Logistic Regression Cost Function: Another Look

• 𝐶𝑜𝑠𝑡 ℎ𝑤 𝑥 , 𝑦 = ቐ− log ℎ𝑤 𝑥 , 𝑦 = 1

− log 1 − ℎ𝑤 𝑥 , 𝑦 = 0

• If y = 1, want the cost to be small if ℎ𝑤 𝑥 is close to 1 and large if ℎ𝑤 𝑥 is close to 0• -log(t) is 0 for t=1 and infinity for t = 0

• If y = 0, want the cost to be small if ℎ𝑤 𝑥 is close to 0 and large if ℎ𝑤 𝑥 is close to 1

• Note:0 < 𝜎 𝑡 < 1

𝜎 𝑡 =1

1 + exp(−𝑡)15

Page 16: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Multilayer Neural Networks

• ℎ𝑖,𝑗 = g W𝑖,𝑗𝑥

= 𝑔(

𝑘

𝑊𝑖,𝑗,𝑘𝑥𝑘 )

• 𝑥0 = 1 always

• 𝑊𝑖,𝑗,0 is the “bias”

• g is the activation function• Could be 𝑔 𝑡 = 𝑡• Could be 𝑔 𝑡 = 𝜎 𝑡

• Nobody uses those anymore…

output units

input units

hidden units

𝑥1

ℎ𝑖,𝑗

𝑥2 𝑥3

𝑜1 𝑜2

16

Page 17: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Why do we need activation functions?

17

Page 18: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Activation functions?

18

Page 19: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Exercise:

• Hand code a neural network to compute:• AND

• XOR

• Use only sigmoid

• Use only ReLU activations

19

Page 20: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Multilayer Neural Network: Speech Recognition Example (multi-class classification)

20

Page 21: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Universal Approximator

• Neural networks with at least one hidden layer (and enough neurons) are universal approximators• Can represent any (smooth) function

• The capacity (ability to represent different functions) increases with more hidden layers and more neurons

• Why go deeper? One hidden layer might need a lotof neurons. Deeper and narrower networks are more compact

21

Page 22: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Computation in Neural Networks

• Forward pass• Making predictions

• Plug in the input x, get the output y

• Backward pass• Compute the gradient of the cost function with respect

to the weights

22

Page 23: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Multilayer Neural Network for Classification:

input vector (x)

hidden

layer

outputs𝑜1

𝑜𝑖 is large if the probability that the correct class is i is high

𝑜2 𝑜3

A possible cost function:

𝐶 𝑜, 𝑦 =

𝑖=1

𝑚

|𝑦 𝑖 − 𝑜 𝑖 |2

𝑦(𝑖)’s and 𝑜(𝑖)’s encoded using one-hot encoding

ℎ1 ℎ3

𝑥1

𝑊(1,1,1)

𝑊(2,1,1)𝑊(2,3,3)

𝑊(1,3,5)

ℎ5

23

Page 24: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Forward Pass (vectorized)

𝐨 = 𝑔 𝑊(2) 𝑇𝒉 + 𝑏(2)

𝐡 = 𝑔 𝑊(1) 𝑇𝒙 + 𝑏(1)

…etc... if there are more layers

output units

input units

hidden units

𝑥1

ℎ𝑖,𝑗

𝑥2 𝑥3

𝑜1 𝑜2

24

Page 25: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Backwards Pass (training)

Need to find

𝑊 = argmin𝑊

𝑖=1

𝑚

𝑙𝑜𝑠𝑠(𝒐 𝑖 , 𝒚 𝑖 )

Where:• 𝒐 𝑖 is the output of the neural network• 𝒚 𝑖 is the ground truth• 𝑊 is all the weights in the neural network• 𝑙𝑜𝑠𝑠 is a continuous loss function.

Use gradient descent to find a good 𝑊

Page 26: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

But how to compute gradient?

• To optimize the weights / parameters o the neural network, we need to compute gradient of the cost function: 𝐶 𝐨, 𝐲 = σ𝑖=1

𝑚 𝑙𝑜𝑠𝑠(𝒐 𝑖 , 𝒚 𝑖 )with respect to every weight in the neural network.

• Need to compute, for every layer and weight 𝑙, 𝑗, 𝑖 :𝜕𝐶

𝜕𝑊 𝑙,𝑗,𝑖

• How to do this? How to do this efficiently?

26

Page 27: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Review: Chain Rule

• Univariate Chain Rule𝑑

𝑑𝑡𝑔 𝑓 𝑡 =

𝑑𝑔

𝑑𝑓.𝑑𝑓

𝑑𝑡

• Multivariate Chain Rule𝜕𝑔

𝜕𝑥= σ

𝜕𝑔

𝜕𝑓𝑖

𝜕𝑓𝑖

𝜕𝑥

27

𝑓1(𝑥) 𝑓2(𝑥)

𝑥

𝑓3(𝑥)

𝑔(𝑓1 𝑥 , 𝑓2 𝑥 , 𝑓3 𝑥 )

Page 28: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Gradient of Single Weight (last layer)

• We need the partial derivatives of the cost function 𝐶 𝑜, 𝑦 w.r.t all the W and b .

• 𝑜𝑖 = 𝑔 σ𝑗𝑊2,𝑗,𝑖 ℎ𝑗 + 𝑏(2,𝑗)

• Let 𝑧𝑖 = σ𝑗𝑊2,𝑗,𝑖 ℎ𝑗 + 𝑏 2,𝑗 so that 𝑜𝑖 = 𝑔 𝑧𝑖

• Partial derivative of 𝐶 𝑜, 𝑦 with respect to 𝑊 2,𝑗,𝑖 all evaluated at (𝑥, 𝑦,𝑊, 𝑏, ℎ, 𝑜)

𝜕𝐶

𝜕𝑊 2,𝑗,𝑖 =𝜕𝑜𝑖

𝜕𝑊 2,𝑗,𝑖

𝜕𝐶

𝜕𝑜𝑖

=𝜕𝑧𝑖

𝜕𝑊 2,𝑗,𝑖

𝜕𝑔

𝜕𝑧𝑖

𝜕𝐶

𝜕𝑜𝑖

= ℎ𝑗𝜕𝑔

𝜕𝑧𝑖

𝜕𝐶

𝜕𝑜𝑖

= ℎ𝑗𝑔′ 𝑧𝑖

𝜕

𝜕𝑜𝑖𝐶(𝑜, 𝑦)

28

𝑥1

ℎ𝑖,𝑗

𝑥2 𝑥3

𝑜1 𝑜2

C

𝑊(2,2,1)

Page 29: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Gradient of Single Weight (last layer)

• For example, if we use:• sigmoid activation 𝑔 = 𝜎, 𝜎′ 𝑡 = 𝜎 𝑡 1 − 𝜎 𝑡

• MSE loss: 𝐶 𝒐, 𝒚 = σ𝑖=1𝑁 𝑜𝑖 − 𝑦𝑖

2

• Then𝜕𝐶

𝜕𝑊 2,𝑗,𝑖 = ℎ𝑗𝑔′ 𝑧𝑖

𝜕𝐶

𝜕𝑜𝑖

= ℎ𝑗𝑔 𝑧𝑖 1 − 𝑔 𝑧𝑖 2 𝑜𝑖 − 𝑦𝑖= ℎ𝑗𝑜𝑖 1 − 𝑜𝑖 2(𝑜𝑖 − 𝑦𝑖)

29

Page 30: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Vectorization

• For a single weight, we had:

𝜕𝐶

𝜕𝑊 2,𝑗,𝑖= ℎ𝑗𝑜𝑖 1 − 𝑜𝑖 2(𝑜𝑖 − 𝑦𝑖)

• Vectorizing, we get

𝜕𝐶

𝜕𝑾 𝟐= 2𝒉 ⋅ 𝒐. 1 − 𝒐 . 𝒐 − 𝒚

𝑇

• Note this is for sigmoid activation, square loss

30

Page 31: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

What about earlier layers?

input vector (x)

hidden

layer

outputs𝑜1 𝑜2 𝑜3

ℎ1 ℎ3

𝑥1

𝑊(1,1,1)

𝑊(2,1,1)𝑊(2,3,3)

𝑊(1,3,5)

ℎ5

Use multivariate chain rule:

𝜕𝐶

𝜕ℎ𝑖=

𝑘

(𝜕𝐶

𝜕𝑜𝑘

𝜕𝑜𝑘ℎ𝑖

)

𝜕𝐶

𝜕𝑊(1,𝑗,𝑖)=

𝜕𝐶

𝜕ℎ𝑖

𝜕ℎ𝑖𝜕𝑊(1,𝑗,𝑖)

31

Page 32: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Backpropagation

hidden

layers

outputs

hidB

hidA

𝑊(1,3,5)

𝜕𝐶

𝜕ℎ𝑖𝑑𝐵𝑖=

𝑘

(𝜕𝐶

𝜕ℎ𝑖𝑑𝐴𝑘

𝜕ℎ𝑖𝑑𝐴𝑘𝜕ℎ𝑖𝑑𝐵𝑖

)

𝜕𝐶

𝜕𝑊(1,𝑗,𝑖)=

𝜕𝐶

𝜕ℎ𝑖𝑑𝐵𝑖

𝜕ℎ𝑖𝑑𝐵𝑖𝜕𝑊(1,𝑗,𝑖)

32

Page 33: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

input vector

hidden

layers

outputs

Back-propagate

error signal to

get derivatives

for learning

Compare outputs with

correct answer to get

error signal

33

Page 34: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Training Summary

34

Page 35: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Why is training neural networks so hard?

• Hard to optimize:• Not convex

• Local minima, saddle points, etc…

• Can take a long time to train

• Architecture choices:• How many layers? How many units in each layer?

• What activation function to use?

• Choice of optimizer:• We talked about gradient descent, but there are

techniques that improves upon gradient descent

35

Page 36: Artificial Neural Networks: Intro · 2018. 5. 17. · Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang “Making

Demo

http://playground.tensorflow.org/

36