cs 4803 / 7643: deep learning · “2-layer neural net”, or “1-hidden-layer neural net”...

Post on 02-Oct-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics: – Neural Networks– Computing Gradients

– (Finish) manual differentiation

Administrivia• PS1/HW1 out

• Instructor Office hours: 10-11am Mondays CODA 11th floor, conference room C1106 Briarcliff– This week only: Thursday 11am (same room)

• Start thinking about project topics/teams

(C) Dhruv Batra & Zsolt Kira 2

Do the Readings!

(C) Dhruv Batra & Zsolt Kira 3

Recap from last time

(C) Dhruv Batra & Zsolt Kira 4

(Before) Linear score function:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

6

(Before) Linear score function:

(Now) 2-layer Neural Networkor 3-layer Neural Network

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

● Not really how neurons work (they are more complex)● This is the same thing as● Started with binary output (0/1) depending on whether sum is larger than 0

(or threshold if no bias is used)● Still want to add non-linearity to increase representational power

7

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

8

sigmoid activation function

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Activation functions

With ReLU it’s what we showed before

“Fully-connected” layers“2-layer Neural Net”, or“1-hidden-layer Neural Net”

“3-layer Neural Net”, or“2-hidden-layer Neural Net”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: Architectures

Representational Power: • Any continuous functions with 1 hidden layer• Any function with 2 hidden layers• Doesn’t say how many nodes we’ll need!

11

(Before) Linear score function:

(Now) 2-layer Neural Networkor 3-layer Neural Network

Equivalent to Math Viewpoint!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

x hW1 sW2

3072 100 10

Strategy: Follow the slope

In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension

The slope in any direction is the dot product of the direction with the gradientThe direction of steepest descent is the negative gradient

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Gradient Descent Pseudocodefor i in {0,…,num_epochs}:

for x, y in data:𝑦 𝑆𝑀 𝑊𝑥𝐿 𝐶𝐸 𝑦, 𝑦

? ? ?

𝑊 ≔ 𝑊 𝛼

Some design decisions:• How many examples to use to calculate gradient per iteration? • What should alpha (learning rate) be?

• Should it be constant throughout?• How many epochs to run to?

Full sum expensive when N is large!

Approximate sum using a minibatch of examples32 / 64 / 128 common

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Stochastic Gradient Descent (SGD)

How do we compute gradients?• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD– Reverse mode AD

• aka “backprop”

(C) Dhruv Batra & Zsolt Kira 15

gradient dW:

[-2.5,?,?,?,?,?,?,?,?,…]

(1.25322 - 1.25347)/0.0001= -2.5

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

W + h (first dim):

[0.34 + 0.0001,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25322

Manually Computing Gradients

17

• Our function: linear with sigmoid activation

• Definition of sigmoid

• Sigmoid has a nice derivative!

• Start with our loss function• Calculate derivative:

2*inside*derivative of inside• Also get rid of y (constant!) leaving

a negative sign for rest

• Move negative out, include derivative of g (g’) and multiple by derivative of inside

• Define some terms

• Write with new terms and substitute derivative of sigmoid in

Plan for Today• Decomposing a function• Backpropagation algorithm (w/ example)• Math

– Function composition– Vectors– Tensors

(C) Dhruv Batra & Zsolt Kira 18

How to Simplify? • Calculating gradients for large functions is

complicated

• Idea: Decompose the function and compute local gradients for each part!

(C) Dhruv Batra & Zsolt Kira 19

Computational Graphs• Notation

(C) Dhruv Batra & Zsolt Kira 20

Example

(C) Dhruv Batra & Zsolt Kira 21

+

sin( )

x1 x2

*

x

W

hinge loss

R

+ Ls (scores)

*

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph

Logistic Regression as a Cascade

(C) Dhruv Batra and Zsolt Kira 23

Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

input image

loss

weights

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. 

Convolutional network (AlexNet)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

input image

loss

Neural Turing Machine

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

How do we compute gradients?• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD– Reverse mode AD

• aka “backprop”

(C) Dhruv Batra & Zsolt Kira 26

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra & Zsolt Kira 27

Computational Graph

Directed Acyclic Graphs (DAGs)• Exactly what the name suggests

– Directed edges– No (directed) cycles– Underlying undirected cycles okay

(C) Dhruv Batra 28

Directed Acyclic Graphs (DAGs)• Concept

– Topological Ordering

(C) Dhruv Batra 29

Directed Acyclic Graphs (DAGs)

(C) Dhruv Batra 30

Key Computation: Forward-Prop

(C) Dhruv Batra & Zsolt Kira 31Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop

(C) Dhruv Batra & Zsolt Kira 32Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

How to Simplify? • Calculating gradients for large functions is

complicated

• Idea: Decompose the function and compute local gradients for each part!

• We will use the chain rule to calculate the gradients– We will receive an upstream gradient from layer after us– We will then use that to compute local gradients using chain

rule

(C) Dhruv Batra & Zsolt Kira 33

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra & Zsolt Kira 34Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra & Zsolt Kira 35Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra & Zsolt Kira 36Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 37Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 38Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 39Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters

(C) Dhruv Batra & Zsolt Kira 40Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

41

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

42

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

43

e.g. x = -2, y = 5, z = -4

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

44

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

45

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

46

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

47

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

48

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

49

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

50

e.g. x = -2, y = 5, z = -4

Want:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

51

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

52

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

53

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Upstream gradient

Localgradient

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

54

Chain rule:

e.g. x = -2, y = 5, z = -4

Want: Upstream gradient

Localgradient

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

55

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Upstream gradient

Localgradient

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

56

Chain rule:

e.g. x = -2, y = 5, z = -4

Want: Upstream gradient

Localgradient

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributorQ: What is a max gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributormax gate: gradient router

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributormax gate: gradient routerQ: What is a mul gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributormax gate: gradient routermul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(C) Dhruv Batra and Zsolt Kira 64

Summary• We will have a composed non-linear function as our model

– Several portions will have parameters

• We will use (stochastic/mini-batch) gradient descent with a loss function to define our objective

• Rather than analytically derive gradients for complex function, we will modularize computation

– Back propagation = Gradient Descent + Chain Rule

• Now:– Work through mathematical view– Vectors, matrices, and tensors– Next time: Can the computer do this for us automatically?

• Read:– https://explained.ai/matrix-calculus/index.html– https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/slides/L5_gradients

_notes.pdf

Matrix/Vector Derivatives Notation• Read:

– https://explained.ai/matrix-calculus/index.html– https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/slide

s/L5_gradients_notes.pdf

• Matrix/Vector Derivatives Notation• Vector Derivative Example• Extension to Tensors• Chain Rule: Composite Functions

– Scalar Case– Vector Case– Jacobian view– Graphical view– Tensors

• Logistic Regression Derivatives

(C) Dhruv Batra & Zsolt Kira 65

top related