# DL Classe 1 - Go Deep or Go Home

Post on 16-Apr-2017

78 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Louis Monier@louis_monier

Deep Learning BasicsGo Deep or Go Home

Class 1 - Q1 - 2016

• Supervised Learning

cathatdogcatmug

=?

prediction

dog (89%)

training set

Training(days)

Testing(ms)

• Single Neuron to Neural Networks

• Single Neuron

+

x

w1

xpos

x

w2

ypos

w3

ChoiceScore

Linear Combination Nonlinearity

3.4906 0.9981, red

Inputs

3.45

1.21

-0.314

1.59

2.65

• More Neurons, More Layers: Deep Learning

input1

input2

input3

input layer hidden layers output layer

circleiris

eye

face

89% likely to be a face

• How to find the best values for the weights?

w = Parameter

Error

a

b

Define error = |expected - computed| 2

Find parameters that minimize average error

Gradient of error wrt w is

Update:

take a step downhill

• Egg carton in 1 million dimensions

Get to here

or here

• Backpropagation: Assigning Blame

input1

input2

input3

label=1

loss=0.49

w1

w2

w3

b

w1

w2

w3

b

1). For loss to go down by 0.1, p must go up by 0.05.For p to go up by 0.05, w1 must go up by 0.09.For p to go up by 0.05, w2 must go down by 0.12....

prediction p=0.3

p=0.89

2). For loss to go down by 0.1, p must go down by 0.01

3). For p to go down by 0.01, w2 must go up by 0.04

• Stochastic Gradient Descent (SGD) and Minibatch

Its too expensive to compute the gradient on all inputs to take a step.

We prefer a quick approximation.

Use a small random sample of inputs (minibatch) to compute the gradient.

Apply the gradient to all the weights.

Welcome to SGD, much more efficient than regular GD!

• Usually things would get intense at this point...

Chain rule

Partial derivatives

Hessian

Sum over paths

Credit: Richard Socher Stanford class cs224d

Jacobian matrix

• Learning a new representation

Credit: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

• Training Data

• Train. Validate. Test.Training set

Validation set

Cross-validation

Testing set

2 3 41

2 3 41

2 3 41

2 3 41

shuffle!

one epoch

Never ever, ever, ever, ever, ever, ever use for training!

• Tensors

• Tensors are not scary

Number: 0D

Vector: 1D

Matrix: 2D

Tensor: 3D, 4D, array of numbers

This image is a 3264 x 2448 x 3 tensor

[r=0.45 g=0.84 b=0.76]

• Different types of networks / layers

• Fully-Connected NetworksPer layer, every input connects to every neuron.

input1

input2

input3

input layer hidden layers output layer

circleiris

eye

face

89% likely to be a face

• Recurrent Neural Networks (RNN)Appropriate when inputs are sequences.

Well cover in detail when we work on text.

pipe a not is this

t=0t=1t=2t=3t=4

this is not a pipe

t=0 t=1 t=2 t=3 t=4

• Convolutional Neural Networks (ConvNets)Appropriate for image tasks, but not limited to image tasks.

Well cover in detail in the next session.

• HyperparametersActivations: a zoo of nonlinear functions

Initializations: Distribution of initial weights. Not all zeros.

Objectives: comparing a prediction to the truth

Regularizers: forcing the function we learn to remain simple

...and many more

• ActivationsNonlinear functions

• Sigmoid, tanh, hard tanh and softsign

• ReLU (Rectified Linear Unit) and Leaky ReLU

• Softplus and Exponential Linear Unit (ELU)

• Softmax

x1 = 6.78

x2 = 4.25

x3 = -0.16

x4 = -3.5

dog = 0.92536

cat = 0.07371

car = 0.00089

hat = 0.00003

Raw scores Probabilities

SOFTMAX

• Optimizers

• Various algorithms for driving Gradient DescentTricks to speed up SGD.

Learn the learning rate.

• Cost/Loss/Objective Functions

=?

• Cross-entropy Loss

x1 = 6.78

x2 = 4.25

x3 = -0.16

x4 = -3.5

dog = 0.92536

cat = 0.07371

car = 0.00089

hat = 0.00003

Raw scores Probabilities

SOFTMAX

dog: loss = 0.078

cat: loss = 2.607

car: loss = 7.024

hat: loss = 10.414

If label is:

=?

label (truth)

• RegularizationPreventing Overfitting

• Overfitting

Very simple.Might not predict well.

Just right? Overfitting. Rote memorization.Will not generalize well.

• Regularization: Avoiding Overfitting

Idea: keep the functions simple by constraining the weights.

Loss = Error(prediction, truth) + L(keep solution simple)

L2 makes all parameters medium-size

L1 kills many parameters.

L1+L2 sometimes used.

• At every step, during training, ignore the output of a fraction p of neurons.

p=0.5 is a good default.

Regularization: Dropout

input1

input2

input3

• One Last Bit of Wisdom

• Workshop : Keras & Addition RNNhttps://github.com/holbertonschool/deep-learning/tree/master/Class%20%230