DL Classe 1 - Go Deep or Go Home

Download DL Classe 1 - Go Deep or Go Home

Post on 16-Apr-2017

78 views

Category:

Documents

4 download

Embed Size (px)

TRANSCRIPT

  • Louis Monier@louis_monier

    https://www.linkedin.com/in/louismonier

    Deep Learning BasicsGo Deep or Go Home

    Gregory Renard@redohttps://www.linkedin.com/in/gregoryrenard

    Class 1 - Q1 - 2016

  • Supervised Learning

    cathatdogcatmug

    =?

    prediction

    dog (89%)

    training set

    Training(days)

    Testing(ms)

  • Single Neuron to Neural Networks

  • Single Neuron

    +

    x

    w1

    xpos

    x

    w2

    ypos

    w3

    ChoiceScore

    Linear Combination Nonlinearity

    3.4906 0.9981, red

    Inputs

    3.45

    1.21

    -0.314

    1.59

    2.65

  • More Neurons, More Layers: Deep Learning

    input1

    input2

    input3

    input layer hidden layers output layer

    circleiris

    eye

    face

    89% likely to be a face

  • How to find the best values for the weights?

    w = Parameter

    Error

    a

    b

    Define error = |expected - computed| 2

    Find parameters that minimize average error

    Gradient of error wrt w is

    Update:

    take a step downhill

  • Egg carton in 1 million dimensions

    Get to here

    or here

  • Backpropagation: Assigning Blame

    input1

    input2

    input3

    label=1

    loss=0.49

    w1

    w2

    w3

    b

    w1

    w2

    w3

    b

    1). For loss to go down by 0.1, p must go up by 0.05.For p to go up by 0.05, w1 must go up by 0.09.For p to go up by 0.05, w2 must go down by 0.12....

    prediction p=0.3

    p=0.89

    2). For loss to go down by 0.1, p must go down by 0.01

    3). For p to go down by 0.01, w2 must go up by 0.04

  • Stochastic Gradient Descent (SGD) and Minibatch

    Its too expensive to compute the gradient on all inputs to take a step.

    We prefer a quick approximation.

    Use a small random sample of inputs (minibatch) to compute the gradient.

    Apply the gradient to all the weights.

    Welcome to SGD, much more efficient than regular GD!

  • Usually things would get intense at this point...

    Chain rule

    Partial derivatives

    Hessian

    Sum over paths

    Credit: Richard Socher Stanford class cs224d

    Jacobian matrix

    Local gradient

  • Learning a new representation

    Credit: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

  • Training Data

  • Train. Validate. Test.Training set

    Validation set

    Cross-validation

    Testing set

    2 3 41

    2 3 41

    2 3 41

    2 3 41

    shuffle!

    one epoch

    Never ever, ever, ever, ever, ever, ever use for training!

  • Tensors

  • Tensors are not scary

    Number: 0D

    Vector: 1D

    Matrix: 2D

    Tensor: 3D, 4D, array of numbers

    This image is a 3264 x 2448 x 3 tensor

    [r=0.45 g=0.84 b=0.76]

  • Different types of networks / layers

  • Fully-Connected NetworksPer layer, every input connects to every neuron.

    input1

    input2

    input3

    input layer hidden layers output layer

    circleiris

    eye

    face

    89% likely to be a face

  • Recurrent Neural Networks (RNN)Appropriate when inputs are sequences.

    Well cover in detail when we work on text.

    pipe a not is this

    t=0t=1t=2t=3t=4

    this is not a pipe

    t=0 t=1 t=2 t=3 t=4

  • Convolutional Neural Networks (ConvNets)Appropriate for image tasks, but not limited to image tasks.

    Well cover in detail in the next session.

  • HyperparametersActivations: a zoo of nonlinear functions

    Initializations: Distribution of initial weights. Not all zeros.

    Optimizers: driving the gradient descent

    Objectives: comparing a prediction to the truth

    Regularizers: forcing the function we learn to remain simple

    ...and many more

  • ActivationsNonlinear functions

  • Sigmoid, tanh, hard tanh and softsign

  • ReLU (Rectified Linear Unit) and Leaky ReLU

  • Softplus and Exponential Linear Unit (ELU)

  • Softmax

    x1 = 6.78

    x2 = 4.25

    x3 = -0.16

    x4 = -3.5

    dog = 0.92536

    cat = 0.07371

    car = 0.00089

    hat = 0.00003

    Raw scores Probabilities

    SOFTMAX

  • Optimizers

  • Various algorithms for driving Gradient DescentTricks to speed up SGD.

    Learn the learning rate.

    Credit: Alex Radford

  • Cost/Loss/Objective Functions

    =?

  • Cross-entropy Loss

    x1 = 6.78

    x2 = 4.25

    x3 = -0.16

    x4 = -3.5

    dog = 0.92536

    cat = 0.07371

    car = 0.00089

    hat = 0.00003

    Raw scores Probabilities

    SOFTMAX

    dog: loss = 0.078

    cat: loss = 2.607

    car: loss = 7.024

    hat: loss = 10.414

    If label is:

    =?

    label (truth)

  • RegularizationPreventing Overfitting

  • Overfitting

    Very simple.Might not predict well.

    Just right? Overfitting. Rote memorization.Will not generalize well.

  • Regularization: Avoiding Overfitting

    Idea: keep the functions simple by constraining the weights.

    Loss = Error(prediction, truth) + L(keep solution simple)

    L2 makes all parameters medium-size

    L1 kills many parameters.

    L1+L2 sometimes used.

  • At every step, during training, ignore the output of a fraction p of neurons.

    p=0.5 is a good default.

    Regularization: Dropout

    input1

    input2

    input3

  • One Last Bit of Wisdom

  • Workshop : Keras & Addition RNNhttps://github.com/holbertonschool/deep-learning/tree/master/Class%20%230

Recommended

View more >