the art of backpropagation

The Art Of Backpropagation

and other Bedtime Deep Learning Stories

Jennifer Prendki, @WalmartLabs

Why this talk?• Deep Learning can solve many problem• Deep Learning is trendy• Deep Learning is applied in many different industries

Everybody is using it, or want to use it

• But many people are using Deep Learning as a black-box• There is no consistent theory regarding architecture building

Context: Neural Nets, Forward & Backward Feeds• Back to the basics: what are Artificial Neural Nets?

The combination of:• a training method• an optimization method

A 2-phase cycle:• propagation• weight update

Deep Learning Glossary• Input: the first layer (what is fed to the algorithm, the initial data columns)•Output: what we want to compute (can be more than one value)•Hidden layers: the neurons for the intermediate steps•Forward propagation of a training pattern's input through the neural network in order to generate the network's output value(s)•Backward propagation of the propagation's output activations through the neural net using the training pattern target in order to generate the •Deltas: the difference between the targeted and actual output values of all output and hidden neurons•Weight update: the process of multiplying the output delta and input activation to compute the gradient of the weight.•Learning rate: ratio of the weight's gradient is subtracted from the weight

Backpropagation Algorithm

• Propagation Forward propagation of a training pattern's input through the neural

network in order to generate the network's output value(s). Backward propagation of the propagation's output activations through the

neural network using the training pattern target in order to generate the deltas.

• Weight update The weight's output delta and input activation are multiplied to find the

gradient of the weight. The weight is updated according to the learning rate.

Backpropagation AlgorithmBackpropagation can be explained through the “Shoe Lace” analogy

- Too little tension = - Not enough constraining, too loose

(unsatisfactory model)

- Too much tension = - too much constraint (overtraining)- taking too much time (slow process)- higher likelihood of breaking (non

convergence)

- Pulling more on one than the other =- discomfort (bias)

Learning Rate• Learning rate definition:• Ratio of the weight's gradient that is subtracted from the weight

• Learning rate = Trade-Off• Large values for ratio => Fast training• Lower ratios => Accurate training

• Question: How do you choose the learning rate?

Activation Function• Backpropagation &

Supervised Learning • Backpropagation used in

supervised context

• Backpropagation requires the activation function to be differentiable

Vanishing Gradient• What is a vanishing gradient?

The case where some weights go down to 0

Lessons:

- Starting point for weight matter(can fall into non optimal minimum)- Large architectures make it harder to

control- Expensive memory-wise (and useless)

hidden

Let’s Recap: What Is Hard/Tricky with DL?• What decisions to be made to build a DL model?

• Overall architecture (RNN, etc.)• Number of layers• Number of neurons• Learning rate

• Conclusion• Architecture building is sketchy and empirical• Experimentation takes time and memory

• Loss function• Activation function• Starting weights

ARCHITECTURE MODEL DATA

• Number of inputs• Number of outputs• Amount of Data

the art of backpropagation

Technology