the art of backpropagation
TRANSCRIPT
The Art Of Backpropagation
and other Bedtime Deep Learning Stories
Jennifer Prendki, @WalmartLabs
Why this talk?• Deep Learning can solve many problem• Deep Learning is trendy• Deep Learning is applied in many different industries
Everybody is using it, or want to use it
• But many people are using Deep Learning as a black-box• There is no consistent theory regarding architecture building
Context: Neural Nets, Forward & Backward Feeds• Back to the basics: what are Artificial Neural Nets?
The combination of:• a training method• an optimization method
A 2-phase cycle:• propagation• weight update
Deep Learning Glossary• Input: the first layer (what is fed to the algorithm, the initial data columns)•Output: what we want to compute (can be more than one value)•Hidden layers: the neurons for the intermediate steps•Forward propagation of a training pattern's input through the neural network in order to generate the network's output value(s)•Backward propagation of the propagation's output activations through the neural net using the training pattern target in order to generate the •Deltas: the difference between the targeted and actual output values of all output and hidden neurons•Weight update: the process of multiplying the output delta and input activation to compute the gradient of the weight.•Learning rate: ratio of the weight's gradient is subtracted from the weight
Backpropagation Algorithm
• Propagation Forward propagation of a training pattern's input through the neural
network in order to generate the network's output value(s). Backward propagation of the propagation's output activations through the
neural network using the training pattern target in order to generate the deltas.
• Weight update The weight's output delta and input activation are multiplied to find the
gradient of the weight. The weight is updated according to the learning rate.
Backpropagation AlgorithmBackpropagation can be explained through the “Shoe Lace” analogy
- Too little tension = - Not enough constraining, too loose
(unsatisfactory model)
- Too much tension = - too much constraint (overtraining)- taking too much time (slow process)- higher likelihood of breaking (non
convergence)
- Pulling more on one than the other =- discomfort (bias)
Learning Rate• Learning rate definition:• Ratio of the weight's gradient that is subtracted from the weight
• Learning rate = Trade-Off• Large values for ratio => Fast training• Lower ratios => Accurate training
• Question: How do you choose the learning rate?
Activation Function• Backpropagation &
Supervised Learning • Backpropagation used in
supervised context
• Backpropagation requires the activation function to be differentiable
Vanishing Gradient• What is a vanishing gradient?
The case where some weights go down to 0
Lessons:
- Starting point for weight matter(can fall into non optimal minimum)- Large architectures make it harder to
control- Expensive memory-wise (and useless)
hidden
Let’s Recap: What Is Hard/Tricky with DL?• What decisions to be made to build a DL model?
• Overall architecture (RNN, etc.)• Number of layers• Number of neurons• Learning rate
• Conclusion• Architecture building is sketchy and empirical• Experimentation takes time and memory
• Loss function• Activation function• Starting weights
ARCHITECTURE MODEL DATA
• Number of inputs• Number of outputs• Amount of Data