Tutorial:Deep learning
implementations and frameworks
Seiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+
*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp
+National Institute of Advanced Industrial Science and Technology (AIST)
[email protected], [email protected] DLIF Tutorial @ PAKDD2016
Introduction
Atsunori KanemuraAIST, Japan
22016-04-19 DLIF Tutorial @ PAKDD2016
Objective• Get into deep learning research and practices
• 1) Learn the building blocks that are common to most deep learning frameworks– Review key technologies.
• 2) Understand the differences between the various implementations– How specific DL frameworks differ– Useful to decide which framework to start with
• Not to know coding know-hows (although coding examples will be given).
32016-04-19 DLIF Tutorial @ PAKDD2016
Model audience• Want to use neural networks• Want to model neural network architectures for
practical problems • Expected background:
– Basics of computer science and numerical computation
– General machine learning terminologies (in particular around supervised learning)
– Basic knowledge or practices of neural networks (recommended)
– Basic knowledge of Python programming language (recommended)
42016-04-19 DLIF Tutorial @ PAKDD2016
Overview• 1st session (8:30 – 10:00)
– Introduction (AK)– Basics of neural networks (AK)– Common design of neural network
implementations (KO)• 2nd session (10:30 – 12:30)
– Differences of deep learning frameworks (ST)
– Coding examples of frameworks (KO & ST)– Conclusion (ST)
52016-04-19 DLIF Tutorial @ PAKDD2016
Frameworks to be (and not to be) explained
• Deeply explained with coding examples– Chainer - Python– Keras - Python– Tensorflow – Python
• Also compared– Torch.nn – Lua– Theano – Python– Caffe – C++ & Python & Matlab– MXNet ̶ Many– autograd ̶ Python & Lua
• Others not explained– Cloud computing, Matlab toolboxes, DL4J, H2O, CNTK– Wrappers: Lasagne, Blocks, skflow– TensorBoard, DIGITS (only mention their names)
62016-04-19 DLIF Tutorial @ PAKDD2016
Basics of Neural Networks
Atsunori KanemuraAIST, Japan
72016-04-19 DLIF Tutorial @ PAKDD2016
Artificial neural networks• Biologically inspired
– A biological neuron is a nonlinear unitconnected with synapses atthe dendrites (input) andthe axon (output)
• A building block for pattern recognition systems (and more)
82016-04-19 DLIF Tutorial @ PAKDD2016
Why neural networks?• Superior performance
– Image recognition• ImageNet LSVR Challenge – Exceeds human
performance– Playing games
• AlphaGO – Human experts have defeated
• Extended to other problems – Images and text
• Show & Tell – Generate texts from images with intermediate representations (“embeddings”)
– Learn artist styles– Many others (translation, speech recognition, …)
92016-04-19 DLIF Tutorial @ PAKDD2016
Technical inside of NNs• Layered processing with
linear transformation (aka. matrix multiplication, affine transformation)
+ nonlinear operation (aka. activation function)
• Adapt to data
102016-04-19 DLIF Tutorial @ PAKDD2016
11
Mathematical model for a neuron• Compare the product of input and
weights (parameters) with a threshold
– Plasticity of the neuron= The change of parameters and
… … ∑
f : nonlinear transform
2016-04-19 DLIF Tutorial @ PAKDD2016
bx
w
b
x1
x2y
w
xD
y = f
⇣ DX
d=1
wdxd � b
⌘= f(wT
x� b)
b
Generalized linear discriminant• Generalized linear discriminant
– :Nonlinear transformation– ⇒ Logistic (classical), Probit, etc.
122016-04-19 DLIF Tutorial @ PAKDD2016
f(·)
f(wTx)
?? by = f
⇣ DX
d=1
wdxd � b
⌘= f(wT
x� b)yn =
(1 (xn is positive)
0 (xn is negative)
Learning with loss minimization• Learn from many samples• Binary output
• Define the loss function
• Minimize J to learn (estimate) the parameters
132016-04-19 DLIF Tutorial @ PAKDD2016
(Squared error)
w⇤ = argminw
J(w)
{xn, y⇤n}Nn=1
y⇤n =
(1 (xn is positive)
0 (xn is negative)
J(w) =1
2
NX
n=1
(f(wTxn)� y⇤n)
2
Neural networks• Multi-layered
• Minimize the loss to learn the parameters
142016-04-19 DLIF Tutorial @ PAKDD2016
※ f works element-wisey
1 = f1(W10x)
y
2 = f2(W21y
1)
y
3 = f3(W32y
2)
...
y
L = fL(W(L)(L�1)
y
L�1)
J({W }) = 1
2
NX
n=1
(yL(xn)� y⇤n)2
Gradient descent• The gradient of the loss for 1-layer model is
• The update rule
152016-04-19 DLIF Tutorial @ PAKDD2016
(r is a constant learning rate)
rwJ(w) =1
2
NX
n=1
rw(f(wTxn)� y⇤n)
2
=NX
n=1
(f(wTxn)� y⇤n)rwf(wT
xn)
=NX
n=1
(f(wTxn)� y⇤n)f(w
Txn)(1� f(wT
xn))xn
w w � rrwJ(w) = w �NX
n=1
h(xn,w)xn
h(xn,w)def= (f(wT
xn)� y⇤n)f(wTxn)(1� f(wT
xn))
Backprop• Use the chain rule to derive the gradient
• E.g. 2-layer case
– ⇒ Calculate gradient recursively from top to bottom layers
• Cf. Gradient vanishing, ReLU162016-04-19 DLIF Tutorial @ PAKDD2016
y
1n = f(W 10
xn), y2n = f(w21 · y1n)
@J
@W 10kl
=X
n,i
@J
@y1ni
@y1ni@W 10
kl
J(W 10,w21) =1
2
X
n
(y2n � y⇤n)2
Automatic Differentiation• The math for backprop is obvious (but
tedious) if the NN architecture has been defined
• Can be automatically calculated after defining the NN model
• This is called automatic differentiation (which is a general concept that makes use of the chain rule)
172016-04-19 DLIF Tutorial @ PAKDD2016
Parameter update• Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)– Take several samples (say, 128) from the
dataset (mini-batch), estimate the gradient.– Theoretically motivated as the Robbins-Monro
algorithm
• SGD to general gradient-based algorithms– Adam, AdaGrad, etc.– Use momentum and other techniques
182016-04-19 DLIF Tutorial @ PAKDD2016
w w � rrwJ(w) = w �NX
n=1
h(xn,w)xn
h(xn,w¯)def= (f(wT
xn)� yn)f(wTxn)(1� f(wT
xn))
Overfitting and generalization error• The goal of learning is to decrease the
generalization error, which is the error for previously unseen data
• Having a low error at the data at hand is not enough (or even harmful)– We can achieve 0% error by memorizing all the
examples in the training data– Complicated models (i.e., NNs with many
parameters and layers) can achieve this (if the learning algorithm is clever enough).
192016-04-19 DLIF Tutorial @ PAKDD2016
Training procedure• Avoid overfitting• Split the data into two parts
– Training dataset• We optimize the parameters using this training dataset
– Validation dataset• We evaluate the performance of the learned NN with
this validation dataset
• Optional: Test errors– If you want to estimate the generalization error,
use three-way splitting of the data and use the last one, the test dataset, to measure generalization error
202016-04-19 DLIF Tutorial @ PAKDD2016
Train Validation
Available data
Extra topics implementedby most of the frameworks
• Weights initialization– Random– Pretraining– Transfer from another trained network
• Techniques for avoid overffiting– Dropout– Batch normalization– ResNet
• Convolution• Visualization
– Deconvolution212016-04-19 DLIF Tutorial @ PAKDD2016
Summary of this Part• Neural networks are computational model
that stacks neurons, or non-linear computational units
• The gradients of the loss w.r.t. the parameter are recursively calculated from top to bottom by backprop
• Care must be taken to avoid overfitting by following validation procedures
222016-04-19 DLIF Tutorial @ PAKDD2016