pakdd2016 tutorial dlif: introduction and basics

Tutorial:Deep learning

implementations and frameworks

Seiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+

*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp

+National Institute of Advanced Industrial Science and Technology (AIST)

[email protected], [email protected] DLIF Tutorial @ PAKDD2016

Introduction

Atsunori KanemuraAIST, Japan

22016-04-19 DLIF Tutorial @ PAKDD2016

Objective•  Get into deep learning research and practices

•  1) Learn the building blocks that are common to most deep learning frameworks–  Review key technologies.

•  2) Understand the differences between the various implementations–  How specific DL frameworks differ–  Useful to decide which framework to start with

•  Not to know coding know-hows (although coding examples will be given).


Model audience•  Want to use neural networks•  Want to model neural network architectures for

practical problems •  Expected background:

– Basics of computer science and numerical computation

– General machine learning terminologies (in particular around supervised learning)

– Basic knowledge or practices of neural networks (recommended)

– Basic knowledge of Python programming language (recommended)


Overview• 1st session (8:30 – 10:00)

– Introduction (AK)– Basics of neural networks (AK)– Common design of neural network

implementations (KO)• 2nd session (10:30 – 12:30)

– Differences of deep learning frameworks (ST)

– Coding examples of frameworks (KO & ST)– Conclusion (ST)


Frameworks to be (and not to be) explained

•  Deeply explained with coding examples–  Chainer - Python–  Keras - Python–  Tensorflow – Python

•  Also compared–  Torch.nn – Lua–  Theano – Python–  Caffe – C++ & Python & Matlab–  MXNet ̶ Many–  autograd ̶ Python & Lua

•  Others not explained–  Cloud computing, Matlab toolboxes, DL4J, H2O, CNTK–  Wrappers: Lasagne, Blocks, skflow–  TensorBoard, DIGITS (only mention their names)


Basics of Neural Networks

Atsunori KanemuraAIST, Japan


Artificial neural networks•  Biologically inspired

– A biological neuron is a nonlinear unitconnected with synapses atthe dendrites (input) andthe axon (output)

•  A building block for pattern recognition systems (and more)


Why neural networks?•  Superior performance

–  Image recognition•  ImageNet LSVR Challenge – Exceeds human

performance– Playing games

•  AlphaGO – Human experts have defeated

•  Extended to other problems –  Images and text

•  Show & Tell – Generate texts from images with intermediate representations (“embeddings”)

–  Learn artist styles– Many others (translation, speech recognition, …)


Technical inside of NNs•  Layered processing with

linear transformation (aka. matrix multiplication, affine transformation)

+ nonlinear operation (aka. activation function)

•  Adapt to data


11

Mathematical model for a neuron•  Compare the product of input and

weights (parameters) with a threshold

– Plasticity of the neuron= The change of parameters and

… … ∑

f ： nonlinear transform


bx

w

b

x1

x2y

w

xD

y = f

⇣ DX

d=1

wdxd � b

⌘= f(wT

x� b)

b

Generalized linear discriminant•  Generalized linear discriminant

–  ：Nonlinear transformation– ⇒ Logistic (classical), Probit, etc.


f(·)

f(wTx)

?? by = f

⇣ DX

d=1

wdxd � b

⌘= f(wT

x� b)yn =

(1 (xn is positive)

0 (xn is negative)

Learning with loss minimization•  Learn from many samples•  Binary output

•  Define the loss function

•  Minimize J to learn (estimate) the parameters


(Squared error)

w⇤ = argminw

J(w)

{xn, y⇤n}Nn=1

y⇤n =

(1 (xn is positive)

0 (xn is negative)

J(w) =1

2

NX

n=1

(f(wTxn)� y⇤n)

2

Neural networks•  Multi-layered

•  Minimize the loss to learn the parameters


※ f works element-wisey

1 = f1(W10x)

y

2 = f2(W21y

1)

y

3 = f3(W32y

2)

...

y

L = fL(W(L)(L�1)

y

L�1)

J({W }) = 1

2

NX

n=1

(yL(xn)� y⇤n)2

Gradient descent•  The gradient of the loss for 1-layer model is

•  The update rule


（r is a constant learning rate）

rwJ(w) =1

2

NX

n=1

rw(f(wTxn)� y⇤n)

2

=NX

n=1

(f(wTxn)� y⇤n)rwf(wT

xn)

=NX

n=1

(f(wTxn)� y⇤n)f(w

Txn)(1� f(wT

xn))xn

w w � rrwJ(w) = w �NX

n=1

h(xn,w)xn

h(xn,w)def= (f(wT

xn)� y⇤n)f(wTxn)(1� f(wT

xn))

Backprop•  Use the chain rule to derive the gradient

•  E.g. 2-layer case

– ⇒ Calculate gradient recursively from top to bottom layers

•  Cf. Gradient vanishing, ReLU162016-04-19 DLIF Tutorial @ PAKDD2016

y

1n = f(W 10

xn), y2n = f(w21 · y1n)

@J

@W 10kl

=X

n,i

@J

@y1ni

@y1ni@W 10

kl

J(W 10,w21) =1

2

X

n

(y2n � y⇤n)2

Automatic Differentiation•  The math for backprop is obvious (but

tedious) if the NN architecture has been defined

•  Can be automatically calculated after defining the NN model

•  This is called automatic differentiation (which is a general concept that makes use of the chain rule)


Parameter update•  Gradient Descent (GD)

•  Stochastic Gradient Descent (SGD)– Take several samples (say, 128) from the

dataset (mini-batch), estimate the gradient.– Theoretically motivated as the Robbins-Monro

algorithm

•  SGD to general gradient-based algorithms– Adam, AdaGrad, etc.– Use momentum and other techniques


w w � rrwJ(w) = w �NX

n=1

h(xn,w)xn

h(xn,w¯)def= (f(wT

xn)� yn)f(wTxn)(1� f(wT

xn))

Overfitting and generalization error•  The goal of learning is to decrease the

generalization error, which is the error for previously unseen data

•  Having a low error at the data at hand is not enough (or even harmful)– We can achieve 0% error by memorizing all the

examples in the training data– Complicated models (i.e., NNs with many

parameters and layers) can achieve this (if the learning algorithm is clever enough).


Training procedure•  Avoid overfitting•  Split the data into two parts

– Training dataset• We optimize the parameters using this training dataset

– Validation dataset• We evaluate the performance of the learned NN with

this validation dataset

•  Optional: Test errors–  If you want to estimate the generalization error,

use three-way splitting of the data and use the last one, the test dataset, to measure generalization error


Train Validation

Available data

Extra topics implementedby most of the frameworks

•  Weights initialization– Random– Pretraining– Transfer from another trained network

•  Techniques for avoid overffiting– Dropout– Batch normalization– ResNet

•  Convolution•  Visualization

– Deconvolution212016-04-19 DLIF Tutorial @ PAKDD2016

Summary of this Part•  Neural networks are computational model

that stacks neurons, or non-linear computational units

•  The gradients of the loss w.r.t. the parameter are recursively calculated from top to bottom by backprop

•  Care must be taken to avoid overfitting by following validation procedures