introduction to chainer: a flexible framework for deep learning

Introduction to Chainer: A Flexible Framework for Deep Learning 20150618 PFI/PFN Weekly Seminar Seiya Tokui (Preferred Networks)

Upload: seiya-tokui

Post on 29-Jul-2015




2 download


Page 1: Introduction to Chainer: A Flexible Framework for Deep Learning

Introduction  to  Chainer:A  Flexible  Framework  for  Deep  Learning

2015-‐‑‒06-‐‑‒18  PFI/PFN  Weekly  SeminarSeiya  Tokui  (Preferred  Networks)

Page 2: Introduction to Chainer: A Flexible Framework for Deep Learning


l  Seiya  Tokui    @beam2d  (Twitter,  GitHub)

l  Researcher  at  Preferred  Networks

l  Main  focus:  machine  learning

–  Learning  to  Hash  (master  degree)

–  Deep  Learning,  Representation  Learning  (current  focus)


Page 3: Introduction to Chainer: A Flexible Framework for Deep Learning


A Powerful, Flexible, and Intuitive Framework of Neural Networks

Page 4: Introduction to Chainer: A Flexible Framework for Deep Learning

Today  I  will  introduce:

l  The  features  of  Chainer

l  How  to  use  Chainer

l  Some  planned  features

l  (Slide  in  English,  talk  in  Japanese)

Page 5: Introduction to Chainer: A Flexible Framework for Deep Learning

: The Concept


Page 6: Introduction to Chainer: A Flexible Framework for Deep Learning

Chainer  is  a  framework  of  neural  networks

l  Official  site:  

l  Repository:

l  Provided  as  a  Python  library  (PyPI:  chainer)

l  Main  features

–  Powerful:Supports  CUDA  and  multi-‐‑‒GPU  capability

–  Flexible: Support  almost  arbitrary  architectures

–  Intuitive: Forward  prop  can  be  written  as  a  regular  Python  code

Page 7: Introduction to Chainer: A Flexible Framework for Deep Learning

Elements  of  a  neural  network  framework

l  Multi-‐‑‒dimensional  array  implementations

l  Layer  implementations

–  Called  in  various  names  (layers,  modules,  blocks,  primitives,  etc...)

–  The  smallest  units  of  automatic  differentiation

–  Contain  forward  and  backward  implementations

l  Optimizer  implementations

l  Other  stuffs  (data  loading  scheme,  training  loop,  etc...)

–  These  are  also  very  important,  though  Chainer  currently  does  not  provide  their  abstraction  (future  work)


Page 8: Introduction to Chainer: A Flexible Framework for Deep Learning

Forward  prop  /  Backprop

l  Forward  prop  is  how  we  want  to  process  the  input  data

l  Backprop  computes  its  gradient  for  the  learnable  parameters

l  Given  backward  procedures  of  all  layers,  backprop  can  be  written  as  their  combination  (a.k.a.  reverse-‐‑‒mode  automatic  differentiation)


input hidden output groundtruth

loss  func



Page 9: Introduction to Chainer: A Flexible Framework for Deep Learning

Backprop  Implementation  Paradigm  (1)


l  First,  a  computational  graph  is  constructed.  Then,  it  is  periodically  fed  with  minibatches  to  do  forward/backward

l  The  computational  graph  can  be  seen  as  a  program  and  the  forward/backward  computation  is  done  by  its  interpreter

u  Caffe:  the  program  is  written  by  Prototxt

u  Torch:  the  program  is  constructed  by  Lua  scripts

u  Theano-‐‑‒based  frameworks:  the  program  is  constructed  by  Python  scripts

Page 10: Introduction to Chainer: A Flexible Framework for Deep Learning

Backprop  Implementation  Paradigm  (2)

Define-‐‑‒and-‐‑‒Run  (cont.)

l  Pros

–  (Almost)  No  need  of  memory  management

–  The  computational  graph  can  be  implicitly  optimized  (cf.  Theano)

l  Cons

–  The  program  is  fixed  within  the  training  loop

–  The  interpreter  must  have  capability  of  defining  various  forward  computations,  including  control-‐‑‒flow  statements  like  if  and  for

u  Theano  has  the  dedicated  functions  for  them  (ifelse  and  scan),  which  are  unintuitive  and  not  Pythonic

–  Network  definition  is  hard  to  debug,  since  an  error  occurs  at  the  forward  computation  that  is  far  apart  from  the  network  definition

Page 11: Introduction to Chainer: A Flexible Framework for Deep Learning

Backprop  Implementation  Paradigm  (3)


l  The  forward  computation  is  written  as  a  regular  program  code  with  special  variables  and  operators,  executing  which  simultaneously  involves  the  forward  computation  and  the  graph  construction  (just  by  storing  the  order  of  operations).

l  The  graph  is  used  for  the  backward  computation.

l  This  paradigm  enables  us  to  use  arbitrary  control  flow  statements  in  the  forward  computation

–  No  need  of  a  mini  language  and  its  interpreter

l  It  also  makes  the  forward  computation  intuitive  and  easy  to  debug

Page 12: Introduction to Chainer: A Flexible Framework for Deep Learning

Backprop  Implementation  Paradigm  (4)

Define-‐‑‒by-‐‑‒Run  (cont.)

l  The  computational  graph  can  be  modified  within  each  iteration

l  Example:  Truncated  BPTT  (BackProp  Through  Time)

–  BPTT:  Backprop  on  a  recurrent  net

–  Truncated  BPTT:  Truncate  the  backprop  at  some  time  point

–  Truncation  is  one  type  of  modification  of  the  computational  graph


Page 13: Introduction to Chainer: A Flexible Framework for Deep Learning

Features  of  Chainer

l  Define-‐‑‒by-‐‑‒Run  scheme

–  Forward  computation  can  contain  any  Python  code

u  if-else,  for-else,  break,  continue,  try-except-finally,  list,  dict,  class,  etc...

–  User  can  modify  the  graph  within  the  loop

u  E.g.  truncation  can  be  done  by  unchain_̲backward  (which  unchains  the  graph  backward  from  some  variable)

u  See  the  tutorial  on  recurrent  nets

l  Predefined  functions

l  Support  GPU(s)  via  PyCUDA

Page 14: Introduction to Chainer: A Flexible Framework for Deep Learning

Example:  Training  a  multi-‐‑‒layer  perceptron  in  one  page

Full  code  is  in  the  tutorial  and  the  example  directory.

# Model definition

model = FunctionSet(

l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10))

opt = optimizers.SGD()



# Forward computation

def forward(x, t):

h1 = F.relu(model.l1(x))

h2 = F.relu(model.l2(h1))

y = model.l3(h2)

return F.softmax_cross_entropy(y, t)

# Training loop

for epoch in xrange(n_epoch):

for i in xrange(0, N, batchsize):

x = Variable(...)

t = Variable(...)


loss = forward(x, t)



Page 15: Introduction to Chainer: A Flexible Framework for Deep Learning

Example:  Recurrent  net  language  model  in  one  page

Full  code  is  in  the  tutorial  and  the  example  directory.

# Model definition

model = FunctionSet(

emb=F.EmbedID(1000, 100),

x2h=F.Linear( 100, 50),

h2h=F.Linear( 50, 50),

h2y=F.Linear( 50, 1000))

opt = optimizers.SGD()



# Forward computation of one step

def fwd1step(h, w, t):

x = F.tanh(model.emb(w))

h = F.tanh(model.x2h(x) + model.h2h(h))

y = model.h2y(h)

return h, F.softmax_cross_entropy(y, t)

# Full RNN forward computation

def forward(seq):

h = Variable(...) # init state

loss = 0

for curw, nextw in \

zip(seq, seq[1:]):

x = Variable(curw)

t = Variable(nextw)

h, new_loss = fwd1step(h, x, t)

loss += new_loss

return loss

Page 16: Introduction to Chainer: A Flexible Framework for Deep Learning

: How to Use It


Page 17: Introduction to Chainer: A Flexible Framework for Deep Learning

Install  Chainer

l  Prepare  a  Python  2.7  environment  with  pip

–  (Pyenv+)Anaconda  is  recommended

l  Install  Chainer  just  bypip install chainer

l  If  you  want  to  use  GPU(s),  do:

–  Install  CUDA  and  the  corresponding  NVIDIA  driver

–  Install  dependent  packages  bypip install chainer-cuda-deps

–  You  may  have  to  update  the  six packagepip install –U six

Page 18: Introduction to Chainer: A Flexible Framework for Deep Learning

Run  the  MNIST  example  (quick  start)

l  Require  scikit-‐‑‒learn  installed:  pip install scikits.learn

l  Clone  the  repository  of  Chainer:  git clone

l  Go  to  the  example  directory  at  examples/mnist

l  Then,  run  python

–  Run  on  GPU  by  passing  --gpu=0

l  Other  examples  can  be  similarly  executed  (some  needs  manual  preparation  of  datasets)

Page 19: Introduction to Chainer: A Flexible Framework for Deep Learning

Read  the  documents

l  Read  the  documents  at

l  It  includes:

–  Tutorial

–  Reference  manual

l  All  features  given  in  this  talk  are  introduced  by  the  tutorial,  so  please  try  it  if  you  want  to  know  the  detail.

Page 20: Introduction to Chainer: A Flexible Framework for Deep Learning

Basic  concepts  (1)

l  Essential  part  of  Chainer:  Variable  and  Function

l  Variable  is  a  wrapper  of  n-‐‑‒dimensional  arrays  (ndarray  and  GPUArray)

l  Function  is  an  operation  on  Variables

–  Function  application  is  memorized  by  the  returned  Variable(s)

–  All  operations  for  which  you  want  to  backprop  must  be  done  by  Functions  on  Variables

l  Making  a  Variable  object  is  simple:  just  pass  an  arrayx = chainer.Variable(numpy.ndarray(...))

–  The  array  is  stored  in  data  attribute  (

Page 21: Introduction to Chainer: A Flexible Framework for Deep Learning

Basic  concepts  (2)

l  Example  of  the  computational  graph  constructionx = chainer.Variable(...)

y = chainer.Variable(...)

z = x**2 + 2*x*y + y

l  Gradient  of  z(x,  y)  can  be  computed  by  z.backward()

l  Results  are  stored  in  x.grad  and  y.grad



_ ** 2

2 * _ _ * _ _ + _ z

_ + _

Actually, Split nodes are automatically inserted (they accumulate the gradients on backprop)

Page 22: Introduction to Chainer: A Flexible Framework for Deep Learning

Basic  concepts  (3)

l  Chainer  provides  many  functions  in  chainer.functions  subpackage

–  This  package  is  often  abbreviated  to  F

l  Parameterized  functions  are  provided  as  classes

–  Linear,  Convolution2D,  EmbedID,  PReLU,  BatchNormalization,  etc.

–  Their  instances  should  be  shared  across  all  iterations

l  Non-‐‑‒parameterized  functions  are  provided  as  Python  functions

–  Activation  functions,  pooling,  array  manipulation,  etc.

Page 23: Introduction to Chainer: A Flexible Framework for Deep Learning

Basic  concepts  (4)

l  Use  FunctionSet  to  manage  parameterized  functions

–  It  is  an  object  with  Function  attributes

–  Easy  to  migrate  functions  onto  GPU  devices

–  Easy  to  collect  parameters  and  gradients  (collect_̲parameters)

l  Use  Optimizer  for  numerical  optimization

–  Major  algorithms  are  provided:SGD,  MomentumSGD,  AdaGrad,  RMSprop,  ADADELTA,  Adam

–  Some  parameter/gradient  manipulations  are  done  via  this  class:weight  decay,  gradient  clip,  

Page 24: Introduction to Chainer: A Flexible Framework for Deep Learning

Easy  to  debug!

l  If  the  forward  computation  has  a  bug,  then  an  error  occurs  immediately  at  the  appropriate  line  of  the  forward  definition

l  Example

–  This  code  has  inconsistency  of  the  array  size:

x = Variable(np.ndarray((3, 4), dtype=np.float32)

y = Variable(np.ndarray((3, 3), dtype=np.float32)

a = x ** 2 + x

b = a + y * 2

c = b + x * 2

–  Since  an  exception  is  raised  at  the  appropriate  line,  we  can  easily  find  the  cause  of  bug  (this  is  one  big  difference  from  Define-‐‑‒and-‐‑‒Run  frameworks)

← an exception is raised at this line

Page 25: Introduction to Chainer: A Flexible Framework for Deep Learning

Graph  manipulation  (1)

l  Backward  unchaining:  y.unchain_backward()

–  It  purges  the  nodes  backward  from  y

–  It  is  useful  to  implement  truncated  BPTT  (see  PTB  example)

x f y g z

y g z


Page 26: Introduction to Chainer: A Flexible Framework for Deep Learning

Graph  manipulation  (2)

l  Volatile  variables:  x = Variable(..., volatile=True)

–  Volatile  variable  does  not  build  a  graph

–  Volatility  can  be  accessed  directly  by  x.volatile

x = Variable(..., volatile=True)

y = f(x)

y.volatile = False

z = h(y)

x f y g z

Page 27: Introduction to Chainer: A Flexible Framework for Deep Learning

Example:  Training  a  multi-‐‑‒layer  perceptron  in  one  page

Note:  F = chainer.functions

# Model definition

model = FunctionSet(

l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10))

opt = optimizers.SGD()



# Forward computation

def forward(x, t):

h1 = F.relu(model.l1(x))

h2 = F.relu(model.l2(h1))

y = model.l3(h2)

return F.softmax_cross_entropy(y, t)

# Training loop

for epoch in xrange(n_epoch):

for i in xrange(0, N, batchsize):

x = Variable(...)

t = Variable(...)


loss = forward(x, t)



Page 28: Introduction to Chainer: A Flexible Framework for Deep Learning

Example:  Recurrent  net  language  model  in  one  page

# Model definition

model = FunctionSet(

emb=F.EmbedID(1000, 100),

x2h=F.Linear( 100, 50),

h2h=F.Linear( 50, 50),

h2y=F.Linear( 50, 1000))

opt = optimizers.SGD()



# Forward computation of one step

def fwd1step(h, w, t):

x = F.tanh(model.emb(w))

h = F.tanh(model.x2h(x) + model.h2h(h))

y = model.h2y(h)

return h, F.softmax_cross_entropy(y, t)

# Full RNN forward computation

def forward(seq):

h = Variable(...) # init state

loss = 0

for curw, nextw in \

zip(seq, seq[1:]):

x = Variable(curw)

t = Variable(nextw)

h, new_loss = fwd1step(h, x, t)

loss += new_loss

return loss

Page 29: Introduction to Chainer: A Flexible Framework for Deep Learning

CUDA  support  (1)

l  Chainer  supports  CUDA  computation

l  Installation

–  Install  CUDA  6.5+

–  Install  CUDA-‐‑‒related  packages  bypip install chainer-cuda-deps

u  Build  of  PyCUDA  may  fail  if  you  install  CUDA  into  non-‐‑‒standard  path.  In  such  case,  you  have  to  install  PyCUDA  from  source  code  with  appropriate  configuration.

Page 30: Introduction to Chainer: A Flexible Framework for Deep Learning

CUDA  support  (2)

l  Call  cuda.init() before  any  CUDA-‐‑‒related  operations

l  Converts  numpy.ndarray  into  GPUArray  by  chainer.cuda.to_gpu data_gpu = chainer.cuda.to_gpu(data_cpu)

l  A  GPUArray  object  can  be  passed  to  the  Variable  constructorx = Variable(data_gpu)

l  Most  functions  support  GPU  Variables

–  Parameterized  functions  must  be  sent  to  GPU  beforehand  by  Function.to_gpu  or  FunctionSet.to_gpu

l  Extracts  the  results  to  host  memory  by  chainer.cuda.to_cpu

l  All  examples  support  CUDA  (pass  --gpu=N,  where  N  is  the  GPU  ID)

Page 31: Introduction to Chainer: A Flexible Framework for Deep Learning

MLP  example  for  CUDA

# Model definition

model = FunctionSet(

l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10)).to_gpu() opt = optimizers.SGD()



# Forward computation

def forward(x, t):

h1 = F.relu(model.l1(x))

h2 = F.relu(model.l2(h1))

y = model.l3(h2)

return F.softmax_cross_entropy(y, t)

# Training loop

for epoch in xrange(n_epoch):

for i in xrange(0, N, batchsize):

x = Variable(to_gpu(...)) t = Variable(to_gpu(...))


loss = forward(x, t)



Page 32: Introduction to Chainer: A Flexible Framework for Deep Learning

CUDA  support  (3)

l  Chainer  also  supports  computation  on  multiple  GPUs  (easily!)

l  Model  parallel

–  Send  FunctionSets  to  appropriate  devices  (to_̲gpu  accepts  GPU  ID)model_0 = FunctionSet(...).to_gpu(0)

model_1 = FunctionSet(...).to_gpu(1)

–  Copy  Variable  objects  across  GPUs  by  copy  functionx_1 = F.copy(x_0, 1)

u  This  copy  is  tracked  by  the  computational  graph,  so  you  donʼ’t  need  to  deal  with  it  on  backprop

Page 33: Introduction to Chainer: A Flexible Framework for Deep Learning

CUDA  support  (4)

l  Chainer  also  supports  computation  on  multiple  GPUs

l  Data  parallel

–  FunctionSet  can  be  copied  by  copy.copy model = FunctionSet(...)

model_0 = copy.copy(model_0).to_gpu(0)

model_1 = model_1.to_gpu(1)

–  Set  up  the  optimizer  only  for  the  master  modelopt.setup(model_0.collect_parameters())

–  After  data-‐‑‒parallel  gradient  computation,  gather  themopt.accumulate_grads(model_1.gradients)

–  After  the  update,  share  them  across  model  copiesmodel_1.copy_parameters_from(model_0.parameters)

Page 34: Introduction to Chainer: A Flexible Framework for Deep Learning

Model  Zoo  support  (in  the  near  future)

l  Model  Zoo  is  a  place  that  pretrained  models  are  registered

–  Provided  by  BVLC  Caffe  team

–  It  contains  the  Caffe  reference  models

l  We  are  planning  to  support  the  Caffe  reference  models  in  three  weeks  (the  next  minor  release)

–  Current  design  (it  may  be  changed):f = CaffeFunction(‘path/to/model.caffemodel’)

x, t = Variable(...), Variable(...)

y = f(inputs={‘data’: x, ‘label’: t}, outputs=[‘loss’])

–  It  emulates  Caffe  networks  by  Chainerʼ’s  functions

Page 35: Introduction to Chainer: A Flexible Framework for Deep Learning

Note:  development  process

l  Schedule

–  We  are  planning  to  release  updates  biweekly

–  Updates  are  classified  into  three  groups

u  Revision:  bug  fixes,  updates  without  adding/modifying  interfaces

u  Minor:  Updates  that  add/modify  interfaces  without  lacking  backward  compatibility

u  Major:  Updates  that  are  not  backward-‐‑‒compatible

l  We  are  using  the  GitHub-‐‑‒flow  process

l  We  welcome  your  PRs!

–  Please  send  them  to  the  master  branch

Page 36: Introduction to Chainer: A Flexible Framework for Deep Learning

Wrap  up

l  Chainer  is  a  powerful,  flexible,  and  intuitive  framework  of  neural  networks  in  Python

l  It  is  based  on  Define-‐‑‒by-‐‑‒Run  scheme,  which  makes  it  intuitive  and  flexible

l  Chainer  is  a  very  young  project  and  immature

–  Its  development  started  at  mid.  April  (just  two  months  ago)

–  We  will  add  many  functionailities  (especially  more  functions)

–  We  may  add  some  abstraction  of  whole  learning  processes