learning is about acquiring the ability - meetupfiles.meetup.com/18532292/deep learning on...

30
1

Upload: others

Post on 04-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

1

Page 2: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

Learning is about acquiring the ability to discriminate.

Memorization

Overfitting

Under fitting

Generalization

© Satyendra Rana 2

2 2 4

2 3 5

2 4 6

2 5 7

2 3 ?

2 7 ?

3 4 ?

2 2 4.05

2 3 4.98

4 2 5.95

2 5 7.06

2 3 ?

2 7 ?

3 4 ?

Noise

Page 3: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

Machine Learning

Data In Wisdom Out

© Satyendra Rana 3

?

Square Boxes

Thin Rectangular Boxes

RoundBoxes

Q1: Which type of box should we look for?

Q2: Having picked up the box type, how do we find the right box?

Computational Architecture

Learning Method

Page 4: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

Deep (Machine) Learning

Data In Wisdom Out

© Satyendra Rana 4

?

Type of box?

Right box?

Computational Architecture

Learning Method

Discrimination Ability?

Finer Discrimination (Non-linearity)

Network of Neurons (aka Neural Network or NN)

Page 5: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 5

Natural Language Generation

Page 6: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 6

Machine Translation

Page 7: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 7

Automatic Image Captioning

Page 8: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 8

Automatic Colorization of Gray Scale ImagesInput Image

Automatically Colorized

Ground-Truth

Source:

Nvidia news

Page 9: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 9

Ping Pong Playing Robot

Source:

Omron Automation Lab

Kyoto, Japan

Page 10: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 10

Deep learning for the sight-impaired (and also for the sight-endowed)

Page 11: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 11

Neurons

Adult Brain - 100 Trillion Infant Brain – 1 Quadrillion

Synapses

Page 12: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 12

Model of a Neuron & Artificial Neural Networks

I

N

P

U

T

w0

w1

w2

w3

w4

Hyper-parameters- number of layers- type & number of neurons in each layer

Parameters- weights (one for each connection)

Page 13: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 13

Multi-layered Neural Network

Synapse ScaleTypical NNs - 1-10 MillionGoogle Brain – 1 BillionMore Recent Ones – 10 Billion

Given a fixed number of neurons, spreading them in more layers (deep structure) is more effective than in fewer layers (shallow structure).

Given a fixed number of layers, higher number of neurons is better than fewer.

Deep Neural Networks are powerful, but they must also be trainable to be useful.

Different kinds of Deep Neural Networks Feed Forward NNs Recurrent NNs Recursive NNs Convolutional NNs

Page 14: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 14

How does a Neural Network Learn?Parameters

Learning problem is to find the best combination of parameter values, among all possible choices, which would give us on an average most accurate (or minimum error) result (output) in all possible situations (inputs).

Page 15: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 15

Feed Forward Neural Network (FNN)

W111W112W113W114

W121W122W123

W124W131W132

W133W134

W211W212W213W214

W221W222W223

W224W231W232W233

W234W241W242

W243W244

W321

W311

W331

W341Loss Function

Output

Credit Assignment ProblemWhich modifiable components of a learning system are responsible for its success or failure?How can I modify the responsible components to improve the system?

How do I change the weights (parameters) to make the NN exhibit desired behavior?

Supervised Learning

Page 16: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 16

Passing the Buck Example: Fine Tuning a Sales Team Performance

W111W112W113W114

W121W122W123

W124W131W132

W133W134

W211W212W213W214

W221W222W223

W224W231W232W233

W234W241W242

W243W244

W321

W311

W331

W341Loss Function

Output

Backward PropagationPropagating the error backwards from layer to layer, so that each layer can tweak its weights to account for their share of responsibility.

(direction, amount)

Page 17: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 17

Fn-2(Xn-3, Wn-2)

Fn-1(Xn-2, Wn-1)

Fn(Xn-1, Wn)Wn

Wn-2

Wn-1

Xn-1

Xn-2

Xn-3

X1

Xn

C (Xn, Y)

Y

E

Directionn

Directionn-1

Directionn-2

Forw

ard

Pas

s

Backw

ard Pass

Directionn = DF( Xn, C (Xn, Y))

Directionn-1 = Directionn *DF( Xn-1, Fn(Xn-1, Wn))

Directionn-2 = Directionn-1 *DF( Xn-1, Fn(Xn-2, Wn-1))

Directionn-3 = Directionn-2 *DF( Xn-2, Fn(Xn-3, Wn-2))

Stochastic Gradient Descent (SGD)

Page 18: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 18

Base camp

You are here

Climbing down Mountains with Zero Gravity

Steepest Descent

Learning rate

Epoch

Page 19: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 19

What changed since the 80’s?

1970 1975 1980 1985 1990 2010 2015 2020

Early NN Activity Deep NN Activity

• Slow Computers

• Small Data Sets

• Faster Computers

• Big Data

• Training Issues

Big Data & Deep Learning Symbiosis

Page 20: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 20

Reaching Saturation Point in Learning

I don’t want to learn anymore.

Page 21: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 21

Vanishing (or Unstable) Gradient Problem(Gradient at a layer involves the multiplication of gradient at previous layers)

What is the fix?

1. Random Initialization of Weights2. Pre-Training of Layers3. Choice of activation function

• Rectified Linear Unit (RELU)4. Don’t use SGD5. LSTM

Page 22: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 22

Implementation of Deep LearningIt’s all about scaling

1. Implementing a Neuron2. Implementing a Layer3. Composing Layers (Building the network)4. Implementing a Training (Learning) Iteration, aka epoch5. Learning Hyper-parameters

Page 23: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 23

Implementation of Neuron / Layer

Neuron Abstraction

Layer Abstraction

Fast Matrix / Tensor Computation Libraries

• Exploiting multi-threaded multi-core architectures

• GPU Acceleration

Single Node Architecture

Shared Memory Shared Memory

Memory

GPU

Memory

GPU

Single Node ArchitectureGPU Accelerated

Activation Functions

Loss Functions

Node 1

Node 2

Node 3

Page 24: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 24

Composing Layers / Building a Neural Network1. Specifying Layer Composition

(network specification)

SparkML

val mlp = new MultilayerPerceptronClassifier().setLayers(Array(784, 300, 100, 10)).setBlockSize(128)

SparkNet

val netparams = NetParams(RDDLayer(“data”, shape=List(batchsize, 1, 28, 28)),RDDLayer(“label”, shape=List(batchsize, 1)),ConvLayer(“conv1”, List(“data”), Kernel=(5,5), numFilters=20),PoolLayer(“pool1”, List(“conv1”), pool=Max, kernel=(2,2), stride=(2,2)),ConvLayer(“conv2”, List(“pool1”), Kernel=(5,5), numFilters=50),PoolLayer(“pool2”, List(“conv2”), pool=Max, kernel=(2,2), stride=(2,2)),LinearLayer(“ip1”, List(“pool2”), numOutputs=500),ActivationLayer(“relu1”, List(“ip1”), activation=ReLU),LinearLayer(“ip2”, List(“relu1”), numOutputs=10),SoftmaxWithLoss(“loss”, List(“ip2”, “label”))

)

2. Allocating layers to nodes

Page 25: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 25

Speeding up the Training IterationDistributed Implementation of SGD

Executor 1

BLAS

Master

Executor n

BLAS

Wk

Wk

Step 1: Get parameters from Master

Step 2: Compute gradient

Step 3: Send gradients to MasterMaster

Step 4: Compute Wk+1 from gradients

Wk+1

Wk+1

Iteration k

BLAS: Basic Linear Algebra Subprograms, use in Spark thru NetLib-java

Page 26: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 26

MultilayerPerceptronClassifier() in Spark ML

Scala Code

val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)

val mlp = new MultilayerPerceptronClassifier().setLayers(Array(784, 300, 100, 10)).setBlockSize(128)

val model=mlp.fit(digits)

Features (input)

Classes (output)

Hidden layerWith 300 neurons

Hidden layerWith 100 neurons

Page 27: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 27

SparkNet: Training Deep Networks in Spark

Executor 3

GPU Caffe

Executor 2

GPU Caffe

Executor 1

GPU Caffe

Executor 4

GPU CaffeMaster

Data Shard 1

Data Shard 2 Data Shard 3

Data Shard 4

2. Run SGD on a mini-batchfor a fixed time/iterations3. Send parameters to master

1. Broadcast model parameters

4. Receive parameters from executors5. Average them to get new parameters

2. Run SGD on a mini-batchfor a fixed time/iterations3. Send parameters to master

2. Run SGD on a mini-batchfor a fixed time/iterations3. Send parameters to master

2. Run SGD on a mini-batchfor a fixed time/iterations3. Send parameters to master

Page 28: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 28

with

Best Model

Model # 1Training

Model # 2Training

Model # 3Training

Distributed Cross Validation

Page 29: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

© Satyendra Rana 29

Apache SINGAA General Distributed Deep Learning Platform

Page 30: Learning is about acquiring the ability - Meetupfiles.meetup.com/18532292/Deep Learning on Spark.pdf · A 3-to-5 node Spark cluster can be as fast as a GPU Most of my application

Why “Deep Learning” on Spark?

Sorry, I don’t have a GPU / GPU Cluster A 3-to-5 node Spark cluster can be as fast as a GPU

Most of my application and data resides on a Spark ClusterIntegrating Model Training with existing data-processing pipelines

High-throughput loading and pre-processing of data and the ability to keep data in between operations.

Hyper-parameter learning

Poor man’s deep learning

It’s simply fun …

© Satyendra Rana 30