learning is about acquiring the ability - meetupfiles.meetup.com/18532292/deep learning on...

Learning is about acquiring the ability to discriminate.

Memorization

Overfitting

Under fitting

Generalization

© Satyendra Rana 2

2 2 4

2 3 5

2 4 6

2 5 7

2 3 ?

2 7 ?

3 4 ?

2 2 4.05

2 3 4.98

4 2 5.95

2 5 7.06

2 3 ?

2 7 ?

3 4 ?

Noise

Machine Learning

Data In Wisdom Out

© Satyendra Rana 3

?

Square Boxes

Thin Rectangular Boxes

RoundBoxes

Q1: Which type of box should we look for?

Q2: Having picked up the box type, how do we find the right box?

Computational Architecture

Learning Method

Deep (Machine) Learning

Data In Wisdom Out

© Satyendra Rana 4

?

Type of box?

Right box?

Computational Architecture

Learning Method

Discrimination Ability?

Finer Discrimination (Non-linearity)

Network of Neurons (aka Neural Network or NN)

© Satyendra Rana 5

Natural Language Generation

© Satyendra Rana 6

Machine Translation

© Satyendra Rana 7

Automatic Image Captioning

© Satyendra Rana 8

Automatic Colorization of Gray Scale ImagesInput Image

Automatically Colorized

Ground-Truth

Source:

Nvidia news

© Satyendra Rana 9

Ping Pong Playing Robot

Source:

Omron Automation Lab

Kyoto, Japan

© Satyendra Rana 10

Deep learning for the sight-impaired (and also for the sight-endowed)


Neurons

Adult Brain - 100 Trillion Infant Brain – 1 Quadrillion

Synapses


Model of a Neuron & Artificial Neural Networks

I

N

P

U

T

w0

w1

w2

w3

w4

Hyper-parameters- number of layers- type & number of neurons in each layer

Parameters- weights (one for each connection)


Multi-layered Neural Network

Synapse ScaleTypical NNs - 1-10 MillionGoogle Brain – 1 BillionMore Recent Ones – 10 Billion

Given a fixed number of neurons, spreading them in more layers (deep structure) is more effective than in fewer layers (shallow structure).

Given a fixed number of layers, higher number of neurons is better than fewer.

Deep Neural Networks are powerful, but they must also be trainable to be useful.

Different kinds of Deep Neural Networks Feed Forward NNs Recurrent NNs Recursive NNs Convolutional NNs


How does a Neural Network Learn?Parameters

Learning problem is to find the best combination of parameter values, among all possible choices, which would give us on an average most accurate (or minimum error) result (output) in all possible situations (inputs).


Feed Forward Neural Network (FNN)

W111W112W113W114

W121W122W123

W124W131W132

W133W134

W211W212W213W214

W221W222W223

W224W231W232W233

W234W241W242

W243W244

W321

W311

W331

W341Loss Function

Output

Credit Assignment ProblemWhich modifiable components of a learning system are responsible for its success or failure?How can I modify the responsible components to improve the system?

How do I change the weights (parameters) to make the NN exhibit desired behavior?

Supervised Learning


Passing the Buck Example: Fine Tuning a Sales Team Performance

W111W112W113W114

W121W122W123

W124W131W132

W133W134

W211W212W213W214

W221W222W223

W224W231W232W233

W234W241W242

W243W244

W321

W311

W331

W341Loss Function

Output

Backward PropagationPropagating the error backwards from layer to layer, so that each layer can tweak its weights to account for their share of responsibility.

(direction, amount)


Fn-2(Xn-3, Wn-2)

Fn-1(Xn-2, Wn-1)

Fn(Xn-1, Wn)Wn

Wn-2

Wn-1

Xn-1

Xn-2

Xn-3

X1

Xn

C (Xn, Y)

Y

E

Directionn

Directionn-1

Directionn-2

Forw

ard

Pas

s

Backw

ard Pass

Directionn = DF( Xn, C (Xn, Y))

Directionn-1 = Directionn *DF( Xn-1, Fn(Xn-1, Wn))

Directionn-2 = Directionn-1 *DF( Xn-1, Fn(Xn-2, Wn-1))

Directionn-3 = Directionn-2 *DF( Xn-2, Fn(Xn-3, Wn-2))

Stochastic Gradient Descent (SGD)


Base camp

You are here

Climbing down Mountains with Zero Gravity

Steepest Descent

Learning rate

Epoch


What changed since the 80’s?

1970 1975 1980 1985 1990 2010 2015 2020

Early NN Activity Deep NN Activity

• Slow Computers

• Small Data Sets

• Faster Computers

• Big Data

• Training Issues

Big Data & Deep Learning Symbiosis


Reaching Saturation Point in Learning

I don’t want to learn anymore.


Vanishing (or Unstable) Gradient Problem(Gradient at a layer involves the multiplication of gradient at previous layers)

What is the fix?

1. Random Initialization of Weights2. Pre-Training of Layers3. Choice of activation function

• Rectified Linear Unit (RELU)4. Don’t use SGD5. LSTM


Implementation of Deep LearningIt’s all about scaling

1. Implementing a Neuron2. Implementing a Layer3. Composing Layers (Building the network)4. Implementing a Training (Learning) Iteration, aka epoch5. Learning Hyper-parameters


Implementation of Neuron / Layer

Neuron Abstraction

Layer Abstraction

Fast Matrix / Tensor Computation Libraries

• Exploiting multi-threaded multi-core architectures

• GPU Acceleration

Single Node Architecture

Shared Memory Shared Memory

Memory

GPU

Memory

GPU

Single Node ArchitectureGPU Accelerated

Activation Functions

Loss Functions

Node 1

Node 2

Node 3


Composing Layers / Building a Neural Network1. Specifying Layer Composition

(network specification)

SparkML

val mlp = new MultilayerPerceptronClassifier().setLayers(Array(784, 300, 100, 10)).setBlockSize(128)

SparkNet

val netparams = NetParams(RDDLayer(“data”, shape=List(batchsize, 1, 28, 28)),RDDLayer(“label”, shape=List(batchsize, 1)),ConvLayer(“conv1”, List(“data”), Kernel=(5,5), numFilters=20),PoolLayer(“pool1”, List(“conv1”), pool=Max, kernel=(2,2), stride=(2,2)),ConvLayer(“conv2”, List(“pool1”), Kernel=(5,5), numFilters=50),PoolLayer(“pool2”, List(“conv2”), pool=Max, kernel=(2,2), stride=(2,2)),LinearLayer(“ip1”, List(“pool2”), numOutputs=500),ActivationLayer(“relu1”, List(“ip1”), activation=ReLU),LinearLayer(“ip2”, List(“relu1”), numOutputs=10),SoftmaxWithLoss(“loss”, List(“ip2”, “label”))

)

2. Allocating layers to nodes


Speeding up the Training IterationDistributed Implementation of SGD

Executor 1

BLAS

Master

Executor n

BLAS

Wk

Wk

Step 1: Get parameters from Master

Step 2: Compute gradient

Step 3: Send gradients to MasterMaster

Step 4: Compute Wk+1 from gradients

Wk+1

Wk+1

Iteration k

BLAS: Basic Linear Algebra Subprograms, use in Spark thru NetLib-java


MultilayerPerceptronClassifier() in Spark ML

Scala Code

val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)

val mlp = new MultilayerPerceptronClassifier().setLayers(Array(784, 300, 100, 10)).setBlockSize(128)

val model=mlp.fit(digits)

Features (input)

Classes (output)

Hidden layerWith 300 neurons

Hidden layerWith 100 neurons


SparkNet: Training Deep Networks in Spark

Executor 3

GPU Caffe

Executor 2

GPU Caffe

Executor 1

GPU Caffe

Executor 4

GPU CaffeMaster

Data Shard 1

Data Shard 2 Data Shard 3

Data Shard 4

2. Run SGD on a mini-batchfor a fixed time/iterations3. Send parameters to master

1. Broadcast model parameters

4. Receive parameters from executors5. Average them to get new parameters





with

Best Model

Model # 1Training

Model # 2Training

Model # 3Training

Distributed Cross Validation


Apache SINGAA General Distributed Deep Learning Platform

Why “Deep Learning” on Spark?

Sorry, I don’t have a GPU / GPU Cluster A 3-to-5 node Spark cluster can be as fast as a GPU

Most of my application and data resides on a Spark ClusterIntegrating Model Training with existing data-processing pipelines

High-throughput loading and pre-processing of data and the ability to keep data in between operations.

Hyper-parameter learning

Poor man’s deep learning

It’s simply fun …


learning is about acquiring the ability - meetupfiles.meetup.com/18532292/deep learning on...

Documents