optimizing machine learning workloads on intel platforms · stand-aloneexample:convolution 24 1 //...

44
Optimizing Machine Learning workloads on Intel ® Platforms Colfax International colfaxresearch.com November 2016 colfaxresearch.com/ Welcome © Colfax International, 2013–2016

Upload: others

Post on 28-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Optimizing Machine Learning

workloads on Intel® PlatformsColfax International — colfaxresearch.com

November 2016

colfaxresearch.com/ Welcome © Colfax International, 2013–2016

Page 2: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Disclaimer2

While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.

colfaxresearch.com/ About This Document © Colfax International, 2013–2016

Page 3: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Colfax Research3

http://colfaxresearch.com/

colfaxresearch.com/ About This Document © Colfax International, 2013–2016

Page 4: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§2. Code Modernization

Page 5: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

What is Code Modernization?5

.Code Modernization..

......Optimizing software to better utilize features available in modern computerarchitectures.

Scalar Tuningwhat goes on in the pipeline?

Threadingdo cores cooperate efficiently?

Vectorizationis SIMD parallelism used well?

Memoryis cache usage maximized or

RAM access streamlined?

Communicationcan coordination in a distributed or

heterogeneous system be improved?

colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016

Page 6: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Case Study: VGG-Net on Torch6

 0

 5

 10

 15

 20

 25

 30

Original Intel Compiler+MKL

MiddlewareChanges

User CodeChanges

ParallelStrategy

MCDRAM asCache

Perform

ance (im

ages/s)Optimization of NeuralTalk2

colfaxresearch.com55x

28x

Intel® Xeon® processor E5-2650 v4 (2 sockets)

0.91 1.5

11

15

25Intel® Xeon Phi™ processor 7210 (KNL)

5.7

10

21

28

Colfax Research Summary Paper

colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016

Page 7: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Intel Python Performance7

LUDecomposition

CholeskyDecomposition

Singular ValueDecomposition

DGEMM0

20

40

60

80

100

120

140

160

180

Rel

ativ

e Pe

rfor

man

ce

1.0 1.0 1.0 1.0 3.5 3.6 1.1 7.0

29.0 17.0

8.3

154.0

colfaxresearch.com Intel Python on Knights Landing Processors (N=5000)

CPython, SciPy CPython, NumPy Intel Python, SciPy

Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.

colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016

Page 8: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Three Approaches8

.High Level Approach..

......

Use high level libraries that are pre-optimized for modern architectures.▷ IntelCaffe, TensorFlow, Scikit-learn etc.

.Low Level Approach..

......

Apply code modernization techniques to frameworks/applications.▷ Colfax Research Website, HOW series, Intel Modern Code page etc.

.Middle Ground Approach..

......

Integrate pre-optimized kernels into frameworks/applications.▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc.

colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016

Page 9: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§3. The High Level Approach

Page 10: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Intel Libraries for Machine Learning10

LeNet (Cifar10, minibatch 64)

 Xeon PhiProcessor 

 Broadwell XeonProcessor

0

5

10

15

20

25

30 F

orw

ard/

Bac

kwar

d Pe

rf (k

­img/

s, m

ini­b

atch

 64)

 

0.15k 0.75k

13.27k

25.16k

 

 BVLC Intel

VGG16 (ImageNet, minibatch 64)

 Xeon PhiProcessor 

 Broadwell XeonProcessor

0

10

20

30

40

50

60

70

 For

war

d/B

ackw

ard 

Perf

 (im

g/s, 

min

i­bat

ch 6

4) 

0.913.82

54.40

28.57

 

 BVLC Intel

colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016

Page 11: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

References for Intel Machine Learning Libraries11

▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)

▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN)

▷ IntelCaffe (https://github.com/intel/caffe)

▷ Intel Theano (https://github.com/intel/theano)

▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)

▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)

▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)

• Scikit-learn, Numpy, Scipy etc.

▷ And more coming...• TensorFlow, CNTK, etc.

colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016

Page 12: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Intel Distribution for Python12

SciPy

Caffe

Intel Distribution for Python → Intel Math Kernel Library →

Intel DAAL

Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.

colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016

Page 13: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§4. Low Level Approach

Page 14: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Optimization Areas14

Scalar Tuningwhat goes on in the pipeline?

Threadingdo cores cooperate efficiently?

Vectorizationis SIMD parallelism used well?

Memoryis cache usage maximized or

RAM access streamlined?

Communicationcan coordination in a distributed or

heterogeneous system be improved?

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Page 15: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Case Study: VGG-Net on Torch15

 0

 5

 10

 15

 20

 25

 30

Original Intel Compiler+MKL

MiddlewareChanges

User CodeChanges

ParallelStrategy

MCDRAM asCache

Perform

ance (im

ages/s)Optimization of NeuralTalk2

colfaxresearch.com55x

28x

Intel® Xeon® processor E5-2650 v4 (2 sockets)

0.91 1.5

11

15

25Intel® Xeon Phi™ processor 7210 (KNL)

5.7

10

21

28

Colfax Research Summary Paper

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Page 16: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Base Torch Performance16

 0

 2

 4

 6

 8

 10

 12

 14

 16

 18

 10  20  30  40  50  60

images/s

Batch Count (images)

Comp. Perf. (64 threads)

By Layer:▷ ReLU: 66%

▷ Conv: 30%

▷ MaxPool: 3%

▷ Other: <1%

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Page 17: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Performance After ReLU Optimization17

 0

 5

 10

 15

 20

 25

 30

 35

 40

 10  20  30  40  50  60

images/s

Batch Count (images)

Original (64 threads)ReLU optimized (64 threads)

RELU -> 160x boost

By Layer:▷ ReLU: 1%

▷ Conv: 85%

▷ MaxPool: 11%

▷ Other: 3%

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Page 18: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

FALCON paper18

https://colfaxresearch.com/falcon-library/

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Page 19: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Learn More

Page 20: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Colfax Research20

http://colfaxresearch.com/

colfaxresearch.com/ Learn More © Colfax International, 2013–2016

Page 21: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

→ HowSeries.com

Page 22: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§5. The Middle Ground Approach

Page 23: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Intel MKL and Intel MKL-DNN23

slide credit: Intel corp.

colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016

Page 24: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Stand-alone Example: Convolution24

1 // Creating MKL DNN primitive object2 dnnPrimitive_t convFwd;3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,4 dim, input_dims, output_dims, filter_dims,5 conv_strides, padding, dnnBorderZeros);6

7 // Creating the needed data buffer8 void* conv_res[dnnResourceNumber];9 conv_res[dnnResourceSrc] = (void*) input;

10 conv_res[dnnResourceFilter] = (void*) filter;11 conv_res[dnnResourceDst] = (void*) output;12

13 // Execute the workload14 dnnExecute_F32(pConvFwd, conv_res);

For more: Intel MKL documentation on DNN primitives

colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016

Page 25: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Example Integration: IntelCaffe25

GitHub link: https://github.com/intel/caffe/Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp

1 // Grabbing parameters from Caffe Layers2 PoolingParameter pool_param = this->layer_param_.pooling_param();3 channels_ = bottom[0]->channels();4 height_ = bottom[0]->height();5 width_ = bottom[0]->width();6 num_ = bottom[0]->num();7 // ... //8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();9 // ..... //

10

11 // Creating the math kernel from these parameters12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );

colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016

Page 26: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§6. Distributed Memory Computation

Page 27: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

"FLOPs Are Cheap"?27

.Theoretical estimates, Intel® Xeon E5-2697 V3 processor..

......

Performance =28 cores ×2.7 GHz × (256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2 TFLOP/s

Required Data Rate =1.2 TFLOP/s×8 bytes ≈ 10 TB/s

OPA Max Bandwidth =12.5 GB/s ≈ 0.01 TB/s

Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)

To put it short....Difficulty of Distributed Computation..

......

In the time it takes to transfer one data element, processors can do thousands ofoperation on one data element.

colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016

Page 28: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Distributed Computation for Neural Networks28

Forward

Backward

Loss

Update

Forward

Backward

Loss

Update

GatherGradients

Forward

Backward

Loss

Update

Forward

Backward

Loss

Update

PartialResults

PartialResults

node 2node 1 node 2node 1

Data Parallel Model Parallel

Gradient Trnsferred but not Data Data Trnsferred but not Gradient

colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016

Page 29: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Caffe Scaling29

Source: Intel® Corporation. (Caffe* Training on Multi-nodeDistributed-memory Systems Based on Intel® Xeon® Processor E5 Family)

colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016

Page 30: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Machine Learning Framework: Intel® DAAL

Page 31: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Algorithms in DAAL31

Analysis- Low Order Moments- Quantile- Correlation and Variance- Cosine Distance Matrix- Correlation Distance Matrix- K-Means Clustering- Principal Component Analysis- Cholesky Decomposition

Training & prediction- Regression - Linear/Ridge Regresion- Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier

- Singular Value Decomposition- QR Decomposition- Expectation-Maximization- Multivariate Outlier Detection- Univariate Outlier Detection- Association Rules- Kernel Functions- Quality Metrics

Portal: DAAL page. See also: intro article, CR papers.

colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016

Page 32: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Algorithms in DAAL32

Data Set

PartialComputation

Data Set

PartialComputation

Data Set

PartialComputation

Final Result Final Result

Data SetData Set

Data Set

Final Result

FullComputation

FullComputation

Data Set

Distributed Mode Batch Mode Online Mode

Portal: DAAL page. See also: intro article, CR papers.

colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016

Page 33: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Communication Framework: MPI

Page 34: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Structure of MPI Applications: Hello World34

1 #include "mpi.h"2 #include <cstdio>3 int main (int argc, char *argv[]) {4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt5 int rank, size, namelen;6 char name[MPI_MAX_PROCESSOR_NAME];7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process8 MPI_Get_processor_name (name, &namelen); // Hostname of node9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes

10 printf ("Hello World from rank %d running on %s!\n", rank, name);11 if (rank == 0) printf("MPI World size = %d processes\n", size);12 MPI_Finalize (); // Terminate MPI environment13 }

colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016

Page 35: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Collective Communication: Gather35

1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);

Gather

sender

data

sender

data

sender

data

sender

data

receiver

colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016

Page 36: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Collective Communication: Broadcast36

1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,2 int root, MPI_Comm comm );

sender

data

receiver receiver receiverreceiver

Broadcast

colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016

Page 37: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Implementation

Page 38: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Example Distributed Image Processing: DAAL38

▷ Algorithm <step1Local> is responsible for the forward/backward propagation.

1 training::Distributed<step1Local> local_net; // local net algorithm2 local_net.compute(); // forward/backward3 part_res = local_net.getPartialResult(); // getting partial result4 local_net.input.get(training::inputModel)5 ->setWeightsAndBiases(wb); // Update the weights/bias

▷ Algorithm <step2Master> is responsible for accumulating the gradient.

1 training::Distributed<step2Master> master_net; // master net algorithm2 master_net.input.add(training::partialResults, // Add partial result3 0, part_res);4 master_net.compute(); // Accumulate gradients5 wbModel = master_net.getPartialResult() // Get Current Model6 ->get(training::resultFromMaster)7 ->get(training::model);8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias

colfaxresearch.com/ Implementation © Colfax International, 2013–2016

Page 39: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Example Distributed Image Processing (Part 1)39

1 // Computation part of the node with the master net2 // Local forward and backward propagation3 local_net.compute();4 part_res[master_node_id] = local_net.getPartialResult();5

6 // ... Code to store the result into a buffer (char *) ... //7

8 // Send the result to the master node9 MPI_Gather(....);

10

11 // ... Code to reconstruct the partial result from the buffer... //12

13 // accumulate the partial result from nodes14 for(int i = 0; i < num_nodes; i++)15 master_net.input.add(training::partialResults, node, part_res[i]);16 master_net.compute();

colfaxresearch.com/ Implementation © Colfax International, 2013–2016

Page 40: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Example Distributed Image Processing (Part 2)40

1 // ... Continuing on the master compute ... //2

3 // Extract the weight/bias from the master net4 training::ModelPtr wbModel = master_net.getPartialResult()5 ->get(training::resultFromMaster)6 ->get(training::model);7 NumericTablePtr wb = wbModel->getWeightsAndBiases();8

9 // ... Code to store weights/bias into a buffer (char*) ... //10

11 // Broadcast the weights/bias to all nodes //12 MPI_Bcast(.....);13

14 // ... Code to reconstruct the weights/bias from buffer ... //15

16 // Update the weights on local node17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);

colfaxresearch.com/ Implementation © Colfax International, 2013–2016

Page 41: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Parallel Efficiency41

 0

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  3  4

Parallel Efficiency

Number of Nodes

93%

91%

87%

Linear Scaling (Theoretical)Distributed Lenet

Further performance optimizations and model parallelism are coming soon...

colfaxresearch.com/ Implementation © Colfax International, 2013–2016

Page 42: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

§7. Final Words

Page 43: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Colfax Research43

http://colfaxresearch.com/

colfaxresearch.com/ Final Words © Colfax International, 2013–2016

Page 44: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,

Thank you for your Attention!

Join us at Booth #2407 at SC16!

colfaxresearch.com/ Final Words © Colfax International, 2013–2016