optimizing machine learning workloads on intel platforms · stand-aloneexample:convolution 24 1 //...

Optimizing Machine Learning

workloads on Intel® PlatformsColfax International — colfaxresearch.com

November 2016

colfaxresearch.com/ Welcome © Colfax International, 2013–2016

http://colfaxresearch.com/

Disclaimer2

While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.

colfaxresearch.com/ About This Document © Colfax International, 2013–2016

Colfax Research3


colfaxresearch.com/ About This Document © Colfax International, 2013–2016



§2. Code Modernization

What is Code Modernization?5

.Code Modernization..

......Optimizing software to better utilize features available in modern computerarchitectures.

Scalar Tuningwhat goes on in the pipeline?

Threadingdo cores cooperate efficiently?

Vectorizationis SIMD parallelism used well?

Memoryis cache usage maximized or

RAM access streamlined?

Communicationcan coordination in a distributed or

heterogeneous system be improved?

colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016

Case Study: VGG-Net on Torch6

0

5

10

15

20

25

30

Original Intel Compiler+MKL

MiddlewareChanges

User CodeChanges

ParallelStrategy

MCDRAM asCache

Perform

ance (im

ages/s)Optimization of NeuralTalk2

colfaxresearch.com55x

28x

Intel® Xeon® processor E5-2650 v4 (2 sockets)

0.91 1.5

11

15

25Intel® Xeon Phi™ processor 7210 (KNL)

5.7

10

21

28

Colfax Research Summary Paper


http://colfaxresearch.com/isc16-neuraltalk/

Intel Python Performance7

LUDecomposition

CholeskyDecomposition

Singular ValueDecomposition

DGEMM0

20

40

60

80

100

120

140

160

180

Rel

ativ

e Pe

rfor

man

ce

1.0 1.0 1.0 1.0 3.5 3.6 1.1 7.0

29.0 17.0

8.3

154.0

colfaxresearch.com Intel Python on Knights Landing Processors (N=5000)

CPython, SciPy CPython, NumPy Intel Python, SciPy

Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.


https://software.intel.com/en-us/intel-distribution-for-python

http://colfaxresearch.com/isc16-intel-python/

Three Approaches8

.High Level Approach..

......

Use high level libraries that are pre-optimized for modern architectures.▷ IntelCaffe, TensorFlow, Scikit-learn etc.

.Low Level Approach..

......

Apply code modernization techniques to frameworks/applications.▷ Colfax Research Website, HOW series, Intel Modern Code page etc.

.Middle Ground Approach..

......

Integrate pre-optimized kernels into frameworks/applications.▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc.


§3. The High Level Approach

Intel Libraries for Machine Learning10

LeNet (Cifar10, minibatch 64)

Xeon PhiProcessor

Broadwell XeonProcessor

0

5

10

15

20

25

30 F

orw

ard/

Bac

kwar

d Pe

rf (k

img/

s, m

inib

atch

64)

0.15k 0.75k

13.27k

25.16k

BVLC Intel

VGG16 (ImageNet, minibatch 64)

Xeon PhiProcessor

Broadwell XeonProcessor

0

10

20

30

40

50

60

70

For

war

d/B

ackw

ard

Perf

(im

g/s,

min

ibat

ch 6

4)

0.913.82

54.40

28.57

BVLC Intel

colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016

References for Intel Machine Learning Libraries11

▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)

▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN)

▷ IntelCaffe (https://github.com/intel/caffe)

▷ Intel Theano (https://github.com/intel/theano)

▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)

▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)

▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)

• Scikit-learn, Numpy, Scipy etc.

▷ And more coming...• TensorFlow, CNTK, etc.


https://software.intel.com/en-us/intel-mkl

https://github.com/01org/MKL-DNN

https://github.com/intel/caffe

https://github.com/intel/theano

https://software.intel.com/en-us/intel-daal

https://github.com/xhzhao/Optimized-Torch


Intel Distribution for Python12

SciPy

Caffe

Intel Distribution for Python → Intel Math Kernel Library →

Intel DAAL

Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.



http://colfaxresearch.com/isc16-intel-python/

§4. Low Level Approach

Optimization Areas14

Scalar Tuningwhat goes on in the pipeline?

Threadingdo cores cooperate efficiently?

Vectorizationis SIMD parallelism used well?

Memoryis cache usage maximized or

RAM access streamlined?

Communicationcan coordination in a distributed or

heterogeneous system be improved?

colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016

Case Study: VGG-Net on Torch15

0

5

10

15

20

25

30

Original Intel Compiler+MKL

MiddlewareChanges

User CodeChanges

ParallelStrategy

MCDRAM asCache

Perform

ance (im

ages/s)Optimization of NeuralTalk2

colfaxresearch.com55x

28x

Intel® Xeon® processor E5-2650 v4 (2 sockets)

0.91 1.5

11

15

25Intel® Xeon Phi™ processor 7210 (KNL)

5.7

10

21

28

Colfax Research Summary Paper


http://colfaxresearch.com/isc16-neuraltalk/

Base Torch Performance16

0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

images/s

Batch Count (images)

Comp. Perf. (64 threads)

By Layer:▷ ReLU: 66%

▷ Conv: 30%

▷ MaxPool: 3%

▷ Other: <1%


Performance After ReLU Optimization17

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60

images/s

Batch Count (images)

Original (64 threads)ReLU optimized (64 threads)

RELU -> 160x boost

By Layer:▷ ReLU: 1%

▷ Conv: 85%

▷ MaxPool: 11%

▷ Other: 3%


FALCON paper18

https://colfaxresearch.com/falcon-library/


https://colfaxresearch.com/falcon-library/

Learn More

Colfax Research20


colfaxresearch.com/ Learn More © Colfax International, 2013–2016



→ HowSeries.com

http://howseries.com/

§5. The Middle Ground Approach

Intel MKL and Intel MKL-DNN23

slide credit: Intel corp.

colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016

Stand-alone Example: Convolution24

1 // Creating MKL DNN primitive object2 dnnPrimitive_t convFwd;3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,4 dim, input_dims, output_dims, filter_dims,5 conv_strides, padding, dnnBorderZeros);6

7 // Creating the needed data buffer8 void* conv_res[dnnResourceNumber];9 conv_res[dnnResourceSrc] = (void*) input;

10 conv_res[dnnResourceFilter] = (void*) filter;11 conv_res[dnnResourceDst] = (void*) output;12

13 // Execute the workload14 dnnExecute_F32(pConvFwd, conv_res);

For more: Intel MKL documentation on DNN primitives


https://software.intel.com/en-us/node/684759

Example Integration: IntelCaffe25

GitHub link: https://github.com/intel/caffe/Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp

1 // Grabbing parameters from Caffe Layers2 PoolingParameter pool_param = this->layer_param_.pooling_param();3 channels_ = bottom[0]->channels();4 height_ = bottom[0]->height();5 width_ = bottom[0]->width();6 num_ = bottom[0]->num();7 // ... //8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();9 // ..... //

10

11 // Creating the math kernel from these parameters12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );


https://github.com/intel/caffe/

§6. Distributed Memory Computation

"FLOPs Are Cheap"?27

.Theoretical estimates, Intel® Xeon E5-2697 V3 processor..

......

Performance =28 cores ×2.7 GHz × (256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2 TFLOP/s

Required Data Rate =1.2 TFLOP/s×8 bytes ≈ 10 TB/s

OPA Max Bandwidth =12.5 GB/s ≈ 0.01 TB/s

Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)

To put it short....Difficulty of Distributed Computation..

......

In the time it takes to transfer one data element, processors can do thousands ofoperation on one data element.

colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016

Distributed Computation for Neural Networks28

Forward

Backward

Loss

Update

Forward

Backward

Loss

Update

GatherGradients

Forward

Backward

Loss

Update

Forward

Backward

Loss

Update

PartialResults

PartialResults

node 2node 1 node 2node 1

Data Parallel Model Parallel

Gradient Trnsferred but not Data Data Trnsferred but not Gradient


Caffe Scaling29

Source: Intel® Corporation. (Caffe* Training on Multi-nodeDistributed-memory Systems Based on Intel® Xeon® Processor E5 Family)


https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems-based-on-intel-xeon-processor-e5

https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems-based-on-intel-xeon-processor-e5

Machine Learning Framework: Intel® DAAL

Algorithms in DAAL31

Analysis- Low Order Moments- Quantile- Correlation and Variance- Cosine Distance Matrix- Correlation Distance Matrix- K-Means Clustering- Principal Component Analysis- Cholesky Decomposition

Training & prediction- Regression - Linear/Ridge Regresion- Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier

- Singular Value Decomposition- QR Decomposition- Expectation-Maximization- Multivariate Outlier Detection- Univariate Outlier Detection- Association Rules- Kernel Functions- Quality Metrics

Portal: DAAL page. See also: intro article, CR papers.

colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016


https://software.intel.com/en-us/blogs/daal

http://colfaxresearch.com/intro-to-daal-1/

Algorithms in DAAL32

Data Set

PartialComputation

Data Set

PartialComputation

Data Set

PartialComputation

Final Result Final Result

Data SetData Set

Data Set

Final Result

FullComputation

FullComputation

Data Set

Distributed Mode Batch Mode Online Mode

Portal: DAAL page. See also: intro article, CR papers.

colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016


https://software.intel.com/en-us/blogs/daal

http://colfaxresearch.com/intro-to-daal-1/

Communication Framework: MPI

Structure of MPI Applications: Hello World34

1 #include "mpi.h"2 #include <cstdio>3 int main (int argc, char *argv[]) {4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt5 int rank, size, namelen;6 char name[MPI_MAX_PROCESSOR_NAME];7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process8 MPI_Get_processor_name (name, &namelen); // Hostname of node9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes

10 printf ("Hello World from rank %d running on %s!\n", rank, name);11 if (rank == 0) printf("MPI World size = %d processes\n", size);12 MPI_Finalize (); // Terminate MPI environment13 }

colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016

Collective Communication: Gather35

1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);

Gather

sender

data

sender

data

sender

data

sender

data

receiver


Collective Communication: Broadcast36

1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,2 int root, MPI_Comm comm );

sender

data

receiver receiver receiverreceiver

Broadcast


Implementation

Example Distributed Image Processing: DAAL38

▷ Algorithm <step1Local> is responsible for the forward/backward propagation.

1 training::Distributed<step1Local> local_net; // local net algorithm2 local_net.compute(); // forward/backward3 part_res = local_net.getPartialResult(); // getting partial result4 local_net.input.get(training::inputModel)5 ->setWeightsAndBiases(wb); // Update the weights/bias

▷ Algorithm <step2Master> is responsible for accumulating the gradient.

1 training::Distributed<step2Master> master_net; // master net algorithm2 master_net.input.add(training::partialResults, // Add partial result3 0, part_res);4 master_net.compute(); // Accumulate gradients5 wbModel = master_net.getPartialResult() // Get Current Model6 ->get(training::resultFromMaster)7 ->get(training::model);8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias

colfaxresearch.com/ Implementation © Colfax International, 2013–2016

Example Distributed Image Processing (Part 1)39

1 // Computation part of the node with the master net2 // Local forward and backward propagation3 local_net.compute();4 part_res[master_node_id] = local_net.getPartialResult();5

6 // ... Code to store the result into a buffer (char *) ... //7

8 // Send the result to the master node9 MPI_Gather(....);

10

11 // ... Code to reconstruct the partial result from the buffer... //12

13 // accumulate the partial result from nodes14 for(int i = 0; i < num_nodes; i++)15 master_net.input.add(training::partialResults, node, part_res[i]);16 master_net.compute();


Example Distributed Image Processing (Part 2)40

1 // ... Continuing on the master compute ... //2

3 // Extract the weight/bias from the master net4 training::ModelPtr wbModel = master_net.getPartialResult()5 ->get(training::resultFromMaster)6 ->get(training::model);7 NumericTablePtr wb = wbModel->getWeightsAndBiases();8

9 // ... Code to store weights/bias into a buffer (char*) ... //10

11 // Broadcast the weights/bias to all nodes //12 MPI_Bcast(.....);13

14 // ... Code to reconstruct the weights/bias from buffer ... //15

16 // Update the weights on local node17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);


Parallel Efficiency41

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 3 4

Parallel Efficiency

Number of Nodes

93%

91%

87%

Linear Scaling (Theoretical)Distributed Lenet

Further performance optimizations and model parallelism are coming soon...


§7. Final Words

Thank you for your Attention!

Join us at Booth #2407 at SC16!

colfaxresearch.com/ Final Words © Colfax International, 2013–2016

optimizing machine learning workloads on intel platforms · stand-aloneexample:convolution 24 1 //...

Documents