optimizing machine learning workloads on intel platforms · stand-aloneexample:convolution 24 1 //...
TRANSCRIPT
Optimizing Machine Learning
workloads on Intel® PlatformsColfax International — colfaxresearch.com
November 2016
colfaxresearch.com/ Welcome © Colfax International, 2013–2016
Disclaimer2
While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
Colfax Research3
http://colfaxresearch.com/
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
§2. Code Modernization
What is Code Modernization?5
.Code Modernization..
......Optimizing software to better utilize features available in modern computerarchitectures.
Scalar Tuningwhat goes on in the pipeline?
Threadingdo cores cooperate efficiently?
Vectorizationis SIMD parallelism used well?
Memoryis cache usage maximized or
RAM access streamlined?
Communicationcan coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Case Study: VGG-Net on Torch6
0
5
10
15
20
25
30
Original Intel Compiler+MKL
MiddlewareChanges
User CodeChanges
ParallelStrategy
MCDRAM asCache
Perform
ance (im
ages/s)Optimization of NeuralTalk2
colfaxresearch.com55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Intel Python Performance7
LUDecomposition
CholeskyDecomposition
Singular ValueDecomposition
DGEMM0
20
40
60
80
100
120
140
160
180
Rel
ativ
e Pe
rfor
man
ce
1.0 1.0 1.0 1.0 3.5 3.6 1.1 7.0
29.0 17.0
8.3
154.0
colfaxresearch.com Intel Python on Knights Landing Processors (N=5000)
CPython, SciPy CPython, NumPy Intel Python, SciPy
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Three Approaches8
.High Level Approach..
......
Use high level libraries that are pre-optimized for modern architectures.▷ IntelCaffe, TensorFlow, Scikit-learn etc.
.Low Level Approach..
......
Apply code modernization techniques to frameworks/applications.▷ Colfax Research Website, HOW series, Intel Modern Code page etc.
.Middle Ground Approach..
......
Integrate pre-optimized kernels into frameworks/applications.▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
§3. The High Level Approach
Intel Libraries for Machine Learning10
LeNet (Cifar10, minibatch 64)
Xeon PhiProcessor
Broadwell XeonProcessor
0
5
10
15
20
25
30 F
orw
ard/
Bac
kwar
d Pe
rf (k
img/
s, m
inib
atch
64)
0.15k 0.75k
13.27k
25.16k
BVLC Intel
VGG16 (ImageNet, minibatch 64)
Xeon PhiProcessor
Broadwell XeonProcessor
0
10
20
30
40
50
60
70
For
war
d/B
ackw
ard
Perf
(im
g/s,
min
ibat
ch 6
4)
0.913.82
54.40
28.57
BVLC Intel
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
References for Intel Machine Learning Libraries11
▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)
▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN)
▷ IntelCaffe (https://github.com/intel/caffe)
▷ Intel Theano (https://github.com/intel/theano)
▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)
▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)
▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)
• Scikit-learn, Numpy, Scipy etc.
▷ And more coming...• TensorFlow, CNTK, etc.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
Intel Distribution for Python12
SciPy
Caffe
Intel Distribution for Python → Intel Math Kernel Library →
Intel DAAL
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
§4. Low Level Approach
Optimization Areas14
Scalar Tuningwhat goes on in the pipeline?
Threadingdo cores cooperate efficiently?
Vectorizationis SIMD parallelism used well?
Memoryis cache usage maximized or
RAM access streamlined?
Communicationcan coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Case Study: VGG-Net on Torch15
0
5
10
15
20
25
30
Original Intel Compiler+MKL
MiddlewareChanges
User CodeChanges
ParallelStrategy
MCDRAM asCache
Perform
ance (im
ages/s)Optimization of NeuralTalk2
colfaxresearch.com55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Base Torch Performance16
0
2
4
6
8
10
12
14
16
18
10 20 30 40 50 60
images/s
Batch Count (images)
Comp. Perf. (64 threads)
By Layer:▷ ReLU: 66%
▷ Conv: 30%
▷ MaxPool: 3%
▷ Other: <1%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Performance After ReLU Optimization17
0
5
10
15
20
25
30
35
40
10 20 30 40 50 60
images/s
Batch Count (images)
Original (64 threads)ReLU optimized (64 threads)
RELU -> 160x boost
By Layer:▷ ReLU: 1%
▷ Conv: 85%
▷ MaxPool: 11%
▷ Other: 3%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
FALCON paper18
https://colfaxresearch.com/falcon-library/
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Learn More
Colfax Research20
http://colfaxresearch.com/
colfaxresearch.com/ Learn More © Colfax International, 2013–2016
→ HowSeries.com
§5. The Middle Ground Approach
Intel MKL and Intel MKL-DNN23
slide credit: Intel corp.
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Stand-alone Example: Convolution24
1 // Creating MKL DNN primitive object2 dnnPrimitive_t convFwd;3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,4 dim, input_dims, output_dims, filter_dims,5 conv_strides, padding, dnnBorderZeros);6
7 // Creating the needed data buffer8 void* conv_res[dnnResourceNumber];9 conv_res[dnnResourceSrc] = (void*) input;
10 conv_res[dnnResourceFilter] = (void*) filter;11 conv_res[dnnResourceDst] = (void*) output;12
13 // Execute the workload14 dnnExecute_F32(pConvFwd, conv_res);
For more: Intel MKL documentation on DNN primitives
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Example Integration: IntelCaffe25
GitHub link: https://github.com/intel/caffe/Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp
1 // Grabbing parameters from Caffe Layers2 PoolingParameter pool_param = this->layer_param_.pooling_param();3 channels_ = bottom[0]->channels();4 height_ = bottom[0]->height();5 width_ = bottom[0]->width();6 num_ = bottom[0]->num();7 // ... //8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();9 // ..... //
10
11 // Creating the math kernel from these parameters12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
§6. Distributed Memory Computation
"FLOPs Are Cheap"?27
.Theoretical estimates, Intel® Xeon E5-2697 V3 processor..
......
Performance =28 cores ×2.7 GHz × (256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2 TFLOP/s
Required Data Rate =1.2 TFLOP/s×8 bytes ≈ 10 TB/s
OPA Max Bandwidth =12.5 GB/s ≈ 0.01 TB/s
Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)
To put it short....Difficulty of Distributed Computation..
......
In the time it takes to transfer one data element, processors can do thousands ofoperation on one data element.
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Distributed Computation for Neural Networks28
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
GatherGradients
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
PartialResults
PartialResults
node 2node 1 node 2node 1
Data Parallel Model Parallel
Gradient Trnsferred but not Data Data Trnsferred but not Gradient
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Caffe Scaling29
Source: Intel® Corporation. (Caffe* Training on Multi-nodeDistributed-memory Systems Based on Intel® Xeon® Processor E5 Family)
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Machine Learning Framework: Intel® DAAL
Algorithms in DAAL31
Analysis- Low Order Moments- Quantile- Correlation and Variance- Cosine Distance Matrix- Correlation Distance Matrix- K-Means Clustering- Principal Component Analysis- Cholesky Decomposition
Training & prediction- Regression - Linear/Ridge Regresion- Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier
- Singular Value Decomposition- QR Decomposition- Expectation-Maximization- Multivariate Outlier Detection- Univariate Outlier Detection- Association Rules- Kernel Functions- Quality Metrics
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Algorithms in DAAL32
Data Set
PartialComputation
Data Set
PartialComputation
Data Set
PartialComputation
Final Result Final Result
Data SetData Set
Data Set
Final Result
FullComputation
FullComputation
Data Set
Distributed Mode Batch Mode Online Mode
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Communication Framework: MPI
Structure of MPI Applications: Hello World34
1 #include "mpi.h"2 #include <cstdio>3 int main (int argc, char *argv[]) {4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt5 int rank, size, namelen;6 char name[MPI_MAX_PROCESSOR_NAME];7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process8 MPI_Get_processor_name (name, &namelen); // Hostname of node9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes
10 printf ("Hello World from rank %d running on %s!\n", rank, name);11 if (rank == 0) printf("MPI World size = %d processes\n", size);12 MPI_Finalize (); // Terminate MPI environment13 }
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Gather35
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);
Gather
sender
data
sender
data
sender
data
sender
data
receiver
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Broadcast36
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,2 int root, MPI_Comm comm );
sender
data
receiver receiver receiverreceiver
Broadcast
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Implementation
Example Distributed Image Processing: DAAL38
▷ Algorithm <step1Local> is responsible for the forward/backward propagation.
1 training::Distributed<step1Local> local_net; // local net algorithm2 local_net.compute(); // forward/backward3 part_res = local_net.getPartialResult(); // getting partial result4 local_net.input.get(training::inputModel)5 ->setWeightsAndBiases(wb); // Update the weights/bias
▷ Algorithm <step2Master> is responsible for accumulating the gradient.
1 training::Distributed<step2Master> master_net; // master net algorithm2 master_net.input.add(training::partialResults, // Add partial result3 0, part_res);4 master_net.compute(); // Accumulate gradients5 wbModel = master_net.getPartialResult() // Get Current Model6 ->get(training::resultFromMaster)7 ->get(training::model);8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 1)39
1 // Computation part of the node with the master net2 // Local forward and backward propagation3 local_net.compute();4 part_res[master_node_id] = local_net.getPartialResult();5
6 // ... Code to store the result into a buffer (char *) ... //7
8 // Send the result to the master node9 MPI_Gather(....);
10
11 // ... Code to reconstruct the partial result from the buffer... //12
13 // accumulate the partial result from nodes14 for(int i = 0; i < num_nodes; i++)15 master_net.input.add(training::partialResults, node, part_res[i]);16 master_net.compute();
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 2)40
1 // ... Continuing on the master compute ... //2
3 // Extract the weight/bias from the master net4 training::ModelPtr wbModel = master_net.getPartialResult()5 ->get(training::resultFromMaster)6 ->get(training::model);7 NumericTablePtr wb = wbModel->getWeightsAndBiases();8
9 // ... Code to store weights/bias into a buffer (char*) ... //10
11 // Broadcast the weights/bias to all nodes //12 MPI_Bcast(.....);13
14 // ... Code to reconstruct the weights/bias from buffer ... //15
16 // Update the weights on local node17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Parallel Efficiency41
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4
Parallel Efficiency
Number of Nodes
93%
91%
87%
Linear Scaling (Theoretical)Distributed Lenet
Further performance optimizations and model parallelism are coming soon...
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
§7. Final Words
Colfax Research43
http://colfaxresearch.com/
colfaxresearch.com/ Final Words © Colfax International, 2013–2016
Thank you for your Attention!
Join us at Booth #2407 at SC16!
colfaxresearch.com/ Final Words © Colfax International, 2013–2016