presentation thesis - convolutional net on the xeon phi using simd - gaurav raina small
TRANSCRIPT
Deep Convolutional Network Evaluation on the Intel Xeon Phi
Gaurav Raina MSc Graduation Project
5-1-2016
Cameras are ubiquitous 1
Vision processing on mobile devices
• Currently most processing off-line • High compute demands + energy • Move to edge processing
2
Motivation
• Convolutional neural nets very generic (support many vision tasks) • Traffic sign • Pedestrian • Face detection
• Accelerate with an power efficient core
3
Problem statement
“Efficiently parallelize a Convolutional Neural Network on a highly-parallel power efficient processor platform”
4
You are here:
5
Overview
1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
6
Overview
1. Convolution Network (ConvNet) algorithm 2. Optimization Approach 3. Mapping on the core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
7
Introduction Neural Networks
• Artificial neuron model
8
Convolution example
9
Image credit: deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Speed sign detection application 10
Image courtesy: Maurice Peemen
ConvNet Application in action 11
Video courtesy: Maurice Peemen
ConvNet Code Structure
1. for( 0 < r < 6 ){ 2. acc = bias[r]; 3. for( 0 < m < YL1 ){ 4. for( 0 < n < XL1 ){ 5. for( 0< k < 6 ){ 6. for( 0 < l < 6 ){ 7. acc = acc + in_layer[m,n,l] x weight[r,k,l]; 8. } 9. } 10. index = saturate_shift(acc); //10bit fixedpoint format 11. output_layer[r,m,n]=fixact[index]; 12. } 13. } 14. } “r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel “n” = Neuron outputs fixact = sigmoid activation function
12
Compute
Store
Overview
1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
13
Optimization Approach
• Methodology: • Test on Core i7 (Haswell – AVX2) • Move to Xeon Phi (Knights Corner - IMCI)
• Steps:
1. Loop unrolling 2. Vectorization using SIMD intrinsics (DLP)
− Fused Multiply Add instruction 3. Parallelization using OpenMP (TLP)
14
1 core
Many-core
SIMD Vectorization example
15
Courtesy: www.kernel.org
Intel MIC Programming models 16
Credit: Dr. Volker Weinberg, Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015
Overview
1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
17
Roofline Model 18
actual FLOP/Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Performance Roofline
Y coordinate is performance
Processor BW Roofline
(Slope is BW)
Kernel 2 Kernel 1
Each kernels performance
bound
Each kernels performance
bound
Intel Core i7
• Intel Core i7 @3.5GHz • Haswell micro-architecture • AVX2 vector instructions
− 256bit vectors
19
Multiply Accumulate intrinsic – AVX2
20
Calculation of Ops/Byte
• acc += in_layer[i]*weight[j]
• Intrinsics used
• add(acc, madd(in_layer,weight))
• Bytes Loaded • in_layer[i] - 1bytes • weight[j] - 2bytes
• Operational Intensity • 2ops/3bytes = 0.67 Ops/Byte
21
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code) • Intel C Compiler
• Layer 1 - 4.7x • Layer 2 - 5.7x • Layer 3 - 4.88x • Overall CNN – 5.6x
• GCC compiler • Layer 1 - 4.7x • Layer 2 - 6.8x • Layer 3 - 6.7x • Overall CNN – 6.3x
22
• (w.r.t auto-vectorized code) • ICC
• 4.9x • 11.3x • 4.8x • Overall CNN - 5x
• GCC • same • same • same • Overall CNN – 6.3x
Roofline - Core i7 - manual v/s auto 23
Layer3 Hand-optimized
0.67, 35.54 Complete CNN Hand-
optimized, 0.67, 32.46
Complete CNN Auto-vectorized , 0.67, 5.13 4
8
16
32
64
0.125 0.25 0.5 1 2
Perf
orm
ance
(Gig
a O
ps/s
)
Operational Intensity (Ops/Byte)
Single core SIMD ops roofline - Intel i7 5930K @3.5GHz
56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache
224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM
16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc
Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc
Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc
Complete CNN no-vectorization gcc
Overview
1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Intel core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
24
Intel Xeon Phi
• Knights Corner • Initial Many Core
Instructions (IMCI) • Knights Landing
• AVX-512 • 57-61core
25
Credit: Intel
Intel Xeon Phi
26 Intel Many Integrated Core Architecture
Credit: http://semiaccurate.com/2012/08/28/ intel-details-knights-corner-architecture-at-long-last/
Core Architecture Overview
• 60+ in-order, low power IA cores • Bi-directional Ring interconnect
• Two pipelines (u & v) • Scalar Unit based on Pentium • 512bit SIMD Vector Processing unit • 4 hardware threads
• Coherent 512KB L2 Cache per core
27
Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide
28
Going from Core i7 to Xeon Phi (AVX to KNC)
29
Going from Core i7 to Xeon Phi (AVX to IMCI)
madd()
fmadd()
• acc = acc + in_layer[m,n,l] x weight[r,k,l]
30
Fused Multiply-Add on Xeon Phi
31
Intrinsics Kernel implementation
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code) • Intel C Compiler
• Layer 1 - 5.7x • Layer 2 - 10.2x • Layer 3 - 12.4x • Overall CNN – 11x • ~0.75 Frame per second
− 57 cores => 43 FPS
32
• (w.r.t auto-vectorized code) • ICC
• 5.6x • 6.3x • 10.7x • Overall CNN – 9.2x
Roofline – Xeon Phi 33
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Perf
orm
ance
(Gig
a FL
OP/
s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM
Roofline – Xeon Phi 34
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Perf
orm
ance
(Gig
a FL
OP/
s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimizedLayer 1 Auto vectorized Layer 2 - hand optimizedLayer 2 Auto vectorized Layer 3 - hand optimizedLayer 3 Auto vectorized
Roofline – Xeon Phi - Complete 35
Complete - hand optimized, 0.67, 1.5626
Complete Auto vectorized, 0.67, 0.1702 0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Perf
orm
ance
(Gig
a FL
OP/
s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s BW to DDR RAM Complete - hand optimized
Complete Auto vectorized
Demo
• Speed sign application running on: • The Core i7 • The Xeon Phi
36
Overview
1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work
37
You are here:
38
Conclusion
• Contribution • Core i7 – 6.3x • Xeon Phi – 11x
• Design trade-off: • Developer time v/s Optimized code • Architecture specific intrinsics v/s generic OpenMP
39
Future Work OpenMP number of threads
• Varying number of threads per core • 1T x 57 cores = 57T • 4T x 57 cores = 228T
• Varying thread distribution on Cores • KMP_AFFINITY (Environment Variable)
• Splitting work using OpenMP directives • #pragma omp for
40
41
Baseline OpenMP Scaling Vectorization Peeling
Elapsed time (s): 5605.027 127.616 17.767 15.619 FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41
Throughput (GB/s): 0.235 10.338 74.254 84.467
Test code on Xeon Phi
• Baseline - simulate diffusion of a solute through a volume of liquid • OpenMP Scaling
• #pragma omp for collapse(2) • Vectorization
• #pragma simd
Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor high-performance programming. Newnes, 2013.
Thank You.
Questions?