presentation thesis - convolutional net on the xeon phi using simd - gaurav raina small

43
Deep Convolutional Network Evaluation on the Intel Xeon Phi Gaurav Raina MSc Graduation Project 5-1-2016

Upload: gaurav-raina

Post on 14-Apr-2017

137 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Deep Convolutional Network Evaluation on the Intel Xeon Phi

Gaurav Raina MSc Graduation Project

5-1-2016

Presenter
Presentation Notes
Factoid: Autonomously driven more than 1 million miles city streets ! Every second a Google Self-Driving collects an incredible 1GB of visual data. Can recognize pedestrians, speed signs, cyclists giving hand signals many others.
Page 2: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Cameras are ubiquitous 1

Presenter
Presentation Notes
Spy drones mobile/wearable platforms Internet of Things with cameras
Page 3: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Vision processing on mobile devices

• Currently most processing off-line • High compute demands + energy • Move to edge processing

2

Presenter
Presentation Notes
Edge vs cloud processing for vision trend is to move to on device processing
Page 4: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Motivation

• Convolutional neural nets very generic (support many vision tasks) • Traffic sign • Pedestrian • Face detection

• Accelerate with an power efficient core

3

Presenter
Presentation Notes
This object-recognition job has been done in the past in a variety of ways. The most common approach is rule-based: search the image for pre-defined shapes, feed the locations of the shapes into a classifier, and give the classifier rules about what combinations of shapes might represent a human. In abstract tests such systems can perform adequately. But on real-world images with changes in position, orientation, lighting, and noise level, they are often inadequate.
Page 5: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Problem statement

“Efficiently parallelize a Convolutional Neural Network on a highly-parallel power efficient processor platform”

4

Page 6: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

You are here:

5

Page 7: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

6

Page 8: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. Convolution Network (ConvNet) algorithm 2. Optimization Approach 3. Mapping on the core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

7

Page 9: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Introduction Neural Networks

• Artificial neuron model

8

Presenter
Presentation Notes
-- quick -- Weighted sum of inputs, y = w0x0 + w1x1 + ... + wnxn A threshold function, i.e 1 if y > 0, -1 if y <= 0
Page 10: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Convolution example

9

Image credit: deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Page 11: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Speed sign detection application 10

Image courtesy: Maurice Peemen

Presenter
Presentation Notes
CNN is state-of the art machine learning algo inspired by the visual cortex of animals Traffic sign Pedestrian Face detection Hand writing recognition
Page 12: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

ConvNet Application in action 11

Video courtesy: Maurice Peemen

gaurai
Typewritten Text
https://youtu.be/kkha3sPoU70
Page 13: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

ConvNet Code Structure

1. for( 0 < r < 6 ){ 2. acc = bias[r]; 3. for( 0 < m < YL1 ){ 4. for( 0 < n < XL1 ){ 5. for( 0< k < 6 ){ 6. for( 0 < l < 6 ){ 7. acc = acc + in_layer[m,n,l] x weight[r,k,l]; 8. } 9. } 10. index = saturate_shift(acc); //10bit fixedpoint format 11. output_layer[r,m,n]=fixact[index]; 12. } 13. } 14. } “r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel “n” = Neuron outputs fixact = sigmoid activation function

12

Compute

Store

Presenter
Presentation Notes
This is not saturate but (upper and lower limit) in_layer[m*2,n*2,l]
Page 14: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

13

Page 15: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Optimization Approach

• Methodology: • Test on Core i7 (Haswell – AVX2) • Move to Xeon Phi (Knights Corner - IMCI)

• Steps:

1. Loop unrolling 2. Vectorization using SIMD intrinsics (DLP)

− Fused Multiply Add instruction 3. Parallelization using OpenMP (TLP)

14

1 core

Many-core

Presenter
Presentation Notes
Reduces exec time because of increased DLP, reduced branching overhead, increases code size. Same programming models supported on both
Page 16: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

SIMD Vectorization example

15

Courtesy: www.kernel.org

Page 17: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Intel MIC Programming models 16

Credit: Dr. Volker Weinberg, Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015

Presenter
Presentation Notes
Fully utilizing Single core - Vectorization Splitting work among core - Parallelization
Page 18: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

17

Page 19: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Roofline Model 18

actual FLOP/Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Performance Roofline

Y coordinate is performance

Processor BW Roofline

(Slope is BW)

Kernel 2 Kernel 1

Each kernels performance

bound

Each kernels performance

bound

Presenter
Presentation Notes
In-core optimizations (horizontal ceilings) Bandwidth optimizations (diagonal ceilings) Memory traffic optimizations (vertical walls)
Page 20: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Intel Core i7

• Intel Core i7 @3.5GHz • Haswell micro-architecture • AVX2 vector instructions

− 256bit vectors

19

Page 21: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Multiply Accumulate intrinsic – AVX2

20

Page 22: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Calculation of Ops/Byte

• acc += in_layer[i]*weight[j]

• Intrinsics used

• add(acc, madd(in_layer,weight))

• Bytes Loaded • in_layer[i] - 1bytes • weight[j] - 2bytes

• Operational Intensity • 2ops/3bytes = 0.67 Ops/Byte

21

Page 23: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Speedup after SIMD intrinsics

• (w.r.t non-vectorized code) • Intel C Compiler

• Layer 1 - 4.7x • Layer 2 - 5.7x • Layer 3 - 4.88x • Overall CNN – 5.6x

• GCC compiler • Layer 1 - 4.7x • Layer 2 - 6.8x • Layer 3 - 6.7x • Overall CNN – 6.3x

22

• (w.r.t auto-vectorized code) • ICC

• 4.9x • 11.3x • 4.8x • Overall CNN - 5x

• GCC • same • same • same • Overall CNN – 6.3x

Presenter
Presentation Notes
Icc: Generates better vector code L1: L2: 46.06 to 8.055 L3: 244 to 50ms Complete: 319 to 64ms Gcc: L1:20.38 to 4.33 L2: 57.4799 to 8.37 L3:323 to 48 Complete: 403 to 64ms
Page 24: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Roofline - Core i7 - manual v/s auto 23

Layer3 Hand-optimized

0.67, 35.54 Complete CNN Hand-

optimized, 0.67, 32.46

Complete CNN Auto-vectorized , 0.67, 5.13 4

8

16

32

64

0.125 0.25 0.5 1 2

Perf

orm

ance

(Gig

a O

ps/s

)

Operational Intensity (Ops/Byte)

Single core SIMD ops roofline - Intel i7 5930K @3.5GHz

56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache

224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM

16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc

Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc

Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc

Complete CNN no-vectorization gcc

Presenter
Presentation Notes
Analysis: Hits compute roofline before the combined L1 Read+Write BW
Page 25: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Intel core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

24

Page 26: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Intel Xeon Phi

• Knights Corner • Initial Many Core

Instructions (IMCI) • Knights Landing

• AVX-512 • 57-61core

25

Credit: Intel

Page 27: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Intel Xeon Phi

26 Intel Many Integrated Core Architecture

Credit: http://semiaccurate.com/2012/08/28/ intel-details-knights-corner-architecture-at-long-last/

Presenter
Presentation Notes
MIC
Page 28: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Core Architecture Overview

• 60+ in-order, low power IA cores • Bi-directional Ring interconnect

• Two pipelines (u & v) • Scalar Unit based on Pentium • 512bit SIMD Vector Processing unit • 4 hardware threads

• Coherent 512KB L2 Cache per core

27

Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide

Presenter
Presentation Notes
Scalar unit: Dual issue with scalar instructions Pipelined one-per-clock scalar throughput VPU: Most VPU instructions have a latency of 4 cycles and TPT 1 cycle Because of simple Instruction Decoder, needs 2 Ts, ref pg. 18 Soft devs guide Needs two threads per core to achieve full compute potential for same instruction (simple Instruction Decoder)
Page 29: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

28

Going from Core i7 to Xeon Phi (AVX to KNC)

Page 30: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

29

Going from Core i7 to Xeon Phi (AVX to IMCI)

madd()

fmadd()

• acc = acc + in_layer[m,n,l] x weight[r,k,l]

Presenter
Presentation Notes
2^8*2^16 = 2^24 = need 24bits have 32bits = Lucky!
Page 31: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

30

Fused Multiply-Add on Xeon Phi

Page 32: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

31

Intrinsics Kernel implementation

Page 33: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Speedup after SIMD intrinsics

• (w.r.t non-vectorized code) • Intel C Compiler

• Layer 1 - 5.7x • Layer 2 - 10.2x • Layer 3 - 12.4x • Overall CNN – 11x • ~0.75 Frame per second

− 57 cores => 43 FPS

32

• (w.r.t auto-vectorized code) • ICC

• 5.6x • 6.3x • 10.7x • Overall CNN – 9.2x

Presenter
Presentation Notes
Icc: Generates better vector code for MIC, hence speedup is less single core takes 1.3197 seconds/frame fmadd (Fused Multiply Add) – vectorized N, M, L ======================================== IMCI fmadd() vs AVX2 madd() 16 operands 32bit vs 16bit Core i7 AVX2 (4.9x) Why 4.9 vs 4.0, it has 2x wider vectors?Intrinsics implementation uses ½ of potential vectorization
Page 34: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Roofline – Xeon Phi 33

0.125

0.25

0.5

1

2

4

8

16

32

64

0.25 0.5 1

Perf

orm

ance

(Gig

a FL

OP/

s)

Operational Intensity (FLOP/Byte)

Single core Roofline - Xeon Phi @1.1GHz

35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling

70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache

5.8GB/s STREAM BW to DDR RAM

Page 35: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Roofline – Xeon Phi 34

0.125

0.25

0.5

1

2

4

8

16

32

64

0.25 0.5 1

Perf

orm

ance

(Gig

a FL

OP/

s)

Operational Intensity (FLOP/Byte)

Single core Roofline - Xeon Phi @1.1GHz

35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimizedLayer 1 Auto vectorized Layer 2 - hand optimizedLayer 2 Auto vectorized Layer 3 - hand optimizedLayer 3 Auto vectorized

Page 36: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Roofline – Xeon Phi - Complete 35

Complete - hand optimized, 0.67, 1.5626

Complete Auto vectorized, 0.67, 0.1702 0.125

0.25

0.5

1

2

4

8

16

32

64

0.25 0.5 1

Perf

orm

ance

(Gig

a FL

OP/

s)

Operational Intensity (FLOP/Byte)

Single core Roofline - Xeon Phi @1.1GHz

35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling

70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache

5.8GB/s BW to DDR RAM Complete - hand optimized

Complete Auto vectorized

Page 37: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Demo

• Speed sign application running on: • The Core i7 • The Xeon Phi

36

Page 38: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Overview

1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work

37

Page 39: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

You are here:

38

Page 40: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Conclusion

• Contribution • Core i7 – 6.3x • Xeon Phi – 11x

• Design trade-off: • Developer time v/s Optimized code • Architecture specific intrinsics v/s generic OpenMP

39

Page 41: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Future Work OpenMP number of threads

• Varying number of threads per core • 1T x 57 cores = 57T • 4T x 57 cores = 228T

• Varying thread distribution on Cores • KMP_AFFINITY (Environment Variable)

• Splitting work using OpenMP directives • #pragma omp for

40

Presenter
Presentation Notes
4types * 3types =12 combinations
Page 42: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

41

Baseline OpenMP Scaling Vectorization Peeling

Elapsed time (s): 5605.027 127.616 17.767 15.619 FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41

Throughput (GB/s): 0.235 10.338 74.254 84.467

Test code on Xeon Phi

• Baseline - simulate diffusion of a solute through a volume of liquid • OpenMP Scaling

• #pragma omp for collapse(2) • Vectorization

• #pragma simd

Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor high-performance programming. Newnes, 2013.

Presenter
Presentation Notes
370x ---------- Nested for loops, with multiply add operations ----------- Collapse(2) This clause tells the compiler to collapse the next two loops (z and y) into one loop and then apply the OpenMP omp for work division mechanism, Conceptually, the for loop changes to a single loop that executes as for(yz = 0; yz < ny*nx; ++yz) This will enable each thread to be assigned larger chunks of data to process more calculations, and therefore, allow more efficiency on each pass through the loop. --------- The line #pragma simd requests the compiler to vectorize the loop regardless of potential dependencies or other potential constraints. ----------
Page 43: Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Thank You.

Questions?

Presenter
Presentation Notes
Factoid: Every second a Google Self-Driving collects an incredible 1GB of visual data. Can recognize pedestrians, speed signs, cyclists giving hand signals many others. Autonomously driven more than 1 million miles city streets !