heterogeneous computing in hpc and deep · pdf fileheterogeneous computing in hpc and deep...

23
Heterogeneous Computing In HPC and Deep Learning Qing Zhang[email protected] HPC Application R&D Manager,Inpusr group Inspur-Intel China Parallel Computing Joint lab Chief Architect

Upload: ngonguyet

Post on 19-Mar-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Heterogeneous Computing In HPC and Deep Learning

Qing Zhang,[email protected] Application R&D Manager,Inpusr group

Inspur-Intel China Parallel Computing Joint lab Chief Architect

Page 2: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

moore’s law was raised at 19th April 1965

50 years history of Moore's law

1965-2005: single-core CPU era by Frequency improvement

2006 – now: multi-core CPU era

2012-now many-core MIC era

2008-now many-core GPU era

Page 3: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

• Isomorphism computing – X86 multi-core CPU

• Intel CPU• AMD CPU

– Non-X86 multi-core CPU • Power• Shen Wei

• Heterogeneous computing– CPU+GPU – CPU+MIC– CPU+FPGA

“…parallel computing is the future…”Bill Dally, Chief Scientist & Sr. VP of Research, NVIDIA

Page 4: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Multi-core CPU Parallel Computing

• The basis and start of the parallel computing

• Application performance will be speed up more 10X

– Data parallel in single core

– Multi-core parallel with multi-thread

– Multi-node parallel in a cluster

0

5

10

15

20 18 16.7

12.54 12.0810.46 9.8 9.16

Speedup ratio 2-way Intel Xeon E5620 4-core CPU VS 1CPU core

Page 5: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

The Challenges of CPU parallel computing

• Performance– CPU:~500GFlpos– GPU:~2900GFlops

• Performance per Watt– CPU:~3.5GFlops/W– GPU:~9.6GFlops/W

• Memory Bandwidth– CPU:~50GB/s– GPU:480GB/s

• Scalability for Big Application– Data communication for many nodes

Page 6: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Top500 Development Trend• Analysis of Top 10 (Jun 2015)

– Heterogeneous computing system accounted for 40% of the 10 computing systems

• Latest 16 times analysis of Top500– Proportion of heterogeneous computing system is increasing

• heterogeneous computing of CPU+GPU/MIC become matured

4

1114

37

5553

4341

48

53 53

0 0 0 0 1

7

12 13

19

25

37

0

10

20

30

40

50

60

Jun

/10

Sep

/10

Dec

/10

Mar

/11

Jun

/11

Sep

/11

Dec

/11

Mar

/12

Jun

/12

Sep

/12

Dec

/12

Mar

/13

Jun

/13

Sep

/13

Dec

/13

Mar

/14

Jun

/14

Sep

/14

Dec

/14

Mar

/15

Jun

/15

GPU

MIC

Page 7: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

HPC development Trend

• HPC application– Traditional HPC application

• Science computing

– HPC + Big Data application• Deep Learning

• HPC system architecture– Cluster– Cluster + Hadoop

• HPC parallel technology– CPU parallel– CPU+GPU/CPU+MIC/CPU+FPGA parallel

Page 8: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Heterogeneous Computing & HPC

• Heterogeneous Computing will be a development trend of HPC

• The mainstream Heterogeneous computing architecture– CPU+GPU

– CPU+MIC

• The emerging Heterogeneous computing architecture– CPU + FPGA

GPU MIC FPGA

Page 9: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

GPU & MIC Heterogeneous Computing• High parallel computing

– Many-core architecture

– Above hundreds or ten thousands threads parallel

• High computing throughput – Guaranteed by high bandwidth and low latency memory

• Suitable for computing-intensive/high parallel /SIMD Apps

• Good at matrix computing

MIC

Page 10: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

CPU+GPU/MIC Parallel Architecture

MPI process parallel

MPI process / OpenMPthread parallel

OpenMP/CUDA thread parallel

Data parallel (vectorization )

512bit 256bit

Page 11: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Design of GPU/MIC Parallel Algorithm

• Fine-grained parallel design– GPU:above ten thousands threads parallel– MIC:240 threads parallel

• SIMD and Data mapping design– GPU:SIMT– MIC: 512bit Vectorization computing

• Data Block Processing design– GPU:4-12GB– MIC:6-16GB

• On-chip Memory access design– GPU:Shared/Constant/Texture memory,L1/L2 cache– MIC:L1/L2 cache

Page 12: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Heterogeneous Computing In Inspur• Heterogeneous Computing joint-lab

– Inspur-Nvidia cloud supercomputing innovation center– Inspur-Intel China parallel computing joint-lab

• Heterogeneous Computing technology– CPU+GPU/CPU+MIC/CPU+FPGA

• Application Research Directions– Traditional HPC applications– HPC+Big Data applications

Page 13: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

GPU Apps for Traditional HPC

1313Copyright © 2013 Inspur Group

Field Application Clients Speed-up ratio Platform

Life Science

BlastnBeijing

Institute of Genomics

35X(kernel) 1GPU /1CPU core

ETInstitute of Biophysics,

CSA48X 1GPU /1CPU core

CFD LBM_LES — 100X 1GPU /1CPU core

Oil&gas

RNA

BGP

8X24 GPU nodes /24

CPU nodes

PSTM 5X6 GPU nodes / 6 CPU

nodes

Scandip 9X 4GPU+2CPU /2CPU

• GPU Apps for Traditional HPC in China– Big Apps in Super computing center

– Science computing in Universities and CAS

– Industrial computing in BGP/SGRI

Page 14: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

MIC Apps for Traditional HPC

0

2

4

6

1

1.922.47

5.25

1.75

3.32

2

5.02

speed-up ratio

Inspur-Intel joint lab MIC Apps Performance

MIC = (1.5x~5.25x) 16 cores CPU

Page 15: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

GPU Apps for Deep Learning • Internet Apps with Deep Learning in China

– classification、clustering– Speech recognition: Training offline and recognition online(DNN、RNN)

– Image recognition (like faces、objects recognition)(CNN)– advertising recommendation、products recommendation– enterprise security– Video/Audio detection and analysis

Field Application ClientsSpeed-up

ratioPlatform

Internet

Caffe Qihoo 10.7X 16GPU/1GPU

DNN IFLYTEK 13X 16GPU/1GPU

K-means Qihoo 35X 1GPU/1CPU core

Neural NetworkQihoo

270X 4GPU/1CPU core

Page 16: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

The Challenges of GPU/MIC Computing• Design of fine-grained parallel algorithm• porting and tuning programming• The performance scalability of large-scale computing• Bottleneck of power

– 200-300W, independent power supply

• Bottleneck of memory capacity and bandwidth– GPU/MIC memory capacity– PCIE bandwidth(between CPU and GPU/MIC)

Application characteristic application GPU & MIC

Low F/B Grapes(Weather Forecast) Low Memory bandwidth/core

Low degree of parallelism Vasp(Materials Physics) Many cores

Logical computingtomographic

inversion(Oil & gas)SIMD

Unable to data block BWA(Life science) Memory capacity(4-16GB)

Frequent data IO Real-time application PCIE bottleneck

Page 17: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Next Generation GPU/MIC Architecture

• GPU(Pascal)– High performance per Watt:12GFlops/W– NVLink:120GB/s

• MIC(KNL)– Don’t need CPU

• Large memory– Avoid memory capacity bottleneck

• High bandwidth– Avoid PCIE bottleneck

– High performance• single precision:6TFlops• double precision:3TFlops

Page 18: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

CPU+FPGA Heterogeneous Computing• High performance per Watt

– 20GFlops/W

• support more parallel mode– Data parallel– Task parallel

• Based on Pipeline parallel

• High density– Support more FPGA

• Support OpenCL,easy to programming– Xilinx,Alter

• Intel Xeon-FPGA,FPGA communicate with CPU by QPI

Page 19: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

FPGA Apps for Traditional HPC • Tool:UCLA CMOST(support OpenCL)• NAMD:Inspur cooperate with UCLA

– Speed-up:8X– Energy Gain:120X

Data provided by UCLA

Page 20: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

FPGA Apps for Deep Learning

Data provided by Baidu

Page 21: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

FPGA for IDC Application

• Inspur is researching a IDC application with FPGA cooperated with a internet customer, Alter,Intel

– Deep learning:DNN

– Used for online speech processing

– Run in IDC(Internet Data center)

– Platform

• Alter FPGA

• Intel Xeon-FPGA SDP

Page 22: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design

Conclusion• Heterogeneous computing will be a trend of HPC

• Heterogeneous computing– CPU+GPU

– CPU+MIC

– CPU+FPGA

• GPU and MIC are the most mature heterogeneous computing solution. more and more traditional HPC and deep learning applications will use GPU and MIC.

• FPGA will solve performance/power and density challenge, and it will be got popular used in HPC,deep learning and IDC application .

Page 23: Heterogeneous Computing In HPC and Deep  · PDF fileHeterogeneous Computing In HPC and Deep Learning ... multi-core CPU era ... •On-chip Memory access design