james lin - hpc advisory council › events › 2013 › china-workshop › ...openmp accelerator...

45
Early Experience with OpenACC and OpenMP4 on π, the supercomputer of SJTU Visiting Associate Satoshi MATSUOKA Laboratory Tokyo Institute of Technology James Lin Vice Director Center for High Performance Computing Shanghai Jiao Tong University HPC Advisory 2013@Guilin

Upload: others

Post on 29-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Early Experience with OpenACC and OpenMP4 on

π, the supercomputer of SJTU

Visiting Associate

Satoshi MATSUOKA Laboratory

Tokyo Institute of Technology

James Lin

Vice Director

Center for High Performance Computing

Shanghai Jiao Tong University

HPC Advisory 2013@Guilin

Page 2: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 3: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

π, the supercomputer of SJTU

Page 4: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 5: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

π, the supercomputer of SJTU

• NO.1 in China MOE Universities, NO.158 of TOP500 in June 2013

– Plan to update some accelerators in Y2015 and V2.0 around Y2017

• Specification

– Type: INSPUR TS10000

– Performance: 830 Intel SNB E5-2670 + 100 NVIDIA Kepler K20 + 10 Intel Xeon Phi 5110P=196.2/262.6 TFlops

– Nodes: Intel EPSD JP/WP

– Interconnection: Mellanox IB FDR 56Gbps, 648ports

– Parallel File Storage: DDN SFA12K 720TB with Lustre

– SSD: 80 Intel SSD 400G

– Cooling System: Rittal Liquid

• Applications

– Open Source: Gromacs, LAMMPS…

– In-house: PIC, NS3D, DSMC…

Page 6: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Milestones in Year 2013

Apr Oct June

Assembled

Early Access Program

Production Submit LINPCK to TOP500

Page 7: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Why we build a GPU-Phi Cluster?

• Mostly based on academic consideration, and quite confident GPU or Phi can be fully used – Large Scale GPU Supercomputer: Titan, TSUBME

2.5

– Large Scale Xeon Phi Supercomputer: Tianhe-2, Stampede

• Accelerators could be a path to Exascale

• More applications are ready for GPU in this generation

• We will keep our mind open to next generation, Maxwell and Knights Landing

Page 8: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Single source code base for GPU and Xeon Phi?

• Low Level: OpenCL

• High Level: Directive-based Programming

• Just like Java for Windows and Linux in

the early days

Page 9: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 10: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Staff

Page 11: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Adjuncts

Page 12: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

User Committee

Page 13: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 14: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Current Research Topics on π

• Directive-based Programming on Accelerators

– Ninja GAP[1]

• Application Performance Evaluation and Optimization on Accelerators

– Particle in Cell

– Molecular Dynamics

[1] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, “Can traditional programming bridge the Ninja performance gap for parallel computing applications?,” ISCA 2011

Page 15: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

• We may have many reasons to use directive-based

programming, but for me, it can keep the code

readable/maintainable from the application developers'

point of view

CUDA Experts

Application Developers

Version 1 Version 2 is based on develops’ own version

Port to CUDA

CUDA Version 1

Unmaintainable to application developers

Why Directive-based Programming?

Page 16: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Directive-based Programming for Accelerators[1]

• Standard – OpenACC

– OpenMP >=4.0

• Product

– PGI Accelerators

– HMPP

• Research Projects – R-Stream

– HiCUDA

– OpenMPC/OpenMP for accelerators

[1] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU programming models for productive exascale computing,” presented at the High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.

Page 17: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 18: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

OpenACC

• Standard version evolution is much faster than OpenMP and MPI : ~1.5year

– Maybe too fast for compiler vendors and application developers to catch up

• NVIDIA proposes a "Fork and Merge" model for OpenACC to OpenMP

Standard/Version 1.0 2.0 3.0 4.0

OpenACC Nov 2011 July 2013 -- --

OpenMP-Fortran Oct 1997 Nov 2000 May 2008 July 2013

MPI June 1994 July 1997 Sept 2012 --

Page 19: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

EPCC OpenACC Benchmark Suite

• Version 1.0 is released in this summer – Developed by Nick Johnson of EPCC

– Source code is available on github

• Divided into 3 sections – Level A benchmarks mainly measure the speed of

operation of various OpenACC functions

– Level B benchmarks measure the performance of some BLAS-type kernels

– Level C are real application codes • Himeno

• 27stencil

• Le2d

http://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite

Page 20: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Preliminary Results

1. Most cases of CUDA on K20 run faster then OpenCL on MIC

2. Except Guassian and Himeno

Page 21: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Roofline Model[1]

[1] S. Williams, A. Waterman, and D. Patterson, “Roofline,” Commun. ACM, vol. 52, no. 4, p. 65, Apr. 2009.

• The reason we apply Roofline mode here is to explore any relationship between arithmetic Intensity with performance portability

• What kind of application could archive good performance portability

Page 22: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Intensity of Benchmark

Level A SpMV GEMM GESUMM 2MM 3MM ATAX BICG MVT

Intensity 0.06 4.41~ 122.04 0.24 3.28~62.35 6.10~101.67 0.49 0.49 0.49

Level B SYRK SYR2K 2DConv 3DConv COR COV Pathfinder

Intensity 5.28~135.96 4.97~192.02 0.708 1.244 2.77~50.87 2.00~51.15 0.09

Level C Hotspot Gaussian 27Stencil Himeno Le2d

Intensity 0.50 0.24 1.47 1.78 1.63

Page 23: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Roofline Model Analysis

Page 24: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 25: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 26: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 27: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

OpenMP 4.0 for Accelerators

• Released in July 2013, it supports on directive-based programming on accelerators, such as GPU and Xeon Phi

• Directives for – Parallelism: target/parallel

– Data: target data

– Two levels of parallelism: teams/distribute

• A standard supported by Intel and will be supported by NVIDIA

Page 28: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Code Example: Jacobi Kernel with OpenMP4.0

Page 29: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

HOMP, OpenMP compiler for CUDA

• Developed by LLNL, it is build on ROSE[1],

a source-to-source compiler

• It is an early research implementation[2] of

OpenMP4.0 on GPU for CUDA5.0

• Support C/C++ only now

[1] http://rosecompiler.org

[2] C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13

Page 30: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Missing Features of OpenMP4.0

• Multiple Device Support

• Combined Constructs Separate

• No-Middle-Sync

• Array Sections

• Global Barrier

• Mapping Nested Loops

• Mapped data Reuse

C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, “Early Experiences with the OpenMP Accelerator Model,” IWOMP13

Page 31: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

OpenMP 4.0

HOMP-Rose

Kepler K20m

Intel Compiler

MIC 5110p

AXPY Jacobi MM

Experiment Methodology

Page 32: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Test cases

• AXPY

• Jacobi

• Matrix Multiplication

Page 33: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Software used in π

• HOMP

• CUDA 5.0

• Intel Compiler

Page 34: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

34

Hardware used in π

Xeon E5-2670 Xeon Phi 5110P Tesla K20m

Performance (SP)

333 GFlops

2022 GFlops

3520 GFlops

Memory Bandwidth 51.2 GB/s 320 GB/s (ECC off)

208 GB/s (ECC off)

Memory Size --- 8 GB 5 GB

number of cores 8 60/61 2496

Clock Speed 2.6 GHz 1.053 GHz 0.706 GHz

Page 35: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Anatomy of a GPU/Phi Node in π

Page 36: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 37: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 38: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 39: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Outline

1. π, the supercomputer of SJTU

2. People behind π

3. Research on π

– Background

– OpenACC

– OpenMP for Accelerators

4. Summary

Page 40: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Summary

• π, the supercomputer of SJTU, is a GPU-

Phi hybrid cluster

• Directive-based programming approach,

such as OpenACC and OpenMP4.0, has

potential for single source running on both

GPU and Xeon Phi

Page 41: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Contact Information Website http://hpc.sjtu.edu.cn

Blog http://ccoe.sjtu.edu.cn

Weibo http://e.weibo.com/sjtuhpc

Email [email protected]

Page 42: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

http://icsc2014.sjtu.edu.cn/

May 7-9, 2014 @SJTU

Page 43: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 44: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined
Page 45: James Lin - HPC Advisory Council › events › 2013 › china-workshop › ...OpenMP Accelerator Model,” IWOMP13 Missing Features of OpenMP4.0 • Multiple Device Support • Combined

Pre-conference GPU tutorial, May 5-6, 2014

• Parallel Programming tutorial, V.Morozov

• GPU and Beyond, L.Grinberg

• Hybrid/GPU Programming, TH Tang