nec sx aurora tsubasa migration workshop · 2018/11/16 1 nec sx aurora tsubasa migration workshop...

10
2018/11/16 1 NEC SX Aurora TSUBASA migration workshop CAU Kiel Dr. Jens-Olaf Beismann Senior Benchmarking Analyst NEC Deutschland GmbH Agenda 10:00 – 10:30 SX Aurora Tsubasa : technology overview 10:30 – 11:30 Migration to SX-Aurora Tsubasa : compilers, libraries, … 11:30 – 12:00 RZ Kiel : overview Lunch break 13:00 – 17:00 Hands-on session: porting, run-time, performance SX Aurora TSUBASA Technology overview 4 © NEC Deutschland GmbH 2018 Dedicated vector processor High memory bandwidth Commodity processors De facto standard x86/Linux environment

Upload: others

Post on 05-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

2018/11/16

1

NEC SX Aurora TSUBASAmigration workshop

CAU Kiel

Dr. Jens-Olaf Beismann

Senior Benchmarking Analyst

NEC Deutschland GmbH

Agenda

10:00 – 10:30 SX Aurora Tsubasa : technology overview

10:30 – 11:30 Migration to SX-Aurora Tsubasa : compilers,

libraries, …

11:30 – 12:00 RZ Kiel : overview

Lunch break

13:00 – 17:00 Hands-on session: porting, run-time,

performance

SX Aurora TSUBASA

Technology overview

4 © NEC Deutschland GmbH 2018

�Dedicated vector processor

�High memory bandwidth

�Commodity processors

�De facto standard x86/Linux environment

2018/11/16

2

5 © NEC Deutschland GmbH 2018

Brand-new Vector Supercomputer

1.22TB/s / processor, 150GB/s / core

Fortran/C/C++ programing, OpenMPAutomatic vectorization/parallelization

High sustained performance onx86/Linux environment

TSUBASA: meaning “wing” in Japanese

6 © NEC Deutschland GmbH 2018

Strategy

user

Library

Tool

Application

Linux OSEnvironment

VectorEngine

x86peripherals

Linux open environment

Linux asset High performance

VE high performanceon

x86/Linux

7 © NEC Deutschland GmbH 2018

Architecture

Aurora Architecture� x86 node + Vector Engine (VE)

� VE capability is provided on x86/Linux environment

x86 server

Vector Engine

VE OS

Aurora Architecture

SX-Aurora TSUBASASX-Aurora TSUBASA

x86node

LinuxOS

Desktop Tower

Rack Mount Servers

Supercomputer

8 © NEC Deutschland GmbH 2018

Inherited and Changed

Previous SXPrevious SX SX-Aurora TSUBASASX-Aurora TSUBASA

Super-UX

SPU

VPU

coreprocessor

Mem.

storage

SPU

VPU

coreprocessor

Mem.x86

storage

Application ApplicationLINUX

VE OS

VHVector Host

VHVector Host

VEVector EngineVEVector Engine

2018/11/16

3

9 © NEC Deutschland GmbH 2018

GPGPU and VE

:

x86

Memory

GPGPUPCIe

Memory

x86

Memory

VEPCIe

Memory

exec

Result Transmission

Data Transmission

exit

exec

OS Function

Start Processing

exit End Processing

I/O,etc:

APCUDA

Function AP

Frequent PCIe transmission Whole AP is executed on VE

OS OS

� PCIe bottleneck� Small memory� Programming difficulty

disadvantagedisadvantage AdvantageAdvantage� Avoiding PCIe bottleneck� Larger memory� Standard language

GPGPU Architecture Aurora Architecture

10 © NEC Deutschland GmbH 2018

Processor

Software controllable cache16MB

coreVE1.0 Spec.

cores/CPU 8

core performance

~307GF(DP)~614GF(SP)

CPUperformance

~2.45TF(DP)~4.91TF(SP)

cachecapacity

16MB shared

memorybandwidth

1.22TB/s

memorycapacity

24, 48GB

core core core

core core core core

1.22TB/s

3TB/s

HBM2 memory x 6

0.4TB/s

307GF

2.45TF (@1.6GHZ)

11 © NEC Deutschland GmbH 2018

Product

processor

■Developed by NEC■World’s highest memory bandwidth

Card

Products of SX-Aurora TSUBASA

1VE 2VE 4VE 8VE

A100 Tower A300 Server A500 DLC Supercomputer

12 © NEC Deutschland GmbH 2018

SPUScalar Processing Unit

SPUScalar Processing Unit

Core Architecture

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

VFMA0

VFMA1

VFMA2

ALU0

ALU1

DIV

1.22TB/s / processor

(Ave. 150GB/s / core)400GB/s / core

Single core

Peak Performance :268.8GF = 32Flops/cycle x 2(FMA) x 3 x 1.4GHz

CAU Kiel : 17.2 TF

2018/11/16

4

13 © NEC Deutschland GmbH 2018

Characteristics

standard special

hig

h s

pec.

sta

ndard

language

sp

ecif

icati

on

Xeon®

GPGPUVectorEngine

Memory bandwidth / processor

Position

Xeon® GPGPU VectorEngine

x

14 © NEC Deutschland GmbH 2018

Fundamental Benchmarks

� STREAM: VE is the highest sustained memory bandwidth / node

� HPL: VE provides competitive FLOPS capability

HPL

/ n

ode

STRE

AM /

nod

e

HPL / nodeSTREAM / node

� VE provides same range HPL sustained performance as SKL/KNL

� VE provides the highest memory bandwidth

15 © NEC Deutschland GmbH 2018

Performance/Price

� High Price Competitiveness- The highest STREAM sustained performance / price

- Competitive HPL sustained performance / price

HPL

/ p

rice

STRE

AM /

pric

e

HPL / priceSTREAM / price

� VE provides same range HPL sustained performance/pricecompared to Intel products

� VE provides the highest memory bandwidth/price

16 © NEC Deutschland GmbH 2018

HPCG

� VE provides high HPCG performance per node and price

� HPL and STREAM are bookends of benchmark, and HPCG stands between them

HPCG / priceHPCG / node PerformanceCharacteristics

Performancebound

Memory bandwidth bound

3x 3xPe

rfor

man

ce ra

tio

Perf

orm

ance

ratio

Perf

orm

ance

ratio

Auro

ra

SKL

2018/11/16

5

SX Aurora TSUBASA

User environment

18 © NEC Deutschland GmbH 2018

Usability

Programing EnvironmentPrograming EnvironmentVector Cross Compiler

automatic vectorization automatic parallelization

Fortran: F2003, F2008(partially)

C: C11C++: C++14OpenMP: OpenMP4.5MPI: MPI3.1

$ vi sample.c$ ncc sample.c

Execution EnvironmentExecution Environment

$ a.out execution

VEVH

19 © NEC Deutschland GmbH 2018

Compilers

Cross compilers :

nfort

ncc

nc++

Tools :

nld, nar, nranlib, …

MPI wrappers :

mpinfort

mpincc

mpinc++

20 © NEC Deutschland GmbH 2018

Programming Environment

� NEC supports the latest language standards along with GNU compatibility

▌C/C++� ISO/IEC 9899:2011 (aka C11)

� ISO/IEC 14882:2014 (aka C++14)

▌Fortran� ISO/IEC 1539-1:2004 (aka Fortran 2003)

� ISO/IEC 1539-1:2010 (aka Fortran 2008)

▌OpenMP�Version 4.5

▌Libraries� libc

�MPI Version 3.1 (fully tuned for Aurora architecture)

�Numeric libraries (BLAS, FFT, Lapack, etc.)

▌Tools�GNU Profiler (gprof)

�GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP)

� FtraceViewer / PROGINF

2018/11/16

6

21 © NEC Deutschland GmbH 2018

Options

-Caopt -O4

-Chopt -O3

-Cvopt -O2

-Cvsafe -O1

-Cnoopt -O0

-Omove -fmove-loop-invariants-unsafe

-Onomovediv -fmove-loop-invariants

-Onomove -fno-move-loop-invariants

-Popenmp -fopenmp

-pi,auto -finline-functions # no cross-file

inlining

22 © NEC Deutschland GmbH 2018

Options

Compiler diagnostics / listings

-fdiag-vector=0|1|2 # more or less detailed

vectorization diagnostics

-fdiag-parallel=0|1|2

-fdiag-inline=0|1|2

-report-all # get both diagnostics and

formatted listing in .L file

Default type size

-fdefault-integer=4|8

-fdefault-real=4|8

-fdefault-double=8|16

Cache usage

-mretain-[all|list-vector|none]

23 © NEC Deutschland GmbH 2018

Directives

!CDIR … !NEC$ …

nodep ivdep

expand=n unroll(n)

move move_unsafe

nomovediv move

nomove nomove

outerunroll=n outerloop_unroll(n)

unroll=n unroll(n)

� Directive conversion tool nfdirconv !

24 © NEC Deutschland GmbH 2018

Libraries

▌NEC Library provides wide variety of functions

�NEC library is fully tuned for Aurora architecture

NEC Lib MKL

BLAS � �

LAPACK � �

ScaLAPACK � �

FFT � �

Random number generators � �

Direct sparse solvers � �

Iterative sparse solvers � �

Functions for Statistics � �

Spline functions � �

Special functions �

Approximation and Interpolation �

Numerical Differentials/Integrals �

Roots of Equations �

Time series analysis �

Sorting and ranking �

2018/11/16

7

25 © NEC Deutschland GmbH 2018

UNIX system function interface

To use extensions like GETARG, FLUSH, ABORT, … subroutines,

compile with

-use F90_UNIX[,F90_UNIX_ENV,…]

See Fortran Users’ Guide 8.2 for details

26 © NEC Deutschland GmbH 2018

Endianness

SX-Aurora TSUBASA is little-endian ! (Former SXs were big-endian)

export VE_FORT_UFMTENDIAN=10,11 (ALL)

sets the unit number of an unformatted file to be treated as a file in big-endian format. When the value of this variable is ALL, then

all unit numbers are applied. Two or more unit numbers can be specified by comma delimitation.

GNU Fortran extension : convert specifier

open(10,file=‘test.dat’, form=‘unformatted’, &

convert=‘big_endian’)

� Non-standard Fortran, but supported by nfort

27 © NEC Deutschland GmbH 2018

Correctness

Run-time errors…?

Compile with -traceback

export VE_TRACEBACK=FULL|ALL

reduce optimization

-fcheck=bounds (all)

Stack/heap initialization

–minit-stack=zero|nan

export VE_INIT_HEAP=ZERO|NAN

-mno-vector-fma

export VE_FPE_ENABLE=(DIV,FOF,FUF,INV,INE)

(debugger)

28 © NEC Deutschland GmbH 2018

Debugger

Process Sets Functions

Standard outputStandard error outputSource code

variables

Stack trace

Eclipse parallel tools platform (PTP) VE plugin provides GUI debugging environment

2018/11/16

8

29 © NEC Deutschland GmbH 2018

Compile with “-proginf”

export VE_PROGINF=YES/DETAIL

PROGINF performance information

30 © NEC Deutschland GmbH 2018

Compile with “-proginf”

export VE_PROGINF=YES/DETAIL

******** Program Information ********

Real Time (sec) : 8783.021690

User Time (sec) : 8753.527852

Vector Time (sec) : 4959.702777

Inst. Count : 8018493848355

V. Inst. Count : 1081598389267

V. Element Count : 221178430822266

V. Load Element Count : 51426036073697

FLOP Count : 140842692526140

MOPS : 29663.826745

MOPS (Real) : 29444.837925

MFLOPS : 16155.280253

MFLOPS (Real) : 16036.016282

A. V. Length : 204.492197

V. Op. Ratio (%) : 97.317633

L1 Cache Miss (sec) : 801.540473

CPU Port Conf. (sec) : 2.435367

V. Arith. Exec. (sec) : 2355.046188

V. Load Exec. (sec) : 2346.286974

VLD LLC Hit Element Ratio (%) : 79.566662

Power Throttling (sec) : 0.000000

Thermal Throttling (sec) : 0.000000

Memory Size Used (MB) : 10956.000000

Start Time (date) : Tue Nov 6 13:05:18 2018 CET

End Time (date) : Tue Nov 6 15:31:41 2018 CET

PROGINF performance information

31 © NEC Deutschland GmbH 2018

configure

autoconf: https://github.com/SX-Aurora/autoconf-helper

configure command:

./configure CC=ncc CXX=nc++ FC=nfort F90=nfort \

AR=nar LD=nld AS=nas --host=ve-nec-linux

CMAKE Toolchain (example): https://github.com/SX-Aurora/CMake-toolchain-file

Documentation

Available at

www.rz.uni-kiel.de/de/angebote/hiperf/nec-

sx-aurora-tsubasa

2018/11/16

9

33 © NEC Deutschland GmbH 2018

Documentation

34 © NEC Deutschland GmbH 2018

▌Official Documentation: http://www.nec.com/en/global/prod/hpc/aurora/document

▌Official Software: http://www.nec.com/en/global/prod/hpc/aurora/ve-software

▌Open Source Software: https://github.com/SX-Aurora

Aurora Forum Website

36 © NEC Deutschland GmbH 2018

Aurora Forum community website

Visit https://www.hpc.nec and join our questVisit https://www.hpc.nec and join our quest

JoinJoin

2018/11/16

10

37 © NEC Deutschland GmbH 2018

Aurora Forum community website

Better communication through BBS, let’s discuss openly!Better communication through BBS, let’s discuss openly!

Roadmaprequest

PortingEvaluation Report

Tuning

38 © NEC Deutschland GmbH 2018

Aurora Forum community website

About posting – Please like for good postsAbout posting – Please like for good posts

�Post text and arbitrary file(word, ppt, pdf, picture movie, software, etc. there is not limitation for the type of file).

Like!

39 © NEC Deutschland GmbH 2018

Aurora Forum community website

Let’s develop better Aurora together!!Let’s develop better Aurora together!!