para concepts

8/12/2019 Para Concepts

1/52

ECMWFSlide 1Concepts of Parallel Computing

George Mozdzynski

March 2012

Concepts of Parallel

Computing


2/52


Outline

What is parallel computing?

Why do we need it?

Types of computer

Parallel Computers todayChallenges in parallel computing

Parallel Programming Languages

OpenMP and Message Passing

Terminology


3/52


What is Parallel Computing?

The simultaneous use of more thanone processor or computer to solve a

problem


4/52


Why do we need Parallel Computing?

Serial computing is too slow

Need for large amounts of memory notaccessible by a single processor


5/52


An IFS T2047L149 forecast model takes about

5000 seconds wall time for a 10 day forecastusing 128 nodes of an IBM Power6 cluster.

How long would this model take using a fastPC with sufficient memory? (e.g. dual coreDell desktop)


6/52


Ans. About 1 year This PC would also need ~2000

GBYTES of memory (4 GB is usual)1 year is too long for a 10 day forecast!

5000 seconds is also too long

See www.spec.org for CPU performance data (e.g. Specfp2006)
http://www.spec.org/http://www.spec.org/


7/52ECMWFSlide 7

Some Terminology

Concepts of Parallel Computing

Hardware:CPU = Core = Processor = PE (Processing Element )Socket = a chip with 1 or more cores (typically 2 or 4 today), more

correctly the socket is what the chip fits into

Software :Process (Unix/Linux) = Task (IBM)MPI = M essage Passing Interface, standard for programming

processes (tasks) on systems with distributed memoryOpenMP = standard for shared memory programming (threads)Thread = some code / unit of work that can be scheduled (threads

are cheap to start/stop compared with processes/tasks)User Threads = tasks * (threads per task)


8/52ECMWFSlide 8Concepts of Parallel Computing

Amdahls Law:

Wall Time = S + P/N CPUS

IFS Operational Forecast Model(T1279L91, 2 days, Power6)

Serial = 114 secsParallel = 1591806 secs

(Calculated using ExcelsLINEST function)

Amdahls Law (formal):

(Named after Gene Amdahl) If F is the fraction of acalculation that is sequential, and (1-F) is the fraction thatcan be parallelised, then the maximum speedup that can

be achieved by using N processors is 1/(F+(1-F)/N).

User Threads Actual

Wall Time

1024 1675.7

1536 1138.0

2048 899.92560 725.1

3072 619.7

3584 555.8

3840 533.34096 518.8


9/52ECMWFSlide 9

Power6: SpeedUp and EfficiencyT1279L91 model, 2 day forecast (CY36R4)


parallel serial

1591806 114

User Threads Actual

Wall TimeCalculatedWall Time

CalculatedSpeedUp

CalculatedEfficiency %

1024 1675.7 1668 950 92.81536 1138.0 1150 1399 91.1

2048 899.9 891 1769 86.4

2560 725.1 736 2195 85.8

3072 619.7 632 2569 83.6

3584 555.8 558 2864 79.9

3840 533.3 528 2985 77.7

4096 518.8 502 3068 74.9

1 1587754 10 day forecast ~ 45 min


10/52ECMWFSlide 10

0

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

S p e e

d u p

User Threads

ideal

T2047L149

T1279L91

IFS speedup on Power6


IFS world record set on 10 March 2012, aT2047L137 model ran on a CRAY XE6(HECToR) using 53,248 cores (user threads)



Measuring Performance

Wall Clock

Floating point operations per second (FLOPS or FLOP/S)- Peak (Hardware), Sustained (Application)

SI prefixes

- Mega Mflops 10**6- Giga Gflops 10**9

- Tera Tflops 10**12 ECMWF: 2 * 156 Tflops peak (P6)

- Peta Pflops 10**15 2008-2010 (early systems)

- Exa, Zetta, YottaInstructions per second, Mips, etc,Transactions per second (Databases)



CRAY2 (1985, $20M, 2 Gflop/s peak, 2GBMemory, 2 x 600MB Disk)



Georges Antique Home PC (2005, 700, 6 Gflop/speak, 2 GB Memory, 250 GB Disk)

Comparing with CRAY2

Similar performance

10,000X less expensive

200X more disk space

5,000X less power

1000X less volume

2012 can buy a PC with 2-3times performance for about 400(2 core, 4GB, 500GB disk).


14/52


Types of ParallelComputer

P=Processor M=MemoryS=Switch

Shared Memory Distributed Memory

P

M

P P

M

P

M

S


15/52


IBM Cluster(Distributed + Shared memory)

P=Processor M=MemoryS=Switch

S

P

M

P P

M

P

Node Node


16/52


IBM Power6 Clusters at ECMWF

This is just one of theTWO identical clusters


17/52

ECMWFSlide 17

and the worlds fastest and largestsupercomputer Fujitsu K computer


705,024 Sparc64

processor cores


18/52


ECMWF supercomputers

1979 CRAY 1A Vector

CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16

1996 Fujitsu VPP700

Fujitsu VPP5000

2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel

}} Vector + MPI Parallel

Vector + Shared Memory Parallel


19/52


ECMWFs first Supercomputer

CRAY-1 (1979)


20/52


Types of Processor DO J=1,1000

A(J)=B(J) + CENDDO

LOAD B(J)FADD C

STORE A(J)INCR JTEST

SCALARPROCESSOR

VECTORPROCESSOR

LOADV B->V1FADDV V1,C->V2STOREV V2->A

Single instruction

processes oneelement

Single instruction processes manyelements


21/52


Parallel Computers Today

FujitsuK-Computer (Sparc)

IBM BlueGene

IBM RoadRunner/ Cell (PS3)

Cray XT6, XE6, AMD Opteron

Hitachi, Opteron

HP 3000, Xeon

IBM Power6 (e.g. ECMWF)

Fujitsu, SPARC NEC SX8, SX9

SGI Xeon Sun, Opteron Bull, Xeon

Less GeneralPurpose

More GeneralPurpose

Higher #s of cores=> Less memory per core

P e r f o r m a n c e


22/52


The TOP500 project

started in 1993Top 500 sites reported

Report produced twice a year EUROPE in JUNE

USA in NOV

Performance based on LINPACK benchmarkdominated by matrix multiply (DGEMM)

HPC Challenge Benchmark

http://www.top500.org/


23/52



24/52


ECMWF in Top 500

R max Tflop/sec achieved with LINPACK Benchmark

R peak Peak Hardware Tflop/sec (that will never be reached!)

TFlops KW

In the June 2012 Top 500 list ECMWF expect to have 2Power7 clusters EACH with ~ 24000 cores


25/52



26/52



27/52



28/52



29/52



30/52

ECMWFSlide 30

Why is Matrix Multiply (DGEMM) so efficient?


VL

1

VL is vectorregister length

VL FMAs

(V L + 1) LDs

VECTOR SCALAR / CACHE

n

m

(m * n) + (m + n)< # registers

m * n FMAs

m + n LDs

FMAs ~= LDs FMAs >> LDs

NVIDIA Tesla C1060 GPU


31/52

ECMWFSlide 31

GPU programming

GPU Graphics Processing Unit

Programmed using CUDA, OpenCLHigh performance, low power, but challenging to programme for largeapplications, separate memory, GPU/CPU interface

Expect GPU technology to be more easily useable on future HPCs

http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)

Mark Govett (NOAA Earth System Research Laboratory) Using GPUsto run weather prediction models

Tom Henderson (NOAA Earth System Research Laboratory) Progresson the GPU parallelization and optimization of the NIM global weathermodel

Dave Norton (The Portland Group) Accelerating weather models withGPGPU's



32/52


Key Architectural Features of a Supercomputer

CPU

Performance

MEMORY

Latency / Bandwidth

InterconnectLatency / Bandwidth

Parallel File-system

Performance

a balancing act to achieve good sustained performance


33/52


What performance do MeteorologicalApplications achieve?

Vector computers- About 20 to 30 percent of peak performance (single node)

- Relatively more expensive

- Also have front-end scalar nodes (compiling, post-processing)

Scalar computers- About 5 to 10 percent of peak performance- Relatively less expensive

Both Vector and Scalar computers are being used in MetNWP Centres around the world

Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility

- Parallelization is mainly the users responsibility


34/52


Challenges in parallel computing

Parallel Computers

- Have ever increasing processors, memory, performance, but- Need more space (new computer halls = $)

- Need more power (MWs = $)

Parallel computers require/produce a lot of data (I/O)

- Require parallel file systems (GPFS, Lustre) + archive storeApplications need to scale to increasing numbers ofprocessors, problems areas are

- Load imbalance, Serial sections, Global Communications

Debugging parallel applications (totalview, ddt)We are going to be using more processors in the future!

More cores per socket, little/no clock speed improvements


35/52

ECMWFSlide 35

Parallel Programming Languages?

OpenMP

directive based

support for Fortran 90/95/2003 and C/C++

shared memory programming only

http://www.openmp.orgPGAS Languages (Partitioned Global Address Space)

UPC, CAF, Titanium, Co-array Fortran (F2008)

One programming model for inter and intra nodeparallelism

MPI is not a programming language!



36/52


Most Parallel Programmers use

Fortran 90/95/2003, C/C++ with MPI for communicatingbetween tasks (processes)

- works for applications running on shared and distributedmemory systems

Fortran 90/95/2003, C/C++ with OpenMP- For applications that need performance that is satisfied by a

single node (shared memory)

Hybrid combination of MPI/OpenMP

- ECMWFs IFS uses this approach


37/52


More Parallel Computing (Terminology)

Cache, Cache lineDomain decomposition

Halo, halo exchange

Load imbalanceSynchronization

Barrier


38/52


Cache

P

M

C

P=Processor C=CacheM=Memory

M

P

C1 C1

C2

P


39/52


IBM Power architecture (3 levels of $)

P

C1 C1

C2

P P

C1 C1

C2

P P

C1 C1

C2

PP

C1 C1

C2

P

C3

Memory


40/52


41/52


DO J=1, NGPTOT, NPROMA CALL GP_CALCS

ENDDO

U(NGPTOT,NLEV)

NGPTOT = NLAT * NLON NLEV = vertical levels

IFS Grid-Point Calculations(an example of blocking for cache)

NLON

NLAT

SUB GP_CALCS

DO I=1,NPROMA ENDDO

END

NLAT

Scalar

Vector

Lots of workIndependent for each J


42/52


Grid point space blocking for Cache (Power5)

RAPS9 FC T799L91192 tasks x 4 threads

200

250

300

350

400

450

500

550

1 10 100 1000

Grid Space blocking (NPROMA)

S E C O N D S

Optimal use of cache /

subroutine call overhead


43/52


T799 FC 192x4 (10 runs)

226

228

230

232

234

236

238

240

242244

246

20 25 30 35 40 45 50 55 60

NPROMA

S E C O N D S


44/52


T799 FC 192x4 (average)

0%

1%

2%

3%

4%

5%

6%

7%

20 25 30 35 40 45 50 55 60

NPROMA

P E R C E N T

W A L L T I M E


45/52


T L799 1024 tasks 2D partitioning

2D partitioning results innon-optimal Semi-Lagrangiancomms requirement at polesand equator!

Square shaped partitions arebetter than rectangularshaped partitions.

x

arrival

departure

mid-point

MPI task partition

x


46/52


eq_regions algorithm


47/52


48/52


eq_regions partitioning T799 1024 tasksN_REGIONS( 1)= 1

N_REGIONS( 2)= 7

N_REGIONS( 3)= 13

N_REGIONS( 4)= 19

N_REGIONS( 5)= 25

N_REGIONS( 6)= 31

N_REGIONS( 7)= 35

N_REGIONS( 8)= 41

N_REGIONS( 9)= 45

N_REGIONS(10)= 48

N_REGIONS(11)= 52

N_REGIONS(12)= 54

N_REGIONS(13)= 56

N_REGIONS(14)= 56N_REGIONS(15)= 58

N_REGIONS(16)= 56

N_REGIONS(17)= 56

N_REGIONS(18)= 54

N_REGIONS(19)= 52

N_REGIONS(20)= 48

N_REGIONS(21)= 45

N_REGIONS(22)= 41

N_REGIONS(23)= 35N_REGIONS(24)= 31

N_REGIONS(25)= 25

N_REGIONS(26)= 19

N_REGIONS(27)= 13

N_REGIONS(28)= 7

N_REGIONS(29)= 1

S h i i l i b l


49/52


IFS physics computational imbalance(T799L91, 384 tasks)

~11% imbalance in physics, ~5% imbalance (total)


50/52


http://en.wikipedia.org/wiki/Parallel_computing


51/52


52/52

para concepts

Documents