para concepts
TRANSCRIPT
-
8/12/2019 Para Concepts
1/52
ECMWFSlide 1Concepts of Parallel Computing
George Mozdzynski
March 2012
Concepts of Parallel
Computing
-
8/12/2019 Para Concepts
2/52
ECMWFSlide 2Concepts of Parallel Computing
Outline
What is parallel computing?
Why do we need it?
Types of computer
Parallel Computers todayChallenges in parallel computing
Parallel Programming Languages
OpenMP and Message Passing
Terminology
-
8/12/2019 Para Concepts
3/52
ECMWFSlide 3Concepts of Parallel Computing
What is Parallel Computing?
The simultaneous use of more thanone processor or computer to solve a
problem
-
8/12/2019 Para Concepts
4/52
ECMWFSlide 4Concepts of Parallel Computing
Why do we need Parallel Computing?
Serial computing is too slow
Need for large amounts of memory notaccessible by a single processor
-
8/12/2019 Para Concepts
5/52
ECMWFSlide 5Concepts of Parallel Computing
An IFS T2047L149 forecast model takes about
5000 seconds wall time for a 10 day forecastusing 128 nodes of an IBM Power6 cluster.
How long would this model take using a fastPC with sufficient memory? (e.g. dual coreDell desktop)
-
8/12/2019 Para Concepts
6/52
ECMWFSlide 6Concepts of Parallel Computing
Ans. About 1 year This PC would also need ~2000
GBYTES of memory (4 GB is usual)1 year is too long for a 10 day forecast!
5000 seconds is also too long
See www.spec.org for CPU performance data (e.g. Specfp2006)
http://www.spec.org/http://www.spec.org/ -
8/12/2019 Para Concepts
7/52ECMWFSlide 7
Some Terminology
Concepts of Parallel Computing
Hardware:CPU = Core = Processor = PE (Processing Element )Socket = a chip with 1 or more cores (typically 2 or 4 today), more
correctly the socket is what the chip fits into
Software :Process (Unix/Linux) = Task (IBM)MPI = M essage Passing Interface, standard for programming
processes (tasks) on systems with distributed memoryOpenMP = standard for shared memory programming (threads)Thread = some code / unit of work that can be scheduled (threads
are cheap to start/stop compared with processes/tasks)User Threads = tasks * (threads per task)
-
8/12/2019 Para Concepts
8/52ECMWFSlide 8Concepts of Parallel Computing
Amdahls Law:
Wall Time = S + P/N CPUS
IFS Operational Forecast Model(T1279L91, 2 days, Power6)
Serial = 114 secsParallel = 1591806 secs
(Calculated using ExcelsLINEST function)
Amdahls Law (formal):
(Named after Gene Amdahl) If F is the fraction of acalculation that is sequential, and (1-F) is the fraction thatcan be parallelised, then the maximum speedup that can
be achieved by using N processors is 1/(F+(1-F)/N).
User Threads Actual
Wall Time
1024 1675.7
1536 1138.0
2048 899.92560 725.1
3072 619.7
3584 555.8
3840 533.34096 518.8
-
8/12/2019 Para Concepts
9/52ECMWFSlide 9
Power6: SpeedUp and EfficiencyT1279L91 model, 2 day forecast (CY36R4)
Concepts of Parallel Computing
parallel serial
1591806 114
User Threads Actual
Wall TimeCalculatedWall Time
CalculatedSpeedUp
CalculatedEfficiency %
1024 1675.7 1668 950 92.81536 1138.0 1150 1399 91.1
2048 899.9 891 1769 86.4
2560 725.1 736 2195 85.8
3072 619.7 632 2569 83.6
3584 555.8 558 2864 79.9
3840 533.3 528 2985 77.7
4096 518.8 502 3068 74.9
1 1587754 10 day forecast ~ 45 min
-
8/12/2019 Para Concepts
10/52ECMWFSlide 10
0
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
S p e e
d u p
User Threads
ideal
T2047L149
T1279L91
IFS speedup on Power6
Concepts of Parallel Computing
IFS world record set on 10 March 2012, aT2047L137 model ran on a CRAY XE6(HECToR) using 53,248 cores (user threads)
-
8/12/2019 Para Concepts
11/52ECMWFSlide 11Concepts of Parallel Computing
Measuring Performance
Wall Clock
Floating point operations per second (FLOPS or FLOP/S)- Peak (Hardware), Sustained (Application)
SI prefixes
- Mega Mflops 10**6- Giga Gflops 10**9
- Tera Tflops 10**12 ECMWF: 2 * 156 Tflops peak (P6)
- Peta Pflops 10**15 2008-2010 (early systems)
- Exa, Zetta, YottaInstructions per second, Mips, etc,Transactions per second (Databases)
-
8/12/2019 Para Concepts
12/52ECMWFSlide 12Concepts of Parallel Computing
CRAY2 (1985, $20M, 2 Gflop/s peak, 2GBMemory, 2 x 600MB Disk)
-
8/12/2019 Para Concepts
13/52ECMWFSlide 13Concepts of Parallel Computing
Georges Antique Home PC (2005, 700, 6 Gflop/speak, 2 GB Memory, 250 GB Disk)
Comparing with CRAY2
Similar performance
10,000X less expensive
200X more disk space
5,000X less power
1000X less volume
2012 can buy a PC with 2-3times performance for about 400(2 core, 4GB, 500GB disk).
-
8/12/2019 Para Concepts
14/52
ECMWFSlide 14Concepts of Parallel Computing
Types of ParallelComputer
P=Processor M=MemoryS=Switch
Shared Memory Distributed Memory
P
M
P P
M
P
M
S
-
8/12/2019 Para Concepts
15/52
ECMWFSlide 15Concepts of Parallel Computing
IBM Cluster(Distributed + Shared memory)
P=Processor M=MemoryS=Switch
S
P
M
P P
M
P
Node Node
-
8/12/2019 Para Concepts
16/52
ECMWFSlide 16Concepts of Parallel Computing
IBM Power6 Clusters at ECMWF
This is just one of theTWO identical clusters
-
8/12/2019 Para Concepts
17/52
ECMWFSlide 17
and the worlds fastest and largestsupercomputer Fujitsu K computer
Concepts of Parallel Computing
705,024 Sparc64
processor cores
-
8/12/2019 Para Concepts
18/52
ECMWFSlide 18Concepts of Parallel Computing
ECMWF supercomputers
1979 CRAY 1A Vector
CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16
1996 Fujitsu VPP700
Fujitsu VPP5000
2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel
}} Vector + MPI Parallel
Vector + Shared Memory Parallel
-
8/12/2019 Para Concepts
19/52
ECMWFSlide 19Concepts of Parallel Computing
ECMWFs first Supercomputer
CRAY-1 (1979)
-
8/12/2019 Para Concepts
20/52
ECMWFSlide 20Concepts of Parallel Computing
Types of Processor DO J=1,1000
A(J)=B(J) + CENDDO
LOAD B(J)FADD C
STORE A(J)INCR JTEST
SCALARPROCESSOR
VECTORPROCESSOR
LOADV B->V1FADDV V1,C->V2STOREV V2->A
Single instruction
processes oneelement
Single instruction processes manyelements
-
8/12/2019 Para Concepts
21/52
ECMWFSlide 21Concepts of Parallel Computing
Parallel Computers Today
FujitsuK-Computer (Sparc)
IBM BlueGene
IBM RoadRunner/ Cell (PS3)
Cray XT6, XE6, AMD Opteron
Hitachi, Opteron
HP 3000, Xeon
IBM Power6 (e.g. ECMWF)
Fujitsu, SPARC NEC SX8, SX9
SGI Xeon Sun, Opteron Bull, Xeon
Less GeneralPurpose
More GeneralPurpose
Higher #s of cores=> Less memory per core
P e r f o r m a n c e
-
8/12/2019 Para Concepts
22/52
ECMWFSlide 22Concepts of Parallel Computing
The TOP500 project
started in 1993Top 500 sites reported
Report produced twice a year EUROPE in JUNE
USA in NOV
Performance based on LINPACK benchmarkdominated by matrix multiply (DGEMM)
HPC Challenge Benchmark
http://www.top500.org/
-
8/12/2019 Para Concepts
23/52
ECMWFSlide 23Concepts of Parallel Computing
-
8/12/2019 Para Concepts
24/52
ECMWFSlide 24Concepts of Parallel Computing
ECMWF in Top 500
R max Tflop/sec achieved with LINPACK Benchmark
R peak Peak Hardware Tflop/sec (that will never be reached!)
TFlops KW
In the June 2012 Top 500 list ECMWF expect to have 2Power7 clusters EACH with ~ 24000 cores
-
8/12/2019 Para Concepts
25/52
ECMWFSlide 25Concepts of Parallel Computing
-
8/12/2019 Para Concepts
26/52
ECMWFSlide 26Concepts of Parallel Computing
-
8/12/2019 Para Concepts
27/52
ECMWFSlide 27Concepts of Parallel Computing
-
8/12/2019 Para Concepts
28/52
ECMWFSlide 28Concepts of Parallel Computing
-
8/12/2019 Para Concepts
29/52
ECMWFSlide 29Concepts of Parallel Computing
-
8/12/2019 Para Concepts
30/52
ECMWFSlide 30
Why is Matrix Multiply (DGEMM) so efficient?
Concepts of Parallel Computing
VL
1
VL is vectorregister length
VL FMAs
(V L + 1) LDs
VECTOR SCALAR / CACHE
n
m
(m * n) + (m + n)< # registers
m * n FMAs
m + n LDs
FMAs ~= LDs FMAs >> LDs
NVIDIA Tesla C1060 GPU
-
8/12/2019 Para Concepts
31/52
ECMWFSlide 31
GPU programming
GPU Graphics Processing Unit
Programmed using CUDA, OpenCLHigh performance, low power, but challenging to programme for largeapplications, separate memory, GPU/CPU interface
Expect GPU technology to be more easily useable on future HPCs
http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)
Mark Govett (NOAA Earth System Research Laboratory) Using GPUsto run weather prediction models
Tom Henderson (NOAA Earth System Research Laboratory) Progresson the GPU parallelization and optimization of the NIM global weathermodel
Dave Norton (The Portland Group) Accelerating weather models withGPGPU's
Concepts of Parallel Computing
-
8/12/2019 Para Concepts
32/52
ECMWFSlide 32Concepts of Parallel Computing
Key Architectural Features of a Supercomputer
CPU
Performance
MEMORY
Latency / Bandwidth
InterconnectLatency / Bandwidth
Parallel File-system
Performance
a balancing act to achieve good sustained performance
-
8/12/2019 Para Concepts
33/52
ECMWFSlide 33Concepts of Parallel Computing
What performance do MeteorologicalApplications achieve?
Vector computers- About 20 to 30 percent of peak performance (single node)
- Relatively more expensive
- Also have front-end scalar nodes (compiling, post-processing)
Scalar computers- About 5 to 10 percent of peak performance- Relatively less expensive
Both Vector and Scalar computers are being used in MetNWP Centres around the world
Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility
- Parallelization is mainly the users responsibility
-
8/12/2019 Para Concepts
34/52
ECMWFSlide 34Concepts of Parallel Computing
Challenges in parallel computing
Parallel Computers
- Have ever increasing processors, memory, performance, but- Need more space (new computer halls = $)
- Need more power (MWs = $)
Parallel computers require/produce a lot of data (I/O)
- Require parallel file systems (GPFS, Lustre) + archive storeApplications need to scale to increasing numbers ofprocessors, problems areas are
- Load imbalance, Serial sections, Global Communications
Debugging parallel applications (totalview, ddt)We are going to be using more processors in the future!
More cores per socket, little/no clock speed improvements
-
8/12/2019 Para Concepts
35/52
ECMWFSlide 35
Parallel Programming Languages?
OpenMP
directive based
support for Fortran 90/95/2003 and C/C++
shared memory programming only
http://www.openmp.orgPGAS Languages (Partitioned Global Address Space)
UPC, CAF, Titanium, Co-array Fortran (F2008)
One programming model for inter and intra nodeparallelism
MPI is not a programming language!
Concepts of Parallel Computing
-
8/12/2019 Para Concepts
36/52
ECMWFSlide 36Concepts of Parallel Computing
Most Parallel Programmers use
Fortran 90/95/2003, C/C++ with MPI for communicatingbetween tasks (processes)
- works for applications running on shared and distributedmemory systems
Fortran 90/95/2003, C/C++ with OpenMP- For applications that need performance that is satisfied by a
single node (shared memory)
Hybrid combination of MPI/OpenMP
- ECMWFs IFS uses this approach
-
8/12/2019 Para Concepts
37/52
ECMWFSlide 37Concepts of Parallel Computing
More Parallel Computing (Terminology)
Cache, Cache lineDomain decomposition
Halo, halo exchange
Load imbalanceSynchronization
Barrier
-
8/12/2019 Para Concepts
38/52
ECMWFSlide 38Concepts of Parallel Computing
Cache
P
M
C
P=Processor C=CacheM=Memory
M
P
C1 C1
C2
P
-
8/12/2019 Para Concepts
39/52
ECMWFSlide 39Concepts of Parallel Computing
IBM Power architecture (3 levels of $)
P
C1 C1
C2
P P
C1 C1
C2
P P
C1 C1
C2
PP
C1 C1
C2
P
C3
Memory
-
8/12/2019 Para Concepts
40/52
-
8/12/2019 Para Concepts
41/52
ECMWFSlide 41Concepts of Parallel Computing
DO J=1, NGPTOT, NPROMA CALL GP_CALCS
ENDDO
U(NGPTOT,NLEV)
NGPTOT = NLAT * NLON NLEV = vertical levels
IFS Grid-Point Calculations(an example of blocking for cache)
NLON
NLAT
SUB GP_CALCS
DO I=1,NPROMA ENDDO
END
NLAT
Scalar
Vector
Lots of workIndependent for each J
-
8/12/2019 Para Concepts
42/52
ECMWFSlide 42Concepts of Parallel Computing
Grid point space blocking for Cache (Power5)
RAPS9 FC T799L91192 tasks x 4 threads
200
250
300
350
400
450
500
550
1 10 100 1000
Grid Space blocking (NPROMA)
S E C O N D S
Optimal use of cache /
subroutine call overhead
-
8/12/2019 Para Concepts
43/52
ECMWFSlide 43Concepts of Parallel Computing
T799 FC 192x4 (10 runs)
226
228
230
232
234
236
238
240
242244
246
20 25 30 35 40 45 50 55 60
NPROMA
S E C O N D S
-
8/12/2019 Para Concepts
44/52
ECMWFSlide 44Concepts of Parallel Computing
T799 FC 192x4 (average)
0%
1%
2%
3%
4%
5%
6%
7%
20 25 30 35 40 45 50 55 60
NPROMA
P E R C E N T
W A L L T I M E
-
8/12/2019 Para Concepts
45/52
ECMWFSlide 45Concepts of Parallel Computing
T L799 1024 tasks 2D partitioning
2D partitioning results innon-optimal Semi-Lagrangiancomms requirement at polesand equator!
Square shaped partitions arebetter than rectangularshaped partitions.
x
arrival
departure
mid-point
MPI task partition
x
-
8/12/2019 Para Concepts
46/52
ECMWFSlide 46Concepts of Parallel Computing
eq_regions algorithm
-
8/12/2019 Para Concepts
47/52
-
8/12/2019 Para Concepts
48/52
ECMWFSlide 48Concepts of Parallel Computing
eq_regions partitioning T799 1024 tasksN_REGIONS( 1)= 1
N_REGIONS( 2)= 7
N_REGIONS( 3)= 13
N_REGIONS( 4)= 19
N_REGIONS( 5)= 25
N_REGIONS( 6)= 31
N_REGIONS( 7)= 35
N_REGIONS( 8)= 41
N_REGIONS( 9)= 45
N_REGIONS(10)= 48
N_REGIONS(11)= 52
N_REGIONS(12)= 54
N_REGIONS(13)= 56
N_REGIONS(14)= 56N_REGIONS(15)= 58
N_REGIONS(16)= 56
N_REGIONS(17)= 56
N_REGIONS(18)= 54
N_REGIONS(19)= 52
N_REGIONS(20)= 48
N_REGIONS(21)= 45
N_REGIONS(22)= 41
N_REGIONS(23)= 35N_REGIONS(24)= 31
N_REGIONS(25)= 25
N_REGIONS(26)= 19
N_REGIONS(27)= 13
N_REGIONS(28)= 7
N_REGIONS(29)= 1
S h i i l i b l
-
8/12/2019 Para Concepts
49/52
ECMWFSlide 49Concepts of Parallel Computing
IFS physics computational imbalance(T799L91, 384 tasks)
~11% imbalance in physics, ~5% imbalance (total)
-
8/12/2019 Para Concepts
50/52
ECMWFSlide 50Concepts of Parallel Computing
http://en.wikipedia.org/wiki/Parallel_computing
-
8/12/2019 Para Concepts
51/52
-
8/12/2019 Para Concepts
52/52