para concepts

Upload: nbpr

Post on 03-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Para Concepts

    1/52

    ECMWFSlide 1Concepts of Parallel Computing

    George Mozdzynski

    March 2012

    Concepts of Parallel

    Computing

  • 8/12/2019 Para Concepts

    2/52

    ECMWFSlide 2Concepts of Parallel Computing

    Outline

    What is parallel computing?

    Why do we need it?

    Types of computer

    Parallel Computers todayChallenges in parallel computing

    Parallel Programming Languages

    OpenMP and Message Passing

    Terminology

  • 8/12/2019 Para Concepts

    3/52

    ECMWFSlide 3Concepts of Parallel Computing

    What is Parallel Computing?

    The simultaneous use of more thanone processor or computer to solve a

    problem

  • 8/12/2019 Para Concepts

    4/52

    ECMWFSlide 4Concepts of Parallel Computing

    Why do we need Parallel Computing?

    Serial computing is too slow

    Need for large amounts of memory notaccessible by a single processor

  • 8/12/2019 Para Concepts

    5/52

    ECMWFSlide 5Concepts of Parallel Computing

    An IFS T2047L149 forecast model takes about

    5000 seconds wall time for a 10 day forecastusing 128 nodes of an IBM Power6 cluster.

    How long would this model take using a fastPC with sufficient memory? (e.g. dual coreDell desktop)

  • 8/12/2019 Para Concepts

    6/52

    ECMWFSlide 6Concepts of Parallel Computing

    Ans. About 1 year This PC would also need ~2000

    GBYTES of memory (4 GB is usual)1 year is too long for a 10 day forecast!

    5000 seconds is also too long

    See www.spec.org for CPU performance data (e.g. Specfp2006)

    http://www.spec.org/http://www.spec.org/
  • 8/12/2019 Para Concepts

    7/52ECMWFSlide 7

    Some Terminology

    Concepts of Parallel Computing

    Hardware:CPU = Core = Processor = PE (Processing Element )Socket = a chip with 1 or more cores (typically 2 or 4 today), more

    correctly the socket is what the chip fits into

    Software :Process (Unix/Linux) = Task (IBM)MPI = M essage Passing Interface, standard for programming

    processes (tasks) on systems with distributed memoryOpenMP = standard for shared memory programming (threads)Thread = some code / unit of work that can be scheduled (threads

    are cheap to start/stop compared with processes/tasks)User Threads = tasks * (threads per task)

  • 8/12/2019 Para Concepts

    8/52ECMWFSlide 8Concepts of Parallel Computing

    Amdahls Law:

    Wall Time = S + P/N CPUS

    IFS Operational Forecast Model(T1279L91, 2 days, Power6)

    Serial = 114 secsParallel = 1591806 secs

    (Calculated using ExcelsLINEST function)

    Amdahls Law (formal):

    (Named after Gene Amdahl) If F is the fraction of acalculation that is sequential, and (1-F) is the fraction thatcan be parallelised, then the maximum speedup that can

    be achieved by using N processors is 1/(F+(1-F)/N).

    User Threads Actual

    Wall Time

    1024 1675.7

    1536 1138.0

    2048 899.92560 725.1

    3072 619.7

    3584 555.8

    3840 533.34096 518.8

  • 8/12/2019 Para Concepts

    9/52ECMWFSlide 9

    Power6: SpeedUp and EfficiencyT1279L91 model, 2 day forecast (CY36R4)

    Concepts of Parallel Computing

    parallel serial

    1591806 114

    User Threads Actual

    Wall TimeCalculatedWall Time

    CalculatedSpeedUp

    CalculatedEfficiency %

    1024 1675.7 1668 950 92.81536 1138.0 1150 1399 91.1

    2048 899.9 891 1769 86.4

    2560 725.1 736 2195 85.8

    3072 619.7 632 2569 83.6

    3584 555.8 558 2864 79.9

    3840 533.3 528 2985 77.7

    4096 518.8 502 3068 74.9

    1 1587754 10 day forecast ~ 45 min

  • 8/12/2019 Para Concepts

    10/52ECMWFSlide 10

    0

    1024

    2048

    3072

    4096

    5120

    6144

    7168

    8192

    9216

    10240

    11264

    12288

    13312

    S p e e

    d u p

    User Threads

    ideal

    T2047L149

    T1279L91

    IFS speedup on Power6

    Concepts of Parallel Computing

    IFS world record set on 10 March 2012, aT2047L137 model ran on a CRAY XE6(HECToR) using 53,248 cores (user threads)

  • 8/12/2019 Para Concepts

    11/52ECMWFSlide 11Concepts of Parallel Computing

    Measuring Performance

    Wall Clock

    Floating point operations per second (FLOPS or FLOP/S)- Peak (Hardware), Sustained (Application)

    SI prefixes

    - Mega Mflops 10**6- Giga Gflops 10**9

    - Tera Tflops 10**12 ECMWF: 2 * 156 Tflops peak (P6)

    - Peta Pflops 10**15 2008-2010 (early systems)

    - Exa, Zetta, YottaInstructions per second, Mips, etc,Transactions per second (Databases)

  • 8/12/2019 Para Concepts

    12/52ECMWFSlide 12Concepts of Parallel Computing

    CRAY2 (1985, $20M, 2 Gflop/s peak, 2GBMemory, 2 x 600MB Disk)

  • 8/12/2019 Para Concepts

    13/52ECMWFSlide 13Concepts of Parallel Computing

    Georges Antique Home PC (2005, 700, 6 Gflop/speak, 2 GB Memory, 250 GB Disk)

    Comparing with CRAY2

    Similar performance

    10,000X less expensive

    200X more disk space

    5,000X less power

    1000X less volume

    2012 can buy a PC with 2-3times performance for about 400(2 core, 4GB, 500GB disk).

  • 8/12/2019 Para Concepts

    14/52

    ECMWFSlide 14Concepts of Parallel Computing

    Types of ParallelComputer

    P=Processor M=MemoryS=Switch

    Shared Memory Distributed Memory

    P

    M

    P P

    M

    P

    M

    S

  • 8/12/2019 Para Concepts

    15/52

    ECMWFSlide 15Concepts of Parallel Computing

    IBM Cluster(Distributed + Shared memory)

    P=Processor M=MemoryS=Switch

    S

    P

    M

    P P

    M

    P

    Node Node

  • 8/12/2019 Para Concepts

    16/52

    ECMWFSlide 16Concepts of Parallel Computing

    IBM Power6 Clusters at ECMWF

    This is just one of theTWO identical clusters

  • 8/12/2019 Para Concepts

    17/52

    ECMWFSlide 17

    and the worlds fastest and largestsupercomputer Fujitsu K computer

    Concepts of Parallel Computing

    705,024 Sparc64

    processor cores

  • 8/12/2019 Para Concepts

    18/52

    ECMWFSlide 18Concepts of Parallel Computing

    ECMWF supercomputers

    1979 CRAY 1A Vector

    CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16

    1996 Fujitsu VPP700

    Fujitsu VPP5000

    2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel

    }} Vector + MPI Parallel

    Vector + Shared Memory Parallel

  • 8/12/2019 Para Concepts

    19/52

    ECMWFSlide 19Concepts of Parallel Computing

    ECMWFs first Supercomputer

    CRAY-1 (1979)

  • 8/12/2019 Para Concepts

    20/52

    ECMWFSlide 20Concepts of Parallel Computing

    Types of Processor DO J=1,1000

    A(J)=B(J) + CENDDO

    LOAD B(J)FADD C

    STORE A(J)INCR JTEST

    SCALARPROCESSOR

    VECTORPROCESSOR

    LOADV B->V1FADDV V1,C->V2STOREV V2->A

    Single instruction

    processes oneelement

    Single instruction processes manyelements

  • 8/12/2019 Para Concepts

    21/52

    ECMWFSlide 21Concepts of Parallel Computing

    Parallel Computers Today

    FujitsuK-Computer (Sparc)

    IBM BlueGene

    IBM RoadRunner/ Cell (PS3)

    Cray XT6, XE6, AMD Opteron

    Hitachi, Opteron

    HP 3000, Xeon

    IBM Power6 (e.g. ECMWF)

    Fujitsu, SPARC NEC SX8, SX9

    SGI Xeon Sun, Opteron Bull, Xeon

    Less GeneralPurpose

    More GeneralPurpose

    Higher #s of cores=> Less memory per core

    P e r f o r m a n c e

  • 8/12/2019 Para Concepts

    22/52

    ECMWFSlide 22Concepts of Parallel Computing

    The TOP500 project

    started in 1993Top 500 sites reported

    Report produced twice a year EUROPE in JUNE

    USA in NOV

    Performance based on LINPACK benchmarkdominated by matrix multiply (DGEMM)

    HPC Challenge Benchmark

    http://www.top500.org/

  • 8/12/2019 Para Concepts

    23/52

    ECMWFSlide 23Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    24/52

    ECMWFSlide 24Concepts of Parallel Computing

    ECMWF in Top 500

    R max Tflop/sec achieved with LINPACK Benchmark

    R peak Peak Hardware Tflop/sec (that will never be reached!)

    TFlops KW

    In the June 2012 Top 500 list ECMWF expect to have 2Power7 clusters EACH with ~ 24000 cores

  • 8/12/2019 Para Concepts

    25/52

    ECMWFSlide 25Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    26/52

    ECMWFSlide 26Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    27/52

    ECMWFSlide 27Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    28/52

    ECMWFSlide 28Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    29/52

    ECMWFSlide 29Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    30/52

    ECMWFSlide 30

    Why is Matrix Multiply (DGEMM) so efficient?

    Concepts of Parallel Computing

    VL

    1

    VL is vectorregister length

    VL FMAs

    (V L + 1) LDs

    VECTOR SCALAR / CACHE

    n

    m

    (m * n) + (m + n)< # registers

    m * n FMAs

    m + n LDs

    FMAs ~= LDs FMAs >> LDs

    NVIDIA Tesla C1060 GPU

  • 8/12/2019 Para Concepts

    31/52

    ECMWFSlide 31

    GPU programming

    GPU Graphics Processing Unit

    Programmed using CUDA, OpenCLHigh performance, low power, but challenging to programme for largeapplications, separate memory, GPU/CPU interface

    Expect GPU technology to be more easily useable on future HPCs

    http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)

    Mark Govett (NOAA Earth System Research Laboratory) Using GPUsto run weather prediction models

    Tom Henderson (NOAA Earth System Research Laboratory) Progresson the GPU parallelization and optimization of the NIM global weathermodel

    Dave Norton (The Portland Group) Accelerating weather models withGPGPU's

    Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    32/52

    ECMWFSlide 32Concepts of Parallel Computing

    Key Architectural Features of a Supercomputer

    CPU

    Performance

    MEMORY

    Latency / Bandwidth

    InterconnectLatency / Bandwidth

    Parallel File-system

    Performance

    a balancing act to achieve good sustained performance

  • 8/12/2019 Para Concepts

    33/52

    ECMWFSlide 33Concepts of Parallel Computing

    What performance do MeteorologicalApplications achieve?

    Vector computers- About 20 to 30 percent of peak performance (single node)

    - Relatively more expensive

    - Also have front-end scalar nodes (compiling, post-processing)

    Scalar computers- About 5 to 10 percent of peak performance- Relatively less expensive

    Both Vector and Scalar computers are being used in MetNWP Centres around the world

    Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility

    - Parallelization is mainly the users responsibility

  • 8/12/2019 Para Concepts

    34/52

    ECMWFSlide 34Concepts of Parallel Computing

    Challenges in parallel computing

    Parallel Computers

    - Have ever increasing processors, memory, performance, but- Need more space (new computer halls = $)

    - Need more power (MWs = $)

    Parallel computers require/produce a lot of data (I/O)

    - Require parallel file systems (GPFS, Lustre) + archive storeApplications need to scale to increasing numbers ofprocessors, problems areas are

    - Load imbalance, Serial sections, Global Communications

    Debugging parallel applications (totalview, ddt)We are going to be using more processors in the future!

    More cores per socket, little/no clock speed improvements

  • 8/12/2019 Para Concepts

    35/52

    ECMWFSlide 35

    Parallel Programming Languages?

    OpenMP

    directive based

    support for Fortran 90/95/2003 and C/C++

    shared memory programming only

    http://www.openmp.orgPGAS Languages (Partitioned Global Address Space)

    UPC, CAF, Titanium, Co-array Fortran (F2008)

    One programming model for inter and intra nodeparallelism

    MPI is not a programming language!

    Concepts of Parallel Computing

  • 8/12/2019 Para Concepts

    36/52

    ECMWFSlide 36Concepts of Parallel Computing

    Most Parallel Programmers use

    Fortran 90/95/2003, C/C++ with MPI for communicatingbetween tasks (processes)

    - works for applications running on shared and distributedmemory systems

    Fortran 90/95/2003, C/C++ with OpenMP- For applications that need performance that is satisfied by a

    single node (shared memory)

    Hybrid combination of MPI/OpenMP

    - ECMWFs IFS uses this approach

  • 8/12/2019 Para Concepts

    37/52

    ECMWFSlide 37Concepts of Parallel Computing

    More Parallel Computing (Terminology)

    Cache, Cache lineDomain decomposition

    Halo, halo exchange

    Load imbalanceSynchronization

    Barrier

  • 8/12/2019 Para Concepts

    38/52

    ECMWFSlide 38Concepts of Parallel Computing

    Cache

    P

    M

    C

    P=Processor C=CacheM=Memory

    M

    P

    C1 C1

    C2

    P

  • 8/12/2019 Para Concepts

    39/52

    ECMWFSlide 39Concepts of Parallel Computing

    IBM Power architecture (3 levels of $)

    P

    C1 C1

    C2

    P P

    C1 C1

    C2

    P P

    C1 C1

    C2

    PP

    C1 C1

    C2

    P

    C3

    Memory

  • 8/12/2019 Para Concepts

    40/52

  • 8/12/2019 Para Concepts

    41/52

    ECMWFSlide 41Concepts of Parallel Computing

    DO J=1, NGPTOT, NPROMA CALL GP_CALCS

    ENDDO

    U(NGPTOT,NLEV)

    NGPTOT = NLAT * NLON NLEV = vertical levels

    IFS Grid-Point Calculations(an example of blocking for cache)

    NLON

    NLAT

    SUB GP_CALCS

    DO I=1,NPROMA ENDDO

    END

    NLAT

    Scalar

    Vector

    Lots of workIndependent for each J

  • 8/12/2019 Para Concepts

    42/52

    ECMWFSlide 42Concepts of Parallel Computing

    Grid point space blocking for Cache (Power5)

    RAPS9 FC T799L91192 tasks x 4 threads

    200

    250

    300

    350

    400

    450

    500

    550

    1 10 100 1000

    Grid Space blocking (NPROMA)

    S E C O N D S

    Optimal use of cache /

    subroutine call overhead

  • 8/12/2019 Para Concepts

    43/52

    ECMWFSlide 43Concepts of Parallel Computing

    T799 FC 192x4 (10 runs)

    226

    228

    230

    232

    234

    236

    238

    240

    242244

    246

    20 25 30 35 40 45 50 55 60

    NPROMA

    S E C O N D S

  • 8/12/2019 Para Concepts

    44/52

    ECMWFSlide 44Concepts of Parallel Computing

    T799 FC 192x4 (average)

    0%

    1%

    2%

    3%

    4%

    5%

    6%

    7%

    20 25 30 35 40 45 50 55 60

    NPROMA

    P E R C E N T

    W A L L T I M E

  • 8/12/2019 Para Concepts

    45/52

    ECMWFSlide 45Concepts of Parallel Computing

    T L799 1024 tasks 2D partitioning

    2D partitioning results innon-optimal Semi-Lagrangiancomms requirement at polesand equator!

    Square shaped partitions arebetter than rectangularshaped partitions.

    x

    arrival

    departure

    mid-point

    MPI task partition

    x

  • 8/12/2019 Para Concepts

    46/52

    ECMWFSlide 46Concepts of Parallel Computing

    eq_regions algorithm

  • 8/12/2019 Para Concepts

    47/52

  • 8/12/2019 Para Concepts

    48/52

    ECMWFSlide 48Concepts of Parallel Computing

    eq_regions partitioning T799 1024 tasksN_REGIONS( 1)= 1

    N_REGIONS( 2)= 7

    N_REGIONS( 3)= 13

    N_REGIONS( 4)= 19

    N_REGIONS( 5)= 25

    N_REGIONS( 6)= 31

    N_REGIONS( 7)= 35

    N_REGIONS( 8)= 41

    N_REGIONS( 9)= 45

    N_REGIONS(10)= 48

    N_REGIONS(11)= 52

    N_REGIONS(12)= 54

    N_REGIONS(13)= 56

    N_REGIONS(14)= 56N_REGIONS(15)= 58

    N_REGIONS(16)= 56

    N_REGIONS(17)= 56

    N_REGIONS(18)= 54

    N_REGIONS(19)= 52

    N_REGIONS(20)= 48

    N_REGIONS(21)= 45

    N_REGIONS(22)= 41

    N_REGIONS(23)= 35N_REGIONS(24)= 31

    N_REGIONS(25)= 25

    N_REGIONS(26)= 19

    N_REGIONS(27)= 13

    N_REGIONS(28)= 7

    N_REGIONS(29)= 1

    S h i i l i b l

  • 8/12/2019 Para Concepts

    49/52

    ECMWFSlide 49Concepts of Parallel Computing

    IFS physics computational imbalance(T799L91, 384 tasks)

    ~11% imbalance in physics, ~5% imbalance (total)

  • 8/12/2019 Para Concepts

    50/52

    ECMWFSlide 50Concepts of Parallel Computing

    http://en.wikipedia.org/wiki/Parallel_computing

  • 8/12/2019 Para Concepts

    51/52

  • 8/12/2019 Para Concepts

    52/52