class01_introduction to hpc.pdf

Upload: pavan-behara

Post on 04-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 class01_Introduction to HPC.pdf

    1/92

    Introduction to High Performance Computing

    M. D. Jones, Ph.D.

    Center for Computational ResearchUniversity at Buffalo

    State University of New York

    HPC-I 2012

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 1 / 97

  • 7/29/2019 class01_Introduction to HPC.pdf

    2/92

    HPC Historical Highlights (Very) Basic Terminology

    HPC

    HPC = High Performance Computing

    and by High Performance we mean what?Unfortunately, HPC is a slippery term - lets construct a morecareful definition ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 3 / 97

  • 7/29/2019 class01_Introduction to HPC.pdf

    3/92

    HPC Historical Highlights (Very) Basic Terminology

    High Performance is Quite Relative

    Consider the (rather unfair) comparison ...

    The Cray-1 Supercomputer, circa 1976 (first commerciallysuccessful vector machine)

    Weight - about 2400 kg, Cost - roughly 8M$ (US)Performance - about 160 million floating point operations persecond (MFlop/s)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 4 / 97

  • 7/29/2019 class01_Introduction to HPC.pdf

    4/92

    HPC Historical Highlights (Very) Basic Terminology

    A modern day PC, circa 2012 (say running a 3rd generationquad-core 3 GHz Intel i7 CPU)

    Weight - roughly 5 kg, Cost - 1000$ (US)Performance - about 48 billion floating points operations per second(96 GFlop/s)

    So that is roughly an increase of 600 in peak performance, and awhopping reduction of 8000 (not accounting for inflation, which

    would be about another factor of 4 or so) in price.

    We will come back to this notion of peak performance later ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 5 / 97

  • 7/29/2019 class01_Introduction to HPC.pdf

    5/92

    HPC Historical Highlights Basic Terminology

    Attributes of HPC

    Something that everyone knows what it means, so somehow

    manages to escape concrete definition. Most will likely agree with thefollowing key attributes:

    Problems in HPC require significant computational resources(where significant, in this case, can be taken to be more resources

    than are generally available to the average person)

    Data sets require large (in the same sense as significant above)

    amounts of storage and/or processing

    Needs to collectively operate on a (possibly distributed) network

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 6 / 97

    HPC Hi i l Hi hli h B i T i l

  • 7/29/2019 class01_Introduction to HPC.pdf

    6/92

    HPC Historical Highlights Basic Terminology

    Operational Definition of HPC

    Definition

    (HPC): HPC requires substantially (more than an order of magnitude)more computational resources than are available on current

    workstations, and typically require concurrent (parallel) computation

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 7 / 97

    HPC Hi t i l Hi hli ht B i T i l

  • 7/29/2019 class01_Introduction to HPC.pdf

    7/92

    HPC Historical Highlights Basic Terminology

    Supercomputers/Supercomputing

    The term Supercomputing is usually taken to be the top-end ofHPC:

    Massive computing at the highest possible speeds

    Usually reserved for the very fastest (or elite) HPC systemsAn old definition - any computer costing more than ten milliondollars, which is now quite obsolete

    The Top500 list has been used to rank such systems since 1993,

    but we will come back to that ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 8 / 97

    HPC Historical Highlights Basic Terminology

  • 7/29/2019 class01_Introduction to HPC.pdf

    8/92

    HPC Historical Highlights Basic Terminology

    Peak Performance

    Definition

    (Peak Performance): The theoretical maximum performance (usuallymeasured in terms of 64-bit (double precision, on most architectures)

    floating point operations per second, or Flop/s) achievable by a

    computing system.

    Typically one simply adds the floating point capabilities in the mostsimplistic way

    MFlop/s, GFlops/s, TFlop/s, and coming soon to a facility nearyou, PFlop/s (in the usual metric usage).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 9 / 97

    HPC Historical Highlights Basic Terminology

  • 7/29/2019 class01_Introduction to HPC.pdf

    9/92

    HPC Historical Highlights Basic Terminology

    Peak Performance Example

    Example

    Intel has transitioned from the netburst core architecture (2003-2006) to the core/core2/i7 - the

    netburst processors could do a maximum of 2 floating point operations per clock tick, so a 3GHznetburst core could do 6 GFlop/s as its theoretical peak performance (TPP). The first generationi7 processors doubled the floating point capacity (4 Flop/s per clock tick) and the second andthird generation quadrupled (8 Flop/s per clock tick), so each core of a 3GHz generation 3 (IvyBridge) i7 would have a TPP of 48 GFlop/s.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 10 / 97

    HPC Historical Highlights Basic Terminology

  • 7/29/2019 class01_Introduction to HPC.pdf

    10/92

    HPC Historical Highlights Basic Terminology

    Significance of TPP

    So what about the theoretical peak performance (TPP)?

    Guaranteed "not-to-exceed" limit on what you can hope to achieve

    In fact, very few applications exceed a few percentCareful tuning might get you to 10-20%, unless you have analgorithm dominated entirely by easily predictable data patternsdominated by simple computation

    Gives you a relatively robust way to compare very different

    platforms

    Behold, TPP as the computational equivalent of Mt. Everest:

    Bandwidth of memory to cache/processor is limitedConflicts in cache/memory accessCommunication time dominates computation timeLike Mt. Everest, you can kill yourself trying to achieve TPP ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 11 / 97

    HPC Historical Highlights The Early (Dark?) Ages (pre-silicon)

  • 7/29/2019 class01_Introduction to HPC.pdf

    11/92

    HPC Historical Highlights The Early (Dark?) Ages (pre silicon)

    Experimental/Special Purpose

    1939 Atanasoff & Berry build prototype electronic/digital

    computer at Iowa State1941 Conrad Zuse completes Z3, first functional programmable

    electromechanical digital computer

    1943 Bletchley Park operates Colossus, computer based on

    vacuum tubes, by Turing, Flowers, and Newman1946 ENIAC, Eckert and Mauchly, at the University of

    Pennsylvania

    1951 UNIVAC I (also designed by Eckert and Mauchly),

    produced by Remington Rand, delivered to US CensusBureau

    1952 ILLIAC I (based on Eckert, Mauchly, and von Neumanndesign), first electronic computer built and housed at a

    University

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 12 / 97

    HPC Historical Highlights The Middle Ages (Vector Ascendancy)

  • 7/29/2019 class01_Introduction to HPC.pdf

    12/92

    HPC Historical Highlights The Middle Ages (Vector Ascendancy)

    The Cray Years

    1962 Control Data Corp. introduces the first commerciallysuccessful supercomputer, the CDC 6600, designed bySeymour Cray. TPP of 9 MFlop/s

    1967 Texas Instruments Advanced Scientific Computer, similarto CDC 6600, includes vector processing instructions

    1968 CDC 6800, Crays redesign of 6600, remains fastestsupercomputer until mid 1970s. TPP of about 40MFlop/s.

    1976 Cray Research Inc.s Cray-I starts vector revolution. TPPof about 250MFlop/s.

    1982 Cray X-MP, Crays first multiprocessor computer, original2 processor design had a TPP of 400MFlop/s, includedshared memory access between processors.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 13 / 97

    HPC Historical Highlights The Modern Age (Rise of the Machines)

  • 7/29/2019 class01_Introduction to HPC.pdf

    13/92

    g g g ( )

    Clusters Take Over

    1993 Cray introduces the T3D - an MPP (Massively Parallel

    Processing) based on 32-2048 DEC Alpha (21064 RISC,150MHz) processors and a proprietary 3D torus

    interconnect

    1997 ASCI Red at Sandia delivers first TFlop/s (on the Linpackbenchmark) using Intel Pentium Pro processors and a

    custom interconnect.

    2002 NECs Earth Simulator is a cluster of 640 vector

    supercomputers, delivers 35.61 TFlop/s on the Linpackbenchmark.

    2005 IBMs Blue Gene systems regain top rankings (again,according to Linpack) using massive numbers ofembedded processors and communication sub-systems

    (more later), each running a stripped down Linux-basedoperating system.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 14 / 97

    HPC Historical Highlights The Modern Age (Rise of the Machines)

  • 7/29/2019 class01_Introduction to HPC.pdf

    14/92

    g g g ( )

    Clusters Take Over (contd)

    2008 IBM deploys a hybrid system of Opteron nodes coupled

    with Cell processors interconnected via Infiniband,achieves first sustained PFlop/s on top500 ...

    2010 Tianhe-1A at the National Supercomputing Center inTianjin, China, mix of 14,336 Intel Xeon X5670

    processors (86,016 cores) and 7168 Nvidia Tesla M2050general purpose GPUs, custom (Arch) interconnect,2.566 PFlop/s

    2011 K computer, at RIKEN in Kobe, Japan, 68544 2.0GHz8-core Sparc64 VIIIfx processors (548,352 cores), custom

    (Tofu) interconnect, 8.162 PFlop/s2012 Sequoia, IBM Blue Gene Q at LLNL, 1,572,864 1.6GHz

    PowerPC A2 cores, custom BG interconnect, 16.32PFlop/s

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 15 / 97

    Current Trends in HPC Commodity is King

  • 7/29/2019 class01_Introduction to HPC.pdf

    15/92

    Off-The-Shelf Supercomputers

    The last decade has seen so-called COTS (Commodity, Off-The-Shelf

    Systems) replace, well, almost everything else, at the heart ofHPC/Supercomputing.

    Starting in 1993, Donald Becker and Thomas Sterling (at the

    Center of Excellence in Space Data and Information Sciences, orCESDIS) at NASA/Goddard proposed and prototyped a COTS

    system based on 16 Intel DX4 processors (100MHz) and channelbonded 10Mbit/s Ethernet. This project was named after the epic

    Norse hero, Beowulf.

    HPC then underwent a remarkable transformation.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 17 / 97

    Current Trends in HPC Commodity is King

  • 7/29/2019 class01_Introduction to HPC.pdf

    16/92

    Beowulf Slays Grendel

    In this case, Grendel consisted of the cadre of specializedmanufacturers that supplied HPC hardware. Following is an

    (abbreviated, I am quite sure) list of vendors that either wentcompletely out of business, or were consumed by competitors:

    Control Data Corp.

    Cray (eaten by SGI, then TERA, now reborn)

    Convex

    DEC (Digital Equipment Corp., eaten by Compaq, in turn by HP)

    Kendall Square Research

    Maspar

    Meiko

    SGI (recently twice deceased)

    Sun (recently absorbed by Oracle)

    Thinking Machines Corp.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 18 / 97

    Current Trends in HPC Commodity is King

  • 7/29/2019 class01_Introduction to HPC.pdf

    17/92

    And the Sword ...

    The tool by which these companies were driven out of business, ofcourse, is the commodity CPU of Intel (and its primary competitor,

    AMD).

    Commodity CPUs are as cheap as dirt (well, ok, sand) compared

    to their proprietary foes (be they RISC or vector).Commodity networking has taken flight as well, making it also

    rather hard to compete on the interconnect front.

    The end result of these clusters of low-cost systems is that the

    per-CPU cost can be one (if not two) orders of magnitude lessthan their proprietary counterparts.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 19 / 97

    Current Trends in HPC Commodity is King

  • 7/29/2019 class01_Introduction to HPC.pdf

    18/92

    The Top500 List

    One measure of HPC/supercomputer performance is the LINPACK

    benchmark, which has been used since 1993 to rank the worlds mostpowerful supercomputers.

    www.top500.org

    The LINPACK benchmark was introduced by Jack Dongarra:

    solves a dense system of linear equations, Ax = b

    for different problem sizes N, get maximum achieved performanceRmax for the problem size Nmax

    must conform to the standard operation count for LU factorizationwith partial pivoting (forbids fast matrix multiply method likeStrassens Method), i.e. 2/3N3 +O(N2)

    a version is available for distributed memory machines (alsoknown as the "high-performance Linpack benchmark")

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 20 / 97

    Current Trends in HPC Trends in the Top500

    http://www.top500.org/http://www.top500.org/
  • 7/29/2019 class01_Introduction to HPC.pdf

    19/92

    Rmax in the Top500

    CM-5(TMC)

    VPP500(Fujitsu)

    XP/S140(Intel)

    VPP500(Fujitsu)

    VPP500(Fujitsu)

    VPP500(Fujitsu)

    SR2201(Hitachi)

    CP-PACS(Hitachi)

    AS

    CI-R(Intel)

    A

    SCI-R(Intel)

    A

    SCI-R(Intel)

    A

    SCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-W(IBM/SP2)

    ASCI-W(IBM/SP2)

    ASCI-W(IBM/SP2)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    BlueGene/L(IBM)

    BlueGene/L(IB

    M)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    Roadrunner(IBM)

    R

    oadrunner(IBM)

    R

    oadrunner(IBM)

    Jaguar(CrayXT5)

    Jaguar(CrayXT5)

    Tianhe-1A(Intel/NV)

    K(Fujitsu/SPARC)

    K(Fujitsu/SPARC)

    BlueGene/Q(IBM)

    1993-

    06

    1993-

    11

    1994-

    06

    1994-

    11

    1995-

    06

    1995-

    11

    1996-

    06

    1996-

    11

    1997-

    06

    1997-

    11

    1998-

    06

    1998-

    11

    1999-

    06

    1999-

    11

    2000-

    06

    2000-

    11

    2001-

    06

    2001-

    11

    2002-

    06

    2002-

    11

    2003-

    06

    2003-

    11

    2004-

    06

    2004-

    11

    2005-

    06

    2005-

    11

    2006-

    06

    2006-

    11

    2007-

    06

    2007-

    11

    2008-

    06

    2008-

    11

    2009-

    06

    2009-

    11

    2010-

    06

    2010-

    11

    2011-

    06

    2011-

    11

    2012-

    06

    Date

    100

    101

    102

    103

    104

    105

    106

    107

    Rmax

    [GFlop/s]

    #1 Rmax

    in Top500 List 1993-present

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 21 / 97

    Current Trends in HPC Trends in the Top500

    T d i

  • 7/29/2019 class01_Introduction to HPC.pdf

    20/92

    Trends in Rmax

    CM-5(TMC)

    VPP500(Fujitsu

    )

    XP/S140(Intel)

    VPP500(Fujitsu)

    VPP500(Fujitsu)

    VPP500(Fujitsu)

    SR2201(Hitachi)

    CP-PACS(Hitachi)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-R(Intel)

    ASCI-W(IBM/SP2)

    ASCI-W(IBM/SP2)

    ASCI-W(IBM/SP2)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    NES(NEC/SX6)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    BlueGene/L(IBM)

    Roadrunner(IBM)

    Roadrunner(IBM)

    Roadrunner(IBM)

    Jaguar(CrayXT5)

    Jaguar(CrayXT5)

    Tianhe-1A(Intel/NV)

    K(Fujitsu/SPARC)

    K(Fujitsu/SPARC)

    BlueGene/Q(IBM)

    1993-06

    1993-11

    1994-06

    1994-11

    1995-06

    1995-11

    1996-06

    1996-11

    1997-06

    1997-11

    1998-06

    1998-11

    1999-06

    1999-11

    2000-06

    2000-11

    2001-06

    2001-11

    2002-06

    2002-11

    2003-06

    2003-11

    2004-06

    2004-11

    2005-06

    2005-11

    2006-06

    2006-11

    2007-06

    2007-11

    2008-06

    2008-11

    2009-06

    2009-11

    2010-06

    2010-11

    2011-06

    2011-11

    2012-06

    Date

    100

    101

    102

    103

    104

    105

    106

    107

    Rmax

    [GFlop/s]

    #1 Rmax

    in Top500 List 1993-present

    One can certainly argue with using this single benchmark(LINPACK) as the supercomputer ranking, but it is quite instructiveas to historical trends in HPC.

    Note the growth in Rmax

    What about the trends in processor architecture?

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 22 / 97

    Current Trends in HPC Trends in the Top500

    T d i P F il

  • 7/29/2019 class01_Introduction to HPC.pdf

    21/92

    Trends in Processor Family

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 23 / 97

    Current Trends in HPC Trends in the Top500

    T d i P F il

  • 7/29/2019 class01_Introduction to HPC.pdf

    22/92

    Trends in Processor Family

    83

    4 4

    19

    147

    4 4

    37

    118

    8085

    2 210

    28

    4 3

    28

    231

    107

    68

    1 0 0 3 1 0

    16

    356

    5558

    0 0 0 0 15 1

    372

    62

    Po

    we r

    (IB

    M)

    Cra

    y

    Alp

    ha

    PA

    -RIS

    C(H

    P)

    Inte

    l/IA

    -3 2

    NE

    C

    S p

    a rc

    (S u n

    )

    Inte

    l/IA

    -6 4

    Inte

    l/E

    M6 4

    T

    AM

    D/x

    8 6

    _ 6

    4

    Processor Family

    0

    50

    100

    150

    200

    250

    300

    350

    400

    NumberofSystems

    inRecentTop500Lists

    2006-062007-062008-062012-06

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 24 / 97

    Current Trends in HPC And the Future is ...

    C t P T d

  • 7/29/2019 class01_Introduction to HPC.pdf

    23/92

    Current Processor Trends

    Figure: Intel processors according to clock frequency and transistor count.(H. Sutter, Dr. Dobbs Journal 30(3), March 2005).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 25 / 97

    Current Trends in HPC And the Future is ...

  • 7/29/2019 class01_Introduction to HPC.pdf

    24/92

    Processor frequency has reached a plateau due to power anddesign requirements

    Moores Law of transistor density doubling every 18-24 months isstill in effect, however

    More transistor logic is being devoted to innovations in multi-core

    processors

    Will become significant software issue - how do you harness thecomputational power in all of those cores?

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 26 / 97

    Current Trends in HPC And the Future is ...

    The Power Envelope

  • 7/29/2019 class01_Introduction to HPC.pdf

    25/92

    The Power Envelope

    6 The UltraSPARC T1 Processor - Power Efficient Throughput Computing December 2005

    FIGURE 2 Power Density (W/cm2) Over Time

    80

    70

    60

    50

    40

    30

    20

    10

    0

    Powerdensity(watts/squarecm.)

    85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07

    Year

    Pentium 3

    Pentium 2

    Pentium

    486

    386

    Power PC

    Alpha 21264

    Alpha 21164

    Alpha 21064

    Itanium

    Pentium 4

    AMD x86-64

    AMD K7

    AMD K6

    HP PA MIPs

    SPARC

    SuperSPARC

    SPARC64

    UltraSPARC T1

    Legend

    UltraSPARC T1

    nuclear reactor surface flux, ~125 W/cm2

    hot plate, ~15 W/cm2

    Figure: Processor power expenditures, courtesy Sun Microsystems.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 27 / 97

    Current Trends in HPC And the Future is ...

    The Power Envelope

  • 7/29/2019 class01_Introduction to HPC.pdf

    26/92

    The Power Envelope

    6 The UltraSPARC T1 Processor - Power Efficient Through put Computing December 2005

    FIGURE 2 Power Density (W/cm2) Over Time

    80

    70

    60

    50

    40

    30

    20

    10

    0

    Powerdensity(watts/squar

    ecm.)

    85 86 87 88 8 9 9 0 9 1 9 2 9 3 94 95 96 97 9 8 9 9 0 0 0 1 0 2 03 04 05 06 0 7

    Year

    Pentium 3

    Pentium 2

    Pentium

    486

    386

    Power PC

    Alpha 21264

    Alpha 21164

    Alpha 21064

    Itanium

    Pentium 4

    AMDx86-64

    AMDK7

    AMDK6

    HPPA MIPs

    SPARC

    SuperSPARC

    SPARC64

    UltraSPARCT1

    Legend

    UltraSPARC T1

    nuclear reactor surface flux, ~125 W/cm 2

    hot plate, ~15 W/cm2

    Around 2003, Intel & AMD both reached close to the limit of air

    cooling for their processors - why?

    Power in a CPU is proportional to V2fC:

    Increasing frequency, f, also requires increasing voltage, V - highly

    nonlinearIncreasing core count will increase capacitance, C, but only linearly

    Behold, the multi-core era is born, but by necessity rather than

    choice ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 28 / 97

    Current Trends in HPC And the Future is ...

    Future Speculation

  • 7/29/2019 class01_Introduction to HPC.pdf

    27/92

    Future Speculation

    Hot trend in HPC - special-purpose processors. Recognize this

    one?

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 29 / 97

    Current Trends in HPC And the Future is ...

  • 7/29/2019 class01_Introduction to HPC.pdf

    28/92

    Cell single-chip multi-processor (IBM/Sony/Toshiba), debuted in2005/2006

    3.2GHz PowerPC RISC as Power Processing Element (PPE)8 Synergistic Processing Elements (SPEs) act as vector

    processing elements (hardware accelerators)

    Low power, high performance on specialized applications (230

    GFlop/s, in single-precision)#1 on top500 list in 2008-2009 was a mix of Opterons (7000)

    and Cells (13000)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 30 / 97

    Current Trends in HPC And the Future is ...

    Current Multicore Offerings

  • 7/29/2019 class01_Introduction to HPC.pdf

    29/92

    Current Multicore Offerings

    Short table of current multi-core processors ...Manufacturer Arch CoresSun/Oracle "Niagara" T2 SPARC64 16 (1.4 GHz)IBM POWER7 POWER 4-8 (3-4.5 GHz)IBM PowerPC A2 POWER 18 (1.6 GHz)Intel "Nehalem EP/Westmere" x86_64 4/6 (1.86-3.2 GHz)Intel "Nehalem EX/Beckton" x86_64 4/6/8 (1.86-2.66 GHz)Intel "Sandy Bridge EP/E5" x86_64 4/6/8 (1.8-3.6 GHz)AMD "Shanghai" x86_64 4 (2.1-3.1 GHz)AMD "Istanbul" x86_64 6 (2.1-2.8 GHz)AMD "Magny-Cours" x86_64 8/12 (1.8-2.4 GHz)IBM/Sony/Toshiba Cell PPC+8xSPE 1+8 (3.2 GHz)

    SciCortex MIPS64 6 (500 MHz)Nvidia "Tesla" C2050 448 (1.15 GHz )

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 31 / 97

    Current Trends in HPC And the Future is ...

    All is Parallel

  • 7/29/2019 class01_Introduction to HPC.pdf

    30/92

    All is Parallel

    Well, at least pretty much all of the hardware is parallel:

    Right now is it is difficult to buy a computer that has only a single

    processor - even laptops have multiple cores.

    GPUs have a great many processing elements (Cell has 9,NVIDIA and ATI offerings have hundreds).

    This increasing processor core count trend is going to continue for

    a while.What about the software that can actually concurrently use all ofthese hardware resources?

    Software is lagging behind the hardware.

    Some specialized parallel libraries for multicore systems.Some APIs for harnessing multiple cores (OpenMP) and multiple

    machines (MPI).

    Generally the software picture is one of tediously mapping

    applications to new architectures and machines.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 32 / 97

    Current Trends in HPC And the Future is ...

    Who Uses HPC?

  • 7/29/2019 class01_Introduction to HPC.pdf

    31/92

    Who Uses HPC?

    Users of HPC/parallel computation, as reflected in the last top500 list:

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 33 / 97

    Demand for HPC More Speed!

    Why Use Parallel Computation?

  • 7/29/2019 class01_Introduction to HPC.pdf

    32/92

    Why Use Parallel Computation?

    Note that, in the modern era, inevitably HPC = Parallel (or concurrent)Computation. The driving forces behind this are pretty simple - thedesire is:

    Solve my problem faster, i.e. I want the answer now (and who

    doesnt want that?)

    I want to solve a bigger problem than I (or anyone else, for thatmatter) have ever before been able to tackle, and do so in a

    reasonable (generally reasonable = within a graduate students

    time to graduate!)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 35 / 97

    Demand for HPC More Speed!

    A Concrete Example

  • 7/29/2019 class01_Introduction to HPC.pdf

    33/92

    A Concrete Example

    Well, more of a discrete example, actually. Lets consider thegravitational N-body problem.

    Example

    Using classical gravitation, we have a very simple (but long-ranged)

    force/potential. For each of N bodies, the resulting force is computedfrom the other N 1 bodies, thereby requiring N2 force calculationsper step. If a galaxy consists of approximately 1012 such bodies, and

    even the best algorithm for computing requires Nlog2 N calculations,that means 1012 ln(1012)/ ln(2) calculations. If each calculation

    takes 1 sec, that is 40 106 seconds per step. That is about 1.3CPU-years per step. Ouch!

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 36 / 97

    Demand for HPC More Speed!

    A Size Example

  • 7/29/2019 class01_Introduction to HPC.pdf

    34/92

    A Size Example

    On the other side, suppose that we need to diagonalize (or invert) adense matrix being used in, say, an eigenproblem derived from someengineering or multiphysics problem.

    Example

    In this problem, we want to increase the resolution to capture the

    essential underlying behavior of the physical process being modeled.So we determine that we need a matrix on the order of, say, 400000elements. Simply to store this matrix, in 64-bit representation, requires

    1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this

    onto a cluster with say, 103

    nodes, each having 4 GBytes of memory,by distributing the matrix across the individual memories of eachcluster node.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 37 / 97

    Demand for HPC More Speed!

    So ...

  • 7/29/2019 class01_Introduction to HPC.pdf

    35/92

    So

    So the lessons to take from the preceding examples should be clear:

    It is the nature of the research (and commercial) enterprise to

    always be trying to accomplish larger tasks with ever increasingspeed, or efficiency. For that matter it is human nature.

    These examples, while specific, apply quite generally - do you seethe connection between the first example and general molecularmodeling? The second example and finite element (or finite

    difference) approaches to differential equations? How about web

    searching?

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 38 / 97

    Demand for HPC More Speed!

    Dictatorship of Data Storage

  • 7/29/2019 class01_Introduction to HPC.pdf

    36/92

    p g

    Very strong limitations are imposed by the hierarchy of data/memorystorage, schematically:

    Problem Size

    P

    erformance

    Fits in Registers

    Fits in Level 1 Cache

    Fits in Level 2 Cache

    (Level 3 Cache)

    Fits in Main Memory

    Fits on Disk

    Need a Bigger Machine!

    Note that the performance scale shown is likely logarithmic.Ever-decreasing speed of memory as you get further from theprocessor(s) is inevitable, often called the memory wall, which

    we will discuss later in more detail.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 39 / 97

    Demand for HPC More Speed!

    What is Old is New Again ...

  • 7/29/2019 class01_Introduction to HPC.pdf

    37/92

    g

    Interestingly enough, computer processors have not changed verymuch - consider the following picture:

    ARITHMETIC

    LOGIC

    UNIT

    INPUT

    CONTROL

    UNIT

    accumulator

    MEMORY

    OUTPUTCPU

    Look familiar? Pretty much every modern processor follows thisdesign ...

    John von Neumann, the famous polymath, described this conceptof the stored-program computer in 1946!

    It follows directly from Alan Turings concept of the Turing machinefrom 1936

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 40 / 97

    Demand for HPC More Speed!

    Cache & Memory Hierarchy

  • 7/29/2019 class01_Introduction to HPC.pdf

    38/92

    y y

    Almost all modern processors try to improve performance by keeping

    the most oft-used data close to the CPU(s), using a hierarchy ofcaches:

    L2

    L1IL1D

    (L3)

    VIRTUAL MEMORY

    PHYSICAL MEMORY

    TLB

    PERFORMAN

    CE

    SIZE

    Note in particular that as the caches increase in size, they decrease inperformace (bandwidth and latency).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 41 / 97

    Demand for HPC More Speed!

    CPU/Memory Imbalance

  • 7/29/2019 class01_Introduction to HPC.pdf

    39/92

    y

    It has been noted that CPU performance has been increasing atroughly a rate of 80% per year (aside: note Moores observation wasfor transistor counts), but memory performance has lagged behind

    considerably - perhaps 7% per year1.

    Has led to a significant imbalance between CPU and memory

    performance

    Explains the preceding cache hierarchy figure pretty well!

    Also explains why getting even 10% of the theoretical peakperformance out of a CPU can be a considerable achievement!

    1see, for example, Sally A. McKee, Reflections on the Memory Wall, CF04:Proceedings of the 1st Conference on Computing Frontiers (Ischia, Italy, 2004), p.

    162.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 42 / 97

    Demand for HPC More Speed!

    CPU/DRAM Performance Gap

  • 7/29/2019 class01_Introduction to HPC.pdf

    40/92

    Figure: D. Patterson et. al., "A case for intelligent RAM," IEEE Micro 17, 34(1997).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 43 / 97

    Demand for HPC More Speed!

    Machine Balance

  • 7/29/2019 class01_Introduction to HPC.pdf

    41/92

    The notion of machine balance (or work/memory ratio) attempts to

    quantify, in at least a simple way, this distinction between CPU andmemory,

    WM =number floating-point operations

    number of memory references.

    Larger WM is better given the relative slow speed of memory.

    If the maximum bandwidth to memory is bwM, then a given

    algorithm has a maximum performance of WM bwM.

    The STREAM benchmark was developed to establish a baseline for

    machine balance (J. D. McCalpin, "Memory Bandwidth and MachineBalance in Current High Performance Computers", IEEE ComputerSociety Technical Committee on Computer Architecture (TCCA)Newsletter, December 1995).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 44 / 97

    Demand for HPC More Speed!

    Behold the Power of ... Google

  • 7/29/2019 class01_Introduction to HPC.pdf

    42/92

    Google is a fine example of the power of parallel computation. Did you

    already know:

    Google uses very large Linux clusters as their platform for theirsearch engine?

    Exactly how many is closely guarded, but experts estimate from

    100,000 to 300,000 servers.Modifications to the Linux kernel to allow for customization - faulttolerance, redundancy, etc.

    A customized cluster filesystem (the googleFS) to accelerate theirsearch and indexing algorithms, and the scalable programming

    model (MapReduce) to go with itresearch.google.com

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 45 / 97

    Demand for HPC Inherent Limitations

    Scaling Concepts

    http://localhost/var/www/apps/conversion/tmp/scratch_1/research.google.comhttp://localhost/var/www/apps/conversion/tmp/scratch_1/research.google.com
  • 7/29/2019 class01_Introduction to HPC.pdf

    43/92

    By scaling we typically mean the relative performance of a parallelvs. serial implementation:

    Definition (Scaling): Speedup Factor, S(p) is given by

    S(p) =Sequential execution time (using optimal implementation)

    Parallel execution time using pprocessors

    so, in the ideal case, S(p) = p.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 46 / 97

    Demand for HPC Inherent Limitations

    Parallel Efficiency

  • 7/29/2019 class01_Introduction to HPC.pdf

    44/92

    Using S as the (best) sequential execution time, we note that

    S(p) =S

    p(p),

    p,

    for a lower bound, and the parallel efficiency is given by

    E(p) =S(p)

    p,

    =S

    p p.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 47 / 97

    Demand for HPC Inherent Limitations

    Inherent Limitations in Parallel Speedup

  • 7/29/2019 class01_Introduction to HPC.pdf

    45/92

    Limitations on the maximum speedup:

    Fraction of the code, f, can not be made to execute in parallel

    Parallel overhead (communication, duplication costs)

    Using this serial fraction, f, we can note that

    p fS + (1 f) S/p,

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 48 / 97

    Demand for HPC Inherent Limitations

    Amdahls Law

  • 7/29/2019 class01_Introduction to HPC.pdf

    46/92

    This simplification for p leads directly to Amdahls Law,Definition (Amdahls Law):

    S(p) S

    fS + (1 f) S/p

    ,

    p

    1 + f(p 1).

    G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference

    Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 49 / 97

    Demand for HPC Inherent Limitations

    Implications of Amdahls Law

  • 7/29/2019 class01_Introduction to HPC.pdf

    47/92

    The implications of Amdahls law are pretty straightforward:

    The limit as p,

    limpS(p)

    1

    f .

    If the serial fraction is 5%, the maximum parallel speedup is only20.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 50 / 97

    Demand for HPC Inherent Limitations

    Implications of Amdahls Law (contd)

  • 7/29/2019 class01_Introduction to HPC.pdf

    48/92

    1 2 4 8 16 32 64 128 256Number of Processors, p

    1

    2

    4

    8

    16

    32

    64

    128

    256

    MAX(S(p))

    f=0.001f=0.01f=0.02f=0.05f=0.1

    Amdahl's LawMaximum Parallel Speedup

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 51 / 97

    Demand for HPC Inherent Limitations

    A Practical Example

  • 7/29/2019 class01_Introduction to HPC.pdf

    49/92

    Let p = comm + comp, where comm is the time spent in communicationbetween parallel processes (unavoidable overhead) and comp is the

    time spent in (parallelizable) computation.

    1 pcomp,

    and

    S(p) =1p

    ,

    pcomp

    comm + comp,

    p

    1 + comm/comp1

    The point here is that it is critical to minimize communication time totime spent doing computation (recurring theme).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 52 / 97

    Demand for HPC Inherent Limitations

    Defeating Amdahls Law

  • 7/29/2019 class01_Introduction to HPC.pdf

    50/92

    There are ways to work around the serious implications of Amdahlslaw:

    We assumed that the problem size was fixed, which is (very) often

    not the case.Now consider a case where the problem size is allowed to vary

    Assume that now the problem size is fixed such that p is heldconstant

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 53 / 97

    Demand for HPC Inherent Limitations

    Gustafsons Law

  • 7/29/2019 class01_Introduction to HPC.pdf

    51/92

    Now let p be held constant as p is increased,

    Ss(p) = S/p,

    = (fS + (1 f)S)/p,

    = (fS + (1 f)S)/(fS + (1 f)S/p),

    = p/(1 (1 p)f),

    p+ p(1 p)f . . . .

    Another way of looking at this is that the serial fraction becomes

    negligible as the problem size is scaled. Actually that is a pretty gooddefinition of a scalable code ...

    J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 54 / 97

    Demand for HPC Inherent Limitations

    Scalability

  • 7/29/2019 class01_Introduction to HPC.pdf

    52/92

    Definition

    (Scalable): An algorithm is scalable if there is a minimal nonzero

    efficiency as p and the problem size is allowed to take on anyvalue

    I like this (equivalent) one better:

    Definition

    (Scalable): For a scalable code the sequential fraction becomes

    negligible as the problem size (and the number of processors) grows

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 55 / 97

    Parallel Architectures Taxonomies

    Flynns Taxonomy

  • 7/29/2019 class01_Introduction to HPC.pdf

    53/92

    Based on the flow of information (both data and instructions) in acomputer:

    SISD Single Instruction, Single Dataquite old-fashioned

    SIMD Single Instruction, Multiple Data

    data parallel

    MISD Multiple Instruction, Single Datanever implemented

    MIMD Multiple Instruction, Multiple Data

    task parallelMany taxonomies are possible ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 57 / 97

    Parallel Architectures Taxonomies

    A More Intuitive Parallel Taxonomy

  • 7/29/2019 class01_Introduction to HPC.pdf

    54/92

    A somewhat more intuitive illustration of parallel taxonomy:

    single multiple

    single

    multiple

    InstructionSystem

    Memory System

    SMP DM(SIMD & MIMD)

    Scalar Vector

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 58 / 97

    Parallel Architectures Taxonomies

    SISD

  • 7/29/2019 class01_Introduction to HPC.pdf

    55/92

    Sequential (non-parallel) instruction flowPredictable and deterministic

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 59 / 97

    Parallel Architectures Taxonomies

    SIMD

  • 7/29/2019 class01_Introduction to HPC.pdf

    56/92

    Single (typically assembly) instruction operates on different data in

    a given clock tick.

    Requires predicatble data access patterns - "vectors" that arecontiguous in memory.

    Almost all modern processors use some form - for Intel, SSE

    (Streaming SIMD Extensions), AVX (Advanced Vector Extensions)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 60 / 97

    Parallel Architectures Taxonomies

    MISD

  • 7/29/2019 class01_Introduction to HPC.pdf

    57/92

    Multiple instructions applied independently on same data.

    Theoretical rather than practical architecture - fewimplementations

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 61 / 97

    Parallel Architectures Taxonomies

    MIMD

  • 7/29/2019 class01_Introduction to HPC.pdf

    58/92

    Multiple instructions applied independently on different data.

    Very flexible - can be asynchronous, non-deterministicMost modern supercomputers follow MIMD design, albeit withSIMD components/sub-systems

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 62 / 97

    Parallel Architectures Taxonomies

    Data/Task Parallel

  • 7/29/2019 class01_Introduction to HPC.pdf

    59/92

    Where is the parallelism in your code?

    data parallel - the code has data that can be distributed over and

    acted upon by multiple processing units. (think OpenMP and HPF)

    task parallel - the tasks performed by the code can be distributed

    over multiple processors, often with some need for inter-processcommunication. (think MPI, PVM, etc.)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 63 / 97

    Parallel Architectures Shared Memory Multiprocessors

    SMP

  • 7/29/2019 class01_Introduction to HPC.pdf

    60/92

    High Speed Memory

    CPU

    cache

    CPU

    cache

    CPU

    cache

    Each processor has equal access to pool of shared globalmemory (contrast with non-uniform, or NUMA)

    Cache coherency is a serious issueQuickly limits scalability due to bottlenecks

    Examples: Sun Enterprise Series,Intel/AMD x86/x86_64bus-based SMPs (small), IBM POWER Nodes

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 64 / 97

    Parallel Architectures Distributed Memory Machines

    Distributed Memory

  • 7/29/2019 class01_Introduction to HPC.pdf

    61/92

    CPU

    cache

    Memory

    High Speed

    Network

    CPU

    cache

    Memory

    CPU

    cache

    Memory

    Single processors with a hierarchy of memory interconnectedthrough an external network

    Examples: rather uncommon now - older clusters of uniprocessornodes.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 65 / 97

    Parallel Architectures Distributed Shared Memory

    DSM

  • 7/29/2019 class01_Introduction to HPC.pdf

    62/92

    High Speed

    Network

    Memory

    CPUCPU

    cache cache

    Memory

    CPUCPU

    cache cache

    Memory

    CPUCPU

    cache cache

    Some processors have truly shared memory, rest accessed in anon-uniform (NUMA, or non-uniform memory access) way throughnetwork

    Examples: clusters of SMPs, SGI Origin/Altix Series (OS makes itlook like SM).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 66 / 97

    Parallel Architectures Distributed Shared Memory

    Data Parallel

  • 7/29/2019 class01_Introduction to HPC.pdf

    63/92

    Also known as fine-grained parallel.

    Same set of instructions runs simultaneously on different pieces ofdata.

    Typical example - parallel execution of DO/for loops inFORTRAN/C codes.

    Examples of directive-based data-parallel APIs: HPF, OpenMP.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 67 / 97

    Parallel Architectures Distributed Shared Memory

    Task Parallel

  • 7/29/2019 class01_Introduction to HPC.pdf

    64/92

    Also known as coarse-grained parallel.

    Multiple code segments are run concurrently on differentprocessors.

    Generally only works well when the code fragments areindependent (otherwise the communication cost may beprohibitive).

    Typical example - subroutine calls in FORTRAN codes.

    Examples of message-passing task parallel APIs: PVM, MPI.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 68 / 97

    Parallel Architectures Distributed Shared Memory

    SIMD

  • 7/29/2019 class01_Introduction to HPC.pdf

    65/92

    Processors that incorporate SIMD vector-like features are widespread:

    packs N (often a power of 2) operations into a single instruction

    included in processors from Intel and AMD (streaming SIMD

    extensions, or SSE)allow the processor to speed calculation by extensive prefetching

    of multiple pieces of data which will be subjected to the same setof operations.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 69 / 97

    Parallel Architectures Distributed Shared Memory

    SPMD vs. MPMD

  • 7/29/2019 class01_Introduction to HPC.pdf

    66/92

    All the architectures that we will consider follow the MIMD designmodel, and we can denote our general approach as Multiple Program,Multiple Data. A typical example consists of two programs in a

    master-worker arrangement. SPMD is another such structure, in whichthe same code is executed by all processes, but on different data (e.g.

    in the master-worker example, the code would have sections for themaster, and sections for the workers).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 70 / 97

    Parallel Architectures Scalar and Vector Processing

    Vector Processing

  • 7/29/2019 class01_Introduction to HPC.pdf

    67/92

    A vector processor is one in which entire vectors can be addedtogether with a single (vector) instruction, rather than simply twoelements in the case of scalar processors. This approach can be quite

    advantageous:

    Vector processors may have to fetch and decode far fewerinstructions (less overhead)

    More (contiguous in memory) data can be moved simultaneously

    Keeps processor fed with data rather than waiting on memoryoperations

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 71 / 97

    Parallel Architectures Examples of Cluster "Nodes"

    Typical Cluster Compute Nodes

    HPC clusters can be built out of just about any type of system

  • 7/29/2019 class01_Introduction to HPC.pdf

    68/92

    HPC clusters can be built out of just about any type of system,depending on your budget. Some common (and a few less common)

    options:Commodity PCs (workstations) - typically networked together

    using Ethernet

    Commodity Servers (2-12 processor cores/server) - often with

    (advanced) third party networks in addition to, or in place of,Ethernet

    Example

    Example: CCRs U2 has 256 nodes with 8x 2.26 GHz Nehalem

    cores/server, 372 nodes with 12x 2.40GHz Westmere cores/serverand 20 "fat" nodes with 32 cores/server (some Intel, some AMD), all

    with a QDR Infiniband interconnect for applications and gigabitEthernet for storage/systems operation.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 72 / 97

    Parallel Architectures Examples of Cluster "Nodes"

    Typical Cluster Compute Nodes (contd)

  • 7/29/2019 class01_Introduction to HPC.pdf

    69/92

    Constellations of High-End Servers ( 16 processors/server)

    NCARs Bluefiresystem of 128 32-processor IBM p575

    servers (Power 6+) with Infiniband interconnect

    NASAs Columbiasystem of 20 512-processor SGI Altix

    3700s (Itanium2) with Infiniband interconnect

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 73 / 97

    Parallel Architectures Cluster Interconnects

    Typical Cluster Networks

  • 7/29/2019 class01_Introduction to HPC.pdf

    70/92

    By far the most common interconnect for HPC is some form of Ethernet(TCP/IP), but there are other, more highly performing, options:

    Myrinet, www.myrinet.com

    Myrinet is now available at 10Gbit/s, more common at 2Gbit/s.

    Infiniband, www.infinibandta.org, www.openib.orgSDR IB 4x is 10Gb/s, DDR 8x is 20Gb/s, and QDR 16x is 40Gb/s.

    Proprietary/Custom, customized networks stacks that offer low

    latency and high bandwidth, becoming less common withcommodity networks like Ethernet and Infiniband

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 74 / 97

    Parallel Architectures Cluster Interconnects

    Cluster Network Evolution

    Using the Top500 list for analyzing historical trends, we can look at the

    http://www.myrinet.com/http://www.infinibandta.org/http://www.openib.org/http://www.openib.org/http://www.infinibandta.org/http://www.myrinet.com/
  • 7/29/2019 class01_Introduction to HPC.pdf

    71/92

    g p y g ,interconnect family:

    1 1 48

    5769

    154

    206

    HIPPI

    Proprietary

    FatT

    ree

    Fast

    Eth

    ernet

    SPSwitc

    h

    Cray

    Crossba

    rN/A

    0

    50

    100

    150

    200

    250

    300

    NumberofEntriesinTop5

    001998-06

    Top500 List Interconnect Family

    N/A=single node!

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 75 / 97

    Parallel Architectures Cluster Interconnects

    Cluster Network Evolution

    ... and note the trend now favors the commodity networks:

  • 7/29/2019 class01_Introduction to HPC.pdf

    72/92

    y

    1 3 59 9 12 14

    26 3642

    87

    256

    0 0 28 8 3 5

    40

    121

    17 12

    284

    0 0 09

    0 0 0

    20

    209

    0 4

    207

    Firep

    lane

    Rapid

    Array

    Mixe

    dCr

    ay

    NUM

    Alink

    Crossbar

    Quadric

    s

    Proprie

    tary

    Infin

    iband

    SPSwitc

    h

    Myrine

    t

    Gigabit

    Ethe

    rnet

    0

    50

    100

    150

    200

    250

    300

    NumberofSystemsinTop500List

    2006-062008-062012-06

    Top500 List Interconnect Family

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 76 / 97

    Parallel Architectures Dedicated Clusters - Examples

    The Cluster Sandbox

    Pi d b l i l f d di d l

  • 7/29/2019 class01_Introduction to HPC.pdf

    73/92

    Pictured below is an example of a dedicated compute cluster:

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 77 / 97

    Parallel Architectures Dedicated Clusters - Examples

  • 7/29/2019 class01_Introduction to HPC.pdf

    74/92

    The preceding cluster example was the first incarnation of CCRs U2, Itplaced number 57 on the June 2006 Top500 list, using only the 768

    Myrinet connected compute nodes. U2 has been substantially updatedsince then, but in several stages such that it is now relativelyheterogeneous:

    http://www.ccr.buffalo.edu/support/research_facilities/u2.html

    which is typical for University-based HPC facilities.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 78 / 97

    Parallel Architectures Dedicated Clusters - Examples

    Current State of U2

    CCR Computing Resources

    Schematic Layout/Diagram

    http://www.ccr.buffalo.edu/support/research_facilities/u2.htmlhttp://www.ccr.buffalo.edu/support/research_facilities/u2.html
  • 7/29/2019 class01_Introduction to HPC.pdf

    75/92

    Dell C410x PCIe ExpansionNvidia Tesla M2050 GPUs (64 total)

    Dell C6100 Servers (32 total)2xXeon X5650 2.66GHz 12-cores each

    2x500GB SATA HD, 1x100GB SSD48GB RAM, QDR Infiniband + 1Gb Ethernet

    IBM dx360M2 Servers (128 total)2xXeon L5520 2.26GHz 8-cores each

    24GB RAM, 2x160GB SATA HDQDR Infiniband + 1Gb Ethernet

    Dell SC1425 Servers (256 total)2xXeon 3.0GHz 2-cores each

    2GB RAM, 1x80GB SATA HDMyrinet 2G + 1Gb Ethernet

    Dell C6100 Servers (128 total)2xXeon L5630 2.13GHz 8-cores each

    24GB RAM, 2x500GB SATA HDQDR Infiniband + 1Gb Ethernet

    Leaf Ethernet Switches

    1Gb/node (32-48 nodes)

    2x10Gb uplinks to core

    Arista 7508 10Gb

    Ethernet Core Switch

    Panasas PAS8 Parallel Storage

    215TB, /panasas/scratch

    Isilon IQ36000x Storage Arrays

    325TB, /home, /projects

    2x/leaf

    U2 General Purpose Cluster

    GPU Cluster

    Revised: 2012-08

    chemistry/physics/proteomics

    Faculty/Research Group Clusters

    Netezza/IBM TwinFin DISC

    XtremeData dbX 1016

    Data Appliance

    Data-Intensive Computing Resources

    14x

    12x

    2x

    1 Gb/s

    10 Gb/sfuture 10Gb/s

    General and Project Storage Resources

    "Fatnodes" Large Memory Cluster

    "Fat" Nodes Servers (19 total)Quantity 1 (Dell R910)

    4x Xeon X7650 2.0GHz, 32-cores each2x500GB SATA HD, 14x100GB SSD

    256GB RAMQuantity 8 (Dell R910)

    4x Xeon E7-4830 2.13GHz, 32-cores each4x1TB SATA HD, 256GB RAM

    Quantity 8 (Dell C6145)

    4x Opteron 6132HE 2.2GHz, 32-cores each4x1TB SATA HD, 256GB RAM

    Quantity 2 (Dell R910+IBM x3850X5)4x Xeon E7-4830 2.13GHz, 32-cores each

    4x1TB SATA HD, 512GB RAMQDR Infiniband + 1Gb Ethernet

    Dell C6100 Servers (372 total)2xXeon E5645 2.40GHz 12-cores each

    48GB RAM, 2x500GB SATA HDQDR Infiniband + 1Gb Ethernet

    2x

    Current (2012-08) state of U2after many upgrades.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 79 / 97

    Parallel Architectures Dedicated Clusters - Examples

    A Very Different Cluster Example

  • 7/29/2019 class01_Introduction to HPC.pdf

    76/92

    Now lets consider Japans Earth Simulator, which dominated theTop500 List from 2002-2004:

    640 8-processor SX-8 (vector) SMPs (peak of 8GFlop/s perprocessor)

    10 TB of Memory

    Custom crossbar interconnect

    700 TB disk + 1.2 PB Mass Storage

    Reportedly consumes about 12MW of power ...

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 80 / 97

    Parallel Architectures Dedicated Clusters - Examples

    A Very Different Cluster Example (contd)

    Japans Earth Simulator:

  • 7/29/2019 class01_Introduction to HPC.pdf

    77/92

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 81 / 97

    Parallel Architectures Dedicated Clusters - Examples

    A Very Different Cluster Example (contd)

    Japans Earth Simulator:

  • 7/29/2019 class01_Introduction to HPC.pdf

    78/92

    Japan s Earth Simulator:

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 82 / 97

    Parallel Architectures Dedicated Clusters - Examples

    IBM Blue Gene

  • 7/29/2019 class01_Introduction to HPC.pdf

    79/92

    LLNLs BG/L system from 2007-11, and a single rack of BG/L from SC06 (photo on left courtesty IBM, right IBM/Wikipedia).

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 83 / 97

    Parallel Architectures Dedicated Clusters - Examples

    Highly Engineered Supercluster

    Now consider IBMs BlueGene, holding 4 of the top 10 spots in the2008 06 Top500 list

  • 7/29/2019 class01_Introduction to HPC.pdf

    80/92

    2008-06 Top500 list:

    Building block is a node consisting of 2 700 MHz PowerPC 440processors, 1024MB RAM

    Nodes configured on 5 different networks3D Torus (3D mesh where each node connected to 6nearest-neighbors) for general message passing (well suited for

    decomposition problems), each link 1.4 Gb/s bidirectional, or 2.1GB/s aggregate for each node. Worst case hardware latency is 6.4s.Collective network (also does some point-to-point), latency of only2.5 s for a full 65,536 node BG/L, each link 2.8 Gb/s bidirectional

    I/O network, with one dedicated I/O node for every 64 computenodes (gigabit Ethernet)Global interrupt network, can do full system hardware barrier in 1.5s (fast Ethernet)Diagnostic & control network (JTAG/IEEE 1149.1)

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 84 / 97

    Parallel Architectures Dedicated Clusters - Examples

  • 7/29/2019 class01_Introduction to HPC.pdf

    81/92

    OS kernel on compute is a customized runtime kernel thatprovides no scheduling or context switching (said to be written in

    5000 lines of C++), one thread/CPU, I/O nodes runs customizedLinux kernel

    Native MPI is customized port of MPICH2

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 85 / 97 Parallel Architectures Dedicated Clusters - Examples

  • 7/29/2019 class01_Introduction to HPC.pdf

    82/92

    Figure courtesy IBM, total power requirements of 65,536 node system 1.5MW, or 20kW/rack.

    M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 86 / 97 Parallel Architectures Dedicated Clusters - Examples

  • 7/29/2019 class01_Introduction to HPC.pdf

    83/92

    Figure courtesy IBM, IBM J. Res. Devel. 49, No. 2/3 (2005).

    Development of the BG architecture has continued with BG/P (2007,

    chip upgrade to 4 PPC450 processors/chip), and BG/Q (ETA 2012, 16cores/chip).

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 87 / 97 Parallel Architectures Dedicated Clusters - Examples

    2008/2009 #1 in Top500 - Roadrunner

  • 7/29/2019 class01_Introduction to HPC.pdf

    84/92

    IBMs Roadrunner (photo courtesy IBM), #1 entry in 2008-06 Top500list, 1.026 PFlop/s ...

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 88 / 97 Parallel Architectures Dedicated Clusters - Examples

    2008/2009 #1 in Top500 - Roadrunner

  • 7/29/2019 class01_Introduction to HPC.pdf

    85/92

    CU from Roadrunner, hybrid Opteron/Cell architecture ...

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 89 / 97 Parallel Architectures Dedicated Clusters - Examples

    2008/2009 #1 in Top500 - Roadrunner

  • 7/29/2019 class01_Introduction to HPC.pdf

    86/92

    Roadrunner first PFlop/s system on Top500, consumes 2.3MW ofpower

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 90 / 97 Parallel Architectures Dedicated Clusters - Examples

    2009-2010 #1 in Top500 - Jaguar

  • 7/29/2019 class01_Introduction to HPC.pdf

    87/92

    Housed at the National Center for Computational Sciences at Oak

    Ridge National Laboratory, Jaguar is a Cray XT5 System with 37,3766-core AMD Opteron processors (224,256 cores total), and Crayspropietary interconnect. Power usage 7MW.

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 91 / 97 Parallel Architectures Dedicated Clusters - Examples

    2010-11 #1 in Top500 - Tianhe-1A

  • 7/29/2019 class01_Introduction to HPC.pdf

    88/92

    Tianhe-1A was installed at the National Supercomputer Center inTianjin, China, and debuted in the #1 slot of the top500 list in 2010-11.Details on its specifications are still sketchy:

    14,336 Intel Xeon processors

    7,168 Nvidia Tesla M2050 GPUs2048 NUDT FT1000 heterogeneous processors

    Proprietary interconnect "Arch"

    4.04MW Power usage

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 92 / 97 Parallel Architectures Dedicated Clusters - Examples

    2011 #1 in Top500 - K Computer at RIKEN

  • 7/29/2019 class01_Introduction to HPC.pdf

    89/92

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 93 / 97 Parallel Architectures Dedicated Clusters - Examples

    2012-06 #1 Sequoia, IBM BG/Q

  • 7/29/2019 class01_Introduction to HPC.pdf

    90/92

    IBM deployed BG/Q (3rd in the BlueGene line) in 2012. PowerPC A2

    chip shown above, 18 cores total, 16 for computation, 1 for operatingsystem (RHEL), 1 extra. 8 Flop/s per clock tick. SoC design, 1.6GHz

    core frequency at 55W (204.8GFlop/s). Water cooled.

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 94 / 97 Parallel Architectures Dedicated Clusters - Examples

    2012-06 #1 Sequoia, IBM BG/Q

  • 7/29/2019 class01_Introduction to HPC.pdf

    91/92

    Each rack holds 1024 compute cards, or 16384 cores total. TheSequoia system at LLNL has targeted to have 96 racks total, or1,572,864 cores (and consumes about 7.9MW of power).

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC I 2012 95 / 97 Conclusions

    Lessons Learned

    Moores Law (observation, really) continues - lithography in

  • 7/29/2019 class01_Introduction to HPC.pdf

    92/92

    oo e s a (obse at o , ea y) co t ues t og ap ysemiconductor manufacturing advances

    Sequential processor speeds are not advancing (or in a muchmore limited way):

    Thermal wall - can not cool them fast enoughMemory wall - can not feed the processor data fast enough

    ILP (instruction level parallelism) wall - limitations on speculativecompilation

    What happens to all of these new transistors?

    Make processors smallerMake L2 caches bigger

    Add lots and lots more coresThe future? Massive numbers of multicore servers (8/16/32/...cores per server) harnessed in a distributed fashion

    M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC I 2012 97 / 97