class01_introduction to hpc.pdf
TRANSCRIPT
-
7/29/2019 class01_Introduction to HPC.pdf
1/92
Introduction to High Performance Computing
M. D. Jones, Ph.D.
Center for Computational ResearchUniversity at Buffalo
State University of New York
HPC-I 2012
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 1 / 97
-
7/29/2019 class01_Introduction to HPC.pdf
2/92
HPC Historical Highlights (Very) Basic Terminology
HPC
HPC = High Performance Computing
and by High Performance we mean what?Unfortunately, HPC is a slippery term - lets construct a morecareful definition ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 3 / 97
-
7/29/2019 class01_Introduction to HPC.pdf
3/92
HPC Historical Highlights (Very) Basic Terminology
High Performance is Quite Relative
Consider the (rather unfair) comparison ...
The Cray-1 Supercomputer, circa 1976 (first commerciallysuccessful vector machine)
Weight - about 2400 kg, Cost - roughly 8M$ (US)Performance - about 160 million floating point operations persecond (MFlop/s)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 4 / 97
-
7/29/2019 class01_Introduction to HPC.pdf
4/92
HPC Historical Highlights (Very) Basic Terminology
A modern day PC, circa 2012 (say running a 3rd generationquad-core 3 GHz Intel i7 CPU)
Weight - roughly 5 kg, Cost - 1000$ (US)Performance - about 48 billion floating points operations per second(96 GFlop/s)
So that is roughly an increase of 600 in peak performance, and awhopping reduction of 8000 (not accounting for inflation, which
would be about another factor of 4 or so) in price.
We will come back to this notion of peak performance later ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 5 / 97
-
7/29/2019 class01_Introduction to HPC.pdf
5/92
HPC Historical Highlights Basic Terminology
Attributes of HPC
Something that everyone knows what it means, so somehow
manages to escape concrete definition. Most will likely agree with thefollowing key attributes:
Problems in HPC require significant computational resources(where significant, in this case, can be taken to be more resources
than are generally available to the average person)
Data sets require large (in the same sense as significant above)
amounts of storage and/or processing
Needs to collectively operate on a (possibly distributed) network
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 6 / 97
HPC Hi i l Hi hli h B i T i l
-
7/29/2019 class01_Introduction to HPC.pdf
6/92
HPC Historical Highlights Basic Terminology
Operational Definition of HPC
Definition
(HPC): HPC requires substantially (more than an order of magnitude)more computational resources than are available on current
workstations, and typically require concurrent (parallel) computation
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 7 / 97
HPC Hi t i l Hi hli ht B i T i l
-
7/29/2019 class01_Introduction to HPC.pdf
7/92
HPC Historical Highlights Basic Terminology
Supercomputers/Supercomputing
The term Supercomputing is usually taken to be the top-end ofHPC:
Massive computing at the highest possible speeds
Usually reserved for the very fastest (or elite) HPC systemsAn old definition - any computer costing more than ten milliondollars, which is now quite obsolete
The Top500 list has been used to rank such systems since 1993,
but we will come back to that ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 8 / 97
HPC Historical Highlights Basic Terminology
-
7/29/2019 class01_Introduction to HPC.pdf
8/92
HPC Historical Highlights Basic Terminology
Peak Performance
Definition
(Peak Performance): The theoretical maximum performance (usuallymeasured in terms of 64-bit (double precision, on most architectures)
floating point operations per second, or Flop/s) achievable by a
computing system.
Typically one simply adds the floating point capabilities in the mostsimplistic way
MFlop/s, GFlops/s, TFlop/s, and coming soon to a facility nearyou, PFlop/s (in the usual metric usage).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 9 / 97
HPC Historical Highlights Basic Terminology
-
7/29/2019 class01_Introduction to HPC.pdf
9/92
HPC Historical Highlights Basic Terminology
Peak Performance Example
Example
Intel has transitioned from the netburst core architecture (2003-2006) to the core/core2/i7 - the
netburst processors could do a maximum of 2 floating point operations per clock tick, so a 3GHznetburst core could do 6 GFlop/s as its theoretical peak performance (TPP). The first generationi7 processors doubled the floating point capacity (4 Flop/s per clock tick) and the second andthird generation quadrupled (8 Flop/s per clock tick), so each core of a 3GHz generation 3 (IvyBridge) i7 would have a TPP of 48 GFlop/s.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 10 / 97
HPC Historical Highlights Basic Terminology
-
7/29/2019 class01_Introduction to HPC.pdf
10/92
HPC Historical Highlights Basic Terminology
Significance of TPP
So what about the theoretical peak performance (TPP)?
Guaranteed "not-to-exceed" limit on what you can hope to achieve
In fact, very few applications exceed a few percentCareful tuning might get you to 10-20%, unless you have analgorithm dominated entirely by easily predictable data patternsdominated by simple computation
Gives you a relatively robust way to compare very different
platforms
Behold, TPP as the computational equivalent of Mt. Everest:
Bandwidth of memory to cache/processor is limitedConflicts in cache/memory accessCommunication time dominates computation timeLike Mt. Everest, you can kill yourself trying to achieve TPP ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 11 / 97
HPC Historical Highlights The Early (Dark?) Ages (pre-silicon)
-
7/29/2019 class01_Introduction to HPC.pdf
11/92
HPC Historical Highlights The Early (Dark?) Ages (pre silicon)
Experimental/Special Purpose
1939 Atanasoff & Berry build prototype electronic/digital
computer at Iowa State1941 Conrad Zuse completes Z3, first functional programmable
electromechanical digital computer
1943 Bletchley Park operates Colossus, computer based on
vacuum tubes, by Turing, Flowers, and Newman1946 ENIAC, Eckert and Mauchly, at the University of
Pennsylvania
1951 UNIVAC I (also designed by Eckert and Mauchly),
produced by Remington Rand, delivered to US CensusBureau
1952 ILLIAC I (based on Eckert, Mauchly, and von Neumanndesign), first electronic computer built and housed at a
University
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 12 / 97
HPC Historical Highlights The Middle Ages (Vector Ascendancy)
-
7/29/2019 class01_Introduction to HPC.pdf
12/92
HPC Historical Highlights The Middle Ages (Vector Ascendancy)
The Cray Years
1962 Control Data Corp. introduces the first commerciallysuccessful supercomputer, the CDC 6600, designed bySeymour Cray. TPP of 9 MFlop/s
1967 Texas Instruments Advanced Scientific Computer, similarto CDC 6600, includes vector processing instructions
1968 CDC 6800, Crays redesign of 6600, remains fastestsupercomputer until mid 1970s. TPP of about 40MFlop/s.
1976 Cray Research Inc.s Cray-I starts vector revolution. TPPof about 250MFlop/s.
1982 Cray X-MP, Crays first multiprocessor computer, original2 processor design had a TPP of 400MFlop/s, includedshared memory access between processors.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 13 / 97
HPC Historical Highlights The Modern Age (Rise of the Machines)
-
7/29/2019 class01_Introduction to HPC.pdf
13/92
g g g ( )
Clusters Take Over
1993 Cray introduces the T3D - an MPP (Massively Parallel
Processing) based on 32-2048 DEC Alpha (21064 RISC,150MHz) processors and a proprietary 3D torus
interconnect
1997 ASCI Red at Sandia delivers first TFlop/s (on the Linpackbenchmark) using Intel Pentium Pro processors and a
custom interconnect.
2002 NECs Earth Simulator is a cluster of 640 vector
supercomputers, delivers 35.61 TFlop/s on the Linpackbenchmark.
2005 IBMs Blue Gene systems regain top rankings (again,according to Linpack) using massive numbers ofembedded processors and communication sub-systems
(more later), each running a stripped down Linux-basedoperating system.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 14 / 97
HPC Historical Highlights The Modern Age (Rise of the Machines)
-
7/29/2019 class01_Introduction to HPC.pdf
14/92
g g g ( )
Clusters Take Over (contd)
2008 IBM deploys a hybrid system of Opteron nodes coupled
with Cell processors interconnected via Infiniband,achieves first sustained PFlop/s on top500 ...
2010 Tianhe-1A at the National Supercomputing Center inTianjin, China, mix of 14,336 Intel Xeon X5670
processors (86,016 cores) and 7168 Nvidia Tesla M2050general purpose GPUs, custom (Arch) interconnect,2.566 PFlop/s
2011 K computer, at RIKEN in Kobe, Japan, 68544 2.0GHz8-core Sparc64 VIIIfx processors (548,352 cores), custom
(Tofu) interconnect, 8.162 PFlop/s2012 Sequoia, IBM Blue Gene Q at LLNL, 1,572,864 1.6GHz
PowerPC A2 cores, custom BG interconnect, 16.32PFlop/s
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 15 / 97
Current Trends in HPC Commodity is King
-
7/29/2019 class01_Introduction to HPC.pdf
15/92
Off-The-Shelf Supercomputers
The last decade has seen so-called COTS (Commodity, Off-The-Shelf
Systems) replace, well, almost everything else, at the heart ofHPC/Supercomputing.
Starting in 1993, Donald Becker and Thomas Sterling (at the
Center of Excellence in Space Data and Information Sciences, orCESDIS) at NASA/Goddard proposed and prototyped a COTS
system based on 16 Intel DX4 processors (100MHz) and channelbonded 10Mbit/s Ethernet. This project was named after the epic
Norse hero, Beowulf.
HPC then underwent a remarkable transformation.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 17 / 97
Current Trends in HPC Commodity is King
-
7/29/2019 class01_Introduction to HPC.pdf
16/92
Beowulf Slays Grendel
In this case, Grendel consisted of the cadre of specializedmanufacturers that supplied HPC hardware. Following is an
(abbreviated, I am quite sure) list of vendors that either wentcompletely out of business, or were consumed by competitors:
Control Data Corp.
Cray (eaten by SGI, then TERA, now reborn)
Convex
DEC (Digital Equipment Corp., eaten by Compaq, in turn by HP)
Kendall Square Research
Maspar
Meiko
SGI (recently twice deceased)
Sun (recently absorbed by Oracle)
Thinking Machines Corp.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 18 / 97
Current Trends in HPC Commodity is King
-
7/29/2019 class01_Introduction to HPC.pdf
17/92
And the Sword ...
The tool by which these companies were driven out of business, ofcourse, is the commodity CPU of Intel (and its primary competitor,
AMD).
Commodity CPUs are as cheap as dirt (well, ok, sand) compared
to their proprietary foes (be they RISC or vector).Commodity networking has taken flight as well, making it also
rather hard to compete on the interconnect front.
The end result of these clusters of low-cost systems is that the
per-CPU cost can be one (if not two) orders of magnitude lessthan their proprietary counterparts.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 19 / 97
Current Trends in HPC Commodity is King
-
7/29/2019 class01_Introduction to HPC.pdf
18/92
The Top500 List
One measure of HPC/supercomputer performance is the LINPACK
benchmark, which has been used since 1993 to rank the worlds mostpowerful supercomputers.
www.top500.org
The LINPACK benchmark was introduced by Jack Dongarra:
solves a dense system of linear equations, Ax = b
for different problem sizes N, get maximum achieved performanceRmax for the problem size Nmax
must conform to the standard operation count for LU factorizationwith partial pivoting (forbids fast matrix multiply method likeStrassens Method), i.e. 2/3N3 +O(N2)
a version is available for distributed memory machines (alsoknown as the "high-performance Linpack benchmark")
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 20 / 97
Current Trends in HPC Trends in the Top500
http://www.top500.org/http://www.top500.org/ -
7/29/2019 class01_Introduction to HPC.pdf
19/92
Rmax in the Top500
CM-5(TMC)
VPP500(Fujitsu)
XP/S140(Intel)
VPP500(Fujitsu)
VPP500(Fujitsu)
VPP500(Fujitsu)
SR2201(Hitachi)
CP-PACS(Hitachi)
AS
CI-R(Intel)
A
SCI-R(Intel)
A
SCI-R(Intel)
A
SCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-W(IBM/SP2)
ASCI-W(IBM/SP2)
ASCI-W(IBM/SP2)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
BlueGene/L(IBM)
BlueGene/L(IB
M)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
Roadrunner(IBM)
R
oadrunner(IBM)
R
oadrunner(IBM)
Jaguar(CrayXT5)
Jaguar(CrayXT5)
Tianhe-1A(Intel/NV)
K(Fujitsu/SPARC)
K(Fujitsu/SPARC)
BlueGene/Q(IBM)
1993-
06
1993-
11
1994-
06
1994-
11
1995-
06
1995-
11
1996-
06
1996-
11
1997-
06
1997-
11
1998-
06
1998-
11
1999-
06
1999-
11
2000-
06
2000-
11
2001-
06
2001-
11
2002-
06
2002-
11
2003-
06
2003-
11
2004-
06
2004-
11
2005-
06
2005-
11
2006-
06
2006-
11
2007-
06
2007-
11
2008-
06
2008-
11
2009-
06
2009-
11
2010-
06
2010-
11
2011-
06
2011-
11
2012-
06
Date
100
101
102
103
104
105
106
107
Rmax
[GFlop/s]
#1 Rmax
in Top500 List 1993-present
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 21 / 97
Current Trends in HPC Trends in the Top500
T d i
-
7/29/2019 class01_Introduction to HPC.pdf
20/92
Trends in Rmax
CM-5(TMC)
VPP500(Fujitsu
)
XP/S140(Intel)
VPP500(Fujitsu)
VPP500(Fujitsu)
VPP500(Fujitsu)
SR2201(Hitachi)
CP-PACS(Hitachi)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-R(Intel)
ASCI-W(IBM/SP2)
ASCI-W(IBM/SP2)
ASCI-W(IBM/SP2)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
NES(NEC/SX6)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
BlueGene/L(IBM)
Roadrunner(IBM)
Roadrunner(IBM)
Roadrunner(IBM)
Jaguar(CrayXT5)
Jaguar(CrayXT5)
Tianhe-1A(Intel/NV)
K(Fujitsu/SPARC)
K(Fujitsu/SPARC)
BlueGene/Q(IBM)
1993-06
1993-11
1994-06
1994-11
1995-06
1995-11
1996-06
1996-11
1997-06
1997-11
1998-06
1998-11
1999-06
1999-11
2000-06
2000-11
2001-06
2001-11
2002-06
2002-11
2003-06
2003-11
2004-06
2004-11
2005-06
2005-11
2006-06
2006-11
2007-06
2007-11
2008-06
2008-11
2009-06
2009-11
2010-06
2010-11
2011-06
2011-11
2012-06
Date
100
101
102
103
104
105
106
107
Rmax
[GFlop/s]
#1 Rmax
in Top500 List 1993-present
One can certainly argue with using this single benchmark(LINPACK) as the supercomputer ranking, but it is quite instructiveas to historical trends in HPC.
Note the growth in Rmax
What about the trends in processor architecture?
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 22 / 97
Current Trends in HPC Trends in the Top500
T d i P F il
-
7/29/2019 class01_Introduction to HPC.pdf
21/92
Trends in Processor Family
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 23 / 97
Current Trends in HPC Trends in the Top500
T d i P F il
-
7/29/2019 class01_Introduction to HPC.pdf
22/92
Trends in Processor Family
83
4 4
19
147
4 4
37
118
8085
2 210
28
4 3
28
231
107
68
1 0 0 3 1 0
16
356
5558
0 0 0 0 15 1
372
62
Po
we r
(IB
M)
Cra
y
Alp
ha
PA
-RIS
C(H
P)
Inte
l/IA
-3 2
NE
C
S p
a rc
(S u n
)
Inte
l/IA
-6 4
Inte
l/E
M6 4
T
AM
D/x
8 6
_ 6
4
Processor Family
0
50
100
150
200
250
300
350
400
NumberofSystems
inRecentTop500Lists
2006-062007-062008-062012-06
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 24 / 97
Current Trends in HPC And the Future is ...
C t P T d
-
7/29/2019 class01_Introduction to HPC.pdf
23/92
Current Processor Trends
Figure: Intel processors according to clock frequency and transistor count.(H. Sutter, Dr. Dobbs Journal 30(3), March 2005).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 25 / 97
Current Trends in HPC And the Future is ...
-
7/29/2019 class01_Introduction to HPC.pdf
24/92
Processor frequency has reached a plateau due to power anddesign requirements
Moores Law of transistor density doubling every 18-24 months isstill in effect, however
More transistor logic is being devoted to innovations in multi-core
processors
Will become significant software issue - how do you harness thecomputational power in all of those cores?
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 26 / 97
Current Trends in HPC And the Future is ...
The Power Envelope
-
7/29/2019 class01_Introduction to HPC.pdf
25/92
The Power Envelope
6 The UltraSPARC T1 Processor - Power Efficient Throughput Computing December 2005
FIGURE 2 Power Density (W/cm2) Over Time
80
70
60
50
40
30
20
10
0
Powerdensity(watts/squarecm.)
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07
Year
Pentium 3
Pentium 2
Pentium
486
386
Power PC
Alpha 21264
Alpha 21164
Alpha 21064
Itanium
Pentium 4
AMD x86-64
AMD K7
AMD K6
HP PA MIPs
SPARC
SuperSPARC
SPARC64
UltraSPARC T1
Legend
UltraSPARC T1
nuclear reactor surface flux, ~125 W/cm2
hot plate, ~15 W/cm2
Figure: Processor power expenditures, courtesy Sun Microsystems.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 27 / 97
Current Trends in HPC And the Future is ...
The Power Envelope
-
7/29/2019 class01_Introduction to HPC.pdf
26/92
The Power Envelope
6 The UltraSPARC T1 Processor - Power Efficient Through put Computing December 2005
FIGURE 2 Power Density (W/cm2) Over Time
80
70
60
50
40
30
20
10
0
Powerdensity(watts/squar
ecm.)
85 86 87 88 8 9 9 0 9 1 9 2 9 3 94 95 96 97 9 8 9 9 0 0 0 1 0 2 03 04 05 06 0 7
Year
Pentium 3
Pentium 2
Pentium
486
386
Power PC
Alpha 21264
Alpha 21164
Alpha 21064
Itanium
Pentium 4
AMDx86-64
AMDK7
AMDK6
HPPA MIPs
SPARC
SuperSPARC
SPARC64
UltraSPARCT1
Legend
UltraSPARC T1
nuclear reactor surface flux, ~125 W/cm 2
hot plate, ~15 W/cm2
Around 2003, Intel & AMD both reached close to the limit of air
cooling for their processors - why?
Power in a CPU is proportional to V2fC:
Increasing frequency, f, also requires increasing voltage, V - highly
nonlinearIncreasing core count will increase capacitance, C, but only linearly
Behold, the multi-core era is born, but by necessity rather than
choice ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 28 / 97
Current Trends in HPC And the Future is ...
Future Speculation
-
7/29/2019 class01_Introduction to HPC.pdf
27/92
Future Speculation
Hot trend in HPC - special-purpose processors. Recognize this
one?
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 29 / 97
Current Trends in HPC And the Future is ...
-
7/29/2019 class01_Introduction to HPC.pdf
28/92
Cell single-chip multi-processor (IBM/Sony/Toshiba), debuted in2005/2006
3.2GHz PowerPC RISC as Power Processing Element (PPE)8 Synergistic Processing Elements (SPEs) act as vector
processing elements (hardware accelerators)
Low power, high performance on specialized applications (230
GFlop/s, in single-precision)#1 on top500 list in 2008-2009 was a mix of Opterons (7000)
and Cells (13000)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 30 / 97
Current Trends in HPC And the Future is ...
Current Multicore Offerings
-
7/29/2019 class01_Introduction to HPC.pdf
29/92
Current Multicore Offerings
Short table of current multi-core processors ...Manufacturer Arch CoresSun/Oracle "Niagara" T2 SPARC64 16 (1.4 GHz)IBM POWER7 POWER 4-8 (3-4.5 GHz)IBM PowerPC A2 POWER 18 (1.6 GHz)Intel "Nehalem EP/Westmere" x86_64 4/6 (1.86-3.2 GHz)Intel "Nehalem EX/Beckton" x86_64 4/6/8 (1.86-2.66 GHz)Intel "Sandy Bridge EP/E5" x86_64 4/6/8 (1.8-3.6 GHz)AMD "Shanghai" x86_64 4 (2.1-3.1 GHz)AMD "Istanbul" x86_64 6 (2.1-2.8 GHz)AMD "Magny-Cours" x86_64 8/12 (1.8-2.4 GHz)IBM/Sony/Toshiba Cell PPC+8xSPE 1+8 (3.2 GHz)
SciCortex MIPS64 6 (500 MHz)Nvidia "Tesla" C2050 448 (1.15 GHz )
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 31 / 97
Current Trends in HPC And the Future is ...
All is Parallel
-
7/29/2019 class01_Introduction to HPC.pdf
30/92
All is Parallel
Well, at least pretty much all of the hardware is parallel:
Right now is it is difficult to buy a computer that has only a single
processor - even laptops have multiple cores.
GPUs have a great many processing elements (Cell has 9,NVIDIA and ATI offerings have hundreds).
This increasing processor core count trend is going to continue for
a while.What about the software that can actually concurrently use all ofthese hardware resources?
Software is lagging behind the hardware.
Some specialized parallel libraries for multicore systems.Some APIs for harnessing multiple cores (OpenMP) and multiple
machines (MPI).
Generally the software picture is one of tediously mapping
applications to new architectures and machines.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 32 / 97
Current Trends in HPC And the Future is ...
Who Uses HPC?
-
7/29/2019 class01_Introduction to HPC.pdf
31/92
Who Uses HPC?
Users of HPC/parallel computation, as reflected in the last top500 list:
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 33 / 97
Demand for HPC More Speed!
Why Use Parallel Computation?
-
7/29/2019 class01_Introduction to HPC.pdf
32/92
Why Use Parallel Computation?
Note that, in the modern era, inevitably HPC = Parallel (or concurrent)Computation. The driving forces behind this are pretty simple - thedesire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for thatmatter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 35 / 97
Demand for HPC More Speed!
A Concrete Example
-
7/29/2019 class01_Introduction to HPC.pdf
33/92
A Concrete Example
Well, more of a discrete example, actually. Lets consider thegravitational N-body problem.
Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computedfrom the other N 1 bodies, thereby requiring N2 force calculationsper step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires Nlog2 N calculations,that means 1012 ln(1012)/ ln(2) calculations. If each calculation
takes 1 sec, that is 40 106 seconds per step. That is about 1.3CPU-years per step. Ouch!
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 36 / 97
Demand for HPC More Speed!
A Size Example
-
7/29/2019 class01_Introduction to HPC.pdf
34/92
A Size Example
On the other side, suppose that we need to diagonalize (or invert) adense matrix being used in, say, an eigenproblem derived from someengineering or multiphysics problem.
Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.So we determine that we need a matrix on the order of, say, 400000elements. Simply to store this matrix, in 64-bit representation, requires
1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103
nodes, each having 4 GBytes of memory,by distributing the matrix across the individual memories of eachcluster node.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 37 / 97
Demand for HPC More Speed!
So ...
-
7/29/2019 class01_Introduction to HPC.pdf
35/92
So
So the lessons to take from the preceding examples should be clear:
It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasingspeed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you seethe connection between the first example and general molecularmodeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 38 / 97
Demand for HPC More Speed!
Dictatorship of Data Storage
-
7/29/2019 class01_Introduction to HPC.pdf
36/92
p g
Very strong limitations are imposed by the hierarchy of data/memorystorage, schematically:
Problem Size
P
erformance
Fits in Registers
Fits in Level 1 Cache
Fits in Level 2 Cache
(Level 3 Cache)
Fits in Main Memory
Fits on Disk
Need a Bigger Machine!
Note that the performance scale shown is likely logarithmic.Ever-decreasing speed of memory as you get further from theprocessor(s) is inevitable, often called the memory wall, which
we will discuss later in more detail.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 39 / 97
Demand for HPC More Speed!
What is Old is New Again ...
-
7/29/2019 class01_Introduction to HPC.pdf
37/92
g
Interestingly enough, computer processors have not changed verymuch - consider the following picture:
ARITHMETIC
LOGIC
UNIT
INPUT
CONTROL
UNIT
accumulator
MEMORY
OUTPUTCPU
Look familiar? Pretty much every modern processor follows thisdesign ...
John von Neumann, the famous polymath, described this conceptof the stored-program computer in 1946!
It follows directly from Alan Turings concept of the Turing machinefrom 1936
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 40 / 97
Demand for HPC More Speed!
Cache & Memory Hierarchy
-
7/29/2019 class01_Introduction to HPC.pdf
38/92
y y
Almost all modern processors try to improve performance by keeping
the most oft-used data close to the CPU(s), using a hierarchy ofcaches:
L2
L1IL1D
(L3)
VIRTUAL MEMORY
PHYSICAL MEMORY
TLB
PERFORMAN
CE
SIZE
Note in particular that as the caches increase in size, they decrease inperformace (bandwidth and latency).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 41 / 97
Demand for HPC More Speed!
CPU/Memory Imbalance
-
7/29/2019 class01_Introduction to HPC.pdf
39/92
y
It has been noted that CPU performance has been increasing atroughly a rate of 80% per year (aside: note Moores observation wasfor transistor counts), but memory performance has lagged behind
considerably - perhaps 7% per year1.
Has led to a significant imbalance between CPU and memory
performance
Explains the preceding cache hierarchy figure pretty well!
Also explains why getting even 10% of the theoretical peakperformance out of a CPU can be a considerable achievement!
1see, for example, Sally A. McKee, Reflections on the Memory Wall, CF04:Proceedings of the 1st Conference on Computing Frontiers (Ischia, Italy, 2004), p.
162.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 42 / 97
Demand for HPC More Speed!
CPU/DRAM Performance Gap
-
7/29/2019 class01_Introduction to HPC.pdf
40/92
Figure: D. Patterson et. al., "A case for intelligent RAM," IEEE Micro 17, 34(1997).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 43 / 97
Demand for HPC More Speed!
Machine Balance
-
7/29/2019 class01_Introduction to HPC.pdf
41/92
The notion of machine balance (or work/memory ratio) attempts to
quantify, in at least a simple way, this distinction between CPU andmemory,
WM =number floating-point operations
number of memory references.
Larger WM is better given the relative slow speed of memory.
If the maximum bandwidth to memory is bwM, then a given
algorithm has a maximum performance of WM bwM.
The STREAM benchmark was developed to establish a baseline for
machine balance (J. D. McCalpin, "Memory Bandwidth and MachineBalance in Current High Performance Computers", IEEE ComputerSociety Technical Committee on Computer Architecture (TCCA)Newsletter, December 1995).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 44 / 97
Demand for HPC More Speed!
Behold the Power of ... Google
-
7/29/2019 class01_Introduction to HPC.pdf
42/92
Google is a fine example of the power of parallel computation. Did you
already know:
Google uses very large Linux clusters as their platform for theirsearch engine?
Exactly how many is closely guarded, but experts estimate from
100,000 to 300,000 servers.Modifications to the Linux kernel to allow for customization - faulttolerance, redundancy, etc.
A customized cluster filesystem (the googleFS) to accelerate theirsearch and indexing algorithms, and the scalable programming
model (MapReduce) to go with itresearch.google.com
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 45 / 97
Demand for HPC Inherent Limitations
Scaling Concepts
http://localhost/var/www/apps/conversion/tmp/scratch_1/research.google.comhttp://localhost/var/www/apps/conversion/tmp/scratch_1/research.google.com -
7/29/2019 class01_Introduction to HPC.pdf
43/92
By scaling we typically mean the relative performance of a parallelvs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =Sequential execution time (using optimal implementation)
Parallel execution time using pprocessors
so, in the ideal case, S(p) = p.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 46 / 97
Demand for HPC Inherent Limitations
Parallel Efficiency
-
7/29/2019 class01_Introduction to HPC.pdf
44/92
Using S as the (best) sequential execution time, we note that
S(p) =S
p(p),
p,
for a lower bound, and the parallel efficiency is given by
E(p) =S(p)
p,
=S
p p.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 47 / 97
Demand for HPC Inherent Limitations
Inherent Limitations in Parallel Speedup
-
7/29/2019 class01_Introduction to HPC.pdf
45/92
Limitations on the maximum speedup:
Fraction of the code, f, can not be made to execute in parallel
Parallel overhead (communication, duplication costs)
Using this serial fraction, f, we can note that
p fS + (1 f) S/p,
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 48 / 97
Demand for HPC Inherent Limitations
Amdahls Law
-
7/29/2019 class01_Introduction to HPC.pdf
46/92
This simplification for p leads directly to Amdahls Law,Definition (Amdahls Law):
S(p) S
fS + (1 f) S/p
,
p
1 + f(p 1).
G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 49 / 97
Demand for HPC Inherent Limitations
Implications of Amdahls Law
-
7/29/2019 class01_Introduction to HPC.pdf
47/92
The implications of Amdahls law are pretty straightforward:
The limit as p,
limpS(p)
1
f .
If the serial fraction is 5%, the maximum parallel speedup is only20.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 50 / 97
Demand for HPC Inherent Limitations
Implications of Amdahls Law (contd)
-
7/29/2019 class01_Introduction to HPC.pdf
48/92
1 2 4 8 16 32 64 128 256Number of Processors, p
1
2
4
8
16
32
64
128
256
MAX(S(p))
f=0.001f=0.01f=0.02f=0.05f=0.1
Amdahl's LawMaximum Parallel Speedup
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 51 / 97
Demand for HPC Inherent Limitations
A Practical Example
-
7/29/2019 class01_Introduction to HPC.pdf
49/92
Let p = comm + comp, where comm is the time spent in communicationbetween parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.
1 pcomp,
and
S(p) =1p
,
pcomp
comm + comp,
p
1 + comm/comp1
The point here is that it is critical to minimize communication time totime spent doing computation (recurring theme).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 52 / 97
Demand for HPC Inherent Limitations
Defeating Amdahls Law
-
7/29/2019 class01_Introduction to HPC.pdf
50/92
There are ways to work around the serious implications of Amdahlslaw:
We assumed that the problem size was fixed, which is (very) often
not the case.Now consider a case where the problem size is allowed to vary
Assume that now the problem size is fixed such that p is heldconstant
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 53 / 97
Demand for HPC Inherent Limitations
Gustafsons Law
-
7/29/2019 class01_Introduction to HPC.pdf
51/92
Now let p be held constant as p is increased,
Ss(p) = S/p,
= (fS + (1 f)S)/p,
= (fS + (1 f)S)/(fS + (1 f)S/p),
= p/(1 (1 p)f),
p+ p(1 p)f . . . .
Another way of looking at this is that the serial fraction becomes
negligible as the problem size is scaled. Actually that is a pretty gooddefinition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 54 / 97
Demand for HPC Inherent Limitations
Scalability
-
7/29/2019 class01_Introduction to HPC.pdf
52/92
Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on anyvalue
I like this (equivalent) one better:
Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 55 / 97
Parallel Architectures Taxonomies
Flynns Taxonomy
-
7/29/2019 class01_Introduction to HPC.pdf
53/92
Based on the flow of information (both data and instructions) in acomputer:
SISD Single Instruction, Single Dataquite old-fashioned
SIMD Single Instruction, Multiple Data
data parallel
MISD Multiple Instruction, Single Datanever implemented
MIMD Multiple Instruction, Multiple Data
task parallelMany taxonomies are possible ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 57 / 97
Parallel Architectures Taxonomies
A More Intuitive Parallel Taxonomy
-
7/29/2019 class01_Introduction to HPC.pdf
54/92
A somewhat more intuitive illustration of parallel taxonomy:
single multiple
single
multiple
InstructionSystem
Memory System
SMP DM(SIMD & MIMD)
Scalar Vector
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 58 / 97
Parallel Architectures Taxonomies
SISD
-
7/29/2019 class01_Introduction to HPC.pdf
55/92
Sequential (non-parallel) instruction flowPredictable and deterministic
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 59 / 97
Parallel Architectures Taxonomies
SIMD
-
7/29/2019 class01_Introduction to HPC.pdf
56/92
Single (typically assembly) instruction operates on different data in
a given clock tick.
Requires predicatble data access patterns - "vectors" that arecontiguous in memory.
Almost all modern processors use some form - for Intel, SSE
(Streaming SIMD Extensions), AVX (Advanced Vector Extensions)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 60 / 97
Parallel Architectures Taxonomies
MISD
-
7/29/2019 class01_Introduction to HPC.pdf
57/92
Multiple instructions applied independently on same data.
Theoretical rather than practical architecture - fewimplementations
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 61 / 97
Parallel Architectures Taxonomies
MIMD
-
7/29/2019 class01_Introduction to HPC.pdf
58/92
Multiple instructions applied independently on different data.
Very flexible - can be asynchronous, non-deterministicMost modern supercomputers follow MIMD design, albeit withSIMD components/sub-systems
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 62 / 97
Parallel Architectures Taxonomies
Data/Task Parallel
-
7/29/2019 class01_Introduction to HPC.pdf
59/92
Where is the parallelism in your code?
data parallel - the code has data that can be distributed over and
acted upon by multiple processing units. (think OpenMP and HPF)
task parallel - the tasks performed by the code can be distributed
over multiple processors, often with some need for inter-processcommunication. (think MPI, PVM, etc.)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 63 / 97
Parallel Architectures Shared Memory Multiprocessors
SMP
-
7/29/2019 class01_Introduction to HPC.pdf
60/92
High Speed Memory
CPU
cache
CPU
cache
CPU
cache
Each processor has equal access to pool of shared globalmemory (contrast with non-uniform, or NUMA)
Cache coherency is a serious issueQuickly limits scalability due to bottlenecks
Examples: Sun Enterprise Series,Intel/AMD x86/x86_64bus-based SMPs (small), IBM POWER Nodes
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 64 / 97
Parallel Architectures Distributed Memory Machines
Distributed Memory
-
7/29/2019 class01_Introduction to HPC.pdf
61/92
CPU
cache
Memory
High Speed
Network
CPU
cache
Memory
CPU
cache
Memory
Single processors with a hierarchy of memory interconnectedthrough an external network
Examples: rather uncommon now - older clusters of uniprocessornodes.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 65 / 97
Parallel Architectures Distributed Shared Memory
DSM
-
7/29/2019 class01_Introduction to HPC.pdf
62/92
High Speed
Network
Memory
CPUCPU
cache cache
Memory
CPUCPU
cache cache
Memory
CPUCPU
cache cache
Some processors have truly shared memory, rest accessed in anon-uniform (NUMA, or non-uniform memory access) way throughnetwork
Examples: clusters of SMPs, SGI Origin/Altix Series (OS makes itlook like SM).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 66 / 97
Parallel Architectures Distributed Shared Memory
Data Parallel
-
7/29/2019 class01_Introduction to HPC.pdf
63/92
Also known as fine-grained parallel.
Same set of instructions runs simultaneously on different pieces ofdata.
Typical example - parallel execution of DO/for loops inFORTRAN/C codes.
Examples of directive-based data-parallel APIs: HPF, OpenMP.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 67 / 97
Parallel Architectures Distributed Shared Memory
Task Parallel
-
7/29/2019 class01_Introduction to HPC.pdf
64/92
Also known as coarse-grained parallel.
Multiple code segments are run concurrently on differentprocessors.
Generally only works well when the code fragments areindependent (otherwise the communication cost may beprohibitive).
Typical example - subroutine calls in FORTRAN codes.
Examples of message-passing task parallel APIs: PVM, MPI.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 68 / 97
Parallel Architectures Distributed Shared Memory
SIMD
-
7/29/2019 class01_Introduction to HPC.pdf
65/92
Processors that incorporate SIMD vector-like features are widespread:
packs N (often a power of 2) operations into a single instruction
included in processors from Intel and AMD (streaming SIMD
extensions, or SSE)allow the processor to speed calculation by extensive prefetching
of multiple pieces of data which will be subjected to the same setof operations.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 69 / 97
Parallel Architectures Distributed Shared Memory
SPMD vs. MPMD
-
7/29/2019 class01_Introduction to HPC.pdf
66/92
All the architectures that we will consider follow the MIMD designmodel, and we can denote our general approach as Multiple Program,Multiple Data. A typical example consists of two programs in a
master-worker arrangement. SPMD is another such structure, in whichthe same code is executed by all processes, but on different data (e.g.
in the master-worker example, the code would have sections for themaster, and sections for the workers).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 70 / 97
Parallel Architectures Scalar and Vector Processing
Vector Processing
-
7/29/2019 class01_Introduction to HPC.pdf
67/92
A vector processor is one in which entire vectors can be addedtogether with a single (vector) instruction, rather than simply twoelements in the case of scalar processors. This approach can be quite
advantageous:
Vector processors may have to fetch and decode far fewerinstructions (less overhead)
More (contiguous in memory) data can be moved simultaneously
Keeps processor fed with data rather than waiting on memoryoperations
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 71 / 97
Parallel Architectures Examples of Cluster "Nodes"
Typical Cluster Compute Nodes
HPC clusters can be built out of just about any type of system
-
7/29/2019 class01_Introduction to HPC.pdf
68/92
HPC clusters can be built out of just about any type of system,depending on your budget. Some common (and a few less common)
options:Commodity PCs (workstations) - typically networked together
using Ethernet
Commodity Servers (2-12 processor cores/server) - often with
(advanced) third party networks in addition to, or in place of,Ethernet
Example
Example: CCRs U2 has 256 nodes with 8x 2.26 GHz Nehalem
cores/server, 372 nodes with 12x 2.40GHz Westmere cores/serverand 20 "fat" nodes with 32 cores/server (some Intel, some AMD), all
with a QDR Infiniband interconnect for applications and gigabitEthernet for storage/systems operation.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 72 / 97
Parallel Architectures Examples of Cluster "Nodes"
Typical Cluster Compute Nodes (contd)
-
7/29/2019 class01_Introduction to HPC.pdf
69/92
Constellations of High-End Servers ( 16 processors/server)
NCARs Bluefiresystem of 128 32-processor IBM p575
servers (Power 6+) with Infiniband interconnect
NASAs Columbiasystem of 20 512-processor SGI Altix
3700s (Itanium2) with Infiniband interconnect
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 73 / 97
Parallel Architectures Cluster Interconnects
Typical Cluster Networks
-
7/29/2019 class01_Introduction to HPC.pdf
70/92
By far the most common interconnect for HPC is some form of Ethernet(TCP/IP), but there are other, more highly performing, options:
Myrinet, www.myrinet.com
Myrinet is now available at 10Gbit/s, more common at 2Gbit/s.
Infiniband, www.infinibandta.org, www.openib.orgSDR IB 4x is 10Gb/s, DDR 8x is 20Gb/s, and QDR 16x is 40Gb/s.
Proprietary/Custom, customized networks stacks that offer low
latency and high bandwidth, becoming less common withcommodity networks like Ethernet and Infiniband
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 74 / 97
Parallel Architectures Cluster Interconnects
Cluster Network Evolution
Using the Top500 list for analyzing historical trends, we can look at the
http://www.myrinet.com/http://www.infinibandta.org/http://www.openib.org/http://www.openib.org/http://www.infinibandta.org/http://www.myrinet.com/ -
7/29/2019 class01_Introduction to HPC.pdf
71/92
g p y g ,interconnect family:
1 1 48
5769
154
206
HIPPI
Proprietary
FatT
ree
Fast
Eth
ernet
SPSwitc
h
Cray
Crossba
rN/A
0
50
100
150
200
250
300
NumberofEntriesinTop5
001998-06
Top500 List Interconnect Family
N/A=single node!
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 75 / 97
Parallel Architectures Cluster Interconnects
Cluster Network Evolution
... and note the trend now favors the commodity networks:
-
7/29/2019 class01_Introduction to HPC.pdf
72/92
y
1 3 59 9 12 14
26 3642
87
256
0 0 28 8 3 5
40
121
17 12
284
0 0 09
0 0 0
20
209
0 4
207
Firep
lane
Rapid
Array
Mixe
dCr
ay
NUM
Alink
Crossbar
Quadric
s
Proprie
tary
Infin
iband
SPSwitc
h
Myrine
t
Gigabit
Ethe
rnet
0
50
100
150
200
250
300
NumberofSystemsinTop500List
2006-062008-062012-06
Top500 List Interconnect Family
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 76 / 97
Parallel Architectures Dedicated Clusters - Examples
The Cluster Sandbox
Pi d b l i l f d di d l
-
7/29/2019 class01_Introduction to HPC.pdf
73/92
Pictured below is an example of a dedicated compute cluster:
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 77 / 97
Parallel Architectures Dedicated Clusters - Examples
-
7/29/2019 class01_Introduction to HPC.pdf
74/92
The preceding cluster example was the first incarnation of CCRs U2, Itplaced number 57 on the June 2006 Top500 list, using only the 768
Myrinet connected compute nodes. U2 has been substantially updatedsince then, but in several stages such that it is now relativelyheterogeneous:
http://www.ccr.buffalo.edu/support/research_facilities/u2.html
which is typical for University-based HPC facilities.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 78 / 97
Parallel Architectures Dedicated Clusters - Examples
Current State of U2
CCR Computing Resources
Schematic Layout/Diagram
http://www.ccr.buffalo.edu/support/research_facilities/u2.htmlhttp://www.ccr.buffalo.edu/support/research_facilities/u2.html -
7/29/2019 class01_Introduction to HPC.pdf
75/92
Dell C410x PCIe ExpansionNvidia Tesla M2050 GPUs (64 total)
Dell C6100 Servers (32 total)2xXeon X5650 2.66GHz 12-cores each
2x500GB SATA HD, 1x100GB SSD48GB RAM, QDR Infiniband + 1Gb Ethernet
IBM dx360M2 Servers (128 total)2xXeon L5520 2.26GHz 8-cores each
24GB RAM, 2x160GB SATA HDQDR Infiniband + 1Gb Ethernet
Dell SC1425 Servers (256 total)2xXeon 3.0GHz 2-cores each
2GB RAM, 1x80GB SATA HDMyrinet 2G + 1Gb Ethernet
Dell C6100 Servers (128 total)2xXeon L5630 2.13GHz 8-cores each
24GB RAM, 2x500GB SATA HDQDR Infiniband + 1Gb Ethernet
Leaf Ethernet Switches
1Gb/node (32-48 nodes)
2x10Gb uplinks to core
Arista 7508 10Gb
Ethernet Core Switch
Panasas PAS8 Parallel Storage
215TB, /panasas/scratch
Isilon IQ36000x Storage Arrays
325TB, /home, /projects
2x/leaf
U2 General Purpose Cluster
GPU Cluster
Revised: 2012-08
chemistry/physics/proteomics
Faculty/Research Group Clusters
Netezza/IBM TwinFin DISC
XtremeData dbX 1016
Data Appliance
Data-Intensive Computing Resources
14x
12x
2x
1 Gb/s
10 Gb/sfuture 10Gb/s
General and Project Storage Resources
"Fatnodes" Large Memory Cluster
"Fat" Nodes Servers (19 total)Quantity 1 (Dell R910)
4x Xeon X7650 2.0GHz, 32-cores each2x500GB SATA HD, 14x100GB SSD
256GB RAMQuantity 8 (Dell R910)
4x Xeon E7-4830 2.13GHz, 32-cores each4x1TB SATA HD, 256GB RAM
Quantity 8 (Dell C6145)
4x Opteron 6132HE 2.2GHz, 32-cores each4x1TB SATA HD, 256GB RAM
Quantity 2 (Dell R910+IBM x3850X5)4x Xeon E7-4830 2.13GHz, 32-cores each
4x1TB SATA HD, 512GB RAMQDR Infiniband + 1Gb Ethernet
Dell C6100 Servers (372 total)2xXeon E5645 2.40GHz 12-cores each
48GB RAM, 2x500GB SATA HDQDR Infiniband + 1Gb Ethernet
2x
Current (2012-08) state of U2after many upgrades.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 79 / 97
Parallel Architectures Dedicated Clusters - Examples
A Very Different Cluster Example
-
7/29/2019 class01_Introduction to HPC.pdf
76/92
Now lets consider Japans Earth Simulator, which dominated theTop500 List from 2002-2004:
640 8-processor SX-8 (vector) SMPs (peak of 8GFlop/s perprocessor)
10 TB of Memory
Custom crossbar interconnect
700 TB disk + 1.2 PB Mass Storage
Reportedly consumes about 12MW of power ...
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 80 / 97
Parallel Architectures Dedicated Clusters - Examples
A Very Different Cluster Example (contd)
Japans Earth Simulator:
-
7/29/2019 class01_Introduction to HPC.pdf
77/92
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 81 / 97
Parallel Architectures Dedicated Clusters - Examples
A Very Different Cluster Example (contd)
Japans Earth Simulator:
-
7/29/2019 class01_Introduction to HPC.pdf
78/92
Japan s Earth Simulator:
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 82 / 97
Parallel Architectures Dedicated Clusters - Examples
IBM Blue Gene
-
7/29/2019 class01_Introduction to HPC.pdf
79/92
LLNLs BG/L system from 2007-11, and a single rack of BG/L from SC06 (photo on left courtesty IBM, right IBM/Wikipedia).
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 83 / 97
Parallel Architectures Dedicated Clusters - Examples
Highly Engineered Supercluster
Now consider IBMs BlueGene, holding 4 of the top 10 spots in the2008 06 Top500 list
-
7/29/2019 class01_Introduction to HPC.pdf
80/92
2008-06 Top500 list:
Building block is a node consisting of 2 700 MHz PowerPC 440processors, 1024MB RAM
Nodes configured on 5 different networks3D Torus (3D mesh where each node connected to 6nearest-neighbors) for general message passing (well suited for
decomposition problems), each link 1.4 Gb/s bidirectional, or 2.1GB/s aggregate for each node. Worst case hardware latency is 6.4s.Collective network (also does some point-to-point), latency of only2.5 s for a full 65,536 node BG/L, each link 2.8 Gb/s bidirectional
I/O network, with one dedicated I/O node for every 64 computenodes (gigabit Ethernet)Global interrupt network, can do full system hardware barrier in 1.5s (fast Ethernet)Diagnostic & control network (JTAG/IEEE 1149.1)
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 84 / 97
Parallel Architectures Dedicated Clusters - Examples
-
7/29/2019 class01_Introduction to HPC.pdf
81/92
OS kernel on compute is a customized runtime kernel thatprovides no scheduling or context switching (said to be written in
5000 lines of C++), one thread/CPU, I/O nodes runs customizedLinux kernel
Native MPI is customized port of MPICH2
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 85 / 97 Parallel Architectures Dedicated Clusters - Examples
-
7/29/2019 class01_Introduction to HPC.pdf
82/92
Figure courtesy IBM, total power requirements of 65,536 node system 1.5MW, or 20kW/rack.
M. D. Jones, Ph.D. (CCR/UB) Introduction to High Performance Computing HPC-I 2012 86 / 97 Parallel Architectures Dedicated Clusters - Examples
-
7/29/2019 class01_Introduction to HPC.pdf
83/92
Figure courtesy IBM, IBM J. Res. Devel. 49, No. 2/3 (2005).
Development of the BG architecture has continued with BG/P (2007,
chip upgrade to 4 PPC450 processors/chip), and BG/Q (ETA 2012, 16cores/chip).
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 87 / 97 Parallel Architectures Dedicated Clusters - Examples
2008/2009 #1 in Top500 - Roadrunner
-
7/29/2019 class01_Introduction to HPC.pdf
84/92
IBMs Roadrunner (photo courtesy IBM), #1 entry in 2008-06 Top500list, 1.026 PFlop/s ...
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 88 / 97 Parallel Architectures Dedicated Clusters - Examples
2008/2009 #1 in Top500 - Roadrunner
-
7/29/2019 class01_Introduction to HPC.pdf
85/92
CU from Roadrunner, hybrid Opteron/Cell architecture ...
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 89 / 97 Parallel Architectures Dedicated Clusters - Examples
2008/2009 #1 in Top500 - Roadrunner
-
7/29/2019 class01_Introduction to HPC.pdf
86/92
Roadrunner first PFlop/s system on Top500, consumes 2.3MW ofpower
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 90 / 97 Parallel Architectures Dedicated Clusters - Examples
2009-2010 #1 in Top500 - Jaguar
-
7/29/2019 class01_Introduction to HPC.pdf
87/92
Housed at the National Center for Computational Sciences at Oak
Ridge National Laboratory, Jaguar is a Cray XT5 System with 37,3766-core AMD Opteron processors (224,256 cores total), and Crayspropietary interconnect. Power usage 7MW.
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 91 / 97 Parallel Architectures Dedicated Clusters - Examples
2010-11 #1 in Top500 - Tianhe-1A
-
7/29/2019 class01_Introduction to HPC.pdf
88/92
Tianhe-1A was installed at the National Supercomputer Center inTianjin, China, and debuted in the #1 slot of the top500 list in 2010-11.Details on its specifications are still sketchy:
14,336 Intel Xeon processors
7,168 Nvidia Tesla M2050 GPUs2048 NUDT FT1000 heterogeneous processors
Proprietary interconnect "Arch"
4.04MW Power usage
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 92 / 97 Parallel Architectures Dedicated Clusters - Examples
2011 #1 in Top500 - K Computer at RIKEN
-
7/29/2019 class01_Introduction to HPC.pdf
89/92
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 93 / 97 Parallel Architectures Dedicated Clusters - Examples
2012-06 #1 Sequoia, IBM BG/Q
-
7/29/2019 class01_Introduction to HPC.pdf
90/92
IBM deployed BG/Q (3rd in the BlueGene line) in 2012. PowerPC A2
chip shown above, 18 cores total, 16 for computation, 1 for operatingsystem (RHEL), 1 extra. 8 Flop/s per clock tick. SoC design, 1.6GHz
core frequency at 55W (204.8GFlop/s). Water cooled.
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC-I 2012 94 / 97 Parallel Architectures Dedicated Clusters - Examples
2012-06 #1 Sequoia, IBM BG/Q
-
7/29/2019 class01_Introduction to HPC.pdf
91/92
Each rack holds 1024 compute cards, or 16384 cores total. TheSequoia system at LLNL has targeted to have 96 racks total, or1,572,864 cores (and consumes about 7.9MW of power).
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC I 2012 95 / 97 Conclusions
Lessons Learned
Moores Law (observation, really) continues - lithography in
-
7/29/2019 class01_Introduction to HPC.pdf
92/92
oo e s a (obse at o , ea y) co t ues t og ap ysemiconductor manufacturing advances
Sequential processor speeds are not advancing (or in a muchmore limited way):
Thermal wall - can not cool them fast enoughMemory wall - can not feed the processor data fast enough
ILP (instruction level parallelism) wall - limitations on speculativecompilation
What happens to all of these new transistors?
Make processors smallerMake L2 caches bigger
Add lots and lots more coresThe future? Massive numbers of multicore servers (8/16/32/...cores per server) harnessed in a distributed fashion
M D Jones Ph D (CCR/UB) Introduction to High Performance Computing HPC I 2012 97 / 97