introduction to supercomputers, architectures and high performance computing

Introduction to Supercomputers, Architectures and High Performance

Computing

Amit MajumdarScientific Computing Applications Group

SDSCMany others: Tim Kaiser, Dmitry Pekurovsky,

Mahidhar Tatineni, Ross Walker

Topics

1. Intro to Parallel Computing2. Parallel Machines3. Programming Parallel Computers4. Supercomputer Centers and Rankings5. SDSC Parallel Machines6. Allocations on NSF Supercomputers7. One Application Example – Turbulence

2

First Topic – Intro to Parallel Computing

• What is parallel computing• Why do parallel computing• Real life scenario• Types of parallelism• Limits of parallel computing• When do you do parallel computing

3

What is Parallel Computing?

• Consider your favorite computational application – One processor can give me results in N hours– Why not use N processors

-- and get the results in just one hour?

The concept is simple:

Parallelism = applying multiple processors to a single problem

Parallel computing is computing by committee

• Parallel computing: the use of multiple computers or processors working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information with

other processors

CPU #1 works on this area of the problem

CPU #3 works on this areaof the problem



Grid of Problem to be solved

y

x

exchange

exchangeexc

ha

ng

e

exc

ha

ng

e

Why Do Parallel Computing?• Limits of single CPU computing

– Available memory– Performance/Speed

• Parallel computing allows:– Solve problems that don’t fit on a single CPU’s memory

space– Solve problems that can’t be solved in a reasonable time

• We can run…– Larger problems– Faster– More cases– Run simulations at finer resolution– Model physical phenomena more realistically

6

Parallel Computing – Real Life Scenario• Stacking or reshelving of a set of library books• Assume books are organized into shelves and

shelves are grouped into bays• Single worker can only do it in a certain rate• We can speed it up by employing multiple

workers• What is the best strategy ?

o Simple way is to divide the total books equally among workers. Each worker stacks the books one at a time. Worker must walk all over the library.

o Alternate way is to assign fixed disjoint sets of bay to each worker. Each worker is assigned equal # of books arbitrarily. Workers stack books in their bays or pass to another worker responsible for the bay it belongs to.

7

Parallel Computing – Real Life Scenario• Parallel processing allows to accomplish a task

faster by dividing the work into a set of substacks assigned to multiple workers.

• Assigning a set of books to workers is task partitioning. Passing of books to each other is an example of communication between subtasks.

• Some problems may be completely serial; e.g. digging a post hole. Poorly suited to parallel processing.

• All problems are not equally amenable to parallel processing.

8

Weather ForecastingAtmosphere is modeled by dividing it into three-dimensional regions or cells, 1 mile x 1 mile x 1 mile - about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time.

About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step

10 day forecast with 10 minute resolution => ~1.5x1014 flop

On a 100 Mflops (Mflops/sec) sustained performance machine would

take: 1.5x1014 flop/ 100x106 flops = ~17 days

On a 1.7 Tflops sustained performance machine would take:

1.5x1014 flop/ 1.7x1012 flops = ~2 minutes

Other Examples• Vehicle design and dynamics• Analysis of protein structures• Human genome work• Quantum chromodynamics• Astrophysics• Earthquake wave propagation• Molecular dynamics• Climate, ocean modeling• CFD• Imaging and Rendering• Petroleum exploration• Nuclear reactor, weapon design• Database query• Ozone layer monitoring • Natural language understanding• Study of chemical phenomena• And many other scientific and industrial simulations

10

Types of Parallelism : Two Extremes

• Data parallel– Each processor performs the same task on

different data– Example - grid problems

• Task parallel– Each processor performs a different task– Example - signal processing

• Most applications fall somewhere on the continuum between these two extremes

11

Typical Data Parallel Program• Example: integrate 2-D propagation problem:

2

2

2

2

yB

xD

t

211

211

1 22

y

fffB

x

fffD

t

ff ni

ni

ni

ni

ni

ni

ni

ni

Starting partial differential equation:

Finite DifferenceApproximation:

PE #0 PE #1 PE #2

PE #4 PE #5 PE #6

PE #3

PE #7

y

x

12

Basics of Data Parallel ProgrammingOne code will run on 2 CPUsProgram has array of data to be operated on by 2 CPU so array is split into

two parts.

program:… if CPU=a then low_limit=1 upper_limit=50elseif CPU=b then low_limit=51 upper_limit=100end ifdo I = low_limit, upper_limit work on A(I)end do...end program

CPU A CPU B

program:…low_limit=1upper_limit=50do I= low_limit, upper_limit work on A(I)end do…end program

program:…low_limit=51upper_limit=100do I= low_limit, upper_limit work on A(I)end do…end program

14

Typical Task Parallel Application• Example: Signal Processing• Use one processor for each task• Can use more processors if one is overloaded

DATA Normalize Task

FFTTask

Multiply Task

Inverse FFT Task

15

Basics of Task Parallel Programming• One code will run on 2 CPUs• Program has 2 tasks (a and b) to be done by 2 CPUs

program.f:… initialize...if CPU=a then do task aelseif CPU=b then do task bend if….end program

CPU A CPU B

program.f:…initialize…do task a…end program

program.f:…initialize…do task b…end program

16

How Your Problem Affects Parallelism

• The nature of your problem constrains how successful parallelization can be

• Consider your problem in terms of– When data is used, and how– How much computation is involved, and when

• Importance of problem architectures– Perfectly parallel– Fully synchronous

Perfect Parallelism• Scenario: seismic imaging problem

– Same application is run on data from many distinct physical sites– Concurrency comes from having multiple data sets processed at once– Could be done on independent machines (if data can be available)

• This is the simplest style of problem • Key characteristic: calculations for each data set are independent

– Could divide/replicate data into files and run as independent serial jobs– (also called “job-level parallelism”)

Si t e C I mage Si t e D I mageSi t e B I mageSi t e A I mage

Si t e A Dat a Si t e B Dat a Si t e C Dat a Si t e D Dat a

Se i s mi c I magi ng Appl i c at i on

Fully Synchronous Parallelism• Scenario: atmospheric dynamics problem

– Data models atmospheric layer; highly interdependent in horizontal layers– Same operation is applied in parallel to multiple data– Concurrency comes from handling large amounts of data at once

• Key characteristic: Each operation is performed on all (or most) data– Operations/decisions depend on results of previous operations

• Potential problems– Serial bottlenecks force other processors to “wait”

I ni t i al At mos phe r i c Par t i t i ons

At mos phe r i c Mode l i ng Appl i c at i on

Re s ul t i ng Par t i t i ons

Limits of Parallel Computing

• Theoretical Upper Limits– Amdahl’s Law

• Practical Limits– Load balancing– Non-computational sections (I/O, system ops etc.)

• Other Considerations– time to re-write code

20

Theoretical Upper Limits to Performance

• All parallel programs contain:– Serial sections– Parallel sections

• Serial sections – when work is duplicated or no useful work done (waiting for others) - limit the parallel effectiveness• Lot of serial computation gives bad speedup• No serial work “allows” perfect speedup

• Speedup is the ratio of the time required to run a code on one processor to the time required to run the same code on multiple (N) processors - Amdahl’s Law states this formally

21

tn fp / N fs t1

S 1fs

fp / N

Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized

by using multiple processors.– Effect of multiple processors on run time

– Effect of multiple processors on speed up (S = t1/tn)

– Where

• fs = serial fraction of code

• fp = parallel fraction of code• N = number of processors

• tn = time to run on N processors

22

It takes only a small fraction of serial content in a code to degrade the parallel performance.

0

50

100

150

200

250

0 50 100 150 200 250Number of processors

fp = 1.000fp = 0.999fp = 0.990fp = 0.900

Illustration of Amdahl's Law

Speedup

23

Amdahl’s Law provides a theoretical upper limit on parallelspeedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance.

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

Number of processors

Amdahl's LawReality

fp = 0.99

Amdahl’s Law vs. Reality

Speedup

24

Practical Limits: Amdahl’s Law vs. Reality

• In reality, Amdahl’s Law is limited by many things:– Communications– I/O– Load balancing (waiting)– Scheduling (shared processors or memory)

25

When do you do parallel computing

• Writing effective parallel application is difficult– Communication can limit parallel efficiency– Serial time can dominate– Load balance is important

• Is it worth your time to rewrite your application– Do the CPU requirements justify parallelization?– Will the code be used just once?

Parallelism Carries a Price Tag• Parallel programming

– Involves a learning curve– Is effort-intensive

• Parallel computing environments can be complex– Don’t respond to many serial debugging and tuning techniques

Will the investment of your time be worth it?Will the investment of your time be worth it?

Test the “Preconditions for Parallelism”

• According to experienced parallel programmers:– no green Don’t even consider it– one or more red Parallelism may cost you more than you gain– all green You need the power of parallelism (but there are no guarantees)

Fre que ncy o f Us e Exe c ut io n Tim e Re s o lut io n Ne e d s

t hous ands o f t ime sbe t we e n chang e s

doz e ns of t ime sbe t we e n chang e s

only a fe w t ime sbe t we e n chang e s

days or weekss

4 -8 ho urs

m in u t e s

m us t s ig nif ic ant lyinc re as e re s o lut io no r c o mple xit y

want t o inc re aset o s ome e xt e nt

c urre nt re s o lut io n/c o mple xit y a lre adymo re than ne e de d

p o s i t i v ep r e - c o n d it io n

p o s s ib lep r e - c o n d it io n

n e g a t iv ep r e - c o n d it io n

Second Topic – Parallel Machines

• Simplistic architecture• Types of parallel machines• Network topology• Parallel computing terminology

29

30

Simplistic Architecture

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

Processors

Memory

Interconnect Network

31

Processor Related Terms

• RISC: Reduced Instruction Set Computers• PIPELINE : Technique where multiple

instructions are overlapped in execution• SUPERSCALAR: Multiple instructions per clock

period

32

Network Interconnect Related Terms

• LATENCY : How long does it take to start sending a "message"? Units are generally microseconds now a days.

• BANDWIDTH : What data rate can be sustained once the message is started? Units are bytes/sec, Mbytes/sec, Gbytes/sec etc.

• TOPLOGY: What is the actual ‘shape’ of the interconnect? Are the nodes connect by a 2D mesh? A ring? Something more elaborate?

33

Memory/Cache Related Terms

CPU

MAIN MEMORY

Cache

CACHE : Cache is the level of memory hierarchy between the CPU and main memory. Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.

34

Memory/Cache Related Terms • ICACHE : Instruction cache• DCACHE (L1) : Data cache closest to registers• SCACHE (L2) : Secondary data cache

– Data from SCACHE has to go through DCACHE to registers

– SCACHE is larger than DCACHE – L3 cache• TLB : Translation-lookaside buffer keeps

addresses of pages ( block of memory) in main memory that have recently been accessed

35

Memory/Cache Related Terms (cont.)

CPU

MEMORY (e.g., L1 cache)

MEMORY(e.g., L2 cache, L3 cache)

MEMORY(e.g., DRAM)

SPEED SIZE Cost ($/bit)

File System

36

Memory/Cache Related Terms (cont.)

• The data cache was designed with two key concepts in mind– Spatial Locality

• When an element is referenced its neighbors will be referenced too

• Cache lines are fetched together• Work on consecutive data elements in the same cache line

– Temporal Locality• When an element is referenced, it might be referenced again soon• Arrange code so that date in cache is reused as often as possible

Types of Parallel Machines• Flynn's taxonomy has been commonly use to classify

parallel computers into one of four basic types:– Single instruction, single data (SISD): single scalar processor– Single instruction, multiple data (SIMD): Thinking machines CM-2– Multiple instruction, single data (MISD): various special purpose

machines– Multiple instruction, multiple data (MIMD): Nearly all parallel machines

• Since the MIMD model “won”, a much more useful way to classify modern parallel computers is by their memory model– Shared memory– Distributed memory

P P P P P P

B U S

M e m o r y

Shared and Distributed memory

Shared memory - single address space. All processors have access to a pool of shared memory.(examples: CRAY T90, SGI Altix)

Methods of memory access : - Bus - Crossbar

Distributed memory - each processorhas it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, XT; IBM Power,Sun and other vendor made machines )

M

P

M

P

M

P

M

P

M

P

M

P

Network

Uniform memory access (UMA)Each processor has uniform accessto memory - Also known assymmetric multiprocessors (SMPs)

Non-uniform memory access (NUMA)Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scalethan SMPs(example: HP-Convex Exemplar, SGI Altix)

P P P P

BUS

Memory

Styles of Shared memory: UMA and NUMA

Memory

P PPP

Bus

Memory

P PPP

Bus

Secondary Bus

UMA - Memory Access Problems• Conventional wisdom is that systems do not scale

well– Bus based systems can become saturated– Fast large crossbars are expensive

• Cache coherence problem– Copies of a variable can be present in multiple caches– A write by one processor my not become visible to others– They'll keep accessing stale value in their caches– Need to take actions to ensure visibility or cache

coherence

Machines

• T90, C90, YMP, XMP, SV1,SV2 • SGI Origin (sort of)• HP-Exemplar (sort of)• Various Suns• Various Wintel boxes• Most desktop Macintosh• Not new

– BBN GP 1000 Butterfly– Vax 780

Programming methodologies• Standard Fortran or C and let the compiler do it for

you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods

– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication

• Message passing can also be used but is not common

Distributed shared memory (NUMA)• Consists of N processors and a global address space

– All processors can see all memory– Each processor has some amount of local memory– Access to the memory of other processors is

slower• NonUniform Memory Access

Memory

P PPP

Bus

Memory

P PPP

Bus

Secondary Bus

Memory

• Easier to build because of slower access to remote memory

• Similar cache problems• Code writers should be aware of data

distribution– Load balance– Minimize access of "far" memory

Programming methodologies• Same as shared memory• Standard Fortran or C and let the compiler do it for

you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods

– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication

• Message passing can also be used

Machines

• SGI Origin, Altix• HP-Exemplar

Distributed Memory

• Each of N processors has its own memory• Memory is not shared• Communication occurs using messages

Network

P PPP

M MMM

Programming methodology

• Mostly message passing using MPI• Data distribution languages

– Simulate global name space– Examples

• High Performance Fortran• Split C• Co-array Fortran

Hybrid machines• SMP nodes (clumps) with interconnect between

clumps• Machines

– Cray XT3/4– IBM Power4/Power5– Sun, other vendor machines

• Programming– SMP methods on clumps or message passing– Message passing between all processors

Memory

P PPP

Bus

Memory

P PPP

Bus

Interconnect

Currently Multi-socket Multi-core

50

Network Topology• Custom

– Many manufacturers offer custom interconnects (Myrinet, Quadrix, Colony, Federation, Cray, SGI)

• Off the shelf– Infiniband– Ethernet– ATM– HIPPI– FIBER Channel– FDDI

Types of interconnects• Fully connected• N dimensional array and ring or torus

– Paragon– Cray XT3/4

• Crossbar– IBM SP (8 nodes)

• Hypercube– Ncube

• Trees, CLOS– Meiko CS-2, TACC Ranger (Sun machine)

• Combination of some of the above• IBM SP (crossbar and fully connect for 80 nodes)• IBM SP (fat tree for > 80 nodes)

Wrapping produces torus

Parallel Computing Terminology• Bandwidth - number of bits that can be transmitted in unit time, given as

bits/sec, bytes/sec, Mbytes/sec.

• Network latency - time to make a message transfer through network.

• Message latency or startup time - time to send a zero-length message. Essentially the software and hardware overhead in sending message and the actual transmission time.

• Communication time - total time to send message, including software overhead and interface delays.

• Bisection width of a network - number of links (or sometimes wires) that must be cut to divide network into two equal parts. Can provide a lower bound for messages in a parallel algorithm.

60

Communication Time Modeling• Tcomm = Nmsg * Tmsg

Nmsg = # of non overlapping messagesTmsg = time for one point to point communication

L = length of message ( for e.g in words)Tmsg = ts + tw * L

latency = ts = startup time (size independent)tw = asymptotic time per word (1/BW)

61

Performance and Scalability Terms• Efficiency: Measure of the fraction of time for which a

processor is usefully employed. Defined as the ratio of speedup to the number of processor. E = S/N

• Amdahl’s law : discussed before • Scalability : An algorithm is scalable if the level of

parallelism increases at least linearly with the problem size. An architecture is scalable if it continues to yield the same performance per processor, albeit used on a larger problem size, as the # of processors increases. Algorithm and architecture scalability are important since they allow a user to solve larger problems in the same amount of time by buying a parallel computer with more processors.

62

Performance and Scalability Terms• Superlinear speedup: In practice a speedup

greater than N (on N processors) is called superlinear speedup.

• This is observed due to Non optimal sequential algorithmSequential problem may not fit in one processor’s main memory and require slow secondary storage, whereas on multiple processors problem fits in main memory of N processors

63

Sources of Parallel Overhead• Interprocessor communication: Time to transfer data

between processors is usually the most significant source of parallel processing overhead.

• Load imbalance: In some parallel applications it is impossible to equally distribute the subtask workload to each processor. So at some point all but one processor might be done and waiting for one processor to complete.

• Extra computation: Sometime the best sequential algorithm is not easily parallelizable and one is forced to use a parallel algorithm based on a poorer but easily parallelizable sequential algorithm. Sometimes repetitive work is done on each of the N processors instead of send/recv, which leads to extra computation.

64

CPU Performance Comparison

# of floating point operations in a program TFLOPS = ----------------------------------------------------------------

execution time in seconds * 1012

• TFLOPS (Trillions of Floating Point Operations per Second) are dependent on the machine and on the program ( same program running on different computers would execute a different # of instructions but the same # of FP operations)

• TFLOPS is also not a consistent and useful measure of performance because – Set of FP operations is not consistent across machines e.g. some have

divide instructions, some don’t– TFLOPS rating for a single program cannot be generalized to establish

a single performance metric for a computer

Just when we thought we understand TFLOPs, Petaflop is almost here

65

CPU Performance Comparison

• Execution time is the principle measure of performance

• Unlike execution time, it is tempting to characterize a machine with a single MIPS, or MFLOPS rating without naming the program, specifying I/O, or describing the version of OS and compilers

66

Capability Computing• Full power of a machine is

used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance

• Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution

• E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models

Capacity Computing• Modest problems are tackled,

often simultaneously, on a machine, each with less demanding requirements

• Smaller or cheaper systems are used for capacity computing, where smaller problems are solved

• Parametric studies or to explore design alternatives

• The main figure of merit is sustained performance per unit cost

• Today’s capability computing can be tomorrow’s capacity computing

67

Strong Scaling

• For a fixed problem size how does the time to solution vary with the number of processors

• Run a fixed size problem and plot the speedup

• When scaling of parallel codes is discussed it is normally strong scaling that is being referred to

Weak Scaling

• How the time to solution varies with processor count with a fixed problem size per processor

• Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count

• Deviations from this indicate that either – The algorithm is not truly O(N) or – The overhead due to parallelism is

increasing, or both

68

Third Topic – Programming Parallel Computers

• Programming single-processor systems is (relatively) easy due to:– single thread of execution– single address space

• Programming shared memory systems can benefit from the single address space

• Programming distributed memory systems can be difficult due to multiple address spaces and need to access remote data

69

Single Program, Multiple Data (SPMD)

• SPMD: dominant programming model for shared and distributed memory machines.– One source code is written– Code can have conditional execution based on

which processor is executing the copy– All copies of code are started simultaneously and

communicate and synch with each other periodically

70

SPMD Programming Model

Processor 0 Processor 1 Processor 2 Processor 3

source.c

source.c source.c source.c source.c

71

Shared Memory vs. Distributed Memory

• Tools can be developed to make any system appear to look like a different kind of system– distributed memory systems can be programmed as if

they have shared memory, and vice versa– such tools do not produce the most efficient code, but

might enable portability

• HOWEVER, the most natural way to program any machine is to use tools & languages that express the algorithm explicitly for the architecture.

72

Shared Memory Programming: OpenMP

• Shared memory systems (SMPs, cc-NUMAs) have a single address space:– applications can be developed in which loop

iterations (with no dependencies) are executed by different processors

– shared memory codes are mostly data parallel, ‘SIMD’ kinds of codes

– OpenMP is a good standard for shared memory programming (compiler directives)

– Vendors offer native compiler directives

73

Accessing Shared Variables

• If multiple processors want to write to a shared variable at the same time there may be conflicts :

Process 1 and 21) read X2) compute X+1

3) write X

• Programmer, language, and/or architecture must provide ways of resolving conflicts

Shared variable X in memory

X+1 in proc1

X+1 in proc2

74

OpenMP Example #1: Parallel loop!$OMP PARALLEL DO

do i=1,128

b(i) = a(i) + c(i)

end do

!$OMP END PARALLEL DO • The first directive specifies that the loop immediately following should be

executed in parallel. The second directive specifies the end of the parallel section (optional).

• For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant parallel performance.

75

OpenMP Example #2; Private variables

!$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP)

do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP)end do!$OMP END PARALLEL DO• In this loop, each processor needs its own private

copy of the variable TEMP. If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location.

76

Distributed Memory Programming: MPI

• Distributed memory systems have separate address spaces for each processor– Local memory accessed faster than remote

memory– Data must be manually decomposed– MPI is the standard for distributed memory

programming (library of subprogram calls)– Older message passing libraries include PVM and

P4; all vendors have native libraries such as SHMEM (T3E) and LAPI (IBM)

77

MPI Example #1• Every MPI program needs these:#include <mpi.h> /* the mpi include file *//* Initialize MPI */ierr=MPI_Init(&argc, &argv);

/* How many total PEs are there */

ierr=MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

/* What node am I (what is my rank? */

ierr=MPI_Comm_rank(MPI_COMM_WORLD, &myid);...ierr=MPI_Finalize();

78

MPI Example #2#include#include "mpi.h” int main(argc,argv)int argc;char *argv[];{

int myid, numprocs;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);/* print out my rank and this run's PE size*/printf("Hello from %d\n",myid);printf("Numprocs is %d\n",numprocs);MPI_Finalize();

}

79

Fourth Topic – Supercomputer Centers and Rankings

• DOE National Labs - LANL, LNNL, Sandia etc.• DOE Office of Science Labs – ORNL, NERSC, BNL, etc.• DOD, NASA Supercomputer Centers• NSF supercomputer centers for academic users

– San Diego Supercomputer Center (UCSD)– National Center for Supercomputer Applications (UIUC) – Pittsburgh Supercomputer Center (Pittsburgh)– Texas Advanced Computing Center– Indiana– Purdue– ANL-Chicago– ORNL– LSU– NCAR

80

TeraGrid: Integrating NSF Cyberinfrastructure

SDSCTACC

UC/ANL

NCSA

ORNL

PUIU

PSC

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, LSU, and the National Center for Atmospheric Research.

NCAR

CaltechUSC-ISI

UtahIowa

Cornell

Buffalo

UNC-RENCI

Wisc

LSU

81

Measure of Supercomputers• Top 500 list (HPL code performance)

– Is one of the measures, but not the measure– Japan’s Earth Simulator (NEC) was on top for 3 years 40 TFLOPs peak; 35

TFLOPS on HPL (87% of peak)

• In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOPs on HPL, 367 TFLOPs peak (currently 596 TFLOPs peak and 478 TFLOPS on HPL – 80% of peak)– First 100 TFLOP and 200 TFLOP sustained on a real application

• June, 2008 RoadRunner at LANL achieves 1 PFLOPs on HPL (1.375 PLOPs peak; 73% of peak)

• New HPCC benchmarks• Many others – NAS, NERSC, NSF, DOD TI06 etc.• Ultimate measure is usefulness of a center for you – enabling better

or new science through simulations on balanced machines

http://www.top500.org/lists/2008/06/performance_development

84

Other Benchmarks

• HPCC – High Performance Computing Challenge benchmarks – no rankings

• NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered)

• DoD HPCMP – TI0X benchmarks

http://icl.cs.utk.edu/hpcc/

http://icl.cs.utk.edu/hpcc/

http://www.hpcmo.hpc.mil/Htdocs/TI/TI06/

85

Floating point Performance

Processor to memory BW

Inter –processor communication of smallmessages

Inter –processor communication of largemessages

Latency and BW for simultaneous communicationpatterns

Total communicationcapacity of network

86

Kiviat diagrams

Fifth Topic – SDSC Parallel Machines

87

SDSC: Data-intensive Computing for the TeraGrid 40TF compute, 2.5PB disk, 25PB archive

DataStarIBM Power4+

15.6 TFlops

TeraGrid Linux ClusterIBM/Intel IA-64

4.4 TFlops

Archival Systems25PB capacity (~5PB used)

Storage Area Network Disk

2500 TBSun F15K Disk Server

BlueGene DataIBM PowerPC17.1 TFlops

OnDemand ClusterDell/Intel

2.4 TFlops

89

DataStar is a powerful compute resource well-suited to “extreme I/O” applications

• Peak speed 15.6 TFlops • IBM Power4+ processors (2528 total)• Hybrid of 2 node types, all on single switch

– 272 8-way p655 nodes: • 176 1.5 GHz proc, 16 GB/node (2 GB/proc)• 96 1.7 GHz proc, 32 GB/node (4 GB/proc)

– 11 32-way p690 nodes: 1.7 GHz, 64-256 GB/node (2-8 GB/proc)

• Federation switch: ~6 sec latency, ~1.4 GB/sec pp-bandwidth• At 283 nodes, ours is one of the largest IBM Federation switches• All nodes are direct-attached to high-performance SAN disk , 3.8

GB/sec write, 2.0 GB/sec read to GPFS• GPFS now has 115TB capacity

• ~700 TB of gpfs-wan across NCSA, ANL

• Will be retired in October, 2008 for national users

Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes &

increased GPFS storage from 60 ->125TB - Enables 2048-processor capability jobs

- ~50% more throughput capacity- More GPFS capacity and bandwidth

SDSC’s three-rack BlueGene/L system

90

BG/L System Overview:Novel, massively parallel system from IBM

• Full system installed at LLNL from 4Q04 to 3Q05; addition in 2007– 106,496 nodes (212992 cores)– Each node being two low-power PowerPC processors + memory– Compact footprint with very high processor density– Slow processors & modest memory per processor – Very high peak speed of 596 Tflop/s– #1 in top500 until June 2008 - Linpack speed of 478 Tflop/s– Two applications have run at over 100 (2005) and 200+ (2006) Tflop/s

• Many BG/L Systems at US and outside• Now there are BG/P machines – ranked #3, #6, and #9 on top500• Need to select apps carefully

– Must scale (at least weakly) to many processors (because they’re slow)– Must fit in limited memory

91

92

Chip(2 processors)

Com pute Card(2 ch ips, 2x1x1)

Node Board(32 ch ips, 4x4x2)

16 Com pute C ards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 G F/s4 M B

5.6/11.2 G F/s0.5 G B DDR

90/180 G F/s8 G B DDR

2.9/5.7 TF/s256 G B DDR

180/360 TF /s16 TB D DR

SDSC was first academic institution with an IBM Blue Gene system

SDSC rack has maximum ratio of I/O SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is to compute nodes at 1:8 (LLNL’s is

1:64). Each of 128 I/O nodes in rack 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 has 1 Gbps Ethernet connection => 16

GBps/rack potential. GBps/rack potential.

SDSC procured 1-rack system 12/04. SDSC procured 1-rack system 12/04. Used initially for code evaluation and Used initially for code evaluation and benchmarking; production 10/05. Now benchmarking; production 10/05. Now

SDSC has 3 racksSDSC has 3 racks (LLNL system initially had 64 racks.) (LLNL system initially had 64 racks.)

BG/L System Overview:SDSC’s 3-rack system

• 3076 compute nodes & 384 I/O nodes (each with 2p)– Most I/O-rich configuration possible (8:1 compute:I/O node ratio)– Identical hardware in each node type with different networks wired– Compute nodes connected to: torus, tree, global interrupt, & JTAG– I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG– IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth– I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN

• Two half racks (also confusingly called midplanes)– Connected via link chips

• Front-end nodes (2 B80s, each with 4 pwr3 processors, 1 pwr5 node)• Service node (Power 275 with 2 Power4+ processors)• Two parallel file systems using GPFS

– Shared /gpfs-wan serviced by 58 NSD nodes (each with 2 IA-64s)– Local /bggpfs serviced by 12 NSD nodes (each with 2 IA-64s)93

94

BG System Overview: Processor Chip (1)

“Double FPU”

Ethernet Gbit

JTAGAccess

144 bit wide DDR512MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

l

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

“Double FPU”

256

snoop

Tree

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

128

95

BG System Overview: Processor Chip (2)(= System-on-a-chip)

• Two 700-MHz PowerPC 440 processors– Each with two floating-point units– Each with 32-kB L1 data caches that are not coherent– 4 flops/proc-clock peak (=2.8 Gflop/s-proc)– 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc)

• Shared 2-kB L2 cache (or prefetch buffer)• Shared 4-MB L3 cache• Five network controllers (though not all wired to each node)

– 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways)– Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) – Global interrupt (for MPI_Barrier: low latency)– Gigabit Ethernet (for I/O)– JTAG (for machine control)

• Memory controller for 512 MB of off-chip, shared memory

Sixth Topic – Allocations on NSF Supercomputer Centers

• http://www.teragrid.org/userinfo/getting_started.php?level=new_to_teragrid

• Development Allocation Committee (DAC) awards up to 30,000 CPU-hours and/or some TBs of disk (these amounts are going up)

• Larger allocations awarded through merit-review of proposals by panel of computational scientists

• UC Academic Associates• Special program for UC campuses• www.sdsc.edu/user_services/aap

96

http://www.teragrid.org/userinfo/getting_started.php?level=new_to_teragrid

Medium and large allocations• Requests of 10,001-500,000 SUs

reviewed quarterly.o MRAC

• Requests of more than 500,000 SUs reviewed twice per year.o LRAC

• Requests can span all NSF-supported resource providers

• Multi-year requests and awards are possible

New: Storage Allocations• SDSC now making disk storage and database resources available via the merit-review process

– SDSC Collections Disk Space• >200 TB of network-accessible disk for data collections• TeraGrid GPFS-WAN

• Many100s of TB parallel file system attached to TG computers• Portion available for long-term storage allocations

• SDSC Database• Dedicated disk/hardware for high-performance databases

– Oracle, DB2, MySQL

And all this will cost you…

……absolutely absolutely nothing.nothing.

$0$0plus the time to plus the time to

write your proposalwrite your proposal

Seventh Topic – One Application Example –

Turbulence

100

Turbulence using Direct Numerical Simulation (DNS)

101

“Large”

102

Evolution of Computers and DNS

103

106

Can use N2

procs for N3

gridCan use N procs for N3

grid

1D Decomposition

2D decomposition

2D Decomposition cont’d

CommunicationGlobal communication: traditionally, a serious

challenge for scaling applications to large node counts.

• 1D decomposition: 1 all-to-all exchange involving P processors• 2D decomposition: 2 all-to-all exchanges within p1 groups of p2

processors each (p1 x p2 = P)

• Which is better? Most of the time 1D wins. But again: it can’t

be scaled beyond P=N.

Crucial parameter is bisection bandwidth

Performance : towards 40963

111

introduction to supercomputers, architectures and high performance computing

Documents