introduction to supercomputers, architectures and high performance computing
DESCRIPTION
Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim Kaiser, Dmitry Pekurovsky, Mahidhar Tatineni, Ross Walker. Introduction to Supercomputers, Architectures and High Performance Computing. Topics. Intro to Parallel Computing Parallel Machines - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Supercomputers, Architectures and High Performance
Computing
Amit MajumdarScientific Computing Applications Group
SDSCMany others: Tim Kaiser, Dmitry Pekurovsky,
Mahidhar Tatineni, Ross Walker
Topics
1. Intro to Parallel Computing2. Parallel Machines3. Programming Parallel Computers4. Supercomputer Centers and Rankings5. SDSC Parallel Machines6. Allocations on NSF Supercomputers7. One Application Example – Turbulence
2
First Topic – Intro to Parallel Computing
• What is parallel computing• Why do parallel computing• Real life scenario• Types of parallelism• Limits of parallel computing• When do you do parallel computing
3
What is Parallel Computing?
• Consider your favorite computational application – One processor can give me results in N hours– Why not use N processors
-- and get the results in just one hour?
The concept is simple:
Parallelism = applying multiple processors to a single problem
Parallel computing is computing by committee
• Parallel computing: the use of multiple computers or processors working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information with
other processors
CPU #1 works on this area of the problem
CPU #3 works on this areaof the problem
CPU #4 works on this areaof the problem
CPU #2 works on this areaof the problem
Grid of Problem to be solved
y
x
exchange
exchangeexc
ha
ng
e
exc
ha
ng
e
Why Do Parallel Computing?• Limits of single CPU computing
– Available memory– Performance/Speed
• Parallel computing allows:– Solve problems that don’t fit on a single CPU’s memory
space– Solve problems that can’t be solved in a reasonable time
• We can run…– Larger problems– Faster– More cases– Run simulations at finer resolution– Model physical phenomena more realistically
6
Parallel Computing – Real Life Scenario• Stacking or reshelving of a set of library books• Assume books are organized into shelves and
shelves are grouped into bays• Single worker can only do it in a certain rate• We can speed it up by employing multiple
workers• What is the best strategy ?
o Simple way is to divide the total books equally among workers. Each worker stacks the books one at a time. Worker must walk all over the library.
o Alternate way is to assign fixed disjoint sets of bay to each worker. Each worker is assigned equal # of books arbitrarily. Workers stack books in their bays or pass to another worker responsible for the bay it belongs to.
7
Parallel Computing – Real Life Scenario• Parallel processing allows to accomplish a task
faster by dividing the work into a set of substacks assigned to multiple workers.
• Assigning a set of books to workers is task partitioning. Passing of books to each other is an example of communication between subtasks.
• Some problems may be completely serial; e.g. digging a post hole. Poorly suited to parallel processing.
• All problems are not equally amenable to parallel processing.
8
Weather ForecastingAtmosphere is modeled by dividing it into three-dimensional regions or cells, 1 mile x 1 mile x 1 mile - about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time.
About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step
10 day forecast with 10 minute resolution => ~1.5x1014 flop
On a 100 Mflops (Mflops/sec) sustained performance machine would
take: 1.5x1014 flop/ 100x106 flops = ~17 days
On a 1.7 Tflops sustained performance machine would take:
1.5x1014 flop/ 1.7x1012 flops = ~2 minutes
Other Examples• Vehicle design and dynamics• Analysis of protein structures• Human genome work• Quantum chromodynamics• Astrophysics• Earthquake wave propagation• Molecular dynamics• Climate, ocean modeling• CFD• Imaging and Rendering• Petroleum exploration• Nuclear reactor, weapon design• Database query• Ozone layer monitoring • Natural language understanding• Study of chemical phenomena• And many other scientific and industrial simulations
10
Types of Parallelism : Two Extremes
• Data parallel– Each processor performs the same task on
different data– Example - grid problems
• Task parallel– Each processor performs a different task– Example - signal processing
• Most applications fall somewhere on the continuum between these two extremes
11
Typical Data Parallel Program• Example: integrate 2-D propagation problem:
2
2
2
2
yB
xD
t
211
211
1 22
y
fffB
x
fffD
t
ff ni
ni
ni
ni
ni
ni
ni
ni
Starting partial differential equation:
Finite DifferenceApproximation:
PE #0 PE #1 PE #2
PE #4 PE #5 PE #6
PE #3
PE #7
y
x
12
13
Basics of Data Parallel ProgrammingOne code will run on 2 CPUsProgram has array of data to be operated on by 2 CPU so array is split into
two parts.
program:… if CPU=a then low_limit=1 upper_limit=50elseif CPU=b then low_limit=51 upper_limit=100end ifdo I = low_limit, upper_limit work on A(I)end do...end program
CPU A CPU B
program:…low_limit=1upper_limit=50do I= low_limit, upper_limit work on A(I)end do…end program
program:…low_limit=51upper_limit=100do I= low_limit, upper_limit work on A(I)end do…end program
14
Typical Task Parallel Application• Example: Signal Processing• Use one processor for each task• Can use more processors if one is overloaded
DATA Normalize Task
FFTTask
Multiply Task
Inverse FFT Task
15
Basics of Task Parallel Programming• One code will run on 2 CPUs• Program has 2 tasks (a and b) to be done by 2 CPUs
program.f:… initialize...if CPU=a then do task aelseif CPU=b then do task bend if….end program
CPU A CPU B
program.f:…initialize…do task a…end program
program.f:…initialize…do task b…end program
16
How Your Problem Affects Parallelism
• The nature of your problem constrains how successful parallelization can be
• Consider your problem in terms of– When data is used, and how– How much computation is involved, and when
• Importance of problem architectures– Perfectly parallel– Fully synchronous
Perfect Parallelism• Scenario: seismic imaging problem
– Same application is run on data from many distinct physical sites– Concurrency comes from having multiple data sets processed at once– Could be done on independent machines (if data can be available)
• This is the simplest style of problem • Key characteristic: calculations for each data set are independent
– Could divide/replicate data into files and run as independent serial jobs– (also called “job-level parallelism”)
Si t e C I mage Si t e D I mageSi t e B I mageSi t e A I mage
Si t e A Dat a Si t e B Dat a Si t e C Dat a Si t e D Dat a
Se i s mi c I magi ng Appl i c at i on
Fully Synchronous Parallelism• Scenario: atmospheric dynamics problem
– Data models atmospheric layer; highly interdependent in horizontal layers– Same operation is applied in parallel to multiple data– Concurrency comes from handling large amounts of data at once
• Key characteristic: Each operation is performed on all (or most) data– Operations/decisions depend on results of previous operations
• Potential problems– Serial bottlenecks force other processors to “wait”
I ni t i al At mos phe r i c Par t i t i ons
At mos phe r i c Mode l i ng Appl i c at i on
Re s ul t i ng Par t i t i ons
Limits of Parallel Computing
• Theoretical Upper Limits– Amdahl’s Law
• Practical Limits– Load balancing– Non-computational sections (I/O, system ops etc.)
• Other Considerations– time to re-write code
20
Theoretical Upper Limits to Performance
• All parallel programs contain:– Serial sections– Parallel sections
• Serial sections – when work is duplicated or no useful work done (waiting for others) - limit the parallel effectiveness• Lot of serial computation gives bad speedup• No serial work “allows” perfect speedup
• Speedup is the ratio of the time required to run a code on one processor to the time required to run the same code on multiple (N) processors - Amdahl’s Law states this formally
21
tn fp / N fs t1
S 1fs
fp / N
Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized
by using multiple processors.– Effect of multiple processors on run time
– Effect of multiple processors on speed up (S = t1/tn)
– Where
• fs = serial fraction of code
• fp = parallel fraction of code• N = number of processors
• tn = time to run on N processors
22
It takes only a small fraction of serial content in a code to degrade the parallel performance.
0
50
100
150
200
250
0 50 100 150 200 250Number of processors
fp = 1.000fp = 0.999fp = 0.990fp = 0.900
Illustration of Amdahl's Law
Speedup
23
Amdahl’s Law provides a theoretical upper limit on parallelspeedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance.
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
Number of processors
Amdahl's LawReality
fp = 0.99
Amdahl’s Law vs. Reality
Speedup
24
Practical Limits: Amdahl’s Law vs. Reality
• In reality, Amdahl’s Law is limited by many things:– Communications– I/O– Load balancing (waiting)– Scheduling (shared processors or memory)
25
When do you do parallel computing
• Writing effective parallel application is difficult– Communication can limit parallel efficiency– Serial time can dominate– Load balance is important
• Is it worth your time to rewrite your application– Do the CPU requirements justify parallelization?– Will the code be used just once?
Parallelism Carries a Price Tag• Parallel programming
– Involves a learning curve– Is effort-intensive
• Parallel computing environments can be complex– Don’t respond to many serial debugging and tuning techniques
Will the investment of your time be worth it?Will the investment of your time be worth it?
Test the “Preconditions for Parallelism”
• According to experienced parallel programmers:– no green Don’t even consider it– one or more red Parallelism may cost you more than you gain– all green You need the power of parallelism (but there are no guarantees)
Fre que ncy o f Us e Exe c ut io n Tim e Re s o lut io n Ne e d s
t hous ands o f t ime sbe t we e n chang e s
doz e ns of t ime sbe t we e n chang e s
only a fe w t ime sbe t we e n chang e s
days or weekss
4 -8 ho urs
m in u t e s
m us t s ig nif ic ant lyinc re as e re s o lut io no r c o mple xit y
want t o inc re aset o s ome e xt e nt
c urre nt re s o lut io n/c o mple xit y a lre adymo re than ne e de d
p o s i t i v ep r e - c o n d it io n
p o s s ib lep r e - c o n d it io n
n e g a t iv ep r e - c o n d it io n
Second Topic – Parallel Machines
• Simplistic architecture• Types of parallel machines• Network topology• Parallel computing terminology
29
30
Simplistic Architecture
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
Processors
Memory
Interconnect Network
31
Processor Related Terms
• RISC: Reduced Instruction Set Computers• PIPELINE : Technique where multiple
instructions are overlapped in execution• SUPERSCALAR: Multiple instructions per clock
period
32
Network Interconnect Related Terms
• LATENCY : How long does it take to start sending a "message"? Units are generally microseconds now a days.
• BANDWIDTH : What data rate can be sustained once the message is started? Units are bytes/sec, Mbytes/sec, Gbytes/sec etc.
• TOPLOGY: What is the actual ‘shape’ of the interconnect? Are the nodes connect by a 2D mesh? A ring? Something more elaborate?
33
Memory/Cache Related Terms
CPU
MAIN MEMORY
Cache
CACHE : Cache is the level of memory hierarchy between the CPU and main memory. Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.
34
Memory/Cache Related Terms • ICACHE : Instruction cache• DCACHE (L1) : Data cache closest to registers• SCACHE (L2) : Secondary data cache
– Data from SCACHE has to go through DCACHE to registers
– SCACHE is larger than DCACHE – L3 cache• TLB : Translation-lookaside buffer keeps
addresses of pages ( block of memory) in main memory that have recently been accessed
35
Memory/Cache Related Terms (cont.)
CPU
MEMORY (e.g., L1 cache)
MEMORY(e.g., L2 cache, L3 cache)
MEMORY(e.g., DRAM)
SPEED SIZE Cost ($/bit)
File System
36
Memory/Cache Related Terms (cont.)
• The data cache was designed with two key concepts in mind– Spatial Locality
• When an element is referenced its neighbors will be referenced too
• Cache lines are fetched together• Work on consecutive data elements in the same cache line
– Temporal Locality• When an element is referenced, it might be referenced again soon• Arrange code so that date in cache is reused as often as possible
Types of Parallel Machines• Flynn's taxonomy has been commonly use to classify
parallel computers into one of four basic types:– Single instruction, single data (SISD): single scalar processor– Single instruction, multiple data (SIMD): Thinking machines CM-2– Multiple instruction, single data (MISD): various special purpose
machines– Multiple instruction, multiple data (MIMD): Nearly all parallel machines
• Since the MIMD model “won”, a much more useful way to classify modern parallel computers is by their memory model– Shared memory– Distributed memory
P P P P P P
B U S
M e m o r y
Shared and Distributed memory
Shared memory - single address space. All processors have access to a pool of shared memory.(examples: CRAY T90, SGI Altix)
Methods of memory access : - Bus - Crossbar
Distributed memory - each processorhas it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, XT; IBM Power,Sun and other vendor made machines )
M
P
M
P
M
P
M
P
M
P
M
P
Network
Uniform memory access (UMA)Each processor has uniform accessto memory - Also known assymmetric multiprocessors (SMPs)
Non-uniform memory access (NUMA)Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scalethan SMPs(example: HP-Convex Exemplar, SGI Altix)
P P P P
BUS
Memory
Styles of Shared memory: UMA and NUMA
Memory
P PPP
Bus
Memory
P PPP
Bus
Secondary Bus
UMA - Memory Access Problems• Conventional wisdom is that systems do not scale
well– Bus based systems can become saturated– Fast large crossbars are expensive
• Cache coherence problem– Copies of a variable can be present in multiple caches– A write by one processor my not become visible to others– They'll keep accessing stale value in their caches– Need to take actions to ensure visibility or cache
coherence
Machines
• T90, C90, YMP, XMP, SV1,SV2 • SGI Origin (sort of)• HP-Exemplar (sort of)• Various Suns• Various Wintel boxes• Most desktop Macintosh• Not new
– BBN GP 1000 Butterfly– Vax 780
Programming methodologies• Standard Fortran or C and let the compiler do it for
you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods
– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication
• Message passing can also be used but is not common
Distributed shared memory (NUMA)• Consists of N processors and a global address space
– All processors can see all memory– Each processor has some amount of local memory– Access to the memory of other processors is
slower• NonUniform Memory Access
Memory
P PPP
Bus
Memory
P PPP
Bus
Secondary Bus
Memory
• Easier to build because of slower access to remote memory
• Similar cache problems• Code writers should be aware of data
distribution– Load balance– Minimize access of "far" memory
Programming methodologies• Same as shared memory• Standard Fortran or C and let the compiler do it for
you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods
– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication
• Message passing can also be used
Machines
• SGI Origin, Altix• HP-Exemplar
Distributed Memory
• Each of N processors has its own memory• Memory is not shared• Communication occurs using messages
Network
P PPP
M MMM
Programming methodology
• Mostly message passing using MPI• Data distribution languages
– Simulate global name space– Examples
• High Performance Fortran• Split C• Co-array Fortran
Hybrid machines• SMP nodes (clumps) with interconnect between
clumps• Machines
– Cray XT3/4– IBM Power4/Power5– Sun, other vendor machines
• Programming– SMP methods on clumps or message passing– Message passing between all processors
Memory
P PPP
Bus
Memory
P PPP
Bus
Interconnect
Currently Multi-socket Multi-core
50
51
Network Topology• Custom
– Many manufacturers offer custom interconnects (Myrinet, Quadrix, Colony, Federation, Cray, SGI)
• Off the shelf– Infiniband– Ethernet– ATM– HIPPI– FIBER Channel– FDDI
Types of interconnects• Fully connected• N dimensional array and ring or torus
– Paragon– Cray XT3/4
• Crossbar– IBM SP (8 nodes)
• Hypercube– Ncube
• Trees, CLOS– Meiko CS-2, TACC Ranger (Sun machine)
• Combination of some of the above• IBM SP (crossbar and fully connect for 80 nodes)• IBM SP (fat tree for > 80 nodes)
Wrapping produces torus
Parallel Computing Terminology• Bandwidth - number of bits that can be transmitted in unit time, given as
bits/sec, bytes/sec, Mbytes/sec.
• Network latency - time to make a message transfer through network.
• Message latency or startup time - time to send a zero-length message. Essentially the software and hardware overhead in sending message and the actual transmission time.
• Communication time - total time to send message, including software overhead and interface delays.
• Bisection width of a network - number of links (or sometimes wires) that must be cut to divide network into two equal parts. Can provide a lower bound for messages in a parallel algorithm.
60
Communication Time Modeling• Tcomm = Nmsg * Tmsg
Nmsg = # of non overlapping messagesTmsg = time for one point to point communication
L = length of message ( for e.g in words)Tmsg = ts + tw * L
latency = ts = startup time (size independent)tw = asymptotic time per word (1/BW)
61
Performance and Scalability Terms• Efficiency: Measure of the fraction of time for which a
processor is usefully employed. Defined as the ratio of speedup to the number of processor. E = S/N
• Amdahl’s law : discussed before • Scalability : An algorithm is scalable if the level of
parallelism increases at least linearly with the problem size. An architecture is scalable if it continues to yield the same performance per processor, albeit used on a larger problem size, as the # of processors increases. Algorithm and architecture scalability are important since they allow a user to solve larger problems in the same amount of time by buying a parallel computer with more processors.
62
Performance and Scalability Terms• Superlinear speedup: In practice a speedup
greater than N (on N processors) is called superlinear speedup.
• This is observed due to Non optimal sequential algorithmSequential problem may not fit in one processor’s main memory and require slow secondary storage, whereas on multiple processors problem fits in main memory of N processors
63
Sources of Parallel Overhead• Interprocessor communication: Time to transfer data
between processors is usually the most significant source of parallel processing overhead.
• Load imbalance: In some parallel applications it is impossible to equally distribute the subtask workload to each processor. So at some point all but one processor might be done and waiting for one processor to complete.
• Extra computation: Sometime the best sequential algorithm is not easily parallelizable and one is forced to use a parallel algorithm based on a poorer but easily parallelizable sequential algorithm. Sometimes repetitive work is done on each of the N processors instead of send/recv, which leads to extra computation.
64
CPU Performance Comparison
# of floating point operations in a program TFLOPS = ----------------------------------------------------------------
execution time in seconds * 1012
• TFLOPS (Trillions of Floating Point Operations per Second) are dependent on the machine and on the program ( same program running on different computers would execute a different # of instructions but the same # of FP operations)
• TFLOPS is also not a consistent and useful measure of performance because – Set of FP operations is not consistent across machines e.g. some have
divide instructions, some don’t– TFLOPS rating for a single program cannot be generalized to establish
a single performance metric for a computer
Just when we thought we understand TFLOPs, Petaflop is almost here
65
CPU Performance Comparison
• Execution time is the principle measure of performance
• Unlike execution time, it is tempting to characterize a machine with a single MIPS, or MFLOPS rating without naming the program, specifying I/O, or describing the version of OS and compilers
66
Capability Computing• Full power of a machine is
used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance
• Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution
• E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models
Capacity Computing• Modest problems are tackled,
often simultaneously, on a machine, each with less demanding requirements
• Smaller or cheaper systems are used for capacity computing, where smaller problems are solved
• Parametric studies or to explore design alternatives
• The main figure of merit is sustained performance per unit cost
• Today’s capability computing can be tomorrow’s capacity computing
67
Strong Scaling
• For a fixed problem size how does the time to solution vary with the number of processors
• Run a fixed size problem and plot the speedup
• When scaling of parallel codes is discussed it is normally strong scaling that is being referred to
Weak Scaling
• How the time to solution varies with processor count with a fixed problem size per processor
• Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count
• Deviations from this indicate that either – The algorithm is not truly O(N) or – The overhead due to parallelism is
increasing, or both
68
Third Topic – Programming Parallel Computers
• Programming single-processor systems is (relatively) easy due to:– single thread of execution– single address space
• Programming shared memory systems can benefit from the single address space
• Programming distributed memory systems can be difficult due to multiple address spaces and need to access remote data
69
Single Program, Multiple Data (SPMD)
• SPMD: dominant programming model for shared and distributed memory machines.– One source code is written– Code can have conditional execution based on
which processor is executing the copy– All copies of code are started simultaneously and
communicate and synch with each other periodically
70
SPMD Programming Model
Processor 0 Processor 1 Processor 2 Processor 3
source.c
source.c source.c source.c source.c
71
Shared Memory vs. Distributed Memory
• Tools can be developed to make any system appear to look like a different kind of system– distributed memory systems can be programmed as if
they have shared memory, and vice versa– such tools do not produce the most efficient code, but
might enable portability
• HOWEVER, the most natural way to program any machine is to use tools & languages that express the algorithm explicitly for the architecture.
72
Shared Memory Programming: OpenMP
• Shared memory systems (SMPs, cc-NUMAs) have a single address space:– applications can be developed in which loop
iterations (with no dependencies) are executed by different processors
– shared memory codes are mostly data parallel, ‘SIMD’ kinds of codes
– OpenMP is a good standard for shared memory programming (compiler directives)
– Vendors offer native compiler directives
73
Accessing Shared Variables
• If multiple processors want to write to a shared variable at the same time there may be conflicts :
Process 1 and 21) read X2) compute X+1
3) write X
• Programmer, language, and/or architecture must provide ways of resolving conflicts
Shared variable X in memory
X+1 in proc1
X+1 in proc2
74
OpenMP Example #1: Parallel loop!$OMP PARALLEL DO
do i=1,128
b(i) = a(i) + c(i)
end do
!$OMP END PARALLEL DO • The first directive specifies that the loop immediately following should be
executed in parallel. The second directive specifies the end of the parallel section (optional).
• For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant parallel performance.
75
OpenMP Example #2; Private variables
!$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP)
do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP)end do!$OMP END PARALLEL DO• In this loop, each processor needs its own private
copy of the variable TEMP. If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location.
76
Distributed Memory Programming: MPI
• Distributed memory systems have separate address spaces for each processor– Local memory accessed faster than remote
memory– Data must be manually decomposed– MPI is the standard for distributed memory
programming (library of subprogram calls)– Older message passing libraries include PVM and
P4; all vendors have native libraries such as SHMEM (T3E) and LAPI (IBM)
77
MPI Example #1• Every MPI program needs these:#include <mpi.h> /* the mpi include file *//* Initialize MPI */ierr=MPI_Init(&argc, &argv);
/* How many total PEs are there */
ierr=MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
/* What node am I (what is my rank? */
ierr=MPI_Comm_rank(MPI_COMM_WORLD, &myid);...ierr=MPI_Finalize();
78
MPI Example #2#include#include "mpi.h” int main(argc,argv)int argc;char *argv[];{
int myid, numprocs;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);/* print out my rank and this run's PE size*/printf("Hello from %d\n",myid);printf("Numprocs is %d\n",numprocs);MPI_Finalize();
}
79
Fourth Topic – Supercomputer Centers and Rankings
• DOE National Labs - LANL, LNNL, Sandia etc.• DOE Office of Science Labs – ORNL, NERSC, BNL, etc.• DOD, NASA Supercomputer Centers• NSF supercomputer centers for academic users
– San Diego Supercomputer Center (UCSD)– National Center for Supercomputer Applications (UIUC) – Pittsburgh Supercomputer Center (Pittsburgh)– Texas Advanced Computing Center– Indiana– Purdue– ANL-Chicago– ORNL– LSU– NCAR
80
TeraGrid: Integrating NSF Cyberinfrastructure
SDSCTACC
UC/ANL
NCSA
ORNL
PUIU
PSC
TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, LSU, and the National Center for Atmospheric Research.
NCAR
CaltechUSC-ISI
UtahIowa
Cornell
Buffalo
UNC-RENCI
Wisc
LSU
81
Measure of Supercomputers• Top 500 list (HPL code performance)
– Is one of the measures, but not the measure– Japan’s Earth Simulator (NEC) was on top for 3 years 40 TFLOPs peak; 35
TFLOPS on HPL (87% of peak)
• In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOPs on HPL, 367 TFLOPs peak (currently 596 TFLOPs peak and 478 TFLOPS on HPL – 80% of peak)– First 100 TFLOP and 200 TFLOP sustained on a real application
• June, 2008 RoadRunner at LANL achieves 1 PFLOPs on HPL (1.375 PLOPs peak; 73% of peak)
• New HPCC benchmarks• Many others – NAS, NERSC, NSF, DOD TI06 etc.• Ultimate measure is usefulness of a center for you – enabling better
or new science through simulations on balanced machines
82
83
84
Other Benchmarks
• HPCC – High Performance Computing Challenge benchmarks – no rankings
• NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered)
• DoD HPCMP – TI0X benchmarks
85
Floating point Performance
Processor to memory BW
Inter –processor communication of smallmessages
Inter –processor communication of largemessages
Latency and BW for simultaneous communicationpatterns
Total communicationcapacity of network
86
Kiviat diagrams
Fifth Topic – SDSC Parallel Machines
87
SDSC: Data-intensive Computing for the TeraGrid 40TF compute, 2.5PB disk, 25PB archive
DataStarIBM Power4+
15.6 TFlops
TeraGrid Linux ClusterIBM/Intel IA-64
4.4 TFlops
Archival Systems25PB capacity (~5PB used)
Storage Area Network Disk
2500 TBSun F15K Disk Server
BlueGene DataIBM PowerPC17.1 TFlops
OnDemand ClusterDell/Intel
2.4 TFlops
89
DataStar is a powerful compute resource well-suited to “extreme I/O” applications
• Peak speed 15.6 TFlops • IBM Power4+ processors (2528 total)• Hybrid of 2 node types, all on single switch
– 272 8-way p655 nodes: • 176 1.5 GHz proc, 16 GB/node (2 GB/proc)• 96 1.7 GHz proc, 32 GB/node (4 GB/proc)
– 11 32-way p690 nodes: 1.7 GHz, 64-256 GB/node (2-8 GB/proc)
• Federation switch: ~6 sec latency, ~1.4 GB/sec pp-bandwidth• At 283 nodes, ours is one of the largest IBM Federation switches• All nodes are direct-attached to high-performance SAN disk , 3.8
GB/sec write, 2.0 GB/sec read to GPFS• GPFS now has 115TB capacity
• ~700 TB of gpfs-wan across NCSA, ANL
• Will be retired in October, 2008 for national users
Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes &
increased GPFS storage from 60 ->125TB - Enables 2048-processor capability jobs
- ~50% more throughput capacity- More GPFS capacity and bandwidth
SDSC’s three-rack BlueGene/L system
90
BG/L System Overview:Novel, massively parallel system from IBM
• Full system installed at LLNL from 4Q04 to 3Q05; addition in 2007– 106,496 nodes (212992 cores)– Each node being two low-power PowerPC processors + memory– Compact footprint with very high processor density– Slow processors & modest memory per processor – Very high peak speed of 596 Tflop/s– #1 in top500 until June 2008 - Linpack speed of 478 Tflop/s– Two applications have run at over 100 (2005) and 200+ (2006) Tflop/s
• Many BG/L Systems at US and outside• Now there are BG/P machines – ranked #3, #6, and #9 on top500• Need to select apps carefully
– Must scale (at least weakly) to many processors (because they’re slow)– Must fit in limited memory
91
92
Chip(2 processors)
Com pute Card(2 ch ips, 2x1x1)
Node Board(32 ch ips, 4x4x2)
16 Com pute C ards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 G F/s4 M B
5.6/11.2 G F/s0.5 G B DDR
90/180 G F/s8 G B DDR
2.9/5.7 TF/s256 G B DDR
180/360 TF /s16 TB D DR
SDSC was first academic institution with an IBM Blue Gene system
SDSC rack has maximum ratio of I/O SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is to compute nodes at 1:8 (LLNL’s is
1:64). Each of 128 I/O nodes in rack 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 has 1 Gbps Ethernet connection => 16
GBps/rack potential. GBps/rack potential.
SDSC procured 1-rack system 12/04. SDSC procured 1-rack system 12/04. Used initially for code evaluation and Used initially for code evaluation and benchmarking; production 10/05. Now benchmarking; production 10/05. Now
SDSC has 3 racksSDSC has 3 racks (LLNL system initially had 64 racks.) (LLNL system initially had 64 racks.)
BG/L System Overview:SDSC’s 3-rack system
• 3076 compute nodes & 384 I/O nodes (each with 2p)– Most I/O-rich configuration possible (8:1 compute:I/O node ratio)– Identical hardware in each node type with different networks wired– Compute nodes connected to: torus, tree, global interrupt, & JTAG– I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG– IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth– I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN
• Two half racks (also confusingly called midplanes)– Connected via link chips
• Front-end nodes (2 B80s, each with 4 pwr3 processors, 1 pwr5 node)• Service node (Power 275 with 2 Power4+ processors)• Two parallel file systems using GPFS
– Shared /gpfs-wan serviced by 58 NSD nodes (each with 2 IA-64s)– Local /bggpfs serviced by 12 NSD nodes (each with 2 IA-64s)93
94
BG System Overview: Processor Chip (1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR512MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
l
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
“Double FPU”
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
128
95
BG System Overview: Processor Chip (2)(= System-on-a-chip)
• Two 700-MHz PowerPC 440 processors– Each with two floating-point units– Each with 32-kB L1 data caches that are not coherent– 4 flops/proc-clock peak (=2.8 Gflop/s-proc)– 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc)
• Shared 2-kB L2 cache (or prefetch buffer)• Shared 4-MB L3 cache• Five network controllers (though not all wired to each node)
– 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways)– Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) – Global interrupt (for MPI_Barrier: low latency)– Gigabit Ethernet (for I/O)– JTAG (for machine control)
• Memory controller for 512 MB of off-chip, shared memory
Sixth Topic – Allocations on NSF Supercomputer Centers
• http://www.teragrid.org/userinfo/getting_started.php?level=new_to_teragrid
• Development Allocation Committee (DAC) awards up to 30,000 CPU-hours and/or some TBs of disk (these amounts are going up)
• Larger allocations awarded through merit-review of proposals by panel of computational scientists
• UC Academic Associates• Special program for UC campuses• www.sdsc.edu/user_services/aap
96
Medium and large allocations• Requests of 10,001-500,000 SUs
reviewed quarterly.o MRAC
• Requests of more than 500,000 SUs reviewed twice per year.o LRAC
• Requests can span all NSF-supported resource providers
• Multi-year requests and awards are possible
New: Storage Allocations• SDSC now making disk storage and database resources available via the merit-review process
– SDSC Collections Disk Space• >200 TB of network-accessible disk for data collections• TeraGrid GPFS-WAN
• Many100s of TB parallel file system attached to TG computers• Portion available for long-term storage allocations
• SDSC Database• Dedicated disk/hardware for high-performance databases
– Oracle, DB2, MySQL
And all this will cost you…
……absolutely absolutely nothing.nothing.
$0$0plus the time to plus the time to
write your proposalwrite your proposal
Seventh Topic – One Application Example –
Turbulence
100
Turbulence using Direct Numerical Simulation (DNS)
101
“Large”
102
Evolution of Computers and DNS
103
104
105
106
Can use N2
procs for N3
gridCan use N procs for N3
grid
1D Decomposition
2D decomposition
2D Decomposition cont’d
CommunicationGlobal communication: traditionally, a serious
challenge for scaling applications to large node counts.
• 1D decomposition: 1 all-to-all exchange involving P processors• 2D decomposition: 2 all-to-all exchanges within p1 groups of p2
processors each (p1 x p2 = P)
• Which is better? Most of the time 1D wins. But again: it can’t
be scaled beyond P=N.
Crucial parameter is bisection bandwidth
Performance : towards 40963
111
112
113
114