building a beowulf class supercomputer - wag.caltech.eduwag.caltech.edu/pasi/lectures/puj-building a...
TRANSCRIPT
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
11
Building a Beowulf Class Building a Beowulf Class “Supercomputer”“Supercomputer”
Andrés JaramilloAndrés Jaramillo--BoteroBoteroPontificia Universidad Pontificia Universidad JaverianaJaverianaCali, ColombiaCali, Colombia
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
22
OutlineOutline
nn Working DefinitionsWorking Definitionsnn Why Parallel Computing?Why Parallel Computing?nn Parallel Computing: Interconnect Topologies, Programming Parallel Computing: Interconnect Topologies, Programming
Models, and MetricsModels, and Metricsnn Beowulf Class Computer: Motivation and OverviewBeowulf Class Computer: Motivation and Overviewnn Linux based BeowulfLinux based Beowulf
–– Hardware Inventory for a homeHardware Inventory for a home--grown Beowulf Parallel Computer Cluster grown Beowulf Parallel Computer Cluster –– Communication Software and Interconnect Operations Communication Software and Interconnect Operations –– Practical Aspects: Software, Benchmarks, Compilers Practical Aspects: Software, Benchmarks, Compilers
ASCI White 512 / 8192, LLNL
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
33
Working Definitions: ArchitecturesWorking Definitions: Architectures
nn General Computing ClusterGeneral Computing Cluster–– An ensemble of interconnected computing systems, each capable ofAn ensemble of interconnected computing systems, each capable of
standalone operationstandalone operation
nn PC ClusterPC Cluster–– Set of independent computers Set of independent computers
nn COTSCOTSnn Capable of full independent operationCapable of full independent operationnn Employed individually for standalone mainstream workloads / applEmployed individually for standalone mainstream workloads / appl icationsicationsnn UniprocessorUniprocessor or SMP nodesor SMP nodes
–– Supervised within a single administrative domain as a single sysSupervised within a single administrative domain as a single systemtem–– An interconnection network An interconnection network
nn COTS COTS nn LAN or SAN or multiple separate network structures.LAN or SAN or multiple separate network structures.nn Dedicated to cluster nodes and separate from the external enviroDedicated to cluster nodes and separate from the external environment. nment.
Source: Thomas Sterling
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
44
Working Definitions: Architectures Working Definitions: Architectures nn NOWsNOWs
–– Network of Workstations (UCBNetwork of Workstations (UCB’’95)95)
nn ConstellationsConstellations–– A Cluster of ClustersA Cluster of Clusters
nn An ensemble of An ensemble of NN nodes each comprising nodes each comprising pp computing elementscomputing elementsnn The The pp elements are tightly bound shared memory (e.g., elements are tightly bound shared memory (e.g., smpsmp, , dsmdsm))nn TheThe NN nodes are loosely coupled, i.e., distributed memorynodes are loosely coupled, i.e., distributed memorynn pp is greater than is greater than NNnn Distinction is which layer gives us the most power through paralDistinction is which layer gives us the most power through parallelismlelism. .
nn MPPsMPPs (Massively Parallel Processor)(Massively Parallel Processor)–– Built with specialized (costly) networks by vendors, with the inBuilt with specialized (costly) networks by vendors, with the intent of tent of
being used as a parallel computer.being used as a parallel computer.
ASCI Blue Mountain (LLNL)• SGI Origin 2000 48 SMP Computers• 128 processors each @ 250MHz• 6144 processors total• 4 x 32 4-way SMP nodes• 1.608/3.072 TF peak performance • 1.5 TB memory• 75 TB disk
10,000 square feet of floor space, 1.6 MWatts of power, 530 tons of cooling capability, 384 cabinets to house 6144 CPUs, 48 cabinets for the meta routers, 96 cabinets for the disks, 8 cabinets for the 36 HiPPI, switches, and ~476 miles of fiber cable.
http://www.lanl.gov/projects /asci/bluemtn/ASCI_fly.pdf
NOW 1997• 100+ Ultra Sparc• 128 MB, 2 2GB disks,
ethernet, myrinet• Largest Myrinet in the world• First cluster on TOP 500
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
55
Working Definitions: ApplicationsWorking Definitions: Applications
nn Embarrassingly Embarrassingly Parallel (rare):Parallel (rare):nn Little or no dependence between individual calculationsLittle or no dependence between individual calculationsnn Extremely parallelExtremely parallelnn Good scalability (low network dependence)Good scalability (low network dependence)nn EgEg. Monte Carlo Simulations, Particle Physics, and Cryptography. Monte Carlo Simulations, Particle Physics, and Cryptography
nn Block Level Parallel:Block Level Parallel:nn Computational domain can be partitioned across several nodesComputational domain can be partitioned across several nodesnn Each node solves its own computational domain and shares resultsEach node solves its own computational domain and shares results
for the edges of its segments with neighboring nodesfor the edges of its segments with neighboring nodesnn Common type of parallel application (uses message passing)Common type of parallel application (uses message passing)nn Scalability tied to performance of network infrastructureScalability tied to performance of network infrastructurenn EgEg. . ScaLapackScaLapack, Molecular Dynamics, Molecular Dynamics
Source: Doug Johnson, OSC
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
66
Working Definitions: ApplicationsWorking Definitions: Applicationsnn LoopLoop--Level Parallelism:Level Parallelism:
nn Where inner of intermediate loops may be run in parallel (threadWhere inner of intermediate loops may be run in parallel (threads)s)nn Amenable to parallelism using compiler directives such as Amenable to parallelism using compiler directives such as OpenMPOpenMP
(appropriate for vector computers)(appropriate for vector computers)nn Shared memory required, hence run better on Shared memory required, hence run better on SMPsSMPsnn Not scalable to a large number of processorsNot scalable to a large number of processorsnn EgEg. POSIX threads. POSIX threads
nn MultiMulti--Level Parallel:Level Parallel:nn Hybrid blockHybrid block--level and looplevel and loop--levellevelnn Mostly independent blocks which can be calculated using message Mostly independent blocks which can be calculated using message
passing, and each can be further parallelized at loop level usinpassing, and each can be further parallelized at loop level using SMPg SMPnn Limited scalability (proportional to the number of blocks)Limited scalability (proportional to the number of blocks)nn EgEg. Multi. Multi--grid grid NavierNavier--Stokes solverStokes solver
nn Serially Parallel: NO parallelism exploitedSerially Parallel: NO parallelism exploited
Source: Doug Johnson, OSC
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
77
Working Definitions:Working Definitions:Types of ParallelismTypes of Parallelismnn Data ParallelismData Parallelism
– The same task executing simultaneously over multiple sub regions of the same data.
– Characteristicsn Takes advantage of large accumulated memory capacity in a parallel
machine making it easier to code,n Different processors execute the same function over different sub
regions of the same data space,n Less flexibility – problem must be naturally parallel (embarrassingly)n Requires domain decomposition of problem.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
88
Working Definitions:Working Definitions:Types of ParallelismTypes of Parallelismn Functional Parallelism
– Different tasks executed simultaneously over different data spaces.
– Characteristicsn Natural programming scheme for programmers with modular
programming skills,n Definition of scalable functions grows harder as the number of
processors increase,n Load balancing and synchronization become important issues,n Different processors executing different tasks,n A lot of applications implement some form of each, in spite of their
apparent complementary nature.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
99
Working Working Definitions:TaxonomyDefinitions:Taxonomy
– MIMD parallel computers:n MPPs: Massively Parallel Processors (MP)n NOWs: Network of Workstations (MP – Clusters).n SMPs: Symmetric Multiprocessing (SM). n DSMs: Distributed Shared Memory (Hybrid MP and SM).
Special cases: NUMA and cc-NUMA architectures
Flyn
n-Jo
hnso
nTa
xono
my
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1010
Working Definitions: Working Definitions: Microprocessor ArchitecturesMicroprocessor Architecturesn CISC
–– Complex Instruction Set ComputerComplex Instruction Set Computer
nn RISCRISC–– Reduced Set Instruction Set ComputerReduced Set Instruction Set Computer
nn VLIWVLIW–– Very Long Instruction Width ComputerVery Long Instruction Width Computer
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1111
Why Parallel Computing?Why Parallel Computing?nn Solve Solve bigger bigger problems problems fasterfasternn Opportunities in Molecular SimulationsOpportunities in Molecular Simulations
–– Increased Fidelity of Molecular ModelsIncreased Fidelity of Molecular Modelsn Bond energies critical for describing many
chemical phenomenan Accuracy of calculated bond energies
increased significantly since 1997
–– Converged Molecular CalculationsConverged Molecular Calculations–– Increased Molecular SizeIncreased Molecular Size
n Large scale systems (atomistic)n Long term simulations for phenomena to occurn Need potentials to model nanoscale processesn Little data from experiment, need accurate
calculations
l
l
l
l1
10
100
1970 1980 1990 2000
Error(kcal/mol)
Thom H. Dunning, Jr.Joint Institute for Computational SciencesOak Ridge National LaboratoryOak Ridge, Tennessee
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1212
Why Parallel Computing?Why Parallel Computing?1. Computer simulations are far
cheaper and faster than physical experiments.
2. Computers can solve a much wider range of problems than specific laboratory equipment can.
3. Computational approaches are only limited by computer speed and memory capacity, while physical experiments have many practical constraints.
Good Parallel Candidate Applications:• Predictive Modeling and Simulations• Engineering Design and Automation• Energy Resources Exploration• Medical, Military and Basic Research• Visualization
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1313
Why Parallel Programs don’t Perform as Why Parallel Programs don’t Perform as they are intended?they are intended?
1.1. Type of Problem (50%)Type of Problem (50%)– Degree of Parallelism, mapping onto multiprocessor system
2.2. Algorithm Construction (40%)Algorithm Construction (40%)– Inefficient algorithms that do not exploit the natural concurrency of a problem.–– Bad Load Balancing (Bad Load Balancing (cpucpu utilization)utilization)–– Unnecessary Communication (overhead)Unnecessary Communication (overhead)–– Sequential Bottleneck (algorithm reorganization)Sequential Bottleneck (algorithm reorganization)–– Bad Scheduling (time reorganization, synchronization)Bad Scheduling (time reorganization, synchronization)
3.3. Middleware, Language and Compiler choiceMiddleware, Language and Compiler choice– Inefficient distribution and coordination of tasks, – high inter-processor communications latency due to inefficient middleware.
4.4. Operating System Operating System – Inefficient internal scheduler, file systems and memory allocation/de-allocation.
5.5. HardwareHardware– Idle processors due to conflicts with shared resources.– Slow interconnects
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1414
Parallel Computing: Parallel Computing: Interconnect TopologiesInterconnect Topologiesn Bus: A single shared data path
n Pros: Simplicityn Cons: Does not scale well, resource
contention, fixed bandwidth
nn Mesh: Mesh: n 2-D array of processors
– Each processor has a connection to 4 neighbors
n Torus/Wraparound Mesh– Processors on edge of mesh are
connected
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1515
Parallel Computing: Parallel Computing: Interconnect TopologiesInterconnect Topologiesn Hypercube:
n A d-dimensional hypercube has p=2d processors.
n Each processor directly connected to d other processors
n Shortest path between a pair of processors is at most d
n Switch Based:n mxn switchesn Eg. Star Topology, Shufflen Omega network
– Log P stages– Connection any
processor to any memory
n Others: Ring
grey codinggrey coding
001000
100
010 011
111
101
110
3D
4D
MS
SP
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1616
Parallel Computing: Important Parallel Computing: Important Interconnect ParametersInterconnect Parameters
1. Network DiameterThe longest of the shortest paths between various pairs of nodes (small to minimize latency). More important with store-and-forward routing than with wormhole routing (node relaying).
2. Bisection Bandwidth (BB)The smallest number (total capacity) of links that need to be cut in order to divide the network into two sub networks of half the size. (Need Full Bisection Bandwidth ) A small BB limits the rate of data transfer between the two halves of the network, thus affecting the performance of communications intensive algorithms.
3. Vertex or node DegreeThe number of communications ports required of each node, which should be a constant, independent of network size if the architecture is to be readily scalable to larger sizes. Impact on $$
Source: Richard Morrison, Cluster Computing
Cray T3ECray T3E
Full BB: Nodes of any two halves can communicate at full speed with each other.
BB=3
min /21
1 2N
ii
NB B=
=∑
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1717
Parallel Computing: Parallel Computing: Programming ParadigmsProgramming Paradigms
n Message Passingn MPI (Message Passing
Interface)n PVM (Parallel Virtual
Machine)– Characteristics:
n Hard to programn More control on work sharing
and data distributionn Code Portabilityn Optimized libraries available
n Shared Memory:– Alternatives for Programming:
n Threads (light processes), n Parallel programming languages (HPF), n Preprocessor compiler directives to
declare and manipulate shared data (OpenMP),
n Using library routines with an existing sequential language,
n Using a sequential programming language and a parallelizing compiler.
– Characteristicsn Simple coden Synchronization requiredn Less data locality explorationn Poor portability
Beowulfs!
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1818
Parallel Performance Metrics: Parallel Performance Metrics: Speedup and EfficiencySpeedup and Efficiency
n Assumingn Ts = best sequential algorithm run time, i.e. time to calculate all operations on a
single processor.n Tp = parallel algorithm run time, i.e. time to calculate the same quantity of work
distributed among p processors of the same type.
n Speedup: ratio of sequential processing time to parallel processing time.
n Efficiency (η): ratio between the speedup factor and the number of processors:
SP
TSpeedup T=
1sP
Speedup Tp pTh = = ≤ 1s
pTT P h= ∴ =Ideal loading
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
1919
Parallel Performance Metrics: Parallel Performance Metrics: Redundancy, Utilization and QualityRedundancy, Utilization and Quality
n Redundancy (Rp) – Is the ratio between the total number of operations Op executed in performing some computation with p processors and the number of operations Osrequired to execute the same computation with a single processor:
Rp is related to the time lost because of overhead, and is always larger than 1.n Utilization (Up) – Is the ratio between the actual number of operations Op and the number
of operations that could be performed with p processors in Tp time units:
n Quality (Qp) – The Quality factor of a parallel computation using p processors is:
pp
P
OU pT=
32s
pp p
TQ pT O=
pp
s
OR O=
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2020
nn EfficiencyEfficiency (in terms of (in terms of pp))n s: problem size, p: processorsn w(s): workload, n h(s,p): communication overhead.
– As p grows, communication overhead h(s,p) increases and efficiency ηp decreases. – For growing s, w(s) usually increases much faster than h(s,n). – An increase of w(s) may outweigh increase h(s,p) for growing processor number p.
nn QuestionQuestion: For growing p, how fast must s grow for efficiency to remain constant?
w(s,p) should grow in proportion to h(s,p).
n Workload:
n Isoefficiency function: n If workload w(s) grows as fast as f η(p), constant efficiency can be maintained.
( ) ( ),f p Ch s ph =
( ) ( )[ ] ( ) ( )/ 1 , ,p pw s h s p Ch s ph h= − =
Parallel Performance Metrics: Parallel Performance Metrics: IssoefficiencyIssoefficiency
( ) ( )1
1 , / ,p h s p w s ph = +
( )
( ) ( ), ,pw s
w s p h s ph = +
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2121
1PS
P
Speedupp
pSp
= −= −
Parallel Performance Metrics: Parallel Performance Metrics: Amdahl’s Law (1967)Amdahl’s Law (1967)
n For a certain problem size, the speedup curve tends to flatten with an increasing number of processors. Because communication-synchronization costs tend to increase.
P: Parallel Fraction
S: Serial Fraction
p: Number of Processors
11 PSpeedup = −
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2222
Parallel Performance Metrics: Parallel Performance Metrics: ScalabilityScalabilitynn ScalabilityScalability: : A parallel system's capacity to increase/maintain speedup in to A parallel system's capacity to increase/maintain speedup in to
the number of processes. the number of processes. –– Typically when Typically when pp is increased, is increased, ηη will normally decrease (due to increase in parallelization will normally decrease (due to increase in parallelization
overheads). But often when we increase overheads). But often when we increase N (problem size)N (problem size), the ratio of communication (, the ratio of communication (CC) to ) to computation (computation (RR) decreases, leading to improved ) decreases, leading to improved ηη. .
–– For an algorithm to be scalable For an algorithm to be scalable ηη11≤η≤η22 must hold true for any two different problem sizes, must hold true for any two different problem sizes, NN11 < N< N2 2 , running on p, running on p11 < p< p2 2 , ,
–– WhereWherenn γγ: : communication to computation ratiocommunication to computation ratio
nn Thus, for Thus, for ηη to remain the same while increasing to remain the same while increasing pp, we must be able to find a larger , we must be able to find a larger NN for which for which γγ remains the same remains the same
( )1s sp overhead
t tt tp p g= + = +
( ) ( )1 2 1 2
1 21 21 1 2 2 1 1 2 21
1 12
s s s ss sp p
t t t tt tt p t p p pp p
g gg g
= → = → =+ + + × + ×
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2323
Beowulf Class Computer: Beowulf Class Computer: Brief historyBrief historynn Thomas Sterling and Donald Becker CESDIS (Center of Thomas Sterling and Donald Becker CESDIS (Center of
Excellence in Space Data and Information Sciences), Goddard Excellence in Space Data and Information Sciences), Goddard Space Flight Center, Greenbelt, MDSpace Flight Center, Greenbelt, MD
nn Summer 1994: built an experimental clusterSummer 1994: built an experimental clusternn Called their cluster Called their cluster BeowulfBeowulf (pre X century poem in Old (pre X century poem in Old
English about a Scandinavian warrior from the XI century)English about a Scandinavian warrior from the XI century)nn Genesis: combines low cost x86, Ethernet, Linux, Clustered Genesis: combines low cost x86, Ethernet, Linux, Clustered
architecture and MPI programmingarchitecture and MPI programmingnn Beowulf adds Beowulf adds ethernetethernet drivers, channel bonding, advanced drivers, channel bonding, advanced
topologies, applications, and administration toolstopologies, applications, and administration toolsnn Rapidly sustained GFLOPS performance / low cost.Rapidly sustained GFLOPS performance / low cost.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2424
NASA Beowulf ProjectNASA Beowulf Project1st BEOWULF: 1st BEOWULF: WiglafWiglaf (1994)(1994)
• 16 Intel 80486 DX4 100 MHz
• VESA Local bus• 256 Mbytes memory• 6.4 GBs of disk• Dual 10 base-T Ethernet• 72 MFlops sustained
• 16 Intel PPro 200 MHz• PCI Motherboard• 2 GBs memory• 49.6 GBs of disk• 100 base-T switch• 1.25 Gflops sustained
HyglacHyglac--1996 (Caltech)1996 (Caltech)
• 16 Intel Pentium 100 MHz• PCI motherboard (Triton)• 256 synchronous cache• 32M - 1 GBs memory• 6.4 GBs of disk• 100 base-T Ethernet (hub)• 240 Mflops sustained
$46K $50KUS$40K
HrothgarHrothgar -- 19951995
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2525
History Highlights of ClustersHistory Highlights of Clustersnn 1957 1957 –– SAGE by IBM & MITSAGE by IBM & MIT--LL for LL for AirforceAirforce NORADNORADnn 1976 1976 –– EthernetEthernetnn 1984 1984 –– Cluster of 160 Apollo workstations by NSACluster of 160 Apollo workstations by NSAnn 1985 1985 –– M31 Andromeda by DEC, 32 VAX 11/750M31 Andromeda by DEC, 32 VAX 11/750nn 1986 1986 –– Production Condor cluster operationalProduction Condor cluster operationalnn 1990 1990 –– PVM releasedPVM releasednn 1993 1993 –– First NOW workstation cluster at UC BerkeleyFirst NOW workstation cluster at UC Berkeleynn 1993 1993 –– MyrinetMyrinet introducedintroducednn 1994 1994 –– First Beowulf PC cluster at NASA GoddardFirst Beowulf PC cluster at NASA Goddardnn 1994 1994 –– MPI standardMPI standardnn 1996 1996 –– >1Gflops>1Gflopsnn 1997 1997 –– Gordon Bell Prize for PriceGordon Bell Prize for Price--PerformancePerformancenn 1997 1997 –– >10 >10 GflopsGflopsnn 1998 1998 –– Avalon by LANL on Top500 listAvalon by LANL on Top500 listnn 1999 1999 –– >100 >100 GflopsGflopsnn 2000 2000 –– Compaq and PSC awarded 5 Compaq and PSC awarded 5 TflopsTflops by NSFby NSF
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2626
Beowulf Class Computer: Beowulf Class Computer: AccomplishmentsAccomplishments
nn An experimental platform for parallel computingAn experimental platform for parallel computingnn Established Established visionvision of lowof low--cost HPCcost HPCnn Demonstrated effectiveness of PC clusters Demonstrated effectiveness of PC clusters nn Provided networking/managing software in LinuxProvided networking/managing software in Linuxnn Introduced Mass Storage with PVFSIntroduced Mass Storage with PVFSnn Achieved >10 Achieved >10 GflopsGflops performance (3rd gen.)performance (3rd gen.)nn Gordon Bell Prize for PriceGordon Bell Prize for Price--PerformancePerformancenn Conveyed findings to broad communityConveyed findings to broad communitynn Tutorials and booksTutorials and booksnn Provided design standard to rally communityProvided design standard to rally communitynn SpinSpin--off of off of ScyldScyld Computing Corp.Computing Corp.
Naegling at Caltech CACR
Beowulf Cluster Computing with Linux, Second Editionby Thomas Sterling (Editor), et al
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2727
Why Beowulf? as if we needed reasons …Why Beowulf? as if we needed reasons …n Brings high end computing to broad ranged problems
– Scalable performance to high end
n Order of magnitude Price-Performance advantage – Commodity enabled– No long development lead times (it’s here!)
n Low vulnerability to vendor-specific decisions – Standardized hardware and logical interfaces, – Multiple vendor offerings within class, – Replaceable subsystem components from different vendors, – Companies are ephemeral,– Beowulfs are forever
n Long-term Multi-generation System Architecture Classn Industry-wide, non-proprietary software environment
– Full function system software common across class, thank Linux.
n Rapid response technology tracking (Requirement responsive)n Just-in-Place, Just-in-Time user-driven configuration (meets low
expectations of MPPs)n Provides extreme flexibility of configuration and operationn Greatly increases accessibility of parallel computing
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2828
Issues you must considerIssues you must consider
n Significant node-to-node latency (limited by cost and software)n Bandwidth limitations (getting better fast, still limited by cost)n Finite wait times in receiving processorsn Synchronization takes time (data coherence)n Load unbalancen Complexity and costs of planning and administering tasks,n Finite communication times (not overlapped with run-time)n Heterogeneous scalability
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
2929
Why Beowulf? Why Beowulf? StatsStats fromfrom TopTop 500 500 ListList
The 22nd TOP500 List was introduced during the Supercomputer Conference (SC2003) in Phoenix, AZ.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3030
Why Beowulf? Why Beowulf? StatsStats fromfrom TopTop 500 500 ListList
The 22nd TOP500 List was introduced during the Supercomputer Conference (SC2003) in Phoenix, AZ.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3131
Why Beowulf? Why Beowulf? HighlightsHighlights fromfrom TopTop 500 500 ListList
nn 208 systems are now labeled as 208 systems are now labeled as clustersclusters, up from 149. This makes clustered , up from 149. This makes clustered systems the most common architecture in the TOP500. systems the most common architecture in the TOP500.
nn The number of systems exceeding the 1 The number of systems exceeding the 1 TFflop/sTFflop/s mark running the mark running the LinpackLinpackbenchmark jumped from 59 to 131. benchmark jumped from 59 to 131.
nn Entry level is now 403.4 Entry level is now 403.4 GFlop/sGFlop/s, compared to 245.1 , compared to 245.1 GFlop/sGFlop/s six month ago. six month ago. nn The entry point for the top 100 moved from 708 The entry point for the top 100 moved from 708 GFlop/sGFlop/s to 1.142 to 1.142 TFlop/sTFlop/s. . nn A total of 189 systems are now using Intel processors. Six monthA total of 189 systems are now using Intel processors. Six months ago there s ago there
were 119 Intelwere 119 Intel--based systems on the list and one year ago only 56. based systems on the list and one year ago only 56. nn With respect to the number of systems, HewlettWith respect to the number of systems, Hewlett--Packard topped IBM again by a Packard topped IBM again by a
small margin. HP is at 165 systems (up from 159) and IBM is at 1small margin. HP is at 165 systems (up from 159) and IBM is at 159 systems 59 systems (up one system) installed. SGI is again third with 41 systems, d(up one system) installed. SGI is again third with 41 systems, down from 54. own from 54.
nn The Cray X1 system appears on the list with 10 installations, wiThe Cray X1 system appears on the list with 10 installations, with the largest X1 th the largest X1 listed at #19. listed at #19.
Source: http://www.top500.org/lists/2003/11/trends.php
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3232
HPHP
1400001400001161611616Integrity rx2600 Itanium2 Integrity rx2600 Itanium2 ClusterCluster
Integrity rx2600 Itanium2 1.5 Integrity rx2600 Itanium2 1.5 GHz, Quadrics / 1936GHz, Quadrics / 1936
United States/2003United States/2003
83500083500086338633ResearchResearchHP ClusterHP ClusterMpp2Mpp2Pacific Northwest Pacific Northwest National LaboratoryNational Laboratory
55
DellDell
1530015300PowerEdgePowerEdge 1750, 1750, MyrinetMyrinetPowerEdgePowerEdge 1750, P4 Xeon 1750, P4 Xeon 3.06 GHz, 3.06 GHz, MyrinetMyrinet / 2500/ 2500
United States/2003United States/2003
63000063000098199819AcademicAcademicDell Dell ClusterClusterTungstenTungstenNCSANCSA44
SelfSelf--mademade
1520001520001760017600G5 G5 ClusterCluster1100 Dual 2.0 GHz Apple 1100 Dual 2.0 GHz Apple G5/Mellanox G5/Mellanox InfinibandInfiniband4X/Cisco 4X/Cisco GigEGigE / 2200/ 2200
United States/2003United States/2003
5200005200001028010280AcademicAcademicNOW NOW -- PowerPCPowerPCXXVirginia TechVirginia Tech33
2250002250002048020480AlphaAlpha--ServerServer--ClusterClusterHPHPUnited States/2002United States/2002
6330006330001388013880ResearchResearchHP HP AlphaServerAlphaServerASCI Q ASCI Q -- AlphaServerAlphaServerSC45, 1.25 GHz / 8192SC45, 1.25 GHz / 8192
Los Alamos National Los Alamos National LaboratoryLaboratory
22
26624040960SX6NECJapan/2002
1,1E+1035860ResearchNEC VectorEarth-Simulator / 5120Earth Simulator Center
1
nhalfRpeakInstallation Area
ModelManufacturerCountry/Year
NmaxRmaxInst. typeComputer FamilyComputer / ProcessorsSiteRank
Why Beowulf? Why Beowulf?
Top 10Top 10
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3333
900009000092169216xSeriesxSeries ClusterCluster Xeon Xeon --QuadricsQuadrics
IBM/QuadricsIBM/QuadricsUnited United States/2003States/2003
42500042500065866586ResearchResearchIBM ClusterIBM ClusterxSeriesxSeries Cluster Xeon 2.4 GHz Cluster Xeon 2.4 GHz --Quadrics / 1920Quadrics / 1920
Lawrence Lawrence Livermore Livermore National National LaboratoryLaboratory
1010
IBM
9984SP Power3 375 MHz high node
SP Power3 375 MHz 16 way / 6656United States/2002
6400007304ResearchIBM SPSeaborgNERSC/LBNL9
12288SP Power3 375 MHz high node
IBMUnited States/2000
6400007304ResearchIBM SPASCI White, SP Power3 375 MHz / 8192
Lawrence Livermore National Laboratory
8
75000750001106011060NOW NOW ClusterCluster -- Intel Intel Pentium Pentium -- QuadricsQuadrics
Linux Linux NetworxNetworx/Quadrics/QuadricsUnited United States/2002States/2002
35000035000076347634ResearchResearchNOW NOW -- Intel PentiumIntel PentiumMCR Linux Cluster Xeon 2.4 GHz MCR Linux Cluster Xeon 2.4 GHz --Quadrics / 2304Quadrics / 2304
Lawrence Lawrence Livermore National Livermore National LaboratoryLaboratory
77
Linux Linux NetworxNetworx
1092081092081126411264NOW NOW ClusterCluster -- AMD AMD --MyrinetMyrinet
OpteronOpteron 2 GHz, 2 GHz, MyrinetMyrinet / 2816/ 2816United United States/2003States/2003
76116076116080518051ResearchResearchNOW NOW -- AMDAMDLightningLightningLos Alamos Los Alamos National National LaboratoryLaboratory
66
Why Beowulf? Top 10 … continuedWhy Beowulf? Top 10 … continued
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3434
Why Beowulf? Why Beowulf? Highlights from the Top 10:Highlights from the Top 10:nn The Earth Simulator, built by NEC, remains the unchallenged #1. The Earth Simulator, built by NEC, remains the unchallenged #1. nn ASCI Q at Los Alamos is still #2 at 13.88 ASCI Q at Los Alamos is still #2 at 13.88 TFlop/sTFlop/s. . nn The third system ever to exceed the 10 The third system ever to exceed the 10 TFflop/sTFflop/s mark is mark is VirginaVirgina Tech's XTech's X
measured at 10.28 measured at 10.28 TFlop/sTFlop/s. Based on Apple G5 as building blocks and often . Based on Apple G5 as building blocks and often referred to as the 'referred to as the 'SuperMacSuperMac'. Uses a '. Uses a MellanoxMellanox network. network.
nn The fourth system is also a cluster. The Tungsten cluster at NCSThe fourth system is also a cluster. The Tungsten cluster at NCSA is a Dell A is a Dell PowerEdgePowerEdge--based system using a based system using a MyrinetMyrinet interconnect. It just missed the 10 interconnect. It just missed the 10 TFlop/sTFlop/s mark with a measured 9.82 mark with a measured 9.82 TFlop/sTFlop/s. .
nn The list of clusters in the TOP10 continues with the upgraded ItThe list of clusters in the TOP10 continues with the upgraded Itanium2anium2--based based HewlettHewlett--Packard system, located at Packard system, located at DOE'sDOE's Pacific Northwest National Pacific Northwest National Laboratory, which uses a Quadrics interconnect. Laboratory, which uses a Quadrics interconnect.
nn #6 is the first system in the TOP500 based on AMD's #6 is the first system in the TOP500 based on AMD's OpteronOpteron chip. It was chip. It was installed by Linux installed by Linux NetworxNetworx at the Los Alamos National Laboratory and also uses at the Los Alamos National Laboratory and also uses a a MyrinetMyrinet interconnect. interconnect.
nn The list of cluster systems in the TOP10 has grown impressively The list of cluster systems in the TOP10 has grown impressively to seven to seven systems. The Earth Simulator and the two IBM SP systems at Lawresystems. The Earth Simulator and the two IBM SP systems at Lawrence nce Livermore and Lawrence Berkeley national labs are the other threLivermore and Lawrence Berkeley national labs are the other three systems. e systems.
Source: http://www.top500.org/lists/2003/11/trends.php
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3535
$$nn World’s Most Powerful ComputerWorld’s Most Powerful Computer
–– 640 nodes x 8 vector processors per node 640 nodes x 8 vector processors per node = = 5,1205,120 processorsprocessors
–– 8 8 gflopsgflops peak per processor = peak per processor = 40.9640.96teraflops peakteraflops peak
–– 10.2410.24 terabytes of memoryterabytes of memory–– 640 x 640 single stage crossbar switch640 x 640 single stage crossbar switch
nn PerformancePerformance–– LinPackLinPack Benchmark: Benchmark: 35.8635.86 teraflopsteraflops–– Atmospheric Global Circulation Model: Atmospheric Global Circulation Model:
26.5826.58 teraflops (T1279L96)teraflops (T1279L96)–– Plasma Simulation Code (IMPACTPlasma Simulation Code (IMPACT-- 3D): 3D):
14.914.9 tflopstflops on 512 nodeson 512 nodes
Why Beowulf? Why Beowulf? Japanese Earth SimulatorJapanese Earth Simulator
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3636
Why Beowulf? Top 10 Reasons why YOU Why Beowulf? Top 10 Reasons why YOU should own a Beowulfshould own a Beowulf
1. You are here2. You are in need for pain3. Or you don’t have enough4. You have spare $5. You need your “supercomputer” to grow with you $ (heterogeneous)6. You are a computer scientist with a strong background in Unix7. You have “embarrassingly parallel" applications 8. You have pre-existing parallel (message passing) applications9. You can formulate your application in a message passing model10. You are limited by existing performance (speed, memory, etc.)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3737
Objective: Objective: Design a Cost Effective Design a Cost Effective Supercomputer with COTSSupercomputer with COTS1.1. Performance RequirementsPerformance Requirements – What performance is desired?
What Performance is required? What application(s) will be running on the cluster and what is the intended purpose.
2.2. Hardware Hardware – What Hardware will be required or is available to build the Cluster Computer.
3.3. Operating SystemsOperating Systems – What operating systems are available for Cluster Computing and what are their benefits. Focus: Linux
4.4. MiddlewareMiddleware – Between the Parallel application and the operating system we need a way of load balancing, and managing processes, the system itself, users.
5.5. ApplicationsApplications – What Software is available for Cluster Computing and how can applications be developed?
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3838
Performance FactorsPerformance Factors
n What performance do you need in your Beowulf cluster and how do you achieve it?– Suggestion: benchmark your application (or part of it)
n Several system factors affect performance:n Number of processorsn CPU speedn Memory size and access timesn Cache sizen Storage architecture and total storage capacity (access, transfer)n Interconnect (bandwidth, latency)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
3939
Why Benchmarks ?
n They are instruments to quantify system performance n They provide a method for enabling comparative analysis.n In HPC, benchmarks are used not only to evaluate cluster
performance, but also to measure the scalability of the compute nodes for computation and/or communication-intensive applications.
n Benchmarks results can provide relevant insight for determining which systems will best suit the needs of particular applications.
The Linux Benchmarking Project: http://www.tux.org/bench/
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4040
Application and Architecture Specific HPC Benchmarksnn Application specificApplication specific
– focus on the performance of different types of parallel applications – test the HPC system as a whole– provide users with multiple aspects of the system performance
nn Architecture specificArchitecture specific, specific to subsystems like:– memory subsystem,– the processor’s integer arithmetic and floating point units,– interconnect communication subsystem.
nn What to do …What to do …– Identify benchmarks that share similar characteristics with and are comparable
to the user application. – Although the end-application would be the ideal benchmark for that particular
system, several industry standard HPC benchmarks can also be used.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4141
Classes of Benchmark
nn Serial BenchmarksSerial Benchmarks– For accessing the performance of single processor machine through
measuring the performance rate (Mflops/s), memory bandwidth (MB/s), and the execution time of the application.
nn Parallel BenchmarksParallel Benchmarks– For measuring the performance of the inter-processor communications
through measuring the latency, the bandwidth of the parallel communications, and time of the application to execute.
n Serial and Parallel Benchmarks, subdivided into 2 categories:1. Kernels: To measure basic machine parameters2. Real Applications: To evaluate the performance of the machine as a
whole (for that particular application).
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4242
Some Standard Benchmarks for HPCSome Standard Benchmarks for HPC
nn Kernels and Industry Standard BenchmarksKernels and Industry Standard Benchmarks–– Livermore Fortran KernelLivermore Fortran Kernel (LFK)
(http://www.llnl.gov/asci_benchmarks/asci/limited/lfk/asci_lfk.html).–– LinpackLinpack.. Solving dense matrix. (http://www.netlib.org/benchmark/hpl/)–– nbenchnbench (http://www.byte.com/bmark/faqbmark.htm)–– NPBNPB.. (NAS Parallel Benchmarks) renamed to PBN in 1999
http://www.nas.nasa.gov/Research/Tasks/pbn.html–– StreamStream (and Stream OpenMp). http://www.cs.virginia.edu/stream/
nn Communications Communications – Pallas MPI Benchmark (PMBPMB) (http://www.pallas.com/e/products/pmb/)–– EFF_BWEFF_BW Spin off from PMB
nn Applications Applications (Chemistry)–– GAMESSGAMESS--UKUK (http://www.dl.ac.uk/CFS/benchmarks/gamess_uk.html)–– DL_POLYDL_POLY
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4343
Benchmarks from the Standard Benchmarks from the Standard Performance Evaluation CorporationPerformance Evaluation Corporation
nn SPEChpc2002 Features: SPEChpc2002 Features: –– Derived from real HPC applications and application practices, anDerived from real HPC applications and application practices, and measure the d measure the
overall performance of highoverall performance of high--end computer systems, including the computer's end computer systems, including the computer's processors (CPUs), the interconnection system (shared or distribprocessors (CPUs), the interconnection system (shared or distributed memory), uted memory), the compilers, the MPI and/or the compilers, the MPI and/or OpenMPOpenMP parallel library implementation, and the parallel library implementation, and the input/output system. input/output system. nn Parallelism supported: serial, Parallelism supported: serial, OpenMPOpenMP, MPI, or combined MPI, MPI, or combined MPI--OpenMPOpenMPnn Architectures supported: shared memory, distributed memory, clusArchitectures supported: shared memory, distributed memory, clusters ters nn IO and communication included in the benchmarks IO and communication included in the benchmarks nn Implemented under SPEC tools for building applications, running Implemented under SPEC tools for building applications, running with different data with different data
sets, verifying results and measuring runtime. sets, verifying results and measuring runtime. nn SPEC CHEM2002SPEC CHEM2002
–– Based on a quantum chemistry application called GAMESS and has Based on a quantum chemistry application called GAMESS and has performance metrics performance metrics SPECchemM2002SPECchemM2002 and and SPECchemS2002SPECchemS2002. .
nn CPU2000CPU2000–– SPECint2000, SPECfp2000, SPECint_base2000, SPECint_rate2000, etcSPECint2000, SPECfp2000, SPECint_base2000, SPECint_rate2000, etc. .
Integer and floating point CPUInteger and floating point CPU--intensive benchmarks. intensive benchmarks.
http://www.specbench.org/
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4444
The Beowulf Performance Suite (BPS)The Beowulf Performance Suite (BPS)
nn Developed by Developed by ParalogicParalogic[http://www.plogic.com/bps/
– Graphical front end to a series of commonly used performance tests for Parallel Computers. It was designed explicitly for the Beowulf Class of computers running Linux and the suite is packaged as an easy to install .rpm file.
– Can be run from the command line by invoking bps or xbps with the GUI. G
– Generates all test results and graphs (using Ggnuplot) in .html format hence documentation can be maintained on line.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4545
Beowulf Basic ArchitectureBeowulf Basic Architecture
Brahma Cluster Logo at DukeSMP or Cluster ? …
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4646
Design: Layered Model of a Design: Layered Model of a Beowulf Beowulf Computer ClusterComputer Cluster
1. CPU Architectures and Hardware Platforms
2. Network Hardware and Communication Protocols
3. Operating System (Linux)
4. Middleware (Single System Image)
5. High Level Programming Languages (C, Fortran)
6. Parallel Applications and Utilities (for HPC)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4747
1. CPU Architectures and 1. CPU Architectures and Hardware PlatformsHardware Platforms
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4848
Execution Nodes: Execution Nodes: Choice of ProcessorChoice of Processornn Two (mayor) FamiliesTwo (mayor) Families
1. Intel x86 [32 bit] compatible (E.g.. Pentium)– Considered as commodity systems because there are multiple sources
(Intel, AMD, Cyrix) and obviously ubiquitous.– Best integer performance– Available native software– New Itanium (64 bit)
2. Compaq Alpha systems [64 bit, Formerly DEC]– Performance winner (higher bandwidth to memory and network, best floating
point performance), – Hard to source at a good price.
– Others: Power PC, Sun SPARCn limited support and accompanying software distributions available
for these architectures.
$
RULE OF T
HUMB: Cho
ose se
cond t
o last G
enerat
ion
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
4949
Processors: Intel Compatible, Processors: Intel Compatible, things to look for …things to look for …
nn Core Frequency Core Frequency nn Watch out ! different architectures have different performanceWatch out ! different architectures have different performance
nn Front Side Bus (FSB) Speed (up to 800MHz)Front Side Bus (FSB) Speed (up to 800MHz)nn Integer performance (http://Integer performance (http://www.spec.orgwww.spec.org/)/)nn Floating Point performance (http://Floating Point performance (http://www.spec.orgwww.spec.org/)/)nn HyperHyper--Threading TechnologyThreading Technology
–– Multiple software threads on each processor in the system Multiple software threads on each processor in the system nn Internal Cache Internal Cache
–– L1: Execution Trace Cache L1: Execution Trace Cache –– L2: Advanced Transfer Cache L2: Advanced Transfer Cache –– L3: For large data setsL3: For large data sets
nn PricePrice--Performance Performance some comparison charts available at (some comparison charts available at (http://www.tomshardware.com/http://www.tomshardware.com/))
nn SMP supportSMP support
• the Pentium 4 1.8GHz costs 78% more than the 1.7GHz model,
• the Pentium III 1.1GHz costs 75% more than the Pentium III 1GHz
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5050
Performance: Memory and Performance: Memory and Locality of ReferenceLocality of Referencen The success of memory hierarchy is based upon assumptions
that are critical to achieving the appearance of a large, fast memory.
n There are 3 components of the locality of reference, which coexist in an active process which are:
– Temporal – A tendency for a process to reference in the near future the elements of instructions or data referenced in the recent past. Program constructs that lead to this concept are loops, temporary variables, or process stacks.
– Spatial – A tendency for a process to reference instructions or data in a similar location in the virtual address space of the last reference.
– Sequentiality – The principle that if the last reference was rj(t), then there is a likelihood that the next reference is to the immediate successor of the element rj(t).
n It should be noted that each process exhibits an individual characteristic with respect to the three types of localities.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5151
Processors: Intel vs. AMDProcessors: Intel vs. AMD
SOURCE: http://www.tomshardware.com/
H J Curnow and B A Wichmann, "A Synthetic Benchmark", Computer Journal Vol 19, No 1 1976
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5252
Processors: Intel Processors: Intel vsvs AMDAMD
111126.926.922.822.818.718.7AthlonAthlon K7/500K7/500
1.561.561.161.1617.217.227.227.221.621.6AthlonAthlon K7/600K7/600
1.871.871.211.2114.414.429.329.322.622.6AthlonAthlon K7/650K7/650
2.342.341.471.4711.511.536.936.927.427.4AthlonAthlon K7/850K7/850
2.422.421.571.5711.111.142.942.929.429.4AthlonAthlon K7/1000K7/1000
2.862.863.073.0713.013.035.735.728.328.3PIII/733PIII/733
2.862.863.083.08--36.536.528.328.3PIII/750PIII/750
2.52.53.143.1414.914.938.438.428.928.9PIII/800PIII/800
--3.273.27--40.440.430.130.1PIII/866PIII/866
--3.53.5--46.846.832.232.2PIII/1000PIII/1000
Relative to Relative to PII/300 or PII/300 or K7/500 K7/500 GAMESSGAMESS--UKUK
Performance Performance SPECfp95SPECfp95
GAMESSGAMESS--UK Time UK Time (seconds)(seconds)
SPECint95SPECint95SPECfp95SPECfp95SystemSystemThese programs deliver their results indexed to a standardized baseline system (a 40-MHz Sun SparcStation 10) scoring 1.0. In other words, if the SPECint95 result is 5.0, then the tested system is five times faster at integer tasks than the baseline system. SPEC95 has been replaced by CPU2000 (SPECint2000 and SPECfp2000)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5353
Processors: Intel/AMDProcessors: Intel/AMD
DedicatedBandwidth
SharedBandwidth
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5454
$89$89
$204$204
$123$123
$700$700
AMDAMD
212212$189$189P4/Athlon P4/Athlon @[email protected]
8989$183$183P4P4AthlonAthlon XPXP
110110$136$136XeonXeonAthlonAthlon MPMP
214214$1,500$1,500ItaniumItaniumOpteronOpteron
%%IntelIntel@2GHz@2GHz
Source: http://www.pricewatch.com/ 12/21/2003
Source: http://www.tomshardware.com
• Athlon MPs have 3 FPUs and higher peak FLOP rate• P4 with highest clock rate beats out the Athlon MP with
highest clock rate in real FLOP rate• Athlon MPs higher real FLOP per $, more popular• P4 supports SSE2 instruction set, which perform SIMD
operations on double precision data (2 x 64-bit)• Athlon MP supports only SSE, for single precision data
(4 x 32-bit)
Processors PriceProcessors Price--Performance: Performance: Intel/AMDIntel/AMD
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5555
Source: http://www.hardwareanalysis.com/content/reviews/article/1511.4/
Motherboards: SingleMotherboards: Single--DualDual--Quad?Quad?nn Motherboard performance depends Motherboard performance depends
heavily on theheavily on the–– Chipset ($)Chipset ($)–– Support for SMP (Independent data Support for SMP (Independent data
path for processors, power)path for processors, power)–– Maximum operating frequencyMaximum operating frequency–– LayoutLayout
nn Symmetric Multiprocessing (SMP)Symmetric Multiprocessing (SMP)–– Good for multithreaded applications Good for multithreaded applications
(but MP libraries are not thread safe)(but MP libraries are not thread safe)–– AMDsAMDs support for SMP (recent), nice support for SMP (recent), nice
change for change for OpteronOpteron–– Performance increase might not be Performance increase might not be
worth Priceworth Price
nn Memory type and capacity (large Memory type and capacity (large enough to avoid disk swapping)enough to avoid disk swapping)
nn OverclockingOverclocking ($)($)
An important part of increasing performance has to do with chipsets chipsets and memory technologyand memory technology. Advertising continues to give you very little of the crucial information that you need in order to be able to evaluate the performance ofa motherboard in a complete system.It's not just factors like the speed of the FSB or the memory that play a role here - chipsets can operate at very different speeds even with memory types that are seemingly the same
About Memory: About Memory: • Choose Grade A Manufacturers
• Micron Technology, • Rambus, • PNY, • Kingston, • Corsair, • LG, • Hyundai, • Mushkin, and • Samsung
• SIMM (EDO) requires pairs, DIMM (SDRAM and DDR) don’t
• RAM doubles CPU Frequency• For P4 use DDR over RDRAM ($)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5656
IBM 34GXP
Storage: Hard DisksStorage: Hard Disksnn General criteria:General criteria:
–– Access Time = Command Overhead Time + Access Time = Command Overhead Time + Seek Time + Settle Time + LatencySeek Time + Settle Time + Latency
–– Latency=(30000/SpindleSpeed)Latency=(30000/SpindleSpeed)–– Rotational speedRotational speed–– Buffer (cache)Buffer (cache)–– Size (the smaller the better)Size (the smaller the better)–– CapacityCapacity–– BenchmarkBenchmark: Bonnie++ : Bonnie++
http://www.coker.com.au/bonnie++/http://www.coker.com.au/bonnie++/nn Central and individual nodesCentral and individual nodesnn Disk type options:Disk type options:
–– IDE/EIDEIDE/EIDEnn CheaperCheapernn Seek times: 8.0Seek times: 8.0--12ms12msnn BuiltBuilt--in controllersin controllers
–– SCSISCSInn Good expansion capabilityGood expansion capability
RAID (Redundant Array of Inexpensive Disks)RAID (Redundant Array of Inexpensive Disks)nn Seek times: <8.5msSeek times: <8.5msnn Some motherboards have builtSome motherboards have built--in controllersin controllers
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5757
Execution NodesExecution Nodes
nn Rack mount, or desktop/tower?Rack mount, or desktop/tower?–– The choice depends on: space, cost and migration and reusabilityThe choice depends on: space, cost and migration and reusability::
nn RackRack––mount solutions saves space, but at a higher cost (ideal for larmount solutions saves space, but at a higher cost (ideal for large systems)ge systems)nn Desktop or tower cases are cheap, consume more power (heat), mesDesktop or tower cases are cheap, consume more power (heat), messy and big (space)sy and big (space)nn If you go for the rack, you are stuck with the hardware for its If you go for the rack, you are stuck with the hardware for its entire life (difficult to scale entire life (difficult to scale
given changing form factors)given changing form factors)nn If you go for standard cases, you can push the cluster hardware If you go for standard cases, you can push the cluster hardware to office or lab use after a to office or lab use after a
short period and update the clustershort period and update the cluster
nn Power Distribution (neat, but not necessary for small systems)Power Distribution (neat, but not necessary for small systems)–– Highly desirable to have network addressable power distribution Highly desirable to have network addressable power distribution unitsunits
nn Can remotely power cycle compute nodesCan remotely power cycle compute nodesnn Instrumented which help determine power needsInstrumented which help determine power needs
Ethernet portPower sockets
$1,555 (16 port Sentry R016-1-1-1PT)
Don’t in
stall in
your
office
(loud
, mess
y and
hot)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5858
Design: Layered Model of a Design: Layered Model of a Beowulf Computer ClusterBeowulf Computer Cluster
1. CPU Architectures and Hardware Platforms
2. Network Hardware and Communication Protocols
3. Operating System (Linux)
4. Middleware (Single System Image)
5. High Level Programming Languages
6. Parallel Applications and Utilities
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
5959
Interconnection NetworkInterconnection Network
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6060
Interconnect: Selection CriteriaInterconnect: Selection Criterian Some design considerations for the selection of a node interconnect in the
Beowulf cluster are as follows:– Linux support: yes/no, kernel driver or library (kernel drivers are preferred).– Maximum bandwidth: The higher the better.– Minimum latency: The lower the better (small network diameter).– Available as: Single Vendor / Multiple Vendor Hardware.– Interface port/bus used: High performance, included as a standard node port and
matched bandwidth to the dedicate node network fabric bandwidth.– Network structure: Bus/Switched/Topology.– Cost per machine connected: The lower the better.– Scalability– PRICE !!
n Switched, full duplex Ethernet is the most commonly used network in Beowulf systems, and gives almost the same performance as a fully meshed network (at significantly lower $$, thanks decreasing prices for high speed Ethernet). Switched Ethernet provides dedicated bandwidth between any two nodesconnected to the switch. If higher internode bandwidth is required we can use channel bonding to connect multiple channels of Ethernet to each node.
'ifenslave‘ *NASA
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6161
Node InterconnectionTechnologies-Supported in LinuxCOTSCOTSnn CAPERSCAPERS Cable Adapter for Parallel
Execution and Rapid Synchronizationnn 10Mb Ethernet10Mb Ethernetnn 100Mb Ethernet100Mb Ethernet (Fast Ethernet)nn 1000Mb Ethernet1000Mb Ethernet (Giga Ethernet)nn 10G Ethernet10G Ethernet (10 Giga Ethernet)nn PLIPPLIP Parallel Line Interface Protocolnn SLIPSLIP Serial Line Interface Protocolnn USB USB Universal Serial Bus
Vendor SpecificVendor Specificnn MyrinetMyrinet (http://www.myri.com/)nn ParastationParastation
(http://wwwipd.ira.uka.de/parastation)nn QuadricsQuadricsnn ArcNetArcNet (token based protocol,
http://www.arcnet.com/l)nn ATMATM Asynchronous Transfer Mode
(http://lrcwww.epfl.ch/linux-atm/)nn SCSISCSI Small Computer Systems
Interconnectnn SHRIMPSHRIMP Scalable, High-
Performance, Really Inexpensive
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6262
Network Hardware: COTSNetwork Hardware: COTS
$5$5BusBusUSBUSBN.A.N.A.12Mb/s12Mb/sKernel driversKernel driversUSBUSB
$2$2Cable between Cable between 2 nodes2 nodes
RS232CRS232C1,0000 1,0000 µµss0.1Mb/s0.1Mb/sKernel driversKernel driversSLIPSLIP
$2$2Cable between Cable between 2 nodes2 nodes
SPPSPP1,0000 1,0000 µµss1.2Mb/s1.2Mb/sKernel driversKernel driversPLIPPLIP
N.A.N.A.Switch or Switch or FDRsFDRsPCIPCI--XXN.A.N.A.10,000Mb/s10,000Mb/sKernel driversKernel drivers10Gb Ethernet10Gb Ethernet
$2,500*$2,500*Switch or Switch or FDRsFDRsPCIPCI300 300 µµss1,000Mb/s1,000Mb/sKernel driversKernel drivers1,000Mb 1,000Mb EthernetEthernet
$400 *$400 *Switch, hub or Switch, hub or hublesshubless busbus
PCIPCI80 80 µµss100Mb/s100Mb/sKernel driversKernel drivers100Mb Ethernet100Mb Ethernet
$150 * $150 * ($100 ($100 hublesshubless))
Switch, hub or Switch, hub or hublesshubless busbus
PCIPCI100 100 µµss10Mb/s10Mb/sKernel driversKernel drivers10Mb Ethernet10Mb Ethernet
$2$2Cable between Cable between 2 nodes2 nodes
SPPSPP2 2 µµss1.2Mb/s1.2Mb/sAPI LibraryAPI LibraryCAPERSCAPERS
Cost per Node Cost per Node Network Network StructureStructure
Interface Interface port/bus usedport/bus used
Minimum Minimum LatencyLatency
Maximum Maximum BandwidthBandwidth
Linux SupportLinux SupportTechnologyTechnology
*multiple vendor hardwareFDR: Full Duplex RepeatersFDR: Full Duplex Repeaters
Before you buy consult latest list of supported drivers @ http://www.scyld.com/network_index.html
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6363
Network Hardware: Vendor*Network Hardware: Vendor*
$1,479*$1,479*SwitchedSwitched hubshubsPCIPCI5 5 µµs *MPIs *MPI350MB/s350MB/s--900MB/s*1900MB/s*1
KernelKernel driversdrivers andandlibrarieslibraries
QuadricsQuadricsQsNetII*QsNetII*
N.A.N.A.Mesh backplane Mesh backplane *ala Paragon*ala Paragon
EISAEISA5 5 µµss180Mb/s180Mb/sUserUser--level memory level memory mapped interfacemapped interface
SHRIMPSHRIMP
N.A.N.A.Inter node bus Inter node bus sharing SCSI devicessharing SCSI devices
PCI, EISA, ISAPCI, EISA, ISAN.A.N.A.5Mb/s 5Mb/s –– 20Mb/s20Mb/sKernel driversKernel driversSCSISCSI
$3,000$3,000Switched hubsSwitched hubsPCIPCI120 120 µµss155Mb/s 155Mb/s *(1,200Mb/s)*(1,200Mb/s)
Kernel driversKernel driversATMATM
$200$200UnswitchedUnswitched hub or hub or bus *ringbus *ring
ISAISA1,000 1,000 µµss2.5Mb/s2.5Mb/sKernel driversKernel driversArcnetArcnet
>$1,000>$1,000HublessHubless meshmeshPCIPCI2 2 µµss125Mb/s125Mb/sHAL or socket HAL or socket librarylibrary
ParastationParastation
$1,200+$1,200+Switched hubsSwitched hubsPCIPCI6.36.3 µµss248MB/s hd 248MB/s hd --489MB/s fd*1489MB/s fd*1
LibraryLibraryMyrinetMyrinetPCIPCI--XX
Cost per Cost per Node Node
Network StructureNetwork StructureInterface Interface port/bus usedport/bus used
Minimum Minimum LatencyLatency
Maximum Maximum BandwidthBandwidth
Linux SupportLinux SupportTechnologyTechnology
* Currently Supporting Linux
Sustained
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6464
What you need to know about What you need to know about Network Network InterconnectInterconnectnn Don’t buy a hub !Don’t buy a hub !nn StoreStore--andand--ForwardForward
–– Copies incoming packets to memoryCopies incoming packets to memory–– Delivers packet when it can arbitrate transfer across switchDelivers packet when it can arbitrate transfer across switch
nn NonNon--BlockingBlocking–– Can process incoming packets without storing Can process incoming packets without storing
nn Buy NonBuy Non--Blocking switched with second to last port Blocking switched with second to last port density available ($), it will permit future scaling of density available ($), it will permit future scaling of beowulfbeowulf
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6565
Example Configuration ($)Example Configuration ($)
16,204.716,204.7GRAND TOTALGRAND TOTAL
250250Monitor/Keyboard switch boxes, keyboard and video extensions, neMonitor/Keyboard switch boxes, keyboard and video extensions, network twork cables (CAT5),cables (CAT5),
3993993993991121” 21” MultisyncMultisync monitor SONY FDCPDmonitor SONY FDCPD--G520G520
1010101011KeyboardKeyboard
1818181811CDROM 56XCDROM 56X
17917917917911SVGA adapter SVGA adapter GeForceGeForce FX 5900 128MBFX 5900 128MB
1191197717171.44MB FD1.44MB FD
85085050501717HD SCSI ST150176LC 10Krpm 100GB 8.2ms (otherwise use ultra ATA)HD SCSI ST150176LC 10Krpm 100GB 8.2ms (otherwise use ultra ATA)
18018010101110/100Mb Ethernet A. (10/100Mb Ethernet A. (3Com 3C905C)*3Com 3C905C)*
1,024.71,024.756.9356.931818RAM DIMM DDR266 256MBRAM DIMM DDR266 256MB
34034020201717PC cases (300W)PC cases (300W)
11,35611,3566686681717Intel Intel SE7505VB2 Dual SE7505VB2 Dual XeonXeon. . 1 CPU, 8MB 1 CPU, 8MB VideoVideo, , EthernetEthernet100/1000, Serial Ultra ATA, SCSI 100/1000, Serial Ultra ATA, SCSI optopt..
1.4791.4791.4791.479112424--port switch port switch Cisco 2950SXCisco 2950SX--24 24
TotalTotalUS$US$No.No.PartPart
Source: http://www.pricewatch.com/12/22/2003
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6666
Example Configuration… on the cheap sideExample Configuration… on the cheap side
1700170010010017171.5MHz P41.5MHz P4
4,9184,918GRAND TOTALGRAND TOTAL
250250Monitor/Keyboard switch boxes, keyboard and video extensions, neMonitor/Keyboard switch boxes, keyboard and video extensions, network twork cables (CAT5),cables (CAT5),
1501501501501117” 17” MultisyncMultisync monitormonitor
1010101011KeyboardKeyboard
1818181811CDROM 56XCDROM 56X
2525252511SVGA Adapter 64MSVGA Adapter 64M
1191197717171.44MB FD1.44MB FD
6466463838171720GB EIDE HDD20GB EIDE HDD
909055181810/100Mb Ethernet Adapter10/100Mb Ethernet Adapter
68068020203434128MB SDRAM DIMM128MB SDRAM DIMM
34034020201717PC cases (300W)PC cases (300W)
85085050501717MotherboardMotherboard
40404040112424--port switchport switch
TotalTotalUS$US$No.No.PartPart
Source: http://www.pricewatch.com/12/22/2003
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6767
Design: Layered Model of a Design: Layered Model of a Beowulf Computer ClusterBeowulf Computer Cluster
1. CPU Architectures and Hardware Platforms
2. Network Hardware and Communication Protocols
3. Operating System (Linux)
4. Middleware (Single System Image)
5. High Level Programming Languages
6. Parallel Applications and Utilities
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6868
Not p
roprie
tary a
nd Fr
ee
Not p
roprie
tary a
nd Fr
ee
LinuxLinux
nn Strictly speaking, Linux (introduced by Strictly speaking, Linux (introduced by LinusLinus TorvaldsTorvalds in 1991) is an operating in 1991) is an operating system kernel:system kernel:
–– Control all devicesControl all devices–– Manages resourcesManages resources–– Schedules user processesSchedules user processes
nn The kernel together with software from the GNU project and otherThe kernel together with software from the GNU project and others, form a s, form a usable operating system. Very Robust.usable operating system. Very Robust.
nn Supports different network protocols, full SMP since v2.1, POSIXSupports different network protocols, full SMP since v2.1, POSIX compliance compliance true multitasking, virtual memory, shared libraries, demand loadtrue multitasking, virtual memory, shared libraries, demand loading …ing …
nn Runs on all Intel x86s, Alpha, PPC, Runs on all Intel x86s, Alpha, PPC, SparcSparc, Motorola 68k, MIPS, ARM, HP, Motorola 68k, MIPS, ARM, HP--PA PA RISC, and moreRISC, and more
nn It is UNIXIt is UNIX––like, but not UNIX. Rewrite based on published POSIX standardslike, but not UNIX. Rewrite based on published POSIX standardsnn Most common distribution for BeowulfMost common distribution for Beowulf: : RedhatRedhat ((http://www.redhat.comhttp://www.redhat.com))
–– Includes RMP (Red Hat Program Manager)Includes RMP (Red Hat Program Manager)–– Others: Others: DebianDebian, , SlackwareSlackware,,
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
6969
Design: Layered Model of a Design: Layered Model of a Beowulf Computer ClusterBeowulf Computer Cluster
1. CPU Architectures and Hardware Platforms
2. Network Hardware and Communication Protocols
3. Operating System (Linux)
4. Middleware (Single System Image)
5. High Level Programming Languages
6. Parallel Applications and Utilities
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7070
Middleware
nn Is a software layer that is added on top of the Is a software layer that is added on top of the operating system to provide what is known as a Single operating system to provide what is known as a Single System Image (SSI).System Image (SSI).
nn Provides a layer of software that enables uniform Provides a layer of software that enables uniform access to different nodes on a cluster regardless of access to different nodes on a cluster regardless of the operating system running on a particular node.the operating system running on a particular node.
nn Is responsible for providing high availability, by means Is responsible for providing high availability, by means of load balancing and responding to failures in of load balancing and responding to failures in individual components.individual components.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7171
Middleware: Desirable Objectives for Cluster Services and Functions1.1. Single Entry PointSingle Entry Point –– User logs onto cluster rather than on to individual computer.User logs onto cluster rather than on to individual computer.2.2. Single File HierarchySingle File Hierarchy –– User sees single file directory hierarchy under same root dir.User sees single file directory hierarchy under same root dir.3.3. Single Control Point Single Control Point –– There is a default workstation used for cluster management There is a default workstation used for cluster management
and control. This is usually known as the server.and control. This is usually known as the server.4.4. Single Virtual Networking Single Virtual Networking –– Any node can access any other point in the cluster, even Any node can access any other point in the cluster, even
though the actual cluster configuration may consist of multiple though the actual cluster configuration may consist of multiple interconnected networks. interconnected networks. 5.5. Single Memory Space Single Memory Space –– Distributed Shared Memory enables programs to share vars.Distributed Shared Memory enables programs to share vars.6.6. Single jobSingle job--management system management system –– Under a cluster job scheduler, a user can submit a Under a cluster job scheduler, a user can submit a
job without specifying the host computer to execute the job.job without specifying the host computer to execute the job.7.7. Single User Interface Single User Interface –– A common graphic interface supports all users, regardless of A common graphic interface supports all users, regardless of
the workstation from which they enter the cluster.the workstation from which they enter the cluster.8.8. Single I/O Space Single I/O Space –– Any node can remotely access any I/O peripheral or disk device Any node can remotely access any I/O peripheral or disk device
without knowledge of its physical location.without knowledge of its physical location.9.9. Single Process Space Single Process Space –– A uniform processA uniform process--identification scheme is used. A process identification scheme is used. A process
on any node can create or communicate with any other process on on any node can create or communicate with any other process on a remote node.a remote node.10.10. CheckpointingCheckpointing –– This function periodically saves the process state and intermedThis function periodically saves the process state and intermediate iate
computing results, to allow rollback recovery after a failure.computing results, to allow rollback recovery after a failure.11.11. Process Migration Process Migration –– This function enables load balancing.This function enables load balancing.
Source: Richard Morrison, Cluster Computing
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7272
Middleware: Requirements for System Middleware: Requirements for System Software and Tools for HPCSoftware and Tools for HPC
üü Robust node O/S (easy configuration, installation, boot)Robust node O/S (easy configuration, installation, boot)nn Parallel Programming APIParallel Programming APInn Application Package DevelopmentApplication Package Developmentnn Parallel file systemsParallel file systemsnn System administration and managementSystem administration and managementnn Job and process schedulingJob and process schedulingnn Parallel debug and performance monitoringParallel debug and performance monitoringnn Checkpoint and restartCheckpoint and restart
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7373
Middleware …Middleware …
nn Parallel Programming APIParallel Programming API– Beowulf are independent computers connected via a
communication net, need to pass messages– Need to ensure degree of portability across different nodes
(E.g. 32 bit and 64 bit machines)– PVM (ORNL) and MPI (a standard). – MPI provides more functionality (controlled by MPI Forum)– PVM contains fault tolerant features– OpenMP (compiler)
n Supports multi-platform shared-memory programming. OpenMP is a portable, scalable model that gives shared-memory programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7474
MPI: Message Passing InterfaceMPI: Message Passing InterfaceWhy MPI over PVM?Why MPI over PVM?
1.1. MPI has several freely available, quality implementationsMPI has several freely available, quality implementations2.2. MPI defines a 3rd party profiling mechanism MPI defines a 3rd party profiling mechanism 3.3. MPI has full asynchronous communicationMPI has full asynchronous communication4.4. MPI groups are solid, efficient, and deterministicMPI groups are solid, efficient, and deterministic5.5. MPI efficiently manages message buffersMPI efficiently manages message buffers6.6. MPI synchronization protects user from 3rd party softwareMPI synchronization protects user from 3rd party software7.7. MPI can efficiently program MPP and clustersMPI can efficiently program MPP and clusters8.8. MPI is highly portable, highly portableMPI is highly portable, highly portable9.9. MPI is formally specifiedMPI is formally specified10.10. MPI is a standardMPI is a standard
MPI2MPI2• Parallel I/O• Remote Memory Operations (put, get)• Dynamic Process Management• Supports the use of threads (POSIX)
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7575
Middleware …Middleware …nn Application Development PackagesApplication Development Packages
– Large base of serial code– Abstraction vs. Efficiency problem– Parallel Programs must last longer than parallel machines– Designing and Building Parallel Programs, Concepts and Tools for Parallel Software
Engineering, Ian T. Foster http://wwwhttp://www--unix.mcs.anl.gov/dbpp/unix.mcs.anl.gov/dbpp/nn Parallel file systems (Parallel Virtual File System, PVFS, and PParallel file systems (Parallel Virtual File System, PVFS, and PVFSVFS22))
– Clemson University - 1993– Objective: high throughput file system– Strategy:
n Exploit parallelism of bandwidth n Provide user interface so that applications can make powerful requests such as large
collection of non-contiguous data with single request for multidimensional data sets, allow application direct access to server without going through kernel.
– Characteristics:n N-clients and N-serversn Single file spread across multiple disks and nodes, accessed by multiple tasks.n Actual distribution of a file is configurable on a file by file basis.
httphttp://://www.parl.clemson.eduwww.parl.clemson.edu/pvfs2//pvfs2/
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7676
Middleware …Middleware …
nn System administration and managementSystem administration and management– Aspen Beowulf Cluster Management Software (ABC)
http://www.aspsys.com/software/abc/– Scyld http://www.scyld.com/products.html– Ganglia (Distributed monitoring and execution system ) http://ganglia.sourceforge.net/
nn Job and process schedulingJob and process scheduling– Condor– Maui (SMP enabled) http://supercluster.org/maui/ –– PBS (PBS (OpenPBSOpenPBS).). Workload management system http://www.openpbs.org/
n Coordinates resource utilization policy and user job requirementsn Multi users, Multi jobs, Multi nodesn Functionality
– Manages parallel job execution (MPI, MPL, PVM, HPF)– Interactive and batch cross system scheduling– Security and access control lists– Dynamic distribution and automatic load-leveling of workload– Job and user accounting
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7777
Middleware …Middleware …nn Cluster Software Cluster Software –– Solvers (Libraries)Solvers (Libraries)
– BLAS - Basic Linear Algebra Subprogramsn The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing
basic vector and matrix operations. http://www.netlib.org/blas/ – FFTW
n FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of both real and complex data, and of arbitrary input size. http://www.fftw.org/
– LMPIn LMPI is a library for post-mortem analysis of the communication behavior of parallel MPI programs.
http://www.lrz-muenchen.de/services/software/parallel/lmpi/ – METIS & ParMETIS
n METIS is a family of programs for partitioning unstructured graphs and hypergraphs and computing fill-reducing orderings of sparse matrices. http://www-users.cs.umn.edu/~karypis/metis/
– MpCCIn Mesh-based parallel Code Coupling Interface (MpCCI) is a code coupling interface for multidisciplinary
applications. http://www.mpcci.org/ – NetCDF
n NetCDF (network Common Data Form) is an interface for array-oriented data access and a library that provides an implementation of the interface. http://www.unidata.ucar.edu/packages/netcdf/
– Numerical Pythonn Numerical Python adds a fast, compact, multidimensional array language facility to Python.
http://www.pfdubois.com/numpy/
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7878
Middleware …Middleware …nn Cluster Software Cluster Software –– Solvers (Libraries) … continuedSolvers (Libraries) … continued
– PARASOLn ParaSol is a parallel discrete event simulation system that supports optimistic and adaptive
synchronization methods. http://www.cs.purdue.edu/research/PaCS/parasol.html– PETSc
n PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. Employs MPI http://www-fp.mcs.anl.gov/petsc/
– PLAPACK - Parallel Linear Algebra Packagen Coding parallel algorithms is generally regarded as a formidable task. Parallel Linear Algebra Package
(PLAPACK), is an infrastructure for coding such algorithms at a high level of abstraction. http://www.cs.utexas.edu/users/plapack/
– PSPASESn PSPASES (Parallel SPArse Symmetric dirEctSolver) is a high performance, scalable, parallel, MPI-
based library, intended for solving linear systems of equations involving sparse symmetric positive definite matrices. http://www-users.cs.umn.edu/~mjoshi/pspases/
– ScaLAPACKn The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for
distributed memory MIMD parallel computers. It is currently written in a SIPD style using message passing for interprocessor communication. http://www.netlib.org/scalapack/scalapack_home.html
– VTK - Visualization ToolKitn The Visualization ToolKit (VTK) is an open source, freely available software system for 3D computer
graphics, image processing, and visualization used by thousands of researchers and developers around the world. http://public.kitware.com/VTK/
Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.
Andrés JaramilloAndrés Jaramillo--BoteroBotero
7979
Beowulf Toolkits (free)Beowulf Toolkits (free)
nn OSCAR (Open Source Cluster Application Resource)OSCAR (Open Source Cluster Application Resource)–– www.openclustergroup.orgwww.openclustergroup.org
nn ROCKSROCKS–– http:// http:// www.rocksclusters.orgwww.rocksclusters.org–– http://rocks.npaci.eduhttp://rocks.npaci.edu