building a beowulf class supercomputer - wag.caltech.eduwag.caltech.edu/pasi/lectures/puj-building a...

Computational Nanotechnology and Molecular Engineering Computational Nanotechnology and Molecular Engineering Workshop, Caltech, Pasadena, January 5Workshop, Caltech, Pasadena, January 5--17, 2004. 17, 2004.

Andrés JaramilloAndrés Jaramillo--BoteroBotero

11

Building a Beowulf Class Building a Beowulf Class “Supercomputer”“Supercomputer”

Andrés JaramilloAndrés Jaramillo--BoteroBoteroPontificia Universidad Pontificia Universidad JaverianaJaverianaCali, ColombiaCali, Colombia



22

OutlineOutline

nn Working DefinitionsWorking Definitionsnn Why Parallel Computing?Why Parallel Computing?nn Parallel Computing: Interconnect Topologies, Programming Parallel Computing: Interconnect Topologies, Programming

Models, and MetricsModels, and Metricsnn Beowulf Class Computer: Motivation and OverviewBeowulf Class Computer: Motivation and Overviewnn Linux based BeowulfLinux based Beowulf

–– Hardware Inventory for a homeHardware Inventory for a home--grown Beowulf Parallel Computer Cluster grown Beowulf Parallel Computer Cluster –– Communication Software and Interconnect Operations Communication Software and Interconnect Operations –– Practical Aspects: Software, Benchmarks, Compilers Practical Aspects: Software, Benchmarks, Compilers

ASCI White 512 / 8192, LLNL



33

Working Definitions: ArchitecturesWorking Definitions: Architectures

nn General Computing ClusterGeneral Computing Cluster–– An ensemble of interconnected computing systems, each capable ofAn ensemble of interconnected computing systems, each capable of

standalone operationstandalone operation

nn PC ClusterPC Cluster–– Set of independent computers Set of independent computers

nn COTSCOTSnn Capable of full independent operationCapable of full independent operationnn Employed individually for standalone mainstream workloads / applEmployed individually for standalone mainstream workloads / appl icationsicationsnn UniprocessorUniprocessor or SMP nodesor SMP nodes

–– Supervised within a single administrative domain as a single sysSupervised within a single administrative domain as a single systemtem–– An interconnection network An interconnection network

nn COTS COTS nn LAN or SAN or multiple separate network structures.LAN or SAN or multiple separate network structures.nn Dedicated to cluster nodes and separate from the external enviroDedicated to cluster nodes and separate from the external environment. nment.

Source: Thomas Sterling



44

Working Definitions: Architectures Working Definitions: Architectures nn NOWsNOWs

–– Network of Workstations (UCBNetwork of Workstations (UCB’’95)95)

nn ConstellationsConstellations–– A Cluster of ClustersA Cluster of Clusters

nn An ensemble of An ensemble of NN nodes each comprising nodes each comprising pp computing elementscomputing elementsnn The The pp elements are tightly bound shared memory (e.g., elements are tightly bound shared memory (e.g., smpsmp, , dsmdsm))nn TheThe NN nodes are loosely coupled, i.e., distributed memorynodes are loosely coupled, i.e., distributed memorynn pp is greater than is greater than NNnn Distinction is which layer gives us the most power through paralDistinction is which layer gives us the most power through parallelismlelism. .

nn MPPsMPPs (Massively Parallel Processor)(Massively Parallel Processor)–– Built with specialized (costly) networks by vendors, with the inBuilt with specialized (costly) networks by vendors, with the intent of tent of

being used as a parallel computer.being used as a parallel computer.

ASCI Blue Mountain (LLNL)• SGI Origin 2000 48 SMP Computers• 128 processors each @ 250MHz• 6144 processors total• 4 x 32 4-way SMP nodes• 1.608/3.072 TF peak performance • 1.5 TB memory• 75 TB disk

10,000 square feet of floor space, 1.6 MWatts of power, 530 tons of cooling capability, 384 cabinets to house 6144 CPUs, 48 cabinets for the meta routers, 96 cabinets for the disks, 8 cabinets for the 36 HiPPI, switches, and ~476 miles of fiber cable.

http://www.lanl.gov/projects /asci/bluemtn/ASCI_fly.pdf

NOW 1997• 100+ Ultra Sparc• 128 MB, 2 2GB disks,

ethernet, myrinet• Largest Myrinet in the world• First cluster on TOP 500



55

Working Definitions: ApplicationsWorking Definitions: Applications

nn Embarrassingly Embarrassingly Parallel (rare):Parallel (rare):nn Little or no dependence between individual calculationsLittle or no dependence between individual calculationsnn Extremely parallelExtremely parallelnn Good scalability (low network dependence)Good scalability (low network dependence)nn EgEg. Monte Carlo Simulations, Particle Physics, and Cryptography. Monte Carlo Simulations, Particle Physics, and Cryptography

nn Block Level Parallel:Block Level Parallel:nn Computational domain can be partitioned across several nodesComputational domain can be partitioned across several nodesnn Each node solves its own computational domain and shares resultsEach node solves its own computational domain and shares results

for the edges of its segments with neighboring nodesfor the edges of its segments with neighboring nodesnn Common type of parallel application (uses message passing)Common type of parallel application (uses message passing)nn Scalability tied to performance of network infrastructureScalability tied to performance of network infrastructurenn EgEg. . ScaLapackScaLapack, Molecular Dynamics, Molecular Dynamics

Source: Doug Johnson, OSC



66

Working Definitions: ApplicationsWorking Definitions: Applicationsnn LoopLoop--Level Parallelism:Level Parallelism:

nn Where inner of intermediate loops may be run in parallel (threadWhere inner of intermediate loops may be run in parallel (threads)s)nn Amenable to parallelism using compiler directives such as Amenable to parallelism using compiler directives such as OpenMPOpenMP

(appropriate for vector computers)(appropriate for vector computers)nn Shared memory required, hence run better on Shared memory required, hence run better on SMPsSMPsnn Not scalable to a large number of processorsNot scalable to a large number of processorsnn EgEg. POSIX threads. POSIX threads

nn MultiMulti--Level Parallel:Level Parallel:nn Hybrid blockHybrid block--level and looplevel and loop--levellevelnn Mostly independent blocks which can be calculated using message Mostly independent blocks which can be calculated using message

passing, and each can be further parallelized at loop level usinpassing, and each can be further parallelized at loop level using SMPg SMPnn Limited scalability (proportional to the number of blocks)Limited scalability (proportional to the number of blocks)nn EgEg. Multi. Multi--grid grid NavierNavier--Stokes solverStokes solver

nn Serially Parallel: NO parallelism exploitedSerially Parallel: NO parallelism exploited

Source: Doug Johnson, OSC



77

Working Definitions:Working Definitions:Types of ParallelismTypes of Parallelismnn Data ParallelismData Parallelism

– The same task executing simultaneously over multiple sub regions of the same data.

– Characteristicsn Takes advantage of large accumulated memory capacity in a parallel

machine making it easier to code,n Different processors execute the same function over different sub

regions of the same data space,n Less flexibility – problem must be naturally parallel (embarrassingly)n Requires domain decomposition of problem.



88

Working Definitions:Working Definitions:Types of ParallelismTypes of Parallelismn Functional Parallelism

– Different tasks executed simultaneously over different data spaces.

– Characteristicsn Natural programming scheme for programmers with modular

programming skills,n Definition of scalable functions grows harder as the number of

processors increase,n Load balancing and synchronization become important issues,n Different processors executing different tasks,n A lot of applications implement some form of each, in spite of their

apparent complementary nature.



99

Working Working Definitions:TaxonomyDefinitions:Taxonomy

– MIMD parallel computers:n MPPs: Massively Parallel Processors (MP)n NOWs: Network of Workstations (MP – Clusters).n SMPs: Symmetric Multiprocessing (SM). n DSMs: Distributed Shared Memory (Hybrid MP and SM).

Special cases: NUMA and cc-NUMA architectures

Flyn

n-Jo

hnso

nTa

xono

my



1010

Working Definitions: Working Definitions: Microprocessor ArchitecturesMicroprocessor Architecturesn CISC

–– Complex Instruction Set ComputerComplex Instruction Set Computer

nn RISCRISC–– Reduced Set Instruction Set ComputerReduced Set Instruction Set Computer

nn VLIWVLIW–– Very Long Instruction Width ComputerVery Long Instruction Width Computer



1111

Why Parallel Computing?Why Parallel Computing?nn Solve Solve bigger bigger problems problems fasterfasternn Opportunities in Molecular SimulationsOpportunities in Molecular Simulations

–– Increased Fidelity of Molecular ModelsIncreased Fidelity of Molecular Modelsn Bond energies critical for describing many

chemical phenomenan Accuracy of calculated bond energies

increased significantly since 1997

–– Converged Molecular CalculationsConverged Molecular Calculations–– Increased Molecular SizeIncreased Molecular Size

n Large scale systems (atomistic)n Long term simulations for phenomena to occurn Need potentials to model nanoscale processesn Little data from experiment, need accurate

calculations

l

l

l

l1

10

100

1970 1980 1990 2000

Error(kcal/mol)

Thom H. Dunning, Jr.Joint Institute for Computational SciencesOak Ridge National LaboratoryOak Ridge, Tennessee



1212

Why Parallel Computing?Why Parallel Computing?1. Computer simulations are far

cheaper and faster than physical experiments.

2. Computers can solve a much wider range of problems than specific laboratory equipment can.

3. Computational approaches are only limited by computer speed and memory capacity, while physical experiments have many practical constraints.

Good Parallel Candidate Applications:• Predictive Modeling and Simulations• Engineering Design and Automation• Energy Resources Exploration• Medical, Military and Basic Research• Visualization



1313

Why Parallel Programs don’t Perform as Why Parallel Programs don’t Perform as they are intended?they are intended?

1.1. Type of Problem (50%)Type of Problem (50%)– Degree of Parallelism, mapping onto multiprocessor system

2.2. Algorithm Construction (40%)Algorithm Construction (40%)– Inefficient algorithms that do not exploit the natural concurrency of a problem.–– Bad Load Balancing (Bad Load Balancing (cpucpu utilization)utilization)–– Unnecessary Communication (overhead)Unnecessary Communication (overhead)–– Sequential Bottleneck (algorithm reorganization)Sequential Bottleneck (algorithm reorganization)–– Bad Scheduling (time reorganization, synchronization)Bad Scheduling (time reorganization, synchronization)

3.3. Middleware, Language and Compiler choiceMiddleware, Language and Compiler choice– Inefficient distribution and coordination of tasks, – high inter-processor communications latency due to inefficient middleware.

4.4. Operating System Operating System – Inefficient internal scheduler, file systems and memory allocation/de-allocation.

5.5. HardwareHardware– Idle processors due to conflicts with shared resources.– Slow interconnects



1414

Parallel Computing: Parallel Computing: Interconnect TopologiesInterconnect Topologiesn Bus: A single shared data path

n Pros: Simplicityn Cons: Does not scale well, resource

contention, fixed bandwidth

nn Mesh: Mesh: n 2-D array of processors

– Each processor has a connection to 4 neighbors

n Torus/Wraparound Mesh– Processors on edge of mesh are

connected



1515

Parallel Computing: Parallel Computing: Interconnect TopologiesInterconnect Topologiesn Hypercube:

n A d-dimensional hypercube has p=2d processors.

n Each processor directly connected to d other processors

n Shortest path between a pair of processors is at most d

n Switch Based:n mxn switchesn Eg. Star Topology, Shufflen Omega network

– Log P stages– Connection any

processor to any memory

n Others: Ring

grey codinggrey coding

001000

100

010 011

111

101

110

3D

4D

MS

SP



1616

Parallel Computing: Important Parallel Computing: Important Interconnect ParametersInterconnect Parameters

1. Network DiameterThe longest of the shortest paths between various pairs of nodes (small to minimize latency). More important with store-and-forward routing than with wormhole routing (node relaying).

2. Bisection Bandwidth (BB)The smallest number (total capacity) of links that need to be cut in order to divide the network into two sub networks of half the size. (Need Full Bisection Bandwidth ) A small BB limits the rate of data transfer between the two halves of the network, thus affecting the performance of communications intensive algorithms.

3. Vertex or node DegreeThe number of communications ports required of each node, which should be a constant, independent of network size if the architecture is to be readily scalable to larger sizes. Impact on $$

Source: Richard Morrison, Cluster Computing

Cray T3ECray T3E

Full BB: Nodes of any two halves can communicate at full speed with each other.

BB=3

min /21

1 2N

ii

NB B=

=∑



1717

Parallel Computing: Parallel Computing: Programming ParadigmsProgramming Paradigms

n Message Passingn MPI (Message Passing

Interface)n PVM (Parallel Virtual

Machine)– Characteristics:

n Hard to programn More control on work sharing

and data distributionn Code Portabilityn Optimized libraries available

n Shared Memory:– Alternatives for Programming:

n Threads (light processes), n Parallel programming languages (HPF), n Preprocessor compiler directives to

declare and manipulate shared data (OpenMP),

n Using library routines with an existing sequential language,

n Using a sequential programming language and a parallelizing compiler.

– Characteristicsn Simple coden Synchronization requiredn Less data locality explorationn Poor portability

Beowulfs!



1818

Parallel Performance Metrics: Parallel Performance Metrics: Speedup and EfficiencySpeedup and Efficiency

n Assumingn Ts = best sequential algorithm run time, i.e. time to calculate all operations on a

single processor.n Tp = parallel algorithm run time, i.e. time to calculate the same quantity of work

distributed among p processors of the same type.

n Speedup: ratio of sequential processing time to parallel processing time.

n Efficiency (η): ratio between the speedup factor and the number of processors:

SP

TSpeedup T=

1sP

Speedup Tp pTh = = ≤ 1s

pTT P h= ∴ =Ideal loading



1919

Parallel Performance Metrics: Parallel Performance Metrics: Redundancy, Utilization and QualityRedundancy, Utilization and Quality

n Redundancy (Rp) – Is the ratio between the total number of operations Op executed in performing some computation with p processors and the number of operations Osrequired to execute the same computation with a single processor:

Rp is related to the time lost because of overhead, and is always larger than 1.n Utilization (Up) – Is the ratio between the actual number of operations Op and the number

of operations that could be performed with p processors in Tp time units:

n Quality (Qp) – The Quality factor of a parallel computation using p processors is:

pp

P

OU pT=

32s

pp p

TQ pT O=

pp

s

OR O=



2020

nn EfficiencyEfficiency (in terms of (in terms of pp))n s: problem size, p: processorsn w(s): workload, n h(s,p): communication overhead.

– As p grows, communication overhead h(s,p) increases and efficiency ηp decreases. – For growing s, w(s) usually increases much faster than h(s,n). – An increase of w(s) may outweigh increase h(s,p) for growing processor number p.

nn QuestionQuestion: For growing p, how fast must s grow for efficiency to remain constant?

w(s,p) should grow in proportion to h(s,p).

n Workload:

n Isoefficiency function: n If workload w(s) grows as fast as f η(p), constant efficiency can be maintained.

( ) ( ),f p Ch s ph =

( ) ( )[ ] ( ) ( )/ 1 , ,p pw s h s p Ch s ph h= − =

Parallel Performance Metrics: Parallel Performance Metrics: IssoefficiencyIssoefficiency

( ) ( )1

1 , / ,p h s p w s ph = +

( )

( ) ( ), ,pw s

w s p h s ph = +



2121

1PS

P

Speedupp

pSp

= −= −

Parallel Performance Metrics: Parallel Performance Metrics: Amdahl’s Law (1967)Amdahl’s Law (1967)

n For a certain problem size, the speedup curve tends to flatten with an increasing number of processors. Because communication-synchronization costs tend to increase.

P: Parallel Fraction

S: Serial Fraction

p: Number of Processors

11 PSpeedup = −



2222

Parallel Performance Metrics: Parallel Performance Metrics: ScalabilityScalabilitynn ScalabilityScalability: : A parallel system's capacity to increase/maintain speedup in to A parallel system's capacity to increase/maintain speedup in to

the number of processes. the number of processes. –– Typically when Typically when pp is increased, is increased, ηη will normally decrease (due to increase in parallelization will normally decrease (due to increase in parallelization

overheads). But often when we increase overheads). But often when we increase N (problem size)N (problem size), the ratio of communication (, the ratio of communication (CC) to ) to computation (computation (RR) decreases, leading to improved ) decreases, leading to improved ηη. .

–– For an algorithm to be scalable For an algorithm to be scalable ηη11≤η≤η22 must hold true for any two different problem sizes, must hold true for any two different problem sizes, NN11 < N< N2 2 , running on p, running on p11 < p< p2 2 , ,

–– WhereWherenn γγ: : communication to computation ratiocommunication to computation ratio

nn Thus, for Thus, for ηη to remain the same while increasing to remain the same while increasing pp, we must be able to find a larger , we must be able to find a larger NN for which for which γγ remains the same remains the same

( )1s sp overhead

t tt tp p g= + = +

( ) ( )1 2 1 2

1 21 21 1 2 2 1 1 2 21

1 12

s s s ss sp p

t t t tt tt p t p p pp p

g gg g

= → = → =+ + + × + ×



2323

Beowulf Class Computer: Beowulf Class Computer: Brief historyBrief historynn Thomas Sterling and Donald Becker CESDIS (Center of Thomas Sterling and Donald Becker CESDIS (Center of

Excellence in Space Data and Information Sciences), Goddard Excellence in Space Data and Information Sciences), Goddard Space Flight Center, Greenbelt, MDSpace Flight Center, Greenbelt, MD

nn Summer 1994: built an experimental clusterSummer 1994: built an experimental clusternn Called their cluster Called their cluster BeowulfBeowulf (pre X century poem in Old (pre X century poem in Old

English about a Scandinavian warrior from the XI century)English about a Scandinavian warrior from the XI century)nn Genesis: combines low cost x86, Ethernet, Linux, Clustered Genesis: combines low cost x86, Ethernet, Linux, Clustered

architecture and MPI programmingarchitecture and MPI programmingnn Beowulf adds Beowulf adds ethernetethernet drivers, channel bonding, advanced drivers, channel bonding, advanced

topologies, applications, and administration toolstopologies, applications, and administration toolsnn Rapidly sustained GFLOPS performance / low cost.Rapidly sustained GFLOPS performance / low cost.



2424

NASA Beowulf ProjectNASA Beowulf Project1st BEOWULF: 1st BEOWULF: WiglafWiglaf (1994)(1994)

• 16 Intel 80486 DX4 100 MHz

• VESA Local bus• 256 Mbytes memory• 6.4 GBs of disk• Dual 10 base-T Ethernet• 72 MFlops sustained

• 16 Intel PPro 200 MHz• PCI Motherboard• 2 GBs memory• 49.6 GBs of disk• 100 base-T switch• 1.25 Gflops sustained

HyglacHyglac--1996 (Caltech)1996 (Caltech)

• 16 Intel Pentium 100 MHz• PCI motherboard (Triton)• 256 synchronous cache• 32M - 1 GBs memory• 6.4 GBs of disk• 100 base-T Ethernet (hub)• 240 Mflops sustained

$46K $50KUS$40K

HrothgarHrothgar -- 19951995



2525

History Highlights of ClustersHistory Highlights of Clustersnn 1957 1957 –– SAGE by IBM & MITSAGE by IBM & MIT--LL for LL for AirforceAirforce NORADNORADnn 1976 1976 –– EthernetEthernetnn 1984 1984 –– Cluster of 160 Apollo workstations by NSACluster of 160 Apollo workstations by NSAnn 1985 1985 –– M31 Andromeda by DEC, 32 VAX 11/750M31 Andromeda by DEC, 32 VAX 11/750nn 1986 1986 –– Production Condor cluster operationalProduction Condor cluster operationalnn 1990 1990 –– PVM releasedPVM releasednn 1993 1993 –– First NOW workstation cluster at UC BerkeleyFirst NOW workstation cluster at UC Berkeleynn 1993 1993 –– MyrinetMyrinet introducedintroducednn 1994 1994 –– First Beowulf PC cluster at NASA GoddardFirst Beowulf PC cluster at NASA Goddardnn 1994 1994 –– MPI standardMPI standardnn 1996 1996 –– >1Gflops>1Gflopsnn 1997 1997 –– Gordon Bell Prize for PriceGordon Bell Prize for Price--PerformancePerformancenn 1997 1997 –– >10 >10 GflopsGflopsnn 1998 1998 –– Avalon by LANL on Top500 listAvalon by LANL on Top500 listnn 1999 1999 –– >100 >100 GflopsGflopsnn 2000 2000 –– Compaq and PSC awarded 5 Compaq and PSC awarded 5 TflopsTflops by NSFby NSF



2626

Beowulf Class Computer: Beowulf Class Computer: AccomplishmentsAccomplishments

nn An experimental platform for parallel computingAn experimental platform for parallel computingnn Established Established visionvision of lowof low--cost HPCcost HPCnn Demonstrated effectiveness of PC clusters Demonstrated effectiveness of PC clusters nn Provided networking/managing software in LinuxProvided networking/managing software in Linuxnn Introduced Mass Storage with PVFSIntroduced Mass Storage with PVFSnn Achieved >10 Achieved >10 GflopsGflops performance (3rd gen.)performance (3rd gen.)nn Gordon Bell Prize for PriceGordon Bell Prize for Price--PerformancePerformancenn Conveyed findings to broad communityConveyed findings to broad communitynn Tutorials and booksTutorials and booksnn Provided design standard to rally communityProvided design standard to rally communitynn SpinSpin--off of off of ScyldScyld Computing Corp.Computing Corp.

Naegling at Caltech CACR

Beowulf Cluster Computing with Linux, Second Editionby Thomas Sterling (Editor), et al



2727

Why Beowulf? as if we needed reasons …Why Beowulf? as if we needed reasons …n Brings high end computing to broad ranged problems

– Scalable performance to high end

n Order of magnitude Price-Performance advantage – Commodity enabled– No long development lead times (it’s here!)

n Low vulnerability to vendor-specific decisions – Standardized hardware and logical interfaces, – Multiple vendor offerings within class, – Replaceable subsystem components from different vendors, – Companies are ephemeral,– Beowulfs are forever

n Long-term Multi-generation System Architecture Classn Industry-wide, non-proprietary software environment

– Full function system software common across class, thank Linux.

n Rapid response technology tracking (Requirement responsive)n Just-in-Place, Just-in-Time user-driven configuration (meets low

expectations of MPPs)n Provides extreme flexibility of configuration and operationn Greatly increases accessibility of parallel computing



2828

Issues you must considerIssues you must consider

n Significant node-to-node latency (limited by cost and software)n Bandwidth limitations (getting better fast, still limited by cost)n Finite wait times in receiving processorsn Synchronization takes time (data coherence)n Load unbalancen Complexity and costs of planning and administering tasks,n Finite communication times (not overlapped with run-time)n Heterogeneous scalability



2929

Why Beowulf? Why Beowulf? StatsStats fromfrom TopTop 500 500 ListList

The 22nd TOP500 List was introduced during the Supercomputer Conference (SC2003) in Phoenix, AZ.



3030

Why Beowulf? Why Beowulf? StatsStats fromfrom TopTop 500 500 ListList

The 22nd TOP500 List was introduced during the Supercomputer Conference (SC2003) in Phoenix, AZ.



3131

Why Beowulf? Why Beowulf? HighlightsHighlights fromfrom TopTop 500 500 ListList

nn 208 systems are now labeled as 208 systems are now labeled as clustersclusters, up from 149. This makes clustered , up from 149. This makes clustered systems the most common architecture in the TOP500. systems the most common architecture in the TOP500.

nn The number of systems exceeding the 1 The number of systems exceeding the 1 TFflop/sTFflop/s mark running the mark running the LinpackLinpackbenchmark jumped from 59 to 131. benchmark jumped from 59 to 131.

nn Entry level is now 403.4 Entry level is now 403.4 GFlop/sGFlop/s, compared to 245.1 , compared to 245.1 GFlop/sGFlop/s six month ago. six month ago. nn The entry point for the top 100 moved from 708 The entry point for the top 100 moved from 708 GFlop/sGFlop/s to 1.142 to 1.142 TFlop/sTFlop/s. . nn A total of 189 systems are now using Intel processors. Six monthA total of 189 systems are now using Intel processors. Six months ago there s ago there

were 119 Intelwere 119 Intel--based systems on the list and one year ago only 56. based systems on the list and one year ago only 56. nn With respect to the number of systems, HewlettWith respect to the number of systems, Hewlett--Packard topped IBM again by a Packard topped IBM again by a

small margin. HP is at 165 systems (up from 159) and IBM is at 1small margin. HP is at 165 systems (up from 159) and IBM is at 159 systems 59 systems (up one system) installed. SGI is again third with 41 systems, d(up one system) installed. SGI is again third with 41 systems, down from 54. own from 54.

nn The Cray X1 system appears on the list with 10 installations, wiThe Cray X1 system appears on the list with 10 installations, with the largest X1 th the largest X1 listed at #19. listed at #19.

Source: http://www.top500.org/lists/2003/11/trends.php



3232

HPHP

1400001400001161611616Integrity rx2600 Itanium2 Integrity rx2600 Itanium2 ClusterCluster

Integrity rx2600 Itanium2 1.5 Integrity rx2600 Itanium2 1.5 GHz, Quadrics / 1936GHz, Quadrics / 1936

United States/2003United States/2003

83500083500086338633ResearchResearchHP ClusterHP ClusterMpp2Mpp2Pacific Northwest Pacific Northwest National LaboratoryNational Laboratory

55

DellDell

1530015300PowerEdgePowerEdge 1750, 1750, MyrinetMyrinetPowerEdgePowerEdge 1750, P4 Xeon 1750, P4 Xeon 3.06 GHz, 3.06 GHz, MyrinetMyrinet / 2500/ 2500


63000063000098199819AcademicAcademicDell Dell ClusterClusterTungstenTungstenNCSANCSA44

SelfSelf--mademade

1520001520001760017600G5 G5 ClusterCluster1100 Dual 2.0 GHz Apple 1100 Dual 2.0 GHz Apple G5/Mellanox G5/Mellanox InfinibandInfiniband4X/Cisco 4X/Cisco GigEGigE / 2200/ 2200


5200005200001028010280AcademicAcademicNOW NOW -- PowerPCPowerPCXXVirginia TechVirginia Tech33

2250002250002048020480AlphaAlpha--ServerServer--ClusterClusterHPHPUnited States/2002United States/2002

6330006330001388013880ResearchResearchHP HP AlphaServerAlphaServerASCI Q ASCI Q -- AlphaServerAlphaServerSC45, 1.25 GHz / 8192SC45, 1.25 GHz / 8192

Los Alamos National Los Alamos National LaboratoryLaboratory

22

26624040960SX6NECJapan/2002

1,1E+1035860ResearchNEC VectorEarth-Simulator / 5120Earth Simulator Center

1

nhalfRpeakInstallation Area

ModelManufacturerCountry/Year

NmaxRmaxInst. typeComputer FamilyComputer / ProcessorsSiteRank

Why Beowulf? Why Beowulf?

Top 10Top 10



3333

900009000092169216xSeriesxSeries ClusterCluster Xeon Xeon --QuadricsQuadrics

IBM/QuadricsIBM/QuadricsUnited United States/2003States/2003

42500042500065866586ResearchResearchIBM ClusterIBM ClusterxSeriesxSeries Cluster Xeon 2.4 GHz Cluster Xeon 2.4 GHz --Quadrics / 1920Quadrics / 1920

Lawrence Lawrence Livermore Livermore National National LaboratoryLaboratory

1010

IBM

9984SP Power3 375 MHz high node

SP Power3 375 MHz 16 way / 6656United States/2002

6400007304ResearchIBM SPSeaborgNERSC/LBNL9

12288SP Power3 375 MHz high node

IBMUnited States/2000

6400007304ResearchIBM SPASCI White, SP Power3 375 MHz / 8192

Lawrence Livermore National Laboratory

8

75000750001106011060NOW NOW ClusterCluster -- Intel Intel Pentium Pentium -- QuadricsQuadrics

Linux Linux NetworxNetworx/Quadrics/QuadricsUnited United States/2002States/2002

35000035000076347634ResearchResearchNOW NOW -- Intel PentiumIntel PentiumMCR Linux Cluster Xeon 2.4 GHz MCR Linux Cluster Xeon 2.4 GHz --Quadrics / 2304Quadrics / 2304

Lawrence Lawrence Livermore National Livermore National LaboratoryLaboratory

77

Linux Linux NetworxNetworx

1092081092081126411264NOW NOW ClusterCluster -- AMD AMD --MyrinetMyrinet

OpteronOpteron 2 GHz, 2 GHz, MyrinetMyrinet / 2816/ 2816United United States/2003States/2003

76116076116080518051ResearchResearchNOW NOW -- AMDAMDLightningLightningLos Alamos Los Alamos National National LaboratoryLaboratory

66

Why Beowulf? Top 10 … continuedWhy Beowulf? Top 10 … continued



3434

Why Beowulf? Why Beowulf? Highlights from the Top 10:Highlights from the Top 10:nn The Earth Simulator, built by NEC, remains the unchallenged #1. The Earth Simulator, built by NEC, remains the unchallenged #1. nn ASCI Q at Los Alamos is still #2 at 13.88 ASCI Q at Los Alamos is still #2 at 13.88 TFlop/sTFlop/s. . nn The third system ever to exceed the 10 The third system ever to exceed the 10 TFflop/sTFflop/s mark is mark is VirginaVirgina Tech's XTech's X

measured at 10.28 measured at 10.28 TFlop/sTFlop/s. Based on Apple G5 as building blocks and often . Based on Apple G5 as building blocks and often referred to as the 'referred to as the 'SuperMacSuperMac'. Uses a '. Uses a MellanoxMellanox network. network.

nn The fourth system is also a cluster. The Tungsten cluster at NCSThe fourth system is also a cluster. The Tungsten cluster at NCSA is a Dell A is a Dell PowerEdgePowerEdge--based system using a based system using a MyrinetMyrinet interconnect. It just missed the 10 interconnect. It just missed the 10 TFlop/sTFlop/s mark with a measured 9.82 mark with a measured 9.82 TFlop/sTFlop/s. .

nn The list of clusters in the TOP10 continues with the upgraded ItThe list of clusters in the TOP10 continues with the upgraded Itanium2anium2--based based HewlettHewlett--Packard system, located at Packard system, located at DOE'sDOE's Pacific Northwest National Pacific Northwest National Laboratory, which uses a Quadrics interconnect. Laboratory, which uses a Quadrics interconnect.

nn #6 is the first system in the TOP500 based on AMD's #6 is the first system in the TOP500 based on AMD's OpteronOpteron chip. It was chip. It was installed by Linux installed by Linux NetworxNetworx at the Los Alamos National Laboratory and also uses at the Los Alamos National Laboratory and also uses a a MyrinetMyrinet interconnect. interconnect.

nn The list of cluster systems in the TOP10 has grown impressively The list of cluster systems in the TOP10 has grown impressively to seven to seven systems. The Earth Simulator and the two IBM SP systems at Lawresystems. The Earth Simulator and the two IBM SP systems at Lawrence nce Livermore and Lawrence Berkeley national labs are the other threLivermore and Lawrence Berkeley national labs are the other three systems. e systems.

Source: http://www.top500.org/lists/2003/11/trends.php



3535

$$nn World’s Most Powerful ComputerWorld’s Most Powerful Computer

–– 640 nodes x 8 vector processors per node 640 nodes x 8 vector processors per node = = 5,1205,120 processorsprocessors

–– 8 8 gflopsgflops peak per processor = peak per processor = 40.9640.96teraflops peakteraflops peak

–– 10.2410.24 terabytes of memoryterabytes of memory–– 640 x 640 single stage crossbar switch640 x 640 single stage crossbar switch

nn PerformancePerformance–– LinPackLinPack Benchmark: Benchmark: 35.8635.86 teraflopsteraflops–– Atmospheric Global Circulation Model: Atmospheric Global Circulation Model:

26.5826.58 teraflops (T1279L96)teraflops (T1279L96)–– Plasma Simulation Code (IMPACTPlasma Simulation Code (IMPACT-- 3D): 3D):

14.914.9 tflopstflops on 512 nodeson 512 nodes

Why Beowulf? Why Beowulf? Japanese Earth SimulatorJapanese Earth Simulator



3636

Why Beowulf? Top 10 Reasons why YOU Why Beowulf? Top 10 Reasons why YOU should own a Beowulfshould own a Beowulf

1. You are here2. You are in need for pain3. Or you don’t have enough4. You have spare $5. You need your “supercomputer” to grow with you $ (heterogeneous)6. You are a computer scientist with a strong background in Unix7. You have “embarrassingly parallel" applications 8. You have pre-existing parallel (message passing) applications9. You can formulate your application in a message passing model10. You are limited by existing performance (speed, memory, etc.)



3737

Objective: Objective: Design a Cost Effective Design a Cost Effective Supercomputer with COTSSupercomputer with COTS1.1. Performance RequirementsPerformance Requirements – What performance is desired?

What Performance is required? What application(s) will be running on the cluster and what is the intended purpose.

2.2. Hardware Hardware – What Hardware will be required or is available to build the Cluster Computer.

3.3. Operating SystemsOperating Systems – What operating systems are available for Cluster Computing and what are their benefits. Focus: Linux

4.4. MiddlewareMiddleware – Between the Parallel application and the operating system we need a way of load balancing, and managing processes, the system itself, users.

5.5. ApplicationsApplications – What Software is available for Cluster Computing and how can applications be developed?



3838

Performance FactorsPerformance Factors

n What performance do you need in your Beowulf cluster and how do you achieve it?– Suggestion: benchmark your application (or part of it)

n Several system factors affect performance:n Number of processorsn CPU speedn Memory size and access timesn Cache sizen Storage architecture and total storage capacity (access, transfer)n Interconnect (bandwidth, latency)



3939

Why Benchmarks ?

n They are instruments to quantify system performance n They provide a method for enabling comparative analysis.n In HPC, benchmarks are used not only to evaluate cluster

performance, but also to measure the scalability of the compute nodes for computation and/or communication-intensive applications.

n Benchmarks results can provide relevant insight for determining which systems will best suit the needs of particular applications.

The Linux Benchmarking Project: http://www.tux.org/bench/



4040

Application and Architecture Specific HPC Benchmarksnn Application specificApplication specific

– focus on the performance of different types of parallel applications – test the HPC system as a whole– provide users with multiple aspects of the system performance

nn Architecture specificArchitecture specific, specific to subsystems like:– memory subsystem,– the processor’s integer arithmetic and floating point units,– interconnect communication subsystem.

nn What to do …What to do …– Identify benchmarks that share similar characteristics with and are comparable

to the user application. – Although the end-application would be the ideal benchmark for that particular

system, several industry standard HPC benchmarks can also be used.



4141

Classes of Benchmark

nn Serial BenchmarksSerial Benchmarks– For accessing the performance of single processor machine through

measuring the performance rate (Mflops/s), memory bandwidth (MB/s), and the execution time of the application.

nn Parallel BenchmarksParallel Benchmarks– For measuring the performance of the inter-processor communications

through measuring the latency, the bandwidth of the parallel communications, and time of the application to execute.

n Serial and Parallel Benchmarks, subdivided into 2 categories:1. Kernels: To measure basic machine parameters2. Real Applications: To evaluate the performance of the machine as a

whole (for that particular application).



4242

Some Standard Benchmarks for HPCSome Standard Benchmarks for HPC

nn Kernels and Industry Standard BenchmarksKernels and Industry Standard Benchmarks–– Livermore Fortran KernelLivermore Fortran Kernel (LFK)

(http://www.llnl.gov/asci_benchmarks/asci/limited/lfk/asci_lfk.html).–– LinpackLinpack.. Solving dense matrix. (http://www.netlib.org/benchmark/hpl/)–– nbenchnbench (http://www.byte.com/bmark/faqbmark.htm)–– NPBNPB.. (NAS Parallel Benchmarks) renamed to PBN in 1999

http://www.nas.nasa.gov/Research/Tasks/pbn.html–– StreamStream (and Stream OpenMp). http://www.cs.virginia.edu/stream/

nn Communications Communications – Pallas MPI Benchmark (PMBPMB) (http://www.pallas.com/e/products/pmb/)–– EFF_BWEFF_BW Spin off from PMB

nn Applications Applications (Chemistry)–– GAMESSGAMESS--UKUK (http://www.dl.ac.uk/CFS/benchmarks/gamess_uk.html)–– DL_POLYDL_POLY



4343

Benchmarks from the Standard Benchmarks from the Standard Performance Evaluation CorporationPerformance Evaluation Corporation

nn SPEChpc2002 Features: SPEChpc2002 Features: –– Derived from real HPC applications and application practices, anDerived from real HPC applications and application practices, and measure the d measure the

overall performance of highoverall performance of high--end computer systems, including the computer's end computer systems, including the computer's processors (CPUs), the interconnection system (shared or distribprocessors (CPUs), the interconnection system (shared or distributed memory), uted memory), the compilers, the MPI and/or the compilers, the MPI and/or OpenMPOpenMP parallel library implementation, and the parallel library implementation, and the input/output system. input/output system. nn Parallelism supported: serial, Parallelism supported: serial, OpenMPOpenMP, MPI, or combined MPI, MPI, or combined MPI--OpenMPOpenMPnn Architectures supported: shared memory, distributed memory, clusArchitectures supported: shared memory, distributed memory, clusters ters nn IO and communication included in the benchmarks IO and communication included in the benchmarks nn Implemented under SPEC tools for building applications, running Implemented under SPEC tools for building applications, running with different data with different data

sets, verifying results and measuring runtime. sets, verifying results and measuring runtime. nn SPEC CHEM2002SPEC CHEM2002

–– Based on a quantum chemistry application called GAMESS and has Based on a quantum chemistry application called GAMESS and has performance metrics performance metrics SPECchemM2002SPECchemM2002 and and SPECchemS2002SPECchemS2002. .

nn CPU2000CPU2000–– SPECint2000, SPECfp2000, SPECint_base2000, SPECint_rate2000, etcSPECint2000, SPECfp2000, SPECint_base2000, SPECint_rate2000, etc. .

Integer and floating point CPUInteger and floating point CPU--intensive benchmarks. intensive benchmarks.

http://www.specbench.org/



4444

The Beowulf Performance Suite (BPS)The Beowulf Performance Suite (BPS)

nn Developed by Developed by ParalogicParalogic[http://www.plogic.com/bps/

– Graphical front end to a series of commonly used performance tests for Parallel Computers. It was designed explicitly for the Beowulf Class of computers running Linux and the suite is packaged as an easy to install .rpm file.

– Can be run from the command line by invoking bps or xbps with the GUI. G

– Generates all test results and graphs (using Ggnuplot) in .html format hence documentation can be maintained on line.



4545

Beowulf Basic ArchitectureBeowulf Basic Architecture

Brahma Cluster Logo at DukeSMP or Cluster ? …



4646

Design: Layered Model of a Design: Layered Model of a Beowulf Beowulf Computer ClusterComputer Cluster

1. CPU Architectures and Hardware Platforms

2. Network Hardware and Communication Protocols

3. Operating System (Linux)

4. Middleware (Single System Image)

5. High Level Programming Languages (C, Fortran)

6. Parallel Applications and Utilities (for HPC)



4747

1. CPU Architectures and 1. CPU Architectures and Hardware PlatformsHardware Platforms



4848

Execution Nodes: Execution Nodes: Choice of ProcessorChoice of Processornn Two (mayor) FamiliesTwo (mayor) Families

1. Intel x86 [32 bit] compatible (E.g.. Pentium)– Considered as commodity systems because there are multiple sources

(Intel, AMD, Cyrix) and obviously ubiquitous.– Best integer performance– Available native software– New Itanium (64 bit)

2. Compaq Alpha systems [64 bit, Formerly DEC]– Performance winner (higher bandwidth to memory and network, best floating

point performance), – Hard to source at a good price.

– Others: Power PC, Sun SPARCn limited support and accompanying software distributions available

for these architectures.

$

RULE OF T

HUMB: Cho

ose se

cond t

o last G

enerat

ion



4949

Processors: Intel Compatible, Processors: Intel Compatible, things to look for …things to look for …

nn Core Frequency Core Frequency nn Watch out ! different architectures have different performanceWatch out ! different architectures have different performance

nn Front Side Bus (FSB) Speed (up to 800MHz)Front Side Bus (FSB) Speed (up to 800MHz)nn Integer performance (http://Integer performance (http://www.spec.orgwww.spec.org/)/)nn Floating Point performance (http://Floating Point performance (http://www.spec.orgwww.spec.org/)/)nn HyperHyper--Threading TechnologyThreading Technology

–– Multiple software threads on each processor in the system Multiple software threads on each processor in the system nn Internal Cache Internal Cache

–– L1: Execution Trace Cache L1: Execution Trace Cache –– L2: Advanced Transfer Cache L2: Advanced Transfer Cache –– L3: For large data setsL3: For large data sets

nn PricePrice--Performance Performance some comparison charts available at (some comparison charts available at (http://www.tomshardware.com/http://www.tomshardware.com/))

nn SMP supportSMP support

• the Pentium 4 1.8GHz costs 78% more than the 1.7GHz model,

• the Pentium III 1.1GHz costs 75% more than the Pentium III 1GHz



5050

Performance: Memory and Performance: Memory and Locality of ReferenceLocality of Referencen The success of memory hierarchy is based upon assumptions

that are critical to achieving the appearance of a large, fast memory.

n There are 3 components of the locality of reference, which coexist in an active process which are:

– Temporal – A tendency for a process to reference in the near future the elements of instructions or data referenced in the recent past. Program constructs that lead to this concept are loops, temporary variables, or process stacks.

– Spatial – A tendency for a process to reference instructions or data in a similar location in the virtual address space of the last reference.

– Sequentiality – The principle that if the last reference was rj(t), then there is a likelihood that the next reference is to the immediate successor of the element rj(t).

n It should be noted that each process exhibits an individual characteristic with respect to the three types of localities.



5151

Processors: Intel vs. AMDProcessors: Intel vs. AMD

SOURCE: http://www.tomshardware.com/

H J Curnow and B A Wichmann, "A Synthetic Benchmark", Computer Journal Vol 19, No 1 1976



5252

Processors: Intel Processors: Intel vsvs AMDAMD

111126.926.922.822.818.718.7AthlonAthlon K7/500K7/500

1.561.561.161.1617.217.227.227.221.621.6AthlonAthlon K7/600K7/600

1.871.871.211.2114.414.429.329.322.622.6AthlonAthlon K7/650K7/650

2.342.341.471.4711.511.536.936.927.427.4AthlonAthlon K7/850K7/850

2.422.421.571.5711.111.142.942.929.429.4AthlonAthlon K7/1000K7/1000

2.862.863.073.0713.013.035.735.728.328.3PIII/733PIII/733

2.862.863.083.08--36.536.528.328.3PIII/750PIII/750

2.52.53.143.1414.914.938.438.428.928.9PIII/800PIII/800

--3.273.27--40.440.430.130.1PIII/866PIII/866

--3.53.5--46.846.832.232.2PIII/1000PIII/1000

Relative to Relative to PII/300 or PII/300 or K7/500 K7/500 GAMESSGAMESS--UKUK

Performance Performance SPECfp95SPECfp95

GAMESSGAMESS--UK Time UK Time (seconds)(seconds)

SPECint95SPECint95SPECfp95SPECfp95SystemSystemThese programs deliver their results indexed to a standardized baseline system (a 40-MHz Sun SparcStation 10) scoring 1.0. In other words, if the SPECint95 result is 5.0, then the tested system is five times faster at integer tasks than the baseline system. SPEC95 has been replaced by CPU2000 (SPECint2000 and SPECfp2000)



5353

Processors: Intel/AMDProcessors: Intel/AMD

DedicatedBandwidth

SharedBandwidth



5454

$89$89

$204$204

$123$123

$700$700

AMDAMD

212212$189$189P4/Athlon P4/Athlon @[email protected]

8989$183$183P4P4AthlonAthlon XPXP

110110$136$136XeonXeonAthlonAthlon MPMP

214214$1,500$1,500ItaniumItaniumOpteronOpteron

%%IntelIntel@2GHz@2GHz

Source: http://www.pricewatch.com/ 12/21/2003

Source: http://www.tomshardware.com

• Athlon MPs have 3 FPUs and higher peak FLOP rate• P4 with highest clock rate beats out the Athlon MP with

highest clock rate in real FLOP rate• Athlon MPs higher real FLOP per $, more popular• P4 supports SSE2 instruction set, which perform SIMD

operations on double precision data (2 x 64-bit)• Athlon MP supports only SSE, for single precision data

(4 x 32-bit)

Processors PriceProcessors Price--Performance: Performance: Intel/AMDIntel/AMD



5555

Source: http://www.hardwareanalysis.com/content/reviews/article/1511.4/

Motherboards: SingleMotherboards: Single--DualDual--Quad?Quad?nn Motherboard performance depends Motherboard performance depends

heavily on theheavily on the–– Chipset ($)Chipset ($)–– Support for SMP (Independent data Support for SMP (Independent data

path for processors, power)path for processors, power)–– Maximum operating frequencyMaximum operating frequency–– LayoutLayout

nn Symmetric Multiprocessing (SMP)Symmetric Multiprocessing (SMP)–– Good for multithreaded applications Good for multithreaded applications

(but MP libraries are not thread safe)(but MP libraries are not thread safe)–– AMDsAMDs support for SMP (recent), nice support for SMP (recent), nice

change for change for OpteronOpteron–– Performance increase might not be Performance increase might not be

worth Priceworth Price

nn Memory type and capacity (large Memory type and capacity (large enough to avoid disk swapping)enough to avoid disk swapping)

nn OverclockingOverclocking ($)($)

An important part of increasing performance has to do with chipsets chipsets and memory technologyand memory technology. Advertising continues to give you very little of the crucial information that you need in order to be able to evaluate the performance ofa motherboard in a complete system.It's not just factors like the speed of the FSB or the memory that play a role here - chipsets can operate at very different speeds even with memory types that are seemingly the same

About Memory: About Memory: • Choose Grade A Manufacturers

• Micron Technology, • Rambus, • PNY, • Kingston, • Corsair, • LG, • Hyundai, • Mushkin, and • Samsung

• SIMM (EDO) requires pairs, DIMM (SDRAM and DDR) don’t

• RAM doubles CPU Frequency• For P4 use DDR over RDRAM ($)



5656

IBM 34GXP

Storage: Hard DisksStorage: Hard Disksnn General criteria:General criteria:

–– Access Time = Command Overhead Time + Access Time = Command Overhead Time + Seek Time + Settle Time + LatencySeek Time + Settle Time + Latency

–– Latency=(30000/SpindleSpeed)Latency=(30000/SpindleSpeed)–– Rotational speedRotational speed–– Buffer (cache)Buffer (cache)–– Size (the smaller the better)Size (the smaller the better)–– CapacityCapacity–– BenchmarkBenchmark: Bonnie++ : Bonnie++

http://www.coker.com.au/bonnie++/http://www.coker.com.au/bonnie++/nn Central and individual nodesCentral and individual nodesnn Disk type options:Disk type options:

–– IDE/EIDEIDE/EIDEnn CheaperCheapernn Seek times: 8.0Seek times: 8.0--12ms12msnn BuiltBuilt--in controllersin controllers

–– SCSISCSInn Good expansion capabilityGood expansion capability

RAID (Redundant Array of Inexpensive Disks)RAID (Redundant Array of Inexpensive Disks)nn Seek times: <8.5msSeek times: <8.5msnn Some motherboards have builtSome motherboards have built--in controllersin controllers



5757

Execution NodesExecution Nodes

nn Rack mount, or desktop/tower?Rack mount, or desktop/tower?–– The choice depends on: space, cost and migration and reusabilityThe choice depends on: space, cost and migration and reusability::

nn RackRack––mount solutions saves space, but at a higher cost (ideal for larmount solutions saves space, but at a higher cost (ideal for large systems)ge systems)nn Desktop or tower cases are cheap, consume more power (heat), mesDesktop or tower cases are cheap, consume more power (heat), messy and big (space)sy and big (space)nn If you go for the rack, you are stuck with the hardware for its If you go for the rack, you are stuck with the hardware for its entire life (difficult to scale entire life (difficult to scale

given changing form factors)given changing form factors)nn If you go for standard cases, you can push the cluster hardware If you go for standard cases, you can push the cluster hardware to office or lab use after a to office or lab use after a

short period and update the clustershort period and update the cluster

nn Power Distribution (neat, but not necessary for small systems)Power Distribution (neat, but not necessary for small systems)–– Highly desirable to have network addressable power distribution Highly desirable to have network addressable power distribution unitsunits

nn Can remotely power cycle compute nodesCan remotely power cycle compute nodesnn Instrumented which help determine power needsInstrumented which help determine power needs

Ethernet portPower sockets

$1,555 (16 port Sentry R016-1-1-1PT)

Don’t in

stall in

your

office

(loud

, mess

y and

hot)



5858

Design: Layered Model of a Design: Layered Model of a Beowulf Computer ClusterBeowulf Computer Cluster





5. High Level Programming Languages

6. Parallel Applications and Utilities



5959

Interconnection NetworkInterconnection Network



6060

Interconnect: Selection CriteriaInterconnect: Selection Criterian Some design considerations for the selection of a node interconnect in the

Beowulf cluster are as follows:– Linux support: yes/no, kernel driver or library (kernel drivers are preferred).– Maximum bandwidth: The higher the better.– Minimum latency: The lower the better (small network diameter).– Available as: Single Vendor / Multiple Vendor Hardware.– Interface port/bus used: High performance, included as a standard node port and

matched bandwidth to the dedicate node network fabric bandwidth.– Network structure: Bus/Switched/Topology.– Cost per machine connected: The lower the better.– Scalability– PRICE !!

n Switched, full duplex Ethernet is the most commonly used network in Beowulf systems, and gives almost the same performance as a fully meshed network (at significantly lower $$, thanks decreasing prices for high speed Ethernet). Switched Ethernet provides dedicated bandwidth between any two nodesconnected to the switch. If higher internode bandwidth is required we can use channel bonding to connect multiple channels of Ethernet to each node.

'ifenslave‘ *NASA



6161

Node InterconnectionTechnologies-Supported in LinuxCOTSCOTSnn CAPERSCAPERS Cable Adapter for Parallel

Execution and Rapid Synchronizationnn 10Mb Ethernet10Mb Ethernetnn 100Mb Ethernet100Mb Ethernet (Fast Ethernet)nn 1000Mb Ethernet1000Mb Ethernet (Giga Ethernet)nn 10G Ethernet10G Ethernet (10 Giga Ethernet)nn PLIPPLIP Parallel Line Interface Protocolnn SLIPSLIP Serial Line Interface Protocolnn USB USB Universal Serial Bus

Vendor SpecificVendor Specificnn MyrinetMyrinet (http://www.myri.com/)nn ParastationParastation

(http://wwwipd.ira.uka.de/parastation)nn QuadricsQuadricsnn ArcNetArcNet (token based protocol,

http://www.arcnet.com/l)nn ATMATM Asynchronous Transfer Mode

(http://lrcwww.epfl.ch/linux-atm/)nn SCSISCSI Small Computer Systems

Interconnectnn SHRIMPSHRIMP Scalable, High-

Performance, Really Inexpensive



6262

Network Hardware: COTSNetwork Hardware: COTS

$5$5BusBusUSBUSBN.A.N.A.12Mb/s12Mb/sKernel driversKernel driversUSBUSB

$2$2Cable between Cable between 2 nodes2 nodes

RS232CRS232C1,0000 1,0000 µµss0.1Mb/s0.1Mb/sKernel driversKernel driversSLIPSLIP


SPPSPP1,0000 1,0000 µµss1.2Mb/s1.2Mb/sKernel driversKernel driversPLIPPLIP

N.A.N.A.Switch or Switch or FDRsFDRsPCIPCI--XXN.A.N.A.10,000Mb/s10,000Mb/sKernel driversKernel drivers10Gb Ethernet10Gb Ethernet

$2,500*$2,500*Switch or Switch or FDRsFDRsPCIPCI300 300 µµss1,000Mb/s1,000Mb/sKernel driversKernel drivers1,000Mb 1,000Mb EthernetEthernet

$400 *$400 *Switch, hub or Switch, hub or hublesshubless busbus

PCIPCI80 80 µµss100Mb/s100Mb/sKernel driversKernel drivers100Mb Ethernet100Mb Ethernet

$150 * $150 * ($100 ($100 hublesshubless))

Switch, hub or Switch, hub or hublesshubless busbus

PCIPCI100 100 µµss10Mb/s10Mb/sKernel driversKernel drivers10Mb Ethernet10Mb Ethernet


SPPSPP2 2 µµss1.2Mb/s1.2Mb/sAPI LibraryAPI LibraryCAPERSCAPERS

Cost per Node Cost per Node Network Network StructureStructure

Interface Interface port/bus usedport/bus used

Minimum Minimum LatencyLatency

Maximum Maximum BandwidthBandwidth

Linux SupportLinux SupportTechnologyTechnology

*multiple vendor hardwareFDR: Full Duplex RepeatersFDR: Full Duplex Repeaters

Before you buy consult latest list of supported drivers @ http://www.scyld.com/network_index.html



6363

Network Hardware: Vendor*Network Hardware: Vendor*

$1,479*$1,479*SwitchedSwitched hubshubsPCIPCI5 5 µµs *MPIs *MPI350MB/s350MB/s--900MB/s*1900MB/s*1

KernelKernel driversdrivers andandlibrarieslibraries

QuadricsQuadricsQsNetII*QsNetII*

N.A.N.A.Mesh backplane Mesh backplane *ala Paragon*ala Paragon

EISAEISA5 5 µµss180Mb/s180Mb/sUserUser--level memory level memory mapped interfacemapped interface

SHRIMPSHRIMP

N.A.N.A.Inter node bus Inter node bus sharing SCSI devicessharing SCSI devices

PCI, EISA, ISAPCI, EISA, ISAN.A.N.A.5Mb/s 5Mb/s –– 20Mb/s20Mb/sKernel driversKernel driversSCSISCSI

$3,000$3,000Switched hubsSwitched hubsPCIPCI120 120 µµss155Mb/s 155Mb/s *(1,200Mb/s)*(1,200Mb/s)

Kernel driversKernel driversATMATM

$200$200UnswitchedUnswitched hub or hub or bus *ringbus *ring

ISAISA1,000 1,000 µµss2.5Mb/s2.5Mb/sKernel driversKernel driversArcnetArcnet

>$1,000>$1,000HublessHubless meshmeshPCIPCI2 2 µµss125Mb/s125Mb/sHAL or socket HAL or socket librarylibrary

ParastationParastation

$1,200+$1,200+Switched hubsSwitched hubsPCIPCI6.36.3 µµss248MB/s hd 248MB/s hd --489MB/s fd*1489MB/s fd*1

LibraryLibraryMyrinetMyrinetPCIPCI--XX

Cost per Cost per Node Node

Network StructureNetwork StructureInterface Interface port/bus usedport/bus used

Minimum Minimum LatencyLatency

Maximum Maximum BandwidthBandwidth

Linux SupportLinux SupportTechnologyTechnology

* Currently Supporting Linux

Sustained



6464

What you need to know about What you need to know about Network Network InterconnectInterconnectnn Don’t buy a hub !Don’t buy a hub !nn StoreStore--andand--ForwardForward

–– Copies incoming packets to memoryCopies incoming packets to memory–– Delivers packet when it can arbitrate transfer across switchDelivers packet when it can arbitrate transfer across switch

nn NonNon--BlockingBlocking–– Can process incoming packets without storing Can process incoming packets without storing

nn Buy NonBuy Non--Blocking switched with second to last port Blocking switched with second to last port density available ($), it will permit future scaling of density available ($), it will permit future scaling of beowulfbeowulf



6565

Example Configuration ($)Example Configuration ($)

16,204.716,204.7GRAND TOTALGRAND TOTAL

250250Monitor/Keyboard switch boxes, keyboard and video extensions, neMonitor/Keyboard switch boxes, keyboard and video extensions, network twork cables (CAT5),cables (CAT5),

3993993993991121” 21” MultisyncMultisync monitor SONY FDCPDmonitor SONY FDCPD--G520G520

1010101011KeyboardKeyboard

1818181811CDROM 56XCDROM 56X

17917917917911SVGA adapter SVGA adapter GeForceGeForce FX 5900 128MBFX 5900 128MB

1191197717171.44MB FD1.44MB FD

85085050501717HD SCSI ST150176LC 10Krpm 100GB 8.2ms (otherwise use ultra ATA)HD SCSI ST150176LC 10Krpm 100GB 8.2ms (otherwise use ultra ATA)

18018010101110/100Mb Ethernet A. (10/100Mb Ethernet A. (3Com 3C905C)*3Com 3C905C)*

1,024.71,024.756.9356.931818RAM DIMM DDR266 256MBRAM DIMM DDR266 256MB

34034020201717PC cases (300W)PC cases (300W)

11,35611,3566686681717Intel Intel SE7505VB2 Dual SE7505VB2 Dual XeonXeon. . 1 CPU, 8MB 1 CPU, 8MB VideoVideo, , EthernetEthernet100/1000, Serial Ultra ATA, SCSI 100/1000, Serial Ultra ATA, SCSI optopt..

1.4791.4791.4791.479112424--port switch port switch Cisco 2950SXCisco 2950SX--24 24

TotalTotalUS$US$No.No.PartPart

Source: http://www.pricewatch.com/12/22/2003



6666

Example Configuration… on the cheap sideExample Configuration… on the cheap side

1700170010010017171.5MHz P41.5MHz P4

4,9184,918GRAND TOTALGRAND TOTAL

250250Monitor/Keyboard switch boxes, keyboard and video extensions, neMonitor/Keyboard switch boxes, keyboard and video extensions, network twork cables (CAT5),cables (CAT5),

1501501501501117” 17” MultisyncMultisync monitormonitor

1010101011KeyboardKeyboard

1818181811CDROM 56XCDROM 56X

2525252511SVGA Adapter 64MSVGA Adapter 64M

1191197717171.44MB FD1.44MB FD

6466463838171720GB EIDE HDD20GB EIDE HDD

909055181810/100Mb Ethernet Adapter10/100Mb Ethernet Adapter

68068020203434128MB SDRAM DIMM128MB SDRAM DIMM

34034020201717PC cases (300W)PC cases (300W)

85085050501717MotherboardMotherboard

40404040112424--port switchport switch

TotalTotalUS$US$No.No.PartPart

Source: http://www.pricewatch.com/12/22/2003



6767










6868

Not p

roprie

tary a

nd Fr

ee

Not p

roprie

tary a

nd Fr

ee

LinuxLinux

nn Strictly speaking, Linux (introduced by Strictly speaking, Linux (introduced by LinusLinus TorvaldsTorvalds in 1991) is an operating in 1991) is an operating system kernel:system kernel:

–– Control all devicesControl all devices–– Manages resourcesManages resources–– Schedules user processesSchedules user processes

nn The kernel together with software from the GNU project and otherThe kernel together with software from the GNU project and others, form a s, form a usable operating system. Very Robust.usable operating system. Very Robust.

nn Supports different network protocols, full SMP since v2.1, POSIXSupports different network protocols, full SMP since v2.1, POSIX compliance compliance true multitasking, virtual memory, shared libraries, demand loadtrue multitasking, virtual memory, shared libraries, demand loading …ing …

nn Runs on all Intel x86s, Alpha, PPC, Runs on all Intel x86s, Alpha, PPC, SparcSparc, Motorola 68k, MIPS, ARM, HP, Motorola 68k, MIPS, ARM, HP--PA PA RISC, and moreRISC, and more

nn It is UNIXIt is UNIX––like, but not UNIX. Rewrite based on published POSIX standardslike, but not UNIX. Rewrite based on published POSIX standardsnn Most common distribution for BeowulfMost common distribution for Beowulf: : RedhatRedhat ((http://www.redhat.comhttp://www.redhat.com))

–– Includes RMP (Red Hat Program Manager)Includes RMP (Red Hat Program Manager)–– Others: Others: DebianDebian, , SlackwareSlackware,,



6969










7070

Middleware

nn Is a software layer that is added on top of the Is a software layer that is added on top of the operating system to provide what is known as a Single operating system to provide what is known as a Single System Image (SSI).System Image (SSI).

nn Provides a layer of software that enables uniform Provides a layer of software that enables uniform access to different nodes on a cluster regardless of access to different nodes on a cluster regardless of the operating system running on a particular node.the operating system running on a particular node.

nn Is responsible for providing high availability, by means Is responsible for providing high availability, by means of load balancing and responding to failures in of load balancing and responding to failures in individual components.individual components.



7171

Middleware: Desirable Objectives for Cluster Services and Functions1.1. Single Entry PointSingle Entry Point –– User logs onto cluster rather than on to individual computer.User logs onto cluster rather than on to individual computer.2.2. Single File HierarchySingle File Hierarchy –– User sees single file directory hierarchy under same root dir.User sees single file directory hierarchy under same root dir.3.3. Single Control Point Single Control Point –– There is a default workstation used for cluster management There is a default workstation used for cluster management

and control. This is usually known as the server.and control. This is usually known as the server.4.4. Single Virtual Networking Single Virtual Networking –– Any node can access any other point in the cluster, even Any node can access any other point in the cluster, even

though the actual cluster configuration may consist of multiple though the actual cluster configuration may consist of multiple interconnected networks. interconnected networks. 5.5. Single Memory Space Single Memory Space –– Distributed Shared Memory enables programs to share vars.Distributed Shared Memory enables programs to share vars.6.6. Single jobSingle job--management system management system –– Under a cluster job scheduler, a user can submit a Under a cluster job scheduler, a user can submit a

job without specifying the host computer to execute the job.job without specifying the host computer to execute the job.7.7. Single User Interface Single User Interface –– A common graphic interface supports all users, regardless of A common graphic interface supports all users, regardless of

the workstation from which they enter the cluster.the workstation from which they enter the cluster.8.8. Single I/O Space Single I/O Space –– Any node can remotely access any I/O peripheral or disk device Any node can remotely access any I/O peripheral or disk device

without knowledge of its physical location.without knowledge of its physical location.9.9. Single Process Space Single Process Space –– A uniform processA uniform process--identification scheme is used. A process identification scheme is used. A process

on any node can create or communicate with any other process on on any node can create or communicate with any other process on a remote node.a remote node.10.10. CheckpointingCheckpointing –– This function periodically saves the process state and intermedThis function periodically saves the process state and intermediate iate

computing results, to allow rollback recovery after a failure.computing results, to allow rollback recovery after a failure.11.11. Process Migration Process Migration –– This function enables load balancing.This function enables load balancing.

Source: Richard Morrison, Cluster Computing



7272

Middleware: Requirements for System Middleware: Requirements for System Software and Tools for HPCSoftware and Tools for HPC

üü Robust node O/S (easy configuration, installation, boot)Robust node O/S (easy configuration, installation, boot)nn Parallel Programming APIParallel Programming APInn Application Package DevelopmentApplication Package Developmentnn Parallel file systemsParallel file systemsnn System administration and managementSystem administration and managementnn Job and process schedulingJob and process schedulingnn Parallel debug and performance monitoringParallel debug and performance monitoringnn Checkpoint and restartCheckpoint and restart



7373

Middleware …Middleware …

nn Parallel Programming APIParallel Programming API– Beowulf are independent computers connected via a

communication net, need to pass messages– Need to ensure degree of portability across different nodes

(E.g. 32 bit and 64 bit machines)– PVM (ORNL) and MPI (a standard). – MPI provides more functionality (controlled by MPI Forum)– PVM contains fault tolerant features– OpenMP (compiler)

n Supports multi-platform shared-memory programming. OpenMP is a portable, scalable model that gives shared-memory programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.



7474

MPI: Message Passing InterfaceMPI: Message Passing InterfaceWhy MPI over PVM?Why MPI over PVM?

1.1. MPI has several freely available, quality implementationsMPI has several freely available, quality implementations2.2. MPI defines a 3rd party profiling mechanism MPI defines a 3rd party profiling mechanism 3.3. MPI has full asynchronous communicationMPI has full asynchronous communication4.4. MPI groups are solid, efficient, and deterministicMPI groups are solid, efficient, and deterministic5.5. MPI efficiently manages message buffersMPI efficiently manages message buffers6.6. MPI synchronization protects user from 3rd party softwareMPI synchronization protects user from 3rd party software7.7. MPI can efficiently program MPP and clustersMPI can efficiently program MPP and clusters8.8. MPI is highly portable, highly portableMPI is highly portable, highly portable9.9. MPI is formally specifiedMPI is formally specified10.10. MPI is a standardMPI is a standard

MPI2MPI2• Parallel I/O• Remote Memory Operations (put, get)• Dynamic Process Management• Supports the use of threads (POSIX)



7575

Middleware …Middleware …nn Application Development PackagesApplication Development Packages

– Large base of serial code– Abstraction vs. Efficiency problem– Parallel Programs must last longer than parallel machines– Designing and Building Parallel Programs, Concepts and Tools for Parallel Software

Engineering, Ian T. Foster http://wwwhttp://www--unix.mcs.anl.gov/dbpp/unix.mcs.anl.gov/dbpp/nn Parallel file systems (Parallel Virtual File System, PVFS, and PParallel file systems (Parallel Virtual File System, PVFS, and PVFSVFS22))

– Clemson University - 1993– Objective: high throughput file system– Strategy:

n Exploit parallelism of bandwidth n Provide user interface so that applications can make powerful requests such as large

collection of non-contiguous data with single request for multidimensional data sets, allow application direct access to server without going through kernel.

– Characteristics:n N-clients and N-serversn Single file spread across multiple disks and nodes, accessed by multiple tasks.n Actual distribution of a file is configurable on a file by file basis.

httphttp://://www.parl.clemson.eduwww.parl.clemson.edu/pvfs2//pvfs2/



7676

Middleware …Middleware …

nn System administration and managementSystem administration and management– Aspen Beowulf Cluster Management Software (ABC)

http://www.aspsys.com/software/abc/– Scyld http://www.scyld.com/products.html– Ganglia (Distributed monitoring and execution system ) http://ganglia.sourceforge.net/

nn Job and process schedulingJob and process scheduling– Condor– Maui (SMP enabled) http://supercluster.org/maui/ –– PBS (PBS (OpenPBSOpenPBS).). Workload management system http://www.openpbs.org/

n Coordinates resource utilization policy and user job requirementsn Multi users, Multi jobs, Multi nodesn Functionality

– Manages parallel job execution (MPI, MPL, PVM, HPF)– Interactive and batch cross system scheduling– Security and access control lists– Dynamic distribution and automatic load-leveling of workload– Job and user accounting



7777

Middleware …Middleware …nn Cluster Software Cluster Software –– Solvers (Libraries)Solvers (Libraries)

– BLAS - Basic Linear Algebra Subprogramsn The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing

basic vector and matrix operations. http://www.netlib.org/blas/ – FFTW

n FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of both real and complex data, and of arbitrary input size. http://www.fftw.org/

– LMPIn LMPI is a library for post-mortem analysis of the communication behavior of parallel MPI programs.

http://www.lrz-muenchen.de/services/software/parallel/lmpi/ – METIS & ParMETIS

n METIS is a family of programs for partitioning unstructured graphs and hypergraphs and computing fill-reducing orderings of sparse matrices. http://www-users.cs.umn.edu/~karypis/metis/

– MpCCIn Mesh-based parallel Code Coupling Interface (MpCCI) is a code coupling interface for multidisciplinary

applications. http://www.mpcci.org/ – NetCDF

n NetCDF (network Common Data Form) is an interface for array-oriented data access and a library that provides an implementation of the interface. http://www.unidata.ucar.edu/packages/netcdf/

– Numerical Pythonn Numerical Python adds a fast, compact, multidimensional array language facility to Python.

http://www.pfdubois.com/numpy/



7878

Middleware …Middleware …nn Cluster Software Cluster Software –– Solvers (Libraries) … continuedSolvers (Libraries) … continued

– PARASOLn ParaSol is a parallel discrete event simulation system that supports optimistic and adaptive

synchronization methods. http://www.cs.purdue.edu/research/PaCS/parasol.html– PETSc

n PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. Employs MPI http://www-fp.mcs.anl.gov/petsc/

– PLAPACK - Parallel Linear Algebra Packagen Coding parallel algorithms is generally regarded as a formidable task. Parallel Linear Algebra Package

(PLAPACK), is an infrastructure for coding such algorithms at a high level of abstraction. http://www.cs.utexas.edu/users/plapack/

– PSPASESn PSPASES (Parallel SPArse Symmetric dirEctSolver) is a high performance, scalable, parallel, MPI-

based library, intended for solving linear systems of equations involving sparse symmetric positive definite matrices. http://www-users.cs.umn.edu/~mjoshi/pspases/

– ScaLAPACKn The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for

distributed memory MIMD parallel computers. It is currently written in a SIPD style using message passing for interprocessor communication. http://www.netlib.org/scalapack/scalapack_home.html

– VTK - Visualization ToolKitn The Visualization ToolKit (VTK) is an open source, freely available software system for 3D computer

graphics, image processing, and visualization used by thousands of researchers and developers around the world. http://public.kitware.com/VTK/



7979

Beowulf Toolkits (free)Beowulf Toolkits (free)

nn OSCAR (Open Source Cluster Application Resource)OSCAR (Open Source Cluster Application Resource)–– www.openclustergroup.orgwww.openclustergroup.org

nn ROCKSROCKS–– http:// http:// www.rocksclusters.orgwww.rocksclusters.org–– http://rocks.npaci.eduhttp://rocks.npaci.edu

building a beowulf class supercomputer - wag.caltech.eduwag.caltech.edu/pasi/lectures/puj-building a...

Documents