mate ti cluster intro
TRANSCRIPT
-
8/2/2019 Mate Ti Cluster Intro
1/43
11
Introduction toIntroduction to
Cluster ComputingCluster ComputingPrabhaker MatetiPrabhaker Mateti
Wright State UniversityWright State University
Dayton, Ohio, USADayton, Ohio, USA
-
8/2/2019 Mate Ti Cluster Intro
2/43
22
OverviewOverview
High performance computingHigh performance computing
High throughput computingHigh throughput computing
NOW, HPC, and HTCNOW, HPC, and HTCParallel algorithmsParallel algorithms
Software technologiesSoftware technologies
-
8/2/2019 Mate Ti Cluster Intro
3/43
-
8/2/2019 Mate Ti Cluster Intro
4/43
44
Parallel ComputingParallel Computing
Traditional supercomputersTraditional supercomputersSIMD, MIMD, pipelinesSIMD, MIMD, pipelines
Tightly coupled shared memoryTightly coupled shared memory
Bus level connectionsBus level connections
Expensive to buy and to maintainExpensive to buy and to maintain
Cooperating networks of computersCooperating networks of computers
-
8/2/2019 Mate Ti Cluster Intro
5/43
-
8/2/2019 Mate Ti Cluster Intro
6/43
66
Traditional SupercomputersTraditional Supercomputers
Very high starting costVery high starting costExpensive hardwareExpensive hardware
Expensive softwareExpensive software
High maintenanceHigh maintenance
Expensive to upgradeExpensive to upgrade
-
8/2/2019 Mate Ti Cluster Intro
7/43
77
Traditional SupercomputersTraditional Supercomputers
No one is predicting their demise, but No one is predicting their demise, but
-
8/2/2019 Mate Ti Cluster Intro
8/43
88
Computational GridsComputational Grids
are the futureare the future
-
8/2/2019 Mate Ti Cluster Intro
9/43
99
Computational GridsComputational Grids
Grids are persistent environments thatGrids are persistent environments that
enable software applications to integrateenable software applications to integrate
instruments, displays, computational andinstruments, displays, computational and
information resources that are managedinformation resources that are managed
by diverse organizations in widespreadby diverse organizations in widespread
locations.locations.
-
8/2/2019 Mate Ti Cluster Intro
10/43
1010
Computational GridsComputational Grids
Individual nodes can be supercomputers,Individual nodes can be supercomputers,
or NOWor NOW
High availabilityHigh availability
Accommodate peak usageAccommodate peak usage
LAN : Internet :: NOW : GridLAN : Internet :: NOW : Grid
-
8/2/2019 Mate Ti Cluster Intro
11/43
1111
NOW ComputingNOW Computing
WorkstationWorkstation
NetworkNetwork
Operating SystemOperating SystemCooperationCooperation
Distributed+Parallel ProgramsDistributed+Parallel Programs
-
8/2/2019 Mate Ti Cluster Intro
12/43
1212
Workstation Operating SystemWorkstation Operating System
Authenticated usersAuthenticated users
Protection of resourcesProtection of resources
Multiple processesMultiple processes
Preemptive schedulingPreemptive scheduling
Virtual MemoryVirtual MemoryHierarchical file systemsHierarchical file systems
Network centricNetwork centric
-
8/2/2019 Mate Ti Cluster Intro
13/43
1313
NetworkNetwork
EthernetEthernet10 Mbps10 Mbps obsoleteobsolete
100 Mbps100 Mbps almost obsoletealmost obsolete
1000 Mbps1000 Mbps standardstandard
ProtocolsProtocolsTCP/IPTCP/IP
-
8/2/2019 Mate Ti Cluster Intro
14/43
1414
CooperationCooperation
Workstations are personalWorkstations are personal
Use by othersUse by others
slows you downslows you down Increases privacy risksIncreases privacy risks
Decreases securityDecreases security
Willing to shareWilling to share
Willing to trustWilling to trust
-
8/2/2019 Mate Ti Cluster Intro
15/43
1515
Distributed ProgramsDistributed Programs
Spatially distributed programsSpatially distributed programsA part here, a part there, A part here, a part there,
ParallelParallel
SynergySynergy Temporally distributed programsTemporally distributed programs Finish the work of your great grand fatherFinish the work of your great grand father
Compute half today, half tomorrowCompute half today, half tomorrow
Combine the results at the endCombine the results at the endMigratory programsMigratory programs Have computation, will travelHave computation, will travel
-
8/2/2019 Mate Ti Cluster Intro
16/43
1616
SPMDSPMD
Single program, multiple dataSingle program, multiple data
Contrast with SIMDContrast with SIMD
Same program runs on multiple nodesSame program runs on multiple nodesMay or may not be lock-stepMay or may not be lock-step
Nodes may be of different speedsNodes may be of different speeds
Barrier synchronizationBarrier synchronization
-
8/2/2019 Mate Ti Cluster Intro
17/43
1717
Conceptual Bases ofConceptual Bases of
Distributed+Parallel ProgramsDistributed+Parallel Programs
Spatially distributed programsSpatially distributed programsMessage passingMessage passing
Temporally distributed programsTemporally distributed programsShared memoryShared memory
Migratory programsMigratory programs
Serialization of data and programsSerialization of data and programs
-
8/2/2019 Mate Ti Cluster Intro
18/43
1818
(Ordinary) Shared Memory(Ordinary) Shared Memory
Simultaneous read/write accessSimultaneous read/write accessRead : readRead : read
Read : writeRead : write
Write : writeWrite : write
Semantics not cleanSemantics not cleanEven when all processes are on the sameEven when all processes are on the same
processorprocessor
Mutual exclusionMutual exclusion
-
8/2/2019 Mate Ti Cluster Intro
19/43
1919
Distributed Shared MemoryDistributed Shared Memory
Simultaneous read/write access bySimultaneous read/write access by
spatially distributed processorsspatially distributed processors
Abstraction layer of an implementationAbstraction layer of an implementation
built from message passing primitivesbuilt from message passing primitives
Semantics not so cleanSemantics not so clean
-
8/2/2019 Mate Ti Cluster Intro
20/43
2020
Conceptual Bases for MigratoryConceptual Bases for Migratory
programsprograms
Same CPU architectureSame CPU architectureX86, PowerPC, MIPS, SPARC, , JVMX86, PowerPC, MIPS, SPARC, , JVM
Same OS + environmentSame OS + environment
Be able to checkpointBe able to checkpointsuspend, andsuspend, and
then resume computationthen resume computationwithout loss of progresswithout loss of progress
-
8/2/2019 Mate Ti Cluster Intro
21/43
2121
Clusters of WorkstationsClusters of Workstations
Inexpensive alternative to traditionalInexpensive alternative to traditionalsupercomputerssupercomputers
High availabilityHigh availabilityLower down timeLower down timeEasier accessEasier access
Development platform with productionDevelopment platform with production
runs on traditional supercomputersruns on traditional supercomputers
-
8/2/2019 Mate Ti Cluster Intro
22/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2222
Cluster CharacteristicsCluster Characteristics
Commodity off the shelf hardwareCommodity off the shelf hardware
NetworkedNetworked
Common Home DirectoriesCommon Home DirectoriesOpen source software and OSOpen source software and OS
Support message passing programmingSupport message passing programming
Batch scheduling of jobsBatch scheduling of jobsProcess migrationProcess migration
-
8/2/2019 Mate Ti Cluster Intro
23/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2323
Why are Linux Clusters Good?Why are Linux Clusters Good?
Low initial implementation costLow initial implementation cost Inexpensive PCsInexpensive PCs
Standard components and NetworksStandard components and NetworksFree Software: Linux, GNU, MPI, PVMFree Software: Linux, GNU, MPI, PVM
Scalability: can grow and shrinkScalability: can grow and shrink
Familiar technology, easy for user toFamiliar technology, easy for user toadopt the approach, use and maintainadopt the approach, use and maintain
system.system.
-
8/2/2019 Mate Ti Cluster Intro
24/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2424
Example ClustersExample Clusters
July 1999July 1999
1000 nodes1000 nodes
Used for geneticUsed for geneticalgorithm research byalgorithm research by
John Koza, StanfordJohn Koza, Stanford
UniversityUniversity
www.genetic-programming.com/www.genetic-programming.com/
http://www.genetic-programming.com/http://www.genetic-programming.com/http://www.genetic-programming.com/ -
8/2/2019 Mate Ti Cluster Intro
25/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2525
Largest Cluster SystemLargest Cluster System
IBM BlueGene, 2007IBM BlueGene, 2007
DOE/NNSA/LLNLDOE/NNSA/LLNL
Memory: 73728 GBMemory: 73728 GB
OS: CNK/SLES 9OS: CNK/SLES 9 Interconnect: ProprietaryInterconnect: Proprietary
PowerPC 440PowerPC 440
106,496 nodes106,496 nodes
478.2 Tera FLOPS on478.2 Tera FLOPS onLINPACKLINPACK
-
8/2/2019 Mate Ti Cluster Intro
26/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2626
OS Share of Top 500OS Share of Top 500
OS CountOS Count ShareShare Rmax (GF)Rmax (GF) Rpeak (GF) ProcessorRpeak (GF) Processor
LinuxLinux 426426 85.20% 489704685.20% 4897046 79567587956758 970790970790
Windows 6Windows 6 1.20% 474951.20% 47495 8679786797 1211212112
UnixUnix 3030 6.00% 4083786.00% 408378 519178519178 7353273532
BSDBSD 22 0.40% 447830.40% 44783 5017650176 56965696Mixed 34Mixed 34 6.80% 15400376.80% 1540037 19003611900361 580693580693
MacOS 2MacOS 2 0.40% 284300.40% 28430 4481644816 52725272
TotalsTotals 500500 100% 6966169 10558086 1648095100% 6966169 10558086 1648095
http://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfam Nov 2007Nov 2007
http://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfam -
8/2/2019 Mate Ti Cluster Intro
27/43
2727
Development ofDevelopment of
Distributed+Parallel ProgramsDistributed+Parallel ProgramsNew code + algorithmsNew code + algorithms
Old programs rewritten in new languagesOld programs rewritten in new languages
that have distributed and parallel primitivesthat have distributed and parallel primitives
Parallelize legacy codeParallelize legacy code
-
8/2/2019 Mate Ti Cluster Intro
28/43
2828
New Programming LanguagesNew Programming Languages
With distributed and parallel primitivesWith distributed and parallel primitives
Functional languagesFunctional languages
Logic languagesLogic languagesData flow languagesData flow languages
-
8/2/2019 Mate Ti Cluster Intro
29/43
2929
Parallel ProgrammingParallel Programming
LanguagesLanguages
based on the shared-memory modelbased on the shared-memory modelbased on the distributed-memory modelbased on the distributed-memory model
parallel object-oriented languagesparallel object-oriented languagesparallel functional programmingparallel functional programming
languageslanguages
concurrent logic languagesconcurrent logic languages
-
8/2/2019 Mate Ti Cluster Intro
30/43
CondorCondor
Cooperating workstations: come and go.Cooperating workstations: come and go.
Migratory programsMigratory programsCheckpointingCheckpointing
Remote IORemote IO
Resource matchingResource matching
http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/
http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/ -
8/2/2019 Mate Ti Cluster Intro
31/43
Portable Batch System (PBS)Portable Batch System (PBS)
Prepare a .cmd filePrepare a .cmd file naming the program and its argumentsnaming the program and its arguments properties of the jobproperties of the job the needed resourcesthe needed resources
Submit .cmd to the PBS Job Server:Submit .cmd to the PBS Job Server: qsubqsub commandcommand Routing and Scheduling: The Job ServerRouting and Scheduling: The Job Server
examines .cmd details to route the job to an execution queue.examines .cmd details to route the job to an execution queue. allocates one or more cluster nodes to the joballocates one or more cluster nodes to the job
communicates with the Execution Servers (mom's) on the cluster to determine the currentcommunicates with the Execution Servers (mom's) on the cluster to determine the currentstate of the nodes.state of the nodes.
When all of the needed are allocated, passes the .cmd on to the Execution Server on the firstWhen all of the needed are allocated, passes the .cmd on to the Execution Server on the firstnode allocated (the "mother superior").node allocated (the "mother superior").
Execution ServerExecution Server
willwill loginlogin on the first node as the submitting user and run the .cmd file in the user's homeon the first node as the submitting user and run the .cmd file in the user's homedirectory.directory.
Run an installation defined prologue script.Run an installation defined prologue script.
Gathers the job's output to the standard output and standard errorGathers the job's output to the standard output and standard error
It will execute installation defined epilogue script.It will execute installation defined epilogue script.
Delivers stdout and stdout to the user.Delivers stdout and stdout to the user.
-
8/2/2019 Mate Ti Cluster Intro
32/43
TORQUE, an open source PBSTORQUE, an open source PBS
Tera-scale Open-source Resource and QUEue managerTera-scale Open-source Resource and QUEue manager(TORQUE) enhances OpenPBS(TORQUE) enhances OpenPBS
Fault ToleranceFault Tolerance Additional failure conditions checked/handledAdditional failure conditions checked/handled Node health check script supportNode health check script support
Scheduling InterfaceScheduling Interface ScalabilityScalability
Significantly improved server to MOM communication modelSignificantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors)Ability to handle larger clusters (over 15 TF/2,500 processors)
Ability to handle larger jobs (over 2000 processors)Ability to handle larger jobs (over 2000 processors) Ability to support larger server messagesAbility to support larger server messages
LoggingLogging http://www.supercluster.org/projects/torque/http://www.supercluster.org/projects/torque/
-
8/2/2019 Mate Ti Cluster Intro
33/43
OpenMP for shared memoryOpenMP for shared memory
Distributed shared memory APIDistributed shared memory API
User-gives hints as directives to theUser-gives hints as directives to thecompilercompiler
http://www.openmp.orghttp://www.openmp.org
-
8/2/2019 Mate Ti Cluster Intro
34/43
Message Passing LibrariesMessage Passing Libraries
Programmer is responsible for initial dataProgrammer is responsible for initial data
distribution, synchronization, and sendingdistribution, synchronization, and sending
and receiving informationand receiving information
Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)
Message Passing Interface (MPI)Message Passing Interface (MPI)
Bulk Synchronous Parallel model (BSP)Bulk Synchronous Parallel model (BSP)
-
8/2/2019 Mate Ti Cluster Intro
35/43
BSP: Bulk Synchronous ParallelBSP: Bulk Synchronous Parallel
modelmodel Divides computation intoDivides computation into superstepssupersteps
In each superstep a processor can work on localIn each superstep a processor can work on local
data and send messages.data and send messages.
At the end of the superstep, a barrierAt the end of the superstep, a barriersynchronization takes place and all processorssynchronization takes place and all processors
receive the messages which were sent in thereceive the messages which were sent in the
previous superstepprevious superstep
-
8/2/2019 Mate Ti Cluster Intro
36/43
BSP LibraryBSP Library
Small number of subroutines to implementSmall number of subroutines to implementprocess creation,process creation,
remote data access, andremote data access, and
bulk synchronization.bulk synchronization.
Linked to C, Fortran, programsLinked to C, Fortran, programs
-
8/2/2019 Mate Ti Cluster Intro
37/43
BSP: Bulk Synchronous ParallelBSP: Bulk Synchronous Parallel
modelmodelhttp://www.bsp-worldwide.org/http://www.bsp-worldwide.org/
Book: Rob H. Bisseling, Parallel ScientificBook: Rob H. Bisseling, Parallel Scientific
Computation: A Structured ApproachComputation: A Structured Approach
using BSP and MPI, Oxford Universityusing BSP and MPI, Oxford University
Press, 2004,Press, 2004,
324 pages, ISBN 0-19-852939-2.324 pages, ISBN 0-19-852939-2.
-
8/2/2019 Mate Ti Cluster Intro
38/43
3838
PVM, and MPIPVM, and MPI
Message passing primitivesMessage passing primitives
Can be embedded in many existingCan be embedded in many existing
programming languagesprogramming languages
Architecturally portableArchitecturally portable
Open-sourced implementationsOpen-sourced implementations
-
8/2/2019 Mate Ti Cluster Intro
39/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 3939
Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)
PVM enables a heterogeneous collectionPVM enables a heterogeneous collection
of networked computers to be used as aof networked computers to be used as a
single large parallel computer.single large parallel computer.
Older than MPIOlder than MPI
Large scientific/engineering userLarge scientific/engineering user
communitycommunity
http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/
http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/ -
8/2/2019 Mate Ti Cluster Intro
40/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 4040
Message Passing Interface (MPI)Message Passing Interface (MPI)
http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/
MPI-2.0MPI-2.0 http://www.mpi-forum.org/docs/http://www.mpi-forum.org/docs/
MPICH:MPICH: www.mcs.anl.gov/mpi/mpich/www.mcs.anl.gov/mpi/mpich/ bybyArgonne National Laboratory andArgonne National Laboratory and
Missisippy State UniversityMissisippy State University
LAM:LAM:
http://www.lam-mpi.org/http://www.lam-mpi.org/
http://www.open-mpi.org/http://www.open-mpi.org/
http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/http://www.mpi-forum.org/docs/http://www.mpi-forum.org/docs/http://www.mcs.anl.gov/mpi/mpich/http://www.mcs.anl.gov/mpi/mpich/http://www.lam-mpi.org/http://www.lam-mpi.org/http://www.open-mpi.org/http://www.open-mpi.org/http://www.open-mpi.org/http://www.lam-mpi.org/http://www.mcs.anl.gov/mpi/mpich/http://www.mpi-forum.org/docs/http://www-unix.mcs.anl.gov/mpi/ -
8/2/2019 Mate Ti Cluster Intro
41/43
May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 4141
Kernels Etc Mods for ClustersKernels Etc Mods for Clusters
Dynamic load balancingDynamic load balancing Transparent process-migrationTransparent process-migration Kernel ModsKernel Mods
http://openmosix.sourceforge.net/http://openmosix.sourceforge.net/ http://kerrighed.org/http://kerrighed.org/
http://openssi.org/http://openssi.org/ http://ci-linux.sourceforge.net/http://ci-linux.sourceforge.net/
CLuster Membership Subsystem ("CLMS") andCLuster Membership Subsystem ("CLMS") and Internode Communication SubsystemInternode Communication Subsystem
http://www.gluster.org/http://www.gluster.org/ GlusterFS: Clustered File Storage of peta bytes.GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute ClustersGlusterHPC: High Performance Compute Clusters
http://boinc.berkeley.edu/http://boinc.berkeley.edu/ Open-source software for volunteer computing and grid computingOpen-source software for volunteer computing and grid computing
Condor clustersCondor clusters
http://openmosix.sourceforge.net/http://openmosix.sourceforge.net/http://kerrighed.org/http://kerrighed.org/http://www.gluster.org/http://www.gluster.org/http://boinc.berkeley.edu/http://boinc.berkeley.edu/http://boinc.berkeley.edu/http://www.gluster.org/http://kerrighed.org/http://openmosix.sourceforge.net/ -
8/2/2019 Mate Ti Cluster Intro
42/43
MMoreore IInformationnformation on Clusterson Clusters
http://www.ieeetfcc.org/http://www.ieeetfcc.org/ IEEE Task Force on ClusterIEEE Task Force on ClusterComputingComputing
http://http://lcic.orglcic.org// a central repository of links and informationa central repository of links and informationregarding Linux clustering, in all its forms.regarding Linux clustering, in all its forms.
www.beowulf.orgwww.beowulf.org resources for of clusters built onresources for of clusters built on
commodity hardware deploying Linux OS and opencommodity hardware deploying Linux OS and opensource software.source software. http://linuxclusters.com/http://linuxclusters.com/ Authoritative resource forAuthoritative resource for
information on Linux Compute Clusters and Linux Highinformation on Linux Compute Clusters and Linux HighAvailability Clusters.Availability Clusters.
http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/ To provideTo provideeducation and advanced technical training for theeducation and advanced technical training for thedeployment and use of Linux-based computing clusters todeployment and use of Linux-based computing clusters tothe high-performance computing community worldwide.the high-performance computing community worldwide.
http://www.ieeetfcc.org/http://www.ieeetfcc.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://www.beowulf.org/http://www.beowulf.org/http://linuxclusters.com/http://linuxclusters.com/http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/http://linuxclusters.com/http://www.beowulf.org/http://lcic.org/http://lcic.org/http://lcic.org/http://www.ieeetfcc.org/ -
8/2/2019 Mate Ti Cluster Intro
43/43
ReferencesReferences
Cluster Hardware SetupCluster Hardware Setup
http://www.phy.duke.edu/~rgb/Beowulf/beowuhttp://www.phy.duke.edu/~rgb/Beowulf/beowu
PVMPVM http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/
MPIMPI http://www.open-mpi.org/http://www.open-mpi.org/
CondorCondorhttp://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdfhttp://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdfhttp://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/http://www.open-mpi.org/http://www.open-mpi.org/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.open-mpi.org/http://www.csm.ornl.gov/pvm/http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdf