mate ti cluster intro

Upload: ssah0401

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Mate Ti Cluster Intro

    1/43

    11

    Introduction toIntroduction to

    Cluster ComputingCluster ComputingPrabhaker MatetiPrabhaker Mateti

    Wright State UniversityWright State University

    Dayton, Ohio, USADayton, Ohio, USA

  • 8/2/2019 Mate Ti Cluster Intro

    2/43

    22

    OverviewOverview

    High performance computingHigh performance computing

    High throughput computingHigh throughput computing

    NOW, HPC, and HTCNOW, HPC, and HTCParallel algorithmsParallel algorithms

    Software technologiesSoftware technologies

  • 8/2/2019 Mate Ti Cluster Intro

    3/43

  • 8/2/2019 Mate Ti Cluster Intro

    4/43

    44

    Parallel ComputingParallel Computing

    Traditional supercomputersTraditional supercomputersSIMD, MIMD, pipelinesSIMD, MIMD, pipelines

    Tightly coupled shared memoryTightly coupled shared memory

    Bus level connectionsBus level connections

    Expensive to buy and to maintainExpensive to buy and to maintain

    Cooperating networks of computersCooperating networks of computers

  • 8/2/2019 Mate Ti Cluster Intro

    5/43

  • 8/2/2019 Mate Ti Cluster Intro

    6/43

    66

    Traditional SupercomputersTraditional Supercomputers

    Very high starting costVery high starting costExpensive hardwareExpensive hardware

    Expensive softwareExpensive software

    High maintenanceHigh maintenance

    Expensive to upgradeExpensive to upgrade

  • 8/2/2019 Mate Ti Cluster Intro

    7/43

    77

    Traditional SupercomputersTraditional Supercomputers

    No one is predicting their demise, but No one is predicting their demise, but

  • 8/2/2019 Mate Ti Cluster Intro

    8/43

    88

    Computational GridsComputational Grids

    are the futureare the future

  • 8/2/2019 Mate Ti Cluster Intro

    9/43

    99

    Computational GridsComputational Grids

    Grids are persistent environments thatGrids are persistent environments that

    enable software applications to integrateenable software applications to integrate

    instruments, displays, computational andinstruments, displays, computational and

    information resources that are managedinformation resources that are managed

    by diverse organizations in widespreadby diverse organizations in widespread

    locations.locations.

  • 8/2/2019 Mate Ti Cluster Intro

    10/43

    1010

    Computational GridsComputational Grids

    Individual nodes can be supercomputers,Individual nodes can be supercomputers,

    or NOWor NOW

    High availabilityHigh availability

    Accommodate peak usageAccommodate peak usage

    LAN : Internet :: NOW : GridLAN : Internet :: NOW : Grid

  • 8/2/2019 Mate Ti Cluster Intro

    11/43

    1111

    NOW ComputingNOW Computing

    WorkstationWorkstation

    NetworkNetwork

    Operating SystemOperating SystemCooperationCooperation

    Distributed+Parallel ProgramsDistributed+Parallel Programs

  • 8/2/2019 Mate Ti Cluster Intro

    12/43

    1212

    Workstation Operating SystemWorkstation Operating System

    Authenticated usersAuthenticated users

    Protection of resourcesProtection of resources

    Multiple processesMultiple processes

    Preemptive schedulingPreemptive scheduling

    Virtual MemoryVirtual MemoryHierarchical file systemsHierarchical file systems

    Network centricNetwork centric

  • 8/2/2019 Mate Ti Cluster Intro

    13/43

    1313

    NetworkNetwork

    EthernetEthernet10 Mbps10 Mbps obsoleteobsolete

    100 Mbps100 Mbps almost obsoletealmost obsolete

    1000 Mbps1000 Mbps standardstandard

    ProtocolsProtocolsTCP/IPTCP/IP

  • 8/2/2019 Mate Ti Cluster Intro

    14/43

    1414

    CooperationCooperation

    Workstations are personalWorkstations are personal

    Use by othersUse by others

    slows you downslows you down Increases privacy risksIncreases privacy risks

    Decreases securityDecreases security

    Willing to shareWilling to share

    Willing to trustWilling to trust

  • 8/2/2019 Mate Ti Cluster Intro

    15/43

    1515

    Distributed ProgramsDistributed Programs

    Spatially distributed programsSpatially distributed programsA part here, a part there, A part here, a part there,

    ParallelParallel

    SynergySynergy Temporally distributed programsTemporally distributed programs Finish the work of your great grand fatherFinish the work of your great grand father

    Compute half today, half tomorrowCompute half today, half tomorrow

    Combine the results at the endCombine the results at the endMigratory programsMigratory programs Have computation, will travelHave computation, will travel

  • 8/2/2019 Mate Ti Cluster Intro

    16/43

    1616

    SPMDSPMD

    Single program, multiple dataSingle program, multiple data

    Contrast with SIMDContrast with SIMD

    Same program runs on multiple nodesSame program runs on multiple nodesMay or may not be lock-stepMay or may not be lock-step

    Nodes may be of different speedsNodes may be of different speeds

    Barrier synchronizationBarrier synchronization

  • 8/2/2019 Mate Ti Cluster Intro

    17/43

    1717

    Conceptual Bases ofConceptual Bases of

    Distributed+Parallel ProgramsDistributed+Parallel Programs

    Spatially distributed programsSpatially distributed programsMessage passingMessage passing

    Temporally distributed programsTemporally distributed programsShared memoryShared memory

    Migratory programsMigratory programs

    Serialization of data and programsSerialization of data and programs

  • 8/2/2019 Mate Ti Cluster Intro

    18/43

    1818

    (Ordinary) Shared Memory(Ordinary) Shared Memory

    Simultaneous read/write accessSimultaneous read/write accessRead : readRead : read

    Read : writeRead : write

    Write : writeWrite : write

    Semantics not cleanSemantics not cleanEven when all processes are on the sameEven when all processes are on the same

    processorprocessor

    Mutual exclusionMutual exclusion

  • 8/2/2019 Mate Ti Cluster Intro

    19/43

    1919

    Distributed Shared MemoryDistributed Shared Memory

    Simultaneous read/write access bySimultaneous read/write access by

    spatially distributed processorsspatially distributed processors

    Abstraction layer of an implementationAbstraction layer of an implementation

    built from message passing primitivesbuilt from message passing primitives

    Semantics not so cleanSemantics not so clean

  • 8/2/2019 Mate Ti Cluster Intro

    20/43

    2020

    Conceptual Bases for MigratoryConceptual Bases for Migratory

    programsprograms

    Same CPU architectureSame CPU architectureX86, PowerPC, MIPS, SPARC, , JVMX86, PowerPC, MIPS, SPARC, , JVM

    Same OS + environmentSame OS + environment

    Be able to checkpointBe able to checkpointsuspend, andsuspend, and

    then resume computationthen resume computationwithout loss of progresswithout loss of progress

  • 8/2/2019 Mate Ti Cluster Intro

    21/43

    2121

    Clusters of WorkstationsClusters of Workstations

    Inexpensive alternative to traditionalInexpensive alternative to traditionalsupercomputerssupercomputers

    High availabilityHigh availabilityLower down timeLower down timeEasier accessEasier access

    Development platform with productionDevelopment platform with production

    runs on traditional supercomputersruns on traditional supercomputers

  • 8/2/2019 Mate Ti Cluster Intro

    22/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2222

    Cluster CharacteristicsCluster Characteristics

    Commodity off the shelf hardwareCommodity off the shelf hardware

    NetworkedNetworked

    Common Home DirectoriesCommon Home DirectoriesOpen source software and OSOpen source software and OS

    Support message passing programmingSupport message passing programming

    Batch scheduling of jobsBatch scheduling of jobsProcess migrationProcess migration

  • 8/2/2019 Mate Ti Cluster Intro

    23/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2323

    Why are Linux Clusters Good?Why are Linux Clusters Good?

    Low initial implementation costLow initial implementation cost Inexpensive PCsInexpensive PCs

    Standard components and NetworksStandard components and NetworksFree Software: Linux, GNU, MPI, PVMFree Software: Linux, GNU, MPI, PVM

    Scalability: can grow and shrinkScalability: can grow and shrink

    Familiar technology, easy for user toFamiliar technology, easy for user toadopt the approach, use and maintainadopt the approach, use and maintain

    system.system.

  • 8/2/2019 Mate Ti Cluster Intro

    24/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2424

    Example ClustersExample Clusters

    July 1999July 1999

    1000 nodes1000 nodes

    Used for geneticUsed for geneticalgorithm research byalgorithm research by

    John Koza, StanfordJohn Koza, Stanford

    UniversityUniversity

    www.genetic-programming.com/www.genetic-programming.com/

    http://www.genetic-programming.com/http://www.genetic-programming.com/http://www.genetic-programming.com/
  • 8/2/2019 Mate Ti Cluster Intro

    25/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2525

    Largest Cluster SystemLargest Cluster System

    IBM BlueGene, 2007IBM BlueGene, 2007

    DOE/NNSA/LLNLDOE/NNSA/LLNL

    Memory: 73728 GBMemory: 73728 GB

    OS: CNK/SLES 9OS: CNK/SLES 9 Interconnect: ProprietaryInterconnect: Proprietary

    PowerPC 440PowerPC 440

    106,496 nodes106,496 nodes

    478.2 Tera FLOPS on478.2 Tera FLOPS onLINPACKLINPACK

  • 8/2/2019 Mate Ti Cluster Intro

    26/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 2626

    OS Share of Top 500OS Share of Top 500

    OS CountOS Count ShareShare Rmax (GF)Rmax (GF) Rpeak (GF) ProcessorRpeak (GF) Processor

    LinuxLinux 426426 85.20% 489704685.20% 4897046 79567587956758 970790970790

    Windows 6Windows 6 1.20% 474951.20% 47495 8679786797 1211212112

    UnixUnix 3030 6.00% 4083786.00% 408378 519178519178 7353273532

    BSDBSD 22 0.40% 447830.40% 44783 5017650176 56965696Mixed 34Mixed 34 6.80% 15400376.80% 1540037 19003611900361 580693580693

    MacOS 2MacOS 2 0.40% 284300.40% 28430 4481644816 52725272

    TotalsTotals 500500 100% 6966169 10558086 1648095100% 6966169 10558086 1648095

    http://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfam Nov 2007Nov 2007

    http://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfam
  • 8/2/2019 Mate Ti Cluster Intro

    27/43

    2727

    Development ofDevelopment of

    Distributed+Parallel ProgramsDistributed+Parallel ProgramsNew code + algorithmsNew code + algorithms

    Old programs rewritten in new languagesOld programs rewritten in new languages

    that have distributed and parallel primitivesthat have distributed and parallel primitives

    Parallelize legacy codeParallelize legacy code

  • 8/2/2019 Mate Ti Cluster Intro

    28/43

    2828

    New Programming LanguagesNew Programming Languages

    With distributed and parallel primitivesWith distributed and parallel primitives

    Functional languagesFunctional languages

    Logic languagesLogic languagesData flow languagesData flow languages

  • 8/2/2019 Mate Ti Cluster Intro

    29/43

    2929

    Parallel ProgrammingParallel Programming

    LanguagesLanguages

    based on the shared-memory modelbased on the shared-memory modelbased on the distributed-memory modelbased on the distributed-memory model

    parallel object-oriented languagesparallel object-oriented languagesparallel functional programmingparallel functional programming

    languageslanguages

    concurrent logic languagesconcurrent logic languages

  • 8/2/2019 Mate Ti Cluster Intro

    30/43

    CondorCondor

    Cooperating workstations: come and go.Cooperating workstations: come and go.

    Migratory programsMigratory programsCheckpointingCheckpointing

    Remote IORemote IO

    Resource matchingResource matching

    http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/

    http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/
  • 8/2/2019 Mate Ti Cluster Intro

    31/43

    Portable Batch System (PBS)Portable Batch System (PBS)

    Prepare a .cmd filePrepare a .cmd file naming the program and its argumentsnaming the program and its arguments properties of the jobproperties of the job the needed resourcesthe needed resources

    Submit .cmd to the PBS Job Server:Submit .cmd to the PBS Job Server: qsubqsub commandcommand Routing and Scheduling: The Job ServerRouting and Scheduling: The Job Server

    examines .cmd details to route the job to an execution queue.examines .cmd details to route the job to an execution queue. allocates one or more cluster nodes to the joballocates one or more cluster nodes to the job

    communicates with the Execution Servers (mom's) on the cluster to determine the currentcommunicates with the Execution Servers (mom's) on the cluster to determine the currentstate of the nodes.state of the nodes.

    When all of the needed are allocated, passes the .cmd on to the Execution Server on the firstWhen all of the needed are allocated, passes the .cmd on to the Execution Server on the firstnode allocated (the "mother superior").node allocated (the "mother superior").

    Execution ServerExecution Server

    willwill loginlogin on the first node as the submitting user and run the .cmd file in the user's homeon the first node as the submitting user and run the .cmd file in the user's homedirectory.directory.

    Run an installation defined prologue script.Run an installation defined prologue script.

    Gathers the job's output to the standard output and standard errorGathers the job's output to the standard output and standard error

    It will execute installation defined epilogue script.It will execute installation defined epilogue script.

    Delivers stdout and stdout to the user.Delivers stdout and stdout to the user.

  • 8/2/2019 Mate Ti Cluster Intro

    32/43

    TORQUE, an open source PBSTORQUE, an open source PBS

    Tera-scale Open-source Resource and QUEue managerTera-scale Open-source Resource and QUEue manager(TORQUE) enhances OpenPBS(TORQUE) enhances OpenPBS

    Fault ToleranceFault Tolerance Additional failure conditions checked/handledAdditional failure conditions checked/handled Node health check script supportNode health check script support

    Scheduling InterfaceScheduling Interface ScalabilityScalability

    Significantly improved server to MOM communication modelSignificantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors)Ability to handle larger clusters (over 15 TF/2,500 processors)

    Ability to handle larger jobs (over 2000 processors)Ability to handle larger jobs (over 2000 processors) Ability to support larger server messagesAbility to support larger server messages

    LoggingLogging http://www.supercluster.org/projects/torque/http://www.supercluster.org/projects/torque/

  • 8/2/2019 Mate Ti Cluster Intro

    33/43

    OpenMP for shared memoryOpenMP for shared memory

    Distributed shared memory APIDistributed shared memory API

    User-gives hints as directives to theUser-gives hints as directives to thecompilercompiler

    http://www.openmp.orghttp://www.openmp.org

  • 8/2/2019 Mate Ti Cluster Intro

    34/43

    Message Passing LibrariesMessage Passing Libraries

    Programmer is responsible for initial dataProgrammer is responsible for initial data

    distribution, synchronization, and sendingdistribution, synchronization, and sending

    and receiving informationand receiving information

    Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)

    Message Passing Interface (MPI)Message Passing Interface (MPI)

    Bulk Synchronous Parallel model (BSP)Bulk Synchronous Parallel model (BSP)

  • 8/2/2019 Mate Ti Cluster Intro

    35/43

    BSP: Bulk Synchronous ParallelBSP: Bulk Synchronous Parallel

    modelmodel Divides computation intoDivides computation into superstepssupersteps

    In each superstep a processor can work on localIn each superstep a processor can work on local

    data and send messages.data and send messages.

    At the end of the superstep, a barrierAt the end of the superstep, a barriersynchronization takes place and all processorssynchronization takes place and all processors

    receive the messages which were sent in thereceive the messages which were sent in the

    previous superstepprevious superstep

  • 8/2/2019 Mate Ti Cluster Intro

    36/43

    BSP LibraryBSP Library

    Small number of subroutines to implementSmall number of subroutines to implementprocess creation,process creation,

    remote data access, andremote data access, and

    bulk synchronization.bulk synchronization.

    Linked to C, Fortran, programsLinked to C, Fortran, programs

  • 8/2/2019 Mate Ti Cluster Intro

    37/43

    BSP: Bulk Synchronous ParallelBSP: Bulk Synchronous Parallel

    modelmodelhttp://www.bsp-worldwide.org/http://www.bsp-worldwide.org/

    Book: Rob H. Bisseling, Parallel ScientificBook: Rob H. Bisseling, Parallel Scientific

    Computation: A Structured ApproachComputation: A Structured Approach

    using BSP and MPI, Oxford Universityusing BSP and MPI, Oxford University

    Press, 2004,Press, 2004,

    324 pages, ISBN 0-19-852939-2.324 pages, ISBN 0-19-852939-2.

  • 8/2/2019 Mate Ti Cluster Intro

    38/43

    3838

    PVM, and MPIPVM, and MPI

    Message passing primitivesMessage passing primitives

    Can be embedded in many existingCan be embedded in many existing

    programming languagesprogramming languages

    Architecturally portableArchitecturally portable

    Open-sourced implementationsOpen-sourced implementations

  • 8/2/2019 Mate Ti Cluster Intro

    39/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 3939

    Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)

    PVM enables a heterogeneous collectionPVM enables a heterogeneous collection

    of networked computers to be used as aof networked computers to be used as a

    single large parallel computer.single large parallel computer.

    Older than MPIOlder than MPI

    Large scientific/engineering userLarge scientific/engineering user

    communitycommunity

    http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/

    http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/
  • 8/2/2019 Mate Ti Cluster Intro

    40/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 4040

    Message Passing Interface (MPI)Message Passing Interface (MPI)

    http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/

    MPI-2.0MPI-2.0 http://www.mpi-forum.org/docs/http://www.mpi-forum.org/docs/

    MPICH:MPICH: www.mcs.anl.gov/mpi/mpich/www.mcs.anl.gov/mpi/mpich/ bybyArgonne National Laboratory andArgonne National Laboratory and

    Missisippy State UniversityMissisippy State University

    LAM:LAM:

    http://www.lam-mpi.org/http://www.lam-mpi.org/

    http://www.open-mpi.org/http://www.open-mpi.org/

    http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/http://www.mpi-forum.org/docs/http://www.mpi-forum.org/docs/http://www.mcs.anl.gov/mpi/mpich/http://www.mcs.anl.gov/mpi/mpich/http://www.lam-mpi.org/http://www.lam-mpi.org/http://www.open-mpi.org/http://www.open-mpi.org/http://www.open-mpi.org/http://www.lam-mpi.org/http://www.mcs.anl.gov/mpi/mpich/http://www.mpi-forum.org/docs/http://www-unix.mcs.anl.gov/mpi/
  • 8/2/2019 Mate Ti Cluster Intro

    41/43

    May 2008May 2008 Mateti-Everything-About-LinuxMateti-Everything-About-Linux 4141

    Kernels Etc Mods for ClustersKernels Etc Mods for Clusters

    Dynamic load balancingDynamic load balancing Transparent process-migrationTransparent process-migration Kernel ModsKernel Mods

    http://openmosix.sourceforge.net/http://openmosix.sourceforge.net/ http://kerrighed.org/http://kerrighed.org/

    http://openssi.org/http://openssi.org/ http://ci-linux.sourceforge.net/http://ci-linux.sourceforge.net/

    CLuster Membership Subsystem ("CLMS") andCLuster Membership Subsystem ("CLMS") and Internode Communication SubsystemInternode Communication Subsystem

    http://www.gluster.org/http://www.gluster.org/ GlusterFS: Clustered File Storage of peta bytes.GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute ClustersGlusterHPC: High Performance Compute Clusters

    http://boinc.berkeley.edu/http://boinc.berkeley.edu/ Open-source software for volunteer computing and grid computingOpen-source software for volunteer computing and grid computing

    Condor clustersCondor clusters

    http://openmosix.sourceforge.net/http://openmosix.sourceforge.net/http://kerrighed.org/http://kerrighed.org/http://www.gluster.org/http://www.gluster.org/http://boinc.berkeley.edu/http://boinc.berkeley.edu/http://boinc.berkeley.edu/http://www.gluster.org/http://kerrighed.org/http://openmosix.sourceforge.net/
  • 8/2/2019 Mate Ti Cluster Intro

    42/43

    MMoreore IInformationnformation on Clusterson Clusters

    http://www.ieeetfcc.org/http://www.ieeetfcc.org/ IEEE Task Force on ClusterIEEE Task Force on ClusterComputingComputing

    http://http://lcic.orglcic.org// a central repository of links and informationa central repository of links and informationregarding Linux clustering, in all its forms.regarding Linux clustering, in all its forms.

    www.beowulf.orgwww.beowulf.org resources for of clusters built onresources for of clusters built on

    commodity hardware deploying Linux OS and opencommodity hardware deploying Linux OS and opensource software.source software. http://linuxclusters.com/http://linuxclusters.com/ Authoritative resource forAuthoritative resource for

    information on Linux Compute Clusters and Linux Highinformation on Linux Compute Clusters and Linux HighAvailability Clusters.Availability Clusters.

    http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/ To provideTo provideeducation and advanced technical training for theeducation and advanced technical training for thedeployment and use of Linux-based computing clusters todeployment and use of Linux-based computing clusters tothe high-performance computing community worldwide.the high-performance computing community worldwide.

    http://www.ieeetfcc.org/http://www.ieeetfcc.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://lcic.org/http://www.beowulf.org/http://www.beowulf.org/http://linuxclusters.com/http://linuxclusters.com/http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/http://www.linuxclustersinstitute.org/http://linuxclusters.com/http://www.beowulf.org/http://lcic.org/http://lcic.org/http://lcic.org/http://www.ieeetfcc.org/
  • 8/2/2019 Mate Ti Cluster Intro

    43/43

    ReferencesReferences

    Cluster Hardware SetupCluster Hardware Setup

    http://www.phy.duke.edu/~rgb/Beowulf/beowuhttp://www.phy.duke.edu/~rgb/Beowulf/beowu

    PVMPVM http://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/

    MPIMPI http://www.open-mpi.org/http://www.open-mpi.org/

    CondorCondorhttp://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/

    http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdfhttp://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdfhttp://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/http://www.open-mpi.org/http://www.open-mpi.org/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/http://www.open-mpi.org/http://www.csm.ornl.gov/pvm/http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book.pdf