beowulf cluster computing with linux

533

Upload: suresh-kumar

Post on 16-Aug-2015

21 views

Category:

Documents


1 download

DESCRIPTION

Beowulf Cluster

TRANSCRIPT

BeowulfClusterComputingwithLinuxScienticandEngineeringComputationJanuszKowalik,editorData-ParallelProgrammingonMIMDComputers,PhilipJ.HatcherandMichaelJ.Quinn,1991UnstructuredScienticComputationonScalableMultiprocessors,editedbyPiyushMehrotra,JoelSaltz,andRobertVoigt,1992Parallel Computational FluidDynamics: ImplementationandResults,editedbyHorstD.Simon,1992EnterpriseIntegrationModeling: ProceedingsoftheFirstInternational Conference,editedbyCharlesJ.Petrie,Jr.,1992TheHighPerformanceFortranHandbook,CharlesH.Koelbel,DavidB.Loveman,RobertS.Schreiber,GuyL.SteeleJr. andMaryE.Zosel,1994PVM:Parallel Virtual MachineAUsersGuideandTutorial forNetworkParallel Computing,AlGeist,AdamBeguelin,JackDongarra,WeichengJiang,BobManchek,andVaidySunderam,1994Practical Parallel Programming,GregoryV.Wilson,1995EnablingTechnologiesforPetaopsComputing,ThomasSterling,PaulMessina,andPaulH.Smith,1995AnIntroductiontoHigh-PerformanceScienticComputing,LloydD.Fosdick,ElizabethR.Jessup,CarolynJ.C.Schauble,andGittaDomik,1995Parallel ProgrammingUsingC++,editedbyGregoryV.WilsonandPaulLu,1996UsingPLAPACK:Parallel LinearAlgebraPackage,RobertA.vandeGeijn,1997Fortran95Handbook,JeanneC.Adams,WalterS.Brainerd,JeanneT.Martin,BrianT.Smith,JerroldL.Wagener,1997MPITheCompleteReference: Volume1,TheMPICore,MarcSnir,SteveOtto,StevenHuss-Lederman,DavidWalker,andJackDongarra,1998MPITheCompleteReference: Volume2,TheMPI-2Extensions,WilliamGropp,StevenHuss-Lederman,AndrewLumsdaine,EwingLusk,BillNitzberg,WilliamSaphir,andMarcSnir,1998AProgrammersGuidetoZPL,LawrenceSnyder,1999HowtoBuildaBeowulf,ThomasL.Sterling,JohnSalmon,DonaldJ.Becker,andDanielF.Savarese,1999UsingMPI:PortableParallel ProgrammingwiththeMessage-PassingInterface,secondedition,WilliamGropp,EwingLusk,andAnthonySkjellum,1999UsingMPI-2: AdvancedFeaturesoftheMessage-PassingInterface,WilliamGropp,EwingLusk,andRajeevThakur,1999BeowulfClusterComputingwithLinux,ThomasSterling,2001BeowulfClusterComputingwithWindows,ThomasSterling,2001BeowulfClusterComputingwithLinuxThomasSterlingTheMITPressCambridge,MassachusettsLondon,Englandc 2002MassachusettsInstituteofTechnologyAllrightsreserved. Nopartofthisbookmaybereproducedinanyformbyanyelectronicormechanicalmeans(includingphotocopying,recording,orinformationstorageandretrieval)withoutpermissioninwritingfromthepublisher.ThisbookwassetinLATEXbytheauthorandwasprintedandboundintheUnitedStatesofAmerica.LibraryofCongressControlNumber2001095383ISBN:0262692740Disclaimer:Some images in the original version of this book are not available for inclusion in the eBook. DedicatedwithrespectandappreciationtothememoryofSeymourR.Cray19251996This page intentionally left blankContentsSeriesForeword xixForeword xxiPreface xxix1 IntroductionThomasSterling 11.1 DenitionsandTaxonomy 11.2 OpportunitiesandAdvantages 31.3 AShortHistory 61.4 ElementsofaCluster 81.5 DescriptionoftheBook 10I EnablingTechnologies2 AnOverviewofClusterComputingThomasSterling152.1 ATaxonomyofParallelComputing 162.2 HardwareSystemStructure 192.2.1 BeowulfComputeNodes 192.2.2 InterconnectionNetworks 232.3 NodeSoftware 252.4 ResourceManagement 252.5 DistributedProgramming 272.6 Conclusions 293 NodeHardwareThomasSterling 313.1 OverviewofaBeowulfNode 323.1.1 PrincipalSpecications 343.1.2 BasicElements 353.2 Processors 383.2.1 IntelPentiumFamily 393.2.2 AMDAthlon 393.2.3 CompaqAlpha21264 40viii Contents3.2.4 IA64 403.3 Motherboard 413.4 Memory 433.4.1 MemoryCapacity 433.4.2 MemorySpeed 433.4.3 MemoryTypes 443.4.4 MemoryHierarchyandCaches 453.4.5 PackageStyles 463.5 BIOS 463.6 SecondaryStorage 473.7 PCIBus 493.8 ExampleofaBeowulfNode 503.9 Boxes,Shelves,Piles,andRacks 503.10 NodeAssembly 523.10.1 MotherboardPreassembly 533.10.2 TheCase 543.10.3 MinimalPeripherals 553.10.4 BootingtheSystem 563.10.5 InstallingtheOtherComponents 573.10.6 Troubleshooting 594 LinuxPeterH.Beckman 614.1 WhatIsLinux? 614.1.1 WhyUseLinuxforaBeowulf? 614.1.2 AKernelandaDistribution 644.1.3 OpenSourceandFreeSoftware 654.1.4 ALinuxDistribution 674.1.5 VersionNumbersandDevelopmentMethods 694.2 TheLinuxKernel 714.2.1 CompilingaKernel 724.2.2 LoadableKernelModules 734.2.3 TheBeowulfKernelDiet 744.2.4 DisklessOperation 76Contents ix4.2.5 DownloadingandCompilingaNewKernel 774.2.6 LinuxFileSystems 794.3 PruningYourBeowulfNode 824.3.1 inetd.conf 834.3.2 /etc/rc.d/init.d 834.3.3 OtherProcessesandDaemons 854.4 OtherConsiderations 864.4.1 TCPMessaging 874.4.2 HardwarePerformanceCounters 884.5 FinalTuningwith/proc 884.6 Conclusions 925 NetworkHardwareThomasSterling 955.1 InterconnectTechnologies 955.1.1 TheEthernets 965.1.2 Myrinet 975.1.3 cLAN 985.1.4 ScalableCoherentInterface 995.1.5 QsNet 995.1.6 Inniband 1005.2 ADetailedLookatEthernet 1005.2.1 PacketFormat 1005.2.2 NICArchitecture 1025.2.3 HubsandSwitches 1055.3 NetworkPracticalities: InterconnectChoice 1065.3.1 ImportanceoftheInterconnect 1065.3.2 DierencesbetweentheInterconnectChoices 1075.3.3 StrategiestoImprovePerformanceoverEthernet 1085.3.4 ClusterNetworkPitfalls 1095.3.5 AnExampleofanEthernetInterconnectedBeowulf 1105.3.6 AnExampleofaMyrinetInterconnectedCluster 1116 NetworkSoftwareThomasSterling 113x Contents6.1 TCP/IP 1136.1.1 IPAddresses 1146.1.2 Zero-CopyProtocols 1156.2 Sockets 1166.3 Higher-LevelProtocols 1206.3.1 RemoteProcedureCalls 1216.3.2 DistributedObjects: CORBAandJavaRMI 1236.4 DistributedFileSystems 1266.4.1 NFS 1266.4.2 AFS 1276.4.3 Autofs: TheAutomounter 1286.5 RemoteCommandExecution 1286.5.1 BSDRCommands 1286.5.2 SSHTheSecureShell 1307 SettingUpClusters: InstallationandCongurationThomasSterlingandDaniel Savarese1317.1 SystemAccessModels 1317.1.1 TheStandaloneSystem 1327.1.2 TheUniversallyAccessibleMachine 1327.1.3 TheGuardedBeowulf 1327.2 AssigningNames 1337.2.1 StatisticallyAssignedAddresses 1337.2.2 DynamicallyAssignedAddresses 1347.3 InstallingNodeSoftware 1357.3.1 CreatingTarImages 1367.3.2 SettingUpaCloneRootPartition 1377.3.3 SettingUpBOOTP 1387.3.4 BuildingaCloneBootFloppy 1397.4 BasicSystemAdministration 1407.4.1 BootingandShuttingDown 1407.4.2 TheNodeFileSystem 141Contents xi7.4.3 AccountManagement 1427.4.4 RunningUnixCommandsinParallel 1437.5 AvoidingSecurityCompromises 1447.5.1 SystemConguration 1447.5.2 RestrictingHostAccess 1457.5.3 SecureShell 1467.5.4 IPMasquerading 1477.6 JobScheduling 1497.7 SomeAdviceonUpgradingYourSoftware 1508 HowFastIsMyBeowulf ?DavidBailey 1518.1 Metrics 1518.2 Ping-PongTest 1548.3 TheLINPACKBenchmark 1548.4 TheNASParallelBenchmarkSuite 156II ParallelProgramming9 ParallelProgrammingwithMPIWilliamGroppandEwingLusk1619.1 HelloWorldinMPI 1629.1.1 CompilingandRunningMPIPrograms 1639.1.2 AddingCommunicationtoHelloWorld 1659.2 Manager/WorkerExample 1699.3 Two-DimensionalJacobiExamplewithOne-DimensionalDecomposition 1749.4 CollectiveOperations 1789.5 ParallelMonteCarloComputation 1839.6 InstallingMPICHunderLinux 1839.6.1 ObtainingandInstallingMPICH 1839.6.2 RunningMPICHJobswiththechp4Device 1869.6.3 StartingandManagingMPD 1879.6.4 RunningMPICHJobsunderMPD 189xii Contents9.6.5 DebuggingMPIPrograms 1899.6.6 OtherCompilers 1919.7 Tools 1929.7.1 ProlingLibraries 1929.7.2 VisualizingParallelProgramBehavior 1939.8 MPIImplementationsforClusters 1949.9 MPIRoutineSummary 19410 AdvancedTopicsinMPIProgrammingWilliamGroppandEwingLusk19910.1 DynamicProcessManagementinMPI 19910.1.1 Intercommunicators 19910.1.2 SpawningNewMPIProcesses 20010.1.3 RevisitingMatrix-VectorMultiplication 20010.1.4 MoreonDynamicProcessManagement 20210.2 FaultTolerance 20210.3 RevisitingMeshExchanges 20410.3.1 BlockingandNonblockingCommunication 20510.3.2 CommunicatingNoncontiguousDatainMPI 20710.4 MotivationforCommunicators 21110.5 MoreonCollectiveOperations 21310.6 ParallelI/O 21510.6.1 ASimpleExample 21710.6.2 AMoreComplexExample 21910.7 RemoteMemoryAccess 22110.8 UsingC++andFortran90 22410.9 MPI,OpenMP,andThreads 22610.10 MeasuringMPIPerformance 22710.10.1 mpptest 22710.10.2 SKaMPI 22810.10.3 HighPerformanceLINPACK 22810.11 MPI-2Status 230Contents xiii10.12 MPIRoutineSummary 23011 ParallelProgrammingwithPVMAl GeistandStephenScott23711.1 Overview 23711.2 ProgramExamples 24211.3 Fork/Join 24211.4 DotProduct 24611.5 MatrixMultiply 25111.6 One-DimensionalHeatEquation 25711.7 UsingPVM 26511.7.1 SettingUpPVM 26511.7.2 StartingPVM 26611.7.3 RunningPVMPrograms 26711.8 PVMConsoleDetails 26911.9 HostFileOptions 27211.10 XPVM 27411.10.1 NetworkView 27611.10.2 Space-TimeView 27711.10.3 OtherViews 27812 Fault-TolerantandAdaptiveProgramswithPVMAl GeistandJimKohl28112.1 ConsiderationsforFaultTolerance 28212.2 BuildingFault-TolerantParallelApplications 28312.3 AdaptivePrograms 289III ManagingClusters13 ClusterWorkloadManagementJamesPattonJones,DavidLifka,Bill Nitzberg,andToddTannenbaum30113.1 GoalofWorkloadManagementSoftware 30113.2 WorkloadManagementActivities 302xiv Contents13.2.1 Queueing 30213.2.2 Scheduling 30313.2.3 Monitoring 30413.2.4 ResourceManagement 30513.2.5 Accounting 30514 Condor: ADistributedJobSchedulerToddTannenbaum,DerekWright,KarenMiller,andMironLivny30714.1 IntroductiontoCondor 30714.1.1 FeaturesofCondor 30814.1.2 UnderstandingCondorClassAds 30914.2 UsingCondor 31314.2.1 RoadmaptoUsingCondor 31314.2.2 SubmittingaJob 31414.2.3 OverviewofUserCommands 31614.2.4 SubmittingDierentTypesofJobs: AlternativeUniverses 32314.2.5 GivingYourJobAccesstoItsDataFiles 32914.2.6 TheDAGManScheduler 33014.3 CondorArchitecture 33214.3.1 TheCondorDaemons 33314.3.2 TheCondorDaemonsinAction 33414.4 InstallingCondorunderLinux 33614.5 ConguringCondor 33814.5.1 LocationofCondorsCongurationFiles 33814.5.2 RecommendedCongurationFileLayoutforaCluster 33914.5.3 CustomizingCondorsPolicyExpressions 34014.5.4 CustomizingCondorsOtherCongurationSettings 34314.6 AdministrationTools 34314.6.1 RemoteCongurationandControl 34314.6.2 AccountingandLogging 34414.6.3 UserPrioritiesinCondor 34514.7 ClusterSetupScenarios 346Contents xv14.7.1 BasicConguration: UniformlyOwnedCluster 34614.7.2 UsingMultiprocessorComputeNodes 34714.7.3 SchedulingaDistributivelyOwnedCluster 34814.7.4 SubmittingtotheClusterfromDesktopWorkstations 34914.7.5 ExpandingtheClustertoNondedicated(Desktop)ComputingResources 34914.8 Conclusion 35015 MauiScheduler: AMultifunctionClusterSchedulerDavidB.Jackson35115.1 Overview 35115.2 InstallationandInitialConguration 35215.2.1 BasicConguration 35215.2.2 SimulationandTesting 35215.2.3 ProductionScheduling 35315.3 AdvancedConguration 35315.3.1 AssigningValue: JobPrioritizationandNodeAllocation 35415.3.2 Fairness: ThrottlingPoliciesandFairshare 35615.3.3 ManagingResourceAccess: Reservations,AllocationManagers,andQualityofService 35815.3.4 OptimizingUsage: Backll,NodeSets,andPreemption 36115.3.5 EvaluatingSystemPerformance: Diagnostics,Proling,Testing,andSimulation 36315.4 SteeringWorkloadandImprovingQualityofInformation 36515.5 Troubleshooting 36715.6 Conclusions 36716 PBS:PortableBatchSystemJamesPattonJones 36916.1 HistoryofPBS 36916.1.1 AcquiringPBS 37016.1.2 PBSFeatures 37016.1.3 PBSArchitecture 372xvi Contents16.2 UsingPBS 37316.2.1 CreatingaPBSJob 37416.2.2 SubmittingaPBSJob 37416.2.3 GettingtheStatusofaPBSJob 37516.2.4 PBSCommandSummary 37616.2.5 UsingthePBSGraphicalUserInterface 37616.2.6 PBSApplicationProgrammingInterface 37716.3 InstallingPBS 37816.4 ConguringPBS 37916.4.1 NetworkAddressesandPBS 37916.4.2 TheQmgrCommand 37916.4.3 Nodes 38116.4.4 CreatingorAddingNodes 38216.4.5 DefaultConguration 38316.4.6 ConguringMOM 38416.4.7 SchedulerConguration 38516.5 ManagingPBS 38616.5.1 StartingPBSDaemons 38616.5.2 MonitoringPBS 38616.5.3 TrackingPBSJobs 38716.5.4 PBSAccountingLogs 38716.6 Troubleshooting 38816.6.1 ClientsUnabletoContactServer 38816.6.2 NodesDown 38816.6.3 NondeliveryofOutput 38916.6.4 JobCannotBeExecuted 38917 PVFS:ParallelVirtualFileSystemWaltLigonandRobRoss39117.1 Introduction 39117.1.1 ParallelFileSystems 39117.1.2 SettingUpaParallelFileSystem 39417.1.3 ProgrammingwithaParallelFileSystem 39617.2 UsingPVFS 402Contents xvii17.2.1 WritingPVFSPrograms 40317.2.2 PVFSUtilities 41117.3 AdministeringPVFS 41217.3.1 BuildingthePVFSComponents 41317.3.2 Installation 41517.3.3 StartupandShutdown 42117.3.4 CongurationDetails 42317.3.5 Miscellanea 42817.4 FinalWords 42918 ChibaCity: TheArgonneScalableClusterRemyEvard43118.1 ChibaCityConguration 43118.1.1 NodeConguration 43218.1.2 LogicalConguration 43818.1.3 NetworkConguration 44018.1.4 PhysicalConguration 44218.2 ChibaCityTimeline 44218.2.1 Phase1: Motivation 44218.2.2 Phase2: DesignandPurchase 44418.2.3 Phase3: Installation 44518.2.4 Phase4: FinalDevelopment 44618.2.5 Phase5: EarlyUsers 44618.2.6 Phase6: FullOperation 44618.3 ChibaCitySoftwareEnvironment 44718.3.1 TheComputingEnvironment 44718.3.2 ManagementEnvironment 45218.4 ChibaCityUse 45918.5 FinalThoughts 46018.5.1 LessonsLearned 46018.5.2 FutureDirections 46119 ConclusionsThomasSterling 463xviii Contents19.1 FutureDirectionsforHardwareComponents 46319.2 FutureDirectionsforSoftwareComponents 46519.3 FinalThoughts 468A GlossaryofTerms 471B AnnotatedReadingList 479C AnnotatedURLs 481References 485Index 488SeriesForewordThe world of modern computing potentially oers many helpful methods and toolsto scientists and engineers, but the fast pace of change in computer hardware, soft-ware, and algorithms often makes practical use of the newest computing technologydicult. TheScienticandEngineeringComputationseriesfocusesonrapidad-vances incomputingtechnologies, withthe aimof facilitatingtransfer of thesetechnologiestoapplicationsinscienceandengineering. Itwill includebooksontheories, methods, and original applications in such areas as parallelism, large-scalesimulations, time-critical computing, computer-aideddesignandengineering, useof computers in manufacturing, visualization of scientic data, and human-machineinterfacetechnology.Theseries is intendedtohelpscientists andengineers understandthecurrentworldof advancedcomputationandtoanticipatefuturedevelopments that willaecttheircomputingenvironmentsandopenupnewcapabilitiesandmodesofcomputation.This volume in the series describes the increasingly successful distributed/parallelsystemcalledBeowulf. ABeowulf isaclusterof PCsinterconnectedbynetworktechnology and employing the message-passing model for parallel computation. Keyadvantagesofthisapproacharehighperformanceforlowprice,systemscalability,andrapidadjustmenttonewtechnologicaladvances.Thisbookincludeshowtobuild,program,andoperateaBeowulfsystembasedontheLinuxoperatingsystem. AcompanionvolumeintheseriesprovidesthesameinformationforBeowulfclustersbasedontheMicrosoftWindowsoperatingsystem.Beowulfhardware, operatingsystemsoftware, programmingapproachesandli-braries,andmachinemanagementsoftwareareallcoveredhere. Thebookcanbeused as an academic textbook as well as a practical guide for designing, implement-ing, and operating a Beowulf for those in science and industry who need a powerfulsystembutarereluctanttopurchaseanexpensivemassivelyparallel processororvectorcomputer.JanuszS.KowalikThis page intentionally left blankForewordWeknowtwothingsaboutprogressinparallelprogramming:1. Like nearly all technology, progress comes when eort is headed in a common,focuseddirectionwithtechnologistscompetingandsharingresults.2. Parallel programmingremainsverydicultandshouldbeavoidedifatallpossible. Thisarguesforasingleenvironmentandforsomeoneelsetodothepro-grammingthroughbuilt-inparallelfunction(e.g.,databases,vigorousapplicationssharing,andanapplicationsmarket).After20yearsof falsestartsanddeadendsinhigh-performancecomputerar-chitecture,thewayisnowclear: Beowulfclustersarebecomingtheplatformformanyscientic, engineering, andcommercial applications. Cray-stylesupercom-putersfromJapanarestill usedforlegacyorunpartitionableapplicationscode;but this is a shrinking fraction of supercomputing because such architectures arentscalable or aordable. But if the code cannot be ported or partitioned, vector super-computers at larger centers are required. Likewise, the Top500 share of proprietaryMPPs1(massively parallel processors), SMPs (shared memory, multiple vector pro-cessors), andDSMs(distributedsharedmemory)thatcamefromthedecade-longgovernment-sponsoredhuntforthescalablecomputerisdeclining. Unfortunately,thearchitectural diversitycreatedbythehuntassuredthatastandardplatformandprogrammingmodelcouldnotform. Eachplatformhadlowvolumeandhugesoftwaredevelopmentcostsandalock-intothatvendor.Just twogenerations agobasedonMoores law(19952), aplethoraof vectorsupercomputers,nonscalablemultiprocessors,andMPPclustersbuiltfrompropri-etary nodes and networks formed the market. That made me realize the error of anearlier prediction that these exotic shared-memory machines were supercomputingsinevitablefuture. Atthetime,severalpromisingcommercialo-the-shelf(COTS)technology clusters using standard microprocessors and networks were beginning tobebuilt. WisconsinsCondortoharvestworkstationcyclesandBerkeleysNOW(networkofworkstations)weremyfavorites. Theyprovidedonetotwoordersof1MPPsareaproprietaryvariantof clustersormulticomputers. MulticomputersisthenameAllen Newell and I coined in our 1971 book, Computer Structures, to characterize a single computersystemcomprisingconnectedcomputersthatcommunicatewithoneanotherviamessagepassing(versusviasharedmemory). Inthe2001listof theworldsTop500computers, all exceptafewshared-memoryvectoranddistributedshared-memorycomputersaremulticomputers. Massivehasbeenproposedasthenameforclustersover1,000computers.2G. Bell, 1995 Observations on Supercomputing Alternatives: Did the MPP Bandwagon LeadtoaCul-de-Sac?,CommunicationsoftheACM39,no. 3(March1996)1115.xxii Forewordmagnitude improvement in performance/price over the proprietary systems, includ-ingtheirhigheroperationaloverhead.Inthepast veyears, theBeowulf wayhas emerged. It developedandin-tegratedaprogrammingenvironment that operatesonscalableclustersbuilt oncommodity partstypically based on Intel but sometimes based on Alphas or Pow-erPCs. It alsoleveragedavendor-neutral operatingsystem(Linux) andhelpedmaturetoolssuchasGNU, MPI, PVM, Condor, andvariousschedulers. Thein-troductionof Windows Beowulf leverages the large software base, for example,applications,oceandvisualizationtools,andclusteredSQLdatabases.Beowulfs lower price and standardization attracted a large user community to acommon software base. Beowulf follows the personal computer cycle of innovation:platformavailabilityattractsapplications;applicationsattractusers;userdemandattractsplatformcompetitionandmoreapplications; lowerpricescomewithvol-umeandcompetition. Concurrently, proprietaryplatformsbecomelessattractivebecausetheylacksoftware,andhenceliveinnichemarkets.Beowulfisthehardwarevendorsworstnightmare: thereislittleprotinBeo-wulf clustersof commoditynodesandswitches. ByusingCOTSPCs, networks,freeLinux/GNU-basedoperatingsystemsandtools,orWindows,Beowulfenablesanygrouptobuyandbuilditsownsupercomputer. Oncethemovementachievedcritical mass, the world tippedto this new computing paradigm. No amount of gov-ernmenteorttopropuptheailingdomesticindustry,andnoamountofindustrylobbying, couldreversethattrend. Today, traditionalvectorsupercomputercom-paniesaregonefromtheUnitedStates, andtheyareavanitybusinessinJapan,withlessthan10%of theTop500beingvectorprocessors. Clustersbeatvectorsupercomputers, even though about eight scalar microprocessors are still needed toequalthepowerofavectorprocessor.TheBeowulf movementuniedtheclustercommunityandchangedthecourseof technical computingbycommoditizingit. Beowulf enableduserstohaveacommonplatformandprogrammingmodelindependentofproprietaryprocessors,interconnects, storage, or software base. An applications base, as well as an industrybasedonmanylow-costkillermicroprocessors,isnallyforming.Youarethecauseofthisrevolution,buttheresstillmuchtobedone! Thereiscause for concern, however. Beowulf is successful because it is a common base withcriticalmass.Therewill beconsiderablepressuretocreateLinux/Beowulf dialects(e.g., 64-bitavorandvariousvendorbinarydialects),whichwillfragmentthecommunity,userattentionspan, training, andapplications, justasproprietary-platformUnixdialects sprang from hardware vendors to dierentiate and lock in users. The com-Foreword xxiiimunity must balance this pseudo- and incremental innovation against standardiza-tion,becausestandardizationiswhatgivestheBeowulfitshugeadvantage.HavingdescribedtheinevitableappearanceofLinux/Beowulfdialects, andtheassociatedpitfalls, I amstronglyadvocatingWindowsBeowulf. Insteadof frag-mentingthecommunity, WindowsBeowulfwill signicantlyincreasetheBeowulfcommunity. AWindowsversionwill supportthelargecommunityof peoplewhowanttheWindowstools, layeredsoftware, anddevelopmentstyle. Already, mostusersoflargesystemsoperateaheterogeneoussystemthatrunsboth, withWin-dows (supplyingalarge scalable database) anddesktopVisual-Xprogrammingtools. Furthermore,competitionwillimproveboth. Finally,thebiggainwillcomefrom cross-fertilization of .NET capabilities, which are leading the way to the trulydistributedcomputingthathasbeenpromisedfortwodecades.BeowulfBecomesaContenderInthemid-1980sanNSFsupercomputingcentersprogramwasestablishedinre-sponsetoDigitalsVAXminicomputers.3Althoughtheperformancegapbetweenthe VAX andaCray couldbeaslargeas100,4theperformanceperpricewas usu-ally the reverse: VAX gave much more bang for the buck. VAXen soon became thedominantcomputerforresearchers. ScientistswereabletoownandoperatetheirowncomputersandgetmorecomputingresourceswiththeirownVAXen,includ-ingthosethatwereoperatedastherstclusters. Thesupercomputercenterswereusedprimarilytorunjobsthatweretoolargeforthesepersonal ordepartmentalsystems.In1983ARPAlaunchedtheScalableComputingInitiativetofundoverascoreof research projects to design, build, and buy scalable, parallel computers. Many ofthese were centered on the idea of the emerging killer microprocessor.Over fortystartups werefundedwithventurecapital andour taxdollars tobuilddierentparallel computers. All of these eorts failed. (I estimate these eorts cost betweenoneandthreebilliondollars, plusatleastdoublethatinuserprogrammingthatisbestwrittenoastraining.) Thevastfundingofallthedierentspecies,whichvariedonlysupercially, guaranteedlittle progress andnoapplications market.Theusercommunitydid, however, managetodefensively createlowestcommon3TheVAX780wasintroducedin1978.4VAXen lacked the ability to get 520 times the performance that a large, shared Cray providedforsingleproblems.xxiv Foreworddenominatorstandardstoenableprogramstorunacrossthewidearrayofvaryingarchitectures.In1987, the National Science Foundations newcomputingdirectorate estab-lishedthegoal of achievingparallelismof 100Xbytheyear2000. Thegoal gottwoextreme responses: DonKnuthandKenThompsonsaidthat parallel pro-grammingwastoohardandthatweshouldntfocusonit;andothersfeltthegoalshouldbe1,000,000X! Everyoneelseeitherignoredthecall orwentalongquietlyforthefunding. Thiscall wasaccompaniedbyanoer(byme)of yearlyprizestorewardthosewhoachievedextraordinaryparallelism,performance,andperfor-mance/price. In1988,threeresearchersatSandiaobtainedparallelismof600Xona1000-nodesystem,whileindicatingthat1000Xwaspossiblewithmorememory.Theannouncementof theirachievementgalvanizedothers, andtheGordonBellprizescontinue,withgainsof100%nearlyeveryyear.Interestingly,afactorof1000scalingseemstocontinuetobethelimitformostscalableapplications, but20100Xismorecommon. Infact, atleasthalf of theTop500 systems have fewer than 100 processors! Of course, the parallelism is deter-mined largely by the fact that researchers are budget limited and have only smallermachinescosting$1,000$3,000pernodeorparallelismof< 100. Ifthenodesarein a center, then the per node cost is multiplied by at least 10, giving an upper limitof 100010,000nodespersystem. If thenodesarevectorprocessors, thenumberofprocessorsisdividedby810andthepernodepriceraisedby100X.In 1993, Tom Sterling and Don Becker led a small project within NASA to buildagigaopsworkstationcostingunder$50,000. Theso-calledBeowulfprojectwasoutsidethemainparallel-processingresearchcommunity: itwasbasedinsteadoncommodityandCOTStechnologyandpubliclyavailablesoftware. TheBeowulfproject succeeded: a 16-node, $40,000 cluster built from Intel 486 computers ran in1994. In 1997, a Beowulf cluster won the Gordon Bell Prize for performance/price.TherecipeforbuildingonesownBeowulfwaspresentedinabookbySterlingetal. in1999.5Bytheyear2000, several thousand-nodecomputerswereoperating.InJune2001, 33BeowulfswereintheTop500supercomputerlist(www.top500.org). Today, inthe year 2001, technical highschools canbuyandassemble asupercomputerfrompartsavailableatthecornercomputerstore.Beowulfs formed a do-it-yourself cluster computing community using commoditymicroprocessors, local areanetworkEthernetswitches, Linux(andnowWindows2000),andtoolsthathaveevolvedfromtheusercommunity. Thisvendor-neutral5T.Sterling, J.Salmon, D.J.Becker, andD.V.Savarese, HowtoBuildaBeowulf: AGuidetotheImplementationandApplicationofPCClusters,MITPress,Cambridge,MA,1999.Foreword xxvplatformusedtheMPImessage-basedprogrammingmodel thatscaleswithaddi-tionalprocessors,disks,andnetworking.Beowulfssuccessgoesbeyondthecreationof anopensourcemodel forthescientic software community. It utilizes the two decades of attempts by the parallelprocessingcommunitytoapplythesemutantmulticomputerstoavarietyofappli-cations. Nearlyalloftheseeorts,likeBeowulf,havebeencreatedbyresearchersoutsidethetraditional fundingstream(i.e., pullversus pushresearch). In-cludedamongtheseeortsarethefollowing: Operating system primitives in Linux and GNU tools that support the platformandnetworkinghardwaretoprovidebasicfunctions MessagePassingInterface(MPI)programmingmodel Variousparallel programmingparadigms, includingLinda, theParallel VirtualMachine(PVM),andFortrandialects. Parallellesystems,awaitingtransparentdatabasetechnology Monitoring,debugging,andcontroltools Scheduling and resource management (e.g., Wisconsins Condor, the Maui sched-uler) Higher-levellibraries(e.g.,LINPACK,BLAS)ChallengesofDo-It-YourselfSupercomputersWill the supercomputer centers role change in light of personal Beowulfs?Beowulfsare even more appealing than VAXen because of their ubiquity, scalability, and ex-traordinaryperformance/price. Asupercomputercenteruserusuallygetsnomorethan64128nodes6forasingleproblemcomparabletothesizethatresearchershave or can build up in their labs. At a minimum,centers will be forced to rethinkandredenetheirrole.AninterestingscenarioariseswhenGigabitand10GigabitEthernetsbecomethedefactoLAN. Asnetworkspeedandlatencyincreasemorerapidlythanpro-cessing,messagepassinglookslikememoryaccess,makingdataequallyaccessibletoall nodes withinalocal area. Thesematchthespeedof thenext-generationInternet. ThiswouldmeananyLAN-basedcollectionof PCswouldbecomeadefacto Beowulf! Beowulfs and Grid computing technologies will become more closely6At a large center with over 600 processors,the following was observed: 65%,of the users wereassigned< 16processors;24% 0, thensend the name of the processor to the process with rank 0Elseprint the name of this processorfor each rank,receive the name of the processor and print itEndifThis programis showninFigure9.3. ThenewMPI calls aretoMPI_SendandMPI_RecvandtoMPI_Get_processor_name. Thelatterisaconvenientwaytogetthenameoftheprocessoronwhichaprocessisrunning. MPI_SendandMPI_Recvcanbeunderstoodbysteppingbackandconsideringthetworequirements thatmustbesatisedtocommunicatedatabetweentwoprocesses:1. Describethedatatobesentorthelocationinwhichtoreceivethedata2. Describe the destination (for a send) or the source (for a receive) of the data.In addition, MPI provides a way to tag messages and to discover information aboutthesizeandsourceofthemessage. Wewilldiscusseachoftheseinturn.Describing the Data Buer. A data buer typically is described by an addressandalength, suchas a,100,where ais apointer to100bytes of data. Forexample, the Unixwrite call describes the data to be written with an address andlength (along with a le descriptor). MPI generalizes this to provide two additionalcapabilities: describingnoncontiguousregionsofdataanddescribingdatasothatit can be communicated between processors with dierent data representations. Todothis, MPIusesthreevaluestodescribeadatabuer: theaddress, the(MPI)datatype,andthenumberorcount oftheitemsofthatdatatype. Forexample,abuercontainingfourCintsisdescribedbythetriplea, 4, MPI_INT.TherearepredenedMPIdatatypesforallofthebasicdatatypesdenedinC,Fortran,andC++. ThemostcommondatatypesareshowninTable9.1.166 Chapter9#include "mpi.h"#include int main( int argc, char *argv[] ){int numprocs, myrank, namelen, i;char processor_name[MPI_MAX_PROCESSOR_NAME];char greeting[MPI_MAX_PROCESSOR_NAME + 80];MPI_Status status;MPI_Init( &argc, &argv );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );MPI_Comm_rank( MPI_COMM_WORLD, &myrank );MPI_Get_processor_name( processor_name, &namelen );sprintf( greeting, "Hello, world, from process %d of %d on %s",myrank, numprocs, processor_name );if ( myrank == 0 ) {printf( "%s\n", greeting );for ( i = 1; i < numprocs; i++ ) {MPI_Recv( greeting, sizeof( greeting ), MPI_CHAR,i, 1, MPI_COMM_WORLD, &status );printf( "%s\n", greeting );}}else {MPI_Send( greeting, strlen( greeting ) + 1, MPI_CHAR,0, 1, MPI_COMM_WORLD );}MPI_Finalize( );return( 0 );}Figure9.3AmorecomplexHelloWorldprograminMPI.Onlyprocess0writestostdout;eachprocesssendsamessagetoprocess0.ParallelProgrammingwithMPI 167C FortranMPItype MPItypeint MPI_INT INTEGER MPI_INTEGERdouble MPI_DOUBLE DOUBLEPRECISION MPI_DOUBLE_PRECISIONoat MPI_FLOAT REAL MPI_REALlong MPI_LONGchar MPI_CHAR CHARACTER MPI_CHARACTERLOGICAL MPI_LOGICAL MPI_BYTE MPI_BYTETable9.1ThemostcommonMPIdatatypes. CandFortrantypesonthesamerowareoftenbutnotalwaysthesametype. ThetypeMPIBYTEisusedforrawdatabytesanddoesnotcoorespondtoanyparticulardatatype. TheC++MPIdatatypeshavethesamenameastheCdatatype,butwithouttheMPI prex,forexample,MPI::INT.DescribingtheDestinationorSource. The destination or source is speciedbyusingtherankof theprocess. MPIgeneralizesthenotionof destinationandsource rank by making the rank relative to a group of processes. This group may bea subset of the original group of processes. Allowing subsets of processes and usingrelative ranks make it easier to use MPI to write component-oriented software (moreonthisinSection10.4). TheMPIobjectthatdenesagroupof processes(andaspecial communicationcontextthatwill bediscussedinSection10.4)iscalledacommunicator. Thus, sourcesanddestinationsaregivenbytwoparameters: arankandacommunicator. ThecommunicatorMPI_COMM_WORLDispredenedandcontains all of the processes started by mpirun or mpiexec. As a source, the specialvalueMPI_ANY_SOURCEmaybeusedtoindicatethatthemessagemaybereceivedfromanyrankoftheMPIprocessesinthisMPIprogram.SelectingamongMessages. TheextraargumentforMPI_Sendisanonneg-ativeintegertagvalue. Thistagallowsaprogramtosendoneextranumberwiththe data. MPI_Recv can use this value either to select which message to receive (byspecifying a specic tag value) or to use the tag to convey extra data (by specifyingthewildcardvalueMPI_ANY_TAG).Inthelattercase,thetagvalueofthereceivedmessageisstoredinthestatusargument(thisisthelastparametertoMPI_Recvin the C binding). This is a structure in C, an integer array in Fortran, and a classinC++. ThetagandrankofthesendingprocesscanbeaccessedbyreferringtotheappropriateelementofstatusasshowninTable9.2.168 Chapter9C Fortran C++status.MPI_SOURCE status(MPI_SOURCE) status.Get_source()status.MPI_TAG status(MPI_TAG) status.Get_tag()Table9.2AccessingthesourceandtagafteranMPIRecv.DeterminingtheAmountof DataReceived. TheamountofdatareceivedcanbefoundbyusingtheroutineMPI_Get_count. Forexample,MPI_Get_count( &status, MPI_CHAR, &num_chars );returnsinnum_charsthenumberofcharacterssent. Thesecondargumentshouldbe the same MPI datatype that was used to receive the message. (Since many appli-cations do not need this information, the use of a routine allows the implementationtoavoidcomputingnum_charsunlesstheuserneedsthevalue.)Ourexampleprovidesamaximum-sizedbuerinthereceive. Itisalsopossibleto nd the amount of memory needed to receive a message by usingMPI_Probe,asshowninFigure9.4.char *greeting;int num_chars, src;MPI_Status status;...MPI_Probe( MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status );MPI_Get_count( &status, MPI_CHAR, &num_chars );greeting = (char *)malloc( num_chars );src = status.MPI_SOURCE;MPI_Recv( greeting, num_chars, MPI_CHAR,src, 1, MPI_COMM_WORLD, &status );Figure9.4UsingMPIProbetondthesizeofamessagebeforereceivingit.MPI guarantees that messages are ordered and that anMPI_Recv after anMPI_-Probewill receivethemessagethattheprobereturnedinformationonaslongasthesamemessageselectioncriteria(sourcerank,communicator,andmessagetag)areused. Notethatinthisexample, thesourcefortheMPI_Recvisspeciedasstatus.MPI_SOURCE,notMPI_ANY_SOURCE,toensurethatthemessagereceivedisthesameastheoneaboutwhichMPI_Probereturnedinformation.ParallelProgrammingwithMPI 1699.2 Manager/WorkerExampleWe now begin a series of examples illustrating approaches to parallel computationsthataccomplishuseful work. Whileeachparallel applicationisunique, anumberofparadigmshaveemergedaswidelyapplicable,andmanyparallelalgorithmsarevariationsonthesepatterns.Oneof themost universal is themanager/workeror taskparallelismap-proach. The idea is that the work that needs to be done can be divided by a man-agerintoseparatepiecesandthepiecescanbeassignedtoindividual workerprocesses. Thusthemanagerexecutesadierentalgorithmfromthatofthework-ers, butall of theworkersexecutethesamealgorithm. MostimplementationsofMPI (including MPICH) allow MPI processes to be running dierent programs (ex-ecutableles), butitisoftenconvenient(andinsomecasesrequired)tocombinethemanagerandworkercodeintoasingleprogramwiththestructureshowninFigure9.5.#include "mpi.h"int main( int argc, char *argv[] ){int numprocs, myrank;MPI_Init( &argc, &argv );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );MPI_Comm_rank( MPI_COMM_WORLD, &myrank );if ( myrank == 0 ) /* manager process */manager_code ( numprocs );else /* worker process */worker_code ( );MPI_Finalize( );return 0;}Figure9.5Frameworkofthematrix-vectormultiplyprogram.Sometimestheworkcanbeevenlydividedintoexactlyasmanypiecesasthereareworkers, butamoreexibleapproachistohavethemanagerkeepapool ofunits of work larger than the number of workers, and assign new work dynamically170 Chapter9to workers as they complete their tasks and send their results back to the manager.This approach, called self-scheduling, works well in the presence of tasks of varyingsizesand/orworkersofvaryingspeeds.Weillustratethistechniquewithaparallel programtomultiplyamatrixbyavector. (A Fortran version of this same program can be found in [13].)This programisnot aparticularlygoodwaytocarryoutthisoperation, butitillustratestheapproach and is simple enough to be shown in its entirety. The program multipliesasquarematrixabyavectorbandstorestheresultinc. Theunitsofworkaretheindividualdotproductsoftherowsofawiththevectorb. Thusthemanager,codeforwhichisshowninFigure9.6, startsbyinitializinga. Themanagerthensends out initial units of work, one row to each worker. We use the MPI tag on eachsuchmessagetoencodetherownumberwearesending. Sincerownumbersstartat0butwewishtoreserve0asatagwiththespecialmeaningofnomoreworktodo,wesetthetagtoonegreaterthantherownumber. Whenaworkersendsbackadotproduct,westoreitintheappropriateplaceincandsendthatworkeranother row to work on. Once all the rows have been assigned, workers completingataskaresentanomoreworkmessage,indicatedbyamessagewithtag0.ThecodefortheworkerpartoftheprogramisshowninFigure9.7. Aworkerinitializesb, receivesarowof ainamessage, computesthedotproductof thatrow and the vectorb, and then returns the answer to the manager, again using thetagtoidentifytherow. Aworkerrepeatsthisuntilitreceivesthenomoreworkmessage,identiedbyitstagof0.This program requires at least two processes to run: one manager and one worker.Unfortunately,addingmoreworkersisunlikelytomakethejobgofaster. Wecananalyzethecostofcomputationandcommunicationmathematicallyandseewhathappensasweincreasethenumberofworkers. Increasingthenumberofworkerswill decrease the amount of computationdone byeachworker, andsince theyworkinparallel,thisshoulddecreasetotalelapsedtime. Ontheotherhand,moreworkersmeanmorecommunication, andthecostof communicatinganumberisusuallymuchgreaterthanthecostofanarithmetical operationonit. Thestudyofhowthetotaltimeforaparallelalgorithmisaectedbychangesinthenumberofprocesses,theproblemsize,andthespeedoftheprocessorandcommunicationnetworkiscalledscalabilityanalysis. Weanalyzethematrix-vectorprogramasasimpleexample.First, let us compute the number of oating-point operations. For a matrix of sizen, we have to compute n dot products, each of which requires n multiplications andn1 additions. Thus the number of oating-point operations is n(n+(n1)) =n(2n1) = 2n2n. If Tcalc is the time it takes a processor to do one oating-pointParallelProgrammingwithMPI 171#define SIZE 1000#define MIN( x, y ) ((x) < (y) ? x : y)void manager_code( int numprocs ){double a[SIZE][SIZE], c[SIZE];int i, j, sender, row, numsent = 0;double dotp;MPI_Status status;/* (arbitrary) initialization of a */for (i = 0; i < SIZE; i++ )for ( j = 0; j < SIZE; j++ )a[i][j] = ( double ) j;for ( i = 1; i < MIN( numprocs, SIZE ); i++ ) {MPI_Send( a[i-1], SIZE, MPI_DOUBLE, i, i, MPI_COMM_WORLD );numsent++;}/* receive dot products back from workers */for ( i = 0; i < SIZE; i++ ) {MPI_Recv( &dotp, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, &status );sender = status.MPI_SOURCE;row = status.MPI_TAG - 1;c[row] = dotp;/* send another row back to this worker if there is one */if ( numsent < SIZE ) {MPI_Send( a[numsent], SIZE, MPI_DOUBLE, sender,numsent + 1, MPI_COMM_WORLD );numsent++;}else /* no more work */MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE, sender, 0,MPI_COMM_WORLD );}}Figure9.6Thematrix-vectormultiplyprogram,managercode.172 Chapter9void worker_code( void ){double b[SIZE], c[SIZE];int i, row, myrank;double dotp;MPI_Status status;for ( i = 0; i < SIZE; i++ ) /* (arbitrary) b initialization */b[i] = 1.0;MPI_Comm_rank( MPI_COMM_WORLD, &myrank );if ( myrank 0 ) {row = status.MPI_TAG - 1;dotp = 0.0;for ( i = 0; i < SIZE; i++ )dotp += c[i] * b[i];MPI_Send( &dotp, 1, MPI_DOUBLE, 0, row + 1,MPI_COMM_WORLD );MPI_Recv( c, SIZE, MPI_DOUBLE, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status );}}}Figure9.7Thematrix-vectormultiplyprogram,workercode.operation,thenthetotalcomputationtimeis(2n2n) Tcalc. Next,wecomputethe number of communications, dened as sending one oating-point number. (Weignore for this simple analysis the eect of message lengths.)Leaving aside the costofcommunicatingb(perhapsitiscomputedlocallyinaprecedingstep), wehavetosendeachrowofaandreceivebackonedotproductanswer. Sothenumberofoating-point numbers communicated is (nn) +n = n2+n. If Tcommis the timeto communicate one number, we get (n2+n) Tcommfor the total communicationtime. Thustheratioofcommunicationtimetocomputationtimeis

n2+ n2n2n

TcommTcalc

.ParallelProgrammingwithMPI 173Inmanycomputationstheratioofcommunicationtocomputationcanbereducedalmostto0bymakingtheproblemsizelarger. Ouranalysisshowsthatthisisnotthecasehere. Asngetslarger, thetermontheleftapproaches12. Thuswecanexpect communication costs to prevent this algorithm from showing good speedups,evenonlargeproblemsizes.Thesituationisbetterinthecaseofmatrix-matrixmultiplication,whichcouldbecarriedoutbyasimilaralgorithm. Wewouldreplacethevectorsbandcbymatrices, sendtheentirematrixbtotheworkersatthebeginningofthecompu-tation,andthenhandouttherowsofaasworkunits,justasbefore. Theworkerswouldcomputeanentirerowoftheproduct,consistingofthedotproductsoftherowofawithallofthecolumnofb,andthenreturnarowofctothemanager.Let us now do the scalability analysis for the matrix-matrix multiplication. Againweignoretheinitial communicationof b. Thenumberof operationsforonedotproduct is n+(n+1) as before, and the total number of dot products calculated isn2. Thusthetotalnumberofoperationsisn2(2n 1) = 2n3n2. Thenumberofnumberscommunicatedhasgoneupto(n n) +(n n) = 2n2. Sotheratioofcommunicationtimetocomputationtimehasbecome

2n22n3n2

TcommTcalc

,whichdoestendto0asngetslarger. Thus,forlargematricesthecommunicationcostsplaylessofarole.Twootherdicultieswiththisalgorithmmightoccurasweincreasethesizeoftheproblemandthenumberofworkers. Therstisthatasmessagesgetlonger,theworkerswastemoretimewaitingforthenextrowtoarrive. Asolutiontothisproblemistodoublebuerthedistributionof work, havingthemanagersendtworowstoeachworkertobeginwith,sothataworkeralwayshassomeworktodowhilewaitingforthenextrowtoarrive.Anotherdicultyforlargernumbersofprocessescanbethatthemanagercanbecome overloaded so that it cannot assign work in a timely manner. This problemcanmosteasilybeaddressedbyincreasingthesizeoftheworkunit, butinsomecasesitisnecessarytoparallelizethemanagertaskitself,withmultiplemanagershandlingsubpoolsofworkunits.A more subtle problem has to do with fairness: ensuring that all worker processesarefairlyservicedbythemanager. MPIprovidesseveral waystoensurefairness;see[13,Section7.1.4].174 Chapter99.3 Two-Dimensional Jacobi Example with One-Dimensional De-compositionA common use of parallel computers in scientic computation is to approximate thesolutionofapartialdierentialequation(PDE).OneofthemostcommonPDEs,atleastintextbooks,isthePoissonequation(hereshownintwodimensions):2ux2+2uy2= f(x, y)in (9.3.1)u = g(x, y)on (9.3.2)Thisequationisusedtodescribemanyphysical phenomena, includinguidowandelectrostatics. Theequationhastwoparts: adierentialequationappliedev-erywhere within a domain (9.3.1) and a specication of the value of the unknownualongtheboundaryof(thenotationmeanstheboundaryof). Forex-ample, if this equation is used to model the equilibrium distribution of temperatureinsidearegion, theboundaryconditiong(x, y)speciestheappliedtemperaturealongtheboundary, f(x, y)iszero, andu(x, y)isthetemperaturewithinthere-gion. To simplify the rest of this example,we will consider only a simple domain consistingofasquare(seeFigure9.8).Tocomputeanapproximationtou(x, y), wemustrstreducetheproblemtonite size. We cannot determine the value of u everywhere; instead, we will approx-imateuatanitenumberofpoints(xi, yj)inthedomain, wherexi=i handyj=j h. (Ofcourse, wecandeneavalueforuatotherpointsinthedomainbyinterpolatingfromthesevalues that wedetermine, but theapproximationisdenedbythevalueofuatthepoints(xi, yj).) ThesepointsareshownasblackdisksinFigure9.8. Becauseof thisregularspacing, thepointsaresaidtomakeuparegularmesh. Ateachofthesepoints,weapproximatethepartialderivativeswithnitedierences. Forexample,2ux2(xi, yj) u(xi+1, yj) 2u(xi, yj) + u(xi1, yj)h2.Ifwenowletui,jstandforourapproximationtosolutionofEquation9.3.1atthepoint(xi, yj), wehavethefollowingsetof simultaneouslinearequationsforthevaluesofu:ui+1,j 2ui,j+ ui1,jh2+ui,j+12ui,j+ ui,j1h2= f(xi, yj). (9.3.3)ParallelProgrammingwithMPI 175rank = 0 rank = 1 rank =2i+1,ji,j-1i-1,ji,j+1i,jFigure9.8Domainand9 9computationalmeshforapproximatingthesolutiontothePoissonproblem.For values of ualongtheboundary(e.g., at x=0or y =1), thevalueof theboundaryconditiongisused. If h=1/(n + 1)(sotherearen npointsintheinteriorofthemesh),thisgivesusn2simultaneouslinearequationstosolve.Manymethodscanbeusedtosolvetheseequations. Infact, if youhavethisparticularproblem,youshoulduseoneofthenumericallibrariesdescribedinTa-ble 10.1. Inthis section, we describe averysimple (andinecient) algorithmbecause, from a parallel computing perspective, it illustrates how to program moreeective and general methods. The method that we use is called the Jacobimethodfor solving systems of linear equations. The Jacobi method computes successive ap-proximations to the solution of Equation 9.3.3 by rewriting the equation as follows:ui+1,j 2ui,j+ ui1,j+ ui,j+12ui,j+ ui,j1= h2f(xi, yj)ui,j=14(ui+1,j + ui1,j + ui,j+1 + ui,j1h2fi,j). (9.3.4)EachstepintheJacobiiterationcomputesanewapproximationtouN+1i,jintermsofthesurroundingvaluesofuN:uN+1i,j=14(uNi+1,j + uNi1,j + uNi,j+1 + uNi,j1h2fi,j). (9.3.5)176 Chapter9This is our algorithm for computing the approximation to the solution of the Poissonproblem. WeemphasizethattheJacobi methodisapoornumerical methodbutthatthesamecommunicationpatternsapplytomanynitedierence,volume,orelementdiscretizationssolvedbyiterativetechniques.Intheuniprocessorversionofthisalgorithm, thesolutionuisrepresentedbyatwo-dimensionalarrayu[max_n][max_n],andtheiterationiswrittenasfollows:double u[NX+2][NY+2], u_new[NX+2][NY+2], f[NX+2][NY+2];int i, j;...for (i=1;i