lecture 26: the future of high- performance...
TRANSCRIPT
Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017
Lecture 26:
The Future of High-Performance Computing
Carnegie Mellon
15-418/618 2
ComparingTwoLarge-ScaleSystems
⬛ OakridgeTitan
▪ Monolithicsupercomputer(3rdfastestinworld)
▪ Designedforcompute-intensiveapplications
⬛ GoogleDataCenter
■ Serverstosupportmillionsofcustomers
■ Designedfordatacollection,storage,andanalysis
2
Carnegie Mellon
15-418/618 33
ComputingLandscape
Computational Intensity
Internet-Scale Computing
Personal Computing
Cloud Services
Dat
a In
tens
ity
Modeling & Simulation-Driven
Science & Engineering
Traditional Supercomputing
• Web search • Mapping / directions • Language translation • Video streaming
Google Data Center
Oakridge Titan
Carnegie Mellon
15-418/618 44
SupercomputingLandscape
Computational Intensity
Dat
a In
tens
ity
Modeling & Simulation-Driven
Science & Engineering
Oakridge Titan
Carnegie Mellon
15-418/618 55
SupercomputerApplications
⬛ Simulation-BasedModeling▪ Systemstructure+initialconditions+transitionbehavior▪ Discretizetimeandspace▪ Runsimulationtoseewhathappens
⬛ Requirements▪ Modelaccuratelyreflectsactualsystem▪ Simulationfaithfullycapturesmodel
Science Industrial Products
Public Health
Carnegie Mellon
15-418/618 66
TitanHardware
⬛ EachNode▪ AMD16-coreprocessor▪ nVidiaGraphicsProcessingUnit▪ 38GBDRAM▪ Nodiskdrive
⬛ Overall▪ 7MW,$200M
Local Network
CPU
Node 1
CPU
Node 2
CPU
Node 18,688
• • •
GPU GPU GPU
Carnegie Mellon
15-418/618 77
TitanNodeStructure:CPU
⬛ CPU▪ 16coressharingcommonmemory
▪ Supportsmultithreadedprogramming
▪ ~0.16x1012floating-pointoperationspersecond(FLOPS)peakperformance
DRAM Memory
Carnegie Mellon
15-418/618 88
TitanNodeStructure:GPU
⬛ KeplerGPU▪ 14multiprocessors
▪ Eachwith12groupsof16streamprocessors ▪ 14X12X16=2688
▪ Single-Instruction,Multiple-Dataparallelism▪ Singleinstructioncontrolsallprocessorsingroup
▪ 4.0x1012FLOPSpeakperformance
Carnegie Mellon
15-418/618 99
TitanProgramming:Principle
⬛ SolvingProblemOverGrid▪ E.g.,finite-elementsystem
▪ Simulateoperationovertime
⬛ BulkSynchronousModel▪ PartitionintoRegions
▪ pregionsforp-nodemachine
▪ MapRegionperProcessor
Carnegie Mellon
15-418/618 1010
TitanProgramming:Principle(cont)
⬛ BulkSynchronousModel▪ MapRegionperProcessor
▪ Alternate▪ Allnodescomputebehaviorof
region– PerformonGPUs
▪ Allnodescommunicatevaluesatboundaries
P1 P2 P3 P4 P5
Communicate
Communicate
Communicate
Compute
Compute
Compute
Carnegie Mellon
15-418/618 1111
BulkSynchronousPerformance
▪ Limitedbyperformanceofslowestprocessor
⬛ Strivetokeepperfectlybalanced▪ Engineerhardwaretobehighlyreliable
▪ Tunesoftwaretomakeasregularaspossible
▪ Eliminate“noise”▪ Operatingsystemevents▪ Extraneousnetworkactivity
P1 P2 P3 P4 P5
Communicate
Communicate
Communicate
Compute
Compute
Compute
Carnegie Mellon
15-418/618 1212
TitanProgramming:Reality
⬛ SystemLevel▪ Message-PassingInterface(MPI)supportsnodecomputation,synchronizationandcommunication
⬛ NodeLevel▪ OpenMPsupportsthread-leveloperationofnodeCPU
▪ CUDAprogrammingenvironmentforGPUs▪ Performancedegradesquicklyifdon’thaveperfectbalance
amongmemoriesandprocessors
⬛ Result▪ Singleprogramiscomplexcombinationofmultipleprogrammingparadigms
▪ Tendtooptimizeforspecifichardwareconfiguration
Carnegie Mellon
15-418/618 1313
MPIFaultTolerance
⬛ Checkpoint▪ Periodicallystorestateofallprocesses
▪ SignificantI/Otraffic
⬛ Restore▪ Whenfailureoccurs
▪ Resetstatetothatoflastcheckpoint
▪ Allinterveningcomputationwasted
⬛ PerformanceScaling▪ Verysensitivetonumberoffailingcomponents
Restore
Wasted Computation
Compute & Communicate
P1 P2 P3 P4 P5
Checkpoint
Compute & Communicate
Carnegie Mellon
15-418/618 1414
SupercomputerProgrammingModel
▪ Programontopofbarehardware
⬛ Performance▪ Low-levelprogrammingtomaximizenodeperformance
▪ Keepeverythinggloballysynchronizedandbalanced
⬛ Reliability▪ Singlefailurecausesmajordelay▪ Engineerhardwaretominimizefailures
Hardware
Machine-Dependent Programming Model
Software Packages
Application Programs
Carnegie Mellon
15-418/618 1515
Data-Intensive ComputingLandscape
Computational Intensity
Internet-Scale Computing
Personal Computing
Cloud Services
Dat
a In
tens
ity
• Web search • Mapping / directions • Language translation • Video streaming
Google Data Center
Carnegie Mellon
15-418/618 16
InternetComputing⬛ WebSearch
▪ AggregatetextdatafromacrossWWW
▪ Nodefinitionofcorrectoperation
▪ Donotneedreal-timeupdating
⬛ MappingServices▪ Hugeamountof(relatively)staticdata
▪ Eachcustomerrequiresindividualizedcomputation
16
OnlineDocuments■ Mustbestoredreliably ■ Mustsupportreal-timeupdating ■ (Relatively)smalldatavolumes
Carnegie Mellon
15-418/618 1717
OtherData-IntensiveComputingApplications
⬛ Wal-Mart▪ 267millionitems/day,soldat6,000stores
▪ HPbuiltthem4PBdatawarehouse
▪ Minedatatomanagesupplychain,understandmarkettrends,formulatepricingstrategies
⬛ LSST▪ Chileantelescopewillscanentireskyevery3days▪ A3.2gigapixeldigitalcamera▪ Generate30TB/dayofimagedata
Carnegie Mellon
15-418/618 1818
Data-IntensiveApplicationCharacteristics
⬛ DiverseClassesofData▪ Structured&unstructured
▪ High&lowintegrityrequirements
⬛ DiverseComputingNeeds▪ Localized&globalprocessing
▪ Numerical&non-numerical
▪ Real-time&batchprocessing
Carnegie Mellon
15-418/618 19
GoogleDataCenters
Dalles,Oregon▪ Hydroelectricpower@2¢/KWHr▪ 50Megawatts▪ Enoughtopower60,000homes
■ Engineeredforlowcost,modularity&powerefficiency
■ Container:1160servernodes,250KW
19
Carnegie Mellon
15-418/618 2020
GoogleCluster
▪ Typically1,000−2,000nodes
⬛ NodeContains▪ 2multicoreCPUs
▪ 2diskdrives
▪ DRAM
Local Network
CPU
Node 1
CPU
Node 2
CPU
Node n
• • •
Carnegie Mellon
15-418/618 2121
HadoopProject⬛ Filesystemwithfilesdistributedacrossnodes
▪ Storemultiple(typically3copiesofeachfile)▪ Ifonenodefails,datastillavailable
▪ Logically,anynodehasaccesstoanyfile▪ Mayneedtofetchacrossnetwork
⬛ Map/Reduceprogrammingenvironment▪ Softwaremanagesexecutionoftasksonnodes
Local Network
CPU
Node 1
CPU
Node 2
CPU
Node n
• • •
Carnegie Mellon
15-418/618 2222
Map/ReduceOperation⬛ Characteristics
▪ Computationbrokenintomany,short-livedtasks▪ Mapping,reducing
▪ Tasksmappedontoprocessorsdynamically
▪ Usediskstoragetoholdintermediateresults
⬛ Strengths▪ Flexibilityinplacement,scheduling,
andloadbalancing▪ Canaccesslargedatasets
⬛ Weaknesses▪ Higheroverhead▪ Lowerrawperformance
MapReduce
MapReduce
MapReduce
MapReduce
Map/Reduce
Carnegie Mellon
15-418/618 2323
Map/ReduceFaultTolerance⬛ DataIntegrity
▪ Storemultiplecopiesofeachfile
▪ IncludingintermediateresultsofeachMap/Reduce▪ Continuouscheckpointing
⬛ RecoveringfromFailure▪ Simplyrecomputelostresult
▪ Localizedeffect
▪ Dynamicschedulerkeepsallprocessorsbusy
⬛ Usesoftwaretobuildreliablesystemontopofunreliablehardware
MapReduce
MapReduce
MapReduce
MapReduce
Map/Reduce
Carnegie Mellon
15-418/618 2424
ClusterProgrammingModel
▪ Applicationprogramswrittenintermsofhigh-leveloperationsondata
▪ Runtimesystemcontrolsscheduling,loadbalancing,…
⬛ ScalingChallenges▪ Centralizedschedulerforms
bottleneck
▪ Copyingto/fromdiskverycostly
▪ Hardtolimitdatamovement▪ Significantperformancefactor
Hardware
Machine-Independent Programming Model
Runtime System
Application Programs
Carnegie Mellon
15-418/618 2525
RecentProgrammingSystems
⬛ SparkProject
▪ atU.C.,Berkeley▪ Growntohavelargeopensourcecommunity
⬛ GraphLab▪ StartedasprojectatCMUbyCarlosGuestrin▪ Environmentfordescribingmachine-learningalgorithms
▪ Sparsematrixstructuredescribedbygraph▪ Computationbasedonupdatingofnodevalues
Carnegie Mellon
15-418/618 2626
ComputingLandscapeTrends
Computational Intensity
Dat
a In
tens
ity
Modeling & Simulation-Driven
Science & EngineeringTraditional
Supercomputing
Mixing simulation with data analysis
Carnegie Mellon
15-418/618 2727
CombiningSimulationwithRealData
⬛ Limitations▪ Simulationalone:Hardtoknowifmodeliscorrect
▪ Dataalone:Hardtounderstandcausality&“whatif”
⬛ Combination▪ Checkandadjustmodelduringsimulation
Carnegie Mellon
15-418/618 2828
Real-TimeAnalytics
⬛ MilleniumXXLSimulation(2010)▪ 3X109particles
▪ Simulationrunof9.3dayson12,228cores
▪ 700TBtotaldatagenerated▪ Saveatonly4timepoints▪ 70TB
▪ Large-scalesimulationsgeneratelargedatasets
⬛ WhatIf?▪ Couldperformdataanalysiswhilesimulationisrunning Simulation
EngineAnalytic Engine
Carnegie Mellon
15-418/618 2929
ComputingLandscapeTrends
Computational Intensity
Internet-Scale ComputingD
ata
Inte
nsity Google Data Center
Sophisticated data analysis
Carnegie Mellon
15-418/618 3030
ExampleAnalyticApplications
ClassifierImage Description
Microsoft Project Adam
TransducerEnglish Text
German Text
Carnegie Mellon
15-418/618 31
DataAnalysiswithDeepNeuralNetworks
⬛ Task:▪ Computeclassificationofsetofinputsignals
31
⬛ Training■ Usemanytrainingsamplesofforminput/desiredoutput■ Computeweightsthatminimizeclassificationerror
⬛ Operation■ Propagatesignalsfrominputtooutput
Carnegie Mellon
15-418/618 33
TrainingDNNs
⬛ Characteristics▪ Iterativenumericalalgorithm
▪ Regulardataorganization
⬛ ProjectAdamTraining■ 2Bconnections■ 15Mimages■ 62machines■ 10days
33
0
5
10
15
20
0 5 10 15 200
100
200
300
400
0 5 10 15 200
5
10
15
20
0 5 10 15 20
× ➔
Model Size Training Data
Training Effort
Carnegie Mellon
15-418/618 3434
Trends
Computational Intensity
Internet-Scale Computing
Dat
a In
tens
ity
Modeling & Simulation-Driven
Science & EngineeringTraditional
Supercomputing
Google Data CenterSophisticated data analysis
Mixing simulation with real-world data
Convergence?
Carnegie Mellon
15-418/618 3535
ChallengesforConvergence⬛ Supercomputers
■ Customized ■ Optimized for reliability
■ Source of “noise” ■ Static scheduling
■ Low-level, processor-centric model
⬛ DataCenterClusters
■ Consumer grade ■ Optimized for low cost
■ Provides reliability ■ Dynamic allocation
■ High level, data-centric model
Hardware
Run-Time System
Application Programming
Carnegie Mellon
15-418/618 3636
Summary:Computation/DataConvergence
⬛ TwoImportantClassesofLarge-ScaleComputing▪ Computationallyintensivesupercomputing▪ Dataintensiveprocessing
▪ Internetcompanies+manyotherapplications
⬛ FollowedDifferentEvolutionaryPaths▪ Supercomputers:Getmaximumperformancefromavailablehardware▪ Datacenterclusters:Maximizecost/performanceovervarietyofdata-centric
tasks▪ Yieldeddifferentapproachestohardware,runtimesystems,andapplication
programming
⬛ AConvergenceWouldHaveImportantBenefits▪ Computationalanddata-intensiveapplications▪ But,notclearhowtodoit
Carnegie Mellon
15-418/618 38
World’sFastestMachines⬛ Top500Ranking:High-performanceLINPACK
▪ Benchmark:SolveNxNlinearsystem▪ SomevariantofGaussianelimination
▪ 2/3N3+O(N2)operations▪ VendorcanchooseNtogivebestperformance(inFLOPS)
⬛ Alternative:High-performanceconjugategradient▪ Solvesparselinearsystem(≤27nonzeros/row)▪ Iterativemethod▪ Highercommunication/computeratio
Carnegie Mellon
15-418/618 39
SunwayTaihuLight⬛ WuxiChina
▪ Operational2016
⬛ Machine▪ Totalmachinehas40,960processorchips▪ Processorchipcontains256computecores+4
managementcores
▪ Eachhas4-wideSIMDvectorunit▪ 8FLOPS/clockcycle
⬛ Performance▪ HPL:93.0PF(World’stop)
▪ HPCG:0.37PF▪ 15.4MW▪ 1.31PBDRAM
⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:6.0▪ Bytes/FLOP:0.014
Carnegie Mellon
15-418/618 40
Tianhhe-2⬛ GuangzhouChina
▪ Operational2013
⬛ Machine▪ Totalmachinehas16,000nodes▪ Eachwith2IntelXeons+3IntelXeonPhi’s
⬛ Performance▪ HPL:33.9PF▪ HPCG:0.58PF(world’sbest)▪ 17.8MW▪ 1.02PBDRAM
⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:1.9▪ Bytes/FLOP:0.030
Carnegie Mellon
15-418/618 41
Titan⬛ OakRidge,TN
▪ Operational2012
⬛ Machine▪ Totalmachinehas18,688nodes▪ Eachwith16-coreOpteron+TeslaK20XGPU
⬛ Performance▪ HPL:17.6PF▪ HPCG:0.32PF▪ 8.2MW▪ 0.71PBDRAM
⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:2.2▪ Bytes/FLOP:0.040
Carnegie Mellon
15-418/618 42
HowPowerfulisaTitanNode?
Titan⬛ CPU
▪ Opteron6274▪ Nov.,2011.32nmtechnology▪ 2.2GHz▪ 16cores(nohyperthreading)▪ 16MBL3cache▪ 32GBDRAM
⬛ GPU▪ KeplerK20X▪ Feb.,2013.28nm▪ Cudacapability3.5▪ 3.9TFPeak(SP)
GHCMachine⬛ CPU
▪ XeonE5-1660▪ June,2016.14nmtechnology▪ 3.2GHz▪ 8cores(2xhyperthreaded)▪ 20MBL3cache▪ 32GBDRAM
⬛ GPU▪ GeForceGTX1080▪ May,2016.16nm▪ Cudacapability6.0▪ 8.2TFPeak(SP)
Carnegie Mellon
15-418/618 43
PerformanceofTop500Machines
⬛ FrompresentationbyJackDongarra⬛ MachinesfaroffpeakwhenperformingHPCG
Carnegie Mellon
15-418/618 44
WhatLiesAhead⬛ DOECORALProgram
▪ AnnouncedNov2014
▪ Deliveryin2018
⬛ Vendor#1▪ IBM+nVidia+Mellanox
▪ 3400nodes▪ 10MW▪ 150–300PFpeak
⬛ Vendor#2▪ Intel+Cray▪ ~50,000nodes(XeonPhi’s)▪ 13MW
▪ >180PFpeak
Carnegie Mellon
15-418/618 4646
Moore’sLaw
▪ Basisforever-increasingcomputerpower▪ We’vecometoexpectitwillcontinue
Carnegie Mellon
15-418/618 4747
ChallengestoMoore’sLaw:Technical
▪ Mustcontinuetoshrinkfeaturessizes
▪ Approachingatomicscale
⬛ Difficulties▪ Lithographyatsuchsmalldimensions
▪ Statisticalvariationsamongdevices
• 2022: transistors with 4nm feature size
• Si lattice spacing 0.54nm
Carnegie Mellon
15-418/618 4848
ChallengestoMoore’sLaw:Economic⬛ GrowingCapitalCosts
▪ Stateofartfabline~$20B▪ Musthaveveryhighvolumestoamortizeinvestment▪ Hasledtomajorconsolidations
Carnegie Mellon
15-418/618 4949
DennardScaling
▪ DuetoRobertDennard,IBM,1974
▪ QuantifiesbenefitsofMoore’sLaw
⬛ HowtoshrinkanICProcess▪ Reducehorizontalandverticaldimensionsbyk
▪ Reducevoltagebyk
⬛ Outcomes▪ Devices/chipincreasebyk2▪ Clockfrequencyincreasesbyk▪ Power/chipconstant
⬛ Significance▪ Increasedcapacityandperformance
▪ Noincreaseinpower
Carnegie Mellon
15-418/618 5050
EndofDennardScaling
⬛ WhatHappened?▪ Can’tdropvoltagebelow~1V▪ Reachedlimitofpower/chipin2004▪ Morelogiconchip(Moore’sLaw),butcan’tmakethemrunfaster
▪ Responsehasbeentoincreasecores/chip
Carnegie Mellon
15-418/618 5151
ResearchChallenges
⬛ Supercomputers▪ Cantheybemademoredynamicandadaptive?
▪ Requirementforfuturescalability
▪ Cantheybemadeeasiertoprogram?▪ Abstract,machine-independentprogrammingmodels
⬛ Data-IntensiveComputing▪ Cantheybeadaptedtoprovidebettercomputationalperformance?▪ Cantheymakebetteruseofdatalocality?
▪ Performance&power-limitingfactor
⬛ Technology/Economic▪ WhatwillwedowhenMoore’sLawcomestoanendforCMOS?▪ Howcanweensureastablemanufacturingenvironment?