lecture 26: the future of high- performance...

51
Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 26: The Future of High- Performance Computing

Upload: dangdat

Post on 26-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017

Lecture 26:

The Future of High-Performance Computing

Carnegie Mellon

15-418/618 2

ComparingTwoLarge-ScaleSystems

⬛ OakridgeTitan

▪ Monolithicsupercomputer(3rdfastestinworld)

▪ Designedforcompute-intensiveapplications

⬛ GoogleDataCenter

■ Serverstosupportmillionsofcustomers

■ Designedfordatacollection,storage,andanalysis

2

Carnegie Mellon

15-418/618 33

ComputingLandscape

Computational Intensity

Internet-Scale Computing

Personal Computing

Cloud Services

Dat

a In

tens

ity

Modeling & Simulation-Driven

Science & Engineering

Traditional Supercomputing

• Web search • Mapping / directions • Language translation • Video streaming

Google Data Center

Oakridge Titan

Carnegie Mellon

15-418/618 44

SupercomputingLandscape

Computational Intensity

Dat

a In

tens

ity

Modeling & Simulation-Driven

Science & Engineering

Oakridge Titan

Carnegie Mellon

15-418/618 55

SupercomputerApplications

⬛ Simulation-BasedModeling▪ Systemstructure+initialconditions+transitionbehavior▪ Discretizetimeandspace▪ Runsimulationtoseewhathappens

⬛ Requirements▪ Modelaccuratelyreflectsactualsystem▪ Simulationfaithfullycapturesmodel

Science Industrial Products

Public Health

Carnegie Mellon

15-418/618 66

TitanHardware

⬛ EachNode▪ AMD16-coreprocessor▪ nVidiaGraphicsProcessingUnit▪ 38GBDRAM▪ Nodiskdrive

⬛ Overall▪ 7MW,$200M

Local Network

CPU

Node 1

CPU

Node 2

CPU

Node 18,688

• • •

GPU GPU GPU

Carnegie Mellon

15-418/618 77

TitanNodeStructure:CPU

⬛ CPU▪ 16coressharingcommonmemory

▪ Supportsmultithreadedprogramming

▪ ~0.16x1012floating-pointoperationspersecond(FLOPS)peakperformance

DRAM Memory

Carnegie Mellon

15-418/618 88

TitanNodeStructure:GPU

⬛ KeplerGPU▪ 14multiprocessors

▪ Eachwith12groupsof16streamprocessors ▪ 14X12X16=2688

▪ Single-Instruction,Multiple-Dataparallelism▪ Singleinstructioncontrolsallprocessorsingroup

▪ 4.0x1012FLOPSpeakperformance

Carnegie Mellon

15-418/618 99

TitanProgramming:Principle

⬛ SolvingProblemOverGrid▪ E.g.,finite-elementsystem

▪ Simulateoperationovertime

⬛ BulkSynchronousModel▪ PartitionintoRegions

▪ pregionsforp-nodemachine

▪ MapRegionperProcessor

Carnegie Mellon

15-418/618 1010

TitanProgramming:Principle(cont)

⬛ BulkSynchronousModel▪ MapRegionperProcessor

▪ Alternate▪ Allnodescomputebehaviorof

region– PerformonGPUs

▪ Allnodescommunicatevaluesatboundaries

P1 P2 P3 P4 P5

Communicate

Communicate

Communicate

Compute

Compute

Compute

Carnegie Mellon

15-418/618 1111

BulkSynchronousPerformance

▪ Limitedbyperformanceofslowestprocessor

⬛ Strivetokeepperfectlybalanced▪ Engineerhardwaretobehighlyreliable

▪ Tunesoftwaretomakeasregularaspossible

▪ Eliminate“noise”▪ Operatingsystemevents▪ Extraneousnetworkactivity

P1 P2 P3 P4 P5

Communicate

Communicate

Communicate

Compute

Compute

Compute

Carnegie Mellon

15-418/618 1212

TitanProgramming:Reality

⬛ SystemLevel▪ Message-PassingInterface(MPI)supportsnodecomputation,synchronizationandcommunication

⬛ NodeLevel▪ OpenMPsupportsthread-leveloperationofnodeCPU

▪ CUDAprogrammingenvironmentforGPUs▪ Performancedegradesquicklyifdon’thaveperfectbalance

amongmemoriesandprocessors

⬛ Result▪ Singleprogramiscomplexcombinationofmultipleprogrammingparadigms

▪ Tendtooptimizeforspecifichardwareconfiguration

Carnegie Mellon

15-418/618 1313

MPIFaultTolerance

⬛ Checkpoint▪ Periodicallystorestateofallprocesses

▪ SignificantI/Otraffic

⬛ Restore▪ Whenfailureoccurs

▪ Resetstatetothatoflastcheckpoint

▪ Allinterveningcomputationwasted

⬛ PerformanceScaling▪ Verysensitivetonumberoffailingcomponents

Restore

Wasted Computation

Compute & Communicate

P1 P2 P3 P4 P5

Checkpoint

Compute & Communicate

Carnegie Mellon

15-418/618 1414

SupercomputerProgrammingModel

▪ Programontopofbarehardware

⬛ Performance▪ Low-levelprogrammingtomaximizenodeperformance

▪ Keepeverythinggloballysynchronizedandbalanced

⬛ Reliability▪ Singlefailurecausesmajordelay▪ Engineerhardwaretominimizefailures

Hardware

Machine-Dependent Programming Model

Software Packages

Application Programs

Carnegie Mellon

15-418/618 1515

Data-Intensive ComputingLandscape

Computational Intensity

Internet-Scale Computing

Personal Computing

Cloud Services

Dat

a In

tens

ity

• Web search • Mapping / directions • Language translation • Video streaming

Google Data Center

Carnegie Mellon

15-418/618 16

InternetComputing⬛ WebSearch

▪ AggregatetextdatafromacrossWWW

▪ Nodefinitionofcorrectoperation

▪ Donotneedreal-timeupdating

⬛ MappingServices▪ Hugeamountof(relatively)staticdata

▪ Eachcustomerrequiresindividualizedcomputation

16

OnlineDocuments■ Mustbestoredreliably ■ Mustsupportreal-timeupdating ■ (Relatively)smalldatavolumes

Carnegie Mellon

15-418/618 1717

OtherData-IntensiveComputingApplications

⬛ Wal-Mart▪ 267millionitems/day,soldat6,000stores

▪ HPbuiltthem4PBdatawarehouse

▪ Minedatatomanagesupplychain,understandmarkettrends,formulatepricingstrategies

⬛ LSST▪ Chileantelescopewillscanentireskyevery3days▪ A3.2gigapixeldigitalcamera▪ Generate30TB/dayofimagedata

Carnegie Mellon

15-418/618 1818

Data-IntensiveApplicationCharacteristics

⬛ DiverseClassesofData▪ Structured&unstructured

▪ High&lowintegrityrequirements

⬛ DiverseComputingNeeds▪ Localized&globalprocessing

▪ Numerical&non-numerical

▪ Real-time&batchprocessing

Carnegie Mellon

15-418/618 19

GoogleDataCenters

Dalles,Oregon▪ Hydroelectricpower@2¢/KWHr▪ 50Megawatts▪ Enoughtopower60,000homes

■ Engineeredforlowcost,modularity&powerefficiency

■ Container:1160servernodes,250KW

19

Carnegie Mellon

15-418/618 2020

GoogleCluster

▪ Typically1,000−2,000nodes

⬛ NodeContains▪ 2multicoreCPUs

▪ 2diskdrives

▪ DRAM

Local Network

CPU

Node 1

CPU

Node 2

CPU

Node n

• • •

Carnegie Mellon

15-418/618 2121

HadoopProject⬛ Filesystemwithfilesdistributedacrossnodes

▪ Storemultiple(typically3copiesofeachfile)▪ Ifonenodefails,datastillavailable

▪ Logically,anynodehasaccesstoanyfile▪ Mayneedtofetchacrossnetwork

⬛ Map/Reduceprogrammingenvironment▪ Softwaremanagesexecutionoftasksonnodes

Local Network

CPU

Node 1

CPU

Node 2

CPU

Node n

• • •

Carnegie Mellon

15-418/618 2222

Map/ReduceOperation⬛ Characteristics

▪ Computationbrokenintomany,short-livedtasks▪ Mapping,reducing

▪ Tasksmappedontoprocessorsdynamically

▪ Usediskstoragetoholdintermediateresults

⬛ Strengths▪ Flexibilityinplacement,scheduling,

andloadbalancing▪ Canaccesslargedatasets

⬛ Weaknesses▪ Higheroverhead▪ Lowerrawperformance

MapReduce

MapReduce

MapReduce

MapReduce

Map/Reduce

Carnegie Mellon

15-418/618 2323

Map/ReduceFaultTolerance⬛ DataIntegrity

▪ Storemultiplecopiesofeachfile

▪ IncludingintermediateresultsofeachMap/Reduce▪ Continuouscheckpointing

⬛ RecoveringfromFailure▪ Simplyrecomputelostresult

▪ Localizedeffect

▪ Dynamicschedulerkeepsallprocessorsbusy

⬛ Usesoftwaretobuildreliablesystemontopofunreliablehardware

MapReduce

MapReduce

MapReduce

MapReduce

Map/Reduce

Carnegie Mellon

15-418/618 2424

ClusterProgrammingModel

▪ Applicationprogramswrittenintermsofhigh-leveloperationsondata

▪ Runtimesystemcontrolsscheduling,loadbalancing,…

⬛ ScalingChallenges▪ Centralizedschedulerforms

bottleneck

▪ Copyingto/fromdiskverycostly

▪ Hardtolimitdatamovement▪ Significantperformancefactor

Hardware

Machine-Independent Programming Model

Runtime System

Application Programs

Carnegie Mellon

15-418/618 2525

RecentProgrammingSystems

⬛ SparkProject

▪ atU.C.,Berkeley▪ Growntohavelargeopensourcecommunity

⬛ GraphLab▪ StartedasprojectatCMUbyCarlosGuestrin▪ Environmentfordescribingmachine-learningalgorithms

▪ Sparsematrixstructuredescribedbygraph▪ Computationbasedonupdatingofnodevalues

Carnegie Mellon

15-418/618 2626

ComputingLandscapeTrends

Computational Intensity

Dat

a In

tens

ity

Modeling & Simulation-Driven

Science & EngineeringTraditional

Supercomputing

Mixing simulation with data analysis

Carnegie Mellon

15-418/618 2727

CombiningSimulationwithRealData

⬛ Limitations▪ Simulationalone:Hardtoknowifmodeliscorrect

▪ Dataalone:Hardtounderstandcausality&“whatif”

⬛ Combination▪ Checkandadjustmodelduringsimulation

Carnegie Mellon

15-418/618 2828

Real-TimeAnalytics

⬛ MilleniumXXLSimulation(2010)▪ 3X109particles

▪ Simulationrunof9.3dayson12,228cores

▪ 700TBtotaldatagenerated▪ Saveatonly4timepoints▪ 70TB

▪ Large-scalesimulationsgeneratelargedatasets

⬛ WhatIf?▪ Couldperformdataanalysiswhilesimulationisrunning Simulation

EngineAnalytic Engine

Carnegie Mellon

15-418/618 2929

ComputingLandscapeTrends

Computational Intensity

Internet-Scale ComputingD

ata

Inte

nsity Google Data Center

Sophisticated data analysis

Carnegie Mellon

15-418/618 3030

ExampleAnalyticApplications

ClassifierImage Description

Microsoft Project Adam

TransducerEnglish Text

German Text

Carnegie Mellon

15-418/618 31

DataAnalysiswithDeepNeuralNetworks

⬛ Task:▪ Computeclassificationofsetofinputsignals

31

⬛ Training■ Usemanytrainingsamplesofforminput/desiredoutput■ Computeweightsthatminimizeclassificationerror

⬛ Operation■ Propagatesignalsfrominputtooutput

Carnegie Mellon

15-418/618 3232

DNNApplicationExample⬛ FacebookDeepFaceArchitecture

Carnegie Mellon

15-418/618 33

TrainingDNNs

⬛ Characteristics▪ Iterativenumericalalgorithm

▪ Regulardataorganization

⬛ ProjectAdamTraining■ 2Bconnections■ 15Mimages■ 62machines■ 10days

33

0

5

10

15

20

0 5 10 15 200

100

200

300

400

0 5 10 15 200

5

10

15

20

0 5 10 15 20

× ➔

Model Size Training Data

Training Effort

Carnegie Mellon

15-418/618 3434

Trends

Computational Intensity

Internet-Scale Computing

Dat

a In

tens

ity

Modeling & Simulation-Driven

Science & EngineeringTraditional

Supercomputing

Google Data CenterSophisticated data analysis

Mixing simulation with real-world data

Convergence?

Carnegie Mellon

15-418/618 3535

ChallengesforConvergence⬛ Supercomputers

■ Customized ■ Optimized for reliability

■ Source of “noise” ■ Static scheduling

■ Low-level, processor-centric model

⬛ DataCenterClusters

■ Consumer grade ■ Optimized for low cost

■ Provides reliability ■ Dynamic allocation

■ High level, data-centric model

Hardware

Run-Time System

Application Programming

Carnegie Mellon

15-418/618 3636

Summary:Computation/DataConvergence

⬛ TwoImportantClassesofLarge-ScaleComputing▪ Computationallyintensivesupercomputing▪ Dataintensiveprocessing

▪ Internetcompanies+manyotherapplications

⬛ FollowedDifferentEvolutionaryPaths▪ Supercomputers:Getmaximumperformancefromavailablehardware▪ Datacenterclusters:Maximizecost/performanceovervarietyofdata-centric

tasks▪ Yieldeddifferentapproachestohardware,runtimesystems,andapplication

programming

⬛ AConvergenceWouldHaveImportantBenefits▪ Computationalanddata-intensiveapplications▪ But,notclearhowtodoit

Carnegie Mellon

15-418/618 3737

GETTINGTOEXASCALE

Carnegie Mellon

15-418/618 38

World’sFastestMachines⬛ Top500Ranking:High-performanceLINPACK

▪ Benchmark:SolveNxNlinearsystem▪ SomevariantofGaussianelimination

▪ 2/3N3+O(N2)operations▪ VendorcanchooseNtogivebestperformance(inFLOPS)

⬛ Alternative:High-performanceconjugategradient▪ Solvesparselinearsystem(≤27nonzeros/row)▪ Iterativemethod▪ Highercommunication/computeratio

Carnegie Mellon

15-418/618 39

SunwayTaihuLight⬛ WuxiChina

▪ Operational2016

⬛ Machine▪ Totalmachinehas40,960processorchips▪ Processorchipcontains256computecores+4

managementcores

▪ Eachhas4-wideSIMDvectorunit▪ 8FLOPS/clockcycle

⬛ Performance▪ HPL:93.0PF(World’stop)

▪ HPCG:0.37PF▪ 15.4MW▪ 1.31PBDRAM

⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:6.0▪ Bytes/FLOP:0.014

Carnegie Mellon

15-418/618 40

Tianhhe-2⬛ GuangzhouChina

▪ Operational2013

⬛ Machine▪ Totalmachinehas16,000nodes▪ Eachwith2IntelXeons+3IntelXeonPhi’s

⬛ Performance▪ HPL:33.9PF▪ HPCG:0.58PF(world’sbest)▪ 17.8MW▪ 1.02PBDRAM

⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:1.9▪ Bytes/FLOP:0.030

Carnegie Mellon

15-418/618 41

Titan⬛ OakRidge,TN

▪ Operational2012

⬛ Machine▪ Totalmachinehas18,688nodes▪ Eachwith16-coreOpteron+TeslaK20XGPU

⬛ Performance▪ HPL:17.6PF▪ HPCG:0.32PF▪ 8.2MW▪ 0.71PBDRAM

⬛ Ratios(BigisBetter)▪ GigaFLOPS/Watt:2.2▪ Bytes/FLOP:0.040

Carnegie Mellon

15-418/618 42

HowPowerfulisaTitanNode?

Titan⬛ CPU

▪ Opteron6274▪ Nov.,2011.32nmtechnology▪ 2.2GHz▪ 16cores(nohyperthreading)▪ 16MBL3cache▪ 32GBDRAM

⬛ GPU▪ KeplerK20X▪ Feb.,2013.28nm▪ Cudacapability3.5▪ 3.9TFPeak(SP)

GHCMachine⬛ CPU

▪ XeonE5-1660▪ June,2016.14nmtechnology▪ 3.2GHz▪ 8cores(2xhyperthreaded)▪ 20MBL3cache▪ 32GBDRAM

⬛ GPU▪ GeForceGTX1080▪ May,2016.16nm▪ Cudacapability6.0▪ 8.2TFPeak(SP)

Carnegie Mellon

15-418/618 43

PerformanceofTop500Machines

⬛ FrompresentationbyJackDongarra⬛ MachinesfaroffpeakwhenperformingHPCG

Carnegie Mellon

15-418/618 44

WhatLiesAhead⬛ DOECORALProgram

▪ AnnouncedNov2014

▪ Deliveryin2018

⬛ Vendor#1▪ IBM+nVidia+Mellanox

▪ 3400nodes▪ 10MW▪ 150–300PFpeak

⬛ Vendor#2▪ Intel+Cray▪ ~50,000nodes(XeonPhi’s)▪ 13MW

▪ >180PFpeak

Carnegie Mellon

15-418/618 4545

TECHNOLOGYCHALLENGES

Carnegie Mellon

15-418/618 4646

Moore’sLaw

▪ Basisforever-increasingcomputerpower▪ We’vecometoexpectitwillcontinue

Carnegie Mellon

15-418/618 4747

ChallengestoMoore’sLaw:Technical

▪ Mustcontinuetoshrinkfeaturessizes

▪ Approachingatomicscale

⬛ Difficulties▪ Lithographyatsuchsmalldimensions

▪ Statisticalvariationsamongdevices

• 2022: transistors with 4nm feature size

• Si lattice spacing 0.54nm

Carnegie Mellon

15-418/618 4848

ChallengestoMoore’sLaw:Economic⬛ GrowingCapitalCosts

▪ Stateofartfabline~$20B▪ Musthaveveryhighvolumestoamortizeinvestment▪ Hasledtomajorconsolidations

Carnegie Mellon

15-418/618 4949

DennardScaling

▪ DuetoRobertDennard,IBM,1974

▪ QuantifiesbenefitsofMoore’sLaw

⬛ HowtoshrinkanICProcess▪ Reducehorizontalandverticaldimensionsbyk

▪ Reducevoltagebyk

⬛ Outcomes▪ Devices/chipincreasebyk2▪ Clockfrequencyincreasesbyk▪ Power/chipconstant

⬛ Significance▪ Increasedcapacityandperformance

▪ Noincreaseinpower

Carnegie Mellon

15-418/618 5050

EndofDennardScaling

⬛ WhatHappened?▪ Can’tdropvoltagebelow~1V▪ Reachedlimitofpower/chipin2004▪ Morelogiconchip(Moore’sLaw),butcan’tmakethemrunfaster

▪ Responsehasbeentoincreasecores/chip

Carnegie Mellon

15-418/618 5151

ResearchChallenges

⬛ Supercomputers▪ Cantheybemademoredynamicandadaptive?

▪ Requirementforfuturescalability

▪ Cantheybemadeeasiertoprogram?▪ Abstract,machine-independentprogrammingmodels

⬛ Data-IntensiveComputing▪ Cantheybeadaptedtoprovidebettercomputationalperformance?▪ Cantheymakebetteruseofdatalocality?

▪ Performance&power-limitingfactor

⬛ Technology/Economic▪ WhatwillwedowhenMoore’sLawcomestoanendforCMOS?▪ Howcanweensureastablemanufacturingenvironment?