mpi, dataflow, streaming: messaging for diverse requirements · 2017-09-26 · abstract: mpi,...

Post on 20-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

`

MPI,Dataflow,Streaming:MessagingforDiverseRequirements

ArgonneNa<onalLab,Chicago,Illinois,GeoffreyFox,September25,2017

Indiana University, Department of Intelligent Systems Engineering gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/

WorkwithJudyQiu,ShantenuJha,SupunKamburugamuve,KannanGovindarajan,PulasthiWickramasinghe

9/26/17 1

25 years of MPI Symposium

Abstract:MPI,Dataflow,Streaming:MessagingforDiverseRequirements

• Welookatmessagingneededinavarietyofparallel,distributed,cloudandedgecompuHngapplicaHons.

• WecomparetechnologyapproachesinMPI,AsynchronousMany-Tasksystems,ApacheNiFi,Heron,KaRa,OpenWhisk,Pregel,SparkandFlink,event-drivensimulaHons(HLA)andMicrosoWNaiad.

• Wesuggestanevent-triggereddataflowpolymorphicrunHmewithimplementaHonsthattrade-offperformance,faulttolerance,andusability.

• IntegrateParallelCompuHng,BigData,Grids

9/26/17 2

• MPIiswonderful(andimpossibletobeat?)forcloselycoupledparallelcompuHngbut

•  TherearemanyotherregimeswhereeitherparallelcompuHngand/ormessagepassingessenHal

•  ApplicaHondomainswhereother/higher-levelconceptssuccessful/necessary•  InternetofThingsandEdgeCompu<nggrowinginimportance• Useofpubliccloudsincreasingrapidly

•  CloudsbecomingdiversewithsubsystemscontainingGPU’s,FPGA’s,highperformancenetworks,storage,memory…

• RichsoRwarestacks:•  HPC(HighPerformanceCompuHng)forParallelCompuHnglessusedthan(?)•  ApacheforBigDataSoWwareStackABDSincludingsomeedgecompuHng(streamingdata)

• AlotofconfusioncomingfromdifferentcommuniHes(database,distributed,parallelcompuHng,machinelearning,computaHonal/datascience)invesHgaHngsimilarideaswithlibleknowledgeexchangeandmixeduprequirements

Mo<va<ngRemarks

9/26/17 3

•  Ongeneralprinciplesparallelanddistributedcompu<nghavedifferentrequirementsevenifsomeHmessimilarfuncHonaliHes

•  ApachestackABDStypicallyusesdistributedcompuHngconcepts•  Forexample,ReduceoperaHonisdifferentinMPI(Harp)andSpark

•  LargescalesimulaHonrequirementsarewellunderstood•  BigDatarequirementsarenotclearbutthereareafewkeyusetypes

1)  Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstaHsHcsandvisualizaHons;possiblyStreaming

2)  DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3)  GlobalMachineLearningGMLwithsinglejobusingmulHplenodesasclassicparallel

compuHng4)  DeepLearningcertainlyneedsHPC–possiblyonlymulHplesmallsystems

•  Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoABDS(withnoHPC)•  ThisexplainswhySparkwithpoorGMLperformanceissosuccessfulandwhyitcanignoreMPI

Requirements

9/26/17 4

HPCRun<meversusABDSdistributedCompu<ngModelonDataAnaly<cs

Hadoopwritestodiskandisslowest;SparkandFlinkspawnmanyprocessesanddonotsupportAllReducedirectly;MPIdoesin-placecombinedreduce/broadcastandisfastest

NeedPolymorphicReducHoncapabilitychoosingbestimplementaHonUseHPCarchitecturewith

MutablemodelImmutabledata

9/26/17 5

Mul<dimensionalScaling:3NestedParallelSec<ons

MDSexecuHonHmeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecuHonHmewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

9/26/17 6

Flink

Spark

MPI

MPIFactorof20-200FasterthanSpark/Flink

Implemen<ngTwister2tosupportaGridlinkedtoanHPCCloud

Cloud

HPC

Cloud

HPC

CentralizedHPCCloud+IoTDevices CentralizedHPCCloud+Edge=Fog+IoTDevices

Cloud

HPC

Fog

HPCCloudcanbefederated

9/26/17 7

•  Cloud-ownerProvidedCloud-naHveplapormfor•  Event-drivenapplicaHonswhich•  ScaleupanddowninstantlyandautomaHcallyChargesforactualusageatamillisecondgranularity

4

BareMetal

PaaSContainer

Orchestrators

IaaS

Serverless

GridSolve,NeoswereFaaS

9/26/17 8

Seereviewhbp://dx.doi.org/10.13140/RG.2.2.15007.87206

Serverless(serverhidden)compu<nga\rac<vetouser:“Noserveriseasiertomanagethannoserver”

Twister2:“NextGenera<onGrid-Edge–HPCCloud”•  Original2010TwisterpaperwasaparHcularapproachtoMap-CollecHveiteraHveprocessingformachinelearning

•  Re-engineercurrentApacheBigDatasoWwaresystemsasatoolkitwithMPIasanopHon•  BaseonApacheHeronasmostmodernand“neutral”oncontroversialissues

•  Supportaserverless(cloud-na<ve)dataflowevent-drivenHPC-FaaS(microservice)frameworkrunningacrossapplicaHonandgeographicdomains.

•  SupportalltypesofDataanalysisfromGMLtoEdgecompuHng

•  BuildonCloudbestpracHcebutuseHPCwhereverpossibletogethighperformance•  SmoothlysupportcurrentparadigmsNaiad,Hadoop,Spark,Flink,Storm,Heron,MPI…•  UseinteroperablecommonabstracHonsbutmulHplepolymorphicimplementaHons.

•  i.e.donotrequireasinglerunHme

•  FocusonRunHmebutthisimpliesHPC-FaaSprogrammingandexecuHonmodel•  Thisdescribesanextgenera<onGridbasedondataandedgedevices–notcompuHngasinoriginalGrid Seelongpaperhbp://dsc.soic.indiana.edu/publicaHons/Twister2.pdf

9/26/17 9

Communica<on(Messaging)Models•  MPIGoldStandard:TightlysynchronizedapplicaHons

•  EfficientcommunicaHons(µslatency)withuseofadvancedhardware•  InplacecommunicaHonsandcomputaHons(Processscopeforstate)

•  Basic(coarse-grain)dataflow:ModelacomputaHonasagraph•  NodesdocomputaHonswithTaskascomputaHonsandedgesareasynchronouscommunicaHons

•  AcomputaHonisacHvatedwhenitsinputdatadependenciesaresaHsfied

•  Streamingdataflow:Pub-SubwithdataparHHonedintostreams•  Streamsareunbounded,ordereddatatuples•  OrderofeventsimportantandgroupdataintoHmewindows

• MachineLearningdataflow:IteraHvecomputaHons•  ThereisbothModelandData,butonlycommunicatethemodel•  Collec<vecommunica<onoperaHonssuchasAllReduceAllGather(nodifferenHaloperatorsinBigDataproblems

•  Canusein-placeMPIstylecommunicaHon

S

W G

S

W

WDataflow

9/26/17 10

CoreSPIDALParallelHPCLibrarywithCollec<veUsed•  DA-MDSRotate,AllReduce,Broadcast•  DirectedForceDimensionReducHonAllGather,Allreduce

•  IrregularDAVSClusteringParHalRotate,AllReduce,Broadcast

•  DASemimetricClusteringRotate,AllReduce,Broadcast

•  K-meansAllReduce,Broadcast,AllGatherDAAL•  SVMAllReduce,AllGather•  SubGraphMiningAllGather,AllReduce•  LatentDirichletAllocaHonRotate,AllReduce•  MatrixFactorizaHon(SGD)RotateDAAL•  RecommenderSystem(ALS)RotateDAAL•  SingularValueDecomposiHon(SVD)AllGatherDAAL

•  QRDecomposiHon(QR)Reduce,BroadcastDAAL•  NeuralNetworkAllReduceDAAL•  CovarianceAllReduceDAAL•  LowOrderMomentsReduceDAAL•  NaiveBayesReduceDAAL•  LinearRegressionReduceDAAL•  RidgeRegressionReduceDAAL•  MulH-classLogisHcRegressionRegroup,Rotate,AllGather

•  RandomForestAllReduce•  PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAALimpliesintegratedwithIntelDAALOpHmizedDataAnalyHcsLibrary(RunsonKNL!)

9/26/17 11

Coordina<onPoints•  Thereareinmanyapproaches,“coordinaHonpoints”thatcanbeimplicitorexplicit

•  Twister2makescoordinaHonpointsanimportant(firstclass)concept•  DataflownodesinHeron,Flink,Spark,Naiad;wecallthesefine-graindataflow•  IssuanceofaCollecHvecommunicaHoncommandinMPI•  StartandEndofaParallelsecHoninOpenMP•  Endofajob;wecallthesecoarse-graindataflownodesandtheseareseeninworkflowsystemssuchasPegasus,Taverna,KeplerandNiFi(fromApache)

•  Twister2willallowuserstospecifytheexistenceofanamedcoordinaHonpointandallowacHonstobeiniHated

•  ProduceanRDDstyledatasetfromuserspecified•  LaunchnewtasksasinHeron,Flink,Spark,Naiad•  ChangeexecuHonmodelasinOpenMPParallelsecHon

9/26/17 12

NiFiWorkflowwithCoarseGrainCoordina<on

9/26/17 13

K-meansandDataflow

9/26/17 14

Map(nearestcentroid

calculaHon)

Reduce(updatecentroids)

DataSet<Points>

DataSet<IniHalCentroids>

DataSet<UpdatedCentroids>

Broadcast

DataflowforK-means

FullJobReduce

Maps

Iterate

AnotherJob

CorseGrainWorkflowNodes CoarseGrain

WorkflowNodes

InternalExecuHon(IteraHon)NodesFine-GrainCoordinaHon

DataflowCommunicaHon

HPCCommunicaHon

“CoordinaHonPoints”

HandlingofState• Stateisakeyissueandhandleddifferentlyinsystems• MPINaiad,Storm,Heronhavelongrunningtasksthatpreservestate

• MPItasksstopatendofjob• NaiadStormHerontaskschangeat(fine-grain)dataflownodesbutalltasksrunforever

•  SparkandFlinktasksstopandrefreshatdataflownodesbutpreservesomestateasRDD/datasetsusingin-memorydatabases

• AllsystemsagreeonacHonsatacoarsegraindataflow(atjoblevel);onlykeepstatebyexchangingdata.

9/26/17 15

FaultToleranceandState • Similarformofcheck-poin<ngmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobeberthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• CheckpointaRereachstageofthedataflowgraph• NaturalsynchronizaHonpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

9/26/17 16

Twister2ComponentsI

9/26/17 17

Area Component Implementation Comments Distributed Data API

Relaxed Distributed data set

Similar to Spark RDD ETL type data applications; Streaming Backup for Fault Tolerance

Streaming Pub-Sub and Spouts as in Heron API to pub-sub messages Data access Access common data sources including

file, connecting to message brokers etc. All the above applications can use this base functionality

Distributed Shared Memory

Similar to PGAS Machine learning such as graph algorithms

Task API FaaS API (Function as a Service)

Dynamic Task Scheduling

Dynamic scheduling as in AMT Some machine learning FaaS

Static Task Scheduling Static scheduling as in Flink & Heron Streaming ETL data pipelines

Task Execution Thread based execution as seen in Spark, Flink, Naiad, OpenMP

Look at hybrid MPI/thread support available

Task Graph Twister2 Tasks similar to Naiad and Heron

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ Classic Streaming Scaling of FaaS needs Research

Elasticity OpenWhisk Needs experimentation Task migration Monitoring of tasks and migrating tasks for

better resource utilization

Twister2ComponentsII

9/26/17 18

Area Component Implementation Comments Communication API

Messages Heron ThisisuserlevelandcouldmaptomulHplecommunicaHon

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?

Streaming Machine learning ETL data pipelines

BSP Communication Map-Collective

MPI Style communication Harp

Machine learning

Execution Model

Architecture Spark, Flink Container/Processes/Tasks=Threads Job Submit API Resource Scheduler

Pluggable architecture for any resource scheduler (Yarn, Mesos, Slurm)

All the above applications need this base functionality

Dataflow graph analyzer & optimizer

Flink Spark is dynamic and implicit

Coordination Points Specification and Actions

Research based on MPI, Spark, Flink, NiFi (Kepler)

Synchronization Point. Backup to datasets Refresh Tasks

Security Storage, Messaging, execution

Research Crosses all Components

Summary of MPI in a HPC Cloud + Edge + Grid Environment

• WesuggestvalueofaneventdrivencompuHngmodelbuiltaroundCloudandHPCandspanningbatch,streaming,andedgeapplicaHons

•  Highlyparalleloncloud;possiblysequenHalattheedge• WehavedoneapreliminaryanalysisofthedifferentrunHmesofMPI,Hadoop,Spark,Flink,Storm,Heron,Naiad,HPCAsynchronousManyTask(AMT)

•  TherearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstracHonssuchascommunicaHoncollecHves

•  ObviouslyMPIbestforparallelcompuHng(bydefiniHon)

•  ApachesystemsusedataflowcommunicaHonwhichisnaturalfordistributedsystemsbutinevitablyslowforclassicparallelcompuHng

•  Nostandarddataflowlibrary(why?).AddDataflowprimi<vesinMPI-4?

• MPIcouldadoptsomeoftoolsofBigDataasinCoordinaHonPoints(dataflownodes),StatemanagementwithRDD(datasets)

9/26/17 19

top related