big data meets hpc - exploiting hpc technologies for accelerating big data processing
TRANSCRIPT
BigDataMeetsHPC–Exploi4ngHPCTechnologiesforAccelera4ngBigDataProcessing
DhabaleswarK.(DK)PandaTheOhioStateUniversity
E-mail:[email protected]
h<p://www.cse.ohio-state.edu/~panda
KeynoteTalkatHPCAC-Switzerland(Mar2016)
by
HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompu4ngLaboratory
• BigDatahasbecometheoneofthemostimportantelementsofbusinessanalyFcs
• ProvidesgroundbreakingopportuniFesforenterpriseinformaFonmanagementanddecisionmaking
• Theamountofdataisexploding;companiesarecapturinganddigiFzingmoreinformaFonthanever
• TherateofinformaFongrowthappearstobeexceedingMoore’sLaw
Introduc4ontoBigDataApplica4onsandAnaly4cs
HPCAC-Switzerland(Mar‘16) 3NetworkBasedCompu4ngLaboratory
• Commonlyaccepted3V’sofBigData• Volume,Velocity,VarietyMichaelStonebraker:BigDataMeansatLeastThreeDifferentThings,hWp://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• 4/5V’sofBigData–3V+*Veracity,*Value
4VCharacteris4csofBigData
Courtesy:hWp://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
HPCAC-Switzerland(Mar‘16) 4NetworkBasedCompu4ngLaboratory
• Webpages(content,graph)
• Clicks(ad,page,social)
• Users(OpenID,FBConnect,etc.)
• e-mails(Hotmail,Y!Mail,Gmail,etc.)
• Photos,Movies(Flickr,YouTube,Video,etc.)
• Cookies/trackinginfo(seeGhostery)
• Installedapps(Androidmarket,AppStore,etc.)
• LocaFon(LaFtude,Loopt,Foursquared,GoogleNow,etc.)
• Usergeneratedcontent(Wikipedia&co,etc.)
• Ads(display,text,DoubleClick,Yahoo,etc.)
• Comments(Discuss,Facebook,etc.)
• Reviews(Yelp,Y!Local,etc.)
• SocialconnecFons(LinkedIn,Facebook,etc.)
• Purchasedecisions(Netflix,Amazon,etc.)
• InstantMessages(YIM,Skype,Gtalk,etc.)
• Searchterms(Google,Bing,etc.)
• NewsarFcles(BBC,NYTimes,Y!News,etc.)
• Blogposts(Tumblr,Wordpress,etc.)
• Microblogs(Twi<er,Jaiku,Meme,etc.)
• Linksharing(Facebook,Delicious,Buzz,etc.)
DataGenera4oninInternetServicesandApplica4ons
NumberofAppsintheAppleAppStore,AndroidMarket,Blackberry,andWindowsPhone(2013)• AndroidMarket:<1200K• AppleAppStore:~1000KCourtesy:hWp://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog-android-growth-mobile-ecosystem-2014/
HPCAC-Switzerland(Mar‘16) 5NetworkBasedCompu4ngLaboratory
VelocityofBigData–HowMuchDataIsGeneratedEveryMinuteontheInternet?
TheglobalInternetpopula4ongrew18.5%from2013to2015andnowrepresents3.2BillionPeople.
Courtesy:hWps://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
HPCAC-Switzerland(Mar‘16) 6NetworkBasedCompu4ngLaboratory
• ScienFficDataManagement,Analysis,andVisualizaFon
• ApplicaFonsexamples– Climatemodeling
– CombusFon
– Fusion
– Astrophysics
– BioinformaFcs
• DataIntensiveTasks– Runslarge-scalesimulaFonsonsupercomputers
– Dumpdataonparallelstoragesystems
– Collectexperimental/observaFonaldata
– Moveexperimental/observaFonaldatatoanalysissites
– VisualanalyFcs–helpunderstanddatavisually
NotOnlyinInternetServices-BigDatainScien4ficDomains
HPCAC-Switzerland(Mar‘16) 7NetworkBasedCompu4ngLaboratory
• Hadoop:h<p://hadoop.apache.org– ThemostpopularframeworkforBigDataAnalyFcs
– HDFS,MapReduce,HBase,RPC,Hive,Pig,ZooKeeper,Mahout,etc.
• Spark:h<p://spark-project.org– ProvidesprimiFvesforin-memoryclustercompuFng;Jobscanloaddataintomemoryandqueryitrepeatedly
• Storm:h<p://storm-project.net– Adistributedreal-FmecomputaFonsystemforreal-FmeanalyFcs,onlinemachinelearning,conFnuous
computaFon,etc.
• S4:h<p://incubator.apache.org/s4– AdistributedsystemforprocessingconFnuousunboundedstreamsofdata
• GraphLab:h<p://graphlab.org– ConsistsofacoreC++GraphLabAPIandacollecFonofhigh-performancemachinelearninganddatamining
toolkitsbuiltontopoftheGraphLabAPI.
• Web2.0:RDBMS+Memcached(h<p://memcached.org)– Memcached:Ahigh-performance,distributedmemoryobjectcachingsystems
TypicalSolu4onsorArchitecturesforBigDataAnaly4cs
HPCAC-Switzerland(Mar‘16) 8NetworkBasedCompu4ngLaboratory
BigDataProcessingwithHadoopComponents • Majorcomponentsincludedinthis
tutorial:– MapReduce(Batch)– HBase(Query)– HDFS(Storage)– RPC(Inter-processcommunicaFon)
• UnderlyingHadoopDistributedFileSystem(HDFS)usedbybothMapReduceandHBase
• ModelscalesbuthighamountofcommunicaFonduringintermediatephasescanbefurtheropFmized
HDFS
MapReduce
HadoopFramework
UserApplica4ons
HBase
HadoopCommon(RPC)
HPCAC-Switzerland(Mar‘16) 9NetworkBasedCompu4ngLaboratory
Worker
Worker
Worker
Worker
Master
HDFSDriver
Zookeeper
Worker
SparkContext
• Anin-memorydata-processingframework– IteraFvemachinelearningjobs– InteracFvedataanalyFcs– ScalabasedImplementaFon
– Standalone,YARN,Mesos
• ScalableandcommunicaFonintensive– WidedependenciesbetweenResilient
DistributedDatasets(RDDs)– MapReduce-likeshuffleoperaFonsto
reparFFonRDDs
– SocketsbasedcommunicaFon
SparkArchitectureOverview
hWp://spark.apache.org
HPCAC-Switzerland(Mar‘16) 10NetworkBasedCompu4ngLaboratory
MemcachedArchitecture
• DistributedCachingLayer– AllowstoaggregatesparememoryfrommulFplenodes– Generalpurpose
• Typicallyusedtocachedatabasequeries,resultsofAPIcalls• Scalablemodel,buttypicalusageverynetworkintensive
Main memory CPUs
SSD HDD
High Performance Networks
... ...
...
Main memory CPUs
SSD HDD
Main memory CPUs
SSD HDD
Main memory CPUs
SSD HDD
Main memory CPUs
SSD HDD
HPCAC-Switzerland(Mar‘16) 11NetworkBasedCompu4ngLaboratory
• SubstanFalimpactondesigninganduFlizingdatamanagementandprocessingsystemsinmulFpleFers
– Front-enddataaccessingandserving(Online)• Memcached+DB(e.g.MySQL),HBase
– Back-enddataanalyFcs(Offline)• HDFS,MapReduce,Spark
DataManagementandProcessingonModernClusters
Internet
Front-end Tier Back-end Tier
Web ServerWeb
ServerWeb Server
Memcached+ DB (MySQL)Memcached
+ DB (MySQL)Memcached+ DB (MySQL)
NoSQL DB (HBase)NoSQL DB
(HBase)NoSQL DB (HBase)
HDFS
MapReduce Spark
Data Analytics Apps/Jobs
Data Accessing and Serving
HPCAC-Switzerland(Mar‘16) 12NetworkBasedCompu4ngLaboratory
• IntroducedinOct2000• HighPerformanceDataTransfer
– InterprocessorcommunicaFonandI/O– Lowlatency(<1.0microsec),Highbandwidth(upto12.5GigaBytes/sec->100Gbps),and
lowCPUuFlizaFon(5-10%)• MulFpleOperaFons
– Send/Recv– RDMARead/Write– AtomicOperaFons(veryunique)
• highperformanceandscalableimplementaFonsofdistributedlocks,semaphores,collecFvecommunicaFonoperaFons
• Leadingtobigchangesindesigning– HPCclusters– Filesystems– CloudcompuFngsystems– GridcompuFngsystems
OpenStandardInfiniBandNetworkingTechnology
HPCAC-Switzerland(Mar‘16) 13NetworkBasedCompu4ngLaboratory
InterconnectsandProtocolsinOpenFabricsStack
Kernel Space
Applica4on/Middleware
Verbs
EthernetAdapter
EthernetSwitch
EthernetDriver
TCP/IP
1/10/40/100GigE
InfiniBandAdapter
InfiniBandSwitch
IPoIB
IPoIB
EthernetAdapter
EthernetSwitch
HardwareOffload
TCP/IP
10/40GigE-TOE
InfiniBandAdapter
InfiniBandSwitch
UserSpace
RSockets
RSockets
iWARPAdapter
EthernetSwitch
TCP/IP
UserSpace
iWARP
RoCEAdapter
EthernetSwitch
RDMA
UserSpace
RoCE
InfiniBandSwitch
InfiniBandAdapter
RDMA
UserSpace
IBNa4ve
Sockets
Applica4on/MiddlewareInterface
Protocol
Adapter
Switch
InfiniBandAdapter
InfiniBandSwitch
RDMA
SDP
SDP
HPCAC-Switzerland(Mar‘16) 14NetworkBasedCompu4ngLaboratory
HowCanHPCClusterswithHigh-PerformanceInterconnectandStorageArchitecturesBenefitBigDataApplica4ons?
BringHPCandBigDataprocessingintoa“convergenttrajectory”!
Whatarethemajorbo<lenecksincurrentBig
Dataprocessingmiddleware(e.g.Hadoop,Spark,andMemcached)?
Canthebo<lenecksbealleviatedwithnewdesignsbytakingadvantageofHPCtechnologies?
CanRDMA-enabledhigh-performanceinterconnectsbenefitBigDataprocessing?
CanHPCClusterswithhigh-performance
storagesystems(e.g.SSD,parallelfile
systems)benefitBigDataapplicaFons?
Howmuchperformancebenefits
canbeachievedthroughenhanced
designs?HowtodesignbenchmarksforevaluaFngthe
performanceofBigDatamiddlewareon
HPCclusters?
HPCAC-Switzerland(Mar‘16) 15NetworkBasedCompu4ngLaboratory
DesigningCommunica4onandI/OLibrariesforBigDataSystems:Challenges
BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)
NetworkingTechnologies(InfiniBand,1/10/40/100GigE
andIntelligentNICs)
StorageTechnologies(HDD,SSD,andNVMe-SSD)
ProgrammingModels(Sockets)
Applica4ons
CommodityCompu4ngSystemArchitectures
(Mul4-andMany-corearchitecturesandaccelerators)
OtherProtocols?
Communica4onandI/OLibraryPoint-to-PointCommunica4on
QoS
ThreadedModelsandSynchroniza4on
Fault-ToleranceI/OandFileSystems
Virtualiza4on
Benchmarks
UpperlevelChanges?
HPCAC-Switzerland(Mar‘16) 16NetworkBasedCompu4ngLaboratory
• Socketsnotdesignedforhigh-performance– StreamsemanFcsowenmismatchforupperlayers– Zero-copynotavailablefornon-blockingsockets
CanBigDataProcessingSystemsbeDesignedwithHigh-PerformanceNetworksandProtocols?
CurrentDesign
Applica4on
Sockets
1/10/40/100GigENetwork
OurApproach
Applica4on
OSUDesign
10/40/100GigEorInfiniBand
VerbsInterface
HPCAC-Switzerland(Mar‘16) 17NetworkBasedCompu4ngLaboratory
• RDMAforApacheSpark
• RDMAforApacheHadoop2.x(RDMA-Hadoop-2.x)– PluginsforApache,Hortonworks(HDP)andCloudera(CDH)HadoopdistribuFons
• RDMAforApacheHadoop1.x(RDMA-Hadoop)
• RDMAforMemcached(RDMA-Memcached)
• OSUHiBD-Benchmarks(OHB)– HDFSandMemcachedMicro-benchmarks
• hWp://hibd.cse.ohio-state.edu
• UsersBase:155organizaFonsfrom20countries
• Morethan15,500downloadsfromtheprojectsite
• RDMAforApacheHBaseandImpala(upcoming)
TheHigh-PerformanceBigData(HiBD)Project
AvailableforInfiniBandandRoCE
HPCAC-Switzerland(Mar‘16) 18NetworkBasedCompu4ngLaboratory
• High-PerformanceDesignofHadoopoverRDMA-enabledInterconnects
– HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-levelforHDFS,MapReduce,andRPCcomponents
– EnhancedHDFSwithin-memoryandheterogeneousstorage
– HighperformancedesignofMapReduceoverLustre
– Plugin-basedarchitecturesupporFngRDMA-baseddesignsforApacheHadoop,CDHandHDP
– Easilyconfigurablefordifferentrunningmodes(HHH,HHH-M,HHH-L,andMapReduceoverLustre)anddifferentprotocols(naFveInfiniBand,RoCE,andIPoIB)
• Currentrelease:0.9.9
– BasedonApacheHadoop2.7.1
– CompliantwithApacheHadoop2.7.1,HDP2.3.0.0andCDH5.6.0APIsandapplicaFons
– Testedwith
• MellanoxInfiniBandadapters(DDR,QDRandFDR)
• RoCEsupportwithMellanoxadapters
• VariousmulF-corepla|orms
• DifferentfilesystemswithdisksandSSDsandLustre
– hWp://hibd.cse.ohio-state.edu
RDMAforApacheHadoop2.xDistribu4on
HPCAC-Switzerland(Mar‘16) 19NetworkBasedCompu4ngLaboratory
• High-PerformanceDesignofSparkoverRDMA-enabledInterconnects– HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-level
forSpark
– RDMA-baseddatashuffleandSEDA-basedshufflearchitecture
– Non-blockingandchunk-baseddatatransfer
– Easilyconfigurablefordifferentprotocols(naFveInfiniBand,RoCE,andIPoIB)
• Currentrelease:0.9.1
– BasedonApacheSpark1.5.1
– Testedwith• MellanoxInfiniBandadapters(DDR,QDRandFDR)
• RoCEsupportwithMellanoxadapters
• VariousmulF-corepla|orms
• RAMdisks,SSDs,andHDD
– hWp://hibd.cse.ohio-state.edu
RDMAforApacheSparkDistribu4on
HPCAC-Switzerland(Mar‘16) 20NetworkBasedCompu4ngLaboratory
• High-PerformanceDesignofMemcachedoverRDMA-enabledInterconnects– HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-levelfor
MemcachedandlibMemcachedcomponents
– HighperformancedesignofSSD-AssistedHybridMemory
– EasilyconfigurablefornaFveInfiniBand,RoCEandthetradiFonalsockets-basedsupport(EthernetandInfiniBandwithIPoIB)
• Currentrelease:0.9.4
– BasedonMemcached1.4.24andlibMemcached1.0.18
– CompliantwithlibMemcachedAPIsandapplicaFons
– Testedwith• MellanoxInfiniBandadapters(DDR,QDRandFDR)• RoCEsupportwithMellanoxadapters• VariousmulF-corepla|orms• SSD
– hWp://hibd.cse.ohio-state.edu
RDMAforMemcachedDistribu4on
HPCAC-Switzerland(Mar‘16) 21NetworkBasedCompu4ngLaboratory
• Micro-benchmarksforHadoopDistributedFileSystem(HDFS)– SequenFalWriteLatency(SWL)Benchmark,SequenFalReadLatency(SRL)
Benchmark,RandomReadLatency(RRL)Benchmark,SequenFalWriteThroughput(SWT)Benchmark,SequenFalReadThroughput(SRT)Benchmark
– Supportbenchmarkingof• ApacheHadoop1.xand2.xHDFS,HortonworksDataPla|orm(HDP)HDFS,Cloudera
DistribuFonofHadoop(CDH)HDFS
• Micro-benchmarksforMemcached– GetBenchmark,SetBenchmark,andMixedGet/SetBenchmark
• Currentrelease:0.8
• hWp://hibd.cse.ohio-state.edu
OSUHiBDMicro-Benchmark(OHB)Suite–HDFS&Memcached
HPCAC-Switzerland(Mar‘16) 22NetworkBasedCompu4ngLaboratory
• HHH:HeterogeneousstoragedeviceswithhybridreplicaFonschemesaresupportedinthismodeofoperaFontohavebe<erfault-toleranceaswellasperformance.Thismodeisenabledbydefaultinthepackage.
• HHH-M:Ahigh-performancein-memorybasedsetuphasbeenintroducedinthispackagethatcanbeuFlizedtoperformallI/OoperaFonsin-memoryandobtainasmuchperformancebenefitaspossible.
• HHH-L:Withparallelfilesystemsintegrated,HHH-LmodecantakeadvantageoftheLustreavailableinthecluster.
• MapReduceoverLustre,with/withoutlocaldisks:Besides,HDFSbasedsoluFons,thispackagealsoprovidessupporttorunMapReducejobsontopofLustrealone.Here,twodifferentmodesareintroduced:withlocaldisksandwithoutlocaldisks.
• RunningwithSlurmandPBS:SupportsdeployingRDMAforApacheHadoop2.xwithSlurmandPBSindifferentrunningmodes(HHH,HHH-M,HHH-L,andMapReduceoverLustre).
DifferentModesofRDMAforApacheHadoop2.x
HPCAC-Switzerland(Mar‘16) 23NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 24NetworkBasedCompu4ngLaboratory
• EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface• JNILayerbridgesJavabasedHDFSwithcommunicaFonlibrarywri<eninnaFvecode
DesignOverviewofHDFSwithRDMA
HDFS
Verbs
RDMACapableNetworks(IB,iWARP,RoCE..)
Applica4ons
1/10/40/100GigE,IPoIBNetwork
JavaSocketInterface JavaNa4veInterface(JNI)WriteOthers
OSUDesign
• DesignFeatures– RDMA-basedHDFSwrite– RDMA-basedHDFS
replicaFon– ParallelreplicaFon
support– On-demandconnecFon
setup– InfiniBand/RoCEsupport
N.S.Islam,M.W.Rahman,J.Jose,R.Rajachandrasekar,H.Wang,H.Subramoni,C.MurthyandD.K.Panda,HighPerformanceRDMA-BasedDesignofHDFSoverInfiniBand,Supercompu4ng(SC),Nov2012
N.Islam,X.Lu,W.Rahman,andD.K.Panda,SOR-HDFS:ASEDA-basedApproachtoMaximizeOverlappinginRDMA-EnhancedHDFS,HPDC'14,June2014
HPCAC-Switzerland(Mar‘16) 25NetworkBasedCompu4ngLaboratory
Triple-H
HeterogeneousStorage
• DesignFeatures– Threemodes
• Default(HHH)• In-Memory(HHH-M)
• Lustre-Integrated(HHH-L)– PoliciestoefficientlyuFlizetheheterogeneous
storagedevices• RAM,SSD,HDD,Lustre
– EvicFon/PromoFonbasedondatausagepa<ern
– HybridReplicaFon
– Lustre-Integratedmode:
• Lustre-basedfault-tolerance
EnhancedHDFSwithIn-MemoryandHeterogeneousStorage
HybridReplica4on
DataPlacementPolicies
Evic4on/Promo4on
RAMDisk SSD HDD
Lustre
N.Islam,X.Lu,M.W.Rahman,D.Shankar,andD.K.Panda,Triple-H:AHybridApproachtoAccelerateHDFSonHPCClusterswithHeterogeneousStorageArchitecture,CCGrid’15,May2015
Applica4ons
HPCAC-Switzerland(Mar‘16) 26NetworkBasedCompu4ngLaboratory
DesignOverviewofMapReducewithRDMA
MapReduce
Verbs
RDMACapableNetworks(IB,iWARP,RoCE..)
OSUDesign
Applica4ons
1/10/40/100GigE,IPoIBNetwork
JavaSocketInterface JavaNa4veInterface(JNI)
JobTracker
TaskTracker
Map
Reduce
• EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface• JNILayerbridgesJavabasedMapReducewithcommunicaFonlibrarywri<eninnaFvecode
• DesignFeatures– RDMA-basedshuffle– Prefetchingandcachingmapoutput– EfficientShuffleAlgorithms– In-memorymerge
– On-demandShuffleAdjustment– Advancedoverlapping
• map,shuffle,andmerge
• shuffle,merge,andreduce
– On-demandconnecFonsetup– InfiniBand/RoCEsupport
M.W.Rahman,X.Lu,N.S.Islam,andD.K.Panda,HOMR:AHybridApproachtoExploitMaximumOverlappinginMapReduceoverHighPerformanceInterconnects,ICS,June2014
HPCAC-Switzerland(Mar‘16) 27NetworkBasedCompu4ngLaboratory
• AhybridapproachtoachievemaximumpossibleoverlappinginMapReduceacrossallphasescomparedtootherapproaches– EfficientShuffleAlgorithms– DynamicandEfficientSwitching
– On-demandShuffleAdjustment
AdvancedOverlappingamongDifferentPhases
DefaultArchitecture
EnhancedOverlappingwithIn-MemoryMerge
AdvancedHybridOverlapping
M.W.Rahman,X.Lu,N.S.Islam,andD.K.Panda,HOMR:AHybridApproachtoExploitMaximumOverlappinginMapReduceoverHighPerformanceInterconnects,ICS,June2014
HPCAC-Switzerland(Mar‘16) 28NetworkBasedCompu4ngLaboratory
• DesignFeatures– JVM-bypassedbuffer
management– RDMAorsend/recvbased
adapFvecommunicaFon– IntelligentbufferallocaFonand
adjustmentforserializaFon– On-demandconnecFonsetup
– InfiniBand/RoCEsupport
DesignOverviewofHadoopRPCwithRDMA
HadoopRPC
Verbs
RDMACapableNetworks(IB,iWARP,RoCE..)
Applica4ons
1/10/40/100GigE,IPoIBNetwork
JavaSocketInterface JavaNa4veInterface(JNI)
OurDesign
Default
OSUDesign
• EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface• JNILayerbridgesJavabasedRPCwithcommunicaFonlibrarywri<eninnaFvecode X.Lu,N.Islam,M.W.Rahman,J.Jose,H.Subramoni,H.Wang,andD.K.Panda,High-PerformanceDesignofHadoopRPCwith
RDMAoverInfiniBand,Int'lConferenceonParallelProcessing(ICPP'13),October2013.
HPCAC-Switzerland(Mar‘16) 29NetworkBasedCompu4ngLaboratory
0
50
100
150
200
250
80 100 120
Execu4
onTim
e(s)
DataSize(GB)
IPoIB(FDR)
0
50
100
150
200
250
80 100 120
Execu4
onTim
e(s)
DataSize(GB)
IPoIB(FDR)
PerformanceBenefits–RandomWriter&TeraGeninTACC-Stampede
Clusterwith32Nodeswithatotalof128maps
• RandomWriter– 3-4ximprovementoverIPoIB
for80-120GBfilesize
• TeraGen– 4-5ximprovementoverIPoIB
for80-120GBfilesize
RandomWriter TeraGen
Reducedby3x Reducedby4x
HPCAC-Switzerland(Mar‘16) 30NetworkBasedCompu4ngLaboratory
0100200300400500600700800900
80 100 120
Execu4
onTim
e(s)
DataSize(GB)
IPoIB(FDR) OSU-IB(FDR)
0
100
200
300
400
500
600
80 100 120
Execu4
onTim
e(s)
DataSize(GB)
IPoIB(FDR) OSU-IB(FDR)
PerformanceBenefits–Sort&TeraSortinTACC-Stampede
Clusterwith32Nodeswithatotalof128mapsand64reduces
• SortwithsingleHDDpernode– 40-52%improvementoverIPoIB
for80-120GBdata
• TeraSortwithsingleHDDpernode– 42-44%improvementoverIPoIB
for80-120GBdata
Reducedby52% Reducedby44%
Clusterwith32Nodeswithatotalof128mapsand57reduces
HPCAC-Switzerland(Mar‘16) 31NetworkBasedCompu4ngLaboratory
Evalua4onofHHHandHHH-LwithApplica4ons
HDFS(FDR) HHH(FDR)
60.24s 48.3s
CloudBurstMR-MSPolyGraph
0
200
400
600
800
1000
4 6 8
ExecuF
onTim
e(s)
Concurrentmapsperhost
HDFS Lustre HHH-L Reducedby79%
• MR-MSPolygraphonOSURIwith1,000maps– HHH-LreducestheexecuFonFme
by79%overLustre,30%overHDFS
• CloudBurstonTACCStampede– WithHHH:19%improvementover
HDFS
HPCAC-Switzerland(Mar‘16) 32NetworkBasedCompu4ngLaboratory
Evalua4onwithSparkonSDSCGordon(HHHvs.Tachyon/Alluxio)
• For200GBTeraGenon32nodes
– Spark-TeraGen:HHHhas2.4ximprovementoverTachyon;2.3xoverHDFS-IPoIB(QDR)
– Spark-TeraSort:HHHhas25.2%improvementoverTachyon;17%overHDFS-IPoIB(QDR)
020406080
100120140160180
8:50 16:100 32:200
Execu4
onTim
e(s)
ClusterSize:DataSize(GB)
IPoIB(QDR) Tachyon OSU-IB(QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
Execu4
onTim
e(s)
ClusterSize:DataSize(GB)
Reducedby2.4x
Reducedby25.2%
TeraGen TeraSort
N.Islam,M.W.Rahman,X.Lu,D.Shankar,andD.K.Panda,PerformanceCharacteriza4onandAccelera4onofIn-MemoryFileSystemsforHadoopandSparkApplica4onsonHPCClusters,IEEEBigData’15,October2015
HPCAC-Switzerland(Mar‘16) 33NetworkBasedCompu4ngLaboratory
IntermediateDataDirectory
DesignOverviewofShuffleStrategiesforMapReduceoverLustre• DesignFeatures
– Twoshuffleapproaches• Lustrereadbasedshuffle
• RDMAbasedshuffle
– Hybridshufflealgorithmtotakebenefitfrombothshuffleapproaches
– Dynamicallyadaptstothebe<ershuffleapproachforeachshufflerequestbasedonprofilingvaluesforeachLustrereadoperaFon
– In-memorymergeandoverlappingofdifferentphasesarekeptsimilartoRDMA-enhancedMapReducedesign
Map1 Map2 Map3
Lustre
Reduce1 Reduce2LustreRead/RDMA
In-memorymerge/sort
reduce
M.W.Rahman,X.Lu,N.S.Islam,R.Rajachandrasekar,andD.K.Panda,HighPerformanceDesignofYARNMapReduceonModernHPCClusterswithLustreandRDMA,IPDPS,May2015
In-memorymerge/sort
reduce
HPCAC-Switzerland(Mar‘16) 34NetworkBasedCompu4ngLaboratory
• For500GBSortin64nodes– 44%improvementoverIPoIB(FDR)
PerformanceImprovementofMapReduceoverLustreonTACC-Stampede
• For640GBSortin128nodes– 48%improvementoverIPoIB(FDR)
0
200
400
600
800
1000
1200
300 400 500
JobExecu4
onTim
e(sec)
DataSize(GB)
IPoIB(FDR)OSU-IB(FDR)
050
100150200250300350400450500
20GB 40GB 80GB 160GB 320GB 640GB
Cluster:4 Cluster:8 Cluster:16 Cluster:32 Cluster:64 Cluster:128
JobExecu4
onTim
e(sec) IPoIB(FDR) OSU-IB(FDR)
M.W.Rahman,X.Lu,N.S.Islam,R.Rajachandrasekar,andD.K.Panda,MapReduceoverLustre:CanRDMA-basedApproachBenefit?,Euro-Par,August2014.
• LocaldiskisusedastheintermediatedatadirectoryReducedby48%Reducedby44%
HPCAC-Switzerland(Mar‘16) 35NetworkBasedCompu4ngLaboratory
• For80GBSortin8nodes– 34%improvementoverIPoIB(QDR)
CaseStudy-PerformanceImprovementofMapReduceoverLustreonSDSC-Gordon
• For120GBTeraSortin16nodes– 25%improvementoverIPoIB(QDR)
• Lustreisusedastheintermediatedatadirectory
0
100
200
300
400
500
600
700
800
900
40 60 80
JobExecu4
onTim
e(sec)
DataSize(GB)
IPoIB(QDR)OSU-Lustre-Read(QDR)OSU-RDMA-IB(QDR)OSU-Hybrid-IB(QDR)
0
100
200
300
400
500
600
700
800
900
40 80 120
JobExecu4
onTim
e(sec)
DataSize(GB)
Reducedby25%Reducedby34%
HPCAC-Switzerland(Mar‘16) 36NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 37NetworkBasedCompu4ngLaboratory
HBase-RDMADesignOverview
• JNILayerbridgesJavabasedHBasewithcommunicaFonlibrarywri<eninnaFvecode• EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface
HBase
IBVerbs
RDMACapableNetworks(IB,iWARP,RoCE..)
OSU-IBDesign
Applica4ons
1/10/40/100GigE,IPoIBNetworks
JavaSocketInterface JavaNa4veInterface(JNI)
HPCAC-Switzerland(Mar‘16) 38NetworkBasedCompu4ngLaboratory
HBase–YCSBRead-WriteWorkload
• HBaseGetlatency(QDR,10GigE)– 64clients:2.0ms;128Clients:3.5ms– 42%improvementoverIPoIBfor128clients
• HBasePutlatency(QDR,10GigE)– 64clients:1.9ms;128Clients:3.5ms– 40%improvementoverIPoIBfor128clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Time(us)
No.ofClients
010002000300040005000600070008000900010000
8 16 32 64 96 128
Time(us)
No.ofClients
10GigE
ReadLatency WriteLatency
OSU-IB(QDR)IPoIB(QDR)
J.Huang,X.Ouyang,J.Jose,M.W.Rahman,H.Wang,M.Luo,H.Subramoni,ChetMurthy,andD.K.Panda,High-PerformanceDesignofHBasewithRDMAoverInfiniBand,IPDPS’12
HPCAC-Switzerland(Mar‘16) 39NetworkBasedCompu4ngLaboratory
HBase–YCSBGetLatencyandThroughputonSDSC-Comet
• HBaseGetaveragelatency(FDR)– 4clientthreads:38us– 59%improvementoverIPoIBfor4clientthreads
• HBaseGettotalthroughput– 4clientthreads:102Kops/sec– 2.4ximprovementoverIPoIBfor4clientthreads
GetLatency GetThroughput
0
0.02
0.04
0.06
0.08
0.1
0.12
1 2 3 4AverageLatency(m
s)
NumberofClientThreads
IPoIB(FDR) OSU-IB(FDR)
0
20
40
60
80
100
120
1 2 3 4TotalThrou
ghpu
t(Ko
ps/
sec)
NumberofClientThreads
IPoIB(FDR) OSU-IB(FDR)59% 2.4x
HPCAC-Switzerland(Mar‘16) 40NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 41NetworkBasedCompu4ngLaboratory
• DesignFeatures– RDMAbasedshuffle– SEDA-basedplugins– DynamicconnecFon
managementandsharing
– Non-blockingdatatransfer– Off-JVM-heapbuffer
management– InfiniBand/RoCEsupport
DesignOverviewofSparkwithRDMA Spark Applications(Scala/Java/Python)
Spark(Scala/Java)
BlockFetcherIteratorBlockManager
Task TaskTask Task
Java NIO ShuffleServer
(default)
Netty ShuffleServer
(optional)
RDMA ShuffleServer
(plug-in)
Java NIO ShuffleFetcher(default)
Netty ShuffleFetcher
(optional)
RDMA ShuffleFetcher(plug-in)
Java Socket RDMA-based Shuffle Engine(Java/JNI)
1/10 Gig Ethernet/IPoIB (QDR/FDR) Network
Native InfiniBand (QDR/FDR)
• EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface• JNILayerbridgesScalabasedSparkwithcommunicaFonlibrarywri<eninnaFvecode X.Lu,M.W.Rahman,N.Islam,D.Shankar,andD.K.Panda,Accelera4ngSparkwithRDMAforBigDataProcessing:Early
Experiences,Int'lSymposiumonHighPerformanceInterconnects(HotI'14),August2014
HPCAC-Switzerland(Mar‘16) 42NetworkBasedCompu4ngLaboratory
• InfiniBandFDR,SSD,64WorkerNodes,1536Cores,(1536M1536R)
• RDMA-baseddesignforSpark1.5.1
• RDMAvs.IPoIBwith1536concurrenttasks,singleSSDpernode.– SortBy:TotalFmereducedbyupto80%overIPoIB(56Gbps)
– GroupBy:TotalFmereducedbyupto57%overIPoIB(56Gbps)
PerformanceEvalua4ononSDSCComet–SortBy/GroupBy
64WorkerNodes,1536cores,SortByTestTotalTime 64WorkerNodes,1536cores,GroupByTestTotalTime
0
50
100
150
200
250
300
64 128 256
Time(sec)
DataSize(GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time(sec)
DataSize(GB)
IPoIB
RDMA
57%80%
HPCAC-Switzerland(Mar‘16) 43NetworkBasedCompu4ngLaboratory
• InfiniBandFDR,SSD,64WorkerNodes,1536Cores,(1536M1536R)
• RDMA-baseddesignforSpark1.5.1
• RDMAvs.IPoIBwith1536concurrenttasks,singleSSDpernode.– Sort:TotalFmereducedby38%overIPoIB(56Gbps)
– TeraSort:TotalFmereducedby15%overIPoIB(56Gbps)
PerformanceEvalua4ononSDSCComet–HiBenchSort/TeraSort
64WorkerNodes,1536cores,SortTotalTime 64WorkerNodes,1536cores,TeraSortTotalTime
050
100150200250300350400450
64 128 256
Time(sec)
DataSize(GB)
IPoIB
RDMA
0
100
200
300
400
500
600
64 128 256
Time(sec)
DataSize(GB)
IPoIB
RDMA
15%38%
HPCAC-Switzerland(Mar‘16) 44NetworkBasedCompu4ngLaboratory
• InfiniBandFDR,SSD,32/64WorkerNodes,768/1536Cores,(768/1536M768/1536R)
• RDMA-baseddesignforSpark1.5.1
• RDMAvs.IPoIBwith768/1536concurrenttasks,singleSSDpernode.– 32nodes/768cores:TotalFmereducedby37%overIPoIB(56Gbps)
– 64nodes/1536cores:TotalFmereducedby43%overIPoIB(56Gbps)
PerformanceEvalua4ononSDSCComet–HiBenchPageRank
32WorkerNodes,768cores,PageRankTotalTime 64WorkerNodes,1536cores,PageRankTotalTime
050
100150200250300350400450
Huge BigData GiganFc
Time(sec)
DataSize(GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData GiganFc
Time(sec)
DataSize(GB)
IPoIB
RDMA
43%37%
HPCAC-Switzerland(Mar‘16) 45NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 46NetworkBasedCompu4ngLaboratory
• ServerandclientperformanegoFaFonprotocol– Masterthreadassignsclientstoappropriateworkerthread
• Onceaclientisassignedaverbsworkerthread,itcancommunicatedirectlyandis“bound”tothatthread
• AllotherMemcacheddatastructuresaresharedamongRDMAandSocketsworkerthreads
• MemcachedServercanservebothsocketandverbsclientssimultaneously
• MemcachedapplicaFonsneednotbemodified;usesverbsinterfaceifavailable
Memcached-RDMADesign
SocketsClient
RDMAClient
MasterThread
SocketsWorkerThreadVerbsWorkerThread
SocketsWorkerThread
VerbsWorkerThread
SharedData
MemorySlabsItems…
1
1
2
2
HPCAC-Switzerland(Mar‘16) 47NetworkBasedCompu4ngLaboratory
1
10
100
1000
1 2 4 8 16 32 641282565121K 2K 4K
Time(us)
MessageSize
OSU-IB(FDR)IPoIB(FDR)
0100200300400500600700
16 32 64 128 256 512 102420484080Thou
sand
sofT
ransac4o
ns
perS
econ
d(TPS)
No.ofClients
• MemcachedGetlatency– 4bytesOSU-IB:2.84us;IPoIB:75.53us– 2KbytesOSU-IB:4.49us;IPoIB:123.42us
• MemcachedThroughput(4bytes)– 4080clientsOSU-IB:556Kops/sec,IPoIB:233Kops/s– Nearly2Ximprovementinthroughput
MemcachedGETLatency MemcachedThroughput
RDMA-MemcachedPerformance(FDRInterconnect)
ExperimentsonTACCStampede(IntelSandyBridgeCluster,IB:FDR)
LatencyReducedbynearly20X
2X
HPCAC-Switzerland(Mar‘16) 48NetworkBasedCompu4ngLaboratory
• IllustraFonwithRead-Cache-Readaccesspa<ernusingmodifiedmysqlslaploadtesFngtool
• Memcached-RDMAcan- improvequerylatencybyupto66%overIPoIB(32Gbps)
- throughputbyupto69%overIPoIB(32Gbps)
Micro-benchmarkEvalua4onforOLDPworkloads
012345678
64 96 128 160 320 400
Latency(sec)
No.ofClients
Memcached-IPoIB(32Gbps)Memcached-RDMA(32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Throughp
ut(Kq
/s)
No.ofClients
Memcached-IPoIB(32Gbps)Memcached-RDMA(32Gbps)
D.Shankar,X.Lu,J.Jose,M.W.Rahman,N.Islam,andD.K.Panda,CanRDMABenefitOn-LineDataProcessingWorkloadswithMemcachedandMySQL,ISPASS’15
Reducedby66%
HPCAC-Switzerland(Mar‘16) 49NetworkBasedCompu4ngLaboratory
• ohb_memlat&ohb_memthrlatency&throughputmicro-benchmarks
• Memcached-RDMAcan- improvequerylatencybyupto70%overIPoIB(32Gbps)
- improvethroughputbyupto2XoverIPoIB(32Gbps)
- Nooverheadinusinghybridmodewhenalldatacanfitinmemory
PerformanceBenefitsofHybridMemcached(Memory+SSD)onSDSC-Gordon
0
2
4
6
8
10
64 128 256 512 1024
Throughp
ut(m
illiontran
s/sec)
No.ofClients
IPoIB(32Gbps)RDMA-Mem(32Gbps)RDMA-Hybrid(32Gbps)
0
100
200
300
400
500
Averagelatency(us)
MessageSize(Bytes)
2X
HPCAC-Switzerland(Mar‘16) 50NetworkBasedCompu4ngLaboratory
– MemcachedlatencytestwithZipfdistribuFon,serverwith1GBmemory,32KBkey-valuepairsize,totalsizeofdataaccessedis1GB(whendatafitsinmemory)and1.5GB(whendatadoesnotfitinmemory)
– Whendatafitsinmemory:RDMA-Mem/Hybridgives5ximprovementoverIPoIB-Mem
– Whendatadoesnotfitinmemory:RDMA-Hybridgives2x-2.5xoverIPoIB/RDMA-Mem
PerformanceEvalua4ononIBFDR+SATA/NVMeSSDs
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
DataFitsInMemory DataDoesNotFitInMemory
Latency(us)
slaballocaFon(SSDwrite) cachecheck+load(SSDread) cacheupdate serverresponse clientwait miss-penalty
HPCAC-Switzerland(Mar‘16) 51NetworkBasedCompu4ngLaboratory
– RDMA-AcceleratedCommunicaFonforMemcachedGet/Set
– Hybrid‘RAM+SSD’slabmanagementforhigherdataretenFon
– Non-blockingAPIextensions• memcached_(iset/iget/bset/bget/test/wait)• Achievenearin-memoryspeedswhilehiding
bo<lenecksofnetworkandSSDI/O• AbilitytoexploitcommunicaFon/computaFon
overlap• OpFonalbufferre-useguarantees
– AdapFveslabmanagerwithdifferentI/Oschemesforhigherthroughput.
Accelera4ngHybridMemcachedwithRDMA,Non-blockingExtensionsandSSDs
D.Shankar,X.Lu,N.S.Islam,M.W.Rahman,andD.K.Panda,High-PerformanceHybridKey-ValueStoreonModernClusterswithRDMAInterconnectsandSSDs:Non-blockingExtensions,Designs,andBenefits,IPDPS,May2016
BLOCKINGAPI
NON-BLOCKINGAPIREQ.
NON-BLOCKINGAPIREPLY
CLIENT
SERV
ER
HYBRIDSLABMANAGER(RAM+SSD)
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
LIBMEMCACHEDLIBRARY
BlockingAPIFlow
Non-BlockingAPIFlow
HPCAC-Switzerland(Mar‘16) 52NetworkBasedCompu4ngLaboratory
– Datadoesnotfitinmemory:Non-blockingMemcachedSet/GetAPIExtensionscanachieve• >16xlatencyimprovementvs.blockingAPIoverRDMA-Hybrid/RDMA-Memw/penalty• >2.5xthroughputimprovementvs.blockingAPIoverdefault/opFmizedRDMA-Hybrid
– Datafitsinmemory:Non-blockingExtensionsperformsimilartoRDMA-Mem/RDMA-Hybridand>3.6ximprovementoverIPoIB-Mem
PerformanceEvalua4onwithNon-BlockingMemcachedAPI
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-BlockH-RDMA-Opt-NonB-i
H-RDMA-Opt-NonB-b
AverageLatency(us)
MissPenalty(BackendDBAccessOverhead)ClientWaitServerResponseCacheUpdateCachecheck+Load(Memoryand/orSSDread)SlabAllocaFon(w/SSDwriteonOut-of-Mem)
H=HybridMemcachedoverSATASSDOpt=AdapFveslabmanagerBlock=DefaultBlockingAPINonB-i=Non-blockingiset/igetAPINonB-b=Non-blockingbset/bgetAPIw/bufferre-useguarantee
HPCAC-Switzerland(Mar‘16) 53NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 54NetworkBasedCompu4ngLaboratory
• DesignFeatures– Memcached-basedburst-buffer
system
• Hideslatencyofparallelfilesystemaccess
• ReadfromlocalstorageandMemcached
– DatalocalityachievedbywriFngdatatolocalstorage
– DifferentapproachesofintegraFonwithparallelfilesystemtoguaranteefault-tolerance
Accelera4ngI/OPerformanceofBigDataAnaly4csthroughRDMA-basedKey-ValueStore
Applica4on
I/OForwardingModule
Map/ReduceTask DataNode
LocalDisk
DataLocalityFault-tolerance
Lustre
Memcached-basedBurstBufferSystem
HPCAC-Switzerland(Mar‘16) 55NetworkBasedCompu4ngLaboratory
Evalua4onwithPUMAWorkloads
GainsonOSURIwithourapproach(Mem-bb)on24nodes
• SequenceCount:34.5%overLustre,40%overHDFS
• RankedInvertedIndex:27.3%overLustre,48.3%overHDFS
• HistogramRaFng:17%overLustre,7%overHDFS
050010001500200025003000350040004500
SeqCount RankedInvIndex HistoRaFng
Execu4
onTim
e(s)
Workloads
HDFS(32Gbps)Lustre(32Gbps)Mem-bb(32Gbps)
48.3%
40%
17%
N.S.Islam,D.Shankar,X.Lu,M.W.Rahman,andD.K.Panda,Accelera4ngI/OPerformanceofBigDataAnaly4cswithRDMA-basedKey-ValueStore,ICPP’15,September2015
HPCAC-Switzerland(Mar‘16) 56NetworkBasedCompu4ngLaboratory
• RDMA-basedDesignsandPerformanceEvaluaFon– HDFS– MapReduce– RPC– HBase– Spark– Memcached(Basic,HybridandNon-blockingAPIs)– HDFS+Memcached-basedBurstBuffer– OSUHiBDBenchmarks(OHB)
Accelera4onCaseStudiesandPerformanceEvalua4on
HPCAC-Switzerland(Mar‘16) 57NetworkBasedCompu4ngLaboratory
• Thecurrentbenchmarksprovidesomeperformancebehavior
• However,donotprovideanyinformaFontothedesigner/developeron:– Whatishappeningatthelower-layer?
– Wherethebenefitsarecomingfrom?
– Whichdesignisleadingtobenefitsorbo<lenecks?
– Whichcomponentinthedesignneedstobechangedandwhatwillbeitsimpact?
– Canperformancegain/lossatthelower-layerbecorrelatedtotheperformancegain/lossobservedattheupperlayer?
AretheCurrentBenchmarksSufficientforBigData?
HPCAC-Switzerland(Mar‘16) 58NetworkBasedCompu4ngLaboratory
BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)
NetworkingTechnologies(InfiniBand,1/10/40/100GigE
andIntelligentNICs)
StorageTechnologies
(HDD,SSD,andNVMe-SSD)
ProgrammingModels(Sockets)
Applica4ons
CommodityCompu4ngSystemArchitectures
(Mul4-andMany-corearchitecturesandaccelerators)
OtherProtocols?
Communica4onandI/OLibrary
Point-to-PointCommunica4on
QoS
ThreadedModelsandSynchroniza4on
Fault-ToleranceI/OandFileSystems
Virtualiza4on
Benchmarks
RDMAProtocols
ChallengesinBenchmarkingofRDMA-basedDesigns
CurrentBenchmarks
NoBenchmarks
Correla4on?
HPCAC-Switzerland(Mar‘16) 59NetworkBasedCompu4ngLaboratory
• Acomprehensivesuiteofbenchmarksto– CompareperformanceofdifferentMPIlibrariesonvariousnetworksandsystems– Validatelow-levelfuncFonaliFes– ProvideinsightstotheunderlyingMPI-leveldesigns
• Startedwithbasicsend-recv(MPI-1)micro-benchmarksforlatency,bandwidthandbi-direcFonalbandwidth• Extendedlaterto
– MPI-2one-sided– CollecFves– GPU-awaredatamovement– OpenSHMEM(point-to-pointandcollecFves)– UPC
• Hasbecomeanindustrystandard• Extensivelyusedfordesign/developmentofMPIlibraries,performancecomparisonofMPIlibrariesandeven
inprocurementoflarge-scalesystems• Availablefromh<p://mvapich.cse.ohio-state.edu/benchmarks
• AvailableinanintegratedmannerwithMVAPICH2stack
OSUMPIMicro-Benchmarks(OMB)Suite
HPCAC-Switzerland(Mar‘16) 60NetworkBasedCompu4ngLaboratory
BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)
NetworkingTechnologies(InfiniBand,1/10/40/100GigE
andIntelligentNICs)
StorageTechnologies
(HDD,SSD,andNVMe-SSD)
ProgrammingModels(Sockets)
Applica4ons
CommodityCompu4ngSystemArchitectures
(Mul4-andMany-corearchitecturesandaccelerators)
OtherProtocols?
Communica4onandI/OLibrary
Point-to-PointCommunica4on
QoS
ThreadedModelsandSynchroniza4on
Fault-ToleranceI/OandFileSystems
Virtualiza4on
Benchmarks
RDMAProtocols
Itera4veProcess–RequiresDeeperInves4ga4onandDesignforBenchmarkingNextGenera4onBigDataSystemsandApplica4ons
Applica4ons-LevelBenchmarks
Micro-Benchmarks
HPCAC-Switzerland(Mar‘16) 61NetworkBasedCompu4ngLaboratory
• HDFSBenchmarks
– SequenFalWriteLatency(SWL)Benchmark
– SequenFalReadLatency(SRL)Benchmark
– RandomReadLatency(RRL)Benchmark
– SequenFalWriteThroughput(SWT)Benchmark
– SequenFalReadThroughput(SRT)Benchmark
• MemcachedBenchmarks
– GetBenchmark
– SetBenchmark
– MixedGet/SetBenchmark
• AvailableasapartofOHB0.8
OSUHiBDBenchmarks(OHB)N.S.Islam,X.Lu,M.W.Rahman,J.Jose,andD.K.Panda,AMicro-benchmarkSuiteforEvalua4ngHDFSOpera4onsonModernClusters,Int'lWorkshoponBigDataBenchmarking(WBDB'12),December2012
D.Shankar,X.Lu,M.W.Rahman,N.Islam,andD.K.Panda,AMicro-BenchmarkSuiteforEvalua4ngHadoopMapReduceonHigh-PerformanceNetworks,BPOE-5(2014)
X.Lu,M.W.Rahman,N.Islam,andD.K.Panda,AMicro-BenchmarkSuiteforEvalua4ngHadoopRPConHigh-PerformanceNetworks,Int'lWorkshoponBigDataBenchmarking(WBDB'13),July2013
TobeReleased
HPCAC-Switzerland(Mar‘16) 62NetworkBasedCompu4ngLaboratory
• UpcomingReleasesofRDMA-enhancedPackageswillsupport
– HBase
– Impala
• UpcomingReleasesofOSUHiBDMicro-Benchmarks(OHB)willsupport
– MapReduce
– RPC• Advanceddesignswithupper-levelchangesandopFmizaFons
– MemcachedwithNon-blockingAPI
– HDFS+Memcached-basedBurstBuffer
On-goingandFuturePlansofOSUHighPerformanceBigData(HiBD)Project
HPCAC-Switzerland(Mar‘16) 63NetworkBasedCompu4ngLaboratory
• DiscussedchallengesinacceleraFngHadoop,SparkandMemcached
• PresentediniFaldesignstotakeadvantageofInfiniBand/RDMAforHDFS,MapReduce,RPC,Spark,andMemcached
• Resultsarepromising
• Manyotheropenissuesneedtobesolved
• WillenableBigDatacommunitytotakeadvantageofmodernHPCtechnologiestocarryouttheiranalyFcsinafastandscalablemanner
ConcludingRemarks
HPCAC-Switzerland(Mar‘16) 64NetworkBasedCompu4ngLaboratory
FundingAcknowledgmentsFundingSupportby
EquipmentSupportby
HPCAC-Switzerland(Mar‘16) 65NetworkBasedCompu4ngLaboratory
PersonnelAcknowledgmentsCurrentStudents
– A.AugusFne(M.S.)
– A.Awan(Ph.D.)– S.Chakraborthy(Ph.D.)
– C.-H.Chu(Ph.D.)– N.Islam(Ph.D.)
– M.Li(Ph.D.)
PastStudents– P.Balaji(Ph.D.)
– S.Bhagvat(M.S.)
– A.Bhat(M.S.)
– D.BunFnas(Ph.D.)
– L.Chai(Ph.D.)
– B.Chandrasekharan(M.S.)
– N.Dandapanthula(M.S.)
– V.Dhanraj(M.S.)
– T.Gangadharappa(M.S.)– K.Gopalakrishnan(M.S.)
– G.Santhanaraman(Ph.D.)– A.Singh(Ph.D.)
– J.Sridhar(M.S.)
– S.Sur(Ph.D.)
– H.Subramoni(Ph.D.)
– K.Vaidyanathan(Ph.D.)
– A.Vishnu(Ph.D.)
– J.Wu(Ph.D.)
– W.Yu(Ph.D.)
PastResearchScien:st– S.Sur
CurrentPost-Doc– J.Lin
– D.Banerjee
CurrentProgrammer– J.Perkins
PastPost-Docs– H.Wang
– X.Besseron– H.-W.Jin
– M.Luo
– W.Huang(Ph.D.)– W.Jiang(M.S.)
– J.Jose(Ph.D.)
– S.Kini(M.S.)
– M.Koop(Ph.D.)
– R.Kumar(M.S.)
– S.Krishnamoorthy(M.S.)
– K.Kandalla(Ph.D.)
– P.Lai(M.S.)
– J.Liu(Ph.D.)
– M.Luo(Ph.D.)– A.Mamidala(Ph.D.)
– G.Marsh(M.S.)
– V.Meshram(M.S.)
– A.Moody(M.S.)
– S.Naravula(Ph.D.)
– R.Noronha(Ph.D.)
– X.Ouyang(Ph.D.)
– S.Pai(M.S.)
– S.Potluri(Ph.D.)
– R.Rajachandrasekar(Ph.D.)
– K.Kulkarni(M.S.)– M.Rahman(Ph.D.)
– D.Shankar(Ph.D.)– A.Venkatesh(Ph.D.)
– J.Zhang(Ph.D.)
– E.Mancini– S.Marcarelli
– J.Vienne
CurrentResearchScien:stsCurrentSeniorResearchAssociate– H.Subramoni
– X.Lu
PastProgrammers– D.Bureddy
-K.Hamidouche
CurrentResearchSpecialist– M.Arnold
HPCAC-Switzerland(Mar‘16) 66NetworkBasedCompu4ngLaboratory
SecondInterna4onalWorkshoponHigh-PerformanceBigDataCompu4ng(HPBDC)
HPBDC2016willbeheldwithIEEEInterna4onalParallelandDistributedProcessingSymposium(IPDPS2016),Chicago,IllinoisUSA,May27th,2016
KeynoteTalk:Dr.ChaitanyaBaru,SeniorAdvisorforDataScience,Na4onalScienceFounda4on(NSF);Dis4nguishedScien4st,SanDiegoSupercomputerCenter(SDSC)
PanelModerator:JianfengZhan(ICT/CAS)
PanelTopic:MergeorSplit:MutualInfluencebetweenBigDataandHPCTechniques
SixRegularResearchPapersandTwoShortResearchPapers
hWp://web.cse.ohio-state.edu/~luxi/hpbdc2016
HPBDC2015washeldinconjunc4onwithICDCS’15
hWp://web.cse.ohio-state.edu/~luxi/hpbdc2015
HPCAC-Switzerland(Mar‘16) 67NetworkBasedCompu4ngLaboratory
ThankYou!
TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/
Network-BasedCompuFngLaboratoryh<p://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/