digital - nextgenio€¦ · d2.4 the nextgenio architecture wp2 “architecture” project acronym...
TRANSCRIPT
D2.4TheNEXTGenIOArchitectureWP2“Architecture”
ProjectAcronym NEXTGenIO
ProjectName NextGenerationI/OforExascale
ProjectNumber 671591
Instrument ResearchandInnovationAction
ThematicPriority FETHPC-1-2014HPCCoreTechnologies,ProgrammingEnvironmentsandAlgorithmsforExtremeParallelismandExtremeDataApplications
Duedate 30/09/2017
Submissiondate 29/09/2017
Projectstartdate 1stOctober2015
Projectduration 36months
Deliverableleadorganisation
UEDIN
Status Final
Version 1.0
Author(s) AdrianJackson,IakovosPanourgias(UEDIN),BernhardHomölle(FUJITSU),AlbertoMiranda,RamonNou,JavierConejero(BSC),MarceloCintra(INTEL),SimonSmart,AntoninoBonanni(ECMWF),HolgerBrunst,ChristianHerold,MuhammadSarimZafar(TUD)
Reviewer(s) Hans-Christian Hoppe (INTEL), Michèle Weiland (UEDIN), BernhardHomölle(FUJITSU)
Disseminationlevel
PU PU-Public
epcc|NEXTGenIOVisual Identity Designs
d i g i t a l
Copyright©2017NEXTGenIOConsortiumPartners 2
VersionHistoryVersion Date Comments,Changes,Status Authors&contributors
0.1 14/08/17 Initialdraftcreated AdrianJackson(UEDIN)
0.2 01/09/17 Reviewdraftfinished AdrianJackson(UEDIN)
0.3 04/09/17 FixedlayoutandincorporatedChristian’schanges
AdrianJackson(UEDIN),ChristianHerold(TUD)
0.4 12/09/17 Updateofsystemwarearchitectures,correctingsomemistakesandre-orderingsections.
AdrianJackson(UEDIN)
1.0 22/09/17 Incorporatingreviewcomments AdrianJackson,MichèleWeiland(UEDIN),BernhardHomölle(FUJITSU),Hans-ChristianHoppe,MarceloCintra(INTEL)
Copyright©2017NEXTGenIOConsortiumPartners 3
TableofContents1 ExecutiveSummary........................................................................................................................7
2 Introduction...................................................................................................................................8
2.1 Glossary..................................................................................................................................8
3 Non-volatileMemory...................................................................................................................12
3.1 SCMmodesofoperation......................................................................................................13
3.2 Non-volatilememorysoftwareecosystem..........................................................................14
3.3 AlternativestoSCM..............................................................................................................14
4 Requirements...............................................................................................................................16
5 HardwareArchitecture.................................................................................................................17
5.1 Hardwarebuildingblocks.....................................................................................................17
5.1.1 Mainservernode.........................................................................................................17
5.1.2 Ethernetswitch............................................................................................................19
5.1.3 Networkswitch.............................................................................................................19
5.1.4 Loginnodes..................................................................................................................195.1.5 Bootnodes...................................................................................................................19
5.1.6 Gatewaynodes.............................................................................................................20
5.2 Prototypeconsiderations.....................................................................................................20
5.2.1 RDMAoptionsfortheprototype..................................................................................20
5.2.2 ApplicabilityofthearchitecturewithoutaccesstoSCM..............................................21
5.3 Systemconfiguration............................................................................................................21
5.4 ExaFLOPvision......................................................................................................................23
5.4.1 ScalabilitytoanExaFLOPandbeyond..........................................................................24
5.4.2 Scalabilityinto100PFLOP/s.........................................................................................24
6 SystemwareArchitecture.............................................................................................................25
6.1 SystemwareOverview..........................................................................................................25
6.2 LoginNodes..........................................................................................................................26
6.2.1 CommunicationLibraries..............................................................................................27
6.2.2 Compilers......................................................................................................................27
6.2.3 I/OLibraries..................................................................................................................28
6.2.4 Java...............................................................................................................................28
6.2.5 JobLauncher.................................................................................................................28
6.2.6 JobScheduler...............................................................................................................28
6.2.7 LocalFilesystem............................................................................................................29
6.2.8 MetaScheduler............................................................................................................296.2.9 ObjectStore..................................................................................................................29
Copyright©2017NEXTGenIOConsortiumPartners 4
6.2.10 OperatingSystem.........................................................................................................29
6.2.11 PerformanceMonitoring..............................................................................................29
6.2.12 PerformanceVisualisationandDebuggingTools.........................................................30
6.2.13 Python..........................................................................................................................30
6.2.14 SchedulingTools...........................................................................................................30
6.2.15 ScientificLibraries.........................................................................................................30
6.2.16 Task-basedprogrammingmodel..................................................................................31
6.3 GatewayNodes....................................................................................................................31
6.3.1 OperatingSystem.........................................................................................................31
6.4 BootNodes...........................................................................................................................316.4.1 OperatingSystem.........................................................................................................32
6.4.2 Management/MonitoringSoftware.............................................................................32
6.4.3 PXE-BootServer............................................................................................................32
6.4.4 DHCPServer.................................................................................................................33
6.4.5 DNSServer....................................................................................................................33
6.4.6 NetworkManagement.................................................................................................33
6.4.7 LocalFilesystem............................................................................................................33
6.5 CompleteComputeNodes...................................................................................................33
6.5.1 AutomaticCheck-pointer.............................................................................................34
6.5.2 CommunicationLibraries..............................................................................................34
6.5.3 DataMovers.................................................................................................................35
6.5.4 DataScheduler.............................................................................................................35
6.5.5 I/OLibraries..................................................................................................................35
6.5.6 Java...............................................................................................................................35
6.5.7 JobScheduler...............................................................................................................35
6.5.8 Management/MonitoringSoftware.............................................................................35
6.5.9 Multi-nodeSCMFilesystem..........................................................................................36
6.5.10 ObjectStore..................................................................................................................36
6.5.11 OperatingSystem.........................................................................................................36
6.5.12 PerformanceMonitoring..............................................................................................366.5.13 PersistentMemoryAccessLibrary...............................................................................37
6.5.14 Python..........................................................................................................................37
6.5.15 SCMDriver....................................................................................................................37
6.5.16 SCMLocalFilesystem...................................................................................................37
6.5.17 Task-basedRuntimeEnvironment...............................................................................37
6.5.18 VolatileMemoryAccessLibrary...................................................................................38
Copyright©2017NEXTGenIOConsortiumPartners 5
7 UsagePatterns.............................................................................................................................39
7.1 GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem..........................................39
7.1.1 Description...................................................................................................................39
7.1.2 Components.................................................................................................................39
7.1.3 Datalifecycle................................................................................................................40
7.2 GeneralPurposeHPCsystemimprovingthroughput...........................................................40
7.2.1 Description...................................................................................................................40
7.2.2 Components.................................................................................................................40
7.2.3 Datalifecycle................................................................................................................40
7.3 ResilienceinspecialistHPCsystem......................................................................................417.3.1 Description...................................................................................................................41
7.3.2 Components.................................................................................................................41
7.3.3 Datalifecycle................................................................................................................41
7.4 LargeMemory......................................................................................................................42
7.4.1 Description...................................................................................................................42
7.4.2 Components.................................................................................................................42
7.4.3 Datalifecycle................................................................................................................42
7.5 Workflows............................................................................................................................42
7.5.1 Description...................................................................................................................42
7.5.2 Components.................................................................................................................43
7.5.3 Datalifecycle................................................................................................................43
7.6 DataAnalytics.......................................................................................................................43
7.6.1 Description...................................................................................................................43
7.6.2 Components.................................................................................................................44
7.6.3 Datalifecycle................................................................................................................44
7.7 Liveprocesspausingandmigration.....................................................................................44
7.7.1 Description...................................................................................................................44
7.7.2 Components.................................................................................................................45
7.7.3 Datalifecycle................................................................................................................45
7.8 ObjectStore..........................................................................................................................457.8.1 Description...................................................................................................................45
7.8.2 Components.................................................................................................................45
7.8.3 Datalifecycle................................................................................................................46
7.9 Weatherforecastingcomponentoverview..........................................................................46
7.9.1 Description...................................................................................................................46
7.9.2 Components.................................................................................................................46
Copyright©2017NEXTGenIOConsortiumPartners 6
7.9.3 Datalifecycle................................................................................................................47
7.10 Distributedworkflowmanagement.....................................................................................47
7.10.1 Description...................................................................................................................47
7.10.2 Components.................................................................................................................47
7.10.3 Datalifecycle................................................................................................................47
8 Conclusions..................................................................................................................................49
9 References....................................................................................................................................50
10 Appendix..................................................................................................................................51
TableofFiguresFigure1:One-levelmemory(1LM)......................................................................................................13Figure2:Two-levelmemory(2LM)......................................................................................................14Figure3:pmem.iosoftwarearchitecture(source:RedHat).................................................................14Figure4:MainnodeDRAM/NVM(SCM)configuration.....................................................................18Figure5:UnifiedNEXTGenIOservernode(source:FUJITSU)..............................................................18Figure6:Servernodeblockdiagram....................................................................................................19Figure7:LocalandRDMASCMlatencywithhardwareRDMA............................................................20Figure8:LocalandremoteSCMswithRDMAemulation....................................................................21Figure9:TheNEXTGenIOprototyperack............................................................................................22Figure10:Prototyperackdetails(NOTE:capacities/sizesofmemoryandotherhardwarecomponentspecificationsarebasedoncurrentexpectations,ratherthanexactspecificationsoftheprototype)...............................................................................................................................................................22Figure11:ScalingtheNEXTGenIOarchitecturebeyondonerack.......................................................23Figure12:Intra-rackvs.inter-rackconnectivity...................................................................................24Figure13:Systemwarearchitecture....................................................................................................26Figure14:Loginnodesystemwarecomponents..................................................................................27Figure15:Bootnodesystemwarecomponents..................................................................................32Figure16:Completecomputenodesystemwarecomponents...........................................................34Figure17:GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem........................................51Figure18:ResilienceinspecialistHPCsystem.....................................................................................52Figure19:Largememory.....................................................................................................................53Figure20:Workflows...........................................................................................................................54
TableofTablesTable1:NEXTGenIOscalability(3Dtorusprojections)........................................................................23
Copyright©2017NEXTGenIOConsortiumPartners 7
1 ExecutiveSummaryNEXTGenIOisdevelopingaprototypehighperformancecomputing(HPC)andhighperformancedataanalytics (HPDA) system that integrates a byte-addressable storage class memory (SCM) into astandardcomputeclustertoprovidegreatlyincreasedI/Operformanceforcomputationalsimulationanddataanalyticstasks.
Toenableustodevelopaprototypethatcanbeusedbyawiderangeofcomputationalsimulationapplication, anddataanalytic tasks,wehaveundertakena requirements-drivendesignprocess tocreate hardware and software architectures for the system. These architectures both outline thecomponentsand integrationof theprototypesystem,anddefineourvisionofwhat is requiredtointegrateandexploitSCMtoenableagenerationofExascalesystemswithsufficientI/Operformancetoensureawiderangeofworkloadscanbesupported.
The hardware architecture, which is designed to scale up to an ExaFLOP system, uses highperformanceprocessorscoupledwithSCM inNVRAM(non-volatile randomaccessmemory) form,traditionalDRAMmemory,andanOmni-Pathhighperformancenetwork,toprovideasetofcompletecomputenodesthatcanundertakebothhighperformancecomputationalandhighperformancedataanalysis.
Thesystemware(systemsoftware),whichsupportsthehardwareinthesystem,willenableparallelI/Ousing the SCM technology, provide amulti-node filesystem forusers to exploit, enableuseofobjectstoragetechniques,andprovideautomaticcheck-pointingifdesiredbyauser.Thesefeatures,along with other systemware components, will enable the system to support traditional parallelapplications with high efficiency, and newer computing modes such as high performance dataanalytics.
Thearchitecturesdefinedinthisdocumentarealsodemonstratedwithdescriptionsofrelevantusecasesthatillustratewhichsystemwarecomponentswillbeusedtoundertakevariousactionsonthehardwarebyuserapplications.
Copyright©2017NEXTGenIOConsortiumPartners 8
2 IntroductionTheNEXTGenIO project is developing a prototypeHPC system that utilises byte-addressable SCMhardwaretoprovidegreatlyimprovedI/OperformanceforHPCandHPDAapplications.Akeypartofdevelopingtheprototypeisthedesignofthedifferentcomponentsofthesystem.
ThisdocumentoutlinesthearchitectureswehavedesignedfortheNEXTGenIOprototypeandfutureExascaleHPC/HPDAsystems,buildingonadetailedrequirementscaptureundertakenintheproject.
OuraiminthisprocesshasbeentodesignaprototypesystemthatcanbeusedtoevaluateandexploitSCM for large scale computations and data analysis, and to design a hardware and softwarearchitecturethatcouldbeexploitedtointegrateSCMwithExascalesizedHPCandHPDAsystems.
NEXTGenIOisbuildingahardwareprototypeusingIntel’s3DXPoint™memory,andwehavedesignedasoftwarearchitecturethatisapplicabletootherinstancesofSCMoncetheybecomeavailable.Thefunctionalityoutlinedissufficientlygenericthatitcanexploitarangeofdifferenthardwaretoprovidethepersistentandhighperformancestoragefunctionalitythat3DXPoint™offers.
TheremainderofthisdocumentoutlinesthekeyfeaturesofSCMthatweareexploitinginNEXTGenIO,the general requirements that we have identified for our architectures, and the hardware andsoftware architectures that we have designed. It concludes with a discussion of common usagepatternsweenvisagewillbeusedtoexploittheperformancebenefitsSCMcanbring.
2.1 Glossary1LM 1levelmemory.ThisrepresentsthemodewhereDRAMandSCMare
used as separate random-access memory systems, with their ownmemoryaddressspace.ThismeansthatitispossibletoaddressalltheDRAMandSCMfromanapplication,resultinginanavailablememoryspaceofthetotalDRAMplusthetotalSCMinstalledinanode.1LMallowspersistentmemoryinstructionstobeissued.
2LM 2levelmemory.ThisrepresentsthemodewheretheDRAMisusedasatransparentcachefortheSCMinthesystem.Applicationsdonotseethe DRAM memory address space, only the SCM memory addressspace.Thismeansthetotalmemoryavailabletoapplicationsisthesizeof the total SCMmemory in the node. 2LM cannot issue persistentmemoryinstructions;itcanonlybeusedinload/store(volatile)mode.
1GE 1GigabitEthernet
10GE 10GigabitEthernet
100G 100Gigabit(dualsimplex)
3DXPoint™ Intel’semergingSCMtechnology thatwillbeavailable inNVMeSSDandNVDIMMforms
AEP Apache Pass, code name for Intel NVDIMMs based on 3D XPointTMmemory
API Applicationprogramminginterface
CPU Computeprocessingunit
CCN Completecomputenode
DDR4 DoubledatarateDRAMaccessprotocol(version4)
Copyright©2017NEXTGenIOConsortiumPartners 9
DHCP DynamicHostConfigurationProtocol
DIMM Dualinlinememory,themainstreampluggableformfactorforDRAM
DMA Directmemoryaccess(usingahardwarestatemachinetocopydata)
DNS DomainNameSystem,astandardInternetaddressresolutionsystem
DP Dualprocessor
DPA DIMMphysicaladdress
DPXeon DPIntel®Xeon®platform
DRAM Dynamicrandomaccessmemory
DMTF DistributedManagementTaskForce,an industry standardsbody forcomputersystemmanagementtopics
ECWMF EuropeanCentreforMedium-RangeWeatherForecasts
ExaFLOP 1018FLOP/s
FLOP Floatingpointoperation(inourcontext64bit/dualprecision)
Gb,GB Gigabyte(10243bytes,JEDECstandard)
GIOPS 109IOPS
GPU Graphicalprocessingunit
HDD Harddiskdrive.Primarilyusedtorefertospinningdisks
HPC HighPerformanceComputing
IB InfiniBand
IB-EDR IB“eightdatarate”(100Gb/sperlink)
IFTcard Internal faceplate transition card (interfaces internal iOPA cablescomingfromCPUstoserverfaceplate)
IFS IntegratedForecastingSystem,asimulationframeworkfromECMWF
InfiniBand IndustrystandardHPCcommunicationnetwork
I/O Input/output
iOPA IntegratedOPA(withHFIontheCPUpackage,asopposedtohavingitonaPCIecard)
IOPS I/Ooperationspersecond
IPMI IntelligentPlatformManagementInterface
JEDEC JEDECSolidStateTechnologyAssociation,formerlyknownastheJointElectronDeviceEngineeringCouncil;governsDRAMspecifications
Lustre Paralleldistributedfilesystem;Opensource
LNET,LNET Lustrenetworkprotocol
MIOPS 106IOPS
MPI MessagePassingInterface,standardforinter-processcommunicationheavilyusedinparallelcomputing
MW MegaWatt(106Watts)
Copyright©2017NEXTGenIOConsortiumPartners 10
NANDflash MainstreamNVMusedinSSDstoday
NUMA Non-uniformmemoryaccess
NVM Non-volatilememory(genericcategory,usedinSSDsandSCM)
NVMe NVMExpress,atransportinterfaceforPCIeattachedSSDs
NVMF NVMoverfabrics,theremoteversionofNVMe
NVML NVMlibraryat[5]
NVRAM Non-volatile RAM (DIMMbased). Implemented byNVDIMMs, beingone kind of NVM, that supports both non-volatile RAM access andpersistentstorage
NVDIMM Non-volatile memory in DIMM form fact that supports both non-volatileRAMaccessandpersistentstorage.IntelNVDIMM,alsoknownasAEP.
OEM Originalequipmentmanufacturer
Omni-Path High-performance interconnect fabric developed by Intel for use inHPCsystems
OPA Omni-Path
O/S Operatingsystem
PB Petabyte(10245bytes,JEDECstandard)
PCI-Express Peripheral Component Interconnect Express. Communication busbetweenprocessorsandexpansioncardsorperipheraldevices.
PCIe PCI-Express
Persistent Datathatisretainedbeyondpowerfailure
PFLOP Peta-FLOP,1015FLOP/s
PMEM,pmem Persistentmemory,anothernameforSCM
POSIX PortableOperatingSystemInterface
POSIXfilesystem AfilesystemthatadherestothePOSIXstandard,specificallyforFileandDirectoryoperations.
PXE Pre-bootExecutionEnvironment,runbeforeO/Sisbooted
QPI QuickPathInterconnect
QuickPathInterconnect InteltermforthecommunicationlinkbetweenIntel®Xeon®CPUsonaDPsystem
RAM Randomaccessmemory
ramdisk DRAMbasedblockstorageemulation
RDMA RemoteDMA,DMAoveranetwork
SCM Byte-Addressable Storage class memory, referring to memorytechnologies that can bridge the huge performance gap betweenDRAMandHDDswhileprovidinglargedatacapacitiesandpersistenceofdata
SNIA StorageNetworkingIndustryAssociation
Copyright©2017NEXTGenIOConsortiumPartners 11
SNMP SimpleNetworkManagementProtocol
SSD Solidstatediskdrive
SHDD SolidstateHDD,usingaNVMbufferinsideaHDD
TB Terabyte(10244bytes,JEDECstandard)
TFTP TrivialFileTransferProtocol
TOP500 TOP500supercomputerlist,http://www.top500.org
U Rackheightunit(1.75”)
Xeon® Intel’sbrandnameforx86serverprocessors
XeonPhiTM Intel’smanycorex86lineofCPUsforHPCsystems
Copyright©2017NEXTGenIOConsortiumPartners 12
3 Non-volatileMemoryTheaimoftheNEXTGenIOprojectistoexploitanewgenerationofmemorytechnologiesthatweseebringing great performance and functionality benefits for future computer systems. Non-volatilememory,memorythatretainsanydatastoredwithinitevenwhenitdoesnothavepower,hasbeenexploitedbyconsumerelectronicsandcomputersystemsformanyyears.Theflashmemorycardsusedincamerasandmobilephonesareanexampleofsuchhardware,usedfordatastorage.Morerecently,flashmemoryhasbeenusedforhighperformanceI/OintheformofSolidStateDisk(SSD)drives,providinghigherbandwidthand lower latency than traditionalHardDiskDrives (HDD),buttypicallywithlowercapacities.
WhilstflashmemorycanprovidefastI/Operformanceforcomputersystems,therearesomedrawbacks. It has limited endurance when compare to HDD technology, limiting the number ofmodificationsofmemorycellsandthustheeffectivelifetimeoftheflashstorage.Itisoftenalsomoreexpensivethanotherstoragetechnologies.However,SSDstorage,andenterpriselevelSSDdrivesareheavilyusedforI/Ointensivefunctionalityinlargescalecomputersystems.
Byte-addressablestorageclassmemorytakesanewgenerationofnon-volatilememorythatisdirectlyaccessible via CPU load/store operations, has higher durability than standard flash memory, andhigherreadandwriteperformance,andpackagesitinthesameformfactor(i.e.thesameconnectorandsize)asthemainmemoryusedincomputers(DRAM).ThisallowsthememorytobeinstalledandusedalongsideDRAMbasedmainmemory,controlledbythesamememorycontroller.Thismeansthatapplicationsrunningonthesystemcanaccessitdirectlyasifitwasmainmemory,includingtruerandomdataaccessatByteor cache linegranularity. This is verydifferent thanblockbased flashstorage,whichrequiresintensiveI/Ooperationsforsuchdataaccess.
Theperformanceofbyte-addressableSCMisprojectedtobelowerthanmainmemory(withalatencyof~5-10xofDDR4memorywhenconnectedtothesamememorychannels),butmuchfasterthanSSDsorHDDs.ItisalsoprojectedtobeofmuchlargercapacitythanDRAM,around2-5xdenser(i.e.2-5xmorecapacityinthesameformfactor).
This new class of memory offers very large memory capacity for servers, long term very highperformancepersistentstoragewithintheNVRAMmemoryspaceoftheservers,andtheabilitytoundertakeI/O(readingandwritingdata)inanewway.DirectaccessfromapplicationstoindividualbytesofdataintheSCMisverydifferentfromtheblock-orientedwayI/Oiscurrentlyimplemented.
SCMhasthepotentialtoenablesynchronous,byte level I/O,movingawayfromtheasynchronousblock-basedfileI/Oapplicationscurrentlyrelyon.IncurrentasynchronousI/O,thedriversoftwareissuesanI/OcommandbyhandingoveranI/Orequestintoaqueuetoanactivecontroller.Processingofthatcommandmayhappenaftersomedelayifthereareothercommandsalreadybeingprocessed.
Next,theI/Ocommandisactivelystarted(aswithstorageread/writeornetworkingtransmit),oritisrecognised as a receiving descriptor for an asynchronous I/O command coming from an externalsource(e.g.anetworkingreceive).OncetheI/Ooperationisfinishedbythecontroller,thedriverisnotified(usuallybyaninterrupt)tocompletetheoperationinsoftware.
WithSCMprovidingmuchlowerlatenciesthanexternaldiskdrives,thistraditionalI/Omodel,usinginterrupts,isinefficientbecauseoftheoverheadofcontextswitchesbetweenuserandkernelmode(whichcantypicallytakethousandsofCPUinstructions).
Furthermore,withSCMitbecomespossibletoimplementremoteaccesstodatastoredinthememoryusingRDMAtechnologyoverasuitableinterconnect.UsinghighperformancenetworkscanenableaccesstodatastoredinSCMinremotenodesfasterthanaccessinglocalhighperformanceSSDsviatraditionalI/Ointerfacesandstacksinsideanode.
Therefore,itispossibletouseSCMtogreatlyimproveI/Operformancewithinaserver,increasethememorycapacityofaserver,orprovidearemotedatastorewithhighperformanceaccessforagroup
Copyright©2017NEXTGenIOConsortiumPartners 13
ofserverstoshare.SuchstoragehardwarecanalsobescaledupbyaddingmoreSCMmemoryinaserver,oraddingmorenodestotheremotedatastore,allowingtheI/Operformanceofasystemtoscaleasrequired.
However,ifSCMisprovisionedintheserversinacomputesystem,theremustbesoftwaresupportformanagingdatawithintheSCM.Thisincludesmovingdataasrequiredforthejobsrunningonthesystem,andprovidingthefunctionalitytoletapplicationsrunonanyserverandstillutilisetheSCMforfastI/Oandstorage(i.e.applicationsshouldbeabletoaccessSCMinremotenodesifthesystemisconfiguredwithSCMonlyinasubsetofallnodes).
AsSCMispersistent, italsohaspotentialtobeusedforresiliency,providingbackupfordatafromactiveapplications,orprovidinglongtermstoragefordatabasesordatastoresrequiredbyarangeofapplications.With support from the systemware, servers could be enabled to handle power losswithout experiencing data loss, efficiently and transparently recovering from power failure andresumingapplications fromtheir latest runningstate,andmaintainingdatawith littleoverhead intermsofperformance.
3.1 SCMmodesofoperationOngoing developments in non-volatile / storage class memory have impacted the discussion ofmemoryhierarchies.AcommonmodelthathasbeenproposedincludestheabilitytoconfiguremainmemoryandSCMintwodifferentmodes:1-levelmemoryand2-levelmemory[2].
1-levelmemory,or1LM,hasmainmemory (DRAM)andNVRAMas twoseparatememoryspaces,both accessibleby applications. This is very similar to the FlatMode [3] configurationof thehighbandwidth, on-package, MCDRAM in current Intel® Xeon PhiTM processors (code name “KnightsLanding”orKNL).TheDRAMwillbehandledviastandardmemoryAPI’ssuchasmallocandrepresenttheO/Svisiblemainmemorysize.TheNVRAMwillbemanagedbystandardfilesystemAPIsuchasmmapandpresent thenon-volatilepartof the systemmemory.BothallowdirectCPU load/storeoperation.InordertotakeadvantageofSCMin1LMmode,thesystemwareorapplicationshavetobeadaptedtousethesetwodistinctaddressspaces.
Figure1:One-levelmemory(1LM).
2-levelmemory,or2LM,configuresDRAMasacacheinfrontoftheNVRAM.ApplicationsonlyseethememoryspaceoftheSCM,databeingusedisstoredinDRAM,andmovedtoSCMwhennolongerimmediatelyrequiredbythememorycontroller(asinstandardCPUcaches).ThisisverysimilartotheCacheMode[3]configurationofMCDRAMonKNLprocessors.
ThismodeofoperationdoesnotrequireapplicationstobealteredtoexploitthecapacityofSCM,andaimstogivememoryaccessperformanceatmainmemoryspeedswhilstprovidingthelargememory
MainMemory
StorageClassMemory
Processor
Copyright©2017NEXTGenIOConsortiumPartners 14
space of SCM. However, howwell themainmemory cache performswill depend on the specificmemory requirements and access pattern of a given application. Furthermore, persistence of theNVRAMcontentscannotbelongerguaranteed,duetothevolatileDRAMcacheinfrontoftheNVRAM,sothenon-volatilecharacteristicsofSCMarenotexploited.
Figure2:Two-levelmemory(2LM).
3.2 Non-volatilememorysoftwareecosystemInrecentyearstheStorageNetworkingIndustryAssociation(SNIA)hasbeenworkingonasoftwarearchitectureforSCMwithpersistentload/storeaccess.TheseoperatingsystemindependentconceptshavenowmaterialisedintotheLinuxpmem.io[5]library.
Thisapproachre-usesthenamingschemeoffilesastraditionalpersistententitiesandmaptheSCMregionsintotheaddressspaceofaprocess.Oncethemappinghasbeendone,thefiledescriptorisnolongerneededandcanbeclosed.
Figure3:pmem.iosoftwarearchitecture(source:RedHat).
3.3 AlternativestoSCMThereareexistingtechnologicalsolutionsthatareofferingsimilarfunctionalitytoSCMandthatcanalsobeexploitedforhighperformanceI/O.OneexampleisNVMedevices:SSDsthatareattachedtothePCIebusandsupport theNVMExpress interface. Indeed, Intelalreadyhasa lineofanNVMedeviceonthemarketthatuseon3DXPoint™memorytechnology,calledIntel®Optane™[6].OthervendorshavealargerangeofNVMedevicesonthemarket,mostofthembasedondifferentvariationsofFlashtechnology.
MainMemory StorageClassMemory Processor
Copyright©2017NEXTGenIOConsortiumPartners 15
NVMedeviceshavethepotentialtoprovidebyte-levelstorageaccess,usingthepmem.iolibraries.Afilecanbeopenedandpresentedasamemoryspaceforanapplication,andthencanbeuseddirectlyasmemorybythatapplication.
NVMedevicesofferthepotentialtoremovetheoverheadoffileaccess(i.e.dataaccessthroughfilereadsandwrites)whenperformingI/OandenablethedevelopmentofapplicationsthatexploitSCMfunctionality.However, given thatNVMedevicesare connectedvia thePCIebus, andhaveadiskcontrolleronthedevicethroughwhichaccessismanaged,NVMedevicesdonotprovidethesamelevelofperformancethatSCMoffers.Indeed,asthesedevicesstilluseblock-baseddataaccess,finegrainedmemoryoperationscanrequirewholeblocksofdatatobeloadedorstoredtothedevice,ratherthanindividualbytes.
Memory mapped files are another example of SCM like functionality using existing technology.Memorymapped files allow I/O storage to be accessed as if it wasmemory, through the virtualmemorymanager.However,thedatastillresidesonatraditionalstoragedeviceanditcanthereforeonlybeaccessedattheperformanceratethatthisstoragedeviceallows.
Memorymappedfilesallowforsomeoftheblockstoragecoststobeavoided(i.e.thosecostscausedbytheasynchronousdataaccessandinterrupts),butarelimitedtothesizeofDRAMavailabletoanapplication,have coarser grainedpersistency (requiring full blocks tobe flushed todisk toensurepersistency),withmuchhigherpersistencycoststhanbyte-addressableSCM.
Copyright©2017NEXTGenIOConsortiumPartners 16
4 RequirementsTodesignourhardwareandsystemwarearchitecturesthatwillenableustoconstructaprototypesystem exploiting SCM for HPC and HPDA applications, we identified and examined the usagescenariosforsuchasystemandusedthemtogenerateasetofrequirement.
Whilstdetailedreportingofthoserequirementsisnottheaimofthisdocument,wewilloutlinethegeneralrequirementsweconsideredkeywhendesigningthesystem:
1. Thesystemshouldbeascloseaspossibletoastandard,production,largescalecomputesystem.2. Applicationsmustbeabletorunonthesystemwithoutchanges,otherthanpossiblyrequiringre-
compilation,andbeabletousetheSCMtoprovideperformancebenefits.3. ItmustbepossibletomodifyapplicationstodirectlyusetheSCMandexploitthefullperformance
thehardwareprovides.4. ThehardwareandsystemwaremustbeabletosupportaheterogeneousSCMenvironment.We
considerthatsystemswheresomenodeshaveSCMandsomedonotmaybedesirableforarangeofusersorsystemproviders;wethereforewantourhardwareandsystemwarearchitecturestobeabletosupportsuchconfigurations.
5. Thesystemmustbeabletosupportmultipleusersandmultipleapplicationsrunningatthesametime.
6. Thesystemmustbeflexibleenoughtoallowanyapplicationtorunonanycomputenode.7. Itshouldbepossibleformultipleapplicationstoshareasinglecomputenode,orforuserjobsto
specifythatexclusivenodeaccessisrequired.8. Thesystemmustsupportthesharingofdatabetweenapplicationsoruserjobs.Inparticularwe
recognisethatworkflowsofapplicationsareanimportantusecaseforSCMinsuchsystems.9. The systemshould support I/O fromapplicationsusingnon-filesystembasedapproaches.One
candidatecouldbeanobjectstore.10. ThesystemshouldprovidedataaccesstoSCMbetweencomputenodestoallowapplicationsto
accessremotedataandtoallowthesystemtomanagedatainSCM.11. ThearchitecturesshouldenablethedesignofanExaFLOPrangesystemwithanI/Operformance
sufficienttoensureI/Oisnotaperformancelimiterwhenscalingapplicationsonsuchasystem.12. Thesystemshouldallowdifferentstoragetargetstobeconfiguredforagivenapplication.This
requiressoftwarebeingconfiguredforthesetofcomputenodesallocatedtoagivenuserjob.13. The system must support profiling and debugging tools to enable users to understand the
performanceandbehaviouroftheirprograms.
Copyright©2017NEXTGenIOConsortiumPartners 17
5 HardwareArchitectureDevelopingaprototypeplatformtotestandevaluatetheperformanceandfunctionalitybenefitsofSCMisoneofthekeyobjectivesoftheNEXTGenIOproject.TheSCMtechnologyweareusing,3DXPoint™memory, is a new type ofmemory that requires specific hardware support. 3D XPoint™memorycannotsimplybeinstalledinanexistingsystem.
However,oneof themajorbenefitsofSCM(NVRAM) is that it comes in thesamehardware formfactoras standardmemory (DRAM,DDR4 for3DXPoint™). ThismeanswecandesignandbuildasystemthatcantakebothstandardmemoryandSCMinthesamenodeandprovidebothtousers.
SCM does required specific processor support, primarily because memory controllers in moderncomputingsystemsaredirectlyintegratedintotheprocessoritself.Thismeans,ifwewanttouseanewmemory technologyweneedprocessorswithsupport for thatnewmemory in theirmemorycontrollers. For instance, for 3D XPoint™, Intel has announced a new generation of Intel® Xeon®processorswiththecodename“CascadeLake”for2018releasethatwillincludesuchsupport.
Furthermore,aswearecombiningtwotypesofmemoryintheservers(ornodes)inthesystem,weneedtoensurethatthereiscapacitytoinstallsufficientamountsofstandardmemoryandSCM.Aswearelookingtoprovideaprototypesystemwherelargescalecomputationalsimulationanddataanalytics tasks can be undertaken, we need to provide sufficient hardware capacity to installsufficientlylargeamountsofstandardDRAMmemoryandSCMatthesametime.Inotherwords,weneedasystemthathasspaceforalargenumberofmemorymodules,orDIMMs,tobeinstalled.
The NEXTGenIO prototype system is designed to provides for very high SCM capacities and I/Obandwidth,combiningis3DXPoint™technologywithIntel’snewOmni-PathinterconnecttoprovideextremelyhighI/Operformanceforapplications.
AsingleNEXTGenIOrackholdsupto32Xeon®servers,CompleteComputeNodes(CCNs),connectedbytheOmni-Pathnetwork,withtwologinnodestosupportlocalcompilation,visualisation,andrackmanagement.Furthermore,twobootserversareprovidedtostoreandservetheoperatingsystemand software required by the CCNs, and two gateway servers to enable access to an externalfilesystemforlongtermstorage,andforperformancecomparisonwiththeI/OprovidedbytheSCM.
Thearchitectureisdesignedwithscalabilityinmind,facilitatingtheconnectionofNEXTGenIOracksintolargersystems,reachingoutwellintotheExaFLOPrange.WeenablesoftwaretoaccesslocalSCMalmostasfastasDRAM,andtoaccessandanyotherCCN’sSCMintheclusterfasterthanaccessingafaststoragedevice(i.e.afastSSD)insidethenodeitself.
ThisSCMfocussedsystemarchitectureallowsfundamentallynewdesignsforscalableHPCsoftwareandprovidesthefunctionalityrequiredtomakethedatacentrepowerconsumptionbecomethefinalscalinglimiter,notthesystemarchitecture.
Usingfutureprocessorsandnetworkgenerations,weexpectanExaFLOPconfigurationbasedontheNEXTGenIOarchitecturetobecomepracticalwithinfiveyears.
5.1 Hardwarebuildingblocks
5.1.1 Mainservernode
SinceNEXTGenIOiscentredonusingSCMasanew,ultra-faststoragelayerbetweenmemory(DRAM)anddisk,wehavetounderstandthegoalsandconstraintsonhowtoconfigureDRAMandNVMintheserversbeingused.
AsshowninFigure4,theNEXTGenIOservernodeisadualprocessornodeandsupportsSCM.Asweare lookingforhighperformancenodes,weneedtohaveall12memorychannels(asprovidedbyIntel’scurrent“Skylake”generationofIntel®Xeon®processorsandannouncedfor“CascadeLake”)populatedwithsomeDRAM,toenableutilisingthetotalavailableDRAMbandwidthinthenode.
Copyright©2017NEXTGenIOConsortiumPartners 18
Figure4:MainnodeDRAM/NVM(SCM)configuration.
SincewearealsointerestedinmaximumSCMbandwidthandcapacity,wealsoneedtopopulateall12memorychannelsinthenodewithSCM.Together,thatmakes12DRAM-DIMMsplus12NVDIMMs,pairingupineachchannel.Ofcourse,devicesonthesamechannelsharetheavailableDDR4channel’sbandwidth.
The server hardware is implemented in a 1U form factor (Figure 5),which is a good compromisebetweenconfigurabilityanddensity.
Figure5:UnifiedNEXTGenIOservernode(source:FUJITSU).
Figure6showstheNEXTGenIOservernodeimplementationblockdiagram.Whilemostdetailsarenotrelevanthere,someareimportant.Intotal,therearethreePCIex16slotsavailableforadd-incardsinthe system.Wewill develop support for processors with integrated network connection i.e. onechannelcomingoffeachCPU;thiscurrentlyrequiresaPCIeslottocarryan“IFTCard”,holdingthemodules to receive external cables (shown on the upper right in the picture). Consequently, thatleavestwoPCIex16slotsforotherI/Ocards.
Copyright©2017NEXTGenIOConsortiumPartners 19
Figure6:Servernodeblockdiagram.
5.1.2 Ethernetswitch
Forsystemmanagementandgeneral-purposeinter-nodecommunication,twoEthernetnetworksareincluded:1GEforthemanagementLANandslowdataconnections,and10GEtoconnecttothedatacentreLANaswellasfastconnectionswithintherack.
5.1.3 Networkswitch
Thehardwarearchitecturesupportstheintegrationofmostcommonhighperformancenetworkstoprovide the fast, high bandwidth, interconnect between the nodes in the system. This includesInfiniband based networks, Omni-Path Architecture and fast Ethernet and Omni-Path options,connectedviaPCIExpressor(forOmni-Path)integratedintoamulti-chipCPUpackage.
Onefeaturethatisnecessaryforasystemtosupportalltheproject’srequirementsisremotedataaccessover thenetwork.RDMAoffers thepotential forhighperformanceaccess to remoteSCM,whichisimportanttoenablesomeoftheusecasesweareconsidering.
5.1.4 Loginnodes
AswithanyHPCcluster,weneedtosupportloginnodestoaccessthesystem,performcompilationsandremotevisualisation.Loginnodeshavedifferentrequirementstothecomputenodes,andassuchwillbeprovisionedwithhighmemoryandprocessingcapabilities,alongwithlocalstoragetoenablefastcompilationofapplications.
This local storage canbeprovidedbySCM in thenodesor local SSDs,dependingon the costandavailabilityofthesecomponents.SCMcouldprovidethebenefitofalargememoryspacefortheloginnodes(i.e.ifitsupported2LMfunctionality).
5.1.5 Bootnodes
Toenablebootingofthecomputenodesviathenetwork,foreasyadministrationoftheoperatingsystem,andtoinstallsoftwareonthecomputenodes,weprovideanumberofbootnodes.Thesehave1GEnetworkconnectionstothecomputeservers,connectedthrough10GEswitches(toreducetheriskofbecomingsaturatedduringsysteminitialisation).
Copyright©2017NEXTGenIOConsortiumPartners 20
Bootnodestypicallyhavelowrequirementsoncomputeperformance,i.e.evenamainstreamsingleCPUisexpectedtobesufficient.Thestoragecapacityrequirementsarealsomoderate.However,toavoidbottlenecksonbootstorms,SSDsorSHDDs(HDDswithaninternalNAND-Flashbuffer)seemtobeappropriate.
5.1.6 Gatewaynodes
IfanOmni-Pathnetworkisusedasthemaininterconnectwithinthesystem,andweneedtoconnectthesystemtoanexternalLustrefilesystembasedonInfiniband,wewillneedsomegatewayserversthatbridgebetweenthetwointerconnects.IntelhavetechniquesforbuildingsuchgatewaysusingstandardIntel®Xeon®serversandLinuxoperatingsystem.
5.2 PrototypeconsiderationsThechoiceofnetworkfortheprototypeisbetweenInfinibandandOmni-Pathbasedsolutions.WhilstInfinibandoffershardwareRDMA,Omni-Pathprovidesthepotentialforanetworkintegratedontotheprocessor(integratedOPAor iOPA).FortheNEXTGenIOprototypewedecidedtoevaluatetheintegratednetworkfunctionalityofOmni-Path.
5.2.1 RDMAoptionsfortheprototype
Thereareanumberofwayswecanprovideremotedataaccessintheprototypesystem.IfweutilisedanInfinibandbasednetworkwecouldenableRDMA(supportedbythenetworkadapterhardware),providing very low latency (1 microsecond or less) and high bandwidth access, with small I/Ooperations.ThishardwareRDMAalsodoesnotusemuchofthecomputeresourceinanodeasthenetworkadapterhasitsownprocessinghardwaretosupporttheRDMAoptions.
Figure7:LocalandRDMASCMlatencywithhardwareRDMA.
However, the current generation of Omni-Path network adapters does not support hardwareaccelerated RDMA (future generations ofOmni-Path are scheduled to have such support). In thisscenario,partsoftheRDMAoperationsneedtobeexecutedbythehostCPU.AssuminganoptimisedRDMAemulationtargetusingpolling,weexpectaremoteSCMblockread/writelatencyofaround4
Copyright©2017NEXTGenIOConsortiumPartners 21
to5microseconds(Figure8).EventhoughthisissignificantlyhigherthanwithhardwareacceleratedRDMA,itwillstillbeintherangeforeffectiveuseofsynchronousI/O.
Figure8:LocalandremoteSCMswithRDMAemulation.
5.2.2 ApplicabilityofthearchitecturewithoutaccesstoSCM
Whilst this hardware architecture is designed around SCM, it is more generally applicable. Forinstance,local(in-node)SCMcouldbereplacedbylocalnon-volatilediskdrives,usingNVM-Express(NVMe)devices. Theseprovidebyte level access topersistentmemory fromapplications, andarecompatiblewiththepmem.iolibraryfunctionality.Itwouldalsobepossibletoreplacein-nodeSCMwithremoteNVMedevices,usingNVMoverFabrics(NVMF).
NVMelatencyandbandwidthwillnotbeofthesamelevelofperformancecomparedtotheSCMwewilluseinourprototypearchitecture(i.e.3DXpoint™memory),withbestsustainedNVMelatenciesforcurrenttechnologybeingaround100-300µs,andbandwidthofaround2.5-3.5GB/s(forax4PCIeconnection),comparedto300-500nsand8-10GB/sfortheSCM(perDIMM).TheNVMFlatencywillbehigherthanwithremoteSCMaccessedbyRDMA.Weexpectaround10-20µs(read/write)with3DXPoint™-SSDs(Optane)andaround10µs(read/write)withaSCMbasedNVMFtarget.
5.3 SystemconfigurationTheNEXTGenIOprototyperackprovidesapproximately100TBNVRAMcapacitybasedonIntel’s3DXPoint™technology[1].Itisaccessiblefromany(andtoany)CCNinlessthan5microseconds,withveryhighI/Obandwidth(almost400MIOPs).
The rack holds up to 32 Intel® Xeon® servers, also known as complete compute nodes (CCN),connected by Intel® Omni-Path Architecture fabric. Two login nodes support local compilation,visualisation,andalsorackmanagement.Twobootserversenablenetworkboot,andtwogatewaystoanexternal file system.Thenodeconfigurationsdetailsandnetworkconnectivityare shown inFigure10.
Copyright©2017NEXTGenIOConsortiumPartners 22
Figure9:TheNEXTGenIOprototyperack.
Figure10:Prototyperackdetails(NOTE:capacities/sizesofmemoryandotherhardware
componentspecificationsarebasedoncurrentexpectations,ratherthanexactspecificationsoftheprototype).
Copyright©2017NEXTGenIOConsortiumPartners 23
Assuming2to4TFLOPSpernode,theNEXTGenIOrackwith32nodesachieves64-128TFLOPSandapproximately384MIOPS,i.e.around1.5TB/slocalblockstorage.Forthepurposeoftheprototypethisscaleissufficient,butonekeyaspectofthisworkistoshowhowthiscouldscaleintotheExaFLOPrange.
5.4 ExaFLOPvisionSincethearchitecturesupportsnodesusingthetwointegratednetworkportsoftheCPUs,wehavetwoPCIex16slotsavailablepernode.Usingtwosingle-channelnetworkcards,wecanextendthesingleracksystemtomanyracks,butstillpreservethesamehardware/softwarearchitecture.Itwouldjustaddasecondlevelof“slower”connectivity,i.e.betweenracks.“Slower”inthiscontextmeanshigher latencyandlowerbandwidthpernode,butstillkeepingtheconceptofsynchronousI/O,atleastuptoarangeof10to20microsecondslatency.
Witha5-hopnetworkwecanlink864rackswith50to100PFLOPs,83PBNVM,and332GIOPs:
Figure11:ScalingtheNEXTGenIOarchitecturebeyondonerack.
Withcubical3Dtorusconfigurations,wecanprojectscalingto theExaFLOPrange–stillusing thesameNEXTGenIOhardware/softwarearchitecture(Table1).
Total# Nodes(Intel DP Skylake)
PFLOP(Min
Estimate)
PFLOP(Max
Estimate)
Total SCM Capacity
(PB)
Total SCM I/O B/W(TB/s)
Total Power (MW)
768 1.5 3 2 36 0.4 3,072 6 12 9 144 1.5
24,576 48 96 72 1,152 12 82,944 162 324 243 3,888 41
196,608 384 768 576 9,216 98 384,000 750 1,500 1,125 18,000 192
Table1:NEXTGenIOscalability(3Dtorusprojections).
At750to1,500PFLOPs,thetotalSCMcapacitywouldbe1.1Exabyte.ThetotalI/Obandwidthwouldbe18PB/s,whichcorrespondstothespeedofapproximately6millioncurrentdayPCIe-basedSSDs.
Copyright©2017NEXTGenIOConsortiumPartners 24
TheFLOPtoI/Obandwidthratiowouldbe1:50,i.e.atasimilarleveltotheFLOPtomemorybandwidthratiothatHPCsystemsaspiretoo.
Thepowerneededfora1ExaFLOPsystembasedoncurrentIntelprocessorswouldbeimpractical–almost200MW.However,basedonfutureCPUsandinterconnects,anExaFLOPconfigurationwithinanaffordable20-30MWshouldbecomefeasibleearlyinthenextdecade.
5.4.1 ScalabilitytoanExaFLOPandbeyond
Technically,theOPAlinklayercanaddressmorethan10millionnodes[4].Suchclustersizewouldbefarbeyondanypracticalpowerlimit(5GW).Earlynextdecade,however,adualprocessor(DP)nodemayhave50TFLOPsormore inthesamepowerenvelope.Then,50,000nodeswouldachieve2.5ExaFLOPsinamuchmorepracticalpowerbudgetof25MW.
5.4.2 Scalabilityinto100PFLOP/s
AccordingtoIntel,a5-hopOPAclustercanlinkupto27,648nodes.WithourNEXTGenIOrackandasingleOPAinter-racklinkpernode,thiswouldleadtothefollowingconfiguration(seealsoFigure12):
- 864rackswith27,648NEXTGenIOnodes- 50to100PFLOPs- 83PBofSCMwith332GIOPs@4kB=1.36PB/slocalI/O- 14MWpower(assuming500W/nodeincludinginfrastructure)
The inter-rack bandwidth would be dependent on the network topology. Not knowing theconfigurationdetails,weguessthebandwidthpernodewillbelowerthanthe2x100Gavailableinourprototyperack.Thelatency,however,wouldbesimilar.Assuming1microsecondorlessround-tripperhop,theMPIlatencyshouldstaybelow5microseconds,andremoteSCMviaRDMAemulationbelow 10 microseconds. This would enable synchronous remote I/O SCM storage access, but atsomewhatlowerefficiencythanwithintheracks.
Figure12:Intra-rackvs.inter-rackconnectivity.
Sincethebandwidthpernodemaybelimitedbythisinter-rackbandwidthrestrictions,thesoftwareshouldbeawareofthisextranetworklevel.Theschedulerandapplicationsshouldthereforeensurethatinter-rackaccessesarelessfrequentthanintra-rackaccesses.Thisisverysimilartotoday’sNUMAmemoryaccessoptimisationwithincomputenodes.
Copyright©2017NEXTGenIOConsortiumPartners 25
6 SystemwareArchitectureThesystemwarefortheNEXTGenIOarchitecture includesall thesoftwarethatrunsonthesystem(excluding user applications). As described in Section 5, there are four main components in theprototype:
• Loginnodes• Boot/Managementnodes• Gatewaynodes• Completecomputenodes
Forlargescaleversionsofthesystemwearedesigning,wewouldexpecttheretobeafifthtypeofnode, JobLauncher/Schedulernodes,whichwouldrunthesystemwarecomponentsthatscheduleandrunuserjobs,andcollectperformancedatafromthesystemasawhole.Forfull-scaleproductionsystems,itwouldnotbedesirabletoruntheschedulingandmonitoringsystemsfromtheloginnodesasuserbehaviourcouldcompromisetheschedulingcomponentsandinterrupttherunningofjobsonthe system, and the small number of login nodesmight be overwhelmed by the load created bymonitoring,orchestratingandschedulingaverylargesystem.Furthermore,separatingouttheloginandjobscheduling/launchingnodesallowstheenforcementofdifferentlevelsofsecurityonthosecomponents,thusfacilitatingasecuresystemoverall.However,giventhescaleofourprototype,thisfunctionalityhasbeenassignedtotheloginnodesinsteadofrequiringaseparatesetofnodes.
There are also other hardware components that are important to the prototype system we aredeveloping,butarenotpartofthecoresystemwarearchitecture,specifically:
• Networkswitches,thesewillrequiremanagementfunctionalityinourprototypesystem,buttheyarenotpartofwhatweconsiderthecoreNEXTGenIOsystemwareastheyaresimplyacomponentofthenetworktheybelongto.Wewillusestandardcomponentsfortheselectednetworkfabric.
Wehaveaddedtheseadditionalcomponentstothesystemwarediagramsbelow,buttheyarenotdescribedtheminthefollowingsections;weviewthemasstandalonecomponentsthatarenotkeyto the architectures we are designing in this project (i.e. they could be replaced with othercomponentsandstillprovidethesamefunctionalitytooursystem).
6.1 SystemwareOverviewThe systemware provides the software functionality necessary to fulfil the requirementswe haveidentified.Itprovidesanumberofdifferentsetsoffunctionality,relatedtodifferentwaystoexploitSCM.
Briefly,fromauser’sperspectivethesystemenablesthefollowingscenarios:
1. Userscanrequestsystemwarecomponentstoload/storedatainSCMpriortoajobstarting,or after a job has completed. This can be thought of as similar to current burst buffertechnology.This canaddress the requirement forusers tobeable toexploitSCMwithoutchangingtheirapplications.
2. UserscandirectlyexploitSCMbymodifyingtheirapplicationstoimplementdirectmemoryaccessandmanagement.ThisoffersuserstheabilitytoexploitthebestperformanceSCMcanprovide,butrequiresapplicationdeveloperstoundertakethetaskofprogrammingforSCMthemselves,andensuretheyareusingitinanefficientmanner.
3. ProvideafilesystembuiltontheSCMintheCCNs.ThisallowsuserstoexploitSCMforI/Ooperations without having to fundamentally change how I/O is implemented in theirapplications. However, it does not enable the benefit of moving away from filesystemoperationsthatSCMcanprovide.
Copyright©2017NEXTGenIOConsortiumPartners 26
4. ProvideanobjectstorethatexploitstheSCMtoenableuserstoexploredifferentmechanismsforstoringandaccessingdatafromtheirapplications.
5. Enable the sharing of data between applications through SCM. For example, thismay besharingdatabetweendifferentcomponentsofthesamecomputationalworkflow,orbethesharingofacommondatasetbetweenagroupofusers.
6. Ensuredataaccessisrestrictedtothoseauthorisedtoaccessthatdataandenabledeletionorencryptionofdatatomakesurethoseaccessrestrictionsaremaintained
7. Provisionofprofilinganddebuggingtoolstoallowapplicationdeveloperstounderstandtheperformance characteristics of SCM, investigatehow their applications areutilising it, andidentifyanybugsthatoccurduringdevelopment.
8. Efficientcheck-pointingforapplications,ifrequestedbyusers.9. Provideboth1LMand2LMmodesiftheyaresupportedbytheSCMhardware.10. Enableordisablesystemwarecomponentsasrequiredforagivenuserapplicationtoreduce
theperformanceimpactofthesystemwaretoarunningapplication,ifthisisnotusingthosesystemwarecomponents.
Wehavenotdesignedthesystemwaretosupportthefollowingfunctionality:
1. Automaticcopyingormirroringofapplicationdatabetweennodeswithouta jobexplicitlyrequestingit.
Thesystemwarearchitecturewehavedefinedisoutlinedinthefollowingdiagram.Wewilldescribeitinmoredetailinthefollowingsections.Eachdistincttypeofnodeinthesystemhasitsownpartinthesystemwarearchitecture.
Figure13:Systemwarearchitecture.
6.2 LoginNodesTheprimarypurposeoftheloginnodesistoprovideresourcesforuserstoperformdatamovement,sourcecodecompilation,jobsubmission,andjobmanagement.Userswillbeabletologintothesenodes,andwillhaveaccess toa local filesystemon the loginnodesaswell as toaglobalparallelfilesystem,whichisalsoaccessiblefromtheCCNs.
ThesystemwarecomponentsfortheloginnodesareshowninFigure14,whichisasubsectionoftheoverallsystemwarearchitectureshowninFigure13.
Copyright©2017NEXTGenIOConsortiumPartners 27
Figure14:Loginnodesystemwarecomponents.
6.2.1 CommunicationLibraries
CompatibleMPIlibraries/functionalityisessentialtosupporttheHPCapplicationsthatwillbeusingthesystem.AsmanyparallelapplicationsuseMPI-I/OfunctionalitytheMPIlibraryshouldsupportatleasttheMPI-2.0APIsandfunctionality,andMPI-I/OshouldbeoptimisedtoworkwithSCM.
6.2.2 Compilers
Applications need to be compiled to use particular hardware platforms. Compute nodes for HPCsystemsgenerallyarenotsetupfordirectuseraccessorforcompilation,sologinnodesarerequiredonaHPCsystemtoletusersbuild/compiletheirapplications,andlaunchtheirjobsthroughthebatchsystem.
Compilationcanrequirereadingandwritingoflargenumbersofsmalldatafiles,whichcanbeslowonparallelfilesystems.SCMcouldspeedupsuchcompilationbyprovidingafasterfilesystemforsuchcompilation.
HPCapplicationsaregenerallyC,C++,orFortranprograms.Assuchitisessentialtohavesupportforthosethreelanguagesinthecompilersonthesystem.Theremaybevaryinglevelsofstandardsusageandcompliance,particularlyintheC++andFortrancodes(i.e.C++11orFORTRAN2015featurescouldbeusedbysomecodes),soitisnecessarytounderstandthelanguagesupportforC++andFortranthattheavailablecompilersprovide.
Copyright©2017NEXTGenIOConsortiumPartners 28
HPCapplicationsgenerallyusesomeparallelfunctionality,primarilyMPIandOpenMP(althoughotherparallelfunctionalityisused,suchasOpenACC,OpenCL,CAForUPC).Thereforethecompilerneedsto also support linking to communication libraries for MPI (and other distributed memoryparallelisationapproaches)basedprograms,andtonode-localparallelisation libraries/functionalitysuchasOpenMP.
6.2.3 I/OLibraries
ManyapplicationsuseI/Olibraries,suchasHDF5orNetCDF,tosimplifyI/Oforparticularapplicationareas,toprovidemetadatafordatasets,ortointegratedatawithotherapplications/tools.
I/O librariesmaybebuiltontopofother I/O libraries (i.e.HDF5maybuildonMPI-I/Oforparallelfunctionality,andNetCDFontopofHDF5).
6.2.4 Java
Java is a general-purpose, concurrent, class-based and object-oriented computer programminglanguagewhich has gained an important popularity, and consequently it is beingwidely adoptedwithinthescientificcommunity.ThelatestJavaversionis8.
Some new task based programming environments, such as COMPSs, are implemented in Java,consequentlyJavasupportmustbeprovidedwithintheloginnodesinordertobeabletoexecuteitenablesuchenvironmentswithintheloginnodes(e.g.submitaPyCOMPSsapplicationtothequeuingsystemandsettheenvironmentwithintheassignedCCN).
6.2.5 JobLauncher
For performance and security reasons the CCNs cannot be accessed directly by users. Jobs aresubmittedusingajobschedulerwhichtakesinputfromtheuser(viaajobscheduling/batchscript)whichspecifiestheapplicationtoberun,theresourcestobeused,andotherdatathatisrequiredforsuccessfulrunningofthejob.ThebatchsystemthenschedulesandrunsthejobontheCCNswhenresourcesareavailableandfollowingdefinedschedulingpolicies.
Oftentheapplicationisactuallyrunbyajoblauncher,aprogramspecifiedbytheuserinthebatchscriptwhichstartstheparallelapplicationontheCCNsthatthebatchsystemhasallocatedtothejob.Thejoblaunchergenerallyispartofparallellibrariesonthesystem,specificallytheMPIlibrariesandassociatedecosystem.
6.2.6 JobScheduler
Ajobschedulerisresponsibleforallocatingnodestorunparalleljobs.Inordertoprovidethisbasicfunctionality,ajobschedulermust:
• Provideauthenticationandauthorisationforusersandgroups• Provideaccountingandhistoricalrecordkeeping• Implementseveralschedulingalgorithms• Allowinteractionsbyprovidingasetofhelperapplication(andaprogrammaticAPI)• Provideabackupjobschedulerforafault-tolerantconfiguration• Allocateandrunjobsinatimelymanner• Manageandtrackdataassociatedwithuserjobs
The job scheduler must be aware of SCM resources in the system, allow users to specify SCMrequirements anddatamovement requirements, understand the configurationof CCNs andallowuserstospecifyconfigurationrequirementsofCCNsfortheirjobs.
The job scheduler will communicate with, and control, job scheduling and data schedulingcomponentsontheCCNs,toenablethesystemtoruninasefficientamanneraspossibleandensureuserdataissecureandsafe.
Copyright©2017NEXTGenIOConsortiumPartners 29
Therewillbemultipleschedulingcomponents inthejobscheduler, includingcomponentsthatcanschedulejobsbasedondatalocationorenergycharacteristics(thedataawarejobschedulerandtheenergyawarejobscheduler).
6.2.7 LocalFilesystem
For performance reasons, users of HPC systems should have access to both a high performanceparallelfilesystem(/work),suchasLustre,aswellasastandardbacked-upNFSfilesystem(/home).
CCNmayhaveaccessonlytothehighperformancefilesystem,whileloginnodeshaveaccesstobothsothattheusercansecurelystoredataonabackedupsystemwhilenotputtingunnecessarystrainonthehighperformancefilesystem.
Compilationcancreatealargenumberofsmallfiles,especiallywhencompilinglargeprograms,andthis canhavea significant impacton theperformanceofparallel filesystems.Havingamore localfilesystemontheloginnodes,especiallyonethatcanbebackedupifnecessary,isessentialtoenableefficientcompilationandhighperformanceapplicationsontheCCN.
6.2.8 MetaScheduler
ThemetaschedulercoordinatestheschedulingdecisionsbetweentheDataAwareJobSchedulerandtheEnergyAwareJobScheduler.Forexample,insteadofmovingdataitcanschedulejobswherethedata is located,ordecidetostage-inandstage-outdataduringtheexecutionofthe job if there isspare network capacity. This functionality will use the job scheduler, and through it the dataschedulers on CCNs, providing dynamic, adaptive, scheduling tomaximise the efficient use of themachine.
6.2.9 ObjectStore
Anyobjectstorewehaverunninginthesystemmayrequiretoolsavailableontheloginnodesforuserstoaccess,query,update,andremovedatafromtheobjectstore.Therefore,weshouldhavetheabilitytorunobjectstoreaccesssoftwareontheloginnodes.
Furthermore, someobject storeshavemanagementand control functionality that canbeused toadministerandmonitortheobjectstore.Theloginnodewouldbetheappropriatelocationforthatfunctionalityaswell.
6.2.10 OperatingSystem
TheoperatingsystemontheLoginnodesrequiresthefunctionalitytoprovideexternalaccesstothesystem;undertakecompilationsandvisualisation/graphicaloperations.Furthermore,themetaandjobschedulerswillrunontheloginnodes.
TheloginnodesmayhavetheirownlocalbootmediaaswellasthepotentialtoPXE-bootfromthebootnode.BesidestandardfunctionstheO/SmustsupporttheCPUsthatsupportSCM,aGPUaswellasthehighperformancenetworkinterface.
6.2.11 PerformanceMonitoring
Studyingsystemandapplicationperformancerequiresaperformance-monitoringcomponent.Theseperformancemonitoring interfaceswill be responsible for performancedata collectionduring theruntimeofanHPCapplication.Thiscoversbothdataabouttheapplication,anddataontheactionsofthesystemware.
In many cases the performance monitoring infrastructure is integrated into, or attached to, theapplications to be monitored. For certain performance aspects, e.g. the scheduling of jobs, it isnecessarytoprovideaself-containedmonitoringcomponentthatrunsindependentlyaspartofthesystemware.
Copyright©2017NEXTGenIOConsortiumPartners 30
6.2.12 PerformanceVisualisationandDebuggingTools
Performance profiling is statistical analysis to study the overall performance of HPC applications.Performancetracingallowstherecordingofindividualeventsfromanapplication.Theresultingtracefileenablesin-depthanalysisoftheapplication’sruntimebehaviourandistypicallyvisualisedusingatimeseries.Incontrast,debuggingtoolsprovideaccesstoanapplicationwhileitisrunningtoaddresscorrectnessissuesastheyhappen.
All three forms of performance tracking require a sophisticated graphical interface and varioussystemwarecomponentstoprovidetheirservice.InthisprojectthetoolsAllineaMap,Vampir/Score-P,andAllineaDDTwillbedeployed.Together,theyprovidecomprehensivein-depthanalysisaidingtoresolveerrors,bottlenecksandperformoptimisations.
Vampir/Score-Pisaframeworkforperformanceanalysis,whichenablesdeveloperstoquicklystudytheapplicationbehaviouratafinegrainlevelandperformin-depthroot-causeanalysis.Thislevelofdetailisachievedbymeansofrecordingperformanceeventsandstoringthemtothetracefile.Animportant featureofVampir is the intuitiveandgraphical representationofdetailedperformanceeventsonatimeaxisandaggregatedprofiles.
These tools differ from the monitoring component of Section 6.2.11 in that they are used byapplicationdevelopersduringcodedevelopmentandoptimisation, rather than thedatacollectioncomponentsofperformancemonitoring/debugging.
6.2.13 PythonPython is a free, interpreted, object-oriented, high-level programming language with dynamicsemantics.Itischaracterisedbyenablingtherapidapplicationdevelopment,aswellasbeingusedasscriptingorgluelanguagetoconnectexistingcomponentstogether.It isbecomingwidelyadoptedwithinthescientificcommunityduetoitsquicklearninganddevelopmentprocess,aswellasfortheexistingscientificlibraries,suchasNumPyandSciPy.
InordertorunPythonapplications,aPythoninterpreterisneededwithintheloginnodes.Thus,inordertoprovidesupportforscientificapplications,somelibrariesmaybeofinterest(e.g.NumPyorSciPy).
GiventhattheapplicationsthatwillbeexecutedinNEXTGenIOthattakeadvantageofPyCOMPSsarePythonapplications,weneedPythonsupportwithintheloginnodes.
6.2.14 SchedulingToolsScheduling tools are used to interact with the job scheduler (either by invoking acommand/applicationorprogrammaticallyusingawell-definedAPI).Interactionswithajobschedulerinclude:
• Submittingabatchscriptorsingleapplicationforaparallelrun• Transmitafiletonodesallocatedforaparticularjob• Querythescheduleraboutthecurrentstateofjobs/nodes• Cancelarunningorpendingjob• Get/updatethestateofjobs/nodes• Displayaccounting/historicaldataaboutnodes/jobs/users
6.2.15 ScientificLibrariesMany of the operations carried out in scientific applications can be composed from standardcomponents.Thesehavebeendevelopedandoptimisedextensivelybydomainspecialistsaswellashardwareandcompilervendors.High-performancecomputational librariesareoftenoptimisedfortheparticularsystemstheyaretobeusedon.
Copyright©2017NEXTGenIOConsortiumPartners 31
6.2.16 Task-basedprogrammingmodel
NewprogrammingmodelsforexploitationoflargescaleHPCsystemsaimtotackleprogrammabilityandperformanceforverylargenumbersofprogramprocessesorthreads.Onesuchapproachistouse a task-based programming model, where developers specify the tasks required for theirapplicationandtheruntimeenvironmentoftheprogrammingmodeldistributesandrunsthesetasksinasefficientwayaspossible.
One example of such a model is PyCOMPSs, which is the Python binding of COMP Superscalar(COMPSs),atask-basedprogrammingmodelthataimstoeasethedevelopmentofapplicationsfordistributedinfrastructures.
Itoffersasimpleprogrammingmodelbasedonsequentialdevelopment:aPyCOMPSsapplicationisaplainsequentialPythonscriptwithannotationsthatdefinetasksandtheirdatadependencies.Inthemodel,theuserismainlyresponsiblefor(i)identifyingthefunctionstobeexecutedasasynchronousparalleltasksand(ii)annotatingthemwithastandardPythondecorator.
The loginnodesmusthaveallthecomponentsofthetask-basedprogrammingmodelrequiredfordevelopmentofapplicationsandsubmissions/managementofjobsbasedonthistechnology.
6.3 GatewayNodesTheprimarypurposeofthegatewaynodesistoprovideaccesstoexternalstoragesystems.ThenodesprovidehighbandwidthgatewayfunctionalityfromIBtoOPAandviceversa.
6.3.1 OperatingSystem
TheoperatingsystemontheGatewaynodesrequiresthefunctionalitytoprovideexternalaccesstostoragesystems.Fortheprototypesystem,thebridgingsoftwarerunningontheGatewaynodeswillbetheLNETRoutersoftwarefromIntel.BesidestandardfunctionsthiscomponentmustsupportthatLNETroutersoftware,thenewprocessorstobeusedintheprototype,andthenetworkinterfacesrequiredtobridgebetweenthemaincomputenetworkandtheexternalstoragesystemnetwork.
6.4 BootNodesThebootnodesprovidemanagementandsoftwareprovisioningfunctionalityforthesystem.TheywillprovidetheCCNswithsoftwareimagesordownloadstoenableCCNstooperatewithout localstorage.
Theywillonlybeaccessiblebysystemadministrators.ThebootnodesystemwarecomponentsareoutlinedinFigure15.
Copyright©2017NEXTGenIOConsortiumPartners 32
Figure15:Bootnodesystemwarecomponents
6.4.1 OperatingSystem
TheoperatingsystemontheBootnodesrequires the functionality tohost thePre-bootExecutionEnvironment(PXE)toprovidebootserviceviaTrivialFileTransferProtocol(TFTP)toothernodes.
ThebootnodeswillalsoofferDomainNameSystem(DNS)andDynamicHostConfigurationProtocol(DHCP)serviceswithintheprototypedomain.ThebootnodewillhostmanagementandmonitoringsoftwareandisthecentralhuboftheNEXTGenIOprototypemanagementsystem.
ThisincludesCCN,EthernetnetworkandOmni-PathFabricmanagement.Thebootnodeswillbeusedfor management access to all other nodes, the Ethernet switches and the Omni-Path switches.Thereforetheoperatingsystemwillneedtosupportthemanagement,monitoring,DNS,DHCP,andPXE-Bootsoftwarethatwillberunonit.
6.4.2 Management/MonitoringSoftware
TheCCNandloginnodeswithinthesystemrequiresomesoftwaretomanagetheiroperation,start-up and shutdown. This includesmonitoring the health of individual nodes, provisioning softwareimages to boot compute nodes from, and managing the lifecycle of the whole system. Severaldifferentinterfacestotheboardmanagementcontrollers(BMC)oneachnodeareavailable.
The traditional Intelligent Platform Management Interface (IPMI) over LAN, Simple NetworkManagementProtocol (SNMP)or thenewRESTfulRedfish(DMTF)canbeusedtoretrieveandsetmanagementdataandfunctionsviaremoteout-of-band(ManagementLAN)interface.
6.4.3 PXE-BootServer
ThebootserverprovidesfunctionalityforCCNs,orloginnode,tobootviathenetworkratherthanhavingtohaveanoperatingsysteminstalledlocally.This,inturn,meansthattheCCNscanoperatewithoutlocalstorage/disksattached.
Copyright©2017NEXTGenIOConsortiumPartners 33
6.4.4 DHCPServer
DynamicHostConfigurationProtocol(DHCP)isaclient/serverprotocolthatautomaticallyprovidesahostwith its IPaddressandother relatedconfiguration information suchas the subnetmaskanddefaultgateway.
ThiscomponentprovidesIPaddressesonthenormalnetworkfortheCCN,login,andbootnodesinthesystem.
6.4.5 DNSServer
TheDNSserverisanycomputerthatprovidesanameserviceforasetofcomputersonanetwork.ADNSserverrunspecial-purposenetworkingsoftwarewithapublicIPaddress,andcontainsadatabaseofnetworknamesandaddressesforotherhosts.
6.4.6 NetworkManagement
Softwarethatconfiguresandmonitorstheswitchesandnetworksusedinthesystemisessentialtoensurethecorrectrunningoftheprototype.Thiswill involvemanagingallofthenetworkswewilldeployinthesystemthroughthemanagementnetwork.
6.4.7 LocalFilesystem
Theservicesonthebootnode,suchasthePXE-Bootservice,needaplacetostoredata(suchasthebootimages).
6.5 CompleteComputeNodesTheCCNsprovidethecomputationalresourceforthesystem.Theywillnotbedirectlyaccessiblebyusers;userjobswillberunthroughschedulingcomponentsinstalledonthenodes.
There will be a large number of CCNs in the system and they will all have the same softwareconfiguration(s). It is expected that userswill be able to reconfigure CCNs for differentmodes ofhardware operation (rebooting between different SCM configuration modes), and potentially toenableordisabledifferentsystemwarecomponents (suchas theobjectstoreorSCMfilesystem ifthosenotinuse).
Therefore, whilst the defined systemware architecture for the CCNs will be uniform, differentcomponentsmaybeactiveorenabledondifferentCCNsatanygiventime,dependingonthejobsthatarebeingrunonthesystem.
The NEXTGenIO architectures are also designed to allow some level of heterogeneity in theconfigurationofCCNs,withthepotentialtoeitherhavethesameamountofSCMacrossallnodes,ortohavesystemswheresomenodeshavenoSCM.ThisistoallowsystemstobedeployedwhereSCMisprovidedinasubsetoftheCCNs,withthebenefitsofSCMavailableacrossthesystemusingthefunctionalityofthesystemwareNEXTGenIOisdesigningandimplementing.
Copyright©2017NEXTGenIOConsortiumPartners 34
Figure16:Completecomputenodesystemwarecomponents
6.5.1 AutomaticCheck-pointer
Applicationsthatrunforlongperiodsoftime(overafewhours)generallyhavetocheck-point(writesomedatatoapersistentstoragesystem,likeafilesystem,thatcanbeusedtorestarttheapplicationifnecessary)tocopewithhardwarefailuresandjobqueuelimits.
Currently applications perform their own check-pointing, howeverwith SCM on node it could bepossible for the automatic check-pointing system component to automatically check-point wholeapplication instances into SCM periodically to remove the need for applications to check-pointthemselves.
6.5.2 CommunicationLibraries
CommunicationlibrariesmustbeavailableontheCCNsforusebyHPCandHPDAapplicationsiftheyhavebeencompileddynamically(i.e.theoperatingsystemlooksforthelibraryobjectfileswhentheapplicationrunsratherthanincludingthematcompiletime).
Copyright©2017NEXTGenIOConsortiumPartners 35
6.5.3 DataMovers
Datamovers are used by the data scheduler to undertake specific data operations. This could becopyingdatabetweendifferentfilesystems,fromlocaltoremoteCCNs,andbetweendifferentdatastoragetargets(forinstancebetweenanobjectstoreandafilesystem).
There will be different datamovers for different operations (i.e. one formoving files around, oranotherformovingdatatoaremoteCCN).
6.5.4 DataScheduler
The data scheduler can migrate data between different hardware levels, within a single node,independently fromanapplication,andbetweendifferentnodes in thesystem,specifically toandfrom SCM on other nodes over the high performance network (probably using RDMA). Thisfunctionalitycanberequestedbyothersoftwarecomponents(includingapplications).
Thedatascheduleralsokeepstrackofdataandwhathardwarethatdataislocatedin.OntheCCNthedataschedulercomponentprovidesthelocalfunctionalitytomovedatabetweenstoragelevels(e.g.filesystem,SCM,DRAM)asinstructedbythehigherlevelschedulingcomponent(thejobschedulerontheCCNincoordinationwiththemasterjobscheduler).
Thedataschedulerisprimarilyassociatedwithmovingdataforagivenjoborapplication,thismaybestagingdatainpriortoajobstartingorstagingdataoutafterajobhasfinished.
Thedataschedulerisalsoresponsibleforhandlingremotedataaccesses.
6.5.5 I/OLibraries
ManyapplicationsuseI/Olibraries,suchasHDF5orNetCDF,tosimplifyI/Oforparticularapplicationareas,toprovidemetadatafordatasets,ortointegratedatawithotherapplications/tools.
I/O librariesmaybebuiltontopofother I/O libraries (i.e.HDF5maybuildonMPI-I/Oforparallelfunctionality,andNetCDFontopofHDF5).
6.5.6 Java
Seesubsection6.2.4.
6.5.7 JobScheduler
AjobscheduleragentrunningontheCCNsisresponsibleforrunninganapplicationwhichispartofaparalleljob.Inordertoprovidethisbasicfunctionality,ajobscheduleragentmust:
• Beabletoupdatethecurrentconfigurationenvironmentofanode;• Executeaparallelapplication;• Monitoralltasksrunningonanode;• Kill/canceltasksrunningonanode;• Communicatewiththemasterjobscheduler(inordertoprovideupdates);• Acceptcommandsfromthemasterjobscheduler;• Beabletoactifcommunicationwiththemasterjobschedulerislost.• Beabletocommunicatewiththedataschedulertounderstandthecurrentstateofdatain
thesystem• Passcommandstothedatascheduler
6.5.8 Management/MonitoringSoftware
Thecomputeandloginnodeswithinthesystemrequiresomesoftwaretomanagetheiroperation,start-upandshutdown.Thisincludesmonitoringthehealthofindividualnodes,provisioningsoftwareimagestobootCCNsfrom,andmanagingthelifecycleofthewholesystem.
Copyright©2017NEXTGenIOConsortiumPartners 36
This component runs on each of the CCN and is used by the management/monitoring softwarecomponentsonthebootnodestoundertakemanagementormonitoringoperationsontheCCNs.
6.5.9 Multi-nodeSCMFilesystem
Themulti-nodeSCMFilesystemprovidesalightweightPOSIXfilesystemthathastwomaingoals:first,tohidethecomplexityofthedifferentlevelsoflocalstoragepresentinacomputingnode(i.e.DRAM,SCMandhighperformanceparallelfilesystem)byaggregatingthemintoavirtualstoragedeviceand,secondly,toallowcomputingnodestoexportSCMregionssothattheycanbeusedtransparentlybyothernodes,thusallowingforalargerstoragecapacityandtheimprovedefficiencyofcollaborative,multi-nodeI/Ooperations.
TherationalebehindthissoftwarecomponentisthattherearemanylegacyHPCapplicationswhichmaybehardorcostlytomodifytouseSCMtechnology,itwouldbedesirablethattheycouldatleastgetsomeofthebenefitoftheNEXTGenIOarchitecturewithoutanymodifications.
Additionally,themulti-nodeSCMFilesystemoffersthefollowingopportunitiesandadvantages:(1)policiescanbeaddedtoautomaticallycontrolthelevelatwhichdataisstored(DRAM,SCMorhighperformanceparallelfilesystem)andwhenitshouldbemigrated;(2)theopportunitytodistributeI/ObetweentheSCMofasetofnodesandkeeptheresultsina“cooperativeburstbuffer”,whichcouldbereusedbetweenjobsorwrittentolong-termstoragewhentheclusterloadisappropriate;and(3)offeranAPItoschedulersandapplicationssothattheycanbettercommunicatetheirneedstothefilesystem.
6.5.10 ObjectStoreRecentlysomeHPCapplicationshavestartedusingobjectstorestomanagedatainsteadof,orasacomplementto,traditionalfilesystems.Objectstoresprovideapplicationswithvariablegranularitydataintheformofindividuallyidentifiedobjectsinsteadoffilesinadirectoryhierarchy.Suchobjectstoresappeartobeagoodmatchtodistributedstorage,especiallyonethatcanprovideuser-levelbyte-granularityaccess(asSCMtechnologyoffers).
Intelhavebeenworkingononesuchobjectstore,DAOS.AnotherexampleisdataClay,adatastoredevelopedbyBSCthatallowsaccessingpersistentobjectsasiftheywereinmemory.dataClaycanstore functions or code-snippets, as well as data, that can be used by the PyCOMPSs task-basedprogrammingsystem.
6.5.11 OperatingSystem
TheoperatingsystemontheCCNsmustbeconfiguredtoallowhighperformancecomputationsandtosupportnewnon-volatilemainmemory functionality. Itmustallowthe implementationofnewstorageparadigmssuchasdirectload/storeaccessofpersistentmainmemory,withoutpagecaches,but also distributed shared storage based on SCM via a high-speed low latency high bandwidthinterconnect.
ToavoidtraditionallocalstoragedevicessuchasHDDorSSDtheO/Smustsupportamemorydisktobootstraightfromthebootloaderonthebootnode.AswellasthestandardfunctionstheO/SmustsupporttheSCMhardware,processorsthatcansupportSCM,andthenetworkinterfaceusedforthemaincomputenetwork.
6.5.12 PerformanceMonitoring
Thisperformancemonitoringcomponentisresponsibleforcollectingperformanceinformationfromanapplicationexecution.Inthefirstcase,performancemonitoringcomponentattachesitselftotheexecution of the application. Different states of the program under observation are inspected byperiodicallyinterruptingtherunningprogram.Exampleswouldbesetitimer()oroverflowtriggersofhardwarecountersthataretriggeredperiodicallytoobservetheapplicationexecution.
Copyright©2017NEXTGenIOConsortiumPartners 37
Inordertoinspecttheexecutionstateofanapplication,call-pathandinformationfromthehardwareperformancecountersareutilised.Callpathsprovideinformationregardingthefunctionsandregionswithintheapplicationwhereasperformancecountersprovide informationregardingthehardwarerelatedactivities.Theperformancemonitoringcomponentdirectlyattachesitselfwiththeinterfaceswhich canprovide information regarding theapplicationexecution. These interfaces includePAPI,hwloc,netloc,procfsandNVML.
In some cases, this component will query hardware resources on the node. For this purpose,systemwaremonitoringinterfaces,suchasinterfacestothescheduler,areused.TheschedulingAPIprovidesamonitoringmechanismtointerprettheoverallresourceconsumptionoftheapplicationonanode.
Moreover, the scheduler will also be responsible for managing the optimal utilisation of SCMresources,requiringittomaintainuptodateinformationonthestatusofSCM.Consequently,theinformationcollectedbytheschedulercanbepassedtotheperformancemonitoringcomponent.
6.5.13 PersistentMemoryAccessLibrary
OnewaytoprovidesimplerandreusableSCMaccessandmanagementmethodsistousealibrarysuite,whichcanthenbeusedbybothuserapplicationsandsystemware.Amongthecoreexpectedfunctionality,suchlibrariesshouldprovidemechanismsforapplicationsorsystemwaretocreateandmanagepersistentmemoryregionsandtheirkeys,includingacrosspowercycles.
The library should also provide mechanisms to facilitate storage of persistent data, such asmechanisms for reliable transactionalupdates.Optionally, the libraries couldprovidemethods formanipulationofcommonhigher-leveldatastructures inpersistentmemory,suchas listsandtreesandevensimplekey-valuestores.
6.5.14 PythonSeesubsection6.2.13.
6.5.15 SCMDriver
Justlikewithanyhardwaredevice,theSCMmustbeexposedtotherestofthesystemandtheO/Sthroughaspecificdevicedriver.WeexpectthistobeprovidedbytheSCMvendor(inthiscaseIntel).Wedonotexpecttoneedtoperformchangestothedevicedriver.
6.5.16 SCMLocalFilesystem
OneofthemoststraightforwardmethodstoexploitSCMforpersistentstorageineachcomputenodeistousea localfilesystem.WhileanyfilesystemcanbeusedwiththeSCM,inordertoexploitthebyte-level load/store access to SCM it is necessary that the local filesystem supports the DAXextensions.TheseextensionsallowuserstobypassallthetraditionalblockorientedstructuresintheI/Ostack,suchasthepagecache.
6.5.17 Task-basedRuntimeEnvironment
Task-basedprogrammingmodelsgenerallyrequirearuntimecomponentthatexecutesandcontrolstasksassociatedwithatask-basedprogramoncomputenodes.
The PyCOMPSs runtimemanages the user application exploiting the inherent concurrency of thescript, automatically detecting and enforcing the data dependencies between tasks and spawningthese tasks to the available resources, which in this scenario is considered to be able to handlepersistentobjectsacrosstheinfrastructure.
Copyright©2017NEXTGenIOConsortiumPartners 38
6.5.18 VolatileMemoryAccessLibrary
WhilethesystemcantransparentlyprovidelargevolatilememorytoapplicationsontopofcombinedDRAMandSCM(in2LMmode),someapplicationsmaywishtomanagetheirownvolatiledataandDRAM.
One example would be applications explicitly storing frequently accessed data in DRAM and lessfrequentlyaccesseddata inNVRAM,butnot requiringpersistencyof thedata inNVRAM.Anotherexamplewould be applications implementing a software cache usingNVRAMof datamirrored inexternalstoragedevices.Insuchcases,amechanismisrequiredtoallowapplicationstospecifywhere(eitherDRAMorNVRAM)memoryshouldbeallocated.
Copyright©2017NEXTGenIOConsortiumPartners 39
7 UsagePatternsToallowacompleterunderstandingofhowasystemdevelopedfromourarchitecturescanbeused,we outline some of the possible usage scenarios we have developed during the requirementsgatheringandarchitecturedesignprocessesintheproject.
The followingsectionsprovidedetailson the functionality that thehardwareandsystemwarecanoffer applications.We build on use cases, and associated usage scenarios, we have identified tooutlinethesystemwareandhardwarecomponentsusedbyagivenapplication,andthelifecycleofthedatainthosecomponents.
For each usage scenario, we describe which systemware components are active, how data flowsthroughthesecomponentsandtheuserapplication,wheredataresides,andthestateofthesystemanddataaftertheapplicationhasfinished.SequencediagramsthatillustratethedatalifecyclecanbefoundintheAppendix(seepage51).
7.1 GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem
7.1.1 Description
ManyHPCsystemshostawiderangeofjobs,fromsmallsinglenodejobstofullsystemjobs.FortheUKnationalHPCservicethatUEDINoperateweseeabreakdownofjobsthatisroughlyasfollows:
• Breakdownbycomputetimeo 15%<96coreso 80%<96-12000coreso 5%>12000cores
• Breakdownbynumberofjobssubmittedo 70%<96coreso 29%<96-12000coreso 1%>12000cores
Asthereisawiderangeofjobsonthesystem,weseeawiderangeofI/Opatterns,fromapplicationsthatdoverylittleI/OtothosethataredominatedbyI/O.WeestimatetheI/Ocostislessthan10%for the majority of applications. Most applications will perform check-pointing to cope with themaximumruntimelimitonthesystem.
Suchasystemgenerallyusesaparallelfilesystemandtherearenodisksinthecomputenodes.AllI/Oisdonetotheparallelfilesystemoverthehighperformancecommunicationnetworkfabricusedforinter-processcommunications.
Asthisisageneralpurposeusagescenariothereispotentialforapplicationsrunningonasystemlikethis to use all the different components available in the NEXTGenIO systemware and hardwarearchitecture.However,forthisexamplewewillrestrictourselvestoapplicationsthathavenotbeenmodified,aresimplyre-compiled,andwritetheirdatatofilesinastandardfilesystem.Traditionallysuchapplicationswouldusetheparallel filesystemattachedtoaHPCsystemtoperformtheir I/O,wouldberunusingabatchsystem,andwouldoperatewithintheresourcelimitsplacedonthembythatbatchsystem.
We assume that applications use themulti-node SCM filesystem for I/Owhilst the application isrunning, and the external high performance filesystem for data storage after an application hasfinished.
ForthisusagescenarioweareusingSCMin1LMnode.
7.1.2 Components
• JobScheduler
Copyright©2017NEXTGenIOConsortiumPartners 40
• Multi-nodeSCMfilesystem• I/OLibraries• CommunicationLibraries
7.1.3 Datalifecycle
1. Schedulingrequestssubmitted.a. At this point the executable to be run is either stored on the external high performance
filesystemoronthelocalfilesystemontheloginnodes,andthedatafortheapplicationisstoredontheexternalhighperformancefilesystem.
2. Jobschedulerallocatesresourcesforanapplication.a. The job scheduler ensures that nodes are in 1LM mode and that the multi-node SCM
filesystemcomponentisoperationalonthenodesbeingusedbythisapplication.3. Oncethenodesthatthejobhasbeenallocatedtoareavailablethedataschedulercopiesinput
datafromtheexternalhighperformancefilesystemtothemulti-nodeSCMfilesystem.a. Thisstepisoptional;anapplicationcouldreadit’sinputdatadirectlyfromtheexternalhigh
performance filesystem, but using the multi-node SCM filesystem will deliver betterperformance.
4. JobLauncherstartstheuserapplicationontheCCN.5. Applicationreadsdatafromthemulti-nodeSCMfilesystem.6. Applicationwritesdatatothemulti-nodeSCMfilesystem.7. Applicationfinishes.8. DataScheduleristriggeredandmovesdatafrommulti-nodeSCMfilesystemtotheexternalhigh
performancefilesystem.
7.2 GeneralPurposeHPCsystemimprovingthroughput
7.2.1 Description
This isageneralpurposeusagescenario,meaning there ispotential forapplications touseall thedifferentcomponentsavailableintheNEXTGenIOsystemwareandhardwarearchitecture.However,forthisexampleweareconsideringhowtousetheNEXTGenIOsystemwaretooptimisetheoverallutilisationorenergyefficiencyofthesystem.
WeareassumingthatCCNsareinvariedstates,withvariousdataloadedontoSCMondifferentCCNs.The schedulers areusedby themeta scheduler toensure thatefficientuse ismadeof thewholemachineorparticularresourcesthatareofinterest(i.e.totalenergyconsumed,powerlimits,ortimeforthroughputofjobs).
7.2.2 Components
• DataAwareJobScheduler• EnergyAwareJobScheduler• MetaScheduler• Multi-nodeSCMfilesystem• I/OLibraries• CommunicationLibraries
7.2.3 Datalifecycle
1. Schedulingrequestssubmitted.a. Atthispointtheexecutabletoberuniseitherstoredontheexternalhighperformance
filesystemoronthelocalfilesystemontheloginnodes,andthedatafortheapplicationmaybeontheexternalfilesystemoronSCMssomewherewithinthesystem.
Copyright©2017NEXTGenIOConsortiumPartners 41
2. Metascheduler is informedofthe jobschedulingrequestsandundertakesananalysisofwhatresourceswouldbemostefficienttouseforthisjobtakingintoaccounttheconstraintsthatthemetaschedulerassigned.
3. Metaschedulergeneratesjobanddataschedulerinstructions.4. Jobschedulerallocatesresourcesforanapplication.5. Oncethenodesthatthejobhasbeenallocatedtoareavailablethedataschedulercopiesinput
datafromtheexternalhighperformancefilesystemtothemulti-nodeSCMfilesystem.a. This step isoptional, thedatamayalreadybeon-nodeormaystillbeon theexternal
filesystem.6. JobLauncherstartstheuserapplicationontheCCN.7. Applicationreadsdatafromthemulti-nodeSCMfilesystem.8. Applicationwritesdatatothemulti-nodeSCMfilesystem.9. Applicationfinishes.10. Data Scheduler may be triggered to move data from SCM storage to the external high
performancefilesystem.
7.3 ResilienceinspecialistHPCsystem
7.3.1 Description
AnothercategoryofHPCsystemarethosethatareusedforasingleacademicorindustrialdiscipline.Thistypeofsystemcanstillbeverylarge,butwilltypicallyrunasmallernumberoflongerjobs,andeachjobwillbelargerthanisusuallyseenonsharedHPCsystem.Check-pointingiscriticalforthesafeoperationofjobstoensurethatanyhardwarefailuredoesnotresultinlosingalargejob.
Applications currentlymanually check-point their running state and datasets to enable restart onfailureofjobsorjobsreachingbatchsystemtimelimits.SCMcanallowforO/S-levelcheck-pointing,ratherthanapplicationlevelcheck-pointing,withtheO/StriggeringadumpofanapplicationtoSCMmemory for futureuse independentlyof theapplication itself. It canalsoallowforapplications tocheck-pointtoSCMmanually.
Forthisusagescenario,weareusingtheSCMforcheck-pointingofrunningapplicationstoensurethatapplicationscanberecoveredafterhardwarefailures.Inthiscontext,weareusingtheSCMin1LMmode,although2LMmodewithanApp-Directregionisalsopossible.
7.3.2 Components
• JobScheduler• AutomaticCheck-pointer• CommunicationLibraries• I/OLibraries
7.3.3 Datalifecycle
1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesautomaticcheck-pointingrequirements.
2. Jobschedulerallocatesresourcesforanapplication.3. Once the nodes the job has been allocated to are available the job launcher starts the user
application.4. Theuserapplicationnotifiestheautomaticcheck-pointerthatcheck-pointingisrequired,andat
whatfrequency.a. Some applications may notify the automatic check-pointer when check-pointing can
occur.b. Othersmayletthecheck-pointingoccurasdecidedbythecheck-pointer.
5. Twoscenariosarenowpossible.
Copyright©2017NEXTGenIOConsortiumPartners 42
a. Applicationcompletessuccessfully.i. Schedulernotifiestheautomaticcheck-pointerthatcheck-pointersarenolonger
required.ii. Check-pointerremovescheck-points.
b. Applicationfailedforsomereason.i. Re-runjobsubmittedwithacheck-pointIDprovidedtoresumetherun.ii. Applicationrunstocompletionasabove.
Ifcheck-pointdatawaswrittentothemulti-nodeSCMfilesysteminsteadofstraighttoSCMthenitispossiblethatjobscouldberestartedwithoutfailednodesprovidedthemulti-nodeSCMfilesystemhas provided data backup/replication/migration functionality to ensure the check-point data isavailableonmultiplenodes,notjustthenodeitwaswrittenon.
7.4 LargeMemory
7.4.1 Description
Applicationsareassuchasbio-informaticsoftenrelyonlargescalesharedmemorysystemstoenablevery largedata sets tobe loaded andprocessed. For these applications, reading andwritingdatafrom/tostoragesystemstakesalongtime,soreducingthiscostisimportant.
For this example,weare consideringusing theSCMasa very largememory space toperform in-memoryprocessingof fulldatasets, readingthedata intomemoryonceandthenwritingthedatabacktothehighperformancefilesystemattheendofthecomputation.Inthiscontext,weareusingtheSCMin2LMmodeandasinglejobwillonlyuseonenodeinthesystem.
7.4.2 Components
• JobScheduler• I/OLibraries
7.4.3 Datalifecycle
1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesinformationthat2LMmodeisrequired.
2. Jobschedulerallocatesanodefortheapplication.3. Oncethenodethatthejobhasbeenallocatedtoisavailablethejobschedulerensuresitisin2LM
mode.a. Eitherthisissimplyselectinganodealreadyinthatmode,or;b. Theschedulerinitiatesnoderebootasrequiredtogetthenodeinto2LMmode.
4. Thejobschedulerstartstheuserapplicationonthenode.5. Theuserapplicationloadsdatafromtheexternalhighperformancefilesystemintomemory.6. ThememorycontrollerhardwareloadsdataintoSCMandusestheDRAMascacheforthatSCM.7. Applicationprocessesdata.8. Applicationwritesdatabacktotheexternalhighperformancefilesystem.9. Applicationfinishesandjobschedulerreleasesthenodeused.
7.5 Workflows
7.5.1 Description
Jobsfromanincreasingnumberofscientificareasareoftencharacterisedbyaworkflow/jobchainingapproach,whereoneapplicationmayreaddata, thenproduceoutput/writedata,which is in turnusedbyanotherapplication.Readingandwritingdatatobepassedbetweenapplicationscantakesalongtime,thereforereducingthiscostisimportant.
InthiscontextweareusingSCMin1LMmode.
Copyright©2017NEXTGenIOConsortiumPartners 43
7.5.2 Components
• JobScheduler• I/OLibraries• DataScheduler• DataMovers
7.5.3 Datalifecycle
1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesinformationthat1LMmodeisrequired.
2. Jobschedulerallocatesnodesfortheapplication.a. Thismayincluderestartingnodestoensuretheyarein1LMmode.
3. Thejobschedulerstartsthefirstuserapplicationintheworkflowonthenodes.4. Theuserapplicationloadsdatafromtheexternalhighperformancefilesystemintomemory.
a. ThismaybeSCMorDRAM.5. Applicationprocessesdata.6. ApplicationwritesdatatoSCM.7. ApplicationfinishesandinformsDataSchedulerofdatatoberetained.8. DataSchedulertakescontrolofdata.
a. Thisensuresdataisnotlostoncetheowningapplicationfinishes.b. TheDataSchedulerorSchedulingDaemonswillgenerateakeyforfutureapplicationsto
use,toensurethesecurityofthatdata,andwillreturnthatkeytothefinishingapplicationoruserscript.
9. Next application in the userworkflow submitted to the job schedulerwith notification that itrequirespreviouslywrittendata.
a. TheapplicationorscriptwillusethepreviouslygeneratedkeytotakecontrolofthedatastoredinSCM.
10. Jobschedulerallocatesnodesfortheapplication.a. Itislikelythesewillbethesamenodesasthoseusedbythepreviousapplication,butmay
notbe.Iftheyaredifferentnodes,thejobanddataschedulersareresponsibletoensureworkflowdataisavailablewhenthenextworkflowapplicationstarts.
11. Thejobschedulerstartsthenextuserapplicationintheworkflowonthenodesallocatedtothatapplication.
12. Applicationprocessesdata.13. ApplicationwritesdatatoSCM.14. ApplicationfinishesandinformsDataSchedulerofdatatoberetained.15. DataSchedulertakescontrolofdata.
a. Thisensuresdataisnotlostoncetheowningapplicationfinishes.b. TheDataSchedulerorSchedulingDaemonswillgenerateakeyforfutureapplicationsto
useto,ensurethesecurityofthatdata,andwillreturnthatkeytothefinishingapplicationoruserscript.
16. Steps9–13repeatedasrequired.17. Finalapplicationintheworkflowfinishes.
a. DataSchedulerusesDataMovers towritedata tobestoredback to theexternalhighperformancefilesystem.
b. Nodesreleasedbyjobscheduler.
7.6 DataAnalytics
7.6.1 Description
Dataanalyticsusagecasesgenerallyhaveverylargedatasetsthatareanalysedoften,andtheanalysisperformssmallamountsofworkperdatapoint.Therearethereforelargenumbersofdatareads,and
Copyright©2017NEXTGenIOConsortiumPartners 44
smallamountsofcomputeordatawrites.Oftenthemainoverheadcomesfromloadingthedatapriorto analysis, or from converting the data into a format that can be queried, rather than from theanalysisorqueryingitself.
Oncedatahasbeenloadeditisoftenqueriedoranalysedmanytimes.PersistenceoflargedatasetsinSCMisbeneficialasitenablesdatatobeavailableafterrebootsandhardwarefaults.Persistenceoftheanalysisdata(i.e.individualanalysisruns)isnotimportant.Ifanindividualanalysisrunislostthatcanbere-run.
7.6.2 Components
• JobScheduler
7.6.3 Datalifecycle
1. RequestsubmittedtojobschedulerfordatasettobeloadedintoSCMonCCNsfromthehighperformancefilesystem.
2. The job scheduler instructs specific CCNs data schedulers to load the specified data in to theappropriateplaceinthesystem.
a. ThiscouldbelocalSCMonindividualnodes.b. Thiscouldbethemulti-nodeSCMfilesystem.c. Thiscouldtheobjectstore.
3. ThedataschedulerloadsthedataandstoresareferencetowhatdatahasbeenstoredandonwhichCCN.
4. Ausersubmitsarequesttothejobschedulertorunanapplication.5. Jobschedulerallocatesresourcesforanapplication.
a. Theschedulingtakesintoaccountthedatarequiredbythejobwhenallocatingresourcesbyliaisingwiththedatascheduler.
b. If thedata is not presenton any resources, or the requested resourcesdonotmatchwherethedataisavailablethenthejobisrejected.
6. Once thenodes that the jobhasbeenallocatedareavailable the job launcher starts theuserapplication.
7. Theapplicationusesthedataavailableonthenodes.8. Theapplicationwritesanydataitproducesbacktothehighperformancefilesystem.9. Theapplicationfinishes.10. Computeresourcesarefreedupforuse.11. Steps4-10arerepeatedasrequired.
a. Multiplejobscanrunusingthepre-loadeddataatthesametimeasrequired.12. Oncethedataisnolongerrequiredonthesystemthejobschedulerisnotifiedthatthedatacan
beremoved.13. Thejobschedulernotifiesthedataschedulersthattheycanremovethedata.14. Thedataschedulerremovesthedata.
7.7 Liveprocesspausingandmigration
7.7.1 Description
BatchsystemsforHPCaredesignedtoallowarangeofjoblengthsandsizestoberun.However,ifthereareurgentorlargescalejobswaitinginthebatchqueue,thisinvolvesallocatingresourcesforthese future jobs and blocking resources until these jobs are ready to run (this is often calledbackfilling/queuedraining).
SCMgivesthepossibilitytopauselivejobsbycheck-pointertheirstatesfromDRAMtoSCMtofreeupresourcestorunurgentjobsquickly.Thiscutsdownthebackfilling/drainingtimeforlargescaleor
Copyright©2017NEXTGenIOConsortiumPartners 45
urgentjobs.ItisalsopossibleforapplicationstobemigratedbetweennodesthroughRDMAaccesstoSCM.
Toachievethisfunctionalitywewillusetheautomaticcheck-pointerandjobschedulerworktogethertomigratearunningapplicationintoastoredstateinSCM,runanewapplication,andthenrestoretheapplicationfromthecheck-pointoncetheresourcesareonceagainfree.
ThisreliesonhavingaportionoftheSCMavailable,regardlessofwhether1LMor2LMmodeisbeingused,toensurethatthereissufficientspaceforarunningapplicationtobecheck-pointed.
7.7.2 Components
• JobScheduler• AutomaticCheck-pointer• JobLauncher
7.7.3 Datalifecycle
1. ApplicationisrunningonasetofCCN.2. Ahighpriorityjobissubmittedtothejobscheduler.3. Jobschedulerallocatesasetofnodestothehighpriorityjob.4. Jobschedulernotifiestheautomaticcheck-pointeroneachoftheallocatednodestocheck-point
thejobs.5. Check-pointertakesacheck-pointandstoresitinSCM.6. Check-pointernotifiesthejobschedulerthatthecheck-pointhasbeentakenandwhereitis.7. Jobschedulerkillstherunningapplication(s).8. Joblauncherstartsthehighpriorityapplication.9. Highpriorityapplicationrunstocompletion.10. Job scheduler reallocates the nodes to the previous jobs and resubmits their job scriptswith
restorefromautomaticcheck-pointingenabled.11. Application(s)restarts.12. Jobschedulernotifiesautomaticcheck-pointerthatcheckpointdatacanberemoved.13. Application(s)runstocompletion.14. Nodesarereleasedforfutureuse.
7.8 ObjectStore
7.8.1 Description
Object stores are a data storage approach thatmanages data as objects, compared to traditionalstoragearchitectureslikefilesystems,wheredataismanagedasafilehierarchy,andblockstoragewheredata isdealtwith inchunks,calledblocks.Asobject storesprovidestorage thatcanmirrorstandarddataaccessinmostapplications,theyofferthepotentialforfinegrain,andlowoverhead,datastorageandaccess.
Matchingthelowoverheadsofobjectstores,comparedtofileaccess,withthehighperformanceofSCM,whencomparedtotraditionalI/Odevices,offersextremelyhighI/OperformanceandnewwaysofoperatingforHPCandHPDAapplications.
This scenario involves an application thathasbeenmodified touse anobject store, runningon asystemwithSCM,withatleastpartoftheSCMavailablefordirectaccessfromtheobjectstore.
7.8.2 Components
• Objectstore• JobScheduler
Copyright©2017NEXTGenIOConsortiumPartners 46
7.8.3 Datalifecycle
1. Objectstoreserviceisinitiatedthroughthejobscheduler.2. Jobissubmittedrequestingnodeswithobjectstoreenabled.3. Jobschedulercheckswhichnodeshaveobjectstoreavailable.4. Jobschedulerallocatesrequirednodes.
a. Jobsubmissionfailsifnoderequestcannotbematched.5. Jobschedulerrunsjobonrequirednode.6. Jobrunsandusesobjectstore.7. Jobfinishes.8. Steps3-7repeatedasrequired.9. Objectstoreserviceisclosedthroughjobscheduler.10. Datascheduler(s)canpersistobjectstoretoexternalhighperformancefilesystemifrequired.
a. Ifnotpersistedthenobjectstoreislostattheendofajob.
7.9 Weatherforecastingcomponentoverview
7.9.1 Description
The operational workflow at ECMWF is complex, and involves many (approx. 10,000) tasks tiedtogetherbyanexternalcontrolsystem,withsubstantialinter-stagedataflows.
WithinNEXTGenIO,thefocus isonadaptingtheflowofdatabetweenthevariouscomponents,toavoiddatatransfersto-and-fromthehigh-performancefilesystemwithinthetime-criticalpathway.This is likely to involvesomeformofobjectstore,onnodescontainingSCM,whichcanretain therelevant data inmemory throughout the critical time periods. How this functionality fits into theoverallworkflowisyettobeexplored,andinparticularitisnotcleartheextenttowhichthisdatastorageshouldbecollocatedwithotheroperationalcomponents.
Theoverallweatherforecastingworkflow,althoughitispurelyanapplication,expectstooperatewithsubstantial reserved resources, and very shortqueue times, and thus tohavea veryhigh level ofcontrolofhowtheoverallworkflowoperatesonthemachine.Asaresult,someofthecomponentsappeartohavesystemwarescope,despitebeingpartoftheapplicationnotthesystemware.
Therearetwopseudo-systemwarecomponentsthatarepartoftheECMWFoperationalworkflow:
• WorkflowController: ECMWFuses a customworkflow controller that runs continuously on anodeoutside theHPC system. It interrogates and submits jobs to the scheduler, and receivesstatusupdatesdirectlyfromtheHPCjobsovertheIPnetwork.
• HotObjectStore:Adomainspecificmulti-nodeobjectstoreformeteorologicaldatathatcanbeaddressedusingthesamerequestsastheexistingindexeddatastoresatECMWF.Thismay,ormaynotbecollocatedwithotherapplicationcomponents,oroperateasitsowntaskontheHPCprototype.ItmakesuseofthePersistentMemoryAccessLibrarytomakeuseofthelargememorycapacity available on SCM. It communicateswith other HPC tasks over the high-performancenetwork.
Withinthisworkflow,SCMwillbeexplicitlymanagedin1LMmode.
7.9.2 Components
• JobScheduler• PersistentMemoryAccessLibrary• CommunicationLibraries(MPI)• High-performancefilesystem• I/Olibraries(GRIB)
Copyright©2017NEXTGenIOConsortiumPartners 47
7.9.3 Datalifecycle
Notethedescriptionaboveofthenon-systemwarecomponentscalledthe“workflowcontroller”and“hotobjectstore”.
1. Dataarrivesfromawidevarietyofexternalsourcesonanongoingbasisandpre-processedbeforebeingstoredonthehigh-performanceparallelfilesystem
2. Duringthetime-criticalwindow,theWorkflowControllermonitorsthestatusoftheoperationaltasks,ensuringthatthecorrecttasksaresubmittedatthecorrecttime
3. Atthestartofthetime-criticalwindow,thepre-processeddataisreadfromthehigh-performanceparallelfilesystemintothe52primarycomputationaltasks
4. AtintervalsoutputfieldsstoredtheHotObjectStore,storingthedatainSCMusingthePersistentMemoryAccessLibrary.Inanon-time-criticalmanner,thisdatamayalsobeduplicatedontothehighperformancefilesystemasthistime
5. Post-processingprocessesaretriggeredattheappropriatemomentbytheworkflowcontroller.TheyretrievetherelevantdatafromtheHotObjectStoreoverthehigh-performancenetwork
6. Theoutputofpost-processingisstoredonthehigh-performancefilesystemforlaterretrieval.7. Dataarchivingtasks,outsidethetime-criticalpathway,retrievemeteorological inputdata,and
outputfieldsfromthehighperformancefilesystemandthehotobjectstoreforexternalarchiving.
7.10 Distributedworkflowmanagement
7.10.1 DescriptionPyCOMPSsisthePythonbindingofCOMPSuperscalar(COMPSs),whichisatask-basedprogrammingmodelthataimstoeasethedevelopmentofapplicationsfordistributedinfrastructures.
PyCOMPSsmanages theuserapplicationand interactswithdataClay inorder tohandlepersistentobjectsacrosstheinfrastructure.Inthemodel,theuserismainlyresponsiblefor(i) identifyingthefunctionstobeexecutedasasynchronousparalleltasks,(ii)annotatingthemwithastandardPythondecoratorand(iii)determinethepersistentobjects.
AnexampleapplicationforthisusecasecanbetheK-meansalgorithmoralifescienceapplicationorsimilarthatusestheK-meansalgorithminitsanalysiswithpersistentobjects.
7.10.2 Components
• Task-basedprogrammingmodelandenvironment:PyCOMPSs• Objectstore:dataClay
7.10.3 Datalifecycle1. UserrequeststheexecutionoftheapplicationwithasetofparametersusingCOMPSslaunching
scripts.a. Atthispoint,theexecutablecanbeontheexternalhighperformancefilesystemoronthe
localfilesystemintheloginnodes.b. Ifadatasetisrequested,itisstoredontheexternalhighperformancefilesystem.c. The user requests to use dataClay in combination with PyCOMPSs (this step may be
optionalforotherapplications).2. COMPSslaunchingscriptscreateajobtemplatewhichissubmittedtothejobscheduler.3. TheJobschedulerallocatesresourcesforPyCOMPSs.4. Oncethenodesthatthejobschedulerhasallocatedareavailable,COMPSsstartsandconfigures
therequiredprocesseswithineachnodeinordertosetupamaster-workerstructure.a. Thefirstnodeoftheallocationwillbeconsideredthemasterandtherestofthenodes
willbeconsideredworkers.b. If the user has requested to use dataClay, this deployment process also triggers the
dataClaydeploymentbeforestartingtheuserapplicationexecution.
Copyright©2017NEXTGenIOConsortiumPartners 48
5. Whentheentiredeploymenthasfinished,PyCOMPSsstartstheuserapplicationexecutionontheassignedCCN.
6. Datasetsource:a. Option1:theapplicationcreatesadatasetandmakesitpersistentwithindataClay;b. Option2:thedatasetisalreadywithindataClay.
7. Theapplicationstartscomputinginordertoachievetheresult.a. ThetasksaredeployedacrosstheCCN.b. EachtaskcaninteractwithdataClayifhastoreadorwritedata.
8. Theapplicationfinishes.9. Datascheduler(s)canpersistobjectstoretoexternalhighperformancefilesystemifrequired.
a. Ifnotpersistedthenobjectstoreislostattheendofajob.
Copyright©2017NEXTGenIOConsortiumPartners 49
8 ConclusionsWehaveoutlinedthehardwareandsystemwarearchitectureswehavedesignedtoexploitStorageClassMemoryforHPCandHPDAapplications.WehavedemonstratedthatsuchanarchitecturehasthepotentialtoscaleuptoExascale-classsystemsinthenearfuture,withthepotentialforextremelyhighI/Operformanceandmemorycapacity.
EfficientlyutilisingSCMforcomputationalsimulationanddataanalyticapplicationswilleitherrequireapplicationmodifications, toenableapplications todirectly targetandmanageSCM,or intelligentsystemwaretosupportapplicationsusingSCMforI/O,memory,andcheck-pointing.
WehavedescribedarangeofsystemwarecomponentsthattheNEXTGenIOprojectisdevelopinganddeployingtoevaluateSCMforlargescaleapplications.Wewillbeevaluatingthissystemware,alongwithdirectaccessapplications,andtherawperformanceofSCM,usingaprototypeHPCsystemthatNEXTGenIOisdeveloping.
Copyright©2017NEXTGenIOConsortiumPartners 50
9 References[1] “3DXPoint™TechnologyRevolutionizesStorageMemory”
http://www.intel.com/content/www/us/en/architecture-and-technology/3D-XPoint™-technology-animation.html
[2] https://www.google.com/patents/US20150178204[3] https://colfaxresearch.com/knl-mcdram/[4] IntelOmni-PathWhitepaperfromHotInterconnects2015:
https://plan.seek.intel.com/LandingPage-IntelFabricsWebinarSeries-Omni-PathWhitePaper-3850
[5] pmem.io:http://pmem.io/[6] https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/optane-
ssd-dc-p4800x-brief.pdf
Copyright©2017NEXTGenIOConsortiumPartners 51
10 AppendixThefollowingpagescontainsequencediagramsforusagepatternsdescribedinSection7.
Figure17:GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem.