lawrence berkeley national laboratory · srinivasan, tavia stone gibbins, nicholas j. wright...
TRANSCRIPT
Lawrence Berkeley National LaboratoryRecent Work
TitleStorage 2020: A Vision for the Future of HPC Storage
Permalinkhttps://escholarship.org/uc/item/744479dp
AuthorsLockwood, GKHazen, DKoziol, Qet al.
Publication Date2017-10-20 Peer reviewed
eScholarship.org Powered by the California Digital LibraryUniversity of California
Storage2020:AVisionfortheFutureofHPCStorage
GlennK.Lockwood,DamianHazen,QuinceyKoziol,ShaneCanon,KatieAntypas,JanBalewski,NicholasBalthaser,WahidBhimji,JamesBotts,JeffBroughton,TinaL.Butler,GregoryF.Butler,RaviCheema,ChristopherDaley,TinaDeclerck,LisaGerhardt,WayneE.Hurlbert,KristyA.Kallback-Rose,StephenLeak,JasonLee,ReiLee,JialinLiu,KirillLozinskiy,DavidPaul,Prabhat,CorySnavely,Jay
Srinivasan,TaviaStoneGibbins,NicholasJ.Wright
NationalEnergyResearchScientificComputingCenterLawrenceBerkeleyNationalLaboratory
Berkeley,CA94720
ReportNo.LBNL-2001072
November2017
ThisworkwassupportedbytheDirector,OfficeofScience,OfficeofAdvancedScientificComputingResearchoftheU.S.DepartmentofEnergyunderContractNo.DE-AC02-05CH11231.
ThisdocumentwaspreparedasanaccountofworksponsoredbytheUnitedStatesGovernment.Whilethisdocumentisbelievedtocontaincorrectinformation,neithertheUnitedStates
Governmentnoranyagencythereof,northeRegentsoftheUniversityofCalifornia,noranyoftheiremployees,makesanywarranty,expressorimplied,orassumesanylegalresponsibilityforthe
accuracy,completeness,orusefulnessofanyinformation,apparatus,product,orprocessdisclosed,orrepresentsthatitsusewouldnotinfringeprivatelyownedrights.Referencehereintoanyspecificcommercialproduct,process,orservicebyitstradename,trademark,manufacturer,orotherwise,doesnotnecessarilyconstituteorimplyitsendorsement,recommendation,orfavoringbytheUnitedStatesGovernmentoranyagencythereof,ortheRegentsoftheUniversityofCalifornia.Theviews
andopinionsofauthorsexpressedhereindonotnecessarilystateorreflectthoseoftheUnitedStatesGovernmentoranyagencythereofortheRegentsoftheUniversityofCalifornia.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
3
TableofContents1. Introduction...................................................................................................................................6
2. NERSCStorageHierarchy..........................................................................................................72.1. CurrentStorageInfrastructureatNERSC.....................................................................................72.2. Workflow-basedModelforStorage...............................................................................................8
3. Requirements..............................................................................................................................113.1. CurrentI/OPatterns........................................................................................................................113.2. NERSC-9Requirements...................................................................................................................143.3. DOEExascaleRequirementsReviews........................................................................................153.4. EmergingApplicationsandUseCases........................................................................................163.5. OperationalRequirements............................................................................................................173.5.1. Reliability,Durability,Longevity,andDisasterRecovery............................................................173.5.2. Spacemanagementandcurationfeatures..........................................................................................183.5.3. Availability........................................................................................................................................................18
3.6. GapsandChallenges.........................................................................................................................193.6.1. Tiering................................................................................................................................................................193.6.2. DataMovement...............................................................................................................................................193.6.3. DataCuration...................................................................................................................................................193.6.4. WorkloadDiversity.......................................................................................................................................203.6.5. StorageSystemSoftware............................................................................................................................203.6.6. HardwareConcerns......................................................................................................................................213.6.7. POSIXandMiddleware................................................................................................................................21
4. TechnologyLandscapeandTrends......................................................................................214.1. Hardware.............................................................................................................................................214.1.1. MagneticDisk..................................................................................................................................................224.1.2. Solid-StateStorage........................................................................................................................................234.1.3. StorageClassMemoryandNonvolatileRAM....................................................................................244.1.4. MagneticTape.................................................................................................................................................254.1.5. StorageSystemDesign................................................................................................................................26
4.2. Software................................................................................................................................................274.2.1. Non-POSIXStorageSystemSoftware....................................................................................................274.2.2. ApplicationInterfacesandMiddleware...............................................................................................28
5. NextSteps.....................................................................................................................................285.1. VisionfortheFuture........................................................................................................................285.2. Strategy.................................................................................................................................................305.2.1. NearTerm(2017–2020)..........................................................................................................................305.2.2. LongTerm(2020–2025)..........................................................................................................................325.2.3. OpportunitiestoInnovateandContribute.........................................................................................34
6. Conclusion....................................................................................................................................36
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
4
ExecutiveSummaryTheexplosivegrowthindataoverthenextfiveyearsthatwillaccompanyexascalesimulationsandnewexperimentaldetectorswillenablenewdata-drivenscienceacrossvirtuallyeverydomain.Atthesametime,newnonvolatilestoragetechnologieswillenterthemarketinvolumeandupendlong-heldprinciplesusedtodesignthestoragehierarchy.Thedisruptionthattheseforceswillbringtobearonhigh-performancecomputing(HPC)willalsocreatesignificantopportunitiestoinnovateandacceleratescientificdiscovery.ToensurethatNERSCfullycapitalizesontheseopportunities,wehavedevelopedacomprehensivevisionforthefutureofstorageinHPCandidentifiedshort-andlong-termstrategicgoalstoeffectivelyrealizethisvision.ThisreportpresentstheresultsofthiseffortandoffersablueprintfordesigningastorageinfrastructureforsupportingHPCthrough2025andbeyond.
Atahighlevel,abroadsurveyofscientificworkflowsanduserrequirementsreviewsidentifiedfourlogicaltiersofdatastoragewithdifferentperformance,capacity,shareability,andmanageabilityrequirements:
• Temporarystorage,whichcontainsdatabeingactivelyusedbysimulationanddataanalysisapplicationsoverthecourseofhourstodays.
• Campaignstorage,whichcontainsdatabeingactivelyusedbylargerworkflowsandscienceprojectsoverthecourseofweekstomonths.
• Communitystorage,whichcontainslargerdatasetsthataresharedamongdifferentprojectswithinascientificcommunityoverthecourseofyears.
• Foreverstorage,whichcontainshigh-valueorirreplaceabledatasetsindefinitely.
ThesefourtiersdonotneatlymaptothephysicalstoragehierarchypresentlydeployedatNERSCtoday,butoverthenextseveralyears,NERSCwillusetacticaldeploymentstocloselyalignstorageresourceswiththeserequirements.By2020,ouraimistoaccommodateTemporarystoragedataandmuchoftheCampaignstoragedataontoasingle,flash-basedstoragesystemthatistightlyintegratedwiththeNERSC-9computeplatformthatwillbedeployedthatyear.Simultaneously,disk-basedCommunityandtape-basedForevertierswillbemorecloselycoupledandprovideasingle,seamlessuserinterfacethatwillsimplifythemanagementoflong-liveddataforbothusersandcenterstaff.Thesetierswillbeimplementedoff-platformtoenablethemtogrowinresponsetouserneedsandpersistbeyondthelifetimeoftheNERSC-9computesystem.
By2025,thenonvolatilemediaunderpinningtheconvergedTemporary/Campaignstoragetierwillexposeextremeperformanceandscalabilitythroughahigh-performanceobjectinterface.UserswhowanttouseafamiliarPOSIXfilesysteminterfacetoaccessdataonthissystemwillusePOSIXmiddlewarethatprovidescompatibilityatthecostofperformance.Similarly,theoff-platformCommunity/Forevertierswillconvergeintoasinglemassstoragesystemby2025,anddataaccesswilloccurthroughindustry-standardobjectstorageinterfacesthatmorenaturallymaptotheusepatternsoflong-liveddata.Today'sfilesysteminterfacesandcustomHPSSclientsoftwarewillbealternateaccessmodes,buttheunderlyingstoragesystemwilltransparentlycombinetheeconomicsoftapeandtheaccessibilityofdiskintooneseamlessdatarepository.
Thetransitionfromfilesystemstoobjectstoresasexascalebecomeswidespreadin2025willrequireuserstochangetheirapplicationsoradoptI/Omiddlewarethatabstractsawaytheinterfacechanges.Ensuringthatusers,applications,andworkflowswillbereadyforthistransitionwillrequireimmediateinvestmentintestbedsthatincorporatebothnewnonvolatilestoragetechnologiesandadvancedobjectstoragesoftwaresystemsthateffectivelyusethem.Thesetestbedswillalsoprovideafoundationon
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
5
whichanewclassofdatamanagementtoolscanbebuilttoleveragetheflexibilityofuser-definedobject-levelmetadata.
AstheDOEOfficeofScience'smissioncomputingfacility,NERSCwillfollowthisroadmapanddeploythesenewstoragetechnologiestocontinuedeliveringstorageresourcesthatmeettheneedsofitsbroadusercommunity.NERSC'sdiversityofworkflowsencompasssignificantportionsofopenscienceworkloadsaswell,andthefindingspresentedinthisreportarealsointendedtobeablueprintforhowtheevolvingstoragelandscapecanbebestutilizedbythegreaterHPCcommunity.ExecutingthestrategypresentedherewillensurethatemergingI/Otechnologieswillbebothapplicabletoandeffectiveinenablingscientificdiscoverythroughextreme-scalesimulationanddataanalysisinthecomingdecade.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
6
1. IntroductionTheNationalEnergyResearchScientificComputingCenter(NERSC)atLawrenceBerkeleyNationalLaboratoryisthemissionscientificcomputingfacilityfortheOfficeofScience(SC)intheU.S.DepartmentofEnergy(DOE).Asoneofthelargestfacilitiesintheworlddevotedtoprovidingcomputationalresourcesandexpertiseforbasicscientificresearch,NERSCisaworldleaderinacceleratingscientificdiscoverythroughhighperformancecomputing(HPC)anddataanalysis.StoragesystemsplayacriticalroleinsupportingNERSC'smissionbyenablingtheretentionanddisseminationofsciencedatausedandproducedatthecenter.Overthepast10years,thetotalvolumeofdatastoredatNERSChasincreasedfrom3.5PiBto146PiBandcontinuestogrowatanannualrateof30%,drivenbya1000xincreaseinsystemperformanceand100xincreaseinsystemmemory.Inaddition,therehasbeendramaticgrowthinexperimentalandobservationaldata,andexperimentalfacilitiessuchastheLargeSynopticSurveyTelescope(LSST)1andLinacCoherentLightSource(LCLS)2areincreasinglyturningtoNERSCtomeettheirdataanalysisandstoragerequirements.
Asthesedatarequirementscontinuetogrow,thetechnologiesunderpinningtraditionalstorageinHPCarerapidlytransforming.Solid-statedrivesarenowbeingintegratedintoHPCsystemsasanewtierofhigh-performancestorage,shiftingtheroleofmagneticdiskmediaawayfromperformance,andtaperevenuesareonaslowdecline.Economicdriverscomingfromcloudandhyperscaledatacenterprovidersarealteringthemassstorageecosystemaswell,rapidlyadvancingthestateoftheartinobject-basedstoragesystemsoverPOSIX-basedparallelfilesystems.Inadditiontothesechangingtides,non-volatilestorage-classmemory(SCM)isemergingasanextremelyhigh-performance,low-latencymediawhoseroleinthestoragehierarchyremainsthesubjectofintenseresearch.Thecombinationofthesefactorsbroadensthedesignspaceoffuturestoragesystems,creatingnewopportunitiesforinnovationwhilesimultaneouslyintroducingnewuncertainties.
ToclarifyhowtheevolvingstoragerequirementsoftheNERSCusercommunitycanbebestmetgiventhestoragetechnologylandscapeoverthenexttenyears,wepresenthereadetailedanalysisofNERSCusers'datarequirementsandrelevanthardware,middleware,andsoftwaretechnologiesandtrends.Fromthisweproposeareferencestoragearchitecturethataddressestheincreasingdatademandsfromexternalexperimentalfacilities,datascience,andotheremergingworkloadswhilecontinuingtosupporttheneedsoftraditionalHPCusers.Weenumeratetherequirementsoflonger-termedstorageresourcesthatenablepublication,collaboration,andcurationovermultipleyears.
WelayoutaroadmapforthecentertodeploystorageresourcesthatbestserveNERSCusersin2020andidentifytheactionsrequiredtorealizethisstrategy.Wethendescribetheevolutionofstoragesystemsbeyond2020andhowadvancesinstoragehardwareandinnovationwithinDOEandinindustrywillimpactourlong-termstoragestrategythrough2025.Withthisroadmapandlong-termstrategy,weidentifyareaswhereNERSCispositionedtoprovideleadershipinstorageinthecomingdecadetoensureourusersareabletomakethemostproductiveuseofallrelevantstoragetechnologies.BecauseoftheNERSCworkload'sdiversityacrossscientificdomains,thisanalysisandthe
1Ivezić,Zetal.2011.LargeSynopticSurveyTelescope(LSST)ScienceRequirementsDocument.https://docushare.lsst.org/docushare/dsweb/Get/LPM-17.AccessedSeptember11,2017.22016.LCLSDataAnalysisStrategy.https://portal.slac.stanford.edu/sites/lcls_public/Documents/LCLSDataAnalysisStrategy.pdf.AccessedSeptember11,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
7
referencestoragearchitectureshouldberelevanttoHPCstorageplanningoutsideofNERSCandtheDOE.
2. NERSCStorageHierarchyNERSChasmorethan6,000activeuserswithmorethan700activeprojectsthatspanabroadrangeofsciencedisciplines,suchasmaterialsscience,astrophysics,bioinformatics,andclimatescience.ThediversityofworkflowsatNERSCresultinawiderangeofI/Opatterns,datavolumes,andretentionrequirements;forexample,anumberofprojectsusedatafromexperimentalandobservationalfacilitiesaspartoftheirworkflowandneedhigh-capacitystorageatNERSCtoingestobservationaldatathatistransferredoverthewide-areanetwork.Agrowingnumberofprojectsalsocombinemodelingandsimulationwithexperimentalorobservationaldata,whichisincreasingthecomplexityofworkflowsandthedemandforstorageresourcesaccessiblefrombothextreme-scalecomputesystemsandthewide-areanetwork.Tomeetthesediverseneeds,NERSCmaintainsdifferenttiersofstorage,eachoptimizedforadifferentbalanceofperformance,capacity,andmanageability.
2.1. CurrentStorageInfrastructureatNERSCAsof2017,theNERSCstoragehierarchyconsistsofa1.6PiBflash-basedburstbuffer,a27PiBLustrescratchfilesystembuiltusingharddiskdrives(HDDs),a10.7PiBdisk-basedprojectfilesystemthatprovidesmediumtermstorage,anda130PiBenterprisetape-basedarchive.Thesetiers,depictedschematicallyinFigure1,varyincapacity,performance,reliability,anddatamanagementpolicies.
FIGURE1.STORAGEHIERARCHYATNERSCIN2017
Thetoptwotiers(burstbufferandscratch)areoptimizedforperformanceandprovidesufficientcapacitytosupporttypicalactiveworkloadsinthesystem.Thesestoragesystemsareeitheractivelypurgedorrequireuserstorequestresourcesaspartoftheirjob.Theyareadvertisedasscratchspaceandmanagedasmorevolatileandlessrobustresources,andusersareencouragedtosavecriticaldataandresultstotheothertiers.Thedisk-basedscratchtieriscurrentlyimplementedusingtheLustreparallelfilesystem,andtheburstbuffercurrentlyusesCray'sDataWarpfilesystemandinfrastructure.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
8
Theprojectandarchivetiersareoptimizedforcapacityanddurability,butstillprovidesufficientperformancetoallowuserstomovedataeffectivelyinandout.Thesetiersarenotactivelypurgedbutinsteadmanagedviaquotas.Theprojecttierisdisk-basedandrunsIBM'sSpectrumScaleparallelfilesystem(previouslyknownasGPFS),whilethearchivetierusesacombinationofdiskandtapethataremanagedbytheHPSSsoftwaredevelopedbyacollaborationbetweenDOElabsandIBM.
Reliabilityandmanageabilityareamajorconcernfortheprojectandarchivetierssincetheyareoftentherepositoriesforusers'mostcriticaldata.Datastoredinthesesystemsarecriticaltosupportthescientificprocessitself,sincescientificresultsmustbemaintainedforlongperiodsoftimeandareoftensharedthroughthecommunityviadataportals3associatedwiththesestoragesystems.Consequently,thestoragesoftwaretechnologiesusedforthesetiersmustbehighlyrobust.Thesetiersmustalsobeabletogrowovertimetoallowforexternalprojectstosponsoradditionalspacetomeetmissionorsciencerequirements.Forexample,variousexperimentalprojectssuchasSTAR4andALICE,5alongwithexperimentalfacilitiessuchastheALS6andJGI,7haveaugmentedNERSC'sprojectfilesystemtostoretheirdata.Thiscontrastssharplywiththeburstbufferandscratchtiers,whicharetypicallydesignedspecificallytomeettheneedsofthecomputationalplatformwithwhichtheyareprocured.
2.2. Workflow-basedModelforStorageInpreparationforNERSC'snextmajorsystem,tobedeployedin2020,andaspartoftheAllianceforApplicationPerformanceatExtremeScale(APEX),8theNERSCdivisionofLawrenceBerkeleyNationalLaboratory,LosAlamosNationalLaboratory(LANL),andSandiaNationalLaboratory(SNL)surveyedtheirusers'scientificworkflowstoinformthetechnicalrequirementsfortheprocurementoftheNERSC-9andCrossroadssystems.Theresultsofthisanalysis,summarizedintheAPEXWorkflowswhitepaper,9presentsthedatamovementbetweendifferentstagesofworkflowsasworkflowdiagramstohelpreasonaboutsystemarchitecture;anexampleofsuchadiagramisshowninFigure2.Theverticalaxiscapturestherequiredretentiontimeforthedatainputsandoutputsandisamajorcontributortostoragesystemcapacityrequirements.Theverticalaxisalsospeakstotheperformancerequirementsofeachtier,asdatathatisgenerated(anddeleted)morefrequentlywillrequirehigherperformancethanthosedataproductsthataregeneratedmuchlessfrequently.
3ALSDataandSimulationPortal.https://spot.nersc.gov/.AccessedSeptember4,2017.4Adams,J.etal.2005.Experimentalandtheoreticalchallengesinthesearchforthequark–gluonplasma:TheSTARCollaboration’scriticalassessmentoftheevidencefromRHICcollisions.NuclearPhysicsA.757,1–2(Aug.2005),102–183.5Aamodt,K.etal.2008.TheALICEexperimentattheCERNLHC.JournalofInstrumentation.3,8(Aug.2008),S08002–S08002.6AdvancedLightSource.https://als.lbl.gov/.AccessedSeptember3,2017.7DOEJointGenomeInstitute:ADOEOfficeofScienceUserFacilityofLawrenceBerkeleyNationalLaboratory.https://jgi.doe.gov/.AccessedSeptember3,2017.8AllianceforApplicationPerformanceatExtremeScale.http://www.lanl.gov/projects/apex/.AccessedApril30,2017.9APEXWorkflows.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
9
FIGURE2.DATAMOTIONANDRETENTIONINANARCHETYPALSIMULATIONSCIENCEPIPELINE.FROMTHEAPEXWORKFLOWSWHITEPAPER.10
Overall,thisstudyfoundcommonalityacrossDOEincomputeandstoragerequirements,anditpresentedataxonomyofworkflows'storagerequirementsintheformofthreelogicalstoragetiers:Temporary,Campaign,andForever:
• Temporarystorage,usedforthedurationofasingleworkflowinstance,isusedtostoreanddeliverworkingsets,checkpoints,andjoboutputs.Itisthehighestperformingstorageresource,andassuchistypicallytightlycoupledtothecomputesystem.
• Campaignstorage,usedforthedurationofaprojectorallocation,enablescollaborationwithinagroupofresearchers,providesspaceforpostprocessingandinputsetsforsubsequentruns,andfacilitatesdatacurationforlaterpublicationormovementtolonger-termstorage.ItrequiresgreatercapacitybutlessperformancethantheTemporarystoragetier.
• Foreverstorage,usedforlong-termstorage,actsasarepositoryforhigh-valuedatathatisirreplaceableorprohibitivelyexpensivetoreproduce.Itwillcontainrawdatasets,oftentoolargetostoreinotherresources,andmayalsostoregoldendatasetsthatareofwidervaluetoscientificcommunities.ItsperformancerequirementsarelowerthanCampaignstorage,butitmustbeabletoreliablyholdyearsordecadesworthofdata.
InadditiontothesethreetiersformalizedintheAPEXWorkflowsdocument,thereareadditionaldesigncriteriathatarecriticaltoNERSC'susers:theabilitytoingestandstoredatafromremoteinstruments,theavailabilityofaccesscontrolsforpublishingandsharing,andtheabilitytoefficientlyindex,search,anddescribedatasets.Thus,wealsoidentifyafourth,Communitystorage,resourcethatisoptimized
102016.APEXWorkflowsWhitepaper.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
10
toingestdatafromexperimentalandobservationalfacilities,sharedatawithresearchersatothercenters,andfacilitatethecurationofdata.
Figure3summarizesthefunctionalityofthesefourlogicaltiersintermsoftheirbalanceofcapacityandperformanceandhowmuchoptimizationisinvestedinmakingtheircontentssearchable,shareable,andotherwiseeasilycurated.
FIGURE3.FUNCTIONALVIEWOFSTORAGETIERS
WhileFigure3depictsafunctionalviewofstorage,Figure4showshowthefunctionalmodelmapstotheNERSCresourcesshowninFigure1.
FIGURE4.MAPPINGBETWEENFUNCTIONALMODELANDACTUALSTORAGERESOURCESAVAILABLEATNERSC.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
11
Asisclearinthisdiagram,thestorageresourcesprovidedbyNERSCtodaydonotpreciselyalignwiththefourlogicaltierswehaveidentified.However,withtheunderstandingthatfourlogicaltiersneednotnecessarilymaptofourphysicalstorageresources,thisservesasasoundapproachtodefiningthedesignoptimaandgoalsforfuturephysicalstorageresources.
3. RequirementsAspreviouslyindicated,theNERSCworkloadisevolvingasaresultofavarietyofscientificandtechnologicalchanges.Toensurethatfuturecomputeandstorageresourceswillmeettheseevolvingneeds,wedrawonavarietyofrequirementsstudiesthatincludecurrentworkloads,theAPEXWorkflowswhitepaper,11theDOEExascaleRequirementsReviews,12andNERSCstaffexperiences.
3.1. CurrentI/OPatternsExaminingcurrentuserandapplicationI/Obehaviortargetingscratchfilesystems(theTemporarystoragetier)atNERSCshowsthatthevolumeofdatareadfromandwrittentothesescratchfilesystemsareapproximatelyequal,asshowninFigure5.Thisislikelyduetoabalancebetweencheckpoint-heavyworkloads(manywrite-heavycheckpointoperationsforeachread-heavyrestartoperation),commonexperimentalandsimulationdatasetsbeingre-readmultipletimes,andwrite-once,read-onceintermediatefilesgeneratedbyscientificworkflows,asnotedinFigure2.
FIGURE5.WEEKLYI/OREADANDWRITEVOLUMESONNERSCEDISON'SSCRATCH1ANDSCRATCH2LUSTREFILESYSTEMS.OVERALLANNUALAVERAGEREAD/WRITERATIOIS11/9.
ThisanalysisindicatesthatTemporarystorageneedstoprovidebalancedreadandwritecapabilitiesandthatstoragemedia,APIs,oraccesssemanticsthatemphasizeoneovertheotherwouldnotbe
11APEXWorkflows.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.12DOEExascaleRequirementsReview.http://www.exascaleage.org/.AccessedAugust31,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
12
suitablefortheNERSCworkload.Inaddition,theTemporaryandCampaignstoragetiersshouldbestronglycoupledtostreamlinedatamotionofhotdatasetsbetweentheworkingspaceandastorageresourcethatfacilitatesdatamanagementoverthecourseofthelargerscientificstudy.
AsshowninFigure6,NERSCapplicationsalsouseavarietyofPOSIXmetadatacallswithinTemporaryandCampaignstoragesystems,withthevastmajoritybeingopens,closes,andstats.ItisthereforeessentialthattheTemporarystorageresource'ssystemsoftwareimplementthesecallsinahighlyscalablefashion;forexample,calculatingthesizeofafilethatisstripedacrosshundredsofstorageserversmustbeefficient,andallowinguserstoobtainfilehandlesbywhichtheycanaccesstheirstoreddatamustincurminimallatency.
FIGURE6.DISTRIBUTIONOFMETADATAOPERATIONCOUNTSONNERSCEDISON'SSCRATCH1ANDSCRATCH2LUSTREFILESYSTEMSFROMJUNE2016TOJUNE2017.
Intuitively,accessestoForeverstorageshouldskewtowardwrites,butthisisnotpronouncedatNERSC;24%ofdatawrittentothearchiveisrecalledatsomepoint.Infact,somearchiveddatashowsahighskewtowardreadsasaresultofsciencecommunitiescontinuallyaccessinglargedatasets.ThenetresultisthatNERSC'sarchiveread-to-writeratioisremarkablybalanced,withreadsaccountingfor40%ofsystemI/O.GiventhatNERSC'sForevertierismagnetictape,andtapereadsaremoredifficulttomanageandareslowerduetovolumemountandseeklatenciesonlinearaccessmedia,weconcludethatthisisaresultbornoutofnecessityratherthanbestuseofthesystemordesiresofusers.Re-readswouldbebetterservedfromlower-latencyCampaignorCommunitystoragelayersifcapacityallowed.
WithsufficientlysizedCommunityandCampaignstoragetiers,Foreverstorageshouldbeoptimizedforhighperformancewritecapabilitiesratherthanreadperformance,asreaddutyismainlyfulfilledbyCommunityandCampaignstorageresources.However,asshowninFigure4(whichdepictstherealityatNERSC),thisisasystemdesignpointratherthanastatementofhowthecurrentstoragesystemswork.Thediscrepancyisdriveninlargepartbythefactthattapeisstillthemostcost-effectivemassstoragemediumonadollars-per-bitbasis.
ThecouplingbetweenForeverandCommunitystoragecanbelooserthanTemporaryandCampaign,asthedatainForeverandCommunityspaceisprincipallystatic.Communitystorageshouldbesizedsuch
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
13
thatdatadoesnotmigratefrequentlytoandfromForeverstorage.Becauseofthedifficultyinteractingwithtape,Communitystorageneedstobelargeenoughthatiteffectivelyeliminatesrepeatedre-readsfromtheForevertier.Forevaluatingeffectivetechnologies,POSIXI/OoperationsaremuchsimplerintheCommunityandForeverspace,mainlycomposedofput/get/statonwholefiles,withotheroperationstocreateandmaintaindirectoryhierarchiesandverylittleelse.Suchawrite-once,read-many(WORM)workloadisanareawhereinexpensivecapacitystoragesystemswithoutfullPOSIXcompliancecouldbedeployed;forexample,theobjectstoragesystemsusedextensivelyinthecloudandhyperscalemarketsarespecificallyoptimizedforWORMI/O.
Asscienceteamsmovefromusingsmallnumbersofapplicationsduringtheirresearchtomorecomplexinteractionsbetweenmanyapplications,scientificworkflowsareexpectedtobecomethedominantmodeofoperationatNERSC.Thecomputeconcurrencyoftheseworkflowsisdiverseandcanbeextremelylowforimageorotherinstrument-analysisworkflows.Thesedata-orientedworkflowsareanticipatedtogrowmoreinthroughputratherthanproblemsizeby2020,andbecausemanyconstituentapplicationsdonotstrongscalewell,theincreasedconcurrencyofNERSC'sfuturesystemswillbeutilizedbybundlingmultipleworkflowpipelinesintoasinglejob.13Unlikethescalingbehavioroftraditionalsimulationscienceapplications,thiswilldemandscalablemetadataperformancefromthestoragesystemaseachnodeprocesseslargernumbersoffilesconcurrently.14
FIGURE7.PERCENTAGEOFDATAGENERATEDBYNERSCWORKFLOWSTHATWILLBERETAINEDINFOREVERSTORAGE
AkeyfindingoftheAPEXWorkflowsstudywasthatNERSCuserswanttosaveasignificantfractionofthedatafilesusedandproducedbytheirworkflowsforalongtime,perhapsindefinitely.Figure7showsthepercentageofI/OgeneratedbythesurveyedNERSCworkflowsthatissavedforever.Evenifusers
13Daley,C.S.etal.2015.AnalysesofScientificWorkflowsforEffectiveUseofFutureArchitectures.Proceedingsofthe6thInternationalWorkshoponBigDataAnalytics:Challenges,andOpportunities(BDAC-15)(Austin,TX,2015).14Daley,C.S.etal.2016.PerformanceCharacterizationofScientificWorkflowsfortheOptimalUseofBurstBuffers.ProceedingsoftheWorkshop,WorkflowsinSupportofLarge-ScaleScience(WORKS2016)(SaltLakeCity,2016),69–73.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
14
areabletomakeuseofin-situorin-transitanalyticstoreducedatamovementduringworkflowexecution,alargefractionofthegenerateddataisirreducibleandmustberetainedlong-term.
Thus,in-flightanalyticsarenotmagicbulletsthatcanbereliedupontostemtheincreasingvolumesofdatabeinggeneratedbyscientificworkflows,andwearerapidlyapproachingtheneedforO(exabyte)ofcapacitystorageunlessNERSCusersre-architecttheirworkflowstosavelessdata.Extrapolatingthehistoric45%annualgrowthrateofNERSC'scurrentarchivesystemalonepredicts1exabyteofuserdataby2022.GiventheaforementionedobservationthatNERSCusersarecurrentlyusingthearchiveasbothCommunityandForeverstorage,effectivelybalancingthecapacityofCommunitystoragerelativetoForeverstorageindicatestheneedforhundredsofpetabytesofcapacityintheCommunitystoragetierby2023.
Thefindingspresentedaboveindicatetwocorollaries:
• Campaignstorageis,inasense,"cold"Temporarystorage,andCommunitystorageis"hot"Foreverstorage.
• ThedatastoredinTemporary/Campaignstorageservesthegoalsofindividualresearchprojectsandtheirusers,whiledatainCommunity/Foreverstoragemaybeofinteresttobroaderscientificcommunitiesandmanyresearchprojects.
ThesesuggestabroaddichotomybetweenTemporary/CampaignstorageandCommunity/Foreverstorageinboththeirdataretentiontimesandthebreadthofuserstheyserve.ItfollowsthatTemporary/Campaignstorageisbestimplementedclosetospecificcomputesystemstoemphasizehigh-performanceanalysisandaccessbyasmallcohortofusers.Conversely,Community/Foreverstorageisbestmaintainedclosertothewide-areanetworkandmorecentrallywithinafacilitytoemphasizesharingandbroadaccessbylargerusercommunities.
Fromtheseuserrequirements,severalkeydesigncriteriabecomeapparent.TheTemporaryandCampaigntiersshouldbecloselycoupledandprovidebalancedread/writeperformanceandscalablemetadatatosupporttheNERSCworkload.TheCommunityandForevertiersdonotneedsuchtightcoupling,buttheyshouldbesizedsuchthatmostreadactivitytargetsdatathatisstoredintheCommunitytierratherthanForeverstorage.ThiswouldallowCommunitystoragetomakeuseoftechnologiesoptimalforWORMworkloads,leavingForeverstorageforhighlyvaluablebutcolddata.
3.2. NERSC-9RequirementsIn2020,NERSCplanstodeployitsNERSC-9system,whichistargetedtoincreasetheprocessingcapabilityofthecenterby4-5xovertheNERSC-8system,Cori.Withthepotentialfordramaticdatagrowthasemergingareasindatasciencesmatures,thisincreaseincomputingcapabilityisexpectedtobeaccompaniedbyatleastaproportionalincreaseintherateandvolumeofdatagenerationwithinNERSC.TheNERSC-9systemwillincludeplatformstoragethatisexplicitlydesignedto:
"[retain]allapplicationinput,output,andworkingdatafor12weeks(84days),estimatedata
minimumof36%ofbaselinesystemmemory[3PiB]perday."15
15APEX2020TechnicalRequirementsDocumentforCrossroadsandNERSC-9Systems.http://www.lanl.gov/projects/apex/request-for-proposal.php.AccessedApril30,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
15
aswellasdeliversufficientperformancetoabsorbcheckpointing.ThetechnicalrequirementsfortheNERSC-9systemwerespecifiedsuchthatplatform-integratedstoragewillfulfilltheroleofTemporarystorageandaportionofCampaign.
Whiletheutilityandcapabilityofthisplatformstoragewillbewell-definedinthe2020timeframe,itisdesignedtoretaindataforonly84days.Therefore,usersandprojectsthatwishtoretaindatalong-termmuststoreitonalternate,longer-termstorageresourcesthatfalloutsideoftheNERSC-9procurement.However,intheNERSC-9storagetechnicalrequirements,vendorsweregiventheflexibilitytorespondwithinnovativesolutionssurroundingfeaturesthataremorerelevanttolonger-termdatamanagement,includingbackgrounddataintegrityverification,detailedmonitoringofstorageperformanceandutilization,fastmetadatatraversal,andconnectivitytoexternalfilesystemsandotherdatasources.Asaresult,theNERSC-9procurementcouldbeusedasavehicleforprocuringandsatisfyingtherequirementsofCampaign,Community,andForeverstoragetieraswell.
3.3. DOEExascaleRequirementsReviewsTheDOEAdvancedScientificComputingResearch(ASCR)programhasconductedanumberofrequirementsgatheringeffortswithotherDOESCprogramstoensurethattheexascalesystemstobefieldedin2021-2023arealignedwiththemissionneedsofeachDOESCprogramoffice.Theseeffortsbuildonalonghistoryofengagementwiththescientificcommunitythathelpdrivefuturesystemrequirementsandarchitectures,goingbacktotheNERSC'sGreenBook16reviewin2002andextendingtotherecentDOEExascaleRequirementsReviews.17TheoutputoftheseeffortsdirectstheplanningandacquisitionstrategiesforNERSC,theLeadershipComputingFacilitiesandESnet.
Thesecomprehensivereportsspanabroadrangeofareas,includingcomputationalrequirements,softwareandmiddlewareneeds,networking,datamanagement,anddataanalysis.SomeofthecommondataandstoragerequirementsthatemergedfromthoseeffortsthatarerelevanttoNERSC'sstoragestrategyareasfollows:
1. Manyoftheprogramofficesanticipateexabyte-scalestorageneedsinthecomingdecade,withmanyprojectsgeneratingandprocessinghundredsofterabytesofdatatodayandprojecting10-50xgrowthduringthatdecade.Multipleprojectsarepredicting100petabyteorgreaterdatasetsinthe2025timeframe.Theseusecasesunderlinetheneedforcost-effective,capacity-optimizedCommunityandForeverstorage.
2. Thereisanincreasingneedtointegrateobservationalandsimulationdatainworkflowsthatrequiredatatobeco-locatedforeffectiveanalysis.Thisis,inpart,adirectresultoftypicalobservationalandsimulationresultsnowsurpassingtheanalysiscapabilitiesofcomputingsystemsatusers'homeinstitutions.Thiswilldrivetheneedtoimprovedatamovementtools,increasestoragecapacity,andprovidehigh-bandwidth,wide-areanetworkingconnectivity.Thisspeakstotheneedforeffectiveintegrationbetweenallstoragetierstominimizethecomplexityofdatamovementduringworkflows.
3. DatamanagementneedstoextendbeyondNERSCtothewide-areanetwork,asothercomputeandexperimentalfacilitiesintegratemorecloselywithNERSC.Externalconnectivityrequirementsarealsobeingdrivenbyagrowingdemandtosharecommon,curateddatasetswiththewidercommunity,drivingtheneedforarobustCommunitystorageresource.
16Greenbook–NeedsandDirectionsinHigh-PerformanceComputingfortheOfficeofScience.https://www.nersc.gov/assets/For-Users/DOEGreenbook.pdf.AccessedApril27,2017.17DOEExascaleRequirementsReview.http://www.exascaleage.org/.AccessedAugust31,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
16
4. Usershaveastrongneedforintegrateddatatrackingandprovenancewithinthestoragesystem.Thisincludesexpandedcapabilitiesaroundmetadatastorage,searchingandquerying,andeventtriggering.ThesearefeaturesthatareprincipaltoCampaignandCommunitystoragetiers.
5. Thereisatransitionfromindividuallarge-scalesimulationstowardensembles,uncertaintyquantification,andmorecomplexworkflowsthatmustconnectandintegratesimulationandanalysis.ThisshifttowardsensembleworkflowswillrequirethatCampaignstoragesimplifydatamanagementacrosslargeprojectsandtheothertiers.
6. Thedramaticgrowthindatastoragedemandsisaccompaniedbyadesiretoapplynewformsofdataanalysisandanalytics,includingmachinelearning,toeffectivelyprocessthemassiveamountsofdataresultingfromexperimentalsources,extreme-scalesimulation,anduncertaintyquantification.ThisalignswiththeobservationatNERSCthatTemporarystoragedeliverbalancedreadandwriteperformance.
AlldivisionswithinDOESCanticipatethatthedramaticincreaseintheircomputationalrequirementswilldrivesimilarlydramaticincreasesintheirdatastorageandmanagementrequirements.Simplyprovidinghighcapacity,high-bandwidthstoragewillnolongersatisfythebroadrangeofrequirementsthatarisefromtheaforementionedshifttowardworkflow-orientedprocessingandexperimentalanalysis.Rather,futurestoragesystemswillhavetodeliverlowlatency(highIOPs),richmetadatafacilities,andexternalconnectivity,inadditiontohighparallelI/Obandwidth.Theseuserrequirementsreinforcetheneedtotreatstorageinfrastructuredesignasamulti-dimensionalproblemandsupporttheapproachdescribedinSection2.2.
3.4. EmergingApplicationsandUseCasesAgrowingnumberofdomainsciencesneedtoleveragethecapabilitiesofHPCsystems,yethavedatarequirementsthatcontrastwiththoseoftraditionalHPCworkloads.Manyoftheseemergingdataworkloadsaredrivenbymachinelearningandotherdataanalyticstechniquesthatrelyonworkflowframeworks(e.g.ApacheSpark),analyticspackages(e.g.,CaffeandTensorFlow),anddomain-specificlibrariesthattraditionallyhavenotbeenusedinHPC.TheseanalysistoolsoftenexhibitI/OpatternsthatperformpoorlyonHPCsystemsasaresultoftheirgenesisincloudenvironments,andwhileindividualanalyticstoolscanberefactoredforuseonHPCsystems,thefieldofdatascienceisevolvingrapidlyandindependentlyoftheHPCcommunity.ThenextsetofpopulartoolsmayexhibitthesamedeleteriousI/Obehaviorandpoorout-of-boxperformance,andtheywillneedtobeadaptedtoHPCenvironmentsbecauseoftheirprioritizationofproductivityandtheirmomentuminthelargerdataanalyticscommunity.
Manyoftheseemergingapplicationsareasareassociatedwithobservationalandexperimentalfacilitiesthatarealreadygeneratinglargevolumesofdata,and,ashighlightedinSection3.3,theirprojectedgrowthratesarestaggering.Forexample,NERSCiscollaboratingwiththeLinearCoherentLightSource(LCLS)toenablereal-timeanalysisofdatageneratedbyhigh-speed,high-resolutioninstruments.Theseinstrumentscurrentlygeneratehundredsofmegabytespersecondofdatabutareprojectedtogeneratetenstohundredsofgigabytespersecondofdatawithfutureupgrades.InstrumentsattheNationalCenterforElectronMicroscopy,18theAdvancedPhotonSource,19theSpallationNeutronSource,20and
18NationalCenterforElectronMicroscopy(NCEM).http://foundry.lbl.gov/facilities/ncem/.AccessedSeptember11,2017.19AdvancedPhotonSource.https://www1.aps.anl.gov/.AccessedSeptember11,2017.20SpallationNeutronSource.https://neutrons.ornl.gov/sns.AccessedSeptember11,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
17
elsewhereprojectsimilarincreases.Thesefacilitiesalsooftenrun24x7formonthsatatime,soavailabilityandreliabilityofthecompute,storage,andnetworkresourcessupportingtheseworkflowsiscritical.Giventhefactthatresearchersareoftenallocatedverylimitedtimeontheseinstruments,providingcontinuityofstorageandcomputingresources,eventhroughsystemmaintenanceperiods,isimportant.
DirectinteractionsbetweenNERSCstaffandthestaffandusersfromanumberofexperimentalfacilitiesandprojectshaverevealedseveralkeystoragerequirements.TherewillbeaneedtotransferhundredsofGB/secfromthewide-areanetworkdirectlytoadurablestorageresourcesuchasCampaignorForeverstorageinareliableway.Thistranslatestoaneedforhighavailabilityandaccessibilityofdataonthesetiersthroughmaintenance,softwareupgrades,andstorageexpansion.Furthermore,predictableI/Operformanceforbothdataandmetadataaccessesiscriticalforco-schedulingexperimentalandcomputationalresources,andprovidingqualityofservicecontrolsishighlydesirableacrossallstoragetiers.
3.5. OperationalRequirementsUserrequirementsreviewsandothersurveysdefinemanydesigncriteriaforthestoragesystemarchitecturesuchasI/Operformanceanddatamanageability,butoperationalconsiderationsanddatalifecyclemanagementneedsgiverisetoadditionalrequirementsthatarenotdirectlyuser-facing.TheseoperationalrequirementsareespeciallycriticalfortheCommunityandForeverstorageresources,whichwillretainlong-liveddata.Dataontheseresourceswillroutinelyoutlastthefour-tofive-yearlifespanofindividualcomputeplatformsandmustbeavailableacrossallcomputesystemsandaccompanyingedgeservicesatthecenter.
AsdiscussedinSection2.1,theroleofCommunitystorageatNERSCiscurrentlyfulfilledbytheprojectfilesystemwhichhasbeeninexistenceformorethan10years.ForeverstorageisfulfilledbytheHPSS-basedarchiveandhasbeenmanagedformorethan20years.DozensofNERSCstaffhaveaccumulatedhundredsofyearsofdirectexperiencemanaginglong-livedHPCstoragesystems,contributingtocommunitybestpracticesandworkingwithpeersatotherDOEHPCfacilities.Theyhaveidentifiedcriticalattributesneededtomaintainandrunthesesystemseffectively.Theseoperationalrequirementscanbeorganizedintothreegeneralcategories,describedinsections3.5.1-3.5.3.
3.5.1. Reliability,Durability,Longevity,andDisasterRecoveryBecauseCommunityandForeverstorageareexpresslydesignedtostorevaluabledata,ensuringthatthedataishighlyresistanttocorruption,availableeveninthepresenceofcomponentfailures,andcanbequicklyrestoredintheeventofadisasterareparamount.Althoughvirtuallyallmassstoragesystemsmakeassurancesaboutthesefeatures,itisimportanttonotetheeffortrequiredbystoragesystemoperatorstoexercisethesefeaturesinpractice.ThisefforthasadirecteffectonthestaffinglevelsrequiredtosupportthestoragesystemasitincreasesincapacityandmaybeofcriticalimportancetoensuretheminimaldowntimeduringoutagesrequiredbytheemergingapplicationsandusecasesdiscussedinSection3.4.
Requiredfeaturesinclude:
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
18
• Highlydurablehardwareandsoftware.Forthearchive,tapemediahasofferednotonlycost-effectivecapacitybutadditionaldurabilityassurancebecausethedataisoffline.Thismakesitfarlesspronetodatacorruptionduetosoftwareerror,asevidencedbya2011software-induceddisasterataleadinghyperscaleprovider.21
• Highdegreeofreliabilityandintegrityfordataatrestandinmotion.ThismaybeaddressedbymechanismslikeT10DIFanddatachecksummingandiscriticaltopreventingsilentdatacorruption,asevidencedbya2013hardware-relateddatacorruptionissuewithinInternet2.22
• Abilitytoshrink,grow,andmigratedata"live,"ascapacityisincreasedorreconfigured.Thisisanessentialfeatureforrepackingolddatatonew,highercapacitymedia.ItalsoenablesNERSCtoallowlargeexperimentsandotherdata-intensiveuserstopurchaseadditionalstoragetobeco-locatedwiththeircomputeresources.
• Abilitytomountstorageresourcesacrossdifferentcomputeandloginsystemsandovertensorhundredsofthousandsofclientnodes.ThisisimportantforalltiersbutparticularlyessentialfortheCampaignandCommunitystoragetier,whichmustinterfacewithadiversityofenvironmentstoingestexperimentaldataandsharedatasets.
• Flexiblesupportforavarietyofhigh-performancenetworks.Thisallowsthestoragetocontinuetobecompatibleasthecenter'snetworkandcomputetechnologiesevolvewithchanginguserrequirementsandemergingtechnologies.
3.5.2. SpacemanagementandcurationfeaturesEffectivelymanagingstorageresourceutilizationreducesstoragecostsandimprovesqualityofservice.Whilemanagementfeaturessuchassupportinguser-andgroup-levelquotasaresupportedbyvirtuallyallstoragesystems,itcanbeaninflexibleandopaqueapproachifusersdonothavetheabilitytodeterminewhatdatatheyhave.Givingusersandadministratorstheabilitytodeterminewhichdatasetsareconsumingthemostspaceandwheretheselargedatasetsarelocatedsimplifiestheirdatamanagementoverhead.Requiredspacemanagementandcurationfeaturesinclude:
• Flexiblemethodstotrackusageandtospecifyandenforcelimits(e.g.userquotas,treequotas,etc).Thisallowsusersandoperatorstomakemoreinformeddecisionsaboutwhichdatacanorshouldbedeletedtoensurefairshareofstorageresources.
• Methodstoquicklywalkthestorageresourcenamespace.Inadditiontohelpinginformspacemanagementdecisions,understandingthedistributionoffileorobjectsizes,accessfrequencies,andothermetadatainformspolicydecisionsandsystemperformanceoptimization.
• Abilitytomanagehardwarethathasdifferentcharacteristics(bandwidth,capacity,IOPs)withinthesamesystem.Thisallowsthestoragesystemtogrowalongindependentdimensions(e.g.,performanceandcapacity)andisofincreasingimportancewithemergingNANDandSCMmedia.
3.5.3. AvailabilityMaintainingthehighestpossibleavailabilityofstorageresourcesisessentialtooperatingasupercomputingcenter;anentirecentercanberenderedofflineifitsstoragesystemsareoffline.Furthermore,theneedtomaintainextremeavailabilityandminimizemaintenanceoutagesonlybecomesgreaterasexperimentalfacilitiesbecomecoupledtoHPCfacilities;asdescribedinSection3.4,storagesystemdowntimecanseverelyimpacttheabilityofauserofanexperimentalfacilitytodo
21Treynor,B.2011.Gmailbacksoonforeveryone.https://gmail.googleblog.com/2011/02/gmail-back-soon-for-everyone.html.AccessedSeptember4,2017.22Foster,I.2013.GlobusOnlineensuresresearchdataintegrity.https://www.globus.org/blog/globus-online-ensures-research-data-integrity.AccessedSeptember4,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
19
research.Assuch,wehaveidentifiedthefollowingoperationalrequirementstoensuremaximumavailability:
• Strongsupportforliveupdates,rollingupgrades,liveconfigurationchanges,etc.Thisminimizestheneedtotakethesystemoffline,especiallyforextendedperiodsoftime,andspeaksdirectlytotherequirementofmaintaininghighavailabilityduringmaintenance.
• Supportforcentralizedmanagementandmonitoring.Thisimprovesoperationalefficiencyandreducesdowntimebydecreasingtheamountofeffortrequiredforstorageengineerstomanagemultipletiersofhighlydistributedstorage.
• Abilitytorecovercleanlyfromfaultsorfailureswithminimalcleanupandmanualintervention.Aswithpreviousoperationalrequirements,thisisdirectlytiedtoreducingdowntimeandstaffingrequirements.
3.6. GapsandChallengesWhilethecurrentstoragehierarchydescribedinSection2hasservedNERSCwell,contrastingitwiththerequirementsstatedinthissectionrevealsomeshortcomingsinitsoverallarchitecture,thedeployedtechnology,anditseaseofuse.Ifthesegapsarenotaddressed,theywillbefurtheraggravatedbytechnologytrendsandemerginguserneeds.
3.6.1. TieringThenumberoflayersinthehierarchyisdrivenbycostoptimizationstoprovidefast,high-performancestoragetosupportrunningsimulationsandanalysis(Temporarystorage);highcapacitytosupportlonger-termprojects(Campaign/Communitystorage);andarchivingdatatosupportthescientificprocess(Community/Foreverstorage).Tieredstorageaddscomplexityforusersandstaff,andthelackofautomateddatamovementbetweentiersisasignificantburdentoNERSCusers.Eachlayerofthestoragehierarchyisacomplex,independentsystemthatrequiresexpertisetomanage,andcollapsingtierswouldsimplifystorageadministrationforNERSCandreducedatamanagementcomplexityforusers.
3.6.2. DataMovementAtpresent,movingdatabetweenNERSC'sTemporaryandCampaign/Communitystoragetiersisrelativelyfrictionless,astheybothprovideaPOSIXfilesysteminterface.MovementinandoutofForeverstorageismorechallengingbecauseitrequiresuserstointeractwithcustomclientsoftwaresimilartoFTPorUNIXtar.Thefactthatdataresidesontape—whichintroducesvolumemountlatenciesthatmayspanseveralminutesandlinearreadorwriteaccessrestrictions,plusthefactthatdatamaybescatteredovermanydifferenttapecartridges—addstothedifficulty.Providingacommoninterfaceforalltiers,whetheritbefile-basedorobject-based,wouldstreamlinedatamovementandsimplifythetaskofbuildingmoreproductiveuserinterfacestomanagedatamovement.
3.6.3. DataCurationIntegratedsearchanddiscoverytoolsarelackingatalllevelsofthestoragehierarchytoday.ThisismoreproblematicforCommunityandForeverstorage,wheresignificantquantitiesofdataareresidentforyearsordecades.Thesetiersoftenserveasshareddatarepositoriesformultipleprojectsoveralongperiodoftime,andtheindividualownerorstewardofadatasetmaychangeoverthecourseofaproject.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
20
Toaddresstheseissues,largeprojectshavebuilttheirowndatacatalogsthatarecompletelyexternalfromtheNERSCstorageresources.Some,suchasJAMO,23arefocusednarrowlyoncataloginganddatamovement;whileothers,includingthosedevelopedbytheATLAS24experimentattheLargeHadronColliderandbytheAdvancedLightSource25atBerkeleyLab,includewebpresentationandworkflowfeatures.AlthoughwedonotintendtodefineametadataschemaforallNERSCusers,havingacommonsetofmetadatafeaturesacrossalltiersonwhichusercommunitiescanbuildtheirdomain-specificcatalogingsystemswouldsimplifydatamanagementandcurationasNERSC'sstoragehierarchycontinuestoevolveoverthenext10years.
3.6.4. WorkloadDiversityThespanofNERSCuserworkloadsisbroadand,consequently,thescaleanddistributionoffilecharacteristicsandI/Opatternsvariesgreatly.AsdiscussedinSection3.1,simulationsrunningatscaleoftenwriteverylargecheckpointsthatstresstheentiredatapathfrominterconnecttomedia.Attheotherendofthespectrum,manyexperimentally-drivenprojectsrunmanylow-concurrencyjobsoverlargecollectionsofsmallerfiles.Thiscanstressthemetadataserviceandthestoragesystem'sabilitytoefficientlyhandlehighvolumesofsmallI/Ooperationsthathasknock-oneffectsonotherusersofthefilesystem.Providingameanstodistributemetadataovermultiplestorageserversisanessentialrequirement,andfeaturesthatallowmoreintelligentpartitioningofmetadataonthebasisofusers,projects,orarbitrarydatapropertieswouldbenefitqualityofservice.
3.6.5. StorageSystemSoftwareUsabilityandmanageabilitygapsexistacrossthestoragesystemsoftwareusedacrossallofNERSC'scurrentstoragetiers.Forexample,theLustre-basedscratchfilesystemdeployedaspartoftheCorisystem'sTemporarystoragetierfilesystemprovidesnostraightforwardwaytoaddadditionalstoragecapacityorrebalancedataacrossLustreobjectstoragetargets.Lustre'smanagementtoolsarealsorelativelyimmature;asidefromIntel'snow-unsupportedIntelManagerforLustresoftware,26thereisnosingle-panefilesystemmanagementinterfaceforLustre,andthemajorityofavailabletoolsareadhocscriptscontributedbythecommunity.
NERSC'sSpectrumScale-basedprojectfilesystemhasitsownsetofchallenges.Maintenanceoperations,suchasfilesystemintegritychecksthatrequirethefilesystemtobetakenofflineforanextendedperiod,workdirectlyagainstthehighavailabilityrequirementsidentifiedinSection3.4.Furthermore,SpectrumScaleisaproprietary,closed-sourcesystemwithannuallicensingcosts,andmuchrecentdevelopmenteffortatIBMhasgoneintosupportingrequirementsdrivenbyenterprise,notHPC,needs.
TheForevertier,implementedusingHPSS,isengineeredtopresentaPOSIX-compliantinterfacedespiteasimpleput/getinterfacebeingsufficientfornearlyallusecases.ThisPOSIX-complianceadds
23NewMetadataOrganizerandArchiveStreamlinesJGIDataManagement.http://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2013/new-metadata-organizer-and-archive-streamlines-jgi-data-management.AccessedMarch6,2017.24PDSFdatadisksummary.http://portal.nersc.gov/atlas_diskstat.AccessedMarch6,2017.25Deslippe,J.etal.2014.WorkflowManagementforReal-TimeAnalysisofLightsourceExperiments.9thWorkshoponWorkflowsinSupportofLarge-ScaleScience.(Nov.2014),31–40.26Damkroger,T.2017.ANewPathwithLustre.http://intel.cmail20.com/t/ViewEmail/d/C316287F828160FA/5FC4DCCCE8C49BF9F6A1C87C670A6B9F.AccessedApril20,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
21
significantcomplexitytothesoftware,yettheuserinterfaceintothistieristhroughcustomclientsoftware.Afilesysteminterfacetothearchive,eitherthroughintegrationwithSpectrumScaleorFUSE,ispossible,buttheunderlyingtapestoragecanmakeoperationsthatareunremarkableinadisk-basedfilesystemextremelyinefficientandtimeconsumingwithoutcarefulplanning.
3.6.6. HardwareConcernsWhileallofthedisk-basedstoragesystemsarearchitectedforreliabilitywithenterpriseclassRAIDandredundancy,thedemandforstoragecapacityisnowbeingsatisfiedwithmore,notsimplylarger,disks.Thishasasignificanteffectontheoverallreliabilityofastoragesystemanditscharacteristicmeantimetodataloss,andtheextreme-scalestorageindustryistransitioningfromblock-basedparitywithineachfailuredomain(e.g.,RAID6)tohighlydistributed,object-levelerasurecodingacrossshelves,racks,andevendatacenters.Filesystemsbuiltuponblock-basedstoragecannotmakeuseoftheseadvancesinerasurecodingdespitethenatureofmagneticdiskseffectivelyrequiringitforresilienceinthefuture,somovingtheCampaignandCommunitystoragetierstowardstechnologiesthatbalanceparityandresiliencemoreeffectivelywillbeessential.
3.6.7. POSIXandMiddlewareOverthelast50years,thePOSIXI/Ostandard27hasstoodthetestoftimeasthecanonicalwaytoaccessstoragedevices.However,advancesinsoftwarescalabilityandhardwareperformancehavestrainedtheappropriatenessoftheexistingstandardanditssemantics.Eitherrevisionstothestandardorentirelynewperformance-optimizedstandardswouldbevaluableforfutureapplicationstodealwithemerginghigh-performancestoragetechnologies.
Further,agreatdealofI/Omiddleware,suchasHDF5,PnetCDF,andADIOS,aretunedtooperatingwiththetraditionalmemory-to-diskI/Oendpoints.Thismiddlewareprovidesgreatvaluetoapplicationdevelopersbyisolatingusersfromthevagariesofextractingpeakperformancefromtheunderlyingstoragesystem,butitwillneedtobeupdatedtohandlethetransitiontoamulti-tieredI/Oconfiguration.Prefetchingdatafromscratchorprojectintoaburstbufferandmigratingchangesbackagain,supportforasynchronousI/Ooperations,andotherimprovementstoleveragenewtechnologiesareneededtocontinuesupportinguserrequirements.
4. TechnologyLandscapeandTrendsHavingidentifiedboththefunctionalrequirementsofafuturestorageinfrastructureatNERSC,aswellastherequirementscomingfromusers,experimentalfacilities,andoperators,wenowpresenthardwareandsoftwaretechnologiesthatareorwillbeavailabletoimplementtheTemporary,Campaign,Community,andForevertiersoverthenextdecade.
4.1. HardwareAlthoughtheHPCindustryhashistoricallybeenasignificantdriverofmassstoragehardware,theemergenceofcloudandotherhyperscaleserviceprovidershashadadramaticeffectonthestorageindustryanditsroadmapsforstoragemedia.Theseeconomicforces,combinedwiththeimpending
272009.InternationalStandard-InformationTechnologyPortableOperatingSystemInterface(POSIX)BaseSpecifications,Issue7.ISO/IEC/IEEE9945:2009(E).(Sep.2009),1–3880.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
22
scalinglimitsofsomephysicalmediaandtheemergenceofentirelynewformsofothers,arecausingrapidandsignificantchangesinthefuturelandscapeofstoragehardware.
4.1.1. MagneticDiskMagneticdiskistransitioningfromamediumdesignedforbothcapacityandbandwidthintoonesolelyforcapacityasaresultoftwofactors:
• Magneticstoragemediaisreachingaphysicallimitonhowsmallindividualmagneticdomainsonthedisksurfacecanbe.
• High-performanceNANDisproliferating,satisfyingstorageperformancerequirementsanddisincentivizinginnovationtowardsbettermagneticdiskperformance.
CombinedwiththescalingofI/Operformancewiththesquarerootofthebitdensityonrotatingmedia,thedisparitybetweendiskcapacityandperformanceisonlyexpectedtowiden.
Thatsaid,thereareanumberofcapacity-focusedimprovementsonthemagneticdiskroadmapsofvendorsandindustryconsortia.AsshowninFigure8,therearetechnologyimprovementsthatareprojectedtodelivera10xincreaseinarealdensityoverthenext10years.
FIGURE8.PROJECTEDAREALDENSITYIMPROVEMENTSFORMAGNETICDISKSTORAGETECHNOLOGY.BASEDONPROJECTIONSFROMSEAGATE28ANDATSC.29PARALLELMAGNETICRECORDING(PMR)ISTHE
STANDARDTECHNOLOGYOFTODAY.
Themodest10%arealdensity(AD)improvementfromtwo-dimensionalmagneticrecording(TDMR)30islikelytoreachtheenterprisemarketinthenearterm,andheat-assistedmagneticrecording(HAMR)
28Anderson,D.2016.WhitherHardDiskArchives?32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).292016ATSCTechnologyRoadmap.http://idema.org/?page_id=5868.AccessedSeptember3,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
23
andbit-patternedmagneticrecording(BPM)31promisetodelivermoreaggressiveincreasesinbitdensityinthelongerterm.However,bothHAMRandBPMrepresentlargelynewrecordingtechniquesratherthansmallrefinementstoexistingapproaches,andthereisanontrivialriskthatHAMRwillnotbeacommerciallyoreconomicallyviableoptionin2020.
Thus,itismorelikelythatvendorswillcontinueincreasingtheper-drivestoragecapacitybyrelyingonrefinementstoshingling(e.g.,viaTDMR)andincreasingplattercounts.Thesetwoapproacheswillresultinhigh-capacitydriveswithreducedwriteperformance,flatreadperformance,andslightlyincreasedpowerconsumption.WhilesuitablefortheWORMworkloadsprolificinenterpriseapplicationsandcontentdistributionnetworks,theevolutionofspinningdiskmediaismovingawayfromthebalancedread-writeworkloadsdescribedinSection3.1andcommontoscientificcomputingingeneral.
4.1.2. Solid-StateStorageNAND-basedsolid-statestoragedevices(flash)havebecomeagrowingpresenceinHPCintheformofnode-localscratchstorage32andcentralizedburstbuffers33designedtoreachabetterperformance-per-bitthanmagneticdiskmedia.Asdemandforflashmediacontinuestoincrease,drivenbybothmobileelectronicsandhyperscalemarkets,thelowerpowerconsumptionandhighperformanceofflashareexpectedtocontinuetopushmagneticdiskintolower-performanceroles.
Thelowpowerconsumptionandhighbitdensityofflashmakeitanattractivearchivalmedia.Althoughthecost-per-bitofflashisstillsignificantlyhigherthanthatofmagneticdiskandtape,thecost-per-bitofflashstoragecanbereducedbysacrificingperformanceandendurance.Hyperscaleconsumers(e.g.,Facebook34)aredrivingthedevelopmentofquad-levelcell(QLC)flashasalow-power,high-densitymediumforWORM-andarchivalstorage,andthefirstQLCNANDproductshaverecentlybeenannouncedbyvendorsincludingSamsung35andToshiba.36Bythe2020timeframe,itisentirelyconceivablethatQLCflashmayfindarolealongsidehigherperformance,higherenduranceMLCandTLCNANDintiered,all-flashstoragesystems.
Thecost-per-bitofflashisalsoexpectedtodropprecipitouslybefore2020astheglobalNANDmanufacturingindustrycompletestheprocessofconverting2D(planar)NANDfabricationplantsto3DNAND.Thiswilllikelypushpricesforperformanceflashbelow$0.10perGB,encroachingonamarkettraditionallyheldbymagneticdisk.37Advancesin3DNANDfabricationtechnology,drivenbyhealthy
30Victora,R.H.etal.2012.Two-DimensionalMagneticRecordingat10Tbits/in^2.IEEETransactionsonMagnetics.48,5(May2012),1697–1703.31Albrecht,T.R.etal.2015.Bit-PatternedMagneticRecording:Theory,MediaFabrication,andRecordingPerformance.IEEETransactionsonMagnetics.51,5(May2015),1–42.32Strande,S.M.etal.2012.Gordon:design,performance,andexperiencesdeployingandsupportingadataintensivesupercomputer.Proceedingsofthe1stConferenceoftheExtremeScienceandEngineeringDiscoveryEnvironment-XSEDE(Chicago,2012),1.33Bhimji,W.etal.2016.AcceleratingSciencewiththeNERSCBurstBufferEarlyUserProgram.Proceedingsofthe2016CrayUserGroup(London,2016).34Rao,V.2016."HowWeUseFlashatFacebook:TieredSolidStateStorage."FlashMemorySummit2016.(August2016).35Elliot,J.2017."AdvancementsinSSDand3DNANDReshapingStorageMarket."FlashMemorySummit2017.(August2017).36ToshibaDevelopsWorld'sFirstQLCBiCSFLASH3DMemorywith4-Bit-Per-CellTechnology.https://toshiba.semicon-storage.com/us/company/taec/news/2017/06/memory-20170627-1.html.AccessedSeptember9,2017.37Handy,J.FlashMarketCurrent&Future.FlashMemorySummit2017.(August2017).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
24
competitioninthemarketplace,willallow3DNANDtoscalewellbeyond2020aswell;approachessuchasstringstackingareexpectedtoallowarealdensitiesofflashtoscaletoatleast5-10xthestateofthearttoday.
TheNVMeoverFabrics(NVMeoF)protocolisarapidlyevolvingstandardthatenablesblock-levelaccesstoNVMedevicesoveranynetworkfabricsthatsupportremotedirectmemoryaccess(RDMA),includingInfiniBandandIntelOmniPath.IncombinationwithRDMAfabricswhosebandwidthandperformancealignwiththeperformanceofNAND,NVMeoFisexpectedtoenablefabric-attachedNVMedevicesasaviable,high-performance,disaggregatedstoragearchitecture.
Furthermore,itistechnologicallyfeasibletouseNVMeoFtotransferblock-baseddatatoremotetargetswithoutCPUinterventionandwithoutcopyingblocksthroughhostmemory.AlthoughsuchafeaturerequiresextensivehardwaresupportanddrivercompatibilitybetweenNVMedevicesandRDMA-enablednetworkinterfaces,ithasthepotentialtoenablehyperconvergednodedesignsforHPCthatdonotsufferfromI/O-inducedjitter.Althoughsuchzero-jitterarchitecturesareinkeyvendors'roadmaps,itisimportanttostressthatthesesolutionsremainunproveninproductionenvironments.Furthermore,block-leveldatatransferwillstillrequirestoragesystemsoftwaretorunontopofNVMeoFwhichisnotjitter-free.
AcomplementarytechnologyistheStoragePerformanceDevelopmentKit(SPDK),38whichisanemergingsetoflibrariesthatprovideamechanismforapplicationstoperformI/OtoNVMeandNVMeoFdevicesentirelyinuserspace.ThissignificantlyreducestheI/Olatencyofinteractingwithflashmediabycompletelyremovingtheneedfordatatotransitthesystemkernel,anditisoneofseveraleffortstoprovideacompletelynewinterfacetostoragemediathatexposesthefullcapabilitiesofthehardware.SPDKisnotwidelyusedinproductionstoragesystemsatpresent,butitisaninstrumentalcomponentinfutureproducts,includingDAOS.39
4.1.3. StorageClassMemoryandNonvolatileRAMStorageclassmemory(SCM)technologies,whichincludeIntel/Micron's3DXPoint,areonthehorizonandpromisetodelivernonvolatileandbyte-addressablestoragewhoseperformanceliessomewherebetweentoday'sDRAMandNAND.AlthoughsuchtechnologiesdeliverhigherperformanceanddurabilitythanNAND,thesignificantlyhighercostperbit(andthereforelowercapacity)renderSCMapureperformancetechnologythatislikelytobeintegratedintolarger,flash-basedstoragesystemstoremediatethesoftwareoverheadsincurredbyprocessessuchasdatajournaling.WhileSCMwillundeniablyplayaroleinstoragesystemsinthe2020timeframe,itislikelytofirstappearashighlyintegratedcomponentswithinalargerstoragesystem.ThisisanalogoustohowflashwasfirstintegratedintoenterprisestorageasextensionsoftraditionalRAM-basedcachetierssuchasinZFS'sZIL/L2ARC.40
ThereisopportunityforSCMtobedirectlyusedbyusersandapplicationsintheformofbyte-addressablenonvolatilestoragewithextremelylowlatency,buttheconsistencysemanticsofreadingandwritingdatafromaglobalstorageresourcewithaload/storeinterfacepresentanumberofnew
38StoragePerformanceDevelopmentKit.http://www.spdk.io/.AccessedSeptember10,2017.39Paciucci,G.HPCStorageTrends.HPCAdvisoryCouncilSwissConference.(April2017).40Leventhal,A.2008.Flashstoragememory.CommunicationsoftheACM.51,7(Jul.2008),47.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
25
challengesthatremainasubjectofintenseresearch.41Ofnote,theNVMLibrary42isanemerginginterfaceforpersistentmemorythatpreservesmostofthelow-latencybenefitsofSCMandflashbyenablinguser-spaceI/Odirectlytosuchdevicesthroughkey-value,block,andothersemantics.AlthoughtheNVMLibraryiscurrentlybeingusedtodevelopexperimentalstorageservicesonSCM,43librariesandapplicationsthatcanmakedirectuseofthebyte-addressabilityofSCMareunlikelytobeproduction-readyby2020.
4.1.4. MagneticTapeLTOandenterprisemagnetictapemediahaveacomfortabletechnologicalrunwaybecausetheycapitalizeontheinvestmentsmadetowardimprovingmagneticdiskmedia.Furthermore,state-of-the-artmagnetictapetechnologytypicallycomestomarketfiveormoreyearsafterthesametechnologyreachedthemagneticdiskmarket,givingthetapeindustryahealthyleadtimeintheeventthatmagneticdiskreachesanyfundamentalbarrierstoimprovement.
Asaconsequenceoftapetechnologytrailingdisktechnology,though,theroadmapformagnetictapeisdrivenbyeconomics,nottechnology.TakingLTOtape(whichholdsavastmajorityshareofthemagnetictapemarket)asanexample,taperevenuehasbeensteadilydecreasingdespitesteadilyincreasingvolumesofcapacityshipped,asshowninFigure9.
FIGURE9.ANNUALREVENUEANDEXABYTESSHIPPEDOFLTOTAPEMEDIA.DATAFROMFONTANAANDDECAD.44
41Chowdhury,M.andRangaswami,R.2017.NativeOSSupportforPersistentMemorywithRegions.Proceedingsofthe2017InternationalConferenceonMassiveStorageSystemsandTechnologies.(2017).42pmem.io:NVMLibrary.http://pmem.io/nvml/.AccessedSeptember9,2017.43Carns,P.etal.2016.EnablingNVMforData-IntensiveScientificServices.4thWorkshoponInteractionsofNVM/FlashwithOperatingSystemsandWorkloads(INFLOW’16)(Savannah,GA,2016).44Fontana,R.,Decad,G.2016.StorageMediaOverview:HistoricPerspectives.32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
26
Inaddition,thediversityofthetapemanufacturingmarkethasshrunkdramaticallyoverthelastdecade:asof2014,onlySonyandFujifilmcontinuetomanufacturemagnetictapemedia,andasof2017,IBMremainstheonlyvendortodeveloptapedrivesandcartridges.Asadirectconsequenceofthesteadydeclineoftaperevenueandmarketcompetition,itislikelythattherateofinnovationinmagnetictapewilldeceleraterelativetomagneticdisk.Theperceptibleeffectsofthisdeclinearelesscertainthough,anditisnotclearifthecostadvantagesoftapeforarchivalstoragewillbesurpassedbyanothermediainthenextfivetotenyears.
Ifoneassumesthatdatagenerationratesareultimatelyboundedbytheavailablecapacitybeingproduced,andthemajorityofstoragecapacityisprovidedbymagneticdiskasevidencedinFigure10,thedecelerationofmagnetictapecapacityshipmentsrelativetomagneticdiskpresentsasignificantriskbecauseitfollowsthataconstantinvestmentindisk-basedstoragewillrequireincreasinginvestmentintape-basedstoragetoprovideaconstantratioofdisktotape.Thus,whiletaperemainscost-effectiveforarchivalinthenearterm,itisunlikelytobetheoptimallong-termsolution.However,thecross-overpointisnotimminent,anditisnotclearthatthispointwilloccurbefore2025.
FIGURE10.ANNUALEXABYTESOFSTORAGEMEDIASHIPPED.DATAFROMFONTANAANDDECAD.45
Thelowcost-per-bitoftape,combinedwithitsminimalpowerconsumptionasanofflinestoragemedium,continuestomakeitanattractivearchivalstoragetechnologyintheshortterm.Giventheuncertaintiesoutlinedabove,though,trackingtheeconomicsofthetapemarketandfollowingvendorroadmapsareessentialforlonger-termplanning.
4.1.5. StorageSystemDesignStoragesystemarchitecturesin2020willbeshapedbythetechnologicaldevelopmentsoutlinedinthissectioninseveralkeyways:
45Fontana,R.,Decad,G.2016.StorageMediaOverview:HistoricPerspectives.32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
27
1. NANDdeviceswillstratifyintoperformance-oriented,high-enduranceMLC/TLCandlow-performance,high-capacityQLC,bothofwhichconsumelesspowerandpossesshigherbitdensitythanmagneticdisk.
2. Magneticdiskmediawilldisappearfromperformance-criticaldatapathsandbecomeacapacity-onlymedium.
3. Magnetictape,whichhashistoricallybeenacapacity-onlymedium,hasanuncertainfutureasitsrevenuesdrop.However,dramaticshiftsintheeconomicsoftapeareunlikelytomanifestbefore2020-2025.
Basedonthesetechnologicalandeconomictrends,theroleofthesedifferentmediawillalsoevolve:
1. MLC/TLCNANDwillreplacemagneticdiskinallperformance-criticalapplications,andQLCNANDwillbegintosupplantmagneticdiskinmanyWORMapplicationareas.
2. Magneticdiskwillbegintoeatawayatthemostperformance-sensitiveapplicationsofmagnetictape,includinghotarchiveandreplicatedtape.
3. Magnetictape'sroleinthedatacenterwillcontinuetoshrinktowarddeeparchiveapplicationsasQLCNANDandmagneticdiskapproachitincost.
4.2. SoftwareBeyondthechangescominginthehardwarerealm,therearemanyimprovementsandadditionsneededinextreme-scalestorageandI/Osoftwareaswell.TheincreasingdifficultyinscalingPOSIX-basedparallelfilesystemstoextremescalesisbecomingasignificantimpediment,and,asdiscussedinSection4.1.3,newsoftwareinterfacesarearequirementtomakeoptimaluseofemerginglow-latencystoragehardware.Becausethesenewnon-POSIXinterfacesareoptimizedforperformanceoverusability,though,I/OmiddlewarewillbecomemoreimportanttobridgethesemanticgapbetweentheI/OoperationsthatscientificapplicationsdemandandtheI/Ooperationssupportedbytheunderlyingstoragesystem.
4.2.1. Non-POSIXStorageSystemSoftwareThestatefulfile-basednatureofPOSIXI/O,combinedwithitsprescriptivemetadataschemaandstrongconsistencysemantics,makeitdifficulttoscalePOSIX-basedfilesystemstotheextremelevelsofparallelismanticipatedforexascalesystems.Objectstores,initiallydrivenbytheextreme-scaleI/Oneedsofcloudproviders,eschewPOSIXI/Osemanticsinfavorofstatelessput/getoperationsandimmutabledataobjects.ByexposingtheseI/Oprimitivesdirectlytoapplications,theyprovideamuchmorescalablefoundationonwhichmorefeature-richstorageservicesandsystemscanbebuilt.
Asaresult,weexpecttoseescalableobject-basedstoragesystems,suchasDAOS46orCeph,47takeonamoreprominentroleinHPCsystemsinthenearfuture.POSIXfile-basedinteractionwillstillbeanoptionforusers'sourcecode,configurationfiles,andinputdecks,butthisPOSIXinterfacewillbeimplementedasmiddlewareatopanativeobjectinterfaceratherthanbeingthelowest-leveluserinterfacetostorage.AsPOSIXmovesfromanativeinterfacetoamiddlewarelayer,weanticipatethehardwareadvancesdescribedinSection4.1todriveagradualreplacementofparallelfilesystemswithobjectstoresforbothperformanceandcapacitywithoutrequiringimmediate,disruptivechangestouserapplications.
46Gorda,B.2015.DAOS:AnArchitectureforExascaleStorage.31stInternationalConferenceonMassiveStorageSystemsandTechnology.(May2015).47Weil,S.A.etal.2006.Ceph:AScalable,High-PerformanceDistributedFileSystem.ProceedingsofUSENIXSymposiumonOperatingSystemsDesignandImplementation.(2006).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
28
4.2.2. ApplicationInterfacesandMiddlewareAsPOSIXevolvesintomiddleware,wealsoseeagreaterpercentageoftheapplicationcommunitymovingtouseotherI/OmiddlewarepackageslikeHDF548andADIOS.49ThisshiftallowsapplicationteamstousemoresemanticallymeaningfulAPIs(e.g.,storeawholearrayratherthanmanuallyserializedatastructures)andbenefitfromtheeffortandexperienceofthemiddlewarepackagedevelopers.TheincreasingadoptionofI/OmiddlewarepackageswillalsoinsulateapplicationsfromtheunderlyingshiftawayfromcurrentPOSIXconsistencysemantics,allowingthemtoautomaticallygainthebenefitsofnewhardwarewithouthavingtodirectlyinteractwiththestoragesystem'snativeAPI.
Increasedstorageofobservationaldataandapushtowardimprovedreproducibilityofscienceresultsalsoleadstoaneedforstoringprovenanceinformationonalldata,asidentifiedinSection3.3.EnhancingI/Omiddlewaretoautomaticallyaddprovenancetoapplicationdatawillgoalongwaytowardimprovingthecurrentwild-westconditionsofdatacurationbyprovidingalways-available,queryableinformationonthestoragesystem.Thesedatacurationimprovementswilladdtothemomentumforalong-livedCommunity/ForeverstoragethatisindependentofTemporary/Campaignstorage.
5. NextStepsAsdiscussedinprevioussections,thediversityofNERSC'sworkloadwillcontinuetodriveNERSC'sstoragerequirementsinseveraldifferentdimensions.Filesystemperformancemustbemeasurednotonlyinbandwidthbutmetadataperformance,latency,andvariabilityaswell.Partnershipswithexperimentalfacilitiesandthecontinuedgrowthofdatascienceworkloadswillalsoaddnewdataretentionrequirementsintermsofbothdurabilityandmanageability.Inaddition,thesizeofNERSC-9willdemandnewlevelsofscalabilityandresilience.Theserequirementsdriveourvisionforthefutureandourstrategyingettingthere.
5.1. VisionfortheFutureWhileeveryHPCuserdesiresasingle,highperformance,highcapacity,andhighlydurablestoragesystem,costwillcontinuetorequiretieredstorageatHPCcenters.Ashasbeenthecaseforthepasttwodecades,HPCwillcontinuetodeploystoragesystemsbuiltfromenterprisecomponentswhoseeconomicsarenowbeingdrivenlargelybyconsumerandcloudmarkets.Inthe2020-2025timeframe,themostnotableshiftwillbethemoveinplatformstorageawayfromHDDsandtowardhigher-performancebuteconomicalnonvolatilememorytechnologies.
Themassivedisk-basedparallelfilesystem,whichhasservedtheHPCcommunityformorethantwodecades,willseeitsrolediminished.Itwillnolongerbethehigh-bandwidthresourceusedforalljobI/O,asemergingstoragetechnologiesexpresslybuiltforNVM—suchasIntelDAOS50,IBM'sburstbuffer51,andCrayDataWarp52—becometheprincipalinterfacetoon-platformstorage.Foroff-platform
48Folk,M.etal.2011.AnoverviewoftheHDF5technologysuiteanditsapplications.ProceedingsoftheEDBT/ICDT2011WorkshoponArrayDatabases.(2011).49Lofstead,J.etal.2009.Adaptable,metadatarichIOmethodsforportablehighperformanceIO.2009IEEEInternationalSymposiumonParallel&DistributedProcessing(May2009),1–10.50Gorda,B.2015.DAOS:AnArchitectureforExascaleStorage.31stInternationalConferenceonMassiveStorageSystemsandTechnology.(May2015).51Goldstone,R.2016.TheRoadtoCoral…andBeyond.HPCAdvisoryCouncilStanfordConference.(February2016).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
29
storage,cost-effectiveandscalablesolutionssuchasobjectstoreswillbegintoreplaceit.On-platformTemporary/CampaignstoragewillalmostcertainlybebuiltentirelyoutofperformanceNANDandSCM,whileoff-platformCommunity/ForeverstoragewillbeamixofQLCNAND,magneticdisk,andtapeinacombinationdictatedbycost,technologicalevolution,andperformance/capacitybalance.
AnincreasingnumberofscientificapplicationswillinteractwithstoragethroughanI/Omiddlewarelayer,allowinghighlyscalablestorage(whichprovidesPOSIXcomplianceasanoption,notadefault)totransparentlyserveasthebackingstore.Nonvolatilememorywillmakeinroadsthroughoutthestoragehierarchy,andasitdoes,storagesoftwarewillbereengineeredtowringoutperformancebottlenecksthatappearwhenlatenciesarenolongerdominatedbythephysicalcharacteristicsofdiskdrives.Wearebeginningtoseethisintheformoflow-latency,user-spaceI/OlibrariessuchasMercury53andtheNVMLibrary,54andthistrendtowardoptimizingsoftwareforlowlatencywillbecomearequirementtomatchthelowlatencyofemergingnonvolatilememorytechnologies.
Archivalstoragesoftware,oneofthelastvestigesofpurpose-builtsystemsoftwareforHPC,willberadicallyimpactedbysoftwareinnovationsfromcloudproviders.Thesameput/getinterfacesusedtostoredataincloudservicessuchasAmazonS3alsosufficeforstorageintheonsitearchive,andthearchivewillprovideaccessviathesestandardobjectAPIs,includingS3andSwift.Forlong-termstorage,thelinesmaywellbeblurredbetweendatathatresideswithinthelocalfacilityanddatathatresidesoffsite,eitherinacommercialcloudoratanotheropensciencecenter.Datareplicationcurrentlyofferedbycommercialobjectstoresandcloudproviders,includingattributestoguaranteegeographicalseparation,willbecomepartofthearchivalsoftwaresuite.
ThroughouttheHPCstoragestack,therewillbeanemphasisoneaseofmovementbetweenstoragetiers.Anewsetofstandards-basedAPIstointeractwiththeperformance,capacity,andarchivaltierswillhelpwithadoptionandportability,andeffortsarealreadyunderwaywithinDOEandamongstvendorstodeveloptheseAPIs.Job-schedulingsoftwarewillbeabletomovedatabetweenalltiersaspartofarun,withresourcemanagersincludingSlurm,Torque,andPBSproalreadybeginningtosupportthis.ThecombinationofstandardAPIsandscheduler-moderateddatamotionwillenableuserstosteerjobsandmarshaldatabetweentiersmoreexpressively.Thisrich,proceduralinterfacewillensurethatdataisinthecorrectplaceasdifferentworkflowstagesingest,manipulate,andstoredataindifferentways.
Thehierarchicalfilesystemoftodaywillonlybeoneofanumberofviewsthroughwhichuserscaninteractwiththeirdata.Alternateviewsofdata,searchablebyuser-definedattributesassociatedwithdata,areafeatureoftoday'scloud-basedstoragethatwillfindtheirwayintotheHPCspace.Thereareahandfulofeffortstoproviderichmetadatacapabilitiesatopexistingparallelfilesystems,buttheyareimplementedasanexternalsoftwarelayerandhaveseenlimitedadoptioninproductionHPC.Weanticipatethatsearchanddiscoverybasedonuser-definedmetadatawillbebetterintegrateddirectlyintothestoragesystem,andthiswillcatalyzebroaderuseradoptionandprovideamorestablefoundationonwhichdomain-specificmetadatacatalogscanbedeveloped.
52Henseler,D.etal.2016.ArchitectureandDesignofCrayDataWarp.Proceedingsofthe2016CrayUserGroup(London,2016).53Soumagne,J.etal.2013.Mercury:Enablingremoteprocedurecallforhigh-performancecomputing.2013IEEEInternationalConferenceonClusterComputing(CLUSTER)(Sep.2013),1–8.54pmem.io:NVMLibrary.http://pmem.io/nvml/.AccessedSeptember9,2017.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
30
Althoughthehigh-bandwidthTemporarytierwillcontinuetobepurchasedwiththesupercomputer,CommunityandForeverstoragewillbebestmanagedasseparateresourcesowingtothelongevityofthedatatheywillstore.Bydecouplingtheselonger-termtiers'refreshcadencesfromthecomputesystems'procurementcycles,wewillbeabletodeploythemostfeature-richstorageresourcesthemarketoffers,integratenewtechnologyovertime,andrealizethecostbenefitsofpurchasingstorageonlywhenitneedstobedeployed.
5.2. StrategyThechangesrequiredtorealizethisvisionforthefutureofstorageinHPCwillrequireinnovationsthatinvolvehardwarevendors,softwareandmiddlewaredevelopers,andthelargerresearchcommunity.Thefollowingstrategy,dividedintonear-term(presentdaythrough2020)andlong-term(2020-2025)targets,strivestoensureasmoothtransitionforNERSCusersandtoidentifyareaswhereNERSCleadershipandcommunityengagementwouldbemostbeneficial.TheevolutionofthestoragehierarchyduringthisperiodissummarizedinFigure11.
FIGURE11.EVOLUTIONOFTHENERSCSTORAGEHIERARCHYBETWEENTODAYAND2025.
Inthefollowingsections,wedetailtheactionsrequiredtorealizethisevolution.
5.2.1. NearTerm(2017–2020)Themostsignificantchangetothestoragehierarchyinthe2020timeframewillbeacollapseoftheburstbufferanddisk-basedscratchfilesystembackintoasingle,high-performance,modest-capacitytier.ThroughthehighlysuccessfulBurstBufferEarlyUserProgramatNERSC55andongoingproductionuseoftheburstbufferonCori,solid-statemediahasdemonstrateditsviabilityforTemporarystorage,andasingle-tier,all-flashplatformstoragesystemwouldsimplifydatamanagementforuserswithoutsacrificingsubstantialfunctionality.GiventhetrendsoftheNANDindustrydiscussedinSection4.1,thisshouldbeeconomicallyviableaswell.
55Bhimji,W.etal.2016.AcceleratingSciencewiththeNERSCBurstBufferEarlyUserProgram.Proceedingsofthe2016CrayUserGroup(London,2016).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
31
Inadditiontothisall-flashplatform-integratedtier,adisk-based,POSIX-compatiblestoragesystemwillalsoneedtoexistduringthistimeperiodtosatisfytheneedsofthecolderportionsoftheCampaigntierandthehotterportionsoftheCommunitytier.UnliketheNERSCprojectfilesystemoftoday,though,thistierwillbeoptimizedforcapacityandmanageability,notperformance.ItwillmeettheneedsofdatathatmustberetainedbeyondthedesignofNERSC-9'stemporarytier,suchashigh-valueexperimentalobservations,community-curateddatasets,andotheremergingusecasesoutlinedinSections3.3and3.4.Thiscapacity-optimizedtierwillpresentafamiliarfilesysteminterfacetosupportexistingdatamanagementandtransfertools,butitwillalsoprovideaccessviamorefuture-looking,object-basedAPIstoallowuserstobegintransitioningapplicationstoput/getsemantics.
The2020Campaign/CommunitystoragewillalsosatisfymanyoftheoperationalrequirementsdiscussedinSection3.5.NERSCpresentlyreliesonkeystoragemanageabilityfeatures,includingmetadatareplication,dynamicstorageresizing,snapshotting,andenforcingproject-basedquotas.The2020Campaign/Communitystoragesystemwillexpanduponthesemanageabilityfeaturesandprovideafoundationtobegindevelopingadditionalsystemmonitoringandmanagementtoolsforthefuture.ItwillalsoserveasthebasisforfuturedatacurationtoolsandinterfacesthatNERSCwillprovidetousersandsupportfeaturestofacilitateobjectorfilemetadatasearchesandqueries.
Duetothedifferentperformance,capacity,andfeaturerequirementsofthis2020Campaign/Communitytier,itwillbeacquiredandmanagedasaresourcethatisindependentofsystemplatformstoragethroughthe2020timeframe.Unlikecompute,storageisnotaresourcethatisfullyutilizedassoonasitarrives,andincrementalgrowthguidedbyuserneedsandcenterpolicywilltakeadvantageoftheexpected10%-30%annualreductionincost-per-bitandalloweconomicalresaleofextrastoragetoprojectsthatneedit.Thisplannedgrowthallowsustoadoptnewstorageandnetworktechnologiesincrementally,deploynovelsolutionsearlier,andincreaseNERSC'sagilitytoinnovateonthenewtechniquesandtechnologiesinstoragedescribedinSection4.
The2020Foreverstoragewillremainpredominantlytape-basedduetotape'seconomicadvantages.Tapetechnologywillcontinuetobemorecosteffectivethandiskthrough2020,andtransitioninganexabyteofdata(ormore)toanewstoragemediumwouldrequiresignificantcapitalinvestmentandtime.Theremaybeopportunitytoexplorealternativearchivemedia,buttherearenotrulycompellingoptionsinthenearterm.Otherkeytechnologiesthatmaybecometechnologicallyviableforarchive,suchaslowdurabilityNAND56orhyperscaledisk-basedobjectstores,willstillnotbecost-competitiveversustapeby2020.
NERSCwillundoubtedlycontinuetodeploytape-basedstoragebeyond2020,butitisunlikelythattape'seconomicscalingrateswillcontinue.AlthoughNERSC'sForeverstoragehasbeentreatedasalimitlessdatastoreforusersinthepast,theeconomicsofthetapemarketaremakingthisanunsustainablepolicy.WehavealreadybeguntotakestepstosharpenthefocusoftheNERSCarchive,resultingina10%reductioninsize,andfurtherrefinementswillbemadebasedonclosemonitoringofthetapemarket.
ThesumofthesefindingsdrivesustowardthestoragehierarchyforNERSCin2020showninFigure12.
56Peglar,R.2016.InnovationsinNon-VolatileMemory:3DNANDanditsImplications.32ndInternationalConferenceonMassiveStorageSystemsandTechnologies(2016).
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
32
FIGURE12.TARGETTHREE-TIERSTORAGEHIERARCHYFORNERSCIN2020.
Tomeetthesenear-termrequirementsandevolvethestoragehierarchytowardthisdesign,severalcriticalactionsmustbetakenbefore2020:
1. ThepresentNERSCprojectfilesystemmustbeexpandedsignificantlytoreflectitsrolerelativetotheplatform-integratedTemporary/CampaigntiersonCoriandNERSC-9.Becausethisstoragesystemisoptimizedformanageability,accessibility,andusability,itscapacityshouldreflectthedesireofuserstostorethebulkoftheirworkingdataonit,andtheaimisforasizeof2-3xtheperformancetier.Thisisincontrastwithtoday'shierarchy,whereusersstoredataforaslongaspossibleontheperformancetier(beforedatagetspurged),andthenmovedatatotheforevertier.
2. InvestmentsmustbemadetowardfullyutilizingthedatamanagementfeaturespresentinNERSC'sprojectfilesystemandarchive.Buildingnewdatamanagementtoolsthatunifythesetierswillbeessential;thisincludesimprovingaccessibility(vianewinterfacessuchasindustry-standardobjectAPIs)andintrospection(viaexpandedindexing,monitoring,andcharacterizationcapabilities).
3. GiventhattheprojectfilesystemwillholdtheCommunitytier,weexpectdeceleratedgrowthforthetape-basedarchive.PoliciesandstricterquotasmaybenecessarytoensurethatmaintainingForeverstorageiseconomicallysustainable.
Theresultoftheseeffortswillbeasingle,high-performance,platform-integratedstoragesystemthatsatisfiestheroleofTemporarystorageandsomeveryhotCampaignstorage;ahigh-capacitybutscalableandmanageablestoragesystemthatsatisfiestheroleofCampaignandCommunitystorage;andacloselyintegrated,high-capacity,high-durabilitystoragesystemthatsatisfiestheroleofverycoldCommunitystorageandForeverstorage.
5.2.2. LongTerm(2020–2025)Thenextevolutionarystepbeyondthe2020StoragearchitecturewillaimtotransformthecloselyintegratedCommunityandForeverstoragesystemsintoasingleCommunity/Forevertierforlong-termdataretention,curation,andsharing.Thisresultsinatwo-tierstoragehierarchy,asshowninFigure13.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
33
FIGURE13.TARGETTWO-TIERSTORAGEHIERARCHYFORNERSCIN2025.
Aswiththe2020storageinfrastructure,theplatform-integratedtierwillemphasizeperformancefirst.ItwillprovideanativeinterfacethatdeliversextremeperformancethroughasynchronousI/O,relaxedconsistencysemantics,andauser-spaceclientimplementation.57UserswillstillbeabletoaccessthistierthroughafamiliarPOSIXinterfaceimplementedasmiddleware,butthisfile-basedAPIwillnotdeliverthefullperformanceoftheunderlyingNAND-andSCM-basedhardware.Rather,applicationsthatrequireextremeperformancewillhavetoeitheruseI/OmiddlewarethatsupportsthenativeinterfaceorrestructuretheirI/Otousethenativeinterfacedirectly.Giventhedisruptivenatureofsuchachange,thesemanticsofthisnewAPIshouldbewelldefinedby2020,andexperimentalsystemsmustbeavailabletoallowuserstobegintestingandmodernizingtheirapplicationI/O.
AttheCommunityandForevertiers,preparingforatransitionawayfromestablishedsolutionsliketape-basedstorageandHPSStowardobject-storagesolutionsbackedbyshingleddiskorarchivalNANDwillrequireacarefulassessmentofthepotentialreplacementtechnologiesandproductionhardening.Asapointofreference,DOEhasinvesteddecadesinthedevelopmentofHPSStomeetitsmissionneeds,butadoptingoff-the-shelftechnologies(suchasopen-sourceorcommercialobject-storagesolutions)willpayfuturedividendsbyaligningourapproachtomassstoragewiththoseofthecloudandhyperscalecommunities.Movinguserstoanobject-basedinterfaceforthearchivewillallowustotransparentlymigrateawayfromtape-basedmediashouldtapecontinuetodecline.However,buildingthesebridgesrequiresconnectinguserswiththesetechnologies,andensuringtheymeetuserandoperationalrequirementswillrequireinvestmentonthepartofNERSCandtheHPCcommunity.
PreparingtheNERSCstoragehierarchytotransitionintothislong-termvisionby2025requiresadditionalactionswithinthenextfiveyears:
1. TheNERSCDataArchivemissionmustberedefinedtoalignitsgrowthtrajectorywiththelong-termtargetcapacitiesandinvestmentssothatthetransitionto2025isseamless.Thiswill
57SeediscussionofMercury,NVML,DAOS,andothersoftwareinterfacesdiscussedinSections4.1.3,4.2.1,and5.1.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
34
involveuserengagementwiththoseuserswhosedataneedswillexceedstoragecapacityprojections,anditwillinvolvedevelopingsoftwareandinfrastructuretoassistusersinmanagingandmigratingtheirdata.
2. TestplatformsmustbefieldedtoexplorenewI/Oparadigms,includingperformance-orientedobjectstoresandsoftwaresystemscapableofeffectivelyutilizingnext-generationnonvolatilememorytechnologies.ThiswillallowNERSCtoestablishacredibleunderstandingofhowdifficultafuturetransitiontosuchsystemswouldbeforourusers,andalsoallowustodeveloptoolsthataddressthosedifficulties.Suchasystemwouldalsoinformthereturnoninvestmentsuserscanexpectfromthiseffortandmaintainourunderstandingofthesetechnologies'maturity.
3. Wemustdevelopthetoolsandinfrastructurethatallowtheperformance/projectstiersandcampaign/archivetierstocollapse.Forexample,manycomponentsofDAOSwouldgluetogethertheperformanceaspectsofDAOS'asynchronousobjectinterfacewithalower-performancebuthigher-durabilityflashlayer.Similarly,asoftwaretechnologysuchasIBM'sGHIwouldhavetobeprovenouttointegrateaGPFS-basedcampaigntierwithanHPSS-basedarchivetier.
5.2.3. OpportunitiestoInnovateandContributeNERSCisuniquelypositionedtoleadatransitiontothisstoragearchitecturebecauseofitsbroaduserbase,deepunderstandingofuserrequirements,andprovenabilitytopartnerwithapplicationdevelopersincodemodernizationefforts.Assuch,ourroleinleadingatransitiontofuturestoragetechnologiesiscenteredaroundtwokeyareas:
1. Drivingrequirementsthatwillsteeremergingsoftware,middleware,andhardwaretechnologiesinadirectionthatwillbebroadlyaccessibleandusefulacrossallsegmentsofHPCandscientificcomputingmarkets.
2. Demonstratingandhardeningemergingsoftware,middleware,andhardwaretechnologiesinextreme-scalebuthighlydiverseworkloadenvironmentsthatspantraditionalhigh-performancesimulation,high-throughputexperimentaldataprocessingandsynthesis,andmachinelearning-drivendataanalyticsatscale.
Ultimately,leadingtheground-updesignofnovelstoragesystemsordefiningnewstorageparadigmsatthebleedingedgeofcomputationalscienceisnotwithintheNERSCmission.Rather,ourexpertiseliesinunderstandinghowsuchradicalchangeswillaffecteachofthescientificdomainareas'workflowsatallscales,andthisiswhereNERSC'sleadershipwillbeessentialtoensurethatemergingI/OtechnologieswillbeviableandsustainableastheymatureintothebroaderHPCecosystem.ThiscontributionisessentialtohelpnewstoragesystemsandAPIsmeettheirfullpotentialbybroadeninguseradoption.Opportunitiestodriverequirementsaremanyfold,andwecategorizetheseopportunitiesasbeinginsoftware,middleware,andhardware.
Atthesoftwarelevel,NERSC'sbroaduserbaseservesasauniquesoundingboardforemergingI/OAPIsandsoftwaretechnologies.TheNERSCBurstBufferEarlyUserProgramhasbeenanexemplarofhowwellNERSCissuitedtoprovingoutnewstoragesystems,newmodesofuser-definedconfiguration,andnewmechanismsofdataaccess.TheprogramprovidedthevendorwithcontinuousfeedbackabouthowdifferentusercommunitieswantedtointeractwithflashstorageandbothdroveitsdesignanddemonstrateditsviabilitytothegreaterHPCcommunity.Notonlydidthisworkstrengthentheburstbuffersoftware(muchtothebenefitoftheusercommunityandthevendor),itdemonstratedthatsoftware-definedstorageandflash-basedfilesystemsareviabletechnologiesforthefuture.ThiseffortisaugmentednowbytheTieredStorageWorkingGroup,apartnershipofDOElabsandburstbuffervendors,todefinestandards-basedAPIsforinteractingwithfuturemulti-tierstorageplatforms.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
35
ItiscriticalthatNERSCcontinuetomakeinvestmentsinpartneringwithstoragesoftwareproviderstoensurethatourusers'needsarerepresentedindesigns.ThestrategicimportanceofthiscannotbeoverstatedastheHPCindustrybeginstoexploreradicallynewalternativestothetraditionalparallelfilesystemandastheenterpriseindustrydrivesobject-basedarchivalsolutionsintotheHPCspace.Failingtoengagebothsoftwarevendorsanduserstoexplorenewstorageparadigmspresentsasignificantriskthatthesestoragesolutionswillevolveindirectionsnotsuitableforthebroadusercommunityandthatcompute-anddata-intensivecomputingwillbifurcateatthestoragelayer.
ThemiddlewarelevelrepresentsanidealareawhereNERSCshouldleadinbridgingthegapbetweenrapidlychangingstoragehardwareandthediversityofuserapplicationsthatchangemuchmoreslowly.AcaseinpointwasarecentdemonstrationofusingtheHDF5middlewaretointerfacedirectlywithDAOS58;becauseasignificantnumberofNERSCdataisstoredasHDF5,asubstantialamountoftheworkrequiredtoportapplicationstoentirelynewI/OAPIsandparadigmscanbedoneinthemiddlewarelayer,effectivelyenablingbroadadoptionatonlyamodestinvestmentfromNERSC.GiventhebroadandincreasinguseofI/OmiddlewareinHPC,thisinvestmentwouldbeofsignificantbenefittothegreaterHPCcommunityaswell.
ItisthereforeessentialthatwecontinuetoengagewiththebroadusercommunitytotransitionapplicationstouseI/Omiddlewarewhereappropriate.Furthermore,wemustcontinuecloseengagementwithmiddlewaredeveloperstoensurethattheessentialfeaturesofusers,includingmetadata,provenancetracking,andeaseofuse,guidethedevelopmentofthesemiddleware.Failuretoinvestinthiswillholdopenagapbetweentoday'sapplicationsandthenativeinterfacesofnon-POSIXstoragesystems,reducingtheperformanceandscalabilitybenefitsofferedbynew,nonvolatilehardware.
Atthehardwarelevel,NERSChasbegunanefforttointegratethemonitoringofthestoragetiersintoaholisticunderstandingofemergingI/Odemands,andcontinuingthisworkwillprovidecriticalfeedbacktovendors.Forexample,monitoringtheworkloadsandwearratesonCori'sburstbufferhasidentifiedthatHPCworkloadswouldbenefitgreatlyfrommulti-streamsupportinSSDfirmware,59andongoingvendorengagementandsharingofendurancedatahasfoundthatHPCworkloadswouldbebetterservedbytradinghighwriteenduranceforaddedcapacityonenterpriseSSDs.Furthermore,thesemonitoringeffortsareimprovingtheperformance,reliability,andusabilityofNERSC'sstoragesystemsbyestablishingdetailedbaselinebehaviorandmaintainingrelationshipswithvendorsthatfacilitaterapiddiagnosis,resolution,andimprovementswhenaberrationsarise.
TrackingNERSCproductionworkloadtelemetry,curatingandcontextualizingit,sharingitwiththelargervendorandresearchcommunity,andactivelymaintainingproductiveengagementswithvendorsandresearchershaveprovidedsignificantreturnsforNERSCandthelargerHPCcommunity.IntheabsenceofNERSCinvestment,theevolutionofnewstoragetechnologiesmaybeshapedbyboutiqueworkloadsandtheenterprisemarket.ThiswouldresultinoveralllossofvalueinfuturegenerationsofNVM,networktechnologies,andSCM.
58Breitenfeld,M.S.etal.2016.UseofanewI/Ostackforextreme-scalesystemsinscientificapplications.Proceedingsofthe1stJointInternationalWorkshoponParallelDataStorage&DataIntensiveScalableComputingSystems(2016).59Han,J.etal.2017.AcceleratingaBurstBufferviaUser-LevelI/OIsolation.2017IEEEInternationalConferenceonClusterComputing(CLUSTER)(2017),245–255.
Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072
36
6. ConclusionTheincreasedamountofdatageneratedatexperimentalfacilitiesandtheprevalenceofhigh-speednetworkconnectionsbetweentheirinstrumentsandcenterssuchasNERSCpointtoanexplosiveincreaseinthevolumeofexperimentaldatastoredatcomputingsites.This,combinedwiththemassiveincreaseofdataproducedbyexascalecomputations,requiresrethinkingtheHPCstoragehierarchytomaintainacceptableperformanceandcost.Wehaveestablishedfourlogicaltiersofdatastoragebasedonrequiredperformance,capacity,shareability,andmanageabilityandmappedtheselogicaltierstophysicalstoragesystemsbasedontheprevalenttrendsinstoragetechnologies.
Intheshortterm,collapsingplatform-integrated,high-performance,flash-basedstoragesystemsintoasingletierthatsatisfiestherequirementsofTemporaryandhotCampaignstorageisfeasibleanddesirabletosimplifyI/Oforscientificworkflowsanddatamanagement.Movingthecolder,disk-basedCampaign/Communityandtape-basedForeverstoragetiersintoamorecloselyintegratedgroupofsystemsisalsotractableby2020andpositionsNERSCforatwo-tierstoragehierarchyin2025.
Thistwo-tiered2025storagesystemestablishesaconvergedTemporary/CampaignstoragesystemandaCommunity/Foreverstoragesystem,allowingNERSCtoseparatelyoptimizeextremeI/Operformancefromtheorthogonalneedsoflong-lived,high-valuecommunitydatasets.ThistransitionwillbecriticaltomeetingtheneedsofNERSCusersusingthebestavailablestoragetechnologiesinboth2020and2025,andimmediateinvestmentsinsoftware,middleware,andhardwaretechnologiesarenecessarytoachievethebenefitsforeseenbythattransition.
AstheprincipalproviderofHPCservicestotheDOEOfficeofScience,NERSCwilldeploythesenewstoragetechnologieswhilecontinuingtoprovidefastandreliablestorageresourcesthatmeettheneedsofourbroadspectrumofusers.ThediversityofworkflowsanduniquedatasetsthatrelyonNERSC'scomputationalandstorageresourcesputNERSCinastrongpositiontounderstandhowthechangingstoragelandscapewillaffectthescientificdomainareas'workflowsatallscales.ExecutingthestrategypresentedinthisdocumentwillensurethatemergingI/OtechnologieswillbeviableandsustainablesolutionstomeetingtheneedsoftheDOEOfficeofScienceaswellasthebroaderHPCcommunity.