responsive storage: home automation for research data ... · pbs, slurm, … lambda functions ......
TRANSCRIPT
ResponsiveStorage:HomeAutomationfor
ResearchDataManagement
RyanChard,KyleChard,JasonAlt,DilworthY.Parkinson,SteveTuecke,andIanFosterArgonneNationalLab,UniversityofChicago,andLawrenceBerkeleyNationalLab
TheProblem• Datagenerationratesareexploding
• Complexanalyticsprocesses
• Thedatalifecycleofteninvolvesmultipleorganisations,machines,andpeople
Thiscreatesasignificantstrainonresearchers
• Bestmanagementpractices(cataloguing,sharing,purging,etc.)canbeoverlooked
• Usefuldatamaybelost,siloed,orforgotten
RIPPLE:AprototyperesponsivestoragesolutionTransformstaticdatagraveyardsintoactive,responsivestoragedevices
• Automatedatamanagementprocessesandenforcebestpractices
• Reliableevent-drivenexecution
• Usersfocused:simpleif-trigger-then-actionrules• Accessibletoallendusers,notjustadminsandexpertusers
• Userscansetdatamanagementpoliciesandthenforgetaboutthem
• Combinerulesintoflowstocontrolend-to-enddatatransformations
• Passivelywaitsforfilesystemevents(verylittleoverhead)
• Filesystemagnostic– worksonbothedgeandleadershipplatforms
RIPPLEArchitecture(updated)Agent:
• Sits locally on the machine
• Detects & filters filesystem events
• Facilitates execution of actions
• Can receive new recipes
Service:
• Serverless architecture
• Lambda functions process events
• Orchestrates execution of actions
• Records and manages execution of flows
RippleRunner
Ripple Agent
SQLite
Filesystem
Docker,PBS,
SLURM,…
Lambda Functions
Monitor
Observers
Ripple Cloud API
Process
SQS Queue of events
RIPPLEAgentResponsiblefordetectingandreportingeventsofinterest
Filesystemagnostic– usesanappropriatemonitorfortheFS• LeveragesPythonWatchdogobservers
• inotify,polling,kqueue,etc.
• GlobusTransferAPIdetectsglobus events(transfer,create,delete)
RulesareretrievedfromthecloudserviceandstoredinanSQLitedatabase
Hybridfilteringmodel:• Localmonitorcheckseventsagainstactiverules• Iftheymatch,theyarereportedtothecloudforprocessing
Ripple Agent
SQLite
Filesystem
ProcessMonitor
Observers
RIPPLERunnerAbstractsexecutionenvironmentsandallowsjobsubmission/statuschecksviaAPI
HasaUUIDandpollsforactions– rulescaninvokeactionsonanyrunner
Canbedeployedalmostanywhere:LocallyinitiateDockercontainers,singularityexeccommands,andsubprocesses toactonlocalfiles(metadataextraction,dispatchjobs,etc.)
Cloudrunner(backedbyLambdafunctions)performscloudfunctions:Globustransfers,createsharedendpoints,sendemails,invokeotherLambdafunctionsetc.ThisfunctionsanAPIgatewayexposingtherunnerAPIandproxying requeststhroughtoLambdas
HPCsystemsemployarunnerforexposedbatchsubmission(currentlyjustSLURM)
RippleRunner
Docker,PBS,
SLURM,…
RIPPLECloudService• Gateway API exposes Ripple service
• Get rules
• Report events
• Update event status
• API either proxies Lambda functions (get rules) or inserts payload into SQS queue.
• Once an event reaches the SQS, it should not be lost
• SQS queue reports to SNS topic, triggering Lambda functions to pull from the queue• Dead letter queue after 3 processing failures
• CloudWatch timer triggers “cleanup” and “checkup” functions to process events still on the queue and outstanding jobs
Lambda Functions
Ripple Cloud API
SQS Queue of events
RIPPLERulesIFTTT-inspiredprogrammingmodel:
Triggers describewheretheeventiscomingfrom(filesystemcreateevents)andtheconditionstomatch(/path/to/monitor/.*.h5)
Actions describewhatservicetouse(e.g.,globus transfer)andargumentsforprocessing(source/dest endpoints).
EventDetectionGoal:beabletomonitorHPCstorageworkload(>3milevents/day)
Inotify vspolling
Create/touch/delete10,000filesandrecordeventreportingduration(20ktotal)
Machines:• Laptop• c4.xlargeinstance• Edisonloginnode(gpfs)
FilteringoverheadGoal:Determineoverheadcausedbyfilteringeventslocally
Measuredifferencesinevent/seconddetection
Filteringrequiresmatchingdirectorypathandfileextension
Pollingisoddasitonlypollsonceeverysecond
LambdaPerformanceGoal:Understandlambdaperformancefordifferenttasks
ColdvsWarmedfunctions
Actions:• Globustransfer• SMSemail• DynamoDB insert/query
TransfersrequireahandshakewiththeGlobusservice,whichalsocommunicateswiththeendpoints
UseCase:LargeSynopticSurveyTelescopeDevelopedarepresentativetestbedoftheLSSTstoragerequirements
• Automaticallypropagatedatabetweenstoragetiersandfacilities
• InvokeDockercontainerstoextractmetadataandmaintainafilecatalog
• Compressandarchivefiles
• Recoverdeleted/corruptedfileswhendeleteandmodificationeventsoccurCustodial Store
(Chile)
Archive: ANL’s Sparrow
Archiver
Landing
Magnetic
Forwarder
File Catalog
File Catalog
Custodial Store (NCSA)
Landing
Magnetic
Archive
metadataminidgzip
catalog....
1.
2.3.
4.
6.
7.
UseCase:AdvancedLightSourceDeployedRippleonanALSandNERSCmachinetoautomatedataanalysis
• AtALS: DetectnewheartbeatbeamlinedataandinitiatetransfertoNERSC
• AtNERSC: Extractmetadata,createsbatchfile,dispatchanalysisjobto
Edisonqueue,detectresultandtransferbacktoALS
• AtALS: createasharedendpoint,notifycollaboratorsofresultviaemail
NewUseCases
• AutomatedmetadataextractionandingestionintoGlobusSearch• UsessingularityandApacheTika toextractmetadata
• Metadataiswrappedintogmeta (json)documentsandingestedintosearch
• Offlinefeedbackmechanismforworkflows
• Researcherswantahumanqualitycontrolcomponent
• HaveRipplesendsubsetsofdatatoresearchersviaemailtocheckit
• Triggeractionsbasedoncontentofreplymessages
SummaryEvent-drivenautomationofdatamanagementpractices
Userfocused
Monitoringagentagnostictounderlyingfilesystem
Serverless eventprocessingandactionorchestration
FutureWorkMoreusecases!
Morerunners
Scalable&highperformanceeventmonitorsforleadershipresources
Programmingmodelforevent-baseddatamanagement
IntegrationwithGlobus