responsive storage: home automation for research data ... · pbs, slurm, … lambda functions ......

Post on 24-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ResponsiveStorage:HomeAutomationfor

ResearchDataManagement

RyanChard,KyleChard,JasonAlt,DilworthY.Parkinson,SteveTuecke,andIanFosterArgonneNationalLab,UniversityofChicago,andLawrenceBerkeleyNationalLab

TheProblem• Datagenerationratesareexploding

• Complexanalyticsprocesses

• Thedatalifecycleofteninvolvesmultipleorganisations,machines,andpeople

Thiscreatesasignificantstrainonresearchers

• Bestmanagementpractices(cataloguing,sharing,purging,etc.)canbeoverlooked

• Usefuldatamaybelost,siloed,orforgotten

RIPPLE:AprototyperesponsivestoragesolutionTransformstaticdatagraveyardsintoactive,responsivestoragedevices

• Automatedatamanagementprocessesandenforcebestpractices

• Reliableevent-drivenexecution

• Usersfocused:simpleif-trigger-then-actionrules• Accessibletoallendusers,notjustadminsandexpertusers

• Userscansetdatamanagementpoliciesandthenforgetaboutthem

• Combinerulesintoflowstocontrolend-to-enddatatransformations

• Passivelywaitsforfilesystemevents(verylittleoverhead)

• Filesystemagnostic– worksonbothedgeandleadershipplatforms

RIPPLEArchitecture(updated)Agent:

• Sits locally on the machine

• Detects & filters filesystem events

• Facilitates execution of actions

• Can receive new recipes

Service:

• Serverless architecture

• Lambda functions process events

• Orchestrates execution of actions

• Records and manages execution of flows

RippleRunner

Ripple Agent

SQLite

Filesystem

Docker,PBS,

SLURM,…

Lambda Functions

Monitor

Observers

Ripple Cloud API

Process

SQS Queue of events

RIPPLEAgentResponsiblefordetectingandreportingeventsofinterest

Filesystemagnostic– usesanappropriatemonitorfortheFS• LeveragesPythonWatchdogobservers

• inotify,polling,kqueue,etc.

• GlobusTransferAPIdetectsglobus events(transfer,create,delete)

RulesareretrievedfromthecloudserviceandstoredinanSQLitedatabase

Hybridfilteringmodel:• Localmonitorcheckseventsagainstactiverules• Iftheymatch,theyarereportedtothecloudforprocessing

Ripple Agent

SQLite

Filesystem

ProcessMonitor

Observers

RIPPLERunnerAbstractsexecutionenvironmentsandallowsjobsubmission/statuschecksviaAPI

HasaUUIDandpollsforactions– rulescaninvokeactionsonanyrunner

Canbedeployedalmostanywhere:LocallyinitiateDockercontainers,singularityexeccommands,andsubprocesses toactonlocalfiles(metadataextraction,dispatchjobs,etc.)

Cloudrunner(backedbyLambdafunctions)performscloudfunctions:Globustransfers,createsharedendpoints,sendemails,invokeotherLambdafunctionsetc.ThisfunctionsanAPIgatewayexposingtherunnerAPIandproxying requeststhroughtoLambdas

HPCsystemsemployarunnerforexposedbatchsubmission(currentlyjustSLURM)

RippleRunner

Docker,PBS,

SLURM,…

RIPPLECloudService• Gateway API exposes Ripple service

• Get rules

• Report events

• Update event status

• API either proxies Lambda functions (get rules) or inserts payload into SQS queue.

• Once an event reaches the SQS, it should not be lost

• SQS queue reports to SNS topic, triggering Lambda functions to pull from the queue• Dead letter queue after 3 processing failures

• CloudWatch timer triggers “cleanup” and “checkup” functions to process events still on the queue and outstanding jobs

Lambda Functions

Ripple Cloud API

SQS Queue of events

RIPPLERulesIFTTT-inspiredprogrammingmodel:

Triggers describewheretheeventiscomingfrom(filesystemcreateevents)andtheconditionstomatch(/path/to/monitor/.*.h5)

Actions describewhatservicetouse(e.g.,globus transfer)andargumentsforprocessing(source/dest endpoints).

EventDetectionGoal:beabletomonitorHPCstorageworkload(>3milevents/day)

Inotify vspolling

Create/touch/delete10,000filesandrecordeventreportingduration(20ktotal)

Machines:• Laptop• c4.xlargeinstance• Edisonloginnode(gpfs)

FilteringoverheadGoal:Determineoverheadcausedbyfilteringeventslocally

Measuredifferencesinevent/seconddetection

Filteringrequiresmatchingdirectorypathandfileextension

Pollingisoddasitonlypollsonceeverysecond

LambdaPerformanceGoal:Understandlambdaperformancefordifferenttasks

ColdvsWarmedfunctions

Actions:• Globustransfer• SMSemail• DynamoDB insert/query

TransfersrequireahandshakewiththeGlobusservice,whichalsocommunicateswiththeendpoints

UseCase:LargeSynopticSurveyTelescopeDevelopedarepresentativetestbedoftheLSSTstoragerequirements

• Automaticallypropagatedatabetweenstoragetiersandfacilities

• InvokeDockercontainerstoextractmetadataandmaintainafilecatalog

• Compressandarchivefiles

• Recoverdeleted/corruptedfileswhendeleteandmodificationeventsoccurCustodial Store

(Chile)

Archive: ANL’s Sparrow

Archiver

Landing

Magnetic

Forwarder

File Catalog

File Catalog

Custodial Store (NCSA)

Landing

Magnetic

Archive

metadataminidgzip

catalog....

1.

2.3.

4.

6.

7.

UseCase:AdvancedLightSourceDeployedRippleonanALSandNERSCmachinetoautomatedataanalysis

• AtALS: DetectnewheartbeatbeamlinedataandinitiatetransfertoNERSC

• AtNERSC: Extractmetadata,createsbatchfile,dispatchanalysisjobto

Edisonqueue,detectresultandtransferbacktoALS

• AtALS: createasharedendpoint,notifycollaboratorsofresultviaemail

NewUseCases

• AutomatedmetadataextractionandingestionintoGlobusSearch• UsessingularityandApacheTika toextractmetadata

• Metadataiswrappedintogmeta (json)documentsandingestedintosearch

• Offlinefeedbackmechanismforworkflows

• Researcherswantahumanqualitycontrolcomponent

• HaveRipplesendsubsetsofdatatoresearchersviaemailtocheckit

• Triggeractionsbasedoncontentofreplymessages

SummaryEvent-drivenautomationofdatamanagementpractices

Userfocused

Monitoringagentagnostictounderlyingfilesystem

Serverless eventprocessingandactionorchestration

FutureWorkMoreusecases!

Morerunners

Scalable&highperformanceeventmonitorsforleadershipresources

Programmingmodelforevent-baseddatamanagement

IntegrationwithGlobus

top related