building the trident scientific workflow workbench for data management in the cloud roger barga, msr...
TRANSCRIPT
Building the Trident Scientific Workflow Workbench for Data Management in the Cloud
Roger Barga, MSRYogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine van Ingen
Trident Project ObjectivesDemonstrate that a commercial workflow management
system can be used to implement scientific workflow
Offer this system as an open source acceleratorWrite once, deploy and run anywhere...Abstract parallelism (HPC and many core);Automatic provenance capture, for both workflow and results;Costing model for estimating resource required;Integrated data storage and access, in particular cloud computing;Reproducible research;
Develop this in the context of real eScience applicationsMake sure we solve a real problem for actual project(s).
And this is where things started to get interesting...
Research Questions
Role of workflow in data intensive eScience
Explore architectural patterns/best practices
Scalability Fault ToleranceProvenance
Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research
Scientific Workflow for Oceanography
Workflow and the Neptune Array Workflow is a bridge between the underwater sensor
array (instrument) and the end users
FeaturesAllow human interaction with instruments;Create ‘on demand’ visualizations of ocean processes; Store data for long term time-series studiesDeployed instruments will change regularly, as will the analysis;Facilitate automated, routine “survey campaigns”;Support automated event detection and reaction;User able to access through web (or custom client software);Best effort for most workflows is acceptable;
Pan-STARRS Sky SurveyOne of the largest visible light telescopes
4 unit telescopes acting as one1 Gigapixel per telescope
Surveys entire visible universe once per weekCatalog solar system, moving objects/asteroidsps1sc.org: UHawaii, Johns Hopkins, …
Pan-STARRS Highlights
30TB of processed data/year~1PB of raw data5 billion objects; 100 million detections/week
Updated every week
SQL Server 2008 for storing detections
Distributed over spatially partitioned databasesReplicated for fault tolerance
Windows 2008 HPC ClusterSchedules workflow, monitor system
Pan-STARRS Data Flow
Slice 1
Slice 2
Slice 3
Slice 4
Slice 5
Slice 6
Slice 7
Slice 8
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
s16
s3
s2
s5
s4
s7
s6
s9
s8
s11
s10
s13
s12
s15
s14
s1
Load Merge
1
Load Merge
2
Load Merge
3
Load Merge
4
Load Merge
5
Load Merge
6
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
csvcsvcsv csvcsvcsvIPP
Shared Data Store
L1L2
HOT W
ARM
Main MainDistributed View
Pan-STARRS Workflows
Telescope
CSV FilesImage
Procesing Pipeline (IPP)
CSV Files
Load Workflow
Load Workflow
Load DB
Load DB
Cold Slice DB 1
Cold Slice DB 2
WarmSlice DB 1
WarmSlice DB 2
Merge Workflow
Merge Workflow
Hot Slice DB 2
Hot Slice DB 1
Flip Workflow
Flip Workflow
Distributed View
Distributed View
CASJobsQuery Service
MyDB
MyDB
The Pan-STARRS Science Cloud
← Behind the Cloud|| User facing services →
Validation Exception
Notification
Data Valet Workflows
Data ConsumerQueries & Workflows
Data flows in one direction→, except for
error recovery
Slice Fault Recover
Workflow
Data CreatorsAstronomers
(Data Consumers)
Admin & Load-Merge Machines
Production Machines
Pan-STARRS Architecture
Workflow is just a member of the orchestra
Workflow and Pan-STARRS
Workflow carries out the data loading and merging
FeaturesSupport scheduling of workflows for nightly load and merge;Offer only controlled (protected) access to the workflow system;Workflows are tested, hardened and seldom change;
Not a unit of reuse or knowledge sharing;Fault tolerance – ensure recovery and cleanup from faults;
Assign clean up workflows to undo state changes;Provenance as a record of state changes (system management);Performance monitoring and logging for diagnostics;Must “play well” in a distributed system;
Provide ground truth for the state of the system;
Other Partner Applications
<footer text>
Data Creation, Ingest to End UseScientific data from sensors and instrumentsTime series, spatially distributedNeed to be ingested before use
Go from Level 1 to Level 2 dataPotentially large, continuous stream of dataA variety of end users (consumers) of this data
Workflows shepherd raw bits from instruments to usable data in databases in the Cloud
Users in an eScience Eco System
Producer Data Valets
Publishers
Consumers
Data Products
Curators
Reject/Fix
New data upload
Data Correction
Reject/Fix
New data upload
Accepted
Data Download
Data Correction
Query
Query Result
Publish New Data
Shared Compute Resources
Shared Queryable Data Store
Configuration Management, Health
and Performance Monitoring Operato
r User Interfac
e
User Interfac
eData Valet User
Interface
VALET WORKFLOW
USER WORKFLOW
User Storag
e
Data Flow
Control Flow
Data Valet Queryable Data Store
User Queryable Data Store
Generalized Architecture: GrayWulf
PanSTARRS WorkflowsWorkflow Executes Repeats Parallelism
Load Preamble Admin machine per Load DB no limitsLoad LM Machines per Load DB 1/2/4/8 per LM Machine (TBD)Merge Preamble Admin Machine once only one only Merge LM Machines per Cold SliceDB One per LM machinesCold Main Merge Preamble Admin Machine once only one only
Cold Main Merge One of the LM Machine once only one only
Copy Preamble Admin Machine once only one only
Copy
Slice Machines AND Head Machine (for the Cold Main)
per Cold SliceDB and Cold MainDB
One tuple of (Slice machine and LM machine) active at a time MainDB copy before any co-located Cold SliceDB on LM.
Flip Preamble Admin Machine once only one only
Flip Admin Machine once hot/warm one only CasJob Poller Admin Machine once hot/warm one only
Garbage Collector
Admin Machine (Assumes we can delete remotely) per hot/warm slice one only
Cold/Hot Slice InitiationCold Main InitializationCold Slice RecoveryCold Main Recovery
Garbage Collector
Admin Machine (Assumes we can delete remotely) weeky
Hot/Warm Slice Recovery
PS Load & Merge Workflows
Switch OUT Slice partition
to temp
For Each Partition in
Slice Cold DB
UNION ALL over Slice & Load DBs into temp.
Filter on partition bound.
StartPost Partition
Load ValidationSwitch IN temp
to Slice partition End
Detect Merge Fault. Launch Recovery Operations. Notify Admin.
Slice Column Recalculations &
Updates
Post Slice Load Validation
Determine ‘Merge Worthy’ Load DBs &
Slice Cold DBs
Sanity Check of Network Files,
Manifest, Checksum
Validate CSV File & Table
Schema
Create, Register empty LoadDB from template
For Each CSV File in Batch
BULK LOAD CSV File into Table
StartPerform CSV
File/Table Validation
Perform LoadDB/Batch
ValidationEnd
Detect Load Fault. Launch Recovery Operations. Notify Admin.
Determine affine Slice Cold DB for CSV Batch
System State Matters…Monitor state of the system
Data centric and Process centric viewsWhat is the Load/Merge state of each database in the system?What are the active workflows in the system?
Drill down into actions performed:On a particular database till dateBy a particular workflow
Provenance for the Data ValetProvenance Drivers for Pan-STARRSNeed a way to monitor state of the
systemDatabases & Workflows
Need a way to recover from error statesDatabase states are modeled as a state transition diagram
Workflows cause transition from one state to another state
Provenance forms an intelligent system log
DeployedLoadInProgre
ss
Loaded(Not
Validated)
Load succeeds
Validation begins
Validationfails
Load fails
Contents Initialization
fails
Load Started
MergeInProgress
MergeWorthy
Pre Merge validationsucceeds
Merged
ReaperWorthy Corrupt
Missing (entry in CSM database, but
not on machine)
Hold
From any of the above – Monitor fails to locate
database due to confusion, bug
or loss of machine
From any of the aboveQuery fails catastrophicallydue to database corruption
AnalysisHold(Ex-LoadDB)
Deleted Deleted
Flip of Cold SliceDB to both Hot and Warm SliceDBs
completes
Merge completes
Corrupt or lostCold SliceDB
causes merge failure; Cold SliceDB repaired
HoldReleased(human action)
Corrupt or lostCold SliceDB
causes flip failure; Cold SliceDB repaired
ValidationInProgress
Validation succeeds
Pre Merge validation
fail
Load DB State
Diagram
Slice DB State
Diagram
DeployedLoadInProgre
ss
Loaded(Not
Validated)
Load succeeds
Validation begins
Validationfails
Load fails
Contents Initialization
fails
Load Started
MergeInProgress
MergeWorthy
Pre Merge validationsucceeds
Merged
ReaperWorthy
Corrupt
Missing (entry in CSM database, but
not on machine)
Hold
From any of the above – Monitor fails to locate
database due to confusion, bug
or loss of machine
From any of the aboveQuery fails catastrophicallydue to database corruption
AnalysisHold(Ex-LoadDB)
Deleted Deleted
Flip of Cold SliceDB to both Hot and Warm SliceDBs
completes
Merge completes
Corrupt or lostCold SliceDB
causes merge failure; Cold SliceDB repaired
HoldReleased(human action)
Corrupt or lostCold SliceDB
causes flip failure; Cold SliceDB repaired
ValidationInProgress
Validation succeeds
Pre Merge validation
fail
Fault Recovery
Faults are just another statePS aims to support 2 degrees of failure
Upto 2 replicas out of 3 can fail and still be recovered
Fault RecoveryProvenance logs need to identify type and location of failureVerification of fault paths
Attribution of failure to human error, infrastructure failure, data error
Global view of system state during fault
Provenance Data ModelFine grained workflow activities
Activity does one task Eases failure recovery
Capture inputs and outputs from workflow/activity
Relational/XML model for storing provenanceGeneric model supports complex .NET typesIdentify stateful data in parameters
Build a relational view on the data states
Domain specific viewEncodes semantic knowledge in view query
Reliability of Provenance SystemFault recovery depends on provenanceMissing provenance can cause unstable system upon faultsProvenance collection is synchronous
Provenance events published using reliable (durable) messagingGuarantee that the event will be eventually delivered
Provenance is reliably persisted
Trident WorkbenchTrident Logical Architecture
Visualization
Design Workflow Packages
Windows Workflow Foundation
Trident Runtime Services
Service & Activity Registry
Workbench Scientific Workflows
ProvenanceFault
Tolerance
WinHPC Scheduling
Monitoring Service
Runtime
Workflow Monitor
Administration Console
Workflow Launcher
Community
Archiving
Web Portal
Data Access
Data Object Model (Database Agnostic Abstraction)
SQL Server, SSDS Cloud DB, S3, …
Research QuestionsRole of workflow in data intensive eScienceData Valet
Explore architectural patterns/best practicesScalability, Fault Tolerance and
Provenance implemented through workflow patterns
Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for researchGrayWulf reference architecture
Data Acquisition
Data Assembly
Discovery and Browsing
Science Exploration
Domain Specific Analyses
Scientific Output
Archive
Field sensor deployments and operations; field campaigns measuring site properties.
“Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data.
“Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing
“Science variables” and data summaries for hypothesis testing and early exploration. Likediscovery and browsing, but variables are computed via gap filling, units conversions, or simple equations.
“Science variables” combined with models, other specialized code, or statistics for deep science understanding.
Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS.
Data and analysis methodology stored for data reuse, or repeating an analysis .