building the trident scientific workflow workbench for data management in the cloud roger barga, msr...

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud

Roger Barga, MSRYogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine van Ingen

Trident Project ObjectivesDemonstrate that a commercial workflow management

system can be used to implement scientific workflow

Offer this system as an open source acceleratorWrite once, deploy and run anywhere...Abstract parallelism (HPC and many core);Automatic provenance capture, for both workflow and results;Costing model for estimating resource required;Integrated data storage and access, in particular cloud computing;Reproducible research;

Develop this in the context of real eScience applicationsMake sure we solve a real problem for actual project(s).

And this is where things started to get interesting...

Research Questions

Role of workflow in data intensive eScience

Explore architectural patterns/best practices

Scalability Fault ToleranceProvenance

Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research

Scientific Workflow for Oceanography

Workflow and the Neptune Array Workflow is a bridge between the underwater sensor

array (instrument) and the end users

FeaturesAllow human interaction with instruments;Create ‘on demand’ visualizations of ocean processes; Store data for long term time-series studiesDeployed instruments will change regularly, as will the analysis;Facilitate automated, routine “survey campaigns”;Support automated event detection and reaction;User able to access through web (or custom client software);Best effort for most workflows is acceptable;

Pan-STARRS Sky SurveyOne of the largest visible light telescopes

4 unit telescopes acting as one1 Gigapixel per telescope

Surveys entire visible universe once per weekCatalog solar system, moving objects/asteroidsps1sc.org: UHawaii, Johns Hopkins, …

Pan-STARRS Highlights

30TB of processed data/year~1PB of raw data5 billion objects; 100 million detections/week

Updated every week

SQL Server 2008 for storing detections

Distributed over spatially partitioned databasesReplicated for fault tolerance

Windows 2008 HPC ClusterSchedules workflow, monitor system

Pan-STARRS Data Flow

Slice 1

Slice 2

Slice 3

Slice 4

Slice 5

Slice 6

Slice 7

Slice 8

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

s16

s3

s2

s5

s4

s7

s6

s9

s8

s11

s10

s13

s12

s15

s14

s1

Load Merge

1

Load Merge

2

Load Merge

3

Load Merge

4

Load Merge

5

Load Merge

6

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

csvcsvcsv csvcsvcsvIPP

Shared Data Store

L1L2

HOT W

ARM

Main MainDistributed View

Pan-STARRS Workflows

Telescope

CSV FilesImage

Procesing Pipeline (IPP)

CSV Files

Load Workflow

Load Workflow

Load DB

Load DB

Cold Slice DB 1

Cold Slice DB 2

WarmSlice DB 1

WarmSlice DB 2

Merge Workflow

Merge Workflow

Hot Slice DB 2

Hot Slice DB 1

Flip Workflow

Flip Workflow

Distributed View

Distributed View

CASJobsQuery Service

MyDB

MyDB

The Pan-STARRS Science Cloud

← Behind the Cloud|| User facing services →

Validation Exception

Notification

Data Valet Workflows

Data ConsumerQueries & Workflows

Data flows in one direction→, except for

error recovery

Slice Fault Recover

Workflow

Data CreatorsAstronomers

(Data Consumers)

Admin & Load-Merge Machines

Production Machines

Pan-STARRS Architecture

Workflow is just a member of the orchestra

Workflow and Pan-STARRS

Workflow carries out the data loading and merging

FeaturesSupport scheduling of workflows for nightly load and merge;Offer only controlled (protected) access to the workflow system;Workflows are tested, hardened and seldom change;

Not a unit of reuse or knowledge sharing;Fault tolerance – ensure recovery and cleanup from faults;

Assign clean up workflows to undo state changes;Provenance as a record of state changes (system management);Performance monitoring and logging for diagnostics;Must “play well” in a distributed system;

Provide ground truth for the state of the system;

Other Partner Applications

<footer text>

Data Creation, Ingest to End UseScientific data from sensors and instrumentsTime series, spatially distributedNeed to be ingested before use

Go from Level 1 to Level 2 dataPotentially large, continuous stream of dataA variety of end users (consumers) of this data

Workflows shepherd raw bits from instruments to usable data in databases in the Cloud

Users in an eScience Eco System

Producer Data Valets

Publishers

Consumers

Data Products

Curators

Reject/Fix

New data upload

Data Correction

Reject/Fix

New data upload

Accepted

Data Download

Data Correction

Query

Query Result

Publish New Data

Shared Compute Resources

Shared Queryable Data Store

Configuration Management, Health

and Performance Monitoring Operato

r User Interfac

e

User Interfac

eData Valet User

Interface

VALET WORKFLOW

USER WORKFLOW

User Storag

e

Data Flow

Control Flow

Data Valet Queryable Data Store

User Queryable Data Store

Generalized Architecture: GrayWulf

PanSTARRS WorkflowsWorkflow Executes Repeats Parallelism

Load Preamble Admin machine per Load DB no limitsLoad LM Machines per Load DB 1/2/4/8 per LM Machine (TBD)Merge Preamble Admin Machine once only one only Merge LM Machines per Cold SliceDB One per LM machinesCold Main Merge Preamble Admin Machine once only one only

Cold Main Merge One of the LM Machine once only one only

Copy Preamble Admin Machine once only one only

Copy

Slice Machines AND Head Machine (for the Cold Main)

per Cold SliceDB and Cold MainDB

One tuple of (Slice machine and LM machine) active at a time MainDB copy before any co-located Cold SliceDB on LM.

Flip Preamble Admin Machine once only one only

Flip Admin Machine once hot/warm one only CasJob Poller Admin Machine once hot/warm one only

Garbage Collector

Admin Machine (Assumes we can delete remotely) per hot/warm slice one only

Cold/Hot Slice InitiationCold Main InitializationCold Slice RecoveryCold Main Recovery

Garbage Collector

Admin Machine (Assumes we can delete remotely) weeky

Hot/Warm Slice Recovery

PS Load & Merge Workflows

Switch OUT Slice partition

to temp

For Each Partition in

Slice Cold DB

UNION ALL over Slice & Load DBs into temp.

Filter on partition bound.

StartPost Partition

Load ValidationSwitch IN temp

to Slice partition End

Detect Merge Fault. Launch Recovery Operations. Notify Admin.

Slice Column Recalculations &

Updates

Post Slice Load Validation

Determine ‘Merge Worthy’ Load DBs &

Slice Cold DBs

Sanity Check of Network Files,

Manifest, Checksum

Validate CSV File & Table

Schema

Create, Register empty LoadDB from template

For Each CSV File in Batch

BULK LOAD CSV File into Table

StartPerform CSV

File/Table Validation

Perform LoadDB/Batch

ValidationEnd

Detect Load Fault. Launch Recovery Operations. Notify Admin.

Determine affine Slice Cold DB for CSV Batch

System State Matters…Monitor state of the system

Data centric and Process centric viewsWhat is the Load/Merge state of each database in the system?What are the active workflows in the system?

Drill down into actions performed:On a particular database till dateBy a particular workflow

Provenance for the Data ValetProvenance Drivers for Pan-STARRSNeed a way to monitor state of the

systemDatabases & Workflows

Need a way to recover from error statesDatabase states are modeled as a state transition diagram

Workflows cause transition from one state to another state

Provenance forms an intelligent system log

DeployedLoadInProgre

ss

Loaded(Not

Validated)

Load succeeds

Validation begins

Validationfails

Load fails

Contents Initialization

fails

Load Started

MergeInProgress

MergeWorthy

Pre Merge validationsucceeds

Merged

ReaperWorthy Corrupt

Missing (entry in CSM database, but

not on machine)

Hold

From any of the above – Monitor fails to locate

database due to confusion, bug

or loss of machine

From any of the aboveQuery fails catastrophicallydue to database corruption

AnalysisHold(Ex-LoadDB)

Deleted Deleted

Flip of Cold SliceDB to both Hot and Warm SliceDBs

completes

Merge completes

Corrupt or lostCold SliceDB

causes merge failure; Cold SliceDB repaired

HoldReleased(human action)


causes flip failure; Cold SliceDB repaired

ValidationInProgress

Validation succeeds

Pre Merge validation

fail

Load DB State

Diagram

Slice DB State

Diagram

DeployedLoadInProgre

ss

Loaded(Not

Validated)

Load succeeds

Validation begins

Validationfails

Load fails

Contents Initialization

fails

Load Started

MergeInProgress

MergeWorthy

Pre Merge validationsucceeds

Merged

ReaperWorthy

Corrupt

Missing (entry in CSM database, but

not on machine)

Hold

From any of the above – Monitor fails to locate

database due to confusion, bug

or loss of machine

From any of the aboveQuery fails catastrophicallydue to database corruption

AnalysisHold(Ex-LoadDB)

Deleted Deleted

Flip of Cold SliceDB to both Hot and Warm SliceDBs

completes

Merge completes


causes merge failure; Cold SliceDB repaired

HoldReleased(human action)


causes flip failure; Cold SliceDB repaired

ValidationInProgress

Validation succeeds

Pre Merge validation

fail

Fault Recovery

Faults are just another statePS aims to support 2 degrees of failure

Upto 2 replicas out of 3 can fail and still be recovered

Fault RecoveryProvenance logs need to identify type and location of failureVerification of fault paths

Attribution of failure to human error, infrastructure failure, data error

Global view of system state during fault

Provenance Data ModelFine grained workflow activities

Activity does one task Eases failure recovery

Capture inputs and outputs from workflow/activity

Relational/XML model for storing provenanceGeneric model supports complex .NET typesIdentify stateful data in parameters

Build a relational view on the data states

Domain specific viewEncodes semantic knowledge in view query

Reliability of Provenance SystemFault recovery depends on provenanceMissing provenance can cause unstable system upon faultsProvenance collection is synchronous

Provenance events published using reliable (durable) messagingGuarantee that the event will be eventually delivered

Provenance is reliably persisted

Trident WorkbenchTrident Logical Architecture

Visualization

Design Workflow Packages

Windows Workflow Foundation

Trident Runtime Services

Service & Activity Registry

Workbench Scientific Workflows

ProvenanceFault

Tolerance

WinHPC Scheduling

Monitoring Service

Runtime

Workflow Monitor

Administration Console

Workflow Launcher

Community

Archiving

Web Portal

Data Access

Data Object Model (Database Agnostic Abstraction)

SQL Server, SSDS Cloud DB, S3, …

Research QuestionsRole of workflow in data intensive eScienceData Valet

Explore architectural patterns/best practicesScalability, Fault Tolerance and

Provenance implemented through workflow patterns

Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for researchGrayWulf reference architecture

Data Acquisition

Data Assembly

Discovery and Browsing

Science Exploration

Domain Specific Analyses

Scientific Output

Archive

Field sensor deployments and operations; field campaigns measuring site properties.

“Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data.

“Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing

“Science variables” and data summaries for hypothesis testing and early exploration. Likediscovery and browsing, but variables are computed via gap filling, units conversions, or simple equations.

“Science variables” combined with models, other specialized code, or statistics for deep science understanding.

Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS.

Data and analysis methodology stored for data reuse, or repeating an analysis .

building the trident scientific workflow workbench for data management in the cloud roger barga, msr...

Documents

merge workflow hot slice

warm slice db

s1s1 s1s1 load merge

system slide

research slide

acceptable slide

role of workflow

reference data