streamflow - programming model for data streaming in scientific workflows

20
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath

Upload: galeno

Post on 23-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Streamflow - Programming Model for Data Streaming in Scientific Workflows. Chathura Herath. Outline. Background Motivation Approach Architecture Programming Model Domain application. Background. Scientific workflow are a good programming model for scientific computing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Streamflow - Programming Model for Data Streaming inScientific Workflows

Chathura Herath

Page 2: Streamflow - Programming Model for Data Streaming in Scientific Workflows

OutlineBackgroundMotivationApproachArchitectureProgramming ModelDomain application

Page 3: Streamflow - Programming Model for Data Streaming in Scientific Workflows

BackgroundScientific workflow are a

good programming model for scientific computing

Scientific domains have high volumes of data

Most of the data are coming from sensors, catalogs and other experiments.

Most data sources are data streams or can be modeled as streams.

Page 4: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Motivation Huge data sources require preprocessing and mining

and scaling down of data volumes. Compute resources are limited when taking the scale of

date. Currently experts determine which data sets contain

the interesting data Preserve the workflow programming model for the user. Users are familiar with DAG execution Define workflow patterns for use as new workflow

semantics that can capture data streams Goal

◦ Real-time data mining, filtering and preprocessing◦ Data-driven reactive workflow systems◦ Feedback systems

Page 5: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Data to Information Data Storage

Supercomputing

Information RateData Rate

Page 6: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Data to Information Data Storage

Supercomputing

Information RateData Rate

Scientific workflow

Stream Mining

Page 7: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Streamflow Data Storage

Supercomputing

Information RateData Rate

Streamflow

Page 8: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Why Workflow Streaming?Most scientific workflows

are staticConsiderable segment of

scientific data for scientific workflows are produced by scientific sensors

Sensor data tend to behave as repeating data streams

It is possible to provide a programming abstraction to capture data search and filtration?

Page 9: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Possible approachesComplete decoupled systems where

workflows and the data mining is separate.◦ Data mining rules or queries would produce

outputs which would may get refined again and again.

◦ Some interesting event would launch the workflow.◦ It may loose the insight and abstraction provided

by the workflows◦ The Data mining itself may have complex data and

control dependencies Pure workflow approach

◦ Workflow languages are not designed for streaming

Page 10: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Stream Integration Approach Complex Event Processing system

◦ Interact with the streams◦ Filter and bundle data◦ Publish input datasets to workflows

Workflow system◦ Handles the scientific computations◦ Gets invoked when dataset of specified

nature gets published to the CEP system

Resources

Streamflow Semantics

StreamBase Workflow

Streamflow Composer

Esper

Page 11: Streamflow - Programming Model for Data Streaming in Scientific Workflows

STREAMing workFLOWS -Streamflows

Streamflows are enhancement of workflows to handle data streams

Allows the complex experimental logic to be encapsulated using scientific workflows

Allows the management of large streams of data with stream mining

Provide a programming model similar to workflow composition to handle streams Workflo

w Streamflow

Page 12: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Stream Integration

Select * from DataminedRUCDATA(reflectivity> 3.5).win:time_batch(1h)

Page 13: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Workflow Semantics Conventional SOA components

can be used as it is. Workflow components may

change behavior based on input data or stream.

Filter nodes will change the “cardinality” of the output stream

Aggregator will aggregate data over a window.

Generator node interface external stream to the Streamflow

Page 14: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Programming model

Join semantics◦ Constant inputs need to be matched to

streams.Inputs Streamed into the workflow

from Stream EngineOutputs are published back by stream

sinks and may be used for feedback.

Page 15: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Evaluation Deployment Overhead

◦ Extra overhead as the workflow is flat. Θ(1)

◦ Extra overhead are comparable to the normal workflow deployment because it may need to deploy new workflows

Runtime Latency◦ Latency of

event arriving at the framework to be delivered the workflow.

Page 16: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Evaluation

Page 17: Streamflow - Programming Model for Data Streaming in Scientific Workflows

DomainsMeteorologyAstronomy

On-DemandGrid Computing

StreamingObservations

Storms Forming

Forecast Model

Data Mining

Astronomy

Meteorology

Page 18: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Related work B. Biornstad. A workflow approach to stream processing,

PhD Thesis, Computer Science Department, ETH Zurich. Y. Liu, N. Vijayakumar, and B. Plale. Stream processing in

data-driven computational science. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing, pages 160–167. IEEE Computer Society Washington, DC, USA, 2006.

J. Buck, S. Ha, E. Lee, and D. Messerschmitt. Ptolemy: A framework for simulating and prototyping heterogeneous systems. International Journal of Computer Simulation, 4(2):155–182, 1994. – DataTurbine

Y. Cai et al. MAIDS: Mining Alarming Incidents from Data Streams Automated Learning Group, NCSA, University of Illinois at Urbana-Champaign, U.S.A.

Page 19: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Future workDevelop a formal model for the

workflow semanticsEvent order guarantees How to handle missing streamsProvenance for data streams.

Page 20: Streamflow - Programming Model for Data Streaming in Scientific Workflows

Questions ?