scientific data & workflow...

Post on 30-Aug-2018






Click to see full reader



Scientific Data & Workflow EngineeringScientific Data & Workflow EngineeringPreliminary Notes from the Preliminary Notes from the CyberinfrastructureCyberinfrastructure TrenchesTrenches

Bertram Bertram LudLudääscherscher

San Diego Supercomputer Center

Associate ProfessorAssociate ProfessorDept. of Computer Science & Genome CenterDept. of Computer Science & Genome Center

University of California, DavisUniversity of California, Davis

FellowFellowSan Diego Supercomputer CenterSan Diego Supercomputer Center

University of California, San DiegoUniversity of California, San Diego

UC DAVISDepartment ofComputer Science

2 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


•• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

•• Scientific Data IntegrationScientific Data Integration

•• Scientific Workflow ManagementScientific Workflow Management

•• Links & Crystallization PointsLinks & Crystallization Points

•• SummarySummary


3 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Science Environment for Ecological Science Environment for Ecological Knowledge (SEEK) OverviewKnowledge (SEEK) Overview

•• Domain Science DriverDomain Science Driver– Ecology (LTER), biodiversity, …

•• Analysis & Modeling SystemAnalysis & Modeling System– Design & execution of ecological

models & analysis (“scientific workflows”)

– {application,upper}-wareKepler system

•• Semantic Mediation SystemSemantic Mediation System– Data Integration of hard-to-

relate sources and processes– Semantic Types and Ontologies– upper middleware

Sparrow Toolkit•• EcoGridEcoGrid

– Access to ecology data and tools– {middle,under}-ware

unified API to SRB/MCAT, MetaCat, DiGIR, … datasets sample CS problem [DILS’04]

4 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Common CI Infrastructure PiecesCommon CI Infrastructure Pieces•• Other CIOther CI--projects (e.g. GEON, projects (e.g. GEON, …… ) have similar ) have similar

serviceservice--oriented architectures: oriented architectures: – Seamless and uniform data access (“Data-Grid”)

• data & metadata registry– distributed and high performance computing

platform (“Compute-Grid”)• service registry

– Federated, integrated, mediated databases• often use of semantic extensions (e.g. ontologies)

– User-friendly workbench / problem-solving environment

scientific workflows

•• add to this sensors, observing systems add to this sensors, observing systems ……


5 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

…… Example: Example: RealtimeRealtime Environment for Environment for Analytical Processing Analytical Processing (REAP vision)(REAP vision)

6 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

The Great Unified SystemThe Great Unified System

•• Many engineering and CS challenges!Many engineering and CS challenges!… we’ll see some …

•• Our focus:Our focus:– Scientific data integration

• How to associate, mediate, integrate complex scientific data?– Scientific workflows

• How to devise larger scientific workflows for process automation from individual components (e.g. web services)?

•• Disclaimer:Disclaimer:… often scratching the surface; see references &

research literature for details …


7 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


•• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

•• Scientific Data IntegrationScientific Data Integration

•• Scientific Workflow ManagementScientific Workflow Management

•• Links & Crystallization PointsLinks & Crystallization Points

•• SummarySummary

8 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

An Online ShopperAn Online Shopper’’s Information Integration s Information Integration ProblemProblem

El Cheapo: El Cheapo: ““Where can I get the cheapest copy (including shipping cost) of Where can I get the cheapest copy (including shipping cost) of WittgensteinWittgenstein’’s s TractatusTractatus LogicusLogicus--PhilosophicusPhilosophicus within a week?within a week?””

??Information Information IntegrationIntegration

“One-World”Mediation A1books.comA1books.comA1books.comhalf.comhalf.comhalf.combarnes&noble.combarnes&noble.combarnes&

Mediator (virtual DB)Mediator (virtual DB)(vs. (vs. DatawarehouseDatawarehouse))NOTE: nonNOTE: non--trivial trivial

data engineering challenges!data engineering challenges!


9 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

A Home BuyerA Home Buyer’’s Information Integration Problems Information Integration Problem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats


10 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

A NeuroscientistA Neuroscientist’’s Information s Information Integration ProblemIntegration ProblemWhat is the cerebellar distribution of rat proteins with more than

70% homology with human NCS-1? Any structure specificity?How about other rodents?

?Information Integration

protein localization(NCMIR)


sequence info(CaPROT)


“Complex Multiple-Worlds”


Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://

Inter-source links:• unclear for the non-scientists• hard for the scientist


11 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

12 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Interoperability & Integration ChallengesInteroperability & Integration Challenges• System aspects: “Grid” Middleware

• distributed data & computing, SOA• web services, WSDL/SOAP, WSRF, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

• Synthesis: Scientific Workflow Design & Execution

• Composition of declarative and procedural components into larger workflows

• (re)sources = services, processes, actors, …

reconciling reconciling SS55

heterogeneitiesheterogeneities““gluinggluing”” together together resources resources bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally


13 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Information Integration Challenges: Information Integration Challenges: SS44 HeterogeneitiesHeterogeneities•• SSystemystem aspectsaspects

– platforms, devices, data & service distribution, APIs, protocols, …Grid middleware technologies

+ e.g. single sign-on, platform independence, transparent use of remote resources, …

•• SSyntaxyntax & & SStructuretructure– heterogeneous data formats (one for each tool ...)– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) – heterogeneous schemas (one for each DB ...)

Database mediation technologies+ XML-based data exchange, integrated views, transparent query rewriting, …

•• SSemanticsemantics– descriptive metadata, different terminologies, “hidden” semantics

(context), implicit assumptions, …Knowledge representation & semantic mediation technologies

+ “smart” data discovery & integration+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!

14 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Information Integration Challenges: Information Integration Challenges: SS55 HeterogeneitiesHeterogeneities•• SSynthesisynthesis of applications, analysis tools, data & of applications, analysis tools, data &

query components, query components, …… into into ““scientific workflowsscientific workflows””– How to make use of these wonderful things & put them

together to solve a scientist’s problem?Scientific Problem Solving Environments (Scientific Problem Solving Environments (PSEsPSEs))

Portals,Workbench (“scientist’s view”)+ ontology-enhanced data registration, discovery,

manipulation+ creation and registration of new data products from

existing ones, …Scientific Workflow System (“engineer’s view”)

+ for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools …

+ e.g., creation of new datasets from existing ones, dataset registration, …


15 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Information Integration from a Information Integration from a Database Perspective Database Perspective

•• Information Integration ProblemInformation Integration Problem– Given: data sources S1, ..., Sk (databases, web sites, ...)

and user questions Q1,..., Qn that can –in principle– be answered using the information in the Si

– Find: the answers to Q1, ..., Qn

•• The Database Perspective: The Database Perspective: source = source = ““databasedatabase””⇒ Si has a schema (relational, XML, OO, ...) ⇒ Si can be queried⇒ define virtual (or materialized) integrated (or global)

view G over local sources S1 ,..., Sk using database query languages (SQL, XQuery,...)

⇒ questions become queries Qi against G(S1,..., Sk)

16 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

5. Post processing5. Post processing2. Query rewriting2. Query rewriting

Standard (XMLStandard (XML--Based) Mediator ArchitectureBased) Mediator Architecture


Integrated Global(XML) View G

Integrated ViewDefinition

G(..)← S1(..)…Sk(..)


1. Query Q ( G (S1. Query Q ( G (S11,..., ,..., SSkk) )) )



(XML) View



(XML) View



(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q34. {answers(Q1)} {answers(Q2)} 4. {answers(Q1)} {answers(Q2)} {answers(Q3)}{answers(Q3)}

6. {answers(Q)}6. {answers(Q)}


17 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Query Planning in Data IntegrationQuery Planning in Data Integration•• GivenGiven: :

– Declarative user query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { ic(…) … S … G… } integrity constraints (ICs)

•• FindFind: : – equivalent (or minimal containing, maximal contained)

query plan Q’: answer(…) … S …query rewriting (logical/calculus, algebraic, physical levels)

•• ResultsResults::– A variety of results/algorithms; depending on classes of

queries, views, and ICs: P, NP, … , undecidable– hot research area in core CS (database community)

18 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Scientific Data Integration using Scientific Data Integration using Semantic ExtensionsSemantic Extensions


19 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

20 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Example: Geologic Map IntegrationExample: Geologic Map Integration

•• GivenGiven: : – Geologic maps from different state geological

surveys (shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology (e.g. USGS)• Rock classification ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

•• ProblemProblem::– Support uniform queries across all map – … using different ontologies– Support registration w/ ontology A, querying w/

ontology B


21 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


Schema Integration Schema Integration ((““registeringregistering”” local local schemas to the global schema)schemas to the global schema)






New Mexico

Montana E.


Montana West













































andesitic sandstone

Livingston formation




22 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

MultihierarchicalMultihierarchical Rock Classification Rock Classification ““OntologyOntology””(Taxonomies) for (Taxonomies) for ““Thematic QueriesThematic Queries”” (GSC)(GSC)






23 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

OntologyOntology--Enabled Application Example:Enabled Application Example:Geologic Map IntegrationGeologic Map Integration

Show formations where AGE = ‘Paleozic’(without age ontology)

Show formations where AGE = ‘Paleozic’(without age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

+/- a few hundred million years



Knowledge re



c Age




24 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Querying by Geologic Age Querying by Geologic Age ……


25 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Querying by Geologic Age: ResultsQuerying by Geologic Age: Results

26 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Querying by Chemical Composition Querying by Chemical Composition …… (GSC) (GSC)


27 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Semantic Mediation Semantic Mediation (via (via ““semantic semantic registrationregistration”” of schemas and ontology articulations)of schemas and ontology articulations)

•• Schema elements and/or data values are associated Schema elements and/or data values are associated with concept expressions from the target ontologywith concept expressions from the target ontology

conceptual queries “through” the ontology•• Articulation ontology Articulation ontology

source registration to A, querying through B•• Semantic mediation: query rewriting w/ Semantic mediation: query rewriting w/ ontologiesontologies



Ontology A

Ontology B


ontology articulations




28 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Different views on State Different views on State Geological MapsGeological Maps


29 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Sedimentary Rocks: BGS OntologySedimentary Rocks: BGS Ontology

30 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Sedimentary Rocks: GSC OntologySedimentary Rocks: GSC Ontology


31 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Implementation in OWL: Not only Implementation in OWL: Not only ““for the for the machinemachine”” ……

32 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Source Contextualization Source Contextualization through Ontology Refinementthrough Ontology Refinement

In addition to registering(“hanging off”) data relative to

existing concepts, a source may also refine the mediator’s domain map...

⇒ sources can register new concepts at the mediator ...


33 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


•• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

•• Scientific Data IntegrationScientific Data Integration

•• Scientific Workflow ManagementScientific Workflow Management

•• Links & Crystallization PointsLinks & Crystallization Points

•• SummarySummary

34 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

What is a Scientific Workflow (SWF)?What is a Scientific Workflow (SWF)?•• GoalsGoals::

– automate a scientist’s repetitive data management and analysis tasks

– typical phases: • data access, scheduling, generation, transformation,

aggregation, analysis, visualizationdesign, test, share, deploy, execute, reuse, … SWFs


35 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Promoter Identification WorkflowPromoter Identification Workflow

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

36 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)


37 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Ecology: GARP Analysis Pipeline Ecology: GARP Analysis Pipeline for Invasive Species Predictionfor Invasive Species Prediction

Training sample


GARPrule set


Test sample (d)


(native range) (c)

Speciespresence &

absence points(native range)














Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)


ArchiveTo Ecogrid









Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)

38 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


39 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Commercial & Open Source Commercial & Open Source Scientific Scientific ““WorkflowWorkflow”” (well (well DataflowDataflow) Systems) Systems

Kensington Discovery Edition from InforSense



40 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

SCIRunSCIRun: Problem Solving Environments for : Problem Solving Environments for LargeLarge--Scale Scientific ComputingScale Scientific Computing

•• SCIRunSCIRun: PSE for interactive construction, debugging, : PSE for interactive construction, debugging, and steering of largeand steering of large--scale scientific computationsscale scientific computations

•• Component model, based on generalized dataflow Component model, based on generalized dataflow programmingprogramming Steve Parker ( Parker (


41 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Ptolemy II




Source: Edward Lee et al. Edward Lee et al.

42 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Why Ptolemy II (and thus KEPLER)?Why Ptolemy II (and thus KEPLER)?•• Ptolemy II Objective:Ptolemy II Objective:

– “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

•• Dataflow Process Networks w/ natural support for Dataflow Process Networks w/ natural support for abstractionabstraction, , pipeliningpipelining (streaming) (streaming) actoractor--orientation, orientation, actor actor reusereuse

•• UserUser--OrientationOrientation– Workflow design & exec console (Vergil GUI)– “Application/Glue-Ware”

• excellent modeling and design support• run-time support, monitoring, …• not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)• but middle-/underware is conveniently accessible through actors!

•• PRAGMATICSPRAGMATICS– Ptolemy II is mature, continuously extended & improved, well-documented

(500+pp) – open source system– Ptolemy II folks actively participate in KEPLER


43 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

KEPLER/KEPLER/CSPCSP: : CContributors, ontributors, SSponsors, ponsors, PProjectsrojects(or loosely coupled (or loosely coupled CCommunicating ommunicating SSequential equential PPersons ;ersons ;--))

IlkayIlkay AltintasAltintas SDM, ResurgenceSDM, ResurgenceKim Kim BaldridgeBaldridge Resurgence, NMIResurgence, NMIChad Berkley Chad Berkley SEEKSEEKShawn Bowers Shawn Bowers SEEKSEEKTerence Terence CritchlowCritchlow SDMSDMTobin Fricke Tobin Fricke ROADNetROADNetJeffrey Jeffrey GretheGrethe BIRNBIRNChristopher H. Brooks Christopher H. Brooks Ptolemy IIPtolemy IIZhengangZhengang Cheng Cheng SDMSDMDan Higgins Dan Higgins SEEKSEEKEfratEfrat Jaeger Jaeger GEONGEONMatt Jones Matt Jones SEEKSEEKWerner Krebs, Werner Krebs, EOLEOLEdward A. Lee Edward A. Lee Ptolemy IIPtolemy IIKai Lin Kai Lin GEONGEONBertram Bertram LudaescherLudaescher SEEKSEEK, , GEONGEON, , SDMSDM, , ROADNetROADNet, BIRN, BIRNMark Miller Mark Miller EOLEOLSteve Mock Steve Mock NMINMISteve Steve NeuendorfferNeuendorffer Ptolemy IIPtolemy IIJingJing Tao Tao SEEKSEEKMladenMladen VoukVouk SDMSDMXiaowenXiaowen XinXin SDMSDMYang Zhao Yang Zhao Ptolemy IIPtolemy IIBing Zhu Bing Zhu SEEKSEEK••••••

Ptolemy IIPtolemy II

44 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

KEPLER: An Open CollaborationKEPLER: An Open Collaboration•• Initiated by members from NSF SEEK and DOE SDM/SPA; now Initiated by members from NSF SEEK and DOE SDM/SPA; now

several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, ……))•• Open Source (BSDOpen Source (BSD--style license)style license)•• Intensive Communications: Intensive Communications:

– Web-archived mailing lists– IRC (!)

•• CoCo--development: development: – via shared CVS repository– joining as a new co-developer (currently):

• get a CVS account (read-only)• local development + contribution via existing KEPLER member• be voted “in” as a member/co-developer

•• Software & social engineeringSoftware & social engineering– How to better accommodate new groups/communities?– How to better accommodate different usage/contribution models (core

dev … special purpose extender … user)?


45 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Ptolemy II/KEPLER GUI (Ptolemy II/KEPLER GUI (VergilVergil))“Directors” define the component interaction & execution semantics

Large, polymorphic component (“Actors”) and Directorslibraries (drag & drop)

46 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

KEPLER/Ptolemy II GUI refinedKEPLER/Ptolemy II GUI refinedOntology based actor

(service) and dataset search

Result Display


47 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Web Services Web Services Actors Actors (WS Harvester)(WS Harvester)




“Minute-made” (MM) WS-based application integration• Similarly: MM workflow design & sharing w/o implemented components

48 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Some Recent Actor AdditionsSome Recent Actor Additions


49 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

An An ““earlyearly”” example: example: Promoter Identification Promoter Identification

SSDBM, AD 2003SSDBM, AD 2003

• Scientist models application as a “workflow” of connected components (“actors”)

• If all components exist, the workflow can be automated/ executed

• Different directors can be used to pick appropriate execution model (often “pipelined”execution: PN director)

50 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Reengineering a Geoscientist’s Mineral Classification Workflow


51 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Job Management (here: NIMROD)Job Management (here: NIMROD)

•J ob management infrastructure in place• Results database: under development• Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04

52 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004



53 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Rapid Web ServiceRapid Web Service--based Prototyping based Prototyping (Here: (Here: ROADNetROADNet Command & Control Services for LOOKING KickCommand & Control Services for LOOKING Kick--Off Off MtgMtg))

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

54 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

in KEPLER in KEPLER (w/ editable script)(w/ editable script)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK


55 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

in KEPLER in KEPLER (interactive session)(interactive session)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK

56 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Blurring Blurring Design (Design (ToDoToDo)) and Executionand Execution


57 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Scientific Workflow ChallengesScientific Workflow Challenges

•• Typical FeaturesTypical Features– data-intensive and/or compute-intensive– plumbing-intensive (consecutive web services won’t fit)– dataflow-oriented– distributed (remote data, remote processing)– user-interaction “in the middle”, …– … vs. (C-z; bg; fg)-ing (“detach” and reconnect)– advanced programming constructs (map(f), zip,

takewhile, …)– logging, provenance, “registering back” (intermediate)


58 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service actor

Complex backward control-flow

No data transformations available



59 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

A Scientific Workflow Problem: Solved A Scientific Workflow Problem: Solved (Computer Scientist(Computer Scientist’’s view)s view)

•• Solution based on Solution based on declarative, functional declarative, functional dataflow process networkdataflow process network(= also a data streaming


•• HigherHigher--order constructs: order constructs: mapmap((ff) ) ⇒ no control-flow spaghetti⇒ data-intensive apps ⇒ free concurrent execution⇒ free type checking⇒ automatic support to go

from piw(GeneId) to PIW :=map(piw) over [GeneId]


Powerful type checking

Generic, declarative “programming”


Generic data transformation actors

Forward-only, abstractable sub-workflow piw(GeneId)

60 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Promoter Identification Workflow RedesignedPromoter Identification Workflow Redesigned

map(GenbankWS)Input: {“NM_001924”, “NM020375”}Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}


61 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

A Research Problem: A Research Problem: Optimization by RewritingOptimization by Rewriting

•• Example: PIW as a declarative, Example: PIW as a declarative, referentially transparent functional referentially transparent functional processprocess⇒ optimization via functional rewriting

possiblee.g. map(f o g) = map(f) o map(g)

•• Technical report & PIW specification in Technical report & PIW specification in HaskellHaskell

map(f o g) instead of

map(f) o map(g)

Combination of map and zip

62 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

A KR+DI+Scientific Workflow ProblemA KR+DI+Scientific Workflow Problem•• Services can be Services can be semantically compatiblesemantically compatible, but , but

structurally incompatiblestructurally incompatible



Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection





δ(Ps)δ(Ps)δ (≺)

Ontologies (OWL)Ontologies (OWL)

Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]


63 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

OntologyOntology--Informed Data Informed Data Transformation Transformation ((““StructureStructure--ShimShim””))



Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Compatible (⊑)

RegistrationMapping (Output)

RegistrationMapping (Input)


Generate δ(Ps)δ(Ps)

Ontologies (OWL)Ontologies (OWL)


Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]

64 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


•• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

•• Scientific Data IntegrationScientific Data Integration

•• Scientific Workflow ManagementScientific Workflow Management

•• Links & Crystallization PointsLinks & Crystallization Points

•• SummarySummary


65 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

LinkLink--Up & Crystallization PointsUp & Crystallization Points•• Shared (Domain) Science Vision, GoalsShared (Domain) Science Vision, Goals

– NVO, SCEC, Human Genome Project, …•• Technology WavesTechnology Waves

– XML, web services, WSRF, Semantic Web (OWL), Portlets, …•• Standards for data exchange, metadata, data access Standards for data exchange, metadata, data access

protocols, protocols, ……– GML, EML, netCDF, HDF, …, ADN, …, DODS/OpenDAP, …– Organizations: W3C, GGF, …,

•• Community Community ontologiesontologies– GO (Gene Ontology), ecoinformatics, seismology, geochemistry, …– … from Saulus to Paulus …

•• Shared Community Tools and Tool CoShared Community Tools and Tool Co--DevelopmentDevelopment– SRB, Globus, …, Kepler, …

66 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Shared Science Vision, Goals: SCEC/CMEShared Science Vision, Goals: SCEC/CME

Simulation of Seismic Wave Simulation of Seismic Wave Propagation of a Magnitude 7.7 Propagation of a Magnitude 7.7 Earthquake on San Andreas Earthquake on San Andreas FaultFault– PIs: Thomas Jordan, Bernard

Minster, Reagan Moore, Carl Kesselman

– Simulation• 240 Processors for 5 days• 47 Terabytes of data generated

– SDSC SAC project optimized code on DataStar parallel computer (both MPI I/O management and checkpointing)

– Future simulation – Increase resolution a factor of 2, implies 1 PB of simulation results, 1000 processors for 20 days

Southern California Earthquake Center / Community Modeling EnvirSouthern California Earthquake Center / Community Modeling Environment onment ProjectProject

Source: Reagan Moore, SDSCSource: Reagan Moore, SDSC


67 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Example: NVO Community ProcessesExample: NVO Community Processes

-- created standard data encoding format (FITS image created standard data encoding format (FITS image format)format)

-- made accessible common digital holdings (sky survey images)made accessible common digital holdings (sky survey images)-- defined Uniform Content Descriptors (common metadata defined Uniform Content Descriptors (common metadata

attributes)attributes)-- created standard services (standard access mechanisms to created standard services (standard access mechanisms to

catalogs catalogs and surveys)and surveys)-- created digital library (manage derived data products)created digital library (manage derived data products)-- created portals (for combining services interactively)created portals (for combining services interactively)-- created processing pipelines (for automated processing)created processing pipelines (for automated processing)-- created preservation environmentcreated preservation environment

•• AndAnd…… found a new brown dwarf!found a new brown dwarf!

Source: Reagan Moore, SDSCSource: Reagan Moore, SDSC

68 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Semantic Mediation Semantic Mediation ““WaterfallWaterfall”” ……


Semantic Data,Service Annotation


Resource Discovery




Source: Shawn Bowers, SEEK AHM’04Source: Shawn Bowers, SEEK AHM’04


69 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

GEON Dataset Generation & RegistrationGEON Dataset Generation & Registration(a co(a co--development in KEPLER)development in KEPLER)

Xiaowen (SDM)Edward et al.(Ptolemy)

Yang (Ptolemy)



SQL database access (J DBC)Matt,Chad, Dan et al. (SEEK)

% Makefile$> ant run

% Makefile$> ant run

70 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

KEPLER as a Melting Pot KEPLER as a Melting Pot ……•• A A grassgrass--rootsroots projectproject

– Needed a coalition of the (really!) willing •• IntraIntra--project project linkslinks

– e.g. in SEEK: AMS SMS EcoGrid•• InterInter--projectproject linkslinks

– SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDACSDM, Ptolemy II, NIH BIRN (coming we hope …), UK eScience myGrid, …

•• InterInter--technologytechnology linkslinks– Globus, SRB, JDBC, web services, soaplab services,

command line tools, R, GRASS, XSLT, …•• InterdisciplinaryInterdisciplinary linkslinks

– CS, IT, domain sciences, … (recently: usability engineer)


71 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004


•• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

•• Scientific Data IntegrationScientific Data Integration

•• Scientific Workflow ManagementScientific Workflow Management

•• Links & Crystallization PointsLinks & Crystallization Points

•• SummarySummary

72 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Summary/Lessons Learned Summary/Lessons Learned •• Eat your own dogEat your own dog--food (or at least tryfood (or at least try……))

– start using your own (CI) tools early!•• Collaboration toolsCollaboration tools

– CVS repositories (+cvsview, webcvs)– Mailing lists (e.g. mailman googlified) – Bugzilla (detailed tracking of tech. issues & bugs)– WIKI (community authored web resource, e.g. high-level tech.

issues)•• Where is the XYZ repository/registry?Where is the XYZ repository/registry?

– EcoGrid (SEEK) registry, GEON registry, KEPLER actor & datasets repository, …

– UDDI what?•• CI CI ““Melting PotsMelting Pots””: :

– SDSC, NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …– Genome Center@UC Davis (moving in …)


73 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Q & AQ & A

74 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Further ReadingFurther Reading

under review – available upon request from


75 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Related PublicationsRelated Publications•• Semantic Data Registration and IntegrationSemantic Data Registration and Integration•• On Integrating Scientific Resources through Semantic RegistratioOn Integrating Scientific Resources through Semantic Registrationn, S. Bowers, K. Lin, and , S. Bowers, K. Lin, and

B. B. LudLudääscherscher, , 16th International Conference on Scientific and Statistical Data16th International Conference on Scientific and Statistical Database base ManagementManagement ((SSDBM'04SSDBM'04), 21), 21--23 June 2004, 23 June 2004, SantoriniSantorini Island, Greece. Island, Greece.

•• A System for Semantic Integration of Geologic Maps via A System for Semantic Integration of Geologic Maps via OntologiesOntologies, K. Lin and B. , K. Lin and B. LudLudääscherscher. In . In Semantic Web Technologies for Searching and Retrieving ScientifiSemantic Web Technologies for Searching and Retrieving Scientific Datac Data((SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

•• Towards a Generic Framework for Semantic Registration of ScientiTowards a Generic Framework for Semantic Registration of Scientific Datafic Data, S. Bowers and , S. Bowers and B. B. LudLudääscherscher. In . In Semantic Web Technologies for Searching and Retrieving ScientifiSemantic Web Technologies for Searching and Retrieving Scientific Datac Data((SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

•• The Role of XML in Mediated Data Integration Systems with ExamplThe Role of XML in Mediated Data Integration Systems with Examples from Geological es from Geological (Map) Data Interoperability(Map) Data Interoperability, B. , B. BrodaricBrodaric, B. , B. LudLudääscherscher, and K. Lin. In , and K. Lin. In Geological Society of Geological Society of America (GSA) Annual MeetingAmerica (GSA) Annual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

•• Semantic Mediation Services in Geologic Data Integration: A CaseSemantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Study from the GEON GridGrid, K. Lin, B. , K. Lin, B. LudLudääscherscher, B. , B. BrodaricBrodaric, D. , D. SeberSeber, C. , C. BaruBaru, and K. A. , and K. A. SinhaSinha. In . In Geological Geological Society of America (GSA) Annual MeetingSociety of America (GSA) Annual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

•• Query Planning and RewritingQuery Planning and Rewriting•• Processing FirstProcessing First--Order Queries under Limited Access PatternsOrder Queries under Limited Access Patterns, Alan Nash and B. , Alan Nash and B.

LudLudääscherscher, , Proc. 23rd ACM Symposium on Principles of Database SystemsProc. 23rd ACM Symposium on Principles of Database Systems ((PODS'04PODS'04) Paris, ) Paris, France, June 2004. France, June 2004.

•• Processing Unions of Conjunctive Queries with Negation under LimProcessing Unions of Conjunctive Queries with Negation under Limited Access Patternsited Access Patterns, Alan , Alan Nash and B. Nash and B. LudLudääscherscher., ., 9th Intl. Conference on Extending Database Technology9th Intl. Conference on Extending Database Technology ((EDBT'04EDBT'04) ) HeraklionHeraklion, Crete, Greece, March 2004, LNCS 2992. , Crete, Greece, March 2004, LNCS 2992.

•• Web Service Composition Through Declarative Queries: The Case ofWeb Service Composition Through Declarative Queries: The Case of Conjunctive Queries Conjunctive Queries with Union and Negationwith Union and Negation, B. , B. LudLudääscherscher and Alan Nash. Research abstract (poster), and Alan Nash. Research abstract (poster), 20th 20th Intl. Conference on Data EngineeringIntl. Conference on Data Engineering ((ICDE'04ICDE'04) Boston, IEEE Computer Society, April 2004.) Boston, IEEE Computer Society, April 2004.

76 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004 Scientific Data & WF Engineering, B.LudäscherNov. 15th 2004

Related PublicationsRelated Publications•• Scientific WorkflowsScientific Workflows•• KeplerKepler: An Extensible System for Design and Execution of Scientific Wo: An Extensible System for Design and Execution of Scientific Workflowsrkflows, I. , I. AltintasAltintas, C. , C.

Berkley, E. Jaeger, M. Jones, B. Berkley, E. Jaeger, M. Jones, B. LudLudääscherscher, S. Mock, , S. Mock, 16th International Conference on 16th International Conference on Scientific and Statistical Database ManagementScientific and Statistical Database Management ((SSDBM'04SSDBM'04), 21), 21--23 June 2004, 23 June 2004, SantoriniSantorini Island, Island, Greece. Greece.

•• KeplerKepler: Towards a Grid: Towards a Grid--Enabled System for Scientific WorkflowsEnabled System for Scientific Workflows, , IlkayIlkay AltintasAltintas, Chad Berkley, , Chad Berkley, EfratEfrat Jaeger, Matthew Jones, Bertram Jaeger, Matthew Jones, Bertram LudLudääscherscher, Steve Mock, , Steve Mock, Workflow in Grid Systems Workflow in Grid Systems (GGF10)(GGF10), Berlin, March 9th, 2004., Berlin, March 9th, 2004.

•• An OntologyAn Ontology--Driven Framework for Data Transformation in Scientific WorkflowsDriven Framework for Data Transformation in Scientific Workflows, S. Bowers and , S. Bowers and B. B. LudLudääscherscher, , Intl. Workshop on Data Integration in the Life SciencesIntl. Workshop on Data Integration in the Life Sciences ((DILS'04DILS'04), March 25), March 25--26, 26, 2004 Leipzig, Germany, LNCS 2994. 2004 Leipzig, Germany, LNCS 2994.

•• A Web Service Composition and Deployment Framework for ScientifiA Web Service Composition and Deployment Framework for Scientific Workflows, I. c Workflows, I. AltintasAltintas, , E. Jaeger, K. Lin, B. E. Jaeger, K. Lin, B. LudaescherLudaescher, A. , A. MemonMemon, In the , In the 2nd Intl. Conference on Web Services2nd Intl. Conference on Web Services((ICWSICWS), San Diego, California, July 2004.), San Diego, California, July 2004.

top related