an introduction to taverna workflows - bioinformatics · 2013-06-28 · an introduction to taverna...

23
An Introduction to Taverna Workflows Katy Wolstencroft my Grid University of Manchester What is my Grid? my Grid is a suite components to support in silico experiments in biology Taverna workbench = my Grid user interface Originally designed to support bioinformatics Expanded into new areas: Chemoinformatics Health Informatics Medical Imaging Integrative Biology Open source – and always will be

Upload: others

Post on 08-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

An Introduction to Taverna Workflows

Katy WolstencroftmyGrid

University of Manchester

What is myGrid?

• myGrid is a suite components to support in silicoexperiments in biology

• Taverna workbench = myGrid user interface

• Originally designed to support bioinformatics

Expanded into new areas:

Chemoinformatics

Health Informatics

Medical Imaging

Integrative Biology

• Open source – and always will be

Page 2: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

History

EPSRC funded UK eScience Program Pilot Project

OMII Open Middleware Infrastructure Institute

• University of Manchester (myGrid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006

• OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaborators.

• A guarantee of development and support

Page 3: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

The Life Science Community

In silico Biology is an open Community

• Open access to data

• Open access to resources

• Open access to tools

• Open access to applications

Global in silico biological research

The Community Problems

• Everything is Distributed– Data, Resources and Scientists

• Heterogeneous data

• Very few standards – I/O formats, data representation, annotation

– Everything is a string!

Integration of data and interoperability of resources is difficult

Page 4: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Lots of Resources

NAR 2007 – 968 databases

Traditional Bioinformatics

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Page 5: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Cutting and Pasting

• Advantages:– Low Technology on both server and client side– Very Robust: Hard to break.– Data Integration happens along the way

• Disadvantages:– Time Consuming (and painful!)

• Can be repeated rarely• Limited to small data sets.

– Error Prone:• Poor repeatability

How do you do this for a genome/proteome/metabolome of information!

Pipeline Programming

• Advantages– Repeatable

– Allows automation

– Quick, reliable, efficient

• Disadvantages– Requires programming skills

– Difficult to modify

– Requires local tool and database installation

– Requires tool and database maintenance!!!

Page 6: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

What we want as a solution

A system that is:

• Allows automation

• Allows easy repetition, verification and sharing of experiments

• Works on distributed resource

• Requires few programming skills

• Runs on a local desktop / laptop

myGrid as a solution

myGrid allows the automated orchestration of in silicoexperiments over distributed resources from the scientist’s desktop

Built on computer science technologies of:

• Web services

• Workflows

• Semantic web technologies

Page 7: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Web Services

Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web

Web services are a:– technology and standard for exposing code / databases with an API that

can be consumed by a third party remotely.

– describes how to interact with it.

They are:

• Self-contained

• Self-describing

• Modular

• Platform independent

Workflows

– General technique for describing and enacting a process– Describes what you want to do, not how you want to do it– High level description of the experiment

RepeatMasker

Web service

GenScanWeb Service

BlastWeb Service

Page 8: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Workflow language specifies how bioinformatics processes fit together.

High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows.

Workflow is a kind of script or protocol that you configure when you run it.

Easier to explain, share, relocate, reuse and repurpose.

Workflow <=> ModelWorkflow is the integrator of knowledge

The METHODS section of a scientific publication

Workflows

Workflow Advantages

• Automation– Capturing processes in an explicit manner– Tedium! Computers don’t get bored/distracted/hungry/impatient!– Saves repeated time and effort

• Modification, maintenance, substitution and personalisation• Easy to share, explain, relocate, reuse and build• Releases Scientists/Bioinformaticians to do other work• Record

– Provenance: what the data is like, where it came from, its quality– Management of data (LSID - Life Science Identifiers)

Page 9: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Different Workflow Systems

• Kepler

• Triana

• DiscoveryNet

• Taverna

• Geodise

• Pegasus

• Pipeline Pilot

Each has differences in action, language, access restrictions, subject areas

Taverna Workflow Components

Scufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications available

SOAPLABWeb Service

Any Application

Web Service e.g. DDBJ BLAST

Page 10: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

An Open World

• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

• Quality Web Services considered desirable

Adding your own web services

• SoapLab • Java API Consumer

import Java API of libSBML as workflow components

http://www.ebi.ac.uk/soaplab/

Page 11: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Services Landscape

Shield the Scientist – Bury the Complexity

Workflow enactor

Processor Processor

PlainWeb

Service

Soaplab

Processor

LocalJavaApp

Processor

Enactor

Processor

BioMOBY

Processor

WSRF

Processor

BioMART

Styx

Styxclient

Processor

Rpackage

...

...

Scufl Model

TavernaWorkbench

Workflow Execution

Application

Simple Conceptual Unified Flow Language

Page 12: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

What can you do with myGrid?

• ~37000 downloads• Users worldwide US, Singapore, UK, Europe,

Australia• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Genotype/Phenotype studies• Health Informatics• Astronomy• Chemoinformatics• Data integration

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

Trypanosomiasis in Africa

Andy B

rassS

teve Kem

pP

aul Fisher

Page 13: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Trypanosomiasis Study

• A form of Sleeping sickness in cattle – Known as n’gana

• Caused by Trypanosoma brucei

• Can we breed cattle resistant to n’gana infection?

• What are the causes of the differences between resistant and susceptible strains?

Trypanosomiasis Study

Understanding Phenotype

• Comparing resistant vs susceptible strains – Microarrays

Understanding Genotype

• Mapping quantitative traits – Classical genetics QTL

Need to access microarray data, genomic sequence information, pathway databases AND integrate the results

Page 14: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

?200

Microarray + QTL

Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping

Genotype Phenotype

Genes captured in microarray experiment and present in QTL region

Key:

A – Retrieve genes in QTL region

B – Annotate genes with external database Ids

C – Cross-reference Ids with KEGG gene ids

D – Retrieve microarray data from MaxD database

E – For each KEGG gene get the pathways it’s involved in

F – For each pathway get a description of what it does

G – For each KEGG gene get a description of what it does

Page 15: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Results

• Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.

• Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate.

Why was the Workflow Approach Successful?

• Workflow analysed each piece of data systematically– Eliminated user bias and premature filtering of datasets and

results leading to single sided, expert-driven hypotheses

• The size of the QTL and amount of the microarray data made a manual approach impractical

• Workflows capture exactly where data came from and how it was analysed

• Workflow output produced a manageable amount of data for the biologists to interpret and verify– “make sense of this data” -> “does this make sense?”

Page 16: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Trichuris muris(mouse whipworm) infection

parasite model of the human parasite -Trichuris trichuria)

• Identified the biological pathways involved in sex dependence inthe mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• Manual experimentation: Two year study of candidate genes, processes unidentified

• Workflows: trypanosomiasis cattle experiment was reused without change.

• Analysis of the resulting data by a biologist found the processes in a couple of days.

Joanne Pennock, Richard GrencisUniversity of manchester

Workflow Reuse – Workflows are Scientific Protocols – Share them!

Addisons Disease

SNP design

Protein annotation

Microarray analysis

Page 17: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

A workflow marketplace

A Practical Guide to Building and Managing in silico Experiments

Page 18: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Semantic Web Technologies

• myGrid built on Web Services, Workflows AND semantic web technologies

• Semantic web technologies are used to: – Find appropriate services during workflow design

– Find similar workflows for reuse and repurposing

– Record the process and outcome of an experiment, in context

->>>> the experimental provenance

Finding Services

There are over 3000 distributed services. How do we find an appropriate one?

Find services by their function instead of their name

• We need to annotate services by their functions.

• The services might be distributed, but a registry of service descriptions can be central and queried

Page 19: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Feta Semantic Discovery

• Feta is the myGrid component that can query the service annotations and find services

Questions we can ask:

Find me all the services that perform a multiple sequence alignment And accept protein sequences in FASTA format as input

Specialises

myGrid Ontology

Upper level ontology

Task ontology

Informatics ontology

Molecular Biology ontology

Bioinformatics ontology

Web Service ontology

Contributes to

sequence

biological_sequence

protein_sequence

nucleotide_sequence

DNA_sequence

protein_structure_feature

BLASTp service

Similarity Search Service

BLAST service

InterProScan service

Page 20: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Annotations

• Feta has been available for over a year

• Only just been included in the release

• Need critical mass of service annotations before release

• By demonstrating the use of service annotation, we aim to encourage service providers to provide the annotations in the future

• Annotation experiments with users and domain experts

• Domain expert annotations much better – We now have a domain expert for full-time service annotation

Data Management

• Workflows can generate vast amount of data - how can we manage and track it?

• We need to manage – data AND

– metadata AND

– experiment provenance

• Workflow experiments may consist of many workflows of the same, or different experiments.

• Scientists need to check back over past results, compare workflow runs and share workflow runs with colleagues

Page 21: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

From which Ensembl gene does pathway mmu004620 come from?

Advanced Provenance Features

Smart re-runningExperiment miningCross experiment mining

Page 22: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

Conclusions

Web services and workflows are powerful technologies for in silico science

– automation

– high throughput experiments

– systematic analysis

– Interoperability of distributed resources

Contact Us

• Taverna development is user-driven

• Please tell us what you would like to see via the mailing lists: – Taverna-Users and Taverna-Hackers

• Download software and find out more at:

http://www.mygrid.org.uk

http://taverna.sourceforge.net

Page 23: An Introduction to Taverna Workflows - Bioinformatics · 2013-06-28 · An Introduction to Taverna Workflows ... Workflow is the integrator of knowledge The METHODS section of a scientific

myGrid acknowledgements

Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

Changes to Scientific Practice

– Systematic and comprehensive automation• Eliminated user bias and premature filtering of datasets and

results leading to single sided, expert-driven hypotheses

– Dry people hypothesise, wet people validate• “make sense of this data” -> “does this make sense?”

– Workflow factories• Different dataset, different result

– Workflow market

– Accurate provenance