doing it again: workflows and ontologies supporting science phillip lord frank gibson newcastle...

Doing it again: Workflows and Ontologies Supporting Science

Phillip Lord

Frank Gibson

Newcastle University

Outline

• Describe the background problem

• Introduce distributed services, workflows, eScience and (a bit of) ontologies.

• CARMEN

• Provenance

• Can we repeat an experiment?

Data-intensive bioinformatics

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Around the world in 80 days

• Biology is still largely a cottage industry

• On a global stage

Websites everywhere

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

WBS Workflows:

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetative elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterProPFAMPrositeSmart

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

ncbiBlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence ncbiBlastWrapper

Sort for appropriate Sequences only

Pink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns

RepeatMasker

START

myGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net

http://taverna.sf.net/

Web Services

Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web

Web services are a:– technology and standard for exposing code / databases with an

API that can be consumed by a third party remotely.– describes how to interact with it.

They are:• Self-contained• Self-describing• Modular• Platform independent

Workflow language specifies how bioinformatics processes fit together.

High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows.

Workflow is a kind of script or protocol that you configure when you run it.

Easier to explain, share, relocate, reuse and repurpose.

The METHODS section of a scientific publication

Workflows

The Taverna Workbench

http://taverna.sourceforge.net

http://www.mygrid.org.uk

http://taverna.sourceforge.net/

http://www.mygrid.org.uk/

http://www.mygrid.org.uk/

Workflows

• Automating away cutting and pasting.

• Helps to deal with distribution of data.

• myGrid and Taverna built on the open nature of bioinformatics.

• Can we adapt the same approach to another discipline?

CARMENCode, Analysis, Repository and Modelling for e-Neuroscience

www.carmen.org.uk

Engineering and Physical Sciences Research Council

http://www.carmen.org.uk/

http://www.epsrc.ac.uk/

Consortium & Profile

Stirling

St. Andrews

Newcastle

York

Sheffield

Cambridge

ImperialPlymouth

Warwick

Leicester

Manchester

• $10M over 4 years

• 20 Investigators

• Commenced 1st October 2006

Industry & Associates

http://www.upenn.edu/

Virtual Laboratory for Neurophysiology

• Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated

http://www.case.edu/artsci/ribms/spiketrain.html

Potential Barriers

• Technical– Multiple propietary formats– No standardised metadata– Volume of data to be analysed

• Cultural– Multiple Communities acting independently– Concerns about implications of sharing

Comparing to bioinformatics

• Cottage industry

• Global distribution

• Need to share

• But….

Age and Impact.

No sequences!

• DNA and Protein sequence form a core datatype for bioinformatics

• It’s simple to structure and to store, and it is of high-value

• Initially, there wasn’t much of it, and textual metadata was fine.

• Many people built tools over it, for transforming and manipulating.

The need for clear metadata

• Most neurosciences data is relative simple in structure

• But often contextually complex

• Sometimes associated with behavioural features

Neuroscience spike data

• The raw data is just a waveform

• But what is the experiment for?

• What stimulus is the organism/tissue receiving?

• Even, which channel is which?

• The data sets being produced are (reasonably) large (10’s of Gb, or 1Tb in three months)

Data Sharing in bioinformatics

• Data Sharing was an early tradition in biology.

• Gene patenting, NDAs and the like came as quite a surprise

• Many political battles were fought, culminating with Clinton/Blair statement

Data Sharing in Neurosciences

• The data is easy to structure, but the metadata is not• There is, therefore, less point to sharing data

• Many neuroscientists come from a medical background• tends to be more of a hierarchical, secretive

profession – all worried about getting sued.

• A lot of neuroscientists use invasive, live animal experiments• security is more than a passing concern.

The difference in neuroscience

• Less data sharing tradition

• No rich ecosystem of tools

• Higher barrier to entry for metadata

• Larger datasets

Virtual Laboratory Node

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine

RegistryServiceRepos-

itory

Search for Data & Analysis Code

Raw Signal Data Search & Visualisation

Deployment of Data & Analysis Code in Processes

Raw & Derived Data File Store

Security Policies Controlling Access to Data & Code

Structured Metadata Store Enabling Search & Annotation

Analysis & Model Code Store

CARMEN

Metadata(April 2008)

Data and Scripting Support (April 2008)

Security(April 2008)

Provenance (July 2008)

CARMEN v1.0(October 2008)

CARMEN v2.0 (October 2009)

Structured Metadata allowing data and analysis code to be described and searched

Support for extended range of data formats and scripting languagesSecurity allowing access

to data and analysis code to be controlledProvenance of analysis and modelling processes

leading to scientific resultsRelease of CARMEN v 1.0

Virtual laboratory nodes open to the CARMEN consortiumRelease of CARMEN v 2.0

Virtual laboratory nodes “networked”

Development Timeline

Virtual Laboratory Infrastructure

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Networked Nodes at Newcastle and York.

More planned …

Vision – Global Laboratory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Some Unexpected Advantages

• Big problem with bioinformatics services

• Over time they tend to disappear

• CARMEN keeps services and data together

• This means we should be able to rerun analyses later.

• We should be able to store provenance

What is Provenance

What does it mean to rerun an experiment?

• Replicability: one scientist should be able to repeat another’s experiment, under equivalent conditions, at a different time.

• Rerunability: a scientist should be able to apply an

equivalent technique under new circumstances.

• The addition of services into this mix complicate the issue.

New DataOld Data

Replicability Rerunability

New Data

Old Data Old Services

New ServicesReplicability

Rerunability

Is the specification of what

happened actually right?

Has the state of the world advanced since previously?

Has the world changed, in a comparable way?

Has the service changed in a comparable way?

Error-Prone

Neuroscientist

Eager Neuroscientist

Neuroscientist comparing to existing work

Tool Builder

There is a difficulty

• There is less tradition of data sharing

• The tendancy to want to control data is much larger

• If we want to data mine, we have to cope with data is mine

• If we have many different repositories, this needs to be supported computationally

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

Data

Metadata


Deployed

WebPortal

..............

WebPortal

Rich Clients

Sec

urity

Workflow Enactment

Engine


itory

An Example: Licensing

• Computationally amenable licenses are available

• Take, for example, Creative Commons

Conclusions

• Automated workflows have been applied very successfully in bioinformatics.

• But applying these directly to neuroinformatics is a different issue.

• Technology has to fit the domain.

• We are investigating metadata for describing neuroinformatics

myGrid acknowledgementsCarole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble.

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell.

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

AcknowledgementsProfessor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan

University ofSt Andrews

TheUniversity OfSheffield

http://www.le.ac.uk/

http://www.york.ac.uk/

doing it again: workflows and ontologies supporting science phillip lord frank gibson newcastle...

Documents

webweb services

est databases

soaplab services yellow

taverna project

wbs workflows

human databases

publicationsblastn vs

code databases