doing it again: workflows and ontologies supporting science phillip lord frank gibson newcastle...
Post on 15-Jan-2016
226 views
TRANSCRIPT
Doing it again: Workflows and Ontologies Supporting Science
Phillip Lord
Frank Gibson
Newcastle University
Outline
• Describe the background problem
• Introduce distributed services, workflows, eScience and (a bit of) ontologies.
• CARMEN
• Provenance
• Can we repeat an experiment?
Data-intensive bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Around the world in 80 days
• Biology is still largely a cottage industry
• On a global stage
Websites everywhere
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
WBS Workflows:
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetative elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterProPFAMPrositeSmart
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
ncbiBlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence ncbiBlastWrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns
RepeatMasker
START
myGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net
Web Services
Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web
Web services are a:– technology and standard for exposing code / databases with an
API that can be consumed by a third party remotely.– describes how to interact with it.
They are:• Self-contained• Self-describing• Modular• Platform independent
Workflow language specifies how bioinformatics processes fit together.
High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows.
Workflow is a kind of script or protocol that you configure when you run it.
Easier to explain, share, relocate, reuse and repurpose.
The METHODS section of a scientific publication
Workflows
The Taverna Workbench
http://taverna.sourceforge.net
http://www.mygrid.org.uk
Workflows
• Automating away cutting and pasting.
• Helps to deal with distribution of data.
• myGrid and Taverna built on the open nature of bioinformatics.
• Can we adapt the same approach to another discipline?
CARMENCode, Analysis, Repository and Modelling for e-Neuroscience
www.carmen.org.uk
Engineering and Physical Sciences Research Council
Consortium & Profile
Stirling
St. Andrews
Newcastle
York
Sheffield
Cambridge
ImperialPlymouth
Warwick
Leicester
Manchester
• $10M over 4 years
• 20 Investigators
• Commenced 1st October 2006
Industry & Associates
Virtual Laboratory for Neurophysiology
• Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated
Potential Barriers
• Technical– Multiple propietary formats– No standardised metadata– Volume of data to be analysed
• Cultural– Multiple Communities acting independently– Concerns about implications of sharing
Comparing to bioinformatics
• Cottage industry
• Global distribution
• Need to share
• But….
Age and Impact.
No sequences!
• DNA and Protein sequence form a core datatype for bioinformatics
• It’s simple to structure and to store, and it is of high-value
• Initially, there wasn’t much of it, and textual metadata was fine.
• Many people built tools over it, for transforming and manipulating.
The need for clear metadata
• Most neurosciences data is relative simple in structure
• But often contextually complex
• Sometimes associated with behavioural features
Neuroscience spike data
• The raw data is just a waveform
• But what is the experiment for?
• What stimulus is the organism/tissue receiving?
• Even, which channel is which?
• The data sets being produced are (reasonably) large (10’s of Gb, or 1Tb in three months)
Data Sharing in bioinformatics
• Data Sharing was an early tradition in biology.
• Gene patenting, NDAs and the like came as quite a surprise
• Many political battles were fought, culminating with Clinton/Blair statement
Data Sharing in Neurosciences
• The data is easy to structure, but the metadata is not• There is, therefore, less point to sharing data
• Many neuroscientists come from a medical background• tends to be more of a hierarchical, secretive
profession – all worried about getting sued.
• A lot of neuroscientists use invasive, live animal experiments• security is more than a passing concern.
The difference in neuroscience
• Less data sharing tradition
• No rich ecosystem of tools
• Higher barrier to entry for metadata
• Larger datasets
Virtual Laboratory Node
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Search for Data & Analysis Code
Raw Signal Data Search & Visualisation
Deployment of Data & Analysis Code in Processes
Raw & Derived Data File Store
Security Policies Controlling Access to Data & Code
Structured Metadata Store Enabling Search & Annotation
Analysis & Model Code Store
CARMEN
Metadata(April 2008)
Data and Scripting Support (April 2008)
Security(April 2008)
Provenance (July 2008)
CARMEN v1.0(October 2008)
CARMEN v2.0 (October 2009)
Structured Metadata allowing data and analysis code to be described and searched
Support for extended range of data formats and scripting languagesSecurity allowing access
to data and analysis code to be controlledProvenance of analysis and modelling processes
leading to scientific resultsRelease of CARMEN v 1.0
Virtual laboratory nodes open to the CARMEN consortiumRelease of CARMEN v 2.0
Virtual laboratory nodes “networked”
Development Timeline
Virtual Laboratory Infrastructure
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Networked Nodes at Newcastle and York.
More planned …
Vision – Global Laboratory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Some Unexpected Advantages
• Big problem with bioinformatics services
• Over time they tend to disappear
• CARMEN keeps services and data together
• This means we should be able to rerun analyses later.
• We should be able to store provenance
What is Provenance
What does it mean to rerun an experiment?
• Replicability: one scientist should be able to repeat another’s experiment, under equivalent conditions, at a different time.
• Rerunability: a scientist should be able to apply an
equivalent technique under new circumstances.
• The addition of services into this mix complicate the issue.
New DataOld Data
Replicability Rerunability
New Data
Old Data Old Services
New ServicesReplicability
Rerunability
Is the specification of what
happened actually right?
Has the state of the world advanced since previously?
Has the world changed, in a comparable way?
Has the service changed in a comparable way?
Error-Prone
Neuroscientist
Eager Neuroscientist
Neuroscientist comparing to existing work
Tool Builder
There is a difficulty
• There is less tradition of data sharing
• The tendancy to want to control data is much larger
• If we want to data mine, we have to cope with data is mine
• If we have many different repositories, this needs to be supported computationally
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
WebPortal
..............
WebPortal
Rich Clients
Sec
urity
Workflow Enactment
Engine
RegistryServiceRepos-
itory
An Example: Licensing
• Computationally amenable licenses are available
• Take, for example, Creative Commons
Conclusions
• Automated workflows have been applied very successfully in bioinformatics.
• But applying these directly to neuroinformatics is a different issue.
• Technology has to fit the domain.
• We are investigating metadata for describing neuroinformatics
myGrid acknowledgementsCarole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble.
• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.
• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell.
• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.
• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.
AcknowledgementsProfessor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan
University ofSt Andrews
TheUniversity OfSheffield