data integration, web services and workflow management

35
P. Romano, Tutorial BITS2005 1 Data integration, web services and workflow management Paolo Romano National Cancer Research Institute, Genova ([email protected])

Upload: dakota

Post on 10-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Data integration, web services and workflow management. Paolo Romano National Cancer Research Institute, Genova ([email protected]). Summary. Information and data integration Web Services CABRI and TP53 databases Implementation of Web Services (soaplab) Workflow management - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 1

Data integration, web servicesand workflow management

Paolo Romano

National Cancer Research Institute, Genova

([email protected])

Page 2: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 2

Summary

Information and data integration Web Services CABRI and TP53 databases Implementation of Web Services (soaplab) Workflow management Demo: execution of workflows with taverna

Page 3: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 3

Information in biology

Biomedical research produces an increasing quantity of new information

Some domains, like genomics and proteomics, contributes to huge databases

Emerging domains, like mutation and variation analysy, polymorphisms, metabolism, and technologies, e.g., microarrays, will contribute with even huger amounts of data

Page 4: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 4

Information in biology

EMBL Data Library 74 (Mar 2003):o Sequences: 23,234,788, Bases: 30,356,786,718

EMBL Data Library 81 (Dec 2004):o Sequences: 40,696,839, Bases: 44,285,259,441o WGS sequences: 5,408,558, Bases: 34,986,041,399

EMBL Data Library 82 (Mar 2005):o Sequences: 43,246,005, Bases: 46,927,070,905o WGS sequences: 6,228,397, Bases: 38,207,643,477

o Size: 7,3% more vs 81 (3 months), 112,9% vs 74 (24 months)

Page 5: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 5

Heterogeneicity of databanks

Only a few databanks are managed in an almost homogenous way by EBI, NCBI, DDBJ (sequence)

Many databanks are created by small groups or single researchers

Secondary databases are of high quality (good and extended annotation, quality control)

Many databases are highly specialized, e.g. by gene, organism, disease, mutation, etc…

Databanks are distributed: different DBMS, data structures, information, semantics, distribution methods

Page 6: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 6

Softwares

Specialist softwares are essential for almost all analysis in molecular biology:o Sequence analysis, secondary and tertiary protein

structure prediction, gene prediction, molecular evolution, etc…

Softwares must interoperate with databaseso Databases as input for softwareso Results as new data to record and analyze

Page 7: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 7

Goals of the integration

Integration is needed in order to:

o Achieve a better and wider view of all available information

o Carry out analysis and/or searches involving more databases and softwares automatically

o Perform analysis involving large data setso Carry out a real data mining

Page 8: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 8

Integration needs stabilityo Standardization…… o Good domain knowledgeo Well defined datao Well defined goals

Integration fears:o Heterogeneicity of data and systemso Uncertain domain knowledgeo Fast evolution of datao Highly specialized datao Lacking of predefined, clear goalso Originality, experimentalism (“let me see if this works”)

Integration longevity

Page 9: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 9

Integration of biological information

In biology:

o Goals and needs of researchers evolve very quickly according to new theories and discoveries

o A pre-analysis and reorganization of the data is very difficult, because data and related knowledge vary continuosly

o Complexity of information makes it difficult to design data models which can be valid for different domains and over time

Page 10: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 10

Integration methods

Integration methods

Explicit (reciprocal) links (xrefs) Implicit links (e.g., names)

Common contents (vocabularies)

Shared data models and schemasOntologies

Page 11: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 11

Web Services

XML based network services Implement standard transport protocols (SOAP,

HTTP) Standards available for their retrieval and

identification (UDDI), description (WSDL) and composition (WSFL)

Allow software applications to access data “intelligently”: identification of contents, interpretation of semantics information

Metadata needed Web Services implemented by many Institutes and

service nodes (EBI, NCBI, ....)

Page 12: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 12

WSDL: the description

Web Services Description Language (WSDL)

Standard for the description of Web Services Define localization, access ways and detailed

description Abstract functionalities, practical details WSDL Binding: implementation for SOAP,

HTTP, MIME

Page 13: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 13

CABRI: Objectives

Common Access to Biological Resources and Information

Setting Quality Management Guidelines Distributing biological resources of the highest quality Integrating searches and access to catalogues Ad hoc search (CABRI Simple Search) Shopping cart (pre-ordering facility)

Page 14: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 14

CABRI: Partners and resources

Partners: BCCM, CABI, CBS, CIP, DSMZ, ECACC, ICLC,

NCCB, NCIMB (culture collections) IST, CERDIC (ICT)

Resources: Microorganisms (bacteria, yeasts, fungi strains) Animal cells (animal and human cell lines,

hybridomas, HLA typed B lines) Plasmids, phages, viruses, DNA probes Overall, more than 110.000 biological resources

Page 15: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 15

CABRI: SRS

Reasons whyo Manages heterogeneous databaseso Flat file formato Simple and effective interfaceo Internal and external linkso Link operatoro Easily expandible (new databases)o Flexibility in creation of indexes

Page 16: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 16

CABRI: data structure

For each material, three data sets identified:

Minimum Data Set (MDS): essential data, needed to identify individual resources

Recommended Data Set (RDS): all data that are useful to describe individual resources

Full Data Set (FDS): all data available on the resources

Page 17: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 17

CABRI: data structure

For each information, data input and authentication guidelines, including:

Detailed textual description of the information In-house reference lists of terms and controlled

vocabularies Predefined syntaxes (e.g., Literature, scientific names)

Page 18: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 18

CABRI: Name field

Field Name

Description Full scientific and most recent name of the strain.

It includes: Genus name and species epithetSubspeciesPathovarAuthors of the nameYear of valid publication or validationApprobation of the name

Input process Enter full scientific name as given by depositor and confirmed (or changed) by collection. Names of authors of the name, year of valid publication or validation and approbation are included after a comma.

Values for approbation:

AL = approved list, c.f.r. IJSB 1980

VL = validation list, in IJSB after 1980

VP = validly published, paper in IJSB after 1980

Reference list: DSMZ list of bacterial names

Required for MDS

Page 19: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 19

CABRI: Reference paper field

Field Reference paper

Description Original paper [if available]

Input process New entries:

JournalTitle Year; Volume(issue): beginning page#-ending page#

 

The title is abbreviated following international standard rules (ISSN).

Abbreviations are without dot. Authors and title of the article are not mentioned.

The reference can be followed by the Pubmed ID enclosed within square brackets as follows:

[PMID: 1234567], where '1234567' is the Pubmed ID of the paper

Required for MDS

Page 20: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 20

CABRI: integration

For each material: Common data structure and syntax Integrated searches/results through SRS

For each catalogue: SRS and HTML links to reference dbs (media,

synonyms, hazard, etc…)

For many catalogues: Explicit links to Medline, EMBL, plamisd maps

Page 21: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 21

IARC TP53 database

IARC TP53 Mutation Database http://www.iarc.fr/p53/ Release 9: 19,809 somatic mutations, 1,769 papers, Information: mutation, source, patient’s life style. Vocabularies and standardized annotations On-line queries imply human interaction.

SRS implementation of the TP53 Database http://srs.o2i.it/srs71/ SRS based service Definition of an ad hoc DTD XML based data interchange Improved automated accessibility

Page 22: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 22

CABRI and TP53 Web Services

Implementing web services that allow: The retrieval of information from CABRI and TP53 databases by

using remote calls to SRS The possibility of including such services in complex workflows

Reproducing current behaviour: Search by name, identifier and free text (CABRI) Search by interesting properties (TP53) Combine results Integrate data with other sources by using IDs/common terms

Two types of services: Search for a specific feature and return ID Search for an ID and return full record (or predefined sections)

Page 23: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 23

Soaplab: SOAP-based Analysis Web Service

“Soaplab is a set of Web Services providing a programatic access to some applications on remote computers.It is often referred to as an Analysis (Web) Service” (Martin Senger, EBI).

It allows for the implementation of Web Services offering access to: local command-line applications EMBOSS contents of ordinary web pages (GowLab)

Requirements Apache Tomcat servlet engine and Axis SOAP toolkit, Java perl, mySQL

Page 24: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 24

Soaplab

Page 25: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 25

Soaplab

appl: getCellLineIdsByName [ documentation: "Get cell lines by name from CABRI human and

animal cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method [{$libs}-nam:'$name'] -ascii“ ]

string: libs [ parameter: "Y“ ]string: name [ parameter: "Y“ ]

outfile: result [ ]

Page 26: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 26

Soaplab

appl: getCellLineIdsByProperty [ documentation: "Get cell lines by properties (all text) from CABRI

human and animal cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method [{$libs}-all:'$text'] -ascii"]

string: libs [ parameter: "Y“ ]string: text [ parameter: "Y“ ]outfile: ids [ ]

Page 27: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 27

Soaplab

appl: getCellLinesById [ documentation: "Get cell lines by Id from CABRI human and animal

cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method -e [{$libs}:'$id'] -ascii"]

string: libs [ parameter: "Y“ ]string: id [ parameter: "Y“ ]

outfile: result [ ]

Page 28: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 28

Workflow management

“A computerized facilitation or automation of a business process, in whole or part". (Workflow Management Coalition)

Main goal is: the implementation of data analysis processes in

standardized environments

Main advantages relate to: effectiveness: being an automatic procedure, it frees bio-

scientists from repetitive interactions with the web and it supports good practice,

reproducibility: analysis can be replicated over time, reusability: intermediate results can be reused, traceability: the workflow is carried out in a transparent

analysis environment where data provenance can be checked and/or controlled.

Page 29: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 29

Workflow management

Workflow management softwares: Biopipe, an add-on to bioperl, GPipe, an extension of the Pise interface Taverna (EBI), a component of the myGrid

platform, Wildfire (Bioinformatics Institute, Singapore) Pipeline Pilot (SciTegic).

Page 30: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 30

Workflow management

Taverna Workbenchconstructs complex analysis workflowsaccess both remote and local processors defines alternative processors runs workflows visualizes the results includes a bioinformatics data ontology

Requirements: java, Windows or Linux

Page 31: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 31

Workflow management

WSDL services Web Service Description Language (WSDL) file: adds WSDL based service nodes

Soaplab servers Soaplab server: adds a list of soaplab provided services

Biomoby registries Moby Central repository: determines hosts and their services

Workflows XScufl definition file: adds the workflow as a node and processors as child node

Biomart databases Biomart data warehouse: adds all available data sets

Local processors Simple list/string processors, constant values, beanshell scripts

Page 32: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 32

Demo: workflows for CABRI dbs

Page 33: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 33

Demo: workflows for TP53 dbs

Page 34: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 34

Some acknoledgements…..

This work has partially been supported by the Italian Ministry for Education, University and Research (MIUR), project “Oncology over Internet” (2002 – 2005)

I wish to thank my colleagues:

Domenico Marra (TP53 databases and Soaplab),

Federico Malusa (CABRI databases),

Francesca Piersigilli (CABRI databases)

Page 35: Data integration, web services and workflow management

P. Romano, Tutorial BITS2005 35

…and an announcement!

Workflows management:new abilities for the biological information overflow

October 5 - 7, 2005,University of Naples

Naples, Italy

Workshop NETTAB 2005http://www.nettab.org/2005/

Take a brochure!