data access & integration in the ispider proteomics grid n. martin – a. poulovassilis – l....

23
Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,[email protected]}

Upload: melissa-tutt

Post on 14-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Data Access & Integration in the ISPIDER Proteomics

GridN. Martin – A. Poulovassilis – L.

Zamboulis{nigel,ap,[email protected]}

Page 2: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Project Details

• Members• Birkbeck College

• European Bioinformatics Institute

• University of Manchester

• University College London

Page 3: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Problem Definition

• Biological repositories • In separate locations: interoperability problems

• Rapidly updated/modified/evolved

• Overlapping data

• Need processing power

Page 4: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

ISPIDER Objectives

Page 5: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

ISPIDER Objectives

Page 6: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

ISPIDER Objectives

Page 7: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

ISPIDER Objectives

Page 8: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

ISPIDER Objectives

Page 9: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Middleware (1/2)

• myGrid: collection of services/components allowing high-level integration of data/applications• Taverna Workbench

• AutoMed heterogeneous data integration system

• OGSA-DAI• OGSA-DQP

Page 10: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Middleware (2/2)

• AutoMed toolkit: heterogeneous data integration system - developed by Birkbeck College/Imperial College• Subsumes traditional data integration approaches• Handles various data models – easily extensible• Virtual/materialised/ hybrid integration• Data warehousing tools• Schema evolution

Page 11: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

GAV & LAV Approaches

• Global-As-View (GAV) approach: describe GS constructs with view definitions over LSi constructs

• Local-As-View (LAV) approach: describe LSi constructs with view definitions over GS constructs

RDF

XMLFileRDB

Local Schema

GlobalSchema

Local SchemaLocal Schema

Vie

wD

efin

itio

n

View

Def

initi

on

View

Definition

Page 12: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

GAV Example

• student(id,name,left,degree) = [{x,y,z,w} |x,y,z,w,_ug

x,_,_,_,_phd x,y,z,w,_phd

w = ‘phd’]

• monitors(sno,id) = [{x,y} |x,_,_,_,yug

x,_,_,_,_phd x,ysupervises]

• staff(sno,sname,dept) = [{x,y,z} |x,y,z,w,_tutor

x,_,_supervisor

x,y,zsupervisor]

Page 13: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Both-As-View (BAV)

• Schema transformation approach

• For each pair (LSi,GS): incrementally modify LSi/GS to match GS/LSi RDF

XMLFileRDB

Local Schema

GlobalSchema

Local SchemaLocal Schema

Tra

nsf

orm

atio

np

ath

wa

y

Tran

sfor

mat

ion

path

way

Transformation

pathway

Page 14: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

BAV Example

• Transformation pathway consists of primitive transformations

• Pathway contains both GAV & LAV definitions• Transformations are automatically reversible• Metadata in AutoMed Repository

S1 Sg

I1S1

add(C1,q1) I2add(C2,q2) I3

add(C3,q3) I4add(C4,q4) I5

rename(C5,C6) I6delete(C7,q5) Sg

delete(C9,q6)

S1 Sg

I1S1

delete(C1,q1) I2delete(C2,q2) I3

delete(C3,q3) I4delete(C4,q4) I5

rename(C6,C5) I6add(C7,q5) Sg

add(C9,q6)

Page 15: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Schema Evolution Example

• No need to redefine pathway

• Instead, simply describe the evolution as a pathway

• Automatic in most cases

Page 16: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Interoperability

• Sources wrapped with OGSA-DAI

• AutoMed wrappers extract source metadata

• Integration using AutoMed

• Queries submitted:• Reformulated using

AutoMed metadata• Submitted to

OGSA- DQPAutoMedMetadata

Repository

OGSA-DQPQES

OGSA-DQPQES

OGSA-DQPQES

DB

AutoMed DAIwrapper

AutoMed DAIwrapper

AutoMed DAIwrapper

DistributedQuery Processor

GlobalAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIGDS

OGSA-DAIGDS

OGSA-DAIGDS

DBPedro

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

AutoMedWrappers

OGSA-DQPQDQS

transformation pathways

Page 17: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Schema extraction

• AutoMed wrapper requests the schema of the data source using an OGSA-DAI service

• The service replies with the source schema encoded in XML

• The AutoMed wrapper creates the corresponding schema in the AutoMed repository

AutoMed DAIwrapper

AutoMedSchema

OGSA-DAIGDS

schema request

DB

responsedocument

Page 18: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Query Processing

• Query is submitted to AutoMed’s GQP:• Reformulated

• Optimised

• AutoMed-DQP Wrapper:• IQL OQL

• Submits OQL to OGSA-DQP

• OQL result IQL result AutoMed

MetadataRepository

OGSA-DQPQES

OGSA-DQPQES

OGSA-DQPQES

DB

AutoMed DAIwrapper

AutoMed DAIwrapper

AutoMed DAIwrapper

DistributedQuery Processor

GlobalAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIGDS

OGSA-DAIGDS

OGSA-DAIGDS

DBPedro

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

AutoMedWrappers

OGSA-DQPQDQS

transformation pathways

Page 19: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Query Processing

• OGSA-DQP: • Evaluates OQL

query using QES

• Sends OQL result back to AutoMed DQP Wrapper: OQL result IQL result

AutoMedMetadata

Repository

OGSA-DQPQES

OGSA-DQPQES

OGSA-DQPQES

DB

AutoMed DAIwrapper

AutoMed DAIwrapper

AutoMed DAIwrapper

DistributedQuery Processor

GlobalAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIGDS

OGSA-DAIGDS

OGSA-DAIGDS

DBPedro

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

AutoMedWrappers

OGSA-DQPQDQS

transformation pathways

Page 20: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Future Work

• Workflow integration• AutoMed toolkit

• Taverna Workbench

• Integration of services with XML input/output

• LAV Mappings from XML to one or more RDFS ontologies

sourceXML

format

targetXML

format

RDFS ontology

Step 2(automatic)

Step 1(manual)

Page 21: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Future Work

• AutoMed extensions:• Web/Grid Services for AutoMed

• Data warehousing• Materialised/hybrid integration

• Data provenance

• Incremental view maintenance

• Schema evolution

Page 22: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Summary

• ISPIDER aims to:• Create an integrated platform of proteomic resources• Use existing resources – produce new ones• Create clients for querying, visualisation, etc.

• ISPIDER is using:• myGrid – middleware for in silico experiments in

biology• OGSA-DQP – service-based distributed query

processor• AutoMed – heterogeneous data integration system

Page 23: Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis {nigel,ap,lucas@dcs.bbk.ac.uk}

Project Members

• Birkbeck College• Nigel Martin• Alex Poulovassilis• Lucas Zamboulis (R.A.)• Hao Fan (former R.A.)

• European Bioinformatics Institute• Rolf Apweiler• Henning Hermjakob• Weimin Zhu• Chris Taylor• Phil Jones• Nisha Vinod

• University of Manchester• Simon Hubbard • Steve Oliver• Suzanne Embury• Norman Paton• Carol Goble• Robert Stevens• Khalid Belhajjame (R.A.)• Jennifer Siepen (R.A.)

• U.C.L.• David Jones• Christine Orengo• Melissa Pentony (R.A.)