data integration in bioinformatics using ogsa-dai the bioda project shirley crompton, brian matthews...

18
Data Integration in Bioinformatics Using OGSA- DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff University)

Upload: sharyl-matthews

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Data Integration in Bioinformatics Using OGSA-DAI

The BioDA Project

Shirley Crompton, Brian Matthews (CCLRC)

Alex Gray, Andrew Jones,

Richard White (Cardiff University)

Page 2: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Overview

• Bioinformatics Data Access and Integration Requirements– Generic

• BioDA Workshop and Questionnaire

– BDWorld-specific

• OGSA-DAI exemplar

Page 3: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

The BioDA Project

• Independent Evaluation of OGSA-DAI– the suitability of that software in its present form – how to leverage OGSA-DAI in bioinformatics GRID

• OGSA-DAI Product Improvement– Feedbacks to the DAIT Team

• Knowledge Dissemination– Evaluation Report– Publications/Presentations– Workshop on OGSA-DAI for the bioinformatics

eResearch community

Page 4: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Bioinformatics

The Application and development of computing of mathematics to the

management, analysis an understanding of data to solve biological question.

Attwood, TK and Parry-Smith, DJ 1999

Data Management

Data Analysis

Page 5: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Grid Computing

... “... flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources…”

Foster, Kesselman and Tuecke, 2001

Page 6: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

1st BioDA Workshop

• Objectives– examine bioinformatics community’s needs for

data access and integration (DAI) on the grid, and

– to explore the application of OGSA-DAI, a middleware developed expressly to address DAI requirements of eScience projects

Page 7: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

The BioDA Survey

Mean Scores by Requirement Categories(adjusted by the no. of questions within each category)

0

1

2

3

4

5

Requirement Category

Mea

n S

core

Page 8: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

The Results

17 key requirements, top of the list include:

• schema integration

• schema mapping

• mixed language query

• complex join across databases

• provenance data

• flexible resource discovery

• RDF database access

Page 9: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

The BioDA Exemplar

The BioDiversity World

• To create a GRID-based problem solving environment. 

• Enable collaborative exploration and analysis of global biodiversity patterns using workflow and rich data sources from around the world

• Example applications would be modeling species distributions against climate change, conservation prioritization and linking evolutionary changes to past climates. 

Page 10: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

BDWorld(Source: BDWolrd)

Taxonomic index (Species 2000

& ITIS Catalogue of Life)Analyti

c tool

Thematic data

source

BDGrid

Ontology: Metadata

Intelligent links Resource & analytic

tool descriptions Maintenance tools

Proxy

Abiotic data

source

User

Local tools

Problem Solving

Environment user interface

Problem Solving Environment: Broker agents

Facilitator agents Presentation agents

Proxy

Proxy

ProxyProxy

Proxy

Analytic tool

GSDGSDGSDGSD

Page 11: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

BDWorld Data Resources :Key Issues

• geographically distributed and autonomous– heterogeneous in structure and data standards – mainly read via HTTP/XML protocols using custom

wrappers • SQL queries are limited to the EBI EMBL store and

BDWorld cache databases

• potentially resource-intensive to harvest – a single taxa name may resolve into a large number

of ‘accepted’ taxon names – same query repeated on different data collections

Page 12: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Resource Wrapping(Source:BDWorld)

Remote Resource

The GRID

Workflow enactment engine

User

BDWorld-GRID Interface (BGI)

BGI API

BDWorld-GRID Interface (BGI)

BGI API

Wrapper

Page 13: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Implications for BioDA

• abstraction layer (BGI) Proprietary invocation mechanism – InvokeOperation

(ResourceHandler, Operation, XmlDataCollection)

• prepared search statements defined in individual data resource wrapper

• BGI protocols BDW communication objects. Search parameters and results passed as XmlDataCollecton

Page 14: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

BioDA Exemplar

• Two main possibilities within BDW:

1.Augment BGI to support inclusion of queries in workflows and to be sent directly to OGSA-DAI enabled databases.

• Distributed query processing facilities could assist in planning execution & distribution of data-orientated parts of a workflow. (For the current status of OGSA-DQP see Section

4.) – Very major revision to BDW protocols; also,– many resources of interest are simply not exposed as

databases.

2.Provide facilities within individual wrappers that benefit from OGSA-DAI.

Page 15: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

OGSA-DAI Prototype(What we’d have liked)

OGSA-DAI R5 GDS

deliverFromURL(xsl)OGSA-DAIClient

BDWQueryActivity

Wrapper Module

WrapperWrapperWrapper2. Create GDS

and query

3. Invoke wrapper

Web DBs

4. Query

deliverFromURL(url)5. Download URL

XSLTransform

deliverToURL/GFTP

6. Download url7. url

8. XSL transform to BDW

format

9. To WF unit

1. BGI

InvokeOperation()

Page 16: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Key Issues encountered

• Complex client-side coding to orchestrate the application flow– require several GDS perform requests…

• Difficult to synchronise– Remote web databases have different response time (or not

response at all!)

• Different data transformation series applicable to different data resources

• BDW Protocols specify data returned as a BDW XmlDataCollection object

Page 17: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

OGSA-DAI Prototype(What we ended up doing)

OGSA-DAI R5 GDS

OGSA-DAIClient

BDWQueryActivity

Wrapper Module

WrapperWrapperWrapper

2. Create GDS

and query

3. Invoke wrapper/s

Web DBs

4. Query, transform

1. BGI

InvokeOperation()

Cache File

5. Write cache file

6. return XmlRemoteData

7. return

XmlDataCollection

Page 18: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff

Conclusion

• Highlighted key bioinformatics eScience project requirements for OGSA-DAI – support for a metadata-driven two-step

access to data and data integration…

• Reviewed BDWorld DAI requirements– uniform access to disparate, heterogeneous

data resources• including anonymous access to web information

system

• Reviewed the BDWorld OGSA-DAI exemplar and issues encountered