chado and interoperability chris mungall, bdgp pinglei zhou, flybase-harvard

49
Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase- Harvard

Upload: stewart-barrett

Post on 27-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Chado and interoperability

Chris Mungall, BDGPPinglei Zhou, FlyBase-Harvard

Page 2: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Databases and applications

SQL

?

How do we get databases and applications speakingto one another?

Application

Application

Java

Perl

Chado

seq

cv

rad

genetic

phylopub

Page 3: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

posgresqldriver

posgresqldriver

Databases and applications

Application

Application

SQL Java

Perl

Generic database interfaces only solve part of the problem

JDBC

DBI

They let us embed SQL inside application code

Chado

seq

cv

rad

genetic

phylopub

method calls

row objects

method calls

perl arrays

Page 4: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Why this alone isn’t enough

• Interfacing applications to databases is a tricky business…

• Issue: Language mismatch• Issue: Data structure mismatch• Issue: Repetitive code• Issue: No centralized domain logic

Page 5: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Language mismatch

String sql = "SELECT srcfeature_id, fmax, fmin “+ “FROM featureloc "+ "WHERE feature_id =" + featId; try { Statement s = conn.createStatement(); ResultSet rs = s.executeQuery(sql); rs.next(); sourceFeatureId = rs.getInt("srcfeature_id"); fmin = rs.getInt("fmin"); fmax = rs.getInt("fmax"); } catch (SQLException sqle) { System.err.println(this.getClass() + ": SQLException retrieving feature loc" + " for feature_id = " + featId); sqle.printStackTrace(System.err); }

Page 6: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Data structure mismatch

• Different formalisms

Xrelations - set theoretic - relational algebra

classes and structs - pointers - programs

Page 7: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Repetitive code

• Database fetch pattern– construct, ask, transform, repeat, stitch

• Example: fetching gene models– fetch genes– fetch transcripts– fetch exons, polypeptides– fetch ancillary data (props, cvs, pubs, syns,

etc)

• Optimisation is difficult

Page 8: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

No centralized domain logic

• Examples of domain logic:– project a feature onto a virtual contig– revcomp or translate a sequence– search by ontology term– delete a gene model

• Domain logic should be reusable by different applications

Page 9: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

biosqladapter

chadoadapter

A solution:Object Oriented APIs

Application

Perl

Different perl applicationsshare the same API

API

Different schemas can beadded by writing adapters

chado method calls

domain objects

Perl

driver DBI

biosql driver DBIApplication

Perl

domain objects

Page 10: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

How do OO APIs help?

• Issue: Language mismatch– Seperation of interface from implementation

• Issue: Data structure mismatch– API talks objects– adapters hide and deal with conversion

• Issue: Repetitive code– code centralized in both API and adapter

• Issue: No centralized domain logic– object model encapsulates domain logic– object model can be used independently of database

Page 11: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

How do OO APIs hinder?

• Writing or generating adapters– brittle, difficult to maintain

• Restrictive– canned parameterized queries vs query language

• Application language bound– very difficult to use a perl API from java

• Application bound– sometimes generic, but often limited to one

application

• Opaque domain logic

Page 12: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

XML can help

• XML is application-language neutral • XML can be used to specify:

• data• transactions• queries and query constraints

• XML can be used within both application languages and specialized XML languages

• XPath• XSLT• XQuery

Page 13: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

impl

XML middleware

Application

Perl

Database XML aslingua franca

interface

chado query params

chado xml

Perl

driver DBI

Applicationquery params

chado xml

Java

any

Generic SQL toXML mapper

OO APIs can be implementedon top of XML layer

Page 14: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

impl

XML middleware

Application

Perl

interface

Implementation canbe any language

chado query params

chado xml

Java

driver JDBC

Applicationquery params

chado xml

Java

any

Page 15: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Mapping with XORT

• XORT is a specification of how to map XML to the relational model– generic: independent of chado and biology

• XML::XORT is a perl implementation of the XORT specification– Other implementations possible

• DBIx::DBStag implements XORT xml->db

• Application language agnostic– Easily wrapped for other languages

Page 16: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Highlights

• Proposal: XML mapping specification for Chado

• Tools

• Real Case

Page 17: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

XORT Mapping

• Elements– Table– Column (except DB-specific value, e.g primary key in Chado schema -- not visible in

XML)

• Attributes– few and generic: transaction and reference control

• Element nesting– column within table– joined table within table -- joining column is implicit– foreign key table within foreign key column

• Modules– No module distinctions in chadoXML

• Limitations of DTD– Cardinality, NULLness, data type

Page 18: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Transactions and Operations

• Lookup - Select only

• Insert - Insert explicitly

• Delete - Unique identifier with unique key(s) - One record per operation

• Update - Two elements - Unique identifier with unique key(s) - One record per operation

• Force Combination of lookup, insert and update (if not lookup, then insert, else

update)

Page 19: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Referencing Objects• By global accession

- Format: dbname:accession[.version] - Only for dbxref, feature ?, cvterm ?

• By a pre-defined local id - Allows reference to objects in same file - Need not be in DB - Can be any symbol

• By lookup using unique key value(s) - Object can be in file or DB

• Implicitly, using foreign key to identify information in the related link table

Page 20: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Object Reference By Global Accession

<feature> <uniquename>CG3123</uniquename> <type_id>gene</type_id> <feature_relationship> <subject_id>Gadfly:CG3123-RA:1</subject_id> <type_id>producedby</type_id> <feature_relationship> ……

</feature></feature>

Page 21: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Object Reference By Local ID

<cv id=“SO”> <name>Sequence Ontology</name></cv>

<cvterm id=“exon”> <cv_id>SO</cv_id> <name>exon</name></cvterm>

<feature><type_id>exon</type_id>

Page 22: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Object Reference By key Value (s)

<feature> <type_id>    <cvterm>        <cv_id> <cv> <name>Sequence Ontology</name>               </cv> </cv_id>        <name>exon</name>    </cvterm> </type_id> ….       

Page 23: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

ChadoXML Example<cv id=“SO”> <name>Sequence Ontology</name></cv><feature op=“lookup” id=“CG3312”> <uniquename>CG3312</uniquename> <type_id> <cvterm> <name>gene</name> <cv_id>SO</cv_id> </cvterm> <type_id> <organism_id> <feature_relationship> <subject_id>Gadfly:CG3312-RA</subject_id> <type_id>producedby</type_id> </feature_relationship> </feature>

Page 24: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Schema-Driven Tools

• DTD Generator: DDL-DTD• Validator

DB Not connected Syntax verification: legal XML, correct element nesting Some Semantic verfication: NULLness, cardinality, local ID

reference DB Connected: reference validation

• Loader: XML->DB• Dumper:DB->XML

Driven by XML “dumpspec”• XORTDiff: diff two XORT XML files

Page 25: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

DumpSpec Driven Dumper

• Default behavior: given an object class and ID, dump all direct values and link tables, with refs to foreign keys.

• Non-default behavior specified by XML dumpspecs using same DTD with a few additions:– attribute dump= all | cols | select | none– attribute test = yes | no– element _sql– element _appdata

• Workaround with views, _sql• Current use cases:

– Dump a gene for a gene detail page– Dump a scaffold for Apollo

Page 26: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

<feature dump="all"> <uniquename test="yes">CG3312</uniquename> <!-- get all mRNAs of this gene --> <feature_relationship dump="all"> <subject_id test="yes"> <feature> <type_id><cvterm> <name>mRNA</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump="all"> <!-- get all exons of those mRNAs --> <feature_relationship dump="all"> <subject_id test="yes"> <feature> <type_id> ><cvterm><name>exon</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump="all“/> </subject_id> </feature_relationship> </feature>

DumpSpec Sample

Page 27: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Use Case 1Chado <-> Apollo Interaction

XM

L D

um

per

XM

L L

oader

+ valid

ator

GameBridge

Chado

GAMEXML

ChadoXML

ApolloGAMEXML

ChadoXML

GameBridge

DumpSpec1DDL

XMLSchema

By_product

Page 28: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Use Case 2Gene Page Dataflow

Chado

XM

L D

um

per

ChadoXML

DumpSpec2

acode

Page 29: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

To Do Lists

• External Object reference

• Dump with auto-generated XML Schema

• Output human-friendly

Page 30: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Resources

• Today’s slides

• XORT package http://www.gmod.org

• Protocol draft submit to Current Protocol In Bioinformatics

• Using chado to Store Genome Annotation Data

Page 31: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

XORT Key points

• Application language-neutral– reusable from within multiple

languages and applications

• Where does the domain logic live?• Unlike objects, XML does not have

‘behaviour’

– One solution: ChadoXML Services– Another solution: Inside the DBMS

Page 32: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

XML::XORT

ChadoXML Services

ApplicationXORT

interfacechado

query params

chado xmldriver DBI

ChadoXML services interface

Formatconverters,dumpers

chado x

ml

query

para

ms

chado x

ml

oth

er x

ml

XSLT

sequencedomain logic CPerl

Ontologyservices

Java

XQuery

Lisp

Prolog

companalysislogic

JavaPerl

Page 33: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

ChadoXML Services

Application

ChadoXML services interface

Formatconverters,dumpers

chado x

ml

query

para

ms

chado x

ml

oth

er x

ml

XSLT

sequencedomain logic CPerl

Ontologyservices

Java

XQuery

Lisp

Prolog

Can be independent of DB

companalysislogic

JavaPerl

Page 34: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

PL/PgSQLFunction

Impl

DB Functions API

Application

Implementation inside database

Chado db function calls

sql result objects

DB FunctionsAPI

posgresqldriver

JDBC/DBI

sequencelibrary C

Page 35: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

PL/PgSQLFunction

Impl

DB Functions API

Existing functions

Chado

DB FunctionsAPI

• cv module– get_all_subject_ids(cvterm_id int);– get_all_object_ids(cvterm_id int);– fill_cvtermpath(cv_id int);

Page 36: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

PL/PgSQLFunction

Impl

DB Functions API

Existing functions

Chado

DB FunctionsAPI

• sequence module– get_sub_feature_ids(feature_id int)– featuresplice(fmin int, fmax int)– get_subsequence(srcfeature_id int,

fmin int, fmax int, strand int)– next_uniquename()

Page 37: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Putting it together: storing ontologies in chado

ChadoXML

XML::XORT orDBIx::DBStag

oboxml_to_chadoxml.xslobo-edit Obo

XML

cv

fill_cvtermpath()

owl_to_oboxml.xslprotege OWL

Page 38: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Benefits

• Issue: Language mismatch– XORT dumpspecs and sql functions a more natural fit

for application languages

• Issue: Data structure mismatch– XML maps naturally to objects and structs

• Issue: Repetitive code– XORT dumpspecs centralize db-fetch code– XORT loader centralizes db-store code

• Issue: No centralized domain logic– domain logic can be encoded in:

• PostgreSQL functions and triggers• ChadoXML services

Page 39: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Other issues

• Speed?– chained transformations may be slower (-)– generic code is often slower (-)– single point for optimization(+)

• Verbosity– inevitable with a normalized database – reduced with XORT macros

• Portability– XORT highly portable (+)– PostgreSQL functions must be manually ported

to different DBMSs (-)

Page 40: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Current plans

• XORT wrappers• Improving efficiency• Documentation• Extend PostgreSQL function repertoire• More ChadoXML XSLTs• ChadoXML adapters

– CGL– Apollo– BioPerl - Bio::

{Seq,Search,Tree,..}IO::chadoxml

Page 41: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Conclusions

• ChadoXML– a common GMOD format– converted to other formats with XSLTs

• XORT– centralises database interoperation logic

• PostgreSQL functions– useful for certain kinds of domain logic

• Object APIs– still required by many applications– can be layered on top of XORT if so desired

Page 42: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Thanks to…

• Richard Bruskiewich• Scott Cain• Allen Day• Karen Eilbeck• Dave Emmert• William Gelbart• Mark Gibson• Don Gilbert• Aubrey de Grey• Nomi Harris• Stan Letovsky• Suzanna Lewis• Aaron Mackey

• Sima Misra• Emmannel Mongin• Simon Prochnik• Gerald Rubin• Susan Russo• ShengQiang Shu• Chris Smith• Frank Smutniak• Lincoln Stein• Colin Wiel• Mark Yandell• Peili Zhang• Mark Zythovicz

Page 43: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Nothing to see here, move along…..

• random deleted slides follow……

Page 44: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

How do we develop an OO API?

• Ad-hoc– eg: GO API, GadFly API, Ensembl API,

ApolloChadoJDBC– Object model and db can be developed

independently– hand tuned

• Generic– Class::DBI, JDO– efficiency issues?

Page 45: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

GBrowse API

Bio::DB::DAS

Perl

SQL

Gbrowse can connect to different database via the same API

Chado DBI

APIs are typically associated with an object model

GBrowse

BioPerlBio-DB-GFF

BioSQL

Page 46: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Choice of XML Datamodel

• XML is a different formalism from the relational model

• Should we:– design an XML datamodel from

scratch?– adopt existing XML datamodel?– Generate XML datamodel from

relational model?

Page 47: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Chado XML

• XML datamodel derived automatically from relational schema– mapping uses XORT specification

• XML tightly coupled to relational schema– benefits & advantages..

Page 48: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Data retrieval: XORT Dumpspecs

• Examples: chado gene model dumpspec

Page 49: Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard

Data storage

• Static or transactional