(near term) develop database requirements to yield schema and interfaces
DESCRIPTION
(near term) Develop Database Requirements to Yield Schema and Interfaces MoBIoS: Database Management for Data in Metric Spaces Daniel P. Miranker Univ. of Texas. What we know for sure: Exploit Commodity Architecture. External Data/DB Sources. Web App Server. Curating New Content. - PowerPoint PPT PresentationTRANSCRIPT
1. (near term) Develop Database Requirements to Yield Schema and Interfaces
2. MoBIoS: Database Management for Data in Metric Spaces
Daniel P. Miranker
Univ. of Texas
What we know for sure: Exploit Commodity Architecture
DB
Curating New Content
Computing GridWebApp
Server
External Data/DB Sources
Users
Repository Schema and Interface Definitions
Issue:
• Database organization and data interchange should be addressed simultaneously
• Once established, difficult to change
Best to get this right the first time.
What we know for sure:
DB Schema
Curating New Content
Computing GridWebApp
Server
1. Data transfer XML & Nexus files2. Curate: (manage quality)
Users
Both 1 & 2 impact schema, (data provenance)
XML and Bioinformatics
• Taxonomic Markup Language (TML)
• PhyloML
• BEAST: Bayesian Evolutionary Analysis Sampling Trees
• AGAVE: Architecture for Genomic Annoation Visualization and Exchange
Answers Start with a Requirements Analysis
• Who
• What
• Why
• How
“Use cases”: specific examples of what is to be accomplish
A Head Start
Requirements of Phylogenetic Databases (with Nakhleh, Barbancon Piel & Donoghue)[BIBE ’03]
• Did a requirements analysis
• Proof of concept for a correctly normalized database schema
1 evolutionary (tree)-edge = 1 row in the database
Who is interested in using Phylogenies?
• Casual Users
• Visualization
• Study Development
• Super-tree algorithms
• Simulation Studies
• Parameter Derivation
• Comparative Genomics
Super-Tree Algorithms Use-Cases
Construct phylogenies by assembling existing studies
Collect those studies by:
• Determine minimum spanning clade for a set of taxa
• Find all phylogenies sufficiently similar to a given phylogeny
Requirements of Phylogenetic Databases
The MoBIoS ProjectMolecular Biological Information System
Daniel P. Miranker
University of Texas
MoBIoS – A Simple IdeaOrganize the Storage Manager Around Metric Space Indexing
Relational Databases
B+ trees 1
dimensional
Spatial Databases
R & K-D trees 2 & 3 dimensions
Metric Databases
VP, M & GNAT trees
No dimensions
Or
very high dimensions
Biological queries conducted with sequential scans.
• Sequence (BLAST)
• Phylogenies (Tree of Life)
• Mass Spectra (Proteomics)
• Ligand Docking (Rational Drug Design)
Metric Space is
• a pair, M=(D,d), where • D is a set of points • d is [metric] distance function with the following
properties:
– d(x, y) = d (y, x) (symmetry)– d(x, y) > 0, d(x, x) = 0 (non negativity)– d(x, y) <= d(x, z) + d(z, y) (triangle inequality)
Can Biology Be Modeled by Metrics?
• Already metrics re:– Phylogenetic trees
– Ligand docking
• First Biologically Effective Metric Model of Amino Acid Substitution [Xu&Miranker 03] In effect, precisely the phylogenetic relationships among
sequences are exploited to form a database index.
• Metrics for proteomic mass-spectra underway
MoBIoS Architecture(Molecular Biological Information System)
phylogenies
First Application (with Randy Linder)
Compared:
{entire Arib. Genome} x {“entire” Rice genome}
To determine conserved pairs of primer pairs,
In O(m log n), will repeat study again soon, faster.
When biological data is put in to an RDBMS
• Primary data is stored in text or blob fields– Annotations may be relational
• Data retrieval – Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST, TreeBASE, Sequest
Organism Function Sequence (BLOB)
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
Homework: Due tomorrow morning
1. Who are you, (generically)?
2. Use case involving the database
Don’t know: A General Web Service
DB Schema
Curating New Content
Computing GridWebApp
Server
ToL Infrastructure @ SDSC
Computing Grid