a search engine for phylogenetic tree databases - d. fernándes-baca
TRANSCRIPT
A search engine for phylogenetic tree databases
David Fernández-BacaJoint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)
PhyloFinder
http://pilin.cs.iastate.edu/phylofinder/
Outline
1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements
Outline
1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements
Issues in Phylogenetic Databases Taxonomic consistency
Species may appear in multiple trees by different but synonymous names.
Homonyms Misspellings
Querying capability Storage/representation Exploiting classification trees (e.g., NCBI)
Clustering capabilities Distance measures
Aggregation (synthesis) capabilities Supertrees
Visualization
Classification trees and phylogenies
Exploiting taxonomic classifications The leaves in phylogenetic trees may
represent different taxonomic levels A classification tree can allows us to
locate trees that contain a taxon, as well as its descendants or ancestors. E.g., a “Pinaceae" query would identify
trees that contain “Pinus thunbergii” or “Abies alba”
TreeBASE (Piel, Donoghue, & Sanderson, 1996)
TreeBASE capabilities
Search by taxon author citation study accession number matrix accession number structure (topology)
Tree surfing
TreeBASE limitations
Taxonomic name consistency Querying
Few options Does not exploit classification
Can’t identify ancestors/descendants
Visualization Clustering and aggregation (supertrees)
PhyloFinder
A search engine for tree databases Not a database
Allows powerful phylogenetic queries Handles synonymous taxonomic names (via TBMap) Handles misspellings. Exploits taxonomic classification Offers precise options for identifying different types of
subtrees and metrics for identifying similar trees. Provides a visualization tool with links to GenBank and
TBMap. Fast
Efficient storage and filtering
PhyloFinder Design
Uses simple but powerful techniques Inverted index for filtering Nested-set representation of trees
Least common ancestor queries directly on database
Off-the-shelf spell-checking technology Can be used with any phylogenetic
database E.g., PhyLoTA browser However, set-up is not (yet) automatic
Outline
1. Introduction2. Queries3. Storage and querying4. Acknowledgements
PhyloFinder Queries
Taxonomic queries involve a single taxon or set of taxa.
Phylogenetic queries take as input a phylogenetic tree Locate trees that match it in some specified
way.
Taxonomic Queries
1. Contains: Given a list of taxa, return all trees that contain all or any of these names. Similar to Boolean “AND” and “OR” searches. Automatically searches for synonymous taxa
2. Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. E.g., if the query taxon is “birds", identify all trees
that contain bird taxa.
3. Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.
Taxonomic Queries: Contains
Phylogenetic Queries
Tree mining: Given a query tree Q, find the database trees that exhibit Q in some way. Options: Return the trees that have Q as an embedded
subtree. Return the trees that refine Q.
Similarity: Given a query tree Q and a specified similarity measure, return trees in database ranked by decreasing similarity from Q. Requires at least 3 taxon overlap
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that
contains the leaves in A.
a b c d e f g
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that
contains the leaves in A.
a b c d e f g
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that
contains the leaves in A.
a b c d e f g
Phylogenetic queries: Notation 2 T|A is obtained from T(A) by suppressing all
internal nodes that have only one child.
a b c d e f g
Phylogenetic queries
Let Q be a query tree with leaf set A. Q is an embedded subtree of T if and
only if it is identical to T|A. Q is refined by T (T refines Q) if T|A is
a refinement of Q.
Phylogenetic queries
Phylogenetic queries: Embedded
Phylogenetic queries: Refined by
Phylogenetic queries: Refined by
Q embedded in T Q refined by T
Phylogenetic queries: Embedded
Similarity queries
Return trees ranked by a similarity score Score is a percentage between 0 and 100%
reflecting how similar query tree is to candidate tree.
PhyloFinder’s similarity measures: Robinson-Foulds (RF) similarity Least common ancestor (LCA) similarity
Score takes degree of taxon overlap into account.
Outline
1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements
System architecture
Least Common Ancestors (LCAs)
a b c d e f g
Least Common Ancestors (LCAs)
a b c d e f g
LCA(b,e)
Storage: Nested intervals
Ancestor/descendant relationship is easy to determine The between predicate defines subtrees
LCAs are easily computed Find common ancestor with largest Node_ID
a b dc e f
(1,10)
(2,9)
(3,5)
(10,10)(4,4) (5,5)(6,6)
(7,9)
(8,8)(9,9)
(Node_ID,RMD_ID)
Storage: Inverted index
For each taxon, store a list of all trees that contain it.
Easy to find trees containing any or all elements in a list of taxa Used as a filter
Cornus
Spigelia
Hedera
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
Building the inverted indexI. Input trees:
1: (((man,pan),gorilla),pongo), 2. (((human, coprinus),cryptomonas),zea_mays), . . . ,N: (((dogs,homo_sapiens),pig),lambs)
II. Convert trees into lists of taxa:
man pan gorilla pongo, human coprinus . . . . . .
III. Synonymy preprocessing: Replace names by TBMap name clusters:
tc1 tc2 tc3 tc4 , tc1 tc5 . . . . . .
IV. Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs).
tc1 1 2 3 4
tc2 1
Schema
Query Processing: Outline
Consultinverted index
Q: Candidate trees:
Results:
Compare against Q using LCA queries
Implementing Phylogenetic Queries Idea: Use LCA queries to compare ancestor-
descendant relationships in Q with those in T.
M(x) and M(y) have the same relationship in T
as x and y have in Q Q can be embedded in T.
Advantage: Database trees need not be read into main memory.
Implementing Taxonomic Queries Use Boolean (union/intersection)
operations on the inverted index Example: Querying for “birds"
1. Find all bird species in the database trees using the NCBI taxonomy tree.
2. Use inverted index to retrieves the tree ID lists for each bird species.
3. Return the union of these lists.
Tree visualization
Tree visualization
Other tree visualization tools are available: Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-
482. Sanderson 2006. Bioinformatics 22: 1004-1006. Zmasek & Eddy. 2001. Bioinformatics 17: 383-384.
We developed our own to avoid plug-ins, and easily highlight query results and provide outlinks to
GenBank and TBMap.
Spelling
Suggestions come from TreeBASE and NCBI Uses GNU Aspell
Modified to handle special characters (`-', `&', '.') and compound words.
Outline
1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements
Under construction
Unrooted trees Supertree methods
MRP, MRF, MMC Desktop version Automatic update Suggestions?
Outline
1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements
Thanks to
Rod Page for TBMap Bill Piel for TreeBASE data Mike Sanderson Oliver Eulenstein National Science Foundation (grant EF-
0334832)