a search engine for phylogenetic tree databases - d. fernándes-baca

46
A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)

Upload: roderic-page

Post on 10-May-2015

2.192 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A search engine for phylogenetic tree databases - D. Fernándes-Baca

A search engine for phylogenetic tree databases

David Fernández-BacaJoint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)

Page 2: A search engine for phylogenetic tree databases - D. Fernándes-Baca

PhyloFinder

http://pilin.cs.iastate.edu/phylofinder/

Page 3: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Page 4: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Page 5: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Issues in Phylogenetic Databases Taxonomic consistency

Species may appear in multiple trees by different but synonymous names.

Homonyms Misspellings

Querying capability Storage/representation Exploiting classification trees (e.g., NCBI)

Clustering capabilities Distance measures

Aggregation (synthesis) capabilities Supertrees

Visualization

Page 6: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Classification trees and phylogenies

Page 7: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Exploiting taxonomic classifications The leaves in phylogenetic trees may

represent different taxonomic levels A classification tree can allows us to

locate trees that contain a taxon, as well as its descendants or ancestors. E.g., a “Pinaceae" query would identify

trees that contain “Pinus thunbergii” or “Abies alba”

Page 8: A search engine for phylogenetic tree databases - D. Fernándes-Baca

TreeBASE (Piel, Donoghue, & Sanderson, 1996)

Page 9: A search engine for phylogenetic tree databases - D. Fernándes-Baca

TreeBASE capabilities

Search by taxon author citation study accession number matrix accession number structure (topology)

Tree surfing

Page 10: A search engine for phylogenetic tree databases - D. Fernándes-Baca

TreeBASE limitations

Taxonomic name consistency Querying

Few options Does not exploit classification

Can’t identify ancestors/descendants

Visualization Clustering and aggregation (supertrees)

Page 11: A search engine for phylogenetic tree databases - D. Fernándes-Baca

PhyloFinder

A search engine for tree databases Not a database

Allows powerful phylogenetic queries Handles synonymous taxonomic names (via TBMap) Handles misspellings. Exploits taxonomic classification Offers precise options for identifying different types of

subtrees and metrics for identifying similar trees. Provides a visualization tool with links to GenBank and

TBMap. Fast

Efficient storage and filtering

Page 12: A search engine for phylogenetic tree databases - D. Fernándes-Baca

PhyloFinder Design

Uses simple but powerful techniques Inverted index for filtering Nested-set representation of trees

Least common ancestor queries directly on database

Off-the-shelf spell-checking technology Can be used with any phylogenetic

database E.g., PhyLoTA browser However, set-up is not (yet) automatic

Page 13: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. Queries3. Storage and querying4. Acknowledgements

Page 14: A search engine for phylogenetic tree databases - D. Fernándes-Baca

PhyloFinder Queries

Taxonomic queries involve a single taxon or set of taxa.

Phylogenetic queries take as input a phylogenetic tree Locate trees that match it in some specified

way.

Page 15: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Taxonomic Queries

1. Contains: Given a list of taxa, return all trees that contain all or any of these names. Similar to Boolean “AND” and “OR” searches. Automatically searches for synonymous taxa

2. Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. E.g., if the query taxon is “birds", identify all trees

that contain bird taxa.

3. Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.

Page 16: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Taxonomic Queries: Contains

Page 17: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic Queries

Tree mining: Given a query tree Q, find the database trees that exhibit Q in some way. Options: Return the trees that have Q as an embedded

subtree. Return the trees that refine Q.

Similarity: Given a query tree Q and a specified similarity measure, return trees in database ranked by decreasing similarity from Q. Requires at least 3 taxon overlap

Page 18: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Page 19: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Page 20: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Page 21: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Notation 2 T|A is obtained from T(A) by suppressing all

internal nodes that have only one child.

a b c d e f g

Page 22: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries

Let Q be a query tree with leaf set A. Q is an embedded subtree of T if and

only if it is identical to T|A. Q is refined by T (T refines Q) if T|A is

a refinement of Q.

Page 23: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries

Page 24: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Embedded

Page 25: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Refined by

Page 26: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Refined by

Q embedded in T Q refined by T

Page 27: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Phylogenetic queries: Embedded

Page 28: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Similarity queries

Return trees ranked by a similarity score Score is a percentage between 0 and 100%

reflecting how similar query tree is to candidate tree.

PhyloFinder’s similarity measures: Robinson-Foulds (RF) similarity Least common ancestor (LCA) similarity

Score takes degree of taxon overlap into account.

Page 29: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Page 30: A search engine for phylogenetic tree databases - D. Fernándes-Baca

System architecture

Page 31: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Least Common Ancestors (LCAs)

a b c d e f g

Page 32: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Least Common Ancestors (LCAs)

a b c d e f g

LCA(b,e)

Page 33: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Storage: Nested intervals

Ancestor/descendant relationship is easy to determine The between predicate defines subtrees

LCAs are easily computed Find common ancestor with largest Node_ID

a b dc e f

(1,10)

(2,9)

(3,5)

(10,10)(4,4) (5,5)(6,6)

(7,9)

(8,8)(9,9)

(Node_ID,RMD_ID)

Page 34: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Storage: Inverted index

For each taxon, store a list of all trees that contain it.

Easy to find trees containing any or all elements in a list of taxa Used as a filter

Cornus

Spigelia

Hedera

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

Page 35: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Building the inverted indexI. Input trees:

1: (((man,pan),gorilla),pongo), 2. (((human, coprinus),cryptomonas),zea_mays), . . . ,N: (((dogs,homo_sapiens),pig),lambs)

II. Convert trees into lists of taxa:

man pan gorilla pongo, human coprinus . . . . . .

III. Synonymy preprocessing: Replace names by TBMap name clusters:

tc1 tc2 tc3 tc4 , tc1 tc5 . . . . . .

IV. Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs).

tc1 1 2 3 4

tc2 1

Page 36: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Schema

Page 37: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Query Processing: Outline

Consultinverted index

Q: Candidate trees:

Results:

Compare against Q using LCA queries

Page 38: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Implementing Phylogenetic Queries Idea: Use LCA queries to compare ancestor-

descendant relationships in Q with those in T.

M(x) and M(y) have the same relationship in T

as x and y have in Q Q can be embedded in T.

Advantage: Database trees need not be read into main memory.

Page 39: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Implementing Taxonomic Queries Use Boolean (union/intersection)

operations on the inverted index Example: Querying for “birds"

1. Find all bird species in the database trees using the NCBI taxonomy tree.

2. Use inverted index to retrieves the tree ID lists for each bird species.

3. Return the union of these lists.

Page 40: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Tree visualization

Page 41: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Tree visualization

Other tree visualization tools are available: Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-

482. Sanderson 2006. Bioinformatics 22: 1004-1006. Zmasek & Eddy. 2001. Bioinformatics 17: 383-384.

We developed our own to avoid plug-ins, and easily highlight query results and provide outlinks to

GenBank and TBMap.

Page 42: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Spelling

Suggestions come from TreeBASE and NCBI Uses GNU Aspell

Modified to handle special characters (`-', `&', '.') and compound words.

Page 43: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Page 44: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Under construction

Unrooted trees Supertree methods

MRP, MRF, MMC Desktop version Automatic update Suggestions?

Page 45: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Page 46: A search engine for phylogenetic tree databases - D. Fernándes-Baca

Thanks to

Rod Page for TBMap Bill Piel for TreeBASE data Mike Sanderson Oliver Eulenstein National Science Foundation (grant EF-

0334832)