waiting time and seed selection for homology search
DESCRIPTION
Waiting Time and Seed Selection for Homology Search. Gary Benson Department of Computer Science Department of Biology Graduate Program in Bioinformatics Boston University. Classroom Experiment – Estimating Waiting Time. - PowerPoint PPT PresentationTRANSCRIPT
Waiting Time and Seed Selection for Homology Search
Gary BensonDepartment of Computer Science
Department of Biology
Graduate Program in Bioinformatics
Boston University
Classroom Experiment – Estimating Waiting Time
• Each student tosses a coin until the first occurrence of 3 heads in a row. This is one trial. If ten tosses do not produce three heads in a row, stop and begin a new trial.
• Repeat trials, if necessary, so that there are 100 trials across all the students in the class.
• For each trial, record in a table which toss (if any) gives the first occurrence of three heads in a row.
Example
1 2 3 4 5 6 7 8 9 10
Trial 1: H T H H T H H H 8
Trial 2: T T H H H 5
Trial 3: H T T H H T H H H 9
Trial 4: H T T H H T H T H H None
Firstoccurrence
Fill in Table
N = 3 4 5 6 7 8 9 10 None Trial 1 X Trial 2 X Trial 3 X Trial 4 X Total in 4 trials
0 0 1 0 0 1 1 0 1
Class Results
Class Trial Record N = 3 4 5 6 7 8 9 10 None
Total in 100 trials
Probability Total/100
Cumulative Probability
ACGTTCGACTACGTTCGACT
ACGTTCGGCTACGTTCGACT
AAGTTCGACT ACGTTCGGCT
ACCTTCGGCTAAGTTCGACT
AAGTTCGACA ACCTTCGGCT
Molecular evolution
Speciation
Nucleotide substitution
Nucleotide substitution
Human Mouse
Evolutionary time
Biologists Ask:
I have an unstudied human gene sequence
AAGTTCGACA
What other genes (from other organisms)
are similar to this one?
Why do they ask that question?
Genes that have similar sequences often:
1. produce proteins that have similar function
2. are regulated in similar ways
So a gene that has already been studied can often provide information about an unknown gene.
How we answer the question:
Compare the human gene to
every sequence
in a database of known genomes
How BIG is the database?
Combined length of all sequences
is greater than
1011 = 100,000,000,000 nucleotides.
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/
How long does it take to search?
The best algorithm (not the fastest) for detecting similarity takes time proportional to the product of the lengths of the sequence and the database.
Assume the unstudied gene sequence is
102 = 100 nucleotides and the database is
1011 = 100,000,000,000 nucleotides.
Searching the database
102 • 1011 = 1013 nucleotide comparisons.
Good desktop computers now run at
3 gigahertz = 3 billion = 3 • 109 comparisons per second
So comparing one sequence to the database takes
1013 / 3 • 109 = 3333 seconds = 55 minutes
Searching the database
102 • 1011 = 1013 nucleotide comparisons.
Good desktop computers now run at
3 gigahertz = 3 billion = 3 • 109 comparisons per second
So comparing one sequence to the database takes
1013 / 3 • 109 = 3333 seconds = 55 minutes
UGH!!!
A faster way
Take small “words” from A and find matches in the database first:
AAGTTCGACA
AAGTT AGTTC GTTCG … CGACA
Then only do comparisons where there is a “hit”.
Example
Sequence: AAGTTCGACA
Targets: AAGTT AGTTC GTTCG … CGACA
Database:
ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCCTGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCCGACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGGCTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC…
Example
Sequence: AAGTTCGACA
Targets: AAGTT AGTTC GTTCG … CGACA
Database:
ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCCTGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCCGACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGGCTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC…
But we missed the mouse gene!
What size should the “words” be?
• Too long and we miss similar sequences
This reduces sensitivity.
• Too small and there are “hits” everywhere.
This reduces specificity.
We need to balance these two issues.
Computing sensitivity
We need to answer the following type of question:
Suppose a gene sequence of length 100 has a match in the database and the two sequences are expected to be identical at 80% of their positions. If we search with “words” of length 5, how often will we find the match?
Computing sensitivity
We model the two sequences with coin tosses, where a match is considered a head and a mismatch (mutation) is considered a tail.
Example:AAGTTCGACAACCTTCGGCTHTTHHHHTHT
Computing sensitivity
We model the two sequences with coin tosses, where a match is considered a head and a mismatch (mutation) is considered a tail.
Example:AAGTTCGACAACCTTCGGCTHTTHHHHTHT
We are interested in at least one occurrence of five heads if we use small words of length five.
Waiting Time
Our question is a classic waiting time problem. It asks, for coin toss sequences of length n, and with probability of heads = P(H), what is the probability of tossing k heads in a row at least once?
For our question,
n = 100,
P(H) = 0.8
k = 5
Answer
99.99% -- Which means that using small words of length 5, we will never miss similar sequences of length 100.
In fact, the sequences have to be as short as 23 nucleotides before we will miss 5%.
And at 9 nucleotides we miss 41% of similarities.
AAGTTCGACAACCTTCGGCT
But wait! Maybe the “words” can be modified
It turns out, that words composed of letters with don’t care spaces produce better results.
AAGTTCGACA
AA*TTC AG*TCG GT*CGA … TC*ACA
But wait! Maybe the “words” can be modified
It turns out, that words composed of letters with don’t care spaces produce better results.
AAGTTCGACA
AA*TTC AG*TCG GT*CGA … TC*ACA
Current research involves finding the best word shapes.