waiting time and seed selection for homology search

27
Waiting Time and Seed Selection for Homology Search Gary Benson Department of Computer Science Department of Biology Graduate Program in Bioinformatics Boston University

Upload: kenyon-lane

Post on 02-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Waiting Time and Seed Selection for Homology Search. Gary Benson Department of Computer Science Department of Biology Graduate Program in Bioinformatics Boston University. Classroom Experiment – Estimating Waiting Time. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Waiting Time and Seed Selection for Homology Search

Waiting Time and Seed Selection for Homology Search

Gary BensonDepartment of Computer Science

Department of Biology

Graduate Program in Bioinformatics

Boston University

Page 2: Waiting Time and Seed Selection for Homology Search

Classroom Experiment – Estimating Waiting Time

• Each student tosses a coin until the first occurrence of 3 heads in a row. This is one trial. If ten tosses do not produce three heads in a row, stop and begin a new trial.

• Repeat trials, if necessary, so that there are 100 trials across all the students in the class.

• For each trial, record in a table which toss (if any) gives the first occurrence of three heads in a row.

Page 3: Waiting Time and Seed Selection for Homology Search

Example

1 2 3 4 5 6 7 8 9 10

Trial 1: H T H H T H H H 8

Trial 2: T T H H H 5

Trial 3: H T T H H T H H H 9

Trial 4: H T T H H T H T H H None

Firstoccurrence

Page 4: Waiting Time and Seed Selection for Homology Search

Fill in Table

N = 3 4 5 6 7 8 9 10 None Trial 1 X Trial 2 X Trial 3 X Trial 4 X Total in 4 trials

0 0 1 0 0 1 1 0 1

Page 5: Waiting Time and Seed Selection for Homology Search

Class Results

Class Trial Record N = 3 4 5 6 7 8 9 10 None

Total in 100 trials

Probability Total/100

Cumulative Probability

Page 6: Waiting Time and Seed Selection for Homology Search

ACGTTCGACTACGTTCGACT

ACGTTCGGCTACGTTCGACT

AAGTTCGACT ACGTTCGGCT

ACCTTCGGCTAAGTTCGACT

AAGTTCGACA ACCTTCGGCT

Molecular evolution

Speciation

Nucleotide substitution

Nucleotide substitution

Human Mouse

Evolutionary time

Page 7: Waiting Time and Seed Selection for Homology Search

Biologists Ask:

I have an unstudied human gene sequence

AAGTTCGACA

What other genes (from other organisms)

are similar to this one?

Page 8: Waiting Time and Seed Selection for Homology Search

Why do they ask that question?

Genes that have similar sequences often:

1. produce proteins that have similar function

2. are regulated in similar ways

So a gene that has already been studied can often provide information about an unknown gene.

Page 9: Waiting Time and Seed Selection for Homology Search

How we answer the question:

Compare the human gene to

every sequence

in a database of known genomes

Page 10: Waiting Time and Seed Selection for Homology Search

How BIG is the database?

Combined length of all sequences

is greater than

1011 = 100,000,000,000 nucleotides.

Page 11: Waiting Time and Seed Selection for Homology Search

http://www.ncbi.nlm.nih.gov/

Page 12: Waiting Time and Seed Selection for Homology Search

http://www.ncbi.nlm.nih.gov/

Page 13: Waiting Time and Seed Selection for Homology Search
Page 14: Waiting Time and Seed Selection for Homology Search

How long does it take to search?

The best algorithm (not the fastest) for detecting similarity takes time proportional to the product of the lengths of the sequence and the database.

Assume the unstudied gene sequence is

102 = 100 nucleotides and the database is

1011 = 100,000,000,000 nucleotides.

Page 15: Waiting Time and Seed Selection for Homology Search

Searching the database

102 • 1011 = 1013 nucleotide comparisons.

Good desktop computers now run at

3 gigahertz = 3 billion = 3 • 109 comparisons per second

So comparing one sequence to the database takes

1013 / 3 • 109 = 3333 seconds = 55 minutes

Page 16: Waiting Time and Seed Selection for Homology Search

Searching the database

102 • 1011 = 1013 nucleotide comparisons.

Good desktop computers now run at

3 gigahertz = 3 billion = 3 • 109 comparisons per second

So comparing one sequence to the database takes

1013 / 3 • 109 = 3333 seconds = 55 minutes

UGH!!!

Page 17: Waiting Time and Seed Selection for Homology Search

A faster way

Take small “words” from A and find matches in the database first:

AAGTTCGACA

AAGTT AGTTC GTTCG … CGACA

Then only do comparisons where there is a “hit”.

Page 18: Waiting Time and Seed Selection for Homology Search

Example

Sequence: AAGTTCGACA

Targets: AAGTT AGTTC GTTCG … CGACA

Database:

ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCCTGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCCGACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGGCTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC…

Page 19: Waiting Time and Seed Selection for Homology Search

Example

Sequence: AAGTTCGACA

Targets: AAGTT AGTTC GTTCG … CGACA

Database:

ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCCTGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCCGACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGGCTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC…

But we missed the mouse gene!

Page 20: Waiting Time and Seed Selection for Homology Search

What size should the “words” be?

• Too long and we miss similar sequences

This reduces sensitivity.

• Too small and there are “hits” everywhere.

This reduces specificity.

We need to balance these two issues.

Page 21: Waiting Time and Seed Selection for Homology Search

Computing sensitivity

We need to answer the following type of question:

Suppose a gene sequence of length 100 has a match in the database and the two sequences are expected to be identical at 80% of their positions. If we search with “words” of length 5, how often will we find the match?

Page 22: Waiting Time and Seed Selection for Homology Search

Computing sensitivity

We model the two sequences with coin tosses, where a match is considered a head and a mismatch (mutation) is considered a tail.

Example:AAGTTCGACAACCTTCGGCTHTTHHHHTHT

Page 23: Waiting Time and Seed Selection for Homology Search

Computing sensitivity

We model the two sequences with coin tosses, where a match is considered a head and a mismatch (mutation) is considered a tail.

Example:AAGTTCGACAACCTTCGGCTHTTHHHHTHT

We are interested in at least one occurrence of five heads if we use small words of length five.

Page 24: Waiting Time and Seed Selection for Homology Search

Waiting Time

Our question is a classic waiting time problem. It asks, for coin toss sequences of length n, and with probability of heads = P(H), what is the probability of tossing k heads in a row at least once?

For our question,

n = 100,

P(H) = 0.8

k = 5

Page 25: Waiting Time and Seed Selection for Homology Search

Answer

99.99% -- Which means that using small words of length 5, we will never miss similar sequences of length 100.

In fact, the sequences have to be as short as 23 nucleotides before we will miss 5%.

And at 9 nucleotides we miss 41% of similarities.

AAGTTCGACAACCTTCGGCT

Page 26: Waiting Time and Seed Selection for Homology Search

But wait! Maybe the “words” can be modified

It turns out, that words composed of letters with don’t care spaces produce better results.

AAGTTCGACA

AA*TTC AG*TCG GT*CGA … TC*ACA

Page 27: Waiting Time and Seed Selection for Homology Search

But wait! Maybe the “words” can be modified

It turns out, that words composed of letters with don’t care spaces produce better results.

AAGTTCGACA

AA*TTC AG*TCG GT*CGA … TC*ACA

Current research involves finding the best word shapes.