pattern discovery and recognition for genetic regulation

18
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB

Upload: duke

Post on 14-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Pattern Discovery and Recognition for Genetic Regulation. Tim Bailey UQ Maths and IMB. Research Goals. We are studying algorithms for discovering regulatory elements in DNA. Our research includes:. Developing fast and accurate methods for computing the statistics of random alignments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Discovery and Recognition for Genetic Regulation

Pattern Discovery and Recognition for Genetic Regulation

Tim Bailey

UQ Maths and IMB

Page 2: Pattern Discovery and Recognition for Genetic Regulation

Research Goals

Developing fast and accurate methods for computing the statistics of random alignments

Discovering regulatory elements in the upstream regions of orthologous genes

We are studying algorithms for discovering regulatory elements in DNA. Our research includes:

Page 3: Pattern Discovery and Recognition for Genetic Regulation

Recent Work Developed new way of computing statistics

for DNA regulatory motif scores Participated in the evaluation of most

extant motif discovery algorithms Studied prediction of subcellular localization Studied prediction of accessible protein

area Developing algorithms for motif discovery in

sets of orthologous sequences

Page 4: Pattern Discovery and Recognition for Genetic Regulation

Collaborations Algorithm evaluation: Martin

Tompa (University of Washington) Protein accesible surface area:

Zheng Yuan (IMB) Subcellular localization: Rohan

Teasdale, Melissa Davis (IMB)

Page 5: Pattern Discovery and Recognition for Genetic Regulation

Computing the statistics of random alignments

Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance.

Computing motif significance is therefore critical to any motif discovery approach.

Page 6: Pattern Discovery and Recognition for Genetic Regulation

Measuring the goodness off DNA regulatory motifs: IC

Alignment

nij

Counts

fij=nij/N

Frequencies

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Sequences

IC =IC1+ …+ICw

Information Content

1 GACATCGAAA

2 GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

N TGTGAAGCAC12 … w

i

j

Page 7: Pattern Discovery and Recognition for Genetic Regulation

POP: product of IC p-values IC is the sum of the information

contents of the motif columns. POP is an alternative measure of

motif quality: the product of the p-values of the column information contents.

Page 8: Pattern Discovery and Recognition for Genetic Regulation

Statistics of IC scores

Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999).

Time to compute the p-value of one IC score is O(N2).

MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive.

POP p-values can be computed efficiently.

Page 9: Pattern Discovery and Recognition for Genetic Regulation

Discovering regulatory elements in orthologous genes De novo discovery of most known

regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003).

We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Page 10: Pattern Discovery and Recognition for Genetic Regulation

Speedup using POP statistic

Page 11: Pattern Discovery and Recognition for Genetic Regulation

Evaluation of motif discovery algorithms Eighteen motif discovery

algorithms were tested evaluated on DNA regulatory motifs in four organisms.

Each algorithm was run by experts in that particular algorithm.

The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Page 12: Pattern Discovery and Recognition for Genetic Regulation

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

nCC categorized by species

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Com

bine

d nC

C wholeset

fly

human

mouse

yeast

Page 13: Pattern Discovery and Recognition for Genetic Regulation

Conservation of known regulatory elements in sets of orthologous genes

Human vs. Mouse Four yeast species

Source: Liu et al., Genome Res 14:451-458, 2004.

Background sequences

Regulatory elements

Regulatory elements

Background sequences

Page 14: Pattern Discovery and Recognition for Genetic Regulation

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements

make up less of human intergenic DNA (3% vs. 15%).

The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species.

Large-scale motif discovery should be possible using human and mouse orthologous genes.

Page 15: Pattern Discovery and Recognition for Genetic Regulation

Estimating the POP p-value correction factor parameters To estimate the correction factor

parameters we: estimate the right tail of the distribution

using a convolution method, fit the (non-linear) correction function to the

tail of the distribution using a least squares approach.

The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.

Page 16: Pattern Discovery and Recognition for Genetic Regulation

Correction factor for POP p-values The p-value of POP score, p, is roughly:

Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values.

Empirically, the p-value error for POP, p, letting x = ln(p), is about

where a and b are parameters that must be estimated.

Page 17: Pattern Discovery and Recognition for Genetic Regulation

CPU time per motif using LD method to compute p-values

w=16

Page 18: Pattern Discovery and Recognition for Genetic Regulation

CPU time to estimate correction factor parameters

w=16