pattern discovery and recognition for genetic regulation
DESCRIPTION
Pattern Discovery and Recognition for Genetic Regulation. Tim Bailey UQ Maths and IMB. Research Goals. We are studying algorithms for discovering regulatory elements in DNA. Our research includes:. Developing fast and accurate methods for computing the statistics of random alignments - PowerPoint PPT PresentationTRANSCRIPT
Pattern Discovery and Recognition for Genetic Regulation
Tim Bailey
UQ Maths and IMB
Research Goals
Developing fast and accurate methods for computing the statistics of random alignments
Discovering regulatory elements in the upstream regions of orthologous genes
We are studying algorithms for discovering regulatory elements in DNA. Our research includes:
Recent Work Developed new way of computing statistics
for DNA regulatory motif scores Participated in the evaluation of most
extant motif discovery algorithms Studied prediction of subcellular localization Studied prediction of accessible protein
area Developing algorithms for motif discovery in
sets of orthologous sequences
Collaborations Algorithm evaluation: Martin
Tompa (University of Washington) Protein accesible surface area:
Zheng Yuan (IMB) Subcellular localization: Rohan
Teasdale, Melissa Davis (IMB)
Computing the statistics of random alignments
Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance.
Computing motif significance is therefore critical to any motif discovery approach.
Measuring the goodness off DNA regulatory motifs: IC
Alignment
nij
Counts
fij=nij/N
Frequencies
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Sequences
IC =IC1+ …+ICw
Information Content
1 GACATCGAAA
2 GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
N TGTGAAGCAC12 … w
i
j
POP: product of IC p-values IC is the sum of the information
contents of the motif columns. POP is an alternative measure of
motif quality: the product of the p-values of the column information contents.
Statistics of IC scores
Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999).
Time to compute the p-value of one IC score is O(N2).
MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive.
POP p-values can be computed efficiently.
Discovering regulatory elements in orthologous genes De novo discovery of most known
regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003).
We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.
Speedup using POP statistic
Evaluation of motif discovery algorithms Eighteen motif discovery
algorithms were tested evaluated on DNA regulatory motifs in four organisms.
Each algorithm was run by experts in that particular algorithm.
The ability of the algorithm to discover motifs in sets of DNA sequences was measured.
Performance of Motif Discovery Algorithms Finding Regulatory Motifs
nCC categorized by species
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Com
bine
d nC
C wholeset
fly
human
mouse
yeast
Conservation of known regulatory elements in sets of orthologous genes
Human vs. Mouse Four yeast species
Source: Liu et al., Genome Res 14:451-458, 2004.
Background sequences
Regulatory elements
Regulatory elements
Background sequences
Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements
make up less of human intergenic DNA (3% vs. 15%).
The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species.
Large-scale motif discovery should be possible using human and mouse orthologous genes.
Estimating the POP p-value correction factor parameters To estimate the correction factor
parameters we: estimate the right tail of the distribution
using a convolution method, fit the (non-linear) correction function to the
tail of the distribution using a least squares approach.
The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.
Correction factor for POP p-values The p-value of POP score, p, is roughly:
Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values.
Empirically, the p-value error for POP, p, letting x = ln(p), is about
where a and b are parameters that must be estimated.
CPU time per motif using LD method to compute p-values
w=16
CPU time to estimate correction factor parameters
w=16