title of presentation board of scientific counselors january 2007 your name

25
TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Upload: constance-lewis

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

TITLE OFPRESENTATION

Board of Scientific Counselors January 2007

Your Name

Page 2: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Title - 32 pt Arial

Page 3: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Title - 32 pt Arial

Page 4: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Title - 32 pt Arial

Page 5: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

COMPARATIVE GENOMICSManolis Kellis

Board of Scientific Counselors

January 2007

Page 6: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA

Page 7: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA

Genes

Encodeproteins

Regulatory motifs

Controlgene expression

Page 8: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

32 mammals

9 yeasts

12 flies

The power of comparative genomics

• Comparative genomics reveals selection– Functional elements mostly conserved– Non-functional regions mostly diverged

Functional regions stand out

• Comparative genomics reveals function– Each type of function under unique constraints

(Proteins, RNA, motifs, each evolve differently)– Discover them by their distinct evolutionary patterns Evolutionary signatures for each type of element

human mouse ratchimp dog

8 Candida

Page 9: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Comparative genomics leads to…

1. Genome interpretation– Decode the human genome– Discover all functional elements

The building blocks

2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections

3. Evolutionary innovation– Emergence of new functions– Genome and network duplication

The dynamics

Page 10: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Distinguishing genes from non-coding regions

Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT

Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA

Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC

Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC

Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC

Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT

Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC

Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT

***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **

• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)

• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically

Splice

Frame-shifting indels Periodic mutations Synonymous substs.

Page 11: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Power of evolutionary signatures

Signatures much more precise than level of conservationBefore: Parsing a genome into high-conservation / low-conservationNow: Parse into protein-coding conservation / RNA-like / motif-like, etc.

Probabilistic frameworkHidden Markov Models (HMMs)

Generative model, learn emission, transition probabilitiesEasy to train, hard to integrate long-range signals

Conditional Random Fields (CRFs)Discriminative dual of HMMs, learn weights on featuresEasy to integrate diverse signals, gradient ascent for training

Page 12: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Known genes stand out Substitution typical of protein-coding regionsSubstitution typical of intergenic regions

Page 13: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Previously-annotated start codon Newly-identified start codon

Ability to identify subtle events

ATG ATG

• Translation start corrected for 200 genes

Protein-coding

conservation

Continued protein-coding

conservationNo more

conservation

• Hundreds of read-through regions identified

• New mechanism of post-transcriptional control. Many questions remain. • Enriched in brain proteins, ion channels. Under ADAR control.

Stop codon

read through2nd stop

codon

Page 14: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

• Towards a revised genome annotation– Curation: FlyBase integrates prediction with cDNA, protein, literature– Experimentation: BDGP large-scale functional validation novel exons

• High-accuracy reannotation– Ability to detect small genes & exons (40AA: 95|99|99%, 20AA: 87|96|99%)– Detect subtle events: sequencing errors, start/stop and splice site changes– Recognize unusual gene structures read-through, uORFs, RNA editing

D. simulans

D. erecta

D. persimilis

D. melanog.

Summary: Revisiting fly genome annotation

(…)

454 genes 800 genes 668 genes12,000 genes

Confirmed Dubious Novel Refined

Powerful approach for comprehensive genome annotation

sen | pre | spe sen | pre | spe

Page 15: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Comparative genomics

1. Genome interpretation– Decode the human genome– Discover all functional elements

The building blocks

2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections

3. Evolutionary innovation– Emergence of new functions– Genome and network duplication

The dynamics

Page 16: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

The regulatory code

• Multiple levels of regulation– Temporal and spatial regulation, disease, development– Chromatin, pre- / post-transcriptional, splicing, translational

• Combinatorial coding of individual motifs– The core: a relatively small number of regulatory motifs– Regions: diverse motif combinations specify diverse functions

• Regulatory motifs– Summarize information across thousands of sites

• Distinguish: regulatory motifs vs. motif instances

– Challenging to discover• Small (6-8 nucleotides), subtle (frequent degenerate positions),

dispersed (act at a distance), diverse (sequence composition)

Enhancer regions

5’-UTR

Promoter motifs

3’-UTR

Splicing signals Motifs at RNA level

Page 17: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Regulatory motif discovery

Study known motifs

Derive conservation rules

Discover novel motifs

Page 18: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Known motifs are preferentially conserved

• In multi-species alignments: known motifs conservation islands– Conserved biology: Conserved regulatory code, same words are functional– Preferential conservation: Stand out from surrounding nucleotides– Good signal for identifying individual instances of known motifs

• Need additional power for motif discovery: – Conservation not limited to exact binding site additional bases would be found– Weakly constrained positions can diverge Real motifs will be missed– How do we discover motifs de novo? Use basic property of regulatory motifs

Evaluate genome-wide conservation over thousands of instances

Errhuman CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *

Gabpa

Human

Dog

Mouse

Rat

Errα Errα Errα

Page 19: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Consensus MCS Matches to known Expression enrichment Promoters Enhancers

1 CTAATTAAA 65.6 engrailed (en) 25.4 2

2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2

3 WATTRATTK 54.9 araucan (ara) 11.7 2.6

4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5

5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3

6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3

7 TGATTAAT 45.7 apterous (ap) 7.1 1.7

8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2

9 AAACNNGTT 41.2 20.1 4.3

10 RATTKAATT 40 3.9 0.7

11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9

12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7

13 AATTRMATTA 38.2 19.5 1.2

14 TATGCWAAT 37.8 5.8 2

15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4

16 CATNAATCA 36.9 1.8 1.7

17 TTACATAA 36.9 5.4

18 RTAAATCAA 36.3 3.2 2.8

19 AATKNMATTT 36 3.6 0

20 ATGTCAAHT 35.6 2.4 4.6

21 ATAAAYAAA 35.5 57.2 -0.5

22 YYAATCAAA 33.9 5.3 0.6

23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6

24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7

25 TGTMAATA 33.2 8.9 1.6

26 TAAYGAG 33.1 4.7 2.7

27 AAAKTGA 32.9 7.6 0.3

28 AAANNAAA 32.9 449.7 0.8

29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8

30 TTATTTAYR 32.9 Deformed (Dfd) 30.7

Systematically discover regulatory motifs

Page 20: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Functional clustering of motifs and tissues

Page 21: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Motif discovery in human enhancer regions

• Can identify 40% of enhancers with 50 motifs– 3X enrichment (vs. 15% of intergenic regions)

• Motif combinations further improve performance– 5X enrichment for top 30 motif combinations

Chromatin signatures of enhancer regions Motif signatures of enhancer regions

74 Enhancers

208 Promoters

H3K4me3 RNAPII

p300H3K4me1

Page 22: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Evolutionary signatures for microRNA genes

• Genome-wide discovery of miRNAs– 41 novel miRNA genes. Rediscover 81% of known (61 of 74). Reject 4 dubious.– 454 sequencing of small RNAs confirms 27 of 41 novel miRNAs (66%).

• Genomic properties: – Introns of known genes, including several transcription factors– Genomic clustering of known and novel miRNAs: poly-cistronic precursors– Two ‘dubious’ protein-coding genes are in fact miRNAs

Improved annotation of miRNA genes

Page 23: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Functional properties of microRNA targets

• Refine annotation of known miRNA genes– Start adjustments suggested by the evolutionary signatures, confirmed by sequencing– Small change in start (+2 nucleotides) implies great change in target spectrum (>95%)

• miRNA targets– Novel miRNAs include many novel families distinct groupings of genes. – Targets of novel show large overlap with targets of known denser miRNA network

• miR10* as a master Hox regulator– For three genes, both miRNA+ and miRNA* seem functional by evolution and sequencing. – For miR-10, the star shows stronger signal, more sequencing reads, more predicted targets.– Both miR-10+ and miR-10* targets several Hox genes, more than any other miRNA.

Page 24: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Comparative genomics

1. Genome interpretation– Decode the human genome– Discover all functional elements

The building blocks

2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections

3. Evolutionary innovation– Emergence of new functions– Genome and network duplication

The dynamics

Page 25: TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

0.3 sub/site0.1 sub/site 0.8 sub/site