cs273a, spring 2007, lecture 11 whole-genome motif discovery

35
273a, Spring 2007, Lecture 11 Whole-genome motif discovery

Upload: rosamund-watts

Post on 28-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Whole-genome motif discovery

Page 2: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Challenges in Computational Biology

DNA

4 Genome Assembly

Gene FindingRegulatory motif discovery

Database lookup

Gene expression analysis9

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery10 Gibbs samplingProtein network analysis12

Emerging network properties14

13 Regulatory network inference

Comparative Genomics

RNA folding

Page 3: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

Page 4: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT

Promoter motifs

3’ UTR motifs

Exons

Introns

Page 5: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Comparing genomes reveals functional elements

• Ultra-conserved elements

• Protein-coding genes

• Short regulatory motifs

Page 6: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

ATGACTAAATCTCATTCAGAAGAAGTGA

Regulatory Motif Discovery

GAL1

CCCCWCGG CCG

Gal4 Mig1

CGG CCG

Gal4

• Gene regulation– Genes are turned on / off in response to changing environments

– Gene regulatory logic is controlled by sequence motifs

– Specialized proteins (transcription factors) recognize motifs

• What makes motif discovery hard?– Motifs are short (6-8 bp) and usually degenerate

– Act at variable distances upstream (or downstream) of target gene

Page 7: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Regulatory Motif Discovery

Study known motifs

Derive conservation rules

Discover novel motifs

Page 8: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Known motifs are preferentially conserved

Is this enough to discover motifs?Is this enough to discover motifs?No.

Page 9: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *

human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *

human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *

Known motifs are preferentially conserved

human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *

human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *

human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *

Gabpa

Err

human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *

human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *

human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *

Is this enough to discover motifs?No

Page 10: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Known motifs are frequently conserved

• Across the human promoter regions, the Err motif: – appears 434 times– is conserved 162 times

Human

Dog

Mouse

Rat

Err Err Err

Conservation rate: 37%

• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)

Motif Conservation Score (MCS)

Page 11: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

MCS distribution of all 6-mers shows excess conservation

– High scoring patterns include known motifs– Excess specific to promoters and 3’-UTRs (not introns)– For MCS > 6, estimate 97% specificity

Mot

if de

nsity

Mot

if de

nsity

Motif Conservation Score (MCS)

Select motifs with MCS > 6.0, cluster

Page 12: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Hill-climbing in sequence space

• Seed selection– Three mini-motif conservation criteria (CC1, CC2, CC3)

• Motif extension– Non-random conservation of neighbors

• Motif collapsing– Merge neighbors using hierarchical clustering, avg-max-linkage

• Re-scoring complex motifs– Motif conservation score for full motifs (MCS)

Page 13: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Test 1: Intergenic conservation

Total count

Con

serv

ed c

ount

CGG-11-CCG

Page 14: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Test 1: Selecting mini-motifs

• Estimate basal rate of conservation– Expected conservation rate at the

evolutionary distances observed– Average conservation rate of non-

outlier mini-motifs

• Score conservation of mini-motif– k: conserved motif occurrences– n: total motif occurrences– r: basal conservation rate– Evaluate binomial probability of

observing k successes out of n trials

• Assign z-score to each mini-motif– Bulk of distribution is symmetric– Estimate specificity as (R-L)/R– Select cutoff: 5.0 sigma– 1190 mini-motifs, 97.5% non-random

Conservation rater

N

knk ppk

nkp

)1()(

Binomial score

Right tail

Left tail

Specificity

Cu

toff

Page 15: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Test 2: Intergenic vs. Coding

Coding Conservation

Inte

rgen

ic C

onse

rvat

ion

CGG-11-CCG

Higher Conservation in Genes

Page 16: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Test 3: Upstream vs. Downstream

CGG-11-CCG

Downstream motifs?

MostPatterns

Downstream Conservation

Ups

trea

m C

onse

rvat

ion

Page 17: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Extend

Collapse

Full Motifs

Constructing full motifs

2,000 Mini-motifs

72 Full motifs

6CT A C GAR R

CT GR C C GA AA CCTG C GA A

CT GR C C GA ACT RA Y C GA A

Y 5Extend Extend Extend

Collapse Collapse Collapse

Merge

Test 1 Test 2 Test 3

Page 18: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Extending mini-motifs

• Separate conserved and non-conserved instances

CT A C GA6

CT x x GA6

Causalset

Randomset

CT A C GAR G W

CT x x GAY H S

• Find maximally discriminating neighborhood

N1

N2

M1

M2

• Evaluate non-randomness of neighborhood– chi-square contingency test on [N1,M1], [N2,M2]

Page 19: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Systematically test candidate patterns

All potential motifs

Evaluate MCS

Cluster similar motifs

GT C A GTR RY gapS W

174 motifs in promoters

106 motifs in 3’ UTRs

• Enumerate

– Length between 6 and 15 nt, allow central gap

– 11 letter alphabet (A C G T, 2-fold codes, N)• Score

– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%

• Cluster– Sequence similarity– Overlapping occurrences

Are these real ?

Page 20: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Functions of discovered motifs

Page 21: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Evidence of motif function

• Promoter motifs: (1) Comparison to known motifs

(2) Distance from TSS

(3) Expression enrichment

Promoter 3’-UTRATG Stop

174 motifs 106 motifs

Page 22: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

MCS Discovered motif 46.8 GGGCGGR 34.7 GCCATnTTg 32.7 CACGTG 31.2 GATTGGY 30.8 TGAnTCA 29.7 GGGAGGRR 29.5 TGACGTMR 26.0 CGGCCATYK 25.0 TGACCTTG 22.6 CCGGAARY 19.8 SCGGAAGY 17.9 CATTTCCK 14.9 TTGTTT 14.6 TATAAA 14.2 RTAAACA 13.9 SMGGAAGT 12.6 YYATTGTT 12.5 TCACGTG 12.4 YATGYAAAT 12.2 GGGnnTTTCC 11.9 TGACGTGK 11.7 TTAYRTAA 11.0 CCAWWnAAGG 10.7 TAAWWATAG

(1) Promoter motifs match known TF binding sites

• Compare discovered motifs to TRANSFAC database of 125 known motifs• Compare discovered motifs to TRANSFAC database of 125 known motifs

MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY

55% of TRANSFAC motifs

match discovered motifs

45% of discovered motifs

match TRANSFAC motifs

(only 2% of control sequences

match TRANSFAC motifs)

MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY

Page 23: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

(2) Promoter motifs show preferred distance to TSS

32% of discovered motifs show strong positional bias

Conserved motif sites in all four species Motif instances in human

Eac

h of

174

dis

cove

red

mot

ifs

Motif 8

Motif 4-81

-63

Distance from TSS

Discovered motifs occur preferentially

Within 200 bp of Transcription Start Site

Individual motifs show strong peaks

Regardless of conservation

Page 24: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

(3) Promoter motifs enriched in specific tissues

70% of motifs show significant enrichment in at least one tissue

New

mo

tifs

Kn

ow

n T

Fs

Page 25: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Summary for promoter motifs

Rank Discovered MotifKnown

TF motifTissue

EnrichmentDistance

bias

1 RCGCAnGCGY NRF-1 Yes Yes

2 CACGTG MYC Yes Yes

3 SCGGAAGY ELK-1 Yes Yes

4 ACTAYRnnnCCCR Yes Yes

5 GATTGGY NF-Y Yes Yes

6 GGGCGGR SP1 Yes Yes

7 TGAnTCA AP-1 Yes

8 TMTCGCGAnR Yes Yes

9 TGAYRTCA ATF3 Yes Yes

10 GCCATnTTG YY1 Yes

11 MGGAAGTG GABP Yes Yes

12 CAGGTG E12 Yes

13 CTTTGT LEF1 Yes

14 TGACGTCA ATF3 Yes Yes

15 CAGCTG AP-4 Yes

16 RYTTCCTG C-ETS-2 Yes Yes

17 AACTTT IRF1(*) Yes

18 TCAnnTGAY SREBP-1 Yes Yes

19 GKCGCn(7)TGAYG Yes Yes

20 GTGACGY E4F1 Yes Yes

21 GGAAnCGGAAnY Yes Yes

22 TGCGCAnK Yes Yes

23 TAATTA CHX10 Yes

24 GGGAGGRR MAZ Yes

25 TGACCTY ERRA Yes

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

Most discovered motifs are likely to be functional

NewNew

New

New

New

Page 26: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Summary of Promoter Motifs

Page 27: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Similar analysis in 5% most conserved regions in human

12-22 bp

long motifs

12-22 bp

long motifs

Page 28: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Similar analysis in 5% most conserved regions in human

Page 29: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Overview of Motif Discovery Algorithms

Page 30: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Motif Representation

GTATAACTATAAGTCTTAATATACGTAATATTGTACGTATTAGTATTCATCTAA

GTATAACTATAAGTCTTAATATACGTAATATTGTACGTATTAGTATTCATCTAA

PSSMPSSM

GTATAAGTATAA

ConsensusConsensus

GTATAMGTATAM

IUPACIUPAC

Complex Dependency

Graphical Models

Complex Dependency

Graphical Models

GTATAAGTATAA

CTATAACTATAA

TTGTACTTGTAC GTCTTAGTCTTA

GTAATAGTAATAATATACATATAC

GTATTAGTATTA

GTATTCGTATTC

ATCTAAATCTAA

Nonparametric –

Graph or Bag of Words

Nonparametric –

Graph or Bag of Words

Page 31: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Motif Representation – Pairwise Dependencies

Complex Dependency

Graphical Models

Complex Dependency

Graphical Models

Page 32: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Motif Representation – MotifScan

GTATAAGTATAA

CTATAACTATAA

TTGTACTTGTACGTCTTAGTCTTA

GTAATAGTAATAATATACATATAC

GTATTAGTATTA

GTATTCGTATTC

ATCTAAATCTAA

Page 33: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Page 34: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11

Page 35: CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11