cs273a, spring 2007, lecture 11 whole-genome motif discovery
TRANSCRIPT
CS273a, Spring 2007, Lecture 11
Whole-genome motif discovery
CS273a, Spring 2007, Lecture 11
Challenges in Computational Biology
DNA
4 Genome Assembly
Gene FindingRegulatory motif discovery
Database lookup
Gene expression analysis9
RNA transcript
Sequence alignment
Evolutionary Theory7
TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT
Cluster discovery10 Gibbs samplingProtein network analysis12
Emerging network properties14
13 Regulatory network inference
Comparative Genomics
RNA folding
CS273a, Spring 2007, Lecture 11
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
CS273a, Spring 2007, Lecture 11
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
CS273a, Spring 2007, Lecture 11
Comparing genomes reveals functional elements
• Ultra-conserved elements
• Protein-coding genes
• Short regulatory motifs
CS273a, Spring 2007, Lecture 11
ATGACTAAATCTCATTCAGAAGAAGTGA
Regulatory Motif Discovery
GAL1
CCCCWCGG CCG
Gal4 Mig1
CGG CCG
Gal4
• Gene regulation– Genes are turned on / off in response to changing environments
– Gene regulatory logic is controlled by sequence motifs
– Specialized proteins (transcription factors) recognize motifs
• What makes motif discovery hard?– Motifs are short (6-8 bp) and usually degenerate
– Act at variable distances upstream (or downstream) of target gene
CS273a, Spring 2007, Lecture 11
Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
CS273a, Spring 2007, Lecture 11
Known motifs are preferentially conserved
Is this enough to discover motifs?Is this enough to discover motifs?No.
CS273a, Spring 2007, Lecture 11
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Known motifs are preferentially conserved
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Gabpa
Err
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Is this enough to discover motifs?No
CS273a, Spring 2007, Lecture 11
Known motifs are frequently conserved
• Across the human promoter regions, the Err motif: – appears 434 times– is conserved 162 times
Human
Dog
Mouse
Rat
Err Err Err
Conservation rate: 37%
• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)
Motif Conservation Score (MCS)
CS273a, Spring 2007, Lecture 11
MCS distribution of all 6-mers shows excess conservation
– High scoring patterns include known motifs– Excess specific to promoters and 3’-UTRs (not introns)– For MCS > 6, estimate 97% specificity
Mot
if de
nsity
Mot
if de
nsity
Motif Conservation Score (MCS)
Select motifs with MCS > 6.0, cluster
CS273a, Spring 2007, Lecture 11
Hill-climbing in sequence space
• Seed selection– Three mini-motif conservation criteria (CC1, CC2, CC3)
• Motif extension– Non-random conservation of neighbors
• Motif collapsing– Merge neighbors using hierarchical clustering, avg-max-linkage
• Re-scoring complex motifs– Motif conservation score for full motifs (MCS)
CS273a, Spring 2007, Lecture 11
Test 1: Intergenic conservation
Total count
Con
serv
ed c
ount
CGG-11-CCG
CS273a, Spring 2007, Lecture 11
Test 1: Selecting mini-motifs
• Estimate basal rate of conservation– Expected conservation rate at the
evolutionary distances observed– Average conservation rate of non-
outlier mini-motifs
• Score conservation of mini-motif– k: conserved motif occurrences– n: total motif occurrences– r: basal conservation rate– Evaluate binomial probability of
observing k successes out of n trials
• Assign z-score to each mini-motif– Bulk of distribution is symmetric– Estimate specificity as (R-L)/R– Select cutoff: 5.0 sigma– 1190 mini-motifs, 97.5% non-random
Conservation rater
N
knk ppk
nkp
)1()(
Binomial score
Right tail
Left tail
Specificity
Cu
toff
CS273a, Spring 2007, Lecture 11
Test 2: Intergenic vs. Coding
Coding Conservation
Inte
rgen
ic C
onse
rvat
ion
CGG-11-CCG
Higher Conservation in Genes
CS273a, Spring 2007, Lecture 11
Test 3: Upstream vs. Downstream
CGG-11-CCG
Downstream motifs?
MostPatterns
Downstream Conservation
Ups
trea
m C
onse
rvat
ion
CS273a, Spring 2007, Lecture 11
Extend
Collapse
Full Motifs
Constructing full motifs
2,000 Mini-motifs
72 Full motifs
6CT A C GAR R
CT GR C C GA AA CCTG C GA A
CT GR C C GA ACT RA Y C GA A
Y 5Extend Extend Extend
Collapse Collapse Collapse
Merge
Test 1 Test 2 Test 3
CS273a, Spring 2007, Lecture 11
Extending mini-motifs
• Separate conserved and non-conserved instances
CT A C GA6
CT x x GA6
Causalset
Randomset
CT A C GAR G W
CT x x GAY H S
• Find maximally discriminating neighborhood
N1
N2
M1
M2
• Evaluate non-randomness of neighborhood– chi-square contingency test on [N1,M1], [N2,M2]
CS273a, Spring 2007, Lecture 11
Systematically test candidate patterns
All potential motifs
Evaluate MCS
Cluster similar motifs
GT C A GTR RY gapS W
174 motifs in promoters
106 motifs in 3’ UTRs
• Enumerate
– Length between 6 and 15 nt, allow central gap
– 11 letter alphabet (A C G T, 2-fold codes, N)• Score
– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%
• Cluster– Sequence similarity– Overlapping occurrences
Are these real ?
CS273a, Spring 2007, Lecture 11
Functions of discovered motifs
CS273a, Spring 2007, Lecture 11
Evidence of motif function
• Promoter motifs: (1) Comparison to known motifs
(2) Distance from TSS
(3) Expression enrichment
Promoter 3’-UTRATG Stop
174 motifs 106 motifs
CS273a, Spring 2007, Lecture 11
MCS Discovered motif 46.8 GGGCGGR 34.7 GCCATnTTg 32.7 CACGTG 31.2 GATTGGY 30.8 TGAnTCA 29.7 GGGAGGRR 29.5 TGACGTMR 26.0 CGGCCATYK 25.0 TGACCTTG 22.6 CCGGAARY 19.8 SCGGAAGY 17.9 CATTTCCK 14.9 TTGTTT 14.6 TATAAA 14.2 RTAAACA 13.9 SMGGAAGT 12.6 YYATTGTT 12.5 TCACGTG 12.4 YATGYAAAT 12.2 GGGnnTTTCC 11.9 TGACGTGK 11.7 TTAYRTAA 11.0 CCAWWnAAGG 10.7 TAAWWATAG
(1) Promoter motifs match known TF binding sites
• Compare discovered motifs to TRANSFAC database of 125 known motifs• Compare discovered motifs to TRANSFAC database of 125 known motifs
MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY
55% of TRANSFAC motifs
match discovered motifs
45% of discovered motifs
match TRANSFAC motifs
(only 2% of control sequences
match TRANSFAC motifs)
MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY
CS273a, Spring 2007, Lecture 11
(2) Promoter motifs show preferred distance to TSS
32% of discovered motifs show strong positional bias
Conserved motif sites in all four species Motif instances in human
Eac
h of
174
dis
cove
red
mot
ifs
Motif 8
Motif 4-81
-63
Distance from TSS
Discovered motifs occur preferentially
Within 200 bp of Transcription Start Site
Individual motifs show strong peaks
Regardless of conservation
CS273a, Spring 2007, Lecture 11
(3) Promoter motifs enriched in specific tissues
70% of motifs show significant enrichment in at least one tissue
New
mo
tifs
Kn
ow
n T
Fs
CS273a, Spring 2007, Lecture 11
Summary for promoter motifs
Rank Discovered MotifKnown
TF motifTissue
EnrichmentDistance
bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1(*) Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias
75% have evidence
• Control sequences< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
< 7% false positives
• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias
75% have evidence
• Control sequences< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
< 7% false positives
Most discovered motifs are likely to be functional
NewNew
New
New
New
CS273a, Spring 2007, Lecture 11
Summary of Promoter Motifs
CS273a, Spring 2007, Lecture 11
Similar analysis in 5% most conserved regions in human
12-22 bp
long motifs
12-22 bp
long motifs
CS273a, Spring 2007, Lecture 11
Similar analysis in 5% most conserved regions in human
CS273a, Spring 2007, Lecture 11
Overview of Motif Discovery Algorithms
CS273a, Spring 2007, Lecture 11
Motif Representation
GTATAACTATAAGTCTTAATATACGTAATATTGTACGTATTAGTATTCATCTAA
GTATAACTATAAGTCTTAATATACGTAATATTGTACGTATTAGTATTCATCTAA
PSSMPSSM
GTATAAGTATAA
ConsensusConsensus
GTATAMGTATAM
IUPACIUPAC
Complex Dependency
Graphical Models
Complex Dependency
Graphical Models
GTATAAGTATAA
CTATAACTATAA
TTGTACTTGTAC GTCTTAGTCTTA
GTAATAGTAATAATATACATATAC
GTATTAGTATTA
GTATTCGTATTC
ATCTAAATCTAA
Nonparametric –
Graph or Bag of Words
Nonparametric –
Graph or Bag of Words
CS273a, Spring 2007, Lecture 11
Motif Representation – Pairwise Dependencies
Complex Dependency
Graphical Models
Complex Dependency
Graphical Models
CS273a, Spring 2007, Lecture 11
Motif Representation – MotifScan
GTATAAGTATAA
CTATAACTATAA
TTGTACTTGTACGTCTTAGTCTTA
GTAATAGTAATAATATACATATAC
GTATTAGTATTA
GTATTCGTATTC
ATCTAAATCTAA
CS273a, Spring 2007, Lecture 11
CS273a, Spring 2007, Lecture 11
CS273a, Spring 2007, Lecture 11