dark matter in the genome shin-han shiu plant biology / genetics / eebb

30
DARK MATTER IN THE GENOME DARK MATTER IN THE GENOME Shin-Han Shiu Shin-Han Shiu Plant Biology / Genetics / EEBB Plant Biology / Genetics / EEBB

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

DARK MATTER IN THE GENOMEDARK MATTER IN THE GENOME

Shin-Han ShiuShin-Han ShiuPlant Biology / Genetics / EEBBPlant Biology / Genetics / EEBB

Page 2: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB
Page 3: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Cell, nucleus, and chromosomesCell, nucleus, and chromosomes

Page 4: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

DNADNA

A G G C G T A G A G A G A T C C T T G A T

T C C G C A A C T C T C A A G G A A C A A

Page 5: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

DNA and GenomeDNA and Genome

How many A's, T's, G's, and C's are there in the human How many A's, T's, G's, and C's are there in the human genome?genome?

3,038,000,000 letters A sizable book, say, the most recent Harry Potter bookA sizable book, say, the most recent Harry Potter book

~1,516,000 characters in 758 pages*

The book of our lifeThe book of our life

1,519,000 pages

~1,000 of the Deathly Hallows

How fast can you read?How fast can you read?

Say, 1/day, take about 3 years No vacation, no social life, no going anywhere else.

Page 6: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAATAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCTCTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAAGACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAAAATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTAATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCTGGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTATGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAGAACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAGGTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGCTTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTATCAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAATCTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGATTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATTTTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAGGCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGATTTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATACTTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTTAATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTAAGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTCATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGCAAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATATAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAGGATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAGTGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATAACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTTGCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTTCAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCGAACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTAGAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACTAGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTGCAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAAAATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTAATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTACATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATATTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGTACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTAATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGGAAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTACTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACATAAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAATGGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAAAGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATATTTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGTGCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTATGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATGCCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGCGGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTATCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTTAACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGTCATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG

And the worse of all, it looks like this...And the worse of all, it looks like this...

Page 7: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Our research interestOur research interestTTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAATAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCTCTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAAGACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAAAATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTAATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCTGGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTATGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAGAACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAGGTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGCTTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTATCAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAATCTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGATTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATTTTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAGGCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGATTTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATACTTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTTAATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTAAGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTCATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGCAAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATATAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAGGATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAGTGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATAACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTTGCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTTCAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCGAACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTAGAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACTAGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTGCAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAAAATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTAATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTACATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATATTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGTACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTAATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGGAAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTACTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACATAAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAATGGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAAAGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATATTTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGTGCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTATGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATGCCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGCGGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTATCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTTAACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGTCATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG

Page 8: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Now, 1366 genomes are sequenced or being Now, 1366 genomes are sequenced or being sequencedsequenced

Page 9: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Evolution of genome sizesEvolution of genome sizes

In MbIn Mb

Thale cress (Arabidopsis thaliana): Thale cress (Arabidopsis thaliana): 150150

Fruit fly (Drosophila melanogaster): Fruit fly (Drosophila melanogaster): 160160

Pufferfish (Takifugu rubripes): Pufferfish (Takifugu rubripes): 400400

Human (Homo sapiens): Human (Homo sapiens): 3,0003,000

Onion (Allium cepa):Onion (Allium cepa): 16,75016,750

Tiger salamander (Ambystoma tigrinum): Tiger salamander (Ambystoma tigrinum): 32,00032,000

Marbled lungfish (Protopterus aethiopicus): Marbled lungfish (Protopterus aethiopicus): 132,000132,000

http://www.rbgkew.org.uk/

Page 10: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

What's in the genomeWhat's in the genome

Genome

Annotated genes

Exon

UTR

Intron

Cis-regulatory elements

Selfish elements

Novel genes

Dead genes (pseudogenes)

Page 11: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

"Non-genic": repetitive elements"Non-genic": repetitive elements

E.g. Human genomeE.g. Human genome Exons take up?Exons take up? Introns account for?Introns account for? Repetitive elements occupy?Repetitive elements occupy? Unknown?Unknown?

Venter et al. (2001) Science 291:1304

AA BB CC

1%1% 24%24% 25%25%

24%24% 1%1% 25%25%

35%35% 60%60% 45%45%

40%40% 15%15% 5%5%

Page 12: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

What are in the unknown regions?What are in the unknown regions?

Investigate with Investigate with tiling arraytiling array

cDNA array

Tiling array

Gap size: 10bpProbe size: 25bp

Number of features:Number of features: Arabidopsis, 135Mb, 1 chip, ~6x106 featuresArabidopsis, 135Mb, 1 chip, ~6x106 features Human, 3Gb, 7 chips, ~4.2x107 featuresHuman, 3Gb, 7 chips, ~4.2x107 features

Page 13: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

"Non-genic": unannotated genes"Non-genic": unannotated genes

Kapranov et al., 2002. Science

Tiling array analysis of human Chr 21, 22Tiling array analysis of human Chr 21, 22

Page 14: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Tiling array analysis of human Tiling array analysis of human transcriptometranscriptome

Kapranov et al., 2002. Science

Human Chr 21, 22Human Chr 21, 22

What do you think these expressed regions What do you think these expressed regions represent??represent??

Page 15: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Difficulties for coding gene predictionDifficulties for coding gene prediction

Training dataTraining data You need to know something...You need to know something... ““Biased” toward the properties of the majority.Biased” toward the properties of the majority.

Real genes that are shorter tend to be much harder Real genes that are shorter tend to be much harder to predict.to predict.

Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp)

Gene finder Cor Sn Snfk (%) Sp

GISMO 0.64 63.0 86.4 69.0

Glimmer 0.54 72.0 83.7 44.0

CRITICA 0.60 46.0 67.4 84.0

Snfk denotes the sensitivity in detecting function-known genes.Krause et al., 2006. Nucleic Acid Res. 35:540

Page 16: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Novel coding sequence identificationNovel coding sequence identification

Arabidopsis thaliana as an exampleArabidopsis thaliana as an example 135Mb, ~50% occupied by 135Mb, ~50% occupied by

annotated genes.annotated genes. Focus on coding sequences 90-Focus on coding sequences 90-

300bp long.300bp long.

What would you do next to eliminate What would you do next to eliminate ORFs that are likely false ORFs that are likely false predictions?predictions?

133,090 sORFs

Page 17: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Criterion 1: Codon usage biasCriterion 1: Codon usage bias

Some codons are used more frequently than othersSome codons are used more frequently than others

http://www.cbs.dtu.dk/services/GenomeAtlas/

Page 18: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Criterion 1: Codon usage biasCriterion 1: Codon usage bias

For example: codons for prolineFor example: codons for proline

Suppose you have the following 2 sequences both Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be code for poly-leucine, which one is more likely to be real coding sequence?real coding sequence?

NCDSNCDS CDSCDS

CCTCCT 0.250.25 0.120.12

CCCCCC 0.250.25 0.490.49

CCACCA 0.250.25 0.060.06

CCGCCG 0.250.25 0.330.33

Seq1 CCT CCA CCT

Seq2 CCC CCG CCC 2

4

109.749.033.049.0)2|(

106.812.006.012.0)1|(

SeqCDSp

SeqCDSp

Page 19: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Posterior probability of coding sequencePosterior probability of coding sequence

Compare known non-coding and coding sequencesCompare known non-coding and coding sequences

Hanada et al., 2007. Genome Res.

Page 20: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Posterior probability of coding sequencePosterior probability of coding sequence

Scanning Scanning ArabidopsisArabidopsis genome genome

Hanada et al., 2007. Genome Res.

Page 21: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

After applying the first criterionAfter applying the first criterion

7,442 coding sORFs

Page 22: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

How good is the CDS finding measure How good is the CDS finding measure

For the training dataFor the training data

For 18 Arabidopsis small protein genes For 18 Arabidopsis small protein genes All 18 are predicted as CDS.All 18 are predicted as CDS.

For 84 yeast small protein genesFor 84 yeast small protein genes All 84 are predicted as CDS.All 84 are predicted as CDS.

Page 23: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

So what does this mean? So what does this mean?

If a sequence is a true coding sequenceIf a sequence is a true coding sequence Our approach can predict them with high accuracy.Our approach can predict them with high accuracy. So, the So, the sensitivitysensitivity is very good. is very good.

Is this good enough??Is this good enough??

What about specificity?What about specificity? Namely, how good is the criteria in excluding Namely, how good is the criteria in excluding false false

positivespositives??

Page 24: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Criterion 2: ExpressionCriterion 2: Expression

Which of the following distribution more likely Which of the following distribution more likely depicting the expression level distribution of true depicting the expression level distribution of true CDS compared to that of false CDS?CDS compared to that of false CDS?

Tiling array

Gap size: 10bpProbe size:

25bp

Expression level

Fre

qu

en

cy

Low High

Page 25: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Comparison of expression levelsComparison of expression levels

A: ExonB: IntronC: Prediceted novel CDSD: tRNAE: rRNA

Exon, intron, tRNA, rRNA, our predictionsExon, intron, tRNA, rRNA, our predictions

Page 26: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Applying the second criterionApplying the second criterion

Prediction significantly enriched in expressed sequencesPrediction significantly enriched in expressed sequences

2,996 transcribedsORFs

Page 27: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Criterion 3: Purifying selectionCriterion 3: Purifying selection

Compare known coding and non-coding sequencesCompare known coding and non-coding sequences

selection positive:1

neutraly selectivel:1

selection )(purifying negative:1

rateon substituti synonymous:

rateon substituti synonymous-non:

w

w

w

K

K

K

Kw

s

a

s

a

Page 28: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

Criterion 3: Purifying selectionCriterion 3: Purifying selection

Compare known coding and non-coding sequencesCompare known coding and non-coding sequences

Page 29: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

In the end,In the end,

We found a large number (941) small ORFs have the We found a large number (941) small ORFs have the following three properties:following three properties: They have nucleotide composition similar to known They have nucleotide composition similar to known

coding sequences.coding sequences. They are expressed.They are expressed. They are subjected to selection in a fashion a protein They are subjected to selection in a fashion a protein

sequence would be selected.sequence would be selected.

Take home message:Take home message: We don't know the functions of all any of these.We don't know the functions of all any of these. The view that most of the "intergenic" region is junk The view that most of the "intergenic" region is junk

DNA may be wrong.DNA may be wrong.

Page 30: DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB

AcknowledgementAcknowledgement

Current and past lab Current and past lab membersmembers

Kousuke Hanada

Melissa Lehti-Shiu

Cheng Zou

TIGRTIGR Chris TownChris Town Hank WuHank Wu

University of ChicagoUniversity of Chicago Wen-Hsiung LiWen-Hsiung Li Justin O. BorevitzJustin O. Borevitz Xu ZhangXu Zhang

FundingFunding