zipf’s monkeys

46
Zipf’s monkeys Observations from real and random genomes

Upload: zola

Post on 23-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

Zipf’s monkeys. Observations from real and random genomes. Environmental genomics. When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from. - PowerPoint PPT Presentation

TRANSCRIPT

Zipfs monkeys

Zipfs monkeysObservations from real and random genomesEnvironmental genomicsWhen an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments

Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from

The dataAGTCGATGCAGTCAGCATACGATCAGACTGCAGCTThe dataAGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGThe dataAGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGThe dataAGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTGThe data___________________________________________________________________________________________________________________________________________________________The data____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

The data____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

The data___________ _________________ ___________ ______________________ _______________ __________________ _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

The data___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____________ _____________ _______________________________________ _____________________________ _____________________________________________ _________________________ _______ _________________ _______________________ ___________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

The data___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____________ _____________ _______________________________________ _____________________________ _____________________________________________ _________________________ _______ _________________ _______________________ ___________

_______ _____________ ____ ______________ ___________________________ __________ _____________________ ____________________________ ______________________ ________________________________________________________ ________ _________ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __

____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

The data___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____________ _____________ _______________________________________ _____________________________ _____________________________________________ _________________________ _______ _________________ _______________________ ___________

_______ _____________ ____ ______________ ___________________________ __________ _____________________ ____________________________ ______________________ ________________________________________________________ ________ _________ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __

________________________ __________________ ________________ ________________________________ ___________________ _________________ ___________________ ____________ _____ _______ ________________ _________________ _____________________________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ _____ _____________ ___________________________________ ____________________ _______________________ __________

How can we reconstruct the original genomes?ApproachesJigsaw puzzleFind common subsequencesAlign overlapping regionsStatisticsCompute histograms of oligonucleotides (n-grams)Match to distributions for known organisms

Use rare polymers to select anchor points (BLAST-like)Compression distanceConjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome

(normalized) universal compression distance

max[ C(xy) C(x), C(yx) C(y) ]UCD(x,y) = ---------------------------------------------max[ C(x), C(y)]CM clusteringCompression MaximizationAdopt compression into a kind of EM clusteringPartition reads randomly into [say] two groupsFor each read, compute compression distance to each group ( la leave-one-out)Reassign read to closest groupIterate until some stopping criterionApply recursively to each groupExperimentgroupAgroupBDG2 AF2NM1 DE2MR2 AD4DE3 CA4AD5DE5AF1DG1DE1 AD1AF3 NM3DG4 AF4AF5DG5CA1MR1MR4 AD3CA3 CS5DE4 CA2CA5MR5NM4 CS3CS2NM2AD2 DG3CS4CS1MR3 NM5Experiment: resultgroupAgroupBAD1DE1AD2DE2AD3DE3AD4DE4AD5DE5AF1DG1AF2DG2AF3DG3AF4DG4AF5DG5CA1MR1CA2MR2NM1 MR3NM2 MR4NM3 MR5CS1 CA3CS2 CA4CS3 CA5CS4NM4CS4NM5stop when CD > 70ReassemblyCan the LZ trie be used to reassemble reads into genomes?The LZ trie is a regular grammar of the set of readsA long phrase is an extension of a shorter phraseThe start of one read is the end of anotherThe part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phrase

Along the way .While setting up the initial experiments, we started to ponder things that might go wrongDifferent genomes might have a lot of common subsequences that will conflate the clustering resultSNPs and missing fragments might thwart compressionCompression model might take too long to converge on a useful model (paucity of data)

What is the underlying principle being leveraged?Information theoryA linear sequence of symbols intended for communication exhibits a balance between randomness and regularityIf a sequence is entirely random, it is noiseIf a sequence is entirely predictable, it is redundant

Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information)Compression attempts to minimize redundancy

Information theoryHuman languages exhibit non-uniform distributions over letters, phonemes, words, etc

Brown Corpus word frequencies

DNA primary sequencesFour nucleotide symbols: A, C, G, TMuch of a genome codes nothing, and the rest is genesA gene is copied (transcription) off the genome, and the copy is used to build a protein (translation)Three consecutive nucleotides form a codon, which codes for a specific amino acidA sequence of amino acids (residues) constitutes a proteinProteins are where structure definitely exists

DNA primary sequences43 = 64 possible codons20 possible amino acidsMany amino acids have more than one codon

Genomic regularitiesMost genes start with ATG and end with a stop codon (TAG, TAA, and TGA most frequent)TATA-box in regulatory region (for binding)GC rich regions (for stability)

But

Frequency of individual nucleotides or residues is not-so interesting (no syntax)Tertiary structure of proteins is The Thing: the interactions of amino residues are paramount

Genomic regularitiesDo genomes have sequential syntactic structures?

Codon frequencies in real DNA

4-gram frequencies in real DNA

5-gram frequencies in real DNA

6-gram frequencies in real DNA

6-gram probabilities in real DNA

Problems from paucity of dataTakes time for an LZ compression trie to become saturated with characteristic phrasesExperimental data somewhat small, thus interesting sequences may not manifest quickly enoughPrime the trie by prepending some random DNA to the data prior to computing CDHow much? How about a million?bigram frequency in random DNA

codon frequency in random DNA

10-gram frequency in random DNA

4-gram frequency in random DNA

5-gram frequency in random DNA

5-gram frequency in random DNA

7-gram frequency in random DNA

8-gram frequency in random DNA

9-gram frequency in random DNA

Millers monkey19th century Wilfried Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data1949 G.K. Zipf argued that power-law distributions are an interesting linguistic phenomenon1957 G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce language with Zipfian distribution1968 David Howes argued that Millers proof is flawed2004 Michael Mitzenmacher demonstrated the connection between power-law distributions and log-normal distributionsconclusionProbably nothing!