jigsaw puzzlers' delight - university of washingtonmorrow/mathday/mathday16/dna.puzzle.pdfnote:...

Post on 27-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Jigsaw Puzzlers' Delight: Sequencing DNA

Prof. Sara Billey and

Prof. Sreeram Kannan University of Washington

Happy Mathday! March 20, 2016

http://www.nist.gov/pml/div689/images/sh_17004592_dna_Benjamin_Albiach_Galan_LR.jpg

Who Are These People?

Each human genome is a three billion nucleotide long “book” written in an alphabet with only the four letters A, C, G, T.

http://davidlazarphoto.com/

•  Differentpeoplehaveslightlydifferentgenomes:onaverage,roughly1muta9onin1000nucleo9des.

•  The1in1000nucleo9desdifferenceaccountsforheight,highcholesterolsuscep9bility,and1000sofgene9cdiseases.

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

Few Mutations Can Make a Big Difference…

©2013 by Compeau and Pevzner.

Genomes for Different Species

Amoeba Paris Amoeba Paris dubia japonica

§  All human genomes are similar (99.9% agreement).

§  Human genomes and chimpanzee genomes are further apart (96% agreement).

§  Some genomes are 100 X larger than the human genome:

©2013 by Compeau and Pevzner.

A Short Genome (5386 bases long)

Enterobacteria phage phiX174 sensu lato, complete genome from (http://www.ncbi.nlm.nih.gov/nuccore/9626372?report=fasta) NCBI Reference Sequence: NC_001422.1 GenBank Graphics >gi|9626372|ref|NC_001422.1| Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTT GGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACA ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT TCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGC ATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAAC CTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTT GATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGC CGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGAC TAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTG TATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGT TTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGA AGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGAT TATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAAC GCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGC TTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGT TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTG TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG AATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGC CGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGT TTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTG CTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAA AGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCT GGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTG GTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGA TAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGA GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTT CTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGA TACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCG TCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTT CTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTAT TGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGC ATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATG TTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGA ATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGG GACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCC CTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATT GCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAG GCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTT ATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCG CAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGC CGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTC GTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCAT CGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAG CCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATA TGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACT TCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTG TCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGC AGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACC TGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA ***************************************************************

DNA = Deoxyribonucleic acid v  DNA encodes genetic instructions used in the development,

functioning and reproduction of all known living things. v DNA is a type of polymer (long chain of repeating

molecules) first isolated by Friedrich Miescher in 1869.

v Photo 51: X-ray diffraction image by Rosalind Franklin and Ray Gosling 1952.

https://en.wikipedia.org/wiki/Photo_51

DNA = Deoxyribonucleic acid Each nucleotide is

composed of a nitrogen-containing nucleobase— either cytosine (C), guanine (G), adenine (A), or thymine (T) — along with a sugar called deoxyribose. T = Thymine =

C5H6N2O2

http://www.councilforresponsiblegenetics.org/geneticprivacy/images/c16x6base-pairs.png

Scientific Goal of the Century

Goal: To discover how can we read off and classify the function and variations in each genome.

First Steps: Acquire exact DNA sequences from different living creatures. Compare. Question: What is holding us back? Answer: It’s not so easy to read DNA!

http://www.nist.gov/oles/forensics/images/DNA-Strand.jpg

Real Image of DNA from 2012

http://scitechdaily.com/first-electron-microscope-image-of-dna-double-helix/

• Modernsequencingmachinescannotreadanen9regenomeonenucleo9deata9mefrombeginningtoend(likewereadabook)

•  Theycanonlyshredthegenomeandgenerateshortreads.

•  Thegenomeassemblyisnotthesameasajigsawpuzzle:wemustuseoverlappingreadstoreconstructthegenome,agiantoverlappuzzle!

What Makes Genome Sequencing Difficult?

©2013 by Compeau and Pevzner.

– Applica9onsinmedicine(genomesofpathogens),agriculture(oilpalmgenome),biotechnology(genomesofenergy-producingcyanobacteria),etc.,etc.,etc.

Why Do We Sequence 1000s of Species?

©2013 by Compeau and Pevzner.

•  2010:NicholasVolkerbecamethefirsthumanbeingtobesavedbygenomesequencing.– Doctorscouldnotdiagnosehiscondi9on;hewentthroughdozensofsurgeries.

–  Sequencingrevealedararemuta9oninaXIAPgenelinkedtoadefectinhisimmunesystem.

–  Thisleddoctorstouseimmunotherapy,whichsavedthechild.

Why Do We Sequence Personal Genomes?

•  1977:WalterGilbertandFrederickSangerdevelopindependentDNAsequencingmethods.

•  1980:TheysharetheNobelPrize.

•  S9ll,theirsequencingmethodsweretooexpensive($3billiontosequenceahumangenome).

Walter Gilbert

Frederick Sanger

Brief History of Genome Sequencing

©2013 by Compeau and Pevzner.

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

Craig Venter

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

Craig Venter

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

Early 2000s: Many more mammalian genomes are sequenced using the same Sanger sequencing method, but it is clear that new technology is needed for further progress.

From Human to Mouse to Rat to …

©2013 by Compeau and Pevzner.

•  Early2000s:Themarketfornewsequencingmachinestakesoff.–  Illuminareducesthecostofsequencingahumangenomefrom$3billionto$10,000.

–  CompleteGenomicsbuildsagenomicfactoryinSiliconValleythatsequenceshundredsofgenomespermonth.

–  BeijingGenomeIns9tuteordershundredsofsequencingmachines,becomingtheworld’slargestsequencingcenter.

Next Generation Sequencing Technologies

©2013 by Compeau and Pevzner.

10,000 Genomes and Beyond

•  2010: Scientists launch a project to sequence 10,000 vertebrate genomes.

•  Now:Humangenomesequencingabout$1000.Andpar9alsequencingis$199on23andme.com!

©2013 by Compeau and Pevzner.

What brought the price of genome sequencing down?

1)  Better technology: Shorter reads are cheaper to

produce. 2)  Better mathematical algorithms: Solve the

overlapping puzzle problem to reconstruct the original DNA sequence quickly and reliably from the random reads.

From Reads to Sequences

Unsolved Problem: Find the best possible way to

reconstruct the original DNA sequence from the reads. Example reads: AAGT TAGA GTAG GAAG One Solution: AAGTAGAGTAGAAG Better Solution: GAAGTAGA (unique on 8 letters)

From Snips to Sequences

Activity: Get into groups of 4-6 people. Get a packet of snips. Try to

recreate the most likely DNA sequence from these snips. Note: Each snip is a consecutive sequence of 20 letters from the original.

These were sampled in a circular fashion, some snips wrap around. Find the secrete message in positions: 10, 20, 30,…, 90. Hint: The original sequence has 100 letters. It ends with ATATGGA.

From Snips to Sequences C G C C T C G A G A

A T T T G T C T A T T C A T T A A C G T C A G T T T G C T A C T C C G G A C C C G C C G T G A C A A T C C G A C T A T C G T G C T G G C C A C C G C A G T C T T T T G A T A T G G A

Solution Sequence:

How to reconstruct?

T T G T C T A T T C A T T A A C G T C A

T A T T C A T T A A C G T C A G T T T G

C A T T A A C G T C A G T T T G C T A C

A C G T C A G T T T G C T A C T C C G G

C G T C A G T T T G C T A C T C C G G A

G C T A C T C C G G A C C C G C C G T G

..T T G T C T A T T C A T T A A C G T C A G T T T G C T A C T C C G G A C C C G C C G T G ..

Why does this work?

•  Each read is extended using the most overlapping read

•  Overlap is significant: >7 usually

•  Puzzle: What is the chance that two k-length DNA sequences are the same? •  How many possible k-length starting positions in a DNA seq of length D?

4^{-k} < D

Greedy algorithm (TIGR Assembler, phrap, CAP3...)

Input: the set of snips 1.  Set the initial set of “contigs” as the snips

2.  Find two contigs with largest overlap and merge them into a new contig

3.  Repeat step 2 until only one contig remains

Shortest Common Supersequence (SCS)

Input: A set of reads Output: The shortest sequence containing all the reads as

subsequences ó Shortest DNA sequence that can explain all reads

Example reads: AAGT TAGA GTAG GAAG One Solution: AAGTAGAGTAGAAG Better Solution: GAAGTAGA

Read Overlap Graph

AAGT

TAGA

GTAG GAAG 3 2 2

1

1 GAAG AAGT

Find a path through all the reads (nodes) with highest weight.

Shortest Common Supersequence (SCS)

Question: How to find the SCS? Answer: Equivalent to finding Hamiltonian path in a read-

overlap graph ó hard problem Alternate way: formulate an Eulerian cycle problem ó easy to find

Conditions for Reconstruction

•  When can a DNA sequence be reconstructed correctly from reads?

•  Which jigsaw puzzles easily reconstructible?

Jigsaw puzzles

easier jigsaw puzzle harder jigsaw puzzle

How exactly do the fundamental limits depend on repeat statistics?

Not reconstructable: Interleaved repeats

Unreconstructable DNA Sequences

These two are confusable if: Read length < Interleaved Repeat length

Reprise

•  Algorithms for DNA reconstruction are derived from graph theory and probability theory.

•  Mathematical algorithms have led to faster and more accurate reconstruction.

•  Still many questions unanswered.

•  Come, join, contribute to the DNA revolution!

Resources and Acknowledgements Many thanks to Glenn Tesler, Phillip Compeau and Pavel

Pevzner, Alan, Marisa and Paul Viola for help on preparing some of these slides!

Thanks to all of you for listening and participating! Resources: “How to apply de Bruijn graphs to genome assembly” Phillip E C Compeau, Pavel A

Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html

“Genome Sequencing” by Phillip Compeau and Pavel Pevzner

https://www.coursera.org/course/assembly

top related