the science of information: from communication to dna ... · the science of information: from...
TRANSCRIPT
![Page 1: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/1.jpg)
The Science of Information:
From Communication to DNA Sequencing
David Tse
Dept. of EECS
U.C. Berkeley
Telecom Paris
August 31, 2012
Research supported by NSF Center for Science of Information.
Joint work with A. Motahari, G. Bresler, M. Bresler.
Acknowledgements: S. Batzolgou, L. Pachter, Y. Song
![Page 2: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/2.jpg)
Communication: the beginning
• Prehistoric: smoke signals, drums.
• 1837: telegraph
• 1876: telephone
• 1897: radio
• 1927: television
Communication design tied to the specific source and
specific physical medium.
![Page 3: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/3.jpg)
Grand Unification
channel capacity C bits/ sec
source entropy rate H bits/ source sym
Shannon 48
Theorem:
max. rate of reliable communicat ion
=C
Hsource sym / sec.
Model all sources and channels statistically.
A unified way of looking at all communication problems in terms of
information flow.
source
reconstructed
source
![Page 4: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/4.jpg)
60 Years Later
• All communication systems are designed based on
the principles of information theory.
• A benchmark for comparing different schemes and
different channels.
• Suggests totally new ways of communication.
![Page 5: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/5.jpg)
Secrets of Success
• Information , then computation.
It took 60 years, but we got there.
• Simple models, then complex.
The discrete memoryless channel
………… is like the Holy Roman Empire.
• Infinity, and then back.
Allow us to think in terms of typical behavior.
![Page 6: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/6.jpg)
Tom Cover, 1990
“Asymptotic limit is the first term in the Taylor series expansion
at infinity.
And theory is the first term in the Taylor series of practice.”
![Page 7: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/7.jpg)
Looking Forward
Can the success of this way of thinking be broadened
to other fields?
![Page 8: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/8.jpg)
Information Theory of DNA Sequencing
![Page 9: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/9.jpg)
DNA sequencing
DNA: the blueprint of life
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
courtesy: Batzoglou
![Page 10: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/10.jpg)
Impetus: Human Genome Project
1990: Start
2001: Draft
2003: Finished 3 billion nucleotides
courtesy: Batzoglou
3 billion $$$$
![Page 11: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/11.jpg)
Sequencing gets cheaper and faster
Cost of one human genome
• HGP: $ 3 billion
• 2004: $30,000,000
• 2008: $100,000
• 2010: $10,000
• 2011: $4,000
• 2012-13: $1,000
• ???: $300
Time to sequence one genome: years days
Massive parallelization.
courtesy: Batzoglou
![Page 12: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/12.jpg)
But many genomes to sequence
100 million species
(e.g. phylogeny)
7 billion individuals
(SNP, personal genomics)
1013 cells in a human
(e.g. somatic mutations
such as HIV, cancer) courtesy: Batzoglou
![Page 13: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/13.jpg)
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
![Page 14: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/14.jpg)
A Gigantic Jigsaw Puzzle
![Page 15: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/15.jpg)
Many Sequencing Technologies
• HGP era: single technology
(Sanger)
• Current: multiple “next generation”
technologies (eg. Illumina, SoLiD,
Pac Bio, Ion Torrent, etc.)
• Each technology has different
read lengths, noise profiles, etc
![Page 16: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/16.jpg)
Many assembly algorithms
Source:
Wikipedia
![Page 17: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/17.jpg)
And many more…….
A grand total of 42!
Why is there no single optimal algorithm?
![Page 18: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/18.jpg)
A basic question
• What is the minimum number of reads required for
reliable reconstruction?
• How much intrinsic information does each read
provide about the DNA sequence?
• A benchmark for comparing different algorithms and
different technologies.
• An open question!
![Page 19: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/19.jpg)
Coverage Analysis
• Pioneered by Lander-Waterman
• What is the minimum number of reads to ensure there is
no gap between the reads with a desired prob.?
• Only provides a lower bound on the minimum number of
reads to reconstruct.
• Clearly not tight.
![Page 20: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/20.jpg)
Communication and Sequencing:
An Analogy
Communication:
Sequencing:
Question: what is the max. sequencing rate such that
reliable reconstruction is asymptotically possible?
source
sequence
max. communicat ion rate R =Cchannel
Hsource
source sym / sec.
sequencing rate R =G
NDNA sym / read
![Page 21: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/21.jpg)
A Basic Model
• DNA sequence: i.i.d. with marginal distribution
p=(p1, p2, p3, p4).
• Starting positions of reads: i.i.d. uniform on the DNA
sequence.
• Read process: noiseless.
Will build on this to look at statistics from genomic data.
![Page 22: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/22.jpg)
The read channel
• Capacity depends on
– read length: L
– DNA length: G
• Normalized read length:
• Eg. L = 100, G = 3 £ 109 :
read
channel AGCTTATAGGTCCGCATTACC AGGTCC
¹L :=L
logG
¹L = 4:6
L
G
![Page 23: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/23.jpg)
Result: Sequencing Capacity
Renyi entropy of order 2
C = 0
C = ¹L
The higher the entropy,
the easier the problem!
![Page 24: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/24.jpg)
Complexity is in the eye of the beholder
Low entropy High entropy
harder to compress
easier jigsaw puzzle harder jigsaw puzzle
easier to compress
![Page 25: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/25.jpg)
Capacity Result Explained
C = 0
C = ¹L
coverage constraint
N > G=L logG
![Page 26: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/26.jpg)
A necessary condition for reconstruction
interleaved repeats of length
None of the copies is straddled by a read (unbridged).
Reconstruction is impossible!
Special cases:
Under i.i.d. model, greedy is optimal for mixed reads.
![Page 27: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/27.jpg)
A sufficient condition: greedy algorithm
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads
2. Find two contigs with largest overlap and merge
them into a new contig
3. Repeat step 2 until only one contig remains
Algorithm progresses in stages:
at stage
contig
merge reads at overlap
overlap
![Page 28: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/28.jpg)
Greedy algorithm: stage
merge mistake => there must be a -repeat
and
each copy is not bridged by a read.
A sufficient condition for reconstruction:
There is no unbridged - repeat for any .
repeat
bridging read already merged
` `
contigs
![Page 29: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/29.jpg)
Summary
Necessary condition for reconstruction:
No unbridged interleaved -repeats for any .
Sufficient condition (via greedy algorithm)
No unbridged -repeats for any .
For the i.i.d. DNA model:
1) If there are no unbridged interleaved repeats, then
w.h.p. there are no unbridged repeats.
2) The probability is dominated when either
![Page 30: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/30.jpg)
Capacity Result Explained
C = 0
C = ¹L
interleaved (L-1)- repeats
L-1 L-1 L-1 L-1
coverage constraint
![Page 31: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/31.jpg)
Summary: Two Regimes
coverage-limited
regime repeat-limited
regime
Question:
Is this clean state of affairs tied to the i.i.d. DNA model?
![Page 32: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/32.jpg)
I.I.D. DNA vs real DNA
• Mammalian DNA has many long repeats.
• How will the greedy algorithm perform for general
DNA statistics?
• Will there be a clean decomposition into two
regimes?
![Page 33: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/33.jpg)
Greedy algorithm: general DNA statistics
• Reconstruction if there are no unbridged repeats.
• Performance depends on the DNA statistics through the
number of - repeats:
• Necessary condition translates similarly to an upper
bound on capacity:
`
? ?
![Page 34: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/34.jpg)
2 4 6 8 10 12
11.5
12
12.5
13
13.5
14
14.5
15
15.5
16
16.5
2 4 6 8 10 12 14
10
11
12
13
14
15
16
20 40 60 80 100 120 140 160
0
2
4
6
8
10
12
14
16
0 500 1000 1500 20000
5
10
15
`
Rgreedy (L)log(# of `-repeats)
i.i.d. data
0 500 1000 1500 2000 2500 3000 3500 4000 45000
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000 45000
50
100
150
200
250
upper bound
2200 2250 2300 2350 2400 24500
20
40
60
80
100
120
140
longest repeat
at
upper bound
I.I.D. DNA vs real DNA
GRCh37 Chr 22 (G = 35M)
longest interleaved repeats
at 2325
![Page 35: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/35.jpg)
0 1000 2000 3000 4000 5000 60000
50
100
150
200
250
300
350
Rgreedy (L)
0 1000 2000 3000 40000
5
10
15
longest interleaved repeats
at length 2248
upper bound
longest repeat
at
Chromosome 19
GRCh37 Chr 19 (G = 55M)
log(# of `-repeats)
There is another more sophisticated algorithm that would close the
gap and in fact near optimal on all 22 chromosomes.
![Page 36: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/36.jpg)
Ongoing work
• Noisy reads
• Reference-based assembly.
• Partial reconstruction.
……...
![Page 37: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08277f7e708231d4209bb1/html5/thumbnails/37.jpg)
Conclusion
• Information theory has made a huge impact on
communication.
• Its success stems from focusing on something
fundamental: information.
• This philosophy may be useful for other important
engineering problems.
• DNA sequencing is a good example.