application of communication theory automatic dna sequencing€¦ · dolan provided leadership when...
TRANSCRIPT
Application of Communication Theory to Automatic DNA Sequencing
S t ephen William Davies
A t hesis su bmi t t ed in conformi ty wi t h the requirements for the degree of
Doctor of Philosophy, Graduate Department of Electrical and Computer Engineering,
University of Toronto
@ Copyright Stephen William Davies 1999
National Library m*m of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395, rue Wellington ûîiawaON K I A O N 4 OhawaON K1AON4 Canada Canada
The author has granted a non- exclusive licence aiiowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.
The author retaias ownenhip of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfichelfilm, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Application of Communication Theory to Automatic DNA Sequencing
Doctor of Philosophy 1999
S t ephen William Davies
Electrical and Computer Engineering, University of Toronto
Abstract
DeoxyriboNucleic Acid (DNA) sequencing is one of the pillars of the current
biotechnology revolution. Current automatic DNA sequencing dgorithms use heuris-
tic approaches based on autoniating the manual analysis done by molecular biologists.
In this thesis, a more forma1 and rigorous approach is followed wherein the first statis-
tical model of the sequencing data is built and then the optimal processor is derivetl
from the rnodel.
The model characterizes peak shape and the local fluctuations in peak parameters
(peak tirne, amplitude and width). The characterization of peak paranieters includes
their point probability density functions and their average dependence on tlieir neigh-
bouring peaks (covariance). .litter in peak time is found to be correlateci over several
rieighbouring peaks. .A practical noise model is proposed consisting of a white noise
cornponent and a noise componeut with spectrum similar to that of the signai itself.
The mode1 can be used to generate simnlations for the cornparison and evaluation of
D NA sequencing algori t hms.
Based on the model, an optimal DNA sequencing algorithm was derived using
the maximum likelihood approach of the analogous field of digital communications.
The uncertainty associated with parameters of the optimum processor is addressed by
maintaining multiple hypotheses for both the different possible information sequences
and the different possible parameter sequences.
The performance of the algorithm is exarnined with real data from both the ac-
curate 6% cross-linked gels and the much faster 4% gels. Results with the 6% gel
data are comparable with that of a commercial algonthm though simulations have
- - - - - - - - - -- - --
suggested the potential for a two to three-fold reduction in error rate. Results with
the 4% gel data exhibited an error rate that was four times Iower than that of a
commercial sequencing algorithm. The DNA mode1 and the DNA-ML algorithm do
offer benefits beyond a reduction in error rate. They may guide the refinement of
the entire sequencing process. Assigning probabilities to alternative sequences niay
aid the clinician in forming his diagnosis. The overall benefits to healthcare incliide
the reduction of total test costs and reduction of the damage caused by acting ori
erroneous information.
Acknowledgment s
1 would like to thank my supervisors. Dr. M. Eizenman provided excellent ad-
vice and guidance in this work, and was tireless in ensuring its completion. Dr. S.
Pasupathy's gentle nudges opened the door to the communications literature. Both
provided the questions and direction that led to the timely completion of this thesis.
Werner b[uller7s generous aid made this work possible. I have greatly enjoyed
the time spent with him in the molecular biology laboratory of the Eye Research
Institute of Canada (ERIC). The support of Dr. K. Tsilfidis is greatly appreciated. I
wvs fortunate to have the opportunity to discuss the physics and chemistry of DNA
sequencing with two experts, Dr. CI. Slater of the University of Ottawa and Dr. R.
Macgregor of the University of Toronto.
At the Institute of Biomedical Engineering, I have eïijoyed the support of many
b u t can mention just a few. Prof. A. Dolan provided leadership when 1 needed it
rnost. Thas Yuwaraj's support of this work was unflagging and invaluable. Melina
Cartlidge \vas j ust so helpful.
I would like to acknowledge the financial support provided by the Natural Sciences
and Engineering Research Council of Canada and the Sumner Foundation.
Finally, 1 wûuld like to thank my mom and dad for the support and wisdom that
lias brought me this far and will hopefully c a n y me further.
Contents
Acknowledgments
List of Tables
List of Figures
List of Abbreviations and Symbols
1 Introduction 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation I
. . . . . . . . . . . . . . . . . . . 1.2 A Perspective on DNA Sequencing 2
1.2.1 DNA's Function and Structure . . . . . . . . . . . . . . . . . 2
1.2.2 Manual DNA Sequencing . . . . . . . . . . . . . . . . . . . . . 3
. . . . . . . . . . . . . . . . . . . 1.2.3 Automatic DN.4 Sequencing 4
1.2.4 Errors - . . . . . . . . . . . . . . . . . . . . * . . . . . . . . . . 3
. . . . . . . . . . . . . . . . . . . . 1.2.5 Other Sequencing Methods 6
. . . . . . . . . . . . . . . . . . . 1.2.6 The Human Genome Project 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.7 Clinical Role 9
. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data Communications 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Mode1 10
. . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Receiver Technology 11
Contents
1.4 Analogy between DNA Sequencing and Data
Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . 1.6 Dissertation Organization 13
2 Details of the Chemistry and Physics of DNA Sequencing 15
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Sequencing Chemistry 15
. . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Chemical Structures 15
. . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 DNA .4 mplification 17
2.1.3 Sequencing Reaction Molecules: Terminators and
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Polymerases 20
2.1.4 Fidelity and Peak Amplitude Variation . . . . . . . . . . . . . 20
2 . Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Sequencing Physics 35
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Sequencing Gel 25
. . . . . . . . . . . . . . . . . . . . 2.2.2 Theories of Electrophoresis 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Kuhn Length 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Resolutiori 29
. . . . . . . . . . . . . . . . 2.2.5 Other Concerns in Electrophoresis 30
. . . . . . . . . . . . . . . . . 2.2.6 Detection of Fluorescent Labels 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Summary 32
3 A Statistical Mode1 of the DNA TirneSeries 33
. . . . . . . . . . . . 3.1 Gross and Local Structure of DNA Time-Series 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Signal Peak Shape 36
. . . . . . . . . . . . . . 3.3 Local Covariance Mode1 of Peak Parameters 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Methods 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Results 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Discussion 58
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Noise Process Model 61
. . . . . . . . . . . . . . . . . . . . . . . 3.5 Simulated Data from Mode1 63
. . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Significance and Novelty 66
Contents
4 Maximum Likelihood Sequence Detection 67
4.1 The Maximum Likelihood Concept . . . . . . . . . . . . . . . . . . . 67
4.2 Additive White Noise Finite Response . . . . . . . . . . . . . . . . . 68
4.3 Noise Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Cost Function Derivation 73
4.5.1 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Nuisance Likelihood 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 CostFunction 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Significance 80
5 Implementation 82
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 HypothesisReduction 82
. . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Peak Estimation 82
5.1.2 Future Peak ISI Canceliation . . . . . . . . . . . . . . . . . . $3
5.1.3 Sequential Decoding . . . . . . . . . . . . . . . . . . . . . . . 83 . . . . . . . . . . . . . . . . . . . . 5.2 Unique Algonthm Considerations 85
. . . . . . . . . . . . . . . . . . Unequal Length Cornparisons 86 ..... . . . . . . . . . . . . . . . . . 5.2.2 Selection of Symboi Region, Ki 91
5.3 Modelling Limitations and Robustness . . . . . . . . . . . . . . . . . 92
5.3.1 PulseWidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.2 Noise Whitening . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Cornparison with Typical Automatic Sequencer Techniques . . . . . . 96
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 ISI Suppression 96
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Peak Detection 97
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Search Window 97
. . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Multi-Peak Tests 98
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Special Rules 98
. . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Promise of Approach 98
6 Performance with Real Data 100
6.1 Data Set 1 . Typical Case . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
Contents
6.1.1 Source and Mode1 . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.2 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . 102
6.1.3 Error Cornparison . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Data Set 2 - High Speed Gel . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.1 Source / Rationale . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.2 iLIodel and Adjustrnents . . . . . . . . . . . . . . . . . . . . . 107
6.2.3 Error Cornparison . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.5 Assessment and Significance . . . . . . . . . . . . . . . . . . . 118
7 Conclusions 120
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Thesis Sumrnary 120
. . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Thesis Contributions 121
. . . . . . . . . . . . . . . . . . . . . 7.3 Suggestions for Future Research 122
Bibliography
A Large Scale Trend Removal
viii
List of Tables
5.1 Performance as a Eunction of pulse width mismatch for 300 bases of
simulated data. . . . . . . . . . . . . . . . . , . . . . . . . . . . . . .
Performance (insertions/deletions/ substitution errors) as a function
of algorithm parameter settings for 300 bases of real data. . . . . . . 103
Errors observed for DNA-ML algorithm (/3=0.85, fractional bandwidth=0.25)
and Pharmacia ALF interna1 algorithm for 300 bases of real data. . . Data Set 2 error type and location for DN.4-ML with baseline pararn-
eter settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Set 2 error type and location for DNA-ML with rnodified param-
eter settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
1.1 DNA sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample DNA time series. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Typical automatic sequencing algorithm block diagram. . . . . . . . . 1.4 Error rate as a function of distance along the DNA molecule. . . . . . 1.5 Data communications systern block diagram. . . . . . . . . . . . . . . 1.6 Communication signais. . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Structure of deoxyadenosine 5'monophosphate (dAMP) (after 1131). .
2.2 A single-stranded DNA (ssDNA) molecule (after 1131); full structure
shown for phosphate and ribose groups but bases are represented by
one of A, C, G, or T. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Double stranded DNA (dsDNA) detailed structure [I l ] . Phosphate-
deoxyribose backbones are on extreme left and right, corresponding to
respective strands. Bases run from top to bottom aiong the center of
the diagram. Hydrogen bonding is seen aiong center as dashed line
emanating from a hydrogen (H) that also has a solid line indicating a
covalent bond to the othet strand. . . . . . . . . . . . . - . . . . . . . 2.4 Bulging of copy with insertion of a T. . . . . . . . . . . . . . . . . . .
2.5 Hairpin loop due to complementaxy GC runs. . . . . . . . . . . . . .
List of Figures
2.6 Cleavage pathway for depurination (Guanine base) 1201. .4dditional
symbols are: R for deoxyribose, G for guanine, T for thymine and P
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . for phosphate.
3.1 Sample entire time series for 'T' channel. Mean inter-base separation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . is14.7samples.
3.2 Selected compensated time series for same sequencing session as Fig-
ure 3.1. Individual channel data has been offset in this figure for clarity.
Top curve is for A channel with C, G and T chanriels presented in order
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . frorntop
3.3 High resolution viea of a segment of the compensated time series (ac-
tually Figure 1.2 repeated for reader's convenience). . . . . . . . . . .
3.4 Micro-satellite repeat data trace. Major peaks in time orcier are:
primer peak, proximal DNA standard peak, sample peak, distal DNA
standard peak. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Proximal DNA standard peak (solid line) and distal DNA standard
peak (dash-dot). Warped peak (dotted line) was created by scaling
the time coordinates by 0.7286. . . . . . . . . . . . . . . . . . . . . .
3.6 Approximation of proximal peak of Figure 3.4 (dotted line) by ieading
exponential (samples 1-35, dashed line) , Gaussian (samples 36-70, solid
line), and decaying exponential (samples 71200, dashed line). Inset is
the logarithm of the same data. . . . . . . . . . . . . . . . . . . . . .
3.7 Three isolated peaks from DNA sequencing data. . . . . . . . . . . .
3.8 Peak time jitter for "G" labelled product applied to six contiguous lanes
of the gel (total of 79 "G" peaks present over the range of 350 bases in
original sequence). Six overlapping curves are plotted corresponding
to the SLY gel lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Peak time jitter.
. . . . . . . . . . . . . . . . . . . . . 3.10 Local peak amplitude estimates.
3.11 Pulse width estimates. . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
Covariance of peak time jitter. Monotonically increasing region just to
the left of and including lag zero and rnonotonically decreasing region
to its right is referred to as the niainlobe. Inset is a logarithmic plot
of the right side of the mainlobe. . . . . . . . . . . . . . . . . . . . . Covariance of difference between successive peak time jit ter values. .
Peak amplitude covariance. . . . . . . . . . . . . . . . . . . . . . . . .
Pulse wid th covariance. . . . . . . . . . . . . . . . . . . . . . . . . . . Block diagram of peak parameter system model. . . . . . . . . . . . . Histogram of scaled peak time jitter. To insure comparability of Sam-
pies, data was divided (scaled) by jitter standard deviation linear trend
prior to forming histogram. . . . . . . . . . . . . . . . . . . . . . . . Histogram of scaled difference between adjacent peak time jitter values.
To insure cornparability of samples, data was divided (scaled) by jitter
standard deviation linear trend prior to forming histogram. . . . . . .
Theoretical covariance of peak time jitter for system of Figure 3.16.
Inset is a logarithmic plot of the right side of the mainlobe. . . . . . .
Theoretical covariance of difference between successive peak timc jitter
values for system of Figure 3.16. . . . . . . . . . . . . . . . . . . . . .
Noise spectnirn estirnate for "A" channel bases 297-319. . . . . . . . .
Simulated compensated time series for cornparison with real data of
Figure 3.2. Individual channel data has been offset in this figure for
clarity. Top curve is for A channel with C, G and T channels presented
in order from top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High resolution view of a segment of the simulated compensated tinie
series (compare with Figure 3.3). . . . . . . . . . . . . . . . . . . . .
Maximum likelihood processor block diagram. . . . . . . . . . . . . .
Peak estimator. . . . . . . . . . . . . . . , . . . . . . . . . . . . . . Spectral estimate for a short section of "noise whitened" data lacking
signal peaks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
List of Figures
Raw time series for Data Set 2. Individual channel data has been offset
in this figure for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . 107
Compensated time series for Data Set 2 corresponding to first 4000
samples from Figure 6.1. Individual channel data has been offset in
this figure for clarity. Top curve is for A channel with C, G and T
channels presented in order from top. . . . . . . . . . . . . . . . . . . 108
Covariance of peak tirne jitter for Data Set 2. Inset is a logarithmic
plot of the right side of the mainlobe. . . . . . . . . . . . . . . . . . . 109
Covariance of difference between successive peak tirne jitter values for
Data Set 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Selected section of data after application of whitening filter with coloiired
noise variance on, = 0.0016 and white noise variance on, = 0.000009,
al1 in units of peak mean squared. . . . . . . . . . . . . . . . . . . . . 11 1
Selected section of data after application of whitening filter wit h colourrd
noise variance o., = 0.0016 and white noise variance O,,, = 0.00000 1,
al1 in units of peak mean squared. . . . . . . . . . . . . . . . . . . . . 112
Compensated time series for Data Set 2 corresponding to bases 110-
140. Individual channel data has been offset in this figure for clarity.
Top curve is for A channel with C, G and T channels presented in order
from top. DNA-ML algorithm estimates of peak amplitudes arid times
are indicated by '*". X-axis is time in samples. True and estimated
sequences are indicated at top and bottom, respectively, coded as A= 1,
C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Compensated time series for Data Set 2 corresponding to first 30 bases.
Individual channel data has been offset in this figure for clarity. Top
curve is for A channel with C, G and T channels presented in order
from top. DN.4-ML algorithm estimates of peak amplitudes and times
are indicated by "*". X-axis is time in samples. True and estimated
sequences are indicated at top and bottom, respectively, coded as A= 1 ?
C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
List of Figures
6.9 Waveforrns associated with 4 "G" run from base 49 to 52 selected to
illustrate estimation of third "G". Raw waveform is cornpensüted but
not whitened. Dashed curve is formed from whitened data by sub-
tracting estirnated contribution from previous two baws and predicted
contribution from next base, and then applying matched filter. . . . . 116
6.10 Compensated time series for Data Set 2 corresponding to bases 155 to
185. Individual channel data has been offset in this figure for clarity.
Top curve is for A channel with C, G and T channels presenteci in order
from top. DNA-ML algorithm estimates of peak amplitudes and times
are indicated by "*". X-axis is time in samples. True and estirnatecl
sequences are indicated at top and bottom, respectively, coded as A= 1,
C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
-4.1 Inter channel peak time variation - plot of ratio of 'T' channel peak
times to those of "A" channel for data used in Chapter 3. . . . . . . . 131
xiv
List of Abbreviations & Symbols
ABBREVIATIONS
-4 AWGN bis C d Abf P d ,UP clNh1P dNTP ddNTP DFE DNA DNA-ML dsDN-4 FIR G ISI ML MLSD bISR PCR pdf mi -4 ssDNA SNR T TBE
adenine additive white Gaussian noise N ,N'-methylene- bis-acrylamide cytosine deoxyadenosine 5'-monophosphate deoxyadenosine 5'-triphosphate deoxynucleotide monophosphate deoxynucleotide triphosphate dideoxynucleotide triphosphate decision feedback equalizer deoxyribonucleic acid DNA mauinium likelihood (algorithm) double stranded DNA finite impulse response guanine inter-symbol interference maximum likelihood maximum Iikelihood sequence detection micro-satellite repeat polymerase chah reaction probability density function ribonucleic acid single stranded DNA signal to noise ratio tyrosine 8.9mM tris-borate and 0.2mM ethylenediaminetetra-acetic acid
List of Abbreviations & Svmbols
SELECTED SYMBOLS
ji t ter aut o-regressive weight ing observation vector whitened observation vector information symbol sequence amplitude noise sample index peak time for peak i pulse width generic pulse shape pulse shape peaking at t i evaluated at k Kronecker delta function summation indicates estirnate mean inter-symbol separation jitter jitter state variable jit ter input distur bance ji t ter measurement noise variance expec tat ion difference between adjacent jitter values non-stationary noise spectrum Fourier transformation conditional pdf nuisance parameter noise whitening filter cost (negative log likelihood) convolution i-th subset of samples best estimate of # a t i given data up to and including j covariance matrix Kalman gain cost to point z
comparison (hwypothesis test)
CHAPTER 1
Introduction
DeoxyriboNucleic Acid (DNA) carries the genetic information that codes for life.
The extraction of this information, a process known as DNA sequencing, is one of
the pillars of the current biotechnology revolution. This cliapter provides background
information on DNA and introduces the field of data communications as a potentiai
aid in DNX seqiiencing. It then presents the research approach used to investigatc tlie
applicability of communication theory to automatic DNA sequericing and concludes
with an overview of the thesis.
1.1 Motivation
This work is motivated by the potential for a timely impact on a significant indus-
try with broad healthcare implications. Timeliness is implied as an explosive growth
in the use of DNA sequencing as a clinical tool is imminent, built on the scientific
foundation provided by the human genome project [II. DNX sequencers d l become
as ubiquitous as X-ray machines and the industry will grow dramatically as the full
clinical potential is realized. If applying concepts from communication theory can
increase the reliability of DNA testing then the benefits include reduction of total
test costs and reduction of the damage cauçed by acting on erroneous information.
Chapter 1 o Introduction 2
1.2 A Perspective on DNA Sequencing
1.2.1 DNA's Function and Structure
In what is often referred to as the central maxim of biology, DNA is transcribed to
RiboNucleic Acid (RNA) which is then translated to protein [13j. DNA is normally
a very large molecule that serves a s the permanent store of the genetic information.
Only a small portion of this information is required to define a particular protein
so only this small portion is copied to RNA. The protein itself niay perform some
particular ce11 function such as tliat of an enzyme for a metabolic reaction.
Each DNA molecule is a sequence of bases where genetic information is encoded
according tao the type of base (Adenine (A), Guanine (G), Cytosine (C)? or Thymine
(T)) at each point in the sequence. Each consecutive group of three bases (triplet)
either codes for an amino acid to be incorporated in the protein or else it codes for
simple control information.
DNA's large scale structure makes it well suited for the storage of iriforrnation[l31.
it is normally in the form of a double helix coniposed of two intertwined strüiids of
DXX with one strand bearing the genetic information (sense strand) and the otlier its
complement. The cornplement is defined by the following rules: (1) where the sciise
strand has an A, the complement must have a T; (2) where the sense strnnd has a T?
the complement must have an A; (3) where the sense strand has a C, the complenient
rnust have a G; and; (4) where the sense strand has a G, the complement must have
a C. The double helix is maintained by hydrogen bonds between the sense strand and
the complement; there are two hydrogen bonds for each A-T pair and three for each
G-C pair. This stable structure is resistant to damage by outside forces. However, if a
riick should occur in one strand, cellular machinery will repair it using the information
available in the other strand.
To transcribe from DN.4 to RNA or to make a copy of the DNA, a portion of
the double-stranded DNA (dsDNA) must be separated into single-stranded DNX (ss-
DNA). Shen the enzymes and other molecules involved in making the copy can access
the information. If an entire dsDNA molecule is separated into two complementaq
ssDN.4 molecules then it is said to be denatured.
Chapter 1 O Introduction 3
1.2.2 Manual DNA Sequencing
In Sanger 121 DNA sequenciiig, molecular biologists employ a polymerase enzyme
to p r~pare a set of partial copies of the original ssDNA molecule, al1 starting frorn
the same location as determined by a primer molecule (Figure l.l(a)). In addition
to the A, C, G, and T substrate needed to make the copies, ari additional terniinator
riiolecule is included which competes with one of the suhstrate bases for inclusion in
the c:opy. Once a terminator is incorporated, copying is stopped for that particiilar
rnolecular copy and thus its length is fixed. In the example of Figure l . l (a) , the
terminator corresponding to adenine (A) is used. It competes with A at two poirits in
the figure. Based on chance, some copies will incorporate the A at these points and
continue growing. Others will incorporate the terminator aiid refrain from further
extension. The final result thus contains molecules of lengths corresponding to the
positions of adenine in the original DNA sample.
As indicated in Figure l . l ( b ) , four sets of reactions are carried out, cach çorre-
sponding to a different teminator/base type. The products are labeiled with eitkier
fluorescent or radioactive markers. Thus, for each base in the original sequence, the
sequence position has been encoded as molecular size and base type by which marker
is present.
Electrophoresis is theri used to separate these charged DNA molecules. The saiii-
ples are placed at the top end of a gel and a voltage is applied from that erid to the
other. Srnall DNA molecules will move quickly down the gel but larger ones encouriter
more resistance and thus move more slowly. Thus, the molecules become separated
by size. At some point in time, the voltage may be removed and the gel image may
be recorded as in Figure 1.1 (c). This two dimensional plot may be read from bottom
to top with horizontal position indicating base type. Here, the bottom-most band is
in the lane corresponding to the A marked sample holder (hereafter referred to as the
h lane), indicating the the first base is an A. The second lolvest band is in the lane
corresponding to the C marked sample holder (hereafter referred to as the C lane),
indicating that the second base is a C, and so on. Details on rnanual DN.4 sequencing
may be found in 1331.
Chapter 1 O Introduction 4
DNA COMPLEMENT. POLYMERASE. - ACOTAT ORIGINAL ONA MOLECULE
f PRIMER
I A C G " A SUBS TRATES
1
ACOT A FINAL
PRODUCTS
(8) fonn IWIcd substt o f diffcruit length fmgmcnfs, herr comsponding io the idenint (A) baw locaam.
A AC AC O ACGT
ACQTA ACGTAT
(b) rcjulr of repuin6 rucrions o f (4 wih icrmi~tom for a h of ihe four bu! rypc..
A MARKED C MARKED G MARKED 1 MARKED SAMPLE SAMPLE SAMPLE SAMPLE HOLOER HOLDER HOLDER HOLOEA
GEL (4 COLUMNS)
(c) conl~cl cleamphorczils IO sepurite by s izc of moltcuk. Oniy I;ibcllat fmpnls will k sccn. Rcsuli rads fmm bonom to top ACGTAT.
Figure 1.1: DNA sequencing.
1.2.3 Aut ornat ic DN A Sequencing
Automatic sequencing algorithms have been developed to recover the DXIA se-
quence either from images as displayed in Figure l.l(c) or from marker detectors
mounted at a fired location on the gel colurnn. In the later case, the input to the
algorithm is a time-sexies (Figure 1.2) where the order of detection is in order of
increasing molecular mass as the speedy sniall molecules reach the detector first. In
both cases, the algorithm faces a difficult task due to noise and poor resolution of
overlapping bands/peaks. Automatic sequencing algorithms are available from both
acadernic 1381 [391 [4lj and commercial sources ( Applied Biosystems, Du Pont, Molec-
ular Dynamics, P h m a c i a , Scandytics [361). Such dgorithms typically feature band
sharpening filters, simple normalization, thresholding and data dock recove. Neural
Chapter 1 o Introduction 5
Figure 1.2: Sample DNA tirne series.
8 r I 1 1 1
A /
J L ~ G
networks have also been used (411. Figure 1.3, a generic abstraction based largely on
1381, is illustrative of typical sequencing algori thms.
2
l
O
Errors
-
- m w
Three classes of errors occur in DNA sequencing: substitutions, insertions and
deletions. The first, substitution, corresponds to simply mistaking one base for an-
other, as when noise has caused a strong peak in the wrong lane. Xlternatively, such
a noise peak may lead to a base being called in the noise lane and the true base peak
being called as well, particularly if it occurred slightly later or earlier than the faise
peak. Thus one more base would be called than was actually in the data; this is
referred to as an insertion. A deletion may occur when two bands overlap to such an
extent that there is only one peak and thus only one of the two bases is called. There
are several other scenarios that can lead to insertions andior deletions. In this thesis,
the generic t e m "error" refers to al1 error classes.
Figure 1.4 is a schematic illustration of the dependence of error rate on base
O 50 100 150 200 250 300 k (SAMPLES)
Chapter 1 O Introduction 6
Figure 1.3: Typical automatic sequencing algorithm block diagrani.
location for both manual and automatic DNA sequencing. There is a constant error
rate zone that stretches over the first 300-500 bases. Error rate in this zone is of
the order of 1-5% 141 with automatic sequencers generally performing poorer than
human readers. Note that these are typical results and certain DNA sequences can
lead to considerably poorer results. Beyond the constant error rate zone lies the rising
error rate zone where, for example, error rate can increase by as much as 7% in 100
bases 151. This rising error rate may be attributed to several factors among which is
the difficulty in resolving between large molecules where a single base difference in
length equates to a small fractional change in overall size. To combat this problem.
practitioners have adopted the strategy of breaking large DNA molecules into snialler
ones, sequencing the smaller molecules and then combining the results [61. Note that
useful information is available even in the rising error rate zone as data from this
region is used to aid the combining of sequence fragments [71.
1.2.5 Ot her Sequencing Methods
Three other sequencing methods deserve attention at this point: (1) single lane
/ multiple fluorescent markers, (2) Maxam-Gilbert sequencing, and (3) biochip se-
quencing by hybridization.
Chapter 1 o Introduction 7
Figure 1.4: Error rate as a function of distance dong the DNA molecule.
Single Lane / Multiple Fluorescent Markers
This technology, protected under several patents by Applied Biosystenis Inc.,
forms the basis for the most popular automatic sequencing machines. As in the
techniques described earlier in this document, four different sequencing reactions are
carried out. However, the four reactions use different fluorescent markers so that their
products feature a spectral peak at different points in the spectrum. The producto
niay then be mked into a single solution and loaded into the same lane of s gel (if
the label is part of the terminator then al1 the reactions could have taken place in the
sarne test tube). Neaz the bottom of the gel, a laser is used to excite the fluorescent
bands as they pass the detector region. For each lane, there are four detectors, one
for each of the four different fluorescent peaks. .4s before, the order of the bands
indicates position in the sequence. Base type, however, is indicated by peak colour.
.4 key advantage is high throughput as four times the nurnber of sequences can be
done on the same gel (Le. one sequence per lane vs one sequence every four lanes).
dlso, alignment problems are minimized as al1 DNA from the same sequence goes
clown the same lane. This eliminates the effect of lane to lane gel inhomogeneities.
Possible problems include interference between base types as fluorescent spectra over-
lap.
This thesis will not include an examination of this type of data. However, certain
Chapter l o Introduction 8
aspects of the work in this thesis will apply directly as they are derived for the same
physical processes. For example, in both the multi-lane and single lane data, the
dynamics of DNA molecules in a gel will be common. Differences will occur as the
niiilti-lane data uses the same marker for al1 bases while the single Iane data will ilse
markers of different size and mass for each base type. As will be developed later in
the thesis, both will feature noise associated with the hydrolysis of DNA. However,
the multi-lane data will not suffer from the inter-base interference due to overlapping
fluorescent spectra. Thus, a judicious reader may draw conclusions from this work
which will be relevant to single lane data. At the same time, however, this reacier
will no doubt identify areas where additional work must be done in order to properly
treat the single lane application.
Maxarn-Gilbert Sequencing
Developed at the same time as Sanger sequencing, the hlauarn-Gilbert rnethod is
based on degradation of DNA rather than synthesis 1421. The process begins with the
labelling of one end of the ssDNA molecules [331. This is then loaded into four test
tubes. One tube then is used in a reaction that breaks the DN-4 wherever there is a.
G iri the sequence. Another tube is used in a reaction that breaks the DWA wherever
there is a C. Two other similar reactions are mn, one that breaks the DN.4 wherever
there is a G or an A, and one that breaks the DNA wherever there is a C or a T. Thus,
-4 locations must be decoded by comparing the G data and the AiG data. T locations
must be similarly decoded. Also, the timing of the reactions is important so that the
product is dominated by molecules produced by only single breaks per original DNA
molecule - othenvise, the first few bases will dominate the results. 5Iêuam-Gilbert
sequencing is useful for sequences of less than 250 bases 1331 and can perform better
than Sanger sequencing for sequences with long r u s of identical bases. However, the
vat majority of sequencing today is performed using the Sanger method. Results
presented in this thesis should apply for the Maxam-Gilbert method other than for
considerations related to decoding the A+G and C+T lanes.
Chapter 1 o Introduction 9
Sequencing by Hybridization
Sequencing By Hybridization (SBH) uses an array of cornplementary DNA se-
quences where ail possible N base sequences are represented in the array. The la-
belled DNA to be sequenced will hybridize to its cornplement and the array will theii
fluoresce at the corresponding location. (Hybridization is the process where two com-
plementary DNA sequences form a double helix through hydrogen bonding.) Other
means of detecting the hybridization location are possible. SBH is fast and convenient.
However, it is limited to extremely short sequences as even sequencing an eight base
sequence implies an array with a8 = 65536 elements. Sequences of a Iiundred bases
would require a current ly in feasible array size. Corn plementary D NA hy bridizat ion
arrays are niore promising for detecting mutations or the presence of a fcw prcviousiy
knowii sequences rather than for SBH. SBH will not be further addressed in this
t liesis.
1.2.6 The Human Genome Project
The Human Genome Project is well underway in its endeavour to sequence the 3
billion base pair human genome 1431. Completion of this project is expected to occur
by the year 2005. The product of this project is a consensus sequence of the human
genome reflecting the typical sequence seen in the population of humans. This project
has served as a major irnpetus in the developrnent of oew sequencing strategies. It
will aid the identification of new genes and control areas in the gnome. -4s well, it
will aid in the identification of abnormdities,
1.2.7 Clinical Role
.As is no doubt obvious to the reader, gene databases such as the one produced by
the human genome project and sequenced patient DNA will allow clinicians to identify
genetic abnormalities. S o m these clinicians mil1 be able to treat such abnormalities
through gene therapy. However, there are other very important clinical roles for DNA
sequencing with immediate benefits. DNA sequencing can quickly and effectively
compare the Human Leukocyte Antigens (HL.4s) of the patient and potential donor
Chapter 1 O Introduction 10
organ and thus determine if the tissues are compatible. DN.4 sequencing can quickly
identify viral strains and determine what drugs the patient's infection would resist . Thus, the financial and health penalties associated with prescribing an inappropriate
course of treatment can be avoided.
1.3 Data Communications
Data communications is concerned with the transmission and reception of a se-
quence of information with as little error as possible. A field of vigorous investigation
since the work of Nyquist in 1924, it offers a well-dcveloped body of knowledge and
experience. The basic philosophy is first to develop a mode1 of the channel over
which communication must occur and then to derive from the mode1 an appropriate
communications technique.
Figure 1.5 depicts the basic elements of a data communications system. At the left
of Figure 1.5, the transmîtter takes the input data stream and performs a sequence
of operations in order to represent the information as an analog signal at the input of
the channel. These operations may include source encoding to reduce the number of
symbols necessary to represent the input, channel encoding to allow error correction at
the receiver, and modulation to rnap the coded digital sequence to signal waveform(s).
Figure 1.6(a) shows a simple waveform representing the information sequence 10 1.
SIoving to the center of Figure 1.5, passage through the channel has two main
effects [561. First, the waveform is distorted by the channel impulse response, c(r, t ) ,
which is the response a t time t to an impulse at time r. This extends the duration
of the received s p b o l signal and causes interference between adjacent symbols, a
phenornenon known as Inter-Symbol Interference (ISI). ISI cm lead to errors as in
the case where, due to the symbols on either side of the symbol of interest being
ones, a sufficiently high level will be measured at the time of interest such that the
symbol will be declaïed a one when it r e d y should have been a zero. Figure 1.6(b)
depicts the waveform of 1.6(a) after passage through such a channel impulse response.
Chapter 1 O Introduction 11
SIGNAL DISTORTED RECEIVED WAVEFORM SIGNAL WAVEFORM
DATA
Figure 1.5: Data communications system block diagram.
Figure 1.6(c) displays the data in 1.6(b) with the second major channel effect included,
that of the noise source shown in Figure 1.5. Clearly, symbol detection is cornplicated
by the random nature of the received waveform.
The receiver at the right of Figure 1.5 may attempt to reduce detection errors by
estimating and removing the ISI and then averaging over time to limit the effect of
noise. The resulting digital sequence will then be passed through a decoder for error
correction and, hopefully, recovery of the original information sequence.
TRANSMIlTER
1.3.2 Receiver Technology
+- C h ) 4 ESTlMATEO SEQUENCE
Removing the ISI and limiting the effect of noise is not a trivial task. Several
receiver structures have been developed to address this problem. The first is the zero-
forcing linear equalizer. It applies a filter that inverts the effect of the channel impulse
response (and in some cases, transmitter pulse shaping) so that the ISI is guaranteed
to be zero at the symbol time. Thus, neighbourïng symbols do not directly contribute
to errors. Unfortunately, this inverse filter is applied to the received waveform and
usually increases the noise component. A variant knom a s the mean-square error
linear equalizer is designed to minimize the sum squared of residual ISI and noise at
the symbol time; it trades off some ISI cancellation for reduced noise emphasis. Non-
Chapter 1 o Introduction 12
linear equalizers avoid noise emphasis by using a proxy for the signal in canceling the
ISI. For example, the decision feedback equalizer applies the sequence detectecl thus
far to a filter representing the channel and then subtracts the result from the received
waveform to remove the ISI from previous symbols. As the noise does not appear in
the proxy, it is riot emphasized. Of course, problems occur if there are errors in the
sequence detected thus fa.
There is a more sophisticated and mathematically rigorous receiver that implicitly
limits the effect of both ISI and noise. Known as the Maximum Likelihood Sequence
Detector (MLSD) 1551 or Maximum Likelihood Sequence Estimator (MLSE) (561, it
is basetl on a solid statistical approach. A probabilistic mode1 is developed for the
received waveform for each of the possible transrnitted sequences. The hypothesized
seqiience then implies the ISI in the waveform and the probabilistic uncertainty mod-
els the fluctuation due to noise. The actual received waveform is then used as an
argument to the probability functions and the sequence which yields the greatest
probability is selected. This process can be performed efficiently using the Vitcrbi
algorithm 181.
1.4 Analogy between
Communications
DNA
In this chapter, DNA sequencing has been
Sequencing and Data
shown to depend on noisy, overlapping
signais that represent a sequence of information. Data communications has been
shown to be concerned with extracting an information sequence from noisy, overiap
ping signals. There is clearly a strong analogy between data communications and
DN.4 sequencing. Interestingly, this analogy has not been identified previously in the
literature. This thesis will exploit this analogy in the hope of improving our under-
standing of DNA sequencing through the use of the powerful concepts developed for
data communications.
Chapter 1 O Introduction 13
1.5 Research Approach
This research first identified the key aspects of the DN.4 time-series through a
study of the literature and preliminary investigations of real data. Then statistical
nioclels were developed which incorporated those features. Witli respect to this mod-
elling, the optimum recaiver / sequencing algorithm was then dcrived. Sub-optimal
implementations were then investigated using both simulated and real data.
1.6 Dissertation Organization
This thesis will first detail in Chapter 2 the chernical and physical processes iin-
clerlying the time-series observed in DNA sequencing. In Chapter 3, statistical models
are developed for the DNA time series. The derivation of the optimum sequencing
algorithm is presented in Chapter 4. The analysis will explore the mathematics so
that insights may be formed into the key structural and functional features of the
algorithm. Chapter 5 discusses the implementation of the algorit hm. Simulations are
used to aid in choosing an appropriate design. Performance with real data is inves-
tigated in Chapter 6 and compared with that of a commercial sequencer. Fiiially.
Chapter 7 surnrnarizes the contributions of this work ancl provides suggestions for
furt her research.
Cha~ter 1 o Introduction 14
-0.6 L 1 20 40 00 00 1 0 0 1W #*O 1 0 4
71Mt (MMCLI.>
(a) Transmitted signal.
(b) Signal distorted by channel impulse response.
4
é 2 3 i3
-7
-4 a0 00 100 t 20 f I M E (SAMPLES)
(c) Received signal.
Figure 1.6: Communication signals.
CHAPTER 2
Details of the Chemistry and Physics
of DNA Sequencing
The introductory chapter of this thesis presented a high-level view of the sequenc-
ing process. However, the modelling in this thesis requires a deeper understanding of
the sequencing process. Thus, to Iay the foundation for the statistical modelling to
follow, this chapter provides a more detailed description of the chemical and physical
processes involved in DN.4
2.1 Sequencing
sequencing.
Chemistry
2.1.1 Chemical Structures
DN.4 is a polyrner cornposed of monomers called nucleotides (Le.? A. C, G, T).
Figure 2.1 depicts the chemical structure of the A nucleotide which is more properly
referred to as deoxyadenosine 5'-monophosphate. Note the nucleotide's three corn-
ponents: phosphate, deoxyribose and adenine. Ml DNA nucleotides have the same
phosphate and deoxyribose components but differ in their base which may be Adenine
(A), Cytosine (C) , Guanine (G) or Thymine (T) . Deoxy Adenosine 5'-MonoP hosphate
Chapter 2 0 Details of the Chemistry and Physics of DNA Sequencing 16
PHOSPHATE
Figure 2.1: Strticture of deoxyadenosine 5'rnonophosphate (dAiCIP) (after 1131).
rnay be abbreviated as dAMP; its higher energy triphosphate form is abbreviated as
dATP. Similar abbreviations apply for the other nucleotides as in dCbIP, dGSIP?
dTMP, and, dCTP, dGTP and dTTP. Also, a generic nucleotide representing any of
the four bases may be referred to as dNMP or dNTP.
Yote the numbering by the carbons of the ribose in Figure 2-1. Of particular im-
portance are the 5' and 3' carbons as the ssDNA polymer is formed by connecting the
5' carbon of one nucleotide to the 3' carbon of the next nucleotide using a phosphate
group (Figure 2.2).
Figure 2.3 shows two complementary strands of DNA hybridized together. The
hydrogen bonds joining the strands are indicated by dashed lines. One can see how
A is complementary to T and G is complementary to C by by virtue of the hydrogen
bonds they can Form. Further, each complernentary pair incorporates one base with a
single ring, known as a pyrimidine (C or T), and one base with a double ring, known
as a purine (A or G).
Chapter 2 o Details of the Chemistry and Physics of DNA Sesuencine: 17
0. I 5' END
.O- P = O I
3' END
Figure 2.2: A single-stranded DNA (ssDN.4) molecule (after 1131); full structure shown for phosphate and ribose groups but bases are represented by one of .A. C, G, or T.
2.1.2 DNA Amplification
Typicdly, researchers start with a very small amount of DN.4 that has been
isolated from the ce11 or virus of interest. To obtain good strong signals in DNA
sequencing, more DNA is needed. So, p ior to the actual sequencing reactions, steps
are taken to produce many copies of the original DNA. This processing is referred
to as DNA amplification. There are two major methods for DNA amplification:
Polymerase Chain Reaction (PCR) and cloning.
PCR is based on the use of a special thermally stable polymerase enzyme, Taq,
that was isolated from bacteria living in geothermal vents 1121. In PCR, the DNA is
Chapter 2 O Details of the Chemistry and Physics of DNA Sequencing 18
Figure 2.3: Double stranded DNA (dsDN.4) detailed structure [Ill. Phosphate- deoxyribose backbones are on eatrerne left and right, corresponding to respective strands. Bases run from top to bottom along the center of the diagram. Hydrogen boriding is seen along center as dashed line emanating from a hydrogeri (H) thet also has a solid line indicating a covalent bond to the other strand.
first heated to approximately 95 O C to denature it (Le., separate dsDNA into ssDNA).
It is then cooled to approximately 55 O C to allow a primer to hybridize to it. .A primer
is a short (10-20 base) piece of ssDNA that is complementary to the start of the DNX
segment to be copied. The polymerase enzyme will then bind to the DNX at the
end of the primer and start to add bases, extending the primer to make a copy
of the original DN.4. The solution is typically heated to roughly 70 OC during this
phase to speed the incorporation of bases. Then the cycle is repeated with the initial
heating to 95 O C serving to release the copy fiom the original molecule. This copy is
complementary to the original ssDNA. Note that a non-thermally stable pal-merase
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 19
would be destroyed by heating to 95 O C .
The description of PCR is not yet cornplete. Another primer, complementaq to
the copy at the other end of the segment of interest, is also included in the solution.
This primer will bind to the new copy and then the polymerase enzyme can make a
complementary copy of the copy. Applying the definition of a complement, this new
copy is therefore identical to the original ssDNA over the segment of interest. Now
there are two molecules available as templates for the next round of copying. In this
manner, PCR doubles the amount of DNA with every thermal cycle. The typical
number of cycles used in PCR is 10-20 and thus amplification factors of a thoiisand
(2") to a niillion (z2*) are typical. Also, any errors are duplicated in subsecluent
thermal cycles. Should the enzyme dissociate from the template before completion of
the full copy then, for primer labelled sequencing, the short copy will contribute to
peaks in all four (A,C,G,T) time series at its terminal base position. This phenornenon
is know as a "false-stop" or 'Talse temination".
The second method of DNA amplification, cloning, relies on cells, typically bacte-
ria or yeast, to amplify the DNA of interest 1131. .A cloning vector is used to introduce
the DNh template into the cell; the general process is often referred to as recombinant
D M technology. Two standard cloning vectors are plasmids and bacteriophages.
A plasmid is a small (several thousand base pair) piece of circular DNA that is
capable of replication within a bacterium. A plasmid vector consists of the original
plasmid's DNA required for replication plus the DNA of interest inserted into the
circular DNA rnolecule. This is then inserted into the ce11 and through the cellk
normal reproductive cycle additional copies of the DNX of interest are made.
X bacteriophage is a virus that infects bacteria. The DNA of interest can be
attached to the bacteriophage's DNA. The bacteriophage is then applied to a bac-
teria colony. Over the course of infection each bacteria makes many copies of the
bacteriophage. These can then be harvested and the DNA of interest extracted.
As in the PCR case, cloning techniques can lead to errors that may be passed to
su bsequent generations.
Chapter 2 o Details of the Chemistry and Physics of DNA Seauencine: 20
2.1.3 Sequencing Reaction Molecules: Terminators and
Polymerases
In the introductory chapter, Sanger DNA sequencing is described as dcpending on
cornpetition between a substrate base and a terminator molecule. The terminator
rriolecule is actually a modified nucleotide where the 3' hydroxyl (OH) group of the
deoxyribose is replaced by a hydrogen atom. The -4 terminator is then properly
called dideoxyadenosine 5'-monophosphate (abbreviated ddAMP). Its 5' end will bind
wherever the 5' end of an -4 would bind. As it lacks the 3' hydroxyl group, it is not
possible to bond other nucleotides at the 3' location and polyrnerization at this end is
not possible. Thus, once it is incorporated, construction of the DNA copy will stop.
In sequencing, DNA polynierases are responsible for incorporating the bases into
the complementary DNA niolecuie. There are many different DNA polyrnerases, a
cliversity made possible as different organisrns have different natural polymerases.
Taq, used in PCR as discussed above, rnay also be used in sequencing. If thermal
cycling is used for the sequencing reactions then the process is referred to as cycle
sequencing. Sequenase, ano t her t hermally stable DNA polymerase enzyme, can also
be used in cycle sequencing. T7 DNA polymerase is not therrnally stable but it can
make very long copies. Polymerase enyzmes are limited in the length of DNA they
can copy as they eventually dissociate frorn the DN.4 template. Taq, Sequenase and
T7 DNA polymerase were used in the experiments of this thesis.
2.1.4 Fidelity and Peak Amplitude Variation
The polymerase c m make mistakes in rnaking the copies and can cause fluctuations
in the amplitudes of correct peaks. For example, the error rate for Taq is estimated as
2 x 10-4 misincorporations per nucleotide per cycle 1121. Other authors measured the
error rate for Taq as one single-base substitution error in 9000 bases and one frarneshift
error (i.e., insertion or deletion) in 41000 bases [141. Under special conditioùs, base
' Note that in most of this thesis and in the literature in generd, when discussing DNA sequencing the word 'base' will often refer to the complete nucleotide including its phosphate and deoxyribose groups.
Chapter 2 0 Details of the Chemistry and Physics of DNA Sequencing 21
substitution and frameshift error rates of less than IO-' have been observed [ljl.
Errors rnay occur in both the amplification and sequencing phases. Beyond a direct
substitution of one base for another due to partial afinity for bases other tiian the
correct one, many other error mechanisms are possible.
One source of error is 5' to 3' exonuclease activity wherein the polymcrase removes
bases frorn the primer end of the copy. If this occurs during amplification, it will
rcduce yield as the PCR primers may not be capable of binding to the shorter copies.
If part of the region complementary to the sequencing primer is lost, the rnolecule
will not participate in the sequencing reactions. The 5' to 3' exonuclease activity is
part of the DNA repair mechanism in living cells. Sequencing polymerases typically
have been modified to suppress this activity.
Blunt-end addition is another DNA polymerase error. Just after cornpleting a full
length copy, the polyrnerase adds additional bases at the 3' end 116, 171, even tliough
the template does not have bases there. The bases are added at randorn but A's are
added more often. If blunt end addition occurs during amplification then the results
are unaffecteci as no changes have been made to the region of interest. As to the
sequencing reactions, the ddNTP terminators do no t allow bliint end addit ion. Tlius,
blunt end addition does not lead to sequencing mors .
As the polymerase ages, it becomes partially inactivated [331. This can lead to it
clissociating from the template before the copy incorporates a ddNTP. If the primer is
fiuorescently labelled then the copy will be detected. This will indicate a base of type
corresponding to that ddNTP even though the base at that point in the sequence
may be of another type. In amplification, polymerase dissociation leads to somr
products having a shorter length than others. CVhen sequenced, both the short and
long length copies will contribute to the initial peaks but only the long length copies
will contribute to the later peaks. Therefore, this phenornena will contribute to a
reduced signal strength for later peaks. Aiso, many times the sequencing reactions
are likely to go the full length of the short copy without incorporating a terminator.
For primer labelied data, this results in peaks in al1 four time series (A,C,G,T) at the
short copy's terminal base position. .4s mentioned exlier, this phenomenon is known
as a "f'alse-stop" or "false termination".
Chapter 2 o Details of the Chemistw and Ph~sics of DNA Seauencing 22
L G C T P A C C A C G A A T G G T
Figure 2.4: Bulging of copy with insertion of a T.
There are several known sequence dependent effects associated with polgmerases
(see Table 4.1 in 1331). For example, in a string of consecutive C's, later C peaks are
likely (but not certainly) to be larger than earlier C peaks. However, a "cornplete"
list of these dependencies is not available, largely due to the very large number of
possible combinations. Painvise sequence dependencies are presented in [Ul. Relative
separation depended on the 3' terminal dideoxynucleotide and increased in order of
C, A, G and T. However, relative separation was also dependent on the penultimate
base adjacent to the 3' dideoxynucleotide and increased in order Tl A/G and C.
Generally, peak levels can fluctuate. In (181, the fluctuation, which was defined as the
ratio of the difference in adjacent peak leveis to their average, was varied from 0.1 to
10 by changing the concentration and type of the divalent cation needed to activate
the polymerase. In (451, amplitude fluctuations between runs for same sequence and
polymerase were found to be highly correlated.
Other errors in the copying process are associated with problems in the hybridiza-
tion of template and copy. Either the template or the copy may bulge out, forming a
little loop. If the copy bulges then the copy has one or more insertions (Figure 2.4).
If the template bulges then the copy will have one or more deletions. The stability of
these bulges depends on local sequence 1191. Larger bulges, known as hairpin loops,
are often associated with regions rich in G's and C's as these regions may form hydro-
gen bonds as in Figure 2.5. If the template forms a hairpin loop then the polymerase
is more likely to dissociate at the loop. This leads to the level reduction and false
stop effects mentioned previously.
Formation of bdges also allows the primer to hybridize to a region it only partially
Chapter 2 O Details of the Chemistry and Physics of DNA Sequencing 23
T A A C G G C C
Figure 2.5: Hairpin loop due to complementary GC runs.
matches. This phenornenon is known as secondary priming. It leads to additional
peaks in the DNA time series. These peaks correspond to the sequence froni tlic
secondary priming site. Secondary priming can be suppressed by setting the annealing
temperature to be high enough so that only the exact cornplementary hybridization
d l be stable.
In surnmary, the error mechanisms discussed in this section impact on DNA time
series as additional peaks in the data. These additional peaks occur at locations
consistent with those expected for true peaks as they correspond to product contain-
ing an integer number of uncorrupted bases. Problems associated with polymerasc
dissociation can lead to reduced signal levels.
2.1.5 Degradation
Chernical breakdown can cause sequencing errors. The main process is hydrolysis
where long DNA molecules are split into two smaller fragments by the addition of
water, with the water's hydroxyl group added to one fragment and its hydrogen is
added to the other. The elevated temperatures used in sequencing promote hydrolysis.
The labelled fragments cause anomalous peaks in the time series. Hydrolysis during
amplification and sequencing reactions leads to effects similar to polymerase/template
dissociation. Hydrolysis can also occur whiie the DNA is in the sequencing gel. The
resulting anomalous peaks can occur anywhere in the time series, as they are offset
in time by when hydrolysis occurs which in turn is a random variable.
From long DNA molecules of al1 the same length, hydrolysis can lead to a pop-
ulation of products of al1 lengths (base counts) shorter than the original length. As
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 24
H CHO
R-T I
P 1
R-T
Figure 2.6: Cleavage pathway for depurination (Guanine base) 1201. Additional symbols are: R for deoxyribose, G for guanine, T for th-ymine and P for phosphate.
the products of a hydrolysis reaction may thernselves undergo hydrolysis, the shorter
rnembers of this population tend to be more populous than the longer rnernbers of
this population. Thus, the Ievel of noise due to this population should be higher
earlier in the time series.
Hydrolysis can remove the base from the deoxyribose monophosphate of a nucleic
acid. Purine bases are more likely than pyrimidine bases to be removed from DNA.
Depurination is the hydrolytic removal of purine bases from DNA. Fiyre 2.6 illus-
trates the depurination pathway; all the intermediate products may be present and
would lead to additional anomalous peaks in the DNA time series. Note that depuri-
nation ultimately leads to cleavage of the D N 4 into two shorter DNA molecules.
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 25
Hydrolysis can also cleave the fluorescent label from the DNA. Depending on the
label rnolecule, the fluorescent label may then be positively or negatively charged.
If this hydrolysis occurs before the detectors then a loss of signal level results. The
noise background will increase either from negatively charged labels migrating ahead
of the DNA band prior to the detectors or frorn positively charged labels migrating
backwards past the detectors after the source band has passed the detectors.
2.2 Sequencing P hysics
2.2.1 Sequencing Gel
.-\ gel is an aggregate of polymers that encompasses a liquid medium. There
are connections, referred to as cross-links, between sorne of the fibers. .A molecule
iindergoing electrophoresis must weave its way between the fibers. It is this interaction
that allows discrimination of DN.4 molecule length. In a simple solution rather than
a gel, DN.4's charge and resistance to motion scale with length and discrimination of
length is not possible [211.
The cross-links lead to the notion of pores in the gel and a gel is often char-
acterized by its mean pore size. The large pore agarose gel is used for separating
large molecules such as million base dsDNA molecules. The srnaller pore potyacry-
lamide gel is normally used for DNA sequencing. Currently experimental, capillary
gel electrophoresis can also work if the fibers are not cross-linked 122, 231.
.A polyacrylamide fiber is composed of long runs of acrylamide monomers. The
fibers are covalently cross-linked by N,N7-rnethylene-bis-acrylarnide, a molecule which
is usually referred to as "bis". The weight ratio of acrylamide to "bis" determines the
extent of cross-linking and therefore the character of the gel (10: 1 is brittle while 100: 1
is pasty 1241). A 19: 1 ratio is typical for DNA sequencing. The gel concentration sets
the mean pore size. The gel is formed by adding acrylamide and "bis" to the liquid
medium.
For sequencing, the medium contains at least two components: a buffer and a
denaturing agent. The buffer provides ions that undergo electrophoresis just as the
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 26
DNA molecules do. The concentrations are set so that the v a t majority of charge is
carried by the buffer ions. Thus, the buffer ions define the local electric field conditions
and as they are small and uniforrn in distribution, the DNA molecules see a uniform
electric field that is essentially unperturbed by other DNA molecules. The ability of
the buffer to do this is referred to as its ionic strength and is defined as one half the
sum of the rnolecular molality times the molecular charge squared. Ionic strength is
often reported as a multiple of TBE where lOxTBE is the ionic strength of a reference
t~uffer (10xTBE) consisting of 89mM Tris-Borate and 2mh.I EthyleneDiamineTetra-
acetic Acid (EDTA). For the experirnents reported in this thesis, the buffer is TBE.
The denaturing agent ensures that the DNA remains single stranded. The de-
naturing agent also helps prevent the ssDNA from forrning hydrogen bonds between
different regions of itself. ,4 gel formed with a denaturing agent is referrcd to as a
"cienaturing gel". For the experiments reported in this thesis, the denaturing agent is
urea.
2.2.2 Theories of Electrophoresis
There are several electrophoretic theories in the literature 124, 46, 47, 48, 49, 501
and an excellent overview of these theories is given in 191. Individual theories explain
different electrophoretic regimes where the regimes are defined by the relative sizes of
the molacule and the gel pore [till. 'lectric field strength can also provide the basis
for differentiation into different regirnes. For the sequencing gel conditions used in the
experiments of this thesis, the two most relevant models and regirnes are the Ogston
model ( 1 - 4 5 0 basepairs) and the biased reptation model (- 150-500 basepairs) .
Ogston calculated the fraction of the gel that can contain a sphere of a specific
radius (Le., the ratio of the total volume of al1 pores bigger than the sphere to the
total volume). To obtain a simple electrophoresis theory, the arbitrary idea that the
mobility is proportional to the fractional volume was explored. This of course ignores
the requirement that a gel must have a connected set of sufficiently big pores from
one end to the other in order for the sphere to pass through it. Surprisingly then,
the mode1 yielded a good fit to experimental data over a fair region of molecular
rnass, with the upper Iimit of the region being roughly two to three orders of mag-
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 27
riitude greater than the lower limit of the region. As a result, it has been the basic
electrophoresis rnodel for nearly four decades. The model states that the mobility
(lecreases exponentially with the square of the ratio of sphere radius to pore radius.
In applying the model to DNA sequencing, the DNA is presumed to have folded in
to a sphere-like random coi1 with volume equal to that of the linear DNA molecule.
The biased reptation model applies when the DNA molecule is large enough such
that it cannot fit into a single pore. In this model, the leading part of the linear
polymer, the "head", is assumed to enter the pore and to choose the path to the
next pore. The rest of the polymer just proceeds in order along the path selected
by the head. The head "searches" for the entrance to the next pore through its
riornial thermodynamic motions. The word "biased" in the mode1 name refers to
the electric field biasing the head to follow the field lines. In the biased reptation
niodel mobility is inversely proportional to molecular lengt h. It also states that
beyond a certain limiting length, rnobility is independent of molecular length. Two
clifferent length molecules of size greater than the limiting length would have the same
rnobility, take the sarne time to travel through the gel and thus would not be resolved.
Fortunately, this limiting length is in the thousands of base pairs for the gels used in
the experiments for this thesis.
These electrophoresis models assume that the molecule is a linear polymer corn-
posed of identical symmetric monomers. DNA has mrying monomers (A,C,G,T) and
the bases are attached asymmetrïcally off the side of the phosphate deo.uyribose back-
bone. However, rotation about the backbone is possible. One can think of a long
DNA rnolecule representing an instantiation of a random set of base types and rota-
tions. -4s the set is large, the lam of large numbers applies and the average properties
become more representative. One could think of an average monomer and average
rotation leading to the idealized linear polymer of identical symmetric mononiers to
which the models would then apply. Thus, these models provide the average charac-
ter of the electrophoresis results but on a base by base basis, fluctuations about this
average would be expected due to variations in monomer type and orientation.
Furt her , microscope studies of act ual DN A rnolecuIes undergoing electrophoresis
have revealed more cornplicated behavior than assumed in the models above 152, 53,
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 28
541. These behaviors include herniation outside the reptation "tu bey', hooking on gel
chains and release of hooked molecules where one end goes backward relative to the
general direction of migration.
The idealized electrophoresis models above assume the polymer consists of stiff
rotls joined at nodes and that these rods are free to take any relative angles. Rather
than use the length of a monomer as the length of the rod, the models use the Kuhii
langth of the DNA4 as the rod length. Kuhn length is the contour2 distance between
two points on the DNA molecule such that the angle of the local segment along the
contour at one point is uncorrelated with that of the other point 1311. This justifies
the rods being f'ree to take any relative angles in the models. Persistence length.
defined as half the Kuhn length, is often used as a measure of this property.
Kuhn length may be easily understood by considering a thick rope. For two points
huntlreds of rope diameters apart, its easy to place the local segment at any arigle
independent of that at the other point (presuming the rope is not stretched taut).
However, for two points a few diameters apart, the stiffness of the rope limits tlic
possible relative angles.
The pcrsistence length of ssDNA varies from 5 to 12 bases depending on the
ionic strength of the buffer; here, the maximum ionic strength has been restricted
to IOd2 mol/L to reflect the maximum likely to be used in sequencing gels [271.
This is because the stiffness of the ssDNA has a fixed structural component and
an electrostatic component [32]. The electrostatic component is due to elect rost at ic
repulsion between the charged bases. As ionic strength increases, more positive ions
gather around the DNA molecule and shield the bases from the negative charge of
adjacent bases. Thus, electrostatic repulsion is decreased and t lie DN.4 molecule
becomes more flexible. The Kuhn length of ssDNA varies frorn about 2.4nm at
infinite ionic strength to 16nm at a low ionic strength (the phosphate to phosphate
interbase separation for ssDNA is 0.43nm) [321.
'The contour is the path taken in going hom base to base along the DNA molecule.
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 29
2.2.4 Resolution
Resolution refers tu the abiiity to discern multiple consecutive same type bases.
As resolution becomes poorer, the sequencing error rate incrcases. -4 useful measure
for resolution is the ratio of peak width to interbase separation. Peak width in the
gel is determined by the peak width on loading plus the additional increments due to
diffusion and dispersion since loading [251. The peak width on loading is cornplicated
by the stacking of the DNA at the decelerating interface between loading well and gel.
karmola 1251 defines diffusion as that spreading component present in the absence of
an electric field and dispersion as the additional time dependent component present in
the presence of an electric field. Slater 191 combines diffusion and dispersion together
and refers to the aggregate as diffusion; t his is the nomenclature used in the reniainder
of t his t hesis.
The time-clependent peak width in the gel, h x D ( t ) , is given in 191 as
where Axa is the peak width on loading and D is the diffusion coefficient. Peak width
in time, pw, is then A x ~ / v where v is the speed of the peak in the gel.
Resolution becomes problematic in the region that the biased reptation model
applies. The analysis is simplified by assurning that only the biased reptation model
applies. So rnobility (v) is inversely proportional to molecular length and hence base
number (2 ) ; u = C/i where C is the constant of proportionality. The center of the
band passes the detectors at t = Llv = Lz/C where L is the length of the gel.
Application of these factors to the results of the previous paragraph yields
Thus, peak width grows with a rate that is between linear in i and i3I2. Note that for
this model the separation between adjacent peaks is constant at Li/C - L(i - 1)/C =
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 30
LIC. The resolution, res(i) = pw(i) / ( t , - t,-l) is then
A small value of resolution is desirable. According to this equation (and as is actually
the case in practice), resolution is improved by using longer gels (large L).
The diffusion coefficient does depend on the mass and hence the length of the DNA
rnolecule. The temperature defines the average kinetic energy of the molecules in the
systern. Kinetic energy is proportional to the product of mass and velocity squaretl
so, For the same temperature, a larger moss implies a lower velocity and vice versa.
Smaller veiocities lead to lesser diffusion. Thus, one would expect larger molecules
to have narrower bands in the gel than if they had the diffusion coefficient of the
snialler molecules. However, the effect will be complicated by the configuration of the
rnolecule and in practice, the impact may be modest. As the diffusion coefficient will
clecrcase with base number and hence peak time, overall spreading due to diffusion
as indicated by Equation 2.1 will Vary less than if the diffusion coefficient had been
constant. Practically, to first order, band width in the gel varies little with base
riumber.
2.2.5 Other Concerns in Electrophoresis
Gel Inhomogeneity
Bubbles in the gel, dust on the interior g l a s wall and defects in the loading
well shape are practical problems 1331 that can lead to local variations in mobility.
Generally, the impact is felt in terms of extending the pulse shape, particularly if,
for the same lane, there is an entire gel section dong the field axis which is problem
free and high mobility and a section dong the field axis with Baws and low mobility.
For very large bubbles or well defects, the signal for a particular base type may be
significantly degraded or lost as migration down the gel is inhibited.
On a larger scale, variations in the degree of cross-linking over the entire gel lead
to variations in mobility from lane to lane. This problem c m be addressed by scaling
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 31
in time the time series for each lane so that generally the peaks occur in the right
sequence; this is referred to as lane alignment.
Secondary Structure
The hairpin loops mentioned in Section 2.1 can also affect electrophoretic mobility.
The resulting molecule tends to be more compact and migrates faster 1331. Thus,
peaks correspoiiding to bases subsequent to the position of the hairpin formation
will reach the detectors sooner than could be expected given the times of the pre-
hairpin peaks. The peaks will appear to bunch up, a phenornenon known as band
compression. More modest secondary structure, such as bends 1281 and a tendency
to form arcs 1291, will affect niobility in a more modest fashion.
2.2.6 Detection of Fluorescent Labels
Fluorescent labels must be excited by a source with wavelength shorter than that
of the labels' emission wavelength. For example, fluorescein lias an emissioti m u -
imum at 320 nm and must be excited by a source of wavelength smaller thari 494
nm 1261. The Pharmacia ALF Automatic DN.4 Sequencer, the data source for the
esperirnents reported in this thesis, uses a blue-green laser for escitation. The laser
beam enters the gel by the first lane and terminates by the last. This implies lane
1 has a clean illumination while the last lane would be illuminated by a beam that
h a been attenuated and dispersed by passing through the gel. The peaks in the last
lane are likely to be weaker and broader. They will also be noisier as the dispersed
beam can excite a wider region and thus more potential fluorescence noise sources.
The noise sources include chernical contarninants and the g l a s of the gel assembly.
Detection in the Pharmacia ALF is by an array of photodiodes. The output of
these devices is characterized as shot-noise as an impulse is produced for each photon
received. However, as the level of fluorescence is large and as low-pas filtering is
performed in the amplifiers, the recorded signal appears as an analog measurement
of fluorescence intensity plus a small Gaussian measurement noise.
Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 32
2.3 Summary
This chapter has identified the chemical and physical processes that determine the
character of the DNA tirne series. Amplitude and noise fluctuations are largely due to
chemical processes. Peak time variations are largely due to physical processes. The
discussion of fidelity leads to the noise model of the next chapter. The peak shape and
electrophoresis discussions lead to the signal model. Phenornena have been presented
in sufficient detail so as to provide the required background for the model developed
in the next chapter.
CHAPTER 3
A Statistical Mode1 of the DNA
Time- Series
X statistical characterization of the DNA time-series will be presented in this
chapter. First, the gross features of the DNA tirne-series are described. Our interest
then focuses on the local features of the time-series. The signal peak shape and pa-
rameters are modelled. This is followed by the noise model. Simulated data produced
by this rnodel is then presented and compared visually with sample real data. For
completeness, the major known features of DNA data that are not included in the
model are summarized. Finally. the importance of the model is discussed.
3.1 Gross and Local Structure of DNA Time-Series
r\mplitude trends are in evidence in Figure 3.1 which presents the entire time-series
for a single channel. Proceeding from left to right, a constant background level is first
seen. This could be due to background fluorescence and/or an offset in the sequencer
electronics. Next, a large peak is seen; this is known as the primer peak and is due
to an excess of the flourescently labelled primer unincorporated into any sequencing
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 34
Figure 3.1: Sample entire time series for "T" channel. Mean inter-base separation is 14.7 samples.
copies. The primer peak causes an cxponentially decaying offset in the data. Near the
end of the data, an exponential rising offset is seen. This is the precursor of the peak
at the end of the data due to fluorescently labelled full length copies of the original
DNA fragment. If the terminator had been labelled instead of the primer then neither
this peak nor the primer peak would be present. Over the central region, a downward
trend in peak amplitudes ' can be seen. This is likely due the cornpetitive process
used to encode sequence information. Here, the relative concentration of ddNTP to
dNTP is high leading to a greater chance of terminating early rather than later in
the sequence. The trend may also be due to random polymerase dissociation during
t lie sequencing reactions.
The gross structure of the time-series in Figure 3.1 is representative of DNA time-
series from a wide variety of DNA sequencers though parameter values rnay change.
These trends may be compensated and the useful data region extracted prior to
l For this thesis, peak amplitude is defined as the difference between peak maximum intensity and the local offset Ievel. For example, in Figure 3.1, the peak at sample 7000 has an amplitude of just over 100 intensity units and an offset of just under 1300 intensity uaits.
Chapter 3 o A Statistical Mode1 of the DNA Time-Series
rnaking sequence decisions. This is typical practice in automatic DN.4 seqiiencing
and is analogous to automatic gain control and automatic frequency coritrol in radio
communications.
Figure 3.2 presents the compensated time series fur al1 four bases; Figure 3.1
presented the uncompensated T channel data for the same sequencing run. As this
data originated from different lanes, it was necessary to compensate for differences
in mobility. The compensation, detailed in the Appendix, features sufficient degrees
of freedom to allow for the Oggston and biased reptation regimes expected in se-
quencing data. .&O as described in the Appendix, the background, primer and end
of data offsets have been estimated and removed. The trend in peak amplitude has
been estimated and the data has been xaled by its inverse. The result features sig-
nal absent regions with values near zero and isolated signal peaks with values near
one. Consecutive peaks can have values much greater than one due to constructive
iriterference. This later phenomenon is more pronounced near the end of the run due
to the expected increase in pulse width with base nurnber.
Figure 3.3 is a higher resolution presentation of the compensated time-series for
al1 four bases. Note that the individual peaks are of similar shape and that there is
evidence of noise. This suggests a time-series model a s in
where n refers to the base type, the index k is the sample nurnber, the sum is over
the base sequence position, i, and there are a total of Nb bases in the sequence. The
Kronecker delta function, defined as one if n is the same as xi and zero othenvise,
is used to determine if xi, the base at sequence position i, is of the same type as the
channel n and should therefore contribute to the observed waveform in that channel.
The contribution consists of a generic pulse shape, gkVt , , where the peak of the pulse
is centered on ti and the peak is scaled by ai. The random vaxiables ti and a, model
the timing jit ter and amplitude fluctuation, respectively. Finally, an additive noise
process, {nk), represents the background fluctuation evident in Figure 3.3.
Chapter 3 O A Statistical Mode1 of the DNA The-Series 36
Figure 3.2: Selected cornpensated time series for same sequencing session as Fig- ure 3.1. Individual channel data has been offset in this figure for clarity. Top curve is For -4 channel with C, G and T channels presented in order froni top.
3.2 Signal Peak Shape
The electrophoresis of a pure molecule should, after the band has moved suffi-
ciently away from the loading well, lead to a Gaussian shaped peak 191. The Gaussian
peak is presumed in at least one automatic sequencing algorithm 1361. However,
in the data observed from the Pharmacia ALF sequencer, the peak shape is more
complicated than a simple Gaussian.
Referring again to Figure 3.3, it is evident that most peaks, unlike a Gaussian,
are not symmetric with respect to their tails. In particular, the peaks appear to have
the trailing tail extended.
To provide a cleaner look at these low level tails, a data set was examined which
featured high Signal to Noise Ratio (SNR) and well separated peaks that interfered
only modestly. Micro-Satellite Repeat (MSR) data, used in family genetic studies, fits
these critena well. MSR product features 3-20 primer labelled ssDNA molecules. The
P harmacia ALF permits MSR product t O be loaded, electrophoresed, recorded and
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 37
Figure 3.3: High resolution view of a segment of the compensated tirne series (ac- tually Figure 1.2 repeated for reader's convenience).
8 I 1 I I I
/
analysed for size cornparison. Figure 3.4 displays the trace of an MSR electrophoresis
%
2
session. The peaks are very strong as the fluorophores are spread over only a few
molecular sizes instead of hundreds in the example of Figure 3.1. They are separated
=:N'--Lm:
by tens of bases so the tails are relatively free of interference.
O 50 100 150 200 250 300 k (SAMPLES)
Figure 3.5 shows the regions about the proximal and distal DN.4 standard peaks
in Figure 3.4. In Figure 3.5, each region had its baseline removed and the peak was
scaled to unit height. The distal peak was lined up with the proximal peak and then
time scaled in an attempt to match the shape of the proximal peak. In Figure 3.5, it
is evident that the peaks are extremely similar. Thus, the only significant difference
in the shape of the peaks in Figure 3.4 is that later peaks have been stretched in time.
This may be formalized by writing the pulse shape as
where k is the sample index, ti is the peak time, g, is the pulse shape (continuous)
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 38
Figure 3.4: Micro-satellite repeat data trace. Major peaks in time order are: primer peak, proximal DNA standard peak, sample peak, distal DN.4 standard peak.
rvhen the pulse widt h is unity, and p,(t) describes the dependence of peak ividt h oii
peak time.
Figure 3.6 shows the components of the peak shape. The central peak is a Gaussian
while the tails are exponentials. The trailing exponential has a longer tirne constant
than the other tail. These tails are consistent with those seen in the high amplitude
primer peak in DN.4 sequencing data.
As a check on the validity of this candidate generic pulse shape, isolated peaks
from DN.4 sequencing data were extracted. These had lower SNR than the MSR
peaks and there was evidence of intersymbol interference. These peaks are plotted
in Figure 3.7. Note that the peaks were simply aligned in time and not time scaled.
The peaks do appear to be asymmetric. However, the tails are not as consistent as in
the MSR data. The peak due to base 143 has a trailing exponential tail. The peak
due to base 275 appears to have a much weaker tail; this may be due to an error in
the baseline removal processing. The peak due to base 14 has a trailing tail but it
is far from exponential. Rather it appears to be a small echo of the main peak. -4s
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 39
1 t t
PROXIMAL I WARPEO DISTAL 1
REtATlVE TIME (SAMPLES)
Figure 3.5: Proximal DNA standard peak (solid line) and distal DN.4 standard peak (dash-dot). Warped peak (dotted line) was created by scaling the tirne coordinates by 0.7286.
the inter-base separation is roughly 15 samples, this echo appears approxirnately two
bases after the main peak. This is consistent with an error due to a bulge in the copy
as clescribed in the previoiis chapter. Variations such as those in Figure 3.7 are seen
throughout the data.
Two issues emerge: 1) could the peak shape associated with MSR data be fun-
damentally different than that of sequencing data, and, 1) how should the peak be
modelled given the range of fluctuations observed? Regarding the Brst issue, MSR
and sequencing data are distinguished by signal level. The high signal level of MSR
data implies a very large nurnber of molecules of identical size. This large number
may lead to DN.4-DNA interactions and a phenomenon known as gel overloading.
Neither of these eEects are well understood. One hypothesis is that due to overload-
ing some DNA molecules rnay be trapped or wrapped around gel fibers for a long
time. At a later, random time they are released and eventudly are detected. Their
arrivals would be expected to have a Poisson distribution and this would lead to the
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 40
Figure 3.6: Approximation of proximal peak of Figure 3.4 (dotted line) by leading exponential (samples 1-35, dashed line), Gaussian (samples 36-70, solid hie)? and decaying exponential (samples 71:200, dashed line). Inset is the logarithm of the same data.
trailing exponential tail. This would be less likely to happen at lower gel loading and
thus is not seen as frequently in sequencing data.
Advancing to the second issue, if the high SNR LISR data does not provide an
accurate model of the sequencing peak shape, and, further, that peak shape appears
to have considerable fluctuation, then perhaps a stochastic model should be ernployed.
This strategy is adopted by this thesis. The model employs the structure suggested
y the NSR data: leading exponentiai, Gaussian mainlobe and t railing exponent i d .
However, the scding and time constants of the exponentials are not taken from the
MSR data. Rather, they are selected to loosely represent the average tails seen in
the sequencing data. The generic unit width pulse shape for the sequencing data set
Chapter 3 O A Statistical Mode1 of the DNA Time-Series 41
Figure 3.7: Three isolated peaks from DNA sequencing data.
discussed in this section is
3.3 Local Covariance Model of Peak Parameters
Xow that the generic peak shape has been established, Our attention turns to
the paxameters necessary for its incorporation into Equation 3.1, specifically, peak
amplitude, ai, peak time, t i , and peak width, pzu(ti). These parameters may be
characterized in terxns of their g r o s behavior (i.e., long term trends) and local be-
havior (Le., fluctuations and the dependency of these fluctuations on the values of
neighbouring peaks) . Chapter 2 has presented models describhg the gross behavior of these parame-
ters. Amplitude is expected to decay with base number though fluctuations are ex-
Chapter 3 O A Statisticd Mode1 of the DNA Time-Series 42
pected due to polyrnerase problems (Section 2.1.4). Peak time is expected to evolve
smoothly (on the scale of hundreds of bases) through Oggston and biased reptation
regimes; however, local fluctuations are expected due to the non-uniformity of the
DNA molecule. Under the biased reptation model, pulse width is expected to grow
linearly then slightly faster than liuearly over a sequencing mn; as pulse width is
driven by the statistical mechanics of a very large number of molecules, little fliictu-
ation is expected. Practical application of this knowledge with respect to the trends
in peak amplitude and time leads to the compensated data presented in the previous
section.
Little is known of the local behavior of these parameters. Certainly, Chapter 2 sug-
gested some mechanisrns for local fluctuations but work in the literature has stopped
short of characterizing them other than for the limited number of amplitude sequence
dependencies mentioned in Chapter 2. In this section, a rnodel will be developed
for the local fluctuations in peak parameters including their point probability dcnsity
functions and their average dependence on their neighbouring peaks (covariance). The
emphasis will be directed towards a practical model to be used in the development of
sequencing algorithms.
3.3.1 Methods
The a-A-crystalline exon 3 1101 data was obtained and then processed to remove
trends as described in the -4ppendix. It should be noted that arnplitude trend removal
employed a 51 bin moving averager; as will be seen later in this section, this leads to
an artifact with this same petiod in the amplitude covariance estimate. The methods
employed to extract the peak measurement and form the covariance estirnates are
de t ailed below .
Peak Extraction and Mesurernent
-411 measurernents assume a directed search for peaks based on knowledge of the
true sequence. After identiwng the correct peaks the following procedures were used
to extract the peak parameters.
Chapter 3 O A Statistical Mode1 of the DNA Time-Series
For the basic peak measurements, we first obtain a background level estimate by
clrawing a line through the point halfway back to the previous peak and the point
halfway fonvard to the next peak. This line is then subtracted from the data. The
peak aniplitude is taken as the maximum of the result. The peak width is taken as
the distance between the half amplitude points on either side of the peak.
For data including unresolved peaks, a more elaborate procedure is followed to
alleviate the effect of neighbouring bases on the measurement. Again the process
begins with the identification of the correct peaks given knowledge of the true se-
quence. -41~0, we start with linear model of pulse width based on manual pulse width
tnewurements taken near the start and end of the data set. The peak shape is taken
as Gaussian; the effect of the tails is ignored and is a source of error. Now for the
rrieasurement of each correct peak, the influence of neighbouring peaks is estimated
and removed and then the peak measurement is niade as described for isolated peaks.
A multi-step iterative process is used to estimate neighbouring base interference.
For the first p a s , as we move through the sequence, peak time and amplitude mea-
surements, together with the peak width model and peak shape function, are used to
estimate and suppress the influence of previous peaks. The influence of future peaks
is suppressed using a priori rnean amplitudes and peak separations together with the
pulse width model and peak shape function. Intermediate parameter estimates are
thcn obtained as described above for isolated peaks.
Subsequent passes use the measured peak amplitudes and times from the previous
pass as the parameters in suppressing neighbouring peak influence. Updating these
measurements in this fashion leads to improved estimates of the amplitudes and
peak times. However, if pulse width was updated in the same fashion divergence
would be seen; a wide peak would grow wider if its adjacent peaks were seen as
narrow and thus their influence under-estimated. For example, the contribution of
a wider peak to its neighbouring peaks will be over-estimated which in turn would
lead to these neighbours being biased still narrower. With each p a s the effect would
be emphasized and divergence from the correct values would result. To avoid this
problem, the peak width estimate that is used in suppressing the contribution to
neighbouring peaks is obtained from a low-order polynomial fit to the previous pass's
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 44
peak width measurements. Typically, five passes lead to effective convergence of the
parameter estimates.
As a check on the peak extraction and measurement process, DNA sequencing data
obtained from peaks well isolated from their same lane neighbours was cornpared with
tliat incorporating al1 peaks. This cornparison involvcd the use of scatter plots and
covariance rneasurements.
The resuits for al1 peaks were checked to see if they lay within limits imposed
by the estimation error of the isolated peak measurements. Statistically significant
differences were not seen. The use of isolated peaks and non-isolated peaks together
(typically 350 pairs for a given base separation) allows examination of finer covariance
features than those that could be obtained using only the few (typically only 20-30
pairs were available for a given base separation) available isolated peak measurernents.
Covariance Estimation
Once the basic peak parameters have been extracted, a covariance estimate mny
be formed as a measure of their average dependence on their neighbouring peaks.
The covariance estimate is
where {i : i, i + 2 E {l, ..., N ) ) defines the set of al1 pairs of bases 1 bases apart,
11: = I{i : 2,i + 1 E (1, ..., N)}I is the size (cardinality) of that set, N is the total
number of bases, 2i is the estirnated parameter value at position 2, and fi,, is the
estimated rnean value of the parameter at position i. Equation 3.4 says take the
average of al1 pairs of deviates from the estimated mean that are I bases apart ( l? as
is common in the signal processing literature, will be referred to as the 'hg'). It does
not allow for non-stationarity in the data.
If it may be that the actuai covariance varies with base number then why adopt a
stationary covariance estimator? Covariance estimates with estirnate standard devi-
ations of 10% of the peak covariance require on the order of a hundred terms in the
surnmation (see [301 for information on cova,rîance estimate quality). Then to estimate
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 45
the non-stationary covariance to this accuracy, a hundred or so electrophoresis runs
would be required. These runs would differ with respect to gels and contaminants.
The covariance would then be a rneasure of run to run variation as well as variation
within a run. However, the interest of the sequencing algorithm designer is what is
predictable in a run. Therefore, covariance within a run is likecely to be a more useful
nieasurernent as it ignores run to run variation.
However, there is a non-stationary component to DNA sequencing data; peak pa-
rameters mry with base number in a manner that leads to an increase in sequence er-
ror rate. Our preliminary investigations using parameter covariance estimates t'ormed
from short contiguous sections of data suggested that the general magnitude of the
covariance tends to increitse with base number; however, the structure of the covari-
ance did not Vary significantly with base number. Therefore, in the results, we focus
on the variation of the variance with respect to base number, which will affect the
general scaling of the covariance.
3.3.2 Results
In this section, the peak tirne, amplitude and pulse width measuremerits are pre-
sented, their trends examined and their covariances calculated. The data is from the
gel electrophoresis of exon 3 of the gene coding the a-A-crystalline protein of the eye.
Xote that similar results have been obtained for exon I which is about 2000 bases
away on the chromosome.
Correlation Between Lanes
The main thrust is the study of covariance estimates formed from data rnerged
across lanes. Figure 3.8 provides justification for this approach. To create Figure 3.8,
bG'-labelled product was applied to six adjacent gel lanes. The resulting time series
featured peaks at similar positions with some mis-alignment due to gel spatial inho-
mogeneities. The peak locations were extracted. After scding and shifting t his data
so that the end peaks occurred at identical positions in dl lanes, and then, removing
large scale trends by least-squares fitting of a cubic to each lane's data and then sub-
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 46
200 250 BASE NUMBEFI
Figure 3.8: Peak tinie jitter for "G" labelled product applied to six contiguous lanes of the gel (total of 79 "G" peaks present over the range of 350 bases in original sequence). Six overlapping curves are plotted corresponding t o the six gel lanes.
tracting off the trend, the data shown in Figure 3.8 was produced. All six tirne-series
are plotted but the correlation is so high that they are difficult to distinguish. The
rneasured correlation coefficient between any two pairs of lanes is not less than 0.94
(79 data points per lane were used in the measurement). Therefore, lane to Lane gel
variation must account for less than 12% of the jitter variance. ..\lso. its clear that,
after large scale trend compensation, the lanes are highly synchronized. Thus, we can
be confident that merging lane data will introduce effects that are relatively small
and locaiized in time.
Basic Measurernents and Variances
Peak time, amplitude and pulse width were measured as describec 1 in Section 3.3.1.
Figure 3.9 presents the measured peak time jitter (difference between measured
peak time aad that expected based on large scale trends) which shdl be denoted as
m. Some evidence of correlation is seen as adjacent bases tend to have similar jitter
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 47
-20; 1 4 I l I I 1 50 100 150 200 250 300 350
BASE NUMBER
Figure 3.9: Peak time jitter.
values. Note the increase in scatter with respect to increasing base number. h lincar
fit to standard deviation estiniates Formed using contiguous 50- bin sections of the
jit ter in Figure 3.9 yields the standard deviation of the scatter as o = 1.7.5 + 0.0143 * i whcre i is the base number.
Figure 3.10 presents the local amplitude estimates. As already discussed, large
scale trends have been estimated and used to normalize the peaks to near unit arn-
plitude. However, as is evident in Figure 3.10, there remains a residual trend as
evidenced in the general decrease in local amplitude estimates with increasing base
number. About this trend, the scatter appears to have a consistent range, indepen-
dent of base number. Thus, at this level of investigation, amplitude estimates are
stationary (constant variance). The standard deviation of this amplitude scatter, aat
expressed as a percentage of mean peak amplitude is 23%.
Figure 3.11 presents the pulse width estimates. Here, the trend in pulse width
appears t o be linear and a least squares fit yields the pulse width as pw = 15.08 + 0.0326 * ( 2 - 1). Horizontal striations are apparent in Figure 3.11; these correspond
to quantization of pulse width estimates to the nearest sample intenal. The scatter
Chapter 3 o A Statistical Model of the DNA Time-Series 48
O I I 1 1 I 1 + 1 O 50 tOO 150 MO 250 300 350
BASE NUMBER
Figure 3.10: Local peak amplitude estimates.
observeci in pulse width estimates (standard deviation 10% of local pulse width) is
likely to be largely due to measurernent error.
Covariances
Figure 3.12 presents the covariance of the timing jitter. The niain lobe of Fig-
ure 3.12 has a significant value over approximately 15 bins indicat ing correlat ion in the
jitter extending over 15 base positions; the decaying oscillatioris evident beyond k20
bins of the central peak in Figure 3.12 are artifacts of the trend removal processing.
Iri the inset of Figure 3.12, one side of the main lobe is presented using a logarith-
mic scale. It appears to be very well approximated by a straight line, indicating an
exponential decay of the covariance.
An alternative view of timing jitter dependence is obtained through examination
of the difference between successive timing jitter values, Ai = #i - &*- i . Taking the
difference between successive values removes the portion which is comrnon to both
(Le. the correlated part), leaving that which is different (Le. the uncorrelated part).
This allows the examination and measurement of the uncorreiated part without the
Chapter 3 O A Statistical Mode1 of the DNA TirneSeries 49
101 1 I I 1 I L 1
O 50 100 1SO 200 250 300 350 BASE NUMBER
Figure 3.11: Pulse width estimates.
artifacts seen in Figure 3.12. The covariance of this jitter difference is presented in
Figure 3.13. Of significance here are the negative peaks on either side of the main
lobe. These are indicative of the additive unccrrelated component in the original timc
series. .As will be seen in the next section, accurate knowleclge of the uncorrelated
component from Fi y r e 3.13, together with the correlation information in Figure 3.12.
allows us to solve for the paramet-ers of a mode1 which explains the observed data.
Figure 3.14 presents the amplitude covariance. The srna11 values at non-zero lags
suggests that amplitude fluctuations are uncorrelated. Note that the general offset
from zero is due to large scale trends which were not completely removed prior to the
covariance calculations.
Figure 3.15 presents the pulse width covariance. Here some evidence of correlation
is seen in the first two lags. The covariance at these two lags is roughly 15% of the
zero lag value. Thus, 15% of the scatter in puise width can be predicted from one
base to the next. However, as the scatter standard deviation is only 10% of the pulse
width, knowledge of the previous pulse width allows us to use a pulse width estimate
with error decreased by 1.5% of the pulse width. This improvement is insignificant in
Chapter 3 o A Statistical Model of the DNA Time-Series 50
-5' 1 I 1 1 1 1 1 -300 -200 -1 00 O 1 O0 200 300 400
LAG (BASES)
Figure 3.12: Covariance of peak time jit ter. Monotonically increasing region j iist to the left of and including lag zero and monotonically decreasing region to its right is referred to as the mainlobe. Inset is a logarithmic plot of the right side of the main10 be.
terrns of its potential impact on sequence error rate and, therefore, the pulse wiclth
is treated as locally uncorrelated.
Model
A mode1 has been developed which reflects the correlation of the sequence peaks
over time and between channels. These rneasurernents are well modelled by the system
presented in the block diagram of Figure 3.16.
The observed peak times, t i , are modelled as
where i denotes base number, p z is the a priori mean expected value and c$~ is the
observed jitter. Practically, the peak time expected from the large scale trends would
be substituted for PT,; for the Pharmacia ALF data. PT, is very close to a linear
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 51
-4 1 1 I L 1 I 1 1 1 t -100 -80 -60 -40 -20 O 20 40 60 80 100
tAG (BASES)
Figure 3.13: Covariance of difference between successive peak time jitter values.
function of i.
The timing j it ter process is described by the following equations:
The state variable C incorporates the correlation memory of the system through the
auto-regressive weighting, P. Here, a large P implies the jitter is constrained to be
similar to past values. The jitter process is driven by a white, zero-mean Gaussian
source, vil of variance a:,; in the systems modelling literature, this would be referred
to as an 'input disturbance'. This input disturbance reflects the freedom of the indi-
vidual DN.4 molecules in choosing their 'random' path through the gel. The additive.
white, zero-mean Gaussian measurement noise, wi, has vaxiance O&. This measure-
ment noise may indeed be due to additive time series noise pulling the observed peak
location away from its noise free location. However, it may dso reflect other phe-
nomena such as mobility dinerences based on terminal sequence. Strictly speaking
Chapter 3 O A Statistical Mode1 of the DNA TirneSeries 52
01 1 1 1 1 1 t 1 1
-40 -30 -20 -1 0 O 10 20 30 40 LAG (BASES)
Figure 3.14: Peak amplitude covariance.
Qi is not permitted to be a value which would place the peak before a previous peak
or after a subsequent peak; in practice, the means and standard deviations are such
that such values are unlikely to arise.
The choice of Gaussian distributions for the measurement noise and input dis-
turbance reflects the histograms formed from the data. The histogram of the total
jitter, Figure 3.17), is a monomodal plot whose mainlobe may be approximated by a
Gaussian; insufficient samples are available to form a hypothesis regarding the tails
of the distribution. The histograrn of the difference between successive peak time
jitter values, Figure 3.18, seems siniilarly Gaussian; the difference is dominated bu
the measurernent noise and so this directly suggest that the measurement noise is
Gaussian. Given that the total jitter and the measurernent noise appear Gaussian, it
is not unreasonable to suggest that their difference is Gaussian and hence the jitter
process input, v, is Gaussian.
For this model, the theoretical time jitter covariances will now be derived. Ex-
pressions are developed for obtaining key parameters from observed covariances. By
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 53
iAG (BASES)
Figure 3.15: Pulse width covariance.
i terat ive application of Equations 3.6 and 3.7, the observation can be written 'as
where the base number, 2 , is greater than zero. The covariance is then
where E[*] denotes the expectation operation and both v and w are zero-mean? white
processes. Now if v is a slowly non-stationary process mith respect to i relative to the
weight imposed by ,û2'-2j then o:j may be replaced by of,. Using the properties of the
geometric series, it can be s h o m that c:=, pz'-*j = (1 - D2i)/( l - ,i92). Emplqing
Chapter 3 o A Statistical Model of the DNA Time-Series 54
Figure 3.16: Block diagram of peak parameter system niodel.
tliis and the assumption that i is large and k small eventually yields
Note the exponential decay with lag k in this equation. Thus 0 may be estimated as
the exponent of the slope of the log covariance estimate.
The covariance of the differences is obtained by first miting an expression For the
difference:
Chapter 3 o A Statisticai Model of the DNA Time-Series 55
-4 -3 -2 - 1 O 1 2 3 JITiER (SAMPLES)
Figure 3.17: Histograni of scaled peak time jitter. To insure comparability of Sam- ples, data was divided (scaled) by jitter standard deviation linear trend prior to forming histogram.
The covariance is then
Xow assume w is a slowly non-stationaxy process such that oit-, = ow,, * and, v is
a slowly non-stationary process with respect to i relative to the weight imposed by
:3"-2j. Then for large i and small k, Equation 3.10 becomes
Chapter 3 o A Statistical Mode1 of the DNA Time-Series 56
15 4 -3 -2 -1 O 1 2 3 4 JI'TTER DfFFERENCE (SAMPLES)
Figure 3.18: Histogram of scaled difference between adjacent peak tirne jitter values. To insure cornparability of samples, data was divided (scaled) by jitter standard deviation linear trend prior to forming histogram.
To obtain the mode1 parameters from measured covariances, first estimate d as
the exponerit of the slope of the log covariance estimate. Using the lag zero value
of the covariance, Equation 3.9, and the lag one value of the difference covariance.
Equation 3.11, the following system of simultaneous equations may be written:
These equations may be summed to yield
Combining terms, recognizing E[&#i] as ozi and introducing ka = E [Ai hi+ ,] / E [ + i ~ i ]
(i.e. the ratio of the lag 1 negative peak of the covariance of the jitter differences
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 57
(Figure 3.13) to the peak time jitter variance (lag O, Figure 3.12) yields the jitter
process variance as
Then, this expression for a:. rnay be substituted into Equation 3.12 and the result
solved to yield the measurement noise variance as
where the jitter standard deviation is
where a, and b4 are the coefficients obtained from a linear fit to the standard deviation
estimates.
It is also possible to refine the 0 estimate using the recovered variances. .\ more
sophist icated technique for est imating ,û and the variances would exploit al1 t lie in-
formation available in the covariances.
The amplitude and pulse width modelling is simple, as the covariances are assunied
to be zero for al1 lags other than zero. The amplitude is a truncated Gaussian random
variable of unit mean and variance 0:; the truncation restricts the amplitude to
positive values only with the probability density lunction rescaled for unit area. The
pulse width is a Gaussian random variable with mean
where a, and b, are the coefficients obtained from a linear fit to the pulse width
estimates. The pulse width has a constant variance, ot.
While the amplitude and pulse width components of the model are simple and
direct reflections of the measumments, the peak jitter portion of the model is more
complicated. The validity of this part of the model will now be exarnined by esti-
Chapter 3 o A Statisticd Mode1 of the DNA TirneSeries 58
rnating the model parameters and then performing a simple numerical check and a
graphical comparison of the measured and theoretical covariances.
From the inset of Figure 3.12, 0 may be estimated at 0.85. Figure 3.12 yiclds the
average E[&&] = a:, as 21 and Figure 3.13 yields the average E[AiAi+ ,] as -3.1; the
term "average" is used here as the covariance estimates average over 'base niimber,
i. Applying Equations 3.8 and 3.9 yields the average O:, as 4.79 and o: as 3.71.
As a check, these estimated variances are substituted into Equation 3.11 which is
then evaluated a t lag O to yield the average E[AiAi] = 12.6. This agrees with t h
measurcd value from Figure 3.13 of 12.9 within the expected measurement error.
Figures 3.19 and 3.20 present the theoretical covariances of the model timing jitter
for comparison with Figures 3.12 and 3.13. Such comparison may only be done to a
confidence limit iniposed by errors introduced by artifacts and estimation error. In
Figure 3.12, the trend removal artifact may be recognized by its 51 bin period. The
shorter period variations are due to estimation error: these would be smaller if more
data points were available for covariance estimation. So, to compare Figure 3.12
wi t h Figure 3.19, visually subtract off the trend removal artifact and tlien impose
confidence limits eqiial to the extremes of the short period variations. The data of
Figure 3.19 t hen lies within these limits, particularly in the mainlobe region. Siniilar
agreement is seen in the jitter difference covariances of Figure 3.20 after allowing for
the estimation error exemplified by the data for Iags 10-100.
3.3.3 Discussion
The covariances observed are understandable in light of the differences in the
processes underlying peak times, amplitudes and widths.
The local peak times are the result of large polymers, identical except for the last
few bases, moving in a similar fashion through a gel. The high correlation seen in
the rnultilane data (Figure 3.8) indicates that it is not necessary that the molecules
follow the same paths through the gel to obtain correlation in peak times. In f'act,
this suggests that the gel itself is not the limiting factor in the correlation. Rather,
it is the similarity between the DNA molecules that detemines the correlation.
Further, from Figure 3.12, the correlation decreases exponentially wit h a decay
Chapter 3 O A Statistical Mode1 of the DNA TirneSeries 59
Figure 3.19: Theoretical covariance of peak time jitter for system of Figure 3.16. Iriset is a logarithmic plot of the right side of the mainlobe.
length of about 5 bases and becomes insignificant above about 15 bases. The recent
work of Tinland et al. 1321 indicates that the persistence leiigth of ssDNA is about
p=2-5 nm, or about p=5-12 bases (at 0.43 nmlbase). The correlation in the peak
times and the persistence length of the molecule rnay be related. The persistence
length of ssDNA is a measure how far apaxt two points on a polwymer need be for their
spatial orientation to be uncorrelated [311. In other words, p reflects the stiffness of
the polymer chain. Most models of DNA gel electrophoresis predict that the mobility
of the analyte is related to its mean elongation in the field direction. It is clear that
for molecules of l e s than a persistence length difference in size (contour length), the
niean elongations will be strongly correlated. For instance, if the base composition
of a ssDNA molecule with M monomers makes the elongation slightly smaller (or
larger) than expected (as estimated for a generic chah of M monomers), then the
elongation of a M + l monomer chah will also be smaller (larger) than expected.
Such correlations will extend over roughly one persistence length, and would thus
affect the expected mobilities accordingly. These effects have yet to be included in
Chapter 3 o A Statisticd Mode1 of the DNA Tirne-Series 60
Figure 3.20: Theoretical covariance of difference between successive peak t h e jit ter values for system of Figure 3.16.
the gel electrophoresis theories as theorists are more concerned with average trends
rather than specific cases.
The amplitude of a particular peak is determined by how rnany molecules incorpo-
rate a terminator (ddNTP) at that base position. For our primer labelleci molecules?
the terminator differs from the normal nucleotide (dNTP) only in that it has a hy-
drogen rather than a hydroxyl group on the 3' carbon. For such a mal1 difference,
well rernoved from the location participating in the 1 s t condensation reaction, one
would expect lit tle correlation in amplitudes as simple random chance determines in-
corporation of a nucleotide or terminator. On the otherhand, some dependencies on
terminal sequence are seen such as a rise in amplitude in a run of C's. However. these
runs are infrequent in our data and thus do not evidence a significant effect in the
measured covariance. Note that the consumption of primer-labelled substrate leads
to large scale decay in amplitude; however, for the ddNTP and dNTP concentrations
used, the effect of this decay on a base by base basis would be of the order of one
percent or less and thus does not impact significantly on the observed covariances.
Chapter 3 o A Statistical Model of the DNA TirneSeries 61
The pulse width is determined by the distribution of like molecules in an elec-
trophoresis band. This distribution is in turn governed by thermodynamics ,i dif-
fusion 191. Presurning the loading is too low for significant DNA-DNA interactions,
but nonetheless such that a very large number (x 10") of molecules participate in
each band, stable bands are expected whose width would follow a simple large scale
trend with respect to base number. Local fluctuations are likely to be insignificant
due to the large number of molecules involved. Given the elaborate sclieme to extract
the pulse width estimates, it likely that the fluctuations observed in the pulse width
estimates are largely due to estimation error.
3.4 Noise Process Model
Our attention is now directed at the last term in Equation 3.1, the additive noise,
n k . Physical phenomena, such as integrated sensor shot-noise and pre-amplifier ther-
mal noise, give rise to an uncorrelated Gaussian component. As implied in Section
2.2.6, this comprises only a srnall fraction of the total noise faced by the sequencing
algorit hm.
Chernical phenomena in fact dominate the additive noise. This chemical noise
is created by any molecule that: (1) is not part of the true sequence, (2) passes by
the detectors, and, (3) fluoresces. These molecules arise largely out of the chemical
processes involved in DNA sequencing; other chemical contaminants added a t loading
time and in the gel dso contribute.
Some noise molecules are part of the sequencing product placed in the loading
well. The discussion of fidelity in Section 2.1.4 presented a number of mechanisms
for the creation of anomalous peaks (i.e., noise molecules). These mechanisms lead to
labelled ssDNA that is not part of the true sequence. ,4s these molecules are labelled
ssDNA and do undergo electrophoresis, the resulting noise peaks should have the
same shape as the correct peaks. As they are a n integrat number of bases long, the
peaks should tend to be more likely a t the same points true peaks would be likely
to occur. Thus, the noise would tend to be cyclostationary with penod equal to the
mean inter-base period.
Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 62
Hydrolysis (Section 2.1.5) will also contribute to this noise. Hydrolysis prior to
loading can lead to labelled products that are an integral number of bases long and
so contribute to the noise as discussed in the previous paragraph. Other labelled
hydrolysis products (see Figure 2-6) can have a length that is not an integral nurnber
of bases. If present in the product at loading, they will have the same peak shape as
the correct peaks as peak shape is driven by diffusion. However, their peaks will be
likely to occur anywhere in the time-series rather than just in the regions where the
true peaks are likely to occur. These peaks would contribute uniformly to the overall
noise ievel.
Hydrolysis products produced during electrophoresis at a specific time and üt a
specific location in the gel will have a peak width (and hence spectrum) determined
by the time of migration to the detectors. Generally these products may be produced
anywhere in the gel and at any time. The actual noise spectrum for this type of
noise is then an integral over space and time of creation of these hydrolysis products.
Relative to hydrolysis prior to loading, narrower peaks are expected (i.e. more energy
at higher frequencies). This is due to hydrolysis occurring near the detector.
This theoretical noise model, derived from the underlying chernistry and physics.
has two features which are easy to estimate and two features which are difficult to
estimate. The tractable features are the additive white noise and the diffusion driven
noise peak shape. The dificult fcatures to estimate are the cyclostationarity and
temporally-spatially integrated hydrolysis noise. The degree to which the last noise
tmo features are present depends on the actual sequencing process. In the absence of
DNA, these features are absent, except for possibly fluorescent contarninants. Thus.
noise data must be extracted from DNA sequencing data regions where true peaks
are absent. These are of limited length and so it is difficult to get the degrees of
freedorn necessary for a good measurement.
Therefore, a less accurate but more practical noise mode1 is proposed. It consists
of a white noise component and a coloured noise component. The coloured noise
component is obtained by driving a white source through a filter whose impulse
response is the correct peak pulse shape. This essentially averages over the period
of the cyclostationary component of the theoretical model. The ternporally-spatially
Chapter 3 O A Statistical Mode1 of the DNA Time-Series 63
integrated hydrolysis noise rnay be sornewhat accounted for by adjusting the relative
levels of the white noise and coloured noise components.
Combining the two noise cornponents together yields the noise spectrum as
where G(w, t ) = F(gklt) = xk gk,l exp(- j w k / K ) is the discrete Fourier transforni of
the pulse shape with K the total number of samples, No is the white noise spectral
level and h', sets the coloured noise spectral level. The levels are scaled so that the ni3 white noise variance is oz,, ( t ) = ( 1 1 2 ~ ) J-ns No ( t ) dw and the coloured noise variance is
cri, (t) = ( 1 1 2 ~ ) ~'ifB K,I F(gkl t ) IZdw. Here B is the single sided sampling bandwidth,
B = (112) f,, where f, is the sampling frequency. Note that as the pulse width is
dependent on peak time, the noise is non-stationary and N(w, t ) refers to the noise
spectra at time t. The noise is also assumed to be Gaussian.
Figure 3.21 presents a noise spectrum estimate obtained from a 23 base section
(bases 297-319) of seqiiencing data that was free of true peaks. The spectral estiniate
was formed using a single Fourier transform of the data after weighting by a Kaiser
window with Kaiser ,O parameter set to 6 [341. Thus, the rapid fluctuations dong
adjacent frequency cells are due to the iack of averaging. Howvever. trends in the
spectra are a true reflection of the data as the sidelobes irnposed by the Kaiser window
are extremely low -in excess of lOOdB over much of the spectrum. The large energy
in the low frequency region is consistent with the expected coloured noise component.
At higher frequencies, there is evidence of the white noise component as the spectral
dope is reduced.
3.5 Simulated Data from Model
Equations 3.1-3.3 and 3.5-3.18 in combination wit h Gaussian randorn number
generators and a predetermined sequence may be used to generate simulated DNA
time-series. Figures 3.22 and 3.23 present the results of a simulation where the base
sequence and parameters were identical to that of Figures 3.2 and 3.3. Thus, direct
Chapter 3 o A Statistical Model of the DNA Time-Series 64
FREQUENCY (RADIANSISAMPLE)
Figure 3.21: Noise spectrum estimate for ".4" channel bases 297-319.
cornparison is possible. This cornparisou should be for "sirnilar character" rather tlian
identical waveforms as ideally the two data sets represent different realizations of the
same process.
Figure 3.22 dernonstrates the same rise in level for later bases as is seen in Fig-
urc 3.2. This effect is driven by the incrcasing pulse width. The noise fluctiiations
also appear to be similar. Figure 3.23 and 3.3 feature sirnilar pulse shapes and t heir
peak level fluctuations are comparable in size. The tails of the peaks also scem sim-
ilar though the noise background fluctuations seem to mask the differences. One
discrepancy is in the run of C's where in Figure 3.3 a rise in amplitude with base
number occurs while it does not in Figure 3.23 as this sequence dependent effect is
not rnodelled.
Visual examination for the effects of the timing jitter is more difficult. On this
scale, it is hard for the eye to assess the jitter by looking for extrema locations. How-
ever, the amplitudes of adjacent peaks provide a cue as peaks closer together will have
a stronger sum while those further apart will affect each other iess. Unfortunately,
discrimination of t his effect from normal amplitude fluctuations requires examinat ion
Chapter 3 0 A Statistical Mode1 of the DNA Time-Series 65
Figure 3.22: Simulated compensated time series for cornparison with real data of Figure 3.2. Individual channel data has been offset in this figure for clarity. Top curve is for A channel with C, G and T channels presented in order from top.
and measurement of many peaks.
Simulation rnay be used to investigate the impact of modifications to the sequenc-
ing process. The physical and chernical phenomena involved rnay he interpreted to
lead to certain parameter changes, and these revised parameter vaIues rnay be used
in the simulation to assess the impact on algorithm performance. Adjustments to
the algorithm rnay then be tned and evaluated to identify an appropriate method for
ameliorating any deleterious effects of these modifications.
For example, reducing the percentage of cross-linking in the gel would reduce the
flow resistance and hence the time required for the DNA to move through the gel.
However, the diffusion coefficient would increase faster than mobility 191 so the band
Nidth in the gel would increase. Therefore, in addition to being closer together, the
peaks in the DNA time-series would be relatively wider and interference between peaks
would be a bigger problem. The mode1 rnay be used to generate controlled simulated
data with various peak separations and widtk and then the sequencer settings rnay be
optimized for each separation and width. It would be difficult and time consuming to
Chapter 3 0 A Statistical Model of the DNA Time-Series 66
Figure 3.23: High resolution view of a segment of the simulated compensated time series (compare a i th Figure 3.3).
generate data sets experimentally with predetermined peak separations and widths.
Model based simulation facilitates controlled investigation of process and algorithm
features, and, may be used to compare alternative sequencing algorithnis.
3.6 Significance and Novelty
This chapter presents the first statistical model of the D N h time-series. The chem-
istry and physics of DNA sequencing have been translated into a form where engineers
and mathematicians can directly contribute to the development of new sequencing al-
gorithms. The model also forms the foundation for further model developrnent based
on extensions that incorporate additional attributes of the data. Finally, the mode1
may be used to generate simulations for the cornparison of sequencing algorithms. It
provides the basis for a standard for the evaluation of DNA sequencing algorithms.
CHAPTER 4
Maximum Likelihood Sequence
Detection
The Maximum Likelihood Concept
The optimum Maximum Likelihood (ML) processor selects the sequence, 2, that
maximizes the probability of the observation, y, given a signal moclel as in -
where the hat is used to indicate the best estimate, the tildëindicates test values,
' k g max" returns the test value that mavimizes the expression on its right and p ( g , - g ) is the conditional probability density function (pdf) of the observation. Essent ially, for
each hypothesized sequence, it generates the expected signal waveforrn and compares
it with the observed waveform. It must search over dl possible hypotheses, evaluating
the probability of the observation for each hypothesis, in order to find the best.
The ML Sequence Detector (MLSD) is universal; it is appropriate for any signal
and noise model. Other popular processors such as the linear equalizer and decision
feedback equalizer are structured around specific signal features. In particular, these
Chapter 4 o Maximum Likelihood Seauence Detection 68
equalizers assume fixed symbol times and are oriented towards minirnizing Inter-
Symbol Interference (ISI) and noise at tliese decision points. Due to high peak time
jitter of DNA time-series, these equalizers are ill-suited for DNA sequencing. The
Maximum A Posteriori (MXP) processor is also universal. I t brings in a priori symbol
probabilities into the decision process. For equally probable sequences, the M.4P
sequence detectorL reduces to MLSD. This being the general case, it is appropriate
to select MLSD for the DN.4 sequencing problem. The resulting processor will be
referred to as the DNA-ML algorithm.
4.2 Additive White Noise Finite Response
To provide a context in which the extensions required for DNA-ML are evident, a
simple MLSD example will now be examined. This Additive White Gaussian Noise
(AWCN) Finite Impulse Response (FIR) example presurnes the received signal is
corrupted by the addition of white noise. The signal has also been strctched and
distorted by the channel medium so that previous symbols interfere with the current
symbol. Accordingly, the k-th sample of the observation is given as
mhere n k is the white noise, x h is the information symbol and h describes the channel
impulse response.
The noise is a zero-mean Gaussian random process. The channel impulse response
is presumed fked and knom. Then the probability density function (pdf) for the
received sarnple is just that of the noise shifted by the distorted signal. More forrnally.
Note that the MAP symbol detector does not reduce to MLSD as in some cases the most probable syrnbol at a location is not necessarily that which yields the most likely symbois for its neighbouring locations.
Chapter 4 o Maximum Likelihood Sequence Detection 69
the conditional pdf is given by
where z = {xl, 2 2 , ...) is the symbol sequence * and p , (n ) is the noise pdf. -4s the noise
is white and Gaussian, the noise samples are independent and thus their joint pdf is
the product of their individual pdfs. This then allows writing the joint coriditional
pdf for the entire observation as
where Nk is the total number of samples. The MLSD processor must choose the
sequence g that maximizes Equation 4.4 for the observed y. - .As the logarithm function is monotonic and increasing, mavimizing Equat ion 4.4
is equivalerit to minimizing the negative logarithm of Equation 4.4. The maximum
likelihood sequence is then
2 = arg mjn(- log(p(g(Z))) = arg r n j n ( x - log p,(% - C h , ~ ~ - ~ ) ) . - - 5 (4.5)
s: - k= 1 j=O
where the hat is used to indicate the best estimate, the tilde - indicates test values
and "argmin" returns the test value that minimizes the expression on its right. The
negative logarithm of a likelihood function is often referred to as the 'cost'. The
Gaussian noise pdf is p,(n) = (l/J-) exp (-n2/(202)). Its logarithm is then
log(l/ I/=) - n2/(20Z). Substituting this into Equation 4.5 and removing additive
constants that do not affect the minirnization results in
Thus, for this case, the maximum likelihood sequence estirnate is found by finding
2 ~ i , i < 1 is presumed to be zero.
Chapter 4 o Maximum Likelihood Sequence Detection 70
the hypothesized sequence that minimizes the sum of the squared differences between
the observation and the hypothesized received signal.
Further, the structure of Equation 4.6, and in particular the finite impulse length,
iVhr leads to an efficient method for finding 2 181. Mkiting the sunimation over the
sarnples as an iterative cost, Ck, yields
Therefore for a particular hypothesis, calculating the current coût requires the pre-
vious cost and the last lVh symbols in the hypothesis. These last lVh syrnbols are
referred to as the current state. If there are N, possible symbol values tlien therc
are N,"" possible values of the current state. -4 special rectangular grid of riodes,
known as a trellis, may be created where the x-axis of the grid is the sainpie time. k ,
and the y-axis is the state, { x ~ + ~ - ~ , , , ..., xt). X particular hypothesized sequence will
now correspond to a path from left to right connecting nodes of the trellis. MLSD
then corresponds to choosing the path with the least cost. Note that the possible
valid connections between nodes is limited by the definition of the state; two nodes
that both have l j in their state must have the same value for xj if a valid connection
between them is possible.
With this graphical representation, the minirnization can be seen as a dynamic
programming problem. Consider a node that is on the path of minimum cost. The
path of minimum cost is the union of the minimum cost path from the start to this
node with the minimum cost path from this node to the end of the observed data.
Any other combination would have higher cost. Then the minimum cost path from
the start to a node a t time k must include the minimum cost path from the start to
the node a t time k - l that is on the minimum cost path to the node at time k. Thus.
if at every time iteration, only the cost and the path corresponding to the minimum
cost to each of the nodes is retained then this set of paths will include the minimum
cost path. With the iterative structure of the cost as defined in Equation 4.7, the
extension of the path to a current node may be done by selecting the previous node
whose cost from the start plus the incrementd cost from the previous to the cunent
Chapter 4 o Maximum Likelihood Seauence Detection 71
node is minimum. The new path is then the union of the best path from start to
selected previous node with the segment from selected previous riode to the current
node. This is done for al1 nodes. At the last iteration, the node with lowest cost
identifies the minimum cost path through the trellis. Thia trellis based dynamic
programming algorithm is referred to as the Viterbi algorithm in the communications
li terature.
The Viterbi algorithm permits MLSD on systems with limited computational re-
sources. For an M symbol sequence, explicit testing of each of the possible hypotheses
reciuires N f tests. Using the Viterbi algorithrn, only MN? Ns tests are required as at
cach point in the sequence. each hypothesis must be extendcd by checking iV, possible
next symbol values. Growth in computations is linear with M rather than exponen-
tial. Thus, arbitrary length sequences may be processed using the Viterbi algorit hm
whereas the brute force method would eventually erceed the available coniputing
capaci ty.
In this section, the derivation through to practical implementation of an AILSD
processor has been presented. For the additive white noise and finite impulse resporise
case examined, the observation pdf is a simple function of the noise pdf; as the noise is
white, the joint pdf is a simple product of single sample pdfs. Hypothesis tests involve
a direct cornparison of the observation with a hypothesized waveform (Equation 4.4).
Taking the negative logarithm of the likelihood pdf and removing constants led to
a simple cost function. The iterative structure of this cost function, together with
the finite length of the impulse response, permitted the use of the efficient Viterbi
algorithrn. The mode1 developed in Chapter 3 is much more complicated than the
case considered here. Extensions will have to be developed to address features such
as coloured noise and peak time jitter.
4.3 Noise Whitening
DNA time-series feature coloured Gaussian noise as described by Equation 3.12.
This implies that the noise joint pdf between samples will be correlated as described
by a full noise covariance rnatrix. Evaluation of the pdf will be laborious. The uncor-
Chapter 4 o Maximum Likelihood Sequence Detection 72
related noise presented in the earlier example implied independence of the Gaussian
variates which led to an easy to compute and easy to comprehend formulation for the
processor. Clearly, it is desirable to transform the DNA time-series into a form with
similar properties to the Additive White Gaussian Noise (AWGN) case.
This can be achieved using a noise whiteniiig filter. The time varying noise whiten-
ing filter, hW, is designed so that the noise in the whitened data, y is uncorrelated. -w'
This is satisfied by a filter whose Fourier transform is the square root of the inverse
of the discrete noise spectrum. The time varying aspect rnust follow the signal peak
pulse widt h's t irne clependcnce.
4.4 Nuisance Parameters
In situations with nuisance parameters such as amplitude and peak time jit-
ter. optimum processors rnust extend their hypotheses to jointly include not only al1
possible data sequences but al1 possible sequences of nuisance parameters 1571. Math-
ematically, when there are nuisance parameters, the maximum likelihood estimate
where Q = {cz, t ) is the nuisance parameter vector, the hat is used to indicate the
best estimate, the tilde-indicates test values, "argmax" returns the test value that
maximizes the expression on its right, p (y3 , 8) is the probability density function - (pdf) of the observation, p(y, - %lé) is the probability density function of the observation
conditioned on the nuisance parameter wctor, and, p(#) is the probability density
function of the nuisance parameter vector. for example, in the AWGN-FIR case
above, if the hj were random variables then the nuisance parameters @ would be the
h = 1, . N The pdf conditioned on the nuisance parameters is essentially the
pdf in Equation 4.3 and p(@) is the density function of the hj .
.4t a each point in the sequence and for each possible base type, LAT hypotheses
rnust now be evaluated where LrlT is the number of amplitude and peak time pairs to
Chapter 4 o Maximum Likelihood Sequence Detection 73
be considered for each signal peak. For continuous random variables such as the am-
plitude and peak time, LAT should be infinity. Practically, a finite LAT that permits
good sampling of the amplitude and tirne joint pdf should achieve performance a p
proaching full ML. Regardless, the implication is that the number of hypotheses to be
considered is increased by a factor of L ~ F ~ when time and amplitude jitter is included
where N,,, is the number of symbols in the sequence. For example, if ten possible
amplitude and peak time pairs are allowed for each symbol in a 500 point sequence,
then 10500 more hypotheses must be considered. Clearly, nuisarice paramcters have a
major impact on the computational load of MLSD processing.
4.5 Cost Function Derivation
With the fundamentals of MLSD and its extensions for coloured noise and nuisance
parameters now established, attention may now be directed at the forma1 derivation
of the cost function for the DNA-ML algorithm. The algorithm seeks the sequence
and nuisance parameter values that minimize the negative log of the pdf, referred to
as a 'cost' :
where, A,(-, %la) = - log(p(y, - zla)) - constant is the part of the log likelihood due
to the conditional pdf and, ile(& = - - constant is the part due to the
nuisance pdf. The constant offset serves to remove those terms which do not affect
the rnauimization. The following subsections will address first the conditional pdf
and second the nuisance pdf.
4.5.1 Conditional Likelihood
The conditional likelihood, A,, reflects the additive noise pdf for a specific {2,(}. The noise is coloured and its pdf is complicated to evaluate due to the implied corre-
Chapter 4 O Maximum Likelihood Seciuence Detection 74
lations between al1 samples. The noise whitening filter, bw, is applied to create
where the * denotes convolution. As the noise spectrum is time dependent (see
Equation 3-12), the filter has t o be recalculated for each possible peak time. Note
that this transformation will not change the result of our sequence detection problem
as the {t, 8) which best explains the observed - y is also the one which best explains the
observed y . For the whitened observation, y the noise terms are uncorrelated and -W -w '
its pdf may be written as the product of the Gaussian sarnple pdfs as uncorrelated
Gaussian random variables are independent. The conditional likelihood then becomes
cvhere fi is the expected whitened observation for the hypothesized data sequence, -W
g, and nuisance parameters, 8. Here the sumniations are over ail data sarnples {k}
and base channels {n). Unlike in the AWGN-FIR case, constants (( f ) log(2xoiw))
relating to the whitened noise variance, aiw, are retained in this expression; as ail1 be
explored in greater detail in Chapter 5, these terms cannot be dropped as hypotheses
incorporating different numbers of time samples rnay have to be coniparecl.
Substitution of the expected observation into Equation 4.12 yields
where g;, is the peak shape after application of the noise whitening filter for a peak
centered on time t and evaluated a t k, and, is the Kronecker delta. Without
changing the result, the summations outside the brackets rnay be freely interchanged
and split and the summation within the brackets rnay be split as long as the split
Chapter 4 O Maximum Likelihood Sequence Detection 75
ternis remain within the brackets. Thus, Equation 4.13 rnay be written as
where the range of k has been partitioried into non-overlapping subsets, K i , such that
the union of these subsets corresponds to the complete range of k.
Equation 4.14 suggests grouping the samples i~ i to non-overlapping groups with
each group corresponding to a specific base in the data sequcnce. Within each group,
first subtract off the interference from other bases as indicated by the terrns encom-
passed by the innermost brackets of Equatioii 4.14. Then, evaluate the hypothesized
contribution of the specific base.
4.5.2 Nuisance Likelihood
Now consider the log likelihood of the nuisance parameters, the second terrn of
Equation 4.10. As specified in the mode1 (Sectioii 3.3.2), the peak amplitude fluctua-
tion is uncorrelated, its pdf is Gaussian, and the corresponding log likelihood terrri is
just the squared difference with respect to the mean, pa, al1 norrnalized by twice the
variance, 0;. The nuisance parameter log likelihood encompassing the log likelihood
of the peak amplitude and peak time rnay be written as
where Ci is the hypothesized amplitude of the i-th peak and the last term, &(i) =
- L O ~ ( ~ ( ~ ) - constant, is the peak time log likelihood. Our analysis will now address
that term.
For our model, the correlation of the jitter mlth respect to al1 previous peaks
complicates the evaluation of the pdf and log likelihood. For a sequential processor,
Chapter 4 o Maximum Likelihood Sequence Detection 76
on a given hypothesis, for each extension in hypothesis length, the entire hypotliesis
must be fed into the new larger marginal probability function. Here, where the
probability function is Gaussian, the covariance matrix is extended from siae N - 1
to iV and the number of computations to evaluate the probability are proportional io
iV2. The total number of calculations in sequentially evaluating a N point hypothesis
would then go as N 3 . It is desirable for the total number of calculations to be a liriear
function of N .
An efficient sequential representation of the hypothesis pdf and log likelihood is
rcquired. As the timing jitter appears as a functional argument of the waveform
and as it is sequence dependent, the simple whitening filter, appropriate for additive,
sequence independent disturbances, cannot be used. Instead, the Markov structure of
the timing jitter leads to an innovation technique for data whitening (66, 691 in which
a linear transformation is applied to the data to decorrelate current and prcvious
measurements. For uncorrelated Gaussian variates, the joint probability is just the
product of the marginal pr~babilities of the random variables. Thus, for suitably
transformed observations, the probability of the extended N point hypothesis is thc
product of the probability of the IV - 1 point hypothesis and the probability of the
current transformed observation. The log likelihood of the wliitened data may then
be written as a simple sum of squared terms. The total computational load is then a
linear function of !V.
Statistically, the innovations approach identifies the new information in the current
observation, separating it from that which could be inferred from previous data. By
virtue of its correlation with previous samples, the correlated part may be predicted
[rom the previous samples. The i-th sample of the original peak time series may be
written as
where t , and dCi represents the part correlated with previous samples and tu, and #u,
the uncorrelated part. Using Equations 3.6 and 3.7, the timing jitter, #i = &, + A,,
Chapter 4 O Maximum Likelihood Seauence Detection 77
may be written in state space form as
where A = G = O a n d C = H = J=1.
For a system describable in linear state space form, the optimum preclictor is the
Kalmar1 predictor 157, 671. Using the Kalman predictor, the innovation is defined as
the difference between the observation and the best prediction of the new observation
given previous observations, al1 divided by the standard deviation of the prediction.
Equation 4.19 is not the usual state space form as the systern input disturbance,
ui, appears directly in this observation equation. The best estimate of the state at
i + 1 given the information available at i (denoted Ci+ lii) is the coriditional expectation
If .v* was not observable in @i then vili would just be zero (the a priori mean) and the
usual Kalmar1 filter derivation 1671 would apply.
The extension to the Kalman filter analysis must include the recovery of uiii and its
impact on the prediction covariance. By the usual projection operation, the estimate
of the input disturbance is
where E is the expectation operator, # denotes al1 the observations up to and in-
cluding the i-th, & = +* - $ili-l is the innovation, $' is its transform and P$-, is the
covariance of the prediction of 4- Here, the white spectrtm of v and its appearance
in Equation 4.19 have been used tu reach the last expression.
The state prediction covariance, can be obtained using state Equation 4.18
Chapter 4 o Maximum Likelihood Sequence Detection 78
a5
where iii = v i - 'uili-1 and C, = Ci - Gli-l. The covariance ~ ' f i ~ i , T ] can be shown
to be zero. As the projection operation is used to form the estirnate of the input
disturbance, the covariance of this estimate can be obtained as
Equations 4.20-4.23 should be cornbined with the standard Kalman filter equa-
tions. The mode1 parameters can be substituted and terms regrouped to yield
The first three equations describing the Kalman gain, L, state estirnate, C, and co-
variance of the state estimate, P, are the standard Kdman filter equations. The
subsequent equations represent extensions due to the observability of the input dis-
turbance. Here, a gain, LU, is used in recovering an estimate of the input disturbance,
U i i i , which is then used to predict the next state, C,+lli. The prediction covariance,
Pi+lli , has an additional term which reflects the covariance of the input disturbance
estimate.
The whitened innovation sequence to be used by the sequencer is (&-cii-l)/ d G '
Chapter 4 o Maximum Likelihood Sequence Detection 79
or? to make the peak dependence explicit, (ti - ipr - c ~ ~ ~ - ~ )/ JPiii_i* This is an in-
dependent, zero mean, unit variance, Gaussian random process. The peak time log
likelihood is then one-half the sum of the squares of this process:
The nuisance parameter log likelihood becomes
4.5.3 Cost Function
Equations 4.10, 4.14 and 4.32, in conjunction with the whitening filter (1.11),
and Kalman predictor (4.244.30), define the maximum likelihood sequencer. .As is
desired for the dynamic programming algorit hm, for each hypot hesized sequence, the
cost (log likelihood) may be written as the surn of the cost corresponding to previous
points in the sequence and a cost associated with the current point:
Figure 4.1 summarizes the algorithm structure. The hypothesized peak amplitudes
and times, together with the hypothesis' sequence, are used to generate a waveform
which is compared with the observation. This yields the likelihood conditioned on
the parameters. Also, the hypothesized peak amplitude and times are compared with
Chapter 4 o Maximum Likelihood Sequence Detection 80
HYPOTHESES
1
DNA r
PEAK WAVEFORM ,+ DYNAMIC
ESTlMATOR COMPARISON PROGRAMMINO
ALGORITHM
1 PREDICTORS 1
Figure 4.1: Maximum likelihood processor block diagram.
the Kalrnan filter predictions to obtain the parameter probability. The innovat ion is
also used to update the hypothesis' Kalman filter. Note that each hypothcsis l ias its
own hypothesis dependent Kalman filter.
4.6 Significance
Following a formal derivation from the statistical mode1 of the DNA tirne-series,
this chapter has presented the first DNA sequencing algorithm which can achieve
optimal detection performance. Of course this optimality is only to the extent that
the model reflects reality. Viewed purely from the perspective of detection theory. the
algori t hm is sop histicated in addressing non-stat ionarit ies and nuisance parameters.
It is at the edge of the state of the art in communication theory. The derivation
leads naturally to a general structure. Components of this structure have well definecl
tasks. They facilitate the assessrnent of current algorithms as analogous blocks may be
compared. Future work may see the cost function derived in this chapter incorporated
into a LIaximum A Posteaon (MAP) processor to provide optimum estimates of base
type and probability on a base by base basis. When the DNA time-series model
receives further refinernents, the DNA-ML algorithm rnay easily be extended to reflect
Chapter 4 o Maximum Likelihood Sequence Detection 81
these new mode1 features.
CHAPTER 5
Implement at ion
While including most of the features of the optimum algorithm, the implemen-
tation of the DNA-ML algorithm requires changes mainly directed at rediicing the
computational load. This chapter provides details regarding implemeiitation of the
algorithm that was discussed Chapter 4. .Algorithm robustness to mode1 errors also
receives attention.
5.1 Hypothesis Reduction
5.1.1 Peak Estimation
An alternative to carrying multiple hypotheses for the nuisance parameter is to
estimate the nuisance parameter and use that value in the sequence detection. This
approach is sometimes used in data communications where the carrier phase is a
slowly varying nuisance parameter (571. For DNA data with well isolated peaks, peak
amplitude and time can be easily estimated by taking the local maximum. However,
as resolution decreases, estimates of adjacent peaks are biased closer together. The
bias leads to erroneous values being used in the likeiihood evaluation and therefore to
sequencing errors. In the limit, peaks are not resolved and bases are deleted from the
Chapter 5 O Implementation 83
sequence. The multiple hypothesis approach avoids this problem as it generates the
complete waveform for the hypothesis and compares it with the observation. Even if
the peaks are unresolved, this approach can obtain the correct result if the resulting
broad peak in the hypothesis waveform matches that in the observation.
To address this resolution problem, Equation 4.14 suggests using the secluence
data from adjacent points in the hypothesized sequence to reduce the influence of
neighbouring peaks in the peak estimator. Consider a sequence with two adjaceiit
b'G"'s that are poorly resolved. For the correct hypothesis, the measurement of the
parameters of one peak may be made more accurate by subtracting from the tirne
series a pulse of peak amplitude and time corresponding to the other peak then
taking the local maximum. It is possible to show that even if thcre are moderate
errors in the estimation of the parameters of the second peak, the overall accuracy
is much iniproved re!ative t o simple peak detection without ISI removal. In order to
remove ISI, one must include past peaks already estimated alid future peaks yet to
be estimated. Estimation of future peak ISI will be addressed in the next section: for
now, it will be assumed that it can be done successfully.
For a particular base, if the ISI has been cornpletely removed then the problem
becomes the detection / estimation of a single peak. This is optirxally accomplished by
matched filtering the data, detecting the maximum, and recording the peak amplitude
and time 1571. The matched filter is the time-reversed noise whitened peak shape.
g:,,,. The peak estimate obtained in this fashion is then used to create the expected
peak as the last term in Equation 4.14. If the hypothesized sequence matches the
observed DNA sequence then the squared term of Equation 4.14 should be small.
Figure 5.1 summarizes the above procedures.
5.1.2 Future Peak ISI Cancellation
A priori peak predictions based on jitter and peak amplitude models may be
used to estimate the interference from future peaks. By modifying Equation 4.29,
the Kalman predictor used in the innovation processing may also be used to predict
future peak locations an arbitrary number, p, of bases fomard of the current base as
Chapter 5 o Implementation 84
SINGLE PEAK
ISI t
MATCHED + PEAK PULSE Yw- REYOVAL * FILTER O€iECTlON -t E STIMATE
ESTIMATION
UNOER CORRECT
HY POTHESIS
Figure 5.1: Peak estimator.
t, is the location of the current peak, is the mean inter-peak separation. d is
the jitter auto-regressive weighting, Cli is the current estimate of the current jitter
state, and vil* is the current estimate of the jitter driving process, ail as described in
Section 3.3.2, The amplitude prediction, after trend removal, is j ust p,
The accuracy of the future peak location estirnate falls very quickiy with p. On
the other hand, the further into the future a peak is from the current peak then the
smaller its contribution to the ISI is by virtue of the peak shape. Thus, the inaccuracy
in peak location for peaks far into the future has little impact on the accuracy of ISI
removal.
However, for a future peak whose mainlobe reaches well into the region of the
peak of interest, accuracy in peak prediction is vey important. Prediction is limited
by the component of the future peak that is uncorrelated with the currently available
measurements. The uncorrelated cornponent of the jitter in peak time and amplitude
may be sufficient to lead to large errors in ISI removal. To address this problem, the
algorithm may be extended to include several hypothesized locations for each future
peak (i.e. a constellation of candidate future peak locations). These additional hy-
Chapter 5 o Implementation 85
potheses represent a partial return to the optimal algorithm of Chapter 4. However,
they are only included for ISI removal and once a direct estimate of the peak's param-
eters is obtained then the constellation is collapsed to that estimate. Thus previous
peaks in the hypothesis do not maintain nuisance parameter constellations and so the
computational load is much lower than for full MLSD.
5.1.3 Sequential Decoding
To further reduce the computational load, a n algorithm rnay be selected ttiat tests
only a subset of the possible hypotheses. A number of these have been developed and
malysed in the communications coding literature 1611. Of t hese, the bI-algorithm
has been chosen for this thesis as it can be considered to be a mc1.icimum likelihood
processor under the constraint of retaining only at most bI hypotheses at each point
in the sequence 1681. This algorithm processes the data sequentially. At the i-th
position in the sequence, it retains h.1 hypotheses corresponding to the most likely
i-point subsequences given the observed data from the start of the ruri up to the
current point under consideration.
5.2 Unique Algorit hm Considerations
The irnplementation of the DNA-ML algorithm is complicated by the interplay
between the dynamic programming algorithm and the asynchronous peak times. The
dynamic programming algorithm must compare hypotheses as it progresses through
the data set. Because of different values of the peak time parameters held by different
hypotheses, the hypotheses rnay be of different duration and t herefore not properly
comparable. On average, the shorter hypotheses would have lower costs and be more
likely to be retained.
Similarly, the symbol region, Ki, (the set of tirne samples associated with each
symbol), is likely to be defined dynamically and thus the lengths of the summations
over the symbol will vary again based on differences in estirnated parameter values.
LThe technique is similar to the 'geedy' algonthm found in the cornputer science literature though there can be ciifferences based on the specific definition used by particular authors.
Chapter 5 O Implementation 86
The analysis of Chapter 4 did not give direct guidance as to how the { I G ) were to
be defined.
Could a change in the dynamic prograrnming from a base by base basis to being
on a sainpie by sample ba i s solve the problem? No, as now hypotheses would diffcr in
relatively how much of the latest base region was represented in the cost. The problem
oiily disappears if decisions are not made until the cost of the entire observation is
available. Of course, this would imply an untenable number of computations if al1
Iiypotheses are retained that long.
In this section, uneqiial length hypothesis comparisoii and symbol region definition
will he examined.
5.2.1 Unequal Length Cornparisons
Long before the development of the Viterbi algorithrn, there were a number of
sequential decoding algorithms which would deal with unequal length hypotheses 161 1. Typically. these algorithms follow the niost likely hypothesis until its cost exceeded a
threshold. They then return to an earlier hypothesis and pursue it. II the end of data
is reached then the current hypothesis is returned as the sequence estimate. By no
nieans are these algorithms guaranteed to return the maximum likelihooci estimate
of the sequence.
The most famous of these is the Fano algorithm 1621. This algorithrn is still in
use today for systems that use very long codewords such as in space communications;
the Viterbi algorithm would require too many computations in such applications.
Developed on an ad hoc basis, it was later shown to also result from a probabilistic
analysis 1631. While its original application was for variable length codes. it has since
been used for sequence estimation in infinite length channels 1641.
The Fano algorithm does not explicitly compare hypotheses. Rather, it examines
the Fano metric (Fm), the ratio of the hypothesis pdf to the unconditional pdf as in
where k is the index of the latest observation sample, yk, and hypothesized symbol.
Chapter 5 O Implementation 87
xb. Here, as in the remainder of this section. the probability functions p ( ) are defined
by their arguments. For the correct hypothesis, the Fano metric will increase with
tirne (k) as, in this case, the numerator is greater than the denorninator. For the
incorrect hypothesis, the Fano metric will eventually fa11 below a threshold. This
hj-pot hesis is then discarded and anot her pursued.
The Fano metric assumes the observation to be the sarne length as the hypothesis
and so further analysis is required to adapt it to the DNA sequencing problern. This
analysis builds on the work of Massey 1631. Consider two different hypothesized
sequences, g, and g2, coritaining the same number of symbols. However, assume that
due to differing symbol lengths (Le. due to peak tirne jitter in DNA sequencing),
the length of y , the observation associated with g,, is different than the length of -1
y the observation associated with 2. Define a as the entire possible observation -2 ?
encompassing both the observations t.hus far under the hypothcsis and the future
observations. Then, using the plus superscript to indicate future observations, I/ = Y
Consider comparing the two hypotheses when the entire observation is available:
As the observations are of the same length, the cornparison does not suffer bias
from unequal length hypothesis observations. Similady, each hypothesis contains
the same number of symbols and so bias is not introduced from differing number of
symbols. Here, the observation does include the effects of future symbols, 9, but
the probability density function is the marginal one obtained by averaging over al1
possible future symbols as in
Thus, Equation 5.3 is a desirable test of the two hypotheses. It will now be demon-
strated that the Fano metric provides exactly this test.
First, normalize Equation 5.3 by the unconditional probability of the entire ob-
Chapter 5 o Implementation 88
servat ion:
.As bot h sides are scaled by the same factor, the result of the comparison is unaffected.
Applying the chain rule, the probability of the entire observation under hypothesis i
niay be written as
and integrating over the possible same length hypotheses yields the unconditional ptlf
as Cr _ p(y, - g)p(yf - 13, - g). The ratio may then be written as
The next step in Massey's analysis assumes a Discrete Mernoryless Channel (DNC)
where the samples ore independent. This implies p(y+ly. ,gi) = p(y+) which on siib -t 1 4
stitution in Equation 5.7 yields
Thus, normalizing the hypothesis probability for the entire observation by the un-
conditional pdf for the entire observation is equivalent to normalizing the hypothesis
probability for the observation thus far by the unconditional pdf for the observation
thus far (i.e. the Fano metric Equation 5.2). Equation 5.5 becomes
and so a means of cornparing different observation length hypotheses has been devel-
oped -albeit only for the case of independent samples.
Can the Fano metrïc be applied to the DNA sequencing problem? DNA time
Chapter 5 o Implementation 89
series feature mernory due to the pulse shape and the correlation of the peak time
jitter: they are not the result of a memoryless channel as rnandated by the Massey
analysis. The analysis was extended to include nuisance parameters and the ratio
of the probability of the entire (past, current and future) tirne series given the short
hypo t hesis to the unconditional probability of the entire time series ernergecl as
where as usual, x,y and 0 are the information sequence, observations and nuisance
parameters, respectively. The summations are over al1 valid values of their argument
vectors; the prime is used to indicate a dummy variable vector. The development of
this expression assumed the observations were causal.
For the DMC, p(yf - , -+, g+l y, 2,e) = p(f , gf , g+) so that the 1 s t pair of sum-
mations in the numerator and denominator cancel, leaving the Fano metric. The
same does not hold true for DNA time-series as the first few elements of - y+ and @+
depend substantially on past values. In this region, the last pair of surnmations in the
numerator would be hypothesis dependent, while in the denominator, the last pair
of summations ~ ~ o u l d be averaged over al1 possible previous hypotheses. Therefore,
they would not cancel. Certainly, beyond a few tirne constants of this merno- the
tails of the Fano metric formulation should effectively cancel. However, our interest is
in comparing hypotheses which, while of different lengths, are probably within a few
memory time constants of each other. The Fano metric is not theoretically justified
in this region.
Nonetheless, the Fano metric at least provides a structure, dbeit a sub-optimal
one. As no other is available, the Fano metric was investigated on sequencing data.
Specifically, the division of the unconditional pdf led to the addition of its logarithm
Chapter 5 O Implementation 90
to the cost given by Equation 4.33 to produce
where the unconditional pdf, p(y ), has as its argument the vector of the k-th -Wk
whitened samples of al1 four channels. This is necessary to allow for the hinda-
mental nature of the series where ideally at a given point one channel has the base
peak while the ot hers have the basic noise background.
The unconditional pdf, rather than being
from the histogram of the actual data. After
form was adopted:
derived from the mode1 was developed
observing the histogram, the following
Here the current observation, %, is a vector of the four channel levels. The first of
these equations states that the probability of the observation is the weighted sum of
Chapter 5 o Implementation 91
the probabilities of the observation given the base type, b, was known. The second
equation gives the probability of the observation given a known base type as the prod-
uct of the signal pdf on that base's channel and the noise pdf on the other channels.
Next is the noise pdf which is as used elsewhere in the DNA-ML algorithm. Finally,
the signal pdf h a . three regions weighted by constant c which serves to normalize the
Fiinction. It represents a 'heuristic fit' to the histogram. The first region represents
the signal peak being absent, due perhaps to a dropout, and so uses the form and
parameters of the noise pdL The Rat region represents the rising and falling regions
of the peak shape. The last region accounts for the peak amplitude.
The results of employing the Fano metric were not encouraging. In investigations
with real data some sequencing errors of the non-Fano implementation were corrected.
However, new errors occurred elsewhere and the overall error rate was riot improvcd.
These errors did not exhibit a pattern from which one could infer a mechanism.
Theoretically, two phenornena could account for the lackluster performance of the
Fano metric. The rnost likely of these is the impact of the correlation in the DNA
t ime-series. The second possibility is t hat the synthetic observation pdf, Equation 5.3,
did not accurately reflect the true observation pdf. Any bias here would accumulate
in the cost with sample number and could eventualiy subvert the decision process.
The implementation whose performance is presented in the next chapter does
not incorporate compensation for unequal length comparisons. Such compensation
remains an open problem.
5.2.2 Selection of Symbol Region, Ki
A natural notion for a symbol region definition would be to center the region on
the symbol. The region's borders could be from halfway between the previous symbol
and the current symbol to halfway between the current symbol and the next symbol.
However, at the time of the current symbol, the next symbol's location is uot known
so the upper border can't be set. Also hypotheses with peaks closer together would
tend to have lower costs as fewer sample points would belong in the symbol region.
A constant region width would cure the border and varying nidth cost bias. How-
ever, it would encounter problems with respect to Ieaving out points when the peaks
Chapter 5 o Implementation 92
are widely separated and including the same point in the symbol regions for two
different bases when the peaks were close together.
Defining the current symbol region as between the previous aiid curreiit symbols
has sorne advantages. First, the borders are known. Second, ISI is suppressed bet-
ter in this region as it is distant from the problems with future peak location error.
Unfortunately, with this definition only half of the current peak is used in the iritegra-
tion, thus implying a lower signal to noise ratio. As well, the problems with varying
region width are still present. However, the advantages for this strategy appear to be
stronger than the disadvantages. This definition of current symbol region is used in
the real data processing of the next chapter.
5.3 Modelling Limitations and Robustness
Neglecting the different length hypothesis problern, the DNA-ML algorithm is
optimum only for data that exactly matches the mode1 and parameters used. Errors
in some parameter settings are expected ta have little effect. For example, a small
crror in mean peak amplitude would have little effect as the peak to peak variance
is so large. A ~ s already discussed, errors in the tails of the pulse shape have only a
small effect as mainlobe ISI dominates. While the mainlobe shape is weli known, it
does depend on the pulse width parameter, a parameter whose estimate has a fair
uncertainty. The sensitivity to t his parameter should be investigated. Noise w hitening
is fundamental to the development of the DNA-ML algorithm. Chapter 3 alluded to
the difficulty in measuring the noise spectrum. The impact of this on the algorithm
should be addressed. Another key parameter is B; sensitivity to ,O miçmatch will be
investigated in Chapter 6. In this section, the sensitivity to pulse width mis-match is
examined and issues associated with the noise whitening process are discussecl.
5.3.1 Pulse Width
Errors in pulse width setting can lead to large errors in waveform cornparison,
particularly near the steep edges of the pulse. A sirnulated data set was created
wherein the mode1 and al1 parameters were obtained from real data (Data Set 1 of
Chapter 5 o Implementation 93
Table 5.1: Performance as a function of pulse width rnismatch for 300 bases of sirnulated data.
the next chapter). The algorithm used the same parameters with the exception being
pulse width. Table 5.1 presents the results of reprocessing the same data set witli
several different pulse width settings. It is clear from the table that 10% misrnatch
iri the pulse width can ïesult in a large increase in the error rate.
In Section 3.3.2, the scatter of the pulse width estimate had a standard deviation
of 10%. However, presuming the trend mode1 to be correct, the least squares mode1 fit
in effect averages these 300 estimates. Thus, the standard deviation of the error in the
rcsulting mode1 is on the order of 1 0 % / m r= 0.6%. Thus, if the scatter is indecd
due to measurement error then the pulse width is knowu with sufficient acciiracy
that pulse width mismatch is not a problem. On the other hand, if this hypothesis
is wrong and the rneasured pulse width mriations are actually true reflections of
the physical processes then pulse width misrnatch could be a large contributor to
Assumed/True Pulse Width 0.8 0.9 0.95 1 1 .O5 1.1 1.2
sequencing errors.
Insertions/Deletions/Errors 45/19/46 19/18/14 2/2/2 1/1/1 11111 21211 49/26/32
5.3.2 Noise Whitening
X low quality noise spectrum estimate can lead to poor noise whitening. Section
3.4 alluded to the difficulty in measuring the noise spectrum of DNA time-series. It is
not possible to obtain a "noise only" data set that has enough data points so that the
statistics of the noise in DNA sequencing data can be estimated accurately. However,
in adopting the noise spectral mode1 of Equation 3.12, the necessity of measuring the
Chapter 5 o Implementation 94
full spectrum vanishes as the model only requires estimates of the white noise and
coloured noise variances.
The white noise variance estimate itiay be obtained from the sarnple to sample
variation of the data in regions without true signal peaks. The average of the square
of the difference between adjacent samples should be twice the white noise variance
if only white noise is present. Taking the difference between adjacent samples should
suppress the lower frequencies where the coloured noise is strong. To simplify the
variance estimation process, the assumption is made that the difference cornpletely
suppresses the coloured noise. Thus the estimate of the white noise variance is simply
one half the average of the squared differences between consecutive saniples.
The coloured noise variance estirnate rnay be obtained by studying the weak sigrial
like features in regions without true signal peaks. The assumption is made that
these features have a Gaussiari distribution. Then the vertical interval ' of the range
containing 95% of these features is an estirnate of four times the standard deviation.
This tlien leads directly to the coloured noise variance as the square of the standard
deviation estirnate.
Woise whitening based on such variance estimates has been attempted for a real
data set. Figure 5.2 presents the estimated spectrum for a 23 base noise only region
of the 'noise whitened' A channel starting at base 297. Clearly, the data is not white:
the level drops by roughly 20dB in going from O to T radians jsample. Obviously, the
estirnate of the white noise spectral level used in the generation of the noise whitening
filter was too high. Attempts at adjusting the white noise variance estimate led to
'whitened' data that eshibited other non-white features near the middle of the band.
Should result shown in Figure 5.2 or the noise variance estimates be given greater
credit? The noise variance estimates have the advantage of being formed from a larger
amount of data. They should be more stable and representative of a larger portion
of the data set. On the otherhand, the noise whitening is dependent on the noise
spectral model of which the variances are but two parameters. Figure 3.21 has t.he
advantage of directly modelling the entire noise spectrurn but it is from such a small
data set that its quality is poor and it rnay not be represeutative of the entire DNA
*adj usted for large scaie trends
Chapter 5 o Implementation 95
-301 i 1 I 1 1 1
O O. 5 1 1.5 2 2.5 3 FREQUENCY (RADIAN WS)
Figure 5.2: Spectral estimate for a short section of ''noise whitened" data lacking signal peaks.
timcseries. Thus, for both the short noise spectral estimate approach and the mode1
based approach, some degree of spectral misrnatch is to be expected and the noise
whitened data will not be truly white.
Given the likelihood of mismatch, how may the processor be rnodified to allow
robustness with respect to this problem? The residual noise colour manifests itself as
a correlation between the terms within the sum over K, in Equations 4.14 and 4.33.
.As adjacent terms are now similar, there are fewer independent samples than implied
by the cardinality, 1 Ki (il. Thus by summing over Ki, the weight given to these terms
is greater than implied by the number of independent samples. Now that these terms
are over weighted relative to the weight on the nuisance parameter log likelihood
terms, hypotheses with improbable jitter will be given more consideration and errors
will result.
To restore balance, the summation over Ki can be weighted by a factor in order
to reflect the tme statistical degrees of fieedom available. The statistical degrees of
freedorn can be expressed as the product of the observation time and the statistical
Chapter 5 O Implementation 96
bandwidth of the data. The statistical bandwidth is the bandwidth of an ideal low
pass process which, over the same observation time, yields the same statistical degrees
of freedom as the process of interest. Defining the f~actionul bandvidth as the ratio of
the statistical bandwidth to the total bandwidth of the observation, it may be easily
seeri that scaling the summation over Ki by the fractional bandwidth provides the
proper weighting t o compensate for the correlation in the samples. For example, with
a fractional bandwidth of 0.25, there is one quarter as many independent sarnples and
so the magnitude of the sum should be as though one quarter as many terms were
surnmed.
5.4 Comparison with Typical Automatic Sequencer
Techniques
Currently available automatic sequencer algorithms incorporate some techniques
which address the same general signal features as do the various components of the
DN.4-ML algorithm. They do difTer in how they address these features. Some in-
sight as to the performance potential of the DNX-ML algoritlim may be gained by
exainining these differences.
5.4.1 ISI Suppression
Current algorithms address ISI suppression through either a peak sharpening filter,
deconvolution algorithm or maximum entropy algorithm. The peak sharpening filter
sharpens the peak and, undesirably, emphasizes the high frequency portion of the
noise. Deconvolution processing in a sense fits replicas of the generic pulse shape to
the observed data 1351; maximum entropy reconstruction performs similar processing
[?Il. Both signal and noise are represented by these pulses. On the otherhand,
the DNA-ML subtracts off only interference from signal peaks as determined by the
sequence hypothesis. It does not emphasize the noise.
In that aspect, the current ISI suppression techniques are to the DN.4-ML algo-
rithm what the Linear equalizer is to the Decision Feedback Equalizer (DFE). Based
Chapter 5 O Implementation 97
on the known performance advantage of the DFE 1551, the DNA-ML would be ex-
pected to have superior ISI suppression and thus rcduced error rates. However, the
analysis that yields the advantage to the DFE is based on a known fixed pulse am-
plitude, peak tirne and shape. Errors in peak amplitude and tirne parameters in the
DNA-ML algorithm could lead to reduced ISI suppression.
5.4.2 Peak Detection
At the lowest level, some algorithms detect peaks using a criteria such as the
largest local maximum exceeding an amplitude threshold in the time search window.
More sophisticated algorithms integrate peak area into the detection criteria 1701.
This approaches the optimum match filtering of the DNA-ML peak estimator but
gives greater weight to the smaller, hence noisier, portions of the peak. Giddings 1381
uses a dual-Gaussian bandpass filter which would be closer to but still different from
the match filter. The DNA-ML should offer superior peak estimates.
As the signal peaks in DNA time-series are large and easily detected wit h a crude
detector, it is unlikely that the theoretical superior peak detection capability of the
DNA-ML will offer any practical performance improvement for isolated peaks. But
by feeding the superior peak estimates to the ISI suppression algorithm. the DN.1-SIL
may realize improved performance for overlapping peaks.
5.4.3 Search Window
Current DNA sequencing algorithms at some point impose a search window wliich
defines where they will search for a peak. This prediction of where the next peak
is to be is implicitly relying on the correlation of the peak times as described in
Chapter 3. The DN.4-ML through the peak time jitter pdf allows a broad range
of peak locations and identifies which are unlikely. The hard search window clearly
elirninates candidates outside a certain range and does not discriminate amongst
candidates within that range. Giddings [381, however, includes a confidence weighting
for peaks within the range that factors in distance from expected peak location. Note
that the DN-4-ML implementation with peak estimation does impose a search window,
Chapter 5 o Implementation 98
albeit a large one. The full DNA-ML of Chapter 4 is likely to perform better than
al1 these approaches as it c m conceivably handle multiple valid peaks in what would
othenvise be the same search window.
5.4.4 Multi-Peak Tests
Multiple unresolved peaks are addressed in some current algorithms by assessing
whether the total area of the unresolved peak is closer to that of 1,2, ..., or N isolatcd
peaks. In using area as a criteria, the noise reduction benefits of averaging over
scveral samples is gained. However, variation in the waveform which rnay encompass
inflection points and other indicators of multiple peaks is lost. The DNA-ML explicitly
considers al1 possible runs of bases and may take advantage of the waveform variations.
Still, peak area rnay be a very powerful metric and approach the performance of the
more sophisticated DNA-ML algorithm.
5.4.5 Special Rules
Commercial automatic sequencing algorithms incorporate special rules to hanclle
known special features of sequencing data. The rise in amplitude for a run in C's
is a classic example of a special feature that has been mapped into a special rule.
The DNA-ML algorithm as yet lacks such rules and ivould therefore be expected to
perforrn not as well in regions where these rules apply.
5.4.6 Promise of Approach
For DNA time series exhibiting the features modelled in Chapter 3: the DNA-ML
algorithm should be superior to more ad hoc algorithms. However, as seen above,
many of the current techniques, while not optimum, incorporate processing that a p
proaches that of the DNA-ML algorithm. The performance advantage of the DN.4-ML
algorithm may not be dramatic. Commercial algorithms may have an advantage for
particular signal situations included in their modelling but not in the modelling of
Chapter 3.
Chapter 5 o Implementation 99
The DNA mode1 and the DNA-ML algorithm do offer benefits beyond a reductioii
in error rate. They may guide the refinement of the entire sequencing process. For
example, chernical parameters, such as ionic strength, may be adjusted to rediice peak
time jitter. An additional benefit is offered by assigning probabilities to alternative
sequences as this mzy aid the clinician in forming his diagnosis3.
'The user rnay request the evaluation of s p e d c alternatives or additional of record keeping software may maintain a list of the most like1y alternatives. In both cases, the cost function provides the key to extracting the pmbability of the alternative.
Performance with Real Data
In this chapter, the performance of the DNA-ML algorithm will be exarnined
using two real data sets, one from a 6% cross-linked gel and one from a 4% gel.
Typically, electrophoresis with 6% gels allows the accurate processing of 400 bases in
six hours which is standard for research applications, while the 4% gels allow much
faster processing which is important in clinical applications. Thus, these data sets
permit insight into two different application areas. For both data sets, the results are
compared to those obtained by the Pharmacia ALF Sequencer. While simulated data
is useful for examining algorithm behavior with known models, real data extends the
analysis to include unmodelled effects. Judgement may be made as to whether the
modelling is sufficient to ensure effective algori t hm operation.
6.1 Data Set 1 - Typical
In this section, the DNA-ML processoi
Case
is applied to real data that is representative
of that produced in research laboratories.
Chapter 6 O Performance with Real Data 101
The data is from the electrophoresis on a Phamacia ALF Sequencer of exon 3
of the a-A-crystalline gene of the eyel. It was preprocessed to remove large scale
trends prior to application of the algorithm described in this paper. Here 'large scale
trends' refers to features that extend over more than fifty bases. First, the inter-
lane mobility variations were removed as described in the Appendix. Then, the large
scale intra-lane variations in rnean rnobility were modelled by fitting a fourth order
polynomial to the entire set of peak times; these trends were removed by interpolating
and resampling the data based on this polynomial to achieve uniform average mobility.
The central region (bases 11-343) of the data was selected for analysis to remove
artifacts associated with the start and end of the run. Exponential trends in the
noise background level and peak amplitudes were then removed.
For convenience, the scaling during trend removal was such that the resulting meari
amplitude, p., was unity; its standard deviation, o., was 0.1; al1 other amplitude
and noise offsets and variances quoted below are in normalized units based on this
scaling. The noise whitening filter (Sections 4.3 and 5.3.2) was designed assuming
the variance due to coloured noise was 0.005 and the variance due to white noise was
0.0000002. The non-stationarities (Section 3.3.2) were rnodelled by set ting pulsewid t h
as p , ( t ) = 13.48 + 0.00419t and total jitter variance as O;, = (1.44 + 0.0141~)'. The
input disturbance variance was O:, = (6/(4 + 6/(i - p * ) ) ) ~ : ~ and the measurernent
variance vas o;, = 0.67~:~ + 2; here the offset of two in the measurernent variance
reflects the error of the peak estimator as obtained through simulation studies. The
value of the jitter process auto-regressive weighting, 13, will be considered in the
next section. The average inter-peak interval was 14.7 sarnples. To allow for errors
iritroduced in the trend rernoval process in addition to the original additive noise,
the rnean noise level was set to 0.1 and its variance set to 0.0169; these nunibers
were set based on empirical examination of the data. The M-algorithm carried 100
hypotheses. In peak estimation (Sections 5.1.1-2), the influences of one base fonvard
- - --
'An evon is a region of DNA which gets translated into protein. Ln between exons, DNA features introns which are regions that do not code for proteins
Chapter 6 o Performance with Real Data 102
and three previous bases were rernoved. The generic unit width pulse shape used was
The central Gaussian part of this pulse shape is a very close fit to observed pulses.
The exponential tails are an approximation to the average seen in the ensemble; many
real pulses were observed to have stronger tails while some did not exhibit tails at al1
(Figure 3-6).
6.1.2 Sensitivity to Parameters
The initial results with real data were much poorer than Our simulations had
led us to expect. Two factors were instrumental: mismatch in the noise whitening
filter and misrnatch in the setting of the jitter process auto-regressive weight ing, 9, (Section 3.3.2), in the Kaiman predictor (Section 4.5.2). Mdressing t hose problems
eventually led to good performance.
To allow for possible mismatch, the sanie real data set was reprocessed for several
tlifferent hypothesized p's and fractional bandwidths (Section 5.3.2). Table 1 sum-
marizes the results. Here undesirable results are identified as: (i) insertions - the true
data has been split into two segments and additional base values placed between these
segments; (ii) deletions - two segments of the true data have had intervening bases
removed and the segments have been joined together; and, (iii) substitution errors - if on either side of a specific base the true sequence matched the recovered sequence
but at the specific base the true sequence and recovered sequence did not match. The
best results occurred for near 0.85 and fractional bandwidth near 0.25. Error rates
increased on rnoving away €rom that locus, particularly when both 13 and the frac-
tional bandwidth were increased. However, it appears that the fractional bandwidt h
may be varied over a large range without significantly affecting results. .41so included
in the table is the jitter correlation time, TJ, defmed as the intenml in bases required
for the jitter correlation to drop below 50%.
Chapter 6 O Performance with Real Data 103
Table 6.1 : Performance (insertions/deletions/ substitution errors) as a func t ion of algorithm parameter settings for 300 bases of real data.
Fractionai Bandwidth B TJ 0.25 0.5 1
The g=0 case corresponds to a simple algorithrn where the jitter (offset from a.
priori mean) in the next sample is assumed to be equal to the previous offset.
6.1.3 Error Cornparison
Table 6.2 compares the performance of the 'optimum' algorithrn with that of the
interna1 algorithm of the Pharmacia ALF sequencer. Yote that in four cases, both
algorithms make the same error. h o , errors at bases 258 and 260 for the optimum
algorithm correspond to the same event as jitter on the T lane led to ari early T peak
that cut-off a C a t 258 and caused it to appear a t 260 instead. The ambiguous peaks
with the Pharmacia ALF sequencer were due to its software allowing for heterozygotes
-the presence of similar DNA molecules from mother and father that differ a t only a
few bases. Thus, rather than just an A at a point in the sequence it is possible to
simultaneously have an A and a C at the same point. A s it turns out, the sample
was probably heterozygous AC a t 118 as identified by the Pharmacia algorithm; here,
the optimum algorithm's base selection reflected the GenBank sequence. For base 6,
however, the Pharmacia algorithm was in error.
6.1.4 Error Analysis
Even though the Pharmacia algorithm performed slightly better than the DNA-
ML algorithm, the DN.4-ML algorithrn has the potential to do better when biases
Chapter 6 o Performance with Red Data 104
Table 6.2: Errors observed for DNA-ML algorithm (P=0.85, fractional band- width=0.25) and Pharmacia .4LF interna1 algorithm for 300 bases of real data.
Base Number 6 118 215 218 252 258 260 275 Error rate
DNA-ML
- Del. G in triplet
Ins. G form triplet Ins. G form pair
Del. C in pair Ins. C form pair Del. G in pair
2%
P harmacia Ambiguous
Ambiguous * Del. G in triplet
- Ins. G form pair
Del. C in pair
Del. G in pair 1.7% (* not incl.)
introduced during pre-processing and during the estimation of noise and signal statis-
tics are removed. For simulations with the same parameter settings as above and cvith
pulse shape, non-stationarities and mode1 parameters that are known exact ly, the er-
ror rate was only 0.7%.
Examination of the actual mors encountered with real data allows us to infer the
most likely meçhanism for error generation. First, from Table 6.1, note that most
of the errors were insertions or deletions. Further, in Table 6.2, it can be seen that
the insertions and deletions concern pairs or triplets of consecutive bases on the same
cliannel. From this we infer that the errors were probably due to the effects of ISI
from adjacent bases.
For optimal processing of ISI. the pulse width and pulse shape of adjacent bases
have to be known accurately. Errors in pulse width setting can lead to large errors
in waveform cornparison, particularly near the steep edges of the pulse. -41~0, the
presumption of a single generic signal pulse shape could lead to similar errors. While
the mainlobe is stable, there appears to be fluctuation from peak to peak with respect
to the tail that follows the peak. Figure 3-6 gives examples of this fluctuation. For
short sections of data, particular tail realizations could lead to significant differences
in the spectra. These tail variations will affect the accuracy of the ISI removal process
and thus the accuracy the peak estimator and the conditional likelihood component
of the cost (Equation 4.14).
Chapter 6 o Performance with Real Data 105 - - -
Other factors may have led to a poorer performance with real data than with
sirnulatcd data. Errors in trend removal could certainly lead to problems as the
resul t ing offsets appear as discrepancies in the waveforrn comparison portion of the
algorithm (Equation 4.14). As the reader may recall, an attempt was made to address
this problem by including a noise mean offset and setting the algorithm noise variance
to be larger than that expected in the DNA time series; these represent additional
parameters which rnay not be a t their best settings.
With respect to the whitening filter, it must be emphasized that the analysis of the
mismatch in Section 5.3.2 is based on a short noise only region. It is difficult to extract
noise data as most regions are contaminated by signal peaks. Empirically, the noise
also appears to be signal dependent; this would imply that one cannot characterize
the noise by performing electrophoresis in the absence of DNA. .4dditional work is
needed to properly characterize the noise.
Several other assumptions regarding the parameters of the DNA-ML algorithm
seem to have only a lirnited effect on the error rate. For example, the M-algorithm
carried only 100 hypotheses. Increasing this number should improve performance. On
the other hand, while similarly restricted to 100 hypotheses, the simulation achieved
much better performance. .&O, the algorithm used only the three previous bases
aud one future base in ISI removal; interference from bases beyond tliis region would
contribute directly to errors in peak estimation and waveforrn comparison. However,
from Table 2, interference fiom bases outside the window of bases used for ISI removal
does not appear to be a significant problem.
6.2 Data Set 2 - High Speed Gel
In this section, the DNA-ML processor is applied to real data obtained from a
gel set for fast electrophoresis. Rather than the 6% "bis" to acrylamide mixture
used in Data Set 1, Data Set 2 uses a 4% "bis" to acrylamide mixture. This implies
ferver cross-links, iess mechanical resistance and faster passage of DNA molecules
through the gel. This fast gel data may foreshadow future clinical applications of
DNA sequencing where speed and productivity are highly valued.
Chapter 6 O Performance with Real Data 106
6.2.1 Source / Rationale
Data Set 2 was also taken from the a-A-crystalline gene. This time a long segment
of DN.4 was selected spanning approximately 2000 bases. This included exon 2,
iritrons and exon 3. Amplification was via insertion into a plasmid and then in turn
into a bacterial culture (Data Set 1 used PCR for amplification). After amplification
the plasmids were nicked and changed from circular to linear form and then sequenced
using Ml3 Reverse as the sequencing primer. M l 3 Reverse is complementary to part of
the plasmid's own DNA. Thus, using M l 3 Reverse, the observed sequence corresponds
initially to plasrnid DNA then that of the primer used to select the desired DNA for
amplification, intervening a-crystalline DNA, then exon 3, intron and exon 2. M l 3
Reverse is so named as it leads to the sequencing of the cornplementary strand and
therefore the sequence is both complementary and in reverse ordcr to that of the true
sequence.
This data set differs frorn Data Set 1 in several significant ways. First, the low
gel cross-linking leads to an inter-base separation which is much shorter (mean 10.1
samples as opposed to 14.7 in Data Set 1 (sampling frequency, nominal voltage, etc.
were unchanged). Second, the pulse width expressed in units of peak separation
is much higlier than for Data Set 1 (1.46 versus 0.92 for the initial bases). Thus,
Inter-Symbol Interference (ISI) is greater in Data Set 2. Longer DNA molecules are
preseiit due to the longer ternplate in Data Set 2. As well, tlieir Liydrolysis products
are present to contribute to the background noise. As the template is so long that
tlie sequencing polymerase typically does not succeed in making a full length copy,
the large end of segment peak seen in Data Set 1 is not seen in Data Set 2. Unlike
Data Set 1, Data Set 2 used 7-deaza-GTP instead of GTP as a substrate. This
molecuie cannot form the hydrogen bonds that lead to secondary structure such as
hairpin loops. This may impact on the structure of the peak parameter covariances.
Finally, as sequencing proceeds in different directions and on complementary strands.
sequence dependent interactions with the polymerase are likely to be different, even in
the area of exon 3. Data Set 2 should thus demonstrate different properties than Data
Set 1. In paxticular, it should highlight the performance of the DNA-ML algorithm
in the high ISI environment typical of fast gels and long sequences.
Chapter 6 o Performance with Real Data 107
Figure 6.1: Raw time series for Data Set 2. Individual channel data has been offset in this figure for clarity.
6.2.2 Mode1 and Adjustments
Preprocessing of Data Set 2 was as described in the Appendix with one exception.
-4s the data set lacked the rising background due to end of segment and primer
Iabelling, compensation was not required for this trend. Figures 6.1 and 6.2 show the
DN.4 time-series before and after compensation. Interestingly, while the template was
2000 bases long, the sequencing copies appeared to die out beyond approxirnately 800
bases (i.e. 8000 samples at 10 samples per base). Apparently, the polyrnerase (Thermo
Sequenase), template and copy complex became unstable in this region. Attempts at
increasing the copy length through varying the ddNTP:dNTP ratio and Mg++ cation
concentration were unsuccessful.
As before, manual cursoring based on a priori sequence knowledge was used to
identify the correct peaks for use in the correlation models. To mode1 to the end of
exon 3, 287 bases were cursored and used to estimate parameters. Figure 6.3 presents
the conelation in peak jitter. The structure of this correlation is clearly consistent
with that discussed in Chapter 3. Similarly, the correlation of the diference between
Chapter 6 Performance with Real Data 108
Figure 6.2: Compensated time series for Data Set 2 corresponding to first 1000 samples from Figure 6.1. Individual channel data has been offset in this figure for clarity. Top curve is for A channel with C, G and T channels presented in order from top.
adjacent peak times, Figure 6.4, is consistent with the earlier modelling. Data Set 2
used 7-deaza-GTP instead of GTP as a substrate and therefore should not suffer from
the effects of secondas. structure such as hairpin loops. Thus, Figures 6.3 and 6.4
suggest that such structure is not a major contributor to the jitter covariance.
Measured @ was 0.78. Average jitter process input variance was 2.86 (samples
squared) and measurement kariance was 10.1. The high value of the later is consistent
with the difficulty in obtaining accurate peak measurements when the peaks are wide
and the noise is high. The non-stationarity (Section 3.3.2) of the total jitter standard
deviation was described by 04 = 2.7 + 0.0141i where i is the base number.
Neither amplitude nor pulse width had significant covariance values beyond lag
zero. After compensation, amplitude was unit mean with standard deviation of 0.3
(units of mean). Pulse widt h non-stationarity was described by pw = 14.74 + 0.0331~
where i is the base number. Note that the initial pulse width for the 4% gel is higher
than the 13.48 samples used with the 6% gel (Data Set 1) as the diffusion coefficient
Chapter 6 o Performance with Real Data 109
l 1 1 1 t 1
-200 -1 00 O 100 200 300 LAG (BASES)
Figure 6.3: Covariance of pesk time jitter for Data Set 2. Inset is a logarithmic plot of the right side of the mainlobe.
is higher. The peak mainlobes appeared to be Gaussian. Due to high ISI, clean
examples of the tails of the pulse shape were unavailable. Therefore, a Gaussian
pulse shape was used in the DNA-ML algorithm foi Data Set 2.
Following the procedures discussed in Section 5.3.2, coloured noise wriance was
set to 0.0016 and white noise variance was set to 0.000009, both in units of of the
peak mean squared2. Both factors were extremely difficult to estimate due to the
high ISI. Unlike the previous data set, it was not possible to find a sufficiently wide
"noise-only" region to form a spectral estimate that could serve as a check on these
parameter values.
Rather! the whitened data was examined. With the settings of the previous para-
graph the noise was not fully whitened as may be seen by the broad noise peaks in
Figure 6.5. Lowering the white noise variance to 0.000001 (i.e. reducing standard
deviation by a factor of 3), led to the whitened noise data seen in Figure 6.6. From
%e. the square of the mean height of valid isolated peaks.
Chapter 6 o Performance with Real Data 110
-15l 1 1 t 1 1 1 + t 1 J
-100 -80 -60 -40 -20 O 20 40 60 80 100 LAG (BASES)
Figure 6.4: Covariance of difference between successive peak tirne jitter values for Data Set 2.
the bursts of noise at approximately 1500, 1800, 2000 and 2600 samples, it appears
tha t an additional high frequency, non-stationary noise process is present. In uncorn-
pensated data, it is evidenced as sudden jump in the intensity values. .Alsot for this
data set, the knowledge of the peak shape was poor which implies that our coloured
noise spectrurn may be inaccurate. To avoid problems due to the burst noise and
inaccurate pulse shape knowledge, a white noise variance of 0.000009 was selectcd
mhich limited the emphasis on high frequencies after whitening. The fractional band-
width was set to 0.25 to reflect the reduced degrees of freedom avaiiable for waveforrn
comparison given the coloured data.
6.2.3 Error Cornparison
The DNA-ML algorithm and the Pharmacia ALF algorithm experienced difficulty
in sequencing this data set. However, the mechanisms of error generation appear quite
different and suggest that direct detailed compazison is not meaningful. Therefore,
their performance is discussed separately.
Chapter 6 o Performance with Real Data 111
TiME (SAMPLES)
Figure 6.5: Selected section of data after application of whitening filter with coloured noise variance on= = 0.0016 and white noise variance on, = 0.000009, al1 in units of pcak mean squared.
The Pharmacia ALF algorithm sequenced the first 110 ba..es. Beyond that point
in the data, the algorithm deemed the data to be of too low a quality to sequence. The
Pharmacia ALF algorithm experienced 16 deletions in the first 60 bases. The problern
may have been due to the heavy ISI interacting with its base clock recovery algorithm.
.As adjacent peaks were unresolved, fewer peaks were assurned to be present and so the
inter-base separation was estimated to be higher. Beyond base 60, enough instances
of isolated peaks were observed to correct this timing problem. An insertion error
occurred at base 71. No other errors occurred in the 110 bases the algorithm marked
(these 110 bases correspond to 125 true bases = 110 bases marked by Pharmacia +
16 deletions - 1 insertion). Average error rate was therefore 15.5%.
As will be discussed further in the error analysis section, several variants / pa-
rameter settings were tried for the DN.4-ML algorithm. For cornparison with the
Pharmacia ALF, in the first 125 bases, the baseline case produced only 6 errors,
yielding an error rate of 4.8%. However, rather than being an indication of a vastly
Chapter 6 o Performance with Real Data 112
. - O 500 IO00 1500 2000 2500 3000 3500 4000
TlME (SAMPLES)
Figure 6.6: Selected section of data after application of wliitening filter with coloured noise variance O,, = 0.0016 and white noise variance on, = 0.000001, al1 in units of peak mean squared.
superior algorithrn, this factor of three improvement may be due to accurate a priori
knowledge of parameters such as mean base separation. The baseline case produced
23 errors in 200 bases sequenced (8 insertions / 7 deletions i 8 substitution errors).
Most of the errors (14) occurred in the region between base 150 and 200.
6.2.4 Error Analysis
In this section, DNA-ML algorithm errors made in sequencing Data Set 2 are
examined. Small changes to algorithm parameters are investigated in hopes of further
reducing the error rate.
First, Figure 6.7 presents the region about the h a 1 point of the marked sequence of
the Pharmacia ALF. Just before sample 1250 is the final peak called by the Pharmacia
ALF which was a "Cf (second curve from the top). In this area. simultaneous large
levels are seen on the C and T channeis. -4s well, the G lane level is well away from
the assurned noise mean of 0.1. Presumably, the Phxmacia algorithm found the data
Chapter 6 o Performance with Real Data 113
Figure 6.7: Cornpensated time series for Data Set 2 correspo~iding to bases 110-140. Individual channel data has been offset in this figure for clarity. Top curve is for -4 channel with C, G and T channels presented in order from top. DNA-ML algorithm estimates of peak amplitudes and times are indicated by "*". X-axis is time in samplcs. True and estimated sequences are indicated at top and bottom, respectively, coded as A=1, C=2, G=3 and T=4.
to be overly arnbiguous here. As such ambiguity is generally found at the end of a
data set, it stopped processing on the assumption that subsequent data would be
poorer still.
The DN.4-ML algorithm processed through this region but did incur errors. Ta-
ble 6.3 presents the error locations and types for the baseline DN.4-ML algorithm
and parameters. Substitution errors occurred at base 127 (esample 1270) where a
C was called instead of a G and a t base 133 (zsample 1335) where a G was called
instead of a C. In both cases, the erroneous peak occurred in the middle of a run.
The peak was called with a lotv level which should increase its cost. However, in these
cases, the erroneous peak may, through the ISI rernoval processing of the parameter
estimator, have reduced the ISI in the neighbouring peak estimates. The resulting
peak estimates may have been closer to the means and thus lowered the cost of the
Chapter 6 O Performance with Real Data 114
Table 6.3: Data Set 2 error type and location for DNA-ML with baseline parameter set t ings.
hypot hesis.
Error Insertion Dele tion Substitution
Directing our attention at another interesting region, Figure 6.8 displays the corn-
pensated but unwhitened time-series corresponding to the first 30 bases, togethcr with
the DNA-ML algorithm estimates of peak times and amplitudes. The errors listed
in Table 6.3 a t bases 8, 18 and 21 appear in Figure 6.8 at time samples 45, 170 and
205, respectively. The insertion error at base 18 / sample 170 stands out as the peak
in the A lane is so small relative to the valid peaks. This srna11 peak was accepted
by the algorithm as the setting of the peak amplitude variance was high. In fact, the
srnall peak was within two standard deviations of the mean amplitude setting and
thus belonged within the region that 95% of valid peaks would lie, assuming correct
parameter settings. It is likely that the peak amplitude variance was set erroneously
high. This will be investigated further later in this section.
Figure 6.9 provides insight into the ISI suppression and parameter estimation pro-
cessing. A run of four G's, encompassing bases 49-52 of Data Set 2, is unresolved
in the raw data. There rnay be inflection points that indicate the presence of the
four peaks; however, these visual cues could also be sirnply additive coloured noise.
The whitening filter helps to resolve the peaks as the increased high frequency em-
phasis sharpens the peaks. Noise is clearly emphasized as well as evidenced by the
fluctuations between samples 450 to 480 and 510 to 540. In the matched filtered and
ISI suppressed data, the influence of the k t two G's has been removed. The high
accuracy of the estimated positions of past peaks facilitates the easy removal of their
influence. The fourth G still appears as a substantial peak because the predicted
peak location was very inaccurate, which in tum misaligned the replica used in can-
Location (Base Number) Bases 1-125 18 70 8 21 47
43
Bases 126-200 146 163 164 167 168 189 151 158 181 194 127 133 153 170 173 178 191
Chapter 6 o Performance with Real Data 115
Figure 6.8: Compensated time series for Data Set 2 corresponding to first 30 bases. Individual channel data has been offset in this figure for clarity. Top curve is for .-\ channel with C, G and T channels presented in order from top. DNA-ML algorithm estimates of peak amplitudes and times are indicated by "*". X-axis is time in samples. True and estimated sequences are indicated at top and bottom, respectively, coded as A = l , C=2, G=3 and T=4.
cellation and permitted much of the peak to remain. Still, after this processing, the
peak of the third G is strongest and simple peak picking will yieid good estimates of
amplitude and peak time. In a more severe scenaxio, noise and the next peak location
could have led to the next peak being strongest after this processing. In such a case,
if the search window was wide enough to encompass the next peak, then the next
peak would be selected and the third G would be deleted. Such errors can be difficult
to classify. It may be that the deletion a t base 8 / sample 45 in Figure 6.8 is due to
such a phenornenon.
Table 6.3 indicates the presence of a number of errors centered around base 170.
Figure 6.10 presents the compensated but unwhitened time series in this region to-
gether with the DN.4-ML algorithm's estimates of peak amplitudes and locations.
Chapter 6 o Performance with Real Data 116
Figure 6.9: Waveforms associated with 4 "G" run from base 49 to 52 selected to illustrate estimation of third "G". Raw waveform is compensated but not whitened. Dashed curve is formed from whitened data by subtracting estimated contribution from previous two bases and predicted contribution from next base, and then applying matched filter.
Evident in the area near base 170 / sample 1700 is a discontinuity in the data. This
appears on al1 lanes, though shifted in time due to lane alignment processing. The
event is likely due to ternporary removal of field voltage as an operator might do to
allow visual inspection of the gel. Such a discontinuity is greatly emphasized by the
high-pass action of a whitening filter as rnay be seen directly about sample 1700 in
Figure 6.5. The time extent of the event is dso emphasized by the whitening filter.
This leads to the errors reported at bases 167, 168, 170 and 173.
Inspection of the time-series and peak estimates about other errors revealed two
other phenornena which were contributing to errors. Eight of the errors could be
attributed to the valid peaks being weak. Four of these occurred on the C lane and
three on the G lane. The implication is that these lanes were scaled Iow. Another
group of six errors appeared to be due to a peak t h e jitter variance which may have
been too high. Five of these errors were insertions where the correct peaks appeared
Chapter 6 o Performance with Real Data 117
Figure 6.10: Compensated time series for Data Set 2 corresponding to bases 155 to 185. Individual channel data has been offset in this figure for clarity. Top curvc is for A channel with C, G and T channels presented in order from top. DN.4-ML algorithm estimates of peak amplitudes and times are indicated by "*". ,Y-suis is time in samples. True and estimated sequences are indicated at top and bottom, respectively, coded as A = l , C=2, G=3 and T=4.
near the ant icipated tirnes but t here were earlier noise or resolution pro blenis tliat
suggested inserting an erroneous peak. With a srnaller jitter variance set ting these
erroneous peaks might not have been accepted. As these events were insertions that
imply a shorter inter-base interval, the other mechanism that may lead to these errors
is the bias towards the shorter length hypothesis given two unequal length hypotheses
(see Section 5.2.1).
This analysis suggests that parameter settings may be adjusted for better results.
Based on the above observations, a large number of parameter settings were investi-
gated and the modifications yielding the best result were: (1) peak amplitude variance
reduced from 0.09 to 0.0225; (2) jitter rneasurement variance reduced from 10.1 to
6.7 (with attendant modification of total jitter variance); (3) C and G lanes scaled
by 1.15; and, (4) fractional bandwidth set to 0.25 (baseline case had unit fractional
Chapter 6 o Performance with Real Data 118
Table 6.4: Data Set 2 error type and location for DNA-ML with modified parameter set tings.
Errer
bandwidth). As shown in Table 6.4, total errors were reduced from 23 to
Location (Base Number) Bases 1-125 1 Bases 126-200
Insertion Delet ion Substitution
21 ovcr
two hundred bases, but, more importantly, in the 125 bases corresponding to the
110 bases marked by the Pharmacia ALF algorithrn, the error rate was only 3.2% as
opposed to the Pharmacia ALF's 15.5%.
As predicted, changing the peak amplitude variance removed the error at base 18.
It also removed errors at bases 21 and 158 though new errors were introduced near
the end of the run a t bases 195 and 197. Scaling C and G lanes removed errors at
bases 133 and 153. The errors at bases 163 and 164, attributed to high jitter variance,
have also been removed. Reducing the jitter nieasurement variance without changing
the fractional bandwidth increased rather than lowered the error rate; clearly, the
ciifferent parameters interact to determine final performance. In general, the rnodifi-
cations improved results early in the data set but were somewhat offset by new errors
introduced later in the data set.
70 2 47
43
6.2.5 Assessrnent and Significance
146 167 168 175 176 189 195 151 173 181 127 142 161 170 178 191 197
For good quality data as demonstrated with Data Set 1, the DNA-ML algorithrn
acliieved performance comparable to the commercial Pharmacia ALF algorithm. For
data with high ISI (Data Set 2), the DNA-ML algorithm appears to offer as much
as a four-fold improvement in error rate relative to the P h m a c i a ALF algorithrn.
However, the validity of the comparison is limited due to differences in the initialize
t ion parameters for the two algorit hms. Nonetheless, the preliminary investigation
suggests that the DNA-ML algorithm has significant potential in dealing with fast
gel data. This in turn implies that the DNA-ML algorithm may offer a performance
Chapter 6 O Performance with Real Data 119
improvement in clinical applications.
CHAPTER 7
Conclusions
7.1 Thesis Summary
This thesis has provided the foundations for rigorous study of the DNA t ime-series.
The characteristics of the time-series arising from DNA sequencing have been inves-
tigated, both from a theoretical and a statistical perspective. 4 statistical model has
been developed that reflects the local statistics of the DNA time-series. The maximum
likelihood sequence detector has been derived for this model. The iinplementation
of the processor addressed issues ranging from computational loading through to the
comparison of hypotheses of different lengths. Real data has been usecl to investigate
the performance of the processor. In comparison with a commercial algorithm, the
results indicate improved performance in situations where there is high overlap be-
tween the peaks of adjacent bases. This is likely to be the case when DNX sequencing
is employed in high throughput clinical applications.
Chapter 7 o Conclusions 121
7.2 Thesis Contributions
The major contributions of this thesis are:
(1) The creation of the first statistical mode1 of the DN14 time-series;
(2) The derivation of the first optimal algorithm for DNA sequencing.
The development of the DNA time-series mode1 focussed on ensuring its utility
for sequencing algorithm development. A generic peak shape, pararneterized by peak
time, amplitude and width, is used to represent the signal peaks. The characterization
of the fluctuation of peak parameters includes their point probability density functions
and their correlations with neighbouring peaks. A practical noise model is proposed
consisting of a white noise component and a noise component with spectra similar to
tliat of the signal itself. The noise and peak parameter processes are non-stationary
with variances increasing with base number. The complete model can be used to
generate simulations for the comparison and evaluation of sequencing algorithms.
Based on the DNA time-series model, an optimal Maximum Likelihond DNA se-
quencing algorithm was derived. It selects the hypothesized sequence that maxiniizes
the probability of the observed signals. The uncertainty associated with parameter
values is addressed by maintaining multiple hypotheses not just for the different pos-
sible information sequences but also for the different possible parameter values. Tlic
structure of the algorithm features two main branches, one that compares waveforms
based on hypothesized parameters, and, one that predicts (and costs) parameter val-
ues.
Additional significant contributions include:
(1) The creation of the hypothesis cost function which may be used to provide a
probability for different possible sequences to allow the user to directly assess sequence
ait ernat ives;
(2) The recognition of the asgnchrony between bases and samples of the DXA
time-series and the problems it leads to in comparing hypotheses:
(3) The introduction and application of techniques from communication theo-
including the Fano metric and M-algorithm, to DNA sequencing.
The first of these has potential clinical value as it allows meaningful comparison
Chapter 7 o Conclusions 122
between two possible genetic sequences. As to the second, asynchrony at the levcl
seen in DNA tirne-series is not found in communication systems. The resulting prob-
lems associated with unequal length hypotheses that incorporate the same number of
symbols have not been dealt with clsewhere. The third relates to the introdtiction of
a valuahle new set of tools to the DNA sequencing community.
7.3 Suggestions for Future Research
This work provides the foundation and the structure for further research into the
inter-dependencies between the underlying chemistry and physics of DNA seqiiencing
and the other properties of the optimal sequencer.
Physical modelling of DNA electrophoresis has concentrated on gross behavior.
New work is needed to provide a physical model which fully explains the correla-
tion obsented in peak time jitter. One could study the relationships between thc
ionic strength of the solution (known to affect the persistence length) and the jitter
auto-regressive weighting, ,û. Similarly, there has yet to be a direct verification and
assessrnerit of the chemical noise mechanisnis described in this thesis. This would
he invaluable in ensuring model fidelity. For example, an experiment could be con-
ducted to rneasure the production of hydrolysis products with tirne and temperature
as the key variables. Further, the chemical and/or physical mechanism behind the
exponentially decaying tails of the pulse shape should be elucidated.
Further development is required to make the sequencing algorithm usable by
rnolecular biologists. -4s was seen in Chapter 6, errors in parameter settings can
have a very significant effect on system performance. On-line estimation of param-
eters is necessary as well as tracking and correction of large scale parameter trends.
The techniques of system identification should be directly applicable.
Finally, the probabilist ic description and hypot hesis cost function developed in
this thesis can be applied to specific genetic tests as opposed to general sequencing.
Here, the hypothesis cost function may be assessed for the cases of mutation present
or mutation absent at base N. The early stages of this work are now undenvay a t the
Institute of Bio-Medical Engineering of the University of Toronto.
Bibliography
DeLisi, C., 'The human genorne project", American Scientid, V. 76, 1988, pp.488-
493.
Sanger, F., Nicklen, S., Coulson, "DNA sequencing with chain terminating in-
hibitors", froc. Natl. Acad. Sci., Vol. 74, pp.5463-5467, 1977.
Davies, S. W., Eizenman, hi., Pasupathy, S., "Optimal structure for automatic
processing of DNA sequences", IEEE Trans. Biomedical Eng., submitted for pub-
lica t ion.
Hunkapiller, T., Kaiser, R., Koop, B., Hood, L., "Large-scale and aiitomated DNX
sequence determination", Science, V.254, 1991, pp.59-67.
Church, G., Gryan, G., Lakey, N., Kieffer-Higgins, S., Mintz, L., Temple, SI.,
Rubenfield, M., Ghazizadeh, H., Robison, K., Richterich, P., "Automated multi-
plex sequencing", pp.11-15 in Automated DNA Sequenca'ng and Analysis, Adamst
M., Fields, C., Venter, J., (editors), Academic Press, New York, 1994.
Burks, C., "DNA sequence assembly", IEEE Engineering in Medicine and Biology,
Nov./Dec., 1994, pp.771-773.
Myers, E., "Advances in sequence assernbly", pp.231-238 in Automated DNA Se-
penczng and Analysis, Adams, M., Fields, C.' Venter, J., (editors) , Academic
Press, New York, 1994.
Bibliography 124
181 Forney, G.D., Jr., "Maximum-likelihood sequence estimation of digital sequences
in the presence of intersymbol interference", IEEE Dans. Infunnation Theory,
V.IT-18, May, 1972, pp.363-378.
191 Slater, G.W., "Electrophoresis Theories", Chap. 2, pp.24-66 in Analysis of Nu-
cleic Acids by Captl laq Electrophoresis, Keller, C., (editor) , C hromatographia
CE Series, Vol. 1. Vieweg, Wiesbaden, Germany, 1997.
1101 Caspers, G. J., Pennings, J., de Jong, W.W., "A partial cDNA sequence corrects
the human alpha A crystallin primary structure", Ezp. Eye Res., V.59, 1904,
pp. 125-126.
11 11 Casey, D., "Primer on rnolecu1a.r genetics", 1991-92 DOE Humun Genome Pro-
gram Report, U.S. Dept. of Energy, Oak Ridge, Tenn., USA, 1992.
1121 Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R.? Horn, G.T.,
Mullis, K.B., Erlich, KA. , "Primer-directed enzymatic amplification of DN.4 wi t h
a thermostable DNA polymerase", Science, V.239, 1988, pp.487-491.
1131 Lodish, H., Baltimore, D., Birk, A., Zipursky, S.L., Matsudaira, P., Darnell, J . ,
Molecular Cell Biologg, 3rd. ed., Scientific American Books, W H . Freeman, N.Y.,
1995.
1141 Tindall, KR., Kunkel, TA., "Fidelity of DNA synthesis by the tliermus aquaticus
DNA polymerase", Biochemzstry, V.27, 1988, pp.6008-6013.
[151 Eckert, K A . , Kunkel, TA., "High fidelity DNA synthesis by the Tliermus aquati-
cus DNA polymerase", Nucleic Acids Research, V. 18, N.3, 1990, pp. 3739-3744.
1161 Clark, J.M., "Wovel non-templated addition reactions catalyzed by procaryotic
and eucaryot ic DN.4 polymerases", Nucleic Acids Research, V. 16, N.20, 1988:
pp.9677-9686.
[171 Clark, LM., Joyce, C.M., Beardsley, G.P., "Novel blunt-end addition react ions
catalyzed by DNA polymerase I of Escherichia col?', J. Mol. BioL, V. 198, 1987,
pp.123-127.
Bibliography 125
[lSI Tabor, S., Richardson, C.C., "Effect of manganese ions on the incorporation of
dideoxynucleotides by bacteriophage T7 DNA polymerase and Escherichia coli
DNA polymerase P', Proc. Natl. Acad. Sci. USA, V.86, 1989, pp.4076-4080.
1191 Ke, S-H, Wartell, R.M., "Influence of neighboring base pairs on the stability of
single base bulges and base pairs in a DNA fragment", Biochemistry, V.34, 1995,
pp.4593-4600.
1201 Suzuki, T., Ohsumi, S., Makino, K., "Mechanistic studies on depurination and
apiirinic site chain breakage in oligodeoxyribonucleotides", Nucleic Acids Research,
V.22, N.23, 1994, pp.4997-5003.
1211 Viovy, J.L., Duke, T., Caron, F., 'The physics of DNA electrophoresis", Contem-
poranj Physics, V.33, N.1, 1992, pp.25-40.
1221 Fang, Y., Zhang, J.Z., Hou, J.Y., Lu, H., Dovichi, N.J., "Activation cnergy
of the separation of DNA sequencing fragments in denaturing noncross-linked
polyacrylamide by capillary electrophoresis", Electrophoreszs, V. 17, 1996, pp. 1.136-
1442.
1231 Kamahori, A L , Kambara, H., "Characteristics of single-stranded DNA separation
by capillary gel electrophoresis", Electrophoreszs, V. 17, 1996, pp. 1476- 1484.
1241 Maurer, H.R., Dzsc electrophoresis and related techniques of polyacn~famide gel
electrophoresis, Walter de Gruyter, Berlin, 1971.
1251 Yarmola, E., Sokoloff, H., Chrambach, A., 'The relative contribution of disper-
sion and diffusion to band spreading (resolution) in gel electrophoresis", EIec-
trophoreszs, V.17, 1996, pp. 1416-1419.
[261 Smith, L.M., Kaiser, R.J., Sanders, J.Z., Hood, LX., 'The synthesis and use of
fluorescent oligonucleotides in DNA sequence anaiysis", Methods in Enzymology,
V.155, 1987, pp.260-301.
[271 Slater, G., informa1 communication.
Bibliography 126
[281 Strutz, K., Stellwagen, N.C., "Intrinsic curvature of plasmid DNA's analyzed by
polyacrylamide gel electrophoresis", Electrophoresis, V.17, 1996, pp.989-995.
1291 Wheeler, D.L., Chrarnbach, A., "A computer simulation accoiinting for dissimilar
electrophoretic behavior between two similarly curved DNA fragments due to a
difference in arc length", Electrophoresis, V. 15, 1994, pp.885-889.
1301 Bendat , J .S., Engineering Applications of Correlation and Spectral Analysis, 2nd
ed., J. Wiley, New York, 1993.
1311 Elias, H.-G., An Introduction to Polymer Science, VCH, Weinheim, Gerrnany.
1997.
1321 Tinland, B., Pluen, A., Sturm, J., Weill, G., "Persistance length of single-
st randed DNA", Macromolecules, V.30, N. l9? 1997, pp.5763-5765.
1331 Brown, TA., DNA Sequencing: The Basics, Oxford University Press, New York.
1994.
[341 Oppenheim, A.V., Schafer, R. W., Digital Signal Processing, Prentice-Hall, En-
glewood Cliffs, N.J., 1975.
1351 Xu, Y.. Mural! R. J., Uberbacher, E.C., "Correcting sequencing errors in DNX
coding regions using a dynamic programming approach", Cornputer Applications
in Biosciences, Voi. 11, No. 2, pp.117-124, 1995.
[361 Wu, Y., Mislan, D., "Automated DNA sequencing: An image processing a p
proach", Applied and Theoretical Electrophoresis, No. 3, pp.223-228, 1993.
1371 Berno, A.J., "A graph theoretic approach to the analysis of DNA sequencing
data", Genome Research, Vol. 6, No. 2, pp.80-91, 1996.
1381 Giddings, M., Bnirnley, R., Haker, M., Smith, L., "An adaptive, object-oriented
strategy for base calling in DNA sequencing analysis", Nucleic Acids Research,
Vo1.21, No. 19, pp. 4330-4540, 1993.
Bibliography
[391 Ives, J., Gesteland, R., Stockharn, T.? "An automated film reader for DN-A se-
quencing based on homomorphie deconvolution", IEEE Trans. Biomedicul Eng.,
Vol. 41,No. 6, pp. 509-519, June 1994.
1401 Tibbctts, C., Bowling, J., "Met hod and apparatus for automatic nucleic acici
sequence determination", United States Patent No. 5365455, Nov. 15, 1994.
1411 Tibbetts, C., Bowling, J., Golden, J., "Neural networks for automated basecalling
of gel-based DNA sequencing ladders", pp. 219-229 in Automated DNA Sequenciny
and Analysis, Adams, M., Fields, C., Venter, J., (edi tors), Academic Press, New
York, 1994.
1421 Maxam, A.M., Gilbert, IV., "A New Method for Sequencing DNA", froc. Nat.
Acad. Sci., USA, V. 74, p.560, 1977.
1431 Roberts, L., Science, V. 236, N. 806, 1987.
[+II Bowling, J., Bruner, K., Cmarik, J., Tibbets, C., "Neighboring nucleoticle in-
teractions during DNA sequencing gel electrophoresis", Nucleic Acids Research,
V01.19, No. 11, pp. 3089-3097, 1991.
1451 Tibbetts, C., Golden, J.B., III, Torgersen, D., "Parsing of genomic graffiti"? pp.
183-182 in Genetic Mapping and DNA Sequencing. IMA Vol. Math. .4pp., V. 81:
Speed, T., Waterman, M.S., (editors), Springer Verlag, New York, 1996.
1-16] De Gennes, P.G., "Reptation of a polymer chah in the presence of fixed obsta-
cles", J. Chernical Physics, V.55, N.2, 1971, pp.572-579.
1471 Lumpkin, O. J., Dejardin, P., Zimm, B.H., 'Theory of gel electrophoresis of DN-A".
Biopolymers, V.24, 1985, pp. lSï3- 15%.
[481 Muthukumar, M., Baurngartner, A., "Effects of entropic barriers on polymer
dynamics", Mac~urno~ecdes, V.22, 1989, pp. 1937- 1941.
1491 Zirnm, B.H., " A gel as an array of channels", Electrophoresis, V. 17, 1996, pp.996-
1002.
Bibliography 128
1501 Slater, G. W., Guo, H.L., "An exactly solvable Ogston mode1 of gel electrophore-
sis: 1. The role of the synimetry and randomness of the gel structure", Elec-
trophoresis, V. 17, 1996, pp.977-988.
1511 Slater, G.W., Rousseau, J., Noolandi, J., Turmel, C., Lalande, hl., "Quantitative
analysis of the three regirnes of DNA electrophoresis in agarose gels", Biopolymers.
V.27, 1988, pp.509-524.
1521 Carlsson, C., Larsson, A., Jonsson, M., Norden, B., "Dancing DNA in capillary
solution electrophoresis", J. Amen'can Chernical Society, V. 1 17, 1995, pp.387 1-
3872.
1531 Smith, S.B., Aldridge, P.K., Callis, J.B., "Observation of individual DN.4
molecules undergoing gel electrophoresis", Science, V.243, 1989, pp.203-206.
[54j Schwartz, D.C., Koval, M., "Conformational dyriamics of individual DNA
molecules during gel electrophoresis", lkture, V.338, 1989, pp.520-522.
1351 Lee, E., Messeechchniitt, D., Digital Communication, (2nd Ed.), Kluwer, Xew
York, 1994.
1561 Proakis, J .G ., Dzgital Communications, (3rd Ed.), McGraw-Hill Inc., New York,
1995.
1571 Van Trees, H., Detection, Estimation and Modulation Theory, John Wiley &
Sons, New York, 1968.
1581 Falconer, D., Salz, J., "Optimal reception of digital data over the Gaussian chan-
ne1 with unknown delay and phase jitter", IEEE Trans. Inform. Theory, Vol. 23,
No.1, January, 1977, pp.117-126.
1591 Georghiades, C., "Optimal delay and sequence estimation from incomplete data"
IEEE Trans. Inform. Theoy, Vo1.36, No.1, January, 1990, pp.202-208.
1601 Moeneclaey, M., "Synchronization problems in PAM systems" , IEEE Trans.
Commun., Vo1.28, No.8, pp.1130-1136.
Bibliography 129
1611 Anderson, J.B., Mohan, S., "Sequential coding algorithms: a survey and cost
analysis", IEEE Dans. Commun., Vol. 32, Feb., 1984, pp. 169-176.
1621 Fano, R.M., "A heuristic discussion of probablistic decoding", IEEE Trrns. InJ
Th., V.IT-9, Apr., 1963, pp.64-73.
1631 Massey, J.L., "Variable-length codes and the Fano metric", IEEE Tkans. In f. Th. ,
V. IT-18, Jan., 1972, pp.196-198.
1641 Xiong, F., Zerik, A., Shwedyk, E., "Sequential sequence estimation for channels
with intersymbol interference of finite or infinite length", IEEE I f .an~ . Cornru.,
V.38, N.6, June, 1990? pp.795-804.
1651 Davies, S.W ., Eizenman, M., Pasupathy, S., "Exploiting multi-channel infor-
mation in systems with high symbol clock variance", in Proceedznp, Canadian
Workshop on Information Theory, Toronto, Canada, June, 1997, pp.91-94.
[661 Yu, X., Pasupat hy, S., "Innovations-based MLSE for Rayleigh fading chaniiels",
lEEE Trans. Commun., Vo1.43, pp. 1534-1544, Feb./Mar./'Apr., 1995.
'1 Kumar, P.R., Varaiya, P., Stochastic Systems: Es tirnation, IdentiJication and
Adaptive Control, Prentice-Hall, Englewood Cliffs, New Jersey, 1986.
1681 Aulin, T., "Breadt h first maximum likelihood sequence detection", su bmi t ted to
IEEE Trans. Inf. Th.
1691 Lodge, J.H., Moher, M.L., "Maximum likelihood sequence estimation of CPhI
signals transrnitted over Rayleigh Bat-fading channels", IEEE Trans. Commun.,
Vol. 38, No. 6: June, 1990, pp. 787-794.
1701 Ewing, B., Hillier, Le1 Wendl, M.C., Green, P., "Base-calling of automated se-
quencer traces using Phred. 1. Accuracy assessment", Genome Research, V.8: N.3,
1998, pp.175-185.
[711 Elder, J.K., "Maximum entropy image reconstruction of DNA sequencing gel
autoradiographs", Electrophoresis, V. 11, 1990, pp.440-444.
APPENDIX A
Large Scale Trend Removal
DNA time-series exhibit several large scale features that stretch out over tcns
or hundreds of bases. These features include mis-alignment between time-series and
amplitude offsets. Preprocessing is employed to remove these features prior to the
application of the DNA-ML algorithm.
For multi-lane sequencers, channel time series for the different base types are from
electrophoresis down different lanes of the gel. Variation i n gel propert ies betweeii
lanes leads to mobility variations and a tendency for mis-alignment of peaks in the
time series. Usually the data from different lanes is initially in synchrony but tends
to drift out of alignment with increasing sequence position. Automatic sequencing
algorithms typically use a different mobility constant for each lane to compensate for
t his drift.
Our preprocessing goes a step beyond this linear mobility correction by using a
quadratic to account for the mobility variation between lanes. To obtain this com-
pensation, the measured peak times for each channel are linearly interpolated so as to
obtain values at every sequence position, even if the sequence does not have a base of
that type a t that position. The result of this operation for the C, G and T channels is
then divided by the result for the A channel (Figure Al). These nomalized data are
Appendix A o Large Scale Dend Removal 131
1 .m
1 .O06
1 .O05
V) z 0 1.004 e V)
8 1.003 Y
3 y 1.002 O 0 2 1.001 a
1
0.999
0.998 O 1 O0 1 50 200 250 300 350
BASE
Figure A.1: Inter channel peak time variation - plot of ratio of 'T" channel peak times to those of "A" channel for data used in Chapter 3.
smoothed by least-squares fitting of a quadratic to them. Thus, the quadratic repre-
sents the large scale variation in mobility betiveeri the two lanes over the entire data
set. Data used in Chapter 6 have been temporally interpolated and then resampled
t~ased on the respective quadratics to remove the inter-lane variation.
For the modelling in Chapter 3, the original C, G and T measured peak times were
compensated by their respective quadratics to bnng them into large scale alignment
with the .4 channel. These data were then merged into a single time series. While this
compensation had removed the inter-lane variation, a general large scale variation in
mobility, common to all lanes, remained. For the autocorrelation estimate presented
in Figure 3.12, this variation was removed by subtracting a 51 bin moving average of
the data p io r to calculating the autocorrelation estimate.
Amplitude trends are in evidence in Figure 3.1 which presents the entire time-
series for a single channel. Proceeding from left to right, a constant background level
is first seen. This could be due to background fluorescence and/or an offset in the
sequencer electronics. Next, a large peak is seen; this is known as the primer peak
Appendix A O Large Scale Trend Removal 132
and is due to an excess of the flourescently labelled primer used to identify the DNA
fragment to be sequenced. The primer peak causes an exponentially decaying offset
in the data. Near the end of the data, an exponential rising offset is seen. This is
the precursor of the peak at the end of the data due to fluorescently labelled full
length copies of the orignal DNA fragment. Over the central region, a downward
trend in peak amplitudes can be seen. This is due the cornpetitive process used to
encode sequence information; substrate is consumed to label earlier positions leaving
lcss available to label later positions.
Data used in Chapter 6 have had the background, primer and end of data offsets
estimated and removed. The trend in peak amplitude has been estimated and the
data has been scaled by its inverse. The result features signal absent regions with
values near zero and isolated signal peaks with values near one (consecutive peaks
can have values much greater than one due to constructive interference).