application of communication theory automatic dna sequencing€¦ · dolan provided leadership when...

Application of Communication Theory to Automatic DNA Sequencing

S t ephen William Davies

A t hesis su bmi t t ed in conformi ty wi t h the requirements for the degree of

Doctor of Philosophy, Graduate Department of Electrical and Computer Engineering,

University of Toronto

@ Copyright Stephen William Davies 1999

National Library m*m of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395, rue Wellington ûîiawaON K I A O N 4 OhawaON K1AON4 Canada Canada

The author has granted a non- exclusive licence aiiowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.

The author retaias ownenhip of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfichelfilm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Application of Communication Theory to Automatic DNA Sequencing

Doctor of Philosophy 1999

S t ephen William Davies

Electrical and Computer Engineering, University of Toronto

Abstract

DeoxyriboNucleic Acid (DNA) sequencing is one of the pillars of the current

biotechnology revolution. Current automatic DNA sequencing dgorithms use heuris-

tic approaches based on autoniating the manual analysis done by molecular biologists.

In this thesis, a more forma1 and rigorous approach is followed wherein the first statis-

tical model of the sequencing data is built and then the optimal processor is derivetl

from the rnodel.

The model characterizes peak shape and the local fluctuations in peak parameters

(peak tirne, amplitude and width). The characterization of peak paranieters includes

their point probability density functions and their average dependence on tlieir neigh-

bouring peaks (covariance). .litter in peak time is found to be correlateci over several

rieighbouring peaks. .A practical noise model is proposed consisting of a white noise

cornponent and a noise componeut with spectrum similar to that of the signai itself.

The mode1 can be used to generate simnlations for the cornparison and evaluation of

D NA sequencing algori t hms.

Based on the model, an optimal DNA sequencing algorithm was derived using

the maximum likelihood approach of the analogous field of digital communications.

The uncertainty associated with parameters of the optimum processor is addressed by

maintaining multiple hypotheses for both the different possible information sequences

and the different possible parameter sequences.

The performance of the algorithm is exarnined with real data from both the ac-

curate 6% cross-linked gels and the much faster 4% gels. Results with the 6% gel

data are comparable with that of a commercial algonthm though simulations have

- - - - - - - - - -- - --

suggested the potential for a two to three-fold reduction in error rate. Results with

the 4% gel data exhibited an error rate that was four times Iower than that of a

commercial sequencing algorithm. The DNA mode1 and the DNA-ML algorithm do

offer benefits beyond a reduction in error rate. They may guide the refinement of

the entire sequencing process. Assigning probabilities to alternative sequences niay

aid the clinician in forming his diagnosis. The overall benefits to healthcare incliide

the reduction of total test costs and reduction of the damage caused by acting ori

erroneous information.

Acknowledgment s

1 would like to thank my supervisors. Dr. M. Eizenman provided excellent ad-

vice and guidance in this work, and was tireless in ensuring its completion. Dr. S.

Pasupathy's gentle nudges opened the door to the communications literature. Both

provided the questions and direction that led to the timely completion of this thesis.

Werner b[uller7s generous aid made this work possible. I have greatly enjoyed

the time spent with him in the molecular biology laboratory of the Eye Research

Institute of Canada (ERIC). The support of Dr. K. Tsilfidis is greatly appreciated. I

wvs fortunate to have the opportunity to discuss the physics and chemistry of DNA

sequencing with two experts, Dr. CI. Slater of the University of Ottawa and Dr. R.

Macgregor of the University of Toronto.

At the Institute of Biomedical Engineering, I have eïijoyed the support of many

b u t can mention just a few. Prof. A. Dolan provided leadership when 1 needed it

rnost. Thas Yuwaraj's support of this work was unflagging and invaluable. Melina

Cartlidge \vas j ust so helpful.

I would like to acknowledge the financial support provided by the Natural Sciences

and Engineering Research Council of Canada and the Sumner Foundation.

Finally, 1 wûuld like to thank my mom and dad for the support and wisdom that

lias brought me this far and will hopefully c a n y me further.

Contents

Acknowledgments

List of Tables

List of Figures

List of Abbreviations and Symbols

1 Introduction 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation I

. . . . . . . . . . . . . . . . . . . 1.2 A Perspective on DNA Sequencing 2

1.2.1 DNA's Function and Structure . . . . . . . . . . . . . . . . . 2

1.2.2 Manual DNA Sequencing . . . . . . . . . . . . . . . . . . . . . 3

. . . . . . . . . . . . . . . . . . . 1.2.3 Automatic DN.4 Sequencing 4

1.2.4 Errors - . . . . . . . . . . . . . . . . . . . . * . . . . . . . . . . 3

. . . . . . . . . . . . . . . . . . . . 1.2.5 Other Sequencing Methods 6

. . . . . . . . . . . . . . . . . . . 1.2.6 The Human Genome Project 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.7 Clinical Role 9

. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data Communications 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Mode1 10

. . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Receiver Technology 11

Contents

1.4 Analogy between DNA Sequencing and Data

Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

. . . . . . . . . . . . . . . . . . . . . . . . 1.6 Dissertation Organization 13

2 Details of the Chemistry and Physics of DNA Sequencing 15

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Sequencing Chemistry 15

. . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Chemical Structures 15

. . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 DNA .4 mplification 17

2.1.3 Sequencing Reaction Molecules: Terminators and

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Polymerases 20

2.1.4 Fidelity and Peak Amplitude Variation . . . . . . . . . . . . . 20

2 . Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Sequencing Physics 35

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Sequencing Gel 25

. . . . . . . . . . . . . . . . . . . . 2.2.2 Theories of Electrophoresis 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Kuhn Length 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Resolutiori 29

. . . . . . . . . . . . . . . . 2.2.5 Other Concerns in Electrophoresis 30

. . . . . . . . . . . . . . . . . 2.2.6 Detection of Fluorescent Labels 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Summary 32

3 A Statistical Mode1 of the DNA TirneSeries 33

. . . . . . . . . . . . 3.1 Gross and Local Structure of DNA Time-Series 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Signal Peak Shape 36

. . . . . . . . . . . . . . 3.3 Local Covariance Mode1 of Peak Parameters 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Methods 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Results 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Discussion 58

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Noise Process Model 61

. . . . . . . . . . . . . . . . . . . . . . . 3.5 Simulated Data from Mode1 63

. . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Significance and Novelty 66

Contents

4 Maximum Likelihood Sequence Detection 67

4.1 The Maximum Likelihood Concept . . . . . . . . . . . . . . . . . . . 67

4.2 Additive White Noise Finite Response . . . . . . . . . . . . . . . . . 68

4.3 Noise Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Cost Function Derivation 73

4.5.1 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . 73

. . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Nuisance Likelihood 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 CostFunction 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Significance 80

5 Implementation 82

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 HypothesisReduction 82

. . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Peak Estimation 82

5.1.2 Future Peak ISI Canceliation . . . . . . . . . . . . . . . . . . $3

5.1.3 Sequential Decoding . . . . . . . . . . . . . . . . . . . . . . . 83 . . . . . . . . . . . . . . . . . . . . 5.2 Unique Algonthm Considerations 85

. . . . . . . . . . . . . . . . . . Unequal Length Cornparisons 86 ..... . . . . . . . . . . . . . . . . . 5.2.2 Selection of Symboi Region, Ki 91

5.3 Modelling Limitations and Robustness . . . . . . . . . . . . . . . . . 92

5.3.1 PulseWidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3.2 Noise Whitening . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4 Cornparison with Typical Automatic Sequencer Techniques . . . . . . 96

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 ISI Suppression 96

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Peak Detection 97

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Search Window 97

. . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Multi-Peak Tests 98

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Special Rules 98

. . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Promise of Approach 98

6 Performance with Real Data 100

6.1 Data Set 1 . Typical Case . . . . . . . . . . . . . . . . . . . . . . . . 100

vii

Contents

6.1.1 Source and Mode1 . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.1.2 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . 102

6.1.3 Error Cornparison . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.1.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 Data Set 2 - High Speed Gel . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.1 Source / Rationale . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2.2 iLIodel and Adjustrnents . . . . . . . . . . . . . . . . . . . . . 107

6.2.3 Error Cornparison . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2.5 Assessment and Significance . . . . . . . . . . . . . . . . . . . 118

7 Conclusions 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Thesis Sumrnary 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Thesis Contributions 121

. . . . . . . . . . . . . . . . . . . . . 7.3 Suggestions for Future Research 122

Bibliography

A Large Scale Trend Removal

viii

List of Tables

5.1 Performance as a Eunction of pulse width mismatch for 300 bases of

simulated data. . . . . . . . . . . . . . . . . , . . . . . . . . . . . . .

Performance (insertions/deletions/ substitution errors) as a function

of algorithm parameter settings for 300 bases of real data. . . . . . . 103

Errors observed for DNA-ML algorithm (/3=0.85, fractional bandwidth=0.25)

and Pharmacia ALF interna1 algorithm for 300 bases of real data. . . Data Set 2 error type and location for DN.4-ML with baseline pararn-

eter settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Set 2 error type and location for DNA-ML with rnodified param-

eter settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures

1.1 DNA sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample DNA time series. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Typical automatic sequencing algorithm block diagram. . . . . . . . . 1.4 Error rate as a function of distance along the DNA molecule. . . . . . 1.5 Data communications systern block diagram. . . . . . . . . . . . . . . 1.6 Communication signais. . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 Structure of deoxyadenosine 5'monophosphate (dAMP) (after 1131). .

2.2 A single-stranded DNA (ssDNA) molecule (after 1131); full structure

shown for phosphate and ribose groups but bases are represented by

one of A, C, G, or T. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Double stranded DNA (dsDNA) detailed structure [I l ] . Phosphate-

deoxyribose backbones are on extreme left and right, corresponding to

respective strands. Bases run from top to bottom aiong the center of

the diagram. Hydrogen bonding is seen aiong center as dashed line

emanating from a hydrogen (H) that also has a solid line indicating a

covalent bond to the othet strand. . . . . . . . . . . . . - . . . . . . . 2.4 Bulging of copy with insertion of a T. . . . . . . . . . . . . . . . . . .

2.5 Hairpin loop due to complementaxy GC runs. . . . . . . . . . . . . .

List of Figures

2.6 Cleavage pathway for depurination (Guanine base) 1201. .4dditional

symbols are: R for deoxyribose, G for guanine, T for thymine and P

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . for phosphate.

3.1 Sample entire time series for 'T' channel. Mean inter-base separation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . is14.7samples.

3.2 Selected compensated time series for same sequencing session as Fig-

ure 3.1. Individual channel data has been offset in this figure for clarity.

Top curve is for A channel with C, G and T chanriels presented in order

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . frorntop

3.3 High resolution viea of a segment of the compensated time series (ac-

tually Figure 1.2 repeated for reader's convenience). . . . . . . . . . .

3.4 Micro-satellite repeat data trace. Major peaks in time orcier are:

primer peak, proximal DNA standard peak, sample peak, distal DNA

standard peak. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5 Proximal DNA standard peak (solid line) and distal DNA standard

peak (dash-dot). Warped peak (dotted line) was created by scaling

the time coordinates by 0.7286. . . . . . . . . . . . . . . . . . . . . .

3.6 Approximation of proximal peak of Figure 3.4 (dotted line) by ieading

exponential (samples 1-35, dashed line) , Gaussian (samples 36-70, solid

line), and decaying exponential (samples 71200, dashed line). Inset is

the logarithm of the same data. . . . . . . . . . . . . . . . . . . . . .

3.7 Three isolated peaks from DNA sequencing data. . . . . . . . . . . .

3.8 Peak time jitter for "G" labelled product applied to six contiguous lanes

of the gel (total of 79 "G" peaks present over the range of 350 bases in

original sequence). Six overlapping curves are plotted corresponding

to the SLY gel lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Peak time jitter.

. . . . . . . . . . . . . . . . . . . . . 3.10 Local peak amplitude estimates.

3.11 Pulse width estimates. . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures

Covariance of peak time jitter. Monotonically increasing region just to

the left of and including lag zero and rnonotonically decreasing region

to its right is referred to as the niainlobe. Inset is a logarithmic plot

of the right side of the mainlobe. . . . . . . . . . . . . . . . . . . . . Covariance of difference between successive peak time jit ter values. .

Peak amplitude covariance. . . . . . . . . . . . . . . . . . . . . . . . .

Pulse wid th covariance. . . . . . . . . . . . . . . . . . . . . . . . . . . Block diagram of peak parameter system model. . . . . . . . . . . . . Histogram of scaled peak time jitter. To insure comparability of Sam-

pies, data was divided (scaled) by jitter standard deviation linear trend

prior to forming histogram. . . . . . . . . . . . . . . . . . . . . . . . Histogram of scaled difference between adjacent peak time jitter values.

To insure cornparability of samples, data was divided (scaled) by jitter

standard deviation linear trend prior to forming histogram. . . . . . .

Theoretical covariance of peak time jitter for system of Figure 3.16.

Inset is a logarithmic plot of the right side of the mainlobe. . . . . . .

Theoretical covariance of difference between successive peak timc jitter

values for system of Figure 3.16. . . . . . . . . . . . . . . . . . . . . .

Noise spectnirn estirnate for "A" channel bases 297-319. . . . . . . . .

Simulated compensated time series for cornparison with real data of

Figure 3.2. Individual channel data has been offset in this figure for

clarity. Top curve is for A channel with C, G and T channels presented

in order from top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High resolution view of a segment of the simulated compensated tinie

series (compare with Figure 3.3). . . . . . . . . . . . . . . . . . . . .

Maximum likelihood processor block diagram. . . . . . . . . . . . . .

Peak estimator. . . . . . . . . . . . . . . , . . . . . . . . . . . . . . Spectral estimate for a short section of "noise whitened" data lacking

signal peaks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

List of Figures

Raw time series for Data Set 2. Individual channel data has been offset

in this figure for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . 107

Compensated time series for Data Set 2 corresponding to first 4000

samples from Figure 6.1. Individual channel data has been offset in

this figure for clarity. Top curve is for A channel with C, G and T

channels presented in order from top. . . . . . . . . . . . . . . . . . . 108

Covariance of peak tirne jitter for Data Set 2. Inset is a logarithmic

plot of the right side of the mainlobe. . . . . . . . . . . . . . . . . . . 109

Covariance of difference between successive peak tirne jitter values for

Data Set 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Selected section of data after application of whitening filter with coloiired

noise variance on, = 0.0016 and white noise variance on, = 0.000009,

al1 in units of peak mean squared. . . . . . . . . . . . . . . . . . . . . 11 1

Selected section of data after application of whitening filter wit h colourrd

noise variance o., = 0.0016 and white noise variance O,,, = 0.00000 1,

al1 in units of peak mean squared. . . . . . . . . . . . . . . . . . . . . 112

Compensated time series for Data Set 2 corresponding to bases 110-

140. Individual channel data has been offset in this figure for clarity.

Top curve is for A channel with C, G and T channels presented in order

from top. DNA-ML algorithm estimates of peak amplitudes arid times

are indicated by '*". X-axis is time in samples. True and estimated

sequences are indicated at top and bottom, respectively, coded as A= 1,

C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Compensated time series for Data Set 2 corresponding to first 30 bases.

Individual channel data has been offset in this figure for clarity. Top

curve is for A channel with C, G and T channels presented in order

from top. DN.4-ML algorithm estimates of peak amplitudes and times

are indicated by "*". X-axis is time in samples. True and estimated

sequences are indicated at top and bottom, respectively, coded as A= 1 ?

C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

List of Figures

6.9 Waveforrns associated with 4 "G" run from base 49 to 52 selected to

illustrate estimation of third "G". Raw waveform is cornpensüted but

not whitened. Dashed curve is formed from whitened data by sub-

tracting estirnated contribution from previous two baws and predicted

contribution from next base, and then applying matched filter. . . . . 116

6.10 Compensated time series for Data Set 2 corresponding to bases 155 to

185. Individual channel data has been offset in this figure for clarity.

Top curve is for A channel with C, G and T channels presenteci in order

from top. DNA-ML algorithm estimates of peak amplitudes and times

are indicated by "*". X-axis is time in samples. True and estirnatecl

sequences are indicated at top and bottom, respectively, coded as A= 1,

C=2, G=3 and T=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

-4.1 Inter channel peak time variation - plot of ratio of 'T' channel peak

times to those of "A" channel for data used in Chapter 3. . . . . . . . 131

xiv

List of Abbreviations & Symbols

ABBREVIATIONS

-4 AWGN bis C d Abf P d ,UP clNh1P dNTP ddNTP DFE DNA DNA-ML dsDN-4 FIR G ISI ML MLSD bISR PCR pdf mi -4 ssDNA SNR T TBE

adenine additive white Gaussian noise N ,N'-methylene- bis-acrylamide cytosine deoxyadenosine 5'-monophosphate deoxyadenosine 5'-triphosphate deoxynucleotide monophosphate deoxynucleotide triphosphate dideoxynucleotide triphosphate decision feedback equalizer deoxyribonucleic acid DNA mauinium likelihood (algorithm) double stranded DNA finite impulse response guanine inter-symbol interference maximum likelihood maximum Iikelihood sequence detection micro-satellite repeat polymerase chah reaction probability density function ribonucleic acid single stranded DNA signal to noise ratio tyrosine 8.9mM tris-borate and 0.2mM ethylenediaminetetra-acetic acid

List of Abbreviations & Svmbols

SELECTED SYMBOLS

ji t ter aut o-regressive weight ing observation vector whitened observation vector information symbol sequence amplitude noise sample index peak time for peak i pulse width generic pulse shape pulse shape peaking at t i evaluated at k Kronecker delta function summation indicates estirnate mean inter-symbol separation jitter jitter state variable jit ter input distur bance ji t ter measurement noise variance expec tat ion difference between adjacent jitter values non-stationary noise spectrum Fourier transformation conditional pdf nuisance parameter noise whitening filter cost (negative log likelihood) convolution i-th subset of samples best estimate of # a t i given data up to and including j covariance matrix Kalman gain cost to point z

comparison (hwypothesis test)

CHAPTER 1

Introduction

DeoxyriboNucleic Acid (DNA) carries the genetic information that codes for life.

The extraction of this information, a process known as DNA sequencing, is one of

the pillars of the current biotechnology revolution. This cliapter provides background

information on DNA and introduces the field of data communications as a potentiai

aid in DNX seqiiencing. It then presents the research approach used to investigatc tlie

applicability of communication theory to automatic DNA sequericing and concludes

with an overview of the thesis.

1.1 Motivation

This work is motivated by the potential for a timely impact on a significant indus-

try with broad healthcare implications. Timeliness is implied as an explosive growth

in the use of DNA sequencing as a clinical tool is imminent, built on the scientific

foundation provided by the human genome project [II. DNX sequencers d l become

as ubiquitous as X-ray machines and the industry will grow dramatically as the full

clinical potential is realized. If applying concepts from communication theory can

increase the reliability of DNA testing then the benefits include reduction of total

test costs and reduction of the damage cauçed by acting on erroneous information.

Chapter 1 o Introduction 2

1.2 A Perspective on DNA Sequencing

1.2.1 DNA's Function and Structure

In what is often referred to as the central maxim of biology, DNA is transcribed to

RiboNucleic Acid (RNA) which is then translated to protein [13j. DNA is normally

a very large molecule that serves a s the permanent store of the genetic information.

Only a small portion of this information is required to define a particular protein

so only this small portion is copied to RNA. The protein itself niay perform some

particular ce11 function such as tliat of an enzyme for a metabolic reaction.

Each DNA molecule is a sequence of bases where genetic information is encoded

according tao the type of base (Adenine (A), Guanine (G), Cytosine (C)? or Thymine

(T)) at each point in the sequence. Each consecutive group of three bases (triplet)

either codes for an amino acid to be incorporated in the protein or else it codes for

simple control information.

DNA's large scale structure makes it well suited for the storage of iriforrnation[l31.

it is normally in the form of a double helix coniposed of two intertwined strüiids of

DXX with one strand bearing the genetic information (sense strand) and the otlier its

complement. The cornplement is defined by the following rules: (1) where the sciise

strand has an A, the complement must have a T; (2) where the sense strnnd has a T?

the complement must have an A; (3) where the sense strand has a C, the complenient

rnust have a G; and; (4) where the sense strand has a G, the complement must have

a C. The double helix is maintained by hydrogen bonds between the sense strand and

the complement; there are two hydrogen bonds for each A-T pair and three for each

G-C pair. This stable structure is resistant to damage by outside forces. However, if a

riick should occur in one strand, cellular machinery will repair it using the information

available in the other strand.

To transcribe from DN.4 to RNA or to make a copy of the DNA, a portion of

the double-stranded DNA (dsDNA) must be separated into single-stranded DNX (ss-

DNA). Shen the enzymes and other molecules involved in making the copy can access

the information. If an entire dsDNA molecule is separated into two complementaq

ssDN.4 molecules then it is said to be denatured.

Chapter 1 O Introduction 3

1.2.2 Manual DNA Sequencing

In Sanger 121 DNA sequenciiig, molecular biologists employ a polymerase enzyme

to p r~pare a set of partial copies of the original ssDNA molecule, al1 starting frorn

the same location as determined by a primer molecule (Figure l.l(a)). In addition

to the A, C, G, and T substrate needed to make the copies, ari additional terniinator

riiolecule is included which competes with one of the suhstrate bases for inclusion in

the c:opy. Once a terminator is incorporated, copying is stopped for that particiilar

rnolecular copy and thus its length is fixed. In the example of Figure l . l (a) , the

terminator corresponding to adenine (A) is used. It competes with A at two poirits in

the figure. Based on chance, some copies will incorporate the A at these points and

continue growing. Others will incorporate the terminator aiid refrain from further

extension. The final result thus contains molecules of lengths corresponding to the

positions of adenine in the original DNA sample.

As indicated in Figure l . l ( b ) , four sets of reactions are carried out, cach çorre-

sponding to a different teminator/base type. The products are labeiled with eitkier

fluorescent or radioactive markers. Thus, for each base in the original sequence, the

sequence position has been encoded as molecular size and base type by which marker

is present.

Electrophoresis is theri used to separate these charged DNA molecules. The saiii-

ples are placed at the top end of a gel and a voltage is applied from that erid to the

other. Srnall DNA molecules will move quickly down the gel but larger ones encouriter

more resistance and thus move more slowly. Thus, the molecules become separated

by size. At some point in time, the voltage may be removed and the gel image may

be recorded as in Figure 1.1 (c). This two dimensional plot may be read from bottom

to top with horizontal position indicating base type. Here, the bottom-most band is

in the lane corresponding to the A marked sample holder (hereafter referred to as the

h lane), indicating the the first base is an A. The second lolvest band is in the lane

corresponding to the C marked sample holder (hereafter referred to as the C lane),

indicating that the second base is a C, and so on. Details on rnanual DN.4 sequencing

may be found in 1331.


DNA COMPLEMENT. POLYMERASE. - ACOTAT ORIGINAL ONA MOLECULE

f PRIMER

I A C G " A SUBS TRATES

1

ACOT A FINAL

PRODUCTS

(8) fonn IWIcd substt o f diffcruit length fmgmcnfs, herr comsponding io the idenint (A) baw locaam.

A AC AC O ACGT

ACQTA ACGTAT

(b) rcjulr of repuin6 rucrions o f (4 wih icrmi~tom for a h of ihe four bu! rypc..

A MARKED C MARKED G MARKED 1 MARKED SAMPLE SAMPLE SAMPLE SAMPLE HOLOER HOLDER HOLDER HOLOEA

GEL (4 COLUMNS)

(c) conl~cl cleamphorczils IO sepurite by s izc of moltcuk. Oniy I;ibcllat fmpnls will k sccn. Rcsuli rads fmm bonom to top ACGTAT.

Figure 1.1: DNA sequencing.

1.2.3 Aut ornat ic DN A Sequencing

Automatic sequencing algorithms have been developed to recover the DXIA se-

quence either from images as displayed in Figure l.l(c) or from marker detectors

mounted at a fired location on the gel colurnn. In the later case, the input to the

algorithm is a time-sexies (Figure 1.2) where the order of detection is in order of

increasing molecular mass as the speedy sniall molecules reach the detector first. In

both cases, the algorithm faces a difficult task due to noise and poor resolution of

overlapping bands/peaks. Automatic sequencing algorithms are available from both

acadernic 1381 [391 [4lj and commercial sources ( Applied Biosystems, Du Pont, Molec-

ular Dynamics, P h m a c i a , Scandytics [361). Such dgorithms typically feature band

sharpening filters, simple normalization, thresholding and data dock recove. Neural


Figure 1.2: Sample DNA tirne series.

8 r I 1 1 1

A /

J L ~ G

networks have also been used (411. Figure 1.3, a generic abstraction based largely on

1381, is illustrative of typical sequencing algori thms.

2

l

O

Errors

-

- m w

Three classes of errors occur in DNA sequencing: substitutions, insertions and

deletions. The first, substitution, corresponds to simply mistaking one base for an-

other, as when noise has caused a strong peak in the wrong lane. Xlternatively, such

a noise peak may lead to a base being called in the noise lane and the true base peak

being called as well, particularly if it occurred slightly later or earlier than the faise

peak. Thus one more base would be called than was actually in the data; this is

referred to as an insertion. A deletion may occur when two bands overlap to such an

extent that there is only one peak and thus only one of the two bases is called. There

are several other scenarios that can lead to insertions andior deletions. In this thesis,

the generic t e m "error" refers to al1 error classes.

Figure 1.4 is a schematic illustration of the dependence of error rate on base

O 50 100 150 200 250 300 k (SAMPLES)


Figure 1.3: Typical automatic sequencing algorithm block diagrani.

location for both manual and automatic DNA sequencing. There is a constant error

rate zone that stretches over the first 300-500 bases. Error rate in this zone is of

the order of 1-5% 141 with automatic sequencers generally performing poorer than

human readers. Note that these are typical results and certain DNA sequences can

lead to considerably poorer results. Beyond the constant error rate zone lies the rising

error rate zone where, for example, error rate can increase by as much as 7% in 100

bases 151. This rising error rate may be attributed to several factors among which is

the difficulty in resolving between large molecules where a single base difference in

length equates to a small fractional change in overall size. To combat this problem.

practitioners have adopted the strategy of breaking large DNA molecules into snialler

ones, sequencing the smaller molecules and then combining the results [61. Note that

useful information is available even in the rising error rate zone as data from this

region is used to aid the combining of sequence fragments [71.

1.2.5 Ot her Sequencing Methods

Three other sequencing methods deserve attention at this point: (1) single lane

/ multiple fluorescent markers, (2) Maxam-Gilbert sequencing, and (3) biochip se-

quencing by hybridization.


Figure 1.4: Error rate as a function of distance dong the DNA molecule.

Single Lane / Multiple Fluorescent Markers

This technology, protected under several patents by Applied Biosystenis Inc.,

forms the basis for the most popular automatic sequencing machines. As in the

techniques described earlier in this document, four different sequencing reactions are

carried out. However, the four reactions use different fluorescent markers so that their

products feature a spectral peak at different points in the spectrum. The producto

niay then be mked into a single solution and loaded into the same lane of s gel (if

the label is part of the terminator then al1 the reactions could have taken place in the

sarne test tube). Neaz the bottom of the gel, a laser is used to excite the fluorescent

bands as they pass the detector region. For each lane, there are four detectors, one

for each of the four different fluorescent peaks. .4s before, the order of the bands

indicates position in the sequence. Base type, however, is indicated by peak colour.

.4 key advantage is high throughput as four times the nurnber of sequences can be

done on the same gel (Le. one sequence per lane vs one sequence every four lanes).

dlso, alignment problems are minimized as al1 DNA from the same sequence goes

clown the same lane. This eliminates the effect of lane to lane gel inhomogeneities.

Possible problems include interference between base types as fluorescent spectra over-

lap.

This thesis will not include an examination of this type of data. However, certain

Chapter l o Introduction 8

aspects of the work in this thesis will apply directly as they are derived for the same

physical processes. For example, in both the multi-lane and single lane data, the

dynamics of DNA molecules in a gel will be common. Differences will occur as the

niiilti-lane data uses the same marker for al1 bases while the single Iane data will ilse

markers of different size and mass for each base type. As will be developed later in

the thesis, both will feature noise associated with the hydrolysis of DNA. However,

the multi-lane data will not suffer from the inter-base interference due to overlapping

fluorescent spectra. Thus, a judicious reader may draw conclusions from this work

which will be relevant to single lane data. At the same time, however, this reacier

will no doubt identify areas where additional work must be done in order to properly

treat the single lane application.

Maxarn-Gilbert Sequencing

Developed at the same time as Sanger sequencing, the hlauarn-Gilbert rnethod is

based on degradation of DNA rather than synthesis 1421. The process begins with the

labelling of one end of the ssDNA molecules [331. This is then loaded into four test

tubes. One tube then is used in a reaction that breaks the DN-4 wherever there is a.

G iri the sequence. Another tube is used in a reaction that breaks the DWA wherever

there is a C. Two other similar reactions are mn, one that breaks the DN.4 wherever

there is a G or an A, and one that breaks the DNA wherever there is a C or a T. Thus,

-4 locations must be decoded by comparing the G data and the AiG data. T locations

must be similarly decoded. Also, the timing of the reactions is important so that the

product is dominated by molecules produced by only single breaks per original DNA

molecule - othenvise, the first few bases will dominate the results. 5Iêuam-Gilbert

sequencing is useful for sequences of less than 250 bases 1331 and can perform better

than Sanger sequencing for sequences with long r u s of identical bases. However, the

vat majority of sequencing today is performed using the Sanger method. Results

presented in this thesis should apply for the Maxam-Gilbert method other than for

considerations related to decoding the A+G and C+T lanes.


Sequencing by Hybridization

Sequencing By Hybridization (SBH) uses an array of cornplementary DNA se-

quences where ail possible N base sequences are represented in the array. The la-

belled DNA to be sequenced will hybridize to its cornplement and the array will theii

fluoresce at the corresponding location. (Hybridization is the process where two com-

plementary DNA sequences form a double helix through hydrogen bonding.) Other

means of detecting the hybridization location are possible. SBH is fast and convenient.

However, it is limited to extremely short sequences as even sequencing an eight base

sequence implies an array with a8 = 65536 elements. Sequences of a Iiundred bases

would require a current ly in feasible array size. Corn plementary D NA hy bridizat ion

arrays are niore promising for detecting mutations or the presence of a fcw prcviousiy

knowii sequences rather than for SBH. SBH will not be further addressed in this

t liesis.

1.2.6 The Human Genome Project

The Human Genome Project is well underway in its endeavour to sequence the 3

billion base pair human genome 1431. Completion of this project is expected to occur

by the year 2005. The product of this project is a consensus sequence of the human

genome reflecting the typical sequence seen in the population of humans. This project

has served as a major irnpetus in the developrnent of oew sequencing strategies. It

will aid the identification of new genes and control areas in the gnome. -4s well, it

will aid in the identification of abnormdities,

1.2.7 Clinical Role

.As is no doubt obvious to the reader, gene databases such as the one produced by

the human genome project and sequenced patient DNA will allow clinicians to identify

genetic abnormalities. S o m these clinicians mil1 be able to treat such abnormalities

through gene therapy. However, there are other very important clinical roles for DNA

sequencing with immediate benefits. DNA sequencing can quickly and effectively

compare the Human Leukocyte Antigens (HL.4s) of the patient and potential donor


organ and thus determine if the tissues are compatible. DN.4 sequencing can quickly

identify viral strains and determine what drugs the patient's infection would resist . Thus, the financial and health penalties associated with prescribing an inappropriate

course of treatment can be avoided.

1.3 Data Communications

Data communications is concerned with the transmission and reception of a se-

quence of information with as little error as possible. A field of vigorous investigation

since the work of Nyquist in 1924, it offers a well-dcveloped body of knowledge and

experience. The basic philosophy is first to develop a mode1 of the channel over

which communication must occur and then to derive from the mode1 an appropriate

communications technique.

Figure 1.5 depicts the basic elements of a data communications system. At the left

of Figure 1.5, the transmîtter takes the input data stream and performs a sequence

of operations in order to represent the information as an analog signal at the input of

the channel. These operations may include source encoding to reduce the number of

symbols necessary to represent the input, channel encoding to allow error correction at

the receiver, and modulation to rnap the coded digital sequence to signal waveform(s).

Figure 1.6(a) shows a simple waveform representing the information sequence 10 1.

SIoving to the center of Figure 1.5, passage through the channel has two main

effects [561. First, the waveform is distorted by the channel impulse response, c(r, t ) ,

which is the response a t time t to an impulse at time r. This extends the duration

of the received s p b o l signal and causes interference between adjacent symbols, a

phenornenon known as Inter-Symbol Interference (ISI). ISI cm lead to errors as in

the case where, due to the symbols on either side of the symbol of interest being

ones, a sufficiently high level will be measured at the time of interest such that the

symbol will be declaïed a one when it r e d y should have been a zero. Figure 1.6(b)

depicts the waveform of 1.6(a) after passage through such a channel impulse response.


SIGNAL DISTORTED RECEIVED WAVEFORM SIGNAL WAVEFORM

DATA

Figure 1.5: Data communications system block diagram.

Figure 1.6(c) displays the data in 1.6(b) with the second major channel effect included,

that of the noise source shown in Figure 1.5. Clearly, symbol detection is cornplicated

by the random nature of the received waveform.

The receiver at the right of Figure 1.5 may attempt to reduce detection errors by

estimating and removing the ISI and then averaging over time to limit the effect of

noise. The resulting digital sequence will then be passed through a decoder for error

correction and, hopefully, recovery of the original information sequence.

TRANSMIlTER

1.3.2 Receiver Technology

+- C h ) 4 ESTlMATEO SEQUENCE

Removing the ISI and limiting the effect of noise is not a trivial task. Several

receiver structures have been developed to address this problem. The first is the zero-

forcing linear equalizer. It applies a filter that inverts the effect of the channel impulse

response (and in some cases, transmitter pulse shaping) so that the ISI is guaranteed

to be zero at the symbol time. Thus, neighbourïng symbols do not directly contribute

to errors. Unfortunately, this inverse filter is applied to the received waveform and

usually increases the noise component. A variant knom a s the mean-square error

linear equalizer is designed to minimize the sum squared of residual ISI and noise at

the symbol time; it trades off some ISI cancellation for reduced noise emphasis. Non-


linear equalizers avoid noise emphasis by using a proxy for the signal in canceling the

ISI. For example, the decision feedback equalizer applies the sequence detectecl thus

far to a filter representing the channel and then subtracts the result from the received

waveform to remove the ISI from previous symbols. As the noise does not appear in

the proxy, it is riot emphasized. Of course, problems occur if there are errors in the

sequence detected thus fa.

There is a more sophisticated and mathematically rigorous receiver that implicitly

limits the effect of both ISI and noise. Known as the Maximum Likelihood Sequence

Detector (MLSD) 1551 or Maximum Likelihood Sequence Estimator (MLSE) (561, it

is basetl on a solid statistical approach. A probabilistic mode1 is developed for the

received waveform for each of the possible transrnitted sequences. The hypothesized

seqiience then implies the ISI in the waveform and the probabilistic uncertainty mod-

els the fluctuation due to noise. The actual received waveform is then used as an

argument to the probability functions and the sequence which yields the greatest

probability is selected. This process can be performed efficiently using the Vitcrbi

algorithm 181.

1.4 Analogy between

Communications

DNA

In this chapter, DNA sequencing has been

Sequencing and Data

shown to depend on noisy, overlapping

signais that represent a sequence of information. Data communications has been

shown to be concerned with extracting an information sequence from noisy, overiap

ping signals. There is clearly a strong analogy between data communications and

DN.4 sequencing. Interestingly, this analogy has not been identified previously in the

literature. This thesis will exploit this analogy in the hope of improving our under-

standing of DNA sequencing through the use of the powerful concepts developed for

data communications.


1.5 Research Approach

This research first identified the key aspects of the DN.4 time-series through a

study of the literature and preliminary investigations of real data. Then statistical

nioclels were developed which incorporated those features. Witli respect to this mod-

elling, the optimum recaiver / sequencing algorithm was then dcrived. Sub-optimal

implementations were then investigated using both simulated and real data.

1.6 Dissertation Organization

This thesis will first detail in Chapter 2 the chernical and physical processes iin-

clerlying the time-series observed in DNA sequencing. In Chapter 3, statistical models

are developed for the DNA time series. The derivation of the optimum sequencing

algorithm is presented in Chapter 4. The analysis will explore the mathematics so

that insights may be formed into the key structural and functional features of the

algorithm. Chapter 5 discusses the implementation of the algorit hm. Simulations are

used to aid in choosing an appropriate design. Performance with real data is inves-

tigated in Chapter 6 and compared with that of a commercial sequencer. Fiiially.

Chapter 7 surnrnarizes the contributions of this work ancl provides suggestions for

furt her research.

Cha~ter 1 o Introduction 14

-0.6 L 1 20 40 00 00 1 0 0 1W #*O 1 0 4

71Mt (MMCLI.>

(a) Transmitted signal.

(b) Signal distorted by channel impulse response.

4

é 2 3 i3

-7

-4 a0 00 100 t 20 f I M E (SAMPLES)

(c) Received signal.

Figure 1.6: Communication signals.

CHAPTER 2

Details of the Chemistry and Physics

of DNA Sequencing

The introductory chapter of this thesis presented a high-level view of the sequenc-

ing process. However, the modelling in this thesis requires a deeper understanding of

the sequencing process. Thus, to Iay the foundation for the statistical modelling to

follow, this chapter provides a more detailed description of the chemical and physical

processes involved in DN.4

2.1 Sequencing

sequencing.

Chemistry

2.1.1 Chemical Structures

DN.4 is a polyrner cornposed of monomers called nucleotides (Le.? A. C, G, T).

Figure 2.1 depicts the chemical structure of the A nucleotide which is more properly

referred to as deoxyadenosine 5'-monophosphate. Note the nucleotide's three corn-

ponents: phosphate, deoxyribose and adenine. Ml DNA nucleotides have the same

phosphate and deoxyribose components but differ in their base which may be Adenine

(A), Cytosine (C) , Guanine (G) or Thymine (T) . Deoxy Adenosine 5'-MonoP hosphate

Chapter 2 0 Details of the Chemistry and Physics of DNA Sequencing 16

PHOSPHATE

Figure 2.1: Strticture of deoxyadenosine 5'rnonophosphate (dAiCIP) (after 1131).

rnay be abbreviated as dAMP; its higher energy triphosphate form is abbreviated as

dATP. Similar abbreviations apply for the other nucleotides as in dCbIP, dGSIP?

dTMP, and, dCTP, dGTP and dTTP. Also, a generic nucleotide representing any of

the four bases may be referred to as dNMP or dNTP.

Yote the numbering by the carbons of the ribose in Figure 2-1. Of particular im-

portance are the 5' and 3' carbons as the ssDNA polymer is formed by connecting the

5' carbon of one nucleotide to the 3' carbon of the next nucleotide using a phosphate

group (Figure 2.2).

Figure 2.3 shows two complementary strands of DNA hybridized together. The

hydrogen bonds joining the strands are indicated by dashed lines. One can see how

A is complementary to T and G is complementary to C by by virtue of the hydrogen

bonds they can Form. Further, each complernentary pair incorporates one base with a

single ring, known as a pyrimidine (C or T), and one base with a double ring, known

as a purine (A or G).

Chapter 2 o Details of the Chemistry and Physics of DNA Sesuencine: 17

0. I 5' END

.O- P = O I

3' END

Figure 2.2: A single-stranded DNA (ssDN.4) molecule (after 1131); full structure shown for phosphate and ribose groups but bases are represented by one of .A. C, G, or T.

2.1.2 DNA Amplification

Typicdly, researchers start with a very small amount of DN.4 that has been

isolated from the ce11 or virus of interest. To obtain good strong signals in DNA

sequencing, more DNA is needed. So, p ior to the actual sequencing reactions, steps

are taken to produce many copies of the original DNA. This processing is referred

to as DNA amplification. There are two major methods for DNA amplification:

Polymerase Chain Reaction (PCR) and cloning.

PCR is based on the use of a special thermally stable polymerase enzyme, Taq,

that was isolated from bacteria living in geothermal vents 1121. In PCR, the DNA is

Chapter 2 O Details of the Chemistry and Physics of DNA Sequencing 18

Figure 2.3: Double stranded DNA (dsDN.4) detailed structure [Ill. Phosphate- deoxyribose backbones are on eatrerne left and right, corresponding to respective strands. Bases run from top to bottom along the center of the diagram. Hydrogen boriding is seen along center as dashed line emanating from a hydrogeri (H) thet also has a solid line indicating a covalent bond to the other strand.

first heated to approximately 95 O C to denature it (Le., separate dsDNA into ssDNA).

It is then cooled to approximately 55 O C to allow a primer to hybridize to it. .A primer

is a short (10-20 base) piece of ssDNA that is complementary to the start of the DNX

segment to be copied. The polymerase enzyme will then bind to the DNX at the

end of the primer and start to add bases, extending the primer to make a copy

of the original DN.4. The solution is typically heated to roughly 70 OC during this

phase to speed the incorporation of bases. Then the cycle is repeated with the initial

heating to 95 O C serving to release the copy fiom the original molecule. This copy is

complementary to the original ssDNA. Note that a non-thermally stable pal-merase

Chapter 2 o Details of the Chemistry and Physics of DNA Sequencing 19

would be destroyed by heating to 95 O C .

The description of PCR is not yet cornplete. Another primer, complementaq to

the copy at the other end of the segment of interest, is also included in the solution.

This primer will bind to the new copy and then the polymerase enzyme can make a

complementary copy of the copy. Applying the definition of a complement, this new

copy is therefore identical to the original ssDNA over the segment of interest. Now

there are two molecules available as templates for the next round of copying. In this

manner, PCR doubles the amount of DNA with every thermal cycle. The typical

number of cycles used in PCR is 10-20 and thus amplification factors of a thoiisand

(2") to a niillion (z2*) are typical. Also, any errors are duplicated in subsecluent

thermal cycles. Should the enzyme dissociate from the template before completion of

the full copy then, for primer labelled sequencing, the short copy will contribute to

peaks in all four (A,C,G,T) time series at its terminal base position. This phenornenon

is know as a "false-stop" or 'Talse temination".

The second method of DNA amplification, cloning, relies on cells, typically bacte-

ria or yeast, to amplify the DNA of interest 1131. .A cloning vector is used to introduce

the DNh template into the cell; the general process is often referred to as recombinant

D M technology. Two standard cloning vectors are plasmids and bacteriophages.

A plasmid is a small (several thousand base pair) piece of circular DNA that is

capable of replication within a bacterium. A plasmid vector consists of the original

plasmid's DNA required for replication plus the DNA of interest inserted into the

circular DNA rnolecule. This is then inserted into the ce11 and through the cellk

normal reproductive cycle additional copies of the DNX of interest are made.

X bacteriophage is a virus that infects bacteria. The DNA of interest can be

attached to the bacteriophage's DNA. The bacteriophage is then applied to a bac-

teria colony. Over the course of infection each bacteria makes many copies of the

bacteriophage. These can then be harvested and the DNA of interest extracted.

As in the PCR case, cloning techniques can lead to errors that may be passed to

su bsequent generations.

Chapter 2 o Details of the Chemistry and Physics of DNA Seauencine: 20

2.1.3 Sequencing Reaction Molecules: Terminators and

Polymerases

In the introductory chapter, Sanger DNA sequencing is described as dcpending on

cornpetition between a substrate base and a terminator molecule. The terminator

rriolecule is actually a modified nucleotide where the 3' hydroxyl (OH) group of the

deoxyribose is replaced by a hydrogen atom. The -4 terminator is then properly

called dideoxyadenosine 5'-monophosphate (abbreviated ddAMP). Its 5' end will bind

wherever the 5' end of an -4 would bind. As it lacks the 3' hydroxyl group, it is not

possible to bond other nucleotides at the 3' location and polyrnerization at this end is

not possible. Thus, once it is incorporated, construction of the DNA copy will stop.

In sequencing, DNA polynierases are responsible for incorporating the bases into

the complementary DNA niolecuie. There are many different DNA polyrnerases, a

cliversity made possible as different organisrns have different natural polymerases.

Taq, used in PCR as discussed above, rnay also be used in sequencing. If thermal

cycling is used for the sequencing reactions then the process is referred to as cycle

sequencing. Sequenase, ano t her t hermally stable DNA polymerase enzyme, can also

be used in cycle sequencing. T7 DNA polymerase is not therrnally stable but it can

make very long copies. Polymerase enyzmes are limited in the length of DNA they

can copy as they eventually dissociate frorn the DN.4 template. Taq, Sequenase and

T7 DNA polymerase were used in the experiments of this thesis.

2.1.4 Fidelity and Peak Amplitude Variation

The polymerase c m make mistakes in rnaking the copies and can cause fluctuations

in the amplitudes of correct peaks. For example, the error rate for Taq is estimated as

2 x 10-4 misincorporations per nucleotide per cycle 1121. Other authors measured the

error rate for Taq as one single-base substitution error in 9000 bases and one frarneshift

error (i.e., insertion or deletion) in 41000 bases [141. Under special conditioùs, base

' Note that in most of this thesis and in the literature in generd, when discussing DNA sequencing the word 'base' will often refer to the complete nucleotide including its phosphate and deoxyribose groups.

Chapter 2 0 Details of the Chemistry and Physics of DNA Sequencing 21

substitution and frameshift error rates of less than IO-' have been observed [ljl.

Errors rnay occur in both the amplification and sequencing phases. Beyond a direct

substitution of one base for another due to partial afinity for bases other tiian the

correct one, many other error mechanisms are possible.

One source of error is 5' to 3' exonuclease activity wherein the polymcrase removes

bases frorn the primer end of the copy. If this occurs during amplification, it will

rcduce yield as the PCR primers may not be capable of binding to the shorter copies.

If part of the region complementary to the sequencing primer is lost, the rnolecule

will not participate in the sequencing reactions. The 5' to 3' exonuclease activity is

part of the DNA repair mechanism in living cells. Sequencing polymerases typically

have been modified to suppress this activity.

Blunt-end addition is another DNA polymerase error. Just after cornpleting a full

length copy, the polyrnerase adds additional bases at the 3' end 116, 171, even tliough

the template does not have bases there. The bases are added at randorn but A's are

added more often. If blunt end addition occurs during amplification then the results

are unaffecteci as no changes have been made to the region of interest. As to the

sequencing reactions, the ddNTP terminators do no t allow bliint end addit ion. Tlius,

blunt end addition does not lead to sequencing mors .

As the polymerase ages, it becomes partially inactivated [331. This can lead to it

clissociating from the template before the copy incorporates a ddNTP. If the primer is

fiuorescently labelled then the copy will be detected. This will indicate a base of type

corresponding to that ddNTP even though the base at that point in the sequence

may be of another type. In amplification, polymerase dissociation leads to somr

products having a shorter length than others. CVhen sequenced, both the short and

long length copies will contribute to the initial peaks but only the long length copies

will contribute to the later peaks. Therefore, this phenornena will contribute to a

reduced signal strength for later peaks. Aiso, many times the sequencing reactions

are likely to go the full length of the short copy without incorporating a terminator.

For primer labelied data, this results in peaks in al1 four time series (A,C,G,T) at the

short copy's terminal base position. .4s mentioned exlier, this phenomenon is known

as a "f'alse-stop" or "false termination".

Chapter 2 o Details of the Chemistw and Ph~sics of DNA Seauencing 22

L G C T P A C C A C G A A T G G T

Figure 2.4: Bulging of copy with insertion of a T.

There are several known sequence dependent effects associated with polgmerases

(see Table 4.1 in 1331). For example, in a string of consecutive C's, later C peaks are

likely (but not certainly) to be larger than earlier C peaks. However, a "cornplete"

list of these dependencies is not available, largely due to the very large number of

possible combinations. Painvise sequence dependencies are presented in [Ul. Relative

separation depended on the 3' terminal dideoxynucleotide and increased in order of

C, A, G and T. However, relative separation was also dependent on the penultimate

base adjacent to the 3' dideoxynucleotide and increased in order Tl A/G and C.

Generally, peak levels can fluctuate. In (181, the fluctuation, which was defined as the

ratio of the difference in adjacent peak leveis to their average, was varied from 0.1 to

10 by changing the concentration and type of the divalent cation needed to activate

the polymerase. In (451, amplitude fluctuations between runs for same sequence and

polymerase were found to be highly correlated.

Other errors in the copying process are associated with problems in the hybridiza-

tion of template and copy. Either the template or the copy may bulge out, forming a

little loop. If the copy bulges then the copy has one or more insertions (Figure 2.4).

If the template bulges then the copy will have one or more deletions. The stability of

these bulges depends on local sequence 1191. Larger bulges, known as hairpin loops,

are often associated with regions rich in G's and C's as these regions may form hydro-

gen bonds as in Figure 2.5. If the template forms a hairpin loop then the polymerase

is more likely to dissociate at the loop. This leads to the level reduction and false

stop effects mentioned previously.

Formation of bdges also allows the primer to hybridize to a region it only partially

Chapter 2 O Details of the Chemistry and Physics of DNA Sequencing 23

T A A C G G C C

Figure 2.5: Hairpin loop due to complementary GC runs.

matches. This phenornenon is known as secondary priming. It leads to additional

peaks in the DNA time series. These peaks correspond to the sequence froni tlic

secondary priming site. Secondary priming can be suppressed by setting the annealing

temperature to be high enough so that only the exact cornplementary hybridization

d l be stable.

In surnmary, the error mechanisms discussed in this section impact on DNA time

series as additional peaks in the data. These additional peaks occur at locations

consistent with those expected for true peaks as they correspond to product contain-

ing an integer number of uncorrupted bases. Problems associated with polymerasc

dissociation can lead to reduced signal levels.

2.1.5 Degradation

Chernical breakdown can cause sequencing errors. The main process is hydrolysis

where long DNA molecules are split into two smaller fragments by the addition of

water, with the water's hydroxyl group added to one fragment and its hydrogen is

added to the other. The elevated temperatures used in sequencing promote hydrolysis.

The labelled fragments cause anomalous peaks in the time series. Hydrolysis during

amplification and sequencing reactions leads to effects similar to polymerase/template

dissociation. Hydrolysis can also occur whiie the DNA is in the sequencing gel. The

resulting anomalous peaks can occur anywhere in the time series, as they are offset

in time by when hydrolysis occurs which in turn is a random variable.

From long DNA molecules of al1 the same length, hydrolysis can lead to a pop-

ulation of products of al1 lengths (base counts) shorter than the original length. As


H CHO

R-T I

P 1

R-T

Figure 2.6: Cleavage pathway for depurination (Guanine base) 1201. Additional symbols are: R for deoxyribose, G for guanine, T for th-ymine and P for phosphate.

the products of a hydrolysis reaction may thernselves undergo hydrolysis, the shorter

rnembers of this population tend to be more populous than the longer rnernbers of

this population. Thus, the Ievel of noise due to this population should be higher

earlier in the time series.

Hydrolysis can remove the base from the deoxyribose monophosphate of a nucleic

acid. Purine bases are more likely than pyrimidine bases to be removed from DNA.

Depurination is the hydrolytic removal of purine bases from DNA. Fiyre 2.6 illus-

trates the depurination pathway; all the intermediate products may be present and

would lead to additional anomalous peaks in the DNA time series. Note that depuri-

nation ultimately leads to cleavage of the D N 4 into two shorter DNA molecules.


Hydrolysis can also cleave the fluorescent label from the DNA. Depending on the

label rnolecule, the fluorescent label may then be positively or negatively charged.

If this hydrolysis occurs before the detectors then a loss of signal level results. The

noise background will increase either from negatively charged labels migrating ahead

of the DNA band prior to the detectors or frorn positively charged labels migrating

backwards past the detectors after the source band has passed the detectors.

2.2 Sequencing P hysics

2.2.1 Sequencing Gel

.-\ gel is an aggregate of polymers that encompasses a liquid medium. There

are connections, referred to as cross-links, between sorne of the fibers. .A molecule

iindergoing electrophoresis must weave its way between the fibers. It is this interaction

that allows discrimination of DN.4 molecule length. In a simple solution rather than

a gel, DN.4's charge and resistance to motion scale with length and discrimination of

length is not possible [211.

The cross-links lead to the notion of pores in the gel and a gel is often char-

acterized by its mean pore size. The large pore agarose gel is used for separating

large molecules such as million base dsDNA molecules. The srnaller pore potyacry-

lamide gel is normally used for DNA sequencing. Currently experimental, capillary

gel electrophoresis can also work if the fibers are not cross-linked 122, 231.

.A polyacrylamide fiber is composed of long runs of acrylamide monomers. The

fibers are covalently cross-linked by N,N7-rnethylene-bis-acrylarnide, a molecule which

is usually referred to as "bis". The weight ratio of acrylamide to "bis" determines the

extent of cross-linking and therefore the character of the gel (10: 1 is brittle while 100: 1

is pasty 1241). A 19: 1 ratio is typical for DNA sequencing. The gel concentration sets

the mean pore size. The gel is formed by adding acrylamide and "bis" to the liquid

medium.

For sequencing, the medium contains at least two components: a buffer and a

denaturing agent. The buffer provides ions that undergo electrophoresis just as the


DNA molecules do. The concentrations are set so that the v a t majority of charge is

carried by the buffer ions. Thus, the buffer ions define the local electric field conditions

and as they are small and uniforrn in distribution, the DNA molecules see a uniform

electric field that is essentially unperturbed by other DNA molecules. The ability of

the buffer to do this is referred to as its ionic strength and is defined as one half the

sum of the rnolecular molality times the molecular charge squared. Ionic strength is

often reported as a multiple of TBE where lOxTBE is the ionic strength of a reference

t~uffer (10xTBE) consisting of 89mM Tris-Borate and 2mh.I EthyleneDiamineTetra-

acetic Acid (EDTA). For the experirnents reported in this thesis, the buffer is TBE.

The denaturing agent ensures that the DNA remains single stranded. The de-

naturing agent also helps prevent the ssDNA from forrning hydrogen bonds between

different regions of itself. ,4 gel formed with a denaturing agent is referrcd to as a

"cienaturing gel". For the experiments reported in this thesis, the denaturing agent is

urea.

2.2.2 Theories of Electrophoresis

There are several electrophoretic theories in the literature 124, 46, 47, 48, 49, 501

and an excellent overview of these theories is given in 191. Individual theories explain

different electrophoretic regimes where the regimes are defined by the relative sizes of

the molacule and the gel pore [till. 'lectric field strength can also provide the basis

for differentiation into different regirnes. For the sequencing gel conditions used in the

experiments of this thesis, the two most relevant models and regirnes are the Ogston

model ( 1 - 4 5 0 basepairs) and the biased reptation model (- 150-500 basepairs) .

Ogston calculated the fraction of the gel that can contain a sphere of a specific

radius (Le., the ratio of the total volume of al1 pores bigger than the sphere to the

total volume). To obtain a simple electrophoresis theory, the arbitrary idea that the

mobility is proportional to the fractional volume was explored. This of course ignores

the requirement that a gel must have a connected set of sufficiently big pores from

one end to the other in order for the sphere to pass through it. Surprisingly then,

the mode1 yielded a good fit to experimental data over a fair region of molecular

rnass, with the upper Iimit of the region being roughly two to three orders of mag-


riitude greater than the lower limit of the region. As a result, it has been the basic

electrophoresis rnodel for nearly four decades. The model states that the mobility

(lecreases exponentially with the square of the ratio of sphere radius to pore radius.

In applying the model to DNA sequencing, the DNA is presumed to have folded in

to a sphere-like random coi1 with volume equal to that of the linear DNA molecule.

The biased reptation model applies when the DNA molecule is large enough such

that it cannot fit into a single pore. In this model, the leading part of the linear

polymer, the "head", is assumed to enter the pore and to choose the path to the

next pore. The rest of the polymer just proceeds in order along the path selected

by the head. The head "searches" for the entrance to the next pore through its

riornial thermodynamic motions. The word "biased" in the mode1 name refers to

the electric field biasing the head to follow the field lines. In the biased reptation

niodel mobility is inversely proportional to molecular lengt h. It also states that

beyond a certain limiting length, rnobility is independent of molecular length. Two

clifferent length molecules of size greater than the limiting length would have the same

rnobility, take the sarne time to travel through the gel and thus would not be resolved.

Fortunately, this limiting length is in the thousands of base pairs for the gels used in

the experiments for this thesis.

These electrophoresis models assume that the molecule is a linear polymer corn-

posed of identical symmetric monomers. DNA has mrying monomers (A,C,G,T) and

the bases are attached asymmetrïcally off the side of the phosphate deo.uyribose back-

bone. However, rotation about the backbone is possible. One can think of a long

DNA rnolecule representing an instantiation of a random set of base types and rota-

tions. -4s the set is large, the lam of large numbers applies and the average properties

become more representative. One could think of an average monomer and average

rotation leading to the idealized linear polymer of identical symmetric mononiers to

which the models would then apply. Thus, these models provide the average charac-

ter of the electrophoresis results but on a base by base basis, fluctuations about this

average would be expected due to variations in monomer type and orientation.

Furt her , microscope studies of act ual DN A rnolecuIes undergoing electrophoresis

have revealed more cornplicated behavior than assumed in the models above 152, 53,


541. These behaviors include herniation outside the reptation "tu bey', hooking on gel

chains and release of hooked molecules where one end goes backward relative to the

general direction of migration.

The idealized electrophoresis models above assume the polymer consists of stiff

rotls joined at nodes and that these rods are free to take any relative angles. Rather

than use the length of a monomer as the length of the rod, the models use the Kuhii

langth of the DNA4 as the rod length. Kuhn length is the contour2 distance between

two points on the DNA molecule such that the angle of the local segment along the

contour at one point is uncorrelated with that of the other point 1311. This justifies

the rods being f'ree to take any relative angles in the models. Persistence length.

defined as half the Kuhn length, is often used as a measure of this property.

Kuhn length may be easily understood by considering a thick rope. For two points

huntlreds of rope diameters apart, its easy to place the local segment at any arigle

independent of that at the other point (presuming the rope is not stretched taut).

However, for two points a few diameters apart, the stiffness of the rope limits tlic

possible relative angles.

The pcrsistence length of ssDNA varies from 5 to 12 bases depending on the

ionic strength of the buffer; here, the maximum ionic strength has been restricted

to IOd2 mol/L to reflect the maximum likely to be used in sequencing gels [271.

This is because the stiffness of the ssDNA has a fixed structural component and

an electrostatic component [32]. The electrostatic component is due to elect rost at ic

repulsion between the charged bases. As ionic strength increases, more positive ions

gather around the DNA molecule and shield the bases from the negative charge of

adjacent bases. Thus, electrostatic repulsion is decreased and t lie DN.4 molecule

becomes more flexible. The Kuhn length of ssDNA varies frorn about 2.4nm at

infinite ionic strength to 16nm at a low ionic strength (the phosphate to phosphate

interbase separation for ssDNA is 0.43nm) [321.

'The contour is the path taken in going hom base to base along the DNA molecule.


2.2.4 Resolution

Resolution refers tu the abiiity to discern multiple consecutive same type bases.

As resolution becomes poorer, the sequencing error rate incrcases. -4 useful measure

for resolution is the ratio of peak width to interbase separation. Peak width in the

gel is determined by the peak width on loading plus the additional increments due to

diffusion and dispersion since loading [251. The peak width on loading is cornplicated

by the stacking of the DNA at the decelerating interface between loading well and gel.

karmola 1251 defines diffusion as that spreading component present in the absence of

an electric field and dispersion as the additional time dependent component present in

the presence of an electric field. Slater 191 combines diffusion and dispersion together

and refers to the aggregate as diffusion; t his is the nomenclature used in the reniainder

of t his t hesis.

The time-clependent peak width in the gel, h x D ( t ) , is given in 191 as

where Axa is the peak width on loading and D is the diffusion coefficient. Peak width

in time, pw, is then A x ~ / v where v is the speed of the peak in the gel.

Resolution becomes problematic in the region that the biased reptation model

applies. The analysis is simplified by assurning that only the biased reptation model

applies. So rnobility (v) is inversely proportional to molecular length and hence base

number (2 ) ; u = C/i where C is the constant of proportionality. The center of the

band passes the detectors at t = Llv = Lz/C where L is the length of the gel.

Application of these factors to the results of the previous paragraph yields

Thus, peak width grows with a rate that is between linear in i and i3I2. Note that for

this model the separation between adjacent peaks is constant at Li/C - L(i - 1)/C =


LIC. The resolution, res(i) = pw(i) / ( t , - t,-l) is then

A small value of resolution is desirable. According to this equation (and as is actually

the case in practice), resolution is improved by using longer gels (large L).

The diffusion coefficient does depend on the mass and hence the length of the DNA

rnolecule. The temperature defines the average kinetic energy of the molecules in the

systern. Kinetic energy is proportional to the product of mass and velocity squaretl

so, For the same temperature, a larger moss implies a lower velocity and vice versa.

Smaller veiocities lead to lesser diffusion. Thus, one would expect larger molecules

to have narrower bands in the gel than if they had the diffusion coefficient of the

snialler molecules. However, the effect will be complicated by the configuration of the

rnolecule and in practice, the impact may be modest. As the diffusion coefficient will

clecrcase with base number and hence peak time, overall spreading due to diffusion

as indicated by Equation 2.1 will Vary less than if the diffusion coefficient had been

constant. Practically, to first order, band width in the gel varies little with base

riumber.

2.2.5 Other Concerns in Electrophoresis

Gel Inhomogeneity

Bubbles in the gel, dust on the interior g l a s wall and defects in the loading

well shape are practical problems 1331 that can lead to local variations in mobility.

Generally, the impact is felt in terms of extending the pulse shape, particularly if,

for the same lane, there is an entire gel section dong the field axis which is problem

free and high mobility and a section dong the field axis with Baws and low mobility.

For very large bubbles or well defects, the signal for a particular base type may be

significantly degraded or lost as migration down the gel is inhibited.

On a larger scale, variations in the degree of cross-linking over the entire gel lead

to variations in mobility from lane to lane. This problem c m be addressed by scaling


in time the time series for each lane so that generally the peaks occur in the right

sequence; this is referred to as lane alignment.

Secondary Structure

The hairpin loops mentioned in Section 2.1 can also affect electrophoretic mobility.

The resulting molecule tends to be more compact and migrates faster 1331. Thus,

peaks correspoiiding to bases subsequent to the position of the hairpin formation

will reach the detectors sooner than could be expected given the times of the pre-

hairpin peaks. The peaks will appear to bunch up, a phenornenon known as band

compression. More modest secondary structure, such as bends 1281 and a tendency

to form arcs 1291, will affect niobility in a more modest fashion.

2.2.6 Detection of Fluorescent Labels

Fluorescent labels must be excited by a source with wavelength shorter than that

of the labels' emission wavelength. For example, fluorescein lias an emissioti m u -

imum at 320 nm and must be excited by a source of wavelength smaller thari 494

nm 1261. The Pharmacia ALF Automatic DN.4 Sequencer, the data source for the

esperirnents reported in this thesis, uses a blue-green laser for escitation. The laser

beam enters the gel by the first lane and terminates by the last. This implies lane

1 has a clean illumination while the last lane would be illuminated by a beam that

h a been attenuated and dispersed by passing through the gel. The peaks in the last

lane are likely to be weaker and broader. They will also be noisier as the dispersed

beam can excite a wider region and thus more potential fluorescence noise sources.

The noise sources include chernical contarninants and the g l a s of the gel assembly.

Detection in the Pharmacia ALF is by an array of photodiodes. The output of

these devices is characterized as shot-noise as an impulse is produced for each photon

received. However, as the level of fluorescence is large and as low-pas filtering is

performed in the amplifiers, the recorded signal appears as an analog measurement

of fluorescence intensity plus a small Gaussian measurement noise.


2.3 Summary

This chapter has identified the chemical and physical processes that determine the

character of the DNA tirne series. Amplitude and noise fluctuations are largely due to

chemical processes. Peak time variations are largely due to physical processes. The

discussion of fidelity leads to the noise model of the next chapter. The peak shape and

electrophoresis discussions lead to the signal model. Phenornena have been presented

in sufficient detail so as to provide the required background for the model developed

in the next chapter.

CHAPTER 3

A Statistical Mode1 of the DNA

Time- Series

X statistical characterization of the DNA time-series will be presented in this

chapter. First, the gross features of the DNA tirne-series are described. Our interest

then focuses on the local features of the time-series. The signal peak shape and pa-

rameters are modelled. This is followed by the noise model. Simulated data produced

by this rnodel is then presented and compared visually with sample real data. For

completeness, the major known features of DNA data that are not included in the

model are summarized. Finally. the importance of the model is discussed.

3.1 Gross and Local Structure of DNA Time-Series

r\mplitude trends are in evidence in Figure 3.1 which presents the entire time-series

for a single channel. Proceeding from left to right, a constant background level is first

seen. This could be due to background fluorescence and/or an offset in the sequencer

electronics. Next, a large peak is seen; this is known as the primer peak and is due

to an excess of the flourescently labelled primer unincorporated into any sequencing

Chapter 3 o A Statistical Mode1 of the DNA Time-Series 34

Figure 3.1: Sample entire time series for "T" channel. Mean inter-base separation is 14.7 samples.

copies. The primer peak causes an cxponentially decaying offset in the data. Near the

end of the data, an exponential rising offset is seen. This is the precursor of the peak

at the end of the data due to fluorescently labelled full length copies of the original

DNA fragment. If the terminator had been labelled instead of the primer then neither

this peak nor the primer peak would be present. Over the central region, a downward

trend in peak amplitudes ' can be seen. This is likely due the cornpetitive process

used to encode sequence information. Here, the relative concentration of ddNTP to

dNTP is high leading to a greater chance of terminating early rather than later in

the sequence. The trend may also be due to random polymerase dissociation during

t lie sequencing reactions.

The gross structure of the time-series in Figure 3.1 is representative of DNA time-

series from a wide variety of DNA sequencers though parameter values rnay change.

These trends may be compensated and the useful data region extracted prior to

l For this thesis, peak amplitude is defined as the difference between peak maximum intensity and the local offset Ievel. For example, in Figure 3.1, the peak at sample 7000 has an amplitude of just over 100 intensity units and an offset of just under 1300 intensity uaits.

Chapter 3 o A Statistical Mode1 of the DNA Time-Series

rnaking sequence decisions. This is typical practice in automatic DN.4 seqiiencing

and is analogous to automatic gain control and automatic frequency coritrol in radio

communications.

Figure 3.2 presents the compensated time series fur al1 four bases; Figure 3.1

presented the uncompensated T channel data for the same sequencing run. As this

data originated from different lanes, it was necessary to compensate for differences

in mobility. The compensation, detailed in the Appendix, features sufficient degrees

of freedom to allow for the Oggston and biased reptation regimes expected in se-

quencing data. .&O as described in the Appendix, the background, primer and end

of data offsets have been estimated and removed. The trend in peak amplitude has

been estimated and the data has been xaled by its inverse. The result features sig-

nal absent regions with values near zero and isolated signal peaks with values near

one. Consecutive peaks can have values much greater than one due to constructive

iriterference. This later phenomenon is more pronounced near the end of the run due

to the expected increase in pulse width with base nurnber.

Figure 3.3 is a higher resolution presentation of the compensated time-series for

al1 four bases. Note that the individual peaks are of similar shape and that there is

evidence of noise. This suggests a time-series model a s in

where n refers to the base type, the index k is the sample nurnber, the sum is over

the base sequence position, i, and there are a total of Nb bases in the sequence. The

Kronecker delta function, defined as one if n is the same as xi and zero othenvise,

is used to determine if xi, the base at sequence position i, is of the same type as the

channel n and should therefore contribute to the observed waveform in that channel.

The contribution consists of a generic pulse shape, gkVt , , where the peak of the pulse

is centered on ti and the peak is scaled by ai. The random vaxiables ti and a, model

the timing jit ter and amplitude fluctuation, respectively. Finally, an additive noise

process, {nk), represents the background fluctuation evident in Figure 3.3.

Chapter 3 O A Statistical Mode1 of the DNA The-Series 36

Figure 3.2: Selected cornpensated time series for same sequencing session as Fig- ure 3.1. Individual channel data has been offset in this figure for clarity. Top curve is For -4 channel with C, G and T channels presented in order froni top.

3.2 Signal Peak Shape

The electrophoresis of a pure molecule should, after the band has moved suffi-

ciently away from the loading well, lead to a Gaussian shaped peak 191. The Gaussian

peak is presumed in at least one automatic sequencing algorithm 1361. However,

in the data observed from the Pharmacia ALF sequencer, the peak shape is more

complicated than a simple Gaussian.

Referring again to Figure 3.3, it is evident that most peaks, unlike a Gaussian,

are not symmetric with respect to their tails. In particular, the peaks appear to have

the trailing tail extended.

To provide a cleaner look at these low level tails, a data set was examined which

featured high Signal to Noise Ratio (SNR) and well separated peaks that interfered

only modestly. Micro-Satellite Repeat (MSR) data, used in family genetic studies, fits

these critena well. MSR product features 3-20 primer labelled ssDNA molecules. The

P harmacia ALF permits MSR product t O be loaded, electrophoresed, recorded and

Chapter 3 o A Statistical Mode1 of the DNA TirneSeries 37

Figure 3.3: High resolution view of a segment of the compensated tirne series (actually Figure 1.2 repeated for reader's convenience).

8 I 1 I I I

/

analysed for size cornparison. Figure 3.4 displays the trace of an MSR electrophoresis

%

2

session. The peaks are very strong as the fluorophores are spread over only a few

molecular sizes instead of hundreds in the example of Figure 3.1. They are separated

=:N'--Lm:

by tens of bases so the tails are relatively free of interference.

O 50 100 150 200 250 300 k (SAMPLES)

Figure 3.5 shows the regions about the proximal and distal DN.4 standard peaks

in Figure 3.4. In Figure 3.5, each region had its baseline removed and the peak was

scaled to unit height. The distal peak was lined up with the proximal peak and then

time scaled in an attempt to match the shape of the proximal peak. In Figure 3.5, it

is evident that the peaks are extremely similar. Thus, the only significant difference

in the shape of the peaks in Figure 3.4 is that later peaks have been stretched in time.

This may be formalized by writing the pulse shape as

where k is the sample index, ti is the peak time, g, is the pulse shape (continuous)


Figure 3.4: Micro-satellite repeat data trace. Major peaks in time order are: primer peak, proximal DNA standard peak, sample peak, distal DN.4 standard peak.

rvhen the pulse widt h is unity, and p,(t) describes the dependence of peak ividt h oii

peak time.

Figure 3.6 shows the components of the peak shape. The central peak is a Gaussian

while the tails are exponentials. The trailing exponential has a longer tirne constant

than the other tail. These tails are consistent with those seen in the high amplitude

primer peak in DN.4 sequencing data.

As a check on the validity of this candidate generic pulse shape, isolated peaks

from DN.4 sequencing data were extracted. These had lower SNR than the MSR

peaks and there was evidence of intersymbol interference. These peaks are plotted

in Figure 3.7. Note that the peaks were simply aligned in time and not time scaled.

The peaks do appear to be asymmetric. However, the tails are not as consistent as in

the MSR data. The peak due to base 143 has a trailing exponential tail. The peak

due to base 275 appears to have a much weaker tail; this may be due to an error in

the baseline removal processing. The peak due to base 14 has a trailing tail but it

is far from exponential. Rather it appears to be a small echo of the main peak. -4s


1 t t

PROXIMAL I WARPEO DISTAL 1

REtATlVE TIME (SAMPLES)

Figure 3.5: Proximal DNA standard peak (solid line) and distal DN.4 standard peak (dash-dot). Warped peak (dotted line) was created by scaling the tirne coordinates by 0.7286.

the inter-base separation is roughly 15 samples, this echo appears approxirnately two

bases after the main peak. This is consistent with an error due to a bulge in the copy

as clescribed in the previoiis chapter. Variations such as those in Figure 3.7 are seen

throughout the data.

Two issues emerge: 1) could the peak shape associated with MSR data be fun-

damentally different than that of sequencing data, and, 1) how should the peak be

modelled given the range of fluctuations observed? Regarding the Brst issue, MSR

and sequencing data are distinguished by signal level. The high signal level of MSR

data implies a very large nurnber of molecules of identical size. This large number

may lead to DN.4-DNA interactions and a phenomenon known as gel overloading.

Neither of these eEects are well understood. One hypothesis is that due to overload-

ing some DNA molecules rnay be trapped or wrapped around gel fibers for a long

time. At a later, random time they are released and eventudly are detected. Their

arrivals would be expected to have a Poisson distribution and this would lead to the


Figure 3.6: Approximation of proximal peak of Figure 3.4 (dotted line) by leading exponential (samples 1-35, dashed line), Gaussian (samples 36-70, solid hie)? and decaying exponential (samples 71:200, dashed line). Inset is the logarithm of the same data.

trailing exponential tail. This would be less likely to happen at lower gel loading and

thus is not seen as frequently in sequencing data.

Advancing to the second issue, if the high SNR LISR data does not provide an

accurate model of the sequencing peak shape, and, further, that peak shape appears

to have considerable fluctuation, then perhaps a stochastic model should be ernployed.

This strategy is adopted by this thesis. The model employs the structure suggested

y the NSR data: leading exponentiai, Gaussian mainlobe and t railing exponent i d .

However, the scding and time constants of the exponentials are not taken from the

MSR data. Rather, they are selected to loosely represent the average tails seen in

the sequencing data. The generic unit width pulse shape for the sequencing data set

Chapter 3 O A Statistical Mode1 of the DNA Time-Series 41

Figure 3.7: Three isolated peaks from DNA sequencing data.

discussed in this section is

3.3 Local Covariance Model of Peak Parameters

Xow that the generic peak shape has been established, Our attention turns to

the paxameters necessary for its incorporation into Equation 3.1, specifically, peak

amplitude, ai, peak time, t i , and peak width, pzu(ti). These parameters may be

characterized in terxns of their g r o s behavior (i.e., long term trends) and local be-

havior (Le., fluctuations and the dependency of these fluctuations on the values of

neighbouring peaks) . Chapter 2 has presented models describhg the gross behavior of these parame-

ters. Amplitude is expected to decay with base number though fluctuations are ex-

Chapter 3 O A Statisticd Mode1 of the DNA Time-Series 42

pected due to polyrnerase problems (Section 2.1.4). Peak time is expected to evolve

smoothly (on the scale of hundreds of bases) through Oggston and biased reptation

regimes; however, local fluctuations are expected due to the non-uniformity of the

DNA molecule. Under the biased reptation model, pulse width is expected to grow

linearly then slightly faster than liuearly over a sequencing mn; as pulse width is

driven by the statistical mechanics of a very large number of molecules, little fliictu-

ation is expected. Practical application of this knowledge with respect to the trends

in peak amplitude and time leads to the compensated data presented in the previous

section.

Little is known of the local behavior of these parameters. Certainly, Chapter 2 sug-

gested some mechanisrns for local fluctuations but work in the literature has stopped

short of characterizing them other than for the limited number of amplitude sequence

dependencies mentioned in Chapter 2. In this section, a rnodel will be developed

for the local fluctuations in peak parameters including their point probability dcnsity

functions and their average dependence on their neighbouring peaks (covariance). The

emphasis will be directed towards a practical model to be used in the development of

sequencing algorithms.

3.3.1 Methods

The a-A-crystalline exon 3 1101 data was obtained and then processed to remove

trends as described in the -4ppendix. It should be noted that arnplitude trend removal

employed a 51 bin moving averager; as will be seen later in this section, this leads to

an artifact with this same petiod in the amplitude covariance estimate. The methods

employed to extract the peak measurement and form the covariance estirnates are

de t ailed below .

Peak Extraction and Mesurernent

-411 measurernents assume a directed search for peaks based on knowledge of the

true sequence. After identiwng the correct peaks the following procedures were used

to extract the peak parameters.

Chapter 3 O A Statistical Mode1 of the DNA Time-Series

For the basic peak measurements, we first obtain a background level estimate by

clrawing a line through the point halfway back to the previous peak and the point

halfway fonvard to the next peak. This line is then subtracted from the data. The

peak aniplitude is taken as the maximum of the result. The peak width is taken as

the distance between the half amplitude points on either side of the peak.

For data including unresolved peaks, a more elaborate procedure is followed to

alleviate the effect of neighbouring bases on the measurement. Again the process

begins with the identification of the correct peaks given knowledge of the true se-

quence. -41~0, we start with linear model of pulse width based on manual pulse width

tnewurements taken near the start and end of the data set. The peak shape is taken

as Gaussian; the effect of the tails is ignored and is a source of error. Now for the

rrieasurement of each correct peak, the influence of neighbouring peaks is estimated

and removed and then the peak measurement is niade as described for isolated peaks.

A multi-step iterative process is used to estimate neighbouring base interference.

For the first p a s , as we move through the sequence, peak time and amplitude mea-

surements, together with the peak width model and peak shape function, are used to

estimate and suppress the influence of previous peaks. The influence of future peaks

is suppressed using a priori rnean amplitudes and peak separations together with the

pulse width model and peak shape function. Intermediate parameter estimates are

thcn obtained as described above for isolated peaks.

Subsequent passes use the measured peak amplitudes and times from the previous

pass as the parameters in suppressing neighbouring peak influence. Updating these

measurements in this fashion leads to improved estimates of the amplitudes and

peak times. However, if pulse width was updated in the same fashion divergence

would be seen; a wide peak would grow wider if its adjacent peaks were seen as

narrow and thus their influence under-estimated. For example, the contribution of

a wider peak to its neighbouring peaks will be over-estimated which in turn would

lead to these neighbours being biased still narrower. With each p a s the effect would

be emphasized and divergence from the correct values would result. To avoid this

problem, the peak width estimate that is used in suppressing the contribution to

neighbouring peaks is obtained from a low-order polynomial fit to the previous pass's


peak width measurements. Typically, five passes lead to effective convergence of the

parameter estimates.

As a check on the peak extraction and measurement process, DNA sequencing data

obtained from peaks well isolated from their same lane neighbours was cornpared with

tliat incorporating al1 peaks. This cornparison involvcd the use of scatter plots and

covariance rneasurements.

The resuits for al1 peaks were checked to see if they lay within limits imposed

by the estimation error of the isolated peak measurements. Statistically significant

differences were not seen. The use of isolated peaks and non-isolated peaks together

(typically 350 pairs for a given base separation) allows examination of finer covariance

features than those that could be obtained using only the few (typically only 20-30

pairs were available for a given base separation) available isolated peak measurernents.

Covariance Estimation

Once the basic peak parameters have been extracted, a covariance estimate mny

be formed as a measure of their average dependence on their neighbouring peaks.

The covariance estimate is

where {i : i, i + 2 E {l, ..., N ) ) defines the set of al1 pairs of bases 1 bases apart,

11: = I{i : 2,i + 1 E (1, ..., N)}I is the size (cardinality) of that set, N is the total

number of bases, 2i is the estirnated parameter value at position 2, and fi,, is the

estimated rnean value of the parameter at position i. Equation 3.4 says take the

average of al1 pairs of deviates from the estimated mean that are I bases apart ( l? as

is common in the signal processing literature, will be referred to as the 'hg'). It does

not allow for non-stationarity in the data.

If it may be that the actuai covariance varies with base number then why adopt a

stationary covariance estimator? Covariance estimates with estirnate standard devi-

ations of 10% of the peak covariance require on the order of a hundred terms in the

surnmation (see [301 for information on cova,rîance estimate quality). Then to estimate


the non-stationary covariance to this accuracy, a hundred or so electrophoresis runs

would be required. These runs would differ with respect to gels and contaminants.

The covariance would then be a rneasure of run to run variation as well as variation

within a run. However, the interest of the sequencing algorithm designer is what is

predictable in a run. Therefore, covariance within a run is likecely to be a more useful

nieasurernent as it ignores run to run variation.

However, there is a non-stationary component to DNA sequencing data; peak pa-

rameters mry with base number in a manner that leads to an increase in sequence er-

ror rate. Our preliminary investigations using parameter covariance estimates t'ormed

from short contiguous sections of data suggested that the general magnitude of the

covariance tends to increitse with base number; however, the structure of the covari-

ance did not Vary significantly with base number. Therefore, in the results, we focus

on the variation of the variance with respect to base number, which will affect the

general scaling of the covariance.

3.3.2 Results

In this section, the peak tirne, amplitude and pulse width measuremerits are pre-

sented, their trends examined and their covariances calculated. The data is from the

gel electrophoresis of exon 3 of the gene coding the a-A-crystalline protein of the eye.

Xote that similar results have been obtained for exon I which is about 2000 bases

away on the chromosome.

Correlation Between Lanes

The main thrust is the study of covariance estimates formed from data rnerged

across lanes. Figure 3.8 provides justification for this approach. To create Figure 3.8,

bG'-labelled product was applied to six adjacent gel lanes. The resulting time series

featured peaks at similar positions with some mis-alignment due to gel spatial inho-

mogeneities. The peak locations were extracted. After scding and shifting t his data

so that the end peaks occurred at identical positions in dl lanes, and then, removing

large scale trends by least-squares fitting of a cubic to each lane's data and then sub-


200 250 BASE NUMBEFI

Figure 3.8: Peak tinie jitter for "G" labelled product applied to six contiguous lanes of the gel (total of 79 "G" peaks present over the range of 350 bases in original sequence). Six overlapping curves are plotted corresponding t o the six gel lanes.

tracting off the trend, the data shown in Figure 3.8 was produced. All six tirne-series

are plotted but the correlation is so high that they are difficult to distinguish. The

rneasured correlation coefficient between any two pairs of lanes is not less than 0.94

(79 data points per lane were used in the measurement). Therefore, lane to Lane gel

variation must account for less than 12% of the jitter variance. ..\lso. its clear that,

after large scale trend compensation, the lanes are highly synchronized. Thus, we can

be confident that merging lane data will introduce effects that are relatively small

and locaiized in time.

Basic Measurernents and Variances

Peak time, amplitude and pulse width were measured as describec 1 in Section 3.3.1.

Figure 3.9 presents the measured peak time jitter (difference between measured

peak time aad that expected based on large scale trends) which shdl be denoted as

m. Some evidence of correlation is seen as adjacent bases tend to have similar jitter


-20; 1 4 I l I I 1 50 100 150 200 250 300 350

BASE NUMBER

Figure 3.9: Peak time jitter.

values. Note the increase in scatter with respect to increasing base number. h lincar

fit to standard deviation estiniates Formed using contiguous 50- bin sections of the

jit ter in Figure 3.9 yields the standard deviation of the scatter as o = 1.7.5 + 0.0143 * i whcre i is the base number.

Figure 3.10 presents the local amplitude estimates. As already discussed, large

scale trends have been estimated and used to normalize the peaks to near unit arn-

plitude. However, as is evident in Figure 3.10, there remains a residual trend as

evidenced in the general decrease in local amplitude estimates with increasing base

number. About this trend, the scatter appears to have a consistent range, indepen-

dent of base number. Thus, at this level of investigation, amplitude estimates are

stationary (constant variance). The standard deviation of this amplitude scatter, aat

expressed as a percentage of mean peak amplitude is 23%.

Figure 3.11 presents the pulse width estimates. Here, the trend in pulse width

appears t o be linear and a least squares fit yields the pulse width as pw = 15.08 + 0.0326 * ( 2 - 1). Horizontal striations are apparent in Figure 3.11; these correspond

to quantization of pulse width estimates to the nearest sample intenal. The scatter

Chapter 3 o A Statistical Model of the DNA Time-Series 48

O I I 1 1 I 1 + 1 O 50 tOO 150 MO 250 300 350

BASE NUMBER

Figure 3.10: Local peak amplitude estimates.

observeci in pulse width estimates (standard deviation 10% of local pulse width) is

likely to be largely due to measurernent error.

Covariances

Figure 3.12 presents the covariance of the timing jitter. The niain lobe of Fig-

ure 3.12 has a significant value over approximately 15 bins indicat ing correlat ion in the

jitter extending over 15 base positions; the decaying oscillatioris evident beyond k20

bins of the central peak in Figure 3.12 are artifacts of the trend removal processing.

Iri the inset of Figure 3.12, one side of the main lobe is presented using a logarith-

mic scale. It appears to be very well approximated by a straight line, indicating an

exponential decay of the covariance.

An alternative view of timing jitter dependence is obtained through examination

of the difference between successive timing jitter values, Ai = #i - &*- i . Taking the

difference between successive values removes the portion which is comrnon to both

(Le. the correlated part), leaving that which is different (Le. the uncorrelated part).

This allows the examination and measurement of the uncorreiated part without the

Chapter 3 O A Statistical Mode1 of the DNA TirneSeries 49

101 1 I I 1 I L 1

O 50 100 1SO 200 250 300 350 BASE NUMBER

Figure 3.11: Pulse width estimates.

artifacts seen in Figure 3.12. The covariance of this jitter difference is presented in

Figure 3.13. Of significance here are the negative peaks on either side of the main

lobe. These are indicative of the additive unccrrelated component in the original timc

series. .As will be seen in the next section, accurate knowleclge of the uncorrelated

component from Fi y r e 3.13, together with the correlation information in Figure 3.12.

allows us to solve for the paramet-ers of a mode1 which explains the observed data.

Figure 3.14 presents the amplitude covariance. The srna11 values at non-zero lags

suggests that amplitude fluctuations are uncorrelated. Note that the general offset

from zero is due to large scale trends which were not completely removed prior to the

covariance calculations.

Figure 3.15 presents the pulse width covariance. Here some evidence of correlation

is seen in the first two lags. The covariance at these two lags is roughly 15% of the

zero lag value. Thus, 15% of the scatter in puise width can be predicted from one

base to the next. However, as the scatter standard deviation is only 10% of the pulse

width, knowledge of the previous pulse width allows us to use a pulse width estimate

with error decreased by 1.5% of the pulse width. This improvement is insignificant in


-5' 1 I 1 1 1 1 1 -300 -200 -1 00 O 1 O0 200 300 400

LAG (BASES)

Figure 3.12: Covariance of peak time jit ter. Monotonically increasing region j iist to the left of and including lag zero and monotonically decreasing region to its right is referred to as the mainlobe. Inset is a logarithmic plot of the right side of the main10 be.

terrns of its potential impact on sequence error rate and, therefore, the pulse wiclth

is treated as locally uncorrelated.

Model

A mode1 has been developed which reflects the correlation of the sequence peaks

over time and between channels. These rneasurernents are well modelled by the system

presented in the block diagram of Figure 3.16.

The observed peak times, t i , are modelled as

where i denotes base number, p z is the a priori mean expected value and c$~ is the

observed jitter. Practically, the peak time expected from the large scale trends would

be substituted for PT,; for the Pharmacia ALF data. PT, is very close to a linear


-4 1 1 I L 1 I 1 1 1 t -100 -80 -60 -40 -20 O 20 40 60 80 100

tAG (BASES)

Figure 3.13: Covariance of difference between successive peak time jitter values.

function of i.

The timing j it ter process is described by the following equations:

The state variable C incorporates the correlation memory of the system through the

auto-regressive weighting, P. Here, a large P implies the jitter is constrained to be

similar to past values. The jitter process is driven by a white, zero-mean Gaussian

source, vil of variance a:,; in the systems modelling literature, this would be referred

to as an 'input disturbance'. This input disturbance reflects the freedom of the indi-

vidual DN.4 molecules in choosing their 'random' path through the gel. The additive.

white, zero-mean Gaussian measurement noise, wi, has vaxiance O&. This measure-

ment noise may indeed be due to additive time series noise pulling the observed peak

location away from its noise free location. However, it may dso reflect other phe-

nomena such as mobility dinerences based on terminal sequence. Strictly speaking


01 1 1 1 1 1 t 1 1

-40 -30 -20 -1 0 O 10 20 30 40 LAG (BASES)

Figure 3.14: Peak amplitude covariance.

Qi is not permitted to be a value which would place the peak before a previous peak

or after a subsequent peak; in practice, the means and standard deviations are such

that such values are unlikely to arise.

The choice of Gaussian distributions for the measurement noise and input dis-

turbance reflects the histograms formed from the data. The histogram of the total

jitter, Figure 3.17), is a monomodal plot whose mainlobe may be approximated by a

Gaussian; insufficient samples are available to form a hypothesis regarding the tails

of the distribution. The histograrn of the difference between successive peak time

jitter values, Figure 3.18, seems siniilarly Gaussian; the difference is dominated bu

the measurernent noise and so this directly suggest that the measurement noise is

Gaussian. Given that the total jitter and the measurernent noise appear Gaussian, it

is not unreasonable to suggest that their difference is Gaussian and hence the jitter

process input, v, is Gaussian.

For this model, the theoretical time jitter covariances will now be derived. Ex-

pressions are developed for obtaining key parameters from observed covariances. By


iAG (BASES)

Figure 3.15: Pulse width covariance.

i terat ive application of Equations 3.6 and 3.7, the observation can be written 'as

where the base number, 2 , is greater than zero. The covariance is then

where E[*] denotes the expectation operation and both v and w are zero-mean? white

processes. Now if v is a slowly non-stationary process mith respect to i relative to the

weight imposed by ,û2'-2j then o:j may be replaced by of,. Using the properties of the

geometric series, it can be s h o m that c:=, pz'-*j = (1 - D2i)/( l - ,i92). Emplqing


Figure 3.16: Block diagram of peak parameter system niodel.

tliis and the assumption that i is large and k small eventually yields

Note the exponential decay with lag k in this equation. Thus 0 may be estimated as

the exponent of the slope of the log covariance estimate.

The covariance of the differences is obtained by first miting an expression For the

difference:

Chapter 3 o A Statisticai Model of the DNA Time-Series 55

-4 -3 -2 - 1 O 1 2 3 JITiER (SAMPLES)

Figure 3.17: Histograni of scaled peak time jitter. To insure comparability of Sam- ples, data was divided (scaled) by jitter standard deviation linear trend prior to forming histogram.

The covariance is then

Xow assume w is a slowly non-stationaxy process such that oit-, = ow,, * and, v is

a slowly non-stationary process with respect to i relative to the weight imposed by

:3"-2j. Then for large i and small k, Equation 3.10 becomes


15 4 -3 -2 -1 O 1 2 3 4 JI'TTER DfFFERENCE (SAMPLES)

Figure 3.18: Histogram of scaled difference between adjacent peak tirne jitter values. To insure cornparability of samples, data was divided (scaled) by jitter standard deviation linear trend prior to forming histogram.

To obtain the mode1 parameters from measured covariances, first estimate d as

the exponerit of the slope of the log covariance estimate. Using the lag zero value

of the covariance, Equation 3.9, and the lag one value of the difference covariance.

Equation 3.11, the following system of simultaneous equations may be written:

These equations may be summed to yield

Combining terms, recognizing E[&#i] as ozi and introducing ka = E [Ai hi+ ,] / E [ + i ~ i ]

(i.e. the ratio of the lag 1 negative peak of the covariance of the jitter differences


(Figure 3.13) to the peak time jitter variance (lag O, Figure 3.12) yields the jitter

process variance as

Then, this expression for a:. rnay be substituted into Equation 3.12 and the result

solved to yield the measurement noise variance as

where the jitter standard deviation is

where a, and b4 are the coefficients obtained from a linear fit to the standard deviation

estimates.

It is also possible to refine the 0 estimate using the recovered variances. .\ more

sophist icated technique for est imating ,û and the variances would exploit al1 t lie in-

formation available in the covariances.

The amplitude and pulse width modelling is simple, as the covariances are assunied

to be zero for al1 lags other than zero. The amplitude is a truncated Gaussian random

variable of unit mean and variance 0:; the truncation restricts the amplitude to

positive values only with the probability density lunction rescaled for unit area. The

pulse width is a Gaussian random variable with mean

where a, and b, are the coefficients obtained from a linear fit to the pulse width

estimates. The pulse width has a constant variance, ot.

While the amplitude and pulse width components of the model are simple and

direct reflections of the measumments, the peak jitter portion of the model is more

complicated. The validity of this part of the model will now be exarnined by esti-

Chapter 3 o A Statisticd Mode1 of the DNA TirneSeries 58

rnating the model parameters and then performing a simple numerical check and a

graphical comparison of the measured and theoretical covariances.

From the inset of Figure 3.12, 0 may be estimated at 0.85. Figure 3.12 yiclds the

average E[&&] = a:, as 21 and Figure 3.13 yields the average E[AiAi+ ,] as -3.1; the

term "average" is used here as the covariance estimates average over 'base niimber,

i. Applying Equations 3.8 and 3.9 yields the average O:, as 4.79 and o: as 3.71.

As a check, these estimated variances are substituted into Equation 3.11 which is

then evaluated a t lag O to yield the average E[AiAi] = 12.6. This agrees with t h

measurcd value from Figure 3.13 of 12.9 within the expected measurement error.

Figures 3.19 and 3.20 present the theoretical covariances of the model timing jitter

for comparison with Figures 3.12 and 3.13. Such comparison may only be done to a

confidence limit iniposed by errors introduced by artifacts and estimation error. In

Figure 3.12, the trend removal artifact may be recognized by its 51 bin period. The

shorter period variations are due to estimation error: these would be smaller if more

data points were available for covariance estimation. So, to compare Figure 3.12

wi t h Figure 3.19, visually subtract off the trend removal artifact and tlien impose

confidence limits eqiial to the extremes of the short period variations. The data of

Figure 3.19 t hen lies within these limits, particularly in the mainlobe region. Siniilar

agreement is seen in the jitter difference covariances of Figure 3.20 after allowing for

the estimation error exemplified by the data for Iags 10-100.

3.3.3 Discussion

The covariances observed are understandable in light of the differences in the

processes underlying peak times, amplitudes and widths.

The local peak times are the result of large polymers, identical except for the last

few bases, moving in a similar fashion through a gel. The high correlation seen in

the rnultilane data (Figure 3.8) indicates that it is not necessary that the molecules

follow the same paths through the gel to obtain correlation in peak times. In f'act,

this suggests that the gel itself is not the limiting factor in the correlation. Rather,

it is the similarity between the DNA molecules that detemines the correlation.

Further, from Figure 3.12, the correlation decreases exponentially wit h a decay


Figure 3.19: Theoretical covariance of peak time jitter for system of Figure 3.16. Iriset is a logarithmic plot of the right side of the mainlobe.

length of about 5 bases and becomes insignificant above about 15 bases. The recent

work of Tinland et al. 1321 indicates that the persistence leiigth of ssDNA is about

p=2-5 nm, or about p=5-12 bases (at 0.43 nmlbase). The correlation in the peak

times and the persistence length of the molecule rnay be related. The persistence

length of ssDNA is a measure how far apaxt two points on a polwymer need be for their

spatial orientation to be uncorrelated [311. In other words, p reflects the stiffness of

the polymer chain. Most models of DNA gel electrophoresis predict that the mobility

of the analyte is related to its mean elongation in the field direction. It is clear that

for molecules of l e s than a persistence length difference in size (contour length), the

niean elongations will be strongly correlated. For instance, if the base composition

of a ssDNA molecule with M monomers makes the elongation slightly smaller (or

larger) than expected (as estimated for a generic chah of M monomers), then the

elongation of a M + l monomer chah will also be smaller (larger) than expected.

Such correlations will extend over roughly one persistence length, and would thus

affect the expected mobilities accordingly. These effects have yet to be included in

Chapter 3 o A Statisticd Mode1 of the DNA Tirne-Series 60

Figure 3.20: Theoretical covariance of difference between successive peak t h e jit ter values for system of Figure 3.16.

the gel electrophoresis theories as theorists are more concerned with average trends

rather than specific cases.

The amplitude of a particular peak is determined by how rnany molecules incorpo-

rate a terminator (ddNTP) at that base position. For our primer labelleci molecules?

the terminator differs from the normal nucleotide (dNTP) only in that it has a hy-

drogen rather than a hydroxyl group on the 3' carbon. For such a mal1 difference,

well rernoved from the location participating in the 1 s t condensation reaction, one

would expect lit tle correlation in amplitudes as simple random chance determines in-

corporation of a nucleotide or terminator. On the otherhand, some dependencies on

terminal sequence are seen such as a rise in amplitude in a run of C's. However. these

runs are infrequent in our data and thus do not evidence a significant effect in the

measured covariance. Note that the consumption of primer-labelled substrate leads

to large scale decay in amplitude; however, for the ddNTP and dNTP concentrations

used, the effect of this decay on a base by base basis would be of the order of one

percent or less and thus does not impact significantly on the observed covariances.

Chapter 3 o A Statistical Model of the DNA TirneSeries 61

The pulse width is determined by the distribution of like molecules in an elec-

trophoresis band. This distribution is in turn governed by thermodynamics ,i dif-

fusion 191. Presurning the loading is too low for significant DNA-DNA interactions,

but nonetheless such that a very large number (x 10") of molecules participate in

each band, stable bands are expected whose width would follow a simple large scale

trend with respect to base number. Local fluctuations are likely to be insignificant

due to the large number of molecules involved. Given the elaborate sclieme to extract

the pulse width estimates, it likely that the fluctuations observed in the pulse width

estimates are largely due to estimation error.

3.4 Noise Process Model

Our attention is now directed at the last term in Equation 3.1, the additive noise,

n k . Physical phenomena, such as integrated sensor shot-noise and pre-amplifier ther-

mal noise, give rise to an uncorrelated Gaussian component. As implied in Section

2.2.6, this comprises only a srnall fraction of the total noise faced by the sequencing

algorit hm.

Chernical phenomena in fact dominate the additive noise. This chemical noise

is created by any molecule that: (1) is not part of the true sequence, (2) passes by

the detectors, and, (3) fluoresces. These molecules arise largely out of the chemical

processes involved in DNA sequencing; other chemical contaminants added a t loading

time and in the gel dso contribute.

Some noise molecules are part of the sequencing product placed in the loading

well. The discussion of fidelity in Section 2.1.4 presented a number of mechanisms

for the creation of anomalous peaks (i.e., noise molecules). These mechanisms lead to

labelled ssDNA that is not part of the true sequence. ,4s these molecules are labelled

ssDNA and do undergo electrophoresis, the resulting noise peaks should have the

same shape as the correct peaks. As they are a n integrat number of bases long, the

peaks should tend to be more likely a t the same points true peaks would be likely

to occur. Thus, the noise would tend to be cyclostationary with penod equal to the

mean inter-base period.


Hydrolysis (Section 2.1.5) will also contribute to this noise. Hydrolysis prior to

loading can lead to labelled products that are an integral number of bases long and

so contribute to the noise as discussed in the previous paragraph. Other labelled

hydrolysis products (see Figure 2-6) can have a length that is not an integral nurnber

of bases. If present in the product at loading, they will have the same peak shape as

the correct peaks as peak shape is driven by diffusion. However, their peaks will be

likely to occur anywhere in the time-series rather than just in the regions where the

true peaks are likely to occur. These peaks would contribute uniformly to the overall

noise ievel.

Hydrolysis products produced during electrophoresis at a specific time and üt a

specific location in the gel will have a peak width (and hence spectrum) determined

by the time of migration to the detectors. Generally these products may be produced

anywhere in the gel and at any time. The actual noise spectrum for this type of

noise is then an integral over space and time of creation of these hydrolysis products.

Relative to hydrolysis prior to loading, narrower peaks are expected (i.e. more energy

at higher frequencies). This is due to hydrolysis occurring near the detector.

This theoretical noise model, derived from the underlying chernistry and physics.

has two features which are easy to estimate and two features which are difficult to

estimate. The tractable features are the additive white noise and the diffusion driven

noise peak shape. The dificult fcatures to estimate are the cyclostationarity and

temporally-spatially integrated hydrolysis noise. The degree to which the last noise

tmo features are present depends on the actual sequencing process. In the absence of

DNA, these features are absent, except for possibly fluorescent contarninants. Thus.

noise data must be extracted from DNA sequencing data regions where true peaks

are absent. These are of limited length and so it is difficult to get the degrees of

freedorn necessary for a good measurement.

Therefore, a less accurate but more practical noise mode1 is proposed. It consists

of a white noise component and a coloured noise component. The coloured noise

component is obtained by driving a white source through a filter whose impulse

response is the correct peak pulse shape. This essentially averages over the period

of the cyclostationary component of the theoretical model. The ternporally-spatially

Chapter 3 O A Statistical Mode1 of the DNA Time-Series 63

integrated hydrolysis noise rnay be sornewhat accounted for by adjusting the relative

levels of the white noise and coloured noise components.

Combining the two noise cornponents together yields the noise spectrum as

where G(w, t ) = F(gklt) = xk gk,l exp(- j w k / K ) is the discrete Fourier transforni of

the pulse shape with K the total number of samples, No is the white noise spectral

level and h', sets the coloured noise spectral level. The levels are scaled so that the ni3 white noise variance is oz,, ( t ) = ( 1 1 2 ~ ) J-ns No ( t ) dw and the coloured noise variance is

cri, (t) = ( 1 1 2 ~ ) ~'ifB K,I F(gkl t ) IZdw. Here B is the single sided sampling bandwidth,

B = (112) f,, where f, is the sampling frequency. Note that as the pulse width is

dependent on peak time, the noise is non-stationary and N(w, t ) refers to the noise

spectra at time t. The noise is also assumed to be Gaussian.

Figure 3.21 presents a noise spectrum estimate obtained from a 23 base section

(bases 297-319) of seqiiencing data that was free of true peaks. The spectral estiniate

was formed using a single Fourier transform of the data after weighting by a Kaiser

window with Kaiser ,O parameter set to 6 [341. Thus, the rapid fluctuations dong

adjacent frequency cells are due to the iack of averaging. Howvever. trends in the

spectra are a true reflection of the data as the sidelobes irnposed by the Kaiser window

are extremely low -in excess of lOOdB over much of the spectrum. The large energy

in the low frequency region is consistent with the expected coloured noise component.

At higher frequencies, there is evidence of the white noise component as the spectral

dope is reduced.

3.5 Simulated Data from Model

Equations 3.1-3.3 and 3.5-3.18 in combination wit h Gaussian randorn number

generators and a predetermined sequence may be used to generate simulated DNA

time-series. Figures 3.22 and 3.23 present the results of a simulation where the base

sequence and parameters were identical to that of Figures 3.2 and 3.3. Thus, direct


FREQUENCY (RADIANSISAMPLE)

Figure 3.21: Noise spectrum estimate for ".4" channel bases 297-319.

cornparison is possible. This cornparisou should be for "sirnilar character" rather tlian

identical waveforms as ideally the two data sets represent different realizations of the

same process.

Figure 3.22 dernonstrates the same rise in level for later bases as is seen in Fig-

urc 3.2. This effect is driven by the incrcasing pulse width. The noise fluctiiations

also appear to be similar. Figure 3.23 and 3.3 feature sirnilar pulse shapes and t heir

peak level fluctuations are comparable in size. The tails of the peaks also scem sim-

ilar though the noise background fluctuations seem to mask the differences. One

discrepancy is in the run of C's where in Figure 3.3 a rise in amplitude with base

number occurs while it does not in Figure 3.23 as this sequence dependent effect is

not rnodelled.

Visual examination for the effects of the timing jitter is more difficult. On this

scale, it is hard for the eye to assess the jitter by looking for extrema locations. How-

ever, the amplitudes of adjacent peaks provide a cue as peaks closer together will have

a stronger sum while those further apart will affect each other iess. Unfortunately,

discrimination of t his effect from normal amplitude fluctuations requires examinat ion

Chapter 3 0 A Statistical Mode1 of the DNA Time-Series 65

Figure 3.22: Simulated compensated time series for cornparison with real data of Figure 3.2. Individual channel data has been offset in this figure for clarity. Top curve is for A channel with C, G and T channels presented in order from top.

and measurement of many peaks.

Simulation rnay be used to investigate the impact of modifications to the sequenc-

ing process. The physical and chernical phenomena involved rnay he interpreted to

lead to certain parameter changes, and these revised parameter vaIues rnay be used

in the simulation to assess the impact on algorithm performance. Adjustments to

the algorithm rnay then be tned and evaluated to identify an appropriate method for

ameliorating any deleterious effects of these modifications.

For example, reducing the percentage of cross-linking in the gel would reduce the

flow resistance and hence the time required for the DNA to move through the gel.

However, the diffusion coefficient would increase faster than mobility 191 so the band

Nidth in the gel would increase. Therefore, in addition to being closer together, the

peaks in the DNA time-series would be relatively wider and interference between peaks

would be a bigger problem. The mode1 rnay be used to generate controlled simulated

data with various peak separations and widtk and then the sequencer settings rnay be

optimized for each separation and width. It would be difficult and time consuming to

Chapter 3 0 A Statistical Model of the DNA Time-Series 66

Figure 3.23: High resolution view of a segment of the simulated compensated time series (compare a i th Figure 3.3).

generate data sets experimentally with predetermined peak separations and widths.

Model based simulation facilitates controlled investigation of process and algorithm

features, and, may be used to compare alternative sequencing algorithnis.

3.6 Significance and Novelty

This chapter presents the first statistical model of the D N h time-series. The chem-

istry and physics of DNA sequencing have been translated into a form where engineers

and mathematicians can directly contribute to the development of new sequencing al-

gorithms. The model also forms the foundation for further model developrnent based

on extensions that incorporate additional attributes of the data. Finally, the mode1

may be used to generate simulations for the cornparison of sequencing algorithms. It

provides the basis for a standard for the evaluation of DNA sequencing algorithms.

CHAPTER 4

Maximum Likelihood Sequence

Detection

The Maximum Likelihood Concept

The optimum Maximum Likelihood (ML) processor selects the sequence, 2, that

maximizes the probability of the observation, y, given a signal moclel as in -

where the hat is used to indicate the best estimate, the tildëindicates test values,

' k g max" returns the test value that mavimizes the expression on its right and p ( g , - g ) is the conditional probability density function (pdf) of the observation. Essent ially, for

each hypothesized sequence, it generates the expected signal waveforrn and compares

it with the observed waveform. It must search over dl possible hypotheses, evaluating

the probability of the observation for each hypothesis, in order to find the best.

The ML Sequence Detector (MLSD) is universal; it is appropriate for any signal

and noise model. Other popular processors such as the linear equalizer and decision

feedback equalizer are structured around specific signal features. In particular, these

Chapter 4 o Maximum Likelihood Seauence Detection 68

equalizers assume fixed symbol times and are oriented towards minirnizing Inter-

Symbol Interference (ISI) and noise at tliese decision points. Due to high peak time

jitter of DNA time-series, these equalizers are ill-suited for DNA sequencing. The

Maximum A Posteriori (MXP) processor is also universal. I t brings in a priori symbol

probabilities into the decision process. For equally probable sequences, the M.4P

sequence detectorL reduces to MLSD. This being the general case, it is appropriate

to select MLSD for the DN.4 sequencing problem. The resulting processor will be

referred to as the DNA-ML algorithm.

4.2 Additive White Noise Finite Response

To provide a context in which the extensions required for DNA-ML are evident, a

simple MLSD example will now be examined. This Additive White Gaussian Noise

(AWCN) Finite Impulse Response (FIR) example presurnes the received signal is

corrupted by the addition of white noise. The signal has also been strctched and

distorted by the channel medium so that previous symbols interfere with the current

symbol. Accordingly, the k-th sample of the observation is given as

mhere n k is the white noise, x h is the information symbol and h describes the channel

impulse response.

The noise is a zero-mean Gaussian random process. The channel impulse response

is presumed fked and knom. Then the probability density function (pdf) for the

received sarnple is just that of the noise shifted by the distorted signal. More forrnally.

Note that the MAP symbol detector does not reduce to MLSD as in some cases the most probable syrnbol at a location is not necessarily that which yields the most likely symbois for its neighbouring locations.

Chapter 4 o Maximum Likelihood Sequence Detection 69

the conditional pdf is given by

where z = {xl, 2 2 , ...) is the symbol sequence * and p , (n ) is the noise pdf. -4s the noise

is white and Gaussian, the noise samples are independent and thus their joint pdf is

the product of their individual pdfs. This then allows writing the joint coriditional

pdf for the entire observation as

where Nk is the total number of samples. The MLSD processor must choose the

sequence g that maximizes Equation 4.4 for the observed y. - .As the logarithm function is monotonic and increasing, mavimizing Equat ion 4.4

is equivalerit to minimizing the negative logarithm of Equation 4.4. The maximum

likelihood sequence is then

2 = arg mjn(- log(p(g(Z))) = arg r n j n ( x - log p,(% - C h , ~ ~ - ~ ) ) . - - 5 (4.5)

s: - k= 1 j=O

where the hat is used to indicate the best estimate, the tilde - indicates test values

and "argmin" returns the test value that minimizes the expression on its right. The

negative logarithm of a likelihood function is often referred to as the 'cost'. The

Gaussian noise pdf is p,(n) = (l/J-) exp (-n2/(202)). Its logarithm is then

log(l/ I/=) - n2/(20Z). Substituting this into Equation 4.5 and removing additive

constants that do not affect the minirnization results in

Thus, for this case, the maximum likelihood sequence estirnate is found by finding

2 ~ i , i < 1 is presumed to be zero.


the hypothesized sequence that minimizes the sum of the squared differences between

the observation and the hypothesized received signal.

Further, the structure of Equation 4.6, and in particular the finite impulse length,

iVhr leads to an efficient method for finding 2 181. Mkiting the sunimation over the

sarnples as an iterative cost, Ck, yields

Therefore for a particular hypothesis, calculating the current coût requires the pre-

vious cost and the last lVh symbols in the hypothesis. These last lVh syrnbols are

referred to as the current state. If there are N, possible symbol values tlien therc

are N,"" possible values of the current state. -4 special rectangular grid of riodes,

known as a trellis, may be created where the x-axis of the grid is the sainpie time. k ,

and the y-axis is the state, { x ~ + ~ - ~ , , , ..., xt). X particular hypothesized sequence will

now correspond to a path from left to right connecting nodes of the trellis. MLSD

then corresponds to choosing the path with the least cost. Note that the possible

valid connections between nodes is limited by the definition of the state; two nodes

that both have l j in their state must have the same value for xj if a valid connection

between them is possible.

With this graphical representation, the minirnization can be seen as a dynamic

programming problem. Consider a node that is on the path of minimum cost. The

path of minimum cost is the union of the minimum cost path from the start to this

node with the minimum cost path from this node to the end of the observed data.

Any other combination would have higher cost. Then the minimum cost path from

the start to a node a t time k must include the minimum cost path from the start to

the node a t time k - l that is on the minimum cost path to the node at time k. Thus.

if at every time iteration, only the cost and the path corresponding to the minimum

cost to each of the nodes is retained then this set of paths will include the minimum

cost path. With the iterative structure of the cost as defined in Equation 4.7, the

extension of the path to a current node may be done by selecting the previous node

whose cost from the start plus the incrementd cost from the previous to the cunent

Chapter 4 o Maximum Likelihood Seauence Detection 71

node is minimum. The new path is then the union of the best path from start to

selected previous node with the segment from selected previous riode to the current

node. This is done for al1 nodes. At the last iteration, the node with lowest cost

identifies the minimum cost path through the trellis. Thia trellis based dynamic

programming algorithm is referred to as the Viterbi algorithm in the communications

li terature.

The Viterbi algorithm permits MLSD on systems with limited computational re-

sources. For an M symbol sequence, explicit testing of each of the possible hypotheses

reciuires N f tests. Using the Viterbi algorithrn, only MN? Ns tests are required as at

cach point in the sequence. each hypothesis must be extendcd by checking iV, possible

next symbol values. Growth in computations is linear with M rather than exponen-

tial. Thus, arbitrary length sequences may be processed using the Viterbi algorit hm

whereas the brute force method would eventually erceed the available coniputing

capaci ty.

In this section, the derivation through to practical implementation of an AILSD

processor has been presented. For the additive white noise and finite impulse resporise

case examined, the observation pdf is a simple function of the noise pdf; as the noise is

white, the joint pdf is a simple product of single sample pdfs. Hypothesis tests involve

a direct cornparison of the observation with a hypothesized waveform (Equation 4.4).

Taking the negative logarithm of the likelihood pdf and removing constants led to

a simple cost function. The iterative structure of this cost function, together with

the finite length of the impulse response, permitted the use of the efficient Viterbi

algorithrn. The mode1 developed in Chapter 3 is much more complicated than the

case considered here. Extensions will have to be developed to address features such

as coloured noise and peak time jitter.

4.3 Noise Whitening

DNA time-series feature coloured Gaussian noise as described by Equation 3.12.

This implies that the noise joint pdf between samples will be correlated as described

by a full noise covariance rnatrix. Evaluation of the pdf will be laborious. The uncor-


related noise presented in the earlier example implied independence of the Gaussian

variates which led to an easy to compute and easy to comprehend formulation for the

processor. Clearly, it is desirable to transform the DNA time-series into a form with

similar properties to the Additive White Gaussian Noise (AWGN) case.

This can be achieved using a noise whiteniiig filter. The time varying noise whiten-

ing filter, hW, is designed so that the noise in the whitened data, y is uncorrelated. -w'

This is satisfied by a filter whose Fourier transform is the square root of the inverse

of the discrete noise spectrum. The time varying aspect rnust follow the signal peak

pulse widt h's t irne clependcnce.

4.4 Nuisance Parameters

In situations with nuisance parameters such as amplitude and peak time jit-

ter. optimum processors rnust extend their hypotheses to jointly include not only al1

possible data sequences but al1 possible sequences of nuisance parameters 1571. Math-

ematically, when there are nuisance parameters, the maximum likelihood estimate

where Q = {cz, t ) is the nuisance parameter vector, the hat is used to indicate the

best estimate, the tilde-indicates test values, "argmax" returns the test value that

maximizes the expression on its right, p (y3 , 8) is the probability density function - (pdf) of the observation, p(y, - %lé) is the probability density function of the observation

conditioned on the nuisance parameter wctor, and, p(#) is the probability density

function of the nuisance parameter vector. for example, in the AWGN-FIR case

above, if the hj were random variables then the nuisance parameters @ would be the

h = 1, . N The pdf conditioned on the nuisance parameters is essentially the

pdf in Equation 4.3 and p(@) is the density function of the hj .

.4t a each point in the sequence and for each possible base type, LAT hypotheses

rnust now be evaluated where LrlT is the number of amplitude and peak time pairs to


be considered for each signal peak. For continuous random variables such as the am-

plitude and peak time, LAT should be infinity. Practically, a finite LAT that permits

good sampling of the amplitude and tirne joint pdf should achieve performance a p

proaching full ML. Regardless, the implication is that the number of hypotheses to be

considered is increased by a factor of L ~ F ~ when time and amplitude jitter is included

where N,,, is the number of symbols in the sequence. For example, if ten possible

amplitude and peak time pairs are allowed for each symbol in a 500 point sequence,

then 10500 more hypotheses must be considered. Clearly, nuisarice paramcters have a

major impact on the computational load of MLSD processing.

4.5 Cost Function Derivation

With the fundamentals of MLSD and its extensions for coloured noise and nuisance

parameters now established, attention may now be directed at the forma1 derivation

of the cost function for the DNA-ML algorithm. The algorithm seeks the sequence

and nuisance parameter values that minimize the negative log of the pdf, referred to

as a 'cost' :

where, A,(-, %la) = - log(p(y, - zla)) - constant is the part of the log likelihood due

to the conditional pdf and, ile(& = - - constant is the part due to the

nuisance pdf. The constant offset serves to remove those terms which do not affect

the rnauimization. The following subsections will address first the conditional pdf

and second the nuisance pdf.

4.5.1 Conditional Likelihood

The conditional likelihood, A,, reflects the additive noise pdf for a specific {2,(}. The noise is coloured and its pdf is complicated to evaluate due to the implied corre-

Chapter 4 O Maximum Likelihood Seciuence Detection 74

lations between al1 samples. The noise whitening filter, bw, is applied to create

where the * denotes convolution. As the noise spectrum is time dependent (see

Equation 3-12), the filter has t o be recalculated for each possible peak time. Note

that this transformation will not change the result of our sequence detection problem

as the {t, 8) which best explains the observed - y is also the one which best explains the

observed y . For the whitened observation, y the noise terms are uncorrelated and -W -w '

its pdf may be written as the product of the Gaussian sarnple pdfs as uncorrelated

Gaussian random variables are independent. The conditional likelihood then becomes

cvhere fi is the expected whitened observation for the hypothesized data sequence, -W

g, and nuisance parameters, 8. Here the sumniations are over ail data sarnples {k}

and base channels {n). Unlike in the AWGN-FIR case, constants (( f ) log(2xoiw))

relating to the whitened noise variance, aiw, are retained in this expression; as ail1 be

explored in greater detail in Chapter 5, these terms cannot be dropped as hypotheses

incorporating different numbers of time samples rnay have to be coniparecl.

Substitution of the expected observation into Equation 4.12 yields

where g;, is the peak shape after application of the noise whitening filter for a peak

centered on time t and evaluated a t k, and, is the Kronecker delta. Without

changing the result, the summations outside the brackets rnay be freely interchanged

and split and the summation within the brackets rnay be split as long as the split

Chapter 4 O Maximum Likelihood Sequence Detection 75

ternis remain within the brackets. Thus, Equation 4.13 rnay be written as

where the range of k has been partitioried into non-overlapping subsets, K i , such that

the union of these subsets corresponds to the complete range of k.

Equation 4.14 suggests grouping the samples i~ i to non-overlapping groups with

each group corresponding to a specific base in the data sequcnce. Within each group,

first subtract off the interference from other bases as indicated by the terrns encom-

passed by the innermost brackets of Equatioii 4.14. Then, evaluate the hypothesized

contribution of the specific base.

4.5.2 Nuisance Likelihood

Now consider the log likelihood of the nuisance parameters, the second terrn of

Equation 4.10. As specified in the mode1 (Sectioii 3.3.2), the peak amplitude fluctua-

tion is uncorrelated, its pdf is Gaussian, and the corresponding log likelihood terrri is

just the squared difference with respect to the mean, pa, al1 norrnalized by twice the

variance, 0;. The nuisance parameter log likelihood encompassing the log likelihood

of the peak amplitude and peak time rnay be written as

where Ci is the hypothesized amplitude of the i-th peak and the last term, &(i) =

- L O ~ ( ~ ( ~ ) - constant, is the peak time log likelihood. Our analysis will now address

that term.

For our model, the correlation of the jitter mlth respect to al1 previous peaks

complicates the evaluation of the pdf and log likelihood. For a sequential processor,


on a given hypothesis, for each extension in hypothesis length, the entire hypotliesis

must be fed into the new larger marginal probability function. Here, where the

probability function is Gaussian, the covariance matrix is extended from siae N - 1

to iV and the number of computations to evaluate the probability are proportional io

iV2. The total number of calculations in sequentially evaluating a N point hypothesis

would then go as N 3 . It is desirable for the total number of calculations to be a liriear

function of N .

An efficient sequential representation of the hypothesis pdf and log likelihood is

rcquired. As the timing jitter appears as a functional argument of the waveform

and as it is sequence dependent, the simple whitening filter, appropriate for additive,

sequence independent disturbances, cannot be used. Instead, the Markov structure of

the timing jitter leads to an innovation technique for data whitening (66, 691 in which

a linear transformation is applied to the data to decorrelate current and prcvious

measurements. For uncorrelated Gaussian variates, the joint probability is just the

product of the marginal pr~babilities of the random variables. Thus, for suitably

transformed observations, the probability of the extended N point hypothesis is thc

product of the probability of the IV - 1 point hypothesis and the probability of the

current transformed observation. The log likelihood of the wliitened data may then

be written as a simple sum of squared terms. The total computational load is then a

linear function of !V.

Statistically, the innovations approach identifies the new information in the current

observation, separating it from that which could be inferred from previous data. By

virtue of its correlation with previous samples, the correlated part may be predicted

[rom the previous samples. The i-th sample of the original peak time series may be

written as

where t , and dCi represents the part correlated with previous samples and tu, and #u,

the uncorrelated part. Using Equations 3.6 and 3.7, the timing jitter, #i = &, + A,,

Chapter 4 O Maximum Likelihood Seauence Detection 77

may be written in state space form as

where A = G = O a n d C = H = J=1.

For a system describable in linear state space form, the optimum preclictor is the

Kalmar1 predictor 157, 671. Using the Kalman predictor, the innovation is defined as

the difference between the observation and the best prediction of the new observation

given previous observations, al1 divided by the standard deviation of the prediction.

Equation 4.19 is not the usual state space form as the systern input disturbance,

ui, appears directly in this observation equation. The best estimate of the state at

i + 1 given the information available at i (denoted Ci+ lii) is the coriditional expectation

If .v* was not observable in @i then vili would just be zero (the a priori mean) and the

usual Kalmar1 filter derivation 1671 would apply.

The extension to the Kalman filter analysis must include the recovery of uiii and its

impact on the prediction covariance. By the usual projection operation, the estimate

of the input disturbance is

where E is the expectation operator, # denotes al1 the observations up to and in-

cluding the i-th, & = +* - $ili-l is the innovation, $' is its transform and P$-, is the

covariance of the prediction of 4- Here, the white spectrtm of v and its appearance

in Equation 4.19 have been used tu reach the last expression.

The state prediction covariance, can be obtained using state Equation 4.18


a5

where iii = v i - 'uili-1 and C, = Ci - Gli-l. The covariance ~ ' f i ~ i , T ] can be shown

to be zero. As the projection operation is used to form the estirnate of the input

disturbance, the covariance of this estimate can be obtained as

Equations 4.20-4.23 should be cornbined with the standard Kalman filter equa-

tions. The mode1 parameters can be substituted and terms regrouped to yield

The first three equations describing the Kalman gain, L, state estirnate, C, and co-

variance of the state estimate, P, are the standard Kdman filter equations. The

subsequent equations represent extensions due to the observability of the input dis-

turbance. Here, a gain, LU, is used in recovering an estimate of the input disturbance,

U i i i , which is then used to predict the next state, C,+lli. The prediction covariance,

Pi+lli , has an additional term which reflects the covariance of the input disturbance

estimate.

The whitened innovation sequence to be used by the sequencer is (&-cii-l)/ d G '


or? to make the peak dependence explicit, (ti - ipr - c ~ ~ ~ - ~ )/ JPiii_i* This is an in-

dependent, zero mean, unit variance, Gaussian random process. The peak time log

likelihood is then one-half the sum of the squares of this process:

The nuisance parameter log likelihood becomes

4.5.3 Cost Function

Equations 4.10, 4.14 and 4.32, in conjunction with the whitening filter (1.11),

and Kalman predictor (4.244.30), define the maximum likelihood sequencer. .As is

desired for the dynamic programming algorit hm, for each hypot hesized sequence, the

cost (log likelihood) may be written as the surn of the cost corresponding to previous

points in the sequence and a cost associated with the current point:

Figure 4.1 summarizes the algorithm structure. The hypothesized peak amplitudes

and times, together with the hypothesis' sequence, are used to generate a waveform

which is compared with the observation. This yields the likelihood conditioned on

the parameters. Also, the hypothesized peak amplitude and times are compared with


HYPOTHESES

1

DNA r

PEAK WAVEFORM ,+ DYNAMIC

ESTlMATOR COMPARISON PROGRAMMINO

ALGORITHM

1 PREDICTORS 1

Figure 4.1: Maximum likelihood processor block diagram.

the Kalrnan filter predictions to obtain the parameter probability. The innovat ion is

also used to update the hypothesis' Kalman filter. Note that each hypothcsis l ias its

own hypothesis dependent Kalman filter.

4.6 Significance

Following a formal derivation from the statistical mode1 of the DNA tirne-series,

this chapter has presented the first DNA sequencing algorithm which can achieve

optimal detection performance. Of course this optimality is only to the extent that

the model reflects reality. Viewed purely from the perspective of detection theory. the

algori t hm is sop histicated in addressing non-stat ionarit ies and nuisance parameters.

It is at the edge of the state of the art in communication theory. The derivation

leads naturally to a general structure. Components of this structure have well definecl

tasks. They facilitate the assessrnent of current algorithms as analogous blocks may be

compared. Future work may see the cost function derived in this chapter incorporated

into a LIaximum A Posteaon (MAP) processor to provide optimum estimates of base

type and probability on a base by base basis. When the DNA time-series model

receives further refinernents, the DNA-ML algorithm rnay easily be extended to reflect


these new mode1 features.

CHAPTER 5

Implement at ion

While including most of the features of the optimum algorithm, the implemen-

tation of the DNA-ML algorithm requires changes mainly directed at rediicing the

computational load. This chapter provides details regarding implemeiitation of the

algorithm that was discussed Chapter 4. .Algorithm robustness to mode1 errors also

receives attention.

5.1 Hypothesis Reduction

5.1.1 Peak Estimation

An alternative to carrying multiple hypotheses for the nuisance parameter is to

estimate the nuisance parameter and use that value in the sequence detection. This

approach is sometimes used in data communications where the carrier phase is a

slowly varying nuisance parameter (571. For DNA data with well isolated peaks, peak

amplitude and time can be easily estimated by taking the local maximum. However,

as resolution decreases, estimates of adjacent peaks are biased closer together. The

bias leads to erroneous values being used in the likeiihood evaluation and therefore to

sequencing errors. In the limit, peaks are not resolved and bases are deleted from the

Chapter 5 O Implementation 83

sequence. The multiple hypothesis approach avoids this problem as it generates the

complete waveform for the hypothesis and compares it with the observation. Even if

the peaks are unresolved, this approach can obtain the correct result if the resulting

broad peak in the hypothesis waveform matches that in the observation.

To address this resolution problem, Equation 4.14 suggests using the secluence

data from adjacent points in the hypothesized sequence to reduce the influence of

neighbouring peaks in the peak estimator. Consider a sequence with two adjaceiit

b'G"'s that are poorly resolved. For the correct hypothesis, the measurement of the

parameters of one peak may be made more accurate by subtracting from the tirne

series a pulse of peak amplitude and time corresponding to the other peak then

taking the local maximum. It is possible to show that even if thcre are moderate

errors in the estimation of the parameters of the second peak, the overall accuracy

is much iniproved re!ative t o simple peak detection without ISI removal. In order to

remove ISI, one must include past peaks already estimated alid future peaks yet to

be estimated. Estimation of future peak ISI will be addressed in the next section: for

now, it will be assumed that it can be done successfully.

For a particular base, if the ISI has been cornpletely removed then the problem

becomes the detection / estimation of a single peak. This is optirxally accomplished by

matched filtering the data, detecting the maximum, and recording the peak amplitude

and time 1571. The matched filter is the time-reversed noise whitened peak shape.

g:,,,. The peak estimate obtained in this fashion is then used to create the expected

peak as the last term in Equation 4.14. If the hypothesized sequence matches the

observed DNA sequence then the squared term of Equation 4.14 should be small.

Figure 5.1 summarizes the above procedures.

5.1.2 Future Peak ISI Cancellation

A priori peak predictions based on jitter and peak amplitude models may be

used to estimate the interference from future peaks. By modifying Equation 4.29,

the Kalman predictor used in the innovation processing may also be used to predict

future peak locations an arbitrary number, p, of bases fomard of the current base as

Chapter 5 o Implementation 84

SINGLE PEAK

ISI t

MATCHED + PEAK PULSE Yw- REYOVAL * FILTER O€iECTlON -t E STIMATE

ESTIMATION

UNOER CORRECT

HY POTHESIS

Figure 5.1: Peak estimator.

t, is the location of the current peak, is the mean inter-peak separation. d is

the jitter auto-regressive weighting, Cli is the current estimate of the current jitter

state, and vil* is the current estimate of the jitter driving process, ail as described in

Section 3.3.2, The amplitude prediction, after trend removal, is j ust p,

The accuracy of the future peak location estirnate falls very quickiy with p. On

the other hand, the further into the future a peak is from the current peak then the

smaller its contribution to the ISI is by virtue of the peak shape. Thus, the inaccuracy

in peak location for peaks far into the future has little impact on the accuracy of ISI

removal.

However, for a future peak whose mainlobe reaches well into the region of the

peak of interest, accuracy in peak prediction is vey important. Prediction is limited

by the component of the future peak that is uncorrelated with the currently available

measurements. The uncorrelated cornponent of the jitter in peak time and amplitude

may be sufficient to lead to large errors in ISI removal. To address this problem, the

algorithm may be extended to include several hypothesized locations for each future

peak (i.e. a constellation of candidate future peak locations). These additional hy-


potheses represent a partial return to the optimal algorithm of Chapter 4. However,

they are only included for ISI removal and once a direct estimate of the peak's param-

eters is obtained then the constellation is collapsed to that estimate. Thus previous

peaks in the hypothesis do not maintain nuisance parameter constellations and so the

computational load is much lower than for full MLSD.

5.1.3 Sequential Decoding

To further reduce the computational load, a n algorithm rnay be selected ttiat tests

only a subset of the possible hypotheses. A number of these have been developed and

malysed in the communications coding literature 1611. Of t hese, the bI-algorithm

has been chosen for this thesis as it can be considered to be a mc1.icimum likelihood

processor under the constraint of retaining only at most bI hypotheses at each point

in the sequence 1681. This algorithm processes the data sequentially. At the i-th

position in the sequence, it retains h.1 hypotheses corresponding to the most likely

i-point subsequences given the observed data from the start of the ruri up to the

current point under consideration.

5.2 Unique Algorit hm Considerations

The irnplementation of the DNA-ML algorithm is complicated by the interplay

between the dynamic programming algorithm and the asynchronous peak times. The

dynamic programming algorithm must compare hypotheses as it progresses through

the data set. Because of different values of the peak time parameters held by different

hypotheses, the hypotheses rnay be of different duration and t herefore not properly

comparable. On average, the shorter hypotheses would have lower costs and be more

likely to be retained.

Similarly, the symbol region, Ki, (the set of tirne samples associated with each

symbol), is likely to be defined dynamically and thus the lengths of the summations

over the symbol will vary again based on differences in estirnated parameter values.

LThe technique is similar to the 'geedy' algonthm found in the cornputer science literature though there can be ciifferences based on the specific definition used by particular authors.


The analysis of Chapter 4 did not give direct guidance as to how the { I G ) were to

be defined.

Could a change in the dynamic prograrnming from a base by base basis to being

on a sainpie by sample ba i s solve the problem? No, as now hypotheses would diffcr in

relatively how much of the latest base region was represented in the cost. The problem

oiily disappears if decisions are not made until the cost of the entire observation is

available. Of course, this would imply an untenable number of computations if al1

Iiypotheses are retained that long.

In this section, uneqiial length hypothesis comparisoii and symbol region definition

will he examined.

5.2.1 Unequal Length Cornparisons

Long before the development of the Viterbi algorithrn, there were a number of

sequential decoding algorithms which would deal with unequal length hypotheses 161 1. Typically. these algorithms follow the niost likely hypothesis until its cost exceeded a

threshold. They then return to an earlier hypothesis and pursue it. II the end of data

is reached then the current hypothesis is returned as the sequence estimate. By no

nieans are these algorithms guaranteed to return the maximum likelihooci estimate

of the sequence.

The most famous of these is the Fano algorithm 1621. This algorithrn is still in

use today for systems that use very long codewords such as in space communications;

the Viterbi algorithm would require too many computations in such applications.

Developed on an ad hoc basis, it was later shown to also result from a probabilistic

analysis 1631. While its original application was for variable length codes. it has since

been used for sequence estimation in infinite length channels 1641.

The Fano algorithm does not explicitly compare hypotheses. Rather, it examines

the Fano metric (Fm), the ratio of the hypothesis pdf to the unconditional pdf as in

where k is the index of the latest observation sample, yk, and hypothesized symbol.


xb. Here, as in the remainder of this section. the probability functions p ( ) are defined

by their arguments. For the correct hypothesis, the Fano metric will increase with

tirne (k) as, in this case, the numerator is greater than the denorninator. For the

incorrect hypothesis, the Fano metric will eventually fa11 below a threshold. This

hj-pot hesis is then discarded and anot her pursued.

The Fano metric assumes the observation to be the sarne length as the hypothesis

and so further analysis is required to adapt it to the DNA sequencing problern. This

analysis builds on the work of Massey 1631. Consider two different hypothesized

sequences, g, and g2, coritaining the same number of symbols. However, assume that

due to differing symbol lengths (Le. due to peak tirne jitter in DNA sequencing),

the length of y , the observation associated with g,, is different than the length of -1

y the observation associated with 2. Define a as the entire possible observation -2 ?

encompassing both the observations t.hus far under the hypothcsis and the future

observations. Then, using the plus superscript to indicate future observations, I/ = Y

Consider comparing the two hypotheses when the entire observation is available:

As the observations are of the same length, the cornparison does not suffer bias

from unequal length hypothesis observations. Similady, each hypothesis contains

the same number of symbols and so bias is not introduced from differing number of

symbols. Here, the observation does include the effects of future symbols, 9, but

the probability density function is the marginal one obtained by averaging over al1

possible future symbols as in

Thus, Equation 5.3 is a desirable test of the two hypotheses. It will now be demon-

strated that the Fano metric provides exactly this test.

First, normalize Equation 5.3 by the unconditional probability of the entire ob-


servat ion:

.As bot h sides are scaled by the same factor, the result of the comparison is unaffected.

Applying the chain rule, the probability of the entire observation under hypothesis i

niay be written as

and integrating over the possible same length hypotheses yields the unconditional ptlf

as Cr _ p(y, - g)p(yf - 13, - g). The ratio may then be written as

The next step in Massey's analysis assumes a Discrete Mernoryless Channel (DNC)

where the samples ore independent. This implies p(y+ly. ,gi) = p(y+) which on siib -t 1 4

stitution in Equation 5.7 yields

Thus, normalizing the hypothesis probability for the entire observation by the un-

conditional pdf for the entire observation is equivalent to normalizing the hypothesis

probability for the observation thus far by the unconditional pdf for the observation

thus far (i.e. the Fano metric Equation 5.2). Equation 5.5 becomes

and so a means of cornparing different observation length hypotheses has been devel-

oped -albeit only for the case of independent samples.

Can the Fano metrïc be applied to the DNA sequencing problem? DNA time


series feature mernory due to the pulse shape and the correlation of the peak time

jitter: they are not the result of a memoryless channel as rnandated by the Massey

analysis. The analysis was extended to include nuisance parameters and the ratio

of the probability of the entire (past, current and future) tirne series given the short

hypo t hesis to the unconditional probability of the entire time series ernergecl as

where as usual, x,y and 0 are the information sequence, observations and nuisance

parameters, respectively. The summations are over al1 valid values of their argument

vectors; the prime is used to indicate a dummy variable vector. The development of

this expression assumed the observations were causal.

For the DMC, p(yf - , -+, g+l y, 2,e) = p(f , gf , g+) so that the 1 s t pair of sum-

mations in the numerator and denominator cancel, leaving the Fano metric. The

same does not hold true for DNA time-series as the first few elements of - y+ and @+

depend substantially on past values. In this region, the last pair of surnmations in the

numerator would be hypothesis dependent, while in the denominator, the last pair

of summations ~ ~ o u l d be averaged over al1 possible previous hypotheses. Therefore,

they would not cancel. Certainly, beyond a few tirne constants of this merno- the

tails of the Fano metric formulation should effectively cancel. However, our interest is

in comparing hypotheses which, while of different lengths, are probably within a few

memory time constants of each other. The Fano metric is not theoretically justified

in this region.

Nonetheless, the Fano metric at least provides a structure, dbeit a sub-optimal

one. As no other is available, the Fano metric was investigated on sequencing data.

Specifically, the division of the unconditional pdf led to the addition of its logarithm


to the cost given by Equation 4.33 to produce

where the unconditional pdf, p(y ), has as its argument the vector of the k-th -Wk

whitened samples of al1 four channels. This is necessary to allow for the hinda-

mental nature of the series where ideally at a given point one channel has the base

peak while the ot hers have the basic noise background.

The unconditional pdf, rather than being

from the histogram of the actual data. After

form was adopted:

derived from the mode1 was developed

observing the histogram, the following

Here the current observation, %, is a vector of the four channel levels. The first of

these equations states that the probability of the observation is the weighted sum of


the probabilities of the observation given the base type, b, was known. The second

equation gives the probability of the observation given a known base type as the prod-

uct of the signal pdf on that base's channel and the noise pdf on the other channels.

Next is the noise pdf which is as used elsewhere in the DNA-ML algorithm. Finally,

the signal pdf h a . three regions weighted by constant c which serves to normalize the

Fiinction. It represents a 'heuristic fit' to the histogram. The first region represents

the signal peak being absent, due perhaps to a dropout, and so uses the form and

parameters of the noise pdL The Rat region represents the rising and falling regions

of the peak shape. The last region accounts for the peak amplitude.

The results of employing the Fano metric were not encouraging. In investigations

with real data some sequencing errors of the non-Fano implementation were corrected.

However, new errors occurred elsewhere and the overall error rate was riot improvcd.

These errors did not exhibit a pattern from which one could infer a mechanism.

Theoretically, two phenornena could account for the lackluster performance of the

Fano metric. The rnost likely of these is the impact of the correlation in the DNA

t ime-series. The second possibility is t hat the synthetic observation pdf, Equation 5.3,

did not accurately reflect the true observation pdf. Any bias here would accumulate

in the cost with sample number and could eventualiy subvert the decision process.

The implementation whose performance is presented in the next chapter does

not incorporate compensation for unequal length comparisons. Such compensation

remains an open problem.

5.2.2 Selection of Symbol Region, Ki

A natural notion for a symbol region definition would be to center the region on

the symbol. The region's borders could be from halfway between the previous symbol

and the current symbol to halfway between the current symbol and the next symbol.

However, at the time of the current symbol, the next symbol's location is uot known

so the upper border can't be set. Also hypotheses with peaks closer together would

tend to have lower costs as fewer sample points would belong in the symbol region.

A constant region width would cure the border and varying nidth cost bias. How-

ever, it would encounter problems with respect to Ieaving out points when the peaks


are widely separated and including the same point in the symbol regions for two

different bases when the peaks were close together.

Defining the current symbol region as between the previous aiid curreiit symbols

has sorne advantages. First, the borders are known. Second, ISI is suppressed bet-

ter in this region as it is distant from the problems with future peak location error.

Unfortunately, with this definition only half of the current peak is used in the iritegra-

tion, thus implying a lower signal to noise ratio. As well, the problems with varying

region width are still present. However, the advantages for this strategy appear to be

stronger than the disadvantages. This definition of current symbol region is used in

the real data processing of the next chapter.

5.3 Modelling Limitations and Robustness

Neglecting the different length hypothesis problern, the DNA-ML algorithm is

optimum only for data that exactly matches the mode1 and parameters used. Errors

in some parameter settings are expected ta have little effect. For example, a small

crror in mean peak amplitude would have little effect as the peak to peak variance

is so large. A ~ s already discussed, errors in the tails of the pulse shape have only a

small effect as mainlobe ISI dominates. While the mainlobe shape is weli known, it

does depend on the pulse width parameter, a parameter whose estimate has a fair

uncertainty. The sensitivity to t his parameter should be investigated. Noise w hitening

is fundamental to the development of the DNA-ML algorithm. Chapter 3 alluded to

the difficulty in measuring the noise spectrum. The impact of this on the algorithm

should be addressed. Another key parameter is B; sensitivity to ,O miçmatch will be

investigated in Chapter 6. In this section, the sensitivity to pulse width mis-match is

examined and issues associated with the noise whitening process are discussecl.

5.3.1 Pulse Width

Errors in pulse width setting can lead to large errors in waveform cornparison,

particularly near the steep edges of the pulse. A sirnulated data set was created

wherein the mode1 and al1 parameters were obtained from real data (Data Set 1 of


Table 5.1: Performance as a function of pulse width rnismatch for 300 bases of sirnulated data.

the next chapter). The algorithm used the same parameters with the exception being

pulse width. Table 5.1 presents the results of reprocessing the same data set witli

several different pulse width settings. It is clear from the table that 10% misrnatch

iri the pulse width can ïesult in a large increase in the error rate.

In Section 3.3.2, the scatter of the pulse width estimate had a standard deviation

of 10%. However, presuming the trend mode1 to be correct, the least squares mode1 fit

in effect averages these 300 estimates. Thus, the standard deviation of the error in the

rcsulting mode1 is on the order of 1 0 % / m r= 0.6%. Thus, if the scatter is indecd

due to measurement error then the pulse width is knowu with sufficient acciiracy

that pulse width mismatch is not a problem. On the other hand, if this hypothesis

is wrong and the rneasured pulse width mriations are actually true reflections of

the physical processes then pulse width misrnatch could be a large contributor to

Assumed/True Pulse Width 0.8 0.9 0.95 1 1 .O5 1.1 1.2

sequencing errors.

Insertions/Deletions/Errors 45/19/46 19/18/14 2/2/2 1/1/1 11111 21211 49/26/32

5.3.2 Noise Whitening

X low quality noise spectrum estimate can lead to poor noise whitening. Section

3.4 alluded to the difficulty in measuring the noise spectrum of DNA time-series. It is

not possible to obtain a "noise only" data set that has enough data points so that the

statistics of the noise in DNA sequencing data can be estimated accurately. However,

in adopting the noise spectral mode1 of Equation 3.12, the necessity of measuring the


full spectrum vanishes as the model only requires estimates of the white noise and

coloured noise variances.

The white noise variance estimate itiay be obtained from the sarnple to sample

variation of the data in regions without true signal peaks. The average of the square

of the difference between adjacent samples should be twice the white noise variance

if only white noise is present. Taking the difference between adjacent samples should

suppress the lower frequencies where the coloured noise is strong. To simplify the

variance estimation process, the assumption is made that the difference cornpletely

suppresses the coloured noise. Thus the estimate of the white noise variance is simply

one half the average of the squared differences between consecutive saniples.

The coloured noise variance estirnate rnay be obtained by studying the weak sigrial

like features in regions without true signal peaks. The assumption is made that

these features have a Gaussiari distribution. Then the vertical interval ' of the range

containing 95% of these features is an estirnate of four times the standard deviation.

This tlien leads directly to the coloured noise variance as the square of the standard

deviation estirnate.

Woise whitening based on such variance estimates has been attempted for a real

data set. Figure 5.2 presents the estimated spectrum for a 23 base noise only region

of the 'noise whitened' A channel starting at base 297. Clearly, the data is not white:

the level drops by roughly 20dB in going from O to T radians jsample. Obviously, the

estirnate of the white noise spectral level used in the generation of the noise whitening

filter was too high. Attempts at adjusting the white noise variance estimate led to

'whitened' data that eshibited other non-white features near the middle of the band.

Should result shown in Figure 5.2 or the noise variance estimates be given greater

credit? The noise variance estimates have the advantage of being formed from a larger

amount of data. They should be more stable and representative of a larger portion

of the data set. On the otherhand, the noise whitening is dependent on the noise

spectral model of which the variances are but two parameters. Figure 3.21 has t.he

advantage of directly modelling the entire noise spectrurn but it is from such a small

data set that its quality is poor and it rnay not be represeutative of the entire DNA

*adj usted for large scaie trends


-301 i 1 I 1 1 1

O O. 5 1 1.5 2 2.5 3 FREQUENCY (RADIAN WS)

Figure 5.2: Spectral estimate for a short section of ''noise whitened" data lacking signal peaks.

timcseries. Thus, for both the short noise spectral estimate approach and the mode1

based approach, some degree of spectral misrnatch is to be expected and the noise

whitened data will not be truly white.

Given the likelihood of mismatch, how may the processor be rnodified to allow

robustness with respect to this problem? The residual noise colour manifests itself as

a correlation between the terms within the sum over K, in Equations 4.14 and 4.33.

.As adjacent terms are now similar, there are fewer independent samples than implied

by the cardinality, 1 Ki (il. Thus by summing over Ki, the weight given to these terms

is greater than implied by the number of independent samples. Now that these terms

are over weighted relative to the weight on the nuisance parameter log likelihood

terms, hypotheses with improbable jitter will be given more consideration and errors

will result.

To restore balance, the summation over Ki can be weighted by a factor in order

to reflect the tme statistical degrees of fieedom available. The statistical degrees of

freedorn can be expressed as the product of the observation time and the statistical


bandwidth of the data. The statistical bandwidth is the bandwidth of an ideal low

pass process which, over the same observation time, yields the same statistical degrees

of freedom as the process of interest. Defining the f~actionul bandvidth as the ratio of

the statistical bandwidth to the total bandwidth of the observation, it may be easily

seeri that scaling the summation over Ki by the fractional bandwidth provides the

proper weighting t o compensate for the correlation in the samples. For example, with

a fractional bandwidth of 0.25, there is one quarter as many independent sarnples and

so the magnitude of the sum should be as though one quarter as many terms were

surnmed.

5.4 Comparison with Typical Automatic Sequencer

Techniques

Currently available automatic sequencer algorithms incorporate some techniques

which address the same general signal features as do the various components of the

DN.4-ML algorithm. They do difTer in how they address these features. Some in-

sight as to the performance potential of the DNX-ML algoritlim may be gained by

exainining these differences.

5.4.1 ISI Suppression

Current algorithms address ISI suppression through either a peak sharpening filter,

deconvolution algorithm or maximum entropy algorithm. The peak sharpening filter

sharpens the peak and, undesirably, emphasizes the high frequency portion of the

noise. Deconvolution processing in a sense fits replicas of the generic pulse shape to

the observed data 1351; maximum entropy reconstruction performs similar processing

[?Il. Both signal and noise are represented by these pulses. On the otherhand,

the DNA-ML subtracts off only interference from signal peaks as determined by the

sequence hypothesis. It does not emphasize the noise.

In that aspect, the current ISI suppression techniques are to the DN.4-ML algo-

rithm what the Linear equalizer is to the Decision Feedback Equalizer (DFE). Based


on the known performance advantage of the DFE 1551, the DNA-ML would be ex-

pected to have superior ISI suppression and thus rcduced error rates. However, the

analysis that yields the advantage to the DFE is based on a known fixed pulse am-

plitude, peak tirne and shape. Errors in peak amplitude and tirne parameters in the

DNA-ML algorithm could lead to reduced ISI suppression.

5.4.2 Peak Detection

At the lowest level, some algorithms detect peaks using a criteria such as the

largest local maximum exceeding an amplitude threshold in the time search window.

More sophisticated algorithms integrate peak area into the detection criteria 1701.

This approaches the optimum match filtering of the DNA-ML peak estimator but

gives greater weight to the smaller, hence noisier, portions of the peak. Giddings 1381

uses a dual-Gaussian bandpass filter which would be closer to but still different from

the match filter. The DNA-ML should offer superior peak estimates.

As the signal peaks in DNA time-series are large and easily detected wit h a crude

detector, it is unlikely that the theoretical superior peak detection capability of the

DNA-ML will offer any practical performance improvement for isolated peaks. But

by feeding the superior peak estimates to the ISI suppression algorithm. the DN.1-SIL

may realize improved performance for overlapping peaks.

5.4.3 Search Window

Current DNA sequencing algorithms at some point impose a search window wliich

defines where they will search for a peak. This prediction of where the next peak

is to be is implicitly relying on the correlation of the peak times as described in

Chapter 3. The DN.4-ML through the peak time jitter pdf allows a broad range

of peak locations and identifies which are unlikely. The hard search window clearly

elirninates candidates outside a certain range and does not discriminate amongst

candidates within that range. Giddings [381, however, includes a confidence weighting

for peaks within the range that factors in distance from expected peak location. Note

that the DN-4-ML implementation with peak estimation does impose a search window,


albeit a large one. The full DNA-ML of Chapter 4 is likely to perform better than

al1 these approaches as it c m conceivably handle multiple valid peaks in what would

othenvise be the same search window.

5.4.4 Multi-Peak Tests

Multiple unresolved peaks are addressed in some current algorithms by assessing

whether the total area of the unresolved peak is closer to that of 1,2, ..., or N isolatcd

peaks. In using area as a criteria, the noise reduction benefits of averaging over

scveral samples is gained. However, variation in the waveform which rnay encompass

inflection points and other indicators of multiple peaks is lost. The DNA-ML explicitly

considers al1 possible runs of bases and may take advantage of the waveform variations.

Still, peak area rnay be a very powerful metric and approach the performance of the

more sophisticated DNA-ML algorithm.

5.4.5 Special Rules

Commercial automatic sequencing algorithms incorporate special rules to hanclle

known special features of sequencing data. The rise in amplitude for a run in C's

is a classic example of a special feature that has been mapped into a special rule.

The DNA-ML algorithm as yet lacks such rules and ivould therefore be expected to

perforrn not as well in regions where these rules apply.

5.4.6 Promise of Approach

For DNA time series exhibiting the features modelled in Chapter 3: the DNA-ML

algorithm should be superior to more ad hoc algorithms. However, as seen above,

many of the current techniques, while not optimum, incorporate processing that a p

proaches that of the DNA-ML algorithm. The performance advantage of the DN.4-ML

algorithm may not be dramatic. Commercial algorithms may have an advantage for

particular signal situations included in their modelling but not in the modelling of

Chapter 3.


The DNA mode1 and the DNA-ML algorithm do offer benefits beyond a reductioii

in error rate. They may guide the refinement of the entire sequencing process. For

example, chernical parameters, such as ionic strength, may be adjusted to rediice peak

time jitter. An additional benefit is offered by assigning probabilities to alternative

sequences as this mzy aid the clinician in forming his diagnosis3.

'The user rnay request the evaluation of s p e d c alternatives or additional of record keeping software may maintain a list of the most like1y alternatives. In both cases, the cost function provides the key to extracting the pmbability of the alternative.

Performance with Real Data

In this chapter, the performance of the DNA-ML algorithm will be exarnined

using two real data sets, one from a 6% cross-linked gel and one from a 4% gel.

Typically, electrophoresis with 6% gels allows the accurate processing of 400 bases in

six hours which is standard for research applications, while the 4% gels allow much

faster processing which is important in clinical applications. Thus, these data sets

permit insight into two different application areas. For both data sets, the results are

compared to those obtained by the Pharmacia ALF Sequencer. While simulated data

is useful for examining algorithm behavior with known models, real data extends the

analysis to include unmodelled effects. Judgement may be made as to whether the

modelling is sufficient to ensure effective algori t hm operation.

6.1 Data Set 1 - Typical

In this section, the DNA-ML processoi

Case

is applied to real data that is representative

of that produced in research laboratories.

Chapter 6 O Performance with Real Data 101

The data is from the electrophoresis on a Phamacia ALF Sequencer of exon 3

of the a-A-crystalline gene of the eyel. It was preprocessed to remove large scale

trends prior to application of the algorithm described in this paper. Here 'large scale

trends' refers to features that extend over more than fifty bases. First, the inter-

lane mobility variations were removed as described in the Appendix. Then, the large

scale intra-lane variations in rnean rnobility were modelled by fitting a fourth order

polynomial to the entire set of peak times; these trends were removed by interpolating

and resampling the data based on this polynomial to achieve uniform average mobility.

The central region (bases 11-343) of the data was selected for analysis to remove

artifacts associated with the start and end of the run. Exponential trends in the

noise background level and peak amplitudes were then removed.

For convenience, the scaling during trend removal was such that the resulting meari

amplitude, p., was unity; its standard deviation, o., was 0.1; al1 other amplitude

and noise offsets and variances quoted below are in normalized units based on this

scaling. The noise whitening filter (Sections 4.3 and 5.3.2) was designed assuming

the variance due to coloured noise was 0.005 and the variance due to white noise was

0.0000002. The non-stationarities (Section 3.3.2) were rnodelled by set ting pulsewid t h

as p , ( t ) = 13.48 + 0.00419t and total jitter variance as O;, = (1.44 + 0.0141~)'. The

input disturbance variance was O:, = (6/(4 + 6/(i - p * ) ) ) ~ : ~ and the measurernent

variance vas o;, = 0.67~:~ + 2; here the offset of two in the measurernent variance

reflects the error of the peak estimator as obtained through simulation studies. The

value of the jitter process auto-regressive weighting, 13, will be considered in the

next section. The average inter-peak interval was 14.7 sarnples. To allow for errors

iritroduced in the trend rernoval process in addition to the original additive noise,

the rnean noise level was set to 0.1 and its variance set to 0.0169; these nunibers

were set based on empirical examination of the data. The M-algorithm carried 100

hypotheses. In peak estimation (Sections 5.1.1-2), the influences of one base fonvard

- - --

'An evon is a region of DNA which gets translated into protein. Ln between exons, DNA features introns which are regions that do not code for proteins

Chapter 6 o Performance with Real Data 102

and three previous bases were rernoved. The generic unit width pulse shape used was

The central Gaussian part of this pulse shape is a very close fit to observed pulses.

The exponential tails are an approximation to the average seen in the ensemble; many

real pulses were observed to have stronger tails while some did not exhibit tails at al1

(Figure 3-6).

6.1.2 Sensitivity to Parameters

The initial results with real data were much poorer than Our simulations had

led us to expect. Two factors were instrumental: mismatch in the noise whitening

filter and misrnatch in the setting of the jitter process auto-regressive weight ing, 9, (Section 3.3.2), in the Kaiman predictor (Section 4.5.2). Mdressing t hose problems

eventually led to good performance.

To allow for possible mismatch, the sanie real data set was reprocessed for several

tlifferent hypothesized p's and fractional bandwidths (Section 5.3.2). Table 1 sum-

marizes the results. Here undesirable results are identified as: (i) insertions - the true

data has been split into two segments and additional base values placed between these

segments; (ii) deletions - two segments of the true data have had intervening bases

removed and the segments have been joined together; and, (iii) substitution errors - if on either side of a specific base the true sequence matched the recovered sequence

but at the specific base the true sequence and recovered sequence did not match. The

best results occurred for near 0.85 and fractional bandwidth near 0.25. Error rates

increased on rnoving away €rom that locus, particularly when both 13 and the frac-

tional bandwidth were increased. However, it appears that the fractional bandwidt h

may be varied over a large range without significantly affecting results. .41so included

in the table is the jitter correlation time, TJ, defmed as the intenml in bases required

for the jitter correlation to drop below 50%.


Table 6.1 : Performance (insertions/deletions/ substitution errors) as a func t ion of algorithm parameter settings for 300 bases of real data.

Fractionai Bandwidth B TJ 0.25 0.5 1

The g=0 case corresponds to a simple algorithrn where the jitter (offset from a.

priori mean) in the next sample is assumed to be equal to the previous offset.

6.1.3 Error Cornparison

Table 6.2 compares the performance of the 'optimum' algorithrn with that of the

interna1 algorithm of the Pharmacia ALF sequencer. Yote that in four cases, both

algorithms make the same error. h o , errors at bases 258 and 260 for the optimum

algorithm correspond to the same event as jitter on the T lane led to ari early T peak

that cut-off a C a t 258 and caused it to appear a t 260 instead. The ambiguous peaks

with the Pharmacia ALF sequencer were due to its software allowing for heterozygotes

-the presence of similar DNA molecules from mother and father that differ a t only a

few bases. Thus, rather than just an A at a point in the sequence it is possible to

simultaneously have an A and a C at the same point. A s it turns out, the sample

was probably heterozygous AC a t 118 as identified by the Pharmacia algorithm; here,

the optimum algorithm's base selection reflected the GenBank sequence. For base 6,

however, the Pharmacia algorithm was in error.

6.1.4 Error Analysis

Even though the Pharmacia algorithm performed slightly better than the DNA-

ML algorithm, the DN.4-ML algorithrn has the potential to do better when biases

Chapter 6 o Performance with Red Data 104

Table 6.2: Errors observed for DNA-ML algorithm (P=0.85, fractional bandwidth=0.25) and Pharmacia .4LF interna1 algorithm for 300 bases of real data.

Base Number 6 118 215 218 252 258 260 275 Error rate

DNA-ML

- Del. G in triplet

Ins. G form triplet Ins. G form pair

Del. C in pair Ins. C form pair Del. G in pair

2%

P harmacia Ambiguous

Ambiguous * Del. G in triplet

- Ins. G form pair

Del. C in pair

Del. G in pair 1.7% (* not incl.)

introduced during pre-processing and during the estimation of noise and signal statis-

tics are removed. For simulations with the same parameter settings as above and cvith

pulse shape, non-stationarities and mode1 parameters that are known exact ly, the er-

ror rate was only 0.7%.

Examination of the actual mors encountered with real data allows us to infer the

most likely meçhanism for error generation. First, from Table 6.1, note that most

of the errors were insertions or deletions. Further, in Table 6.2, it can be seen that

the insertions and deletions concern pairs or triplets of consecutive bases on the same

cliannel. From this we infer that the errors were probably due to the effects of ISI

from adjacent bases.

For optimal processing of ISI. the pulse width and pulse shape of adjacent bases

have to be known accurately. Errors in pulse width setting can lead to large errors

in waveform cornparison, particularly near the steep edges of the pulse. -41~0, the

presumption of a single generic signal pulse shape could lead to similar errors. While

the mainlobe is stable, there appears to be fluctuation from peak to peak with respect

to the tail that follows the peak. Figure 3-6 gives examples of this fluctuation. For

short sections of data, particular tail realizations could lead to significant differences

in the spectra. These tail variations will affect the accuracy of the ISI removal process

and thus the accuracy the peak estimator and the conditional likelihood component

of the cost (Equation 4.14).

Chapter 6 o Performance with Real Data 105 - - -

Other factors may have led to a poorer performance with real data than with

sirnulatcd data. Errors in trend removal could certainly lead to problems as the

resul t ing offsets appear as discrepancies in the waveforrn comparison portion of the

algorithm (Equation 4.14). As the reader may recall, an attempt was made to address

this problem by including a noise mean offset and setting the algorithm noise variance

to be larger than that expected in the DNA time series; these represent additional

parameters which rnay not be a t their best settings.

With respect to the whitening filter, it must be emphasized that the analysis of the

mismatch in Section 5.3.2 is based on a short noise only region. It is difficult to extract

noise data as most regions are contaminated by signal peaks. Empirically, the noise

also appears to be signal dependent; this would imply that one cannot characterize

the noise by performing electrophoresis in the absence of DNA. .4dditional work is

needed to properly characterize the noise.

Several other assumptions regarding the parameters of the DNA-ML algorithm

seem to have only a lirnited effect on the error rate. For example, the M-algorithm

carried only 100 hypotheses. Increasing this number should improve performance. On

the other hand, while similarly restricted to 100 hypotheses, the simulation achieved

much better performance. .&O, the algorithm used only the three previous bases

aud one future base in ISI removal; interference from bases beyond tliis region would

contribute directly to errors in peak estimation and waveforrn comparison. However,

from Table 2, interference fiom bases outside the window of bases used for ISI removal

does not appear to be a significant problem.

6.2 Data Set 2 - High Speed Gel

In this section, the DNA-ML processor is applied to real data obtained from a

gel set for fast electrophoresis. Rather than the 6% "bis" to acrylamide mixture

used in Data Set 1, Data Set 2 uses a 4% "bis" to acrylamide mixture. This implies

ferver cross-links, iess mechanical resistance and faster passage of DNA molecules

through the gel. This fast gel data may foreshadow future clinical applications of

DNA sequencing where speed and productivity are highly valued.


6.2.1 Source / Rationale

Data Set 2 was also taken from the a-A-crystalline gene. This time a long segment

of DN.4 was selected spanning approximately 2000 bases. This included exon 2,

iritrons and exon 3. Amplification was via insertion into a plasmid and then in turn

into a bacterial culture (Data Set 1 used PCR for amplification). After amplification

the plasmids were nicked and changed from circular to linear form and then sequenced

using Ml3 Reverse as the sequencing primer. M l 3 Reverse is complementary to part of

the plasmid's own DNA. Thus, using M l 3 Reverse, the observed sequence corresponds

initially to plasrnid DNA then that of the primer used to select the desired DNA for

amplification, intervening a-crystalline DNA, then exon 3, intron and exon 2. M l 3

Reverse is so named as it leads to the sequencing of the cornplementary strand and

therefore the sequence is both complementary and in reverse ordcr to that of the true

sequence.

This data set differs frorn Data Set 1 in several significant ways. First, the low

gel cross-linking leads to an inter-base separation which is much shorter (mean 10.1

samples as opposed to 14.7 in Data Set 1 (sampling frequency, nominal voltage, etc.

were unchanged). Second, the pulse width expressed in units of peak separation

is much higlier than for Data Set 1 (1.46 versus 0.92 for the initial bases). Thus,

Inter-Symbol Interference (ISI) is greater in Data Set 2. Longer DNA molecules are

preseiit due to the longer ternplate in Data Set 2. As well, tlieir Liydrolysis products

are present to contribute to the background noise. As the template is so long that

tlie sequencing polymerase typically does not succeed in making a full length copy,

the large end of segment peak seen in Data Set 1 is not seen in Data Set 2. Unlike

Data Set 1, Data Set 2 used 7-deaza-GTP instead of GTP as a substrate. This

molecuie cannot form the hydrogen bonds that lead to secondary structure such as

hairpin loops. This may impact on the structure of the peak parameter covariances.

Finally, as sequencing proceeds in different directions and on complementary strands.

sequence dependent interactions with the polymerase are likely to be different, even in

the area of exon 3. Data Set 2 should thus demonstrate different properties than Data

Set 1. In paxticular, it should highlight the performance of the DNA-ML algorithm

in the high ISI environment typical of fast gels and long sequences.


Figure 6.1: Raw time series for Data Set 2. Individual channel data has been offset in this figure for clarity.

6.2.2 Mode1 and Adjustments

Preprocessing of Data Set 2 was as described in the Appendix with one exception.

-4s the data set lacked the rising background due to end of segment and primer

Iabelling, compensation was not required for this trend. Figures 6.1 and 6.2 show the

DN.4 time-series before and after compensation. Interestingly, while the template was

2000 bases long, the sequencing copies appeared to die out beyond approxirnately 800

bases (i.e. 8000 samples at 10 samples per base). Apparently, the polyrnerase (Thermo

Sequenase), template and copy complex became unstable in this region. Attempts at

increasing the copy length through varying the ddNTP:dNTP ratio and Mg++ cation

concentration were unsuccessful.

As before, manual cursoring based on a priori sequence knowledge was used to

identify the correct peaks for use in the correlation models. To mode1 to the end of

exon 3, 287 bases were cursored and used to estimate parameters. Figure 6.3 presents

the conelation in peak jitter. The structure of this correlation is clearly consistent

with that discussed in Chapter 3. Similarly, the correlation of the diference between

Chapter 6 Performance with Real Data 108

Figure 6.2: Compensated time series for Data Set 2 corresponding to first 1000 samples from Figure 6.1. Individual channel data has been offset in this figure for clarity. Top curve is for A channel with C, G and T channels presented in order from top.

adjacent peak times, Figure 6.4, is consistent with the earlier modelling. Data Set 2

used 7-deaza-GTP instead of GTP as a substrate and therefore should not suffer from

the effects of secondas. structure such as hairpin loops. Thus, Figures 6.3 and 6.4

suggest that such structure is not a major contributor to the jitter covariance.

Measured @ was 0.78. Average jitter process input variance was 2.86 (samples

squared) and measurement kariance was 10.1. The high value of the later is consistent

with the difficulty in obtaining accurate peak measurements when the peaks are wide

and the noise is high. The non-stationarity (Section 3.3.2) of the total jitter standard

deviation was described by 04 = 2.7 + 0.0141i where i is the base number.

Neither amplitude nor pulse width had significant covariance values beyond lag

zero. After compensation, amplitude was unit mean with standard deviation of 0.3

(units of mean). Pulse widt h non-stationarity was described by pw = 14.74 + 0.0331~

where i is the base number. Note that the initial pulse width for the 4% gel is higher

than the 13.48 samples used with the 6% gel (Data Set 1) as the diffusion coefficient


l 1 1 1 t 1

-200 -1 00 O 100 200 300 LAG (BASES)

Figure 6.3: Covariance of pesk time jitter for Data Set 2. Inset is a logarithmic plot of the right side of the mainlobe.

is higher. The peak mainlobes appeared to be Gaussian. Due to high ISI, clean

examples of the tails of the pulse shape were unavailable. Therefore, a Gaussian

pulse shape was used in the DNA-ML algorithm foi Data Set 2.

Following the procedures discussed in Section 5.3.2, coloured noise wriance was

set to 0.0016 and white noise variance was set to 0.000009, both in units of of the

peak mean squared2. Both factors were extremely difficult to estimate due to the

high ISI. Unlike the previous data set, it was not possible to find a sufficiently wide

"noise-only" region to form a spectral estimate that could serve as a check on these

parameter values.

Rather! the whitened data was examined. With the settings of the previous para-

graph the noise was not fully whitened as may be seen by the broad noise peaks in

Figure 6.5. Lowering the white noise variance to 0.000001 (i.e. reducing standard

deviation by a factor of 3), led to the whitened noise data seen in Figure 6.6. From

%e. the square of the mean height of valid isolated peaks.


-15l 1 1 t 1 1 1 + t 1 J

-100 -80 -60 -40 -20 O 20 40 60 80 100 LAG (BASES)

Figure 6.4: Covariance of difference between successive peak tirne jitter values for Data Set 2.

the bursts of noise at approximately 1500, 1800, 2000 and 2600 samples, it appears

tha t an additional high frequency, non-stationary noise process is present. In uncorn-

pensated data, it is evidenced as sudden jump in the intensity values. .Alsot for this

data set, the knowledge of the peak shape was poor which implies that our coloured

noise spectrurn may be inaccurate. To avoid problems due to the burst noise and

inaccurate pulse shape knowledge, a white noise variance of 0.000009 was selectcd

mhich limited the emphasis on high frequencies after whitening. The fractional band-

width was set to 0.25 to reflect the reduced degrees of freedom avaiiable for waveforrn

comparison given the coloured data.

6.2.3 Error Cornparison

The DNA-ML algorithm and the Pharmacia ALF algorithm experienced difficulty

in sequencing this data set. However, the mechanisms of error generation appear quite

different and suggest that direct detailed compazison is not meaningful. Therefore,

their performance is discussed separately.


TiME (SAMPLES)

Figure 6.5: Selected section of data after application of whitening filter with coloured noise variance on= = 0.0016 and white noise variance on, = 0.000009, al1 in units of pcak mean squared.

The Pharmacia ALF algorithm sequenced the first 110 ba..es. Beyond that point

in the data, the algorithm deemed the data to be of too low a quality to sequence. The

Pharmacia ALF algorithm experienced 16 deletions in the first 60 bases. The problern

may have been due to the heavy ISI interacting with its base clock recovery algorithm.

.As adjacent peaks were unresolved, fewer peaks were assurned to be present and so the

inter-base separation was estimated to be higher. Beyond base 60, enough instances

of isolated peaks were observed to correct this timing problem. An insertion error

occurred at base 71. No other errors occurred in the 110 bases the algorithm marked

(these 110 bases correspond to 125 true bases = 110 bases marked by Pharmacia +

16 deletions - 1 insertion). Average error rate was therefore 15.5%.

As will be discussed further in the error analysis section, several variants / pa-

rameter settings were tried for the DN.4-ML algorithm. For cornparison with the

Pharmacia ALF, in the first 125 bases, the baseline case produced only 6 errors,

yielding an error rate of 4.8%. However, rather than being an indication of a vastly


. - O 500 IO00 1500 2000 2500 3000 3500 4000

TlME (SAMPLES)

Figure 6.6: Selected section of data after application of wliitening filter with coloured noise variance O,, = 0.0016 and white noise variance on, = 0.000001, al1 in units of peak mean squared.

superior algorithrn, this factor of three improvement may be due to accurate a priori

knowledge of parameters such as mean base separation. The baseline case produced

23 errors in 200 bases sequenced (8 insertions / 7 deletions i 8 substitution errors).

Most of the errors (14) occurred in the region between base 150 and 200.

6.2.4 Error Analysis

In this section, DNA-ML algorithm errors made in sequencing Data Set 2 are

examined. Small changes to algorithm parameters are investigated in hopes of further

reducing the error rate.

First, Figure 6.7 presents the region about the h a 1 point of the marked sequence of

the Pharmacia ALF. Just before sample 1250 is the final peak called by the Pharmacia

ALF which was a "Cf (second curve from the top). In this area. simultaneous large

levels are seen on the C and T channeis. -4s well, the G lane level is well away from

the assurned noise mean of 0.1. Presumably, the Phxmacia algorithm found the data


Figure 6.7: Cornpensated time series for Data Set 2 correspo~iding to bases 110-140. Individual channel data has been offset in this figure for clarity. Top curve is for -4 channel with C, G and T channels presented in order from top. DNA-ML algorithm estimates of peak amplitudes and times are indicated by "*". X-axis is time in samplcs. True and estimated sequences are indicated at top and bottom, respectively, coded as A=1, C=2, G=3 and T=4.

to be overly arnbiguous here. As such ambiguity is generally found at the end of a

data set, it stopped processing on the assumption that subsequent data would be

poorer still.

The DN.4-ML algorithm processed through this region but did incur errors. Ta-

ble 6.3 presents the error locations and types for the baseline DN.4-ML algorithm

and parameters. Substitution errors occurred at base 127 (esample 1270) where a

C was called instead of a G and a t base 133 (zsample 1335) where a G was called

instead of a C. In both cases, the erroneous peak occurred in the middle of a run.

The peak was called with a lotv level which should increase its cost. However, in these

cases, the erroneous peak may, through the ISI rernoval processing of the parameter

estimator, have reduced the ISI in the neighbouring peak estimates. The resulting

peak estimates may have been closer to the means and thus lowered the cost of the


Table 6.3: Data Set 2 error type and location for DNA-ML with baseline parameter set t ings.

hypot hesis.

Error Insertion Dele tion Substitution

Directing our attention at another interesting region, Figure 6.8 displays the corn-

pensated but unwhitened time-series corresponding to the first 30 bases, togethcr with

the DNA-ML algorithm estimates of peak times and amplitudes. The errors listed

in Table 6.3 a t bases 8, 18 and 21 appear in Figure 6.8 at time samples 45, 170 and

205, respectively. The insertion error at base 18 / sample 170 stands out as the peak

in the A lane is so small relative to the valid peaks. This srna11 peak was accepted

by the algorithm as the setting of the peak amplitude variance was high. In fact, the

srnall peak was within two standard deviations of the mean amplitude setting and

thus belonged within the region that 95% of valid peaks would lie, assuming correct

parameter settings. It is likely that the peak amplitude variance was set erroneously

high. This will be investigated further later in this section.

Figure 6.9 provides insight into the ISI suppression and parameter estimation pro-

cessing. A run of four G's, encompassing bases 49-52 of Data Set 2, is unresolved

in the raw data. There rnay be inflection points that indicate the presence of the

four peaks; however, these visual cues could also be sirnply additive coloured noise.

The whitening filter helps to resolve the peaks as the increased high frequency em-

phasis sharpens the peaks. Noise is clearly emphasized as well as evidenced by the

fluctuations between samples 450 to 480 and 510 to 540. In the matched filtered and

ISI suppressed data, the influence of the k t two G's has been removed. The high

accuracy of the estimated positions of past peaks facilitates the easy removal of their

influence. The fourth G still appears as a substantial peak because the predicted

peak location was very inaccurate, which in tum misaligned the replica used in can-

Location (Base Number) Bases 1-125 18 70 8 21 47

43

Bases 126-200 146 163 164 167 168 189 151 158 181 194 127 133 153 170 173 178 191


Figure 6.8: Compensated time series for Data Set 2 corresponding to first 30 bases. Individual channel data has been offset in this figure for clarity. Top curve is for .-\ channel with C, G and T channels presented in order from top. DNA-ML algorithm estimates of peak amplitudes and times are indicated by "*". X-axis is time in samples. True and estimated sequences are indicated at top and bottom, respectively, coded as A = l , C=2, G=3 and T=4.

cellation and permitted much of the peak to remain. Still, after this processing, the

peak of the third G is strongest and simple peak picking will yieid good estimates of

amplitude and peak time. In a more severe scenaxio, noise and the next peak location

could have led to the next peak being strongest after this processing. In such a case,

if the search window was wide enough to encompass the next peak, then the next

peak would be selected and the third G would be deleted. Such errors can be difficult

to classify. It may be that the deletion a t base 8 / sample 45 in Figure 6.8 is due to

such a phenornenon.

Table 6.3 indicates the presence of a number of errors centered around base 170.

Figure 6.10 presents the compensated but unwhitened time series in this region to-

gether with the DN.4-ML algorithm's estimates of peak amplitudes and locations.


Figure 6.9: Waveforms associated with 4 "G" run from base 49 to 52 selected to illustrate estimation of third "G". Raw waveform is compensated but not whitened. Dashed curve is formed from whitened data by subtracting estimated contribution from previous two bases and predicted contribution from next base, and then applying matched filter.

Evident in the area near base 170 / sample 1700 is a discontinuity in the data. This

appears on al1 lanes, though shifted in time due to lane alignment processing. The

event is likely due to ternporary removal of field voltage as an operator might do to

allow visual inspection of the gel. Such a discontinuity is greatly emphasized by the

high-pass action of a whitening filter as rnay be seen directly about sample 1700 in

Figure 6.5. The time extent of the event is dso emphasized by the whitening filter.

This leads to the errors reported at bases 167, 168, 170 and 173.

Inspection of the time-series and peak estimates about other errors revealed two

other phenornena which were contributing to errors. Eight of the errors could be

attributed to the valid peaks being weak. Four of these occurred on the C lane and

three on the G lane. The implication is that these lanes were scaled Iow. Another

group of six errors appeared to be due to a peak t h e jitter variance which may have

been too high. Five of these errors were insertions where the correct peaks appeared


Figure 6.10: Compensated time series for Data Set 2 corresponding to bases 155 to 185. Individual channel data has been offset in this figure for clarity. Top curvc is for A channel with C, G and T channels presented in order from top. DN.4-ML algorithm estimates of peak amplitudes and times are indicated by "*". ,Y-suis is time in samples. True and estimated sequences are indicated at top and bottom, respectively, coded as A = l , C=2, G=3 and T=4.

near the ant icipated tirnes but t here were earlier noise or resolution pro blenis tliat

suggested inserting an erroneous peak. With a srnaller jitter variance set ting these

erroneous peaks might not have been accepted. As these events were insertions that

imply a shorter inter-base interval, the other mechanism that may lead to these errors

is the bias towards the shorter length hypothesis given two unequal length hypotheses

(see Section 5.2.1).

This analysis suggests that parameter settings may be adjusted for better results.

Based on the above observations, a large number of parameter settings were investi-

gated and the modifications yielding the best result were: (1) peak amplitude variance

reduced from 0.09 to 0.0225; (2) jitter rneasurement variance reduced from 10.1 to

6.7 (with attendant modification of total jitter variance); (3) C and G lanes scaled

by 1.15; and, (4) fractional bandwidth set to 0.25 (baseline case had unit fractional


Table 6.4: Data Set 2 error type and location for DNA-ML with modified parameter set tings.

Errer

bandwidth). As shown in Table 6.4, total errors were reduced from 23 to

Location (Base Number) Bases 1-125 1 Bases 126-200

Insertion Delet ion Substitution

21 ovcr

two hundred bases, but, more importantly, in the 125 bases corresponding to the

110 bases marked by the Pharmacia ALF algorithrn, the error rate was only 3.2% as

opposed to the Pharmacia ALF's 15.5%.

As predicted, changing the peak amplitude variance removed the error at base 18.

It also removed errors at bases 21 and 158 though new errors were introduced near

the end of the run a t bases 195 and 197. Scaling C and G lanes removed errors at

bases 133 and 153. The errors at bases 163 and 164, attributed to high jitter variance,

have also been removed. Reducing the jitter nieasurement variance without changing

the fractional bandwidth increased rather than lowered the error rate; clearly, the

ciifferent parameters interact to determine final performance. In general, the rnodifi-

cations improved results early in the data set but were somewhat offset by new errors

introduced later in the data set.

70 2 47

43

6.2.5 Assessrnent and Significance

146 167 168 175 176 189 195 151 173 181 127 142 161 170 178 191 197

For good quality data as demonstrated with Data Set 1, the DNA-ML algorithrn

acliieved performance comparable to the commercial Pharmacia ALF algorithm. For

data with high ISI (Data Set 2), the DNA-ML algorithm appears to offer as much

as a four-fold improvement in error rate relative to the P h m a c i a ALF algorithrn.

However, the validity of the comparison is limited due to differences in the initialize

t ion parameters for the two algorit hms. Nonetheless, the preliminary investigation

suggests that the DNA-ML algorithm has significant potential in dealing with fast

gel data. This in turn implies that the DNA-ML algorithm may offer a performance


improvement in clinical applications.

CHAPTER 7

Conclusions

7.1 Thesis Summary

This thesis has provided the foundations for rigorous study of the DNA t ime-series.

The characteristics of the time-series arising from DNA sequencing have been inves-

tigated, both from a theoretical and a statistical perspective. 4 statistical model has

been developed that reflects the local statistics of the DNA time-series. The maximum

likelihood sequence detector has been derived for this model. The iinplementation

of the processor addressed issues ranging from computational loading through to the

comparison of hypotheses of different lengths. Real data has been usecl to investigate

the performance of the processor. In comparison with a commercial algorithm, the

results indicate improved performance in situations where there is high overlap be-

tween the peaks of adjacent bases. This is likely to be the case when DNX sequencing

is employed in high throughput clinical applications.

Chapter 7 o Conclusions 121

7.2 Thesis Contributions

The major contributions of this thesis are:

(1) The creation of the first statistical mode1 of the DN14 time-series;

(2) The derivation of the first optimal algorithm for DNA sequencing.

The development of the DNA time-series mode1 focussed on ensuring its utility

for sequencing algorithm development. A generic peak shape, pararneterized by peak

time, amplitude and width, is used to represent the signal peaks. The characterization

of the fluctuation of peak parameters includes their point probability density functions

and their correlations with neighbouring peaks. A practical noise model is proposed

consisting of a white noise component and a noise component with spectra similar to

tliat of the signal itself. The noise and peak parameter processes are non-stationary

with variances increasing with base number. The complete model can be used to

generate simulations for the comparison and evaluation of sequencing algorithms.

Based on the DNA time-series model, an optimal Maximum Likelihond DNA se-

quencing algorithm was derived. It selects the hypothesized sequence that maxiniizes

the probability of the observed signals. The uncertainty associated with parameter

values is addressed by maintaining multiple hypotheses not just for the different pos-

sible information sequences but also for the different possible parameter values. Tlic

structure of the algorithm features two main branches, one that compares waveforms

based on hypothesized parameters, and, one that predicts (and costs) parameter val-

ues.

Additional significant contributions include:

(1) The creation of the hypothesis cost function which may be used to provide a

probability for different possible sequences to allow the user to directly assess sequence

ait ernat ives;

(2) The recognition of the asgnchrony between bases and samples of the DXA

time-series and the problems it leads to in comparing hypotheses:

(3) The introduction and application of techniques from communication theo-

including the Fano metric and M-algorithm, to DNA sequencing.

The first of these has potential clinical value as it allows meaningful comparison

Chapter 7 o Conclusions 122

between two possible genetic sequences. As to the second, asynchrony at the levcl

seen in DNA tirne-series is not found in communication systems. The resulting prob-

lems associated with unequal length hypotheses that incorporate the same number of

symbols have not been dealt with clsewhere. The third relates to the introdtiction of

a valuahle new set of tools to the DNA sequencing community.

7.3 Suggestions for Future Research

This work provides the foundation and the structure for further research into the

inter-dependencies between the underlying chemistry and physics of DNA seqiiencing

and the other properties of the optimal sequencer.

Physical modelling of DNA electrophoresis has concentrated on gross behavior.

New work is needed to provide a physical model which fully explains the correla-

tion obsented in peak time jitter. One could study the relationships between thc

ionic strength of the solution (known to affect the persistence length) and the jitter

auto-regressive weighting, ,û. Similarly, there has yet to be a direct verification and

assessrnerit of the chemical noise mechanisnis described in this thesis. This would

he invaluable in ensuring model fidelity. For example, an experiment could be con-

ducted to rneasure the production of hydrolysis products with tirne and temperature

as the key variables. Further, the chemical and/or physical mechanism behind the

exponentially decaying tails of the pulse shape should be elucidated.

Further development is required to make the sequencing algorithm usable by

rnolecular biologists. -4s was seen in Chapter 6, errors in parameter settings can

have a very significant effect on system performance. On-line estimation of param-

eters is necessary as well as tracking and correction of large scale parameter trends.

The techniques of system identification should be directly applicable.

Finally, the probabilist ic description and hypot hesis cost function developed in

this thesis can be applied to specific genetic tests as opposed to general sequencing.

Here, the hypothesis cost function may be assessed for the cases of mutation present

or mutation absent at base N. The early stages of this work are now undenvay a t the

Institute of Bio-Medical Engineering of the University of Toronto.

Bibliography

DeLisi, C., 'The human genorne project", American Scientid, V. 76, 1988, pp.488-

493.

Sanger, F., Nicklen, S., Coulson, "DNA sequencing with chain terminating in-

hibitors", froc. Natl. Acad. Sci., Vol. 74, pp.5463-5467, 1977.

Davies, S. W., Eizenman, hi., Pasupathy, S., "Optimal structure for automatic

processing of DNA sequences", IEEE Trans. Biomedical Eng., submitted for pub-

lica t ion.

Hunkapiller, T., Kaiser, R., Koop, B., Hood, L., "Large-scale and aiitomated DNX

sequence determination", Science, V.254, 1991, pp.59-67.

Church, G., Gryan, G., Lakey, N., Kieffer-Higgins, S., Mintz, L., Temple, SI.,

Rubenfield, M., Ghazizadeh, H., Robison, K., Richterich, P., "Automated multi-

plex sequencing", pp.11-15 in Automated DNA Sequenca'ng and Analysis, Adamst

M., Fields, C., Venter, J., (editors), Academic Press, New York, 1994.

Burks, C., "DNA sequence assembly", IEEE Engineering in Medicine and Biology,

Nov./Dec., 1994, pp.771-773.

Myers, E., "Advances in sequence assernbly", pp.231-238 in Automated DNA Se-

penczng and Analysis, Adams, M., Fields, C.' Venter, J., (editors) , Academic

Press, New York, 1994.

Bibliography 124

181 Forney, G.D., Jr., "Maximum-likelihood sequence estimation of digital sequences

in the presence of intersymbol interference", IEEE Dans. Infunnation Theory,

V.IT-18, May, 1972, pp.363-378.

191 Slater, G.W., "Electrophoresis Theories", Chap. 2, pp.24-66 in Analysis of Nu-

cleic Acids by Captl laq Electrophoresis, Keller, C., (editor) , C hromatographia

CE Series, Vol. 1. Vieweg, Wiesbaden, Germany, 1997.

1101 Caspers, G. J., Pennings, J., de Jong, W.W., "A partial cDNA sequence corrects

the human alpha A crystallin primary structure", Ezp. Eye Res., V.59, 1904,

pp. 125-126.

11 11 Casey, D., "Primer on rnolecu1a.r genetics", 1991-92 DOE Humun Genome Pro-

gram Report, U.S. Dept. of Energy, Oak Ridge, Tenn., USA, 1992.

1121 Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R.? Horn, G.T.,

Mullis, K.B., Erlich, KA. , "Primer-directed enzymatic amplification of DN.4 wi t h

a thermostable DNA polymerase", Science, V.239, 1988, pp.487-491.

1131 Lodish, H., Baltimore, D., Birk, A., Zipursky, S.L., Matsudaira, P., Darnell, J . ,

Molecular Cell Biologg, 3rd. ed., Scientific American Books, W H . Freeman, N.Y.,

1995.

1141 Tindall, KR., Kunkel, TA., "Fidelity of DNA synthesis by the tliermus aquaticus

DNA polymerase", Biochemzstry, V.27, 1988, pp.6008-6013.

[151 Eckert, K A . , Kunkel, TA., "High fidelity DNA synthesis by the Tliermus aquati-

cus DNA polymerase", Nucleic Acids Research, V. 18, N.3, 1990, pp. 3739-3744.

1161 Clark, J.M., "Wovel non-templated addition reactions catalyzed by procaryotic

and eucaryot ic DN.4 polymerases", Nucleic Acids Research, V. 16, N.20, 1988:

pp.9677-9686.

[171 Clark, LM., Joyce, C.M., Beardsley, G.P., "Novel blunt-end addition react ions

catalyzed by DNA polymerase I of Escherichia col?', J. Mol. BioL, V. 198, 1987,

pp.123-127.

Bibliography 125

[lSI Tabor, S., Richardson, C.C., "Effect of manganese ions on the incorporation of

dideoxynucleotides by bacteriophage T7 DNA polymerase and Escherichia coli

DNA polymerase P', Proc. Natl. Acad. Sci. USA, V.86, 1989, pp.4076-4080.

1191 Ke, S-H, Wartell, R.M., "Influence of neighboring base pairs on the stability of

single base bulges and base pairs in a DNA fragment", Biochemistry, V.34, 1995,

pp.4593-4600.

1201 Suzuki, T., Ohsumi, S., Makino, K., "Mechanistic studies on depurination and

apiirinic site chain breakage in oligodeoxyribonucleotides", Nucleic Acids Research,

V.22, N.23, 1994, pp.4997-5003.

1211 Viovy, J.L., Duke, T., Caron, F., 'The physics of DNA electrophoresis", Contem-

poranj Physics, V.33, N.1, 1992, pp.25-40.

1221 Fang, Y., Zhang, J.Z., Hou, J.Y., Lu, H., Dovichi, N.J., "Activation cnergy

of the separation of DNA sequencing fragments in denaturing noncross-linked

polyacrylamide by capillary electrophoresis", Electrophoreszs, V. 17, 1996, pp. 1.136-

1442.

1231 Kamahori, A L , Kambara, H., "Characteristics of single-stranded DNA separation

by capillary gel electrophoresis", Electrophoreszs, V. 17, 1996, pp. 1476- 1484.

1241 Maurer, H.R., Dzsc electrophoresis and related techniques of polyacn~famide gel

electrophoresis, Walter de Gruyter, Berlin, 1971.

1251 Yarmola, E., Sokoloff, H., Chrambach, A., 'The relative contribution of disper-

sion and diffusion to band spreading (resolution) in gel electrophoresis", EIec-

trophoreszs, V.17, 1996, pp. 1416-1419.

[261 Smith, L.M., Kaiser, R.J., Sanders, J.Z., Hood, LX., 'The synthesis and use of

fluorescent oligonucleotides in DNA sequence anaiysis", Methods in Enzymology,

V.155, 1987, pp.260-301.

[271 Slater, G., informa1 communication.

Bibliography 126

[281 Strutz, K., Stellwagen, N.C., "Intrinsic curvature of plasmid DNA's analyzed by

polyacrylamide gel electrophoresis", Electrophoresis, V.17, 1996, pp.989-995.

1291 Wheeler, D.L., Chrarnbach, A., "A computer simulation accoiinting for dissimilar

electrophoretic behavior between two similarly curved DNA fragments due to a

difference in arc length", Electrophoresis, V. 15, 1994, pp.885-889.

1301 Bendat , J .S., Engineering Applications of Correlation and Spectral Analysis, 2nd

ed., J. Wiley, New York, 1993.

1311 Elias, H.-G., An Introduction to Polymer Science, VCH, Weinheim, Gerrnany.

1997.

1321 Tinland, B., Pluen, A., Sturm, J., Weill, G., "Persistance length of single-

st randed DNA", Macromolecules, V.30, N. l9? 1997, pp.5763-5765.

1331 Brown, TA., DNA Sequencing: The Basics, Oxford University Press, New York.

1994.

[341 Oppenheim, A.V., Schafer, R. W., Digital Signal Processing, Prentice-Hall, En-

glewood Cliffs, N.J., 1975.

1351 Xu, Y.. Mural! R. J., Uberbacher, E.C., "Correcting sequencing errors in DNX

coding regions using a dynamic programming approach", Cornputer Applications

in Biosciences, Voi. 11, No. 2, pp.117-124, 1995.

[361 Wu, Y., Mislan, D., "Automated DNA sequencing: An image processing a p

proach", Applied and Theoretical Electrophoresis, No. 3, pp.223-228, 1993.

1371 Berno, A.J., "A graph theoretic approach to the analysis of DNA sequencing

data", Genome Research, Vol. 6, No. 2, pp.80-91, 1996.

1381 Giddings, M., Bnirnley, R., Haker, M., Smith, L., "An adaptive, object-oriented

strategy for base calling in DNA sequencing analysis", Nucleic Acids Research,

Vo1.21, No. 19, pp. 4330-4540, 1993.

Bibliography

[391 Ives, J., Gesteland, R., Stockharn, T.? "An automated film reader for DN-A se-

quencing based on homomorphie deconvolution", IEEE Trans. Biomedicul Eng.,

Vol. 41,No. 6, pp. 509-519, June 1994.

1401 Tibbctts, C., Bowling, J., "Met hod and apparatus for automatic nucleic acici

sequence determination", United States Patent No. 5365455, Nov. 15, 1994.

1411 Tibbetts, C., Bowling, J., Golden, J., "Neural networks for automated basecalling

of gel-based DNA sequencing ladders", pp. 219-229 in Automated DNA Sequenciny

and Analysis, Adams, M., Fields, C., Venter, J., (edi tors), Academic Press, New

York, 1994.

1421 Maxam, A.M., Gilbert, IV., "A New Method for Sequencing DNA", froc. Nat.

Acad. Sci., USA, V. 74, p.560, 1977.

1431 Roberts, L., Science, V. 236, N. 806, 1987.

[+II Bowling, J., Bruner, K., Cmarik, J., Tibbets, C., "Neighboring nucleoticle in-

teractions during DNA sequencing gel electrophoresis", Nucleic Acids Research,

V01.19, No. 11, pp. 3089-3097, 1991.

1451 Tibbetts, C., Golden, J.B., III, Torgersen, D., "Parsing of genomic graffiti"? pp.

183-182 in Genetic Mapping and DNA Sequencing. IMA Vol. Math. .4pp., V. 81:

Speed, T., Waterman, M.S., (editors), Springer Verlag, New York, 1996.

1-16] De Gennes, P.G., "Reptation of a polymer chah in the presence of fixed obsta-

cles", J. Chernical Physics, V.55, N.2, 1971, pp.572-579.

1471 Lumpkin, O. J., Dejardin, P., Zimm, B.H., 'Theory of gel electrophoresis of DN-A".

Biopolymers, V.24, 1985, pp. lSï3- 15%.

[481 Muthukumar, M., Baurngartner, A., "Effects of entropic barriers on polymer

dynamics", Mac~urno~ecdes, V.22, 1989, pp. 1937- 1941.

1491 Zirnm, B.H., " A gel as an array of channels", Electrophoresis, V. 17, 1996, pp.996-

1002.

Bibliography 128

1501 Slater, G. W., Guo, H.L., "An exactly solvable Ogston mode1 of gel electrophore-

sis: 1. The role of the synimetry and randomness of the gel structure", Elec-

trophoresis, V. 17, 1996, pp.977-988.

1511 Slater, G.W., Rousseau, J., Noolandi, J., Turmel, C., Lalande, hl., "Quantitative

analysis of the three regirnes of DNA electrophoresis in agarose gels", Biopolymers.

V.27, 1988, pp.509-524.

1521 Carlsson, C., Larsson, A., Jonsson, M., Norden, B., "Dancing DNA in capillary

solution electrophoresis", J. Amen'can Chernical Society, V. 1 17, 1995, pp.387 1-

3872.

1531 Smith, S.B., Aldridge, P.K., Callis, J.B., "Observation of individual DN.4

molecules undergoing gel electrophoresis", Science, V.243, 1989, pp.203-206.

[54j Schwartz, D.C., Koval, M., "Conformational dyriamics of individual DNA

molecules during gel electrophoresis", lkture, V.338, 1989, pp.520-522.

1351 Lee, E., Messeechchniitt, D., Digital Communication, (2nd Ed.), Kluwer, Xew

York, 1994.

1561 Proakis, J .G ., Dzgital Communications, (3rd Ed.), McGraw-Hill Inc., New York,

1995.

1571 Van Trees, H., Detection, Estimation and Modulation Theory, John Wiley &

Sons, New York, 1968.

1581 Falconer, D., Salz, J., "Optimal reception of digital data over the Gaussian chan-

ne1 with unknown delay and phase jitter", IEEE Trans. Inform. Theory, Vol. 23,

No.1, January, 1977, pp.117-126.

1591 Georghiades, C., "Optimal delay and sequence estimation from incomplete data"

IEEE Trans. Inform. Theoy, Vo1.36, No.1, January, 1990, pp.202-208.

1601 Moeneclaey, M., "Synchronization problems in PAM systems" , IEEE Trans.

Commun., Vo1.28, No.8, pp.1130-1136.

Bibliography 129

1611 Anderson, J.B., Mohan, S., "Sequential coding algorithms: a survey and cost

analysis", IEEE Dans. Commun., Vol. 32, Feb., 1984, pp. 169-176.

1621 Fano, R.M., "A heuristic discussion of probablistic decoding", IEEE Trrns. InJ

Th., V.IT-9, Apr., 1963, pp.64-73.

1631 Massey, J.L., "Variable-length codes and the Fano metric", IEEE Tkans. In f. Th. ,

V. IT-18, Jan., 1972, pp.196-198.

1641 Xiong, F., Zerik, A., Shwedyk, E., "Sequential sequence estimation for channels

with intersymbol interference of finite or infinite length", IEEE I f .an~ . Cornru.,

V.38, N.6, June, 1990? pp.795-804.

1651 Davies, S.W ., Eizenman, M., Pasupathy, S., "Exploiting multi-channel infor-

mation in systems with high symbol clock variance", in Proceedznp, Canadian

Workshop on Information Theory, Toronto, Canada, June, 1997, pp.91-94.

[661 Yu, X., Pasupat hy, S., "Innovations-based MLSE for Rayleigh fading chaniiels",

lEEE Trans. Commun., Vo1.43, pp. 1534-1544, Feb./Mar./'Apr., 1995.

'1 Kumar, P.R., Varaiya, P., Stochastic Systems: Es tirnation, IdentiJication and

Adaptive Control, Prentice-Hall, Englewood Cliffs, New Jersey, 1986.

1681 Aulin, T., "Breadt h first maximum likelihood sequence detection", su bmi t ted to

IEEE Trans. Inf. Th.

1691 Lodge, J.H., Moher, M.L., "Maximum likelihood sequence estimation of CPhI

signals transrnitted over Rayleigh Bat-fading channels", IEEE Trans. Commun.,

Vol. 38, No. 6: June, 1990, pp. 787-794.

1701 Ewing, B., Hillier, Le1 Wendl, M.C., Green, P., "Base-calling of automated se-

quencer traces using Phred. 1. Accuracy assessment", Genome Research, V.8: N.3,

1998, pp.175-185.

[711 Elder, J.K., "Maximum entropy image reconstruction of DNA sequencing gel

autoradiographs", Electrophoresis, V. 11, 1990, pp.440-444.

APPENDIX A

Large Scale Trend Removal

DNA time-series exhibit several large scale features that stretch out over tcns

or hundreds of bases. These features include mis-alignment between time-series and

amplitude offsets. Preprocessing is employed to remove these features prior to the

application of the DNA-ML algorithm.

For multi-lane sequencers, channel time series for the different base types are from

electrophoresis down different lanes of the gel. Variation i n gel propert ies betweeii

lanes leads to mobility variations and a tendency for mis-alignment of peaks in the

time series. Usually the data from different lanes is initially in synchrony but tends

to drift out of alignment with increasing sequence position. Automatic sequencing

algorithms typically use a different mobility constant for each lane to compensate for

t his drift.

Our preprocessing goes a step beyond this linear mobility correction by using a

quadratic to account for the mobility variation between lanes. To obtain this com-

pensation, the measured peak times for each channel are linearly interpolated so as to

obtain values at every sequence position, even if the sequence does not have a base of

that type a t that position. The result of this operation for the C, G and T channels is

then divided by the result for the A channel (Figure Al). These nomalized data are

Appendix A o Large Scale Dend Removal 131

1 .m

1 .O06

1 .O05

V) z 0 1.004 e V)

8 1.003 Y

3 y 1.002 O 0 2 1.001 a

1

0.999

0.998 O 1 O0 1 50 200 250 300 350

BASE

Figure A.1: Inter channel peak time variation - plot of ratio of 'T" channel peak times to those of "A" channel for data used in Chapter 3.

smoothed by least-squares fitting of a quadratic to them. Thus, the quadratic repre-

sents the large scale variation in mobility betiveeri the two lanes over the entire data

set. Data used in Chapter 6 have been temporally interpolated and then resampled

t~ased on the respective quadratics to remove the inter-lane variation.

For the modelling in Chapter 3, the original C, G and T measured peak times were

compensated by their respective quadratics to bnng them into large scale alignment

with the .4 channel. These data were then merged into a single time series. While this

compensation had removed the inter-lane variation, a general large scale variation in

mobility, common to all lanes, remained. For the autocorrelation estimate presented

in Figure 3.12, this variation was removed by subtracting a 51 bin moving average of

the data p io r to calculating the autocorrelation estimate.

Amplitude trends are in evidence in Figure 3.1 which presents the entire time-

series for a single channel. Proceeding from left to right, a constant background level

is first seen. This could be due to background fluorescence and/or an offset in the

sequencer electronics. Next, a large peak is seen; this is known as the primer peak

Appendix A O Large Scale Trend Removal 132

and is due to an excess of the flourescently labelled primer used to identify the DNA

fragment to be sequenced. The primer peak causes an exponentially decaying offset

in the data. Near the end of the data, an exponential rising offset is seen. This is

the precursor of the peak at the end of the data due to fluorescently labelled full

length copies of the orignal DNA fragment. Over the central region, a downward

trend in peak amplitudes can be seen. This is due the cornpetitive process used to

encode sequence information; substrate is consumed to label earlier positions leaving

lcss available to label later positions.

Data used in Chapter 6 have had the background, primer and end of data offsets

estimated and removed. The trend in peak amplitude has been estimated and the

data has been scaled by its inverse. The result features signal absent regions with

values near zero and isolated signal peaks with values near one (consecutive peaks

can have values much greater than one due to constructive interference).

application of communication theory automatic dna sequencing€¦ · dolan provided leadership when...

Documents