tams 23: statistiska metoder i bioinformatik föreläsning 6...

69
TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6: Modellering av DNA och Ord i DNA Timo Koski Matematisk statistik /Link ¨ opings tekniska h ¨ ogskola 07/04/2003 – p.1/69

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

TAMS 23: Statistiska metoder i bioinformatikFöreläsning 6: Modellering av DNA och Ord i

DNATimo Koski

Matematisk statistik /Linkopings tekniska hogskola

07/04/2003 – p.1/69

Page 2: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Lecture 6

The lecture is base on portions of chapter 5 and onsection 10.4 of Ewens and Grant. The lecture is,however, as far as the occurrences (or appearances)of DNA words is concerned, a mix of the originalresearch contributions and the material in thetextbook.The sections 5.1 ’shotgun sequencing‘, section 5.4’long repeats‘, section 5.5 �-scans are offered aspossible examination items, ’presentationer’.

07/04/2003 – p.2/69

Page 3: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Lecture 6: Contents

1) Weight Matrix Model

2) Markov Modelling

3) Words in a DNA sequence(A) The number of occurrences(B) The length between one occurrence of a word

and the next(C) The waiting time till appearance of a word

including the Markovian case.(D) p.g.f. for the waiting time till appearance of a

word.

07/04/2003 – p.3/69

Page 4: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

An Image of Practical Life Science

07/04/2003 – p.4/69

Page 5: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Weight Matrix Model

A weight matrix � is a simple model often used by

molecular biologicists as a representation for a family of

signals. The sequences containing the signals are sup-

posed to have equal length (= �) and to have no gaps

(no positions are blank).

07/04/2003 – p.5/69

Page 6: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Weight Matrix Model

A weight matrix

��� has as entries the probabilities �� ����

(e.g. observed relativefrequency) for that a string should have one of the bases

�� � �� � �� � �� ��� �� �� �� � �

at position

:

�� �

�� ���� � � � ��� ���

�� ���� � � � ��� �� �

�� ���� � � � � � �� �

�� ���� � � � ��� ���� �

The weight matrix model is often called a profile.

07/04/2003 – p.6/69

Page 7: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Prob(Data

Model)

The probability of a finite sequence � � ��� ��� � � � �� �

given the model � is given by

� � � � � �

� � �

��� �

� �� � � ���� �� ��� ��

where the indicator

� ��� ��� � , a function of x, is

if � � �

�� � , i.e., if the symbol � �does not appear in position

!

in

the string � and is"

otherwise. Thus the bases in the

different positions are independent given � .

07/04/2003 – p.7/69

Page 8: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Prob(Data

Model)

A sequence of strings � �� � � �� � �

is training data, i.e., ofknown cases of members of a signal family. We takethem to be generated independently given � , is bymultiplication of the preceding expressions assignedthe probability

� � � �� � � �� � � ��

� ��

� � �� � � � �

�� ��

�� �� �� � � �� � �� ��

where � �� � � is the number of times the symbol � �ap-

pears on position

!

in

� �

.07/04/2003 – p.8/69

Page 9: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Prob(Data

Model �� �)

� ��� �� � � �� � �� �� ��

��� � � � ��

��

� ���

� ���� ��� �� � � �

where �� �� �

is the number of times the symbol � � appears on position

in � �� � � �� � �

.The maximum likelihood estimate is

� �� �� � � �� ���� �

This will be shown during a later ’lektion’.

07/04/2003 – p.9/69

Page 10: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Example: Promoter Regions

RNA polymerase molecules start transcription byrecognizing and binding to promoter regions upstreamof the desired transcription start sites. Unfortunatelypromoter regions do not follow a strict pattern. It ispossible to find a DNA sequence (called theconsensus sequence) to which all of them are verysimilar.

07/04/2003 – p.10/69

Page 11: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Example: Promoter Regions

For example, the consensus sequence in the bac-

terium E. Coli, based on the study of 263 promoters,

is TTGACA followed by 17 random base pairs followed

by TATAAT, with the latter located about 10 bases up-

stream of the transcription start site. None of the 263

promoter sites exactly match the above consensus se-

quence.

07/04/2003 – p.11/69

Page 12: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Weight Matrix Model

By constructing the weight matrix of the TATAAT region (n=6) we can compute theprobability of a DNA sequence � � �� � �� � � � � �� � and compute the probability

��� � � � � � � � � � � �� � � This means that successive bases are thought as being generated independently fromthe distributions in the weight matrix table. Similarly, we can compute using, e.g., aweight matrix model of the signal family

��� � � � � � �� �� � � � � � � � �

with weight matrix from a non-promoter region and compare��� � � � � � � � � � � � �

��� � � � � � �� �� � � � � � � � � �

to decide in � is a member of the family.

07/04/2003 – p.12/69

Page 13: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Markov Model �

A Markov model is a statistical ’counter hypothesis‘ to

weight models.

A C G TA � � � � � � �� � � � � � � ��

C �� � � �� �� �� � � � � ��

G � � � � � � �� � � � � � � ��

T �� � � �� �� �� � � �� ��

The

matrix

contains

"� � unknown probabilities that will

have to be learned from training data, i.e., of knowncases of member of a signal family.

07/04/2003 – p.13/69

Page 14: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Prob

� � � � � � �

If � � � �� � � � �� � �� � �� �� �where

�� � � � �� � � � � � � �� � �� � � � �

then

� �� � � ��� � � � �� � � � �� � � � �� � �� � �� �� ��

�� ��� � �� � �

07/04/2003 – p.14/69

Page 15: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

An Image of Practical Life Science

07/04/2003 – p.15/69

Page 16: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Learning with Markov chains

��

������������

�� �� �� �� �� �� �� ��

�� �� �� �� �� �� �� ��

�� �� �� �� �� �� �� ��

�� �� � �� �� �� �� ��

������������

We change the interpretation: the function � � � � � isregarded as a function of

�(or the probabilities in

)and called a likelihood function and denoted by

� � �

� � � � ��� �

�� � �

� �� � � �� �

The maximum likelihood estimate of

is obtained by

maximizing

� �

as a function of

. 07/04/2003 – p.16/69

Page 17: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Learning with Markov chains

The maximum likelihood estimate

�� �� �of � �� �is

�� �� � �� �� �

� �� for all

!

and�

.

Here � �� �is the number of times the sequencecontains the pair of bases

� !� � (in this order), i.e., the

number of transitions from!

to

and � �is the numberof times the base

!

occurs in the sequence.

This will be shown during a later ’lektion’.

07/04/2003 – p.17/69

Page 18: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Modelling with Markov chains

�� �� � �� �� �

� �� for all

!

and�

.

Here � �� �is the number of times the sequencecontains the pair of bases

� !� � (in this order), i.e., the

number of transitions from!

to�

and � �is the numberof times the base

!

occurs in the sequence.Hence Markov models assume that there is biologicalinformation contained in the frequency of pairs ofbases following each other.

07/04/2003 – p.18/69

Page 19: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

An Image of Practical Life Science

07/04/2003 – p.19/69

Page 20: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Learning & Parametric Inference

In maximum likelihood estimation we have regardedthe transition probabilities as parameters and usedtraining data to infer their values.Inference: the process of deriving a conclusion fromfrom fact and/or premise.In probabilistic modelling of sequences the facts arethe observed sequences, the premise is representedby the model and the conclusions concernunobserved quantities.

07/04/2003 – p.20/69

Page 21: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Another Image of Practical LifeScience

07/04/2003 – p.21/69

Page 22: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Words in a DNA sequence

Word means here a subsequence of a larger se-

quence. This is a unique subsequence, not a family

of signals.

07/04/2003 – p.22/69

Page 23: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Words in A DNA sequence

Frequency of occurrence of specific short nucleotidewords (like GAGA) within part or all of a nucleotidesequence is of interest for several biological reasons.

The occurrence of a word in unexpectedly large num-

bers in a segment may provide information about struc-

ture or function.

07/04/2003 – p.23/69

Page 24: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Words in a DNA sequence

Information about structure or function:

� a segment of DNA may have originally beenformed by replication of smaller segments. Suchreplication might be exact initially but might bealtered over time. But in regions where retention offunction is necessary, exact repeats would beexpected to occur.

� the occurrence of of repeated subsequences mayidentify locations where an insert has beenintroduced into the DNA sequence.

07/04/2003 – p.24/69

Page 25: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

DNA Words : statistical assumptions

Let

assume values in

�� � �� �� � �

and let� � � � "�

.

is a nucleotide chosen at random. Hence the probability ofthe occurrence of a word of length

is

� �"

07/04/2003 – p.25/69

Page 26: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

DNA Words : statistical assumptions

The DNA dice is biologically unreasonable as a model

for a whole genome, but can serve well as a statisti-

cal null hypothesis to detect important deviations from

random noise.

07/04/2003 – p.26/69

Page 27: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

(A) Occurrence of a word

We say that ’a word occurs at position!

‘ (Like GAGAin position 7 in the figure) if it is found to begin atposition

!

in a longer sequence.

A T T C C A G A G A A T G G C G A C A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

07/04/2003 – p.27/69

Page 28: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

(A) Occurrence of a word

Let the random variable

� � �

be the number ofoccurrences of a nucleotide word (like GAGA) oflength

(like

) within a nucleotide sequence oflength ,

. Then � � � � "is the maximum

value achievable by

� � �

.

07/04/2003 – p.28/69

Page 29: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Occurrence of a word: overlapcapability (examples)

The word ACAC cannot occur more than 9 times in a sequence of length 20, because of

its ’overlap capability‘.

ACACACACACACACACACAC

07/04/2003 – p.29/69

Page 30: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Occurrence of a word: overlapcapability (examples)

The word AAAA can occur between 0 and 17 times ina sequence of length 20, because of its greater’overlap capability‘. The word ACGT has no overlapcapability except in the trivial case.

07/04/2003 – p.30/69

Page 31: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Occurrence of a word

Let the random variable

� � �

be the number ofoccurrence of a nucleotide word of length

within anucleotide sequence of length . is thought to beso large that end effects can be neglected. Let itsprobability function be denoted by

��� � � � � ���� � � � �

This does depend on the overlap capability (’ �’ (to

be defined)) and is therefore not a binomial probabil-

ity function.

07/04/2003 – p.31/69

Page 32: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Overlap capability

We define the overlap capability of a word as follows.Let � be a word, � � � � � � � � �, so that � �are itsnucleotides read from left to right.The overlap capability � is a binary sequence

� � � � � � � � �, where � �is defined as follows:

� � � "

if it is possible for the word’s first

!

letters tooverlap its last

!

letters (in the same order) and � � � �

otherwise.Overlap capability is also known as the autocorrelationof a word.

07/04/2003 – p.32/69

Page 33: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Overlap capability

The overlap capability � of � � � � � � � � � is a binarysequence � � � � � � � � �, where � �is defined as follows:

� � � "

, if it is possible for the word’s first

!letters to be

equal to its last

!

letters (in the same order) and � � � �

otherwise.More formally

� � �

"

if �� � � � �� � �� � � "� � � �� !

�elsewhere.

07/04/2003 – p.33/69

Page 34: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Overlap capability: examples (Ewens& Grant p. 169)

� � �

"

if �� � � �� � �� � � "� � � �� !

elsewhere.

� � � � �� � GAGA

� " � "GGGG

" " " "GAAG

" � � "

GAGC

� � � "

07/04/2003 – p.34/69

Page 35: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Number of Occurrences of a Word

We are not going to derive the probability function

�� � � � � �� � � � � for

� � �

, the number of occurrence of a nucleotideword of length

within a nucleotide sequence oflength . Instead we are going find

� � � � � � �

� ���� � � � � �

07/04/2003 – p.35/69

Page 36: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

� � � � �

, Expected Number ofOccurrences of a Word

Let �� � � � �

and

�� �� � � �� �

�� ����

��

if the word begins in position�

otherwise.

Then

� � � � �� � �� � � � � ���

and

� � � � � � � � �� � � � �� � � � � � � �� �

��

� ����� � � �

the the word begins in position

� � �� �� � � � ��

� ��

07/04/2003 – p.36/69

Page 37: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

� � � � �

, Expected Number ofOccurrences of a Word

The expected number of occurrences of a word

� � � � � � � �

"

��

does not depend on the overlap capability !

07/04/2003 – p.37/69

Page 38: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � � � � �

, variance of the Numberof Occurrences of a Word

���� � � � � � ��

�� ��

� � ���� � � � �� �

��

�� ��

� � �� � � �� � �

��� �

� � � ��

� � �� � � �

��

�� ��

�� �� � � �� � � �

"

� �

07/04/2003 – p.38/69

Page 39: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � � � �

The correlations between the indicators� �and

� �thatare near neighbors depend on �.

� � � �� ��� is just the probability that the word occcurs at both po-

sition

!

and position

! � �.

07/04/2003 – p.39/69

Page 40: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � � � �

� If

� � � � � ��� � � "� � � "

, then

� � � �� ��� � � � � �

"

� ��

� If � �� � � "� � � " � � � �, then � is irrelevant, and

� � � �� ��� �"

� �

07/04/2003 – p.40/69

Page 41: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Counting the cases in

�� � �

�� � �

�� � � � �

� There are � terms among the ��

in� �� � � � � � � � � �� � such that

! � �

.

� If � � "

, there are

� �� � �

terms such that

� � ! � �

or

! � � � �

for

� � "� � � �� � ��� � � "� � � "

.

� If � �

, there are

� � � �

� � � � � �� � �� � � "

remaining terms that do not depend on �.

07/04/2003 – p.41/69

Page 42: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � �

�� � �

�� � � � �

��� �

�� � �

� � � �� �

� � � � � � ��

�� � � � � � " � � � �

� ��� � � � � � � � � �

� � �

�� � � � � � �

"

07/04/2003 – p.42/69

Page 43: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � � � � �

, variance of the Numberof Occurrences of a Word

���� � � � � �

� � � �� " � � � �

� � ��

�� � �� � � " � � � �

� ��� � � � � � � � � �

� � �

�� � � � � � �

"

��

In the right hand side the second term equals

if � �

and the third term equals

if � � "

. If

� "

, the third

term in the equation equals zero, and the variance is

that of a binomial R.V..

07/04/2003 – p.43/69

Page 44: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

�� � � � � �

, generalization

Let the the alphabet is generic (

symbols) and the

probabilities be � �� � � � ��� , and � be a word with

letters

with probabilities � ��� , � � "� � � �� . Let

�� � �� � � � ���

be the product of probabilities of the first

letters in �.

Then we insert in the preceding formula

�� for

� � �

and

�� for

� � ��

.

07/04/2003 – p.44/69

Page 45: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Reference

The results about

� � � � � �

and

���� � � � � �presented

above are due toJ. F. Gentleman and R. C. Mullin: The Distribution ofthe Frequency of Occurrence of NucleotideSubsequences, Based on their Overlap Capability.Biometrics, 45, pp. 35 � 52, 1989.Gentleman and Mullin find also the full probabilitydistribution

��� � � � � ���� � � � �

07/04/2003 – p.45/69

Page 46: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

(B) The length between oneoccurrence of a word and the next

07/04/2003 – p.46/69

Page 47: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

(B) The length between oneoccurrence of a word and the next

Let � be word, � � � � � � � � � with overlap capability �.

� � � � � � � � �.Let

� � be the distance to the next occurrence of � after

� has occurred at

!

.Let

�� � ��� ��� �� � � � � � � �

We are going to find a recursive formula for � � � ���

.

07/04/2003 – p.47/69

Page 48: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Some pertinent events

Let

� � be the distance to the next occurrence of � after

� has occurred at

!

.

�� � ��� � � � � � � � �

Let

� � � occurs at

! � � but nowhere between

!

and

! � �

Then

� � � � �� � ��� .

07/04/2003 – p.48/69

Page 49: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

More pertinent events

�� � occurs at

� ��� but nowhere between

and� ��

and let

�� � occurs at

� ��

� � � � occurs at

� �� and

� � �

but nowhere between

and

� � �

Then

� � � � � � � �� � � � � � �� � �

07/04/2003 – p.49/69

Page 50: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Decomposition of events

Decomposition

� � � � � � � �� � � � � � �� � �

yields, since the events are in the right hand side aredisjoint,

�� � ��� � � � � � � � � �� � �

�� �� �� � �

We shall now compute the probabilities in the right

hand side of this expression.

07/04/2003 – p.50/69

Page 51: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Decomposition of events: the cases

� � ����� � � �

�� �

Let

" � � � � "

.

� � � � � � � �� � � � � � �� � �

�� � ��� � � � � � � � � �� � �

�� �� �� � �

07/04/2003 – p.51/69

Page 52: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Decomposition of events: the cases

� � ����� � � �

�� �

,� �

i i+j

i+L

i+1 i+2 i+3

s s s s1 2 3 4

1s s2

s 3 s4

1 < j < y −1 < L−1

� � � � � � � �

"

07/04/2003 – p.52/69

Page 53: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Decomposition of events: the cases

� � ����� � � �

�� �

,� � �

i i+j

i+L

i+1 i+2 i+3

s s s s1 2 3 4

1s s2

s 3 s4

1 < j < y −1 < L−1 For" � � � � � " � � "we have that

� �� � � �� � � � � � � � � �

"

� � ��

since the overlaps are to be taken into account.07/04/2003 – p.53/69

Page 54: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

��� �� � �

for � � ���� � � �

�� �

�� � ��� � � � � � � � � �� � �

�� �� �� ��

given for

" � � � � "

we have

�� � ��� � � � � �

"

��

� � �� � �

�� � � � � � � � � �

"

� � ��

07/04/2003 – p.54/69

Page 55: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Decomposition of events: the cases

� �

,� �

Let

� � .

� � � � � � �"

07/04/2003 – p.55/69

Page 56: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Probability

� � �

;

� � � ��

For

� � � � �

we have that

� �� � � � �� � � �

since the events that � occurs at! � � and

! � �

but

nowhere between

!

and

! � �

are in this case independent.

i+yi i+j i+y−3

i+(y−L)i+L

i+1 i+2 i+3

s s s s1 2 3 4 s1 s2 s3 s4

L < j < y−L

s1 s2s

3s4

07/04/2003 – p.56/69

Page 57: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

Probability

� � �;

� �� � � � � �

For � � � " � � � � � "

we have that

� �� � � �� � � � � � � � � �

"

� � ��

since for � � � " � � � � � "overlaps are to be taken

into account.

i+yi i+j i+y−3

i+(y−L)i+L

i+1 i+2 i+3

s s s s1 2 3 4 s1 s2 s3 s43

s2

ss1 4s

y−L+1 < j < y −107/04/2003 – p.57/69

Page 58: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

��� �� � �

for � �

Hence for

� � .

� ���� �

"

"

� � �� �

� �� � �

� � �

�� � � � �� �

� � � � � ��

"

� � ��

and the recursion of Ewens and Grant is complete.

07/04/2003 – p.58/69

Page 59: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

(C) The waiting time to the firstoccurrence of a word: beginning at

the origin

� ���� �

"

"

� � ��

� � � �� � �

�� � � � �� � � � � � ��

"

� � ��

This formula is due to Gunnar Blom and DanielThorburn (1982), as the probability function of the thewaiting time

�� until the word has appeared, in thesense that all the symbols have been seen, hence

�� � �� � �

for � �

.

07/04/2003 – p.59/69

Page 60: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The waiting time of a word: TheMarkov Case

We consider a stationary Markov chain� �

�� � �

� � �� �� � � �� �

with the state space�

and let

� ��� � �

�� ��

� � � �� �

designate the probability that the chain should gener-

ate � � � � � � after � � � � .

07/04/2003 – p.60/69

Page 61: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The Markov Case

We denote by

��� � � is the stationary probability ofobserving � so that

�� � � � � � � � �

��� �

�� � � �� � �

07/04/2003 – p.61/69

Page 62: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The Markov Case: The recursion

The probability function of the waiting time of the word �� �� � � � � � � �� at time

�� �

ina sample path of

�� � �� � � , is for

� � �

� � � � ��� � � � � � ��where

� �� �

� � �� � � � � � � �� � � � � � � � � � � �

where � � � � � � is the probability of transition from �� to �� in � steps, and

� �� �

� �� �� �� � � � � � � � � � � � � � � � �� � �

Furthermore, � �� � for

� � �

and � � � � � � � � .

07/04/2003 – p.62/69

Page 63: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The Markov Case: The recursion

The result can clearly be derived in the same way asthe result by Blom and Thorburn was derived aboveand is due toS. Robin and J.J. Daudin (1999): Exact distribution ofword occurrences in a random sequence of letters.Journal of Applied Probability, 36, pp. 179 � 193.

07/04/2003 – p.63/69

Page 64: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The p.g.f. of the waiting time of aword

Gunnar Blom and Daniel Thorburn (1982) derived also

the p.g.f. of the waiting time of a word.

07/04/2003 – p.64/69

Page 65: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The p.g.f. of the waiting time of aword

� ���� �

"

"

� � ��

� � � �� � �

�� � � � �� � � � � � ��

"

� � �

� ���� � " �

� � ��

� �� � �

� � �

�� � � � �� �

� � � � � ��

"

� � �

Here

" �� � �

� � ��� � � � �

�� � � � � � �

�� � � �

and we use this in the right hand side07/04/2003 – p.65/69

Page 66: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The p.g.f. of the waiting time of aword

� ���� �

�� � � � �� �

� � � � � �

�� � � � �� �

� � � � � ��

"

� � �

Now we multiply this equality by � �

and sum over � :

� �� � �

��� �

� � � �

� � � �

� � ��

� �

� � � � �

�� � � � �� �

� � � � � ��

"� � � ��

07/04/2003 – p.66/69

Page 67: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The p.g.f. of the time to the waitingtime of a word

� � �� � � � �

�� ��

� � � �� � �

�� � � � � � � � �

�� � �

� �

� � � � � ��� � � � �� � � �

� �� � �

�� � �

can be rewritten (lots of details omitted) in the form

� �� � �

� " � �

��� �

� �� � � " � � � �

� � �� �

� ��� �

� � � � �

"

��

� �� � �

� " � �

� " � � �� � � � �� �

� ��� �

� � � � �

"

��

07/04/2003 – p.67/69

Page 68: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

The p.g.f. of the time to the waitingtime of a word

� �� � �

� " � �

� " � � �� � � � �� �

� ��� �

� � � � �

"

��

If we solve this equation w.r.t. � � �� � we get �

����� � � ���

� �� � � �� � � � �

where

��� � � ��� � � �� �

(= the generating function of

the overlap capabilities).

07/04/2003 – p.68/69

Page 69: TAMS 23: Statistiska metoder i bioinformatik Föreläsning 6 ...web.abo.fi/fak/mnf/mate/jc/miscFiles/kurs23lect6.pdf · Prob(Data Model) A sequence of strings is training data, i.e.,

End of Lecture 6

07/04/2003 – p.69/69