discrete probability distributions: poisson applications ... · matlab exercise: binomial...

Discrete Probability Distributions:Poisson

Applications to Genome Assembly

Binomial Distribution• Binomially‐distributed random variable X equals sum of n independent Bernoulli trials

• The probability mass function is:

• Based on the binomial expansion:

Sec 3=6 Binomial Distribution 2

1 for 0,1,... (3-7)n xn xxf x C p p x n

Matlab exercise: Binomial distribution• Generate a sample of size 100,000 for binomially‐distributed random variable X with n=100, p=0.2

• Plot the approximation to the Probability Mass Function based on this sample

• Calculate the mean and variance of this sample and compare it to theoretical calculations:E[X]=n*p and V[X]=n*p*(1‐p)

Matlab exercise: Uniform distribution• n=100; p=0.2;• Stats=100000;• r1=rand(Stats,n)<p;• r2=sum(r1,2); • mean(r2)• var(r2)• [a,b]=hist(r2, 0:n);• p_b=a./sum(a);• figure; stem(b,p_b);• figure; semilogy(b,p_b,'ko‐')

Poisson Distribution• Limit of the binomial distribution when

– n , the number of attempts, is very large – p , the probability of success is very small – E(X)=np is just right

Siméon Denis Poisson (1781–1840)French mathematician and physicistFrom von Bortkiewicz 1898

7

1

Let , so

( 1)...( 1) ~ ;! ! !

.!

Normalization requires 1

1

!

.

Thus

x n x xx x

x

x

x

n xx

x

n n n

np E x p n

n n

nP X x p

n x nx x x

ex

px

X e

x

x

X

Px

P

Poisson Mean & Variance

If X is a Poisson random variable, then:• Mean: μ = E(X) = λ• Variance: σ2=V(X) = λ• Standard deviation: σ=λ1/2

Note: Variance = MeanNote: Standard deviation/Mean = λ‐1/2

decreases with λ

Sec 2‐ 8

Matlab exercise

• Find mean, variance, and plot the histogram of 100,000 Poisson distributed numbers with lambda=2

• Use random('Poisson',lambda) to generate Poisson‐distributed random numbers

Matlab: Poisson distributions• Stats=100000;• lambda=2;• r2=random('Poisson',lambda,Stats,1);• disp(mean(r2));• disp(var(r2));• disp(std(r2));• [a,b]=hist(r2, 0:max(r2));• p_p=a./sum(a);• figure; plot(b,p_p,'ko‐');

11

Discrete Poisson Distribution PMF: f(x)=exp(‐λ) λx/x!

Cumulative Distribution Function: CDF(x)=P(X≥x)?What is CDF(1)?

A. exp(‐λ)B. 0C. 1‐exp(‐λ)D. 1

Get your i‐clickers

12

Discrete Poisson Distribution PMF: f(x)=exp(‐λ) λx/x!

Cumulative Distribution Function: CDF(x)=P(X≥x)?What is CDF(1)?

A. exp(‐λ)B. 0C. 1‐exp(‐λ)D. 1

Get your i‐clickers

Credit: XKCD comics

Poisson Distribution in Genome Assembly

Poisson Example: Genome Assembly• Goal: figure out the sequence of DNA nucleotides (ACTG) along the entire genome

• Problem: Sequencers generate random short reads

• Solution: assemble genome from short reads using computers. Whole Genome Shotgun Assembly.

MinION, a palm‐sized gene sequencer made by UK‐based Oxford Nanopore Technologies

Short Reads assemble into Contigs

Promise of Genomics

I think I found the corner piece!

How many short reads do we need?

of K‐mers (5‐mers shown here )

Where is the Poisson?• G ‐ genome length (in bp) • L ‐ short read average length • N – number of short read sequenced • λ – sequencing redundancy = LN/G• x‐ number of short reads covering a given site on the genome

Ewens, Grant, Chapter 5.1

Poisson as a limit of Binomial. For a given site on the genome for each short read Prob(site covered): p=L/G is very small.Number of attempts (short reads): N is very large. Their product (sequencing redundancy): λ = NL/G is O(1).

What fraction of genome is covered?• Coverage: λ=NL/G, X – r.v. equal to the number of times a given site is coveredPoisson: P(X=x)= λx*exp(‐ λ) /x!P(X=0)=exp(‐ λ), P(X>0)=1‐ exp(‐ λ)

• Total length covered: G*[1‐ exp(‐ λ)]λ

λ

How many contigs?• Probability that a given short read is the right end of a contig = no left ends of other reads fall within it.

• Left ends of each of N‐1 N other reads has Prob: p=(L‐1)/G L/G to fall within given read.Expected number of short reads extending a given one is N*p=N*L/G= λProbability that none will extend it = exp(‐ λ):

• Number of contigs: Ncontigs=Ne‐ λ:

Average length of a contig?• Length of a genome covered:Gcovered=G∙ P(X>0)=G ∙ (1‐ exp(‐ λ))

• Number of contigs Ncontigs=N ∙ e‐ λ

• Average length of a contig =<L>=Σi Li /Ncontigs=Gcovered/Ncontigs=

G ∙ (1‐ exp(‐ λ))/ N ∙ e‐ λ=L ∙ (1‐ exp(‐ λ))/ λ ∙ e‐ λ

λ

Estimate

• Human genome is 3x109 bp long• Chromosome 1 spans about 250x106 bp• Illumina generates short reads 100 bp long• How many short reads are needed to completely assemble the 1st chromosome?

The answer is N=44x106 short (100bp) reads

or λ =17.6 fold redundant coverage.At 0.1$/Mb that means that the reads for de novo full assembly of

human genome would cost(3 x109 x 17.6 /106) x 0.1$

=$5300 /genomeIn reality is cheaper as we don’t

need de novo assembly

What spoils these estimates?

• Try searching human genome for TTAGGGTTAGGGTTAGGG (18 bases) and you will find 100s of matches

• How many one expects by pure chance? 2 ∙ 3x109 ∙ 4‐18=0.08 << 1 match

• Genome has lots of repeats. DNA repeats make assembly difficult

• Side remark. If it was not for repeats 17 bases would be enough to specify unique position in the human genome. log(2 ∙ 3x109)/log(4)=16.2412

• That is why microRNAs recognizing ~22 bases don’t have a lot of off‐target cleavage

Stuttering human genome

Repeats are like sky puzzle pieces

How many repeats are in eukaryotic genomes?

The Little Prince, Antoine de Saint‐Exupéry, 1943

Repetitive DNA and next-generation sequencing: computational challenges and solutionsTodd J. Treangen & Steven L. SalzbergNature Reviews Genetics 13, 36-46(January 2012)doi:10.1038/nrg3117

How to assemble genome with repeats?

L=50

L=1000

L=5000

• Answer:longer reads

• Problem:cheap sequencing

=short reads

Credit: XKCD comics

discrete probability distributions: poisson applications ... · matlab exercise: binomial...

Documents