discrete probability distributions: poisson applications ... · matlab exercise: binomial...
TRANSCRIPT
Binomial Distribution• Binomially‐distributed random variable X equals sum of n independent Bernoulli trials
• The probability mass function is:
• Based on the binomial expansion:
Sec 3=6 Binomial Distribution 2
1 for 0,1,... (3-7)n xn xxf x C p p x n
Matlab exercise: Binomial distribution• Generate a sample of size 100,000 for binomially‐distributed random variable X with n=100, p=0.2
• Plot the approximation to the Probability Mass Function based on this sample
• Calculate the mean and variance of this sample and compare it to theoretical calculations:E[X]=n*p and V[X]=n*p*(1‐p)
Matlab exercise: Uniform distribution• n=100; p=0.2;• Stats=100000;• r1=rand(Stats,n)<p;• r2=sum(r1,2); • mean(r2)• var(r2)• [a,b]=hist(r2, 0:n);• p_b=a./sum(a);• figure; stem(b,p_b);• figure; semilogy(b,p_b,'ko‐')
Poisson Distribution• Limit of the binomial distribution when
– n , the number of attempts, is very large – p , the probability of success is very small – E(X)=np is just right
Siméon Denis Poisson (1781–1840)French mathematician and physicistFrom von Bortkiewicz 1898
7
1
Let , so
( 1)...( 1) ~ ;! ! !
.!
Normalization requires 1
1
!
.
Thus
x n x xx x
x
x
x
n xx
x
n n n
np E x p n
n n
nP X x p
n x nx x x
ex
px
X e
x
x
X
Px
P
Poisson Mean & Variance
If X is a Poisson random variable, then:• Mean: μ = E(X) = λ• Variance: σ2=V(X) = λ• Standard deviation: σ=λ1/2
Note: Variance = MeanNote: Standard deviation/Mean = λ‐1/2
decreases with λ
Sec 2‐ 8
Matlab exercise
• Find mean, variance, and plot the histogram of 100,000 Poisson distributed numbers with lambda=2
• Use random('Poisson',lambda) to generate Poisson‐distributed random numbers
Matlab: Poisson distributions• Stats=100000;• lambda=2;• r2=random('Poisson',lambda,Stats,1);• disp(mean(r2));• disp(var(r2));• disp(std(r2));• [a,b]=hist(r2, 0:max(r2));• p_p=a./sum(a);• figure; plot(b,p_p,'ko‐');
11
Discrete Poisson Distribution PMF: f(x)=exp(‐λ) λx/x!
Cumulative Distribution Function: CDF(x)=P(X≥x)?What is CDF(1)?
A. exp(‐λ)B. 0C. 1‐exp(‐λ)D. 1
Get your i‐clickers
12
Discrete Poisson Distribution PMF: f(x)=exp(‐λ) λx/x!
Cumulative Distribution Function: CDF(x)=P(X≥x)?What is CDF(1)?
A. exp(‐λ)B. 0C. 1‐exp(‐λ)D. 1
Get your i‐clickers
Poisson Example: Genome Assembly• Goal: figure out the sequence of DNA nucleotides (ACTG) along the entire genome
• Problem: Sequencers generate random short reads
• Solution: assemble genome from short reads using computers. Whole Genome Shotgun Assembly.
Where is the Poisson?• G ‐ genome length (in bp) • L ‐ short read average length • N – number of short read sequenced • λ – sequencing redundancy = LN/G• x‐ number of short reads covering a given site on the genome
Ewens, Grant, Chapter 5.1
Poisson as a limit of Binomial. For a given site on the genome for each short read Prob(site covered): p=L/G is very small.Number of attempts (short reads): N is very large. Their product (sequencing redundancy): λ = NL/G is O(1).
What fraction of genome is covered?• Coverage: λ=NL/G, X – r.v. equal to the number of times a given site is coveredPoisson: P(X=x)= λx*exp(‐ λ) /x!P(X=0)=exp(‐ λ), P(X>0)=1‐ exp(‐ λ)
• Total length covered: G*[1‐ exp(‐ λ)]λ
λ
How many contigs?• Probability that a given short read is the right end of a contig = no left ends of other reads fall within it.
• Left ends of each of N‐1 N other reads has Prob: p=(L‐1)/G L/G to fall within given read.Expected number of short reads extending a given one is N*p=N*L/G= λProbability that none will extend it = exp(‐ λ):
• Number of contigs: Ncontigs=Ne‐ λ:
Average length of a contig?• Length of a genome covered:Gcovered=G∙ P(X>0)=G ∙ (1‐ exp(‐ λ))
• Number of contigs Ncontigs=N ∙ e‐ λ
• Average length of a contig =<L>=Σi Li /Ncontigs=Gcovered/Ncontigs=
G ∙ (1‐ exp(‐ λ))/ N ∙ e‐ λ=L ∙ (1‐ exp(‐ λ))/ λ ∙ e‐ λ
λ
Estimate
• Human genome is 3x109 bp long• Chromosome 1 spans about 250x106 bp• Illumina generates short reads 100 bp long• How many short reads are needed to completely assemble the 1st chromosome?
The answer is N=44x106 short (100bp) reads
or λ =17.6 fold redundant coverage.At 0.1$/Mb that means that the reads for de novo full assembly of
human genome would cost(3 x109 x 17.6 /106) x 0.1$
=$5300 /genomeIn reality is cheaper as we don’t
need de novo assembly
What spoils these estimates?
• Try searching human genome for TTAGGGTTAGGGTTAGGG (18 bases) and you will find 100s of matches
• How many one expects by pure chance? 2 ∙ 3x109 ∙ 4‐18=0.08 << 1 match
• Genome has lots of repeats. DNA repeats make assembly difficult
• Side remark. If it was not for repeats 17 bases would be enough to specify unique position in the human genome. log(2 ∙ 3x109)/log(4)=16.2412
• That is why microRNAs recognizing ~22 bases don’t have a lot of off‐target cleavage
Repetitive DNA and next-generation sequencing: computational challenges and solutionsTodd J. Treangen & Steven L. SalzbergNature Reviews Genetics 13, 36-46(January 2012)doi:10.1038/nrg3117
How to assemble genome with repeats?
L=50
L=1000
L=5000
• Answer:longer reads
• Problem:cheap sequencing
=short reads