introduction to statistics iii

17
Statistics for Next Generation Sequencing (RNA-Seq)

Upload: strand-life-sciences-pvt-ltd

Post on 18-Jun-2015

447 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Introduction to statistics iii

Statistics for Next Generation Sequencing (RNA-Seq)

Page 2: Introduction to statistics iii

Distribution?

• 25000 genes, each with counts over several samples

• 2 conditions, each with several replicates

• Recall, log-Normal for Microarrays• Based on fitting on actual data with many replicates

• No equivalent data for RNA-Seq• So go back to first principles

Page 3: Introduction to statistics iii

RNA-Seq Setting

• copies of transcripts from gene

• Total number of molecules

• Choose of these molecules for sequencing; chosen at random

• Probability that a particular molecule falls in this sample of size is /

Page 4: Introduction to statistics iii

RNA-Seq Counts Distribution

• How many of the copies of transcripts from gene are chosen for sequencing?

• How is this quantity distributed?

• Hypergeometric Distribution

Page 5: Introduction to statistics iii

Hypergeometric Distribution

• items of which are red, - are black

• If of the items are sampled at random

• How many reds are in the sample?

Page 6: Introduction to statistics iii

Simplifying the Hypergeometric Distribution

• Simplify

• Assuming, this is approximately

𝑎𝑖𝑘𝑘!

(𝑀𝑁 )𝑘 𝑒−𝑎𝑖𝑀 /𝑁

Page 7: Introduction to statistics iii

The Poisson Distribution

λ =

λ is both mean and variance

are all unknown and subsumed within λ

Page 8: Introduction to statistics iii

The Poisson Distribution(Wikipedia)

• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).

• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[19]

• The number of phone calls arriving at a call centre per minute.• The number of goals in sports involving two competing teams.• The number of deaths per year in a given age group.• The number of jumps in a stock price in a given time interval.• Under an assumption of homogeneity, the number of times a web server is

accessed per minute.• The number of mutations in a given stretch of DNA after a certain amount of

radiation.• The proportion of cells that will be infected at a given multiplicity of infection.

Page 9: Introduction to statistics iii

Is Mean = Variance for NGS ?

– Variance Mean∝ 2

Log Scale: White line is the Poisson

line

Page 10: Introduction to statistics iii

Why this Over-Dispersion

• The Poisson model only models technical variation, not biological variation

• Biological variation induces more variance than captured by the Poisson model

– No reason for difference from microarrays where SD Mean ∝

(or Variance Mean∝ 2) SD vs Mean for Microarrays

Page 11: Introduction to statistics iii

Handling Over-Dispersion where

itself comes from a distribution with mean and variance σ2

σ2= σ2

Page 12: Introduction to statistics iii

What Distribution is X?

• Log-Normal for Arrays?

• The combination of log-Normal and Poisson doesn’t have a neat closed form (i.e., formula)

• So assume Gamma distribution– Poisson + Gamma -> Negative Binomial– Used traditionally to fix the problem of over-dispersion

Page 13: Introduction to statistics iii

The Gamma Distribution

• 2 parameters– Shape – Scale

• Lifespans are modeled as Gamma

Control on Right Tail

Page 14: Introduction to statistics iii

The Negative Binomial Distribution

• How many heads before you get tails?

• 2 parameters– Tails probability – Number of tails

• =

• =

Page 15: Introduction to statistics iii

Estimating Parameters

• 2 parameters– Tails probability – Number of tails

• =

• = For each gene, estimate the mean across replicates,

and then estimate the variance from the curve fit

aboveThen use these formulae to estimate and

Page 16: Introduction to statistics iii

Bias Correction

• and are unbiased estimates of and

• = = are not necessarily unbiased estimates of and respectively

• So bias correction needed. How?• Do theoretical simulations and see what the bias factor is• Correct by this factor

Page 17: Introduction to statistics iii

Thank You