propagating measurement uncertainty in microarray data analysis

28
Combining the strengths of UMIST and The Victoria University of Manchester Propagating Measurement Uncertainty in Microarray Data Analysis Magnus Rattray School of Computer Science University of Manchester

Upload: omar

Post on 20-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Propagating Measurement Uncertainty in Microarray Data Analysis. Magnus Rattray School of Computer Science University of Manchester. Talk Outline. Part 1: Affymetrix probe-level analysis Probabilistic model for oligonucleotide arrays Estimating credibility intervals - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Propagating Measurement Uncertainty in Microarray Data Analysis

Magnus Rattray

School of Computer Science

University of Manchester

Page 2: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Talk Outline• Part 1: Affymetrix probe-level analysis

• Probabilistic model for oligonucleotide arrays

• Estimating credibility intervals

• Evaluation on real and spike-in data

• Part 2: Propagating uncertainties

• A general framework for propagating uncertainties

• Example 1: Identifying differentially expressed genes

• Example 2: Modified Principal Component Analysis

Page 3: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Part 1: Affy probe-level analysis

PM – Perfect match DNA probe designed to measure signal

MM – Mismatch DNA probe designed to measure background

Probes for the same gene differ greatly in their binding affinities, eg.

PM 83 77 70 982 530 1013 340 1832 464 1111

MM 86 65 79 489 172 1224 181 985 191 313

~10000-50000 probe-sets with 11-20 PM/MM probe-pairs

Page 4: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Are mismatch probes useful?• In practice there is specific binding to MM, so some

methods ignore MM probes altogether. But…

…if fraction is the same for each chip, this term cancels when computing expression ratios.

Page 5: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Probabilistic probe-level analysis• Most methods return a single expression level estimate

• Probabilistic models provide confidence intervals

• Useful for propagating through higher-level analysis

• Hopefully, this approach will also improve accuracy

A hierarchical Bayesian model (Hein et al. 2005) uses MCMC for Bayesian parameter estimation, but this can be prohibitively slow – a more efficient approach is required.

Page 6: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Gamma model for oligo signal: gMOS

- PM (background+signal)

- MM (background)

- signal

Mean log-signal

Milo et. al., Biochemical Transactions 31, 6 (2003)

where

Models (PM,MM) distribution for each probe-set

Page 7: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Modelling probe affinity: mgMOS

• PM and MM probes have correlated binding affinities

• Use a shared scale parameter for probe-pair

• Treat scale parameter as a latent variable

• Distribution of PM ( ) and MM ( ) is

Improves fit to data

Page 8: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Further extensions of the model• Share binding affinity parameter across multiple chips

• Include fraction specific binding to MM probe

Parameter is unidentifiable

We estimate an empirical prior from spike-in data

Probe in probe-set on chip

Liu et. al., Bioinformatics 21, 3637 (2005).

Page 9: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Posterior signal distribution• We estimate the mean signal over a probe-set as

• Only the first term is chip & condition specific

• Distribution of gives posterior signal distribution

• We assume a uniform positive prior on

• Approximate posterior of as truncated Gaussian or using a histogram approach (very similar in practice)

• Percentiles of provide percentiles of

Page 10: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Posterior signal distribution• Posterior becomes more peaked as signal increases

• Normal provides good fit for large signals

• For low signal there is a long left-hand tail due to the fact that we are measuring

• Posterior distribution can be used to put credibility intervals on the estimated expression level

Page 11: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Results: Accuracy on real data• 5 time-points, 3 replicates & qr-PCR for 14 genes

Method Error

GC-RMA 0.69

MAS 5.0 0.66

mgMOS (post.median) 0.60

multi-mgMOS 0.60

Hierarchical Bayesian 0.72

mgMOS

multi-mgMOS

Mouse hair-follicle morphogenesis data from Lin et. al. PNAS 101, 15955 (2004).

RMS error to PCR results

Page 12: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Importance of credibility intervals

Red boxes show truly differentially expressed genes

Left: Log-ratios used to rank genes

Right: Credibility intervals used to rank genes

Spike-in data from Choe et al Genome Biology 6, R16 (2005).

1331 up-regulated genes (1.2 to 4-fold), 12679 invariant

Page 13: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Part 2: Propagating uncertainties• Uncertainties can be propagated as noise

where is diagonal covariance matrix for gene

• Use your favourite probabilistic model for

• Data is not i.i.d. making parameter estimation tricky

We consider two popular tasks as examples:

(i) Combining replicates and identifying differential expression

(ii) Principal Component Analysis (PCA)

Page 14: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

(i) Combining replicates

• We can then calculate the probability of the sign of change in expression level between two conditions:

Simplest model of log-expression is a Gaussian:

for replicate in conditions with priors

• Hyper-parameters are

• Parameters are

Page 15: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Hyper-parameter estimation

Likelihood:

Prior:

We wish to optimise the log marginal likelihood:

The integral is intractable, so we use a variational approximation (popular approach in machine learning).

The resulting optimisation resembles an EM-algorithm.

Page 16: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Variational approximation

E-step:

M-step:

We use a factorised approximation to the posterior:

Page 17: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Results: credibility intervals

Data from Lin et. al. PNAS 101, 15955 (2004)

Page 18: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Identifying differential expression

One chip per condition 3 replicates per condition

Spike-in data from Choe et al Genome Biology 6, R16 (2005).

1331 up-regulated genes (1.2 to 4-fold), 12679 invariant

Page 19: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

(ii) Principal Component Analysis• Popular dimensionality reduction technique

• Project data onto directions of greatest variation

This example from Pomeroy et. al. Nature 415, 436, 2002. Embryonic tumours of the central nervous system.

Useful tool for visualising patterns and clusters within the data set

Usually requires an ad-hoc method for removing genes with low signal/noise

Page 20: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Probabilistic PCA• PCA can be cast as a probabilistic model

with -dimensional latent variables

• The resulting data distribution is

• Maximum likelihood solution is equivalent to PCA

Diagonal contains the top sample covariance eigenvalues and contains associated eigenvectors

Tipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).

Page 21: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Relationship to Factor Analysis• Probabilistic PCA is equivalent to factor analysis with

equal noise for every dimension

• In factor analysis for a diagonal covariance matrix

• An iterative algorithm (eg. EM) is required to find parameters if precisions are not known in advance

In our case we want the precision to be gene and experiment specific – we need a more flexible model

Page 22: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

PCA with measurement uncertainty• If we let the covariance matrix be gene specific then

Probabilistic PCA:

Corrupted data model:

• The log-likelihood is

with

• The maximum likelihood solution for the mean is

which is no longer the sample mean

Page 23: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Likelihood optimisation• The optimal parameters are solutions to a coupled

non-linear set of equations (eg. depends on )

• Gradients require inversion of large matrices

• An EM-algorithm provides more efficient optimisation

• M-step still requires non-linear optimisation

• Redundant parameterisation of model gives us a significant speed-up

Page 24: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Advantages over standard PCA• Automatically eliminates influence of consistently

noisy genes, eg. noisy in all experiments

• Automatically chooses no. of principal components because noise “explains away” some of the variation

• Down-weights influence of noisy measurements in an experiment specific way

• Provides error-bars on the reduced dimension representation of the data

• Can be used to “denoise” expression profiles

Page 25: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Results: Improved visualisation

Under standard PCA 43% of samples are closest to a sample of the same tumour type.

For modified PCA this percentage increases to 71%.

Data from Pomeroy et. al. Nature 415, 436, 2002.

Page 26: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Denoising a data set• We can estimate the uncorrupted data from the noisy

measurements as

• Denoised profile approaches original as noise is reduced

• Denoised data improves performance of clusteringSanguinetti et al. Bioinformatics 21, 3748 (2005).

Page 27: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

Conclusions• We have developed a computationally efficient

probabilistic model for Affymetrix probe-level analysis.

• The model provides good accuracy and confidence intervals for gene expression level estimates.

• Measurement uncertainties can be propagated through an appropriate probabilistic model.

• Example applications to Bayesian t-test and PCA.

• Parameter estimation becomes much more difficult, so approximate methods are needed.

• Same principal can be applied to other models.

Page 28: Propagating Measurement Uncertainty in Microarray Data Analysis

Combining the strengths of UMIST andThe Victoria University of Manchester

AcknowledgmentsRest of the team:

Xuejun Liu, School of Computer Science, University of Manchester.

Guido Sanguinetti, Marta Milo & Neil Lawrence, Department of Computer Science, University of Sheffield.

Software: www.bioinf.man.ac.uk/resources/puma

Papers:

Liu et al. “A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips” Bioinformatics 21, 3637 (2005).

Sanguinetti et al. “Accounting for probe-level noise in principal component analysis of microarray data” Bioinformatics 21, 3748 (2005).

Supported by a BBSRC award “Improved processing of microarray data with probabilistic models”