microarray normalization issues in high-throughput data analysis bios 691-803 spring 2010 dr mark...

43
Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Upload: derek-hill

Post on 04-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Microarray Normalization

Issues in High-Throughput Data Analysis

BIOS 691-803 Spring 2010

Dr Mark Reimers

Page 2: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Normalization of Expression Arrays

• Historical approaches 1995-2005

• Current standard approaches– Q.75 for Agilent– Quantile for Affymetrix and Nimblegen– VSN for Illumina

• Modeling correlation (later)

• Technical variable regression (later)

Page 3: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Historic Normalization Approaches• One Parameter

– Single reference standard – Total or median brightness

• Two parameter or non-parametric– Invariant Set– Lowess for two-color arrays– Matching variance– Distribution matching

• By variance or by quantiles

• Local approaches– Print-tip separate normalization

Page 4: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Median Normalization

• Subtract chip medians from values to align centers of chip intensity distributions

45

67

89

Page 5: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Two-color Intensity-Dependent Bias

Non-normalized data {(M,A)}n=1..5184:

M = log2(R/G)

• Saturation occurs at different densities for Cy-3 (green) and Cy-5 (red) dyes because different densities of label get attached to the same amount of cDNA target.

• Model bias by an intensity dependent function c(A)

c(A)

Page 6: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Global (lowess) Normalization

Global normalized data {(M,A)}n=1..5184:

Mnorm = M-c(A)

• c(A) could be determined by any local averaging method

• Terry Speed suggested lowess (local weighted regression)

• Subtract c(A) to obtain ‘corrected’ data

Page 7: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Print-tip NormalizationPrint-tip normalized data {(M,A)}n=1..5184:

Mp,norm = Mp-cp(A); p=print tip (1-16)

where cp(A) is an intensity dependent function for print tip p.

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Print-tip layout

Page 8: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Scaled Print-tip Normalization

Scaled print-tip normalized data {(M,A)}n=1..5184:

Mp,norm = sp·(Mp-cp(A)); p=print tip (1-16)

where sp is a scale factor for print tip p (Median Absolute Deviation).After print-tip normalization After scaled print-tip normalization

Page 9: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Effect on Spatial Artifacts

Median normalization Global lowess normalization

Print-tip lowess normalization Scaled Print-tip normalization

Page 10: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Quantile Normalization• Determine reference distribution (can use

any good chip or average a set of chips)• For each chip, for each probe, determine

quantile within that chip• Shift to corresponding quantile of

reference distribution----------------------------------• Easy to implement• Resolves intensity dependent bias as well

as loess

Page 11: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Quantile Normalization (Irizarry et al 2002)

The m

apping by quantile normalization

Page 12: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

THINGS TO MAYBE ADD

• Maybe examples of how to do mapping

• How to assign reference

• linear extension to high genes

• Critiques: induced correlations –example and details about variable cross-hybe

Page 13: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Ratio-Intensity: Before

Page 14: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Ratio-Intensity: After

Page 15: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Key Assumption of Quantile Norm

• The processes that distort the distribution act on all probes of a given intensity more or less equally

• Probably true within differences of 30% or 40%

• Smaller differences depend quite strongly on technical characteristics of probes

Page 16: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Critiques of Quantile Normalization

• Artificially compresses variation of highly expressed genes

• Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression

• Induces artificial correlations in gene expression across samples

Page 17: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

How to Assess Normalization?

• We want to minimize technical variations in relation to biological variation– Most tests like t-test or ANOVA compare

technical and between-group variance

• Compare distributions of biological to technical variation after normalization

• Most small estimates of variance are under-estimates

Page 18: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Other Issues in Normalization

• Transformation of variables– Variance stabilization

• Background compensation

Page 19: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Why Variance Stabilization?

Ideal raw x log2 (x) log2 (x+offset)

x-y plot

mean-var

plot

From Du CHI 2007

Page 20: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

• Simple power transforms (Box-Cox) often nearly stabilize variance

• Durbin and Huber derived variance-stabilizing transform from a theoretical model:– y = (background) + m e (mult. error) + static

error) – m is true signal; and have N(0,) distribution– Transform:

• Could estimate (background) and empirically

• In practice often best effect on variance comes from parameters different from empirical estimates– Huber’s harder to estimate

Variance Stabilization

222)(log)( yyyg

Page 21: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Effect of Box-Cox Transforms

Page 22: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Model Solution – arcsinh

Fit the relations between mean and standard deviation

Relations between log2 and VST (arcsinh)

(Lin, Pan, Huber, and Warren, 2007)

Page 23: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Illumina Bead Arrays

• Oligonucleotides (50-mers) immobilized on glass beads

• Identifier tag on each oligo

• Usually ~ 30 beads per probe

Page 24: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Comparing VSN results with log scale

VST improves the cross-site concordance

MAQC data

From Du CHI 2007

Page 25: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Applicability to Other Array Types

• The crucial assumption of most current methods for expression array normalization is that the differences between arrays reflect changes in only a small proportion of the genome and that the overall distribution of expression levels is unchanged

Page 26: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Recent Approaches

1. Use of standard or control variables to infer covariates

Mike West

2. PCA of residuals to infer covariates or patterns of systematic error

3. Regression on technical covariates of probes

Page 27: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Inferred (Surrogate) Covariates• Surrogate variable analysis (SVA)

– Leek and Storey, PLoS Genetics, 2007

• Motivation: many unmodeled (and unknown) factors affect the measures

• Even if known, most experiments don’t have sufficient d.f. to estimate their effects

• Idea: often the effects of all factors are somewhat correlated

• Can you infer a manageable set of ersatz (surrogate) covariates that do the same thing?

Page 28: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Underlying Model

• There are factors f1, …, fL, which affect genes via linear combinations of functions gi1(f1), …, giL(fL).

• The distorting effect on gene i in array j is:

• Claim: this is a sufficiently general representation, because additive models can represent most data sets (dubious)

• Fact: an additive representation can be represented as a linear function (of transformed variables)

)(1 lj

L

l lli fg

Page 29: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Inferring Covariates

• Given observations Y and predictors XL x N , – (e.g. X might record diagnosis and age in columns)

• Fit the following model:

• The residual matrix R is approximated by R ~ UVT

using singular value decomposition with K non-trivial components

• The kth row of matrix V records the kth inferred covariate across the N samples

ij

K

k kjikij

ij

L

l ljiliij

vr

rxy

1

1

Page 30: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

SV Decomposition of Residuals• How many singular values to

keep? • Test whether fraction of

variance explained higher than ‘chance’

• Compute test statistic

• Assess significance by SVD following many permutations, acting on rows independently, to disrupt correlation structure

Surrogate variables

Page 31: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Using the Surrogate Covariates• For each gene i separately fit ’s and ’s in model:

• Issue: How to limit d.f. used up by covariates?– For each k, many genes show little correlation with

inferred covariate k.

• Compute variance explained by each covariate across all genes

• Select which genes i are significantly associated with predictor vk . (i.e. ik > 0) using Storey’s FDR approach on variance explained

• Include only those significantly associated

ij

K

k kjik

L

l ljiliij vxy 11

Page 32: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Critiques and Issues for SVA– SVA does not distinguish between technical variation and

biological variation in most designs • Biological variation within treatment groups is often important

– SVA assumes covariate effects are additive (linear in practice)

• This may be plausible for a single outcome, but the assumption of the model is that the same functions of the unknown factors contribute linearly to the distortion of ALL genes

– Fairly complex procedure with several tuning parameters– Does not address confounding between systematic

errors and design • a general fault of covariate methods, but not (I think) necessary

– SVA as published is not robust – easily fixed

Page 33: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

A Simpler Approach Using SVD

• Left singular vectors represent basis for subspace of technical variation

• Hypothesis: Technical errors are reproducible

• Implication: one can ‘learn’ typical patterns of technical variation for each technology from one set of replicates

Page 34: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Algorithm• Consider sets of technical replicates of the same

samples, with only technical differences within sets• PCA of replicates identifies major components• Algorithm:

– Construct technical differences from mean of each set– Robust PCA of differences

• Outliers can be handled by simple winsorization

– Find differences of each array from common mean of all arrays in experiment

– Project each array’s difference onto K PC’s (K small)– Subtract projection (typically 50% of variance)

• Leverage points in regression are also winsorized

Page 35: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Principal Components of MAQC

• Four samples: A: brain; B mixed tissue; C: 3:1 mixture of A & B; D 1:3 mixture of A & B

• Each sample hybridized five times in each of three labs

Scree plot of replicate PCA for Agilent 44K 1 color MAQC data set (3 sets of 4x5 reps)

Page 36: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Results on MAQC Data

• Using each lab’s PC’s to normalize the other two labs

• Five PC’s (left singular vectors) used

• Proportion of variance explained > 50%– 5/40,000 expected if

taking a ‘random’ subspace

Number of F-scores greater than 7

Page 37: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Technical Variable Regression• Hypothesis:

1. Most technical variation between chips is caused by a few (unknown) systematic factors

2. Probes with similar technical characteristics (Tm, position in gene, location on chip, typical intensity, ..) will be distorted by similar amounts

• Therefore we can use technical variables as an index of technical similarity (the predictor) and (usually) treat real biological differences as ‘noise’

• Construct deviations from average or standard chip

• Identify which technical variables have the most effect

• Regress deviations from average on technical variables

Page 38: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Covariates to Index Similar Probes

• Analogous to ‘loess’ normalization of Yang et al• Index similar probes by technical covariates• Known covariates of array probes

– Location (X,Y) on chip– Reference (or average) intensity – Tm (‘melting’ or annealing temperature)– Location relative to priming site

• for expression arrays– Pyrimidine content (C + T)

• Cross-hybridization easier than with purines– Deviation of reference from average reference (two

color arrays)

Deviations of log intensity from average as function of average

Page 39: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Many Covariates Predict Deviations

• A moderate number (5-9) of technical predictors have significant effects on many chips

• Non-linear, non-additive interactions are usual

Low CT; near 3’ end

High CT; near 3’ endLow CT; far 3’ endHigh CT; far 3’ end

Deviations of chip GSM 25410 from average of all chips in expt.

Overall downwardtrend (apparentloss of expression)at higher values ofaverage intensity Average of all

chips

LOESS curves tracking:

Page 40: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Regression in Moderate Dimensions• Local regression (LOESS) works reasonably

well up to three or four dimensions, but there is too much flexibility in five or more dimensions– Curse of dimensionality: If 7 variables are truly

independent at all levels, and if 4 bins for each variable: 47 = 16,384 bins

– But there is plenty of data!

• How to reconcile flexibility with restricting df?• Representation: How to represent the high-

dimensional surface effectively for 105 points?

Page 41: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Addressing Curse of Dimensionality

• Local regression unwieldy in 7 dimensions

• There don’t seem to be dimension reduction subspaces within predictors– The condition number of the data matrices

selected by 5% ‘slices’ is 1.5 - 2

• Approaches such as MARS don’t seem to work, because interactions dominate most main effects– Manufacturers tune the probes to remove

main effects

Page 42: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers

Issues in Building Representation

• Spend degrees of freedom wisely

• Borrow idea from MARS – limit number of effective dimensions in local regression

• Construct neighborhoods in B7 that are wide in most directions, but narrow in directions of high variation– Directions determined adaptively with high

threshold

Page 43: Microarray Normalization Issues in High-Throughput Data Analysis BIOS 691-803 Spring 2010 Dr Mark Reimers