data preprocessing - broad institute · data acquisition and preprocessing are often united •...

1

1

Data PreprocessingData Preprocessing

2


•• NormalizationNormalization: the process of removing sample-to-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale.

•• FilteringFiltering: elimination of variables/genes whose expression variability across samples is below the instrument precision.

2

3

Sources of variationin the data

Sources of variationSources of variationin the datain the data

• Interesting variation:– e.g., differentially expressed genes between disease and normal

tissues.

• “Obscuring” variation:– Technical:

• sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the lab

• instrument (scanner) precision• Different platforms

– Biological:• Different growth conditions, heterogeneity of samples, stochastic

nature of biology.

4

Steps that introduce variation (noise)Steps that introduce variation (noise)

N Engl J Med, 354: 2463, 2006.

Tissues

3

5

Measuring variationsMeasuring variations• Analysis of duplicated/replicated experiments at different

levels can be used to assess the different sources of variation.

• Biological replicates: samples from the same biological ‘state’

• Technical replicates: splitting a single sample into several parts. Can be done at different stages of the protocol.

Biological variation >> Technical variationBiological variation >> Technical variation

6


•• NormalizationNormalization: the process of removing sample-to-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale.


4

7

cDNA NormalizationcDNA Normalization• cDNA and oligonucleotide arrays have different

normalization needs due to different sources of noise.

• cDNA: Since the two dyes (Cy3 and Cy5) are unbalanced (different efficiency), each channel is normalized separately and then combined.

• Normalization methods for the two technologies share similar concepts but often tools are dedicated to a single technology.

8

Input: Raw signal matrixInput: Raw signal matrix

Genes Genes ((ProbesetsProbesets))

e1e2 … eng1g2…gm

Signal extractionfrom single experimentse.g. using MAS5

CEL files (probe-level data)

Normalization

5

9

Data acquisition and preprocessing are often unitedData acquisition and preprocessing are often united

• Common signal extraction methods such as dChip, RMA (and gcRMA) combine signal extraction and preprocessing.

• Pros: Normalization can be done at the probe level. Use statistics on a set of samples to identify outlier probes.

• Cons: Generates a dependency between the samples. Example: Adding/Removing samples requires to rerun the signal extraction part.

Data acquisitionmicroarray processing

Data preprocessingscaling/normalization/filtering

10

ScalingScaling• Common sources of variation yield readings at different

scales. Can be caused by hybridizing different amounts of RNA, different efficiency of the labels, different scanning conditions …

• The distortion can be non-linear (e.g. due to saturation effects). Note: cDNA chips often encounter stronger non-linear distortions.

6

11

ExampleExample• 10 biological replicates: MCF7 cells grown in 0.1%

DMSO taken from the connectivity map experiment(Lamb, Science 2006) run on U133A Affymetrix chips.

• Signal extracted with MAS5

1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

mea

n ex

pres

sion

Histogram of mean expression

3.4 fold differencebetween extremes

12

Scatter plotScatter plot• Comparing samples using a scatter plot

0 2000 4000 6000 8000 10000 120000

1000

2000

3000

4000

5000

6000

7000

8000

Sample #5 mean=227.5622

Sam

ple

#7 m

ean=

67.6

875

7

13

Normalization/Scaling methodsNormalization/Scaling methods

• Linear scaling• Invariant set normalization • Quantile normalization• cDNA: Non-linear scaling (loess, splines, etc.)• …

14

• Assumption: same overall chip intensity across samples.

Normalization methodsNormalization methodslinear scalinglinear scaling

8

15

x'i = xi ×Y X

, i =1,2,..., p

X = mean(X),Y = mean(Y )

y = x

Normalization methodsNormalization methodslinear scalinglinear scaling

BeforeBefore AfterAfter

• Assumption: same overall chip intensity across samples.• Transformation: fitting a linear relationship w/ zero intercept.• Reference sample: typically the one with median mean value.

16

Normalization methodsNormalization methodsInvariant Set Normalization (used by dChip)Invariant Set Normalization (used by dChip)

• Assumption: Many genes are kept unchanged and transformation is linear

• Method: Identify genes whose ranks are relatively constant (e.g. std. of rank<10).

• Use the mean of these genes to linearly scale the samples.

• Repeat several times, until converges.• Used by dChip at probe level.

9

17

• Method: (RMA uses quantile normalization at probe level)–Transform data so that the quantile-quantile plot for any two arrays is the straight identity line.

–Take the mean quantile (across samples), and susbtitute it as the value of the data item in the original dataset.

(color corresponds to rank)

prob

es

samples

sort bysort bycolumncolumn

replace w/replace w/row averagesrow averages

rearrange inrearrange inoriginal orderoriginal order

• Assumption: Measured expression grows monotonically with true level of expression

Normalization methodsNormalization methodsquantilequantile normalization (used by RMA)normalization (used by RMA)

18

MAS5: Std(logMAS5: Std(log22) vs. Mean(log) vs. Mean(log22))

• Log2-transformation: Noise nearly constant at all levels

-2 0 2 4 6 8 10 12 140

0.5

1

1.5

2

2.5

3

Std

of l

og

Mean of log

10

19

RMA: Std(logRMA: Std(log22) vs. Mean(log) vs. Mean(log22))

2 4 6 8 10 12 140

0.5

1

1.5

2

2.5S

td o

f log

Mean of log

RMA: Less variation, more constant across values

20

Normalization assumptionsNormalization assumptionssummarysummary

• Same overall intensity (or, same distribution) for different arrays. • Measured expression grows linearly with true level of expression (or

at least monotonically)• Gene-specific noise is multiplicative (additive in the log-scale).• Log2-transform transform noise to be independent of mean

• We typically use RMA or MAS5.

• Danger: Make sure these assumption holds in your experiment. For example: stem cells have higher overall expression than differentiated cells.

11

21

QuestionsQuestions• …

22

Data PreprocessingData Preprocessing•• Normalization: the process of removing sampleNormalization: the process of removing sample--toto--

sample variations in the measurements not due to sample variations in the measurements not due to differential gene expression. Bringing measurements differential gene expression. Bringing measurements from the different hybridizations to a common and from the different hybridizations to a common and convenient scale.convenient scale.


12

23

Why filtering?Why filtering?

• Small N, large P implies vulnerability to overfitting (modeling noise).

• Try to reduce the number of hypothesis (therefore, the number of false negatives)

24

Filtering methodsFiltering methods• Variation filters based on simple threshold:

– select only genes that vary more than a given minimum (e.g., genes with s2 > τs, or MAD > τMAD, or CV > τCV, etc.).

• Variation filters based on noise envelope:– define noise envelope based on replicates, select genes whose

variation is larger than the envelope.

• Gene selection based on reproducibility:– need for duplicates.

• …

2 4 6 8 10 12 140

0.5

1

1.5

2

2.5

Std

of l

og

Mean of log

RMA

10-1 100 101 102 103 104 10510-1

100

101

102

103

104

Std

Mean

MAS5

13

25






• …

26

Compute gene-specific mean and stdev on data to be filtered

FilteringFilteringbased on noise envelopebased on noise envelope

Estimate a noise envelope based on replicate data

),0(~,)log()log( σεεμβασ Ngg ++=

Super-impose envelope on data to be filtered

⎡ ⎤ %95)(

?)( )log(ˆˆ new

gnew

g μβασ +≥

Here: set of 14 Stratagene© samples

14

27






• …

28

FilteringFilteringbased on duplicatesbased on duplicates

• Look at duplicates (sample pairs)

• Select genes whose expression across duplicates correlates best.

15

29 duplicate 1

dupl

icat

e 2

experiment1

experiment2

experimentn

given genei (duplicate pair), i=1,…,nexperiment1 experiment2 … experimentn

genei (duplicate 1) g11 g12 g1N

genei (duplicate 2) g21 g22 g2N

FilteringFilteringbased on reproducibilitybased on reproducibility

30

F statisticF statisticmaximizing correlation and spreadmaximizing correlation and spread

1?>=

WBF

BBetweenetween--groups variation groups variation WWithinithin--group variationgroup variation

overall mean group mean

( ) ( )2 2

#Groups Group #Groups Group

#Groups-1 #Samples #Groupsi i

i ij ii j i j

g g g gB W

• •• •∈ ∈ ∈ ∈

− −= =

−

∑ ∑ ∑ ∑

CorrelationCorrelation

Spre

adSp

read

good bad

bad

good

16

31

Best/worst markersBest/worst markers

DLBCL dataset [Blood, 105(5):1851-1861 2005]

32

ReferencesReferences1. Quackenbush, J., Microarray Data Normalization and Transformation. Nature

Genetics, 2002. 32(4s): p. 496-501.

2. The Tumor Analysis Best Practices Working Group, Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nature Reviews Genetics, 2004. 5(3): p. 229-237.

3. Dudoit, S., et al., Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments. Statistica Sinica, 2002. 12(1): p. 111-139.

4. Hartemink, A.J., et al., Maximum Likelihood Estimation of Optimal Scaling Factors for Expression Array Normalization, in SPIE International Symposium on Biomedical Optics (BiOS01), M. Bittner, et al., Editors. 2001. p. 132-140.

5. Irizarry, R.A., et al., Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 2003. 4(2): p. 249-264.

6. Many more …

17

33

QuestionsQuestions• …

34

VisualizationVisualizationVisualizationall steps can benefit from visualization

18

35

VisualizationVisualization• Heat Map• PCA• SVD• MDS• NMF (discussed in the clustering session)

36

Visualization of Visualization of EE: : HeatmapHeatmap• Raw data (RMA, log2 scale)

ALL MLL AML

500 most variable genes(clustered)

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

M

M

LLL

M

ijeE

Leukemia data from Armstrong et al. Nat. Genet. (2002)

19

37

AML/MLL/ALL AML/MLL/ALL HeatmapHeatmap• Row centered and normalized data

ALL MLL AML

Pro: Clearly see differential expressionCon: Loose absolute value

500 most variable genes(clustered)

i

iijij

mn

ij

ex

x

xx

σμ−

=

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

M

M

LLL

M11

X

38

AML/MLL/ALL 3D AML/MLL/ALL 3D HeatmapHeatmap• Raw z-axis and color

ALLMLL

AML

20

39

AML/MLL/ALL 3D AML/MLL/ALL 3D HeatmapHeatmap• Raw z-axis, row centered and normalized color

ALLMLL

AML

Color z-axis

40

Dimensionality ReductionDimensionality Reduction• Our brain is a good pattern recognition tool• Problem: We are used to handle only 2 (or 3)

dimensions• Solution: Dimensionality reduction

21

41

Visualization of genes or samplesVisualization of genes or samples

• Aim: Given x1:n∈Rd find y1:n ∈Rk (typically k<<d) that capture some properties of x1:n

• Methods:– Principal Component Analysis (PCA) – Projection onto a low-dimensional

hyper-plane.– Singular Value Decomposition (SVD) – Approximate a matrix by a sum of

simple matrices – Multi-dimensional Scaling (MDS) – Mapping points such that preserve

distances– Graph layout methods (e.g. springs and charges)– Independent Component Analysis (ICA)– Projection pursuit– …

• Caution: Our brain also tends to find patterns in random data (over-fitting)

42

Principal Component Analysis (PCA)Principal Component Analysis (PCA)• Aim: Find a low k-dimensional hyper-plane on which the variation of

the projected data is maximal.

∑ =

×××

=

=m

jiij

knkmmn

bac1α αα

CBAtionmultiplicaMatrix

=i ij

j

xi

V=(v1 v2)v2

v1

v2

v1σ22

σ12

J(V)=σ12+σ2

2

• Objective: Find V that maximizes J(V)• Equivalent to: Find low k-dimensional hyper-plane such that the

projected data best approximate the original data.

yi=VT(xi-μ)

μ=nxi/n

22

43

Incremental Building of the Incremental Building of the k k Principal ComponentsPrincipal Components• Algorithm: (not the one actually used)

– Loop i from 1:k• Find the direction vi along which the variance σi

2 is maximal• Remove from each point its projection on vi

– ⇒ Principal components V=(v1,…, vk), captured variances {σi2}1:k

– ⇒ The projected data yi= VT (xi-μ)• The fraction of variance that is captured by the principal

components, ck, measures how well the projected data approximates that original data

k = # of PC

Cap

ture

d v

aria

nce

c k

sk=ni=1,...,k σi2

ck=sk/sd

44

PCA of leukemia samplesPCA of leukemia samples• Input

• OutputALL

MLL

AML

Gen

es

v1 v2 v3

23

45

Pitfalls of PCAPitfalls of PCA• Largest variance ≠ most informative: 2 pancakes

• Structure in low-dimensional space ⇒ there is structurein the full space. But NOT ⇐

Direction with largest variance

“Interesting” direction

46

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)

• Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample ⊗ meta-gene)

≈ + + =

-0.16

-0.14

-0.12

-0.1

-0.3-0.2-0.100.10.2-0.2

-0.1

0

0.1

0.2

0.3

SVD #1SVD #2

SV

D #

3

ALLMLLAML

24

47

SVD of leukemia dataSVD of leukemia data

AML ALL MLL

10 20 30 40 50 60 70

50

100

150

200

250

300

10 20 30 40 50 60 70

50

100

150

200

250

30010 20 30 40 50 60 70

50

100

150

200

250

30010 20 30 40 50 60 70

50

100

150

200

250

300

s1 v1 u1T s2 v2 u2

T s3 v3 u3T

10 20 30 40 50 60 70

50

100

150

200

250

300

≈

48

SVD of leukemia dataSVD of leukemia data

AML ALL MLL

10 20 30 40 50 60 70

50

100

150

200

250

300

=5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0

0 . 5

1

1 . 5

2

2 . 5

3

3 . 5

0.5 1 1.5 2 2.5 3 3.5

10

20

30

40

50

60

70

10 20 30 40 50 60 70

50

100

150

200

250

30010 20 30 40 50 60 70

50

100

150

200

250

30010 20 30 40 50 60 70

50

100

150

200

250

300

s1 v1 u1T s2 v2 u2

T s3 v3 u3T

25

49

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)• Aim: Find best approximation for E by a sum of K rank-1 matrices

(meta-sample ⊗ meta-gene)

• E≈Σi=1:Ksi vi uiT

where {vi},{ui} are orthogonal unit vectors

• Objective function: J({si},{vi},{ui})= Σij (eij-Σα=1:Ksα viα ujα)2

• Method: Unique solution based on diagonalizing EET

• Note: {vi} are the same as in PCA of the samples if the genes are centered

• Clustering: Identify elements with large absolute value as members of clusters

≈ + +

50

MultiMulti--dimensional Scaling (MDS)dimensional Scaling (MDS)• Aim: Find a low k-dimension representation of the data such that

best preserves the distance matrix of the original data

26

51

MultiMulti--dimensional Scaling (MDS)dimensional Scaling (MDS)• Aim: Find a low k-dimension representation of the data such that

best preserves the distance matrix of the original data

• Objective: Find y1:n that minimize J(δij,dij). J(δij,dij) measures how well dij approximates δij.

• Method: Gradient descent

xi

δij=||xi-xj|| dij=||yi-yj||

yi

52

Objective functionsObjective functions• Different ways to measure similarity between distance matrices:

– Emphasize large differences

– Emphasize fractional differences

– …

( )∑

∑<

<−

=ji ij

ji ijijee

dJ 2

2

δ

δ

∑ < ⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

jiij

ijijff

dJ

2

δδ

27

53

Gradient DescentGradient Descent• Aim: Find minimum of J(a)• Method:

– Init: a(0) ← a random position– Iterate: a(t+1) ← a(t)-η∇J(a(t))– Stop: when Da(t+1) - a(t) D<ε or t>T

• Problem: Finds a local minimum depending on the starting point (according to basin of attraction)

• See also: Newton’s algorithm, Conjugate gradient

axx

xx

a

=⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

∂∂

∂∂∂∂

=∇

dxJ

xJxJ

J

)(

)()(

)( 2

1

M

Gradient descend

54

MDS for leukemia samplesMDS for leukemia samples• Input:

• Used Jee with Euclidean distance

• Output:

28

55

MDS vs. PCAMDS vs. PCA

PCA MDS

Optimal Yes Converges to local minima

Preserves distances Only projected part Yes (attempts to)

Linear Yes Distorts space

Unique Yes Depends on initial configuration

Captures high-dimensional structure

Missing dimensions Potentially better

56

ReferencesReferences1. Duda, Hart and Stork, Pattern Classification. Wiley & Sons 2001

2. Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, 2002. 32(4s): p. 496-501.

3. Allison, D.B. et al. Microarray data analysis: from disarray to consolidation and consensus, Nat. Rev. Gent. 2006. (7): p. 55-65

data preprocessing - broad institute · data acquisition and preprocessing are often united •...

Documents