data preprocessing - broad institute · data acquisition and preprocessing are often united •...
Post on 19-Jul-2020
3 Views
Preview:
TRANSCRIPT
1
1
Data PreprocessingData Preprocessing
2
Data PreprocessingData Preprocessing
•• NormalizationNormalization: the process of removing sample-to-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale.
•• FilteringFiltering: elimination of variables/genes whose expression variability across samples is below the instrument precision.
2
3
Sources of variationin the data
Sources of variationSources of variationin the datain the data
• Interesting variation:– e.g., differentially expressed genes between disease and normal
tissues.
• “Obscuring” variation:– Technical:
• sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the lab
• instrument (scanner) precision• Different platforms
– Biological:• Different growth conditions, heterogeneity of samples, stochastic
nature of biology.
4
Steps that introduce variation (noise)Steps that introduce variation (noise)
N Engl J Med, 354: 2463, 2006.
Tissues
3
5
Measuring variationsMeasuring variations• Analysis of duplicated/replicated experiments at different
levels can be used to assess the different sources of variation.
• Biological replicates: samples from the same biological ‘state’
• Technical replicates: splitting a single sample into several parts. Can be done at different stages of the protocol.
Biological variation >> Technical variationBiological variation >> Technical variation
6
Data PreprocessingData Preprocessing
•• NormalizationNormalization: the process of removing sample-to-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale.
•• FilteringFiltering: elimination of variables/genes whose expression variability across samples is below the instrument precision.
4
7
cDNA NormalizationcDNA Normalization• cDNA and oligonucleotide arrays have different
normalization needs due to different sources of noise.
• cDNA: Since the two dyes (Cy3 and Cy5) are unbalanced (different efficiency), each channel is normalized separately and then combined.
• Normalization methods for the two technologies share similar concepts but often tools are dedicated to a single technology.
8
Input: Raw signal matrixInput: Raw signal matrix
Genes Genes ((ProbesetsProbesets))
e1e2 … eng1g2…gm
Signal extractionfrom single experimentse.g. using MAS5
CEL files (probe-level data)
Normalization
5
9
Data acquisition and preprocessing are often unitedData acquisition and preprocessing are often united
• Common signal extraction methods such as dChip, RMA (and gcRMA) combine signal extraction and preprocessing.
• Pros: Normalization can be done at the probe level. Use statistics on a set of samples to identify outlier probes.
• Cons: Generates a dependency between the samples. Example: Adding/Removing samples requires to rerun the signal extraction part.
Data acquisitionmicroarray processing
Data preprocessingscaling/normalization/filtering
10
ScalingScaling• Common sources of variation yield readings at different
scales. Can be caused by hybridizing different amounts of RNA, different efficiency of the labels, different scanning conditions …
• The distortion can be non-linear (e.g. due to saturation effects). Note: cDNA chips often encounter stronger non-linear distortions.
6
11
ExampleExample• 10 biological replicates: MCF7 cells grown in 0.1%
DMSO taken from the connectivity map experiment(Lamb, Science 2006) run on U133A Affymetrix chips.
• Signal extracted with MAS5
1 2 3 4 5 6 7 8 9 100
50
100
150
200
250
mea
n ex
pres
sion
Histogram of mean expression
3.4 fold differencebetween extremes
12
Scatter plotScatter plot• Comparing samples using a scatter plot
0 2000 4000 6000 8000 10000 120000
1000
2000
3000
4000
5000
6000
7000
8000
Sample #5 mean=227.5622
Sam
ple
#7 m
ean=
67.6
875
7
13
Normalization/Scaling methodsNormalization/Scaling methods
• Linear scaling• Invariant set normalization • Quantile normalization• cDNA: Non-linear scaling (loess, splines, etc.)• …
14
• Assumption: same overall chip intensity across samples.
Normalization methodsNormalization methodslinear scalinglinear scaling
8
15
x'i = xi ×Y X
, i =1,2,..., p
X = mean(X),Y = mean(Y )
y = x
Normalization methodsNormalization methodslinear scalinglinear scaling
BeforeBefore AfterAfter
• Assumption: same overall chip intensity across samples.• Transformation: fitting a linear relationship w/ zero intercept.• Reference sample: typically the one with median mean value.
16
Normalization methodsNormalization methodsInvariant Set Normalization (used by dChip)Invariant Set Normalization (used by dChip)
• Assumption: Many genes are kept unchanged and transformation is linear
• Method: Identify genes whose ranks are relatively constant (e.g. std. of rank<10).
• Use the mean of these genes to linearly scale the samples.
• Repeat several times, until converges.• Used by dChip at probe level.
9
17
• Method: (RMA uses quantile normalization at probe level)–Transform data so that the quantile-quantile plot for any two arrays is the straight identity line.
–Take the mean quantile (across samples), and susbtitute it as the value of the data item in the original dataset.
(color corresponds to rank)
prob
es
samples
sort bysort bycolumncolumn
replace w/replace w/row averagesrow averages
rearrange inrearrange inoriginal orderoriginal order
• Assumption: Measured expression grows monotonically with true level of expression
Normalization methodsNormalization methodsquantilequantile normalization (used by RMA)normalization (used by RMA)
18
MAS5: Std(logMAS5: Std(log22) vs. Mean(log) vs. Mean(log22))
• Log2-transformation: Noise nearly constant at all levels
-2 0 2 4 6 8 10 12 140
0.5
1
1.5
2
2.5
3
Std
of l
og
Mean of log
10
19
RMA: Std(logRMA: Std(log22) vs. Mean(log) vs. Mean(log22))
2 4 6 8 10 12 140
0.5
1
1.5
2
2.5S
td o
f log
Mean of log
RMA: Less variation, more constant across values
20
Normalization assumptionsNormalization assumptionssummarysummary
• Same overall intensity (or, same distribution) for different arrays. • Measured expression grows linearly with true level of expression (or
at least monotonically)• Gene-specific noise is multiplicative (additive in the log-scale).• Log2-transform transform noise to be independent of mean
• We typically use RMA or MAS5.
• Danger: Make sure these assumption holds in your experiment. For example: stem cells have higher overall expression than differentiated cells.
11
21
QuestionsQuestions• …
22
Data PreprocessingData Preprocessing•• Normalization: the process of removing sampleNormalization: the process of removing sample--toto--
sample variations in the measurements not due to sample variations in the measurements not due to differential gene expression. Bringing measurements differential gene expression. Bringing measurements from the different hybridizations to a common and from the different hybridizations to a common and convenient scale.convenient scale.
•• FilteringFiltering: elimination of variables/genes whose expression variability across samples is below the instrument precision.
12
23
Why filtering?Why filtering?
• Small N, large P implies vulnerability to overfitting (modeling noise).
• Try to reduce the number of hypothesis (therefore, the number of false negatives)
24
Filtering methodsFiltering methods• Variation filters based on simple threshold:
– select only genes that vary more than a given minimum (e.g., genes with s2 > τs, or MAD > τMAD, or CV > τCV, etc.).
• Variation filters based on noise envelope:– define noise envelope based on replicates, select genes whose
variation is larger than the envelope.
• Gene selection based on reproducibility:– need for duplicates.
• …
2 4 6 8 10 12 140
0.5
1
1.5
2
2.5
Std
of l
og
Mean of log
RMA
10-1 100 101 102 103 104 10510-1
100
101
102
103
104
Std
Mean
MAS5
13
25
Filtering methodsFiltering methods• Variation filters based on simple threshold:
– select only genes that vary more than a given minimum (e.g., genes with s2 > τs, or MAD > τMAD, or CV > τCV, etc.).
• Variation filters based on noise envelope:– define noise envelope based on replicates, select genes whose
variation is larger than the envelope.
• Gene selection based on reproducibility:– need for duplicates.
• …
26
Compute gene-specific mean and stdev on data to be filtered
FilteringFilteringbased on noise envelopebased on noise envelope
Estimate a noise envelope based on replicate data
),0(~,)log()log( σεεμβασ Ngg ++=
Super-impose envelope on data to be filtered
⎡ ⎤ %95)(
?)( )log(ˆˆ new
gnew
g μβασ +≥
Here: set of 14 Stratagene© samples
14
27
Filtering methodsFiltering methods• Variation filters based on simple threshold:
– select only genes that vary more than a given minimum (e.g., genes with s2 > τs, or MAD > τMAD, or CV > τCV, etc.).
• Variation filters based on noise envelope:– define noise envelope based on replicates, select genes whose
variation is larger than the envelope.
• Gene selection based on reproducibility:– need for duplicates.
• …
28
FilteringFilteringbased on duplicatesbased on duplicates
• Look at duplicates (sample pairs)
• Select genes whose expression across duplicates correlates best.
15
29 duplicate 1
dupl
icat
e 2
experiment1
experiment2
experimentn
given genei (duplicate pair), i=1,…,nexperiment1 experiment2 … experimentn
genei (duplicate 1) g11 g12 g1N
genei (duplicate 2) g21 g22 g2N
FilteringFilteringbased on reproducibilitybased on reproducibility
30
F statisticF statisticmaximizing correlation and spreadmaximizing correlation and spread
1?>=
WBF
BBetweenetween--groups variation groups variation WWithinithin--group variationgroup variation
overall mean group mean
( ) ( )2 2
#Groups Group #Groups Group
#Groups-1 #Samples #Groupsi i
i ij ii j i j
g g g gB W
• •• •∈ ∈ ∈ ∈
− −= =
−
∑ ∑ ∑ ∑
CorrelationCorrelation
Spre
adSp
read
good bad
bad
good
16
31
Best/worst markersBest/worst markers
DLBCL dataset [Blood, 105(5):1851-1861 2005]
32
ReferencesReferences1. Quackenbush, J., Microarray Data Normalization and Transformation. Nature
Genetics, 2002. 32(4s): p. 496-501.
2. The Tumor Analysis Best Practices Working Group, Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nature Reviews Genetics, 2004. 5(3): p. 229-237.
3. Dudoit, S., et al., Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments. Statistica Sinica, 2002. 12(1): p. 111-139.
4. Hartemink, A.J., et al., Maximum Likelihood Estimation of Optimal Scaling Factors for Expression Array Normalization, in SPIE International Symposium on Biomedical Optics (BiOS01), M. Bittner, et al., Editors. 2001. p. 132-140.
5. Irizarry, R.A., et al., Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 2003. 4(2): p. 249-264.
6. Many more …
17
33
QuestionsQuestions• …
34
VisualizationVisualizationVisualizationall steps can benefit from visualization
18
35
VisualizationVisualization• Heat Map• PCA• SVD• MDS• NMF (discussed in the clustering session)
36
Visualization of Visualization of EE: : HeatmapHeatmap• Raw data (RMA, log2 scale)
ALL MLL AML
500 most variable genes(clustered)
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
M
M
LLL
M
ijeE
Leukemia data from Armstrong et al. Nat. Genet. (2002)
19
37
AML/MLL/ALL AML/MLL/ALL HeatmapHeatmap• Row centered and normalized data
ALL MLL AML
Pro: Clearly see differential expressionCon: Loose absolute value
500 most variable genes(clustered)
i
iijij
mn
ij
ex
x
xx
σμ−
=
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
M
M
LLL
M11
X
38
AML/MLL/ALL 3D AML/MLL/ALL 3D HeatmapHeatmap• Raw z-axis and color
ALLMLL
AML
20
39
AML/MLL/ALL 3D AML/MLL/ALL 3D HeatmapHeatmap• Raw z-axis, row centered and normalized color
ALLMLL
AML
Color z-axis
40
Dimensionality ReductionDimensionality Reduction• Our brain is a good pattern recognition tool• Problem: We are used to handle only 2 (or 3)
dimensions• Solution: Dimensionality reduction
21
41
Visualization of genes or samplesVisualization of genes or samples
• Aim: Given x1:n∈Rd find y1:n ∈Rk (typically k<<d) that capture some properties of x1:n
• Methods:– Principal Component Analysis (PCA) – Projection onto a low-dimensional
hyper-plane.– Singular Value Decomposition (SVD) – Approximate a matrix by a sum of
simple matrices – Multi-dimensional Scaling (MDS) – Mapping points such that preserve
distances– Graph layout methods (e.g. springs and charges)– Independent Component Analysis (ICA)– Projection pursuit– …
• Caution: Our brain also tends to find patterns in random data (over-fitting)
42
Principal Component Analysis (PCA)Principal Component Analysis (PCA)• Aim: Find a low k-dimensional hyper-plane on which the variation of
the projected data is maximal.
∑ =
×××
=
=m
jiij
knkmmn
bac1α αα
CBAtionmultiplicaMatrix
=i ij
j
xi
V=(v1 v2)v2
v1
v2
v1σ22
σ12
J(V)=σ12+σ2
2
• Objective: Find V that maximizes J(V)• Equivalent to: Find low k-dimensional hyper-plane such that the
projected data best approximate the original data.
yi=VT(xi-μ)
μ=nxi/n
22
43
Incremental Building of the Incremental Building of the k k Principal ComponentsPrincipal Components• Algorithm: (not the one actually used)
– Loop i from 1:k• Find the direction vi along which the variance σi
2 is maximal• Remove from each point its projection on vi
– ⇒ Principal components V=(v1,…, vk), captured variances {σi2}1:k
– ⇒ The projected data yi= VT (xi-μ)• The fraction of variance that is captured by the principal
components, ck, measures how well the projected data approximates that original data
k = # of PC
Cap
ture
d v
aria
nce
c k
sk=ni=1,...,k σi2
ck=sk/sd
44
PCA of leukemia samplesPCA of leukemia samples• Input
• OutputALL
MLL
AML
Gen
es
v1 v2 v3
23
45
Pitfalls of PCAPitfalls of PCA• Largest variance ≠ most informative: 2 pancakes
• Structure in low-dimensional space ⇒ there is structurein the full space. But NOT ⇐
Direction with largest variance
“Interesting” direction
46
Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)
• Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample ⊗ meta-gene)
≈ + + =
-0.16
-0.14
-0.12
-0.1
-0.3-0.2-0.100.10.2-0.2
-0.1
0
0.1
0.2
0.3
SVD #1SVD #2
SV
D #
3
ALLMLLAML
24
47
SVD of leukemia dataSVD of leukemia data
AML ALL MLL
10 20 30 40 50 60 70
50
100
150
200
250
300
10 20 30 40 50 60 70
50
100
150
200
250
30010 20 30 40 50 60 70
50
100
150
200
250
30010 20 30 40 50 60 70
50
100
150
200
250
300
s1 v1 u1T s2 v2 u2
T s3 v3 u3T
10 20 30 40 50 60 70
50
100
150
200
250
300
≈
48
SVD of leukemia dataSVD of leukemia data
AML ALL MLL
10 20 30 40 50 60 70
50
100
150
200
250
300
=5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0
0 . 5
1
1 . 5
2
2 . 5
3
3 . 5
0.5 1 1.5 2 2.5 3 3.5
10
20
30
40
50
60
70
10 20 30 40 50 60 70
50
100
150
200
250
30010 20 30 40 50 60 70
50
100
150
200
250
30010 20 30 40 50 60 70
50
100
150
200
250
300
s1 v1 u1T s2 v2 u2
T s3 v3 u3T
25
49
Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)• Aim: Find best approximation for E by a sum of K rank-1 matrices
(meta-sample ⊗ meta-gene)
• E≈Σi=1:Ksi vi uiT
where {vi},{ui} are orthogonal unit vectors
• Objective function: J({si},{vi},{ui})= Σij (eij-Σα=1:Ksα viα ujα)2
• Method: Unique solution based on diagonalizing EET
• Note: {vi} are the same as in PCA of the samples if the genes are centered
• Clustering: Identify elements with large absolute value as members of clusters
≈ + +
50
MultiMulti--dimensional Scaling (MDS)dimensional Scaling (MDS)• Aim: Find a low k-dimension representation of the data such that
best preserves the distance matrix of the original data
26
51
MultiMulti--dimensional Scaling (MDS)dimensional Scaling (MDS)• Aim: Find a low k-dimension representation of the data such that
best preserves the distance matrix of the original data
• Objective: Find y1:n that minimize J(δij,dij). J(δij,dij) measures how well dij approximates δij.
• Method: Gradient descent
xi
δij=||xi-xj|| dij=||yi-yj||
yi
52
Objective functionsObjective functions• Different ways to measure similarity between distance matrices:
– Emphasize large differences
– Emphasize fractional differences
– …
( )∑
∑<
<−
=ji ij
ji ijijee
dJ 2
2
δ
δ
∑ < ⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
jiij
ijijff
dJ
2
δδ
27
53
Gradient DescentGradient Descent• Aim: Find minimum of J(a)• Method:
– Init: a(0) ← a random position– Iterate: a(t+1) ← a(t)-η∇J(a(t))– Stop: when Da(t+1) - a(t) D<ε or t>T
• Problem: Finds a local minimum depending on the starting point (according to basin of attraction)
• See also: Newton’s algorithm, Conjugate gradient
axx
xx
a
=⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
∂∂
∂∂∂∂
=∇
dxJ
xJxJ
J
)(
)()(
)( 2
1
M
Gradient descend
54
MDS for leukemia samplesMDS for leukemia samples• Input:
• Used Jee with Euclidean distance
• Output:
28
55
MDS vs. PCAMDS vs. PCA
PCA MDS
Optimal Yes Converges to local minima
Preserves distances Only projected part Yes (attempts to)
Linear Yes Distorts space
Unique Yes Depends on initial configuration
Captures high-dimensional structure
Missing dimensions Potentially better
56
ReferencesReferences1. Duda, Hart and Stork, Pattern Classification. Wiley & Sons 2001
2. Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, 2002. 32(4s): p. 496-501.
3. Allison, D.B. et al. Microarray data analysis: from disarray to consolidation and consensus, Nat. Rev. Gent. 2006. (7): p. 55-65
top related