dimension reduction
DESCRIPTION
Dimension Reduction. Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Ewing family of tumors (EWS) Burkitt lymphomas (BL) Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) - PowerPoint PPT PresentationTRANSCRIPT
1
Dimension Reduction Examples:
1. DNA MICROARRAYS:Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS)
Ewing family of tumors (EWS) Burkitt lymphomas (BL)
Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 arrays(6 EWS, 5 RMS, 6 NB, 3 BL, 5
other)
Genes: 2308 genes were selected because they showed minimal expression levels.
2. PLASTIC EXPLOSIVES: The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the discrete x-components of the xray absorption spectrum. The objective is to detect the suitcases with explosives. 2993 suitcases were use for training and 60 testing. (see web page for dataset).
2
Covariance Vs Correlation Matrix1. Use covariance or correlation matrix? If variables are not in
the same units Use Correlations
2. Dim(V) =Dim(R) = pxp and if p is large Dimension reduction.
21 12 1 12 1
221 221 2 2
21 21 2
, , 1, , ,
,1, ,, , ,;
, , ,1, , ,
p p
p ijpij
i j
p pp p p
s s s r r
r r ss s sS R r
s s
r rs s s
3
Gene 141
Gene 187
Gene 246
Gene 509
Gene 1645
Gene 1955
Gene 141
1.0000
0.7983 (0.000)
-0.5058 (0.001)
0.7463 (0.000)
-0.4049 (0.007)
0.4676 (0.002)
Gene 187
0.7983 (0.000)
1.0000
-0.8111 (0.000)
0.9357 (0.000
-0.6621 (0.000)
0.7891 (0.000)
Gene 246
-0.5058 (0.001)
-0.8111 (0.000)
1.0000
-0.7717 (0.000)
0.7624 (0.000)
-0.7977 (0.000)
Gene 509
0.7463 (0.000)
0.9357 (0.000)
-0.7717 (0.000)
1.000
-0.6388 (0.000)
0.6827 (0.000)
Gene 1645
-0.4049 (0.007)
-0.6621 (0.000)
0.7624 (0.000)
-0.6388 (0.000)
1.000
-0.8143 (0.000)
Gene 1955
0.4676 (0.002)
0.7891 (0.000)
-0.7977 (0.000)
0.6827 (0.000)
-0.8143 (0.000)
1.000
SampleCorrelation
Matrix
ScatterplotMatrix
141
-3 -1 1 3 -2 0 2 -1 0 1 2
-1.0
0.51.5
-3-1
13
187
246
-10
12
-20
2
509
1645
-20
12
-1.0 0.5 1.5
-10
12
-1 0 1 2 -2 0 1 2
1955
4
- The data cloud is approximated by an ellipsoid
- The axes of the ellipsoid represent the natural components of the data
- The length of the semi-axis represent the variability of the component.
Principal Components Geometrical Intuition
Variable X1
Variable X2
Data
Component1
Component2
5
- When some of the components show a very small variability they can be omitted.
- The graphs shows that Component 2 has low variability so it can be removed.
- The dimension is reduced from dim=2 to dim=1
DIMENSION REDUCTION
Variable X1
Variable X2
Data
Component1
Component2
6
Linear Algebra Linear algebra is useful to write computations in a convenient
way. Singular Value Decomposition: X = U D V’
nxp nxp pxp pxp
X centered => S = V D2 V’ pxp pxp pxp pxp
Principal Components(PC): Columns of V.
Eigenvalues (Variance of PC’s): Diagonal elements of D2
Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the
covariance
If p > n then SVD: X’ = U D V’ and S = U D2
U’ pxn pxn nxn nxn
7
Principal components of 100 genes. PC2 Vs PC1.
(a) Cells are the observations
Genes are the variables
(b) Genes are the observations
Cells are the variables
8
Dimension reduction:Choosing the number of PC’s
1. k components explain some percentage of the variance: 70%,80%.
2. k eigenvalues are greater than the average (1)
3. Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off.
4. Test the null hypothesis that the last m eigenvalues are equal (0)
The same idea can be applied to factor analysis.
1
( (2 11) / 6)( - )p
ii p m
u p m m log log
9
1. The top 5 eigenvalues explain 81% of variability.2. Five eigenvalues greater than the average 2.5%3. Scree Plot
4. Test statistic is 4 significant for 6 and highly significant for 2.
1 4 7 10 14 18 22 26 30 34 38 42
010
20
30
40
50
p-m 24 20 15 9 8 7 6 5 4 3 2 1 u 0.1 5 32 146 182 222 279 340 425 554 1632 3260 2 9.2 37 94 195 215 237 259 282 307 332 358 386
average
10
BiplotsGraphical display of X in which two sets of markers are plotted.One set of markers a1,…,aG represents the rows of X The other set of markers, b1,…, bp, represents the columns of X.
For example: X = UDV’ X2 = U2D2V2’
A = U2D2a and B=V2D2
b, a+b=1 so X2=AB’
The biplot is the graph of A and B together in the same graph.
11
Biplot of the first two principal
components.
Biplot of the first two Principal
components.
-0.3 -0.2 -0.1 0.0 0.1
-0.3
-0.2
-0.1
0.0
0.1
PC1
PC2
EWEW
EW
EW
EW
EWEW
EWEW
EW
EW
EW
EW
EW
EWEW
EWEWEW
EW
EW
EWEW
BLBL
BL
BL
BL
BL
BL
BL
NBNB
NB
NB
NBNB
NB
NBNB
NBNBNB
RM
RM
RMRM
RM
RM
RMRM
RMRM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
-10 -5 0 5
-10
-50
5
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18V19
V20
V21
V22
V23
V24
V25
V26
V27V28 V29
V30
V31
V32
V33V34
V35V36
V37
V38
V39V40
V41
V42
V43
V44
V45V46
V47
V48
V49
V50
V51
V52
V53
V54
V55
V56
V57
V58
V59V60
V61V62
V63
V64
V65
V66
V67
V68
V69
V70
V71
V72
V73
V74
V75V76
V77
V78
V79
V80
V81
V82
V83
V84
V85
V86V87
V88
V89
V90
V91
V92
V93V94
V95V96
V97
V98
V99
V100
-0.2 -0.1 0.0 0.1 0.2 0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Comp.1
Comp.2
1955
1389
1003
2050
1954246
1194
545
1319
174
107
187
842
1387
2046
1645 1708
123836
2162
846
1 129
783
1158
509
2022
851
566
335
742 867
951
1066
1888 1911
338
5541799
188
603
21981924
1896
819
1980
255
2159
153
2253
248
1662
1207
2144
910
1884
1497
1055
166
1916
368
14272
589
1353
1536165
1886
1601
1764
800
607
8491606
1723
1105
2291434
437
1093
1795
1634
23031770
979
251
1655575
1915
1327
1036
1524336
1074380
1298
12951772
756
1735
-10 -5 0 5 10 15
-10
-50
510
15
EW
EW
EW
EW
EW
EWEW
EWEW
EW
EWEW EW
EW
EWEW
EW
EWEWEW
EW
EWEW
BLBL
BLBL
BLBL
BLBL
NBNBNB
NB
NB
NB
NBNB
NB
NBNB
NB
RMRM
RM
RM
RMRMRMRMRMRM
RMRMRMRM
RM
RMRM
RMRM
RM