dimension reduction

1

Dimension Reduction Examples:

1. DNA MICROARRAYS:Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS)

Ewing family of tumors (EWS) Burkitt lymphomas (BL)

Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 arrays(6 EWS, 5 RMS, 6 NB, 3 BL, 5

other)

Genes: 2308 genes were selected because they showed minimal expression levels.

2. PLASTIC EXPLOSIVES: The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the discrete x-components of the xray absorption spectrum. The objective is to detect the suitcases with explosives. 2993 suitcases were use for training and 60 testing. (see web page for dataset).

2

Covariance Vs Correlation Matrix1. Use covariance or correlation matrix? If variables are not in

the same units Use Correlations

2. Dim(V) =Dim(R) = pxp and if p is large Dimension reduction.

21 12 1 12 1

221 221 2 2

21 21 2

, , 1, , ,

,1, ,, , ,;

, , ,1, , ,

p p

p ijpij

i j

p pp p p

s s s r r

r r ss s sS R r

s s

r rs s s

3

Gene 141

Gene 187

Gene 246

Gene 509

Gene 1645

Gene 1955

Gene 141

1.0000

0.7983 (0.000)

-0.5058 (0.001)

0.7463 (0.000)

-0.4049 (0.007)

0.4676 (0.002)

Gene 187

0.7983 (0.000)

1.0000

-0.8111 (0.000)

0.9357 (0.000

-0.6621 (0.000)

0.7891 (0.000)

Gene 246

-0.5058 (0.001)

-0.8111 (0.000)

1.0000

-0.7717 (0.000)

0.7624 (0.000)

-0.7977 (0.000)

Gene 509

0.7463 (0.000)

0.9357 (0.000)

-0.7717 (0.000)

1.000

-0.6388 (0.000)

0.6827 (0.000)

Gene 1645

-0.4049 (0.007)

-0.6621 (0.000)

0.7624 (0.000)

-0.6388 (0.000)

1.000

-0.8143 (0.000)

Gene 1955

0.4676 (0.002)

0.7891 (0.000)

-0.7977 (0.000)

0.6827 (0.000)

-0.8143 (0.000)

1.000

SampleCorrelation

Matrix

ScatterplotMatrix

141

-3 -1 1 3 -2 0 2 -1 0 1 2

-1.0

0.51.5

-3-1

13

187

246

-10

12

-20

2

509

1645

-20

12

-1.0 0.5 1.5

-10

12

-1 0 1 2 -2 0 1 2

1955

4

- The data cloud is approximated by an ellipsoid

- The axes of the ellipsoid represent the natural components of the data

- The length of the semi-axis represent the variability of the component.

Principal Components Geometrical Intuition

Variable X1

Variable X2

Data

Component1

Component2

5

- When some of the components show a very small variability they can be omitted.

- The graphs shows that Component 2 has low variability so it can be removed.

- The dimension is reduced from dim=2 to dim=1

DIMENSION REDUCTION

Variable X1

Variable X2

Data

Component1

Component2

6

Linear Algebra Linear algebra is useful to write computations in a convenient

way. Singular Value Decomposition: X = U D V’

nxp nxp pxp pxp

X centered => S = V D2 V’ pxp pxp pxp pxp

Principal Components(PC): Columns of V.

Eigenvalues (Variance of PC’s): Diagonal elements of D2

Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the

covariance

If p > n then SVD: X’ = U D V’ and S = U D2

U’ pxn pxn nxn nxn

7

Principal components of 100 genes. PC2 Vs PC1.

(a) Cells are the observations

Genes are the variables

(b) Genes are the observations

Cells are the variables

8

Dimension reduction:Choosing the number of PC’s

1. k components explain some percentage of the variance: 70%,80%.

2. k eigenvalues are greater than the average (1)

3. Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off.

4. Test the null hypothesis that the last m eigenvalues are equal (0)

The same idea can be applied to factor analysis.

1

( (2 11) / 6)( - )p

ii p m

u p m m log log

9

1. The top 5 eigenvalues explain 81% of variability.2. Five eigenvalues greater than the average 2.5%3. Scree Plot

4. Test statistic is 4 significant for 6 and highly significant for 2.

1 4 7 10 14 18 22 26 30 34 38 42

010

20

30

40

50

p-m 24 20 15 9 8 7 6 5 4 3 2 1 u 0.1 5 32 146 182 222 279 340 425 554 1632 3260 2 9.2 37 94 195 215 237 259 282 307 332 358 386

average

10

BiplotsGraphical display of X in which two sets of markers are plotted.One set of markers a1,…,aG represents the rows of X The other set of markers, b1,…, bp, represents the columns of X.

For example: X = UDV’ X2 = U2D2V2’

A = U2D2a and B=V2D2

b, a+b=1 so X2=AB’

The biplot is the graph of A and B together in the same graph.

11

Biplot of the first two principal

components.

Biplot of the first two Principal

components.

-0.3 -0.2 -0.1 0.0 0.1

-0.3

-0.2

-0.1

0.0

0.1

PC1

PC2

EWEW

EW

EW

EW

EWEW

EWEW

EW

EW

EW

EW

EW

EWEW

EWEWEW

EW

EW

EWEW

BLBL

BL

BL

BL

BL

BL

BL

NBNB

NB

NB

NBNB

NB

NBNB

NBNBNB

RM

RM

RMRM

RM

RM

RMRM

RMRM

RM

RM

RM

RM

RM

RM

RM

RM

RM

RM

-10 -5 0 5

-10

-50

5

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

V12

V13

V14

V15

V16

V17

V18V19

V20

V21

V22

V23

V24

V25

V26

V27V28 V29

V30

V31

V32

V33V34

V35V36

V37

V38

V39V40

V41

V42

V43

V44

V45V46

V47

V48

V49

V50

V51

V52

V53

V54

V55

V56

V57

V58

V59V60

V61V62

V63

V64

V65

V66

V67

V68

V69

V70

V71

V72

V73

V74

V75V76

V77

V78

V79

V80

V81

V82

V83

V84

V85

V86V87

V88

V89

V90

V91

V92

V93V94

V95V96

V97

V98

V99

V100

-0.2 -0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Comp.1

Comp.2

1955

1389

1003

2050

1954246

1194

545

1319

174

107

187

842

1387

2046

1645 1708

123836

2162

846

1 129

783

1158

509

2022

851

566

335

742 867

951

1066

1888 1911

338

5541799

188

603

21981924

1896

819

1980

255

2159

153

2253

248

1662

1207

2144

910

1884

1497

1055

166

1916

368

14272

589

1353

1536165

1886

1601

1764

800

607

8491606

1723

1105

2291434

437

1093

1795

1634

23031770

979

251

1655575

1915

1327

1036

1524336

1074380

1298

12951772

756

1735

-10 -5 0 5 10 15

-10

-50

510

15

EW

EW

EW

EW

EW

EWEW

EWEW

EW

EWEW EW

EW

EWEW

EW

EWEWEW

EW

EWEW

BLBL

BLBL

BLBL

BLBL

NBNBNB

NB

NB

NB

NBNB

NB

NBNB

NB

RMRM

RM

RM

RMRMRMRMRMRM

RMRMRMRM

RM

RMRM

RMRM

RM

dimension reduction

Documents