biology-driven clustering of microarray data

25
Biology-Driven Clustering of Microarray Data Applications to the NCI60 Data Set K.R. Coombes, K.A. Baggerly, D.N. Stivers, J. Wang, D. Gold, H.G. Sung, and S.J. Lee

Upload: onslow

Post on 17-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Biology-Driven Clustering of Microarray Data. K.R. Coombes, K.A. Baggerly, D.N. Stivers, J. Wang, D. Gold, H.G. Sung, and S.J. Lee. Applications to the NCI60 Data Set. Introduction. Microarray data is more than a large, unstructured matrix. - PowerPoint PPT Presentation

TRANSCRIPT

Biology-Driven Clustering of Microarray Data

Applications to the NCI60 Data Set

K.R. Coombes, K.A. Baggerly, D.N. Stivers,

J. Wang, D. Gold, H.G. Sung, and S.J. Lee

Introduction• Microarray data is more than a large,

unstructured matrix.– We already know many genes important

for studying cancer through their involvement in specific biological processes

– We also know that reproducible chromosomal abnormalities play an important role in cancer

• Need analytical methods that use biological information early

Methods• First, updated the annotations of

the genes on the microarray• Performed separate analyses

– using genes on individual chromosomes

– using genes involved in different biological processes

• Developed ways to assess how well each set of genes classified samples

Quality of Annotations

• Problem:– I.M.A.G.E. clone IDs and GenBank

accession numbers are archival– UniGene clusters, gene names,

descriptions, functions, etc., are changeable

• Solution:– Download latest UniGene (build 137)

and LocusLink to update annotations

How many genes on the array have good

annotations?Numberof Spots

Current UniGeneStatus

294 None (control spots)128 Only 3’ – unknown to UniGene

1379 Only 3’ – known to UniGene1 Only 5’ – unknown6 Only 5’ – known

399 Both – unknown763 Both – 3’ known, 5’ unknown291 Both – 3’ unknown, 5’ known646 Both known, but disagree

6093 Both known, and agree

Only trust the 7478 spots where the UniGene clusters match.

Where are the genes located?

Chromosome

(Ob

serv

ed

- E

xpe

cte

d)

/ SD

5 10 15 20

-6-4

-20

24

6

X Y

chi^2 = 148.8p < 10^(-10)

How do we determine the functions of genes?

• UniGene -> LocusLink -> GeneOntology

• GeneOntology is a structured, hierarchical vocabulary to describe gene functions in three broad areas:– biological process (why)– molecular function (what)– cellular component (where)

What kinds of genes are on the microarray?

Function Ann. Spots Function Ann. Spots

Oncogenesis 140 180 Cell shape and size 78 101Apoptosis 128 138 Protein traffic 157 188Physiological proc. 180 210 Transport 146 136Perc. of ext. stimuli 238 150 Cell proliferation 197 249Ectoderm devel. 129 152 Stress response 599 372Mesoderm devel. 92 102 Radiation response 147 136Cell adhesion 111 140 Cell cycle 494 283Cell-cell signaling 137 166 Nucleic acid met. 695 595Signal transduction 222 228 Protein metabolism 471 567Intracell sig cascade 110 110 Lipid metabolism 146 156Cell motility 120 153 Carbohydrate met. 103 97Cell organization 98 118 Energy pathways 88 98

Data Preprocessing

• Remove spots with poor annotations and spots with median intensity below the 97th percentile of empty spots.

• Normalize each array so median log ratio between channels is one

• Center each gene so mean log ratio across experiments is zero

• Use (1-correlation)/2 as distance metric

How well does a set of genes distinguish types of

cancer?• Three methods for assessment:

– Qualitative (PCA, MDS)– Quantitative (PCA + ANOVA)– Semi-quantitative (Grading

Dendrograms)

Multidimensional Scaling

coordinate 1

coo

rdin

ate

2

-0.2 -0.1 0.0 0.1 0.2

-0.1

0.0

0.1

0.2

0.3

B

BBB

BB

BB

S

S

SS

S

SC

C

CC

C

C

CLLL

LL

L

M

M

M

M

M

MMM

N

N

N

N

N

NN

N N

O OO

OOOP

P

R

R

R

R

R

R

R

R

PCANOVA

How good is a dendrogram?

• A = cluster contains all and only one kind of cancer

• B = all, with extras• C = all except one• D = all except one,

with extras• E = all except two• F = all except two,

with extras

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60

leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achnrenal.tk10

renal.sn12c

renal.rxf393

renal.uo31renal.786o

renal.a498

breast.unknown

Cancer B C L M N O P R S

Score A A D F D C B

Can cancers be distinguished by genes on

one chromosome?ch B C L M N O P R S ch B C L M N O P R S

1 B A D F D B 13 D E2 E C D D E D E 14 A A F3 C E D E F 15 C B C F C4 E E E E 165 A A D F E 17 A A D F E E6 C A D E E D 18 E D7 E A D E C E 19 D D8 E C D 20 E C9 B C C E E E 21

10 D E 22 A E E11 E C C D X B A D E D12 B C C E E E

Heterogeneity of different types of cancer

• Some cancers (colon, leukemia) are fairly easy to distinguish from others

• Some (breast, lung) are so heterogeneous as to be almost impossible to distinguish

• Some chromosomes (1, 2, 6, 7, 9, 12, 17) can distinguish many cancers.

• Some (16, 21) are essentially random

0.0

0.2

0.4

0.6

0.8

Chromosome 2

0.00.20.40.6

breast.bt549breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29

colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60

leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5melanoma.malme3m

melanoma.skmel28

melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4ovarian.3

ovarian.8

ovarian.5ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

0.0

0.2

0.4

0.6

0.8

Chromosome 16

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19

cns.u251

colon.ht29

colon.hct116

colon.hct15colon.km12

colon.sw620

colon.hcc2998colon.colo205

leukemia.k562

leukemia.hl60

leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1

ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

Can cancers be distinguished by genes of

one function?• Table for functional categories looks a

lot like the table for chromosomes• Some biological process categories

(signal transduction, cell proliferation, cell cycle, protein metabolism) can distinguish many types of cancer

• Others (apoptosis, energy pathways) cannot

0.0

0.2

0.4

0.6

0.8

cell surface receptor linked signal transduction

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29

colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60

leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28

melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1

ovarian.skov3

prostate.du145

prostate.pc3cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29

colon.hct116

colon.hct15colon.km12

colon.sw620colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2melanoma.skmel5melanoma.malme3mmelanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3ovarian.8

ovarian.5ovarian.igrov1

ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achnrenal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown0.

00.

20.

40.

60.

8

protein metabolism and modification

0.0

0.2

0.4

0.6

0.8

death (apoptosis)

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19

cns.u251

colon.ht29

colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562

leukemia.hl60leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

0.0

0.2

0.4

0.6

energy pathways

0.00.20.40.60.8

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435

breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19

cns.u251

colon.ht29

colon.hct116

colon.hct15colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562

leukemia.hl60

leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28

melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1

ovarian.skov3

prostate.du145prostate.pc3

cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

Conclusions (I)

• Multiple views into the data provide substantial insight into differences in cancer types and gene sets.

• Cancer types differ greatly in their degree of heterogeneity, ranging from homogeneous (colon, leukemia) through moderately heterogeneous (renal, melanoma) to extremely heterogeneous (breast and lung).

Conclusions (II)

• Homogeneous cancers exhibit strong identifying signals across most views of the data.

• There are large difference in the ability of genes of different chromosomes or involved in different biological processes to distinguish cancer types.

Supplementary Material

Complete results of each analysis by chromosome and by function are available no our web site:http://www.mdanderson.org

/depts/cancergenomics