ecml pkdd workshop challenge: mining and exploiting interpretable local patterns natalja friesen,

17
ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

Upload: malcolm-morrison

Post on 18-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

ECML PKDD Workshop Challenge:

Mining and exploiting interpretable local patterns

Natalja Friesen,

Page 2: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

2

Intelligente Analyse- und Informationssysteme

Motivation for the challenge dataset

Gene expression analysis:

1. Finding a set of genes associated with certain disease

2. Description ot the discovered genes according to their functional role.

Question: How to translate gene names into understandable biological knowledge automaticaly?

Not a pure data mining problem – no prediction model is required

Understandability of the results is the key success factor

Goal of the challenge: investigation of typical key requirements for the usage of local pattern mining

Page 3: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

3

Intelligente Analyse- und Informationssysteme

Dataset Description

The original dataset - study about responses of cancer cells to ionizing radiation Amundson et al. (2008)*

A major determinant of gene expression responses to ionizing radiation - p53 status

p53 is a well-known tumor suppressor protein involved in prevention of cancer

60 cell lines representing nine tumor types: breast, central nervous system, colon, leukemia, lung, melanoma, ovarian, prostate, and renal

We employed the Student t-test to identify the differentially expressed genes (p-value<0.05) according to the p53 status

* Sally A. Amundson, Khanh T. Do, Lisa C. Vinikoor, R. Anthony Lee, Christine A. Koch-Paiz, Jaeyong Ahn, Mark Reimers, Yidong Chen, Dominic A. Scudiero, John N. Weinstein, Jeffrey M. Trent, Michael L. Bittner, Paul S. Meltzer, and Albert J. Fornace. Integrating Global Gene Expression and Radiation Survival Parameters across the 60 Cell Lines of the National Cancer Institute Anticancer Drug Screen. Cancer Res, 68(2):415–424, January 2008.

Page 4: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

4

Intelligente Analyse- und Informationssysteme

Enrichment by Gene Ontology terms

The set of genes was enriched using Gene Ontology (GO) terms.

Gene Ontology includes 38137 terms

The three categories of the GO hierarchy are:

biological processes (23928 terms)

molecular functions (9467 terms)

cellular component (3050 terms)

The resulting dataset consists of 6172 genes that are described by 9027 GO terms

The label indicates whether genes are statistically associated with the p53 status.

Page 5: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

5

Intelligente Analyse- und Informationssysteme

Evaluation

Validation of local pattern by domain experts from Biological Research Foundation (BRF)

Questionnaire according to the following criteria:

 Novelty – whether the subgroup comprises a new knowledge

Usability – the knowledge in subgroup is useful for researcher

Generality – the subgroup contains very general terms and is not interesting for the user.Scores Novelty Usability Generality

2 novel very useful very specific

1 - useful specific

0 not novel not useful general

Page 6: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

6

Intelligente Analyse- und Informationssysteme

Results: Novelty

None of discovered subgroup were considered as novel

Definition of novelty: the GO terms representing in subgroups are not known from the literature

Expert feedback:

“The disease is well known - cancer has been studied very thoroughly”.

“There always seem to be a PubMed mention of those terms relative to radiation and cancer disease”

“It is a good output for a general overview of the dataset, and often biologists need to have this outcome to focus on particular biological processes”

Page 7: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

7

Intelligente Analyse- und Informationssysteme

Challenge Results: Generality

Small subgroups are likely to be more interesting and useful

General SD are mostly not interesting

Expert feedback:

“the terms were very general”

Page 8: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

8

Intelligente Analyse- und Informationssysteme

Challenge Results: Usability

Usability is a main criteria to evaluate the results

Correlation between Usability – Generality

Page 9: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

9

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

Page 10: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

10

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

Page 11: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

11

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

4 - good Arno Knobbe 1 0,66

Page 12: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

12

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

4 - good Arno Knobbe 1 0,662 1,33 0,66

Page 13: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

13

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

4 - good Arno Knobbe 1 0,662 1,33 0,666 - positive Arno Knobbe 1,33 1

Page 14: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

14

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

4 - good Arno Knobbe 1 0,662 1,33 0,666 - positive Arno Knobbe 1,33 17 - diverse Arno Knobbe 1,33 1

Page 15: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

15

Intelligente Analyse- und Informationssysteme

Results

Subgroup Set

Participant Av. usability Av. generality

1 Wouter Duivesteijn 1 0

5 - specific Arno Knobbe 1 0

4 - good Arno Knobbe 1 0,662 1,33 0,666 - positive Arno Knobbe 1,33 17 - diverse Arno Knobbe 1,33 13 1,66 1,66

Page 16: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

16

Intelligente Analyse- und Informationssysteme

Best rules

Subgroup Probability Size Usability

Generality

1. NADH dehydrogenase (ubiquinone) activity=1

1.0 29 2 2

2. proteolysis=1 & plasma membrane=1

0.825 40 2 2

3. DNA binding=1, zinc ion binding=1 0.612 125 2 2

Page 17: ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

17

Intelligente Analyse- und Informationssysteme

Conclusion

1. Presentation of results is important to the user

2. Very large descriptions are hard to understand

3. Very general subgroups are likely to be not useful

4. Expert knowledge plays an important role

5. Optimization of algorithm parameters according to the expert feedback

6. Generality is not always characteristic of a data, but include domain knowledge – remove a general attributes