bertinoro, nov 2005 some data mining challenges learned from bioinformatics & actions taken...

31
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Upload: annabella-ethel-bradford

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Some Data Mining Challenges Learned From Bioinformatics & Actions Taken

Limsoon Wong

National University of Singapore

Page 2: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Plan

• Bioinformatics Examples– Treatment prognosis of DLBC lymphoma– Prediction of translation initiation site– Prediction of protein function from PPI data

• What have we learned from these projects?• What have I been looking at recently?

– Statistical measures beyond frequent items– Small changes that have large impact– Evolution of pattern spaces

Page 3: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Example #1: Treatment

Prognosis for DLBC Lymphoma

Image credit: Rosenwald et al, 2002

Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392

Page 4: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the most common type of lymphoma in adults

• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients

DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy

• Intl Prognostic Index (IPI) – age, “Eastern Cooperative

Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...

• Not very good for stratifying DLBC lymphoma patients for therapeutic trials

Use gene-expression profiles to predict outcome of chemotherapy?

Page 5: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Knowledge Discovery from Gene Expression of “Extreme” Samples

“extreme”sampleselection:< 1 yr vs > 8 yrs

knowledgediscovery from gene expression

240 samples

80 samples26 long-

term survivors

47 short-term survivors

7399genes

84genes

T is long-term if S(T) < 0.3

T is short-term if S(T) > 0.7

Page 6: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Kaplan-Meier Plot for 80 Test Cases

p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3

Low risk

High risk

No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Page 7: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Example #2: Protein Translation Initiation Site Recognition

Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002

Page 8: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

A Sample cDNA

• What makes the second ATG the TIS?

• Approach– Training data gathering– Signal generation

• k-grams, distance, domain know-how, ...

– Signal selection• Entropy, 2, CFS, t-test, domain know-how...

– Signal integration• SVM, ANN, PCL, CART, C4.5, kNN, ...

Page 9: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Too Many Signals Feature Selection

• For each value of k, there are 4k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

• Choose a signal w/ low intra-class distance

• Choose a signal w/ high inter-class distance

• E.g.,

Page 10: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Sample k-grams Selected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias?

Page 11: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

ATGpr

Ourmethod

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Page 12: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Example #3: Protein Function Prediction

from Protein Interactions

Level-1 neighbour

Level-2 neighbour

Page 13: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

An illustrative Case of Indirect Functional Association?

• Is indirect functional association plausible?• Is it found often in real interaction data?• Can it be used to improve protein function

prediction from protein interaction data?

SH3 Proteins SH3-Binding Proteins

Page 14: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

YBR055C|11.4.3.1

YDR158W|1.1.6.5|1.1.9

YJR091C|1.3.16.1|16.3.3

YMR101C|42.1

YPL149W|14.4|20.9.13|42.25|14.7.11

YPL088W|2.16|1.1.9

YMR300C|1.3.1

YBL072C|12.1.1

YOR312C|12.1.1

YBL061C|1.5.4|10.3.3|18.2.1.1|32.1.3|42.1|43.1.3.5|1.5.1.3.2

YBR023C|10.3.3|32.1.3|34.11.3.7|42.1|43.1.3.5|43.1.3.9|1.5.1.3.2

YKL006W|12.1.1|16.3.3 YPL193W

|12.1.1

YAL012W|1.1.6.5|1.1.9

YBR293W|16.19.3|42.25|1.1.3|1.1.9

YLR330W|1.5.4|34.11.3.7|41.1.1|43.1.3.5|43.1.3.9

YLR140W

YDL081C|12.1.1

YDR091C|1.4.1|12.1.1|12.4.1|16.19.3

YPL013C|12.1.1|42.16

YMR047C|11.4.2|14.4|16.7|20.1.10|20.1.21|20.9.1

Freq of Indirect Functional Association

• 59.2% proteins in dataset share some function with level-1 neighbours

• 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours

Page 15: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Over-Rep of Functions in L1 & L2 Neighbours

Fraction of Neighbours w ith Functional Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

≥0.1 ≥0.2 ≥0.3 ≥0.4 ≥0.5 ≥0.6 ≥0.7 ≥0.8 ≥0.9

Similarity

Fra

ctio

n

L1 - L2L2 - L1

L1 ∩ L2L3 - (L1 U L2)All Proteins

Sensitivity vs Precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Sen

siti

vity

L1 - L2

L2 - L1

L1 ∩ L2

Page 16: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Performance Evaluation

• Prediction performance improves after incorporation of L1, L2, & interaction reliability info

Informative FCs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Sen

siti

vity

NCChi²PRODISTINWeighted AvgWeighted Avg R

Page 17: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

What Have We Learned?

Page 18: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Some of those “techniques”

frequently needed in analysis of

biomedical data are insufficiently

studied by current data mining researchers

• Recognizing what samples are relevant and what are not

• Recognizing what features are relevant and what are not & handling missing or incorrect values

• Recognizing trends, changes, and their causes

Page 19: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Action #1: Going Beyond

Frequent Patterns to Recognize What

Features Are Relevant and What

Are Not

Page 20: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Going Beyond Frequent Patterns

• Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant

• Examples:– Odds ratio– Relative risk– Gini index– Yule’s Q & Y– etc

• Odds ratio

,

,

,

,

),(

D

eD

dD

edD

P

PP

P

DPOR

Page 21: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

OR search space

{A}:∞{A,B,C}:3

{A,B}:1

Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But …

• Proposition:

Let SkOR(ms,D) = { P

F(ms,D) | OR(P,D) k}.

Then SkOR(ms,D) is not

convex

• i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns

Page 22: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Solution: Luckily They Become Convex When Decomposed Into Plateaus

• Theorem:

Let Sn,kOR(ms,D) = { P

F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,k

OR(ms,D) is convex

The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset

• Proposition:

Let Q [P]∊ D, then OR(Q,D)=OR(P,D)

The plateau space can be further divided into convex equivalence classes on the whole dataset

The space of equivalence classes can be concisely represented by generators and closed patterns

Page 23: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Performance

• Mining odds ratio and relative patterns depends on GC-growth

• GC-Growth is mining both generators and closed patterns

• It is comparable in speed to the fastest algorithms that mined only closed patterns

Page 24: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Action #2: Tipping Factors---The Small Changes With Large

Impact

Page 25: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Tipping Events

• Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts

Tipping events Tipping factors are “action

items” for causing transitions

• “Tipping event” is two or more population cohorts that are significantly different from each other

• “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts

• “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event

• “Tipping point” (TP) is the combination of TB and a TF

Page 26: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Impact-To-Cost-Ratio of Tipping Points

Page 27: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Some Simple Results Useful For Constructing TPs

Page 28: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005

Action #3: Evolution of Pattern Spaces---How

Do They Change When the Sample Space

Changes?

Page 29: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Impact of Adding New Transactions onKey and Closed Patterns

Page 30: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Impact of Removing

Items From All Transactions

Page 31: Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Acknowledgements

• DLBC Lymphoma:– Jinyan Li, Huiqing Liu

• Translation Initiation: – Fanfan Zeng, Roland

Yap– Huiqing Liu

• Protein Function Prediction:– Kenny Chua, Ken Sung

• Odds Ratio & Relative Risk– Mengling Feng, Yap-Peng

Tan, – Haiquan Li, Jinyan Li

• Tipping Points:– Guimei Liu, Jinyan Li– Guozhu Dong

• Pattern Space Evolution:– Mengling Feng, Yap-Peng

Tan– Guozhu Dong– Jinyan Li