bertinoro, nov 2005 some data mining challenges learned from bioinformatics & actions taken...

Bertinoro, Nov 2005

Some Data Mining Challenges Learned From Bioinformatics & Actions Taken

Limsoon Wong

National University of Singapore

Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong

Plan

• Bioinformatics Examples– Treatment prognosis of DLBC lymphoma– Prediction of translation initiation site– Prediction of protein function from PPI data

• What have we learned from these projects?• What have I been looking at recently?

– Statistical measures beyond frequent items– Small changes that have large impact– Evolution of pattern spaces

Bertinoro, Nov 2005

Example #1: Treatment

Prognosis for DLBC Lymphoma

Image credit: Rosenwald et al, 2002

Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392


Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the most common type of lymphoma in adults

• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients

DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy

• Intl Prognostic Index (IPI) – age, “Eastern Cooperative

Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...

• Not very good for stratifying DLBC lymphoma patients for therapeutic trials

Use gene-expression profiles to predict outcome of chemotherapy?


Knowledge Discovery from Gene Expression of “Extreme” Samples

“extreme”sampleselection:< 1 yr vs > 8 yrs

knowledgediscovery from gene expression

240 samples

80 samples26 long-

term survivors

47 short-term survivors

7399genes

84genes

T is long-term if S(T) < 0.3

T is short-term if S(T) > 0.7


Kaplan-Meier Plot for 80 Test Cases

p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3

Low risk

High risk

No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Bertinoro, Nov 2005

Example #2: Protein Translation Initiation Site Recognition

Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002


299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

A Sample cDNA

• What makes the second ATG the TIS?

• Approach– Training data gathering– Signal generation

• k-grams, distance, domain know-how, ...

– Signal selection• Entropy, 2, CFS, t-test, domain know-how...

– Signal integration• SVM, ANN, PCL, CART, C4.5, kNN, ...


Too Many Signals Feature Selection

• For each value of k, there are 4k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

• Choose a signal w/ low intra-class distance

• Choose a signal w/ high inter-class distance

• E.g.,


Sample k-grams Selected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias?


ATGpr

Ourmethod

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Bertinoro, Nov 2005

Example #3: Protein Function Prediction

from Protein Interactions

Level-1 neighbour

Level-2 neighbour


An illustrative Case of Indirect Functional Association?

• Is indirect functional association plausible?• Is it found often in real interaction data?• Can it be used to improve protein function

prediction from protein interaction data?

SH3 Proteins SH3-Binding Proteins


YBR055C|11.4.3.1

YDR158W|1.1.6.5|1.1.9

YJR091C|1.3.16.1|16.3.3

YMR101C|42.1

YPL149W|14.4|20.9.13|42.25|14.7.11

YPL088W|2.16|1.1.9

YMR300C|1.3.1

YBL072C|12.1.1

YOR312C|12.1.1

YBL061C|1.5.4|10.3.3|18.2.1.1|32.1.3|42.1|43.1.3.5|1.5.1.3.2

YBR023C|10.3.3|32.1.3|34.11.3.7|42.1|43.1.3.5|43.1.3.9|1.5.1.3.2

YKL006W|12.1.1|16.3.3 YPL193W

|12.1.1

YAL012W|1.1.6.5|1.1.9

YBR293W|16.19.3|42.25|1.1.3|1.1.9

YLR330W|1.5.4|34.11.3.7|41.1.1|43.1.3.5|43.1.3.9

YLR140W

YDL081C|12.1.1

YDR091C|1.4.1|12.1.1|12.4.1|16.19.3

YPL013C|12.1.1|42.16

YMR047C|11.4.2|14.4|16.7|20.1.10|20.1.21|20.9.1

Freq of Indirect Functional Association

• 59.2% proteins in dataset share some function with level-1 neighbours

• 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours


Over-Rep of Functions in L1 & L2 Neighbours

Fraction of Neighbours w ith Functional Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

≥0.1 ≥0.2 ≥0.3 ≥0.4 ≥0.5 ≥0.6 ≥0.7 ≥0.8 ≥0.9

Similarity

Fra

ctio

n

L1 - L2L2 - L1

L1 ∩ L2L3 - (L1 U L2)All Proteins

Sensitivity vs Precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Sen

siti

vity

L1 - L2

L2 - L1

L1 ∩ L2


Performance Evaluation

• Prediction performance improves after incorporation of L1, L2, & interaction reliability info

Informative FCs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Sen

siti

vity

NCChi²PRODISTINWeighted AvgWeighted Avg R

Bertinoro, Nov 2005

What Have We Learned?


Some of those “techniques”

frequently needed in analysis of

biomedical data are insufficiently

studied by current data mining researchers

• Recognizing what samples are relevant and what are not

• Recognizing what features are relevant and what are not & handling missing or incorrect values

• Recognizing trends, changes, and their causes

Bertinoro, Nov 2005

Action #1: Going Beyond

Frequent Patterns to Recognize What

Features Are Relevant and What

Are Not


Going Beyond Frequent Patterns

• Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant

• Examples:– Odds ratio– Relative risk– Gini index– Yule’s Q & Y– etc

• Odds ratio

,

,

,

,

),(

D

eD

dD

edD

P

PP

P

DPOR


OR search space

{A}:∞{A,B,C}:3

{A,B}:1

Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But …

• Proposition:

Let SkOR(ms,D) = { P

F(ms,D) | OR(P,D) k}.

Then SkOR(ms,D) is not

convex

• i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns


Solution: Luckily They Become Convex When Decomposed Into Plateaus

• Theorem:

Let Sn,kOR(ms,D) = { P

F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,k

OR(ms,D) is convex

The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset

• Proposition:

Let Q [P]∊ D, then OR(Q,D)=OR(P,D)

The plateau space can be further divided into convex equivalence classes on the whole dataset

The space of equivalence classes can be concisely represented by generators and closed patterns


Performance

• Mining odds ratio and relative patterns depends on GC-growth

• GC-Growth is mining both generators and closed patterns

• It is comparable in speed to the fastest algorithms that mined only closed patterns

Bertinoro, Nov 2005

Action #2: Tipping Factors---The Small Changes With Large

Impact


Tipping Events

• Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts

Tipping events Tipping factors are “action

items” for causing transitions

• “Tipping event” is two or more population cohorts that are significantly different from each other

• “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts

• “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event

• “Tipping point” (TP) is the combination of TB and a TF


Impact-To-Cost-Ratio of Tipping Points


Some Simple Results Useful For Constructing TPs

Bertinoro, Nov 2005

Action #3: Evolution of Pattern Spaces---How

Do They Change When the Sample Space

Changes?


Impact of Adding New Transactions onKey and Closed Patterns


Impact of Removing

Items From All Transactions


Acknowledgements

• DLBC Lymphoma:– Jinyan Li, Huiqing Liu

• Translation Initiation: – Fanfan Zeng, Roland

Yap– Huiqing Liu

• Protein Function Prediction:– Kenny Chua, Ken Sung

• Odds Ratio & Relative Risk– Mengling Feng, Yap-Peng

Tan, – Haiquan Li, Jinyan Li

• Tipping Points:– Guimei Liu, Jinyan Li– Guozhu Dong

• Pattern Space Evolution:– Mengling Feng, Yap-Peng

Tan– Guozhu Dong– Jinyan Li

bertinoro, nov 2005 some data mining challenges learned from bioinformatics & actions taken...

Documents