bertinoro, nov 2005 some data mining challenges learned from bioinformatics & actions taken...
TRANSCRIPT
Bertinoro, Nov 2005
Some Data Mining Challenges Learned From Bioinformatics & Actions Taken
Limsoon Wong
National University of Singapore
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Plan
• Bioinformatics Examples– Treatment prognosis of DLBC lymphoma– Prediction of translation initiation site– Prediction of protein function from PPI data
• What have we learned from these projects?• What have I been looking at recently?
– Statistical measures beyond frequent items– Small changes that have large impact– Evolution of pattern spaces
Bertinoro, Nov 2005
Example #1: Treatment
Prognosis for DLBC Lymphoma
Image credit: Rosenwald et al, 2002
Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Diffuse Large B-Cell Lymphoma
• DLBC lymphoma is the most common type of lymphoma in adults
• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients
DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy
• Intl Prognostic Index (IPI) – age, “Eastern Cooperative
Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...
• Not very good for stratifying DLBC lymphoma patients for therapeutic trials
Use gene-expression profiles to predict outcome of chemotherapy?
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Knowledge Discovery from Gene Expression of “Extreme” Samples
“extreme”sampleselection:< 1 yr vs > 8 yrs
knowledgediscovery from gene expression
240 samples
80 samples26 long-
term survivors
47 short-term survivors
7399genes
84genes
T is long-term if S(T) < 0.3
T is short-term if S(T) > 0.7
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Kaplan-Meier Plot for 80 Test Cases
p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3
Low risk
High risk
No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted
Bertinoro, Nov 2005
Example #2: Protein Translation Initiation Site Recognition
Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
A Sample cDNA
• What makes the second ATG the TIS?
• Approach– Training data gathering– Signal generation
• k-grams, distance, domain know-how, ...
– Signal selection• Entropy, 2, CFS, t-test, domain know-how...
– Signal integration• SVM, ANN, PCL, CART, C4.5, kNN, ...
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Too Many Signals Feature Selection
• For each value of k, there are 4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!
• This is too many for most machine learning algorithms
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
• E.g.,
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Sample k-grams Selected by CFS
• Position –3• in-frame upstream ATG• in-frame downstream
– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC
Kozak consensusLeaky scanning
Stop codon
Codon bias?
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
ATGpr
Ourmethod
Validation Results (on Chr X and Chr 21)
• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s
Bertinoro, Nov 2005
Example #3: Protein Function Prediction
from Protein Interactions
Level-1 neighbour
Level-2 neighbour
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
An illustrative Case of Indirect Functional Association?
• Is indirect functional association plausible?• Is it found often in real interaction data?• Can it be used to improve protein function
prediction from protein interaction data?
SH3 Proteins SH3-Binding Proteins
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
YBR055C|11.4.3.1
YDR158W|1.1.6.5|1.1.9
YJR091C|1.3.16.1|16.3.3
YMR101C|42.1
YPL149W|14.4|20.9.13|42.25|14.7.11
YPL088W|2.16|1.1.9
YMR300C|1.3.1
YBL072C|12.1.1
YOR312C|12.1.1
YBL061C|1.5.4|10.3.3|18.2.1.1|32.1.3|42.1|43.1.3.5|1.5.1.3.2
YBR023C|10.3.3|32.1.3|34.11.3.7|42.1|43.1.3.5|43.1.3.9|1.5.1.3.2
YKL006W|12.1.1|16.3.3 YPL193W
|12.1.1
YAL012W|1.1.6.5|1.1.9
YBR293W|16.19.3|42.25|1.1.3|1.1.9
YLR330W|1.5.4|34.11.3.7|41.1.1|43.1.3.5|43.1.3.9
YLR140W
YDL081C|12.1.1
YDR091C|1.4.1|12.1.1|12.4.1|16.19.3
YPL013C|12.1.1|42.16
YMR047C|11.4.2|14.4|16.7|20.1.10|20.1.21|20.9.1
Freq of Indirect Functional Association
• 59.2% proteins in dataset share some function with level-1 neighbours
• 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Over-Rep of Functions in L1 & L2 Neighbours
Fraction of Neighbours w ith Functional Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
≥0.1 ≥0.2 ≥0.3 ≥0.4 ≥0.5 ≥0.6 ≥0.7 ≥0.8 ≥0.9
Similarity
Fra
ctio
n
L1 - L2L2 - L1
L1 ∩ L2L3 - (L1 U L2)All Proteins
Sensitivity vs Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Sen
siti
vity
L1 - L2
L2 - L1
L1 ∩ L2
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Performance Evaluation
• Prediction performance improves after incorporation of L1, L2, & interaction reliability info
Informative FCs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Sen
siti
vity
NCChi²PRODISTINWeighted AvgWeighted Avg R
Bertinoro, Nov 2005
What Have We Learned?
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Some of those “techniques”
frequently needed in analysis of
biomedical data are insufficiently
studied by current data mining researchers
• Recognizing what samples are relevant and what are not
• Recognizing what features are relevant and what are not & handling missing or incorrect values
• Recognizing trends, changes, and their causes
Bertinoro, Nov 2005
Action #1: Going Beyond
Frequent Patterns to Recognize What
Features Are Relevant and What
Are Not
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Going Beyond Frequent Patterns
• Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant
• Examples:– Odds ratio– Relative risk– Gini index– Yule’s Q & Y– etc
• Odds ratio
,
,
,
,
),(
D
eD
dD
edD
P
PP
P
DPOR
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
OR search space
{A}:∞{A,B,C}:3
{A,B}:1
Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But …
• Proposition:
Let SkOR(ms,D) = { P
F(ms,D) | OR(P,D) k}.
Then SkOR(ms,D) is not
convex
• i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Solution: Luckily They Become Convex When Decomposed Into Plateaus
• Theorem:
Let Sn,kOR(ms,D) = { P
F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,k
OR(ms,D) is convex
The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset
• Proposition:
Let Q [P]∊ D, then OR(Q,D)=OR(P,D)
The plateau space can be further divided into convex equivalence classes on the whole dataset
The space of equivalence classes can be concisely represented by generators and closed patterns
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Performance
• Mining odds ratio and relative patterns depends on GC-growth
• GC-Growth is mining both generators and closed patterns
• It is comparable in speed to the fastest algorithms that mined only closed patterns
Bertinoro, Nov 2005
Action #2: Tipping Factors---The Small Changes With Large
Impact
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Tipping Events
• Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts
Tipping events Tipping factors are “action
items” for causing transitions
• “Tipping event” is two or more population cohorts that are significantly different from each other
• “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts
• “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event
• “Tipping point” (TP) is the combination of TB and a TF
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Impact-To-Cost-Ratio of Tipping Points
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Some Simple Results Useful For Constructing TPs
Bertinoro, Nov 2005
Action #3: Evolution of Pattern Spaces---How
Do They Change When the Sample Space
Changes?
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Impact of Adding New Transactions onKey and Closed Patterns
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Impact of Removing
Items From All Transactions
Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong
Acknowledgements
• DLBC Lymphoma:– Jinyan Li, Huiqing Liu
• Translation Initiation: – Fanfan Zeng, Roland
Yap– Huiqing Liu
• Protein Function Prediction:– Kenny Chua, Ken Sung
• Odds Ratio & Relative Risk– Mengling Feng, Yap-Peng
Tan, – Haiquan Li, Jinyan Li
• Tipping Points:– Guimei Liu, Jinyan Li– Guozhu Dong
• Pattern Space Evolution:– Mengling Feng, Yap-Peng
Tan– Guozhu Dong– Jinyan Li