personalized medicine: analytics for cancer survival curves ran qi, shujia zhou, yelena yesha june...
TRANSCRIPT
Personalized Medicine:Analytics for Cancer Survival Curves
Ran Qi, Shujia Zhou, Yelena Yesha
June 13, 2013
IAB Meeting Research Report
Introduction: Cancer Staging (1)
• Cancer stage is an anatomic description of character and quantity of the extent of cancer spread (usually I to IV)– Prognostic factors• Tumor (T): size, location, local extent • Nodes (N): number, location of nodal metastases• Metastasis (M): presence of distance organ spread
Lung cancer staging (bin model)
Stage I T1 N0 M0Stage IIA T1 N1 M0
T2 N0 M0Stage IIB T2 N1 M0
T3 N0 M0Stage IIIA T1, 2 N2 M0
T3 N1, 2 M0Stage IIIB T4 N0,1,2 M0Stage IIIC Any T N3 M0Stage IV Any T Any N M1
bin
Lung cancer survival curves
A Bin Model
• Breast cancer: 5 T’s, 4 N’s, 2 M’s - 40 bins• Adding grades (3 levels): 120 bins (5x4x2x3)• Adding ER (hormonal status, 2 levels) 240 bins• Thus, for additional variables, the number of
bins that would have to be added to a stage would be enormous, and collapsing into a stage would become impractical.
• “Bin” is also called “combination”.
Problems
• How to combine the growing number of prognostic factors into small number of stages– Since the TNM staging system was announced in
the 1950’s, many new prognostic factors have been identified.
– By 1995, 76 predictive factors for breast cancer. – By 2002, 150 factors for lung cancer.
• Different prognostic factors have different levels of impacts on the survival curves
Objectives
• Reduce the number of bins through grouping the similar patients
• Find the relationship between prognostic factors and survival curve
Approaches
• Grouping cancer patients according to their similarity• Ensemble algorithm for Clustering Cancer Data
(EACCD)• Grouping algorithm for Cancer Data (GACD)
Initialize groups of patients with cutoff
Partitioning clustering +statistical calculations
200,000 patients
Combinations
Log-rank test
Dissimilarity matrix
Learnt dissimilarity matrix
Hierarchical clustering with dendrogram
New groups of patients Kaplan-Meier Estimator
Cancer Patient Dataset
Step 1:
Step 2:
Step 3:
Step 4:
Survival curves
The GACD work flow
MCMC jump over local minimumWeight Increase efficiency
GACD
• Features– A deterministic grouping method– Use weighted dissimilarity to improve the grouping efficiency.– Use MCMC to avoid local minima
• Results– Find that grouping results are sensitive to the partitioning
algorithms (e.g., PAM and Fuzzy)– Find that grouping results are different between local-minimum
and global-minimum partitioning algorithms.– Implemented weighted dissimilarity
Prognostic factors: Size, node, age, raceNumber of combinations: 59
Reduce 59 curves to 3
Evaluation Metric for Grouping Results
• The area enclosed by two Kaplan-Meier curves
• Linear correlation coefficient between the merging order of dendrogram and the area of Kaplan-Meier curves
Conclusion
• The expanded TNM system (e.g., EACCD and GACD) can analyze cancer survival with more prognostic factors.
• GACD improves the efficiency of grouping algorithm through using weights.
• The area enclosed by two Kaplan-Meier curves appears to be useful for evaluating grouping results.
Acknowledgement
• This project is sponsored by NIST through NSF CHMPR. We would like to thank D. Chen, D. Henson, A. Schwartz, A. Dima, M. Brady the helpful discussions.