data and computational challenges in integrative biomedical informatics
Post on 05-Dec-2014
Embed Size (px)
DESCRIPTIONPresentation at the Georgia Tech Center for Data Analytics
- 1. Data and Computational Challenges in Integrative Biomedical InformaticsJoel Saltz MD, PhDChair Department of Biomedical Informatics,Director Center for Comprehensive InformaticsEmory UniversityAdjunct Professor CSE, CSCollege of Computing, Georgia Tech
2. Center for Comprehensive Informatics ANALYTICS INTEGRATIVE DATA 3. Integrative Biomedical Informatics AnalyticsCenter for Comprehensive Informatics Anatomic/functional characterization at fine level (Pathology) and gross level (Radiology) RadiologyImaging High throughput multi- scale image segmentation, featurePatientOmic extraction, analysis of Outcome Data features Integration of anatomic/functionalPathologic Features characterization with multiple types of omic information 4. Integrative Spatio-Temporal Molecular AnalyticsCenter for Comprehensive Informatics Aka Big Data 5. Quantitative Feature Analysis in Pathology:Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz) 6. Using TCGA Data to Study GlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression 7. Millions of Nuclei Defined by n Features Top-down analysis: use the featureswith existing diagnostic constructs Bottom-up analysis: let features defineand drive the analysis 8. TCGA Whole Slide ImagesStep 1:Nuclei Identify individual nuclei Segmentation and their boundariesJun Kong 9. Nuclear Analysis WorkflowStep 1: Step 2:NucleiFeature SegmentationExtraction Describe individual nuclei in terms of size,shape, and texture 10. Step 3:NucleiNuclear Qualities Classification 1 10OligodendrogliomaAstrocytoma 11. Comparison of Machine-based Classificationto Human Based ClassificationSeparation of GBM, Oligo1, Oligo2 Separation of GBM, Oligo1 andas Designated byOligo2 as Designated by MachineNeuropathologists 12. Survival AnalysisHumanMachine 13. Gene Expression Correlates of High Oligo-AstroRatio on Machine-based Classification Oligo Related Genes Myelin Basic Protein Proteolipoprotein HoxD1 Nuclear features most Associated with Oligo Signature Genes: Circularity (high) Eccentricity (low) 14. Millions of Nuclei Defined by n Features Top-down analysis: analyze features incontext of existing diagnostic constructs Bottom-up analysis: let nuclear featuresdefine and drive the analysis 15. Direct Study of Relationship Between vsCenter for Comprehensive InformaticsLee Cooper,Carlos Moreno 16. Clustering identifies three morphological groupsCenter for Comprehensive Informatics Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) Prognostically-significant (logrank p=4.5e-4) CC CM PB1 CC10 0.8 CM PB20Feature Indices 0.6Survival30 0.440 0.2500 0 500 1000 1500 2000 2500 3000Days 17. Center for Comprehensive Informatics Associations 18. Center for Comprehensive Informatics ANALYTICS HEALTHCARE DATA 19. Clinical Phenotype Characterization and the Emory Analytic Information WarehouseCenter for Comprehensive Informatics Example Project: Find hot spots in readmissions within 30 days What fraction of patients with a given principal diagnosis will bereadmitted within 30 days? What fraction of patients with a given set of diseases will be readmittedwithin 30 days? How does severity and time course of co-morbidities affectreadmissions? Geographic analyses Compare and contrast with UHC Clinical Data Base Repeat analyses across all UHC hospitals Are we performing the same? How are UHC-curated groupings of patients (e.g., product lines) useful? Need a repeatable process that we can apply identically to both local and UHC data 20. Overall SystemCenter for Comprehensive Informatics MetadataRepository I2b2 Web I2b2 ServerDatabase InvestigatorMetadata Manager Data ModelerDataQuery ProcessingSpecification Data Analyst Investigator DatabaseMapperData Analyst Study-Query tools specificDatabaseSource SourceSource Investigator data datadata 21. 5-year Datasets from Emory and University Healthcare ConsortiumCenter for Comprehensive Informatics EUH, EUHM and WW (inpatient encounters) Removed encounter pairs with chemotherapy and radiation therapy readmit encounters (CDW data) Encounter location (down to unit for Emory) Providers (Emory only) Discharge disposition Primary and secondary ICD9 codes Procedure codes DRGs Medication orders (Emory only) Labs (Emory only) Vitals (Emory only) Geographic information (CDW only + US Census and American Community Survey)Analytic Information 22. Using Emory & UHC Data to Find Associations With 30-day ReadmitsCenter for Comprehensive Informatics Problem: Raw clinical and administrative variables are difficult to use for associative data mining Too many diagnosis codes, procedure codes Continuous variables (e.g., labs) require interpretation Temporal relationships between variables are implicit Solution: Transform the data into a much smaller set of variables using heuristic knowledge Categorize diagnosis and procedure codes using codehierarchies Classify continuous variables using standardinterpretations (e.g., high, normal, low) Identify temporal patterns (e.g., frequency, duration,sequence) Apply standard data mining techniquesAnalytic Information 23. Derived VariablesCenter for Comprehensive Informatics 30-day readmit The 9 Emory Enhanced Risk Assessment Tool diagnosis categories UHC product lines Variables derived from a combination of codes and/or laboratory test results Obesity Diabetes/uncontrolled diabetes End-stage renal disease (ESRD) Pressure ulcer Sickle cell disease/sickle cell crisis Temporal variables derived over multiple encounters Multiple MI Multiple 30-day readmissions Chemotherapy within 180 (or 365) days before surgery Previous encounter within the last 90 (or 180) days 24. 30-Day Readmission Rates for Derived VariablesCenter for Comprehensive Informatics Emory Health Care 25. Geographic Analyses UHC Medicine General Product Line (#15)Center for Comprehensive Informatics Analytic Information Warehouse 26. Predictive Modeling for ReadmissionCenter for Comprehensive Informatics Random forests (ensemble of decision trees) Create a decision tree using a random subset of the variables in the dataset Generate a large number of such trees All trees vote to classify each test example in a training dataset Generate a patient-specific readmission risk for each encounter Rank the encounters by risk for a subsequent 30- day readmission Analytic Information 27. Emory Readmission Rates for High and Low Risk Groups Generated withCenter for Comprehensive Informatics Random Forest 28. Predictive Modeling Applied to 180 UHC Hospitals Readmission fraction of top 10% high risk patientsCenter for Comprehensive Informatics0.90.80.70.60.5 All Hospital ModelIndividual Hospital0.4Model0.30.20.1 0113 17 25 33 41 49 57 65 73 81 89 9716110512112913714515316917718591 29. Status of Healthcare Data AnalyticsCenter for Comprehensive Informatics Integrative dataset analysis can leverage patient information gathered over many encounters Temporal analyses can generate derived variables that appear to correlate with readmissions Predictive modeling has promise of providing decision support Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue 30. Center for Comprehensive Informatics DATA COMPUTING HIGH END AND LARGE 31. Supercomputing Collaboration with ORNL: Titan Peak Speed 30,000,000,000,000,000 floating point operations per second!Center for Comprehensive Informatics 32. Core Transformations for multi-scale pipelinesCenter for Comprehensive Informatics Data Cleaning and Low Level Transformations Data Subsetting, Filtering, Subsampling Spatio-temporal Mapping and Registration Object Segmentation Feature Extraction, Object Classification Spatio-temporal Aggregation Change Detection, Comparison, and Quantification 33. Extreme DataCutter Two Level ModelCenter for Comprehensive Informatics 34. Center for Comprehensive Informatics Node Level Work Scheduling 35. VLDB 2012Center for Comprehensive Informatics Change Detection, Comparison, and Quantification 36. Thanks!