data and computational challenges in integrative biomedical informatics

37
Joel Saltz MD, PhD Chair Department of Biomedical Informatics, Director Center for Comprehensive Informatics Emory University Adjunct Professor CSE, CS College of Computing, Georgia Tech Data and Computational Challenges in Integrative Biomedical Informatics

Upload: joel-saltz

Post on 05-Dec-2014

548 views

Category:

Health & Medicine


0 download

DESCRIPTION

Presentation at the Georgia Tech Center for Data Analytics

TRANSCRIPT

Page 1: Data and Computational Challenges in Integrative Biomedical Informatics

Joel Saltz MD, PhDChair Department of Biomedical Informatics, Director Center for Comprehensive InformaticsEmory UniversityAdjunct Professor CSE, CSCollege of Computing, Georgia Tech

Data and Computational Challenges in Integrative Biomedical Informatics

Page 2: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

INTEGRATIVE DATA ANALYTICS

Page 3: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Integrative Biomedical Informatics Analytics• Anatomic/functional

characterization at fine level (Pathology) and gross level (Radiology)

• High throughput multi-scale image segmentation, feature extraction, analysis of features

• Integration of anatomic/functional characterization with multiple types of “omic” information

Radiology

Imaging

Patient Outcome

Pathologic Features

“Omic”

Data

Page 4: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Integrative Spatio-Temporal Molecular Analytics

• Aka Big Data

Page 5: Data and Computational Challenges in Integrative Biomedical Informatics

Quantitative Feature Analysis in Pathology: Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)

Page 6: Data and Computational Challenges in Integrative Biomedical Informatics

Using TCGA Data to Study

Glioblastoma

Diagnostic Improvement

Molecular Classification

Predictors of Progression

Page 7: Data and Computational Challenges in Integrative Biomedical Informatics

Millions of Nuclei Defined by n Features

• Top-down analysis: use the features with existing diagnostic constructs

• Bottom-up analysis: let features define and drive the analysis

Page 8: Data and Computational Challenges in Integrative Biomedical Informatics

TCGA Whole Slide Images

Jun Kong

Step 1:Nuclei

Segmentation

• Identify individual nuclei and their boundaries

Page 9: Data and Computational Challenges in Integrative Biomedical Informatics

Nuclear Analysis Workflow

• Describe individual nuclei in terms of size, shape, and texture

Step 2:Feature

Extraction

Step 1:Nuclei

Segmentation

Page 10: Data and Computational Challenges in Integrative Biomedical Informatics

Oligodendroglioma Astrocytoma

Nuclear Qualities

1 10

Step 3:Nuclei

Classification

Page 11: Data and Computational Challenges in Integrative Biomedical Informatics

Comparison of Machine-based Classification to Human Based Classification

Separation of GBM, Oligo1, Oligo2 as Designated by Neuropathologists

Separation of GBM, Oligo1 and Oligo2 as Designated by Machine

Page 12: Data and Computational Challenges in Integrative Biomedical Informatics

Survival Analysis

Human Machine

Page 13: Data and Computational Challenges in Integrative Biomedical Informatics

Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification

Oligo Related Genes

Myelin Basic ProteinProteolipoproteinHoxD1

Nuclear features mostAssociated with Oligo Signature Genes:

Circularity (high)Eccentricity (low)

Page 14: Data and Computational Challenges in Integrative Biomedical Informatics

Millions of Nuclei Defined by n Features

• Top-down analysis: analyze features in context of existing diagnostic constructs

• Bottom-up analysis: let nuclear features define and drive the analysis

Page 15: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information

Lee Cooper,Carlos Moreno

Page 16: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM),

Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)

Featu

re I

ndic

es

CC CM PB

10

20

30

40

500 500 1000 1500 2000 2500 3000

0

0.2

0.4

0.6

0.8

1

Days

Sur

viva

l

CC

CM

PB

Page 17: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Associations

Page 18: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

HEALTHCARE DATA ANALYTICS

Page 19: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be

readmitted within 30 days?– What fraction of patients with a given set of diseases will be readmitted

within 30 days?– How does severity and time course of co-morbidities affect readmissions?– Geographic analyses

• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines) useful?

• Need a repeatable process that we can apply identically to both local and UHC data

Clinical Phenotype Characterization and the Emory Analytic Information Warehouse

Page 20: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Overall System

I2b2 Web Server

I2b2 Database

Source data

Database Mapper

Source data

Source data

Data Processing

Metadata Manager

Metadata Repository

Query Specification

Investigator

Data Analyst

Data Analyst

Data Modeler

Investigator

Query toolsStudy-

specific Database

Investigator

Page 21: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

5-year Datasets from Emory and University Healthcare Consortium

• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation

therapy readmit encounters (CDW data)

• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American

Community Survey)Analytic Information Warehouse

Page 22: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Using Emory & UHC Data to Find Associations With 30-day Readmits

• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit

• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code

hierarchies– Classify continuous variables using standard interpretations

(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques

Analytic Information Warehouse

Page 23: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Derived Variables

• 30-day readmit• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories• UHC product lines• Variables derived from a combination of codes and/or laboratory test results

– Obesity– Diabetes/uncontrolled diabetes– End-stage renal disease (ESRD)– Pressure ulcer– Sickle cell disease/sickle cell crisis

• Temporal variables derived over multiple encounters– Multiple MI– Multiple 30-day readmissions– Chemotherapy within 180 (or 365) days before surgery– Previous encounter within the last 90 (or 180) days

Page 24: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

30-Day Readmission Rates for Derived VariablesEmory Health Care

Page 25: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Geographic AnalysesUHC Medicine General Product Line (#15)

Analytic Information Warehouse

Page 26: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for Readmission

• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the

variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a

training dataset– Generate a patient-specific readmission risk for each

encounter

• Rank the encounters by risk for a subsequent 30-day readmission

Analytic Information Warehouse

Page 27: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest

Page 28: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling Applied to 180 UHC HospitalsReadmission fraction of top 10% high risk patients

1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 1770

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

All Hospital Model

Individual Hospital Model

Page 29: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Status of Healthcare Data Analytics

• Integrative dataset analysis can leverage patient information gathered over many encounters

• Temporal analyses can generate derived variables that appear to correlate with readmissions

• Predictive modeling has promise of providing decision support

• Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper

• Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts

• Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue

Page 30: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

HIGH END AND LARGE DATA COMPUTING

Page 31: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Supercomputing – Collaboration with ORNL: Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!

Page 32: Data and Computational Challenges in Integrative Biomedical Informatics
Page 33: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Core Transformations for multi-scale pipelines

• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification

Page 34: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Extreme DataCutter – Two Level Model

Coarse Grained Level

Page 35: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Node Level Work Scheduling

Fine Grained Level

Page 36: Data and Computational Challenges in Integrative Biomedical Informatics

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

VLDB 2012

Change Detection, Comparison, and Quantification

Page 37: Data and Computational Challenges in Integrative Biomedical Informatics

Thanks!