nsf northeast hub big data workshop

33
Building a search engine to find environmental and phenotypic factors associated with disease and health Chirag J Patel Northeast Big Data Innovation Hub Workshop 02/24/17 [email protected] @chiragjp www.chiragjpgroup.org

Upload: chirag-patel

Post on 09-Apr-2017

66 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: NSF Northeast Hub Big Data Workshop

Building a search engine to find environmental and phenotypic factors

associated with disease and health

Chirag J PatelNortheast Big Data Innovation Hub Workshop

02/24/17

[email protected]@chiragjp

www.chiragjpgroup.org

Page 2: NSF Northeast Hub Big Data Workshop

P = G + EType 2 Diabetes

CancerAlzheimer’s

Gene expression

Phenotype Genome

Variants

Environment

Infectious agentsNutrientsPollutants

Drugs

Page 3: NSF Northeast Hub Big Data Workshop

We are great at G investigation!

over 2400 Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/

G

Page 4: NSF Northeast Hub Big Data Workshop

Nothing comparable to elucidate E influence!

E: ???We lack high-throughput methods and data to discover new E in P…

until now!

Page 5: NSF Northeast Hub Big Data Workshop

σ2Gσ2P H2 =

Heritability (H2) is the range of phenotypic variability attributed to genetic variability in a

population

Indicator of the proportion of phenotypic differences attributed to G.

Page 6: NSF Northeast Hub Big Data Workshop

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

G estimates for burdensome diseases are low and variable: massive opportunity for E discovery

Type 2 Diabetes

Heart Disease

Asthma

Page 7: NSF Northeast Hub Big Data Workshop

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery

σ2E : Exposome!

Page 8: NSF Northeast Hub Big Data Workshop

Enhance accessibility of clinical and environmental data, and analytic artificial intelligence tools!

How can we drive discovery of environmental factors (E) in disease phenotypes (P)?

Page 9: NSF Northeast Hub Big Data Workshop

Enhance accessibility of large open data and tools to drive discovery of

environmental factors (E) in disease phenotypes (P)

Greg Cooper, MD, PhD Pittsburgh

Vasant Honavar, PhD Penn State

Noémie Elhadad, PhD Columbia

Chirag Patel, PhD Harvard

Page 10: NSF Northeast Hub Big Data Workshop

Where do we get disease (P) data?

Page 11: NSF Northeast Hub Big Data Workshop

wearable.com

Page 12: NSF Northeast Hub Big Data Workshop

Where do we get disease P data?Health record data from your doctor!

• Longitudinal data on millions of patients

• diagnoses, prescriptions, lab reports, notes

• Sitting there in institutional IT infrastructure

• OHDSI provides a unified model to access data across institutions, enhancing the scientific process!

Page 13: NSF Northeast Hub Big Data Workshop

Noémie Elhadad, PhD Columbia

Page 14: NSF Northeast Hub Big Data Workshop

Capitalize on digitalized health record data (from around the world)!

High-powered dataset(s) for discovery.

Page 15: NSF Northeast Hub Big Data Workshop

Where do we get environmental (E) data?

Page 16: NSF Northeast Hub Big Data Workshop

Examples of sources of disparate external exposome datasets available in the Exposome Data Warehouse

• Geological

• NASA - Cloud and Atmosphere Profiles

• NOAA Climate Data

• Pollution

• EPA Air Quality Surveillance Data Mart, or AirData,

• Socio-Economic

• US Census American Community Survey (ACS)

• Epidemiological

• CDC Wonder, USDA Food Atlas

Page 17: NSF Northeast Hub Big Data Workshop

A key challenge: mashing up Exposome Data Warehouse with patient

data from OHDSI

f(location, time)PM2.5income

Pollen count

EPA AirData American CommunitySurvey

NOAA Climate

home zipcodeencounter time

Page 18: NSF Northeast Hub Big Data Workshop

0.50

gini index

0.20

0.1

hh income (K)

100

50

3030

15

Wind

0

individualn

..

..

..

..

3/5/1998yes

10

E?age

no

95376

20

sex

M

Temp

individual2

PM2.5

individual1

75 55

02215

1/1/2016 23

Pb

individual3

21

70

F

M

zip

35

Time(E)

2312/11/2015

0

yes

50

302124

OHDSI EPA NOAA Census

milli

ons

of p

atie

nts

Page 19: NSF Northeast Hub Big Data Workshop

Will it work? yep!

Page 20: NSF Northeast Hub Big Data Workshop

Does temperature (and weather) influence asthma-related pediatric ER visits?

• Children <= 17 y/o with >=1 ICD9 code corresponding to 493.*

• N=56K, >84K ER visits

• Weather station data

• (daily temperature, wind, humidity)

• Case-crossover design (only investigated cases)

Yeran Li, PhD (MS, HSPH) Chirag Lakhani, PhD (HMS)

Yun Wang, PhD (Post-doc, HSPH)

Page 21: NSF Northeast Hub Big Data Workshop

Prevalence of asthma attack varies across the US

Page 22: NSF Northeast Hub Big Data Workshop

Temperature(F)20

4060

80

Lag

01

23

45

6

Relative risk

0.951.00

1.05

20 40 60 80

0.9

1.1

1.3

Overall temperature effect

Mean temperature (F)

Rel

ativ

e ris

k

●●

● ● ●

0 1 2 3 4 5 6

0.90

1.00

1.10

Lag effect at T=60F

Lag(day)

Rel

ativ

e R

isk

20 40 60 80

0.90

1.00

1.10

Temperature effect at Lag=0

Mean temperature (F)

Rel

ativ

e R

isk

Does temperature influence asthma ER visits?: yes! Relative risk of asthma attack by mean temperature

Page 23: NSF Northeast Hub Big Data Workshop

Rates of asthma attacks depend on season?: yes!

0.9

1.0

1.1

20 40 60 80

season Fall Spring Summer Winter

Page 24: NSF Northeast Hub Big Data Workshop

Temperature)effects)vary)at)different)regions)(use)NOAA)defini9on,)ER)kids))

h@ps://www.ncdc.noaa.gov/monitoringEreferences/maps/usEclimateEregions.php)

sout

h

north

east

sout

heas

t

cent

ral

sout

hwes

t

wn_

cent

ral

en_c

entra

l

west

north

west

Region counts by NOAA

Num

of o

bs

020

000

6000

0

51052

78126

57238

68719

15256 18050

44551

22923

13895

south northeast southeast central southwest wn_central en_central west northwest

0.5

1.0

1.5

2.0

20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80Temperature

Risk

weather effect: different weather zones by NOAA

Rates of asthma attacks dependent on region?: yes!

Page 25: NSF Northeast Hub Big Data Workshop
Page 26: NSF Northeast Hub Big Data Workshop

Does temperature (and weather) influence asthma-related ER visits in kids?: the tip of the iceberg!

Temperature(F)20

4060

80

Lag

01

23

45

6

Relative risk

0.951.00

1.05

20 40 60 80

0.9

1.1

1.3

Overall temperature effect

Mean temperature (F)

Rel

ativ

e ris

k

●●

● ● ●

0 1 2 3 4 5 6

0.90

1.00

1.10

Lag effect at T=60F

Lag(day)

Rel

ativ

e R

isk

20 40 60 80

0.90

1.00

1.10

Temperature effect at Lag=0

Mean temperature (F)

Rel

ativ

e R

isk

What other scientific questions? •what is influence of pollen?•what is the influence of air pollution?•what about adults?

Can we replicate the analysis? •different populations•using different data•with different analysts

Page 27: NSF Northeast Hub Big Data Workshop

ExposomeDBEnvironmental Protection Agency

AirData

US CensusSocioeconomic data

National Oceanic and Atmospheric Administration (NOAA)Climate Data

Observational Health Data Sciences and Informatics

(OHDSI)

Harvard Medical School Columbia University P&S

Geotemporal database Network of observational data

Causal Analytics

A2

A3

A1

AX Aim X

Analytics Workshop

Hands-on workshops for leveraging work in

Aims 1-3

Training Resources

Jupyter notebooksStudent internships at Pitt,

Harvard, PSU, and Columbia P&S

OHDSI Node

A4

University of Pittsburgh Penn State University

Integrating the ExposomeData Warehouse with OHDSI and causal modeling tools to drive and demonstrate

discovery.

Page 28: NSF Northeast Hub Big Data Workshop

Many hypotheses are possible to address: useful for public health & planning!

What is the effect of air pollution levels in disease?

Do adverse weather conditions influence hospital use?

What pharmaceutical drugs lead to adverse health outcomes?

How does socioeconomic context influence hospital use, disease rates, and recovery?

Page 29: NSF Northeast Hub Big Data Workshop

We will harness tools in machine learning extract signal from noise!

Greg Cooper, MD, PhD Pittsburgh

Vasant Honavar, PhD Penn State

SES

PM2.5

asthma

oF

bayesian networks

socio-economic

case-crossoverweather

air pollution

Sci Rep 2016

Page 30: NSF Northeast Hub Big Data Workshop

http://www.ccd.pitt.edu/tools/

Page 31: NSF Northeast Hub Big Data Workshop

Integrating the ExposomeDB with OHDSI and analytics to drive and demonstrate discovery by the community!

(trainees especially welcome)

• 2-day hands-on workshop in New York or Boston

• remote “exchange” internship program and 2-week immersion

• dissemination of electronic training resources

Page 32: NSF Northeast Hub Big Data Workshop

Many hypotheses are possible to address: useful for can we build a machine learning predictor to estimate E?

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

σ2E : Exposome!

Page 33: NSF Northeast Hub Big Data Workshop

Harvard DBMI Isaac KohaneSusanne ChurchillNathan PalmerSunny AlvearMichal Preminger

Chirag J [email protected]

@chiragjpwww.chiragjpgroup.org

ThanksRagGroup Chirag Lakhani Yeran Li Shreyas BhaveRolando Acosta

Noémie Elhadad (Columbia) Vasant Honavar (PSU)

Greg Cooper (Pitt) George Hripcsak (Columbia)

René Baston Katie Naum

Kathleen McKeown

NIH Common FundBig Data to Knowledge