machine learning for healthcare - github pages

42
Machine Learning for Healthcare HST.956, 6.S897 Lecture 4: Risk stratification David Sontag

Upload: others

Post on 29-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Healthcare - GitHub Pages

MachineLearningforHealthcareHST.956,6.S897

Lecture4:Riskstratification

DavidSontag

Page 2: Machine Learning for Healthcare - GitHub Pages

Courseannouncements

• RecitationFridayat2pm(4-153)– optional• NoclassthisTuesday• Problemset1duenextThursday,Feb21• SignupforlecturescribingorMLHCcommunityconsulting

• Readingswillbepostedseveraldaysahead• AllcoursecommunicationthroughPiazza

Page 3: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)

Page 4: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)

Page 5: Machine Learning for Healthcare - GitHub Pages

Whatis riskstratification?

• Separateapatientpopulationintohigh-riskandlow-risk ofhavinganoutcome– Predictingsomethinginthefuture– Goalisdifferentfromdiagnosis,withdistinctperformancemetrics

• Coupledwithinterventions thattargethigh-riskpatients

• Goalistypicallytoreducecostandimprovepatientoutcomes

Page 6: Machine Learning for Healthcare - GitHub Pages

Examplesofriskstratification

(Sariaetal.,ScienceTranslationalMedicine 2010)

Preterminfant’sriskofseveremorbidity?

Page 7: Machine Learning for Healthcare - GitHub Pages

Examplesofriskstratification

(Pozen etal.,NEJM1984)

Doesthispatientneedtobeadmittedtothecoronary-careunit?

Figuresource:https://www.drmani.com/heart-attack/

Page 8: Machine Learning for Healthcare - GitHub Pages

Figuresource:https://www.air.org/project/revolving-door-u-s-hospital-readmissions-diagnosis-and-procedure

Likelihoodofhospitalreadmission?

Page 9: Machine Learning for Healthcare - GitHub Pages

Oldvs.New

• Traditionally,riskstratificationwasbasedonsimplescoresusinghuman-entereddata

Page 10: Machine Learning for Healthcare - GitHub Pages

Oldvs.New

• Traditionally,riskstratificationwasbasedonsimplescoresusinghuman-entereddata

• Now,basedonmachinelearningonhigh-dimensionaldata– Fitsmoreeasilyintoworkflow– Higheraccuracy– Quickertoderive(canspecialcase)

• But,newdangersintroducedwithMLapproach– tobediscussed

Page 11: Machine Learning for Healthcare - GitHub Pages

Optum Whitepaper, “Predictiveanalytics: Poisedtodrivepopulation health"

LikelihoodofCOPD-relatedhospitalizationsExamplecommercialproduct

Page 12: Machine Learning for Healthcare - GitHub Pages

Optum Whitepaper, “Predictiveanalytics: Poisedtodrivepopulation health"

High-risk diabetes patients missing tests

# of A1c tests

# of LDL tests Last A1c Date of

last A1c Last LDL Date of last LDL

Patient 1 2 0 9.2 5/3/13 N/A N/A

Patient 2 2 0 8 1/30/13 N/A N/A

Patient 3 0 0 N/A N/A N/A N/A

Patient 4 0 2 N/A N/A 133 8/9/13

Patient 5 0 0 N/A N/A N/A N/A

Patient 6 0 1 N/A N/A 115 7/16/13

Patient 7 1 0 10.8 9/18/13 N/A N/A

Patient 8 0 0 N/A N/A N/A N/A

Patient 9 0 0 N/A N/A N/A N/A

Patient 10 0 0 N/A N/A N/A N/A

Examplecommercialproduct

Page 13: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)

Page 14: Machine Learning for Healthcare - GitHub Pages

Type2Diabetes:AMajorpublichealthchallenge

1994 2000

<4.5%4.5%–5.9%6.0%–7.4%7.5%–8.9%>9.0%

2013

$245billion:Totalcostsofdiagnoseddiabetes intheUnitedStatesin2012$831billion:Totalfiscalyearfederal budgetforhealthcare intheUnitedStatesin2014

Page 15: Machine Learning for Healthcare - GitHub Pages

Type2DiabetesCanBePrevented*

Requirement for successful large scale prevention program1. Detect/reach truly at risk population

2. Improve the interventions

3. Lower the cost of intervention

*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.

Page 16: Machine Learning for Healthcare - GitHub Pages

TraditionalRiskPredictionModels• SuccessfulExamples

• ARIC• KORA• FRAMINGHAM• AUSDRISC• FINDRISC• SanAntonioModel

• Easytoask/measureintheoffice,orforpatientstodoonline

• Simplemodel:cancalculatescoresbyhand

Page 17: Machine Learning for Healthcare - GitHub Pages

ChallengesofTraditionalRiskPredictionModels• A screening step needs to be done for every

member in the population• Either in the physician’s office or as surveys• Costly and time-consuming• Infeasible for regular screening for millions of individuals

• Models not easy to adapt to multiple surrogates, when a variable is missing• Discovery of surrogates not straightforward

Page 18: Machine Learning for Healthcare - GitHub Pages

Population-LevelRiskStratification

• Keyidea:Usereadilyavailableadministrative,utilization,andclinicaldata

• Machinelearningwillfindsurrogatesforriskfactorsthatwouldotherwisebemissing

• Performriskstratificationatthepopulationlevel– millionsofpatients

[Razavian, Blecker, Schmidt,Smith-McLallen,Nigam,Sontag.BigData.‘16]

Page 19: Machine Learning for Healthcare - GitHub Pages

Sourceforfigure:http://www.mahesh-vc.com/blog/understanding-whos-paying-for-what-in-the-healthcare-industry

Healthstakeholders

Page 20: Machine Learning for Healthcare - GitHub Pages

AData-DrivenapproachonLongitudinalData

• Lookingatindividualswhogotdiabetestoday, (comparedtothosewhodidn’t)– Canweinferwhichvariables intheir recordcouldhavepredicted their

healthoutcome?

TodayAFewYearsAgo

Page 21: Machine Learning for Healthcare - GitHub Pages

Administrative&ClinicalData

Patient:

EligibilityRecord:-MemberID-Age/gender-IDofsubscriber-Companycode

MedicalClaims:-ICD9diagnosiscodes-CPTcode(procedure)-Specialty-Locationofservice-DateofService

LabTests:-LOINCcode(urineorbloodtestname)-Results(actualvalues)-LabID-Rangehigh/low-Date

Medications:-NDCcode(drugname)-Daysofsupply-Quantity-ServiceProviderID-Dateoffill

time

Page 22: Machine Learning for Healthcare - GitHub Pages

Disease count4011Benignhypertension 4470172724Hyperlipidemia NEC/NOS 3820304019HypertensionNOS 37247725000DMIIwocmp nt st uncntr 3395222720Purehypercholesterolem 2326712722Mixedhyperlipidemia 180015V7231Routinegyn examination 1787092449HypothyroidismNOS 16982978079MalaiseandfatigueNEC 149797V0481Vaccin forinfluenza 1478587242Lumbago 137345V7612ScreenmammogramNEC 129445V700Routinemedicalexam 127848

Disease count71947Jointpain-ankle 286483004Dysthymicdisorder 285302689VitaminDdeficiencyNOS 28455V7281Preopcardiovsclrexam 278977243Sciatica 2760478791Diarrhea 27424V221Supervis oth normalpreg 2732036501Opnanglbrderln lorisk 2603337921Vitreousdegeneration 255924241Aorticvalvedisorder 2542561610VaginitisNOS 2473670219Othersborheickeratosis 244533804Impactedcerumen 24046

Disease count53081Esophagealreflux 12106442731Atrialfibrillation 1137987295Paininlimb 11244941401Crnry athrscl natve vssl 1044782859AnemiaNOS 10335178650ChestpainNOS 919995990Urin tractinfectionNOS 87982V5869Long-termusemedsNEC 85544496Chr airwayobstructNEC 785854779Allergic rhinitisNOS 7796341400Cor ath unsp vsl ntv/gft 75519

Outof135K patients who hadlaboratorydata

Topdiagnosiscodes

Page 23: Machine Learning for Healthcare - GitHub Pages

Labtest2160-0Creatinine 12847373094-0Ureanitrogen 12823442823-3Potassium 12808122345-7Glucose 12998971742-6Alanineaminotransferase 11878091920-8Aspartateaminotransferase 11879652885-2Protein 12773381751-7Albumin 12741662093-3Cholesterol 12682692571-8Triglyceride 125775113457-7Cholesterol.inLDL 124120817861-6Calcium 11653702951-2Sodium 1167675

Labtest

2085-9Cholesterol.in HDL 1155666718-7Hemoglobin 11527264544-3Hematocrit 11478939830-1Cholesterol.total/Cholesterol.in HDL 103773033914-3Glomerularfiltrationrate/1.73sqM.predicted 561309

785-6Erythrocytemeancorpuscularhemoglobin 10708326690-2Leukocytes 1062980789-8Erythrocytes 1062445

787-2Erythrocytemeancorpuscularvolume 1063665

Labtest770-8Neutrophils/100leukocytes 952089731-0Lymphocytes 943918704-7Basophils 863448711-2Eosinophils 9357105905-5Monocytes/100leukocytes 943764706-2Basophils/100leukocytes 863435751-8Neutrophils 943232742-7Monocytes 942978713-8Eosinophils/100leukocytes 9339293016-3Thyrotropin 8918074548-4HemoglobinA1c/Hemoglobin.total 527062

Countofpeoplewhohavethetestresult(ever)

Toplabtestresults

Page 24: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)

Page 25: Machine Learning for Healthcare - GitHub Pages

Framingforsupervisedmachinelearning

2009 2010 2011 2012 2013

Feature Construction

Prediction Window 2011-2013

2009 2010 2011 2012 2013

Feature Construction

Prediction Window 2010-2012

2009 2010 2011 2012 2013

Feature Construction Prediction Window 2009-2011

Gap isimportant topreventlabelleakage

Page 26: Machine Learning for Healthcare - GitHub Pages

Framingforsupervisedmachinelearning

Problem:Dataiscensored!• Patientschangehealthinsurersfrequently,butdatadoesn’tfollowthem

• Leftcensored:maynothaveenoughdatatoderivefeatures

• Rightcensored:maynotknowlabel

2009 2010 2011 2012 2013

Feature Construction Prediction Window 2009-2011

Page 27: Machine Learning for Healthcare - GitHub Pages

Data Collection Period:Patient variables built

from data in this period

Gap period between

data collection and outcome evaluation

T T+WDiabetes Onset

Patient C *Patient B -Patient A +

Patient D -Patient E *Patient F *Patient G *

Patient outcome

evaluated in this period

Thisisanexampleofalignmentbyabsolutetime

ReductiontobinaryclassificationExcludepatientsthatareleft- andright-censored.

Page 28: Machine Learning for Healthcare - GitHub Pages

Alternativeframings• Alignbyrelativetime,e.g.

– 2hoursintopatientstayinER– EverytimepatientseesPCP– Whenindividualturns40yrs old

• Alignbydataavailability

NOTE:• Ifmultipledatapointsperpatient,makesureeachpatientinonly train,validate,ortest

Page 29: Machine Learning for Healthcare - GitHub Pages

Methods• L1RegularizedLogisticRegression

– Simultaneouslyoptimizespredictiveperformanceand– Performsfeatureselection,choosingthesubsetofthefeaturesthataremostpredictive

• Thispreventsoverfittingtothetrainingdata

Page 30: Machine Learning for Healthcare - GitHub Pages

L1regularization

• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.

• Why?

minw

X

i

`(xi, yi;w) + �||w||1 ||~w||1 =X

d

|wd|

minw

X

i

`(xi, yi;w) + �||w||22 ||~w||22 =X

d

w2d

insteadof

Page 31: Machine Learning for Healthcare - GitHub Pages

L1regularization

• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.

• Why? minw

`(w · x, y) + �|w|Minimizethis:

SubjecttoConstantL1norm

SubjecttoConstantL2norm

Page 32: Machine Learning for Healthcare - GitHub Pages

• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.

• Why? minw

`(w · x, y) + �|w|Intuition#2– w.w.g.d.d(Whatwouldgradientdescentdo?)

d

dwi�|w| = ±�

L1regularization

d

dwi�||w||2 = ±�wi

2 2

Page 33: Machine Learning for Healthcare - GitHub Pages

• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.

• Why? minw

`(w · x, y) + �|w|Intuition#2– w.w.g.d.d(Whatwouldgradientdescentdo?)

d

dwi�|w| = ±�

L1regularization

d

dwi�||w||2 = ±�wi

Thepushtowards0getsweakeraswigetssmaller

Alwayspusheselementsofwi towards0

2 2

Page 34: Machine Learning for Healthcare - GitHub Pages

Demographics(age,sex,etc.)

Healthinsurancecoverage

Proceduresperformed(457features)

Specialtyofdoctorsseen(cardiology,rheumatology,…)

FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)

Laboratoryindicators(7000features)

Forthe1000mostfrequent labtests:• Wasthetesteveradministered?• Wastheresulteverlow?• Wastheresulteverhigh?• Wastheresultevernormal?• Isthevalueincreasing?• Isthevaluedecreasing?• Isthevaluefluctuating?

Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)

Page 35: Machine Learning for Healthcare - GitHub Pages

Demographics(age,sex,etc.)

Healthinsurancecoverage

Proceduresperformed(457features)

Specialtyofdoctorsseen(cardiology,rheumatology,…)

FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)

Laboratoryindicators(7000features)

Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)

16,000ICD-9diagnosiscodes(allhistory)

Allhistory 24monthhistory

6monthhistory

Totalfeaturesperpatient:42,000

Page 36: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)

Page 37: Machine Learning for Healthcare - GitHub Pages

WhataretheDiscoveredRiskFactors?

• 769variableshavenon-zeroweight

TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17

(3.87 4.49)

Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)

Hypertension (401) 3.28 (3.17 3.39)

Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)

Obesity (278) 2.88 (2.75 3.02)

Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)

Hyperlipidemia (272.4) 2.45 (2.37 2.53)

Shortness Of Breath (786.05) 2.09 (1.99 2.19)

Esophageal Reflux (530.81) 1.85(1.78 1.93)

Diabetes1-yeargap

Page 38: Machine Learning for Healthcare - GitHub Pages

WhataretheDiscoveredRiskFactors?

TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17

(3.87 4.49)

Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)

Hypertension (401) 3.28 (3.17 3.39)

Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)

Obesity (278) 2.88 (2.75 3.02)

Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)

Hyperlipidemia (272.4) 2.45 (2.37 2.53)

Shortness Of Breath (786.05) 2.09 (1.99 2.19)

Esophageal Reflux (530.81) 1.85(1.78 1.93)

Additional DiseaseRiskFactors Include:Pituitarydwarfism (253.3),Hepatomegaly(789.1), ChronicHepatitisC(070.54),Hepatitis (573.3),CalcanealSpur(726.73),Thyrotoxicosiswithoutmentionofgoiter(242.90),Sinoatrial Nodedysfunction(427.81),Acute frontalsinusitis(461.1),Hypertrophicandatrophicconditionsofskin(701.9),Irregularmenstruation(626.4), …

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Page 39: Machine Learning for Healthcare - GitHub Pages

Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75

(5.42 6.10)

Glucose (High- Past 6 months) 4.05 (3.89 4.21)

Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)

Potassium (Low - Entire History) 2.58(2.24 2.98)

Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)

Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)

2.25(1.92 2.64)

Eosinophils (High - Entire History) 2.11(1.82 2.44)

Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)

Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)

WhataretheDiscoveredRiskFactors?

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Page 40: Machine Learning for Healthcare - GitHub Pages

Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75

(5.42 6.10)

Glucose (High- Past 6 months) 4.05 (3.89 4.21)

Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)

Potassium (Low - Entire History) 2.58(2.24 2.98)

Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)

Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)

2.25(1.92 2.64)

Eosinophils (High - Entire History) 2.11(1.82 2.44)

Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)

Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)

WhataretheDiscoveredRiskFactors?

Additional LabTestRiskFactors Include:Albumin/Globulin (Increasing -Entirehistory),Ureanitrogen/Creatinine -(high-EntireHistory),Specificgravity(Increasing,Past2years),Bilirubin (high-Past2years),…

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Page 41: Machine Learning for Healthcare - GitHub Pages

Positivepredictivevalue(PPV)

0.060.07

0.06

0.15

0.17

0.1

Top100Predictions Top1000Predictions Top10000Predictions

Traditionalriskfactors Fullmodel

Diabetes1-yeargap

Page 42: Machine Learning for Healthcare - GitHub Pages

Outlinefortoday’sclass

1. Riskstratification2. Casestudy:EarlydetectionofType2

diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms

3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)