machine learning for healthcare - github pages
TRANSCRIPT
MachineLearningforHealthcareHST.956,6.S897
Lecture4:Riskstratification
DavidSontag
Courseannouncements
• RecitationFridayat2pm(4-153)– optional• NoclassthisTuesday• Problemset1duenextThursday,Feb21• SignupforlecturescribingorMLHCcommunityconsulting
• Readingswillbepostedseveraldaysahead• AllcoursecommunicationthroughPiazza
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)
Whatis riskstratification?
• Separateapatientpopulationintohigh-riskandlow-risk ofhavinganoutcome– Predictingsomethinginthefuture– Goalisdifferentfromdiagnosis,withdistinctperformancemetrics
• Coupledwithinterventions thattargethigh-riskpatients
• Goalistypicallytoreducecostandimprovepatientoutcomes
Examplesofriskstratification
(Sariaetal.,ScienceTranslationalMedicine 2010)
Preterminfant’sriskofseveremorbidity?
Examplesofriskstratification
(Pozen etal.,NEJM1984)
Doesthispatientneedtobeadmittedtothecoronary-careunit?
Figuresource:https://www.drmani.com/heart-attack/
Figuresource:https://www.air.org/project/revolving-door-u-s-hospital-readmissions-diagnosis-and-procedure
Likelihoodofhospitalreadmission?
Oldvs.New
• Traditionally,riskstratificationwasbasedonsimplescoresusinghuman-entereddata
Oldvs.New
• Traditionally,riskstratificationwasbasedonsimplescoresusinghuman-entereddata
• Now,basedonmachinelearningonhigh-dimensionaldata– Fitsmoreeasilyintoworkflow– Higheraccuracy– Quickertoderive(canspecialcase)
• But,newdangersintroducedwithMLapproach– tobediscussed
Optum Whitepaper, “Predictiveanalytics: Poisedtodrivepopulation health"
LikelihoodofCOPD-relatedhospitalizationsExamplecommercialproduct
Optum Whitepaper, “Predictiveanalytics: Poisedtodrivepopulation health"
High-risk diabetes patients missing tests
# of A1c tests
# of LDL tests Last A1c Date of
last A1c Last LDL Date of last LDL
Patient 1 2 0 9.2 5/3/13 N/A N/A
Patient 2 2 0 8 1/30/13 N/A N/A
Patient 3 0 0 N/A N/A N/A N/A
Patient 4 0 2 N/A N/A 133 8/9/13
Patient 5 0 0 N/A N/A N/A N/A
Patient 6 0 1 N/A N/A 115 7/16/13
Patient 7 1 0 10.8 9/18/13 N/A N/A
Patient 8 0 0 N/A N/A N/A N/A
Patient 9 0 0 N/A N/A N/A N/A
Patient 10 0 0 N/A N/A N/A N/A
Examplecommercialproduct
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)
Type2Diabetes:AMajorpublichealthchallenge
1994 2000
<4.5%4.5%–5.9%6.0%–7.4%7.5%–8.9%>9.0%
2013
$245billion:Totalcostsofdiagnoseddiabetes intheUnitedStatesin2012$831billion:Totalfiscalyearfederal budgetforhealthcare intheUnitedStatesin2014
Type2DiabetesCanBePrevented*
Requirement for successful large scale prevention program1. Detect/reach truly at risk population
2. Improve the interventions
3. Lower the cost of intervention
*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.
TraditionalRiskPredictionModels• SuccessfulExamples
• ARIC• KORA• FRAMINGHAM• AUSDRISC• FINDRISC• SanAntonioModel
• Easytoask/measureintheoffice,orforpatientstodoonline
• Simplemodel:cancalculatescoresbyhand
ChallengesofTraditionalRiskPredictionModels• A screening step needs to be done for every
member in the population• Either in the physician’s office or as surveys• Costly and time-consuming• Infeasible for regular screening for millions of individuals
• Models not easy to adapt to multiple surrogates, when a variable is missing• Discovery of surrogates not straightforward
Population-LevelRiskStratification
• Keyidea:Usereadilyavailableadministrative,utilization,andclinicaldata
• Machinelearningwillfindsurrogatesforriskfactorsthatwouldotherwisebemissing
• Performriskstratificationatthepopulationlevel– millionsofpatients
[Razavian, Blecker, Schmidt,Smith-McLallen,Nigam,Sontag.BigData.‘16]
Sourceforfigure:http://www.mahesh-vc.com/blog/understanding-whos-paying-for-what-in-the-healthcare-industry
Healthstakeholders
AData-DrivenapproachonLongitudinalData
• Lookingatindividualswhogotdiabetestoday, (comparedtothosewhodidn’t)– Canweinferwhichvariables intheir recordcouldhavepredicted their
healthoutcome?
TodayAFewYearsAgo
Administrative&ClinicalData
Patient:
EligibilityRecord:-MemberID-Age/gender-IDofsubscriber-Companycode
MedicalClaims:-ICD9diagnosiscodes-CPTcode(procedure)-Specialty-Locationofservice-DateofService
LabTests:-LOINCcode(urineorbloodtestname)-Results(actualvalues)-LabID-Rangehigh/low-Date
Medications:-NDCcode(drugname)-Daysofsupply-Quantity-ServiceProviderID-Dateoffill
time
Disease count4011Benignhypertension 4470172724Hyperlipidemia NEC/NOS 3820304019HypertensionNOS 37247725000DMIIwocmp nt st uncntr 3395222720Purehypercholesterolem 2326712722Mixedhyperlipidemia 180015V7231Routinegyn examination 1787092449HypothyroidismNOS 16982978079MalaiseandfatigueNEC 149797V0481Vaccin forinfluenza 1478587242Lumbago 137345V7612ScreenmammogramNEC 129445V700Routinemedicalexam 127848
Disease count71947Jointpain-ankle 286483004Dysthymicdisorder 285302689VitaminDdeficiencyNOS 28455V7281Preopcardiovsclrexam 278977243Sciatica 2760478791Diarrhea 27424V221Supervis oth normalpreg 2732036501Opnanglbrderln lorisk 2603337921Vitreousdegeneration 255924241Aorticvalvedisorder 2542561610VaginitisNOS 2473670219Othersborheickeratosis 244533804Impactedcerumen 24046
Disease count53081Esophagealreflux 12106442731Atrialfibrillation 1137987295Paininlimb 11244941401Crnry athrscl natve vssl 1044782859AnemiaNOS 10335178650ChestpainNOS 919995990Urin tractinfectionNOS 87982V5869Long-termusemedsNEC 85544496Chr airwayobstructNEC 785854779Allergic rhinitisNOS 7796341400Cor ath unsp vsl ntv/gft 75519
Outof135K patients who hadlaboratorydata
Topdiagnosiscodes
Labtest2160-0Creatinine 12847373094-0Ureanitrogen 12823442823-3Potassium 12808122345-7Glucose 12998971742-6Alanineaminotransferase 11878091920-8Aspartateaminotransferase 11879652885-2Protein 12773381751-7Albumin 12741662093-3Cholesterol 12682692571-8Triglyceride 125775113457-7Cholesterol.inLDL 124120817861-6Calcium 11653702951-2Sodium 1167675
Labtest
2085-9Cholesterol.in HDL 1155666718-7Hemoglobin 11527264544-3Hematocrit 11478939830-1Cholesterol.total/Cholesterol.in HDL 103773033914-3Glomerularfiltrationrate/1.73sqM.predicted 561309
785-6Erythrocytemeancorpuscularhemoglobin 10708326690-2Leukocytes 1062980789-8Erythrocytes 1062445
787-2Erythrocytemeancorpuscularvolume 1063665
Labtest770-8Neutrophils/100leukocytes 952089731-0Lymphocytes 943918704-7Basophils 863448711-2Eosinophils 9357105905-5Monocytes/100leukocytes 943764706-2Basophils/100leukocytes 863435751-8Neutrophils 943232742-7Monocytes 942978713-8Eosinophils/100leukocytes 9339293016-3Thyrotropin 8918074548-4HemoglobinA1c/Hemoglobin.total 527062
Countofpeoplewhohavethetestresult(ever)
Toplabtestresults
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)
Framingforsupervisedmachinelearning
2009 2010 2011 2012 2013
Feature Construction
Prediction Window 2011-2013
2009 2010 2011 2012 2013
Feature Construction
Prediction Window 2010-2012
2009 2010 2011 2012 2013
Feature Construction Prediction Window 2009-2011
Gap isimportant topreventlabelleakage
Framingforsupervisedmachinelearning
Problem:Dataiscensored!• Patientschangehealthinsurersfrequently,butdatadoesn’tfollowthem
• Leftcensored:maynothaveenoughdatatoderivefeatures
• Rightcensored:maynotknowlabel
2009 2010 2011 2012 2013
Feature Construction Prediction Window 2009-2011
Data Collection Period:Patient variables built
from data in this period
Gap period between
data collection and outcome evaluation
T T+WDiabetes Onset
Patient C *Patient B -Patient A +
Patient D -Patient E *Patient F *Patient G *
Patient outcome
evaluated in this period
Thisisanexampleofalignmentbyabsolutetime
ReductiontobinaryclassificationExcludepatientsthatareleft- andright-censored.
Alternativeframings• Alignbyrelativetime,e.g.
– 2hoursintopatientstayinER– EverytimepatientseesPCP– Whenindividualturns40yrs old
• Alignbydataavailability
NOTE:• Ifmultipledatapointsperpatient,makesureeachpatientinonly train,validate,ortest
Methods• L1RegularizedLogisticRegression
– Simultaneouslyoptimizespredictiveperformanceand– Performsfeatureselection,choosingthesubsetofthefeaturesthataremostpredictive
• Thispreventsoverfittingtothetrainingdata
L1regularization
• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.
• Why?
minw
X
i
`(xi, yi;w) + �||w||1 ||~w||1 =X
d
|wd|
minw
X
i
`(xi, yi;w) + �||w||22 ||~w||22 =X
d
w2d
insteadof
L1regularization
• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.
• Why? minw
`(w · x, y) + �|w|Minimizethis:
SubjecttoConstantL1norm
SubjecttoConstantL2norm
• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.
• Why? minw
`(w · x, y) + �|w|Intuition#2– w.w.g.d.d(Whatwouldgradientdescentdo?)
d
dwi�|w| = ±�
L1regularization
d
dwi�||w||2 = ±�wi
2 2
• PenalizingtheL1normoftheweightvectorleadstosparse (read:many0’s)solutionsforw.
• Why? minw
`(w · x, y) + �|w|Intuition#2– w.w.g.d.d(Whatwouldgradientdescentdo?)
d
dwi�|w| = ±�
L1regularization
d
dwi�||w||2 = ±�wi
Thepushtowards0getsweakeraswigetssmaller
Alwayspusheselementsofwi towards0
2 2
Demographics(age,sex,etc.)
Healthinsurancecoverage
Proceduresperformed(457features)
Specialtyofdoctorsseen(cardiology,rheumatology,…)
FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)
Laboratoryindicators(7000features)
Forthe1000mostfrequent labtests:• Wasthetesteveradministered?• Wastheresulteverlow?• Wastheresulteverhigh?• Wastheresultevernormal?• Isthevalueincreasing?• Isthevaluedecreasing?• Isthevaluefluctuating?
Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)
Demographics(age,sex,etc.)
Healthinsurancecoverage
Proceduresperformed(457features)
Specialtyofdoctorsseen(cardiology,rheumatology,…)
FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)
Laboratoryindicators(7000features)
Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)
16,000ICD-9diagnosiscodes(allhistory)
Allhistory 24monthhistory
6monthhistory
Totalfeaturesperpatient:42,000
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)
WhataretheDiscoveredRiskFactors?
• 769variableshavenon-zeroweight
TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17
(3.87 4.49)
Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)
Hypertension (401) 3.28 (3.17 3.39)
Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)
Obesity (278) 2.88 (2.75 3.02)
Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)
Hyperlipidemia (272.4) 2.45 (2.37 2.53)
Shortness Of Breath (786.05) 2.09 (1.99 2.19)
Esophageal Reflux (530.81) 1.85(1.78 1.93)
Diabetes1-yeargap
WhataretheDiscoveredRiskFactors?
TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17
(3.87 4.49)
Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)
Hypertension (401) 3.28 (3.17 3.39)
Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)
Obesity (278) 2.88 (2.75 3.02)
Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)
Hyperlipidemia (272.4) 2.45 (2.37 2.53)
Shortness Of Breath (786.05) 2.09 (1.99 2.19)
Esophageal Reflux (530.81) 1.85(1.78 1.93)
Additional DiseaseRiskFactors Include:Pituitarydwarfism (253.3),Hepatomegaly(789.1), ChronicHepatitisC(070.54),Hepatitis (573.3),CalcanealSpur(726.73),Thyrotoxicosiswithoutmentionofgoiter(242.90),Sinoatrial Nodedysfunction(427.81),Acute frontalsinusitis(461.1),Hypertrophicandatrophicconditionsofskin(701.9),Irregularmenstruation(626.4), …
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75
(5.42 6.10)
Glucose (High- Past 6 months) 4.05 (3.89 4.21)
Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)
Potassium (Low - Entire History) 2.58(2.24 2.98)
Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)
Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)
2.25(1.92 2.64)
Eosinophils (High - Entire History) 2.11(1.82 2.44)
Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)
Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)
WhataretheDiscoveredRiskFactors?
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75
(5.42 6.10)
Glucose (High- Past 6 months) 4.05 (3.89 4.21)
Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)
Potassium (Low - Entire History) 2.58(2.24 2.98)
Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)
Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)
2.25(1.92 2.64)
Eosinophils (High - Entire History) 2.11(1.82 2.44)
Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)
Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)
WhataretheDiscoveredRiskFactors?
Additional LabTestRiskFactors Include:Albumin/Globulin (Increasing -Entirehistory),Ureanitrogen/Creatinine -(high-EntireHistory),Specificgravity(Increasing,Past2years),Bilirubin (high-Past2years),…
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Positivepredictivevalue(PPV)
0.060.07
0.06
0.15
0.17
0.1
Top100Predictions Top1000Predictions Top10000Predictions
Traditionalriskfactors Fullmodel
Diabetes1-yeargap
Outlinefortoday’sclass
1. Riskstratification2. Casestudy:EarlydetectionofType2
diabetes– Framingassupervisedlearningproblem– Evaluatingriskstratificationalgorithms
3. DiscussionwithLeonardD'Avolio (AssistantProfessoratHMS,CEO@Cyft)