jessica franklin instructor in medicine division of pharmacoepidemiology & pharmacoeconomics
DESCRIPTION
Comparing high-dimensional propensity score versus lasso variable selection for confounding adjustment in a novel simulation framework. Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics Brigham and Women’s Hospital and Harvard Medical School - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/1.jpg)
Comparing high-dimensional propensity score versus lasso
variable selection for confounding adjustment in a novel simulation
frameworkJessica Franklin
Instructor in MedicineDivision of Pharmacoepidemiology & Pharmacoeconomics
Brigham and Women’s Hospital and Harvard Medical School
QMC, Department of Quantitative Health SciencesUniversity of Massachusetts Medical School
April 15, 2014
![Page 2: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/2.jpg)
Background
• Administrative healthcare claims data are a popular data source for nonrandomized studies of interventions.
• Because treatments are not randomized, addressing confounding is the primary methodological challenge.
![Page 3: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/3.jpg)
Claims Data• Comprehensive claims databases contain
information on patient insurance enrollment and demographics, as well as every healthcare encounter, including:• Diagnoses• Procedures• Hospitalizations• Medications dispensed
• Dates of encounters provide a complete longitudinal record of patients’ healthcare interactions.
![Page 4: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/4.jpg)
New user design
• Potential confounders are measured prior to initiation of exposure.
• Active treatment comparator group reduces biases associated with non-user comparators.Exposur
e initiation
Covariates assessed Follow-up for outcome events
End of:• Data• Enrollmen
t
![Page 5: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/5.jpg)
Principles of variable selection
• Brookhart et al. (2006) showed that the best PS model is the model that includes all predictors of outcome (regardless of whether they are associated with exposure).
• Pearl (2010) and Myers et al. (2011) further noted that including instrumental varaibles (IVs) can increase bias from unmeasured confounding.• IVs are associated with exposure, but not
associated with outcome except through exposure.
![Page 6: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/6.jpg)
hd-PS variable selection
• The high-dimensional propensity score (hd-PS) algorithm screens thousands of diagnoses, medications, and procedure codes and ranks variables according to likelihood of confounding.
• Relies on the idea that a large number of “proxy” variables can reduce bias from unmeasured confounding.
• Empirical evidence has shown a reduction in bias.
![Page 7: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/7.jpg)
Shrinkage methods• Greenland (2008) suggested regularization
methods as preferable to variable selection.• Shrinking coefficients allows for efficient
estimation, even in models with many degrees of freedom.
• Lasso regression provides both shrinkage and principled variable selection.• Shrinkage allows for direct modeling of the
outcome even with many potential confounders• Some coefficients are shrunk all the way to 0.
![Page 8: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/8.jpg)
Objective
• To compare the performance of • hd-PS variable selection• Ridge regression of the outcome on all
potential confounders• Lasso regression of the outcome on all
potential confounders
• The goal is maximum reduction in confounding bias.
![Page 9: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/9.jpg)
Comparing high-dimensional methods
• How can we answer this question?• Empirical studies are useful when we
“know” the true treatment effect, but even then we can’t determine the contributions of bias and variance to overall error.
• Ordinary simulation techniques with completely synthetic data cannot capture the complex correlation structure among covariates in claims data.
![Page 10: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/10.jpg)
Plasmode simulation• We start with a real empirical cohort study:• 49,653 patients• Exposed to either ns-NSAIDs or Cox-2 inhibitors (X)• Followed for gastrointestinal events (Y)• Pre-defined covariates include age, sex, race, and 16
diagnosis/medication/procedure variables (C1)
• To get reasonable values for associations between covariates and outcome, we estimated a model with:• Y ~ X + all pre-defined covariates + interactions
between age and binary covariates
![Page 11: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/11.jpg)
Simulation setup• True outcome generation model:
• Estimated coefficient values from the observed outcome model
• Except for the coefficient on exposure: .
• To create simulated datasets:• Sample with replacement rows from (X, C)• Calculate for each patient in the sample.• Simulate outcome
• We created 500 datasets, each of size 30,000, outcome prevalence set to 5%, exposure prevalence set to 40%.
![Page 12: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/12.jpg)
True causal diagram
C1
X Y
CAny variables associated with exposure remain associated with exposure.
Any correlations among covariates and true confounders remain intact.
C1 = True confounders, a subset of C = all measured covariates.
Associations with outcome are determined by chosen simulation model.
![Page 13: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/13.jpg)
Outcome generationVariable True OR
Age1.03092841
3
Black race0.66838508
2
Male gender1.41899133
3
Congestive heart failure1.22057522
9
Coronary disease1.18463300
1
Prior bleeding10.6247019
5
Prior ulcer0.77770424
9
Recent hospitalization4.53710606
9Recent nursing home admission
2.222756726
Warfarin1.01149407
2
Gastrointestinal drugs1.85852810
1
![Page 14: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/14.jpg)
The mechanics of hd-PS• For each diagnosis, procedure, medication
code, hd-PS creates 3 potential variables:• Code observed ≥ 1 time during baseline period• Code observed ≥ median number of times• Code observed ≥ 75th percentile number of times
• There are 2 potential ranking methods:• Exposure-based: A simple RR association
measure between exposure and each variable.• Bias-based: Bross’s bias formula that considers
the association of each varaible with exposure and outcome
![Page 15: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/15.jpg)
hd-PS Analyses
• PSs were constructed using:• The top 500 exposure-ranked variables +
demographics• The top 500 bias-ranked variables +
demographics• The top 30 exposure-ranked variables +
demographics• The top 30 bias-ranked variables +
demographics
• Logistic regression on exposure + deciles of each PS
![Page 16: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/16.jpg)
Shrinkage analyses
• Regression of the outcome on all hdPS-screened variables (4800 – those that never occur) + exposure + demographics• Ridge regression • Lasso regression
• We apply no shrinkage to the coefficient on exposure.
• Calculate the crude estimate for comparison
![Page 17: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/17.jpg)
Combination approaches
• Using the variables selected by the lasso regression:• Include them in a PS analysis• Include them in an ordinary logistic regression
outcome model
• Using the 500 variables chosen by bias-based hd-PS:• Include them in an ordinary logistic regression
outcome model• Include them in a lasso outcome model• Include them in a ridge outcome model
![Page 18: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/18.jpg)
Results – Variable selection
• Lasso selected 103 variables on average.• 66% were also
selected by at least one hdPS algorithm• IQR: 62-70%
• Age was selected in 100% of simulations.
• Race was selected in 28%.
![Page 19: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/19.jpg)
Results - Bias
![Page 20: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/20.jpg)
Results - Bias
Crude confounding bias of 0.19.
![Page 21: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/21.jpg)
Results - Bias
Ridge and lasso regression with all variables reduces bias by 41% and 63%, respectively.
![Page 22: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/22.jpg)
Results - Bias
Ridge and lasso do better when they start with pre-screened variables. Bias is reduced by 70% and 83%, respectively.
![Page 23: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/23.jpg)
Results - Bias
Ordinary regression and PS approaches performed better. Exposure-based hdPS with 500 variables completely eliminated bias.
![Page 24: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/24.jpg)
Results - Bias
Bias-based hdPS varaible selection also performed well, with 93% and 91% bias reduction in the PS and ordinary regression models.
![Page 25: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/25.jpg)
Results - Bias
PS and regular regression models performed well using lasso variable selection as well (95% and 96% bias reduction).
![Page 26: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/26.jpg)
Results - Bias
When restricting variables to a very small set, bias-based hdPS was much preferred.
![Page 27: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/27.jpg)
Conclusion
• The variable selection method had relatively little importance.
• The estimation method mattered much more.• Shrinkage of coefficient estimates led to
insufficient bias control.
• Focus on including a large number of potential confounders or confounder proxies.
![Page 28: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/28.jpg)
Limitations
• There are many “instruments” in current simulation setup.• Variables associated with exposure that are
not included in the outcome simulation model are essentially IVs, which is unrealistic.
• There is no unmeasured confounding in these data.• Variable selection is an easier task when all
important confounders are measured.
![Page 29: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/29.jpg)
Future work• Enrich the outcome model
• Non-linear associations, more interactions, more true confounders
• Vary the true treatment effect• Modify the coefficient on treatment in the outcome generation model.
• Vary exposure prevalence • Can be accomplished by sampling within exposure group.
• Vary outcome prevalence• Modify the intercept in the outcome generation model.
• Unmeasured confounding• Set aside one or more true confounders and don’t allow methods to
utilize these variables.
• Other base datasets
![Page 30: Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics](https://reader036.vdocuments.mx/reader036/viewer/2022062816/56816683550346895dda2a49/html5/thumbnails/30.jpg)
Thanks!
• Co-authors:• Wesley Eddings• Jeremy A Rassen• Robert J Glynn• Sebastian Schneeweiss
• Contact:• [email protected]• www.drugepi.org/faculty-staff-trainees/faculty/
jessica-franklin/