creating structured and flexible models: some open problemsgelman/presentations/mittalk2.pdf ·...

Weakly informative priorsInteractions in before-after studies

Interactions in regressionsConclusions

Creating structured and flexible models: someopen problems

Andrew GelmanDepartment of Statistics and Department of Political Science

Columbia University

8 June 2009

Andrew Gelman Creating structured and flexible models: some open problems

Themes

I Goal: models that are structured enough to be able to learnfrom data but not be so strong as to overwhelm the data

I Collaborators:Joe Bafumi, Valerie Chan, Samantha Cook, Jeronimo Cortina,Zaiying Huang, Aleks Jakulin, Jouni Kerman, Gary King, IainPardoe, David Park, Maria Grazia Pittau, Boris Shor, MattStevens, Yu-Sung Su, Francis Tuerlinckx, Masanao Yajima,Shouhao Zhou

Themes

Separation in logistic regressionBayesian solutionPrior informationEvaluation using a corpus of datasets

Logistic regression

−6 −4 −2 0 2 4 60.0

1.0 y = logit−1(x)

t−1(x

slope = 1/4

A clean example

−10 0 10 20

estimated Pr(y=1) = logit−1(−1.40 + 0.33 x)

y slope = 0.33/4

The problem of separation

−6 −4 −2 0 2 4 6

slope = infinity?

Separation is no joke!

glm (vote ~ female + black + income, family=binomial(link="logit"))

1960 1968

coef.est coef.se coef.est coef.se

(Intercept) -0.14 0.23 (Intercept) 0.47 0.24

female 0.24 0.14 female -0.01 0.15

black -1.03 0.36 black -3.64 0.59

income 0.03 0.06 income -0.03 0.07

1964 1972

coef.est coef.se coef.est coef.se

(Intercept) -1.15 0.22 (Intercept) 0.67 0.18

female -0.09 0.14 female -0.25 0.12

black -16.83 420.40 black -2.63 0.27

income 0.19 0.06 income 0.09 0.05

bayesglm()

I Bayesian logistic regression

I In the arm (Applied Regression and Multilevel modeling)package in R

I Replaces glm(), estimates are more numerically andcomputationally stable

I Student-t prior distributions for regression coefs

I Use EM-like algorithm

I We went inside glm.fit to augment the iteratively weightedleast squares step

I Default choices for tuning parameters (we’ll get back to this!)

bayesglm()

Regularization in action!

What else is out there?

I glm (maximum likelihood): fails under separation, gives noisyanswers for sparse data

I Augment with prior “successes” and “failures”: doesn’t workwell for multiple predictors

I brlr (Jeffreys-like prior distribution): computationallyunstable

I brglm (improvement on brlr): doesn’t do enough smoothing

I BBR (Laplace prior distribution): OK, not quite as good asbayesglm

I Non-Bayesian machine learning algorithms: understateuncertainty in predictions

Information in prior distributions

I Informative prior distI A full generative model for the data

I Noninformative prior distI Let the data speakI Goal: valid inference for any θ

I Weakly informative prior distI Purposely include less information than we actually haveI Goal: regularization, stabilization

Weakly informative priors forlogistic regression coefficients

I Separation in logistic regressionI Some prior info: logistic regression coefs are almost always

between −5 and 5:I 5 on the logit scale takes you from 0.01 to 0.50

or from 0.50 to 0.99I Smoking and lung cancer

I Independent Cauchy prior dists with center 0 and scale 2.5

I Rescale each predictor to have mean 0 and sd 12

I Fast implementation using EM; easy adaptation of glm

Prior distributions

−10 −5 0 5 10

Another example

Dose #deaths/#animals

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

I Slope of a logistic regression of Pr(death) on dose:I Maximum likelihood est is 7.8± 4.9I With weakly-informative prior: Bayes est is 4.4± 1.9

I Which is truly conservative?

I The sociology of shrinkage

Another example

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

Another example

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

Another example

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

Another example

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

Another example

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

Maximum likelihood and Bayesian estimates

0 10 20

glmbayesglm

Conservatism of Bayesian inference

I Problems with maximum likelihood when data showseparation:

I Coefficient estimate of −∞I Estimated predictive probability of 0 for new cases

I Is this conservative?

I Not if evaluated by log score or predictive log-likelihood

Which one is conservative?

0 10 20

glmbayesglm

Prior as population distribution

I Consider many possible datasets

I The “true prior” is the distribution of β’s across these datasets

I Fit one dataset at a time

I A “weakly informative prior” has less information (widervariance) than the true prior

I Open question: How to formalize the tradeoffs from usingdifferent priors?

Evaluation using a corpus of datasets

I Compare classical glm to Bayesian estimates using variousprior distributions

I Evaluate using 5-fold cross-validation and average predictiveerror

I The optimal prior distribution for β’s is (approx) Cauchy (0, 1)

I Our Cauchy (0, 2.5) prior distribution is weakly informative!

Expected predictive loss, avg over a corpus of datasets

0 1 2 3 4 5

scale of prior

(1.79)GLM

BBR(l)

df=2.0

df=4.0

df=8.0BBR(g)

df=1.0

df=0.5

Other examples of weakly informative priors

I Variance parameters

I Covariance matrices

I Population variation in a physiological model

I Mixture models

I Intentional underpooling in hierarchical models

I Mixture models

End of part 1

I “Noninformative priors” are actually weakly informative

I “Weakly informative” is a more general and useful conceptI Regularization

I Better inferencesI Stability of computation (bayesglm)

I Why use weakly informative priors rather than informativepriors?

I Conformity with statistical culture (“conservatism”)I Labor-saving deviceI Robustness

End of part 1

Legislative redistrictingEducational experimentIncumbency advantageHierarchical models for treatment interactions

No-interaction model

I Before-after data with treatment and control groupsI Default model: constant treatment effects

I Fisher’s classical null hyp: effect is zero for all casesI Regression model: yi = Tiθ + Xiβ + εi

control

treatment

"before" measurement, x

control

treatment

control

treatment

control

treatment

control

treatment

Actual data show interactions

I Treatment interacts with “before” measurement

I Before-after correlation is higher for controls than for treatedunits

I ExamplesI An observational study of legislative redistrictingI An experiment with pre-test, post-test dataI Congressional elections with incumbents and open seats

Observational study of legislative redistricting:before-after data

Estimated partisan bias in previous election

-0.05 0.0 0.05

no redistricting

bipartisan redistrict

Dem. redistrict

Rep. redistrict.

. .. .

•• •

•••

(favors Democrats)

(favors Republicans)

Educational experiment: correlation between pre-test andpost-test data for controls and for treated units

1 2 3 4

controls

treated

Correlation between two successive Congressional electionsfor incumbents running (controls) and open seats (treated)

1900 1920 1940 1960 1980 2000

incumbents

open seats

Interactions as variance components

I yi = Tiθ + Xiβ + εi + ηi

I Unit-level “error term” ηi

I For control units, ηi persists from time 1 to time 2I For treatment units, ηi changes:

I Subtractive treatment error (ηi only at time 1)I Additive treatment error (ηi only at time 2)I Replacement treatment error

I Under all these models, the before-after correlation is higherfor controls than treated units

End of part 2

I Treatment and control can be modeled asymmetrically

I Additional variance component

End of part 2

Federal spendingVote preferencesIncome and votingIncentives in sample surveys

Examples of interactions in regression

I Federal spending by state, year, category (50× 19× 10)

I Vote preference given state and demographic variables(50× 2× 2× 4× 4)

I Rich state, poor state, red state, blue state (50× 2 for eachelection)

I Meta-analysis of incentives in sample surveys (26)

General concerns

I Lots of potential interactions

I Setting high-level interactions to zero? Too extreme,especially when interactions are of substantive interest

I Simple hierarchical model for interactions is too crude

I Model: large main effects can have large interactions. Inhierarchical setting, model should come “naturally”

General concerns

Federal spending by state

I For each spending category, 50× 19 data structure

I yjt = αj + βt + γjt

I possible model: γjt ∼ N (0, A + B|αjβt |)I Some example data

Interactions |γjt plotted vs. main effects |αjβt |

Disretire − Federal Spending (Inflation−adjusted)

Abs(a_ j*b_t)

0.00 0.02 0.04 0.06 0.08

Logistic regression for pre-election polls

I Logistic regression predicting vote from demographic andgeographic factors: sex, ethnicity, age, education, state

I Hierarchical model for 4 age levels, 4 education levels, 16 age× education, 50 states

I Also consider interactions such as ethnicity × state

Prediction error as function of # of predictors

MSE : training sample

number of parameters

0 50 100 150

MSE : test sample

number of parameters

0 50 100 150

Red state, blue state, rich state, poor state

I Richer voters favor the Republicans, but

I Richer states favor the Democrats

I Hierarchical logistic regression: predict your vote given yourincome and your state (“varying-intercept model”)

Varying-intercept model

Income

−2 −1 0 1 2

Connecticut

Mississippi

Varying-intercept, varying-slope model

Income

−2 −1 0 1 2

Connecticut

Mississippi

Interactions!

Avg Income 2000 vs. Var Slope 2000

Avg State Income ($10k)

2.0 2.5 3.0 3.5

NENV NH

NCNDOH

RISC SD

TXUT VT

3-way interactions!

Income

−2 −1 0 1 2

ConnecticutOhio

Mississippi

Income

−2 −1 0 1 20.

Connecticut

OhioMississippi

Income

−2 −1 0 1 2

Connecticut

Mississippi

Income

−2 −1 0 1 2

Connecticut

Mississippi

Income

−2 −1 0 1 2

Connecticut

Mississippi

Income

−2 −1 0 1 2

Connecticut

Mississippi

Meta-analysis of effects of incentives on survey responserates

I 6 factorsI Incentive or notI Value of incentiveI Form (gift or cash)I Timing (before or after)I Mode (telephone or face-to-face)I Burden (short or long survey)

I ModelsI No interactions: estimates don’t make senseI Interactions: estimates are out of control

Model without interactions

I Estimated effects on response rate (in percentage points)

Beta (s.e.)Intercept 1.4 (1.6)Value of incentive 0.34 (0.17)Prepayment 2.8 (1.8)Gift −6.9 (1.5)Burden 3.3 (1.3)

I Will a low-value postpaid gift really reduce response rates by 7percentage points??

Models with interactions

Model I Model II Model III Model IVConstant 60.7 (2.2) 60.8 (2.5) 61.0 (2.5) 60.1 (2.5)Incentive 5.4 (0.7) 3.7 (0.8) 2.8 (1.0) 6.1 (1.2)Mode 15.2 (4.7) 16.1 (5.1) 16.0 (4.9) 18.0 (4.6)Burden −7.2 (4.3) −8.9 (5.0) −8.7 (5.0) −9.9 (5.0)Mode × Burden −7.6 (9.8) −7.8 (9.4) −4.9 (9.1)Incentive × Value 0.14 (0.03) 0.33 (0.09) 0.26 (0.09)Incentive × Timing 4.4 (1.3) 1.7 (1.7) −0.2 (2.1)Incentive × Form 1.4 (1.3) 1.1 (1.2) −1.2 (2.0)Incentive × Mode −2.3 (1.6) −2.0 (1.7) 7.8 (2.9)Incentive × Burden 4.8 (1.5) 5.4 (1.8) −5.2 (2.7)Incentive × Value × Timing 0.40 (0.17) 0.58 (0.18)Incentive × Value × Burden −0.06 (0.06) 1.10 (0.24)Incentive × Timing × Burden 11.1 (3.9)Incentive × Value × Form 0.30 (0.20)Incentive × Value × Mode −1.20 (0.24)Incentive × Timing × Form 9.9 (2.7)Incentive × Timing × Mode −17.4 (4.1)Incentive × Form × Mode −0.3 (2.5)Incentive × Form × Burden 5.9 (3.2)Incentive × Mode × Burden −5.8 (3.0)Within-study sd, σ 4.2 (0.3) 3.6 (0.3) 3.6 (0.3) 2.8 (0.3)Between-study sd, τ 18 (2) 19 (2) 18 (2) 18 (2)

Structured hierarchical models

I Need to go beyond exchangeability to shrink batches ofparameters in a reasonable way

I For example, parameter matrices αjk don’t look likeexchangeable vectors

I Similar problems arise in shrinking higher-order terms inneural nets, wavelets, tree models, image models, . . .

I Recall the “blessing of dimensionality”: as the number offactors, and the number of levels per factor, increases, moreinformation is available to estimate the hyperparameters ofthe big model

I In the background: advances in Bayesian computationincluding parameter expansion (Meng, Liu, Liu, Rubin, vanDyk), adaptive Metropolis algorithms (Pasarica), structuredcomputations (Kerman)

End of part 3

I Interactions are important

I Technical challenges in modeling and computation

I Blessing of dimensionality

End of part 3

Another example: who supports school vouchers?

What have we learned?

I Models need structure but not too much structureI Interactions are important

I Treatment interactions in before-after studiesI 2-way, 3-way, . . . , interactions in regression models

I Conservatism in statistics

I Weak prior information is key

creating structured and flexible models: some open problemsgelman/presentations/mittalk2.pdf ·...

Documents

autoregressive augmentation of midas regressions · midas...

stacked regressions - statistics at uc...

foofys solutions - a design and technology company creating...

mixing interaction, sonification, rendering and design - the...

4 stock regressions

ma advanced econometrics: spurious regressions and

multivariate ordered logit regressions (valentino - cemmap

understanding regressions with observations collected at

avoiding regressions in 3rd party javascript

module 3 - multiple linear regressions

creating interactions between tissue-engineered skeletal...

regressions analyse

smoothly mixing regressions

eviews student version. today’s workshop basic grasp of...

240615paper gaming- creating iot paper interactions with...

approaches to patient-centered interactions: creating · pdf...

portfolios and regressions

spline regressions

correlations & linear regressions

variance function regressions for studying inequality