variable selection in linear regression

194
The Stata Journal Volume 10 Number 4 2010 ® A Stata Press publication StataCorp LP College Station, Texas

Upload: tamu

Post on 16-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

The Stata JournalVolume 10 Number 4 2010

®

A Stata Press publicationStataCorp LPCollege Station, Texas

The Stata JournalEditorH. Joseph NewtonDepartment of StatisticsTexas A&M UniversityCollege Station, Texas 77843979-845-8817; fax [email protected]

EditorNicholas J. CoxDepartment of GeographyDurham UniversitySouth RoadDurham DH1 3LE [email protected]

Associate Editors

Christopher F. BaumBoston College

Nathaniel BeckNew York University

Rino BelloccoKarolinska Institutet, Sweden, and

University of Milano-Bicocca, Italy

Maarten L. BuisTubingen University, Germany

A. Colin CameronUniversity of California–Davis

Mario A. ClevesUniv. of Arkansas for Medical Sciences

William D. DupontVanderbilt University

David EpsteinColumbia University

Allan GregoryQueen’s University

James HardinUniversity of South Carolina

Ben JannUniversity of Bern, Switzerland

Stephen JenkinsUniversity of Essex

Ulrich KohlerWZB, Berlin

Frauke KreuterUniversity of Maryland–College Park

Peter A. LachenbruchOregon State University

Jens LauritsenOdense University Hospital

Stanley LemeshowOhio State University

J. Scott LongIndiana University

Roger NewsonImperial College, London

Austin NicholsUrban Institute, Washington DC

Marcello PaganoHarvard School of Public Health

Sophia Rabe-HeskethUniversity of California–Berkeley

J. Patrick RoystonMRC Clinical Trials Unit, London

Philip RyanUniversity of Adelaide

Mark E. SchafferHeriot-Watt University, Edinburgh

Jeroen WeesieUtrecht University

Nicholas J. G. WinterUniversity of Virginia

Jeffrey WooldridgeMichigan State University

Stata Press Editorial ManagerStata Press Copy Editors

Lisa GilmoreDeirdre Patterson and Erin Roberson

The Stata Journal publishes reviewed papers together with shorter notes or comments,regular columns, book reviews, and other material of interest to Stata users. Examplesof the types of papers include 1) expository papers that link the use of Stata commandsor programs to associated principles, such as those that will serve as tutorials for usersfirst encountering a new field of statistics or a major new technique; 2) papers that go“beyond the Stata manual” in explaining key features or uses of Stata that are of interestto intermediate or advanced users of Stata; 3) papers that discuss new commands orStata programs of interest either to a wide spectrum of users (e.g., in data managementor graphics) or to some large segment of Stata users (e.g., in survey statistics, survivalanalysis, panel analysis, or limited dependent variable modeling); 4) papers analyzingthe statistical properties of new or existing estimators and tests in Stata; 5) papersthat could be of interest or usefulness to researchers, especially in fields that are ofpractical importance but are not often included in texts or other journals, such as theuse of Stata in managing datasets, especially large datasets, with advice from hard-wonexperience; and 6) papers of interest to those who teach, including Stata with topicssuch as extended examples of techniques and interpretation of results, simulations ofstatistical concepts, and overviews of subject areas.

For more information on the Stata Journal, including information for authors, see thewebpage

http://www.stata-journal.com

The Stata Journal is indexed and abstracted in the following:

• CompuMath Citation Index R©

• Current Contents/Social and Behavioral Sciences R©

• RePEc: Research Papers in Economics• Science Citation Index Expanded (also known as SciSearch R©)

• ScopusTM

• Social Sciences Citation Index R©

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and

help files) are copyright c© by StataCorp LP. The contents of the supporting files (programs, datasets, and

help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, Mata, NetCourse,

and Stata Press are registered trademarks of StataCorp LP.

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station,Texas 77845, telephone 979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates

The listed subscription rates include both a printed and an electronic copy unless oth-erwise mentioned.

Subscriptions mailed to U.S. and Canadian addresses:3-year subscription $1952-year subscription $1351-year subscription $ 69

1-year student subscription $ 42

1-year university library subscription $ 891-year institutional subscription $195

Subscriptions mailed to other countries:3-year subscription $2852-year subscription $1951-year subscription $ 99

3-year subscription (electronic only) $185

1-year student subscription $ 69

1-year university library subscription $1191-year institutional subscription $225

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. Morerecent articles may be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive,College Station, TX 77845, USA, or emailed to [email protected].

Volume 10 Number 4 2010

The Stata Journal

Articles and Columns 507

A suite of commands for fitting the skew-normal and skew-t models . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y. V. Marchenko and M. G. Genton 507

Fitting heterogeneous choice models with oglm . . . . . . . . . . . . . . . . . . . . . R. Williams 540Frequentist q-values for multiple-test procedures . . . . . . . . . . . . . . . . . . R. B. Newson 568Making spatial analysis operational: Commands for generating spatial-effect vari-

ables in monadic and dyadic data . . . . . . . . . . . . . E. Neumayer and T. Plumper 585Age–period–cohort modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . M. J. Rutherford, P. C. Lambert, and J. R. Thompson 606A simple feasible procedure to fit models with high-dimensional fixed effects . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .P. Guimaraes and P. Portugal 628Variable selection in linear regression . . . . . . . . . . . . . . . .C. Lindsey and S. Sheather 650Speaking Stata: Graphing subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox 670

Notes and Comments 682

Stata tip 68: Week assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .N. J. Cox 682Stata tip 92: Manual implementation of permutations and bootstraps. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .L. Angquist 686Stata tip 93: Handling multiple y axes on twoway graphs . . . . . . . . . . . . V. Wiggins 689

Software Updates 691

The Stata Journal (2010)10, Number 4, pp. 507–539

A suite of commands for fitting theskew-normal and skew-t models

Yulia V. MarchenkoStataCorp

College Station, TX

[email protected]

Marc G. GentonDepartment of StatisticsTexas A&M University

College Station, TX

[email protected]

Abstract. Nonnormal data arise often in practice, prompting the developmentof flexible distributions for modeling such situations. In this article, we describetwo multivariate distributions, the skew-normal and the skew-t, which can beused to model skewed and heavy-tailed continuous data. We then discuss someinferential issues that can arise when fitting these distributions to real data. Wealso consider the use of these distributions in a regression setting for more flexibleparametric modeling of the conditional distribution given other predictors. Wepresent commands for fitting univariate and multivariate skew-normal and skew-tregressions in Stata (skewnreg, skewtreg, mskewnreg, and mskewtreg) as well assome postestimation features (predict and skewrplot). We also demonstrate theuse of the commands for the analysis of the famous Australian Institute of Sportdata and U.S. precipitation data.

Keywords: st0207, skewnreg, skewtreg, mskewnreg, mskewtreg, skewrplot, predict,distribution, heavy tails, nonnormal, precipitation, regression, skewness, skew-normal, skew-t

1 Introduction

Nonnormal data arise often in practice. One common way of dealing with nonnormaldata is to find a suitable transformation that makes the data more normal-like and to ap-ply standard normal-based methods to the transformed data. Finding a suitable trans-formation can be difficult with multivariate data. Also, for the ease of interpretation, itis often preferable to work with data in the original scale. These difficulties motivateda search for more-flexible parametric families of distributions to model nonnormal data.A number of approaches are available for univariate outcomes. For noncontinuous data,such as binary data or count data, binomial or Poisson distributions can be used. Moregenerally, generalized linear models can be used to accommodate a range of distribu-tions within an exponential family. However, the choices for multivariate outcomes arerather limited.

Our focus in this article is on continuous nonnormal data. Because real data oftendeviate from normality in the tails or exhibit asymmetry in the distribution, there hasbeen a growing interest in distributions with additional parameters regulating asym-metry and tails directly. For example, for heavy-tailed data, the Student’s t distri-bution is often considered. Traditionally, lognormal or gamma distributions are used

c© 2010 StataCorp LP st0207

508 Fitting skewed regressions

to model positive skewed data. To accommodate asymmetry for data spanning a realline, one can consider skew-normal and skew-t distributions, which are skewed versionsof the respective normal and Student’s t distributions. One of the appealing featuresof these distributions is that they have tractable multivariate versions that allow usto model multivariate outcomes. More generally, the family of skew-elliptical distribu-tions proposed by Branco and Dey (2001) allows for asymmetry in a class of ellipticallysymmetric distributions.

The simplest representative of the skew-elliptical family, as defined by Azzalini(1985), is the skew-normal distribution. Compared with the normal distribution, in ad-dition to location and scale parameters, the skew-normal distribution has a shape param-eter regulating the asymmetry of the distribution. Another commonly used representa-tive is the skew-t distribution (Azzalini and Capitanio 2003), which extends the normaldistribution to allow for both asymmetry and heavier tails with two additional param-eters, a shape parameter and a degrees-of-freedom parameter. These extra parametersallow us to capture the features of the data more adequately. Azzalini and Dalla Valle(1996), Azzalini and Capitanio (1999), Branco and Dey (2001), and Azzalini and Cap-itanio (2003) study multivariate analogs of these distributions.

What makes these distributions appealing for use in practice is that they are sim-ple extensions of their more commonly used counterparts, the normal and Student’s tdistributions, and that they share some properties. For example, the distribution ofthe quadratic forms of skew-normal and skew-t random vectors does not depend on theshape parameter (and is chi-squared for the skew-normal model, as it is for the normalmodel). This property is useful for evaluating model fit. These distributions are closedunder linear transformations, and multivariate versions are closed under marginalization(but not conditioning). Similarly to the normal and Student’s t distributions, the skew-normal and skew-t distributions can also be adapted to handle positive data by consid-ering their log versions (Azzalini, dal Cappello, and Kotz 2002; Marchenko and Genton2010).

A more detailed description of these and other skewed distributions can be found inthe book edited by Genton (2004) and in the review by Azzalini (2005).

The structure of our article is the following: We start with a motivating examplein section 2. In section 3, we proceed to describe the skewed distributions and, moregenerally, skewed regressions in more detail. We present commands for fitting the skewedmodels in section 4. In section 5, we provide more examples of using skew-normal andskew-t models in the analysis of the Australian Institute of Sport data, commonly usedin the literature about skewed distributions.

2 Motivating example

We consider the Australian Institute of Sport dataset (Cook and Weisberg 1994), whichis repeatedly used in the literature about skewed distributions. The ais.dta datasetcontains 202 observations (100 females and 102 males) that record 13 biological charac-

Y. V. Marchenko and M. G. Genton 509

teristics of Australian athletes. In our examples, we use only a subset of these charac-teristics.

. use ais(Biological measures from athletes at the Australian Institute of Sport)

. describe lbm bmi weight height fe female

storage display valuevariable name type format label variable label

lbm double %9.0g Lean body mass (kg)bmi double %9.0g Body mass index (kg/m^2)weight double %9.0g Weight (kg)height double %9.0g Height (m)fe int %9.0g Plasma ferritin concentration (ng/ml)female byte %9.0g gender Gender

Suppose we are interested in modeling plasma ferritin concentration recorded inthe fe variable. From figure 1, we can see that the distribution of plasma ferritinconcentration is skewed to the right compared with the normal distribution.

. histogram fe, normal(bin=14, start=8, width=16.142857)

0.0

05.0

1.0

15D

ensi

ty

0 50 100 150 200 250Plasma ferritin concentration (ng/ml)

Figure 1. Histogram of plasma ferritin concentration overlaid with normal density

As mentioned in the introduction, for a univariate outcome we can choose fromseveral options. We can use a transformation-based approach and model the fe variablein the log metric, for example. If we prefer to work with the original scale, we can useone of the univariate distributions that accommodate asymmetry. Here we demonstratethe use of the skew-normal and skew-t distributions for modeling fe.

We first fit the skew-normal distribution to plasma ferritin concentration fe usingthe new skewnreg command. For later comparison with the skew-t fit, we specify the

510 Fitting skewed regressions

dpmetric option to report results in the direct parameterization, which will be explainedin section 3.3:

. skewnreg fe, dpmetric

initial: log likelihood = -1033.6914rescale: log likelihood = -1033.6914rescale eq: log likelihood = -1033.6914Iteration 0: log likelihood = -1033.6914Iteration 1: log likelihood = -1032.6839Iteration 2: log likelihood = -1030.9463Iteration 3: log likelihood = -1030.9116Iteration 4: log likelihood = -1030.9115

Skew-normal regression Number of obs = 202Wald chi2(0) = .

Log likelihood = -1030.9115 Prob > chi2 = .

fe Coef. Std. Err. z P>|z| [95% Conf. Interval]

_cons 20.24412 2.491879 8.12 0.000 15.36012 25.12811

alpha 9.142567 2.56432 4.116592 14.16854

omega 73.84035 4.141059 66.15418 82.41954

LR test vs normal regression: chi2(1) = 70.17 Prob > chi2 = 0.0000

As mentioned in the introduction, compared with the symmetric normal distribution,the skew-normal distribution has an additional shape parameter. Labeled as alphain the output, it regulates the asymmetry of the distribution. For positive values ofthe shape parameter, the distribution is skewed to the right; for negative values, thedistribution is skewed to the left; and the distribution is symmetric (normal) when theshape parameter is zero. From the output, we can see that alpha is estimated to be 9.14with a 95% confidence interval of [4.12, 14.17], which is evidence that the distributionof fe exhibits skewness to the right.

We can visually check how well the skew-normal distribution fits the data by usingthe new postestimation command, skewrplot:

. skewrplot, fitted(bin=14, start=8, width=16.142857)

We specified the fitted option to plot the skew-normal density estimate [evaluated atthe above maximum likelihood estimates (MLEs) of the model parameters] of the fittedvalues against the histogram of fe. From figure 2, we can see that the skew-normaldensity estimate closely follows the nonparametric density estimate and that it demon-strates better fit of the skew-normal distribution to fe than the normal distribution infigure 1.

Y. V. Marchenko and M. G. Genton 511

0.0

05.0

1.0

15D

ensi

ty

0 50 100 150 200 250Plasma ferritin concentration (ng/ml)

NonparametricSkew−normal, alpha = 9.14

Distribution of fe

Figure 2. Histogram and skew-normal density estimate of plasma ferritin concentration

We can also fit the skew-t distribution to fe by using the skewtreg command:

. skewtreg fe

initial: log likelihood = -1428.9045rescale: log likelihood = -1411.4498rescale eq: log likelihood = -1041.7301Iteration 0: log likelihood = -1041.7301Iteration 1: log likelihood = -1035.0139Iteration 2: log likelihood = -1030.6871Iteration 3: log likelihood = -1029.457Iteration 4: log likelihood = -1029.1935Iteration 5: log likelihood = -1029.186Iteration 6: log likelihood = -1029.186

Skew-t regression Number of obs = 202Wald chi2(0) = .

Log likelihood = -1029.186 Prob > chi2 = .

fe Coef. Std. Err. z P>|z| [95% Conf. Interval]

_cons 22.2901 2.830001 7.88 0.000 16.7434 27.8368

alpha 7.244468 2.270883 3.19 0.001 2.793619 11.69532

omega 62.12069 7.079737 49.68519 77.66861

df 7.440234 4.405123 2.331404 23.7441

LR test vs normal regression: chibar2(1_2) = 73.62 Prob >= chibar2 = 0.0000

In addition to the shape parameter, the skew-t distribution introduces a degrees-of-freedom parameter. Labeled as df in the output, this parameter regulates the heavinessof the tails of the distribution. The smaller the degrees of freedom, the “heavier” thetails of the distribution. (For instance, one degree of freedom yields a skew-Cauchy

512 Fitting skewed regressions

distribution of Arnold and Beaver [2000].) As the degrees of freedom becomes large,the skew-t distribution reduces to the skew-normal distribution or the normal distri-bution, when in addition the shape parameter is zero. From the output, we can seethat the degrees of freedom is estimated to be 7.44 with a 95% confidence interval of[2.33, 23.74], which provides evidence for heavier-than-normal tails of the distribution offe. The estimate of the shape parameter alpha is 7.24 with a 95% confidence interval[2.79, 11.70], which again confirms the existence of positive skewness in the distributionof fe.

As we did before, we can plot the density estimate of fitted values from the skew-tdistribution estimated above against the nonparametric density estimate. The plot isshown in figure 3:

. skewrplot, fitted(bin=14, start=8, width=16.142857)

0.0

05.0

1.0

15D

ensi

ty

0 50 100 150 200 250Plasma ferritin concentration (ng/ml)

NonparametricSkew−t, alpha = 7.24, df = 7.44

Distribution of fe

Figure 3. Histogram and skew-t density estimate of plasma ferritin concentration

From the graph, we can see that the skew-t distribution seems to fit the fe values betterthan the skew-normal distribution. We could also use probability–probability (P–P) orquantile–quantile (Q–Q) plots, as we demonstrate later, to more easily compare modelfits.

Let us now describe the skew-normal and skew-t models in more detail.

Y. V. Marchenko and M. G. Genton 513

3 The skew-normal and skew-t models

3.1 Definition and some properties

The density of the univariate skew-normal distribution, SN(ξ, ω2, α), is

fSN(x; ξ, ω2, α) = 2ω−1φ(z)Φ(αz), x ∈ R (1)

where z = ω−1(x − ξ), ξ ∈ R is a location parameter, ω > 0 is a scale parameter, φ(·)is the density of a univariate standard normal distribution, and Φ(·) is the cumulativedistribution function of the standard normal distribution. The additional multiplier2Φ(αz) is a skewness factor, and it is controlled by a shape parameter α ∈ R. Whenα > 0, the distribution is skewed to the right; when α < 0, the distribution is skewedto the left; and when α = 0, the skew-normal distribution (1) reduces to the normaldistribution.

The univariate skew-t distribution, ST(ξ, ω2, α, ν), is defined in a similar manner byintroducing a multiplier to the Student’s t density, which is a heavier-tailed distributionthan the normal distribution:

fST(x; ξ, ω2, α, ν) = 2ω−1t(z; ν)T{

αz√

(ν + 1)/(ν + z2); ν + 1}

, x ∈ R (2)

where t(z; ν) is the density of a univariate standard Student’s t distribution with degreesof freedom ν, and T (·; ν +1) is the cumulative distribution function of a univariate stan-dard Student’s t distribution with ν +1 degrees of freedom. Here again, ξ ∈ R regulatesthe location of the distribution, ω > 0 regulates the scale of the distribution, the shapeparameter α ∈ R regulates asymmetry of the distribution, and the degrees-of-freedomparameter ν > 0 regulates the tails of the distribution. When α = 0, the density (2)reduces to the Student’s t density; and when α = 0 and the degrees of freedom be-comes very large (ν tends to ∞), the skew-t density reduces to the normal density. Byintroducing an extra parameter for regulating the tails, the skew-t distribution accom-modates outlying observations and, thus, can be viewed as a more robust model thanthe skew-normal model; see Azzalini and Genton (2008) for details.

As mentioned in the introduction, one of the useful properties of the skew-normaland skew-t distributions is that their quadratic forms do not depend on the shapeparameter. In the univariate case, if X ∼ SN(ξ, ω2, α), then (X − ξ)2/ω2 ∼ χ2

1. IfX ∼ ST(ξ, ω2, α, ν), then (X − ξ)2/ω2 ∼ F1,ν . These properties provide a way ofevaluating model fit using Q–Q or P–P plots.

Multivariate analogs of the skew-normal and skew-t distributions are constructed ina similar manner for the corresponding multivariate normal and multivariate Student’st distributions. The density of the multivariate skew-normal distribution, SNd(ξ,Ω,α),is

fSNd(x; Θ) = 2φd(x; ξ,Ω)Φ(α′z), x ∈ R

d (3)

where Θ = (ξ,Ω,α), z = Ω−1/2diag (x − ξ) ∈ R

d, φd(x; ξ,Ω) is the density of a d-variatenormal distribution with location ξ and covariance matrix Ω, and Ωdiag is the d × d

514 Fitting skewed regressions

diagonal matrix containing the diagonal elements of Ω. Similarly to the univariate case,when all d components of α are zero, the multivariate skew-normal density (3) reducesto the multivariate normal density φd(·).

The density of the multivariate skew-t distribution, STd(ξ,Ω,α, ν), is

fSTd(x; Θ) = 2 td(x; ξ,Ω, ν)T

{α′z

(ν + d

ν + Qξ,Ωx

)1/2

; ν + d

}, x ∈ R

d (4)

where Θ = (ξ,Ω,α, ν), z = Ω−1/2diag (x − ξ), Qξ,Ω

x = (x − ξ)′Ω−1(x − ξ), td(x; ξ,Ω, ν) =Γ{(ν + d)/2}(1 + Qξ,Ω

x /ν)−(ν+d)/2/{|Ω|1/2(νπ)d/2Γ(ν/2)} is the density of a d-variateStudent’s t distribution with ν degrees of freedom, and T (·; ν +d) is the cumulative dis-tribution function of a univariate Student’s t distribution with ν +d degrees of freedom.When all d components of α are zero, the multivariate skew-t density (4) reduces tothe multivariate Student’s t density td(·) and to the multivariate normal density φd(·)when in addition ν tends to ∞.

Similarly to the univariate case, if X ∼ SNd(ξ,Ω,α), then the Mahalanobis measure(X − ξ)′Ω−1(X − ξ) ∼ χ2

d. If X ∼ STd(ξ,Ω,α, ν), then 1d (X − ξ)′Ω−1(X − ξ) ∼ Fd,ν .

3.2 Regression models

Consider a sample Y = (y1, y2, . . . , yn)′ of n observations. In linear regression,

yi = β0 + β1x1i + · · · + βpxpi + εi, i = 1, . . . , n (5)

where x1i, . . . , xpi define covariate values, β0, . . . , βp are the unknown regression coeffi-cients, and εi is an error term. In normal linear regression, the errors are assumed to benormally distributed, εi

iid∼ Normal(0, σ2). The skew-normal regression is a linear regres-sion (5) with errors from the skew-normal distribution, εi

iid∼ SN(0, ω2, α). Similarly, theskew-t regression is defined by (5) with εi

iid∼ ST(0, ω2, α, ν). Equivalently, the sample Y

is assumed to follow the skew-normal distribution, yiiid∼ SN(ξi, ω

2, α), or the skew-t dis-tribution, yi

iid∼ ST(ξi, ω2, α, ν), respectively, where ξi = β0+β1x1i+· · ·+βpxpi. However,

because the mean μ of a skewed random variate is not the same as the location param-eter ξ, E(εi) �= 0 (unless α = 0) unlike the normal linear regression. The mean E(εi) =√

2/πωδ for the skew-normal regression and E(εi) = ωδ√

ν/πΓ{(ν−1)/2}/Γ(ν/2) whenν > 1 for the skew-t regression, where δ = α/

√1 + α2. Then E(yi) = ξ + E(εi).

Under the multivariate regression setting, Y becomes an n × d data matrix, βbecomes a p × d matrix of unknown coefficients, and the errors follow the multi-variate skew-normal distribution, SNd(0,Ω,α), or the multivariate skew-t distribution,STd(0,Ω,α, ν), respectively.

The method of maximum likelihood is used to obtain estimates of regression coef-ficients β and other model parameters, Ω, α, and ν. Two issues arise with likelihoodinference for the skew-normal and skew-t models: 1) the existence of a stationary point

Y. V. Marchenko and M. G. Genton 515

at α = 0 of the profile log-likelihood function for the skew-normal model; and 2) un-bound MLEs. We discuss each issue in more detail below.

The existence of a stationary point at α = 0 for the skew-normal model leads tothe singularity of the Fisher information matrix of the profile log likelihood for theshape parameter α (Azzalini 1985; Azzalini and Genton 2008). This violates standardassumptions underlying the asymptotic properties of the maximum likelihood estima-tors and, consequently, leads to slower convergence and possibly a bimodal limitingdistribution of the estimates (Arellano-Valle and Azzalini 2008). All model parametersξ, Ω, and α are identifiable, so the issue is really due to the chosen parameterization.To alleviate this issue, Azzalini (1985) suggested an alternative centered parameteriza-tion for the univariate skew-normal model under which the sampling distributions ofthe new parameters are closer to the normal distribution. Arellano-Valle and Azzalini(2008) extended this parameterization to the multivariate case. We will discuss thecentered parameterization in more detail in section 3.3. This unfortunate propertyseems to vanish in the case of the skew-t distribution, unless the degrees of freedom arelarge enough that the skew-t distribution essentially becomes the skew-normal distri-bution; see Azzalini and Capitanio (2003) and Azzalini and Genton (2008) for details.More generally, the issue of the singularity of multivariate skew-symmetric models wasinvestigated by Ley and Paindaveine (2010) and Hallin and Ley (forthcoming).

Both the skew-normal and skew-t models suffer from the problem of unboundednessof the MLEs for the shape and degrees-of-freedom parameters; that is, the maximumlikelihood estimator can be infinite with positive probability for the finite true valueof the parameter. For example, in the cases of the univariate standard skew-normaldistribution and the univariate standard skew-t distribution with fixed degrees of free-dom, when all observations are positive (or negative)—which can happen with positiveprobability—the likelihood function is monotone increasing, and thus, an infinite esti-mate of the shape parameter is encountered. In other more general cases, such as un-known degrees of freedom and the multivariate case, the conditions under which the loglikelihood is unbound are more complicated and thus more difficult to describe. Sartori(2006) and Azzalini and Genton (2008) presented ways of dealing with the unboundestimates. Sartori (2006) proposed a bias correction to the MLEs. Azzalini and Genton(2008) suggested a deviance-based approach according to which the unbound MLEs of(α, ν) are replaced by the smallest values (α0, ν0) such that the likelihood-ratio testof H0 : (α, ν) = (α0, ν0) is not rejected at a fixed level, say, 0.1. Within a Bayesianframework, Liseo and Loperfido (2006) showed that the estimate of the posterior modeof the shape parameter is finite for the skew-normal model under the Jeffreys prior;and Bayes and Branco (2007) considered an alternative noninformative uniform priorfor the shape parameter.

The centered parameterization is available for skewnreg and mskewnreg to alleviatethe singularity issue. The issue of unbound parameter estimates is not yet addressed inthe presented commands. This issue is likely to arise when the distribution of the data(or residuals within the regression framework) is close to a half-normal distribution.If this issue occurs, one solution is to determine the iteration number after which the

516 Fitting skewed regressions

changes in the likelihood become very small and then to refit the model using theprespecified number of iterations in the iterate(#) option.

3.3 Centered parameterization

Here we briefly describe the centered parameterization for the univariate skew-normaldistribution as proposed by Azzalini (1985), and we outline the points made in Arellano-Valle and Azzalini (2008), where more details and the extension to the multivariate casecan be found.

Let Y be distributed as SN(ξ, ω2, α). Consider the following decomposition of Y :

Y = ξ + ωZ = μ + σ(Y − μz)/σz

where μz = E(Z) =√

2/πδ, σ2z = Var(Z) = 1 − 2δ2/π, and δ = α/

√1 + α2. Then

μ = E(Y ) = ξ+ωμz and σ2 = Var(Y ) = ω2(1−μ2z). Let γ = (4−π) sign(α) (μz/σ2

z)3/2denote the skewness index of Y . (The skewness index γ is not the classical samplemoment-based measure of skewness but is specific to this family of distributions.) Then,mean, standard deviation, and skewness index, (μ, σ, γ), form the centered parameteri-zation. They are referred to as the centered parameters (CP) because they are obtainedby centering Y . The set of parameters (ξ, ω, α) are referred to as the direct param-eters (DP). It is worth noting that unlike the range of α, the range of γ is restrictedto approximately (−0.9953, 0.9953). More generally in the multivariate setting, unlikethe DPs (ξ,Ω,α), the CPs (μ,Σ,γ) cannot be chosen freely and are subject to certainconstraints; see Arellano-Valle and Azzalini (2008) for details. Of course, both sets ofparameters require the scale matrices to be positive definite.

In the regression setting, the CP metric affects only the estimate of the interceptand not the coefficients. Specifically, βCP

0 = β0 +√

2/πωδ, βCPi = βi, i = 1, . . . , p.

Consequently, εCPi = εi −

√2/πωδ and so the residuals in the CP metric have a mean

of zero, E(εCPi ) = 0, i = 1, . . . , n. In what follows, when referring to residuals we will

always assume the residuals are in the DP metric.

The use of CP is advantageous from both inferential and interpretation standpoints.The sampling distributions of the MLEs of CP are closer to quadratic forms, and theprofile log likelihood for γ does not have a stationary point at γ = 0. Although theshape parameter α can be used as a guide to whether the normal model is sufficientfor analysis, it is easier to infer the actual magnitude of the departure from normalitybased on the skewness index γ. Also, in the multivariate case, components of a skewnessvector γ represent the skewness indexes of the marginal distributions whereas individualcomponents of α, in general, cannot be used to infer the direction or the magnitude ofthe asymmetry in marginal distributions. Marginal skewness indexes are complicatedfunctions of individual components of α. However, zero components of α do imply zeromarginal skewness indexes or, in other words, symmetric marginal distributions. DP isuseful for direct interpretation in the original model.

From the above formulas, we can see that a one-to-one correspondence exists betweenCP and DP, provided CP is within its admissible range. So after obtaining estimates in

Y. V. Marchenko and M. G. Genton 517

the CP metric, one can use the formulas above and the delta method to obtain respectiveestimates and their standard errors in the DP metric, and vice versa.

The centered parameterization is implemented in skewnreg and mskewnreg. Atthe time of publication of this article, the centered parameterization for the skew-tdistribution is yet to appear in the literature (Arellano-Valle and Azzalini 2009) andthus is not implemented in skewtreg and mskewtreg.

4 A suite of commands for fitting skewed regressions

4.1 Syntax

Skewed regression models

Univariate skew-normal regression

skewnreg depvar[indepvars

] [if

] [in

] [weight

] [, constraints(constraints)

collinear vce(vcetype) level(#) dpmetric estmetric nocnsreport

coeflegend postdp display options maximize options]

Univariate skew-t regression

skewtreg depvar[indepvars

] [if

] [in

] [weight

] [, df(#)

constraints(constraints) collinear vce(vcetype) level(#) estmetric

nocnsreport coeflegend postdp display options maximize options]

Multivariate skew-normal regression

mskewnreg depvars[= indepvars

] [if

] [in

] [weight

] [,

constraints(constraints) collinear vce(vcetype) level(#) dpmetric

estmetric noshowomega nocnsreport coeflegend postdp postcp

display options maximize options]

(Continued on next page)

518 Fitting skewed regressions

Multivariate skew-t regression

mskewtreg depvars[= indepvars

] [if

] [in

] [weight

] [, df(#)

constraints(constraints) collinear vce(vcetype) level(#) estmetric

noshowomega nocnsreport coeflegend postdp display options

maximize options]

indepvars may contain factor variables; see [U] Factor variables.fweights are allowed; see [U] weight.

Postestimation features

Predictions

predict[type

]newvar

[if

] [in

] [, xb residuals score stdp

equation(eqno)]

Residual density plot over histogram (default with skewnreg and skewtreg)

skewrplot[, histogram fitted normal normopts(norm options)

lineopts(line options) histopts(hist options) addplot(plot) twoway options]

Residual density plot with kernel-density estimate (skewnreg and skewtreg only)

skewrplot, kdensity[fitted normal normopts(norm options)

lineopts(line options) kdenopts(kden options) addplot(plot) twoway options]

Residual-versus-fitted plot (skewnreg and skewtreg only)

skewrplot, rvf[addplot(plot) scatter options twoway options

]Probability–probability plot

skewrplot, pp[normal normopts(norm options) overlay addplot(plot)

pp options graph options]

Y. V. Marchenko and M. G. Genton 519

Quantile–quantile plot (default with mskewnreg and mskewtreg)

skewrplot, qq[normal normopts(norm options) addplot(plot) qq options

graph options]

4.2 Description

The skewnreg and skewtreg commands fit skew-normal and skew-t regression mod-els to univariate data. The mskewnreg and mskewtreg commands fit skew-normal andskew-t regression models to multivariate data. skewnreg and mskewnreg support boththe CP metric (the default) and the DP metric (with the dpmetric option), whereasskewtreg and mskewtreg support only the DP metric. Regardless of the display met-ric, optimization is performed in the estimation metric specific to each command; seeeach command’s help file for details. In the skew-t regression, the degrees-of-freedomparameter can optionally be set to a fixed value with the df() option.

The postestimation features include predictions and residual diagnostics plots. Thepredict command can be used after any of the four estimation commands to obtainlinear predictions and their standard errors, residual estimates, and the score estimates.The equation() option can be used with multivariate regressions to obtain equation-specific predictions. The first equation is assumed by default.

The skewrplot command can be used after any of the four estimation commandsto obtain a number of residual diagnostic plots. The default after univariate regres-sions is a residual density plot, where the skew-normal (or skew-t) density estimate ofresiduals, evaluated at MLEs from the previously fit model, is plotted together with anonparametric residual density estimate—a histogram. Alternatively, if kdensity isused, a residual density plot is displayed together with a nonparametric kernel-densityestimate of residuals instead of the histogram. In the absence of predictors, the fittedoption can be used to plot density estimates of the fitted values instead of residuals.In addition, a normal density estimate can be added to the graph as a reference byspecifying the normal option. The residual-versus-fitted plot can be obtained with thervf option. The P–P and Q–Q plots are available after univariate or multivariate re-gressions. The Q–Q plot of residuals is the default after multivariate regressions. It canalso be requested with the qq option. The P–P plot of residuals can be obtained withthe pp option. If normal is used in combination with pp (or qq), a P–P (or Q–Q) plotof residuals from a normal regression fit is produced as a separate plot.

4.3 Options

Common estimation options

constraints(constraints) specifies the linear constraints to be applied during estima-tion. The default is to perform unconstrained estimation. See [R] estimationoptions for details.

520 Fitting skewed regressions

collinear specifies that the estimation command not omit collinear variables. See[R] estimation options for details.

vce(vcetype) specifies the type of standard error reported, which includes types that arederived from asymptotic theory, that are robust to some kinds of misspecification,that allow for intragroup correlation, and that use bootstrap or jackknife methods;see [R] vce option.

level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95) or as set by set level. This option may be specified eitherat estimation or upon replay.

estmetric displays results in the estimation metric. The estimation metric used is spe-cific to each estimation command. This option may be specified either at estimationor upon replay.

nocnsreport specifies that no constraints be reported. The default is to display user-specified constraints above the coefficient table.

coeflegend specifies that the legend of the coefficients and how to specify them inan expression be displayed rather than the coefficient table. This option may bespecified either at estimation or upon replay.

postdp stores DP estimates and their variance–covariance estimator (VCE) in e(b) ande(V), respectively.

display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see[R] estimation options. These options may be specified either at estimation orupon replay.

maximize options: difficult, technique(algorithm spec), iterate(#),[no

]log,

trace, gradient, showstep, hessian, showtolerance, tolerance(#),ltolerance(#), nrtolerance(#), nonrtolerance; see [R] maximize. Also,init(ml init args) can be specified; see [R] ml.

Other options for skewnreg

dpmetric specifies that the results be displayed in the DP metric instead of the defaultCP metric. This option may be specified either at estimation or upon replay.

Other options for mskewnreg

dpmetric specifies that the results be displayed in the DP metric instead of the defaultCP metric. This option may be specified either at estimation or upon replay.

noshowomega specifies that the display of the covariance (or scale) matrix be suppressed.

postcp stores CP estimates and their VCE in e(b) and e(V), respectively, instead of theestimation parameters.

Y. V. Marchenko and M. G. Genton 521

Other options for skewtreg

df(#) specifies that the degrees-of-freedom parameter be fixed at # during estimation.This is equivalent to constrained estimation using the constraints() option whenthe degrees-of-freedom parameter is set to #.

Other options for mskewtreg

df(#) specifies that the degrees-of-freedom parameter be fixed at # during estimation.This is equivalent to constrained estimation using the constraints() option whenthe degrees-of-freedom parameter is set to #.

noshowomega specifies that the display of the covariance (or scale) matrix be suppressed.

Options for predict

xb, the default, calculates the linear prediction.

residuals calculates the residuals.

score calculates the first derivative of the log likelihood with respect to xjβ.

stdp calculates the standard error of the linear prediction.

equation(eqno) is allowed only when you have previously fit mskewnreg or mskewtreg.It specifies the equation to which you are referring. equation() is filled in with oneeqno for the xb, stdp, and residuals options. equation(#1) means the calculationis to be made for the first equation; equation(#2) means the second; and so on.You could also refer to the equations by their names. equation(lbm) would referto the equation named lbm, and equation(bmi) would refer to the equation namedbmi. If you do not specify equation(), results are the same as if you specifiedequation(#1).

Options for skewrplot

histogram, the default after skewnreg and skewtreg, requests that the histogram ofresiduals be plotted together with a residual density estimate from a skewnregor skewtreg fit. This option is not allowed with skewrplot after mskewnreg ormskewtreg.

kdensity requests that the kernel-density estimate of residuals be plotted together witha residual density estimate from a skewnreg or skewtreg fit instead of the histogram.This option is not allowed with skewrplot after mskewnreg or mskewtreg.

rvf requests that the residual-versus-fitted plot be produced. This option is not allowedwith skewrplot after mskewnreg or mskewtreg.

pp requests that probability–probability plots of the observed residuals versus the resid-uals obtained from the fitted parametric model be produced.

522 Fitting skewed regressions

qq, the default after mskewnreg and mskewtreg, requests that quantile–quantile plotsof the observed residuals versus the residuals obtained from the fitted parametricmodel be produced.

fitted requests that the density of fitted values be plotted instead of the density ofresiduals from a skewnreg or skewtreg fit. This option is allowed only in combina-tion with histogram or kdensity.

normal requests that a corresponding normal plot be produced for comparison. Ifhistogram is used, normal specifies that the histogram be overlaid with an appro-priately scaled normal density. The normal will have the same mean and standarddeviation as the data. If kdensity is used, normal requests that a normal densitybe overlaid on the density estimate of residuals from a skewed regression fit. If ppor qq is used, normal requests that an additional, separate chi-squared probabilityplot or chi-squared quantile plot of squared standardized residuals from a normalregression fit be produced. This option can be used in combination with overlayto overlay P–P plots on one graph. This option is not allowed in combination withrvf.

normopts(norm options) specifies details about the look of normal plots produced whennormal is specified. If histogram or kdensity is used, norm options affect renditionof the normal curve, such as the color and style of line used, and can be any of theoptions documented in [G] graph twoway line. If pp (or qq) is used, norm optionsaffect the look of the chi-squared probability (or quantile) plot and can be any ofthe options documented for quantile in [R] diagnostic plots.

overlay specifies that the normal plot be overlaid with the main plot in one graph.This option requires normal and is not allowed in combination with qq. This optionis implied with histogram and kdensity.

lineopts(line options) affect rendition of the curve from the skew fit. Aspects such asthe color and style of line used are affected and can be specified using any of theoptions documented in [G] graph twoway line.

histopts(hist options) are any of the options other than discrete, fraction,frequency, percent, horizontal, and all Density plots options documented in[R] histogram.

kdenopts(kden options) are any of the options documented in [R] kdensity.

addplot(plot) provides a way to add other plots to the generated graph; see[G] addplot option.

scatter options are any of the options documented in [G] graph twoway scatter.

pp options are any of the options of quantile documented in [R] diagnostic plots.

qq options are any of the options of quantile documented in [R] diagnostic plots.

twoway options are any of the options other than by() documented in[G] twoway options.

Y. V. Marchenko and M. G. Genton 523

graph options specify the overall look of a graph. If normal is used without overlay,graph options are any of the options documented in [G] graph combine. Otherwise,graph options are any of the twoway options above.

5 Numerical examples

5.1 Univariate analysis of Australian Institute of Sport data

Our motivating example demonstrated the use of skewnreg and skewtreg for modelingthe distribution of plasma ferritin concentration from the Australian Institute of Sportdata. We can also use these commands within the regression framework to accommodatedepartures from normality of the conditional distribution of the outcome of interestcontrolling for other covariates.

For the purpose of illustration, consider the conditional distribution of lean bodymass, lbm, given the weight and height of an athlete. Linearity of lean body masswith respect to weight and height was established by previous analysis of these data(for example, Cook and Weisberg [1994]), so we consider a simple linear regression formodeling the conditional distribution of lbm. To obtain more meaningful estimatesof main effects, we use recentered versions of covariates, weight c and height c, inour regression analysis. Also, to adjust for likely differences in the relationship dueto gender, we interact weight c and height c with female. (Alternatively, we couldhave fit separate regressions for males and females to also allow the variability in themeasurements to differ across gender.)

. use ais, clear(Biological measures from athletes at the Australian Institute of Sport)

. summarize weight, meanonly

. generate weight_c = weight - r(mean)

. summarize height, meanonly

. generate height_c = height - r(mean)

We first fit a normal linear regression and examine the distribution of the residu-als from its fit. It is worth noting that weight and height measurements are highlycorrelated.

(Continued on next page)

524 Fitting skewed regressions

. regress lbm i.female##c.(weight_c height_c)

Source SS df MS Number of obs = 202F( 5, 196) = 1087.52

Model 33142.2236 5 6628.44472 Prob > F = 0.0000Residual 1194.61754 196 6.09498744 R-squared = 0.9652

Adj R-squared = 0.9643Total 34336.8411 201 170.830055 Root MSE = 2.4688

lbm Coef. Std. Err. t P>|t| [95% Conf. Interval]

1.female -9.014547 .4304858 -20.94 0.000 -9.863526 -8.165568weight_c .7101775 .0265595 26.74 0.000 .6577985 .7625566height_c 14.83978 4.169091 3.56 0.000 6.617744 23.06182

female#c.weight_c

1 -.1765309 .041757 -4.23 0.000 -.2588816 -.0941802

female#c.height_c

1 -5.442548 5.965791 -0.91 0.363 -17.20793 6.322834

_cons 68.51799 .3006605 227.89 0.000 67.92504 69.11093

. predict resid, residuals

. kdensity resid, normal

0.0

5.1

.15

.2D

ensi

ty

−10 −5 0 5 10Residuals

Kernel density estimateNormal density

kernel = epanechnikov, bandwidth = 0.6126

Kernel density estimate

Figure 4. Normal residuals density estimate

Figure 4 demonstrates a slight (longer left tail) skewness in the distribution of resid-uals compared with the assumed underlying normal distribution. More directly, we canuse a Q–Q plot to compare the distribution of residuals with the normal distribution,as shown in figure 5:

Y. V. Marchenko and M. G. Genton 525

. qnorm resid

−10

−50

510

Res

idua

ls

−5 0 5Inverse Normal

Figure 5. Normal Q–Q plot of residuals

The Q–Q plot confirms the existence of negative skewness in the distribution ofresiduals from the linear regression fit.

We store estimation results from regress for later comparison with skewed models:

. estimates store reg

To capture asymmetry in the data, we now fit the skew-normal regression:

. skewnreg lbm i.female##c.(weight_c height_c), nolog

Skew-normal regression Number of obs = 202Wald chi2(5) = 6773.01

Log likelihood = -457.54665 Prob > chi2 = 0.0000

lbm Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.female -8.225366 .4431113 -18.56 0.000 -9.093848 -7.356883weight_c .7737271 .0306772 25.22 0.000 .7136009 .8338533height_c 9.91473 4.072276 2.43 0.015 1.933216 17.89624

female#c.weight_c

1 -.1959762 .0382144 -5.13 0.000 -.2708751 -.1210774

female#c.height_c

1 -3.118911 5.621625 -0.55 0.579 -14.13709 7.899271

_cons 68.05071 .3032845 224.38 0.000 67.45629 68.64514

gamma -.6191484 .1192347 -5.19 0.000 -.8528442 -.3854526

sigma 2.416606 .1314719 2.172189 2.688526

LR test vs normal regression: chi2(1) = 17.18 Prob > chi2 = 0.0000

526 Fitting skewed regressions

By default, skewnreg estimates and displays model parameters other than the standarddeviation sigma in the CP metric, as discussed in section 3.3. The standard deviation isestimated in the log metric. From the output, we can see that both weight and heightare strong predictors of lean body mass measurements, and their relationship differsbetween males and females. The estimated skewness index, labeled as gamma in theoutput, is −0.62, which suggests that the conditional distribution of lbm adjusted forweight and height is skewed to the left. According to the reported test of H0: γ = 0 withthe test statistic of −5.19, we have strong evidence of asymmetry in the distributionof lbm, and thus the skew-normal regression may be more appropriate for the analysisthan the normal regression. The likelihood-ratio test for the skew-normal regressionversus the normal linear regression, which is reported at the bottom of the table, alsofavors the skew-normal model.

We can redisplay results in the DP metric by using the dpmetric option:

. skewnreg, dpmetric

Skew-normal regression Number of obs = 202Wald chi2(5) = 6773.01

Log likelihood = -457.54665 Prob > chi2 = 0.0000

lbm Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.female -8.225366 .4431113 -18.56 0.000 -9.093848 -7.356883weight_c .7737271 .0306772 25.22 0.000 .7136009 .8338533height_c 9.91473 4.072276 2.43 0.015 1.933216 17.89624

female#c.weight_c

1 -.1959762 .0382144 -5.13 0.000 -.2708751 -.1210774

female#c.height_c

1 -3.118911 5.621625 -0.55 0.579 -14.13709 7.899271

_cons 70.78126 .2882586 245.55 0.000 70.21628 71.34624

alpha -2.718978 .6434226 -3.980063 -1.457893

omega 3.646351 .273574 3.147716 4.223975

LR test vs normal regression: chi2(1) = 17.18 Prob > chi2 = 0.0000

Notice that all regression coefficients remain the same: the transformation from the CP

to the DP metric changes only the intercept. The estimate of the shape parameter alphais −2.72 with a 95% confidence interval of [−3.98,−1.46]. The confidence interval doesnot include 0, corresponding to the normal regression, which agrees with our earlierfindings. Also note that the scale parameter omega is now reported instead of thestandard deviation sigma.

Similarly to figure 4, we can use the skewrplot command to plot the residual densityestimate obtained nonparametrically against that from the skew-normal distributionevaluated at the MLEs of the model parameters, as shown in figure 6:

Y. V. Marchenko and M. G. Genton 527

. skewrplot, kdensity

0.0

5.1

.15

.2D

ensi

ty

−15 −10 −5 0 5Residuals

NonparametricSkew−normal, alpha = −2.72

kernel = epanechnikov, bandwidth = 0.6401

Distribution of residuals

Figure 6. Skew-normal residuals density estimate

Figure 6 demonstrates an improved fit to the distribution of residuals.

Alternatively, we can obtain a Q–Q or P–P plot by using the respective options. Forexample,

. skewrplot, qq

05

1015

Scal

ed s

quar

ed re

sidu

als

0 2 4 6 8Expected χ2 d.f. = 1

Figure 7. Q–Q plot for the skew-normal model

528 Fitting skewed regressions

produces the Q–Q plot of quantiles of the scaled squared residuals from the fitted skew-normal model against the quantiles of the chi-squared distribution with 1 degree offreedom, as shown in figure 7.

According to the Q–Q plot, the skew-normal model fits the data reasonably well,with the exception of several outlying observations in the right tail. See Dalla Valle(2007) for a formal test of the skew-normality in a population.

Next we store estimation results from the skew-normal regression for later compar-ison with other models. We store results in the DP metric by using the postdp optionon replay:

. skewnreg, postdp

. estimates store skewn_dp

To accommodate heavier tails in addition to skewness, we fit the skew-t model:

. skewtreg lbm i.female##c.(weight_c height_c), nolog

Skew-t regression Number of obs = 202Wald chi2(5) = 7955.92

Log likelihood = -450.12502 Prob > chi2 = 0.0000

lbm Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.female -8.184854 .3913878 -20.91 0.000 -8.95196 -7.417748weight_c .7583558 .0300931 25.20 0.000 .6993743 .8173372height_c 12.03037 3.83344 3.14 0.002 4.516964 19.54377

female#c.weight_c

1 -.1677404 .0375606 -4.47 0.000 -.2413579 -.0941229

female#c.height_c

1 -6.352142 5.232898 -1.21 0.225 -16.60843 3.904149

_cons 70.14246 .3307072 212.10 0.000 69.49429 70.79063

alpha -1.760172 .6463594 -2.72 0.006 -3.027013 -.493331

omega 2.318537 .3619959 1.70732 3.148569

df 3.658399 1.128259 1.998842 6.695817

LR test vs normal regression: chibar2(1_2) = 32.02 Prob >= chibar2 = 0.0000

As mentioned in section 3.3, the centered parameterization for the skew-t model isstill under development and has not yet appeared in the literature. Thus the skewtregcommand reports results only in the DP metric. Compared with the output of DPs fromskewnreg, the skewtreg command reports an additional estimate of the degrees offreedom. The estimate of the degrees of freedom is 3.66 with a 95% confidence intervalof [2.00, 6.70], which implies heavier-than-normal tails for the conditional distributionof lbm. The estimate of the shape parameter alpha is −1.76 with a 95% confidenceinterval of [−3.03,−0.49]. Again the reported likelihood-ratio test rejects the hypothesis

Y. V. Marchenko and M. G. Genton 529

of normality. The reported test of H0 : α = 0, ν = ∞ requires a boundary correctionbecause the degrees-of-freedom parameter is tested at its boundary value. As such, thedistribution of the likelihood-ratio test statistic is a 50:50 percent mixture of chi-squareddistributions with 1 and 2 degrees of freedom, labeled as chibar2(1 2) in the output;see, for example, Gutierrez, Carter, and Drukker (2001) and DiCiccio and Monti (2009)for more details.

We can also perform the likelihood-ratio test of the skew-t model versus the skew-normal model (H0 : ν = ∞) by using the lrtest command. Because skewnreg andskewtreg are two different estimation commands, we need to specify the force optionto obtain results. Although using this option is generally not recommended, it is safein our case because we know that the skew-normal model is nested within the skew-tmodel.

. lrtest skewn ., force

Likelihood-ratio test LR chi2(1) = 14.84(Assumption: skewn nested in .) Prob > chi2 = 0.0001

The likelihood-ratio test favors the skew-t model over the skew-normal model. Theresults from this test should be interpreted with caution because it does not automat-ically account for the fact that the degrees of freedom ν are tested at the boundaryvalue ν = ∞. The distribution of the likelihood-ratio test statistic in this case is a 50:50percent mixture of a degenerate distribution at 0 and a chi-squared distribution with 1degree of freedom. As such, the corrected p-value is half the uncorrected p-value and is0.000058 in this example:

. display r(p)/2

.00005841

We can also compare the two fits visually using, for example, a Q–Q plot. We useskewrplot, qq to obtain the Q–Q plot of residuals after skewtreg:

(Continued on next page)

530 Fitting skewed regressions

. skewrplot, qq

010

2030

40Sc

aled

squ

ared

resi

dual

s

0 10 20 30 40Expected F1, 3.66

Figure 8. Q–Q plot for the skew-t model

According to figures 7 and 8, the skew-t model fits the lbm regression better thanthe skew-normal model.

Alternatively, we can use information criteria to compare the two models:

. estimates stats skewn .

Model Obs ll(null) ll(model) df AIC BIC

skewn 202 . -457.5467 8 931.0933 957.5595. 202 . -450.125 9 918.25 948.0244

Note: N=Obs used in calculating BIC; see [R] BIC note

Both Akaike’s information criterion and Schwarz’s Bayesian information criterionare smaller for the skew-t model, which suggests that it is preferable to the skew-normalmodel.

We can also compare results from all three regressions, including the normal regres-sion, side-by-side by using estimates table.

Because there is no CP parameterization for the skew-t regression, we can compareresults only in the DP metric. Although skewtreg displays results in the DP metric, theresults are saved in the estimation metric. To save results in the DP metric, we use thepostdp option:

. skewtreg, postdp

. estimates store skewt_dp

Y. V. Marchenko and M. G. Genton 531

We now combine all three estimation results in one table by using estimates table.

. estimates table reg skewn_dp skewt_dp, equation(1) star(0.05 0.01 0.005) b(%9.3f)

Variable reg skewn_dp skewt_dp

#1female

1 -9.015*** -8.225*** -8.185***

weight_c 0.710*** 0.774*** 0.758***height_c 14.840*** 9.915* 12.030***

female#c.weight_c

1 -0.177*** -0.196*** -0.168***

female#c.height_c

1 -5.443 -3.119 -6.352

_cons 68.518*** 70.781*** 70.142***

alpha_cons -2.719*** -1.760**

omega_cons 3.646*** 2.319***

df_cons 3.658***

legend: * p<.05; ** p<.01; *** p<.005

According to the three regression models, both weight and height are strong predic-tors of lean body mass measurements. Despite the differences in coefficient estimates,all models lead to similar inferential conclusions. The estimates of the shape parameteralpha suggest the presence of negative skewness in the conditional distribution of lbmgiven weight and height. Because tests against zero are not appropriate for the scaleand degrees-of-freedom parameters, the significance levels, reported automatically byestimates table for these parameters, should be ignored.

5.2 Multivariate analysis of Australian Institute of Sport data

Suppose we are interested in the distribution of lbm and bmi, the body mass index. Infigure 9, the scatterplot of the lbm and bmi values suggests that the two variables arerelated and thus should be analyzed jointly.

532 Fitting skewed regressions

. use ais(Biological measures from athletes at the Australian Institute of Sport)

. scatter lbm bmi

4060

8010

012

0Le

an b

ody

mas

s (k

g)

15 20 25 30 35Body mass index (kg/m^2)

Figure 9. Scatter plot of lbm and bmi

The scatterplot also suggests that the joint distribution of lbm and bmi is somewhatasymmetric, and so we fit the bivariate skew-normal distribution to lbm and bmi usingmskewnreg:

. mskewnreg lbm bmi, nolog

Multivariate skew-normal regression Number of obs = 202Wald chi2(0) = .

Log likelihood = -1213.2609 Prob > chi2 = .

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lbm_cons 64.92238 .9165846 70.83 0.000 63.12591 66.71886

bmi_cons 22.99999 .1964848 117.06 0.000 22.61489 23.3851

gamma1 .0061345 .0095526 0.64 0.521 -.0125882 .02485722 .4534053 .0936021 4.84 0.000 .2699486 .636862

Sigma1 1 169.679 16.86076 139.6514 206.1631 2 26.31228 3.150039 20.13832 32.486242 2 7.910783 .8210286 6.454709 9.695323

LR test vs MVN regression: chi2(2) = 37.55 Prob > chi2 = 0.0000

Y. V. Marchenko and M. G. Genton 533

By default, mskewnreg reports results in the CP metric. The estimate of the skewnessparameter for lbm is close to zero, and according to the z-test (p = 0.521), the hypothesisof H0: γ1 = 0 cannot be rejected. For bmi, however, there is strong evidence that theskewness parameter is different from zero. The joint test of H0 : γ1 = 0, γ2 = 0 (seebelow) and the reported likelihood-ratio test strongly reject the hypothesis of bivariatenormality for lbm and bmi.

. mskewnreg, postcp

. test [gamma1]_cons [gamma2]_cons

( 1) [gamma1]_cons = 0( 2) [gamma2]_cons = 0

chi2( 2) = 52.18Prob > chi2 = 0.0000

To test CPs with mskewnreg, we first need to post CP estimates and their VCE toe(b) and e(V) using the postcp option. By default, mskewnreg saves parameters andtheir VCE in the estimation metric, which is described in Azzalini and Capitanio (2003)for the multivariate skew-t distribution.

We can also obtain the results in the DP metric by using the dpmetric option:

. mskewnreg lbm bmi, dpmetric nolog

Multivariate skew-normal regression Number of obs = 202Wald chi2(0) = .

Log likelihood = -1213.2609 Prob > chi2 = .

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lbm_cons 61.76118 1.86054 33.20 0.000 58.11459 65.40777

bmi_cons 20.13548 .2921862 68.91 0.000 19.56281 20.70816

alpha1 -2.30218 .5772141 -3.99 0.000 -3.433499 -1.1708612 5.515335 1.301097 4.24 0.000 2.965232 8.065439

Omega1 1 179.6722 21.30181 142.4174 226.67251 2 35.36759 7.52879 20.61143 50.123752 2 16.11622 2.299581 12.18449 21.31664

LR test vs MVN regression: chi2(2) = 37.55 Prob > chi2 = 0.0000

Notice that the estimate of α1 corresponding to the shape parameter of lbm in the DP

metric is very far from zero compared with the skewness index reported earlier. Asmentioned in section 3.3, the individual shape parameters are poor estimates of themagnitude of the asymmetry. Although their zero values provide evidence that themultivariate normal model may be adequate, the opposite is not necessarily true, as wewitnessed in this example.

534 Fitting skewed regressions

We can compare the fit against the normal model by using, for example, a Q–Q plot:

. skewrplot, qq normal

05

1015

20Sq

uare

d M

ahal

anob

is d

ista

nces

0 2 4 6 8 10Expected χ2 d.f. = 2

Skew−normal Q−Q plot

05

1015

20

0 2 4 6 8 10Expected χ2 d.f. = 2

Normal Q−Q plot

Figure 10. Q–Q plot for bivariate skew-normal and normal model

Figure 10 shows that the bivariate skew-normal model fits the data better than thebivariate normal model.

Y. V. Marchenko and M. G. Genton 535

We can also fit the bivariate skew-t model:

. mskewtreg lbm bmi, nolog

Multivariate skew-t regression Number of obs = 202Wald chi2(0) = .

Log likelihood = -1213.1074 Prob > chi2 = .

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lbm_cons 61.9651 1.926496 32.16 0.000 58.18923 65.74096

bmi_cons 20.19786 .3165282 63.81 0.000 19.57748 20.81825

alpha1 -2.234864 .5836011 -3.83 0.000 -3.378702 -1.0910272 5.242386 1.355911 3.87 0.000 2.58485 7.899922

Omega1 1 171.7734 24.33629 130.1249 226.75211 2 32.63323 8.5462 15.88298 49.383472 2 14.8864 3.046903 9.967092 22.23366

df 51.00171 95.45806 1.301432 1998.702

LR test vs MVN regression: chibar2(2_3) = 37.86 Prob >= chibar2 = 0.0000

The estimated degrees of freedom are large, which suggests that the skew-normal modelis sufficient for modeling lbm and bmi.

We can also adjust the location for gender by including female as a regressor:

. mskewnreg lbm bmi = female, nolog

Multivariate skew-normal regression Number of obs = 202Wald chi2(1) = 314.13

Log likelihood = -1105.0246 Prob > chi2 = 0.0000

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lbmfemale -20.36519 1.149042 -17.72 0.000 -22.61728 -18.11311_cons 75.0455 .8219865 91.30 0.000 73.43443 76.65656

bmifemale -2.267239 .3202246 -7.08 0.000 -2.894868 -1.639611_cons 24.13093 .2413993 99.96 0.000 23.65779 24.60406

gamma1 .1037418 .0543517 1.91 0.056 -.0027856 .21026922 .6843178 .0915305 7.48 0.000 .5049213 .8637143

Sigma1 1 71.51098 7.115973 58.83973 86.911011 2 16.63504 2.002173 12.71085 20.559232 2 6.954864 .7538472 5.623747 8.601051

LR test vs MVN regression: chi2(2) = 35.63 Prob > chi2 = 0.0000

536 Fitting skewed regressions

We could also fit separate regressions for males and females to allow all parametersof the joint distribution to vary across gender.

5.3 Log-skew-normal and log-skew-t distributions for modeling pos-itive data

The lognormal and log-t distributions are often used to model data such as precipi-tation data or income data that have a positive support. These distributions implythat the distribution of the data in the log metric is symmetric. This assumption maybe too restrictive in some applications. For example, here we investigate how rea-sonable this assumption is in the analysis of the monthly U.S. national precipitationdata, following Marchenko and Genton (2010). The data are publicly available fromthe National Climatic Data Center, the largest archive of weather data, and includemonthly precipitation measured in inches for the period of 1895–2007 (113 observationsper month). The national values could be viewed as weighted averages of station data.More specifically, national values are obtained from the regional values weighted byarea. The regional values for each of the nine U.S. climatic regions are computed fromthe statewide values (which are obtained from the divisional values weighted by area)weighted by area. The divisional monthly precipitation data are obtained as monthlyequally weighted averages of values reported by all stations within a climatic division.

To fit the log-skew-normal model to the precipitation data, we follow the standardprocedure and fit the skew-normal model, described previously, to the log of the pre-cipitation. For example, we generate the new variable lnprecip to contain the logof the precipitation and fit the skew-normal distribution to the January (month==1)log-precipitation measurements over 113 years:

. use precip07_national(Precipitation (inches), national U.S. data)

. generate lnprecip = ln(precip)

. skewnreg lnprecip if month==1, nolog

Skew-normal regression Number of obs = 113Wald chi2(0) = .

Log likelihood = .71065091 Prob > chi2 = .

lnprecip Coef. Std. Err. z P>|z| [95% Conf. Interval]

_cons .7651154 .0228328 33.51 0.000 .7203639 .8098669

gamma -.3321967 .1894122 -1.75 0.079 -.7034378 .0390445

sigma .2428148 .0168615 .2119171 .2782174

LR test vs normal regression: chi2(1) = 2.96 Prob > chi2 = 0.0853

The skewness index is not significantly different from zero at a 5% level, so the assump-tion of normality seems reasonable for January log precipitation.

More generally, we can obtain skewness indexes for all months. Below we use thestatsby command to collect the estimates of skewness indexes and their respective

Y. V. Marchenko and M. G. Genton 537

standard errors from skewnreg over months and plot them along with their associated95% confidence intervals (also see Cox [2010] for more examples of statsby):

. statsby gamma=_b[gamma:_cons] se_gamma=_se[gamma:_cons], by(month) clear:> skewnreg lnprecip(running skewnreg on estimation sample)

command: skewnreg lnprecipgamma: _b[gamma:_cons]

se_gamma: _se[gamma:_cons]by: month

Statsby groups1 2 3 4 5

............

. generate lb = gamma-1.96*se_gamma

. generate ub = gamma+1.96*se_gamma

. twoway (line gamma month, sort) (rcap ub lb month, sort), yline(0) xtitle("")> ytitle("Skewness index") legend(off) xlabel(1(1)12, valuelabel angle(45))

−1−.

50

.5Sk

ewne

ss in

dex

Janu

ary

Februa

ryMarc

hApri

lMay

June Ju

ly

Augus

t

Septem

ber

Octobe

r

Novem

ber

Decem

ber

Figure 11. Skewness indexes over months with 95% confidence intervals

From figure 11, we can see that the assumption of the symmetry of the distributionof the log-precipitation is questionable for some months (for example, September, andOctober). We can see that the distribution of the log precipitation is negatively skewedfor summer and fall months and becomes more symmetric in early spring. Similarly,we can investigate the trend in the tails of the distribution over months by plotting theestimated degrees of freedom from skewtreg.

6 Conclusion

In this article, we described two flexible parametric models, the skew-normal and skew-tmodels, which can be used for the analysis of nonnormal data. We presented a suite

538 Fitting skewed regressions

of commands for fitting these models in Stata to univariate and multivariate data. Wealso provided postestimation features for obtaining linear predictions and for graphicallyevaluating the goodness-of-fit of the skewed distributions to the data. We demonstratedhow to use the commands for univariate and multivariate analyses of the well-knownAustralian Institute of Sport data. We also showed how to use the developed commandsto analyze data with positive support on the example of U.S. precipitation data.

7 Acknowledgments

The authors thank the referee for valuable comments. M. G. Genton’s research waspartially supported by NSF grants CMG ATM-0620624 and DMS-1007504. We are grateful toR. B. Arellano-Valle, A. Azzalini, T. DiCiccio, and A. C. Monti for access to preliminaryversions of their publications.

8 ReferencesArellano-Valle, R. B., and A. Azzalini. 2008. The centred parametrization for the

multivariate skew-normal distribution. Journal of Multivariate Analysis 99: 1362–1382.

———. 2009. Parameters and other summary quantities of the skew-t distribution.Manuscript in preparation.

Arnold, B. C., and R. J. Beaver. 2000. The skew-Cauchy distribution. Statistics &Probability Letters 49: 285–290.

Azzalini, A. 1985. A class of distributions which includes the normal ones. ScandinavianJournal of Statistics 12: 171–178.

———. 2005. The skew-normal distribution and related multivariate families (withdiscussion by Marc G. Genton and a rejoinder by the author). Scandinavian Journalof Statistics 32: 159–200.

Azzalini, A., and A. Capitanio. 1999. Statistical applications of the multivariate skewnormal distribution. Journal of the Royal Statistical Society, Series B 61: 579–602.

———. 2003. Distributions generated by perturbation of symmetry with emphasis ona multivariate skew t-distribution. Journal of the Royal Statistical Society, Series B65: 367–389.

Azzalini, A., T. dal Cappello, and S. Kotz. 2002. Log-skew-normal and log-skew-tdistributions as models for family income data. Journal of Income Distribution 11:12–20.

Azzalini, A., and A. Dalla Valle. 1996. The multivariate skew-normal distribution.Biometrika 83: 715–726.

Azzalini, A., and M. G. Genton. 2008. Robust likelihood methods based on the skew-tand related distributions. International Statistical Review 76: 106–129.

Y. V. Marchenko and M. G. Genton 539

Bayes, C. L., and M. D. Branco. 2007. Bayesian inference for the skewness parameter ofthe scalar skew-normal distribution. Brazilian Journal of Probability and Statistics21: 141–163.

Branco, M. D., and D. K. Dey. 2001. A general class of multivariate skew-ellipticaldistributions. Journal of Multivariate Analysis 79: 99–113.

Cook, R. D., and S. Weisberg. 1994. An Introduction to Regression Graphics. NewYork: Wiley.

Cox, N. J. 2010. Speaking Stata: The statsby strategy. Stata Journal 10: 143–151.

Dalla Valle, A. 2007. A test for the hypothesis of skew-normality in a population.Journal of Statistical Computation and Simulation 77: 63–77.

DiCiccio, T. J., and A. C. Monti. 2009. Inferential aspects of the skew-t distribution.Manuscript in preparation.

Genton, M. G., ed. 2004. Skew-Elliptical Distributions and Their Applications: AJourney Beyond Normality. Boca Raton, FL: Chapman & Hall/CRC.

Gutierrez, R. G., S. Carter, and D. M. Drukker. 2001. sg160: On boundary-valuelikelihood-ratio tests. Stata Technical Bulletin 60: 15–18. Reprinted in Stata TechnicalBulletin Reprints, vol. 10, pp. 269–273. College Station, TX: Stata Press.

Hallin, M., and C. Ley. Forthcoming. Skew-symmetric distributions and Fisherinformation—A tale of two densities. Bernoulli..

Ley, C., and D. Paindaveine. 2010. On the singularity of multivariate skew-symmetricmodels. Journal of Multivariate Analysis 101: 1434–1444.

Liseo, B., and N. Loperfido. 2006. A note on reference priors for the scalar skew-normaldistribution. Journal of Statistical Planning and Inference 136: 373–389.

Marchenko, Y. V., and M. G. Genton. 2010. Multivariate log-skew-elliptical distribu-tions with applications to precipitation data. Environmetrics 21: 318–340.

Sartori, N. 2006. Bias prevention of maximum likelihood estimates for scalar skewnormal and skew t distributions. Journal of Statistical Planning and Inference 136:4259–4275.

About the authors

Yulia V. Marchenko is a senior statistician at StataCorp. Her research interests include multipleimputation, survival analysis, skewed multivariate non-Gaussian distributions, and statisticalsoftware development.

Marc G. Genton is a professor at the Department of Statistics, Texas A&M University, CollegeStation. His research interests include skewed multivariate non-Gaussian distributions, spatialand spatio-temporal statistics, and robustness.

The Stata Journal (2010)10, Number 4, pp. 540–567

Fitting heterogeneous choice models with oglm

Richard WilliamsDepartment of SociologyUniversity of Notre Dame

Notre Dame, IN

[email protected]

Abstract. When a binary or ordinal regression model incorrectly assumes that er-ror variances are the same for all cases, the standard errors are wrong and (unlikeordinary least squares regression) the parameter estimates are biased. Hetero-geneous choice models (also known as location–scale models or heteroskedasticordered models) explicitly specify the determinants of heteroskedasticity in an at-tempt to correct for it. Such models are also useful when the variance itself is ofsubstantive interest. This article illustrates how the author’s Stata program oglm

(ordinal generalized linear models) can be used to fit heterogeneous choice andrelated models. It shows that two other models that have appeared in the liter-ature (Allison’s model for group comparisons and Hauser and Andrew’s logisticresponse model with proportionality constraints) are special cases of a heteroge-neous choice model and alternative parameterizations of it. The article furtherargues that heterogeneous choice models may sometimes be an attractive alterna-tive to other ordinal regression models, such as the generalized ordered logit modelfit by gologit2. Finally, the article offers guidelines on how to interpret, test, andmodify heterogeneous choice models.

Keywords: st0208, oglm, heterogeneous choice model, location–scale model,gologit2, ordinal regression, heteroskedasticity, generalized ordered logit model

1 Introduction

When a binary or ordinal regression model incorrectly assumes that error variances arethe same for all cases, the standard errors are wrong, and [unlike ordinary least squares(OLS) regression] the parameter estimates are biased (Yatchew and Griliches 1985). Het-erogeneous choice models (also known as location–scale models or heteroskedastic or-dered models) explicitly specify the determinants of heteroskedasticity in an attempt tocorrect for it (Williams 2009; Keele and Park 2006)

In addition, most regression-type analyses focus on the conditional mean of a vari-able or on conditional probabilities [for example, E(Y |X), Pr(Y = 1|X)]. Sometimes,however, determinants of the conditional variance are also of interest. For example,Allison (1999) speculated that unmeasured variables affecting the chances of promotionmay be more important for women scientists than for men, causing women’s careeroutcomes to be more variable and less predictable. Heterogeneous choice models makeit possible to examine such issues.

c© 2010 StataCorp LP st0208

R. Williams 541

Williams (2009) provides an extensive critique of the strengths and weaknesses ofheterogeneous choice models, including a more detailed substantive discussion of someof the examples presented here. The current article takes a more applied approachand illustrates how the author’s Stata command oglm (ordinal generalized linear mod-els1) can be used to fit heterogeneous choice models and related models. The articledemonstrates how two other models that have appeared in the literature—Allison’s(1999) model for comparing logit and probit coefficients across groups and Hauser andAndrew’s (2006) logistic response model with proportionality constraints (LRPC)—arespecial cases and alternative parameterizations of oglm’s heterogeneous choice model;yet, despite these equivalencies, it is possible to interpret the results of these models invery different ways. The article further argues that heterogeneous choice models maysometimes be an attractive alternative to other ordinal regression models, such as thegeneralized ordered logit model fit by gologit2. Finally, the article offers guidelineson how to interpret the parameters of such models, ways to make interpretation easier,and procedures for testing hypotheses and making model modifications.

2 The heterogeneous choice or location–scale model

Suppose there is an observed variable y with ordered categories—for example, stronglydisagree, agree, neutral, agree, and strongly agree. One of the rationales for the orderedlogit and probit models is that y is actually a collapsed or limited version of a latentvariable, y∗. As respondents cross thresholds or cutpoints on y∗, their observed valueson y change—for example,

y = 1 if −∞ < y∗ < κ1

y = 2 if κ1 < y∗ < κ2

y = 3 if κ2 < y∗ < κ3

y = 4 if κ3 < y∗ < κ4

y = 5 if κ4 < y∗ < +∞

The model for the underlying y∗ can be written as

y∗i = α0 + α1xi1 + · · · + αKxiK + σεi

where the x’s are the explanatory variables, the α’s are coefficients that give the effectof each x on y∗, εi is a residual term often assumed to have either a logistic or normal(0, 1) distribution, and σ is a parameter that allows the variance to be adjusted upwardor downward.

Because y∗ is a latent variable, its metric has to be fixed in some way. Typically,this is done by scaling the coefficients so that the residual variance is π2/3 (as in logit)

1. The name is slightly misleading in that oglm can also fit the nonlinear models presented here.

542 Heterogeneous choice models

or 1 (as in probit).2 Further, because y∗ is unobserved, we do not actually estimatethe α’s. Rather, we estimate parameters called β’s. As Allison (1999, citing Amemiya[1985, 269]) notes, the α’s and the β’s are related this way:

βk = αk/σ k = 1, . . . , K

This now leads us to a potential problem with the ordered logit/probit model. Whenσ is the same for all cases—residuals are homoskedastic—the ratio between the β’sand the α’s is also the same for all cases. However, when σ differs across cases—there is heteroskedasticity—the ratio also differs (Allison 1999). As Hoetker (2004, 17)notes, “. . . in the presence of even fairly small differences in residual variation, naıvecomparisons of coefficients [across groups] can indicate differences where none exist,hide differences that do exist, and even show differences in the opposite direction ofwhat actually exists.”

We will illustrate this first by a series of hypothetical examples. Remember, σ is anadjustment factor for the residual variance. Therefore, σ is fixed at 1 for one group,and the σ for the other group reflects how much greater or smaller that group’s residualvariance is. In each example, the α’s and σ for group 0 are fixed at 1. For group 1,the values of the α’s and σ are systematically varied. We then see how cross-groupcomparisons of the β’s—that is, the parameters that are actually estimated in a logisticregression—are affected by differences in residual variability.

Case 1: Underlying alphas are equal, residual variances differ.

Group 0 Group 1

Model using α y∗i = xi1 + xi2 + xi3 + εi y∗

i = xi1 + xi2 + xi3 + 2εi

Model using β y∗i = xi1 + xi2 + xi3 + εi y∗

i = 0.5xi1 + 0.5xi2 + 0.5xi3 + εi

In case 1, the underlying α’s all equal 1 in both groups. However, because theresidual variance is twice as large for group 1 as it is for group 0, the β’s are onlyhalf as large for group 1 as for group 0. Naıve comparisons of coefficients can indicatedifferences where none exist.

2. This technique can be easily illustrated using Long and Freese’s fitstat command, which is part ofthe spost9 package available from Long’s website. No matter what logit or probit model is fit (forexample, you can add variables, subtract variables, or change the variables completely), fitstatalways reports a residual variance of 3.29 (that is, π2/3) for logit models and 1.0 for probit models.

R. Williams 543

Case 2: Underlying alphas differ, residual variances differ.

Group 0 Group 1

Model using α y∗i = xi1 + xi2 + xi3 + εi y∗

i = 2xi1 + 2xi2 + 2xi3 + 2εi

Model using β y∗i = xi1 + xi2 + xi3 + εi y∗

i = xi1 + xi2 + xi3 + εi

In case 2, the α’s are twice as large in group 1 as those in group 0. However, becausethe residual variances also differ, the β’s for the two groups are the same. Differences inresidual variances obscure the differences in the underlying effects. Naıve comparisonsof coefficients can hide differences that do exist.

Case 3: Underlying alphas differ, residual variances differ even more.

Group 0 Group 1

Model using α y∗i = xi1 + xi2 + xi3 + εi y∗

i = 2xi1 + 2xi2 + 2xi3 + 3εi

Model using β y∗i = xi1 + xi2 + xi3 + εi y∗

i = 23xi1 + 2

3xi2 + 23xi3 + εi

In case 3, the α’s are again twice as those large in group 1 as in group 0. However,because of the large differences in residual variances, the β’s are smaller for group 0than group 1. Differences in residual variances make it look like the Xs have smallereffects on group 1 when really the effects are larger. Naıve comparisons of coefficientscan even show differences in the opposite direction of what actually exists.

To think of the problem another way, the β’s that are fit are basically standardizedcoefficients, and hence, when doing cross-group comparisons we encounter problemsthat are very similar to those that occur when comparing standardized coefficients fordifferent groups in OLS regression (Duncan 1975). Because coefficients are always scaledso that the residual variance is the same no matter what variables are in the model, thescaling of coefficients will differ across groups if the residual variances are different andwill make cross-group comparisons of effects invalid.

The heterogeneous choice model provides us with a means for dealing with theseproblems. With this model, σ can differ across cases, hence correcting for heteroskedas-ticity. The heterogeneous choice model accomplishes this by simultaneously fitting twoequations: one for the determinants of the outcome, or choice, and another for thedeterminants of the residual variance. The choice equation can be written as

y∗i =

∑k

xikβk + εi

The location or choice equation gives the value of the underlying latent variable. Inthe equation above, x is a vector of k values for the ith observation. The x’s are the

544 Heterogeneous choice models

explanatory variables and are said to be the determinants of the choice, or outcome.The β’s show how the x’s affect the choice.

The variance equation can be written as

σi = exp

⎛⎝∑j

zijγj

⎞⎠The scale or variance equation indicates how the underlying latent variable is scaled

for each case; that is, it reflects differences in residual variability that, if left unaccountedfor, would cause values to be scaled differently across cases. In the equation above, z is avector of j values for the ith observation. The z’s can define groups with different errorvariances in the underlying latent variable. For example, the z’s might include dummyvariables for gender or race. However, the z’s can also include continuous variables thatare related to the error variances. For example, as income increases, the error variancesmay increase. The z’s and x’s need not include any of the same variables, although theycan. When the z’s all equal 0, σi = 1. The γ’s show how the z’s affect the variance (ormore specifically, the log of σ; fitting the log of σ guarantees that σ itself will alwayshave a positive value).

For an ordered variable y with M categories coded 1 to M , the full heterogeneouschoice model (using logit link) can then be written as3

P (yi > m) = invlogit

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩∑k

xikβk − κm

exp

(∑j

zijγj

)⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭

= invlogit

⎛⎝∑k

xikβk − κm

σi

⎞⎠ , m = 1, 2, . . . ,M − 1 (1)

where

invlogit(x) = inverse logit function of x = exp(x)/ {1 + exp(x)}

exp

⎛⎝∑j

zijγj

⎞⎠ = exp {ln (σi)} = σi

κ0 = −∞ and κM = ∞

3. The actual coding does not matter so long as the categories are ordered. For example, Y could becoded −2 to 2 or Y could be a dichotomy coded 0–1.

R. Williams 545

The full model shows how the choice and variance equations are combined to comeup with the probability for any given response. For example, you can compute theprobability that a person with a given set of characteristics will strongly agree or disagreewith a statement. In the above formula, the κ’s are the cutpoints. As is the case withlogit and ologit, when the dependent variable is a 0–1 dichotomy, the model can berewritten to add a constant (β0) rather than subtract a cutpoint. The end result is thesame because the cutpoint and constant are opposite in sign. The logit link function isused here, but others, such as probit, complementary log–log, log–log, and cauchit, arepossible.

When σi = 1 for all cases and links logit or probit are used, the heterogeneouschoice model becomes the same as the ordered logit or probit models fit by ologitand oprobit. When the dependent variable is a dichotomy and the link is probit, theheterogeneous choice model becomes the same as the heteroskedastic probit model fitby hetprob (except that hetprob uses an intercept rather than a cutpoint). As wewill see, although it is less apparent, various other models that have appeared in theliterature are also special cases of heterogeneous choice models.

3 The oglm command

3.1 Syntax

oglm supports many standard Stata options, which work the same way as they do withother Stata commands. Several other options are unique to or fine-tuned for oglm. Thecomplete syntax is

oglm depvar[indepvars

] [if

] [in

] [weight

] [,

link(logit | probit | cloglog | loglog | cauchit) hetero(varlist) scale(varlist)

eq2(varlist) flip hc ls force lrforce store(name) log or rrr eform irr

hr constraints(clist) robust cluster(varname) level(#)

maximize options]

oglm shares the features of all estimation commands; see help estcom. oglm typedwithout arguments redisplays previous results. The following options may be givenwhen redisplaying results:

store(name) or irr rrr hr eform level(#)

by, svy, nestreg, stepwise, xi, and possibly other prefix commands are allowed;see help prefix.

pweights, fweights, and iweights are allowed; see help weight.

546 Heterogeneous choice models

3.2 Options

link(logit | probit | cloglog | loglog | cauchit) specifies the link function to be used.The legal values are link(logit), link(probit), link(cloglog),link(loglog), and link(cauchit). The default is link(logit).

Users should keep in mind that programs differ in the names used for some links.Stata’s loglog link corresponds to SPSS PLUM’s cloglog link, and Stata’s cloglog link iscalled nloglog in SPSS. The following advice for choosing an appropriate link functionis excerpted from Norusis (2005, 84): “Probit and logit models are reasonable choiceswhen the changes in the cumulative probabilities are gradual. If there are abruptchanges, other link functions should be used. The complementary log–log link maybe a good model when the cumulative probabilities increase from 0 fairly slowlyand then rapidly approach 1. If the opposite is true, namely that the cumulativeprobability for lower scores is high and the approach to 1 is slow, the negative log–loglink may describe the data”.

hetero(varlist), scale(varlist), and eq2(varlist) are synonyms (use only one of them)and can be used to specify the variables believed to affect heteroskedasticity inheterogeneous choice and location–scale models. In such models, the model chi-squared statistic is a test of whether any of the choice and location parameters or theheteroskedasticity and scale parameters differ from zero; this differs from hetprob,where the model chi-squared tests only the choice and location parameters. Themore neutral-sounding eq2(varlist) alternative is provided because it may be lessconfusing when using the flip option.

flip causes the command-line placement of the location and scale variables to be re-versed; that is, what would normally be the choice and location variables will insteadbe the variance and scale variables, and vice versa. This functionality is primarilyuseful if you want to use the stepwise or nestreg prefix commands to do stepwiseselection or hierarchical entry of the heteroskedasticity and scale variables. (Just besure to remember which set of variables is which.) If you do this, use the likelihood-ratio test options of nestreg or stepwise, because the default Wald tests may bewrong otherwise.

hc and ls affect how the equations are labeled. If hc is used, then, to be consistentwith the literature on heterogeneous choice, the equations are labeled “choice” and“variance”. If ls is used, the equations are labeled “location” and “scale”, whichis consistent with SPSS PLUM and other published literature. If neither option isspecified, then the scale or heteroskedasticity equation is labeled “lnsigma”, whichis consistent with other Stata programs such as hetprob.

force can be used to force oglm to issue only warning messages in some situations whenit would normally give a fatal error message. By default, the dependent variable canhave a maximum of 20 categories. A variable with more categories than that isprobably a mistaken entry by the user—for example, if a continuous variable hasbeen specified rather than an ordinal one. However, if the dependent variable really isordinal with more than 20 categories, force will let oglm analyze it (although other

R. Williams 547

practical limitations, such as small sample sizes within categories, may prevent itfrom generating a final solution). Obviously, you should use force only when youare confident that you are not making a mistake. trustme can be used as a synonymfor force.

lrforce forces Stata to report a likelihood-ratio statistic under certain conditions whenit ordinarily would not. Some types of constraints can make a likelihood-ratio chi-squared test invalid. Hence, to be safe, Stata reports a Wald statistic whenever con-straints are used. For many common sorts of constraints (for example, constrainingthe effects of two variables to be equal) a likelihood-ratio chi-squared statistic isprobably appropriate. The lrforce option will be ignored when robust standarderrors are specified either directly or indirectly (for example, via use of the robustor svy options). Use this option with caution.

store(name) causes the command estimates store name to be executed when oglmfinishes. This is useful for when you wish to fit a series of models and want to savethe results. See help estimates. The store() option may not work correctly whenthe svy prefix is used.

log displays the iteration log. By default, it is suppressed.

or reports the estimated coefficients transformed to relative odds ratios—that is, exp(b)rather than b; see [R] ologit for a description of this concept. Options rrr, eform,irr, and hr produce identical results (that are labeled differently) and can also beused. It is up to the user to decide whether the exp(b) transformation makes sensegiven the link function used; for example, it probably does not make sense whenusing the probit link.

constraints(clist) specifies the linear constraints to be applied during estimation.The default is to perform unconstrained estimation. Constraints are defined withthe constraint command. constraints(1) specifies that the model is to be con-strained according to constraint 1; constraints(1-4) specifies constraints 1 through4; and constraints(1-4,8) specifies constraints 1 through 4 and 8.

robust specifies that the Huber/White/sandwich estimator of variance is to be used inplace of the traditional calculation. If you specify pweights, robust is implied.

cluster(varname) specifies that the observations are independent across groups (clus-ters) but not necessarily within groups. varname specifies the group to which eachobservation belongs; for example, cluster(personid) would specify data with re-peated observations on individuals. cluster() affects the estimated standard errorsand variance–covariance matrix of the estimators, but not the estimated coefficients.cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data.

level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95) or as set by set level.

548 Heterogeneous choice models

maximize options control the maximization process; see help maximize. You shouldnever have to specify most of these. However, the difficult option can sometimesbe useful with models that are running very slowly or not converging.

3.3 Options available when replaying results

store(), or, irr, rrr, hr, eform, and level(#) are the same as described above.

4 Empirical examples

A series of empirical examples will help to illustrate the utility of heterogeneous choicemodels and the capabilities of the oglm program. These examples require that RichardWilliams’s oglm and gologit2 routines and Ben Jann’s (2005, 2007) esttab program(all available from the Statistical Software Components) be installed. The first twoexamples demonstrate the equivalencies between the heterogeneous choice model andtwo other models that have appeared in the literature: Allison’s (1999) model for groupcomparisons and Hauser and Andrew’s (2006) LRPC. The third example compares andcontrasts heterogeneous choice models and generalized ordered logit models as a meansfor dealing with violations of assumptions in the ordered logit model. The final two ex-amples deal with practical issues in fitting and interpreting heterogeneous choice models.They illustrate 1) how to interpret coefficients; 2) why likelihood-ratio tests, when pos-sible, are often preferable to Wald tests for hypothesis testing; 3) the use of stepwiseregression with the variance equation; and 4) the use of heterogeneous choice modelsas a diagnostic device even when the researcher does not want to use a heterogeneouschoice model for the final analysis.

4.1 Example 1: Allison’s model of group comparisons

Allison (1999) analyzes a dataset of 301 male and 177 female biochemists.4 The units ofanalysis are person–years rather than persons. Each person has one record for each yearof service as an assistant professor, for as many as ten years; once a person achievestenure, no further records are added. As a result, we have 1,741 person–years formen and 1,056 person–years for women. The dependent variable in Allison’s analysis,tenure, is promotion to associate professor; tenure is coded 1 if the person was pro-moted in that year, and 0 otherwise. For the independent variables, year is the numberof years since the beginning of the assistant professorship, yearsq is years squared,select is a measure of the selectivity of the colleges where scientists received theirbachelor’s degrees, articles is the cumulative number of articles published by the endof each person–year, and prestige is a measure of prestige of the department in whichscientists were employed. The primary substantive interest of the analysis is whetherthe determinants of tenure differ for men (group 0) and women (group 1). Williams

4. The data were originally collected by J. Scott Long (Long, Allison, and McGinnis 1993) and areavailable on his website.

R. Williams 549

(2009) provides an extended discussion of the strengths and weaknesses of Allison’s pro-posed strategy, some of which we will expand on later. The appendix of Allison’s articlepresents the Stata code that is needed to fit his models.5 We begin by summarizingAllison’s discussion and then show how his results can be replicated using oglm.

Allison starts by fitting separate logistic regression models for men and women. Ofkey interest is the effect of published articles: The effect is twice as great for men(0.0737) as it is for women (0.0340), and separate tests reveal that this difference isstatistically significant. Allison (1999, 188) says, “If accurate, this difference suggeststhat men get a greater payoff from their published work than do females, a conclusionthat many would find troubling”.

Allison notes, however, that differences in effects could be artifacts of differences inresidual variability. Reasons exist for believing that women have more heterogeneouscareer patterns than men, especially during the period covered by his data. “Hence,unmeasured variables affecting the chances of promotion may be more important forwomen than for men. That difference could explain why the coefficients. . . are largerfor men than for women” (Allison 1999, 190). Using our earlier terminology, Allison isarguing that this difference in effect may fall under case 1, in which underlying alphasare equal but the residual variances differ.

To examine this possibility, Allison uses a program presented in the appendix of hisarticle to fit a single model for men and women that includes a new parameter thathe calls δ. In this model, the coefficients for men and women are constrained to beequal. The δ parameter adjusts for the differences in residual variability between menand women. Allison’s model can be written as

P (yi = 1) = invlogit

{(∑k

xikβk + β0

)× (1 + δGi)

}

= invlogit

⎧⎨⎩∑k

xikβk + β0

1/(1 + δGi)

⎫⎬⎭ = invlogit

⎛⎝∑k

xikβk + β0

σi

⎞⎠ (2)

where x is a vector of explanatory variables, Gi is a grouping variable (in this case,female) coded either 1 or 0, and δ > −1. The traditional logistic regression model isa special case of the above, where δ = 0. Under Allison’s approach, the σ for group 0equals 1, and the σ for group 1 equals 1/(1 + δ). The value of δ in Allison’s modelis −0.26, meaning that the standard deviation of the disturbance variance for men(group 0) is 26% lower than the standard deviation for women (group 1); that is, womenare more variable in their career histories, which causes the estimated coefficients in thefemale model to be smaller. To the model with δ, Allison then adds an interactionterm for gender*articles. This interaction term is insignificant. Allison thereforeconcludes, “The apparent difference in the coefficients for article counts in table 1 doesnot necessarily reflect a real difference in causal effects. It can be readily explained bydifferences in the degree of residual variation between men and women”.

5. The do-file included with this article includes the code needed to replicate Allison’s analysis usinghis own programs.

550 Heterogeneous choice models

Allison used specialized code to fit his model. However, as Williams (2009) pointsout, although he did not label it as such, Allison actually fit a heteroskedastic logitmodel, which in turn is a special case of a heterogeneous choice model: the link islogit, the dependent variable is a 0–1 dichotomy, and the variance equation is limitedto a single 0–1 dichotomous grouping variable that also appears in the choice equation.Under these conditions, the heterogeneous choice model presented in (1) simplifies to

P (yi = 1) = invlogit

⎧⎨⎩∑k

xikβk − κ

exp(Giγ)

⎫⎬⎭ = invlogit

⎡⎣∑k

xikβk − κ

exp {ln(σi)}

⎤⎦= invlogit

⎛⎝∑k

xikβk − κ

σi

⎞⎠ (3)

Note the similarities between the formulas for the heterogeneous choice model (3) andfor Allison’s model (2). In Allison’s approach, a constant (β0) is added in the numerator,while in the heterogeneous choice model, a cutpoint (κ) is subtracted. This difference istrivial because one number is the negative of the other. In both models, the numerator isdivided by σi. The main difference is how the two methods arrive at their estimate of σi.Neither method estimates σi directly, but σi is easily computed from the numbers theydo estimate. The heterogeneous choice model estimates the log of σi, which guaranteesthat σi will be a positive number. Under Allison’s approach, δ is estimated, where δ isthe difference between the values of σ in the two groups. Not surprisingly, then, oglmcan easily reproduce the estimates from Allison’s model. The het(female) option tellsoglm to include female in the variance equation, thus allowing residual variability todiffer by gender.

. use "http://www.indiana.edu/~jslsoc/stata/spex_data/tenure01.dta"(Gender differences in receipt of tenure (Scott Long 06Jul2006))

. * Allison restricted the sample to the first 10 years as an Assistant Prof

. keep if year <= 10(148 observations deleted)

. * Allison´s Table 1 - men only

. quietly logit tenure female year yearsq select articles prestige if female==0

. quietly estimates store male

. * Allison´s Table 1 - females only

. quietly logit tenure female year yearsq select articles prestige if female==1

. quietly estimates store female

. * oglm replication of Allison´s delta models from his Table 2

. quietly oglm tenure year yearsq select articles prestige female,> hetero(female) store(oglm1)

. * Compute Allison´s delta

. display (1 - exp(.3022305))/ exp(.3022305)-.26083233

. quietly oglm tenure year yearsq select articles prestige female f_articles,> hetero(female) store(oglm2)

. * Compute Allison´s delta

. display (1 - exp(.1774193))/ exp(.1774193)-.16257142

R. Williams 551

. esttab male female oglm1 oglm2, stats(N ll) mtitle

(1) (2) (3) (4)male female oglm1 oglm2

mainyear 1.909*** 1.408*** 1.910*** 1.838***

(8.92) (5.47) (9.56) (9.06)

yearsq -0.143*** -0.0956*** -0.140*** -0.134***(-7.70) (-4.36) (-8.24) (-7.89)

select 0.216*** 0.0551 0.182*** 0.170**(3.51) (0.77) (3.45) (3.29)

articles 0.0737*** 0.0340** 0.0635*** 0.0720***(6.37) (2.69) (6.22) (6.31)

prestige -0.431*** -0.371* -0.446*** -0.420***(-3.96) (-2.38) (-4.60) (-4.37)

female -0.939* -0.378(-2.53) (-0.84)

f_articles -0.0305(-1.63)

_cons -7.680*** -5.842***(-11.27) (-6.75)

lnsigmafemale 0.302* 0.177

(2.07) (1.09)

cut1_cons 7.491*** 7.365***

(11.36) (11.25)

N 1741 1056 2797 2797ll -526.5 -306.2 -836.3 -835.1

t statistics in parentheses* p<0.05, ** p<0.01, *** p<0.001

The models labeled oglm1 and oglm2 correspond to the delta models in Allison’stable 2. The log likelihoods for the corresponding models are identical, as are thecoefficients for the variables in the choice equation. Similar to the difference betweenlogit and ologit with a binary dependent variable, oglm reports cutpoints ratherthan constants, and the cutpoints equal the negative of the constants. The main, lessobvious difference in the results is that Allison’s model reports δ while oglm reportsγ, which in this case is ln(σGroup1). These results are algebraically equivalent: δ ={1 − exp(γ)} /exp(γ) = (1 − σGroup1)/σGroup1. The code above shows how delta caneasily be computed using Stata.

(Continued on next page)

552 Heterogeneous choice models

The oglm1 model says that the standard deviation of the residuals is exp(γ) =exp(0.302) = 1.35 times larger for women than men, while Allison’s model using deltamakes the equivalent statement that the standard deviation for men is 26% smaller thanit is for women. In the oglm2 model, the standard deviation is exp(γ) = exp(0.177) =1.194 times larger for women, which is the same as saying that the standard deviationfor men is 16.25% smaller.

While either Allison’s code or oglm can be used for this problem, there are severaladvantages to using oglm. oglm allows for both ordinal and binary dependent variables.This is not just a matter of convenience: ordinal variables are generally preferablebecause they contain more information about the underlying latent variable.6 Thevariance equation is not limited to a single binary variable; hence, the ability of theresearcher to fit a properly specified model increases. oglm has several other powerfulfeatures, such as the ability to obtain predicted probabilities, which we describe later.Finally, the use of oglm makes it clear that the fitted model falls within the broaderclass of heterogeneous choice and location scale models that have already been well-documented in the literature.

4.2 Example 2: Hauser and Andrew’s LRPC and LRPPC models

Mare (1980) applied a logistic response model to school continuation. Contrary toprior supposition, Mare’s estimates suggested that the effects of some socioeconomicbackground variables declined across six successive transitions, including completionof elementary school through entry into graduate school. Hauser and Andrew (2006)replicate and extend Mare’s analysis using the same data he did, the 1973 Occupa-tional Changes in a Generation (OCG II) survey data (Blau et al. 1983; Inter-UniversityConsortium for Political and Social Research 2010). Rather than analyzing each educa-tional transition separately as Mare did, Hauser and Andrew fit a single model acrossall educational transitions. They take the original dataset of 21,682 white men andrestructure it into 88,768 person–transition records. For example, somebody who com-pleted the first three educational transitions would have four records. On the first threerecords, the dependent variable, outcome, would be coded 1 because the person madethe transition, while on the record for the uncompleted fourth transition the dependentvariable would be coded 0. The person would have no records for the fifth and sixthtransitions because you cannot make those transitions if you have not made the fourth.To each record, they also added variables trans1–trans6, each of which is coded 1 ifthe record is from the transition in question, and 0 otherwise. For example, trans3 iscoded 1 for each person–transition record in which the individual has completed thesecond transition and is now eligible to complete the third; otherwise, trans3 is coded 0.

Hauser and Andrew argue that the relative effects of some (but not necessarily all)background variables are the same at each transition, and that multiplicative scalarsexpress proportional change in the effect of those variables across successive transitions.

6. Williams (2009) discusses in more detail the limitations of binary dependent variables and theadvantages offered by ordinal measures.

R. Williams 553

Specifically, Hauser and Andrew fit two new types of models. We primarily focus onthe first of these, the LRPC.

log(

pij

1 − pij

)= βj0 + λj

∑k

βkXijk, j = 1, 2, . . . , 6 (4)

The λj introduce proportional increases or decreases in the βk across transitions; thusthe LRPC model implies proportional changes in main effects across transitions. Insteadof having to estimate a different set of betas for each transition, a single set of betasis estimated, along with one λj proportionality factor for each of the j = 6 transitions(λ1 is constrained to equal 1). The proportionality constraints would hold if, say,the coefficients for the second transition were all 2/3 as large as the correspondingcoefficients for the first transition, the coefficients for the third transition were all halfas large as for the first transition, etc. Put another way, if the model holds, the itemscan be viewed as forming a composite scale, providing a parsimonious and substantivelyinteresting model.

Hauser and Andrew (2006, 8), however, note that “one cannot distinguish empir-ically between the hypothesis of uniform proportionality of effects across transitionsand the hypothesis that group differences between parameters of binary regressions areartifacts of heterogeneity between groups in residual variation”. Similarly, Mare (2006,32) points out that “the constants of proportionality, λj , are estimable, but their valuesincorporate both differences across equations in the effects of the regressors and alsodifferences in the variances of the underlying dependent variables”.

Indeed, even though the rationales behind the models are totally different, the het-erogeneous choice model estimated by oglm produces a fit identical to the LRPC modelestimated by Hauser and Andrew: the models are empirically indistinguishable. In theheterogeneous choice model [(1) and (3)], the Xβ’s are divided by σ’s, while in theLRPC (4) the Xβ’s are multiplied by λ’s. Because multiplication is simply the inverseof division, it is not surprising that Hauser and Andrew’s LRPC results can be easilyreproduced using oglm.7 In the corresponding oglm code, all the variables in Hauserand Andrew’s betas and intercepts equation are included in oglm’s choice equation (ex-cept for trans1, because its inclusion would result in perfect multicollinearity). Thevariables in their lambdas equation are included in oglm’s heteroskedasticity equation.

7. The fit of the LRPC model is presented in table 5, model 4 of Hauser and Andrew’s (2006) article.The do-files included with this article show how to exactly reproduce Hauser and Andrew’s originalresults and show the simple algebraic manipulations that convert their parameterization into oglm’s.

554 Heterogeneous choice models

. use lrpc, clear(Hauser & Andrew, Sociological Methodology 2006 pp. 1-26, modified OCG II data)

. oglm outcome dunc sibsttl9 ln_inc_trunc edhifaom edhimoom broken farm16 south> trans2 trans3 trans4 trans5 trans6,> hetero(trans2 trans3 trans4 trans5 trans6) store(olrpc)

Heteroskedastic Ordered Logistic Regression Number of obs = 88768LR chi2(18) = 26602.23Prob > chi2 = 0.0000

Log likelihood = -33529.654 Pseudo R2 = 0.2840

outcome Coef. Std. Err. z P>|z| [95% Conf. Interval]

outcomedunc .2751199 .0130478 21.09 0.000 .2495466 .3006931

sibsttl9 -.1744805 .0072242 -24.15 0.000 -.1886396 -.1603213ln_inc_trunc .5383488 .0216585 24.86 0.000 .4958989 .5807987

edhifaom .0942192 .0067319 14.00 0.000 .0810249 .1074136edhimoom .1470293 .0068439 21.48 0.000 .1336155 .1604431

broken -.2778073 .0524071 -5.30 0.000 -.3805232 -.1750913farm16 -.1634613 .0427207 -3.83 0.000 -.2471923 -.0797303south -.1850324 .0374289 -4.94 0.000 -.2583917 -.111673

trans2 .468548 .102289 4.58 0.000 .2680652 .6690307trans3 -.8607577 .0742938 -11.59 0.000 -1.006371 -.7151445trans4 -4.017835 .0674156 -59.60 0.000 -4.149967 -3.885702trans5 -4.974159 .1330155 -37.40 0.000 -5.234865 -4.713454trans6 -5.384518 .345992 -15.56 0.000 -6.06265 -4.706387

lnsigmatrans2 .2904472 .0348906 8.32 0.000 .2220628 .3588316trans3 .5309857 .0323389 16.42 0.000 .4676026 .5943688trans4 .6084307 .0319945 19.02 0.000 .5457226 .6711389trans5 1.582275 .0714418 22.15 0.000 1.442251 1.722298trans6 2.38262 .2095284 11.37 0.000 1.971952 2.793288

/cut1 -.5622391 .0691998 -8.12 0.000 -.6978682 -.4266101

Equivalencies between the LRPC and heterogeneous choice models are immediatelyapparent. Hauser and Andrew’s LRPC program produces a log likelihood of −33529.654,as does oglm. The coefficients in Hauser and Andrew’s betas equation have exactcounterparts in oglm’s choice equation. Simple algebraic manipulations can yield theother parameters reported by Hauser and Andrews; for example, the LRPC’s lambdasare the reciprocals of the heterogeneous choice model’s sigmas.

Hauser and Andrew also propose a less restrictive model, which they call the logisticresponse model with partial proportionality constraints (LRPPC):

log(

pij

1 − pij

)= βj0 + λj

k′∑k=1

βkXijk +K∑

k′+1

βjkXijk, j = 1, 2, . . . , 6

R. Williams 555

This model maintains the proportionality constraints for some variables while al-lowing the effects of other variables to freely differ across transitions. For example,Hauser and Andrew say the LRPPC “could apply to Mare’s analysis where effects ofsocioeconomic variables appear to decline across transitions while those of farm origin,one-parent family, and Southern birth vary in other ways”.

The LRPPC model can also be easily fit using oglm. As Hauser and Andrew show intheir appendix, this model is fit by adding interaction terms involving transitions andthe variables whose effects are allowed to freely vary across transitions. In oglm, thisis accomplished by adding the interaction terms to the choice equation. The code isshown below.

*** H & A Model 6: An intercept for each transition, proportional effects of* socioeconomic variables, interactions of broken, farm, and south with transition.* This is the second hetero choice model (equivalent to H & A´s LRPPC).oglm outcome trans2 trans3 trans4 trans5 trans6 broken farm16 south

trans2Xbroken trans2Xfarm16 trans2Xsouth trans3Xbroken trans3Xfarm16trans3Xsouth trans4Xbroken trans4Xfarm16 trans4Xsouth trans5Xbrokentrans5Xfarm16 trans5Xsouth trans6Xbroken trans6Xfarm16 trans6Xsouth duncsibsttl9 ln inc trunc edhifaom edhimoom,hetero(trans2 trans3 trans4 trans5 trans6) store(m6)

Having noted these equivalences, it is important to realize that the substantiveimplications and rationales that motivate the models are very different. The LRPC andLRPPC say that effects differ across transitions by scale factors. The heterogeneouschoice model says that effects do not differ across transitions; they only appear to differwhen you fit separate models because the variances of residuals change across transitions.Empirically, there is no way to distinguish between the two.8 In any event, there canbe little arguing that, at least in these data, the effects of socioeconomic status relativeto other influences decline across transitions. The only question is whether this trendis caused by a decline in the absolute effects of socioeconomic status or by an increasein the influences of other (omitted) variables.

8. Using Hauser and Andrew’s published code, we also fit an LRPC model with Allison’s biochemistdata. The similarities were striking and obvious: Other than the intercepts, which the two programsparameterize differently, the coefficient estimates were identical. Most critically, Allison’s σ, whichhis program estimated and which he reported in his article, is exactly identical to Hauser andAndrew’s λ − 1, which their program estimated and which they reported in their article. Hauserand Andrew’s software is, in fact, a generalization of Allison’s software for when there are twoor more groups. The theoretical concerns that motivated their models and programs lead toradically different interpretations of the results. According to Allison’s theory (and the theorybehind the heterogeneous choice model) apparent differences in effects between men and womenare an artifact of differences in residual variability. Someone looking at these exact same numbersfrom the viewpoint of the LRPC, however, would conclude that the effect of articles (and everyother variable for that matter) is 26% smaller for women than it is men.

556 Heterogeneous choice models

4.3 Example 3: Heterogeneous choice versus generalized orderedlogit models

Williams (2006) notes that the proportional odds assumption9 of the ordered logit modelis often violated. He shows that using generalized ordered logit models are one way ofdealing with the problem. We will now illustrate that heterogeneous choice models mayalso be attractive alternatives.

Long and Freese (2006) present data from the 1977 and 1989 general social survey inwhich respondents were asked to evaluate the following statement: “A working mothercan establish just as warm and secure a relationship with her child as a mother whodoes not work.” Responses were coded as 1 = strongly disagree (1SD), 2 = disagree(2D), 3 = agree (3A), and 4 = strongly agree (4SA). Explanatory variables are yr89(survey year; 0 = 1977, 1 = 1989), male (0 = female, 1 = male), white (0 = nonwhite,1 = white), age (measured in years), ed (years of education), and prst (occupationalprestige scale). ologit yields the following results:

. use http://www.indiana.edu/~jslsoc/stata/spex_data/ordwarm2.dta, clear(77 & 89 General Social Survey)

. ologit warm yr89 male white age ed prst, nolog

Ordered logistic regression Number of obs = 2293LR chi2(6) = 301.72Prob > chi2 = 0.0000

Log likelihood = -2844.9123 Pseudo R2 = 0.0504

warm Coef. Std. Err. z P>|z| [95% Conf. Interval]

yr89 .5239025 .0798988 6.56 0.000 .3673037 .6805013male -.7332997 .0784827 -9.34 0.000 -.8871229 -.5794766white -.3911595 .1183808 -3.30 0.001 -.6231815 -.1591374

age -.0216655 .0024683 -8.78 0.000 -.0265032 -.0168278ed .0671728 .015975 4.20 0.000 .0358624 .0984831

prst .0060727 .0032929 1.84 0.065 -.0003813 .0125267

/cut1 -2.465362 .2389126 -2.933622 -1.997102/cut2 -.630904 .2333155 -1.088194 -.173614/cut3 1.261854 .2340179 .8031873 1.720521

Both Long and Freese (2006) and Williams (2006) use a Brant test to show thatthe assumptions of the ordered logit model are violated, but the main problems seemto be with the variables yr89 and male. Williams (2006) shows that a generalizedordered logit model (fit by gologit2) provides a superior fit while introducing onlya few additional parameters. gologit2 relaxes the parallel lines constraint for those

9. As Williams (2006) notes, the parallel lines assumption goes by many different names. In Stata,Wolfe and Gould’s (1998) omodel command calls it the “proportional odds assumption”, a termthat is appropriate only when the logit link is used. Long and Freese’s brant command refers to the“parallel regressions assumption”. Both SPSS’s PLUM command (Norusis 2005) and SAS’s PROCLOGISTIC (SAS Institute 2004) provide tests of what they call the “parallel lines assumption”.For consistency with other major statistical packages, oglm and gologit2 also use the term “parallellines”, but researchers should realize that others may use different but equivalent phrasings.

R. Williams 557

variables that violate it (yr89 and male), while maintaining the constraint for others.Williams’s article discusses the model in detail, but his main results can be reproducedwith the command

. gologit2 warm yr89 male white age ed prst, autofit lrf store(gologit2)

(output omitted )

The model chi-squared for the gologit2 model is 338.30 with 10 degrees of freedom,which is a significant improvement over the ordered logit model (301.72 with 6 degreesof freedom). At the same time, the gologit2 model is much more parsimonious than amultinomial logit model, which has a model chi-squared of 349.53 but requires 18 degreesof freedom. Williams (2006, 58) therefore concludes that “gologit2 can estimate modelsthat are less restrictive than the parallel lines models estimated by ologit (whoseassumptions are often violated) but more parsimonious and interpretable than thoseestimated by a nonordinal method, such as multinomial logistic regression (that is,mlogit)”.10

We will now consider whether a heterogeneous choice model might also be a rea-sonable alternative in this case. Both gologit2 and the Brant test identified yr89 andmale as the variables that violated the assumptions of the ordered logit model, so weinclude them in the variance equation:11

10. Both the Brant test and gologit2’s autofit option rely on purely empirical means to identifyviolations of a model’s assumptions. It would be better, of course, if researchers had strong theoriesabout when and where the model’s assumptions will be violated, but we suspect this is rarely thecase. Given that the alternatives are often to fit a model whose assumptions are known to beviolated (for example, ologit) or to fit a model that has far more parameters than are necessary(for example, mlogit), the sort of middle ground taken by a program like gologit2 may be thebest choice. Williams (2006) argues that when theory about the nature of violations is lacking, theuse of more stringent significance levels when testing helps to avoid capitalizing on chance.

11. Stepwise selection (see example 5) also results in the variables yr89 and male being included in thevariance equation.

558 Heterogeneous choice models

. oglm warm yr89 male white age ed prst, hetero(yr89 male) store(oglm)

Heteroskedastic Ordered Logistic Regression Number of obs = 2293LR chi2(8) = 331.03Prob > chi2 = 0.0000

Log likelihood = -2830.2563 Pseudo R2 = 0.0552

warm Coef. Std. Err. z P>|z| [95% Conf. Interval]

warmyr89 .4531574 .0686839 6.60 0.000 .3185394 .5877755male -.6345402 .0697638 -9.10 0.000 -.7712748 -.4978057white -.3087676 .102739 -3.01 0.003 -.5101323 -.1074029

age -.0186098 .0021728 -8.56 0.000 -.0228684 -.0143512ed .0535685 .0135944 3.94 0.000 .0269239 .080213

prst .0052866 .00278 1.90 0.057 -.0001622 .0107353

lnsigmayr89 -.1486188 .0458169 -3.24 0.001 -.2384183 -.0588192male -.1909211 .044807 -4.26 0.000 -.2787412 -.1031011

/cut1 -2.151122 .2114069 -10.18 0.000 -2.565472 -1.736772/cut2 -.5696264 .1992724 -2.86 0.004 -.9601932 -.1790596/cut3 1.066508 .2022099 5.27 0.000 .6701839 1.462832

The variables male and yr89 have significant effects in both the choice and varianceequations. The negative coefficients in the variance equation reveal that men were lessvariable in their attitudes than were women, and that variability in attitudes towardworking women declined across time. Both results seem plausible and substantivelyinteresting. Women, torn between traditional and new roles, may be more divided intheir feelings toward working women. Consensus may have increased across time as thenotion of women working became more socially acceptable and less divisive.

Both the gologit2 and oglm models provide a much better fit to the data than doesthe ordered logit model. From a purely empirical standpoint, cases can be made foreither approach:

. lrtest gologit2 oglm, stats force

Likelihood-ratio test LR chi2(2) = 7.28(Assumption: oglm nested in gologit2) Prob > chi2 = 0.0263

Model Obs ll(null) ll(model) df AIC BIC

oglm 2293 -2995.77 -2830.256 11 5682.513 5745.626gologit2 2293 -2995.77 -2826.618 13 5679.236 5753.825

Note: N=Obs used in calculating BIC; see [R] BIC note

The models are not nested, but nonetheless we can note that the gologit2 modelproduces a larger model chi-squared (338.30 versus 331.03) but at the cost of 2 degreesof freedom. The Bayesian information criterion statistic favors the oglm model, whilethe Akaike information criterion statistic leans slightly towards the gologit2 model.Additional analyses (not shown) reveal that the predicted probabilities and marginaleffects for each model are very similar. Ergo, from a purely empirical standpoint, there

R. Williams 559

is little reason for preferring one model over the other, and either clearly fits better thanthe ordered logit model. However, from a substantive standpoint, the simplicity of theoglm model and the insights about differences in variability across time and gender thatare gained by adding only two parameters to the ordered logit model may be highlyappealing.

There is no guarantee that other examples will show an equally tight race betweenthe gologit2 and oglm models, and ultimately theoretical concerns should guide thechoice between the two. Nonetheless, this example illustrates that when the assumptionsof the ordered logit model are violated, researchers may want to at least consider thepossibility that a heterogeneous choice model is warranted.

4.4 Example 4: A trivial change with seemingly nontrivial implica-tions

In many types of analyses, it often makes little difference whether z tests or Waldtests or likelihood-ratio chi-squared tests are used to test hypotheses about individualcoefficients. It is important to realize that this is often not the case with heterogeneouschoice models. In particular, seemingly trivial changes in the coding of variables used inthe variance equation can change the hypotheses that z tests or Wald tests of coefficientsin the choice equation address. In brief, z tests of individual coefficients in the choiceequation are conditional on the coding of the variables in the variance equation, whilelikelihood-ratio tests are not.

To illustrate this, we now present a seemingly innocuous change to Allison’s modelthat was presented in example 1. Instead of using the variable female (coded 1 if female,0 if male) we use male (coded 1 if male, 0 if female). Most people would probablyexpect that such a trivial change would have no meaningful impact on the model—butthe actual results seem to suggest otherwise:

. * As before, use female in the equations

. quietly oglm tenure year yearsq select articles prestige female,> hetero(female) store(oglm_f)

. * Now use male instead

. quietly oglm tenure year yearsq select articles prestige male, hetero(male)> store(oglm_m)

. * Do females only logit model again, using oglm

. quietly oglm tenure year yearsq select articles prestige if female,> store(females)

. * Do males only logit model again, using oglm

. quietly oglm tenure year yearsq select articles prestige if male,> store(males)

(Continued on next page)

560 Heterogeneous choice models

. esttab oglm_f oglm_m males females, stats(N ll chi2 df_m) mtitle

(1) (2) (3) (4)oglm_f oglm_m males females

tenureyear 1.910*** 1.411*** 1.909*** 1.408***

(9.56) (7.17) (8.92) (5.47)

yearsq -0.140*** -0.103*** -0.143*** -0.0956***(-8.24) (-6.68) (-7.70) (-4.36)

select 0.182*** 0.134*** 0.216*** 0.0551(3.45) (3.41) (3.51) (0.77)

articles 0.0635*** 0.0470*** 0.0737*** 0.0340**(6.22) (5.80) (6.37) (2.69)

prestige -0.446*** -0.330*** -0.431*** -0.371*(-4.60) (-4.07) (-3.96) (-2.38)

female -0.939*(-2.53)

male 0.694***(3.69)

lnsigmafemale 0.302*

(2.07)

male -0.302*(-2.07)

cut1_cons 7.491*** 6.231*** 7.680*** 5.842***

(11.36) (10.04) (11.27) (6.75)

N 2797 2797 1741 1056ll -836.3 -836.3 -526.5 -306.2chi2 413.1 413.1 302.4 114.6df_m 7 7 5 5

t statistics in parentheses* p<0.05, ** p<0.01, *** p<0.001

Comparing the first two models, as we would expect, the log likelihoods, modelchi-squared, and degrees of freedom are all the same. Also as we would expect, inthe variance equations, the coefficient for male is opposite in sign to the coefficient forfemale. Perhaps surprisingly, however, all the coefficients in the choice equations aredifferent, as are the z-values. Note, too, that the coefficients in the first model (wheremales are coded 0) are similar to the coefficients in the males-only model 3. The sameis true for the second model that uses the variable male and females are coded 0, andthe last model for females only.

Why does this occur, and what should be done about it? This situation is verysimilar to the one that occurs when a regression model includes both main effects andinteraction effects. For example, if a model includes x1, x2, and x1 × x2, then thecoefficient for x1 reflects the effect of x1 when x2 equals zero. Further, the t- or z-value

R. Williams 561

for x1 tests whether the effect of x1 differs from zero when x2 = 0; even if the effect ofx1 is insignificant when x2 = 0, it may be significant for other values of x2.

Put another way, we can think of the coefficients in the choice equation as beingthe coefficients for a group where σ = 1, and hence the log of σ = 0. The log of σwill equal 0 when all the variables in the variance equation have a value of zero. Thereported z-values in the choice equation, then, are tests of whether or not the effect ofa variable differs from zero for a group that has a value of zero for all variables in thevariance equation. That is, the tests are conditional on the values of the variables in thevariance equation, and a different set of values would yield different conditional tests.The z-values are not global tests of whether the inclusion of a variable does or does notsignificantly improve overall model fit.

A very important implication of the explanation above is that z-values and Waldtests should generally not be relied on for hypothesis testing involving variables inthe choice equation. At the very least, researchers who use them need to be clearon what hypotheses are being tested. As the examples show, the z-values in the choiceequation are not invariant across arbitrary changes in the coding of the variance equationvariables; for example, the z-value for prestige is −4.60 when female is used in the modelbut only −4.07 when male is used instead.12 Particularly in borderline situations, suchdifferences could lead to different conclusions as to whether the effect of a variable wasstatistically significant.

Luckily, likelihood-ratio tests of individual coefficients do not have this problem.They can test whether the inclusion of a variable in the choice equation does or doesnot significantly improve model fit, and are not conditional on the coding of the variablesin the variance equation. To illustrate this point, we will conduct likelihood-ratio testsfor the effect of prestige, first using female and then male in the models.

. * Test prestige under the male versus female models

. * Female is in the model:

. quietly oglm tenure (year yearsq select articles female), hetero(female)> store(f1)

. quietly oglm tenure (year yearsq select articles female prestige),> hetero(female) store(f2)

. lrtest f1 f2, stats

Likelihood-ratio test LR chi2(1) = 22.34(Assumption: f1 nested in f2) Prob > chi2 = 0.0000

Model Obs ll(null) ll(model) df AIC BIC

f1 2797 -1042.828 -847.4507 7 1708.901 1750.456f2 2797 -1042.828 -836.2824 8 1688.565 1736.055

Note: N=Obs used in calculating BIC; see [R] BIC note

12. An additional complication with nestreg is that when Wald tests are used and a variable appearsin both the choice and variance equations, both effects will be tested. When using the nestreg

or stepwise prefix commands with oglm, it is strongly recommend that the lr (likelihood ratio)option be specified.

562 Heterogeneous choice models

. * Male is in the model:

. quietly oglm tenure (year yearsq select articles male), hetero(male) store(m1)

. quietly oglm tenure (year yearsq select articles male prestige), hetero(male)> store(m2)

. lrtest m1 m2, stats

Likelihood-ratio test LR chi2(1) = 22.34(Assumption: m1 nested in m2) Prob > chi2 = 0.0000

Model Obs ll(null) ll(model) df AIC BIC

m1 2797 -1042.828 -847.4507 7 1708.901 1750.456m2 2797 -1042.828 -836.2824 8 1688.565 1736.055

Note: N=Obs used in calculating BIC; see [R] BIC note

We see that the likelihood-ratio tests give the same value (22.34) regardless ofwhether male or female is used in the model.

Another implication of these results is that researchers may want to code the vari-ables in the variance equation so that zero is a substantively meaningful value. In thecurrent examples, zero is meaningful in that it stands for one gender or the other. Inother cases, however, zero may not even be a value that can occur in the data; forexample, no one may have an IQ score of zero. In such instances, researchers may wantto consider centering the variables in the variance equation (that is, subtract the meanfrom each case) so that a score of 0 on the log of sigma reflects an “average” person.The coefficients in the choice equation will then tell you the effects of variables on an“average” person. Alternatively, the zero point might be chosen to represent some othermeaningful value; for example, one could subtract 12 from years of education so that ascore of 0 would stand for a high school graduate. Again this recommendation is similarto those that are sometimes made for OLS regression models that include interactioneffects. Such changes do not affect the fit of the model, but they may make it easier tointerpret the results.

4.5 Example 5: Using stepwise selection as a model building anddiagnostic device

Stepwise selection procedures are often criticized for their atheoretical nature. As thisexample will show, however, stepwise selection can help to identify theoretically plausiblealternative models that the researcher may wish to consider and can also be used asa diagnostic device even when the researcher does not want to ultimately present aheterogeneous choice model.

Stepwise selection of variables is easily done in Stata via the use of the stepwiseprefix command. With oglm, stepwise selection can be used for either the choice or vari-ance equation. To do stepwise selection for the variance equation, the flip option canbe used to reverse the placement of the choice and variance equations in the commandline. The variables in the choice equation can then be specified using the eq2() option.

R. Williams 563

Using the biochemist data and stepwise selection for the variance equation produces asomewhat different model than the one Allison proposed:

. stepwise, pe(.01) lr: oglm tenure female year yearsq select articles> prestige, eq2(female year yearsq select articles prestige) flip store(sw1)LR test begin with empty modelp = 0.0000 < 0.0100 adding articles

Heteroskedastic Ordered Logistic Regression Number of obs = 2797LR chi2(7) = 428.03Prob > chi2 = 0.0000

Log likelihood = -828.81224 Pseudo R2 = 0.2052

tenure Coef. Std. Err. z P>|z| [95% Conf. Interval]

tenurefemale -.4179259 .1742084 -2.40 0.016 -.759368 -.0764838

year 2.108752 .2486633 8.48 0.000 1.621381 2.596123yearsq -.1542213 .0208579 -7.39 0.000 -.1951019 -.1133407select .1744644 .0598623 2.91 0.004 .0571364 .2917925

articles .0628407 .0157851 3.98 0.000 .0319026 .0937789prestige -.611869 .1307263 -4.68 0.000 -.8680877 -.3556502

lnsigmaarticles .030149 .0091448 3.30 0.001 .0122256 .0480724

/cut1 7.959556 .7637107 10.42 0.000 6.462711 9.456401

As the above output shows, in Allison’s biochemist data, the only variable thatenters into the variance equation using oglm’s stepwise selection procedure is numberof articles. A very plausible argument can be made for this: there may be little residualvariability among biochemists with few articles (with most of them being denied tenure)but there may be much more variability among biochemists with more articles (havingmany articles may be a necessary but not sufficient condition for tenure). Hence, whileheteroskedasticity may be a problem with these data, it may not be for the reasons firstthought.

It is important to realize, however, that apparent problems with heteroskedasticityin a model may actually reflect other problems with the model specification: relevantvariables may be omitted from the model; subgroup differences may be being ignored;and variables may need to be transformed in some way, for example, logged or squared.In the present example, the number of articles ranges from 0 to 73. It may be that, atsome point, additional articles have less effect or even a negative effect on the likelihoodof getting tenure (for example, if somebody has many articles but they are not thatgood).13 One simple way to address such a possibility is to add articles^2 to themodel:

13. I thank Maarten Buis for suggesting that I consider adding terms for nonlinear effects to the model.

564 Heterogeneous choice models

. generate articles2 = articles^2

. oglm tenure female year yearsq select articles articles2 prestige,> hetero(articles) store(sw2)

Heteroskedastic Ordered Logistic Regression Number of obs = 2797LR chi2(8) = 439.77Prob > chi2 = 0.0000

Log likelihood = -822.94311 Pseudo R2 = 0.2109

tenure Coef. Std. Err. z P>|z| [95% Conf. Interval]

tenurefemale -.3470777 .1470053 -2.36 0.018 -.6352028 -.0589526

year 1.764339 .2233363 7.90 0.000 1.326608 2.20207yearsq -.1282567 .0182644 -7.02 0.000 -.1640543 -.0924591select .1631087 .0503776 3.24 0.001 .0643704 .261847

articles .1481165 .0246791 6.00 0.000 .0997464 .1964865articles2 -.002716 .0008273 -3.28 0.001 -.0043374 -.0010945prestige -.4909738 .1124811 -4.36 0.000 -.7114327 -.270515

lnsigmaarticles .0081941 .009509 0.86 0.389 -.0104433 .0268315

/cut1 7.375547 .680343 10.84 0.000 6.042099 8.708995

. lrtest sw1 sw2, stats

Likelihood-ratio test LR chi2(1) = 11.74(Assumption: sw1 nested in sw2) Prob > chi2 = 0.0006

Model Obs ll(null) ll(model) df AIC BIC

sw1 2797 -1042.828 -828.8122 8 1673.624 1721.115sw2 2797 -1042.828 -822.9431 9 1663.886 1717.313

Note: N=Obs used in calculating BIC; see [R] BIC note

As we see, adding articles^2 significantly improves fit and makes the coefficientin the variance equation insignificant.14 Hence, even if the researcher does not wantto use stepwise selection as a model-building device or does not want to present aheterogeneous choice model, he or she may still wish to use stepwise selection to diagnosepotential problems in the model so they can then be addressed in other ways. Of course,researchers can also use theoretical reasons to identify those variables that might raiseconcerns about heteroskedasticity and specify the models themselves.

5 Other features of oglm

oglm has several other features that may make it useful to researchers. oglm supportsmultiple link functions, including logit (the default), probit, complementary log–log,log–log, and cauchit. Several special cases of ordinal generalized linear models can

14. A reviewer suggested that “rather than adding a squared term for productivity, either the squareroot of articles or the ln(articles + 0.5) are commonly used.” Inclusion of either of these terms alsocaused the variance coefficient to become insignificant. However, the overall fit of the model wasbetter with articles^2.

R. Williams 565

also be fit by oglm, including the parallel lines models of ologit and oprobit (whereerror variances are assumed to be homoskedastic), the heteroskedastic probit model ofhetprob (where the dependent variable must be a dichotomy and the only link allowedis probit), the binomial generalized linear models of logit, probit, and cloglog (whichalso assume homoskedasticity), as well as similar models that are not otherwise fit byStata. This makes oglm particularly useful for testing whether constraints on a model(for example, homoskedastic errors) are justified or for determining whether one linkfunction is more appropriate for the data than are others.

Other features of oglm include support for linear constraints, which makes it pos-sible, for example, to impose and test the constraint that the effects of x1 and x2 areequal. oglm works with several prefix commands, including by, nestreg, xi, svy, andstepwise. oglm does not currently support factor variables and may or may not sup-port other features that were added to Stata after version 9. Its predict commandincludes the ability to compute estimated probabilities. The actual values taken on bythe dependent variable are irrelevant except that larger values are assumed to corre-spond to “higher” outcomes. As many as 20 outcomes are allowed. oglm was inspiredby the SPSS PLUM routine but differs somewhat in its terminology and labeling of links.

6 Acknowledgments

The documentation and source code for several Stata commands (for example,ologit p) were major aids in developing the oglm documentation and in adding supportfor the predict command. Much of the code is adapted from Maximum Likelihood Es-timation with Stata, Third Edition, by William Gould, Jeffrey Pitblado, and WilliamSribney (2006). SPSS’s PLUM routine helped to inspire oglm and provided a meansfor double-checking the accuracy of the program. Joseph Hilbe, Mike Lacy, MaartenBuis, Glenn Hoetker, and Rory Wolfe provided stimulating comments on this articleand on the development of oglm. Jeff Pitblado assisted with several difficult program-ming issues. J. Scott Long, Robert Hauser, and Megan Andrew provided access tothe datasets used in these analyses. The 1973 occupational changes in a generation(OCG II) data (Blau et al. 1983) that Hauser and Andrew modified is made available bythe Inter-University Consortium for Political and Social Research (2010). Brian Millerassisted with the analysis.

7 ReferencesAllison, P. D. 1999. Comparing logit and probit coefficients across groups. Sociological

Methods and Research 28: 186–208.

Amemiya, T. 1985. Advanced Econometrics. Cambridge: Harvard University Press.

Blau, P. M., O. D. Duncan, D. L. Featherman, and R. M. Hauser. 1983. Occupationalchanges in a generation, 1962 and 1973 (Computer file). Madison, WI: University ofWisconsin (producer). Ann Arbor, MI: Inter-University Consortium for Political andSocial Research (distributor), 1994.

566 Heterogeneous choice models

Duncan, O. D. 1975. Introduction to Structural Equation Models. New York: AcademicPress.

Gould, W., J. Pitblado, and W. Sribney. 2006. Maximum Likelihood Estimation withStata. 3rd ed. College Station, TX: Stata Press.

Hauser, R. M., and M. Andrew. 2006. Another look at the stratification of educationaltransitions: The logistic response model with partial proportionality constraints. So-ciological Methodology 36: 1–26.

Hoetker, G. P. 2004. Confounded coefficients: Extending recent advances in the ac-curate comparison of logit and probit coefficients across groups. Working Paper.http://ssrn.com/abstract=609104.

Inter-University Consortium for Political and Social Research. 2010. Occupationalchanges in a generation, 1962 and 1973.http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06162.

Jann, B. 2005. Making regression tables from stored estimates. Stata Journal 5: 288–308.

———. 2007. Making regression tables simplified. Stata Journal 7: 227–244.

Keele, L., and D. K. Park. 2006. Difficult choices: An evaluation of heterogeneous choicemodels. Working Paper. http://www.nd.edu/˜rwilliam/oglm/ljk-021706.pdf.

Long, J. S., P. D. Allison, and R. McGinnis. 1993. Rank advancement in academiccareers: Sex differences and the effects of productivity. American Sociological Review58: 703–722.

Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent VariablesUsing Stata. 2nd ed. College Station, TX: Stata Press.

Mare, R. D. 1980. Social background and school continuation decisions. Journal of theAmerican Statistical Association 75: 295–305.

———. 2006. Response: Statistical models of educational stratification—Hauser andAndrew’s models for school transitions. Sociological Methodology 36: 27–37.

Norusis, M. 2005. SPSS 13.0 Advanced Statistical Procedures Companion. Upper SaddleRiver, NJ: Prentice Hall.

SAS Institute. 2004. SAS/Stat 9.1 User’s Guide. Cary, NC: SAS Institute.

Williams, R. 2006. Generalized ordered logit/partial proportional odds models for or-dinal dependent variables. Stata Journal 6: 58–82.

———. 2009. Using heterogenous choice models to compare logit and probit coefficientsacross groups. Sociological Methods & Research 37: 531–559.

R. Williams 567

Wolfe, R., and W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinalresponse models. Stata Technical Bulletin 42: 24–27. Reprinted in Stata TechnicalBulletin Reprints, vol. 7, pp. 199–204. College Station, TX: Stata Press.

Yatchew, A., and Z. Griliches. 1985. Specification error in probit models. Review ofEconomics and Statistics 67: 134–139.

About the author

Richard Williams is an associate professor and a former chairman of the Department of Soci-ology at the University of Notre Dame. His teaching and research interests include methodsand statistics, demography, and urban sociology. His work has appeared in the American So-ciological Review, Social Forces, Stata Journal, Social Problems, Demography, Sociology ofEducation, Journal of Urban Affairs, Cityscape, Journal of Marriage and Family, and Socio-logical Methods and Research. His recent research, which has been funded by grants from theDepartment of Housing and Urban Development and the National Science Foundation, focuseson the causes and consequences of inequality in American home ownership. He is a frequentcontributor to Statalist.

The Stata Journal (2010)10, Number 4, pp. 568–584

Frequentist q-values for multiple-testprocedures

Roger B. NewsonNational Heart and Lung Institute

Imperial College LondonLondon, UK

[email protected]

Abstract. Multiple-test procedures are increasingly important as technology in-creases scientists’ ability to make large numbers of multiple measurements, as theydo in genome scans. Multiple-test procedures were originally defined to input avector of input p-values and an uncorrected critical p-value, interpreted as a fami-lywise error rate or a false discovery rate, and to output a corrected critical p-valueand a discovery set, defined as the subset of input p-values that are at or below thecorrected critical p-value. A range of multiple-test procedures is implemented us-ing the smileplot package in Stata (Newson and the ALSPAC Study Team 2003,Stata Journal 3: 109–132; 2010, Stata Journal 10: 691–692). The qqvalue com-mand uses an alternative formulation of multiple-test procedures, which is alsoused by the R function p.adjust. qqvalue inputs a variable of p-values and out-puts a variable of q-values that are equal in each observation to the minimumfamilywise error rate or false discovery rate that would result in the inclusion ofthe corresponding p-value in the discovery set if the specified multiple-test pro-cedure was applied to the full set of input p-values. Formulas and examples arepresented.

Keywords: st0209, qqvalue, smileplot, multproc, p.adjust, R, multiple-test proce-dure, data mining, familywise error rate, false discovery rate, Bonferroni, Sidak,Holm, Holland, Copenhaver, Hochberg, Simes, Benjamini, Yekutieli

1 Introduction

Multiple-test procedures are one of the key themes in twenty-first-century biostatisticsso far because technology gives scientists the power to measure unprecedented numbersof comparisons in genome scans, epigenome scans, and metabolome scans. A multiple-test procedure takes the following as input: a vector of p-values that corresponds tomultiple comparisons testing multiple null hypotheses, and an uncorrected critical p-value, which is usually interpreted either as a maximum permissible familywise errorrate (FWER) or as a maximum permissible false discovery rate (FDR). The multiple-testprocedure outputs a corrected critical p-value that is used to define a discovery set asthe subset of input p-values at or below the corrected critical p-value. A number ofmultiple-test procedures have been implemented in Stata using the smileplot package(Newson and the ALSPAC Study Team 2003, 2010).

c© 2010 StataCorp LP st0209

R. B. Newson 569

Frequentist multiple-test procedures are a generalization of the concept of confidenceregions beyond scalar and even vector parameters to a set-valued parameter, namely, theset of null hypotheses that are true. If the input uncorrected critical p-value α ∈ (0, 1)is an FWER, then we can be 100(1 − α)% confident that all the null hypotheses in thediscovery set are false. If the input uncorrected critical p-value α = β × γ is an FDR,then we can be 100(1 − β)% confident that over 100(1 − γ)% of the null hypotheses inthe discovery set are false. Of course, the discovery set may be empty, in which case100% of the null hypotheses in it are false.

Conventionally, a multiple-test procedure has been implemented by writing a pro-gram that inputs a vector of p-values and an uncorrected critical p-value and outputs acorrected critical p-value and a discovery set. The multproc command of the smileplotpackage introduced by Newson and the ALSPAC Study Team (2003) does just that.

The R function p.adjust (Smyth and the R Core Team 2010) uses an alternativeway of implementing multiple-test procedures. This function inputs a vector of p-valuesand a specified multiple-test procedure. It outputs a new vector of q-values (parallel tothe input vector), sometimes known as adjusted p-values. For each input p-value, thecorresponding q-value is the lowest input uncorrected critical p-value (FWER or FDR)that would cause the input p-value to be included in the discovery set if the specifiedmultiple-test procedure was applied to the full vector of p-values. This q-value maybe one if there is no FWER or no FDR less than one for which the corresponding nullhypothesis would be rejected.

The Stata qqvalue package is modeled broadly on the R function p.adjust; it gen-erates q-values for an input variable of p-values and a specified multiple-test procedure.The name qqvalue originally stood for “quasi–q-value”, which was my initial choiceof terminology and was intended to prevent confusion between the vector of adjustedp-values output by p.adjust and the scalar corrected critical p-value output by themultproc command of smileplot. The term q-value was originally introduced as anempirical Bayesian concept by Storey (2003), who aimed to control the positive FDR byestimating from the vector of input p-values the prior probability that a null hypoth-esis is true. The q-values calculated by p.adjust and qqvalue, by contrast, are thenearest frequentist equivalent of Storey’s q-values. They are minimum FWERs or FDRsfor rejection of individual input p-values, just as Storey’s original q-values are minimumpositive FDRs for rejection of individual input p-values. In view of this difference, Ioriginally added the prefix “quasi–”, but was advised by Gordon Smyth (the author ofp.adjust) that the prefix was not really necessary because it is now common to usethe term q-value for the values computed by p.adjust. I therefore now conform tothis usage but use the term “frequentist q-value” when making a distinction from theoriginal Bayesian q-value.

The remainder of this article documents and details the qqvalue package. Section 2documents the command itself. Section 3 presents and details the methods and formulasused. Section 4 gives some examples of the use of qqvalue in practice.

570 Frequentist q-values for multiple-test procedures

2 The qqvalue command

2.1 Syntax

qqvalue varname[if

] [in

] [, method(method) bestof(#) qvalue(newvar)

npvalue(newvar) rank(newvar) svalue(newvar) rvalue(newvar) float

fast]

where method is one ofbonferroni | sidak | holm | holland | hochberg | simes | yekutieli

by varlist: can be used with qqvalue; see [D] by. If by varlist: is used, then allgenerated variables are calculated using the specified multiple-test procedure withineach by-group defined by the variables in the varlist .

2.2 Description

qqvalue is similar to the R package p.adjust. It inputs a single variable, assumed tocontain p-values calculated for multiple comparisons, in a dataset with one observationper comparison. It outputs a new variable—calculated by inverting a multiple-testprocedure specified by the user—containing the q-values corresponding to these p-values.Each q-value represents, for each corresponding p-value, the minimum uncorrected p-value threshold for which that p-value would be in the discovery set, assuming thatthe specified multiple-test procedure was used on the same set of input p-values togenerate a corrected p-value threshold. These minimum uncorrected p-value thresholdsmay represent FWERs or FDRs, depending on the procedure used. qqvalue’s optionsmay be used to output other variables that contain the various intermediate resultsused in calculating the q-values. The multiple-test procedures available for qqvalueare a subset of those available using the multproc command of the smileplot package(Newson and the ALSPAC Study Team 2010).

2.3 Options

method(method) specifies the multiple-test procedure method to be used for calculatingthe q-values from the input p-values. The method may be bonferroni, sidak, holm,holland, hochberg, simes, or yekutieli. These method names specify that theq-values will be calculated from the input p-values by inverting the multiple-testprocedure specified by the method() option of the same name for the multproccommand of the smileplot package (Newson and the ALSPAC Study Team 2010).The default is method(bonferroni).

bestof(#) specifies an integer. If the bestof() option is specified and # is greaterthan the number of input p-values, then the q-values are calculated assuming thatthe input p-values are a subset (usually the smallest number of input p-values) of

R. B. Newson 571

a superset of p-values. If the method() option specifies a one-step method (such asbonferroni or sidak), then the q-values do not depend on the other p-values in thesuperset, but only on the number of p-values in the superset. If the method() optionspecifies a step-down method (such as holm or holland), then it is assumed that allthe other p-values in the superset are greater than the largest of the input p-values.If the method() option specifies a step-up method (such as hochberg, simes, oryekutieli), then it is assumed that all the other p-values in the superset are equalto one, which implies that the q-values will be conservative and will define an upperbound to the respective q-values that would have been calculated if we knew theother p-values in the superset. If bestof() is unspecified (or nonpositive), then theinput p-values are assumed to be the full set of p-values calculated. The bestof()option is useful if the input p-values are known (or suspected) to be the smallestof a greater set of p-values that we do not know. This often happens if the inputp-values are from a genome scan reported in the literature.

qvalue(newvar) specifies the name of a new output variable containing the q-valuescalculated from the input p-values. The new output variable is generated using themultiple-test procedure specified by the method() option.

npvalue(newvar) specifies the name of a new output variable to be generated. It con-tains in each observation the total number of p-values in the sample of observationsspecified by the if and in qualifiers or in the by-group containing that observationif the by: prefix is specified.

rank(newvar) is the name of a new variable to be generated. It contains in eachobservation the rank of the corresponding p-value from the lowest to the highest.Tied p-values are ranked according to their position in the input dataset. If the by:prefix is specified, then the ranks are defined within the by-group.

svalue(newvar) specifies the name of a new output variable to be generated, whichcontains the s-values calculated from the input p-values. The s-values are an in-termediate result; they are calculated in the course of calculating the q-values andare used mainly for validation. They are calculated from the input p-values byinverting the formulas used for the rank-specific critical p-value thresholds, whichare calculated by the multproc command of the smileplot package. These rank-specific p-value thresholds are returned in the generated variable specified by thecritical() option of multproc. The s-values may be greater than one.

rvalue(newvar) specifies the name of a new output variable to be generated, whichcontains the r-values calculated from the input p-values. The r-values are an in-termediate result; they are calculated in the course of calculating the q-values andare used mainly for validation. They are calculated from the s-values by truncatingthe s-values to a maximum of one. The q-values are calculated from the r-valuesusing a procedure that is dependent on the multiple-test procedure specified by themethod() option. If the multiple-test procedure is a one-step procedure (such asbonferroni or sidak), then the q-values are equal to the corresponding r-values.If the multiple-test procedure is a step-down procedure (such as holm or holland),then the q-value for each p-value is equal to the cumulative maximum of all the

572 Frequentist q-values for multiple-test procedures

r-values corresponding to p-values of rank equal to or less than that p-value. Ifthe multiple-test procedure is a step-up procedure (such as hochberg, simes, oryekutieli), then the q-value for each p-value is equal to the cumulative minimumof all the r-values corresponding to p-values of rank equal to or greater than thatp-value.

float specifies that the output variables specified by the qvalue(), rvalue(), andsvalue() options be created as variables of type float. If float is absent, thenthese variables are created as variables of type double. Whether or not float isspecified, all generated variables are stored to the lowest precision possible withoutloss of information.

fast is an option for programmers. It specifies that qqvalue will not take any actionto restore the original data in the event of failure or if the user presses Break.

3 Methods and formulas

The methods used are a development of those used by the multproc command ofthe smileplot package, which is documented in Newson and the ALSPAC Study Team(2003, 2010). I will therefore use a notation that is as consistent as possible with thatsource. I will use uppercase and lowercase symbols to denote different quantities andto reduce confusion in readers who refer both to that article and to this article.

We assume that there is a sequence of m distinct parameters θ1, . . . , θm, estimatedusing estimates θ1, . . . , θm and having the values θ

(0)1 , . . . , θ

(0)m under their respective null

hypotheses. Typically, θ(0)i is zero for difference parameters such as median differences

or is one for ratio parameters such as median ratios. We denote by P1, . . . , Pm theobserved p-values for testing the m null hypotheses. Each Pi has the property that if0 ≤ α ≤ 1, then

Pr(

Pi ≤ α∣∣ θi = θ

(0)i

)≤ α

We denote by R1, . . . , Rm the ranks (in ascending order) of P1, . . . , Pm and denote byQ1, . . . , Qm the p-values in ascending order so that for each i, QRi

= Pi. (The Qi arenot the q-values, which we will define in due course.)

The methods used by the multproc command of the smileplot package aim todefine a credible (or acceptable) subset of indices C ⊆ (1, . . . , m) such that the nullhypotheses (θi = θ

(0)i : i ∈ C) are acceptable and the complementary set of null hy-

potheses (θi = θ(0)i : i /∈ C) are rejected. This is done by defining an uncorrected

p-value threshold, punc; calculating a corrected p-value threshold, pcor, from punc andQ1, . . . , Qm; and defining the acceptable subset C to be the subset of indices i suchthat Pi > pcor. The methods used by qqvalue, by contrast, are derived by invertingthe methods used by multproc because they start from an individual input p-value andderive the minimum uncorrected p-value threshold, which if used would have made thecorrected p-value threshold at least as large as the individual input p-value.

R. B. Newson 573

The multiple-test procedures used by qqvalue and selected using the method() op-tion are a subset of those used by multproc. They are listed in table 1 and classified inthree ways: the form of the algorithm used (one-step, step-down, or step-up), the inter-pretation of the uncorrected overall critical p-value (FWER or FDR), and the correlationassumed between the Pi (independence, nonnegative, or arbitrary).

Table 1. Multiple-test procedures specified by the method() option of qqvalue

method() Step type FWER/FDR Correlation assumed

bonferroni one-step FWER arbitrarysidak one-step FWER nonnegativeholm step-down FWER arbitraryholland step-down FWER nonnegativehochberg step-up FWER independencesimes step-up FDR nonnegativeyekutieli step-up FDR arbitrary

3.1 Formulas for one-step, step-down, and step-up methods

The formulas used by multproc are given in Newson and the ALSPAC Study Team (2003,section 3.1). Each of the methods of multproc works by specifying a nondecreasingsequence of individual critical p-values c1, . . . , cm, which correspond to the ordered inputp-values Q1, . . . , Qm. The formulas used by each method for deriving these thresholdsci as functions of punc, i, and m are listed in that subsection.

Once these ci are specified, each multproc method selects an overall corrected criticalp-value, pcor, from the ci in one of three ways, namely, one-step, step-down, or step-up.In the one-step case, the ci are all equal to a common value, pcor, defined in a way thatis not dependent on i. In the step-down case, pcor is set to the minimum ci such thatQi > ci if such a ci exists or to the maximum critical p-value cm otherwise. In thestep-up case, pcor is set to the maximum ci such that Qi ≤ ci if such a ci exists or tothe minimum critical p-value c1 otherwise.

The q-values computed by qqvalue are derived by inverting the formulas ofmultproc. The technique can be summarized in the phrase “sorted p-values generates-values generate r-values generate q-values”. For each given method, this technique isexecuted in three steps:

1. Invert the formula used for calculating ci as a function of punc to give a formulafor calculating punc as a function of ci. If we substitute the sorted p-value Qi forci in this formula, then the result will be denoted si. si will be expressed on anuncorrected p-value scale but may be one or greater if no FWER or FDR less thanone will generate a threshold ci ≥ Qi.

574 Frequentist q-values for multiple-test procedures

2. Define ri = min(si, 1) as the minimum uncorrected critical p-value that generatesa threshold that Qi can pass below. If we are willing to live with a FWER or FDR

of 1, at which 100% of discoveries may be false, then any p-value may be includedin the discovery set.

3. Define the set of q-values qi from the set of r-values ri, using a formula thatdepends on whether the procedure is one-step, step-down, or step-up. For a one-step procedure, this formula is

qi = ri (1)

For a step-down procedure, it is

qi = max(rj : j ≤ i) (2)

For a step-up procedure, it is

qi = min(rj : j ≥ i) (3)

For each i, qi will then be the q-value corresponding to the sorted p-value Qi.Therefore, for each i, the q-value corresponding to Pi will be qRi

.

The formulas for deriving the si from the Qi are derived by inverting a subset ofthose in Newson and the ALSPAC Study Team (2003, section 3.1). They are given asfollows, together with references for the original multiple-test procedures:

One-step methods

1. bonferronisi = mQi

2. sidak (Sidak 1967)si = 1 − (1 − Qi)m

Step-down methods

1. holm (Holm 1979)si = (m − i + 1)Qi

2. holland (Holland and Copenhaver 1987)

si = 1 − (1 − Qi)m−i+1

Step-up methods

1. hochberg (Hochberg 1988)

si = (m − i + 1)Qi

The si are the same as those for the step-down Holm method.

R. B. Newson 575

2. simes (Simes [1986]; Benjamini and Hochberg [1995]; Benjamini and Yekutieli[2001, first method])

si =m

iQi

3. yekutieli (Benjamini and Yekutieli [2001, second method])

si =m

iQi

m∑j=1

j−1

All these expressions for si are increasing in Qi and increasing in m and nonincreasing(or constant in the case of one-step procedures) in i. The corresponding expressions forri = min(si, 1) will therefore be nondecreasing in Qi and in m, and will be nonincreasingin i.

3.2 Incomplete sets of input p-values

We have assumed so far that the variable input to qqplot contains the full set of p-valuesfrom a project. In practice, this may not be the case. Scientists who report genomescans frequently give only a short list of those associations with the lowest k < m p-values and do not report the rest (and so do scientists in other fields, who are less likelyto admit it). Readers are then left with the problem of how much confidence to have intheir “discoveries”.

Fortunately, reports of genome scans usually contain an indication of how manyassociations were really measured. (Unfortunately, this is usually not the case in manyother fields.) This can be helpful, given the formulas of the previous subsection. Formu-las (1), (2), and (3) imply that for each sorted p-value, Qi, the corresponding q-value,qi, depends only on Qi in the case of one-step procedures, depends on p-values equalto or less than Qi in the case of step-down procedures, and depends on p-values equalto or greater than Qi in the case of step-up procedures. This statement implies thatq-values can be computed for any subset of p-values in the case of one-step proceduresor for the lowest k p-values in the case of step-down procedures without knowing theother p-values. In the case of step-up procedures (which are usually more powerful),life is less simple. However, even in this case, (3) implies that we can still computeconservative estimates of the q-values for the lowest k p-values, which are guaranteedto be upper bounds for the corresponding true q-values, by assuming (conservatively)that all the other p-values in the full set are equal to one.

The bestof() option of qqvalue allows us to compute conservative q-values for aninput variable containing a subset of k p-values by supplying the number m of p-valuespresent in the full set. These conservative q-values will be correct for any subset ofk p-values in the case of one-step procedures, correct for the lowest k p-values in thecase of step-down procedures, and conservative for the lowest k p-values in the case ofstep-up procedures. We therefore may be able to show that we can be confident in alist of the highlights of a genome scan as long as we know how large the genome scanwas.

576 Frequentist q-values for multiple-test procedures

3.3 q-values versus discovery sets

A long list of multiple-test procedures was implemented in Stata using the smileplotpackage of Newson and the ALSPAC Study Team (2003, 2010). This package imple-mented the procedures by generating scalar corrected critical p-values and correspondingdiscovery set indicator variables. Since then, R users, and now also Stata users, havegained the option of using some of the same procedures to generate q-values. What arethe advantages of the two policies?

Multiple-test procedures were originally developed and justified in terms of discov-ery sets. This is especially the case with multiple-test procedures that control the FDR,such as those of Benjamini and Yekutieli (2001), which are implemented using the op-tions method(simes) and method(yekutieli) of smileplot and qqvalue. The Simesprocedure, in particular, has the advantageous property that the power to detect aneffect of a given size does not necessarily tend to zero as the number of comparisonstends to infinity, in contrast to the case with most other multiple-test procedures (seeGenovese and Wasserman [2002]). Discovery sets that are defined to control the FDR

also have two very useful multiplicative properties:

• If we control the FDR at α = β × γ, then we can be 100(1 − β)% confident thatover 100(1− γ)% of the discovery set will correspond to false null hypotheses (seeNewson and the ALSPAC Study Team [2003]).

• If we carry out a preliminary study to find a candidate discovery set (control-ling the FDR at β) and then carry out a follow-up study on an independentset of subjects (containing only comparisons from that candidate discovery setand controlling the FDR at γ), then the “overall” FDR of the process generatingthe follow-up discovery set, prior to the preliminary study, is α = β × γ (seeBenjamini and Yekutieli [2005]).

The first of these results specifies a trade-off between how confident we can beand how much we can be confident about. The second of these results specifies asimilar trade-off between how conservative we need to be in the preliminary studyand how conservative we need to be in the follow-up study. Both of these results areentirely evidence-based and objectivist-frequentist, and they are derived without usingany authority-based subjectivist claims of having prior knowledge.

In view of these properties of discovery sets, my first impulse was to adopt a standardpractice of defining a nested list of three discovery sets that correspond to FDRs of 0.25,0.05, and 0.01; then to identify these discovery sets by adding one, two, or three starsto the p-value in the table of results; then to add three footnotes to the table, withone, two, and three stars, respectively; and finally to indicate the corrected p-valuethresholds under the respective FDRs.

However, the list of FDRs adopted by our research group might not be the same asthe lists of FDRs adopted by other research groups, and readers might prefer to have acommon analog scale of significance for results from all research groups. Moreover, the

R. B. Newson 577

second result seems to assume (implausibly) that scientists conform rigorously and in-flexibly to a study plan to the point of defining FDR thresholds prior to the preliminarystudy and canceling the follow-up study if the discovery set from the preliminary studyis empty. Furthermore, if we have an output variable of q-values, then we can define asmany discovery sets as we like by selecting observations with q-values at or below ourchosen FDRs. For these reasons, I would currently argue that q-values represent an ad-vance on nested discovery sets and that qqvalue should probably supersede smileplotfor most purposes.

It should be stressed that the field of multiple-test procedures is currently in a stateof rapid development and that there is not necessarily a consensus on the subject, evenamong statisticians.

4 Examples

qqvalue, like smileplot, requires an input dataset with one observation per parameterand also requires data on p-values (and possibly other attributes) for the parameters.In Stata, such datasets are typically created using the official Stata statsby command(see [D] statsby) or, alternatively, using the parmest package of Newson (2003). In ourexamples, we will assume that such a dataset (or resultsset) has been created and thatit contains a variable containing the input p-values.

4.1 Epigenetic assay data in the ALSPAC study

The Avon Longitudinal Study of Parents and Children (ALSPAC) is a multipurposebirth cohort study based at Bristol University, England. The study involves over 14,000pregnancies in the Avon area of England in the early 1990s, the children from which havebeen followed through childhood. For further information, refer to the study website athttp://www.alspac.bris.ac.uk.

A nested pilot study in ALSPAC subjected the cord blood DNA of 174 subjects (69girls and 105 boys) to methylation assays. DNA methylation levels (as percentages)were measured at 1,505 methylation sites in the human genome. A methylation site isa position in the genome where a single DNA base can be either methylated (typicallyimplying that a gene is switched off) or unmethylated (typically implying that a gene isswitched on). The science of gene switching, including methylation, is known as epige-netics. Each of the 1,505 methylation assays performed on cord blood samples measuredthe percent of all copies of the appropriate methylation site that were methylated. Themethylation data were considered to be useful at 1,495 of these sites.

The methylation levels at these 1,495 sites were distributed non-normally in waysthat varied greatly from site to site, being positively skewed at some sites, negativelyskewed at other sites, bimodal at others, and semidiscrete at others, with a vast majorityof zeros (indicating no methylation) and a small minority of positive values (indicatingsome methylation). There did not seem to be a unified model whose parameters we

578 Frequentist q-values for multiple-test procedures

might fit to the data at all sites. I therefore decided to use the methods of Newson(2006b) and Newson (2006a) to generate confidence intervals and p-values for Somers’ Dand unequal-variance confidence intervals for Theil–Sen median slopes and Hodges–Lehmann median differences. These methods are all implemented using the somersdpackage (Newson 2006a,b).

As a preliminary analysis, I compared methylation levels at each of the 1,495 sites,between the 105 boys and the 69 girls. I used Somers’ D and the Hodges–Lehmannmedian difference, which have distinct confidence intervals sharing a common p-value.Both of these parameters were restricted to comparisons within laboratory batches toremove the influence of batch effects. The estimates, confidence intervals, and p-valueswere stored in an output dataset (or resultsset) with one observation per methylationsite.

q-values for the Simes procedure were then computed using the following Stata code:

. qqvalue p, method(simes) qvalue(qq)

. format qq %8.2g

. summarize p qq, detail

P-value

Percentiles Smallest1% 7.41e-11 3.43e-155% .0017592 6.52e-14

10% .0732356 2.87e-13 Obs 149525% .3035019 4.59e-13 Sum of Wgt. 1495

50% .579294 Mean .5381529Largest Std. Dev. .304321

75% .7946141 190% .9225728 1 Variance .092611395% .966077 1 Skewness -.31037299% .9948297 1 Kurtosis 1.889998

q-value by method(simes)

Percentiles Smallest1% 7.15e-09 5.13e-125% .035067 4.87e-11

10% .7131457 1.43e-10 Obs 149525% 1 1.72e-10 Sum of Wgt. 1495

50% 1 Mean .9052502Largest Std. Dev. .2553094

75% 1 190% 1 1 Variance .065182995% 1 1 Skewness -2.85970499% 1 1 Kurtosis 9.78171

Most of the q-values are as high as 1, but some are tiny, which implies that thecorresponding p-values would still be in the Simes discovery set even if the FDR wascontrolled very stringently.

I then plotted the q-values against the position of the corresponding methylation sitein the human genome. The human genome has 22 nonsex chromosomes, numbered from1 to 22, and 2 sex chromosomes, denoted X and Y. Each chromosome has a very long

R. B. Newson 579

linear DNA sequence, and each methylation site has a position (or coordinate) on itschromosome. I therefore defined, for each methylation site on each of the chromosomes1–22 and X, a relative position on a scale from 0 (for the first methylation site on thechromosome) to 100 (for the last methylation site on the chromosome). (There were nomethylation sites on the Y chromosome.)

The integer variable denoting the chromosome for each methylation site had thevariable name chromosome, and the continuous variable denoting the methylation site’srelative position had the variable name mrelpos. To make the plot, we use the com-mands regaxis and logaxis, which are components of the regaxis package.1 Theregaxis package is very useful in defining axis scales and tick positions, especially forvariables such as p-values and q-values that are plotted on a log scale. The Stata codefor making the plot is as follows:

. regaxis mrelpos, include(0 100) cycle(25) lticks(xlabs)

. logaxis qq, base(10) include(1) lrange(yrange) lticks(ylabs)> maxticks(12)

. scatter qq mrelpos, msize(2)> by(chrom, compact row(4) total)> xlabel(`xlabs´, labsize(4) angle(270))> yaxis(1 2)> yscale(reverse log range(`yrange´)) ylab(`ylabs´, labsize(4) angle(0))> ylabel(0.05, axis(2) labsize(4) angle(0))> yline(0.05, lpattern(shortdash))> plotregion(marg(2 2 0.5 0))

1. The regaxis package can be downloaded from Statistical Software Components athttp://econpapers.repec.org/scripts/search/search.asp?ft=regaxis.

580 Frequentist q-values for multiple-test procedures

.05

.05

.05

.05

1.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−121.0e−101.0e−081.0e−06

.0001.01

1 0 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 100

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 X Totalq−va

lue

by m

etho

d(si

mes

)

Relative position of methylation site on chromosomeGraphs by Chromosome of methylation site

Figure 1. q-values for boy–girl methylation differences at 1,495 sites

The result of this code is given in figure 1, which shows one panel for each of the23 chromosomes plus one for all methylation sites on all chromosomes. The horizontalaxis gives the relative position of the methylation site, and the vertical axis gives thecorresponding q-value on a reverse log scale. We see that even allowing for multiplecomparisons, there is a large number of statistically significant boy–girl differences inmethylation, and that most (but not all) of these are on the X chromosome. This findingdoes not surprise epigeneticists because a girl has two X chromosomes per cell, of whichone is inactivated by methylation, whereas a boy has only one X chromosome per cell,which is not inactivated.

As a comparison, we also used the multproc command of the smileplot package ofNewson and the ALSPAC Study Team (2003, 2010) to define a Simes corrected critical p-value corresponding to an FDR of 0.05. We plotted the p-values of the methylation sitesagainst their positions in the genome, with vertical-axis reference lines at the uncorrectedand corrected critical p-values. The result is given as figure 2, which has vertical-axisreference lines at the uncorrected critical p-value of 0.05 and at the corrected criticalp-value of 0.00254181. The message of the two figures is qualitatively similar. However,figure 1 is arguably more informative because there you can see at a glance the discoveryset under any FDR, rather than the discovery set only at the FDR of 0.05.

R. B. Newson 581

.05

.0025

.05

.0025

.05

.0025

.05

.0025

1.0e−161.0e−141.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−161.0e−141.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−161.0e−141.0e−121.0e−101.0e−081.0e−06

.0001.01

11.0e−161.0e−141.0e−121.0e−101.0e−081.0e−06

.0001.01

1 0 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 1000 25 50 75 100

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 X Total

P−va

lue

Relative position of methylation site on chromosomeGraphs by Chromosome of methylation site

Figure 2. p-values for boy–girl methylation differences at 1,495 sites

4.2 Polymorphisms associated with autism spectrum disorders

In Wang et al. (2009), several research groups combined their genome scan data onthe association of autism spectrum disorders with a total of 486,864 single-nucleotidepolymorphisms (SNPs). The highlight of their results was a subset of associations (withthe lowest p-values) between autism spectrum disorders and six SNPs in the 5p14.1region of chromosome 5. This region lies between two genes that encode the amino acidsequences of cadherin molecules, which seem to play a role in cell–cell adhesion duringthe formation of connections between neurons in the developing brain. The authorsgave the p-values for these six most significant SNPs.

These p-values were entered into a Stata dataset with one observation for each ofthe six SNPs and the following variables: snp (the name of the SNP), position (positionof the SNP on chromosome 5), alleles (the DNA bases of the more and less frequentalleles of the SNP), and pcomb (the p-value for the association, which was determinedusing combined data from all scans).

We use pcomb as the input variable for qqvalue, and we output three q-value vari-ables that were generated using the option bestof(486864) and the method() optionssimes, yekutieli, and bonferroni, respectively. The Stata code and its output are asfollows:

582 Frequentist q-values for multiple-test procedures

. qqvalue pcomb, method(simes) bestof(486864) qv(qqcomb1)

. qqvalue pcomb, method(yekutieli) bestof(486864) qv(qqcomb2)

. qqvalue pcomb, method(bonferroni) bestof(486864) qv(qqcomb3)

. format qqcomb1 qqcomb2 qqcomb3 %8.2g

. list, noobs

snp position alleles pcomb qqcomb1 qqcomb2 qqcomb3

rs4307059 26003460 C/T 2.10e-10 .0001 .0014 .0001rs7704909 25934678 C/T 9.90e-10 .00018 .0024 .00048

rs12518194 25987318 G/A 1.10e-09 .00018 .0024 .00054rs4327572 26008578 T/C 2.70e-09 .00033 .0045 .0013rs1896731 25934776 C/T 4.80e-08 .0047 .064 .023

rs10038113 25938100 C/T 7.40e-08 .006 .082 .036

We see that, although these six SNPs are the most significant of 486,864 investigated,their association with autistic spectrum disorders is still at least suggestive, even if weuse the yekutieli or bonferroni methods, whose q-values are in the variables qqcomb2and qqcomb3, respectively. The associations are even more impressive if we use the morepowerful simes method, whose q-values are in the variable qqcomb1.

5 Acknowledgments

I would like to thank my Imperial College colleagues Professor Peter Burney, for sug-gesting that something like q-values might be a good idea, and Adaikalavan Ramasamy,for drawing my attention to the p.adjust package in R. In addition, I would like tothank Gordon Smyth of the Walter and Eliza Hall Institute of Medical Research, Vic-toria, Australia, for writing the current version of p.adjust, for some very helpfulcorrespondence when I was certifying qqvalue, and for some equally helpful advice onthe terminology to use.

I would also like to thank my collaborators in the ALSPAC Study Team (Instituteof Child Health, University of Bristol, United Kingdom) for allowing the use of theirdata in this paper. The whole ALSPAC Study Team comprises interviewers, computertechnicians, laboratory technicians, clerical workers, research scientists, volunteers, andmanagers who continue to make the study possible. The ALSPAC study could not havebeen undertaken without the cooperation and support of the mothers and midwiveswho took part or without the financial support of the Medical Research Council, theDepartment of Health, the Department of the Environment, the Wellcome Trust, andother funders. The ALSPAC study is part of the World Health Organization–initiatedEuropean Longitudinal Study of Pregnancy and Childhood. My own work at ImperialCollege London is financed by the United Kingdom Department of Health.

R. B. Newson 583

6 ReferencesBenjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: A practical

and powerful approach to multiple testing. Journal of the Royal Statistical Society,Series B (Methodological) 57: 289–300.

Benjamini, Y., and D. Yekutieli. 2001. The control of the false discovery rate in multipletesting under dependency. Annals of Statistics 29: 1165–1188. Also downloadablefrom Yoav Benjamini’s website at http://www.math.tau.ac.il/˜ybenja/.

———. 2005. Quantitative trait loci analysis using the false discovery rate. Genetics171: 783–790.

Genovese, C., and L. Wasserman. 2002. Operating characteristics and extensions of thefalse discovery rate procedure. Journal of the Royal Statistical Society, Series B 64:499–517.

Hochberg, Y. 1988. A sharper Bonferroni procedure for multiple tests of significance.Biometrika 75: 800–802.

Holland, B. S., and M. D. Copenhaver. 1987. An improved sequentially rejective Bon-ferroni test procedure. Biometrics 43: 417–423.

Holm, S. 1979. A simple sequentially rejective multiple test procedure. ScandinavianJournal of Statistics 6: 65–70.

Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. StataJournal 3: 245–269.

———. 2006a. Confidence intervals for rank statistics: Percentile slopes, differences,and ratios. Stata Journal 6: 497–520.

———. 2006b. Confidence intervals for rank statistics: Somers’ D and extensions. StataJournal 6: 309–334.

Newson, R., and the ALSPAC Study Team. 2003. Multiple-test procedures and smileplots. Stata Journal 3: 109–132.

———. 2010. Software Updates: st0035 1: Multiple-test procedures and smile plots.Stata Journal 10: 691–692.

Sidak, Z. 1967. Rectangular confidence regions for the means of multivariate normaldistributions. Journal of the American Statistical Association 62: 626–633.

Simes, R. J. 1986. An improved Bonferroni procedure for multiple tests of significance.Biometrika 73: 751–754.

Smyth, G., and the R Core Team. 2010. p.adjust. Part of the R package stats.http://www.r-project.org/.

584 Frequentist q-values for multiple-test procedures

Storey, J. D. 2003. The positive false discovery rate: A Bayesian interpretation and theq-value. Annals of Statistics 31: 2013–2035.

Wang, K., H. Zhang, D. Ma, M. Bucan, J. T. Glessner, B. S. Abrahams, D. Salyakina,M. Imielinski, J. P. Bradfield, P. M. A. Sleiman, C. E. Kim, C. Hou, E. Frackelton,R. Chiavacci, N. Takahashi, T. Sakurai, E. Rappaport, C. M. Lajonchere, J. Munson,A. Estes, O. Korvatska, J. Piven, L. I. Sonnenblick, A. I. Alvarez-Retuerto, E. I.Herman, H. Dong, T. Hutman, M. Sigman, S. Ozonoff, A. Klin, T. Owley, J. A.Sweeney, C. W. Brune, R. M. Cantor, R. Bernier, J. R. Gilbert, M. L. Cuccaro,W. M. McMahon, J. Miller, M. W. State, T. H. Wassink, H. Coon, S. E. Levy, R. T.Schultz, J. I. Nurnberger, J. L. Haines, J. S. Sutcliffe, E. H. Cook, N. J. Minshew,J. D. Buxbaum, G. Dawson, S. F. A. Grant, D. H. Geschwind, M. A. Pericak-Vance,G. D. Schellenberg, and H. Hakonarson. 2009. Common genetic variants on 5p14.1associate with autistic spectrum disorders. Nature 459: 528–533.

About the author

Roger B. Newson is a lecturer in medical statistics at Imperial College London, United King-dom. He works principally in asthma research. He wrote the qqvalue, smileplot, parmest,somersd, and regaxis Stata packages.

The Stata Journal (2010)10, Number 4, pp. 585–605

Making spatial analysis operational: Commandsfor generating spatial-effect variables in

monadic and dyadic data

Eric NeumayerDepartment of Geography and Environment

London School of Economics and Political ScienceLondon, UK

Centre for the Study of Civil WarInternational Peace Research Institute

Oslo, [email protected]

Thomas PlumperDepartment of Government

University of EssexColchester, UK

Centre for the Study of Civil WarInternational Peace Research Institute

Oslo, [email protected]

Abstract. Spatial dependence exists whenever the expected utility of one unit ofanalysis is affected by the decisions or behavior made by other units of analysis.Spatial dependence is ubiquitous in social relations and interactions. Yet, thereare surprisingly few social science studies accounting for spatial dependence. Thisholds true for settings in which researchers use monadic data, where the unit ofanalysis is the individual unit, agent, or actor, and even more true for dyadic datasettings, where the unit of analysis is the pair or dyad representing an interactionor a relation between two individual units, agents, or actors. Dyadic data offermore complex ways of modeling spatial-effect variables than do monadic data. Thecommands described in this article facilitate spatial analysis by providing an easytool for generating, with one command line, spatial-effect variables for monadiccontagion as well as for all possible forms of contagion in dyadic data.

Keywords: st0210, spspc, spundir, spmon, spdir, spagg, spatial dependence, spatialanalysis, contagion, spatial lag, spatial error, monadic data, dyadic data

1 Introduction

Do you avoid taking the car during rush hours? If so, you understand the concept ofspatial dependence, which in this case means that your choice of a means of transportor the choice of your time of travel is partly a function of other individuals’ choices.

c© 2010 StataCorp LP st0210

586 Making spatial analysis operational

More generally, spatial dependence exists whenever the expected utility of one unit ofanalysis is influenced by the choices of other units of analysis.

Spatial dependence is also of interest to biologists and other natural scientists, butit is for social scientists that its study and analysis is of greatest importance. To statethat the social sciences are characterized by interdependence between the various unitsof analysis and thus by spatial dependence is almost a tautology. Social science isthe study of social relations and interactions, so situations in which units are entirelyunaffected by what other units do are likely to be rare.

Yet given the nature of its field of study, only a surprisingly small minority of socialscience research either actively seeks to analyze spatial dependence or at least to controlfor its effect. Part of the reason is, of course, that spatial econometrics is still a fairlyyoung subdiscipline (properly starting only with Anselin’s [1988] monograph from sometwenty years ago) and that it takes time for new methods and advice on specificationissues to penetrate mainstream social science research. Another reason is that manyapplied researchers may find it difficult, particularly for dyadic data, to create thespatial-effect variables required for modeling spatial dependence. It is here that thecommands described in this article facilitate spatial analysis by providing an easy toolfor the generation of spatial-effect variables in both monadic and dyadic data.

We start in section 2 by briefly discussing the importance of spatial dependencefor the social sciences and contrasting this with the relatively minor role that relevantstudies play in published research. In section 3, we provide an overview of the threetypes of spatial dependence and the appropriate models for analyzing them—namely,spatial lag (spatial autoregressive) models, spatial-x models, and spatial-error models.Whatever the model, spatial-effect variables need to be created.

The modeling options open to researchers in specifying spatial-effect variables differgreatly between monadic and dyadic data. Spatial effects in monadic data (that is, wherethe unit of analysis is a single unit, actor, or agent) are discussed in section 4. In monadicdata, spatial dependence always emanates from other units. A more detailed discussionis given in section 5 for the more complex specification of spatial-effect variables in dyadicdata (that is, where the unit of analysis is a dyad or pair representing an interaction ora relation between two units, actors, or agents). Here spatial dependence can emanatefrom all other dyads, but also from merely one part of other dyads and from eithertheir aggregate behavior relating to all dyads or their specific behavior relating to onlythe dyad under observation. There are thus many more modeling options available indyadic data.

In section 6, we describe a technique for generating spatial-effect variables for dyadicdata. It allows researchers to work from a standard dyadic dataset, obliterating the needto construct a 4-adic dataset that would connect dyads with dyads. Section 7 providesdetailed information on the Stata commands that generate the various spatial-effectvariables in monadic and dyadic data.

E. Neumayer and T. Plumper 587

2 Spatial dependence in the social sciences

Spatial dependence is a common, albeit often neglected part of social interaction. Froma theoretical perspective, spatial dependence can result from coercion, competition,externalities, learning, or emulation (Simmons and Elkins 2004; Elkins and Simmons2005; Franzese and Hays 2010). Units of analysis—call them agents—change theirbehavior because others pressurize them (Levi-Faur 2005), because they need to finda competitive advantage (Basinger and Hallerberg 2004), because the strategies car-ried out by other agents affect the payoffs they generate from their own behavior(Genschel and Plumper 1997; Simmons and Elkins 2004; Franzese and Hays 2006;Plumper and Troeger 2008), because agents learn that other strategies proved to bemore successful (Mooney 2001; Meseguer 2005), or because they want to mimic the be-havior of others (Weyland 2005). As a consequence, all social science studies in whichagents’ strategies are partly dependent on the strategies chosen by other agents need toaccount for spatial dependence.

Existing analyses of spatial dependence are usually motivated by studying one ormore of the mechanisms mentioned above that cause dependence among agents. It isimportant to note, however, that spatial dependence is also likely to exist when re-searchers do not have a direct theoretical interest in analyzing it. Not controlling forexisting spatial effects causes omitted variable bias just as it is caused by the exclusionof any other variable that is correlated with at least one regressor and the dependentvariable (Franzese and Hays 2010). Empirical analyses in the social sciences shouldtherefore control for spatial dependence almost as frequently as social scientists nowa-days control for temporal dependence—that is, for the impact that the prior behaviorof a unit of analysis has on its present behavior.

Surprisingly, however, the number of articles referenced in the Social Sciences Cita-tion Index with either the term “spatial analysis” or “spatial dependence” in the titleis very small, albeit slightly increasing over time; see figure 1.

(Continued on next page)

588 Making spatial analysis operational

Figure 1. The number of articles in Social Sciences Citation Index journals with “spatialanalysis” (light gray) or “spatial dependence” (dark gray) in the title in the years 1990–2009

Naturally, there will be many studies that study spatial dependence but do not in-clude either “spatial analysis” or “spatial dependence” in the title. On the other hand,there will be some studies containing either term in the title without actually analyzingor modeling spatial dependence as further defined below. For example, there will besome studies dealing merely with the detection of spatial association and correlationin the data with the help of Moran’s I statistic or similar. Such measurement errornotwithstanding, the general picture certainly holds true: As yet, spatial analyses arestill confined to a small minority of studies. In addition, these spatial analyses areconcentrated in only a handful of areas of the social sciences: demography,1 health sci-ence2 (especially epidemiology3), and geographic information system–based research4

in geography. We also find a few articles in political science5, political economy,6 eco-nomics,7 and geography.8 Spatial analyses may have become more common over thelast years, but given the underlying logic of social science, it seems fair to say that theyare not yet common enough. The commands presented here facilitate the generationof spatial-effect variables, thus rendering it easier for researchers to study or at leastcontrol for spatial dependence.

1. See, for example, Schmertmann, Potter, and Cavenaghi 2008; Chi and Zhu 2008; and Crews andPeralvo 2008.

2. For example, Crighton et al. 2007, and Kandala and Ghilagaber 2006.3. For example, Atanaka-Santos, Souza-Santos, and Czeresnia 2007.4. For example, Alix-Garcia 2007, and Gray and Shadbegian 2007.5. For example, Neumayer and Plumper 2010a.6. For example, Plumper, Troeger, and Winner 2009; Hays 2009; and Garrett, Wagner, and Wheelock

2005.7. For example, Kosfeld and Dreger 2006, and Rice, Venables, and Patacchini 2006.8. For example, Perkins and Neumayer 2010, and Perkins and Neumayer forthcoming.

E. Neumayer and T. Plumper 589

3 Types of spatial dependence

One can distinguish three types of spatial dependence that call for three types of spatialmodels. In the first type of spatial dependence, the dependent variable in other unitsof analysis exerts an influence on the dependent variable in the unit under observation.For example, active labor-market policies in other countries (negatively) influence activelabor-market policies in the country under observation because such policies generatepositive externalities not captured by the country implementing the policy (Franzese andHays 2006). The estimation model required to deal with this effect is commonly calleda spatial lag model (Franzese and Hays 2007) or a spatial autoregressive model (Anselin1988). In such models, the spatial-effect variable consists of the weighted values of thedependent variable in other units—that is, on the spatially lagged dependent variable.In scalar notation, the spatial lag model or spatial autoregressive model is formallyspecified in its simplest form and for monadic data as follows:

yit = α + ρ∑

k

wiktykt + βXit + εit (1)

where i, k = 1, 2, . . . , N denotes the (monadic) unit of observation; t = 1, 2, . . . , Tis time; Xit is a set of explanatory variables that may include the temporally laggeddependent variable, unit fixed effects, and period-specific time dummies; and εit is anindependent and identically distributed error term. The spatial autoregression param-eter ρ gives the impact of the spatial-effect variable, the spatial lag

∑k

wiktykt, on the

dependent variable yit. The spatial lag itself is the product of two elements. The firstelement, an N × N × T block-diagonal spatial weighting matrix, measures the relativeconnectivity between N number of units i and N number of units k in T number oftime periods in the off-diagonal cells of the matrix.9 The second element is an N × Tmatrix of the value of the dependent variable.10

In the second, rarely analyzed type of spatial dependence, some independent variableof other units affects the dependent variable in the unit under observation. For example,support of terrorist groups by other countries can affect the foreign policy (for example,military spending, alliance formation, and so on) of the country under observation.We call the estimation model required to analyze this type of dependence a spatial-xmodel. In such models, the spatial-effect variable consists of the (weighted) values ofone or more independent explanatory variables in all other units:

yit = α + ρ∑

k

wiktxkt + βXit + εit

Finally, there is a third type of spatial dependence, in which the error processesare systematically correlated across units of observation. To some extent, this type ofdependence will simply be the consequence of failing to adequately model one or both

9. The diagonal of the matrix has values of zero because i = k and units cannot spatially depend onthemselves.

10. The spatial lag could also be temporally lagged.

590 Making spatial analysis operational

of the other types of dependence: If, say, the dependent variable of other units affectsthe dependent variable of the unit under observation and this fact is not accounted for,then the error processes will be systematically correlated across units of observation. Infact, researchers will sometimes relegate spatial dependence to the error term for thesake of convenience, despite knowing that the correlated errors are the result of failingto model spatial dependence in the dependent or independent variables. However, thereare also factors that can genuinely lead to this third type of spatial dependence. Forexample, Galton (1889) famously argued that common behavioral patterns across tribesand societies may well be the result of common descent, not of emulation or learning,which would suggest that the spatial correlation in the residuals is best modeled via aspatial-error model.

Spatial-error models account for spatial dependence in the error term, which con-sists of at least two parts: one is an independent and identically distributed spatiallyuncorrelated component εit, and the other is a spatial component ρ

∑k

wiktukt. The

model to be fit is thus

yit = α + βXit + εit + ρ∑

k

wiktukt

Not controlling for correlated errors violates the Gauss–Markov assumptions andthus leads to spatial heteroskedasticity. As a consequence, estimates are inconsistent.

The models of spatial dependence can also be combined. Combining the spatial lagmodel or spatial autoregressive model with the spatial-x model leads to what Anselin(1988, 111) and LeSage and Pace (2009, 32) call a spatial Durbin model. Combiningthe spatial lag model or spatial autoregressive model with the spatial-error model leadsto what Anselin (1988, 36) calls a mixed-regressive spatial autoregressive model with aspatial autoregressive disturbance.

When fitting spatial lag models or spatial autoregressive models, researchers haveto deal with an obvious endogeneity problem: When units k affect unit i, the oddsare that unit i also affects units k; thus yk → yi → yk → · · · , where the arrowsrepresent an influence.11 In spatial-x models, endogeneity may also occur if there isfeedback from the dependent variable on the spatially lagged independent variable. Inthis case, xj → yi → xi → yj → xj → · · · . Franzese and Hays (2007) show that fittingsuch models with simple ordinary least squares, what they call spatial-ordinary leastsquares, does not suffer much from simultaneity bias if the strength of interdependence,ρ, remains modest. In all other cases, researchers need to appropriately account for theendogeneity in the variance–covariance matrix. They can do so by either instrumentingthe endogenous spatial-effect variable, which Kelejian and Prucha (1998) and Franzeseand Hays (2007) call spatial two-stage least squares (2SLS), or by using spatial maximum-likelihood (spatial ML) models. Maximum likelihood models and appropriate software

11. Endogeneity will be absent only if units exclusively depend on other units on which they do notexert an effect in turn, but this constellation is likely to represent the exception rather than therule. Endogeneity is thus likely to be present in the vast majority of spatial lag models.

E. Neumayer and T. Plumper 591

exist now for an increasing number of estimators (see, for example, Ward and Gleditsch[2008, appendix A] and LeSage and Pace [2009]).

4 Spatial-effect variables in monadic data

In all three variants of spatial analyses, researchers need to create a spatial-effect vari-able that consists of the weighted values of the dependent, independent, or error-termvariable of other units of observation. Before we come to describe the commands thatgenerate such variables, we first need to explain the multiple forms of modeling thisspatial-effect variable. From now on, we will focus on spatial lag models or spatialautoregressive models, because these are very popular in applied research. Everythingwe say carries over to spatial-x models and spatial-error models, as well. We start withmonadic data, discussing the more complex case of spatial effects in dyadic data in moredetail in the next section.

In monadic data, spatial dependence always emanates from all other units, weightedby the connectivity variable. The spatial weighting matrix represents the degree towhich unit i is connected to units k, if at all. It can be a dichotomous variable such asgeographical contiguity between two units, or it can measure a nonspatial relationshipsuch as trade or investment links (Beck, Gleditsch, and Beardsley 2006). Theory shoulddecide which is the appropriate variable and how exactly it is defined and operational-ized. For example, contiguity can be defined in different ways, and trade flows can enterin levels, in logged form, or in other functional forms.

The variable used for the weighting matrix can be undirected as in the case of con-tiguity or directed as in the case of, say, exports. With directed connectivity variables,researchers must choose whether the weighting matrix measures connectivity from uniti to units k as in (1) above or measures connectivity from units k to unit i as in thefollowing specification:

yit = α + ρ∑

k

wkitykt + βXit + εit (2)

For, say, exports as the connectivity variable, the weighting matrix in (1) measuresexports from i to k, whereas in (2) it measures exports from k to i. Which weightingmatrix is appropriate will depend on the specific research context and must also bejustified on theoretical grounds.

The weighting matrix is often row-standardized, which means that each cell ofthe matrix is divided by the row-sum of cells. For example, if the nonstandardizedweighting matrix consists of absolute foreign direct investment flows, then the row-standardized weighting matrix consists of shares of foreign direct investment flows.Plumper and Neumayer (2010) argue that researchers must always consider whetherrow-standardization of the weighting matrix is appropriate for their research design be-cause it changes the substantive meaning of the connectivity variable. One should there-fore justify one’s decision on theoretical grounds rather than take row-standardizationas the unquestioned norm.

592 Making spatial analysis operational

5 Spatial-effect variables in dyadic data

In monadic data, spatial dependence weighted by the connectivity variable always em-anates from all other units. As discussed above, the only freedoms researchers have liein the choice of a connectivity variable and its functional form (for example, in levelsor in logged form); whether to row-standardize the weighting matrix; and, in case theconnectivity variable is directed, the decision whether connectivity should be directedfrom unit i to units k or the reverse.

In contrast, a dyadic estimation dataset offers more freedoms with respect to thechannels through which spatial dependence can be modeled, leading to different typesof contagion, and with respect to the specification of the weighting matrix. Most impor-tantly, with dyadic data one can distinguish directed and undirected dyads. In directeddyads, the interaction between two dyad members ij initiates with i and is directedtoward j. In the directed dyad ij, unit i is called the source, while unit j is called thetarget of the interaction. It is different from the directed dyad ji where, in contrast,unit j is the source and unit i is the target.

In contrast, in undirected dyadic data, whilst one can distinguish unit i from unit j,it is either not possible to distinguish between the dyad ij and the dyad ji or researchersdo not want to make such a distinction. For example, if the dependent variable measuresthe presence or absence of militarized conflict between country i and country j, then itmay not be clear which country started the conflict or this question may be irrelevantbecause researchers may merely be interested in whether a conflict exists, not whoinitiated it. As a consequence, the dependent variables of dyads ij and ji are identicalin undirected dyadic data.12

Undirected dyadic datasets are most similar to the monadic setting, because spa-tial dependence always emanates from other dyads. With directed dyadic data, spatialdependence can also emanate from other dyads, but there are more options to be dis-cussed further below. When spatial dependence comes from other dyads, the only choiceis whether one wishes to allow dyads that either unit i or unit j form with other units toalso exert an influence on the spatial effect variable. For example, a spatial lag model ora spatial autoregressive model of what Neumayer and Plumper (2010b) name inclusivedyad contagion will be specified as follows:

yij = α + ρ∑

km �=ij

ωpqykm + · · · + εij (3)

with “· · · ” representing other explanatory variables. For ease of exposition, (3) as-sumes a time-invariant research setting, but a time dimension can be easily added to allvariables. Consider military alliances between two countries as an example. Whethercountry i and country j form an alliance may partly depend on what other alliances

12. Both directed and undirected dyadic data settings are no less likely to be subject to spatial depen-dence than monadic data settings. What one unit does in relation to another unit with which itforms a dyad will often influence as well as be influenced by the relations of other dyads. Yet inNeumayer and Plumper (2010b), we could identify only three prior studies analyzing spatial effectsin a dyadic data setting.

E. Neumayer and T. Plumper 593

exist between countries in the world, including those that either country i or country jhave concluded with countries besides each other. Which other dyads are relevant forthe spatial effect—and if so, to what extent—will be specified by the weighting matrix,with

ωpq ∈ {wik, wki, wjm, wmj , wik+jm, wki+mj , wik×jm, wki×mj}as eight possible specifications of the (potentially directed) weighting matrix.13 Inwords, the weighting matrix can either link (source) units i and k (wik, wki) or (target)units j and m (wjm, wmj) or the sum (wik+jm, wki+mj) or the product (wik×jm, wki×mj)of the two units.14

In contrast, a spatial lag model or a spatial autoregressive model of exclusive dyadcontagion disallows all dyads that contain either unit i or unit j from exerting aninfluence on the spatial-effect variable and is modeled as

yij = α + ρ∑k �=i,jm �=i,j

ωpqykm + · · · + εij

with the same set of options available for the weighting matrix. On the decision whetherto form an alliance between countries i and j, this specification would exclude alliancesthat countries i and j have with other countries.

As alluded to already, directed dyadic datasets offer more modeling flexibility thanjust dyadic contagion. The reason is that in such datasets, it is possible to distinguishthe source i of a dyadic interaction from its target j. This opens the possibility thatspatial dependence only derives from other sources or from other targets, instead offrom all other dyads. Moreover, contagion may stem from either the aggregate behaviorof other sources or targets or from their specific behavior with respect to the dyad ijunder observation.

What Neumayer and Plumper (2010b) coin aggregate source contagion consists ofspatial dependence coming from the aggregate behavior of other sources k—that is, fromtheir relationship with any target m, not just the specific target j under observation:

yij = α + ρ∑k �=i

∑m

ωpqykm + · · · + εij

Our previous example of military alliances could be a directed dyadic relationship ifit were possible to distinguish the source (initiator) from the target (recipient) of theinteraction, but it is perhaps more likely to be an undirected dyadic relationship. Wetherefore switch to international terrorism as an example, where the dyadic relationship

13. The list is not exhaustive, and links can be combined with each other (see Neumayer and Plumper[2010b]). Even if the variable that is to be spatially lagged is an undirected dyadic variable, theweighting matrix can still be a directed dyadic variable.

14. For undirected dyad contagion, in which it is not possible to distinguish sources from targets,simply read this sentence, omitting the words source and target. Taking the sum of two weightingmatrices implies that they are substitutes for each other (the lack of one link can be compensatedby the presence of the other), whereas taking the product implies that they are complements.

594 Making spatial analysis operational

between perpetrator and victim is more clearly directed. With aggregate source con-tagion, the likelihood that terrorists from country i attack victims from country j maypartly depend on the aggregate overall propensity of terrorists from other countries kto attack victims from any other country m.

If, instead, only the relationship of other sources k with the specific target j mattersfor spatial dependence, then the situation calls for modeling specific source contagion:

yij = α + ρ∑k �=i

ωpqykj + · · · + εij

With specific source contagion, the aggregate overall propensity of terrorists from othercountries k no longer matters for whether terrorists from country i are more likely toattack victims from country j. Instead, only the propensity of terrorists from othercountries k to attack victims from this specific country j matters.15 In both aggregateand specific source contagion, the basic set of link functions is ωpq ∈ {wik, wki}; thatis, the weighting matrix links source units i and k with each other, either from i to kor from k to i, if it is a directed variable.16

The two forms of target contagion function very similarly, only this time it is theaggregate or specific behavior of other targets m from which the spatial effect emanates.For aggregate target contagion, in which the aggregate behavior of other targets m withany source k (not just the specific source i under observation) matters,

yij = α + ρ∑

k

∑m �=j

ωpqykm + · · · + εij (4)

For the example of international terrorism, the propensity of terrorists from country ito attack victims from country j may partly depend on how much terrorism victimsfrom other countries m experience, independently of who the terrorists are.

Specific target contagion, in which only interactions of other targets m with thespecific source i matter, is modeled as

yit = α + ρ∑m �=j

ωpqyim + · · · + εij (5)

Here the propensity of terrorists from country i to attack victims from country j partlydepends on how much terrorism terrorists from this country i inflict on victims fromother countries m. In both forms of target contagion, the set of basic link functionscomprises ωpq ∈ {wjm, wmj}; that is, the weighting matrix links target units j and mwith each other, either from j to m or from m to j, if it is a directed variable.

15. For example, Neumayer and Plumper (2010a) use the civilizational affiliation of countries of (po-tential) terrorists as a connectivity variable to test whether there is evidence for internationalterrorism spreading along civilizational lines in the form of specific source contagion, as predictedby Huntington (1996).

16. As with dyadic contagion, further link functions are possible. The same applies to the forms oftarget contagion of (4) and (5).

E. Neumayer and T. Plumper 595

6 Parsing through a virtual 4-adic dataset

In principle, because the weighting matrix is of one dimension above the dimension ofthe estimation dataset, one needs a dataset of one dimension higher than the estimationdataset to generate the spatial-effect variable. So, for example, to create a spatial-effectvariable for a monadic dataset of dimension N × T , one needs a dataset connectingmonadic units with each other—that is, a dyadic dataset of dimension N ×N × T . Togenerate a spatial-effect variable for dyadic data, one would normally need a so-called4-adic dataset of dimension (Ni ×Nj)× (Ni ×Nj)×T —that is, a dataset that connectsdyads with dyads. Table 1 displays a very simple directed 4-adic dataset for the case ofNi, Nj = 3 with i, j, k,m ∈ {1, 2, 3} and T = 1; that is, the dataset is time-invariant.

Table 1. The parsing technique and matching of spatial-effect variable components forspecific source contagion and wik as connectivity

Relevantdyads forspatially Relevantlagged dyads for

i j k m variable connectivity1 1 1 1

x

x x

x

1 1 1 21 1 1 31 1 2 11 1 2 21 1 2 31 1 3 11 1 3 21 1 3 31 2 1 1

x

x

x

x

1 2 1 21 2 1 31 2 2 11 2 2 21 2 2 31 2 3 11 2 3 21 2 3 3

596 Making spatial analysis operational

1 3 1 1x

x

x

x

1 3 1 21 3 1 31 3 2 11 3 2 21 3 2 31 3 3 11 3 3 21 3 3 32 1 1 1

x

x

xx

2 1 1 22 1 1 32 1 2 12 1 2 22 1 2 32 1 3 12 1 3 22 1 3 32 2 1 1

x

x

x

x

2 2 1 22 2 1 32 2 2 12 2 2 22 2 2 32 2 3 12 2 3 22 2 3 32 3 1 1

xx

x

x

2 3 1 22 3 1 32 3 2 12 3 2 22 3 2 32 3 3 12 3 3 22 3 3 33 1 1 1

x

x

x

x

3 1 1 23 1 1 33 1 2 13 1 2 23 1 2 33 1 3 13 1 3 23 1 3 3

E. Neumayer and T. Plumper 597

3 2 1 1

x

x

x

x

3 2 1 23 2 1 33 2 2 13 2 2 23 2 2 33 2 3 13 2 3 23 2 3 33 3 1 1

x

x

x

x

3 3 1 23 3 1 33 3 2 13 3 2 23 3 2 33 3 3 13 3 3 23 3 3 3

Note: Arrows indicate which observations fromthe two separate spatial-effect variablecomponents are merged with each other.

The dataset shown in table 1 is very small, but in many actual research contextswith i and j of medium to large size and multiple time periods, such a 4-adic dataset willbe far too large for the memory of standard personal computers (PCs). The commandsdiscussed in this article circumvent this problem by parsing through a virtual 4-adicdataset. Thus, rather than generating an actual full sized 4-adic dataset of dimension(Ni×Nj)×(Ni×Nj)×T , the commands exploit the fact that for any one specific dyad ij,say, dyad 1–1 in table 1, the dyadic dataset of dimension Ni×Nj×T highlighted in a lightgray color contains both the full set of dyads from which spatial dependence can possiblyderive and the full set of dyads that are potentially relevant for the weighting matrix.The commands therefore loop through the full set of dyads, and for any one specificdyad ij, they save in temporary files the dyads km that are relevant for whichever typeof contagion is created, as well as another set of dyads km that are relevant for theweighting matrix that is dependent on the type of connectivity chosen. Table 1 showswhich dyads km are relevant for the example of specific source contagion and wik as thechosen connectivity. For the ij dyad of 1–1, the km dyads of 2–1 and 3–1 are relevantfor the spatially lagged variable because with specific source contagion it is dyads of theother sources 2 and 3 with the specific target 1 that matter. The km dyads of 1–2 and1–3 are relevant for measuring connectivity from source 1 to source 2 and from source1 to source 3.

Once all the necessary components for creating the spatial-effect variable have beensaved in temporary files, the commands then combine all the components with each

598 Making spatial analysis operational

other by merging the relevant dyads for the spatially lagged variable with the relevantdyads for the connectivity variable to create the spatial-effect variable. The arrowsin table 1 show which dyads of the variable that is to be spatially lagged are mergedwith which dyads of the variable representing connectivity. Finally, unless the nomergeoption is specified, the resulting spatial-effect variable is created and saved in the currentworking directory, as well as merged into the original dyadic dataset.

This parsing technique has two main advantages. The one already mentioned isthat it makes generating spatial-effect variables possible without having to create alarge 4-adic dataset. The second advantage is that users need not worry about creatinga connectivity variable that links sources or targets with each other or with additive ormultiplicative combinations of the two, depending on the type of connectivity required.By looping through each possible dyad ij and saving only the dyads km in temporaryfiles that are relevant for the specific connectivity variable chosen by the user, all that isrequired is a connectivity variable that links unit i to unit j. The commands virtuallytransform this connectivity variable to generate the actual weighting matrix chosen bythe user according to the link options available.

Unfortunately, this parsing technique also comes with two disadvantages. First,depending on the size of the dyadic dataset, it can take from seconds to several minutes,hours, or even days to generate the spatial-effect variable on standard PCs. As a generalrule, the commands that generate aggregate source or target contagion are fast,17 thecommands that generate specific source or target contagion are considerably slower, andthe ones that generate undirected or directed dyad contagion are the slowest. However,creating an actual 4-adic dataset is unlikely to represent a superior alternative to theparsing technique. When its size is moderate enough that it could be handled bystandard PC memory size, the commands employing the parsing technique also workrelatively fast. Processing the commands is time-consuming only when creating anactual 4-adic dataset is difficult or impossible.

The second disadvantage of the parsing technique is that because no actual 4-adicweighting matrix is constructed, researchers cannot apply spatial ML methods becausedoing so would require using the 4-adic weighting matrix. Instead, researchers need torely on the instrumental-variable technique of spatial-2SLS (Kelejian and Prucha 1998;Franzese and Hays 2007) to account for the simultaneity bias introduced by the spatial-effect variable. Of course, the commands were written specifically for cases in which the4-adic weighting matrix is simply too large to be handled by standard PCs. In samplesfor which the 4-adic weighting matrix is not too large, spatial ML can be used—but insuch situations researchers do not need the commands described here anyway becausethey can simply create the spatial-effect variables by hand if they can construct theentire actual 4-adic weighting matrix.

Sparse matrix modeling represents another alternative that is potentially superiorto the parsing technique in some contexts (Ward and Gleditsch 2008). However, thistechnique makes sense only where a large share of zeros (or some other specific con-

17. The same is true for the command that generates spatial effects for monadic data, but spmon doesnot rely on the parsing technique, anyway.

E. Neumayer and T. Plumper 599

stant number) is in the weighting matrix, as is typically the case for using contiguityor similar as the connectivity variable. If the share of zeros is small—for examplewhen researchers weight by distance, exports, or some other continuous connectivityvariable—sparse matrix modeling does not provide any advantage. We contend thatconnectivity variables with no zeros or a small share of zeros will become much morepopular in the future because theories will often predict spatial dependence working viamore complicated links than simple dichotomous weights such as contiguity.

7 Commands for generating spatial-effect variables

7.1 Syntax

Monadic contagion

spmon lagvar[if

] [in

], i(varname) k(varname) weightvar(varname)[

reverse W std options]

Undirected dyad contagion

spundir lagvar[if

] [in

], i(varname) j(varname) weightvar(varname)

link(link fcn)[exclusive std options

]Directed dyad contagion

spdir lagvar[if

] [in

], source(varname) target(varname)

weightvar(varname) link(link fcn)[exclusive std options

]Aggregate source or target contagion

spagg lagvar[if

] [in

], source(varname) target(varname)

weightvar(varname) form(source | target) [reverse W std options

]Specific source or target contagion

spspc lagvar[if

] [in

], source(varname) target(varname)

weightvar(varname) form(source | target) [reverse W std options

]For std options, see the Standard options subsection in section 7.2.

(Continued on next page)

600 Making spatial analysis operational

7.2 Description of commands and options

Because for dyadic datasets the parsing technique described in section 6 obliteratesthe need for a 4-adic dataset, both spmon, which generates spatial-effect variables formonadic data, and the set of commands that generate spatial-effect variables for dyadicdata (spagg, spspc, spdir, and spundir) all merely require a dyadic dataset. Thisdataset must contain at least four variables.

First, the dataset must contain the variable to be spatially lagged (lagvar), the nameof which is stated right after each command. For spmon, this variable must be the samefor all dyads of a specific unit k with various combinations of unit i (for any given timeperiod), whereas for spagg, spspc, spdir, and spundir, this variable will typicallydiffer from dyad to dyad. For example, in spatial lag (spatial autoregressive) models,this variable will simply be the dependent variable of other dyads.

Second, the dataset must contain a variable identifying unit i, which is stated ini(varname) in spmon and spundir and in source(varname) in the commands thatgenerate spatial-effect variables for directed dyadic data. The difference is purely nota-tional. All that matters is that the variable listed in i(varname) or source(varname)identifies unit i. It can be a numeric or string variable.

The third variable must identify a second unit, which is stated in k(varname) forspmon, in j(varname) for spundir, and in target(varname) for the commands fordirected dyadic data. Again the difference is purely notational. What matters is thatthis numeric or string variable identifies unit k or unit j and that together ik or ijuniquely identify a specific dyad.

The fourth and final variable that a dataset must contain for the commands de-scribed here to work is the weighting or connectivity variable, which is always listed inweightvar(varname). It connects unit i with units k in case of spmon and unit i withunits j in case of the commands that generate spatial-effect variables in dyadic data.This variable will typically be different for each dyad of a specific unit i with variouscombinations of units k (j). Also it may or may not be directed. If the spatial-effectvariable is to be time-variant, then one additionally needs a fifth (optional) variable inthe dataset that identifies time; see time(varname) in the Standard options subsectionof section 7.2.

For spagg and spspc, one must also specify whether the spatial effect arises fromsources or targets in other directed dyads. Use form(source) if the spatial effect stemsfrom other sources, or form(target) if the spatial effect derives from other targets.

The commands spmon, spagg, and spspc each allow only two basic link functionssuch that, for simplicity, the function linking unit i to units k is the default optionfor spmon and for the source contagion forms of spagg and spspc, while the functionlinking unit j to units m is the default option for the target contagion forms of spagg andspspc. In each of these cases, specifying the reverse W option reverses the direction ofthe connectivity variable such that the weighting matrix represents connectivity from,respectively, units k to unit i and to connectivity from units m to unit j, instead.

E. Neumayer and T. Plumper 601

Naturally, this option makes sense only if the connectivity variable is in fact a directedvariable. Otherwise, both the default and the reverse W option will lead to the samegenerated spatial-effect variable.

Both spundir and spdir can create spatial-effect variables with a variety of specifiedconnectivities, such that the required link(link fcn) option prompts users to choose oneof eight possible link functions: ik, ki, jm, mj, ik+jm, ki+mj, ik*jm, or ki*mj. Theik link function requests that the virtually transformed weighting variable listed inweightvar(varname) is to represent connectivity from unit i to other units k. Theki link function requests connectivity from other units k to unit i, instead. The jmlink function requests connectivity from unit j to other units m. The mj link functionrequests connectivity from other units m to unit j, instead. The ik+jm link functionrequests that the virtually transformed weighting variable represents the sum of connec-tivities invoked by ik and jm. The ki+mj link function does the same, but for the sumof connectivities invoked by ki and mj. The ik*jm option requests that the virtuallytransformed weighting variable represent the product of connectivities invoked by ikand jm. The ki*mj option does the same, but for the product of connectivities invokedby ki and mj.

Both spundir and spdir can also create either inclusive dyad contagion (the defaultoption) or exclusive dyad contagion, to be requested by invoking the exclusive option.

Standard options

std options can be any of the following:

time(varname) contains the numeric time variablesename(name) names the created spatial-effect variablelabelname(label) names the label given to the spatial-effect variablefilename(filename) names the file to which the spatial-effect variable is savednorowst specifies that the spatial-effect variable not be

row-standardizednomerge specifies no automatic merge of spatial-effect variable into

the original dataset

All commands allow restricting the relevant sample with if and in conditions. Asmentioned already, there is an optional time-variable identifier, time(varname), whichis needed if the spatial-effect variable is to be time-variant. All commands also al-low users to name the created spatial-effect variable by specifying the sename(name)option, to give the created spatial-effect variable a specific label by specifying thelabelname(label) option, and to save a dataset containing the generated spatial-effectvariable in the current directory under the name specified in the filename(filename) op-tion. Without these options, the generated spatial-effect variable and files are given pre-

602 Making spatial analysis operational

defined names.18 Each command allows deviating from generating a row-standardizedspatial-effect variable (the default option) by specifying the norowst option. Each com-mand will normally automatically merge the generated spatial-effect variable into theoriginal dataset used for generating it, but this can be prevented by specifying thenomerge option. This option is particularly relevant if one uses two separate datasets—one for the creation of the spatial-effect variable and another one that is the actualestimation dataset into which the spatial-effect variable created from the other datasetthen needs to be merged by hand. The automatic merge default option is most suitablewhen the dataset used for the creation of the spatial-effect variable is also the estimationdataset. For the analysis of spatial dependence in monadic data, users must always havetwo datasets because the estimation dataset is monadic, whereas the dataset used forthe creation of the spatial-effect variable must be dyadic.19

7.3 A note on the format of the dyadic dataset required for spundir

Often, undirected dyadic datasets are organized such that if dyad ij is contained in thedataset, then dyad ji is excluded, and vice versa. The reason is that one of the dyadscontains redundant information given that the value of the dependent variable for ijequals that of ji. If the dataset is in this nonsymmetric format, then it must be the casethat the dataset contains only those dyads for which i is numerically smaller or equal toj and excludes all dyads for which i is larger than j, which follows common practice.20

Thus, for example, if i and j both run from 1 to 4, then the dataset would containthe dyads 1–1, 1–2, 1–3, 1–4, 2–2, 2–3, 2–4, 3–3, 3–4, and 4–4, but it would excludedyads 2–1, 3–1, 3–2, 4–1, 4–2, and 4–3. (Dyads 1–1, 2–2, 3–3, and 4–4 may also beexcluded if a dyadic relationship of a unit with itself is impossible, which depends onthe research context, namely, the type of relationship studied.)

It is, however, possible and often convenient for users that an undirected dyadicdataset is organized such that it contains both dyad ij and dyad ji, despite the factthat the value of the dependent variable for these two dyads must be the same. Forspundir to work, it does not matter whether the dataset is kept in the nonsymmetricor symmetric format. Users must, however, organize their data in symmetric format ifthe weighting variable is to be directed, because a directed, dyadic weighting variablerequires a fully symmetric dyadic dataset.

8 Conclusion

Spatial dependence is a common phenomenon in social relations. Social science re-search is therefore particularly in need of modeling or at least controlling for spatial

18. Consult the help files for each command for information on these default names.19. For spmon, if the spatial-effect variable is merged into the original dyadic dataset used for the

creation of the variable, then it will have the same value for all units of i in any given time period.20. If i and j are string variables, then this condition requires that the dataset contain only those dyads

for which i is alphabetically prior or equal to j and excludes all dyads for which i is alphabeticallysubsequent to j.

E. Neumayer and T. Plumper 603

dependence. The rapid improvements in both computing power and spatial estimationtechniques as well as mounting advice on specification issues are bound to make morescholars interested in spatial analysis (Anselin 1988; Beck, Gleditsch, and Beardsley2006; Darmofal 2006; Franzese and Hays 2007, 2010; Ward and Gleditsch 2008; Neu-mayer and Plumper 2010b; Plumper and Neumayer 2010). The purpose of the com-mands described here is to render such analysis easier by allowing users the generationof all types of spatial-effect variables with one command line, and in the case of dyadicdata, allowing users to do so without the need for constructing a large 4-adic dataset.

9 ReferencesAlix-Garcia, J. 2007. A spatial analysis of common property deforestation. Journal of

Environmental Economics and Management 53: 141–157.

Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: KluwerAcademic Publishers.

Atanaka-Santos, M., R. Souza-Santos, and D. Czeresnia. 2007. Spatial analysis forstratification of priority malaria control areas, Mato Grosso State, Brazil. Cadernosde Saude Publica 23: 1099–1112.

Basinger, S. J., and M. Hallerberg. 2004. Remodeling the competition for capital: Howdomestic politics erases the race to the bottom. American Political Science Review98: 261–276.

Beck, N., K. S. Gleditsch, and K. Beardsley. 2006. Space is more than geography:Using spatial econometrics in the study of political economy. International StudiesQuarterly 50: 27–44.

Chi, G., and J. Zhu. 2008. Spatial regression models for demographic analysis. Popu-lation Research and Policy Review 27: 17–42.

Crews, K. A., and M. F. Peralvo. 2008. Segregation and fragmentation: Extendinglandscape ecology and pattern metrics analysis to spatial demography. PopulationResearch and Policy Review 27: 65–88.

Crighton, E. J., S. J. Elliott, R. Moineddin, P. Kanaroglou, and R. Upshur. 2007. Aspatial analysis of the determinants of pneumonia and influenza hospitalizations inOntario (1992–2001). Social Science and Medicine 64: 1636–1650.

Darmofal, D. 2006. Spatial econometrics and political science. Paper presented at theAnnual Meeting of Southern Political Science Association, Atlanta, GA, January 5–7.

Elkins, Z., and B. A. Simmons. 2005. On waves, clusters, and diffusion: A conceptualframework. Annals of the American Academy of Political and Social Sciences 598:33–51.

604 Making spatial analysis operational

Franzese, R. J., Jr., and J. C. Hays. 2006. Strategic interaction among EU governmentsin active labor market policy-making: Subsidiarity and policy coordination under theEuropean employment strategy. European Union Politics 7: 167–189.

———. 2007. Spatial econometric models of cross-sectional interdependence in politicalscience panel and time-series-cross-section data. Political Analysis 15: 140–164.

———. 2010. Spatial-Econometric Models of Interdependence. (Book prospectus.) PDF

file.

Galton, F. 1889. Discussion of “On a method of investigating the development ofinstitutions; applied to laws of marriage and descent”, by Edward B. Tylor. Journalof the Anthropological Institute of Great Britain and Ireland 18: 270–272.

Garrett, T. A., G. A. Wagner, and D. C. Wheelock. 2005. A spatial analysis of statebanking regulation. Papers in Regional Science 84: 575–595.

Genschel, P., and T. Plumper. 1997. Regulatory competition and international co-operation. Journal of European Public Policy 4: 626–642.

Gray, W. B., and R. J. Shadbegian. 2007. The environmental performance of pollutingplants: A spatial analysis. Journal of Regional Science 47: 63–84.

Hays, J. C. 2009. Globalization and the New Politics of Embedded Liberalism. Oxford:Oxford University Press.

Huntington, S. 1996. The Clash of Civilizations and the Remaking of World Order.New York: Simon & Schuster.

Kandala, N.-B., and G. Ghilagaber. 2006. A geo-additive Bayesian discrete-time survivalmodel and its application to spatial analysis of childhood mortality in Malawi. Qualityand Quantity 40: 935–957.

Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squaresprocedure for estimating a spatial autoregressive model with autoregressive distur-bances. Journal of Real Estate Finance and Economics 17: 99–221.

Kosfeld, R., and C. Dreger. 2006. Thresholds for employment and unemployment: Aspatial analysis of German regional labour markets, 1992–2000. Papers in RegionalScience 85: 523–542.

LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:Chapman & Hall/CRC.

Levi-Faur, D. 2005. The global diffusion of regulatory capitalism. Annals of the Amer-ican Academy of Political and Social Sciences 598: 12–32.

Meseguer, C. 2005. Policy learning, policy diffusion and the making of a new order.Annals of the American Academy of Political and Social Sciences 598: 67–82.

E. Neumayer and T. Plumper 605

Mooney, C. Z. 2001. Modeling regional effects on state policy diffusion. Political Re-search Quarterly 54: 103–124.

Neumayer, E., and T. Plumper. 2010a. Galton’s problem and the spread of internationalterrorism along civilizational lines. Conflict Management and Peace Science 27: 308–325.

———. 2010b. Spatial effects in dyadic data. International Organization 64: 145–166.

Perkins, R., and E. Neumayer. 2010. Geographic variations in the early diffusion ofcorporate voluntary standards: Comparing ISO14001 and the global compact. Envi-ronment and Planning A 42: 347–365.

———. Forthcoming. Transnational spatial dependencies in the geography of non-resident patent filings. Journal of Economic Geography.

Plumper, T., and E. Neumayer. 2010. Model specification in the analysis of spatialdependence. European Journal of Political Research 49: 418–442.

Plumper, T., and V. E. Troeger. 2008. Fear of floating and the external effects ofcurrency unions. American Journal of Political Science 52: 656–676.

Plumper, T., V. E. Troeger, and H. Winner. 2009. Why is there no race to the bottomin capital taxation? International Studies Quarterly 53: 761–786.

Rice, P., A. J. Venables, and E. Patacchini. 2006. Spatial determinants of productivity:Analysis for the regions of Great Britain. Regional Science and Urban Economics 36:727–752.

Schmertmann, C. P., J. E. Potter, and S. M. Cavenaghi. 2008. Exploratory analysisof spatial patterns in Brazil’s fertility transition. Population Research and PolicyReview 27: 1–15.

Simmons, B. A., and Z. Elkins. 2004. The globalization of liberalization: Policy diffusionin the international political economy. American Political Science Review 98: 171–189.

Ward, M. D., and K. S. Gleditsch. 2008. Spatial Regression Models. London: Sage.

Weyland, K. G. 2005. Theories of policy diffusion: Lessons from Latin American pensionreform. World Politics 57: 262–295.

About the authors

Eric Neumayer is a professor and head of department in the Department of Geography and En-vironment at the London School of Economics and Political Science. He works on internationalpolitical economy topics.

Thomas Plumper is a professor of government and Director of the Essex Summer School inSocial Science Data Analysis at the University of Essex. He works on political economy andmethodology.

The Stata Journal (2010)10, Number 4, pp. 606–627

Age–period–cohort modeling

Mark J. RutherfordDepartment of Health Sciences

University of Leicester, UK

[email protected]

Paul C. LambertDepartment of Health Sciences

University of Leicester, UK

[email protected]

John R. ThompsonDepartment of Health Sciences

University of Leicester, UK

[email protected]

Abstract. Age–period–cohort models provide a useful method for modeling inci-dence and mortality rates. It is well known that age–period–cohort models sufferfrom an identifiability problem due to the exact relationship between the variables(cohort = period − age). In 2007, Carstensen published an article advocatingthe use of an analysis that models age, period, and cohort as continuous variablesthrough the use of spline functions (Carstensen, 2007, Statistics in Medicine 26:3018–3045). Carstensen implemented his method for age–period–cohort models inthe Epi package for R. In this article, a new command is introduced, apcfit, thatperforms the methods in Stata. The identifiability problem is overcome by forcingconstraints on either the period or cohort effects. The use of the command is il-lustrated through an example relating to the incidence of colon cancer in Finland.The example shows how to include covariates in the analysis.

Keywords: st0211, apcfit, poprisktime, age–period–cohort models, incidence rates,mortality rates, Lexis diagrams

1 Introduction

An age–period–cohort (APC) model provides a modeling tool that can be used to sum-marize the information that is routinely collected by cancer registries and registries forother diseases. Classically, APC models fit the effects of age, period, and cohort asfactors. It has become common practice to report the age and period effects in 5-yearintervals, resulting in ten-year overlapping intervals for the relevant cohorts. However,through the increase in computer power and the use of restricted cubic (natural) splines(Durrleman and Simon 1989), it has been shown that it is possible to analyze the effectsas continuous variables (Carstensen 2007). This article builds on the work of Carstensenand explains how the method and the extensions have been made available in Stata.

APC models suffer from an identifiability problem. The date of birth can be cal-culated directly from the age at diagnosis and the date of diagnosis. If fitted directlyin a generalized linear model (GLM) this leads to overparameterization and, conse-quently, the exclusion of one of the terms. It is therefore necessary to fit constraintsto the model to extract identifiable answers for each of the parameters. This step is

c© 2010 StataCorp LP st0211

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 607

required because each of the components of the model provides different insights intothe trends of the disease over time. The age effect provides information on the ratesof disease in terms of different age groups. The period effect can highlight changes intreatment that could affect all ages simultaneously. The cohort effects are associatedwith long-term exposures, with different generations being exposed to different risks(Robertson, Gandini, and Boyle 1999).

Other Stata commands are available that apply constraints to overcome the iden-tifiability issue for APC models. The apc ie and apc cglim commands are availableto download from the apc package, which can be found via the Statistical SoftwareComponents (SSC) archive. The apc cglim command uses a single equality constraint.The age, period, and cohort terms are fitted as factors, and a constraint that sets twoof the categories from different components equal to one another is applied to overcomethe lack of identifiability (Yang, Fu, and Land 2004). The apc ie command uses theintrinsic estimator, which employs a principal components regression to arrive at theconstrained estimates for the age, period, and cohort effects. The two approaches aredescribed in detail and compared by Yang, Fu, and Land (2004). A good overview oftechniques available to carry out APC models is given by Land (2008).

The apcfit command differs from the existing approaches by using restricted cubicsplines to model the three variables. apcfit gives estimates for the three effects (age,period, and cohort) that can then be combined to give the predicted rates. The estimatesfor the three components are also interpretable individually and can be plotted to showincidence and mortality trends over the different time scales. The graphs that can beproduced provide a clear and simple depiction of the data. A further benefit of apcfitis the potential for further modeling to investigate the effect of covariates.

2 Methods

The general APC model can be described using the following equation:

ln {λ(a,p)} = f(a) + g(p) + h(c)

where f(), g(), and h() are functions, λ refers to the rate, a refers to the age variable,p refers to the period variable, and c refers to the cohort variable. This model can beused to predict the incidence or mortality rate for any given combination of age andperiod. However, because of the direct relationship between the terms, c = p − a,the components of this model cannot be uniquely determined. The model needs to beconstrained in some way to ensure that three functions showing the age, period, andcohort effects can be extracted. Carstensen (2007) details a method that allows this tobe achieved.

In essence, the method proposed by Carstensen uses restricted cubic (natural) splinesfor the age, period, and cohort terms within a GLM framework with a Poisson familyerror structure, a log link function, and an offset of log(person risk-time). However,to overcome the identifiability problem, transformations are made to the spline basisvectors for the period and cohort effects using matrix transformations.

608 APC models

Having performed the matrix transformations, a GLM is fitted within Stata using theadjusted spline basis vectors. Using this GLM as a foundation, it is possible to extendthe analysis to include covariates. The data required to do this have observations foreach unique age–period combination for every level of the covariate of interest. It isthen possible to adjust for the effect of the covariate by including the term in the GLM.It is also possible to include interaction terms between the covariate and age, period,and cohort.

3�1�2000�

3�2�2000�

2000� 2001�1999�1998�

Calendar Time (Period)�

Age�

34�

35�

36�

33�

3�1�34�

3�2�34�

3�1�1966�

3�2�1965�

Figure 1. Snapshot of a Lexis diagram indicating the reasoning behind the use of theaverage values that are offset by 1/6 for the triangular subsets (compared with theaverage values for the squares of a Lexis diagram)

2.1 Form of the data

Cancer registries and other disease registries typically collect data that could be sum-marized in a Lexis diagram. A Lexis diagram summarizes a population’s disease status

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 609

over calendar time against age. For example, Lexis diagrams can be used to depictthe number of new cases of a disease by category for age and period of diagnosis (seefigure 1). A Lexis diagram is usually split into five-year intervals for period and age; wesuggest that yearly intervals should be used. The cells of a Lexis diagram can also befurther subdivided by cohort by using information on patients’ dates of birth.

To appropriately fit the models allowed by the apcfit command, it is necessaryto ensure that the data are in the right form. In practice, the dataset will have oneobservation for each of the subsets of the Lexis diagram. The width of the intervals forthe age and period terms will be dictated by the availability of population figures forthe intervals. Each observation will consist of these explanatory variables: number ofevents (cases), population risk time (person-years), mean age, period, and cohort.

To analyze data that are set out in the form of a Lexis diagram, the appropriateaverages for age and period need to be used for the triangular subsections of the diagram(see figure 1; the dots in the center of the two triangles give the average values thatshould be used). Carstensen (2007) highlights that when the diagram is split into yearlyage and period categories, the necessary values differ from conventional averages by onesixth. The conventional averages that would be used are at the center of the square;that is, at age 34 1/2 and period 2000 1/2. These values are different from those at thecenter of the two triangles by one sixth. Making this distinction provides data that canthen model the full extent of the Lexis diagram, taking into account both the upper andlower triangular subsets. The reasoning behind the averages used for the values of age,period, and cohort is illustrated in figure 1. The set of three lines that pass throughthe center of each triangle indicates the values that should be used for age, period, andcohort. The distinction between the upper and lower triangular subsets is defined bythe patients’ years of birth. This can again be seen from the partial Lexis diagramgiven as figure 1; the upper triangular subset relates to patients who were born in 1965,whereas the lower triangular subset relates to patients who were born in 1966.

Once the data are set up with the appropriate averages for age and period, it isnecessary to ensure that the population risk-time is calculated for each triangular sectionof the Lexis diagram.

Person-years (person risk-time)

For the majority of countries, it is possible to obtain population figures in one-year ageclasses for each calendar year. For example, the population data used in the example insection 5 were obtained from Statistics Finland (Statistics Finland). It is then possibleto use these figures in the calculation of risk-time for each triangular subset of the Lexisdiagram.

The command poprisktime can be used to calculate the population risk-time frompopulation data using the formula suggested by Sverdrup (1967) as given in Carstensen(2007). The syntax of the command is detailed in the following section. Thepoprisktime command adjusts the averages for the age and period variables providedthat the dataset is split into yearly intervals. The population risk-time should be cal-

610 APC models

culated for every possible combination of age and period. The dataset containing thevalues for the population risk-time (Y) is merged with the dataset containing the num-ber of cases as part of the command. An example of the form of data that are requiredto carry out the poprisktime command is given at the beginning of the example insection 5.

2.2 Matrix transformations

The main function of the apcfit command is to make transformations to the splinebasis vectors so that the resulting output has a clear and sensible interpretation in spiteof the identifiability issue. The matrix transformations are performed using Mata andthey remove the trend from the cohort and period terms. The so-called drift term is thenadded to either the cohort or period terms, depending on the selected parameterization.This is why the cohort term, the period term, or both terms are constrained to have 0slope (see param() in section 3.3).

The appropriate spline basis vectors are combined into matrices relating to each ofthe components (age, period, and cohort) of the model. Let these three design matricesbe MA, MP , and MC . The method requires that the period and cohort matricesbe detrended. This is achieved by projecting the columns of the matrices onto theorthogonal complement of a two-column matrix, X. In the case of the detrending of theperiod matrix: X = [1 | P ], where P is the column of all the values of period.

The form of a general inner product that allows weighting is

〈x | y〉 =∑

i

xiwiyi = x′Wy

where W = diag(wi). The projection matrix on the column space of X with respect toa general inner product is

PW = X(X′WX)−1X′W

and the projection of M on the orthogonal complement is

(I − PW)M

Once the period and cohort matrices have been projected in this way, the nextstage is to reduce the number of columns of the matrices to ensure that they are offull rank. The columns of the matrices are also pivoted during this process. Therank of the matrices and the pivoting vector to be used are obtained by using Mata’shqrd() function. The columns required to ensure that the matrices are of full rank areselected using the select() function in Mata. The matrices are then centered aroundthe relevant reference points by subtracting a row corresponding to the reference pointfrom each of the rows of M. A column of 1s is then attached at the beginning of theMA matrix to ensure that the intercept is part of the age effects. The intercept term iscontained within the age effects so that the age term carries the rate dimension. Thenaccording to the parameterization, the drift column is added to the front of either the

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 611

period or cohort matrix. For full details of the matrix operations, see the appendix ofCarstensen’s article (2007).

2.3 Weights

The weighting matrix can take on any form; however, three logical choices for theweights are: wi = 1, wi = Di (where Di is the number of cases for an observation), andwi = Yi (where Yi is the population risk-time for an observation). Carstensen (2007)suggests using a weighting that is based upon the number of cases (D). Using equalweights (of 1) during the process of the detrending is a method that is attributed toHolford (1983). Using different values for the weights produces different estimates forthe drift term.

3 The apcfit command

3.1 Syntax

apcfit[if

] [in

], age(varname) cases(varname) poprisktime(varname)[

period(varname) cohort(varname) agefitted(newvar) perfitted(newvar)

cohfitted(newvar) refper(#) refcoh(#) drextr(weighted | holford)param(ACP | APC | AdCP | AdPC | AP | AC) level(#) dfa(#) dfp(#) dfc(#)

nper(#) bknotsa(numlist) bknotsp(numlist) bknotsc(numlist)

knotsa(numlist) knotsp(numlist) knotsc(numlist)

knotplacement(equal | weighted) adjust replace]

3.2 Note

rcsgen must be installed to run apcfit. rcsgen can be installed from the SSC archive.rcsgen generates basis functions for restricted cubic splines.

3.3 Options

age(varname) specifies the variable that refers to the age values. age() is required.

cases(varname) specifies the variable that refers to the number of cases or deaths fora given age and period. cases() is required.

poprisktime(varname) specifies the variable that refers to the population risk-time(person-years) for a given age and period. The population risk-time can be calculatedusing the poprisktime command. poprisktime() is required.

612 APC models

period(varname) specifies the variable that refers to the period values. This variablemust be specified in all cases except when the age–cohort parameterization optionis specified.

cohort(varname) specifies the variable that refers to the cohort values. If this variableis not given, the cohort values are calculated from the period and age variablesaccording to the equality cohort = period− age.

agefitted(newvar) can specify the name of the fitted rate values given for age in theoutput. The default is agefitted(agefitted). This can be useful when the userwants to compare the results of more than one parameterization.

perfitted(newvar) can specify the name of the fitted relative risk (RR) values givenfor period in the output. The default is perfitted(perfitted). This can be usefulwhen the user wants to compare the results of more than one parameterization.

cohfitted(newvar) can specify the name of the fitted RR values given for cohort inthe output. The default is cohfitted(cohfitted). This can be useful when theuser wants to compare the results of more than one parameterization.

refper(#) can specify the reference period for the model. The default is to take thereference period to be the median date of diagnosis among the cases.

refcoh(#) can specify the reference cohort for the model. The default is to take thereference cohort to be the median date of birth among the cases.

drextr(weighted | holford) specifies the method of drift extraction for the model.

drextr(weighted) lets the drift extraction depend on the weighted average (bynumber of cases) of the period and cohort effects. The default is drextr(weighted).

drextr(holford) uses a naıve average over all the values of the estimated effects,disregarding the number of cases.

param(ACP | APC | AdCP | AdPC | AP | AC) specifies the parameterization of the APC model.

param(ACP) dictates that the age effects should be rates for the reference cohort, thecohort effects should be RR relative to the reference cohort, and the period effectsshould be RR constrained to be 0 on average (on the log scale) with 0 slope. Thedefault is param(ACP).

param(APC) dictates that the age effects should be rates relative to the referenceperiod, the period effects should be RR relative to the reference period, and thecohort effects should be RR constrained to be 0 on average (on the log scale) with 0slope.

param(AdCP) dictates that the age effects should be rates for the reference cohort,and the cohort and period effects should be RR constrained to be 0 on average (onthe log scale) with 0 slope. The drift term is missing from this model, and so thefitted values do not multiply to the fitted rates.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 613

param(AdPC) dictates that the age effects should be rates for the reference period,and the cohort and period effects should be RR constrained to be 0 on average (onthe log scale) with 0 slope. The drift term is missing from this model, and so thefitted values do not multiply to the fitted rates.

param(AP) dictates that the age effects should be rates for the reference period, andthe period effects should be RR relative to the reference period. The cohort effectsare not included in this model. Therefore, there is no identifiability issue.

param(AC) dictates that the age effects should be rates for the reference cohort, andthe cohort effects should be RR relative to the reference cohort. The period effectsare not included in this model. Therefore, there is no identifiability issue.

level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95).

dfa(#) specifies the degrees of freedom used for the natural (restricted) cubic splinerelating to the age variable. The default is dfa(5) (unless knotsa() is specified).The (df-1) internal knots are placed at the centiles of the data, depending on thevalue specified.

dfp(#) specifies the degrees of freedom used for the natural (restricted) cubic splinerelating to the period variable. The default is dfp(5) (unless knotsp() is specified).The (df-1) internal knots are placed at the centiles of the data, depending on thevalue specified.

dfc(#) specifies the degrees of freedom used for the natural (restricted) cubic splinerelating to the cohort variable. The default is dfc(5) (unless knotsc() is specified).The (df-1) internal knots are placed at the centiles of the data, depending on thevalue specified.

nper(#) specifies the units to be used in reported rates. For example, if the analysistime is in years, specifying nper(1000) results in rates per 1000 person-years. Thedefault is nper(1).

bknotsa(numlist) specifies the lower and upper boundary knots for the age variable;the default is the upper and lower values of the variable.

bknotsp(numlist) specifies the lower and upper boundary knots for the period variable;the default is the upper and lower values of the variable.

bknotsc(numlist) specifies the lower and upper boundary knots for the cohort variable;the default is the upper and lower values of the variable.

knotsa(numlist) specifies the knots for the age variable if the dfa() option is not used;the default is to use dfa(5) if neither option is specified.

knotsp(numlist) specifies the knots for the period variable if the dfp() option is notused; the default is to use dfp(5) if neither option is specified.

knotsc(numlist) specifies the knots for the cohort variable if the dfc() option is notused; the default is to use dfc(5) if neither option is specified.

614 APC models

knotplacement(equal | weighted) specifies the method of knot placement for the splineterms.

knotplacement(equal) means that the knots are placed at equally spaced centilesof the respective variables, depending on the number of knots that are used. This isthe default.

knotplacement(weighted) means that the knots are placed at centiles of the vari-ables that are dependent on the number of cases. For example, if there are morecases in the higher ages, the knots would be concentrated at the higher ages.

adjust specifies that the constrained variables be given relative to a reference pointrather than averaging to zero on the log scale. This option cannot be applied tothe age–period and age–cohort parameterizations. This option alters the variablethat is constrained to be 0 on average (on the log scale) with 0 slope to still have0 slope but to make the RRs relative to the reference point that is specified (or themedian, if not specified). Adjusting the third variable to be relative to a referencepoint alters the interpretation of the age effects.

replace specifies that the default fitted value variables for age, period, and cohortshould be replaced by the new run of the command. This will work only if thedefault names are used for the original model and if all the variables are still in thedataset.

4 The poprisktime command

4.1 Syntax

poprisktime using filename, age(varname) period(varname) cohort(varname)

cases(varname) agemin(#) agemax(#) permin(#) permax(#)[pop(string)

poprisktime(newvar) covariates(varlist) missingreplace]

4.2 Datasets

The using dataset refers to the dataset that contains population data split into yearlyintervals of age and period. The master dataset should contain information on thenumber of cases split by age, period, and cohort. The intervals for the age and periodvariables should, again, be of length 1. Examples of the form of the data that is requiredwill be detailed in the next section.

4.3 Options

age(varname) specifies the age variable that must have the same name in both datasets.In the population dataset (the using dataset), age values one less and one greaterthan those specified by agemax() and agemin() are required to avoid missing values.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 615

period(varname) specifies the period variable that must have the same name in bothdatasets. In the population dataset (the using dataset), period values one greaterthan that specified by permax() are required. Population data are also necessaryfor at least as low at the permin() value. Missing values will be generated for thepopulation risk-time variable if the population data do not at least satisfy theserequirements.

cohort(varname) specifies the cohort variable in the master dataset.

cases(varname) specifies the variable in the master dataset that contains the numberof cases.

agemin(#) specifies the minimum age in the output dataset. In the population dataset(the using dataset), an age that is less than this is required.

agemax(#) specifies the maximum age in the output dataset. In the population dataset(the using dataset), an age that is greater than this is required.

permin(#) specifies the minimum period in the output dataset. In the populationdataset (the using dataset), a period that is equal to or less than this is required.

permax(#) specifies the maximum period in the output dataset. In the populationdataset (the using dataset), a period that is greater than this is required.

pop(string) specifies the name of the variable in the using dataset that refers to thepopulation figures. If this option is not specified, it is assumed that the variable iscalled pop.

poprisktime(newvar) specifies the name of the new variable that is added to the fileto specify the population risk-time. The default is to name the variable Y.

covariates(varlist) can specify any covariates, such as a sex variable, by which the twodatasets are split. If this option is specified, the covariates are included in the mergestatement. If the dataset is split by covariates, this option must be specified so thatthe variables by which the data are merged uniquely identify the observations in thedatasets. Both datasets must be split by the same covariates.

missingreplace specifies whether the missing values for the cases variable should bereplaced with a zero in the merging process. This should be an appropriate assump-tion, unless missing data were present in the data beforehand. If missingreplaceis not specified, the cases() variable is likely to contain missing values. A warningis given to indicate the presence of missing values that most probably should bereplaced with values of 0.

5 Example

Colon cancer data from Finland will be used to illustrate the use of the poprisktimeand the apcfit commands. The data cover diagnoses between 1980 and 2003 for allregions of Finland. It was decided to restrict the age range of the dataset to people who

616 APC models

were, at time of diagnosis, equal to or greater than 20 but less than 80 years of age. Tohighlight the possibility of including covariates into the analysis, the gender of patientswas included when collapsing the dataset into unique records of age, period, and cohort.The data were collapsed into yearly intervals for age and period, leading to an upperand lower cohort value for each unique combination of age and period (according totheir date of birth).

This collapse led to (80 − 20) × (2004 − 1980) different age–period categories, eachof which was further subdivided by date of birth into two categories. This gave a totalof 2,880 observations for (D,Y)—one for each triangular subset. However, because theFinnish dataset contains sufficient information to include a sex term in the dataset,the dataset actually contains 5760 (= 2880 × 2) observations. To increase the datasetto include the sex term, the calculations for population risk-time were done for eachgender.

5.1 Creating the dataset

Getting the data into the appropriate form has been facilitated by the creation ofthe poprisktime command. The poprisktime command uses the merge commandto ensure that the data are matched in terms of the unique values of age, period,and cohort. It therefore requires the definition of a using dataset and needs the masterdataset to be in the appropriate form. The data that are required for the master datasetfor poprisktime take the form:

. list A P D C if A==30 & P<1990, noobs

A P D C

30 1980 1 195030 1982 2 195230 1983 1 195230 1983 1 195330 1984 1 1953

30 1985 1 195430 1985 2 195530 1986 1 195630 1987 1 195630 1987 3 1957

30 1988 1 195830 1989 1 1959

It should be noted that these values refer to the left endpoints of the relevant classesof age (A), period (P), and cohort (C). This form would result in the two triangularsubsets highlighted in figure 1 being listed as having the same age and period values(A = 34 and P = 2000), but different cohort values (C = 1965 and C = 1966). Thepenultimate column (D) refers to the number of cases.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 617

The data that are required for the using dataset for the poprisktime commandshould take the form

. list A P pop if A==30 & P<1990, noobs

A P pop

30 1980 8482830 1981 8124530 1982 8355430 1983 8021330 1984 80514

30 1985 7980430 1986 8015030 1987 7742730 1988 7355630 1989 75669

The dataset (produced by poprisktime) that can be used by apcfit takes the form

. poprisktime using popdatanosex, age(A) period(P) cohort(C) cases(D) agemax(80)> agemin(20) permin(1980) permax(2004) missingreplace

. list A P C D Y if A>30 & A<31 & P<1990, noobs

A P C D Y

30.333 1980.667 1950.334 1 40605.8330.333 1981.667 1951.334 0 4175830.333 1982.667 1952.334 2 40086.3330.333 1983.667 1953.334 1 4024530.333 1984.667 1954.334 0 39909.17

30.333 1985.667 1955.334 2 40069.8330.333 1986.667 1956.334 1 38721.530.333 1987.667 1957.334 3 36779.3330.333 1988.667 1958.334 1 37818.1730.333 1989.667 1959.334 1 37791

30.667 1980.333 1949.666 0 42433.530.667 1981.333 1950.666 0 40654.530.667 1982.333 1951.666 0 41799.6730.667 1983.333 1952.666 1 40115.8330.667 1984.333 1953.666 1 40257.33

30.667 1985.333 1954.666 1 39909.3330.667 1986.333 1955.666 0 40066.6730.667 1987.333 1956.666 1 38710.3330.667 1988.333 1957.666 0 3679430.667 1989.333 1958.666 0 37872

It should be noted that to create the data used during this example, all the abovedatasets were also split by a covariate for gender.

618 APC models

5.2 Basic model

Having set up the data into the correct form, as detailed in the previous section, theapcfit command can be applied. The simplest form of the apcfit command uses thedefaults for the options and only requires the specification of the data, as shown below:

. apcfit, age(A) period(P) cases(D) poprisktime(Y) nper(100000)

Iteration 0: log likelihood = -10229.164Iteration 1: log likelihood = -9772.3502Iteration 2: log likelihood = -9757.1651Iteration 3: log likelihood = -9757.0919Iteration 4: log likelihood = -9757.0919

Generalized linear models No. of obs = 5760Optimization : ML Residual df = 5745

Scale parameter = 1Deviance = 6592.292062 (1/df) Deviance = 1.147483Pearson = 6554.902862 (1/df) Pearson = 1.140975

Variance function: V(u) = u [Poisson]Link function : g(u) = ln(u) [Log]

AIC = 3.393087Log likelihood = -9757.091911 BIC = -43151.9

OIMD Coef. Std. Err. z P>|z| [95% Conf. Interval]

_spA1_intct -9.142635 .0348165 -262.60 0.000 -9.210874 -9.074396_spA2 1.702715 .0358606 47.48 0.000 1.632429 1.773_spA3 -.0312765 .0262975 -1.19 0.234 -.0828187 .0202658_spA4 .0775714 .0206082 3.76 0.000 .0371802 .1179627_spA5 .0135517 .0117284 1.16 0.248 -.0094356 .036539_spA6 .0332615 .0065958 5.04 0.000 .020334 .046189_spP1 .0201192 .007845 2.56 0.010 .0047432 .0354951_spP2 .0025498 .0067633 0.38 0.706 -.010706 .0158056_spP3 .0103832 .0071587 1.45 0.147 -.0036476 .024414_spP4 .0029901 .0075344 0.40 0.691 -.0117772 .0177573

_spC1_ldrft .0107694 .0011545 9.33 0.000 .0085067 .0130321_spC2 .0099424 .0224079 0.44 0.657 -.0339763 .0538611_spC3 -.0080999 .0155304 -0.52 0.602 -.0385389 .022339_spC4 -.0415647 .0163449 -2.54 0.011 -.0736 -.0095293_spC5 -.0198339 .0153494 -1.29 0.196 -.0499182 .0102504

Y (exposure)

The apcfit command saves the adjusted spline basis as spA∗ for the age variable,spP∗ for the period variable, and spC∗ for the cohort variable, which allows othermodels to be fit using the glm command (providing that the appropriate family, link,and offset are used). As a result, (providing that the dataset was appropriately split forany given covariate), further models can be fit that can account for interactions.

The following set of commands can be used to list the estimated fitted values andtheir confidence intervals by age. The apcfit command creates new variables in thedataset containing the fitted values for the first occurrence of each unique value. In theinterest of saving space, the output displayed below is limited to ages less than twenty-five years. The rates given are per 100,000 person-years because the nper(100000)option is used.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 619

. sort A

. list A agefitted agefitted_lci agefitted_uci if agefitted!=. & A<=25, noobs> abbreviate(13) divider separator(10)

A agefitted agefitted_lci agefitted_uci

20.333 .7674873 .5526323 1.06587520.667 .7808697 .5675128 1.07443821.333 .8083171 .5981881 1.0922621.667 .8225069 .6140866 1.10166522.333 .8517627 .6468446 1.12159822.667 .8669665 .6638144 1.13229123.333 .8984786 .6987644 1.15527323.667 .9149407 .7168638 1.16774824.333 .9492388 .7541356 1.19481724.667 .9672476 .7734399 1.209619

Figure 2 displays the fitted incidence of Finnish colon data for males and femalescombined. The default for apcfit is to make the reference points at the median value(with respect to the number of cases) for the period and cohort variable, respectively.The median cohort for the Finnish colon data is 1926.33, and the median period is1993.67. To correctly interpret results, it is vital that the values of the reference pointsare known. The apcfit command returns the values of the reference points as r-classvalues refcoh and refper. The default parameterization was used in the model becausethe param() option was not specified. This means that the age values are relative tothe reference cohort, having been adjusted for the effect of period. Under the defaultparameterization, the period effect is constrained to have 0 slope and to be 0 on average(on the log scale). This is due to the fact that the period effects are detrended andthe drift term is then added to the cohort effects. The cohort effect is relative to thereference cohort and is allowed a slope through the inclusion of the drift term in thecohort effect.

(Continued on next page)

620 APC models

1

2

5

10

50

200

Rat

e pe

r 100

,000

per

son−

year

s

20 40 60 80Age

1

2

3

4

Rat

e R

atio

1900 1920 1940 1960 1980 2000Calendar Time

Estimated effects from the weightedAPC model for the colon data

Figure 2. Graph for APC model for Finnish colon data. The leftmost solid line refers tothe estimated age effect, the longest of the solid lines on the RR half of the graph refersto the estimated cohort effect, and the shortest line in the RR half of the graph refersto the estimated period effect. The respective regions surrounding the lines provide the95% confidence intervals. The circle indicates the reference point.

The degrees of freedom were taken to be five for each of the spline bases for the threevariables (age, period, and cohort). It is interesting to alter the degrees of freedom forany one of the variables, particularly the cohort variable, although this might lead tooverfitting if the number is increased too much. The decision on the number of degreesof freedom can be aided through the use of the Akaike’s information criterion (AIC)values. AIC values can be obtained via the use of the estat ic command. A lower AIC

value suggests a better fitting model.

5.3 Including covariates

Including covariates in the analysis is a relatively simple process provided that the dataare in the correct form. The only difficulty lies in appropriately splitting the datasetand the consequent calculation of the population risk-time. Population figures that aresplit by gender are usually available, making it feasible to calculate the population risktime for males and females separately.

Refitting the same model as in section 5.2, our code is

. apcfit, age(A) period(P) cases(D) poprisktime(Y) agefitted(ageAll)> perfitted(perAll) cohfitted(cohAll) nper(100000)

. glm D _spA* _spP* _spC*, lnoffset(Y) family(poisson) nocons

(output omitted )

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 621

Using this structure, it is possible to add terms to the GLM to take into account theeffect of gender. The simplest method for the inclusion of the sex term into the GLM isto assume a proportional effect for gender. The method for including this term is shownbelow. The covariate for gender is coded as 0 for female and 1 for male. The eformoption is used to report the term for gender as an incidence rate ratio (IRR) (malesrelative to females).

. glm D _spA* _spP* _spC* sex, family(poisson) lnoffset(Y) nocons eform

Iteration 0: log likelihood = -10183.733Iteration 1: log likelihood = -9719.7909Iteration 2: log likelihood = -9704.9493Iteration 3: log likelihood = -9704.8737Iteration 4: log likelihood = -9704.8737

Generalized linear models No. of obs = 5760Optimization : ML Residual df = 5744

Scale parameter = 1Deviance = 6487.855702 (1/df) Deviance = 1.129501Pearson = 6473.673607 (1/df) Pearson = 1.127032

Variance function: V(u) = u [Poisson]Link function : g(u) = ln(u) [Log]

AIC = 3.375303Log likelihood = -9704.873731 BIC = -43247.68

OIMD IRR Std. Err. z P>|z| [95% Conf. Interval]

_spA1_intct .0000998 3.55e-06 -259.28 0.000 .0000931 .000107_spA2 5.51232 .1977228 47.59 0.000 5.138099 5.913797_spA3 .9671249 .0254402 -1.27 0.204 .9185265 1.018295_spA4 1.079698 .0222534 3.72 0.000 1.036952 1.124207_spA5 1.01307 .0118833 1.11 0.268 .9900445 1.03663_spA6 1.033405 .0068161 4.98 0.000 1.020132 1.046851_spP1 1.020693 .0080073 2.61 0.009 1.005119 1.036508_spP2 1.002559 .0067806 0.38 0.706 .9893567 1.015937_spP3 1.01047 .0072336 1.45 0.146 .9963913 1.024748_spP4 1.003077 .0075577 0.41 0.683 .9883726 1.017999

_spC1_ldrft 1.010528 .0011671 9.07 0.000 1.008243 1.012818_spC2 1.007732 .0225837 0.34 0.731 .9644266 1.052982_spC3 .9915087 .0153987 -0.55 0.583 .9617826 1.022154_spC4 .9601781 .0156951 -2.49 0.013 .9299038 .991438_spC5 .9799076 .0150411 -1.32 0.186 .9508666 1.009835

sex 1.158693 .0166619 10.24 0.000 1.126492 1.191814Y (exposure)

The output given above shows that, in Finland, males have about a 16% greaterincidence of colon cancer than females across the entire dataset when adjusting forthe other effects. The p-value for the sex term highlights that the effect for gender issignificant at the 5%, and even the 0.1%, level. This measure of significance, however,assumes that the effect of gender is proportional over both time scales and date of birth(cohort).

622 APC models

Full interaction using the adjusted spline variables

The other extreme is to fit a full interaction for gender with all the spline terms in themodel. In theory, this approach should give results equivalent to fitting two separatemodels for each gender (which can be achieved by using an if qualifier as part of theapcfit command). However, the results given in this case are not exactly equivalent ifthe default drift extraction is used because the default drift extraction uses a weightingthat is based upon the number of cases. Therefore, the weighting takes into accountthe cases for both males and females when fitting the model with all patients includedbut only takes into account, for example, the number of cases for males in the modelfitted exclusively for males. This means that when the full interaction is fitted, the fittedvalues are slightly different from the values obtained by fitting two separate models. Thisdifference would be magnified if the number of cases for males and females were markedlydifferent—for example, if the dataset related to cases of breast cancer. However, in mostcases, the difference will be negligible. If the drift extraction suggested by Holford isused (which gives equal weight to all the observations), then the model with the fullinteraction term for sex is entirely equivalent to the two separate models for each of thegenders.

Using reduced splines to model the interaction

Fitting the full gender interaction with all the components of the model may resultin overfitting. Technically, it involves using the same number of knots for the originalcomponents and for the difference between the genders when fewer knots would suffice.However, it is possible to perform an analysis that uses fewer knots for the differencesbetween the genders whilst still maintaining the greater number of knots for the baselineshape. Spline bases with a reduced number of degrees of freedom can be created to modelthe effect of the interaction with gender.

It is possible that some of the components of the model, such as the period effect,may not vary significantly with gender. The likelihood-ratio test can be used to comparenested models. It should be noted that the identifiability problem is reintroduced whenmodeling the effect of gender using interaction terms. Therefore, it is only possible tofit an interaction with at most two of the age, period, and cohort components (unlessthe full adjusted spline terms are used, as described above).

Fitting the effect of gender using just a single model rather than the two separatemodels allows the calculation of the time-dependent IRR for the genders. This ratio canbe plotted against the relevant time scale to show the differences between the gendersin terms of the incidence rate. This approach can also be implemented with a reducedset of spline bases to model the difference between the genders.

The code given below generates spline terms to model the interaction with sex for theage and cohort effects. The spline terms have a reduced number of degrees of freedomcompared with those used to calculate the original fitted values using apcfit. It wasdecided to illustrate the method using reduced splines with degrees of freedom equal tothree (compared with eight degrees of freedom for each of the components of the original

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 623

APC model). It is important to center the generated reduced set of splines for the cohorteffect around the reference cohort. Centering can be achieved using the fact that theoriginal cohort splines will take a value of zero when the line of the spline design matrixfor the cohort effects corresponds to the reference cohort. This approach assumes thatthe reference cohort selected is a part of the original dataset of cohort values. If that isnot the case, further work is required to center the reduced set of cohort splines.

. apcfit, age(A) period(P) cases(D) poprisktime(Y) dfa(8) dfp(8) dfc(8)> nper(100000)

. *** Generate the reduced set of splines for age ***

. rcsgen A, gen(newA) df(3) orthogVariables newC1 to newC3 were created

. local dfreda=wordcount(r(knots))-1

. *** Obtain the relevant interaction terms ***

. forvalues i = 1/`dfreda´ {

. generate sexnewA`i´=sex*newA_`i´

. }

. *** Generate the reduced set of splines for cohort ***

. rcsgen C, gen(newC) df(3) orthog

. local dfredc=wordcount(r(knots))-1

. *** Obtain the relevant interaction terms ***

. forvalues i = 1/`dfredc´ {

. quietly generate sexnewC`i´=sex*newC_`i´

. }

. *** Center the reduced Cohort splines on the reference cohort ***

. forvalues i = 1/`dfredc´ {

. quietly summarize sexnewC`i´ if _spC1==0 & sex==1

. quietly generate sexnewCref`i´ref=r(mean)

. quietly replace sexnewC`i´=sexnewC`i´-sexnewCref`i´ if sex==1

. }

. drop sexnewCref*

. *** Fit the reduced splines as part of the GLM ***

. glm D _sp* sex sexnewA* sexnewC*, lnoffset(Y) f(p) nocons

The required estimates for each of the genders can be obtained from this model byusing the partpred command, which allows for the calculation of partial predictionsfrom the last fitted model. The partpred package is available to download from theSSC archive. The partpred command can also be used to calculate the relevant time-dependent IRRs for the difference between the genders in terms of the age and cohorteffects. The required code for the prediction of the age effects, and their confidenceintervals, for both genders is

. partpred agefem if sex==0, for(_spA*) eq(D) eform ci(agefemlci agefemuci)

. partpred agemal if sex==1, for(_spA* sex sexnewA*) eq(D) eform> ci(agemallci agemaluci)

(Continued on next page)

624 APC models

.51

1.5

2IR

R (M

/F)

20 40 60 80Age

The estimated time−dependent IRR for malescompared with females for the age component

Figure 3. Graph of time-dependent IRR for the age-by-sex interaction. The solid line isthe IRR (males compared with females) for the age-by-sex interaction using a reducedset of splines with three degrees of freedom (from the model using reduced splines forage and cohort, having fitted a model with eight degrees of freedom for each of thecomponents). The shaded region around this line gives the relevant 95% confidenceinterval. The dashed line gives the IRR for sex in terms of the age component fromthe model with the full interaction for sex (from the previous section) using 8 degreesof freedom for each of the components. The dotted lines form the appropriate 95%confidence interval.

The results of the analysis using a reduced set of spline bases can be comparedwith the analysis that uses a full interaction for each of the components. The majordifferences that will be observed between these two analyses relate to the two funda-mental differences in assumptions between the models. First, the model that fits a fullinteraction with the sex term also allows for a difference in terms of the period effect.The model fitted with the reduced sets of splines for age and cohort assumes that theperiod effect is the same for both genders. Second, the model fitted with the reducedsets of splines for age and cohort assumes that the effect of gender can be modeledusing fewer knots than the model that fits the full interaction with gender. Figure 3shows a comparison of the two analyses in terms of the effect of gender on the IRR forthe age component. Using an increased number of degrees of freedom could lead to anoverfitting of the IRR for any of the components, which is apparent in figure 3. It canbe seen that using the results of the reduced sets of splines gives a more believable andinterpretable effect of gender over the age time scale. In fact, it could be argued thatit may only be necessary to model the interaction using a linear function, consideringhow straight the line is for the reduced splines model. However, there is a danger thata reduction in the degrees of freedom may make the spline function too inflexible tofollow a more complex pattern.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 625

Continuous covariates

The example given above is simplistic in that it includes a single binary variable as theonly covariate. Carrying out more complex scenarios is also feasible using the outputgiven by apcfit. Providing that the appropriately split population data are availablefor a continuous covariate, it is possible to carry out more complex interaction models.It is possible to also use splines to describe the effect of the continuous covariate andcreate spline-by-spline interactions for any of the components of the APC model. Eventhough presentation of the results for these models can become difficult, they do haveclear advantages in terms of flexibility.

5.4 Future incidence

Having obtained the models for incidence via the apcfit command, it is possible toextrapolate to make predictions about future incidence. The fact that restricted cu-bic (natural) splines are linear beyond the boundary knot can be used to make linearextensions to the predicted values. Using a linear regression analysis of the final fewestimates of an APC analysis to predict future trends has been common practice forsome time (Osmond 1985; Bray and Møller 2006). Utilizing the fact that restrictedcubic splines are linear beyond the boundary knot to make predictions is conceptuallysimilar to using linear regression. However, for accurate predictions to be given usingthis method, the boundary knots must be placed within the range of the data. It is alsopossible to simply project the drift term using the apcfit models to make predictionsfor future incidence (Møller et al. 2003, 2007). Further work is planned for investigatingthe effectiveness of these methods for giving predictions of incidence.

6 Conclusion

Carstensen suggested that APC models should be fit using continuous variables forthe age, period, and cohort effects (Carstensen 2007). This procedure can now beaccomplished in Stata by using the apcfit command. The command provides thepossibility of producing a clear and attractive graphical display of the data. It alsoprovides the foundations required to predict the future rates of incidence and mortality.

7 Acknowledgments

We would like to acknowledge that this method was suggested and implemented in R byBendix Carstensen. We would also like to thank Bendix for his useful comments duringthe drafting of this article. This work was carried out as part of a PhD thesis that isfunded by the Medical Research Council. Finally, we would like to thank the FinnishCancer Registry for access to the data.

626 APC models

8 ReferencesBray, F., and B. Møller. 2006. Predicting the future burden of cancer. Nature Reviews

Cancer 6: 63–74.

Carstensen, B. 2007. Age–period–cohort models for the Lexis diagram. Statistics inMedicine 26: 3018–3045.

Durrleman, S., and R. Simon. 1989. Flexible regression models with cubic splines.Statistics in Medicine 8: 551–561.

Holford, T. R. 1983. The estimation of age, period and cohort effects for vital rates.Biometrics 39: 311–324.

Land, K. C. 2008. Disentangling Age-Period-Cohort Effects: New Models, Methods,and Empirical Applications.http://help.pop.psu.edu/help-by-statistical-method/apc/Land-Presentation.ppt/view.

Møller, B., H. Fekjær, T. Hakulinen, H. Sigvaldason, H. H. Storm, M. Talback, andT. Haldorsen. 2003. Prediction of cancer incidence in the Nordic countries: Empiricalcomparison of different approaches. Statistics in Medicine 22: 2751–2766.

Møller, H., L. Fairley, V. Coupland, C. Okello, M. Green, D. Forman, B. Møller, andF. Bray. 2007. The future burden of cancer in England: Incidence and numbers ofnew patients in 2020. British Journal of Cancer 96: 1484–1488.

Osmond, C. 1985. Using age, period and cohort models to estimate future mortalityrates. International Journal of Epidemiology 14: 124–129.

Robertson, C., S. Gandini, and P. Boyle. 1999. Age-period-cohort models: A compara-tive study of available methodologies. Journal of Clinical Epidemiology 52: 569–583.

Statistics Finland. http://www.stat.fi/index en.html.

Sverdrup, E. 1967. Statistiske metoder ved dødelikhetsundersøkelser. Statistical Mem-oirs. (in Norwegian). Instistute of Mathematics, University of Oslo.

Yang, Y., W. J. Fu, and K. C. Land. 2004. A methodological comparison of age-period-cohort models: The intrinsic estimator and conventional generalized linear models.Sociological Methodology 34: 75–110.

About the authors

Mark Rutherford is a PhD student at the University of Leicester, UK. He is currently workingon a PhD thesis relating to the prediction of future cancer burden on society. The PhD isfunded by a Medical Research Council grant.

Paul Lambert is a reader in medical statistics at the University of Leicester, UK. His maininterest is in the development and application of methods in population based cancer research.

M. J. Rutherford, P. C. Lambert, and J. R. Thompson 627

John Thompson is a professor in the department of Health Sciences at the University of Le-icester with a research interest in Genetic Epidemiology. He teaches on the department’s MScin Medical Statistics and is a longtime Stata user and enthusiast.

The Stata Journal (2010)10, Number 4, pp. 628–649

A simple feasible procedure to fit models withhigh-dimensional fixed effects

Paulo GuimaraesUniversity of South CarolinaColumbia, South Carolinaand Universidade do Porto

Porto, [email protected]

Pedro PortugalBanco de Portugal and

Universidade Nova de LisboaLisbon, Portugal

[email protected]

Abstract. In this article, we describe an iterative approach for the estimationof linear regression models with high-dimensional fixed effects. This approach iscomputationally intensive but imposes minimum memory requirements. We alsoshow that the approach can be extended to nonlinear models and to more thantwo high-dimensional fixed effects.

Keywords: st0212, fixed effects, panel data

1 Introduction

The increasing availability of large microlevel datasets has spurred interest in methodsfor estimation of models with high-dimensional fixed effects. Researchers in several fieldssuch as economics, sociology, and political science, among others, find the introductionof fixed effects to be a particularly appealing way of controlling for unobserved hetero-geneity that is shared among groups of observations. In this case, it becomes possibleto account for all intergroup variability by adding to the set of regressors some dummyvariables that absorb group-specific heterogeneity. This approach has the advantageof allowing for the existence of general patterns of correlation between the unobservedeffects and the other regressors.

In practice, when fitting a model with a single fixed effect (that is, a factor in theanalysis of covariance), one is not required to actually add the group dummy variables tothe set of regressors. This is particularly convenient when dealing with high-dimensionalfixed effects—that is, in a situation where the number of groups (dummy variables) isvery large. For several common procedures such as linear regression, Poisson regression,and logit regression, the fixed effect can be eliminated from the model, thereby making itpossible to obtain estimates for the coefficients of the relevant regressors without havingto introduce the group dummy variables in the model. For other nonlinear models, itis still possible to avoid the explicit introduction of dummy variables to account for thefixed effect by modifying the iterative algorithm used to solve the maximum likelihoodproblem (see Greene [2004]).

However, there is no simple solution in situations that have more than one high-dimensional fixed effect. A notable example would be the large employer–employee

c© 2010 StataCorp LP st0212

P. Guimaraes and P. Portugal 629

panel datasets commonly used in the labor economics literature. When studying re-lations in the labor market, researchers often want to simultaneously account for twosources of unobserved heterogeneity—the firm and the worker. Explicit introductionof dummy variables is not an option because the number of units (groups) for eitherfirms or workers is too large. Other well-known examples of large datasets with obvioussources of unobserved heterogeneity include these two types of panel datasets: patientclaims data, in which the potential sources of heterogeneity are the patient, the doctor,and the hospital; and student performance, in which potential sources of heterogeneityare the student, the teacher, and the school.

Abowd, Kramarz, and Margolis (1999) tackled the problem of accounting for twohigh-dimensional fixed effects in the linear regression problem. In a widely cited article,the authors proposed several methods that provide approximations to the least-squaresolution.1 Later, in an unpublished article, Abowd, Creecy, and Kramarz (2002) pre-sented an iterative algorithm that leads to the exact least-square solution of a linearregression model with two fixed effects. The user-written command a2reg is a Stataimplementation of this algorithm by Amine Ouazad. In a recent article published in theStata Journal, Cornelissen (2008) presented a new user-written command, felsdvreg,which consists of a memory-saving procedure for estimation of a linear regression modelwith two high-dimensional fixed effects.

At this point, we should make clear that the methods discussed above (as well asthose discussed in this article) may not lead to consistent estimation. Unlike the casewith most panel-data estimators, consistency is achieved if we are willing to admit thatthe dimension of the groups is unrelated to sample size. This means that, from anasymptotic point of view, the number of parameters of the model remains unchangedas the sample size tends to infinity. This assumption gets around the incidental param-eter problem but more correctly places these estimators as extensions of Stata’s aregcommand—that is, as alternative ways to fit models with large sets of dummy variables.

In our own work (Carneiro, Guimaraes, and Portugal 2010), we were faced with theproblem of fitting a linear regression model with two high-dimensional fixed effects (firmand worker) using a linked employer–employee dataset with over 30 million observations.Implementation of the user-written commands discussed above in a computer with 8 gi-gabytes of random-access memory (RAM) and running Stata/MP for Windows failedbecause of memory limitations. However, using an iterative procedure that was simpleto implement, we were able to fit the linear regression model with two and even threehigh-dimensional fixed effects. The approach is computationally intensive, but it has theadvantage of imposing minimal memory requirements. In this article, we present a de-tailed discussion of the method proposed in Carneiro, Guimaraes, and Portugal (2010)and show how it can be extended to nonlinear models and to applications with morethan two high-dimensional fixed effects.

1. For a discussion on the implementation of these methods in Stata, see Andrews, Schank, andUpward (2006).

630 Fitting models with high-dimensional fixed effects

2 The linear regression model

2.1 One fixed effect

To start with, consider the conventional linear regression model setup

yi = β1x1i + β2x2i + · · · + βkxki + εi

or, more compactly,Y = Xβ + ε

Application of the least-squares method results in the set of normal equations givenbelow: ⎡⎢⎢⎢⎢⎣

∂SS∂β1

=∑

i x1i(yi − β1x1i − β2x2i − · · · − βkxki) = 0

∂SS∂β2

=∑

i x2i(yi − β1x1i − β2x2i − · · · − βkxki) = 0

· · ·∂SS∂βk

=∑

i xki(yi − β1x1i − β2x2i − · · · − βkxki) = 0

⎤⎥⎥⎥⎥⎦ (1)

These equations have a closed-form solution, the least-squares estimator, given by thewell-known formula

β =(X′X

)−1X′Y

However, the above formula is one of several alternative ways to find the solution to (1).For instance, we can solve for β using a partitioned iterative algorithm. An example ofsuch an algorithm is shown below:

• Initialize β(0)1 , β

(0)2 , . . . , β

(0)k

• Solve for β(1)1 as the solution to ∂SS

∂β1=

∑i x1i(yi−β1x1i−β

(0)2 x2i−· · ·−β

(0)k xki) = 0

• Solve for β(1)2 as the solution to ∂SS

∂β2=

∑i x2i(yi−β

(1)1 x1i−β2x2i−· · ·−β

(0)k xki) = 0

• . . .

• Repeat until convergence is achieved.

Algorithms such as this one are discussed in Smyth (1996). This algorithm is knownas the “zigzag” or full Gauss–Seidel algorithm. According to Smyth, this algorithmproduces a stable but slow iteration depending on the correlation between the parameterestimators. In this particular case, use of an iterative algorithm to solve the normalequations is highly inefficient. However, this implementation has the advantage of notrequiring the explicit calculation of the inverse of the X′X matrix.

P. Guimaraes and P. Portugal 631

Consider now what happens if we include a set of dummy variables to account for afixed effect in the regression. In that case,

Y = Zβ + Dα + ε

where Z is the matrix of explanatory variables with N × k dimension and D is theN × G1 matrix of dummy variables. Now we can write the normal equations as[

Z′Z Z′DD′Z D′D

] [βα

]=

[Z′YD′Y

]which can be arranged to show[

Z′Zβ + Z′Dα = Z′YD′Zβ + D′Dα = D′Y

]

Solving each set of equations independently yields[β =

(Z′Z

)−1Z′ (Y − Dα)

α =(D′D

)−1D′ (Y − Zβ)

]

The above partition of the normal equations suggests a convenient iteration strategy.To obtain the exact least-squares solution, one can simply alternate between estimationof β and estimation of α. It is important to mention that we no longer have to worryabout the dimensionality of D. The expression

(D′D

)−1D′ used on the estimation of

α translates into a simple group average of the residuals of the regression of Y on Z.On the other hand, the expression Dα that shows up on the equation for β is a columnvector containing all the elements of α. Estimation of β consists of a simple linearregression of a transformed Y on Z. In our implementation, instead of transformingY, we will keep Y as the dependent variable and add Dα as an additional covariate.When the estimation procedure converges, the coefficient on Dα must equal one andthe vector Dα will contain all the estimated coefficients for the group dummy variables.With this approach, we avoided inversion of a potentially large matrix that would berequired if we had simply added D to the set of regressors. As an illustration of thisapproach, we use auto.dta and replicate the coefficient estimates obtained with aregas shown in example 1 of [R] areg.

. sysuse auto(1978 Automobile Data)

. keep if rep78 < .(5 observations deleted)

. generate double fe1=0

. local rss1=0

. local dif=1

. local i=0

632 Fitting models with high-dimensional fixed effects

. while abs(`dif´)>epsdouble() {2. quietly {3. regress mpg weight gear_ratio fe14. local rss2=`rss1´5. local rss1=e(rss)6. local dif=`rss2´-`rss1´7. capture drop yhat8. predict double yhat9. generate double temp1=mpg-yhat+_b[fe1]*fe110. capture drop fe111. egen double fe1=mean(temp1), by(rep78)12. capture drop temp113. local i=`i´+114. }15. }

. display "Total Number of Iterations --> " `i´Total Number of Iterations --> 13

. quietly regress mpg weight gear_ratio fe1

. estimates table, b(%10.7f)

Variable active

weight -0.0051031gear_ratio 0.9014780

fe1 1.0000000_cons 34.0588892

As we implied earlier, the estimated coefficients are identical to those obtained usingthe areg command (see [R] areg). The final regression includes an additional variable,fe1, with a coefficient of one. This variable was created during estimation and containsthe estimates of the fixed effect.

2.2 More than one fixed effect

Suppose now that instead of a single high-dimensional fixed effect, we have two high-dimensional fixed effects. That is, we now intend to fit the following model:

Y = Zβ + D1α + D2γ + ε

where D1 is N × G1, D2 is N × G2, and both G1 and G2 have high dimensionality.As discussed earlier, in this particular case, estimation of the linear regression model iscomplicated. However, implementation of the partitioned algorithm discussed above isstraightforward. Proceeding as we did before, we can solve the normal equations as⎡⎢⎢⎣

β =(Z′Z

)−1Z′ (Y − D1α − D2γ)

α =(D′

1D1

)−1D′

1 (Y − Zβ − D2γ)

γ =(D′

2D2

)−1D′

2 (Y − Zβ − D1α)

⎤⎥⎥⎦ (2)

Iterating between these sets of equations provides us with the exact least-squares solu-tion. All we have to do is compute several linear regressions with k explanatory variables

P. Guimaraes and P. Portugal 633

and compute group means of residuals. If we add more fixed effects, the logic remainsunchanged. To illustrate the approach, we modify the above algorithm and apply it tothe ancillary dataset that accompanies the felsdvreg command developed by ThomasCornelissen. As before, we introduce D1α and D2γ as additional regressors instead ofsubtracting them from the dependent variable.

. use felsdvsimul, clear

. generate double temp=0

. generate double fe1=0

. generate double fe2=0

. local rss1=0

. local dif=1

. local i=0

. while abs(`dif´)>epsdouble() {2. quietly {3. regress y x1 x2 fe1 fe24. local rss2=`rss1´5. local rss1=e(rss)6. local dif=`rss2´-`rss1´7. capture drop res8. predict double res, res9. replace temp=res+_b[fe1]*fe1, nopromote10. capture drop fe111. egen double fe1=mean(temp), by(i)12. replace temp=res+_b[fe2]*fe2, nopromote13. capture drop fe214. egen double fe2=mean(temp), by(j)15. local i=`i´+116. }17. }

. display "Total Number of Iterations --> " `i´Total Number of Iterations --> 695

. quietly regress y x1 x2 fe1 fe2

. estimates table, b(%10.7f)

Variable active

x1 1.0292584x2 -0.7094820fe1 1.0000000fe2 1.0000000

_cons -1.9951791

As we hinted above, the estimates for the model coefficients are identical to the least-squares results with all dummy variables included, as reported in Cornelissen (2008).The algorithm took 695 iterations to converge, which is one of the drawbacks of thisapproach. Fortunately, as discussed below, there is substantial room for improvement.One obvious simplification is to sweep out one of the fixed effects by subtracting thegroup mean from all variables. By doing this, we avoid dealing with one of the fixedeffects. This means that with minor modifications, the code shown above can be usedto fit a model with three high-dimensional fixed effects. We illustrate the estimation of

634 Fitting models with high-dimensional fixed effects

a model with three high-dimensional fixed effects by assuming that our single variableof interest is x1 and that x2 is a variable indicating an additional fixed effect. Themodified code is shown below:

. use felsdvsimul, clear

. generate double temp=0

. generate double fe1=0

. generate double fe2=0

. generate double fe1t=0

. generate double fe2t=0

. egen double mean=mean(x1), by(x2)

. generate double x1t=x1-mean

. capture drop mean

. egen double mean=mean(y), by(x2)

. generate double yt=y-mean

. local rss1=0

. local dif=1

. local i=0

. while abs(`dif´)>epsdouble() {2. quietly {3. capture drop mean4. egen double mean=mean(fe1), by(x2)5. replace fe1t=fe1-mean, nopromote6. capture drop mean7. egen double mean=mean(fe2), by(x2)8. replace fe2t=fe2-mean, nopromote9. regress yt x1t fe1t fe2t10. local rss2=`rss1´11. local rss1=e(rss)12. local dif=`rss2´-`rss1´13. replace temp=yt-_b[x1t]*x1t+_b[fe1t]*(fe1-fe1t)-_b[fe2t]*fe2t, nopromote14. capture drop fe115. egen double fe1=mean(temp), by(i)16. replace temp=yt-_b[x1t]*x1t-_b[fe1t]*fe1t+_b[fe2t]*(fe2-fe2t), nopromote17. capture drop fe218. egen double fe2=mean(temp), by(j)19. local i=`i´+120. }21. }

. display "Total Number of Iterations --> " `i´Total Number of Iterations --> 477

. quietly areg y x1 fe1 fe2, absorb(x2)

. estimates table, b(%10.7f)

Variable active

x1 0.9634370fe1 1.0000000fe2 1.0000000

_cons -2.2295298

P. Guimaraes and P. Portugal 635

As we can readily see, the estimated coefficient for x1 is the same as that obtainedby the simple regression on x1 and three sets of dummy variables.

. quietly xi: regress y x1 i.x2 i.i i.j

. estimates table, keep(x1) b(%10.7f)

Variable active

x1 0.9634370

In Carneiro, Guimaraes, and Portugal (2010), we estimated a conventional Mince-rian wage equation with three high-dimensional fixed effects. Our data source is theQuadros do Pessoal, a mandatory employment survey collected yearly by the PortugueseMinistry of Labor and Social Security. The dataset comprised more than 30 million ob-servations spanning from 1986 to 2007. In our estimation, we wanted to account forfirm, worker, and job heterogeneity and year fixed effects. With around 6.4 millionworkers, 624,171 firms, and 115,822 jobs, employing dummy variables to account forthe fixed effects was not an option. In the following example, though, we show the out-put result of one specification with 28 covariates and the three high-dimensional fixedeffects estimated using the approach outlined above.

(Continued on next page)

636 Fitting models with high-dimensional fixed effects

********** Linear Regression with 3 High-Dimensional Fixed Effects **********

Number of obs = 30906573F(6703079,24203493)= 30.81Prob > F = 0.0000R-squared = 0.8951Adj R-squared = 0.8660Root MSE = 0.2159

ln_real_hw Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0506563 .0038823 13.05 0.000 .043047 .0582655agesq -.0002992 6.36e-07 -470.60 0.000 -.0003005 -.000298hab_1 -.0228571 .0011448 -19.97 0.000 -.0251008 -.0206134hab_2 -.0202456 .0010558 -19.18 0.000 -.0223149 -.0181763hab_3 -.0173233 .0010535 -16.44 0.000 -.0193881 -.0152585hab_4 -.0068901 .0010517 -6.55 0.000 -.0089515 -.0048288hab_5 .0047679 .00105 4.54 0.000 .0027099 .006826

hab_67 .0658466 .0011658 56.48 0.000 .0635618 .0681314hab_8910 .1175194 .0011608 101.24 0.000 .1152443 .1197945

y87 .0139098 .0038957 3.57 0.000 .0062743 .0215453y88 .0036002 .0077708 0.46 0.643 -.0116303 .0188307y89 .108814 .0116502 9.34 0.000 .0859801 .131648y91 .0802597 .0194122 4.13 0.000 .0422125 .1183068y92 .0748048 .0232936 3.21 0.001 .0291501 .1204595y93 .042098 .0271755 1.55 0.121 -.0111649 .0953609y94 .0060495 .0310574 0.19 0.846 -.0548219 .0669209y95 -.0086197 .034939 -0.25 0.805 -.077099 .0598596y96 -.0186946 .0388209 -0.48 0.630 -.0947821 .057393y97 .0024842 .0427027 0.06 0.954 -.0812116 .08618y98 .0294311 .0465845 0.63 0.528 -.0618729 .1207351y99 .043531 .0504663 0.86 0.388 -.0553812 .1424433y03 -.0054844 .0659942 -0.08 0.934 -.1348307 .123862y00 .0412941 .0543482 0.76 0.447 -.0652264 .1478146y02 .0171165 .0621123 0.28 0.783 -.1046213 .1388544y04 .0296664 .0698753 0.42 0.671 -.1072867 .1666195y05 .0194295 .0737572 0.26 0.792 -.1251318 .1639909y06 -.0153513 .0776394 -0.20 0.843 -.1675216 .1368191y07 -.0225893 .0815211 -0.28 0.782 -.1823678 .1371893

2.3 Estimation of the standard errors

To provide the standard errors associated with the estimator of β, we would need toestimate

V(β)

= σ2(X′X

)−1

which again raises the problem of inverting the X′X matrix. An alternative solution toestimate the elements of V (β) is to use the known relation

V(βj

)=

σ2

Ns2j

(1 − R2

j.123...

)where s2

j is the sample variance associated with the xj variable and R2j.123... is the

coefficient of determination obtained from a regression of xj on all other remaining

P. Guimaraes and P. Portugal 637

explanatory variables. Estimation of σ2 is easy because the final regression that providesthe estimates of β has the correct sum of squared residuals (SSR). As we will see, theremaining difficulty is the computation of the number of degrees of freedom associatedwith SSR. With multiple fixed effects, it may be difficult to compute the actual dimensionof X because some of the coefficients for the fixed effects may not be identifiable. Ifwe were simply estimating the model by adding dummy variables to account for thefixed effects, Stata would numerically identify colinearities and drop variables as neededto identify the coefficients in the model. However, to implement our solution, we needto know beforehand the number of coefficients of the dummy variables that can beidentified.2 Computation of R2

j.123... is not a problem because we know how to estimatea model with high-dimensional effects. However, this approach may be time consumingbecause it would require estimation of a regression with high-dimensional fixed effectsfor each of the regressors.

Fortunately, there is an alternative strategy that will produce results faster. Theidea is a standard application of the well known Frisch–Waugh–Lovell theorem; simplyput, it consists of fitting the model in two steps. In the first step, we expurgate the fixedeffects from all variables in the model. This involves running a linear regression of eachindividual variable on only the high-dimensional effects and storing the residuals. In thesecond step, we run the regression of interest using the stored residuals of the variablesobtained in the first step instead of the original variables. Because we are not dealingwith the high-dimensional fixed effects, the regressions in the second step are easy toimplement and will have the correct standard errors provided that we adjust the degreesof freedom. One reason why this approach works well is that the calculations in the firststep are relatively simple. We can see from (2) that in this case, the algorithm involvesonly the computation of means. In the next example, we again use the ancillary datasetthat accompanies the felsdvreg user-written command and show how to obtain thecorrect standard errors in a regression with two high-dimensional fixed effects. In theStata code shown below, we speed up the algorithm by demeaning all variables withrespect to one of the fixed effects.

. use felsdvsimul, clear

. recast double _all

. generate double temp=0

. generate double fe2=0

. generate double lastfe2=0

2. In the appendix, we provide a more extensive discussion of this issue.

638 Fitting models with high-dimensional fixed effects

. foreach var of varlist y x1 x2 {2. quietly {3. local i=04. local dif=15. capture drop mean6. egen double mean=mean(`var´), by(i)7. replace `var´=`var´-mean, nopromote8. while abs(`dif´)>epsdouble() {9. replace lastfe2=fe2, nopromote10. capture drop mean11. egen double mean=mean(fe2), by(i)12. capture drop fe213. egen double fe2=mean(`var´+mean), by(j)14. replace temp=sum(reldif(fe2,lastfe2)), nopromote15. local dif=temp[_N]/_N16. display `i´ " " `dif´17. local i=`i´+118. }19. noisily display "Total Number of Iterations for `var´ --> " `i´20. generate double `var´_res=`var´-fe2+mean21. }22. }

Total Number of Iterations for y --> 40Total Number of Iterations for x1 --> 41Total Number of Iterations for x2 --> 41

. regress y_res x1_res x2_res, nocons dof(69)

Source SS df MS Number of obs = 100F( 2, 69) = 17.17

Model 1033.49122 2 516.745611 Prob > F = 0.0000Residual 2076.72508 69 30.0974649 R-squared = 0.3323

Adj R-squared = 0.0323Total 3110.2163 100 31.102163 Root MSE = 5.4861

y_res Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1_res 1.029258 .2151235 4.78 0.000 .6000987 1.458418x2_res -.709482 .2094198 -3.39 0.001 -1.127263 -.2917009

Notice that the iterative procedure was much faster, taking about 40 iterations foreach variable. The final regression has the correct estimates for both the coefficientsand standard errors. To obtain the correct standard errors, we had to adjust the degreesof freedom of the regression. This was done by using the dof() option in the regresscommand. Given that in felsdvsimul.dta we have 100 observations, that G1 = 15,that G2 = 20, and that we have six mobility groups and two regressors (x1 and x2),the regression has 100 − 15 − 20 + 6 − 2 = 69 degrees of freedom.3 We also can easilycompute clustered standard errors by using the regression on the transformed variables.In the example below, we cluster the standard errors on the variable g and confirm theresults using the regress command.4

3. See the appendix for a more detailed explanation.4. If the clustering variable is different for each observation (each observation is treated as a cluster),

then we obtain heteroskedasticity-robust White-corrected standard errors.

P. Guimaraes and P. Portugal 639

. quietly regress y_res x1_res x2_res, nocons mse1

. matrix VV=e(V)

. predict double res, residual

. _robust res, variance(VV) minus(31) cluster(g)

. display "Clustered standard error of x1 --> " sqrt(VV[1,1])Clustered standard error of x1 --> .20274454

. display "Clustered standard error of x2 --> " sqrt(VV[2,2])Clustered standard error of x2 --> .14387506

. quietly xi: regress y x1 x2 i.i i.j, vce(cluster g)

. display "Clustered standard error of x1 --> " _se[x1]Clustered standard error of x1 --> .20274454

. display "Clustered standard error of x2 --> " _se[x2]Clustered standard error of x2 --> .14387506

Finally, even though we ran a regression on transformed variables, we are still able torecover the estimates for the coefficients of the fixed effects. We do so by implementingthe same iterative procedure discussed above to the residuals obtained when we subtractthe effects of x1 and x2 from y.

. quietly regress y_res x1_res x2_res, nocons

. generate double dum=y-_b[x1_res]*x1-_b[x2_res]*x2

. capture drop mean

. egen double mean=mean(dum), by(i)

. replace dum=dum-mean, nopromote(29 real changes made)

. local i=0

. local dif=1

. while abs(`dif´)>epsdouble() {2. quietly replace lastfe2=fe2, nopromote3. capture drop mean4. egen double mean=mean(fe2), by(i)5. capture drop fe26. egen double fe2=mean(dum+mean), by(j)7. quietly replace temp=sum(reldif(fe2,lastfe2)), nopromote8. local dif=temp[_N]/_N9. local i=`i´+110. }

. display "Total Number of Iterations for fe2 --> " `i´Total Number of Iterations for fe2 --> 41

. quietly replace dum=dum-fe2, nopromote

. egen double fe1=mean(dum), by(i)

. quietly regress y x1 x2 fe1 fe2, nocons

. estimates table, b(%10.7f)

Variable active

x1 1.0292583x2 -0.7094820fe1 1.0000000fe2 1.0000000

640 Fitting models with high-dimensional fixed effects

We confirm that the estimated coefficients are correct by adding the variables fe1and fe2 to the linear regression of y on x1 and x2. As expected, the estimated βcoefficients are the correct ones and the coefficients associated with the variables fe1and fe2 equal one.

Subtracting the influence of the fixed effects from each variable and working withonly the residuals has some advantages compared with the process shown earlier thatentailed direct estimation of the full regression with all the fixed effects added. First, thesimple regressions in step 1 are likely to converge at a faster rate. Second, it is possibleto test different specifications of the model using only the transformed variables withoutthe need to deal with the high-dimensional fixed effects. And third, when dealing withvery large datasets, we can substantially reduce memory requirements because duringstep 1 we only need to load into memory the variable being handled and the groupidentifiers for the fixed effects. In fact, we could do even better because the solutionto the algorithm is performed independently across mobility groups, meaning that itwould be possible to load each mobility group into memory separately.5

3 Extension to nonlinear models

In this section, we show that the iterative approach outlined earlier for the linear regres-sion model can be extended to nonlinear models. With nonlinear models, it is possibleto estimate correctly the vector β, but there is no easy solution for estimation of theassociated standard errors. While it would be possible to bootstrap the standard errors,this solution is likely to be computationally very expensive. Given that with the itera-tive approach proposed in this article we obtain the correct values for the log-likelihoodfunction, it may be easier to implement statistical tests for the coefficients based uponlikelihood ratios (LRs). To illustrate, let us first consider a typical Poisson regressionmodel with expected value given by

E(yi) = λi = exp(x′iβ)

We know that the maximum likelihood estimators are obtained as the solution to:∂ ln L

∂β=

∑i=1

yixi − xi exp(x′iβ)

=∑i=1

{yi − exp(x′iβ)}xi = 0

If one of the regressors is a dummy variable, say dj , then its estimated coefficient, sayαj , has a closed-form solution given by

exp(αj) = d′jy ×

[d′

j exp{

(x′iβ)(j)

}]−1

(3)

5. Currently available in the Statistical Software Components archive are two user-written commandsthat implement the algorithms discussed in this article for estimation of linear regression modelswith two high-dimensional fixed effects. The first, gpreg, is a fast Mata implementation pro-grammed by Johannes Schmieder. The second, reg2hdfe, is particularly suited for estimation withlarge datasets and was programmed by Paulo Guimaraes.

P. Guimaraes and P. Portugal 641

where the subscript (j) in the argument of the exponential function shows that dj isexcluded from the argument. The above expression suggests a simple iterative strategy,much like the one used for the linear regression. We can alternate between estimationof a Poisson regression with k explanatory variables and calculation of the estimatesfor all the coefficients of the fixed effects using (3). These estimates can be kept in asingle column vector. To show this algorithm at work, we replicate the estimates of thePoisson model with fixed effects, which appear in example 1 of [XT] xtpoisson as anillustration of that command with the fe option. That example shown in the manualincludes an exposure variable, service, which we incorporated into the algorithm:

. webuse ships, clear

. quietly poisson acc op_75_79 co_65_69 co_70_74 co_75_79, exposure(service)

. keep if e(sample)(6 observations deleted)

. bysort ship: egen sumy=total(acc)

. generate double off=0

. generate double temp=0

. local dif=1

. local ll1=0

. local i=0

. while abs(`dif´)>epsdouble() {2. quietly poisson acc op_75_79 co_65_69 co_70_74 co_75_79, offset(off)> noconstant3. local ll2=`ll1´4. local ll1=e(ll)5. capture drop xb6. predict double xb, xb7. quietly replace temp=xb-off+log(service), nopromote8. capture drop sumx9. bysort ship: egen double sumx=total(exp(temp))10. quietly replace off=log(sumy/sumx)+log(service), nopromote11. local dif=`ll2´-`ll1´12. local i=`i´+113. }

. display "Total Number of Iterations --> " `i´Total Number of Iterations --> 103

. quietly poisson acc op_75_79 co_65_69 co_70_74 co_75_79, noconstant> offset(off)

. estimates table, b(%10.7f) eform

Variable active

op_75_79 1.4688312co_65_69 2.0080025co_70_74 2.2669302co_75_79 1.5736955

Extending the algorithm to two fixed effects is straightforward. We use the samedataset as before, but instead of including the dummy variables for year of construction(co 65 69, co 70 74, and co 75 79), we treat the year of construction as a fixed effect;that is, we let the variable yr con identify a second fixed effect. Now the algorithm is

642 Fitting models with high-dimensional fixed effects

implemented without an exposure variable. For comparability purposes, we first run aPoisson regression that includes dummy variables for both fixed effects.

. webuse ships, clear

. local dif=1

. xi: poisson acc op_75_79 i.yr_con i.ship, nologi.yr_con _Iyr_con_1-4 (naturally coded; _Iyr_con_1 omitted)i.ship _Iship_1-5 (naturally coded; _Iship_1 omitted)

Poisson regression Number of obs = 34LR chi2(8) = 475.45Prob > chi2 = 0.0000

Log likelihood = -118.47588 Pseudo R2 = 0.6674

accident Coef. Std. Err. z P>|z| [95% Conf. Interval]

op_75_79 .2928003 .1127466 2.60 0.009 .071821 .5137796_Iyr_con_2 .5824489 .1480547 3.93 0.000 .2922671 .8726308_Iyr_con_3 .4627844 .151248 3.06 0.002 .1663437 .7592251_Iyr_con_4 -.1951267 .2135749 -0.91 0.361 -.6137258 .2234724

_Iship_2 1.79572 .1666196 10.78 0.000 1.469151 2.122288_Iship_3 -1.252763 .3273268 -3.83 0.000 -1.894312 -.6112142_Iship_4 -.9044563 .2874597 -3.15 0.002 -1.467867 -.3410457_Iship_5 -.1462833 .2351762 -0.62 0.534 -.6072202 .3146537

_cons 1.308451 .1972718 6.63 0.000 .9218049 1.695096

. keep if e(sample)(6 observations deleted)

. bysort ship: egen sumy1=total(acc)

. bysort yr_con: egen sumy2=total(acc)

. generate double off=0

. generate double temp=0

. generate double temp1=0

. generate double temp2=0

. generate double fe1=0

. generate double fe2=0

. local ll1=0

. local i=0

. while abs(`dif´)>epsdouble() {2. quietly poisson acc op_75_79, noconstant offset(off)3. local ll2=`ll1´4. local ll1=e(ll)5. capture drop xb6. predict double xb, xb7. quietly replace temp1=xb-off+fe2, nopromote8. capture drop sumx9. bysort ship: egen double sumx=total(exp(temp1))10. quietly replace fe1=log(sumy1/sumx), nopromote11. quietly replace temp2=xb-off+fe1, nopromote12. capture drop sumx13. bysort yr_con: egen double sumx=total(exp(temp2))14. quietly replace fe2=log(sumy2/sumx), nopromote15. quietly replace off=fe1+fe216. local dif=`ll2´-`ll1´17. local i=`i´+118. }

P. Guimaraes and P. Portugal 643

. display "Total Number of Iterations --> " `i´Total Number of Iterations --> 45

. quietly poisson acc op_75_79, noconstant offset(off)

. estimates table, b(%10.7f)

Variable active

op_75_79 0.2928003

To test the statistical significance of the variable op 75 79 using an LR test, werun the same regression as above but without the op 75 79 variable and retaining thelog-likelihood value. The value for the log likelihood of this restricted regression is−121.88042, which leads to a value of the LR test of LR = 2×(−118.47588+121.88042) =6.80908. The LR statistic follows a chi-squared distribution with 1 degree of freedom,and its square root should be comparable with the z statistic reported in the Stataoutput for the Poisson regression. Taking the square root of the LR statistic, we obtain2.6094213, which is very close to the z statistic for op 75 79 that is reported by Statain the Poisson regression that explicitly includes all dummy variables.6 Application ofthe algorithm to Poisson regression was straightforward because we could find a closed-form solution for the coefficients associated with the fixed effects. However, in mostnonlinear regression models, the fixed effects do not have a closed-form solution. Asshown in Guimaraes (2004), models from the multinomial logit family such as logit,multinomial logit, and conditional logit all can be fit using Poisson regression, meaningthat the above algorithm could be used for these cases. The inexistence of a closed-formsolution for the coefficients of the fixed effects does not invalidate use of the zigzagalgorithm, but it requires the use of a numerical optimization routine to solve for thecoefficients of the fixed effects. This routine may slow down the algorithm considerably.As an example of this approach, we show an application of the zigzag algorithm to fita negative binomial model with fixed effects.7

6. The results are not identical because Stata reports the Wald statistic, while we calculated the LRstatistic. Asymptotically, the two statistics are equivalent.

7. The fixed-effects negative binomial model (xtnbreg with the fe option) is not equivalent to anegative binomial model with dummy variables added for fixed effects (see Guimaraes [2008]).

644 Fitting models with high-dimensional fixed effects

. webuse ships, clear

. xi: nbreg acc op_75_79 co_65_69 co_70_74 co_75_79 i.ship, nologi.ship _Iship_1-5 (naturally coded; _Iship_1 omitted)

Negative binomial regression Number of obs = 34LR chi2(8) = 41.45

Dispersion = mean Prob > chi2 = 0.0000Log likelihood = -88.445258 Pseudo R2 = 0.1898

accident Coef. Std. Err. z P>|z| [95% Conf. Interval]

op_75_79 .3324104 .328116 1.01 0.311 -.3106852 .975506co_65_69 .8380919 .4378077 1.91 0.056 -.0199955 1.696179co_70_74 1.658684 .4850461 3.42 0.001 .708011 2.609357co_75_79 .8604224 .5955773 1.44 0.149 -.3068876 2.027732_Iship_2 2.35359 .4701847 5.01 0.000 1.432045 3.275135_Iship_3 -1.104561 .5214874 -2.12 0.034 -2.126657 -.082464_Iship_4 -.9606946 .4905212 -1.96 0.050 -1.922098 .0007092_Iship_5 -.077889 .4780747 -0.16 0.871 -1.014898 .8591201

_cons .4230202 .5218569 0.81 0.418 -.5998006 1.445841

/lnalpha -.7372302 .3814595 -1.484877 .0104166

alpha .4784372 .1825044 .2265302 1.010471

Likelihood-ratio test of alpha=0: chibar2(01) = 60.06 Prob>=chibar2 = 0.000

. keep if e(sample)(6 observations deleted)

. egen id=group(ship)

. quietly sum id

. local maxg=r(max)

. local dif=1

. local ll1=0

. local i=0

. generate double off1=0

. generate double off2=0

. while abs(`dif´)>epsdouble() {2. quietly nbreg acc op_75_79 co_65_69 co_70_74 co_75_79, noconstant offset(off1)3. local lna=log(e(alpha))4. constraint define 1 [lnalpha]_cons=`lna´5. local ll2=`ll1´6. local ll1=e(ll)7. capture drop xb8. predict double xb, xb9. quietly replace off2=xb-off1, nopromote10. forval j=1/`maxg´ {11. quietly nbreg acc if id==`j´, offset(off2) constraint(1)12. quietly replace off1=_b[_cons] if e(sample), nopromote13. }14. local dif=`ll2´-`ll1´15. local i=`i´+116. }

. display "Total Number of Iterations -->" `i´Total Number of Iterations -->122

. quietly nbreg acc op_75_79 co_65_69 co_70_74 co_75_79, offset(off1) noconstant

P. Guimaraes and P. Portugal 645

. estimates table, b(%10.7f)

Variable active

accidentop_75_79 0.3324104co_65_69 0.8380920co_70_74 1.6586841co_75_79 0.8604225

lnalpha_cons -0.7372302

Following an approach similar to the one used for the negative binomial model,it should be possible to extend the algorithm to other nonlinear models. In general,the algorithm should work well with models that have globally concave log-likelihoodfunctions such as the ones discussed here.

4 Conclusion

In this article, we successfully explored the implementation of the full Gauss–Seidelalgorithm to fit regression models with high-dimensional fixed effects. The main ad-vantage of this procedure is the ability to fit linear regression models with two or morehigh-dimensional fixed effects under minimal memory requirements. Generalizing theprocedure to nonlinear regression models is straightforward, particularly in cases havinga closed-form solution for the fixed effect.

We do not claim, however, that our procedure is a superior estimation strategy.Quite to the contrary, the zigzag algorithm can be very slow, and researchers shoulduse more efficient estimation techniques whenever available. We know that the linearregression model with two high-dimensional fixed effects is fit much more efficientlywith the user-written command felsdvreg, the same way that xtpoisson is the betterapproach to fit a Poisson model with a single high-dimensional fixed effect. Nevertheless,the zigzag algorithm may prove useful in some circumstances, namely, when existingapproaches do not work because of hardware (memory) limitations or when no otherknown ways of fitting the model exist. As we mentioned earlier, the estimation strategyoutlined in this article is time consuming, but it does have the advantages of imposingminimum memory requirements and of being simple to implement.

There are many ways to improve the speed of the algorithms discussed above, andresearch is needed to figure out how to improve them. In the examples presentedin this article, we used a very strict convergence criterion. In practical applications,though, a more relaxed criterion is likely to substantially lower the number of iterationswithout meaningful changes to the final results. Other obvious tools that will speed thealgorithms include more efficient Stata code (possibly Mata), better starting values, andthe use of convergence acceleration techniques in the algorithm. Speed should not behard to accomplish because the estimates of fixed effects tend to converge monotonically,

646 Fitting models with high-dimensional fixed effects

making it possible to use the information from the last iterations to adjust the trajectoryof the fixed-effect estimates and thus obtain faster convergence.

We would like to make researchers aware of the large-sample properties of theseestimators. Given the multiple-dimension panel data, the asymptotic behavior of theestimators can be studied in different ways. The estimators are consistent if we arewilling to admit that the dimension of the groups is unrelated to sample size. From theliterature on panel data, we know that a critical situation arises when the dimension ofone fixed effect (say N) increases without bound while T remains fixed. In this case, thenumber of individual parameters increases as N increases, raising the incidental param-eter problem originally discussed by Neyman and Scott (1948) and recently reviewed byLancaster (2000). In the linear regression model, it is well known that the least-squaredummy variable model (or equivalently, the within estimator) still provides consistentestimates of the slope coefficients, but not of the individual fixed effects. This is becausein the linear model, the estimators of the slopes and of the individual effects are asymp-totically independent (Hsiao 2003). As for nonlinear models, in general, the estimatorsof the regression coefficients (the slopes) will be plagued by the incidental parameterproblem and, for T fixed, will be inconsistent. With one fixed effect, the incidentalproblem may be overcome by finding a minimal sufficient statistic for the individualeffect as in the conditional logit model of Chamberlain (1980). Lancaster (2000) offersuseful reparameterizations for a number of conventional nonlinear regression models.If, however, both N and T increase without bound, the inconsistency generated by theincidental parameter problem is circumvented, leading to consistent estimates of theslope and the individual effects.

We should stress that the article presents a technique for estimation of models withlarge numbers of dummy variables. From an asymptotic perspective, this approach isnot equivalent to the use of panel-data estimators that condition out or difference thefixed effects. This means that consistency and asymptotic normality of our estimatorsrely on the implicit assumption that the number of groups remains fixed as the samplesize tends to infinity.

Finally, we would like to point out that we motivated the introduction of fixed effectsin large datasets as a way to control for unobserved heterogeneity. However, theremay be other reasons why researchers may want to deal with large numbers of dummyvariables. With large datasets, it may not make sense to impose functional relationshipsin the variables, and we can instead let the data best show those relationships using adummy variable for each different value of the regressor. With millions of observations,the loss in degrees of freedom is minimal.

5 ReferencesAbowd, J. M., R. H. Creecy, and F. Kramarz. 2002. Computing person and firm effects

using linked longitudinal employer–employee data. Technical Paper No. TP-2002-06,Center for Economic Studies, U.S. Census Bureau.http://lehd.did.census.gov/led/library/techpapers/tp-2002-06.pdf.

P. Guimaraes and P. Portugal 647

Abowd, J. M., F. Kramarz, and D. N. Margolis. 1999. High wage workers and highwage firms. Econometrica 67: 251–333.

Andrews, M., T. Schank, and R. Upward. 2006. Practical fixed-effects estimation meth-ods for the three-way error-components model. Stata Journal 6: 461–481.

Carneiro, A., P. Guimaraes, and P. Portugal. 2010. Real wages and the business cycle:Accounting for worker, firm, and job heterogeneity. Unpublished manuscript.

Chamberlain, G. 1980. Analysis of covariance with qualitative data. Review of EconomicStudies 47: 225–238.

Cornelissen, T. 2008. The Stata command felsdvreg to fit a linear model with twohigh-dimensional fixed effects. Stata Journal 8: 170–189.

Greene, W. 2004. The behaviour of the maximum likelihood estimator of limited de-pendent variable models in the presence of fixed effects. Econometrics Journal 7:98–119.

Guimaraes, P. 2004. Understanding the multinomial-Poisson transformation. StataJournal 4: 265–273.

———. 2008. The fixed effects negative binomial model revisited. Economics Letters99: 63–66.

Hsiao, C. 2003. Analysis of Panel Data. 2nd ed. Cambridge: Cambridge UniversityPress.

Lancaster, T. 2000. The incidental parameter problem since 1948. Journal of Econo-metrics 95: 391–413.

Neyman, J., and E. Scott. 1948. Consistent estimation from partially consistent obser-vations. Econometrica 16: 1–32.

Smyth, G. K. 1996. Partitioned algorithms for maximum likelihood and other non-linearestimation. Statistics and Computing 6: 201–216.

About the authors

Paulo Guimaraes is a research associate professor at the University of South Carolina andcurrently is a visiting professor at the University of Porto.

Pedro Portugal is a senior researcher at the Bank of Portugal and a visiting full professor atthe Universidade Nova de Lisboa.

648 Fitting models with high-dimensional fixed effects

Appendix

In this appendix, we try to provide some intuition on the issue of identification of thefixed effects. Consider first a regression model with N observations and a single fixedeffect with G1 levels (a one-way ANOVA model):

E(yit) = μ + αi

If we replace E(yit) by the data cell means, we have a system of G1 equations on G1 +1unknowns. To solve this model, we need to impose one restriction (typically, μ = 0 orα1 = 0). With this restriction, we are able to estimate G1 coefficients of the model.This in turn means that SSR has N −G1 degrees of freedom (or N − k−G1 if there arean additional k noncollinear explanatory variables in the model).

Consider now a regression model with two fixed effects with G1 and G2 levels,respectively:

E(yit) = μ + αi + ηj

The unique combinations of αi and ηj available on the data define a set of equations,but the interdependence between these equations does not make obvious how manyrestrictions are needed to identify the coefficients. Abowd, Creecy, and Kramarz (2002)presented an algorithm that counts the number of restrictions needed to identify thecoefficients. To illustrate, consider an example with G1 = G2 = 3 and the followingunique combinations of the levels of the fixed effects:

μ + α1 + η1

μ + α1 + η2

μ + α2 + η1

μ + α2 + η3

μ + α3 + η2

μ + α3 + η3

To solve this system of equations, we can start out by imposing two restrictions,for example, μ = 0 and α1 = 0. With these restrictions in place, we can immediatelyidentify η1 and η2. In turn, knowledge of η1 and η2 allow the identification of α2 and α3,thereby leading to the identification of η3. This sequence of steps is illustrated below:

η1

η2

α2+η1

α2 + η3

α3+η2

α3 + η3

η1

η2

α2+η1

α2+η3

α3+η2

α3+η3

η1

η2

α2+η1

α2+η3

α3+η2

α3+η3

P. Guimaraes and P. Portugal 649

Identification of the coefficients is possible only because the equations are “connected”.Consider now an alternative regression model with parameters given by

μ + α1 + η1

μ + α1 + η2

μ + α2 + η1

μ + α2 + η2

μ + α3 + η3

μ + α3 + η3

If we follow the same strategy as above and set μ = 0 and α1 = 0, we now have thefollowing sequence of steps:

η1

η2

α2+η1

α2+η2

α3 + η3

α3 + η3

η1

η2

α2+η1

α2+η2

α3 + η3

α3 + η3

Now, with these two restrictions, we are only able to identify the coefficients in thefirst four equations. This happens because the first set of equations does not shareany coefficients with the remaining equations. In Abowd, Creecy, and Kramarz (2002)terminology, there are now two mobility groups. Only with an additional restriction(α3 = 0 or η3 = 0) can we identify the remaining coefficients. It should be obvious thateach additional mobility group requires an additional restriction. If we let M designatethe number of mobility groups, then we conclude that the number of identified coeffi-cients is G1+G2−M , and the degrees of freedom associated with SSR is N−G1−G2+M(or N −k−G1−G2 +M if there are k noncollinear explanatory variables in the model).We can use a similar logic to that outlined above to count the number of identifiablecoefficients in a model with more than two fixed effects. However, development of analgorithm for this purpose is no simple task and is well beyond the scope of this article.

The Stata Journal (2010)10, Number 4, pp. 650–669

Variable selection in linear regression

Charles LindseyStataCorp

College Station, TX

[email protected]

Simon SheatherDepartment of StatisticsTexas A&M University

College Station, TX

Abstract. We present a new Stata program, vselect, that helps users performvariable selection after performing a linear regression. Options for stepwise meth-ods such as forward selection and backward elimination are provided. The user mayspecify Mallows’s Cp, Akaike’s information criterion, Akaike’s corrected informa-tion criterion, Bayesian information criterion, or R2 adjusted as the informationcriterion for the selection. When the user specifies the best subset option, theleaps-and-bounds algorithm (Furnival and Wilson, Technometrics 16: 499–511) isused to determine the best subsets of each predictor size. All the previously men-tioned information criteria are reported for each of these subsets. We also provideoptions for doing variable selection only on certain predictors (as in [R] nestreg)and support for weighted linear regression. All options are demonstrated on realdatasets with varying numbers of predictors.

Keywords: st0213, vselect, variable selection, regress, nestreg

1 Theory/motivation

Redundant predictors in a linear regression yield a decrease in the residual sum ofsquares (RSS) and less-biased predictions at the cost of an increased variance in predic-tions.

In settings where there are a small number of predictors, the partial F test can beused to determine whether certain groups of predictors should be included in the model.We divide the predictors into two groups. One group, the base group, will be includedin our model. The other group, the suspected group, may or may not be includedwithin the model—we are not yet sure. We call the regression model containing allpredictors in both groups, base and suspected, the full (FULL) model. The regressionmodel containing only the base predictors is called the reduced (RED) model.

The partial F test has a test statistic

F =RSSRED−RSSFULL

dfRED−dfFULL

RSSFULLdfFULL

Under the null hypothesis that the RED model is true (all the predictor coefficientsfor the suspected group are zero), F has an F (dfRED − dfFULL,dfFULL) distribution.Acceptance of the null hypothesis leads us to use the RED model as our regression model.Rejection of the null hypothesis indicates that we should not ignore the predictors in

c© 2010 StataCorp LP st0213

C. Lindsey and S. Sheather 651

the suspected group (at least one of the predictor coefficients is not zero). We can thenreperform the test using subsets of the suspected group to determine which predictors toinclude in the model. The partial F test may be easily performed in Stata via nestreg(see [R] nestreg).

In this article, we are concerned with those cases in which there are a large numberof predictors. When the suspected predictor list grows large, it is not feasible to use thepartial F test method to determine the final regression model. A variety of algorithmshave been created to deal with this situation. These variable selection algorithms takethe specification of the FULL model and output an optimal RED model. The commandpresented here, vselect, performs the stepwise selection algorithms forward selectionand backward elimination as well as the best subsets leaps-and-bounds algorithm.

The output of these algorithms and the partial F test is not very meaningful unlessFULL is a valid regression model. A regression model is valid if the assumptions forperforming its significance tests are met. They can be accessed using residual plots,scale-location plots, etc. Details can be found in Sheather (2009).

We must also note that inference on the models produced by these algorithms isnot equivalent to the inference on the same models that the users find independentlywithout consulting the algorithms. Each step of a variable selection algorithm will fitone or more models and then make an inference on the next step using information fromthese models. So in addition to inferences made using the final model, many preliminaryinferences are made during variable selection.

This will affect the significance levels of the final model. The situation is similar toperforming multiple comparisons on the factor means after an analysis of variance tellsyou there is a significant effect. Each of these comparisons should be evaluated at adifferent significance level than that of the original factor effect.

Cross-validation methods can be used to handle this multiple inference difficulty.These methods generally perform variable selection on subsets of the data and then usean average measure of the results on these subsets to find the final model. They mayalso split the data into two parts, performing variable selection on one part (train) andusing the other (test) for evaluating the resulting model. Details of this method anda general discussion of the multiple inference problem in variable selection are given inSheather (2009). The variable selection methods that we use here may be applied undercertain cross-validation techniques.

The definition of optimal is not uniformly agreed upon. The optimal model is onethat optimizes one or more information criteria. There are multiple information criteriaand multiple guidelines for the number and type of information criteria that should bemet.

(Continued on next page)

652 Variable selection

1.1 Information criteria

An information criterion is a function of a regression model’s explanatory power andcomplexity. The model’s explanatory power (goodness of fit) increases the criterion inthe desirable direction, while the complexity of the model counterbalances the explana-tory power and moves the criterion in the undesirable direction.

We have singled out five relevant criteria for evaluating linear regression models:Mallows’s Cp, R2

ADJ (adjusted), Akaike’s information criterion (AIC), Akaike’s cor-rected information criterion (AICc), and Bayesian information criterion (BIC). We usethe definitions of these criteria given in Sheather (2009) and Izenman (2008). Ourdefinitions for BIC and AIC correspond with those given in estat (see [R] estat).

The R2 adjusted information criterion is an improvement to the R2 measure of amodel’s explanatory power. We abbreviate the RSSRED notation to simply RSS. TheSST notation refers to the total sum of squares.

R2 = 1 − RSS

SST

A penalty for unnecessary predictors is introduced by a multiplication by (n−1)/(n−k − 1) where n is the sample size and k is the number of predictors in the model.

R2ADJ = 1 − n − 1

n − k − 1RSS

SST

As R2ADJ increases, the model becomes more desirable.

The next information criterion, AIC (Akaike 1974), works in the opposite way: asthe criterion decreases, the model becomes more desirable. The explanatory power ofthe model is measured by the maximized log likelihood of the predictor coefficients(assuming a normal model) and error variance. The complexity penalization comesfrom an addition of the number of predictors.

AIC = 2{− log L

(β0, β1, . . . , βp, σ

2 |Y)

+ k + 2}

After we formulate the regression model in terms of a normal distribution likelihood,we obtain

AIC = n logRSS

n+ 2k + n + n log (2π)

Hurvich and Tsai (1989) developed a bias-corrected version of AIC, called AICc. AICcis preferred when the sample size is small or the number of predictors is large relativeto sample size. Using our simplified version of AIC,

AICc = AIC +2(k + 2)(k + 3)n − (k + 2) − 1

C. Lindsey and S. Sheather 653

Let p = k + 1. As in the previous section, we use RSSFULL to refer to the RSS underthe model containing all predictors. Suppose we have m possible predictors, excludingthe intercept. In Izenman (2008), the information criterion Cp, or Mallows’s Cp, isdefined by

Cp = (n − m − 1)RSS

RSSFULL− (n − 2p)

According to the Cp criterion, good models have Cp ≈ p. The full model will alwayssatisfy this criterion. Further, as noted in Hocking (1976), models with small valuesof Mallows’s Cp may be preferred, as well. The Mallows’s Cp criterion was originallydeveloped in Mallows (1973).

Our final information criterion, BIC, was proposed by Schwarz (1978). Raftery (1995)provides another development and motivation for the criterion. BIC is similar to AIC,but it adjusts the penalty term for complexity based on the sample size.

BIC = −2 log L(β0, β1, . . . , βp, σ

2 |Y)

+ (k + 2) log n

This reduces to

BIC = n logRSS

n+ k log n + n + n log (2π)

There is controversy over what should be called the best information criterion. Ac-cording to Sheather (2009), choosing a model based solely on R2

ADJ generally leads tooverfitting (having too many predictors). There is also debate over whether AIC or AICcshould be used in preference to BIC. A comparison of page 46 of Simonoff (2003) withpage 208 of Hastie, Tibshirani, and Friedman (2001) demonstrates this. Mallows’s Cp

suffers from similar controversies. Inference using Cp will be asymptotically equivalentto AIC, but both will share different properties than BIC (Izenman 2008).

For each predictor size k, the best model under each of the information criterionsfor that predictor size k is the model that minimizes RSS. All other terms are constantfor the same predictor size. So at each predictor size, we can find the best model of thatsize by minimizing the RSS. This remarkable result can greatly simplify the variableselection process.

Now that we have defined the relevant information criteria, we will present thevariable selection algorithms implemented in vselect that use the criteria. We beginwith stepwise selection algorithms.

1.2 Stepwise selection

We present two stepwise selection algorithms, forward selection and backward elimina-tion. These algorithms work with only one information criterion, which may be anyof the ones defined previously except Mallows’s Cp. Technically, Mallows’s Cp couldbe used in stepwise selection, but the decision on which predictors to keep or add to

654 Variable selection

the model would be more difficult. All the other criteria measures have an intrinsicordering among their values. The smallest AIC is best, the larger R2

ADJ is preferable,etc. Mallows’s Cp suggests a good model when it is close to the number of predictorsand the intercept of the model it measures, but as mentioned in Hocking (1976), smallvalues of Mallows’s Cp can yield good models as well. Our stepwise selection algorithmsmake an automated decision on whether to keep a variable in the model or add a vari-able to the model. Ideally, this would be based on a simple ranking of the possiblemodels based on an information criterion. If we use both suggestions for interpretationof Mallows’s Cp, the algorithm cannot make the decision based on a simple ranking ofmodels. Given this, we will not use Mallows’s Cp in stepwise selection. It will still beused in the leaps-and-bounds variable selection, however.

Forward selection is an iterative procedure. Our initial model is composed of onlythe intercept term. At every iteration, we add to the model the predictor that will yieldthe most optimal information criterion value when it is included in the model. If thereis no predictor that favorably changes the information criterion from its value in theprevious iteration, the algorithm terminates with the model from the previous iteration.

Backward elimination is also an iterative procedure. In this case, the initial modelis composed of all the predictors. At every iteration, we remove from the model thepredictor that will yield the largest improvement in the information criterion valuewhen it is removed from the model. If there is no predictor whose removal will favorablychange the information criterion value from that of the previous iteration, the algorithmterminates with the model from the previous iteration.

Both stepwise selection algorithms examine at most m(m + 1)/2 of the 2m possiblemodels. When the predictors are highly correlated, the results of stepwise selection andall subsets selection methods can differ dramatically. The algorithms are intuitive andsimple to understand. In many cases, they end up with the best model as well.

For a more dependable algorithm, we turn to the leaps-and-bounds algorithm ofFurnival and Wilson (1974).

1.3 Leaps and bounds

The leaps-and-bounds algorithm actually gives p different models. Each of the modelscontains a different number of predictors and is the most optimal model among modelshaving the same number of predictors. The vselect command provides the five in-formation criteria for each of the models produced by leaps and bounds. The optimalmodel is the one model with these qualities: the smallest value of AIC, AICc, and BIC;the largest value of R2

ADJ; and a value of Mallows’s Cp that is close to the numberof predictors in the models +1 or the smallest among the other Mallows’s Cp values.These guidelines help avoid the controversy of which information criterion is the best.

Sometimes there is no single model that optimizes all the criteria. We will seean example of this in the next section. There are no fixed guidelines for this situation.Generally, we can narrow the choices down to a few models that are close in optimization.

C. Lindsey and S. Sheather 655

Then we make an arbitrary choice among them. All the models in our final group areclose together in fit, so we do not lose or gain much explanatory power by choosing oneover another.

As explained in Furnival and Wilson (1974), the leaps-and-bounds algorithm orga-nizes all the possible models into tree structures and scans through them, skipping (orleaping) over those that are definitely not optimal. The original description of the algo-rithm is done with large amounts of Fortran code. Ni and Huo (2005) provide an easierdescription of the original algorithm.

Each node in the tree corresponds to two sets of predictors. The predictor lists arecreated based on an automatic ordering of all the predictors by their t test statisticvalue in the original regression. When the algorithm examines a node, it compares theregressions of each pair of predictor lists with the optimal regressions of each predictorsize that have already been conducted. Depending on the results, all or some of thedescendants of that node can be skipped by the algorithm. The initial ordering of thepredictors and their smart placement in sets within the nodes ensure that the algorithmcompletes after finding the optimal predictor lists and examining only a fraction of allpossible regressions.

Space constraints do not allow us to provide a fuller description of the algorithmthan we already have. We can say that it gives us the best models for each predictorquantity and that it does so by only examining a manageable fraction of all the possiblemodels.

1.4 Extensions: Nested models and weighting

Our discussion so far has focused on ordinary least-squares regression models, wherevariable selection should be performed on all the model predictors. Lawless and Singhal(1978) provides an extension of the leaps-and-bound algorithm to nonnormal models.Rather than using the RSS to compare models, they use the log likelihood L (β). Anessential condition for our use of the RSS in variable selection is that for a set of predictorsA contained in predictor set B, RSS(B) ≤ RSS(A). In many situations, L (B) ≤ L (A),but it is not always true.

Variable selection in weighted linear regressions and in linear regressions where weperform selection on only certain of the predictors will fit into the Lawless and Singhal(1978) theoretical framework and will satisfy the desired likelihood inequality. Weightedlinear regression is of tremendous practical use. The form of nested variable selectionin which some predictors are fixed is very appealing as well. Through organization orlegal policy, analysts may be forced to fix certain predictors as being in their model,but they would still desire to optimize the model with the free predictors to which theyhave access.

(Continued on next page)

656 Variable selection

vselect implements variable selection for weighted linear regression and variableselection where some predictors are fixed. Further implementation of the Lawless andSinghal (1978) methods is under development.

The information criteria will change for weighted linear regression models. Earlier,we simplified the log likelihood of the model in terms of the RSS. Now we will deal withthe weighted RSS. Simple derivation will show that our previously presented informationcriteria formulas are accurate under weighted regression when we substitute weightedRSS for RSS.

We have now explained all the theory behind vselect.

2 The vselect command

2.1 Syntax

The syntax for the vselect command is

vselect depvar indepvars[if

] [in

] [weight

] [, fix(varlist) best backward

forward r2adj aic aicc bic]

2.2 Options

fix(varlist) fixes these predictors in every regression.

best gives the best model for each quantity of predictors.

backward selects a model by backward elimination.

forward selects a model by forward selection.

r2adj uses R2 adjusted information criterion in stepwise selection.

aic uses AIC in stepwise selection.

aicc uses AICc in stepwise selection.

bic uses BIC in stepwise selection.

3 Examples

vselect is very straightforward in use. We will first use bridge.dta from Sheather(2009) (also Tryfos [1998]). Then we will test vselect on two datasets highlighted inNi and Huo (2005): the diabetes data (Efron et al. 2004) and the famous housing data(Frank and Asuncion 2010). Finally, we will work with a weighted regression from aStata example dataset that provides state-level information from the 1980 U.S. Census.

C. Lindsey and S. Sheather 657

3.1 Bridge example

bridge.dta can be analyzed using least-squares regression. As Sheather (2009) sug-gests, we will work with logs of the original predictors.

. use bridge

. foreach var of varlist time-spans {2. quietly replace `var´ = ln(`var´)3. }

. regress time darea-spans

Source SS df MS Number of obs = 45F( 5, 39) = 27.05

Model 13.3303983 5 2.66607966 Prob > F = 0.0000Residual 3.84360283 39 .098553919 R-squared = 0.7762

Adj R-squared = 0.7475Total 17.1740011 44 .390318208 Root MSE = .31393

time Coef. Std. Err. t P>|t| [95% Conf. Interval]

darea -.0456443 .1267496 -0.36 0.721 -.3020196 .2107309ccost .1960863 .1444465 1.36 0.182 -.0960843 .488257dwgs .8587948 .2236177 3.84 0.000 .4064852 1.311104

length -.0384353 .1548674 -0.25 0.805 -.3516842 .2748135spans .23119 .1406819 1.64 0.108 -.0533659 .515746_cons 2.2859 .6192558 3.69 0.001 1.033337 3.538463

. estat vif

Variable VIF 1/VIF

ccost 8.48 0.117876length 8.01 0.124779darea 7.16 0.139575spans 3.88 0.257838dwgs 3.41 0.293350

Mean VIF 6.19

Analysis of the residuals and other checks will reveal that the model is valid. As wesee, it does have serious multicollinearity problems. All but two of the variance inflationfactors exceed 5. Removing redundant predictors should solve this problem.

(Continued on next page)

658 Variable selection

Forward selection

First, we will try to use forward selection based on AIC.

. vselect time-spans, forward aicFORWARD variable selectionInformation Criteria: AIC

Stage 0 reg time : AIC 86.35751

AIC 47.19052 : add dareaAIC 37.60067 : add ccostAIC 32.80693 : add dwgsAIC 49.00033 : add lengthAIC 56.43028 : add spans

Stage 1 reg time dwgs : AIC 32.80693

AIC 30.30586 : add dareaAIC 26.61563 : add ccostAIC 28.33827 : add lengthAIC 25.33412 : add spans

Stage 2 reg time dwgs spans : AIC 25.33412

AIC 27.12765 : add dareaAIC 25.2924 : add ccostAIC 27.14563 : add length

Stage 3 reg time dwgs spans ccost : AIC 25.2924

AIC 27.06413 : add dareaAIC 27.1425 : add length

Final Model

Source SS df MS Number of obs = 45F( 3, 41) = 46.99

Model 13.3047499 3 4.43491664 Prob > F = 0.0000Residual 3.86925122 41 .094371981 R-squared = 0.7747

Adj R-squared = 0.7582Total 17.1740011 44 .390318208 Root MSE = .3072

time Coef. Std. Err. t P>|t| [95% Conf. Interval]

dwgs .8355863 .2135074 3.91 0.000 .4043994 1.266773spans .1962899 .1107299 1.77 0.084 -.0273336 .4199134ccost .148275 .1074829 1.38 0.175 -.0687911 .365341_cons 2.331693 .3576636 6.52 0.000 1.609377 3.05401

We begin with no predictors, with an AIC of 86.35751 for the intercept in stage 0.Addition of dwgs will change the AIC of the model to 32.80693, a more optimal valuethan the other possibilities of single-predictor addition and the null model. So we adddwgs to the model and move to the next stage. When we add spans to the model thatpredicts time with dwgs, we get an AIC of 25.33412.

C. Lindsey and S. Sheather 659

So we enter stage 2 with the model predicting time by dwgs and spans. This modelyields an AIC of 25.33412. If we add darea to this model, we obtain an AIC of 27.12765.Addition of length would cause the AIC to rise to 27.14563. Adding either of thesewould not improve the fit of the model. The addition of the other remaining potentialpredictor, ccost, yields an AIC of 25.2924. This is a very slight gain in terms of AIC,but it is a gain.

In stage 3, we have added ccost to the model, so the AIC is now 25.2924. We nowpredict spans based on dwgs, spans, ccost, and the intercept. Addition of darea tothis model raises the AIC to 27.06413. Addition of length to this model raises the AIC

to 27.1425. Adding any more predictors causes an increase in AIC, so we terminate theforward selection algorithm with the final model predicting spans with dwgs, spans,and ccost.

Now we will compare this result with forward selection using BIC as an informationcriterion.

. vselect time-spans, forward bicFORWARD variable selectionInformation Criteria: BIC

Stage 0 reg time : BIC 88.16417

BIC 50.80385 : add dareaBIC 41.21399 : add ccostBIC 36.42026 : add dwgsBIC 52.61365 : add lengthBIC 60.04361 : add spans

Stage 1 reg time dwgs : BIC 36.42026

BIC 35.72585 : add dareaBIC 32.03562 : add ccostBIC 33.75826 : add lengthBIC 30.75411 : add spans

Stage 2 reg time dwgs spans : BIC 30.75411

BIC 34.3543 : add dareaBIC 32.51905 : add ccostBIC 34.37228 : add length

Final Model

Source SS df MS Number of obs = 45F( 2, 42) = 68.08

Model 13.1251524 2 6.56257622 Prob > F = 0.0000Residual 4.0488487 42 .096401159 R-squared = 0.7642

Adj R-squared = 0.7530Total 17.1740011 44 .390318208 Root MSE = .31049

time Coef. Std. Err. t P>|t| [95% Conf. Interval]

dwgs 1.041632 .1541992 6.76 0.000 .7304454 1.352819spans .2853049 .0909484 3.14 0.003 .1017636 .4688462_cons 2.661732 .2687132 9.91 0.000 2.119447 3.204017

660 Variable selection

This method suggests the two-predictor model that predicts spans with dwgs andspans.

Backward elimination

Backward elimination based on AIC yields the same model as forward selection. It takesone fewer iteration.

. vselect time-spans, backward aicBACKWARD variable selectionInformation Criteria: AIC

Stage 0 reg time darea ccost dwgs length spans : AIC 28.99311

AIC 27.1425 : remove dareaAIC 29.07072 : remove ccostAIC 41.42757 : remove dwgsAIC 27.06413 : remove lengthAIC 30.00605 : remove spans

Stage 1 reg time darea ccost dwgs spans : AIC 27.06413

AIC 25.2924 : remove dareaAIC 27.12765 : remove ccostAIC 39.44412 : remove dwgsAIC 28.60344 : remove spans

Stage 2 reg time ccost dwgs spans : AIC 25.2924

AIC 25.33412 : remove ccostAIC 37.57602 : remove dwgsAIC 26.61563 : remove spans

Final Model

Source SS df MS Number of obs = 45F( 3, 41) = 46.99

Model 13.3047499 3 4.43491664 Prob > F = 0.0000Residual 3.86925122 41 .094371981 R-squared = 0.7747

Adj R-squared = 0.7582Total 17.1740011 44 .390318208 Root MSE = .3072

time Coef. Std. Err. t P>|t| [95% Conf. Interval]

ccost .148275 .1074829 1.38 0.175 -.0687911 .365341dwgs .8355863 .2135074 3.91 0.000 .4043994 1.266773spans .1962899 .1107299 1.77 0.084 -.0273336 .4199134_cons 2.331693 .3576636 6.52 0.000 1.609377 3.05401

In the initial stage, we have the full model with all predictors and an AIC of 28.99311.Removal of length will yield the most optimal AIC.

At stage 1, we have removed length and our model now has an AIC of 27.06413. Ifwe remove darea, we will have reached the final model for forward selection under AIC.Removal of the other predictors will yield less optimal models. At stage 2, removal ofany of the predictors will yield worse models in terms of AIC.

C. Lindsey and S. Sheather 661

Best subsets

The leaps-and-bounds algorithm finds the same forward selection and backward elim-ination models that we previously discussed. To reach the result, the algorithm needsto perform only 5 out of all 32 possible regressions.

. vselect time-spans, bestResponse : timeFixed Predictors :Selected Predictors: dwgs spans ccost darea length

Actual Regressions 5Possible Regressions 32

Optimal Models Highlighted:

# Preds R2ADJ C AIC AICC BIC1 .70224 9.708371 32.80693 161.0968 36.420262 .7530191 2.082574 25.33412 154.0386 30.754113 .7582178 2.260247 25.2924 154.5353 32.519054 .7534273 4.061594 27.06413 156.9791 36.097445 .7475037 6 28.99311 159.7246 39.83309

Selected Predictors

1 : dwgs2 : dwgs spans3 : dwgs spans ccost4 : dwgs spans ccost darea5 : dwgs spans ccost darea length

The optimal R2ADJ value, 0.7582178, is obtained by the three-variable model with

predictors dwgs, spans, and ccost. This is the same model obtained by forward selec-tion and backward elimination under AIC. This model also optimizes AIC, with an AIC

of 25.2924.

The most optimal model under BIC and AICc is the predictor model using dwgsand spans. This is the same model found by forward selection under BIC. We findthat Mallows’s Cp suggests the five-predictor model when we choose the best model ashaving a Cp value close to the predictor size +1. Otherwise, when picking the smallestMallows’s Cp model, we would choose the two-predictor model that BIC and AICc chose.

This is one of the occasions when there is no completely clear, best final model.We can narrow our decision down to the two mentioned models. We might investigatewhether AICc is more appropriate than AIC in this situation. Recall that picking themodel with the highest R2

ADJ generally leads to overfitting (Sheather 2009). Regardless,there is little difference between the values of AIC and R2

ADJ for the two- and three-predictor models. We will arbitrarily pick the two-predictor model that estimates timeby dwgs and spans as our final model. This selection yields no high variance inflationfactors.

(Continued on next page)

662 Variable selection

. estat vif

Variable VIF 1/VIF

dwgs 1.66 0.603451spans 1.66 0.603451

Mean VIF 1.66

3.2 Diabetes and housing data

For brevity, we will omit stepwise model selection and focus solely on a best subsetsselection method in each of the following datasets. We will document that our imple-mentation of the leaps-and-bounds algorithm obtains the same models as Ni and Huo(2005). We will also demonstrate how few models (relative to all possible models) theleaps-and-bounds algorithm needs to fit before finding the optimal models.

diabetes.dta (Efron et al. 2004) contains information on 442 diabetes patients.They are measured on 10 baseline predictor variables and one measure of disease pro-gression. The predictors include age, sex, body mass index (bmi), blood pressure (bp),and six serum measurements (s1–s6). The progression variable, prog, is our models’response and was recorded a year after the 10 baseline predictors.

Evaluation of the residual plots and other diagnostics does show that the full modelis valid. As we see in the variance inflation factors, though, there are serious multi-collinearity problems.

. use diabetes, clear

. regress prog age-s6

Source SS df MS Number of obs = 442F( 10, 431) = 46.27

Model 1357023.32 10 135702.332 Prob > F = 0.0000Residual 1263985.8 431 2932.68168 R-squared = 0.5177

Adj R-squared = 0.5066Total 2621009.12 441 5943.33135 Root MSE = 54.154

prog Coef. Std. Err. t P>|t| [95% Conf. Interval]

age -.0363613 .2170414 -0.17 0.867 -.4629526 .3902301sex -22.85965 5.835821 -3.92 0.000 -34.32986 -11.38944bmi 5.602962 .7171055 7.81 0.000 4.193503 7.012421bp 1.116808 .2252382 4.96 0.000 .6741061 1.55951s1 -1.089996 .5733318 -1.90 0.058 -2.21687 .0368782s2 .7464501 .5308344 1.41 0.160 -.296896 1.789796s3 .3720042 .7824638 0.48 0.635 -1.165915 1.909924s4 6.533831 5.958638 1.10 0.273 -5.177772 18.24543s5 68.48312 15.66972 4.37 0.000 37.68454 99.28169s6 .2801171 .273314 1.02 0.306 -.257077 .8173111

_cons -334.5671 67.45462 -4.96 0.000 -467.148 -201.9862

C. Lindsey and S. Sheather 663

. estat vif

Variable VIF 1/VIF

s1 59.20 0.016891s2 39.19 0.025515s3 15.40 0.064926s5 10.08 0.099246s4 8.89 0.112473bmi 1.51 0.662499s6 1.48 0.673572bp 1.46 0.685200sex 1.28 0.782429age 1.22 0.821486

Mean VIF 13.97

When we invoke vselect on the data, we find that we only needed to run 29 ofa possible 1,024 regressions. Our model choices match those of Ni and Huo (2005).The choices of best model predictor sizes were five for BIC, six for AIC and AICc, andeight for R2

ADJ. Mallows’s Cp chooses the 11-predictor model when we choose the bestmodel as having a Cp value close to the predictor size +1. If we go with the smallestMallows’s Cp value, then we choose the six-predictor model. The six-predictor modelseems like a prudent choice, given all of this and the closeness of the optimal BIC andR2

ADJ values to their values under six predictors.

. vselect prog age-s6, bestResponse : progFixed Predictors :Selected Predictors: bmi bp s5 sex s1 s2 s4 s6 s3 age

Actual Regressions 29Possible Regressions 1024

Optimal Models Highlighted:

# Preds R2ADJ C AIC AICC BIC1 .3424327 148.3513 4912.038 6166.435 4920.2212 .4570228 47.07119 4828.398 6082.832 4840.6723 .4765213 30.66302 4813.226 6067.705 4829.5914 .487366 21.99793 4804.963 6059.498 4825.4195 .5029966 9.147958 4792.264 6046.863 4816.8116 .5081925 5.560187 4788.603 6043.278 4817.2437 .5084884 6.303253 4789.32 6044.079 4822.0518 .5085553 7.248507 4790.241 6045.093 4827.0629 .5076694 9.028067 4792.015 6046.97 4832.928

10 .5065593 11 4793.986 6049.055 4838.99

Selected Predictors

1 : bmi2 : bmi s53 : bmi bp s54 : bmi bp s5 s15 : bmi bp s5 sex s36 : bmi bp s5 sex s1 s27 : bmi bp s5 sex s1 s2 s48 : bmi bp s5 sex s1 s2 s4 s69 : bmi bp s5 sex s1 s2 s4 s6 s310 : bmi bp s5 sex s1 s2 s4 s6 s3 age

664 Variable selection

Using the six-predictor model, we still find some high variance inflation factors be-tween the first and second serum variables. They are far lower in magnitude than theyare under the full model:

. estat vif

Variable VIF 1/VIF

s1 8.81 0.113561s2 7.37 0.135750s5 2.20 0.454745bmi 1.47 0.678813bp 1.34 0.743677sex 1.23 0.815832

Mean VIF 3.74

If we are concerned about this multicollinearity, we can try the five-predictor modelthat BIC chose:

. estat vif

Variable VIF 1/VIF

s5 1.46 0.684663s3 1.46 0.685455bmi 1.44 0.692867bp 1.35 0.742260sex 1.24 0.807833

Mean VIF 1.39

housing.dta contains real estate data for 506 Boston residences. You can obtainthe dataset at http://archive.ics.uci.edu/ml/datasets/Housing. Many authors have an-alyzed this dataset (Frank and Asuncion 2010), and we will compare our analysis resultswith Ni and Huo (2005). Thirteen predictors are used to predict the median value of thehome. Using vselect on the data, we obtain the same models as Ni and Huo (2005).We performed 71 regressions to obtain the optimal models, which is a small fraction ofthe total possible number of models that could be fit.

C. Lindsey and S. Sheather 665

. use housing

. vselect y v1-v13, bestResponse : yFixed Predictors :Selected Predictors: v13 v6 v8 v11 v5 v9 v12 v2 v1 v10 v4 v3 v7

Actual Regressions 71Possible Regressions 8192

Optimal Models Highlighted:

# Preds R2ADJ C AIC AICC BIC1 .5432418 362.7529 3286.975 4722.989 3295.4282 .6371245 185.6474 3171.542 4607.588 3184.2223 .6767036 111.6489 3114.097 4550.183 3131.0034 .6878351 91.48526 3097.359 4533.493 3118.4925 .7051702 59.75364 3069.439 4505.629 3094.7986 .7123567 47.17537 3057.939 4494.195 3087.5257 .718256 37.05889 3048.438 4484.767 3082.2518 .7222072 30.62398 3042.275 4478.685 3080.3149 .7252743 25.86591 3037.638 4474.138 3079.903

10 .7299149 18.20493 3029.997 4466.595 3076.48811 .7348058 10.11455 3021.726 4458.432 3072.44512 .7343282 12.00275 3023.611 4460.433 3078.55613 .7337897 14 3025.609 4462.554 3084.78

Selected Predictors

1 : v132 : v13 v63 : v13 v6 v114 : v13 v6 v8 v115 : v13 v6 v8 v11 v56 : v13 v6 v8 v11 v5 v47 : v13 v6 v8 v11 v5 v12 v48 : v13 v6 v8 v11 v5 v12 v2 v49 : v13 v6 v8 v11 v5 v9 v12 v1 v410 : v13 v6 v8 v11 v5 v9 v12 v2 v1 v1011 : v13 v6 v8 v11 v5 v9 v12 v2 v1 v10 v412 : v13 v6 v8 v11 v5 v9 v12 v2 v1 v10 v4 v313 : v13 v6 v8 v11 v5 v9 v12 v2 v1 v10 v4 v3 v7

3.3 Census 1980 Stata dataset

Now we will show how to use the weighting and fixed options for vselect by usingcensus13.dta, which can be obtained by typing webuse census13 in Stata or fromhttp://www.stata-press.com/data/r11/census13.dta. This dataset contains one obser-vation per state and records various summary demographic information for the state’spopulation. We wish to predict birthrate brate with the median age, medage; squaredmedian age, medage2; divorce rate, dvcrate; marriage rate, mrgrate; and geographicregion of the state. We standardize median age to prevent obvious multicollinearitybetween its linear and quadratic term, yielding the transformed variables tmedage andtmedage2. The 1980 population of the state, pop, is used as an analytic weight.

(Continued on next page)

666 Variable selection

. webuse census13(1980 Census data by state)

. describe region

storage display valuevariable name type format label variable label

region int %-8.0g cenreg Census region

. label list cenregcenreg:

1 NE2 N Cntrl3 South4 West

. generate ne = region == 1

. generate n = region == 2

. generate s = region == 3

. generate w = region == 4

. summarize medage

Variable Obs Mean Std. Dev. Min Max

medage 50 29.54 1.693445 24.2 34.7

. generate tmedage = (medage-r(mean))/r(sd)

. generate tmedage2 = tmedage^2

Invoking vselect on the data, we find that AIC and AICc both select the five-predictor model. BIC differs in that it chooses to exclude the North Central region ofthe U.S. as a predictor and so chooses a four-predictor model. R2

ADJ chose to include themarriage rate as a predictor, yielding a six-predictor model. Mallows’s Cp advocatesthe seven-predictor model when we choose a model with Cp close to the number ofpredictors +1. Otherwise, when choosing the smallest Cp value, we will choose the five-predictor model. The level of difference for each criterion from the AIC-chosen predictorsize to its own chosen size is minimal. So we choose the five-predictor model. Furtherinvestigation will show that this is a valid model. Its variance inflation factors are notproblematic, either.

C. Lindsey and S. Sheather 667

. vselect brate tmedage tmedage2 mrgrate dvcrate n s w [aweight=pop], bestResponse : brateFixed Predictors :Selected Predictors: tmedage tmedage2 n w dvcrate mrgrate s

Actual Regressions 11Possible Regressions 128

Optimal Models Highlighted:

# Preds R2ADJ C AIC AICC BIC1 .6731149 65.51087 397.97 540.3855 401.7942 .7937451 24.89423 375.8925 518.6752 381.62853 .8412783 9.88896 363.7191 506.9766 371.36724 .8557213 6.141906 359.8499 503.6973 369.415 .8623259 5.051247 358.3834 502.9439 369.85556 .8625235 6.012409 359.1621 504.5681 372.54637 .8592919 8 361.1473 507.5412 376.4435

Selected Predictors

1 : tmedage2 : tmedage tmedage23 : tmedage tmedage2 w4 : tmedage tmedage2 w dvcrate5 : tmedage tmedage2 n w dvcrate6 : tmedage tmedage2 n w dvcrate mrgrate7 : tmedage tmedage2 n w dvcrate mrgrate s

Now suppose that we were forced to include marriage rate as a predictor. We removeit from the predictor list and put it in the fix() option.

. vselect brate tmedage tmedage2 dvcrate n s w [aweight=pop], best fix(mrgrate)Response : brateFixed Predictors : mrgrateSelected Predictors: tmedage tmedage2 n w dvcrate s

Actual Regressions 10Possible Regressions 64

Optimal Models Highlighted:

# Preds R2ADJ C AIC AICC BIC1 .670209 66.15834 399.3598 542.1425 405.09592 .7915307 26.15233 377.3511 520.6086 384.99923 .8385064 11.64741 365.4859 509.3332 375.0464 .8565161 6.867985 360.4501 505.0106 371.92225 .8625235 6.012409 359.1621 504.5681 372.54636 .8592919 8 361.1473 507.5412 376.4435

Selected Predictors

1 : tmedage2 : tmedage tmedage23 : tmedage tmedage2 w4 : tmedage tmedage2 w dvcrate5 : tmedage tmedage2 n w dvcrate6 : tmedage tmedage2 n w dvcrate s

Here the optimal model on R2ADJ and AIC and AICc is the five-predictor model.

This is actually a six-predictor model because we have already fixed mrgrate as beingin the model.

668 Variable selection

. regress brate mrgrate tmedage tmedage2 n w dvcrate [aweight=pop](sum of wgt is 2.2591e+08)

Source SS df MS Number of obs = 50F( 6, 43) = 52.24

Model 21242.2364 6 3540.37274 Prob > F = 0.0000Residual 2914.3087 43 67.774621 R-squared = 0.8794

Adj R-squared = 0.8625Total 24156.5451 49 492.990717 Root MSE = 8.2325

brate Coef. Std. Err. t P>|t| [95% Conf. Interval]

mrgrate -134.7134 130.6446 -1.03 0.308 -398.1833 128.7566tmedage -21.11739 1.569742 -13.45 0.000 -24.28307 -17.9517

tmedage2 4.217915 .7312436 5.77 0.000 2.743222 5.692609n 5.03472 2.944985 1.71 0.095 -.9044078 10.97385w 11.92932 3.405185 3.50 0.001 5.062111 18.79653

dvcrate 1886.619 735.5317 2.56 0.014 403.2778 3369.96_cons 146.665 4.676581 31.36 0.000 137.2338 156.0962

4 Conclusion

We explored both the theory and practice of variable selection in linear regression.Using real datasets, we have demonstrated the use of each flavor of variable selection:forward selection, backward elimination, and best subset selection. Variable selectionon weighted linear regression and fixed predictor models was also demonstrated.

The vselect command was fully defined as a method for performing linear regressionvariable selection in Stata. Its use on each of the three algorithms and contexts ofvariable selection was demonstrated using a variety of datasets.

5 ReferencesAkaike, H. 1974. A new look at the statistical model identification. IEEE Transactions

on Automatic Control 19: 716–723.

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani. 2004. Least angle regression.Annals of Statistics 32: 407–499.

Frank, A., and A. Asuncion. 2010. UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/datasets/Housing.

Furnival, G. M., and R. W. Wilson. 1974. Regression by leaps and bounds. Technomet-rics 16: 499–511.

Hastie, T., R. J. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction. New York: Springer.

Hocking, R. R. 1976. A Biometrics invited paper: The analysis and selection of variablesin linear regression. Biometrics 32: 1–49.

C. Lindsey and S. Sheather 669

Hurvich, C. M., and C.-H. Tsai. 1989. Regression and time series model selection insmall samples. Biometrika 76: 297–307.

Izenman, A. J. 2008. Modern Multivariate Statistical Techniques: Regression, Classifi-cation, and Manifold Learning. New York: Springer.

Lawless, J. F., and K. Singhal. 1978. Efficient screening of nonnormal regression models.Biometrics 34: 318–327.

Mallows, C. L. 1973. Some comments on Cp. Technometrics 15: 661–675.

Ni, X., and X. Huo. 2005. Enhanced leaps-and-bounds method in subset selections withadditional optimality tests. http://www3.informs.org/site/qsr/downloadfile.php?i=17e62166fc8586dfa4d1bc0e1742c08b.

Raftery, A. E. 1995. Bayesian model selection in social research. Sociological Method-ology 25: 111–163.

Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464.

Sheather, S. J. 2009. A Modern Approach to Regression with R. New York: Springer.

Simonoff, J. S. 2003. Analyzing Categorical Data. New York: Springer.

Tryfos, P. 1998. Methods for Business Analysis and Forecasting: Text and Cases. NewYork: Wiley.

About the authors

Charles Lindsey is a statistician and software developer at StataCorp. He graduated from theDepartment of Statistics at Texas A&M with a PhD in May 2010.

Simon Sheather is professor and head of the Department of Statistics at Texas A&M University.Simon’s research interests are in the fields of nonparametric and robust statistics and flexibleregression methods. In 2001, Simon was named an honorary fellow of the American StatisticalAssociation. Simon is currently listed on ISIHighlyCited.com among the top one-half of onepercent of all mathematical scientists in terms of citations of his published works.

The Stata Journal (2010)10, Number 4, pp. 670–681

Speaking Stata: Graphing subsets

Nicholas J. CoxDepartment of Geography

Durham UniversityDurham, UK

[email protected]

Abstract. Graphical comparison of results for two or more groups or subsets canbe accomplished by way of subdivision, superimposition, or juxtaposition. Thechoice between superimposition (several groups in one panel) and juxtaposition(several groups in several panels) can require fine discrimination: while juxtapo-sition increases clarity, it requires mental superimposition to be most effective.Discussion of this dilemma leads to exploration of a compromise design in whicheach subset is plotted in a separate panel, with the rest of the data as a backdrop.Univariate and bivariate examples are given, and associated Stata coding tips andtricks are commented on in detail.

Keywords: gr0046, graphics, subdivision, superimposition, juxtaposition, quantileplots, Gumbel distribution, scatterplots

1 Introduction

A common graphical problem—indeed for many researchers the key graphical problem—is to compare results for two or more groups or subsets of some larger group or set.Results might be measured responses, absolute or relative frequencies, summary statis-tics, parameter estimates, model figures of merit, or whatever else is worth plotting.We might seek comparisons according to treatment, disease, gender, ethnicity, industry,product, habitat, land use, area, time period, and so forth: you can multiply examplesfor yourself.

Various strategies, elementary but also fundamental, recur repeatedly in plottingresults for different groups.

Subdivision of some whole is the principle behind pie charts, stacked bar charts, andlayered area plots. Each group is represented by its own share, relative or absolute asthe case may be.

Superimposition of differing points or lines is the principle behind scatterplots, lineplots, and other plots in which different groups are denoted by (for example) distinctmarker symbols, marker colors, line patterns, or line colors.

Juxtaposition of separate subpanels or panels within a display conveys group com-parisons by what Tufte (2001) called small multiples: the same basic design is repeatedfor each subset, and possibly also for the total set. Other terms in use are trellis andlattice graphics, multipanel graph, and panel charts (Robbins 2010).

c© 2010 StataCorp LP gr0046

N. J. Cox 671

All these ideas are staples within statistical and scientific graphics, and their ad-vantages and disadvantages have been much discussed in many texts and articles. Thebooks of Tufte (1990, 1997, 2001, 2006) and Cleveland (1993, 1994) remain my ownfavorites as overviews of the field. Despite that large literature, with many preachersand many precepts, it often seems that only experimentation will show which idea ismost effective for a particular dataset.

Subdivision is the strategy likely to be first encountered in graphical educationthrough the pies and bars widely met in childhood. However, only some graphicalproblems reduce to comparing fractions of a whole.

Superimposition promises the advantage of a common scale that can be used forcomparison, yet in practice superimposition can mean confusion. The mixture of dif-ferent elements may appear mostly as a mess in which patterns are difficult to discern.Tangled line plots are often referred to as spaghetti plots, not always with affection oradmiration. Some other term (muesli plots?) seems needed for classified scatterplotsthat convey much detail but in which systematic differences are hard to decipher.

Juxtaposition—in Stata terms often effected by a by() or over() option—providesseparation, which clarifies what is being compared, sometimes at the expense of makingthat comparison harder work. To work well, juxtaposition still requires mental super-imposition. Judging the fine structure of differences between adjacent panels can bedifficult enough, and judging differences between panels at opposite ends of a display isevidently even more difficult.

In this column, we will look at a combination of superimposition and juxtaposition,in which subsets are shown separately, but in every case the set as a whole acts as a back-drop. An earlier Stata tip (Cox 2009) emphasized the notion that the whole of the datamay serve as a graphical substrate for a particular subset. Repeating information mightbe criticized as redundant, but rather the idea is that repetition provides reinforcement.Consider a dictum of Tufte (1990, 37): “Simplicity of reading derives from the contextof detailed and complex information, properly arranged. A most unconventional designstrategy is revealed: to clarify, add detail.”

In your own work, you are likely to be able to use color in your talks and possiblyalso within your reports or even published papers. However, many journals still prohibitor inhibit anything other than black and white and what shades of gray lie between.In this column, we follow such a restriction, using only contrasts discernible by varyinggray scale.

2 A univariate example

2.1 Annual maximum windspeeds

A first example concerns data on annual maximum windspeeds for various places in thesoutheastern United States. My source is Hosking and Wallis (1997, 31); their sourceis Simiu, Changery, and Filliben (1979). A recipe that has long been a standard in the

672 Speaking Stata: Graphing subsets

statistics of extremes is to focus on the maximums of a variable in each of several blocksof time. A year is a natural block for meteorology and climatology. The data are forvarying numbers of years, ranging from 19 years (1958–1976) for Key West, Florida,to 35 years (1943–1977) for Brownsville, Texas, so that we should prefer a commonbasis for graphical comparison of these univariate samples. windspeed.dta is providedwith the media for this issue of the Stata Journal. The dataset includes two variables,windspeed and place. The ordering of the places is by mean maximum windspeed.

Researchers accustomed to such data tend to reach first for quantile plots. Theofficial command quantile is limited to one batch of data at a time. While moreversatile user-written alternatives are available (Cox 2005a, 2010), the spirit of thisparticular column is that you can work out code for yourself using basic commands.

Given an ordered sample of size n for variable y, y(1) ≤ y(2) . . . y(n−1) ≤ y(n), theusual ordinate and abscissa for a quantile plot are y(i) and (i− a)/(n− 2a + 1), respec-tively. The abscissa for some choice of a is, in effect, an empirical cumulative probabilityand is often called a plotting position. The naıve choice i/n for plotting position wouldimply probabilities 1/n and 1 at the ends of the data, while (i−1)/n would just reversethe problem. Either choice would be awkward, implying that no value can be moreextreme than those observed and because theoretical quantiles are often not definedfor probabilities 0 or 1. We need, therefore, a slightly more complicated method. Thechoice of a is the subject of a small but contentious literature, to which Thas (2010) isone entry point. Given a choice, we can implement it for ourselves in Stata. Below weuse a = 0—that is, i/(n + 1), as is common in statistics of extremes.

To calculate plotting positions, it is convenient, if not outstandingly efficient, to useegen functions. These functions take care of sorting issues, handling of any missingvalues, and separate calculations for separate groups:

. use windspeed

. egen rank = rank(windspeed), by(place) unique

. egen count = count(windspeed), by(place)

. generate pp = rank/(count + 1)

. label variable pp "fraction of data"

N. J. Cox 673

Figure 1 shows quantile plots, with a separate panel for each place.

. scatter windspeed pp, by(place) yla(, ang(h)) xla(0(.25)1)

40

60

80

100

120

40

60

80

100

120

0 .25 .5 .75 1 0 .25 .5 .75 1 0 .25 .5 .75 1

Brownsville, TX Macon, GA Montgomery, AL

Key West, FL Port Arthur, TX Corpus Christi, TX

annu

al m

axim

um w

inds

peed

, mi/h

r

fraction of dataGraphs by place

Figure 1. Quantile plots of annual maximum windspeed data for six places in thesoutheastern United States

The usual starting point with such data is the fitting of Gumbel distributions, witha distribution function for location parameter (mode) ξ and scale parameter α of

F (y) = exp[− exp{−(y − ξ)/α}]

defined over the real line. Gumbel distributions are named for Emil Julius Gumbel(1891–1966), who did much to systematize knowledge of the statistics of extremes andwrote the first extended monograph on the subject (Gumbel 1958). For more detailon his scientific and political career, see Freudenthal (1967), Hertz (2001), and Brenner(2001).

For context, note that the mean of a Gumbel distribution is ξ+αγ and the standarddeviation is απ/

√6. Here γ ≈ 0.57721+ is Euler’s constant (in Stata, -digamma(1))

and π ≈ 3.14159 (in Stata, pi or c(pi)) is the even better known constant. Withoutventuring into numerical fits, the distribution function can easily be inverted, giving

y(F ) = ξ − α ln(− ln F )

so that a plot of y against − ln(− ln F ) should be approximately linear with intercept ξand slope α if y is drawn from a Gumbel distribution. The quantity − ln(− ln F ) is thus

674 Speaking Stata: Graphing subsets

often called a Gumbel reduced variate, “reduced” implying unit-free and dimensionless.The resulting plot is a Gumbel plot. For another example and literature references, seeCox (2007a).

Because the plotting position variable pp has already been calculated separately foreach place, we can apply functions as just stated algebraically:

. gen gumbel = -ln(-ln(pp))

. label var gumbel "Gumbel reduced variate"

There are two obvious versions of the corresponding graph. Figure 2 separates outdifferent places into different panels:

. scatter windspeed gumbel, by(place) yla(, ang(h))

40

60

80

100

120

40

60

80

100

120

−2 0 2 4 −2 0 2 4 −2 0 2 4

Brownsville, TX Macon, GA Montgomery, AL

Key West, FL Port Arthur, TX Corpus Christi, TX

annu

al m

axim

um w

inds

peed

, mi/h

r

Gumbel reduced variateGraphs by place

Figure 2. Gumbel plots of annual maximum windspeed data for six places in the south-eastern United States, one panel for each place

Figure 2 is clearly ideal for considering individual places. How easy does it makecomparison, however? Conversely, figure 3 separates places by using different pointsymbols (Cox 2005b). The separate command (see [D] separate) makes this stepeasier but is not essential because a series of if qualifications could produce the sameresult.

N. J. Cox 675

. separate windspeed, by(place) veryshortlabel

. scatter windspeed? gumbel, ytitle("`: var label windspeed´") yla(, ang(h))> legend(pos(11) ring(0) order(6 5 4 3 2 1) col(1))

40

60

80

100

120an

nual

max

imum

win

dspe

ed, m

i/hr

−1 0 1 2 3 4Gumbel reduced variate

Corpus Christi, TXPort Arthur, TXKey West, FLMontgomery, ALMacon, GABrownsville, TX

Figure 3. Gumbel plots of annual maximum windspeed data for six places in the south-eastern United States, places being superimposed

I initially ordered the places lowest mean first. Now it becomes evident that thereverse order is needed here. More importantly, graphs like figure 3 appear frequently inbooks and journals. But how effective are they? Readers will find it easy to understandthe principle: given detailed study of the legend, they could study the graph carefullyto learn more about contrasts. But will they be encouraged to do that by the design?And how easy would that be? It is a characteristic of these data, like many others, thatthe groups overlap to some extent. With this design, however, that inevitably impliesthat some groups are partially obscured on the graph by others.

The resulting dilemma is clear. Figure 2 and figure 3 have corresponding advantagesand limitations. Moreover, the example should strike many readers as modest if notminute in size, with just 6 groups and sample sizes between 19 and 35. The problemwith more data can be much more serious.

A suggested compromise is this: Show each group separately, but with the rest ofthe data shown as a backdrop. Figure 4 is the result. Now, as is natural in many ways,the other data provide context for each subset.

. qui forval i = 1/6 {> scatter windspeed gumbel if place != `i´, ms(Oh) mcolor(gs12)> || scatter windspeed gumbel if place == `i´, ms(D) mcolor(gs1)> yla(, ang(h)) yti("") xti("") legend(off)> subtitle("`: label (place) `i´´", box fcolor(gs13) bexpand size(medium))> name(g`i´, replace)> }

676 Speaking Stata: Graphing subsets

. graph combine g1 g2 g3 g4 g5 g6, imargin(small)> l2ti("`: var label windspeed´") b2ti("`: var label gumbel´")

40

60

80

100

120

−1 0 1 2 3 4

Brownsville, TX

40

60

80

100

120

−1 0 1 2 3 4

Macon, GA

40

60

80

100

120

−1 0 1 2 3 4

Montgomery, AL

40

60

80

100

120

−1 0 1 2 3 4

Key West, FL

40

60

80

100

120

−1 0 1 2 3 4

Port Arthur, TX

40

60

80

100

120

−1 0 1 2 3 4

Corpus Christi, TX

annu

al m

axim

um w

inds

peed

, mi/h

r

Gumbel reduced variate

Figure 4. Gumbel plots of annual maximum windspeed data for six places in the south-eastern United States. Data for the five other places are shown as a backdrop to thedata for each place.

The code is more complicated than for previous graphs, but the logic is still straight-forward. The next subsection gives a commentary for those who would like to see mat-ters discussed in greater detail. The details of what you see printed depend on havingpreviously set the Stata Journal graph scheme by typing

. set scheme sj

2.2 Comments on code

qui forval i = 1/6 {

1. We loop using forvalues over the distinct groups of a categorical variable (place).In this case, we know in advance that there are 6 groups, numbered 1–6. In moregeneral situations, we might want to automate the looping. Various techniquesexist for doing that. One method is to use levelsof (see [P] levelsof) to producea list of the distinct groups, followed by a call to foreach. Another, perhapseasier, method is to use the group() function of egen (see [D] egen) to define agrouping variable with positive integer values (Cox 2007b).

N. J. Cox 677

scatter windspeed gumbel if place != `i´, ms(Oh) mcolor(gs12)

2. We first lay down the rest of the data as a backdrop. So, for example, the firsttime around the loop, when ‘i’ evaluates to 1, we look for place not equal to 1.The data for that complementary subset are shown lightly. The suggestion hereis that open circles Oh and a light color gs12 are suitably muted.

It is tempting just to plot all the data as substrate, on the grounds that we will justplot over each subset being emphasized. In principle, that is correct; in practiceit is possible for small parts of the symbols in question to be visible even thoughthey lie underneath the symbols to be plotted. So do use the != constraint.

|| scatter windspeed gumbel if place == `i´, ms(D) mcolor(gs1)

3. The subset being emphasized is now plotted directly on top. More prominentsymbols and colors are needed. (In other problems, line patterns, widths, andcolors will be the elements to adjust.)

yla(, ang(h)) yti("") xti("") legend(off)

4. About the options specified, note first that yla(, ang(h)) is a personal choice,although the underlying logic that text is more readable horizontally is a point onwhich many will agree. More particular to the key themes here are suppression ofytitle(), xtitle(), and legend(). In this problem, the ytitle() and xtitle()would be the same on all six graphs, which is unnecessary. We will see in a momenthow to put just one title on both the left and bottom sides of the overall graph. Theappearance of a legend would be triggered by the double plotting of windspeed.A legend would not help, so we suppress it, too.

subtitle("`: label (place) `i´´", box fcolor(gs13) bexpand size(medium))

5. Evidently, each graph needs some explanatory text. place is a numeric variablewith value labels attached, so we use an extended macro function (see [P] macro)to look up the label concerned as we go around the loop. If no label were attachedto the numeric value in question, the value itself would be shown instead. Thisapproach does not extend to string variables, except that there is an easy work-around: just map the string variable to a numeric variable with value labels first,using encode (see [D] encode) or the egen function group() that was mentionedearlier.

In the rendering of the text, the extra options box, fcolor(gs13), bexpand, andsize(medium) are doubly optional. In this case, they come from peeking at graphsproduced with by() in the Graph Editor and so producing a similar overall style.

name(g`i´, replace)

6. It is essential that we save each graph for later combining, which is the next step.There is a choice between using name() and using saving(). A side effect of usingname() is that each resulting graph remains open in a separate Graph window, sothat it can be checked. Although we have not previously used any of these graph

678 Speaking Stata: Graphing subsets

names, writing “, replace” will let you revise this code a little more easily—thatis, assuming that you do not write perfect code the first time and every time.

graph combine g1 g2 g3 g4 g5 g6, imargin(small)l2ti("`: var label windspeed´") b2ti("`: var label gumbel´")

7. graph combine is used to put the graphs together. In this case with six graphs,the default of two rows and three columns looks fine. We add titles to the combinedgraph using l2title() and b2title(). The option names denote titles on theleft and bottom of the graph (and may evoke nostalgia among longtime Stata usersfor the graphics syntax used before Stata 8). Notice further how we generalize astep beyond wiring in the particular variable labels: the extended macro optioncalls ensure that Stata looks up the current variable labels, so that the same codecan be used even if we change the variable labels. (More general code yet wouldprotect against the possibility that no variable labels have been assigned.)

3 Intermezzo: Advice from Edward Tufte

Make all visual distinctions as subtle as possible, but still clear and effective (Tufte1997, 73).

Minimal contrasts of the secondary elements (figure) relative to the negative space(ground) will tend to produce a visual hierarchy with layers of inactive background, calmsecondary structure, and notable content. Conversely, when everything is emphasized,nothing is emphasized; the design will often be noisy, cluttered, and informationally flat(Tufte 1997, 74).

4 A bivariate example: Cirque lengths, widths, andgrades

For a bivariate example, we examine some data similar to a dataset used in Cox (2005b).The data (Evans and Cox 1995) refer to the lengths and widths of cirques in the LakeDistrict, England. cumbrian cirques.dta is provided with the media for this issue ofthe Stata Journal. Cirques are armchair-shaped hollows formerly occupied by glaciers.Length and width are basic quantitative measures of their size. Logarithmic scales arestandard for such data. Here we also bring in grade, a judgment-based variable of howwell developed each feature is on a five-point ordered scale from classic to poor.

Figure 5 is a standard scatterplot. Because we will have five scatterplots for eachgrade, we might as well use the total suboption to add a panel for all the data combined.An ad hoc refinement is to insist on an extra space in the axis label at 2000 meters toprevent the last digit from being elided in the combined display.

N. J. Cox 679

. use cumbrian_cirques, clear

. scatter width length, by(grade, total) xsc(log) ysc(log) ms(Oh)> xla(200 500 1000 2000 "2000 ") yla(200 500 1000 2000)

200

500

1000

2000

200

500

1000

2000

200 500 1000 2000 200 500 1000 2000 200 500 1000 2000

classic well−defined definite

poor marginal Total

wid

th (m

)

length (m)Graphs by grade

Figure 5. Scatterplots of cirque width and length by grade for the English Lake District

In the compromise design, most of the small code tricks are the same, but as withfigure 5 we add a further display of all the data. So that you have a comparison of stylewith earlier graphs, we will leave the subtitle area unboxed. Figure 6 is the result.

. forval i = 1/5 {> scatter width length if grade != `i´, xsc(log) ysc(log) ms(Oh) mcolor(gs12)> xla(200 500 1000 2000 "2000 ") yla(200 500 1000 2000)> || scatter width length if grade == `i´, xsc(log) ysc(log) ms(D) mcolor(gs1)> yla(, ang(h)) yti("") xti("") legend(off)> subtitle("`: label (grade) `i´´", size(medium)) name(g`i´, replace)> }

. scatter width length, xsc(log) ysc(log) ms(D) mcolor(gs1)> xla(200 500 1000 2000 "2000 ") yla(200 500 1000 2000)> yla(, ang(h)) yti("") xti("") legend(off)> subtitle("all cirques", size(medium)) name(g6, replace)

. graph combine g1 g2 g3 g4 g5 g6, imargin(small)> l2ti("`: var label width´") b2ti("`: var label length´")

(Continued on next page)

680 Speaking Stata: Graphing subsets

200

500

1000

2000

200 500 1000 2000

classic

200

500

1000

2000

200 500 1000 2000

well−defined

200

500

1000

2000

200 500 1000 2000

definite

200

500

1000

2000

200 500 1000 2000

poor

200

500

1000

2000

200 500 1000 2000

marginal

200

500

1000

2000

200 500 1000 2000

all cirques

wid

th (m

)

length (m)

Figure 6. Scatterplots of cirque width and length by grade for the English Lake District.Data for the four other grades are shown as a backdrop to the data for each grade.

5 Conclusions

The conclusions lie with you, the reader. This column has a flavor of experiment. Doyou think that the compromise design—which might be called a subset and substratedesign—has any advantages over alternatives for the examples here? Do you think thatyou can use the ideas here to improve your own comparative displays? Many graphtypes lie open for exploration, including especially attempts to more easily and moreeffectively see fine structure in spaghetti plots.

6 ReferencesBrenner, A. D. 2001. Emil J. Gumbel: Weimar German Pacifist and Professor. Boston:

Brill.

Cleveland, W. S. 1993. Visualizing Data. Summit, NJ: Hobart.

———. 1994. The Elements of Graphing Data. Rev. ed. Summit, NJ: Hobart.

Cox, N. J. 2005a. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460.

———. 2005b. Stata tip 27: Classifying data points on scatter plots. Stata Journal 5:604–606.

———. 2007a. Stata tip 47: Quantile–quantile plots without programming. StataJournal 7: 275–279.

N. J. Cox 681

———. 2007b. Stata tip 52: Generating composite categorical variables. Stata Journal7: 582–583.

———. 2009. Stata tip 78: Going gray gracefully: Highlighting subsets and downplayingsubstrates. Stata Journal 9: 499–503.

———. 2010. Software Updates: gr42 5: Quantile plots, generalized. Stata Journal 10:691–692.

Evans, I. S., and N. J. Cox. 1995. The form of glacial cirques in the English LakeDistrict, Cumbria. Zeitschrift fur Geomorphologie 39: 175–202.

Freudenthal, A. M. 1967. Emil J. Gumbel. American Statistician 21(1): 41.

Gumbel, E. J. 1958. Statistics of Extremes. New York: Columbia University Press.

Hertz, S. 2001. Emil Julius Gumbel. In Statisticians of the Centuries, ed. C. C. Heydeand E. Seneta, 406–410. New York: Springer.

Hosking, J. R. M., and J. R. Wallis. 1997. Regional Frequency Analysis: An ApproachBased on L-Moments. Cambridge: Cambridge University Press.

Robbins, N. B. 2010. Trellis display. Wiley Interdisciplinary Reviews: ComputationalStatistics 2: 600–605.

Simiu, E., M. J. Changery, and J. J. Filliben. 1979. Extreme wind speeds at 129 stationsin the contiguous United States. Building Science Series 118, National Bureau ofStandards.

Thas, O. 2010. Comparing Distributions. New York: Springer.

Tufte, E. R. 1990. Envisioning Information. Cheshire, CT: Graphics Press.

———. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative.Cheshire, CT: Graphics Press.

———. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, CT:Graphics Press.

———. 2006. Beautiful Evidence. Cheshire, CT: Graphics Press.

About the author

Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-mands in official Stata. He wrote several inserts in the Stata Technical Bulletin and is an editorof the Stata Journal.

The Stata Journal (2010)10, Number 4, pp. 682–685

Stata tip 68: Week assumptionsNicholas J. CoxDepartment of GeographyDurham UniversityDurham, UK

[email protected]

1 Introduction

Stata’s handling of dates and times is centered on daily dates. Days are aggregated intoweeks, months, quarters, half-years, and years. Days are divided into hours, minutes,and seconds. This may all sound simple in principle, leaving just the matter of iden-tifying the syntax for converting from one form of representation to another. For anintroduction, see [U] 24 Working with dates and times. For more comprehensivetreatments, see [D] dates and times and [D] functions. As a matter of history, knowthat specific date functions were introduced in Stata 4 in 1995, replacing an earliersystem based on ado-files. These date functions were much enhanced in Stata 6 in 1999and again in Stata 10 in 2007.

However, matters are not quite as simple as this description implies. The occa-sional addition of leap seconds lengthening the year is a complication arising with somedatasets. A little thought shows that weeks are also awkward—whatever the precisedefinition of a week, weeks are not guaranteed to nest neatly and evenly into months,quarters, half-years, or years. This tip focuses on Stata’s solution for weeks and on howto set up your own alternatives given different definitions of the week. A conventionalWestern calendar with seven-day weeks is assumed throughout this tip. For much richerhistorical, cultural, and computational context, see Richards (1998), Holford-Strevens(2005), and Dershowitz and Reingold (2008).

Gabriel Rossman is thanked for providing stimulating comments.

2 Stata’s definition of weeks

Stata’s definition of weeks matches one desideratum and violates others, as would anyother definition. For Stata, week 1 of any year starts on 1 January, whatever day of theweek that is (Sunday through Saturday).

. display %tw wofd(mdy(1,1,2010))2010w1

. display %tw wofd(mdy(1,7,2010))2010w1

. display %tw wofd(mdy(1,8,2010))2010w2

c© 2010 StataCorp LP dm0052

N. J. Cox 683

In these examples, two date functions are used: mdy() yields daily dates from month,day, and year components, and wofd() converts such dates to the corresponding weeks.Most users prefer to see dates shown intelligibly with date display formats such as %tw,thus hiding the underlying Stata machinery, which pivots on a convention that 1 January1960 is date origin or day 0.

At the end of the year, for Stata the 52nd week always lasts 8 days in nonleap yearsand 9 days in leap years. (Recall that 52×7 = 364, so a calendar year always has either1 day or 2 days more than that.)

This definition ensures that weeks nest within years, meaning that no week everstarts in one calendar year and finishes in the next. Also by this definition, a yearhas precisely 52 weeks, even though the last week is never 7 days long. Naturally, thissolution (and indeed any other) cannot ensure that weeks always nest within months,quarters, or half-years. (February is sometimes an exception, but necessarily the onlyone.)

3 Alternative assumptions

Users who deal with weeks may wish to work with other definitions. The most obviousalternatives arise from regarding particular days of the week as defining either the startor the end of the week. With these definitions, the relation of weeks to the days ofthe week is fixed, and all weeks last precisely 7 days. However, other desiderata areviolated: notably, weeks may now span two calendar years.

Such definitions are most likely to appeal whenever weekly cycles are part of whatis being investigated and particular days have meaning. Examples from major religionswill be familiar. Particular days of the week are often key for financial transactions.More parochially, Durham University teaching weeks start on Thursdays in the firstterm of each academic year and on Mondays in the other two terms.

With such alternatives, there is no need to set up a new numbering system for weeks,at least as far as working within Stata is concerned. Each week can just be identifiedby its start or end date, whichever is desired. If you want to map the weeks that dooccur in your dataset to a simple numbering scheme, egen, group() does this easily.

Classifying weeks by their start days is an exercise in rounding down, and classifyingthem by their end days is one in rounding up, so solutions using floor() and ceil()are possible; see Cox (2003). However, a solution using a date is likely to seem moreattractive.

Consider this pretend dataset of 8 days in 2010.

. clear

. set obs 8obs was 0, now 8

. gen day = _n + mdy(9,25,2010)

. gen day2 = day

684 Stata tip 68

. format day %tdDay

. format day2 %tdd_m

. gen dow = dow(day)

. list

day day2 dow

1. Sun 26 Sep 02. Mon 27 Sep 13. Tue 28 Sep 24. Wed 29 Sep 35. Thu 30 Sep 4

6. Fri 1 Oct 57. Sat 2 Oct 68. Sun 3 Oct 0

The Stata function dow() returns day of the week coded 0 for Sunday through 6for Saturday. The dates 26 September 2010 and 3 October 2010 were Sundays. Nowwe have within reach one-line solutions for various problems, given here in terms of theexample daily date variable day.

The simplest case is classifying days in each week by the Sundays that start them.Glancing at the listing above shows that we just need to subtract dow(day) from day:

. gen sunstart = day - dow(day)

Working with other days can be thought of as rotating the days of the weeks to anorigin other than Sunday. One good way to do that is using the versatile function mod()(Cox 2007). If d is one of 1, . . . , 6, then mod(dow(day) - d, 7) rotates the results ofdow() so that day d is the new origin. d = 0 leaves the days as they were.

Hence if we wish to classify days by starting Fridays, subtract mod(dow(day) - 5, 7)from day:

. gen fristart = day - mod(dow(day) - 5, 7)

This is consistent with the simpler rule for Sunday. As just implied, mod(dow(day) -0, 7) is identical to mod(dow(day), 7) and in turn to dow(day).

Now consider classifying weeks by the days that end them. We now need to add anappropriate number between 0 and 6 to the date. That number will cycle from 6 to 0 to6, rather than from 0 to 6 to 0; it is given by mod(d - dow(day), 7). So, Friday-endingweeks are given by

. gen friend = day + mod(5 - dow(day), 7)

This trick also offers a good way to generate ending Sundays:

. gen sunend = day + mod(0 - dow(day), 7)

N. J. Cox 685

The zero in the line just above is not necessary but is left in to emphasize the resemblanceto the previous line.

Finally, let’s show the results of these calculations. There are several other ways tocompute them, but using mod() has its attractions. This method could also be extendedto weeks that are not 7 days long, which do arise in certain research problems.

. format su* fr* %tdd_m

. list day* *start *end

day day2 sunstart fristart friend sunend

1. Sun 26 Sep 26 Sep 24 Sep 1 Oct 26 Sep2. Mon 27 Sep 26 Sep 24 Sep 1 Oct 3 Oct3. Tue 28 Sep 26 Sep 24 Sep 1 Oct 3 Oct4. Wed 29 Sep 26 Sep 24 Sep 1 Oct 3 Oct5. Thu 30 Sep 26 Sep 24 Sep 1 Oct 3 Oct

6. Fri 1 Oct 26 Sep 1 Oct 1 Oct 3 Oct7. Sat 2 Oct 26 Sep 1 Oct 8 Oct 3 Oct8. Sun 3 Oct 3 Oct 1 Oct 8 Oct 3 Oct

ReferencesCox, N. J. 2003. Stata tip 2: Building with floors and ceilings. Stata Journal 3: 446–447.

———. 2007. Stata tip 43: Remainders, selections, sequences, extractions: Uses of themodulus. Stata Journal 7: 143–145.

Dershowitz, N., and E. M. Reingold. 2008. Calendrical Calculations. 3rd ed. Cambridge:Cambridge University Press.

Holford-Strevens, L. 2005. The History of Time: A Very Short Introduction. Oxford:Oxford University Press.

Richards, E. G. 1998. Mapping Time: The Calendar and Its History. Oxford: OxfordUniversity Press.

Editors’ note. Some alert readers have noted that there was no Stata tip 68; we moved from 67to 69 without explanation. This was pure oversight on the part of the Stata Journal. In Statatradition, we now fix this bug with an apology, particularly so that any future compilations oftips include precisely the number advertised.

The Stata Journal (2010)10, Number 4, pp. 686–688

Stata tip 92: Manual implementation of permutationsand bootstraps

Lars AngquistInstitute of Preventive MedicineCopenhagen University HospitalsCopenhagen, [email protected]

In mathematics, a permutation might be seen as a reordering of an ordered set ofabstract elements (see, for example, Fraleigh [2002]),1 whereas in data analysis—whenfacing empirical data—this concept may correspond to a reordering of an ordered setof observations. Vaguely speaking, in statistics and significance testing, this mightbe an interesting concept when simulating under a null hypothesis corresponding to,in some sense, a null association or an effect (most often an outcome) of one specificvariable with respect to another one. Here one basically keeps the dataset constantexcept for the values, which are instead randomly permuted, corresponding to the corevariable. Because all permutations are generally equally likely (at least if properlydealing with potential confounding) under the null hypothesis of no association, thisis a way, through such simulations, of estimating the corresponding null distributionunderlying, for instance, related p-values.

For similar reasons, one may apply the bootstrap simulation procedure. Here onedoes not reorder observations (or in general elements) but rather simulates from theempirical distribution based on this very set. In simulation terminology, the boot-strap and permutation procedures in this sense correspond to a uniformly randomselection of values from the empirical distribution with and without replacement, re-spectively. For more information, see, for example, Manly (2007) for permutations,Davison and Hinkley (1997) for bootstraps, and Robert and Casella (2004) for stochas-tic simulation, in general.

In Stata, one may—given some assumed framework—use the commands permuteand bootstrap to perform tasks related to permutation-based and bootstrap-basedsignificance tests, respectively. Sometimes however, whether it arises as a need to bemore specific or because one simply wants to keep more detailed control over the actualdata manipulations, it might be favorable to perform some related manual labor at yourcomputer keyboard. This tip is about the general structure of a solution for such a task.

Permuting: Assume that you have a variable of interest, permvar, that you want topermute in the sense noted above. Typing

1. The set of all possible reorderings (permutations) includes the permutation that actually leaves theorder intact. This is called the identity permutation.

c© 2010 StataCorp LP st0214

L. Angquist 687

generate id=_ngenerate double u=runiform()sort ulocal type: type permvargenerate `type´ upermvar=permvar[id]

in Stata will give you an additional column (upermvar) of permuted values. In the firstcommand, a new variable, id, that corresponds to the current sort order is created.2 Inthe second command, a column is generated with values uniformly distributed between0 and 1. Because the values of u were randomly generated, sorting on u puts theobservations in a random order. The next command saves the variable type of permvarin the local macro type so that the type can be applied to the new variable in the lastcommand. The last command stores the permutation in the new variable upermvar:each new value is a value of permvar from a randomly selected observation. (Therandom selection is controlled by the id variable, which was put in a random order bythe sort command.)

To reduce the risk of tied values with respect to the (inherently discrete) randomdraws, and moreover to further increase the, so to speak, randomness of the derivedvalues, one might replace the code lines 2–3 with the following:

...generate double u1=runiform()generate double u2=runiform()sort u1 u2...

The randomness reference corresponds to the fact that computer-generated randomnumbers are random only to the extent permitted by the implementation of what istermed pseudorandom numbers (see, for instance, Knuth [1998]). To achieve repro-ducible results, one might take advantage of this pseudorandomness by explicitly statinga starting point, that is, a seed value, for the deterministic algorithm:

set seed 760130

The number must be a positive integer. For instance, this command might be used whenassuring that different methods give equivalent results or, for example, with respect toestimated variances of certain derived estimates of interest, when comparing methodswith respect to efficiency performance.3

2. In other words, this construction is based on the observation number indicator n, which equals 1,2, . . . , N through the present observations (rows), where N is the number of observations in thedataset (generally reachable in a similar fashion through N in Stata). Moreover, one approach toretaining a sort order, irrespective of the content of an executed program, is by taking advantageof the sortpreserve option (see help program or Newson [2004]).

3. You might use your personal birthdate as an easily remembered seed value. This is in fact used inthe above case, though I am not revealing which date format I used; see help dates and times. Ithank Claus Holst for this tip!

688 Stata tip 92

Bootstrapping: A related but slightly different variant of the above schedule might beused to derive a bootstrapped variable called ubootsvar. It is based on the empiricaldistribution formed or constituted by the present observations of the original variablebootsvar.

generate u=ceil(runiform()*_N)generate ubootsvar=bootsvar[u]

Here the uniformly distributed values are not used to decide on a sort order (the un-derlying index values), but rather to directly constitute index values by making thembe part of a uniformly distributed simulation of values on the integers 1, 2, . . . , N . Toachieve this, the so-called ceiling function, ceil(), is used. For more information onruniform(), see help runiform or Buis (2007)4 (with respect to its use for simula-tions); further, ceil() and the related floor() function are described in Cox (2003).

Moreover, one might implement the above code structures into loops based on,for instance, foreach or forvalues. Under such circumstances, one might also takeadvantage of both usage of temporary variables (see help tempvar) and the specificmatrix-oriented environment of Mata (see help mata) though the general structuredescribed here might to some extent serve as a guideline or a template for such cases,as well. Once ready, strap your boots and let the permutation begin!

ReferencesBuis, M. L. 2007. Stata tip 48: Discrete uses for uniform(). Stata Journal 7: 434–435.

Cox, N. J. 2003. Stata tip 2: Building with floors and ceilings. Stata Journal 3: 446–447.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application.Cambridge: Cambridge University Press.

Fraleigh, J. B. 2002. A First Course in Abstract Algebra. 7th ed. Reading, MA: Addison–Wesley.

Knuth, D. E. 1998. The Art of Computer Programming, Volume 2: SeminumericalAlgorithms. 3rd ed. Reading, MA: Addison–Wesley.

Manly, B. F. J. 2007. Randomization, Bootstrap and Monte Carlo Methods in Biology.3rd ed. Boca Raton, FL: Chapman & Hall/CRC.

Newson, R. 2004. Stata tip 5: Ensuring programs preserve dataset sort order. StataJournal 4: 94.

Robert, C. P., and G. Casella. 2004. Monte Carlo Statistical Methods. 2nd ed. NewYork: Springer.

4. The uniform() function was improved in Stata 11 and was renamed runiform().

The Stata Journal (2010)10, Number 4, pp. 689–690

Stata tip 93: Handling multiple y axes on twoway graphsVince WigginsStataCorpCollege Station, TX

[email protected]

Sometimes users find it difficult to handle multiple y axes on their twoway graphs.The main issue is controlling the side of the graph—left or right—where each axis isplaced.

Here is a contrived example that exhibits the issue:

. sysuse auto(1978 Automobile Data)

. collapse (mean) mpg trunk, by(length foreign)

. twoway bar mpg length, yaxis(2) ||line trunk length, yaxis(1)

510

1520

25(m

ean)

trun

k

1015

2025

3035

(mea

n) m

pg

140 160 180 200 220 240Length (in.)

(mean) mpg (mean) trunk

We might want yaxis(1) to be on the left of the graph and yaxis(2) to be on theright of the graph, but twoway insists on putting yaxis(2) on the left and yaxis(1)on the right. We could achieve what we want by reversing the order of the two plots,but the bars then occlude the lines, and who wants that?

It might be surprising, but the number assigned to an axis has nothing to do withits placement on the graph. twoway places the axes in the order in which it encountersthem, with no consideration of their assigned number. How authoritarian! Considertwoway’s problem: when it sees yaxis(2), it cannot be sure that it will ever see ayaxis(1). Moreover, twoway will let you create more than two y axes, and in that caseit just stacks them up on the left of the graph like cordwood.

c© 2010 StataCorp LP gr0047

690 Stata tip 93

Do not worry. Although we may not like twoway’s rules, we can alter them. If wewant any axes to appear in a different position, we just tell twoway to move them tothe alternate (other) side of the graph using the yscale(alt) option. In this example,if we do not like the position of either y axis, we will need to tell each of them to switchto the other side.

. twoway bar mpg length, yaxis(2) ||line trunk length, yaxis(1) yscale(alt) yscale(alt axis(2))

510

1520

25(m

ean)

trun

k

1015

2025

3035

(mea

n) m

pg

140 160 180 200 220 240Length (in.)

(mean) mpg (mean) trunk

We typed just yscale(alt) rather than the more explicit (but still valid) yscale(altaxis(1)) because axis(1) is the default whenever we do not specify an axis. To alterthe side where axis(2) appears, we had to be explicit about the axis number and typeyscale(alt axis(2)).

If your axis is not where you want it, tell it to alter itself.1

1 Acknowledgment

I would like to thank Stata Journal editor Nicholas Cox for the initial adaptation ofthis tip from a Statalist posting, though Nick bears no responsibility for any remainingerrors or puns.

1. The Stata Journal editors, against much stiff opposition, declare this to be the worst pun so far inthe history of the Stata Journal.

The Stata Journal (2010)10, Number 4, pp. 691–692

Software Updates

dm0038 1: kountry: A Stata utility for merging cross-country data from multiplesources. R. Raciborski. Stata Journal 8: 390–400.

kountry, kountryadd, kountrybackup, and kountryrestore failed when the pathto the PLUS directory contained spaces. This has been fixed.

dm0048 1: Finding variables. N. J. Cox. Stata Journal 10: 281–296.

The not option gave incorrect results if a varlist was specified. This has been fixed.

gr42 5: Quantile plots, generalized. N. J. Cox. Stata Journal 6: 597; 5: 471; 4: 97.Stata Technical Bulletin 61: 10–11; 51: 16–18. Reprinted in Stata Technical BulletinReprints, vol. 10, pp. 55–56; vol. 9, pp. 113–116.

The handling of axis titles and missing values has been improved in various minor re-spects. The main program, qplot, no longer depends on an extra program supplied,but not documented, in previous versions.

pr0041 1: Speaking Stata: Correlation with confidence, or Fisher’s z revisited.N. J. Cox. Stata Journal 8: 413–439.

In some circumstances, corrci could exit without saving r-class results. Variousminor fixes have been made to the saving of results in corrci to correct this.

st0015 6: Concordance correlation coefficient and associated measures, tests, and graphs.T. J. Steichen and N. J. Cox. Stata Journal 8: 594; 7: 444; 6: 284; 5: 470; 4: 491;2: 183–189. Stata Technical Bulletin 58: 9; 54: 25–26; 45: 21–23; 43: 35–39.Reprinted in Stata Technical Bulletin Reprints vol. 10, p. 137; vol. 9, pp. 169–170;vol. 8, pp. 137–145.

Various minor updates have been made to the graphics options and their documen-tation, principally making explicit support for plot() and addplot() and allowingsystematic control of reference line presentation through a new collective suboption,lopts().

st0035 1: Multiple-test procedures and smile plots. R. Newson and the ALSPAC StudyTeam. Stata Journal 3: 109–132.

The smileplot package has been updated to Stata 10. The online help files nowhave the .sthlp extension. Also, a new addplot() option has been added to thesmileplot command, allowing users to superimpose additional plots on a smile plotin the standard manner for Stata. (The old plot() option, which formerly performedthe same function, still works but is deprecated as obsolete by the author.)

c© 2010 StataCorp LP up0030

692 Software Updates

st0043 2: Confidence intervals and p-values for delivery to the end user. R. Newson.Stata Journal 3: 359; 245–269.

The parmest package has been updated to Stata 11 to take advantage of the newcapabilities for factor variables and multiple imputation. New options omit andempty have been added to generate in the output dataset indicator variables of thesame names, which indicate that the parameter corresponding to the output obser-vation is omitted or corresponds to an empty factor-value combination, respectively.A new option, msetype, has been added to generate in the output dataset a stringvariable of the same name, which contains the matrix stripe element type of thecorresponding parameter and indicates whether this parameter is a variable, error,factor, interaction, or product parameter. The degrees of freedom can now be madeto vary between parameters in the same estimation, and this is the default behavior ifthe estimation results are from the new Stata 11 mi estimate command, which usesa degrees-of-freedom vector. New options bmatrix(), vmatrix(), and dfmatrix()have been added to allow the user to extract the estimates, variances, and degrees offreedom, respectively, from matrices other than the default estimation results. Thedof() option, which specifies scalar degrees of freedom, can now be a scalar insteadof a constant number.

A new saved result, r(dofpres), now indicates that degrees of freedom were used tocalculate the confidence limits and p-values, instead of using the normal distribution.

st0145 1: Production function estimation in Stata using the Olley and Pakes method.M. Yasar, R. Raciborski, and B. Poi. Stata Journal 8: 221–231.

The command now supports more than two state variables. Second-degree polyno-mial expansion has been replaced with third-degree polynomial expansion. Supportfor predict has been added.

st0150 3: A Stata package for the estimation of the dose–response function throughadjustment for the generalized propensity score. M. Bia and A. Mattei. StataJournal 9: 652; 8: 594; 8: 354–373.

This update corrects and updates all files to work with Stata 11.

st0182 1: Direct and indirect effects in a logit model. M. L. Buis. Stata Journal 10:11–29.

This update of ldecomp fixes a bug that resulted in an error in the table of observedproportions when one or more control variables were specified. This table is dis-played when the obspr option is specified and returned as the matrix e(prop obs).Fortunately, this error did not propagate to the other tables.

This update also fixes a hard-to-understand error message when one included in thelist of control variables one or more variables that were also specified in the direct()and indirect() options. In the new version, these variables will automatically beremoved from the list of control variables.