logistic regression for two-stage case-control datastreet/two-stagelogisticregression.pdf ·...

Click here to load reader

Post on 18-Aug-2019




0 download

Embed Size (px)


  • Biometrika Trust

    Logistic Regression for Two-Stage Case-Control DataAuthor(s): N. E. Breslow and K. C. CainSource: Biometrika, Vol. 75, No. 1 (Mar., 1988), pp. 11-20Published by: Biometrika TrustStable URL: http://www.jstor.org/stable/2336429Accessed: 28/03/2010 00:47

    Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

    Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/action/showPublisher?publisherCode=bio.

    Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected]

    Biometrika Trust is collaborating with JSTOR to digitize, preserve and extend access to Biometrika.



  • Biometrika (1988), 75, 1, pp. 11-20 Printed in Great Britain

    Logistic regression for two-stage case-control data

    BY N. E. BRESLOW AND K. C. CAIN Department of Biostatistics, University of Washington,

    Seattle, Washington 98195, U.S.A.


    Samples of diseased cases and nondiseased controls are drawn at random from the population at risk. After classification according to the exposure of interest, subsamples of cases and controls are selected for purposes of covariable ascertainment. A modification of the usual logistic regression analysis yields consistent estimates of covariable adjusted relative risks and their standard errors. By balancing the numbers of exposed and nonexposed for whom covariable inforniation is ascertained within case and control samples, some efficiency may be gained over the usual single stage design, particularly when the exposure is rare and the relative risks associated with the covariables are large. The procedure may be useful also when covariable information is missing for a large part of the sample.

    Some key words: Asymptotic efficiency; Conditional maximum likelihood; Choice based sample; Case control study; Cohort study; Design of medical study; Epidemiology; Logistic regression; Missing data; Retrospective study.


    Consider an epidemiological study where the exposure of primary interest and the disease outcome are known for a large number of subjects. Because of their high cost, however, the covariable data needed to adjust for confounding effects can only be collected for a much smaller number. For example, radiation dosimetry and lung disease status may be available in computer records for tens of thousands of nuclear energy workers, but the smoking histories required to make sense of the association require abstraction of medical records or personal interviews and are therefore potentially available only for a few hundred subjects.

    A typical approach to this problem is to measure the covariables for subsamples of diseased cases and nondiseased controls, analysing the resultant data by logistic regression so as to estimate covariable adjusted relative risks (Breslow & Day, 1980). Unfortunately, this has two drawbacks. First, it ignores the data on the marginal association between exposure and disease that are available at the 'first stage' of sampling. Secondly, when the exposure is rare, the case-control subsample may not be as informative as one in which exposed cases and controls are deliberately over represented.

    This paper develops the modifications of logistic regression analysis needed to take advantage of the extra information available in the first stage sample and to adjust for possible bias caused by oversampling of exposed persons at the second stage. It also investigates the efficiency of case-control subsamples relative to those that are balanced for both disease status and exposure.


    Table 1. Sizes of first and second stage samples in a study of coronary heart disease.

    First stage Second stage Diseased Nondiseased Cases Controls

    Male 14867 2971 249 250 Female 2944 2934 249 248

    Table 1 is a simple example using data from a large register of patients evaluated for coronary heart disease by angiography (Vlietstra et al., 1980). This table and a data file containing covariable information for the 996 patients selected at the second stage were made available to students as part of an examination. Some students produced the naive analysis shown in Table 2. Since no acccount was taken of the first stage data, the regression coefficient and standard error estimated for male sex are both seriously biased. By contrast, the adjusted relative risk of exp (1-659) = 5-25 from the correct analysis agrees rather well with the marginal odds ratio of 5-0 determined from the first stage data. The decrease in the standard error from 0-144 to 0-077 reflects the additional information on sex and disease contained in those data. Results developed in ? 4 below show that in simple cases like this, with a single binary exposure variable, no adjustment to the covariable coefficients and standard errors obtained from the naive analysis is needed; the information about such effects is contained entirely in the second stage data.


    Suppose the probability of disease occurrence in the population satisfies the logistic model

    PI(x;,B) = pr (D = 1 I x) = 1/{1 + exp (-x,8)}, (1)

    where x includes exposure variables, covariables, interactions and a constant term, where ,8 is a p + 1 vector of regression coefficients, and where D = 1 denotes the diseased and D = 0 the nondiseased state. Our goal is to develop consistent and asymptotically normal estimates of /3, and also to provide some guidance on the appropriate number of subjects to draw from each stratum, when the data are collected according to the following two-stage sampling scheme. First, N1 cases and No controls are sampled from the population at risk. These may represent separate samples of cases and controls (Anderson, 1972) or else simply the numbers of diseased and nondiseased subjects observed in a simple random sample drawn at the first stage. In either event, they are further classified into J strata on the basis of their exposure history and we denote by Nlj the number of cases and by Noj the number of controls having stratum S =j. Then, at the second stage, 2J subsamples of size ni1 are selected from each disease/exposure stratum combination. We denote by Xijk the value of x that is measured for the kth subject in the (i, j)th subsample, where i = 0 or 1, and j = 1, . . . IJ, k = 1, I..., nij.

    For the simple situation where the regression variables consist of an indicator S of exposure stratum, S = 1 for nonexposed and S = 2 for exposed, and a discrete covariable factor F that takes values 1= 1 .. ., L, White (1982) shows how to combine the data from the first and second stage samples to estimate the relative risks. Let

    = pr (D = 1S=2, F= 1) pr (D = 0| S= 1, F= 1) +'pr(D=O|S=2,F=l)pr(D=l|S=l,F=I)

  • Logistic regression for two-stage case-control data 13

    denote the odds ratio relating disease and exposure at covariable level l. Substitution of pr (D I S, F) = pr (F I D, S) pr (D, S)/pr (S, F) yields

    _pr (F = I I D = I1, S = 2) pr (F = I I D = O, S = 1) pr (D = I1, S = 2) pr (D = 0, S = 1 ) pr(F=l1D=O,S=2)pr(F=lD=O1,S=1) pr(D=1,S=2)pr(D=1,S=1)

    or /,l = 4l/i, where 4' is the marginal odds ratio and 41 is a correction factor. If rij, denotes the number of subjects out of nij with F = 1, an estimate of qil is thus

    A (r121/n12)(rolI/no,)N12No

    +'(r0211 n02)(r1111n11)No2N1 I

    White (1982) notes that the variances and covariances of the estimated log odds ratios satisfy var (log r/i) = rIJ - c and cov (log 1,, log q/ i) = -c for I + l', where the correc- tion term is c = Xi j(nn- - NJ'). We extend her results to incorporate continuous exposure and concomitant variables, estimating parameters by 'conditional maximum likelihood' under the logistic model.


    Our approach to estimation of the parameter vector , in the quantal response function (1) consists of an adaptation and extension of the 'conditional maximum likelihood' methods developed for 'choice based' samples by Manski & McFadden (1981, ? 1-8) and Hsieh, Manski & McFadden (1985). See also Amemiya (1985, ? 95). Consider the conditional probability that a member of the second stage sample from stratum j with regression variables x is a case having i = 1 or a control having i = 0. This is

    Epr (x I D h, S =j)(nh1/n) (2) where the sum is over h =0, 1, and where

    pr (x I D = i, S = j) = Pi(x)f (x)/ Qij. (3)

    Here P1 is defined by (1), PO = 1 - P1, fj(x) = pr (x, S =j) is the subdensity of x in stratum j and Qij = pr (D = i, S =j). We have assumed that pr (D = i I S =j, x) = Pi(x), that is that the probabilities of disease depend on the stratum only through the exposure variables explicitly included in x. Further define

    qji=pr(S=jID= i) (4)

    and write Qij = qjiri, where the marginal disease probabilities iri = pr (D = i) are, for the moment, assumed known. If .'.= (, 5) = (,.j.., s 51, , s) is the 2J dimensional parameter vector with elements ,ji = log qji, then the log pseudo-likelihood function for the second stage sampled data is

    1(f ) = Ei,j,k log Rij(Xijk; , (). (5)

    For known 5, Manski & McFadden (1981) proved that maximization of l(,f, 5) yields a consistent asymptotically normal estimator of 8. Here, however, 5 is estimated by G = log (Nij/Ni). Using the fact that (NO,, . . ., NOJ) and (NI ,..., NIJ) have indepen- dent multinomial distributions and applying standard Taylor-series arguments, one can show that NM(5i - 5i) is asymptotically normal with zero mean and covariance matrix

    Bi = Dq,l-M, ~~~~~~~(6)


    where Dqj denotes the J x J diagonal matrix with diagonal elements q, .. ., qji and M denotes the J x J matrix whose entries are all 1.

    PROPOSITION 1. Suppose that the true value of ,3 is an interior point of a compact parameter space, and that the sample sizes nij and Ni tend to infinity in such a way that nij/n and Nil n havefinite, nonzero limits vij and Ai, respectively. Let Eij denote expectation with respect to the distribution of x in the (i, j) th subsample, and assume that this distribution is sufficiently regular to guarantee the existence of the expectations shown below. Let R-*(x) = R*J(x; 3, () denote the conditional distribution (2) except that the nijn are replaced by their limiting values vij. Then the estimator /8 that maximizes 1(f3, e) is consistent for /. Furthermore, n 8( /3 - ,3) has a limiting normal distribution with zero mean and covariance matrix H-1(G+ABA')H-1, where

    H = , ij vijEij{& log R*J (x)/a,8 a,8'},

    G = i j viJ[Eij3{ log R*(x)/a,8} x {d log R*(x)/d,B'} - Eij{ log R!!(x)/&,3}

    x Eij{a log R*g(x)/&,8/}],

    A = E i>j vijEijf{2 log R*,(x)/&,8 ae'}

    are evaluated at the true (/3, e) and B = diag (A71 B1, A-1 B2).

    Proof. The proof follows closely that of Theorem 2 of Hsieh, Manski & McFadden and only a brief outline is given. Their assumptions of positivity, identifiability and regularity are clearly satisfied for the logistic function (1). We impose sufficient regularity conditions on the regression variables x to guarantee the existence of H, G and A as finite matrices and otherwise to implement the remarks made in their ? 5-1, which are needed for the extension to continuous x. Their arguments show that, with 5 fixed at its true value, the normalized log likelihood 1(,/, 5)/n converges uniformly in a neighbour- hood of the true /3 to l*(/3, 5t) = z i jEij log R*(x; ,/, 5). Consistency follows since 1* is uniquely maximized at the true value. Since this is an interior point, /8 satisfies the score equation al(/3, 5)/a,8 = U,3 (,/, 0)=0. Thus we have the Taylor expansion

    O n-2UX (,X3, e) = 1U(,3 n 0, n 2( ,X3- ,6) +n -1UX n( - )+o(1.

    Here U,3 and U,3 denote the second partial derivatives of the log likelihood which, when divided by n, converge to -H and -A, respectively. The leading term n- 2U is asymptotically N(0, G) and n(- f) is asymptotically N(O, B). Since these are indepen- dent, this proves the theorem. D1

    For the logistic model of (1), the Ri have a linear logistic structure

    log {Rlj(x; O, ()IR Ro(x;,(3, 6)} =log (7ro/ X1)+ log (nlj/ noj) + sio-5tjl+ xB (7)

    In the sequel, we prove that a minor modification of the naive logistic regression analysis suffices to estimate /8 consistently, and we derive a correction matrix C that may be used in conjunction with the naive variance matrix to estimate var (/3). Let n.j = Xi nij denote the total number of cases and controls sampled from stratum j and suppose that the n x (p+ 1) design matrix X with rows x~'k is partitioned X' = (X1, . . , XJ), where Xj is the n. x (p + 1) submatrix whose rows represent regression variables for subjects in the jth stratum. Denote by d the correspondingly arranged vector of fitted values d.jk= R11(xIjk; /, (), by d the vector of case-control indicators, dl1k = 1 and do1k =0O, and by V

  • Logistic regression for two-stage case-control data 15

    the n x n diagonal matrix with the estimated conditional variances V dik = 4k(1 - d,Jk) on the diagonal, where V is partitioned similarly to the other quantities so that n.j x n. submatrices Vj corresponding to the jth stratum lie along the main diagonal. Finally, let e denote the n x I vector of ones and let ej denote the n x I vector with ones in locations corresponding to the jth stratum and zeros elsewhere.

    PROPOSITION 2. Let Wj = X'Vej and W = Ij Wj = X'Ve, where X and V are the design and variance matrices just defined. Then the score equations satisfied by ,3 are

    X(d-d)=O. (8)

    A consistent asymptotic covariance matrix is given by

    var (/3) = (X' VX)1{(X' VX) - C}(X' VX)-1, (9)


    C = Ei j (1/ n,j - I/Nij) WjWj+ (1/ No+ 1/N1) WW'. (10)

    Proof. The score equations (8) are an immediate consequence of the logistic structure for Rij(x) shown in (7). To derive (9) and (10) we first replace the matrices H, A and B defined in Proposition 1 by their empirical counterparts. Since &2 lo, Rijk/& &98/' = -VijkXiYkX,, where Rijk = Rij(Xijk; /3, e) and we evaluate (/3, e) at (/, e), we have n -l (X' VX) o -H as n o oo. Similarly, defining Ei = + I or -I according as i = 1 or 0, we have d log Rijk/la/ 3ei = EiVijkXijk and thus n1(-W1...-WJW1 ... WJ) - A as n -+ c. Substituting Ni/n for Ai, and Nij/Ni for the diagonal elements of D-,' in B, it follows that n-1 Xi (Ij N1 WjWj- N7'WW') consistently estimates ABA'. The estimator we propose for G requires more argument and is derived in the Appendix. D

    In circumstances when the first stage data are quite extensive, the covariance estimator defined by (9) and (10) may not be positive-definite. This anomaly is avoided if one estimates G instead by the sum of the within stratum sample covariance matrices of the scores:

    0 ' k1 R,y)2x, E 1~{= (1-R Rukxu {k1 -Rii)~k] i j[ k=1 nij {k=1 }{k-I }

    However, provided that it is positive-definite, we prefer the estimator of G that is derived in the Appendix and that leads to the simple expression (9).


    The score equations (8) are identical to those for a logistic regression analysis of the dijk on regression variables Xijk that are augmented by the variable

    log (ir,/ i)+log (njl/noj) + log (q/- ) = log {(IToNl)/(xl NO)} + log {(n1jN0j)/(no^N1j)}. The second term Xok = log {(n,jNoj)/(nojN,j)} enters the model as an offset, i.e. a regression variable with coefficient fixed at unity a priori. Since the leading term is constant, it is incorporated into a new grand mean parameter ,3* = ,80+ log{( W0N1)/1(raN0)}. Provided that the model contains a free constant term, neither the regression coefficients nor their standard errors depend upon the values actually inserted for r, and in practice we simply omit them.


    Table 2. Results of a logistic regression analysis of coronary heart disease: regression coefficients i

    standard errors.

    Variables Naive analysis Correct analysis

    Constant -4 604? +0481 -4-604 ? 0474 Male sex 0044+0-144 1 659 + 0?077 Age (years) 0072 + 0008 0072 ? 0008 Hypertensive 0-356 +?0145 0X356?+ 0145 Smoker 1 024?+0178 1 024?0178 Ex-smoker 1 033 + 0d179 1 033 ? 0 179

    Standard computer programs such as GLIM (Baker & Nelder, 1978) treat (X'VX)-' at convergence as the asymptotic covariance matrix for p. If a separate binary regression variable is included in the model for each exposure stratum j = 2,. . ., J, then a simple correction may be applied to (X'VX)-1 to estimate the variances and covariances. In this case, furthermore, no correction is needed for the naive variances of the covariable coefficients nor for the covariable x exposure interactions.

    PROPOSITION 3. Suppose the design matrix satisfies X = (X1X2), where X1 = (e e2 . . . ej). Thus the first column of X is associated with the grand mean I80 and the next J- 1 columns with coefficients f3j (j = 1, ... , J - 1) that represent the log relative risks of exposure at level j+1 relative to exposure at level 1. Define c*=(n'll-Noj)+(nBl'-NB) and d*= (NO' + Nj'). Then the asymptotic covariance matrix of ,8 is (X'VX)1 - C*, where the J x J upper left-hand corner of C* has elements

    c1=c*+d*, cj cj*= c*,

    C *+(n-1-N-1)+(n-1 (j=2,.. ., J), c,*j -c (jj'= 2,.. , J)

    The remaining elements of C* are 0.

    Proof. First, note that the vectors ej and e used to define Wj and W in Proposition 2 may be written as linear combinations of the columns of X. Let uj be the p + 1 dimensional vector with jth element equal to 1, and all other elements 0. Then e = Xu,, ej = Xuj for j= 2,..., J and ej=X(uj-...- uJ). The proposition follows upon inserting these expressions for the elements of C in Proposition 2. 0i

    As an illustration, consider the correct and incorrect variances for male sex in Table 2. One readily calculates

    Eij (n-J - NJ1) = 0-015 = (0-144)2_ (0-077)2,

    in accordance with the proposition. Similarly, the difference between the correct and naive regression coefficients equals the difference between the offsets for females (j = 1) and males (j = 2), namely

    1-659 - 0-044 = 0-0006 - (-1 6143) = log {(n11N01)/(n0oN 1)} - log {(n12N02)/(n02N12)}.

    The difference between the correct and naive constant terms is equal to the offset for females, which in this example is approximately zero.


    This section presents numerical results on the asymptotic efficiencies of several stage two designs for the simple problem in which there are two exposure categories and a

  • Logistic regression for two-stage case-control data 17

    single binary covariable. The model to be fitted is

    log {pr (D = 1)/pr (D = O)} = 3+ 13X1 +32X2,

    where xl and x2 are binary variables for the exposure and the covariable, respectively. Two strata correspond to the two exposure categories: S = 1 for xl =0, and S =2 for xl= 1. We assume that the disease is rare so that 30 -00-. The asymptotic variances of interest are avar (,31) = lim n var (,P1) and avar (132)- We also consider avar (13), where f3 is the coefficient of an interaction term X3 = X1X2 that is added to the fitted model. However, 83 = 0 in the model generating the data. All terms of the form nij/ n and Nil n occurring in this section are evaluated in the limit as n -* oo.

    First suppose that A0 = NO/n and A1 = N1/n are both very large so that the error of estimation in the marginal odds ratios may be ignored. This situation would arise in practice if large numbers of both diseased and nondiseased subjects were available in the first stage sample, all with exposure status known. Two simple designs are compared, the standard case-control design with 4n diseased subjects and 'n nondiseased subjects selected wthout regard to exposure so that nij/ n =Nij/ Ni, and the balanced design defined by nij/nl=. We also compared the balanced design to three 'optimal' designs which are most efficient for estimating each of the three parameters, respectively. These were determined by a grid search.

    For a given design the asymptotic variance of 131 can be written

    avar (831)=({ 2 (njgj)} + L I{nij(I_gij) ) i=oj=1 i=Oj=1

    1 2

    -E E (1/nij -1/ Nij), i=Oj=l

    where gij = pr(x2 = II D = i, S =1). The gi depend on the degree of confounding between xi and x2 as measured by the control odds ratio

    = pr (xl= 1, x2 = I D = 0) pr (xi= 0, x2 = OlD = 0)

    pr (x, = 1, X2= OlD = 0) pr (x,= 0, X2 = I D = 0)

    pr(x2 =11 D = 0, S = 2) pr(x2= O0D = 0, S= 1)

    pr(x2= 01 D =0, S=2) pr(x2 =1I D =0, S= 1)' and on the probabilities pr (xl = I I D = 0) and pr (X2 =1 I D = 0). Since the disease is rare, pr (X2 = 1 S =1j) pr (X2 =1 I D = 0, S =j) = goj. Simple equations for avar (12) and avar (13) can also be derived that do not involve the Nij terms.

    Table 3(a), in columns 3-5, shows the asymptotic efficiency of the balanced design relative to the case-control design for a rare disease with pr (xl = I I D = 0) = 0 05, pr(x2 = Ii D =0) = 0-3, exp (81) =2, and several values of exp (132) and 0. Since the purpose of the analysis is to estimate the effects of exposure after controlling for the covariable, we are most interested in efficiency with respect to 831. If the covariable x2 is strongly related to disease, the balanced design is considerably more efficient than the case-control design. By contrast if 12 = 0, then avar (f3,) for the balanced design is about the same as or slightly worse than the case-control design. The balanced design is much more efficient than the case-control design for estimating interaction; see column 5.

    The balanced design may be considerably less efficient than the optimal design for esti- mating fr3A, especially if the covariable is a strong prognostic factor. The same holds true for 132. However, because of the no interaction assumption, the designs which are


    Table 3. Asymptotic relative efficiencies of the three parameter estimates

    (a) AO=AI=ao Balanced design relative Balanced design relative to case-control design to optimal design

    e2 6 P2 P3 P1 82 P3

    02 0-2 1-37 0-68 4 41 060 059 082 02 0-5 2-81 082 3*90 0*67 0-74 0*90 0*2 1.0 4-35 1*01 3*51 067 094 094 02 2*0 4-71 1*24 3*25 069 0O80 094 02 5-0 3 56 1-47 3*30 053 074 093

    1.0 02 071 071 5*77 068 068 094 1.0 0O5 087 0-87 4-71 084 084 099 1-0 1-0 1-00 1.01 4*08 1*00 1*00 1*00 1.0 2*0 1.09 1.09 3*76 095 095 1.00 1.0 5.0 1-05 1-05 3-90 0-95 0-95 1-00

    5-0 0-2 1-84 0-78 6-48 0-52 0-76 0-94 5.0 0-5 3-28 095 4-92 077 094 099 5.0 1P0 3-72 1 01 4-10 098 1.00 1.00 5.0 2-0 2-70 097 3-64 079 095 099 5.0 5.0 1 36 083 3.44 051 079 095

    (b) AO=??,A1=2 Balanced design relative Balanced design relative to case-control design to optimal design

    eP2 6 1 2 P3 P1 2 P3

    02 02 1-02 083 1-43 097 082 1.00 02 0-5 1.09 085 1-43 097 085 1.00 02 1.0 1-18 088 1 45 098 087 1 00 0-2 2-0 1-27 090 1-50 0-98 088 1.00 02 5-0 1-34 093 1-65 096 0.90 1.00

    1.0 02 099 074 2-30 099 071 096 1-0 0-5 1-00 0-78 2-18 1-00 0-76 0-97 1.0 1.0 1.00 081 2-09 1.00 080 097 1.0 2-0 1.00 083 2-04 1.00 082 098 1.0 5-0 098 082 2-06 098 081 098

    5.0 02 1-14 076 2-94 097 075 093 5.0 5-0 1-22 0-80 2-44 098 079 096 5-0 1.0 1 22 081 2 12 095 080 097 5.0 2-0 1 15 0-80 1.90 092 079 098 5-0 5-0 1.00 076 1-74 088 074 098

    For exp (,81) =2, pr (xl = 1iD = 0) = 0.05 and pr (x2 = 1iD = 0) = 0-3. Efficiencies for 81 and 82 based on fitting the model with no interaction term, 'X3'. Optimal design refers to optimality for variance of that parameter estimate.

    optimal for 13, and 12 are degenerate in the sense that one or two of the four values n,j are zero. They would not be used in practice since they have zero power for testing for an interaction. For estimating 83 , the balanced design is almost as efficient as the optimal design.

    Suppose now that AI = N,I n = with AO = Noln still large. This arises with studies of a rare disease in which only a small number of cases but a large pool of potential controls is available in the first stage sample. For the case-control design, all of the cases are kept

  • Logistic regression for two-stage case-control data 19

    together with a random sample of controls. The balanced design keeps all of the cases and samples equal numbers of exposed and nonexposed controls; n11- N1 and no=j n for j = 1, 2. The optimal designs are found subject to the constraint that nlj Nlj, for j=1,2.

    Table 3(b) shows that the relative efficiencies for this situation are much closer to 1. In comparison with the case-control design, the balanced design still offers the possiblity of significant improvement in efficiency with only a small risk of lost efficiency for estimation of 831, and has far superior performance for estimation of the interaction. In comparison with the optimal designs, the balanced design has efficiency close to one for estimating /31 and 833.


    Many epidemiological studies use existing medical records or other data sources where the information on covariables is rather incomplete in comparison with the information on disease and exposure. A quite common practice is to restrict the regression analyses to subjects for whom the full complement of data is available. However, if the pattern of missing covariable information is associated with the joint classification of exposure and disease, such analyses may distort the associations of primary interest. An approach to this problem is to treat the subjects for whom data on exposure and disease are available as the first stage sample, to treat the subsample on whom covariable data are also available as the second stage sample, and to apply the results of ?? 3 and 4. Provided that the covariable information is missing at random within subgroups formed by exposure and disease, use of the offset term in the logistic analysis effectively removes the selection bias. Furthermore, the precision of the relative risk estimate is enhanced by using all available data on exposure and disease, the amount of enhancement being reflected in the correction to the usual covariance matrix; see (10).

    Although applied here exclusively with the logistic response function, Proposition 1 is valid much more generally. The proof depends only on the positivity, identifiability and regularity conditions of Hsieh et al. (1985) and on some further conditions on the distribution of the covariables. For it to be of practical utility, however, prior information on the marginal disease probability ir1 would generally be needed. One could well imagine situations where such information was contained in a third sample, thereby leading to a further generalization of their results. With minor modifications to C that only affect the variance of the constant term, Proposition 2 may be shown to hold for studies where the first stage sample is drawn from exposure rather than disease subgroups.

    In separate unpublished work we have shown that the 'conditional maximum likelihood' estimates considered here differ from constrained maximum likelihood estimates of the kind considered by Anderson (1972) and Prentice & Pyke (1979) unless a separate regression parameter is estimated for each stratum as in Proposition 3. Further work is needed to determine the amount of efficiency loss in situations where the two differ.


    This work was supported by a grant from the United States Public Health Service. The advice and assistance of Dr B. McKnight and Ms M. Kaad are gratefully acknowledged. We are grateful to the investigators participating in the Coronary Artery Surgery Study for permission to use the data shown in Tables 1 and 2.



    Derivation of the estimate for G

    The fundamental identity

    E vijEjE[{l log R*g(x)/a,} x {la log RK(x)/1a3'}] =-E zijEij {a2 log RV(x)/1af ad3'}

    follows by noting that X, vijEij1/i vij denotes expectation with respect to the limiting joint distribu- tion of the sampled data (D, x) within the jth stratum. Taking the expectation first with respect to the distribution of D given x, that is with respect to R*J(x), and then with respect to the marginal distribution of x in the sample, we obtain the result. Thus the leading term of G equals -H and is consistently estimated by (X'VX)/n.

    To evaluate the second term in G, write

    E,j{a log R*,(x)/a,8} = {a log R*,(x)/a,8} pr (x I D = i, S =j) dx

    a {d log R*g(x)/a,3}Pi(x)fj(x)Q,) dx

    - ^J

    f R*,(x){a log R*(x)/a,1}{Z, P1(x)>jJf(x)QlI} dx

    using (2) and (3). Now

    R*g(x){a log R*!(x)/la3} = EixR*j(x){1 - R*j(x)} = Eixvj*(x),

    say. Furthermore, the term in brackets is the limiting conditional density for the sampled x within stratum j, which we denote g,(x). Thus the second term of G may be written

    - ,,ij l xvj*(x)gj(x) dx f x'v*(x)gj(x) dx

    and is consistently estimated by -n-' I,j nJ' WjWj. This completes the proof of Proposition 2.


    AMEMIYA, T. (1985). Advanced Econometrics. Cambridge, Mass: Harvard University Press. ANDERSON, J. A. (1972). Separate sample logistic discrimination. Biometrika 59, 19-35. BAKER, R. J. & NELDER, J. A. (1978). The GLIM System: Release 3. Oxford: Numerical Algorithms Group. BRESLOW, N. E. & DAY, N. E. (1980). Statistical Methods in Cancer Research I: The Analysis of Case-Control

    Studies. Lyon: International Agency for Research on Cancer. HSIEH, D. A., MANSKI, C. F. & McFADDEN, D. (1985). Estimation of response probabilities from augmented

    retrospective observations. J. Am. Statist. Assoc. 80, 651-62. MANSKI, C. F. & McFADDEN, D. (1981). Alternative estimators and sample designs for discrete choice

    analysis. In Structural Analysis of Discrete Data with Econometric Applications, Ed. C. F. Manski and D. McFadden, pp. 2-50. Cambridge, Mass: MIT Press.

    PRENTICE, R. L. & PYKE, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403-11.

    VLIETSTRA, R. E., FRYE, R. L., KRONMAL, R. A., SIM, D. A., TRISTANI, F. E. & KILLIP, III, T. (1980). Risk factors and angiographic coronary artery disease: A report from the coronary artery surgery study (CASS). Circulation 62, 254-61.

    WHITE, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am. J. Epidemiol. 115, 119-28.

    [Received October 1986. Revised July 1987]

    Article Contentsp. [11]p. 12p. 13p. 14p. 15p. 16p. 17p. 18p. 19p. 20

    Issue Table of ContentsBiometrika, Vol. 75, No. 1 (Mar., 1988), pp. 1-187Volume InformationFront MatterEstimation of Rare Errors Using Expert Judgement [pp. 1-9]Logistic Regression for Two-Stage Case-Control Data [pp. 11-20]Interval Estimation Based on the Profile Likelihood: Strong Lagrangian Theory with Applications to Discrimination [pp. 21-28]Likelihood Inference for Linear Regression Models [pp. 29-34]An Asymptotic Theory for Weighted Least-Squares with Weights Estimated by Replication [pp. 35-43]On the Asymptotic Behaviour of General Maximum Likelihood Estimates for the Nonregular Case Under Nonstandard Conditions [pp. 45-56]Tests for No Treatment Effect in Randomized Clinical Trials [pp. 57-64]Residuals for Relative Risk Regression [pp. 65-74]Optimal Design for the Estimation of Variance Components [pp. 75-80]Switch-Back Designs [pp. 81-89]Linear Bayes Estimators of the Potency Curve in Bioassay [pp. 91-96]Estimation of Distribution Functions and Medians Under Sampling with Unequal Probabilities [pp. 97-103]A New Method for Estimating Subgroup Means Under Misclassification [pp. 105-111]Estimating Population Size by Recapture Sampling [pp. 113-120]Improved Approximation for Estimation Following Closed Sequential Tests [pp. 121-128]An Extension of the Growth Curve Model [pp. 129-138]MiscellaneaLikelihood-Based Confidence Intervals for Functions of Many Parameters [pp. 139-144]On Testing for Serial Correlation in Large Numbers of Small Samples [pp. 145-148]A Conservative Test and Confidence Region for Comparing Heteroscedastic Regressions [pp. 149-152]Estimating the Loss of Estimators of a Binomial Parameter [pp. 153-155]The Loss in Efficiency from Misspecifying Covariates in Proportional Hazards Regression Models [pp. 156-160]A Note on Design when Response has an Exponential Family Distribution [pp. 161-164]The Likelihood for a State Space Model [pp. 165-169]On Time-Reversibility and the Uniqueness of Moving Average Representations for Non-Gaussian Stationary Time Series [pp. 170-171]Analysis of Repeated Measures Designs with Changing Covariates [pp. 172-174]Asymptotically Design-Unbiased Predictors in Survey Sampling [pp. 175-177]A Note on the Exact Maximum Likelihood Estimation of the Size of a Finite and Closed Population [pp. 178-180]A Test for a Multiple Isotonic Regression Problem [pp. 181-184]On the Representation of a Density by an Edgeworth Series [pp. 185-187]

    Back Matter