the error~in~variables problem in the logistic regression ... · 2.1.1 simple linear regression...

•

THE ERROR~IN~VARIABLES PROBLEM IN THE LOGISTIC REGRESSION MODEL

by

Rhonda Robinson Clark

Department of·RiostatisticsUniversity of North Carolina at Chapel Hill

Ins·titute of Statistics Mimeo Series No. 1407

July 1982

--

THE ERROR-IN-VARIABLES PROBLEM

IN THE LOGISTIC REGRESSION MODEL

by

. Rhonda Robinson Clark

A Dissertation submitted to the faculty of the University ofNorth Carolina at Chapel Hill in partial fulfillment of therequirements for the degree of Doctor of PhilosophY in theDepartment of Biostatistics.

Chapel Hill

JULY 1982

ii

ABSTRACT

RHONDA ROBINSON CLARK. The Error-in-Variables Problem in theLogistic Regression Model. (Under the direction of ClarenceE. Davis.)

It is well known that when the independent variable in simple

linear regression is measured with error that least squaresesti

mates are not unbiased. This is also true for logistic regression

and the magnitude of the bias is demonstrated through simulation

studies.

Estimators are presented as possible solutions to the 'error

in-variables' problem; that is, the problem of obtaining consistent

estimators of model parameters when measurement error is present in

the independent variable. Two solutions require an external esti

mate of the variance of the measurement error, two require multiple

measures on the independent variable, while two others are exten

sions of the method of grouping, and the instrumental variable ap

proach. Simulation studies show that the use of an external esti-

mate of the error variance or multiple measures on the independent

variable leads to estimators with substantially lower mean square

error than the least squares estimate. For the grouping and instru

mental variable approaches, the proposed estimators have lower mean

square error under some but not all conditions.

The methods discussed are applied to data from the Lipid

Research Clinics Program Prevalence Study.

ACKNOWLEDGEMENTS

I wish to thank the chairman of my Doctoral Committee,

Dr. Clarence E. Davis, for his guidance, encouragement, and

never-ending optimism. Without his support this work could

not have been completed. I am also grateful to Drs. G. Koch,

M. Symons, J. Hosking, and G. Heiss for their suggestions and

participation on the committee.

I would also like to take this opportunity to thank my

friends and family for their love and support throughout my

years in this program.

Finally, I would like to thank Ms. Ernestine Bland for

the excellent and speedy job that she performed in typing the

manuscript.

i. i i

iv

TABLE OF CONTENTS

CHAPTER PAGE

I INTRODUCTION AND REVIEW OF THE LITERATURE............. 1

1.1 The Error-in-Variabl es Problem in LinearRegression;...... . .. ... . .. . .. ... .. . . .. .... . .. .. . . .. 31•1•1 In:troduct ;on . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 31.1. 2 Classical Approaches to the Error-in-

Variables Problem · 51.1.3 Alternatives to the Classical Approaches .. 8

1.1.3.1 The Method of Grouping ...•.•..... 81.1.3.2 The Use of Instrumental

Varia'b1e-s ••••••••••• _. • • • • • • • • •• 101.1.3.3 Replication of Observations .•.... 11

1.2 The Use of the Logistic Regression Model inEpidemiologic Research ...•....................... 151. 2. 1 Int roduc t 1on .- . • . . . • . . . . . . . . . • . . . . . .. 151.2.2 Discriminant Function Approach•.•......... 161.2.3 Weighted Least Squares/Maximum

Li kel i hood Approach....... .... . . . . .. .. . .. 171.2.4 Comparison of the Approaches ..•.•...•..... 20

1.3 Outl ineof the Research 21

11. DESCRIPTION OF THE BIAS DUE TO MEASUREMENT ERROR...... 23

2.1 A Review of the Simple Linear and MultipleLinear Regression Cases .•......••••.......•.....• 242.1.1 Simple Linear Regression Case ........•.... 242.1.2 Multiple Linear Regression Case 25

2.2 Mathematical Descriptions of the Bias Underthe Logistic Regression Model ......•............. 28

2.3 Description of the Bias Under the LogisticRegression Model Using Simulation 302.3.1 Description of the Simulation

Pr-ocedure. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. 312.3.2 Simple Logistic Regression Case ..•........ 332.3.3 Multiple Logistic Regression Case.•....... 342.3.4 Summary:of Results 35

CHAPTER

III ALTERNATIVE ESTIMATES OF S WHEN MEASUREMENTERROR IS PRESENT IN THE SIMPLE LOGISTIC RE-

v

PAGE

GRESS ION MODEL........................................ 37

3.1 Modified Discriminant Function Estimator •..••.•.. 37

3.2 An Estimator Based on the Method of Grouping ..... 383.2. 1 The Method........... . . . . . ... . . . . . . . . . . . . .. 393.2.2 Rationale Behind the Method .......•....... 40

3.3 Use of an Instrumental Variable to Estimate S.••• 433.3.1 Iterative Instrumental Variable

Estimator 433.3.2 Two-Stage Least Squares Estimator 47

3.4 Another Two-Stage Estimator of B••••••••••••••••• 49

3.5 Comparison of the Alternative Estimators 51

IV USE OF MULTIPLE MEASUREMENTS IN THE ESTIMATIONOF 13 ......•...••....•.•........_.......•.•....•........ 57

4.1 The Mean of mIndependent Measurements as ethe Independent Variabl e 57

4.2 Further Reduction of the Bias When m isSmall -. 59

4.3 Use of the James-Stein Estimator of Ui asthe Independent Va ri ab1e.. . . . . . . . . . . . . . .. . .. •.... 61

4.4 Estimation of 0 2 When mMeasurements aretaken on a SUbsampl e 63

4.5' Results of Simulation Study 63

V APPLICATION OF METHODS TO THE LIPID RESEARCHCLINICS DATA .........................•..... It •••••••••• 67

5.1 Description of the Data Set 67

5.2 Computation of the Estimates and TheirVariances - 68

5.3 Comparison of the Estimates 74

5.4 Conclusions ; 77

vi

CHAPTER PAGE

VI SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH •••••••••• 84

5.2

vi i

LIST OF TABLES

TABLE PAGE

1.1 Some Examples of Between-Individual andWithin,,:,Individual Variability ...........•..........•.. 2

1.2 ANOVA Table for Xand Y Data 12

2.1 Estimates of the Bias Along With 95%Confidence Interva"l s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36

3.1 Simulated Mean, Bias, Standard Deviation,and Mean Square Error of the Estimatorsin, Chapter 3 56

4.1 Simulated Mean, Bias, Standard Deviation,and Mean Square Error of the Estimatorsin Chapter 4...................•...................... 66

2 25.1 Sample Estimates of X, Sx' and SE by

Model and DataSet....................•.........•.."... 71

Correlations Between Triglyceride Leveland Instrumental Variables by~~Data Set .•.............. 73

5.3 Simple Logistic Model With Total Cholesterol Level at Visit 2 as the IndependentVar ; ab1e••••••••••••••••••••••••••••• '. • • • • • • • • • • • • • • •• 7·9

5.4 Multiple Logistic Model With TotalCholesterol at Visit 2 and Age as theIndependent Variabl es 80

5.5 Simple Logistic Model With TriglycerideLevel at Visit 2 as the Independent

. Variable • .••••••••••••••••••••••-.•.•••••••.••••.••..••• 81

5.6 Multiple Logistic Model With TriglycerideLevel at Visit 2 and Age as the Indepen-dent Variables '. . . . . . . . .83

CHAPTER r

INTRODUCTION AND REVIEW OF THE LITERATURE

Epidemiologic data collected from prospective studies are often

used to model the incidence of disease as a·function of one or more

independent variables. A commonly employed statistical procedure

for such studies is multiple logistic regression. For example, in a

prospective study of coronary heart disease in Framingham (Truett,

Cornfield and Kannel (1967», the variation in the incidence of

disease as a function of serum cholesterol, systolic blood pressure,

age, relative weight, hemoglobin, cigarettes/day, and ECG measured

on the initial examination was investigated. The probability, P,

of developing the disease during the 12-year time interval was esti

mated for each individual using the logistic regression model:

In the standard linear regression problem one assumes that the

independent variable is measured without error and that the values

of the dependent variable deviate from their true values only by a

random error component representing explanatory variables excluded

from the model or sampling error. Under these assumptions the

ordinary least squares (OLS) estimates of the parameters are

2

unbiased with minimum variance. Violation of these assumptions in

the linear regression model leads to an underestimate of the ex

pected magnitude of the regression coefficients (Snedecor and

Cochran (1967)).

It is well known from epidemiologic 1iterature that variables

such as systolic blood pressure and serum cholesterol level are

subject to possible "errors" or variation in measurement. Two recog

nized sources of this variation are within-individual variability

and observer (interviewer, laboratory) error. Gardner and Heady

(1973) give the following examples of within-individual variability

o~ and between-individual variability o~:

TABLE 1.1 Some Examples of Between-Individual and WithinIndividual Variability

Variabl e Mean ~ ~

Cholesterol (mg %) 215 25 40

Blood Pressure (mmHg) 141 9. 1 13.6

Calories (per day) 2846 370 406

Often these errors are combined with sampling error and forgotten.

The purpose of this research is to study the effect of measurement

error on the estimated logistic regression model coefficients and

to propose methods of producing consistent estimates of these regres

sion coefficients when error in the independent variable is present.

3

1.1 The Error-in-Variables Problem in Linear Regression

1.1.1 Introduction

There are several possible situations out of which the finite

set of independent pairs of observations (Xl'Yl ), (XZ,YZ)' ...

(Xn,Yn) might arise. In each of the situations that will be dis

cussed, there is a corresponding set of true values that will be

denoted as (Ul ' Vl ), (UZ' VZ), ••. (Un' Vn) . In the regress ion

situation the following linear relationship between the true vari-

ables Uand V holds:

V= ex + I3U + t

E(V IU) = ex + I3U

where V is a random variable, Umay be fixed or random, and t has

mean 0 and variance at. In the particular' case where Uand Vare

bivariate normal, this is the correlation model andE{UIV) =y + oV.

If V and Uare linearly related as follows:

V= ex + I3U

and if both Vand Uare fixed, the relationship between thevari

ables is called functional. If the above relationship holds, but

Vand Uare random variables, we have a structural relationship

between Vand U. If we fail to observe the true variables Uand

V in any of the above situations, and instead observe X =U + E

and Y = V+ n, we have what is known as the error-in-variables

problem or measurement error. For the remainder of this study we

shall concern ourselves with the error-in-variables problem when

4

the relationship between the true unobserved variables is structural.

Suppose that only V is subject to error, then V=ex + au be-

comes

v - n =a + eu

v =a + eu + n

where E(n) =0, n has variance 0 2 and is independent of U. Thus then .relationship between the observed values is a regression type rela-

tionship with

E(V\U) = a + eu

The ordinary least squares COLS) estimators are, in this case, both

consistent and unbiased.

If V is not subject to error, but U is, then

V = a + e(X - E)

x = -ala + 1Ie V + E

= a* + (3* V + E

E(XIV) =a* + e*V

where E{E) = 0, Ehas variance 0 2 and is independent of V.Again,E

we have a regression type relation with the OLS estimators being

consistent estimators of a and S. (S = 1/S*, ~ = -~*/~*).

If both Uand Vare subject to error the problem becomes com

plicated. The relationship between the observed variables is

v = a + eX + n - eE.

The error term is now n - (3E and is no longer independent of X

since X =U + ~ The OLS technique in this situation leads to

..

..

5

biased and inconsistent estimates of the regression parameters.

That is, if b is the OLS estimator of S, then

where cr~ is the variance of U.

In another model proposed by Berkson (1950), observations are

made on V for a given X. Here again X= U+ E and Y = V+ n. We

are no longer trying to observe a given U, but for each fixed X

there are a number of U·s which could have given rise to that par

ticular observed X. Also for each U there is a probability that

the observed fixed X is an observation on that Uwith error E. The

U·s are now random variables distributed about the fixed Xwith

error E. Now E is independent of X (but not of U), and

Y= a + eX + n - SE

E(Y IX) = a + (3X

where X is fixed, and we again have a situation in which the OLS

estimator of S is unbiased and consistent.

1.1.2 Classical Approaches to the Error-in-Variables Problem

A number of approaches have been proposed for dealing with the

error-in-variables problem under the structural model. We will

begin with the classical approaches discussed by Moran (1971) and

Madansky (1959).

Let the relationship between the unobserved true variables be

6

V. =a + (3U.1 1

1. ni and Ei are normally and independently distributed with

means equal to 0 and variances O"~ and O'~ respectively,

2. Ui is normally distributed with mean l.l and variance O"~,

3. Ui , ni and Ei are mutually independent,

the observed variables X and Yare jointly distributed in a bivariate

normal distribution with parameters lJ, 0u2 , 0 2 , 0 2 , Ct and (3, where.E n

•

E(X) =l.l

Var(X)

E(Y) =a + BJ.l

Var(Y) = (32 0 2 + 02U n

CoV(X,Y) = f3 0 2U

(1.1.2.1)

The following quantities are jointly sufficient for the equations

in (1.1.2.1):

nX = !

i=lv = f Yi

i=l n

n (y._y)2Sy.y = I 1n

i=l

(1.1.2~2)•

The maximum likelihood equations are derived by setting the quanti

ties in (.1.1.2.1) equal to those in (1.1.2.2) if we can assume that

the latter five quantities are functionally independent.

7

The fact that we have five equations and six unknowns makes

the parameters O'~, O'~, O'~, a, and f3 unidentifiable. In order to

avoid this problem we need to change some of our assumptions or

obtain additional information. In particular, if we knew O'~, O'~,

or O'~/O'~ and were sure that cov( E,n) = 0, we coul d estimate Band

subsequentlya. Both Moran (1971) and Madansky (1959) give the

following estimates of B when additional information is available.

1. 0'2 known:n

Syy - 0 2A n(31 =

Sxy

2. 0 2 known:E

4. When both O'~ and O'~ are known, we are confronted with an

over-identification situation. We now have four parameters

and five equations. As a result we arrive at inconsisten

cies. That is

8

are both derived from the same system of equations. In

large samples these two estimates will become asymptoti

cally equivalent. Because of the inconsistencies s a

maximum 1i kelihood sol ution cannot be arrived at by

equating (1.1.2.1) to (1.1.2.2). Kiefer (1964) points

out that under these circumstances the proper procedure

is to write out the likelihood and maximize it with respect

to the four unknown parameters as as ~s and cr~. Barnett

(1967) gives the resu1 ting equations.

5. The previous four estimators are derived upon the assump

tion that cOv(Esn) = O. If cov(Esn) ; Os then cov(xsY) =cOv(Esn) + acr~. Therefore if both cr~ and o~ are known and ~

we assume that COV(Esn) is not equal to Os we have five

unknowns and five equations. The estimate of a is

= sign[J xiYi,=1

1.1.3 Alternatives to the Classical Approaches

1.1.3.1 The Method of Grouping

Several methods of solving the error-in-variables problem when

no additional information about variances is available have been

proposed. These methods do not require any distributional assumptions.

9

One such method is known as the method of grouping. The method con

sists of ordering the observed pairs (Xi,V i ), selecting proportions

Pl and P2 such that Pl + P2 <= 1. placing the first nPl pairs in

groupGl , and the last nP2 pairs in another group G3, and discarding

G2, the middle group of observations (if Pl + P2 < 1).

" . V, - V3 b2e = =-6 Xl - X3 b,

Wald (1940) states that 66 is a consistent estimate of (3 if:

1. the grouping is independent of the errors E:i and ni;

2. as n ~ co. b, does not approach 0, namely lim inf!Xl-Xjl > o.J'I-HO

In order to determine when conditions (l) and (2) are satisfied we

consider several possible procedures for assigning observations to

the groups. It is obvious that if the observations are assigned

to the groups randomly, condition (1) would be satisfied but not

condition (2) since then

If the magnitude of the Ui's were known, it would be possible

to rank the observations by their corresponding magnitude of Ui •

This grouping procedure would satisfy both conditions. However, it

is unlikely that infonnation on the magnitude of the Ui's would be

available without the actual values. Another possibility is to

order the observations (Xi,V i ) by the magnitude of the observed Xi 's.

This procedure will not guarantee consistency since the ordering

may be dependent on the error .. Neyman and Scott (1951) give the

10

following necessary and sufficient condition for the cons.istency of

S6 when the observations are grouped according to the magnitude of

the observed Xi's:

S6 is a consistent estimate ofa if and only

if the range of U has 'gaps' of 'sufficient

length' at 'appropriate places' (determined

by P1 and P2) where U has probability zero

of occurring.

This condition guarantees that, as n -+ 00, the misgrouped observa-A

tions with respect to Udo not contribute to plim a6 tending away~

from S.

1.1.3.2 The Use of Instrumental Variables

Asecond method of obtaining consistent estimates of a involves

the use of instrumental variables. Suppose that in addition to

having observations on the variables Xi and Vi' we also observe

another variable Zi which is known to be correlated with Ui and Vi'

but is independent of Ei and ni' Then

~ (Z.-Z)(V.-Y)A i=l' ,f37 = """--n"--------

L (Z.-Z)(X.-'X)i=l' ,

nis a consistent estimate of f3 provided I (.z.-Z)(X.-X) does noti=l . , ,approach zero and n -+ co (i.e., cov(Z,X) r} 0). It should be noted

that the effectiveness of this method depends on one's ability to

find a variable which is independent of the error and correlated

11

with Uand on the strength of that correlation. Letting Zi take

the values -1. O. 1 depending on i. where i is independent of the

'" '"error. reduces f3] to f36. Also Durbin (1954) suggests that Zi = i

is a better instrumental variable if the rank order of the X; 's is

the same as the tank order of the Ui's. That is. it leads to a

more efficient estimate than the method of grouping.

1.1.3.3 Replication of Observations

We have seen that in the classical case we can estimate f3

when we know 0 2 • 0 2 • or 0 2/02 • This naturally leads to the consid-E nnEeration of the estimation of e when we can estimate one or more of

these quantities. This is possible if for each (.UpVi ) there is

more than one corresponding. value of (X p Vi).

Assume that we have n pairs of values (Ui,Vi ) and Ni observa

tions on each pa ir. That is

V•• = V. + n· .lJ 1 lJ .

j = 1.2•••••Ni

; = 1.2••.••n

If the usual assumptions of independence are made. a one-way

analysis of variance can be carried out on the XiS and V's and an

estimate of f3 computed. Madansky describes the procedure by using

the following ANOVA table ..

TABLE 1.2 ANOVA Table for Xand Y Data

Source

I

BETWEEN SETS{ II

tIll

IV

Mean Square

nLN. (x .• -x )2;=1 1 1 ••.

n-1

n (- - ) - -N. x. -x . - .L ' 1· •• (y,. y•• );=1 n-1

n N (- - 2\' ,. y. -y )L 1· '• •;=1 n-1

n N; - 2L ~ (x;j-X;'>

;=1 j=l N-n

Expected Mean Square

n0 2 + [(N2 - L N~)/N(n-1)] a~

e: ;=1

n .cov(e:,n) + [(N2 - L. N~)/N(n-1)] B 0 2

;=1 u

n0 2 + [(N2 - L N~)/N(n-1)] B2 0 2n ;=1 ' u

0 2e:

WITHIN "SETS

e

V

VI

N.n 1 ( -)( -\' \' X'.J.-X1· y .. -y. )l. l • 'J , •

i=l j=l N-n

N.n 1 ( .- 2\' \' y .. -y. )l. L. 'J' .

;=1 j=l N-n

e

cov(£,n)

0 2n

e

..J

N

13

whereNi N;

x. = 2 xij/Ni Y;. =j~l

y . ./N.1· j=l 1J 1

n Ni n Nix = .L j~l x. ·/N y•• == .L j~l yij/N•• 1=1 1J 1=1

nN = r N•

. 1 11=

Tukey (1951) proposes the following estimates of 13:

A

(II -V)/(I -IV)138 =

A

(III-VI)/(II-V)139 =

A

13'0 = I( III -vI}/(I -IV)

It is easily seen that these estimates converge to S as n ~ 00 and

N. ~ 00 for some i.1

Housner and Brennan (1948) also give an estimate of S when there

are multiple measurements on Uand V. They consider the expression

y. '-Yk ob - _l.;..'llJ~-:;,("._

ij kl - xij-Xkl

Y1" =() + ax.. + T).. - f3 t· .J 1J lJ 1Jthen

(x . .-x"o) b. '1,0 = 13(x . .-x"o) + (n· ·-n"o) - S( e;•• -e:.,,). 1J ~ 1J~ 1J ~lJ ~ 1J ~

so that

14

for all i, j, k,.e. and xij not equal to xkl . Summing over all com

binations of points and ignoring the terms involving the error gives

the following estimate:

n n i

1.=L

1Yi Ni ( L N. - 2 L N. + N.)

J'=l J j=l J 1=~~__....l_~ ~-'-- _

n n iI ii Ni ( L N. - 2 L N. + Ni )

i=l j=l J j=l J

This estimate approaches ain probability as Ni ~ ~ for at least

two distinct values of i. Madansky gives the following suggestionsA A A A

on which estimate (SS-' 139,1310 or all) to use and when:

1. If the relation is believed to be linear, the optimum

allocation of observations would be at two points. Hence,

the Housner-Brennan estimate is preferred.

2. If the underlying structure is not linear, and one is

trying to approximate some function in a small area of it's

range by a linear function, it may be advisable to increase

n, at the expense of decreasing Ni' to as little as 2. In

this case, the Tukey components in regression estimate is

better.

Moran states that none of these methods is optimal and suggests

the use of the sum of squares (if p = 0) to estimate a2 and 0 2 , and. E n

then use these estimates in Barnett's solution for the case when a~

and a~ are known. However, this is still not the complete maximum

likelihood case.

15

Many of the ideas and results discussed above in reference to

the error-in-variab1es problem with one independent variable extend

to the multivariate situation. A brief discussion is given by

Moran (1971).

1.2 Use of the Logistic Regression Model in Epidemiologic Research

1.2.1 Introduction

One of the most often used indicators of disease frequency in

epidemiology is the incidence rate. Theoretically this rate esti

mates the probability of disease or death for a particUlar population

over a fixed time period. It is possible to model this probability

as a fUnction of one or more independent variables. The logistic

function is a commonly used model for this purpose. Although this

study mainly involves the univariate logistic function,we note here

that it is in the multivariate case that this function is most often

used. This is because most chronic diseases are effected by multiple

factors simultaneously.

There are two methods used to estimate a and ~ in the logistic

regression model

1Vi =------- + n··

1 + e - ea + au i ) 1

One proposed by Truett, Cornfield and Kannel (1967) uses a linear

discriminant analysis approach. The other method, proposed by Walker

and Duncan (1967), uses an iterative procedure to obtain maximum

likelihood estimates of the parameters.

16

1.2.2 Discriminant Function Approach

Assume that we follow N individuals over a fixed time period.

Also assume that this sample is derived from two populations: 0

(those who would develop the disease), and NO (those who would not

develop the disease). The respective sample sizes being n, and nO'

Suppose also that there exists a variable U having a normal distri

bution with mean ~1 in the diseased population and mean ~O in the

non-diseased population. Finally, we assume that the variance of U

is the same in each population.

Let

P{DIU)

P{NDI U)

P{D) = P

= probability of disease for an individual characterized by U.

= probability of not having the disease given U.

= unconditional probability of disease.

P{ND) = 1-p = unconditional probability of not having the disease.

fO{U)=P{UIND)= probability of Ugiven the individual does not. have the disease. .

f,(U)=P(U!D) = probability of Ugiven the individual does havethe disease.

From Bayes' Theorem:

In partic~lar, if the distribution of fO(U)· and f1{U) are univariate

normals with means ~1 and ~O respectively, and with the same vari

ance a~, then

P{Dlu) = 11 + e -(et + aU)

17

where CI. = - log 1pP - 20J (lll-llO)(lll+lJO)u

s =lJ1-lJO0 2u

The following samp1 e estimates of ex and t3 are derived by in

serting maximum likelihood estimates of 111' 1l0' p, and o~ into the

above equations.

A nO 1 A(_ -)CI. = - log n - 2" t3 u1+u0

1

2 22 (nO-1)50 + (n1- l )51where 5 =-=--.--.;~-~-~u n1 + nO - 2

5~ and 5i are sample variances of Ugiven y = 0, y = 1,respectively.

An estimate of risk can be computed for each individual conditional

upon his value ofU as

A 1.P(DIU) = --...:----

1 + e -(~ +SU)

1.2.3 Weighted Least Squares/Maximum Likelihood Approach

Walker and Duncan (1967) assume that the logistic function is

an appropriate model for the probabil ity that an individual will

develop a particular disease conditional upon the risk factor U.

They proceed from that assumption to use weighted least squares

estimation to obtain estimates of the coefficients of the model:

18

where

E(ViIUi ) = 1. =p~1 + e - (a + au i )

is the probability that the ith individual in the sample acquires

the disease within the follow-up period given Ui •

They approximate the function f(Ui ,a,8) using a first order

Taylor Series Expansion around initial values aOand So as

af(U i ,aO,(0

) af(U i ,aO,(0)f(Ui,a,S) ~ f(Ui,aO,BO) + aa (a-aO) + dS (8-80)

Letting

Yi is approximated as

Y* ! U*(B - 00) + n*

The iterative weighted least squares estimate of °= (a,S) is then

A _ A .. ( 'W )-1 'W· ·Y *0r+l - Or + U rU U r r .

where U is a nx2 matrix having as its ith row (l,Ui ), Wr is a

diagonal weight matrix detennined as the inverse of the variance

matrix of n*, and Y* is a nxl vector of rescaled observations

where e = (a,S).

19

The previous results are identical to those obtained by using

maximum likelihood equations. Given a sample of n individuals free

of disease who are followed for a fixed period of time and identi

fied at the end of the period as having developed the disease or not,

then the probability of disease given the variableUi is

where Yi = 1 if the ith individual has the disease and Yi = 0 other

wise. The likelihood of the sample of n individuals is given by

n y. l-Y.L(a,S) = 'IT P. 1 (l-P.) 1

i=l 1 1

The likelihood equations are then

n nl Yi - i~l

P. =0i=l 1

n nl Y.U; - l PiU i = o.

i=l 1 1 i=l

Because of the nonlinearity of the above equations, this method re

quires the use of an iterative computing technique. The most often

used iterative method is the Newton-Raphson technique, which is

based on the first order Taylor Series Expansion of

. T(0) = d l~eL(e)

(e-e )2T(e) = T{e ) + T'{e )(e-e ) +. T"(00) 2.10 +. 0 00

20

T(0) ! T(00) + TI'(00)(0-eO)

Setting T(,e) = 0 and solving for 0 gives us the MLE of e•.

In practice the initial estimates are the discriminant function esti

mates. The right side gives a new trial value for which the process

is repeated until successive o estimates agree to a specified extent

and T(0) =0 at convergence. This method works well if T{e) is

stable over a range of values near e (i.e., if the likelihood func

tion is close to normal in shape). Asymptotic likelihood theory

guarantees this for large samples. The method fails if the likeli

hood is multimodal.

1.2.4 Comparison of the Approaches

One of the major drawbacks of the discriminant function approach

is its requirement that U be normally distributed. This requirement

is rarely satisfied, even approximately, although in some cases the

appropriate transformation can be made to normalize the variable.

Truett, Cornfield and Kannel's results do however show that despite

extreme non-normality of U the agreement between observation and

expectation is quite good.

The Walker-Duncan weighted least squares approach makes no such

assumptions about the distribution of the independent variable. Also

this approach forces the total number of expected cases to equal the

total number of observed cases. This is a desirable property for any

smoothing procedure. As stated previously, this procedure is iterative

21

and therefore computationally more difficult than the discriminant

function approach.

Halperin, Blackwelder and Verter (1970) prefer the weighted

least squares approach on theoretical grounds since it does not

assume any particular distribution of U and it gives results which

asymptotically converge to the proper value if the logistic model

holds.

1.3 Outline of the Research

The purpose of this research is to study the effect of measure

ment error in the independent variable on the estimated logistic

regression model coefficients and to propose methods of producing

consistent estimates of the coefficients. The main focus of this

study is the univariate logistic regression model:

Vi = 1 + n.1 + e -(a + eU i ) 1

In Chapter II the bias in the estimated coefficients of the

univariate logistic model is described for several different distri

butions of the unobserved independent variable, Ui • Also, a

description of the bias is given for the multivariate logistic

regression model with two independent variables, one measured with

out error and the other measured·with error.

In Chapter III we investigate as possible solutions to the

error-in-variables problem under the logistic regression model: an

adaptation of the method of grouping, the use of instrumental vari

ables, and solutions when certain variances are known. The criteria

22

for comparison between these methods and the iterative weighted

. least squares method is the simulated bias and MSE of the estimated

coefficient.

In Chapter IV the use of mUltiple measures of the independent

variable is investigated. Acomparison is .made between the use of

one measurement, the mean of mmeasurements, and the James-Stein

estimate of the independent variable as the independent variable in

the model. Again the criteria for comparison are the simulated bias

and MSE.

Finally, in Chapter Vthe methods discussed are applied to a

data set from the Lipid Research Clinics Program Prevalence Study.

That data set contains observations on 4171 men and women. The

variables of interest are cholesterol level, triglyceride level, and

age, with the outcome variable being mortality status at 5.7 years

after the measurements are taken. Other variables brought into the

study and used as instrumental variables are HDL cholesterol level

and Quetelet index (weight/height2).

A surrmary of the results and recommendations for further re

search are given in Chapter VI.

. I

CHAPTER II

DESCRIPTION OF THE BIAS DUE TO MEASUREMENT ERROR

Before investigating possible solutions to the error-in-variables

problem under the logistic regression model, it is important to make

two determinations. We must first determine whether there is a bias

under this model. Secondly, we must determine the direction and mag

nitude of the bias given that it exists. If there is no bias or if

the bias is always small and/or away from the null, there would be

1ittle need for this study. It is al so of interest to describe the

Mas for different di.stributions of the independent variable and to

determine whether the distribution of the independent variable has an

effect on the bias. Therefore, in this chapter the bias is described

for the following cases:

1. Simple linear regression;

2. MUltiple linear regression (two independent variables);

3. Simple logistic regression where the independent variableis

a) conditionally normal;

b) conditionally exponential;

c) unconditionally normal;

d) unconditionally exponential.

24

4. Multiple logistic regression where the independent variablesare

a) conditionally bivariate normal;

b) unconditionally bivariate normal.

2.1 A Review of the Simple Linear and MUltiple Linear RegressionCases

2.1.1 Simple Linear Regression Case

It was stated previously that in the simple linear error-in-A

variables problem the OLS estimate of au is biased. That is, if ax

is the OLS estimate based on the observed values, then

and the bias is

Proof

Assume that the true relationship between Vand U is

v = ex + a Uu u

but we observex = U + E

y =V + nand

E(E} = E(n} = E(UE) = E{nV) =E(VE} = E(Un) =E(En) = 0

var E =0 2 , var U :: 0 2E U

.25

then

a =cov(XY)x var X

=cov(UV}0 2

U

= cov(UV)0 2 + 0 2U E

where Ul t U2t

respectively,

;;..

It is known from OLS theory that ax is consistent for Bx• Thisim-0 2

plies that ax is consistent fO,r Bu{o~ ~ o~} and not for Bu. The bias

is always towards the null and if o~ is small relative to O~t the

biasis sma11 .

2.1.2 Multiple Linear Regression Case

Cochran (l968) examines the bias when there are two independent

Viariables in the model as follows: Suppose the true unobservable

relationship is

but the following variables are observed

Xl = Ul + El

X2 = U2 + E 2

Y = V+ n

d h . ·22 2 d 2El t an E:2 ave vanances 0u t au tat ano t1 2 El E2

and are uncorrelated. The coefficients au and au·12

are therefore estimated based on the observable relationship

26

Y= a . + ~ Xl· + ~ X2 + nX xl· x2

Let the variance-covariance matrices for the U's and XIS be:

ThenQ - + -1 +- + -1 + 13~x -'x +xy -'x +u u·

This implies that if ax and ex are theOLS estimates of Su and121

Q then~u '2

27

If the correlation between Ul and U2 is zero (i .e." P = 0), then

which is identical to the results in the univariate case.

If only one variable is subject to measurement error, for

example U2, then

Even though U1 is measured without error there is a bias in the

estimation of l3u •. This bias may be negative or positive, depending. 1

on the signs of l3u ,13 ,and p. It is possible to produce situations1 u2

in which l3u < l3u ' but E(~x ) > E(~x). Since f < 1 we observe a1 2 1 2

larger bias in ~x for the multiple regression case than for the2

simple linear regression case. Also as p increases to 1, the bias

in the multiple linear regression case increases.

28

2.2 Mathematical Descriptions of the Bias Under the LogisticRegression Model

In Chapter I we discussed three methods of estimating the

coefficient 6 in the logistic regression model. The discriminant

function approach assumes that Uis conditionally normal and yields

an estimator of 6x that can be used to describe the bias in esti

mating Suo The iterative weighted least squares and the maximum

likelihood methods make no distributional assumptions about U, but

do not produce a closed-form solution to 6, thus making a mathe

matical description of the bias difficult to obtain. In this

section the bias is described mathematically with the independent

variable conditionally distributed.

Assume that UIV=l_ N(lll'O'~) and U!V=O - N(l1O'O~), then from

Chapter I

and

Ctu = - 1n (1pp) - 2:2

(l1l-lJO) (111 +110)u

Under the error-in-variab1es model

x = U + E

v =V

where E - N(O,o~) and is independent of Uand

var(XIV) = var(UIV) + var E

Therefore

29

These results are identical to those obtained in the simple linear

case. We conclude that when the true independent variable comes

from a mixture of two nonna1s with equal variance and measurement

error exists in the independent variable, the discriminant function

estimate of the coefficients using the observed data is not unbiased.

The bias is always towards the null and is dependent upon the mag

nitude of cr~ relative to that of cr~.

Next we investigate the bias when UIY=l ... Exponential (A1) and

UIY=o - Exponential (:>.'0). In this case

P(Y=ll U) = ----:..1 _

1 + exp[- (-1 n~ (l-p)AO P.

= ---=-1 _

-(ex + S U)1 + e u U

whereA (1) A -A

ex = - Tn _1 -p ,13 = 1 0U AO P u AOA1

If we observe X= U+ e: where e: ... N(O,cr2 ) and E is independent of U,E

thenvar(XIV=O) =var(UIV=O) + var €

30

var(.XI Y=l) = var(UI Y=l) + var(e:)

As a result

We conclude in this case that there is a bias in the estimation of

eu when measurement error is present and the independent variable

is conditionally exponential. Again the bias is towards the nUll.

We conclude this section by considering the bias in the multiple

logistic model when (Ul ,U2) is conditionally multivariate normal.

Assume that

~IY=o - MN(~O'*u)

~IY=l ... MN(~l'*u)

where

U= [~:]and

~O = ['lJ10

]'lJ20

If we observe

then ~IY=o - MN(~ot~x)

~IY=l - MN(~lt;x)

31

where

The expected value of the estimator of au that we would obtain by

using the discriminant function approach on the observed XiS is:

This result is also identical to the multiple linear regression case

and the conclusions made in that case also apply here.

2.3 Description of the Bias Under the Logistic Regression ModelUsing Simulation "

In this section we study the bias when the iterative weighted

least squares or maximum likelihood methods are used and no distri

butional assumptions about the independent variable are necessary.

2.3.1 Description of the Simulation Procedure

General Model:

1Yi lUi =-----T(-a-+---,..,SU..,.........) + nl

1 + e 1

where1

with probability Pi = -(a + SU.)1 + e 1

with probabilityl-Pi

32

i.e•• YilUi is Bernoulli (Pi)'

We begi"n the simulation by generating val ues of Ui under the

following distributions:

1. . U; 1Yi Normal

2. UilYi Exponential

3. Ui ... Nonnal

4. Ui ... Exponential

Next, the error values, £1' are generated from a N(O,1) distribution.

The observed values, Xi. are then computed as

X.=U.+£ .•, ,. ,After assigning values to ~ and a, values of the dependent variable

Vi are derived by computing

1P. = ), -{a + a u.

l+e u ,u,

and 1etting

1 if ~i < Pi

where.;,.... Unifonn (0.1). For each sample of values (Y.,X.), i = 1,2,, ,.•. ,500, the SAS procedure Logist was used to produce the iterative

weighted least squares estimate of au. The bias was then computed as"-

a-suo This procedure was followed for 150 samples of size 500. This

33

resulted in 150 estimates of au. The mean and standard deviation of

these estimates were computed along with 95% confidence intervals.

2.3.2 Simple Logistic Regression Case

Simulations in which the independent variable is conditionally

distributed were done as a check on the simulation procedure, since

we know mathematically what the results should be. The results are

summarized in Table 2.1. We selected 150 samples of size 500 from

a mixture of two normal populations. The conditional normal distri

butions were:

UIV=l N(2,1)

ulv=o NO,l)

with p =P(V=l) =~ and a = ~l-~O = 1. If the simulation is per-c. U 0 2

fonning as expected, then

. A .,..,

Our results are good with E(Bx) =0.5109.

Next we checked the simulation procedure whenU was conditionally

exponentiaL We let p = ~, AO = 1, A1 =2 and Bu = A~-~O = 0.5. In1 0

this case

A A

Again the results are good with E(Bx) = 0.3201. Based on these re-

sults, we feel confident that the simulations are performing as

desired.

The simulations show that when Ui is distributed as a N(O,l)

variable, the estimated bias is -0.5377 when Bu=land 0.5338 when

34

au = -1. The bias for this case is towards the null and sl ightly

greater in magnitude than the bias that results when Ui comes from

a mixture of two normals. When Ui is exponential (ft=l) and Su=l,

the bias is -0.6825. This is much larger than the bias in the normal

case.

2.3.3 Multiple Logistic Regression Case

We now investigate the bias under a multiple logistic regression

model. In particular we assume that (~1,U2) come from a multivariate

normal distribution with correlation P. means 0 and variances 1.

U2 is measured with error and Ul is measured without error. The

simulation is performed with, P = 0.25 and then with P = 0.75. The

parameters a, au ' and Su are all assigned the value 1.1 2

The results show that there is a bias in the estimation of au1

although Ul is measured without error. (See Table 2.1.) Also the

bias in estimating a is greater than the bias that results in theu2simple logistic case. We know from the linear regression case that

,,' _ A

E(aX)MULT = f e E(aX)SIMPLE

where f = l-p2. We findapproxirnately the same relationship,l-p2R2

A A A A

E(Sx)MULT .~ f e E(Sx)SIMPLE

for the logistic case in the following table:

p

0.25

0 ..75

f

0.9677

0.6087

A A A A

E(Sx)MULT feE(Sx)SIMPLE

0.4392 0.4474

0.2755 0.2814

35

The simulations also show that as p increases the bias increases.

36

TABLE 2.1 Estimates of the Bias Along With 95% Confidence Intervals eA A A

Model E(.I\) Bias 95% CIon the Bias

Ulv ... Normally 0.5109 -0.4891 (-0.4993. -0.4789)e= 1u

U\V ... Exponentially 0.3201 -0.1799 (-0.1881. -0.1717). au = 0.5

U... Normally 0.4623 -0.5377 (-0.5503, -0.5251)a = 1u

U... Exponentially 0.3175 -0.6825 (.-0.6950. -0.6700)6 = 1u

U... Normally -0.4662 0.5338 ( 0.5208, 0.5468)e =-1·u

e(Ul,U~)Bivariate

N rmalp = 0.25a =1 1.0631 0.0631 ( 0.0407, 0.0855)u1e =1 0.4392 -0.5608 (-0.5743, -0.5473)u2

(U1,Ug) BivariateN nnalp = 0.75)e = 1 1.4868 0·.4868 ( 0.4576, 0.5160)u1au = 1 0.2755 -0.7245 (-0.7390, -0.7101)2

CHAPTER III

ALTERNATIVE ESTIMATES OF a WHEN MEASUREMENT ERRORIS PRESENT IN THE SIMPLE LOGISTIC REGRESSION MODEL

3.1 Modified Discriminant Function Estimator

Recall that the discriminant function estimator of the coeffi-

cients iOn the logistic model is appropriate when one can assume that

the conditional distribution of the independent variable, Ugiven Y,

is normal. That is, if

and

then

where

ul V=O '" N(llO'O~)

UIV=l N(lll o~}

1P.(Y =11 U) =---,---,-~.......-(et+ au)

1 + e

(3.1.1)

Suppose measurement error is present; that is, we observe

X=U+ e:, where Uand e: are independent with variances o~ and o~,

respectively. Since the variance of X is 0x2 = 0 2 + 0 2 , by substi-u e:

tuting o~ - o~ for o~ in equation (3.1.1), S can be rewritten as:

{3.1.2}

38

If o~ is known, ex and a can be estimated by replacing 'Ill' 1l0' o~,

and p(Y=l) in equation (3.1.2) with their maximum likelihood esti-- - ' I'\.

mators: Xl' XO' Si and p = nl/n. The resulting modified discrimi-

nant function estimator is:

where

2 2Sl and So are the sample variances of Xi given Yi=l and Xi given Yi=O

respectively. Since Xl' Xo' S~ and pare maximum likelihood esti

mators of 'Ill' llO' o~ and p(Y=l), respectively, they are consistent

estimators and therefore ~D and SMD are consistent estimators of ex

and a. Even if the assumption of conditional normality is not met,

the estimator in (3.1.2) may produce better estimates of the param

eters than the iterative weighted least squares estimator of Walker

and Duncan. These estimators will be compared in section 3.5.

3.2 An Estimator Based on the Method of Grouping

One means of obtaining a consistent estimator of e when measure

ment error occurs in simple linear regression is the method of group·

ing. This suggests that the method might also produce less biased

estimators in the case of simple logistic regression.

39

3.2.1 The Method

Suppose one has available a grouping criterion, l, that is cor

related with the true unobserved independent variable, U, but is not

correlated with the error term, E. The method would consist of rank

ing the observations based upon the magnitude of their corresponding

l values. Those with the smallest nl l values would be classified

into group 1. The middle values would be classified into group 2,

and discarded. Those observations with the largest n3 l values would

be classified into group 3. The group sizes are nl =n2 =n3 =n/3.

An ideal situation would be to have a grouping criterion that would

correctly group all of the observed Xi values into the same group

that their corresponding Ui values belong. The following quantities

are computed for groups 1 and 3:

- Xia• Xg = ~ n; 9 = 1,3

9 g

'" y.b. Pg =~ n;; 9 =1,3

9

c• ~g = Log [p~ j; 9 = 1,3l-P

9

The parameters, a and S, are then estimated as

/

40

3.2.2 Rationale Behind the Method

We first transform the model

1P. = P(V.=lIU.) =--~--1 1 1 _(ct + au.)

1 +e 1

into the following linear form:

PiAi = Log ----- =a + aUi •

l-P.1

(3.2.2.1)

P.~ince Vi is a dichotomous va,riable, an observed value of log 1_

1-P.cannot be computed unless it can be assumed that Pi is equal 1

to some value Pg for all Ui's, a member of group g; therefore we

can pool the corresponding Vi's and estimate Pg, where

(3.2.2.2)

for i = l,2, •.. ,ngg = 1,2,3.

As a result, equation (3.2.2.1) becomes

A =ex + au.g. 19

and

where

- -A =A =a + au9 9 9

~ =} ~ =A9 G n9 9

g

- UiUg =6 n .

9 g

(3.2.2.3)

In this form we recognize a as the slope of the line in equation

(3.2.2.3). Therefore, a can be written as

41

Theorem 3.1

'"The estimatorBGRP is consistent for S in equation (3.2.2.4)

if

i) grouping is independent of the error, E. ;1- -ii) X1-X3 does not approach zero as n becomes large.

Proof:

Again and'" '"Al-A

= - -X1-X3

A A

a) Using Taylor Series Expansions of Al and A3 about Pl and P3~

respectively, we haveA

A P P . 1 A 2P -1 '" 2A = log -+ = log~ + (P -P ) + z 9 2 (pg-Pg) +g l-Pg l-Pg Pg(l-Pg) 9 9 2Pg(1-Pg)

for 9 = 1, 3

A

If assumption (2) holds, we know that Pg -+- Pg as n -+- 00, therefore

'" P1) 1im E(A

g·) =1og~ = A

n+oo l-P g9

ii) lim var(~g) =var [109--'1J =O.n+oo l-Pg

'" '"Thefore Al and A3 are consistent for Al and A3 respectively.

b) If the grouping is independent of the error, Eit then

42

- - - - 'r1 1 ]var[(X1-X3) - (U1-U3)] =var --- L (X.-Ui ) - --" L (X.-U.)" n1 G

11 n3 G

31 1.

__ var(E i Ig=l)+n1

0 2 0 2

=....f..- + En1 n3

var(Ei Ig=3}n3

and

=02' (p-l + p-1)/nE 1 3 .

- - --lim var[(X1-X3) - (Ul -U3)} = 0 •~

- -Therefore, X1-X3 goes to U1-U3 in probabil ity as n becomes large.

" '"c) If Al-A3 and X1-X3 are consistent for Al-A3 and Ul -U3, respec-

tively, then SGRP is consistent for e as long as 01-03 does not ~approach zero. If the grouping criterion, ~, is correlated with

U,E(Ulg=l) ; E(Ulg=3)

therefore Ul -U3 will never approach O.

In conclusion, if we have a grouping criterion that is corre

lated with the unobserved independent variable, U, but is not corre-'"lated with the error tenn, E, then SGRP is consistent for S in

model (.3.2.2.2). If the Pi's are not equal within groups, S in

model (3.2.2.2) will be an approximation to the parameter in the

true model. Even so, it is possible that ~GRP will estimate the

true parameter better than the Walker-Duncan estimator; which when

measurement error is present is also based upon an approximate model.

43

3.3 Use of An Instrumental Variable to Estimate f3

Another method of obtaining consistent estimates of f3 in linear

regression is one based upon the use of an instrumental variable.

An instrumental variable is one that is correlated with the true

unobserved independent variable, but is not correlated with the

error terms. In this section we describe the use of an instrumental

variable, l, in producing estimators of e in simple logistic regres

sion when measurement error is present in the independent variable ..

3.3.1 Iterative Instrumental Variable Estimator

Let1V" = + n~(cx + SU.) i

1 + e '

1=-----,(-cx-+~~f3r.X-.---e='"e:-1

."T'") + ni ·1 +e 1

Vi = f(cx,e,e: i ) +ni

Using a first order Taylor Series Expansion of f(a,8,e:i) about the

point (cxo'So'O), the model can be approximated as follows:

44

=

+

let

then

e -faa +saXi) e -faa + SOX;)

-(a + S X.)-:; Xi(S-Sa) - ----T"'(a-a-+-:--:::S-;aX.,-;"T"")-2 Sc;.[1 + e . a a 1 J2 [1 + e . ]

(3.3.1.1)

p = 1 and Qai

=1-Pa.;a. -(aa + SOX.) •1 1 + e 1

Next, let

As a result,

(3 .. 3.1.2)

If one observes an instrumental variable, f;., that is correlated

with Ui but is not correlated with ni or Ei' then

*a) cov{n ,f) =0, since n is independent of Xand f,

b) cOV(S£,l) ~ BCOV(E,l) =0,

45

and

c} cov{X,l} = cov(U,l} ~o.

Using equation {3.3.1.2}, a can be written as:

cov{Y*,l} + So cov{X,l) ~ S cov{X,l).

. (3.3.1.3)

A natural estimator ofa* is

{3.3.1.4}

Next we derive an estimator of a. Taking the expected value of the

terms in equation (3.3.1.2), we have

Since E(Ei ) = E(.ni) =0,

Anatural estimator of a* is1'\ A _ A '" _

ar+l =ar + Y~ - (Sr+l-Sr) X • (3. 3.1. 5)

Initial values, aOand SO' are assigned and equations (3.3.1.4)·

and (3.3.1.5) computed iteratively until the solution converges.

For example, until 16r+1 - erl is less than some prespecified amount

(ex., .0005).

46

Theorem 3.2A A

Let alV and SIV be the convergent solution to equations (3.3.1.4)

and (3.3.1.5). Both ~IV and SIV are consistent estimators of a*

and a* respectively.

Proof:

(3.3.1.6)

) * ) *~ Vir - ~ Vir=8 + 1 3

r ~. x. -I x.1 G 1

i 3

n _Since.~ (li-l)(V~-Y*) is consistent for (n-2) cov(ltV*) tand

n _ 1-L A

.L (li-l)(Xi-X) is consistent for (n-2) cov(ltX)t then aIVt which1=1is the ratio of the two sample covariances t is consistent for a*t

which is the ratio of the corresponding population covariances. as

long as cov(ltX) r O. Since Y*t Xt and eIV are consistent for E(V*)t

E(X)t and 6* respectivelYt &IV is consistent fot a*.

The parameters a* and a* only approximate a and a; thereforeA A -alV and SIV are not consistent for a and a. In cases where a* and ..

a* are close to a and a respectivelYt these estimators should be

less biased than the iterative weighted least squares estimate. The

estimators &IV and8IV can also be used to estimate f3 when only a

categorical variable is available. If the observations belong to

three distinct groupst Zi can take values It -It Ot depending upon

whether observation i is a member of group gt where g = lt2 t3. The

grouping must be independent of the measurement and sampling errors t

and correlated with the independent variable U. A simplified form of

SIV would be:

47

where G1, G2 and G3 are the three groups with n/3 observations in

each group. If more than three groups exist, say g, then values 1,

2, ••. ,g could be assigned to Zi,and a computed using equation

(3.3.1.4) •

3.3.2 Two-Stage Least Squares Estimates

The two-stage least squares (TSLS) estimator was derived

independently by Theil (1961) and by Basmann (1957}.The estimate

is computed in two steps, as follows:

1. First, regress X, the observed independent variable on Z,

the instrumental variable, using the linear regression

model:

x. = 'ITO + 'IT1 ~. + W.1.11

E(Wi ) = a var(wi} = o~

{3.3.2.1}

2. Use the estimates of 'ITO and 'lTl computed in step 1 to

estirna;te the unobserved independent variable Ui . The

model of Ui given li is

(3.3.2.2)

since X. = U. + E.. If l,. and E,' are independent and, , 1

E(Ei} =0, then

E(U i } = 'ITO + 'IT1 li •A A A A

and Ui = 'lTO+ 'lTl lie We now regress Vi on Uf .

48

In simple linear regression the TSLS estimator and the instru

mental variable estimator are identical. Let

V. =a + eu. + n·111

Xi = U. + e:.1 1

a) U. n, e: are independent and E(n) =E(e:) =O.

b) cov(l,X) =cov(l,U), andcov(l,n) =cov(l,e:) = O.

If X is regressed on l the estimates of lTO and ITl aren _ _

_f (li-i)(Xi-X)ITl - . - 2

I (fri )

and

A

The estimate of 13 that results from regressing Vi on Ui using the

modelA

Vi =a + I3U. + n~1 1

is n A A-

I (U.-D)(V.-Y)A 1 1

'!3rSLS = n A A 2

L (u.-D)1 '

In the case of logistic regression, the model in step 2 is:

49

1 **Yi =--_"":"'-_-;:-"- + ni .-(a + BU.)

1 + e 1

The iterative weighted least squares estimator of f3 using the above

. model is:

" "where U 1'5 an nx2 matrix with [l Ui] as the ith row, Wr is the

diagonal weight matrix with {Pu. QQ. } on the diagonal, v; is an1,r 1,r

nxl vector with elements

V. - pI'.1 ui,r 1

t(-------.;~-)o, p" =--_"":"'-_-1'.-1'.- ,fA Q" Ui,r -(ar + f3rU,.)u. u. 1 +1,r l,r e

This estimator can be derived in practice by using SAS's PROC. A

LOGIST on the values of (Vi,Ui ), i =l,2, •.. ,n. The resulting esti-. "mator w,ll be referred to as BrSLS'

3.3.3 Another Two-Stage Estimator of f3

In previous sections, Xi and ftO + ftl Zi were used as estimates

of the unobserved independent variable Ui . In this section the use

of the Bayes or James-Stein estimate of Ui is discussed.

If Xi = U; + Ei is observed, where Ui - N(~,cr~) and Ei - N(O,o~),

then (Xi,Ui ) is distributed as a bivariate normal with

and

var(.Xi ) = cr~ + o~, var(Ui ) =cr~

cov( Xi ,Ui ) = cr~

50

ifE(E i ) =0 and Ui , Ei are independent.- Also the conditional0 2

distribution of Ui given Xi is N(ll + P(X1.-ll), (XJ2) wher~ p = uE 02+02

The Bayes estimate of Ui is that function of the observations U E

that minimizes the conditional expectation of the loss function

2L = (Ui - f(X i )) .

- . 2E[( Ui - f(X i )) IXi] is minimum when ftXi) = E( Ui IXi)' if a minimum

exists. Therefore

Proof: eLet Yi =~ + SQU i + ni

cov(Yi,lJ + p(Xi-lJ))

var(ll + p(Xi-u))

p cov(Yi ,Xi) 1= = -sp2 var Xi p x

0 2 + 0 2

= U £ f\ = l3u .0 2

U

Since So is equal to l3u' any consistent estimator of (3a is a consis

tent estimator of l3u' and it is well known that the OLS estimatorA

using the values (Yi'Ui ) is consistent for l3u. Whether a similar

relationship exists in the logistic case is investigated by means of

51

simulation in the next section.

In most situations the parameters l.l and p are unknown. However,

the parameter 1.1 can always be estimated as the mean of the observed

variabl e Xi. If cr~ is known, p can be estimated as

2 2A Sx - cr

. p = 2Sx

where - 22 n (Xi - X)

Sx =.L1

n-11=

The estimator of Ui is then,

'" - AU. = X + p( X. - x) ,1 1

A

the James-Stein (1966) estimator. Using Ui as the independent vari-

able, Ssayes 1 is then computed as the iterative weighted least

squares estimator.

3.5 Comparison of Alternative Estimators

In this section the estimators described previously are com

pared using simulated values of the bias, standard deviation, and

mean square error. The observations for these simulations were

generated as described in section 2.3.1. In the true model ex and 6

are both assigned the value 1. The independent variable, U, is

generated asa N(O,l) variable except in the case of the first

modified discriminant function estimate whereU is conditionally

normal. Both the error term, e:, and the instrumental variable, l,

are generated as N(O,l) variables. In order to investigate the

effect of the correlation between Uand! on the grouping and

52

instrumental variable estimators, we generated three sets of l

values for each sample. The sets differ only in the value of

corr(U,l). The correlations are 0.2, 0.5 and 0.9. Two additional

'"sets of l values were generated for simulation of s,-5L5' with

corr(U,l) =0.3 and 0.4.

The following estimators were computed for 150 samples of size

500:

1. MOdified Discriminant Function Estimator, 0'2 knownE

a) Uconditionally normal with means III = 2, 110 = 1

b) U", N(O,l)

2. The Iterative Weighted Least Squares Estimator with

2 2'" - (5x- 0'e) -U. = X+ (X. - X)1 52 1

X

'"as the independent variable, Saayes l'

3. The Grouping Estimator,

'"4. The Iterative Instrumental Variable Estimator, SIVn _L (li-l)(Y~ -y*)

'" i=l lr r= Sr + -n"--------

I (lr"~)( X.-x)i=l 1

0.2,

0.3,

53

a) l - N(O,l)

bl l takes on the values -1, 0, 1

A

5. The Two-Stage Least Squares Estimator, !3rSLS' which is the

Iterative Weighted Least Squares Estimator with

A " "U. = 7T0 + 7T 1 ~.·1, 1

as the independent variable.

" AEstimators BGRP and B1V were computed for values of corr(U,~)::

A

0.5,.0.9 and a,-SLS is computed for values of,corr(U,~) :: 0.2,

0.4, 0.5, and 0.9. The estimated bias, standard deviation, and

mean square error were computed for each estimator. The resu1 ts are

shown in Table 3.5.1.

The estimators proposed are useful under different assumptions

and depending upon the availabil ity of additional information such

as the variance of 'the measurement error (C1~) or observations on an

instrumental variable. This makes comparisons between some of the

estimators unreasonable. However, one can make the following com

parisons and conclusions.

1. When measurement error is present and the variance of that

error known, the modified discriminant function estimator,A • _ A

SND' has much smaller blas and MSE than SrWLS' regardless

of whether Ui is conditionally normal or unconditionally, A

normal. 1n both cases the bias is very small, 0.025 and

-O.OOSt as compared to -0.538 for SIWLS. When U is condi-. A

tionally normal SMD has asrnaller estimated standard devi-A,,'

ation than SMD when U is unconditionally normal.

54

2. When ~ 1s known and one cannot assume conditional normal

ity of Up then the use of the Bayes estimate of Ui as the

independent variable produces an estimator of e with

smaller MSE than when using SMD' See lines 2 and 3 of

Table 3.1.

....3. The method of grouping estimator, f3GRP ' the iterative instru-

mental variable estimator, SIV' and the TSLS estimator,....SrSLS' vary in quality relative to the value of the corre-

lation between the instrumental variable (or grouping cri

terion) and the independent variable, U. Both the bias and

the standard deviation of the estimators increase as the

correlation, PU~' decreases.

4. The rate of convergence for SIV is also a function of PU~'

For example, when PU~ =0.2, only 108 out of 150 samples

provided solutions within 30 iterations. In contrast, when

PU~ = 0.9 all but two samples converged to a solution within

30 iterations. This problem is due to the large variance of

the estimator when the correlation is small. It is well

known that inefficient/unstable estimators have convergence

probl ems.

5. The bias in 6GRP ' 6IV and STSLS is smaller than the bias....

in f3 IWLS ' although the standard deviation of these esti-

mators is always larger than the standard deviation of

'"f3IWLS ' For values of PU~· > 0.2 the MSE's of these estimators

are smaller than the MSE of SIWLS'

55

6. In terms of bias and MSE. there appears to be little. A A A

difference in the performances of f3GRp • f3IV and !3rSLS."However. !3rSLS is preferred over the other two because of

"the convergence probl an of f3rv and the necessary model" Aassumptions of SGRp. Also. !3rSLS and its variance are

easy to compute using SAS's PROC LOGIST.

56

TABLE 3.1 Simulated Mean, Bias, Standard Deviation and Mean"SquareError of the Estimators

A A A A

~SEEstimators Mean Bias STD(t3)

A ulY ... Normally 1.025 0.166 0.168SMD: 0.025A

U ... Normally 0.994 -0.006 0.203 0.203Ssayesl : 0.936 -0.064 0.163 0.175

A

SGRp: PU:t =0.2 0.835 -0.165 0.691 0.7100.5 0.914 -0.086 0.260 0.2740.9 0.958 -0.042 0.154 0.160

A

SIV: :t ... N(O,1)

PU:t =0.2, n = 108 0.605 -0.395 0.371 0.5420.5, 136 0.845 -0.155 0.209 0.2600.9, 148 0.885 -0.115 0.158 0.195

A

SIV: :t =-1, 0, 1 ePU~ =0.2, n = 100 0.524 -0.476 0.380 0.609

0.5, 135 0.813 -0.187 0.200 0.2740.9, 147 0.867 -0.133 0.152 0.202

~SLS: PU:t = 0.2 0.911 -0.089 0.911 0.915*0.3 0.856 -0.144 0.398 0.4230.4 0.857 -0.143 0.286 0.3200.5 0.866 -0.134 0.230 0.2660.9 0.956 -0.044 0.148 0.154

A

SIWLS(Wa1ker-Duncan) 0.462 -0.538 0.079 0.544

*After removal of the extreme value, 9.39, the starred line becomes:

0.855 -0.145 0.588 0.606

CHAPTER IV

USE OF MULTIPLE MEASUREMENTS IN THE ESTIMATION OFe

4.1 The Mean of MIndependent Measurements as the IndependentVariable

In previous sections it has been assumed that the measurement·

error, Ei , has a mean value of zero and is independent of the un

observed variable Ui , for all values of i. If under these assump

tions, m independent measurements are made on each,of the n

independent sample units with the observed variable being

X•• =U.+E ..1J 1 'J

i = 1,2, ... ,nj =1,2, .. .,m

then

E(X. ·IU.) = U.'J 1 ,

and var(X .. IU.) =var(E .. Iu.) =0 2 •'J 1 'J' , E

_ m

Therefore, Xi _ = .r XiJ·/m, which estimates E(XiJ·IUi ) a1 so estimatesJ=l ' . _

Ui . Also as the number of repeated measurements increases, Xi_ con-

verges to Ui . This suggests the use of Xi_ in place of the unknown

independent variable Ui when estimating e.In the case of simple 1inear regression, if

58

Xi. = Ui + ei .

with variance 0i =o~+ o~/m is used as the independent variable in

the model, then

(4.1.1)

and as m-+ eo, Bx converges to Bu. Therefore, if Sx is a consistent

estimator of Bx then as both nand m become large, Sse will converge

to au in probabi1 ity.

The relationship between Bx- and B shown in equation (4.1.1)u .also exists in the logistic regression case when the independent vari-

able Ui is conditionally normal. If Xi. is used to estimate Ui , Bxtakes· the form

lJl - lJO= =0 2 + 02/mu £:

Again, as m-+ eo, ax goes to au. This proof cannot be extended

directly to the logistic regression case in which the independent-variable Ui is not conditionally normal. However, if Xi. replaces

Ui in the logistic model as follows:

P{ Yi=ll Xi J = ...:...1----(et+BX.)

·1 +e ,.

= .pi

59

then the iterative weighted least squares OWLS) estimator is

(4.1.2)

where X is an nx2 matrix with i th row (1 ,Xi'>' Wr is a diagonal

weight matrix with elements" {PirQir}' and v; is an nxl vector with.th "- - - ", element {Yi-Pir/PirQir}.

As m~ 00, Xi. ~ Ui and equation (.4.1.2) becomes

e =s + (U'W U)-l U'W V*r+lr r r r

The nx2 matrix U has i th row (l,Ui ), Wr is diagonal with element

{PiQi}' v; is an nxl vector with elements {Yi-p;lPiQi}' and

1Pi =------,(;-a-+~SU:-:-.-.-) •

1 + e . ,

It is well known that Sus the IWLS estimator without measure-\

ment error (i.e .. Ui observable) converges to S as n ~ 110. Therefore,

'"Sx converges to S as both m~ 00 and n .~ 00.

4.2 Further Reduction of the Bias when m is Small

Another advantage of having multiple measurements on each

sample unit is that one can now estimate the variance of the measure- "

i = 1,2, ... ,n2 m

SXIU,. = Ij=l

ment error. Since var(XijIUi) = var(e:ij!Ui ) = ai, o~ can be esti

mated by pool ing the estimated va lues of var( Xij IUi) as follows:

- 2(X•• - X" ).'J .. •m-l

and

- 2n m (X .• - X. )S2 = L l 'J ,.

E i=l j=l n(m-l}

An estimate of q~ can be computed as

. - - 2n (X. - X )S! ~ I .,. ..

x i=l n-l

60

(4.2.1)

Now from equation (4.1.1) the relationship between 13,X and l3u for

the linear regression case and the logistic case in which Ui is

conditionally normal can be written as:

13- = p 13-x x (4.2.2)

A 2 2 2 2 .1Using I3x' SE' and S,X as estimates of I3x' 0E' and Ox respectlve y, l3ucan be estimated as:

2f' . Sx A

l3u = z z 13,XS- - S 1mx E

(4.2.3)

2 2 ~ 2 2Since Sx' SE' and P,X are consistent estimates of ox' 0E' and I3x' weA

conclude that l3u is consistent for Bu.

When the independent variable in the logistic regression model

is not conditionally normal the simulation studies of Chapter II have

shown the relationship in equation (4.2.2) to be an approximate

relationship~ That is~

61

where

Therefore2ox

2 2(j- - (j 1mx E

13x

in the logistic case. Still the estimator in equ~tion (4.2.3) isA

expected to be less biased than 13x when m is small.

4.3 Use of the James-Stein Estimator of Uias the Independent

Variabl e

If more than one measurement is available for each sample unit,

one can compute the James-Stein estimator of Ui (discussed previously

in section 3.4) when both p and o~ are u~known. In the case of m.

measurements on each sampling unit, one observes

X. = U. + E.1 • 1 1

where (Xi.,Ui ) is bivariate nonna1 with E(X i .) = E(Ui ) =}.I, var(Xi.l =

o~ + (j~/m, var(Ui ) = o~, and cov(X;.Ui ) =o~.Also,

= }.I + p*(X. - }.I)'1 •

The Bayes estimator of Ui is again that function of the observations

that minimizes the conditional expectation of the loss function,2

L = (Ui - f(X i )) , which is

fr. = }.I + p*(i. - ~).1 1 •

62

Since l.l and p* are unknown, we estimate them using

A = n mJl =X =.r .I XiJ·/nm

,=1 J=l

2 2 •where Sx and SE are def,ned in equation (4.2.1). The estimator of

Ui isA _ A _ _

U. =X+ p*{X,' - X)., . .A

Using Ui above as the independent variable, S is estimated usingA

the IWLS estimator. The result is SSayes 2'A

In the case of linear regression, this estimator, SBayeS 2 isAA

identical to the estimator derived in section 4.2, pSx'

Proof:

A

Ssayes 2

A _ _

HUi-U)(Y i-V)= A _ 2

HUi-U)

A L{Xi-~)(Yi-Y) 1 A AA

= P* _ _ =-:;::-- 13- = p (3-p*2 L{X

i-X)2 p* X X

A

since Ui- U= p*(Xi - X)

andA _ 1p - -.A*P

Whether this is true in the case of logistic regression is investi

gated in section 4.5 by means of simulation.

63

4.4 Estimation of 0 2 When m Measurements are Taken on a Subsamplee; .

It is not always economically feasible to take more than one

measurement on each sample unit. However, one can take mmeasure-

k mS~ = L .L

i=l J=l

ments on a subsample of k units and estimate o~. Since Ui is fixed

for each unit and independent of e: l' an estimate of a~ would be

- 2(X .. - x. )'J ,.

k(m-l)

andS2

X

2 A

whereSx and (3IWLS are based on a single measurement. For example,

for sample unit with mmeasurements use only the first in computing

4.5 Results of Simulation Study

Simulated values of the following estimators were produced for

150 samples of size 500.

The IWLS estimator with Xi as the independentvariable. •

1.A

ex:

2.AApax: The estimator in (1) adjusted by the factor

S~A xP = 22

Sj( - Se:/m

3. Saayes 2: The IWLS estimator with

A _ [.S~ -S2/m.]U. =.~ + X E (i,..- ~), S~

x

64

as the independent variable.

The IWL5 estimator with Xi as the independent variable adjusted by the factor

52x

where 52 is based on mmeasurements made on a subsample of k units.

Different estimates of"Sx. pSx. and SSayes 2 were computed for

values of m= 2.4.6. and 8. where m is the number of repeated measure

ments. The estimated mean. bias. and mean square error of the

estimators are shown in Table 4.1.

As expected. the bias in the estimators decreases as m increases.

Even when m is as small as 2. the reduction in the bias and MSE ;s

large. The actual reduction in bias from SIWLS to Sx when m= 2 is

'" '"29.4%. Adjustment of the estimator Sx by the factor p decreases the

bias even more significantly. In the case of m= 2. there is an 87.9%

'" "''''reduction in the bias from SIWLS to pax when m=2. A similar reduc-" "tion in bias occurs with the use of ass a"nd aBayes 2. As shown in

the linear regression case. ~Sx and Ssayes 2 are identical estimators.

There is an increase in the standard deviation of these estimators

over that of SIWL5. However. the YMSE is largest for the IWl5 e~ti

mator. Another point worth noting is that the use of p does not

affect the results of the test of the Ho: S=O since the t-statistic

for this test is invariant under scale transformation. That is.

t =

66

TABLE 4.1 Simulated Values of the Mean, Standard Deviation, andMean Square Error of the Estimators

A A A -",-

Estimator Mean Bias STD .; MSE

A

(3x: m=2 0.620 -0.380 0.092 0.3914 0.776 -0.224 0.105 0.2476 0.829 -0.171 0.122 0.2108 0.874 -0.126 0.125 0.178

AA

PSi: m = 2 0.935 -0.065 0.146 0.1604 0.969 -0.031 0.131 0.1356 0.968 -0.032 0.143 0.1478 0.984 -0.016 0.140 0.140

A

SBayes 2: m= 2 0.935 -0.065 0.146 0.1604 0.969 -0.031 0.131 0.1356 0.968 -0.032 0.143 0.1478 0.984 -0.016 0.140 0.140 e

A

k=50, m = 2(3ss' 0.942 -0.068 0.247 0.256

0.462 -0.538 0.079 0.544

CHAPTER V

APPLICATION OF METHODS TO THE LIPID RESEARCH CLINICS DATA

5.1 Description of the Data Set

In this chapter the application of methods discussed previously

will be applied to data collected as part of the Lipid Research

Clinics (LRC) Program Prevalence Study. Detailed descriptions of

the LRC Prevalence Study can be found el sewhere (Hei ss, et a1. (J 980)) .

The data used fn this demonstration consist of observations on 4171

white men and women, age 30 and above. After participation in

Visit 1 of the Prevalence Study these individuals were randomly

selected, as part of a 15% sample of Visit 1 participants, to par

ticipate in Visit 2. As a result, we have available two determina

tions of plasma cholesterol and triglyceride levels, one taken at

each visit, along with other sociodemographic data. Other variables

used in this study are HDL cholesterol level measured at Visit 2,

and Quetelet index (weight/height2). The outcome variable for this

demonstration is mortality status at an average of 5.7 years after

the measurements were taken. Mortality status is modeled asa func

tion of either total cholesterol or triglyceride level. Because of

the influences of sex and age on mortality, separate logistic models

were fit for each sex and when p.ossible, age was included in the

model as an additional independent variable. Both males and females

68

using hormones or having a missing value for that variable were

eliminated from the demonstration because of the influence of sex

hormone usage on lipid and lipoprotein levels (Heiss, et al. (1980)).

5.2 Computation of the Estimators and Their Variances

The following estimators were computed by model and data set,

along with their standard deviations and p-values corresponding to

the test of the Ho: 6=0:

1. eIWLS 1s the iterative weighted least squares estimate

with a single measurement of either cholesterol or triglyc

eride as the independent variable. Both the estimate and

its variance were computed using SAS's PROC LOGIST.

A

2. 6MO is the modified discriminant function estimate using

an external estimate of the variance of the measurement

error. The estimate is computed as described in Chapter III.

A Taylor Series Approximation is used to derive the follow-

'"ing large sample estimate of the variance of f3MO :

When there is more than one independent variable, the

variance of the rth coefficient can be estimated as

VA{SMo,r) = S;l (l/nl + l/nO)

where s;l is the rth diagonal element of the inverse co

variance matrix.

,

69

A

3. Ssayes 1 is the iterative weighted least square estimate

which uses

A

Ui = X +

as the first stage estimate of Ui . Again, cr~ is an ex

ternal estimate of the variance of the measurement error.A A

After computing the Ui's, both the estimate ~Bayes 1 and

its variance are computed using SAS's PROC LOGIST on theA

Ui's and the dependent variable.

A

4. I3GRP is the grouping estimate in which observations are

grouped according to the magnitude of an instrumental

variable. The estimate is computed as described in

Chapter III. The large sample approximation to its vari

ance using a Taylor Series Approximation of the estimator

is

2A A

;\1 - 1.3- -Xl - X3

+

5. SIV is the iterative instrumental varlable estimate. Its

variance is computed for fixed values of the ·observed

70

independent variable Xi and for fixed values of the instru

mental variable li as

6. s,.SLS is the two-stage iterative weighted least squares'" '" '"estimate which uses Ui = 'ITO + 'ITl li as the first stage

A

estimate of U. li is an instrumental variable and 'ITO'

;1 result from the linear regression of the observedA

independent variable Xon l. Both the estimate 6-rSLS and

its variance are computed using SAS's PROC LOG1ST on the'"dependent variable and Ui •

7. ex is the iterative weighted least squares estimate which

uses the mean of the two measurements of cholesterol (or

triglyceride) as an estimate of the true independent vari-

able (PROC LOGIST).

'"8. aBayes 2 is the iterative weighted least squares estimate

which uses

A _

Ui =X+

as the independent variable (PROC LOGIST).

= 2 2The sample estimates of X, S-, and S by model and by data set arex E

shown in Table 5.1.

71

= 2 . 2TABLE 5.1 Sample Estimates of X, Sx' and SE by Model and Data Set

Independent VariableMeasured with Error

Total Cholesterol

Triglyceride=X

2sX

52E

Males

206.1

1258.4

256.2

140.3

9086.4

3092.6

Females

205.4

1622.8

258.3

103.3

4673.1

815.6

All of the estimators listed above were not computed for all

models. Estimators requiring an external estimate of the variance

of the measurement error (SMD and Ssayes 1) could only be computed

for models in which total cholesterol was the independent variable.

A relatively good estimate of the within-individual variabil ity of

total cholesterol was found in a paper by Ederer (1972). He ob

served, using data from the National Diet-Heart Study based on a

sample of 1000 men~ that in the absence of dietary change the

between-visit correlation was about 0.85 when measurements of total

chol esterol were taken one to two weeks apart. He a1so reports a

within-visit between-individual standard deviation of 38 (ax) for

cholesterol. Given two measurements of cholesterol for each individ

ual

72

where var(Xi ) = var(X i ) = o~, Uiis the true value of cholesterol,1 2

and Ei and E; are error terms, we have1 2

0 2

corr(X,' ,X,, ) = u1 2 o~ + o~

Therefore, o~ = [1 - corr(xil

,xi2

)] o~. Using Ederer's values for

corr(xi1

,xi2

) and o~ we computed o~ using the above equation to be

216.6. No estimate of the variance of the measurement error in

triglyceride level of reasonable quality could be found.A A

Those estimators requiring an instrumental variable (SGRP' SIV' eSTSLS) were demonstrated for models in which triglyceride level was

the independent variable of interest. None of the variables observed

in this study were adequately correlated with total cholesterol, By

'adequately' we mean having a correlation greater than 0.2. Our

simulations in Chapter III showed that estimators requiring an instru-

mental variable perform poorly when the correlation between the

instrumental variable and the independent variable measured with

error is less than or equal to 0.2. Therefore, we did not attempt

to demonstrate the use of these estimators with total cholesterol

as the independent variable. Our data set contained two variables

adequately correlated with triglyceride level. They were HDL choles-

terol level at Visit 2, and Quetelet index (weightlheight2). It

was also observed that the correlation between log(HDL) and triglyceride e

73

was even higher than the correlation between HOL and triglyceride.

Therefore, using HOL, 10g(HOL), and Quetelet index, estimators re

quiring instrumental variables were computed. Table 5.2 shows the

correlations between triglyceride level and HOL, 10g(HOL), and

Quetelet index. Correlations adjusted for age are also given.

TABLE 5.2 Correlations Between Triglyceride Level and InstrumentalVariables by Data Set

Instrumental Variable

HOL

(Adj. for Age)

log(HOL)

(Adj. for Age)

Quetelet Index

(Adj. for Age)

Males

-0.3134

-0.3209

-0.3565

-0.3628

0.2483

0.2496

Females

-0.3265

-0.3639

-0.3589

-0.3917

0.2557

0.2344

In using an instrumental variable to estimate model coeffi

cients it is necessary to assume that the error in measuring the

instrumental variable is independent of the error in measuring

triglyceride level. This appears to be a reasonable assumption

given the mechanisms. for measuring HOL cholesterol and triglyceride

levels. Obviously there is no relationship between errors in the

measurement of triglyceride level and errors in measuring height or

weight.

74

A

The grouping estimator, SGRP' and the instrumental variable...

estimator, elY' are demonstrated only for the simple logistic case.

Extensions of these estimators to the multiple logistic case was

not attempted as a part of this research.

Tables 5.3-5.6 show the estimators computed by model and data

set along with their standard deviations and p-values corresponding

to the test of the Ho: 13=0.

5.3 Comparison of the Estimates

Comparisons of the estimates are made within model and within

data set.

1. In the simple logistic model with total cholesterol as

the independent variable we observe that

a) The magnitude of the estimate increases from SIWLS

t06MD to 6x. The increase in magnitude from 6lWLSA At.'"

to ex is about 58.3%. Also, SBayes 1 and ~ayes 2

are larger 1n magnitude than SIWLS and ~x respectively ....

The largest increase 1n magnitude is from SIWLS to

BBayeS 2' a 76.4% increase.Al though the model coeffi

cient corresponding to total cholesterol is not signif

icantly different from zero (a =0.05) regardless of

the estimator used, there is a decrease in the p-valueA A A

from SIWlS to eMD to ax'b) For females, the largest increase in magnitude is

... ...froms IWLS toSMD , a 41.9% increase. Other patterns

mentioned for males in (a) also occur for females

75

under this model. The model coefficient correspond

ing to total cholesterol is significantly different

from zero using any of the estimators computed.

2. In the multiple logistic model with total cholesterol and

age as the independent variables, we observe that

a) When age is added to the model the magnitude of the

estimates decreases from that of .the simple logistic

"case except for SMD - males.

)" Ab In the male data set ex and (3Bayes 2 perform poorly.

They show a large decrease in magnitude. SMD performs

"better than (3IWLS in tennsof being larger in magnitude.

The increase in magnitude is 352.5%. ~ayes 1 performs

as expected.

c) Except for SMD the patterns described in (.1.a) are

present between the estimators computed under this

model using the female data set. The increase in mag-

nitude from SIWLS to ~ayes 2 is 20.7%.

3. In the simple logistic model with triglyceride as the

independent variable we observe that

a} For males, there is an increase in magnitude from BIWLS

. to ax to Saayes 2 along with a corresponding decrease

in p-value. The increase in magnitude from SIWLS to"SBayes 2 is a 45.5% increase ..

76

b) For females, there exists a smallet increase in mag

nitude (11.9%) from eIWLS to aBayeS 2 than for males.

'" '"Also the p-value increases from f3 IWLS ·to f3Bayes 2·

c) For both males and females the model coefficient is

significantly different from zero (a = .05) using

'" '" '"SIWLS' ex or f3Bayes 2.

'"d) For the female data set the s,.SLS estimates are larger

in magnitude than BIWLS ' but have increased p-values,

although they remain significant. For the male data

'"set, the !3rSLS estimates are larger in magnitude than

'"the SIWLS estimates except for the case in which HDL

is the instrumental variable. None of the TSLS esti

mates suggest that the model coefficient fortriglyc

eride is significant using the male data set.

e) The grouping and instrumental variable estimators

perform poorly for both the male and female data sets~

'"For SIV-males, there is no gain in magnitude. There'" "-is a gain in magnitude from eIWLS to SGRP and from

BIWLS to eIV (in the female data set). These gains

are offset by large increases in variance.

4. In the multiple logistic model with triglyceride level

and age as the independent variabl es, we observe that·

'" '"a) There is an increase in magnitude fromsIWLS to STSLS

for both data sets except when Quetelet index is used

77

as the instrumental variable (male data set). ThereA A

is also an increase in magnitude from ~IWLS to ~x toA

~Bayes 2·

A . _ . A

b) Except for ~SLS (.HDL,QI)-ma1es, l3x - females, and

~ayes 2 - females, the coefficient corresponding to

triglyceride in the model is significantly different

from zero Ca =0.05).

5.4 Conclusions

The simulation studies of the previous chapters showed that onA A A A

the average ~D' SBayes l' l3x' and 13Bayes 2 perform better thanA

13IWLS • They also showed that if the correlation between the inde-

pendent variable and an instrumental variable was high enough, SGRP'A A

13Iv , and SrSLS would perform well. By 'perform well' we mean have

reduced bias and MSE. Since the bias was shown in Chapter II to

always be towards the null, we demonstrate the reduction in bias

offered by the proposed alternative estimators as an increase inA

their magnitude over tbat of 13IWLS • We also expected, based on,

simulation results, to demonstrate situations in which the p-value

corresponding to the Ho: 13=0 decreases when one uses the alternative

estimators rather than SIWLS. We were able to demonstrate this inA A A A

most cases for 13MD , 13Bayes l' ex and f13ayes 2· However, when look-

ing at the estimators requiring an instrumental variable, the p-value

actually increased. This probably a result of one or more of the

following:

78

1. The correlation between triglyceride and the instrumental

variables was too small; therefore:. resulting in a large

estimated variance.

A

2. The variance used for l3IV in this demonstration is con-

ditional on the observed value of triglyceride and con~

ditional on the instrumental variable. This may not be

an appropriate estimate of the variance.

79

TABLE 5.3 Simple Logistic Model With Total Cholesterol Level atVisit 2 as the Independent Variable

Estimators Males Females(n=2508) (n=1663)

"f3 IWLS 0.00254 0.01381Std. Error 0.00247 0.00277P-value (0.30) «0.0001)

"f3MDStd. ErrorP-va1ue

0.003080.00275(0.26)

0.019590.00356

«0.0001)

"f3Bayes 1 0.00301 0.01575Std. Error 0.00292 0.00316P-va1ue (0.30) «0.0001)e"f3x 0.00402 0.01489Std. Error 0.00258 0.00284P-va1ue CO.12) (:<0.0001)

•

•

"f3Bayes 2Std. ErrorP-value

0.004480.00287(0.12)

0.016170.00309

(<0.000l)

TABLE 5.4 Multiple Logistic Model With Total Cholesterol atVisit 2 and Age as the Independent Variables


A

SIWLS -0.00143 0.00783Std. Error 0.00272 0.00314P-va1ue (0.60) (0.0l)

80

A

(3MDStd. ErrorP-va1ue

A

13Bayes 2Std. ErrorP-value

-0.006470.00282(0.02)

-0.000140.00316(0.97)

0.006300.00401(0.12)

0.009450.00357(0.01 )

•

TABLE 5.5 Simple Logistic Model with Triglyceride Level atVisit 2 as the Independent Variable

81

SGRPtHDLStd. ErrorP-value

Srv tHDLStd. ErrorP-value

A

SrvtLHDLStd. ErrorP-value

A

SIV tQIStd. ErrorP-value

0.001850.00488(0.70)

-0.000430.00288(0.88)

0.000950.00254(0.71)

-0.001140.00392(0.77)

0.009010.00644(0.16)

0.006010.00681(0.38)

0.006010.00619(0.33)

0.006420.00959(0.50)

TABLE 5~ 5 CONTINUED

82

Estimators Males Females(n=2508) (.n=1663)

.."~SLS,HDL - .00114 0.01328Std. Error 0.00280 0.00658P-value (0.68) (0.04)

"erSLS,LHDLStd. ErrorP-value

"eBayes 2Std. ErrorP-value

0.001690.00249(0.50)

0.002430.00081

«0.0001 )

0.011700.00514(0.02)

0.003940.00128

«0.0001)

TABLE 5.6 Multiple Logistic Model with Triglyceride Level atVisit 2 and Age as the Independent Variables


~IWLS 0.00200 0.00207Std. Error 0.00062 0.00106P-value «0.0001) (0.05)

83

"SrSLS,HDLStd. ErrorP-value

0.004710.00290(0.10)

0.016220.00642(0.01 )

SrSLS,LHDL 0.00618 0.01324Std. Error 0.00247 0.00498

"l P-va1ue «0.0001) (0.01)e"!3rSLS,QI 0.00094 0.01471Std. Error 0.00400 0.00684P-value (0.82) (0.03)

"exStd. ErrorP-value

A

eBayes 2Std. ErrorP-value

0.002540.00072

«0.0001)

0.003060.00087

«0.0001)

0.002050.00112(0.07)

0.002250.00123(0.07)

CHAPTER VI

SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH

In this work it has been determined that the iterative weighted

least squares (IWLS) estimator of the model coefficients in logistic

regression is not unbiased when measurement error is present in the

independent variable. Several estimators have been proposed as

alternatives to the IWLS estimator. Of the estimators discussed,

the ones requiring an external estimate of the variance of the

measurement error and the.ones requiring multiple measurements of

the independent variable performed best in termS of having smaller

bias and MSE than the IWLS estimator. The estimators requiring an

instrumental variable perform well only if t~e correlation between

the instrumental variable and the independent variable is high. The

application of the instrumental variables and grouping estimates

also demonstrate that they are not likely to be very useful in

practice.

The major focus of this work has been the simple logistic re

gression model, although some mention of bias in the multiple

logistic regression model was made. Amore complete investigation

of the effect of measurement error on the estimation of coefficients

in the multiple logistic regression model would be of value,

r

\

•

85

especially when some t but not all t variables are measured without

error. Most of the estimators discussed can also be used in the

multiple case; however t further work is needed to extend the

iterative1nstrumental variable estimator to the multiple case .

It would also be worthwhile to develop an unconditional estimate

of the variance of this estimator.

86

BIBLIOGRAPHY

Barnett, V.D. 1967. A Note on Linear Structural Relationshipswhen Both Residual Variances are Known. Biometrika 54,670-672 ..

Basmann, R.L. 1957. A Generalized Classical Method of LinearEstimation of Coefficients in a Structural Equation.Econometrica 25, 77-84.

Berkson, J. 1950. Are There Two Regressions? Journal of theAmerican Statistical Association 45, 164-180.

Cochran, W.G. 1968. Errors of Measurement in Statistics.Technometrics 10, 637-666.

Durbin, J. 1954. Errors in Variables. Review of th'e Instituteof International Statistics 22, 23-32.

Ederer, F. 1972. Serum Cholesterol Changes: Effects of Diet andRegression Toward the Mean. Journal of Chronic Diseases 25,277-289.

Gardner, M.J. and Heady, J.A. 1973. Some Effects of Within-PersonVariability in Epidemiological Studies. Journal of ChronicDiseases 26, 781-795.

Halperin, M., Blackwelder, W.C., and Verter, J.I. 1970. Estimation of the Multivariate Logistic Risk Function: A Comparisonof the Discriminant Function and Maximum Likelihood Approaches.Journal of Chronic Diseases 24, 125-158.

Heiss, G., Tamir, I., Davis, C.E., Tyroler H.A., Rifkind, B.M.,Schonfeld, Got Jacobs, D., and Frantz, I.D. 1980. LipoproteinCholesterol Distributions in Selected North American Populations:The Li pid Research Clinics Program Prevalence Study.Circulation 61, No.2, 302-315.

Housner, G.W. and Brennan, J.F. 1948. The Estimation of LinearTrends. Annals of Mathematical Statistics 19, 380-388.

James, W. and Stein, C. 1961. Estimation with Q~adratic Loss.Proceedings of the Fourth Berkeley Symposium' 1, Universityof California Press, 361-379.. '

Kiefer, J. 1964. Review of Kendall and Stuart's Advanced Theoryof Statistics, II. Annals of Mathematical Statistics 35,1371-1380.

..

87

Kleijnen, J.P.C. 1974. Statistical Techniques in Simulation:Part 1. Marcel Dekker, Inc., New York. .

Lindley, D.V. and Smith, A.F.M. 1972. Bayes Estimates for LinearModels. Journal of the Royal Statistical Society B, 34, 1-41.

Madansky, A. 1959. The Fitting of Straight Lines when BothVariables are Subject to Error. Journal of the AmericanStatistical Association 54, 173-205.

Moran, P.A.P. 1971. Estimating Structural and Functional Relationships. Journal of Multivariate Analysis I, 232-255.

Neyman, J. and Scott, E. 1951. On Certain Methods of Estimatingthe Linear Structural Relationship. Annals of MathematicalStatistics 22, 352-361. (Correction 23, 135) ... .. .

Snedecor, G.W. and Cochran, W.G. 1967. Statistical Methods. 6thEdition, Tbe Iowa State University Press, Ames, Iowa.

Theil, H. 1961. Econometric Forecasts and Policy. 2nd Edition,North-Holland PUblishing Company, Amsterdam.

Truett, J., Cornfield, J., and Kannel, W. 1967. AMultivariateAnalysis of the Risk of Coronary Heart Disease in Framingham.Journal of Chronic Diseases 20, 511-524 .

Tukey, J.W. 1951. Components in Regression. Biometrics 7,33-69.

Wald, A. 1940. The Fitting of Straight Lines if Both Variablesare Subject to Error. Annals of Mathematical Statistics 11,

Walker, S.H. and Duncan, D.B. 1967. Estimation of the Probabilityof an Event as a Function of Several Independent Variables.Biometrika 54, 167-179.

88

PROGRAMS USED TO GENERATE SAMPLES USED IN SIMULATION STUDIES

A. THr FOLLOWING S~~ STATEMENTS WERE USED TO GfNrRATE 15 0SA~PLES OF 500 OBSERVATIONS EACH USED Tn COMPUTE THFBIAS FOR VARIOUS MODELS AND DISTRIBUTIONS or THE TRUEHJOEPENDENT VARII\BLE U. THE VARIAOLES ARE

U=TRUE INDEPENDENT VARIABLELU=MEASURE"ENT ERROR TERMF=~OME ~NIFoRM(O,l) VARIABLE USED TO COMPUTE VX=OBSERVEDINDEPrNDENT VARIABLEV=OsSERVEO OEPrNOENT VARIABLE

1. U nISTRIBUTED AS A NORMALCO,l) VARIABLr.AL~HA=1.8ETA=1.

DATA START;00 K=l TO 150;DO 1=1 TO 500;U=NORMALCSQ3211);LU=NORMAL C13Q213) ;F=UNIFORMC221131l;X=U + LU;P=l/Cl + EXPC-l-U»;IF F(PTHEN V=l; ELSE V=O; OUTPUT; END; END;

2. U nISTRIBUTEO AS A NORMALCO.l) VARII\BLr.ALPt'A=1.8ETfI=-1.

DATA START;DO K=l TO 150;DO 1=1 TO 500;U=NORMALC7~6q23)1

LU=NORMALC!45269);X=U + LUIF=UNIFORMCQ52633);p=t/Cl + EXPC-l+U»;IF F(P THEN V=11 ELSE V=O; OUTPUT; END; END;

..

..

•

89

3. U TS CONDITIONALLY DISTRIBUTED AS A NORMALC1.1) VARIABLEGIVEN Y=O AND A NORMALC2,1) VARIABLE GIVEN Y=1,ALPHA:-l.5.BETA=1,PCY=1)=O.~.

DATA START;00 K=1 TO 150;00 1=1 TO 500;IF 1(=1<=250 THEN U=NORMAL(879211) + 2:ELSE U=NOR~ALCe79211) + 1;lU=NORMAL(567887);x=u + LU;F=UNIFORM(567777);p=t/C1 + EXPC1.5-U»;IF F(P THEN Y=1; ELSE Y=Ol OUTPUT; END; END;

~. U IS DISTRIBUTED AS AN EXPONENTIAL(1) VARIABLE,ALPHA=ltBETA=l.

DATA START;DO K=l TO 150;DO· 1=1 TO 500;R=UNIFORMCI02081); U=-LOGCR);LU=NORMAL(684911);F=UNIFORMC111111';x=u + LU;P=t/C1 + EXPC-1-U»;IF F(P THE~ Y=1; ELSE Y=o; OUTPUT; END; END;

5. UtS CONDITIONALl.Y OIsTRIBUTED AS AN EXPONEf'ITIAL(1)VARIABLE GIVENY~l AND AN EXPONENTIAL(2) VA~IJ\BLE GIVEN Y=1,ALPHA=-O.6932.BETA=O.5tP(Y=1)=O.S.

DATA START;DO K=l TO 150;001=1 TO 500;R=UNIFORMC786421';IF 1(=1<=250 THE'" U=-2*LOG CR); ELSE U=-I OG CR) ;LU:NORMALC567887';x=u + LU;F=UNIFORMC567777';P=t/Cl + EXPC.69~2 -.5*U»;IF F(P TH£~ Y=1; ELSE Y=O; OUTPUT; END; END;

90

6, U1.U2 ARE 8IVARIl\TE NORMALS WITH MEANS n. VARIANCES 1ANn CORREL~TION=O,25.ALPHA=1.BETA1=1.BrTA2=1.

DATA SlART;00 K=l TO 150;00 1=1 TO ~oo;

U1=NORMALCQ29673);C=NORMALC7e23~7);

u2=.25*U1 + .9683*C;LU2=NORMAL(23781 Q);Xl=U1;X2=U2 + LU2;F=uNIFORM(543221);p=t/(l + EXPC-1-UI-U2»;IF F(P THEN Y=l; ELSE Y=O; OUTPUT; END; END;

7. U1,U2 ARE BIVARIATE NORMALS WITH MEANS n. VARIANCES 1ANn CORREL~TION=O.75.ALPHA=1,BETA1=1.Br.TA2=1.

DATA START;00 K=1 TO 150;DO 1=1 TO 500;U1=NORMAL(Q29673);C=NOR~AL(1e2347);

u2=,75*Ul + .6614*C;LU2=NOKMAL(23781Q);xl=Ul;X2=U2 + LU2;F=UNIFORM(543221);p=t/(1 + EXPC-l-U1-U2»;IF F(P THENY=l; ELSE Y=o; OUTPUT; END; END;

8, THr FOLLOWING SA~ STATEMENTS WERE USED TO GENERATE150 SAMPLES OF 500 OBSERVATIONS EACH usre TO COMpUTETHr VARIOUS ESTTMATORS OF BETA. THE VARIABLES AREDE~CRIBED IN A, WITH THE FOLLOWING ADDITIONS:

W=A NORMALCO.l) VARIABLE USED TO COMPUTr Z.Z=THE INS1RUMENTAL VARIABLERHo=CQRRELATION RETWEEN U AND Z

.. I

J

.J

91,

1. GENERATES SAMPLES USED TO COMPUTE THE INSTRUMENTALVARIABLES E~TIMATOR,THE GROUPING ESTIMATOR,THE TWO-STAGELEAST SQUARES ESTIMATOR AND THE MOOIFIEn DISCRIMINANTFUNCTION ESTIMATOR.

DATA START;00 RHO=0.2,O.5,O.9;00 K=1 TO 150;00 1=1 TO 500;U=NORMALCS~3217);

W=NORMAL(1~q081);

Z=RHO*U + (Cl-RHO*.2)**O.S).W;LU=NORMAL(134273);x=u + LU;F=UNIFORM(221137);P=t/(l + EXPC-l-Ut);IF F<P THEN Y=l; ELSE Y=o; OUTPUT; END; END;

2, GENERATES SAMPLE~ USED TO COMPUTE ESTIMftTORS REQUIRINGMULTIPLE MEASURr.~ENTS OF THE INDEPENDENT VARI~8LE,

DATA START;00 K=l TO 150;DO 1=1 TO sao;U=NORMAL(5~3217);

LU1=NORMALC13~273);

LU,=NOMMAL(41006Q );

LU~=NOHMAL(5~633~);

LU~=NORMALC7S5~5!\);

LUS=NORMAL(374211);LU6=NORMAL(1 98229);LU7=NORMAL(765431);LUa=NORMAL(1 94581);Xl=U + LU1;x2=U + LU2:X3=U + LU3;X4=U + LU~;

xS=U + LUS;X6=U + LU6;x7=U + LU7:X8=U + LU8:X8AR=~EAN(OF Xl-X8);V=VARCOF Xl-X8);OUTPUT; E~C; END;

3. GENERATES SAMPLE~ WITH TWO MEASUREMENTS TAKENON ONLY A SUBSAMDLE OF 50 OBSERVATIONS PER SAMPLE.

DATA START;00 K=l TO 150;00 1=1 TO 500;U=NORMALC5~3217);

LU1=NORMALC1~4273);

XI=U + LUI;IF 1<51 THEN 00;LU,=NORMAL(410069);X2=U + LU2;V=VARCOF X1-X2)1END;F=UNIFORM(221137);p=t/Cl + EXPC-I-U));IF F<PTHEN Y=1; ELSE Y=o;OUTPUT; ENe; END;

92

the error~in~variables problem in the logistic regression ... · 2.1.1 simple linear regression...

Documents