some properties of r in ordinary least squares...

Contributions to Methodology and StatisticsA. Ferligoj and A . Kramberger (Editors)Metodološki zvezki, 10, Ljubljana : FDV 1995

Some Properties of R 2 in Ordinary LeastSquares Regression

Janez Stare1

Abstract

A lot has been written about R2 in order to discourage or at least warnagainst its use . Many of the papers address the use and definition of R' ingeneralized linear models where its different interpretations lead to differentstatistics, but quite a few has also been said about the drawbacks of R 2 inordinary least square regression . The authors usually question its use as ameasure of goodness of fit, complain that R 2 can not be 1 when there arereplicated responses or show that its value depends on the range ofindependent variables which makes it difficult to estimate its populationvalue . Often modifications of R 2 or alternative statistics are proposed . Allthese arguments raise doubt about the adequacy of using R2 . We argue thatthis doubt is usually not justified and that there is no great danger in usingR2 in ordinary least squares regression .

1 Introduction

The coefficient of determination R 2 is widely used as a measure of predictivepower of linear regression models . The main reason for this is the interpretation ofR2 as a proportion of variation of the dependent variable explained or accountedfor by the model. In other words, R2 tells us how well our model explains theoccurrence of different values of the outcome .

Although R2 has some nice properties in OLSR, much has been written todispraise its use even in this setting . It is obvious that some problems arise from adifferent understanding of the concept of goodness-of-fit . The distinction betweenthe concepts of explained variation and the goodness-of-fit is also not always clear,which adds to the differences in views about the usefulness of R2 . The paper byKorn and Simon (1991) makes a great contribution to the clarification of thismisunderstanding . They demonstrate that explained variation measures both the

'Institute for Biomedical Informatics, Medical Faculty, P .O .Box 18, 61105 Ljubljana, Slovenia

134

Janez Stare

explained risk and the goodness-offit of a model . By goodness-of-fit theyunderstand the consistency of the model with the data, and the explained risk is ameasure of how much better predictions are when one uses covariates compared tonot using them .

2 Definition and interpretation of R 2 in ordinary leastsquares regression

We review here some well-known results without giving any proofs .Let x„x2,...,xp, and y denote p+ variables . For any given set of values,

x, o ,x20 . . .,xpo say, we may be interested in the conditional expectation (mean) of

y, denoted by E(y4x, u „xpa ) . When such conditional expectation is defined for all

x-values, the function E(ykx,) is called the regression curve of y on

x„x2, . . .,x, . When this function takes the form

PE(y [xxp)= a + Z jJ;x, ,

we talk about linear regression. The conditional variance var(y~,, . . .,xp) shall be

denoted by o2y.l. ..p .The population R 2 is defined bye

2

1-R2 =~ya. ..P .

6r2

The positive square root of R 2 is called the multiple correlation coefficientand is denoted by R . When we want to emphasize that we are talking about thecorrelation between y and x„x2, . . .,x,,, we write R,(, .. .p) . It can be shown that R

is indeed a correlation coefficient (see for example Stuart and Ord, 1991) .Most often we are not interested in the population R 2 but in its sample

analogue, denoted by R2 . It can be defined in the same way, replacing populationvariances with sample estimates . However, another approach to its definition will

2 We follow rhe notation used by Stuart and Ord (1991) . Thus, bold faced R2 denotes thepopulation value while R 2 stands for the sample estimate .

Some Properties of R2 in Ordinary Least Squares Regression

135

be given, an approach that is much more common and more natural (see forexample Draper and Smith, 1981) .

Let y,,i = 1, . . .,n, denote the observed values of the dependent variable, itsmean y and y, the predicted values . The situation is depicted in Figure 1 .

Y - Y~

XFigure 1 .

We can write

(Y+ - y) = (y ; - y) + (y, - 7) .

Squaring both sides and summing over i gives

Y_(yi - Y) 2 = L.a(Yi - Yi) 2 + Y,(Yr - Y)2 (2 .1)

where the cross-product term is omitted since it is equal to 0 . The left-hand side ofequation (2. 1) is the sum of squares of deviations of the observed values from themean, usually called 'SS about the mean' . On the right-hand side of this equationwe have the sum of squares of deviations of observed values from predicted values(y, -y;) and the sum of squares of deviations of predicted values from the mean

(y, -y) . Quantities (y, -y) are usually called residuals . Equation (2 .1) can thenbe expressed as

SS about the mean = SS about regression + SS due to regression .

We see that SS about the mean is divided in two parts . One would expect froma good regression model that residuals were small, that is, that SS due to

1 3 6

Janez Stare

regression was much greater than SS about regression . In other words, one wouldlike to have the ratio (SS due to regression / SS about the mean) as close to 1 aspossible . Since SS about the mean can be regarded as the total variability observedin the dependent variable and SS due to regression as the amount of this variabilitythat is explained, the ratio

R2 - SS dueto regression - z(Y; -y)2

SS about the mean E(y . - y)2

represents the proportion of the total variability in the dependent variable that isexplained by the independent variables . R2 is often called coefficient ofdetermination . It can be shown that R2 is the square of the sample correlationcoefficient between y and the best fitting linear combination of x„ x2, . . .,x,, .

Another definition of R2 is often given

R2 - 1 - E(yr-Yr)2

Dy, y) 2

(2 .2)

(2 .3)

In ordinary least squares regression (OLSR), the two definitions are equivalentbecause of property (2 .1) . Kvalseth (1985) lists other definitions and discussestheir properties in nonlinear regression . He also gives a list of general propertiesthat R2 should possess . Based on this list, he decides on definition (2 .3) as beingthe most useful for more general models .

What is so appealing about R2? It seems that the following properties make itso useful :

- it has an intuitively clear interpretation,- it is a number that can be easily calculated when the model is fitted,- it is invariant to units of measurement,- it lies between 0 and 1,- it becomes larger when the model 'fits better' .

It should be noted that the value of R2 does not depend only on the distancesbetween predicted and observed values but also on the variation of the outcomevariable . So anything that influences this variation also influences the value of R2 .This is evident from the definition (2 .3) .

The interpretation and use of R2 has been extensively discussed in theliterature (some examples are in Crocker, 1972 ; Barrett, 1974 ; Draper and Smith,1981 ; Ranney and Thigpen, 1981 ; Healy, 1984 ; Kvalseth, 1985 ; Helland, 1987 ;

Some Properties of Rz in Ordinary Least Squares Regression

137

Willett and Singer, 1988 ; Nagelkerke, 1991 ; Scott and Wild, 1991) . We return tosome of these later . While the distribution of R' and the inference about itspopulation value are not the issues of main interest here, two points must bestated :

.

with increasing n, R2 tends to

/35,/3

(2.4)

35,/+aY .1. ..,

where S. is the sample covariance matrix for the independent variables (Helland,1987) . Thus, the value of R' depends on the variation among independentvariables .

under the null hypothesis R' = 0, the expected value of R' can be shownto be

E(R 2 ) = pn-1'

where p is the number of independent variables and n is the sample size . Thismeans that we can expect values of sample R' greater than 0 even if its populationvalue is 0 . This property of R' is another reason for requiring large n/p ratios inregression analysis .

Because of this second property, a modified R2 , called adjusted-R2 issometimes used . However, apart from being smaller than usual statistics, there isno other reason for its use .

The following points were raised against R' :

- it can not be 1 when there are replicated responses,- it diminishes with replicated responses,- it depends on the slope of the regression plane,- it depends on the range of independent variables .

3 Analysis of some properties of R2 in ordinary leastsquares regression

It was stated in the introduction that criticism has often been addressed to the useof R' even in ordinary regression . We will show here that this criticism is mostly

1 3 8

Janez Stare

unjustified . The points raised against R2 will be discussed using examples frombivariate linear regression which can readily be generalized to the multivariatecase .

3.1 The slope

Figures 3 .1 .a and 3 .1 .b illustrate the dependence of R2 on the slope of the line .The two lines have different slopes, but the points are at the same distance fromthe line . We see that there is a big difference in R2 , which is clearly due to greatervariation of the dependent variable in the second case . Some authors (Barrett,1974) argue that a measure of goodness-of-fit should be the same in both cases,since the precision of prediction is the same . This is true if we measure thisprecision just by the distances of the points from the curve . In this sense even amodel with R2 = 0 could have the same precision . If we take into account therelative gain from going from the null model to the fitted model then this gain ismuch greater in the second case, and this is reflected in R2 . The second model isclearly more capable of distinguishing the differing outcomes and this is animportant feature in judging the quality of the model . Alternatively, followingKorn and Simon (1993), we can say that changing the slope of the line changes theexplained risk while goodness-of-fit is not affected .

y

Figure 3 .1 .a .

70-

o

b=0.1, R2=0.1

50-

30-0

$

o$ 0

$ 0 • 0 0 0400 • o

p0 00 o

o

0

10-

00

0 0°

• ° 0

•°$ •

0 00

° •

000

0 00

0

00

000

0

-10-

I II

10

20

30

40

50x


139

y50 -

70-

0

b=1 .0, R2=0 .88

30- 0 0

$ $ ° • ° 0 0 0 0

op °

o0

0 0000

010 -

°

o

0 00

Figure 3 .1 .b .

0o • o

0

00 0

0

00 0

0 °

o

10

20

30

40

50x

3 .2 R2 cannot be 1 when there are replicated responses

It is of course clear that if we have different outcomes for the same values of theindependent variable there will always remain some unexplained variation . Thishas been a motive for different attempts to modify R2 for such cases in order tohave a statistic that has 1 as its upper limit . All these attempts have theirdrawbacks which we will not discuss here (see for example Chang and Afifi,1987). Figures 3 .2.a and 3 .2.b show two models for which these modifiedmeasures would be 1 but R2 is different . The reason is that R2 takes into accountthe spread of the outcomes around the line, while the modified measures don't .

The possibility of R2 reaching 1 actually does not depend on the replicatedresponses. If the true R2 is 1 then replicated responses are all equal, while if thetrue R 2 is not 1 then not having replicated responses does not change anything . Ifwe are sampling from the data that follow a linear model with normally distributederror, the probability of obtaining points that lie on a straight line is exactly thesame as getting equal responses with repeated measurements . In both cases wemust get exactly the means of the underlying normal distribution. Modifiedstatistics can only be looked at as statistics that give additional information, not as

140

Janez Stare

a replacement for R2 . It is confusing to use one statistic when there are replicatedresponses and another when there are no replicated responses .

Behaviour of R with replicated responses

Figure 3 .2 .a

70- • sd-5. R2=0 .B9 IY60

50 -

40

30

20-

10-

0-

70-

y60-

50

40 -

30 -

20

10

0

Figure 3 .2.b.

•

~d 2. R2=0.98

10

20

30

40

°0

0

o•0 0° °o

8 0

o°

0o

-

•

0 00

•

0 0

0

00 0

0 °

o

•

oo ° o • ° •o 00 0

8

° • ° o 0g

•

o

00

00 0 • o•o°o °

o

o°

1,0

20

30

x

•

0Q

0o8 0

•

0

40X

50

Some Properties of R2in Ordinary Least Squares Regression

14 1

3.3 The diminishing of R' with replicated responses

This is actually not true and we could stop talking about this point here . But as itseems that there is some confusion about this, it is probably good to say somewords on this topic . We don't know if the confusion existed before Draper's paperin 1984, but it was certainly amplified after that . The paper was for example citedin the same journal in the same year by Healy (1984), who took the assertionsfrom Draper's paper for granted . The fact that Draper actually corrected hismistake in 1985 seems to have been overlooked even after many years (see forexample Scott and Wild, 1991) .

The non-dependence of R2 on the replicated responses can be shown indifferent ways :

• first, it can be seen from the formula (2 .4) for the asymptotic value of R'that with a given range of independent variables, the value of R' depends only onthe variance around the fitted line .

• second, the example illustrated in Table 3 .1 provides empirical proof thatDraper's argument was wrong . The letter k in the table denotes the number ofdifferent values of x and N stands for the number of cases . The independentvariable x was randomly assigned integer values between 0 and 50. The dependentvariable y was then generated in the following way

y=x+10+e,

where the error term e was taken to follow normal distribution N(0,5) . It can beseen that even with N-k dramatically increasing, R' does not change apart fromrandom variation .

Table 3 . 1 . The dependence of R' on replicated responses .

N=100

N=200 N=1200 N=5000 N=10000k=43

k=50

k=50

k=50

k=50

R'

.8851

.8796

.8860

.8933

.8913

• third, simple algebraic consideration shows that replicated responses haveno effect on R' . Suppose we estimated R' using two different samples in such away that x takes the same values in both samples . For the sake of argument assumethat the estimates are equal . Let us denote the corresponding SS due to regressionby SR; and SS about the mean by SM, (i = 1,2) . Then we have

1 42

Janez Stare

It follows that

R2 = SR' = SR2 .SM, SM 2

SR, •SM 2 = SR2 •SM, .

(3 .1)

If we take both samples together we get the following estimate for R 2

R2 _ SR, + SR,SM,+SM2

This is clearly an estimate that we get after adding replicated responses to thefirst sample . When is this estimate equal to the estimates from the separatesamples? For this to be true we should have

SR, + SR, _ SR,SM,+SM2 'SM,'

and from this

SR, • SM, + SR, •SM2 = SR, • SM, + SR 2 •SM, .

After cancelling SR, •SM, we get

SR, •SM2 = SR2 •SMI ,

which is always true as we know from (3 .1) .

3.4 The dependence of R2 on the range of the independent variables

This is exemplified in Figure 3 .3 . The first figure represents a random sample of apopulation for which the true model is

y=5+1.25x+e,

with e distributed as N(0,9) . In the next figure, values of x between 4 and 17 arenot used, and in the third figure the values outside the interval (4,17) are notused. The effect on R2 is large . The reason for this is obvious : such sampling


143

effects the variance of the outcome . Now, it is easier to accept the increase in R 2from Figure 3 .3 .c to Figure 3 .3 .a ; in Figure 3 .3 .a we have more evidence tosupport the calculated curve, our prediction is relatively better since the points onthe edges of the curve are furthest from the mean and contribute more to thevariance of the outcome than the points in the middle .

But it is hard to accept that less evidence in the second figure gives a biggerR2 . Beside the fact that only the further points were used, one should also stressthat drawing a straight line through the whole range of the independent variable isactually an extrapolation of the regression curve, which is bad statistical practice .Of course, if there is good reason to believe that the model actually holds in thewhole range, this property of R2 is annoying and one should be careful inchoosing the values of independent variables if one can do so .

If we want to estimate the real population value of R2 there is no alternative totaking random samples (Helland, 1987) . The dependence of R2 on the distributionof covariates is also a property which makes this statistics often inappropriate forcomparing models built on different data sets . Unless we are sure that the twosamples represent the same population, comparison of models based on R2 is notsuitable .

144

Janez Stare

00-

,y

Re = 0 .55

0

45-

30-XP

o e e~e ~e e

e° o°~ Eoo,°~° . •o

oei°°

eon

o

Y°° e

0-

-15

45-

30-

15-

45-

30-

15-

0-

-15-

Dependence of R e on the range

Figure 3 .3 .a .

-

-5

5

15

05

Figure 3 .3 .b .60-

R' = 0.72

Y o ao °0

0-

,g

°

$ eo_15

0

15

Figure 3 .3 .c .60-

R I = 0 .22

-5

15

25

35

o e

-I5

1 ,6

25

Some Properties ofR2 in Ordinary Least Squares Regression

145

References

[1] Barrett J .P . k'1974) : The coefficient of determination - some limitations . TheAmerican Statistician ; 28, 19-20 .

[2] Chang P .C. and Afifi A .A. (1987) : Goodness-of-fit statistics for generallinear regression equations in the presence of replicated responses . TheAmerican Statistician, 41, 195-99 .

[3] Crocker D .C. (1972): Some interpretations of the multiple correlationcoefficient . The American Statistician, 26, 31-3 .

[4] Draper N . R. (1984) : The Box-Wetz criterion versus R2 . J R Statist Soc A,147, Part 1 : 101-3 .

[5] Draper N .R. (1985) : Corrections . The Box-Wetz criterion versus R2 . J RStatist Soc A, 148, Part 4, 357 .

[6] Draper N.R. and Smith H . (1981) : Applied regression analysis, 2nd ed. NewYork: Wiley .

[7] Healy M .J.R. (1984) : The use of R2 as a measure of goodness of fit . J RStatist Soc A , 147, Part 4 : 608-9 .

[8] Helland I .S . (1987) : On the interpretation and use of R2 in regressionanalysis . Biometrics, 43, 61-9 .

[9] Korn E.L. and Simon R . (1993) : Explained residual variation, explained risk,and goodness of fit . The American Statistician, 45, 201-6 .

[10] Kvalseth T .O. (1985) : Cautionary note about R2 . The American Statistician,39, 279-85 .

[11] Nagelkerke N .J .D. (1991): A note on a general definition of the coefficientof determination . Biometrika, 78, 691-2 .

[12] Ranney G .B . and Thigpen C .C . (1981): The sample coefficient ofdetermination in simple linear regression . The American Statistician, 35, 152-3 .

[13] Scott A . and Wild C . (1991): Transformations and R2 . The AmericanStatistician, 45, 127-9 .

[14] Stuart A . and Ord J .K. (1991) : Kendall's advanced theory of statistics, 5thed., Vol . 2 : Classical inference and relationship . London: Edward Arnold .

[15] Willett J .B. and Singer J .D. (1988): Another cautionary note about R2 : itsuse in weighted least-squares regression analysis . The American Statistician,42, 236-8 .

some properties of r in ordinary least squares...

Documents