discussion : “some statistical observations,” by geoff kite

3
WATER RESOURCES BULLETIN VOL. 26, NO.4 AMERICAN WATER RESOURCES ASSOCIATION AUGUST 1990 DISCUSSION' "Some Statistical Observations," by Geoff Kite2 W. Kirby, D. Helsel, and E. J. Gilroy3 We certainly agree with the author that statistical methods commonly are misused in the water resources community, and that better training of indi- viduals and better refereeing of papers are needed. However, the author's statements about regression assumptions and spurious correlation themselves are not entirely correct. He and many others seem to have misunderstood the statements about spurious correla- tion made by Benson (1965). The author states that regressions between vari- ables containing common factors violate "the basic assumption of independence needed for linear regres- sion" (p. 485). The basic assumption of linear regres- sion is that the mean value of the dependent variable does depend on the values of the independent vari- ables. The assumption of independence is that the departures from the means (errors) of the dependent variable are uncorrelated among the several observa- tions; that is, that there are no serial or cross correla- tions among the errors. Although this assumption can be relaxed (Stedinger and Tasker (1985)), it nonethe- less in fact relates to lack of correlation among the separate observations of the dependent variable, rather than to any relationship between the depen- dent and independent variables. Benson (1965) states clearly that the plotting of ratios (or sums) with common elements "is not wrong per Se, nor are the correlation coefficients computed between such ratios (or sums) wrong, provided that the interpretation of correlation is made only in terms of ratios (or sums) and not in terms of any of the indi- vidual factors." Benson also quotes Chayes (1949) as saying only that "no ratio correlation permits valid inference about the relationship between any two of the absolute measures from which the ratios are formed." Neither of these statements contains any prohibition against plotting or correlating ratios (or products, or sums) with common elements. What is prohibited is the presumption that the correlation between ratios with common elements is indicative of the correlations between the quantities from which the ratios are formed. Specifically, it sometimes has been presumed that a poorly-defined relationship between two variables could in some way be improved by finding suitable highly-correlated products or ratios and then extracting the original variables from that relationship. The burden of Benson's (1965) arti- cle is to demonstrate that such presumptions are invalid and unwarranted. The question of spurious correlation thus seems to have little to do with any inherent spuriousness or other statistical pathologies of ratios or their correla- tion coefficients. Rather, it has to do with the illegiti- mate interpretation and application of correlation coefficients. It is legitimate to interpret the correlation coeffi- cient as an index of which of several explanatory vari- ables provides the best explanation of a specific dependent variable. (It makes no difference whether the specific dependent variable is a single "absolute measure" or a ratio, product, sum, or difference of other variables, as long as the definition of the dependent variable stays fixed.) This use of the corre- lation coefficient as an index of model performance follows logically from the fact that the correlation coefficient equals the proportion of the dependent variable's standard deviation that is accounted for by a linear regression equation with the explanatory 'Discussion No. 88064D of the Water Resources Bulletin. 2Paper No. 88064 of the Water Resources Bulletin 25(3):483-490. 3Respectively, Hydrologists and Mathematical Statistician, U.S. Geological Survey, MS 415, Reston, Virginia 22092. 693 WATER RESOURCES BULLETIN

Upload: w-kirby

Post on 21-Jul-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: DISCUSSION : “Some Statistical Observations,” by Geoff Kite

WATER RESOURCES BULLETINVOL. 26, NO.4 AMERICAN WATER RESOURCES ASSOCIATION AUGUST 1990

DISCUSSION'

"Some Statistical Observations,"by Geoff Kite2

W. Kirby, D. Helsel, and E. J. Gilroy3

We certainly agree with the author that statisticalmethods commonly are misused in the waterresources community, and that better training of indi-viduals and better refereeing of papers are needed.However, the author's statements about regressionassumptions and spurious correlation themselves arenot entirely correct. He and many others seem to havemisunderstood the statements about spurious correla-tion made by Benson (1965).

The author states that regressions between vari-ables containing common factors violate "the basicassumption of independence needed for linear regres-sion" (p. 485). The basic assumption of linear regres-sion is that the mean value of the dependent variabledoes depend on the values of the independent vari-ables. The assumption of independence is that thedepartures from the means (errors) of the dependentvariable are uncorrelated among the several observa-tions; that is, that there are no serial or cross correla-tions among the errors. Although this assumption canbe relaxed (Stedinger and Tasker (1985)), it nonethe-less in fact relates to lack of correlation among theseparate observations of the dependent variable,rather than to any relationship between the depen-dent and independent variables.

Benson (1965) states clearly that the plotting ofratios (or sums) with common elements "is not wrongper Se, nor are the correlation coefficients computedbetween such ratios (or sums) wrong, provided thatthe interpretation of correlation is made only in termsof ratios (or sums) and not in terms of any of the indi-vidual factors." Benson also quotes Chayes (1949) assaying only that "no ratio correlation permits validinference about the relationship between any two of

the absolute measures from which the ratios areformed." Neither of these statements contains anyprohibition against plotting or correlating ratios (orproducts, or sums) with common elements. What isprohibited is the presumption that the correlationbetween ratios with common elements is indicative ofthe correlations between the quantities from whichthe ratios are formed. Specifically, it sometimes hasbeen presumed that a poorly-defined relationshipbetween two variables could in some way be improvedby finding suitable highly-correlated products orratios and then extracting the original variables fromthat relationship. The burden of Benson's (1965) arti-cle is to demonstrate that such presumptions areinvalid and unwarranted.

The question of spurious correlation thus seems tohave little to do with any inherent spuriousness orother statistical pathologies of ratios or their correla-tion coefficients. Rather, it has to do with the illegiti-mate interpretation and application of correlationcoefficients.

It is legitimate to interpret the correlation coeffi-cient as an index of which of several explanatory vari-ables provides the best explanation of a specificdependent variable. (It makes no difference whetherthe specific dependent variable is a single "absolutemeasure" or a ratio, product, sum, or difference ofother variables, as long as the definition of thedependent variable stays fixed.) This use of the corre-lation coefficient as an index of model performancefollows logically from the fact that the correlationcoefficient equals the proportion of the dependentvariable's standard deviation that is accounted for bya linear regression equation with the explanatory

'Discussion No. 88064D of the Water Resources Bulletin.2Paper No. 88064 of the Water Resources Bulletin 25(3):483-490.3Respectively, Hydrologists and Mathematical Statistician, U.S. Geological Survey, MS 415, Reston, Virginia 22092.

693 WATER RESOURCES BULLETIN

Page 2: DISCUSSION : “Some Statistical Observations,” by Geoff Kite

Kirby, Helsel, and Gilroy

(independent) variable. The best explanatory variableis the one that leaves the smallest residual standarddeviation, and this is measured satisfactorily by theproportion of standard deviation explained (correla-tion coefficient) as long as the comparisons are madewith respect to the same dependent variable.

If a different dependent variable is introduced,however, by multiplication, division, or summation ofthe dependent variable with other variables, specifi-cally the explanatory variable, a comparison becomesproblematical. The new dependent variable has a newphysical meaning, new dimensional units, and a newvalue of standard deviation. It is difficult to imagehow one might meaningfully compare, for example, acorrelation between discharge and area with a corre-lation between runoff .and area. From a purely mathe-matical point of view, one could argue on the basis ofcorrelation coefficients that a discharge-area correla-tion was more useful in estimating discharge than arunoff-area correlation is in estimating runoff. Butwhat is the practical hydrologic significance of such aconclusion? How can it be applied by someone whowants to use area to estimate runoff?

One obvious (but naive) answer might be that, inthis example, the high correlation between dischargeand area should be exploited to obtain a preliminaryestimate of discharge, which then should be dividedby area to obtain runoff. Benson (1965), the author,and the writers all would agree that any suchimprovement would be spurious.

Although any such improvement would be spuri-ous, and although many warnings against such corre-lations have been issued, not much has been saidabout the consequences of using such an approach. Itseems to have been tacitly assumed that such "spuri-ous" correlations inevitably will lead to erroneous con-clusions and inaccurate estimates. We are not awarethat such assumptions have been shown to be true.

This question has been addressed by Gilroy et al.(1990), who compared two methods of estimating sedi-ment load using stream discharge. Because sedimentload is the product of discharge and concentration, thecorrelation between load and discharge contains acommon factor and thus might be considered spuri-ous. A correlation between concentration and dis-charge does not contain the common factor, and thusmight be considered more legitimate.

These relationships usually are studied in terms ofthe logarithmically transformed data, in which termsthe log-load, M, is sum of log-concentration, C, log-discharge, Q, and a units-conversion term, A. In prac-tice, Q and C are measured and M is computed for alimited set of calibration data. The problem then is toestimate future values of M using only futureobserved values of Q. Two methods may be consid-ered. First, a direct regression of M on Q may be used;

this regression involves a dependent variable that hasone term in common with the explanatory variable.Alternatively, a regression of C against Q could beused, followed by addition of Q and A to the estimatedC to obtain the M-estimate.

Gilroy et al. (1990), compared the direct (and sup-posedly spurious) regression of M on Q with the (sup-posedly more legitimate) two-step estimate involvingregression of C on Q, followed by addition of Q and Ato obtain M. They found that the two proceduresyielded mathematically identical equations for esti-mating M from Q, identical M-estimates, identicalestimates of standard error of the M-estimates, iden-tical confidence intervals on the M-estimates, andidentical results of tests of significance of the M-vs-Qslope.

Although Gilroy et al. (1990), considered only onecase of spurious correlation, this one case nonethelessseems to cover many of the cases of practical impor-tance. In particular, it includes log-log correlation ofratios with denominator in common with the explana-tory variable. The following general proposition istrue: log-log regression analysis is invriant underformation of products or ratios of the oi'iginal depen-dent variable with the original independent variables.That is, the same answers are obtainpd in the end,whether or not ratios or products are used in theintermediate steps of analysis. Specifically, direct log-log regression of load against discharge gives thesame answers in the end as the two-step procedureinvolving log-concentration. Similarly, direct log-logregression of runoff against area would give the sameanswers in the end as a logarithmic area-discharge-runoff method. The proof follows directly from thematrix-algebra formulation of the multiple linearregression equations for the direct and two-step meth-ods.

It thus is clear that the hunt for spurious correla-tions will be carried out more effectively by searchingfor illogical interpretations of correlation coefficientsthan by merely searching for ratios with common fac-tors. The examples quoted by the author do notinclude information about how the correlations wereinterpreted and used, so it is not possible to saywhether the correlations were spurious or legitimate.It is unfortunate that the author seized upon the cos-metic aspect of the problem and did not come to gripswith more fundamental problems of using regressionand correlation to analyze and interpret observationaldata.

WATER RESOURCES BULLETIN 694

Page 3: DISCUSSION : “Some Statistical Observations,” by Geoff Kite

Discussion: Some Statistical Observations

LrrERATURE CiTED

Benson, M. A., 1965. Spurious Correlation in Hydraulics andHydrology. Journal of the Hydraulics Division, ASCE91(HY4):35-42.

Chayes, F., 1949. On Ratio Correlation in Petrography. Journal ofGeology 57(3):239-254.

Gilroy, E. J., W. H. Kirby, T. A. Cohn, and D. G. Glysson, 1990.Discussion on Uncertainty in Suspended-Sediment TransportCurves. Journal of Hydraulics, ASCE 116(1):143-145.

Stedinger, J. R. and G. D. Tasker, 1985. Regional HydrologicAnalysis — Ordinary Weighted, and Generalized Least SquaresCompared. Water Resources Research 21(9):1421-1432.

695 WATER RESOURCES BULLETIN