Statistics in Psychology Using R and SPSS (Rasch/Statistics in Psychology Using R and SPSS) || Regression and Correlation

Post on 09-Dec-2016

212 views

Category:

Documents

TRANSCRIPT

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

11

Regression and correlation

In this chapter we polarize deterministic relationships and dependencies on the one sideand stochastic relationships of two characters on the other side thereby stochastic meansdependent on chance. Within the field of psychology only the latter relationships are ofinterest. We will deal at first with the graphical representation of corresponding observationpairs as a scatter plot. We will focus mainly on the linear regression function, which graspsthe relationship of two characters, modeled by two random variables. Then we will considerseveral correlation coefficients, which quantify the strength of relationship. Depending onthe scale type of the characters of interest, different coefficients become relevant. In additionto (point) estimators for the corresponding parameters, this chapter particularly deals withhypothesis testing concerning these parameters.

11.1 IntroductionThe term relationship can be explained by means of its colloquial sense. We know fromeveryday life, for instance, phrases and statements of the following type: Childrens behavioris related to parents behavior, Learning effort and success in exams are related, Birth-rateand calendar month are related, and so on.

Bachelor Example 11.1 Type and strength of the relationship between the ages of themother and father of last-born children is to be determined

From everyday life experience, as well as for obvious reasons, ages of bothparents are related. Although there are some extreme cases, for example a 40-year-old mother together with a 25-year-old father of a child, or a 65-year-oldfather with a 30-year-old mother, nevertheless in our society a preponderance ofparents with similar ages can be observed.

Statistics in Psychology Using R and SPSS, First Edition. Dieter Rasch, Klaus D. Kubinger and Takuya Yanagida. 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

304 REGRESSION AND CORRELATION

In contrast to relationships of variables in mathematics, in psychology relationships do notfollow specific mathematical functions. Thus exact functions like y = f (x) do not exist forrelationships in the field of psychology.

Bachelor In many mathematical functions, every value x from the domain of a function fcorresponds to exactly one value y from the co-domain. In this case, the functionis then unique. The graph of such a function is a curve in the plane (x, y). Insuch a function, x is called the independent variable and y is called the dependentvariable. This terminology is common also in the field of psychology, mainly inexperimental psychology, and can also be found in relevant statistical softwarepackages.

Thus, if we talk about a relationship, an interdependence between two characters or a de-pendency of one character on the other is meant. We have to distinguish strictly between theterms of a functional relationship in mathematics, the stochastic dependency in statistics, andthe colloquial term of relationship in psychology, which means a dependency in the mean.However, we will see that psychology is concerned exactly with the (stochastic) dependencieswith which statistics deals. We also call them statistical dependencies. Cases which dealwith functional relationships, like in mathematics, hardly exist in empirical sciences and arenot discussed in this book.

Bachelor Example 11.2 Some simple mathematical functionsBetween the side lengths, l, of a square and its circumference, c, the following

functional relationship exists: c = 4l. Here, this function is a linear function(equation of a straight line). To calculate the area A from the side lengths, theformula A = l2, a quadratic function, has to be used. Of course it is possibleto combine both formulas and to calculate A from c by means of the formula:A = c216 .

Obviously relations between two quantitative characters, between two ordinal-scaled, and twonominal-scaled characters are of interest, but also all cases of mixtures (e.g. a quantitativecharacter on one hand and an ordinal-scaled character on the other hand).

Example 11.3 The dependency of characters from Example 1.1Aside from gestational age at birth, which has been addressed several times before, and

which trivially is a ratio-scaled character, mainly the test scores of all subtests are intervalscaled. It could be of interest, for instance, to what extent test scores from the first and thesecond test date are related, or scores from different subtests.

Besides this, data from Example 1.1 includes ordinal-scaled characters like social statusor sibling position, as well as nominal-scaled characters like marital status of the mother,sex of the child, urban/rural, and native language of the child. Of interest could be therelationships of gestational age at birth (ratio-scaled) and sibling position (ordinal-scaled), ofsocial status (ordinal-scaled) and urban/rural (nominal-scaled, polychotomous), or of maritalstatus of the mother (nominal-scaled, polychotomous) and the native language of the child(nominal-scaled, dichotomous).

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

INTRODUCTION 305

However, relationships often also result between an actually interesting character and afactor or a noise factor (see Chapter 4, as well as Section 12.1.1 and Section 13.2).

Example 11.4 Does a relationship exist between the native language of a child and its testscores in the intelligence tests battery in Example 1.1?

When it is a question of a difference in the (nominal-scaled, dichotomous) factor nativelanguage of the child related to, for instance, the test score of subtest Everyday Knowledge,then the two-sample Welch test, which has already been discussed, even tests the relationshipbetween the test score (as a quantitative character) and the native language of the child. If wefurthermore suppose that the (nominal-scaled, dichotomous) character sex of the child is anoise factor, then the interest is in the relationship between sex of the child or native languageof the child and the test score.

For the purpose of discovering whether certain relationships of (quantitative) charactersexist, we illustrate the phenomenon of a relation by representing the pairs of observationsin a rectangular coordinate system. In the case of two characters x and y, this is done bymeans of representing every research unit v by a point, with the observations xv and yv ascoordinates. We thus obtain a diagram, which represents all pairs of observations as a scatterplot.

Bachelor Example 11.5 In which way are the test scores in the characters Applied Com-puting, 1st test date and Applied Computing, 2nd test date related?

We assume that intelligence is a stable character, thus a character which isconstant over time and over different situations. In this case, the relationshipshould be strict. For illustrating the relation graphically, we represent the testscores at the first test date on the x-axis (axis of the abscissa) and the test scoresat the second test date on the y-axis (axis of the ordinates). Both characters arequantitative.

In R, we first enable access to the data set Example_1.1 (see Chapter 1) by using thefunction attach(). Then we type

> plot(sub3_t1, sub3_t2,+ xlab = "Applied Computing, 1st test date (T-Scores)",+ ylab = "Applied Computing, 2nd test date (T-Scores)")

i.e. we use both characters, Applied Computing, 1st test date (sub3_t1) and AppliedComputing, 2nd test date (sub3_t2), as arguments in the function plot(), and labelthe axes with xlab and ylab.

As a result, we get a chart identical to Figure 11.1.

In SPSS, we use the same command sequence (Graphs Chart Builder. . .) as shown inExample 5.2 in order to open the window shown in Figure 5.5. Here we select Scatter/Dot inthe panel Choose from: of the Gallery tab; next we drag and drop the symbol Simple Scatterinto the Chart preview above. Then we move the character Applied Computing, 1st test date

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

306 REGRESSION AND CORRELATION

to the field X-Axis? and the character Applied Computing, 2nd test date to the field Y-Axis?.After clicking OK we obtain the chart shown in Figure 11.1

70

60

50

40

40 50 60Applied Computing, 1st test date (T-Scores)

App

lied

Com

putin

g, 2n

d te

st d

ate

(T-Sc

ores

)

70

30

20

Figure 11.1 Scatter plot for Example 11.5.

We clearly distinguish from the scatter plot the following tendency: The biggerthe score is at the first test date, the bigger the score is at the second test date, butobviously there is no functional relationship. In that case, all of the points wouldhave to be situated exactly on the graph of any determined function, which is hardto imagine for the given scatter plot. Nevertheless, one comes to the conclusionthat this scatter plot can be described, at least roughly, by a straight line (with apositive slope).

Although, in empirical research in psychology, only relationships by trend and no func-tional relationships exist, in the case of two (quantitative) characters it seems to be meaningfulto place a straight line or, if necessary, some other type of curve through the points of a scatterplot in the way that all deviations of the points from the resulting line (or curve) are as smallas possible on the whole. The procedure of adjusting a line or, more generally stated, acurve, to the points of a scatter plot concerns the type of relationship. The question is if therelationship is linear or nonlinear; and in the case that it is linear, the slope and the interceptof the linear function are of interest.

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

INTRODUCTION 307

Bachelor Obviously, it is always feasible to fit a mathematical function (e.g. a straightline) in the best way possible to a scatter plot. However, the essential questionis to what extent the adjustment succeeds. It is always about how distant thesingle points of the scatter plot are from the curve that belongs to the func-tion. In the case of a relationship of 100%, all of the points would be situatedexactly on it.

The question of the type of relationship is the subject of the so-called regression analysis.Initially the appropriate model is selected; that is the determination of the type of function.In psychology this is almost always the linear function. The next step is to concretize thechosen function; that is, to find that one which adjusts to the scatter plot as well as possible.If this concretization were done by appearance in a subjective way, the result would be tooinexact. An exact determination can be made by means of a certain numeric procedure, theleast squares method (see Section 6.5). From a formal point of view, this means that theslope and intercept of the so-called regression line have to be determined in the best waypossible.

Bachelor After having exactly determined the type of relationship between two characters,as concerns future research units the outcome for one of the characters can bepredicted from the outcome of the other character. For that purpose, the graph ofthe mathematical function, which has finally been chosen, has to be positionedin a way that the overall differences (over all pairs of observations) between thetrue value and that value determined by the other character based on the givenrelationship, becomes a minimum. That is how the term regression derives fromLatin (move backwards), in the way that a certain point on the abscissa isprojected, via the respective mathematical function, onto the corresponding valueon the ordinate.

Doctor Generally, in regression analysis two different cases have to be distinguished:there are two different types of pairs of outcomes. In the case which is usualin psychology, both these values of each research unit were ascertained (mostlysimultaneously) by observation, and result as a matter of fact (so-called modelII). In contrast, in the other type (model I), the pair of outcomes consists of onevalue (e.g. the value x), which is chosen (purposely) in advance by the researcherbut only the other value (y) is observed as a matter of course. This type is rarelyfound in psychology. An example of this model is the observation of childrensgrowth between the age of 6 and 12 years. The investigator thus determines theage at which the body size of the children is measured. In this case, the x-valuesare not measured, but determined a priori.

The type of the model influences the calculation of slope and intercept ofthe regression line. More detailed information concerning model I in combina-tion with regression analysis can be found for example in Rasch, Verdooren, &Gowers (2007).

• P1: OTA/XYZ P2: ABCJWST094-c11 JWST094-Rasch September 22, 2011 16:25 Printer Name: Yet to Come

308 REGRESSION AND CORRELATION

11.2 Regression modelIf we model the characters x and y by two random variables x and y, then (x, y) is called abivariate random variable. Then the regression model is

yv = f (xv ) + ev (v = 1, 2, . . . , n) (11.1)

or

xv = g( yv ) + ev (v = 1, 2, . . . , n) (11.2)

The function f is called a regression function from y onto x and the function g is called aregression function from x onto y. In function f , y is the regressand and x is the regressor, andvice versa in function g. The random variables ev or ev describe error terms. The model partic-ularly assumes that these errors are distributed with mean 0 and the same (mostly unknown)variance in each research unit v. It is also assumed that the error terms are independent fromeach other, as well as that the errors are independent from the variables x or y, respectively.

If functions f and g are linear functions, the following regression models result:

yv = 0 + 1xv + ev (v = 1, 2, . . . , n) (11.3)

or

xv = 0 + 1 yv + ev (v = 1, 2, . . . , n) (11.4)

respectively. If the random variables x and y are (two-dimensional) normally distributed, theregression function is always linear.

Usually the two functions y = 0 + 1x and x = 0 + 1 y differ from each other. Theyare identical exclusively in the case that the strength of the relationship is maximal.

We thus have defined the regression model for the population. Now, we have to estimatethe adequate parameters 0 and 1 or 0 and 1, respectively, with the help of a randomsample taken from the relevant population. With regard to the content, this is of interestmainly because, in practice, sometimes only one of the values (e.g. x-value) is known andwe want to predict the corresponding one (y-value) in the best way possible. For both ofthe random variables x and y we will denote, in addition to the parameters of the regressionmodel, means and variances as follows: x and y, 2x and 2y .

Starting from the need to place the regression line th...