Statistics in Psychology Using R and SPSS (Rasch/Statistics in Psychology Using R and SPSS) || Samples from more than One Population

Download Statistics in Psychology Using R and SPSS (Rasch/Statistics in Psychology Using R and SPSS) || Samples from more than One Population

Post on 10-Dec-2016




2 download


  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    Samples from more thanone population

    This chapter is about scenarios where at least two characters are observed on research unitsof samples from at least two populations. Predominantly, the discussed methods can only beused for quantitative characters. In particular, the generalization of the analysis of variancewith more than one character will be discussed. Thereby even noise factors are taken intoaccount. Finally, it will be shown how characters (in combination) can be selected, froma relatively large pool of characters, in order to best discriminate between two or moregroups.

    Basically, this chapter is about the generalizations of the research questions from Chapter 12;that is, for the case of more than one sample or population, respectively. Further, it is about thegeneralization of research questions from Chapter 10; that is, comparing more than one or twosamples with respect to at least two characters instead of only a single character. Regressionand correlation analysis will also be dealt with, since this is obligatory whenever at least twocharacters are under consideration. First of all, the general linear model will be introduced,as all of the respective methods can be incorporated there.

    13.1 General linear modelSo far, we considered, as one of the most complex cases of the general linear model, forinstance the three-way analysis of variance (see Table 10.23):

    yijkv = + ai + b j + ck + (ab)ij + (ac)ik + (bc)jk + (abc)ijk + eijkv (13.1)

    Statistics in Psychology Using R and SPSS, First Edition. Dieter Rasch, Klaus D. Kubinger and Takuya Yanagida. 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd.

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    If we now label the vector of all main and interaction effects (a, b, (ab), etc.) as , then thegeneral linear model1 can be written quite simply as ( y and e are also vectors):

    y = + e (13.2)

    In the simplest case of one character in just a single random sample, the model can be writtenas: yv = + ev; v = 1, 2, . . . , n.

    The model in (13.2) can also be written as follows (with X as a matrix and as a vector):

    y = X + e (13.3)

    For example, the simple analysis of variance (model I) can be written as: yiv = + ai + eiv;i = 1, 2, . . . , a; v = 1, 2, . . . , ni (see Formula (10.1)).

    Doctor For example, for a = 3 and ni = n = 2, y is the vector of all yiv. In lexico-graphical order, this is written (transposed) as: yT = (y11, y12, y21, y22, y31, y32).Analogously, e is the vector of all eiv, and the vector of the four parameters ,a1, a2, a3. X has the form

    X =

    1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

    Formula (13.3) then is



    1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1




    In analysis of variance, X is a matrix consisting of zeros and ones. In simple linearregression, the model equation is yv = 0 + 1xv + ev ; v = 1, . . . , n; it follows =


    ), and the matrix X has n rows and two columns, the vth line being

    (1 xv). In the regression analysis, the matrix X thus contains the outcomes ofthe regressor.

    Despite the fact that if one permits random factors and regressors this covers a largenumber of models, the general notation does offer some advantages. Most notably, estimationby the method of least squares or derivation of the distribution of test statistics can be handled

    1 The general linear model was defined mainly as model I of the analysis of variances, as well as of the regressionanalysis, so that both approaches could be handled consistently.

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    within the general model, but for specific cases then can be easily determined for specialmodels. The analysis of covariance, factor analysis, and discriminant analysis can also bedescribed using the general linear model.

    13.2 Analysis of covarianceThe analysis of covariance, a special case of the general linear model, is related to regressionanalysis as well as to analysis of variance. The analysis of covariance is about the meandifferences between the (at least two) levels of a particular factor A with respect to thecharacter y; these factor levels can again be seen as different populations. It is also about theassumption/fear that the differences between the factor levels A1, A2, . . . , Ai, . . . , Aa withrespect to the (means of) the character y (i.e. the relationship of y and E(y) = , respectively,and A) are disturbed, superimposed, or hidden by another character, z. In this respect,the analysis of covariance differs from the partial correlation coefficient in Section 12.1.1simply by replacing the quantitative character x by the qualitative factor A. In both cases, thecharacters y and z are quantitative. The factor can be fixed or random. The character or to saynoise factor z, modeled as z, is called a covariate. This factor can be fixed (in that case wewrite z instead of z) or random. In psychological applications it is, however, usually random.


    The characters y and z are modeled by the normally distributed random variablesy and z. For the purpose of hypothesis testing, a two-dimensional normaldistribution of these variables must be assumed. The data structure is shown inTable 13.1. The difference from Table 10.1 is the additional column in which theoutcomes ziv are listed.

    Table 13.1 Data structure of a (one-way) analysis of covariancelevel A1 level A2 . . . level Aa

    y11 z11 y21 z21 . . . ya1 za1y12 z12 y22 z22 . . . ya2 za2.





















    y1n1 z1n1 y2n2 z2n2 . . . yana zana

    We only consider the case of a fixed factor A. The model equation is derivedfrom formula (10.1); however, in the expression yiv = + ai + eiv, the ai isco-determined by the ith of the a regression functions from z onto y: ai = ai (0i + 1i ziv) = 0i+1i ziv. Assuming i1 = 1 it follows:

    yiv = + 0i + 1 ziv + eiv (i = 1, . . . , a; v = 1, . . . , ni ) (13.4)

    With respect, for instance, to the analysis of variances, the following assumptionsare made:

    The eiv are normally distributed N (0, 2); therefore they all have the sameexpectation 0 and the same variance 2.

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    The eiv are independent (from each other). The ziv and the eiv are independent (from each other). a 2, and, for all i, ni 2.a

    i=1 ai = 0.Because i1 = 1, it is particularly presupposed that the regression lines are

    parallel. Therefore, they only differ in terms of the intercept 0i. It follows thatthe a levels of the factor A only have influence on the intercept. This assumptionis certainly fulfilled if the variancecovariance matrix

    (i) =(

    2y(i) yz(i)yz(i) 2z(i)


    is the same in all a populations, which means that they are independent from i(if so, i1 = yz(i) 2z(i) = 1 =

    yz 2z

    ). This then is referred to as the homogeneity of thevariancecovariance matrix.


    Example 13.1 The influence of Latin courses on the development of the abilityof reasoning is to be examined

    Teachers often argue that Latin courses in high school are important notonly because they are the basis for other languages, but also because they arean excellent exercise in reasoning. There might be a study in which studentsin the 11th level of education are tested with a pertinent psychological test formeasuring their reasoning ability. The factor in question, Latin courses, thenwould have two levels: from the 7th level of education with Latin courses andwithout Latin courses; thus the factor is fixed. The character of interest is thetest score in reasoning.

    Bear in mind (see Section 3.1), that this is a so-called ex-post-facto design.That is, the assignment to the two levels is not due to randomization. It followsthat, in the end, any effects detected may have existed from the very beginning:students who choose Latin courses thus might differ fundamentally, and rightfrom the onset, from those who do not. Intelligence is one of the potential noisefactors. For this we might have measured the character intelligence, too; againwith an adequate psychological test. The idea is to clean up the differencesbetween the levels of the factor Latin courses in the reasoning test scores fromthe possible contribution of the factor intelligence. If the factor Latin courseswere a quantitative character like the two characters reasoning and intelligence,we would compute a partial correlation coefficient. This then would adjust thecorrelation between Latin courses and reasoning with respect to the noise factorintelligence.

    A comparison between the means of students with and without Latin coursesas concerns the character reasoning would be unfair, if all those with Latin courseswere more intelligent than those without Latin courses. To analyze the latter, weassume that if there is any (positive) correlation between the character reasoning(y) as the regressand and the character intelligence (z) as the regressor, then the

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    slope of the regression line in both groups will be equal. Thus, the effect of Latincourses could be estimated by the distance between the two parallel regressionlines for any value of the regressor variable, for example for an intelligence valueof 100. This estimation would then be adjusted with respect to the eventuallygiven difference in the means of intelligence in both groups. If we take the (fromthe content point of view unrealistic) value 0 instead of 100, the estimation is justthe difference between the intercepts.

    We now use the data set Example 13.1 (see Chapter 1 for its availability) toanswer the given research question. The null hypothesis is: eliminating the effectsof z, the means of y do not differ with respect to the factor levels Ai. Initially,however, it is of importance whether the slopes of the regression lines are thesame for both factor levels or not.

    In R, we first enable access to the database Example_13.1 (see Chapter 1) by using thefunction attach(). Then we type

    > coef(lm(rsng

    intelligence, subset = latin == "without Latin"))> coef(lm(rsng

    intelligence, subset = latin == "with Latin"))

    i.e. we use the function lm(), each time specifying that the character reasoning (rsng) isto be analyzed with regards to the character intelligence as the first argument; in thesecond arguments, we use subset, once to select the children without Latin andonce to select the children with Latin. We apply the function coef()to the results.

    This yields:

    (Intercept) intelligence32.0325187 0.6361691

    (Intercept) intelligence16.1740102 0.9057169

    In SPSS, we first estimate the regression coefficients for each of the two levels of thefactor Latin course separately. In order to do this, we split the data by the factor Latincourses by using the same sequence of commands (Data Split File...) as shown inExample 5.11 and, in the following window in Figure 5.23, click the button Comparegroups, after which we drag and drop Latin course into the field Groups Based on:. Afterclicking OK, the following analysis will be conducted separately for children without Latinand with Latin. Following the steps described in Example 11.5 (Analyze Regression Linear...), we arrive at Figure 11.5, where we mark the character reasoning in order to dragand drop it into the field Dependent:. We drag and drop the character intelligence into thefield Independent(s):. After clicking OK, we obtain the results in Table 13.2, among otherthings.

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come


    Table 13.2 SPSS-output of the regression coefficients in Example 13.1(shortened output)

    Std. ErrorBUnstandardized Coefficients





    without Latin

    with Latin.103.906


    14.18432.033Latin courses ModelLatin courses Model

    Coefficients a

    a. Dependent Variable: reasoning

    The estimated regression function from reasoning onto intelligence for studentswithout Latin courses is: ywithout Latin = 32.033 + 0.636z; for students with Latincourses, ywith Latin = 16.174 + 0.906z. Although there is a test which could testthe hypothesis of equal slopes, we forgo such a test and illustrate the situationonly graphically, with a scatter plot.

    In R, we thus type

    > plot(intelligence, rsng, col = as.numeric(latin),+ xlab = "intelligence", ylab = "reasoning")

    i.e. we apply the function plot() and use intelligence and reasoning (rsng) asarguments; we select the color of each data point with col = as.numeric(latin)according to the character value of the factor latin, wherein we first have to convertthem into numeric values by using the function as.numeric(). Finally, we label theaxes accordingly with xlab and ylab.

    As a result, we get a chart that exactly corresponds to the one in Figure 13.1.

    In SPSS, we deactivate the partitioning of the data we needed earlier by using the samesequence of commands (Data Split File...) as before, and clicking the button Analyze allcases, do not create groups. After clicking OK, the partitioning will be cancelled. Followingthe steps described in Example 5.2, we open the Chart Builder (see Figure 5.5), where weclick the diagram type Scatter/Dot in the gallery tab in order to select a Grouped Scatter (thesecond chart from the left in the first row; when the cursor hovers over this panel, GroupedScatter appears); we click on it and drag and drop it into the field Chart preview. Now, wedrag and drop the character reasoning into the field Y-Axis?, the character intelligence intothe field X-Axis?, and the character Latin courses into the field Set color. After clicking OK,we obtain Figure 13.1 (which can be edited with the context menu Edit Content In SeparateWindow).

  • P1: OTA/XYZ P2: ABCJWST094-c13 JWST094-Rasch September 26, 2011 8:59 Printer Name: Yet to Come











    with Latinwithout Latin


    Figure 13.1 SPSS-output showing the scatter plot of the characters reasoning and intelli-gence in Example 13.1.

    Although the two slopes, in numerical terms, clearly differ from each other, thechart leaves the impression that the regression lines adapted to the two scatterplots run almost parallel. Clearly, however, they run shifted by an additive con-stant (intercept) on the ordinate. The dots of the students with Latin coursesare on average higher in the level of reasoning than those of students withoutLatin courses.


    Obviously, answering the given research question requires more than descriptivestatistics. The null hypothesis is: H0: i.z = l.z for...


View more >