quantitative data analysis i

26
Quantitative Data Analysis I. & II. Contingency tables: multivariate analysis and elaboration – introduction to 3-fold of data sorting, ordinal correlations Jiří Šafr jiri.safr(AT)seznam.cz updated 26/11/2014 UK FHS Historical sociology (2014+) ® Jiří Šafr, 2014

Upload: blithe

Post on 23-Jan-2016

61 views

Category:

Documents


0 download

DESCRIPTION

UK FHS Historical sociology (2014). Quantitative Data Analysis I. Contingency tables : multivariate analysis and elaboration – introduction to third level of data sorting Jiří Šafr jiri.safr( AT )seznam.cz. updated 5 / 6 /2014. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Quantitative Data Analysis I

Quantitative Data Analysis I. & II.

Contingency tables: multivariate analysis and

elaboration– introduction to

3-fold of data sorting, ordinal correlations

Jiří Šafrjiri.safr(AT)seznam.cz

updated 26/11/2014

UK FHSHistorical sociology

(2014+)

® Jiří Šafr, 2014

Page 2: Quantitative Data Analysis I

Multivariate analysis: threefold level of data sorting in

crosstabulation

→ enables

a) more detailed description and

b) elaboration

(introduction 1.)

Page 3: Quantitative Data Analysis I

Third level of data sorting in contingency table

• A contingency table analysis is used to examine the relationship between two categorical variables (bivariate crosstabulation)

• but it can be organized within levels of a third variable. If our goal is elaboration (rather than detailed description), we call it test variable or factor. We aim at to control for its effects.

• If a third variable is introduced, it will form separate layers or strata in the table.

Page 4: Quantitative Data Analysis I

3rd level of sorting data in contingency table

• We analyse simultaneously relationships among several variables (mostly more independent – explanatory variables).

• The principle is identical as in bivariate analysis.• The goal of 3rd level of sorting data is in principle:

– More detailed description (in sub/sub-groups)– Elaboration of relationships → searching for

causal relations, deeper understanding of context, distinguishing between substantive and false relations, controlling for effect of the 3rd variable (X↔Y / Z)

• This is true also for any 3rd level of sorting data in general, i.e. also for means in subgroups and linear association (scatter-plots, correlation, regression). We will explain it on contingency tables first.

Page 5: Quantitative Data Analysis I

Men Women Men Women

Weekly 21% 30% 34% 50%Less often 79 70 66 50100% = (270) (332) (317) (414)

Under 40 40 and older

Principle of multivariate analysis: 3rd level of data sorting (2×2×2 table)

Dependent variable: Attendance to religious service simultaneously by 2 independent vars: Age, Gender

Both older men and women go to church more frequently than young (i.e. religiosity rises up with age).

In each age category women attend church more often than men.

It seems that gender has slightly larger effect on church attendance than age.

Age as well as gender have independent effect on church attendance. Within each category of independent variable different attributes of the other one still influence people‘s behaviour.

Similarly both independent variables have cumulative effect on behaviour: Older women visit church the most, whereas young men the least.

[Babbie 1997: 391-392]

100 %

Difference 9 % points

100 %

Difference 16 % points

21%30% 34%

50%

79%70% 66%

50%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Men Women Men Women

Under 40 40 and older

Weekly Less often

Source: [Babbie 1997: 391]

Church Attendance by gender and age, USA 1990

Source: General Social Survey, NORC

Page 6: Quantitative Data Analysis I

Simplification of the 2×2×2 table:

100 % → 70 % Less often

We show only „positive“ categories of the variable („attend weekly“).

However we are not losing any information. Frequencies in brackets report the base for percent, from which we can complete a sum for omitted category.

[Babbie 1997: 391]

Men WomenUnder 40 21 30

(270) (332)40 and older 34 50

(317) (414)

Attend Church Weekly

Men Women Men Women

Weekly 21% 30% 34% 50%Less often 79 70 66 50100% = (270) (332) (317) (414)

Under 40 40 and older

Source: General Social Survey, NORC [Babbie 1997: 391]

Page 7: Quantitative Data Analysis I

Threefold data sorting (2×2×2 table) → description/exploration

Propadají více studenti „kolejáci“ – muži nebo „kolejáci“ – ženy?

Muži Kolej Jinde Celkempropadl 4% 19% 17%nepropadl 96% 81% 83%Celkem 100% 100% 100%

Ženy Kolej Jinde Celkempropadla 30% 31% 30%nepropadla 70% 69% 70%Celkem 100% 100% 100%

In comparison to male students, female students living at dormitory tend to fail in exams more often. However their proportion is about the same as in case of those female students living somewhere else (i.e. effect of staying at dormitory on grades is most probably not presented in case of women; regarding men this effect is positive: male students staying at dormitory are more successful in exams as well as they are the most successful from all). Source: adapted from [Kapr, Šafář 1969: 152]

15 percent difference

only 1 percent difference

Do students living at a dormitory (kolej) fail in exams (propadl) more often than those living elsewhere (jinde)? Is it true for male (muži) as well as for female (ženy) students?

Male

Female

Page 8: Quantitative Data Analysis I

Introduction into elaboration

Threefold data sorting

Controlling for the factor

Page 9: Quantitative Data Analysis I

Testing / controlling effect of 3rd variable - factor → Elaboration

• Constructing separated tables split by categories of the third variable makes the tested factor holding constant.

→ relationship between two variables is net – cleaned of distorting effect of this factor variable.

Page 10: Quantitative Data Analysis I

Threefold data sorting: controlling effect of the third variable: interpretation and arrangement of (2x3x3) table

Základní vzdělání Střední vzdělání Vysokoškolské vzdělání

< 39 let 40-59 > 60 let < 39 let 40-59 > 60 let < 39 let 40-59 > 60 let

Volil 18% 24% 32% 36% 34% 49% 40% 50% 70%

Nevolil 82 76 68 64 66 51 60 50 30

Celkem 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %

N (109) (202) (45) (97) (271) (139) (27) (62) (50)

We ask:

1. Are there differences of Y (voting) along X (age) within categories of controlling variable Z (education)? We compare it with bivariate crosstabulation (Y by X).

2. Are differences between the extreme categories X (age) within categories of controlling variable Z (education) approximately the same?

Differences between extreme categories of age in percentage points:

14 % 13 % 30 %Whereas in case of Elementary education (ZŠ) and Secondary (SŠ) there are differences between youngest and oldest about the same, in case of University (VŠ) the difference is about twice. → Thus Education partly intervenes into the relationship between voting and age.

Is voting related to age, even when effect of education is controlled?Regarding ordinal independent variables we compare percentage differences between the extreme categories separately among categories of controlling variable (the factor).

Page 11: Quantitative Data Analysis I

Interaction and additive effect

vzděláníVOLIL ZŠ SŠ VŠmladí 30 35 65starší 40 45 75

Additive effect – effects of both variables add together to produce the additional final result

Interaction effect – effect of one variable on another is contingent on the value of third variable

Similar effect of age in categories of education, only on „different level“

Different effect of age in categories of education on voting: for juniors no difference, for seniors % difference in voting is rising with higher education. The highest voting is among older university graduates.

Note: plus % Didn‘t vote we get complete a sum of 100%.

Still the same percentage point difference between categories of age in categories of education

3133

29

37

51

31

25

30

35

40

45

50

Elem. Secn. Univ.

youngerolder

3035

4045

75

65

25

35

45

55

65

75

Elem. Secn. Univ.

youngerolder

[Treiman 2009: 26-28]

Page 12: Quantitative Data Analysis I

Testing the effect of further factor (then in bivariate relationship)

• We compare intensity of relationship in original bivariate table with relationships in new tables with third variable-controlling factor (now split into its categories).

• If in new tables the association between original variables disappears or is substantially weaken → the association in the original (bivariate) table is function of the third variable (controlling factor)

• Further you will see, how to detect hidden relationship quickly using association coefficients within

subgroups of the third controlling factor (for nominal variables Phi, CramV, Lambda, and ordinal

correlation).

• Later in QDA II. We will also learn how to standardize (weight) the table along the controlling factor Z,

i.e. as if all cases in categories of variable X have the same proportion within categories of Z (e.g. the

same education).

Page 13: Quantitative Data Analysis I

1. To detect and describe interaction (additive) effects

and when doing this we can reveal

2. Spurious association(false association/correlation)

3. Suppressed – hidden association

Following two examples will explain it.

Coefficients of association (e.g. Lambda used here) are explained in later or in 3. Contingency tables and analysis of categorical data .

Why we conduct elaboration?

The aim is net relationship between two variables when controlled for effect of 3rd variable.

Page 14: Quantitative Data Analysis I

Example I.: Spurious association(false association/correlation)

1. bivariate relationship

Source: [Disman 1993: 219-223]Seemingly strong association, but …

Preference for meal

Religiosity

Total

Total

High

Low

CAVIARHAMBURGER

Page 15: Quantitative Data Analysis I

2. After controlling for effect of Education (Threefold data sorting)

No association for people with low education; 0 % point difference (also Lambda=0).

Source: [Disman 1993: 219-223]

People with low education

Religiosity

High

Low

Preference for meal

Total

Total

CAVIARHAMBURGER

Page 16: Quantitative Data Analysis I

Source: [Disman 1993: 219-223]

Association disappears when we control effect of education → factor behind which influences both religiosity and preference for food.

2. After controlling for effect of Education (3rd level of data sorting)

Preference for meal

People with high education

CAVIARHAMBURGERReligiosity

High

Low

Total

Total

Page 17: Quantitative Data Analysis I

Example II.: Suppressed – hidden association

1. bivariate relationship

Source: [Disman 1993: 219-223]

Na první pohled žádná souvislost, ale …

Would buy

Would not buy

Total

Total

Package A Package B

Page 18: Quantitative Data Analysis I

2. when gender controlled for (Threefold data sorting)

men women

Controlling for 3rd variable – factor revealed suppressed association (false independency) between the two variables.

Reason for this bias → the relationship between the variables exists only in a part of the population (within women).

Would not buy

Would not buy

Would buy

Would buy

Total

Total

Total

Total

Package A Package APackage B Package B

Source: [Disman 1993: 219-223]

Page 19: Quantitative Data Analysis I

When examining relationships in elaboration coefficients of

association/ordinal correlation can help us find interaction or

suppressed effects

Page 20: Quantitative Data Analysis I

Ordinal correlation for ordinal variables – bivariate „zero order“ table/correlation (4o×4o table)

Source: data [ISSP 2007, ČR] CROSSTABS income4 BY edu4 /STATISTICS GAMMA BTAU.

When our data is from random sample (i.e. not whole population) we have to in addition first test statistical hypothesis, that the coefficient is not zero (i.e. it is not zero in the whole population and not only in our sample). Approx. Significance (also p) is here < 5% → we reject the null hypothesis that Gamma/TauB is zero in whole population). More on this in QDA II.

Page 21: Quantitative Data Analysis I

Is the strength of relationship (ordinal correlation) identical for

men and women?→ we can compute conditional

association/correlation coefficients separately in categories of control variable – factor

(gender)Here 4o×4o×2 table.

Page 22: Quantitative Data Analysis I

Ordinal correlation for ordinal variables in 3rd level of data sorting (separately for men and women) → gender [s30] is controlling factor

Among women education has a a little stronger effect, but on the whole women earn less than men regardless of education level (see also the graph with means of income).

Source: data [ISSP 2007, ČR] In QDA II. we will further compute partial ordinal correlation (GAMMA).

CROSSTABS prijem4 BY vzd4 BY s30 /STATISTICS GAMMA BTAU.

First order conditional table/ correlation

Page 23: Quantitative Data Analysis I

Types of contingency tables with 3 variables and coefficients of association/ correlation

Generally you can always use association (no direction just strength of mutual dependence) → coefficients of association.

• 2×2×2 (similarly 2×2×3n) – all dichotomous → coefficients association and also special point biserial correlation or tetrachoric correlation

• 2×3o×3n or 2×3o×2 – dependent variable dichotomous, independent ordinal, control nominal → ordinal correlation in groups of control factor (without eventuality of considering linear trends in strength of association/correlation)

• 2×3n×3o – dependent variable dichotomous, independent nominal, control factor ordinal → only coefficients of association (but we can consider linear trend in strength of association between categories of control factor)

• 3o×3o×3o (similarly 2×2×3o) – all ordinal → ordinal correlation (we can consider linear trend in strength of correlation between categories of control factor) + coefficients of partial correlation (i.e. net correlation of X↔Y when effect of Z is controlled; more on this in QDA II.)

It stands also for more than 3 categories (e.g. 4o or 4n).

Page 24: Quantitative Data Analysis I

Coefficients of association in (bivariate) multivariate analysis in SPSS within CROSSTABS

• Within CROSSTABS we can compute several measures of association and correlation for variables Y x X (bivariate) as well as separately in categories of controlling factor Z → this can help us quickly assess interaction and reveal „false“ relationship.

• For nominal variables (Y, X, Z-controlling factor) coefficients of association (they range 0-1 → no direction):CROSSTABS var1 BY var2 BY var3-controlling /CELLS COL /STATISTICS CC PHI.Coefficients of association: CC = Contingency coefficient, PHI = Cramer V (+ equivalent for dichotomised variables is Phi); there are also other coefficients of association and correlation (e.g. Lambda).

• for ordinal variables (Y, X) and nominal/ordinal controlling factor (Z) in addition of association coeff. ordinal correlation (they range -1–0–1 → determine direction):CROSSTABS var1 BY var2 /CELLS COL /STATISTICS CC PHI GAMMA CORR BTAU.Correlation coefficients: GAMMA = Goodman&Kruskal Gamma, BTAU = Kendaull Tau B, CORR = Spearman Rho (+ Pearson correl. coef. R for ratio variables)

• Notice, if we don‘t find correlation, it doesn't mean that, there is no (strong) relationship–association.

Moreover with ordinal variables comparison of correlations and coefficients of association can help us indicate what is the relationship (nonlinearity).

• Notice: in case of means in subgroups (MEANS) we van compute coefficient Eta2 (for ratio x nominal variable):

MEANS var1-dependet-numeric BY var2-independent-categ. BY var3-controlling-categorial /CELLS MEAN STDDEV COUNT /STATISTICS ANOVA.

More on coeficients of association and correlation can be found in 2. Korelace a asociace: vztahy mezi kardinálními/ ordinálními znaky (in Czech only) na http://metodykv.wz.cz/AKD2_korelace.ppt

Page 25: Quantitative Data Analysis I

Notice: First, check counts (absolute frequency) when sorting

data in higher level (namely (but not only) in crosstabulation)

• When doing 3rd level of data sorting always check counts in v individual cells of the table with caution, notably in small samples.

CROSSTABS var1 BY var2 BY var3 /CELLS COL COUNT.

• If frequencies are too small, then interpretation of the table makes no sense from the statistical as well as substantive point of view.

→ You can collapse (recode) sparse cell entries.

Page 26: Quantitative Data Analysis I

More examples will be added later …