chi-square test presentation 12. what does it mean for two categorical variables to be related?...

18
Chi-Square test Chi-Square test Presentation 12

Upload: lisa-farmer

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Chi-Square testChi-Square test

Presentation 12

Page 2: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

What does it mean for two What does it mean for two categorical variables to be categorical variables to be

related?related? Remember that Chi-Square is used to test for a Remember that Chi-Square is used to test for a

relationship between 2 Categorical variables.relationship between 2 Categorical variables. Ho: There is no relationship between the variables.Ho: There is no relationship between the variables. Ha: There is a relationship between the variables. Ha: There is a relationship between the variables.

If two categorical variables are related, it means If two categorical variables are related, it means the chance that an individual falls into a the chance that an individual falls into a particular category for one variable depends upon particular category for one variable depends upon the particular category they fall into for the other the particular category they fall into for the other variable.variable.

Let’s say that we wanted to determine if there is Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if Muslim, Other) and smoking. When we test if there is a relationship between these two there is a relationship between these two variables, we are trying to determine if being part variables, we are trying to determine if being part of a particular religion makes an individual more of a particular religion makes an individual more likely to be a smoker. If that is the case, then we likely to be a smoker. If that is the case, then we can say that Religion and Smoking are can say that Religion and Smoking are relatedrelated or or associatedassociated..

Page 3: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Chi-Square test for 2-way Chi-Square test for 2-way tablestables

Suppose we are studying two categorical variables in a population, Suppose we are studying two categorical variables in a population, where the first variable has where the first variable has rr levels (i.e. possible outcomes) and the levels (i.e. possible outcomes) and the second one has second one has ss levels. levels.

We can summarize a sample from this population using a table with We can summarize a sample from this population using a table with rr rows and rows and cc columns. columns.

A two-way table, also called contingency table, displays the counts of A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of of two categorical variables. So, each cell of the table (total number of cells is cells is rr x xcc) represents a combination of categories of the two variables.) represents a combination of categories of the two variables.

The following table presents the data on race and smoking. The two The following table presents the data on race and smoking. The two variables of interest, race and smoking, have variables of interest, race and smoking, have r r = 4 and = 4 and c c = 2, resulting = 2, resulting in 4x2=8 combinations of categories.in 4x2=8 combinations of categories.

RaceRace NSmokeNSmoke SmokeSmoke

CaucasiaCaucasiann

620620 7575

BlackBlack 240240 4141

HispanicHispanic 130130 2929

OtherOther 190190 3838

Page 4: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Chi-Square test for 2-way Chi-Square test for 2-way tablestables

By considering the number if observation falling into each By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: category, we will see how to test the hypotheses of the form:

HH00: The two variables are not associated. : The two variables are not associated.

HHaa: The two variables are associated.: The two variables are associated.

Two different experimental situations will lead to contingency Two different experimental situations will lead to contingency tables tables 1.1. If we have two populations under study, both of which If we have two populations under study, both of which

have a particular trait with respect to a categorical variable. have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of In this case the null hypothesis is a statement of homogeneityhomogeneity among the two populations. among the two populations.

2.2. If we have one population under study, and we are If we have one population under study, and we are interested to check the relationship between two interested to check the relationship between two categorical variables. In this case the null hypothesis is a categorical variables. In this case the null hypothesis is a statement of statement of independenceindependence between the two between the two variables. variables.

For sufficiently large samples, the same test is appropriate for For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the in the following we will go over the steps in for testing the relationship between two variables.relationship between two variables.

Page 5: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Some Notation!Some Notation!

For i taking values from 1 to For i taking values from 1 to rr (number of rows) and (number of rows) and j taking values from 1 to j taking values from 1 to cc (number of columns), (number of columns), denote: denote: RRii = total count of observations in the i = total count of observations in the i--th row. th row.

CCjj = total count of observations in the j = total count of observations in the j--th column.th column.

OOijij = = observed countobserved count for the cell in the i-th row and the j-th for the cell in the i-th row and the j-th column. column.

EEijij = = expected countexpected count for the cell in the i-th row and the j-th for the cell in the i-th row and the j-th column if the two variables were independent, i.e if Hcolumn if the two variables were independent, i.e if H00 was was true. These counts are calculated astrue. These counts are calculated as

n

CRE jiij

thus,

n size sample Total

alColumn tot totalRow Expected

Page 6: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

ExampleExample

EE1111=(695x1180)/1363 =(695x1180)/1363 EE1212=(695x183)/1363=(695x183)/1363

EE2121=(281x1180)/1363 =(281x1180)/1363 EE2222=(281x183)/1363=(281x183)/1363

EE3131=(159x1180)/1363 =(159x1180)/1363 EE3232=(159x183)/1363=(159x183)/1363

EE4141=(228x1180)/1363 =(228x1180)/1363 EE4242=(228x183)/1363=(228x183)/1363

RaceRace NSmokeNSmoke SmokeSmoke TotalTotal

CaucasiaCaucasiann

OO11 11 = 620= 620 OO12 12 = 75= 75 RR1 1 = 695= 695

BlackBlack OO21 21 = 240= 240 OO22 22 = 41= 41 RR2 2 = 281= 281

HispanicHispanic OO31 31 = 130= 130 OO32 32 = 29= 29 RR3 3 = 159= 159

OtherOther OO41 41 = 190= 190 OO42 42 = 38= 38 RR4 4 = 228= 228

TotalTotal CC1 1 = 1180= 1180 CC2 2 = 183= 183 n=1363n=1363

Page 7: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Chi-Square Analysis DetailsChi-Square Analysis DetailsThe 5 Steps in a Chi-Square Test:The 5 Steps in a Chi-Square Test: Step 1: Write the null and alternative hypothesis.Step 1: Write the null and alternative hypothesis.

HH00: There is no relationship between the variables.: There is no relationship between the variables.

HHaa: There is a relationship between the variables. : There is a relationship between the variables.

Step 2: Check conditions.Step 2: Check conditions.A) All A) All expected countsexpected counts should be > 1. should be > 1.B) At least 80% of B) At least 80% of expected countsexpected counts should > 5. should > 5.

Step 3: Calculate Test Statistic and p-value.Step 3: Calculate Test Statistic and p-value.The test statistic measure the difference between the observed The test statistic measure the difference between the observed counts and the expected counts assuming independence.counts and the expected counts assuming independence.

This is called This is called chi-square statisticchi-square statistic because if the null hypothesis because if the null hypothesis is is

true, then it has a true, then it has a chi-square distributionchi-square distribution with (r-1)x(c-1) degrees with (r-1)x(c-1) degrees of of

freedom.freedom.

ji ij

ijij

E

EO

,

2

cells all

22 )(

Expected

Expected) - (Observed

Page 8: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Step 3 Cont. Find the p-value.Step 3 Cont. Find the p-value. If the If the χχ22- statistic is large, it implies that the observed counts are - statistic is large, it implies that the observed counts are

not close to the counts we would expect to see if the two not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' variables were independent. Thus, ''large'' χχ2 2 gives evidence gives evidence against the null hypothesis, and supports the alternative.against the null hypothesis, and supports the alternative.

The p-value of the chi-square test is the probability that the The p-value of the chi-square test is the probability that the χχ22- - statistic, is as large or larger than the value we obtained if Hstatistic, is as large or larger than the value we obtained if H00 is is true. Also, if Htrue. Also, if H00 is true, the is true, the χχ22- statistic has chi-square distribution - statistic has chi-square distribution with (r-1)x(c-1) df.with (r-1)x(c-1) df.

Thus, the p-value for Chi-Square test is ALWAYS the area to the Thus, the p-value for Chi-Square test is ALWAYS the area to the

right of the test statistic under the curve, i.e. p-value = P(X> right of the test statistic under the curve, i.e. p-value = P(X> χχ22), ), where X has a chi-square distribution with (r-1)x(c-1) df curve.where X has a chi-square distribution with (r-1)x(c-1) df curve.

To get this probability we need to use a chi-square distribution To get this probability we need to use a chi-square distribution with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other statistical software, you can obtain the p-value form the output. statistical software, you can obtain the p-value form the output. Otherwise, you can report a range for the p-value using Table 4 Otherwise, you can report a range for the p-value using Table 4 (since usually you will not be able to find the exact p-value on the (since usually you will not be able to find the exact p-value on the table.table.

Chi-Square Analysis DetailsChi-Square Analysis Details

Page 9: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Step 4: Decide whether or not the result is Step 4: Decide whether or not the result is statistically significant.statistically significant.

The results are statistically significant if the p-value is less The results are statistically significant if the p-value is less than than

alpha, where alpha is the significance level (usually alpha, where alpha is the significance level (usually αα = 0.05). = 0.05).

Step 5: Report the conclusion in the context of the Step 5: Report the conclusion in the context of the situation.situation.

TheThe p-valuep-value is ______ which isis ______ which is < a, this result < a, this result is is statistically significantstatistically significant. . Reject the H0Reject the H0 Conclude Conclude that (the two variables) are related.that (the two variables) are related.

TheThe p-valuep-value is ______ which isis ______ which is > a, this result > a, this result is NOT is NOT statistically significantstatistically significant. . We cannot reject the H0We cannot reject the H0 Cannot conclude that (the two variables) are related.Cannot conclude that (the two variables) are related.

Chi-Square Analysis DetailsChi-Square Analysis Details

Page 10: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Detailed ExampleDetailed Example Derek wants to know if the geographical area that a Derek wants to know if the geographical area that a

student grew up in is associated with whether or not student grew up in is associated with whether or not that the student drinks alcohol. Below are the results that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students he obtained from a random sample of PSU students

NoNo YesYes TotalTotal

Big City Big City 2121 6565 8686

Rural Rural 1111 130130 141141

Small Town Small Town 1818 198198 216216

Suburban Suburban 3737 345345 382382

TotalTotal 8787 738738 825825

Page 11: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Detailed ExampleDetailed Example1. 1. HHoo: There is no relationship between the geographical area that a : There is no relationship between the geographical area that a

student grew up and whether or not that the student drinks student grew up and whether or not that the student drinks alcohol. alcohol.

HHaa: There is relationship between the geographical area that a : There is relationship between the geographical area that a student student

grew up and whether or not that the student drinks alcohol.grew up and whether or not that the student drinks alcohol.

2.2. To check the conditions we need to calculate the expected counts To check the conditions we need to calculate the expected counts for for

each cell. each cell.

EE11 11 = (R= (R11xCxC11)/n = (86x87)/825 = 9.07,)/n = (86x87)/825 = 9.07,

EE12 12 = (R= (R11xCxC22)/n = (86x738)/825 = 76.93, …)/n = (86x738)/825 = 76.93, …

EE32 32 = (R= (R33xCxC22)/n = ___________________, …)/n = ___________________, …

Page 12: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Detailed ExampleDetailed Example

Here is the Minitab output Here is the Minitab output

with the Observed and with the Observed and

Expected counts for each Expected counts for each

cell. We can see that the cell. We can see that the

conditions are satisfied!conditions are satisfied!

No Yes AllNo Yes All Big_City 21 65 86Big_City 21 65 86 9.07 76.93 9.07 76.93 86.0086.00 Rural 11 130 141Rural 11 130 141 14.87 126.13 14.87 126.13 141.00141.00 SmallTow 18 198 216SmallTow 18 198 216 22.78 193.22 22.78 193.22 216.00216.00 Suburban 37 345 382Suburban 37 345 382 40.28 341.72 40.28 341.72 382.00382.00 All 87 738 All 87 738 825825 87.00 738.00 87.00 738.00 825.00825.00

Page 13: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Detailed ExampleDetailed Example

3.3. Chi- Square statistic and P-value:Chi- Square statistic and P-value: χχ2 2 = sum {(Observed – Expected)= sum {(Observed – Expected)22/Expected}/Expected} = (21-9.07)= (21-9.07)22/9.07+ (65-76.93)/9.07+ (65-76.93)22/76.93/76.93 + (11-14.87)+ (11-14.87)22/14.87+ (130-126.13)/14.87+ (130-126.13)22/126.13/126.13 + (18-22.78)+ (18-22.78)22/22.78+ (198-193.22)/22.78+ (198-193.22)22/193.22/193.22 + (37-40.28)+ (37-40.28)22/40.28+ (345-341.72)/40.28+ (345-341.72)22/341.72/341.72 = 20.091= 20.091 df = (4-1)x(2-1) =3 df = (4-1)x(2-1) =3 p-value= P(X> 20.091) p-value= P(X> 20.091) << P(X> 16.17) = 0.001 (Table A.4) P(X> 16.17) = 0.001 (Table A.4)

4. 4. Since the p-value< 0.05, the test is significant, and we can Since the p-value< 0.05, the test is significant, and we can reject the null.reject the null.

5.5. We can conclude that there is a relationship between the We can conclude that there is a relationship between the geographical area that a student grew up and whether or geographical area that a student grew up and whether or not that the student drinks alcohol. not that the student drinks alcohol.

Page 14: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Special Case - Analyzing 2x2 Special Case - Analyzing 2x2 tablestables

In a lot of cases the categorical variables of interest have two In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a levels each. In this case, we can summarize the data using a contingency table having two rows and two columns (i.e. contingency table having two rows and two columns (i.e. r=c=2r=c=2). The general form of a 2x2 table is). The general form of a 2x2 table is

In this case, the chi-square statistic has the following In this case, the chi-square statistic has the following simplified form,simplified form,

Under the null hypothesis, Under the null hypothesis, χχ22-statistic has chi-square -statistic has chi-square distribution with (2-1)x(2-1)=1 degrees of freedom. distribution with (2-1)x(2-1)=1 degrees of freedom.

Column 1Column 1 Column 2Column 2 TotalTotal

Row 1Row 1 AA BB RR11

Row 2Row 2 CC DD RR22

TotalTotal CC11 CC22 nn

.)(

2121

22

CCRR

BCADn

Page 15: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Example for 2x2 table: Example for 2x2 table: Is there relationship between gender and

smoking habits?

GenderGender NSmokNSmokee

SmokSmokee

TotalTotal

MaleMale 540540 5252 592592

FemaleFemale 325325 3131 356356

TotalTotal 865865 8383 948948

Minitab OutputMinitab Output

C1 C2 Total

1 540 52 592 540.17 51.83

2 325 31 356 324.83 31.17

Total 865 83 948

Chi-Sq = 0.000 + 0.001 + 0.000 + 0.001 =

0.002DF = 1, P-Value = 0.968

Minitab uses the general formula

of the χ2 test statistic. points. decimal 3 to

up rounded Minitab because isoutput in the

statistic square-chi thefrom difference the

0.0016,83865356592

)3255231540(948 22

Page 16: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Relationship Between Chi-Square Relationship Between Chi-Square and 2 Proportions Testsand 2 Proportions Tests

When do we use Chi-Square and when do we use 2 When do we use Chi-Square and when do we use 2 proportions?proportions?

Situation 1Situation 1: Both categorical variables of interest have exactly 2 : Both categorical variables of interest have exactly 2 levels. Question - Is there a relationship between the variables, levels. Question - Is there a relationship between the variables, or is there a difference in the proportions? or is there a difference in the proportions?

Answer - Either Chi-Square or Two Sided Test of 2-proportions Answer - Either Chi-Square or Two Sided Test of 2-proportions will lead to the same conclusion!will lead to the same conclusion!

In this case, the In this case, the χχ2 2 –statistic = (z-statistic)–statistic = (z-statistic)22, and the p-, and the p-values of the two tests are equal, i.e.values of the two tests are equal, i.e.

P(XP(X(1df)(1df) > > χχ2 2 –stat) = 2 P (Z > |z-stat|). –stat) = 2 P (Z > |z-stat|).

Situation 2Situation 2: Both categorical variables of interest have exactly 2 : Both categorical variables of interest have exactly 2 levels. Question - Is one proportion greater/smaller than the levels. Question - Is one proportion greater/smaller than the other. other.

Answer - This is a one-sided test and you MUST use a test of 2 Answer - This is a one-sided test and you MUST use a test of 2 proportions.proportions.

Situation 3:Situation 3: At least one of the two categorical variables of At least one of the two categorical variables of interest has MORE than 2 levels.interest has MORE than 2 levels.

Question - Is there a relationship between the variables?Question - Is there a relationship between the variables? Answer - MUST use a Chi-Square Test.Answer - MUST use a Chi-Square Test.

Page 17: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

Examples of Chi-Square and 2-Examples of Chi-Square and 2-ProportionsProportions

GenderGender NSmokeNSmoke SmokeSmoke

MaleMale 540540 5252

FemaleFemale 325325 3131

Q1: Is there a difference in the proportion of males and females that smoke?

Solution: Either a Chi-Square or Test of 2 proportions is fine.

2-proportions Chi-Square

H0: pm – pf = 0 H0: There is no relationship between Gender and Smoking.

Ha: pm – pf ≠ 0 Ha: There is a relationship between Gender and Smoking.

Q2: Is the proportion of males who smoke greater than the proportion of females who smoke?

Solution: Test of 2 proportions, because the alternative is one sided!

2-proportions H0: pm – pf = 0 vs Ha: pm – pf > 0

Page 18: Chi-Square test Presentation 12. What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship

RaceRace NSmokeNSmoke SmokeSmoke

CaucasianCaucasian 620620 7575

BlackBlack 240240 4141

HispanicHispanic 130130 2929

OtherOther 190190 3838

Q: Is there a relationship between Race and Smoking? Is there a difference in the proportion smokers of the different races?

Solution: Chi-Square because Race has more than 2 levels!

Chi-Square Test

H0: There is no relationship between Race and Smoking.

Ha: There is a relationship between Race and Smoking.

Examples of Chi-Square and 2-Examples of Chi-Square and 2-ProportionsProportions