week 10 nov 3-7

Week 10 Nov 3-7

Two Mini-Lectures QMM 510Fall 2014

15-2

Chi-Square Tests ML 10.1

Chapter Contents15.1 Chi-Square Test for Independence

15.2 Chi-Square Tests for Goodness-of-Fit

15.3 Uniform Goodness-of-Fit Test

15.4 Poisson Goodness-of-Fit Test

15.5 Normal Chi-Square Goodness-of-Fit Test

15.6 ECDF Tests (Optional)

Chapter 15

So many topics, so little time …

15-3

Chi-Square Test for Independence

• A contingency table is a cross-tabulation of n paired observations into categories.

• Each cell shows the count of observations that fall into the category defined by its row (r) and column (c) heading.

Contingency Tables

Chapter 15

15-4

Contingency Tables

• For example:

Chapter 15


15-5

Chi-Square Test

• In a test of independence for an r x c contingency table, the hypotheses are H0: Variable A is independent of variable B H1: Variable A is not independent of variable B

• Use the chi-square test for independence to test these hypotheses.

• This nonparametric test is based on frequencies.

• The n data pairs are classified into c columns and r rows and then the observed frequency fjk is compared with the expected frequency ejk.

Chapter 15


15-6

• The critical value comes from the chi-square probability distribution with d.f. degrees of freedom.

d.f. = degrees of freedom = (r – 1)(c – 1)where r = number of rows in the table

c = number of columns in the table

• Appendix E contains critical values for right-tail areas of the chi-square distribution, or use Excel’s =CHISQ.DIST.RT(α,d.f.)

• The mean of a chi-square distribution is d.f. with variance 2d.f.

Chi-Square Distribution

Chapter 15


15-7

Consider the shape of the chi-square distribution:

Chi-Square Distribution

Chapter 15


15-8

• Assuming that H0 is true, the expected frequency of row j and column k is:

ejk = RjCk/nwhere Rj = total for row j (j = 1, 2, …, r)

Ck = total for column k (k = 1, 2, …, c)n = sample size

Expected Frequencies

Chapter 15


15-9

• Step 1: State the Hypotheses• H0: Variable A is independent of variable B • H1: Variable A is not independent of variable B

• Step 2: Specify the Decision Rule• Calculate d.f. = (r – 1)(c – 1)

• For a given α, look up the right-tail critical value (2R) from

Appendix E or by using Excel =CHISQ.DIST.RT(α,d.f.).• Reject H0 if 2

R > test statistic.

Steps in Testing the Hypotheses

Chapter 15


15-10

• For example, for d.f. = 6 and α = .05, 2.05 = 12.59.

Chapter 15



15-11

• Here is the rejection region.

Chapter 15



15-12

• Step 3: Calculate the Expected Frequenciesejk = RjCk/n

• For example,

Chapter 15



15-13

• Step 4: Calculate the Test Statistic• The chi-square test statistic is

• Step 5: Make the Decision• Reject H0 if test statistic 2

calc > 2R or if the p-value α.


Chapter 15


15-14

Example: MegaStat

Chapter 15


p-value = 0.2154 is not small enough to reject the hypothesis of independence at α = .05

all cells have ejk 5 so Cochran’s Rule is met

Caution: Don’t highlight row or column totals

15-15

• For a 2 × 2 contingency table, the chi-square test is equivalent to a two-tailed z test for two proportions.

• The hypotheses are:

Test of Two Proportions

Figure 14.6

Chapter 15


15-16

• The chi-square test is unreliable if the expected frequencies are too small.

• Rules of thumb:• Cochran’s Rule requires that ejk > 5 for all cells.• Up to 20% of the cells may have ejk < 5

Small Expected Frequencies

• Most agree that a chi-square test is infeasible if ejk < 1 in any cell.• If this happens, try combining adjacent rows or columns to enlarge the

expected frequencies.

Chapter 15


15-17

• Chi-square tests for independence can also be used to analyze quantitative variables by coding them into categories.

Cross-Tabulating Raw Data

• For example, the variables Infant Deaths per 1,000 and Doctors per 100,000 can each be coded into various categories:

Chapter 15


15-18

Why Do a Chi-Square Test on Numerical Data?

• The researcher may believe there’s a relationship between X and Y, but doesn’t want to use regression.

• There are outliers or anomalies that prevent us from assuming that the data came from a normal population.

• The researcher has numerical data for one variable but not the other.

Chapter 15


15-19

• More than two variables can be compared using contingency tables.

• However, it is difficult to visualize a higher-order table.• For example, you could visualize a cube as a stack of tiled 2-way

contingency tables.• Major computer packages permit three-way tables.

3-Way Tables and Higher

Chapter 15


15-20

Chi-Square Tests for Goodness-of-Fit ML 10.2

Purpose of the Test

• The goodness-of-fit (GOF) test helps you decide whether your sample resembles a particular kind of population.

• The chi-square test is versatile and easy to understand.

Chapter 15

Hypotheses for GOF tests:

• The hypotheses are: H0: The population follows a _____ distribution H1: The population does not follow a ______ distribution

• The blank may contain the name of any theoretical distribution (e.g., uniform, Poisson, normal).

15-21

• Assuming n observations, the observations are grouped into c classes and then the chi-square test statistic is found using:

Test Statistic and Degrees of Freedom for GOF

where fj = the observed frequency of observations in class jej = the expected frequency in class j if the sample came from the hypothesized population

Chapter 15

Chi-Square Tests for Goodness-of-Fit

15-22

• If the proposed distribution gives a good fit to the sample, the test statistic will be near zero.

• The test statistic follows the chi-square distribution with degrees of freedom

d.f. = c – m – 1.

• where c is the number of classes used in the test and m is the number of parameters estimated.

Test Statistic and Degrees of Freedom for GOF tests

Chapter 15

Chi-Square Tests for Goodness-of-Fit

15-23

• Many statistical tests assume a normal population, so this the most common GOF test.

• Two parameters, the mean μ and the standard deviation σ, fully describe a normal distribution.

• Unless μ and σ are known a priori, they must be estimated from a sample in order to perform a GOF test for normality.

Is the Sample from a Normal Population?

Chapter 15

Normal Chi-Square GOF Test

15-24

Method 1: Standardize the Data

Chapter 15


Problem: Frequencies will be small in the end bins yet large in the middle bins (this may violate Cochran’s Rule and seems inefficient).

• Transform sample observations x1, x2, …, xn into standardized z-values.

• Count the sample observations within each interval on the z-scale and compare them with expected normal frequencies ej.

15-25

• Step 1: Divide the exact data range into c groups of equal width, and count the sample observations in each bin to get observed bin frequencies fj.

• Step 2: Convert the bin limits into standardized z-values:

Method 2: Equal Bin Widths

Chapter 15

• Step 3: Find the normal area within each bin assuming a normal distribution.

• Step 4: Find expected frequencies ej by multiplying each normal area by the sample size n.


Problem: Frequencies will be small in the end bins yet large in the middle bins (this may violate Cochran’s Rule and seems inefficient).

Chapter 15

15-26

Method 3: Equal Expected Frequencies


• Define histogram bins in such a way that an equal number of observations would be expected under the hypothesis of a normal population, i.e., so that ej = n/c.

• A normal area of 1/c is expected in each bin.

• The first and last classes must be open-ended, so to define c bins we need c-1 cut points.

• Count the observations fj within each bin.

• Compare the fj with the expected frequencies ej = n/c.

Advantage: Makes efficient use of the sample.

Disadvantage: Cut points on the z-scale points may seem strange.

15-27

Method 3: Equal Expected Frequencies

• Standard normal cut points for equal area bins.

Table 15.16

Chapter 15


15-28

Critical Values for Normal GOF Test

• Two parameters, m and s, are estimated from the sample, so the degrees of freedom are d.f. = c – m – 1.

• We need at least four bins to ensure at least one degree of freedom.

Chapter 15


Small Expected Frequencies• Cochran’s Rule suggests at least ej 5 in each bin (e.g., with 4 bins

we would want n 20, and so on).

15-29

Visual Tests• The fitted normal superimposed on a histogram gives visual

clues as to the likely outcome of the GOF test.

• A simple “eyeball” inspection of the histogram may suffice to rule out a normal population by revealing outliers or other non-normality issues.

Chapter 15


15-30

ECDF Tests ML 10.3

• There are alternatives to the chi-square test for normality based on the empirical cumulative distribution function (ECDF).

• ECDF tests are done by computer. Details are omitted here.

• A small p-value casts doubt on normality of the population.

• The Kolmogorov-Smirnov (K-S) test uses the largest absolute difference between the actual and expected cumulative relative frequency of the n data values.

• The Anderson-Darling (A-D) test is based on a probability plot. When the data fit the hypothesized distribution closely, the probability plot will be close to a straight line. The A-D test is widely used because of its power and attractive visual.

Chapter 15 ECDF Tests for Normality

15-31

Chapter 15

ECDF Tests

Example: Minitab’s Anderson-Darling Test for NormalityNear-linear probability plot suggests good fit to normal distribution

p-value = 0.122 is not small enough to reject normal population at α = .05

Data: weights of 80 babies (in ounces)

15-32

Chapter 15

ECDF Tests

Example: MegaStat’s Normality Tests

Near-linear probability plot suggests good fit to normal distribution

p-value = 0.2487 is not small enough to reject normal population at α = .05 in this chi-square test

Data: weights of 80 babies (in ounces)

Note: MegaStat’s chi-square test is not as powerful as the A-D test, so we would prefer the A-D test if software is available. The MegaStat probability plot is good, but shows no p-value.

week 10 nov 3-7

Documents