contingency tables - facultyfaculty.nps.edu/rdfricke/oa4109/lecture 9-2... · 2013. 2. 23. ·...

25
Contingency Tables Professor Ron Fricker Naval Postgraduate School Monterey, California 8/25/12 1 Reading Assignment: None

Upload: others

Post on 25-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Contingency Tables!

    Professor Ron Fricker!Naval Postgraduate School!

    Monterey, California!

    8/25/12

    1

    Reading Assignment:!None!

  • Goals for this Lecture!

    •  Understand and be able to conduct tests for discrete contingency table data!–  One-way chi-square goodness-of-fit tests!

    •  Homogeneity!•  Other distributions!

    –  Two-way chi-square tests !•  Independence!•  Homogeneity !

    •  All assuming SRS and no fpc!

    8/25/12

    2

  • One-Way Classifications!

    •  Each item classified into one (and only one) of k categories (cells)!–  Denote counts as x1, x2, …,

    xk with x1+ x2 + … + xk = n!

    8/25/12

    3

    Population

    Random sample of size n

    Category k

    Cell frequency xk

    Classify

    Category 1

    Cell frequency x1

    Category 2

    Cell frequency x2

  • One-Way Tables in R!

    •  Just use table() or xtabs() on one variable!–  E.g., tabulating Q1 in the New Student Survey:!

    8/25/12

    4

    * Data from 2008 survey of NPS new students

  • Two-Way Contingency Tables!

    •  A two-way contingency table (or cross tabulation) gives counts by all pairwise combinations of variable levels!

    8/25/12

    5

    Variable 1

    Variable 2

    “A” “B”

    “X”

    “Y”

    # or %

    # or %

    # or %

    # or %

    # or %

    # or %

    # or % # or %

    Number or percent of obs that are both “X” and “B”

    Number or percent of obs that are “Y”

  • Two-Way Tables in R!

    •  Just use table() or xtabs() on two variable!–  E.g., tabulating Q1 by gender in the New Student

    Survey:!

    8/25/12

    6

    * Data from 2008 survey of NPS new students

  • Higher-Way Tables in R!

    •  Just keep adding variables…!–  E.g., Q1 by gender by country:!

    8/25/12

    7

    * Data from 2008 survey of NPS new students

  • One-Way Goodness-of-Fit Test!

    •  Have counts for k categories, x1, x2, …, xk, with x1+ x2 + … + xk = n!

    •  (Unknown) population cell probabilities denoted p1, p2, …, pk with p1+ p2 +…+ pk = 1

    •  Estimate each cell probability from the observed counts: !

    •  The hypotheses to be tested are!!

    8/25/12

    8

    ˆ / , 1,2,...,i ip x n i k= =

    * * *0 1 1 2 2

    *

    : , ,...,

    : at least one k k

    a i i

    H p p p p p pH p p

    = = =

  • Goodness-of-Fit Test for Homogeneity!

    •  Null hypothesis is the probability of each category is equally likely:!–  I.e., the distribution of category characteristics is

    homogeneous in the population!•  If the null is true, in each cell (in a perfect

    world) we would expect to observe 
counts!

    •  So, how to do a statistical test that assesses how “far away” the ei expected counts are from the xi observed counts?!

    !8/25/12

    9

    * 1/ , 1,2,...,ip k i k= =

    *i ie np=

  • Answer: Chi-square Test!

    •  Idea: Look at how far off table counts are from what is expected under the null!

    •  Reject if chi-square statistic too large!–  Assess “too large” using chi-squared distribution!

    8/25/12

    10

    22

    1

    2

    1

    (observed expected)expected

    ( )

    k

    i

    ki i

    i i

    - X

    x - ee

    =

    =

    =

    =

  • Conducting the Test!

    •  First calculate X 2 statistic!•  Then calculate the p-value:!

    •  is the chi-square distribution with k-1 degrees of freedom!

    •  Reject null if p-value < , for some pre-determined significance level !

    8/25/12

    11

    21kχ −

    2 21-value Pr( )kp Xχ −= ≥

    αα

  • Example!

    •  In Excel:!

    •  In R, use the chisq.test() function!–  Default is the GoF test for homogeneity!

    8/25/12

    12

    * Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data

  • Goodness-of-Fit Test 
for Other Distributions!

    •  Homogeneity is just a special case!•  Can test whether the s are anything as long

    as!

    •  Might have some theory that says what the distribution should be, for example!

    •  Remember, don’t look at that data first and then specify the probabilities… !

    8/25/12

    13

    *ip

    *

    11

    k

    iip

    =

    =∑

  • Example!

    •  In Excel:!

    •  In R, again use chisq.test() function!–  Now, add a vector for the probabilities!

    8/25/12

    14

    * Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data

  • A Note!

    •  Pearson chi-square test depends on all cells having sufficiently large expected counts:!–  If not, collapse across some categories!–  E.g., !

    15

    * 5i ie np= ≥

    8/25/12

    Count and probability for “Strongly Disagree” and “Disagree” aggregated!

    * Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data

  • Some Notation for 
Two-Way Contingency Tables !

    •  Table has r rows and c columns!•  Observed cell counts are xij, with!

    •  Denote row sums:!

    •  Denote column sums:!

    8/25/12

    16

    1, 1,...,

    r

    j iji

    x x j c•=

    = =∑1

    , 1,...,c

    i ijj

    x x i r•=

    = =∑1 1

    r c

    iji j

    x n= =

    =∑∑

  • Chi-square Test for Independence!

    •  Independence means the probability of being in any cell is the product of the row and column probabilities!

    8/25/12

    17

    Variable 1

    Variable 2

    “A” “B”

    “X”

    “Y”

    Pr(X) x Pr(A) Pr(X)

    Pr(Y)

    Pr(A) Pr(B)

    Pr(X) x Pr(B)

    Pr(Y) x Pr(A) Pr(Y) x Pr(B)

    Probability that a random obs is a “Y”

    Probability that an obs is both “X” and “B”

  • The Hypotheses!

    •  Independence means, for all cells in the table, where!–  is the probability of having row i characteristic !–  is the probability of having column j

    characteristic!•  The hypotheses to be tested are!!!

    !

    8/25/12

    18

    0 : , 1,2,..., ; 1,2,...,

    : , for some and ij i j

    a ij i j

    H p p p i r j cH p p p i j

    • •

    • •

    = = =

    ij i jp p p• •=ip •

    p• j

  • Chi-square Test Statistic!

    •  Test statistic: !

    •  Under the null, the expected count is calculated as!

    8/25/12

    19

    22

    1 1

    ( )r c ij iji j ij

    x - eX

    e= ==∑∑

    ˆ ˆ ˆ jiij ij ij

    j

    i

    xxe np np px x

    nn n

    n

    ••• •

    • •

    = = =×

    =

  • Conducting the Test!

    •  Now, proceed as with the goodness-of-fit test!–  Except degrees of freedom are !

    •  Large values of the chi-square statistic are evidence that the null is false!

    •  We’ll let R do the p-value calculation!–  Reject null if p-value < , for some pre-determined

    significance level !!

    8/25/12

    20

    ( 1)( 1)r cν = − −

    αα

  • Example: Mobile Learning Survey!

    •  In mobile learning devices survey, is there an association between those who own a smartphone and those who own a PDA?!–  “Do you own a smartphone (such as iPhone, Android, and

    Blackberry)?” (yes/no)!–  “Do you own a PDA (such as iPad, Zune HD, iPod Touch,

    Palm, excluding previously mentioned devices)?” (yes/no)!

    !

    •  Conclusion: The two sets of responses are not independent, so yes there is an association!

    8/25/12

    21

    * Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc)

  • What’s the Connection?!

    •  Those who do not own a smartphone are also slightly more likely not to own a PDA!

    •  Similarly, those who own a smartphone are slightly more likely to own a PDA!–  Perhaps not a big surprise…!

    8/25/12

    22

    •  Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc, and data cleaned up for convenience)

  • Chi-square Test for Homogeneity!

    •  The question: Is the distribution of a variable (say on a Likert scale) the same for two or more row categories?!

    •  Idea: Each row is a population and proportion that falls in each column category is the same!

    •  Good news: Calculation is exactly the same as test for independence!!

    8/25/12

    23

  • Example: Mobile Learning Survey!

    •  In mobile learning devices survey, is the age distribution different for resident and DL students?!

    •  Sure looks different, so let’s test it formally:!

    8/25/12

    24

    •  Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc, and data cleaned up for convenience)

  • What We Have Just Learned!

    •  Discussed tests for contingency tables!–  One-way chi-square goodness-of-fit tests!

    •  Homogeneity!•  Other distributions!

    –  Two-way chi-square tests !•  Independence!•  Homogeneity !

    •  All can be useful for analyzing Likert scale and other categorical survey data!

    •  Next class, will learn how to modify for complex sampling situations!

    8/25/12

    25