applied statistics lecture_7
TRANSCRIPT
1
Introduction to applied statistics
& applied statistical methods
Prof. Dr. Chang Zhu1
Overview
• Chi-square test
• Discriminant analysis
• Logistic regression
• Nominal data/categorical data
2
• Dichotomous variable
Only 2 values, yes or no, male or female
• Binary variable
Assign a 0 (yes) or 1 (no) to indicate presence or absence of something
3
Chi-square analysis
• Level of measurement is nominal
• The chi square test is non-parametric. It can
be used when normality is not assumed.
3
Association between categorical variables
•Suppose both response and explanatory variables are
categorical.
•There is association if the population conditional
distribution for the response variable differs among the
categories of the explanatory variable
Example: Contingency table on happiness cross-classified by
family income (data from 2006 GSS)
Chi-square analysis
Happiness
Income Very Pretty Not too Total
---------------------------------------------
Above 272 (44%) 294 (48%) 49 (8%) 615
Average 454 (32%) 835 (59%) 131 (9%) 1420
Below 185 (20%) 527 (57%) 208 (23%) 920
----------------------------------------------
Response: Happiness, Explanatory: Income
Relationship between income and happiness?
Chi-square analysis
4
Chi-Squared Test of Independence
(Karl Pearson, 1900)
• Tests H0: The variables are statistically independent
• Ha: The variables are statistically dependent
• Intuition behind test statistic: Summarize differences
between observed cell counts and expected cell counts
(what is expected if H0 true)
• Notation: fo = observed frequency (cell count)
fe = expected frequency
r = number of rows in table, c = number of columns
Chi-square analysis
• Chi-squared test answers “Is there an association?”
• Standardized residuals answer “How do data differ
from what independence predicts?”
• “How strong is the association?” using a measure of
the strength of association, such as the difference of
proportions
5
Chi-square analysis
• Like all tests of hypothesis, chi square is
sensitive to sample size.
– As N increases, obtained chi square increases.
– With large samples, trivial relationships may be
significant. To correct for this, when N>1000, set
your alpha = .01.
Practice 1
• CHI-SQUARE TEST (CROSS-TAB)
• A group of students were classified in terms of
personality (introvert or extrovert) and in terms
of colour preference (red, yellow, green or
blue). Personality and colour preference are
categorical variables. We want to find answer
to this question:
• Is there an association between personality and
colour preference?
6
Practice 1
• In SPSS, Analyze > Descriptive Statistics >
Crosstab
Practice 1 (output)
Chi-Square Tests
Value df Asymp. Sig. (2-
sided)
Pearson Chi-Square 71.200a 3 .000
Likelihood Ratio 70.066 3 .000
Linear-by-Linear Association 69.124 1 .000
N of Valid Cases 400
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.00.
There is a relationship between students’ personality and preferences for colours: χ² (3, N = 400) = 71.20, p < .0001.
7
Discriminant analysis
• Similar to Regression, except that criterion (or
dependent variable) is categorical rather than
continuous.
• used to identify boundaries between groups of
objects
For example: (a) does a person have the disease or
not
(b) Is someone a good credit risk or not?
(c) Should a student be admitted to college or not?
14
Discriminant analysis
• We wish to predict group membership for a
number of subjects from a set of predictor
variables.
• The criterion variable (also called grouping
variable) is the object of classification. This is
ALWAYS a categorical variable.
• Simple case: two groups and p predictor
variables.
8
Discriminant analysis
• Similar to regression:
– What predictor variables are related to the criterion
(dependent variable)
– Predict values on the criterion variable when given
new values on the predictor variable
Discriminant analysis
• Can we classify new (unclassified) subjects into
groups?
– Given the classification functions how accurate are
we? And when we are inaccurate is there some
pattern to the misclassification?
• What is the strength of association between
group membership and the predictors?
D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent)
+ (.134 anti-smoking score) - 4.543
9
Discriminant analysis
Questions?
•Which predictors are most important in
predicting group membership?
Practice 2
A study is set up to determine if the following variables
help to discriminate between those who smoke and those
whose don’t:
•age
•absence (days of absence last year)
•selfcon (self-concept score)
•anxiety (anxiety score)
•anti_smoking (attitude towards anti-smoking policies)
10
Practice 2
• In SPSS, Analyze > Classify > Discriminant
Practice 2
• In SPSS, Analyze > Classify > Discriminant
11
Practice 2
D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent)
+ (.134 anti-smoking score) - 4.543
Functions at Group Centroids
(means of group calculated by the D function)
Function
1
non-smoker 1.125
smoker -1.598
Canonical DiscriminantFunction Coefficients
Function1
age .024self concept score .080anxiety score -.100days absent last year -.012total anti-smoking test score .134(Constant) -4.543Unstandardized coefficients
Practice 2
Classification Resultsa,c
smoke or not
Predicted Group
Membership
Total
non-
smoker smoker
Original Count non-smoker238 19 257
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
Cross-
validatedb
Count non-smoker 238 19 257
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
a. 91.8% of original grouped cases correctly classified.
12
Practice 2
When reporting the result, we should include the following:
• Name of the predictors and sample size
• Results of the Univariate ANOVAs and the Box’s M test
• The significance of the discriminant function
• The variance explained (Canonical correlation coefficient)
• Significant predictors and their contribution to the model
(discriminant function)
• Result from the cross-validation process
(page 9)
Logistic regression
• In logistic regression the response (Y) is a
dichotomous categorical variable.
� For example: voting, mortality, andparticipation data is not continuous ordistributed normally.
� Binary logistic regression is a type ofregression analysis where the dependentvariable is a dummy variable: coded 0 (did notvote) or 1(did vote)
13
Logistic regression
• Models the relationship between a set of
variables xi
– dichotomous (eat : yes/no)
– categorical (social class, ... )
– continuous (age, ...)
and
– dichotomous variable Y
Binary Logistic regression
• Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable (coded 0, 1)
14
Binary Dependent Variables
A few examples:
� Consumer chooses brand (1) or not (0);
� A quality defect occurs (1) or not (0);
� A person is hired (1) or not (0);
� Other Examples
Binary Logistic regression
Binary Logistic regression
• The logistic regression model is simply a non-linear transformation of the linear regression.
• The logistic distribution is an S-shapeddistribution function (cumulative densityfunction) which is similar to the standardnormal distribution and constrains theestimated probabilities to lie between 0 and 1.
15
Binary Logistic regression
• p: the probability of success/event (range from 0 to 1)
• 1-p: probability of failure/non-event
If the probability of success is .8 (80%), the
probability of failure is ???
• The odds of success: the ratio between the probability of
success over the probability of failure
• What is the odds of success for the above situation?
• What can we conclude about the probabilities of success and
failure in a situation when odds equal to 1?
Binary Logistic regression
• The odds of success: the ratio between the probability of
success over the probability of failure
• Logistic regression: model the logit-transformed probability as
a linear relationship with the predictor variables.
• We can also transform the log of the odds back to a
probability:
logit(p) = log(p/(1-p)) = log (odds) = b0 + b1*x1 + ... + bk*xk
p= exp(b0 + b1*x1 + ... + bk*xk)/(1+exp(b0 + b1*x1 + ... + bk*xk))
16
Binary Logistic regression
(SPSS output)
Variables in the Equation
B
(log odds) S.E. Wald df Sig.
Exp(B)
(odds)Step 1a
gender(1) -.005 .202 .001 1 .981 .995
• If the odds ratio > 1: when the predictor increases, the odds
of the event occurs increase.
• If the odds ratio < 1: when the predictor increases, the odds
of the event occurs decreases.
• Conduct logistic regression to see if gender is a
significant predictor of whether someone is a
smoker or non-smoker.
• In SPSS, Analyze > Regression > Binary
Logistic
• The data file is smoker_DA.sav.
Practice 3
17
Practice 3
Analyze > Regression > Binary Logistic
Practice 3
Practice 3
• Conduct logistic regression to see if anti-smoking attitude is a significant predictorof whether someone is a smoker or non-smoker.
• In SPSS, Analyze > Regression > BinaryLogistic
• The data file is smoker_DA.sav.
18
Practice 3
Conduct logistic regression to see the following are
significant predictors of whether someone is a smoker or
non-smoker:
•age
•gender
•absence (days of absence last year)
•selfcon (self-concept score)
•anxiety (anxiety score)
•anti_smoking (attitude towards anti-smoking policies)
When we have no idea about the importance of the
predictors, so we’ll choose Stepwise: Forward LR)
Practice 3
Practice 3
B S.E. Odds Ratio
95% C.I. for Odds
Ratio
Lower Upper
constant 9.257** 2.050 10480.856
self-concept
-.260** .033 .771 .724 .822
anxiety
.236** .036 1.266 1.181 1.357
absence
.075* .030 1.078 1.016 1.144
anti-smoking test
score -.303** .075 .739 .638 .856
Notes. R2=.607 (Cox & Snell), .818 (Nagelkerke). Model χ² (8) = 42.0, p < .001. *p <.05. **p <.01
19
Practice 3
• Report:• A discriminant analysis was conducted age, gender, number of
days from work in previous year, self-concept score, anxiety
score, and attitude to anti-smoking workplace policy as
predictors. A total of 438 cases were analyzed. The full model
significantly predicted whether an employee is a smoker or
non-smoker (χ² = 42.04, df = 8, p < .001), accounting for
between 60.7% and 81.8% on the variance in the group
membership with 92.6% non-smokers and 90.6% smokers
successfully predicted.
Assignment 7
• Deadline: December 10, 2014
• Detail: guidelines page 16
20
• Questions?
39