applied statistics lecture_7

1

Introduction to applied statistics

& applied statistical methods

Prof. Dr. Chang Zhu1

Overview

• Chi-square test

• Discriminant analysis

• Logistic regression

• Nominal data/categorical data

2

• Dichotomous variable

Only 2 values, yes or no, male or female

• Binary variable

Assign a 0 (yes) or 1 (no) to indicate presence or absence of something

3

Chi-square analysis

• Level of measurement is nominal

• The chi square test is non-parametric. It can

be used when normality is not assumed.

3

Association between categorical variables

•Suppose both response and explanatory variables are

categorical.

•There is association if the population conditional

distribution for the response variable differs among the

categories of the explanatory variable

Example: Contingency table on happiness cross-classified by

family income (data from 2006 GSS)

Chi-square analysis

Happiness

Income Very Pretty Not too Total

---------------------------------------------

Above 272 (44%) 294 (48%) 49 (8%) 615

Average 454 (32%) 835 (59%) 131 (9%) 1420

Below 185 (20%) 527 (57%) 208 (23%) 920

----------------------------------------------

Response: Happiness, Explanatory: Income

Relationship between income and happiness?

Chi-square analysis

4

Chi-Squared Test of Independence

(Karl Pearson, 1900)

• Tests H0: The variables are statistically independent

• Ha: The variables are statistically dependent

• Intuition behind test statistic: Summarize differences

between observed cell counts and expected cell counts

(what is expected if H0 true)

• Notation: fo = observed frequency (cell count)

fe = expected frequency

r = number of rows in table, c = number of columns

Chi-square analysis

• Chi-squared test answers “Is there an association?”

• Standardized residuals answer “How do data differ

from what independence predicts?”

• “How strong is the association?” using a measure of

the strength of association, such as the difference of

proportions

5

Chi-square analysis

• Like all tests of hypothesis, chi square is

sensitive to sample size.

– As N increases, obtained chi square increases.

– With large samples, trivial relationships may be

significant. To correct for this, when N>1000, set

your alpha = .01.

Practice 1

• CHI-SQUARE TEST (CROSS-TAB)

• A group of students were classified in terms of

personality (introvert or extrovert) and in terms

of colour preference (red, yellow, green or

blue). Personality and colour preference are

categorical variables. We want to find answer

to this question:

• Is there an association between personality and

colour preference?

6

Practice 1

• In SPSS, Analyze > Descriptive Statistics >

Crosstab

Practice 1 (output)

Chi-Square Tests

Value df Asymp. Sig. (2-

sided)

Pearson Chi-Square 71.200a 3 .000

Likelihood Ratio 70.066 3 .000

Linear-by-Linear Association 69.124 1 .000

N of Valid Cases 400

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.00.

There is a relationship between students’ personality and preferences for colours: χ² (3, N = 400) = 71.20, p < .0001.

7

Discriminant analysis

• Similar to Regression, except that criterion (or

dependent variable) is categorical rather than

continuous.

• used to identify boundaries between groups of

objects

For example: (a) does a person have the disease or

not

(b) Is someone a good credit risk or not?

(c) Should a student be admitted to college or not?

14


• We wish to predict group membership for a

number of subjects from a set of predictor

variables.

• The criterion variable (also called grouping

variable) is the object of classification. This is

ALWAYS a categorical variable.

• Simple case: two groups and p predictor

variables.

8


• Similar to regression:

– What predictor variables are related to the criterion

(dependent variable)

– Predict values on the criterion variable when given

new values on the predictor variable


• Can we classify new (unclassified) subjects into

groups?

– Given the classification functions how accurate are

we? And when we are inaccurate is there some

pattern to the misclassification?

• What is the strength of association between

group membership and the predictors?

D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent)

+ (.134 anti-smoking score) - 4.543

9


Questions?

•Which predictors are most important in

predicting group membership?

Practice 2

A study is set up to determine if the following variables

help to discriminate between those who smoke and those

whose don’t:

•age

•absence (days of absence last year)

•selfcon (self-concept score)

•anxiety (anxiety score)

•anti_smoking (attitude towards anti-smoking policies)

10

Practice 2

• In SPSS, Analyze > Classify > Discriminant

Practice 2

• In SPSS, Analyze > Classify > Discriminant

11

Practice 2

D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent)

+ (.134 anti-smoking score) - 4.543

Functions at Group Centroids

(means of group calculated by the D function)

Function

1

non-smoker 1.125

smoker -1.598

Canonical DiscriminantFunction Coefficients

Function1

age .024self concept score .080anxiety score -.100days absent last year -.012total anti-smoking test score .134(Constant) -4.543Unstandardized coefficients

Practice 2

Classification Resultsa,c

smoke or not

Predicted Group

Membership

Total

non-

smoker smoker

Original Count non-smoker238 19 257

smoker 17 164 181

% non-smoker 92.6 7.4 100.0

smoker 9.4 90.6 100.0

Cross-

validatedb

Count non-smoker 238 19 257

smoker 17 164 181

% non-smoker 92.6 7.4 100.0

smoker 9.4 90.6 100.0

a. 91.8% of original grouped cases correctly classified.

12

Practice 2

When reporting the result, we should include the following:

• Name of the predictors and sample size

• Results of the Univariate ANOVAs and the Box’s M test

• The significance of the discriminant function

• The variance explained (Canonical correlation coefficient)

• Significant predictors and their contribution to the model

(discriminant function)

• Result from the cross-validation process

(page 9)

Logistic regression

• In logistic regression the response (Y) is a

dichotomous categorical variable.

� For example: voting, mortality, andparticipation data is not continuous ordistributed normally.

� Binary logistic regression is a type ofregression analysis where the dependentvariable is a dummy variable: coded 0 (did notvote) or 1(did vote)

13

Logistic regression

• Models the relationship between a set of

variables xi

– dichotomous (eat : yes/no)

– categorical (social class, ... )

– continuous (age, ...)

and

– dichotomous variable Y

Binary Logistic regression

• Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable (coded 0, 1)

14

Binary Dependent Variables

A few examples:

� Consumer chooses brand (1) or not (0);

� A quality defect occurs (1) or not (0);

� A person is hired (1) or not (0);

� Other Examples



• The logistic regression model is simply a non-linear transformation of the linear regression.

• The logistic distribution is an S-shapeddistribution function (cumulative densityfunction) which is similar to the standardnormal distribution and constrains theestimated probabilities to lie between 0 and 1.

15


• p: the probability of success/event (range from 0 to 1)

• 1-p: probability of failure/non-event

If the probability of success is .8 (80%), the

probability of failure is ???

• The odds of success: the ratio between the probability of

success over the probability of failure

• What is the odds of success for the above situation?

• What can we conclude about the probabilities of success and

failure in a situation when odds equal to 1?


• The odds of success: the ratio between the probability of

success over the probability of failure

• Logistic regression: model the logit-transformed probability as

a linear relationship with the predictor variables.

• We can also transform the log of the odds back to a

probability:

logit(p) = log(p/(1-p)) = log (odds) = b0 + b1*x1 + ... + bk*xk

p= exp(b0 + b1*x1 + ... + bk*xk)/(1+exp(b0 + b1*x1 + ... + bk*xk))

16


(SPSS output)

Variables in the Equation

B

(log odds) S.E. Wald df Sig.

Exp(B)

(odds)Step 1a

gender(1) -.005 .202 .001 1 .981 .995

• If the odds ratio > 1: when the predictor increases, the odds

of the event occurs increase.

• If the odds ratio < 1: when the predictor increases, the odds

of the event occurs decreases.

• Conduct logistic regression to see if gender is a

significant predictor of whether someone is a

smoker or non-smoker.

• In SPSS, Analyze > Regression > Binary

Logistic

• The data file is smoker_DA.sav.

Practice 3

17

Practice 3

Analyze > Regression > Binary Logistic

Practice 3

Practice 3

• Conduct logistic regression to see if anti-smoking attitude is a significant predictorof whether someone is a smoker or non-smoker.

• In SPSS, Analyze > Regression > BinaryLogistic

• The data file is smoker_DA.sav.

18

Practice 3

Conduct logistic regression to see the following are

significant predictors of whether someone is a smoker or

non-smoker:

•age

•gender

•absence (days of absence last year)

•selfcon (self-concept score)

•anxiety (anxiety score)

•anti_smoking (attitude towards anti-smoking policies)

When we have no idea about the importance of the

predictors, so we’ll choose Stepwise: Forward LR)

Practice 3

Practice 3

B S.E. Odds Ratio

95% C.I. for Odds

Ratio

Lower Upper

constant 9.257** 2.050 10480.856

self-concept

-.260** .033 .771 .724 .822

anxiety

.236** .036 1.266 1.181 1.357

absence

.075* .030 1.078 1.016 1.144

anti-smoking test

score -.303** .075 .739 .638 .856

Notes. R2=.607 (Cox & Snell), .818 (Nagelkerke). Model χ² (8) = 42.0, p < .001. *p <.05. **p <.01

19

Practice 3

• Report:• A discriminant analysis was conducted age, gender, number of

days from work in previous year, self-concept score, anxiety

score, and attitude to anti-smoking workplace policy as

predictors. A total of 438 cases were analyzed. The full model

significantly predicted whether an employee is a smoker or

non-smoker (χ² = 42.04, df = 8, p < .001), accounting for

between 60.7% and 81.8% on the variance in the group

membership with 92.6% non-smokers and 90.6% smokers

successfully predicted.

Assignment 7

• Deadline: December 10, 2014

• Detail: guidelines page 16

20

• Questions?

39

applied statistics lecture_7

Documents