analyses of cateogical dependent variables · web viewlogistic regression and probit analysis are...

109
Analyses Involving Categorical Dependent Variables When Dependent Variables are Categorical Examples: Dependent variable is simply Failure vs. Success Dependent variable is Lived vs. Died Dependent variable is Passed vs Failed Chi-square analysis is frequently used. Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets? Dependent variable is Death: No (0) vs. Yes (1). Independent variable is Helmet – No (0) vs Yes (1). Crosstabs Logistic Regression Lecture - 1 3/6/2022

Upload: vothu

Post on 01-May-2018

226 views

Category:

Documents


2 download

TRANSCRIPT

Analyses of Cateogical Dependent Variables

Analyses Involving Categorical Dependent Variables

When Dependent Variables are Categorical

Examples:Dependent variable is simply Failure vs. Success

Dependent variable is Lived vs. Died

Dependent variable is Passed vs Failed

Chi-square analysis is frequently used.

Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?

Dependent variable is Death: No (0) vs. Yes (1).Independent variable is Helmet No (0) vs Yes (1).

Crosstabs

So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.Comments on Chi-square analyses

Whats good?

1. The analysis is appropriate. It hasnt been supplanted by something else.

2. The results are usually easy to communicate, especially to lay audiences.

3. A DV with a few more than 2 categories can be easily analyzed.

4. An IV with only a few more than 2 categories can be easily analyzed.

Whats bad?

1. Incorporating more than one independent variable is awkward, requiring multiple tables.

2. Certain tests, such as tests of interactions, cant be performed easily when you have more than one IV.

3. Chi-square analyses cant be done when you have continuous IVs unless you categorize the continuous IVs, which goes against recommendations to NOT categorize continuous variables because you lose power.

Alternatives to the Chi-square test.Well focus on Dichotomous (two-valued) DVs.

1. Linear Regression techniques

a. Multiple Linear Regression. Stick your head in the sand and pretend that your DV is continuous and regress the (dichotomous) DV onto the collection of IVs.

b. Discriminant Analysis (equivalent to MR when DV is dichotomous)

Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.

1. Assumption is that underlying relationship between Y and X is linear.

But when Y has only two values, how can that be?

2. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.

3. Residuals will probably not be normally distributed.

4. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction resulting in Y-hats that are impossible values.

2. Logistic Regression

3. Probit analysis

Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. Well focus on it.

The Logistic Regression Equation

Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.

Conceptualizing Y-hat. When Y is GPA, actual GPAs and predicted GPAs are just like each other. They can even be identical. However, when Y is a dichotomy, it can take on only 2 values 0 and 1 although predicted Ys can be any value. So how do we reconcile that discrepancy?

When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur. Well use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.

The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So were conceptualizing Y-hat as the probability that Y is 1.

The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression)

(B0 + B1*X)

1 e

Y-hat = P(Y=1) = --------------------- = -----------------

-(B0 + B1*X) (B0 + B1*X)

1 + e e+ 1

The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1 as X ranges from - to +. P(Y=1) is never negative and never larger than 1.

The curve of the equation . . .

B0: B0 is analogous to the linear regression constant , i.e., intercept parameter. Although B0 defines the "height" of the curve at a given x, it should be noted that the curve as a whole moves to the right as B0 decreases. For the graphs below, B1=1 and X ranged from -5 to +5.

(P(Y=1)B0 =0B0 = +1B0 = -1)For equations for which B1 is the same, changing B0 only changes the location of the curve over the range of X-axis values.

The slope of the curve remains the same.B1: B1 is analogous to the slope of the linear regression line. B1 defines the steepness of the curve. It is sometimes called a discrimination parameter.

The larger the value of B1, the steeper the curve, the more quickly it goes from 0 to 1. B0=0 for the graph.

(B1 =4) (B1 =1) (B1 =2) (P(Y=1))

Note that there is a MAJOR difference between the linear regression curves were familiar with and logistic regression curves - - -

The logistic regression lines asymptote at 0 and 1. Theyre bounded by 0 and 1.

But the linear regression lines extend below 0 on the left and above 1 on the right the predicted Ys range from - to +.

If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.

(ExampleP(Y) = .09090909Odds of Y = .09090909/.909090909 = .1Y is 1/10th as likely to occur as to not occur.P(Y) = .50Odds of Y = .5/.5 = 1Y is as likely to occur as to not occur.P(Y) = .8Odds of Y = .8/.2 = 4Y is 4 times more likely to occur than to not occur.P(Y) = .99Odds of Y = .99/.01 = 99Y is 99 times more likely to occur than to not occur.)

(So logistic regression is logistic in probability but linear in log odds.)Why we must fit ogival-shaped curves the curse of categorization

Heres a perfectly nice linear relationship between score values, from a recent study.

This relationship is of ACT Comp scores to Wonderlic scores. It shows that as intelligence gets higher, ACT scores get larger.

[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav

Heres the relationship when ACT Comp has been dichotomized at 23, into Low vs. High.

When, proportions of High scores are plotted vs. WPT value, we get the following

So, to fit the above curve relating proportions of persons with High ACT scores to WPT, we need a model that is ogival.

This is where the logistic regression function comes into play.

This means that even if the underlying true values are linearly related, proportions based on the dichotomized values will not be linearly related to the independent variable.

Crosstabs and Logistic Regression

Applied to the same 2x2 situation

The FFROSH data.

The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.

The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not.

The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.

After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)

So the analysis that follows examines the relationship of RETAINED to EARLIREG, retention to the 2nd semester to early registration.

The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.

First, univariate analyses . . .

GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.

Fre var=retained earlireg.

sustained

Frequency

Percent

Valid Percent

Cumulative Percent

Valid

.00

552

11.6

11.6

11.6

1.00

4201

88.4

88.4

100.0

Total

4753

100.0

100.0

earlireg

Frequency

Percent

Valid Percent

Cumulative Percent

Valid

.00

2316

48.7

48.7

48.7

1.00

2437

51.3

51.3

100.0

Total

4753

100.0

100.0

crosstabs tables = retained by earlireg /cells=cou col /sta=chisq.

Crosstabs

(So, 92.4% of those who registered early sustained, compared to 84.2% of those who registered late.)

The same analysis using Logistic RegressionAnalyze -> Regression -> Binary Logistic

logistic regression retained WITH earlireg.

Logistic Regression

(The display to the left is a valuable check to make sure that your 1 is the same as the Logistic Regression procedures 1. Do whatever you can to make Logistics 1s be the same cases as your 1s. Trust me.)

The Logistic Regression procedure applies the logistic regression model to the data. It estimates the parameters of the logistic regression equation.

1

That equation is P(Y) = ---------------------

-(B0 + B1X)

1 + e

The LOGISTIC REGRESSION procedure performs the estimation in two stages.

The first stage estimates only B0. So the model fit to the data in the first stage is simply

1

P(Y) = ------------------

-(B0)

1 + e

SPSS labels the various stages of the estimation procedure Blocks. In Block 0, a model with only B0 is estimated

The second stage estimates both B0 and B1. So the model fit to the data in the first stage is simply

1

P(Y) = ------------------

-(B0 + B1*X)

1 + e

SPSS labels the various stages of the estimation procedure Blocks. In Block 0, a model with only B0 is estimated

The first stage . . .

Block 0: Beginning Block (estimating only B0)

Explanation of the above table: The progresm estimated B0=2.030. The resulting P(Y=1) = .8839.

The program computes Y-hat=.8839 for each case using the logistic regression formula with the estimate of B0. If Y-hat is less than or equal to a predetermined cut value of 0.500, that case is recorded as a predicted 0. If Y-hat is greater than 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1s and 0s vs. predicted 1s and 0s. All predicted Ys are 1 in this particular example. Sometimes this table is more useful than it was in this case. Its typically most useful when the equation includes continuous predictors.

The Variables in the Equation Table for Block 0.

The Variables in the Equation box is the Logistic Regression equivalent of the Coefficients Box in regular regression analysis. The prediction equation for Block 0 is Y-hat = 1/(1 + e 2.030). Recall that B1 is not yet in the equation.

The test statistic in the Variables in the Equation table is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.

Exp(B) is the odds ratio: e2.030 It is the ratio of odds of P(Y=1) when the predictor equals 1 to the odds of P(Y=1) when the predictor equals 0. Its an indicator of strength of relationship to the predictor. It means nothing here since there is no predictor in the equation.

The Variables not in the Equation Table for Block 0.

The Variables not in the Equation gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be significant if it were added to the equation. In this case, its telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .

The second stage . . .

Block 1: Method = Enter (Adding estimation of B1 to the equation)

Whew three chi-square statistics.

Step: Compared to previous step in a stepwise regression. Ignore for now since this regression had only 1 step..

Block: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.

Model: I believe this is analogous to the ANOVA F in REGRESSION testing whether the model with all predictors fits better than a model with no predictors an independence model.

The value under -2 Log likelihood is a measure of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to percent of variance accounted for. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.

The above table is the version of the table including Y-hats based on B0 and B1.

Note that since X is a dichotomous variable here, there are only two y-hat values. They are

1

P(Y) = --------------------- = .842 (see below)

-(B0 + B1*0)

1 + e

And

1

P(Y) = --------------------- = .924 (see below)

-(B0 + B1*1)

1 + e

In both cases, the y-hat was greater than .5, so predicted Y in the table was 1 for all cases.

The prediction equation is Y-hat = P(Y=1) = 1 / (1 + e-(.1.670 + .830*EARLIREG).

Since EARLIREG has only two values, those students who registered early will have predicted RETAINED value of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of

1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842.

Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.

Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is

Odds when X=1 .924/(1-.924)12.158

Odds ratio = --------------------- = --------------- = ------------------- = 2.29.

Odds when X= 0 .842/(1-.842)5.329

So a person who registered early had odds of retaining that were 2.29 times the odds of a person registering late being retained. Odds ratio of 1 means that the DV is not related to the predictor.

Graphical representation of what weve just found.

The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.

(yhat) (Earlireg) (The two points are the predicted points for the two possible values of RETAINED.)CrossTabs in Rcmdr

R Rcmdr, then import the ffrosh for P5100 data.

Crosstabs in Rcmde requires that the variables to be crossed be factors.

First, convert the variables to factors

Data Manage variables in active data set Convert numeric variables to factors

Note that I created a new column in the Rcmdr data editor so that I can use EARLIREG in procedures that analyze regular variables and also in procedures that require factors.

By the way: Rcmdrs import automatically converts any variable whose values have labels into factors.

You can remove this tendency by

1) Removing value labels from the variable in the SPSS file prior to importing, or

2) Unchecking the Convert value labels to factor levels box in the Import SPSS Data set dialog.

Statistics Contingency Tables Two-way table . . .

Note that this procedure works only for factors.

The output.

Frequency table:

earliregfact

retainedfact 0 1

0 367 185

1 1949 2252

Pearson's Chi-squared test

data: .Table

X-squared = 78.832, df = 1, p-value < 2.2e-16

The chi-square value is the same as the Pearson chi-square in SPSS, p. 9.

Logistic Regression Analysis in Rcmdr.

Statistics Fit Models Generalized Linear Models . . .

(Note that the procedure is Generalized Linear models . . .,not Linear model and not Linear regression)

library(foreign, pos=14)

> ffroshnm colnames(ffroshnm) library(abind, pos=15)

> GLM.1 summary(GLM.1)

Call:

glm(formula = retained ~ earlireg, family = binomial(logit),

data = ffroshnm)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2708 0.3974 0.3974 0.5874 0.5874

(The z value is the square root of the Wald value printed by SPSS. The p-values are identical.)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.66971 0.05690 29.343 Weight Cases . . .

All analyses after the Weight Cases dialog will involve the

Expanded data of 150 cases.

If youre interested, the syntax that will do all of the above is . . .

DATASET ACTIVATE DataSet5.

Data list free / gender source judgment freq.

(This syntax reads frequency counts and analyzes them as if they were individual respondent data.)Begin data.

1 1 1 7

1 1 0 18

1 2 1 13

1 2 0 12

1 3 1 19

1 3 0 6

2 1 1 5

2 1 0 20

2 2 1 17

2 2 0 8

2 3 1 20

2 3 0 5

end data.

weight by freq.

value labels gender 1 "Male" 2 "Female"

/source 1 "Economist" 2 "Labor Leader" 3 "Politician"

/ judgment 1 "Biased" 0 "Unbiased".

The logistic regression dialogs . . .

Analyze -> Regression -> Binary Logistic . . .

The syntax to invoke the Logistic Regression command is

logistic regression judgment /categorical = gender source

/enter source gender /enter source by gender

(Note we have to tell the LOGISTIC REGRESSION procedure to analyze the interaction between two factors.) /save = pred(predict).

(Click here to add the interaction term.) (This table tells us about the Group Coding variables for source and gender . . .Source is dummy coded, with Politician as the ref group.Source(1) compares Economists with Politicians.Source(2) compares Labor Leaders with Politicians,.Gender is dummy coded with Female as reference group.Gender(1) compares Males vs. Females.Thanks, Logistic.)Logistic Regression output

(Note that the N is incorrect. We told SPSS to expand the summary data into 150 individual cases. But this part of the Logistic Regression command does not acknowledge that expansion except in the footnote.)Case Processing Summary

Unweighted Casesa

N

Percent

Selected Cases

Included in Analysis

12

100.0

Missing Cases

0

.0

Total

12

100.0

Unselected Cases

0

.0

Total

12

100.0

a. If weight is in effect, see classification table for the total number of cases.

(As always, make sure the internal code is identical to the original code.)Dependent Variable Encoding

Original Value

Internal Value

.00 Unbiased

0

1.00 Biased

1

Categorical Variables Codingsa

Frequency

Parameter coding

(1)

(2)

source

1.00 Economist

4

1.000

.000

2.00 Labor Leader

4

.000

1.000

3.00 Politician

4

.000

.000

gender

1.00 Male

6

1.000

2.00 Female

6

.000

a. This coding results in indicator coefficients.

Note that if you had not learned about group coding variables, the information in the Categorical Variables Codings table would make no sense at all.

This is one of the example of SPSS output whose understanding depends on your understanding group coding schemes.Block 0: Beginning Block (I generally ignore the Block 0 output. Not much of interest here except to logistic regression aficionados.)

Classification Tablea,b

Observed

Predicted

judgment

Percentage Correct

.00 Unbiased

1.00 Biased

Step 0

judgment

.00 Unbiased

0

69

.0

1.00 Biased

0

81

100.0

Overall Percentage

54.0

a. Constant is included in the model.

b. The cut value is .500

Variables in the Equation

B

S.E.

Wald

df

Sig.

Exp(B)

Step 0

Constant

.160

.164

.958

1

.328

1.174

Variables not in the Equation

Score

df

Sig.

Step 0

Variables

source

30.435

2

.000

source(1)

27.174

1

.000

source(2)

1.087

1

.297

gender(1)

.242

1

.623

Overall Statistics

30.676

3

.000

Block 1: Method = Enter

This is the 1st of two blocks one for main effects; one for interaction.

Omnibus Tests of Model Coefficients

Chi-square

df

Sig.

Step 1

Step

32.186

3

.000

Block

32.186

3

.000

Model

32.186

3

.000

Model Summary

Step

-2 Log likelihood

Cox & Snell R Square

Nagelkerke R Square

1

174.797a

.193

.258

a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

Classification Tablea

Observed

Predicted

judgment

Percentage Correct

.00 Unbiased

1.00 Biased

Step 1

judgment

.00 Unbiased

38

31

55.1

1.00 Biased

12

69

85.2

Overall Percentage

71.3

a. The cut value is .500

(Bias=0) (Bias=1)Variables in the Equation

B

S.E.

Wald

df

Sig.

Exp(B)

Step 1a

source

26.944

2

.000

source(1)

-2.424

.476

25.880

1

.000

.089

source(2)

-.862

.448

3.709

1

.054

.422

gender(1)

-.202

.368

.303

1

.582

(Econ=1) (Pol=0).817

Constant

1.370

.393

12.143

1

.000

3.934

a. Variable(s) entered on step 1: source, gender.

Source: The overall differences in probability of rating passage as biased across the 3 source groups.

Source(1): Probability of rating passage as biased least when respondents told that message was from Economists.

Source(2): No officially significant difference between probability of rating passage as biased when attributed to Labor Leaders vs. when attributed to Politicians.

Gender(1)No difference in probability of rating passage as biased between males and females.

Block 2: Method = Enter

This block adds the interaction of SourceXGender. No change in results.

Omnibus Tests of Model Coefficients

Chi-square

df

Sig.

Step 1

Step

1.594

2

.451

Block

1.594

2

.451

Model

33.780

5

.000

Model Summary

Step

-2 Log likelihood

Cox & Snell R Square

Nagelkerke R Square

1

173.203a

.202

.269

a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

Classification Tablea

Observed

Predicted

judgment

Percentage Correct

.00 Unbiased

1.00 Biased

Step 1

judgment

.00 Unbiased

38

31

55.1

1.00 Biased

12

69

85.2

Overall Percentage

71.3

a. The cut value is .500

Variables in the Equation

B

S.E.

Wald

df

Sig.

Exp(B)

Step 1a

source

17.214

2

.000

source(1)

-2.773

.707

15.374

1

.000

.063

source(2)

-.633

.659

.922

1

.337

.531

gender(1)

-.234

.685

.116

1

.733

.792

source * gender

1.573

2

.455

source(1) by gender(1)

.675

.958

.497

1

.481

1.965

source(2) by gender(1)

-.440

.902

.238

1

.626

.644

Constant

1.386

.500

7.687

1

.006

4.000

a. Variable(s) entered on step 1: source * gender .

Since the interaction was not significant, we dont have to interpret the results of this block.

The main conclusion is that respondents rated passages from politicians as more biased than from economists.

Logistic Regression Example 6: Amylase vs Lipase

LOGISTIC REGRESSION VAR=pancgrp logistic regression pancgrp with logamy loglip

/METHOD=ENTER logamy loglip

/CLASSPLOT

/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

[DataSet3]G:\MdbT\InClassDatasets\amylip.sav

Logistic Regression

Block 0: Beginning Block

The following assumes a model with only the constant, B0 in the equation.

Each p-value tells you whether or not the variable would be significant if entered BY ITSELF. That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.

Block 1: Method = Enter

(Specificity: 204/208Sensitivity: 38/48)

Specificity is the ability to identify cases who do NOT have the disease.

Among those without the disease, .981 were correctly identified.

Sensitivity is the ability to identify cases who do have the disease.

Among those with the disease, .792 were correctely identified.

(Note that LOGAMY does not officially increase predictability over that afforded by LOGLIP.)

Interpretation of the coefficients . . .

Bs: Not easily interpretable on a raw probability scale. On a log odds scale: Expected increase in log odds for a one-unit increase in IV.

If the p-value is 0, a decrease if Bi < 0. We just cannot give a simple quantitative prediction of the amount of change in probability of Y=1.

SEs: Standard error of the estimate of Bi.

Wald: Test statistic.

Sig: p-value associated with test statistic.

Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP.

Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.

Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.

So a person one unit higher on LOGLIP will have 20.04 greater odds of having Pancreatitis.

The Exp(B) column is mostly useful for dichotomous predictors 0 = Absent; 1 = present.

Classification Plots a frequency distribution of all cases on across Y-hat values with different symbols representing actual classification

Stepnumber:1

ObservedGroupsandPredictedProbabilities

80

N

N

FN

R60N

EN

QN

UNN

E40NN

NNN

CNNN

YNNN

20NNN

NNNP

NNNNNP

NNNNNNNNNNNPNPPPPP

Predicted

Prob:0.1.2.3.4.5.6.7.8.91

(Y-HAT)Group:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP

PredictedProbabilityisofMembershipforPancreatitis

TheCutValueis.50

Symbols:N-NoPancreatitis

P-Pancreatitis

EachSymbolRepresents5Cases.

One aspect of the above plot is misleading because each symbol represents a group of cases. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were not enough to make 5 cases.

Classification Plots using dot plots.

Heres the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.

(No-Pancreatitis cases) (Pancreatitis cases) (For the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect. There were, however, a few patients who did get Pancreatitis who had small values of Y-hat. Those patients are dragging down the sensitivity of the test. Note that these patients dont show up on the CASEPLOT produced by the LOGISTIC REGRESSION procedure.)

Classification Plots using Histograms in EXPLORE

Heres another equivalent representation of what the authors of the program were trying to show.

Visualizing the equation with two predictors Skipped to ROC plots in 2016.

(Mike use this as an opportunity to whine about SPSSs horrible 3-D graphing capability.)

With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model.

For two predictor models, a 3-D scatterplot is required. Heres how the graph below was produced.

Graphs -> Interactive -> Scatterplot. . .

(The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and LOGAMY. But the relationships really arent apparent until the graph is rotated.Dont ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but its been removed. Shame on SPSS.)

The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.

Representing Relationships with a Table the Powerpoint slides

compute logamygp2 = rnd(logamy,.5).