visualizing simple logistic regression models using mosaic ... · visualizing simple logistic...

15
Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical data, the relationship between visual displays of the data and models is not very well explored, even though for categorical data and loglinear models strong relationships do exist. Starting from assessing odds ratios visually, interaction effects of variables can be examined using mosaic plots. Cumulative link models provide a way to describe trends between ordinal variables. In order to link the theory tightly to the displays cumulative odds ratios are used. Examples from real data sets highlight the usability and applicability of this approach throughout the paper. Keywords: Cumulative Link Models; Cumulative Odds Ratios; Logistic Regression; Mosaic Plots; Multivariate Categorical Data; Proportional Odds Regression Model; Visual Modelling 1 Introduction “With a powerful conceptual model, a graph can also become a tool for thinking” Michael Friendly (1995), p.160 In addition to the above quote, it should be mentioned, that only the “right” graph will be the tool enabling us to think. Mosaic plots seem to be the right graphical display for loglinear models, which undeniably provide a very powerful conceptual model. The relationship between mosaic plots and loglinear models has been long established: Friendly (1999) suggested to show residuals on mosaics of the raw data by shading the tiles. Theus and Lauer (1999) overlaid mosaics of the fitted values with proportionally colored residuals to help to detect patterns in the residuals as a visual aid for finding missing interaction terms in loglinear models. The relationship between odds ratios and mosaic plots of multidimensional binary tables has been established by Hofmann (2001). The goal of this paper is to make the link between mosaic plots and loglinear models more explicit by pointing out visual indicators for parameters in different models. We will also extend the conceptual model to ordinal variables and their corresponding models. The idea here goes back to the very simple displays we are used to working with in our everyday environment: by looking at a scatterplot of two variables, we are immediately able to draw conclusions about a corresponding simple regression. Figure 1 gives an example for a scatterplot of two variables (left), the regression line is drawn on top of the points. From the scatterplot we can find rough visual estimates for intercept and slope in a model of the corresponding simple linear regression. The amount of scatter allows us some conclusion towards the overall error. The boxplots on the right hand side of figure 1 correspond to a simple anova model of Y in X . Again, we can visually assess the estimates of the model from the medians of the boxplots. Both size of the boxes and hinges allow us to draw conclusions towards the standard error of the model. 1

Upload: others

Post on 24-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Visualizing Simple Logistic Regression Models using Mosaic Plots

Heike Hofmann

August 30, 2006

Abstract

For categorical data, the relationship between visual displays of the data and models is notvery well explored, even though for categorical data and loglinear models strong relationshipsdo exist. Starting from assessing odds ratios visually, interaction effects of variables can beexamined using mosaic plots. Cumulative link models provide a way to describe trends betweenordinal variables. In order to link the theory tightly to the displays cumulative odds ratios areused. Examples from real data sets highlight the usability and applicability of this approachthroughout the paper.

Keywords: Cumulative Link Models; Cumulative Odds Ratios; Logistic Regression; MosaicPlots; Multivariate Categorical Data; Proportional Odds Regression Model; Visual Modelling

1 Introduction

“With a powerful conceptual model, a graph can also become a tool for thinking”Michael Friendly (1995), p.160

In addition to the above quote, it should be mentioned, that only the “right” graph will be thetool enabling us to think. Mosaic plots seem to be the right graphical display for loglinear models,which undeniably provide a very powerful conceptual model.The relationship between mosaic plots and loglinear models has been long established: Friendly(1999) suggested to show residuals on mosaics of the raw data by shading the tiles. Theus andLauer (1999) overlaid mosaics of the fitted values with proportionally colored residuals to help todetect patterns in the residuals as a visual aid for finding missing interaction terms in loglinearmodels. The relationship between odds ratios and mosaic plots of multidimensional binary tableshas been established by Hofmann (2001).The goal of this paper is to make the link between mosaic plots and loglinear models more explicitby pointing out visual indicators for parameters in different models. We will also extend theconceptual model to ordinal variables and their corresponding models. The idea here goes back tothe very simple displays we are used to working with in our everyday environment: by looking ata scatterplot of two variables, we are immediately able to draw conclusions about a correspondingsimple regression.Figure 1 gives an example for a scatterplot of two variables (left), the regression line is drawn ontop of the points. From the scatterplot we can find rough visual estimates for intercept and slopein a model of the corresponding simple linear regression. The amount of scatter allows us someconclusion towards the overall error. The boxplots on the right hand side of figure 1 correspond toa simple anova model of Y in X. Again, we can visually assess the estimates of the model from themedians of the boxplots. Both size of the boxes and hinges allow us to draw conclusions towardsthe standard error of the model.

1

Page 2: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

●●

●●

●●

●●

●●

● ●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

−3 −2 −1 0 1 2 3

−4

−2

02

x

y

x

y

89

1011

1213

14

A B C D

Figure 1: Scatterplot of two continuous variables (left), boxplots by group for a combination of onecontinuous and one categorical variable (right). Both plots correspond directly to linear models.

We want to introduce a similar model for mosaic plots. Obviously, we are able to identify patternsin mosaic plots. Figure 2 shows an example of three mosaic plots of a 6× 4 table. One mosaic plotshows the raw data, the other two show random permutations of rows and columns of the table.The mosaic plot in the middle stands out because of its clearly recognizable structure, identifyingmosaic B as the one which shows the real data. The question then becomes, how to translate apattern as visible as in mosaic B to an appropriate model.

Mosaic A

A B C D E F

wel

lm

ildm

oder

ate

impa

ired

Mosaic B

A B C D E F

wel

lm

ildm

oder

ate

impa

ired

Mosaic C

A B C D E F

wel

lm

ildm

oder

ate

impa

ired

Figure 2: Three mosaic plots of a two-dimensional data table. One mosaic shows the raw data, theother two mosaics show randomly permuted values. Mosaic B stands out because of its prominenttile pattern.

Section 2 gives an introduction of how to relate odds ratios to mosaic plots and how to “read” oddsratios directly from the plot. We will apply this method to mosaic plots of continuous variables,which will lead us to a logistic regression as the corresponding model. Section 3 extends the conceptto an ordinal response variable. This requires the introduction of cumulative odds ratios and theproportional odds ratio model (or cumulative logit model, Agresti (1984)). At the example ofthe proportional odds ratio model, we discuss, how higher dimensional interaction terms can beassessed in the mosaic plot.

2

Page 3: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

2 Logistic Regression

2.1 Reading Odds Ratios in Binary Tables

For a table of two binary variables of the form:

Y \X x1 x2

y1 b dy2 a c

the odds ratio θ = ad/(bc) gives a measure for the strength of the association. Using a mosaic plotto visually represent the table, the odds ratio is preserved and can be read off approximately fromthe plot by comparing the heights of tiles a and c. More precisely, the relationship between theodds ratio θ and the difference in the heights ha and hc of tiles a and c, respectively, is given as(Hofmann, 2001):

h := hc − ha =1− θ

1 + θ +√

4θ + D2, (1)

where D = a/b− d/c is a measure for the asymmetry in the data. Solving equation(1) for θ showsthat the log odds ratio is approximately linear in the difference of tiles’ heights:

log θ(h, D) = 4h +∑

i+j≥3

O(hiDj

). (2)

The approximation obviously works best for D → 0 or h → 0. D is small, whenever a/b ≈ d/c, i.e.if ha ≈ 1− hc = hd. h is small, whenever ha ≈ hc.

ha1-ha

ln

+inf

0

-inf

1

-1

-0.5

0.5

log odds scale

1.0

0.6

0probability scale

0.4

0.2

0.8

ac

db

p1-p-ln

hc1-hc

ln

Figure 3: Mosaic plot corresponding to a 2× 2 contingency table. The graph on the left hand sidegives the relationship between bin heights and log odds. For log odds values between -2 and 2 thevalues can be assessed visually.

Figure 3 shows a mosaic plot of two binary variables, corresponding to the table above.Similarly, the log odds value can be read from the display, as a function in the tile’s height a:log odds (a) = log a/(1 − a). Since this function is symmetric around a = 0.5 and approximatelylinear between a ∈ (0.25, 0.75), we may read for highlighting values in this range the log oddsvalues according to the log odds scale as sketched on the right most side in figure 3.The notion of “reading” values from a display should not be taken too far. We do not want toimply to be able to get precise numbers from a display, but we are aiming for a rough assessmentonly. In the case of log odds ratios we will be looking at some of the following properties:

3

Page 4: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

• Equal heights of tiles identify log odds ratios close to zero (indicating statistical independencein the underlying values).

• Comparisons between several displays allow a judgement of weaker and stronger association.

• Small multiples in comparisons: statements of the form “the odds ratio appears to be aboutthe same” in two plots, or “one odds ratio is about x times the other odds ratio”, where x isa small integer multiple.

A decision about significance of an association on the other hand, can not be made based onthe display. Significance depends on the overall number of observations n. This number is notpreserved visually in a mosaic plot. Figure 4 shows three mosaic plots of a series of 2 × 2 tableswith increasingly strong associations from left to right. In the mosaic on the left hand side, thelog odds ratio is close to zero, indicating almost independence. The difference in heights of thepurple tiles approximately triples in size from one plot to the next. As for significances, the mosaicplots do not allow any conclusions. Even the strongest association on the right needs at least 94observations to be significant.

Odds Ratio 1.1

X

Y

x1 x2

y1y2

Odds Ratio 1.4

X

Y

x1 x2

y1y2

Odds Ratio 2.8

X

Y

x1 x2

y1y2

Figure 4: Series of three mosaic plots of 2× 2 tables with varying degrees of association. The oddsratios from left to right are 1.1., 1.4 and 2.8, respectively. The difference in heights of the purpletiles, which corresponds to the log odds ratio, approximately triples each time.

2.2 Application: Logistic Regression

The approximation for log odds ratios can also be applied in a logistic regression. Assume thatwe have a continuous variable X and a binary variable Y . Figure 5 shows different aspects ofthese data. The loess fit might hint towards a logistic regression. Complementary to these moretraditionally used displays, we suggest the use of mosaic plots for a discretized version of X, i.e. Xis cut into equally spaced intervals of size d. Figures 6 and 7 show a histogram and a mosaic plotof discretized versions of X. The coloring is given by the groups in Y . The differences in heights ofneighboring (purple) tiles in the mosaic are approximately constant. This difference gives a visualestimate of a in the logistic regression equation

logit P (Y = 1 | X = x) = logP (Y = 1 | X = x)P (Y = 0 | X = x)

= ax + b + ε,

with ε ∼ N(0, σ), i.i.d.The difference between highlighting heights of tiles i and i + 1 is approximately equal to −4 log θi,where θi is the odds ratio of the four tiles involved, i.e. assuming equal-distant breakpoints x0 <

4

Page 5: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

−5 0 5

Raw Data

X

Y

01

● ● ●●● ●● ●●●●● ●●● ●● ●●● ●

● ●● ●●● ● ●●●● ● ●●●● ●● ● ●●

FA

LSE

TR

UE

−5 0 5

Raw Data

X

Y

Figure 5: Scatterplot of Y vs X (left) overlaid by a loess fit. Parallel boxplots (right) of X by Y .The width of the boxes is adjusted to the number of values in each group.

Histogram

X

Fre

quen

cy

−5 0 5

040

080

012

00

Mosaicplot

X

Y

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10

2.1

2.2

Figure 6: Histogram (left) and mosaic plot of (right) of X with coloring given by Y . X is cut into10 intervals.

x1 < .... < xI for X and x ∈ (xi, xi+1) and d = xi+1 − xi:

log θi = logit P (Y = 1 | X = x)− logit P (Y = 1 | X = x + d) = ad (3)

Equation (3) is true for all x ∈ (xi, xi+1) and 0 ≤ i < I − 1, which leads immediately to thequantities shown in the mosaics:

log θi = logit P (Y = 1 | X ∈ (xi, xi+1))− logit P (Y = 1 | X ∈ (xi+1, xi+2) = ad

3 Proportional Odds Logistic Regression

3.1 Cumulative Odds Ratios

In the situation, where Y is not binary, but some ordinal categorical variable, the modelling ap-proach should take that into account. Agresti (2002) suggests the use of cumulative odds and oddsratios for ordinal data. The cumulative logit is defined as:

logit P (Y ≤ j | X = x) = logP (Y ≤ j | X = x)

1− P (Y ≤ j | X = x)= log

P (Y ≤ j | X = x)P (Y > j | X = x)

5

Page 6: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Histogram

X

Fre

quen

cy

−5 0 5

020

040

060

0

Mosaicplot

X

Y

1.11.21.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.181.191.20

2.1

2.2

Figure 7: Histogram (left) and mosaic plot of (right) of X with coloring given by Y . X is cut into20 intervals.

The difference between two logits then constitutes the cumulative odds ratio. We are going to usethe following definition (see figure 8):

θij = logit P (Y ≤ j | X = i + 1)− logit P (Y ≤ j | X = i) =

= logP (Y ≤ j | X = i + 1) P (Y > j | X = i)P (Y > j | X = i + 1) P (Y ≤ j | X = i)

The advantage of using cumulative odds ratios over the more commonly used local odds ratios is

Mosaicplot ofa 2 x 4 table

Σa

Σb

Σc

Σd

Σa

Σb

Σc

Σd

Σa

Σb

Σc

Σd

θ11 θ12 θ13

Figure 8: Mosaic plot of a 2 × 4 table, bins separated by a black line are added up to form acumulative odds ratio.

that we are able to apply the method of reading odds ratios directly from the previous section.

3.2 Application: Proportional Odds Logistic Regression

A proportional odds logistic regression in ordinal variables X and Y given as

logit P (Y ≤ j | X = x) = βx + αj + εjx

means constant cumulative odds ratios:

logit P (Y ≤ j | X = x1)− logit P (Y ≤ j | X = x2) = β(x1 − x2).

6

Page 7: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Raw Data

Socio−Economic Status

Men

tal H

ealth

A B C D E Fw

ell

mild

mod

erat

eim

paire

d

Uniform Odds Ratio

Socio−Economic Status

Men

tal H

ealth

A B C D E F

wel

lm

ildm

oder

ate

impa

ired

Column Effects Odds Ratios

Socio−Economic Status

Men

tal H

ealth

A B C D E F

wel

lm

ildm

oder

ate

impa

ired

Figure 9: Mosaic plots of the Mental Health Data. Mental Health is plotted versus SES of thestudent’s parents. Fitted values from a proportional odds logistic regression are shown in the middle,raw data are on the left. The plot on the right shows fitted values from model (4)

For neighboring columns we can assume x2 = x1 + 1, which means, that we are, again, able to geta rough visual estimate for β from the difference in corresponding tiles’ heights of a mosaic plot.The Mental Health data has been used by Goodman (1979) for an introduction of his simpleassociation models. Table 1 contains numbers on the relationship between mental health andsocioeconomic status of the parents.

Mental Health Parents’ Socio-Economic StatusStatus A B C D E F Totalimpaired 46 40 60 94 78 71 389moderate symptom 58 54 65 77 54 54 362mild symptom 94 94 105 141 97 71 602well 64 57 57 72 36 21 307

Total 262 245 287 384 265 217 1660

Table 1: Mental Health Data, recorded by Srole et al. (1962) .

Here, we use a proportional odds logistic regression to fit the relationship between mental health ofa student (classified on a scale of well, mildly impaired, moderately impaired, or impaired) and thesocio-economic status of his parents (SES, classified from A, highest, to F , lowest). Figure 9 showsthree mosaic plots of this data set, On the left, the raw data is shown, which exhibits a distinct, ifnot surprising pattern of deteriorating mental health with lower SES. The fitted values from a polrare shown in the mosaic in the middle of figure 9. Differences in heights of corresponding levels(same color or shading) in neighboring columns give an estimate of β. Since we are dealing witha constant cumulative odds ratio, this difference in heights is the same across all columns. Theparameters αj are given in the heights of the leftmost column, since αj = logit P (Y ≤ j | X = 0),see figure 10.The mosaic plot on the right of figure 9 does not assume the same cumulative odds ratio acrossall columns, but estimates different cumulative odds ratios between each two neighboring columns

7

Page 8: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

polr(formula = mh ~ ses, data = mh, weights = number)

Coefficients:

Value Std. Error t value

ses 0.1666683 0.02770198 6.016475 (beta)

Intercepts:

Value Std. Error t value

well|mild symptom formation -0.9234 0.1115 -8.2826 (alpha3)

mild symptom formation|moderate symptom formation 0.7741 0.1093 7.0824 (alpha2)

moderate symptom formation|impaired 1.7813 0.1162 15.3322 (alpha1)α1

α2

α3

+inf

0

-inf

1

-1

-0.5

0.5

log odds scale

β

β

β

β

Figure 10: R output of the fitted proportional odds logistic regression (left). On right, the corre-sponding mosaic plot of fitted values (same as mosaic in the middle of fig. 9). The correspondingparameters are sketched in at the appropriate places.

separately, the underlying model is

logit P (Y ≤ j | X = i) = βXi + αj + εji, (4)

i.e. X is assumed to be a nominal factor. In the graphic, we can see this, too: the difference inheights of corresponding levels is the same for all three levels of the same columns, but differentbetween different columns, since

logit P (Y ≤ j | X = i + 1)− logit P (Y ≤ j | X = i) = βXi+1 + αj −

(βX

i + αj

)= βX

i+1 − βXi ,

i.e. the differences between tiles’ heights of corresponding levels are independent of j.In the example, the first two columns are almost equal in level heights, indicating a cumulativelog odds ratio close to zero. All other log odds ratios are positive, indicating that with lowersocioeconomic status mental health deteriorates.

Raw Data

Socio−Economic Status

Men

tal H

ealth

AB CD E F

wel

lm

ildm

oder

ate

impa

ired

Uniform Odds Ratio

Socio−Economic Status

Men

tal H

ealth

AB CD E F

wel

lm

ildm

oder

ate

impa

ired

Column Effects Odds Ratios

Socio−Economic Status

Men

tal H

ealth

AB CD E F

wel

lm

ildm

oder

ate

impa

ired

Figure 11: Mosaic plots of the Mental Health data. SES categories A and B are collapsed, as are Cand D. From left to right the data displayed are the raw data, fitted values from a polr with uniformodds ratio, and fitted values from a polr as given in eq. (9).

Even though the second model is not significantly better than the first model, the results mightnevertheless suggest, that socioeconomic status is measured on too fine a scale. The association

8

Page 9: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

between mental health and socioeconomic status would be captured better, if socioeconomic levelsA & B and C & D were collapsed, forcing the corresponding log odds ratios βA and βC to bezero. Figure 11 shows mosaic plots for the same models as in figure 9 with the reduced data set.The mosaic plots are almost indistinguishable, visually indicating good fits. XXX goodness of fitstatistics?

3.3 Assessing higher dimensional interactions

The routines described in this section apply to all previously introduced models and mosaics. Wewill restrict the description to the example of proportional odds models only, though, as we feelthat the relationships between graphical displays and this type of models cover all aspects andallow the reader to readily apply the techniques to other types of models.Assessments of higher dimensional interaction terms can be made by comparing parameters of lowerdimensional mosaic plots corresponding to splits of the data set along one of the variables in theinteraction.An interaction effect between three variables is present by definition, if the association between twoof these variables is changed for different values or levels of the third variable.Assume, we are dealing three variables X, Y and Z, where Y is the ordinal response variable, X isordinal and Z is a nominal variable. A proportional odds logistic regression might look like this:

P (Y ≤ i | X = x,Z = k) = αi + βxx + βZk + βXZ

k x + εixk,

i.e. we are modeling cumulative odds of Y by main effects and an interaction between X andZ. The term βXZ

k will be significant, if there is a significant three-way interaction between thevariables X, Y, and Z. βXZ

k can be written as the difference of log odds:

log odds ixk = logit P (Y ≤ i | X = x + 1, Z = k)− logit P (Y ≤ i | X = x,Z = k) == αi + βx(x + 1) + βZ

k + βXZk (x + 1)−

(αi + βxx + βZ

k + βXZk x

)=

= βx + βXZk

βXZk = log odds ix(k+1) − log odds ixk

This implies, that we have a visual means to assess the three way interaction by comparing the two-way (log) odds ratios. Large differences indicate a strong three way interaction, small differences orequality translates to a weak interaction or no three-way interaction. Figure 12 shows an examplefor this.On the left of the figure, a mosaic plot of Y versus X is shown. On the right, there are two mosaics,one on top of the other. The upper mosaic contains variables Y versus X and W . All of the log oddsratios are very similar (maybe increase slightly in value from left to right), which indicates thatthe corresponding three-way interaction of X, Y , and W is weak at best. Adding the interactionterm βXW to the model decreases the model’s deviance by 0.12 (on 1 degree of freedom, as we’veassumed W to be ordinal), which is obviously not a significant improvement.The mosaic below shows variable Y versus X and Z. Only the first two log odds ratios show similarvalues, the third log odds ratio is quite a bit larger, and the fourth log odds ratio changes evenin the direction. These are major differences in log odds ratios and will contribute to a significantthree-way interaction in the model. The model with the additional interaction term correspondsto a decrease of the deviance by 8.53 (on 3 degrees of freedom), which corresponds to a p-value of0.036.

9

Page 10: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

βXW

X x1 x2 x1 x2 x1 x2

Ylow

middle

high

βXW

βXW

βXWβXW

βXW

Ylow

middle

high

X x1 x2 x1 x2 x1 x2 x1 x2

Z z1 z2 z3 z4

β1XZ

β1XZ

β3XZ

β3XZ

β4XZ

β4XZ

β2XZ

β2XZ

X x1 x2

Ylow

middle

high

W w1 w2 w3

a) non-significant three-way interaction between X, Y, and W

b) significant three-way interaction between X, Y, and Z

Figure 12: Two dimensional mosaic plot of the marginal association of X and Y on the left.The mosaics on the right each incorporate an additional third variable. In the upper mosaic theassociation between X and Y is only slightly different for the different levels of W , indicating anabsent or very weak three-way interaction between X, Y, and W . In contrast to that, the associationbetween X and Y is dramatically different for the different levels of Z in the lower mosaic plot.This is the indicator for a strong three-way interaction of X, Y, and Z.

Note again, that we can not draw any conclusions about significance from the graphical represen-tation, since significance depends on the overall number of observations, which is not shown. Wecan, however, safely conclude that the interaction shown above in figure 12 is a lot weaker than theone shown below, given that both representations are based on the same number of observations.Conceptually, higher interaction terms could be visually displayed and assessed in a similar fashion,i.e. a four-way interaction corresponds to the (log) odds ratio of two-dimensional log odds ratios (seee.g. Bhapkar and Koch (1968)), for which we could again use the approximation described in section2.1. A five-way interaction is then based on the difference in corresponding four-way interactions.In practice, however, this does not seem feasible. Fortunately, we are only rarely dealing withthese high-dimensional interaction terms, as we are prone to avoid them for the difficulties in theirinterpretation.

4 Example: The Housing Data

The housing data (see table 2) contain data from a housing conditions survey in Copenhagen asreported by ??. Four variables are recorded:

10

Page 11: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Satisfaction Y (ordinal variable): Low < Medium < HighContact C with other residents (ordered binary variable): Low < Highperceived Influence I on management decisions (ordinal): Low < Medium < HighType T of housing (nominal): Towers, Apartments, Terrace, Atriums

Type Tower Apartment Atrium TerraceInfluence L M H L M H L M H L M H

Contact Satisfaction TotalLow Low 21 34 10 61 43 26 13 8 6 18 15 7 262

Medium 21 22 11 23 35 18 9 8 7 6 13 5 178High 28 36 36 17 40 54 10 12 9 7 13 11 273

High Low 14 17 3 78 48 15 20 10 7 57 31 5 305Medium 19 23 5 46 45 25 23 22 10 23 21 6 268

High 37 40 23 43 86 62 20 24 21 13 13 13 395Total 140 172 88 268 297 200 95 84 60 124 106 47 1681

Table 2: Frequency Table from a Copenhagen Housing Conditions Survey (Madsen, 1976)

Venables and Ripley (2002, p. 204) suggest two different models for the housing data. We willfollow their approach with the corresponding graphical displays. Both models are in the frameworkof proportional odds logistic regression. In the first model the cumulative logit of the response ismodeled by the main effects of the explanatory variables, i.e.

Model 1logit P (Y ≤ i | T = j, I = k, C = `) = αi + βT

j + βIk + βC

` + εijk`

with εijk` ∼ N(0, σ2) i.i.d.,

i.e. all explanatory variables are treated as nominal factors. This yields a model with (assumingfirst effects are zero for identifiability) 2+3+2+1 = 8 parameters.Figure 13 shows a series of three mosaic plots of the housing data corresponding to the parametersof model 1. The top row shows raw data, the bottom data shows fitted data of the same variables.From left to right satisfaction is plotted versus Type, perceived Influence and Contact. The mosaicplot in the middle suggests to exploit the ordinal structure of Influence in the model. This leadsto a variation of model 1 where βI

k = βI · k, i.e influence is modelled as ordinal variable. This freesone parameter without a significant loss in fit.The largest deviation between raw and fitted values becomes apparent by comparing the twoleftmost mosaic plots - the model is not fully able to explain satisfaction of residents in atrium typehouses: more people than predicted by the model express medium satisfaction. This difference ismainly due to fewer people expressing low satisfaction.Model 2 has an additional two terms accommodating for the interactions between type of housingand perceived influence and type of housing and contact with other residents:

Model 2

logit P (Y ≤ i | T = j, I = k, C = `) = αi + βTj + βI

k + βC` + βTC

j` + βTIjk + εijk`.

with εijk` ∼ N(0, σ2) i.i.d.

11

Page 12: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Raw Data: Type versus Satisfaction

Type

Sat

Tower Apartment Atrium Terrace

Low

Med

ium

Hig

h

Raw Data: Influence versus Satisfaction

Infl

Sat

Low Medium High

Low

Med

ium

Hig

h

Raw Data: Contact versus Satisfaction

Cont

Sat

Low High

Low

Med

ium

Hig

h

Model 1: Type versus Satisfaction

Type

Sat

Tower Apartment Atrium Terrace

Low

Med

ium

Hig

h

Model 1: Influence versus Satisfaction

Infl

Sat

Low Medium High

Low

Med

ium

Hig

h

Model 1: Contact versus Satisfaction

Cont

Sat

Low High

Low

Med

ium

Hig

h

Figure 13: Series of three mosaic plots of the housing data. The top row shows raw data, the bottomdata shows fitted data from model 1 of the same variables.

Raw Data: Type and Influence versus Satisfaction

Type

Sat

isfa

ctio

n

Tower Apartment Atrium TerraceLow Medium High

Low

Med

ium

Hig

h

Low Medium High Low MediumHigh Low Medium High

Model 2: Type and Influence versus Satisfaction

Type

Sat

isfa

ctio

n

Tower Apartment Atrium TerraceLow Medium High

Low

Med

ium

Hig

h

Low Medium High Low MediumHigh Low Medium High

Figure 14: Mosaic plots of Type and Influence versus Satisfaction. Raw Data are shown on theleft, fitted values from model 2 are given on the right. The pattern of strictly increasing odds ratiosbetween Influence and Satisfaction only holds for apartments and terraces.

Figure 14 shows two mosaic plots: raw data are displayed on the left, fitted values from model 2are given on the right. The association between Satisfaction and perceived Influence varies for thefour different types of housing. Here, Influence is modeled as a nominal factor. From the mosaicplot it becomes obvious, that any other approach would be unsuccessful in a model, as the patternof strictly increasing odds ratios between Influence and Satisfaction does not hold for residents ofTowers. The largest differences between model and raw data are for residents of Atrium houses witha medium perceived influence: fewer people than expected by the model express high satisfaction,balanced by more medium satisfied people. The second higher-order term of the model involves thevariables Type, Contact and Satisfaction. Figure 15 shows mosaics of their relationship. Raw data

12

Page 13: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Raw Data: Type and Contact versus Satisfaction

Type

Sat

isfa

ctio

n

Tower Apartment Atrium TerraceLow High

Low

Med

ium

Hig

h

Low High Low High Low High

Model 2: Type and Contact versus Satisfaction

Type

Sat

isfa

ctio

n

Tower Apartment Atrium TerraceLow High

Low

Med

ium

Hig

h

Low High Low High Low High

Figure 15: Mosaic plots of Type and Contact versus Satisfaction. Raw data are on the left, fittedvalues from model 2 are shown on the right. Different odds ratios between Satisfaction and Contactare modelled for different types of housing.

Raw DataRaw Data

Type

Con

t

Tower Apartment Atrium Terrace

Low

Hig

h

Low Medium High

Low

Med

ium

Hig

hLo

wM

ediu

mH

igh

Low Medium High Low Medium High Low Medium High

Model 1Fit of Model 1

Type

Con

t

Tower Apartment Atrium Terrace

Low

Hig

h

Low Medium High

Low

Med

ium

Hig

hLo

wM

ediu

mH

igh

Low Medium High Low Medium High Low Medium High

Model 2Fit of Model 3

Type

Con

t

Tower Apartment Atrium Terrace

Low

Hig

h

Low Medium High

Low

Med

ium

Hig

hLo

wM

ediu

mH

igh

Low Medium High Low Medium High Low Medium High

Figure 16: Mosaic plots of the full housing data. All four variables are displayed. The plot on theleft shows the raw data. On the left fitted values are displayed, with fits from model 1 on top, andfits from model 2 at the bottom.

are on the left, fitted values are on the right. Satisfaction of residents seems to generally increasewith increased contact. The only exception are residents of terrace houses, who express highersatisfaction with less neighborly contact. Overall, these residents express the least satisfaction withtheir housing situation.

13

Page 14: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

A comprehensive view of the housing data is given in figure 16. All three mosaic plots show thesame four variables. The raw data are displayed on the left, fitted values from model 1 are in theupper right plot, fitted values from model 2 are in the lower right plot. The deficits of model 1become obvious right away: since only main effects are fitted, none of the higher dimensionalassociations are considered. If you think of the plot consisting of eight two-dimensional panels

arranged as(

1 2 3 45 6 7 8

)only the essence of panels no. 2, 4, and 6 is caught by model 1. Panel 7

will always be a problem in proportional odds models, as it breaks the property of cumulativeodds ratios underlying proportional odds logistic regressions. Visually, the fit of model 2 is a lotbetter than the fit of model 1. The mosaic plot corresponding to model 2 also gives the impressionthat the model fits well. In numbers, model 2 is significantly better than model 1 with a decreasein deviance of 0.21 on 2 degrees of freedom. Model 2 has compared to the full model a residualdeviance of 2.24 on 8 degrees of freedom.

5 Conclusions

We have given the theoretical basis together with the graphical facilities for assessing linear trendsin log-linear models visually. This in particular enables the visual evaluation of effects in modelswith ordinal variables. The approach involves graphics still further in the modelling process. Itsadvantage lies in the graphical display, which allows us to inspect the data directly throughout ananalysis and enables us to “see” why a specific model works, what its assumptions are and whereit could be improved.Cumulative odds ratios have been used as the necessary theoretical construct to transfer the ap-proach of visualising interaction effects from binary variables to ordinal variables. It is ensuredthat visual conclusions based on the vertical differences of bins still provide valid results.These considerations show that mosaic plots form powerful graphical tools for the visual modellingof categorical data. They unite both mathematical correctness and great flexibility with respect tothe underlying models.

References

Agresti, A. (1984), Analysis of Ordinal Data., New York: John Wiley and Sons.

— (2002), Categorical Data Analysis., New York: John Wiley and Sons, 2nd ed.

Bhapkar, V. and Koch, G. (1968), “Hypotheses of ‘No interaction’ In Multidimensional ContingencyTables,” Technometrics, 10, 107–123.

Buja, A. and Cook, D. (1999), “Inference for Data Visualization.” in Proceedings of the StatisticalGraphics Section, ASA.

Friendly, M. (1995), “Conceptual and visual models for categorical data.” Amer. Statistician, 49,153–160.

— (1999), “Extending Mosaic Displays: Marginal, Conditional, and Partial Views of CategoricalData.” Journal of Computational and Graphical Statistics.

Goodman, L. A. (1979), “Simple models for the Analysis of Association in Cross-ClassificationsHaving Ordered Categories.” Journal of the American Statistical Association, 74, 537–552.

14

Page 15: Visualizing Simple Logistic Regression Models using Mosaic ... · Visualizing Simple Logistic Regression Models using Mosaic Plots Heike Hofmann August 30, 2006 Abstract For categorical

Hofmann, H. (2001), “Generalized Odds Ratios for Visual Modelling,” Journal of Computationaland Graphical Statistics, 10, 1–13.

Madsen, M. (1976), “Statistical analysis of multiple contingency tables. Two examples.” Scandina-vian Journal of Statistics, 3, 97–106.

Srole, L., Langner, T., Michael, S., Opler, M., and Rennie, T. (1962), Mental Health in the Metropo-lis: The Midtown Manhattan Study., New York: McGraw-Hill.

Theus, M. and Lauer, S. (1999), “Visualizing Loglinear Models.” Journal of Computational andGraphical Statistics, 3, 396 – 412.

Venables, W. N. and Ripley, B. D. (2002), Modern Applied Statistics with S, Springer, 4th ed.

15