[wiley series in probability and statistics] correspondence analysis || external stability and...

35

Click here to load reader

Upload: rosaria

Post on 27-Jan-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

8

External stability and confidenceregions

8.1 Introduction

Correspondence analysis and its variants provide the analyst with statistical tools to graphi-cally depict the symmetric or asymmetric association between two or more nominal or ordinalcategorical variables. One issue that needs consideration is whether these graphical displaysare reliable in representing the association. By reliable we mean if we draw samples of size𝑛 where the same two categorical variables are cross-classified to form a contingency table,will the association between these variable categories from a sample be similar or wildlydifferent to the association from another sample?

When discussing the theory and application underlying the correspondence analysis ofa two-way contingency table, our attention has so far been on quantifying this associationand visually depicting it; there has been no inferential aspect introduced to our discussion. Inthis chapter we shall be introducing an inferential paradigm into correspondence analysis byconsidering the sampling variation of the configuration of points. This shall be achieved byusing parametric and non-parametric approaches. In much of the literature on correspondenceanalysis, the focus has been on the reliability of a point’s proximity from another point andfrom the origin. Here, we shall only consider the proximity of a point from the origin bydescribing confidence regions for each point in a low-dimensional correspondence plot orbiplot.

When considering the issue of stability in correspondence analysis, various definitionscan be found in the literature. One of the most general definitions is that of Gifi (1990, p. 36)who says a stable solution to the correspondence analysis solution (via the triplet describedin Chapter 3) is one where

a small and unimportant change in data, model, or technique leads to a smalland unimportant change in the results.

Correspondence Analysis: Theory, Practice and New Strategies, First Edition. Eric J. Beh and Rosaria Lombardo.© 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.Companion website: http://www.wiley.com/go/correspondence_analysis

Page 2: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 303

Other authors described two types of stability -- external stability and internal stability;see, for example, Greenacre (1984, Section 8.1), Lebart et al. (1984) and Markus (1994).External stability is concerned with whether the conclusions from a given sample can begeneralised to its population. As a result, small samples can represent a possible source forexternal instability. This is because they may not characterise the population structure ascomprehensively as a large sample. Internal stability refers to understanding the structure ofthe sample, and whether the presence of outliers is of concern.

Our main aim is the derivation of confidence regions for each of the points in a low-dimensional plot. By using these regions, one can assess the legitimacy of generalising thesample results to the population level. The methods for constructing these confidence regionsare numerous and varying and we shall only consider some of them.

In this chapter, we consider only external stability. In particular, we shall confine ourattention to Gifi’s change in data issue, where the focus is on investigating the degree ofsensitivity of the various classical and ordered simple correspondence analysis techniques.

For simple correspondence analysis, Lebart et al. (1984, pp. 182--186) constructed100(1 − 𝛼)% confidence circles for each variable category. Such an approach has since beenconsidered in a number of contributions to correspondence analysis including the orderedsimple correspondence analysis of Beh (1997); see also (2001). These regions have also beendeveloped for nominal/ordinal non-symmetrical correspondence analysis (Lombardo et al.2007; Beh 2010; Beh and D’Ambra 2009, 2010).

In the past, confidence circles, or circular regions, have been used as a means ofidentifying whether a category is statistically consistent with what is expected under thehypothesis of independence. Such regions are derived algebraically, since they are basedon the singular values and singular vectors of a transformed contingency table. Hence, theymay be regarded as parametric since they are also based on the assumptions that underliePearson’s chi-squared statistic and the Goodman--Kruskal tau index. In particular, theconfidence circles of Lebart et al. (1984) and the confidence ellipses of Beh (2010) may beconsidered examples of this approach.

Notwithstanding, we should make distinction among different non-circular regions whichhave been a topic of discussion in the past 20 years by Ringrose (1992, 1996), Gifi (1990),Markus (1994), Greenacre (2007), and Linting et al. (2007). We may distinguish betweenelliptical regions computed algebraically (Beh, 2010) from elliptical regions computed usingnon-parametric or semi-parametric bootstrapping procedures of Markus (1994), Greenacre(2007) and Linting et al. (2007).

Non-parametric confidence regions are based on asymptotic statistics such as confidenceellipses (Gifi 1990, pp. 408--415), or on bootstrap resampling techniques; see, for example,Efron (1979), Greenacre (2007), Markus (1994) and Ringrose (1996, 2012).

When elliptical regions are algebraically derived (Beh, 2010), we can also calculateapproximations of 𝑝-values designed to reflect the statistical significance of the distance of apoint from the origin. Such 𝑝-values allow one to determine the statistical significance of acategory to the association structure between the two categorical variables.

This chapter will review some of the simple procedures for constructing confidenceregions for a two-dimensional correspondence plot.

8.2 On the statistical significance of a point

The most appealing feature of the output obtained when performing a classical/symmetrical,or non-symmetrical, correspondence analysis on a two-way contingency table is the

Page 3: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

304 CORRESPONDENCE ANALYSIS

correspondence plot. As we described in Chapters 4 and 5, depending on the scaling ofdiagonal matrix of singular values, 𝚲𝜆, one can also obtain a biplot. Such plots allow theanalyst to visualise the association structure between the categorical variables and often con-structed using two or (at most) three dimensions. An important issue when visualising thisassociation is to identify the statistical significance of the proximity of a point from the originof the display; recall that in Chapters 4 and 5 we demonstrated that the origin coincides withthe position of all of the points in the configuration where the difference between the observedcell frequency and its expected cell frequency (under independence) is zero. It is this issuethat we shall direct our attention when constructing algebraic regions. Thus, the interpretationof the origin is of fundamental importance. Points that are located close to the origin indicatethat those categories do not play a major role in describing the association structure of thevariables. On the other hand, the further a point lies from the origin, the more important thiscategory is for describing the association structure between the variables. The next issue thatneeds addressing is to investigate how close (or far) from the origin a point needs to be, withany amount of confidence, before the category becomes a statistically significant contributorto the association structure?

To resolve this issue from a classical correspondence analysis perspective, Lebart etal. (1984, p. 182--186) proposed the construction of a 100(1 − 𝛼)% confidence circle foreach category in the configuration. If the origin falls within such a circle, then the categoryis deemed to not to play a statistically significant role in the association structure of thevariables. However, if the origin falls outside of the circle, then the category is consideredstatistically important in defining the association structure between the categorical variables.

The following section describes the construction of confidence regions when performingsimple correspondence analysis. We shall then adapt this approach to obtain approximate𝑝-values which reflect the statistical significance of the distance of a point from the origin.

8.3 Circular confidence regions for classicalcorrespondence analysis

Consider a two-way contingency table N of dimension 𝐼 × 𝐽 that cross-classifies 𝑛 indi-viduals/items according to 𝐼 row categories and 𝐽 column categories. Denote the matrix ofthe joint relative frequencies by 𝐏 =

(𝑝𝑖𝑗

)so that

∑𝐼

𝑖=1∑𝐽

𝑗=1 𝑝𝑖𝑗 = 1. Let 𝑝𝑖⋅ =∑𝐽

𝑗=1 𝑝𝑖𝑗 and

𝑝⋅𝑗 =∑𝐼

𝑖=1 𝑝𝑖𝑗 be the 𝑖th marginal row proportion and the 𝑗th marginal column proportion,respectively.

To graphically summarise the nature of the association between the categorical variablesof N, simple correspondence analysis may be considered. As we have already described inthe preceding chapters, there are many ways in which simple correspondence analysis can beperformed. Nishisato (2007, Chapter 2) provides a very good account of the various ways thatit can be done. By considering the singular value decomposition of the matrix of standardisedresiduals, Chapter 4 showed that simple correspondence analysis may be performed suchthat

𝐒 = 𝐃−1∕2𝐼

(𝐏 − 𝐫 𝐜𝑇

)𝐃−1∕2𝐽

= ��𝚲𝜆��𝑇 ,

where 𝐫 and 𝐜 are the vectors of row and column marginal proportions, respectively. Thesingular vectors respect the orthonormality constraints ��𝑇 �� = 𝐈𝑀∗ and ��𝑇 �� = 𝐈𝑀∗ . To

Page 4: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 305

graphically depict the association between the row and column principal coordinates in a low𝑀 < 𝑀∗ dimensional space, the 𝑖th rowprofile and 𝑗th columnprofilemay be simultaneouslyrepresented by

𝐅 = 𝐃−1∕2𝐼

��𝚲𝜆, 𝐆 = 𝐃−1∕2𝐽

��𝚲𝜆, (8.1)

respectively. Thus, Pearson’s chi-squared statistic of N can be expressed in terms of coordi-nates by

𝑋2 = 𝑛 × trace(𝚲2𝜆

)

= 𝑛 × trace(𝐅𝑇𝐃𝐼𝐅

)

= 𝑛 × trace(𝐆𝑇𝐃𝐽𝐆

).

From this result, Lebart et al. (1984) showed that, for the two-way contingency table, N,the radii length of the 100 (1 − 𝛼) % confidence circle for the 𝑖th row category in a two-dimensional correspondence plot is

𝑟𝐼𝑖(𝛼) =

√𝜒2𝛼

𝑝𝑖⋅. (8.2)

Here, 𝜒2𝛼is the 1 − 𝛼 percentile of a chi-squared distribution with two degrees of freedom; the

degrees of freedom reflect the two dimensions of the correspondence plot. Confidence circlesconstructed in thismanner are similar to those considered byMardia et al. (1982, pp. 345--348)derived for canonical analysis. For example, the radii length of the 95% confidence circle forthe 𝑖th row and the 𝑗th column category, respectively, in a two-dimensional correspondenceplot is

𝑟𝐼𝑖(0.05) =

√5.99𝑝𝑖⋅, 𝑟𝐽

𝑗(0.05) =√

5.99𝑝⋅𝑗,

where 5.99 represents that 95th percentile of the chi-squared distribution with two degrees offreedom.

These confidence circles allow the analyst to identify the statistical significance of thosepoints in a two-dimensional correspondence plot that contribute to the association structureof the categorical variables being considered. In practice, if the origin is included within thecircle, then that particular category does not contribute to the association structure betweenthe variables. Similarly, a confidence circle that does not include the origin means that, at thelevel of significance that is specified, the category to which it is related makes a statisticallysignificant contribution to the association structure.

These circular regions were also used for the early demonstration of correspondenceanalysis to analyse data obtained from a large survey (Lebart 1985). See also Yanao et al.(2006), Griffiths (1997) and Nicoli et al. (2010) for examples of the application of theseregions.

There are two major problems with these circular confidence regions. Firstly, they ignoreany information contained in the higher dimensions. Thus, despite the higher dimensionscontributing relatively little to the association when compared with the first two dimensions,it is still possible that important information concerning the structure of the association that

Page 5: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

306 CORRESPONDENCE ANALYSIS

exists in these higher dimensions is overlooked. The second problem is that they do not takeinto consideration that the axes of the correspondence plot are weighted differently; recallthat the principal inertia of the first axis is never less than the principal inertia of the secondaxis.

To overcome these two issues, Beh (2010) proposed special elliptical confidence regions.Such regions are interpreted in a manner that is analogous to the circular confidence regionsof Lebart et al. (1984) and still enable the analyst to identify the extent to which a row(or column) category contributes to the association. From such a framework, one may alsodetermine the 𝑝-value of each category. The advantage of doing so is that the researcher caninvestigate the statistical significance of a categories contribution to the overall associationbetween the categories.

8.4 Elliptical confidence regions for classicalcorrespondence analysis

Let us consider the second issue concerning the circular confidence regions described inSection 8.3 -- that the axes are assumed to be equally weighted. Since the principal inertia ofthe axes from a two-dimensional correspondence plot are such that the first inertia is neverless than the second inertia, it seems more appropriate to construct confidence regions thatreflect these inertia. This can be achieved by considering an elliptical region rather than theone that is circular. The following discussion is based in part on the recent work of Beh andLombardo (2014).

8.4.1 The information in the optimal correspondence plot

Suppose we consider the following property of the elements of the 𝑖th row singular vector,for𝑀∗ = min (𝐼, 𝐽 ) − 1:

𝑀∗∑

𝑚=1𝑎2𝑖𝑚

= 1𝑝𝑖⋅

so that

𝑎2𝑖1 + 𝑎

2𝑖2 =

1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚. (8.3)

Therefore, by considering the definition of the principal coordinate for the 𝑖th row category,Equations 8.1--8.3 can be alternatively expressed as

𝑓 2𝑖1

𝜆21

+𝑓 2𝑖2

𝜆22

= 1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚. (8.4)

Dividing both sides of Equation 8.4 by its right-hand side leads to the result

𝑓 2𝑖1

𝜆21(1∕𝑝𝑖⋅ −∑𝑀∗

𝑚=3 𝑎2𝑖𝑚)+

𝑓 2𝑖2

𝜆22(1∕𝑝𝑖⋅ −∑𝑀∗

𝑚=3 𝑎2𝑖𝑚)= 1.

Page 6: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 307

One may note that this is just the equation of an ellipse centred at(𝑓𝑖1, 𝑓𝑖2

)in a two-

dimensional correspondence plot with a semi-major length of

𝑥𝑖 = 𝜆1

√√√√(

1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

(8.5)

and semi-minor axis length of

𝑦𝑖 = 𝜆2

√√√√(

1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

. (8.6)

Deriving the semi-major axis length and the semi-minor axis length for the 𝑗th columncoordinate in a two-dimensional space can be obtained in a similar manner.

Beh (2010) argued that since𝜆2𝑚accounts for a certain proportion of the total inertia,𝑋2∕𝑛,

we can determine a value ��𝑚 (𝛼), which contributes the same proportion of the inertia derivedfrom the 1 − 𝛼 percentile of Pearson’s chi-squared statistic with (𝐼 − 1) (𝐽 − 1) degrees offreedom, 𝜒2

𝛼∕𝑛. That is

𝜆2𝑚

𝑋2∕𝑛=��2𝑚(𝛼)

𝜒2𝛼∕𝑛

(8.7)

so that we replace 𝜆𝑚 in 𝑥𝑖 and 𝑦𝑗 with

��𝑚(𝛼) = 𝜆𝑚

√𝜒2𝛼

𝑋2 . (8.8)

When the chi-squared statistic of 𝐍 is equal to 𝜒2𝛼, for a given value of 𝛼, then ��𝑚(𝛼) = 𝜆𝑚

and the semi-major and semi-minor axis lengths are just 𝑥𝑖 and 𝑦𝑗 , respectively. Otherwise,the 100 (1 − 𝛼) % confidence ellipse for the 𝑖th row category has a semi-major length of

𝑥𝑖(𝛼) = 𝜆1

√√√√ 𝜒2𝛼

𝑋2

(1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

(8.9)

and a semi-minor axis length of

𝑦𝑗(𝛼) = 𝜆2

√√√√ 𝜒2𝛼

𝑋2

(1𝑝𝑖⋅

−𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

, (8.10)

Thus, circular regions will arise if, and only if, 𝜆1 = 𝜆2; that is when the principal inertiaof the first axis is equal to the principal inertia of the second axis (a rare occurrence whenanalysing a two-way contingency table). Thus, these confidence ellipses overcome the firstof the issues concerning the Lebart et al.’s (1984) confidence circles.

It is useful to make use of the general parametric form of a 100 (1 − 𝛼)% confidenceellipse when coding the ellipses in R. The form of the ellipse for the 𝑖th row profile (say),

Page 7: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

308 CORRESPONDENCE ANALYSIS

whose path follows the points(𝑝𝑖1(𝛼), 𝑝𝑖2(𝛼)

)and is centred at

(𝑓𝑖1, 𝑓𝑖2

), is

𝑝𝑖1(𝛼) = 𝑓𝑖1 + 𝑥𝑖(𝛼) cos (𝑡) ,𝑝𝑖2(𝛼) = 𝑓𝑖2 + 𝑦𝑖(𝛼) sin (𝑡) ,

where 𝑥𝑖(𝛼) and 𝑦𝑖(𝛼) are the semi-axis length defined above and 𝑡 𝜖 [0, 2𝜋]. These ellipsesare positioned such that the semi-major axis is parallel to the first principal axis. Note thatwhen 100 (1 − 𝛼)% confidence circles are to be constructed so that 𝑥𝑖(𝛼) = 𝑦𝑖(𝛼) = 𝑟𝐼𝑖(𝛼), thegeneral parametric forms of the circles are

𝑝𝑖1(𝛼) = 𝑓𝑖1 + 𝑟𝐼𝑖(𝛼) cos (𝑡) ,

𝑝𝑖2(𝛼) = 𝑓𝑖2 + 𝑟𝐼𝑖(𝛼) sin (𝑡) .

Ellipsoids can be constructed for three-dimensional or higher dimensional correspondenceplots by considering 𝑚 > 2. One may note that Equations 8.9 and 8.10 incorporate the(unweighted) sum of squares of the left singular vectors of 𝐒. Therefore, unlike the confidencecircles of Lebart et al. (1984), constructing confidence ellipses in this manner takes intoconsideration the contribution that the 𝑖th row principal coordinate makes in dimensionshigher than the second. In fact, since all 𝑀∗ dimensions are reflected in the semi-majorand semi-minor axis lengths, all of the contribution that a point make to the association(as reflected in the optimal correspondence plot) can be accounted for using Equations 8.9and 8.10. Therefore, Beh’s (2010) confidence ellipses overcome the second of the issuesconcerning the Lebart et al.’s (1984) confidence circles.

8.4.2 The information in the first two dimensions

If the information contained in the third and higher dimensions is minimal or (for somereason) is ignored, then the semi-major and semi-minor axis lengths along the 𝑚th dimensionare

𝑥𝑖(𝛼) = 𝜆1

√𝜒2𝛼

𝑋2𝑝𝑖⋅, 𝑦𝑗(𝛼) = 𝜆2

√𝜒2𝛼

𝑋2𝑝⋅𝑗. (8.11)

One may note that, under such a framework, the link between 𝑥𝑖(𝛼) and 𝑟𝑖(𝛼) is simply

𝑥𝑖(𝛼) = 𝑟𝐼𝑖(𝛼)

√𝜆21𝑋2∕𝑛

.

Therefore, if we were to ignore the contribution the first axis makes to the quality of theoptimal correspondence plot, then𝑥𝑖(𝛼)would be just the radii length, 𝑟

𝐼𝑖(𝛼), of the 100 (1 − 𝛼) %

confidence circle for the 𝑖th row category.We may also note that the sum of squares of these semi-axis lengths is

𝑥2𝑖(𝛼) + 𝑦

2𝑖(𝛼) = 𝜙

2𝑋2

𝑛𝑖⋅(8.12)

Page 8: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 309

where 𝜙2 =(𝜆21 + 𝜆

22)∕(𝑋2∕𝑛

)is the proportion of the total inertia accounted for by a two

dimensional correspondence plot constructed using the first two principal axes. Therefore,when the optimal correspondence plot consists of only two dimensions (𝑀∗ = 2), the sum ofsquares of these half-axis lengths is such that 𝑥2

𝑖(𝛼) + 𝑦2𝑖(𝛼) = 𝑟𝑖(𝛼). By considering the semi-

axis lengths described by Equations 8.9 and 8.10, as the level of confidence increases (so that𝛼 decreases), the elliptical region for each of the 𝐼 rows and 𝐽 columns expands. More willbe now described about the features of Beh’s (2010) confidence ellipses.

8.4.3 Eccentricity of elliptical regions

Another important characteristic of any ellipse is to measure its eccentricity which is boundedby the interval [0, 1]: note that a circle has zero eccentricity. Such a measure can provebeneficial when performing correspondence analysis since the eccentricity of an ellipsereflects the relative magnitude of the first principal inertia compared with the second principalinertia. By considering the definition of eccentricity ofWeir et al. (2005, p. 698), for example,the eccentricity of the elliptical region for the 𝑖th row category can be shown to be equivalentto

𝐸𝑖(𝛼) =

√√√√√1 −𝑦2𝑖(𝛼)

𝑥2𝑖(𝛼)

. (8.13)

Therefore, substituting Equations 8.9 and 8.10 into Equation 8.13 and simplifying yields aneccentricity for a 100(1−𝛼)% confidence ellipse in a two-dimensional correspondence plotof

𝐸𝑖(𝛼) =

√√√√1 −𝜆22

𝜆21

. (8.14)

We may have also substituted 𝑥𝑖(𝛼) and 𝑦𝑗(𝛼) from Equation 8.11 into 8.13 and obtained thesame results.

Equation 8.14 shows that the eccentricity of an elliptical confidence region remains fixedfor all row categories; only the principal inertia associated with the first and second axesdetermines its magnitude. We can also see that, if the principal inertia along these axes areequal (so that 𝜆21 = 𝜆

22), then the eccentricity of the confidence ellipses is zero, resulting in

circular confidence regions that are akin to those of Lebart et al. (1984). We can also derivethe measure of eccentricity, 𝐸𝑗(𝛼), of the 100 (1 − 𝛼) % confidence ellipse for the 𝑗th columnprincipal coordinate in a correspondence plot. It is easy to verify that its value is just theright-hand side of Equation 8.14, which means that the eccentricity of the ellipses for the rowand column points in a correspondence plot are the same.

8.4.4 Comparison of confidence regions

By considering the expressions for 𝑟𝐼𝑖(𝛼), 𝑥𝑖(𝛼) and 𝑦𝑖(𝛼), changing the level of significance,

𝛼, will impact their magnitude. To show this, we consider, for example, a comparisonof the semi-major axis length, 𝑥𝑖(𝛼), of a 100 (1 − 𝛼)% and 100

(1 − 𝛼′

)% confidence el-

lipse. The ratio between these two quantities of the correspondence plot (irrespective of its

Page 9: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

310 CORRESPONDENCE ANALYSIS

dimensionality) is

𝑟(𝛼, 𝛼′

)=𝑥𝑖(𝛼)

𝑥𝑖(𝛼′)=𝑦𝑗(𝛼)

𝑦𝑗(𝛼′)=

√√√√ 𝜒2𝛼

𝜒2𝛼′

. (8.15)

Therefore, when 𝛼 < 𝛼′, then 𝑟(𝛼, 𝛼′

)> 1, since 𝜒2

𝛼> 𝜒2

𝛼′. For example, consider the two-

way asbestos data of Table 1.2 that consists of four row categories and five column categories,so that the chi-squared statistic considered has 12 degrees of freedom. The semi-major, andsemi-minor, axis lengths for a 0.01 level of significance, compared with the semi-axis lengthwith 0.05 level of significance, changes by a factor of

𝑟 (0.01, 0.05) =

√√√√ 𝜒2𝛼

𝜒2𝛼′

=√

26.21721.026

= 1.247.

That is, the semi-axis length for a 99% confidence ellipse along all dimensions will be1.247 times longer than its corresponding semi-axis length for a 95% confidence ellipse.Similarly, for such a contingency table, the semi-axis length of a 99% confidence ellipse willbe 1.413 times longer than the semi-axis length of a 90% confidence ellipse. Depending onthe magnitude of the semi-axis length, such differences may appear minimal or very large.In particular, when a category has a dominant role in the structure of the association (so thatthe area of the confidence ellipse is small), the impact of changing the level of significancemay be quite small. However, if a category is not a statistically significant contributor to theassociation structure (so that the area of the confidence ellipse is quite large), the impact ofchanging the level of significance may be quite large. Such a ratio 𝑟

(𝛼, 𝛼′

)can also be shown

to be applicable to the confidence circles of Lebart et al. (1984).One may also compare the semi-major axis length when considering a 100 (1 − 𝛼)%

confidence ellipse using only the first two dimensions with the semi-major axis length of theellipse that reflects the association in the optimal correspondence plot. Suppose we define𝑥𝑖(𝛼) to be the semi-major axis length of the confidence ellipse of the 𝑖th row category inthe optimal (𝑀∗-dimensional) correspondence plot. Similarly, let ��𝑖(𝛼) be this length for thepoint in a two-dimensional plot. Then the ratio between these semi-major lengths is

𝑞 (2, 𝑀) =𝑥𝑖(𝛼)

��𝑖𝑚(𝛼)

=

√√√√1 − 𝑝𝑖⋅𝑀∗∑

𝑚=3𝑎2𝑖𝑚

=√𝑎2𝑖1 + 𝑎

2𝑖2.

Thus, if the optimal correspondence plot consists of 𝑀∗ = 2 dimensions, then such a ratiobecomes 𝑞 (2, 2) = 1 and there is no change in the size of the confidence ellipse. However,

Page 10: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 311

in the more general case where 𝑀∗ > 2, 𝑞 (2, 𝑀) < 1. Therefore, when considering theassociation structure reflected in the optimal correspondence plot, the semi-major (or semi-minor) axis length is

√√√√1 − 𝑝𝑖⋅𝑀∑

𝑚=3𝑓 2𝑖𝑚∕𝜆2𝑚=

√(𝑓𝑖1𝜆1

)2+(𝑓𝑖2𝜆2

)2

times shorter than the semi-axis length when only considering the information in the first twodimensions.

8.5 Confidence regions for non-symmetricalcorrespondence analysis

In Section 8.4, we described the construction of circular and elliptical confidence regionsfor simple correspondence analysis. When there exists an asymmetric association structurebetween the categorical variables, onemay consider instead these regions for non-symmetricalcorrespondence analysis. Therefore, suppose we consider the column variable as a predictorvariable and the row variable as its response variable. For such a variable structure, Chapter 5showed that the analysis can be performed by first considering the generalised singular valuedecomposition of

𝑝𝑖𝑗

𝑝⋅𝑗− 𝑝𝑖⋅ =

𝑀∑

𝑚=1𝑎𝑖𝑚𝜆𝑚𝑏𝑗𝑚.

Here, 𝑎𝑖𝑚 and 𝑏𝑗𝑚 are akin to the right and left singular vectors of the correspondence analysisof the standardised residuals of the cells and are constrained such that

𝐼∑

𝑖=1𝑎𝑖𝑚𝑎𝑖𝑚′ =

{1, 𝑚 = 𝑚′,

0, 𝑚 ≠ 𝑚′,

𝐽∑

𝑗=1𝑝⋅𝑗𝑏𝑗𝑚𝑏𝑗𝑚′ =

{1, 𝑚 = 𝑚′,

0, 𝑚 ≠ 𝑚′,

As we previously discussed, the singular values, 𝜆𝑚 for𝑚 = 1, 2, … , 𝑀∗ = min (𝐼, 𝐽 ) − 1,are arranged in descending order.

For our asymmetric variable structure, the aim is to depict the prediction of the rows giventhe columns in a low-dimensional space, where the Goodman--Kruskal tau index (Goodmanand Kruskal, 1954) is

𝜏 =

∑𝐼

𝑖=1∑𝐽

𝑗=1 𝑝⋅𝑗

(𝑝𝑖𝑗

𝑝⋅𝑗− 𝑝𝑖⋅

)2

1 −∑𝐼

𝑖=1 𝑝2𝑖⋅

=𝜏num

1 −∑𝐼

𝑖=1 𝑝2𝑖⋅

.

Page 11: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

312 CORRESPONDENCE ANALYSIS

To test the significance of the asymmetric association structure, a formal statistical test canbe made by considering the C-statistic of Light and Margolin (1971).

𝐶 = (𝑛 − 1) (𝐼 − 1) 𝜏 ∼ 𝜒2𝛼, (𝐼−1)(𝐽−1),

where 𝜒2𝛼, (𝐼−1)(𝐽−1) is the 1 − 𝛼 percentile of a chi-squared distribution with (𝐼 − 1) (𝐽 − 1)

degree of freedom.

8.5.1 Circular regions in non-symmetrical correspondence analysis

Circular and elliptical confidence regions for non-symmetrical correspondence analysis haverecently been discussed in the literature, some for nominal variables and others for ordi-nal variables. See, for example, Lombardo et al. (2007), Corbellini et al. (2008), Beh andD’Ambra (2009), Beh (2010), Crisci and D’Ambra (2011), Simonetti et al. (2011), D’Ambraet al. (2012), D’Ambra and Crisci (2014) and Beh and Lombardo (2013). Before describ-ing the construction of elliptical confidence regions for non-symmetrical correspondenceanalysis, we shall first consider the construction of confidence circles for the analysis.

Confidence circles for the non-symmetrical correspondence analysis of a two-way con-tingency table consisting of nominal or ordered categories (Lombardo et al., 2007; Beh andD’Ambra, 2009) have been derived in a manner similar to that described for simple corre-spondence analysis. As a result, the confidence circle for the 𝑗th column (predictor) categoryin the two-dimensional plot is

𝑟𝐽𝑗(𝛼) =

√√√√√𝜒2𝛼

(1 −

∑𝐼

𝑖=1 𝑝2𝑖⋅

)

𝑝⋅𝑗(𝑛 − 1)(𝐼 − 1). (8.16)

Similarly, the confidence circle for the 𝑖th row (response) category in the two-dimensionalplot is

𝑟𝐼𝑖(𝛼) =

√√√√𝜒2𝛼

(1 −

∑𝐼

𝑖=1 𝑝2𝑖⋅

)

(𝑛 − 1)(𝐼 − 1).

8.5.2 Elliptical regions in non-symmetrical correspondence analysis

To reflect the association structure in dimensions higher than the second, and to take into con-sideration the different weighting of each of the principal axes, the 100 (1 − 𝛼)% confidenceellipse for the 𝑖th row (response) category is constructed using a semi-axis length along the𝑚th principal axis of

𝑥𝑖(𝛼) = 𝜆1

√√√√√𝜒2𝛼

(1 −

∑𝐼

𝑖=1 𝑝2𝑖⋅

)

𝜏num (𝑛 − 1) (𝐼 − 1)

(

1 −𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

,

𝑦𝑖(𝛼) = 𝜆2

√√√√√𝜒2𝛼

(1 −

∑𝐼

𝑖=1 𝑝2𝑖⋅

)

𝜏num(𝑛 − 1)(𝐼 − 1)

(

1 −𝑀∗∑

𝑚=3𝑎2𝑖𝑚

)

.

Page 12: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 313

It can also be shown that, for the elliptical regions from the non-symmetrical correspon-dence analysis of a two-way contingency table, their eccentricity can be calculated usingEquation 8.14, as in simple correspondence analysis. The comments made in Section 8.4.4concerning the comparison between confidence regions are still pertinent here.

8.6 Approximate 𝒑-values and classicalcorrespondence analysis

By considering the theory that underlies the construction of 100 (1 − 𝛼)% confidence cir-cles of a point in a plot obtained from the correspondence analysis of a contingency table,one may approximate the 𝑝-value of this point in relation to its proximity to the origin.Such a 𝑝-value can then be used to assess the statistical significance of a row (or column)category to the association between the variables. In this section, we shall provide an alge-braically derived approximation based on the circular regions of Lebart et al. (1984). Weshall then amend the approximation using the theory that underpins Beh’s (2010) ellipticalregions.

8.6.1 Approximate 𝒑-values based on confidence circles

To derive an approximate 𝑝-value, we first consider the null and alternative hypotheses underwhich it will be generated. As described above, the relative distance of a row, or column,principal coordinate from the origin of the correspondence plot reflects the variation of thecategory associated with that coordinate from the hypothesis of complete independence.Therefore, the contribution to the chi-squared statistic, or alternatively the total inertia, of the𝑖th row category can be made by considering its proximity from the origin. This suggests thatan appropriate null and alternative hypothesis is

𝐻0 ∶ 𝑓𝑖𝑚 = 0, (8.17)

𝐻1 ∶ 𝑓𝑖𝑚 ≠ 0, (8.18)

for 𝑚 = 1, 2, … , 𝑀∗. For such hypotheses, we may consider

𝑋2𝑖𝑀∗ = 𝑛𝑝𝑖⋅

𝑀∗∑

𝑚=1𝑓 2𝑖𝑚

(8.19)

as the test statistic. Often a subset of 𝐷 < 𝑀∗ dimensions is used to visually represent theassociation between the row and column variables; typically 𝐷 = 2 or 𝐷 = 3. Therefore, the𝑝-value of the 𝑖th row in a 𝐷-dimensional correspondence plot is approximately

𝑝-value𝑖𝐷 = 𝑃{𝜒2𝛼> 𝑋2

𝑖𝐷

}≈ 𝑃

{

𝜒2 > 𝑛𝑝𝑖⋅

𝐷∑

𝑚=1𝑓 2𝑖𝑚

}

. (8.20)

Thus, based on the sample of the contingency table, this 𝑝-value represents the probability ofgetting the 𝑖th row point as extreme (or more so) as what is observed from the contingencytable. Therefore, a 𝑝-value that is less than a given level of significance provides evidence that

Page 13: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

314 CORRESPONDENCE ANALYSIS

the category does play a statistically significant role in describing the association structuresince it is deemed that the particular point in the configuration is not consistent with zero.One may note that while Equation 8.20 takes into consideration the proximity of a pointfrom the origin, it ignores the magnitude of the principal inertia for each of the 𝐷 axes in thecorrespondence plot.

8.6.2 Approximate 𝒑-values based on confidence ellipses

Consider now the elliptical regions generated using the semi-axis length defined by Equations8.9 and 8.10. Suppose that a 𝐷 (< 𝑀∗)-dimensional correspondence plot is used to visualisethe association between the row and column categories of a contingency table. Equations 8.9and 8.10 suggest that

𝜒2 ∼ 𝑋2

(1𝑝𝑖⋅

−𝑀∗∑

𝑚=𝐷+1𝑎2𝑖𝑚

)−1𝐷∑

𝑚=1

(𝑓𝑖𝑚

𝜆𝑚

)2. (8.21)

Thus, if the information contained in dimensions 𝐷 + 1, 𝐷 + 2, … , 𝑀∗ is reflected in theconstruction of a confidence interval, the 𝑝-value of the 𝑖th row point in a 𝐷-dimensionalcorrespondence plot is

𝑝-value𝑖𝐷 ≈ 𝑃⎧⎪⎨⎪⎩

𝜒2 > 𝑋2

(1𝑝𝑖⋅

−𝑀∗∑

𝑚=𝐷+1𝑎2𝑖𝑚

)−1𝐷∑

𝑚=1

(𝑓𝑖𝑚

𝜆𝑚

)2⎫⎪⎬⎪⎭

. (8.22)

Since 𝑎𝑖𝑚 = 𝑓𝑖𝑚∕𝜆𝑚, the 𝑝-value may be expressed in terms of the principal coordinates inthe optimal correspondence plot by

𝑝-value𝑖𝐷 ≈ 𝑃⎧⎪⎨⎪⎩

𝜒2 > 𝑋2

[1𝑝𝑖⋅

−𝑀∗∑

𝑚=𝐷+1

(𝑓𝑖𝑚

𝜆𝑚

)2]−1

𝐷∑

𝑚=1

(𝑓𝑖𝑚

𝜆𝑚

)2⎫⎪⎬⎪⎭

. (8.23)

Therefore, if the coordinate of the 𝑖th row category in the optimal correspondence plot liesat the origin, the 𝑝-value is unity since 𝑃

{𝜒2 > 0

}= 1. Conversely, if the position of a

point lies at a distance from the origin in the optimal correspondence plot, then the 𝑝-value isapproximately 𝑃

{𝜒2 > 𝑋2}.

Due to the inclusion of∑𝐷

𝑚=1(𝑓𝑖𝑚∕𝜆𝑚

)2in Equation 8.23, it is apparent that the proximity

of a point from the origin in dimensions higher than the 𝐷th is reflected in this 𝑝-value.Certainly, in many practical situations, the first 𝐷 dimensions may be sufficient to visualisethe association between the categorical variables. In this case, one may ignore the higherdimensions without any real loss of information of this structure. By doing so, the 𝑝-valuefor the 𝑖th row category as given by Equation 8.23 may be amended to give

𝑝-value𝑖𝐷 ≈ 𝑃

{

𝜒2 > 𝑋2𝑝𝑖⋅

𝐷∑

𝑚=1

(𝑓𝑖𝑚

𝜆𝑚

)2}

. (8.24)

Page 14: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 315

By considering the 𝑝-values of Equations 8.23 and 8.24, unlike Equation 8.20, the inequalityof the principal inertia values can be taken into consideration, thereby reflecting the relativeweighting of each of the axes in a 𝐷-dimensional correspondence plot.

8.7 Approximate 𝒑-values and non-symmetricalcorrespondence analysis

The derivations given in Section 8.6 may also be undertaken to determine approximationsof 𝑝-values for row (response) and column (predictor) points in a plot obtained from a non-symmetrical correspondence analysis. By amending the derivation given in Section 8.6.2, theapproximate 𝑝-value for the 𝑗th column point in a 𝐷-dimensional correspondence plot is

𝑝-value𝑗𝐷 ≈

𝑃

⎧⎪⎨⎪⎩

𝜒2 >(𝑛 − 1) (𝐼 − 1) 𝜏num

1 −∑𝐼

𝑖=1 𝑝2𝑖⋅

[

1 −𝑀∗∑

𝑚=𝐷+1

(𝑔𝑗𝑚

𝜆𝑚

)2]−1

𝐷∑

𝑚=1

(𝑔𝑗𝑚

𝜆𝑚

)2⎫⎪⎬⎪⎭

.

Therefore, by considering the elliptical regions and these 𝑝-values, a two-dimensional cor-respondence plot can be constructed to reflect the statistical significance of a category thatwould otherwise require an optimal correspondence plot to view the asymmetric associationstructure. If only the first two dimensions are considered when constructing the ellipticalregions (so that𝐷 = 2), then for the 𝑗th column predictor category, the approximate 𝑝−valueof its point is

𝑝-value𝑗𝐷 = 𝑃

{

𝜒2 >(𝑛 − 1) (𝐼 − 1) 𝜏num

1 −∑𝐼

𝑖=1 𝑝2𝑖⋅

[(𝑔𝑗1

𝜆1

)2+(𝑔𝑗2

𝜆2

)2]}

.

Expressions for the calculation of these 𝑝-values for the row (response) categories may bederived in a very similar manner.

8.8 Bootstrap elliptical confidence regions

In Sections 8.3--8.5 we described simple formulae to derive circular and elliptical confidenceregions for each point in a correspondence plot. An alternative, and more commonly usedstrategy, is to consider such intervals using bootstrapping. Therefore, in this section we shallconsider the construction of bootstrap confidence regions and go on to compare their coveragewith those of the algebraic regions.

For those confidence regions described above, implicit in their derivation is the distri-butional assumptions underlying the use of the chi-squared statistic. Sometimes, dependingon the data being analysed, such assumptions may lead to spurious coverages. Therefore,bootstrap sampling techniques can be employed and have been considered many times in

Page 15: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

316 CORRESPONDENCE ANALYSIS

the correspondence analysis literature. For example, Ringrose (2012) explored the use ofbootstrapping for the construction of ellipses defining confidence regions from the simplecorrespondence analysis of a contingency table. Lombardo and Ringrose (2012) applied thesame procedure to non-symmetrical correspondence analysis. Greenacre (2007, pp. 196--197)also described the implementation of bootstrapping to construct confidence regions, project-ing the replicated bootstrap tables onto the graph of the original data as supplementary pointsand showing the convex hull, after connecting the outer set of points by lines. Since the outlierproblem, it has been also proposed a peeling step to remove the impact of 5% of the moreextreme outlying replicates. Greenacre (1984, 2007, p. 197) also considered the delta methodfor the construction of elliptical regions. The delta method is an asymptotic method (it usesthe partial derivatives of the eigenvectors) based on theoretical assumptions presented byGifi (1990, p. 408). It relies on the assumption of independent random sampling not alwayssatisfied (Greenacre, 2007). Lebart (2008) also constructed bootstrapped confidence regionsfor multiple correspondence analysis. See also Linting et al. (2007), Markus (1994), Tim-merman et al. (2007), Lombardo et al. (2012) and Takane and Jung (2009) for more debateon this issue.

The focus of our attention here will be to discuss the semi-parametric bootstrap approachof Ringrose (2012) for simple correspondence analysis.

8.9 Ringrose’s bootstrap confidence regions

Supposewe have points for each row and column category for the population, albeit unknown.Based on the original sample of size 𝑛, the configuration obtained from performing a simplecorrespondence analysis on the contingency table may be viewed as corresponding to thepopulation’s points, as in the bootstrapping procedure the sample points are in effect takingthe role of the population ones.

Of course, if we were to randomly select another sample of size 𝑛, the configurationmight be subtly, or even completely, different. That is, we expect there to be variation inthe configuration, and the confidence ellipses to be constructed reflect this variation with a100 (1 − 𝛼)% coverage of the population configuration.

Typically, bootstrap procedures can be used to construct 100 (1 − 𝛼) % confidence regions.Suppose we project a sample of points (𝐱) onto their sample axes (or sample left singularvectors, 𝐚) and compare their position with the original population point at 𝐱† on populationaxes 𝛼. Therefore, we would be concerned with the evaluation of the variation of the samplepoints and the population points projected onto different axes. That is, our focus is onevaluating Var

(𝐚𝑇 𝐱 − 𝛼𝑇 𝐱†

).

Ringrose (2012) considered a slightly different scenario where the variation between thesample points (𝐱) and population point (𝐱†) are on the same axes (𝐚) so that Var

(𝐚𝑇

(𝐱 − 𝐱†

))

is evaluated.Hence, in order to take into account the reliability of correspondence plot, confidence

regions are constructed based on the variability in the difference between the sample andpopulation points, when both are projected onto the sample axes.

Therefore, we require a 100 (1 − 𝛼)% confidence region such that the population pointis projected onto the sample axes and not onto the population axes (which represent a set ofaxes different from the ones we are looking at).

Page 16: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 317

8.9.1 Confidence ellipses and covariance matrix

Denote for any original sample quantity 𝐗, the corresponding population quantity 𝐗† andthe corresponding resample (bootstrap) quantity by 𝐗⋆. In order to construct confidenceregions around category sample points, Ringrose (2012) computed the variance matrix ofeach point of the following matrix of differences 𝐝𝑥 =

(𝐗 − 𝐗†) 𝐚 for the 𝐽 − 1 axes. Setting

the expected value of 𝐝𝑥 equal to 0, that is 𝐸(𝐝𝑥

)≈ 0, the variance matrix is defined as

Var(𝐝𝑥

)= 𝚺𝑥,

which takes into account the correlation between the sample points and the sample axes.At present, there exists no simple strategy to calculate the variance/covariance matrix 𝚺𝑥.Although a parametric approach may be considered to estimate it. This involves assuming amultinomial distribution for 𝐗 and producing many resampled data matrices 𝐗⋆. Once theseare obtained, one may then perform a correspondence analysis on each of these resamplematrices, using the variance of the difference between the bootstrapped and original sampleprofiles when projected onto the bootstrap axes, that is the points of 𝐝𝑥 =

(𝐗⋆ − 𝐗

)𝐚.

An alternative non-parametric approach involves obtaining 𝐵 bootstrapped samples fromthe original 𝐗 data set. The estimated variance matrix ��𝑥 is therefore calculated overthe 𝐵 bootstrap replications of 𝐝𝑥. Sample axes can rotate from sample to sample, andso it is important to allow this variation in constructing the confidence regions. As a re-sult, our computation of the confidence regions takes into account the variation caused byrotations.

In the literature -- see, for example, Markus (1994), Linting et al. (2007) and Timmermanet al. (2007) -- one possibility is given by optimally rotating the axes from each of the 𝐵bootstrapped samples to be as close as possible to the axes in the original/population sample.Therefore, Procrustes rotations of the bootstrap coordinates relative to the sample coordinateshave been used to remove the spurious variation due to the difference between sample andbootstrap axes. Nevertheless, as remarked byRingrose (1992, 1996, 2012), this procedurewilltend to underestimate that part of the variation in the difference between the projected sampleand parent/population points which is induced by the variation in the projection. Therefore, toassess any repetition and reordering from the sample to bootstrap axes, Ringrose (2012) seeksthe best match between sample and bootstrap axes under reflection and reordering. Whencompared with the full Procrustes rotations, this approach to reordering does not remove truevariability caused by axis rotations. Due to the nature of the singular value decomposition,the row and column axes were rearranged jointly, with the reorderings and reflections thesame for both.

One drawback of all forms of the bootstrap elliptical regions (including those of Ringrose,2012) is that, unlike Beh’s (2010) elliptical regions discussed in Section 8.4, they cannotreflect information contained in dimensions greater than the third.

However, the computing of algebraic confidence regions (Lebart et al., 1984; Beh, 2010)is based on the null hypothesis of independence between the row and column variables. Ifthis hypothesis is true, the population coordinates are assumed to lie at the origin. We mayinterpret that the sample points are projected onto the population axes which are not rotated.This means that the axes of the algebraically derived ellipses will always have semi-axes thatare parallel with the principal axes of the correspondence plot.

Page 17: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

318 CORRESPONDENCE ANALYSIS

8.10 Confidence regions and Selikoff’s asbestos data

Consider the two-way contingency Table 1.2 based on the data collected and analysed bySelikoff (1981). Throughout the book we have described that there exists a statistically sig-nificant association between the categorical variables; this was established in Chapter 2 usingPearson’s chi-squared test of independence. A graphical representation of this association canbe depicted using simple correspondence analysis, non-symmetrical correspondence analysisand ordered symmetrical and non-symmetrical correspondence analysis. However, we haveyet to formally identify those categories that make a statistically significant contribution tothis association. Therefore, we shall now highlight the confidence regions and 𝑝-values ofeach of the categories of Table 1.2.

To investigate the nature of the association, a simple correspondence analysis can beperformed where the two-dimensional correspondence plot is given by Figure 8.1. With atotal inertia of 648.81 and 12 degrees of freedom (𝑝-value <0.0001), there is no doubt aboutthe significant association, the two-dimensional plot graphically accounts for 99.6% of theexplained inertia, that is of the association between the variables. Superimposed on this figureare the 95% confidence circles of Lebart et al. (1984) for each row and column point.

Figure 8.1 shows that, with the exception of the ‘20--29’ age group (whose circular regionsoverlap the origin), all of the row and column categories contribute to the association structurebetween the two variables. This suggests that the 𝑝-value for ‘20--29’ is more than 0.05. Therow category 𝑝-values based on the confidence circles of Lebart et al. (1984), obtained usingEquation 8.20 where 𝐷 = 2, are summarised in the first column of Table 8.1. Similarly, the

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

Principal Axis 1 (84.22%)

Prin

cipa

l Axi

s 2

(15.

35%

)

0−9

10−1920−29

30−39

40+

*

**

*

*

None+

Grade 1

Grade 2

Grade 3

+

+

+

Figure 8.1 95% circular confidence circles based on the approach of Lebart et al. (1984)for Selikoff’s asbestos data in Table 1.2.

Page 18: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 319

Table 8.1 Algebraic and bootstrap 𝑝-values of row categories of Selikoff’s asbestos data ofTable 1.2.

Rows Circle 𝑝-value Ellipse (𝐷 = 2)𝑝-value Optimal ellipse 𝑝-value

None 0.000 0.000 0.000Grade 1 0.003 0.000 0.000Grade 2 0.000 0.000 0.000Grade 3 0.000 0.000 0.000

Table 8.2 Algebraic and bootstrap 𝑝-values of column categories of Selikoff’s asbestosdata of Table 1.2 .

Columns Circle 𝑝-value Ellipse (𝐷 = 2)𝑝-value Optimal ellipse 𝑝-value

0--9 0.000 0.000 0.00010--19 0.000 0.000 0.00020--29 0.085 0.000 0.00030--39 0.000 0.000 0.00040+ 0.000 0.000 0.000

column category 𝑝-values of asbestos data (Table 1.2) are summarised in the first column ofTable 8.2. It can be seen that the 𝑝-value of all row and column categories is almost zero.Therefore, the confidence circles of Lebart et al. (1984), and the 𝑝-values derived from them,are effective for monitoring the statistical significance of a category from the hypothesis ofno association.

As already described, the confidence circles of Lebart et al. (1984) are constructed byassuming that the principal inertias of the first two dimensions are the same. Since the twovalues are quite different (𝜆21 = 0.489 and 𝜆22 = 0.089), one may instead consider the ellipticalregions proposed by Beh (2010). These regions appear in Figure 8.2 and are derived usingthe semi-major and semi-minor axis lengths defined by Equation 8.11. Therefore, Figure 8.2reflects the information contained in the first two dimensions of the correspondence plot. Notethat the relative size of these regions is fairly consistent with those that appear in Figure 8.1.By taking into consideration the unequal weighting of the two principal axes, the region‘20--29’ now does not include the origin. This suggests that this column category does playa statistically significant role in the association structure between the two variables. Indeed,the 𝑝-values associated with elliptically generated regions reflect the importance (or not) ofthese categories. The third column of Table 8.1 summarises the 𝑝-values of the row categoriesand are calculated using Equation 8.24 where 𝐷 = 2, while the third column of Table 8.2provides those 𝑝-values of the column categories. Table 8.1 shows that, when consideringthe first two dimensions only or when increasing the dimensions, the 𝑝-value for all the rowcategories is effectively 0, which indicates that those categories do play a significant role inthe association structure. In both Tables 8.1 and 8.2, a 𝑝-value of 0 is one that is less than0.001. Similarly, Figure 8.3 shows that at the 0.05 level of significance, ‘Grade 2’ and ‘Grade3’ (which have ellipses smaller than when considering two dimensions) do play a significantrole in the association structure between the variables of asbestos data (Table 1.2).

Page 19: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

320 CORRESPONDENCE ANALYSIS

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

Principal Axis 1(84.22%)

Prin

cipa

l Axi

s 2

(15.

35%

)

0−9

10−1920−29 30−39

40+

*

** *

*

None+

Grade 1

Grade 2

Grade 3

+

+

+

Figure 8.2 95% confidence ellipses based on the approach of Beh (2010) for Selikoff’sasbestos data in Table 1.2. Only the information contained in the first two dimensions isreflected here.

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

Principal Axis 1 (84.22%)

Prin

cipa

l Axi

s 2

(15.

35%

)

0−9

10−1920−29 30−39

40+

*

** *

*

None

Grade 1

Grade 2

Grade 3

+

+

+

+

Figure 8.3 95% confidence ellipses based on the approach of Beh (2010) for Selikoff’sasbestos data of Table 1.2. All of the information contained in the optimal correspondenceplot is reflected here.

Page 20: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 321

Figure 8.2 ignores the association reflected by the third (and, in general, higher) principalaxis. However, such information may be reflected by considering the 𝑝-values or the displayof Figure 8.3 where 𝐷 = 3. In such a case, all of the row categories have a 𝑝-value that isless than 0.001, thus concluding that all of these categories play a statistically significantrole in the association between the row and column variables of asbestos data. The ellip-tical regions of Figure 8.3 for both sets of categories provide a graphical representation ofthis result.

The key difference between Figures 8.2 and 8.3 is the size of the elliptical region for theage group ‘20--29’. This suggests that this two category has a non-zero coordinate in the thirddimension of the optimal correspondence plot.

Figure 8.3 confirms that this is the case. As a result, the size of ‘10--19’ and ‘20--29’ agegroups becomes smaller due to the inclusion of the additional information on the associationstructure contained in the higher dimension. Such changes, yielded from the regions and𝑝-values, show that it is important to reflect the information contained in higher dimensions,rather than relying on findings of the association structure on just the first two dimensions,as it is typically done when performing correspondence analysis.

Finally, to assess the previous features, in Figure 8.4 we show the superimposed boot-strap ellipses which take into account only the first two dimensions. Despite their differentshapes, these confidence regions lead to conclusions similar to those obtained by consideringthe algebraically derived elliptical regions. What is clearly different is the shape and theorientation of ellipses and further investigation may be important. Indeed, these bootstrap

*

**

*

*

−1.0 −0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

Principal Axis 1 (84.22%)Multinomial resampling

Prin

cipa

l Axi

s 2

(15.

35%

)

+

+

+

+

0−9

10−1920−29

30−39

40+

None

Grade1

Grade2

Grade3

Figure 8.4 95% Confidence ellipses based on the bootstrap approach of Ringrose (2012)for Selikoff’s asbestos data in Table 1.2.

Page 21: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

322 CORRESPONDENCE ANALYSIS

confidence ellipses are derived after projecting sample points onto sample axes, while thealgebraically derived confidence ellipses have a different genesis, as sample points may beseen as projected onto population axes.

8.11 Confidence regions and mother--child attachment data

Suppose we now consider the two-way contingency table of Table 4.10 that cross-classifies amother’s attachment to her child, and the child’s response to the mother’s level of attachment(see Section 4.13 and Van IJzendoorn (1995)). While it is apparent that the association struc-ture between mother’s attachment level and infant response may be treated asymmetrically,we shall first consider that the association structure between the variables is symmetrical. Indoing so, we highlight the confidence regions and 𝑝-values of Table 4.10 by performing aclassical correspondence analysis.

A chi-squared test of independence concludes that with a Pearson’s chi-squared statisticof 252.4 and 9 degrees of freedom (𝑝-value < 0.0001), there is ample evidence to concludethat the two categorical variables are associated.

To investigate further the nature of the association, a simple correspondence analysis canbe performed. Doing so leads to the two-dimensional correspondence plot of Figure 8.5. The

−1.0 −0.5 0.0 0.5

−1.

0−

0.5

0.0

0.5

Principal Axis 1 (54.06%)

Prin

cipa

l Axi

s 2

(35.

84%

)

Avoidant

Secure

Resistant

Disorganised

*

*

*

*

Dismissing

Autonomous

Preoccupied

Unresolved

+

+

+

+

Figure 8.5 95% Confidence circles based on the approach of Lebart et al. (1984) for themother--child attachment data in Table 4.10.

Page 22: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 323

first axis, with a principal inertia of 𝜆21 = 0.25 accounts for 54.06% of the inertia, while the

second axis accounts for 35.84%(𝜆22 = 0.017

). Therefore, the first two dimensions visually

summarise about 90% of the association that exists between the two variables. Superimposedon the configuration of Figure 8.5 is the 95% confidence circles of Lebart et al. (1984) foreach row and column point.

Figure 8.5 suggests that with the exception of ‘Resistant’ (whose circular region overlapsthe origin), all row and column categories contribute to the association structure betweenthe two variables. It is also evident that the confidence circle of ‘Preoccupied’ includes theorigin within the region. This suggests that the 𝑝-value for ‘Resistant’ is more than 0.05,and the 𝑝-value of ‘Preoccupied’ is more than 0.05. The row category 𝑝-values based on theconfidence circles, obtained using Equation 8.20 with 𝐷 = 2, are summarised in the secondcolumn of Table 8.3. Similarly, the column category 𝑝-values of the contingency table aresummarised in the second column of Table 8.4. It can be seen that the 𝑝-value of ‘Resistant’is 0.287. However, the 𝑝-value of ‘Preoccupied’ is 0.02, less than 0.05. This suggests that theconfidence circles, or the 𝑝-values derived from them, are not effective for monitoring thestatistical significance of a category from the hypothesis of no association. This may be due tothe equal weighting that is applied to each of the axes when constructing confidence circles.

As described above, the confidence circles are constructed by assuming that the principalinertia for each of the first two dimensions are equal. Since they are quite different, onemay instead consider elliptical regions. These regions appear in Figure 8.6, but reflect only

−0.5 0.0 0.5

−0.

50.

00.

5

Principal Axis 1( 54.06%)

Prin

cipa

l Axi

s 2

( 35

.84%

)

Avoidant

Secure

Resistant

Disorganised

*

*

*

*

Missing

Autonomous

Preoccupied

Unresolved

+

+

+

+

Figure 8.6 95% Confidence ellipses based on the approach of Beh (2010) for themother--child attachment data in Table 4.10. Only the information contained in the firsttwo dimensions is reflected here.

Page 23: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

324 CORRESPONDENCE ANALYSIS

Table 8.3 𝑝-Values of row categories of mother data (Table 4.10) .

Rows Circle 𝑝-value Ellipse (𝐷 = 2)𝑝-value Optimal ellipse 𝑝-value

Avoidant 0.000 0.000 0.000Secure 0.000 0.000 0.000Resistant 0.981 0.802 0.000Disorganised 0.000 0.000 0.000

the information contained in the first two dimensions. Note that the relative size of theseregions is consistent with those that appear in Figure 8.5. By taking into consideration theunequal weighting of the two principal axes, the region for ‘Preoccupied’ now includes theorigin which suggests that this column category does not play a statistically significant rolein the association structure between the two variables. Indeed, the 𝑝-values associated withelliptically generated regions reflect the importance (or not) of these categories. The thirdcolumn of Table 8.3 summarises the 𝑝-values of the row categories and are calculated usingEquation 8.23, where 𝐷 = 2, while the third column of Table 8.4 provides those 𝑝-valuesof the column categories. Furthermore, Table 8.3 shows that, when considering the firsttwo dimensions only, the 𝑝-value for ‘Resistant’ is 0.802 indicating that this particular rowcategory does not play a significant part in the association. However, the remaining three rowcategories, which have a 𝑝-value less than 0.001, do play a significant role in the associationstructure; in the following tables, a zero 𝑝-value represents those categories with a 𝑝-value lessthan 0.001. Similarly, Figure 8.6 shows that, at the 0.05 level of significance, ‘Preoccupied’(which has a 𝑝-value of 0.108 when considering only the first two dimensions) does not playa significant role in the association structure between the variables of Table 4.10.

These 𝑝-values ignore the association reflected by the third (and, in general, higher)principal axis. However, such information may be reflected by considering the 𝑝-valuesderived from Equation 8.23 where 𝐷 = 2. In such a case, all the row and column categorieshave a 𝑝-value that is less than 0.001, thus concluding that all these categories play astatistically significant role in the association between the row and column variables ofTable 4.10. The elliptical regions of Figure 8.7 for both sets of categories provide a graphicalrepresentation of this result.

The key difference between Figures 8.6 and 8.7 is the size of the elliptical region for‘Resistant’ and ‘Preoccupied’. This suggests that these two categories have a relatively largenon-zero coordinate in the third dimension of the optimal correspondence plot. As a result,the 𝑝-value of these categories dramatically changes due to the inclusion of the additionalinformation on the association structure contained in the higher dimension -- in both cases

Table 8.4 𝑝-Values of column categories of Table 4.10 .

Columns Circle 𝑝-value Ellipse (𝐷 = 2)𝑝-value Optimal ellipse 𝑝-value

Dismissing 0.000 0.000 0.000Autonomous 0.000 0.000 0.000Preoccupied 0.025 0.108 0.000Unresolved 0.000 0.000 0.000

Page 24: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 325

−1.0 −0.5 0.0 0.5

−1.

0−

0.5

0.0

0.5

Principal Axis 1 (54.06%)

Prin

cipa

l Axi

s 2

(35.

84%

)

*

*

*

*

+

+

+

+

AvoidantDismissing

Secure

Resistant

Disorganised

Autonomous

Preoccupied

Unresolved

Figure 8.7 95% confidence ellipses based on the approach of Beh (2010) for themother--child attachment data in Table 4.10. All of the information contained in the op-timal correspondence plot is reflected here.

reducing the 𝑝-value from a relatively large quantity to less than 0.001. Such a dramaticchange in the conclusions, yielded from the regions and 𝑝-values, shows that it is important toreflect the information contained in higher dimensions, rather than relying on findings of theassociation structure on just the first two dimensions as it is typically done when performingcorrespondence analysis.

Finally, Figure 8.8 shows the bootstrap confidence ellipses, which takes into accountthe information contained in only the first two dimensions. We can see that the conclusionsconcerning the association structure from these regions are similar to those reached whenconsidering the algebraic elliptical regions in Figure 8.6.

8.12 R code

In this section, we provide some R code for obtaining confidence circles of Lebartet al. (1984) and confidence ellipses of Beh (2010) for a simple correspondence analysisof a contingency table. This code can be easily adapted for a non-symmetrical corre-spondence analysis, or an analysis of ordered categorical variables. However, we shallnot consider this here. Ringrose’s bootstrapped confidence regions can be constructedusing the cabootcrs library available on the CRAN website://cran.r-project.org/web/packages/cabootcrs/index.html.

Page 25: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

326 CORRESPONDENCE ANALYSIS

*

*

*

*

−0.5 0.0 0.5

−0.

50.

00.

5

Principal Axis 1 (54.06%)Multinomial resampling

Prin

cipa

l Axi

s 2

(35.

84%

)

+

+

+

+

Dismissing

Autonomous

Preoccupied

Unresolved

Avoidant

Secure

Resistant

Disorganised

Figure 8.8 95% Confidence ellipses based on the bootstrap approach of Ringrose (2012)for the mother--child attachment data in Table 4.10.

Here we shall describe two pieces of R code. The first will be for calculating the elliptical,or circular, paths of the confidence regions of Lebart et al. (1984) and Beh (2010). The secondpiece of R code can be used to obtain a correspondence plot from the simple correspondenceanalysis of a two-way contingency table; this plot superimposes the algebraically derivedconfidence regions, given a specified value of 𝛼, for each of the points in the configuration.

8.12.1 Calculating the path of a confidence ellipse

The following R code constructs elliptical confidence regions for a specific rowor column point. One may use, for example, the code of Macdonald (2002)-- as Beh (2010) did. There also exists on the CRAN (website://cran.r-project.org/web/packages/ellipse/index.html) the package ellipsethat constructs elliptical regions for a variety of statistical situations.

The input parameters of the R code we present here are as follows:

∙ coord -- The principal coordinate of a row or column category in a correspondenceplot. Its length must be at least 2.

∙ xcoord -- The semi-major axis length of the coordinate specified by coord.

∙ ycoord -- The semi-minor axis length of the coordinate specified by coord.

∙ col -- The colour of the line that follows the path of the elliptical region.

Page 26: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 327

regions <- function(coord, xcoord, ycoord, col = col){

t <- seq(0, 2*pi, length = 1000)

pcoord1 <- coord[1] + xcoord*cos(t)pcoord2 <- coord[2] + ycoord*sin(t)lines(pcoord1, pcoord2, col = col)

}

For elliptical regions, xcoord and ycoord can be specified so that they reflect theinformation contained either in an optimal correspondence plot or in a plot with less than𝑀∗ = min (𝐼, 𝐽 ) − 1 dimensions. Circular confidence regions using this function can beconstructed by selecting the semi-major axis length, xcoord, and semi-minor axis length,ycoord, so that they are equivalent. However, we shall make use of the symbols functionthat exists in R.

8.12.2 Constructing elliptical regions in a correspondence plot

By adapting the R code simpleca presented in Chapter 4, we shall provide some codethat superimposes the confidence regions onto a correspondence plot obtained using simplecorrespondence analysis and non-symmetrical correspondence analysis.

8.12.2.1 R code for simple correspondence analysis

Constructing confidence circles and ellipses that are superimposed onto a simple correspon-dence plot can be undertaken using the following code labelledca.regions.exe. The coreparts of the code consist of the calculation of the semi-axis half lengths for the axes and thenuses regions.exe (described in the previous section) to derive the path of these regions.

The input parameters of ca.regions.exe are as follows:

1. N -- The two-way contingency table to be analysed.

2. a1 and a2 -- The axes used to construct the correspondence plot. By default they areset to construct a plot using the first and second principal axes.

3. alpha -- The level of significance considered to construct the confidence region. Bydefault it is set at 0.05.

4. cols -- The colours used to reflect the points, labels and regions. By default the featuresfor the rows are in red and those in blue are concerned with the columns.

5. M -- The number of dimensions to consider as part of the analysis. When constructingelliptical regions or calculating their 𝑝-values, M may be 2 or may be min (𝐼, 𝐽 ) − 1.

6. region -- The type of confidence region to be considered. When region = 1,confidence circles are superimposed onto the correspondence plot. Similarly, whenregion = 2, confidence ellipses are constructed.

7. scaleplot -- A function that rescales the magnification of the axis to better visualisepoints, or their confidence ellipses. By default it is set at 1.2.

Page 27: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

328 CORRESPONDENCE ANALYSIS

This function includes as part of its output a summary of the row and column characteris-tics. It includes the semi-axis lengths (or radii length, depending on what type of confidenceregion is specified) for each point along the first and second principal axes. The 𝑝-values forthe elliptical and circular regions are also included.

ca.regions.exe <- function (N, a1 = 1, a2 = 2, alpha = 0.05, cols = c(2, 4), M =min(nrow(N), ncol(N)) - 1, region = 2, scaleplot = 1.2) {

############################################################## ## Defining features of the contingency table for CA ## ##############################################################

I <- nrow(N) # Number of rows of tableJ <- ncol(N) # Number of columns of table

Inames <- dimnames(N)[[1]] # Row category namesJnames <- dimnames(N)[[2]] # Column category names

n <- sum(N) # Total number classified in the tablep <- N *(1/n) # Matrix of joint relative proportions

Imass <- as.matrix(apply(p, 1, sum))Jmass <- as.matrix(apply(p, 2, sum))

ItJ <- Imass %*% t(Jmass)y <- p - ItJdI <- diag(Imass[1:I])dJ <- diag(Jmass[1:J])Ih <- Imassˆ-0.5Jh <- Jmassˆ-0.5dIh <- diag(Ih[1:I])dJh <- diag(Jh[1:J])

x <- dIh %*% y %*% dJh

sva <- svd(x)a <- dIh%*% sva$ub <- dJh%*% sva$v

dmu <- diag(sva$d) # Diagonal matrix of singular values

f <- a %*% dmu # Row coordinates for Classical CAg <- b %*% dmu # Column coordinates for Classical CA

dimnames(f)[[1]] <- Inamesdimnames(g)[[1]] <- Jnames

Principal.Inertia <- diag(t(f[, 1:min(I-1, J-1)])%*%dI%*%f[, 1:min(I-1,J-1)])

Total.Inertia <- sum(Principal.Inertia)Percentage.Inertia <- (Principal.Inertia/Total.Inertia) * 100Total.Perc.Inertia.M <- sum(Principal.Inertia[1:M])

Page 28: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 329

chisq.val <- qchisq(1-alpha, df = (I - 1) * (J - 1) )

############################################################## ## Construction of correspondence plot ## ##############################################################

par(pty = "s")

plot(0, 0, pch = " ", xlim = scaleplot * range(f[, 1:M], g[, 1:M]),ylim = scaleplot * range(f[, 1:M], g[, 1:M]),xlab = paste("Principal Axis ", a1, "(", round(Percentage.Inertia[a1],digits = 2), "%)"), ylab = paste("Principal Axis ", a2, "(",round(Percentage.Inertia[a2], digits = 2), "%)"))

text(f[,1], f[,2], labels = Inames, adj = 0, col = cols[1])points(f[, a1], f[, a2], pch = "*", col = cols[1])

text(g[,1], g[,2], labels = Jnames, adj = 1, col = cols[2])points(g[, a1], g[, a2], pch = "#", col = cols[2])

abline(h = 0, v = 0)

title(main = paste(100 * (1 - alpha), "% Confidence Regions"))

############################################################## ## Calculating the row and column radii length for a ## confidence circle ## ##############################################################

radii <- sqrt(qchisq(1 - alpha, 2)/(n * Imass))radij <- sqrt(qchisq(1 - alpha, 2)/(n * Jmass))

############################################################## ## Calculating the semi-axis lengths for the confidence ## ellipses ## ##############################################################

hlax1.row <- vector(mode = "numeric", length = I)hlax2.row <- vector(mode = "numeric", length = I)

hlax1.col <- vector(mode = "numeric", length = J)hlax2.col <- vector(mode = "numeric", length = J)

if (M > 2){

# Semi-axis lengths for the row coordinates in an optimal plot

for (i in 1:I){hlax1.row[i] <- dmu[1,1] * sqrt((chisq.val/(n*Total.Inertia))*

(1/Imass[i] - sum(a[i, 3:M]ˆ2)))

Page 29: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

330 CORRESPONDENCE ANALYSIS

hlax2.row[i] <- dmu[2,2] * sqrt((chisq.val/(n*Total.Inertia))*(1/Imass[i] - sum(a[i, 3:M]ˆ2)))

}

# Semi-axis lengths for the column coordinates in an optimal plot

for (j in 1:J){hlax1.col[j] <- dmu[1,1] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Jmass[j] - sum(b[j, 3:M]ˆ2)))hlax2.col[j] <- dmu[2,2] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Jmass[j] - sum(b[j, 3:M]ˆ2)))}

} else {

# Semi-axis lengths for the row coordinates in a two-dimensional plot

for (i in 1:I){hlax1.row[i] <- dmu[1,1] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Imass[i]))hlax2.row[i] <- dmu[2,2] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Imass[i]))}

# Semi-axis lengths for the column coordinates in a two-dimensional plot

for (j in 1:J){hlax1.col[j] <- dmu[1,1] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Jmass[j]))hlax2.col[j] <- dmu[2,2] * sqrt((chisq.val/(n * Total.Inertia))*

(1/Jmass[j]))}

}

############################################################## ## Eccentricity ## ##############################################################

eccentricity <- sqrt(1 - (dmu[a2, a2]/dmu[a1, a1])ˆ2)

############################################################## ## Approximate P-values ## ##############################################################

pvalrow <- vector(mode = "numeric", length = I)pvalrowcircle <- vector(mode = "numeric", length = I)

pvalcol <- vector(mode = "numeric", length = J)pvalcolcircle <- vector(mode = "numeric", length = J)

for (i in 1:I){

Page 30: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 331

# Approximate row P-values from Lebart et al.’s (1984) confidence# circles

pvalrowcircle[i] <- 1 - pchisq(n * Imass[i] * (f[i, 1]ˆ2 + f[i, 2]ˆ2),df = (I-1)*(J-1))

# Approximate P-values based on Beh’s (2010) confidence ellipses

if (M > 2){pvalrow[i] <- 1 - pchisq(n * Total.Inertia * ((1/Imass[i] -

sum(a[i, 3:M]ˆ2))ˆ(-1)) * ((f[i, 1]/dmu[1, 1])ˆ2+ (f[i, 2]/dmu[2, 2])ˆ2), df = (I - 1) * (J - 1))

} else {pvalrow[i] <- 1 - pchisq(n * Total.Inertia * Imass[i]*((f[i, 1]/

dmu[1, 1])ˆ2 + (f[i, 2]/dmu[2, 2])ˆ2),df = (I - 1) * (J - 1))

}}

for (j in 1:J){

# Approximate row P-values based on Lebart et al.’s (1984)# confidence circles

pvalcolcircle[j] <- 1 - pchisq(n * Imass[i] * (g[j, 1]ˆ2 + g[j, 2]ˆ2),df = (I - 1) * (J - 1))

# Approximate P-values based on Beh’s (2010) confidence ellipses

if (M > 2){pvalcol[j] <- 1 - pchisq(n * Total.Inertia * ((1/Jmass[j] -

sum(b[j, 3:M]ˆ2))ˆ(-1)) * ((g[j, 1]/dmu[1, 1])ˆ2+ (g[j, 2]/dmu[2, 2])ˆ2), df = (I - 1)*(J - 1))

} else {pvalcol[j] <- 1 - pchisq(n * Total.Inertia * Jmass[j]*((g[j,1]/

dmu[1, 1])ˆ2 + (g[j, 2]/dmu[2, 2])ˆ2),df = (I - 1) * (J - 1))

}}

summ.name <- c("HL Axis 1", "HL Axis 2", "P-value-ellipse","P-value-circle")

if (region == 1){row.summ <- cbind(radii, radii, pvalrow, pvalrowcircle)col.summ <- cbind(radij, radij, pvalcol, pvalcolcircle)

} else if (region == 2){row.summ <- cbind(hlax1.row, hlax2.row, pvalrow, pvalrowcircle)col.summ <- cbind(hlax1.col, hlax2.col, pvalcol, pvalcolcircle)

}

dimnames(row.summ) <- list(paste(Inames), paste(summ.name))dimnames(col.summ) <- list(paste(Jnames), paste(summ.name))

Page 31: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

332 CORRESPONDENCE ANALYSIS

############################################################## ## Superimposing the confidence regions ## ##############################################################

if (region == 1){

# Superimposing the confidence circles

symbols(f[,a1], f[,a2], circles = radii, add = T, fg = cols[1])symbols(g[,a1], g[,a2], circles = radij, add = T, fg = cols[2])

} else if (region == 2){

# Superimposing the confidence ellipses

for (i in 1:I){regions(f[i,], xcoord = hlax1.row[i], ycoord = hlax2.row[i],

col = cols[1])}for (j in 1:J){

regions(g[j,], xcoord = hlax1.col[j], ycoord = hlax2.col[j],col = cols[2])

}}

############################################################## ## Summary of output ## ##############################################################

if (region == 1){

list(Row.Summary = round(row.summ, digits = 4), Column.Summary =round(col.summ, digits = 4))

} else if (region == 2){

list(Eccentricity = round(eccentricity, digits = 4), Row.Summary= round(row.summ, digits = 4), Column.Summary =round(col.summ, digits = 4))

}}

As simple example demonstration of this code, Figure 8.1 may be obtained by

> ca.regions.exe(asbestos.dat, region = 1)

and produces the following numerical output:

$Row.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

0-9 0.1316 0.1316 0 0.0000

Page 32: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 333

10-19 0.1257 0.1257 0 0.000020-29 0.2789 0.2789 0 0.084930-39 0.1757 0.1757 0 0.000040+ 0.2225 0.2225 0 0.0000

$Column.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

None 0.1021 0.1021 0 0.0000Grade 1 0.1279 0.1279 0 0.0032Grade 2 0.2181 0.2181 0 0.0000Grade 3 0.3462 0.3462 0 0.0000

From this output, the second and third columns, HL Axis 1 and HL Axis 2, respectively,are the half-lengths of each row and column confidence circle; that is the radii length ofthe circle. The fourth column summarises the 𝑝-value based on the elliptical regions thatcapture the information contained in the first two dimensions of the display. Similarly, thefinal column summarises the approximate 𝑝-value from the confidence circles displayed inFigure 8.1.

The elliptical regions based on the two dimensions of Figure 8.2 may be obtained by

> ca.regions.exe(asbestos.dat, region = 2, M = 2)

which also produces the following numerical output:

$Eccentricity[1] 0.9043

$Row.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

0-9 0.2262 0.0966 0e+00 0.000010-19 0.2162 0.0923 0e+00 0.000020-29 0.4795 0.2047 4e-04 0.084930-39 0.3021 0.1290 0e+00 0.000040+ 0.3825 0.1633 0e+00 0.0000

$Column.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

None 0.1755 0.0749 0 0.0000Grade 1 0.2200 0.0939 0 0.0032Grade 2 0.3749 0.1601 0 0.0000Grade 3 0.5951 0.2541 0 0.0000

Here HL Axis 1 and HL Axis 2 are the semi-major and semi-minor lengths of theconfidence ellipses. Similarly, Figure 8.3 is obtained by

> ca.regions.exe(asbestos.dat, region = 2)

Page 33: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

334 CORRESPONDENCE ANALYSIS

giving

$Eccentricity[1] 0.9043

$Row.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

0-9 0.2208 0.0943 0 0.000010-19 0.1866 0.0797 0 0.000020-29 0.3344 0.1428 0 0.084930-39 0.2930 0.1251 0 0.000040+ 0.3578 0.1528 0 0.0000

$Column.SummaryHL Axis 1 HL Axis 2 P-value-ellipse P-value-circle

None 0.1753 0.0749 0 0.0000Grade 1 0.2169 0.0926 0 0.0032Grade 2 0.2818 0.1203 0 0.0000Grade 3 0.4054 0.1731 0 0.0000

8.12.2.2 R code for non-symmetrical correspondence analysis

When performing non-symmetrical correspondence analysis on a two-way contingency table,the R code above can be amended. When the column categories are treated as forming thepredictor variable and the row categories are treated as forming the response variable, thenthe code necessary to construct confidence regions may be considered by replacing

x <- dIh %*% y %*% dJh

sva <- svd(x)a <- dIh %*% sva$ub <- dJh %*% sva$v

dmu <- diag(sva$d) # Diagonal matrix of singular values...chisq.val <- qchisq(1 - alpha, df = (I - 1)*(J - 1) )...radii <- sqrt(qchisq(1 - alpha, 2)/(n * Imass))radij <- sqrt(qchisq(1 - alpha, 2)/(n * Jmass))

in ca.regions.exe with

x <- y %*% dJhsva <- svd(x)a <- sva$ub <- dJh %*% sva$v

Page 34: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

EXTERNAL STABILITY AND CONFIDENCE REGIONS 335

dmu <- diag(sva$d)f <- a %*% dmug <- b %*% dmu...tauden <- 1 - sum(Imassˆ2)chisq.val <- (qchisq(1 - alpha, df = (I - 1) * (J - 1)) *

tauden)/(n - 1) * (I - 1)...

radii <- sqrt((qchisq(1-alpha, 2) * tauden)/((n - 1) * (I - 1) * Imass))radij <- sqrt((qchisq(1-alpha, 2) * tauden)/((n - 1) * (I - 1) * Jmass))

References

Beh, E.J. (2001) Confidence circles for correspondence analysis using orthogonal polynomials. Journalof Applied Mathematics and Decision Sciences, 5, 35--45.

Beh, E.J. (2010) Elliptical confidence regions for simple correspondence analysis. Journal of StatisticalPlanning and Inference, 140, 2582--2588.

Beh, E.J. and D’Ambra, L. (2009) Some interpretative tools for non-symmetrical correspondenceanalysis. Journal of Classification, 26, 55--76.

Beh, E.J. and D’Ambra, L. (2010) Non-symmetrical correspondence analysis with concatenation andlinear constraints. The Australian and New Zealand Journal of Statistics, 52, 27--44.

Beh, E.J. and Lombardo, R. (2014) Confidence regions and 𝑝-values for classical andnon-symmetric correspondence analysis. Communications in Statistics: Theory and Methodsdoi:10.1080/03610926.2013.768665.

Corbellini, A., Riani, M., and Donatini, A. (2008) Multivariate data analysis techniques to detectwarnings of elderly frailty. Statistica Applicata, 20, 159--178.

Crisci, A. and D’Ambra, A. (2011) The confidence ellipses in multiple non-symmetrical correspon-dence analysis for the evaluation of the innovative performance of the manufacturing enterprises inCampania. Statistica & Applicazioni, 9, 175--187.

D’Ambra, A. and Crisci, A. (2014) The confidence ellipses in decomposition multiple non-symmetricalcorrespondence analysis. Communications in Statistics: Theory and Methods, 43(6), 1209--1221.

D’Ambra, L., D’Ambra, A. and Sarnacchiaro, P. (2012) Visualizing main effects and interaction inmultiple non-symmetric correspondence analysis. Journal of Applied Statistics, 30, 2165--2175.

Efron, B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics, 7, 1--26.

Gifi, A. 1990 Non-Linear Multivariate Analysis. John Wiley & Sons, Inc., New York.

Goodman, L.A. and Kruskal, W.H. (1954) Measures of association for cross classifications. Journal ofthe American Statistical Association, 49, 732--764.

Greenacre, M. (1984) Theory and Application of Correspondence Analysis. Academic Press.

Greenacre, M. (2007) Correspondence Analysis in Practice, 2nd edn. Chapman & Hall/CRC.

Griffiths, S. (1997) Long-Term Adjustment after Extremely Challenging Events in a Sample ofVietnamese--Canadian Seniors. M. Arts Thesis, Mount Allison University, Canada.

Lebart, L. (1985) Exploratory analysis of survey data: the role of correspondence analysis (with discus-sions), in Recent Developments in the Analysis of Large Scale Data Sets (ed. A.Z. Israels), Eurostat,pp. 169--188.

Page 35: [Wiley Series in Probability and Statistics] Correspondence Analysis || External stability and confidence regions

336 CORRESPONDENCE ANALYSIS

Lebart, L. (2008) Validation techniques in multiple correspondence analysis, in Multiple Correspon-dence Analysis and Related Methods (eds. M. Greenacre and J. Blasius), pp. 179--195. Chapman &Hall/CRC.

Lebart, L., Morineau, A. and Warwick, K.M. (1984)Multivariate Descriptive Statistical Analysis. JohnWiley & Sons, Inc., New York.

Light, R.J. and Margolin, B.H. (1971) An analysis of variance for categorical data. Journal of theAmerican Statistical Association, 66, 534--544.

Linting, M., Meulman, J.J., Groenen, P.J.F., and Van der Kooij, A.J. (2007) Stability of nonlinear prin-cipal components analysis: an empirical study using the balanced bootstrap. Psychological Methods,12(3), 359--379.

Lombardo, R., Beh, E.J. and D’Ambra, L. (2007) Non-symmetric correspondence analysis with ordinalvariables. Computational Statistics and Data Analysis, 52, 566--577.

Lombardo, R. andRingrose, T. (2012)Bootstrap confidence regions in non-symmetrical correspondenceanalysis. Electronic Journal of Applied Statistical Analysis, 5, 413--417.

Lombardo, R., Ringrose, T. Beh, E. (2012) Bootstrap confidence regions in classical and orderedmultiple correspondence analysis. In Book of Short Paper Analysis and Modeling of Complex Datain Behavioural and Social Sciences -- Cladag2012 Capri (eds. D. Vicari, A. Okada and G. Ragozino),pp. 53--55. Cleup Padova.

Macdonald, P.D.M. (2002) Drawing an ellipse in Splus or R. Available at www.math.mcmaster.ca/peter/s4m03/s4m03 0102/ellipse.html (accessed December 5, 2013).

Mardia, K.V., Kent, J.T., and Bibby, J.M. (1982) Multivariate Analysis. Academic Press.

Markus,M.T. (1994)BootstrapConfidence Regions in Non-LinearMultivariate Analysis. DSWOPress.

Nicoli, A., Capodanno, F., Valli, B., Di Girolamo, R., Villani, M.T., Nucera, A., Focarelli, R., and LaSala, G.B. (2010) Impact on insemination technique, semen quality and oocyte cryopreservation onpronuclear morphology of zygotes derived from sibling occytes. Zygote, 18, 61--68.

Nishisato, S. (2007) Multidimensional Nonlinear Descriptive Analysis. Taylor & Francis Group.

Ringrose, T.J. (1992) Bootstrapping and correspondence analysis in archaeology. Journal of Archaeo-logical Science, 19, 615--629.

Ringrose, T.J. (1996) Alternative confidence regions for canonical variate analysis. Biometrika, 83,575--587.

Ringrose, T.J. (2012) Bootstrap confidence regions for correspondence analysis. Journal of StatisticalComputation and Simulation, 83, 1397--1413.

Selikoff, I.J. (1981) Household risks with inorganic fibers. Bulletin of the New York Academy ofMedicine, 57, 947--961.

Simonetti, B., D’Ambra, L. and Amenta, R. (2011) New developments in ordinal non symmetricalcorrespondence analysis. In New Perspectives in Statistical Modeling and Data Analysis (eds. S.Ingrassia, R. Rocci and M. Vichi), pp. 497--504. Springer.

Takane, Y. and Jung, S. (2009) Tests of ignoring and eliminating in nonsymmetric correspondenceanalysis. Advances in Data Analysis and Classification, 3, 315--340.

Timmerman, M.E., Kiers, H.AL. and Smilde A.K. (2007) Estimating confidence intervals for principalcomponent loadings: a comparison between the bootstrap and asymptotic results. British Journal ofMathematical and Statistical Psychology, 60, 295--314.

Van IJzendoorn, M.H. (1995) Adult attachment representations, parental responsiveness, and infantattachment: ameta-analysis on the predictive validity of the adult attachment interview.PsychologicalBulletin, 117, 387--403.

Weir, M.D., Hass, J. and Giordano, F.R. (2005) Thomas’ Calculus, 11th edn. Pearson.

Yanao, K., Imai, K, Shimizu, A., and Hanashita, T. (2006) A new method for gene discovery inlarge-scale microarray data. Nucleic Acids Research, 34, 1532--1539.