[wiley series in computational statistics] symbolic data analysis || principal component analysis

5

Principal Component Analysis

A principal component analysis is designed to reduce p-dimensional observationsinto s-dimensional components (where usually s � p). More specifically, a principalcomponent is a linear combination of the original variables, and the goal is tofind those s principal components which together explain most of the underlyingvariance–covariance structure of the p variables. Two methods currently exist forperforming principal component analysis on symbolic data, the vertices methodcovered in Section 5.1 and the centers methods treated in Section 5.2. Theseare for interval-valued data only. No methodology yet exists for other types ofsymbolic data.

5.1 Vertices Method

Cazes et al. (1997), Chouakria (1998), and Chouakria et al. (1998) developed amethod of conducting principal component analysis on symbolic data for whicheach symbolic variable Yj� j = 1� � � � � p, takes interval values �uj = �auj� buj�, say,for each observation u = 1� � � � �m, and where each observation may representthe aggregation of nu individuals. Or equivalently, each observation is inherentlyinterval valued. When auj < buj , we refer to the interval �auj� buj� as a non-trivial interval. For each observation �u = ��u1� � � � � �up�, let qu be the number ofnon-trivial intervals. Each symbolic data point, �u = ��au1� bu1�� aup� bup��, isrepresented by a hyperrectangle Hu with mu = 2qu vertices. When auj < buj for allj = 1� � � � � p, then mu = 2p.

Example 5.1. The first two observations �w1�w2� in Table 5.1 are examples ofobservations with no trivial intervals; these form p = 3 dimensional hyperrect-angles in �3 (see Figure 5.1). The observation w3, however, with a33 = b33, is a

145

Symbolic Data Analysis: Conceptual Statistics and Data Mining L. Billard and E. Diday© 2006 John Wiley & Sons, Ltd. ISBN: 978-0-470-09016-9

146 PRINCIPAL COMPONENT ANALYSIS

two-dimensional hyperrectangle, i.e., it is a plane in �3 as seen in Figure 5.1. Inthis case, q3 = 2, and so there are mu = 22 = 4 vertices. Likewise, the observationw4 is a line in �3 with q4 = 1 and so m4 = 21 = 2 vertices, and that of w5 is apoint in �3 with q5 = 0 and so there is m5 = 20 = 1 vertex in �3. �

Table 5.1 Trivial and non-trivial intervals

wu Y1 = �a1� b1� Y2 = �a2� b2� Y3 = �a3� b3�

w1 [1, 3] [1, 4] [2, 3]w2 [2, 4] [3, 4] [4, 6]w3 [1, 3] [2, 4] [3, 3]w4 [3, 5] [3, 3] [2, 2]w5 [3, 3] [2, 2] [5, 5]

Y3

H4

H1

H3

H2H5

Y2

Y1

Figure 5.1 Types of hypercubes: clouds of vertices.

VERTICES METHOD 147

Each hyperrectangle Hu can be represented by a 2qu ×p matrix Mu with eachrow containing the coordinate values of a vertex Vku

� ku = 1� � � � � 2qu , of thehyperrectangle. There are

n =m∑

u=1

mu =m∑

u=1

2qu (5.1)

vertices in total. Then, an �n×p� data matrix M is constructed from the Mu� u =1� � � � �m according to

M =⎛⎜⎝

M1��

Mm

⎞⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎣

a11 � � � a1q1

��

b11 � � � b1q1

⎤⎥⎦

��⎡⎢⎣

am1 � � � amqm

��

bm1 � � � bmqm

⎤⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

� (5.2)

For example, if p = 2, the observation �u = ��au1� bu1�� au2� b2u�� is transformed tothe 2p ×p = 22 ×2 matrix

Mu =

⎡⎢⎢⎣

au1 au2

au1 bu2

bu1 au2

bu1 bu2

⎤⎥⎥⎦ �

and likewise for M.

Example 5.2. Consider the artificial data of Table 5.1. We saw in Example 5.1that the first two �u = 1� 2� observations formed a hyperrectangle in p = 3dimensional space. Therefore, for u = 1� u = 2, the mu = 23 = 8 vertices arerepresented by the rows in M1 and M2 given as

M1 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 21 1 31 4 21 4 33 1 23 1 33 4 23 4 3

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

� M2 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

2 3 42 3 62 4 42 4 64 3 44 3 64 4 44 4 6

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

�


In this case ku = 23 = 8, and M1 and M2 are both 8 × 3 matrices. However, foru = 3, we have a 22 ×3 matrix since the hyperrectangle H3 associated with �3 isa rectangle in �3, i.e., m3 = 22 = 4, and therefore

M3 =

⎛⎜⎜⎝

1 2 31 4 33 2 33 4 3

⎞⎟⎟⎠ �

for u = 4, the matrix M4 is 21 × 3 since �4 represented by H4 is a line in �3,i.e., m4 = 21 = 2, and therefore

M4 =(

3 3 25 3 2

)�

and for m = 5� M5 is a 1 × 3 matrix since �5 is a single point in �3, i.e.,m5 = 20 = 1, and therefore

M5 = �3 2 5��

Then, the complete set of vertices is represented by the rows of M, where, fromEquation (5.2),

M =

⎛⎜⎜⎜⎜⎝

M1

M2

M3

M4

M5

⎞⎟⎟⎟⎟⎠

and where M is an n×p = 23×3 matrix since here

n =n∑

u=1

mu =m∑

u=1

2qu = 8+8+4+2+1 = 23� �

Example 5.3. Consider the two observations

��w1� = ��20� 25�� 60� 80��

��w2� = ��21� 28�� 65� 86��

Then,

M1 =

⎛⎜⎜⎝

20 6020 8025 6025 80

⎞⎟⎟⎠ � M2 =

⎛⎜⎜⎝

21 6521 8628 6528 86

⎞⎟⎟⎠ �

VERTICES METHOD 149

and hence

M ′ =(

M1

M2

)′

=(

20 20 25 25 21 21 28 2860 80 60 80 65 86 65 86

)�

�

The matrix M is treated as though it represents classical p-variate data forn individuals. Chouakria (1998) has shown that the basic theory for a classicalanalysis carries through; hence, a classical principal component analysis can beapplied.

Briefly, a classical principal component analysis finds linear combinationsof the classical variables X1� � � � �Xp. Thus, the th principal component PC � =1� � � � � s, is

PC = e 1X1 + � � � + e pXp (5.3)

with∑p

j=1 l2 j = 1, l = �l 1�…� l p� are the eigenvalues = 1� � � � � p, and where

e = �e 1� � � � � e p� is the th eigenvector associated with the underlying covariancematrix. The resulting PC � = 1� � � � � s, are uncorrelated. The th principal compo-nent is that value which maximizes the variance Var�PC � subject to e′

e = 1. Thetotal variance underlying the �X1� � � � �Xp� values can be shown to be � =∑p

=1 � .Hence, the proportion of the total variance that is explained by the th principalcomponent is � /�. Often by standardizing the covariance matrix, the resultingcorrelation matrix is used instead. A detailed discussion of the application of andtheory of principal components can be found in most texts devoted to multivariateanalyses; see, for example, Anderson (1984) for a theoretical treatise and Johnsonand Wichern (1998) for an applied presentation.

Observations can be weighted by a factor pu say, with∑m

u=1 pu = 1. In classicalanalyses these weights may typically be pu = 1/m. However, for our symbolicdata, it may be desirable to have weights that reflect in some way the differencesbetween, for example, �1 = �3� 7� and �2 = �1� 9�, i.e., two observations that byspanning different interval lengths have different internal variations. This suggeststhe following options.

Definition 5.1: The weight associated with an observation �u� u = 1� � � � �m,is defined by pu. The vertex weight associated with the vertex ku of �u� ku =1� � � � �mu, is defined by pu

ku, with

mu∑ku=1

puku

= pu� u = 1� � � � �m� (5.4)


andm∑

u=1

pu = 1� (5.5)

�

There are many possible choices for pu and puku

. The simplest scenarios arewhen all observations are equally weighted, and all vertices in the cluster of verticesin Hu are equally weighted. In these cases,

pu = 1/m� u = 1� � � � �m� (5.6)

and

puku

= pu�1/qu�� ku = 1� � � � �mu� (5.7)

Other possible weights for pu could take into account the ‘volume’ occupied in �p

by an observation �u, as suggested by the following definitions.

Definition 5.2: The volume associated with observations �u� u = 1� � � � �m, is

V��u� =p∏

j=1buj �=auj

�buj −auj�� (5.8)

�

Definition 5.3: An observation �u� u = 1� � � � �m, is said to be proportionallyweighted by volume when

pu = V��u�/V (5.9)

where V =∑mu=1 V��u�. �

Definition 5.4: An observation �u� u = 1� � � � �m, is said to be inversely propor-tionally weighted by volume when

p′u = �1−V��u�/V�/

m∑u=1

�1−V��u�/V�� (5.10)

�

Clearly, in both definitions (5.3) and (5.4),∑m

u=1 pu = 1 as required. The useof the weight pu in Equation (5.9) clearly gives larger weight to an observation,e.g., � = �1� 9� which embraces more volume in �p reflecting the larger internalvariation of such an observation over an observation with smaller volume, suchas � = �3� 7�. The use of the p′

u in Equation (5.10) might be preferable when theinterval length(s) reflect a measure of uncertainty �, e.g., � = 6±�. The larger thedegree of uncertainty, the larger the value of �. In this case, however, it would bedesirable to give such an observation a lower weight than one for which the degreeof uncertainty � is small.

VERTICES METHOD 151

Example 5.4. For the data of Table 5.1, it follows that the volume of each datapoint is, from Equation (5.8),

V��1� = �3−1��4−1��3−1� = 12� V��2� = �5−2��4−3��6−4� = 6�

V��3� = �3−1��4−2� = 4� V��4� = �5−3� = 2� V��5� = 0�

Then,∑

V��u� = 24. Hence the weights pu are, from Equation (5.9),

p1 = 12/24 = 0�5�

likewise,

p2 = 0�25� p3 = 0�167� p4 = 0�089� p5 = 0�

Clearly,∑

pu = 1.The inverse volume weights for the data of Table 5.1 follow from Equa-

tion (5.10) as

p′1 = �1−12/24�

��1−12/24�+ �1−6/24�+ �1−2/24�+ �1−0/24��

= �1−12/24�/4 = 0�125�

likewise,

p′2 = 0�188� p′

3 = 0�208� p′4 = 0�229� p′

5 = 0�250� �

Weights reflecting the number of observations nu that made up the symbolicobservation wu could be defined by

p′′u = pu�nu/

m∑u=1

nu�� (5.11)

Example 5.5. Let us assume the symbolic observations of Table 5.1 were theresult of aggregating nu = 3� 2� 4� 2� 1 classical observations for the categoriesu = 1� � � � � 5, respectively. Then, the weighted volume-based weights p′′

u are

p′′1 = 3�0�5�/��3��0�5�+2�0�25�+4�0�167�+2�0�089�+1�0��

= 1�5/2�846 = 0�527�

likewise,

p′′2 = 0�177� p′′

3 = 0�234� p′′4 = 0�062� p′′

5 = 0� �

The weight pu (and likewise for p′u and p′′

u) is associated with the observation�u. This weight is then distributed across the mu = 2pu vertices associated with �u.This leads us to the following.


Definition 5.5: Under the assumption that values across the intervals �auj� buj�,j = 1� � � � � p� u = 1� � � � �m, are uniformly distributed, the weights associated withthe observation wu are equally distributed across its vertices Vku

� ku = 1� � � � � 2pu ,i.e., the weight on Vku

is

puku

�u� = pu/2qu � ku = 1� � � � � 2qu � u = 1� � � � �m� (5.12)

�

Therefore, a classical (weighted or unweighted as desired) principal componentanalysis is conducted on the n observations in M. Let Y ∗

1 � � � � � Y ∗s � s ≤ p, denote the

first s principal components with associated eigenvalues �1 ≥ � � � ≥ �s ≥ 0 whichresult from this classical analysis of the n vertices. We then construct the intervalvertices principal components Y V

1 � � � � � Y Vs as follows.

Let Lu be the set of row indices in M identifying the vertices of the hyper-rectangle Hu, i.e., Lu represents the rows of Mu describing the symbolic dataobservation �u. For each row ku = 1� � � � � 2qu in Lu, let y uku

be the value of theprincipal component Y ∗

u� = 1� � � � � s, for that row ku. Then, the th intervalvertices principal component Y V

u for observation wu is given by

Y V u = y u = �ya

u� yb u�� = 1� � � � � s� (5.13)

where

ya u = min

ku∈Lu

y uku and yb

u = maxku∈Lu

y uku� (5.14)

Then, plotting these values against each principal component axis PC givesus the data represented as hyperrectangles of principal components in s-dimensionalspace. For example, take the p = 3 dimensional hyperrectangle for a single observa-tion �, represented by H in Figure 5.2 (adapted from Chouakria, 1998). Figure 5.2contains s = 3 axes corresponding to the first, second, and third principal compo-nents calculated from Equations (5.13)–(5.14) on all m observations, using all nvertices. The projection of points in the hyperrectangle H onto the first and secondprincipal component plane, and also onto the second and third principal componentplane, are displayed.

It is easily verified that the rectangle formed by the two interval-valued principalcomponents constitutes a maximal envelope of the projection points from H . There-fore, it follows that every point in the hypercube H when projected onto the plane liesin this envelope. However, depending on the actual value of H , there can be some(exterior) points within the envelope that may not be projections of points in H .

Equations (5.13)–(5.14) can be verified as follows. Take any point x̃i withx̃ij ∈ �aij� bij�. Then, the th principal component associated with this x̃i is

P̃C =p∑

j=1

e j�x̃uj − X̄∗j ��

VERTICES METHOD 153

Y3 Y1

yb3u

Y2 Hu

PC3

PC2

PC1

ya3u

ya1u

yb1u

yb2u

ya2u

Figure 5.2 Projection hypercube Hu to principal component � = 1� 2� 3� axes.

It follows thatp∑

j=1

e j�x̃uj − X̄∗j � ≥

p∑j∈J+

e j�auj − X̄∗j �+

p∑j∈J−

e j�buj − X̄∗j � (5.15)

andp∑

j=1

e j�x̃uj − X̄∗j � ≤

p∑j∈J−

e j�auj − X̄∗j �+

p∑j∈J+

e j�buj − X̄∗j � (5.16)

where J+ = je j > 0 and J− = je j < 0. However, by definition, the right-handside of Equation (5.15) is

minku∈Lu

y uku= ya

u

and the right-hand side of Equation (5.16) is

maxku∈Lu

y uku= yb

u�

Hence, for all j = 1� � � � � p,

˜PC ∈ �ya u� yb

u��

and so Y ∗ u as in Equations (5.13)–(5.14) holds for all x̃i with xij ∈ �aij� bij�.


Example 5.6. Table 5.2 provides data relating to activity for m = 14 businesssectors of the financial community for five random variables related to Y1 = Costof a specific activity, Y2 = Job Code, Y3 = Activity Code, Y4 = Monthly Cost,and Y5 = Annual Budgeted Cost. Here, the ‘sector’ is the concept. These intervalvalues were obtained by aggregating values for specific companies within therespective industry sectors as described in Table 5.2, column (a). Column (g)of Table 5.2 gives the number of companies nu so aggregated in each case.Thus, for the automotive sector, for example, there were n2 = 48 specific valueswhich when aggregated gave the interval-valued observations displayed. Thesedata were after Vu et al. (2003).

Suppose we perform a principal component analysis on the first three variablesY1� Y2� Y3 only. We first determine the matrices Mu� u = 1� � � � � 14. For thesedata qu = p = 2, for all u. Therefore, the Mu are each of size 2p × p = 8 × 3.Thus, we see that when u = 1, we have from Equation (5.2),

M1 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1168 1 11168 1 201168 7 11168 7 204139 1 14139 1 204139 7 14139 7 20

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

�

Likewise, we find Mu for all u; and hence we find the matrix M which is of size�m2p�×p = 112 × 3. Each row of M represents a classical vertex value in theset of �m = 14� hypercubes.

A classical principal component analysis on these n = 112 classical valuesis then performed in the usual manner. We denote the th principal compo-nent for the vertex ku of the observation wu by PC = y uku

� = 1� � � � � s�u = 1� � � � �m� ku = 1� � � � � 8. Since we ultimately want to find the intervalvertices principal components for each observation �u through Equations (5.13)and (5.14), it is advantageous to sort the classical analysis by u minimally,and preferably by both u and the th principal component value. The resultswhen sorted by u and the first principal component � = 1� are displayed forthe FMCG (w1) and durable goods (w13) sectors in Table 5.3. Thus, the rowsi = 1� � � � � 8 correspond to the �k1 = 1� � � � � 8� vertices of �1 �u = 1�, the rowsi = 97� � � � � 104 correspond to the �k13 = 1� � � � � 8� vertices of �13 �u = 13�, andso on.

Then, to find the first vertices principal component interval Y V11, i.e., = 1,

we take Table 5.3 and Equation (5.14) which give

ya11 = min

k1∈L1

y11k1 = −1�535

VERTICES METHOD 155

Tab

le5.

2Fi

nanc

eda

ta.

(a)

(b)

Job

(c)

Job

(d)

Act

ivity

(e)

Mon

thly

(f)

Ann

ual

�g�

Cos

tC

ode

Cod

eC

ost

Bud

get

wu

Sect

or�a

1�b

1�

�a2�b

2�

�a3�b

3�

�a4�b

4�

�a5�b

5�

nu

w1

FMC

G[1

168,

4139

][1

,7]

[1,2

0][6

50,4

9531

0][1

000,

1212

300]

605

w2

Aut

omot

ive

[217

5,40

64]

[1,6

][1

,20]

[400

,377

168]

[400

,381

648]

48w

3T

elec

omm

unic

atio

ns[2

001,

4093

][1

,7]

[1,2

0][1

290,

2508

36]

[129

0,50

0836

]63

w4

Publ

ishi

ng[2

465,

3694

][1

,3]

[4,2

0][5

000,

4281

7][5

000,

7936

7]12

w5

Fina

nce

[248

3,40

47]

[1,7

][2

,20]

[650

0,24

8629

][6

500,

3135

01]

40w

6C

onsu

ltant

s[2

484,

4089

][1

,6]

[9,2

0][4

510,

5530

0][4

510,

7440

0]15

w7

Con

sum

ergo

ods

[253

2,40

73]

[1,6

][2

,20]

[290

0,77

500]

[290

1,77

500]

34w

8E

nerg

y[2

542,

3685

][1

,6]

[2,2

0][2

930,

5435

0][2

930,

5435

0]17

w9

Phar

mac

eutic

al[2

547,

3688

][1

,7]

[1,2

0][1

350,

4930

5][1

2450

,127

4700

]31

w10

Tou

rism

[260

4,36

90]

[1,5

][9

,20]

[160

0,31

700]

[160

0,31

700]

13w

11T

extil

es[2

697,

4012

][1

,5]

[9,9

][1

2800

,548

50]

[128

00,9

4000

]15

w12

Serv

ices

[348

1,40

58]

[1,1

][9

,9]

[840

0,31

500]

[840

0,41

700]

3w

13D

urab

lego

ods

[272

6,40

68]

[1,5

][1

0,20

][5

800,

2830

0][3

700,

5540

0]3

w14

Oth

ers

[304

2,41

37]

[1,1

][1

,20]

[458

,194

00]

[458

,194

00]

34


Table 5.3 Classical principal components, u = 1� 13.

i ku u Y 1 Y 2 Y 3 PC1 PC2 PC3

1 1 1 4139 1 1 −1�535 −1�064 0�2532 2 1 4139 1 20 −1�177 1�275 0�1313 3 1 4139 7 1 0�137 −1�22 1�9434 4 1 4139 7 20 0�496 1�107 1�8205 5 1 1168 1 1 0�959 −1�579 −2�2666 6 1 1168 1 20 1�318 0�756 −2�3897 7 1 1168 7 1 2�632 −1�747 −0�5778 8 1 1168 7 20 2�990 0�592 −0�699��

��

��

��

��

97 1 13 4068 1 10 −1�306 0�031 0�13598 2 13 4068 1 20 −1�117 1�262 0�07099 3 13 4068 5 10 −0�191 −0�080 1�262

100 4 13 2726 1 10 −0�179 −0�201 −1�003101 5 13 4068 5 20 −0�002 1�151 1�197102 6 13 2726 1 20 0�010 1�030 −1�068103 7 13 2726 5 10 0�936 −0�313 0�123104 8 13 2726 5 20 1�125 0�918 0�059

��

��

��

��

��

and

yb11 = max

k1∈L1

y11k1 = 2�990

where L1 = 1� � � � � 8 and hence, from Equation (5.13), the first vertices prin-cipal component interval Y V

11 for the w1 observation, i.e., for the FMCG sector, is

Y V11 = y11 = �−1�535� 2�990��

Similarly, for the durable goods sector, i.e., for the w13 observation �13, weapply Equation (5.14) to Table 5.3 to obtain

ya1�13 = min

k13∈L13

y11k13 = −1�306

yb1�13 = max

k13∈L13

y11k13 = 1�125

where L13 = 97� � � � � 104 and hence, from Equation (5.13), the first verticesprincipal component interval Y V

1�13 for the durable goods sector is

Y V1�13 = y1�13 = �−1�306� 1�125��

VERTICES METHOD 157

The complete set of the first, second, and third vertices principal componentintervals based on the Y1� Y2� Y3 variables for all observations is provided inTable 5.4.

Table 5.4 Vertices principal components – finance data (based on Y1� Y2� Y3).

wu Sector PC1 = Y V1u PC2 = Y V

2u PC3 = Y V3u

w1 FMCG [−1�535, 2.990] [−1�747, 1.275] [−2�389, 1.943]w2 Automotive [−1�473, 1.866] [−1�544, 1.262] [−1�535, 1.598]w3 Telecommunications [−1�497, 2.291] [−1�602, 1.267] [−1�683, 1.904]w4 Publishing [−1�105, 0.786] [−1�041, 1.197] [−1�289, 0.420]w5 Finance [−1�439, 1.886] [−1�396, 1.259] [−1�274, 1.859]w6 Consultants [−1�343, 1.606] [−0�506, 1.266] [−1�273, 1.568]w7 Consumer goods [−1�461, 1.566] [−1�359, 1.263] [−1�232, 1.599]w8 Energy [−1�135, 1.558] [−1�358, 1.196] [−1�224, 1.270]w9 Pharmaceutical [−1�157, 1.832] [−1�508, 1.196] [−1�220, 1.561]w10 Tourism [−1�007, 1.227] [−0�457, 1.197] [−1�171, 0.947]w11 Textiles [−1�278, 0.941] �−0�441� −0�101� [−1�021, 1.221]w12 Services �−1�316� −0�832� �−0�193� −0�093� [−0�356, 0.133]w13 Durable goods [−1�306, 1.125] [−0�313, 1.262] [−1�068, 1.262]w14 Others �−1�534� −0�256� [−1�254, 1.274] [−0�800, 0.252]

A plot of the first and second vertices principal component intervals, i.e., ofthe Y V

1�u and Y V2�u, is shown in Figure 5.3. Four clusters emerge. The first contains

the observation w1 corresponding to the FMCG sector. The second contains theautomotive, telecommunications, finance, consumer goods and energy sectors(i.e., w2� w3� w5� w7� w8). In fact, the PC1 and PC2 principal components for theconsumer goods �w7� and energy �w8� sectors are almost identical, as are thosefor the automotive �w2� and finance �w5� sectors. The remaining observations�w4� w6� w9� w10� w11� w12� w13� form a third cluster. However, it may beadvantageous to identify a fourth cluster for the textile and services sectorscorresponding to the w11� w12 observations, with the third cluster consisting ofthe remaining five sectors. These clusters are nested clusters in that the firstcluster spans the other clusters through the w1 observation for the FMCG sector.The data values for this sector were the result of over 600 observations and havelarge internal variation. Therefore, it is not surprising that the related principalcomponent takes this form. In contrast, the observations in the fourth cluster,the textile and services sectors, have much smaller internal variation across thethree variables �Y1� Y2� Y3� collectively, and this is reflected in their comparablyclustered principal components.

Figure 5.4 shows the corresponding plot for the classical analysis of the 112vertices values. Three distinct clusters are apparent. First, recall that here eachsymbolic observation is represented by its eight vertex values. Thus, the middlecluster of Figure 5.4 essentially contains the vertex values of the w11 and w12


Figure 5.3 Vertices principal components: finance data.

symbolic values of the inner fourth cluster along with the lower vertex values forthe wu for u = 6� 10� 13 observations. The top outer cluster of Figure 5.4 containsthe respective upper vertex values of all, except the w11 and w12 observations ofthe first three clusters of Figure 5.3, while the bottom outer cluster of Figure 5.4contains the lower vertex values of all but the wu for u = 6� 10� 11� 12� 13observations.

We can also find the percentages of the overall variation, � /�, accountedfor by the th principal component. For this analysis of these data, the firstvertex principal component accounts for 36.3% of the variation while the firsttwo vertices principal components together account for 69.6% of the totalvariation. �

As discussed earlier, observations can be weighted by pu ≥ 0. Then in thetransformation of the symbolic observation to the matrix M of classical observationsin Example 5.6, each vertex of the hyperrectangle Hu in effect has the same weight�pu2−p�. Weights can be arbitrarily assigned, or can take many forms. Let usconsider an analysis where weights are proportional to the number of individual

VERTICES METHOD 159

Figure 5.4 Classical principal components: vertices of finance data.

observations nu that were aggregated to produce the symbol observation �u� u =1� � � � �m.

Example 5.7. Suppose now that instead of the unweighted principal componentanalysis of Example 5.6, we perform a weighted analysis. Let us do this for thesame data of Table 5.2, and let us assume that the weights are proportional tothe number of specific companies nu that generated the symbolic values; seeTable 5.2, column (g). That is, we use the weights of Equation (5.11).

We perform the weighted classical analysis on the vertices of matrix M, andagain sort the output by u and by the first principal component values. The resultsfor the FMCG sector �w1�, i.e., for k = 1� � � � � 8, are shown in Table 5.5. Fromthis table and Equation (5.14), we obtain for the first � = 1� principal component

ya11 = min

k1∈L1

y11k1 = −0�223

yb11 = max

k1∈L1

y11k1 = 0�183�

hence the first weighted principal component interval for the FMCG sector �w1� is

Y V11 = y11 = �−0�223� 0�183��

The respective first, second, and third weighted vertices principal componentsfor all wu�u = 1� � � � � 14, are displayed in Table 5.6. For this analysis, 34.5%


of the variation is explained by the first principal component and 67.8% by thefirst two principal components. Figure 5.5 shows the plots of the first and secondweighted vertices principal components, i.e., the Y V

1u and Y V2u. The same clusters

as were observed in Example 5.6 and Figure 5.3 emerge. �

Table 5.5 Weighted principal components, u = 1.

k u Y 1 Y 2 Y 3 PC1 PC2 PC3

1 1 1168 7 1 −0�22337 −0�09465 −0�009152 1 1168 7 20 −0�18499 0�15218 −0�020263 1 1168 1 1 −0�04812 −0�12987 −0�186304 1 4139 7 1 −0�03063 −0�11584 0�185745 1 1168 1 20 −0�00974 0�11696 −0�197416 1 4139 7 20 0�00775 0�13099 0�174637 1 4139 1 1 0�14462 −0�15106 0�008598 1 4139 1 20 0�18300 0�09577 −0�00252��

��

��

��

��

Table 5.6 Weighted vertices principal components – finance data (based onY1� Y2� Y3).

wu Sector PCI = Y V1u PC2 = Y V

2u PC3 = Y V3u

w1 FMCG [−0�223, 0.183] [−0�151, 0.152] [−0�197, 0.186]w2 Automotive [−0�129, 0.178] [−0�151, 0.139] [−0�131, 0.151]w3 Telecommunications [−0�169, 0.180] [−0�151, 0.146] [−0�143, 0.183]w4 Publishing [−0�016, 0.154] [−0�109, 0.119] [−0�112, 0.037]w5 Finance [−0�136, 0.177] [−0�137, 0.143] [−0�111, 0.179]w6 Consultants [−0�093, 0.180] [−0�047, 0.137] [−0�111, 0.148]w7 Consumer goods [−0�104, 0.179] [−0�138, 0.137] [−0�108, 0.151]w8 Energy [−0�103, 0.154] [−0�135, 0.137] [−0�107, 0.126]w9 Pharmaceutical [−0�134, 0.154] [−0�148, 0.142] [−0�107, 0.156]w10 Tourism [−0�056, 0.154] [−0�044, 0.130] [−0�103, 0.093]w11 Textiles [−0�050, 0.153] [−0�046� −0�013] [−0�091, 0.114]w12 Services [0.118, 0.156] [−0�047� −0�042] [−0�039� −0�001]w13 Durable goods [−0�046, 0.178] [−0�034, 0.129] [−0�095, 0.117]w14 Others [0.073, 0.183] [−0�151, 0.104] [−0�074, 0.008]

When the number of variables p is large, it can be difficult to visualize featuresin the data. The reduction to s < p principal components is one avenue that assiststhis visualization. Further interpretations can be obtained by looking at associa-tions between the observations and the principal components. Interpretations of theparameters underlying classical analyses can be extended to the symbolic analysis.Typically, quality measures of the principal components are represented by theircontribution functions and variances.

VERTICES METHOD 161

Figure 5.5 Weighted vertices principal components: finance data.

Definition 5.6: The relative contribution between a given principal componentPC and an observation �u represented here by its observed hypercube Hu can bemeasured by

C u = Con1�PC �Hu� =∑nu

ku=1 puku

y2 uku∑nu

ku=1 puku

�d�ku� G��2(5.17)

where y ikuis the th principal component for the vertex ku of Hu (see Equa-

tion (5.14)), wuku

is the weight of that vertex (see Definition 5.1), and where d�ku� G�is the Euclidean distance between the row ku of X and G defined as the centroidof all n rows of X given by

d2�ku� G� =p∑

j=1

(Xkuj − X̄Gj

SGj

)2

(5.18)

where

X̄Gj = 1n

m∑u=1

mu∑ku=1

Xkuj


with Xkuj being the value of the Yj variable for the vertex ku of Hu, and where

S2Gj = 1

n−1

m∑u=1

mu∑ku=1

�Xkuj − X̄Gj�2�

�

An alternative measure of the relationship between PC and Hu is the contri-bution function

C∗ u = Con2�PC �Hu� = pu

ku

pu

mu∑ku=1

y2 uku

�d�ku� G��2�

The first function C u identifies the relative contribution of all the vertices ofthe hyperrectangle Hu to the variance � of the th principal component, while thesecond function C∗

u identifies the average squared cosines of the angles betweenthese vertices and the axis of the th principal component.

Example 5.8. Let us calculate the relative contribution C u for the financial dataof Table 5.2. We take u = 1 and = 1. The principal component outcomes foreach vertex are as in Table 5.3.

For these data,

X̄G = �X̄G1� X̄G2� X̄G3� = �3231�57� 3�07� 11�39��

and

SG = �SG1� SG2� SG3� = �829�17� 2�52� 8�02��

Therefore, for the first vertex �k1 = 1�, the squared Euclidean distance is, fromEquation (5.18),

d2 �k1 = 1� G� =(

1168−3231�57829�17

)2

+(

1−3�072�52

)2

+(

1−11�398�02

)2

= 8�547�

and for the last vertex k1 = 8, this distance is

d2 �k1 = 8� G� =(

4139−3231�57829�17

)2

+(

7−3�072�52

)2

+(

20−11�398�02

)2

= 4�782�

Hence, by summing over all vertices in this hyperrectangle H1, we obtain fromEquation (5.17),

C11 = Con�PC1�H1� = �0�959�2 + � � � + �0�496�2

�8�547�+ � � � + �4�782�= 0�422�

VERTICES METHOD 163

Note that the weights puku

cancel out here as they take the same value �1/pu� forall ku.

The complete set of these contributions between each observation �u �Hu� andthe th vertices principal component u = 1� � � � � 14� = 1� 2� 3, is displayed inTable 5.7.

Table 5.7 Contributions and inertia (Hu�PC ).

wu Sector Contributions Inertia Total

PC1 PC2 PC3 PC1 PC2 PC3 I

w1 FMCG 0�422 0�224 0�353 0�185 0�107 0�184 0�159w2 Automotive 0�316 0�379 0�305 0�078 0�102 0�089 0�089w3 Telecommunications 0�360 0�311 0�329 0�109 0�103 0�119 0�110w4 Publishing 0�204 0�514 0�282 0�026 0�071 0�042 0�046w5 Finance 0�326 0�339 0�335 0�079 0�090 0�097 0�088w6 Consultants 0�376 0�243 0�381 0�063 0�045 0�077 0�061w7 Consumer goods 0�297 0�398 0�305 0�061 0�090 0�075 0�075w8 Energy 0�285 0�450 0�265 0�052 0�089 0�057 0�066w9 Pharmaceutical 0�310 0�407 0�283 0�070 0�101 0�077 0�082w10 Tourism 0�320 0�359 0�321 0�035 0�043 0�043 0�040w11 Textiles 0�469 0�065 0�465 0�042 0�006 0�050 0�033w12 Services 0�927 0�018 0�055 0�079 0�002 0�006 0�031w13 Durable goods 0�337 0�324 0�340 0�042 0�044 0�051 0�046w14 Others 0�385 0�507 0�108 0�068 0�099 0�023 0�065

Observe that for the services sector �u = 12�, this contribution with the first(second) principal component PC1 is high (low) for = 1 � = 2�, with

C1�12 = Con�PC1�H12� = 0�927� C2�12 = Con�PC2�H12� = 0�018�

respectively, whereas for the other sectors �u �= 12�, the differences are much lessdramatic. These contributions are confirmed by the plots of Figure 5.3, wherethe plot corresponding to u = 12 is a thin rectangle largely confined to the firstprincipal component axis, while the plots for u �= 12 are squarer reflecting therelative values for these functions when = 1� 2. �

Definition 5.7: A correlation measure between the th principal component PC and the random variable Xj is

C j = Cor�PC �Xj� = e j

√� /�2

j (5.19)

where e j is the jth component of the th eigenvector e associated with Xj (seeEquation (5.3)), where

� = Var�PC � (5.20)

is the th eigenvalue, and where �2j is the variance of Xj . Note that when the

variance–covariance matrix is standardized these �2j reduce to �2

j = 1. �


Example 5.9. Take the analysis of the financial data of Table 5.2. We obtain theeigenvalue

� = �1�090� 0�998� 0�912�

and the eigenvectors

e1 = �−0�696� 0�702� 0�151��

e2 = �0�144�−0�070� 0�987��

e3 = �0�703� 0�709�−0�052��

Substituting the relevant � and e j values into Equation (5.19) gives the correla-tions C j� j = 1� � � � � 3� = 1� � � � � 3. For example, the correlation between thefirst random variable Y1 (Job Cost) and the first vertex principal component PC1� = 1� is, with �2 = 1,

C11 = Con�PC1� Y1� = e11/�1 = �−0�696�√

1�090 = −0�727�

The complete set is given in Table 5.8. From these we observe, for example, thatJob Cost �Y1� and Job Code �Y2�, but not Activity �Y3�, are strongly correlated tothe first principal component, while Activity Y3 is strongly related to the secondprincipal component. �

Table 5.8 Correlations �PC�Y�.

Variable PC1 PC2 PC3

Y1 Job Cost −0�727 0�144 0�672Y2 Job Code 0�732 −0�070 0�677Y3 Activity 0�158 0�986 −0�050

Definition 5.8: The absolute contribution of a single observation through thevertices of Hu to the variance � is measured by the inertia

I u = Inertia�PC �Hu� =[ mu∑

ku=1

puku

y2 uku

]/� (5.21)

and the contribution of this observation to the total variance is

Iu = Inertia�Hu� ={ mu∑

ku=1

puku

[d�ku� G�

]2}/IT � (5.22)

where IT = ∑p =1 � is the total variance of all the vertices in �p. It is easily

verified thatm∑

u=1

I u = � �m∑

u=1

Iu = IT � (5.23)�

VERTICES METHOD 165

Example 5.10. The proportions of the variance of the th principal component(i.e., of � ) that are attributed to each of the observations �u (through its Hu)for the financial data of Table 5.2 are shown in Table 5.7. For example, for theFMCG sector �w1� and PC1 � = 1�, Equation (5.21) becomes

I11 = Inertia�PC1�H1�

=(

114×3

)��0�959�2 +· · ·+ �0�496�2�/�1�090� = 0�185�

These results confirm what was observed in the plots of Figure 5.3: that theFMCG sector is responsible for a larger share of the overall variance than arethe other sectors. �

In a different direction, an alternative visual aid in interpreting the results isthat make only those vertices which a contribution to the principal component PC exceeding some prespecified value � are to be used in Equations (5.13)–(5.14).That is, we set

Y ∗ u�� = �ya

u�� yb u��

where

ya u�� = min

ku∈Lu

y ukuCon�ku� � ≥ �� yb

u�� = maxku∈Lu

y ukuCon�ku� � ≥ ��

(5.24)where

Con�ku�PC � = y2 uku

�d�ku� G��2(5.25)

is the contribution of a single vertex ku to the th principal component. When for agiven = 1 (say) all Con�ku�PC 1� < �, then to keep track of the position of thehypercube Hu on the principal component plane � 1� 2�, say, we project the centerof the hypercube onto that axis. In this case there is no variability on that principalcomponent 1, whereas if there is variability for the other principal component 2,there is a line segment on its � 2� plane.

An alternative to the criterion of (5.24) is to replace Con�ki� PC � with

Con�ki�PC 1�PC 2� = Con�ki�PC 1�+Con�ki�PC 2�� (5.26)

In this case, vertices that make contributions to either of the two principal compo-nents PC 1 and PC 2 are retained, rather than only those vertices that contribute tojust one principal component.

Example 5.11. Table 5.9 displays interval-valued observations for 30 categoriesof irises. These categories were generated by aggregating consecutive groupsof five from the 150 individual observations published in Fisher (1936). Such


Table 5.9 Iris data.

wu Name Sepal Sepal Petal PetalLength Y1 Width Y2 Length Y3 Width Y4

w1 S1 [4.6, 5.1] [3.0, 3.6] [1.3, 1.5] [0.2, 0.2]w2 S2 [4.4, 5.4] [2.9, 3.9] [1.4, 1.7] [0.1, 0.4w3 S3 [4.3, 5.8] [3.0, 4.0] [1.1, 1.6] [0.1, 0.2]w4 S4 [5.1, 5.7] [3.5, 4.4] [1.3, 1.7] [0.3, 0.4]w5 S5 [4.6, 5.4] [3.3, 3.7] [1.0, 1.9] [0.2, 0.5]w6 S6 [4.7, 5.2] [3.0, 3.5] [1.4, 1.6] [0.2, 0.4]w7 S7 [4.8, 5.5] [3.1, 4.2] [1.4, 1.6] [0.1, 0.4]w8 S8 [4.4, 5.5] [3.0, 3.5] [1.3, 1.5] [0.1, 0.2]w9 S9 [4.4, 5.1] [2.3, 3.8] [1.3, 1.9] [0.2, 0.6]w10 S10 [4.6, 5.3] [3.0, 3.8] [1.4, 1.6] [0.2, 0.3]w11 Ve1 [5.5, 7.0] [2.3, 3.2] [4.0, 4.9] [1.3, 1.5]w12 Ve2 [4.9, 6.6] [2.4, 3.3] [3.3, 4.7] [1.0, 1.6]w13 Ve3 [5.0, 6.1] [2.0, 3.0] [3.5, 4.7] [1.0, 1.5]w14 Ve4 [5.6, 6.7] [2.2, 3.1] [3.9, 4.5] [1.0, 1.5]w15 Ve5 [5.9, 6.4] [2.5, 3.2] [4.0, 4.9] [1.2, 1.8]w16 Ve6 [5.7, 6.8] [2.6, 3.0] [3.5, 5.0] [1.0, 1.7]w17 Ve7 [5.4, 6.0] [2.4, 3.0] [3.7, 5.1] [1.0, 1.6]w18 Ve8 [5.5, 6.7] [2.3, 3.4] [4.0, 4.7] [1.3, 1.6]w19 Ve9 [4.0, 6.1] [2.3, 3.0] [3.3, 4.6] [1.0, 1.4]w20 Ve10 [4.1, 6.2] [2.5, 3.0] [3.0, 4.3] [1.1, 1.3]w21 Vi1 [5.8, 7.1] [2.7, 3.3] [5.1, 6.0] [1.8, 2.5]w22 Vi2 [4.9, 7.6] [2.5, 3.6] [4.5, 6.6] [1.7, 2.5]w23 Vi3 [5.7, 6.8] [2.5, 3.2] [5.0, 5.5] [1.9, 2.4]w24 Vi4 [6.0, 7.7] [2.2, 3.8] [5.0, 6.9] [1.5, 2.3]w25 Vi5 [5.6, 7.7] [2.7, 3.3] [4.9, 6.7] [1.8, 2.3]w26 Vi6 [6.1, 7.2] [2.8, 3.2] [4.8, 6.0] [1.6, 2.1]w27 Vi7 [6.1, 7.9] [2.6, 3.8] [5.1, 6.4] [1.4, 2.2]w28 Vi8 [6.0, 7.7] [3.0, 3.4] [4.8, 6.1] [1.8, 2.4]w29 Vi9 [5.8, 6.9] [2.7, 3.3] [5.1, 5.9] [1.9, 2.5]w30 Vi10 [5.0, 6.7] [2.5, 3.4] [5.0, 5.4] [1.8, 2.3]

an aggregation might occur if the five observations in each category corre-sponded to the measurements of five different flowers at the same ‘site’, orfive different flowers on the same plant, etc. There are three species: setosa(S), versicolor (Ve), and virginica (Vi). The data therefore list 30 observations�S1� � � � � S10�Ve1� � � � � Ve10�Vi1� � � � � Vi10�, 10 for each of the three species.There are p = 4 random variables, Y1 = Sepal Length, Y2 = Sepal Width, Y3 =Petal Length, and Y4 = Petal Width.

VERTICES METHOD 167

We perform the vertices principal component analysis on these data. Theresulting principal components obtained from Equation (5.13) are displayed inTable 5.10. Also shown are the relative contributions for each observation calcu-lated from Equation (5.17). The plot of the principal components on the = 1and 2 axes is shown in Figure 5.6. These correspond to � = 0 in Equation (5.25).While some form of clustering is apparent, greater clarity can be gleaned bytaking � > 0 in Equation (5.24).

Table 5.10 Vertices PC – iris data.

wu Principal components Contribution

PC1 PC2 PC1 PC2

w1 S1 �−2�471� −1�810� �−0�800, 0.423] 0�923 0�062w2 S2 �−2�784� −1�384� �−1�054, 1.087] 0�814 0�140w3 S3 �−2�990� −1�410� �−0�932, 1.370] 0�785 0�133w4 S4 �−2�578� −1�559� [0.256, 2.057] 0�673 0�300w5 S5 �−2�624� −1�466� �−0�298, 0.766] 0�914 0�042w6 S6 �−2�330� −1�577� �−0�760, 0.320] 0�920 0�058w7 S7 �−2�740� −1�475� �−0�566, 1.635] 0�766 0�201w8 S8 �−2�627� −1�600� �−0�891, 0.395] 0�885 0�055w9 S9 �−2�687� −0�995� �−2�079, 0.844] 0�610 0�348w10 S10 �−2�549� −1�598� �−0�796, 0.859] 0�875 0�098w11 Ve1 �−0�081, 1.648] �−1�397, 0.766] 0�391 0�353w12 Ve2 �−0�902, 1.392] �−1�517, 0.799] 0�210 0�410w13 Ve3 �−0�617, 1.277] �−2�161, 0.088] 0�130 0�709w14 Ve4 �−0�227, 1.416] �−1�581, 0.466] 0�277 0�487w15 Ve5 [0.055, 1.445] �−0�923, 0.593] 0�530 0�309w16 Ve6 �−0�250, 1.558] �−0�875, 0.382] 0�452 0�154w17 Ve7 �−0�342, 1.207] �−1�318, 0.085] 0�256 0�497w18 Ve8 �−0�192, 1.500] �−1�397, 1.007] 0�324 0�494w19 Ve9 �−0�683, 1.004] �−1�653, 0.069] 0�131 0�632w20 Ve10 �−0�654, 0.773] �−1�273, 0.076] 0�146 0�545w21 Vi1 [0.751, 2.579] �−0�476, 1.169] 0�739 0�110w22 Vi2 �−0�158, 3.148] �−1�187, 1.895] 0�480 0�200w23 Vi3 [0.795, 2.295] �−0�847, 0.851] 0�733 0�116w24 Vi4 [0.323, 3.316] �−1�313, 2.262] 0�543 0�311w25 Vi5 [0.580, 2.974] �−0�557, 1.390] 0�684 0�111w26 Vi6 [0.718, 2.279] �−0�238, 0.976] 0�782 0�096w27 Vi7 [0.334, 2.962] �−0�597, 2.296] 0�521 0�311w28 Vi8 [0.702, 2.686] [0.099, 1.548] 0�660 0�186w29 Vi9 [0.825, 2.441] �−0�461, 1.093] 0�742 0�107w30 Vi10 [0.243, 2.136] �−1�113, 1.141] 0�484 0�197


21

0–1

–2

–3 –2 –1 0 1 2 3

First Principal Component

Sec

ond

Prin

cipa

l Com

pone

nt

S4

S7

S3

S2

S10S1

S5

S6

S9

S8

Ve2

Vi7

Vi2Vi4

Vi9

Vi5

Vi9

Ve1

Ve5

Ve7

Ve6

Ve10

Ve8

Ve9

Ve3

Vi10

Vi3

Vi1Vi6

Ve4

Figure 5.6 Iris data: principal components – � = 0.

The first two principal components �s = 1� 2� on the vertices of the fourth cate-gory S1 �w4� of the setosa iris are given in Table 5.11. The contributions betweenthe vertices and the principal components are calculated from Equation (5.24).For example, consider the first vertex in H4 (i.e., k4 = 1� u = 4). Then,

Con�k4 = 1� PC1� = �−2�079�2/4�495 = 0�971

where the squared Euclidean distance for this vertex is

d2�k4 = 1� G� =(

5�1−5�8070�937

)2

+(

3�5−3�0620�540

)2

+(

1�3−3�7301�785

)2

+(

0�3−1�2070�774

)2

= 4�454�

These contributions for all vertices in H4 are given in Table 5.11 along withtheir first and second principal components.

Suppose � = 0�1, and take = 1. Then, from Equation (5.24), we see that novertices are deleted from consideration since their contributions are all ≥ 0�1,and hence the first vertex principal component becomes

PC1�� = 0�1� = �−2�578�−1�559��

VERTICES METHOD 169

Table 5.11 Con�ku�PC �� u = 4.

Vertex Y1 Y2 Y3 Y4 PC1 PC2 Con�k4�PC �

k4 PC1 PC2

1 5�1 3�5 1�3 0�3 −2�079 0�256 0�971 0.0152 5�1 3�5 1�3 0�4 −2�005 0�270 0�965 0.0173 5�1 3�5 1�7 0�3 −1�948 0�275 0�975 0.0194 5�1 3�5 1�7 0�4 −1�874 0�289 0�984 0.0235 5�1 4�4 1�3 0�3 −2�578 1�807 0�669 0.3296 5�1 4�4 1�3 0�4 −2�504 1�822 0�650 0.3447 5�1 4�4 1�7 0�3 −2�448 1�827 0�639 0.3568 5�1 4�4 1�7 0�4 −2�374 1�841 0�620 0.3739 5�7 3�5 1�3 0�3 −1�764 0�471 0�798 0.05710 5�7 3�5 1�3 0�4 −1�690 0�486 0�791 0.06511 5�7 3�5 1�7 0�3 −1�633 0�491 0�799 0.07212 5�7 3�5 1�7 0�4 −1�559 0�505 0�797 0.08413 5�7 4�4 1�3 0�3 −2�263 2�023 0�546 0.43614 5�7 4�4 1�3 0�4 −2�189 2�037 0�526 0.45615 5�7 4�4 1�7 0�3 −2�133 2�042 0�516 0.47316 5�7 4�4 1�7 0�4 −2�059 2�057 0�497 0.496

However, for = 2, the vertices corresponding to k4 = 1� � � � � 4� 9� � � � � 12, aredeleted. Hence, the second vertex principal component is found by taking theminimum and maximum of the PC2 values that remain, i.e., we use Equa-tion (5.24) to give

PC2�� = 0�1� = �1�807� 2�057��

The principal component values that emerge are given in Table 5.12.When � = 0�5, for PC1, all vertices of H4 are again retained since their

contribution is ≥ 0�5. Thus, from Equation (5.24) we have

PC1�� = 0�5� = �−2�578�−1�559��

However, we see that all vertices are such that

Con�ki�PC2� ≤ � = 0�5

for this observation H4. Therefore, to locate this observation on the PC2 axes ofthe PC1×PC2 plane, we calculate the average PC2 value from the PC2 valuesover all vertices. Thus we obtain

PC2 = 0�226�

The same procedure is followed for all hypercubes Hu� u = 1� � � � � 30. Theresulting principal component values are shown in Table 5.12. The single values

170 PRINCIPAL COMPONENT ANALYSIST

able

5.12

Ver

tices

PC

,�=

0�1�

0�5

–ir

isda

ta.

wu

�=

0�1

�=

0�5

PC1

PC2

PC1

PC2

w1

S1

�−2�

471�

−1�8

10�

�−0�

800�

−0�7

91�

�−2�

471�

−1�8

10�

0.06

6w

2S

2�−

2�78

4�−1

�384

�[−

1�05

4,1.

087]

�−2�

784�

−1�3

84�

0.14

9w

3S

3�−

2�99

0�−1

�410

�[−

0�93

2,1.

370]

�−2�

990�

−1�4

10�

0.12

9w

4S

4�−

2�57

8�−1

�559

�[1

.807

,2.0

57]

�−2�

578�

−1�5

59�

0.22

6w

5S

5�−

2�62

4�−1

�466

�[0

.722

,0.7

66]

�−2�

624�

−1�4

66�

0.04

2w

6S

6�−

2�33

0�−1

�577

��−

0�76

0�−0

�722

��−

2�33

0�−1

�577

�0.

061

w7

S7

�−2�

740�

−1�4

75�

[1.3

31,1

.635

]�−

2�74

0�−1

�475

�0.

161

w8

S8

�−2�

627�

−1�6

00�

�−0�

891�

−0�8

63�

�−2�

627�

−1�6

00�

0.05

6w

9S

9�−

2�68

7�−0

�995

�[−

2�07

9,0.

844]

�−2�

687�

−1�8

27�

�−2�

079�

−1�7

42�

w10

S10

�−2�

549�

−1�5

98�

�−0�

796�

−0�8

59�

�−2�

549�

−1�5

98�

0.10

1w

11V

e1[0

.361

,1.6

48]

[−1�

397,

0.76

6][1

.149

,1.6

48]

�−1�

397�

−1�3

25�

w12

Ve2

[−0�

902,

1.39

2][−

1�51

7,0.

799]

[−0�

902,

1.39

2]�−

1�51

7�−1

�364

�w

13V

e3[−

0�67

1,1.

277]

�−2�

161�

−0�3

66�

[0.7

22,0

.722

]�−

2�16

1�−0

�636

�w

14V

e4[−

0�22

7,1.

416]

[−1�

581,

0.46

6][0

.916

,1.4

16]

�−1�

581�

−1�4

81�

w15

Ve5

[0.3

18,1

.445

][−

0�92

3,0.

593]

[0.7

62,1

.445

][−

0�92

3,0.

284]

w16

Ve6

[−0�

250,

1.55

8]�−

0�87

5�−0

�185

�[−

0�25

0,1.

558]

�−0�

875�

−0�7

75�

w17

Ve7

[−0�

342,

1.20

7]�−

1�31

8�−0

�283

�[0

.417

,1.2

07]

�−1�

318�

−1�0

17�

w18

Ve8

[0.4

39,1

.500

][−

1�39

7,1.

007]

[1.2

71,1

.500

][−

1�39

7,1.

007]

w19

Ve9

[−0�

683,

1.00

4]�−

1�65

3�−0

�327

�[−

0�68

3,0.

615]

�−1�

653�

−1�1

38�

w20

Ve1

0[−

0�65

4,0.

773]

�−1�

273�

−0�3

20�

[−0�

654,

0.49

6]�−

1�27

3�−0

�849

�w

21V

i1[0

.751

,2.5

79]

[−0�

476,

1.16

9][1

.084

,2.5

79]

0.11

9w

22V

i2[1

.045

,3.1

48]

[−1�

187,

1.89

5][1

.870

,3.1

48]

[−1�

187,

1.68

0]w

23V

i3[0

.795

,2.2

95]

[−0�

847,

0.85

1][1

.184

,2.2

95]

0.12

6w

24V

i4[0

.916

,3.3

16]

[−1�

313,

2.26

2][1

.804

,3.3

16]

[−1�

313,

2.17

1]w

25V

i5[0

.580

,2.9

74]

[−0�

557,

1.39

0][0

.913

,2.9

74]

0.11

6w

26V

i6[0

.718

,2.2

79]

[0.4

51,0

.976

][0

.718

,2.2

79]

0.10

1w

27V

i7[0

.759

,2.9

62]

[−0�

597,

2.29

6][1

.000

,2.9

62]

[1.4

72,2

.233

]w

28V

i8[0

.702

,2.6

86]

[0.7

11,1

.548

][0

.924

,2.6

86]

0.17

7w

29V

i9[0

.825

,2.4

41]

[−0�

461,

1.09

3][1

.158

,2.4

41]

0.11

5w

30V

i10

[0.6

13,2

.136

][−

1�11

3,1.

141]

[1.1

35,2

.136

]0.

207

VERTICES METHOD 171

(such as PC2 = 0�226 for w4) are those for which no vertex had Con�ki�PC � >�. The resulting plots will be lines. Notice that the plot for Ve3�w13� is also aline parallel to the PC2 axis. However, in this case there is one vertex for whichCon�ki�PC1� > 0�5, so that the first principal component takes the value [0.722,0.722].

Clearly, when � = 0, all vertices are included, so that for H4

PC1�� = 0� = �−2�578�−1�559�� PC2�� = 0� = �0�256� 2�057��

The plots for all observations for � = 0�1 and � = 0�5 are shown in Figure 5.7and Figure 5.8, respectively, and can be compared along with that for � = 0 inFigure 5.6.

Figure 5.7 Iris data: principal components – � = 0�1.

As � increases from 0 to 0.5, the two versicolor and virginica species separateout with no overlap between them except for the Ve5 and Ve8 categories whichessentially link the two flora. The setosa species are all quite distinct from boththe versicolor and virginica. The S9 category becomes identified as a cluster ofits own, separate from the other setosa categories. More importantly, instead ofthe harder-to-identify clusters when � = 0, we have from � = 0�5 more distinctclusters. That is, use of Equation (5.24) has enhanced our visual ability to identifyclusters. �


Figure 5.8 Iris data: principal components – � = 0�5.

Classical data

When all the observations are classical data with aij = bij for all i = 1� � � � �m�j = 1� � � � � p, it follows that the number of non-trivial intervals qi = 0 and henceni = 1 for all i = 1� � � � �m. Hence, the vertex weights wi

ki≡ wi. All the results here

carry through as special cases.

5.2 Centers Method

As an alternative to using the vertices of the hyperrectangles Hu as above, thecenters of the hyperrectangles could be used. In this case, each observation�u = ��a1u� b1u�� apu� bpu�� is transformed to

Xcu = �Xc

u1� � � � �Xcup�� u = 1� � � � �m�

where

Xcuj = �auj +buj�/2� j = 1� � � � � p� (5.27)

This implies that the symbolic data matrix has been transformed to a classical m×pmatrix Xc with classical variables Xc

1� � � � �Xcp, say.

CENTERS METHOD 173

Then, the classical principal component analysis is applied to the datamatrix Xc. The th centers principal component for observation �u is, for = 1� � � � � s� u = 1� � � � �m,

yc u =

p∑j=1

�Xcju − X̄c

j �w j (5.28)

where the mean of the values for the variable Xcj is

X̄cj = 1

m

m∑u=1

Xcuj (5.29)

and where w = �w 1� � � � �w p� is the th eigenvector of the variance–covariancematrix associated with Xc.

Since each centered coordinate Xcuj lies in the interval �auj� buj� and since the

principal components are linear functions of Xcju, we can obtain the interval center

principal components as

Y c u = �yca

u� ycb u� (5.30)

where

yca u =

p∑j=1

minauj≤xuj≤buj

�Xcuj − X̄c

j �w u (5.31)

and

ycb u =

p∑j=1

maxauj≤xuj≤buj

�Xcuj − X̄c

j �w j� (5.32)

It follows that

yca u =

p∑j=1j−

�buj − X̄cj �w j +

p∑j=1j+

�auj − X̄cj �w j (5.33)

and

ycb u =

p∑j=1j−

�auj − X̄cj �w j +

p∑j=1j+

�buj − X̄cj �w j (5.34)

where j− are those values of j for which w j < 0 and j+ are those values of j forwhich w j > 0.

Example 5.12. Let us take the financial data of Table 5.2 and consider all fivevariables Y1� � � � � Y5. The centroid values Xc

uj� j = 1� � � � � p = 5� u = 1� � � � �m = 14 are obtained from Equation (5.27) and displayed in Table 5.13. Also givenare the mean centered values X̄c

j along with the centered standard deviations Scj


for each variable Xcj � j = 1� � � � � 5. A classical principal component analysis on

these centroid values gives the eigenvalues

� = �2�832� 1�205� 0�430� 0�327� 0�206��

and hence the eigenvector associated with the first principal component is

e1 = �−0�528� 0�481�−0�086� 0�491� 0�491��

Table 5.13 Centers method statistics – finance data.

wu Xc1 Xc

2 Xc3 Xc

4 Xc5

w1 2653�5 4�0 10�5 247980�0 606650�0w2 3119�5 3�5 10�5 188784�0 191024�0w3 3047�0 4�0 10�5 126063�0 251063�0w4 3079�5 2�0 12�0 23908�5 42183�5w5 3265�0 4�0 11�0 127564�5 160000�5w6 3286�5 3�5 14�5 29905�0 39455�0w7 3302�5 3�5 11�0 40200�0 40200�5w8 3113�5 3�5 11�0 28640�0 28640�0w9 3117�5 4�0 10�5 25327�5 643575�0w10 3147�0 3�0 14�5 16650�0 16650�0w11 3354�5 3�0 9�0 33825�0 53400�0w12 3769�5 1�0 9�0 19950�0 25050�0w13 3397�0 3�0 15�0 17050�0 29550�0w14 3589�5 1�0 10�5 9929�0 9929�0

X̄cj 3231�57 3�07 11�39 66841�18 152669�32

Scj 264�16 1�04 1�93 75193�08 213311�91

We substitute this result into Equations (5.33) and (5.34) to obtain theinterval centers principal component. For example, from Equation (5.33), wehave the centers principal component, standardized values by Sc

j , for the FMCGsector �w1�,

yca11 =

(4139−3231�57

264�16

)�−0�528�x1

+(

20−11�391�938

)�−0�086�x3

+(

1−3�071�04

)�0�481�x2

+(

650−66841�1875493�08

)�0�491�x4

+(

1000−152669�32213311�91

)�0�491�x5

=−3�940�

CENTERS METHOD 175

and from Equation (5.34), we have, for the FMCG sector �w1�,

ycb11 =

(1168−3231�57

264�16

)�−0�528�x1

+(

1−11�391�93

)�−0�086�x3

+(

7−3�071�04

)�0�481�x2

+(

495310−66841�17975193�08

)�0�491�x4

+(

1212300−152669�32213311�91

)�0�491�x5

=11�650�

Hence, the first centers principal component for the FMCG sector �w1� is

Y c11 = �−3�940� 11�650��

The complete set of the first, second, and third centers principal components forall sectors, u = 1� � � � � 14, is shown in Table 5.14.

Table 5.14 Centers principal components – finance data.

wu Sector PC1 = Y C1u PC2 = Y C

2u PC3 = Y C3u

w1 FMCG [−3�940, 11.650] [−8�297, 7.205] [−5�879, 5.123]w2 Automotive [−3�793, 6.487] [−7�113, 6.030] [−4�377, 2.362]w3 Telecommunications [−3�843, 6.750] [−6�908, 6.490] [−3�222, 2.888]w4 Publishing [−3�013, 1.502] [−4�242, 4.824] [−1�040, 0.710]w5 Finance [−3�705, 5.296] [−6�232, 6.058] [−3�009, 2.213]w6 Consultants [−3�807, 2.707] [−2�356, 5.751] [−1�154, 1.177]w7 Consumer goods [−3�789, 3.073] [−5�561, 5.716] [−1�355, 1.371]w8 Energy [−3�013, 2.849] [−5�144, 5.707] [−1�129, 1.157]w9 Pharmaceutical [−3�007, 6.125] [−6�754, 6.011] [−1�049, 5.145]w10 Tourism [−3�035, 1.750] [−1�909, 5.345] [−0�892, 0.773]w11 Textiles [−3�093, 1.858] [−2�308, 0.265] [−0�771, 0.977]w12 Services �−3�223� −1�842� �−2�233� −1�644� [−0�274, 0.262]w13 Durable goods [−3�758, 1.494] [−1�795, 5.227] [−0�809, 0.920]w14 Others �−3�939� −0�740� [−5�853, 3.720] [−0�618, 0.498]

The proportion of the total variation explained by the first centers principalcomponent is �1/

∑�i = 0�566, i.e., 56.6%. Likewise, we find that the first

two and three principal components together account for 80.7% and 89.3%,respectively, of the total variation.

Figure 5.9 shows the plots of the first two centers principal components foreach observation. There are three distinct clusters. The first consists of the w1 obser-vation (i.e., the FMCG sector.) The second cluster consists of the w2�w3�w5�w9

observations (i.e., the automotive, telecommunications, finance, and pharma-ceutical sectors). These are distinct clusters when viewed with respect to the


Figure 5.9(a) First/Second centers principal components.

Figure 5.9(b) First/Second centers principal components.

CENTERS METHOD 177

first centers principal component. If viewed from the second centers principalcomponent, the pharmaceutical sector �w9� would move from the second clusterto the first. The remaining sectors clearly belong to a single third cluster. �

The analyses of the financial data in Examples 5.6, 5.7, and 5.12 all produce thesame (not unreasonably) comparable clusters, with the added feature of the clustersbeing nested inside each other. The following example gives unnested clusters.

Example 5.13. The data of Table 5.15 provide some measurements foreigh different makes of car. The variables are Y1 = Price (×10−3, in euros),Y2 = Maximum Velocity, Y3 = Acceleration Time to reach a given speed, andY4 = Cylinder Capacity of the car.

The centered values for all four variables are displayed in Table 5.16 alongwith the centered variable means X̄c

j and standard deviations Scj � j = 1� � � � � 4.

Table 5.15 Cars data.

wu Car Y1 = Price Y2 =MaxVelocity

Y3 =AccnTime

Y4 =CylinderCapacity

�a1� b1� �a2� b2� �a3� b3� �a4� b4�

w1 Aston Martin [260.5, 460.0] [298, 306] [4.7, 5.0] [5935, 5935]w2 Audi A6 [68.2, 140.3] [216, 250] [6.7, 9.7] [1781, 4172]w3 Audi A8 [123.8, 171.4] [232, 250] [5.4, 10.1] [2771, 4172]w4 BMW 7 [104.9, 276.8] [228, 240] [7.0, 8.6] [2793, 5397]w5 Ferrari [240.3, 391.7] [295, 298] [4.5, 5.2] [3586, 5474]w6 Honda NSR [205.2, 215.2] [260, 270] [5.7, 6.5] [2977, 3179]w7 Mercedes C [55.9, 115.2] [210, 250] [5.2, 11.0] [1998, 3199]w8 Porsche [147.7, 246.4] [280, 305] [4.2, 5.2] [3387, 3600]

Table 5.16 Centers method statistics – cars data.

wu Car Xc1 Xc

2 Xc3 X5

w1 Aston Martin 360�25 302�0 4�85 5935�0w2 Audi A6 104�25 233�0 8�20 2976�5w3 Audi A8 147�60 241�0 7�75 3471�5w4 BMW 7 190�85 234�0 7�80 4095�0w5 Ferrari 316�00 296�5 4�85 4530�0w6 Honda NSR 210�20 265�0 6�10 3078�0w7 Mercedes C 85�55 230�0 8�10 2598�5w8 Porsche 197�05 292�5 4�70 3493�5

X̄cj 201�47 261�75 6�54 3772�25

Scj 95�86 31�21 1�58 1070�17


The principal component analysis of these centered values on all p = 4 randomvariables gave the eigenvalues

� = 3�449� 0�505� 0�040� 0�005�

The first eigenvector is

e1 = 0�512� 0�515�−0�499� 0�461�

Hence, substituting into Equations (5.33)–(5.34) for each observation, weobtain the centers principal components. For example, for Aston Martin �w1�,we have

PC1�Aston Martin� = Y c11 = �yca

11� ycb11� = �2�339� 3�652�

since

yca11 =

(260�5−201�47

95�86

)�0�512�x1

+(

298−261�7531�21

)�0�515�x2

+(

5�0−6�541�58

)�−0�499�x3

+(

5935−3772�251070�17

)�0�461�x4

=2�339�

and

ycb11 =

(460�0−201�47

95�86

)�0�512�x1

+(

306−261�7531�21

)�0�515�x2

+(

4�7−6�541�58

)�−0�499�x3

+(

5935−3772�251070�17

)�0�461�x4

=3�652�

The set of centers principal components for all s = 1� � � � � 4 is displayed inTable 5.17. Plots based on the first and second principal components are shownin Figure 5.10.

From this figure, two (or three) clusters emerge. One cluster consists of theAston Martin, Ferrari, and Porsche cars (i.e., w1�w5, and w8), and the other

CENTERS METHOD 179

Table 5.17 Centers principal components – cars data.

wu Car PC1 PC2 PC3

w1 Aston Martin [2.339, 3.652] [0.429, 1.171] [-1.030, 0.738]w2 Audi A6 [-3.335, -0.404] [-1.484, 1.718] [-1.198, 1.007]w3 Audi A8 [-2.467, 0.176] [-1.104, 1.732] [-0.793, 0.670]w4 BMW 7 [-2.154, 0.608] [-0.487, 2.387] [-1.367, 1.501]w5 Ferrari [1.104, 2.013] [-1.150, 0.783] [-0.916, 1.321]w6 Honda NSR [-0.338, 0.221] [-0.900, -0.349] [0.198, 0.509]w7 Mercedes C [-3.818, -0.487] [-1.868, 1.509] [-0.910, 0.810]w8 Porsche [0.266, 1.624] [-1.720, -0.666] [-0.880, 0.330]

Seco

nd P

rinc

ipal

Com

pone

nt

20

–2–1

1

–4 –2 0 2


AudiA8

AudiA6

Mercedes C

Porsche

Ferrari

AstonMartin

BMW7

HondaNSR

Figure 5.10 Centers principal components: cars data.

cluster consists of the Audi A6, Audi A8, BMW 7, Mercedes C, and Honda NSRcars (i.e., w2�w3�w4�w6, and w7). Whether or not the Honda NSR should formits own third cluster is answered by studying the plots obtained by using thesecond and third principal components. These are shown in Figure 5.11, and doconfirm the choice of the HondaNSR cars as a separate third cluster. �


Thi

rd P

rinc

ipal

Com

pone

nt

1.5

1.0

0.0

–0.5

–1.0

0.5

–2 –1 0 1 2

Second Principal Component

AudiA6

AudiA8

AstonMartin

Mercedes C

FerrariBMW7

Porsche

Honda NSR

Figure 5.11 Centers principal components: cars data.

5.3 Comparison of the Methods

We first compare the two methods through the two datasets of Table 5.2 andTable 5.15. Both the vertices and centers methods were applied to the financialdata of Table 5.2 in Example 5.6 and Example 5.12, respectively. It is clearfrom a comparison of the results in Table 5.4 and Table 5.14, or Figure 5.3 andFigure 5.9(a), respectively, that though somewhat similar clusters of the m = 14financial sectors emerge, there are some slight differences. In the centers method,the telecommunications, tourism, and others sectors (i.e., observations w4�w10�w14)are more obviously contained within the inner third cluster, whereas in the verticesmethod, these observations may be deemed to belong to a separate cluster (ifattention is focused on only the first principal component, but not if the focus isthe second principal component alone).

The centers method was applied to the cars dataset of Table 5.15 inExample 5.13, with the resulting principal components displayed in Table 5.17, andplotted in Figure 5.10 and Figure 5.11. Applying the vertices method to these data,we obtain the corresponding vertices principal components of Table 5.18, and plotsof the first two principal components in Figure 5.12. Comparing these tables andfigures, we observe that for these data, the two methods are essentially the same.

COMPARISON OF THE METHODS 181

Table 5.18 Vertices principal components – cars data.

wu Car PC1 PC2 PC3

w1 Aston Martin [2.032, 3.213] [0.384, 0.957] �−0�730� 0�580�w2 Audi A6 �−2�858�−0�394� �−1�212� 1�404� �−1�293� 1�117�w3 Audi A8 �−2�046� 0�048� �−0�972� 1�479� �−1�119� 0�946�w4 BMW 7 �−1�864� 0�470� �−0�427� 1�916� �−1�463� 1�159�w5 Ferrari [0.998, 2.621] �−0�920� 0�650� �−0�711� 1�204�w6 Honda NSR �−0�286� 0�186� �−0�742�−0�289� [0.090, 0.491]w7 Mercedes C �−3�235�−0�530� �−1�592� 1�281� �−1�303� 1�152�w8 Porsche [0.215, 1.429] �−1�367�−0�550� �−0�607� 0�486�

2.0

1.5

1.0

0.5

0.0

–0.5

–1.0

–1.5

Seco

nd P

rinc

ipal

Com

pone

nt

–3 –2 –1 0 1 2 3


MercedesC

AudiA8

AudiA6 Porsche

AstonMartin

BMW7

Ferrari

HondaNSR

Figure 5.12 Vertices principal components: cars data.

There are, however, some fundamental differences between the two methods.Both methods are adaptations of classical principal component analysis, but theyvary in how the internal variation of the symbolic observations is incorporated intothe final answer.

Consider the two symbolic data points

�1 = �12� 18�� 4� 8��

�2 = �14� 16�� 5� 7��


These have respective vertex values

V1 = �12� 4�� 12� 8�� 18� 4�� 18� 8��

V2 = �14� 5�� 14� 7�� 16� 5�� 16� 7��

and centroid values

X̄c1 = X̄c

2 = �15� 6��

Notice in particular that the data rectangle formed by �2 lies entirely inside the datarectangle of �1. This reflects the larger internal variation of �1 as compared to thatof �2. However, both �1 and �2 have the same centroid values.

The vertices method uses the mu = 2p (here mu = 4) vertices for each obser-vation as classical data points, whereas the centers method uses the single nc = 1centroid for each observation as classical data points. Therefore, the observation�1 through the vertices in V1 will produce different principal components for itsmu = 4 observations than those produced from the observation �2 through thevertices in V2. Use of Equations (5.13) and (5.14) will produce the symbolic intervalvertices principal component Y V

u in each case. Clearly, Y V1 for �1 is itself a larger

rectangle (with larger internal variation) than Y V2 for �2, with the vertices principal

component for �2 contained inside that for �1.In contrast, the centers method will produce the same principal components for

both �1 and �2 using the same centroid value X̄cu. The internal variation for each

observation is now reflected in the construction of the interval centers principalcomponent Y c

u through Equation (5.30) and Equations (5.33) and (5.34). Since�1 spans a larger rectangle than does �2, it is evident from these equations that Y c

1

for �1 will be larger (and have larger internal variation) than Y c2 for �2, with the

centers principal component for �2 contained within that for �1.This phenomenon is demonstrated by the artificial datasets of Table 5.19. These

data were artificially taken so as to have the �u observation of Dataset 2 (column(ii)) to be contained entirely ‘inside’ the corresponding �u observation of Dataset 1(column (i)). The resulting vertices principal components and centers principalcomponents are displayed in Table 5.20 and Table 5.21, respectively. The twosets of vertices principal components are plotted in Figure 5.14 and those for the

Table 5.19 Artificial datasets.

wu (i) Dataset 1 (ii) Dataset 2

�a1� b1� �a2� b2� �a1� b1� �a2� b2�

w1 [12, 18] [4, 8] [14, 16] [5, 7]w2 [3, 9] [10, 14] [4, 8] [11, 13]w3 [12, 20] [11, 19] [14, 18] [13, 17]w4 [1, 7] [1, 9] [3, 5] [4, 6]

COMPARISON OF THE METHODS 183

Table 5.20 Vertices principal components – artificial data.

wu Dataset 1 Dataset 2

PC1 PC2 PC1 PC2

w1 [−0�529, 0.654] [0.389, 1.572] [−0�243, 0.318] [0.856, 1.417]w2 [−0�730, 0.453] �−1�386� −0�202� [−0�537, 0.272] �−1�322� −0�514�w3 [0.389, 2.316] [−1�054, 0.873] [1.013, 2.136] [−0�715, 0.408]w4 �−2�130� −0�422� [−0�949, 0.758] �−1�760� −1�199� [−0�346, 0.215]

Table 5.21 Centers principal components – artificial data.

wu Dataset 1 Dataset 2

PC1 PC2 PC1 PC2

w1 [−0�609, 0.673] [0.423, 1.705] [−0�231, 0.295] [0.801, 1.327]w2 [−0�763, 0.519] �−1�500� −0�218� [−0�500, 0.257] [−1�237�−0�481]w3 [0.423, 2.525] [−1�200, 0.903] [0.949, 2.000] [−0�673, 0.378]w4 �−2�320� −0�449� [−0�993, 0.878] �−1�647� −1�122� [−0�320, 0.205]

centers principal components are shown in Figure 5.15. The distinct clusters eachof size 1 corresponding to the respective w1�w2�w3�w4 data values are apparent.Notice, however, that whereas the actual data hyperrectangles do not overlap (seeFigure 5.13), the principal component rectangles do overlap.

Remark: One of the distinguishing features of the financial data of Table 5.2 isthat some of the interval values are wide in range. This is especially true of theFMCG sector. This particular sector involved over 600 original individual companyrecords, including a very few dominant firms with large budgets. Recall that whenaggregating individual observations into interval-valued symbolic observations, theresulting interval spans the minimum Y value to the maximum Y value. Thus, when,as in the FMCG sector, relative outlier values exist, the resulting symbolic interval isperforce wide, and as such may not truly reflect the distribution of the observationswhose aggregation produced that interval. Of course, smaller aggregations such asthere observed for the pharmaceutical sector (w9, with nu = 31) can also containoutlier observations.

In these situations, an aggregation producing a histogram-valued observationwould more accurately reflect the set of observations making up that concept class.How vertices principal component analysis can be developed for histogram data isstill an open problem. While it is possible to find the centroid values for histogram-valued observations (see Equations (3.34) and (3.35)), which in turn means that aclassical analysis can be applied to these centroid values, how these can be extended(through a histogram version of Equation (5.3)) also remains an open problem.


1510

5

5 10 15 20

u = 1

u = 2

u = 4

u = 3

Variable Y1

Var

iabl

e Y

2

Figure 5.13 Two artificial datasets.

u = 1

u = 2

u = 3u = 4

1.5

0.5

0.0

–0.5

–1.0

–1.5

–2 –1 0 1 2

1.0

Seco

nd P

rinc

ipal

Com

pone

nt


Figure 5.14 Vertices principal component: artificial data.

Exercises 185

1.5

1.0

0.5

–1.0

–1.5

0.0

–0.5

–2 –1 0 1 2

u = 4

u = 1

u = 3

u = 2


Seco

nd P

rinc

ipal

Com

pone

nt

Figure 5.15 Centers principal component: artificial data.

Exercises

E5.1. Consider the financial data of Table 5.2. Do an unweighted vertices principalcomponent analysis using the random variables Y1 = Job Cost, Y3 = ActivityCode, and Y4 = Annual Budget.

E5.2. For the same financial data, do an unweighted vertices principal componentanalysis on all five random variables.

E5.3. Carry out a weighted (with weights proportional to the number nu) centersprincipal component analysis on the same financial dataset using the randomvariables (i) Y1� Y2� Y3, and (ii) Y1� Y3� Y5.

E5.4. Repeat Exercise E5.1 where now only those vertices which contribute alevel exceeding � (of Equation (5.24)) on either the first or second principalcomponent, for � = 0�1� 0�2� 0�5, are retained. Compare these results tothose for � = 0.

E5.5. Repeat Exercise E5.4 where now the total contribution for both the firstand second principal component (i.e., use Equation (5.26)) exceeds � =0�1� 0�2� 0�5. Compare the results of this question to those from Exer-cise E5.4.


E5.6. Consider the cars data of Table 5.15. Conduct a vertices principal componentanalysis using the three variables Y1 = Price, Y3 = Time, and Y4 = CylinderCapacity.

E5.7. Repeat Exercise E5.6 retaining only those vertices which have a relativecontribution to each of the first and second principal components, respec-tively, exceeding � = 0�1� 0�2� 0�5 (of Equation (5.24)), and compare theresults to those of Exercise E5.6.

E5.8. Repeat Exercise E5.6 but retain only those vertices whose total relativecontribution exceeds � = 0�1� 0�2� 0�5 (of Equation (5.26)), and comparethese results to those of Exercise E5.6 and Exercise E5.7.

E5.9. Find the correlations between the four random variables Y1� � � � � Y4 ofthe cars data and the first two vertices principal components. (Use Equa-tion (5.19).)

E5.10. For the cars dataset, find the inertia Hvu (of Equation (5.21)) for each car andthe first and second vertices principal component. What is the contributionof each car to the total variance (see Equation (5.22))?

E5.11. Consider the league teams data of Table 2.24.

(i) Calculate the variance –covariance matrix V = �vij�� i� j = 1� 2� 3, forthe vertices, underlying the vertices principal component analysis.

(ii) Calculate the variance –covariance matrix C = �cij�� i� j = 1� 2� 3, forthe centers, underlying the centers principal component analysis.

(iii) Show that

V = C +E

where the matrix E = �eij�� i� j = 1� 2� 3, has elements eij =∑u∈E wuj�auj −buj�

2 where wuj are constants (weights).

E5.12. Show that the relationship V = C + E (of Exercise E5.11) holds for i� j =1� � � � � p, in general.

References

Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.), JohnWiley & Sons, Inc., New York.

Cazes, P., Chouakria, A., Diday, E., and Schektman, Y. (1997). Extensions de l’analyse enComposantes Principales a des Données de Type Intervalle. Revue de Statistique Appliquée24, 5–24.

Chouakria, A. (1998). Extension des Méthodes d’analyse Factorielle a des Données de TypeIntervalle, Doctoral Thesis, Université Paris, Dauphine.

Chouakria, A., Diday, E., and Cazes, P. (1998). An Improved Factorial Representation ofSymbolic Objects. In: Knowledge Extraction from Statistical Data. European CommissionEurostat, Luxembourg, 301–305.

REFERENCES 187

Fisher, R. A. (1936). The Use of Multiple Measurement in Taxonomic Problems. Annals ofEugenics 7, 179–188.

Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis (4thed.), Prentice Hall, Englewood Cliffs, NJ.

Vu, T. H. T., Vu, T. M. T., and Foo, R. W. S. (2003). Analyse de Données Symboliques surdes Projects Marketing. Technical Report, CEREMADE, Dauphine, Université Paris IX.

[wiley series in computational statistics] symbolic data analysis || principal component analysis

Documents