[wiley series in computational statistics] symbolic data analysis || cluster analysis

7

Cluster Analysis

An important aspect of analyzing data is the formation of groups or clusters ofobservations, with those observations within a cluster being similar and thosebetween clusters being dissimilar according to some suitably defined dissimilarityor similarity criteria. These clusters, often referred to as classes, are not necessarilythose considered in previous chapters. Up to now, a class was the aggregation ofobservations satisfying a category (e.g., age × gender). The clusters that resultsatisfy a similarity (or dissimilarity) criterion. However, these cluster outputs can,but need not, be a category.

After describing briefly the basics of partitioning, hierarchical, and pyramidalclustering, we then delve more deeply into each of these in turn. Since thesestructures are based on dissimilarity measures, we first describe these measures asthey pertain to symbolic data.

7.1 Dissimilarity and Distance Measures

7.1.1 Basic definitions

The formation of subsets �C1� � � � �Cr� of E into a partition, hierarchy, or pyramidis governed by similarity s�a� b� or dissimilarity d�a�b� measures between twoobjects a and b, say. These measures take a variety of forms. Since typicallya similarity measure is an inverse functional of its corresponding dissimilaritymeasure (e.g., d�b�b� = 1 − s�a� b�, and the like), we consider just dissimilaritymeasures. Distance measures are important examples of dissimilarity measures. Wewill define these entities in terms of an ‘object’ (which has a description D). Theserelate to an ‘observation’ (which has a realization �); see Chapter 2.

231

Symbolic Data Analysis: Conceptual Statistics and Data Mining L. Billard and E. Diday© 2006 John Wiley & Sons, Ltd. ISBN: 978-0-470-09016-9

232 CLUSTER ANALYSIS

Definition 7.1: Let a and b be any two objects in E. Then, a dissimilarity measured�a�b� is a measure that satisfies

(i) d�a�b� = d�b�a�;

(ii) d�a�a� = d�b�b� > d�a�b� for all a �= b;

(iii) d�a�a� = 0 for all a ∈ E. �

Definition 7.2: A distance measure, also called a metric, is a dissimilarity measureas defined in Definition 7.1 which further satisfies

(iv) d�a�b� = 0 implies a = b;

(v) d�a�b� ≤ d�a� c�+d�c� b� for all a�b� c ∈ E. �

Dissimilarity measures are symmetric from property (i), and property (v) iscalled the triangular inequality.

Definition 7.3: An ultrametric measure is a distance measure as defined in Defi-nition 7.2 which also satisfies

(vi) d�a�b� ≤ Max�d�a� c��d�c� b�� for all a�b� c ∈ E. �

We note that it can be shown that ultrametrics are in one-to-one correspondencewith hierarchies. Therefore, to compare hierarchies, we can compare their associatedultrametrics.

Definition 7.4: For the collection of objects a1� � � � � am in E, the dissimilaritymatrix (or distance matrix) is the m×m matrix D with elements d�ai� aj�� i� j =1� � � � �m. �

Example 7.1. Consider the observations represented by a�b� c (or, equivalently,w1�w2�w3) whose dissimilarity matrix is

D =⎛⎝0 2 1

2 0 31 3 0

⎞⎠

That is, the dissimilarity measure between a and b is d�a�b� = 2, whiled�a� c� = 1 and d�b� c� = 3. Since property (v) holds for all a�b� c, this matrixD is also a distance matrix, e.g.,

d�b� c� = 3 ≤ d�b�a�+d�a� c� = 2+1 = 3 �

Definition 7.5: A dissimilarity (or distance) matrix whose elements monotonicallyincrease as they move away from the diagonal (by column and by row) is called aRobinson matrix. �

DISSIMILARITY AND DISTANCE MEASURES 233

Example 7.2. The matrix D in Example 7.1 is not a Robinson matrix since, forexample,

d�a� c� = 1 < d�a�b� = 2

However, the matrix

D =⎛⎝0 1 3

1 0 23 2 0

⎞⎠

is a Robinson matrix. �

Robinson matrices play a crucial role in pyramidal clustering as it can beshown that they are in one-to-one correspondence with indexed pyramids (seeDiday, 1989); see Section 7.5. Each clustering process requires a dissimilarity ordistance matrix D of some form. Some distance and dissimilarity measures thatarise frequently in symbolic clustering methodologies will be presented here. Wepay special attention to those measures developed by Gowda and Diday (1991) andIchino and Yaguchi (1994) and their extensions to Minkowski distances. In manyinstances, these measures are extensions of a classical counterpart, and in all casesreduce to a classical equivalent as a special case. For excellent reviews of thesemeasures, see Gordon (1999), Esposito et al. (2000), and Gower (1985).

There are a number of dissimilarity and distance measures based on the notionsof the Cartesian operators ‘join’ and ‘meet’ between two sets A and B, introduced byIchino (1988). We denote A = �A1� � � � �Ap� and B = �B1� � � � �Bp�. In the contextof symbolic data, the set A corresponds to the observed value of a p-dimensionalobservation � = �Y1� � � � � Yp�, i.e., A ≡ �. Thus, when A is a multi-valued symbolicobject, Aj takes single values in the space �A

j (e.g., Aj = �green� red�); and whenA is interval valued, Aj takes interval values aj� bj� in �j ⊂ � (e.g., Aj = 3� 7�).The possible values for Bj are defined analogously. How these notions, and theirapplication to the relevant dissimilarity measures, pertain for modal-valued objectsin general is still an open problem (though we shall see below that there are someinstances where modal multi-valued objects are covered).

Definition 7.6: The Cartesian join A ⊕ B between two sets A and B is theircomponentwise union,

A⊕B = �A1 ⊕B1� � � � �Ap ⊕Bp�

where Aj ⊕ Bj = ‘Aj ∪ Bj’. When A and B are multi-valued objects with Aj =�aj1� � � � � ajsj

� and Bj = �bj1� � � � � btj�, then

Aj ⊕Bj = �aj1� � � � � btj�� j = 1� � � � � p� (7.1)

is the set of values in Aj� Bj , or both. When A and B are interval-valued objectswith Aj = aA

j � bAj � and Bj = aB

j � bBj �, then

Aj ⊕Bj = Min�aAj � aB

j �� Max�bAj � bB

j �� (7.2)�


Definition 7.7: The Cartesian meet A ⊗ B between two sets A and B is theircomponentwise intersection,

A⊗B = �A1 ⊗B1� � � � �Ap ⊗Bp�

where

Aj ⊗Bj = Aj ∩Bj� j = 1� � � � � p

When A and B are multi-valued objects, then Aj ⊗Bj is the list of possible valuesfrom �j common to both. When A and B are interval-valued objects formingoverlapping intervals on Yj ,

Aj ⊗Bj = Max�aAj � aB

j �� Min�bAj � bB

j �� (7.3)

and when Aj ∩Bj = �, then Aj ⊗Bj = 0. �

Example 7.3. Let A and B be multi-valued symbolic objects with

A = ��red, green�� small��

B = ��red, blue�� small, medium��

Then, their join is, from Equation (7.1),

A⊕B = ��red, green, blue�� small, medium��

and their meet is, from Equation (7.3),

A⊗B = ��red�� small��

Example 7.4. Let A and B be the interval-valued symbolic objects

[A = �3� 7�� 21� 25�� 5� 9��B = �5� 8�� 19� 24�� 6� 11��

]

Then, their join is, from Equation (7.2),

A⊕B = �3� 8�� 19� 25�� 5� 11��

and from Equation (7.3) their meet is

A⊗B = �5� 7�� 19� 24�� 6� 9��


These same definitions apply to mixed variables, as shown in the followingexample.

Example 7.5. Let A and B be the symbolic objects

A = �3� 7�� red, green��

B = �5� 8�� red, blue��

Then, their join and meets are, respectively,[A⊕B = �3� 8�� red, green,blue��A⊗B = �5� 7�� red��

]�

Two important distance measures are those developed by Gowda and Diday(1991) and Ichino and Yaguchi (1994). We present these distances here in theirgeneral form. Their applications to multi-valued variables and to interval-valuedvariables are presented in Section 7.1.2 and Section 7.1.3, respectively.

Definition 7.8: The Gowda–Diday dissimilarity measure between two sets A andB is

D�A�B� =p∑

j=1

D1j�Aj�Bj�+D2j�Aj�Bj�+D3j�Aj�Bj�� (7.4)

where D1j�Aj�Bj� is a measure that relates to the relative sizes (or span) of Aj

and Bj , D2j�Aj�Bj� relates to the relative content of Aj and Bj , and D3j�Aj�Bj�measures their relative position. �

Definition 7.9: The Ichino–Yaguchi dissimilarity measure on the variable Yj

component of the two sets A and B is

�j�A�B� = �Aj ⊕Bj�− �Aj ⊗Bj�+ �2�Aj ⊗Bj�− �Aj�− �Bj�� (7.5)

where �A� is the number of elements in A if Yj is multi-valued and is the lengthof the interval Aj if Yj is interval-valued, and where 0 ≤ ≤ 05 is a prespecifiedconstant. �

Definition 7.10: The generalized Minkowski distance of order q ≥ 1 between twosets A and B is

dq�A�B� =(

p∑j=1

w∗j �j�A�B��q

)1/q

(7.6)

where w∗j is an appropriate weight for the distance component �j�A�B� on

Yj� j = 1� � � � � p. �


As the number of variables p increases, the Minkowski distance in Equation (7.6)can rise. To account for any inordinate increase in this distance, Equation (7.6) canbe replaced by

dq�A�B� =(

1p

p∑j=1

w∗j �j�A�B��q

)1/q

(7.7)

Distances weighted to account for different scales of measurement take the form

�′j�A�B� = �j�A�B�/��j�� (7.8)

where ��j� = number of possible values if Yj is multi-valued, or ��j� = total lengthspanned by observations if Yj is interval-valued. It is easy to show that scale-weighted distances of Equation (7.8) when used in Equation (7.7) are such that0 ≤ dq�A�B� ≤ 1. There are two special cases.

Definition 7.11: A generalized Minkowski distance of order q = 1 is called a cityblock distance when

d�A�B� =p∑

j=1

cj�j�A�B� (7.9)

where city weights are such that∑

cj = 1. �

Definition 7.12: A generalized Minkowski distance of order q = 2 is a Euclideandistance when

d�A�B� =(

p∑j=1

�j�A�B��2

)1/2

(7.10)

�

The Euclidean distances of Equation (7.10) can be normalized to account fordifferences in scale and for large values of p. This gives us the following definition.

Definition 7.13: A normalized Euclidean distance is such that

d�A�B� =(

1p

p∑j=1

��j�−1�j�A�B��2

)1/2

(7.11)

�

These Definitions 7.8–7.13 all involve the concepts of join and meet in somemanner, with the precise nature depending on whether or not a given randomvariable Yj is multi-valued or interval-valued. They can also contain, or not contain,weight factors such as those of Equation (7.6). Examples of their use are providedin the respective Sections 7.1.2–7.1.4 below.


7.1.2 Multi-valued variables

Let Y = �Y1� � � � � Yp� be a p-dimensional multi-valued random variable with Yj

taking values in �j = �Yj1� � � � � Yjsj�. Without loss of generality, let us first assume

that each Yj is modal-valued and that all possible values in �j occur. The observationwu in E therefore can be written in the form, for each u = 1� � � � �m,

��wu� =(�Yu1k1

� pu1k1� k1 = 1� � � � � s1�� Yupkp

� pupkp� kp = 1� � � � � sp�

)(7.12)

where pujkjis the relative frequency of Yujkj

. Then, those values Yjkjin �j which

do not occur in the observation � have corresponding values of pjkj≡ 0. Likewise,

for a non-modal multi-valued random variable, these pjkj≡ 1/nuj where nuj is the

number of values from �j which do occur.

Example 7.6. Let �1 = �blue� green� red�. Then the observation

�1 = �blue, green�

is rewritten as

�1 = �blue, 0.5; green, 0.5; red, 0�

Likewise, the observation

�2 = �blue, 0.4; red, 0.6�

becomes

�2 = �blue, 0.4; green, 0.0; red, 0.6�

�

Definition 7.14: For multi-valued modal data of the form of Equation (7.12), acategorical distance measure between any two observations ��wu1

� and ��wu2� is

d�wu1�wu2

� where

d2�wu1�wu2

� =p∑

j=1

sj∑kj=1

(p

m∑u=1

pujkj

)−1 (pu1jkj

−pu2jkj

)2 (7.13)

�

Example 7.7. Consider the observations

��w1� = ��blue, 0.4; green, 0.6�� urban��

��w2� = ��blue, 0.3; green, 0.4; red, 0.3�� urban, rural��

��w3� = ��blue�� urban, 0.4; rural, 0.6��

When transformed to the format of Equation (7.12), the relative frequencies�pujkj

� u = 1� 2� 3� k1 = 1� 2� 3� k2 = 1� 2� j = 1� 2� take the values as shown inTable 7.1.


Table 7.1 Relative frequencies pujkj– Example 7.7 data.

wu Y1 Y2 pu

blue green red urban rural

w1 04 06 00 10 00 2w2 03 04 03 05 05 2w3 10 00 00 04 06 2

pjkj17 10 03 19 11 6

Then, from Equation (7.13), the squared distance between the observations��w1� and ��w2� is

d2�w1�w2� = 12

[�04−03�2

17+ �06−04�2

10+ �00−03�2

03

+ �10−05�2

19+ �00−05�2

11

]

= 0352

Likewise, we obtain

d2�w1�w3� = 0544� d2�w2�w3� = 0381

Hence, the distance matrix is

D =⎛⎝ 0 0593 0738

0593 0 06170738 0617 0

⎞⎠

Notice that this D is a Robinson matrix. �

There are two measures that apply to non-modal multi-valued observations. Forthese data, let the format be written as

��wu� = ��Yu1ku1� ku

1 = 1� � � � � s1�� Yupkpu� ku

p = 1� � � � � sp�� (7.14)

Definition 7.15: The Gowda–Diday dissimilarity measure between two multi-valued observations ��w1� and ��w2� of the form of Equation (7.14) is

D�w1�w2� =p∑

j=1

D1j�w1�w2�+D2j�w1�w2��

where

D1j�w1�w2� = ��k1j −k2

j ��/kj� j = 1� � � � � p� (7.15)

D2j�w1�w2� = �k1j +k2

j −k∗j �/kj� j = 1� � � � � p� (7.16)


where kj is the number of values from �j in the join ��w1� ⊕ ��w2� and k∗j is

the number of values in the meet ��w1�⊗ ��w2� for the variable Yj . (The relativeposition component of Definition 7.8 is zero for multi-valued data.) �

Definition 7.16: The Ichino–Yaguchi dissimilarity measure (or distance)between the two multi-valued observations ��w1� and ��w2� of the form of Equa-tion (7.14) for the variable Yj is

�j�w1�w2� = kj −k∗j + �2k∗

j −k1j −k2

j � (7.17)

where kj� k∗j � k1

j , and k2j have the same meaning as in Definition 7.15 and where

0 ≤ ≤ 05 is a prespecified constant. �

Example 7.8. The data of Table 7.2 give values for Color Y1 from �1 =�red� black� blue� and Habitat Y2 from �2 = �urban� rural� for m = 4 species ofbirds. The components for the Gowda–Diday distances for each variable �Y1� Y2�and for each pair of species �wu1

�wu2� are displayed in Table 7.3. For example,

consider the pair (species1, species2). From Equation (7.15), the span distancecomponent for Y1 is

D1�w1�w2� = �2−1�/2 = 1/2�

and the content distance component from Equation (7.16) for Y1 is

D2�w1�w2� = �2+1−2×1�/2 = 1/2

Table 7.2 Birds – Color and Habitat.

wu Species (Color, Habitat)

w1 species1 ({red, black}; {urban, rural})w2 species2 ({red}; {urban})w3 species3 ({red, black, blue}; {rural})w4 species4 ({red, black, blue}; {urban, rural}

Hence, for Y1, the distance measure is

�1�w1�w2� = 1/2+1/2 = 1

Likewise, for the variable Y2, we obtain

�2�w1�w2� = 1/2+1/2 = 1

Hence, the unweighted Gowda–Diday distance between species1 and species2 is

d�w1�w2� = �1�w1�w2�+�2�w1�w2� = 2


Table 7.3 Gowda–Diday distances: birds – Color and Habitat.

�wu1�wu2

� Y1 = Color Y2 = Habitat �Y1� Y2�

D1�w1�w2� D2�w1�w2� �1�w1�w2� D1�w1�w2� D2�w1�w2� �2�w1�w2� ��w1�w2�

�w1�w2� 1/2 1/2 1 1/2 1/2 1 2�w1�w3� 1/3 1/3 2/3 1/2 1/2 1 5/3�w1�w4� 1/3 1/3 2/3 0 0 0 2/3�w2�w3� 2/3 2/3 4/3 0 1 1 7/3�w2�w4� 0 2/3 2/3 1/2 1/2 1 7/3�w3�w4� 0 0 0 1/2 1/2 1 1

The Gowda–Diday distance matrix becomes

D =

⎛⎜⎜⎝

0 2 5/3 2/32 0 7/3 7/3

5/3 7/3 0 12/3 7/3 1 0

⎞⎟⎟⎠

When these distances are normalized to adjust for scale, the weights for thevariables Y1 and Y2 are ��1� = 3 and ��2� = 2, respectively. Thus, for example,the weighted distance between species1 and species2 is

d�w1�w2� = 1/3+1/2 = 5/6

The weighted distance matrix is

D =

⎛⎜⎜⎝

0 5/6 13/18 2/95/6 0 17/18 17/18

13/18 17/18 0 1/22/9 17/18 1/2 0

⎞⎟⎟⎠

�

Example 7.9. We can obtain the Ichino–Yaguchi distances for the data ofTable 7.2 by substitution into Equation (7.17). For example, for species1 andspecies4 we have, for the variables Y1 and Y2, respectively,

�1�w1�w4� = 3−2+ �2×2−3−2� = 1+ �−1��

and

�2�w1�w4� = 2−2+ �2×2−2−2� = 0

These distances for each pair of species are collected in Table 7.4.These distances are now substituted into Equation (7.6) to obtain the

Minkowski distance of order q. For example, taking = 1/2, we have theunweighted Minkowski distance between species2 and species4 as

dq�w2�w4� = �1q +05q�1/q (7.18)


Table 7.4 Ichino–Yaguchi distances: birds – Color and Habitat.

wu1�wu2

�j�wu1�wu2

� Unweighted Weighted

Y1 Y2 q = 1 q = 2 q = 1 q = 2

�w1�w2� 1+ �−1� 1+ �−1� 0500 0707 0208 0300�w1�w3� 1+ �−1� 1+ �−1� 0500 0707 0208 0300�w1�w4� 1+ �−1� 0 0250 0500 0083 0167�w2�w3� 2+ �−2� 2+ �−2� 1000 1414 0417 0601�w2�w4� 2+ �−2� 1+ �−1� 0750 1118 0181 0417�w3�w4� 0 1+ �−1� 0250 0500 0125 0250

Were these distances to be weighted to account for scale, this would become

dq�w2�w4� = �1/3�q + �05/2�q�1/q (7.19)

Then the city block distances are obtained by substituting q = 1 into Equa-tion (7.18) or Equation (7.19) for the unweighted or weighted distances, respec-tively. The resulting distances for c1 = c2 are summarized in Table 7.4. Hence,the unweighted city block distance matrix is

D =

⎛⎜⎜⎝

0 050 050 025050 0 100 075050 10 0 025025 075 025 0

⎞⎟⎟⎠

The Euclidean distances are found by substituting q = 2 into Equation (7.18)or Equation (7.19) for the unweighted or weighted distances, respectively. Thesevalues are given in Table 7.4. Therefore, we see that the weighted Euclideandistance matrix for these data is

D =

⎛⎜⎜⎝

0 030 030 017030 0 060 042030 060 0 025017 042 025 0

⎞⎟⎟⎠

�

7.1.3 Interval-valued variables

There are a number of dissimilarity and distance measures for interval-valued data,some of which have a kind of analogous counterpart to those for multi-valuedobservations and some of which are unique to interval-valued data. Let us denotean interval-valued realization �u for the observation wu by

�u = �auj� buj�� j = 1� � � � � p�� u = 1� � � � �m (7.20)


Definition 7.17: The Gowda–Diday dissimilarity measure between two interval-valued observations wu1

and wu2of the form of Equation (7.20) is given by

D�wu1�wu2

� =p∑

j=1

Dj�wu1�wu2

� (7.21)

where for the variable Yj , the distance is

Dj�wu1�wu2

� =3∑

k=1

Djk�wu1�wu2

� (7.22)

with

Dj1�wu1�wu2

� = ��bu1j −au1j�− �bu2j −au2j��/kj (7.23)

where

kj = �Max�bu1j� bu2j�� Min�au1j� au2j��is the length of the entire distance spanned by wu1

and wu2; with

Dj2�wu1�wu2

� = ��bu1j −au1j�+ �bu2j −au2j�−2Ij�/kj (7.24)

where Ij is the length of the intersection of the intervals au1j� bu1j� andau2j� bu2j�, i.e.,

Ij = �Max�au1j� au2j�−Min�bu1j� bu2j��if the intervals overlap, and Ij = 0 if not; and with

Dj3�wu1�wu2

� = �au1j −au2j�/��j� (7.25)

where �j is the total length in � covered by the observed values of Yj , i.e.,

��j� = maxu

�buj�−minu

�auj� (7.26)

�

The components Djk�wu1�wu2

�� k = 1� 2, of Equations (7.23)–(7.24) are coun-terparts of the span and content components of the Gowda–Diday distance formulti-valued variables given in Definition 7.15. The third component Dj3�wu1

�wu2�

is a measure of the relative positions of the two observations.

Example 7.10. We consider the veterinary clinic data of Table 7.5, and takethe first three observations only. We first calculate the Gowda–Diday distancesfor the random variable Y1 = Height. Then, substituting the data values from


Table 7.5 into Equation (7.23), we obtain the distance component between themale and female horses as

D11�HorseM, HorseF� = ��180−120�− �160−158��Max�180� 160�−Min�120� 158��

= �60−2�/�180−120� = 0967

Table 7.5 Veterinary data.

wu Animal Y1 Height Y2 Weight

w1 HorseM [120.0, 180.0] [222.2, 354.0]w2 HorseF [158.0, 160.0] [322.0, 355.0]w3 BearM [175.0, 185.0] [117.2, 152.0]w4 DeerM [37.9, 62.9] [22.2, 35.0]w5 DeerF [25.8, 39.6] [15.0, 36.2]w6 DogF [22.8, 58.6] [15.0, 51.8]w7 RabbitM [22.0, 45.0] [0.8, 11.0]w8 RabbitF [18.0, 53.0] [0.4, 2.5]w9 CatM [40.3, 55.8] [2.1, 4.5]w10 CatF [38.4, 72.4] [2.5, 6.1]

Likewise, by substituting into Equation (7.24), we obtain

D12�HorseM, HorseF� = �180−120�+ �160−158�−2�160−158��180−120�

= 0967�

and by substituting into Equation (7.25), we find

D13�HorseM, HorseF� = �120−158�Max�180� 165� 185�−Min�120� 158� 175�

= 38/�185−120� = 0584

Hence, the Gowda–Diday distance for the variable Y1 is

D1�HorseM, HorseF� = 0967+0967+0584 = 2518

For the random variable Y2 = Weight, we likewise obtain

D21�HorseM, HorseF� = 0744�



and hence

D2�HorseM, HorseF� = 1922


Therefore, the Gowda–Diday distance between the male and female horses overall variables is

D�HorseM, HorseF� = 2518+1922 = 4440

Each distance component for each variable and each animal pair is shown inTable 7.6. From these results, the Gowda–Diday distance matrix for the maleand female horses and the female bears is

D =⎛⎝ 0 444 367

444 0 213367 213 0

⎞⎠ �

Table 7.6 Gowda–Diday distances – horses and bears.

�wu1�wu2

� Y1 = Height Y2 = Weight �Y1� Y2�

D11 D12 D13 D1 D21 D22 D23 D2 D

(HorseM, HorseF) 0.967 0.967 0.584 2.518 0.744 0.759 0.419 1.922 4.440(HorseM, BearM) 0.769 0.923 0.846 2.538 0.409 0.703 0.021 1.133 3.671(HorseF, BearM) 0.296 0.444 0.231 0.971 0.008 0.285 0.861 1.154 2.125

Definition 7.18: The Ichino–Yaguchi dissimilarity measure between the twointerval-valued observations wu1

and wu2of the form of Equation (7.20) for the

interval-valued variable Yj is

�j�wu1�wu2

� =�wu1j ⊕wu2j�− �wu1j ⊗wu2j�+ �2�wu1j ⊗wu2j�− �wu1j�− �wu2j�� (7.27)

where as before �A� is the length of the interval A = a� b�, i.e., �A� = b −a, and0 ≤ ≤ 05 is a prespecified constant. �

Definition 7.19: The generalized (weighted) Minkowski distance or order q forinterval-valued objects wu1

and wu2is

dq�wu1�wu2

� = �p∑

j=1

w∗j �j�wu1

�wu2��q�1/q (7.28)

where �j�wu1�wu2

� is the Ichino–Yaguchi distance of Equation (7.27) and wherew∗

j is a weight function associated with Yj .The city block distance is a Minkowski distance of order q = 1, namely,

d�wu1�wu2

� =p∑

j=1

cjw∗j �j�wu1

�wu2� (7.29)


where cj > 0�∑

cj = 1, and the normalized Euclidean distance is a Minkowskidistance of order q = 2, i.e.,

d�wu1�wu2

� =(

1p

p∑j=1

w∗j �j�wu1

�wu2��2

)1/2

(7.30)

�

These definitions for a Minkowski distance are completely analogous with thosefor multi-valued objects; the difference is in the nature of the Ichino–Yaguchidissimilarity measure �j�wu1

�wu2�.

A weight function which takes account of the scale of measurements on Yj isw∗

j = 1/��j� where ��j� is the span of the real line �1 covered by the observationsas defined in Equation (7.26).

Example 7.11. We consider again the veterinary clinic data of Table 7.5and take the first three classes, male and female horses and male bears. TheIchino–Yaguchi dissimilarity measures for each variable Yj , obtained from Equa-tion (7.27) are shown in Table 7.7. For example,

�1�HorseM, BearM� = �185−120�− �180−175�+ �2×5−60−10�

= 60+ �−60�

�2�HorseM, BearM� = �354−1172�−0+ �2×0−1318−348�

= 2368+ �−1666�

Taking = 05 and substituting into Equation (7.28) gives the Minkowskidistances. Weighted distances obtained by taking w∗

j = 1/��j� can also befound. Here,

w∗1 = 1/Max�180� 160� 185�−Min�120� 158� 175�� = 1/65

and similarly,

w∗2 = 1/�355−1172� = 1/2378

The weighted (by w∗j ) and unweighted distances for q = 1� 2 are also shown in

Table 7.7. �

Table 7.7 Ichino–Yaguchi distances – horses and bears.

�j�wu1�wu2

� Unweighted Weighted

j = 1 j = 2 q = 1 q = 2 q = 1 q = 2

(HorseM, HorseF) 58+ �−58� 1008+ �−1008� 794 581 0658 0494(HorseM, BearM) 60+ �−60� 2368+ �−1666� 1835 1564 1107 0794(HorseF, BearM) 27+ �−12� 2378+ �−678� 2149 1950 1138 0877


Extensions of Minkowski distances of order p = 2 for interval-valued data playan important role in the divisive clustering methodology (of Section 7.4). Giventhis role these extensions are described herein; the general case of order q followsreadily.

Definition 7.20: The Hausdorff distance for the variable Yj between two interval-valued observations �u1

and �u2is

�j�wu1�wu2

� = Max�au1j −au2j�� bu1j −bu2j�� (7.31)

�

Example 7.12. For the horses and male bears observations of Table 7.5, we cansee that the Hausdorff distances are

�1�HorseM, HorseF� = Max�120−158�� 180−160�� = 38�

�2�HorseM, HorseF� = Max�2222−322�� 354−355�� = 998

Likewise,

�1�HorseM, BearM� = 55��2�HorseM, BearM� = 202�

�1�HorseF, BearM� = 27��2�HorseF, BearM� = 2048

�

Definition 7.21: The Euclidean Hausdorff distance matrix for two interval-valued observations �u1

and �u2is D = �d�wu1

�wu2�� u1� u2 = 1� � � � �m, where

d�wu1�wu2

� = �p∑

j=1

�j�wu1�wu2

��2�1/2 (7.32)

and �j�wu1�wu2

� are the Hausdorff distances defined in Equation (7.31). �

Definitions 7.22: A normalized Euclidean Hausdorff distance matrix haselements

d�wu1�wu2

� =(

p∑j=1

[�j�wu1

�wu2�

Hj

]2)1/2

(7.33)

where

H2j = 1

2m2

m∑u1=1

m∑u2=1

�j�wu1�wu2

��2 (7.34)

and where �j�wu1�wu2

� is the Hausdorff distance of Equation (7.31). �


Example 7.13. From the Hausdorff distances found in Example 7.12 betweenthe male and female horses and the male bears, it follows by substituting intoEquation (7.32) that the Euclidean Hausdorff distance matrix is

D =⎛⎝ 0 10679 20935

10679 0 2065720935 20657 0

⎞⎠

where, for example,

d�HorseM, HorseF� = �382 +9982�1/2 = 10679

To find the normalized distance matrix, we first need, from Equation (7.34),

H21 = 1

2×32382 +552 +272� = 288778

and similarly,

H22 = 515039

Hence, from Equation (7.33) we obtain the normalized Euclidean Hausdorffdistance matrix as

D =⎛⎝ 0 263 429

263 0 327429 327 0

⎞⎠

For example,

d �HorseM, HorseF� =[(

381699

)2

+(

9987177

)2]1/2

= 2633

�

Definition 7.23: A span normalized Euclidean Hausdorff distance matrix haselements

d�wu1�wu2

� =(

p∑j=1

[�j�wu1

�wu2�

��j�]2)1/2

(7.35)

where ��j� is the span of the observations Yj defined in Equation (7.26). �

Example 7.14. For our female and male horses and male bears observations, thespans for Y1 and Y2 are, respectively,

��1� = 65� ��2� = 2378

Then, for example,

d �HorseM, HorseF� =[(

3865

)2

+(

9982378

)2]1/2

= 0720


We obtain the span normalized Euclidean Hausdorff distance matrix as

D =⎛⎝ 0 072 120

072 0 094120 094 0

⎞⎠ �

The first normalization of Definition 7.22 is a type of standard deviation corre-sponding to symbolic data. If the data are classical data, then the Hausdorff distancesare equivalent to Euclidean distances on �2, in which case Hj corresponds exactlyto the standard deviation of Yj . We refer to this as the dispersion normalization. Thenormalization of Definition 7.23 is based on the length of the domain of �j actuallyobserved, i.e., the maximum deviation. We refer to this as the span normalization.

7.1.4 Mixed-valued variables

Dissimilarity and distance measures can be calculated between observations thathave a mixture of multi-valued and interval-valued variables. In these cases,formulas for these measures provided in Section 7.1.2 are used on the Yj whichare multi-valued and those formulas of Section 7.1.3 are used on the interval-valued Yj variables. The distances over all Yj variables are then calculated using theappropriate summation formula, e.g., Equation (7.6) for a generalized Minkowskidistance matrix. These principles are illustrated in the following example.

Example 7.15. Table 7.8 provides the oils dataset originally presented byIchino (1988). The categories are 6 types of oil (linseed, perilla, cotton-seed, sesame, camellia, and olive) and 2 fats (beef and hog); m = 8.There are four interval-valued variables, namely, Y1 = Specific Gravity(in g/cm3), Y2 = Freezing Point (in �C), Y3 = Iodine Value, and Y4 =Saponification. There is one multi-valued variable, Y5 = Fatty Acids.This variable identifies nine major fatty acids contained in the oils andtakes values from the list of acids �5 = �A�C�L�Ln�Lu�M�O�P�S� ≡�arachic, capric, linoleic, linolenic, lauric, myristic, oleic, palmitic, stearic�.

Table 7.8 Ichino’s oils data.

wu Oil Y1 Specific Y2 Freezing Y3 Iodine Y4 Y5 FattyGravity Point Value Saponification Acids

w1 linseed 0930� 0935� −270�−80� 1700� 2040� 1180� 1960� {L, Ln, M, O, P}w2 perilla 0930� 0937� −50�−40� 1920� 2080� 1880� 1970� {L, Ln, O, P, S}w3 cotton seed 0916� 0918� −60�−10� 990� 1130� 1890� 1980� {L, M, O, P, S}w4 sesame 0920� 0926� −60�−40� 1040� 1160� 1870� 1930� {A, L, O, P, S}w5 camellia 0916� 0917� −210�−150� 800� 820� 1890� 1930� {L, O}w6 olive 0914� 0919� 00� 60� 790� 900� 1870� 1960� {L, O, P, S}w7 beef 0860� 0870� 300� 380� 400� 480� 1900� 1990� {C, M, O, P, S}w8 hog 0858� 0864� 220� 320� 530� 770� 1900� 2020� {L, Lu, M, O, P, S}

CLUSTERING STRUCTURES 249

Then, taking equal weights cj = 1/p = 1/5 and = 0, we can obtain Ichino’scity block distances shown in Table 7.9. These are calculated by substitutioninto Equation (7.29) for q = 1, where for j = 1� � � � � 4, the entries �j�� areobtained from Equation (7.27), and for j = 5, the entries �5(,) are obtained fromEquation (7.17).

For example, consider the pair (linseed, lard) = W (say). Then,

�1�W� = �0935−0858�− �0864−0858��/�0937−0858�� = 0899�

�2�W� = �32− �−27��−0�/�38− �−27�� = 0908�

�3�W� = �204−33�−0�/�208−40�� = 1018�

�4�W� = �202−118�− �196−190��/�202−118�� = 0929�

�5�W� = �6−5�/9 = 0111

Hence,

d�W� = 15

�0899+ � � � +0111� = 0773 �

Table 7.9 City block distances – oils, �Y1� � � � � Y5�.

wu Oils w1 w2 w3 w4 w5 w6 w7 w8

w1 linseed 0 0320 0471 0488 0466 0534 0853 0809w2 perilla 0320 0 0244 0226 0336 0273 0626 0582w3 cottonseed 0471 0244 0 0105 0182 0117 0418 0374w4 sesame 0488 0226 0105 0 0192 0141 0503 0459w5 camellia 0466 0336 0182 0192 0 0160 0504 0460w6 olive 0534 0273 0117 0141 0160 0 0407 0363w7 beef 0853 0626 0418 0503 0504 0407 0 0181w8 lard 0809 0582 0374 0459 0460 0363 0181 0

7.2 Clustering Structures

7.2.1 Types of clusters: definitions

We have p random variables �Yj� j = 1� � � � � p� with actual realizations �u =��1� � � � � �p� on observations wu�u = 1� � � � �m� wu ∈ E = �w1� � � � �wm� (ifsymbolic data) or realizations xi = �x1� � � � � xp� for observations i ∈ �� i =�1� � � � � n� (if classical data). We give definitions in terms of the symbolic obser-vations in E; those for classical observations in � are special cases.


Definition 7.24: A partition of E is a set of subsets �C1� � � � �Cr� such that

(i) Ci ∩Cj = �, for all i �= j = 1� � � � � r and

(ii)⋃

i Ci = E, for i = 1� � � � � r .

That is, the subsets are disjoint, but are exhaustive of the entire dataset E. Thesubsets of a partition are sometimes called classes or clusters. �

Example 7.16. The data of Table 7.5 represent values for Y1 = Weightand Y2 = Height of 10 animal breeds. These interval-valued observa-tions were obtained by aggregating a larger set of data detailing animalshandled at a certain veterinary clinic over a one-year period (availableat http://www.ceremade.dauphine.fr/%7Etouati/clinique.htm). In this case, theconcept involved is breed × gender. The plotted values of the wu�u = 1� � � � � 8,observations in Figure 7.1 suggest a partitioning which puts observations wu withu = 1� 2� 3 (i.e., the male (M) and female (F) horses and the male bears) intoclass C1

1 say, and observations wu for u = 4� 5� 6� 7� 8 into class C12 .

350

300

250

200

150

100

500

Wei

ght

50 100 150

Height

HorseM

HorseF

BearM

DeerM

DogF

DeerF

RabbitMRabbitF

Figure 7.1 Partitions: veterinary data, u = 1� � � � , 8.

Another partitioning may have placed the horses, i.e., observations w1 andw2, all into a single cluster C2

1 , say, and the bear, i.e., observation w3, into itsown cluster C2

2 , with a third cluster C23 = C1

2 . Still another partitioning might


produce the four clusters C3i � i = 1� � � � � 4, with C3

1 ≡ C21 and C3

2 ≡ C22 , and with

C33 containing the observations w7 and w8 for the rabbits, and C3

4 containing theobservations w4�w5�w6 corresponding to the male and female deer and the femaledogs. This subpartitioning of C1

2 is more evident from Figure 7.2 which plots thewu� u = 4� � � � � 8, observations of C1

2 plus those for the male and female cats ofw9�w10. Whether the first partitioning P1 = �C1

1 �C12� or the second partitioning

P2 = �C21 �C2

2 �C23� or the third partitioning P3 = �C3

1 �C32 �C3

3 �C34� is chosen in any

particular analysis depends on the selection criteria associated with any clusteringtechnique. Such criteria usually fall under the rubric of similarity–dissimilaritymeasures and distance measures; these were discussed in Section 7.1. �

5040

3020

10 0

20 30 40 50 60 70

Height

Wei

ght

DogF

DeerF DeerM

RabbitM

RabbitF

CatM CatF

Figure 7.2 Veterinary partitions, u = 4� � � � � 10.

Definition 7.25: A hierarchy on E is a set of subsets H = �C1� � � � �Cr� such that

(i) E ∈ H ;

(ii) for all single observations wu in E� Cu = �wu� ∈ H ; and

(iii) Ci ∩Cj ∈ ��Ci�Cj� for all i �= j = 1� � � � �m. �


That is, condition (iii) tells us that either any two clusters Ci and Cj are disjoint,or one cluster is contained entirely inside the other, and every individual in E iscontained in at least one cluster larger than itself.

Note that if Ci ∩Cj = � for all i �= j, then the hierarchy becomes a partitioning.Henceforth, reference to a hierarchy implies that Ci ∩Cj �= � for at least one set of�i� j� values.

Example 7.17. Consider all observations wu� u = 1� � � � � 10, in the veterinarydata of Table 7.5. The hierarchy of Figure 7.3 may apply. Here, we start withall observations in the subset E = C1. Then, at the first level of the hierarchy,the observations wu with u = 1� 2� 3 (which correspond to the large animals,i.e., the male and female horses and female bears) form one cluster C1

1 and the wu

for u = 4� 5� 6� 7� 8� 9� 10 form a second cluster, C12 , consisting of the relatively

smaller breeds. At this level, the hierarchy is the same as the first partitioningP1 of Example 7.16. Then, at a second level, the observations w1 and w2 (maleand female horses) form a cluster C2

11 and the observation corresponding to thew3 female bears forms a second cluster C2

12. Also at this second level, the otherlevel 1 cluster C1

2 has two subclusters, namely, C221 consisting of observations

�w4� � � � �w8�, and C222 consisting of observations w9 and w10. Then, at the third

level down we observe that the second-level cluster C221 contains a third tier

of clustering, namely, observations w4�w5, and w6 as the subcluster C3211 and

observations w7 and w8 as the subcluster C3212. If we continue in this vein, we

finally have at the bottom of the hierarchy 10 clusters each consisting of a singleobservation �wu�� u = 1� � � � � 10. Apart from the entire set E and the individual

E

1C1 = {1,2,3}

2C1 = {4,5,6,7,8,9,10}

11C2 = {1,2}

12C2 = {3}

21C2 = {4,5,6 ,7 ,8}

22C2 = {9,10}

211C3 = {4,5,6}

212C3

= {7,8}

Figure 7.3 Hierarchy clusters – veterinary data.


clusters Cu = �wu�, there are eight clusters (and relabeling the clusters) shown inFigure 7.3:

C1 ≡ C211 = �w1�w2�� C2 ≡ C2

12 = �w3�� C3 ≡ C3211 = �w4�w5�w6��

C4 ≡ C3212 = �w7�w8�� C5 ≡ C2

22 = �w9�w10�� C6 ≡ C221 = �w4� � � � �w8��

and

C7 ≡ C11 = �w1�w2�w3�� C8 ≡ C1

2 = �w4� � � � �w10�

The hierarchy is H = �C1� � � � �C8� � � � � �w1�� w10��E�. �

In Example 7.17, we started with the entire set of observations and split theninto two classes, and then proceeded to split each of these classes into a second tierof classes, and so on. This ‘top-down’ process is an example of divisive hierarchicalclustering.

Definition 7.26: Divisive clustering is a top-down clustering process which startswith the entire dataset as one class and then proceeds downward through as manylevels as necessary to produce the hierarchy H = �C1� � � � �Cr�.

Agglomerative clustering is a bottom-up clustering process which starts witheach observation being a class of size 1, and then forms unions of classes at eachlevel for as many levels as necessary to produce the hierarchy H = �C1� � � � �Cr�or a pyramid P = ��1� � � � ��m�. �

In the agglomerative process, each observation is aggregated at most once fora hierarchy or at most twice for a pyramid. These two clustering processes arediscussed in more detail later in Sections 7.4 and 7.5, respectively.

Example 7.17 (continued): The same clusters C1� � � � �C8 could have been devel-oped by agglomerative clustering. In this case, we start with the hierarchyconsisting of the 10 classes H = ��w1�� w2�� w10��. Then, at the bottomlevel, the clustering criteria form the unions �w5� ∪ �w6� = �w5�w6�, i.e., thefemale deer and dogs with all other clusters remaining as individuals.

At the second level, the criteria produce the unions

C211 = �w1�∪ �w2� = �w1�w2�� C2

21 = �w4�∪ �w5�w6� = �w4�w5�w6��

C222 = �w7�∪ �w8� = �w7�w8�� C2

12 = �w3��

while at the third level, we have the two clusters

C11 = C2

11 ∪C212 = �w1�w2�w3��

C12 = C2

21 ∪C222 = �w4� � � � �w10��

hence we reach the complete set E = C11 ∪C1

2 = �w1� � � � �w10�. �

Notice that the top two levels of this hierarchy tree correspond to the partitioningP3 of Example 7.16. A hierarchy has additional levels as necessary to reach singleunits at its base.


Definition 7.27: A set of clusters H = �C1� � � � �Cr� where at least one pair ofclasses overlap, i.e., where there exists at least one set �Ci�Cj�� i �= j, such thatCi ∩Cj �= � and where there exists an order on the observations such that each clusterdefines an interval for this order, is called a non-hierarchical pyramid. �

Definition 7.32 gives a more general definition of a pyramid. That definitionincludes the notion of ordered sets and allows a hierarchy to be a special case of apyramid.

Example 7.18. Suppose for the animal breeds data of Table 7.5 we start withthe 10 clusters H = ��1�� 10�� corresponding to the observations wu�u =1� � � � � 10. Suppose the clustering criteria are such that the five clusters

C1 = �w1�w2�� C2 = �w3�� C3 = �w4�w5�w6�� C4 = �w7�w8�� C5 = �w8�w9�w10�

emerge; see Figure 7.4. Notice that the w8 (female rabbits) observation is in boththe C4 and C5 clusters. �

1C1 = {1,2 ,3} 2C1 = {4,5,6,7,8,9,10}

C1 =

C2 = {1,2}

11C2

= C2 = {3}

12C3

= C2 = {4,5,6}

21

C4 =

C

3 = {7,8}221

222C

C5 =

C3

=

{8,9,10}

222

= {7,8,9 ,10}

E

Figure 7.4 Pyramid clusters – veterinary data.

This overlap of clusters in Figure 7.4 is what distinguishes a pyramid treefrom a hierarchy tree. The clusters C1� � � � �C4 of Example 7.18 are identical,respectively, to the clusters C2

11�C212�C3

211, and C3212 in the hierarchy of Example 7.17

and Figure 7.3. The difference is in the fifth cluster. Here, in the pyramid tree,the observation w8 corresponding to the female rabbit is in two clusters, i.e., C4

consisting of male and female rabbits, and C5 consisting of the male and femalecats plus the female rabbits. We notice from Figure 7.2 that the data are highly


suggestive that the cluster C5 should contain all three �w8�w9�w10� observations,and that the rabbits together form their own cluster C4 as well.

A pictorial comparison of the three clustering types can be observed by viewingFigures 7.1–7.4.

Definition 7.28: Let P = �C1� � � � �Cr� be a hierarchy of E. A class Cj is said tobe a predecessor of a class Ci if Ci is completely contained in but is not entirelyCj , and if there is no class Ck contained in Cj that contains Ci; that is, there isno k such that if Ci ⊂ Cj then Ci ⊂ Ck ⊂ Cj . Equivalently, we can define Ci as asuccessor of Cj . �

In effect, a predecessor cluster of a particular cluster refers to the cluster imme-diately above that cluster in the hierarchy tree. Any cluster in a hierarchy has at mostone predecessor, whereas it may have up to two predecessors in a pyramid tree.

Example 7.19. Let H = �C1�C2�C3� � � � �C9� of Figure 7.5, where

C1 = E = �a� b� c�d� e�

C2 = �a� b� c�� C3 = �b� c�� C4 = �d� e��

C5 = �a�� C9 = �e�

Then, C1 is a predecessor of C2. However, C1 is not a predecessor of C3 since,although C3 ⊂ C1, there does exist another class, here C2, which is contained inC1 and which also contains C3. �

C9 = {e} C8 = {d} C7 = {c} C6 = {b} C5 = {a}

C4 = {d,e}

C3 = {b,c}

C2 ={a,b,c}

E = {a,b,c,d,e} = C1

Figure 7.5 Predecessors.


7.2.2 Construction of clusters: building algorithms

As noted in Section 7.2.1 and Definition 7.26, hierarchies can be built from thetop down, or from the bottom up. In this section, we discuss the divisive top-down clustering process; the pyramid bottom-up clustering process is covered inSection 7.5. Regardless of the type of cluster, an algorithm is necessary to build itsstructure. These algorithms will typically be iterative in nature.

Step 1: The first step is to establish the criterion (or criteria) by which one particularclustering structure would be selected over another. This criterion would be basedon dissimilarity and distance measures such as those described in Section 7.1.Therefore, if we have a set of objects C = �a1� � � � � am� (say), and if we takeone cluster C1 = �a1� � � � � ak� and a second cluster C2 = �ak+1� � � � � am�, then aclustering criterion could be that value of k = k∗ for which

Dk = ∑�i�j�∈C1

d�ai� aj�+ ∑�i�j�∈C2

d�ai� aj� = D�1�k +D

�2�k � (7.36)

say, is minimized, where d�a�b� is the selected dissimilarity or distance measure.That is,

Dk∗ = mink

Dk (7.37)

The aim is to find clusters that are internally as homogeneous as possible.

Step 2: The second step therefore is to calculate Dk from Equation (7.36) for all theclusters possible from the initial set of objects C. Note that the cluster C1 containsany k objects from C (i.e., not necessarily the first ordered k objects), so that, intheory, there are

N =m∑

k=1

(mk

)

possible calculations of Dk. However, some specific methods allow for a reducednumber without loss of information. An example is the divisive clustering method(of Section 7.4) for which there are only �m−1� calculations of Dk.

Step 3: The next step is to select k∗ from Equation (7.37). This choice gives thetwo clusters C1 and C2 with C1 ∪C2 = C.

Step 4: This process is repeated by returning to Step 2 using each of C1 and C2

separately in turn instead of C.

Step 5: Steps 2–4 are repeated until the number of clusters reaches a predeterminednumber of clusters �r ≤ m� and/or the clustering criterion reaches some preassignedlimit, e.g., number of clusters is r = R, or D

�1�k + � � � +D

�r�k ≤ �.

In the next three sections, this construction process is implemented for partitions(in Section 7.3), for divisive hierarchies (in Section 7.4), and (in reverse, startingat the bottom) for pyramids (in Section 7.5).

PARTITIONS 257

7.3 Partitions

In Example 7.16 and Figure 7.1 we were able to partition visually the veterinary dataof Table 7.5 into two classes or clusters. The building algorithm of Section 7.2.2can be used to construct this partition in a more rigorous way.

Example 7.20. Take the veterinary data of Table 7.5. Let us suppose (in Step1) that we select our partitioning criterion to be based on the Ichino–Yaguchidistance measures of Equation (7.27) of Definition 7.18. The pairwise Ichino–Yaguchi distances between the male and female horses and male bears werecalculated in Example 7.11, for each variable Y1 = Height and Y2 = Weight.Calculation of the other distances is left as an exercise. Let us take these distancesto determine the normalized Euclidean distance d(a,b) of Equation (7.30) for usein the partitioning criterion of Equation (7.36). For example,

d�HorseM, BearM� ={

12

[�30�2/167+ �1535�2/3546

]}1/2

= 5993

where we have taken = 1/2, and used the weight w∗j = 1/��j�� j = 1� 2, where

��j� is as defined in Equation (7.26).Then, the normalized Euclidean distance matrix for all m = 10 animals is

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 247 599 1116 1176 1128 1237 1245 1206 1185247 0 774 1307 1362 1316 1425 1435 1397 1377599 774 0 813 904 852 936 935 874 839

1116 1307 813 0 098 070 126 131 098 0951176 1362 904 098 0 067 078 108 119 1481128 1316 852 070 067 0 111 123 126 1361237 1425 935 126 078 111 0 037 081 1211245 1435 935 131 108 123 037 0 069 1091206 1397 874 098 119 126 081 069 0 0511185 1377 839 095 148 136 121 109 051 0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

We take as our criterion in Equation (7.36) the sum of the pairwise normalizedEuclidean distances for all animals in the given subsets �C1�C2�.

Step 2 then entails calculating these total distances D�1�k and D

�2�k for respec-

tive partitions. For example, suppose we set C1 = �HorseM� HorseF� BearM�and therefore C2 consists of the other �m−k = 10 − 3 = 7� animals, i.e., C2 =�DeerM� DeerF� DogF� RabbitM� RabbitF� CatM� CatF�. Then, the respec-tive distance matrices are

D�1� =⎛⎝ 0 247 599

247 0 774599 774 0

⎞⎠


D�2� =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 098 070 126 131 098 095098 0 067 078 108 119 148070 067 0 111 123 126 136126 078 111 0 037 081 121131 108 123 037 0 069 109098 119 126 081 069 0 051095 148 136 121 109 051 0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

Then, the sum of the Euclidean distances inside the cluster C1 is the sum of theupper diagonal elements of D�1� (or equivalently, half the sum of all the elementsof D�1�), and likewise for the total of the Euclidean distances inside the clusterC2. In this case, from Equation (7.36),

D�1�3 = 1620� D

�2�3 = 2102

and hence

D3 = 3722

Table 7.10 provides some partitions and their respective cluster sums ofnormalized Euclidean distances D

�1�k and D

�2�k and the total distance Dk. This is by

no means an exhaustive set. At Step 3, we select the partition which minimizesDk. In this case, we obtain

C1 = �HorseM, HorseF, BearM�

C2 = �DeerM, DeerF, DogF, RabbitM, RabbitF, CatM, CatF�

We now have our R = 2 clusters, as previously depicted visually in Figure 7.1.If R > 2 clusters are required, then in Step 5 we repeat the process (Steps 2–4),as needed on C1 and C2 first, and then on their clusters. �

Table 7.10 Two partitions – veterinary data.

k C1 D�1�k D

�2�k Dk

2 {HorseM HorseF}, 247 8254 85.012 {HorseF, BearM}, 774 10395 111.693 {HorseM, HorseF, BearM} 1620 2102 37.223 {BearM, DogF, DeerM} 1735 12724 144.594 {HorseM, HorseF, BearM, DeerM} 4856 1484 63.405 {HorseM, HorseF, BearM, DeerM, DeerF} 8396 764 93.606 {HorseM, HorseF, BearM, DeerM, DeerF, DogF} 11829 468 122.977 {HorseM, HorseF, BearM, DeerM, DeerF, DogF, RabbitM} 15742 229 159.718 {HorseM, HorseF, BearM, DeerM, DeerF, DogF, RabbitM, RabbitF} 19756 051 198.07

HIERARCHY–DIVISIVE CLUSTERING 259

7.4 Hierarchy–Divisive Clustering

As noted in Section 7.2 and Definition 7.26, hierarchies can be built from thetop down, or from the bottom up. In this section, we discuss the divisive top-down clustering process; the pyramid bottom-up clustering process is covered inSection 7.5. The divisive clustering methodology for symbolic data was developedby Chavent (1997, 1998, 2000). We describe the principles in Section 7.4.1 andthen apply them to multi-valued and interval-valued variables in Sections 7.4.2and 7.4.3, respectively.

7.4.1 Some basics

We start with all observations in the one class C1 ≡ E, and proceed by bisectingthis class into two classes �C1

1 �C21�. The actual choice for C1

1 and C21 will depend

on an optimality criterion for the bisection. Then, Cb1 �b = 1� 2� is bisected into

Cb1 = �Cb

11�Cb12� which will be denoted by �C1

2 �C22�, and so on. Whether C1

1 or C21 is

bisected depends on the underlying partitioning criteria being used. Suppose that atthe rth stage there are the clusters �C1� � � � �Cr� = Pr , with the cluster Ck containingmk observations. Recall that E = ∪Ck. One of the clusters, Ck say, is to be bisectedinto two subclusters, Ck = �C1

k �C2k�, to give a new partitioning

Pr+1 = �C1� � � � �C1k �C2

k � � � � �Cr� ≡ �C1� � � � �Cr+1�

The particular choice of Ck is determined by a selection criterion (described inEquation (7.42) below) which when optimally bisected will minimize the totalwithin-cluster variance for the partition across all optimally bisected �Ck�.

The class Ck itself is bisected into C1k and C2

k by determining which observationsin Ck satisfy, or do not satisfy, some criteria q�·� based on the actual values ofthe variables involved. Suppose the class Ck contains observations identified byu = 1� � � � �mk, say. Then,

C1k = �u�q�u� is true, or q�u� = 1�

C2k = �u�q�u� is false, or q�u� = 0�

The divisive clustering technique for symbolic data assumes that the question q�·�relates to one Yj variable at a time. However, at each bipartition, all Yj� j = 1� � � � � p,variables can be considered.

Example 7.21. The data of Table 7.14 below provide values for six randomvariables relating to eight different breeds of horses (see Section 7.4.3 where thedata are described more fully). A bipartition query on the random variable Y1 =Minimum Height of the horses would be

q1 � Y1 ≤ 130

Then, comparing the interval midpoints to q1�·�, we have, for example q1�u =1� = 0 since for the first horse the midpoint is �145 + 155�/2 = 150 > 130.


Likewise, q1�2� = 0� q1�3� = 1� q1�4� = 0� q1�5� = 0� q1�6� = 0� q1�7� = 1�q1�8� = 1. Therefore, were the bisection to be based on this q1�·� alone, the newclasses would contain the breeds wu according to

C11 = �w3�w7�w8�� C2

1 = �w1�w2�w4�w5�w6�

�

If there are zj different bipartitions for the variable Yj , then there are zk =z1 + � � � +zp possible bipartitions of Ck into �C1

k �C2k�. The actual bipartition chosen

is that which satisfies the given selection criteria.At each stage there are two choices:

(i) For the selected cluster Ck, how is the split into two subclusters C1k and C2

k

determined?

(ii) Which cluster Ck is to be split?

The answers involve minimizing within-cluster variation and maximizingbetween-cluster variation for (i) and (ii), respectively.

Definition 7.29: For a cluster Ck = �w1� � � � �wmk�, the within-cluster variance

I�Ck� is

I�Ck� = 12�

mk∑u1=1

mk∑u1=1

pu1pu2

d2�u1� u2� (7.38)

where d2�u1� u2� is a distance measure between the observations wu1�wu2

in Ck,and pu is the weight of the observation wu with � =∑mk

u=1 pu. �

The distances d�u1� u2� take any of the forms presented in Section 7.1. Whenpu = 1/m, where m is the total number of observations in E,

I�Ck� = 12mmk

mk∑u1=1

mk∑u2=1

d2�u1� u2�

or, equivalently, since d�u1� u2� = d�u2� u1�,

I�Ck� = 1mmk

mk∑u1=1

mk∑u2>u1=1

d2�u1� u2� (7.39)

Definition 7.30: For a partition Pr = �C1� � � � �Cr�, the total within-cluster vari-ance is

W�Pr� =r∑

k=1

I�Ck� (7.40)

where I�Ck� is the within-cluster variance for Ck as defined in Definition 7.29. �


Note that a special case of Equation (7.40) is W�E� where the complete set ofobservations E consists of r = m clusters Ck with Ck = �wuk

� containing a singleobservation.

Definition 7.31: For the partition Pr = �C1� � � � �Cr�, the between-cluster vari-ance is

B�Pr� = W�E�−W�Pr� (7.41)�

The divisive clustering methodology is to take in turn each cluster Ck in Pr =�C1� � � � �Cr�. Let Pr�k� = �C1

k �C2k� be a bipartition of this Ck. There are nk ≤

�mk −1� distinct possible bipartitions. When all observations wu in Ck are distinct,nk = mk −1. The particular bipartition Pr�k� chosen is that one which satisfies

MinLkW�Pr�k�� = MinLk

I�C1k�+ I�C2

k�� (7.42)

where Lk is the set of all bipartitions of Ck. From Equation (7.42), we obtain thebest subpartition of Ck = �C1

k �C2k�.

The second question is which value of k, i.e., which Ck ⊂ Pr , is to be selected?For a given choice Ck, the new partition is Pr+1 = �C1� � � � �C1

k �C2k � � � � �Cr�. There-

fore, the total within-cluster variation of Equation (7.40) can be written as

W�Pr +1� =r∑

k′=1

I�Ck′�− I�Ck�+ I�C1k�+ I�C2

k�

The choice Ck is chosen so as to minimize this total within-cluster variation. Thisis equivalent to finding k′ such that

Ik′ = MaxkI�Ck�− I�C1k�− I�C2

k�� = Maxk�W�Ck� (7.43)

The third and final issue of the divisive clustering process is when to stop. Thatis, what value of r is enough? Clearly, by definition of a hierarchy it is possibleto have r = m, at which point each class/cluster consists of a single observation.In practice, however, the partitioning process stops at an earlier stage. The precisequestion as to what is the optimal value of r is still an open one. Typically, amaximum value for r is prespecified, R say, and then the divisive clustering processis continued consecutively from r = 1 up to r = R.

This procedure is an extension in part of the Breiman et al. (1984) approachfor classical data. At the kth bipartition of Ck into �C1

k �C2k� there will be at most

p�mk −1� bipartitions possible. This procedure therefore involves much less compu-tation than the corresponding �2mk−1 − 1� possible bipartitions that would have tobe considered were the Edwards and Cavalli-Sforza (1965) procedure used.


7.4.2 Multi-valued variables

The divisive clustering methodology applied to non-modal multi-valued variableswill be illustrated through its application to the dataset of Table 7.11. These relateto computer capacity in m = 7 different offices in a given company. The randomvariable Y1 = Brand identifies the manufacturer of the computer and takes codedvalues from �1 = �B1� � � � � B4�. The random variable Y2 = Type of computerand takes values from �2 = �laptop �T1�� desktop �T2��. We perform the divisiveclustering procedure using the Y1 variable only. Then, we leave the procedurebasing the partitioning on both variables Y1 and Y2 as an exercise. We will use thecategorical distance measures of Definition 7.14 and Equation (7.13). In order to dothis, we first transform the data according to Equation (7.12). The resulting relativefrequencies pujkj

are as shown in Table 7.12.

Table 7.11 Computers.

wu Y1 = Brand Y2 = Type

w1 Office1 {B1, B2, B3} {T1, T2}w2 Office2 {B1, B4} {T1}w3 Office3 {B2, B3} {T2}w4 Office4 {B2, B3, B4} {T2}w5 Office5 {B2, B3, B4} {T1, T2}w6 Office6 {B1, B4} {T1}w7 Office7 {B2, B4} {T2}

Table 7.12 Computer frequencies.

Brand Type

B1 B2 B3 B4 T1 T2

Office1 1/3 1/3 1/3 0 1/2 1/2 2Office2 1/2 0 0 1/2 1 0 2Office3 0 1/2 1/2 0 0 1 2Office4 0 1/3 1/3 1/3 0 1 2Office5 0 1/3 1/3 1/3 1/2 1/2 2Office6 1/2 0 0 1/2 1 0 2Office7 0 1/2 0 1/2 0 1 2

Frequency 4/3 2 3/2 13/6 3.0 4.0 14.0

Looking at all the possible partitions of a cluster C into two subclusters can bea lengthy task, especially if the size of C is large. However, whenever the multi-valued random variable is such that there is a natural order to the entities in �, or


when it is a modal-valued multi-valued variable, it is only necessary to considerthose cutpoints c for which ∑

kj<cj

pjkj> 1/2 (7.44)

for each Yj , i.e., cj ≥ Yjkj, which ensures Equation (7.44) holds.

Example 7.22. For the computer dataset, from Table 7.11, the cutpoint for theY1 = Brand variable is between the possible brands B2 and B3 when comparingmost offices; for the potential subcluster involving offices 4 and 5, this cutpointlies between B3 and B4. �

Example 7.23. For the random variable Y1 = Brand in the computer dataset ofTable 7.11, we substitute the frequencies puk of Table 7.12 into Equation (7.13)where p = 1 here, to obtain the categorical distance matrix as

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

000 052 034 037 037 052 054052 000 077 057 057 000 056034 077 000 029 029 077 053037 057 029 000 000 057 032037 057 029 000 000 057 032052 000 077 057 057 000 056054 056 053 032 032 056 000

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

For example,

d2�w1�w4� =[

�1/3−0�2

4/3+ �1/3−1/3�2

2+ �1/3−1/3�2

3/2+ �0−1/3�2

13/6

]1/2

= 0367

Let us reorder the observations so that cutpoints change progressively from brandsB1 to B4. The ordering is as shown in part (a) of Table 7.13, along with thecorresponding cutpoints. Thus, for example, the cutpoint for Office3 lies betweenthe brands B2 and B3. Under this ordering the matrix distance becomes

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.00 0.00 0.77 | 0.56 0.52 0.57 0.570.00 0.00 0.77 | 0.56 0.52 0.57 0.570.77 0.77 0.00 | 0.53 0.34 0.29 0.29

0.56 0.56 0.53 | 0.00 0.54 0.32 0.320.52 0.52 0.34 | 0.54 0.00 0.37 0.370.57 0.57 0.29 | 0.32 0.37 0.00 0.000.57 0.57 0.29 | 0.32 0.37 0.00 0.00

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


Table 7.13 Computer clusters.

wk s Cutpoint I�C1i � I�C2

i � W�Ci�

(a) Office2 1 B1, B2 0000 0151 0.151Office6 2 B1, B2 0000 0096 0.096Office3 3 B2, B3 0073 0069 0.142

i = 1 Office7 4 B2, B3 0114 0035 0.149Office1 5 B2, B3 0146 0000 0.146Office4 6 B3, B4 0172 0000 0.172Office5 7 B3, B4 0191 — 0.191

(b) Office3 1 B2, B3 0000 0069 0.069Office7 2 B2, B3 0038 0035 0.073

i = 2 Office1 3 B2, B3 0067 0000 0.067Office4 4 B3, B4 0085 0000 0.085Office5 5 B3, B4 0096 — 0.096

Suppose we want R = 2 clusters. With this ordering, there are m−1 = 6 possiblepartitions, with s offices in C1 and m − s offices in C2. Take s = 3. Then,C = �C1

1 �C21� with

C11 = �Office2, Office6, Office3�

C21 = �Office7, Office1, Office4, Office5�

The dashed lines in the matrix D relate to this partition with the upper left-handportion being the distance matrix associated with the cluster C1

1 and the lowerright-hand matrix being that for the cluster C2

1 . From Equation (7.39),

I�C11� = 1

7×3�0+077+077� = 0073

and

I�C21� = 1

7×4�054+032+ � � � +000� = 0069

Hence,

W�C1� = I�C11�+ I�C2

1� = 0073+0069 = 0142

The within-cluster variance I�C11� and I�C2

1� along with the total within-clustervariance W�C1� for each partition is shown in part (a) of Table 7.13. Comparingthe W�C1� values, we select the optimal bipartition to be

C1 = �Office2, Office6�

C2 = �Office1, Office3, Office4, Office5, Office7�


since this �s = 2� partition produces the minimum W�C1� value; hereW�C1� = 096.

To bisect C2 into two clusters C2 = �C12 �C2

2�, this process is repeated on the Dmatrix associated with C2. The resulting within-cluster variances and total within-cluster variances are shown in part (b) of Table 7.13. From this, we observe thatthe optimal R = 3 cluster partition is


C2 = �Office1, Office3, Office7�


since the C2 = �C12 �C2

2� cluster when cut at s = 3 gives the minimum W�C2�value at 0.067.

The total within-cluster variation is

W�P3� = 0000+ �0067+0000� = 0067 < W�E� = 0191

Therefore, a reduction in variation has been achieved. This hierarchy is displayedin Figure 7.6. �

E

C1 = {Office 2, Office 6}

C2 = {Office 1, Office 3, Office 7}

C3 = {Office 4,

Office 5}

Figure 7.6 Computer hierarchy.

7.4.3 Interval-valued variables

How the divisive clustering methodology is applied to interval-valued randomvariables will be described through its illustration on the horses dataset of Table 7.14.This dataset consists of eight breeds of horses E ={European, Arabian European


Pony, French, North European, South European, French Pony, American Pony},coded as E ={CES, CMA, PEN, TES, CEN, LES, PES, PAM}, respectively. Thereare p = 6 random variables: Y1 = Minimum Weight, Y2 = Maximum Weight, Y3 =Minimum Height, Y4 = Maximum Height, Y5 = (coded) Cost of Mares, and Y6 =(coded) Cost of Fillies. At the end of this section, the complete hierarchy on alleight breeds using all six random variables will be presented, though often portionsonly of the dataset may be used to demonstrate the methodology as we proceedthrough the various stages involved.

Table 7.14 Horses interval-valued dataset.

wu Breed MinimumWeight

MaximumWeight

MinimumHeight

MaximumHeight

MaresCost

FilliesCost

w1 CES [410, 460] [550, 630] [145, 155] [158, 175] [150, 480] [40, 130]w2 CMA [390, 430] [570, 580] [130, 158] [150, 167] [0, 200] [0, 50]w3 PEN [130, 190] [210, 310] [90, 135] [107, 153] [0, 100] [0, 30]w4 TES [410, 630] [560, 890] [135, 165] [147, 172] [100, 350] [30, 90]w5 CEN [410, 610] [540, 880] [145, 162] [155, 172] [70, 600] [20, 160]w6 LES [170, 450] [290, 650] [118, 158] [147, 170] [0, 350] [0, 90]w7 PES [170, 170] [290, 290] [124, 124] [147, 147] [380, 380] [100, 100]w8 PAM [170, 170] [290, 290] [120, 120] [147, 147] [230, 230] [60, 60]

Let the interval-valued variables Yj take values auj� buj�� j = 1� � � � � p� u =1� � � � �m. The observed midpoints are

Xuj = �auj +buj�/2 (7.45)

The interval-valued divisive clustering method poses the question q�·� in the form

qj�u� � Xuj ≤ cj� j = 1� � � � � p� (7.46)

where cj is a cutpoint associated with the variable Yj . In Example 7.21 above,c1 = 130.

For each j, we first reorder the midpoints from the smallest to the largest;let us label these Xsj� s = 1� � � � �mk. Then, the selected cutpoint values are themidpoints of consecutive Xuj values (for those observations u in the cluster Ck

being bipartitioned). That is, we select

csjk = �Xsj + Xs+1�j�/2� s = 1� � � � �mk −1� (7.47)

for variable Yj in cluster Ck. Therefore, there are at most zj = mk − 1 differentcutpoints and therefore at most zj different bipartitions of Ck possible when basedon Yj alone. If two or more Xuj have the same value, then zj < mk − 1. It is easyto verify that for any value c in Xs�j < c < Xs+1�j the same bipartitioning pertains.


Example 7.24. Consider the horse breeds data of Table 7.14, and the randomvariable Y1 = Minimum Weight ≡ Y , say.

The first partition P1 divides the entire dataset C1 into �C11 �C2

1�. Table 7.15provides the midpoints Xsj ≡ Xs� s = 1� � � � � 7, where the observations havebeen rearranged from the lowest to highest X value. The cutpoints cs�≡ cjsk� j =1� k = 1� are the midpoints of these Xs. For example, from the observations w3

and w7 which, when reordered, correspond, respectively, to s = 1� 2, we have

X1 = �130+190�/2 = 160

X2 = �170+170�/2 = 170�

hence,

c1 = �160+170�/2 = 165

There are zj = 7 different cutpoints cs. �

Having selected the cutpoints, we now develop an appropriate distance matrix.We use the Hausdorff distances of Equation (7.31) and the normalized Haus-dorff distances of Equation (7.33) in Examples 7.25–7.31 and Examples 7.32–7.33,respectively.

Example 7.25. Consider the first four observations of the horse breeds data ofTable 7.14. For the present, we assume that the breeds are in the original sequence.If we consider Y1 = Minimum Weight only, then the Hausdorff distance matrixD1 = �dj�u1� u2�� u1� u2 = 1� � � � �m = 4, is, from Equation (7.31),

D1 =

⎛⎜⎜⎝

0 30 280 17030 0 260 200

280 260 0 440170 200 440 0

⎞⎟⎟⎠ (7.48)

For example, for u1 = 1� u2 = 3, we have

�au11 −au21� = �410−130� = 280� �bu11 −bu21� = �460−190� = 270�

hence,

d1�u1� u3� = Max�280� 270� = 280 �

Example 7.26. We continue with the data of Table 7.14 but take p = 2 with Y1 =Minimum Weight and Y2 = Maximum Weight. Again, let us use the first fourobservations only. The Hausdorff distance matrix D1 was given in Equation (7.48)in Example 7.25. The Hausdorff distance matrix for Y2 is

D2 =

⎛⎜⎜⎝

0 50 340 26050 0 360 310

340 260 0 580260 310 580 0

⎞⎟⎟⎠ (7.49)


Then, the Euclidean Hausdorff distance matrix, from Equation (7.32), isD where

D#D = �D1#D1 +D2#D2� (7.50)

where if A and B have elements A = �aij� and B = �bij�, then C = A#B haselements cij = �aij ×bij�. Therefore, from Equations (7.48) and (7.49), we obtain

D#D =

⎛⎜⎜⎝

0 900 78400 28900900 0 67600 40000

78400 67600 0 19360028900 40000 193600 0

⎞⎟⎟⎠+

⎛⎜⎜⎝

0 2500 115600 676002500 0 129600 96100

115600 129600 0 33640067600 96100 336400 0

⎞⎟⎟⎠

=

⎛⎜⎜⎝

0 3400 194000 965003400 0 197200 136100

194000 197200 0 53000096500 136100 530000 0

⎞⎟⎟⎠

Hence,

D =

⎛⎜⎜⎝

0 5831 44045 310645831 0 44407 36892

44045 44407 0 7280131064 36892 72801 0

⎞⎟⎟⎠ (7.51)

Thus, for example, the Hausdorff distance between the w1 and w3 observations is

d�w1�w3� = �2802 +3402�1/2

= �78400+115600�1/2 = 44045

From these D values, we observe that the observations corresponding to u =1� 2 are close with d�w1�w2� = 5831, as far as the �Y1� Y2� weight variables go,while the observations for u = 3� 4 are far apart from d�w3�w4� = 72801. (Thesedistances reflect the data in that the observation w3 represents a breed of pony,while those for wu�u = 1� 2� 4� are for horses.) �

Example 7.27. Let us find the dispersion normalized Euclidean Hausdorffdistances for the first four horse breeds, for Y1 = Minimum Weight and Y2 =Maximum Weight. The distance matrix D1 of Equation (7.48) is the non-normalized matrix for Y1. Using these values in Equation (7.34), we obtain thenormalization factor as

H1 ={

12×42

02 +302 +2802 + � � � +4402 +02�

}1/2

= 15996 (7.52)

For example, for �w1�w3�,

d1�w1�w3� = 280/15996 = 1750


Then the complete normalized Euclidean Hausdorff distance matrix is

D1�H1� =

⎛⎜⎜⎝

0 0187 1750 10630187 0 1625 12501750 1625 0 27511063 1250 2751 0

⎞⎟⎟⎠ (7.53)

For the variable Y2 = Maximum Weight, by applying Equation (7.34) to theentries of the non-normalized matrix D2 of Equation (7.49), we obtain

H2 ={

12×42

02 +502 + � � � +5802 +02�

}1/2

= 21619 (7.54)

Hence, we can obtain the normalized Euclidean Hausdorff distances. For example,for �w1� w3�,

d2�w1�w3� = 340/21619 = 1573

Then, the complete dispersion normalized Euclidean Hausdorff distance matrixfor Y2 is

D2�H2� =

⎛⎜⎜⎝

0 0231 1573 12030231 0 1665 14341573 1665 0 26831203 1434 2683 0

⎞⎟⎟⎠ (7.55)

Finally, we can obtain this normalization for the two variables �Y1� Y2�combined. We apply Equation (7.33) using now the matrices of Equations (7.48)and (7.49). Thus, we have, e.g., for w1� w3,

d�w1�w3� =[[

28015996

]2

+[

34021619

]2]1/2

= 2353 (7.56)

Similarly, we can obtain the distances for the other �w1�w2� pairs; we then obtainthe complete normalized Euclidean Hausdorff distance matrix as

D =

⎛⎜⎜⎝

0 0298 2353 16050298 0 2327 19022353 2327 0 38421605 1902 3842 0

⎞⎟⎟⎠ (7.57)

�

Example 7.28. Let us now calculate the span normalized Euclidean Hausdorffdistance for each of the Y1� Y2 and �Y1� Y2� data of Table 7.14 using the wu�u =1� � � � � 4, observations. Consider Y1 = Minimum Weight. From Table 7.14 andEquation (7.26), it is clear that

��1� = 630−130 = 500 (7.58)


Hence, the span normalized Euclidean Hausdorff distance for Y1 is, from Equa-tion (7.35),

D1 =

⎛⎜⎜⎝

0 0600 0560 03400600 0 0520 04000560 0520 0 08800340 0400 0880 0

⎞⎟⎟⎠ (7.59)

where, e.g., for �w1�w2�, we have, from Equation (7.35) with p = 1� j = 1,

d1�w1�w3� = �1�w1�w3��/��1�= 280/500 = 056

where �1�w1�w3� = 280 is obtained from Equation (7.31).Similarly, for Y2 alone, we have

��2� = 890−210 = 680� (7.60)

hence the span normalized Euclidean Hausdorff distance matrix is, from Equa-tion (7.35),

D2 =

⎛⎜⎜⎝

0 0074 0500 03820074 0 0529 04560500 0529 0 08530382 0456 0853 0

⎞⎟⎟⎠ (7.61)

For the combined �Y1� Y2� variables, we have from Equation (7.35) that thespan normalized Euclidean Hausdorff distance matrix, also called the maximumdeviation normalized Hausdorff distance matrix, is

D =

⎛⎜⎜⎝

0 0095 0751 05120095 0 0742 06060751 0742 0 12260512 0606 1226 0

⎞⎟⎟⎠ (7.62)

This D matrix is obtained in a similar way to the elementwise products ofEquation (7.50) used in obtaining Equation (7.51). Thus, e.g., for �w1�w2�, weobtain from Equations (7.56) and (7.58),

d�w1�w3� =[[

280500

]2

+[

340680

]2]1/2

= 0751 �

To partition this cluster C into two subclusters, to avoid considering all possible

splits (here,(

42

)= 6 in total) we first reorder the classes according to decreasing

(or equivalently increasing) values of the midpoints c of Equation (7.47).


Example 7.29. We again consider the first four breeds and Y1 = MinimumWeight of Table 7.14. When reordered by the cutpoint values, we have P1 =C ={TES, CES, CMA, PEN}. The corresponding Hausdorff distance matrix is

D =

⎛⎜⎜⎝

0 170 200 440170 0 30 280200 30 0 260440 280 260 0

⎞⎟⎟⎠

Then, for this cluster C = �w1� � � � �w4�, the within-cluster variation I�C� is,from Equation (7.39),

I�C� = 14×4

1702 +2002 + � � � +2602� = 255875

Suppose now that this cluster C is partitioned into the subclusters C11 ={TES}

and C21 ={CES, CMA, PEN}. Then, from Equation (7.39),

I�C11� = 1

4×10� = 0�

I�C21� = 1

4×32002 +4402 +2802� = 13000

Therefore, from Equation (7.43), we have

I1 = 255875−0−13000 = 125875

Similarly, for the partition of C into C12 ={TES, CES}, C2

2 ={CMA, PEN},we find

I2 = 13525�

and for the partition of C into C13 ={TES, CES, CMA} and C2

3 ={PEN}, we find

I3 = 1977083

Hence, the optimal bipartition of this C is that k′ for which

Ik′ = maxk

�Ik� = I3 = 1977083

Therefore, the optimal partition of these four breeds is to create the partitionP2 = �C1�C2� with subclusters C1 ={TES, CES, CMA} and C2 ={PEN}. Thetotal within-clusters variation of P2 is, from Equation (7.40),

I�P2� = 581667+0 = 581667 < I�C� = 255875

The desired reduction in total within-cluster variation has therefore beenachieved. �


We can now proceed to determine the complete hierarchy. We do this for theY1 random variable only in Example 7.30, and then in Examples 7.31 and 7.32 theprocedure is developed with both Y1 and Y2 for non-normalized and normalizeddistances, respectively. Finally, in Example 7.33, the solution for all breeds and allsix random variables is presented with the calculation details left to the reader asan exercise. We let the stopping criterion for the number of partitions be R = 4.

Example 7.30. Consider all horse breeds and Y1 = Minimum Weight. We firstreorder the breeds so that the midpoints Xu obtained from Equation (7.45) are indecreasing order. These are shown in Table 7.15 along with the correspondings and cutpoint values cs obtained from Equation (7.47). The Hausdorff distancematrix for this reordered sequence is

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 40 40 260 260 280 420 44040 0 0 280 260 290 440 46040 0 0 280 260 290 440 460260 280 280 0 220 240 240 240260 260 260 220 0 30 180 200280 290 290 240 30 0 150 170420 440 440 240 170 150 0 20440 460 460 240 200 170 20 0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(7.63)

Table 7.15 Divisive clustering – horses first bipartition, Y1.

Breed wu Minimum Weight Xu s cs I�C1s � I�C2

s � W�Cs� �W�Cs�

w3 (130, 190) 160 1 1650 0 288732 288732 5559.6w7 (170, 170) 170 2 2400 1000 204500 205500 13882.8w8 (170, 170) 170 3 2400 1333 86575 87908 25642.0w6 (170, 450) 310 4 3600 71125 39094 110219 23410.9w2 (390, 430) 410 5 4225 119700 21583 141283 20304.5w1 (410, 460) 435 6 4725 163313 250 163563 18076.6w5 (410, 610) 510 7 5150 260714 0 260714 8361.4w4 (410, 630) 520 344328

Then, from Equation (7.39), the within-cluster variance I�C� = I�E� is

I�E� = 18×8

402 + � � � +202� = 344328

There are seven possible cuts of C into P1 = �C1�C2�, one for each s = 1� � � � � 7.Let us denote the sth possible bipartition as P�s� = �C1

s �C2s �. The first partition

of E at s = 1 is such that C1 = �C11 �C2

1� where C11 = �w3� and C2

1 = �wu�u �= 3�.


The within-cluster variations for each of C11 and C2

1 are calculated fromEquation (7.39) with the relevant entries from the full distance matrix of Equa-tion (7.63). Therefore,

I�C11� = 0

and

I�C21� = 1

8×7402 +2602 + � � � +1702� = 288732

Hence, from Equation (7.41),

W�C1� = I�C11�+ I�C2

1� = 288732�

and, from Equation (7.43),

I1 = �W�C1� = I�E�−W�C1� = 55596

When s = 3, say, the bipartition becomes P3 = �C13 �C2

3� with C13 = �w1�w2�w3�,

and C23 = �w4� � � � �w8�. In this case,

I�C13� = 1

8×3402 +402 +0� = 1333�

I�C23� = 1

8×52202 + � � � +202� = 86575�

and hence W�C3� = 87908 and �W�C3� = 256420. The cluster variations I�C1s �,

I�C2s �, W�Cs�, and Is = �W�Cs� are given in Table 7.15 for each bipartition

s = 1� � � � � 7.From the results in Table 7.15, it is seen that the cut s which maximizes Is

(or, equivalently, minimizes W�Cs�� is s = 3. Therefore, the two clusters for ther = 2 partition are

P2 = �C1 = �w3�w7�w8��C2 = �w1�w2�w4�w5�w6��

The question on which this partition was based is

q1 � Is Y = Minimum Weight ≤ 240? (7.64)

If ‘Yes’, then that observation falls into the cluster C1; if ‘No’, then the obser-vation falls into C2.

To find the best �r = 3� three-cluster partition, we first take each of these C1

and C2 clusters and find the best subpartitioning for each in the manner describedabove. That is, we find the optimal

C1 = �C11 �C2

1� and C2 = �C12 �C2

2�

Thus, for example, if the first cluster C1 = �w3�w7�w8� is split at s = 1, to give

C11 = �w3�� C2

1 = �w7�w8��


we can calculate the within-cluster variations as

I�C11� = 0� I�C2

1� = 0�

hence

I�C11�+ I�C2

1� = 0

Similarly, for the split

C11 = �w3�w7�� C2

1 = �w8��

we have

I�C11�+ I�C2

1� = 100+0 = 100

Also the total within-cluster variation for C1 is

I�C1� = 18×3

402 +402� = 1333

The I�C1s �, I�C2

s �, W�Cs�, and Is = �W�Cs� values are summarized in part (a) ofTable 7.16. Thus, if this cluster is selected to form the actual partition, it is thepartition C1 = �C1

1 = �w3��C21 = �w7�w8�� which is optimal since this bipartition

has the minimum value of W�Cs�, or equivalently the maximum value of Is.

Table 7.16 Second and third bipartition – horses, Y1.

(a) Second partitionBreed wu Y1 Xu s cs I�C1

s � I�C2s � W�Cs� �W�Cs�

w3 (130, 190) 160 1 1650 0 0 0 1333w7 (170, 170) 170 2 2400 100 0 100 333w8 (170, 170) 170 — 2400 1333

w6 (170, 450) 310 1 3600 0 39094 39094 47481w2 (390, 430) 410 2 4225 30250 21583 51833 34742w1 (410, 460) 435 3 4725 44542 250 44792 41783w5 (410, 610) 510 4 5150 68563 0 68563 18013w4 (410, 630) 520 — 86575

(b) Third partitionw3 (130, 190) 160 1 1650 0 0 1333 1333w7 (170, 170) 170 2 2400 100 0 1000 333w8 (170, 170) 170 — 2400 1333

w6 (170, 450) 310 — 3600 0 0 0 0

w2 (390, 430) 410 1 4225 0 21583 21583 17510w1 (410, 460) 435 2 4725 563 250 813 38281w5 (410, 610) 510 3 5150 23250 0 23250 15844w4 (410, 630) 520 — 39094


Similarly, if the second cluster of five observations C2 = �w1�w2�w4�w5�w6�is split into the two subclusters

C12 = �w6��C2

2 = �w1�w2�w4�w5��

say, we obtain the within-cluster variation from Equation (7.39) as

I�C12� = 0� I�C2

2� = 39094�

and hence

W�C1� = I�C12�+ I�C2

2� = 39094

Therefore,

I1 = �W�C1� = 86575−39094 = 47481

The within-cluster variations for all possible partitions of C2 are shown in part(a) of Table 7.16. The total within-cluster variation for the whole cluster C2 is

I�C2� = 86575

Comparing the W�Cs� = I�C12� + I�C2

2�� for each s, we observe that the s = 1cut gives the optimal partition, within C2. That is, if C2 is to be split, we obtainoptimally

C2 = �C12 = �w6�� C2

2 = �w1�w2�w4�w5��

The cluster to be partitioned at this (second) stage is chosen to be the Ck whichmaximizes Is. In this case,

I1�C1� = MaxsIs�C1� = 1333

and

I1�C2� = MaxsIs�C2� = 47481

Since

I1�C2� > I1�C1��

it is the optimal partitioning of C2, and not that of C1, which is chosen.Hence, the best r = 3 cluster partition of these data is

P3 = �C1 = �w3�w7�w8�� C2 = �w6�� C3 = �w1�w2�w4�w5��

The question being asked now is the pair of questions

q1 � Is Y = Minimum Weight ≤ 240?

If ‘No’, then we also ask



The process continues in a similar manner to obtain the R = r = 4 partitionP4 = �C1�C2�C3�C4�. The variations I�Ck�� I�C2

k�� I�C2k��W�Ck�, and �W�Ck�

for the three clusters k = 1� 2� 3 in the �r = 3� P3 partition are displayed in part(b) of Table 7.16. It is seen that the optimal P4 partition is

P4 = �C1 = �w3�w7�w8�� C2 = �w6�� C3 = �w1�w2�� C4 = �w4�w5��

The partitioning criteria are the following three questions:


If ‘Yes’, then the observations fall into cluster C1. If ‘No’, then the next question is


If ‘No’, then the observations fall into cluster C2. If ‘Yes’, then the third ques-tion is


If ‘No’, then the observations fall into C3; otherwise, they fall into cluster C4.While this partitioning was based on one variable Y1 = Minimum Weight

only, the bi-plot of �Y1� Y2�, where Y2 = Maximum Weight, shown in Figure 7.7suggests that this partitioning is indeed appropriate.

CEN

CES

TESCMA

LES

PES

PEN

PAM

200 300 400 500 600

Minimum Weight

200

300

400

500

600

700

800

900

Max

imum

Wei

ght

Figure 7.7 Bi-plot horses �Y1 = Minimum Weight� Y2 = Maximum Weight�.�


Example 7.31. We want to find R = 3 clusters from all m = 8 breeds of horsesfrom Table 7.14 based on the two variables Y1 = Minimum Weight and Y4 =Maximum Height.

The cut sequence based on the Y1 variable is as shown in part (a) of Table 7.17(and is the same as needed in Example 7.30). The corresponding cut sequencebased on the Y4 variable is in part (b) of Table 7.17. Note in particular thatthe ordering is slightly different (with the breeds w1 and w5 corresponding tos = 6� 7 reversed). This will impact the row and column order for the relevantdistance matrices. Even though each possible cut of C into �C1�C2� involvesone variable at a time, the distance measures used in this procedure are basedon all the variables involved. We use the non-normalized Euclidean Hausdorffdistances of Equation (7.32). These are displayed in Table 7.18 for each of thecuts on Y1 and on Y4.

Table 7.17 First partition, �Y1� Y4� horses.

Breed wu Yuj Yuj cjs s I�C1s � I�C2


(a) w3 [130, 190] 1600 1650 1 0 289881 289881 57508Cut w7 [170, 170] 1700 1700 2 2000 205223 207222 140166on w8 [170, 170] 1700 2400 3 2667 86702 89369 258019Y1 w6 [170, 450] 3100 3600 4 72956 31190 112146 235243

w2 [390, 430] 4100 4225 5 121829 21664 143493 203895w1 [410, 460] 4350 4725 6 165994 308 166302 181087w5 [410, 610] 5100 5150 7 263665 0 263665 83724w4 [410, 630] 5200 — — 347388

(b) w3 [107, 153] 1300 13850 1 0 289881 289881 57508Cut w7 [147, 147] 1470 14700 2 2000 205223 207223 140166on w8 [147, 147] 1470 15275 3 2667 86702 89369 258019Y4 w6 [145, 170] 1585 15850 4 72956 39190 112146 235243

w2 [150, 167] 1585 16100 5 121829 21664 143493 203895w5 [155, 172] 1635 16500 6 238450 9232 247682 99807w1 [158, 175] 1665 16800 7 263665 0 263665 83724w4 [147, 172] 1695 — — 347388

For each s value for each of the Y1 and Y4 cuts, we calculate the within-clustervariations I�C1

s � and I�C2s �, s = 1� � � � � 7. These are shown in part (a) and (b) of

Table 7.17 for cuts on the Y1 and Y4 orderings, respectively. For example, if webipartition C at s = 3 on Y1, we have the two subclusters

C = �C13 = �w3�w7�w8�� C2

3 = �w6�w2�w1�w5�w4�

The calculations for I�C13� take the upper left-hand 3×3 submatrix of D�Y1� and

those for I�C23� take the lower right-hand 5×5 submatrix of D�Y1�. Then, from

Equation (7.39),

I�C13� = 1

8×3�5657�2 + �5657�2 +02� = 26667�


Table 7.18 Euclidean Hausdorff distance matrix.

D�Y1� Cut on Y1

u 3 7 8 6 2 1 5 4

3 0000 56569 56569 263059 263532 284607 422734 4418147 56569 0000 0000 280943 260768 291349 440710 4606798 56569 0000 0000 280943 260768 291349 440710 4606796 263059 280943 280943 0000 220021 240252 240133 2400082 263532 260768 260768 220021 0000 31048 180069 2000631 284607 291349 291349 240252 31048 0000 150030 1703565 422734 440710 440710 240133 180069 150030 0000 215414 441814 460679 460679 240008 200063 170188 22825 0000

D�Y4� Cut on Y4

u 3 7 8 6 2 5 1 4

3 0000 56569 56569 263059 263532 422734 284607 4418147 56569 0000 0000 280943 260768 440710 291349 4606798 56569 0000 0000 280943 260768 440710 291349 4606796 263059 280943 280943 0000 220021 240133 240252 2400082 263532 260768 260768 220021 0000 180069 31048 2000635 422734 440710 440710 240133 180069 0000 150030 215411 284607 291349 291349 240252 31048 150030 0000 1703564 441814 460679 460679 240008 200063 170188 22825 0000

and also from Equation (7.39),

I�C23� = 1

8×5�22002�2 + �24025�2 + � � � + �2154�2�

= 867015

Hence, from Equation (7.40),

W�C3� = I�C13�+ I�C2

3� = 2667+86702 = 89369

The total within-cluster variation for P1 (before its bisection) is, from Equa-tion (7.39) and Table 7.18,

I�C� = 18×8

�5657�2 + � � � + �2154�2� = 347388

Therefore,

�W�C3� = 347388−89369 = 258019

Since the cut orders on Y1 and Y4 are the same for s = 1� � � � � 5� 8, these W�C3�values are the same in both cases, as Table 7.17 reveals. They differ, however,


for the s = 6 and s = 7 cuts since, when cutting on Y1, C27 = �w5�w4�, but when

cutting on Y4, C27 = �w1�w4�.

The optimal bipartition is that for which I�Cs� is minimum. In this case, thisoccurs at s = 3. Therefore, the second partition is

P2 = �C1 = �w3�w7�w8�� C2 = �w6�w2�w1�w5�w4��

The question that separates C into C1 and C2 is

Is Y1 = Minimum Weight ≤ 240? (7.65)

If ‘Yes’, then the observation goes into cluster C1, but if ‘No’, then it goes intocluster C2. Since the cut on the Y4 order gives the same minimum I�C3� value,the question might have been phrased in terms of Y4 as

Is Y4 = Maximum Height ≤ 15275? (7.66)

However, the distance between the breeds w8 (PAM) and w6 (LES) for theY1 variable is d1�w8�w6� = 280, while the distance between them for the Y4

variable is d2�w8�w6� = 23. Therefore, the criterion selected is that based on Y1,i.e., Equation (7.65) rather than Equation (7.66).

By partitioning P1 into P2, the new total within-cluster variation, from Equa-tion (7.40), is

W�P2� = W�C3� = I�C13�+ I�C2

3� = 89369

The proportional reduction in variation is

I�P1�−W�C3��/I�P1� = �347388−89369�/347388 = 0743

That is, 74.3% of the total variation is explained by this particular partitioning.The second partitioning stage is to bipartition P2 = �C1�C2� into three subclus-

ters. We take each of Ck�k = 1� 2, and repeat the previous process by takingcuts s on each of C1 and C2 for each of Y1 and Y4 using the relevant distancesubmatrices extracted from the full distance matrices in Table 7.18. The resultingI�C1

s �� I�C2s ��W�Cs�, and Is values in each case are displayed in Table 7.19. We

note again that these values differ for the Y1 and Y4 variables at the s = 3 ands = 4 cuts on C2 because of the different orders for the breeds in C2.

Then, if the cluster to be partitioned is C1, its optimal bisection is into

C1 = �C11 = �w3�� C2

1 = �w7�w8��

since W�C1� = MinsW�Cs�� = 0, and I11 = MaxsIs�C

1s � = 2667.

If the cluster to be partitioned is C2, its optimal bisection is

C2 = �C12 = �w6�� C2

2 = �w2�w1�w5�w4��

since for subclusters within C2,

W�C1� = MinsW�Cs�� = 39190


Table 7.19 Second partition, �Y1� Y4� horses.



(a) w3 [130, 190] 1600 1650 1 0 0 0 266.7Cut w7 [170, 170] 1700 1700 2 2000 0 200.0 66.7on w8 [170, 170] 1700 2400 3 266.7

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –Y1 w6 [170, 450] 3100 3600 4 0 39190 3919.0 4751.2

w2 [390, 430] 4100 4225 5 30256 21664 5192.0 3478.2w1 [410, 460] 4350 4725 6 44623 290 4491.3 4178.9w5 [410, 610] 5100 5150 7 68654 0 6865.4 1804.8w4 [410, 630] 5200 — — 8670.2

(b) w3 [107, 153] 1300 13850 1 0 0 0 266.7Cut w7 [147, 147] 1470 14700 2 2000 0 200.0 66.7on w8 [147, 147] 1470 15275 3 266.7

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –Y4 w6 [145, 170] 1585 15850 4 0 39190 3919.0 4751.2

w2 [150, 167] 1585 16100 5 30256 21664 5192.0 3478.2w5 [155, 172] 1635 16500 6 57708 18138 7584.6 1085.6w1 [158, 175] 1665 16800 7 68654 0 6865.4 1804.8w4 [147, 172] 1695 — — 8670.2

and I21 = MaxsIs�C

2s � = 47512. The choice of whether to bisect C1 or C2 is made

by taking the optimal bisection within C1 or C2 for which Is is maximized. Here,

I21 = Max�I1

1 � I21 � = Maxk2667� 47512� = 47512

Therefore, it is the C2 cluster that is bisected. This gives us the new partitioning as

P3 = �C1 = �w3�w7�w8�� C2 = �w6�� C3 = �w2�w1�w5�w4��

The criterion now becomes the two questions:

q1 � Is Y1 = Minimum Weight ≤ 240?

If ‘No’, then we ask

q2 � Is Y1 = Minimum Weight ≤ 360? (7.67)

Again, the criterion of Equation (7.67) is based on the Y1 variable since thedistance between w6 (LES) and w2 (CMA) is greater for the Y1 variable than itis for the Y4 variable (at 160 and 5, respectively).

The third partitioning stage to give P4 = �C1�C2�C3�C4� proceeds in a similarmanner. The relevant within-cluster variations are shown in Table 7.20. Notethat, this time, the optimal partitioning of C3 is different depending on whetherthe cut is based on Y1 or Y4, with the Y1 cut producing

C3 = �C13 = �w2�w1�� C2

3 = �w5�w4��


Table 7.20 Third partition, �Y1� Y4� horses.



(a) w3 [130, 190] 1600 1650 1 0 0 0 266.7Cut w7 [170, 170] 1700 1700 2 2000 0 200.0 66.7on w8 [170, 170] 1700 2400 3 266.7

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –Y1 w6 [170, 450] 3100 3600 4 0 0 0 0

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –w2 [390, 430] 4100 4225 5 0 21664 2166.4 1752.6w1 [410, 460] 4350 4725 6 602 290 89.2 3829.7w5 [410, 610] 5100 5150 7 23291 0 2329.1 1589.9w4 [410, 630] 5200 — — 3919.0

(b) w3 [107, 153] 1300 13850 1 0 0 0 266.7Cut w7 [147, 147] 1470 14700 2 2000 0 200.0 66.7on w8 [147, 147] 1470 15275 3 266.7

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –Y4 w6 [145, 170] 1585 15850 4 0 0 0 0

– – – –– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –w2 [150, 167] 1585 16100 5 0 21664 2166.4 1752.6w5 [155, 172] 1635 16500 6 20266 18138 3840.4 78.6w1 [158, 175] 1665 16800 7 23291 0 2329.1 1589.9w4 [147, 172] 1695 — — 3919.0

and the Y4 cut producing

C3 = �C13 = �w2�� C2

3 = �w5�w1�w4��

However, the Y1 cut is chosen as it has a higher Is = W�Cs� value (3829.7 forthe Y1 cut compared to 1752.6 for the Y4 cut).

From these results we conclude that the best partitioning of E into four clus-ters gives

P4 = �C1 = �w3�w7�w8�� C2 = �w6�� C3 = �w1�w2�� C4 = �w4�w5��

The progressive schemata are displayed in Figure 7.8.The complete criteria on which this bipartitioning is made are

q1 � Is Y1 = Minimum Weight ≤ 240?

If ‘Yes’, then the breed goes into C1. If ‘No’, then we ask

q2 � Is Y1 ≤ 360?

If ‘Yes’, then the breed goes into C2. If ‘No’, then we ask

q3 � Is Y1 ≤ 4725?

If ‘Yes’, then the breed goes into C3. If ‘No’, it goes into C4.


E = {1, . . . , 8}q1: Is Y1 ≤ 240?

Yes No

C1 = {3,7,8}

C2 = {6}

C3 = {1,2} C4 = {4,5}

Yes No

Yes No

C4 = {1,2,4,5,6}q2: Is Y2 ≤ 360?

2

C2 = {1,2,4,5}q3: Is Y1 ≤ 472.5?

2

Figure 7.8 Hierarchy based on �Y1� Y4� – horses, by breed #.

Finally, the total within-cluster variation for the partition P4 is

I�P4� =4∑

k=1

I�Ck�

= 2667+0+602+290 = 3559

Therefore, the total between-cluster variation for P4 is, from Equation (7.41),

B�P4� = I�P1�− I�P4� = 347388−3559 = 343829

Thus, 99.98% of the total variation is explained by the differences betweenclusters. �

Example 7.32. To find the hierarchy for the same horse breeds of Table 7.14based on the random variables Y1 = Minimum Weight and Y4 = MaximumHeight by using the dispersion Euclidean Hausdorff distances, we proceed as inExample 7.31. However, the distances are calculated by use of Equation (7.33).Therefore, at each cut s, the distances must be recalculated since, from Equa-tion (7.34), the normalization factor Hj is based only on those observations inthe (sub)cluster under consideration.

Table 7.21 gives the elements of the dispersion Euclidean distance matrix usedwhen cutting by Y1. Consider the s = 3 cut on Y1. We need to calculate the


Table 7.21 Dispersion Euclidean distance matrix – cut on Y1.

wu 3 7 8 6 2 1 5 4

w3 000 230 230 268 283 328 356 329w7 230 000 000 200 181 224 277 286w8 230 000 000 200 181 224 277 286w6 268 200 200 000 120 144 137 130w2 283 181 181 120 000 049 101 112w1 328 224 224 144 049 000 083 111w5 356 277 277 137 101 083 000 047w4 329 286 286 130 112 102 064 000

relevant distance matrices for each of the subclusters C1 = �w3�w7�w8� and C2 =�w6�w2�w1�w5�w4�. Take C2. Then, for the Y1 variable, from Equation (7.34),

H21 = 1

522202 + � � � +202� = 13852�

H24 = 1

5232 + � � � +82� = 2024

Hence, from Equation (7.33), for instance,

d�w6�w2� =[(

2201177

)2

+(

345

)2]1/2

= 028

Proceeding likewise for all elements in C2, we obtain

D�C2� =

⎛⎜⎜⎜⎜⎝

0 198 271 318 209198 0 189 180 203271 189 0 144 179318 180 144 0 284209 203 179 284 0

⎞⎟⎟⎟⎟⎠

Similarly,

D�C1� =⎛⎝0 3 3

3 0 03 0 0

⎞⎠

We proceed as in Example 7.31 using at each s cut the recalculated distancematrix. Then, we calculate the respective inertias I�C1

s �� I�C2s ��W�C2�, and

�W�Cs� ≡ Is. Hence, for each r = 2� 3� 4, we obtain the optimal cut to give thebest Pr = �C1� � � � �Cr�. The final hierarchy of R = 4 clusters is found to be

P4 = �C1 = �w3�w7�w8�� C2 = �w6�� C3 = �w1�w2�� C4 = �w4�w5��

The details are left to the reader as an exercise. �


Example 7.33. The hierarchy based on all horse breeds and all six randomvariables of the data of Table 7.14, when using the normalized Hausdorffdistances of Equation (7.33) for R = 4 is obtained as

P4 = �C1 = �w3�� C2 = �w2�w4�w6�� C3 = �w7�w8�� C4 = �w1�w5��

This is displayed in Figure 7.9. The numbers indicated at the nodes inthis figure tell us the sequence of the partitioning. Thus, after partitioningE into

P1 = �C11 = �3� 7� 8�� C2

1 = �1� 2� 4� 5� 6��

at the first stage, it is the C11 subcluster which is partitioned at the second

stage. The criteria on which this bipartitioning proceeds are the questionsq1� q2� q3:

q1 � Is Y1 = Minimum Weight ≤ 240? (7.68)

q2 � Is Y4 = Maximum Height ≤ 1385? (7.69)

q3 � Is Y4 = Maximum Height ≤ 1615? (7.70)

1

Yes No

2 3Yes No Yes No

E = {1, . . . , 8}q1: Is Y1 ≤ 240?

C1 = {1,2,4,5,6}q3: Is Y4 ≤ 161.5?

2C1 = {3,7,8}q2: Is Y4 ≤ 138?

1

C1 = {3} C3 = {7,8} C2 = {2,4,6} C4 = {1,5}

Figure 7.9 Hierarchy based on �Y1� � � � � Y6� – horses, by breed # (# at nodeindicates bipartition sequence).

If the answer to q1 is ‘Yes’ and to q2 is ‘Yes’, then the breeds go into C1. Ifthe answer to q1 is ‘Yes’ and to q2 is ‘No’, then the breeds go into C3. If theanswer to q1 is ‘No’ and to q3 is ‘Yes’, then the breeds go into C2. If the answerto q1 is ‘No’ and to q3 is ‘No’, then they go into C4. The details are left as anexercise. �

HIERARCHY–PYRAMID CLUSTERS 285

7.5 Hierarchy–Pyramid Clusters

7.5.1 Some basics

Pyramid clustering was introduced by Diday (1984, 1986) for classical data.Bertrand (1986) and Bertrand and Diday (1990) extended this concept to symbolicdata, followed by a series of developments in, for example, Brito (1991, 1994) andBrito and Diday (1990). The underlying structures have received a lot of attentionsince then. We restrict our treatment to the basic principles. Recall from Section 7.2that pyramid clusters are built from the bottom up, starting with each observationas a cluster of size 1. Also, clusters can overlap. This overlapping of clusters distin-guishes a pyramid structure from the pure hierarchical structures of Section 7.4.See Section 7.5.2 below for an example of this comparison.

Example 7.34. Figure 7.10 represents a pyramid constructed on the five obser-vations in � = �a� b� c�d� e�. The 11 clusters

C1 = �a�� C2 = �b�� C3 = �c�� C4 = �d�� C5 = �e��

C6 = �b� c�� C7 = �c�d�� C8 = �a� b� c�� C9 = �b� c�d��

C10 = �a� b� c�d�� C11 = �a� b� c�d� e�

together form a pyramid, written as

P = �C1� � � � �C11�

{a,b,c,d}

{e}{d}{c}

{c,d}

{b,c,d}{a,b,c}

{a,b,c,d,e}

{a} {b}

E

{b,c}

Figure 7.10 Pyramid clustering.


For example, the clusters C6 and C7 contain the observations {b, c} and {c, d},respectively. Notice that the observation {c} appears in both the C6 and C7

clusters, i.e., they overlap. �

We have the following formal definitions; these are expansions of the earlierDefinition 7.27 for pyramids.

Definition 7.32: A pyramidal classification of the observations inE = �w1� � � � �wm� is a family of non-empty subsets �C1�C2� � � � � = P such that

(i) E ∈ P;

(ii) for all wu ∈ E, all single observations �wu� ∈ P;

(iii) for any two subsets Ci�Cj in P,

Ci ∩Cj = ∅ or Ci ∩Cj ∈ P�

for all i� j; and

(iv) there exists a linear order such that every Ci in P spans a range of observa-tions in E, i.e., if the subset Ci covers the interval

Ci = �i� �i�

then Ci contains all (but only those) observations wk for which �i ≤ wk ≤�i, or

Ci = �wk��i ≤ wk ≤ �i� wk ∈ E�

where �i� �i ∈ E. �

This linear order condition implies that for all a�b� c, the distances betweenthem satisfy

d�a� c� ≥ Max�d�a�b��d�b� c�� (7.71)

where d�x� y� is the distance between observations x and y.

Example 7.35. Suppose there are three observations E = �a� b� c� with a distancematrix

D =⎛⎝

a b c

a 0 3 1b 0 2c 0

⎞⎠

If the observations represent values Y1 = ��Y2 = � + 2� Y3 = � + 1, for somearbitrary � of a single variable Y , then they can be displayed schematically as inFigure 7.11.


Y1 = ξ Y3 = ξ + 1 Y2 = ξ + 3

Figure 7.11 Observations for Example 7.35.

For condition (iv) to hold, it is necessary to relabel �a� b� c�, so to speak, intothe linear order a′ = a, b′ = c, and c′ = b, with the new distance matrix

D′ =⎛⎝

a′ b′ c′

a′ 0 1 3b′ 0 2c′ 0

⎞⎠

Note that this new distance matrix D′ is a Robinson matrix. �

A pyramidal clustering can reflect a measure of the proximity of clusters bythe ‘height’, h, of the pyramid branches. Thus, we see in Figure 7.12 two differentpyramids, each with the same observations in the respective clusters, but withdifferent proximity measure. Definition 7.32 gives us both pyramids of Figure 7.12.The two pyramids are seen as two different pyramids when their indexes are takeninto account. For example, consider the cluster �a� b� c�. In pyramid (a), this clusterhas height h = 25, whereas in pyramid (b), the height is h = 2. Indexes can take

Pyramid (b)

Index h

C4 = {d}C4 =

{d} C3

= {c}C3

= {c} C2

= {b}C2

= {b} C1

= {a}C1

= {a}

C5 =

{b,c}

C7 =

{a,b,c}

1

2

2.5

3

C7 =

{a,b,c}

E E

Pyramid (a)

C8 =

{a,b,c,d}

C8 =

{a,b,c,d}

C5 =

{b,c} C6

= {c,d}

C6 =

{c,d}

Figure 7.12 Two pyramids, different proximity.


various forms. An example is a measure of distance between clusters. This leadsus to the following definition.

Definition 7.33: Let P be a pyramid with the properties (i)–(iv) of Definition 7.32,and let h�Ci� ≥ 0 be an index associated with the clusters Ci in P� i = 1� � � � � r ,where r is the number of clusters in P. If, for all �i� j� pairs in E with Ci containedin but not equal to Cj , there exists an index h that is isotonic on P, i.e.,

h�Ci� ≤ h�Cj� if Ci ⊂ Cj� (7.72)

and

h�Ci� = 0 if Ci is a singleton, (7.73)

then �P�h� is defined to be an indexed pyramid. �

Example 7.36. Let E contain the observations E = �a� b� c�d�. Let the pyramidbe P = �C1 = �a�, C2 = �b�, C3 = �c�, C4 = �d�, C5 = �b� c�, C6 = �c�d�,C7 = �a� b� c�, C8 = �a� b� c�d��, and let h�C5� = 10 and h�C7� = 25; seeFigure 7.12, pyramid(a).

Then,

h�C5� = 10 ≤ 25 = h�C7�

and

C5 = �b� c� ⊂ �a� b� c� = C7

Similarly,

C1 = �a� ⊂ �a� b� c� = C7

since h�C1� = 0 ≤ 25 = h�C7�, and likewise for the other clusters. Therefore, Pis an indexed pyramid. �

Definition 7.34: An indexed pyramid P with classes P = �C1� � � � �Cr� is said tobe indexed in the broad sense if for all �i� j� pairs Ci and Cj with Ci inside of butnot equal to Cj but with equal indexes, i.e., for

Ci ⊂ Cj�h�Ci� = h�Cj��

imply that there exists two other classes Cl� Ck of P, l� k �= i� j, such that

Ci = Cl ∩Ck� Ci �= Cl� Ci �= Ck (7.74)�

An implication of Definition 7.34 is that a pyramid indexed in the broad sensehas clusters Ci each with two predecessors.


Example 7.37. To illustrate a pyramid, indexed in the broad sense considerthe structure of Figure 7.13 which defines the pyramid of E = �a� b� c�d� byP = �C1� � � � �C10� where

C1 = �a� b� c�d�� C2 = �a� b� c�� C3 = �b� c�d��

C4 = �a� b�� C5 = �b� c�� C6 = �b� c�d��

C7 = �a�� C8 = �b�� C9 = �c�� C10 = �d��

and where the index heights are

h�C4� = 075� h�C5� = h�C6� = 10� h�C2� = h�C3� = 20� h�E� = 30

3 C1 = {a,b,c,d}

2 C2 = {a,b,c} C3 = {b,c,d}

1C5 = {b,c} C6 = {b,c,d}

.75 C4 = {a,b}

C7 = {a} C8 = {b} C9 = {c} C10 = {d}

EIndex h

Figure 7.13 Pyramid indexed in broad sense.

In particular, consider the clusters C5 and C6. Here h�C5� = h�C6� and C5 ⊂ C6

completely. Then, by Definition 7.34, there exist two other clusters whose inter-section is C5. In this case, we see that the clusters C2 and C3 satisfy this condition.Note too that since C5 is contained completely within C6, and since they havethe same index value, this pyramid indexed in the broad sense can be arbitrarilyapproximated by an indexed pyramid. �


7.5.2 Comparison of hierarchy and pyramid structures

By ‘hierarchy’ in this subsection, we refer to Definition 7.25. Our hierarchy can becreated by a top-down (divisive) process or a bottom-up (agglomerative) process, butfor the present purposes the hierarchy has no overlapping clusters. Both hierarchyand pyramid structures consist of nested clusters. However, a pyramid can containoverlapping clusters while a pure hierarchy structure does not. An implication of thisstatement is that any one cluster in a hierarchy has at most one predecessor, whilefor a pyramid any one cluster has at most two predecessors. Thirdly, though indexvalues may be comparable, they may induce different distance matrices dependingon whether the structure is a hierarchy or a pyramid.

Indeed, hierarchies are identified by ultrametric dissimilarity matrices, whilepyramids are identified by Robinson matrices. There is a one-to-one correspondencebetween a hierarchy and its ultrametric matrix, and between a pyramid and itsRobinson matrix.

Finally, a hierarchy is a special case of (the broader structure of) a pyramid,and an ultrametric matrix is an example of a Robinson matrix, but the conversesdo not hold.

Example 7.38. Consider the hierarchy H of Figure 7.14(a) and the pyramid Pof Figure 7.14(b). The hierarchy H consists of the clusters

H = �Ch1 = �a� b� c�� Ch

2 = �a� b�� Ch3 = �a�� Ch

4 = �b�� Ch5 = �c��

and has index values of

h�Ch1 � = 3� h�Ch

2 � = 1� h�Chk � = 0� k = 3� 4� 5

E

1

3

(a) Hierachy

3

1.2

1

(b) Pyramid

E

Index h Index h

C1 = {a,b,c}h

C1 = {a,b,c}p

C2 = {a,b}h C2 = {a,b}

p

C6 = {b,c}p

C3 = {a}h

C4 = {b}h

C5 = {c}h C3 = {a}

pC4 = {b}

p C5 = {c}p

Figure 7.14 Comparison of hierarchy (a) and pyramid (b).


The pyramid P consists of the clusters

P = �Cp1 ≡ Ch

1 � Cp2 ≡ Ch

2 � Cp3 ≡ Ch

3 � Cp4 ≡ Ch

4 � Cp5 ≡ Ch

5 � Cp6 = �b� c��

with the index values

h�Cpk � ≡ h�Ch

k �� k = 1� � � � � 5� h�Cp6 � = 12�

i.e., the structures are similar except that the pyramid contains the additionalcluster C

p6 = �b� c�. Notice that the two pyramid clusters C

p2 and C

p6 overlap, each

containing the observation �b�. Therefore, this implies that the cluster Cp4 = �b�

has two predecessors in P, namely, Cp2 and C

p6 . In contrast, this cluster has only

one predecessor in H , namely, Ch2 = C

p2 .

Further, let us suppose that the index function h�·� is the distance d�x� y�

induced from the hierarchy, or the pyramid, as the lowest height/distance betweenthe observations �x� and �y� in H , or P, where we start at h = 0 for the individualsthemselves. Therefore, in this case, we see from Figure 7.14(a) that the distancematrix for the hierarchy H is

Dh =⎛⎝

a b c

a 0 1 3b 0 3c 0

⎞⎠

where, for example, the distance between the observations a and b is d�a�b� = 1since the lowest Ci ∈ H containing both a and b is Ch

2 = �a� b�, and notCh

1 = �a� b� c�. That is, Ch2 is lower down the tree than Ch

1 , and so is chosen.Rather than ‘lower’ down the tree, we can think of Ch

2 as being the smallest Chk

which contains both �a� and �b�. The distance d�a� c� is the index value betweenthe clusters Ch

1 = �a� b� c� and Ch5 = �c�, i.e., d�a� c� = 3; likewise, d�b� c� = 3.

In contrast, the corresponding distance matrix for the pyramid P inFigure 7.14(b) is

Dp =⎛⎝

a b c

a 0 1 3b 0 12c 0

⎞⎠

For example, d�b� c� = 12, since in P the lowest Cpk that contains �b� and �c�

is Cp6 = �b� c� at an index of h = 12 above �c� and not the h = 3 between

Cp1 = �a� b� c� and C

p5 = �c�.


It is observed that Dh is an ultrametric similarity matrix, while Dp is a Robinsonmatrix. That Dh is ultrametric is readily shown from the fact that, from Defini-tion 7.3 in Section 7.1,

d�a�b� = 1 ≤ Max�d�a� c��d�c� b��

= Max�3� 3� = 3�

d�a� c� = 3 ≤ Max�d�a�b��d�b� c��

= Max�1� 3� = 3�

d�b� c� = 3 ≤ Max�d�a�b��d�a� c��

= Max�1� 3� = 3

In addition, �a� b� c� forms an isosceles triangle with base �a� b� having alength d�a�b� = 1, which is shorter than the length of the other two sides,d�a� c� = d�b� c� = 3.

Notice also that Dh is a Robinson matrix, as is Dp, since its elements donot decrease in value as they move away from the diagonal. An application ofDefinition 7.5 verifies this. However, the Robinson matrix Dp is not an ultrametricmatrix. To see this, it is only necessary to note that

d�a� c� = 3 �≤ Max�d�a�b��d�b� c��

= Max�1� 12� = 12 �

7.5.3 Construction of pyramids

We construct our pyramid using the methodology of Brito (1991, 1994). Sincepyramid clustering is an agglomerative bottom-up method, the principles of clusterconstruction outlined in Section 7.2.2 apply in reverse. That is, instead of startingwith the complete set E and partitioning clusters C into two subclusters at eachstage (C1 and C2), we start with m clusters, one for each observation singly, andmerge two subclusters (C1 and C2, say) into one new cluster C. The other featuresof the construction process remain, i.e., selection of a criterion for choosing whichtwo subclusters C1 and C2 prevail, and selection of a stopping rule. For these weneed Definitions 7.35 and 7.36.

Definition 7.35: Let E be a set of m observations and let C ⊆ E be a subset (orcluster) consisting of elements of E. Suppose that the observations are representedby the realizations Y�w1�� Y�wm� of the random variable Y = �Y1� � � � � Yp�.Let s be the symbolic assertion object

s = f�C� =p∧

j=1

Yj ⊆wu∈C

Yj�wu�� (7.75)


where f is the mapping with f�s� = Ext�s�C�; likewise let g = g�E� be the corre-sponding mapping with g�s� = Ext�s�E�. Suppose h is an iterated mapping satisfying

h = f o g gives h�s� = f�g�s��

h′ = g o f gives h′�C� = g�f�C��

Then a symbolic object s is complete if and only if h�s� = s. �

This is clearly a formal definition, and is based on definitions presented andillustrated in Sections 2.2.2 and 2.2.3. Brito (1991) relates this to the notion of aconcept (see Definition 2.22) through the result that, for a complete object s forwhich C = g�s� = Ext�s�E�, the pair �C� s� is a concept, where C ⊆ E is a subset(cluster) of E. In a less formal way, completeness corresponds to the ‘tightest’condition which describes the extension of a set C.

Example 7.39. Consider the data of Table 7.22 (extracted from Table 2.1a) forthe random variables Y3 = Age, Y4 = Gender, Y10 = Pulse rate and Y15 = LDLcholesterol level. Suppose we have the two assertion objects

s1 = �Age ⊆ 20� 60��∧�Gender ⊆ �F��∧�Pulse ⊆ 75� 80��

s2 = �Age ⊆ 27� 56��∧�Gender ⊆ �F��∧�Pulse ⊆ 76� 77��∧�LDL ⊆ 71� 80��

(7.76)

The extensions are therefore Ext�s1� = �Carly� Ellie� = Ext�s2�.

Table 7.22 Health measurements.

wu Y3 Y4 Y10 Y15

Age Gender Pulse LDL

Amy 70 F 72 124Carly 27 F 77 80Ellie 56 F 76 71Matt 64 M 81 124Sami 87 F 88 70

Although the extensions of s1 and s2 contain the same two individuals, thedescription of the set C = �Carly� Ellie� is more precise in s2 than it is in s1.That is, s1 is not a complete object, whereas s2 is a complete object. �

From Definition 7.35, it follows that a cluster C as a single observation iscomplete. Also, the union of two complete objects is itself a complete object.

Example 7.40. To continue with Example 7.39 and the data of Table 7.22,suppose we have the clusters

C1 = �Amy�

C2 = �Carly, Ellie�


Then, the single observation C1 with

s1 = s1�C1�

= �Age⊆70� 70��∧�Gender⊆�F��∧�Pulse ⊆ 72� 72��∧�LDL ⊆ 124� 124��

is a complete object. Also, s2 = s2�C2� where s2 is given in Equation (7.76) is acomplete object. Merging C1 and C2, we obtain the new cluster

C = C1 ∪C2 = �Amy, Carly, Ellie�

with extension

s = s�C� =�Age ⊆ 27� 70��∧ �Gender ⊆ �F��∧ �Pulse ⊆ 72� 80��

∧ �LDL ⊆ 71� 124��

This s = s�C� is a complete object. �

Definition 7.36: Let Y = �Y1� � � � � Yp� be a random variable with Yj taking valuesin the domain �j , where �j is a finite set of categories (or list of possible values fora multi-valued Yj) or bounded intervals on � (for an interval-valued Yj). Let theobject s have description

s =p∧

j=1

Yj ∈ Dj�� Dj ⊆ �j� j = 1� � � � � p

Then, the generality degree of s is

G�s� =p∏

j=1

c�Dj�/c��j� (7.77)

where c�A� is the cardinality of A if Yj is a multi-valued variable and c�A� is thelength of the interval A if Yj is an interval-valued variable. �

Example 7.41. Consider the first four horse breeds in Table 7.14. Take therandom variable Y3 = Minimum Height. Suppose we are interested in the clusterC = {CES, CMA}. Then, the descriptions of � and C are

� = 90� 165�� C = 130� 158�

Hence, the generality degree of C for the random variable Y3 is

G�C� = �158−130�/�165−90� = 0373

If our interest is on both Y3 and Y4 = Maximum Height, the descriptions of �and C become

� = �90� 165�� 107� 175��

C = �130� 158�� 150� 175��


Hence, the generality degree of C for both Y3 and Y4 is

G�C� =(

158−130165−90

)(175−150175−107

)= 0137 �

Example 7.42. Consider the color and habitat data of birds shown in Table 7.2.Take the object or cluster C = {species 1, species 2}. Then, using both variables,we have

� = ��red, black, blue�� rural, urban��

and

C = ��red, black�� rural, urban��

Hence, the generality degree of C is

G�C� =(

23

)(22

)= 0667 �

The stopping rule for a pyramid is that the ‘current’ cluster is the entire set Eitself or that the conditions, for any two subclusters C1 and C2, are such that:

(i) neither C1 nor C2 has already been merged twice in earlier steps and there isa total linear order (as defined in Definition 7.32) ≤ 1 on E such that thereis an interval with respect to ≤ 1 on the merged clusters �C = C1 ∪C2�;

(ii) the merged cluster C is complete (as defined in Definition 7.35).

The selection criterion for choosing which particular pair �C1�C2� to merge is totake that pair which minimizes the generality degree G�s� defined in Definition 7.36and Equation (7.77).

Example 7.43. We construct a pyramid on the first four observations inTable 7.14, i.e., for the horse breeds CES �u = 1�, CMA �u = 2�, PEN �u = 3�,and TES �u = 4�. We take Y1 = Minimum Weight only. At the initial step t = 0,we identify the four singleton clusters.

Let us measure the distance between any two observations by the generalitydegree given above in Equation (7.77), for j = p = 1. Clearly, the length of theentire set � is

d�� = maxu

�bu�−minu

�au� = 630−130 = 500


Hence, the distance matrix is

D′ =

⎛⎜⎜⎝

CES CMA PEN TES0 014 066 044

014 0 060 048066 060 0 100044 048 100 0

⎞⎟⎟⎠

CESCMAPENTES

where, for example, from Equation (7.77),

d�1� 3� = c�CES� PEN�/c��

= �Max�b1� b3�−Min�a1� a3��/d��

= �460−130�/500 = 066

Transforming these observations so that a linear order applies (recallExample 7.35), we obtain the Robinson distance matrix

D1 =

⎛⎜⎜⎝

PEN CMA CES TES0 060 066 100

060 0 014 048066 014 0 044100 048 044 0

⎞⎟⎟⎠

PENCMACESTES

(7.78)

It is easily checked from Definition 7.32(iv) that this gives a linear order, at leastof the initial clusters denoted as

C1 = �PEN�� C2 = �CMA�� C3 = �CES�� C4 = �TES�

Then since the smallest distance is

d�2� 3� = d�CMA� CES� = 014�

the cluster formed at this first �t = 1� stage is

C5 = C2 ∪C3 = �CMA� CES�

The second �t = 2� step is to form a new cluster by joining two of the clustersC1� � � � �C5. Since C1� � � � �C4 and hence C1� � � � �C5 have an internal linearorder, it is necessary to consider only the four possibilities:

C1 ∪C2� C1 ∪C5� C5 ∪C4� C3 ∪C4

The distances between C1 and C2, and C3 and C4, are known from D1 in Equa-tion (7.78). To calculate the distances between C1 = {PEN} and C5 = {CMA,CES}, we first observe that C5 consists of the interval

�5 = Min�a2� a3�� Max�b2� b3��

= 390� 460�


Then, the generality degree for C1 ∪C5 is, from Equation (7.77),

d�C1�C5� = �Max�460� 190�−Min�390� 130��/500

= �460−130�/500 = 066

Similarly,

d�C4�C5� = �Max�630� 460�−Min�410� 390��/500

= �630−390�/500 = 048�

or equivalently,

G�C1 ∪C2� = 060� G�C1 ∪C5� = 066�

G�C4 ∪C5� = 100� G�C3 ∪C4� = 044

Since

Min G�Ci ∪Cj� = G�C3 ∪C4� = 044�

the two clusters C3 and C4 are merged to form a new cluster C6 = {CES, TES}.This C6 cluster consists of the new interval �6 = 410� 630�.

For the third �t = 3� step, the possible merged clusters are

C1 ∪C2� C1 ∪C5� C5 ∪C6

Notice in particular that C3 = {CES} cannot be merged with another cluster atthis stage, as it already has two predecessors, having been merged into the newclusters C5 and then C6. The possible clusters have generality degree values of

G�C1 ∪C2� = 060� G�C1 ∪C5� = 066� G�C5 ∪C6� = 048�

where we have

G�C5 ∪C6� = G��CMA� CES� TES��

= Max�430� 460� 630�−Min�390� 410� 410��/500

= �630−390�/500 = 048

Or equivalently, the new distance matrix is

D3 =

⎛⎜⎜⎝

PEN CMA �CMA� CES� �CES� TES�

0 060 066 100060 0 − −066 − 0 048100 − 048 0

⎞⎟⎟⎠

PENCMA

�CMA� CES��CES� TES�

Since the smallest generality degree value is G�C5 ∪C6� = 048, the new clusterC7 is formed from merging C5 and C6, i.e.,

C7 = C5 ∪C6 = �CMA� CES� TES�


and spans the interval

�7 = Min�390� 410� 410�� Max�430� 460� 630��

= 390� 630�

Similarly, at the next t = 4th step, we have

G�C1 ∪C2� = 060� G�C1 ∪C5� = 066� G�C5 ∪C7� = 100�

D4 =

⎛⎜⎜⎜⎜⎜⎜⎝

PEN CMA �CMA� CES� �CMA� CES� TES�

0 060 066 100

060 0 − −066 − 0 −100 − − 0

⎞⎟⎟⎟⎟⎟⎟⎠

PENCMA

�CMA� CES��CES� TES�

Hence, C1 and C2 are merged to give C8 = C1 ∪C2 = �PEN� CMA� which is theinterval �8 = 130� 430�, at a distance d = 060.

At the t = 5th step, we have

G�C5 ∪C8� = 066� G�C7 ∪C8� = 100� G�C5 ∪C7� = 048�

to give the distance matrix

D5 =⎛⎝

C5 C7 C8

0 048 066048 0 100066 100 0

⎞⎠C5

C7

C8

Hence, the new cluster is C9 = C5 ∪C7 = �PEN� CMA� CES� which is the interval�9 = 130� 460�.

Finally, the t = 6th step gives C10 = C7 ∪C9 = E. Since C10 contains all theobservations, the pyramid is complete.

The complete pyramid P = �C1� � � � �C10� showing the observations containedin each Ci is displayed in Table 7.23. Also given is the symbolic descriptionof each cluster as well as the generality index (or minimum distance selected)when that cluster was merged. For example, the C6 cluster consists of twobreeds, CES and TES, which together have a minimum weight interval value of�6 = 410� 630�. The generality index value which resulted when this cluster wasformed is d = 044. The pictorial representation of this pyramid is displayed inFigure 7.15. �


Table 7.23 Pyramid clusters for horses data.

Cluster Ck Elements Interval �k Generality G�Ck�

C1 {PEN} [130, 190] 0C2 {CMA} [390, 430] 0C3 {CES} [410, 460] 0C4 {TES} [410, 630] 0C5 {CMA, CES} [390, 460] 014C6 {CES, TES} [410, 630] 044C7 {CMA, CES, TES} [390, 630] 048C8 {PEN, CMA} [130, 430] 060C9 {PEN, CMA, CES} [130, 460] 066C10 {PEN, CMA, CES, TES} [130, 630] 100

E

C5

C

C7C6

.14

.44

.48

.60

.66

1.0

C8

C10

C1 = {TES} C2 = {CES} C3 = {CMA} C4 = {PEN}

Index h

9

Figure 7.15 Pyramid of horses {CES, CMA, PEN, TES}, Y1.

Example 7.44. Let us take the same four horses (CES, CMA, PEN, TEX) usedin Example 7.43, but now let us build the pyramid based on the two randomvariables Y5 = Mares and Y6 = Fillies. We again use the generality degree ofEquation (7.77). We calculate the relevant D matrix and after reordering to attain


the desired linear order, we obtain the Robinson matrix

D1 =

⎛⎜⎜⎝

PEN CMA TES CES0 016 050 1

016 0 050 1060 050 0 061

1 1 061 0

⎞⎟⎟⎠

PENCMATESCES

For example,

G�PEN� CMA� =(

200−0480−0

)(50−0

130−0

)= 016

since {PEN, CMA} spans [0, 200], [0, 50] and � = ��1��2� spans ([0, 480],[0, 130]).

Then, at the first step �t = 1�, the two clusters {PEN} and {CMA} are mergedsince of all possible mergers this has the minimum generality degree. Thisproduces a cluster C5 = �PEN� CMA�.

We next (at step t = 2) calculate the generality degrees for the possible merges,namely, C5, CMA, TES, CES. We obtain

D2 =

⎛⎜⎜⎝

C5 CMA TES CES0 0 050 10 0 0 1

050 0 0 0611 1 061 0

⎞⎟⎟⎠

C5

CMATESCES

The minimum generality degree is actually zero between C5 and CMA; however,since CMA is already contained within C5, we use the next minimal degree value.Therefore, we merge the clusters C5 and {TES} to give us

C6 = �PEN� CMA� TES�

The process is repeated (at step t = 3) by calculating the generality degreesacross TES, CES, and C6. Thus, we obtain

D3 =⎛⎝

C6 TES CES0 050 1

050 0 0611 061 0

⎞⎠ C6

TESCES

As in the construction of C6, this time we ignore the ‘possible’ merger of TES withC6, instead merging {TES} and {CES}, which two clusters otherwise have theminimum generality degree at 0.61. This produces the cluster C7 = �TES� CES�.

Finally at the last step �t = 4� we merge C6 and C7 = E. The complete pyramidstructure is shown statistically in Table 7.24 and pictorially in Figure 7.16. Noticethat for these two variables �Y5� Y6�, only two clusters, C6 = �TES� CMA� PEN�and C7 = �CES� TES�, overlap, with the horse TES common to both. �


Table 7.24 Pyramid clusters for horses data – �Y5� Y5�.

Cluster Ck Elements �k = �Y5� Y6� Generalitydegree G�Ck�

C1 {CES} ([150, 480], [40, 130]) 0C2 {TES} ([100, 350], [30, 90]) 0C3 {CMA} ([0, 200], [0, 50]) 0C4 {PEN} ([0, 100], [0, 30]) 0C5 {CMA, PEN} ([0, 200], [0, 50]) 016C6 {TES, CMA, PEN} ([0, 350], [0, 90]) 050C7 {CES, TES} ([100, 480], [30, 130]) 061C8 {CES, TES, CMA, PEN} ([0, 480], [0, 130]) 1

E

.16

.50

.61

1.0

Index h

C1 = {CES} C2 = {TES} C3 = {CMA} C4 = {PEN}

C8

C7

C6

C5

Figure 7.16 Pyramid of horses {CES, CMA, PEN, TES}, �Y5� Y6�.

Example 7.45. Consider the two random variables Y1 = Minimum Weight andY2 = Maximum Weight for the basis for constructing a pyramid on the same fourhorses (CES, CMA, PEN, TES) used in Examples 7.43 and 7.44. We show inFigure 7.17 the pyramid obtained when the merging criterion used is a dissim-ilarity matrix whose elements are those for generalized weighted Minkowskidistances obtained from Equation (7.28) when q = 1, using the Ichino–Yaguchi


measures of Equation (7.27) with = 0. The linearly ordered distance matrix is

D1 =

⎛⎜⎜⎝

PEN CMA CES TES0 1144 1678 2000

1144 0 0203 09111678 0203 0 07372000 0911 0737 0

⎞⎟⎟⎠

PENCMACESTES

The details are omitted, and left as an exercise for the reader.

C1 = {TES} C2 = {CES} C3 = {CMA} C4 = {PEN}

C8

C5

C6

C7

C9

C10

E

Figure 7.17 Pyramid of horses {CES, CMA, PEN, TES}, �Y1� Y2�.

Notice that there are three overlapping clusters, specifically C5 = �CES� CMA�and C6 = �CMA� PEN� with CMA in common, C5 and C8 = �TES� CES� withCES in common, and C7 = �CES� CMA� PEN� and C9 = �TES� CES� CMA� withtwo horses, CES and CMA, in common. �

By comparing the three examples Examples 7.43–7.45, we can identify someinteresting features not uncommon in pyramid structures. We see that Figure 7.15and Figure 7.17 are in effect almost mirror images of each other, both with the samerelative linear ordering of the four horses, but with the roles of the horses TES andPEN reversed in the pyramid itself. One (Figure 7.15) was constructed using the


generality degree as the optimality criterion, while the other used a dissimilaritymeasure. Of course, in this case, it could be a more dominant influence of therandom variable Y2�MaximumWeight� that prescribed this difference; we leave thatdetermination as an exercise for the reader. The pyramid built on the variablesY5 �Mares� and Y6 �Fillies� shown in Figure 7.16 is quite different from those inFigures 7.15 and 7.17. Not only is the linear order different (with the horse TESnow taking an ‘interior’ ordering), but there is only one pair of overlapping clusterscompared to three overlapping clusters for the other two pyramids. These differingstructures allow for enhanced interpretations from a now richer knowledge basegenerated by these analyses. Whatever the effect(s) of the inclusion or exclusionof certain variables, it remains that the impact of differing distance measures is ingeneral still an open question.

This technique can also be extended to the case of mixed variables where somerandom variables are multi-valued and some are interval-valued.

Example 7.46. Take the last four oils, namely, camellia Ca, olive Ol, beef B,and hog H, of Ichino’s oils dataset given in Table 7.8. Let us build a pyramidfrom these four oils based on the three random variables Y1 = Specific Gravity,Y3 = Iodine Value, and Y5 = Fatty Acids, and let us use the generality degree ofEquation (7.77) as our optimality criterion.

The Y1 and Y3 variables are interval-valued, taking values, respectively,over the ranges �1 = 0858� 0919� with cardinality c��1� = 0061, and �3 =40� 90� with c��3� = 50; Y5 is a multi-valued variable taking values over �5 =�C� L� Lu� M� O� P� S� with c��5� = 7.

After reordering the oils so that a linear order pertains, we obtain the distancematrix

D1 =

⎛⎜⎜⎝

Ol Ca H B0 0010 0634 0829

0010 0 0481 06730634 0481 0 01460829 0673 0146 0

⎞⎟⎟⎠

OlCaHB

For example, the generality degree for the pair Ol and H is

G��Ol� H�� =(

00610061

)(3750

)(67

)= 0634�

since the cluster {Ol, H} has description ([0.858, 0.919], [53.0, 90.0],�L�Lu�M�O�P�S��. Then, at this step t = 1, since the cluster {Ol, Ca} has thesmallest generality degree value (at 0.010) in D1, the new cluster is C5 = {Ol,Ca}. The description for this cluster is ([0.914, 0.919], [79.0, 90.0], �L�O�P�S��,as given in Table 7.25.


Table 7.25 Pyramid clusters – Oils (Ol, Ca, H, B).

Cluster Oils Description �k = �Y1� Y3� Y5� G�Ck�

C1 {Ol} ((0.914, 0.919], [79.0, 90.0], {L, O, P, S}) 0C2 {Ca} ((0.916, 0.917], [80.0, 82.0], {L, O}) 0C3 {H} ([0.858, 0.864], [53.0, 77.0], {L, Lu, M, O, P, S}) 0C4 {B} ([0.860, 0.870], [40.0, 48.0], {C, M, O, P, S}) 0C5 {Ol, Ca} ([0.914, 0.919], [79.0, 90.0], {L, O, P, S}) 0.01C6 {H, B} ([0.858, 0.870], [40.0, 77.0], {C, L, Lu, M, O, P, S}) 0.15C7 {Ca, H} ([0.858, 0.917], [53.0, 82.0], {L, Lu, M, O, P, S}) 0.48C8 {Ol, Ca, H} ([0.858, 0.919], [53.0, 90.0], {L, Lu, M, O, P, S}) 0.63C9 {Ol, Ca, H, B} ([0.858, 0.919], [40.0, 90.0], {C, L, Lu, M, O, P, S}) 1.0

We repeat the process taking into account the clusters C2� � � � �C5. We obtainthe distance matrix as

D2 =

⎛⎜⎜⎝

C5 Ca H B0 0 0634 08290 0 0481 0673

0634 0481 0 01460829 0673 0146 0

⎞⎟⎟⎠

C5

CaHB

Since the smallest generality degree value which produces a new cluster is 0.146,we merge (at this step t = 2) the two oils H and B, to give C6 = {H, B}.

Now, for step t = 3, we consider the four clusters C5, Ca, H and C6. Thesegive the distance matrix

D3 =

⎛⎜⎜⎝

C5 Ca H C6

0 0 0634 10 0 0481 0812

0634 0481 0 01461 0812 0146 0

⎞⎟⎟⎠

C5

CaHC6

Then, from D3, the smallest generality degree value to give us a new cluster is0.481 obtained when Ca and H are merged to form C7 = {Ca, H}.

We now take the three clusters C5�C7�C6 (in this linear order) and find thedistance matrix to be

D4 =⎛⎝

C5 C7 C6

0 0634 10634 0 0812

1 0812 0

⎞⎠C5

C7

C6

Therefore, since the smallest generality degree value is 0.634, the optimal merge(at this step t = 4) is to form C8 = �C5�C7� = (O, Ca, H). The final step �t = 5�is to merge the two clusters C5 and C8, which produces the entire dataset E.

EXERCISES 305

The pyramid therefore consists of the nine clusters P = �C1� � � � �C8�E�. Theoils contained in each cluster along with their statistical descriptions and the corre-sponding index h�= the generality degree value at their formation� are given inTable 7.25. The pictorial representation is shown in Figure 7.18. �

E

.010

.146

.481

.634

1

C1 = {Ol} C2 = {Ca} C3 = {H}

C9

C8

C7

C6

C5

C4 = {B}

Index h

Figure 7.18 Pyramid – oils (Ol, Ca, H, B), �Y1� Y3� Y5�.

Exercises

E7.1. Consider the bird data of Table 7.2, and assume that listed values of therandom variables Y1 = Color and Y2 = Habitat are equally likely. Obtainthe categorical distance measures for all pairs of observations by usingEquation (7.13). Hence, give the distance matrix D.

E7.2. For the computer data of Table 7.11, find the normalized Euclidean distancematrix for (i) Y1 = Brand, (ii) Y2 = Type, and (iii) �Y1� Y2�, respectively,based on the Gowda–Diday dissimilarity measure.

E7.3. Redo Exercise E7.2 but base the distance matrix in each case on theIchino–Yaguchi dissimilarity measure with (a) = 0, (b) = 05.

E7.4. Consider the veterinary clinic data of Table 7.5. Obtain the Gowda–Didaydistance matrix for the four animals: male and female rabbits and cats. Use(i) Y1 = Height, (ii) Y2 = Weight, and (iii) �Y1� Y2�.


E7.5. Repeat Exercise E7.4 but this time obtain the Ichino–Yaguchi distancematrix for these four animals. Calculate these distances when (a) = 0,(b) = 05.

E7.6. Calculate the city block distance matrix based on the dissimilarity measuresobtained in (i) Exercise E7.4 and (ii) Exercise E7.5.

E7.7. Calculate the normalized Euclidean distance matrix for the male and femalerabbits and cats based on (i) the Gowda–Diday dissimilarity measure and(ii) the Ichino–Yaguchi dissimilarity measure with = 05.

E7.8. Derive the unweighted Minkowski distance matrix based on theIchino–Yaguchi dissimilarity measure with = 05 using the oilsw1� � � � �w6 from Tables 7.8, for (i) Y1 = Specific Gravity, (ii) Y5 = FattyAcids, and (iii) �Y1� Y5�.

E7.9. Take the health insurance data of Tables 2.2. Find an optimum parti-tion of the six classes into two clusters based on the random vari-ables (i) Y5 = Martial Status, (ii) Y6 = Number of Parents Alive, and(iii) both �Y5� Y6�. Use the city block distance matrix obtained from theIchino–Yaguchi dissimilarity measures, in the divisive clustering technique.

E7.10. Take the second four horses w5� � � � �w8, i.e., CEN, LES, PES, and PAM,of the horses data of Table 7.14. Use the divisive method to obtainthe optimum set of two clusters using the Gowda–Diday dissimilaritymeasures in a normalized Euclidean distance matrix, on the random variableY3 = Minimum Height.

E7.11. Repeat Exercise E7.10 but with the random variable Y4 = Maximum Height.E7.12. Repeat Exercise E7.10 where now the bivariate random variables �Y3� Y4�

are used.E7.13. Use the divisive clustering method to find the optimum set of three clus-

ters among the oils data of Table 7.8 based on the two random variables�Y1� Y5� and the city block distance matrix obtained via the Ichino–Yaguchidissimilarity measures with = 05.

E7.14. Repeat Exercise E7.13 but use the Gowda–Diday dissimilarity measureswhen calculating the distance matrix.

E7.15. Take the health insurance data of Tables 2.2. Obtain the pyramid built fromthe random variables (i) Y1 = Age, (ii) Y13 = Cholesterol, and (iii) both�Y1� Y13�. Use the generality degree as the basis of the distance matrix.

E7.16. Repeat Exercise E7.15 where now the distance matrix is obtained from anormalized Euclidean distance based on the Ichino–Yaguchi dissimilaritymeasure with = 0.

References

Bertrand, P. (1986). Etude de la Représentation Pyramidile. Thése de 3eme Cycle, UniversitéParis, Dauphine.

Bertrand, P. and Diday, E. (1990). Une Généralisation des Arbres Hiérarchiques: lesReprésentations Pyramidales. Revue de Statistique Appliquée 38, 53–78.

REFERENCES 307

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification andRegression Trees. Wadsworth, Belmont, CA.

Brito, P. (1991). Analyse de Données Symboliques. Pyramides d’Héritage. Thése de Doctorat,Université Paris, Dauphine.

Brito, P. (1994). Use of Pyramids in Symbolic Data Analysis. In: New Approaches inClassification and Data Analysis (eds. E. Diday, Y. Lechevallier, M. Schader, P. Bertrand,and B. Burtschy). Springer-Verlag, Berlin, 378–386.

Brito, P. and Diday, E. (1990). Pyramidal Representation of Symbolic Objects. In:Knowledge, Data and Computer-Assisted Decisions (eds. M. Schader and W. Gaul).Springer-Verlag, Heidelberg, 3–16.

Chavent, M. (1997). Analyse de Données Symboliques. Une Méthode Divisive de Classifi-cation. Thése de Doctoral, Université Paris, Dauphine.

Chavent, M. (1998). A Monothetic Clustering Algorithm. Pattern Recognition Letters 19,989–996.

Chavent, M. (2000). Criterion-based Divisive Clustering for Symbolic Data. In: Analysis ofSymbolic Data Exploratory Methods for Extracting Statistical Information from ComplexData (eds. E. Diday and H.-H Bock). Springer-Verlag, Berlin, 299–311.

Diday, E. (1984). Use Repésentation Visuelle des Classes Empiétantes les Pyramides.Rapport de Recherche 291, INRIA, Rocquencourt.

Diday, E. (1986). Orders and Overlapping Clusters by Pyramids. In: Multivariate DataAnalysis (eds. J. De Leeuw, W. J. Heiser, J. J. Meulman, and F. Critchley), DSWO Press,Leiden, 201–234.

Diday, E. (1989). Data Analysis, Learning Symbolic and Numeric Knowledge. Nova Science,Antibes.

Edwards, A. W. F. and Cavalli-Sforza, L. L. (1965). A Method for Cluster Analysis.Biometrics 21, 362–375.

Esposito, F., Malerba, D., and Tamma, V. (2000). Dissimilarity Measures for Symbolic Data.In: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Informationfrom Complex Data (eds. H.-H. Bock and E. Diday). Springer-Verlag, Berlin, 165–185.

Gordon, A. D. (1999). Classification (2nd ed.). Chapman and Hall, London.Gowda, K. C. and Diday, E. (1991). Symbolic Clustering Using a New Dissimilarity Measure.

Pattern Recognition 24, 567–578.Gower, J. C. (1985). Measures of Similarity, Dissimilarity, and Distance. In: Encyclopedia

of Statistical Sciences, Volume 5 (eds. S. Kotz and N. L. Johnson). John Wiley & Sons,Inc., New York, 397–405.

Ichino, M. (1988). General Metrics for Mixed Features - The Cartesian Space Theoryfor Pattern Recognition. In: Proceedings of the 1988 Conference on Systems, Man, andCybernetics. Pergamon, Oxford, 494–497.

Ichino, M. and Yaguchi, H. (1994). Generalized Minkowski Metrics for Mixed Feature TypeData Analysis. IEEE Transactions on Systems, Man, and Cybernetics 24, 698–708.

[wiley series in computational statistics] symbolic data analysis || cluster analysis

Documents