applied multivariate analysis - vaasan...

Applied Multivariate Analysis

Seppo Pynnonen

Department of Mathematics and Statistics, University of Vaasa, Finland

Spring 2017

Seppo Pynnonen Applied Multivariate Analysis

Cluster analysis

Cluster Analysis


Cluster analysis

Background

1 Cluster analysis

Background

Distance data

Hierarchical clustering


Cluster analysis

Background

Example 1

Consider the following data

Beer Calories Sodium Alcohol Cost

-----------------------------------------------------

Budweiser 144 15 4.7 0.43

Schlitz 151 19 4.9 0.43

Lowenbrau 157 15 0.9 0.48

Kronenbourg 170 7 5.2 0.73

Heineken 152 11 5.0 0.77

Old Milwaukee 145 23 4.6 0.28

Augsberger 175 24 5.5 0.40

Srohs Bohemian Style 149 27 4.7 0.42

Miller Lite 99 10 4.3 0.43

Budweiser Light 113 8 3.7 0.40

Coors 140 18 4.6 0.44

Coors Light 102 15 4.1 0.46

Michelob Light 135 11 4.2 0.50

Becks 150 19 4.7 0.76

Kirin 149 6 5.0 0.79

Pabst Extra Light 68 15 2.3 0.38

Hamms 139 19 4.4 0.43

Heilemans Old Style 144 24 4.9 0.43

Olympia Goled Light 72 6 2.9 0.46

Schlitz Light 97 7 4.2 0.47

-----------------------------------------------------


Cluster analysis

Background

A potentially interesting question might be are some beers more alike

than the others. I.e. are there natural groups of the beers.


Cluster analysis

Background

Before clustering, check descriptive statistics and plots


Cluster analysis

Background

It turns out that Lowenbrau is an outlier in particular in the relation of alcohol toothers.


Cluster analysis

Background

No more obvious outliers.


Cluster analysis

Background

Problem problem: Group sample units into homogeneoussub-groups on the basis of a given data set.

Synonyms:

clustering

morphometrics

pattern regognition

classification

taxonomy


Cluster analysis

Background

The main difference to discriminant analysis is that in classificationanalysis the number of groups is not known in advance.

Data:

1 Observations on variables xi1, . . . , xip,i = 1, . . . , n.

2 n × n distance matrix, which describes how apart from eachother the observations are.


Cluster analysis

Distance data

1 Cluster analysis

Background

Distance data



Cluster analysis

Distance data

Dissimilarity:

Let xi = (xi1, . . . , xip)′ denote observations on the variables forsample unit i . Dissimilarity between sample units i and j aremeasured by a suitable distance measure

dij = d(xi , xj), (1)

that has the following properties

1 d(x, y) ≥ 0

2 d(x, y) = 0⇔ x = y

3 d(x, y) = d(y, x)

4 d(x, y) ≤ d(x, z) + d(y, z)


Cluster analysis

Distance data

Example 2

d(x, y) =

p∑j=1

(xj − yj)2

12

. (2)

Block distance

d(x, y) =

p∑j=1

|xj − yj |. (3)

Mahalanobis distance

d(xi , xj) =[(xi − xj)

′Σ−1(xi − xj)] 1

2 , (4)

where Σ = Cov(xi ) = Cov(xj).

Unlike Euclidian and block distance, Mahalanobis distance is independentof scales of variables.


Cluster analysis

Distance data

Scaling makes difference!

Example 3

Observation Weight (g) Length (cm)

1 10 72 20 23 30 10

Euclidian distancescm mm

d12 11.2 51.0d13 20.2 36.1d23 12.8 80.6

Thus, with Euclidian and block distances the differences should have equalpractical importance.Linear or non-linear transformations may be needed.E.g. variables in different currencies should be converted to the same.Similarly if ratios are more meaningful than differences, take logarithms.Outliers should also be removed.


Cluster analysis

Distance data

General solutions:

(1) Standardization

zij =xijsj

(5)

where sj is the standard deviation of variable xj .

Shortcoming: May deteriorate clustering information of thevariables.1

(2) Mahalanobis distance

1Milligan, Glenn W., and Martha C. Cooper, 1987, A study ofstandardization of variables in cluster analysis. Journal of Classification 5,181–204.


Cluster analysis

Distance data

Similarity:

Association measures cij = c(xi , xj)

1 0 ≤ cij ≤ 1

2 cii = 1

3 if cij = 1 then xi = xj .

4 cij = cji .


Cluster analysis


1 Cluster analysis

Background

Distance data



Cluster analysis


Note: Similarity −→ Dissimilarity

dij = 1− cij (8)

Distance matrix D = (dij).

Aggloramerative Clustering:

bottom up

Divisive Clustering:

top down


Cluster analysis


Strategies (examples):

Single linkage (Nearest neighbor): The first cases are combined thathave the smallest distance (are the most similar).

Complete linkage (Furthest neighbor): The distance of two groupsis calculated as the distance between their furthest point.

Average linkage: The average distance of between pairs ofobservations, one in each cluster.

Centroid method: Distances in terms of group means (groupcentroids).

Minimum variance: Ward’s minimum-variance method minimizeswithin sum of squares over clusters.

k-means clustering: Given k clusters, minimize within cluster sum ofsquares. Initial partition must be somehow found (e.g randomly, orby some other clustering method).


Cluster analysis


Example 5Single linkage (nearest neighbor)

D =

12345

07 01 6 09 3 8 08 5 7 4 0

The nearest are 1 and 3. Joining them yields a new distance matrix

D(1) =

{1, 3}245

06 08 3 07 5 4 0

Next join 2 and 4 to yield

D(2) ={1, 3}{2, 4}

5

06 07 4 0

Next join 5 to {2, 4}

D(3) ={1, 3}

{2, 4, 5}

(06 0

)


Cluster analysis


The resulting dendrogram is

|

6 ............+---------+-----------+

| |

| |

4 ............................ +-----------+

| | |

3 ........................ +---+---+ |

| | | |

| | | |

1 ...... +--------+ | | |

| | | | |

. . . . .

1 3 2 4 5


Cluster analysis


Number of Clusters:

A large jump in the dendrogram is a sign of a correct numberof clusters.

A large jump in a used (another) criterion

Maximum pseudo F , minimum pseudo t.

Maximum cubic clustering criterion (cccc).

In k-means clustering plot sums of squares and find an”elbow” similar to the scree plot in PCA


Cluster analysis


Example 6

Beer brands.

data beer;

input brand $21. calories sodium alcohol cost;

datalines;

Budweiser 144 15 4.7 0.43

Schlitz 151 19 4.9 0.43

.

.

;

/* single linkage method, b is subset after removing Lowenbrau*/

proc cluster data = b method = single simple std; /* simple statistics, standardize */

id brand;

var calories--cost;

run;


Cluster analysis


Single linkage


Cluster analysis


Single linkage

No obvious break biggest jump after joining Becks, Kirin, Heineken, andKronenburg to the rest.

Psudo t drops at 4 clusters and is lowest at 7, 9, and 10.

Pseudo F does not seem to work.

Thus, on the basis of this the beers cluster to American and Europeanbrands.

Within the Americans a distinction seems to be between lite and the

others (Michelob differs from this picture).


Cluster analysis


Complete linkage

proc cluster data = b method=complete;

var calories sodium alcohol;

id brand;

run;


Cluster analysis


Complete linkage

The complete linkage divides beers essentially to two lite brands,European, and the other Americans.


Cluster analysis


Centroid

/* centroid linkage*/

proc cluster data = b method=centroid;

var calories sodium alcohol cost;

id brand;

run;


Cluster analysis


Centroid

Similar to complete.


applied multivariate analysis - vaasan...

Documents