advanced studies in applied statistics (wbl), ethz … from wikipedia 10 minkowski distances r = 1....

45
Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 3 Lecturer: Beate Sick [email protected] 1 Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.

Upload: hoangxuyen

Post on 29-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate StatisticsSpring 2018, Week 3

Lecturer: Beate [email protected]

1Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.

Topics of today

2

• Similarity and Distances• Numeric data

• Categorical data

• Mixed data types

• Outlier detection• univariate outlier detection by visual checks and additional tests

• Multivariate outlier detection

• Parametric: squared Mahalanobis distances and Chi-Square test

• Non-parametric: Robust PCA for multi-variate outlier detection

• Multi Dimensional Scaling• Metric MDS

• isoMDS

The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Similarity is hard to define, but… “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Webster's Dictionary

3

What is Similarity?

Definition: Let O1 and O2 be two objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by d(O1,O2)

0.23 3 342.7

Peter Piotr

4

Defining Distance Measures (Recap)

(Dis-)similarities / Distance

Pairs of Objects:Similarity (large ⇒ similar), vague definition Dissimilarity (small ⇒ similar), Rules 1-3Distance,Metric (small ⇒ similar), Rule 4 in addition

Examples of metrics (more follow with the examples)● Euclidian and other Lp-Metrics ● Jaccard-Distance ( 1 - Jaccard Index)● Graph Distance (shortest-path)

Rules

5

Example of a Metric

Task 1• Draw 3 Objects on a piece of paper and measure their distances (e.g. by a ruler).

• Is this a proper distance? Are Axioms 1-4 fulfilled?

Task 2• The 3 entities A,B,C have the dissimilarity:

d(A,B) = 1d(B,C) = 1d(A,C) = 3

• Is this dissimilarity a distance?

• Can you try to draw them on a piece of paper?

6

Problematic: Wordmaps

Try to do a wordmap with:BankFinanceSitting

Triangular Inequality:Not just a mathematical gimmick!

Triangle inequality would imply:d(„sitting“, „finance“) ≤ d(„sitting“, „bank“) + d(„bank“, „finance“)

7

We live in an Euclidean Space

If we are presented objects in the two dimensional plane, we intuitively assume Euklidean Distance between the objects.

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

8

Distance between observations oi, ojp features describing each observation

Eucledian Distance for 2 observations oi, oj, described by n numeric feature:

Minkowski Distance as generalization 0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

obs x1 x2o1 0 2o2 2 0o3 3 1o4 5 1 2

21

o , (o )p

i j ik jkk

d o o

1

1

o , | o |p r

r i j ik jkk

rd o o

2D example(2 feature per observation)

x1

x2

2d

9

2 22 2 3o , (2 3) (0 1) 2d o

Euclidean Distance and its Generalization

L1: Manhattan Distances

A

BOne block is one unit.• How many Blocks you have to walk from A to B• What is the L1 Distance from A to B

• r=1• What is the Euklidean Distance?

1

1

o , | o |p r

r i j ik jkk

rd o o

Image from Wikipedia 10

Minkowski Distances

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

r = 2. Euclidean distance (L2 norm)

r = ∞ “supremum” or maximum (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors

11

o , op

i j ik jkk

d o o

1...po , max oi j k ik jkd o o

22

1o , (o )

p

i j ik jkk

d o o

11

12 1

21 2

1 2

0 . .0 . .

. . .

. . .. . 0

n

n

n n

d dd d

d d

D

o ,i j ijd o d

All diagonal elements are 0!

As discussed on the last couple of slides there are different possibilities to determine the pair-wise distance between two observations oi and oj.

We can collect all these pair-wise distance dij in a distance matrix:

o , 0k k kkd o d

Symmetry:

o , o ,ij i j j i jid d o d o d

12

Distance matrix

How to calculate dissimilarities

with categorical variables?

13

Common situation is that objects, o1 and o2, have only binary attributes, like for example gender (f/M), driving license (yes/no), Nobel price holder (yes/no).

We distinguish between symmetric and asymmetric binary variables.

In symmetric binary variable both levels have roughly comparable frequenciesexample: gender

In asymmetric binary variable both levels have very different frequenciesexample: Nobel price holder

Similarity measures for binary data

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

14

The objects, o1 and o2, have only binary attributes

Matching CoefficientSimilarity measures for “symmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

15

Compute for the symmetric binary variable (could be only a subset ofthe p binary variables) the Simple Matching Coefficients:

SMC = # matches / # attributes

Corresponding to the proportion of matching features over all features

SMC = (M11 + M00) / (M01 + M10 + M11 + M00)

M01 = the number of attributes where o1i was 0 and o2i was 1M10 = the number of attributes where o1i was 1 and o2i was 0M00 = the number of attributes where o1i was 0 and o2i was 0M11 = the number of attributes where o1i was 1 and o2i was 1

The objects, o1 and o2, have only binary attributes

Compute for the asymmetric binary variable (could be only a subset ofthe p binary variables) the Jaccard Coefficient:

J = # both-1-matches / # of not-both-zero attributes values

Corresponding to the proportion of matching features over those features which are 1 in at least one of both observations

J = = (M11) / (M01 + M10 + M11)

M01 = the number of attributes where o1i was 0 and o2i was 1M10 = the number of attributes where o1i was 1 and o2i was 0M11 = the number of attributes where o1i was 1 and o2i was 1

Matching CoefficientSimilarity measures for “asymmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

16

Idea: Use distance measure dij between 0 and 1 for each variable or feature:

- kth variable is binary, nominal: Use discussed methods, e.g.

- kth variable is numeric: xik: Value for object i in variable k Rk: Range of variable k for all objects

- Ordinal: Use normalized ranks; then same like with numeric variables

Aggregate distance measures over all variables/feature/dimensions:

Gower’s dissimilarity for mixed data types

( ) 11 00

01 10 11 00

1kij

M MdM M M M

( ) | |ik jkkij

k

x xd

R

( )

1

1 pk

ij ijk

d dp

17

> str(flower)'data.frame': 18 obs. of 8 variables:$ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...$ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...$ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...$ V4: Factor w/ 5 levels "1","2","3",..: 4 2 3 4 5 4 4 2 3 5 ...$ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...$ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<..: 15 3 1 16 2 12 ...$ V7: num 25 150 150 125 20 50 40 100 25 100 ...$ V8: num 15 50 50 50 15 40 20 15 15 60 ...> library(cluster)> dist=daisy(flower, type=list(asymm=c(1, 3), symm=2, ordratio=7))> str(dist)Classes 'dissimilarity', 'dist' atomic [1:153] 0.901 0.618 .....- attr(*, "Size")= int 18..- attr(*, "Metric")= chr "mixed"..- attr(*, "Types")= chr [1:8] "A" "S" "A" "N" ...

Dissimilarity for mixed data types with R-function “daisy” calculating Gower’s dissimilarity

nominal factor

18

> plot(hclust(dist))

How to visualize

multivariate observations of

mixed data types in 2D?

19

Goal of Multidimensional Scaling

MDS gets as input distances between observations or data points and results in a visualization of points in 2D

The bars between points represent the given distances between points.

As input of MDS we only know the distances and look for a low-dim point configuration where points have the same or similar distances.

20

Example for metric MDS

Distance Matrix:

:

Problem: Given Euclidean distances among points, recover the position of the points!

Example: Road distance between 21 European cities(almost euclidean, but not quite)

21

eurodist data:

:

res.cmd=cmdscale(eurodist)

plot(res.cmd,pch="")text(res.cmd,labels=rownames(res.cmd))

-2000 -1000 0 1000 2000

-100

00

1000

res.cmd[,1]

res.

cmd[

,2]

Athen

Barcelona

BrusselsCalaisCherbourgCologne

Copenhagen

Geneva

ibraltar

HamburgHook of Holland

LisbonLyonsMadrid MarseillesMilan

MunichParis

Rome

Stockholm

Vienna

Configuration can be

- shifted- rotated - reflected

without changing distances

MDS in R:

22

Example for metric MDS

After flipping vertical axes:

23

Example for metric MDS

Equivalence of PCA and MDS with Euclidean distance

PCA-representation on data-matrix=

MDS-representation on Euclidean distance-matrix

MDS on Euclidean distance results in equivalent low-dimension representation (up to rotation, flipping, shifts) as PCA on data-matrix (however, the data-matrix must first be derived from the distance-matrix)

24

library(cluster)

dist = daisy(flower)

mdist = as.matrix( dist)

library(pheatmap)

pheatmap(mdist)

library(MASS)

mds = isoMDS(mdist, k=2) d.mds = as.data.frame(mds$points)

names(d.mds) = c("c1", "c2")

library(ggplot2)

ggplot(data=d.mds, aes(x=c1, y=c2)) +

geom_point() +

geom_text(label=row.names(mdist),

hjust=1.2)

Distance matrix and 2D plot of multivariate mixed data

25

How much is an observation differing from average?

26

27

z-transformationx

x

XX Z

2

(Z) 0( ) 1Z

E Zsd Z

Lets standardize and look at the z-score

The standardized Variable Z has mean zero and variance 1.

We start from a variable X withand Var(X)=x

2

and apply the z-transformation:

Often the z-transformation is applied on different univariate features to make them “comparable”. A distance -2 from the population mean means always that this observation is by two standard deviations smaller than the average.

In case of a Normal-distributed X we know that ~ 0,1 .

z-score

The z-score has the unit “standard deviation”

1sd 2sd 3sd 4sd 5sd 6sd 7sd 8sd 9sd 10sd0

How much is my IQ above/below average?

Remark: Mean and SD can also be determined for a non-normal distributed variables – but intuition is lost and we might prefer to work with quantiles.

28

Remarks:

• All marginal distributions of a multi-variate Normal distribution are uni-variate Normal distributions.

• All conditional distributions of a multi-variate Normal distribution are uni-variate Normal distributions.

• Each iso-density-line is an ellipse or it’s higher-dimensional generalization.

The multivariate Normal distribution

310 5~ ,0 3 25

XN

Y

Density:

~ ,NX μ Σ

11exp2

2

t

kf

x μ Σ x μx

Σ

29

Mahalanobis distance is the multivariate z-score

MD=1

MD=1

MD=1

MD=2

1MD t x x μ Σ x μThe Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of standard deviations.

In case of a multi-variate Normal-distributed x we know that MD(x)~N(0,1).

30

MD=2

31

We need expectations or a model to identify an outlier!

31

Outlier

Outlier detection using a boxplot representation

All points beyond the whiskers are called “extreme” values.

Is there any model around?

32

The model behind the “extreme” value definition in boxplots

99% of N(0,1) data are within the whiskers.When visualizing non-Normal distributed data this model is not valid

boxplot from 100k data points simulated from a N(0,1)

33

Outlier detection in the uni-variate case via Grubbs test

library(outliers)

x=c(45,56,54,34,32,45,67,45,67,65,154)# Grubbs test for one outlier# # data: x# G = 2.80490, U = 0.13459, p-value = 0.0001816# alternative hypothesis: highest value 154 is an outlier

Grubbs developed in 1950 this test statistic (assuming for small n normal distribution as in t-test) to investigate if at “some time during the experiment something possibly happened to cause an extraneous variation on the high side or on the low side” and is also nowadays routinely used in regression model checking procedures (i.e. to find outliers in Cook’s d values or standardized residuals).

potential outlier

34

Outlier detection in the multi-variate case via test

The Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of SD.

Outlier detection via Mahalanobis distance can be performed for data for which the multivariate Normal assumption is reasonable by checking if the MD2 of an p-dim observation is “sticking out” of distribution with df=p.

1p×1 p×1 p×p

2 2p×1

MD ~ ( , )

MD ~

t

df p

N

x x μ Σ x μ 0 1

x

35

Outlier detection based on expected 2 distribution of squared Mahalanobis distances from assumed N-distribution center

• Compute for each observation p-dim x the (robust version of) MD(x)2

• Generate a Quantile-Quantile plot to identify observations that stick out

an expected 2 distribution with df=p ( MD(x)2 > 97.5% quantile of p2)

• Use in addition “adjusted quantiles” that are estimated by simulations

from the expected chi-square distribution without outliers.

• Use (robust) PCA to visualize the data in a 2D score plot.

36

Extreme Quantiles of 2 distribution hint on outliers

37

PC1

PC1 PC1

PC2

PC2

Adjusted Quantile via simulation

ECDF leaves “plausible” range

Defines adaptive cutoff

38Slide credit: Markus Kalisch

Outlier detection via robust PCA

imagine 784 dimensions ;-)

Assumption: the manifold hypothesis holds.

39

nxp nxp nxp

nxp nxp nxp

(PCA representation)

(full reconstruction)t

Y X A

X Y A

PCA Rotation can be achieved by multiplying X with an orthogonal rotation matrix A

Dimension reduction via PCA

PCA minimizes reconstruction error over all available m data points

( ) ( ) 2 2

1 1

ˆˆ|| || || ||m m

i ii i

i i

x x X X

Partly reconstruct X with only k<p PCs:

nxp nxk nx(p-k) nxpˆ , t X Y 0 A

How good is the data representation?

The reconstruction error is given by the squared orthogonal distance between data point and its projection on the plane.

40

PCA is not robust against outliers

The first two PCs are the directions of maximal variance.

Since the variance is not robust against outliers also the result of PCA is not robust against outliers.

We can use a robust version of the PCA which is resistant to outliers.

PC1 with classical PCA

PC1 with robust PCA

41

The reconstruction of the red point has an reconstruction-error equal to the squared distance between red and green points – PCA minimizes the sum of squared distances.

Points with extreme reconstruction errors are identified as outliers.

1.PC

42

PCA can be used for outlier detection

We should use robust PCA to identify outliers via reconstruction errors

In robust PCA the directions of the PCs are not heavily influenced by the positions of some outliers -> outliers have larger distances to the hyperplane which is spanned by the first couple of PCs and capture large parts of the variance of non-outlying points.

outliers

43

There are two major R-implementations of PCA - prcomp() and princomp()

- prcomp is numerically more stable, and therefore preferred (see chapter 2.7 in the StDM script)

- princomp has a few more options and is therefore sometimes used

PCA in R

For robust PCA and outlier detection we can use the package rrcov

- PcaHubert performs robust PCA

44

PCA: Variants (just for reference)

A huge number of variants of PCA exists and is available in R packages, for example:

Robust PCA: make PCA less sensitive to outliers, for example by using a robust estimate of the covariance matrix (PcaCov() in rrcov) or by other means like using Projection Pursuit (pcaPP)

Constrained PCA: PCA-like transformation with some constraints on sparsity (constructing linear combinations from only a small number of original variables) and / or non-negativity of principal components (nsprcomp, elasticnet)

Kernel PCA: By use of the so-called kernel trick, PCA can be extended by implicitly transforming the data to a high-dimensional space. Can also cope with non-numerical data like graphs, texts etc. R implementation e.g. as kpca() in kernlab.

Factor Analysis is related to PCA. Focus is on interpretable transformations, often used in social sciences and psychology. Factors are often viewed as latent unobservable variables that influence outcomes of measurements

For more variants implemented in R, see the CRAN task view „Multivariate“: https://cran.r-project.org/web/views/Multivariate.html

45