advanced studies in applied statistics (wbl), ethz … from wikipedia 10 minkowski distances r = 1....
TRANSCRIPT
Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate StatisticsSpring 2018, Week 3
Lecturer: Beate [email protected]
1Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.
Topics of today
2
• Similarity and Distances• Numeric data
• Categorical data
• Mixed data types
• Outlier detection• univariate outlier detection by visual checks and additional tests
• Multivariate outlier detection
• Parametric: squared Mahalanobis distances and Chi-Square test
• Non-parametric: Robust PCA for multi-variate outlier detection
• Multi Dimensional Scaling• Metric MDS
• isoMDS
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Similarity is hard to define, but… “We know it when we see it”
The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.
Webster's Dictionary
3
What is Similarity?
Definition: Let O1 and O2 be two objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by d(O1,O2)
0.23 3 342.7
Peter Piotr
4
Defining Distance Measures (Recap)
(Dis-)similarities / Distance
Pairs of Objects:Similarity (large ⇒ similar), vague definition Dissimilarity (small ⇒ similar), Rules 1-3Distance,Metric (small ⇒ similar), Rule 4 in addition
Examples of metrics (more follow with the examples)● Euclidian and other Lp-Metrics ● Jaccard-Distance ( 1 - Jaccard Index)● Graph Distance (shortest-path)
Rules
5
Example of a Metric
Task 1• Draw 3 Objects on a piece of paper and measure their distances (e.g. by a ruler).
• Is this a proper distance? Are Axioms 1-4 fulfilled?
Task 2• The 3 entities A,B,C have the dissimilarity:
d(A,B) = 1d(B,C) = 1d(A,C) = 3
• Is this dissimilarity a distance?
• Can you try to draw them on a piece of paper?
6
Problematic: Wordmaps
Try to do a wordmap with:BankFinanceSitting
Triangular Inequality:Not just a mathematical gimmick!
Triangle inequality would imply:d(„sitting“, „finance“) ≤ d(„sitting“, „bank“) + d(„bank“, „finance“)
7
We live in an Euclidean Space
If we are presented objects in the two dimensional plane, we intuitively assume Euklidean Distance between the objects.
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
8
Distance between observations oi, ojp features describing each observation
Eucledian Distance for 2 observations oi, oj, described by n numeric feature:
Minkowski Distance as generalization 0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
obs x1 x2o1 0 2o2 2 0o3 3 1o4 5 1 2
21
o , (o )p
i j ik jkk
d o o
1
1
o , | o |p r
r i j ik jkk
rd o o
2D example(2 feature per observation)
x1
x2
2d
9
2 22 2 3o , (2 3) (0 1) 2d o
Euclidean Distance and its Generalization
L1: Manhattan Distances
A
BOne block is one unit.• How many Blocks you have to walk from A to B• What is the L1 Distance from A to B
• r=1• What is the Euklidean Distance?
1
1
o , | o |p r
r i j ik jkk
rd o o
Image from Wikipedia 10
Minkowski Distances
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
r = 2. Euclidean distance (L2 norm)
r = ∞ “supremum” or maximum (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors
11
o , op
i j ik jkk
d o o
1...po , max oi j k ik jkd o o
22
1o , (o )
p
i j ik jkk
d o o
11
12 1
21 2
1 2
0 . .0 . .
. . .
. . .. . 0
n
n
n n
d dd d
d d
D
o ,i j ijd o d
All diagonal elements are 0!
As discussed on the last couple of slides there are different possibilities to determine the pair-wise distance between two observations oi and oj.
We can collect all these pair-wise distance dij in a distance matrix:
o , 0k k kkd o d
Symmetry:
o , o ,ij i j j i jid d o d o d
12
Distance matrix
Common situation is that objects, o1 and o2, have only binary attributes, like for example gender (f/M), driving license (yes/no), Nobel price holder (yes/no).
We distinguish between symmetric and asymmetric binary variables.
In symmetric binary variable both levels have roughly comparable frequenciesexample: gender
In asymmetric binary variable both levels have very different frequenciesexample: Nobel price holder
Similarity measures for binary data
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
14
The objects, o1 and o2, have only binary attributes
Matching CoefficientSimilarity measures for “symmetric” binary vectors
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
15
Compute for the symmetric binary variable (could be only a subset ofthe p binary variables) the Simple Matching Coefficients:
SMC = # matches / # attributes
Corresponding to the proportion of matching features over all features
SMC = (M11 + M00) / (M01 + M10 + M11 + M00)
M01 = the number of attributes where o1i was 0 and o2i was 1M10 = the number of attributes where o1i was 1 and o2i was 0M00 = the number of attributes where o1i was 0 and o2i was 0M11 = the number of attributes where o1i was 1 and o2i was 1
The objects, o1 and o2, have only binary attributes
Compute for the asymmetric binary variable (could be only a subset ofthe p binary variables) the Jaccard Coefficient:
J = # both-1-matches / # of not-both-zero attributes values
Corresponding to the proportion of matching features over those features which are 1 in at least one of both observations
J = = (M11) / (M01 + M10 + M11)
M01 = the number of attributes where o1i was 0 and o2i was 1M10 = the number of attributes where o1i was 1 and o2i was 0M11 = the number of attributes where o1i was 1 and o2i was 1
Matching CoefficientSimilarity measures for “asymmetric” binary vectors
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
16
Idea: Use distance measure dij between 0 and 1 for each variable or feature:
- kth variable is binary, nominal: Use discussed methods, e.g.
- kth variable is numeric: xik: Value for object i in variable k Rk: Range of variable k for all objects
- Ordinal: Use normalized ranks; then same like with numeric variables
Aggregate distance measures over all variables/feature/dimensions:
Gower’s dissimilarity for mixed data types
( ) 11 00
01 10 11 00
1kij
M MdM M M M
( ) | |ik jkkij
k
x xd
R
( )
1
1 pk
ij ijk
d dp
17
> str(flower)'data.frame': 18 obs. of 8 variables:$ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...$ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...$ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...$ V4: Factor w/ 5 levels "1","2","3",..: 4 2 3 4 5 4 4 2 3 5 ...$ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...$ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<..: 15 3 1 16 2 12 ...$ V7: num 25 150 150 125 20 50 40 100 25 100 ...$ V8: num 15 50 50 50 15 40 20 15 15 60 ...> library(cluster)> dist=daisy(flower, type=list(asymm=c(1, 3), symm=2, ordratio=7))> str(dist)Classes 'dissimilarity', 'dist' atomic [1:153] 0.901 0.618 .....- attr(*, "Size")= int 18..- attr(*, "Metric")= chr "mixed"..- attr(*, "Types")= chr [1:8] "A" "S" "A" "N" ...
Dissimilarity for mixed data types with R-function “daisy” calculating Gower’s dissimilarity
nominal factor
18
> plot(hclust(dist))
Goal of Multidimensional Scaling
MDS gets as input distances between observations or data points and results in a visualization of points in 2D
The bars between points represent the given distances between points.
As input of MDS we only know the distances and look for a low-dim point configuration where points have the same or similar distances.
20
Example for metric MDS
Distance Matrix:
:
Problem: Given Euclidean distances among points, recover the position of the points!
Example: Road distance between 21 European cities(almost euclidean, but not quite)
21
eurodist data:
:
res.cmd=cmdscale(eurodist)
plot(res.cmd,pch="")text(res.cmd,labels=rownames(res.cmd))
-2000 -1000 0 1000 2000
-100
00
1000
res.cmd[,1]
res.
cmd[
,2]
Athen
Barcelona
BrusselsCalaisCherbourgCologne
Copenhagen
Geneva
ibraltar
HamburgHook of Holland
LisbonLyonsMadrid MarseillesMilan
MunichParis
Rome
Stockholm
Vienna
Configuration can be
- shifted- rotated - reflected
without changing distances
MDS in R:
22
Example for metric MDS
Equivalence of PCA and MDS with Euclidean distance
PCA-representation on data-matrix=
MDS-representation on Euclidean distance-matrix
MDS on Euclidean distance results in equivalent low-dimension representation (up to rotation, flipping, shifts) as PCA on data-matrix (however, the data-matrix must first be derived from the distance-matrix)
24
library(cluster)
dist = daisy(flower)
mdist = as.matrix( dist)
library(pheatmap)
pheatmap(mdist)
library(MASS)
mds = isoMDS(mdist, k=2) d.mds = as.data.frame(mds$points)
names(d.mds) = c("c1", "c2")
library(ggplot2)
ggplot(data=d.mds, aes(x=c1, y=c2)) +
geom_point() +
geom_text(label=row.names(mdist),
hjust=1.2)
Distance matrix and 2D plot of multivariate mixed data
25
27
z-transformationx
x
XX Z
2
(Z) 0( ) 1Z
E Zsd Z
Lets standardize and look at the z-score
The standardized Variable Z has mean zero and variance 1.
We start from a variable X withand Var(X)=x
2
and apply the z-transformation:
Often the z-transformation is applied on different univariate features to make them “comparable”. A distance -2 from the population mean means always that this observation is by two standard deviations smaller than the average.
In case of a Normal-distributed X we know that ~ 0,1 .
z-score
The z-score has the unit “standard deviation”
1sd 2sd 3sd 4sd 5sd 6sd 7sd 8sd 9sd 10sd0
How much is my IQ above/below average?
Remark: Mean and SD can also be determined for a non-normal distributed variables – but intuition is lost and we might prefer to work with quantiles.
28
Remarks:
• All marginal distributions of a multi-variate Normal distribution are uni-variate Normal distributions.
• All conditional distributions of a multi-variate Normal distribution are uni-variate Normal distributions.
• Each iso-density-line is an ellipse or it’s higher-dimensional generalization.
The multivariate Normal distribution
310 5~ ,0 3 25
XN
Y
Density:
~ ,NX μ Σ
11exp2
2
t
kf
x μ Σ x μx
Σ
29
Mahalanobis distance is the multivariate z-score
MD=1
MD=1
MD=1
MD=2
1MD t x x μ Σ x μThe Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of standard deviations.
In case of a multi-variate Normal-distributed x we know that MD(x)~N(0,1).
30
MD=2
Outlier detection using a boxplot representation
All points beyond the whiskers are called “extreme” values.
Is there any model around?
32
The model behind the “extreme” value definition in boxplots
99% of N(0,1) data are within the whiskers.When visualizing non-Normal distributed data this model is not valid
boxplot from 100k data points simulated from a N(0,1)
33
Outlier detection in the uni-variate case via Grubbs test
library(outliers)
x=c(45,56,54,34,32,45,67,45,67,65,154)# Grubbs test for one outlier# # data: x# G = 2.80490, U = 0.13459, p-value = 0.0001816# alternative hypothesis: highest value 154 is an outlier
Grubbs developed in 1950 this test statistic (assuming for small n normal distribution as in t-test) to investigate if at “some time during the experiment something possibly happened to cause an extraneous variation on the high side or on the low side” and is also nowadays routinely used in regression model checking procedures (i.e. to find outliers in Cook’s d values or standardized residuals).
potential outlier
34
Outlier detection in the multi-variate case via test
The Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of SD.
Outlier detection via Mahalanobis distance can be performed for data for which the multivariate Normal assumption is reasonable by checking if the MD2 of an p-dim observation is “sticking out” of distribution with df=p.
1p×1 p×1 p×p
2 2p×1
MD ~ ( , )
MD ~
t
df p
N
x x μ Σ x μ 0 1
x
35
Outlier detection based on expected 2 distribution of squared Mahalanobis distances from assumed N-distribution center
• Compute for each observation p-dim x the (robust version of) MD(x)2
• Generate a Quantile-Quantile plot to identify observations that stick out
an expected 2 distribution with df=p ( MD(x)2 > 97.5% quantile of p2)
• Use in addition “adjusted quantiles” that are estimated by simulations
from the expected chi-square distribution without outliers.
• Use (robust) PCA to visualize the data in a 2D score plot.
36
Adjusted Quantile via simulation
ECDF leaves “plausible” range
Defines adaptive cutoff
38Slide credit: Markus Kalisch
Outlier detection via robust PCA
imagine 784 dimensions ;-)
Assumption: the manifold hypothesis holds.
39
nxp nxp nxp
nxp nxp nxp
(PCA representation)
(full reconstruction)t
Y X A
X Y A
PCA Rotation can be achieved by multiplying X with an orthogonal rotation matrix A
Dimension reduction via PCA
PCA minimizes reconstruction error over all available m data points
( ) ( ) 2 2
1 1
ˆˆ|| || || ||m m
i ii i
i i
x x X X
Partly reconstruct X with only k<p PCs:
nxp nxk nx(p-k) nxpˆ , t X Y 0 A
How good is the data representation?
The reconstruction error is given by the squared orthogonal distance between data point and its projection on the plane.
40
PCA is not robust against outliers
The first two PCs are the directions of maximal variance.
Since the variance is not robust against outliers also the result of PCA is not robust against outliers.
We can use a robust version of the PCA which is resistant to outliers.
PC1 with classical PCA
PC1 with robust PCA
41
The reconstruction of the red point has an reconstruction-error equal to the squared distance between red and green points – PCA minimizes the sum of squared distances.
Points with extreme reconstruction errors are identified as outliers.
1.PC
42
PCA can be used for outlier detection
We should use robust PCA to identify outliers via reconstruction errors
In robust PCA the directions of the PCs are not heavily influenced by the positions of some outliers -> outliers have larger distances to the hyperplane which is spanned by the first couple of PCs and capture large parts of the variance of non-outlying points.
outliers
43
There are two major R-implementations of PCA - prcomp() and princomp()
- prcomp is numerically more stable, and therefore preferred (see chapter 2.7 in the StDM script)
- princomp has a few more options and is therefore sometimes used
PCA in R
For robust PCA and outlier detection we can use the package rrcov
- PcaHubert performs robust PCA
44
PCA: Variants (just for reference)
A huge number of variants of PCA exists and is available in R packages, for example:
Robust PCA: make PCA less sensitive to outliers, for example by using a robust estimate of the covariance matrix (PcaCov() in rrcov) or by other means like using Projection Pursuit (pcaPP)
Constrained PCA: PCA-like transformation with some constraints on sparsity (constructing linear combinations from only a small number of original variables) and / or non-negativity of principal components (nsprcomp, elasticnet)
Kernel PCA: By use of the so-called kernel trick, PCA can be extended by implicitly transforming the data to a high-dimensional space. Can also cope with non-numerical data like graphs, texts etc. R implementation e.g. as kpca() in kernlab.
Factor Analysis is related to PCA. Focus is on interpretable transformations, often used in social sciences and psychology. Factors are often viewed as latent unobservable variables that influence outcomes of measurements
For more variants implemented in R, see the CRAN task view „Multivariate“: https://cran.r-project.org/web/views/Multivariate.html
45