applied multivariate statistics spring 2012 - eth zurich · applied multivariate statistics –...

28
Finding Multivariate Outlier Applied Multivariate Statistics Spring 2012

Upload: builien

Post on 21-Jun-2018

292 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Finding Multivariate Outlier

Applied Multivariate Statistics – Spring 2012

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAAAAA

Page 2: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Goals

Concept: Detecting outliers with (robustly) estimated

Mahalanobis distance and QQ-plot

R: chisq.plot, pcout from package “mvoutlier”

2 Appl. Multivariate Statistics - Spring 2012

Page 3: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Outlier in one dimension - easy

Look at scatterplots

Find dimensions of outliers

Find extreme samples just in these dimensions

Remove outlier

3 Appl. Multivariate Statistics - Spring 2012

Page 4: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

2d: More tricky

4 Appl. Multivariate Statistics - Spring 2012

Outlier

No outlier in x or y

Page 5: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

True Mahalanobis distance:

Estimated Mahalanobis distance:

Recap: Mahalanobis distance

5 Appl. Multivariate Statistics - Spring 2012

MD(x) =p(x¡¹)T§¡1(x¡¹)

Sq. Mahalanobis Distance MD2(x)

=

Sq. distance from mean in

standard deviations

IN DIRECTION OF X

M̂D(x) =

q(x¡ ¹̂)T §̂¡1(x¡ ¹̂)

Page 6: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Mahalanobis distance: Example

6 Appl. Multivariate Statistics - Spring 2012

§ =

µ25 0

0 1

¹ =

µ0

0

¶;

Page 7: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Mahalanobis distance: Example

7 Appl. Multivariate Statistics - Spring 2012

§ =

µ25 0

0 1

¹ =

µ0

0

¶;

(20,0) MD = 4

Page 8: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Mahalanobis distance: Example

8 Appl. Multivariate Statistics - Spring 2012

§ =

µ25 0

0 1

¹ =

µ0

0

¶;

(0,10)

MD = 10

Page 9: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Mahalanobis distance: Example

9 Appl. Multivariate Statistics - Spring 2012

§ =

µ25 0

0 1

¹ =

µ0

0

¶;

(10, 7)

MD = 7.3

Page 10: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Theory of Mahalanobis Distance

Assume data is multivariate normally distributed

(d dimensions)

10 Appl. Multivariate Statistics - Spring 2012

Mahalanobis distance of samples follows a Chi-Square distribution

with d degrees of freedom

(“By definition”: Sum of d standard normal random variables has

Chi-Square distribution with d degrees of freedom.)

Page 11: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Check for multivariate outlier

Are there samples with estimated Mahalanobis distance

that don’t fit at all to a Chi-Square distribution?

Check with a QQ-Plot

Technical details:

- Chi-Square distribution is still reasonably good for

estimated Mahalanobis distance - use robust estimates for

11 Appl. Multivariate Statistics - Spring 2012

¹;§

Page 12: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Robust Estimates: Income of 7 people

Robust Scatter

Std. Dev.

Page 13: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Robust

Std. Dev.

Page 14: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Robust Std. Dev.

Page 15: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Robust Estimates for outlier detection

If scatter is estimated robustly, outlier “stick out” much

more

Robust Mahalanobis distance:

Mean and Covariance matrix estiamted robustly

15 Appl. Multivariate Statistics - Spring 2012

Page 16: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Example - continued

16 Appl. Multivariate Statistics - Spring 2012

Outlier easily detected !

Page 17: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Outliers in >2d can be well hidden !

17 Appl. Multivariate Statistics - Spring 2012

No outlier,

right?

Page 18: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Outliers in >2d can be well hidden !

18 Appl. Multivariate Statistics - Spring 2012

Wrong!

Page 19: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Outliers in >2d can be well hidden !

19 Appl. Multivariate Statistics - Spring 2012

This outlier

can’t be seen

in the

scatterplot-

matrix

(but in a 3d plot)

Page 20: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Method 1: Quantile of Chi-Sqaure distribution

Compute for each sample (in d dimensions) the robustly

estimated Mahalanobis distance MD(xi)

Compute the 97.5%-Quantile Q of the Chi-Square

distribution with d degrees of freedom

All samples with MD(xi) > Q are declared outlier

20 Appl. Multivariate Statistics - Spring 2012

Page 21: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Method 2: Adjusted Quantile

Adjusted Quantile for outlier: Depends on distance

between cdf of Chi-Square and ecdf of samples in tails

Simulate “normal” deviations in the tails

Outlier have “abnormally large” deviations in the tails

(e.g. more than seen in 100 simulations without outliers)

21 Appl. Multivariate Statistics - Spring 2012

Page 22: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Method 2: Adjusted Quantile

22 Appl. Multivariate Statistics - Spring 2012

ECDF leaves “plausible” range

Defines adaptive cutoff

Page 23: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Method 2: Adjusted Quantile

Function “aq.plot”

23 Appl. Multivariate Statistics - Spring 2012

Page 24: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Method 3: State of the art - pcout

Complex method based on robust principal components

Pretty involved methodology

Very fast – good for high dimensions

R: Function “pcout” in package “mvoutlier”

$wfinal01: 0 is outlier

$wfinal: Small values are more severe outlier

P. Filzmoser, R. Maronna, M. Werner. Outlier identification

in high dimensions, Computational Statistics and Data

Analysis, 52, 1694-1711, 2008

24 Appl. Multivariate Statistics - Spring 2012

Page 25: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Automatic outlier detection

It is always better to look at a QQ-plot to find outlier !

Just find points “sticking out”; no distributional assumption

If you can’t: Automatic outlier detection

- finds usually too many or too few outlier depending on

parameter settings

- depends on distribution assumptions

(e.g. multivariate normality)

+ good for screening of large amounts of data

25 Appl. Multivariate Statistics - Spring 2012

Page 26: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Concepts to know

Find multivariate outlier with robustly estimated

Mahalanobis distance

Cutoff

- by eye (best method)

- quantile of Chi-Square distribution

26 Appl. Multivariate Statistics - Spring 2012

Page 27: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

R commands to know

chisq.plot, pcout in package “mvoutlier”

27 Appl. Multivariate Statistics - Spring 2012

Page 28: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Next week

Missing values

28 Appl. Multivariate Statistics - Spring 2012