outliers and influential data points. no outliers?

43
Outliers and influential data points

Upload: opal-wilkerson

Post on 18-Jan-2016

243 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Outliers and influential data points. No outliers?

Outliers and influential data points

Page 2: Outliers and influential data points. No outliers?

No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 3: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 4: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Page 5: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 6: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.47 + 4.93 x

Page 7: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 8: Outliers and influential data points. No outliers?

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 8.51 + 3.32 x

Page 9: Outliers and influential data points. No outliers?

Impact on regression analyses

• Not every outlier strongly influences the estimated regression function.

• Always determine if estimated regression function is unduly influenced by one or a few cases.

• Simple plots for simple linear regression.• Summary measures for multiple linear

regression.

Page 10: Outliers and influential data points. No outliers?

The hat matrix H

Page 11: Outliers and influential data points. No outliers?

The hat matrix H

Least squares estimates yXXXb '1'

The regression model XY

XYE

Fitted values yXXXXXby '1'ˆ

Hyy ˆ

Page 12: Outliers and influential data points. No outliers?

7

10

15

8

4

3

2

1

y

y

y

y

y

8.231

5.331

5.65.61

42.41

1

1

1

1

2414

2313

2212

2111

xx

xx

xx

xx

X

664.0044.0152.0444.0

044.0994.0979.1058.0

152.0979.1931.0202.0

444.0058.0202.0411.0

'1' XXXXH

36.6

08.10

71.14

85.8

7

10

15

8

664.0044.0152.0444.0

044.0994.0979.1058.0

152.0979.1931.0202.0

444.0058.0202.0411.0

ˆ Hyy

Page 13: Outliers and influential data points. No outliers?

44434241

34333231

24232221

14131211

hhhh

hhhh

hhhh

hhhh

H

444343242141

434333232131

424323222121

414313212111

4

3

2

1

44434241

34333231

24232221

14131211

ˆ

yhyhyhyh

yhyhyhyh

yhyhyhyh

yhyhyhyh

y

y

y

y

hhhh

hhhh

hhhh

hhhh

Hyy

4

3

2

1

y

y

y

y

y

Page 14: Outliers and influential data points. No outliers?

Identifying outlying Y values

Page 15: Outliers and influential data points. No outliers?

Identifying outlying Y values

• Residuals

• Standardized residuals– also called internally studentized residuals

• Deleted residuals

• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals

Page 16: Outliers and influential data points. No outliers?

Residuals

iii yye ˆ

Ordinary residuals defined for each observation, i = 1, …, n:

Using matrix notation:

yXXXXyyye '1'ˆ

yHIHyye

Page 17: Outliers and influential data points. No outliers?

Variance of the residuals

yHIHyye

HIeVar 2

iii heVar 12

Residual vector

Variance matrixVariance of the ith residual

Estimated variance of the ith residual

iii hMSEes 1

Page 18: Outliers and influential data points. No outliers?

Standardized residuals

iii

i

ii

hMSE

e

es

ee

1*

Standardized residuals defined for each observation, i = 1, …, n:

Standardized residuals quantify how large the residuals are in standard deviation units.

Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.

Page 19: Outliers and influential data points. No outliers?

An outlying y value?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 20: Outliers and influential data points. No outliers?

x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

S = 4.711

Unusual Observations

Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R

R denotes an observation with a large standardized residual

Page 21: Outliers and influential data points. No outliers?

Deleted residuals

If observed yi is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual.

Delete the ith case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the ith case.

Deleted residual )(ˆ iiii yyd

Page 22: Outliers and influential data points. No outliers?

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

ii

i

i

ii

hMSE

d

ds

dt

1)(

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

Page 23: Outliers and influential data points. No outliers?

109876543210

15

10

5

0

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

Page 24: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

Page 25: Outliers and influential data points. No outliers?

Identifying outlying X values

Page 26: Outliers and influential data points. No outliers?

Identifying outlying X values

• Use the diagonal elements, hii, of the hat matrix H to identify outlying X values.

• The hii are called leverages.

Page 27: Outliers and influential data points. No outliers?

Properties of the leverages (hii)

• The hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.

• The hii is a number between 0 and 1, inclusive.

• The sum of the hii equals p, the number of parameters.

Page 28: Outliers and influential data points. No outliers?

0 1 2 3 4 5 6 7 8 9

x

Dotplot for x

sample mean = 4.751

h(11) = 0.176 h(20,20) = 0.163h(11,11) = 0.048

HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.048147 0.049628 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.141136 0.140453 0.163492 0.050974

Sum of HI1 = 2.0000

Page 29: Outliers and influential data points. No outliers?

444343242141

434333232131

424323222121

414313212111

4

3

2

1

44434241

34333231

24232221

14131211

ˆ

yhyhyhyh

yhyhyhyh

yhyhyhyh

yhyhyhyh

y

y

y

y

hhhh

hhhh

hhhh

hhhh

Hyy

Properties of the leverages (hii)

If the ith case is outlying in terms of its X values, it has a large leverage value hii, and therefore exercises substantial leverage in determining the fitted value.

Page 30: Outliers and influential data points. No outliers?

Using leverages to identify outlying X values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

n

p

n

hh

n

iii

1

…or if it’s greater than 0.99.

Page 31: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p

Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it largeinfluence.

x y HI1 14.00 68.00 0.357535

Page 32: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p x y HI213.00 15.00 0.311532

Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

Page 33: Outliers and influential data points. No outliers?

Identifying influential cases

Page 34: Outliers and influential data points. No outliers?

Influence

• A case is influential if its exclusion causes major changes in the estimated regression function.

Page 35: Outliers and influential data points. No outliers?

Identifying influential cases

• Difference in fits, DFITS

• Cook’s distance measure

Page 36: Outliers and influential data points. No outliers?

DFITS

ii

iii

iii

iiii h

ht

hMSE

yyDFITS

1

ˆ

)(

)(

The difference in fits …

… represent the number of standard deviations that the fitted value increases or decreases when the ith case is included.

Page 37: Outliers and influential data points. No outliers?

DFITS

A case is influential if the absolute value of its DFIT value is …

n

p2

… greater than 1 for small to medium data sets

…greater than for large data sets

Page 38: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

62.021

222

n

p x y DFIT114.00 68.00 -1.23841

Page 39: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

62.021

222

n

p x y DFIT213.00 15.00 -11.4670

Page 40: Outliers and influential data points. No outliers?

Cook’s distance

pMSE

yy

D

n

jijj

i

1

2)(ˆ

Cook’s distance measure …

… considers the influence of the ith case on all n fitted values.

Page 41: Outliers and influential data points. No outliers?

Cook’s distance

• Relate Di to the F(p, n-p) distribution.

• If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith case has lots of influence.

Page 42: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

7191.0)19,2,50.0( F x y COOK114.00 68.00 0.701960

Page 43: Outliers and influential data points. No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

7191.0)19,2,50.0( F x y COOK213.00 15.00 4.04801