wbl statistik 2020 — robust fitting · 2020. 6. 5. · the outlier problem measuring robustness...

34
School of Engineering IDP Institute of Data Analysis and Process Design Zurich University of Applied Sciences WBL Statistik 2020 — Robust Fitting Half-Day 1: Introduction to Robust Estimation Techniques Andreas Ruckstuhl Institut f¨ ur Datenanalyse und Prozessdesign urcher Hochschule f¨ ur Angewandte Wissenschaften WBL Statistik 2020 — Robust Fitting

Upload: others

Post on 27-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

School ofEngineering

IDP Institute ofData Analysis andProcess Design

Zurich University

of Applied Sciences

WBL Statistik 2020 — Robust FittingHalf-Day 1: Introduction to Robust Estimation Techniques

Andreas RuckstuhlInstitut fur Datenanalyse und Prozessdesign

Zurcher Hochschule fur Angewandte Wissenschaften

WBL Statistik 2020 — Robust Fitting

Page 2: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Outline:Half-Day 1 • Regression Model and the Outlier Problem

• Measuring Robustness• Location M-Estimation• Inference• Regression M-Estimation• Example from Molecular Spectroscopy

Half-Day 2 • General Regression M-Estimation• Regression MM-Estimation• Example from Finance• Robust Inference• Robust Estimation with GLM

Half-Day 3 • Robust Estimation of the Covariance Matrix• Principal Component Analysis• Linear Discriminant Analysis• Baseline Removal: An application of robust fitting

beyond theoryWBL Statistik 2020 — Robust Fitting – HD1 2 / 34

Page 3: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Your Lecturer

Name: Andreas RuckstuhlCivil Status: Married, 2 childrenEducation: Dr. sc. math. ETH

Position: Professor of Statistical Data Analysis, ZHAW WinterthurLecturer, WBL Applied Statistics, ETH Zurich

Expierence:1987 – 1991 Statistical Consulting and Teaching Assistant,

Seminar fur Statistik, ETHZ1991 – 1996 PhD, Teaching Assistant, Lecturer in NDK/WBL, ETHZ1996 Post-Doc, Texas A&M University, College Station, TX, USA1996 – 1999 Lecturer, ANU, Canberra, AustraliaSince 1999 Institute of Data Analysis and Process Design, ZHAW

WBL Statistik 2020 — Robust Fitting – HD1 3 / 34

Page 4: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

1.1 Regression Model and the Outlier ProblemExample Air Quality (I)False colour maps for NO2 pollution (with R. Locher)The actual air quality can be assessed insufficiently at a location far away from ameasurement station.More attractive would be a false colour map which shows daily means highly resolvedin time and location.

today/future until recently

see also https://home.zhaw.ch/∼lore/OLK/index.htmlWBL Statistik 2020 — Robust Fitting – HD1 4 / 34

Page 5: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Air Quality (II)

Available are• Daily means at a few sites• False colour map of yearly means for NO2 based on a physical model

Idea:• Construct a false colour map which is highly resolved in time and space.

The idea is feasible because there is a nice empirical relationship between thedaily and the yearly means:

log 〈daily mean〉 ≈ βo + β1 log 〈yearly mean〉

βo and β1 are weather dependent

WBL Statistik 2020 — Robust Fitting – HD1 5 / 34

Page 6: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Air Quality (III)

2.4 2.6 2.8

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3.0

log(Annual Mean)

log(

Dai

ly M

ean)

LSrobust

2.4 2.6 2.8

2.4

2.5

2.6

2.7

2.8

2.9

log(Annual Mean)

log(

Dai

ly M

ean)

LSrobust

Left: Stable weather condition over the whole areaRight: The weather condition is different at a few measurement sites.

WBL Statistik 2020 — Robust Fitting – HD1 6 / 34

Page 7: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Regression ModelIn regression, the model is

Yi =m∑

j=0x (j)

i βj + Ei ,

where the random errors Ei , i = 1, . . . , n are independent and Gaussiandistributed with mean 0 and unknown spread σ.

In residual analysis, we check all assumptions as thoroughly as possible. Amongother things, we put a lot of effort in finding potential outliers and badleverage points which we would like to remove from the analysis because theylead to unreliable fits.

Why? - Because real data are never exactly Gaussian distributed.There are gross errors and other perturbations resulting in adistribution which is longer tailed than the Gaussian distribution.

WBL Statistik 2020 — Robust Fitting – HD1 7 / 34

Page 8: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Oldest Informal Approach to Robustness(which is still a common practise) is to1 examine the data for obvious outliers,2 delete these outliers3 apply the optimal inference procedure for the assumed model

to the cleaned data set.

However, this data analytic approach is not unproblematic since• Even professional statisticians do not always screen the data• It can be difficult or even impossible to identify outliers, particularly in

multivariate or highly structured data• cannot examine the relationships in the data without first fitting a model• It can be difficult to formalize this process so that it can be automatized• Inference based on applying a standard procedure to the cleaned data will be based

on distribution theory which ignores the cleaning process and hence will beinapplicable and possibly misleading

WBL Statistik 2020 — Robust Fitting – HD1 8 / 34

Page 9: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

1.2 Measuring Robustness

How robust an estimator is, can be investigated by two simple measures:• influence function and gross error sensitivity• breakdown point

Both measures are based on the idea of studying the reaction of the estimatorunder the influence of gross errors, i.e. arbitrary added data.

Gross Error SensitivityThe (gross error) sensitivity is based on the influence function and measures themaximum effect of a single observation on the estimated value.

Breakdown PointThe breakdown point gives the minimum proportion of data that can bealtered without causing completely unreliable estimates.

WBL Statistik 2020 — Robust Fitting – HD1 9 / 34

Page 10: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example: Prolongation of Sleep

(Cushny and Peebles, 1905)

0 1 2 3 4 5

Data on the prolongation ofsleep by means of two Drugs.These data were used by Stu-dent (1908) as the first illu-stration of the paired t-test.

y = 110

10∑i=1

yi = 1.58

med = median{yi , i = 1, . . . , 10} = 1.3

y10% = 18

9∑i=2

y(i) = 1.4

y∗ = 19

9∑i=1

y(i) = 1.23

For y∗ the rejection rule |y(n)−y|s > 2.18

is used, where the value 2.18 depends onthe sample size.

WBL Statistik 2020 — Robust Fitting – HD1 10 / 34

Page 11: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Sensitivity Curve - Influence Function

Let β 〈y1, . . . , yn〉 be the estimator. The sensitivity curve

SC⟨

y ; y1, . . . , yn−1, β⟩

= β 〈y1, . . . , yn−1, y〉 − β 〈y1, . . . , yn−1〉1/n

describes the influence of an single observation y on the estimated value.

Note that SC 〈. . .〉 depends on the data y1, . . . , yn−1.

A similar measure, but one which characterises only the estimator, is theinfluence function

SC⟨

y ; y1, . . . , yn−1, β⟩

n↑∞−−−→ IF⟨

y ;N , β⟩

(The data y1, . . . , yn−1 have been replaced by the model distribution N .)

WBL Statistik 2020 — Robust Fitting – HD1 11 / 34

Page 12: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Sensitivity Curve (Empirical Influence Function)Sensitivity curve for the prolongation of sleep data.The “outlier” y = 4.6 hours has been varied within −3 < y < 5.

−2 0 2 4

−4

−2

0

2

4

y

SC

(y)

meanmedian10% truncated meanwith rejection rule

WBL Statistik 2020 — Robust Fitting – HD1 12 / 34

Page 13: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Definition of Robustness

A robust estimator has a bounded gross error sensitivity

γ∗⟨β, N

⟩:= max

x

∣∣∣IF ⟨x ; β,N⟩∣∣∣ <∞

Hence• Y is not robust and• med, Y10%, Y ∗ are robust

WBL Statistik 2020 — Robust Fitting – HD1 13 / 34

Page 14: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Breakdown PointHow do the estimation methods respond to two outliers?We consider the worst case scenario of having two observations moving to infinity: y(10) = y(9) →∞

y →∞ + ε∗n 〈y〉 =

0n

= 0

med = 1.3 as before + ε∗n 〈med〉 =

410

= 0.4 n large−→ 0.5

y10% →∞ + ε∗n 〈y10%〉 =

110

= 0.1

y∗ →∞ + ε∗n

⟨y∗⟩

=1

10= 0.1

The breakdown point is the maximum ratio of outlying observation such thatthe estimator still returns reliable estimates.More formally, call Xm the set of all data sets y∗ = {y∗1 , . . . , y∗n } of size n having (n −m)elements in common with y = {y1, y2, . . . , yn}. Then the breakdown point ε∗n

⟨β; y⟩

is equalto m∗/n, where

m∗ = max{

m ≥ 0 :∣∣β⟨y∗⟩− β⟨y⟩ ∣∣ <∞ for all y∗ ∈ Xm

}.

WBL Statistik 2020 — Robust Fitting – HD1 14 / 34

Page 15: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

1.3 Location M-Estimation and InferenceLocation model Yi = β + Ei , i = 1, . . . , n , with Ei i.i.d. ∼ N ,

⟨0, σ2⟩

Let us find a weighted mean, wherein the weights are designed to prevent the influence ofoutliers on the estimator as much as possible:

βM =

∑ni=1 wi yi∑ni=1 wi

.

In order to determine appropriate weights (or weight functions), it helps to consult thecorresponding influence function IF . Theory tells that the influence function of βM is equal to

IF⟨

y , β,N⟩

= const · w · r = const · ψ⟨

r⟩

with ψ⟨

r⟩ def= w · r ,

where r is the scaled residual y−βσ

.

• Hence,the influence function of an M-estimator is proportional to the ψ-function.

• For a robust M-estimation the ψ-function must be bounded – see next page• The corresponding weight function can be determined by wi = ψ(r i )/r i

WBL Statistik 2020 — Robust Fitting – HD1 15 / 34

Page 16: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Influence and Weight Functionψ- (top row9 and weight (bottom row) function for• ordinary least squares estimation (not robust) – on the left• a robust M-estimator (ψ-function is also known as Huber’s ψ-function) – on the right

r

ψ(a)

r

ψ(b)

−c

c

r

w(c)

1

r

w(d)

−2c −c c 2c

1

WBL Statistik 2020 — Robust Fitting – HD1 16 / 34

Page 17: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

• Note that the weights depend on the estimation βM and hence is only givenimplicitly.

• Usually, the M-Estimator is defined by an implicit equation,

n∑i=1

ψ

⟨ri⟨β,⟩

σ

⟩= 0 with ri 〈β〉 = yi − β ,

where σ is the scale parameter.Note: Basically, this equation corresponds to the normal equation of least-squaresestimators.

• A monotonously increasing, but bounded ψ-function leads to abreakdown point of ε∗ = 1

2 .

• The tuning constant c defining the kink in Huber’s ψ-function is determinedsuch that the relative efficiency of the M-estimator is 95% at the Gaussianlocation model: Hence, c = 1.345

WBL Statistik 2020 — Robust Fitting – HD1 17 / 34

Page 18: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Estimation of the Scale Parameter• The scale parameter σ is usually unknown. But it is needed to determine the

position at which unsuitable observation will lose its influence.• The standard deviation as an estimator for σ however is extremely non-robust.

(Therefore the rejection rule breaks down with 2 outliers out of 10 obs., cf slide 14.)

• A robust estimator of scale with a breakdown point of 12 is, e.g., the median of

absolute deviation (MAD)

sMADdef=

mediani⟨∣∣yi −mediank 〈yk〉

∣∣⟩0.6745

(The correction factor 1/0.6745 is needed to obtain a consistent estimate for σ at theGaussian distribution.)

• There are more proposals of suitable scale estimators. Among them are versions where thescale parameter is estimated in each iteration of the location estimation as e.g. in Huber’sProposal 2.

• A better rejection rule is Huber-type skipped mean:∣∣∣ yi −mediank 〈yk〉sMAD

∣∣∣ > 3.5 .

WBL Statistik 2020 — Robust Fitting – HD1 18 / 34

Page 19: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

InferenceA point estimation without a confidence interval is an incomplete (and oftenuseless) information.Distribution of the M-estimator: Theory shows that

βa∼ N

⟨β,

1n τ σ

2⟩

with τ =∫ψ2 〈u〉 f 〈u〉 du(∫ψ′ 〈u〉 f 〈u〉 du

)2

The correction factor τ is needed to correct for downweighting goodobservations. It is larger than 1 and can be• either calculated using the assumption that f 〈〉 is the Gaussian density• or estimated using the empirical distribution of the residuals:

τ =1n∑n

i=1 ψ2⟨

r i

⟨β⟩⟩

(1n∑n

i=1 ψ′⟨

r i

⟨β⟩⟩)2 with r i

⟨β⟩

= Yi − βsMAD

WBL Statistik 2020 — Robust Fitting – HD1 19 / 34

Page 20: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Confidence Interval

The confidence interval is calculated based on the z-test because of theasymptotic result.The confidence interval is formed by all values of β which are not rejected by the asymptoticz-test.

Hence, the (1− α) confidence interval is

β ± qN1−α/2√τ

sMAD√n,

where qN1−α/2 is the (1− α/2) quantile of the standard Gaussian distribution.

To adjust for estimating the scale parameter in a heuristic manner,N may be replaced by tn−1 in qN1−α/2.

WBL Statistik 2020 — Robust Fitting – HD1 20 / 34

Page 21: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Prolongation of Sleep

> y # data> n <- length(y)

> library(robustbase)> (y.hM <- huberM(y, se=T, k=1.345))$mu # 1.371091$s # 0.59304$SE # 0.2302046

## Confidence interval> h1 <- qt(0.975, n-1) * y.hM$SE> y.hM$mu + c(-1,1)*h1## 0.8503317 1.8918499

## Classical estimation> t.test(y) # One Sample t-test95 percent confidence interval:0.7001142 2.4598858

• Using Huber’s M-estimator withc = 1.345 results in βM = 1.37Remember, Y = 1.58

• The robust scale estimator results insMAD = 0.59 and√τ is√

1.507 = 1.227

• The 95% confidence interval is

1.37±2.26·0.59/√

10·1.227 = [0.85, 1.89]

The classical one is [0.70, 2.46]

WBL Statistik 2020 — Robust Fitting – HD1 21 / 34

Page 22: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

2.1 Regression M-Estimation

The linear regression model is

Yi = β1 · 1 + β2 · x (2)i + . . .+ βp · x (p)

i + Ei with Ei i.i.d.

• If the errors Ei are exactly Gaussian distributed, then the least-squareestimator is optimal.

• If the errors Ei are only approximately Gaussian distributed, then theleast-square estimator is not approximately optimal but rather inefficientalready at slightly longer tailed distributions.

Real data are never exactly Gaussian distributed. There are gross errors and itsdistribution is often longer tailed than the Gaussian distribution.

WBL Statistik 2020 — Robust Fitting – HD1 22 / 34

Page 23: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Modified Air Quality (IV)Recall the Example Air Quality:

Two outliers suffice to give unusable least squares lines

To keep the world simple (at least at the moment) we modify the data set suchthat it contains just one outlier:

2.4 2.5 2.6 2.7 2.8 2.9

2.2

2.4

2.6

2.8

log(Annual Mean)

log(

Dai

ly M

ean)

●●

● ●

LS

WBL Statistik 2020 — Robust Fitting – HD1 23 / 34

Page 24: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

From OLS to M-estimationAnalogously to the location model, we will replace the ordinary least squares (OLS)estimator by a weighted least squares(WLS) estimator, where the weights shouldreduce the influence of large residuals.

Normal equations of the WLS estimator: XT W r = 0 or equivalently,n∑

i=1wi · ri · x i = 0 ,

where ri⟨β⟩

:= yi −∑m

j=0 x (j)i βj are the residuals.

Replacing again (wi · ri ) by ψ 〈ri/σ〉 yields the formal definition of a regressionM-estimator βM:

n∑i=1

ψ

⟨ri⟨βM

⟩σ

⟩x i = 0 .

WBL Statistik 2020 — Robust Fitting – HD1 24 / 34

Page 25: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The influence function of a regression M-estimatorIn order to determine appropriate weights (or weight functions), it again helps toconsult the corresponding influence function IF of the regression M-estimator:

IF⟨

x , y ; βM,N⟩

= ψ⟨ rσ

⟩M x .

M is a matrix, which only depends on the design matrix X .

Hence, the ψ function again reflects the influence of a residual (and thereforeof an observation) onto the estimator.

To obtain a robust regression M-estimator theψ-function must be boundedas, e.g., Huber’s ψ-function

Figure on the right: ψ- and weight function for• ordinary least squares estimation (not robust) –

on the left• Huber’s M-estimator – on the right

r

ψ(a)

r

ψ(b)

−c

c

r

w(c)

1

r

w(d)

−2c −c c 2c

1

WBL Statistik 2020 — Robust Fitting – HD1 25 / 34

Page 26: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Modified Air Quality (V)>> f1 <- lm(...) # LS> coef(f1)Intercept Slope1.440561 0.471861

> ## library("MASS")> f2 <- rlm(...) # robust> coef(f2)Intercept Slope0.9369407 0.6814527

2.4 2.5 2.6 2.7 2.8 2.9

2.2

2.4

2.6

2.8

log(Annual Mean)

log(

Dai

ly M

ean)

●●

● ●

LSrobust

WBL Statistik 2020 — Robust Fitting – HD1 26 / 34

Page 27: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

What should be done with outliers?

Diagnose TreatmentTranscription error CorrectDifferent population (e.g. diseased, pre-gnant, . . . )

Remove observation/population

Implausible or dubious value Find explanation, . . .they can be the gateway for new insights

Prerequisite: Detect outliers!Robust methods are called for, particularly in high dimensional data, becauserobust methods• are not deceived by outliers,• help us detect outliers more easily and faster.• may help to automatize a statistical analysis in a safe way.

WBL Statistik 2020 — Robust Fitting – HD1 27 / 34

Page 28: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

2.2 Example From Molecular Spectroscopy

Model for the spectrum data:

Yi = θui − θ`i + Ei

which can be also written as

Y = Xθ + E ,

where

Y =

Y1Y2. . .Yn

, θ =

θAθB. . .. . .θaθb. . .

, and X =

0 1 0 . . . | 0 0 −1 0 . . .1 0 0 . . . | 0 −1 0 0 . . .

... |...

... |...

... |...

WBL Statistik 2020 — Robust Fitting – HD1 28 / 34

Page 29: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Tritium Molecule:Histograms of the residuals

WBL Statistik 2020 — Robust Fitting – HD1 29 / 34

Page 30: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Example Tritium Molecule:Estimated coefficients compared with theoretical ones (involves heavycomputation and many approximations)

WBL Statistik 2020 — Robust Fitting – HD1 30 / 34

Page 31: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Appendix: Computational AspectsA regression M-estimator is defined implicitly (cf. SLIDE 24) by

n∑i=1

ψ⟨

ri⟨βM⟩/

σ⟩· x (k)

i = 0 , k = 1,2, . . . , p. (1)

or with weights

widef=

ψ⟨

ri⟨βM⟩/

σ⟩

ri⟨βM⟩/

σ(2)

the system of quations in (1) can be written as

n∑i=1

wi · ri⟨βM⟩· x (k)

i = 0 , k = 1,2, . . . , p. (3)

(These are the normal equations of a weighted regressionbut the weights depend on the solution βM!)

WBL Statistik 2020 — Robust Fitting – HD1 31 / 34

Page 32: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Iterated Re-Weighted Least Squares1 Let β(m=0) be an initial solution for (3)

2 Calculate r (m)i = yi −

∑pj=1 x (j)

i · β(m)j ,

σ(m) = median{|r (m)i |}/0.6745 (MAV)

and w (m)i according to (2)

3 Solve weighted normal equations (3)4 Repeat steps (ii) and (iii) until convergence.

Algorithm Based on Modified ObservationsReplace step (3) above by solving the normal equation from the least-squares problem

n∑i=1

(y (m)

i −p∑

j=1

x (j)i · βj

)2!= min

β

where y (m)i

def=p∑

j=1

x (j)i · β

(m)j + w (m)

i · r (m)i are pseudo-observations.

WBL Statistik 2020 — Robust Fitting – HD1 32 / 34

Page 33: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Illustration of the ”Modified Observations” Algorithm

20 40 60 80

15

20

25

30

35

40

45

Water Flow at Libby

Wat

er F

low

at N

ewga

te

(a)

P1

20 40 60 80

15

20

25

30

35

40

45

Water Flow at Libby

Wat

er F

low

at N

ewga

te

(b)

P1

20 40 60 80

15

20

25

30

35

40

45

Water Flow at Libby

Wat

er F

low

at N

ewga

te

(c)

20 40 60 80

15

20

25

30

35

40

45

Water Flow at Libby

Wat

er F

low

at N

ewga

te

(d)

P1

WBL Statistik 2020 — Robust Fitting – HD1 33 / 34

Page 34: WBL Statistik 2020 — Robust Fitting · 2020. 6. 5. · The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

The Outlier Problem Measuring Robustness Location M-Estimation Regression M-Estimation Example From Molecular Spectroscopy

Take Home Message Half-Day 1

• The least-squares estimator is the optimal estimator when the data isnormally distributed

• However, the least-squares estimator is unreliable if contaminatedobservations are presentsee, e.g., the example from molecular spectroscopy

• There are better (=efficient, intuitively “correct”) estimators.With M-Estimators, the influence of gross errors can be controlled verysmoothly.

Generally, influence function and breakdown point are two very importantmeasure for assessing the robustness properties of an estimator.

Be aware: Outliers are only define with respect to a model

WBL Statistik 2020 — Robust Fitting – HD1 34 / 34