introduction to super learning - uw faculty web...

Introduction toSuper Learning

Ted Westling, PhDPostdoctoral Researcher

Center for Causal InferencePerelman School of Medicine

University of Pennsylvania

September 25, 2018

1 / 48

Learning Goals

• Conceptual understanding of Super Learning (SL)

• Comfort with the SuperLearner R package

• Awareness of the mathematical backbone of SL

2 / 48

Learning Goals

2 / 48

Learning Goals

2 / 48

Outline

I. Motivation and description of SL (30 minutes)

II. Lab 1: Vanilla SL for a continuous outcome (30 minutes)

III. Mathematical presentation of SL (20 minutes)

IV. Lab 2: Vanilla SL for a binary outcome (30 minutes)

15 minute break

3 / 48

Outline

15 minute break

V. Bells and whistles: Screens, weights, and CV-SL (30

minutes)

VI. Lab 3: Binary outcome redux (40 minutes)

VII. Lab 4: Case-control analysis of Fluzone vaccine (30

minutes)

4 / 48

I. Motivation anddescription of Super

Learning

4 / 48

Notation

• Y is a univariate outcome

• X is a p-variate set of predictors

• We observe n independent copies

(Y1,X1), . . . , (Yn,Xn)

from the joint distribution of (Y ,X).

5 / 48

Notation

(Y1,X1), . . . , (Yn,Xn)

5 / 48

Notation

(Y1,X1), . . . , (Yn,Xn)

5 / 48

The problem

• We want to estimate a function, e.g.:

– Conditional mean (regression) function

– Conditional quantile function

– Conditional density function

– Conditional hazard function

• Super Learning can be applied in all of the above settings

• We will focus on estimating the regression function

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

The problem

µ(x) := E [Y | X = x].

6 / 48

1. Exploratory analysis

2. Imputation of missing values

3. Prediction for new observations

4. Assessing prediction quality/comparing competing

estimators

5. Use as a nuisance parameter estimator

6. Confirmatory analysis/hypothesis testing

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

estimators

(not our goal here)

7 / 48

We want to estimate µ(x) = E [Y | X = x].

How should we do it?

8 / 48

GAM Random Forest

8 / 48

GAM Random ForestNeural network

8 / 48

GAM Random ForestNeural network GLM

8 / 48

How do we choose which algorithm to use?

9 / 48

Super Learning is:

An ensemble method for combiningpredictions from many candidate machine

learning algorithms

10 / 48

Measuring algorithmperformance

• Suppose µ̂1, . . . , µ̂K are candidate estimators of µ.

• k will always index estimators, and i will always index

observations (e.g. study participants)

• The mean squared error of µ̂k ,

MSE(µ̂k ) = E[(Y − µ̂k (X))2

]measures the performance of µ̂k as an estimator of µ.

• If we knew MSE(µ̂k ), we could choose the µ̂k with the

smallest MSE(µ̂k ).

11 / 48

MSE(µ̂k ) = E[(Y − µ̂k (X))2

11 / 48

MSE(µ̂k ) = E[(Y − µ̂k (X))2

11 / 48

MSE(µ̂k ) = E[(Y − µ̂k (X))2

smallest MSE(µ̂k ). 11 / 48

Estimating MSE

MSE(µ̂k ) = E[(Y − µ̂k (X))2

• It is tempting to take M̂SE(µ̂k ) = 1n∑n

i=1 [Yi − µ̂k (Xi)]2.

• This estimator will favor µ̂k which are overfit, because µ̂k

are trained on the same data used to evaluate the MSE.

• Analogy: a student has the exam questions before taking

the exam!

• Instead, we estimate MSE using cross-validation.

12 / 48

Estimating MSE

MSE(µ̂k ) = E[(Y − µ̂k (X))2

]• It is tempting to take M̂SE(µ̂k ) = 1

i=1 [Yi − µ̂k (Xi)]2.

the exam!

12 / 48

Estimating MSE

MSE(µ̂k ) = E[(Y − µ̂k (X))2

i=1 [Yi − µ̂k (Xi)]2.

the exam!

12 / 48

Estimating MSE

MSE(µ̂k ) = E[(Y − µ̂k (X))2

i=1 [Yi − µ̂k (Xi)]2.

the exam!

12 / 48

Estimating MSE

MSE(µ̂k ) = E[(Y − µ̂k (X))2

i=1 [Yi − µ̂k (Xi)]2.

the exam!

12 / 48

Cross-validation

1. Split the data in to V “folds” of size roughly n/V .

2. For each fold v = 1, . . . ,V :

• the data in folds other than v is called the training set;

• the data in fold v is called the test/validation set.

13 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

13 / 48

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Fold 6

Fold 7

Fold 8

Fold 9

Fold 10

Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets.

14 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

• we obtain µ̂k ,v using the training set;

• we obtain µ̂k ,v (Xi) for Xi in the validation set Vv .

3. Our cross-validated MSE is

M̂SECV (µ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

[Yi − µ̂k ,v (Xi)]2.

We average the MSEs of the V validation sets.

15 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

V∑v=1

1|Vv |

∑i∈Vv

[Yi − µ̂k ,v (Xi)]2.

15 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

V∑v=1

1|Vv |

∑i∈Vv

[Yi − µ̂k ,v (Xi)]2.

15 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

V∑v=1

1|Vv |

∑i∈Vv

[Yi − µ̂k ,v (Xi)]2.

15 / 48

Cross-validation

2. For each fold v = 1, . . . ,V :

V∑v=1

1|Vv |

∑i∈Vv

[Yi − µ̂k ,v (Xi)]2.

We average the MSEs of the V validation sets.15 / 48

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Fold 6

Fold 7

Fold 8

Fold 9

Fold 10

CV preds.

Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets.16 / 48

How do we choose V?

• Large V :

– more training data, so better for small n

– more computation time

– well-suited to high-dimensional covariates

– well-suited to complicated or non-smooth µ

• Small V :

– more test data

– less computation time.

(People typically use V = 5 or V = 10.)

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

17 / 48

How do we choose V?

• Large V :

• Small V :

– more test data

(People typically use V = 5 or V = 10.)17 / 48

“Discrete” Super Learner

• At this point, we have cross-validated MSE estimates

M̂SECV (µ̂1), . . . , M̂SECV (µ̂K )

for each of our candidate algorithms.

• We could simply take as our estimator the µ̂k minimizing

these cross-validated MSEs.

• We call this the “discrete Super Learner”.

18 / 48

M̂SECV (µ̂1), . . . , M̂SECV (µ̂K )

18 / 48

M̂SECV (µ̂1), . . . , M̂SECV (µ̂K )

18 / 48

Super Learner

• Let λ = (λ1, . . . , λK ) be an element of SK , the

K -dimensional simplex: each λk ∈ [0,1] and∑

k λk = 1.

• Super Learner considers as its set of candidate algorithms

all convex combinations µ̂λ :=∑K

k=1 λk µ̂k .

• The Super Learner is µ̂λ̂

, where

λ̂ := arg minλ∈SK

M̂SECV

λk µ̂k

(We use constrained optimization to compute the argmin.)

19 / 48

Super Learner

k λk = 1.

k=1 λk µ̂k .

, where

M̂SECV

λk µ̂k

19 / 48

Super Learner

k λk = 1.

k=1 λk µ̂k .

, where

M̂SECV

λk µ̂k

19 / 48

Super Learner

M̂SECV

λk µ̂k

M̂SECV

λk µ̂k

V∑v=1

1|Vv |

∑i∈Vv

[Yi −

K∑k=1

λk µ̂k ,v (Xi)

20 / 48

Super Learner

M̂SECV

λk µ̂k

M̂SECV

λk µ̂k

V∑v=1

1|Vv |

∑i∈Vv

[Yi −

K∑k=1

λk µ̂k ,v (Xi)

20 / 48

Super Learner

M̂SECV

λk µ̂k

M̂SECV

λk µ̂k

V∑v=1

1|Vv |

∑i∈Vv

[Yi −

K∑k=1

λk µ̂k ,v (Xi)

20 / 48

Super Learner: steps

Putting it all together:

1. Define a library of candidate algorithms µ̂1, . . . , µ̂K .

2. Obtain the CV-predictions µ̂k ,v (Xi) for all k , v and i ∈ Vv .

3. Use constrained optimization to compute the SL weights

λ̂ := arg minλ∈SKM̂SECV

(∑Kk=1 λk µ̂k

4. Take µ̂SL =∑K

k=1 λ̂k µ̂k .

21 / 48

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

21 / 48

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

21 / 48

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

21 / 48

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

21 / 48

II. Lab 1:Vanilla SL for a

continuous outcome

21 / 48

III. Into the weeds:a mathematical

presentation of SL

21 / 48

Review

Recall the construction of SL for a continuous outcome:

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

22 / 48

Review

Recall the construction of SL for a continuous outcome:

(∑Kk=1 λk µ̂k

k=1 λ̂k µ̂k .

22 / 48

In this section, we generalize this procedure to estimation

of any summary of the observed data distribution given an

appropriate loss for the summary of interest.

23 / 48

Loss and risk: setup

• Denote by O the observed data unit – e.g. O = (Y ,X).

• Denote by O the sample space of O

• LetM denote our statistical model.

• Denote by P0 ∈M the true distribution of O.

• Thus, we observe i.i.d. copies O1, . . . ,On ∼ P0.

• Suppose we want to estimate a parameter θ :M→ Θ.

• Denote θ0 := θ(P0) the true parameter value.

24 / 48

Loss and risk

• Let L be a map from O ×Θ to R.

• We call L a loss function for θ if it holds that

θ0 = arg minθ∈Θ

EP0 [L(O, θ)] .

• R0(θ) = EP0 [L(O, θ)] is called the oracle risk.

• These definitions of loss and risk come from the statistical

learning literature (see, e.g. Vapnik, 1992, 1999, 2013)

and are not to be confused with loss and risk from the

decision theory literature (e.g. Ferguson, 2014).

25 / 48

Loss and risk

EP0 [L(O, θ)] .

25 / 48

Loss and risk

EP0 [L(O, θ)] .

25 / 48

Loss and risk

EP0 [L(O, θ)] .

25 / 48

Loss and risk: MSE example

MSE is the oracle risk corresponding to a

squared-error loss function

• O = (Y ,X).

• θ(P) = µ(P) = {x 7→ EP [Y | X = x]}

• L(O, µ) = [Y − µ(X)]2 is the squared-error loss.

• R0(µ) = MSE(µ) = EP0 [Y − µ(X)]2.

26 / 48

• O = (Y ,X).

• θ(P) = µ(P) = {x 7→ EP [Y | X = x]}

• R0(µ) = MSE(µ) = EP0 [Y − µ(X)]2.

26 / 48

• O = (Y ,X).

• θ(P) = µ(P) = {x 7→ EP [Y | X = x]}

• R0(µ) = MSE(µ) = EP0 [Y − µ(X)]2.

26 / 48

• O = (Y ,X).

• θ(P) = µ(P) = {x 7→ EP [Y | X = x]}

• R0(µ) = MSE(µ) = EP0 [Y − µ(X)]2.

26 / 48

• O = (Y ,X).

• θ(P) = µ(P) = {x 7→ EP [Y | X = x]}

• R0(µ) = MSE(µ) = EP0 [Y − µ(X)]2.

26 / 48

Estimating the oracle risk

R0(θ)

R0(θ) = EP0 [L(O, θ)]

• Suppose that θ̂1, . . . , θ̂K are candidate estimators.

• As before, we need to estimate R0(θ) to evaluate each θ̂k .

• The naive estimator is R̂(θ̂k ) = 1n∑n

i=1 L(Oi , θ̂k ).

• We instead estimate R0(θ) using the cross-validated risk

R̂CV (θ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

L(Oi , θ̂k ,v ).

27 / 48

R0(θ)

R0(θ) = EP0 [L(O, θ)]

i=1 L(Oi , θ̂k ).

R̂CV (θ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

L(Oi , θ̂k ,v ).

27 / 48

R0(θ)

R0(θ) = EP0 [L(O, θ)]

i=1 L(Oi , θ̂k ).

R̂CV (θ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

L(Oi , θ̂k ,v ).

27 / 48

R0(θ)

R0(θ) = EP0 [L(O, θ)]

i=1 L(Oi , θ̂k ).

R̂CV (θ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

L(Oi , θ̂k ,v ).

27 / 48

R0(θ)

R0(θ) = EP0 [L(O, θ)]

i=1 L(Oi , θ̂k ).

R̂CV (θ̂k ) =1V

V∑v=1

1|Vv |

∑i∈Vv

L(Oi , θ̂k ,v ).

27 / 48

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

1. Define a library of candidate algorithms θ̂1, . . . , θ̂K .

2. Obtain the CV-Risks R̂CV (θ̂k ), k = 1, . . . ,K .

λ̂ := arg minλ∈SKR̂CV

(∑Kk=1 λk θ̂k

4. Take θ̂SL =∑K

k=1 λ̂k θ̂k .

28 / 48

(∑Kk=1 λk θ̂k

k=1 λ̂k θ̂k .

28 / 48

(∑Kk=1 λk θ̂k

k=1 λ̂k θ̂k .

28 / 48

(∑Kk=1 λk θ̂k

k=1 λ̂k θ̂k .

28 / 48

(∑Kk=1 λk θ̂k

k=1 λ̂k θ̂k .

28 / 48

Theoretical guarantees

van der Vaart et al. (2006) showed that, under some conditions,

the oracle risk of the SL estimator is as good as the oracle

risk of the oracle minimizer up to a multiple of log nn as long as

the number of candidate algorithms is polynomial in n.

29 / 48

Loss functions for a binaryoutcome

We return to O = (Y ,X), θ = µ.

• For continuous Y , we used squared-error loss.

• For binary Y , squared-error loss is still valid.

• However, there are (at least) two other alternative loss

functions for a binary outcome.

– Negative log-likelihood loss:

L(O, µ) = −Y logµ(X)− [1− Y ] log[1− µ(X)].

– AUC loss.

30 / 48

– AUC loss.

30 / 48

– AUC loss.

30 / 48

– AUC loss.

30 / 48

– AUC loss.

30 / 48

– AUC loss.30 / 48

IV. Lab 2:Vanilla SL for a binary

outcome

30 / 48

15 minute break

30 / 48

V. Bells and whistles:Screens, weights, and

30 / 48

Overview

In this section, we will introduce three of the add-ons to SL that

are frequently useful in practice: variable screens,

observation weights, and cross-validated SL.

31 / 48

Variable screens

• We think of a candidate algorithm as a two-step procedure:

1. Select a subset of the covariates.

2. Use the selected subset to fit a model.

• We call step 1 a screening procedure.

• While we could program steps 1 and 2 by hand in to each

candidate algorithm, the SuperLearner package has

built-in functionality to ease this process.

Screening algorithms allow us to guide the SL using our

domain knowledge.

32 / 48

Variable screens

domain knowledge.

32 / 48

Variable screens

domain knowledge.

32 / 48

Variable screens

domain knowledge.

32 / 48

Variable screens

domain knowledge.

32 / 48

Variable screens

domain knowledge.

32 / 48

Example use-cases of screening

• If we have a high-dimensional set of covariates, we can try

different ways of reducing the dimensionality.

• If we have a large number of “raw” measurements, we

might try providing a smaller number of summary

measures – e.g. mean, median, min, max.

• If we have measurements collected at multiple time

points, we might try providing just baseline, or just the last

time point, or some summaries of the trajectory.

• We can force certain variables to always be used.

33 / 48

• We can force certain variables to always be used.33 / 48

Observation weights

• In some applications, we need to include observation

weights in the procedure – e.g. case-control sampling,

or as a simple way to account for loss-to-followup.

• Observation weights can be included directly in a call to

SuperLearner, but method.AUC does not make correct

use of weights!!!!

• Note that some SuperLearner wrappers might not make

use of observation weights.

34 / 48

Observation weights

use of weights!!!!

34 / 48

Observation weights

use of weights!!!!

34 / 48

Case-control weights

• Let Y represent disease status at the end of a study.

• Suppose specimens from all ncase cases (Yi = 1) are

assayed.

• A random subset of Ncontrol controls (Yi = 0) (out of

ncontrol total controls) are assayed.

• We will use this case-control cohort to predict disease

status using the results of the assay and other covariates.

35 / 48

assayed.

35 / 48

assayed.

35 / 48

assayed.

35 / 48

• We can use SL with observation weights.

• Cases have weight wi = 1.

• Controls have weight wi = ncontrol/Ncontrol .

• Control weights could also be estimated using a logistic

regression of the indicator of inclusion in the control cohort

on baseline covariates.

36 / 48

Right-censored outcomes

• Suppose Y = I(T ≤ t0) indicates that disease occurs

before time t0.

• T is subject to right-censoring by C: we observe

Y = min{T ,C} and ∆ = I(T ≤ C).

• We want to estimate

µ(x) = P(T ≤ t0 | X = x) = E [Y | X = x].

37 / 48

before time t0.

Y = min{T ,C} and ∆ = I(T ≤ C).

µ(x) = P(T ≤ t0 | X = x) = E [Y | X = x].

37 / 48

before time t0.

Y = min{T ,C} and ∆ = I(T ≤ C).

µ(x) = P(T ≤ t0 | X = x) = E [Y | X = x].

37 / 48

µ0 = arg minµ

G0(Y | X)L((Y ,X), µ)

}• Here, G0(t | x) = P0(C > t | X = x).

• L either squared-error or negative log-likelihood loss.

• If we knew G0, we could use SL with weight ∆G0(Y |X) .

• Instead, we estimate G0 and plug in this estimator to

obtain an estimated weight.

• If C ⊥⊥ T , we can use a Kaplan-Meier estimator for G0;

otherwise we might use a Cox model.

38 / 48

µ0 = arg minµ

G0(Y | X)L((Y ,X), µ)

}• Here, G0(t | x) = P0(C > t | X = x).

38 / 48

µ0 = arg minµ

G0(Y | X)L((Y ,X), µ)

}• Here, G0(t | x) = P0(C > t | X = x).

38 / 48

µ0 = arg minµ

G0(Y | X)L((Y ,X), µ)

}• Here, G0(t | x) = P0(C > t | X = x).

otherwise we might use a Cox model.38 / 48

CV-Super Learner

• The standard SL framework gives us CV risks for each

candidate algorithm.

• However, the SL and discrete SL are obtained using all the

data, so their estimated risks will be optimistic.

• We can rectify this using a second layer of

cross-validation.

39 / 48

CV-Super Learner

cross-validation.

39 / 48

CV-Super Learner

cross-validation.

39 / 48

CV-Super Learner

cross-validation.

39 / 48

CV-Super Learner

1. Split the data into V1 folds.

2. For v = 1, . . . ,V1:

a. Run regular SL on the training set for fold v using

V2-fold CV.

b. Obtain discrete SL and SL predictions for the

validation set for fold v .

3. Combine the validation sets to obtain CV-risks for the

discrete SL and SL.

40 / 48

CV-Super Learner

2. For v = 1, . . . ,V1:

V2-fold CV.

discrete SL and SL.

40 / 48

CV-Super Learner

2. For v = 1, . . . ,V1:

V2-fold CV.

discrete SL and SL.

40 / 48

CV-Super Learner

2. For v = 1, . . . ,V1:

V2-fold CV.

discrete SL and SL.

40 / 48

CV-Super Learner

2. For v = 1, . . . ,V1:

V2-fold CV.

discrete SL and SL.

40 / 48

VI. Lab 3:Binary outcome

40 / 48

VII. Lab 4:Case-control analysis

of Fluzone vaccine

40 / 48

FLUVACS trial

• Health adults aged 18–49 years, Michigan, 2007–2008.

• Randomly assigned to:

– Fluzone – inactivated influenza vaccine (IIV)

– FluMist – live-attenuated influenza vaccine (LAIV)

– placebo.

• We are only interested in Fluzone vs placebo.

• Followed for one flu season.

• Endpoint = laboratory-confirmed influenza.

41 / 48

FLUVACS trial

– placebo.

41 / 48

FLUVACS trial

– placebo.

41 / 48

FLUVACS trial

– placebo.

41 / 48

FLUVACS trial

42 / 48

FLUVACS trial

• All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers,

proteins/virus/peptide magnitude/breadth).

• Measured variables:

– Demographics: age, vaccinated in last year

(EVERVAX)

– Day 0 markers

– Day 30 markers

– Difference markers = Day 30 markers - Day 0 markers

43 / 48

FLUVACS trial

(EVERVAX)

– Day 0 markers

– Day 30 markers

43 / 48

FLUVACS trial

(EVERVAX)

– Day 0 markers

– Day 30 markers

43 / 48

FLUVACS trial

(EVERVAX)

– Day 0 markers

– Day 30 markers

43 / 48

FLUVACS trial

(EVERVAX)

– Day 0 markers

– Day 30 markers

43 / 48

Variable sets

1. Demo.

2. Demo. + Day 0 markers

4. Demo. + Difference markers

5. Demo. + Day 0 markers + EVERVAX × Day 0 markers

7. Demo. + Diff. markers + EVERVAX × Diff. markers

8. Demo. + Day 0 + Day 30 + EVERVAX × (Day 0 + Day 30)

9. Demo. + Day 0 + Diff. + EVERVAX × (Day 0 + Diff.)

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

44 / 48

Variable sets

1. Demo.

9. Demo. + Day 0 + Diff. + EVERVAX × (Day 0 + Diff.)44 / 48

Analysis goals

• We want to compare the quality of these nine sets of

variables for predicting flu status in the placebo and

Fluzone arms separately.

• We also want to compare the predictive quality of IgA, IgG,

and both IgA + IgG measurements.

• We will use cross-validated Super Learning to do this.

45 / 48

Analysis goals

45 / 48

Analysis goals

45 / 48

EV x (Day 30 − Day 0) EV x (Day 0, Day 30) EV x (Day 0, Diff)

Day 30 − Day 0 EV x Day 0 EV x Day 30

Baseline Day 0 Day 30

0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Screen● All

screen.marginal.05

screen.marginal.10

Neither

46 / 48

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Discrete SL

SL.xgboost

Discrete SL

SL.xgboost

Discrete SL

SL.xgboost

Discrete SL

SL.bayesglm

Discrete SL

SL.xgboost

Discrete SL

SL.xgboost

Discrete SL

SL.glm

Discrete SL

SL.bayesglm

Discrete SL

SL.gam

Screen● All

screen.marginal.05

screen.marginal.10

Neither

47 / 48

0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Discrete SL

SL.glm

SL.bayesglm

SL.glmnet

SL.earth

SL.gam

SL.xgboost

SL.ranger

SL.mean

Screen● All

screen.marginal.05

screen.marginal.10

Neither

48 / 48

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

Discrete SL

SL.ranger

Discrete SL

SL.bayesglm

Discrete SL

SL.bayesglm

ner ●

Neither

49 / 48

Ferguson, T. S. (2014). Mathematical statistics: A decision theoretic

approach. Academic Press.

van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Oracle

inequalities for multi-fold cross validation. Statistics & Decisions,

24(3):351–371.

Vapnik, V. (1992). Principles of risk minimization for learning theory. In

Advances in Neural Information Processing Systems, pages 831–838.

Vapnik, V. (2013). The nature of statistical learning theory. Springer Science

& Business Media.

Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE

Transactions on Neural Networks, 10(5):988–999.

50 / 48

introduction to super learning - uw faculty web...

Documents