bias-variance analysis of ensemble learning · procedure for measuring bias and variance construct...

66
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd Bias-Variance Analysis of Ensemble Learning

Upload: others

Post on 18-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Thomas G. DietterichDepartment of Computer Science

Oregon State UniversityCorvallis, Oregon 97331

http://www.cs.orst.edu/~tgd

Bias-Variance Analysis ofEnsemble Learning

Page 2: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Outline

Bias-Variance Decomposition for RegressionBias-Variance Decomposition for ClassificationBias-Variance Analysis of Learning AlgorithmsEffect of Bagging on Bias and VarianceEffect of Boosting on Bias and VarianceSummary and Conclusion

Page 3: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance Analysis inRegression

True function is y = f(x) + εwhere ε is normally distributed with zeromean and standard deviation σ.

Given a set of training examples, {(xi, yi)},we fit an hypothesis h(x) = w ¢ x + b tothe data to minimize the squared error Σi [yi – h(xi)]2

Page 4: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Example: 20 pointsy = x + 2 sin(1.5x) + N(0,0.2)

Page 5: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

50 fits (20 examples each)

Page 6: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance Analysis

Now, given a new data point x* (withobserved value y* = f(x*) + ε), we wouldlike to understand the expectedprediction error

E[ (y* – h(x*))2 ]

Page 7: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Classical Statistical Analysis

Imagine that our particular trainingsample S is drawn from some populationof possible training samples according toP(S).Compute EP [ (y* – h(x*))2 ]Decompose this into “bias”, “variance”,and “noise”

Page 8: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Lemma

Let Z be a random variable with probabilitydistribution P(Z)Let Z = EP[ Z ] be the average value of Z.Lemma: E[ (Z – Z)2 ] = E[Z2] – Z2

E[ (Z – Z)2 ] = E[ Z2 – 2 Z Z + Z2 ] = E[Z2] – 2 E[Z] Z + Z2

= E[Z2] – 2 Z2 + Z2

= E[Z2] – Z2

Corollary: E[Z2] = E[ (Z – Z)2 ] + Z2

Page 9: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance-NoiseDecomposition

E[ (h(x*) – y*)2 ] = E[ h(x*)2 – 2 h(x*) y* + y*2 ] = E[ h(x*)2 ] – 2 E[ h(x*) ] E[y*] + E[y*2] = E[ (h(x*) – h(x*))2 ] + h(x*)2 (lemma) – 2 h(x*) f(x*) + E[ (y* – f(x*))2 ] + f(x*)2 (lemma) = E[ (h(x*) – h(x*))2 ] + [variance] (h(x*) – f(x*))2 + [bias2] E[ (y* – f(x*))2 ] [noise]

Page 10: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Derivation (continued)

E[ (h(x*) – y*)2 ] = = E[ (h(x*) – h(x*))2 ] + (h(x*) – f(x*))2 + E[ (y* – f(x*))2 ] = Var(h(x*)) + Bias(h(x*))2 + E[ ε2 ] = Var(h(x*)) + Bias(h(x*))2 + σ2

Expected prediction error = Variance + Bias2 + Noise2

Page 11: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias, Variance, and Noise

Variance: E[ (h(x*) – h(x*))2 ]Describes how much h(x*) varies fromone training set S to anotherBias: [h(x*) – f(x*)]

Describes the average error of h(x*).Noise: E[ (y* – f(x*))2 ] = E[ε2] = σ2

Describes how much y* varies from f(x*)

Page 12: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

50 fits (20 examples each)

Page 13: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias

Page 14: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Variance

Page 15: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Noise

Page 16: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

50 fits (20 examples each)

Page 17: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Distribution of predictions atx=2.0

Page 18: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

50 fits (20 examples each)

Page 19: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Distribution of predictions atx=5.0

Page 20: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Measuring Bias and Variance

In practice (unlike in theory), we haveonly ONE training set S.We can simulate multiple training sets bybootstrap replicates

S’ = {x | x is drawn at random with replacement from S} and |S’| = |S|.

Page 21: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Procedure for Measuring Biasand Variance

Construct B bootstrap replicates of S(e.g., B = 200): S1, …, SBApply learning algorithm to eachreplicate Sb to obtain hypothesis hbLet Tb = S \ Sb be the data points that donot appear in Sb (out of bag points)Compute predicted value hb(x) for each xin Tb

Page 22: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Estimating Bias and Variance(continued)

For each data point x, we will now havethe observed corresponding value y andseveral predictions y1, …, yK.Compute the average prediction h.Estimate bias as (h – y)Estimate variance as Σk (yk – h)2/(K – 1)Assume noise is 0

Page 23: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Approximations in thisProcedure

Bootstrap replicates are not real dataWe ignore the noise

If we have multiple data points with thesame x value, then we can estimate thenoiseWe can also estimate noise by pooling yvalues from nearby x values (another usefor Random Forest proximity measure?)

Page 24: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging

Bagging constructs B bootstrapreplicates and their correspondinghypotheses h1, …, hB

It makes predictions according toy = Σb hb(x) / B

Hence, bagging’s predictions are h(x)

Page 25: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Estimated Bias and Variance ofBagging

If we estimate bias and variance using thesame B bootstrap samples, we will have:

Bias = (h – y) [same as before]Variance = Σk (h – h)2/(K – 1) = 0

Hence, according to this approximate way ofestimating variance, bagging removes thevariance while leaving bias unchanged.In reality, bagging only reduces variance andtends to slightly increase bias

Page 26: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias/Variance Heuristics

Models that fit the data poorly have high bias:“inflexible models” such as linear regression,regression stumpsModels that can fit the data very well have lowbias but high variance: “flexible” models suchas nearest neighbor regression, regressiontreesThis suggests that bagging of a flexible modelcan reduce the variance while benefiting fromthe low bias

Page 27: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance Decompositionfor Classification

Can we extend the bias-variancedecomposition to classification problems?Several extensions have been proposed; wewill study the extension due to PedroDomingos (2000a; 2000b)Domingos developed a unified decompositionthat covers both regression and classification

Page 28: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Classification Problems

Data points are generated by yi = n(f(xi)),where

f(xi) is the true class label of xin(¢) is a noise process that may change the truelabel f(xi).

Given a training set {(x1, y1), …, (xm, ym)}, ourlearning algorithm produces an hypothesis h.Let y* = n(f(x*)) be the observed label of a newdata point x*. h(x*) is the predicted label. Theerror (“loss”) is defined as L(h(x*), y*)

Page 29: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Loss Functions forClassification

The usual loss function is 0/1 loss. L(y’,y)is 0 if y’ = y and 1 otherwise.Our goal is to decompose Ep[L(h(x*), y*)]into bias, variance, and noise terms

Page 30: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Discrete Equivalent of the Mean:The Main Prediction

As before, we imagine that our observedtraining set S was drawn from some populationaccording to P(S)Define the main prediction to be ym(x*) = argminy’ EP[ L(y’, h(x*)) ]

For 0/1 loss, the main prediction is the mostcommon vote of h(x*) (taken over all trainingsets S weighted according to P(S))For squared error, the main prediction is h(x*)

Page 31: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias, Variance, Noise

Bias B(x*) = L(ym, f(x*))This is the loss of the main prediction with respectto the true label of x*

Variance V(x*) = E[ L(h(x*), ym) ]This is the expected loss of h(x*) relative to themain prediction

Noise N(x*) = E[ L(y*, f(x*)) ]This is the expected loss of the noisy observedvalue y* relative to the true label of x*

Page 32: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Squared Error Loss

These definitions give us the results wehave already derived for squared errorloss L(y’,y) = (y’ – y)2

Main prediction ym = h(x*)Bias2: L(h(x*), f(x*)) = (h(x*) – f(x*))2

Variance: E[ L(h(x*), h(x*)) ] = E[ (h(x*) – h(x*))2 ]

Noise: E[ L(y*, f(x*)) ] = E[ (y* – f(x*))2 ]

Page 33: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

0/1 Loss for 2 classes

There are three components thatdetermine whether y* = h(x*)

Noise: y* = f(x*)?Bias: f(x*) = ym?Variance: ym = h(x*)?

Bias is either 0 or 1, because neither f(x*)nor ym are random variables

Page 34: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Case Analysis of Errorf(x*) = ym?

ym = h(x*)?

y* = f(x*)?

correct error[variance]

yes

yes no [bias]

y* = f(x*)?

error[noise]

correct[noise

cancelsvariance]

ym = h(x*)?

y* = f(x*)?

error[bias]

correct[variancecancels

bias]

yes no [variance]

y* = f(x*)?

correct[noise

cancelsbias]

error[noise

cancelsvariancecancels

bias]

yes no [noise] yes no [noise] yes no [noise] yes no [noise]

no [variance]

Page 35: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Unbiased case

Let P(y* ≠ f(x*)) = N(x*) = τLet P(ym ≠ h(x*)) = V(x*) = σIf (f(x*) = ym), then we suffer a loss ifexactly one of these events occurs:L(h(x*), y*) = τ(1-σ) + σ(1-τ)

= τ + σ – 2τσ = N(x*) + V(x*) – 2 N(x*) V(x*)

Page 36: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Biased Case

Let P(y* ≠ f(x*)) = N(x*) = τLet P(ym ≠ h(x*)) = V(x*) = σIf (f(x*) ≠ ym), then we suffer a loss if either bothor neither of these events occurs:

L(h(x*), y*) = τσ + (1–σ)(1–τ) = 1 – (τ + σ – 2τσ) = B(x*) – [N(x*) + V(x*) – 2 N(x*) V(x*)]

Page 37: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Decomposition for 0/1 Loss(2 classes)

We do not get a simple additive decompositionin the 0/1 loss case:

E[ L(h(x*), y*) ] = if B(x*) = 1: B(x*) – [N(x*) + V(x*) – 2 N(x*) V(x*)] if B(x*) = 0: B(x*) + [N(x*) + V(x*) – 2 N(x*) V(x*)]

In biased case, noise and variance reduceerror; in unbiased case, noise and varianceincrease error

Page 38: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Summary of 0/1 Loss

A good classifier will have low bias, inwhich case the expected loss willapproximately equal the varianceThe interaction terms will usually besmall, because both noise and variancewill usually be < 0.2, so the interactionterm 2 V(x*) N(x*) will be < 0.08

Page 39: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

0/1 Decomposition in Practice

In the noise-free case: E[ L(h(x*), y*) ] =

if B(x*) = 1: B(x*) – V(x*) if B(x*) = 0: B(x*) + V(x*)

It is usually hard to estimate N(x*), so wewill use this formula

Page 40: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Decomposition over an entiredata set

Given a set of test pointsT = {(x*1,y*1),…, (x*n,y*n)},

we want to decompose the average loss:L = Σi E[ L(h(x*i), y*i) ] / n

We will write it asL = B + Vu – Vb

where B is the average bias, Vu is the averageunbiased variance, and Vb is the averagebiased variance (We ignore the noise.)Vu – Vb will be called “net variance”

Page 41: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Experimental Studies of Biasand Variance

Artificial data: Can generate multipletraining sets S and measure bias andvariance directlyBenchmark data sets: Generatebootstrap replicates and measure biasand variance on separate test set

Page 42: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Algorithms to Study

K-nearest neighbors: What is the effectof K?Decision trees: What is the effect ofpruning?Support Vector Machines: What is theeffect of kernel width σ?

Page 43: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

K-nearest neighbor(Domingos, 2000)

Chess (left): Increasing K primarily reduces VuAudiology (right): Increasing K primarilyincreases B.

Page 44: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Size of Decision Trees

Glass (left), Primary tumor (right): deepertrees have lower B, higher Vu

Page 45: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Example: 200 linear SVMs(training sets of size 20)

Error: 13.7%

Bias: 11.7%

Vu: 5.2%

Vb: 3.2%

Page 46: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Example: 200 RBF SVMsσ = 5

Error: 15.0%

Bias: 5.8%

Vu: 11.5%

Vb: 2.3%

Page 47: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Example: 200 RBF SVMsσ = 50

Error: 14.9%

Bias: 10.1%

Vu: 7.8%

Vb: 3.0%

Page 48: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

SVM Bias and Variance

Bias-Variance tradeoff controlled by σBiased classifier (linear SVM) givesbetter results than a classifier that canrepresent the true decision boundary!

Page 49: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

B/V Analysis of Bagging

Under the bootstrap assumption,bagging reduces only variance

Removing Vu reduces the error rateRemoving Vb increases the error rate

Therefore, bagging should be applied tolow-bias classifiers, because then Vb willbe smallReality is more complex!

Page 50: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging Nearest Neighbor

Bagging first-nearestneighbor is equivalent(in the limit) to aweighted majority votein which the k-thneighbor receives aweight of

exp(-(k-1)) – exp(-k)

Since the first nearest neighbor gets more than half of the vote, it willalways win this vote. Therefore, Bagging 1-NN is equivalent to 1-NN.

Page 51: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging Decision Trees

Consider unpruned trees of depth 2 onthe Glass data set. In this case, the erroris almost entirely due to biasPerform 30-fold bagging (replicated 50times; 10-fold cross-validation)What will happen?

Page 52: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging Primarily ReducesBias!

Page 53: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Questions

Is this due to the failure of the bootstrapassumption in bagging?Is this due to the failure of the bootstrapassumption in estimating bias andvariance?Should we also think of Bagging as asimple additive model that expands therange of representable classifiers?

Page 54: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging Large Trees?

Now consider unpruned trees of depth10 on the Glass dataset. In this case,the trees have much lower bias.What will happen?

Page 55: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Answer: Bagging PrimarilyReduces Variance

Page 56: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bagging of SVMs

We will choose a low-bias, high-varianceSVM to bag: RBF SVM with σ=5

Page 57: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

RBF SVMs again: σ = 5

Page 58: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Effect of 30-fold Bagging:Variance is Reduced

Page 59: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Effects of 30-fold Bagging

Vu is decreased by 0.010; Vb isunchangedBias is increased by 0.005Error is reduced by 0.005

Page 60: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance Analysis ofBoosting

Boosting seeks to find a weightedcombination of classifiers that fits thedata wellPrediction: Boosting will primarily act toreduce bias

Page 61: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Boosting DNA splice (left) andAudiology (right)

Early iterations reduce bias. Later iterations alsoreduce variance

Page 62: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Review and Conclusions

For regression problems (squared error loss),the expected error rate can be decomposedinto

Bias(x*)2 + Variance(x*) + Noise(x*)For classification problems (0/1 loss), theexpected error rate depends on whether biasis present:

if B(x*) = 1: B(x*) – [V(x*) + N(x*) – 2 V(x*) N(x*)]if B(x*) = 0: B(x*) + [V(x*) + N(x*) – 2 V(x*) N(x*)]or B(x*) + Vu(x*) – Vb(x*) [ignoring noise]

Page 63: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Sources of Bias and Variance

Bias arises when the classifier cannotrepresent the true function – that is, theclassifier underfits the dataVariance arises when the classifieroverfits the dataThere is often a tradeoff between biasand variance

Page 64: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Effect of Algorithm Parameterson Bias and Variance

k-nearest neighbor: increasing ktypically increases bias and reducesvariancedecision trees of depth D: increasing Dtypically increases variance and reducesbiasRBF SVM with parameter σ: increasing σincreases bias and reduces variance

Page 65: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Effect of Bagging

If the bootstrap replicate approximationwere correct, then bagging would reducevariance without changing biasIn practice, bagging can reduce both biasand variance

For high-bias classifiers, it can reduce bias(but may increase Vu)For high-variance classifiers, it can reducevariance

Page 66: Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Effect of Boosting

In the early iterations, boosting is primarya bias-reducing methodIn later iterations, it appears to beprimarily a variance-reducing method(see end of Breiman’s Random Forestpaper)