wald lecture 1

1

WALD LECTURE 1

MACHINE LEARNING

Leo Breiman UCB Statistics

[email protected]

2

A ROAD MAP

First--a brief overview of what we call machinelearning but consists of many diverse interests( not including data mining). How I became atoken statistician in this community.

Second--an exploration of ensemble predictors.

Beginning with bagging.

Then onto boosting

3ROOTS

Neural Nets--invented circa 1985

Brought together two groups:

A: Brain researchers applying neural nets tomodel some functions of the brain.

B. Computer scientists working on:

speech recognitionwritten character recognitionother hard prediction problems.

4

JOINED BY

C. CS groups with research interests in trainingrobots.

Supervised training

Stimulus was CART- circa 1985

(Machine Learning)

Self-learning robots

(Reinforcement Learning)

D. Other assorted groups

Artificial Intelligence

PAC Learning

etc. etc.

5

MY INTRODUCTION

1991

Invited talk on CART at a Machine LearningConference.

Careful and methodical exposition assuming theyhad never heard of CART.

(as was true in Statistics)

How embarrassing:

After talk found that they knew all aboutCART and were busy using its lookalike C4.5

6`

NIPS

(neural information processing systems)

Next year I went to NIPS--1992-3 and havegone every year since except for one.

In 1992 NIPS was hotbed of use of neural nets for a variety of purposes.

predictioncontrolbrain models

A REVELATION!

Neural Nets actually work in prediction:

despite a multitude oflocal minima

despite the dangersof overfitting

Skilled practitioners tailored large architectures of hidden units to accomplish special purpose results in specific problems

i.e. partial rotation and translation invariancecharacter recognition.

7

NIPS GROWTH

NIPS grew to include many diverse groups:

signal processingcomputer vision

etc.

One reason for growth--skiing.Vancouver --Dec. 9-12th, Whistler 12-15th

In 2001 about 600 attendeesMany foreigners--especially Europeans

Mainly computer scientists, some engineers,physicists, mathematical physiologists, etc.

Average age--30--Energy level--out-of-sightApproach is strictly algorithmic.

For me, algorithmic oriented, who felt like a voicein the wilderness in statistics, this community waslike home. My research was energized.

The papers presented at the NIPS 2000conference are listed in the following to give asense of the wide diversity of research interests.

8• What Can a Single Neuron Compute?• Who Does What? A Novel Algorithm to Determine Function

Localization• Programmable Reinforcement Learning Agents• From Mixtures of Mixtures to Adaptive Transform Coding• Dendritic Compartmentalization Could Underlie Competition and

Attentional Biasing of Simultaneous Visual Stimuli• Place Cells and Spatial Navigation Based on 2D Visual Feature

Extraction, Path Integration, and Reinforcement Learning• Speech Denoising and Dereverberation Using Probabilistic Models• Combining ICA and Top-Down Attention for Robust Speech

Recognition• Modelling Spatial Recall, Mental Imagery and Neglect• Shape Context: A New Descriptor for Shape Matching and Object

Recognition• Efficient Learning of Linear Perceptrons• A Support Vector Method for Clustering• A Neural Probabilistic Language Model• A Variational Mean-Field Theory for Sigmoidal Belief Networks• Stability and Noise in Biochemical Switches• Emergence of Movement Sensitive Neurons' Properties by

Learning a Sparse Code for Natural Moving Images• New Approaches Towards Robust and Adaptive Speech

Recognition• Algorithmic Stability and Generalization Performance• Exact Solutions to Time-Dependent MDPs• Direct Classification with Indirect Data• Model Complexity, Goodness of Fit and Diminishing Returns• A Linear Programming Approach to Novelty Detection• Decomposition of Reinforcement Learning for Admission Control of

Self-Similar Call Arrival Processes• Overfitting in Neural Nets: Backpropagation, Conjugate Gradient,

and Early Stopping• Incremental and Decremental Support Vector Machine Learning• Vicinal Risk Minimization

9• Temporally Dependent Plasticity: An Information Theoretic

Account• Gaussianization• The Missing Link - A Probabilistic Model of Document and

Hypertext Connectivity• The Manhattan World Assumption: Regularities in Scene Statistics

which Enable Bayesian Inference• Improved Output Coding for Classification Using Continuous

Relaxation• Koby Crammer, Yoram Singer• Sparse Representation for Gaussian Process Models• Competition and Arbors in Ocular Dominance• Explaining Away in Weight Space• Feature Correspondence: A Markov Chain Monte Carlo Approach• A New Model of Spatial Representation in Multimodal Brain

Areas• An Adaptive Metric Machine for Pattern Classification• High-temperature Expansions for Learning Models of Nonnegative

Data• Incorporating Second-Order Functional Knowledge for Better

Option Pricing• A Productive, Systematic Framework for the Representation of

Visual Structure• Discovering Hidden Variables: A Structure-Based Approach• Multiple Timescales of Adaptation in a Neural Code• Learning Joint Statistical Models for Audio-Visual Fusion and

Segregation• Accumulator Networks: Suitors of Local Probability Propagation• Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR

Networks• Factored Semi-Tied Covariance Matrices• A New Approximate Maximal Margin Classification Algorithm• Propagation Algorithms for Variational Bayesian Learning• Reinforcement Learning with Function Approximation

Converges to a Region• The Kernel Gibbs Sampler

1 0• From Margin to Sparsity• `N-Body' Problems in Statistical Learning• A Comparison of Image Processing Techniques for Visual Speech

Recognition Applications• The Interplay of Symbolic and Subsymbolic Processes in Anagram

Problem Solving• Permitted and Forbidden Sets in Symmetric Threshold-Linear

Networks• Support Vector Novelty Detection Applied to Jet Engine Vibration

Spectra• Large Scale Bayes Point Machines• A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs

w o r k• Hierarchical Memory-Based Reinforcement Learning• Beyond Maximum Likelihood and Density Estimation: A Sample-

Based Criterion for Unsupervised Learning of Complex Models

• Ensemble Learning and Linear Response Theory for ICA• A Silicon Primitive for Competitive Learning• On Reversing Jensen's Inequality• Automated State Abstraction for Options using the U-Tree

Algorithm• Dopamine Bonuses• Hippocampally-Dependent Consolidation in a Hierarchical Model of

Neocortex• Second Order Approximations for Probability Models• Generalizable Singular Value Decomposition for Ill-posed Datasets• Some New Bounds on the Generalization Error of Combined

Classifiers• Sparsity of Data Representation of Optimal Kernel Machine and

Leave-one-out Estimator• Keeping Flexible Active Contours on Track using Metropolis

Updates• Smart Vision Chip Fabricated Using Three Dimensional Integration

Technology• Algorithms for Non-negative Matrix Factorization

1 1• Color Opponency Constitutes a Sparse Representation for the

Chromatic Structure of Natural Scenes• Foundations for a Circuit Complexity Theory of Sensory Processing• A Tighter Bound for Graphical Models• Position Variance, Recurrence and Perceptual Learning• Homeostasis in a Silicon Integrate and Fire Neuron• Text Classification using String Kernels• Constrained Independent Component Analysis• Learning Curves for Gaussian Processes Regression: A Framework

for Good Approximations• Active Support Vector Machine Classification• Weak Learners and Improved Rates of Convergence in

Boosting• Recognizing Hand-written Digits Using Hierarchical Products of

Experts• Learning Segmentation by Random Walks• The Unscented Particle Filter• A Mathematical Programming Approach to the Kernel Fisher

Algorithm• Automatic Choice of Dimensionality for PCA• On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural

Networks Nonlinear Least Squares Problems• Eiji Mizutani, James W. Demmel• Sex with Support Vector Machines• Baback Moghaddam, Ming-Hsuan Yang• Robust Reinforcement Learning• Partially Observable SDE Models for Image Sequence Recognition

Tasks• The Use of MDL to Select among Computational Models of

Cognition• Probabilistic Semantic Video Indexing• Finding the Key to a Synapse• Processing of Time Series by Neural Circuits with Biologically

Realistic Synaptic Dynamics• Active Inference in Concept Learning

1 2• Learning Continuous Distributions: Simulations With Field

Theoretic Priors• Interactive Parts Model: An Application to Recognition of On-line

Cursive Script• Learning Sparse Image Codes using a Wavelet Pyramid

Architecture• Kernel-Based Reinforcement Learning in Average-Cost

Problems: An Application to Optimal Portfolio Choice• Learning and Tracking Cyclic Human Motion• Higher-Order Statistical Properties Arising from the Non-

Stationarity of Natural Signals• Learning Switching Linear Models of Human Motion• Bayes Networks on Ice: Robotic Search for Antarctic Meteorites• Redundancy and Dimensionality Reduction in Sparse-Distributed

Representations of Natural Objects in Terms of Their LocalFeatures

• Fast Training of Support Vector Classifiers• The Use of Classifiers in Sequential Inference• Occam's Razor• One Microphone Source Separation• Using Free Energies to Represent Q-values in a Multiagent

Reinforcement Learning Task• Minimum Bayes Error Feature Selection for Continuous Speech

Recognition• Periodic Component Analysis: An Eigenvalue Method for

Representing Periodic Structure in Speech• Spike-Timing-Dependent Learning for Oscillatory Networks• Universality and Individuality in a Neural Code• Machine Learning for Video-Based Rendering• The Kernel Trick for Distances• Natural Sound Statistics and Divisive Normalization in the

Auditory System• Balancing Multiple Sources of Reward in Reinforcement Learning• An Information Maximization Approach to Overcomplete and

Recurrent Representations

1 3• Development of Hybrid Systems: Interfacing a Silicon Neuron to a

Leech Heart Interneuron• FaceSync: A Linear Operator for Measuring Synchronization of

Video Facial Images and Audio Tracks• The Early Word Catches the Weights• Sparse Greedy Gaussian Process Regression• Regularization with Dot-Product Kernels• APRICODD: Approximate Policy Construction Using Decision

Diagrams• Four-legged Walking Gait Control Using a Neuromorphic Chip

Interfaced to a Support Vector Learning Algorithm• Kernel Expansions with Unlabeled Examples• Analysis of Bit Error Probability of Direct-Sequence CDMA

Multiuser Demodulators• Noise Suppression Based on Neurophysiologically-motivated SNR

Estimation for Robust Speech Recognition• Rate-coded Restricted Boltzmann Machines for Face Recognition• Structure Learning in Human Causal Induction• Sparse Kernel Principal Component Analysis• Data Clustering by Markovian Relaxation and the Information

Bottleneck Method• Adaptive Object Representation with Hierarchically-Distributed

Memory Sites• Active Learning for Parameter Estimation in Bayesian Networks• Mixtures of Gaussian Processes• Bayesian Video Shot Segmentation• Error-correcting Codes on a Bethe-like Lattice• Whence Sparseness?• Tree-Based Modeling and Estimation of Gaussian Processes on

Graphs with Cycles• Algebraic Information Geometry for Learning Machines with

Singularities• Feature Selection for SVMs?• On a Connection between Kernel PCA and Metric Multidimensional

Scaling• Using the Nystr{\"o}m Method to Speed Up Kernel Machines

1 4• Computing with Finite and Infinite Networks• Stagewise Processing in Error-correcting Codes and Image

Restoration• Learning Winner-take-all Competition Between Groups of Neurons

in Lateral Inhibitory Networks• Generalized Belief Propagation• A Gradient-Based Boosting Algorithm for Regression Problems• Divisive and Subtractive Mask Effects: Linking Psychophysics and

Biophysics• Regularized Winnow Methods• Convergence of Large Margin Separable Linear Classification

1 5PREDICTION REMAINS A MAIN THREAD

Given a training set of data

T= (yn ,xn ) n =1, ... , N}

where the yn are the response vectors and thexn are vectors of predictor variables:

Find a function f operating on the spaceof prediction vectors with values in theresponse vector space such that:

If the (yn ,xn ) are i.i.d from the distribution(Y,X) and given a function L(y,y') that measuresthe loss between y and the prediction y': theprediction error

PE( f ,T) = EY,XL(Y, f (X,T))

is small.

Usually y is one dimensional. If numerical, theproblem is regression. If unordered labels, it isclassification. In regression, the loss is squarederror. In classification, if the predicted label doesnot equal the true label the loss is one, zero otherwise

1 6RECENT BREAKTHROUGHS

Two types of classification algorithms originatedin 1996 that gave improved accuracy.

A. Support vector Machines (Vapnik)

B. Combining Predictors:

Bagging (Breiman 1996)Boosting (Freund and Schapire 1996)

Both bagging and boosting use ensembles ofpredictors defined on the prediction variables inthe training set.

Let { f1(x,T), f2(x,T),..., f K (x,T)} be predictorsconstructed using the training set T such that forevery value of x in the predictor space theyoutput a value of y in the response space.

In regression, the predicted value of ycorresponding to an input x is av

kfk

(x,T)

In classification the output takes values inthe class labels {1,2,...,J}. The predicted value ofy is plur

kfk

(x,T)

The averaging and voting can be weighted,

1 7

THE STORY OF BAGGING

as illustrated to begin with bypictures of three one dimensionalsmoothing examples using the same smoother.

They are not really smoothers-butpredictors of the underlying function

1 9

- 2

- 1

0

1

2

Y V

aria

bles

0 . 2 .4 .6 .8 1X Variable

prediction

function

data

FiIRST SMOOTH EXAMPLE

2 0

- 1

- . 5

0

.5

1

1.5

Y V

aria

bles

0 . 2 .4 .6 .8 1X Variable

prediction

function

data

SECOND SMOOTH EXAMPLE

2 1

- 1

0

1

2

Y V

aria

bles

0 . 2 .4 .6 .8 1X Variable

prediction

function

data

THIRD SMOOTH EXAMPLE

2 2 WHAT SMOOTH?

Here is a weak learner--

- . 15

- . 1

- . 05

0

.05

.1

.15

.2

.25

Y V

aria

ble

0 . 2 .4 .6 .8 1X Variable

A WEAK LEARNER

The smooth is an average of 1000 weak learners.Here is how the weak learners are formed:

2 3

- 2

-1 .5

- 1

- . 5

0

.5

1

1.5

2

Y V

aria

ble

0 . 2 .4 .6 .8 1X Variable

1

0

FORMING THE WEAK LEARNER

Subset of fixed size is selected at random. Thenall the (y,x) points in the subset are connected bylines.

Repeated 1000 times and the 1000 weak learnersaveraged.

2 4THE PRINCIPLE

Its easier to see what is going on in regression:

PE( f ,T) = EY,X(Y − f (X,T))2

Want to average over all training sets of samesize drawn from the same distribution:

PE( f ) = EY,X,T(Y − f (X,T))2

This is decomposable into:

PE( f ) = EY,X(Y − ET f (X,T))2 +

EX,T( f (X,T) − ET f (X,T))2

Or

PE( f ) = (bias)2 + var iance

(Pretty Familiar!)

2 5

BACK TO EXAMPLE

The kth weak learner is of the form:

f k (x,T) = f (x,T,Θk )

where Θk is the random vector that selectsthe points to be in the weak learner. TheΘk are i.i.d.

The ensemble predictor is:

F(x,T) = 1K

f (x,T,Θk )k∑

Algebra and the LLN leads to:

Var(F) =EX,Θ,Θ' [ρT( f (x,T,Θ) f (x,T,Θ' ))VarT( f (x,T,Θ)

where Θ,Θ' are independent. Applying the mean value theorem--

Var(F) = ρVar( f )and

Bias2(F) = EY,X(Y − ET,Θ f (x,T,Θ))2

2 6

THE MESSAGE

A big win is possible with weak learners as longas their correlation and bias are low.

In sin curve example, base predictor is connect allpoints in order of x(n).

bias2=.000variance=.166

For the ensemble

bias2 = .042variance =.0001

Bagging is of this type--each predictor is grownon a bootstrap sample, requiring a random vectorΘ that puts weights 0,1,2,3, on the cases in thetraining set.

But bagging does not produce as low as possiblecorrelation between the predictors. There arevariants that produce lower correlation and betteraccuracy

This Point Will Turn Up Again Later.

2 7

BOOSTING--A STRANGE ALGORITHM

I discuss boosting, in part, to illustrate thedifference between the machine learningcommunity and statistics in terms of theory.

But mainly because the story of boosting isfascinating and multifaceted.

Boosting is a classification algorithm that givesconsistently lower error rates than bagging.

Bagging works by taking a bootstrap samplefrom the training set.

Boosting works by changing the weights on thetraining set.

It assumes that the predictor construction canincorporate weights on the cases.

The procedure for growing the ensemble is--Use the current weights to grow a predictor.

Depending on the training set errors of thispredictor, change the weights and grow the nextpredictor.

2 8THE ADABOOST ALGORITHM

The weights w(n) on the nth case in the trainingset are non-negative and sum to one. Originallythey are set equal. The process goes like so:

i) let w(k) (n) be the weights for the kth step. f

k the classifier constructed using these

weights.

ii) Run the training set down fk

and letd(n)=1 if the nth case is classified in error, otherwise zero.

iii) The weighted error is εk = w(k)(n)n∑ d(n)

set βk =(1−εk

)/εk

iv) The new weights are

w(k +1)(n) = w(k)(n)βkd(n) / w(k)(n)βk

d(n)

n∑

v) Voting for class is weighted with kth classifierhaving vote weight β

k

2 9

THE MYSTERY THICKENS

Adaboost created a big splash in machine learningand led to hundreds, perhaps thousands ofpapers. It was the most accurate classificationalgorithm available at that time.

It differs significantly from bagging. Bagging usesthe biggest trees possible as the weak learners toreduce bias.

Adaboost uses small trees as the weak learners,often being effective using trees formed by asingle split (stumps).

There is empirical evidence that it reduces bias aswell as variance.

It seemed to converges with the test set errorgradually decreasing as hundreds or thousands oftrees were added.

On simulated data its error rate is close to theBayes rate.

But why it worked so well was a mystery thatbothered me. For the last five years I havecharacteriized the understanding of Adaboost asthe most important open problem in machinelearning.

3 0

IDEAS ABOUT WHY IT WORKS

A. Adaboost raises the weights on casespreviously misclassified, so it focusses on thehard cases ith the easy cases just carried along.

wrong: empirical results showed that Adaboosttried to equalize the misclassification rate over allcases.

B. The margin explanation: An ingenious workby Shapire, et.al. derived an upper bound on theerror rate of a convex combination of predictors interms of the VC dimension of each predictor in theensemble and the margin distribution.

The margin for the nth case is the vote in theensemble for the correct class minus the largestvote for any of the other classes.

The authors conjectured that Adaboost was sopoweful because it produced high margindistributions.

I devised and published an algorithm thatproduced uniformly higher margin disrbutionsthan Adaboost, and yet was less accurate.

So much for margins.

3 1

THE DETECTIVE STORY CONTINUES

There was little continuing interest in machinelearning about how and why Adaboost worked.

Often the two communities, statistics and machinelearning, ask different questions

Machine Learning: Does it work?Statisticians: Why does it work?

One breakthrough occurred in my work in 1997.

In the two-class problem, label the classes as -1,+1. Then all the classifiers in the ensemble also take the values -1,+1.

Denote by F(xn ) any ensemble evaluated at xn If F(xn )>0 the prediction is class 1, else class -1.

On average, want ynF(xn ) to be as large as possible. Consider trying to minimize

φ(ynn∑ Fm(xn ))

where φ(x) is decreasing.

3 2

GAUSS-SOUTHWELL

The Gauss-Southwell method for minimizing a differentiable function g(x1,...,xm ) of m real variables goes this way:

i) At a point x compute all the partial derivatives ∂f (x1,...,xm )/∂x

k.

ii) Let the minimum of these be at x j.. Find step

of size α that minimizes g(x1,...,x j +α ,...,xm )

iii) Let the new x be x1,...,x j +α ,...,xm for the

minimizing α value.

3 3

GAUSS SOUTHWELL AND ADABOOST

To minimize exp(−ynn∑ F(xn )) using Gauss-

Southwell. Denote the current ensemble as

Fm(xn ) = ak1

m∑ f k (xn )

i) Find the k=k* that minimizes

\

∂∂f k

exp(−ynn∑ Fm(xn ))

ii) Find that a=a* that minimizes

exp(−yn[n∑ Fm(xn ) + a * f k *(xn )])

iii) Then Fm+1(xn )=Fm (xn )+a* fk*(xn )

The Asdaboost algorithm is identical to Gauss-Southwell as applied above.

This gave a rational basis for the odd form of Adaboost. Following was a plethora of papers inmachine learning proposing other functions of yF(x) to minimize using Gauss-Southwell

3 4

NAGGING QUESTIONS

The classification by the mth ensemble is defined by

sign(Fm(x))

The most important question I chewed on

Is Adaboost consistent?

Does P(Y≠sign(Fm (X,T))) converge to the Bayes risk as m→∞ and then |T|→ ∞?

I am not a fan of endless asymptotics, but I believe that we need to know whether predictors are consistent or inconsistent

For five years I have been bugging my theoreticalcolleagues with these questions.

For a long time I thought the answer was yes.

There was a paper 3 years ago which claimed that Adaboost overfit after 100,000 iterations, butI ascribed that to numerical roundoff error

3 5

THE BEGINNING OF THE END

In 2000, I looked at the analog of Adaboost inpopulation space, i.e. using the Gauss-Southwellapproach, minimize

EY,X exp(−YF(X))

The weak classifiers were the set of all trees witha fixed number (large enough) of terminal nodes.

Under some compactness and continuityconditions I proved that:

Fm → F in L2(P)

P(Y ≠ sign(F(X)) = Bayes Risk

But there was a fly in the ointment

3 6

THE FLY

Recall the notation

Fm(x) = ak1

m∑ f k (x)

An essential part of the proof in the populationcase was showing that:

ak2∑ < ∞

But in the N-sample case, one can show that

ak ≥ 2 / N

So there was an essential difference between thepopulation case and the finite sample case no matterhow large N

3 7

ADABOOST IS INCONSISTENT

Recent work by Jiang, Lugosi, and Bickel-Ritovhave clarified the situation.

The graph below illustrates. The2-dimensionaldata consists of two circular Gaussians with about150 cases in each with some overlap. Error isestimated using a 5000 case test set. Stumps (onesplit trees) were used.

2 3

2 4

2 5

2 6

2 7

test

set

err

or r

ate

0 1 0 2 0 3 0 4 0 5 0number of trees-thousands

ADABOOST ERROR RATE

The minimum occurs at about 5000 trees. Thenthe error rate begins climbing.

3 8

WHAT ADABOOST DOES

In its first stage, Adaboos tries to emulate thepopulation version. This continues for thousandsof trees. Then it gives up and moves into asecond phase of increasing error .

Both Jiang and Bickel-Ritov have proofs that foreach sample size N, there is a stopping time h(N)such that if Adaboost is stopped at h(N), theresulting sequence of ensembles is consistent.

There are still questions--what is happening in thesecond phase? But this will come in the future.

For years I have been telling everyone in earshotthat the behavior of Adaboost, particularlyconsistency, is a problem that plagues Machinelearning.

Its solution is at the fascinating interface betweenalgorithmic behavior and statistical theory.

THANK YOU

wald lecture 1

Documents

based reinforcement

machine learning

reinforcement

random vector

weak learner

population

nth case

weak learners