1 a black-box approach to machine learning yoav freund

1

A Black-Box approach to machine learning

Yoav Freund

2

Why do we need learning?

• Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification

• For accuracy, functions must be tuned to fit the data source.

• For real-time processing, function computation has to be very fast.

3

The complexity/accuracy tradeoff

Complexity

Err

or

Trivial performance

4

The speed/flexibility tradeoff

Flexib

ility

Speed

Matlab CodeJava Code

Machine code

Digital Hardware

Analog Hardware

5

Theory Vs. Practice

• Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations.

- I prove theorems.• Practitioner: I want a real-time algorithm that

performs well on my problem. - I experiment.• My approach: I want combining algorithms whose

performance and speed is guaranteed relative to the performance and speed of their components.

- I do both.

6

Plan of talk

• The black-box approach• Boosting• Alternating decision trees• A commercial application• Boosting the margin• Confidence rated predictions• Online learning

7

The black-box approach

• Statistical models are not generators, they are predictors.

• A predictor is a function from observation X to action Z.

• After action is taken, outcome Y is observed which implies loss L (a real valued number).

• Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)

8

Main software components

x zA predictor

€

x1,y1( ), x2 ,y2( ),K , xm ,ym( )

Training examplesA learner

We assume the predictor will be applied to examples similar to those on which it was trained

9

Learning in a system

Learning System

predictorTrainingExamples

Target System

Sensor Data Action

feedback

10

Special case: Classification

€

Loss( ˆ y ,y) =1 if y ≠ ˆ y

0 if y = ˆ y

⎧ ⎨ ⎩

Observation X - arbitrary (measurable) space

€

ˆ y ∈ ZPrediction Z - {1,…,K}

Usually K=2 (binary classification)

Outcome Y - finite set {1,..,K}

€

y ∈Y

11

batch learning for binary classification

€

x,y( ) ~ D; y ∈ −1,+1{ }Data distribution:

€

ε h( ) ˙ = P x,y( )~D h(x) ≠ y( )Generalization error:

€

T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ DmTraining set:

€

ˆ ε (h) ˙ = 1

m1 h(x) ≠ y[ ]

x,y( )∈T

∑ ˙ = P x,y( )~T h(x) ≠ y[ ]

Training error:

12

Boosting

Combining weak learners

13

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

€

x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )

14

A weak learner

The weak requirement:

€

yiˆ y iwii=1

m

∑wii=1

m

∑> γ > 0

A weak rule

h

h

Weak Leaner

Weighted training set

€

x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )

instances

€

x1,x2 ,K ,xm

predictions

€

ˆ y 1, ˆ y 2 ,K , ˆ y m; ˆ y i ∈ {0,1}

15

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

€

FT x( ) = α 1h1 x( ) +α 2h2 x( ) +L +α T hT x( )

Final rule:

€

fT (x) = sign FT x( )( )

16

€

αt = ln wit

i:ht xi( )=1,yi =1∑ wi

t

i:ht xi( )=1,yi =−1∑ ⎛

⎝ ⎜ ⎞

⎠ ⎟

€

wit = exp −yiFt−1(xi )( )

Adaboost

€

F0 x( ) ≡ 0

€

Ft+1 = Ft +α tht€

Get ht from weak − learner

€

for t =1..T

17

Main property of Adaboost

If advantages of weak rules over random guessing are: T then training error of final rule is at most

€

ˆ ε fT( ) ≤ exp − γ t2

t=1

T

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

18

Boosting block diagram

WeakLearner

Booster

Weakrule

Exampleweights

Strong Learner AccurateRule

19

What is a good weak learner?

The set of weak rules (features) should be:• flexible enough to be (weakly) correlated with

most conceivable relations between feature vector and label.

• Simple enough to allow efficient search for a rule with non-trivial weighted training error.

• Small enough to avoid over-fitting.

Calculation of prediction from observations should be very fast.

20

Alternating decision trees

Freund, Mason 1997

21

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

22

-0.2

A decision tree as a sum of weak rules.

X

Y-0.2

+0.2-0.3

Y>5

yesno

-0.1 +0.1

X>3

no

yes

+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

23

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

24

Example: Medical Diagnostics

• Cleve dataset from UC Irvine database.

• Heart disease diagnostics (+1=healthy,-1=sick)

• 13 features from tests (real valued and discrete).

• 303 instances.

25

AD-tree for heart-disease diagnostics

>0 : Healthy<0 : Sick

26

Commercial Deployment.

27

AT&T “buisosity” problem

• Distinguish business/residence customers from call detail information. (time of day, length of call …)

• 230M telephone numbers, label unknown for ~30%• 260M calls / day

• Required computer resources:

Freund, Mason, Rogers, Pregibon, Cortes 2000

• Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock).

• Significant: Calculating the classification for ~70M customers.• Negligible: Learning (2 Hours on 10K training examples on an

off-line computer).

28

AD-tree for “buisosity”

29

AD-tree (Detail)

30

Quantifiable results

• For accuracy 94% increased coverage from 44% to 56%.

• Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

ScoreA

ccur

acyPrecision/recall:

31

Adaboost’s resistance to over fitting

Why statisticians find Adaboost interesting.

32

A very curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

33

Large margins

€

marginFT(x,y) ˙ = y

α tht x( )t=1

T

∑α tt=1

T

∑= y

FT x( )r α

1

€

marginFT(x,y) > 0 ⇔ fT (x) = y

Thesis:large margins => reliable predictions

Very similar to SVM.

34

Experimental Evidence

35

TheoremSchapire, Freund, Bartlett & Lee / Annals of statistics 1998

H: set of binary functions with VC-dimension d

C

€

= αihi | hi ∈ H , α i > 0, α i =1∑∑{ }

€

∀c ∈C,∀θ > 0, with probability1− δ w.r.t. T ~ Dm

€

P x,y( )~D sign c(x)( ) ≠ y[ ] ≤ P x,y( )~T marginc x,y( ) ≤ θ[ ]

€

+ ˜ O d / m

θ

⎛

⎝ ⎜

⎞

⎠ ⎟+O log

1

δ

⎛

⎝ ⎜

⎞

⎠ ⎟

€

T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ Dm

No dependence on no. of combined functions!!!

36

Idea of Proof

37

Confidence rated predictions

Agreement gives confidence

38

A motivating example

-

-

-+

+

+

++

+

++

++

-

-

-

-

-

-

-

-

-

--

-

-

-

-

-

--

-

-

-

-

-

-

--

-

-

-

-

-

--

-

--

-

-

-

-

-

-

+

++

+

++

+

+

++

+

++

+ +

++

+

+

+

+

+

++

+

++

+

+

++

+

++

+--

-

-- -

--

---

--

?

?

?

Unsure

Unsure

39

The algorithm

€

η > 0, Δ > 0Parameters

€

w(h) ˙ = e−η ˆ ε h( )Hypothesis weight:

€

ˆ l η (x) ˙ = 1

ηln

w(h)h:h ( x)=+1

∑

w(h)h:h ( x)=−1

∑

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

Empirical Log RatioEmpirical Log Ratio::

€

ˆ p η ,Δ x( ) =

+1 if ˆ l x( ) > Δ

-1,+1{ } if ˆ l x( ) ≤ Δ

−1 if ˆ l x( ) < −Δ

⎧

⎨ ⎪

⎩ ⎪

Prediction rule:

Freund, Mansour, Schapire 2001

40

Suggested tuning

€

P(abstain) = P x,y( )~Dˆ p (x) = −1,+1{ }( ) = 5ε h*

( ) +Oln 1 δ( ) + ln H( )

m1/2−θ

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

€

2) for m = Ω ln 1 δ( ) ln H( )( )1/θ ⎛

⎝ ⎜

⎞ ⎠ ⎟

Yields:

€

1) P mistake( ) = P x,y( )~D y ∉ ˆ p (x)( ) = 2ε h*( ) +O

ln m( )m1/2−θ

⎛

⎝ ⎜

⎞

⎠ ⎟

Suppose H is a finite set.

€

η=ln 8 H( )m1 2−θ

€

Δ=2ln 2 δ( )

m+

ln 8 H( )

8m1 2+θ

€

0 <θ < 14

41

Confidence Rating block diagram

Rater-Combiner

Confidence-ratedRule

CandidateRules

€

x1,y1( ), x2 ,y2( ),K , xm ,ym( )

Training examples

42

Face Detection

• Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

QuickTime™ and aYUV420 codec decompressorare needed to see this picture.

Viola & Jones 1999

43

Using confidence to save time

The detector combines 6000 simple features using Adaboost.

In most boxes, only 8-9 features are calculated.

Feature 1Allboxes

Definitely not a face

Might be a face

Feature 2

44

Using confidence to train car detectors

45

Original Image Vs. difference image

QuickTime™ and aPhoto - JPEG decompressorare needed to see this picture.

46

Co-training

HwyImages

Raw B/W

Diff Image

Partially trainedB/W basedClassifier

Partially trainedDiff basedClassifier

Confident Predictions

Confident Predictions

Blum and Mitchell 98

47

Co-Training Results

Raw Image detector Difference Image detector

Before co-training After co-training

Levin, Freund, Viola 2002

48

Selective sampling

Unlabeled dataPartially trained

classifier Sample of unconfident examples

Labeledexamples

Query-by-committee, Seung, Opper & SompolinskyFreund, Seung, Shamir & Tishby

49

Online learning

Adapting to changes

50

Online learning

An expert is an algorithm that maps the past to a prediction

€

(x1,y1),(x2 ,y2 ),K ,(xt−1,yt−1),xt

€

zt

So far, the only statistical assumption was that data is generated IID.

Can we get rid of that assumption?

Yes, if we consider prediction as a repeating game

Suppose we have a set of experts, we believe one is good, but we don’t know which one.

51

Online prediction game

Experts generate predictions:

€

zt1,zt

2 ,K ,ztN

Algorithm makes its own prediction:

€

ζ t

Nature generates outcome:

€

yt

For

€

t =1,K ,T

Total loss of expert i:

€

LTi = L zt

i ,yt( )t=1

T

∑

Total loss of algorithm:

€

LTA = L ζ t ,yt( )

t=1

T

∑

Goal: for any sequence of events

€

LTA ≤ min

iLT

i + o T( )

52

A very simple example

• Binary classification• N experts• one expert is known to be perfect• Algorithm: predict like the majority of

experts that have made no mistake so far.

• Bound:

€

LA ≤ log2 N

53

History of online learning

• Littlestone & Warmuth• Vovk• Vovk and Shafer’s recent book:

“Probability and Finance, its only a game!”• Innumerable contributions from many

fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …

54

Lossless compression

Z - [0,1]

X - arbitrary input space.Y - {0,1}

Entropy, Lossless compression, MDL.

Statistical likelihood, standard probability theory.€

L Z ,Y( ) = y log2

1

z+(1− y) log2

1

1− zLog Loss:

55

Bayesian averaging

€

ζ t =

wti

i=1

N

∑ zti

wti

i=1

N

∑; wt

i = 2−Lt−1i

€

∀T > 0; LTA = log2 w1

i

i=1

N

∑ − log2 wTi

i=1

N

∑ ≤ mini

LTi + ln N

Folk theorem in Information Theory

56

Game theoretical loss

X - arbitrary space

Y - a loss for each of N actions

€

y ∈ 0,1[ ]N

Z - a distribution over N actions

€

p ∈ 0,1[ ]N

, p1

=1

Loss:

€

L(p,y) = p • y = Ep y[ ]

57

Learning in games

€

LTA ≤ min

iLT

i + 2T ln N + ln N

An algorithm which knows T in advance guarantees:

Freund and Schapire 94

58

€

ytit

Instead, a single

€

it ∈ 1,K ,N{ } is chosen at

random according to

€

pt and is observed

Multi-arm bandits

Algorithm cannot observe full outcome

€

yt

Auer, Cesa-Bianchi, Freund, Schapire 95

€

LTA − min

iLT

i = O NT lnNT

δ

⎛

⎝ ⎜

⎞

⎠ ⎟

With probability

€

1− δ

We describe an algorithm that guarantees:

59

Why isn’t online learning practical?

• Prescriptions too similar to Bayesian approach.

• Implementing low-level learning requires a large number of experts.

• Computation increases linearly with the number of experts.

• Potentially very powerful for combining a few high-level experts.

60

code B/W Frontal face detectorIndoor, neutral backgrounddirect front-right lighting

Merl frontal 1.0

Online learning for detector deployment

FaceDetector Library

OLImages

Download Feedback

FaceDetections

Adaptivereal-time

face detector

Detector can beadaptive!!

61

Summary

• By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line.

• To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.

1 a black-box approach to machine learning yoav freund

Documents

confidence slide

adaboost slide

sign slide

cumulative slide

unsure slide

trained slide

idea of proof slide

weak requirement