1 a black-box approach to machine learning yoav freund

61
1 A Black-Box approach to machine learning Yoav Freund

Upload: perla-hiner

Post on 29-Mar-2015

238 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 1 A Black-Box approach to machine learning Yoav Freund

1

A Black-Box approach to machine learning

Yoav Freund

Page 2: 1 A Black-Box approach to machine learning Yoav Freund

2

Why do we need learning?

• Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification

• For accuracy, functions must be tuned to fit the data source.

• For real-time processing, function computation has to be very fast.

Page 3: 1 A Black-Box approach to machine learning Yoav Freund

3

The complexity/accuracy tradeoff

Complexity

Err

or

Trivial performance

Page 4: 1 A Black-Box approach to machine learning Yoav Freund

4

The speed/flexibility tradeoff

Flexib

ility

Speed

Matlab CodeJava Code

Machine code

Digital Hardware

Analog Hardware

Page 5: 1 A Black-Box approach to machine learning Yoav Freund

5

Theory Vs. Practice

• Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations.

- I prove theorems.• Practitioner: I want a real-time algorithm that

performs well on my problem. - I experiment.• My approach: I want combining algorithms whose

performance and speed is guaranteed relative to the performance and speed of their components.

- I do both.

Page 6: 1 A Black-Box approach to machine learning Yoav Freund

6

Plan of talk

• The black-box approach• Boosting• Alternating decision trees• A commercial application• Boosting the margin• Confidence rated predictions• Online learning

Page 7: 1 A Black-Box approach to machine learning Yoav Freund

7

The black-box approach

• Statistical models are not generators, they are predictors.

• A predictor is a function from observation X to action Z.

• After action is taken, outcome Y is observed which implies loss L (a real valued number).

• Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)

Page 8: 1 A Black-Box approach to machine learning Yoav Freund

8

Main software components

x zA predictor

x1,y1( ), x2 ,y2( ),K , xm ,ym( )

Training examplesA learner

We assume the predictor will be applied to examples similar to those on which it was trained

Page 9: 1 A Black-Box approach to machine learning Yoav Freund

9

Learning in a system

Learning System

predictorTrainingExamples

Target System

Sensor Data Action

feedback

Page 10: 1 A Black-Box approach to machine learning Yoav Freund

10

Special case: Classification

Loss( ˆ y ,y) =1 if y ≠ ˆ y

0 if y = ˆ y

⎧ ⎨ ⎩

Observation X - arbitrary (measurable) space

ˆ y ∈ ZPrediction Z - {1,…,K}

Usually K=2 (binary classification)

Outcome Y - finite set {1,..,K}

y ∈Y

Page 11: 1 A Black-Box approach to machine learning Yoav Freund

11

batch learning for binary classification

x,y( ) ~ D; y ∈ −1,+1{ }Data distribution:

ε h( ) ˙ = P x,y( )~D h(x) ≠ y( )Generalization error:

T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ DmTraining set:

ˆ ε (h) ˙ = 1

m1 h(x) ≠ y[ ]

x,y( )∈T

∑ ˙ = P x,y( )~T h(x) ≠ y[ ]

Training error:

Page 12: 1 A Black-Box approach to machine learning Yoav Freund

12

Boosting

Combining weak learners

Page 13: 1 A Black-Box approach to machine learning Yoav Freund

13

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )

Page 14: 1 A Black-Box approach to machine learning Yoav Freund

14

A weak learner

The weak requirement:

yiˆ y iwii=1

m

∑wii=1

m

∑> γ > 0

A weak rule

h

h

Weak Leaner

Weighted training set

x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )

instances

x1,x2 ,K ,xm

predictions

ˆ y 1, ˆ y 2 ,K , ˆ y m; ˆ y i ∈ {0,1}

Page 15: 1 A Black-Box approach to machine learning Yoav Freund

15

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

FT x( ) = α 1h1 x( ) +α 2h2 x( ) +L +α T hT x( )

Final rule:

fT (x) = sign FT x( )( )

Page 16: 1 A Black-Box approach to machine learning Yoav Freund

16

αt = ln wit

i:ht xi( )=1,yi =1∑ wi

t

i:ht xi( )=1,yi =−1∑ ⎛

⎝ ⎜ ⎞

⎠ ⎟

wit = exp −yiFt−1(xi )( )

Adaboost

F0 x( ) ≡ 0

Ft+1 = Ft +α tht€

Get ht from weak − learner

for t =1..T

Page 17: 1 A Black-Box approach to machine learning Yoav Freund

17

Main property of Adaboost

If advantages of weak rules over random guessing are: T then training error of final rule is at most

ˆ ε fT( ) ≤ exp − γ t2

t=1

T

∑ ⎛

⎝ ⎜

⎠ ⎟

Page 18: 1 A Black-Box approach to machine learning Yoav Freund

18

Boosting block diagram

WeakLearner

Booster

Weakrule

Exampleweights

Strong Learner AccurateRule

Page 19: 1 A Black-Box approach to machine learning Yoav Freund

19

What is a good weak learner?

The set of weak rules (features) should be:• flexible enough to be (weakly) correlated with

most conceivable relations between feature vector and label.

• Simple enough to allow efficient search for a rule with non-trivial weighted training error.

• Small enough to avoid over-fitting.

Calculation of prediction from observations should be very fast.

Page 20: 1 A Black-Box approach to machine learning Yoav Freund

20

Alternating decision trees

Freund, Mason 1997

Page 21: 1 A Black-Box approach to machine learning Yoav Freund

21

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

Page 22: 1 A Black-Box approach to machine learning Yoav Freund

22

-0.2

A decision tree as a sum of weak rules.

X

Y-0.2

+0.2-0.3

Y>5

yesno

-0.1 +0.1

X>3

no

yes

+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

Page 23: 1 A Black-Box approach to machine learning Yoav Freund

23

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Page 24: 1 A Black-Box approach to machine learning Yoav Freund

24

Example: Medical Diagnostics

• Cleve dataset from UC Irvine database.

• Heart disease diagnostics (+1=healthy,-1=sick)

• 13 features from tests (real valued and discrete).

• 303 instances.

Page 25: 1 A Black-Box approach to machine learning Yoav Freund

25

AD-tree for heart-disease diagnostics

>0 : Healthy<0 : Sick

Page 26: 1 A Black-Box approach to machine learning Yoav Freund

26

Commercial Deployment.

Page 27: 1 A Black-Box approach to machine learning Yoav Freund

27

AT&T “buisosity” problem

• Distinguish business/residence customers from call detail information. (time of day, length of call …)

• 230M telephone numbers, label unknown for ~30%• 260M calls / day

• Required computer resources:

Freund, Mason, Rogers, Pregibon, Cortes 2000

• Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock).

• Significant: Calculating the classification for ~70M customers.• Negligible: Learning (2 Hours on 10K training examples on an

off-line computer).

Page 28: 1 A Black-Box approach to machine learning Yoav Freund

28

AD-tree for “buisosity”

Page 29: 1 A Black-Box approach to machine learning Yoav Freund

29

AD-tree (Detail)

Page 30: 1 A Black-Box approach to machine learning Yoav Freund

30

Quantifiable results

• For accuracy 94% increased coverage from 44% to 56%.

• Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

ScoreA

ccur

acyPrecision/recall:

Page 31: 1 A Black-Box approach to machine learning Yoav Freund

31

Adaboost’s resistance to over fitting

Why statisticians find Adaboost interesting.

Page 32: 1 A Black-Box approach to machine learning Yoav Freund

32

A very curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 33: 1 A Black-Box approach to machine learning Yoav Freund

33

Large margins

marginFT(x,y) ˙ = y

α tht x( )t=1

T

∑α tt=1

T

∑= y

FT x( )r α

1

marginFT(x,y) > 0 ⇔ fT (x) = y

Thesis:large margins => reliable predictions

Very similar to SVM.

Page 34: 1 A Black-Box approach to machine learning Yoav Freund

34

Experimental Evidence

Page 35: 1 A Black-Box approach to machine learning Yoav Freund

35

TheoremSchapire, Freund, Bartlett & Lee / Annals of statistics 1998

H: set of binary functions with VC-dimension d

C

= αihi | hi ∈ H , α i > 0, α i =1∑∑{ }

∀c ∈C,∀θ > 0, with probability1− δ w.r.t. T ~ Dm

P x,y( )~D sign c(x)( ) ≠ y[ ] ≤ P x,y( )~T marginc x,y( ) ≤ θ[ ]

+ ˜ O d / m

θ

⎝ ⎜

⎠ ⎟+O log

1

δ

⎝ ⎜

⎠ ⎟

T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ Dm

No dependence on no. of combined functions!!!

Page 36: 1 A Black-Box approach to machine learning Yoav Freund

36

Idea of Proof

Page 37: 1 A Black-Box approach to machine learning Yoav Freund

37

Confidence rated predictions

Agreement gives confidence

Page 38: 1 A Black-Box approach to machine learning Yoav Freund

38

A motivating example

-

-

-+

+

+

++

+

++

++

-

-

-

-

-

-

-

-

-

--

-

-

-

-

-

--

-

-

-

-

-

-

--

-

-

-

-

-

--

-

--

-

-

-

-

-

-

+

++

+

++

+

+

++

+

++

+ +

++

+

+

+

+

+

++

+

++

+

+

++

+

++

+--

-

-- -

--

---

--

?

?

?

Unsure

Unsure

Page 39: 1 A Black-Box approach to machine learning Yoav Freund

39

The algorithm

η > 0, Δ > 0Parameters

w(h) ˙ = e−η ˆ ε h( )Hypothesis weight:

ˆ l η (x) ˙ = 1

ηln

w(h)h:h ( x)=+1

w(h)h:h ( x)=−1

⎜ ⎜ ⎜

⎟ ⎟ ⎟

Empirical Log RatioEmpirical Log Ratio::

ˆ p η ,Δ x( ) =

+1 if ˆ l x( ) > Δ

-1,+1{ } if ˆ l x( ) ≤ Δ

−1 if ˆ l x( ) < −Δ

⎨ ⎪

⎩ ⎪

Prediction rule:

Freund, Mansour, Schapire 2001

Page 40: 1 A Black-Box approach to machine learning Yoav Freund

40

Suggested tuning

P(abstain) = P x,y( )~Dˆ p (x) = −1,+1{ }( ) = 5ε h*

( ) +Oln 1 δ( ) + ln H( )

m1/2−θ

⎝ ⎜ ⎜

⎠ ⎟ ⎟

2) for m = Ω ln 1 δ( ) ln H( )( )1/θ ⎛

⎝ ⎜

⎞ ⎠ ⎟

Yields:

1) P mistake( ) = P x,y( )~D y ∉ ˆ p (x)( ) = 2ε h*( ) +O

ln m( )m1/2−θ

⎝ ⎜

⎠ ⎟

Suppose H is a finite set.

η=ln 8 H( )m1 2−θ

Δ=2ln 2 δ( )

m+

ln 8 H( )

8m1 2+θ

0 <θ < 14

Page 41: 1 A Black-Box approach to machine learning Yoav Freund

41

Confidence Rating block diagram

Rater-Combiner

Confidence-ratedRule

CandidateRules

x1,y1( ), x2 ,y2( ),K , xm ,ym( )

Training examples

Page 42: 1 A Black-Box approach to machine learning Yoav Freund

42

Face Detection

• Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

QuickTime™ and aYUV420 codec decompressorare needed to see this picture.

Viola & Jones 1999

Page 43: 1 A Black-Box approach to machine learning Yoav Freund

43

Using confidence to save time

The detector combines 6000 simple features using Adaboost.

In most boxes, only 8-9 features are calculated.

Feature 1Allboxes

Definitely not a face

Might be a face

Feature 2

Page 44: 1 A Black-Box approach to machine learning Yoav Freund

44

Using confidence to train car detectors

Page 45: 1 A Black-Box approach to machine learning Yoav Freund

45

Original Image Vs. difference image

QuickTime™ and aPhoto - JPEG decompressorare needed to see this picture.

Page 46: 1 A Black-Box approach to machine learning Yoav Freund

46

Co-training

HwyImages

Raw B/W

Diff Image

Partially trainedB/W basedClassifier

Partially trainedDiff basedClassifier

Confident Predictions

Confident Predictions

Blum and Mitchell 98

Page 47: 1 A Black-Box approach to machine learning Yoav Freund

47

Co-Training Results

Raw Image detector Difference Image detector

Before co-training After co-training

Levin, Freund, Viola 2002

Page 48: 1 A Black-Box approach to machine learning Yoav Freund

48

Selective sampling

Unlabeled dataPartially trained

classifier Sample of unconfident examples

Labeledexamples

Query-by-committee, Seung, Opper & SompolinskyFreund, Seung, Shamir & Tishby

Page 49: 1 A Black-Box approach to machine learning Yoav Freund

49

Online learning

Adapting to changes

Page 50: 1 A Black-Box approach to machine learning Yoav Freund

50

Online learning

An expert is an algorithm that maps the past to a prediction

(x1,y1),(x2 ,y2 ),K ,(xt−1,yt−1),xt

zt

So far, the only statistical assumption was that data is generated IID.

Can we get rid of that assumption?

Yes, if we consider prediction as a repeating game

Suppose we have a set of experts, we believe one is good, but we don’t know which one.

Page 51: 1 A Black-Box approach to machine learning Yoav Freund

51

Online prediction game

Experts generate predictions:

zt1,zt

2 ,K ,ztN

Algorithm makes its own prediction:

ζ t

Nature generates outcome:

yt

For

t =1,K ,T

Total loss of expert i:

LTi = L zt

i ,yt( )t=1

T

Total loss of algorithm:

LTA = L ζ t ,yt( )

t=1

T

Goal: for any sequence of events

LTA ≤ min

iLT

i + o T( )

Page 52: 1 A Black-Box approach to machine learning Yoav Freund

52

A very simple example

• Binary classification• N experts• one expert is known to be perfect• Algorithm: predict like the majority of

experts that have made no mistake so far.

• Bound:

LA ≤ log2 N

Page 53: 1 A Black-Box approach to machine learning Yoav Freund

53

History of online learning

• Littlestone & Warmuth• Vovk• Vovk and Shafer’s recent book:

“Probability and Finance, its only a game!”• Innumerable contributions from many

fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …

Page 54: 1 A Black-Box approach to machine learning Yoav Freund

54

Lossless compression

Z - [0,1]

X - arbitrary input space.Y - {0,1}

Entropy, Lossless compression, MDL.

Statistical likelihood, standard probability theory.€

L Z ,Y( ) = y log2

1

z+(1− y) log2

1

1− zLog Loss:

Page 55: 1 A Black-Box approach to machine learning Yoav Freund

55

Bayesian averaging

ζ t =

wti

i=1

N

∑ zti

wti

i=1

N

∑; wt

i = 2−Lt−1i

∀T > 0; LTA = log2 w1

i

i=1

N

∑ − log2 wTi

i=1

N

∑ ≤ mini

LTi + ln N

Folk theorem in Information Theory

Page 56: 1 A Black-Box approach to machine learning Yoav Freund

56

Game theoretical loss

X - arbitrary space

Y - a loss for each of N actions

y ∈ 0,1[ ]N

Z - a distribution over N actions

p ∈ 0,1[ ]N

, p1

=1

Loss:

L(p,y) = p • y = Ep y[ ]

Page 57: 1 A Black-Box approach to machine learning Yoav Freund

57

Learning in games

LTA ≤ min

iLT

i + 2T ln N + ln N

An algorithm which knows T in advance guarantees:

Freund and Schapire 94

Page 58: 1 A Black-Box approach to machine learning Yoav Freund

58

ytit

Instead, a single

it ∈ 1,K ,N{ } is chosen at

random according to

pt and is observed

Multi-arm bandits

Algorithm cannot observe full outcome

yt

Auer, Cesa-Bianchi, Freund, Schapire 95

LTA − min

iLT

i = O NT lnNT

δ

⎝ ⎜

⎠ ⎟

With probability

1− δ

We describe an algorithm that guarantees:

Page 59: 1 A Black-Box approach to machine learning Yoav Freund

59

Why isn’t online learning practical?

• Prescriptions too similar to Bayesian approach.

• Implementing low-level learning requires a large number of experts.

• Computation increases linearly with the number of experts.

• Potentially very powerful for combining a few high-level experts.

Page 60: 1 A Black-Box approach to machine learning Yoav Freund

60

code B/W Frontal face detectorIndoor, neutral backgrounddirect front-right lighting

Merl frontal 1.0

Online learning for detector deployment

FaceDetector Library

OLImages

Download Feedback

FaceDetections

Adaptivereal-time

face detector

Detector can beadaptive!!

Page 61: 1 A Black-Box approach to machine learning Yoav Freund

61

Summary

• By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line.

• To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.