1 a black-box approach to machine learning yoav freund
TRANSCRIPT
1
A Black-Box approach to machine learning
Yoav Freund
2
Why do we need learning?
• Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification
• For accuracy, functions must be tuned to fit the data source.
• For real-time processing, function computation has to be very fast.
3
The complexity/accuracy tradeoff
Complexity
Err
or
Trivial performance
4
The speed/flexibility tradeoff
Flexib
ility
Speed
Matlab CodeJava Code
Machine code
Digital Hardware
Analog Hardware
5
Theory Vs. Practice
• Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations.
- I prove theorems.• Practitioner: I want a real-time algorithm that
performs well on my problem. - I experiment.• My approach: I want combining algorithms whose
performance and speed is guaranteed relative to the performance and speed of their components.
- I do both.
6
Plan of talk
• The black-box approach• Boosting• Alternating decision trees• A commercial application• Boosting the margin• Confidence rated predictions• Online learning
7
The black-box approach
• Statistical models are not generators, they are predictors.
• A predictor is a function from observation X to action Z.
• After action is taken, outcome Y is observed which implies loss L (a real valued number).
• Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)
8
Main software components
x zA predictor
€
x1,y1( ), x2 ,y2( ),K , xm ,ym( )
Training examplesA learner
We assume the predictor will be applied to examples similar to those on which it was trained
9
Learning in a system
Learning System
predictorTrainingExamples
Target System
Sensor Data Action
feedback
10
Special case: Classification
€
Loss( ˆ y ,y) =1 if y ≠ ˆ y
0 if y = ˆ y
⎧ ⎨ ⎩
Observation X - arbitrary (measurable) space
€
ˆ y ∈ ZPrediction Z - {1,…,K}
Usually K=2 (binary classification)
Outcome Y - finite set {1,..,K}
€
y ∈Y
11
batch learning for binary classification
€
x,y( ) ~ D; y ∈ −1,+1{ }Data distribution:
€
ε h( ) ˙ = P x,y( )~D h(x) ≠ y( )Generalization error:
€
T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ DmTraining set:
€
ˆ ε (h) ˙ = 1
m1 h(x) ≠ y[ ]
x,y( )∈T
∑ ˙ = P x,y( )~T h(x) ≠ y[ ]
Training error:
12
Boosting
Combining weak learners
13
A weighted training set
Feature vectors
Binary labels {-1,+1}
Positive weights
€
x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )
14
A weak learner
The weak requirement:
€
yiˆ y iwii=1
m
∑wii=1
m
∑> γ > 0
A weak rule
h
h
Weak Leaner
Weighted training set
€
x1,y1,w1( ), x2 ,y2 ,w2( ),K , xm ,ym ,wm( )
instances
€
x1,x2 ,K ,xm
predictions
€
ˆ y 1, ˆ y 2 ,K , ˆ y m; ˆ y i ∈ {0,1}
15
The boosting process
weak learner h1
(x1,y1,1/n), … (xn,yn,1/n)
weak learner h2(x1,y1,w1), … (xn,yn,wn)
h3(x1,y1,w1), … (xn,yn,wn) h4
(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6
(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8
(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT
(x1,y1,w1), … (xn,yn,wn)
€
FT x( ) = α 1h1 x( ) +α 2h2 x( ) +L +α T hT x( )
Final rule:
€
fT (x) = sign FT x( )( )
16
€
αt = ln wit
i:ht xi( )=1,yi =1∑ wi
t
i:ht xi( )=1,yi =−1∑ ⎛
⎝ ⎜ ⎞
⎠ ⎟
€
wit = exp −yiFt−1(xi )( )
Adaboost
€
F0 x( ) ≡ 0
€
Ft+1 = Ft +α tht€
Get ht from weak − learner
€
for t =1..T
17
Main property of Adaboost
If advantages of weak rules over random guessing are: T then training error of final rule is at most
€
ˆ ε fT( ) ≤ exp − γ t2
t=1
T
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
18
Boosting block diagram
WeakLearner
Booster
Weakrule
Exampleweights
Strong Learner AccurateRule
19
What is a good weak learner?
The set of weak rules (features) should be:• flexible enough to be (weakly) correlated with
most conceivable relations between feature vector and label.
• Simple enough to allow efficient search for a rule with non-trivial weighted training error.
• Small enough to avoid over-fitting.
Calculation of prediction from observations should be very fast.
20
Alternating decision trees
Freund, Mason 1997
21
Decision Trees
X>3
Y>5-1
+1-1
no
yes
yesno
X
Y
3
5
+1
-1
-1
22
-0.2
A decision tree as a sum of weak rules.
X
Y-0.2
+0.2-0.3
Y>5
yesno
-0.1 +0.1
X>3
no
yes
+0.1-0.1
+0.2
-0.3
+1
-1
-1sign
23
An alternating decision tree
X
Y
+0.1-0.1
+0.2
-0.3
sign
-0.2
Y>5
+0.2-0.3yesno
X>3
-0.1
no
yes
+0.1
+0.7
Y<1
0.0
no
yes
+0.7
+1
-1
-1
+1
24
Example: Medical Diagnostics
• Cleve dataset from UC Irvine database.
• Heart disease diagnostics (+1=healthy,-1=sick)
• 13 features from tests (real valued and discrete).
• 303 instances.
25
AD-tree for heart-disease diagnostics
>0 : Healthy<0 : Sick
26
Commercial Deployment.
27
AT&T “buisosity” problem
• Distinguish business/residence customers from call detail information. (time of day, length of call …)
• 230M telephone numbers, label unknown for ~30%• 260M calls / day
• Required computer resources:
Freund, Mason, Rogers, Pregibon, Cortes 2000
• Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock).
• Significant: Calculating the classification for ~70M customers.• Negligible: Learning (2 Hours on 10K training examples on an
off-line computer).
28
AD-tree for “buisosity”
29
AD-tree (Detail)
30
Quantifiable results
• For accuracy 94% increased coverage from 44% to 56%.
• Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.
ScoreA
ccur
acyPrecision/recall:
31
Adaboost’s resistance to over fitting
Why statisticians find Adaboost interesting.
32
A very curious phenomenon
Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
33
Large margins
€
marginFT(x,y) ˙ = y
α tht x( )t=1
T
∑α tt=1
T
∑= y
FT x( )r α
1
€
marginFT(x,y) > 0 ⇔ fT (x) = y
Thesis:large margins => reliable predictions
Very similar to SVM.
34
Experimental Evidence
35
TheoremSchapire, Freund, Bartlett & Lee / Annals of statistics 1998
H: set of binary functions with VC-dimension d
C
€
= αihi | hi ∈ H , α i > 0, α i =1∑∑{ }
€
∀c ∈C,∀θ > 0, with probability1− δ w.r.t. T ~ Dm
€
P x,y( )~D sign c(x)( ) ≠ y[ ] ≤ P x,y( )~T marginc x,y( ) ≤ θ[ ]
€
+ ˜ O d / m
θ
⎛
⎝ ⎜
⎞
⎠ ⎟+O log
1
δ
⎛
⎝ ⎜
⎞
⎠ ⎟
€
T = x1,y1( ), x2 ,y2( ),..., xm ,ym( ); T ~ Dm
No dependence on no. of combined functions!!!
36
Idea of Proof
37
Confidence rated predictions
Agreement gives confidence
38
A motivating example
-
-
-+
+
+
++
+
++
++
-
-
-
-
-
-
-
-
-
--
-
-
-
-
-
--
-
-
-
-
-
-
--
-
-
-
-
-
--
-
--
-
-
-
-
-
-
+
++
+
++
+
+
++
+
++
+ +
++
+
+
+
+
+
++
+
++
+
+
++
+
++
+--
-
-- -
--
---
--
?
?
?
Unsure
Unsure
39
The algorithm
€
η > 0, Δ > 0Parameters
€
w(h) ˙ = e−η ˆ ε h( )Hypothesis weight:
€
ˆ l η (x) ˙ = 1
ηln
w(h)h:h ( x)=+1
∑
w(h)h:h ( x)=−1
∑
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
Empirical Log RatioEmpirical Log Ratio::
€
ˆ p η ,Δ x( ) =
+1 if ˆ l x( ) > Δ
-1,+1{ } if ˆ l x( ) ≤ Δ
−1 if ˆ l x( ) < −Δ
⎧
⎨ ⎪
⎩ ⎪
Prediction rule:
Freund, Mansour, Schapire 2001
40
Suggested tuning
€
P(abstain) = P x,y( )~Dˆ p (x) = −1,+1{ }( ) = 5ε h*
( ) +Oln 1 δ( ) + ln H( )
m1/2−θ
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
2) for m = Ω ln 1 δ( ) ln H( )( )1/θ ⎛
⎝ ⎜
⎞ ⎠ ⎟
Yields:
€
1) P mistake( ) = P x,y( )~D y ∉ ˆ p (x)( ) = 2ε h*( ) +O
ln m( )m1/2−θ
⎛
⎝ ⎜
⎞
⎠ ⎟
Suppose H is a finite set.
€
η=ln 8 H( )m1 2−θ
€
Δ=2ln 2 δ( )
m+
ln 8 H( )
8m1 2+θ
€
0 <θ < 14
41
Confidence Rating block diagram
Rater-Combiner
Confidence-ratedRule
CandidateRules
€
x1,y1( ), x2 ,y2( ),K , xm ,ym( )
Training examples
42
Face Detection
• Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).
QuickTime™ and aYUV420 codec decompressorare needed to see this picture.
Viola & Jones 1999
43
Using confidence to save time
The detector combines 6000 simple features using Adaboost.
In most boxes, only 8-9 features are calculated.
Feature 1Allboxes
Definitely not a face
Might be a face
Feature 2
44
Using confidence to train car detectors
45
Original Image Vs. difference image
QuickTime™ and aPhoto - JPEG decompressorare needed to see this picture.
46
Co-training
HwyImages
Raw B/W
Diff Image
Partially trainedB/W basedClassifier
Partially trainedDiff basedClassifier
Confident Predictions
Confident Predictions
Blum and Mitchell 98
47
Co-Training Results
Raw Image detector Difference Image detector
Before co-training After co-training
Levin, Freund, Viola 2002
48
Selective sampling
Unlabeled dataPartially trained
classifier Sample of unconfident examples
Labeledexamples
Query-by-committee, Seung, Opper & SompolinskyFreund, Seung, Shamir & Tishby
49
Online learning
Adapting to changes
50
Online learning
An expert is an algorithm that maps the past to a prediction
€
(x1,y1),(x2 ,y2 ),K ,(xt−1,yt−1),xt
€
zt
So far, the only statistical assumption was that data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
Suppose we have a set of experts, we believe one is good, but we don’t know which one.
51
Online prediction game
Experts generate predictions:
€
zt1,zt
2 ,K ,ztN
Algorithm makes its own prediction:
€
ζ t
Nature generates outcome:
€
yt
For
€
t =1,K ,T
Total loss of expert i:
€
LTi = L zt
i ,yt( )t=1
T
∑
Total loss of algorithm:
€
LTA = L ζ t ,yt( )
t=1
T
∑
Goal: for any sequence of events
€
LTA ≤ min
iLT
i + o T( )
52
A very simple example
• Binary classification• N experts• one expert is known to be perfect• Algorithm: predict like the majority of
experts that have made no mistake so far.
• Bound:
€
LA ≤ log2 N
53
History of online learning
• Littlestone & Warmuth• Vovk• Vovk and Shafer’s recent book:
“Probability and Finance, its only a game!”• Innumerable contributions from many
fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …
54
Lossless compression
Z - [0,1]
X - arbitrary input space.Y - {0,1}
Entropy, Lossless compression, MDL.
Statistical likelihood, standard probability theory.€
L Z ,Y( ) = y log2
1
z+(1− y) log2
1
1− zLog Loss:
55
Bayesian averaging
€
ζ t =
wti
i=1
N
∑ zti
wti
i=1
N
∑; wt
i = 2−Lt−1i
€
∀T > 0; LTA = log2 w1
i
i=1
N
∑ − log2 wTi
i=1
N
∑ ≤ mini
LTi + ln N
Folk theorem in Information Theory
56
Game theoretical loss
X - arbitrary space
Y - a loss for each of N actions
€
y ∈ 0,1[ ]N
Z - a distribution over N actions
€
p ∈ 0,1[ ]N
, p1
=1
Loss:
€
L(p,y) = p • y = Ep y[ ]
57
Learning in games
€
LTA ≤ min
iLT
i + 2T ln N + ln N
An algorithm which knows T in advance guarantees:
Freund and Schapire 94
58
€
ytit
Instead, a single
€
it ∈ 1,K ,N{ } is chosen at
random according to
€
pt and is observed
Multi-arm bandits
Algorithm cannot observe full outcome
€
yt
Auer, Cesa-Bianchi, Freund, Schapire 95
€
LTA − min
iLT
i = O NT lnNT
δ
⎛
⎝ ⎜
⎞
⎠ ⎟
With probability
€
1− δ
We describe an algorithm that guarantees:
59
Why isn’t online learning practical?
• Prescriptions too similar to Bayesian approach.
• Implementing low-level learning requires a large number of experts.
• Computation increases linearly with the number of experts.
• Potentially very powerful for combining a few high-level experts.
60
code B/W Frontal face detectorIndoor, neutral backgrounddirect front-right lighting
Merl frontal 1.0
Online learning for detector deployment
FaceDetector Library
OLImages
Download Feedback
FaceDetections
Adaptivereal-time
face detector
Detector can beadaptive!!
61
Summary
• By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line.
• To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.