experts and boosting algorithms. experts: motivation given a set of experts –no prior information...

Experts and Boosting Algorithms

Experts: Motivation

• Given a set of experts– No prior information– No consistent behavior– Goal: Predict as the best expert

• Model– online model– Input: historical results.

Experts: Model

• N strategies (experts)• At time t:

– Learner A chooses a distribution over N.– Let pt(i) probability of i-th expert.– Clearly pt(i) = 1– Receiving a loss vector lt

– Loss at time t: pt(i) lt(i)

• Assume bounded loss, lt(i) in [0,1]

Expert: Goal

• Match the loss of best expert.

• Loss:– LA

– Li

• Can we hope to do better?

Example: Guessing letters

• Setting:– Alphabet of k letters

• Loss:– 1 incorrect guess– 0 correct guess

• Experts:– Each expert guesses a certain letter always.

• Game: guess the most popular letter online.

Example 2: Rock-Paper-Scissors

• Two player game.• Each player chooses: Rock, Paper, or Scissors.• Loss Matrix:

• Goal: Play as best as we can given the opponent.

Rock Paper Scissors

Rock 1/2 1 0

Paper 0 1/2 1

Scissors 1 0 1/2

Example 3: Placing a point

• Action: choosing a point d.

• Loss (give the true location y): ||d-y||.

• Experts: One for each point.

• Important: Loss is Convex

• Goal: Find a “center”

||||)1(||||))1(( 2121 ydydydd

Experts Algorithm: Greedy

• For each expert define its cumulative loss:

• Greedy: At time t choose the expert with minimum loss, namely, arg min Li

t

t

j

ji

t

ilL

1

Greedy Analysis

• Theorem: Let LGT be the loss of Greedy at

time T, then

• Proof!

)1min( Ti

i

TG LNL

Better Expert Algorithms

• Would like to bound

Ti

i

TA LL min

Expert Algorithm: Hedge(b)

• Maintains weight vector wt

• Probabilities pt(k) = wt(k) / wt(j)

• Initialization w1(i) = 1/N

• Updates:– wt+1(k) = wt(k) Ub(lt(k))

– where b in [0,1] and

– br < Ub (r) < 1-(1-b)r

Hedge Analysis

• Lemma: For any sequence of losses

• Proof!

• Corollary:

H

N

j

T Lbjw )1())(ln(1

1

b

jw

H

N

j

T

L

1

))(ln(1

1

Hedge: Properties

• Bounding the weights

• Similarly for a subset of experts.

Ti

T

t

t

Lil

T

t

tb

T

biwbiw

ilUiwiw

)()(

))(()()(

1)(

1

1

11

1

Hedge: Performance

• Let k be with minimal loss

• Therefore

TkLT

N

j

T bkwkwjw )()()( 11

1

1

bbLN

b

bN

H

Tk

TkL

L

1)/1ln()ln(

1

)1

ln(

Hedge: Optimizing b

• For b=1/2 we have

• Better selection of b:

)2ln(2)ln(2 TkH LNL

)ln(ln2min NNLLL iiH

Occam Razor

Occam Razor

• Finding the shortest consistent hypothesis.• Definition: ()-Occam algorithm

– >0 and <1– Input: a sample S of size m– Output: hypothesis h– for every (x,b) in S: h(x)=b– size(h) < size(ct) m

• Efficiency.

Occam algorithm and compression

A BS(xi,bi)

x1, … , xm

compression

• Option 1:

– A sends B the values b1 , … , bm

– m bits of information

• Option 2:

– A sends B the hypothesis h

– Occam: large enough m has size(h) < m

• Option 3 (MDL):

– A sends B a hypothesis h and “corrections”

– complexity: size(h) + size(errors)

Occam Razor Theorem

• A: (,)-Occam algorithm for C using H• D distribution over inputs X• ct in C the target function• Sample size:

• with probability 1- A(S)=h has error(h) <

)1(121

ln1

2

n

m

Occam Razor Theorem

• Use the bound for finite hypothesis class.

• Effective hypothesis class size 2size(h)

• size(h) < n m

• Sample size:

1ln

12lnln

1 2 mn

mmn

Weak and Strong Learning

PAC Learning model

• There exists a distribution D over domain X• Examples: <x, c(x)>

– use c for target function (rather than ct)

• Goal: – With high probability (1-)– find h in H such that – error(h,c ) < – arbitrarily small.

Weak Learning Model

• Goal: error(h,c) < ½ - • The parameter is small

– constant– 1/poly

• Intuitively: A much easier task• Question:

– Assume C is weak learnable, – C is PAC (strong) learnable

Majority Algorithm

• Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ]

• size(hM) < T size(ht)

• Using Occam Razor

Majority: outline

• Sample m example

• Start with a distribution 1/m per example.

• Modify the distribution and get ht

• Hypothesis is the majority

• Terminate when perfect classification– of the sample

Majority: Algorithm

• Use the Hedge algorithm.

• The “experts” will be associate with points.

• Loss would be a correct classification.– lt(i)= 1 - | ht(xi) – c(xi) |

• Setting b= 1- • hM(x) = MAJORITY( hi(x))

• Q: How do we set T?

Majority: Analysis

• Consider the set of errors S– S={i | hM(xi)c(xi) }

• For ever i in S:– Li/T < ½ (Proof!)

• From Hedge properties:

2/)())(ln( 2 TxD

MSi iL

MAJORITY: Correctness

• Error Probability:

• Number of Rounds:

• Terminate when error less than 1/m

Si ixD )(

2/2 Te

2

ln2

m

T

AdaBoost: Dynamic Boosting

• Better bounds on the error

• No need to “know” • Each round a different b

– as a function of the error

AdaBoost: Input

• Sample of size m: < xi,c(xi) >

• A distribution D over examples – We will use D(xi)=1/m

• Weak learning algorithm

• A constant T (number of iterations)

AdaBoost: Algorithm

• Initialization: w1(i) = D(xi)• For t = 1 to T DO

– pt(i) = wt(i) / wt(j)– Call Weak Learner with pt

– Receive ht

– Compute the error t of ht on pt

– Set bt= t/(1-t)– wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)|

• Output

T

t t

T

ttt

A bxhbIxh1

1

1log21)()1(log)(

AdaBoost: Analysis

• Theorem: – Given 1, ... , T

– the error of hA is bounded by

T

ttt

T

1

)1(2

AdaBoost: Proof

• Let lt(i) = 1-|ht(xi)-c(xi)|

• By definition: pt lt = 1 –t

• Upper bounding the sum of weights– From the Hedge Analysis.

• Error occurs only if

T

t t

T

ttt bxcxhb

11

1log21 |)()(|)1(log

AdaBoost Analysis (cont.)

• Bounding the weight of a point

• Bounding the sum of weights

• Final bound as function of bt

• Optimizing bt:

– bt= t / (1 – t)

AdaBoost: Fixed bias

• Assume t= 1/2 - • We bound:

TT e222/2 )41(

Learning OR with few attributes

• Target function: OR of k literals

• Goal: learn in time:– polynomial in k and log n– and constant

• ELIM makes “slow” progress – disqualifies one literal per round– May remain with O(n) literals

Set Cover - Definition

• Input: S1 , … , St and Si U

• Output: Si1, … , Sik and j Sjk=U

• Question: Are there k sets that cover U?

• NP-complete

Set Cover Greedy algorithm

• j=0 ; Uj=U; C=

• While Uj – Let Si be arg max |Si Uj|

– Add Si to C

– Let Uj+1 = Uj – Si

– j = j+1

Set Cover: Greedy Analysis

• At termination, C is a cover.

• Assume there is a cover C’ of size k.

• C’ is a cover for every Uj

• Some S in C’ covers Uj/k elements of Uj

• Analysis of Uj: |Uj+1| |Uj| - |Uj|/k

• Solving the recursion.

• Number of sets j < k ln |U|

Building an Occam algorithm

• Given a sample S of size m– Run ELIM on S – Let LIT be the set of literals– There exists k literals in LIT that classify

correctly all S

• Negative examples: – any subset of LIT classifies theme correctly

Building an Occam algorithm

• Positive examples: – Search for a small subset of LIT – Which classifies S+ correctly– For a literal z build Tz={x | z satisfies x}– There are k sets that cover S+

– Find k ln m sets that cover S+

• Output h = the OR of the k ln m literals • Size (h) < k ln m log 2n• Sample size m =O( k log n log (k log n))

experts and boosting algorithms. experts: motivation given a set of experts –no prior information...

Documents

t loss

time t

br slide

center slide

loss of greedy

p t i probability of

loss of best expert

loss vector