dasgupta, kalai & monteleoni colt 2005 analysis of perceptron-based active learning sanjoy...

Dasgupta, Kalai & Monteleoni COLT 2005

Analysis of perceptron-based active learning

Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago

Claire Monteleoni, MIT

Selective sampling, online constraints

Selective sampling framework:Unlabeled examples, xt, are received one at a time.

Learner makes a prediction at each time-step. A noiseless oracle to label yt, can be queried at a cost.

Goal: minimize number of labels to reach error istheerror rate (w.r.t. the target) on the sampling

distribution.

Online constraints:Space: Learner cannot store all previously seen examples (and then perform batch learning).Time: Running time of learner’s belief update step should not scale with number of seen examples/mistakes.

AC Milan v. Inter Milan

Problem framework

Target:Current hypothesis:

Error region:

Assumptions:Separabilityu is through originx~Uniform on S error rate:

Related work

Analysis under selective sampling model, of Query By Committee algorithm [Seung,Opper&Sompolinsky‘92] :

Theorem [Freund,Seung,Shamir&Tishby ‘97]: Under selective sampling from the uniform, QBC can learn a half-space through the origin to generalization error , using Õ(d log 1/) labels.

! BUT: space required, and time complexity of the update both scale with number of seen mistakes!

Related work

Perceptron: a simple online algorithm:If yt SGN(vt ¢ xt), then: Filtering rule

vt+1 = vt + yt xt Update step

Distribution-free mistake bound O(1/2), if exists margin .

Theorem [Baum‘89]: Perceptron, given sequential labeled examples from the uniform distribution, can converge to generalization error after Õ(d/2) mistakes.

Our contributions

A lower bound for Perceptron in active learning context of (1/2) labels.

A modified Perceptron update with a Õ(d log 1/) mistake bound.

An active learning rule and a label bound of Õ(d log 1/).

A bound of Õ(d log 1/) on total errors (labeled or not).

Perceptron

Perceptron update: vt+1 = vt + yt xt

error does not decrease monotonically.

Lower bound on labels for Perceptron

Theorem 1: The Perceptron algorithm, using any active learning rule, requires (1/2) labels to reach generalization error w.r.t. the uniform distribution.

Proof idea: Lemma: For small t, the Perceptron update will increase t unless kvtk

is large: (1/sin t). But, kvtk growth

rate: So need t ¸ 1/sin2t.

Under uniform,

t / t ¸ sin t.

A modified Perceptron updateStandard Perceptron update:

vt+1 = vt + yt xt

Instead, weight the update by “confidence” w.r.t. current hypothesis vt:

vt+1 = vt + 2 yt |vt ¢ xt| xt (v1 = y0x0)

(similar to update in [Blum et al.‘96] for noise-tolerant learning)

Unlike Perceptron:Error decreases monotonically:

cos(t+1) = u ¢ vt+1 = u ¢ vt + 2 |vt ¢ xt||u ¢ xt|

¸ u ¢ vt = cos(t)

kvtk =1 (due to factor of 2)

A modified Perceptron update

Perceptron update: vt+1 = vt + yt xt

Modified Perceptron update: vt+1 = vt + 2 yt |vt ¢

xt| xt

vt+1vt+1

Mistake boundTheorem 2: In the supervised setting, the modified

Perceptron converges to generalization error after Õ(d log 1/) mistakes.

Proof idea: The exponential convergence follows from a multiplicative decrease in t:

On an update,

! We lower bound 2|vt ¢ xt||u ¢ xt|, with high probability, using our distributional assumption.

Mistake bound

{x : |a ¢ x| · k} =

Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error after Õ(d log 1/) mistakes.

Lemma (band): For any fixed a: kak=1, · 1 and for x~U on S:

Apply to |vt ¢ x| and |u ¢ x| ) 2|vt ¢ xt||u ¢ xt| is

large enough in expectation (using size of t).

Active learning rule

Goal: Filter to label just those points in the error region. ! but t, and thus t unknown!

Define labeling region:Tradeoff in choosing threshold st:

If too high, may wait too long for an error.If too low, resulting update is too small.

constant.

! But t unknown! Choose st adaptively: Start high. Halve, if no error in R consecutive labels.

Label bound

Theorem 3: In the active learning setting, the modified Perceptron, using the adaptive filtering rule, will converge to generalization error after Õ(d log 1/) labels.

Corollary: The total errors (labeled and unlabeled) will be Õ(d log 1/).

Proof techniqueProof outline: We show the following lemmas hold with

sufficient probability:

Lemma 1. st does not decrease too quickly:

Lemma 2. We query labels on a constant fraction of t.

Lemma 3. With constant probability the update is good.

By algorithm, ~1/R labels are mistakes. 9 R = Õ(1).

) Can thus bound labels and total errors by mistakes.

Proof techniqueLemma 1. st is large enough:

Proof: (By contradiction) Let t be first time Then

A halving event means we saw R labels with no mistakes, so

Lemma 1a: For any particular i, this event happens w.p. · 3/4:

Proof technique

Lemma 1a.

Proof idea: Using this value of st, band lemma in Rd-1 gives constant probability of x0 falling in appropriatelydefined band w.r.t. u0.

where:x0: component of x orthogonal to vt

u0: component of u orthogonal to vt

Proof techniqueLemma 2. We query labels on a constant fraction of t.Proof: Assume Lemma 1 for lower bound on st. Apply Lemma

1a and band lemma )

Lemma 3. With constant probability the update is good.Proof: Assuming Lemma 1, by Lemma 2, each error is labeled

w. constant p. From mistake bound proof, each update is good (multiplicative decrease in error) w. constant p.

Finally, solve for R: Every R labels there is at least 1 update or we halve st, so

There exists R = Õ(1) s.t.

Summary of contributions samples mistakes labels total errors online?

PACcomplexity[Long‘03][Long‘95]

Perceptron[Baum‘97]

QBC[FSST‘97]

[DKM‘05]

Õ(d/) (d/)

Õ(d/3)(1/2)

Õ(d/2)(1/2) (1/2)

Õ(d/log 1/)

Õ(dlog 1/)

Õ(d/log 1/)

Õ(dlog 1/)

Conclusions and open problemsAchieve optimal label-complexity for this problem

unlike QBC, a fully online algorithmMatching bound on total errors (labeled and

unlabeled).

Future work:Relax distributional assumptions:

Uniform is sufficient but not necessary for proof.Note: this bound is not possible under arbitrary

distributions [Dasgupta‘04].Relax separability assumption:

Allow “margin” of tolerated error.Analyze margin version:

for exponential convergence, without d dependence.

Thank you!

dasgupta, kalai & monteleoni colt 2005 analysis of perceptron-based active learning sanjoy...

t t sin

y t v t x t x t

t y t x t error

t x t u x t u v t

y t sgnv t x t

cos t kv t

small t

current hypothesis v

Documents

dr. s. dasgupta

s dasgupta kh_papadimitriu_u_vazirani

practical online active learning for classification claire...

dasgupta nregs sept30pdf

pashto adab aw nariwaal kalai

session 3_srijita dasgupta

oops qbank kalai

kalai mohan lab manual

kalai project 1

plan strategi doktor muda kalai

visual representations of the kalai tribe from

zhu dasgupta huang

saka kalai updated

rana dasgupta / solo

puthiya kalai kathir 05

02 hrpetic keratitis (kalai hong yee)

a general agnostic active learning algorithm claire...

folio nalanda dasgupta

advantages of media advertisements manas & kalai

varma kalai for all diseases