machine learning on a budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...

122
Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh Saligrama, David Casta˜ on Committee: Venkatesh Saligrama, David Casta˜ on, Prakash Ishwar, Ioannis Paschalidis Chair: Ayse Coskun 1 / 79

Upload: others

Post on 07-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Machine Learning on a Budget

Kirill Trapeznikov

September 5th, 2013

Advisors: Venkatesh Saligrama, David Castanon

Committee: Venkatesh Saligrama, David Castanon, Prakash Ishwar,Ioannis Paschalidis

Chair: Ayse Coskun

1 / 79

Page 2: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Supervised Learning

NatureP(x, y)

x y

example x = image, text article, .. , collection of k sensormeasurements

label y = scene category, article topic, .. , target present/not present

2 / 79

Page 3: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Classifier

Goal is to find a classifier f (x) that minimizes expected error:

minf

Ex,y

[1[f (x) 6=y ]

]And if P(x, y) is known then f (x) is the posterior

f (x) = arg maxy

P(y | x)

3 / 79

Page 4: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Empirical Risk Minimization

In most cases, P(x, y), unknown and typically hard to estimate

Empirical Risk Minimization:Collect training data (x1, y1), (x2, y2), . . . and approximate expected risk

minf∈F

∑i

1[f (x))6=yi ]

F is a family of classifiers (ex. linear separators)

SupervisedLearner

Labeled Training Data

Classifier with small error

f(x)

4 / 79

Page 5: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Learning Classifier

Most effort has been in learning a good classifier

how to parametrize classifier family Fhow to design and minimize surrogate risk C (·) to replace 1[·]

minf∈F

∑i

C (f (xi )), yi )

Recently, cost of learning gained importance

Training Phase: labeling cost

Testing Phase: acquisition cost

5 / 79

Page 6: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Labeling Cost

To train a classifier need: measurements x and labels y

often, large collection of unlabeled data is available: xi ’s

to label yi need to use an expert, typically expensive

Examples

Medical Imaging

large amount of unlabeled data, (tests, scans, etc)to label need a doctor/radiologist

Computer Vision/ Object Detection

unlabeled images/videosrequires user input to annotate objects in images

How to reduce labeling cost?

6 / 79

Page 7: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Reducing Labeling Cost

Unnecessary to label every example: many redundant, uninformative

Active Learner+

Expert

SupervisedLearner

Unlabeled Pool Labeled Subset Classifier with small error:

f(x)

Active Learning: label a small fraction of training data, X , to learn agood classifier

minL,|L|≤B

∑i∈X

1[fL(xi ) 6= yi

]

fL(x), classifier trained on a labeled subset of examples L

labeling budget: label at most B examples

7 / 79

Page 8: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Test-time Costs

Given, a classifier, to make a decision

need to acquire sensor measurements: x = x1, x2, . . . , xK

with acquisition costs: c1, c2, . . . , cK

In many situations,

some sensors: less informative but fast/cheap

others: more informative but slow/expensive

cannot afford to use every sensor all the time

Applications

security screening: x-ray machine vs. manual inspection

medical diagnoses: physical exam vs. invasive procedure

computer vision: coarse features vs. high res

Not every decisions requires every sensor!

8 / 79

Page 9: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Reducing test-time costs

Sequential Sensor Selection

Sequentially select sensors to reduce average acquisition cost

Classify easy examples based on cheap sensors

Acquire expensive measurements only for difficult decisions

Learn a decision system F (x) from training data with full measurements:

minF

Ex,y [error(F , x, y) + α cost(F , x)]

F (x) controls

when to stop and classifyor request more sensor measurements

tune α to achieve desired average acquistion budget

9 / 79

Page 10: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Active Learning: Reducing Training Cost

10 / 79

Page 11: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Overview

Active Boosted Learning: Active Learning algorithm in the boostingframework

version space active learning

ActBoost algorithm

theoretical convergence

experiments: comparison to other methods on several datasets

11 / 79

Page 12: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Active Learning Problem

Active Learner+

Expert

SupervisedLearner

Unlabeled Pool Labeled Subset Classifier with small error:

f(x)

Label a small fraction of data to learn a good classifier

binary setting: labels y ∈ {+1,−1}unlabeled pool of M examples: x1, x2, . . . , xM

12 / 79

Page 13: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Margin Based Active Learning

Margin Based AL:labels examples that are ambiguous w.r.t. to current classifier

[Schohn and Cohn, 2000, Balcan et al., 2007, Abe and Mamitsuka, 1998, Campbell et al., 2000]

UpdateModel

LabelMost Uncertain

Sample

...

Seems like a good strategy?

13 / 79

Page 14: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Sensitive to Initialization

Initialize from the first two clusters:

slow to converge (QBB method)

stuck in the first two clusters

cluster 1

cluster 2

cluster 3

initial labels 20 40 60 80 100

50

60

70

80

90

100

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

(Our) Alternative Strategy: ActBoost, robust to initialization bias

Version Space Based approach

14 / 79

Page 15: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 16: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

true classifier

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 17: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

version space

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 18: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 19: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 20: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 21: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 22: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 23: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 24: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 25: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 26: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79

Page 27: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space

Issue:

Data is not separable with thresholds = version space is empty

But separable with intervals

Need to consider more complex classifiers: boosted classifiers

16 / 79

Page 28: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space

Issue:

Data is not separable with thresholds = version space is empty

But separable with intervals

Need to consider more complex classifiers: boosted classifiers

16 / 79

Page 29: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Boosting

Combine simple binary decisions to form a strong classifier

q1 · h1(x) q2 · h2(x) q3 · h3(x) q4 · h4(x) f(x)+ + +

boosted classifier is parametrized by weight vector q

qTh(x) =N∑j=1

qj︸︷︷︸weights

weak learners︷ ︸︸ ︷hj(x) , hj(x) ∈ {−1,+1}

assume a fixed set of N weak learners

x→ [h1(x) h2(x) . . . hN(x)]T = h(x)

weak learning assumption: version space is not empty[Freund and Schapire, 1996]

17 / 79

Page 30: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Version Space of Boosted Classifiers

restrict q to probability simplex

Q = {q |N∑j=1

qj = 1, qj ≥ 0}

q, correctly classifies x if,

sgn[qTh(x)

]= y ⇐⇒ yqTh(x) > 0

For a labeled set of examples Lt

version space = space of q’s that classify Lt correctly

Qt ={

q ∈ Q | yih(xi )Tq ≥ 0, ∀i ∈ Lt

}Think of a polyhedron in N dimensions:

Qt

y1h(x1)

y2h(x2)

y3h(x3)

18 / 79

Page 31: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Iterations of Active Learning

As example xt is labeled at time t labeled set Lt grows:

∅ = L0 ⊂ L1 ⊂ L2 ⊂ . . . ⊂ Lt

Version Space Shrinks:

Q = Q0 ⊃ Q1 ⊃ Q2 ⊃ . . . ⊃ Qt

Labeled examples become constraints:

y tqTh(xt) ≥ 0

y1h(x1)

y2h(x2)

Q = Q0 Q1 Q2

yth(xt)Qt…

How to pick xt to maximally reduce version space?

19 / 79

Page 32: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Active Boosted Learning: ActBoost

Generalized binary search in the space of boosted classifiers

Label xt to best bisect version space → geometric reductiony t is unknown, but once revealed → about half is eliminated!

h(xt)

Qt�

Qt+

Qt+

Qt�

+h(xt)

�h(xt)

request label yt

yt = +1

yt = �1

ActBoost Strategy: label x t with smallest volume difference

minx∈Ut

∣∣Vol Qt+(x)− Vol Qt

−(x)∣∣

20 / 79

Page 33: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Approximate ActBoost Strategy

Approximate volume by uniform random samples from Qt

(hit and run algorithm)

q1,q2 . . .qD

h(x)

h(xt)

Qt�

Qt+

Samples from Qt :q1, . . . ,qD

Label x with greatest disagreement among samples

minx∈Ut

∣∣Vol Qt+(x)− Vol Qt

−(x)∣∣ ≈ min

x∈Ut

∣∣∣∣∣D∑

d=1

1[h(x)T qd>0] −D∑

d=1

1[h(x)T qd≤0]

∣∣∣∣∣pick x to equalize number of q′s in red and blue sections

21 / 79

Page 34: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Summary of Convergence Results

After labeling t examples, what can we say?

We prove the following:Volume Convergence

reduce volume to ε fraction in t = log( 1/ε1/λ )

logarithmic speed up log(1/ε) vs 1/ε

λ is computable constant, property of x1, . . . , xM and h1, . . . , hN

reduce volume 6=⇒ reduce error!

Error Convergence of Sparse Strategy

need structure: search p sparse version space instead

limit to p non-zero q1, . . . , qN ∈ Q

label t ≥ log (Np)+log 1

ε

log 1λ

and achieve error classifier trained on full data

combinatorially hard, search(Np

)subspaces!

ActBoost as a convex surrogate of sparse strategy

22 / 79

Page 35: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Experiments

Compare to Query By Boosting (QBB)

margin based method [Abe and Mamitsuka, 1998] in boosting

labels examples that are ambiguous w.r.t. to current boostedclassifier

requires initialization

ActBoost

use Ada-boost as a supervised learner to evaluate performance

stumps (thresholds on dimensions of x) as weak learners

ActBoost+

ExpertAdaboost

Fqa(x)

23 / 79

Page 36: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Unbiased Initialization Experiments

Remove initialization bias by randomly resampling the initial labeled set

simulate a well behaved initialization (2D synthetic datasets)

20 40 60 80 100

60

65

70

75

80

Training Samples (#)

Acc

urac

y (%

)

RandomActBoostActBoost(sp)QBB

(a) banana

20 40 60 80 100

70

75

80

85

90

95

Training Samples (#)A

ccur

acy

(%)

RandomActBoostActBoost(sp)QBB

(b) box

ActBoost is on par with QBB

ActBoost(sp) simulates not tractable sparse strategyprior knowledge of the sparse support

24 / 79

Page 37: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Biased Initialization Experiments

Simulate adversarial conditions:

data lives in clusters, initialize by labeling points only from the firsttwo clusters

cluster 1

cluster 2

cluster 3

initial labels 20 40 60 80 100

50

60

70

80

90

100

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

Goal: quickly discover (start labeling) 3rd cluster

25 / 79

Page 38: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Biased Initialization Experiments

Datasets consisting of three clusters, initialize from first two only

Dermatology

20 40 60 80 100

70

80

90

100

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

predict skin diseasefrom physiology

features

Soy

20 40 60 80

70

80

90

100

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

soybean disease fromseed attributes

Iris

20 40 60 80 100 120 140

70

80

90

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

flower type from leafmeasuremens

ActBoost quickly discovers unknown clustersQBB does not explore full space

26 / 79

Page 39: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Summary of ActBoost Work

Algorithm:

novel active learning algorithm in the boosting framework

labels examples to approximately halve feasible set of boostedclassfiers

Convergence Results:

characterize volume convergence in terms of properties of weaklearners and unlabeled training data

logarithmic error convergence for a sparse strategy

Experiments:

performs on par with margin based methods when initialization isunbiased

not-sensitive to initialization bias unlike margin based methods

Publication:Active Boosted Learning, AISTATS, 2011

27 / 79

Page 40: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Sequential Sensor Selection: reducing test-time costs

28 / 79

Page 41: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Overview

Sequential Reject Classifiers: learning sequential decisions to reduceacquisition cost

motivation

importance of future uncertainty in learning decisions

two-stage example, novel empirical risk approach

multiple stages

experimental results

29 / 79

Page 42: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Sequential Reject Classifier

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

K stage decision system:

Stage k can use sensor k for a cost ck

Measurements can be high dimensional

Order of stages/sensors is given

Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or

request (reject to) next sensor

Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost

minF

Ex,y [error(F , x, y) + α cost(F , x)]

30 / 79

Page 43: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Sequential Reject Classifier

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

K stage decision system:

Stage k can use sensor k for a cost ck

Measurements can be high dimensional

Order of stages/sensors is given

Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or

request (reject to) next sensor

Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost

minF

Ex,y [error(F , x, y) + α cost(F , x)]

30 / 79

Page 44: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Sequential Reject Classifier

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

K stage decision system:

Stage k can use sensor k for a cost ck

Measurements can be high dimensional

Order of stages/sensors is given

Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or

request (reject to) next sensor

Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost

minF

Ex,y [error(F , x, y) + α cost(F , x)]

30 / 79

Page 45: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example

Sensors of Increasing Resolutions

classify handwritten digit images

f1( ) f2( ) f3( ) f4( )?

low resolution(cheap)

high resolution(expensive)

Do we need all sensors for every decision?

31 / 79

Page 46: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

high acquisition cost: need full resolution to make a decision

32 / 79

Page 47: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

reject

high acquisition cost: need full resolution to make a decision

32 / 79

Page 48: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

reject

high acquisition cost: need full resolution to make a decision

32 / 79

Page 49: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Difficult Decision

f1( ) f2( ) f3( ) f4( )

classify

?

8

reject

high acquisition cost: need full resolution to make a decision

32 / 79

Page 50: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

high acquisition cost: need full resolution to make a decision

32 / 79

Page 51: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

small acquisition cost: full resolution is unnecessary

33 / 79

Page 52: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

reject

small acquisition cost: full resolution is unnecessary

33 / 79

Page 53: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

small acquisition cost: full resolution is unnecessary

33 / 79

Page 54: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

small acquisition cost: full resolution is unnecessary

33 / 79

Page 55: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

How to reduce sensor cost?

Sensor 1 is cheap, Sensor 2 is expensive

Sens

or 2

Sensor 1 Sensor 1

Sens

or 2

Sensor 1

Non-Adaptive

Centralized

Centralized strategy:

use both sensors

high cost, low error

Non-adaptive strategy:

only use sensor 1

low cost, high error

34 / 79

Page 56: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

How to reduce sensor cost?

Sensor 1 is cheap, Sensor 2 is expensive

Sens

or 2

Sensor 1 Sensor 1

Sens

or 2

Sensor 1

Non-Adaptive

Centralized

Centralized strategy:

use both sensors

high cost, low error

Non-adaptive strategy:

only use sensor 1

low cost, high error

34 / 79

Page 57: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

How to reduce sensor cost?

Sensor 1 is cheap, Sensor 2 is expensive

Sens

or 2

Sensor 1 Sensor 1

Sens

or 2

Sensor 1

Non-Adaptive

Centralized

Centralized strategy:

use both sensors

high cost, low error

Non-adaptive strategy:

only use sensor 1

low cost, high error34 / 79

Page 58: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

A better strategy: be adaptive

Only request 2nd sensor on difficult examples

Sens

or 2

Sensor 1

Sens

or 2

Sensor 1

Sensor 1

classify

reject

Stage 1 Decision

Stage 2 Decision

Sensor 1

35 / 79

Page 59: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

How does it compare?

Same error rate as centralized for half the cost

Average Cost / Sample

cost

= 1

2nd

sens

or

Erro

r Rat

e

.1

.2

.5 11st sensorcost=0

Centralized

Non-adaptive

Adaptive

36 / 79

Page 60: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Deciding to reject

How to decide if to use the next sensor?

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risk of a decision:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

(uncertainty is in correct classification)

Acquisition cost justify the reduction in uncertainty?

37 / 79

Page 61: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Deciding to reject

How to decide if to use the next sensor?

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risk of a decision:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

(uncertainty is in correct classification)

Acquisition cost justify the reduction in uncertainty?37 / 79

Page 62: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Deciding to reject

Risk = min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

Difficulty: sensor output is not known since it has not been acquired

How to determine future uncertainty?

Must base decision on collected measurements!2n

d se

nsor

1st sensor

38 / 79

Page 63: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Myopic Approach

Not clear how to determine uncertainty of the future:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

Ignore the future, and only use current uncertainty to make a decision:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost︸ ︷︷ ︸reject to next stage

]

Reduces to:

decision =

{classify, uncertainly < threshold

reject, uncertainty ≥ threshold

39 / 79

Page 64: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Myopic Approach

Not clear how to determine uncertainty of the future:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

Ignore the future, and only use current uncertainty to make a decision:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost︸ ︷︷ ︸reject to next stage

]

Reduces to:

decision =

{classify, uncertainly < threshold

reject, uncertainty ≥ threshold

39 / 79

Page 65: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Myopic Approach

Not clear how to determine uncertainty of the future:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

Ignore the future, and only use current uncertainty to make a decision:

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost︸ ︷︷ ︸reject to next stage

]

Reduces to:

decision =

{classify, uncertainly < threshold

reject, uncertainty ≥ threshold

39 / 79

Page 66: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Myopic In Discriminative Setting

Train a classifier at a stage h(x)

Classifier uncertainty ≈ distance to decision boundary (margin)

Small distance → high uncertainty

Large distance → low uncertainty

h(x)

reject to next stage

threshold

Related work: [Liu et al., 2008]

40 / 79

Page 67: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1

Data:

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Sens

or 2

Sensor 1

41 / 79

Page 68: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1

1st Stage Classifier: only utilizes Sensor 1

Sens

or 2

Sensor 1

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

41 / 79

Page 69: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1

2nd Stage Classifier: utilizes Sensors 1 and 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Sens

or 2

Sensor 1

41 / 79

Page 70: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1

Myopic Reject Classifier

Reject

Classify

Stage 1 Decision Stage 2 Decision

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

42 / 79

Page 71: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1

Myopic Reject Classifier

Requests sensor 2 where sensor 1 is ambiguous

Current uncertainty seems to be a good criteria to reject

reject to 2nd stage(request 2nd sensor)

Sensor 1

Sens

or 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

43 / 79

Page 72: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 1: Error vs Budget

sweep threshold to generate different operating points

Sensor 1

Sens

or 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6 0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

optimalmyopic

Good performance overall

44 / 79

Page 73: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

45 / 79

Page 74: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

1st Stage Classifier: only utilizes Sensor 1

Sens

or 2

Sensor 1

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

45 / 79

Page 75: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

2nd Stage Classifier: utilizes Sensors 1 and 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

45 / 79

Page 76: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Region 1

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

separable only with sensor 2

45 / 79

Page 77: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Region 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

neither sensor helps

45 / 79

Page 78: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Myopic Reject Decision

Sensor 1 uncertainty is equally distributed between regions 1 and 2Uniformly rejects in both regions

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sensor 1

Sens

or 2

reject to 2nd stage

46 / 79

Page 79: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Myopic Reject Decision

Current uncertainty is equally distributed between regions 1 and 2

Without future uncertainty cannot tell where sensor 2 is useful

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

error

myopicoptimal

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

47 / 79

Page 80: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Myopic

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

error

myopicoptimal

Myopic FailsMyopic Works

0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

optimalmyopic

48 / 79

Page 81: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Future Uncertainty is Important

Need to incorporate future uncertainty in the decision

min [ current uncertainty︸ ︷︷ ︸classify

, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage

]

49 / 79

Page 82: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Generative & Parametric Methods

Known model: partially observable Markov decision process (POMDP)

Posterior Model: P( label | sensor measurements )

Likelihood Model: P( sensor k | sensor j )

Method 1: Learn models and solve POMDP

hard to learn models,

cannot solve POMDP in general case

Previous Work: [Ji and Carin, 2007, Kapoor and Horvitz, 2009, Zubek and Dietterich, 2002]

Method 2: Greedily maximize expected utility of a sensor

One step look ahead approximation to POMDP, unclear how to chooseutility

Correlation across sensors: hard to learn likelihood(e.g. sensor output = image)

Previous Work: [Kanani and Melville, 2008, Koller and Gao, 2011]

50 / 79

Page 83: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Our Approach

Avoid estimating probability models

Directly learn decision at each stage from training data

Empirical Risk Minimization (ERM):incorporates uncertainty of future in the current decision

51 / 79

Page 84: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Two Stage System

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

52 / 79

Page 85: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Parametrization of Reject Region

We developed several parameterizations of fk(x):

Binary Setting:

reject as disagreement region of two binary decisions

learn f pk , fnk

fk(x) =

{f pk (x), f pk (x) = f nk (x)

reject, f pk (x) 6= f nk (x)

Multi-class Setting:

stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)

only learn binary decision gk(x)

fk(x) =

{dk(x), σdk (x) > gk(x)

reject, σdk (x) ≤ gk(x)

focus on a simplified parametrization for illustration

53 / 79

Page 86: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Parametrization of Reject Region

We developed several parameterizations of fk(x):

Binary Setting:

reject as disagreement region of two binary decisions

learn f pk , fnk

fk(x) =

{f pk (x), f pk (x) = f nk (x)

reject, f pk (x) 6= f nk (x)

Multi-class Setting:

stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)

only learn binary decision gk(x)

fk(x) =

{dk(x), σdk (x) > gk(x)

reject, σdk (x) ≤ gk(x)

focus on a simplified parametrization for illustration

53 / 79

Page 87: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Parametrization of Reject Region

We developed several parameterizations of fk(x):

Binary Setting:

reject as disagreement region of two binary decisions

learn f pk , fnk

fk(x) =

{f pk (x), f pk (x) = f nk (x)

reject, f pk (x) 6= f nk (x)

Multi-class Setting:

stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)

only learn binary decision gk(x)

fk(x) =

{dk(x), σdk (x) > gk(x)

reject, σdk (x) ≤ gk(x)

focus on a simplified parametrization for illustration

53 / 79

Page 88: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Parametrization of Reject Region

We developed several parameterizations of fk(x):

Binary Setting:

reject as disagreement region of two binary decisions

learn f pk , fnk

fk(x) =

{f pk (x), f pk (x) = f nk (x)

reject, f pk (x) 6= f nk (x)

Multi-class Setting:

stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)

only learn binary decision gk(x)

fk(x) =

{dk(x), σdk (x) > gk(x)

reject, σdk (x) ≤ gk(x)

focus on a simplified parametrization for illustration

53 / 79

Page 89: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Stage Classifiers

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Fix classifiers at each stage:

d1(x) is standard classifier trained on sensor 1

d2(x) is standard classifier trained on sensor 1 & 2

d1(x) := d1(x1), d2(x) := d2(x1, x2)

54 / 79

Page 90: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Decompose Reject Decision

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Decompose classification and rejection decisions:

g(x) is reject / not reject decision

f1(x) =

{d1(x), g(x) = not reject

reject, else

55 / 79

Page 91: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Risk Based Approach

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risks of Each Stage:

Current: Rcu(x) = 1[d1 misclassifies x]

Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost

Stage 1 reject decision g(x):

g(x) =

{classify at 1, Rcu(x) < Rfu(x)

reject to 2nd sensor, Rcu(x) ≥ Rfu(x)

Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2

56 / 79

Page 92: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Risk Based Approach

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risks of Each Stage:

Current: Rcu(x) = 1[d1 misclassifies x]

Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost

Stage 1 reject decision g(x):

g(x) =

{classify at 1, Rcu(x) < Rfu(x)

reject to 2nd sensor, Rcu(x) ≥ Rfu(x)

Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2

56 / 79

Page 93: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Risk Based Approach

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risks of Each Stage:

Current: Rcu(x) = 1[d1 misclassifies x]

Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost

Stage 1 reject decision g(x):

g(x) =

{classify at 1, Rcu(x) < Rfu(x)

reject to 2nd sensor, Rcu(x) ≥ Rfu(x)

Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2

56 / 79

Page 94: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Empirical Risk Minimization

Use training data with full measurement,

(x1, y1), (x2, y2), . . . , (xN , yN)

And system risk for a point x and decision g(x)

R(g , x, y) =

{Rcu(x, y), g(x) = not reject

Rfu(x, y), g(x) = reject

Minimize empirical risk,

ming

Ex,y [R(g , x, y)] ≈ ming∈G

1

N

N∑i=1

R(g , xi , yi )

57 / 79

Page 95: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Empirical Risk Minimization

Use training data with full measurement,

(x1, y1), (x2, y2), . . . , (xN , yN)

And system risk for a point x and decision g(x)

R(g , x, y) =

{Rcu(x, y), g(x) = not reject

Rfu(x, y), g(x) = reject

Minimize empirical risk,

ming

Ex,y [R(g , x, y)] ≈ ming∈G

1

N

N∑i=1

R(g , xi , yi )

57 / 79

Page 96: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Empirical Risk Minimization

Use training data with full measurement,

(x1, y1), (x2, y2), . . . , (xN , yN)

And system risk for a point x and decision g(x)

R(g , x, y) =

{Rcu(x, y), g(x) = not reject

Rfu(x, y), g(x) = reject

Minimize empirical risk,

ming

Ex,y [R(g , x, y)] ≈ ming∈G

1

N

N∑i=1

R(g , xi , yi )

57 / 79

Page 97: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Back to Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

58 / 79

Page 98: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sensor 1

Sens

or 2

reject to 2nd stage

59 / 79

Page 99: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Smaller error for the same cost

MyopicOurs

Error=19%Error=14.8%

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

60 / 79

Page 100: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example 2

Incorporating future uncertainty in current decisionimproves performance

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

erro

r

myopicoursoptimal

61 / 79

Page 101: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Learning to Reject

How to learn reject decision g(x) (green region)?

Reduce reject option to learning a binary decision

Define a weighted supervised learning problem:

risk difference induces pseudo labels on training data,

pseudo label of xi =

{reject , Rcu(xi ) > Rfu(xi )

not reject, Rcu(xi ) ≤ Rfu(xi )

importance weights, risk difference = penalty for misclassifying

weight of xi = |Rcu(xi )− Rfu(xi )|

62 / 79

Page 102: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Learning to Reject

Risks induce pseudo-labels

LearnReject Decsion

pseudo label of xi =

{reject , Rcu(xi ) > Rfu(xi )

not reject, Rcu(xi ) ≤ Rfu(xi )

weight of xi = |Rcu(xi )− Rfu(xi )|

63 / 79

Page 103: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Reduction to supervised learning

Empirical risk minimization simplifies to weighted supervised learning:

arg ming∈G

1

N

N∑i=1

R(g , xi , yi ) =

arg ming∈G

N∑i=1

1[g(xi ) 6= pseudo label of xi

] × weight of xi

64 / 79

Page 104: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Reduction to supervised learning

ming∈G

N∑i=1

1[g(xi ) 6= pseudo label of xi

] × weight of xi

Can be solved with existing supervised learning tools:

Choose:

surrogate loss C (z) ≥ 1[z>0] (e.g. logistic)

classifier family G (e.g. linear)

general framework, not tied to a specific learning algorithm

ming∈G

N∑i=1

C [g(xi )× pseudo label of xi ]× weight of xi

65 / 79

Page 105: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Generalize to Multiple Stages

Measurement x = [x1, . . . , xK ] and true label y

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

Learn decisions at each stage F = f1, f2, . . . fk

fk(x) =

{dk(x), gk(x) = not reject

reject, else

dk(x) is a standard classifier, pretrained on sensors 1, . . . , k

66 / 79

Page 106: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Stage-wise Decomposition

System Risk: R(F , x, y) = Error(F (x), y) + αCost(F , x)

Stage-wise recursion:

R(F , xi , yi ) = R1(xi , yi , f1)

Rk(xi , yi , fk) =

δk+1i , reject to next stage

1, error & not reject

0, correct & not reject

Cost-to-go:δk+1i = αck+1 + Rk+1(xi , yi , fk+1)

risk of future stage decisions fk+1, . . . , fK if rejected

67 / 79

Page 107: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Stage-wise Decomposition

System Risk: R(F , x, y) = Error(F (x), y) + αCost(F , x)

Stage-wise recursion:

R(F , xi , yi ) = R1(xi , yi , f1)

Rk(xi , yi , fk) =

δk+1i , reject to next stage

1, error & not reject

0, correct & not reject

Cost-to-go:δk+1i = αck+1 + Rk+1(xi , yi , fk+1)

risk of future stage decisions fk+1, . . . , fK if rejected

67 / 79

Page 108: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Stage-wise Decomposition

Rk(xi , yi , fk) =

δk+1i , reject to next stage

1, error & not reject

0, correct & not reject

Key Observation:Given the past: f1, . . . , fk−1, and the future: fk+1, . . . , fK

Find current decision, fk , from singe stage risk Rk

Equivalent to a two stage problem

Rcu(xi ) = 1[dk misclassifies xi ]

Rfu(xi ) = δk+1i

68 / 79

Page 109: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Algorithm

For every training example xi :

cost-to-go, δk+1i :, empirical risk of future stages,

state, Ski , indicates if example is still active at stage k

Algorithm: alternatively minimize one stage at a time

For every stage k:

1: Learn decision fk :

minf∈F

N∑i=1

Ski Rk [f , xi , yi , δ

k+1i ]

2: Update state, S ji , for future stages j > k

3: Update cost-to-go, δji , for past stages j < k

repeat until stopping criteria is satisfied

69 / 79

Page 110: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Algorithm

For every training example xi :

cost-to-go, δk+1i :, empirical risk of future stages,

state, Ski , indicates if example is still active at stage k

Algorithm: alternatively minimize one stage at a time

For every stage k :

1: Learn decision fk :

minf∈F

N∑i=1

Ski Rk [f , xi , yi , δ

k+1i ]

2: Update state, S ji , for future stages j > k

3: Update cost-to-go, δji , for past stages j < k

repeat until stopping criteria is satisfied

69 / 79

Page 111: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Learning Stage Decision

Step 1:

fk = arg minf∈F

N∑i=1

Ski Rk [f , xi , yi , δ

k+1i ]

simplifies to a two stage setting,

Rcu(xi ) = 1[dk misclassifies xi ]

Rfu(xi ) = δk+1i

learn reject decision, gk(x), for every stage

supervised weighted classification problem

use any classification algorithm

70 / 79

Page 112: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example

Sensors Varying Resolutions

classify handwritten digit images (mnist)

f1( ) f2( ) f3( ) f4( )?

low resolution(cheap)

high resolution(expensive)

x handwritten digit imagey ∈ {0, 1, . . . , 9} label

Stage 1 2 3 4Sensor 4x4 8x8 16x16 32x32Cost 1 2 3 4

Supervised Learner: logistic regression with linear classifiers

71 / 79

Page 113: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Example

Sensor 1 Sensor 2 Sensor 3 Sensor 4

Digit 0

Digit 1

Digit 8

Sensor selection depends on example

72 / 79

Page 114: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Handwritten Digit Dataset

Same performance as centralized (best) with much lower budget

1 1.5 2 2.5 3 3.5 40.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

Budget

Err

or

oursmyopiccentralized

73 / 79

Page 115: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

More Experiments

Achieve target error rate with fraction of max budget

Dataset Stages Sensors Target Error Myopic Ours

synthetic 2 .147 52% 28%pima 3 weight, age, blood tests .245 41% 15%

threat 3 ir,pmmw,ammw .16 89% 71%covertype 3 soils, wild. areas, elev, aspect .285 79% 40%

letter 3 pixel counts, edge feat’s .25 81% 51%mnist 4 res. levels .085 90% 52%

landsat 4 hyperspectral bands .17 56% 31%mam 2 CAD feat’s, expert rating .173 65 % 25%

74 / 79

Page 116: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Summary of Theoretical Results

Train multi-stage decision system F (x)

How well does it perform on unseen data?

Ex,y [F (x) 6= y ]?

We prove two types of test error bound:

Margin type bound (ACML 2012/ML 2013)

two stage, binary setting, for boosted classifiers

VC Dimension type bound (AISTATS 2013)

VC dimension = complexity of a classifier familysmall VC dim = good generalizationcomplexity of our system scales as K logK times most complex stage

75 / 79

Page 117: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Summary of Sequential Sensor Selection

Sequential Reject Classifiers

ordered sequence of stages/sensors

each stage either classifiers or rejects to next stage

multi-class and multi-stage

Novel Empirical Risk Minimization

Decompose into stage-wise empirical risk minimization

Reduce to a series of weighted supervised learning problems

Empirical Results

demonstrate on several datasets

achieve performance of a centralized system with less expensive sensors

Publications

Two Stage Decision System, IEEE SSP, 2012

Multi-Stage Classifier Design, ACML, 2012 and an extended version inMachine Learning, 2013

Supervised Sequential Classification Under Budget Constraints, AISTATS,2013

76 / 79

Page 118: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Future Directions

Dynamic Sensor Selection: learn a sequential decision system (a policy,F (x)) when the order of stages is no longer fixed

Difficulties

policy has to choose which sensor to acquire next or to stop andclassify

policy has to handle any subset of measurements

Promising direction: imitation learning

77 / 79

Page 119: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Future Directions

Main idea: instead of learning an optimal policy, learn to imitate a verygood policy (an oracle)

oracle may only be evaluated on a training set, requires ground truth

learn a parameterized policy function to match oracle decisions onthe training

can use this policy estimate on unseen examples

Imitation learning is well studied in robotics community. Oracle is ahuman operator.

Difficulty in sensor selection setting: designing or computing an oracle onthe training data

78 / 79

Page 120: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

Thank You!

advisors: Venkatesh Saligrama, David Castanon

rest of my committee: Prakash Ishwar, Ioannis Paschalidis

chair: Ayse Coskun

fellow BU students

79 / 79

Page 121: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

[Abe and Mamitsuka, 1998] Abe, N. and Mamitsuka, H. (1998).Query learning strategies using boosting and bagging.In Proc. of the 15th ICML, pages 1–9.

[Balcan et al., 2007] Balcan, M.-F., Broder, A., and Zhang, T. (2007).Margin based active learning.In Procedings of the 20th Conference on Learning Theory.

[Campbell et al., 2000] Campbell, C., Cristianini, N., and Smola, A. (2000).Query learning with large margin classifiers.In Proceedings 17th International Conference on Machine Learning, pages 111–118.

[Freund and Schapire, 1996] Freund, Y. and Schapire, R. E. (1996).Experiments with a new boosting algorithm.In Proc. of the 13th ICML, pages 148–156.

[Freund et al., 1997] Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. (1997).Selective sampling using the query by committee algorithm.In ML.

[Ji and Carin, 2007] Ji, S. and Carin, L. (2007).Cost-sensitive feature acquisition and classification.In Pattern Recognition.

[Kanani and Melville, 2008] Kanani, P. and Melville, P. (2008).Prediction-time active feature-value acquisition for cost-effective customer targeting.In NIPS.

[Kapoor and Horvitz, 2009] Kapoor, A. and Horvitz, E. (2009).Breaking boundaries: Active information acquisition across learning and diagnosis.In NIPS.

[Koller and Gao, 2011] Koller and Gao (NIPS 2011).Active value.

[Liu et al., 2008] Liu, L.-P., Yu, Y., Jiang, Y., and Zhou, Z.-H. (2008).Tefe: A time-efficient approach to feature extraction.In ICDM.

[Nowak, 2009] Nowak, R. D. (2009).The geometry of generalized binary search.In Advances in Neural Information Processing Systems.

[Schohn and Cohn, 2000] Schohn, G. and Cohn, D. (2000).Less is more: Active learning with support vector machines.In Proceedings of the 17th International Conference on Machine Learning, pages 839–846.

79 / 79

Page 122: Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf · Machine Learning on a Budget Kirill Trapeznikov September 5th, 2013 Advisors: Venkatesh

[Zubek and Dietterich, 2002] Zubek, V. B. and Dietterich, T. G. (2002).Pruning improves heuristic search for cost-sensitive learning.In ICML.

79 / 79