active learning coms 6998-4: learning and empirical inference irina rish ibm t.j. watson research...

108
Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Post on 15-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Active LearningCOMS 6998-4:

Learning and Empirical Inference

Irina RishIBM T.J. Watson Research Center

Page 2: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

2

Outline

Motivation Active learning approaches

Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee

Applications Active Collaborative Prediction Active Bayes net learning

Page 3: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

3

Standard supervised learning modelGiven m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1.

VC theory: need m to be roughly d/, in the realizable case.

Page 4: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

4

Active learning

In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

Page 5: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

5

Page 6: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

6

What is Active Learning?

Unlabeled data are readily available; labels are expensive

Want to use adaptive decisions to choose which labels to acquire for a given dataset

Goal is accurate classifier with minimal cost

Page 7: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

7

Active learning warning

Choice of data is only as good as the model itself Assume a linear model, then two data points are sufficient What happens when data are not linear?

Page 8: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

8

Active Learning Flavors

Selective Sampling Membership Queries

Pool Sequential

Myopic Batch

Page 9: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

9

Active Learning Approaches

Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee

Page 10: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

10

Page 11: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

11

Page 12: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

12

Problem

Many results in this framework, even for complicated hypothesis classes.

[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.Synthetic instances created were incomprehensible to humans!

[Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”

Page 13: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

13

Page 14: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

14

Page 15: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

15

Page 16: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

16

Uncertainty Sampling

Query the event that the current classifier is most uncertain about

Used trivially in SVMs, graphical models, etc.

x x x x x x xxxx

If uncertainty is measured in Euclidean distance:

[Lewis & Gale, 1994]

Page 17: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

17

Information-based Loss Function

Maximize KL-divergence between posterior and prior

Maximize reduction in model entropy between posterior and prior

Minimize cross-entropy between posterior and prior

All of these are notions of information gain

[MacKay, 1992]

Page 18: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

18

Query by Committee

Prior distribution over hypotheses Samples a set of classifiers from distribution Queries an example based on the degree of

disagreement between committee of classifiers

[Seung et al. 1992, Freund et al. 1997]

x x x x x x xxxx

A B C

Page 19: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Infogain-based Active Learning

Page 20: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

20

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

Page 21: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

21

t Sex Age Test A Test B Test C Disease

0 M 40-50 0 1 1 ?1 F 50-60 0 1 0 ?2 F 30-40 0 0 0 ?3 F 60+ 1 1 1 ?4 M 10-20 0 1 0 ?5 M 40-50 0 0 1 ?6 F 0-10 0 0 0 ?7 M 30-40 1 1 0 ?8 M 20-30 0 0 1 ?

Dataset (D) Example

Page 22: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

22

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

Page 23: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

23

Model Example

St Ot

Probabilistic Classifier

Notation

T : Number of examples

Ot : Vector of features of example t

St : Class of example t

Page 24: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

24

Model Example

Patient state (St)St : DiseaseState

Patient Observations (Ot)Ot1 : GenderOt2 : AgeOt3 : TestAOt4 : TestBOt5 : TestC

Page 25: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

25

Possible Model Structures

S

Age

Gender

TestA

TestB

TestC

S

Age

Gender

TestA

TestB

TestC

Page 26: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

26

Model Space

St Ot

P(St)

Model:

Model Parameters: P(Ot|St)

Generative Model:Must be able to compute P(St=i, Ot=ot | w)

Page 27: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

27

Model Parameter Space (W)

• W = space of possible parameter values

• Prior on parameters:

• Posterior over models:

T

ttt WPWOSP

WPWDPDWP

)()|,(

)()|()|(

)(WP

Page 28: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

28

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

q(W,D) returns t*, the next sample to label

Page 29: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

29

Game

while NotDone

• Learn P(W | D)• q chooses next example to label• Expert adds label to D

Page 30: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

30

O1

S1

O2

S2

O3

S3

Simulation

O4

S4

O5

S5

O6

S6

O7

S7

?

S2=false

S7=false

S5=true

?

?

q

hmm…

Page 31: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

31

Active Learning Flavors

• Pool

(“random access” to patients)

• Sequential

(must decide as patients walk in the door)

Page 32: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

32

q?

• Recall: q(W,D) returns the “most interesting” unlabelled example.

• Well, what makes a doctor curious about a patient?

Page 33: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

33

1994

Page 34: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

34

Score Function

))|(y(uncertaint)(score tttuncert OSPS

)( tSH

i

tt iSPiSP )(log)(

Page 35: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

35

t Sex Age Test A

Test B

Test C

St

1 M 20-30

0 1 1 ?

2 F 20-30

0 1 0 ?

3 F 30-40

1 0 0 ?

4 F 60+ 1 1 0 ?

5 M 10-20

0 1 0 ?

6 M 20-30

1 1 1 ?

P(St)

0.02

0.01

0.05

0.12

0.01

0.96

H(St)

0.043

0.024

0.086

0.159

0.024

0.073

Uncertainty Sampling Example

FALSE

Page 36: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

36

t Sex Age Test A

Test B

Test C

St

1 M 20-30

0 1 1 ?

2 F 20-30

0 1 0 ?

3 F 30-40

1 0 0 ?

4 F 60+ 1 1 0 ?

5 M 10-20

0 1 0 ?

6 M 20-30

1 1 1 ?

P(St)

0.01

0.02

0.04

0.00

0.06

0.97

H(St)

0.024

0.043

0.073

0.00

0.112

0.059

Uncertainty Sampling Example

FALSE

TRUE

Page 37: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

37

Uncertainty Sampling

GOOD: couldn’t be easierGOOD: often performs pretty well

BAD: H(St) measures information gain about the samples, not the model

Sensitive to noisy samples

Page 38: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

38

Can we do better thanuncertainty sampling?

Page 39: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

39

1992

Informative with respect to what?

Page 40: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

40

Model Entropy

WW W

P(W|D) P(W|D)P(W|D)

H(W) = high H(W) = 0…better…

Page 41: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

41

Information-Gain

• Choose the example that is expected to most reduce H(W)

• I.e., Maximize H(W) – H(W | St)

Expected model space entropy if we learn St

Current model space entropy

Page 42: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

42

Score Function

)|()(

);()(score

t

ttIG

SWHWH

WSMIS

Page 43: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

43

We usually can’t just sum over all models to get H(St|W)

…but we can sample from P(W | D)

Cc

w

cPcP

CHWH

dwwPwPWH

)(log)(

)()(

)(log)()(

Page 44: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

44

Conditional Model Entropy

dwwPwPWHw )(log)()(

dwiSwPiSwPiSWHw ttt )|(log)|()|(

i

w tttt dwiSwPiSwPiSPSWH )|(log)|()()|(

Page 45: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

45

Score Function

)|()()(score ttIG SCHCHS

Page 46: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

46

t Sex Age Test A

Test B

Test C

St

1 M 20-30

0 1 1 ?

2 F 20-30

0 1 0 ?

3 F 30-40

1 0 0 ?

4 F 60+ 1 1 1 ?

5 M 10-20

0 1 0 ?

6 M 20-30

0 0 1 ?

P(St)

0.02

0.01

0.05

0.12

0.01

0.02

Score =

H(C) - H(C|St)

0.53

0.58

0.40

0.49

0.57

0.52

Page 47: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

47

Score Function

)|()(

)|()()(score

CSHSH

SCHCHS

tt

ttIG

Familiar?

Page 48: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

48

Uncertainty Sampling & Information Gain

)|()()(score

)()(score

CSHSHS

SHS

tttInfoGain

ttUncertain

Page 49: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

49

But there is a problem…

Page 50: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

50

“the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”

If our objective is to reduce the prediction error, then

Page 51: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

51

Strategy #2:Query by Committee

Temporary Assumptions:

Pool Sequential

P(W | D) Version Space

Probabilistic Noiseless

QBC attacks the size of the “Version space”

Page 52: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

52

O1

S1

O2

S2

O3

S3

O4

S4

O5

S5

O6

S6

O7

S7

Model #1 Model #2

FALSE!

FALSE!

Page 53: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

53

O1

S1

O2

S2

O3

S3

O4

S4

O5

S5

O6

S6

O7

S7

Model #1 Model #2

TRUE!TRUE!

Page 54: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

54

O1

S1

O2

S2

O3

S3

O4

S4

O5

S5

O6

S6

O7

S7

Model #1 Model #2

TRUE!FALSE!

Ooh, now we’re going to learnsomething for sure!

One of them is definitely wrong.

Page 55: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

55

The Original QBCAlgorithm

As each example arrives…

1. Choose a committee, C, (usually of size 2) randomly from Version Space

2. Have each member of C classify it

3. If the committee disagrees, select it.

Page 56: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

56

1992

Page 57: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

57

Infogain vs Query by Committee

[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.H:

Which pair of hypotheses is closest? Depends on data distribution P.Distance measure on H: d(h,h’) = P(h(x) h’(x))

Page 58: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

58

Query-by-committee

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.

H:

To keep things simple, say d(h,h’) = Euclidean distance

Error is likely to remain large!

Page 59: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

59

Query-by-committee

Elegant scheme which decreases volume in a manner which is sensitive to the data distribution.

Bayesian setting: given a prior on H

H1 = HFor t = 1, 2, …

receive an unlabeled point xt drawn from P[informally: is there a lot of disagreement about

xt in Ht?]choose two hypotheses h,h’ randomly from (,

Ht)if h(xt) h’(xt): ask for xt’s labelset Ht+1

Problem: how to implement it efficiently?

Page 60: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

60

Query-by-committee

For t = 1, 2, …receive an unlabeled point xt drawn from Pchoose two hypotheses h,h’ randomly from (, Ht)if h(xt) h’(xt): ask for xt’s labelset Ht+1

Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to (h) (h’) d(h,h’).

Ht

vs.

Page 61: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

61

Page 62: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

62

Query-by-committee

Label bound: For H = {linear separators in Rd}, P = uniform distribution, just d log 1/ labels to reach a hypothesis with error < .

Implementation: need to randomly pick h according to (, Ht).

e.g. H = {linear separators in Rd}, = uniform distribution:

Ht

How do you pick a random point from a convex body?

Page 63: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

63

Sampling from convex bodies

By random walk!1. Ball walk2. Hit-and-run

[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC.

Page 64: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

64

Page 65: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

65

Some challenges

[1] For linear separators, analyze the label complexity for some distribution other than uniform!

[2] How to handle nonseparable data?Need a robust base learner

true boundary+-

Page 66: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Active Collaborative Prediction

Page 67: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

67

Approach: Collaborative Prediction (CP)

QoS measure (e.g. bandwidth)

?7033009Client4

1688Client3

567187Client2

18 3674Client1

Server3Server2Server1

Given previously observed ratings R(x,y), where

X is a “user” and Y is a “product”, predict

unobserved ratings

4 Raja

109Irina

24Gerry

31 ?Alina

Shrek GeishaMatrix

Movie Ratings

- will Alina like “The Matrix”? (unlikely )

- will Client 86 have fast download from Server 39?

- will member X of funding committee approve our project Y?

Page 68: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

68

10

0

clie

nts

100 servers

Collaborative Prediction = Matrix Approximation

• Important assumption: matrix entries are NOT independent, e.g. similar users have similar tastes

• Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …)

Page 69: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

69

User’s ‘weights’ associated with

‘factors’’

Factors

Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties

2 4 5 1 4 2

- movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values

Page 70: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

70

2 4 5 1 4 2

Page 71: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

71

2 4 5 1 4 2

Page 72: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

72

2 4 5 1 4 2

3 1 2 2 5

4 2 4 1 3 1

3 3 4 2

2 3 1 4 3 2

2 2 1 4

1 3 1 1 4Y XObjective: find a factorizable X=UV’ that approximates Y

),'(minarg'

YXLossXX

7 2 5 4 5 3 1 4 2

3 1 2 4 2 2 7 5 6

4 3 2 2 4 1 4 3 1

3 1 2 3 4 3 2 4 5

2 3 2 1 3 4 3 5 2

8 2 2 9 1 8 3 4 5

1 2 3 5 1 1 5 6 4

=

and satisfies some “regularization” constraints (e.g. rank(X) < k)Loss functions: depends on the nature of your problem

rank k

Page 73: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

73

Matrix Factorization Approaches

Singular value decomposition (SVD) – low-rank approximation Assumes fully observed Y and sum-squared loss

In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima

Furthermore, we may not want sum-squared loss, but instead accurate predictions (0/1 loss, approximated by hinge loss) cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost (e.g., suggest k ‘best’ movies for a user)

NON-CONVEX PROBLEMS!

Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05] replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming strongly relates to learning max-margin classifiers (SVMs)

Exploit MMMF’s properties to augment it with active sampling!

Page 74: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

74

Key Idea of MMMF Rows – feature vectors, Columns – linear classifiers

Linear classifiersweight vectors

Featu

re v

ect

ors

f1

v2

“marg

in”

Xij = signij x marginij

Predictorij = signij

-1

If signij > 0, classify as +1,Otherwise classify as -1

“margin” here = Dist(sample, line)

Page 75: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

75

MMMF: Simultaneous Search for Low-norm Feature Vectors and Max-margin Classifiers

Page 76: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

76

Margin-based heuristics:

min-margin (most-uncertain) min-margin positive (“good” uncertain) max-margin (‘safe choice’ but no info)

-0.3

-0.4

0.6

-0.9

0.1

-0.5

0.8

0.5

0.7

0.9

-0.2

-0.5

-0.5

-0.9 -0.6

-0.7-0.5

-0.1

-0.5

-0.9

-0.9 -0.8 -0.5

-0.6

-0.5

-0.5

-0.5 -0.5

-0.4

-0.8

-0.1

-0.5-0.5

-0.5

0.1

0.3 0.4

0.6

0.3

0.8 0.2

0.7

0.9

0.2

0.5 0.2 0.3

0.6 0.6

0.9

0.6

0.7

0.30.8

0.6

0.4 0.5

0.30.9

0.2

0.1

0.6

-0.2

Active Learning with MMMF

- We extend MMMF to Active-MMMF using margin-based active sampling

- We investigate exploitation vs exploration trade-offs imposed by different heuristics

Page 77: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

77

A-MMMF(M,s)1. Given s sparse matrix Y, learn approximation X = MMMF(Y)

2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download)

3. Add new samples to Y

4. Repeat 1-3

Active Max-Margin Matrix Factorization

Issues: Beyond simple greedy margin-based heuristics?

Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions

(any suggestions??? )

Page 78: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

78

Empirical Results

Network latency prediction

Bandwidth prediction (peer-to-peer)

Movie Ranking Prediction

Sensor net connectivity prediction

Page 79: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

79

Empirical Results: Latency Prediction

P2Psim data NLANR-AMP data

Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and least-uncertain-next sampling

Page 80: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

80

Movie Rating Prediction (MovieLens)

Page 81: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

81

Sensor Network Connectivity

Page 82: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

82

Introducing Cost: Exploration vs Exploitation

Active sampling lower prediction errors at lower costs (saves 100s of samples) (better prediction better server assignment decisions faster downloads

Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain

DownloadGrid: bandwidth prediction

PlanetLab: latency prediction

Page 83: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

83

Conclusions

Common challenge in many applications: need for cost-efficient sampling

This talk: linear hidden factor models with active sampling

Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications

Future work:

Better active sampling heuristics?

Theoretical analysis of active sampling performance?

Dynamic Matrix Factorizations: tracking time-varying matrices

Incremental MMMF? (solving from scratch every time is too costly)

Page 84: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

84

ReferencesSome of the most influential papers

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.

Page 85: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

85

NIPS papers• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06

• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05

• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman . Active Learning For Identifying Function Threshold Boundaries . NIPS-05

• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05

• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05

• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05

• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05

• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04

• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04

• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01

• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01

• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00

• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00

• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97

• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95

• Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94

• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94

• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94

• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91

Page 86: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

86

ICML papers• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06

• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06

• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06

• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06

• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05

• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04

• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04

• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04

• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.

• COLT papers

• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.

• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294.

Page 87: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

87

Journal Papers• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online

and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.

• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14, 83--113

• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.

• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178.

Page 88: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

88

Workshops

• http://domino.research.ibm.com/comm/research_projects.nsf/pages/nips05workshop.index.html

Page 89: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

89

Appendix

Page 90: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Active Learning of Bayesian Networks

Page 91: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

91

Entropy Function• A measure of information in random

event X with possible outcomes {x1,…,xn}

• Comments on entropy function:– Entropy of an event is zero when the

outcome is known

– Entropy is maximal when all outcomes are equally likely

• The average minimum yes/no questions to answer some question (connection to binary search)

H(x) = - i p(xi) log2 p(xi)

[Shannon, 1948]

Page 92: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

92

Kullback-Leibler divergence

• P is the true distribution; Q distribution is used to encode data instead of P

• KL divergence is the expected extra message length per datum that must be transmitted using Q

• Measure of how “wrong” Q is with respect to true distribution P

DKL(P || Q) = i P(xi) log (P(xi)/Q(xi))

= i P(xi) log Q(xi) – i P(xi) log P(xi)

= -H(P,Q) + H(P)

= -Cross-entropy + entropy

Page 93: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

93

Learning Bayesian Networks

E

R

B

A

C

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Data+

Prior Knowledge

• Model Building• Parameter estimation• Causal structure discovery

• Passive Learning vs Active Learning

Learner

Page 94: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

94

Active Learning

• Selective Active Learning

• Interventional Active Learning

• Obtain measure of quality of current model • Choose query that most improves quality • Update model

Page 95: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

95

Active Learning: Parameter Estimation

[Tong & Koller, NIPS-2000]• Given a BN structure G• A prior distribution p(θ)• Learner request a particular instantiation q (Query)

Training data

Active Learner

Response (x)

+

E B

A

Initial Network G, p(θ)

E B

A

Updated distribution

p´(θ)

How to update parameter densityHow to select next query based on p

Query (q) E B

A

Updated distribution

p´(θ)

Page 96: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

96

•Do not update A since we are fixing it•If we select A then do not update B

•Sampling from P(B|A=a) P(B)

•If we force A then we can update B

•Sampling from P(B|A:=a) = P(B)*

•Update all other nodes as usual

•Obtain new density

*Pearl 2000

A

B

J M

Updating parameter density

) a,A| (p xX θ

Page 97: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

97

• Goal: a single estimate – instead of a distribution p over

• If we choose and the true model is ’ then we incur some loss, L(’ || )

Bayesian point estimation

p()

~

Page 98: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

98

• We do not know the true ’ • Density p represents optimal beliefs over ’ • Choose that minimizes the expected loss

= argmin ∫p(’) L(’ || ) d’

• Call the Bayesian point estimate• Use the expected loss of the Bayesian point

estimate as a measure of quality of p(): – Risk(p) = ∫p(’) L(’ || ) d’

Bayesian point estimation

~

~

Page 99: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

99

• Set the controllable variables so as to minimize the expected posterior risk:

• KL divergence will be used for loss

– KL( || ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui))

The Querying component

θθθθ d )~

||(K )|(p)|(P LxqQxXx

ExPRisk(p | Q=q)

Conditional KL-divergence

Page 100: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

100

Algorithm Summary

• For each potential query q • Compute Risk(X|q) • Choose q for which Risk(X|q) is greatest

– Cost of computing Risk(X|q):

• Cost of Bayesian network inference • Complexity: O (|Q|. Cost of inference)

Page 101: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

101

Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.

Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.

X

-

-

--

-

-

-

++

+

+

+ --

Page 102: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

102

Page 103: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

103

Region of uncertainty

current version spaceSuppose data lies on circle in R2; hypotheses are linear separators.

(spaces X, H superimposed) region of

uncertainty in data space

Current version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

++

Page 104: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

104

Region of uncertainty

current version spaceData and hypothesis spaces, superimposed:

(both are the surface of the unit sphere in Rd)

region of uncertainty in data space

Algorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

Page 105: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

105

Region of uncertainty

Number of labels needed depends on H and also on P.

Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere.

Then: just d log 1/ labels are needed to reach a hypothesis with error rate < .

[1] Supervised learning: d/ labels.[2] Best we can hope for.

Page 106: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

106

Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

For more general distributions: suboptimal…

Need to measure quality of a query – or alternatively, size of version space.

Page 107: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

107

Uncertainty sampling!

ExpectedInfogainof sample

Page 108: Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

108