techniques for exploiting unlabeled data

84
Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Defense September 8,2008 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin

Upload: maisie

Post on 17-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Techniques For Exploiting Unlabeled Data. Thesis Defense. September 8,2008. Mugizi Rwebangira. Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin. Motivation. Supervised Machine Learning:. induction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Techniques For Exploiting Unlabeled Data

Techniques For Exploiting Unlabeled Data

Mugizi Rwebangira

Thesis Defense

September 8,2008

Committee: Avrim Blum, CMU (Co-Chair)John Lafferty, CMU (Co-Chair)

William Cohen, CMUXiaojin (Jerry) Zhu, Wisconsin

Page 2: Techniques For Exploiting Unlabeled Data

2

Motivation

Model x →yLabeled Examples {(xi,yi)}

Supervised Machine Learning:

induction

Algorithms: SVM, Neural Nets, Decision Trees, etc.

Problems: Document classification, image classification, protein sequence determination.

Page 3: Techniques For Exploiting Unlabeled Data

3

Motivation

In recent years, there has been growing interest in techniques for using unlabeled data:

More data is being collected than ever before.

Labeling examples can be expensive and/or require human intervention.

Page 4: Techniques For Exploiting Unlabeled Data

4

Examples

Proteins: sequence can be easily determined, structure determination is a hard problem.

Web Pages: Can be easily crawled on the web, labeling requires human intervention.

Images: Abundantly available (digital cameras) labeling requires humans (captchas).

Page 5: Techniques For Exploiting Unlabeled Data

5

Motivation

Labeled Examples {(xi,yi)}

Semi-Supervised Machine Learning:

x →y

Unlabeled Examples {xi}

Page 6: Techniques For Exploiting Unlabeled Data

6

Motivation

+

+-

-

Page 7: Techniques For Exploiting Unlabeled Data

7

However…

Techniques for adapting supervised algorithms to semi-supervised algorithms

Best practices for using unlabeled data:

Techniques not as well developed as supervised techniques:

Page 8: Techniques For Exploiting Unlabeled Data

8

OutlineMotivation

Randomized Graph Mincut

Local Linear Semi-supervised Regression

Learning with Similarity Functions

Conclusion and Questions

Page 9: Techniques For Exploiting Unlabeled Data

9

Graph Mincut (Blum & Chawla,2001)

Page 10: Techniques For Exploiting Unlabeled Data

10

Construct an (unweighted) Graph

Page 11: Techniques For Exploiting Unlabeled Data

11

Add auxiliary “super-nodes”

-+

Page 12: Techniques For Exploiting Unlabeled Data

12

Obtain s-t mincut

Mincut

-+

Page 13: Techniques For Exploiting Unlabeled Data

13

Classification

+ -

Mincut

Page 14: Techniques For Exploiting Unlabeled Data

14

Problem

+

-

Plain mincut can give very unbalanced cuts.

Page 15: Techniques For Exploiting Unlabeled Data

15

Solution

For each unlabeled example take a majority vote.

Add random weights to the edges

Run plain mincut and obtain a classification.

Repeat the above process several times.

Page 16: Techniques For Exploiting Unlabeled Data

16

Before adding random weights

+ -

Mincut

Page 17: Techniques For Exploiting Unlabeled Data

17

After adding random weights

+ -

Mincut

Page 18: Techniques For Exploiting Unlabeled Data

18

PAC-Bayes

• PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance.

• In this case each distinct cut corresponds to a different hypothesis.

• Hence the average of these cuts will be less likely to overfit than any single cut.

Page 19: Techniques For Exploiting Unlabeled Data

19

Markov Random Fields• Ideally we would like to assign a weight to

each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph.

• This corresponds to a Markov Random Field model.

• We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.

Page 20: Techniques For Exploiting Unlabeled Data

20

How to construct the graph?• k-NN

– Graph may not have small balanced cuts.– How to learn k?

• Connect all points within distance δ– Can have disconnected components.– How to learn δ?

• Minimum Spanning Tree– No parameters to learn.– Gives connected, sparse graph.– Seems to work well on most datasets.

Page 21: Techniques For Exploiting Unlabeled Data

21

Experiments

• ONE vs. TWO: 1128 examples .• (8 X 8 array of integers, Euclidean distance).

• ODD vs. EVEN: 4000 examples .• (16 X 16 array of integers, Euclidean distance).

• PC vs. MAC: 1943 examples .• (20 newsgroup dataset, TFIDF distance) .

Page 22: Techniques For Exploiting Unlabeled Data

22

ONE vs. TWO

Page 23: Techniques For Exploiting Unlabeled Data

23

ODD vs. EVEN

Page 24: Techniques For Exploiting Unlabeled Data

24

PC vs. MAC

Page 25: Techniques For Exploiting Unlabeled Data

25

Summary

We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields.

Randomization helps plain mincut achieve a comparable performance to Gaussian Fields.

There is an intuitive interpretation for the confidence of aprediction in terms of the “margin” of the vote.

“Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004

Page 26: Techniques For Exploiting Unlabeled Data

26

OutlineMotivation

Randomized Graph Mincut

Local Linear Semi-supervised Regression

Learning with Similarity Functions

Proposed Work and Time Line

Page 27: Techniques For Exploiting Unlabeled Data

27

(Supervised) Linear Regression

* *

**

x

y

Page 28: Techniques For Exploiting Unlabeled Data

28

Semi-Supervised Regression

*

**

*

x

y

+ +++ + ++ ++

Page 29: Techniques For Exploiting Unlabeled Data

29

Smoothness assumption

Things that are close together should have similar values

Minimize ξ(f) = ∑ wij(fi-fj)2

One way of doing this:

Where wij is the similarity between examples i and j.

And fi and fj are the predictions for example i and j.

Gaussian Fields (Zhu, Ghahramani & Lafferty)

Page 30: Techniques For Exploiting Unlabeled Data

30

Local ConstancyThe predictions made by Gaussian Fields are locally constant

*

x

y

+u u + Δ

More formally: m (u + Δ) ≈ m(u)

Page 31: Techniques For Exploiting Unlabeled Data

31

Local LinearityFor many regression tasks we would prefer predictions to be locally linear.

*

x

y

+u

u + Δ

More formally: m (u + Δ) ≈ m(u) + m’(u) Δ

Page 32: Techniques For Exploiting Unlabeled Data

32

Problem

Develop a version of Gaussian Fields which is Local Linear

Or a semi-supervised version of Linear Regression

Local Linear Semi-supervised Local Linear Semi-supervised RegressionRegression

Page 33: Techniques For Exploiting Unlabeled Data

33

Local Linear Semi-supervised Regression

xjxi

βi

βj

βjo βio

XjiTβj

} (βio – XjiTβj)2

By analogy with ∑ wij(fi-fj)2

Page 34: Techniques For Exploiting Unlabeled Data

34

Local Linear Semi-supervised Regression

ξ(β) = ∑ wij (βio – XjiTβj)2

So we find β to minimize the following objective function

Where wij is the similarity between xi and xj.

Page 35: Techniques For Exploiting Unlabeled Data

35

Synthetic Data: Gong

σ2 = 0.1 (noise)

Gong function y = (1/x)sin (15/x)

Page 36: Techniques For Exploiting Unlabeled Data

36

Experimental Results: GONGWeighted Kernel Regression, MSE=25.7

Page 37: Techniques For Exploiting Unlabeled Data

37

Experimental Results: GONGLocal Linear Regression, MSE=14.4

Page 38: Techniques For Exploiting Unlabeled Data

38

Experimental Results: GONGLLSR, MSE=7.99

Page 39: Techniques For Exploiting Unlabeled Data

39

PROBLEM: RUNNING TIME

If we have n examples and dimension d then to compute a closed form solution we have to invert an (n(d+1) * n(d+1)) matrix.

This is prohibitively expensive, especially if d is large.

For example if n=1500 and d=199 then we have to invert a matrix

of size 720 GB in Matlab’s double precision format.

Page 40: Techniques For Exploiting Unlabeled Data

40

SOLUTION: ITERATION

It turns out that because of the form of the equation we can start from an arbitrary initial guess and do an iterative computation that provably converges to the desired solution.

In the case of n=1500 and d=199, instead of dealing with a matrix

of size 720 GB we only have to store 2.4 MB in memory which makes the algorithm much more practical.

Page 41: Techniques For Exploiting Unlabeled Data

41

Experiments on Real DataWe do model selection using Leave One Out Cross validation

We compare:

Weighted Kernel Regression (WKR) – a purely supervised method.

Local Linear Regression (LLR) – another purely supervised method.

Local Linear Semi-Supervised Regularization (LLSR)

Local Learning Regularization (LL-Reg) – an up to date semi-supervised method

For each algorithm and dataset we give:

1. The mean and standard deviation of 10 runs.

2. The results of an OPTIMAL choice of parameters.

Page 42: Techniques For Exploiting Unlabeled Data

42

Experimental Results

Dataset n d nl LLSR LLSR-OPT WKR WKR-OPT

Carbon 58 1 10 27±25 19±11 70±36 37±11Alligators 25 1 10 288±176 209±162 336±210 324±211

Smoke 25 1 10 82±13 79±13 83±19 80±15Autompg 392 7 100 50±2 49±1 57±3 57±3

Dataset n d nl LLR LLR-OPT LL-Reg LL-Reg-OPT

Carbon 58 1 10 57±16 54±10 162±199 74±22Alligators 25 1 10 207±140 207±140 289±222 248±157Smoke 25 1 10 82±12 80±13 82±14 70±6Autompg 392 7 100 53±3 52±3 53±4 51±2

Page 43: Techniques For Exploiting Unlabeled Data

43

Summary

LLSR is a natural semi-supervised generalization of Linear Regression

While the analysis is not as clear as with semi-supervised classification,semi-supervised regression can perform better than supervised regression if the function has a smooth manifold similar to the GONG function.

FUTURE WORK:

Carefully analyzing the assumptions under which unlabeled data canbe useful in regression.

Page 44: Techniques For Exploiting Unlabeled Data

44

OutlineMotivation

Randomized Graph Mincut

Local Linear Semi-supervised Regression

Learning with Similarity Functions

Proposed Work and Time Line

Page 45: Techniques For Exploiting Unlabeled Data

45

Kernels

Kernel trick: K(x,y) = Φ(x)∙Φ(y) (Mercer’s theorem)

This allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found .

Kernel must satisfy strict mathematical definitions

1. Continuous

2. Symmetric

3. Positive semi-definite

K(x,y): Informally considered as a measure of similarity between x and y

Page 46: Techniques For Exploiting Unlabeled Data

46

Problems with KernelsThere is a conceptual disconnect between the notion of kernels assimilarity functions and the notion of finding max-margin separatorsin possibly infinite dimensional Hilbert spaces.

The properties of kernels such as being Positive Semi-Definite are rather restrictive and in particular similarity functions used in certaindomains, such as the Smith-Waterman score in molecular biology dodo not fit in this framework.

WANTED: A method for using similarity functions that is botheasy and general.

Page 47: Techniques For Exploiting Unlabeled Data

47

The Balcan-Blum approach

An approach fitting these requirements was recently proposed byBalcan and Blum.

Gave a general definition of a good similarity function for learning.

Gave an algorithm for learning with good similarity functions.

Showed that kernels are special case of their definition.

Page 48: Techniques For Exploiting Unlabeled Data

48

The Balcan-Blum approachSuppose S(x,y) \in (-1,+1) is our similarity function. Then

1. Draw d examples {x1, x2, x3, … xd} uniformly at random from thedata set.

2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}

KEY POINT: This method can make use of

UNLABELED DATA.

Page 49: Techniques For Exploiting Unlabeled Data

49

Combining Feature based and Graph Based Methods

Feature based methods directly operate on the native features:-e.g. Decision Tree, MaxEnt, Winnow, Perceptron

Graph based methods operate on the graph of similarities between examples, e.g Kernel methods, Gaussian Fields, Graph mincut andmost semi-supervised learning methods.

These methods can work well on different datasets, we want to find a way to find a way to COMBINE these approaches into one algorithm.

Page 50: Techniques For Exploiting Unlabeled Data

50

SOLUTION: Similarity functions + Winnow

Use the Balcan-Blum approach to generate extra features.

Append the extra features to the original features:-

x → {x,S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}

Run the Winnow algorithm on the combined features

(Winnow is known to be resistant to irrelevant features.)

Page 51: Techniques For Exploiting Unlabeled Data

51

Our Contributions

Practical techniques for using similarity functions

Combining graph based and feature based learning.

Page 52: Techniques For Exploiting Unlabeled Data

52

How to define a good similarity function?

By modifying a distance metric:-K(x,y) = 1/(D(x,y)+1)

Problem: We can end up with all similarities close to ZERO (not good)

Solution: Scale the similarities as follows:

Sort the similarities for example x from most similar to least.

Give the most similar similarity +1 and the least, similarity -1 and interpolate the remaining example in between.

VERY IMPORTANT: The ranked similarity may not be symmetricWhich is a big difference with kernels.

Page 53: Techniques For Exploiting Unlabeled Data

53

Evaluating a similarity function

K is strongly (ε,γ)-good similarity function for a learning problem P if at least a (1- ε) probability mass of examples x satisfy

Ex’~P [K(x’,x)|l(x’)=l(x)] ≥ Ex’~P [K(x’,x)|l(x’) ≠l(x)] + γ

For a particular similarity function and dataset we can compute the margin γ for each example and then plot the examples by decreasing

margin. If the margin is large for most examples, this is an indication that the similarity function may perform well on a particular dataset.

Page 54: Techniques For Exploiting Unlabeled Data

54

Compatibility of the naïve similarity function on Digits1

Page 55: Techniques For Exploiting Unlabeled Data

55

Compatibility of the ranked similarity function on Digits1

Page 56: Techniques For Exploiting Unlabeled Data

56

Experimental Results

We’ll look at some experimental results on both real and synthetic datasets.

Page 57: Techniques For Exploiting Unlabeled Data

57

Synthetic Data: Circle

Page 58: Techniques For Exploiting Unlabeled Data

58

Experimental Results: Circle

Page 59: Techniques For Exploiting Unlabeled Data

59

Synthetic Data: Blobs and LinesCan we create a data set that needs BOTH the original and the newfeatures to do well?

To answer this we create the data set we will call “Blobs and Lines”

We generate the data in the following way:

2. We flip a coin.

1. We select k point to be the centers of our “blobs” and assign themlabels in {-1,+1}.

3. If heads, then we set x to be a random boolean vector of dimension d and set the label to be the first coordinate of x.

4. If tails, we pick one of the centers and flip r bits and set x equal to that and set the label to the label of the center.

Page 60: Techniques For Exploiting Unlabeled Data

60

Synthetic Data: Blobs and Lines

+

+

+

+

+

- -

-

--

- --

++ +

Page 61: Techniques For Exploiting Unlabeled Data

61

Experimental Results: Blobs and Lines

Page 62: Techniques For Exploiting Unlabeled Data

62

Experimental Results: Real DataDataset n d nl Winnow SVM NN SIM Winnow+SVM

Congress 435 16 100 93.79 94.93 90.8 90.90 92.24

Webmaster 582 1406 100 81.97 71.78 72.5 69.90 81.20

Credit 653 46 100 78.50 55.52 61.5 59.10 77.36

Wisc 683 89 100 95.03 94.51 95.3 93.65 94.49

Digit1 1500 241 100 73.26 88.79 94.0 94.21 91.31

USPS 1500 241 100 71.85 74.21 92.0 86.72 88.57

Page 63: Techniques For Exploiting Unlabeled Data

63

Experimental Results: Concatenation

Dataset n d nl Winnow SVM NN SIM Winnow+SVM

Credit + Digit1

1306 287 100 72.41 51.74 75.46 74.25 83.95

What if we did something halfway between synthetic and real, by concatenating two different datasets? This can be viewed as simulatinga dataset that has two different kinds of data.

We concatenated the datasets by padding each of them with a block of ZEROS.

Credit (653 X 46) Padding (653 X 241)

Padding (653 X 46) Digit1 (653 X 241)

Page 64: Techniques For Exploiting Unlabeled Data

64

Conclusions

Generic similarity functions have a lot of potential to be applied to practical applications.

Combining feature based and graph based methods we can often get the “best of both worlds”

FUTURE WORK

Designing similarity functions suited to particular domains.

Theoretically provable guarantees on the quality of a similarity function

Page 65: Techniques For Exploiting Unlabeled Data

65

QUESTIONS?

Page 66: Techniques For Exploiting Unlabeled Data

66

Back Up Slides

Page 67: Techniques For Exploiting Unlabeled Data

67

References

“Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004

Page 68: Techniques For Exploiting Unlabeled Data

68

My Work

Practical techniques for using unlabeled data and generic similarity functions to “kernelize” the winnow algorithm.

Techniques for extending Local Linear Regression to the semi-supervised setting

Techniques for improving graph mincut algorithms for semi-supervised classification

Page 69: Techniques For Exploiting Unlabeled Data

69

Problem

There may be several minimum cuts in the graph.

+

-

Indeed, there are potentially exponentially many minimum cuts in the graph.

Page 70: Techniques For Exploiting Unlabeled Data

70

Real Data: CO2

Source: World Watch Institute

Carbon dioxide concentration in the atmosphere over the last two centuries.

Page 71: Techniques For Exploiting Unlabeled Data

71

Experimental Results: CO2

Local Linear Regression, MSE = 144

Page 72: Techniques For Exploiting Unlabeled Data

72

Experimental Results: CO2

Weighted Kernel Regression, MSE = 660

Page 73: Techniques For Exploiting Unlabeled Data

73

Experimental Results:CO2

LLSR, MSE = 97.4

Page 74: Techniques For Exploiting Unlabeled Data

74

WinnowA linear separator algorithm, first proposed by Littlestone.

We are particularly interested in winnow because

1. It is known to be able to effectively learn in the presence of irrelevantattributes. Since we will be creating many new features, we expect manyof them will be irrelevant.

2. It is fast and does not require a lot of memory. Since we hope to uselarge amounts of unlabeled data, scalability is an important consideration.

Page 75: Techniques For Exploiting Unlabeled Data

79

PROPOSED WORK: Improving Running Time

Sparsification: Ignore examples which are far away so as to geta sparser matrix to invert.

Iterative Methods for solving Linear systems: For a matrixequation Ax=b, we can obtain successive approximations x1, x2

… xk. Can be significantly faster if matrix A is sparse.

Page 76: Techniques For Exploiting Unlabeled Data

80

PROPOSED WORK: Improving Running Time

Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + …

y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + …

A few terms may be sufficient to get a good approximation

Compute supervised answer first, then “smooth” the answer to get semi-Supervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm.

Page 77: Techniques For Exploiting Unlabeled Data

81

PROPOSED WORK: Experimental Evaluation

Comparison against other proposed semi-supervised regression algorithms.

Evaluation on a large variety of data sets, especially high dimensionalones.

Page 78: Techniques For Exploiting Unlabeled Data

82

PROPOSED WORK

Two main application areas:

1. Domains which have expert defined similarity functions that are not kernels (protein homology).

2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification).

Overall goal: Investigate the practical applicability of this theoryand find out what is needed to make it work on real problems.

Page 79: Techniques For Exploiting Unlabeled Data

83

PROPOSED WORK: Protein Homology

The Smith-Waterman score is the best performing measure of similaritybut it does not satisfy the kernel properties.

Machine learning applications have either used other similarity functionsOr tried to force SW score into a kernel.

Can we achieve better performance by using SW score directly?

Page 80: Techniques For Exploiting Unlabeled Data

84

PROPOSED WORK: Text Classification

Most popular technique is Bag-of-Words (BOW) where each documentis converted into a vector and each position in the vector indicates howmany times each word occurred.

The vectors tend to be sparse and there will be many irrelevant features,hence this is well suited to the Winnow algorithm. Our approach makesthe winnow algorithm more powerful.

Within this framework we have strong motivation for investigating “domain specific” similarity function, e.g. “edit distance” between

documents instead of cosine similarity.

Can we achieve better performance than current techniques using “domain specific” similarity functions?

Page 81: Techniques For Exploiting Unlabeled Data

85

PROPOSED WORK: Domain Specific Similarity Functions

As mentioned in the previous two slides, designing specific similarity functions for each domain, is well motivated in this approach.

What are the “best practice” principles for designing domain specific similarity functions?

In what circumstances are domain specific similarity functions likely to be most useful?

We will answer these questions by generalizing from several different datasets and systematically noting what seems to work best.

Page 82: Techniques For Exploiting Unlabeled Data

86

Proposed Work and Time Line

Summer 2007 (1) Speeding up LLSR

(2) Learning with similarity in protein homology and text classification domain.

Fall 2007 (1) Comparison of LLSR with other semi-supervised regression algs.

(2) Investigate principles of domain specific similarity functions.

Spring 2008 Start Writing Thesis

Summer 2008 Finish Writing Thesis

Page 83: Techniques For Exploiting Unlabeled Data

87

Kernels

K(x,y) = Φ(x)∙Φ(y)

Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found .

Kernel must satisfy strict mathematical definitions

1. Continuous

2. Symmetric

3. Positive semi-definite

Page 84: Techniques For Exploiting Unlabeled Data

88

Generic similarity Functions

What if the best similarity function in a given domain does not satisfy the properties of a kernel?

Two options:

1. Use a kernel with inferior performance

2. Try to “coerce” the similarity function into a kernel by building a kernelthat has similar behavior.

There is another way …