techniques for exploiting unlabeled data

Techniques For Exploiting Unlabeled Data

Mugizi Rwebangira

Thesis Defense

September 8,2008

Committee: Avrim Blum, CMU (Co-Chair)John Lafferty, CMU (Co-Chair)

William Cohen, CMUXiaojin (Jerry) Zhu, Wisconsin

2

Motivation

Model x →yLabeled Examples {(xi,yi)}

Supervised Machine Learning:

induction

Algorithms: SVM, Neural Nets, Decision Trees, etc.

Problems: Document classification, image classification, protein sequence determination.

3

Motivation

In recent years, there has been growing interest in techniques for using unlabeled data:

More data is being collected than ever before.

Labeling examples can be expensive and/or require human intervention.

4

Examples

Proteins: sequence can be easily determined, structure determination is a hard problem.

Web Pages: Can be easily crawled on the web, labeling requires human intervention.

Images: Abundantly available (digital cameras) labeling requires humans (captchas).

5

Motivation

Labeled Examples {(xi,yi)}

Semi-Supervised Machine Learning:

x →y

Unlabeled Examples {xi}

6

Motivation

+

+-

-

7

However…

Techniques for adapting supervised algorithms to semi-supervised algorithms

Best practices for using unlabeled data:

Techniques not as well developed as supervised techniques:

8

OutlineMotivation

Randomized Graph Mincut

Local Linear Semi-supervised Regression

Learning with Similarity Functions

Conclusion and Questions

9

Graph Mincut (Blum & Chawla,2001)

10

Construct an (unweighted) Graph

11

Add auxiliary “super-nodes”

-+

12

Obtain s-t mincut

Mincut

-+

13

Classification

+ -

Mincut

14

Problem

+

-

Plain mincut can give very unbalanced cuts.

15

Solution

For each unlabeled example take a majority vote.

Add random weights to the edges

Run plain mincut and obtain a classification.

Repeat the above process several times.

16

Before adding random weights

+ -

Mincut

17

After adding random weights

+ -

Mincut

18

PAC-Bayes

• PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance.

• In this case each distinct cut corresponds to a different hypothesis.

• Hence the average of these cuts will be less likely to overfit than any single cut.

19

Markov Random Fields• Ideally we would like to assign a weight to

each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph.

• This corresponds to a Markov Random Field model.

• We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.

20

How to construct the graph?• k-NN

– Graph may not have small balanced cuts.– How to learn k?

• Connect all points within distance δ– Can have disconnected components.– How to learn δ?

• Minimum Spanning Tree– No parameters to learn.– Gives connected, sparse graph.– Seems to work well on most datasets.

21

Experiments

• ONE vs. TWO: 1128 examples .• (8 X 8 array of integers, Euclidean distance).

• ODD vs. EVEN: 4000 examples .• (16 X 16 array of integers, Euclidean distance).

• PC vs. MAC: 1943 examples .• (20 newsgroup dataset, TFIDF distance) .

22

ONE vs. TWO

23

ODD vs. EVEN

24

PC vs. MAC

25

Summary

We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields.

Randomization helps plain mincut achieve a comparable performance to Gaussian Fields.

There is an intuitive interpretation for the confidence of aprediction in terms of the “margin” of the vote.

“Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004

26

OutlineMotivation




Proposed Work and Time Line

27

(Supervised) Linear Regression

* *

**

x

y

28

Semi-Supervised Regression

*

**

*

x

y

+ +++ + ++ ++

29

Smoothness assumption

Things that are close together should have similar values

Minimize ξ(f) = ∑ wij(fi-fj)2

One way of doing this:

Where wij is the similarity between examples i and j.

And fi and fj are the predictions for example i and j.

Gaussian Fields (Zhu, Ghahramani & Lafferty)

30

Local ConstancyThe predictions made by Gaussian Fields are locally constant

*

x

y

+u u + Δ

More formally: m (u + Δ) ≈ m(u)

31

Local LinearityFor many regression tasks we would prefer predictions to be locally linear.

*

x

y

+u

u + Δ

More formally: m (u + Δ) ≈ m(u) + m’(u) Δ

32

Problem

Develop a version of Gaussian Fields which is Local Linear

Or a semi-supervised version of Linear Regression

Local Linear Semi-supervised Local Linear Semi-supervised RegressionRegression

33


xjxi

βi

βj

βjo βio

XjiTβj

} (βio – XjiTβj)2

By analogy with ∑ wij(fi-fj)2

34


ξ(β) = ∑ wij (βio – XjiTβj)2

So we find β to minimize the following objective function

Where wij is the similarity between xi and xj.

35

Synthetic Data: Gong

σ2 = 0.1 (noise)

Gong function y = (1/x)sin (15/x)

36

Experimental Results: GONGWeighted Kernel Regression, MSE=25.7

37

Experimental Results: GONGLocal Linear Regression, MSE=14.4

38

Experimental Results: GONGLLSR, MSE=7.99

39

PROBLEM: RUNNING TIME

If we have n examples and dimension d then to compute a closed form solution we have to invert an (n(d+1) * n(d+1)) matrix.

This is prohibitively expensive, especially if d is large.

For example if n=1500 and d=199 then we have to invert a matrix

of size 720 GB in Matlab’s double precision format.

40

SOLUTION: ITERATION

It turns out that because of the form of the equation we can start from an arbitrary initial guess and do an iterative computation that provably converges to the desired solution.

In the case of n=1500 and d=199, instead of dealing with a matrix

of size 720 GB we only have to store 2.4 MB in memory which makes the algorithm much more practical.

41

Experiments on Real DataWe do model selection using Leave One Out Cross validation

We compare:

Weighted Kernel Regression (WKR) – a purely supervised method.

Local Linear Regression (LLR) – another purely supervised method.

Local Linear Semi-Supervised Regularization (LLSR)

Local Learning Regularization (LL-Reg) – an up to date semi-supervised method

For each algorithm and dataset we give:

1. The mean and standard deviation of 10 runs.

2. The results of an OPTIMAL choice of parameters.

42

Experimental Results

Dataset n d nl LLSR LLSR-OPT WKR WKR-OPT

Carbon 58 1 10 27±25 19±11 70±36 37±11Alligators 25 1 10 288±176 209±162 336±210 324±211

Smoke 25 1 10 82±13 79±13 83±19 80±15Autompg 392 7 100 50±2 49±1 57±3 57±3

Dataset n d nl LLR LLR-OPT LL-Reg LL-Reg-OPT

Carbon 58 1 10 57±16 54±10 162±199 74±22Alligators 25 1 10 207±140 207±140 289±222 248±157Smoke 25 1 10 82±12 80±13 82±14 70±6Autompg 392 7 100 53±3 52±3 53±4 51±2

43

Summary

LLSR is a natural semi-supervised generalization of Linear Regression

While the analysis is not as clear as with semi-supervised classification,semi-supervised regression can perform better than supervised regression if the function has a smooth manifold similar to the GONG function.

FUTURE WORK:

Carefully analyzing the assumptions under which unlabeled data canbe useful in regression.

44

OutlineMotivation





45

Kernels

Kernel trick: K(x,y) = Φ(x)∙Φ(y) (Mercer’s theorem)

This allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found .

Kernel must satisfy strict mathematical definitions

1. Continuous

2. Symmetric

3. Positive semi-definite

K(x,y): Informally considered as a measure of similarity between x and y

46

Problems with KernelsThere is a conceptual disconnect between the notion of kernels assimilarity functions and the notion of finding max-margin separatorsin possibly infinite dimensional Hilbert spaces.

The properties of kernels such as being Positive Semi-Definite are rather restrictive and in particular similarity functions used in certaindomains, such as the Smith-Waterman score in molecular biology dodo not fit in this framework.

WANTED: A method for using similarity functions that is botheasy and general.

47

The Balcan-Blum approach

An approach fitting these requirements was recently proposed byBalcan and Blum.

Gave a general definition of a good similarity function for learning.

Gave an algorithm for learning with good similarity functions.

Showed that kernels are special case of their definition.

48

The Balcan-Blum approachSuppose S(x,y) \in (-1,+1) is our similarity function. Then

1. Draw d examples {x1, x2, x3, … xd} uniformly at random from thedata set.

2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}

KEY POINT: This method can make use of

UNLABELED DATA.

49

Combining Feature based and Graph Based Methods

Feature based methods directly operate on the native features:-e.g. Decision Tree, MaxEnt, Winnow, Perceptron

Graph based methods operate on the graph of similarities between examples, e.g Kernel methods, Gaussian Fields, Graph mincut andmost semi-supervised learning methods.

These methods can work well on different datasets, we want to find a way to find a way to COMBINE these approaches into one algorithm.

50

SOLUTION: Similarity functions + Winnow

Use the Balcan-Blum approach to generate extra features.

Append the extra features to the original features:-

x → {x,S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}

Run the Winnow algorithm on the combined features

(Winnow is known to be resistant to irrelevant features.)

51

Our Contributions

Practical techniques for using similarity functions

Combining graph based and feature based learning.

52

How to define a good similarity function?

By modifying a distance metric:-K(x,y) = 1/(D(x,y)+1)

Problem: We can end up with all similarities close to ZERO (not good)

Solution: Scale the similarities as follows:

Sort the similarities for example x from most similar to least.

Give the most similar similarity +1 and the least, similarity -1 and interpolate the remaining example in between.

VERY IMPORTANT: The ranked similarity may not be symmetricWhich is a big difference with kernels.

53

Evaluating a similarity function

K is strongly (ε,γ)-good similarity function for a learning problem P if at least a (1- ε) probability mass of examples x satisfy

Ex’~P [K(x’,x)|l(x’)=l(x)] ≥ Ex’~P [K(x’,x)|l(x’) ≠l(x)] + γ

For a particular similarity function and dataset we can compute the margin γ for each example and then plot the examples by decreasing

margin. If the margin is large for most examples, this is an indication that the similarity function may perform well on a particular dataset.

54

Compatibility of the naïve similarity function on Digits1

55

Compatibility of the ranked similarity function on Digits1

56

Experimental Results

We’ll look at some experimental results on both real and synthetic datasets.

57

Synthetic Data: Circle

58

Experimental Results: Circle

59

Synthetic Data: Blobs and LinesCan we create a data set that needs BOTH the original and the newfeatures to do well?

To answer this we create the data set we will call “Blobs and Lines”

We generate the data in the following way:

2. We flip a coin.

1. We select k point to be the centers of our “blobs” and assign themlabels in {-1,+1}.

3. If heads, then we set x to be a random boolean vector of dimension d and set the label to be the first coordinate of x.

4. If tails, we pick one of the centers and flip r bits and set x equal to that and set the label to the label of the center.

60

Synthetic Data: Blobs and Lines

+

+

+

+

+

- -

-

--

- --

++ +

61

Experimental Results: Blobs and Lines

62

Experimental Results: Real DataDataset n d nl Winnow SVM NN SIM Winnow+SVM

Congress 435 16 100 93.79 94.93 90.8 90.90 92.24

Webmaster 582 1406 100 81.97 71.78 72.5 69.90 81.20

Credit 653 46 100 78.50 55.52 61.5 59.10 77.36

Wisc 683 89 100 95.03 94.51 95.3 93.65 94.49

Digit1 1500 241 100 73.26 88.79 94.0 94.21 91.31

USPS 1500 241 100 71.85 74.21 92.0 86.72 88.57

63

Experimental Results: Concatenation

Dataset n d nl Winnow SVM NN SIM Winnow+SVM

Credit + Digit1

1306 287 100 72.41 51.74 75.46 74.25 83.95

What if we did something halfway between synthetic and real, by concatenating two different datasets? This can be viewed as simulatinga dataset that has two different kinds of data.

We concatenated the datasets by padding each of them with a block of ZEROS.

Credit (653 X 46) Padding (653 X 241)

Padding (653 X 46) Digit1 (653 X 241)

64

Conclusions

Generic similarity functions have a lot of potential to be applied to practical applications.

Combining feature based and graph based methods we can often get the “best of both worlds”

FUTURE WORK

Designing similarity functions suited to particular domains.

Theoretically provable guarantees on the quality of a similarity function

65

QUESTIONS?

66

Back Up Slides

67

References

“Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004

68

My Work

Practical techniques for using unlabeled data and generic similarity functions to “kernelize” the winnow algorithm.

Techniques for extending Local Linear Regression to the semi-supervised setting

Techniques for improving graph mincut algorithms for semi-supervised classification

69

Problem

There may be several minimum cuts in the graph.

+

-

Indeed, there are potentially exponentially many minimum cuts in the graph.

70

Real Data: CO2

Source: World Watch Institute

Carbon dioxide concentration in the atmosphere over the last two centuries.

71

Experimental Results: CO2

Local Linear Regression, MSE = 144

72

Experimental Results: CO2

Weighted Kernel Regression, MSE = 660

73

Experimental Results:CO2

LLSR, MSE = 97.4

74

WinnowA linear separator algorithm, first proposed by Littlestone.

We are particularly interested in winnow because

1. It is known to be able to effectively learn in the presence of irrelevantattributes. Since we will be creating many new features, we expect manyof them will be irrelevant.

2. It is fast and does not require a lot of memory. Since we hope to uselarge amounts of unlabeled data, scalability is an important consideration.

79

PROPOSED WORK: Improving Running Time

Sparsification: Ignore examples which are far away so as to geta sparser matrix to invert.

Iterative Methods for solving Linear systems: For a matrixequation Ax=b, we can obtain successive approximations x1, x2

… xk. Can be significantly faster if matrix A is sparse.

80

PROPOSED WORK: Improving Running Time

Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + …

y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + …

A few terms may be sufficient to get a good approximation

Compute supervised answer first, then “smooth” the answer to get semi-Supervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm.

81

PROPOSED WORK: Experimental Evaluation

Comparison against other proposed semi-supervised regression algorithms.

Evaluation on a large variety of data sets, especially high dimensionalones.

82

PROPOSED WORK

Two main application areas:

1. Domains which have expert defined similarity functions that are not kernels (protein homology).

2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification).

Overall goal: Investigate the practical applicability of this theoryand find out what is needed to make it work on real problems.

83

PROPOSED WORK: Protein Homology

The Smith-Waterman score is the best performing measure of similaritybut it does not satisfy the kernel properties.

Machine learning applications have either used other similarity functionsOr tried to force SW score into a kernel.

Can we achieve better performance by using SW score directly?

84

PROPOSED WORK: Text Classification

Most popular technique is Bag-of-Words (BOW) where each documentis converted into a vector and each position in the vector indicates howmany times each word occurred.

The vectors tend to be sparse and there will be many irrelevant features,hence this is well suited to the Winnow algorithm. Our approach makesthe winnow algorithm more powerful.

Within this framework we have strong motivation for investigating “domain specific” similarity function, e.g. “edit distance” between

documents instead of cosine similarity.

Can we achieve better performance than current techniques using “domain specific” similarity functions?

85

PROPOSED WORK: Domain Specific Similarity Functions

As mentioned in the previous two slides, designing specific similarity functions for each domain, is well motivated in this approach.

What are the “best practice” principles for designing domain specific similarity functions?

In what circumstances are domain specific similarity functions likely to be most useful?

We will answer these questions by generalizing from several different datasets and systematically noting what seems to work best.

86


Summer 2007 (1) Speeding up LLSR

(2) Learning with similarity in protein homology and text classification domain.

Fall 2007 (1) Comparison of LLSR with other semi-supervised regression algs.

(2) Investigate principles of domain specific similarity functions.

Spring 2008 Start Writing Thesis

Summer 2008 Finish Writing Thesis

87

Kernels

K(x,y) = Φ(x)∙Φ(y)

Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found .

Kernel must satisfy strict mathematical definitions

1. Continuous

2. Symmetric

3. Positive semi-definite

88

Generic similarity Functions

What if the best similarity function in a given domain does not satisfy the properties of a kernel?

Two options:

1. Use a kernel with inferior performance

2. Try to “coerce” the similarity function into a kernel by building a kernelthat has similar behavior.

There is another way …

techniques for exploiting unlabeled data

Documents