machine learning techniques in image analysis

Semi-supervised learning

Learning from both labeled and unlabeled data

Motivation: labeled data may be hard/expensive to get, butunlabeled data is usually cheaply available in much greaterquantity

COMP 875 Machine learning techniques in image analysis

Semi-supervised learning

Learning from both labeled and unlabeled data

Motivation: labeled data may be hard/expensive to get, butunlabeled data is usually cheaply available in much greaterquantity

How can unlabeled data help?

Example: Text classification Source: J. Zhu

Classify astronomy vs. travel articles

Similarity measured by word overlap

When labeled data alone fails:

What if there are no overlapping words?

Unlabeled data as stepping stones:

Labels “propagate” via similar unlabeled articles

Another example Source: J. Zhu

Handwritten digits recognition with pixel-wise Euclidean distance

not similar indirectly similar with stepping stones

Types of semi-supervised learning

Inductive learning: given a training set L of labeled data andU of unlabeled data, learn a predictor that can be applied to abrand-new unlabeled point not in U .

Transductive learning: given L and U , learn a predictor thatcan be applied only to U (i.e., the predictor cannot be easilyextended to previously unseen data).

Types of semi-supervised learning

Inductive learning: given a training set L of labeled data andU of unlabeled data, learn a predictor that can be applied to abrand-new unlabeled point not in U .

Transductive learning: given L and U , learn a predictor thatcan be applied only to U (i.e., the predictor cannot be easilyextended to previously unseen data).

Simplest semi-supervised learning algorithm: Self-trainingSource: J. Zhu

Input: labeled data L and unlabeled data URepeat:

1 Learn predictor f from labeled data L using supervisedlearning

2 Apply f to the unlabeled instances in U3 Remove a subset from U and add that subset and its inferred

labels to L

How might we select this subset?

Advantages/disadvantages of this scheme?

2 Apply f to the unlabeled instances in U

3 Remove a subset from U and add that subset and its inferredlabels to L

labels to L

Self-training with nearest-neighbor classifier Source: J. Zhu

1 Find unlabeled point x that is closest to a labeled point x′

and assign to x the label of x′.2 Remove x from U ; add it and its estimated label to L.

Self-training with nearest-neighbor classifier Source: J. Zhu

1 Find unlabeled point x that is closest to a labeled point x′

and assign to x the label of x′.2 Remove x from U ; add it and its estimated label to L.

Propagating nearest-neighbor: Example Source: J. Zhu

(a) Iteration 1 (b) Iteration 25

(c) Iteration 74 (d) Final

(a) (b)

(c) (d)

(a) (b)

(c) (d)

Another simple approach: Cluster-and-label Source: J. Zhu

1 Cluster L ∪ U2 For each cluster, let S be the set of labeled instances in that

cluster

3 Learn a supervised predictor from S and apply it to all theunlabeled instances in that cluster

What is the underlying assumption here?

cluster

Cluster-and-label: Examples Source: J. Zhu

Hierarchical clustering, majority vote predictor within cluster

Cluster-and-label: Examples Source: J. Zhu

Hierarchical clustering, majority vote predictor within cluster

Generative models Source: J. Zhu

Labeled data (Xl, Yl):

Assuming each class has a Gaussian distribution, how do we findthe decision boundary?

Labeled data (Xl, Yl):

The most likely model, and its decision boundary

Labeled data (Xl, Yl) and unlabeled data Xu:

What is the most likely decision boundary now?

Labeled data (Xl, Yl) and unlabeled data Xu:

What is the most likely decision boundary now?

The two boundaries are different because they maximize differentquantities:

p(Xl, Yl|θ) p(Xl, Yl, Xu|θ)

Gaussian mixture model: θ are the component weights, means, andcovariances

Only labeled data:

p(Xl, Yl|θ)

p(xi, yi|θ) =∏

p(yi|θ)p(xi|yi, θ)

ML estimate for θ: sample means, covariances, proportions foreach of the classes

Labeled and unlabeled data:

p(Xl, Yl, Xu|θ) = p(Xl, Yl|θ)∑Yu

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

ML estimate for θ: use EM (Yu are hidden variables)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ)

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

ML estimate for θ:

sample means, covariances, proportions foreach of the classes

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xl, Yl, Xu|θ)

= p(Xl, Yl|θ)∑Yu

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

ML estimate for θ:

use EM (Yu are hidden variables)

Only labeled data:

p(Xl, Yl|θ) =∏

p(xi, yi|θ) =∏

p(Xu, Yu|θ)

( ∏i labeled

) ∏j unlabeled

p(c|θ)p(xj |c, θ)

The EM algorithm for Gaussian mixtures Source: J. Zhu

1 Start from MLE θ = {pc, µc,Σc} on (Xl, Yl):

pc: proportion of class cµc: sample mean of class cΣc: sample covariance matrix of class c

Repeat:

2 The E-step: compute the expected label p(y|x, θ) for all x inXu.

3 The M-step: update MLE θ with the “softly labeled” Xu.

Special case of EM for Gaussian mixtures where thecomponent assignments of labeled data are fixed.

Can also be viewed as a special case of self-training.

Repeat:

Limitations of mixture models Source: J. Zhu

Assumption: mixture components correspond toclass-conditional distributions.

When the assumption is wrong:

Discriminative approach: Semi-supervised SVMs Source: J. Zhu

Idea: try to keep labeled points outside the margin, whilemaximizing the margin.

Discriminative approach: Semi-supervised SVMs Source: J. Zhu

Idea: try to keep labeled points outside the margin, whilemaximizing the margin.

Review: Standard SVMs

Classification function: f(x) = wTx + w0.

Standard SVM objective function:

minw,w0

‖w‖2 + λ1

(1− yif(xi))+

Semi-supervised SVMs Source: J. Zhu

Classification function: f(x) = wTx + w0.

To incorporate unlabeled points, assign to them putativelabels sgn(f(x)).

Semi-supervised SVM objective function:

minw,w0

‖w‖2+λ1

∑i labeled

(1−yif(xi))+ + λ2

∑j unlabeled

(1− |f(xj)|)+

Graph-based semi-supervised learning Source: J. Zhu

Idea: construct graph where nodes are labeled and unlabeledexamples, and edges are weighted by the similarity ofexamples.Unlabeled data can help “glue” the objects of the same classtogether.Assumption: items connected by “heavy” edges are likely tohave the same label.

The mincut algorithm:

Assume binary classification (class labels are 0, 1).

Approach: fix Yl, find Yu to minimize∑i∼j

wij |yi − yj |.

Combinatorial problem, but has polynomial-time solution.

Harmonic functions:

Let’s relax discrete labels to continuous values in R.

We want to find the harmonic function f that satisfiesf(x) = y for all x in Xl and minimizes the energy∑

wij(f(xi)− f(xj))2.

wij |yi − yj |.

Harmonic functions:

wij |yi − yj |.

Harmonic functions:

A random walk interpretation Source: J. Zhu

Randomly walk from node i to j with probabilitywij∑k wik

Stop if we hit a labeled node.

The harmonic function has the following interpretation:f(xi) = P (hit label 1|start from i).

The harmonic solution Source: J. Zhu

We want to find the harmonic function f that satisfiesf(x) = y for all labeled points x and minimizes the energy∑

It can be shown that f(xi) =∑

j∼i wijf(xj)∑j∼i wij

at all unlabeled

points xi.

Iterative algorithm to compute harmonic function:

Initially, fix f(x) = y for all labeled data and set f to arbitraryvalues for all unlabeled data.Repeat until convergence: For each unlabeled xi, set f(xi) toits weighted neighborhood average:

f(xi) =

∑j∼i wijf(xj)∑

j∼i wij.

at all unlabeled

points xi.

f(xi) =

j∼i wij.

at all unlabeled

points xi.

f(xi) =

j∼i wij.

The graph Laplacian Source: J. Zhu

Let W be a symmetric weight matrix with entries wij , and Dbe a diagonal matrix with entries Dii =

∑j wij .

The graph Laplacian matrix is defined as L = D −W .

Then we can write∑i,j

wij(f(xi)− f(xj))2 = fTLf.

We want to minimize fTLf subject to constraints f(xi) = yi

on labeled data.

Solution: fu = −L−1uuLul yl, where yl are the labels for labeled

data, and

L =[Lll Llu

Lul Luu

Let W be a symmetric weight matrix with entries wij , and Dbe a diagonal matrix with entries Dii =

∑j wij .

The graph Laplacian matrix is defined as L = D −W .

Then we can write∑i,j

wij(f(xi)− f(xj))2 = fTLf.

We want to minimize fTLf subject to constraints f(xi) = yi

on labeled data.

Solution: fu = −L−1uuLul yl, where yl are the labels for labeled

data, and

L =[Lll Llu

Lul Luu

Alternative approach: Allow f(xi) to be different from yi onlabeled data, but penalize it:

∑i labeled

c(f(xi)− yi)2 + fTLf.

Let C be a diagonal matrix where Cii = c if i is a labeledpoint, and Λii = 0 otherwise. Then we can write the objectivefunction as

(f − y)TC(f − y) + fTLf

where y is a vector whose entries correspond to labels oflabeled points, and are arbitrary otherwise.

Then the solution is given by the linear system

(C + L)f = Cy.

∑i labeled

(C + L)f = Cy.

∑i labeled

(C + L)f = Cy.

Graph spectrum Source: J. Zhu

The spectrum of the graph represented by W is given by theeigenvalues and eigenvectors (λi, φi)n

i=1 of the Laplacian L.

Properties of the graph spectrum:

A graph has k connected components if and only ifλ1 = λ2 = . . . = λk = 0. The corresponding eigenvectors areconstant on individual connected components, and zeroelsewhere.

L =∑n

i=1 λiφiφTi .

Any function f on the graph can be written as a linearcombination of eigenvectors: f =

∑ni=1 aiφi.

The “smoothness” of f can be written as fTLf =∑n

i=1 a2iλi.

L =∑n

i=1 λiφiφTi .

∑ni=1 aiφi.

i=1 a2iλi.

L =∑n

i=1 λiφiφTi .

∑ni=1 aiφi.

i=1 a2iλi.

L =∑n

i=1 λiφiφTi .

∑ni=1 aiφi.

i=1 a2iλi.

L =∑n

i=1 λiφiφTi .

∑ni=1 aiφi.

i=1 a2iλi.

Using the graph spectrum

Objective function

∑i labeled

c(f(xi)− yi)2 + fTLf

= (f − y)TC(f − y) + fTLf.

We can restrict our solution to “smooth” functions f , i.e.,linear combinations of the first k eigenvectors associated withthe smallest eigenvalues: f =

∑ki=1 aiφi.

Now we can obtain f by solving a k × k linear system insteadof an n× n linear system.

References

J. Zhu, Semi-supervised learning survey, University of Wisconsin technicalreport, 2008.http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html

J. Zhu, Semi-supervised learning tutorial, Chicago Machine Learning SummerSchool, 2009.http://pages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdf

machine learning techniques in image analysis

Documents