دسته بندی نیمه نظارتی (2)

Slide 1

(2)1Introduction to semi-supervised Learning, Xiaojin Zhu and Andrew B. Goldberg, University of Wisconsin, Madison, 2009.2 Mixture EM Co-Training SVM

Co-Training

3

LocationNamed entity Classification3Co-Training

4Named entity Classification

LocationLocation4Co-Training

5Named entity Classification

LocationLocation5Co-Training

6 : .

7 view view

Co-TrainingWhy is the conditional independence assumption important for Co-Training? If the view-2classifier f (2) decides that the context headquartered in indicates Location with high confidence,Co-Training will add unlabeled instances with that context as view-1 training examples. These newtraining examples for f (1) will include all representative Location named entities x(1), thanks tothe conditional independence assumption. If the assumption didnt hold, the new examples could allbe highly similar and thus be less informative for the view-1 classifier. It can be shown that if the twoassumptions hold, Co-Training can learn successfully from labeled and unlabeled data. However,it is actually difficult to find tasks in practice that completely satisfy the conditional independenceassumption. After all, the context Prime Minister of practically rules out most locations exceptcountries. When the conditional independence assumption is violated,Co-Training may not performwell.

If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.

78Web-page classification : hyperlink: hyperlink

Classify Speech phonemes Audio video

Multiview learning (1)

The squared loss c(x, y, f (x)) = (y f (x))20/1 loss c(x, y, f (x)) = 0 if y = f (x), and 1 otherwise c(x, y = healthy, f (x) = diseased) = 1 and c(x, y = diseased, f (x) = healthy) = 100

9

9Multiview learning (2)

10

Multiview Learning (3)MULTIVIEW LEARNING k k

The semi-supervised regularizer: k

11Individual Regularized Risk Semi-Supervised regularizer11Multiview learning(4)12: emprical risk 1213 Mixture EM Co-Training SVM

(1)14 kNN

NN

1415

(2) Regularization16 f f loss functionf ( regularization framework) special graph-based regularization

16Mincut (1)17 source sink source sink

18

13542Mincut (2)19Cost Function

Regularizer

Mincut Regularized Risk problem

Mincut (3)s19Harmonic Function (1)20

20Harmonic Function (2)21

>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network. Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij :>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network.Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij :>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network.Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij : The random walk continues in this fashion until the particle reaches one of the labeled vertices. Thisis known as an absorbing random walk, where the labeled vertices are absorbing states. Then thevalue of the harmonic function at vertex i, f (xi), is the probability that a particle starting at vertexi eventually reaches a positive labeled vertex

21Harmonic Function (3)22

unnormalized graph Laplacian matrix L

W is an (l + u) (l + u) weight matrix, whose i, j -th element is the edge weight wij

Harmonic Function (4)23unnormalized graph Laplacian matrix

Manifold Regularization (1)24 Transductive f (x) = y

24Manifold Regularization (2)25Inductive

Manifold Regularization (3)26normalized graph Laplacian matrix L

Laplacian

(1)27

Spectral graph theory

28

(2)28 (3)

29a smaller eigenvalue corresponds to a smoother eigenvector over the graph

The graph has k connected components if and only if 1 = . . . = k = 0. The corresponding eigenvectors are constant on individual connected components, and zero elsewhere.

Graph Spectrum30

(4)31Regularization term

ai i Regularization term . f ( i ) .

(5)32

k-connected component Regularization term

(6)33

3334 Mixture EM Co-Training SVM

35

margin: geometric margin.Support Vector Machines36

36Support Vector Machines37The signed geometric margin: The distance from the decision boundary to the closest labeled instance

decision boundary

Maximum margin hyperplane must be unique

37Non-Separable Case (1)38


lie inside the margin, but on the correct side of the decision boundary

lie on the wrong side of the decision boundary and are misclassified

are correctly classified


Non-Separable Case (4)41

S3VM (1) 42

S3VM (2) 43the majority (or even all) of the unlabeled instances are predicted in only one of the classes

S3VM (3) 44Convex function

The S3VM objective function is non-convex

The research in S3VMs has focused on how to efficiently find a near-optimum solution

Logistic regressionSVM and S3VM are non-probabilistic modelsprobabilistic model

conditional log likelihood

Gaussian distribution as the prior on w:

45Logistic regression

Logistic loss

regularizer

The second line follows from Bayes rule, and ignoring the denominator that is constant with respectto the parameters.

46

Logistic regressionEntropy RegularizerLogistic Regression+Entropy Regulizer For SemiSupervised Learning Intuitionif the two classes are well-separated, then the classification on any unlabeled instance should be confident: it either clearly belongs to the positive class, or to the negative class. Equivalently, the posterior probability p(y|x) should be either close to 1, or close to 0.

Entropy

48Semi-supervised Logistic Regression

entropy regularizer for logistic regression

Entropy Regularizer

S3VM Entropy Regularization

دسته بندی نیمه نظارتی (2)

Documents

harmonic function f

classifier f

view view

y f x201 loss cx

semisupervised learning

squared loss cx

assumption didnt hold

positive vertices