دسته بندی نیمه نظارتی (2)
DESCRIPTION
دسته بندی نیمه نظارتی (2). زهره کریمی. Introduction to semi-supervised Learning, Xiaojin Zhu and Andrew B. Goldberg, University of Wisconsin, Madison, 2009. روش های یادگیری نیمه نظارتی مدل های Mixture و روش EM روش Co-Training روش های مبتنی بر گراف روش های مبتنی بر SVM - PowerPoint PPT PresentationTRANSCRIPT
Slide 1
(2)1Introduction to semi-supervised Learning, Xiaojin Zhu and Andrew B. Goldberg, University of Wisconsin, Madison, 2009.2 Mixture EM Co-Training SVM
Co-Training
3
LocationNamed entity Classification3Co-Training
4Named entity Classification
LocationLocation4Co-Training
5Named entity Classification
LocationLocation5Co-Training
6 : .
7 view view
Co-TrainingWhy is the conditional independence assumption important for Co-Training? If the view-2classifier f (2) decides that the context headquartered in indicates Location with high confidence,Co-Training will add unlabeled instances with that context as view-1 training examples. These newtraining examples for f (1) will include all representative Location named entities x(1), thanks tothe conditional independence assumption. If the assumption didnt hold, the new examples could allbe highly similar and thus be less informative for the view-1 classifier. It can be shown that if the twoassumptions hold, Co-Training can learn successfully from labeled and unlabeled data. However,it is actually difficult to find tasks in practice that completely satisfy the conditional independenceassumption. After all, the context Prime Minister of practically rules out most locations exceptcountries. When the conditional independence assumption is violated,Co-Training may not performwell.
If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.
78Web-page classification : hyperlink: hyperlink
Classify Speech phonemes Audio video
Multiview learning (1)
The squared loss c(x, y, f (x)) = (y f (x))20/1 loss c(x, y, f (x)) = 0 if y = f (x), and 1 otherwise c(x, y = healthy, f (x) = diseased) = 1 and c(x, y = diseased, f (x) = healthy) = 100
9
9Multiview learning (2)
10
Multiview Learning (3)MULTIVIEW LEARNING k k
The semi-supervised regularizer: k
11Individual Regularized Risk Semi-Supervised regularizer11Multiview learning(4)12: emprical risk 1213 Mixture EM Co-Training SVM
(1)14 kNN
NN
1415
(2) Regularization16 f f loss functionf ( regularization framework) special graph-based regularization
16Mincut (1)17 source sink source sink
18
13542Mincut (2)19Cost Function
Regularizer
Mincut Regularized Risk problem
Mincut (3)s19Harmonic Function (1)20
20Harmonic Function (2)21
>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network. Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij :>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network.Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij :>= 0, predict y = 1, and if f (x) < 0, predict y = 1).The harmonic function f has many interesting interpretations. For example, one can view thegraph as an electric network.Each edge is a resistor with resistance 1/wij ,or equivalently conductancewij . The labeled vertices are connected to a 1-volt battery, so that the positive vertices connect tothe positive side, and the negative vertices connect to the ground. Then the voltage established ateach node is the harmonic function,1 see Figure 5.3(a).The harmonic function f can also be interpreted by a random walk on the graph. Imagine aparticle at vertex i. In the next time step, the particle will randomly move to another vertex j withprobability proportional to wij : The random walk continues in this fashion until the particle reaches one of the labeled vertices. Thisis known as an absorbing random walk, where the labeled vertices are absorbing states. Then thevalue of the harmonic function at vertex i, f (xi), is the probability that a particle starting at vertexi eventually reaches a positive labeled vertex
21Harmonic Function (3)22
unnormalized graph Laplacian matrix L
W is an (l + u) (l + u) weight matrix, whose i, j -th element is the edge weight wij
Harmonic Function (4)23unnormalized graph Laplacian matrix
Manifold Regularization (1)24 Transductive f (x) = y
24Manifold Regularization (2)25Inductive
Manifold Regularization (3)26normalized graph Laplacian matrix L
Laplacian
(1)27
Spectral graph theory
28
(2)28 (3)
29a smaller eigenvalue corresponds to a smoother eigenvector over the graph
The graph has k connected components if and only if 1 = . . . = k = 0. The corresponding eigenvectors are constant on individual connected components, and zero elsewhere.
Graph Spectrum30
(4)31Regularization term
ai i Regularization term . f ( i ) .
(5)32
k-connected component Regularization term
(6)33
3334 Mixture EM Co-Training SVM
35
margin: geometric margin.Support Vector Machines36
36Support Vector Machines37The signed geometric margin: The distance from the decision boundary to the closest labeled instance
decision boundary
Maximum margin hyperplane must be unique
37Non-Separable Case (1)38
38Non-Separable Case (2)39
lie inside the margin, but on the correct side of the decision boundary
lie on the wrong side of the decision boundary and are misclassified
are correctly classified
39Non-Separable Case (3)40
Non-Separable Case (4)41
S3VM (1) 42
S3VM (2) 43the majority (or even all) of the unlabeled instances are predicted in only one of the classes
S3VM (3) 44Convex function
The S3VM objective function is non-convex
The research in S3VMs has focused on how to efficiently find a near-optimum solution
Logistic regressionSVM and S3VM are non-probabilistic modelsprobabilistic model
conditional log likelihood
Gaussian distribution as the prior on w:
45Logistic regression
Logistic loss
regularizer
The second line follows from Bayes rule, and ignoring the denominator that is constant with respectto the parameters.
46
Logistic regressionEntropy RegularizerLogistic Regression+Entropy Regulizer For SemiSupervised Learning Intuitionif the two classes are well-separated, then the classification on any unlabeled instance should be confident: it either clearly belongs to the positive class, or to the negative class. Equivalently, the posterior probability p(y|x) should be either close to 1, or close to 0.
Entropy
48Semi-supervised Logistic Regression
entropy regularizer for logistic regression
Entropy Regularizer
S3VM Entropy Regularization