discriminative classification methods, kernels, and topic models jakob verbeek january 8, 2010
TRANSCRIPT
![Page 1: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/1.jpg)
Discriminative classification methods, kernels, and topic models
Jakob Verbeek
January 8, 2010
![Page 2: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/2.jpg)
Plan for this course Introduction to machine learning
Clustering techniques k-means, Gaussian mixture density
Gaussian mixture density continued Parameter estimation with EM
Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels
5) Classification techniques 2 Discriminative methods, kernels
6) Decomposition of images Topic models
![Page 3: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/3.jpg)
Generative vs Discriminative Classification• Training data consists of “inputs”, denoted x, and
corresponding output “class labels”, denoted as y.
• Goal is to predict class label y for a given test data x.
• Generative probabilistic methods– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input
• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation, but
fit a function f(x), and assign to classes based on sign(f(x))
)(
)()|()|(
xp
ypyxpxyp
y
yxpypxp )|()()(
![Page 4: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/4.jpg)
Discriminative vs generative methods• Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)– Parametric models
• Assume specific form of p(x|y), eg Gaussian, Bernoulli, …
– Semi-parametric models• Combine models of simple class, eg in mixtures of Gaussians
– Non-parametric models• Do not make any assumptions on the form of p(x|y),
eg. K-nearest-neighbour, histograms, kernel density estimator
• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)
– Logistic discriminant– Support vector machines
![Page 5: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/5.jpg)
1. Choose class of decision functions in feature space.
2. Estimate the function parameters from the training set.
3. Classify a new pattern on the basis of this decision rule.
Discriminant function
kNN example from last weekNeeds to store all data
Separation using smooth curveOnly need to store curve parameters
![Page 6: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/6.jpg)
Linear functions in feature space• Decision function is linear in the features:
• Classification based on the sign of f(x)
• Slope is determined by w
• Offset from origin is determined by w0
• Decision surface is (d-1) dimensional hyper-plane orthogonal to w, given by
D
dii
T xwwwxwxf1
00)(
w
0)( 0 wxwxf T
![Page 7: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/7.jpg)
Dealing with more than two classes• First idea: construct classifier for all classes from several 2-class classifiers
– Eg. Seprate one class vs all others, or separate all pairs of classes vs each other– Learn the 2-class “base” classifiers independently
• Not clear what should be done in some regions
– 1 vs All: Region claimed by several classes
– Pair-wise: no agreement
![Page 8: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/8.jpg)
Dealing with more than two classes• Alternative: use a linear function for each class
• assign sample to the class of the function with maximum value
• Decision regions are convex in this case– If two points fall in the region, then also all points on connecting line
0)( kTk wf k xwx
)(maxarg xkhy k
![Page 9: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/9.jpg)
Logistic discrimination• We go back to the concept of x as a stochastic vector and assume:
• This is valid for a variety of parametric models, examples to come…
• In this case we can show that
where
• Thus we see that the class boundary is obtained by setting linear function in exponent to zero
,exp1
exp)|2(1|1
,exp1
1|2
0
0
0
xw
xwxx
xwx
T
T
T
w
wypyp
wyp
xwx
x Tvyp
yp
02|
1|log
.log 2100 ppvw
![Page 10: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/10.jpg)
Logisitic discrimination Example• Suppose each p(x|y) is Gaussian with the same covariance matrix.
• In this case we have
• Thus probability of class given data is a logistic discriminant model
• Points with equal probability are in hyper-plane orthogonal to w.
)()(
2
1exp||)2(),;()1|( 1
11
2/12/ mxCmxCCmxNyxp Td
2|log1|log
2|
1|log
ypypyp
ypxx
x
x
P(data|class)
P(class|data)
xwx
Twyp
0exp1
1|2
)()(2
1)()(
2
11
112
12 mxCmxmxCmx TT
)()(2
1
2
1
2
11
112
12
12
1 mxCmxmCxmCmxCx TTT
21
211
1211
2
1
2
1)( mCmmCmmmCxT
0wwxT
![Page 11: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/11.jpg)
Multi-class logistic discriminant• Assume that ratio of data likelihood is linear for all class pairs is linear in x
• In that case we find that
• This is known as the “soft-max” over the linear functions corresponding to the classes
• For any given pair of classes we find that they are
equally likely on a hyperplane in the feature space
xvv
jyp
iyp Tijij
0|
|log
x
x
jjj
ii
w
wiyp
)exp(
)exp()|(
0
0
xw
xwx
T
T
![Page 12: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/12.jpg)
Parameter estimation for logistic discriminant• For 2-class or multi-class model, we set parameters using maximum likelihood
– Maximize the (log) likelihood of predicting the correct class label for training data– Thus we sum over all training data the log-probablity of the correct label
• Derivative of log-likelihood as intuitive interpretation
• No closed-form solution, use gradient descend method to find optimal parameters– Note 1: log-likelihood is concave in w, no local optima– Note 2: w is linear combination of data points
Expected number of points from each class should equal the actual
number.
Expected value of each feature, weighting
points by p(y|x), should equal empirical
expectation.
n
nnc
cypcyLw
)|(][log0
x
nn
nnc
cypcyL xxw )|(][log
N
nnnypL
1
|log x
Indicator function1 if true, else 0
![Page 13: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/13.jpg)
Motivation for support vector machines• Suppose classes are linearly separable
– Continuum of classifiers with zero
error on the training data
• Generalization is not good in this case:
• But better if a margin is introduced:
• Using statistical learning theory, larger margins can be formally shown to yield better generalization performance.
(c o lo r) 2x
)( ro u n d n e ss1x
(c o lo r) 2x
)( ro u n d n e ss1x
(c o lo r) 2x
)( ro u n d n e ss1x
(c o lo r) 2x
)( ro u n d n e ss1x
b / | |w
![Page 14: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/14.jpg)
Support vector machines• Classification method for 2-class problems, let class labels• Suppose that the classes are linearly separable
• We seek a linear separating classification function, ie. for training data
• To obtain a margin we require
• Since this requirement is invariant to scaling we set b=1– Otherwise a large b could be compensated for by multiplying w and w0 with large number
• We get two hyperplanes
• The size of the margin is
• Points on the margin are called support vectors
}1,1{ ny
.1:
1:
02
01
wH
wHT
T
xw
xw
(color) 2x
)(roundness1x
(color) 2x
)(roundness1x
0)( 0 wxwy nT
n
bwxwy nT
n )( 0
wwT/1 margin
![Page 15: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/15.jpg)
Support vector machines: objective• Our objective is to find the classification function with maximum margin
• Maximizing the margin means minimizing
• But subject to the constraints that the function is separating the training data
• Using the technique of Lagrange multipliers for constrained optimization, we can solve this problem by minimizing the following w.r.t. w and w0, and maximizing w.r.t. the alpha parameters
• This is a quadratic function in w and w0, an the optimal values are found by setting the derivatives to zero.
wwL T
.101
n21
wyL nT
n
N
n
TP xwww
0)( 0 wxwy nT
n
![Page 16: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/16.jpg)
Support vector machines: optimization• Setting the derivatives to zero we find and
– Note: w is a linear function of the training data points
• If we substitute this into the Lagrangian function, we get the dual form
which has to be maximized w.r.t. the alpha variables, subject to constraints
– Note: the training data appear only in the form of inner products here
• The dual function is quadratic in the alpha’s subject to linear constraints, and can be solved using standard quadratic solvers
• Only the data points precisely on the margin get a non-zero alpha
w iy ii1
n
x i
iy ii1
n
0
N
ij
Ti
N
jji
N
iD yyL
1 1ji2
1
1i xx
0i
iy ii1
n
0
![Page 17: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/17.jpg)
SVMs: non separable classes• Before we assumed that the classes we separable
– and found the w that separated the classes with maximum margin
• What if classes are non-separable?
• Require that points are traversing the margin by a mimimal amount
– the n are non-negative and known as “slack variables”
• Replace infeasible class-separation constraint with penalty for large slacks
• New objective function:
• Support vectors are either on the margin,
or on the wrong side of the margin.
nnn wy 10xwT
i
iT CwwL
2
1
![Page 18: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/18.jpg)
Summary Linear discriminant analysis• Two most widely used linear classifiers in practice:
– Logistic discriminant (supports more than 2 classes directly)– Support vector machines (multi-class extensions recently developed)
• In both cases the weight vector w is a linear combination of the data points
• This means that we only need the inner-products between data points to calculate the linear functions
– The function k that computes the inner products is called a kernel function
w ix ii
N
nnn
N
n
Tnn
T
kw
wwf
10
100
),(
)(
xx
xxxwx
![Page 19: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/19.jpg)
Plan for this course Introduction to machine learning
Clustering techniques k-means, Gaussian mixture density
Gaussian mixture density continued Parameter estimation with EM
Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels
5) Classification techniques 2 Discriminative methods, kernel functions
6) Decomposition of images Topic models
![Page 20: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/20.jpg)
Motivation of generalized linear functions• Linear classifiers are attractive because of their simplicity
– Computationally efficient– SVM and logistic discriminant, do not suffer from an objective function with
multiple local optima.
• The fact that only linear functions can be implemented limits their use to discriminating (approximately) linearly separable classes.
• Two possible extensions1) Use a richer class of functions,
not only linear ones– In this case optimization is often
much more difficult
2) Extend or replace the original set of features
with non-linear functions of the original features
(c o lo r) 2x
)( ro u n d n e ss1x
00 wT xw
(c o lo r) 2x
)( ro u n d n e ss1x
00 wT xw
w
![Page 21: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/21.jpg)
Generalized linear functions• Data vector x replaced by vector of m fixed functions of x
• Can turn problem into a linearly separable one
![Page 22: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/22.jpg)
Example of alternative feature space• Suppose that one class `encapsulates’ the other class:
• A linear classifier does not work very well ...
• Lets map our features to a new space spanned by:
• A circle around the origin with radius r in the original space is now described by a plane in the new space:
• We can use classification function
with and
221 )()( rxx
0)()( 0 wxwxf T
€
wT = (1,1,0)2
0 rw
![Page 23: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/23.jpg)
The kernel function of our example• Let’s compute the inner-product in the new feature space we defined:
• Thus, we simply square the standard inner product, and do not need to explicitly map the points to the new feature space !
• This becomes useful if the new feature space has many features.
• The kernel function is a shortcut to compute inner products in feature space, without explicitly mapping the data to that space.
![Page 24: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/24.jpg)
Example non-linear support vector machine• Example where classes are separable, but not linearly.
• Gaussian kernel used
• This corresponds to using an infinite number of non-linear features
Decision surface
![Page 25: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/25.jpg)
Nonlinear support vector machines• We saw that to find the parameters of an SVM we need the data only in the
form of inner products
• We can replace these directly by evaluations of the kernel function
• For the parameters we thus find
• We compute the inner product between a sample to classify as
N
ij
Ti
N
jji
N
iD yyL
1 1ji2
1
1i xx
nN
nny xw
1
n
N
iji
N
jji
N
iD kyyL
1 1ji2
1
1i ),( xx
N
nnn
T xxkywwxwxf1
n00 ),()()(
![Page 26: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/26.jpg)
Nonlinear logistic discriminant• A similar procedure can be taken also for logistic discriminant
• Again we express w directly as a linear combination of the transformed training data
• We then optimize the log-likelihood of correct classification of the training data samples as a function of the alpha variables
• For the multi-class version we obtain a linear function per class
and class membership probability estimates given by the softmax function
nN
n
xw
1
n
N
nnncc
Tcc xxkwxwxf
10 ),()()(
'' )(exp
)(exp)|(
cc
c
xf
xfxcyp
![Page 27: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/27.jpg)
The kernel function• Starting with the alternative representation in some feature space we can
find the corresponding kernel function as the `program’:1. Map x and y to the new feature space.
2. Compute the inner product between the mapped points.
• A kernel is positive definite if NxN matrix K containing all pairwise evaluations on N arbitrary points, is positive definite, ie for non-zero w:
• If a kernel is computing inner products then it is positive definite
1. Mercer’s Theorem: if k is positive definite, then there exists a feature space in which k computes the inner-products.
0),(1 11 1
N
i
N
jjiji
N
i
N
jijji
T xxkwwKwwKww
0)()(
)()()()(),(
11
1 11 11 1
zzxwxw
xwxwxxwwxxkww
TN
iii
TN
iii
N
i
N
jjj
Tii
N
i
N
jj
Tiji
N
i
N
jjiji
![Page 28: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/28.jpg)
The “kernel trick” has many applications• We used kernels to compute inner-products to find non-linear SVMs, and
logistic discriminant models
• The main advantage of the “kernel trick” is that we can use feature spaces with a vast number of “derived” features without being confronted with the computational burden of explicitly mapping the data points to this space.
• The same trick can also be used for many other inner-product based methods for classification, regression, clustering, and dimension reduction.
– K-means– Mixture of Gaussians– Principal component analysis
![Page 29: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/29.jpg)
Image categorization with Bags-of-Words • A range of kernel functions are used in
categorization tasks
• Popular examples include– Chi-square RBF kernel over bag-of-word
histograms– Pyramid-match kernel– ...See slides Cordelia Schmid lecture 2…
• Any interesting image similarity measure that can be show to yield a positive definite kernel, can be plugged into SVMs or logistic discriminant
• Kernels can be combined by – Summation – Products– Exponentiation – Many other ways
![Page 30: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/30.jpg)
Summary classification methods• Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)• Parametric, semi-parametric, and non-parametric models
– Classification using Bayes’ rule
• Discriminative (probabilistic) methods– Directly estimate decision function f(x) for classification
• Logistic discriminant
• Support vector machines
– Linear methods easily extended by using kernel functions
• Hybrid methods– Model training input data with probabilistic model– Use probabilistic model the generate features for discriminative model– Example: Fisher kernels combined with linear classifier
![Page 31: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/31.jpg)
Plan for this course Introduction to machine learning
Clustering techniques k-means, Gaussian mixture density
Gaussian mixture density continued Parameter estimation with EM
Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels
Classification techniques 2 Discriminative methods, kernels
5) Decomposition of images Topic models
![Page 32: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/32.jpg)
Topic models for image decomposition• So far we mainly looked at clustering and classifying images
• What about localizing categories in an image, ie decomposing a scene into the different elements that constitute it
• How can we decompose this beach scene?
Sky
Water
Sand
Vegetation
![Page 33: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/33.jpg)
Topic models for image decomposition• How can we decompose this beach scene?
• Need to model the appearance of difference categories
• Difficult to use shape models– Some categories do not have a particular shape: water, sky, grass, etc.– Other categories occlude each other: building occluded by cars, trees, people
• Typically a relatively small number of categories appear per image– Several tens, not all thousands of categories we might want to model
Sky
Water
Sand
Vegetation
![Page 34: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/34.jpg)
Topic Models for text documents• Model each text as a mixture of different topics
– A fixed set of topics is used (categories in images)
– The mix of topics is specific to each text (mix of categories)
• Words are modelled as– Selecting a topic given the document-specific mix, ie using p( topic | document )
– Selecting a word from the chosen topic, ie using p( word | topic )
• Words no longer independent– Seeing some words, you can guess topics to predict other words
![Page 35: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/35.jpg)
Topic Models for text documents• Words are modelled as
– Selecting a topic given the document-specific mix using p( topic | document )
– Selecting a word from the chosen topic using p( word | topic )
• Given these distributions we can assign words to topics p(topic|word,document)
![Page 36: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/36.jpg)
Topic Models for text documents• Words no longer independent
– Seeing some words, you can guess topics to predict other words
• Polysemous words can be disambiguated using the context– “Bank” can refer to financial institution or side of a river
– Seeing many financial terms in the text, we infer that the “finance” topic is strongly represented in the text, and that “Bank” refers to the first meaning.
![Page 37: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/37.jpg)
Topic Models for images• Each image has its own specific mix of visual topics / categories
• All documents mix the same set of topics
• Image regions extracted over dense grid and multiple scales– Each image region assigned to a visual word
• eg using k-means clustering of region descriptors such as SIFT
• Possibly also use color information, and position of region in the image (top, bottom, …)
• Example of image regions assigned to different topics– Clouds, Trees, Sky, Mountain
![Page 38: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/38.jpg)
Probabilistic Latent Semantic Analysis (PLSA)• Some notation:
– w: (visual) words
– t: topics
– d: documents / images
• p(t|d) : document specific topic mix
• p(w|t) : topic specific distribution over words
• Distribution over words for a document, p(w|d), is a mixture of the topic distributions
• To sample a word:– Sample topic from p(t|d)
– Sample word from topic using p(w|t)
• We can now infer to which topic a word belongs using Bayes’ rule
T
t
twpdtpdwp1
)|()|()|(
)|(
)|()|(),|(
dwp
dtptwpdwtp
![Page 39: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/39.jpg)
Interpreting a new image with PLSA• Suppose that we have learned the topic distributions p(word|topic)
• We want to assign image regions to topics – The topic mix p(topic|document) for the new image is an unknown parameter
• Find the topic mix p(t|d) using maximum likelihood with EM algorithm– Associate topic variable tn to each visual word wn
• Remember the EM algorithm – In the E-step we compute posterior distribution over hidden variables,
distribution on topic for each visual word in this case
– In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case
N
n
T
ttwtd
N
n
T
tnnn
N
nn n
ttwpdttpdwpL1 11 11
log)|()|(log)|(log
Probability of topic t given documentNeeds to be estimated for new doc.
Probability of word wn given topic t Supposed to be know to us
![Page 40: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/40.jpg)
Interpreting a new image with PLSA
• In the E-step we compute posterior distribution over hidden variables,
distribution on topic for each visual word in this case
• The posterior for word 1 and word 2 is the same if they are the same word– We use qwt to denote posterior probability of topic t for word w
N
n
T
ttwtd
N
n
T
tnnn
N
nn n
ttwpdttpdwpL1 11 11
log)|()|(log)|(log
Probability of topic t given documentNeeds to be estimated for new doc.
Probability of word w given topic t Supposed to be know to us
T
sswds
twdt
n
nnnnn
n
n
dwp
ttwpdttpdwttp
1
)|(
)|()|(),|(
T
swsds
wtdtwtq
1
![Page 41: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/41.jpg)
Interpreting a new image with PLSA
• In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case
• Use nwd to denote how often word w appears in the document, then we rewrite as
• Maximizing this with respect to the topic mixing parameters we obtain
N
n
T
ttwtd
N
n
T
tnnn
N
nn n
ttwpdttpdwpL1 11 11
log)|()|(log)|(log
Probability of topic t given documentNeeds to be estimated for new doc.
Probability of word w given topic t Supposed to be know to us
N
nnnn
T
tnn ttwpdttpdwttp
1 1
)|()|(log),|(
W
w
T
twttdtwwdqn
1 1
loglog
W
wwd
W
wtwwdtd nqn
11
/ Expected nr of words assigned to topicdivided by total number of words in doc
![Page 42: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/42.jpg)
Interpreting a new images with PLSA• Repeat EM-algorithm until convergence of p(t|d)
• After convergence, read-off the word to topic assignments from p(t|w,d)
• Example images decomposed over 22 topics [Verbeek & Triggs, CVPR’07]
![Page 43: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/43.jpg)
Learning the topics from labeled images• As training data we have images labeled with the categories they contain
• Not known which image regions belongs to each category
• Learn PLSA model, forcing each image to mix only the labeled topics– Example image labeled as {building, grass, sky, tree}– Assignment of image regions to topics during learning
• EM algorithm– E-step:
• Compute p(t|w,d) as before, but set posterior to zero for topics not appearing in labeling
– M-step: • As before for topic mix p(t|d) in each image, topics not in labeling get weight zero
• Estimation of distribution over words for each topic as
W
w
D
dtwwd
D
dtwwdwt qnqn
1 11
/ Expected nr of times w assigned to topic tdivided by total number words assigned to t
![Page 44: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/44.jpg)
Interpreting a new images with PLSA• More example images taken from [Verbeek & Triggs, CVPR’07]
![Page 45: Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649e8d5503460f94b90cc8/html5/thumbnails/45.jpg)
Plan for this course Introduction to machine learning
Clustering techniques• k-means, Gaussian mixture density
Gaussian mixture density continued• Parameter estimation with EM
Classification techniques 1 Introduction, generative methods, semi-supervised
Classification techniques 2• Discriminative methods, kernels
Decomposition of images• Topic models, …
Next week: last course by Cordelia Schmid
Exam: 9h-12h, Friday January 29, 2010, room D207 (ensimag) Not allowed to bring any material Prepare from slides + papers All material available on http://lear.inrialpes.fr/~verbeek/teaching