discriminative classification methods, kernels, and topic models jakob verbeek january 8, 2010

Discriminative classification methods, kernels, and topic models

Jakob Verbeek

January 8, 2010

Plan for this course Introduction to machine learning

Clustering techniques k-means, Gaussian mixture density

Gaussian mixture density continued Parameter estimation with EM

Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels

5) Classification techniques 2 Discriminative methods, kernels

6) Decomposition of images Topic models

Generative vs Discriminative Classification• Training data consists of “inputs”, denoted x, and

corresponding output “class labels”, denoted as y.

• Goal is to predict class label y for a given test data x.

• Generative probabilistic methods– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input

• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation, but

fit a function f(x), and assign to classes based on sign(f(x))

)(

)()|()|(

xp

ypyxpxyp

y

yxpypxp )|()()(

Discriminative vs generative methods• Generative probabilistic methods

– Model the density of inputs x from each class p(x|y)– Parametric models

• Assume specific form of p(x|y), eg Gaussian, Bernoulli, …

– Semi-parametric models• Combine models of simple class, eg in mixtures of Gaussians

– Non-parametric models• Do not make any assumptions on the form of p(x|y),

eg. K-nearest-neighbour, histograms, kernel density estimator

• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)

– Logistic discriminant– Support vector machines

1. Choose class of decision functions in feature space.

2. Estimate the function parameters from the training set.

3. Classify a new pattern on the basis of this decision rule.

Discriminant function

kNN example from last weekNeeds to store all data

Separation using smooth curveOnly need to store curve parameters

Linear functions in feature space• Decision function is linear in the features:

• Classification based on the sign of f(x)

• Slope is determined by w

• Offset from origin is determined by w0

• Decision surface is (d-1) dimensional hyper-plane orthogonal to w, given by

D

dii

T xwwwxwxf1

00)(

w

0)( 0 wxwxf T

Dealing with more than two classes• First idea: construct classifier for all classes from several 2-class classifiers

– Eg. Seprate one class vs all others, or separate all pairs of classes vs each other– Learn the 2-class “base” classifiers independently

• Not clear what should be done in some regions

– 1 vs All: Region claimed by several classes

– Pair-wise: no agreement

Dealing with more than two classes• Alternative: use a linear function for each class

• assign sample to the class of the function with maximum value

• Decision regions are convex in this case– If two points fall in the region, then also all points on connecting line

0)( kTk wf k xwx

)(maxarg xkhy k

Logistic discrimination• We go back to the concept of x as a stochastic vector and assume:

• This is valid for a variety of parametric models, examples to come…

• In this case we can show that

where

• Thus we see that the class boundary is obtained by setting linear function in exponent to zero

,exp1

exp)|2(1|1

,exp1

1|2

0

0

0

xw

xwxx

xwx

T

T

T

w

wypyp

wyp

xwx

x Tvyp

yp

02|

1|log

.log 2100 ppvw

Logisitic discrimination Example• Suppose each p(x|y) is Gaussian with the same covariance matrix.

• In this case we have

• Thus probability of class given data is a logistic discriminant model

• Points with equal probability are in hyper-plane orthogonal to w.

)()(

2

1exp||)2(),;()1|( 1

11

2/12/ mxCmxCCmxNyxp Td

2|log1|log

2|

1|log

ypypyp

ypxx

x

x

P(data|class)

P(class|data)

xwx

Twyp

0exp1

1|2

)()(2

1)()(

2

11

112

12 mxCmxmxCmx TT

)()(2

1

2

1

2

11

112

12

12

1 mxCmxmCxmCmxCx TTT

21

211

1211

2

1

2

1)( mCmmCmmmCxT

0wwxT

Multi-class logistic discriminant• Assume that ratio of data likelihood is linear for all class pairs is linear in x

• In that case we find that

• This is known as the “soft-max” over the linear functions corresponding to the classes

• For any given pair of classes we find that they are

equally likely on a hyperplane in the feature space

xvv

jyp

iyp Tijij

0|

|log

x

x

jjj

ii

w

wiyp

)exp(

)exp()|(

0

0

xw

xwx

T

T

Parameter estimation for logistic discriminant• For 2-class or multi-class model, we set parameters using maximum likelihood

– Maximize the (log) likelihood of predicting the correct class label for training data– Thus we sum over all training data the log-probablity of the correct label

• Derivative of log-likelihood as intuitive interpretation

• No closed-form solution, use gradient descend method to find optimal parameters– Note 1: log-likelihood is concave in w, no local optima– Note 2: w is linear combination of data points

Expected number of points from each class should equal the actual

number.

Expected value of each feature, weighting

points by p(y|x), should equal empirical

expectation.

n

nnc

cypcyLw

)|(][log0

x

nn

nnc

cypcyL xxw )|(][log

N

nnnypL

1

|log x

Indicator function1 if true, else 0

Motivation for support vector machines• Suppose classes are linearly separable

– Continuum of classifiers with zero

error on the training data

• Generalization is not good in this case:

• But better if a margin is introduced:

• Using statistical learning theory, larger margins can be formally shown to yield better generalization performance.

(c o lo r) 2x

)( ro u n d n e ss1x

(c o lo r) 2x


(c o lo r) 2x


(c o lo r) 2x


b / | |w

Support vector machines• Classification method for 2-class problems, let class labels• Suppose that the classes are linearly separable

• We seek a linear separating classification function, ie. for training data

• To obtain a margin we require

• Since this requirement is invariant to scaling we set b=1– Otherwise a large b could be compensated for by multiplying w and w0 with large number

• We get two hyperplanes

• The size of the margin is

• Points on the margin are called support vectors

}1,1{ ny

.1:

1:

02

01

wH

wHT

T

xw

xw

(color) 2x

)(roundness1x

(color) 2x

)(roundness1x

0)( 0 wxwy nT

n

bwxwy nT

n )( 0

wwT/1 margin

Support vector machines: objective• Our objective is to find the classification function with maximum margin

• Maximizing the margin means minimizing

• But subject to the constraints that the function is separating the training data

• Using the technique of Lagrange multipliers for constrained optimization, we can solve this problem by minimizing the following w.r.t. w and w0, and maximizing w.r.t. the alpha parameters

• This is a quadratic function in w and w0, an the optimal values are found by setting the derivatives to zero.

wwL T

.101

n21

wyL nT

n

N

n

TP xwww

0)( 0 wxwy nT

n

Support vector machines: optimization• Setting the derivatives to zero we find and

– Note: w is a linear function of the training data points

• If we substitute this into the Lagrangian function, we get the dual form

which has to be maximized w.r.t. the alpha variables, subject to constraints

– Note: the training data appear only in the form of inner products here

• The dual function is quadratic in the alpha’s subject to linear constraints, and can be solved using standard quadratic solvers

• Only the data points precisely on the margin get a non-zero alpha

w iy ii1

n

x i

iy ii1

n

0

N

ij

Ti

N

jji

N

iD yyL

1 1ji2

1

1i xx

0i

iy ii1

n

0

SVMs: non separable classes• Before we assumed that the classes we separable

– and found the w that separated the classes with maximum margin

• What if classes are non-separable?

• Require that points are traversing the margin by a mimimal amount

– the n are non-negative and known as “slack variables”

• Replace infeasible class-separation constraint with penalty for large slacks

• New objective function:

• Support vectors are either on the margin,

or on the wrong side of the margin.

nnn wy 10xwT

i

iT CwwL

2

1

Summary Linear discriminant analysis• Two most widely used linear classifiers in practice:

– Logistic discriminant (supports more than 2 classes directly)– Support vector machines (multi-class extensions recently developed)

• In both cases the weight vector w is a linear combination of the data points

• This means that we only need the inner-products between data points to calculate the linear functions

– The function k that computes the inner products is called a kernel function

w ix ii

N

nnn

N

n

Tnn

T

kw

wwf

10

100

),(

)(

xx

xxxwx





5) Classification techniques 2 Discriminative methods, kernel functions


Motivation of generalized linear functions• Linear classifiers are attractive because of their simplicity

– Computationally efficient– SVM and logistic discriminant, do not suffer from an objective function with

multiple local optima.

• The fact that only linear functions can be implemented limits their use to discriminating (approximately) linearly separable classes.

• Two possible extensions1) Use a richer class of functions,

not only linear ones– In this case optimization is often

much more difficult

2) Extend or replace the original set of features

with non-linear functions of the original features

(c o lo r) 2x


00 wT xw

(c o lo r) 2x


00 wT xw

w

Generalized linear functions• Data vector x replaced by vector of m fixed functions of x

• Can turn problem into a linearly separable one

Example of alternative feature space• Suppose that one class `encapsulates’ the other class:

• A linear classifier does not work very well ...

• Lets map our features to a new space spanned by:

• A circle around the origin with radius r in the original space is now described by a plane in the new space:

• We can use classification function

with and

221 )()( rxx

0)()( 0 wxwxf T

€

wT = (1,1,0)2

0 rw

The kernel function of our example• Let’s compute the inner-product in the new feature space we defined:

• Thus, we simply square the standard inner product, and do not need to explicitly map the points to the new feature space !

• This becomes useful if the new feature space has many features.

• The kernel function is a shortcut to compute inner products in feature space, without explicitly mapping the data to that space.

Example non-linear support vector machine• Example where classes are separable, but not linearly.

• Gaussian kernel used

• This corresponds to using an infinite number of non-linear features

Decision surface

Nonlinear support vector machines• We saw that to find the parameters of an SVM we need the data only in the

form of inner products

• We can replace these directly by evaluations of the kernel function

• For the parameters we thus find

• We compute the inner product between a sample to classify as

N

ij

Ti

N

jji

N

iD yyL

1 1ji2

1

1i xx

nN

nny xw

1

n

N

iji

N

jji

N

iD kyyL

1 1ji2

1

1i ),( xx

N

nnn

T xxkywwxwxf1

n00 ),()()(

Nonlinear logistic discriminant• A similar procedure can be taken also for logistic discriminant

• Again we express w directly as a linear combination of the transformed training data

• We then optimize the log-likelihood of correct classification of the training data samples as a function of the alpha variables

• For the multi-class version we obtain a linear function per class

and class membership probability estimates given by the softmax function

nN

n

xw

1

n

N

nnncc

Tcc xxkwxwxf

10 ),()()(

'' )(exp

)(exp)|(

cc

c

xf

xfxcyp

The kernel function• Starting with the alternative representation in some feature space we can

find the corresponding kernel function as the `program’:1. Map x and y to the new feature space.

2. Compute the inner product between the mapped points.

• A kernel is positive definite if NxN matrix K containing all pairwise evaluations on N arbitrary points, is positive definite, ie for non-zero w:

• If a kernel is computing inner products then it is positive definite

1. Mercer’s Theorem: if k is positive definite, then there exists a feature space in which k computes the inner-products.

0),(1 11 1

N

i

N

jjiji

N

i

N

jijji

T xxkwwKwwKww

0)()(

)()()()(),(

11

1 11 11 1

zzxwxw

xwxwxxwwxxkww

TN

iii

TN

iii

N

i

N

jjj

Tii

N

i

N

jj

Tiji

N

i

N

jjiji

The “kernel trick” has many applications• We used kernels to compute inner-products to find non-linear SVMs, and

logistic discriminant models

• The main advantage of the “kernel trick” is that we can use feature spaces with a vast number of “derived” features without being confronted with the computational burden of explicitly mapping the data points to this space.

• The same trick can also be used for many other inner-product based methods for classification, regression, clustering, and dimension reduction.

– K-means– Mixture of Gaussians– Principal component analysis

Image categorization with Bags-of-Words • A range of kernel functions are used in

categorization tasks

• Popular examples include– Chi-square RBF kernel over bag-of-word

histograms– Pyramid-match kernel– ...See slides Cordelia Schmid lecture 2…

• Any interesting image similarity measure that can be show to yield a positive definite kernel, can be plugged into SVMs or logistic discriminant

• Kernels can be combined by – Summation – Products– Exponentiation – Many other ways

Summary classification methods• Generative probabilistic methods

– Model the density of inputs x from each class p(x|y)• Parametric, semi-parametric, and non-parametric models

– Classification using Bayes’ rule

• Discriminative (probabilistic) methods– Directly estimate decision function f(x) for classification

• Logistic discriminant

• Support vector machines

– Linear methods easily extended by using kernel functions

• Hybrid methods– Model training input data with probabilistic model– Use probabilistic model the generate features for discriminative model– Example: Fisher kernels combined with linear classifier





Classification techniques 2 Discriminative methods, kernels


Topic models for image decomposition• So far we mainly looked at clustering and classifying images

• What about localizing categories in an image, ie decomposing a scene into the different elements that constitute it

• How can we decompose this beach scene?

Sky

Water

Sand

Vegetation

Topic models for image decomposition• How can we decompose this beach scene?

• Need to model the appearance of difference categories

• Difficult to use shape models– Some categories do not have a particular shape: water, sky, grass, etc.– Other categories occlude each other: building occluded by cars, trees, people

• Typically a relatively small number of categories appear per image– Several tens, not all thousands of categories we might want to model

Sky

Water

Sand

Vegetation

Topic Models for text documents• Model each text as a mixture of different topics

– A fixed set of topics is used (categories in images)

– The mix of topics is specific to each text (mix of categories)

• Words are modelled as– Selecting a topic given the document-specific mix, ie using p( topic | document )

– Selecting a word from the chosen topic, ie using p( word | topic )

• Words no longer independent– Seeing some words, you can guess topics to predict other words

Topic Models for text documents• Words are modelled as

– Selecting a topic given the document-specific mix using p( topic | document )

– Selecting a word from the chosen topic using p( word | topic )

• Given these distributions we can assign words to topics p(topic|word,document)

Topic Models for text documents• Words no longer independent

– Seeing some words, you can guess topics to predict other words

• Polysemous words can be disambiguated using the context– “Bank” can refer to financial institution or side of a river

– Seeing many financial terms in the text, we infer that the “finance” topic is strongly represented in the text, and that “Bank” refers to the first meaning.

Topic Models for images• Each image has its own specific mix of visual topics / categories

• All documents mix the same set of topics

• Image regions extracted over dense grid and multiple scales– Each image region assigned to a visual word

• eg using k-means clustering of region descriptors such as SIFT

• Possibly also use color information, and position of region in the image (top, bottom, …)

• Example of image regions assigned to different topics– Clouds, Trees, Sky, Mountain

Probabilistic Latent Semantic Analysis (PLSA)• Some notation:

– w: (visual) words

– t: topics

– d: documents / images

• p(t|d) : document specific topic mix

• p(w|t) : topic specific distribution over words

• Distribution over words for a document, p(w|d), is a mixture of the topic distributions

• To sample a word:– Sample topic from p(t|d)

– Sample word from topic using p(w|t)

• We can now infer to which topic a word belongs using Bayes’ rule

T

t

twpdtpdwp1

)|()|()|(

)|(

)|()|(),|(

dwp

dtptwpdwtp

Interpreting a new image with PLSA• Suppose that we have learned the topic distributions p(word|topic)

• We want to assign image regions to topics – The topic mix p(topic|document) for the new image is an unknown parameter

• Find the topic mix p(t|d) using maximum likelihood with EM algorithm– Associate topic variable tn to each visual word wn

• Remember the EM algorithm – In the E-step we compute posterior distribution over hidden variables,

distribution on topic for each visual word in this case

– In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case

N

n

T

ttwtd

N

n

T

tnnn

N

nn n

ttwpdttpdwpL1 11 11

log)|()|(log)|(log

Probability of topic t given documentNeeds to be estimated for new doc.

Probability of word wn given topic t Supposed to be know to us

Interpreting a new image with PLSA

• In the E-step we compute posterior distribution over hidden variables,

distribution on topic for each visual word in this case

• The posterior for word 1 and word 2 is the same if they are the same word– We use qwt to denote posterior probability of topic t for word w

N

n

T

ttwtd

N

n

T

tnnn

N

nn n

ttwpdttpdwpL1 11 11

log)|()|(log)|(log


Probability of word w given topic t Supposed to be know to us

T

sswds

twdt

n

nnnnn

n

n

dwp

ttwpdttpdwttp

1

)|(

)|()|(),|(

T

swsds

wtdtwtq

1

Interpreting a new image with PLSA

• In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case

• Use nwd to denote how often word w appears in the document, then we rewrite as

• Maximizing this with respect to the topic mixing parameters we obtain

N

n

T

ttwtd

N

n

T

tnnn

N

nn n

ttwpdttpdwpL1 11 11

log)|()|(log)|(log


Probability of word w given topic t Supposed to be know to us

N

nnnn

T

tnn ttwpdttpdwttp

1 1

)|()|(log),|(

W

w

T

twttdtwwdqn

1 1

loglog

W

wwd

W

wtwwdtd nqn

11

/ Expected nr of words assigned to topicdivided by total number of words in doc

Interpreting a new images with PLSA• Repeat EM-algorithm until convergence of p(t|d)

• After convergence, read-off the word to topic assignments from p(t|w,d)

• Example images decomposed over 22 topics [Verbeek & Triggs, CVPR’07]

Learning the topics from labeled images• As training data we have images labeled with the categories they contain

• Not known which image regions belongs to each category

• Learn PLSA model, forcing each image to mix only the labeled topics– Example image labeled as {building, grass, sky, tree}– Assignment of image regions to topics during learning

• EM algorithm– E-step:

• Compute p(t|w,d) as before, but set posterior to zero for topics not appearing in labeling

– M-step: • As before for topic mix p(t|d) in each image, topics not in labeling get weight zero

• Estimation of distribution over words for each topic as

W

w

D

dtwwd

D

dtwwdwt qnqn

1 11

/ Expected nr of times w assigned to topic tdivided by total number words assigned to t

Interpreting a new images with PLSA• More example images taken from [Verbeek & Triggs, CVPR’07]


Clustering techniques• k-means, Gaussian mixture density

Gaussian mixture density continued• Parameter estimation with EM

Classification techniques 1 Introduction, generative methods, semi-supervised

Classification techniques 2• Discriminative methods, kernels

Decomposition of images• Topic models, …

Next week: last course by Cordelia Schmid

Exam: 9h-12h, Friday January 29, 2010, room D207 (ensimag) Not allowed to bring any material Prepare from slides + papers All material available on http://lear.inrialpes.fr/~verbeek/teaching