harmonium models for video classificationepxing/papers/harmonium08.pdfj. yang et al.: harmonium...

15
Harmonium Models for Video Classification Jun Yang 1 , Rong Yan 2 , Yan Liu 2 and Eric P. Xing 1 1 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2 IBM T. J. Watson Research Center,Yorktown Heights, NY 10598, New York Received 16 July 2007; accepted 7 August 2007 DOI:10.1002/sam.103 Published online 18 January 2008 in Wiley InterScience (www.interscience.wiley.com). Abstract: Accurate and efficient video classification demands the fusion of multimodal information and the use of intermediate representations. Combining the two ideas into one framework, we propose a series of probabilistic models for video representation and classification using intermediate semantic representations derived from multimodal features of video. On the basis of a class of bipartite undirected graphical models named harmonium, we propose dual-wing harmonium (DWH) model that represents video shots as latent semantic topics derived by jointly modeling the transcript keywords and color-histogram features of the data. Our family-of-harmonium (FoH) and hierarchical harmonium (HH) model extends DWH by introducing variables representing category labels of data, which allows data representation and classification to be performed in the same model. Our models are among the few attempts of using undirected graphical models for representing and classifying video data. Experiments on a benchmark video collection show different semantic interpretations of video data under our models, as well as superior classification performance in comparison with several directed models. 2008 Wiley Periodicals, Inc. Statistical Analy Data Mining 1: 23–37, 2008 Keywords: undirected graphical model; harmonium; video classification; latent semantic topic 1. INTRODUCTION Classifying video data into semantic categories, some- times known as semantic video concept detection, is an important research topic. This task is challenging because video data contain multiple data types including video frames as images, transcript text, speech, audio, each bear- ing correlated and complementary information essential to conveying data semantics. The fusion of such multimodal information is regarded as a key research problem [1], and has been a widely used technique in video classifi- cation and retrieval methods. Many fusion strategies have been proposed, varying from early fusion [2], which merges the feature vectors extracted from different modalities, to late fusion, which combines the outputs of the classifiers or ‘retrieval experts’ built on each single modality [2–5]. Empirical results show that the methods based on the fusion of multimodal information outperforms those based on any single type of information in both classification and retrieval tasks. Another trend in video classification is the search of low- dimensional, intermediate representations of video data in Correspondence to: Eric P. Xing ([email protected]) order to replace the high-dimensional raw features such as color histograms and term vectors. The reason is that intermediate representations would make sophisticated clas- sifiers such as support vector machines [6] computationally efficient, which would be more expensive when applied on raw features. Moreover, using intermediate representations holds the promise of better interpretation of data seman- tics, and may lead to superior classification performance. Related work along this direction ranges from conventional dimension-reduction methods such as principal component analysis (PCA) and Fisher linear discriminant (FLD) [7], to the more recent probabilistic ‘topic models’ such as probabilistic latent semantic indexing (pLSI) [8], latent Dirichlet allocation (LDA) [9], exponential-family harmo- nium (EFH) [10]. While most of these models are initially developed only for single-modal data such as textual docu- ments, extensions of them [11] have been studied recently in order to model data with multiple types of inputs (a.k.a multimodal data) such as captioned images and video. The key insights for video classification from previous works appear to be combining multimodal information and using intermediate representations. The goal of this paper is to propose a series of probabilistic models for represent- ing and classifying video data by taking advantages of both 2008 Wiley Periodicals, Inc.

Upload: dangnguyet

Post on 19-Mar-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Harmonium Models for Video Classification

Jun Yang1, Rong Yan2, Yan Liu2 and Eric P. Xing1

1 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

2 IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, New York

Received 16 July 2007; accepted 7 August 2007DOI:10.1002/sam.103

Published online 18 January 2008 in Wiley InterScience (www.interscience.wiley.com).

Abstract: Accurate and efficient video classification demands the fusion of multimodal information and the use of intermediaterepresentations. Combining the two ideas into one framework, we propose a series of probabilistic models for video representationand classification using intermediate semantic representations derived from multimodal features of video. On the basis of a classof bipartite undirected graphical models named harmonium, we propose dual-wing harmonium (DWH) model that representsvideo shots as latent semantic topics derived by jointly modeling the transcript keywords and color-histogram features of the data.Our family-of-harmonium (FoH) and hierarchical harmonium (HH) model extends DWH by introducing variables representingcategory labels of data, which allows data representation and classification to be performed in the same model. Our modelsare among the few attempts of using undirected graphical models for representing and classifying video data. Experimentson a benchmark video collection show different semantic interpretations of video data under our models, as well as superiorclassification performance in comparison with several directed models. 2008 Wiley Periodicals, Inc. Statistical Analy Data Mining1: 23–37, 2008

Keywords: undirected graphical model; harmonium; video classification; latent semantic topic

1. INTRODUCTION

Classifying video data into semantic categories, some-times known as semantic video concept detection, is animportant research topic. This task is challenging becausevideo data contain multiple data types including videoframes as images, transcript text, speech, audio, each bear-ing correlated and complementary information essential toconveying data semantics. The fusion of such multimodalinformation is regarded as a key research problem [1],and has been a widely used technique in video classifi-cation and retrieval methods. Many fusion strategies havebeen proposed, varying from early fusion [2], which mergesthe feature vectors extracted from different modalities, tolate fusion, which combines the outputs of the classifiersor ‘retrieval experts’ built on each single modality [2–5].Empirical results show that the methods based on the fusionof multimodal information outperforms those based on anysingle type of information in both classification and retrievaltasks.

Another trend in video classification is the search of low-dimensional, intermediate representations of video data in

Correspondence to: Eric P. Xing ([email protected])

order to replace the high-dimensional raw features suchas color histograms and term vectors. The reason is thatintermediate representations would make sophisticated clas-sifiers such as support vector machines [6] computationallyefficient, which would be more expensive when applied onraw features. Moreover, using intermediate representationsholds the promise of better interpretation of data seman-tics, and may lead to superior classification performance.Related work along this direction ranges from conventionaldimension-reduction methods such as principal componentanalysis (PCA) and Fisher linear discriminant (FLD) [7],to the more recent probabilistic ‘topic models’ such asprobabilistic latent semantic indexing (pLSI) [8], latentDirichlet allocation (LDA) [9], exponential-family harmo-nium (EFH) [10]. While most of these models are initiallydeveloped only for single-modal data such as textual docu-ments, extensions of them [11] have been studied recentlyin order to model data with multiple types of inputs (a.k.amultimodal data) such as captioned images and video.

The key insights for video classification from previousworks appear to be combining multimodal information andusing intermediate representations. The goal of this paperis to propose a series of probabilistic models for represent-ing and classifying video data by taking advantages of both

2008 Wiley Periodicals, Inc.

24 Statistical Analy Data Mining, Vol. 1 (2008)

insights. Our models are based on a class of bipartite, undi-rected graphical models (i.e. random fields) called harmoni-ums [10]. The first model, dual-wing harmonium (DWH),derives intermediate representation as a set of latent seman-tic topics of video shots, by jointly modeling the correlatedinformation in the image regions and transcript keywordsassociated with the video shots. The derived latent topicsare then used as semantic features for classifying videoshots. The other two models, family-of-harmonium (FoH)and hierarchical harmonium (HH), extend DWH by explic-itly incorporating the category label(s) of data into themodel, which allows the classification and representationto be accomplished in a unified framework. Specifically,FoH consists of a set of category-specific DWH models,each modeling the video data from one specific category,and it assigns a video shot to the category with the highestprobability. In contrast, HH introduces category labels asanother layer of hidden variables into a DWH model, andperforms classification through the inference of these labelvariables.

The proposed models differ from existing models fortext/multimedia data in several important aspects. First, ourmodels are among the first few undirected topic models forbimodal or multimodal data such as video, as most of theexisting models are directed and they are mainly proposedfor single-modal data such as text documents. Besides pro-viding an important alternative for modeling video data, ourmodels do offer unique properties not supported by theirdirected counterparts, among which is fast inference dueto the conditional independence between latent variables.Furthermore, two of our models, FoH and HH, incorporatecategory labels as (hidden) model variables, which allowsus to classify unlabeled data by directly inferencing the dis-tribution of the label variables. In comparison, most existingmodels [8–11] can be only used for deriving intermedi-ate data representation as latent semantic topics, and onehas to build separate classifiers on top of the derived rep-resentation if classification is to be performed. Therefore,another advantage of our approach lies in the unification ofrepresentation and classification in the same model, whichavoids the overhead of building separate classifiers. Moreimportantly, by considering the interactions between latentsemantic topics and category labels, our models may beable to learn better intermediate representations so as toreflect the category information from the data. Such ‘super-vised’ intermediate representations are expected to providemore discriminative power and insights of the data than the‘unsupervised’ representations generated by existing meth-ods [8–11].

The notations used in the paper follow the convention ofprobabilistic graphical models. Uppercase characters repre-sent random variables, and lowercase characters representthe instances (values) of the random variables. Bold font is

used to indicate a vector of random variables or their values.In all the illustrations, shaded circles represent observednodes while empty circles represent hidden (latent) nodes.Each node in a graphical model is associated with a randomvariable, so we use the term node and variable interchange-ably in this paper.

In Section 2 we review the related work on the fusionof multimodal video features as well as representationmodels for text and multimedia data. A brief introductionof harmoniums is presented in Section 3. We propose thethree harmoniums models for video data in Section 4, anddiscuss their learning algorithms in Section 5. In Section6, we show the experiment results and illustrate interestinginterpretation of the data from TRECVID video collection.The conclusions and future work are discussed in Section 7.

2. RELATED WORKS

As pointed out in [1], the processing, indexing, andfusion of the data in multiple modalities is a core problem ofmultimedia research. For video classification and retrieval,the fusion of features from multiple data types (e.g. key-frames, audio, transcript) allows them to complement eachother and achieve better performance than using any singletype of feature. This idea has been widely used in manyexisting methods. The fusion strategies vary from earlyfusion [2], which merges the feature vectors extracted fromdifferent data modalities to late fusion which combines theoutput of classifiers or ‘retrieval experts’ built on eachsingle modality [2–5]. It remains an open question as towhich fusion strategy is more appropriate for a certaintask, and a comparison of the two strategies in videoclassification is presented in [2]. The approach presentedin this paper takes neither approach; instead it derives thelatent semantic representation of the video data by jointlymodeling the multimodal low-level features, so that thefusion takes place somewhere between early fusion and latefusion.

There are many approaches to obtaining low-dimensionalintermediate representations of video data. PCA has beenthe most popular method, which projects the raw featuresinto a lower-dimensional feature space where the data vari-ances are well preserved. Independent component analysis(ICA) and FLD are widely-used alternatives for dimen-sion reduction. Recently, there are also many studies onmodeling the latent semantic topics of the text and multi-media data. For example, latent semantic indexing (LSI) byDeerwester et al. [12] transforms term counts linearly intoa low-dimensional semantic eigenspace, and the idea waslater extended by Hofmann to pLSI [8]. The LDA by Bleiet al. [9] is a directed graphical model that provides gen-erative semantics of text documents, where each document

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 25

H1

X1 X2 XM

HN

Fig. 1 The basic harmonium model.

is associated with a topic-mixing vector and each word isindependently sampled according to a topic drawn from thistopic-mixing. LDA has been extended to Gaussian-MixtureLDA (GM-LDA) and Correspondence LDA (Corr-LDA)[11], both of which are used to model annotated data suchas captioned images or video with transcript text. EFH pro-posed by Welling et al. [10] is bipartite undirected graphicalmodel consisting of a layer of latent nodes representingsemantic aspects and a layer of observed nodes representingthe raw features.

In practice, the methods mentioned above are mainlyused for transforming the high-dimensional raw featuresinto a low-dimensional representation which presumablycapture the latent semantics of the data. Classification taskis usually performed by building a separate discrimina-tive classifier (e.g. SVMs) based on such latent semanticrepresentations. In this paper two of the proposed mod-els, namely, FoH and HH, provide a unified approachthat integrate representation and classification in the sameframework. They not only achieve satisfactory classifica-tion performance, but also provide interesting insights intothe data semantics, such as the internal structure of eachcategory and the relationships between different categories.Fei-Fei et al. [13] used a unified model for representing andclassifying natural scene images by introducing categoryvariables into the LDA model.

3. THE BASIC HARMONIUM MODEL

The harmoniums, which were originally studied bySmolensky [14] in his ‘harmonium theory’, refer to a fam-ily of bipartite undirected graphical models (a.k.a randomfields) that consist of two layers of nodes. Figure 1 shows abasic harmonium model, where nodes X = {Xi} at the bot-tom layer denote the observed data and the nodes H = {Hi}at the topic layer model the latent semantic topics of thedata. Depending on the specific application, the data nodesX can represent keyword counts of a text document orimage histogram features of an image, and the latent topicnodes H constitute a low-dimensional summarization of thedata that capture the critical information in the raw data.

The bipartite topology of a harmonium ensures that thenodes within the same layer are conditionally independentgiven the nodes in the other layer. This makes possible aconvenient constructive definition of the harmonium distri-bution based on two between-layer conditional distributionp(x|h) and p(hx), both of which factorize over individ-ual nodes as p(x|h) = �ip(xi |h) and p(h|x) = �jp(hj |x).Welling et al. [10] proposed a special class of harmoniummodels called EFH, where these conditionals are from theexponential family:

p(x|h) =∏

i

p(xi |h)

∝∏

i

exp

θi +

∑j

Wij g(hj )

f (xi)

p(h|x) =∏j

p(hj |x)

∝∏j

exp

{(λj +

∑i

Wij f (xi)

)g(hj )

}(1)

where {f (xi)} and {g(hj )} are the sufficient statistics (orfeatures) of node xi and hj ; {θi}, {λj }, and {Wij } are modelparameters, which can be learned from the data. The par-tition function (i.e. normalizer) in these distributions arenot explicitly shown, and therefore we use a proportionalsign instead of an equal size in Eq. (1). We see that thedata nodes x and the topic nodes h are coupled through aninteraction term Wij , so that the values of the topic nodesh affect the distribution of x, and vice versa. This ensuresthat the latent topic variables h are in ‘harmony’ with thedata variables x, so that h preserve most of the informationin x.

Welling et al. [10] showed that these local conditionalsprecisely map to the following harmonium random fields(joint distribution):

p(x, h) ∝ exp

∑i

θif (xi) +∑

j

λj g(hj )

+∑ij

Wij f (xi)g(hj )

(2)

After the parameters {θi}, {λj }, {Wij } are learned, the har-monium model can be used to infer the latent topic nodesh from the observed data nodes x. Due to the conditionalindependence between the latent nodes, the inference of his very efficient. This is a nice property not provided bythe directed graphical models, which typically do not have

Statistical Analy Data Mining DOI:10.1002/sam

26 Statistical Analy Data Mining, Vol. 1 (2008)

such conditional independence. On the other hand, how-ever, there is no marginal independence for either data orlatent topic nodes in a harmonium. Therefore, learning har-moniums is usually more difficult due to the presence ofthe global partition function.

The basic harmonium has been used to derive the latentsemantic topics from the keyword features of text docu-ments [10]. However, it is inadequate for modeling complexdata such as video data that contain multiple types of inputs(features) following distributions of different families. Wedescribe a series of extensions of the basic harmonium formodeling and classifying video data.

4. HARMONIUM MODELS FOR VIDEO DATA

A sketch of our approach to video classification is illus-trated in Fig. 2. We classify video data in the form of videoshots, which are short video segments with length vary-ing from a few seconds to half minute or even longer. Asvideo contain both textual and imagery data, we representeach video shot as a bag of keywords (extracted from thevideo closed-captions or via speech recognition systems),and a set of fixed-sized image regions (extracted from arepresentative frame or keyframe of the video shot).

Each region is described by its color histogram feature.In the training phase, the goal is to build a model of a cer-tain type that derives the latent semantic topics of videodata and captures the latent topics (and their combinations)that best describe each category. During the testing phase,we extract the keywords and color features from an unla-beled video shot, and then use them as features to predictwhich category this shot belongs to. The three proposedmodels, DWH, FoH, and HH, differ in the way the data arerepresented and classified.

4.1. Notations and Definitions

The random variables and parameters in our harmoniummodels are defined as follows:

• A video shot s is represented by a tuple as (x,z, h, y), which respectively denote the keywords,region-based color features, latent semantic topics,and category labels of the shot.

• The vector x = (x1, . . . , xN) denotes the keywordfeature extracted from the transcript associated withthe shot. Here N is the size of the word vocabulary,and xi ∈ {0, 1} is a binary variable that indicatesthe absence or presence of the ith keyword (of thevocabulary) in the shot.

Fig. 2 A sketch of our approach to video classification.

• The vector z = (z1, . . . , zM) denotes color-histogramfeatures of the keyframe in the shot. Each keyframeis evenly divided into a grid of totally M fixed-sizedrectangular regions, and zj ∈ RC is a C-dimensionalvector that represents the color histogram of the j th

region. So z is a stacked vector of length equal toCM.

• The vector h = (h1, . . . , hK) represents the latentsemantic topics of the shot, where K is the totalnumber of the latent topics. Each component hk ∈ Rdenotes how strongly this shot is associated with thekth latent topic.

• The category labels of a shot are modeled differentlyin the two models. In FoH, a single variable y ∈{1, . . . , T } indicates the category this shot belongsto, where T is the total number of categories. Inhierarchical harmonium, the labels are representedby a vector y = (y1, . . . , yT ), with each yt ∈ {0, 1}denoting whether the shot is in the t th category. Herea video shot belongs to only one category, so we have∑

t yt = 1.• The three harmonium models presented below have

different parameters. The parameters of a dual-wingharmonium are denoted as θy = (α, β, W, U). A FoH

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 27

contains a set of category-specific dual-wing har-moniums, and each one has parameters as θy =(πy, αy, βy, Wy, Uy) where y is the category label. Ahierarchical harmonium has a single set of parametersas θ = (α, β, τ, W, U, V ).

4.2. Dual-Wing Harmonium (DWH)

As illustrated in Fig. 3, dual-wing harmonium (DWH)extends the basic harmonium by introducing two ‘wings’of data nodes in order to represent the bi-modal informationof video data. The nodes in the top layer represent the latentsemantic topics H = {Hk} of a video shot s; nodes in thebottom layer consists of two sets of observed variables:X = {Xi} representing the keyword feature of the videoshot and Z = {Zj } representing the the region-based colorfeature of the shot. Thus, DWH models have the low-level(keyword and color) features of a video shot as well asits latent semantic topics as two types of representationsthat influence each other. We can either conceive keywordand color features as being generated by the latent semantictopics, or conceive the semantic topics as being summarizedfrom the keyword and image features. This mutual influenceis reflected in the conditional distributions of the variablesrepresenting the features and the semantic topics detailedbelow.

4.2.1. Text feature

The variable xi indicating the presence/absence of termi ∈ {1, . . . , N} in the vocabulary follows a distribution as:

P (Xi = 1|h) = 11+exp(−αi−

∑k Wikhk)

(3)

P (Xi = 0|h) = 1 − P (Xi = 1|h)

This shows that each keyword in a video shot is sampledfrom a Bernoulli distribution dependent on the latent seman-tic topics h. That is, the probability whether a keywordappears is affected by a weighted combination of semantictopics h. Parameter αi and Wik are both scalars, so α =

Semantic topics

Text features Image features

H1 HK

X1XN Z1 ZM

Fig. 3 Dual-wing harmonium model.

(α1, . . . , αN) is an N -dimensional vector, and W = [Wik]is a matrix of size N × K . Due to the conditional indepen-dence between xi given h, we have p(x|h) = ∏

i p(xi |h).

4.2.2. Image feature

The color-histogram feature zj of the j th region in thekeyframe of the shot admits a conditional multivariateGaussian distribution as:

p(zj |h) = N(

zj |�j

(βj +

∑k

Ujkhk

), �j

)(4)

where zj is sampled from a distribution parameterized bythe latent semantic topics h. Here, both βj and Ujk areC-dimensional vectors, and therefore β = (β1, . . . , βM) isa stacked vector of dimension CM and U = [Ujk] is amatrix of size CM × K . Note that �j is a C × C covari-ance matrix, which, for simplicity, is set to identity matrixI in our model. Again, we have p(z|h) = ∏

j p(zj |h) dueto conditional independence.

4.2.3. Latent semantic topics

Finally, each latent topic variable hj follows a unit-variance Gaussian distribution whose mean is determinedby a weighted combination of the keyword feature x andthe color feature z:

p(hk|x, z, c) = N

hk|

∑i

Wikxi +∑

j

Ujkzj , 1

(5)

where Wik and Ujk are the same parameters used in Eqs. (3)and (4). Similarly, p(h|x, z) = ∏

k p(hk|x, z) holds.So far we have presented the conditional distributions of

all the variables in the model. These local conditionals canbe mapped to the following harmonium random fields as:

p(x, z, h) ∝ exp

∑i

αixi +∑

j

βj zj −∑

j

z2j

2

−∑

k

h2k

2+

∑ik

Wikxihk +∑jk

Ujkzjhk

(6)

We present the detailed derivation for this random fieldin the Appendix. Note that the partition function (globalnormalization term) of this distribution is not explicitlyshown, so we use a proportional sign instead of an equalsign. This hidden partition function increases the difficultyof learning the model.

Statistical Analy Data Mining DOI:10.1002/sam

28 Statistical Analy Data Mining, Vol. 1 (2008)

By integrating out the hidden variables h in Eq. (6), weobtain the marginal distribution over the observed keywordand color features of a video shot:

p(x, z) ∝ exp

∑i

αixi +∑

j

βj zj −∑

j

z2j

2

+ 1

2

∑k

i

Wikxi +∑

j

Ujkzj

2 (7)

which also contains a hidden partition function in thisdistribution.

The parameters of a DWH model, θ = (α, β, W, U), islearned by maximizing the likelihood of a set of videoshots, where the likelihood function is defined by Eq. (7).Due to the presence of the global partition function, thelearning process requires approximate inference methods,which will be discussed in Section 5. Note that in Eq. (3)we define the variance of the latent variables given theinput variables to one in order to simplify the parameterestimation. Introducing a covariance matrix � can offeradditional freedom for joint distribution p(x, z, h), but itwould not lead to more general representations in terms ofprobability p(x, z) [10].

It is important to note that DWH is a model proposedonly for representing video data: given the text and colorfeatures x and z of a video shot, one can infer the latentsemantic topics h of the video shot using a DWH model.The model by itself cannot be used directly for classifi-cation, because it contains no variables representing thecategory labels of data. To do classification, one needs tofirst represent video data by their latent semantic topics(i.e. treating latent variables h as a feature vector), andbuild classifiers using classification models such as Sup-port Vector Machines (SVMs) based on such latent semanticrepresentation. The learning of the DWH model is ‘unsu-pervised’ in the sense that it does not involve data labels. Inpractice, we build only a single DWH model from all theavailable video shots despite the number of categories theybelong to. Moreover, the DWH model does not have to beupdated or retrained when new categories (i.e. unseen inthe training data) arrive; one can simply build more classi-fiers for these new categories on the representations derivedfrom the same DWH model.

4.3. Family-of-Harmonium (FoH)

The FoH model extends the DWH model in order inte-grate classification and representation in the same model.The FoH model uses the DWH model as its basic buildingblock. As illustrated in Fig. 4, a FoH model consists of a

(aT, bT, WT, UT)(a1, b1, W1, U1)

Y

H1 H1 HK

X1 X1 XN Z1 ZMXN

category 1 category T

Z1 ZM

HK

Fig. 4 Family-of-harmonium model.

set of T category-specific harmoniums, where each harmo-nium is a DWH that models video shots from a specificcategory. The number of component harmoniums is equalto the number of categories. On top of these componentharmoniums, a node Y ∈ {1, . . . , T } representing the cat-egory label is introduced as a ‘switch variable’ to indicatethe specific harmonium used for modeling a given videoshot. The semantics of a FoH model is apparent from itsstructure: given the category of a video shot, it uses theharmonium corresponding to that category to model thatvideo shot.

All the component harmoniums in FoH share exactlythe same structure, because they are all DWH modelswith the same number of input and latent nodes and sameforms of distributions, except that each harmonium owns aunique set of parameters (αy, βy, Wy, Uy) indexed by thecategory label y. The label variable Y is only an indicatorof the specific harmonium used for modeling the video shot.Therefore, in Fig. 4 Y is not actually linked to any nodesin the component harmoniums, and it only appears as thesubscript of model parameters in the distribution functionto be presented below.

The distribution of a FoH model can be easily constructedfrom the distribution of each component DWH. For eachDWH, the conditionals of variables of each type, namelyx, z, and h follow the distribution defined in Eqs. (3), (4),and (5), except that the parameters are indexed by thecategory label y. Therefore, the likelihood function of acomponent DWH given the category, p(x, z|y), would haveexactly the same form as the joint distribution of DWHdefined in Eq. (7). The category label Y , the only newvariable in FoH, follows a prior multinomial distribution as:

p(y) = Multi(π1, . . . , πT ), (8)

where∑T

t=1 πt = 1. The marginal distribution (likelihood)of a labeled video shot in a FoH model can be decomposedinto a category-specific marginal and a prior over the

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 29

categories:

p(x, z, y) = p(x, z|y)p(y)

∝ πy × exp

∑i

αy

i xi +∑

j

βy

j zj −∑

j

z2j

2

+ 1

2

∑k

i

Wy

ikxi +∑

j

Uy

jkzj

2 (9)

Learning a FoH model is equivalent to learning T inde-pendent DWH models, where each component DWH islearned from video shots from the corresponding categoryby maximizing the likelihood function defined in Eq. (9).Therefore, the learning method used for DWH is readilyapplicable to the learning of FoH, which will be discussedin Section 5.

For classification, FoH behaves like a maximum likeli-hood classifier. That is, it examines the probability functionp(x, z, y) of a video shot under each of the componentharmoniums, and assigns the shot to the category corre-sponding to the harmonium with the highest probability.This is because given Baye’s rule the posterior probabilityof category label is proportional to the data likelihood asp(y|x, z) ∝ p(x, z|y)p(y). We can further simplify this byassuming that the category prior is a uniform distribution,e.g. p(y) = 1/T . Thus, we can predict the category of ashot by comparing its class conditional p(x, z|y) under eachharmonium y, which can be computed from Eq. (9). Theharmonium that ‘best fits’ the shot determines its category.

4.4. Hierarchical Harmonium (HH)

HH extends the basic DWH model in a different wayin order to make it directly applicable to classificationtasks. Instead of building a separate harmonium for eachcategory, it introduces category labels as another (the third)layer of nodes into a single DWH, making it an undirectedmodel of three layers. In Fig. 5, the label variables Y ={Y1, . . . , YT } with Yt ∈ {0, 1} indicate a shot’s membershipwith each category, and they form a bipartite subgraph withthe latent topic variables H on top of the bipartite subgraphbetween H and the input X and Z. Unlike a FoH model, thelabel variables Y in HH are linked to the other variables inthe model. In fact, there is a link between any Yt and Hj

but not between two Yt , so the conditional independenceproperty of harmoniums is preserved. Another difference isthat a HH model contains only a single harmonium while aFoH model contains a set of harmoniums for all categories.

In a HH model, the conditional distribution of x andz stay the same as those in the DWH model, whichare defined by Eqs. (3) and (4), respectively. Each label

Y1

H1 HK

X1 XN Z1 ZM

YT

Fig. 5 Hierarchical harmonium model.

variable Yi follows a Bernoulli distribution parameterizedby the latent variables h:

P (Yt = 1|h) = 1

1 + exp(−τt − ∑

k Vtkhk

) (10)

P (Yt = 0|h) = 1 − P (Yt = 1|h)

where V = [Vtk] is a matrix of size T × K . Note that if wetreat h as input, Vtk and τ as parameters, this distributionhas exactly the same form as the distribution of the classlabel in logistic regression [7], i.e. P (Y = 1|x) = 1/(1 +exp(−β0 − βT x)). This implies that the model is actuallyperforming logistic regression to compute each categorylabel Yt using the latent semantic topics h as input.

The distribution of each latent variable hk needs tobe modified to incorporate the interactions between labelvariables y and the topic variables h:

p(hk|x,z,y) = N

hk|

∑i

Wikxi +∑

j

Ujkzj +∑

t

Vtkyt ,1

.

(11)

This shows that the distribution of the latent semantictopics h are not only affected by data features x and z,but also affected by their labels y. This is significantlydifferent from the DWH and FoH model, as well as directedgraphical models such as LDA [9], where the distributionof latent variables only depend on the input features. In thissense, the latent semantic topics derived from a HH modelare ‘supervised’ while those derived by other models are‘unsupervised’.

Statistical Analy Data Mining DOI:10.1002/sam

30 Statistical Analy Data Mining, Vol. 1 (2008)

With the incorporation of label variables, the jointdistribution (i.e. random field) of a HH model becomes:

p(x, z, h, y) ∝ exp

∑i

αixi +∑

j

βj zj −∑

j

z2j

2

+∑

t

τt yt −∑

k

h2k

2+

∑ik

Wikxihk

+∑jk

Ujkzjhk +∑tk

Vtkythk

(12)

After integrating out the hidden variable H, the marginaldistribution of a labeled video shot (x, z, y) is:

p(x,z,y) ∝ exp

∑i

αixi +∑

j

βj zj −∑

k

z2k

2+

∑t

τtyt

+ 1

2

∑k

i

Wikxi +∑

j

Ujkzj +∑

t

ytVtk

2

(13)

The parameters of the HH model, θ = (α, β, τ, W, U, V ),are estimated by maximizing the likelihood function definedby Eq. (13). Despite the introduction of label variables Y,the learning procedure for HH is in spirit similar to that forDWH and FoH, and can be easily extended from the latter.

The classification is performed in a way different fromeither in DWH or in FoH. Since data labels are representedas model variables Y, we can predict the category of anunlabeled video shot by inferring the label variables Y fromits text and image features. This is done by computingthe conditional probability p(Yt = 1|x, z) for each labelvariable Yt . We can assign the shot to the category withthe highest probability, i.e. t∗ = argmaxtp(Yt = 1|x, z). Ifa video shot may belong to more than one category,we can assign it to categories with probability above agiven threshold. There is, however, no analytical solutionto inferring p(Yt = 1|x, z). Various approximate inferencemethods are available to solve this problem, as furtherdiscussed in Section 5.

4.5. Discussion

There are several interesting differences and connectionsbetween the three harmonium models we have proposed.

• First of all, DWH is a representation model whileFoH and HH are classification models. DWH has no

variables representing category labels, can be trainedwithout data labels, and cannot be used directly forclassification. It only derives the latent semantic rep-resentations of data, and one needs to build separateclassifiers to classify the data based on such represen-tations. In comparison, FoH and HH incorporate labelvariables and need to be trained using labeled data.They not only derive the latent representation of thedata but also perform classification within the samemodel. It is difficult to say theoretically which modelis better because they are for different purposes. Butif classification is the only purpose, FoH and HH doprovide a more integrated and efficient approach byavoiding the need of training separate classifiers.

• Furthermore, the meanings of the derived latent top-ics are different in these models. In DWH and HH,the latent topics represent the ‘common topics’ of thedata, since all the data share the same set of latent top-ics. The latent topics in HH are likely to be differentfrom those in DWH, because they are ‘supervised’by the category labels and presumably contain morediscriminative information. The latent topics in HHalso help to reveal the connections between variouscategories. In FoH, since each component harmo-nium is built for a specific category, the latent topicsin each harmonium capture the internal structure ofthat category, i.e. they represent the themes or sub-categories in that particular category. There are nocorrespondences between the latent topics across dif-ferent harmoniums: the first topic in one harmoniumis unrelated to the first topic in another.

• The three models also differ in terms of efficiencyand flexibility. On a fixed collection, training a HHmodel is less expensive than training a FoH modelas the latter involves training multiple harmoniums.Training a DWH model is not expensive by itself, butit can be costly to train separate classifiers on top of itdepending on the classification algorithm used. Whenit comes to incorporating a new category, FoH canaccommodate the new category by adding anotherharmonium trained from its data without changingthe harmoniums for existing categories, and DWHdoes not even have to be updated except that a newclassifier needs to be built for the new category.Introducing a new category is more expensive in aHH model because that means adding another labelnode into the model, which requires retraining of thewhole model as its structure has been changed.

To provide deeper insights of our models, we also com-pare them to other topics models for text and multimediadata, including PCA, pLSI [8], LDA [9] and its variantsGM-LDA and Corr-LDA [11], exponential-family harmo-nium (EFH) [10, 15]. Our models are among the first few

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 31

attempts to use undirected topic models, since most existingtopic models are directed. Although there is no conclusionyet as to which one is better, our models offer appealingproperties such as easy inference due to the conditionalindependence. Also note that the majority of topic modelsare for single-modal data, usually text documents, whileour models join GM-LDA and Corr-LDA to be the fewtopic models for bi-modal data such as video and cap-tioned images. Finally, our FoH and HH model integratesrepresentation and classification in the same framework. Incontrast, most existing topic models are only intended fordata representation. The Bayesian hierarchical model forscene classification proposed by Fei-Fei et al. [13], which isextended from LDA, is a counterpart of FoH in the directedmodels.

5. LEARNING AND INFERENCE

The parameters of the three harmonium models inSection 4 are learned under the maximum likelihood prin-ciple using gradient ascent and approximate inference tech-niques. In this section, we use the learning of HH as anexample to describe the general procedure of learning aharmonium model, because HH is structurally the mostcomplex model among the three. The learning algorithmfor DWH can be ‘reduced’ from the algorithm for HH, andlearning a FoH is equivalent to learning multiple DWHmodels for different categories.

Given a labeled set of video shots χ = {xn, zn, yn}Nn=1,the parameters of a HH model θ = (α, β, τ, W, U, V ) isestimated by maximizing the log-likelihood of the datadefined by Eq. (13). Owing to the complexity of the model,there is no closed-form solution to the maximization prob-lem and we have to resort to an iterative method likegradient ascent. The learning rules (i.e. the gradients) can beobtained by setting the derivatives of Eq. (13) with respectto model parameters:

δαi = 〈xi〉p̃ − 〈xi〉p, δβj = 〈zj 〉p̃ − 〈zj 〉p,

δτt = 〈yt 〉p̃ − 〈yt 〉pδWik = 〈xih

′k〉p̃ − 〈xih

′k〉p,

δUjk = 〈zjh′k〉p̃ − 〈zjh

′k〉p,

δVtk = 〈yth′k〉p̃ − 〈yth

′k〉p (14)

where h′k = ∑

i Wikxi + ∑j Ujkzj + ∑

t Vtkyt , and 〈·〉p̃and 〈·〉p denotes expectation under empirical distribution(i.e. data average) or model distribution of the harmonium,respectively. Like other undirected graphical models, thereis a global normalizer (a.k.a partition function) in the likeli-hood function of harmonium Eq. (13), which makes directly

computing 〈·〉p intractable. Instead, we need approximateinference methods to estimate these model expectations〈·〉p. We explored four approximate inference methods inour work, which are briefly discussed below. Besides thelearning process, the inference of the conditional distribu-tion of label variables p(Yt = 1|x, z) in a HH model isalso intractable and requires using approximate inferencemethods.

5.1. Mean Field Approximation

Mean field (MF) is a variational method that approxi-mates the model distribution p through a factorized formas a product of marginals over clusters of variables [16].We use the naive version of MF, where the joint proba-bility p is approximated by a surrogate distribution q as aproduct of singleton marginals over the variables:

q(x, z, y, h) =∏

i

q(xi |νi)∏j

q(zj |µj , I )

×∏

t

q(yt |λt )∏k

q(hk|γk) (15)

where the singleton marginals are defined as q(xi)

∼ Bernoulli(νi), q(zj ) ∼ N(µj , I ), q(yt) ∼ Bernoulli (λt ),and q(hk) ∼ N(γk, 1), and {νi, µj , λt , γk} are variationalparameters. The variation parameters can be computed byminimizing the KL-divergence between p and q, whichresults in the following fixed-point updating equations:

νi = σ

(αi +

∑k

Wikγ k

)

µj = βj +∑

k

Ujkγ k

λt = σ

(τt +

∑k

Vtkγ k

)

γ k =∑

i

Wikvi +∑

j

Ujkµj +∑

t

Vtkλt

where σ(x) = 1/(1 + exp(−x)) is the sigmoid function.We iteratively update the variational parameters using theabove fixed-point equations until they converge, and thenthe surrogate distribution q is fully specified. We replacethe intractable model expectations 〈·〉p with 〈·〉q in Eq. (14),which are easy to compute from the fully factorized surro-gate distribution q. Then, we can update the model param-eters using the learning rules defined in Eq. (14).

It is important to note that, when using gradient ascentmethod with mean file, the learning procedure would con-tain two nested loops: the outer loop iteratively updates the

Statistical Analy Data Mining DOI:10.1002/sam

32 Statistical Analy Data Mining, Vol. 1 (2008)

model parameters using the learning rules Eq. (14), whilethe inner loop iteratively updates the variational parame-ters in order to approximate the model expectations in thelearning rules. Whenever the model parameters are updated(and so are the model distribution p), the whole inner loopneeds to be executed to recompute the surrogate distributionq to approximate the updated model distribution p. Usinggradient ascent with other iterative approximate inferencemethods such as Gibbs sampling would also result in alearning procedure with nested loops.

5.2. Gibbs Sampling

Gibbs sampling, as a special form of the Markov chainMonte Carlo (MCMC) method, has been used widely forapproximate inference in complex graphical models [17].This method repeatedly samples variables in a particu-lar order, with one variable at a time and conditionedon the current values of the other variables. For examplein a HH model, we define the sampling order to bey1, . . . , yT , h1, . . . , hK , as the other variables are givenas input. This means we first sample each yt from the con-ditional distribution defined in Eq. (10) using the currentvalues of hj , and then sample each hj according to Eq. (11)using the sampled values of yt , and repeat this process itera-tively. After a large number of iterations (’burn-in’ period),this procedure guarantees to reach an equilibrium distri-bution that ,in theory, is equal to the model distribution p.Therefore, we use the empirical expectation computed usingthe samples collected after the burn-in period to approxi-mate the true expectation 〈·〉p. The number of ‘burn-in’iterations and samples is at least thousands and typicallyaround tens of thousands. Therefore, although this methodguarantees accurate approximations, it is computationallyintensive.

5.3. Contrastive Divergence

An alternative to exact gradient ascent search on thebasis of the learning rules in Eq. (14) is the contrastivedivergence (CD) algorithm [18] proposed by Hinton andWelling that approximates the gradient learning rules. Ineach step of the gradient ascent, instead of computingthe model expectation 〈·〉p, CD starts from the empiricalvalues as the initial samples, runs the Gibbs samplingfor up to only a few iterations and uses these limitedsamples to approximate the model expectation 〈·〉p. It hasbeen proved that the final values of the parameters by thiskind of updating will converge to the maximum likelihoodestimation [18]. In our implementation, we compute 〈·〉qfrom a large number of samples obtained by running onlyone step of Gibbs sampling with different initializations.

Obviously, CD is substantially more efficient than the Gibbssampling method since the ‘burn-in’ process is skipped.

5.4. The Uncorrected Langevin Method

The uncorrected Langevin method [19] is originated fromthe Langevin Monte Carlo method by accepting all theproposal moves. It makes use of the gradient informationand resembles noisy steepest ascent to avoid local optimal.Similar to the gradient ascent, the uncorrected Langevinalgorithm has the following update rule:

λnewij = λij + ε2

2

∂λij

log p(X, λ) + εnij (16)

where nij ∼ N (0, 1) and ε is the parameter to control thestep size. Like the contrastive divergence algorithm, we useonly a few iterations of Gibbs sampling to approximate themodel distribution p.

6. EXPERIMENTS

We evaluate the proposed models using video data fromthe TRECVID 2003 development set [20]. On the basisof the manual annotations on this set, we choose 2468shots that belong to 15 semantic categories, which areairplane, animal, baseball, basketball, beach, desert, fire,football, hockey, mountain, office, road traffic, skating, stu-dio, and weather news. Each shot belongs to only onecategory. The size of a category varies from 46 to 373shots. The keywords of each shot are extracted fromthe video closed-captions associated with that shot. Byremoving non-informative words such as stop words andless frequent words, we reduce the total number of dis-tinct keywords (vocabulary size) to 3000. Meanwhile, weevenly divide the key-frame of each shot into a grid of5 × 5 regions, and extract a 15-dimensional color his-togram on HVC color space from each region. Therefore,each video shot can be represented by a 3000-d key-word feature and a 375-d color histogram feature. Forsimplicity, the keyword features are made binary, mean-ing that they only capture the presence/absence informa-tion of each keyword, because it is rare to see a key-word appearing multiple times in the short duration ofa shot.

The experiment results are presented in two parts. First,we show some illustrative examples of the latent seman-tic topics derived by the proposed models and discuss theinsights they provide about the structure and relationships ofvideo categories. In the second part, we evaluate the perfor-mance of our models in video classification in comparisonwith some of the existing approaches.

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 33

6.1. Interpretation of Latent Semantic Topics

All the three proposed models provide intermediate rep-resentation in the form of latent semantic topics automati-cally derived from video data. Owing to the difference onmodel structure, the latent topics derived from these mod-els carry different interpretations of the same data. For bothDWH and HH, the latent topics are derived from all thedata despite their categories; therefore, they are supposed torepresent some ‘common topics’ in this data collection. InFig. 6, we show the video keyframes and keywords of thevideo shots that most tightly associated with 5 latent topics(i.e. having the highest conditional probability p(hk|x, z))derived by the DWH or HH model. We find that these top-ics roughly correspond to some of the 15 manually definedcategories. For example, in Fig. 6(a), the topics are about‘weather’, ‘basketball’, ‘airplane’, ‘anchor’, and in Fig. 6(a)the topics are ‘studio’, ‘baseball or football’, ‘weather’, ‘air-plane or skating’, ‘animal’. This shows that the derivedlatent semantic topics are able to capture the semantics ofvideo data.

We should also note that, since these latent topics arederived by jointly modeling the textual and imagery fea-tures of video, they are more than simply clusters in color orkeyword feature space, but sort of ‘co-clusters’ in both fea-ture spaces. For example, the shots of Topic 1 in Fig. 6(a)are very similar to each other visually; the shots of Topic 3are not so similar visually, but it is clear that they have veryclose semantic meanings and share common keywords suchas ‘flying’ and ‘engine’. A close examination also showsthat the latent topics of HH have slightly better correspon-dence with the categories, while the latent topics of DWHseem to be clusters based on image and keyword features.This echoes with the fact that the latent topics from HH are‘supervised’ by the category information while those fromDWH are not.

Another advantage of HH, as we discussed in Section4.5, is that it reveals the relationships between differentcategories through the hidden topics. We can tell howmuch a category t is associated with a latent topic j

from the conditional probability p(yt |hj ). Therefore, weare able to compute the similarity between any two cat-egories by examining the hidden topics they are asso-ciated with. We show the pairwise similarity betweenthe 15 categories using the color-coded confusion matrixin Fig. 7, where red(der) color denotes higher similarityand blue(er) color denotes lower similarity. We can seemany meaningful pairs of related categories, e.g. ‘moun-tain’ is strongly related to ‘animal’, ‘baseball’ is relatedto ‘hockey’, while ‘studio’ is not related to any category.These relationships are basically consistent with commonsense.

The latent topics in a FoH model have very differentinterpretations. Since each component harmonium in FoH is

(a)

(b)

Fig. 6 An illustration of 5 latent topics derived by (a) the DWHmodel and (b) the HH model from the data collection. Each topicis shown by the top 10 keywords and top 5 key-frames extractedfrom the most related video shots.

learned independently from the data of a specific category,the latent topics of that harmonium capture the structure orsubcategories of that particular category. In Fig. 8, we showthe representative key-frames and keywords associated withthe 5 latent topics learned from the category ‘Fire’. We seethat these 5 topics roughly correspond to 5 subcategoriesunder the category ‘Fire’, which can be described as ‘for-est fire in the night’, ‘explosion in outer space’, ‘launch

Statistical Analy Data Mining DOI:10.1002/sam

34 Statistical Analy Data Mining, Vol. 1 (2008)

Fig. 7 The color-coded matrix showing the pairwise similaritybetween categories revealed in the HH model. Best viewed withcolor.

Fig. 8 An illustration of 5 latent topics of the component harmo-nium for ‘Fire’ category in the FoH model.

of missile or space shuttle’, ‘smoke of fire’, and ‘close-upscene of fire’.

6.2. Performance on Video Classification

To evaluate the performance of DWH, FoH and HHmodel in video classification, we evenly divide our dataset into a training set and a test set. The model parametersare estimated from the training set. Specifically, we imple-mented the learning methods based on the four inferencealgorithms described in Section 5, in order to examine theirefficiency and accuracy. The FoH and HH model can bedirectly used for classification, while for the DWH modelwe train separate SVM classifiers based on the intermedi-ate representation derived from DWH. We also explore the

issue of model selection, namely the impact of the numberof latent semantic topics to the classification performance.

Several other methods have been implemented for com-parison. We implemented a baseline method which builds aSVM classifier based on the same keyword and color fea-tures used as input of our models. This is essentially an‘early fusion’ method since it concatenates both featuresinto a longer feature vector.

We also implemented several directed topic models,which produce intermediate representation of some kindfor video data. These models include Gaussian multinomialmixture model (GM-mixture), Gaussian multinomial latentDirichlet allocation (GM-LDA), and correspondence latentDirichlet allocation (Corr-LDA). The details of these mod-els can be found in [11]. Note that all these directed modelsare used only for data representation, and separate SVMclassifiers are trained for to perform classification based onthe intermediate representations derived from these mod-els. To make these methods comparable, we guarantee thatthe same kernel function and parameters are used in theSVM classifiers trained on top of different models. We useRBF kernel [6] and the best empirical parameters found bycross validation. Also, to make the experiments tractableon various models with different learning algorithms anddifferent numbers of latent topics, we restrict this part ofexperiments to a subset of our collection with the 5 largestcategories containing totally 1078 shots as airplane, bas-ketball, baseball, hockey, and weather.

Figure 9(a) shows the classification accuracies of allthe representation models, including the undirected DWHmodel and the directed ones such as GM-mixture, GM-LDA, and Corr-LDA. To be fair, all the models are imple-mented using the MF variational method for learning andinference, except GM-mixture which uses the expectation-maximization (EM) method. All the models are evaluatedwith the number of latent semantic topics set to 5, 10, 20,30, and 50, in order to study the relationship between per-formance and model complexity. Figure 9(b) compares theclassification performance of three of our models, amongwhich DWH is representation-only model while FoH andHH are classification models.

Several interesting observations can be drawn fromFig. 9. First, the three undirected models as FoH, HH, andDWH achieve significantly higher performance than thedirected models as GM-mixture, GM-LDA, and Corr-LDA,which indicates that the harmonium model is an effec-tive tool for video representation and classification. Amongthem, FoH is the best performer at 5 and 10 latent seman-tic nodes, while DWH is the best performer at 20 and 50latent nodes with HH as the close runner-up. Second, wefind that the performance of FoH and HH is overall compa-rable with DWH. Given that DWH uses a SVM classifier,this result is encouraging as it shows that our approach is

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 35

5 10 15 20 25 30 35 40 45 500.2

0.3

0.4

0.5

0.6

0.7

0.8

# of latent topics

clas

sific

atio

n ac

cura

cycl

assi

ficat

ion

accu

racy

BaselineLSIGM–MixGM–LDAGorr–LDADWH

5 10 15 20 25 30 35 40 45 500.45

0.5

0.55

0.6

0.65

0.7

0.75

# of latent topics

BaselineDWHFoHHH

Fig. 9 Comparison of classification performance (a) betweenvarious representation models (directed and undirected), and(b) between the three proposed harmonium models.

comparable to the performance of a state-of-the-art discrim-inative classifier. On the other hand, our approach enjoysmany advantages that SVM does not have. For example,FoH can be easily extended to accommodate a new cate-gory without retraining the whole model. Third, the perfor-mance of DWH and HH improves as the number of latenttopics increases, which agrees with our intuition becauseusing more latent topics leads to better representation of thedata. However, this trend is reversed in the case of FoH,which performs much better when using smaller numberof latent topics. While a theoretical explanation of this isstill unclear, in practice it is a good property of FoH toachieve high performance with simpler models. Fourth, 20seems to be a reasonable number of latent semantic top-ics for this data set, since further increasing the number oftopics does not result in a considerable improvement of theperformance.

Figure 10 shows the classification accuracies of HHmodel implemented using different approximate inferencemethods. From the graph, we can see that the Langevinand contrastive divergence (CD) methods perform simi-larly, but are slightly better than mean-filed (MF) and Gibbssampling. We also study the efficiency of these inferencemethods by examining the time they need to reach con-vergence during training. The results show that mean field

5 10 15 20 25 30 35 40 45 50

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

# of hidden topic nodes

Cla

ssifi

catio

n ac

cura

cy

CDMFLangevinGibbs

Fig. 10 Classification performance of different approximate infer-ence methods in HH.

is the most efficient (approx. 2 min), followed by CD andLangevin (approx. 9 min), and the slowest one is Gibbssampling (approx. 49 min). Therefore, Langevin and CD aregood choices for the learning and inference of our modelsin terms of both efficiency and classification performance.

7. CONCLUSION

We have described three undirected graphical modelsfor semantic representation and classification of video data.The proposed models derive latent semantic representationof video data by jointly modeling the textual and imagefeatures of the data, and perform classification based onsuch latent representations. Experiments on TRECVID datahave demonstrated that our models achieve satisfactoryperformance on video classification and provide insightsto the internal structure and relationships of video cate-gories. Several approximate inference algorithms have beenexamined in terms of efficiency and classification perfor-mance.

Our HH model by nature does not restrict the numberof categories an instance (shot) belongs to, since P (Yt =1|x, z) can be high for multiple Yt . Therefore, an interest-ing future work is to evaluate the model with a multilabeldata set, where each instance can belong to any numberof categories. In this case, our method is actually a multi-task learning (MTL) method, and should be compared withother MTL approaches. Our models can also be improvedusing better low-level features as input. The region-basedcolor histogram features are quite sensitive to scale andillumination variations. Features such as local keypoint fea-tures are more robust and can be easily integrated intoour models. It is interesting to compare the latent semanticinterpretations and classification performance using differ-ent features.

Statistical Analy Data Mining DOI:10.1002/sam

36 Statistical Analy Data Mining, Vol. 1 (2008)

APPENDIX

This is to show the derivation of the harmonium random fields (jointdistribution) in the dual-wing-harmonium model. We start by introducingthe general form of exponential-family harmonium [10] that has H as thelatent topic variables and X and Z as two types of observed data variables.This harmonium random field has the exponential form as:

p(x, z, h) ∝ exp

∑ia

θiafia(xi ) +∑jb

ηjbgjb(zj )

+∑kc

λkcekc(hk) +∑ikac

Wkcia fia(xi )ekc(hk)

+∑jkbc

Ukcjbgjb(zj )ekc(hk)

.

where {fia(·)}, {gjb(·)}, and {ekc(·)} denote the sufficient statistics (fea-tures) of variables xi , zj , and hk , respectively.

The marginal distributions, say, p(x, z), is then obtained by integratingout variables h:

p(x, z) =∫

hp(x, z, h)dh

∝ exp

∑ia

θiafia(xi ) +∑jb

ηjbgjb(zj )

×∏k

∫hk

exp

{∑c

(λkc +

∑ia

Wkcia fia(xi )

+∑jb

Ukcjbgjb(zj )

ekc(hk)

dhk

= exp

∑ia

θiafia(xi) +∑jb

ηjbgjb(zj ) +∑

k

Ck({λ̂kc})

and similarly we can derive:

p(x, h) ∝ exp

∑ia

θiafia(xi ) +∑kc

λkcgkc(hk) +∑

j

Bj ({η̂jb})

p(z, h) ∝ exp

∑jb

ηjbgjb(zj ) +∑kc

λkcekc(hk) +∑

i

Ai({θ̂ia})

where the shifted parameters θ̂ia , η̂jb and λ̂kc are defined as:

θ̂ia = θia +∑kc

Wkcia ekc(hk), η̂jb = ηjb +

∑kc

Ukcjbekc(hk)

λ̂kc = λkc +∑ia

Wkcia fia(xi ) +

∑jb

Ukcjbgjb(zj )

The functions Ai(·), Bj (·), and Ck(·) are defined as:

Ai({θ̂ia}) =∫

xi

exp

{∑a

θ̂iafia(xi)

}dxi

Bj ({η̂jb}) =∫

zj

exp

{∑b

η̂jbgjb(zj )

}dzj

Ck({λ̂kc}) =∫

hk

exp

{∑c

λ̂kcekc(hk)

}dhk

Further integrating out variables from these distribution give themarginal distribution of x, z, and h.

p(x) ∝ exp

∑ia

θiafia(xi) +∑

j

Bj ({η̂jb}) +∑

k

Ck({λ̂kc})

p(z) ∝ exp

∑jb

ηjbgjb(zj ) +∑

i

Ai({θ̂ia}) +∑

k

Ck({λ̂kc}

p(h) ∝ exp

∑kc

λkcekc(hk) +∑

i

Ai({θ̂ia}) +∑

j

Bj ({η̂jb})

All the above marginal distributions, we are ready to derive theconditional distributions as:

p(x|h) = p(x, h)

p(h)∝

∏i

exp

{∑a

θ̂iafia(xi ) − Ai({θ̂ia})}

p(z|h) = p(z, h)

p(h)∝

∏j

exp

{∑b

η̂jbgjb(zj ) − Bj ({η̂jb})}

p(h|x, z) = p(x, z, h)

p(x, z)∝

∏k

exp

{∑c

λ̂kcekc(hk) − Ck({λ̂kc})}

The specific conditional distribution of x, z, and h defined in Eqs. (3),(4), and (5) are all exponential distributions. They can be mapped to thegeneral forms above if we make the following definitions:

fi1(xi) = xi

θi1 = αi, θ̂i1 = αi +∑

k

Wikhk

gj1(zj ) = zj , gj2(zj ) = z2j

ηj1 = βj , ηj2 = −1/2, η̂j1 = βj +∑

k

Ujkhk

ek1 = hk, ek2 = h2k

λk1 = 0, λk2 = −1/2, λ̂k1 =∑

i

Wikhk +∑

j

Ujkhk

Therefore, by plugging these definitions into general form of harmoniumrandom field at the beginning of this appendix, we have the specificrandom field as:

p(x, z, h) ∝ exp

∑i

αixi +∑

j

βj zj −∑

j

z2j

2

−∑

k

h2k

2+

∑ik

Wikxihk +∑jk

Ujkzj hk

which is exactly the same as Eq. (6) except the latter one is defined for aspecific category.

Statistical Analy Data Mining DOI:10.1002/sam

J. Yang et al.: Harmonium Models for Video Classification 37

REFERENCES

[1] Y. Rui, R. Jain, N. D. Georganas, H. Zhang, K. Nahrstedt,et al., What is the state of our community? In Proceedingsof the 13th Annual ACM International Conference onMultimedia, New York, ACM Press, 2005, 666–668.

[2] C. Snoek, M. Worring and A. Smeulders, Early versus LateFusion in Semantic Video Analysis, 2005.

[3] G. Iyengar and H. J. Nock, Discriminative model fusionfor semantic concept detection and annotation in video. InProceedings of the 11th ACM International Conference onMultimedia, New York, ACM Press, 2003, 255–258.

[4] R. Yan, J. Yang and A. G. Hauptmann, Learning query-class dependent weights in automatic video retrieval. InProceedings of the 12th ACM International Conference onMultimedia, New York, ACM Press, 2004, 548–555.

[5] Y. Wu, E. Y. Chang, K. C.-C. Chang and J. R. Smith,Optimal Multimodal Fusion for Multimedia data Analysis.In Proceedings of the 12th annual ACM Int’l Conference onMultimedia, New York, ACM Press, 2004, 572–579.

[6] C. J. C. Burges, A tutorial on support vector machines forpattern recognition, Data Mining and Knowledge Discovery2(2) (1998), 121–167.

[7] T. Hastie, R. Tibshirani and J. Friedman, The Elements ofStatistical Learning: Data Mining, Inference, and Prediction,Springer, 2001.

[8] T. Hofmann, Probabilistic Latent Semantic Analysis. InProceedings of Uncertainty in Artificial Intelligence, UAI’99,Stockholm, 1999.

[9] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent dirichletallocation, Vol. 3, Cambridge, MA, MIT Press, 2003,993–1022.

[10] M. Welling, M. Rosen-Zvi and G. Hinton, Exponentialfamily harmoniums with an application to informationretrieval. In Advances in Neural Information ProcessingSystems 17, Cambridge, MA, MIT Press, 2004, 1481–1488.

[11] D. M. Blei and M. I. Jordan, Modeling annotated data. InSIGIR ’03: Proceedings of the 26th Annual InternationalACM SIGIR Conference on Research and Development

in Information Retrieval, New York, ACM Press, 2003,127–134.

[12] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W.Furnas and R. A. Harshman, Indexing by latent semanticanalysis, J Am Soc Inf Sci 41(6) (1990), 391–407.

[13] F.-F. Li and P. Perona, A Bayesian Hierarchical Modelfor Learning Natural Scene Categories. In Proceedings ofthe 2005 IEEE Computer Society Conf. on Computer Visionand Pattern Recognition, Washington, DC, IEEE ComputerSociety, 2005, 524–531.

[14] P. Smolensky, Information Processing in DynamicalSystems: Foundations of Harmony Theory, 1986, 194–281.

[15] E. Xing, R. Yan and A. Hauptmann. Mining associated textand images with dual-wing harmoniums. In Proceedingsof the 21th Annual Conference on Uncertainty in ArtificialIntelligence (UAI-05), Edinburgh, Scotland, AUAI press,2005, 633–641.

[16] E. Xing, M. Jordan and S. Russell, A generalized meanfield algorithm for variational inference in exponentialfamilies. In Uncertainty in Artificial Intelligence (UAI2003).San Francisco, CA, Morgan Kaufmann Publishers, 2003,583–591.

[17] M. I. Jordan, T. J. Sejnowski, Graphical Models: Foun-dations of Neural Computation, Cambridge Mass., MITPress, 2001.

[18] M. Welling and G. E. Hinton, A new learning algorithmfor mean field boltzmann machines. In ICANN ’02:Proceedings of the International Conference on ArtificialNeural Networks, London, Springer-Verlag, 2002, 351–357.

[19] I. Murray and Z. Ghahramani, Bayesian learning in undi-rected graphical models: Approximate mcmc algorithms.In Proceedings of the 20th Annual Conference on Uncer-tainty in Artificial Intelligence (UAI-04), M. Chickeringand J. Halpern (eds), Banff, Canada, AUAI Press, 2004,392–399.

[20] A. Smeaton and P. Over, Trecvid: Benchmarking theEffectiveness of Infomration Retrieval Tasks on DigitalVideo. In Proceedings of the International Conference onImage and Video Retrieval, Springer, Berlin, 2003, 451–456.

Statistical Analy Data Mining DOI:10.1002/sam