pattern recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/image...automatic image...

15
Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzad n Department of Computer Engineering, Sharif University of Technology, Tehran, Iran article info Article history: Received 30 October 2013 Received in revised form 11 June 2014 Accepted 9 July 2014 Available online 19 July 2014 Keywords: Image annotation Semi-supervised learning Generative modeling Gamma distribution abstract Image annotation approaches need an annotated dataset to learn a model for the relation between images and words. Unfortunately, preparing a labeled dataset is highly time consuming and expensive. In this work, we describe the development of an annotation system in semi-supervised learning framework which by incorporating unlabeled images into training phase reduces the system demand to labeled images. Our approach constructs a generative model for each semantic class in two main steps. First, based on Gamma distribution, a generative model is constructed for each semantic class using labeled images in that class. The second step incorporates the unlabeled images by using a modied EM algorithm to update parameters of the constructed generative models. Performance evaluation of the proposed method on a standard dataset reveals that using unlabeled images will result in considerable improvement in accuracy of the annotation systems when a limited number of labeled images for each semantic class are available. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction The growth of digital image datasets and photo sharing com- munities on the Internet makes it necessary to provide a proper mechanism for searching the images in large collections. Content- based image retrieval (CBIR) focuses on the problem of searching images based on their content. The rst generation of CBIR was based on query-by-example (QBE) paradigm [1], in which the CBIR system retrieves most visually similar images to a query image given by the user. The semantic gap between low level features and high level concepts is a fundamental problem in QBE systems. To bridge the semantic gap, many systems use the relevance feedback approach [2,3] to incorporate user knowledge into the retrieval process. Besides these query-based search strategies, the new generation focuses on development of an automatic system that semantically describes the content of an image. In this approach, a set of semantic labels are assigned to each image to describe its content. Then, a system is developed to train a model for the relation between visual features and tags of images. This formulation, called automatic image annotation or linguistic indexing, allows the system to assign words to every new test image. Further- more, the retrieval procedure could be performed based on input texts provided by the user. To bridge the semantic gap in an annotation system, some methods utilized active learning [4] to exploit the user knowledge in developing the annotation system and thus to reduce the required supervision. In what follows, we briey overview main features of the prior studies. 1.1. Related works There has been a great effort to design an annotation system using statistical learning. According to Ref. [5], one strategy for statistical annotation is unsupervised labeling [6,7] which estimates the joint density of visual features and words by an unsupervised learning algorithm. These methods introduce a hidden variable and assume that features and words are independent given the value of hidden variable. Another formulation for statistical annotation is supervised multi-class labeling (SML) [5,8] that estimates a condi- tional distribution for each semantic class to determine probability of a feature vector given the semantic label. Regardless of learning strategy, training dataset plays a major role in developing annotation systems. Manual assignment of too many words to a large number of images in the dataset is highly time consuming and labor intensive. On the other hand, a classier may have poor generalization when some semantic labels have a few images. It seems essential to develop an annotation system that depends on a small number of labeled images in the training phase. Although it is difcult to prepare an annotated dataset, one could easily obtain unlabeled images in large quantity (e.g., using photo sharing communities on the Internet like Flicker and ImageNet). This large amount of pictures motivates image annotation systems to increase their generalization by incorporating unlabeled images into Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition http://dx.doi.org/10.1016/j.patcog.2014.07.012 0031-3203/& 2014 Elsevier Ltd. All rights reserved. n Corresponding author. Tel.: þ98 21 6616 6618; fax: þ98 21 6601 9246. E-mail addresses: [email protected] (S. Hamid Amiri), [email protected] (M. Jamzad). Pattern Recognition 48 (2015) 174188

Upload: others

Post on 12-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

Automatic image annotation using semi-supervisedgenerative modeling

S. Hamid Amiri, Mansour Jamzad n

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

a r t i c l e i n f o

Article history:Received 30 October 2013Received in revised form11 June 2014Accepted 9 July 2014Available online 19 July 2014

Keywords:Image annotationSemi-supervised learningGenerative modelingGamma distribution

a b s t r a c t

Image annotation approaches need an annotated dataset to learn a model for the relation between imagesand words. Unfortunately, preparing a labeled dataset is highly time consuming and expensive. In thiswork, we describe the development of an annotation system in semi-supervised learning framework whichby incorporating unlabeled images into training phase reduces the system demand to labeled images. Ourapproach constructs a generative model for each semantic class in two main steps. First, based on Gammadistribution, a generative model is constructed for each semantic class using labeled images in that class.The second step incorporates the unlabeled images by using a modified EM algorithm to update parametersof the constructed generative models. Performance evaluation of the proposed method on a standarddataset reveals that using unlabeled images will result in considerable improvement in accuracy of theannotation systems when a limited number of labeled images for each semantic class are available.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

The growth of digital image datasets and photo sharing com-munities on the Internet makes it necessary to provide a propermechanism for searching the images in large collections. Content-based image retrieval (CBIR) focuses on the problem of searchingimages based on their content. The first generation of CBIR wasbased on query-by-example (QBE) paradigm [1], in which the CBIRsystem retrieves most visually similar images to a query imagegiven by the user. The semantic gap between low level featuresand high level concepts is a fundamental problem in QBE systems.To bridge the semantic gap, many systems use the relevancefeedback approach [2,3] to incorporate user knowledge into theretrieval process. Besides these query-based search strategies, thenew generation focuses on development of an automatic systemthat semantically describes the content of an image. In thisapproach, a set of semantic labels are assigned to each image todescribe its content. Then, a system is developed to train a modelfor the relation between visual features and tags of images. Thisformulation, called automatic image annotation or linguistic indexing,allows the system to assign words to every new test image. Further-more, the retrieval procedure could be performed based on input textsprovided by the user. To bridge the semantic gap in an annotationsystem, some methods utilized active learning [4] to exploit the user

knowledge in developing the annotation system and thus to reducethe required supervision. In what follows, we briefly overview mainfeatures of the prior studies.

1.1. Related works

There has been a great effort to design an annotation systemusing statistical learning. According to Ref. [5], one strategy forstatistical annotation is unsupervised labeling [6,7] which estimatesthe joint density of visual features and words by an unsupervisedlearning algorithm. These methods introduce a hidden variable andassume that features and words are independent given the value ofhidden variable. Another formulation for statistical annotation issupervised multi-class labeling (SML) [5,8] that estimates a condi-tional distribution for each semantic class to determine probabilityof a feature vector given the semantic label.

Regardless of learning strategy, training dataset plays a majorrole in developing annotation systems. Manual assignment of toomany words to a large number of images in the dataset is highlytime consuming and labor intensive. On the other hand, a classifiermay have poor generalization when some semantic labels have afew images. It seems essential to develop an annotation system thatdepends on a small number of labeled images in the training phase.

Although it is difficult to prepare an annotated dataset, one couldeasily obtain unlabeled images in large quantity (e.g., using photosharing communities on the Internet like Flicker and ImageNet).This large amount of pictures motivates image annotation systemsto increase their generalization by incorporating unlabeled images into

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/pr

Pattern Recognition

http://dx.doi.org/10.1016/j.patcog.2014.07.0120031-3203/& 2014 Elsevier Ltd. All rights reserved.

n Corresponding author. Tel.: þ98 21 6616 6618; fax: þ98 21 6601 9246.E-mail addresses: [email protected] (S. Hamid Amiri),

[email protected] (M. Jamzad).

Pattern Recognition 48 (2015) 174–188

Page 2: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

the training phase. To reach this aim, semi-supervised learning (SSL)[9] is emerged in machine learning community. Semi-supervisedgenerative models [10,11] and graph-based methods [12] are twomain classes of the SSL that have recently received lots of interests,especially in image annotation [13–20].

A large number of semi-supervised annotations utilize graph-based learning to infer tags of unlabeled images. The mainchallenging issue in these methods is graph construction. Liuet al. [13] proposed a method called nearest spanning chain(NSC) which constructs the learning graph using chain-wisestatistical information instead of the traditional pair-wise simila-rities. Besides constructing multiple NSCs that is computationallyexpensive, they do not show how the number of labeled imagesaffects the annotation results. The authors in Ref. [14] focused ongraph construction in the presence of noisy annotations. To thisend, by solving an optimization problem, a sparse graph isconstructed and the training process is modified to handle noisyannotations. A disadvantage of this approach is that edges of thesparse graph are considered to be a subset of edges in a kNNgraph. The method in Ref. [15] constructs a bi-relational graph thatcomprises both visual features and semantic labels of images. Bypropagating the words of annotated images over bi-relationalgraph, annotations of unlabeled images are extracted.

To further refine the annotation results, there are some worksthat combine graph-based learning with other techniques. Tanget al. [16] proposed a method to combine graph-based learningwith multiple instance learning [17] for image annotation. Afterconstructing two graphs based on multiple and single instancerepresentations, the graphs are integrated into a unified graph forthe learning process. This method utilizes a simple weighted-sumrule for integration. It suffers from early fusion problems in multi-modal representation [18]. Shao et al. [19] presented a frameworkto combine graph-based learning with a probabilistic model forlearning latent topics of images. In this framework, a hiddenvariable is associated to each image and probabilities of hiddenvariables are considered to reside on a manifold. This approachsuffers from the limitations of unsupervised labeling discussed inRef. [5]. Zhu et al. [20] proposed a technique which utilizes graph-based learning to refine the candidate annotations obtained fromthe progressive relevance-based method [21]. Since the candidateannotations are obtained by propagating words from labeledimages to unlabeled ones, noisy annotations will highly degradethe performance of this approach.

In addition to the above limitations, there are two main problemsthat should be considered in graph-based image annotation. First,these methods are transductive and can only predict labels for specificunlabeled samples observed in the training phase. To annotate a newtest image, we must add the image to the unlabeled set and run thetraining phase again. The second problem is related to the requiredmemory and time complexity of graph-based approaches. The primarybottleneck of these approaches comes from the complexity of hand-ling the adjacency matrix of graph which super-linearly grows byincreasing the number of images.

It is shown that annotation could be performed in a short timewhen generative models are utilized in SML formulation [8].Additionally, generative models are inductive and can predict thelabel of every sample in the feature space. Thus, in this work, wefocus on the semi-supervised generative models for image annota-tion. We follow PSU protocol [5] for developing the systemwhich ispreviously utilized in a supervised system called ALIPR [8].

1.2. Overview of ALIPR

ALIPR utilizes Corel60k [7] dataset whose images are organizedbased on PSU protocol [5]. This protocol assumes that images aredivided into distinct categories or “concepts”. Each concept

contains a set of words describing the entire images in thatconcept even though some of these words do not occur in eachindividual image. It is possible that two different concepts couldshare some words in their descriptions. Corel60k is comprised of599 concepts with about 100 images in each concept. This datasetalso includes total of 417 distinct words.

With the above structure, ALIPR constructs a generative modelfor each concept as follows:

(1) Extracting the color and texture signatures of images.(2) Partitioning images of a concept using a new clustering

algorithm named D2-clustering and extracting a prototypefor each cluster.

(3) Computing the distance between each prototype and thesignatures assigned to it.

(4) Fitting Gamma distribution into each cluster based on thedistances in that cluster.

(5) Combining the distribution components to construct a mixturemodel for each concept.

The advantage of ALIPR over the prior statistical systems suchas [5–7] was its higher speed in annotation procedure.

1.3. Contributions of this paper

The current study targets to develop a semi-supervised anno-tation system to include unlabeled images in concept modeling. Tothis end, it covers the following contributions:

� We propose a new feature extraction strategy that obtains colorand texture signatures in less time than ALIPR. Besides, the newfeatures provide more discriminative property.

� We embedded spectral clustering in the prototype extraction toovercome limitations of ALIPR. Indeed, D2-clustering in ALIPRis a divisive hierarchical clustering and has two limitations.First, since a set of linear optimization problems with a largenumber of variables must be solved in each level of clustering,the prototype extraction is computationally expensive. Second,the division process utilizes a greedy strategy which could notlead to well-structured clusters.

� To use unlabeled images in the training phase, the formulationof prototype extraction is modified by assigning a weight toeach signature that indicates the membership degree of thesignature to one concept.

� Given an initial model for each concept, we incorporateunlabeled images into the training phase to improve modelsthrough modifying parameters of the clusters. Thus, a newparameter estimation method is presented which utilizes theunlabeled images.

The last two contributions provide semi-supervised annotationand are considered as main contributions of the proposed solution.On the other hand, the first two contributions are not the maincontributions but are mandatory for achieving good efficiency inthe annotation system.1

In our approach, we assume that training dataset follows PSUprotocol. However, in our experiments, we will also discuss howour approach can be applied to datasets that are not organizedbased on PSU protocol. In fact, the PSU protocol assumption in thetraining phase does not weaken the applicability of our approachto other datasets.

1 An early version of supervised framework [24] explains the feature extractionand clustering phases.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 175

Page 3: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

The reminder of this paper is organized as follow. Featureextraction is discussed in Section 2. In Section 3, we focus on theproblem of prototype extraction for a set of image signatures.Details of the training phase are discussed in Section 4. Theexperimental results are given in Section 5. Finally, we concludeour work in Section 6.

2. Feature extraction

To describe the content of images, color and texture featuresare extracted in this work. We refer to color and texture descrip-tors of an image as its signature. To form a signature, color andtexture segmentations are obtained for the input image using twodistinct clustering algorithms as described below.

To segment an image based on color, it is mapped from RGBcolor space to Lab space. Then, we apply Meanshift [22] segmen-tation followed by an agglomerative hierarchical clustering [23] toLab components of pixels. The detail of this algorithm is discussedin Ref. [24].

To perform texture segmentation, the gray-scale version ofinput image obtained by L component is divided into distinctblocks with size of 16�16 pixel. Then, Gabor features [25] areextracted from each block and a clustering algorithm is applied tothese feature vectors for texture-based segmentation. The detail ofthis algorithm is given in Ref. [24].

Suppose that the above procedure provides m segments for aninput image. The centroid of each segment is defined by averaging onits feature vectors. To extract a descriptor from segmentation output, adiscrete distribution is defined as fðν1; p1Þ; ðν2; p2Þ;…; ðνm;pmÞg, whereνi is the centroid of ith segment and pi denotes its probability which isestimated by the percentage of pixels associated to the ith segment.Since each image is segmented using color and texture separately, weget two discrete distributions to form the image signature.

To compute the distance between two image signatures, weneed a metric defined between two discrete distributions. This way,Mallows distance [26] is utilized to calculate the distance of twodiscrete distributions δ1 and δ2 defined as δi ¼ fðvi1; pi1Þ;…; ðvimi

; pimiÞg,

i¼1,2 where mi is the number of nonzero probabilities in δi:

D2ðδ1; δ2Þ ¼minW

∑m1

i ¼ 1∑m2

j ¼ 1wi;j Jν1i �ν2j J

2

W ¼ fwi;jZ0ji¼ 1;…;m1 and j¼ 1;…;m2g ð1Þwhere wi,j corresponds to joint probability of ν1i and ν2j . Because ofthe marginal distributions, we must add the following constraintsto the above problem:

∑m2

j ¼ 1wi;j ¼ p1i ; ∑

m1

i ¼ 1wi;j ¼ p2j j¼ 1;…;m2 i¼ 1;…;m1 ð2Þ

The objective function D2 in Eq. (1) and constraints in Eq. (2)are linear with respect to the set of optimization variables W,forming a problem with linear programming (LP) solution.

Note that an image signature is comprised of two separatediscrete distributions for color and texture. Therefore, the totaldistance between pair of signatures Ii and Ij is calculated by thesummation on the squared Mallows distance between theirindividual discrete distributions as follows:

~DðIi; IjÞ ¼ ∑d

l ¼ 1λlD

2ðIi;l; Ij;lÞ ð3Þ

where d is the number of distributions for a signature and Ii,ldenotes the lth distribution for Ii. In our experiments, a signatureconsists of color and texture parts, therefore we set d¼2. λl is afixed weight used for normalization of distance values for colorand texture parts. To determine these weights, we select a set ofimages with similar contents and compute the Mallows distance

among all pairs of their color (texture) distributions. Then, westatistically analyze the distance values to specify the weights λ1and λ2 in such a way that color and texture parts have equalimportance in computing the total distance ~D in Eq. (3). Thedetails of this strategy are given in Section 5.1.

Finally, the feature extraction methodology in this paper can bedistinguished from the one in ALIPR as described here. While bothALIPR and the proposed system extract color components in Labspace, but we take a different approach for color segmentation(i.e. a two-step clustering approach [24]). To obtain texture signa-ture, ALIPR extracts texture descriptors from wavelet domain andperforms K-means clustering on feature vectors. In contrast, weapply Gabor transform to images and utilize a graph-based cluster-ing approach [24] for texture segmentation. Gabor has the advan-tage of providing more details for an image in transform domain.

3. Prototype extraction

Assume that a set of image signatures is given and a weight isassigned to each signature representing its importance. The goal isto extract one prototype as the representative of the set ofsignatures. This problem is similar to the update step in K-meansclustering which calculates the centroid of a cluster.

We could formulate the extraction of prototype αη for a clusterCη as follows:

αη ¼ arg minαAΩ

∑iACη

ρi ~Dðα; IiÞ ð4Þ

where Ω¼Ω1Ω2 defines the space of prototypes such that Ωk

(k¼1,2) indicates the space of kth discrete distribution in theimage signatures. Also, ρi is the weight of image signature Ii. Sincethe prototype must have the same structure as image signatures, itcontains two discrete distributions for color and texture parts, aswell. Substituting Eq. (3) in (4), we have

αη ¼ arg minαAΩ

∑iACη

ρi ∑d

l ¼ 1λlD

2ðIi;l; α:;lÞ ¼ ∑d

l ¼ 1λlarg min

α;l AΩl

∑iACη

ρiD2ðIi;l;α:;lÞ

ð5Þ

Where Ii;l ¼ fðνi;l1 ;pi;l1 Þ; ðνi;l2 ; pi;l2 Þ;…; ðνi;lmi;l; pi;lmi;l

Þg is the lth discrete dis-tribution in Ii. The above equation indicates that optimization mustbe performed for color and texture parts of the signatures,separately. By applying Mallows distance in Ref. (1) to the aboveproblem, we have to solve the optimization problem in Eq. (6) to

determine αη;l ¼ fðaη;l1 ; qη;l1 Þ;…; ðaη;lmα; qη;lmα

Þg:

minαη;l AΩl

∑iACη

ρiD2ðIi;l; α:;lÞ ¼ min

ðaη;lk;qη;l

kÞ∑

iACη

ρiminwðiÞ

k;j

∑mα

k ¼ 1∑mi;l

j ¼ 1wðiÞ

k;j Jaη;lk �νi;lj J2

such that : ∑mα

k ¼ 1qη;lk ¼ 1; qη;lk Z0; ∑

mi;l

j ¼ 1wðiÞ

k;j ¼ qη;lk ; ∑mα

k ¼ 1wðiÞ

k;j ¼ pi;lj ; wðiÞk;jZ0

ð6ÞTo have a solution for the above minimization problem with

variables wðiÞk;j; a

η;lk and qη;lk , a two-step iterative algorithm is per-

formed. At first step, we assume aη;lk0s are fixed and solve the

optimization for wðiÞk;j

0s and qη;lk0s. As the constraints and objective

function are linear in this step, we use the linear programming for

optimization. In the second step, given wðiÞk;j

0s and qη;lk0s, we update

aη;lk . If we compute the derivative of the objective function in Eq. (6)

with respect to aη;lk and set it to zero, we have

aη;lk ¼∑iACη

ρi∑mi;l

j ¼ 1wðiÞk;jν

i;lj

∑iACηρi∑

mi;l

j ¼ 1wðiÞk;j

ð7Þ

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188176

Page 4: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

After initializing the prototype, two steps are iterated until thechange in the objective function becomes less than a threshold.The value of this threshold will be defined in Section 5.1.Initialization of prototype αη,l is also explained in Section 4.1.

The above procedure for prototype extraction will be usedduring the training process. Thus, based on our statistical learningframework, we can compute the probability of belonging an imagesignature to a cluster. This probability is considered as the weightof that signature during the prototype extraction. Therefore, theweight ρi is adaptively specified during the training process. Thedetails are given in Section 4.3.2.

It should be noted that the above procedure is a modifiedversion of prototype extraction step in D2-clustering [8]. The maindifference is that we consider a weight for each signature. Beforeapplying prototype extraction, images must be clustered intoseparate groups. D2-clustering groups similar signatures togetherthrough a divisive hierarchical clustering. After calculating aprototype for an image cluster, D2-clustering splits that clusterand performs prototype extraction step again. This greedy strategyleads to suboptimal clusters and also makes clustering algorithmcomputationally expensive. We extract groups of similar signa-tures by spectral clustering that is introduced in Section 4. We callthe above algorithm as weighted D2-clustering indicatingweighted signatures are used for clustering.

4. Training

We assume that the training images are organized based onPSU protocol as discussed in Section 1.2. The objective of trainingphase is to fit a mixture model into each concept. The modelcalculates the probability that an image appears at the concept.Since the design of annotation system is performed based on thesemi-supervised learning, we employ both the unlabeled andlabeled images of a concept to construct its model. Note thatunlabeled images are shared among all concepts in the trainingphase.

The mixture model of a concept is obtained in two major steps.At first, an initial model is constructed using the labeled images ofthat concept. Then, the parameters of models are updated byincorporating unlabeled images.

4.1. Initial modeling

To construct the initial model of a specific concept, we followthree steps: (1) Clustering the labeled images of the concept;(2) calculating one prototype for each cluster; and (3) fitting amixture model based on the prototypes in the concept.

To cluster the images of one concept, we utilized the self-tuning spectral clustering algorithm in Ref. [27]. This algorithm isimplemented by a similarity graph G¼(V,W) whose nodes corre-spond to labeled image signatures of the concept. The distance ofnodes i and j is also defined as follows:

disðIi; IjÞ ¼DðcÞðIi; IjÞ=σcþðDðtÞðIi; IjÞ=σtÞ ð8Þwhere D(c) and D(t) compute the Mallows distances between colorand texture signatures, respectively. To normalize these distances

before combining, we divide them by σc and σt, the standarddeviations of color and texture distances between all images.

To form the edge matrix W, each node is connected to its knearest neighbors on the graph. The weights of these edges aredetermined using the Gaussian function:

wij ¼ expð�disðIi; IjÞ=σ2Þ ð9Þ

where σ is the scaling parameter determined automatically usinglocal scaling approach [27].

We run the self-tuning spectral clustering [27] on the con-structed graph to obtain the clusters of nodes. The advantage ofthis algorithm is automatic selection of the number of clusters.

Our experiments show that small values for k in kNN approachwork well for clustering. Thus, we set k¼3 in the experiments. Asa sample of clustering algorithm, Fig. 1 shows the output forconcept “autumn” in Corel60k dataset when the number oflabeled images is 80.

After clustering the labeled images of one concept, we mustdetermine a prototype for each cluster of signatures. To this end,we employ the prototype extraction algorithm in Section 3. In thisalgorithm, ρi (weight of image signature Ii) indicates the priorprobability that Ii belongs to a concept. Since a labeled imagebelongs to a certain concept, we set its weight to one in theprototype extraction algorithm.

As stated in Section 3, the prototype of each cluster isdetermined in an iterative strategy. To initialize the prototype ofa typical cluster Cη, we compute the closeness centrality [28] foreach signature in the cluster. Then, the signature with maximumcloseness centrality is selected as the initial prototype. Thisprocess is discussed in details in Ref. [24]. It should be noted thatthe discrete distribution αη,l (see Section 3) in the initial prototypeis forced to follow uniform distribution.

After determining the prototypes of one concept, a mixturemodel is constructed for that concept. Similar to ALIPR, we designthe kernel of mixtures based on Mallows distance between eachimage signature and its nearest prototype. It has been shown thatthese distances follow Gamma distribution with two parameters b(scale) and s (shape) [8]:

f ðuÞ ¼ ðu=bÞs�1e�u=b

bΓðsÞ ; uZ0 ð10Þ

where u indicates the distance sample and Γ(.) is Gamma function.Thus, we could fit a Gamma distribution into each cluster based

on Mallows distances between the prototype and signatures ofthat cluster. This results in Eq. (11) for the probability of belongingone image signature I to a cluster with prototype αη [8]:

pðIjαηÞ ¼ ð1=ffiffiffiffiffiffiffiffiπbη

qÞ2sexpð� ~DðI;αηÞ=bηÞ ð11Þ

Here, bη and s are the parameters of Gamma distribution.Assuming a prototype as the centroid of a mixture component,

the overall model of the kth concept withm prototypes is obtainedby combining the distributions around the prototypes:

ΦðIjμkÞ ¼ ∑m

η ¼ 1ωηð1=

ffiffiffiffiffiffiffiffiπbη

qÞ2sexpð� ~DðI; αηÞ=bηÞ ð12Þ

Fig. 1. The output of clustering for concept “autumn”. Images with similar cluster are enclosed by dashed lines.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 177

Page 5: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

where μk is the model of kth concept. Also, ωη0s are the prior

probabilities for clusters η¼1,…,m in kth concept with the con-straint Σm

η ¼ 1ωη ¼ 1. These priors are estimated by the percentageof signatures assigned to the prototype αη. It should be noted that acommon parameter s is used for all component distributions in allconcepts. As this parameter specifies the shape of Gamma dis-tribution, all component distributions have the same shape.

To estimate the parameters of components in the initialmodels, the maximum likelihood (ML) method is used. We willdiscuss about the objective function of estimator in the nextsection. Also, we will explain how to estimate the parameters ofinitial models in Section 4.4.

4.2. Objective function for parameter estimation

Suppose that Mk represents the number of prototypes for kthconcept, M¼ΣkMk is the total number of clusters, and, Cj (j¼1,…,M)indicates the jth cluster with prototype αj. As defined in Eq. (12), thegenerative model of one concept is based on Mallows distancesbetween image signatures and prototypes. Therefore, the likelihoodfunction is expressed in term of Mallows distances betweensignatures and their nearest prototypes.

We assume that every signature is confined to fall into onecluster. If ConceptIi is the concept index of signature Ii, we canconvert the sample ⟨Ii;ConceptIi ⟩ into a unique sample ⟨ui,zi⟩ wherezi is the index of a cluster that contains Ii. In fact, zi is a randomvariable which takes values 1 to M, whereM is the total number ofclusters. Moreover, ui represents Mallows distance between Ii andthe prototype of cluster with index zi. Considering θj¼{αj,bj,s} asthe parameters of cluster Cj, the likelihood function for N signa-tures is defined as follows:

Lðθ;U ;ZÞ ¼ ∏N

i ¼ 1∑M

j ¼ 1Iðzi ¼ jÞf ðuijθjÞ ð13Þ

where θ¼{θ1,…,θM} is the set of parameters forM clusters. The setsZ ¼ fz1;…; zNg and U ¼ fu1;…;uNg indicate cluster indices andMallows distances between N signatures and their correspondingprototypes, respectively. Additionally, f(ui|θj) is defined in Eq. (10)and Ið:Þ denotes the indicator function that returns one if its inputexpression is true and zero otherwise.

Eq. (13) could be rewritten as follows:

Lðθ;U ;ZÞ ¼ exp ∑N

i ¼ 1∑M

j ¼ 1Iðzi ¼ jÞlog f ðuijθjÞ

!

) log Lðθ;U;ZÞ ¼ ∑N

i ¼ 1∑M

j ¼ 1Iðzi ¼ jÞlog f ðuijθjÞ ð14Þ

This function is regarded as the objective function for para-meter estimation. Given the sets Z and U , the ML estimation of θ isobtained by maximizing log Lðθ;U;ZÞ.

4.3. Incorporating unlabeled images

After constructing the initial models of concepts, we incorpo-rate unlabeled images to update the parameters of componentdistributions of the generative models. As discussed in previoussection, the parameters are estimated using samples ⟨ui; zi⟩Ni ¼ 1which stem from complete observations ⟨Ii;ConceptIi ⟩

Ni ¼ 1. We will

have a complete observation on an image signature Ii, if we knowto which concept it belongs. Unfortunately, the concepts ofunlabeled images are unknown which leads to incomplete obser-vations. We must convert them to complete observations toincorporate unlabeled images in parameters estimation.

Let Dl ¼ f⟨I1;ConceptI1 ⟩;…; ⟨Il;ConceptIl ⟩g and Du ¼ fIlþ1;…; Ilþugto be the labeled and unlabeled image sets where l and u indicatethe number of labeled and unlabeled images, respectively. We

arrange a vector Πi for each image signature Ii such that itsdimension is equal to the number of concepts. Each element ofthis vector demonstrates the membership degree of a signature toits corresponding concept. If the labeled signature Ii is assigned tokth concept ðConceptIi ¼ kÞ, then the kth component of Πi will beequal to one, otherwise its value is zero. To form this vector forunlabeled images, we use the current models of concepts tocalculate the membership degrees. Therefore for T concepts, theelements of Πi are defined as follows:

ΠiðkÞ ¼ΦðμkjIiÞ ¼

pkΦ Iijμkð Þ∑T

t ¼ 1ptΦ Iijμtð Þ; IiADu

IðConceptIi ¼ kÞ; IiADl

8<: 8k¼ 1;…; T ð15Þ

where pk is the prior probability for kth concept with uniformdistribution. Also, Ið:Þ is the indicator function and μk and μtindicate the generative model for concepts k and t, respectively.

The concept index of an unlabeled image is regarded as ahidden variable. Since each sample ⟨Ii;ConceptIi ⟩ is converted to aunique sample ⟨ui,zi⟩, the variables zi0s are also hidden. To find theML estimation of parameters in the presence of hidden variables,we use a modified EM algorithm. This algorithm alternatesbetween E-step and M-step as will be explained.

4.3.1. Calculating the posterior probabilitiesIn E-step, we compute the expected value of log-likelihood

function in Eq. (14) with respect to conditional distribution of Zgiven U and θ(t) as current estimation of parameters in iteration t.This yields the following equation:

Q ðθÞ ¼ EZjθðtÞ ;U ½ log Lðθ;U ;ZÞ� ¼ EZjθðtÞ ;U ∑N

i ¼ 1∑M

j ¼ 1Iðzi ¼ jÞlog f ðuijθjÞ

" #

¼ ∑N

i ¼ 1∑M

j ¼ 1E½Iðzi ¼ jÞ� log f ðuijθjÞ ð16Þ

where E½Iðzi ¼ jÞ� is defined as follows:

E½Iðzi ¼ jÞ� ¼ pðzi ¼ jÞ ð17ÞThe term p(zi¼ j), in this equation, is the probability that image

signature Ii belongs to the cluster Cj. To determine this probability,we use the fact that each signature is assigned to the closestprototype in a concept. Let Cj be a cluster in the kth concept. If itcontains the closest prototype to signature Ii, we set p(zi¼ j)¼Πi(k),the prior probability of concept as defined in Eq. (15). For otherclusters in the kth concept, p(zi¼ j)¼0. Therefore, we have thefollowing equation to compute p(zi¼ j).

pðzi ¼ jÞ ¼ΠiðkÞ � IðCj ¼NearestkðIiÞÞ ð18Þwhere NearestkðIiÞ is a cluster in kth concept that contains theclosest prototype to Ii. Obviously, if Ii is a labeled image withconcept index k ðConceptIi ¼ kÞ, then p(zi¼ j)¼1 for the closestprototype of kth concept. Otherwise p(zi¼ j)¼0. To sum up, E-stepcalculates the probabilities p(zi¼ j) using Eq. (18) to determine theprior probabilities for signatures.

4.3.2. Updating the parameters of clustersBased on the labeled and unlabeled images and their prior

probabilities, M-step updates the parameters of each cluster. Weexplain the details of this step for a typical cluster Cj in thefollowing. For other clusters, the update process is performed in anidentical manner.

At first, we utilize weighted D2-clustering algorithm describedin Section 3 to update the prototype of cluster Cj using labeled andunlabeled images assigned to the cluster. To run the clusteringalgorithm, we consider the current prototype as the initial proto-type and follow the iterations discussed in Section 3. Moreover,the weights of signatures are specified using the probabilities

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188178

Page 6: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

p(zi¼ j) in E-step. Thus, in weighted D2-clustering algorithm, weset ρi¼p(zi¼ j) as the weight of signature Ii.

After converging the prototype extraction algorithm, we calcu-late the average within cluster distances for Cj by the followingequation:

uj ¼ ð∑iACjρiuiÞ=ð∑iACj

ρiÞ ð19Þ

where ui is the Mallows distance between the signature Ii and thefinal prototype. If the average distance uj is greater than apredefined threshold, the cluster will be split into two clusters.To this end, a signature is randomly chosen from cluster Cj as anew prototype. Given the current prototype of cluster Cj and thenew selected prototype, signatures in Cj are assigned to one of twoprototypes based on their distances to the prototypes. Afterinitializing each cluster with current two prototypes, the prototypeextraction step is applied again to each cluster, separately. Thesplitting procedure is recursively applied to new clusters untilthe within cluster distance is less than a threshold. Note that theprobability that the signature Ii is selected as a new prototype, isdefined by the weight p_seli ¼ ρi=ΣkACj

ρk. Also, we use the value ofsplitting threshold as listed in Table 2.

Given the new prototypes, we update the parameters ofGamma distributions fitted into the clusters such that function Q(θ) in Eq. (16) is maximized. In Appendix, we prove that theestimator of parameters s and bj will be

log s�ψðsÞ ¼ � ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ log ui

uj

" #,∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ

" #

bj ¼ uj=s ð20Þ

Here uj is obtained using Eq. (19) and ψ(.) is the di-Gammafunction: ψðsÞ ¼ d log ΓðsÞ=ds.

In addition to the above parameters, there is a prior probabilityωj for the cluster Cj as defined in Eq. (12). To update ωj, we set it tothe sum of signatures' weights (ρi) in the cluster Cj. Then, ωj

0s arenormalized to meet the constraint Σjωj¼1 in all concepts.

4.3.3. Filtering irrelevant signaturesAs stated above, the unlabeled images are shared between all

concepts such that a large quantity of the training images for oneconcept comes from the unlabeled set. Since these images will beassigned to one cluster in every concept, we must deal with a largenumber of signatures in the update step. However, only a smallnumber of these signatures are relevant to that cluster. Theirrelevant signatures are divided into two categories: noisy onesand outlier. The noisy signatures do not belong to a concept buttheir color and texture is somehow similar to the signatures in thatconcept. They are mainly generated at feature extraction step.Outliers, on the other hand, are signatures that are generated bythe density function of a cluster in another concept. They formmost of the irrelevant signatures.

The density function of a cluster is defined based on Mallowsdistance between a signature and the prototype of cluster. Assumethat signature Ii is an outlier for kth concept. Therefore, Ii will befar from all prototypes in kth concept. This issue leads to lowvalues of p(zi¼ j)'s for all clusters Cj in kth concept. Thus, wedetermine a subset of unlabeled signatures in the cluster Cj withvalues of p(zi¼ j) less than a threshold. These signatures arediscarded as outlier from the update process of cluster Cj. Fig. 2shows a typical concept with seven prototypes such that eachcluster consists of some labeled and unlabeled signatures. Notethat the size of each signature represents its weight. The signa-tures with low weights are considered as outlier and will beremoved from the clusters.

To specify the filtering threshold, we use a stepwise decreasingfunction of the updating iterations. We start with an initial

threshold at the first iteration of EM algorithm. The threshold fornext iterations is obtained by multiplying the current threshold toa decreasing rate. The values of initial threshold and decreasingrate are listed in Table 2. The reason for using this strategy is thatthe early iterations of algorithm must use the most confidentunlabeled images for updating the parameters of a cluster.

The filtering mechanism provides some advantages. At first, itreduces the computational time of EM algorithm because thenumber of constraints and variables wðiÞ

k;j0s in Eq. (6) will be

substantially reduced. As stated in Ref. [11], the unlabeled sampleswill be informative for semi-supervised learning if the classes(concepts) are separated by a low density area. The informationcontent of outliers is very low and they will increase the overlapbetween different concepts. The filtering process will reduce thisnegative effect of outliers and prevents EM algorithm fromconverging to local maxima.

4.4. Parameter estimation for initial Models

Since the initial models are constructed using the labeledimages, the probabilities p(zi¼ j)0s are equal to one for images incluster Cj. Therefore, we have the following equation for p(zi¼ j).

pðzi ¼ jÞ ¼ IðCj ¼NearestkðIiÞÞ � IðConceptIi ¼ kÞ 8k¼ 1;…; T ð21Þ

where T is the number of concepts. To compute the maximumlikelihood estimation (MLE) of parameters for initial models, theabove values of p(zi¼ j) are used in Eq. (20).

Fig. 3 demonstrates the flowchart of training procedure withtwo main blocks: initial models construction and incorporatingunlabeled images. Update process of one concept will be contin-ued until the change in the log-likelihood function is less than athreshold. The value of this threshold is given in Table 2.

Labeled Signatures with weight 1

Prototype

Outlier

Noise

Relevant Unlabeled Signatures

Fig. 2. A typical concept with seven prototypes and some labeled and unlabeledimages assigned to each prototype. Some unlabeled signatures are relevant to theconcepts but the others are noises or outliers.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 179

Page 7: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

5. Experimental results

To evaluate the proposed annotation system, Corel60k datasetis used. Each test image is annotated in a similar manner as ALIPR[8]. Briefly, the joint probability of words and visual features arecomputed using the concepts probabilities. After sorting wordsbased on their joint probabilities in descending order, the inputimage is annotated by the top ranked words.

To assess the annotation performance, we compute the averageof precision and recall for different words as explained in [8]. Inour experiments, we annotate all test images with differentnumber of words (from 1 to 15) and compute precision and recallfor each case study. We shall mention that all implementations areperformed on MATLAB environment on a PC with 2.4 GHz Corei7processor and 4 GB memory.

5.1. Parameters selection

Because the training step and parameters estimation of modelsare based on distance metric in Eq. (3), it is essential to properlyspecify the weights λ1 and λ2 in this metric. Using Spectralclustering in Section 4.1, we can obtain a group (cluster) of imageswith similar contents. To analyze the behavior of color and textureparts, we compute the Mallows distance among all pairs of color(texture) signatures of images in the same group. Fig. 4 shows thehistograms of distance values obtained for color and texture parts.These histograms follow Gamma distribution with the parameterslisted in Table 1. While the shape of two distributions is verysimilar, they have different scale parameters. According to Table 1,the ratio of scale parameter for texture distance to color distance is0.42. Therefore, in Eq. (3), we set the color and texture weights as

Fig. 3. The steps of our algorithm with two main blocks: initial model construction and incorporating unlabeled images.

0 300 600 900 1200 1500 1800 2100 24000

200

400

600

800

1000

1200

Distance

Num

ber o

f Im

ages

0 100 200 300 400 500 600 700 800 900 10000

200

400

600

800

1000

1200

Distance

Num

ber o

f Im

ages

Fig. 4. Histograms of Mallows distance among all pairs of (a) color and (b) texture signatures of similar images.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188180

Page 8: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

λ1¼0.42 and λ2¼1, respectively, to normalize the distances to thesame scale before combining them.

With the above weights, the average of distances betweensimilar images will be 341.25. To consider the effect of noisyimages in averaging, we set the splitting threshold of clusters inM-step to 300 which is slightly less than the average. In Table 2,we specify the values of parameters used in the modified EMalgorithm. We also avoid splitting clusters with less than 5 images.In the filtering step, high values for initial threshold and decreas-ing rate ensure that the earliest iterations of update processincorporate most relevant images to a cluster.

Because prototype extraction procedure in Section 3 has a greatinfluence on the accuracy of the generative models, we fix thethreshold used in this procedure to a small value (i.e. 10�5). Weconsider ALIPR for comparison because it follows the same

structure as our annotation system. The parameters of thisapproach are selected in the same strategy as discussed in above.However, we set λ1¼1 and λ2¼1 for ALIPR baseline as stated inRef. [8].

5.2. Supervised performance

The first class of evaluation is performed under the supervisedlearning framework. To this end, we select the first 150 conceptsfrom Corel60k and use the initial models of concepts for imageannotation. Each concept contains 100 images and there are a totalof 137 distinct words to describe the selected concepts. Histogramof the number of words assigned to the first 150 concepts ofCorel60k dataset is shown in Fig. 5. This figure indicates that mostimages are annotated with 3 or 4 words. We use the first 80images of each concept as the train set and the remaining imagesare used as test images. For a given test image, a word provided bythe annotation system is correct if the corresponding concept ofimage contains that word.

In addition to ALIPR, we compared our results with Baselineannotation system [29]. This technique treats annotation as aretrieval problem and associates words to a new test image usingits K nearest neighbors in the labeled set of images. The goal of thisapproach is to eliminate the need for training a complex model inthe annotation system. It was shown [29] that this baselinemethod outperforms most of state of the art methods such asannotation system in [5]. To implement Baseline annotationsystem, we set K¼5 as suggested in Ref. [29] and compute thedistance between images using Eq. (3). Other similar systemsassume that there are annotations for each individual image anddo not follow the PSU protocol. Implementation of these systemsunder PSU protocol does not provide informative comparison.Therefore, we do not consider them in our experiments.

Fig. 6 illustrates the results obtained by our approach undersupervised framework against ALIPR and Baseline methods. Asshown in this figure, our approach outperforms ALIPR in bothrecall and precision. As previously stated, the supervised settingdiffers from ALIPR in the feature extraction and initial clusteringstep. However, concept modeling is performed in a similar manneras ALIPR. This means that feature extraction and the initialclustering based on graph partitioning in our approach leads tomore informative prototypes than those in ALIPR.

As shown in Fig. 6, Baseline system could not achieve goodperformance on Corel60k. Averaging over annotation metrics for2–6 words, we conclude that our approach outperforms theBaseline system by 23% and 17% in precision and recall, respec-tively. The main reason for poor performance of the Baselinesystem is that images in Corel60k are annotated with noisy tags.This issue is discussed in more details in Section 5.4.

As discussed in Section 3, D2-clustering in ALIPR has a divisivenature which makes concept modeling computationally expensive.

Table 1ML estimation for parameters of distributions fitted into color and texturedistances.

Color distance Texture distance

Shape (s) 1.94 1.71Scale (b) 217.76 91.65

Table 2Parameters used in the modified EM algorithm and theirvalues.

Parameter specification Value

Threshold for splitting clusters 300Minimum number of images in a cluster 5Initial value of filtering threshold 0.9Decreasing rate of filtering threshold 0.9Stopping threshold for EM algorithm 0.001

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7

Num

ber o

f C

once

pts

Number of words assigned to concepts

Fig. 5. Histogram of the number of words assigned to the first 150 concepts ofCorel60k dataset.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Prec

isio

n

Number of annotation words

ALIPR Supervised Setting Baseline

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16

Reca

ll

Number of annotation words

ALIPR Supervised Setting Baseline

Fig. 6. Comparative analysis of supervised setting with ALIPR [8] and Baseline system [29] with respect to precision (a) and recall (b). The horizontal axes show the numberof words to annotate test images.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 181

Page 9: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

However, we avoid the divisive strategy by applying the spectralclustering on the image signatures. This will result in a greatreduction in the required time for prototype extraction andsubsequently the concept modeling. To experimentally study thisbenefit of our method, we examine the average time for con-structing the model of one concept when there are 80 labeledimages. The results are demonstrated in Table 3 which shows thatthe proposed method reduces the training time of ALIPR by 67%.

In our proposed system, it takes 7.39 s on average to annotatean image using the models of 150 concepts. This time is compar-able to ALIPR. We utilize multi-thread scheme in Parallel Comput-ing Toolbox of MATLAB to compute concepts probabilities. It alsotakes 2.77 s on average to extract the color and texture signaturesfor an image with size 384�256.

Fig. 7 shows the suggested words for some test imagesprovided by our annotation system in the supervised setting. Also,the ground-truth annotations (human annotations) are shown forcomparison. It should be noted that some words provided by the

annotation system could be considered as correct annotationalthough they have not appeared in the ground-truth. Forinstance, the words “ocean” and “beach” provide a correct seman-tic for the left image in the first row of Fig. 7, but these words arenot provided in the ground-truth annotation. These words are alsoconsidered as false annotations in evaluating recall and precisionin Fig. 6.

5.3. Semi-supervised performance

In the semi-supervised annotation, running time complexity isdue to iterations over both E-step process and updating theprototypes of clusters in M-step. Unfortunately, runtime of thesetwo steps scales up significantly by increasing the number ofunlabeled images. Moreover, we consider a subset of images in aconcept as unlabeled samples. Thus, by increasing the number ofconcepts, more running time is required for training. On the otherhand, the goal of semi-supervised annotation is to reach the samelevel of performance as supervised annotation by using muchfewer labeled images. To reach this goal with an acceptablecomputational complexity, we reduced the number of conceptsin SSL setting to the first 50 concepts of Corel60k. These conceptscontain 5000 images, a de facto size for image annotation [29].They are annotated with 88 distinct words.

To evaluate the annotation performance under SSL setting, weconduct 5-fold cross validation on images of every concept. Inmore details, we randomly split images of every concept into five

Table 3Comparative analysis of the training time (s) in ALIPR and proposed method forextracting the model of one concept with 80 labeled images.

ALIPR The proposed supervised setting

Average time 476.93 153.91Standard deviation 120.87 83.95

System Annotation

people, ocean, beach, boat

landscape, Europe, rural,

building

ocean, landscape, bird, wild life

landscape, people, lake, mountain

sky, cloud, ocean, landscape

texture, man-made, fractal, sky

Human Annotation

historical building, building, modern

flower, house, plant, rural

France

ocean, ocean animal, bird, turtle

bird, wild life, eagle, sky

man-made, sky, balloon

city, building, modern, sky

System Annotation

animal, grass, wild life, tree

historical building, texture, nature, Europe

historical building, Europe, landscape

animal, grass, wild life, tree

plant, grass, wild life, food

Human Annotation

animal, tree, grass, wild life, elephant

historical building, Europe, Paris

church, historical building, Europe

animal, tree, lake, grass, wild life flower, plant, insect

System Annotation

modern, indoor, landscape, historical

building

pattern, botany, texture, fractal

cloud, lake, mountain, landscape

plant, grass, wild life, bird

historical building, landscape, Europe

Human Annotation modern, indoor pattern, botany landscape, lake,

mountain, ice, glacier flower, house, plant,

rural England church, historical building, Europe

Fig. 7. Comparing the annotations of the proposed method with human annotations for some test images.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188182

Page 10: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

parts with equal size (i.e. 20 images) and select each of five partsas the test images. The remaining 80 images of every concept areincorporated into the training phase and are divided into labeledand unlabeled sets as follows. Among 80 images in the training setof one concept, the first l images are considered as the labeledsamples for that concept. To study the annotation performance fordifferent number of labeled images, l has been varied from 25 to75 (l¼25,35,45,55,65,75). To form the unlabeled set, we use theremaining images of concepts which are not included in the testand labeled sets. This means that we select u images (u¼80� l)from each concept as the unlabeled images. After computingprecision and recall for each test part, we compute the averageover all five parts to assess annotation performance.

With the above configuration, after convergence of SSL algo-rithm, the results in Fig. 8(a) and (b) are obtained. Fig. 8(c) and(d) also shows precision and recall values at the first iteration ofSSL algorithm. As the algorithm starts from the constructedmodels on labeled images, Fig. 8(c) and (d) indicates the resultsof supervised setting for the first 50 concepts.

If we compute the differences of recall and precision at the firstand last iterations of SSL algorithm, the amount of increasesindicates how much the SSL algorithm could improve the annota-tion performance. Fig. 9 shows the average differences of recalland precision for different number of labeled images on 50concepts of Corel60k. In this figure, the average value for a specificnumber of labeled images represents the mean of differencesbetween the first and last iterations after annotating test imageswith different number of words (from 1 to 15).

Referring to the results in Figs. 8 and 9, we conclude thefollowing points:

� The values of precision and recall are improved by increasingthe number of labeled images such that the highest accuracy is

obtained for 65 and 75 labeled images. This indicates that thelabeled images have more importance than unlabeled ones toconstruct accurate models for the concepts. This issue is not inconflict with the use of semi-supervised annotation whose aimis to reduce the dependency of modeling algorithm to thelabeled images by incorporating unlabeled ones.

� According to the test configuration, there is an increase in thenumber of unlabeled images when the number of labeledimages decreases. Fig. 9 reveals that the average differencesof recall and precision values between the first and lastiterations of SSL algorithm have increased when the numberof unlabeled (labeled) images increases (decreases). As pre-viously stated, the first iteration of semi-supervised algorithmis equivalent to the supervised version of the annotationsystem. Therefore, the amount of differences indicates the levelof improvement provided by the unlabeled images. By

0.15

0.25

0.35

0.45

0.55

0.65

Prec

ision

Number of annotation words

25 35 45 55 65 75

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rec

all

Number of annotation words

25 35 45 55 65 75

0

0.1

0.2

0.3

0.4

0.5

0.6

Prec

ision

Number of annotation words

25 35 45 55 65 75 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Rec

all

Number of annotation words

25 35 45 55 65 75 80

Fig. 8. Precision-recall curves of semi-supervised annotation on 50 concepts of Corel60k with different number of words provided by the system to annotate the test images.(a) and (b) are the results at the last iterations of algorithmwhile (c) and (d) represent the results of first iterations. The results are presented for different number of labeledimages: 25, 35, 45, 55, 65 and 75. Also, the results of 80 labeled images (in the presence of no unlabeled images) are shown in (c) and (d).

0.000.020.040.060.080.100.120.140.160.18

25 35 45 55 65 75

Aver

age

Diff

.

Number of Labeled Images

Precision Diff. Recall Diff.

Fig. 9. Average differences of recall and precision values between the first and finaliterations of semi-supervised algorithm on 50 concepts of Corel60k. Horizontal axisrepresents the number of images in each concept used as labeled samples in thealgorithm.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 183

Page 11: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

comparing the results for different number of labeled images inFig. 9, recall and precision are improved about 15% when thereare 25 labeled images in every concept.

In addition to above experiments, we study changes inprecision and recall when the number of concepts varies. In thisline, we conduct the SSL algorithm on 30 and 40 concepts whichcontain 61 and 75 words in their descriptions, respectively. Theresults are shown in Figs. 10 and 11. By comparing the results inFigs. 8, 10 and 11, we conclude that the precision and recallvalues have decreased by increasing the number of concepts inthe modeling process. To study this issue in more details, wecompute the average of precision and recall values in eachexperiment (i.e. 30, 40 and 50 concepts) by varying the numberof annotation words from 2 to 6. The results are shown in Fig. 12.The main reason for considering 2–6 words is that most imagesare annotated with this number of words as shown in Fig. 5.

Referring to Fig. 12, it is obvious that both precision and recallhave decreased by increasing the number of concepts. The mainreason is that increasing the number of concepts leads to moreoverlap between concepts and more increase in the misclassifica-tion probability as well. By decreasing the number of conceptsfrom 50 to 40 and 30, on average, we achieve increasing rates of16.98% and 33.67% in precision, respectively. For recall, these ratesare 19.11% and 38.25%. This indicates that recall is more affectedthan precision when the number of concepts decreases.

5.4. Comparing with other approaches

We compare the proposed annotation system with iterative bi-relational graph-based (I-BG) method [15], a graph-based semi-supervised annotation. This approach constructs two subgraphs forimages and semantic labels. It provides some links between them toobtain an integrated graph for annotation. To construct the subgraph

0.15

0.25

0.35

0.45

0.55

0.65

0.75

Prec

ision

Number of annotation words

25 35 45 55 65 75

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Rec

all

Number of annotation words

25 35 45 55 65 75

Fig. 10. Precision-recall curves of semi-supervised annotation on 30 concepts from Corel60k dataset.

0.15

0.25

0.35

0.45

0.55

0.65

0.75

Prec

ision

Number of annotation word

25 35 45 55 65 75

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Rec

all

Number of annotation words

25 35 45 55 65 75

Fig. 11. Precision-recall curves of semi-supervised annotation on 40 concepts from Corel60k dataset.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ave

rage

Pre

cisi

on

Number of Labeled Images

30 Concepts 40 Concepts 50 Concepts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

25 35 45 55 65 75 25 35 45 55 65 75

Ave

rage

Rec

all

Number of Labeled Images

30 Concepts 40 Concepts 50 Concepts

Fig. 12. The average of precision and recall values for different number of labeled images when the number of concepts is 30, 40 and 50.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188184

Page 12: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

of semantic labels, we follow same strategy as the I-BG method. Aftercomputing the distance between every pair of images using Eq. (3), wefollow the same strategy as in Ref. [15] to form the image subgraph.This provides a fair comparison between our approach and I-BGbecause the same distance metric is used in both methods. Further-more, we set α¼0.01 and β¼γ¼0.5 for I-BG as suggested in Ref. [15].

In addition to I-BG, we compare our semi-supervised annota-tion with Baseline system [29] introduced in Section 5.2. Similar tothe test configuration in Section 5.3, we assess the performance ofI-BG and Baseline systems with different numbers of labeledimages. In I-BG method, all images are incorporated into graphconstruction but a subset of images is considered as the labeledset. On the other hand, the Baseline method only utilizes thelabeled set for annotating test images.

We apply I-BG and Baseline methods to 50 concepts of Corel60kwith 5-fold cross validation. On average, each image in Corel60kcontains 4.1 words. Therefore, we evaluate the performance of threemethods after annotating all test images with 4 words. As shown inFig. 13, our approach outperforms both I-BG and Baseline methods.Computing total average of precision values for different number ofimages, we conclude that there is an increase of 6% and 23% in theproposed approach with respect to I-BG and Baseline methods,respectively. This increase is about 9% and 30% for recall values. Notethat Baseline has a poor performance on Corel60k dataset. Asdiscussed in the previous section, noisy annotations and high intra-concepts variance in Corel60k will highly degrade the performance ofBaseline system that extracts the tags of a test image using its Knearest neighbors in the labeled set.

Finally, it takes about 33.23 s (on average) to annotate an imageusing the I-BG method. This time is about 5.18 s for our approach.Thus, we can conclude that running time of graph-based methods ishighly more than the generative models as discussed in Section 1.1.

5.5. Applying our approach to image-level annotated datasets

In this section, we discuss how our approach is applied to a datasetwhose tags are provided at image-level (i.e. tags are assigned to eachindividual image instead of image concept). As a case study, weconsider IAPR TC-12 dataset [30] containing 20,000 images. Thisdataset is split into 16,000 training images and 4000 test images.There are 228 distinct words that appear in both training and test sets.

As mentioned before, we assume that images are annotated atconcept-level. The main reason for this assumption is to guaranteethat each semantic class has enough images (samples) in the trainingphase. Indeed, the parameters of Gamma distribution fitted into acluster can be accurately estimated if it contains enough samples.Therefore, to apply our approach to an image-level annotated dataset,we must provide semantic classes with enough training images. Thus,we consider each word of dataset as a semantic class and excludethose words from the training phase which were assigned to a

number of images less than a predefined threshold. Table 4 showsthe number of distinct words in IAPR dataset which are assigned tomore than 20, 50 and 100 images. For instance, the first row of Table 4indicates that there are 163 words that appear in more than 20images. Moreover, there are 15970 images in the dataset which areannotated by 163 words. In below, we discuss the results on IAPRwith threshold value of 50 and 100 for the number of images.

At first, we assess the performance of SSL algorithm on IAPRdataset when the value of threshold is 100. To compute theannotation metrics (i.e. precision and recall), we annotate each testimage using the top 5 words with the largest concept probabilities.Moreover, to form labeled and unlabeled images in the trainingphase of SSL algorithm, we randomly select l percent (l¼50, 60, 70,80, 90) of images assigned to each word as labeled images and theremaining as unlabeled samples. Fig. 14 shows precision and recallvalues in the first and final iterations of SSL algorithm for differentvalues of l. Comparing the results in the first iteration (using only thelabeled images) and the final iteration (after incorporating unlabeledimages), we conclude that SSL algorithm improves the annotationperformance significantly. As shown in Fig. 14, decreasing thepercentage of labeled images which is equivalent to increasing thenumber of unlabeled images, leads to more difference between thefirst and final iterations in both precision and recall. However, as thepercentage of labeled images increases, precision and recall valuesapproach to the values obtained for 100% labeled images.

We also analyze the performance of SSL algorithm when thethreshold for the number of images assigned to each word is set to 50.The results are shown in Fig. 15 indicating considerable increase inannotation metrics after convergence (i.e. the final iteration of SSLalgorithm). In Fig. 16, we compare the results in Figs. 14 and 15 bycomputing the increasing rate of annotation metrics when thethreshold changes from 50 to 100. Fig. 16 indicates that decreasingthe percentage of labeled images leads to a rise the increasing rate ofannotation metrics. When the threshold value changes from 100 to50, some new semantic classes (words), each corresponding to a fewimages are incorporated in the training phase. As a result, thegeneralization of mixture models constructed for the new semanticclasses will decrease as well. Another point concluded from Fig. 16 isthat recall increases more than precision after discarding newsemantic classes.

0

0.1

0.2

0.3

0.4

0.5

Prec

isio

n

Number of Labeled Images

Proposed I-BG Baseline

0

0.1

0.2

0.3

0.4

0.5

0.6

25 35 45 55 65 75 25 35 45 55 65 75

Rec

all

Number of Labeled Images

Proposed I-BG Baseline

Fig. 13. Comparing our approach with I-BG and Baseline methods with respect to (a) precision and (b) recall.

Table 4The size of IAPR dataset (the number of words and training images) after discardingthe words which are assigned to less than 20, 50 and 100 images.

Threshold on number ofimages

Number ofwords

Number of corresponding trainingimages

20 163 15,97050 123 15,916

100 101 15,876

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 185

Page 13: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

6. Conclusions

Image annotation plays a major role for indexing the images inlarge datasets and photo-sharing communities. We proposed anannotation system in the semi-supervised framework to incorpo-rate unlabeled images into the training phase. To this end, weproposed a generative modeling approach in two steps. In the firststep, an initial mixture model is constructed for each conceptusing its labeled images. This step is equivalent to the maximumlikelihood (ML) estimation of parameters in mixture models basedon the labeled images. In the second step, the parameters ofmixture models are updated through incorporating the unlabeledimages. Since the unlabeled images generate incomplete observa-tions for clusters, we customized EM algorithm steps to use theunlabeled signatures for updating parameters of Gamma distribu-tions fitted into the clusters. In spite of many semi-supervisedlearning algorithms, the proposed approach is inductive and could

annotate every test image not included in the training phase.Comparing to the supervised annotation systems, it reaches higheraccuracy after incorporating the unlabeled samples. However, inour algorithm, the training phase needs more time than thesupervised annotation, because we used iterative EM algorithmwith an iterative nature. Our experiments also reveal that SSLalgorithm will improve the annotation metrics for the datasetswhich are not organized based on PSU protocol.

There are many directions for the future works. One of themain challenges is related to the extraction of proper features fromimages. Using multiple descriptors for texture or new featuressuch as shape could improve the annotation results. Our systemdoes not consider the relation between different words. A suitablemodel for words correlations or their co-occurrences could help torefine unrelated words assigned to an image.

Conflict of interest statement

None declared.

Acknowledgements

Wewould like to thank Iran National Science Foundation (INSF)for their partial support of this work.

Appendix

In this section, we prove the parameter estimation of Gammadistribution in Eq. (20). Given the probabilities p(zi¼ j) in theE-step, we must maximize function Q(θ) defined in Eq. (16) in the

0

10

20

30

40

50

60

Prec

isio

n (%

)

Percentage of Labeled Images

First Iteration Final Iteration

0

10

20

30

40

50

50 60 70 80 90 100 50 60 70 80 90 100

Rec

all (

%)

Percentage of Labeled Images

First Iteration Final Iteration

Fig. 14. Precision and recall values on IAPR dataset in the first and final iterations of SSL algorithm after discarding the words which are assigned to less than 100 images. Thevalues of X-axis indicate the percentage of labeled images which are randomly selected from training images assigned to each word.

0

10

20

30

40

50

60

Prec

isio

n (%

)

Percentage of Labeled Images

First Iteration Final Iteration

0

10

20

30

40

50

50 60 70 80 90 100 50 60 70 80 90 100

Rec

all (

%)

Percentage of Labeled Images

First Iteration Final Iteration

Fig. 15. Annotation performance of SSL algorithm on IAPR dataset in the presence of different percentages of labeled images after discarding words assigned to less than 50images.

0

5

10

15

20

50 60 70 80 90 100

Incr

easi

ng R

ate (

%)

Percentage of Labeled Images

Precision Recall

Fig. 16. Increasing rate of annotation metrics when the threshold for discarding thewords of dataset changes from 50 to 100.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188186

Page 14: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

M-step. Substituting the definition of f (ui|θj) in Eq. (10) to Q(θ), wehave

Q ðθÞ ¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞððs�1Þlog ui�s log bj�ðui=bjÞ� log ΓðsÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

S

ð22ÞIf the parameter s is assumed to be fixed, term S will be a

concave function of bj. After calculating the first derivative of S

with respect to bj and setting it to zero, we have

∂Q ðθÞ∂bj

¼ ∑N

i ¼ 1pðzi ¼ jÞ � s

bjþ ui

b2j

!¼ 0

) sbj

∑N

i ¼ 1pðzi ¼ jÞ ¼ 1

b2j∑N

i ¼ 1pðzi ¼ jÞui ð23Þ

To estimate bj, we obtain the following equation:

bj ¼∑N

i ¼ 1pðzi ¼ jÞui

s∑Ni ¼ 1pðzi ¼ jÞ ) bj ¼

uj

sð24Þ

where uj is defined in Eq. (19). By substituting the above equationto Q(θ), it will be

Q ðθÞ ¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ ðs�1Þlog ui�s log

uj

s

� ��s

ui

uj� log ΓðsÞ

� �

¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ s log

ui

uj

� ��ui

uj

� �� log uiþs log s� log ΓðsÞ

� �ð25Þ

Here, Q(θ) is a concave function of s because Γ(s) is a convexfunction. Computing the derivative of Q(θ) with respect to s andsetting it to zero, we have

∂Q ðθÞ∂s

¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ log

ui

uj�ui

ujþ log sþ1�ψðsÞ

� �

¼ ðlog s�ψðsÞÞ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞþ ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ �ui

ujþ1

� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

A

þ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞlog ui

ujð26Þ

Note that term A is equal to zero

∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ� ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞui

uj¼ ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ� ∑

M

j ¼ 1

1uj

∑N

i ¼ 1pðzi ¼ jÞui

¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ� ∑

M

j ¼ 1

1ujuj ∑

N

i ¼ 1pðzi ¼ jÞ

¼ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ� ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ ¼ 0 ð27Þ

Therefore, Eq. (26) will be

∂Q ðθÞ∂s

¼ ðlog s�ψ ðsÞÞ ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞþ ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞlog ui

ujð28Þ

Setting above equation to zero, the estimation of s is solved bythe following equation:

log s�ψðsÞ ¼ � ∑M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞlog ui

uj

" #= ∑

M

j ¼ 1∑N

i ¼ 1pðzi ¼ jÞ

" #ð29Þ

References

[1] R. Datta, D. Joshi, J. Li, J.Z. Wang, Image retrieval: ideas, influences, and trendsof the new age, ACM Computing Surv. 40 (2) (2008) 1–60.

[2] M. Crucianu, M. Ferecatu, N. Boujemaa, Relevance feedback for imageretrieval: a short survey, State of the Art in Audiovisual Content-BasedRetrieval, Information Universal Access and Interaction including Datamodelsand Languages, 2004.

[3] Y. Liu, D. Zhang, G. Lu, W.Y. Ma, A survey of content-based image retrieval withhigh-level semantics, Pattern Recognit. 40 (1) (2007) 262–282.

[4] S. Vijayanarasimhan, K. Grauman, What's it going to cost you?: Predictingeffort vs. informativeness for multi-label image annotations, Comput. Vis.Pattern Recognit. (2009) 2262–2269.

[5] G. Carneiro, A.B. Chan, P.J. Moreno, N. Vasconcelos, Supervised learning ofsemantic classes for image annotation and retrieval, IEEE Trans. PAMI 29 (3)(2007) 394–410.

[6] F. Monay, D. Gatica-Perez, Modeling semantic aspects for cross-media imageindexing, IEEE Trans. PAMI 29 (10) (2007) 1802–1817.

[7] J. Li, J.Z. Wang, Automatic linguistic indexing of pictures by a statisticalmodeling approach, IEEE Trans. PAMI 25 (9) (2003) 1075–1088.

[8] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, IEEE Trans.PAMI 30 (6) (2008) 985–1002.

[9] X. Zhu, A. Goldberg, Introduction to Semi-Supervised Learning (SynthesisLectures on Artificial Intelligence and Machine Learning), Morgan & ClaypoolPublishers, San Rafael, CA, USA, 2009.

[10] F.G. Cozman, I. Cohen, M.C. Cirelo, E. Politecnica, Semi-supervised learning ofmixture models, ICML, 2003, pp. 99–106.

[11] Y. Grandvalet, Y. Bengio, Semi-supervised learning by entropy minimization,NIPS, 2004.

[12] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning with local andglobal consistency, NIPS, 2004.

[13] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning, PatternRecognit. 42 (2) (2009) 218–228.

[14] T. Jinhui, H. Richang, Y. Shuicheng, C. Tat-Seng, Q. Guo-Jun, J. Ramesh, Imageannotation by kNN-sparse graph-based label propagation over noisily taggedweb images, ACM Trans. Intell. Syst. Technol. 2 (14) (2011) 1–14 (15).

[15] H. Wang, H. Huang, C. Ding, Image annotation using bi-relational graph ofimages and semantic labels, Comput. Vis. Pattern Recognit. (2011) 793–800.

[16] J. Tang, H. Li, G.J. Qi, T.S. Chua, Image annotation by graph-based inferencewith integrated multiple/single instance representations, IEEE Trans. Multi-med. 12 (2) (2010) 131–141.

[17] Y. Chen, J. Bi, J.Z. Wang, MILES: multiple-instance learning via embeddedinstance selection, IEEE Trans. PAMI 28 (12) (2006) 1931–1947.

[18] P.K. Atrey, M.A. Hossain, A. El Saddik, M.S. Kankanhalli, Multimodal fusion formultimedia analysis: a survey, Springer Multimed. Syst. J. 16 (6) (2010)345–379.

[19] Y. Shao, Y. Zhou, X. He, D. Cai, H. Bao, Semi-supervised topic modeling forimage annotation, in: Proceedings of the ACM International Conference onMultimedia, 2009, pp. 521–524.

[20] S. Zhu, Y. Liu, Semi-supervised learning model based efficient image annota-tion, IEEE Signal Process. Lett. 16 (11) (2009) 989–992.

[21] B. Wang, Z. Li, N. Yu, M. Li, Image Annotation in a Progressive Way, ICME,Beijing, China (2007) 811–814.

[22] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature spaceanalysis, IEEE Trans. PAMI 24 (2002) 603–619.

[23] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,Springer-Verlag, New York (2009) 520–528.

[24] S.H. Amiri, M. Jamzad, Large-scale image annotation using prototype-basedmodels, in: Proceedings of the IEEE International Symposium on Image andSignal Processing and Analysis (ISPA), 2011, pp. 449–454.

[25] B.S. Manjunath, W.Y. Ma, Texture features for browsing and retrieval of largeimage data, IEEE Trans. PAMI 18 (1996) 837–842.

[26] E. Levina, P. Bickel, The earth mover's distance is the Mallows distance: someinsights from statistics, in: Proceedings of the ICCV, 2001, pp. 251–256.

[27] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, NIPS, 2004,pp. 1601–1608.

[28] S. Wasserman, K. Faust, Social Network Analysis: Methods and Applications,Cambridge University Press, Cambridge, United Kingdom (1994) 183–187.

[29] A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, Int. J.Comput. Vis. 90 (1) (2010) 88–105.

[30] H.J. Escalante, C.A. Hernndez, J.A. Gonzalez, A. Lpez-Lpez, M. Montes, E.F. Morales, L.E. Sucar, L. Villaseor, M. Grubinger, The segmented and annotatedIAPR TC-12 benchmark, Comput. Vis. Image Underst. 114 (4) (2010) 419–428.

S. Hamid Amiri received the B.S. degree in computer engineering from Shahid-Bahonar University, Kerman, Iran, in 2006, and the M.S. degree in computer engineering fromthe Sharif University of Technology (SUT), Tehran, Iran, in 2008. Since September 2009, he has been pursuing the Ph.D. degree from the Department of ComputerEngineering, SUT. His current research interests include various aspects of image processing, including image retrieval, classification and annotation.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188 187

Page 15: Pattern Recognition - پیپرسیمpapersim.com/wp-content/uploads/2018/03/Image...Automatic image annotation using semi-supervised generative modeling S. Hamid Amiri, Mansour Jamzadn

Mansour Jamzad has obtained his M.S. degree in computer science from McGill University, Montreal, Canada and his Ph.D in electrical engineering from Waseda University,Tokyo, Japan in 1989. For a period of two years after graduation he worked as a Post Doctorate Researcher in the Department of Electronics and Communication Engineering,Waseda University. He was a visiting researcher at Tokyo Metropolitan Institute of Gerontology during 1990~1994. He became a faculty member at Department of ComputerEngineering, Sharif University of Technology, Tehran, Iran in 1995. Since then, he has been teaching digital image processing and machine vision graduate courses. His mainresearch interests are related to digital image processing, machine vision and their applications. In recent years he has been working on watermarking, steganography,content-based image retrieval, image annotation, face gestures, robot vision and computer graphics.

S. Hamid Amiri, M. Jamzad / Pattern Recognition 48 (2015) 174–188188