edinburgh research explorer · tion keywords. experimental results show that an image annotation...

10
Edinburgh Research Explorer Automatic Image Annotation Using Auxiliary Text Information Citation for published version: Feng, Y & Lapata, M 2008, Automatic Image Annotation Using Auxiliary Text Information. in ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA. Association for Computational Linguistics, pp. 272-280. <http://www.aclweb.org/anthology/P08-1032> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15- 20, 2008, Columbus, Ohio, USA General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 28. Jul. 2020

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

Edinburgh Research Explorer

Automatic Image Annotation Using Auxiliary Text Information

Citation for published version:Feng, Y & Lapata, M 2008, Automatic Image Annotation Using Auxiliary Text Information. in ACL 2008,Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008,Columbus, Ohio, USA. Association for Computational Linguistics, pp. 272-280.<http://www.aclweb.org/anthology/P08-1032>

Link:Link to publication record in Edinburgh Research Explorer

Document Version:Publisher's PDF, also known as Version of record

Published In:ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA

General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and / or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights.

Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation. If you believe that the public display of this file breaches copyright pleasecontact [email protected] providing details, and we will remove access to the work immediately andinvestigate your claim.

Download date: 28. Jul. 2020

Page 2: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

Proceedings of ACL-08: HLT, pages 272–280,Columbus, Ohio, USA, June 2008. c©2008 Association for Computational Linguistics

Automatic Image Annotation Using Auxiliary Text Information

Yansong Feng and Mirella LapataSchool of Informatics, University of Edinburgh2 Buccleuch Place, Edinburgh EH8 9LW, UK

[email protected], [email protected]

Abstract

The availability of databases of images labeledwith keywords is necessary for developing andevaluating image annotation models. Datasetcollection is however a costly and time con-suming task. In this paper we exploit the vastresource of images available on the web. Wecreate a database of pictures that are natu-rally embedded into news articles and proposeto use their captions as a proxy for annota-tion keywords. Experimental results show thatan image annotation model can be developedon this dataset alone without the overhead ofmanual annotation. We also demonstrate thatthe news article associated with the picturecan be used to boost image annotation perfor-mance.

1 Introduction

As the number of image collections is rapidly grow-ing, so does the need to browse and search them.Recent years have witnessed significant progress indeveloping methods for image retrieval1, many ofwhich are query-based. Given a database of images,each annotated with keywords, the query is used toretrieve relevant pictures under the assumption thatthe annotations can essentially capture their seman-tics.

One stumbling block to the widespread use ofquery-based image retrieval systems is obtaining thekeywords for the images. Since manual annotationis expensive, time-consuming and practically infea-sible for large databases, there has been great in-

1The approaches are too numerous to list; we refer the inter-ested reader to Datta et al. (2005) for an overview.

terest in automating the image annotation process(see references). More formally, given an image Iwith visual features Vi = {v1,v2, . . . ,vN} and a setof keywords W = {w1,w2, . . . ,wM}, the task con-sists in finding automatically the keyword subsetWI ⊂ W , which can appropriately describe the im-age I. Indeed, several approaches have been pro-posed to solve this problem under a variety of learn-ing paradigms. These range from supervised clas-sification (Vailaya et al., 2001; Smeulders et al.,2000) to instantiations of the noisy-channel model(Duygulu et al., 2002), to clustering (Barnard et al.,2002), and methods inspired by information retrieval(Lavrenko et al., 2003; Feng et al., 2004).

Obviously in order to develop accurate image an-notation models, some manually labeled data is re-quired. Previous approaches have been developedand tested almost exclusively on the Corel database.The latter contains 600 CD-ROMs, each contain-ing about 100 images representing the same topicor concept, e.g., people, landscape, male. Each topicis associated with keywords and these are assumedto also describe the images under this topic. As anexample consider the pictures in Figure 1 which areclassified under the topic male and have the descrip-tion keywords man, male, people, cloth, and face.

Current image annotation methods work wellwhen large amounts of labeled images are availablebut can run into severe difficulties when the numberof images and keywords for a given topic is rela-tively small. Unfortunately, databases like Corel arefew and far between and somewhat idealized. Corelcontains clusters of many closely related imageswhich in turn share keyword descriptions, thus al-lowing models to learn image-keyword associations

272

Page 3: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

Figure 1: Images from the Corel database, exemplifyingthe concept male with keyword descriptions man, male,people, cloth, and face.

reliably (Tang and Lewis, 2007). It is unlikely thatmodels trained on this database will perform wellout-of-domain on other image collections which aremore noisy and do not share these characteristics.Furthermore, in order to develop robust image anno-tation models, it is crucial to have large and diversedatasets both for training and evaluation.

In this work, we aim to relieve the data acquisitionbottleneck associated with automatic image annota-tion by taking advantage of resources where imagesand their annotations co-occur naturally. News arti-cles associated with images and their captions springreadily to mind (e.g., BBC News, Yahoo News). So,rather than laboriously annotating images with theirkeywords, we simply treat captions as labels. Theseannotations are admittedly noisy and far from ideal.Captions can be denotative (describing the objectsthe image depicts) but also connotative (describ-ing sociological, political, or economic attitudes re-flected in the image). Importantly, our images are notstandalone, they come with news articles whose con-tent is shared with the image. So, by processing theaccompanying document, we can effectively learnabout the image and reduce the effect of noise dueto the approximate nature of the caption labels. Togive a simple example, if two words appear both inthe caption and the document, it is more likely thatthe annotation is genuine.

In what follows, we present a new database con-sisting of articles, images, and their captions whichwe collected from an on-line news source. Wethen propose an image annotation model which canlearn from our noisy annotations and the auxil-iary documents. Specifically, we extend and mod-ify Lavrenko’s (2003) continuous relevance model

to suit our task. Our experimental results show thatthis model can successfully scale to our database,without making use of explicit human annotationsin any way. We also show that the auxiliary docu-ment contains important information for generatingmore accurate image descriptions.

2 Related Work

Automatic image annotation is a popular task incomputer vision. The earliest approaches are closelyrelated to image classification (Vailaya et al., 2001;Smeulders et al., 2000), where pictures are assigneda set of simple descriptions such as indoor, out-door, landscape, people, animal. A binary classifieris trained for each concept, sometimes in a “one vsall” setting. The focus here is mostly on image pro-cessing and good feature selection (e.g., colour, tex-ture, contours) rather than the annotation task itself.

Recently, much progress has been made on theimage annotation task thanks to three factors. Theavailability of the Corel database, the use of unsu-pervised methods and new insights from the relatedfields of natural language processing and informa-tion retrieval. The co-occurrence model (Mori et al.,1999) collects co-occurrence counts between wordsand image features and uses them to predict anno-tations for new images. Duygulu et al. (2002) im-prove on this model by treating image regions andkeywords as a bi-text and using the EM algorithm toconstruct an image region-word dictionary.

Another way of capturing co-occurrence informa-tion is to introduce latent variables linking imagefeatures with words. Standard latent semantic anal-ysis (LSA) and its probabilistic variant (PLSA) havebeen applied to this task (Hofmann, 1998). Barnardet al. (2002) propose a hierarchical latent modelin order to account for the fact that some wordsare more general than others. More sophisticatedgraphical models (Blei and Jordan, 2003) have alsobeen employed including Gaussian Mixture Models(GMM) and Latent Dirichlet Allocation (LDA).

Finally, relevance models originally developed forinformation retrieval, have been successfully appliedto image annotation (Lavrenko et al., 2003; Feng etal., 2004). A key idea behind these models is to findthe images most similar to the test image and thenuse their shared keywords for annotation.

Our approach differs from previous work in two

273

Page 4: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

important respects. Firstly, our ultimate goal is to de-velop an image annotation model that can cope withreal-world images and noisy data sets. To this endwe are faced with the challenge of building an ap-propriate database for testing and training purposes.Our solution is to leverage the vast resource of im-ages available on the web but also the fact that manyof these images are implicitly annotated. For exam-ple, news articles often contain images whose cap-tions can be thought of as annotations. Secondly, weallow our image annotation model access to knowl-edge sources other than the image and its keywords.This is relatively straightforward in our case; an im-age and its accompanying document have sharedcontent, and we can use the latter to glean informa-tion about the former. But we hope to illustrate themore general point that auxiliary linguistic informa-tion can indeed bring performance improvements onthe image annotation task.

3 BBC News Database

Our database consists of news images which areabundant. Many on-line news providers supply pic-tures with news articles, some even classify newsinto broad topic categories (e.g., business, world,sports, entertainment). Importantly, news images of-ten display several objects and complex scenes andare usually associated with captions describing theircontents. The captions are image specific and use arich vocabulary. This is in marked contrast to theCorel database whose images contain one or twosalient objects and a limited vocabulary (typicallyaround 300 words).

We downloaded 3,361 news articles from theBBC News website.2 Each article was accompa-nied with an image and its caption. We thus createda database of image-caption-document tuples. Thedocuments cover a wide range of topics includingnational and international politics, advanced tech-nology, sports, education, etc. An example of an en-try in our database is illustrated in Figure 2. Here,the image caption is Marcin and Florent face intensecompetition from outside Europe and the accompa-nying article discusses EU subsidies to farmers. Theimages are usually 203 pixels wide and 152 pix-els high. The average caption length is 5.35 tokens,and the average document length 133.85 tokens. Our

2http://news.bbc.co.uk/

Figure 2: A sample from our BBC News database. Eachentry contains an image, a caption for the image, and theaccompanying document with its title.

captions have a vocabulary of 2,167 words and ourdocuments 6,253. The vocabulary shared betweencaptions and documents is 2,056 words.

4 Extending the Continuous RelevanceAnnotation Model

Our work is an extension of the continuous rele-vance annotation model put forward in Lavrenkoet al. (2003). Unlike other unsupervised approacheswhere a set of latent variables is introduced, eachdefining a joint distribution on the space of key-words and image features, the relevance model cap-tures the joint probability of images and annotatedwords directly, without requiring an intermediateclustering stage. This model is a good point of de-parture for our task for several reasons, both theo-retical and empirical. Firstly, expectations are com-puted over every single point in the training set and

274

Page 5: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

therefore parameters can be estimated without EM.Indeed, Lavrenko et al. achieve competitive perfor-mance with latent variable models. Secondly, thegeneration of feature vectors is modeled directly,so there is no need for quantization. Thirdly, as weshow below the model can be easily extended to in-corporate information outside the image and its key-words.

In the following we first lay out the assumptionsunderlying our model. We next describe the contin-uous relevance model in more detail and present ourextensions and modifications.

Assumptions Since we are using a non-standard database, namely images embedded in doc-uments, it is important to clarify what we mean byimage annotation, and how the precise nature of ourdata impacts the task. We thus make the followingassumptions:

1. The caption describes the content of the imagedirectly or indirectly. Unlike traditional imageannotation where keywords describe salient ob-jects, captions supply more detailed informa-tion, not only about objects, and their attributes,but also events. In Figure 2 the caption men-tions Marcin and Florent the two individualsshown in the picture but also the fact that theyface competition from outside Europe.

2. Since our images are implicitly rather than ex-plicitly labeled, we do not assume that we canannotate all objects present in the image. In-stead, we hope to be able to model event-relatedinformation such as “what happened”, “whodid it”, “when” and “where”. Our annotationtask is therefore more semantic in nature thantraditionally assumed.

3. The accompanying document describes thecontent of the image. This is trivially true fornews documents where the images convention-ally depict events, objects or people mentionedin the article.

To validate these assumptions, we performed thefollowing experiment on our BBC News dataset.We randomly selected 240 image-caption pairsand manually assessed whether the caption contentwords (i.e., nouns, verbs, and adjectives) could de-scribe the image. We found out that the captionsexpress the picture’s content 90% of the time. Fur-thermore, approximately 88% of the nouns in sub-

ject or object position directly denote salient pictureobjects. We thus conclude that the captions containuseful information about the picture and can be usedfor annotation purposes.

Model Description The continuous relevanceimage annotation model (Lavrenko et al., 2003)generatively learns the joint probability distribu-tion P(V,W ) of words W and image regions V . Thekey assumption here is that the process of generatingimages is conditionally independent from the pro-cess of generating words. Each annotated image inthe training set is treated as a latent variable. Thenfor an unknown image I, we estimate:

P(VI,WI) = ∑s∈D

P(VI|s)P(WI|s)P(s), (1)

where D is the number of images in the trainingdatabase, VI are visual features of the image regionsrepresenting I, WI are the keywords of I, s is a la-tent variable (i.e., an image-annotation pair), andP(s) the prior probability of s. The latter is drawnfrom a uniform distribution:

P(s) =1

ND(2)

where ND is number of the latent variables in thetraining database D.

When estimating P(VI|s), the probability of im-age regions and words, Lavrenko et al. (2003) rea-sonably assume a generative Gaussian kernel distri-bution for the image regions:

(3)P(VI|s) =NVI

∏r=1

Pg(vr|s)

=NVI

∏r=1

1nsv

nsv

∑i=1

exp{(vr − vi)T Σ−1(vr − vi)}√2kπk |Σ|

where NVI is the number of regions in image I, vr thefeature vector for region r in image I, nsv the numberof regions in the image of latent variable s, vi the fea-ture vector for region i in s’s image, k the dimensionof the image feature vectors and Σ the feature covari-ance matrix. According to equation (3), a Gaussiankernel is fit to every feature vector vi correspondingto region i in the image of the latent variable s. Eachkernel here is determined by the feature covariancematrix Σ, and for simplicity, Σ is assumed to be adiagonal matrix: Σ = βI, where I is the identity ma-trix; and β is a scalar modulating the bandwidth of

275

Page 6: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

the kernel whose value is optimized on the develop-ment set.

Lavrenko et al. (2003) estimate the word prob-abilities P(WI|s) using a multinomial distribution.This is a reasonable assumption in the Corel dataset,where the annotations have similar lengths and thewords reflect the salience of objects in the image (themultinomial model tends to favor words that appearmultiple times in the annotation). However, in ourdataset the annotations have varying lengths, and donot necessarily reflect object salience. We are moreinterested in modeling the presence or absence ofwords in the annotation and thus use the multiple-Bernoulli distribution to generate words (Feng et al.,2004). And rather than relying solely on annotationsin the training database, we can also take the accom-panying document into account using a weightedcombination.

The probability of sampling a set of words Wgiven a latent variable s from the underlying multipleBernoulli distribution that has generated the trainingset D is:

P(W |s) = ∏w∈W

P(w|s) ∏w/∈W

(1−P(w|s)) (4)

where P(w|s) denotes the probability of the w’thcomponent of the multiple Bernoulli distribution.Now, in estimating P(w|s) we can include the docu-ment as:

Pest(w|s) = αPest(w|sa)+(1−α)Pest(w|sd) (5)

where α is a smoothing parameter tuned on the de-velopment set, sa is the annotation for the latent vari-able s and sd its corresponding document.

Equation (5) smooths the influence of the annota-tion words and allows to offset the negative effect ofthe noise inherent in our dataset. Since our imagesare implicitly annotated, there is no guarantee thatthe annotations are all appropriate. By taking intoaccount Pest(w|sd), it is possible to annotate an im-age with a word that appears in the document but isnot included in the caption.

We use a Bayesian framework for estimat-ing Pest(w|sa). Specifically, we assume a beta prior(conjugate to the Bernoulli distribution) for eachword:

Pest(w|sa) =µ bw,sa +Nw

µ+D(6)

where µ is a smoothing parameter estimated on thedevelopment set, bw,sa is a Boolean variable denotingwhether w appears in the annotation sa, and Nw isthe number of latent variables that contain w in theirannotations.

We estimate Pest(w|sd) using maximum likeli-hood estimation (Ponte and Croft, 1998):

Pest(w|sd) =numw,sd

numsd

(7)

where numw,sd denotes the frequency of w in the ac-companying document of latent variable s and numsd

the number of all tokens in the document. Note thatwe purposely leave Pest unsmoothed, since it is usedas a means of balancing the weight of word frequen-cies in annotations. So, if a word does not appear inthe document, the possibility of selecting it will notbe greater than α (see Equation (5)).

Unfortunately, including the document in the es-timation of Pest(w|s) increases the vocabulary whichin turn increases computation time. Given a testimage-document pair, we must evaluate P(w|VI) forevery w in our vocabulary which is the union ofthe caption and document words. We reduce thesearch space, by scoring each document word withits tf ∗ idf weight (Salton and McGill, 1983) andadding the n-best candidates to our caption vocabu-lary. This way the vocabulary is not fixed in advancefor all images but changes dynamically dependingon the document at hand.

Re-ranking the Annotation Hypotheses It iseasy to see that the output of our model is a rankedword list. Typically, the k-best words are taken tobe the automatic annotations for a test image I(Duygulu et al., 2002; Lavrenko et al., 2003; Jeonand Manmatha, 2004) where k is a small number andthe same for all images.

So far we have taken account of the auxiliary doc-ument rather naively, by considering its vocabularyin the estimation of P(W |s). Crucially, documentsare written with one or more topics in mind. The im-age (and its annotations) are likely to represent thesetopics, so ideally our model should prefer words thatare strong topic indicators. A simple way to imple-ment this idea is by re-ranking our k-best list accord-ing to a topic model estimated from the entire docu-ment collection.

Specifically, we use Latent Dirichlet Allocation(LDA) as our topic model (Blei et al., 2003). LDA

276

Page 7: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

represents documents as a mixture of topics and hasbeen previously used to perform document classi-fication (Blei et al., 2003) and ad-hoc informationretrieval (Wei and Croft, 2006) with good results.Given a collection of documents and a set of latentvariables (i.e., the number of topics), the LDA modelestimates the probability of topics per document andthe probability of words per topic. The topic mix-ture is drawn from a conjugate Dirichlet prior thatremains the same for all documents.

For our re-ranking task, we use the LDA modelto infer the m-best topics in the accompanying doc-ument. We then select from the output of our modelthose words that are most likely according to thesetopics. To give a concrete example, let us assumethat for a given image our model has producedfive annotations, w1, w2, w3, w4, and w5. However,according to the LDA model neither w2 nor w5are likely topic indicators. We therefore remove w2and w5 and substitute them with words further downthe ranked list that are topical (e.g., w6 and w7).An advantage of using LDA is that at test time wecan perform inference without retraining the topicmodel.

5 Experimental Setup

In this section we discuss our experimental designfor assessing the performance of the model pre-sented above. We give details on our training pro-cedure and parameter estimation, describe our fea-tures, and present the baseline methods used forcomparison with our approach.

Data Our model was trained and tested on thedatabase introduced in Section 3. We used 2,881image-caption-document tuples for training, 240 tu-ples for development and 240 for testing. The docu-ments and captions were part-of-speech tagged andlemmatized with Tree Tagger (Schmid, 1994).Wordsother than nouns, verbs, and adjectives were dis-carded. Words that were attested less than five timesin the training set were also removed to avoid unre-liable estimation. In total, our vocabulary consistedof 8,309 words.

Model Parameters Images are typically seg-mented into regions prior to training. We impose afixed-size rectangular grid on each image rather thanattempting segmentation using a general purpose al-gorithm such as normalized cuts (Shi and Malik,

Coloraverage of RGB components, standard deviationaverage of LUV components, standard deviationaverage of LAB components, standard deviation

Textureoutput of DCT transformationoutput of Gabor filtering (4 directions, 3 scales)

Shapeoriented edge (4 directions)ratio of edge to non-edge

Table 2: Set of image features used in our experiments.

2000). Using a grid avoids unnecessary errors fromimage segmentation algorithms, reduces computa-tion time, and simplifies parameter estimation (Fenget al., 2004). Taking the small size and low resolu-tion of the BBC News images into account, we av-eragely divide each image into 6×5 rectangles andextract features for each region. We use 46 featuresbased on color, texture, and shape. They are summa-rized in Table 2.

The model presented in Section 4 has a few pa-rameters that must be selected empirically on thedevelopment set. These include the vocabulary size,which is dependent on the n words with the high-est tf ∗ idf scores in each document, and the num-ber of topics for the LDA-based re-ranker. We ob-tained best performance with n set to 100 (no cutoffwas applied in cases where the vocabulary was lessthan 100). We trained an LDA model with 20 top-ics on our document collection using David Blei’simplementation.3 We used this model to re-rank theoutput of our annotation model according to thethree most likely topics in each document.

Baselines We compared our model againstthree baselines. The first baseline is based on tf ∗ idf(Salton and McGill, 1983). We rank the document’scontent words (i.e., nouns, verbs, and adjectives) ac-cording to their tf ∗ idf weight and select the top kto be the final annotations. Our second baseline sim-ply annotates the image with the document’s title.Again we only use content words (the average titlelength in the training set was 4.0 words). Our thirdbaseline is Lavrenko et al.’s (2003) continuous rel-evance model. It is trained solely on image-caption

3Available from http://www.cs.princeton.edu/˜blei/lda-c/index.html.

277

Page 8: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

Model Top 10 Top 15 Top 20Precision Recall F1 Precision Recall F1 Precision Recall F1

tf ∗ idf 4.37 7.09 5.41 3.57 8.12 4.86 2.65 8.89 4.00DocTitle 9.22 7.03 7.20 9.22 7.03 7.20 9.22 7.03 7.20Lavrenko03 9.05 16.01 11.81 7.73 17.87 10.71 6.55 19.38 9.79ExtModel 14.72 27.95 19.82 11.62 32.99 17.18 9.72 36.77 15.39

Table 1: Automatic image annotation results on the BBC News database.

pairs, uses a vocabulary of 2,167 words and the samefeatures as our extended model.

Evaluation Our evaluation follows the exper-imental methodology proposed in Duygulu et al.(2002). We are given an un-annotated image I andare asked to automatically produce suitable anno-tations for I. Given a set of image regions VI , weuse equation (1) to derive the conditional distribu-tion P(w|VI). We consider the k-best words as the an-notations for I. We present results using the top 10,15, and 20 annotation words. We assess our model’sperformance using precision/recall and F1. In ourtask, precision is the percentage of correctly anno-tated words over all annotations that the system sug-gested. Recall, is the percentage of correctly anno-tated words over the number of genuine annotationsin the test data. F1 is the harmonic mean of precisionand recall. These measures are averaged over the setof test words.

6 Results

Our experiments were driven by three questions:(1) Is it possible to create an annotation model fromnoisy data that has not been explicitly hand labeledfor this task? (2) What is the contribution of theauxiliary document? As mentioned earlier, consid-ering the document increases the model’s compu-tational complexity, which can be justified as longas we demonstrate a substantial increase in perfor-mance. (3) What is the contribution of the image?Here, we are trying to assess if the image featuresmatter. For instance, we could simply generate an-notation words by processing the document alone.

Our results are summarized in Table 1. We com-pare the annotation performance of the model pro-posed in this paper (ExtModel) with Lavrenko etal.’s (2003) original continuous relevance model(Lavrenko03) and two other simpler models which

do not take the image into account (tf ∗ idf and Doc-Title). First, note that the original relevance modelperforms best when the annotation output is re-stricted to 10 words with an F1 of 11.81% (recallis 9.05 and precision 16.01). F1 is marginally worsewith 15 output words and decreases by 2% with 20.This model does not take any document-based in-formation into account, it is trained solely on image-caption pairs. On the Corel test set the same modelobtains a precision of 19.0% and a recall of 16.0%with a vocabulary of 260 words. Although these re-sults are not strictly comparable with ours due to thedifferent nature of the training data (in addition, weoutput 10 annotation words, whereas Lavrenko et al.(2003) output 5), they give some indication of thedecrease in performance incurred when using a morechallenging dataset. Unlike Corel, our images havegreater variety, non-overlapping content and employa larger vocabulary (2,167 vs. 260 words).

When the document is taken into account (seeExtModel in Table 1), F1 improves by 8.01% (re-call is 14.72% and precision 27.95%). Increasingthe size of the output annotations to 15 or 20 yieldsbetter recall, at the expense of precision. Eliminat-ing the LDA reranker from the extended model de-creases F1 by 0.62%. Incidentally, LDA can be alsoused to rerank the output of Lavrenko et al.’s (2003)model. LDA also increases the performance of thismodel by 0.41%.

Finally, considering the document alone, withoutthe image yields inferior performance. This is truefor the tf ∗ idf model and the model based on thedocument titles.4 Interestingly, the latter yields pre-cision similar to Lavrenko et al. (2003). This is prob-ably due to the fact that the document’s title is in asense similar to a caption. It often contains wordsthat describe the document’s gist and expectedly

4Reranking the output of these models with LDA slightlydecreases performance (approximately by 0.2%).

278

Page 9: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

tf ∗ idf breastfeed, medical,intelligent, health, child

culturalism, faith, Muslim, sepa-rateness, ethnic

ceasefire, Lebanese, disarm, cab-inet, Haaretz

DocTitle Breast milk does not boost IQ UK must tackle ethnic tensions Mid-East hope as ceasefire beginsLavrenko03 woman, baby, hospital, new,

day, lead, good, England,look, family

bomb, city, want, day, fight,child, attack, face, help, govern-ment

war, carry, city, security, Israeli,attack, minister, force, govern-ment, leader

ExtModel breastfeed, intelligent, baby,mother, tend, child, study,woman, sibling, advantage

aim, Kelly, faith, culturalism,community, Ms, tension, com-mission, multi, tackle, school

Lebanon, Israeli, Lebanese,aeroplane, troop, Hezbollah,Israel, force, ceasefire, grey

Caption Breastfed babies tend to bebrighter

Segregation problems wereblamed for 2001’s Bradford riots

Thousands of Israeli troops are inLebanon as the ceasefire begins

Figure 3: Examples of annotations generated by our model (ExtModel), the continuous relevance model (Lavrenko03),and the two baselines based on tf ∗ idf and the document title (DocTitle). Words in bold face indicate exact matches,underlined words are semantically compatible. The original captions are in the last row.

some of these words will be also appropriate for theimage. In fact, in our dataset, the title words are asubset of those found in the captions.

Examples of the annotations generated by ourmodel are shown in Figure 3. We also include theannotations produced by Lavrenko et. al’s (2003)model and the two baselines. As we can see ourmodel annotates the image with words that are notalways included in the caption. Some of these aresynonyms of the caption words (e.g., child and intel-ligent in left image of Figure 3), whereas others ex-press additional information (e.g., mother, woman).Also note that complex scene images remain chal-lenging (see the center image in Figure 3). Such im-ages are better analyzed at a higher resolution andprobably require more training examples.

7 Conclusions and Future Work

In this paper, we describe a new approach for thecollection of image annotation datasets. Specifically,we leverage the vast resource of images availableon the Internet while exploiting the fact that manyof them are labeled with captions. Our experimentsshow that it is possible to learn an image annotationmodel from caption-picture pairs even if these arenot explicitly annotated in any way. We also showthat the annotation model benefits substantially from

additional information, beyond the caption or image.In our case this information is provided by the newsdocuments associated with the pictures. But moregenerally our results indicate that further linguisticknowledge is needed to improve performance on theimage annotation task. For instance, resources likeWordNet (Fellbaum, 1998) can be used to expandthe annotations by exploiting information about is-arelationships.

The uses of the database discussed in this articleare many and varied. An interesting future directionconcerns the application of the proposed model in asemi-supervised setting where the annotation outputis iteratively refined with some manual intervention.Another possibility would be to use the documentto increase the annotation keywords by identifyingsynonyms or even sentences that are similar to theimage caption. Also note that our analysis of the ac-companying document was rather shallow, limitedto part of speech tagging. It is reasonable to assumethat results would improve with more sophisticatedpreprocessing (i.e., named entity recognition, pars-ing, word sense disambiguation). Finally, we alsobelieve that the model proposed here can be usefullyemployed in an information retrieval setting, wherethe goal is to find the image most relevant for a givenquery or document.

279

Page 10: Edinburgh Research Explorer · tion keywords. Experimental results show that an image annotation model can be developed on this dataset alone without the overhead of manual annotation

References

K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas,D. Blei, and M. Jordan. 2002. Matching wordsand pictures. Journal of Machine Learning Research,3:1107–1135.

D. Blei and M. Jordan. 2003. Modeling annotated data.In Proceedings of the 26th Annual International ACMSIGIR Conference, pages 127–134, Toronto, ON.

D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirich-let allocation. Journal of Machine Learning Research,3:993–1022.

R. Datta, J. Li, and J. Z. Wang. 2005. Content-based im-age retrieval – approaches and trends of the new age.In Proceedings of the International Workshop on Mul-timedia Information Retrieval, pages 253–262, Singa-pore.

P. Duygulu, K. Barnard, J. de Freitas, and D. Forsyth.2002. Object recognition as machine translation:Learning a lexicon for a fixed image vocabulary. InProceedings of the 7th European Conference on Com-puter Vision, pages 97–112, Copenhagen, Danemark.

C. Fellbaum, editor. 1998. WordNet: An ElectronicDatabase. MIT Press, Cambridge, MA.

S. Feng, V. Lavrenko, and R. Manmatha. 2004. Mul-tiple Bernoulli relevance models for image and videoannotation. In Proceedings of the International Con-ference on Computer Vision and Pattern Recognition,pages 1002–1009, Washington, DC.

T. Hofmann. 1998. Learning and representing topic.A hierarchical mixture model for word occurrencesin document databases. In Proceedings of the Con-ference for Automated Learning and Discovery, pages408–415, Pittsburgh, PA.

J. Jeon and R. Manmatha. 2004. Using maximum en-tropy for automatic image annotation. In Proceed-ings of the 3rd International Conference on Image andVideo Retrieval, pages 24–32, Dublin City, Ireland.

V. Lavrenko, R. Manmatha, and J. Jeon. 2003. A modelfor learning the semantics of pictures. In Proceedingsof the 16th Conference on Advances in Neural Infor-mation Processing Systems, Vancouver, BC.

Y. Mori, H. Takahashi, and R. Oka. 1999. Image-to-wordtransformation based on dividing and vector quantiz-ing images with words. In Proceedings of the 1st In-ternational Workshop on Multimedia Intelligent Stor-age and Retrieval Management, Orlando, FL.

J. M. Ponte and W. Bruce Croft. 1998. A languagemodeling approach to information retrieval. In Pro-ceedings of the 21st Annual International ACM SIGIRConference, pages 275–281, New York, NY.

G. Salton and M.J. McGill. 1983. Introduction to Mod-ern Information Retrieval. McGraw-Hill, New York.

H. Schmid. 1994. Probabilistic part-of-speech taggingusing decision trees. In Proceedings of the Interna-tional Conference on New Methods in Language Pro-cessing, Manchester, UK.

J. Shi and J. Malik. 2000. Normalized cuts and imagesegmentation. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(8):888–905.

A. W. Smeulders, M. Worring, S. Santini, A. Gupta, andR. Jain. 2000. Content-based image retrieval at theend of the early years. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 22(12):1349–1380.

J. Tang and P. H. Lewis. 2007. A study of quality is-sues for image auto-annotation with the Corel data-set.IEEE Transactions on Circuits and Systems for VideoTechnology, 17(3):384–389.

A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. 2001.Image classification for content-based indexing. IEEETransactions on Image Processing, 10:117–130.

X. Wei and B. W. Croft. 2006. LDA-based documentmodels for ad-hoc retrieval. In Proeedings of the 29thAnnual International ACM SIGIR Conference, pages178–185, Seattle, WA.

280