object-based tag propagation for semi-automatic annotation

Object-based Tag Propagation for Semi-AutomaticAnnotation of Images

Ivan Ivanov, Peter Vajda, Lutz Goldmann, Jong-Seok Lee, Touradj EbrahimiMultimedia Signal Processing Group – MMSPG

Institute of Electrical Engineering – IELEcole Polytechnique Fédérale de Lausanne – EPFL

CH-1015 Lausanne, Switzerland{ivan.ivanov, peter.vajda, lutz.goldmann, jong-seok.lee, touradj.ebrahimi}@epfl.ch

ABSTRACTOver the last few years, social network systems have greatlyincreased users’ involvement in online content creation andannotation. Since such systems usually need to deal with alarge amount of multimedia data, it becomes desirable to re-alize an interactive service that minimizes tedious and time-consuming manual annotation. In this paper, we proposean interactive online platform that is capable of performingsemi-automatic image annotation and tag recommendationfor an extensive online database. First, when the user marksa specific object in an image, the system performs an objectduplicate detection and returns the search results with im-ages containing similar objects. Then, the annotation of theobject can be performed in two ways: (1) In the tag rec-ommendation process, the system recommends tags associ-ated with the object in images of the search results, amongwhich, the user can accept some tags for the object in thegiven image. (2) In the tag propagation process, when theuser enters his/her tag for the object, it is propagated to im-ages in the search results. Different techniques to speed-upthe process of indexing and retrieval are presented in thispaper and their effectiveness demonstrated through a set ofexperiments considering various classes of objects.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: Online In-formation Services—Data sharing, Web-based services; H.3.3[Information Storage and Retrieval]: Information Searchand Retrieval—Information filtering

General TermsAlgorithms, Experimentation, Performance

Keywordssocial networks, tag propagation, image annotation, tag rec-ommendation, object duplicate detection

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MIR’10, March 29–31, 2010, Philadelphia, Pennsylvania, USA.Copyright 2010 ACM 978-1-60558-815-5/10/03 ...$10.00.

1. INTRODUCTIONIn recent years, social networks, digital photography and

web-based personal image collections have gained popular-ity. A social network service typically focuses on building on-line communities of people who share interests and activities,or are interested in exploring the interests and activities ofothers. At the same time, they have become a popular wayto share and to disseminate information, creating new chal-lenges for access, search and retrieval. For example, usersupload their personal photos and share them through onlinecommunities, asking other people to comment or rate them.This trend has resulted in a continuously growing volumeof publicly available photos on multimedia sharing websiteslike Flickr1, or social networks like FaceBook2. For instance,Flickr contains over 3.6 billion photos [2], and every monthmore than 2 billion photos are uploaded to FaceBook [1].

In these environments, photos are usually accompaniedwith metadata, such as comments, ratings, information aboutthe uploader and their social network. Moreover, a recenttrend is also to “tag” them. Tags are short textual annota-tions used to describe photos in order to provide meaningfulinformation about them.

The most popular tags in photo sharing sites such as Flickrare usually related to the location where the photo was taken(e.g., San Francisco or France), the objects/persons appear-ing in the photo (e.g., baby, car or house), or the event/timewhen the photo was taken (e.g., wedding or summer) [3, 17].

Annotations and their association with images provide apowerful cue for their grouping and indexing. This cue isalso essential for image retrieval systems to work in prac-tice. The current state-of-the-art in content-based imageretrieval systems has not yet delivered widely accepted so-lutions, except for some very narrow application domains,mainly because of the semantic gap problem, i.e., it is hardto extract semantically meaningful information using justlow-level features [19]. In social networks, the success ofFlickr and FaceBook proves that users are willing to pro-vide this semantic context (“subjective” users’ impressions)through manual annotations [17], which can help to bridgethe semantic gap, and therefore improve the results of vi-sual content search engines. Different users who annotatethe same photo can provide different annotations, which isone more advantage of these systems.

However, tagging a lot of photos by hand is a time-consumingtask. Users typically tag a small number of shared photos

1http://www.flickr.com2http://www.facebook.com

497

only, leaving most of them with incomplete metadata. Thislack of metadata seriously impairs search, as photos with-out proper annotations are typically much harder to find.Therefore, robust and efficient algorithms for automatic orsemi-automatic tagging (or tag propagation) are desirable tohelp people organize and browse large collections of personalphotos in an efficient way.

The main novelty of the paper comes from the applica-tion which realizes an interactive service that minimizes theusers’ tedious and time-consuming manual annotation pro-cess, and the evaluation of the object duplicate detectionpart of the system. We propose an interactive online plat-form which is capable of performing semi-automatic imageannotation and tag recommendation for an extensive on-line database of images containing various object classes.Since the most salient regions in images usually correspondto specific objects, we consider object-based tagging withinthe system. First, when the user marks a specific object inan image, the system performs an object duplicate detec-tion and returns the search results with images containingsimilar objects. Then, the annotation of the object can beperformed in two ways, i.e., tag recommendation and tagpropagation. In the tag recommendation mode, the sys-tem recommends tags for the object within the query im-age. The corresponding tags of all matched objects withinthe retrieved images are shown to the user who can thenselect appropriate tags among them. In the tag propagationmode, when the user enters his/her tag for the object, it ispropagated to other images containing similar objects.

The remaining sections of this paper are organized as fol-lows. We introduce related work in Section 2. Section 3 de-scribes our approach for the interactive online platform anddiscusses the two modes, tag recommendation and tag prop-agation. Experiments and results are discussed in Section 4.Finally, Section 5 concludes the paper with a summary andsome perspectives for future work.

2. RELATED WORKThe proposed system is related to different research fields

including visual content analysis, social networking and tag-ging. The goal of this section is to review the most relevantworks on human tagging and various approaches for tag rec-ommendation.

Tagging images is a very time consuming process and tag-ging objects within images even more. Therefore, it is nec-essary to understand and increase the motivation of users toannotate images. Ames and Naaman [5] have explored dif-ferent factors that motivate people to tag photos in mobileand online environments. One way is to decrease the com-plexity of the tagging process through tag recommendationwhich derives a set of possible tags from which the user canselect suitable ones. Another way is to provide incentivesfor the user in form of entertainment or rewards. The mostfamous examples are the ESP Game and Peekaboom, devel-oped for collecting information about image content. TheESP Game [21] randomly matches two players who are notallowed to communicate with each other. They are shownthe same image and asked to enter a textual label that de-scribes it. The aim is to enter the same word as your partnerin the shortest possible time. Peekaboom [22] takes the ESPGame to the next level. Unlike the ESP Game, it’s asym-metrical. To start, one player is shown an image and theother sees an empty black space. The first user is given a

word related to the image, and the aim is to communicatethat word to the other player by revealing portions of theimage. Peekaboom improves on the data collected by theESP Game and for each object in the image determines pre-cise location information. Unlike these games, LabelMe is aweb-based tool that allows easy image annotation and shar-ing of such annotations [16]. Using this tool, a large varietyof annotations are collected spanning many object categories(cars, people, buildings, animals, tools, etc.).

However, manually tagging a large number of photos isstill a tedious and time-consuming task. Thus automaticimage annotation has received a lot of attention recently.Automatic image annotation is a challenging task which hasnot been solved in a satisfactory fashion for real-world appli-cations. Most of the solutions are developed for a specific ap-plication and usually consider only one tag type, e.g., faces,locations, or events.

Berg et al. [8] propose an approach to label people, i.e.assign names to faces within newspapers (images with cap-tions). They cluster face images visually in appropriate dis-criminant coordinates and apply natural language process-ing techniques to the caption to check whether the personwhose name appears in a caption is depicted in the associ-ated image or not. Picasa3 also provides a service for nametagging which automatically finds similar faces in a photocollection.

Annotating images with geographical information such aslandmarks and locations is a topic which has recently gainedincreasing attention. Ahern et al. [4] have developed a mo-bile application called ZoneTag4 which enables the upload ofcontext-aware photos from mobile phones equipped with acamera to Flickr. In addition to automatically supplying thelocation metadata for each photo (provided by a GPS de-vice or mobile phone), ZoneTag supports context-based tagsuggestions in which tags are provided from different sourcesincluding past tags from the user, the user’s social network,and external geo-referenced data sources like Yahoo! Lo-cal, and Upcoming.org. Through the combination of tex-tual information with vision based methods efficient systemsfor tag recommendation and propagation can be developed.Kucuktunc et al. [10] incorporate visual and textual cues ofsemantically relevant photos to recommend tags and to im-prove the quality of the suggested tags. First, the systemrequires the user to add a few initial tags for each uploadedphoto. These initial tags are used to retrieve related pho-tos which contain other tags besides the initial ones. Thenthe set of candidate tags collected from a pool of images isweighted according to the similarity of the target photo tothe retrieved photo. Similarity is based on visual featuresincluding color histograms and SIFT features. Finally, ascoring function is applied to the list of candidate tags andthe tags with the highest scores are used to automaticallyexpand the tags of the target photo. This approach has itslimitations, since users typically do not provide initial tagsfor each uploaded photo, but rather tag only a small numberof shared photos.

Another application that combines textual and visual tech-niques has been proposed by Quack et al. [15]. They devel-oped a system that crawls photo collections on the internetto identify clusters of images referring to a common object

3http://picasa.google.com4http://zonetag.research.yahoo.com

498

(physical items on fixed locations), and events (special so-cial occasions taking place at certain times). The clustersare created based on the pair-wise visual similarities betweenthe images, and the metadata of the clustered photos is usedto derive a label for the clusters. Finally, Wikipedia5 articlesare attached to the images and the validity of these associ-ations is checked. Gammeter et al. [9] extends this ideatowards object-based auto-annotation of holiday photos ina large database that includes landmark buildings, statues,scenes, pieces of art, with help of external resources such asWikipedia. In both papers, [9] and [15], GPS coordinatesare used to pre-cluster objects which limit classes of objectsto landmarks and buildings. Another limitation is that GPScoordinates may not be always available. In contrast, ourwork considers extensive online database of various objectclasses, such as landmark buildings, cars, cover or text pagesof newspapers, shoes, trademarks and different gadgets likemobile phones, cameras, watches.

Lindstaedt et al. [11] developed tagr a tag recommenda-tion system for pictures which depict fruits and vegetables.They combine three types of information: visual content,text and user context. At first, they group annotated im-ages into classes using global color and texture features. Theuser defined annotations are then linked with the images.The resulting set of tags for visually similar images is thenextended with synonyms derived from WordNet. When theuser uploads an untagged image, it is assigned to one ofthe classes and corresponding tags are recommended to theuser. In addition, this system analyzes tags which the userassigns to the images and returns the profiles of users withsimilar tagging preferences. This method has been provento be effective to recommend tags for a set of selected fruitsand vegetables, but it cannot be applied to other classesof objects, which limits its applicability. Other approachesfor automatic image annotation consider only the context.Sigurbjornsson and van Zwol [17] developed a system whichrecommends a set of tags based on collective knowledge ex-tracted from Flickr. Given a photo with user-defined tags,a new list of candidate tags is derived for each of the user-defined tags, based on tag co-occurrence. The lists of can-didate tags are aggregated, tags are ranked, and a new setof recommended tags is provided to the user.

The tag propagation and tag recommendation are nowa-days very important in environment such as social network,since they provide efficient information for grouping or re-trieving images. The system proposed in this paper providesthese functionalities in an interactive way. The novelty isthat image annotation is performed at the object level bymaking use of content based processing. It does not con-sider context, such as text or GPS coordinates, which maylimit its applicability. This approach is suitable for all kindsof objects, such as trademarks, books, newspapers, and notjust buildings or landmarks.

3. SYSTEM OVERVIEWIn this section, we present our method for object-based

tag recommendation and propagation. Since the manualannotation of all the instances of an object within a largeset of images is very time consuming, the system offers tagpropagation of marked and tagged objects. Image annota-tion is performed at the object level, outlining the object

5http://www.wikipedia.org

T a g g e d o r n o n - ta g g e dim a g e s

F e a tu re e x tra c t io n

T a g s & b o u n d in g -b o x e sO p e ra H o u s e N ik e s h o e s

F e rr a r i

iP h o n e

Non

-tagg

ed im

age

Tagg

ed im

ages

F e rr a r i

M y S Q L d a ta b a s e

T a g s &b o u n d in g -b o x e s

C lu s te r in g

V is u a l fe a tu re s& tre e s tru c tu re

Im a g e m a tc h in g

O b je c t d u p lic a te d e te c tio n

T a g re c o m m e n d a t io n

T a gp ro p a g a t io n

O b je c t s e le c t io n

C a n d id a te s im a g e s

R e le v a n t im a g e s & b o u n d in g -b o x e s

O ff l in e

O n lin e

F e a tu re e x tra c t io n

F e r ra r i

F e r ra r i

F e rr a r i

F e rr a r i

Figure 1: Overview of the system for semi-automaticannotation of objects in images.

with a bounding-box. The system architecture is illustratedin Figure 1.

3.1 Offline partThe goal of the offline processing is to preprocess uploaded

images in order to allow efficient and interactive object tag-ging. It starts by describing each image with a set of sparselocal features. In order to speed up the feature matching,the features of all images are grouped hierarchically into atree representation.

3.1.1 Feature extractionFor a robust and efficient object localization sparse local

features are adopted to describe the image content. Salientregions are detected using the Fast-Hessian detector [7] whichis based on approximation of Hessian matrix detector. Theposition and scale are computed for each of the regions andwill be used for the object duplicate detection (described inSection 3.2.3).

The detected regions are described using Speeded Up Ro-bust Features (SURF) [7], which can be extracted very ef-ficiently and are robust to arbitrary changes in viewpoints.The goal of SURF is to approximate the popular and ro-bust features based on the Scale Invariant Feature Trans-form (SIFT) [12].

3.1.2 ClusteringFor the object tagging, features of a selected object have to

be matched against all the images in the database. Thereforea fast matching algorithm is required to ensure interactivityof the application. Hierarchical clustering is applied to groupthe features according to their similarity. This improves theefficiency of the feature matching since a fast approximationof the nearest neighbor search can be used.

Hierarchical k-means clustering is used to derive the vo-cabulary tree, similar to the one described in [13]. Withinthe tree, parent nodes correspond to the cluster centers de-rived from the features of all its children nodes and leaf nodescorrespond to the real features within the images. The clus-

499

tering leads to a balanced tree with a similar depth for allthe leaves.

Since the importance of the individual visual words (i.e.,nodes in the tree structure) may differ among the imagesin the database, weights wi are assigned to each of the cor-responding nodes i. These weights are equivalent to theinverse document frequency (IDF) commonly used in textretrieval which is defined as

wi = log

(N

Ni

)(1)

where N is the number of images in the database and Ni isthe number of images which have features in the subtree, ifthe i-th node is considered as a root of this subtree. Thebasic idea behind IDF is that the importance of a visual wordis higher if it is contained in only a few images. Furthermore,the importance of a visual word i in relation to an individualimage j is considered using the term frequency (TF) whichis defined as

mij =Nij∑k Nkj

(2)

where Nij is the number of occurrences of a visual word iwithin an image j and the denominator is the number ofoccurrences of all features within image j.

Given this TF-IDF weighting scheme the overall weightdij for a visual word i within an image j is given as

dij = mij · wi (3)

which can be combined into a vector dj . This vector will bematched to the one extracted from the query image to com-pute the similarity within the image matching step describedin Section 3.2.2.

The computational complexity of the complete offline phaseis O((n · N · log(n))2), where n is the size of the image andN is the number of images, since the clustering, which isthe most time consuming part of this method, uses limitednumber of iterations.

3.2 Online partThe goal of the online processing is to recommend a tag

for an object in a given image and then to propagate this tagto other images containing the same object. The user marksa desired object in the image by selecting a bounding-boxaround it. The system performs image matching by makinguse of local features and selects a reduced set of candidateimages which are most likely to contain the target object.The object duplicate detection is applied to detect and tolocalize the target object within the reduced set of images.The corresponding tags of all matched objects are shownto the user who can then select an appropriate tag amongthem. Once an object has been tagged the user can ask thesystem to propagate it automatically to other images withinthe database.

3.2.1 Object selectionThe user can annotate any photo in the database, which

is either uploaded by himself/herself or by any other user.Images are annotated on the object level. The database usedin this work covers a wide range of different classes, whichwill be described in more details in Section 4.1. Once theuser chooses a photo which he/she wants to annotate, theuser is free to label as many objects depicted in the image

as he/she chooses. The user interface used in this work isshown in Figure 2. By clicking on the button “add note”,the user marks an object by selecting object’s boundariesas a bounding-box. This process is commonly used in manyphoto sharing services, such as FaceBook. When a user en-ters the page with particular image from the dataset, tagswhich are previously entered by other users accompaniedwith the corresponding bounding-boxes, will already appearon the image. If there is a mistake in the annotation (eitherthe outline or the text of the label is not correct), the usermay either edit the label by renaming the object, redraw-ing along the object’s boundary or deleting labels for thechosen image. Once the desired object is marked, the tagrecommendation process can start.

3.2.2 Image matchingIn order to speed up the object duplicate detection process

image matching is used to select a reduced set of candidateimages which are most likely to contain the target object.Since the more complex object duplicate detection is onlyapplied to this reduced set, the overall speed is considerablyincreased. By making use of the local features, target im-ages can be distinguished from non target images even if thetarget object is just a small part of it.

Given the local features within the selected region in thequery image and the vocabulary tree, a weighting vector qis computed in the same way as the weighting vector dj forimage j, described in Section 3.1.2.

Based on these weighting vectors, the query image is matchedto all the images j in the database and the individual match-ing scores sj are computed in the same way as in [13]:

sj = ‖q − dj‖ = 2 − 2 ·∑

∀i:qi �=0∩dij �=0

qi · dij

‖q‖ · ‖di‖ . (4)

Image j whose scores sj exceed a predefined thresholdTI are discarded and will not be considered for the objectduplicate detection.

The complexity of the search step for similar images isO(n · log(n)), where n is the size of the query image, as thefeature extraction creates O(n · log(n)) features by makinguse of pyramids for detection of scale invariant features [7].

3.2.3 Object duplicate detectionThe goal of the object duplicate detection step is to de-

tect and to localize the target object within the reducedset of images returned from the image matching step. Theoutcome of this step is a set of predicted objects describedthrough their bounding-boxes for each of the images.

Local features are used for object duplicate detection in[12]. General Hough Transformation is then applied for ob-ject localization. Our object duplicates detection method isbased on this algorithm and the detection accuracy is im-proved by using inverse document frequency. Inverse docu-ment frequency has been used for the similar purpose in [18].Descriptors are extracted from local affine-invariant regionsand quantized into visual words, reducing the noise sensitiv-ity of the matching. Inverted files are then used to matchthe video frames to a query object and retrieve those whichare likely to contain the same object. Different techniquesfor the object duplicate detection have been proposed in theliterature. Vajda et al. [20] proposed to use sparse featureswhich are robust to arbitrary changes in viewpoints. Spa-tial graph model matching is then applied to improve the

500

Figure 2: Screenshot of the web application duringthe tag recommendation step. Based on the selectedobject the system automatically proposes tags fromwhich the user can select suitable ones.

accuracy of the detection, which considers the scale, orien-tation, position and neighborhood of the features. Philbinet al. [14] applied the Bag of Words method for detectingbuildings in a large database. To resolve the problem oflarge database they use a forest of 8 randomized k-d treesas a data structure for storing and searching features.

The detection and localization starts by matching the fea-tures within the selected region of the query image to thefeatures with the candidate image. Again the hierarchicalvocabulary tree is used to speed up the nearest neighborsearch. Matches whose distance is larger than a predefinedthreshold TF are discarded.

In order to detect and to localize target objects based onthese matched features, the general Hough transform [6] isapplied. Each matched feature within the candidate imagevotes for the position (center) and the scale of a bounding-box based on the position and scale of the correspondingfeature within the query image. Since unique features mayprovide a more reliable estimate of the bounding-box, thevote of a feature is equal to its inverse document frequency(IDF) value wi already described in Section 3.1.2. This leadsto a 3-dimensional histogram that describes the distributionof the votes across the bounding-box parameters (position,scale). To obtain the set of predicted objects the local max-ima of the histogram are searched and thresholded with TO .

The complexity of our method for object duplicate de-tection is O(n · log(n)), where n is the size of the queryimage, since the SURF feature extraction uses image pyra-mids for detection of scale invariant features [7] and thegeneral Hough transformation has the same computationalcomplexity, since we do not consider rotated objects withinthe database.

Figure 3: Screenshot of the web application duringthe tag propagation step. The system automaticallypropagates the tagged object to the images in thedatabase and asks the user to verify the result.

3.2.4 Tag recommendationThe main idea of the tag recommendation step is to assist

the user by suggesting probable tags for the marked object asshown in Figure 2. This is comparable to the autocompletefeature in text editors that suggests relevant words based onalready typed characters.

After selecting an object within the current image the usercan press the “recommend” button to ask for suitable tagsfor this object. The system tries to find duplicates of theselected object using the algorithms described in the previ-ous sections. The bounding-boxes of the predicted objectsare compared to those of already tagged objects. If there ismore than 50 % overlap, objects are considered as a match.The corresponding tags of all matched objects are combinedinto a recommendation list which is displayed to the user inform of tags and associated thumbnails. Duplicate tags mayappear in the recommendation list, when multiple imagescontain visually similar objects accompanied by the sametag, which can be seen in Figure 2.

The user has then the choice to select one or more of therecommended tags or to provide his own tags. At the end ofthis step, the bounding-box of the objects and the associatedtags are stored in the database.

3.2.5 Tag propagationSince the manual annotation of all the instances of an ob-

ject within a large set of images is very time consuming, thesystem offers tag propagation of marked and tagged objectsas shown in Figure 3. Thereby duplicates of the tagged ob-ject are detected within the database and the result is shownto the user.

501

Figure 4: Samples from the 160 objects within thedataset.

Once an object has been marked and tagged, a user canask the system to propagate it automatically to the otherimages in the database by pressing the “propagate” button.The system performs object duplicate detection in the wayexplained in previous sections, and returns images contain-ing object duplicates. Considering matches between propa-gated and already tagged objects one has to distinguish twocases:

• If an object duplicate does not match any alreadytagged object both its bounding-box and tag can beautomatically propagated to the corresponding image.However since the object duplicate detection may re-turn a few non-relevant objects the user can verify thepropagated tags.

• If an object duplicate matches an already tagged ob-ject the two bounding-boxes and sets of tags have to bemerged. The system can either ask the user to resolvethe conflict and merge the two objects, or this can bedone automatically using some heuristics. Since man-ually tagged objects are usually more reliable than au-tomatically propagated ones, the bounding-box of theobject duplicate will be discarded but the tags will becombined.

4. EXPERIMENTSIn this section the performance of the proposed tag prop-

agation and tag recommendation methods are evaluated andanalyzed in two application scenarios. The considered datasetis described in Section 4.1. In Section 4.2 the evaluation ispresented, and finally the results are discussed for both sce-narios in Section 4.3.

4.1 DatasetA new dataset was created in order to evaluate the tag

propagation and tag recommendation methods. Part of thedataset is obtained from Google Image Search6, Flickr andWikipedia by querying the associated tags for different classesof objects. The rest of the dataset is formed by manu-ally taking photos of particular objects using digital cameraCanon EOS 400D.

6http://images.google.com

Table 1: Summary of the classes and some exampleobjectsClasses Example objectsCars BMW Mini Cooper, Citroen C1, Ferrari

Enzo, Jeep Grand Cherokee, LamborghiniDiablo, Opel Ampera, Peugeot 206, RollsRoyce Phantom

Books “Digital Color Image Processing”, “Im-age Analysis and Mathematical Morphol-ogy”, “JPEG2000”, “Pattern Classifica-tion”, “Speech Recognition”

Gadgets Canon EOS 400D, iPhone, Nokia N97,Sony Playstation 3, Rolex Yacht-Master,Tissot Quadrato Chronograph

Buildings Sagrada Familia (Barcelona), Branden-burg Gate (Berlin), Tower Bridge (Lon-don), Golden Gate Bridge (San Fran-cisco), Eiffel Tower (Paris)

Newspapers MobileZone, Le Matin Bleu, 20 Minutes,EPFL Flash

Text Titles, paragraphs and image captions innewspapers

Shoes Adidas Barricade, Atomic Ski Boot, Con-verse All Star Diego, Grubin Sandals,Merrell Moab, Puma Unlimited

Trademarks Coca Cola, Guinness, Heineken, McDon-ald’s, Starbucks, Walt Disney

The resulting dataset consists of 3200 images: 8 classes ofobjects, and 20 objects for each of them. For each object,20 sample images are collected. Summary of the consideredclasses and some example objects are shown in Table 1. Fig-ure 4 shows a single image for a single object from some ofthe 160 objects, while Figure 5 provides several images for3 selected objects (e.g., Merrell Moab hiking shoes, GoldenGate Bridge, and Starbucks trademark).

As it can be seen, images with a large variety of viewpoints and distances, as well as with different backgroundenvironments, are considered for each object. The datasetis split into training and test subsets. Training images arechosen carefully so that they provide a frontal wide angleview of the objects depicted in images. Objects are selectedusing bounding-boxes. One sample image from each object

Figure 5: Selected objects for 3 different objects:Merrell Moab hiking shoe, Golden Gate Bridge (SanFrancisco), and Starbucks trademark.

502

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

Figure 6: Performance of the object duplicate de-tection as precision vs. recall (PR) curve averagedover all the classes.

is chosen as a training image. All other images from thedataset are test images.

4.2 EvaluationSince both the tag recommendation and tag propagation

rely on the performance of the object duplicate detectiononly this part of the whole system has been assessed. It canbe evaluated as a typical detection problem where the set ofpredicted objects is compared against a set of ground truthobjects.

Objects are matched against each other based on the over-lap of their bounding-boxes. If the ratio between the over-lapping area and the overall area exceeds 50 % it is consid-ered as a match. Based on that, a confusion matrix is com-puted, which contains the number of true positives (TP ),false positives (FP ) and false negatives (FN). For the eval-uation precision-recall (PR) curves can be derived from thisconfusion matrix. PR curves plot the recall (R) versus theprecision (P ) with:

P =TP

TP + FP, (5)

R =TP

TP + FN. (6)

The F-measure is calculated to determine the optimumthresholds for the object duplicate detection. It can be com-puted as the harmonic mean of P and R values:

F =2 · P · RP + R

. (7)

Thus, it considers precision and recall equally weighted.

4.3 ResultsFigure 6 shows the performance of the object duplicate

detection in form of the average PR curve computed overall the classes within the dataset. It provides a good visu-alization of the opposing effects (high precision versus highrecall) which are inherent to any detection task. The results

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Threshold

F m

easu

re

Figure 7: Performance of the object duplicate de-tection as average F-measure vs. object duplicatedetection threshold TO.

show that if both effects are considered with equal impor-tance, the optimum is achieved at R = 0.4 and P = 0.6.However the precision can be greatly increased if R = 0.2 isconsidered enough for the tag recommendation and propa-gation.

In order to determine the optimal threshold for the ob-ject duplicate detection, the F-measures across the differentthresholds has been computed. Figure 7 shows the thresh-old versus F-measure curve. The optimal threshold of 50 ischosen for the maximum F-measure of 0.49 and shown inall related figures by green markers. The final F-measureaveraged over the whole dataset is 0.48. The tag recommen-dation will be more supported than the tag propagation be-cause the propagation is more sensitive to the performanceof the object duplicate detection.

In order to compare the different classes with each other,the F-measure is computed for each of the object classes asshown in Figure 8. Trademarks perform best, thanks to thelarge number of salient regions. In the case of text or coverpages of newspapers, books or gadgets, the proposed tagpropagation scenario performs worse because the objects donot have enough discriminative features. Shiny or rotatedobjects, such as cars, shoes or buildings, are hard to detectdue to the changing reflections and varying viewpoint.

5. CONCLUSIONSocial networks are gaining popularity for sharing inter-

ests and information. Especially photo sharing and taggingis becoming increasingly popular. Among others, tags ofpeople, locations, and objects provide efficient informationfor grouping or retrieving images. Since the manual anno-tation of these tags is quite time consuming automatic tagpropagation based on visual similarity offers a very interest-ing solution.

In this work we have developed an efficient system forsemi-automatic object tagging in images. After marking de-sired object in an image, the system performs object dupli-cate detection in the whole database and returns the searchresults with images containing similar objects. Then, the

503

object-based tag propagation for semi-automatic annotation

Documents