[ieee 2013 8th computing colombian conference (8ccc) - armenia, colombia (2013.08.21-2013.08.23)]...

1

MICS: Multimodal Image Collection Summarizationby Optimal Reconstruction Subset Selection

Jorge E. Camargo, Fabio A. GonzalezMindLab Research Group

Bogota, ColombiaUniversidad Nacional de Colombia

{jecamargom, fagonzalezo}unal.edu.co

Abstract—This paper presents a new method to automaticallyselect a set of representative images from a larger set of retrievedimages for a given query. We define an image collection summaryas a subset of images from the collection, which are visuallyand semantically representative. To build such a summary wepropose MICS, a method that fuses two modalities, textual andvisual, in a common latent space, and use it to find a subsetof images from which the collection visual content could bereconstructed. We conducted experiments on a collection oftagged images and demonstrate the ability of our approachto build summaries with representative visual and semanticcontent. The initial results show that the proposed method isable to build a meaningful summary that can be integrated inan image collection exploration system.

Index Terms machine learning, information retrieval, im-age collection summarization, latent factor analysis, multimodalclustering

I. INTRODUCTION

The large amount of images produced every day requires ofsuitable systems to efficiently and effectively manage them.Photo-sharing systems like Flickr1 pose important challengesto organize, browse and query large image collections. Thetypical scenario to search images within Flickr consists inproviding a query by means of keywords, which is processedby the system to return a set of similar images accordingto a similarity criteria. Although this paradigm has beensatisfactorily used in search engines for searching textualcontent, it is not necessarily the most suitable way to interactwith large image collections. One of the problems of thisapproach is that, in general, textual queries are not enough toexpress the visual richness of images and therefore the mostrelevant images are not necessarily at the top of the searchresults. Conventionally the user only explore the first resultpages, so if the user does not navigate the other pages, (s)hewill not see other images that could be of interest. Figure 1shows the top 24 images returned by Flicker for the queryapple. Note that the returned images have some relation withthe apple term since the associated tags contain such term.However this term can be used to describe different semanticconcepts such as fruit, computers, food, cake, etc. The returnedimages in this example are not representative (iconic) of the

1http://www.flickr.com

complete set of results, so the user only has access to a smallportion of them in a first view, and some relevant images maybe missed by the user.Image collection summarization is the process of selecting arepresentative portion of images from a larger set of images.Most of the proposed summarization methods found in theliterature extract visual features such as color, texture andedges to represent image content, which are used by clusteringalgorithms to select image prototypes (summary). However,images are commonly accompanied of other informationsources such as text, audio, links, etc. Therefore, in the sameway that visual features are used to represent image content,these additional information sources (modalities) can be alsoused to index them.This paper proposes a method to automatically build repre-sentative images summaries from large image collections. Inthis work a summary is understood as a set of representativeimages that semantically represent a larger set of images,usually from a search result. We propose a new method toconstruct multimodal image summaries in which both textand visual content are combined in the same latent semanticspace in order to build a semantic summary. The proposedmethod also provides a mechanism to project images thatdo not have associated text, which addresses the problemof images that are not accessible because of the lack oftext information associated to them. Notwithstanding that usersatisfaction studies are widely used to evaluate summarizationalgorithms, we favored a more objective and quantitative eval-uation metric for assessing the performance of the proposedsummarization method. Consequently, the paper also presentsa method to quantitatively measuring the quality of an imagecollection summary by estimating the capacity of the summaryto reconstruct the visual content of the collection.This paper is organized as follows: Section 2 describes relatedwork; Section 3 presents the proposed method; Section 4presents experimental evaluation of the proposed strategy in atagged image collection; and finally, Section 5 concludes thepaper.

II. RELATED WORK

An image collection summary is a subset of images whichare somehow representative of the whole collection. Differ-ent methods have been proposed to build image collectionsummaries including methods based on clustering [1], [2],978-1-4799-1056-4/13/$31.00 c� 2013 IEEE

Figure 1. The first 24 image results from Flickr for the query apple (retrievedon December 9, 2012).

similarity pyramids methods [3], graph methods [4], [5],neural networks methods [6], formal concept analysis [7],kernel methods [8], among others. In most of the cases, thesummarization problem is approached as a non-supervisedlearning problem. Typically, image clusters are identified inthe collection and representative images from each clusterare chosen to compose the summary. The quality of thesummarization process is calculated using quality measurescommonly used in clustering algorithms. Measures such asseparation and cohesion are used to measure the compactnessof image clusters and the differentiation between clusters.Entropy, a concept from information theory, is also used tomeasure the homogeneity of the image summary with respectto prior knowledge (class labels).

Multimodal image collection analysis is an active topic inmultimedia information retrieval. Authors in [9] present amethod to index image collections which uses visual and non-visual features to represent image content. This representationstrategy allow them to combine both modalities in search andindexing tasks outperforming other methods such as singularvalue decomposition. In [10] authors fuse tags and visualcontent for representing biomedical images. They demonstratethat the proposed framework is more efficient and effectivecompared to the use of only a single modality.

Recently, multimodal summarization methods such as theproposed in [11] compute iconic summaries of general visualcontent. Authors propose an iconic summarization throughthe clustering of visual features using the k-means algorithmand the clustering of text tags through probabilistic latentsemantic analysis (pLSA). Authors calculate the intersectionof both set of clusters in order to extract iconic images andrepresentative tags. Although authors use both modalities tobuild the summary, this intersection is not efficient since itis necessary to calculate the intersection between all visualclusters and all text clusters. Another problem of this approachis that the clustering process is performed separately for both

modalities (text and visual), so both modalities are not fusedin a semantic way during the clustering process.

In this paper, we propose a strategy based on latent factors inwhich visual and text content are jointly used to better modelimage semantics. This allows to provide the user with a moremeaningful overview of image search results in order to betterinteract with them.

III. PROPOSED METHOD

A. Image indexing

In this work, we follow a bag-of-words (BoW) approach torepresent image content, which is one of the most commonrepresentations used in text-mining and information retrieval(IR). We use a common scheme for representing both text andvisual content since we are dealing with multimodal objects.

1) Visual indexing: The visual content is represented usinga bag-of-features [12] approach in which an orderless distri-bution of image features is constructed based on a predefineddictionary of visual patterns. This dictionary is built fromthe image collection and accounts for the occurrence of eachvisual pattern in the images. The bag of features is constructedusing a training set of images that are split in sub-blocksof n⇥n pixels. A visual feature is extracted for each sub-block, to represent the associated visual patterns using rotationinvariant properties. Then, all the extracted blocks are clusteredto obtain k centroids, which are used as a reference dictionaryof visual patterns. Finally, we build a histogram with theoccurrences of visual patterns found in the image. This schemeis widely used in computer vision tasks, including imagecategorization and object recognition [13].

2) Text indexing: To index tags associated to each image,we follow the Vector Space Model (VSM) [14]. This modelis based on a vector representation, where each componentof a vector indicates the frequency of a word (term) inthe document. Formally, a document is expressed as Vd =[w1,d, w2,d, w3,d, . . . , wN,d] , where wt,d = tft · log |D|

|{t2d}| , tftis the term frequency of the term t in the document d, |D| isthe number of documents in the collection, and log |D|

|{t2d}| isthe inverse frequency of the documents that contain t.

B. Optimal reconstruction subset selection

Let Xt 2 Rm⇥` be the matrix that contains all the vectorsindexing textual content of an image collection with l imagesand let Xv 2 Rn⇥` be the matrix that contains the vectorsindexing visual content of images. The general summarizationproblem consist in finding a subset of images that is repre-sentative of both the visual content of the collection and thesemantic (textual) content associated to them. This problem isaddressed through two main strategies: (1) a strategy to finda small subset of images that is a good representative of thethe visual content of the collection (Xv), and (2) a strategy toassociate these images with the semantic classes found in thetextual content (Xt). The two strategies are discussed in thefollowing subsections.

1) Optimal visual reconstruction subset selection: Let S =[xi1 , xi2 , . . . , xir ] 2 Rn⇥r, where {i1, . . . , ir} ✓ {1, . . . , n}and each xi is a column from Xv , S is said to be a subsetof Xv (S ✓ Xv). An optimal visual subset is a subset ofXv from which the visual content in Xv , can be optimallyreconstructed. Formally, this could be stated as a non-negativematrix factorization problem as follows,

minS,H�0

kXv � SHk2F

s.t.S ✓ Xv

, (1)

where S 2 Rn⇥r is the image subset (summary) thatrepresents the image collection and H 2 Rr⇥l contains theweights to linearly combine elements of the summary toreconstruct the visual content of the collection. In principle,this minimization problem could be solved using some of thealgorithms proposed in [15], which solve the more generalnon-negative matrix factorization problem:

minF,H�0

kXv � FHk2F , (2)

where F 2 Rn⇥r. However, these methods do not guaranteethat the factor matrix F be a subset of Xv . An alternative isto use convex non-negative matrix factorization (CNMF) [16],which solves the following factorization problem:

minW,H�0

kXv �XvWHk2F , (3)

where W 2 Rl⇥r. Notice that the matrix factor F in Eq. 2 isreplaced in this problem by XvW , this means that columnsof F = XvW are linear combinations of the original columns(images) in Xv . Thus, W may be used to find those imagesin Xv that are more important to reconstruct the full visualcontent of the collection. This is the approach followed in ourmethod.

2) Multimodal fusion: The matrix factorization algorithmsdiscussed in the previous subsection could be applied as wellto the textual content of the collection. For instance, theCNMF algorithm could be applied to Xt by replacing all theinstances of Xv in Eq. 3 by Xt producing the following matrixfactorization problem:

minWt,Ht�0

kXt �XtWtHtk2F , (4)

This has an interesting byproduct, the columns of Ft :=XtWt could be interpreted as clusters centroids that representclusters among image tags which could be eventually asso-ciated to high-level semantic concepts. This in fact generatesa new representation of the textual (visual) data which is infact a latent semantic representation2. This process is appliedto both visual and textual data to obtain a multimodal latentrepresentation of the image collection.

The summarization process that combines the optimal vi-sual reconstruction subset selection and multimodal fusion ispresented in Algorithm 1 .

In line 04 of the MICS algorithm, the text matrix Xt isdecomposed using CNMF as in Eq. 4. Note that parameters

2It is well known that non-negative matrix factorization is closely relatedto latent semantic analysis [17].

Algorithm 1 MICS algorithm[01] Input : Text matrix Xt, Visual matrix Xv , number ofclusters k, number of samples in each cluster n[02] Output : List of multimodal clusters L

[03] begin[04] [Wt Ht] CNMF (Xt, k)[05] [Fv Hv] CNMF (Xv,Wt, k)[06] C {} , L {}[07] F s

v sort(Fv)[08] W s

t sort(Wt)[09] for i := 1 until n, do[10] C(i) W s

t (i) [ F sv (i)

[11] end for[12] Cw clusterWeight(C)[13] L rank(C,Cw)[14] end[15] return L

of the CNMF function are Xt and k, where Xt is the textmatrix and k is the size of the summary (amount of multimodalclusters). In this factorization Ht is the basis of the latentspace in which text is represented as a linear combination ofthe r columns of Ft. The corresponding coefficients of thecombination are codified in the columns of H . As it is shownin [18], each column of Ft corresponds to a cluster of theoriginal objects, and each column of Ht corresponds to anobject represented in the latent space. Thus, in each columnof Ht we find the membership degree of each text term in thei-th cluster. Each column of Wt correspond to a cluster, andits values indicate the importance of each term for the givencluster.

Then, we fix Ht and apply CNMF to factorize the visualmatrix

Xv = FvHt = XvWvHt,

where Xv 2 Rm⇥l, Wv 2 Rl⇥r, and Fv 2 Rm⇥r. This isexpressed in line 05 of the MICS algorithm, where functionCNMF receives Wt, Xv , and k as parameters. It is importantto note that in this factorization we fix Ht to find Wv , thereforeboth information sources are combined in this step obtaining anew factorization that depends on text and visual informationat the same time. In each column of Wv we find the importancedegree of each image in the i-th cluster. As output of thisfunction call we obtain Fv (line 10), which is used jointlywith Wt to create the i-th multimodal cluster Ci.

In order to obtain the most important images and text termsfor each cluster, Fv and Wt are sorted, and then the n mostimportant elements are selected for each cluster, obtaining asummary of kn multimodal elements (line 09 to 11).

To determine the importance of each multimodal cluster, weselect clusters with the highest weight, that is, we normalizedthe Hv matrix, using the L1 norm, and computed the sum ofeach row of this matrix in order to determine the number ofimages that belong to each cluster. This is expressed in line12 of the MICS algorithm. Finally, we ranked these clustersby weight, as it is expressed in line 13 with the following

Figure 2. Images and terms ranked according to their importance in a latentfactor

function,rank(C,Cw),

where C is the list of clusters and Cw are their respectiveweights to be used in the sorting operation. In Figure 2 showsan illustration of a multimodal cluster, where images and termsare ranked according to its relative weights respective latentfactor.

IV. EXPERIMENTATION

We crawled 4882 images and their associated tags forapple, love, closeup, and beauty query terms, as described inTable I. These terms were chosen because they have differentmeanings depending on context, which makes challenging thesummarization process. For instance, the term beauty can befound in images of women, nature, cats, flowers, etc. Figure3 shows an example of a beauty image and its associatedtags. These terms are useful to see whether a summary candiscriminate different semantic sub-groups of images thatbelong to the same concept. We select 4 concepts for this work,but the proposed method can be applied to more concepts andlarger datasets.

Visual content was indexed using the bag of visual wordsmethod presented in Section III-A1. Each image was split inpatches of 8x8 pixels. The DCT (Discrete Cosine Transform)descriptor [19] was used to index each patch. The parameterof the patch clustering process k was set to 1000, thus weobtained a dictionary of 1000 patches. Finally, each imagewas indexed using a histogram of 1000 bins.

Text content was indexed using the vector space modelpresented in Section III-A2. We applied stop-words removaland stemming. We found that the frequency of many terms waslower than 5. Therefore, we indexed only terms with frequencyhigher than 5. Table I describes the quantity of terms indexedfor each dataset after pre-processing steps.

For each of the 4 datasets, we applied the multimodal sum-marization algorithm proposed in this paper. The dimensionof the latent space r was empirically set to 20, so we obtained20 multimodal clusters for each dataset.

Table IDATASETS USED IN THE EXPERIMENTAL EVALUATION. FOUR DIFFERENT

ONE-TERM QUERIES WERE USED AND EACH DATASET CORRESPONDS TO ASUBSET OF THE IMAGES RETURNED BY FLICKR WHEN QUERIED WITH THE

CORRESPONDING TERM. THE LAST COLUMN SPECIFIES THE NUMBER OFDIFFERENT TEXT TERMS FOUND IN THE RESPECTIVE DATASET AFTER

PRE-PROCESSING

Dataset Number of images Termsapple 1263 837love 724 666

closeup 1405 995beauty 1490 1195

Figure 3. Example of a beauty image and its associated tags crawled fromFlickr

A. Multimodal summarization results

The multimodal summary was built by selecting the fourmost important text terms and the four most important imagesof each cluster. We selected 4 images to illustrate the method,but this is a parameter of the algorithm that can be setaccording to the visualization metaphor used to display results.Figure 4 shows the eight most important multimodal clustersfor the apple query. The number in each cluster representsthe rank assigned according to its importance (the number ofimages belonging to the cluster). Note that each cluster hasrepresentative images and text terms of the concept apple fordifferent contexts. For instance, there are clusters for fruits,computer elements, buildings, and apple trees. In the first sub-cluster, note that there are images of apple with drop water,which semantically correspond to visual appearance of imagesand text terms automatically organized in such sub-group. Thisresult is useful since users can see different semantic conceptsthat match with the query, so users have the opportunity ofexplore the subset that is more related with their needs.

Figure 5 shows the multimodal summary for the beautyquery. In this case the clusters represent different high-levelconcepts of beauty. This concept is very subjective, so usersannotate with this term images of flowers, models, women,and art pictures.

Figure 6 shows the results for the close-up query. Inthe obtained summary there are images of flowers, women,insects, and animals. It is worth noting that this summaryprovides a good set of representative images of different topics.

Finally, Figure 7 shows the obtained clusters for the lovequery. Images of nature, love symbols, marriages, couples,and animals were grouped.

It is worth to note that the proposed method is robust tojunk images. In the matrix factorization process, each latent

Figure 4. Multimodal summary for apple

Figure 5. Multimodal summary for beauty

Figure 6. Multimodal summary for closeup

Figure 7. Multimodal summary for love

factor (columns of Wt and Fv) is representing a cluster inthe latent space. In each cluster we select the most importantimages and concepts (steps 7 and 8 of the MICS algorithm)according to their importance in the respective cluster. Thismechanism allows to select images that are representative ofthe collection and penalize junk images.

B. Quality evaluation of the summary

We are interested in objectively evaluate the quality of theobtained summaries. To do this, we focused on evaluatingthe quality of a summary when it is built using the visualmodality. The performance measure is based on reconstructionerror obtained after the factorization of the original matrix X .To facilitate the comparison, we exclusively used the visualmodality. This is accomplished using the factorization de-scribed in Eq. 1, which attempts to represent the visual contentof the complete collection as a linear combination of thesummary S ✓ Xv . First, we find Snmf using NMF, which isa subset of the most representative images (the closest imagesto the factor centroids) and measure the reconstruction erroras follows. A similar process is performed to find a baselinevisual summary Skmeans using the k-means algorithm. That is,we select the closest images to the obtained k-means centroids,and calculate its reconstruction error as follows,

kX � SkmeansHk2F .

Figure 8 shows the reconstruction error obtained for C-NMFand K-MEANS for different k clusters. In this analysis weevaluated the reconstruction error for different sizes of a visualsummary of the apple dataset. Note how C-NMF outperformsthe K-MEANS clustering algorithm. This result shows that theproposed method reaches a good representativeness degree ofthe complete image collection.

V. CONCLUSION AND FUTURE WORK

This paper presented a new method to build multimodal col-lection summaries. We proposed the MICS algorithm, whichis based on latent factor analysis as a mechanism to fusetext and visual information in the same latent semantic space

Figure 8. Reconstruction error analysis for the apple dataset

to better model image semantics. This algorithm allows tobuild semantic summaries that involve text and visual contentin the clustering process. The proposed method was appliedto four image collections extracted from Flickr. We alsoproposed an objective measure based on reconstruction error toobjectively validate the performance of the proposed methodin the construction of multimodal image collection summaries.Results are encouraging and show the feasibility of using thismethod to offer to the user more diverse and semantic resultswhen interact with image collection exploration systems. Infuture work, we want to evaluate our algorithm in largerdatasets and we expect to conduct user studies to validate theproposed strategy from the user perspective.

ACKNOWLEDGMENTS

This work was partially funded by COLCIENCIAS throughthe project Sistema para la Recuperacion de ImagenesMedicas Utilizando Indexacion Multimodal. ConvocatoriaCOLCIENCIAS 2465 de 2011.

REFERENCES

[1] Stan D, Sethi IK. eID: a system for exploration of image databases. InfProcess Manage. 2003 May;39(3):335–361.

[2] Simon I, Snavely N, Seitz SM. Scene Summarization for OnlineImage Collections. In: Computer Vision, 2007. ICCV 2007. IEEE 11thInternational Conference on; 2007. p. 1–8.

[3] Chen JY, Bouman CA, Dalton JC. Hierarchical browsing and searchof large image databases. Image Processing, IEEE Transactions on.2000;9(3):442–455.

[4] Cai D, He X, Li Z, Ma WY, Wen JR. Hierarchical clustering of WWWimage search results using visual, textual and link information. Proceed-ings of the 12th annual ACM international conference on Multimedia.2004;p. 952–959.

[5] Gao B, Liu TY, Qin T, Zheng X, Cheng QS, Ma WY. Web imageclustering by consistent utilization of visual features and surroundingtexts. In: MULTIMEDIA ’05: Proceedings of the 13th annual ACMinternational conference on Multimedia. New York, NY, USA: ACM;2005. p. 112–121.

[6] Deng D. Content-based image collection summarization and comparisonusing self-organizing maps. Pattern Recognition. 2007;40(2):718–727.

[7] Nobuhara H. A lattice structure visualization by formal concept analysisand its application to huge image database. In: Complex MedicalEngineering, 2007. CME 2007. IEEE/ICME International Conferenceon; 2007. p. 448–452.

[8] Fan J, Gao Y, Luo H, Keim DA, Li Z. A novel approach to enablesemantic and visual image summarization for exploratory image search.In: Proceeding of the 1st ACM international conference on Multimediainformation retrieval. New York, NY, USA: ACM; 2008. p. 358–365.

[9] Caicedo JC, BenAbdallah J, Gonzalez FA, Nasraoui O. Multimodalrepresentation, indexing, automated annotation and retrieval of imagecollections via non-negative matrix factorization. Neurocomput. 2012Jan;76(1):50–60.

[10] Rahman MM, Antani S, Fushman D, Thoma G. Biomedical ImageRetrieval Using Multimodal Context and Concept Feature Spaces. In:Muller H, Greenspan H, Syeda-Mahmood T, editors. Medical Content-Based Retrieval for Clinical Decision Support. vol. 7075 of LectureNotes in Computer Science. Springer Berlin Heidelberg; 2012. p. 24–35.

[11] Raguram R, Lazebnik S. Computing Iconic Summaries of GeneralVisual Concepts. 2008;p. 1–8.

[12] Csurka G, Dance CR, Fan L, Willamowski J, Bray C. Visual catego-rization with bags of keypoints. In: In Workshop on Statistical Learningin Computer Vision, ECCV; 2004. p. 1–22.

[13] Bosch A, Munoz X, Marti R. Which is the best way to organize/classifyimages by content? Image and Vision Computing. 2007;25(6):778 – 791.

[14] Salton G, Wong A, Yang CS. A vector space model for automaticindexing. Communications of the ACM. 1975;18(11):613–620.

[15] Lee DD, Seung HS. Algorithms for nonnegative matrix factorization.Advances in Neural Information Processing Systems. 2001;13:556–562.

[16] Ding CHQ, Li T, Jordan MI. Convex and Semi-Nonnegative MatrixFactorizations. Pattern Analysis and Machine Intelligence, IEEE Trans-actions on. 2010 jan;32(1):45 –55.

[17] Ding H, Liu J, Lu H. Hierarchical clustering-based navigation of imagesearch results. In: MM ’08: Proceeding of the 16th ACM internationalconference on Multimedia. New York, NY, USA: ACM; 2008. p. 741–744.

[18] Xu W, Liu X, Gong Y. Document clustering based on non-negativematrix factorization. In: SIGIR ’03: Proceedings of the 26th annualinternational ACM SIGIR conference on Research and development ininformaion retrieval. New York, NY, USA: ACM; 2003. p. 267–273.

[19] Hare JS, Samangooei S, Lewis PH, Nixon MS. Semantic spacesrevisited: investigating the performance of auto-annotation and semanticretrieval using semantic spaces. In: CIVR ’08: Proceedings of the 2008international conference on Content-based image and video retrieval.New York, NY, USA: ACM; 2008. p. 359–368.

[ieee 2013 8th computing colombian conference (8ccc) - armenia, colombia (2013.08.21-2013.08.23)]...

Documents