[ieee 2013 xviii symposium of image, signal processing, and artificial vision (stsiva) - bogotá,...

6
1 Visualizing Multimodal Image Collections Anyela Chavarro, Jorge Camargo, Fabio A. González Universidad Nacional de Colombia, MindLab Research Group Abstract—This paper presents two different strategies for visualizing multimodal image collections, which are based on a representation strategy that fuses text and visual content in the same latent space. This latent space allows to find semantic groups of images, which are used to select image prototypes to build a semantic visualization. The first strategy is a graph- based visualization in which edges represent image similarities and vertices represent images. The second is a multimodal visualization in which a set of image prototypes surround a semantic tag cloud. Thus, we built a system prototype in order to evaluate the strategies. Results show that the propose strategy is promising and it could be used in a real image exploration system to improve the image collection exploration process. Index Terms—image collection visualization, latent factor anal- ysis, summarization I. I NTRODUCTION I MAGES are a valuable asset in different fields of , academy , research and industry. They are explosively growing every day thanks to the Internet and to the acquisition devices such as tablets, cameras, smartphones and other capturing devices. Millions of images are shared everyday in photo- sharing systems like Flickr 1 , which makes challenging their storing, processing and access. One of the demanding tasks is how to allow users to access a huge image collection in an efficient, effective and intuitive way. Conventional search engines provide access mechanisms typically based on the keyword-based and query-by-example paradigms. Although, these paradigms have been satisfactory to access large text/web repositories, they are not enough to deal with other multimedia content such as images. Image collection exploration is an active research area that aims to offer the user alternative paradigms to browse an image collection. The main challenges addressed by image collection explorations are image representation, summarization, visual- ization and interaction: Image representation focuses on the problem of representing images in a computing structure using visual features. Summarization on the process of selecting a representative set of images as an overview of a larger set of images. Visualization on the way of presenting images once the systems retrieves relevant images. Interaction on the mechanism in which users interact with the image collection. One of the most important factors of an image collection ex- ploration system is visualization of the search results, because it is the the entry point of the interaction for the user and it can contribute as a guide in the exploration of the whole collection. Since a structure of the image collection is presented through a visualization metaphor, the user should be able to easily understand the collection in order to browse it. 1 http://www.flickr.com/ In the content-based image retrieval (CBIR) literature, vi- sualization has not been addressed in the same way as other issues (image representation and indexing, retrieval perfor- mance, etc.). CBIR systems like Google Similar Images and Flickr return a set of similar images in a ranked list according to a similarity criteria, as a result the user obtains a set of result pages ordered according to the degree of relevance w.r.t a query. Despite this is the traditional way of visualizing results, it is not the most efficient to the user, because she/he may miss images that could be of interest. Recently, some works have presented the visualization as an important issue that has to be involved in image search systems. However, most of the works use only visual infor- mation at pixel level (color, borders and textures) to represent image similarity, which is used to visualize image relationships using some visualization metaphor (2D, graph-based, treemap, radial, etc) [1]. But, images are generally complemented with other modalities such as text, which can be used to improve image representation and visualization. Our hypothesis is that visual and text modalities can be used to build a multimodal image collection exploration system that reduces the famous semantic gap. In this paper we present two different strategies for vi- sualizing multimodal image collections.Both are based on a representation strategy that fuses text and visual content in the same latent space. This fusion strategy is based on the MICS algorithm [2], which allows to obtain a set of latent factors where text and visual modalities are projected in the same latent space. This latent space allows us to find semantic groups of images, which are used to select image prototypes that are used to build a semantic visualization. The first strategy is a graph-based multimodal visualization. We explore different algorithms for drawing a graph using hierarchical and spring models. We also build a system pro- totype that uses Graphviz, an open source graph visualization tool that provides different algorithms to build such graph. For the second strategy, we propose a novel multimodal visualization metaphor that combines visual and text terms in the same visualization area. Thus we conducted experiments with a set of images crawled from Flickr to validate the performance of the proposed system. This paper is organized as follows: Section 2 briefly reviews related work; Section 3 describes the multimodal image collection summarization, presents different graph visualization algorithms, and describes the multimodal visualization methaphor proposed; Section 4 presents the obtained results; We conclude the paper in Section 5. 978-1-4799-1121-9/13/$31.00 c 2013 IEEE

Upload: fabio-a

Post on 07-Mar-2017

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

1

Visualizing Multimodal Image CollectionsAnyela Chavarro, Jorge Camargo, Fabio A. González

Universidad Nacional de Colombia, MindLab Research Group

Abstract—This paper presents two different strategies forvisualizing multimodal image collections, which are based ona representation strategy that fuses text and visual content inthe same latent space. This latent space allows to find semanticgroups of images, which are used to select image prototypesto build a semantic visualization. The first strategy is a graph-based visualization in which edges represent image similaritiesand vertices represent images. The second is a multimodalvisualization in which a set of image prototypes surround asemantic tag cloud. Thus, we built a system prototype in orderto evaluate the strategies. Results show that the propose strategyis promising and it could be used in a real image explorationsystem to improve the image collection exploration process.

Index Terms—image collection visualization, latent factor anal-ysis, summarization

I. INTRODUCTION

IMAGES are a valuable asset in different fields of , academy, research and industry. They are explosively growing every

day thanks to the Internet and to the acquisition devicessuch as tablets, cameras, smartphones and other capturingdevices. Millions of images are shared everyday in photo-sharing systems like Flickr1, which makes challenging theirstoring, processing and access. One of the demanding tasksis how to allow users to access a huge image collection inan efficient, effective and intuitive way. Conventional searchengines provide access mechanisms typically based on thekeyword-based and query-by-example paradigms. Although,these paradigms have been satisfactory to access large text/webrepositories, they are not enough to deal with other multimediacontent such as images.

Image collection exploration is an active research area thataims to offer the user alternative paradigms to browse an imagecollection. The main challenges addressed by image collectionexplorations are image representation, summarization, visual-ization and interaction: Image representation focuses on theproblem of representing images in a computing structure usingvisual features. Summarization on the process of selecting arepresentative set of images as an overview of a larger setof images. Visualization on the way of presenting imagesonce the systems retrieves relevant images. Interaction on themechanism in which users interact with the image collection.

One of the most important factors of an image collection ex-ploration system is visualization of the search results, becauseit is the the entry point of the interaction for the user and it cancontribute as a guide in the exploration of the whole collection.Since a structure of the image collection is presented througha visualization metaphor, the user should be able to easilyunderstand the collection in order to browse it.

1http://www.flickr.com/

In the content-based image retrieval (CBIR) literature, vi-sualization has not been addressed in the same way as otherissues (image representation and indexing, retrieval perfor-mance, etc.). CBIR systems like Google Similar Images andFlickr return a set of similar images in a ranked list accordingto a similarity criteria, as a result the user obtains a set of resultpages ordered according to the degree of relevance w.r.t aquery. Despite this is the traditional way of visualizing results,it is not the most efficient to the user, because she/he may missimages that could be of interest.

Recently, some works have presented the visualization asan important issue that has to be involved in image searchsystems. However, most of the works use only visual infor-mation at pixel level (color, borders and textures) to representimage similarity, which is used to visualize image relationshipsusing some visualization metaphor (2D, graph-based, treemap,radial, etc) [1]. But, images are generally complemented withother modalities such as text, which can be used to improveimage representation and visualization. Our hypothesis is thatvisual and text modalities can be used to build a multimodalimage collection exploration system that reduces the famoussemantic gap.

In this paper we present two different strategies for vi-sualizing multimodal image collections.Both are based on arepresentation strategy that fuses text and visual content inthe same latent space. This fusion strategy is based on theMICS algorithm [2], which allows to obtain a set of latentfactors where text and visual modalities are projected in thesame latent space. This latent space allows us to find semanticgroups of images, which are used to select image prototypesthat are used to build a semantic visualization.

The first strategy is a graph-based multimodal visualization.We explore different algorithms for drawing a graph usinghierarchical and spring models. We also build a system pro-totype that uses Graphviz, an open source graph visualizationtool that provides different algorithms to build such graph.For the second strategy, we propose a novel multimodalvisualization metaphor that combines visual and text terms inthe same visualization area. Thus we conducted experimentswith a set of images crawled from Flickr to validate theperformance of the proposed system. This paper is organizedas follows: Section 2 briefly reviews related work; Section3 describes the multimodal image collection summarization,presents different graph visualization algorithms, and describesthe multimodal visualization methaphor proposed; Section 4presents the obtained results; We conclude the paper in Section5.

978-1-4799-1121-9/13/$31.00 c©2013 IEEE

Page 2: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

2

II. PREVIOUS WORK

Visualization of image collections is a research area thatfocuses on finding a way of visualizing:relevant images of acollection in a simple representation that faithfully representthe complete dataset or results obtained after a query isresolved by the system [3], allowing the user to understandand access the collection in a intuitive way. Authors in [4]calculate a set of coordinates, x and y, using multidimensionalscaling (MDS), which takes as input a similarity matrix thatexpresses what is the similarity among images of the completecollection. This technique is known as dimensionality reduc-tion since a high dimensionality representation (for instance acolor histogram) is reduced to obtain typically 2 coordinates.Other works use principal component analysis (PCA)[5] andisometric feature mapping (ISOMAP)[6] to reduce the dimen-sionality in a more elaborated way. The weakness of theseapproaches is related to the fact that they do not take into theaccount the limited display area of screens and visibility issueswhen images are projected in the visualization area. However,authors in [3] address in some way these issues. It is worth tonote that most of the works use only visual low-level featuressuch as color, edges and textures, but they do not use otheinformation sources to better represent image content.

Recently, new approaches try to use both visual and tex-tual modalities to visualize image collections. Thoseaproachespresent images and their associated concepts in the samevisualization metaphor. In [7], authors use non-negative matrixfactorization (NMF) to fuse visual and text modalities in alatent representation that is then used as input in the PCAalgorithm, obtaining a multimodal visualization using a 2Dlayout. Although this approach fuses both modalities to presenta more semantic visualization, this has some limitations due tothe overlapping, which makes difficult to visualize and interactwith the collection.

A graph-based visualization is another interesting way topresent image relationships. TagGraph2 uses a graph to reachthis aim. However, this tool has several drawbacks that arediscussed in [8]. ChainGraph3 is other tool that tries to improvesome problems of TagGraph, but the interaction process isdifficult due to some limitations with the management of thescreen size.

III. MATHERIALS AND METHODS

A. Image Collection Summarization

In order to visualize large image collections it is necessaryto provide a mechanism that summarizes the collection. Asummary allows to present to the user prototypical images that“summarizes” a larger set of images.Since it is not possibleto show all the images to the user at the same time, so asummary is a good way to condense results obtained by thesystem. So, we use MICS [2] to build a multimodal summary,which is composed of multimodal clusters with representativeterms and images.

2http://taggraph.com/3http://chaingraph.demos.interactivesystems.info/

B. Graph-based Visualization

Given the multimodal summary obtained using MICS, weexplore different graph-based visualization algorithms. In thisgraph, edges represent image similarities and vertices representimages. In the literature, there are different algorithms fordrawing graphs, which will be described below.

A hierarchical model used in [9], [10] and [11], builds thegraph in four steps. The first one consists in breaking anycycle that occurs in the input graph by reversing the internaldirection of certain cyclic edge. The second one assigns nodesto discrete ranks or levels. In a top-to-bottom drawing, ranksdetermine Y coordinates..Edges that span more than one rankare broken into chains of “virtual” nodes and unit-length edges.The third one sorts nodes within ranks to avoid crossings. Thefourth one sets X coordinates of nodes to keep edges short.The final one routes edge splines[12].

A spring model is other approach proposed by Kamadalet al in[13], which draws the graph by constructing a virtualphysical model and by running an iterative solver to find alow-energy configuration. An ideal spring is placed betweenevery pair of nodes such that its length is set to the shortestpath distance between the endpoints. The springs push thenodes so their geometric distance in the layout approximatestheir path distance in the graph. In statistics, this algorithm isalso known as multidimensional scaling[14]. Other algorithmis proposed in[15], which is similar to the Kamada’s model,but reducing forces rather than working with energy. In Section4 we present some of the visualization obtained with thedifferent algorithms described here.

Figure 1. Multimodal visualization metaphor. Images and tags are distributedin a concentric way to visualize a multimodal summary.

Page 3: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

3

C. Multimodal Visualization

We propose a new multimodal visualization metaphor withthe aim of combining textual and visual modalities in thevisualization space. In this metaphor image prototypes of thesummary surround a tag cloud. Images are displayed throughregular polygons, in which each image is assigned to eachvertice and text terms are distributed using a tag cloud. Thesize of the tags represents the number of times that each termappears in the images of the summary. Figure 1 shows anexample of the proposed multimodal visualization metaphor.This visualization shows an overview of the entire imagecollection, that is, this shows a multimodal overview of thecomplete collection. The idea of this visualization is to provideto the user a first step in the exploration process. When theuser selects one of the images, the system retrieves the k-nearest neighbor images to the selected image. Figure 2 showsa second visualization when an image is clicked. It is importantto realize that this mechanism allows to refine the user needsby zooming the multimodal summary.

Figure 2. Visualization obtained after select an image of interest.

This visualization metaphor allows to see images and tagsavoiding the overlapping problems discussed in Section 2.Figure 3 displays a tool to highlight image-tag relationships,that is, when mouse is over an image (red color) related tagsare highlighted to display semantic relationships.

IV. RESULTS

We crawled 1250 images from Flickr and their associatedtags for the query apple. We selected this query because itproduces different types of images corresponding to a differentsemantic concepts: fruits, computers, food, cakes, etc. Weapplied the MICS presented in Section 3. The algorithm needsas input the number of latent factors that was set to 25 andthe number of images and tags in each cluster that was set to

Figure 3. Highlighting of tags when mouse is over an image.

5, so we obtained 25 multimodal cluster and a summary of125 multimodal elements (tags and images).

A. Experimental Evaluation of Graph-based Visualizations

We used Graphviz4, a package of the open-source tools ini-tiated by AT&T Labs Research for drawing graphs specified inDOT language scripts [16]. We developed a java application tobuild the neighborhood graph from a distance matrix expressedin the DOT language. Then, we integrate this application withGraphviz using the following commands:

• DOT for Hierarchical Model• NEATO for the spring models of Kamada and Kawai• FDP for the spring model of Fruchterman-Reingold• SFDP which also draws undirected graphs using the

spring model of FDP, but it uses a multi-scale approachto produce la youts of large graphs in a reasonably shorttime

Each command represents each one of the algorithms de-scribed in Section 3.2. Figure 4 shows a visualization usingthe hierarchical model using k=4. In this Figure, each colorcircle represents a cluster. We can see that each image hasfour associated images, in which in most of the cases theybelong to the same ground-truth cluster. Note that images thatbelong to a group are related to similar topics. For instance,the first group is related to the terms apple, toy and robot, thefifth group is related to the terms apple, green, naturallight.

The spring model of NEATO is shown in Figure 5 usingk=5. In this visualization, images of the same cluster are placedclose to each other, so we can easily finding new relationshipsbetween clusters. For instance, cluster eight and three aresemantically related (girls and self portrait). We can also seehow clusters with green apples are linked, this visualizationshows semantic links between clusters.

Figure 6 displays the SFDP visualization, which is a goodrepresentation of the visual collection and the semantic linksbetween images.

4http://www.graphviz.org

Page 4: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

4

B. Experimental Evaluation of the Multimodal VisualizationMetaphor

We developed a web-base prototype application using javaand d3.js, this systems allows the user interact with thecollection. Figure 1 shows the visualization of the multimodalsummary of the Flickr collection, each of the 25 cluster has arepresented image and we can configure the system in orderto visualize the k-nearest neighbor of each image. Figure 2points out the 8-neighborhood of the red apple image thatappear on the top of the Figure 1, we can see that the imagesof this cluster are related to the terms apple, green and lightand most of the images are green apples. User can browsethrough the 125 elements of the multimodal collection.

V. DISCUSSION AND CONCLUSIONS

In this paper we presented two mechanisms to visualizeimage collections, in which the visualization takes as inputa set of image and tags prototypes. The first mechanism isbased on graphs, which offer a way to visualize relationshipsamong images using conventional algorithms such as NEATOand SFDP. Although Graphviz is a good tool for visualizingimage collections, the tool only generates a static image, so itis not possible to interact with the obtained visualizations. Thesecond mechanism is a novel visualization metaphor, in whichwe project visual and text modalities in the same visualizationarea. We named this method multimodal image collectionvisualization. This new method offers an intuitive mechanismto interact with the image collection through tags and images.As future work we want to conduct a user evaluation ofthe multimodal visualization metaphor to validate in a realscenario its performance.

REFERENCES

[1] J. Camargo and F. Gonzalez, “Visualization, summarization and explo-ration of large collections of images: State of the art,” in LatinAmericanConference On Networked and Electronic Media (LACNEM2009), 2009.

[2] ——, “Mics: Multimodal image collection summarization by optimalreconstruction subset selection,” in Computing Congress (CCC), 20138th Colombian, 2013.

[3] G. P. Nguyen and M. Worring, “Interactive access to large imagecollections using similarity-based visualization,” J. Vis. Lang. Comput.,vol. 19, no. 2, pp. 203–224, Apr. 2008. [Online]. Available:http://dx.doi.org/10.1016/j.jvlc.2006.09.002

[4] J. Zhang, Visualization for Information Retrieval, ser. The InformationRetrieval Series. Springer-Verlag Berlin Heidelberg, 2008. [Online].Available: http://books.google.com.co/books?id=x5i-tK8j0GoC

[5] I. Jolliffe, Principal Component Analysis, ser. Springer Series inStatistics. Springer, 2002. [Online]. Available: http://books.google.com.co/books?id=_olByCrhjwIC

[6] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global GeometricFramework for Nonlinear Dimensionality Reduction,” Science, vol.290, no. 5500, pp. 2319–2323, Dec. 2000. [Online]. Available:http://dx.doi.org/10.1126/science.290.5500.2319

[7] J. Camargo and F. Gonzalez, “Multimodal image collection summariza-tion using non-negative matrix factorization,” in Computing Congress(CCC), 2011 6th Colombian, 2011, pp. 1–6.

[8] S. Lohmann, P. Heim, L. Tetzlaff, T. Ertl, and J. Ziegler,“Exploring relationships between annotated images with the chaingraphvisualization,” in Proceedings of the 4th International Conference onSemantic and Digital Media Technologies: Semantic Multimedia, ser.SAMT ’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 16–27.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-10543-2_4

[9] J. N. Warfield, “Crossing theory and hierarchy mapping,” Systems, Manand Cybernetics, IEEE Transactions on, vol. 7, no. 7, pp. 505–523,1977.

[10] M.-J. Carpano, “Automatic display of hierarchized graphs for computer-aided decision analysis,” Systems, Man and Cybernetics, IEEE Transac-tions on, vol. 10, no. 11, pp. 705–715, 1980.

[11] K. Sugiyama, S. Tagawa, and M. Toda, “Methods for visual understand-ing of hierarchical system structures,” Systems, Man and Cybernetics,IEEE Transactions on, vol. 11, no. 2, pp. 109–125, 1981.

[12] E. Koutsofios and S. C. North, “Drawing graphs with dot,” 1993.[13] T. Kamada and S. Kawai, “An algorithm for drawing general undirected

graphs,” Inf. Process. Lett., vol. 31, no. 1, pp. 7–15, Apr. 1989.[Online]. Available: http://dx.doi.org/10.1016/0020-0190(89)90102-6

[14] S. C. North, “Drawing graphs with neato,” NEATO User Manual, April,vol. 26, p. 2004.

[15] T. M. J. Fruchterman and E. M. Reingold, “Graph drawing byforce-directed placement,” Softw. Pract. Exper., vol. 21, no. 11, pp.1129–1164, Nov. 1991. [Online]. Available: http://dx.doi.org/10.1002/spe.4380211102

[16] J. Ellson, E. R. Gansner, E. Koutsofios, S. C. North, and G. Woodhull,“Graphviz and dynagraph U static and dynamic graph drawing tools,” inGRAPH DRAWING SOFTWARE. Springer-Verlag, 2003, pp. 127–148.

Page 5: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

5

Figure 4. Graph-based visualization using the DOT General View (k=4)

Figure 5. Visualization using the graph-based NEATO algorithm (k=5)

Page 6: [IEEE 2013 XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA) - Bogotá, Colombia (2013.09.11-2013.09.13)] Symposium of Signals, Images and Artificial Vision

6

Figure 6. Visualization using the SFDP graph-based algorithm (k=5)