mediaeval 2017 retrieving diverse social images task: nle@mediaeval’17: combining cross-media...

© 2017 NAVER LABS. All rights reserved.

Jean-Michel Renders and Gabriela Csurka

NLE@MediaEval’17: Combining Cross-Media

Similarity and Embeddings for Retrieving

Diverse Social Images


Our main motivations in the challenge were:

• Evaluate the cross-media similarity measure we proposed in [1][2],

which has proven to give top-ranked on several ImageCLEF multi-

modal search tasks between 2007-2011.

• Compare this simple approach with more recent image and text

combination strategies, such as joint image and text embedding [3][4].

• Evaluate different methods to promote diversity on the top.

Motivation

[1] S. Clinchant, J.-M. Renders, and G. Csurka, XRCE’s participation to ImageCLEF, . In CLEF WN 2007.

[2] S. Clinchant, J.-M. Renders, and G. Csurka, Trans–Media Pseudo–Relevance Feedback Methods in Multimedia

Retrieval, In Advances in Multilingual and Multimodal Information Retrieval. Vol. LNCS 5152. Springer, 2008.

[3] L. Wang, Y. Li, and S. Lazebnik, Learning Deep, Structure-Preserving Image-Text Embeddings. CVPR 2016

[4] A. Gordo and D. Larlus, Beyond instance-level image retrieval: Leveraging captions to learn a global visual

representation for semantic retrieval. CVPR 2017

© 2017 NAVER LABS. All rights reserved. 3

The main idea is to switch media during the pseudo feedback process:

• use one media type to gather relevant multimedia objects from a repository;

• use the dual type to step further (retrieve, annotate, etc.).

Cross-media Pseudo Relevance Feedback*

Pseudo Feedback:

Top N ranked documents

based on image or textual similarity

Final step

rank, retrieve, compose,

annotate, illustrate, etc.

Aggregate

and switch

media

…

…

or

*S. Clinchant, J.-M. Renders, and G. Csurka, Trans–Media Pseudo–Relevance Feedback Methods in Multimedia

Retrieval, In Advances in Multilingual and Multimodal Information Retrieval. Vol. LNCS 5152. Springer, 2008.


• 𝑆𝑉(d,d’) visual similarities

• 𝑆𝑇(d,q) textual similarities

• 𝑁𝑁𝑇𝐾 𝑞 top-K documents most similar

to the q using 𝑆𝑇(d,q) .

Cross-media Pseudo Relevance Feedback (cont)


Joint visual and textual embedding*

*L. Wang, Y. Li, and S. Lazebnik, Learning Deep, Structure-Preserving Image-Text Embeddings. CVPR 2016

𝒑𝒊𝑻 𝒑𝒊

𝑽

The aim is to enforce semantically similar documents (in our case relevant to

the query) to have their textual and/or visual embeddings close and far from

textual and visual embeddings of non-relevant documents.


The main idea is to reduce redundancy while maintaining relevance when re-

ranking the retrieved set of documents.

Given a set of retrieved documents 𝑑𝑖 given a query 𝑞, the method

incrementally ranks the documents according to the MMR:

Diversity Promoting: Maximal Margin Relevance* (MMR)

*J. Carbonell and J. Goldstein. The use of MMR, diversity based reranking for reordering documents and

producing summaries, SIGIR 1998.

• 𝑅𝑒𝑙 are the relevant document to be considered

• S𝑒𝑙 are the already selected documents

𝑀𝑀𝑅 = arg max𝑑𝑖∈𝑅𝑒𝑙\𝑆𝑒𝑙

𝛽𝑆𝑇,𝑉 𝑑𝑖 , 𝑞 − 1 − 𝛽 max𝑑𝑗∈𝑆𝑒𝑙

𝑆𝑉 𝑑𝑖 , 𝑑𝑗


The above methods heavily rely on the choice of the mono-modal similarity

measures and, consequently, on a good textual/visual representation of the

query and the documents.

Therefore, we use the following SOA representations:

• Text: a mixture of LM-based representation [1] (using Dirichlet

smoothing) and Dual Embedding Space Model for Document Ranking

[2] pre-trained on the Bing query corpus.

• Image: Inception-ResNet V2 [3] and RMAC [4,5] deep models pre-

trained on ImageNet.

Visual and textual representation

[1] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. SIGIR 1998.

[2] E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving Document Ranking with Dual Word Embeddings,

WWW 2016

[3] C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on

Learning, arXiv 1602.07261

[4] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of CNN activations, ICML 2016

[5] A. Gordo, J. Almazan, J. Revaud and D. Larlus, End-to-end Learning of Deep Visual Representations for Image

Retrieval. CVPR 2016


Text: Dual Embedding Space Model* (DESM)

Embedding designed for IR applications because :

• the IN-OUT dot product captures log probability of co-occurences through:

• word2wec computes WIN and WOUT , but WOUT is discarded.

• linearly combined with traditional LM based retrieval model.

WIN WOUT

*E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving Document Ranking with Dual Word Embeddings,

WWW 2016


Image: Inception Resnet-V2* – activation layer

*C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on

Learning, arXiv 1602.07261


Image: RMAC* learning to rank model

*A. Gordo, J. Almazan, J. Revaud and D. Larlus, End-to-end Learning of Deep Visual Representations for Image

Retrieval. CVPR 2016


Our submitted runs

• Visual only (Run1) and textual only (Run2) runs have similar performances, with

visual one having slightly higher precision and the textual one higher diversity.

• The cross-media similarity allowed us to obtain a much better ranking both in

terms of precision and also diversity (Run3).

• Learning joint visual and textual embedding using the relevance scores

combined with MMR yielded poorer diversity (Run4 and Run5).

Thank you


mediaeval 2017 retrieving diverse social images task: nle@mediaeval’17: combining cross-media...

Science