mediaeval 2017 retrieving diverse social images task: nle@mediaeval’17: combining cross-media...
TRANSCRIPT
© 2017 NAVER LABS. All rights reserved.
Jean-Michel Renders and Gabriela Csurka
NLE@MediaEval’17: Combining Cross-Media
Similarity and Embeddings for Retrieving
Diverse Social Images
© 2017 NAVER LABS. All rights reserved.
Our main motivations in the challenge were:
• Evaluate the cross-media similarity measure we proposed in [1][2],
which has proven to give top-ranked on several ImageCLEF multi-
modal search tasks between 2007-2011.
• Compare this simple approach with more recent image and text
combination strategies, such as joint image and text embedding [3][4].
• Evaluate different methods to promote diversity on the top.
Motivation
[1] S. Clinchant, J.-M. Renders, and G. Csurka, XRCE’s participation to ImageCLEF, . In CLEF WN 2007.
[2] S. Clinchant, J.-M. Renders, and G. Csurka, Trans–Media Pseudo–Relevance Feedback Methods in Multimedia
Retrieval, In Advances in Multilingual and Multimodal Information Retrieval. Vol. LNCS 5152. Springer, 2008.
[3] L. Wang, Y. Li, and S. Lazebnik, Learning Deep, Structure-Preserving Image-Text Embeddings. CVPR 2016
[4] A. Gordo and D. Larlus, Beyond instance-level image retrieval: Leveraging captions to learn a global visual
representation for semantic retrieval. CVPR 2017
© 2017 NAVER LABS. All rights reserved. 3
The main idea is to switch media during the pseudo feedback process:
• use one media type to gather relevant multimedia objects from a repository;
• use the dual type to step further (retrieve, annotate, etc.).
Cross-media Pseudo Relevance Feedback*
Pseudo Feedback:
Top N ranked documents
based on image or textual similarity
Final step
rank, retrieve, compose,
annotate, illustrate, etc.
Aggregate
and switch
media
…
…
or
*S. Clinchant, J.-M. Renders, and G. Csurka, Trans–Media Pseudo–Relevance Feedback Methods in Multimedia
Retrieval, In Advances in Multilingual and Multimodal Information Retrieval. Vol. LNCS 5152. Springer, 2008.
© 2017 NAVER LABS. All rights reserved. 4
• 𝑆𝑉(d,d’) visual similarities
• 𝑆𝑇(d,q) textual similarities
• 𝑁𝑁𝑇𝐾 𝑞 top-K documents most similar
to the q using 𝑆𝑇(d,q) .
Cross-media Pseudo Relevance Feedback (cont)
© 2017 NAVER LABS. All rights reserved. 5
Joint visual and textual embedding*
*L. Wang, Y. Li, and S. Lazebnik, Learning Deep, Structure-Preserving Image-Text Embeddings. CVPR 2016
𝒑𝒊𝑻 𝒑𝒊
𝑽
The aim is to enforce semantically similar documents (in our case relevant to
the query) to have their textual and/or visual embeddings close and far from
textual and visual embeddings of non-relevant documents.
© 2017 NAVER LABS. All rights reserved. 6
The main idea is to reduce redundancy while maintaining relevance when re-
ranking the retrieved set of documents.
Given a set of retrieved documents 𝑑𝑖 given a query 𝑞, the method
incrementally ranks the documents according to the MMR:
Diversity Promoting: Maximal Margin Relevance* (MMR)
*J. Carbonell and J. Goldstein. The use of MMR, diversity based reranking for reordering documents and
producing summaries, SIGIR 1998.
• 𝑅𝑒𝑙 are the relevant document to be considered
• S𝑒𝑙 are the already selected documents
𝑀𝑀𝑅 = arg max𝑑𝑖∈𝑅𝑒𝑙\𝑆𝑒𝑙
𝛽𝑆𝑇,𝑉 𝑑𝑖 , 𝑞 − 1 − 𝛽 max𝑑𝑗∈𝑆𝑒𝑙
𝑆𝑉 𝑑𝑖 , 𝑑𝑗
© 2017 NAVER LABS. All rights reserved.
The above methods heavily rely on the choice of the mono-modal similarity
measures and, consequently, on a good textual/visual representation of the
query and the documents.
Therefore, we use the following SOA representations:
• Text: a mixture of LM-based representation [1] (using Dirichlet
smoothing) and Dual Embedding Space Model for Document Ranking
[2] pre-trained on the Bing query corpus.
• Image: Inception-ResNet V2 [3] and RMAC [4,5] deep models pre-
trained on ImageNet.
Visual and textual representation
[1] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. SIGIR 1998.
[2] E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving Document Ranking with Dual Word Embeddings,
WWW 2016
[3] C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning, arXiv 1602.07261
[4] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of CNN activations, ICML 2016
[5] A. Gordo, J. Almazan, J. Revaud and D. Larlus, End-to-end Learning of Deep Visual Representations for Image
Retrieval. CVPR 2016
© 2017 NAVER LABS. All rights reserved. 8
Text: Dual Embedding Space Model* (DESM)
Embedding designed for IR applications because :
• the IN-OUT dot product captures log probability of co-occurences through:
• word2wec computes WIN and WOUT , but WOUT is discarded.
• linearly combined with traditional LM based retrieval model.
WIN WOUT
*E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving Document Ranking with Dual Word Embeddings,
WWW 2016
© 2017 NAVER LABS. All rights reserved. 9
Image: Inception Resnet-V2* – activation layer
*C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning, arXiv 1602.07261
© 2017 NAVER LABS. All rights reserved. 10
Image: RMAC* learning to rank model
*A. Gordo, J. Almazan, J. Revaud and D. Larlus, End-to-end Learning of Deep Visual Representations for Image
Retrieval. CVPR 2016
© 2017 NAVER LABS. All rights reserved.
Our submitted runs
• Visual only (Run1) and textual only (Run2) runs have similar performances, with
visual one having slightly higher precision and the textual one higher diversity.
• The cross-media similarity allowed us to obtain a much better ranking both in
terms of precision and also diversity (Run3).
• Learning joint visual and textual embedding using the relevance scores
combined with MMR yielded poorer diversity (Run4 and Run5).