mediaeval 2017 retrieving diverse social images task (overview)
TRANSCRIPT
Retrieving Diverse Social Images Task
- task overview -
2017
University Politehnica
of Bucharest
Maia Zaharieva (TUW, Austria)
Bogdan Ionescu (UPB, Romania)
Alexandru Lucian Gînscǎ (CEA LIST, France)
Rodrygo L.T. Santos (UFMG, Brazil)
Henning Müller (HES-SO in Sierre, Switzerland)
Bogdan Boteanu (UPB, Romania)
September 13-15, Dublin, Irelandce
Universidade Federal de
Minas Gerais, Brazil
The Retrieving Diverse Social Images Task
Dataset and Evaluation
Participants
Results
Discussion and Perspectives
2
Outline
3
Diversity Task: Objective & Motivation
Objective: image search result diversification in the context of
social photo retrieval.
Why diversifying search results?
- to respond to the needs of different users;
- as a method of tackling queries with unclear information needs;
- to widen the pool of possible results (increase performance);
- to reduce the number/redundancy of the returned items;
…
3
Diversity Task: Objective & Motivation #2
4
Diversity Task: Objective & Motivation #2
5
Diversity Task: Objective & Motivation #3
7
Diversity Task: Definition
For each query, participants receive a ranked list of photos retrieved
from Flickr using its default “relevance” algorithm.
Query = general-purpose, multi-topic term
e.g.: autumn colors, bee on a flower, home office, snow in
the city, holding hands, ...
Goal of the task: refine the results by providing a ranked list of up
to 50 photos (summary) that are considered to be both relevant and
diverse representations of the query.
relevant: a common photo representation of the query topics (all at once);
bad quality photos (e.g., severely blurred, out of focus) are not considered
relevant in this scenario
diverse: depicting different visual characteristics of the query topics and
subtopics with a certain degree of complementarity, i.e., most of the
perceived visual information is different from one photo to another.
8
Dataset: General Information & Resources
Provided information:
query text formulation;
ranked list of Creative Commons photos from Flickr*
(up to 300 photos per query);
metadata from Flickr (e.g., tags, description, views,
comments, date-time photo was taken, username, userid, etc);
visual, text & user annotation credibility descriptors;
semantic vectors for general English terms computed on top of
the English Wikipedia (wikiset);
relevance and diversity ground truth.
Photos:
Development: 110 queries 32,340 photos
Test: 84 queries 24,986 photos
9
Dataset: Provided Descriptors
General purpose visual descriptors:
e.g., Auto Color Correlogram, Color and Edge Directivity
Descriptor, Pyramid of Histograms of Orientation Gradients, etc;
Convolutional Neural Network based descriptors:
Caffe framework based;
General purpose text descriptors:
e.g., term frequency information, document frequency
information and their ratio, i.e., TF-IDF;
User annotation credibility descriptors (give an automatic
estimation of the quality of users' tag-image content relationships):
e.g., measure of user image relevance, total number of images a
user shared, the percentage of images with faces.
10
Dataset: Basic Statistics
devset
(design the methods)
testset
(final benchmarking)
#queries 110 84
#images 32,340 24,986
#img. per query
(min-average-max ) 141 - 295 - 300 299 - 300 - 300
% relevant img. 53 57.4
avg. #clusters per query 17 14
avg. #img. per cluster 9 14
11
Dataset: Ground Truth - annotations
Relevance and diversity annotations were carried out by
expert annotators:
devset: relevance: 8 annotators + 1 master (3 annotations/query)
diversity: 1 annotation/query
testset: relevance: 8 annotators + 1 master (3 annotations/query)
diversity: 12 annotators (3 annotations/query)
Lenient majority voting for relevance
12
Evaluation: Run Specification
Participants are required to submit up to 5 runs:
required runs:
run 1: automated using visual information only;
run 2: automated using textual information only;
run 3: automated using textual-visual fused without other
resources than provided by the organizers;
general runs:
run 4: everything allowed, e.g. human-based or hybrid human-
machine approaches, including using data from external
sources, (e.g., Internet) or pre-trained models obtained from
external datasets related to this task;
run 5: everything allowed.
13
Evaluation: Official Metrics
Cluster Recall* @ X = Nc/N (CR@X) where X is the cutoff point, N is the total number of clusters for the
current query (from ground truth, N<=25) and Nc is the number of
different clusters represented in the X ranked images;
*cluster recall is computed only for the relevant images.
Precision @ X = R/X (P@X)
where R is the number of relevant images;
F1-measure @ X = harmonic mean of CR and P (F1@X)
Metrics are reported for different values of X (5, 10, 20, 30, 40 & 50)
on per topic as well as overall (average).
official ranking F1@20
14
Participants: Basic Statistics
Survey:
- 22 respondents were interested in the task;
Registration:
- 14 teams registered (1 team is organizer related);
Run submission:
- 6 teams finished the task, including 1 organizer related team;
- 29 runs were submitted;
Workshop participation:
- 5 teams are represented at the workshop.
15
Participants: Submitted Runs (29)
*organizer related team
Team Country
Required Runs General Runs Results (best)
1 (visual) 2 (text) 3 (vis-text) 4 5 P@20 CR@20 F1@20
NLE France ✓ ✓ ✓ visual-text visual-text 0.793 0.679 0.705
MultiBrazil Brazil ✓ ✓ ✓ visual-text-cred. visual-text-cred. 0.7208 0.6524 0.6634
UMONS Belgium ✓ ✓ ✓ visual-text-cred. visual-cred. 0.8071 0.5856 0.6554
CFM China ✓ ✓ ✓ text-cred. text-cred. 0.6881 0.6671 0.6533
tud-mmc Netherlands ✓ ✓ ✓ text-intent ✗ 0.7262 0.6142 0.6462
Flickr 0.6595 0.5831 0.5922
LAPI* Romania ✓ ✓ ✓ visual cred. 0.633 0.6045 0.5777
16
Results: P vs. CR @20 (all runs - testset)
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.5 0.55 0.6 0.65 0.7
P@
20
CR@20
Flickr Initial
CFM
LAPI
MultiBrazil
NLE
tud-mmc
UMONS
Flickr
initial
NLE UMONS
17
Results: Best Team Runs (F1 @)
0.3
0.4
0.5
0.6
0.7
0.8
@5 @10 @20 @30 @40 @50
F1@
X
Flickr Initial
CFM_run5_text_cred.txt
LAPI_HC_PSRF_Run5.txt
run3VisualTextual_MultiBrasil.txt
NLE_run3_CMRF_MMR.txt
tudmmc_run4_tudmmc_intent.txt
UMONS_run5_visual_user_G.txt
18
Results: Best Team Runs (Cluster Recall @)
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
@5 @10 @20 @30 @40 @50
CR
@X
Flickr Initial
CFM_run5_text_cred.txt
LAPI_HC_PSRF_Run5.txt
run3VisualTextual_MultiBrasil.txt
NLE_run3_CMRF_MMR.txt
tudmmc_run4_tudmmc_intent.txt
UMONS_run5_visual_user_G.txt
Results: Visual Results – Flickr Initial Results
Truck Camper
19
Results: Visual Results – Flickr Initial Results
Truck Camper CR@20=0.35, P@20=0.3, F1@20=0.32
19
Results: Visual Results #2 – Best run (F1@20)
20
Truck Camper
Results: Visual Results #2 – Best run (F1@20)
20
Truck Camper CR@20=0.68, P@20=0.8, F1@20=0.74
Results: Visual Results #3 – Lowest run
21
Truck Camper
Results: Visual Results #3 – Lowest run
21
Truck Camper CR@20=0.5, P@20=0.5, F1@20=0.5
22
Brief Discussion
Methods:
this year mainly classification/clustering (& fusion), re-ranking,
relevance feedback, & neural-network based;
best run F1@20: improving relevancy (text) + neural network-based
clustering; use of visual-text information (team NLE).
Dataset:
getting very complex (read diverse);
still low resources for Creative Commons on Flickr;
descriptors were very well received (employed by all of the
participants as provided).
23
Acknowledgements
Task auxiliaries:
Bogdan Boteanu, UPB, Romania & Mihai Lupu, Vienna University of
Technology, Austria
Task supporters:
Alberto Ueda, Bruno Laporais, Felipe Moraes, Lucas Chaves, Jordan
Silva, Marlon Dias, Rafael Glater
Catalin Mitrea, Mihai Dogariu, Liviu Stefan, Gabriel Petrescu, Alexandru
Toma, Alina Banica, Andreea Roxana, Mihaela Radu, Bogdan Guliman,
Sebastian Moraru
24
Questions & Answers
Thank you!