mediaeval 2017 retrieving diverse social images task (overview)

Retrieving Diverse Social Images Task

- task overview -

2017

University Politehnica

of Bucharest

Maia Zaharieva (TUW, Austria)

Bogdan Ionescu (UPB, Romania)

Alexandru Lucian Gînscǎ (CEA LIST, France)

Rodrygo L.T. Santos (UFMG, Brazil)

Henning Müller (HES-SO in Sierre, Switzerland)

Bogdan Boteanu (UPB, Romania)

September 13-15, Dublin, Irelandce

Universidade Federal de

Minas Gerais, Brazil

The Retrieving Diverse Social Images Task

Dataset and Evaluation

Participants

Results

Discussion and Perspectives

2

Outline

3

Diversity Task: Objective & Motivation

Objective: image search result diversification in the context of

social photo retrieval.

Why diversifying search results?

- to respond to the needs of different users;

- as a method of tackling queries with unclear information needs;

- to widen the pool of possible results (increase performance);

- to reduce the number/redundancy of the returned items;

…

3

Diversity Task: Objective & Motivation #2

4


5


7

Diversity Task: Definition

For each query, participants receive a ranked list of photos retrieved

from Flickr using its default “relevance” algorithm.

Query = general-purpose, multi-topic term

e.g.: autumn colors, bee on a flower, home office, snow in

the city, holding hands, ...

Goal of the task: refine the results by providing a ranked list of up

to 50 photos (summary) that are considered to be both relevant and

diverse representations of the query.

relevant: a common photo representation of the query topics (all at once);

bad quality photos (e.g., severely blurred, out of focus) are not considered

relevant in this scenario

diverse: depicting different visual characteristics of the query topics and

subtopics with a certain degree of complementarity, i.e., most of the

perceived visual information is different from one photo to another.

8

Dataset: General Information & Resources

Provided information:

query text formulation;

ranked list of Creative Commons photos from Flickr*

(up to 300 photos per query);

metadata from Flickr (e.g., tags, description, views,

comments, date-time photo was taken, username, userid, etc);

visual, text & user annotation credibility descriptors;

semantic vectors for general English terms computed on top of

the English Wikipedia (wikiset);

relevance and diversity ground truth.

Photos:

Development: 110 queries 32,340 photos

Test: 84 queries 24,986 photos

9

Dataset: Provided Descriptors

General purpose visual descriptors:

e.g., Auto Color Correlogram, Color and Edge Directivity

Descriptor, Pyramid of Histograms of Orientation Gradients, etc;

Convolutional Neural Network based descriptors:

Caffe framework based;

General purpose text descriptors:

e.g., term frequency information, document frequency

information and their ratio, i.e., TF-IDF;

User annotation credibility descriptors (give an automatic

estimation of the quality of users' tag-image content relationships):

e.g., measure of user image relevance, total number of images a

user shared, the percentage of images with faces.

10

Dataset: Basic Statistics

devset

(design the methods)

testset

(final benchmarking)

#queries 110 84

#images 32,340 24,986

#img. per query

(min-average-max ) 141 - 295 - 300 299 - 300 - 300

% relevant img. 53 57.4

avg. #clusters per query 17 14

avg. #img. per cluster 9 14

11

Dataset: Ground Truth - annotations

Relevance and diversity annotations were carried out by

expert annotators:

devset: relevance: 8 annotators + 1 master (3 annotations/query)

diversity: 1 annotation/query

testset: relevance: 8 annotators + 1 master (3 annotations/query)

diversity: 12 annotators (3 annotations/query)

Lenient majority voting for relevance

12

Evaluation: Run Specification

Participants are required to submit up to 5 runs:

required runs:

run 1: automated using visual information only;

run 2: automated using textual information only;

run 3: automated using textual-visual fused without other

resources than provided by the organizers;

general runs:

run 4: everything allowed, e.g. human-based or hybrid human-

machine approaches, including using data from external

sources, (e.g., Internet) or pre-trained models obtained from

external datasets related to this task;

run 5: everything allowed.

13

Evaluation: Official Metrics

Cluster Recall* @ X = Nc/N (CR@X) where X is the cutoff point, N is the total number of clusters for the

current query (from ground truth, N<=25) and Nc is the number of

different clusters represented in the X ranked images;

*cluster recall is computed only for the relevant images.

Precision @ X = R/X (P@X)

where R is the number of relevant images;

F1-measure @ X = harmonic mean of CR and P (F1@X)

Metrics are reported for different values of X (5, 10, 20, 30, 40 & 50)

on per topic as well as overall (average).

official ranking F1@20

14

Participants: Basic Statistics

Survey:

- 22 respondents were interested in the task;

Registration:

- 14 teams registered (1 team is organizer related);

Run submission:

- 6 teams finished the task, including 1 organizer related team;

- 29 runs were submitted;

Workshop participation:

- 5 teams are represented at the workshop.

15

Participants: Submitted Runs (29)

*organizer related team

Team Country

Required Runs General Runs Results (best)

1 (visual) 2 (text) 3 (vis-text) 4 5 P@20 CR@20 F1@20

NLE France ✓ ✓ ✓ visual-text visual-text 0.793 0.679 0.705

MultiBrazil Brazil ✓ ✓ ✓ visual-text-cred. visual-text-cred. 0.7208 0.6524 0.6634

UMONS Belgium ✓ ✓ ✓ visual-text-cred. visual-cred. 0.8071 0.5856 0.6554

CFM China ✓ ✓ ✓ text-cred. text-cred. 0.6881 0.6671 0.6533

tud-mmc Netherlands ✓ ✓ ✓ text-intent ✗ 0.7262 0.6142 0.6462

Flickr 0.6595 0.5831 0.5922

LAPI* Romania ✓ ✓ ✓ visual cred. 0.633 0.6045 0.5777

16

Results: P vs. CR @20 (all runs - testset)

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.5 0.55 0.6 0.65 0.7

P@

20

CR@20

Flickr Initial

CFM

LAPI

MultiBrazil

NLE

tud-mmc

UMONS

Flickr

initial

NLE UMONS

17

Results: Best Team Runs (F1 @)

0.3

0.4

0.5

0.6

0.7

0.8

@5 @10 @20 @30 @40 @50

F1@

X

Flickr Initial

CFM_run5_text_cred.txt

LAPI_HC_PSRF_Run5.txt

run3VisualTextual_MultiBrasil.txt

NLE_run3_CMRF_MMR.txt

tudmmc_run4_tudmmc_intent.txt

UMONS_run5_visual_user_G.txt

18

Results: Best Team Runs (Cluster Recall @)

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

@5 @10 @20 @30 @40 @50

CR

@X

Flickr Initial

CFM_run5_text_cred.txt

LAPI_HC_PSRF_Run5.txt

run3VisualTextual_MultiBrasil.txt

NLE_run3_CMRF_MMR.txt

tudmmc_run4_tudmmc_intent.txt

UMONS_run5_visual_user_G.txt

Results: Visual Results – Flickr Initial Results

Truck Camper

19

Results: Visual Results – Flickr Initial Results

Truck Camper CR@20=0.35, P@20=0.3, F1@20=0.32

19

Results: Visual Results #2 – Best run (F1@20)

20

Truck Camper

Results: Visual Results #2 – Best run (F1@20)

20


Results: Visual Results #3 – Lowest run

21

Truck Camper

Results: Visual Results #3 – Lowest run

21


22

Brief Discussion

Methods:

this year mainly classification/clustering (& fusion), re-ranking,

relevance feedback, & neural-network based;

best run F1@20: improving relevancy (text) + neural network-based

clustering; use of visual-text information (team NLE).

Dataset:

getting very complex (read diverse);

still low resources for Creative Commons on Flickr;

descriptors were very well received (employed by all of the

participants as provided).

23

Acknowledgements

Task auxiliaries:

Bogdan Boteanu, UPB, Romania & Mihai Lupu, Vienna University of

Technology, Austria

Task supporters:

Alberto Ueda, Bruno Laporais, Felipe Moraes, Lucas Chaves, Jordan

Silva, Marlon Dias, Rafael Glater

Catalin Mitrea, Mihai Dogariu, Liviu Stefan, Gabriel Petrescu, Alexandru

Toma, Alina Banica, Andreea Roxana, Mihaela Radu, Bogdan Guliman,

Sebastian Moraru

24

Questions & Answers

Thank you!

mediaeval 2017 retrieving diverse social images task (overview)

Science