: finding diverse images at mediaeval...

1
: Finding Diverse Images at MediaEval 2014 Eleftherios Spyromitros-Xioufis 12 , Symeon Papadopoulos 1 , Yiannis Kompatsiaris 1 , and Ioannis Vlahavas 2 1 Information Technologies Institute, CERTH, Thessaloniki, Greece 2 Department of Informatics, Aristotle University of Thessaloniki, Greece Summary All runs produced by a different instantiation (features, parameter configuration) of the ReDiv method The runs (fully automated, no external data): 1 visual-only using VLAD+CSURF[1] features for relevance and diversity 1 text-only using BoW features for relevance and diversity 2 visual+textual variations using early fusion of VLAD+CSURF and BoW features for relevance and VLAD+CSURF featuers for diversity A common criterion for model selection: best F1@20 calculated using leave-one(- location)-out cross-validation on the devset locations The ReDiv Method ReDiv casts the dual relevance and diversity goal of a diversification algorithm into the following optimization problem a arg max S⊂I|S|=K U (S|)= R (S|) + (1 - )D (S ) (1) where I is the initial set of images and S is a -sized subset of I that has maximum utility U (S|), defined as a weighted combination of the relevance and the diversity of S . a A similar formulation of the problem was used in [2]. In ReDiv, however, we use different definitions for R (S|) and D (S ) that we found more suitable for this task. Relevance Relevance in [2]: R (S|)= ∈S R ( |)= ∈S (1 -( )). This definition can be problematic, especially when one relies only on visual information. Dissimilar images can be relevant (e.g. inside views) and vice versa (e.g. people in focus). Wikipedia image of Angkor Wat (left), a relevant inside view (center) and an irrelevant image with a person in front of the monument (right). Solution: Learn what is relevant from the ground truth! A distinct model is built for each location using relevant/irrelevant images of other loca- tions as positive/negative examples. The Wikipedia images of each location as also added in the training set and used as positive examples with a large weight. Thus in our case R ( |) is the output of a probabilistic classification model. Diversity Assuming a ranking 1 K of the images in S , [2] defines diversity as: D (S )= K =1 1 =1 ( ) (high average dissimilarity high diversity) Problem: Image sets that contain highly similar image pairs (probably belonging to the same cluster) can receive high diversity scores. negatively impacts Cluster Recall! Solution: A more strict definition of diversity: D (S )= min ∈S= ( ) The diversity of a set S is defined as the dissimilarity between the most similar pair of images in S . Optimization The exact optimization of Equation 1 is infeasible! We perform a greedy, approximate optimization: Start with an empty set S and sequentially expand it by adding at each step J the image * that scores highest (among the unselected images), to the criterion: U ( * )= R ( * )+(1 - ) min ∈S J-1 ( * ), where S J-1 represents S at step J- 1. A less greedy version keeps M> 1 highest scoring image subsets in each step. References [1]E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas, “A compre- hensive study over vlad and product quantization in large-scale image retrieval,” IEEE Transactions on Multimedia, 2014. [2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney, “Jointly optimising relevance and diversity in image retrieval,” in ACM CIVR ’09, (New York, USA), 2009. Experimental Protocol ReDiv allows using different representations for relevance and diversity. To reduce complexity of experiments, representations were first evaluated in terms of relevance detection and only the top performing were used for diversity. AUC (ability to rank relevant images higher than irrelevant) was used to measure relevance detection performance. L2-regularized logistic regression was used as classification algorithm. For each combination of relevance detection model and diversity representation, ReDiv was applied with different values for , (the number of most relevant images to consider) and M . The best performing setup for each type of features (visual,textual,visual+textual) was used to produce the final runs! Runs We tested all precomputed visual features made available by the task organizers as well as our own features. Best results were obtained using VLAD+CSURF vectors a [1] ( = 128, = 128 for both relevance and diversity. Cosine distance was used as dissimilarity measure. The parameters used to produce the 1st run are: =04, = 75, M = 3. a Implementation is publicly available here: Visual (Run 1) A parsed version of the Wikipedia page in place of the Wikipedia images. Flickr images are substituted by a concatenation of the the words in their titles (×3), description (×2) and tags (×1). A bag-of-words representation with the 20K/7.5K most frequent words was used for the relevance/diversity component. Again, cosine distance was used as dissimilarity measure. The parameters used to produce the 2nd run are: =095, = 110, M = 1. Textual (Run 2) An early fusion (concatenation) of the visual and textual features used in Runs 1 and 2 was used for relevance. The visual features used in Runs were used for diversity. The parameters used to produce the 3rd run are: =075, = 90, M = 5. The 5th run differs from the 3rd run only in the value used for (= 95). Visual+Textual (Runs 3 and 5) Results 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 40 50 60 70 80 90 100 110 120 F1@20 of the 3 different instantiations of the ReDiv method as a function of w and n Run 1 Run 2 Runs 3 & 5 Development Set Test Set (official) Run AUC a P@20 CR@20 F1@20 P@20 CR@20 F1@20 1 0.719 0.815 0.497 0.609 0.775 0.460 0.569 2 0.672 0.863 0.468 0.599 0.832 0.407 0.538 3 0.725 0.855 0.521 0.642 0.817 0.473 0.593 5 0.725 0.857 0.527 0.647 0.815 0.475 0.594 Flickr 0.636 0.833 0.346 0.477 pending pending pending Best performance obtained by visual+textual runs, followed by visual-only run. Textual features alone are less helpful but powerful in combination. a AUC is calculated on the ranking produced by the relevance detection model! Acknowledgements This work is supported by the SocialSensor FP7 project (http://www.socialsensor.eu), par- tially funded by the EC under contract number 287975.

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: : Finding Diverse Images at MediaEval 2014users.auth.gr/espyromi/publications/posters/spyromitrosMediaEval2014diverse.pdf: Finding Diverse Images at MediaEval 2014 Eleftherios Spyromitros-Xioufis1;2,

: Finding Diverse Images at MediaEval 2014Eleftherios Spyromitros-Xioufis1,2, Symeon Papadopoulos1, Yiannis Kompatsiaris1, and Ioannis Vlahavas21Information Technologies Institute, CERTH, Thessaloniki, Greece2Department of Informatics, Aristotle University of Thessaloniki, Greece

[email protected], [email protected], [email protected], [email protected]

Summary• All runs produced by a different instantiation (features, parameter configuration) of theReDiv method• The runs (fully automated, no external data):1 visual-only using VLAD+CSURF[1] features for relevance and diversity1 text-only using BoW features for relevance and diversity2 visual+textual variations using early fusion of VLAD+CSURF and BoW featuresfor relevance and VLAD+CSURF featuers for diversity• A common criterion for model selection: best F1@20 calculated using leave-one(-location)-out cross-validation on the devset locations

The ReDiv MethodReDiv casts the dual relevance and diversity goal of a diversification algorithm into thefollowing optimization problema

arg maxS⊂I,|S|=K U(S|q) = wR(S|q) + (1−w)D(S), (1)where I is the initial set of images and S is a k-sized subset of I that has maximum utilityU(S|q), defined as a weighted combination of the relevance and the diversity of S.

aA similar formulation of the problem was used in [2]. In ReDiv, however, we use different definitions for R(S|q) and D(S) thatwe found more suitable for this task.

RelevanceRelevance in [2]: R(S|q) = ∑

imi∈S R(imi|q) = ∑imi∈S(1− d(imi, imq)).This definition can be problematic, especially when one relies only on visual information.Dissimilar images can be relevant (e.g. inside views) and vice versa (e.g. people in focus).

Wikipedia image of Angkor Wat (left), a relevant inside view (center)and an irrelevant image with a person in front of the monument (right).Solution: Learn what is relevant from the ground truth!A distinct model is built for each location using relevant/irrelevant images of other loca-tions as positive/negative examples. The Wikipedia images of each location as also addedin the training set and used as positive examples with a large weight.Thus in our case R(imi|q) is the output of a probabilistic classification model.

DiversityAssuming a ranking imr1, . . . , imrK of the images in S, [2] defines diversity as:D(S) = ∑K

i=1 1i

∑ij=1 d(imri, imrj) (high average dissimilarity → high diversity)

Problem: Image sets that contain highly similar image pairs (probably belonging to thesame cluster) can receive high diversity scores. → negatively impacts Cluster Recall!Solution: A more strict definition of diversity: D(S) = min

imi,imj∈S,i6=jd(imi, imj)The diversity of a set S is defined as the dissimilarity between the most similar pair ofimages in S.

OptimizationThe exact optimization of Equation 1 is infeasible!We perform a greedy, approximate optimization:Start with an empty set S and sequentially expand it by adding at each step J theimage im∗ that scores highest (among the unselected images), to the criterion:

U(im∗) = wR(im∗)+(1−w) minimj∈SJ−1 d(im∗, imj), where SJ−1 represents S at step J−1.A less greedy version keeps M > 1 highest scoring image subsets in each step.

References[1] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas, “A compre-hensive study over vlad and product quantization in large-scale image retrieval,” IEEE Transactions on

Multimedia, 2014.[2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney, “Jointly optimising relevance and diversity in image retrieval,”in ACM CIVR ’09, (New York, USA), 2009.

Experimental ProtocolReDiv allows using different representations for relevance and diversity.• To reduce complexity of experiments, representations were first evaluated in terms ofrelevance detection and only the top performing were used for diversity.• AUC (→ ability to rank relevant images higher than irrelevant) was used to measurerelevance detection performance.• L2-regularized logistic regression was used as classification algorithm.• For each combination of relevance detection model and diversity representation, ReDivwas applied with different values for w , n (the number of most relevant images toconsider) and M .• The best performing setup for each type of features (visual,textual,visual+textual) wasused to produce the final runs!

Runs

•We tested all precomputed visual features made available by the task organizers aswell as our own features.•Best results were obtained using VLAD+CSURF vectorsa [1] (k = 128, d′ = 128 forboth relevance and diversity.• Cosine distance was used as dissimilarity measure.• The parameters used to produce the 1st run are: w = 0.4, n = 75, M = 3.

aImplementation is publicly available here: https://github.com/socialsensor/multimedia-indexing

Visual (Run 1)

• A parsed version of the Wikipedia page in place of the Wikipedia images.• Flickr images are substituted by a concatenation of the the words in their titles (×3),description (×2) and tags (×1).• A bag-of-words representation with the 20K/7.5K most frequent words was used forthe relevance/diversity component.• Again, cosine distance was used as dissimilarity measure.• The parameters used to produce the 2nd run are: w = 0.95, n = 110, M = 1.

Textual (Run 2)

• An early fusion (concatenation) of the visual and textual features used in Runs 1 and2 was used for relevance.• The visual features used in Runs were used for diversity.• The parameters used to produce the 3rd run are: w = 0.75, n = 90, M = 5.• The 5th run differs from the 3rd run only in the value used for n (= 95).

Visual+Textual (Runs 3 and 5)

Results

0,52

0,54

0,56

0,58

0,6

0,62

0,64

0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8 0,3 0,5 0,7 0,2 0,4 0,6 0,8

40 50 60 70 80 90 100 110 120

F1@20 of the 3 different instantiations of the ReDiv method as a function of w and n

Run 1 Run 2 Runs 3 & 5

Development Set Test Set (official)Run AUCa P@20 CR@20 F1@20 P@20 CR@20 F1@20

1 0.719 0.815 0.497 0.609 0.775 0.460 0.5692 0.672 0.863 0.468 0.599 0.832 0.407 0.5383 0.725 0.855 0.521 0.642 0.817 0.473 0.5935 0.725 0.857 0.527 0.647 0.815 0.475 0.594

Flickr 0.636 0.833 0.346 0.477 pending pending pending•Best performance obtained by visual+textual runs, followed by visual-only run.• Textual features alone are less helpful but powerful in combination.

aAUC is calculated on the ranking produced by the relevance detection model!

AcknowledgementsThis work is supported by the SocialSensor FP7 project (http://www.socialsensor.eu), par-tially funded by the EC under contract number 287975.