metric inverted - an efficient inverted indexing method for metric spaces benjamin sznajder jonathan...
Post on 18-Dec-2015
231 views
TRANSCRIPT
Metric Inverted -An efficient inverted indexing
method for metric spaces
Benjamin SznajderJonathan MamouYosi MassMichal Shmueli-Scheuer
IBM Research - Haifa
Presented by: Shai Erera
Motivation
Web 2.0 enables mass multimedia productions
Still, search is limited to manually added metadata State of the art solutions for CBIR (Content Based
Image Retrieval) do not scale– Reveal linear scalability in the collection size due to large
number of distance computations Can we use textIR methods to scale up CBIR?
Problem definition
Low level image features can be generalized to Metric Spaces
Metric Space: An ordered pair (S,d) , where S is a domain and d a distance function d: S x S R such that
– d satisfies non-negativity, reflexibility, symmetry and triangle inequality
The best-k results for a query in a metric space are the k objects with the smallest distance to the query
– Convert distances to scores (small distance – high score) between [0,1]
Problem definition
Top-K Problem:– Assume m metric spaces, a Query Q, an
aggregate function f and a score function sd():– Retrieve the best k objects D with highest
f(sd1(Q,D), sd2(Q,D)…sdm(Q,D))
q
k=5
Metric Inverted Index
Assume a collection of objects each having m features – Object D = {F1:v1, F2:v2,…, Fm:vm}
– m metric spaces
Indexing steps– Lexicon creation (select candidates)– Invert objects (canonization to lexicon terms)
Metric inverted indexing – Lexicon creation
Number of different features too large Need to select candidates
– Naïve solution: Lexicon of fixed size l Select randomly l/m documents and extract their features These l features form our lexicon
– Improvement Replace the random choice by clustering (K-Means etc.)
Keep the lexicon in an M-Tree structure
Metric inverted indexing – invert objects
Given object D = {F1:v1, F2:v2,…, Fm:vm}
Canonization – map features (Fi:vi) to lexicon entries– For each feature select the n nearest lexicon terms – D’ = {F1:v11, F1:v12, …F1:v1n,
F2:v21, F2:v22, …F2:v2n, …
Fm:vm1, Fm:vm2, …Fm:vmn}
Index D’ in the relevant posting-lists
Retrieval stage – term selection
Given Q = {F1:qv1, F2:qv2,…, Fm:qvm} Canonization
– For each feature select the n nearest lexicon terms
– Q’ = {F1:qv11, F1:qv12, …F1:qv1n, F2:qv21, F2:qv22, …F2:qv2n, … Fm:qvm1, Fm:qvm2, …Fm:qvmn}
Retrieval stage – Boolean Filtering
These m*n posting-lists will be queried via a Boolean Query
Two possible modes:– Strict-query-mode:
– Fuzzy-query-mode:
Retrieval stage – Scoring
Documents retrieved by the Boolean Query are fully scored
Return the best k objects with the highest aggregate score f(sd_1(Q,D),sd_2(Q,D),… ,sd_m(Q,D))
Experiments
Focus on:– Efficiency– Effectiveness
Collection of 160,000 images from Flickr 3 features are extracted from each image
– EdgeHistogram, ScalableColor and ColorLayout
180 queries – Fuzzy-Query-Mode– Sampled from the collection of images
Compared to M-tree data-structure
Experiments – Measures Used
Effectiveness: MAP is a natural candidate for measuring– Problem: In Image Retrieval, no document is irrelevant
– Solution: we defined as relevant the k highest scored documents in the collection (according to the M-Tree computation)
– MAP@K: MAP computed on relevant and retrieved lists of size k
Experiments – Measures Used contd.
Efficiency: we compute the number of computations per query– A computation unit (cu) is a distance computation
call between two feature values
Effectiveness
MAP vs. number of Nearest Terms size of the lexicon = 12000
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30
# nearest terms
MA
P
K=10
K=20
K=30
Effectiveness
MAP vs. lexicon size Number Nearest Terms =30
0.80.820.840.860.880.9
0.920.940.960.98
1
3000 12000 24000 48000
lexicon's size
MA
P
K=10
K=20
K=30
Effectiveness vs. Efficiency
MAP vs. number of comparisons Number Nearest Terms =30
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 50000 100000 150000 200000
#comparisons
MA
P
lexicon=3000
lexicon=12000
lexicon=24000
lexicon=48000
M-Tree vs. Metric Inverted
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
5 10 20 30 40
top-k
# co
mp
aris
on
s
M-Tree
MII - lexicon=3000
MII - lexicon=12000
MII - lexicon=24000
MII - lexicon=48000
Number of comparisons vs. top-k Number Nearest Terms =30