multimedia information retrieval in networked digital ... · multimedia information retrieval in...
TRANSCRIPT
Multimedia Information Retrieval in Networked Digital Libraries
Norbert FuhrUniversity of Duisburg-Essen
Germany
2
MIND Project• Task: Development of methods for accessing
large numbers of multimedia digital libraries through a single system
• Funded by the EC under FP5 (2/01-7/03)• Participants:
– Carnegie-Mellon University– University of Duisburg-Essen– University of Florence– University of Sheffield– University of Strathclyde
3
MIND Architecture
Fact
Text
Image
Text
Image
Fact
TextProxy1 Proxy2 Proxy3
Wrapper1 Wrapper2 Wrapper3
Dispatcher
Speech
Fact
Text Proxy4
Wrapper4
Data fuser
User Interface
Data fuser
User Interface
DL1 DL2 DL4DL3
4
Query Processing
Query Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)
Resource Selection Determine relevant libraries (more cost effective than querying all libraries)
Database Query Run Query each selected library (task of “wrappers”)
Document Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)
Data Fusion Fuse library results into one single ranked list or several clusters
5
Retrieval in FederatedDigital Libraries
Tasks:• Extract DL metadata („resource
description“)• Select relevant libraries („resource
selection“)• Communicate with libraries („wrapper“)• Combine results („data fusion“)
6
Resource Description
7
Resource Description: Query-based Sampling
• For non-cooperative DLs• Generate sequence of queries• Collect answer documents (~ 300)• Generate resource description from
sampled documents
8
Resource descriptionfor images• Feature vectors are clustered at different
levels of granularity• For each cluster we consider:
– Cluster centre ci;– Cluster radius ri – Cluster population Pi
• The smaller the cluster radii, the finer the approximation of feature vector density.
9
Demo: Resource descriptions for images Resource descriptions
• Java application demonstrating resource description process for images:• Selection of resource to be
described• Acquisition of individual
document descriptors• Processing of document
descriptors and extraction of resource descriptors at several granularity levels
10
Resource selection
11
Resource selection Resource selection
• Querying all accessible DLs is too expensive– Thus route query only to „best“ libraries
• Two competing approaches:– Resource ranking:
• Compute similarity of DL to query (heuristically)• Select fixed number of top-ranked DLs
– Decision-theoretic framework:• Assign costs to retrieval (money, time, retrieval quality)• Compute selection which minimises costs
– #selected DLs not fixed• Described in this talk
12
Multimedia retrievalmodel Resource selection
• Query conditions c (attribute, predicate, comparison value), weight Pr(q c)– E.g. term, year number, colour histogram
• Probabilistic indexing weights Pr(c d)– E.g. BM25 for text, histogram similarity for images
• Linear retrieval function:
• Mapping onto Pr(rel|q,d):– Linear/logistic function
Pr( )Pr()Pr() ∑∈
←⋅←=←qc
iii
dccqdqindexing weightcondition weight
13
Probability of Relevance Resource selection
• Probability of relevance computed based on probability of implication (score)
– Linear function with constant Pr(rel|q d)– New approach: logistic function
• Evaluation: better approximation quality than linearfunction
))(Pr(),|Pr( dqfdqrel ←=
0.1
0.2
0.3f(x)=
Pr(re
l|q,d
)
0.4
0.5
0.6
0.7
0.8
0.9
1
00
0.2 0.4x=Pr(q <- d)
0.6 0.8
b0=-20, b1=50
b0=-10, b1=100b0=-4, b1=10
1
)exp(1)exp(:)(
10
10
xbbxbbxf⋅++
⋅+=
14
Costs Sources Resource selection
• Computation and communication time– Affine linear function
• Charges for delivery– Linear function
• Retrieval quality (most interesting for IR)– C+<C- for relevant (non-relevant) documents– Estimating retrieval quality:
for result set of si documents, estimate the number ri of relevant documents contained
15
Estimating Retrieval Quality –Method 1 Resource selection
Relevant Documents in DL• Resource description:
– Expectation of indexing weights for terms (images, facts: vector clusters, retrieval function)
• Resource selection: – Estimate number of relevant documents in DLi– Estimate number ri of relevant documents in the
result set• Apply recall-precision-function
– Approximated by linear function (defined by starting point P(0))
0.2 0.400
0.1
0.2
0.3
0.4prec
isio
n0.5
0.6
0.7
0.8
0.9
P(0)=0.9
P(0)=0.5
0.6recall
0.8 1
1
16
Estimating Retrieval Quality – Method 2 Resource selection
Simulated Retrieval on Sample• Resource description:
– Store complete index of sample • Resource selection:
– Simulate retrieval on indexed sample (for all media types)
– Derive distribution of probabilities of relevance• Assumed to be representative
for whole collection– Estimate number ri of
relevant documents in result set
2.5
1.5
0.5
0.2 0.4 0.6 0.80
1
2
density p(x)
0
|DL| s
2.5
1.5
0.5
0.2 0.4 0.6 0.80
1
2
density p(x)
0
a
17
Estimating retrieval Quality – Method 3 Resource selection
Modelling Indexing Weights• Resource description:
– Expectation/variance of indexing weights
• Resource selection:– Approximate indexing weight
distribution• New: normal distribution
– Document score distribution• Also normal distribution
– Proceed as for M2– Todo: other media types -0.02 -0.01 0 0.01
document score0.02 0.03 0.04 0.05 0.0
60
50
40
freq
uenc
y
30
20
10
query 51 (collection)
query 51 (normal distribution)
0-0.02 -0.01 0 0.01
document score0.02 0.03 0.04 0.05 0.0
60
50
40
freq
uenc
y
30
20
10
query 51 (collection)0
-0.06 -0.04 -0.02 0indexing weight
0.02 0.04 0.06 0.08 0.1
16
14
freq
uenc
y
12
10
8
6
4
2
0
normal distribution fit
collection
-0.06 -0.04 -0.02 0indexing weight
0.02 0.04 0.06 0.08 0.1
16
14
freq
uenc
y
12
10
8
6
4
2
0
collection
18
Estimating retrievalquality Resource selection
• Methods:1. Estimate #rel.docs. in DL, apply linear recall-
precision function2. Simulate retrieval on sample, extrapolate to DL3. Normal distribution for indexing weights
normal distribution for retrieval scores• Results:
– Use logistic mapping for retrieval score probability of relevance
– 3 ~ CORI>1 > 2
19
Data fusion
20
Data fusion Data fusion
Text1
Speech2
Image3
Text4
Speech5
Text6
Text7
Image8
Image9
Image10
• Rank data– Rank– Score– Surrogates– Duplicates across ranks
• Full document• Broad information
– Collection information– User preferences
21
Text data fusion Data fusion
• Using a combination of evidence– Original rank position of documents– Re-ranking based on similarity of surrogate to
query• Surrogate could be title or text summary
– Promotion of documents found to be similar to others in the rankings
22
Speech data fusion Data fusion
• SpeechBot has 50% Word Error Rate on average– 17.56% in 300 top ranked documents– No need to treat speech differently for
fusion
“Speech is spoken and when recognised word errors occur”
“Speech is spoken and when recognised word errors occur”
Peaches broken and when recognised word errors occur.
Peaches broken and when recognised word errors occur.
23
Image data fusion –score normalisation Data fusion
DL
DF SearchEngine
query queryresults resultsDL Search
Engine s=0.8 σ=0.9image image
s=0.6 σ=0.8image image
s=0.5 σ=0.8image image
s=0.3 σ=0.5image image
s=0.2 σ=0.4image image
normalizedscores
un-normalizedscores
24
Image data fusion Data fusion
25
Presentation of retrieval results
26
2D Information Display Space (IDS) Presentation of results
• Presenting simultaneously multiple properties about documents – each document is represented through a visual
object (VO) – Visual features of each VO encode relevant
properties of documents• Position, size, shape, colour
– Documents sharing the same value of one relevant property (selected by user) are displayed with the same value of the corresponding visual feature
27
Sample query session Presentation of results
28
Sample query session Presentation of results
29
MMIR in Networked DLs – Open Issues• Vague schema mappings for heterogeneous
environments• Cross-media searches• Resource selection for non-textual media• Decentralized (P2P) architectures
30
JXTA Search Architecture
Two-level architecture:
1. Search hubs
2. Provider peers
31
Conclusions and Outlook• MIND provides methods for
– resource description,– resource selection and– result fusionin federated multimedia DLs
• Open Issues:– Heterogeneous, cross-media environments– Time-dependent media– Decentralized (P2P) architectures
32