multimedia information retrieval in networked digital ... · multimedia information retrieval in...

Multimedia Information Retrieval in Networked Digital Libraries

Norbert FuhrUniversity of Duisburg-Essen

Germany

2

MIND Project• Task: Development of methods for accessing

large numbers of multimedia digital libraries through a single system

• Funded by the EC under FP5 (2/01-7/03)• Participants:

– Carnegie-Mellon University– University of Duisburg-Essen– University of Florence– University of Sheffield– University of Strathclyde

3

MIND Architecture

Fact

Text

Image

Text

Image

Fact

TextProxy1 Proxy2 Proxy3

Wrapper1 Wrapper2 Wrapper3

Dispatcher

Speech

Fact

Text Proxy4

Wrapper4

Data fuser

User Interface

Data fuser

User Interface

DL1 DL2 DL4DL3

4

Query Processing

Query Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)

Resource Selection Determine relevant libraries (more cost effective than querying all libraries)

Database Query Run Query each selected library (task of “wrappers”)

Document Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)

Data Fusion Fuse library results into one single ranked list or several clusters

5

Retrieval in FederatedDigital Libraries

Tasks:• Extract DL metadata („resource

description“)• Select relevant libraries („resource

selection“)• Communicate with libraries („wrapper“)• Combine results („data fusion“)

6

Resource Description

7

Resource Description: Query-based Sampling

• For non-cooperative DLs• Generate sequence of queries• Collect answer documents (~ 300)• Generate resource description from

sampled documents

8

Resource descriptionfor images• Feature vectors are clustered at different

levels of granularity• For each cluster we consider:

– Cluster centre ci;– Cluster radius ri – Cluster population Pi

• The smaller the cluster radii, the finer the approximation of feature vector density.

9

Demo: Resource descriptions for images Resource descriptions

• Java application demonstrating resource description process for images:• Selection of resource to be

described• Acquisition of individual

document descriptors• Processing of document

descriptors and extraction of resource descriptors at several granularity levels

10

Resource selection

11

Resource selection Resource selection

• Querying all accessible DLs is too expensive– Thus route query only to „best“ libraries

• Two competing approaches:– Resource ranking:

• Compute similarity of DL to query (heuristically)• Select fixed number of top-ranked DLs

– Decision-theoretic framework:• Assign costs to retrieval (money, time, retrieval quality)• Compute selection which minimises costs

– #selected DLs not fixed• Described in this talk

12

Multimedia retrievalmodel Resource selection

• Query conditions c (attribute, predicate, comparison value), weight Pr(q c)– E.g. term, year number, colour histogram

• Probabilistic indexing weights Pr(c d)– E.g. BM25 for text, histogram similarity for images

• Linear retrieval function:

• Mapping onto Pr(rel|q,d):– Linear/logistic function

Pr( )Pr()Pr() ∑∈

←⋅←=←qc

iii

dccqdqindexing weightcondition weight

13

Probability of Relevance Resource selection

• Probability of relevance computed based on probability of implication (score)

– Linear function with constant Pr(rel|q d)– New approach: logistic function

• Evaluation: better approximation quality than linearfunction

))(Pr(),|Pr( dqfdqrel ←=

0.1

0.2

0.3f(x)=

Pr(re

l|q,d

)

0.4

0.5

0.6

0.7

0.8

0.9

1

00

0.2 0.4x=Pr(q <- d)

0.6 0.8

b0=-20, b1=50

b0=-10, b1=100b0=-4, b1=10

1

)exp(1)exp(:)(

10

10

xbbxbbxf⋅++

⋅+=

14

Costs Sources Resource selection

• Computation and communication time– Affine linear function

• Charges for delivery– Linear function

• Retrieval quality (most interesting for IR)– C+<C- for relevant (non-relevant) documents– Estimating retrieval quality:

for result set of si documents, estimate the number ri of relevant documents contained

15

Estimating Retrieval Quality –Method 1 Resource selection

Relevant Documents in DL• Resource description:

– Expectation of indexing weights for terms (images, facts: vector clusters, retrieval function)

• Resource selection: – Estimate number of relevant documents in DLi– Estimate number ri of relevant documents in the

result set• Apply recall-precision-function

– Approximated by linear function (defined by starting point P(0))

0.2 0.400

0.1

0.2

0.3

0.4prec

isio

n0.5

0.6

0.7

0.8

0.9

P(0)=0.9

P(0)=0.5

0.6recall

0.8 1

1

16

Estimating Retrieval Quality – Method 2 Resource selection

Simulated Retrieval on Sample• Resource description:

– Store complete index of sample • Resource selection:

– Simulate retrieval on indexed sample (for all media types)

– Derive distribution of probabilities of relevance• Assumed to be representative

for whole collection– Estimate number ri of

relevant documents in result set

2.5

1.5

0.5

0.2 0.4 0.6 0.80

1

2

density p(x)

0

|DL| s

2.5

1.5

0.5

0.2 0.4 0.6 0.80

1

2

density p(x)

0

a

17

Estimating retrieval Quality – Method 3 Resource selection

Modelling Indexing Weights• Resource description:

– Expectation/variance of indexing weights

• Resource selection:– Approximate indexing weight

distribution• New: normal distribution

– Document score distribution• Also normal distribution

– Proceed as for M2– Todo: other media types -0.02 -0.01 0 0.01

document score0.02 0.03 0.04 0.05 0.0

60

50

40

freq

uenc

y

30

20

10

query 51 (collection)

query 51 (normal distribution)

0-0.02 -0.01 0 0.01

document score0.02 0.03 0.04 0.05 0.0

60

50

40

freq

uenc

y

30

20

10

query 51 (collection)0

-0.06 -0.04 -0.02 0indexing weight

0.02 0.04 0.06 0.08 0.1

16

14

freq

uenc

y

12

10

8

6

4

2

0

normal distribution fit

collection

-0.06 -0.04 -0.02 0indexing weight

0.02 0.04 0.06 0.08 0.1

16

14

freq

uenc

y

12

10

8

6

4

2

0

collection

18

Estimating retrievalquality Resource selection

• Methods:1. Estimate #rel.docs. in DL, apply linear recall-

precision function2. Simulate retrieval on sample, extrapolate to DL3. Normal distribution for indexing weights

normal distribution for retrieval scores• Results:

– Use logistic mapping for retrieval score probability of relevance

– 3 ~ CORI>1 > 2

19

Data fusion

20

Data fusion Data fusion

Text1

Speech2

Image3

Text4

Speech5

Text6

Text7

Image8

Image9

Image10

• Rank data– Rank– Score– Surrogates– Duplicates across ranks

• Full document• Broad information

– Collection information– User preferences

21

Text data fusion Data fusion

• Using a combination of evidence– Original rank position of documents– Re-ranking based on similarity of surrogate to

query• Surrogate could be title or text summary

– Promotion of documents found to be similar to others in the rankings

22

Speech data fusion Data fusion

• SpeechBot has 50% Word Error Rate on average– 17.56% in 300 top ranked documents– No need to treat speech differently for

fusion

“Speech is spoken and when recognised word errors occur”

“Speech is spoken and when recognised word errors occur”

Peaches broken and when recognised word errors occur.

Peaches broken and when recognised word errors occur.

23

Image data fusion –score normalisation Data fusion

DL

DF SearchEngine

query queryresults resultsDL Search

Engine s=0.8 σ=0.9image image

s=0.6 σ=0.8image image




normalizedscores

un-normalizedscores

24

Image data fusion Data fusion

25

Presentation of retrieval results

26

2D Information Display Space (IDS) Presentation of results

• Presenting simultaneously multiple properties about documents – each document is represented through a visual

object (VO) – Visual features of each VO encode relevant

properties of documents• Position, size, shape, colour

– Documents sharing the same value of one relevant property (selected by user) are displayed with the same value of the corresponding visual feature

27

Sample query session Presentation of results

28

Sample query session Presentation of results

29

MMIR in Networked DLs – Open Issues• Vague schema mappings for heterogeneous

environments• Cross-media searches• Resource selection for non-textual media• Decentralized (P2P) architectures

30

JXTA Search Architecture

Two-level architecture:

1. Search hubs

2. Provider peers

31

Conclusions and Outlook• MIND provides methods for

– resource description,– resource selection and– result fusionin federated multimedia DLs

• Open Issues:– Heterogeneous, cross-media environments– Time-dependent media– Decentralized (P2P) architectures