information retrieval in department 1

15
Information Retrieval in Department 1 Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany Visit of the Scientific Advisory Board Saarbrücken, June 2 nd – 3 rd , 2005

Upload: delta

Post on 14-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval in Department 1. Visit of the Scientific Advisory Board Saarbrücken, June 2 nd – 3 rd , 2005. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbr ücken, Germany. How it got started …. I shifted from formerly very theoretical work … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval  in Department 1

Information Retrieval in Department 1

Holger BastMax-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

Visit of the Scientific Advisory BoardSaarbrücken, June 2nd – 3rd, 2005

Page 2: Information Retrieval  in Department 1

How it got started … I shifted from formerly very theoretical work …

… to information retrieval topics

Over time a number of PhD/Master/Bachelor students joined in …

JosianeParreira

ThomasWarken

IngmarWeber

ChristianMortensen

DebapriyoMajumdar

… and a lot ofinteraction with

Gerhard Weikum's group

ChristianKlein

BenediktGrundmann

RegisNewo

DanielFischer

Page 3: Information Retrieval  in Department 1

What we are doing … Motivation

– even basic retrieval tasks are still far from being solved satisfactorily, e.g. searching my Email

Two main research areas in the past 2 years

– Concept-based retrieval

– Searching with Autocompletion

This presentation

– main idea behind these areas

– lots of demos and examples

– highlight two results

Page 4: Information Retrieval  in Department 1

a querya document expressed

in terms

Concept-Based Retrieval

internet 0 2 0 1 0 0

web 2 1 0 0 0 0

surfing 1 1 0 1 1 1

beach 0 0 1 1 1 1

hawaii 0 0 2 2 2 1

Hawaii, 2nd June 2004Dear Pen Pal,I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB

1

0

0

0

0

Equally dissimilar to query!

Page 5: Information Retrieval  in Department 1

query expressedin concepts

a querya document expressed

in terms

document expressedin concepts

Concept-Based Retrieval

internet 0 2 0 1 0 0

web 2 1 0 0 0 0

surfing 1 1 0 1 1 1

beach 0 0 1 1 1 1

hawaii 0 0 2 2 2 1

1 1 0 .5

0 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

1

0

0

0

0

1

0

Page 6: Information Retrieval  in Department 1

a conceptexpressedin terms

a document expressed

in terms

document expressedin concepts

Concept-Based Retrieval

internet 0 2 0 1 0 0

web 2 1 0 0 0 0

surfing 1 1 0 1 1 1

beach 0 0 1 1 1 1

hawaii 0 0 2 2 2 1

2 0

2 0

1 1

0 1

0 2

1 1 0 .5

0 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

Page 7: Information Retrieval  in Department 1

Concept-Based Retrieval

internet 0 2 0 1 0 0

web 2 1 0 0 0 0

surfing 1 1 0 1 1 1

beach 0 0 1 1 1 1

hawaii 0 0 2 2 2 1

2 0

2 0

1 1

0 1

0 2

1 1 0 .5

0 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

matrix multiplication

Page 8: Information Retrieval  in Department 1

Concept-Based Retrieval

The approximation actually adds to the precision

2 0

2 0

1 1

0 1

0 2

●1 1 0 .

50 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

internet 2 2 0 1 0 0

web 2 2 0 1 0 0

surfing 1 1 1 1 1 1

beach 0 0 1 .5

1 1

hawaii 0 0 2 1 2 2 matrix multiplication

Finding concepts = approximate low-rank matrix decomposition

Page 9: Information Retrieval  in Department 1

A Concrete Example

676 abstracts from the Max-Planck-Institute

– for example:

We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved.

– 3283 words (words like and, or, this, … removed)

– abstracts come from 5 departments: Algorithms, Logic, Graphics, CompBio, Databases

– reduce to 10 concepts

Page 10: Information Retrieval  in Department 1

voronoi / diagram

200 400 6000number of concepts

logic / logics

200 400 6000number of concepts

logic / voronoi

200 400 6000number of concepts

How many concepts? Implicitly, the matrix decomposition assigns

a relatedness score to each pair of terms

→ every fixed number of concepts is wrong!

Bast/MajumdarSIGIR 2005

rela

ted

ness

Page 11: Information Retrieval  in Department 1

voronoi / diagram

200 400 6000number of concepts

logic / logics

200 400 6000number of concepts

logic / voronoi

200 400 6000number of concepts

How many concepts? Implicitly, the matrix decomposition assigns

a relatedness score to each pair of terms

we instead assess the shape of the curves!

Bast/MajumdarSIGIR 2005

rela

ted

ness

Page 12: Information Retrieval  in Department 1

Searching with Autocompletion

best understood by exampleand you can try it yourself via the new MPII webpages

An interactive search technology

– suggests completions of the word that is currently being typed

– along with that, hits are displayed (for the yet to be completed query)

Page 13: Information Retrieval  in Department 1

Useful in many ways Learn about formulations used in the collection

– e.g. "guestbook"

Minimum of information required

– e.g. people's names

Gives stemming functionality (without stemmer)

– e.g. "raghavans", "raghavan3", …

Gives error-correction functionality (without error-correction)

– e.g. "raghvan", "ragavan", …

Database-like queries

– e.g. publications by Kurt Mehlhorn

all this with a single functionalityno dictionary, no training, readily applicable to any collection

Page 14: Information Retrieval  in Department 1

The core algorithmic problem Given

– a set of documents D(the hits of the preceding part of the query)

– a range of words W(all completions of the last word the user has started typing)

Compute– the subset of documents D' ⊆

Dthat contain at least one word from W

– the subset of words W' ⊆ Wthat occur in at least one document of D

– typically |W'| << |W|

D = 17, 23, 48, 116, …

raga 11, 47, 97, 134, …

ragade 15, 77, 214, …

ragan 58, 917, …

ragchi 6, 107, 514, …

ragavan 23, 118, …

rage 211

raged 6, 111, 517, …

ragen 37, 919, …

ragged 14, 77, 112, 245, …

raggett 17, 51, 116, …

raggio 7, 22, 50, 714, …

raghavan 23, 57, 116, …

Page 15: Information Retrieval  in Department 1

The core algorithmic problem

D = 17, 23, 48, 116, …

raga 11, 47, 97, 134, …

ragade 15, 77, 214, …

ragan 58, 917, …

ragchi 6, 107, 514, …

ragavan 23, 118, …

rage 211

raged 6, 111, 517, …

ragen 37, 919, …

ragged 14, 77, 112, 245, …

raggett 17, 51, 116, …

raggio 7, 22, 50, 714, …

raghavan 23, 57, 116, …

Ordinary Inverted Index~|W| time per query

Bast/Mortensen/Weber~|W'| time per query

Given– a set of documents D

(the hits of the preceding part of the query)

– a range of words W(all completions of the last word the user has started typing)

Compute– the subset of documents D' ⊆

Dthat contain at least one word from W

– the subset of words W' ⊆ Wthat occur in at least one document of D

– typically |W'| << |W|