information retrieval in department 1
DESCRIPTION
Information Retrieval in Department 1. Visit of the Scientific Advisory Board Saarbrücken, June 2 nd – 3 rd , 2005. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbr ücken, Germany. How it got started …. I shifted from formerly very theoretical work … - PowerPoint PPT PresentationTRANSCRIPT
Information Retrieval in Department 1
Holger BastMax-Planck-Institut für Informatik (MPII)
Saarbrücken, Germany
Visit of the Scientific Advisory BoardSaarbrücken, June 2nd – 3rd, 2005
How it got started … I shifted from formerly very theoretical work …
… to information retrieval topics
Over time a number of PhD/Master/Bachelor students joined in …
JosianeParreira
ThomasWarken
IngmarWeber
ChristianMortensen
DebapriyoMajumdar
… and a lot ofinteraction with
Gerhard Weikum's group
ChristianKlein
BenediktGrundmann
RegisNewo
DanielFischer
What we are doing … Motivation
– even basic retrieval tasks are still far from being solved satisfactorily, e.g. searching my Email
Two main research areas in the past 2 years
– Concept-based retrieval
– Searching with Autocompletion
This presentation
– main idea behind these areas
– lots of demos and examples
– highlight two results
a querya document expressed
in terms
Concept-Based Retrieval
internet 0 2 0 1 0 0
web 2 1 0 0 0 0
surfing 1 1 0 1 1 1
beach 0 0 1 1 1 1
hawaii 0 0 2 2 2 1
Hawaii, 2nd June 2004Dear Pen Pal,I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB
1
0
0
0
0
Equally dissimilar to query!
query expressedin concepts
a querya document expressed
in terms
document expressedin concepts
Concept-Based Retrieval
internet 0 2 0 1 0 0
web 2 1 0 0 0 0
surfing 1 1 0 1 1 1
beach 0 0 1 1 1 1
hawaii 0 0 2 2 2 1
1 1 0 .5
0 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
1
0
0
0
0
1
0
a conceptexpressedin terms
a document expressed
in terms
document expressedin concepts
Concept-Based Retrieval
internet 0 2 0 1 0 0
web 2 1 0 0 0 0
surfing 1 1 0 1 1 1
beach 0 0 1 1 1 1
hawaii 0 0 2 2 2 1
2 0
2 0
1 1
0 1
0 2
1 1 0 .5
0 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
Concept-Based Retrieval
internet 0 2 0 1 0 0
web 2 1 0 0 0 0
surfing 1 1 0 1 1 1
beach 0 0 1 1 1 1
hawaii 0 0 2 2 2 1
2 0
2 0
1 1
0 1
0 2
1 1 0 .5
0 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
●
matrix multiplication
Concept-Based Retrieval
The approximation actually adds to the precision
2 0
2 0
1 1
0 1
0 2
●1 1 0 .
50 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
internet 2 2 0 1 0 0
web 2 2 0 1 0 0
surfing 1 1 1 1 1 1
beach 0 0 1 .5
1 1
hawaii 0 0 2 1 2 2 matrix multiplication
Finding concepts = approximate low-rank matrix decomposition
A Concrete Example
676 abstracts from the Max-Planck-Institute
– for example:
We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved.
– 3283 words (words like and, or, this, … removed)
– abstracts come from 5 departments: Algorithms, Logic, Graphics, CompBio, Databases
– reduce to 10 concepts
voronoi / diagram
200 400 6000number of concepts
logic / logics
200 400 6000number of concepts
logic / voronoi
200 400 6000number of concepts
How many concepts? Implicitly, the matrix decomposition assigns
a relatedness score to each pair of terms
→ every fixed number of concepts is wrong!
Bast/MajumdarSIGIR 2005
rela
ted
ness
voronoi / diagram
200 400 6000number of concepts
logic / logics
200 400 6000number of concepts
logic / voronoi
200 400 6000number of concepts
How many concepts? Implicitly, the matrix decomposition assigns
a relatedness score to each pair of terms
we instead assess the shape of the curves!
Bast/MajumdarSIGIR 2005
rela
ted
ness
Searching with Autocompletion
best understood by exampleand you can try it yourself via the new MPII webpages
An interactive search technology
– suggests completions of the word that is currently being typed
– along with that, hits are displayed (for the yet to be completed query)
Useful in many ways Learn about formulations used in the collection
– e.g. "guestbook"
Minimum of information required
– e.g. people's names
Gives stemming functionality (without stemmer)
– e.g. "raghavans", "raghavan3", …
Gives error-correction functionality (without error-correction)
– e.g. "raghvan", "ragavan", …
Database-like queries
– e.g. publications by Kurt Mehlhorn
all this with a single functionalityno dictionary, no training, readily applicable to any collection
The core algorithmic problem Given
– a set of documents D(the hits of the preceding part of the query)
– a range of words W(all completions of the last word the user has started typing)
Compute– the subset of documents D' ⊆
Dthat contain at least one word from W
– the subset of words W' ⊆ Wthat occur in at least one document of D
– typically |W'| << |W|
D = 17, 23, 48, 116, …
raga 11, 47, 97, 134, …
ragade 15, 77, 214, …
ragan 58, 917, …
ragchi 6, 107, 514, …
ragavan 23, 118, …
rage 211
raged 6, 111, 517, …
ragen 37, 919, …
ragged 14, 77, 112, 245, …
raggett 17, 51, 116, …
raggio 7, 22, 50, 714, …
raghavan 23, 57, 116, …
The core algorithmic problem
D = 17, 23, 48, 116, …
raga 11, 47, 97, 134, …
ragade 15, 77, 214, …
ragan 58, 917, …
ragchi 6, 107, 514, …
ragavan 23, 118, …
rage 211
raged 6, 111, 517, …
ragen 37, 919, …
ragged 14, 77, 112, 245, …
raggett 17, 51, 116, …
raggio 7, 22, 50, 714, …
raghavan 23, 57, 116, …
Ordinary Inverted Index~|W| time per query
Bast/Mortensen/Weber~|W'| time per query
Given– a set of documents D
(the hits of the preceding part of the query)
– a range of words W(all completions of the last word the user has started typing)
Compute– the subset of documents D' ⊆
Dthat contain at least one word from W
– the subset of words W' ⊆ Wthat occur in at least one document of D
– typically |W'| << |W|