the index-based xxl search engine for querying xml data with relevance ranking

24
1 The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany [email protected] http://www-dbs.cs.uni-sb.de

Upload: shasta

Post on 04-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking. Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany. [email protected] http://www-dbs.cs.uni-sb.de. Conclusion. Problem: diversity of Web / Intranet data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

1

The Index-based XXL Search Enginefor Querying XML Datawith Relevance Ranking

Anja Theobald and Gerhard WeikumUniversity of the Saarland

Saarbrücken, Germany

[email protected]://www-dbs.cs.uni-sb.de

Page 2: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Conclusion

Problem:• diversity of Web / Intranet data despite XML, global schema is a myth users are swamped with results or are looking for needles in haystacks

• combine XML querying with relevance ranking• demonstrate efficiency and search result quality with XXL search engine prototype

Our contribution:

Page 3: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

3

Outline

• Adding relevance to XML

• The XXL search engine:index-based query processing

• Experiments

Page 4: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

XML Data Graph<Uni> ETH Zürich<Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>

<Uni> Uni Stuttgart <Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>

...

<Uni> Uni Saarland <School> Math & Engineering <Dept> CS<Teaching> ... <GradStudies> <Course> Performance analysis <Lecturer> ... </> <Content> Queueing models .. </> <Lit href=springer/nelson.xml > <Lit href=... > </Course> <Course> Speech processing <Content> ... Markov chains... </> </Course> ... </Teaching> .. </Dept> .. </School> ... </Uni>

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter on Markov chains

School: ...

Dept: ... CS ...

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile Comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... ...

...

DozentURL=...

Inhalt...Semistructured data:elements, attributes, linksorganized as labeled graph

Page 5: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

XML Querying

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter onMarkov chains

School: ...

Dept: ... CS

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... Outline: ...statistical methodsfor classification ...

...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“

www.allunis.de/unis.xml

Regular expressionsover path labelsLogical conditionsover element contents

+

Page 6: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“

XML Querying

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter onMarkov chains

School: ...

Dept: ... CS

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... Outline: ...statistical methodsfor classification ...

...

www.allunis.de/unis.xml

Uni As U

Uni:

Uni:

Uni:

U.#.School?.#.(Inst | Dept)+ As D

School:

School: School:

Dept:

D Like „%CS%“

CS

CS

D.#.Course As C

Course:

Course: Course:

C.# Like „%Markov chain%“

Markov chains

Markov chains

U, C

Page 7: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Boolean vs. Ranked Retrieval

There is no global schema for Intranets or the Web Relevance ranking of results is absolutely crucial !

Page 8: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Ranked Retrieval with XXL

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter on Markov chains

School: ...

Dept: ... CS

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... Outline: ...statistical methodsfor classification ...

...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“And D.#.~Course As C AND C.# ~~ „Markov chain“

www.allunis.de/unis.xml

Page 9: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Ranked Retrieval with XXL

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter on Markov chains

School: ...

Dept: ... CS

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... Outline: ...statistical methodsfor classification ...

...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“And D.#.~Course As C and C.# ~~ „Markov chain“

www.allunis.de/unis.xml

DozentURL=...

Inhalt...Result ranking of XML databased on semantic similarity

Page 10: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

10

Outline

Adding relevance to XML

• The XXL search engine:index-based query processing

• Experiments

Page 11: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

XXL: Flexible XML Search Language

Where clause: conjunction of regular path expressions with binding of variables

Extensible, simple core language

Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As FAnd F.#.Lecturer As D And F.#.Student As SAnd D.Name = S.Name And D.Area Like „%XML%“

Elementary conditions on element/attribute names and contents

Semantic similarity conditions on names and contents

Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions

... F.#.~Lecturer As D And D.~Area ~~ „XML“

Page 12: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

XXL Result RankingWhere Uni.#.School?.#.(Inst|Dept)+ As D AndD.#.~Lecturer As D And D.~Area ~~ „XML“

Query:

Data graph: Result graph:

Uni: UniSaarland

Dept: CS Dept: Math

Prof: GW

Teaching Project: IR forsemistruct. data

Course: IR Seminar: XML

Project: Digital libraries

Uni: UniSaarland

Dept: CS Dept: Math

Prof: GW

Project: IR for semistruct. data

0.9

0.80.6

1.0

1.0

Relevance score: 0.432= 1.0 * 1.0 * 0.9 * 0.8 * 0.6

Page 13: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“

F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“

XXL Search EngineWWW

......

.....

......

.....

XXL servlets

Queryprocessor

Pathindexer

Contentindexer

Ontology

XXLapplet

Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“And F.#.~Course.# ~~ „Markov Chains“

Uni.#.(Inst|Dept) As F F ~~ „Computer Science“

• Query decomposition into index-supported subexpressions• wide range of optimizations

Page 14: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Index StructuresElement Path Index:

Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>}XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>}

Element Content Index:

Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117, id119}>}, id2, {<Prof>, {id15}>} }School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } }

Element Ontology Index:

Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}

materializes all (parent, child)element name pairs and dynamically checkstransitive connectivity

precomputes all termoccurrences in element contents,with frequency statistics

contains synonyms, hypernyms,and hyponyms of element names,and „semantic“ distances

Page 15: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Uni.#.(Inst|Dept)+ As FAnd F ~~ „Computer Science“AndF.#.~Course.# ~~ „Markov Chains“

Uni.#.(Inst|Dept)+ Uni.#.(Inst|Dept)+

Query Decomposition & Evaluation

decompose query into subqueries choose global evaluation order of subqueries represent subquery as NFSA for each subquery choose local evaluation strategy (top-down or bottom-up) evaluate subexpressions using indexes compute subquery result paths with relevance scores combine result paths into result graphExample query: Example of subquery NFSA:

Uni %

Inst

Dept

Page 16: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

......

.....

......

.....

WWW / Intranet

The Role of Ontologies

<Uni> Univ. Saarland<School> Engineering <Dept> Computer Science <Faculty> Prof. Dr. GW <Project> Semistructured Data ... XML</>...

Course

Prof

Dept

Insti-tute

Re-search

Teach-ing Pro-

ject

Semi-nar

Univer-sity

Publi-cation

Confe-rence

Jour-nal

c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,x)))

Course

Prof

Dept

Insti-tute

Re-search

Teach- ing Pro-

ject

Semi- nar

Univer-sity

Publi-cation

Confe-rence

Jour-nal

c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,s)))

Observation:Information becomes better searchable when it is more explicitly structured and canonically annotated

Graph of concepts capturinghypernym/hyponym relationships (e.g., from WordNet)

„Poor man‘s ontology“:

quantitative reasoning („semantic similarity“ measures)

Page 17: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

17

Outline

Adding relevance to XML

The XXL search engine:index-based query processing

• Experiments

Page 18: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Example Data

Page 19: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Example Query

SELECT *FROM INDEXWHERE ~drama.#.scene AS CAND C.speech AS SAND (S.speaker ~ "Woman")AND S.line AS LAND (L.CONTENT ~ "leader")AND C.speech AS MAND (M.speaker = "MACBETH")

Page 20: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Example Ontology

thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others)

Page 21: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Example Ontology

woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

Page 22: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Example Results

Relevance = 0.0070400005

<scene> <speech> <speaker> Second Witch </speaker>

<line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech></scene>

Page 23: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

XXL Runtime Measurements

Q1:Select * From IndexWhere #.publication AS A And A.~headline ~~ „XML“ And A.author% AS B

Q2:Select * From IndexWhere #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C

1234

#results:top-downbottom-upw/ optimization:

13114.3 sec694 sec2.68 sec (incl. 0.37 sec)2bu 1bu 3td

588.5 sec3.7 sec4.64 sec (incl. 0.33 sec)1bu 2td 3td 4td

Test data:100 XML documents with a total of 240 000 elements(ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml)

Page 24: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Conclusion

should be able to find results for every search in one day (computer time) with < 1 min intellectual effortthat the best human experts can find with infinite time

Goal:

explore and leverage synergies betweenXML (querying), (relevance-ranking) IR,(domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.)

Research avenue:

pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)