the index-based xxl search engine for querying xml data with relevance ranking

1

The Index-based XXL Search Enginefor Querying XML Datawith Relevance Ranking

Anja Theobald and Gerhard WeikumUniversity of the Saarland

Saarbrücken, Germany

[email protected]://www-dbs.cs.uni-sb.de

Conclusion

Problem:• diversity of Web / Intranet data despite XML, global schema is a myth users are swamped with results or are looking for needles in haystacks

• combine XML querying with relevance ranking• demonstrate efficiency and search result quality with XXL search engine prototype

Our contribution:

3

Outline

• Adding relevance to XML

• The XXL search engine:index-based query processing

• Experiments

XML Data Graph<Uni> ETH Zürich<Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>

<Uni> Uni Stuttgart <Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>

...

<Uni> Uni Saarland <School> Math & Engineering <Dept> CS<Teaching> ... <GradStudies> <Course> Performance analysis <Lecturer> ... </> <Content> Queueing models .. </> <Lit href=springer/nelson.xml > <Lit href=... > </Course> <Course> Speech processing <Content> ... Markov chains... </> </Course> ... </Teaching> .. </Dept> .. </School> ... </Uni>

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter on Markov chains

School: ...

Dept: ... CS ...

Teaching

GradStudies

Course: Speech processing

School: ...

...

...

......

Course: Performance analysis

...

Content: ... Queueing models

Lit: Lit:...

Content: ... Markov chains ...

...

Uni: Uni Stuttgart

School: CS

Course: Mobile Comm.

Prerequisites: ... Markov processes

...

...

...

Uni: Uni Augsburg

Curriculum: E Commerce

...

Weekend: Data Mining

... ...

...

DozentURL=...

Inhalt...Semistructured data:elements, attributes, linksorganized as labeled graph

XML Querying

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter onMarkov chains

School: ...

Dept: ... CS

Teaching

GradStudies


School: ...

...

...

......


...


Lit: Lit:...


...

Uni: Uni Stuttgart

School: CS

Course: Mobile comm.


...

...

...

Uni: Uni Augsburg


...


... Outline: ...statistical methodsfor classification ...

...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“

www.allunis.de/unis.xml

Regular expressionsover path labelsLogical conditionsover element contents

+

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“

XML Querying

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson

Review: ... Chapter onMarkov chains

School: ...

Dept: ... CS

Teaching

GradStudies


School: ...

...

...

......


...


Lit: Lit:...


...

Uni: Uni Stuttgart

School: CS



...

...

...

Uni: Uni Augsburg


...



...


Uni As U

Uni:

Uni:

Uni:

U.#.School?.#.(Inst | Dept)+ As D

School:

School: School:

Dept:

D Like „%CS%“

CS

CS

D.#.Course As C

Course:

Course: Course:

C.# Like „%Markov chain%“

Markov chains

Markov chains

U, C

Boolean vs. Ranked Retrieval

There is no global schema for Intranets or the Web Relevance ranking of results is absolutely crucial !

Ranked Retrieval with XXL

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson


School: ...

Dept: ... CS

Teaching

GradStudies


School: ...

...

...

......


...


Lit: Lit:...


...

Uni: Uni Stuttgart

School: CS



...

...

...

Uni: Uni Augsburg


...



...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“And D.#.~Course As C AND C.# ~~ „Markov chain“


Ranked Retrieval with XXL

Uni: Uni Saarland

Book

Title:Stochastic...

Author:R. Nelson


School: ...

Dept: ... CS

Teaching

GradStudies


School: ...

...

...

......


...


Lit: Lit:...


...

Uni: Uni Stuttgart

School: CS



...

...

...

Uni: Uni Augsburg


...



...

Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“And D.#.~Course As C and C.# ~~ „Markov chain“


DozentURL=...

Inhalt...Result ranking of XML databased on semantic similarity

10

Outline

Adding relevance to XML

• The XXL search engine:index-based query processing

• Experiments

XXL: Flexible XML Search Language

Where clause: conjunction of regular path expressions with binding of variables

Extensible, simple core language

Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As FAnd F.#.Lecturer As D And F.#.Student As SAnd D.Name = S.Name And D.Area Like „%XML%“

Elementary conditions on element/attribute names and contents

Semantic similarity conditions on names and contents

Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions

... F.#.~Lecturer As D And D.~Area ~~ „XML“

XXL Result RankingWhere Uni.#.School?.#.(Inst|Dept)+ As D AndD.#.~Lecturer As D And D.~Area ~~ „XML“

Query:

Data graph: Result graph:

Uni: UniSaarland

Dept: CS Dept: Math

Prof: GW

Teaching Project: IR forsemistruct. data

Course: IR Seminar: XML

Project: Digital libraries

Uni: UniSaarland

Dept: CS Dept: Math

Prof: GW

Project: IR for semistruct. data

0.9

0.80.6

1.0

1.0

Relevance score: 0.432= 1.0 * 1.0 * 0.9 * 0.8 * 0.6

F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“

F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“

XXL Search EngineWWW

......

.....

......

.....

XXL servlets

Queryprocessor

Pathindexer

Contentindexer

Ontology

XXLapplet

Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“And F.#.~Course.# ~~ „Markov Chains“

Uni.#.(Inst|Dept) As F F ~~ „Computer Science“

• Query decomposition into index-supported subexpressions• wide range of optimizations

Index StructuresElement Path Index:

Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>}XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>}

Element Content Index:

Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117, id119}>}, id2, {<Prof>, {id15}>} }School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } }

Element Ontology Index:

Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}

materializes all (parent, child)element name pairs and dynamically checkstransitive connectivity

precomputes all termoccurrences in element contents,with frequency statistics

contains synonyms, hypernyms,and hyponyms of element names,and „semantic“ distances

Uni.#.(Inst|Dept)+ As FAnd F ~~ „Computer Science“AndF.#.~Course.# ~~ „Markov Chains“

Uni.#.(Inst|Dept)+ Uni.#.(Inst|Dept)+

Query Decomposition & Evaluation

decompose query into subqueries choose global evaluation order of subqueries represent subquery as NFSA for each subquery choose local evaluation strategy (top-down or bottom-up) evaluate subexpressions using indexes compute subquery result paths with relevance scores combine result paths into result graphExample query: Example of subquery NFSA:

Uni %

Inst

Dept

......

.....

......

.....

WWW / Intranet

The Role of Ontologies

<Uni> Univ. Saarland<School> Engineering <Dept> Computer Science <Faculty> Prof. Dr. GW <Project> Semistructured Data ... XML</>...

Course

Prof

Dept

Insti-tute

Re-search

Teach-ing Pro-

ject

Semi-nar

Univer-sity

Publi-cation

Confe-rence

Jour-nal

c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,x)))

Course

Prof

Dept

Insti-tute

Re-search

Teach- ing Pro-

ject

Semi- nar

Univer-sity

Publi-cation

Confe-rence

Jour-nal

c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,s)))

Observation:Information becomes better searchable when it is more explicitly structured and canonically annotated

Graph of concepts capturinghypernym/hyponym relationships (e.g., from WordNet)

„Poor man‘s ontology“:

quantitative reasoning („semantic similarity“ measures)

17

Outline

Adding relevance to XML

The XXL search engine:index-based query processing

• Experiments

Example Data

Example Query

SELECT *FROM INDEXWHERE ~drama.#.scene AS CAND C.speech AS SAND (S.speaker ~ "Woman")AND S.line AS LAND (L.CONTENT ~ "leader")AND C.speech AS MAND (M.speaker = "MACBETH")

Example Ontology

thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others)

Example Ontology

woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

Example Results

Relevance = 0.0070400005

<scene> <speech> <speaker> Second Witch </speaker>

<line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech></scene>

XXL Runtime Measurements

Q1:Select * From IndexWhere #.publication AS A And A.~headline ~~ „XML“ And A.author% AS B

Q2:Select * From IndexWhere #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C

1234

#results:top-downbottom-upw/ optimization:

13114.3 sec694 sec2.68 sec (incl. 0.37 sec)2bu 1bu 3td

588.5 sec3.7 sec4.64 sec (incl. 0.33 sec)1bu 2td 3td 4td

Test data:100 XML documents with a total of 240 000 elements(ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml)

Conclusion

should be able to find results for every search in one day (computer time) with < 1 min intellectual effortthat the best human experts can find with infinite time

Goal:

explore and leverage synergies betweenXML (querying), (relevance-ranking) IR,(domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.)

Research avenue:

pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)

the index-based xxl search engine for querying xml data with relevance ranking

Documents