xml retrieval: from modelling to evaluation mounia lalmas queen mary university of london...

XML Retrieval: from modelling to evaluation

Mounia Lalmas

Queen Mary University of London

[email protected]

qmir.dcs.qmul.ac.uk

Outline

Structured document retrieval

XML

Content-oriented XML retrieval

Evaluation

Structured Document Retrieval

Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.

SDR allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book.

The structure of documents is exploited to identify which document components to retrieve.

Structured Documents

Linear order of words, sentences, paragraphs …

Hierarchy or logical structure of a book’s chapters, sections …

Links (hyperlink), cross-references, citations …

Temporal and spatial relationships in multimedia documents

Book

Chapters

Sections

Paragraphs

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

Structured Documents

Explicit structure formalised through document representation standards

(Mark-up Languages)

Layout LaTeX (publishing), HTML (Web publishing)

Structure SGML, XML (Web publishing, engineering),

MPEG-7 (broadcasting)

Content/Semantic RDF, DAML + OIL, OWL (semantic web)

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

<b><font size=+2>SDR</font></b><img src="qmir.jpg" border=0>

<section> <subsection> <paragraph>… </paragraph> <paragraph>… </paragraph> </subsection></section>

<Book rdf:about=“book”> <rdf:author=“..”/> <rdf:title=“…”/></Book>

Outline


XML


Evaluation

XML: eXtensible Mark-up Language• Meta-language (user-defined tags) being adopted as the

document format language by W3C

• Used to describe content and structure (and not layout)

• Grammar described in DTD ( used for validation)

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> …</lecture> <!ELEMENT lecture (title,

author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…

XML: eXtensible Mark-up Language

Use of XPath notation to refer to the XML structure

chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

Querying XML documents Content-only (CO) queries

'open standards for digital video in distance learning'

Content-and-structure (CAS) queries

//article [about(., 'formal methods verify correctness aviation systems')] /body//section [about(.,'case study application model checking theorem proving')]

Structure-only (SA) queries

/article//*section/paragraph[2]

Passage retrieval

Fixed-length (e.g. 300-word windows, overlapping) Discourse (e.g. sentence, paragraph) according to logical

structure but fixed Semantic (e.g. TextTiling)

Retrieval: e.g. rank document based on highest ranking passage or sum of

ranking scores for all passages deal principally with CO queries

p1 p2 p3 p4 p5 p6

doc

Database approaches to XML retrieval

Relational OO Native

Flexibility, expressiveness, complexity

Efficiency

Data-oriented retrieval– containment and not aboutness– no relevance-based ranking

Aims/challenges tend to focus on efficiency performance

XQuery

OutlineStructured document retrieval

XML

Content-oriented XML retrieval A definition Challenges Approaches

Evaluation


Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need both with regards to content and structure.


Retrieve the best components according to content and structure criteria:

INEX: most specific component that satisfies the query, while being exhaustive to the query

Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing

???

Article ?XML,?retrieval

?authoring

0.9 XML 0.5 XML 0.2 XML

0.4 retrieval 0.7 authoring

Challenge 1: term weights

Title Section 1 Section 2

No fixed retrieval unit + nested document components: how to obtain document and collection statistics (e.g. tf idf) which aggregation formalism to use?


?authoring

0.9 XML 0.5 XML 0.2 XML


Challenge 2: augmentation weights


Nested document components: which components contribute best to content of Article? how to estimate augmentation weights (e.g. size, number of children)? how to aggregate term and augmentation weights?

0.40.5

0.2


?authoring

0.9 XML 0.5 XML 0.2 XML


Challenge 3: component weights


Different types of document components: which component is a good retrieval unit? how to estimate component weights (frequency, user studies)? how to aggregate term, augmentation and component weights?

0.40.5

0.2

0.6 0.40.4

0.2

Approaches …vector space model

probabilistic model

bayesian network

language model

extending DB model

boolean model

natural language processing

cognitive model

ontology

parameter estimation

tuning

smoothing

fusion

phrase

term statistics

collection statistics

component statistics

proximity search

logistic regression

belief modelrelevance feedback

Vector space model

article index

abstract index

section index

sub-section index

paragraph index

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

merge

tf and idf as for fixed and non-nested retrieval units

(IBM Haifa, INEX 2003)

Language model

element language modelcollection language modelsmoothing parameter

element score

element sizeelement scorearticle score

query expansion with blind feedbackignore elements with 20 terms

high value of leads to increase in size of retrieved elements

results with = 0.9, 0.5 and 0.2 similar

rank element

(University of Amsterdam, INEX 2003)

Outline


XML


Evaluation

Evaluation of XML retrieval: INEX Evaluating the effectiveness of content-oriented XML retrieval

approaches

Collaborative effort participants contribute to the development of the collection queries relevance assessments

Similar methodology as for TREC, but adapted to XML retrieval

40+ participants worldwide

Workshop in Schloss Dagstuhl in December (20+ institutions)

INEX Test Collection Documents (~500MB), which consist of 12,107 articles in XML format

from the IEEE Computer Society; 8 millions elements

INEX 2002 30 CO and 30 CAS queries CO and CAS ad hoc retrieval tasks

inex_eval metric

INEX 200336 CO and 30 CAS queries CO, SCAS and VCAS ad hoc retrieval tasks

CAS queries are defined according to enhanced subset of XPath

inex_eval and inex_eval_ng metrics

INEX 2004 is just starting

Relevance in INEX

Exhaustivityhow exhaustively a document component discusses the query: 0, 1, 2, 3

Specificityhow focused the component is on the query: 0, 1, 2, 3

Relevance (3,3), (2,3), (1,1), (0,0), …

Use of an online assessment tool to ensure exhaustive and consistent assessments (assessing a query takes a week!)

section

article all sections relevant article very relevantall sections relevant article better than sectionsone section relevant article less relevantone section relevant section better than article…

Metrics

Recall / precision - based

quantisation functions to obtain one relevance value

expected search length

penalise overlap consider size

Othersexpected ratio of relevantcumulated gain-based metricstolerance to irrelevance

section

article

Lessons learntGood definition of relevance

Expressing CAS queries was not easy

Relevance assessment process must be “improved”

Further development on metrics needed

User studies required

INEX 2004

http://inex.is.informatik.uni-duisburg.de:2004/

TracksRelevance feedbackInteractiveHeterogeneous collectionNatural language query










xml retrieval: from modelling to evaluation mounia lalmas queen mary university of london...

Documents

passage retrieval

xml structure chaptertitle

texttiling retrieval

xml web publishing

querying xml documents

fixed retrieval unit

nested document components

rank document