xml retrieval: from modelling to evaluation mounia lalmas queen mary university of london...

29
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London [email protected] qmir.dcs.qmul.ac.uk

Upload: jennifer-fletcher

Post on 28-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

XML Retrieval: from modelling to evaluation

Mounia Lalmas

Queen Mary University of London

[email protected]

qmir.dcs.qmul.ac.uk

Page 2: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Outline

Structured document retrieval

XML

Content-oriented XML retrieval

Evaluation

Page 3: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Outline

Structured document retrieval

XML

Content-oriented XML retrieval

Evaluation

Page 4: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Structured Document Retrieval

Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.

SDR allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book.

The structure of documents is exploited to identify which document components to retrieve.

Page 5: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Structured Documents

Linear order of words, sentences, paragraphs …

Hierarchy or logical structure of a book’s chapters, sections …

Links (hyperlink), cross-references, citations …

Temporal and spatial relationships in multimedia documents

Book

Chapters

Sections

Paragraphs

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

Page 6: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Structured Documents

Explicit structure formalised through document representation standards

(Mark-up Languages)

Layout LaTeX (publishing), HTML (Web publishing)

Structure SGML, XML (Web publishing, engineering),

MPEG-7 (broadcasting)

Content/Semantic RDF, DAML + OIL, OWL (semantic web)

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

<b><font size=+2>SDR</font></b><img src="qmir.jpg" border=0>

<section> <subsection> <paragraph>… </paragraph> <paragraph>… </paragraph> </subsection></section>

<Book rdf:about=“book”> <rdf:author=“..”/> <rdf:title=“…”/></Book>

Page 7: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Outline

Structured document retrieval

XML

Content-oriented XML retrieval

Evaluation

Page 8: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

XML: eXtensible Mark-up Language• Meta-language (user-defined tags) being adopted as the

document format language by W3C

• Used to describe content and structure (and not layout)

• Grammar described in DTD ( used for validation)

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> …</lecture> <!ELEMENT lecture (title,

author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…

Page 9: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

XML: eXtensible Mark-up Language

Use of XPath notation to refer to the XML structure

chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

Page 10: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Querying XML documents Content-only (CO) queries

'open standards for digital video in distance learning'

Content-and-structure (CAS) queries

//article [about(., 'formal methods verify correctness aviation systems')] /body//section [about(.,'case study application model checking theorem proving')]

Structure-only (SA) queries

/article//*section/paragraph[2]

Page 11: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Passage retrieval

Fixed-length (e.g. 300-word windows, overlapping) Discourse (e.g. sentence, paragraph) according to logical

structure but fixed Semantic (e.g. TextTiling)

Retrieval: e.g. rank document based on highest ranking passage or sum of

ranking scores for all passages deal principally with CO queries

p1 p2 p3 p4 p5 p6

doc

Page 12: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Database approaches to XML retrieval

Relational OO Native

Flexibility, expressiveness, complexity

Efficiency

Data-oriented retrieval– containment and not aboutness– no relevance-based ranking

Aims/challenges tend to focus on efficiency performance

XQuery

Page 13: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

OutlineStructured document retrieval

XML

Content-oriented XML retrieval A definition Challenges Approaches

Evaluation

Page 14: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Content-oriented XML retrieval

Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need both with regards to content and structure.

Page 15: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Content-oriented XML retrieval

Retrieve the best components according to content and structure criteria:

INEX: most specific component that satisfies the query, while being exhaustive to the query

Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing

???

Page 16: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Article ?XML,?retrieval

?authoring

0.9 XML 0.5 XML 0.2 XML

0.4 retrieval 0.7 authoring

Challenge 1: term weights

Title Section 1 Section 2

No fixed retrieval unit + nested document components: how to obtain document and collection statistics (e.g. tf idf) which aggregation formalism to use?

Page 17: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Article ?XML,?retrieval

?authoring

0.9 XML 0.5 XML 0.2 XML

0.4 retrieval 0.7 authoring

Challenge 2: augmentation weights

Title Section 1 Section 2

Nested document components: which components contribute best to content of Article? how to estimate augmentation weights (e.g. size, number of children)? how to aggregate term and augmentation weights?

0.40.5

0.2

Page 18: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Article ?XML,?retrieval

?authoring

0.9 XML 0.5 XML 0.2 XML

0.4 retrieval 0.7 authoring

Challenge 3: component weights

Title Section 1 Section 2

Different types of document components: which component is a good retrieval unit? how to estimate component weights (frequency, user studies)? how to aggregate term, augmentation and component weights?

0.40.5

0.2

0.6 0.40.4

0.2

Page 19: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Approaches …vector space model

probabilistic model

bayesian network

language model

extending DB model

boolean model

natural language processing

cognitive model

ontology

parameter estimation

tuning

smoothing

fusion

phrase

term statistics

collection statistics

component statistics

proximity search

logistic regression

belief modelrelevance feedback

Page 20: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Vector space model

article index

abstract index

section index

sub-section index

paragraph index

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

merge

tf and idf as for fixed and non-nested retrieval units

(IBM Haifa, INEX 2003)

Page 21: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Language model

element language modelcollection language modelsmoothing parameter

element score

element sizeelement scorearticle score

query expansion with blind feedbackignore elements with 20 terms

high value of leads to increase in size of retrieved elements

results with = 0.9, 0.5 and 0.2 similar

rank element

(University of Amsterdam, INEX 2003)

Page 22: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Outline

Structured document retrieval

XML

Content-oriented XML retrieval

Evaluation

Page 23: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Evaluation of XML retrieval: INEX Evaluating the effectiveness of content-oriented XML retrieval

approaches

Collaborative effort participants contribute to the development of the collection queries relevance assessments

Similar methodology as for TREC, but adapted to XML retrieval

40+ participants worldwide

Workshop in Schloss Dagstuhl in December (20+ institutions)

Page 24: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

INEX Test Collection Documents (~500MB), which consist of 12,107 articles in XML format

from the IEEE Computer Society; 8 millions elements

INEX 2002 30 CO and 30 CAS queries CO and CAS ad hoc retrieval tasks

inex_eval metric

INEX 200336 CO and 30 CAS queries CO, SCAS and VCAS ad hoc retrieval tasks

CAS queries are defined according to enhanced subset of XPath

inex_eval and inex_eval_ng metrics

INEX 2004 is just starting

Page 25: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Relevance in INEX

Exhaustivityhow exhaustively a document component discusses the query: 0, 1, 2, 3

Specificityhow focused the component is on the query: 0, 1, 2, 3

Relevance (3,3), (2,3), (1,1), (0,0), …

Use of an online assessment tool to ensure exhaustive and consistent assessments (assessing a query takes a week!)

section

article all sections relevant article very relevantall sections relevant article better than sectionsone section relevant article less relevantone section relevant section better than article…

Page 26: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Metrics

Recall / precision - based

quantisation functions to obtain one relevance value

expected search length

penalise overlap consider size

Othersexpected ratio of relevantcumulated gain-based metricstolerance to irrelevance

section

article

Page 27: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Lessons learntGood definition of relevance

Expressing CAS queries was not easy

Relevance assessment process must be “improved”

Further development on metrics needed

User studies required

Page 29: XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London mounia@dcs.qmul.ac.uk qmir.dcs.qmul.ac.uk

Merci