isp 433/533 week 11 xml retrieval. structured information traditional ir –unit of information:...
TRANSCRIPT
![Page 1: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/1.jpg)
ISP 433/533 Week 11
XML Retrieval
![Page 2: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/2.jpg)
Structured Information
• Traditional IR – Unit of information: terms and documents– No structure
• Need more granularity
• Document has structure– E.g. title, sections, footnotes, etc
• A markup language is a mechanism to identify structures in a document– Data + Metadata
![Page 3: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/3.jpg)
Extensible Markup Language XML
• Markup (tags – not a fixed set)• Content• Nested, named trees with attributes
<?xml version="1.0" encoding="UTF-8" ? >
<bookinfo><book><title>One Fish Two Fish</title>
<author>John Meyer</author> <author >Peter Smith</author> <price>7.95</price></book>
<book><title>Goodnight Moon</title> <author >Margaret Brown</author> <price>10.55</price></book> ....
</bookinfo>
![Page 4: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/4.jpg)
Elements
• Delimited by angle brackets
• Identify the nature of the content they surround
• Elements can be nested within another element– A tree structure
• Element may have attributes– E.g. <div class="preface">
![Page 5: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/5.jpg)
Unit of Retrieval
• Traditional IR– Document
• XML IR– Element or fragment of element
![Page 6: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/6.jpg)
Example Retrieval Units
1 2 3
4 5
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query
Lang. XQL
section
We describesyntax of
XQL
chapter
![Page 7: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/7.jpg)
Requirements for XML Retrieval
• Basic needs for XML retrieval
– Query both Data and Metadata
– express the query in an user convenient way
– return proper document fragments
– rank the results according to their relevance
![Page 8: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/8.jpg)
INEX
The initiative for evaluating XML retrieval– international, coordinated effort to promote evaluation
procedures for content-based XML retrieval– provides large test collection of XML documents (12,000
articles in IEEE CS publications since 1995)– introduces both content-only (CO) and content-and-
structure (CAS) topics– designed to be a long-term initiative with workshops held
on a yearly basis (currently in the second year)
![Page 9: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/9.jpg)
INEX CO Topic example<Title>
<cw>semantic web</cw></Title> <Description>
Research and business opportunities and challenges in developing and
deploying the concept of the Semantic Web and the associated idea of web services.
</Description> <Narrative>
To be relevant, a document/component must either discuss the technical issues and opportunities associated with the semantic web, or it must discuss the business challenges, especially the question of viable business models for web services.
</Narrative> <Keywords> semantic web, ontologies, SOAP, UDDI,
RDF…</Keywords>
![Page 10: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/10.jpg)
INEX CAS Topic example
<Title> <te>//fig, //p, //ip1</te> <cw>Corba architecture</cw> <ce>//fgc</ce> <cw>Figure Corba Architecture</cw> <ce>//p, //ip1</ce>
</Title> <Description>
Find figures that describe the Corba architecture and the paragraphs that refer to those figures.
</Description> <Narrative>
To be relevant a figure must describe the standard Corba architecture or a system architecture that relies heavily on Corba…Retrieved components would ideally contain both the figure and the paragraph referring to it.
</Narrative> <Keywords> CORBA Object Request Broker Architecture
…</Keywords>
![Page 11: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/11.jpg)
An Inverted Indexing for XML
(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …
(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …
<section>
<title>
(1, 3, 2) … …
(1, 4, 2) … … “retrieval”
“information”
Element index
Text index
<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>
1
2 3 4 5 6 7
89 10 11 12 13
14
15 16 17 18 19 20
21
22
23
![Page 12: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/12.jpg)
XPath
• XPath is a non-XML language for identifying particular parts of XML documents
– picking nodes and sets of nodes• Similar to Unix file system expression
• “/people/person/name/first_name”• “*” wildcard• “..” parent• “.” context node
– “//” descendents – “@” attribute– [] predicate,specify a condition
![Page 13: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/13.jpg)
XPath Example
chapter/heading
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
![Page 14: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/14.jpg)
XPath Example
chapter//heading
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
![Page 15: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/15.jpg)
XPath Example
//chapter[heading]
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
![Page 16: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/16.jpg)
XPath Example
/document[@class="H.3.3" author="John Smith"]
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
![Page 17: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/17.jpg)
More XPath Examples
• //@id/..
– All the elements that have attribute “id”
• //middle_initial/../first_name
– All the first_name elements that are siblings of middle_initial elements
• //person[profession=‘physicist’]
– All person elements that have a profession child element with the value “physicist”
![Page 18: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/18.jpg)
XQuery
• A language to query data that is similar to XML in structure– nested, named trees with attributes
• Based on XPath
FOR/LET PathExpression
WHERE AdditionalSelectionCriteria
RETURN ResultConstruction
![Page 19: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/19.jpg)
XQuery Example
• Find the name(s) of customers who have ordered the part whose part_id is "xx"
FOR $c IN customers FOR $o IN orders WHERE $c.cust_id=$o.cust_id AND
$o.part_id="xx" RETURN $c.name
![Page 20: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/20.jpg)
More XQuery Example
• Find titles and prices of books by ‘Meyer’ or ‘Smith’
FOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author
contains ‘Smith’RETURN <result>
<title> $b/title </title><price> $b/price </price>
</result>
![Page 21: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/21.jpg)
One Document Structure
• Previous XQuery works
bookinfo
Just Lost
book
titleauthor
author
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Hedi
$13.95
![Page 22: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/22.jpg)
Another Document Structure
• Same XQuery doesn’t work
author
name
Dr. Meyer
author
namebook
M. Brown
Goodnight Moon
title
book
titleprice
One Fish Two Fish
$12.50
book
title price
Cat in the Hat
$14.95
bookinfo
![Page 23: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/23.jpg)
Problem with XQuery
• Requires knowledge of document structure
• Dependent on document structure
• Difficult for naive user
• Need extensions to solve the problem
• Still in active research
![Page 24: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/24.jpg)
Don’t know the tags?
• Integrating with full-text keywords search
• Automatically identifying tag names
• Translate query terms to tag names
• Query expansion
![Page 25: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/25.jpg)
Don’t know the structure?
• Schema-free XQuery
– Automatically identifying minimum, meaningful set of nodes that can provide answer
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
![Page 26: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/26.jpg)
Querying XML with Natural Language
• Translate natural language query to Schema-free XQuery
• NaLIX demo
![Page 27: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/27.jpg)
Relevance Scoring
• Query: articles about “search engine”
secti on
chapter
ti tl e
“ Search andretri eval ”
“ . . . search engi ne . . .retri eval of semanti c
i nformati on . . . ”
p
“ . . . i nformati onretri eval . . . search
engi ne . . . ”
p
secti on secti on
. . .
![Page 28: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649eec5503460f94bfdc41/html5/thumbnails/28.jpg)
TermJoin
• User-defined score function generates the score based on term occurrences and other information
• They are then joined
secti on
chapter
ti tl e
“ Search andretri eval ”
“ . . . search engi ne . . .retri eval of semanti c
i nformati on . . . ”
p
“ . . . i nformati onretri eval . . . search
engi ne . . . ”
p
secti on secti on
. . .score = 1
score = 2score = 2
score = 4
score = 5