April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
1
Hierarchical Indexing and Flexible Element Retrieval for Structured Document
Hang CuiSchool of Computing, NUS
Ji-Rong WenMicrosoft Research Asia
Tat-Seng ChuaSchool of Computing, NUS
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
2
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
3
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
4
Presentation for ECIR’03, Pisa, Italy
Motivations• More structured and semi-structured documents
on the Web.• Users want to explore more of the document
structure.– Access only relevant parts of a document, i.e. sections
or paragraphs
• IR can’t help– Document as the smallest resulting unit.
• Not Question Answering!– Can’t provide views of the internal document structure.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
5
Presentation for ECIR’03, Pisa, Italy
Encarta Articles – An Example• Online encyclopedia.• Well structured XML
documents.• Nodes (elements) –
documents, sections and paragraphs (leaf nodes)
• Text contained in paragraphs, which constitute sections and documents.
Document
Section Section SectionParagraph
Section Section SectionParagraph
Paragraph ParagraphParagraph Paragraph Paragraph Paragraph
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
6
Presentation for ECIR’03, Pisa, Italy
Problems• A document covers multiple aspects of a central
topic– Represented by sections or paragraphs. – Users usually want just one of the aspects.
• How to achieve this goal by utilizing the document structure?– Flexible element retrieval to get elements at arbitrary
level rather than only leaf nodes.– Let each element at different levels have proper
keywords description.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
7
Presentation for ECIR’03, Pisa, Italy
Our contributions• Building index with the same hierarchical structure as the
document has.– Not just index the leaf nodes.
• Keywords propagation mechanism.– Assign proper keywords to each level’s nodes (push broad-
sense keywords to upper-level nodes).– Why can’t use weight propagation technique?– Considering terms’ distributions.
• Flexible element retrieval according to queries.– With the hierarchical index, the system can access arbitrary-
level elements – documents, sections or paragraphs w.r.t queries.
– Avoid assembling separate text fragments with leaf nodes retrieval only.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
8
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
9
Presentation for ECIR’03, Pisa, Italy
Hierarchical Indexing for Structured Documents• Term weighting for the leaf nodes and the
intermediate elements.– Combining the statistics of the term
occurrences and the distributions.– Term selection threshold.
• Propagation and pruning of the index terms
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
10
Presentation for ECIR’03, Pisa, Italy
Term Weighting for Paragraphs• Paragraphs are “atomic” without children
elements.
• Consider the term occurrences only – TFIDF measure.
ijiji n
NPttfPtWeight ln)),(ln(),(
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
11
Presentation for ECIR’03, Pisa, Italy
Term Weighting for Intermediate elements
• Document-level or section-level elements.• Taking into account the term distributions in the
immediately descendant elements.
),()),(1ln(),( jijiji EtIEttfEtWeight
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
12
Presentation for ECIR’03, Pisa, Italy
Measuring Term Distributions• Entropy-like measurement
– How even a term is distributed in all the immediate-descendant elements of an intermediate element.
– Normalization factor – the theoretic maximum entropy.
)(
1ln),(
),(
),(ln),(
)(
1ln
)(
),(
),(
),(ln),(
),(
subNEttf
Ettf
subttfsubttf
subNsubN
Ettf
Ettf
subttfsubttf
EtI
ji
ESub ji
kiki
ESub
ji
ESub ji
kiki
ji
jk
jk
jk
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
13
Presentation for ECIR’03, Pisa, Italy
Term Selection• Term weights are normalized to the range of 0
and 1 for the purpose of comparison.• Compare the terms within one element.
– Select those terms with the weights beyond a threshold as the index terms for this element.
• Repeat this process from bottom up.– Broader-sense terms can be propagated to upper
level elements.– Term pruning to avoid duplications of index terms.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
14
Presentation for ECIR’03, Pisa, Italy
Terms Propagation and Pruning Algorithm
1. For each leaf element, i.e. paragraph, calculate all terms’ weights for paragraphs.
2. For each composite element Ej at the next upper level, calculate the terms’ weights by measuring these terms’ occurrences in this element and the distributions in the immediate-descendant elements of Ej.
3. For term ti, if Weight(ti, Ej)>= average(Ej)+std_dev(Ej) , then this term is selected as an index term of the element Ej and all the descendent elements of Ej would eliminate ti from their index term lists. This process is called the index term propagation and pruning.
4. Recursively perform step 2 onwards until the root node (i.e., the document) is reached.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
15
Presentation for ECIR’03, Pisa, Italy
An illustration of the process
Qi ngDynastyManchuKang XiHi storyChi na
TangDynasty
SuiSui YangHi storyChi na
Secti on Secti on
Document
Qi ngManchuKang Xi
TangSuiYang
Hi storyDynasty
Economy
Chi na
The DocumentStructure
The IndexStructure
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
16
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
17
Presentation for ECIR’03, Pisa, Italy
Flexible Element Retrieval• No term duplications along one path.• The path of an element
– including all the elements from this node to the root.
• Ranking relevant elements is equal to rank their paths.
Q
i ipip n
NPathtWeightPathlevance
1
ln),()(Re
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
18
Presentation for ECIR’03, Pisa, Italy
Path Ranking Algorithm1. Find all elements that contain at least one query term. 2. Get paths for all candidate elements and merge the
paths, that is, merge two paths into one if one is a part of the other.
3. Assign the weights of the query terms for the elements to their paths respectively.
4. Rank these paths using the equation on the previous slide.
5. Return the elements corresponding to the ranked paths with the ranks satisfying the pre-defined threshold in a descending order.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
19
Presentation for ECIR’03, Pisa, Italy
Result Browsing• The prototype interface can
– Highlight the relevant parts of the selected document.
– Allow the user to browse results in the original document structure.
• Query example – “the Manchu Qing Dynasty”– A section in “China”– The whole document for “Qing Dynasty”
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
20
Presentation for ECIR’03, Pisa, Italy
Prototype Interface
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
21
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
22
Presentation for ECIR’03, Pisa, Italy
Evaluation• Data Set
– 41,942 XML documents in various topics from Encarta online encyclopedia.
– Ten experimental queries • Can be answered by only parts of the relevant document,
e.g. “Fleet Street in London” answered by a paragraph of the document London.
• Relevance judgment made by human assessors – for each query, there is a group of paragraphs representing relevant sections or such paragraphs.
– Baseline system (TFIDF Para)• Indexing paragraph nodes only.• Applying TFIDF measure to weight terms and using cosine
similarity to retrieve answers.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
23
Presentation for ECIR’03, Pisa, Italy
Performance Evaluation• Use precision, recall and F-value as
performance metrics.• Two sets of hierarchical index
– Utilizing titles and without considering titles.
• Answer selection threshold– Fixed numbers 0.1 – 0.9, used by most of existing
systems.– Dynamic thresholds – Avg and Avg + Std_Dev
• Compared our system with TFIDF Para using different answer selection thresholds.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
24
Presentation for ECIR’03, Pisa, Italy
Results of Performance Comparison• Figures are impressive
– Improvements on precision are 48.83% (w/ titles) and 41.67% (w/o titles) in average.
– For F-Values, improvements are 56.02% (w/ titles) and 40.89% (w/o titles).
– Recall slightly decreases with some threshold settings (too rigorous threshold for index term selection).
• User feedback– Our system can find more meaningful units instead of
separate paragraphs, including some paragraphs not actually containing query terms.
– Users are clear of their context when browsing the answers within the original document structure.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
25
Presentation for ECIR’03, Pisa, Italy
Threshold Setting• Our system is less sensitive to the answer selection
threshold settings. • Dynamic threshold is a good alternative for such
structured document retrieval.Comparison of F-Values with Different Selection Thresholds
0.00000.10000.20000.30000.40000.50000.60000.70000.80000.9000
Thresholds
F-V
alu
es
TFIDF Para
Flexible Retrieval (with titleterms)
Flexible Retrieval (withouttitle terms)
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
26
Presentation for ECIR’03, Pisa, Italy
Outline
• Motivations and problems
• Hierarchical index propagation and pruning
• Flexible element retrieval
• Evaluation
• Conclusions
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
27
Presentation for ECIR’03, Pisa, Italy
Conclusions• A novel hierarchical index propagation and
pruning mechanism to generate structured index.
• Flexible element retrieval of getting arbitrary-level relevant elements is realized on the hierarchical index.
• It can better satisfy users than previous passage retrieval systems.
• More work can be done on generating hierarchical index for federate search.
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua
28
Presentation for ECIR’03, Pisa, Italy
Thanks!