comparing document segmentation for passage retrieval in question answering

Comparing Document Segmentation for Passage Retrieval in Question Answering

Jorg Tiedemann

University of Groningen

presented by:

Moy’awiah Al-Shannaq

[email protected]

December 05, 2011

2

Outline

• Introduction• Overview of passage retrieval module• Strategies for passage retrieval in QA• Document segmentation • Passage retrieval in Joost• Experiments

Setup Result

• Conclusion• Future Work• References

3

Introduction

• Information Retrieval (IR): is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the world wide web*.

• Passage Retrieval: retrieve individual passages within documents (one or more sentences, paragraphs).

• Precision: Number of relevant document retrieved Number of total retrieved

• Recall : Number of relevant document retrieved Number of total relevant

• Question Answering (QA): include a passage retrieval component to reduce the search space for information extraction modules.

* http://en.wikipedia.org/wiki/Information_retrieval

4

Outline


Setup Result


5

Passage Types

• Discourse passage: is when the segmentation based on document structure.– Problems with this approach often arise with special structures such as

headers, lists and tables which are easily mixed with other units such as proper paragraphs.

• Semantic passages: split documents into semantically motivated units using some topical structure.

• Window-based passages: use fixed or variable-sized windows to segment documents into smaller units. – window-based passages have a fixed length using non-overlapping

parts of the document.

6

Passage Incorporation Approaches

• We can distinguish between two approaches to the incorporation of passages in information retrieval:

1) Passage-level evidence to improve document retrieval.

2) Using passages directly as the unit to be retrieved.

• Paper interested in the second approach to return small units in QA.

7

What are the differences between Passage Retrieval in QA and ordinary IR?

• Passage Retrieval in QA differs from ordinary IR in at least two points:

1) Queries are generated from user questions and not manually created as in standard IR.

1) The units to be retrieved are usually much smaller than documents in IR .

• The division of documents into passages is crucial for two reasons:

1) The textual units have to be big enough to ensure IR works properly.

2) They have to be small enough to enable efficient and accurate QA.

8

Outline


Setup Result


9

Strategies for Passage Retrieval in QA

• Search- time passaging: two-steps strategy of retrieving documents first and then selecting relevant passages within these documents.

– Return only one passage per relevant document.

• Index- time passaging: one-step strategy that return relevant passages from documents.

– Allow multiple passages per relevant document to be returned.

• In our QA system we adopt the second strategy using a standard IR engine to match keyword queries generated from a natural language question with passages.

10

Outline


Setup Result


11

Document Segmentation

• The experiments work with Dutch data from the QA tasks at the cross-lingual evaluation forum (CLEF).

• The document collection used there is a collection of two daily newspapers from the years 1994 and 1995. – It includes about 190,000 documents (newspaper articles) .– 4 million sentences including approximately 80 million words. – The documents include additional markup to segment them into

paragraphs.

• We define document boundaries as hard boundaries, i.e., passages may never come from more than one document in the collection.

12

Document Segmentation Strategies

• Window-based passages: Documents are split into passages of fixed size (in terms of number of sentences).

• Variable-sized arbitrary passages: Passages may start at any sentence in each document and may have variable lengths.

– This is implemented by adding redundant information to our standard IR index.

– We create passages starting at every sentence in a document for each length defined.

• Sliding window passages: A sliding window approach also adds redundancy to the index by sliding over documents with a fixed-sized window

13

Outline


Setup Result


14

Passage Retrieval in Joost

• Joost QA system includes two strategies:

1) Table-lookup strategy using fact databases that have been created off- line.

2) On-Line answer extraction strategy with passage retrieval and subsequent answer identification and ranking modules.

• paper approach interested in the second strategy in order to check the passage retrieval component and its impact on QA performance.

15

Dutch CLEF Corpus

• The contents of the CLEF dataset evidently very diverse. Most of the documents are very short but the longest one contains 625 sentences.

• Figure 1: Distribution of document sizes in terms of sentences they contain in the Dutch CLEF corpus.

16

Dutch CLEF Corpus

• Figure 2: Distribution of paragraph sizes in terms of sentences in the Dutch CLEF corpus.

17

Dutch CLEF Corpus

• Figure 3: Distribution of paragraph sizes in terms of characters in the Dutch CLEF corpus.

18

Outline


Setup Result


19

Experiment Setup

• The entire Dutch CLEF document collection is used to create the index files with the various segmentation approaches.

• There are 777 questions, each question may have several answers.

• For each setting 20 passages retrieved per question using the same query generation strategy

20

Evaluation Measures

1) Redundancy: The average number of passages retrieved per question that contain a correct answer.

2) Coverage: Percentage of questions for which at least one passage is retrieved that contains a correct answer.

21

Evaluation Measures

3) Mean reciprocal ranks: The mean of the reciprocal rank of the first passage retrieved that contains a correct answer.

22

Coverage and redundancy

• Figure 4: Coverage and redundancy of passages retrieved for various segmentation strategies.

23

Mean Reciprocal Ranks

• Figure 5: Mean reciprocal ranks of passage retrieval (IR MRR) and question answering (QA MRR) for various segmentation strategies.

24

Outline


Setup Result


25

Conclusion

• Accurate passage retrieval is essential for Question Answering .

• Discourse based segmentation into paragraphs works well with standard information retrieval techniques.

• Among the window-based approaches a segmentation into overlapping passages of variable-length performs best, in particular for passages with sizes of 1 to 10 sentences.

• Passage retrieval is more effective than full document retrieval.

26

Outline


Setup Result


27

Future Work

• Advance improvement for discourse based segmentation.

• Combine several retrieval setting using various segmentation approaches.

28

Outline


Setup Result


29

References

[1] J. P. Callan. Passage-level evidence in document retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and evelopment in information retrieval, pages 302–310, New York, NY, USA, 1994.Springer-Verlag New York, Inc.

[2] CLEF. Multilingual question answering at CLEF. http://clef-qa.itc.it/, 2005.[3] M. A. Greenwood. Using pertainyms to improve passage retrieval for questions

requesting information about a location. In Proceedings of the Workshop on Information Retrieval for Question Answering (SIGIR 2004), Sheffield, UK, 2004.

[4] M. Kaszkiel and J. Zobel. Effective ranking with arbitrary passages. Journal of the American Society of Information Science, 52(4):344–364, 2001.

[5] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju,R. Goodrum, and V. Rus. The structure and performance of an open-domain question answering system, 2000.

[6] I. Roberts and R. Gaizauskas. Evaluating passage retrieval approaches for question answering. In Proceedings of 26th European Conference on Information Retrieval, 2004.

[7] S. E. Robertson, S.Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC-3. In Text REtrieval Conference,pages 21–30, 1992.

[8] http://en.wikipedia.org/wiki/Information_retrieval

Thank You

comparing document segmentation for passage retrieval in question answering

Documents

passage retrieval component

document retrieval

relevant passages

passagelevel evidence

number of relevant document

incorporation of passages

split documents

division of documents