automatic text segmentation:...

26
Introduction TextTiling Evaluation Conclusion Automatic Text Segmentation: TextTiling Will Roberts [email protected] Tuesday, 20 November, 2007 1/20 Will Roberts Automatic Text Segmentation: TextTiling

Upload: ngodat

Post on 08-Sep-2018

228 views

Category:

Documents


1 download

TRANSCRIPT

IntroductionTextTilingEvaluationConclusion

Automatic Text Segmentation: TextTiling

Will [email protected]

Tuesday, 20 November, 2007

1/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

MotivationApproaches

1 IntroductionMotivationApproaches

2 TextTilingTheoryMethod

3 Evaluation

4 Conclusion

2/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

MotivationApproaches

Uses for Automatic Text Segmentation

hypertext display

information/passage retrieval

text summarization

automatic text generation

measuring stylistic variation for genre detection

aligning parallel multilingual corpora

breaking up connected documents

3/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

MotivationApproaches

Other Text Segmentation Approaches

Clustering or similarity matrices based on word co-occurance

Machine-learning or hand-crafted solutions for detection ofcue words

4/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

TextTiling Theory

Focus on multi-paragraph units in expository text

Topics not always contained in single paragraphs

Identify subtopic shifts

Subtopic: piece of text “about” somethingIdentify topic shift, not topicLinear segmentation

Subtopic shifts associated with change in vocabulary

Linguistically simple: no prosody, discourse markers, pronounreference resolution, . . .

5/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

TextTiling Theory

Focus on multi-paragraph units in expository text

Topics not always contained in single paragraphs

Identify subtopic shifts

Subtopic: piece of text “about” somethingIdentify topic shift, not topicLinear segmentation

Subtopic shifts associated with change in vocabulary

Linguistically simple: no prosody, discourse markers, pronounreference resolution, . . .

5/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

TextTiling Theory

Focus on multi-paragraph units in expository text

Topics not always contained in single paragraphs

Identify subtopic shifts

Subtopic: piece of text “about” somethingIdentify topic shift, not topicLinear segmentation

Subtopic shifts associated with change in vocabulary

Linguistically simple: no prosody, discourse markers, pronounreference resolution, . . .

5/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Word Occurance Counts

6/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

TextTiling Algorithm

1 Tokenize

2 Calculate lexical similarity scores

3 Determine inter-sentence boundaries

7/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Lexical Similarity

Computed for each sentence gap in text

Measure of the lexical similarity of the two sentences/blockson either side

Moving window of size k

8/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Lexical Similarity

Block comparison

Normalized inner product of two word vectors

New vocabulary

New words introduced in a window centered around thesentence boundary

Lexical chaining

Number of lexical chains active at the sentence boundary

9/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Lexical Similarity

Block comparison

Normalized inner product of two word vectors

New vocabulary

New words introduced in a window centered around thesentence boundary

Lexical chaining

Number of lexical chains active at the sentence boundary

9/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Lexical Similarity

Block comparison

Normalized inner product of two word vectors

New vocabulary

New words introduced in a window centered around thesentence boundary

Lexical chaining

Number of lexical chains active at the sentence boundary

9/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Lexical Similarity

10/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Calculating Word Significance Values

Example

Saarland University (German: Universitat des Saarlandes) is auniversity located in Saarbrucken, the capital of the German stateof Saarland. It was founded in 1948 in co-operation with Franceand is organized in 8 faculties that cover all major fields of science.The university is particularly well known for research and educationin Computer Science and Medicine.Saarland University, the first to be established after the SecondWorld War, was founded in November 1948 with the support ofthe French Government and under the auspices of the University ofNancy.At the time the Saarland found itself in the special situation ofbeing partly autonomous and linked to France by economic . . .

11/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Boundary Identification

Depth score is computed at each sentence gap from thelexical scores

Score is the sum of the heights of the peaks on either side ofthe sentence gapDi = (si−1 − si ) + (si+1 − si ) = si−1 + si+1 − 2si

Smooth depth scores and choose local minima

12/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

TheoryMethod

Boundary Identification

13/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Evaluation Caveats

Algorithm depends on parameters: window size, smoothing,number of boundaries

Evaluation depends on desired application: precision, recall,and near-misses

14/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Evaluation Methods

Breaking consecutive documents

Comparison with human judges

Humans rarely agree on correct segmentationConsensus of human judges can be used as “gold standard”

Pk metric

15/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Evaluation Methods

Breaking consecutive documents

Comparison with human judges

Humans rarely agree on correct segmentationConsensus of human judges can be used as “gold standard”

Pk metric

15/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Evaluation Methods

Breaking consecutive documents

Comparison with human judges

Humans rarely agree on correct segmentationConsensus of human judges can be used as “gold standard”

Pk metric

15/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Human Judges

16/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Pk Metric

17/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Pk Metric

18/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Conclusion

Strengths

Linguistically and computationally simple

Language independent

Weaknesses

Designed for expository text; poor for narrative texts anddiscourse

Near-miss errors might be unacceptable for some applications

Cannot extract hierarchical text structure

19/20 Will Roberts Automatic Text Segmentation: TextTiling

IntroductionTextTilingEvaluationConclusion

Literature

Hearst, M.Segmenting Text into Multi-paragraph Subtopic PassagesComputational Linguistics, 23(1), 1997.

Pevzner, L., and Hearst, M.A Critique and Improvement of an Evaluation Metric for TextSegmentationComputational Linguistics, pp. 9–16, 2001.

Richmond, K., Smith, A., and Amitay, E.Detecting Subject Boundaries Within a Text: A LanguageIndependent Statistical ApproachEMNLP, 1997.

20/20 Will Roberts Automatic Text Segmentation: TextTiling