computational models of discourse: texttilingcsporled/compdisc/slides/texttiling.pdf ·...

Computational Models of Discourse:TextTiling

Caroline Sporleder

Universitat des Saarlandes

Sommersemester 2009

13.05.2009

Caroline Sporleder [email protected] Computational Models of Discourse

Text Segmentation

. . . indentification of (sub-)topic shifts


Text Segmentation

Example

Penicillin is a group of beta-lactam antibiotics used in the treatmentof bacterial infections caused by susceptible, usually Gram-positive,organisms. The discovery of penicillin is usually attributed to Scot-tish scientist Sir Alexander Fleming in 1928. Fleming noticed a haloof inhibition of bacterial growth around a contaminant blue-greenmold Staphylococcus plate culture. Fleming concluded that themold was releasing a substance that was inhibiting bacterial growthand lysing the bacteria. Common adverse drug reactions associatedwith use of the penicillins include: diarrhea, nausea, rash, urticaria,and/or superinfection (including candidiasis).


Identifying Sub-Topic Boundaries

Why not simply use paragraph boundaries or section headings?

not all paragraph boundaries reflect topic changes(Stark 1988)

paragraphing conventions are genre-dependent(e.g., news texts)

subsections often too large

⇒ segmentation into multi-paragraph units

Applications

hypertext display

information retrieval

text summarisation


Sub-Topic Segmentation

Methods

find linguistic markers for “topic-shift”

adverbial clauses and prosodic markers (Brown & Yule, 1983)

certain (discourse) markers: oh, well, ok, however (Litman &Passonneau, 1995)

pronoun reference structure (Passonneau & Litman, 1993)

tense and aspect (Webber 1987)

distribution of lexical chains

distribution of lexical items (Hearst 1997) (for expositorytexts)


Sub-Topic Segmentation

Methodsfind linguistic markers for “topic-shift”

adverbial clauses and prosodic markers (Brown & Yule, 1983)

certain (discourse) markers: oh, well, ok, however (Litman &Passonneau, 1995)

pronoun reference structure (Passonneau & Litman, 1993)

tense and aspect (Webber 1987)

distribution of lexical chains

distribution of lexical items (Hearst 1997) (for expositorytexts)


TextTiling (Hearst, 1997)

Main Hypothesis“. . . when [a] subtopic changes, a significant proportion of thevocabulary changes as well.” (Hearst, 1997, p. 40)

⇒ related to segmentation with lexical chains but “shallower”(thus easier to compute)


Text Structure Types (Skorochod’ko 1972)

Compute word overlap between sentences and look at distributionof highly connected sentences:

chained

ringed

monolithic

piecewise


TextTiling (Hearst, 1997)

Example:

Stargazer text (life on earth and other planets)

Segments:

1 Intro: search for life in space (paragraphs 1–3)

2 The moon’s chemical composition (4–5)

3 How early earth-moon proximity shaped the moon (6–8)

4 How the moon helped life evolve on earth (9–12)

5 Improbability of the earth-moon system (13)

6 Binary/trinary star systems make life unlikely (14–16)

7 The low probability of nonbinary/trinary systems (17–18)

8 Properties of earth’s sun that facilitate life (19–20)

9 Summary (21)


Distribution of Terms

Distribution of terms in Stargazer (Hearst 1997, p. 42)


Distribution of Terms

frequent terms (e.g., moon): main topic

less frequent but uniformly distributed terms (scientist):neutral

terms that cluster (binary): indicative of sub-topic boundaries

⇒ need to find clusters


Comparing adjacent text blocks

AimDetermine which sentence boundaries correspond to sub-topicboundaries.

compute lexical score for the gap between pairs of text blocks.

text blocks contain k adjacent sentences and act as movingwindows

low lexical scores preceded and followed by high lexical scoresmay indicate topic shifts

Lexical ScoreThree possibilities:

dot product of word vectors (tf.idf weighting found not towork so well)

vocabulary introduction

lexical chains


Comparing adjacent text blocks

Hearst (1997), p. 44:


The TextTiling Algorithm

Three main steps:1 Tokenisation:

tokenisationstop word removalstemmingsplitting text in equal-size “pseudo-sentences” (tokensequences)

2 Computing Lexical Scoreset block size k (i.e., size of moving window) ≈ avg.paragraph lengthcompute score

3 Boundary Identificationassign a depth score to each token sequence gap i (find peakat left hand side l and peak at right hand side r ,depth − score(i) = score(l)− score(i) + score(r)− score(i))smooth scoresminor heuristics for boundary assignment (prevent boundariesthat are too close, mover boundaries to paragraph breaks)sort boundary scores and assign top n boundaries


Determining Boundaries

Hearst (1997), p. 51


Applying the algorithm to Stargazers

Hearst (1997), p. 55


Evaluation

Manually created gold standard

seven annotators for 12 articles

inter-annotator agreement: kappa score of .647

Kappa Score

K = P(A)−P(E)1−P(E)

where P(A) is the proportion of times that the annotators agreeand P(E ) the proportion of times that they would be expected toagree by chance.


Evaluation

Algorithm parameters

block/window size

number of words in a token sequence

smoothing parameters

number of boundaries to assign

⇒ ideally optimise on development set


Evaluation

Hearst (1997), p. 56


And after TextTiling?

Other work on text segmentation

mostly also based on lexical overlap, e.g., Choi (2000): cosinebetween word vectors of adjacent sentences plus comparisonof ranks

typically evaluated on artificial data sets (join different textsand attempt to find the boundaries)

What do you think about evalutation on artificial data sets?


computational models of discourse: texttilingcsporled/compdisc/slides/texttiling.pdf ·...

Documents