![Page 1: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/1.jpg)
Processing of large document collections
Fall 2002, Part 2
![Page 2: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/2.jpg)
2
Outline 1. Term selection: information gain 2. Character code issues 3. Text summarization
![Page 3: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/3.jpg)
3
1. Term selection a large document collection may contain
millions of words -> document vectors would contain millions of dimensions
many algorithms cannot handle high dimensionality of the term space (= large number of terms)
usually only a part of terms are used how to select terms that are used?
term selection (often called feature selection or dimensionality reduction) methods
![Page 4: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/4.jpg)
4
Term selection: information gain Information gain: measures the
(number of bits of) information obtained for category prediction by knowing the presence or absence of a term in a document
information gain is calculated for each term and the highest-scoring n terms are selected
![Page 5: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/5.jpg)
5
Term selection: IG information gain for a term t:
m
i
m
ii i i i
im
ii
t c P t c P t P t c P t c P t P
c P c P t G
11
1
) |~ ( log ) |~ ( ) (~ ) | ( log ) | ( ) (
) ( log ) ( ) (
![Page 6: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/6.jpg)
6
Estimating probabilities Doc 1: cat cat cat (c) Doc 2: cat cat cat dog (c) Doc 3: cat dog mouse (~c) Doc 4: cat cat cat dog dog dog (~c) Doc 5: mouse (~c)
2 classes: c and ~c
![Page 7: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/7.jpg)
7
Term selection: estimating probabilities P(t): probability of a term t
P(cat) = 4/5, or ‘cat’ occurs in 4 docs of 5
P(cat) = 10/17 the proportion of the occurrences of ´cat’
of the all term occurrences
![Page 8: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/8.jpg)
8
Term selection: estimating probabilities P(~t): probability of the absence of
t P(~cat) = 1/5, or P(~cat) = 7/17
![Page 9: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/9.jpg)
9
Term selection: estimating probabilities P(ci): probability of category i
P(c) = 2/5 (the proportion of documents belonging to c in the collection), or
P(c) = 7/17 (7 of the 17 terms occur in the documents belonging to c)
![Page 10: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/10.jpg)
10
Term selection: estimating probabilities P(ci | t): probability of category i if t
is in the document; i.e., which proportion of the documents where t occurs belong to the category i P(c | cat) = 2/4 (or 6/10) P(~c | cat) = 2/4 (or 4/10) P(c | mouse) = 0 P(~c | mouse) = 1
![Page 11: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/11.jpg)
11
Term selection: estimating probabilities P(ci | ~t): probability of category i
if t is not in the document; i.e., which proportion of the documents where t does not occur belongs to the category i P(c | ~cat) = 0 (or 1/7) P(c | ~dog) = ½ (or 6/12) P(c | ~mouse) = 2/3 (or 7/15)
![Page 12: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/12.jpg)
12
Term selection: estimating probabilities
In other words... Let
term t occurs in B documents, A of them are in category c
category c has D documents, of the whole of N documents in the collection
![Page 13: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/13.jpg)
13
Term selection: estimating probabilities
For instance, P(t): B/N P(~t): (N-B)/N P(c): D/N P(c|t): A/B P(c|~t): (D-A)/(N-B)
![Page 14: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/14.jpg)
14
Term selection: IG information gain for a term t:
G(cat) = -0.40 G(dog) = -0.38 G(mouse) = -0.01
m
i
m
ii i i i
im
ii
t c P t c P t P t c P t c P t P
c P c P t G
11
1
) |~ ( log ) |~ ( ) (~ ) | ( log ) | ( ) (
) ( log ) ( ) (
![Page 15: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/15.jpg)
15
2. Character code issues Abstract character vs. its
graphical representation (glyph, font)
abstract characters are grouped into alphabets each alphabet forms the basis of the
written form of a certain language or a set of languages
![Page 16: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/16.jpg)
16
Character codes For instance
for English: uppercase letters A-Z lowercase letters a-z punctuation marks digits 0-9 common symbols: +, =
ideographic symbols of Chinese and Japanese
phonetic letters of Western languages
![Page 17: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/17.jpg)
17
Some terminology character repertoire (merkkivalikoima)
a set of distinct characters, alphabet no internal presentation, ordering etc
assumed usually defined by specifying names of
characters and a sample presentation of characters in visible form
repertoire may contain characters which look the same (in some presentations), but are logically distinct
![Page 18: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/18.jpg)
18
Some terminology character code (merkkikoodi)
a mapping which defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers
each character is assigned a unique code position (code number, code value, code element, code point, code set value, code)
set of codes often has ”holes”
![Page 19: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/19.jpg)
19
Some terminology character encoding (merkkikoodaus)
an algorithm for presenting characters in digital form by mapping sequences of code numbers into sequences of octets (=bytes)
in the simplest case, each character is mapped to an integer in the range 0-255 according to a character code and these are used as octets
works only for character repertoires with at most 256 characters
for larger sets, more complicated encodings are needed
![Page 20: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/20.jpg)
20
Character codes in English:
26 letters in both lower- and uppercase ten digits + some punctuation marks
in Russian: cyrillic letters both could use the same set of code
points (if not a bilingual document) in Japanese: could be over 6000
characters
![Page 21: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/21.jpg)
21
Character codes: standars Character codes can be arbitrary,
but in practice standardization is needed for interoperability (between computers, programs,...)
early standards were designed for English only, or for a small group of languages at a time
![Page 22: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/22.jpg)
22
Character codes: standards
ASCII ISO-8859 (e.g. ISO Latin1) Unicode UTF-8, UTF-16
![Page 23: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/23.jpg)
23
ASCII American Standard Code for Information
Interchange A seven bit code -> 128 code positions actually 95 printable characters only
code positions 0-31 and 127 are assigned to control characters (mostly outdated)
ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols) e.g. @[\]{|} replaced
![Page 24: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/24.jpg)
24
ASCII With 7 bits, the set of code points is too
small for anything else than American English
solution: 8 bits brings more code points (256) ASCII character repertoire is mapped to the
values 0-127 additional symbols are mapped to other
values
![Page 25: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/25.jpg)
25
Extended ASCII Problems:
different manufacturers each developed their own 8-bit extensions to ASCII
different character repertoires -> translation between them is not always possible
also 256 code values is not enough to represent all the alphabets -> different variants for different languages
![Page 26: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/26.jpg)
26
ISO 8859 Standardization of 8-bit character sets In the 80´s: multipart standard ISO 8859
was produced defines a collection of 8-bit character sets,
each designed for a group of languages the first part: ISO 8859-1 (ISO Latin1)
covers most Western European languages 0-127: identical to ASCII, 128-159 (mostly)
unused, 96 code values for accented letters and symbols
![Page 27: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/27.jpg)
27
”Safe” ASCII subset due to the national variants, only
the following characters can be regarded ”safe” in data transmission: A-Z, a-z, 0-9 ! ” % & ’ ( ) * + , - . / : ; < = > ?
![Page 28: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/28.jpg)
28
Unicode 256 is not enough code positions
for ideographically represented languages (Chinese, Japanese…)
for simultaneous use of several languages
solution: more than one byte for each code value
a 16-bit character set has 65,536 code positions
![Page 29: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/29.jpg)
29
Unicode 16-bit character set; 65,536 code
positions not sufficient for all the characters
required for Chinese, Japanese, and Korean scripts in distinct positions CJK-consolidation: characters of these
scripts are given the same value if they look the same
![Page 30: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/30.jpg)
30
Unicode Code values for all the characters used
to write contemporary ’major’ languages also the classical forms of some languages Latin, Greek, Cyrillic, Armenian, Hebrew,
Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan
Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts
![Page 31: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/31.jpg)
31
Unicode punctuation marks technical and mathematical symbols arrows dingbats (pointing hands, stars, …) both accented letters and separate diacritical
marks (accents, tildes…) are included, with a mechanism for building composite characters
can also create problems: two characters that look the same may have different code values
->normalization may be necessary
![Page 32: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/32.jpg)
32
Unicode Code values for nearly 39,000 symbols
are provided some part is reserved for an expansion
method (see later) 6,400 code points are reserved for
private use they will never be assigned to any character
by the standard, so they will not conflict with the standard
![Page 33: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/33.jpg)
33
Unicode: encodings the ”native” Unicode encoding is
UCS-2 presents each code number as two
consecutive octets m and n code number = 256m + n (=2-byte integer)
can be inefficient for text containing ISO Latin characters only,
the length of the Unicode encoded sequence is twice the length of the ISO 8859-1 encoding
![Page 34: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/34.jpg)
34
Unicode: encodings UTF-8
ASCII code values are likely to be more common in most text than any other values
in UTF-8 encoding, ASCII characters are sent themselves (high-order bit 0)
other characters (two bytes) are encoded using 2-6 bytes (high-order bit is set to 1)
![Page 35: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/35.jpg)
35
Unicode: encodings UTF-16: expansion method
two 16-bit values are combined to a 32-bit value -> a million characters available
![Page 36: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/36.jpg)
36
Use of character codes try to use character codes logically
don’t choose a character just because it looks right
inform applications of the encoding used MIME headers, XML/HTML document
declarations Should be the responsibility of the
authoring applications… but…
![Page 37: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/37.jpg)
37
3. Text summarization ”Process of distilling the most
important information from a source to produce an abridged version for a particular user or task”
![Page 38: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/38.jpg)
38
Text summarization Many everyday uses:
headlines (from around the world) outlines (notes for students) minutes (of a meeting) reviews (of books, movies) ...
![Page 39: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/39.jpg)
39
Architecture of a text summarization system Input:
a single document or multiple documents
text, images, audio, video database
![Page 40: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/40.jpg)
40
Architecture of a text summarization system output:
extract or abstract compression rate
ratio of summary length to source length connected text or fragmentary generic or user-focused/domain-
specific indicative or informative
![Page 41: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/41.jpg)
41
Architecture of a text summarization system Three phases:
analyzing the input text transforming it into a summary
representation synthesizing an appropriate output
form
![Page 42: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/42.jpg)
42
Condensation operations Selection of more salient
(=”keskeinen”, ”essential”) or non-redundant information
aggregation of information (e.g. from different parts of the source, or of different linguistic descriptions)
generalization of specific information with more general, abstract information
![Page 43: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/43.jpg)
43
The level of processing surface level discourse level
![Page 44: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/44.jpg)
44
Surface-level approaches Tend to represent information in
terms of shallow features the features are then selectively
combined together to yield a salience function used to extract information
![Page 45: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/45.jpg)
45
Surface level Shallow features
thematic features presence of statistically salient terms, based on
term frequency statistics location
position in text, position in paragraph, section depth, particular sections
background presence of terms from the title or headings in
the text, or from the user’s query
![Page 46: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/46.jpg)
46
Surface level Cue words and phrases
”in summary”, ”our investigation” emphasizers like ”important”, ”in
particular” domain-specific bonus (+ ) and stigma (-)
terms
![Page 47: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/47.jpg)
47
Discourse-level approaches Model the global structure of the text
and its relation to communicative goals
structure can include: format of the document (e.g. hypertext
markup) threads of topics as they are revealed in
the text rhetorical structure of the text, such as
argumentation or narrative structure
![Page 48: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/48.jpg)
48
Classical approaches Luhn ’58 Edmundson ’69
general idea: give a score to each sentence choose the sentences with the
highest score to be included in the summary
![Page 49: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/49.jpg)
49
Luhn’s method Filter terms in the document using a stoplist Terms are normalized based on combining
together ortographically similar terms differentiate, different, differently, difference -> differen
Frequencies of combined terms are calculated and non-frequent terms are removed
-> ”significant” terms remain
![Page 50: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/50.jpg)
50
Luhn’s method Sentences are weighted using the
resulting set of ”significant” terms and a term density measure: each sentence is divided into segments
bracketed by significant terms not more than 4 non-significant terms apart
each segment is scored by taking the square of the number of bracketed significant terms divided by the total number of bracketed terms
![Page 51: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/51.jpg)
51
Exercise (CNN News) Let {13, computer, servers,
Internet, traffic, attack, officials, said} be significant words.
”Nine of the 13 computer servers that manage global Internet traffic were crippled by a powerful electronic attack this week, officials said.”
![Page 52: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/52.jpg)
52
Exercise (CNN News) Let {13, computer, servers,
Internet, traffic, attack, officials, said} be significant words.
* * * [13 computer servers * * * Internet traffic] * * * * * * [attack * * officials said]
![Page 53: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/53.jpg)
53
Exercise (CNN News) [13 computer servers * * * Internet
traffic] score: 52 / 8 = 25/8 = 3.1
[attack * * officials said] score: 32 / 5 = 9/5 = 1.8
![Page 54: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/54.jpg)
54
Luhn’s method the score of the highest scoring
segment is taken as the sentence score
the highest scoring sentences are chosen to the summary a cutoff value is given
![Page 55: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/55.jpg)
55
”Modern” application text summarization of web pages
on handheld devices (Buyukkokten, Garcia-Molina, Paepcke; 2001)
macro-level summarization micro-level summarization
![Page 56: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/56.jpg)
56
Web page summarization macro-level summarization
The web page is partitioned into ‘Semantic Textual Units’ (STUs)
Paragraphs, lists, alt texts (for images) Hierarchy of STUs is identified
List - list item, table – table row Nested STUs are hidden
![Page 57: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/57.jpg)
57
Web page summarization micro-level summarization: 5
methods tested for displaying STUs in several states incremental: 1) the first line, 2) the
first three lines, 3) the whole STU all: the whole STU in a single state keywords: 1) important keywords, 2)
the first three lines, 3) the whole STU
![Page 58: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/58.jpg)
58
Web page summarization summary: 1) the STUs ’most significant’
sentence is displayed, 2) the whole STU keyword/summary: 1) keywords, 2) the
STUs ’most significant’ sentence, 3) the whole STU
The combination of keywords and a summary has given the best performance for discovery tasks on web pages
![Page 59: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/59.jpg)
59
Web page summarization extracting summary sentences
Sentences are scored using a variant of Luhn’s method:
Words are TF*IDF weighted; given a weight cutoff value, the high scoring words are selected to be significant words
Weight of a segment: sum of the weights of significant words divided by the total number of words within a segment
![Page 60: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/60.jpg)
60
Edmundson’s method Extends earlier work to look at
three features in addition to word frequencies: cue phrases (e.g. ”significant”,
”impossible”, ”hardly”) title and heading words location
![Page 61: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/61.jpg)
61
Edmundson’s method Programs to weight sentences based on
each of the four features weight of a sentence = the sum of the
weights for features programs were evaluated by comparison
against manually created extracts corpus-based methodology: training set
and test set in the training phase, weights were manually
readjusted
![Page 62: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/62.jpg)
62
Edmundson’s method Results:
three additional features dominated word frequency measures
the combination of cue-title-location was the best, with location being the best individual feature
keywords alone was the worst
![Page 63: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/63.jpg)
63
Fundamental issues What are the most powerful but
also more general features to exploit for summarization?
How do we combine these features?
How can we evaluate how well we are doing?
![Page 64: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/64.jpg)
64
Corpus-based approaches In the classical methods, various
features (thematic features, title, location, cue phrase) were used to determine the salience of information for summarization
an obvious issue: determine the relative contribution of different features to any given text summarization task
![Page 65: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/65.jpg)
65
Corpus-based approaches Contribution is dependent on the
text genre, e.g. location: in newspaper stories, the leading text
often contains a summary in TV news, a preview segment may
contain a summary of the news to come
in scientific text: an author-written abstract
![Page 66: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/66.jpg)
66
Corpus-based approaches The importance of different text
features for any given summarization problem can be determined by counting the occurrences of such features in text corpora
in particular, analysis of human-generated summaries, along with their full-text sources, can be used to learn rules for summarization
![Page 67: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/67.jpg)
67
Corpus-based approaches Challenges
creating a suitable text corpus, designing an annotation scheme
ensuring the suitable set of summaries is available
may already be available: scientific papers
if not: author, professional abstractor, judge
![Page 68: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/68.jpg)
68
KPC method Kupiec, Pedersen, Chen (1995): A
Trainable Document Summarizer a learning method using a corpus
of abstracts written by professional human abstractors (Engineering Information Co.)
naïve Bayesian classification method is used
![Page 69: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/69.jpg)
69
KPC method: general idea training phase:
Select a set of features Calculate a probability of each feature
value to appear in a summary sentence
using a training corpus (e.g. originals + manual summaries)
![Page 70: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/70.jpg)
70
KPC method: general idea when a new document is
summarized: For each sentence
Find values for the features Calculate the probability for this feature
value combination to appear in a summary sentence
Choose n best scoring sentences
![Page 71: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/71.jpg)
71
KPC method: features
sentence-length cut-off feature given a threshold (e.g. 5 words), the
feature is true for all sentences longer than the threshold, and false otherwise
F1(s) = 0, if sentence s has 5 or less words F1(s) = 1, if sentence s has more than 5
words
![Page 72: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/72.jpg)
72
KPC method: features paragraph feature
sentences in the first 10 paragraphs and the last 5 paragraphs in a document get a higher value
in paragraphs: paragraph-initial, paragraph-final, paragraph-medial are distinguished
![Page 73: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/73.jpg)
73
KPC method: features paragraph feature
F2(s) = i, if sentence s is the first sentence in a paragraph
F2(s) = f, if there are at least 2 sentences in a paragraph, and s is the last one
F2(s) = m, if there are at least 3 sentences in a paragraph, and is neither the first nor the last sentence
![Page 74: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/74.jpg)
74
KPC method: features thematic word feature
a small number of thematic words (the most frequent content words) are selected
each sentence is scored as a function of frequency of the thematic words
highest scoring sentences are selected binary feature: feature is true for a
sentence, if the sentence is present in the set of highest scoring sentences
![Page 75: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/75.jpg)
75
KPC method: features fixed-phrase feature
this feature is true for sentences that contain any of 26 indicator phrases (e.g. ”this letter…”, ”In conclusion…”), or that follow section head that contain specific keywords (e.g. ”results”, ”conclusion”)
![Page 76: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/76.jpg)
76
KPC method: features Uppercase word feature
proper names and explanatory text for acronyms are usually important
feature is computed like the thematic word feature
an uppercase thematic word is not sentence-initial and begins with a capital
letter and must occur several times first occurrence is scored twice as much as
later occurrences
![Page 77: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/77.jpg)
77
Exercise (CNN news) sentence-length; F1: let threshold = 14
< 14 words: F1(s) =0, else F1(s)=1 paragraph; F2:
i=first, f=last, m=medial thematic-words; F3
score: how many thematic words a sentence has
F3(s) = 0, if score > 3, else F3(s) = 1
![Page 78: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/78.jpg)
78
KPC method: classifier For each sentence s, we compute
the probability that s will be included in a summary S given the k features Fj, j=1…k
the probability can be expressed using Bayes’ rule:
),...,(
)()|,...(),...,|(
1
11
k
kk
FFP
SsPSsFFPFFSsP
![Page 79: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/79.jpg)
79
KPC method: classifier Assuming statistical independence of
the features:
P(sS) is a constant, and P(Fj| sS) and P(Fj) can be estimated directly from the training set by counting occurrences
)(
)()|(),...|(
1
11
k
jj
k
jj
k
FP
SsPSsFPFFSsP
![Page 80: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/80.jpg)
80
KPC method: corpus Corpus is acquired from
Engineering Information Co, which provides abstracts of technical articles to online information services
articles do not have author-written abstracts
abstracts were created by professional abstractors
![Page 81: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/81.jpg)
81
KPC method: corpus 188 document/summary pairs sampled
from 21 publications in the scientific/technical domain
summaries are mainly indicative, average length is 3 sentences
average number of sentences in the original documents is 86
author, address, and bibliography were removed
![Page 82: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/82.jpg)
82
KPC method: sentence matching The abstracts from the human
abstractors are not extracts but inspired by the original sentences
the automatic summarization task here: extract sentences that the human
abstractor might have chosen to prepare summary text (with minor modifications…)
![Page 83: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/83.jpg)
83
KPC method: sentence matching For training, a correspondence
between the manual summary sentences and sentences in the original document need to be obtained
matching can be done in several ways
![Page 84: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/84.jpg)
84
KPC method: sentence matching matching can be done in several
ways: a direct sentence match
the same sentence is found in both a direct join
2 or more original sentences were used to form a summary sentence
summary sentence can be ’unmatchable’ summary sentence (single or joined) can
be ’incomplete’
![Page 85: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/85.jpg)
85
KPC method: sentence matching Matching was done in two passes
first, the best one-to-one sentence matches were found automatically
second, these matches were used as a starting point for the manual assignment of correspondences
![Page 86: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/86.jpg)
86
KPC method: evaluation Cross-validation strategy for
evaluation documents from a given journal were
selected for testing one at a time; all other document/summary pairs were used for training
unmatchable and incomplete summary sentences were excluded
total of 498 unique sentences
![Page 87: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/87.jpg)
87
KPC method: evaluation Two ways of evaluation
1. the fraction of manual summary sentences that were faithfully reproduced by the summarizer program the summarizer produced the same number of
sentences as were in the corresponding manual summary
-> 35% of summary sentences reproduced 83% is the highest possible value, since unmatchable
and incomplete sentences were excluded
2. the fraction of the matchable sentences that were correctly identified by the summarizer -> 42%
![Page 88: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/88.jpg)
88
KPC method: evaluation
the effect of different features was also studied
best combination (44%): paragraph, fixed-phrase, sentence-length
baseline: selecting sentences from the beginning of the document (result: 24%)
if 25% of the original sentences selected: 84%
![Page 89: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/89.jpg)
89
Discourse-based approaches Discourse structure appears to play an
important role in the strategies used by human abstractors and in the structure of their abstracts
an abstract is not just a collection of sentences, but it has an internal structure -> abstract should be coherent and it should
represent some of the argumentation used in the source
![Page 90: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/90.jpg)
90
Discourse models
cohesion relations between words or referring
expressions, which determine how tightly connected the text is
anaphora, ellipsis, synonymy, hypernymy (dog is-a-kind animal)
coherence overall structure of a multi-sentence text in
terms of macro-level relations between sentences (e.g. ”although” -> contrast)
![Page 91: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/91.jpg)
91
Boguraev, Kennedy (BG) Goal: identify those phrasal units
across the entire span of the document that best function as representative highlights of the document’s content
these phrasal units are called topic stamps
a set of topic stamps is called capsule overview
![Page 92: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/92.jpg)
92
BG A capsule overview
not a set/sequence of sentences a semi-formal (normalised)
representaion of the document, derived after a process of data reduction over the original text
not always very readable, but still represents the flow of the narrative
can be combined with surrounding information to produce more coherent presentation
![Page 93: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/93.jpg)
93
Priest is charged with Pope attack
A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night.
According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of 15-20 years.
![Page 94: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/94.jpg)
94
Capsule overview vs. summary summary could be, e.g.
“A Spanish priest is charged after an unsuccessful murder attempt on the Pope”
capsule overview: A SPANISH PRIEST was charged Attempting to murder the POPE HE trained for the assault POPE furious on hearing PRIEST’S criticisms
![Page 95: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/95.jpg)
95
BG Primary consideration: methods
should apply to any document type and source (domain independence)
also: efficient and scalable technology shallow syntactic analysis, no
comprehensive parsing engine needed
![Page 96: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/96.jpg)
96
BG
Based on the findings on technical terms technical terms have such linguistic properties
that can be used to find terms automatically in different domains quite reliably
technical terms seem to be topical task of content characterization
identifying phrasal units that have lexico-syntactic properties similar to technical terms discourse properties that signify their status as most
prominent
![Page 97: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/97.jpg)
97
BG: terms as content indicators Problems
undergeneration overgeneration differentiation
![Page 98: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/98.jpg)
98
Undergeneration a set of phrases should contain an
exhaustive description of all the entities that are discussed in the text
the set of technical terms has to be extended to include also expressions with pronouns etc.
![Page 99: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/99.jpg)
99
Overgeneration already the set of technical terms
can be large extensions make the information
overload even worse solution: phrases that refer to one
participant in the discourse are combined with referential links
![Page 100: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/100.jpg)
100
Differentiation The same list of terms may be used
to describe two documents, even if they, e.g., focus on different subtopics
it is necessary to differentiate term sets not only according to their membership, but also according to the relative representativeness of the terms they contain
![Page 101: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/101.jpg)
101
Term sets and coreference classes Phrases are extracted using a
phrasal grammar (e.g. a noun with modifiers) also expressions with pronouns and
incomplete expressions are extracted using a (Lingsoft) tagger that provides
information about the part of speech, number, gender, and grammatical function of tokens in a text
solves the undergeneration problem
![Page 102: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/102.jpg)
102
Term sets and coreference classes The phrase set has to be reduced
to solve the problem of overgeneration
-> a smaller set of expressions that uniquely identify the objects referred to in the text
application of anaphora resolution e.g. to which noun a pronoun ’he’
refers to?
![Page 103: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/103.jpg)
103
Resolving coreferences Procedure
moving through the text sentence by sentence and analysing the nominal expressions in each sentence from left to right
either an expression is identified as a new participant in the discourse, or it is taken to refer to a previously mentioned referent
![Page 104: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/104.jpg)
104
Resolving coreferences Coreference is determined by a 3 step
procedure a set of candidates is collected: all nominals
within a local segment of discourse some candidates are eliminated due to
morphological mismatches or syntactical restrictions
remaining candidates are ranked according to their relative salience in the discourse
![Page 105: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/105.jpg)
105
Salience factors sent(term) = 100 iff term is in the
current sentence cntx(term) = 50 iff term is in the
current discourse segment subj(term) = 80 iff term is a subject acc(term) = 50 iff term is a direct object dat(term) = 40 iff term is an indirect obj ...
![Page 106: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/106.jpg)
106
Local salience of a candidate The local salience of a candidate is the
sum of the values of the salience factors the most salient candidate is selected
as the antecedent if the coreference link cannot be
established to some other expression, the nominal is taken to introduce a new referent
-> coreferent classes
![Page 107: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/107.jpg)
107
Topic stamps In order to further reduce the
referent set, some additional structure has to be imposed the term set is ranked according to the
salience of its members relative prominence or importance in
the discourse of the entities to which they refer
objects in the centre of discussion have a high degree of salience
![Page 108: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/108.jpg)
108
Saliency Measured like local saliency in
coreference resolution, but tries to measure the importance of unique referents in the discourse
![Page 109: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/109.jpg)
109
Priest is charged with Pope attack
A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night.
According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of 15-20 years.
![Page 110: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/110.jpg)
110
Saliency
’priest’ is the primary element eight references to the same actor in the body
of the story these reference occur in important syntactic
positions: 5 are subjects of main clauses, 2 are subjects of embedded clauses, 1 is a possessive
’Pope attack’ is also important ’Pope’ occurs 5 times, but not in so important
positions (2 are direct objects)
![Page 111: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/111.jpg)
111
Discourse segments If the intention is to use very concise
descriptions of one or two salient phrases, i.e. topic stamps, longer text have to be broken down into smaller segments
topically coherent, contiguous segments can be found by using a lexical similarity measure assumption: distribution of words used
changes when the topic changes
![Page 112: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/112.jpg)
112
BG: Summarization process
1. linguistic analysis2. discourse segmentation3. extended phrase analysis4. anaphora resolution5. calculation of discourse salience6. topic stamp identification7. capsule overview
![Page 113: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/113.jpg)
113
Knowledge-rich approaches Structured information can be used as
the starting point for summarization structured information: e.g. data and
knowledge bases, may have been produced by processing input text
summarizer does not have to address the linguistic complexities and variability of the input, but also the structure of the input text is not available
![Page 114: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/114.jpg)
114
Knowledge-rich approaches There is a need for measures of
salience and relevance that are dependent on the knowledge source
addressing coherence, cohesion, and fluency becomes the entire responsibility of the generator
![Page 115: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/115.jpg)
115
STREAK McKeown, Robin, Kukich (1995):
Generating concise natural language summaries
goal: folding information from multiple facts into a single sentence using concise linguistic constructions
![Page 116: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/116.jpg)
116
STREAK Produces summaries of basketball
games first creates a draft of essential
facts then uses revision rules
constrained by the draft wording to add in additional facts as the text allows
![Page 117: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/117.jpg)
117
STREAK Input:
a set of box scores for a basketball game historical information (from a database)
task: summarize the highlights of the game,
underscoring their significance in the light of previous games
output: a short summary: a few sentences
![Page 118: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/118.jpg)
118
STREAK The box score input is represented as
a conceptual network that expresses relations between what were the columns and rows of the table
essential facts: the game result, its location, date and at least one final game statistic (the most remarkable statistic of a winning team player)
![Page 119: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/119.jpg)
119
STREAK Essential facts can be obtained
directly from the box-score in addition, other potential facts
other notable game statistics of individual players - from box-score
game result streaks (Utah recorded its fourth straight win) - historical
extremum performances such as maximums or minimums - historical
![Page 120: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/120.jpg)
120
STREAK Essential facts are always included potential facts are included if there
is space decision on the potential facts to be
included could be based on the possibility to combine the facts to the essential information in cohesive and stylistically successful ways
![Page 121: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/121.jpg)
121
STREAK Given facts:
Karl Malone scored 39 points. Karl Malone’s 39 point performance is
equal to his season high a single sentence is produced:
Karl Malone tied his season high with 39 points
![Page 122: Processing of large document collections Fall 2002, Part 2](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649e9d5503460f94b9dea6/html5/thumbnails/122.jpg)
122
Text summarization surface-level methods
“manual” features corpus-based learning
discourse-level methods knowledge-rich methods