data mining and statistics for decision making (tufféry/data mining and statistics for decision...
TRANSCRIPT
14
Text mining
I mentioned in Chapter 3 that there was a special class of data, namely text data. The earlier
chapters discussed tools for manipulating data consisting of codes and quantities; the aim of
this chapter is to complete our survey of data mining by showing how it can be combined with
linguistics and lexicometry for the automatic analysis and use of text data.
14.1 Definition of text mining
Text mining is the set of techniques and methods used for the automatic processing of natural
language text data available in reasonably large quantities in the form of computer files, with
the aim of extracting and structuring their contents and themes, for the purposes of rapid (non-
literary) analysis, the discovery of hidden data, or automatic decision making. It is different
from stylometry, which studies the style of texts in order to identify an author or date thework,
but it has much in common with lexicometry or lexical statistics (also known as ‘linguistic
statistics’ or ‘quantitative linguistics’); indeed, it is an extension of the latter science using
advanced methods of multidimensional statistics.
We can show this schematically as:
Text mining ¼ LexicometryþData mining
Like data mining, text mining originated partly in response to the huge volume of text data
created and diffused in our society (think of the amounts of laws, orders, regulations,
contracts, for example), and partly for the purpose of quasi-generalized input and storage
of these data in computer systems. It also owes its acceptance to developments in statistical
and data processing tools whose power has increased greatly in recent years. Thus, following
the work of researchers such as Jean-Baptiste Estoup, George Kingsley Zipf, Benoıt
Mandelbrot, George Udny Yule, Pierre Guiraud, Charles Muller, Gustav Herdan, Etienne
Brunet, Jean-Paul Benz�ecri, Ludovic Lebart and Andr�e Salem, there has been an exponential
growth in the use of statistics, probabilities, data analysis, Markov chains and artificial
Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.
© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8
intelligence tools based on data mining for the processing of text material, and we have made
considerable progress since the early days of simple calculation of percentages. Beginning in
1916 (Estoup1) and 1935 (Zipf2), the frequency of appearance of a word in a text has been
studied by statistical methods, giving rise to Zipf’s law, a well-known formula which links the
frequency of a word to its rank in the table of frequencies. The example of James Joyce’s
Ulysses is famous: the 10th word appears 2653 times, the 100th word appears 265 times, the
1000th word appears 26 times and the 10 000th word appears twice. We find that the product
of the rank r and the frequency f is virtually constant:
rf ¼ constant:
This law is not always valid with the same degree of accuracy, but it is truly universal,
because it applies to all types of text in all languages.Wentian Li3 demonstrated in 1992 that it
could be applied to a text in which the words were created by drawing letters (and a ‘space’
character) at random from an alphabet with a uniform distribution.
The formula shown above has now been revised to
raf ¼ constant;
where a is an exponent which depends on the language and the type of speaker. It generally
ranges from 1.1 to 1.3, and is close to 1.6 in children’s language. As a general rule, it decreases
with the richness of the corpus, measured as the ratio of the number of different words V
(the vocabulary) to the total number of words (V is generally proportional to the square
root of N).
Zipf’s law has since been extended to other rank-size problems, such as the rank of cities
in a country related to their size, the rank of businesses related to their turnover, the rank of
individuals related to their income, etc.
One interesting consequence of Zipf’s law for text mining is that a few tens of words are
enough to represent a large part of any corpus, enabling the depth and complexity of analyses
to be limited.
As in data mining, there are two types of method in text mining. Descriptive methods can
be used to search for themes dealt with in a set (corpus) of documents, without knowing these
themes in advance. Predictive methods find rules for automatically assigning a document to
one of a number of predefined themes. This may be done, for example, for the purpose of
automatically forwarding a letter or a CV to the appropriate department. The corpus analysed
must meet the following conditions:
. it must be in a data processing format (the automatic reading of handwriting, used in the
processing of cheques and mail, is a different problem);
. it must include a minimum number of texts;
. it must be sufficiently comprehensible and coherent;
1 Estoup, J.-B. (1916) Gammes St�enographiques, 4th edn. Paris: Imprimerie Moderne.2 Zipf, G.K. (1935) The Psycho-biology of Language. Boston: Houghton-Mifflin. The definitive formulation can
be found in Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort. Cambridge, MA: Addison-Wesley.3 Li, W. (1992) Random texts exhibit Zipf’s-Law-like word frequency distribution. IEEE Transactions on
Information Theory, 38(6), 1842–1845.
628 TEXT MINING
. there must not be too many different themes in each text;
. it must avoid, as far as possible, the use of innuendo, irony and antiphrasis (saying
the opposite of what one thinks, e.g. ‘Oh, brilliant!’ in response to a particularly
stupid blunder).
14.2 Text sources used
The main sources of texts analysed by text mining are opinion polls, customer satisfaction
surveys, letters of complaint, telephone interview transcriptions, electronic mail, reports of
marketing or medical interviews, press surveys, despatches from news agencies, experts’
documentation and reports, technology monitoring, competition monitoring, strategic and
economic monitoring, the Internet and on-line databases, and more recently curricula vitae.
Users of the information analysed may be financial analysts, economists, marketing profes-
sionals, customer relations services, recruiters or decision makers.
14.3 Using text mining
Some periodical analyses in which the presentation is always identical can be automated by
using text mining. This generates quick analyses without the need for repetitive and tedious
computation. The applications include the automatic generation of satisfaction surveys,
reports on a business’s image or the state of the competition, and the automatic indexing of
documents.
Textmining is also used to discover hidden information (‘descriptivemethod’), for example
new research fields (in filed patents), or information to be added to marketing databases on
customers’ areas of interest andplans. It can even be used by a businesswishing to communicate
with its customers in the vocabulary that they use, and to adapt its marketing presentations
to each customer segment. It can be used in search engines on the web.
Finally, text mining is an aid for decision making (‘predictive methods’), for example in
automatic mail routeing, email filtering (to identify spam and non-spam, technical and
business subject matter, etc.), data filtering and news filtering.
The discovery of hidden information and decision making are mainly classed as forms of
information retrieval, while quick analysis is a form of information extraction.
Information retrieval is concerned with documents in their totality and with the themes
which they deal with, and is used to compare documents and detect types of documents. It
aims to detect all the themes that are present. The analysis is global.
Information extraction is a search for specific information in the documents, without
any comparison of the documents, taking the order and proximity of words into account
to discriminate between different statements which have identical keywords. It is only
concerned with themes related to the ‘target’ database. Information extraction starts
with natural-language data and uses them to build up a structured database. It is a
matter of scanning the natural language text to detect words or phrases corresponding to
each field of the database. The analysis is local. In one sense, information extraction is a
more complex process, because it requires the use of lexical and morpho-syntactic
analysis to recognize the constituents of the text (words and phrases), their nature and
their relationships.
USING TEXT MINING 629
14.4 Information retrieval
This section will describe the different analyses, first linguistic and then statistical, that are
required for the automatic updating of the themes contained in a corpus of documents. These
analyses follow the Strasbourg School which, following Charles Muller, does not apply
statistical methods directly to the text, but to its underlying lexicon, found by a sequence of
operations described below for disambiguation, categorization, lemmatization and combina-
tion. These operations consist of identifying units (the graphic forms which are sequences of
non-separator characters) in the text sequence, grouping them into equivalent classes (up to the
level of the theme) and performing counts and statistical analyses on these classes. This is not
the only approach, and othermethods have been proposed by researcherswho have pointed out
that the text sequence cannot be reduced to a series of unrelated units, and that the meaning of
a text is highly dependent on the relative positioning, juxtapositions and co-occurrences of the
graphic forms (even before considering the equivalence class). Etienne Brunet expressed this
in a humorous way: ‘Some people may regret the loss of the raw forms, whose opaque
materiality could conceal a degree of mystery. They may be repelled by a pale, bloodless
lemma reduced to a set of abstract properties’.4 Having said that, this method of content
analysis is effective because it can be used very successfully in conjunction with data mining
tools. It is also implemented in some of the leading textmining tools, such as IBMSPSS�Text
Analytics and (under the name of ‘text parsing’) in the Text Miner add-in of SAS� Enterprise
MinerTM. Although largely automatic, it still needs to be adapted, sometimes manually, to the
needs of the user and thevocabulary of his society: this is done by creating a list of prohibited or
obligatory terms and a dictionary of synonyms and compound words.
14.4.1 Linguistic analysis
Language identification
It should be noted that the Web obliges us to deal with multilingualism, even within a single
document in some cases. Some lovers of linguistic curiosities know about an extreme case of
multilingualism: this is the ‘polyglot’ phrase that has different meanings in different
languages. For example, at the time of Watergate, the English headline ‘Nixon put dire
comment on tape’ is also a French sentence meaning ‘Nixon could tell you how to type’. In
English-speaking parts of Canada there may be posters for a ‘Garage Sale’, which to French
speakers simply means ‘Dirty Garage’!
Identification of grammatical categories (grammatical labelling)
The next step is to identify the nouns, verbs, adjectives and adverbs in the texts of the corpus,
which requires a grammatical analysis. This can be complicated by the presence of homo-
graphs (e.g. ‘in a flood of tears, she tears up the letter’).
Disambiguation
There are many sources of ambiguity in a natural language text. They may be due to the
polysemy of words (the fact that a word has several meanings), to ellipses (in a ‘telegraphic’
4 Brunet, E. (2002) Le lemme comme on l’aime. In JADT 2002: 6th International Conference on the Statistical
Analysis of Textual Data.
630 TEXT MINING
style), homographs (‘lead me to the lead mine!’), or antiphrasis and irony, the last of these
being particularly difficult to detect automatically.
Anaphora also gives rise to ambiguities that must be removed. ‘Anaphora’ is used here in
its linguistic sense, meaning avoiding repetition of a word by using another word, most often
a pronoun, to refer to it (‘he’, ‘she’, ‘it’, ‘this’, etc.).
The format of data processing texts is another source of ambiguities, such as the ambiguity
between the number ‘0’ and the letter ‘O’, the ambiguities due to line breaks without hyphens,
or those due to poor typography. The word ‘hit’ can be interpreted as a noun, a verb, an
adjective (‘a hit song’) or a past participle.
In personal notes whichmay have been recorded on an electronic notepad, the ambiguities
are even more numerous, owing to personal abbreviations, incorrect spelling, and the often
non-syntactical and non-logical order of entry of the words of a sentence.
This stage is tricky, and some ambiguities can only be removed by analysing the whole
text, or even by an arbitrary decision. We can also consider the probability of the appearance
of a form: ‘sate’ is more likely to be a verb meaning ‘satisfy completely’ than the past tense of
‘sit’, at least in a modern text.
Recognition of compound words
It is necessary to recognize that expressions such as ‘2 April 2005’, ‘the Governor of the
Central European Bank’ and ‘the Proceedings of the Royal Institution’ are groups meaning
a date, a person and a publication. ‘Term’ can denote either this kind of sequence of graphic
forms, or a graphic form with a length of 1 (a ‘word’). Then we must allow for the specialist
lexicon of the field of activity concerned. Thus, the banking lexicon includes terms such as
Visa card, current account, housing savings scheme, etc. The lexicon of business intelligence
will include the terms data mining, text mining, data warehouse, etc. It will be useful to create
a specific lexicon for the business, identifying sequences of graphic forms (often two or
three forms) which are repeated many times in the corpus, or even to compile such a
lexicon ‘manually’.
Lemmatization
The steps described above will have improved the understanding of the texts. They must then
be simplified, without changing the meaning of course, so that the main themes can be
extracted more easily.
We need to start by lemmatizing the texts: this means putting the terms in their canonical
form, so that nouns would be put in the singular and the various forms of verbs would be put in
the infinitive. This is the form in which the words are set out in an ordinary dictionary, which
may contain about 60 000 entries covering 700 000 different forms, such as plurals of nouns
and different tenses of verbs. French, Spanish, Russian and German have many inflected
forms (conjugations and declensions). German also has the distinctive feature of creating
compound words by stringing several nouns together, and we may have to decide whether to
divide these units into elementary fragments.
Grouping the variants
A second stage of simplification is to group together the variants of terms found in texts. The
graphic variants (realise¼ realize), syntactic variants (name of a man¼ a man’s name),
INFORMATION RETRIEVAL 631
semantic variants (‘X buys Y from Z’¼ ‘Z sells Y to X’), synonyms (US¼USA¼United
States¼Uncle Sam), parasynonyms (words with closely related meanings: discontent, anger,
dissatisfaction), and full forms of abbreviations (D ¼EUR¼ euro, BBC¼B.B.C.¼British
Broadcasting Corporation) are all recorded. Like the dictionary of words, the dictionary of
synonyms can be saved in a file specified by the text mining software.
Expressions and metaphors are identified: for instance, ‘Empire of the Rising Sun’ is
replaced with ‘Japan’ and ‘Threadneedle Street’ is replaced with ‘Bank of England’.
Figure 14.1 shows an example of the results of a text mining analysis using SAS
Text Miner.
Grouping the analogies
We can then group the analogies. We group the terms in families of derivative terms, as in
a thesaurus, which may include the following group of terms, for example:
. credit/loan/undertaking/debt/borrow/borrower/debtor.
Intensity markers are also grouped, for example:
. a little/less/very little/�
. much/more/very/þIdentification of themes
The text analysis is completed by grouping all the terms around level 1 themes, then grouping
all the level 1 themes around level 2 themes. The first transition will be of the following type:
. cheque/bank card/draft/currency/. . . , means of payment
Figure 14.1 Terms disambiguated, labelled and lemmatized.
632 TEXT MINING
while the second transition could be:
. means of payment/money/cash/. . . , bank
14.4.2 Application of statistics and data mining
When the analysis of texts and their themes is completed, we can filter the themes or terms to
be examined. We can use either a statistical criterion (selecting terms and themes by their
frequency) or a semantic criterion (centred on a given subject), or a corpus (identifying
offensive words to avoid and their derivation, in order to ‘clean up’ a document). With the
statistical criterion, we can use a number of weighting rules, for example preferring terms
which appear frequently but in few texts (weight¼ frequency of the term/number of texts
containing it).
These terms, having been disambiguated, labelled, lemmatized, grouped and selected, are
then treated with data mining methods, with the individuals (in the statistical sense) being the
texts or documents (e.g. emails) and the characters of the individuals (their variables) being
the themes or terms in the documents. Thus we can produce lexical tables in which each cell cijis a number of occurrences of term j (or an indicator of presence/absence) in document i, to
which the conventional statistical methods are applied. The cell cij can also be the number
of occurrences of term j in the set of documents relating to customer i (letters, reports of
interviews, etc.)
These tables can be processed by correspondence analysis, which simplifies the problem
by reducing the initial variables (corresponding to the terms), often present in very large
numbers (several thousand) although the preliminary transformations may have decreased
their number, to about a hundred factors (which no longer correspond to terms: this is the
drawback of the method). At the end of this transformation, continuous variables will have
been substituted for the initial discrete variables, and conventional data analysis techniques
can be used – classification, clustering, etc. This method is incorporated, under the name of
SVD (singular value decomposition), in SAS Text Miner. Some techniques such as regular-
ized regression (see Section 11.7.2) are useful when we need to process a large number of
variables compared with the number of individuals.
14.4.3 Suitable methods
Text mining can respond to two types of request. Open requests (or free text requests) are
requests in the form of keywords or free text, used to search for relevant documents in a corpus
that changes slowly (such as a yearbook or an electronic library), with the most relevant
sections of text highlighted. Predefined requests are requests relating to a number of fixed
terms, applied to a corpus that changes in a dynamic way with time (e.g. categorization of
documents, routeing/filtering of mail or news). They are subject to the same problems
as classification.
Like data mining, text mining includes descriptive and predictive methods. In the
predictive domain, the classification (or categorization) of documents is carried out according
to predefined themes (nomenclature). It is used for predefined requests (routeing, filtering)
and is based on decision trees (CART, C5.0) and supervised learning neural networks.
Markov chains can be used for open requests. A Markov chain can be briefly described as
follows. Imagine that we have n boxes, each filled with numbered balls. We draw a ball at
random from the first box; its number indicates the box from which the next ball is to be
INFORMATION RETRIEVAL 633
drawn. We continue to draw balls until we reach an empty box. The set of boxes that we have
passed through, in sequence, is a Markov chain. The probability of drawing a given ball
depends on the box from which it is drawn, and therefore on all the previous drawings.
The same applies to a sentence: the probability of the appearance of a word depends on
the preceding words, and not all sequences of words have the same probability of
occurring. Markov chains are used for speech and handwriting recognition, spelling correc-
tion, voice control of automatic systems, and natural language human–machine interfaces
in general.
In the descriptive domain, a corpus clustering is carried out according to non-predefined
themes (discovered in the documents) and can be followed by automatic extraction of
keywords (terms which are frequent in the cluster and rare in the set of documents). The
clustering can be carried out by a Kohonen network or an agglomerative hierarchical
algorithm. It is also possible to carry out a multiple correspondence analysis by matching
the text data with the other data. For example, we can match the response to a questionnaire
with the respondent’s socio-occupational category. It is possible to create a document map
and identify isolated themes, themes forming homogeneous sets, the strength of the links
between the themes in a single set (the vocabulary common to the themes) and the number of
documents for each theme. Figure 14.2 shows by way of example a graphic representation of
the Tropes software from Acetic.
Figure 14.2 Themes in Shakespeare’s Sonnets.
634 TEXT MINING
14.5 Information extraction
14.5.1 Principles of information extraction
Information extraction systems are made up of trigger words (verbs or nouns), linguistic
forms, and constraints which limit the application of the trigger. These systems require
specific semantic dictionaries for the domain or business, as well as syntactic analysers that
can recognize the general linguistic forms (subject, verb, direct object, etc.). Using a target to
be extracted from and predefined fields to be filled, information extraction systems detect the
relevant sentences and extract the desired information.
The main applications of information extraction are:
. automatic completion of predefined forms from free texts;
. automatic construction of bibliographic databases from research papers (fields to be
extracted: title, author, journal, publication date, research establishment, etc.);
. automatic scanning of Reuters despatches on the acquisition of one company by another
(fields to be extracted: purchaser, vendor, price, industrial sector, turnover, stock
exchange quotation, etc.);
. automatic scanning of the financial press (the ‘people’ section, on chief executives’
moves between companies);
. automatic detection of the plans or requirements of the customers of a business, based
on the records of sales staff (fields to be extracted: name of customer, type of product or
service offered, type of plan or requirement of the customer, amount, customer’s
deadline, customer’s response (take-up/refusal), reason for customer’s response, other
suppliers used by the customer, etc.), and the use of the extracted information in
a propensity score.
The performance of information extraction is summarized by two indicators. The
accuracy rate is the number of correctly completed fields divided by the number of completed
fields. The recall rate is the number of correctly completed fields divided by the number of
fields to be completed.
14.5.2 Example of application: transcription of business interviews
If sales staff discover that their customers have plans that can be financed (buying a house,
changing their car, etc.), they offer the customer a credit proposal and note their reaction in
a report. If the reaction is positive, the completion of the report is less important, because it
will be evident that the product has been taken up. If the reaction is negative, the existence of
the report is more important, as it indicates that a product has been offered to the customer. It is
also useful to be able to analyse these marketing reports automatically in order to determine
the reasons for the customers’ refusal, and then deduce predictive models and typologies, or
even adapt the products offered.
One problem that arises is that these reports, written in natural language, are obviously not
standardized, and may contain:
INFORMATION EXTRACTION 635
. spelling mistakes;
. personal abbreviations;
. stream-of-consciousness writing;
. ellipsis (‘telegraphic’ style);
. an illogical order in sentences in some cases (related words may be separated by
a certain distance);
. negations which are not always explicit (the sentence ‘construction Brighton – finance
NatWest’ is a negation if the bank where the salesperson works is not the NatWest!).
Faced with the difficulty of automatic normalization of reports, we need powerful text
mining tools for information extraction, not just keyword search tools.
Report analysis by text mining can be a highly penetratingmethod, offering these benefits:
. detection of customers resistant to certain kinds of credit (useful information for
building a propensity score);
. automatic detection of certain reasons for refusing to take up a product (customer
‘opposed to credit’, better offer from the competition, no need for credit, etc.);
. detection of customers having plans for future dates (enabling us to market to them at
the right time).
14.6 Multi-type data mining
A very promising method of data mining, known as multi-type data mining, can simulta-
neously examine text data (from text mining processes), paratextual data (such as the date and
purpose of a document, the type of document, the recipient of the document in the business,
etc.), and contextual data (such as information about the author of the document, his relations
with the business, the products he has bought, the services he has used, etc.).
Text data are converted into coded data and then stored with the other data in marketing
databases. The matching of all the data (textual and non-textual) makes multi-type data
mining a very powerful tool. For example, an attrition study will be more precise if it takes
letters of complaint and other exchanges between the business and the customer into account.
636 TEXT MINING