data mining and statistics for decision making (tufféry/data mining and statistics for decision...

14

Text mining

I mentioned in Chapter 3 that there was a special class of data, namely text data. The earlier

chapters discussed tools for manipulating data consisting of codes and quantities; the aim of

this chapter is to complete our survey of data mining by showing how it can be combined with

linguistics and lexicometry for the automatic analysis and use of text data.

14.1 Definition of text mining

Text mining is the set of techniques and methods used for the automatic processing of natural

language text data available in reasonably large quantities in the form of computer files, with

the aim of extracting and structuring their contents and themes, for the purposes of rapid (non-

literary) analysis, the discovery of hidden data, or automatic decision making. It is different

from stylometry, which studies the style of texts in order to identify an author or date thework,

but it has much in common with lexicometry or lexical statistics (also known as ‘linguistic

statistics’ or ‘quantitative linguistics’); indeed, it is an extension of the latter science using

advanced methods of multidimensional statistics.

We can show this schematically as:

Text mining ¼ LexicometryþData mining

Like data mining, text mining originated partly in response to the huge volume of text data

created and diffused in our society (think of the amounts of laws, orders, regulations,

contracts, for example), and partly for the purpose of quasi-generalized input and storage

of these data in computer systems. It also owes its acceptance to developments in statistical

and data processing tools whose power has increased greatly in recent years. Thus, following

the work of researchers such as Jean-Baptiste Estoup, George Kingsley Zipf, Benoıt

Mandelbrot, George Udny Yule, Pierre Guiraud, Charles Muller, Gustav Herdan, Etienne

Brunet, Jean-Paul Benz�ecri, Ludovic Lebart and Andr�e Salem, there has been an exponential

growth in the use of statistics, probabilities, data analysis, Markov chains and artificial

Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.

© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

intelligence tools based on data mining for the processing of text material, and we have made

considerable progress since the early days of simple calculation of percentages. Beginning in

1916 (Estoup1) and 1935 (Zipf2), the frequency of appearance of a word in a text has been

studied by statistical methods, giving rise to Zipf’s law, a well-known formula which links the

frequency of a word to its rank in the table of frequencies. The example of James Joyce’s

Ulysses is famous: the 10th word appears 2653 times, the 100th word appears 265 times, the

1000th word appears 26 times and the 10 000th word appears twice. We find that the product

of the rank r and the frequency f is virtually constant:

rf ¼ constant:

This law is not always valid with the same degree of accuracy, but it is truly universal,

because it applies to all types of text in all languages.Wentian Li3 demonstrated in 1992 that it

could be applied to a text in which the words were created by drawing letters (and a ‘space’

character) at random from an alphabet with a uniform distribution.

The formula shown above has now been revised to

raf ¼ constant;

where a is an exponent which depends on the language and the type of speaker. It generally

ranges from 1.1 to 1.3, and is close to 1.6 in children’s language. As a general rule, it decreases

with the richness of the corpus, measured as the ratio of the number of different words V

(the vocabulary) to the total number of words (V is generally proportional to the square

root of N).

Zipf’s law has since been extended to other rank-size problems, such as the rank of cities

in a country related to their size, the rank of businesses related to their turnover, the rank of

individuals related to their income, etc.

One interesting consequence of Zipf’s law for text mining is that a few tens of words are

enough to represent a large part of any corpus, enabling the depth and complexity of analyses

to be limited.

As in data mining, there are two types of method in text mining. Descriptive methods can

be used to search for themes dealt with in a set (corpus) of documents, without knowing these

themes in advance. Predictive methods find rules for automatically assigning a document to

one of a number of predefined themes. This may be done, for example, for the purpose of

automatically forwarding a letter or a CV to the appropriate department. The corpus analysed

must meet the following conditions:

. it must be in a data processing format (the automatic reading of handwriting, used in the

processing of cheques and mail, is a different problem);

. it must include a minimum number of texts;

. it must be sufficiently comprehensible and coherent;

1 Estoup, J.-B. (1916) Gammes St�enographiques, 4th edn. Paris: Imprimerie Moderne.2 Zipf, G.K. (1935) The Psycho-biology of Language. Boston: Houghton-Mifflin. The definitive formulation can

be found in Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort. Cambridge, MA: Addison-Wesley.3 Li, W. (1992) Random texts exhibit Zipf’s-Law-like word frequency distribution. IEEE Transactions on

Information Theory, 38(6), 1842–1845.

628 TEXT MINING

. there must not be too many different themes in each text;

. it must avoid, as far as possible, the use of innuendo, irony and antiphrasis (saying

the opposite of what one thinks, e.g. ‘Oh, brilliant!’ in response to a particularly

stupid blunder).

14.2 Text sources used

The main sources of texts analysed by text mining are opinion polls, customer satisfaction

surveys, letters of complaint, telephone interview transcriptions, electronic mail, reports of

marketing or medical interviews, press surveys, despatches from news agencies, experts’

documentation and reports, technology monitoring, competition monitoring, strategic and

economic monitoring, the Internet and on-line databases, and more recently curricula vitae.

Users of the information analysed may be financial analysts, economists, marketing profes-

sionals, customer relations services, recruiters or decision makers.

14.3 Using text mining

Some periodical analyses in which the presentation is always identical can be automated by

using text mining. This generates quick analyses without the need for repetitive and tedious

computation. The applications include the automatic generation of satisfaction surveys,

reports on a business’s image or the state of the competition, and the automatic indexing of

documents.

Textmining is also used to discover hidden information (‘descriptivemethod’), for example

new research fields (in filed patents), or information to be added to marketing databases on

customers’ areas of interest andplans. It can even be used by a businesswishing to communicate

with its customers in the vocabulary that they use, and to adapt its marketing presentations

to each customer segment. It can be used in search engines on the web.

Finally, text mining is an aid for decision making (‘predictive methods’), for example in

automatic mail routeing, email filtering (to identify spam and non-spam, technical and

business subject matter, etc.), data filtering and news filtering.

The discovery of hidden information and decision making are mainly classed as forms of

information retrieval, while quick analysis is a form of information extraction.

Information retrieval is concerned with documents in their totality and with the themes

which they deal with, and is used to compare documents and detect types of documents. It

aims to detect all the themes that are present. The analysis is global.

Information extraction is a search for specific information in the documents, without

any comparison of the documents, taking the order and proximity of words into account

to discriminate between different statements which have identical keywords. It is only

concerned with themes related to the ‘target’ database. Information extraction starts

with natural-language data and uses them to build up a structured database. It is a

matter of scanning the natural language text to detect words or phrases corresponding to

each field of the database. The analysis is local. In one sense, information extraction is a

more complex process, because it requires the use of lexical and morpho-syntactic

analysis to recognize the constituents of the text (words and phrases), their nature and

their relationships.

USING TEXT MINING 629

14.4 Information retrieval

This section will describe the different analyses, first linguistic and then statistical, that are

required for the automatic updating of the themes contained in a corpus of documents. These

analyses follow the Strasbourg School which, following Charles Muller, does not apply

statistical methods directly to the text, but to its underlying lexicon, found by a sequence of

operations described below for disambiguation, categorization, lemmatization and combina-

tion. These operations consist of identifying units (the graphic forms which are sequences of

non-separator characters) in the text sequence, grouping them into equivalent classes (up to the

level of the theme) and performing counts and statistical analyses on these classes. This is not

the only approach, and othermethods have been proposed by researcherswho have pointed out

that the text sequence cannot be reduced to a series of unrelated units, and that the meaning of

a text is highly dependent on the relative positioning, juxtapositions and co-occurrences of the

graphic forms (even before considering the equivalence class). Etienne Brunet expressed this

in a humorous way: ‘Some people may regret the loss of the raw forms, whose opaque

materiality could conceal a degree of mystery. They may be repelled by a pale, bloodless

lemma reduced to a set of abstract properties’.4 Having said that, this method of content

analysis is effective because it can be used very successfully in conjunction with data mining

tools. It is also implemented in some of the leading textmining tools, such as IBMSPSS�Text

Analytics and (under the name of ‘text parsing’) in the Text Miner add-in of SAS� Enterprise

MinerTM. Although largely automatic, it still needs to be adapted, sometimes manually, to the

needs of the user and thevocabulary of his society: this is done by creating a list of prohibited or

obligatory terms and a dictionary of synonyms and compound words.

14.4.1 Linguistic analysis

Language identification

It should be noted that the Web obliges us to deal with multilingualism, even within a single

document in some cases. Some lovers of linguistic curiosities know about an extreme case of

multilingualism: this is the ‘polyglot’ phrase that has different meanings in different

languages. For example, at the time of Watergate, the English headline ‘Nixon put dire

comment on tape’ is also a French sentence meaning ‘Nixon could tell you how to type’. In

English-speaking parts of Canada there may be posters for a ‘Garage Sale’, which to French

speakers simply means ‘Dirty Garage’!

Identification of grammatical categories (grammatical labelling)

The next step is to identify the nouns, verbs, adjectives and adverbs in the texts of the corpus,

which requires a grammatical analysis. This can be complicated by the presence of homo-

graphs (e.g. ‘in a flood of tears, she tears up the letter’).

Disambiguation

There are many sources of ambiguity in a natural language text. They may be due to the

polysemy of words (the fact that a word has several meanings), to ellipses (in a ‘telegraphic’

4 Brunet, E. (2002) Le lemme comme on l’aime. In JADT 2002: 6th International Conference on the Statistical

Analysis of Textual Data.

630 TEXT MINING

style), homographs (‘lead me to the lead mine!’), or antiphrasis and irony, the last of these

being particularly difficult to detect automatically.

Anaphora also gives rise to ambiguities that must be removed. ‘Anaphora’ is used here in

its linguistic sense, meaning avoiding repetition of a word by using another word, most often

a pronoun, to refer to it (‘he’, ‘she’, ‘it’, ‘this’, etc.).

The format of data processing texts is another source of ambiguities, such as the ambiguity

between the number ‘0’ and the letter ‘O’, the ambiguities due to line breaks without hyphens,

or those due to poor typography. The word ‘hit’ can be interpreted as a noun, a verb, an

adjective (‘a hit song’) or a past participle.

In personal notes whichmay have been recorded on an electronic notepad, the ambiguities

are even more numerous, owing to personal abbreviations, incorrect spelling, and the often

non-syntactical and non-logical order of entry of the words of a sentence.

This stage is tricky, and some ambiguities can only be removed by analysing the whole

text, or even by an arbitrary decision. We can also consider the probability of the appearance

of a form: ‘sate’ is more likely to be a verb meaning ‘satisfy completely’ than the past tense of

‘sit’, at least in a modern text.

Recognition of compound words

It is necessary to recognize that expressions such as ‘2 April 2005’, ‘the Governor of the

Central European Bank’ and ‘the Proceedings of the Royal Institution’ are groups meaning

a date, a person and a publication. ‘Term’ can denote either this kind of sequence of graphic

forms, or a graphic form with a length of 1 (a ‘word’). Then we must allow for the specialist

lexicon of the field of activity concerned. Thus, the banking lexicon includes terms such as

Visa card, current account, housing savings scheme, etc. The lexicon of business intelligence

will include the terms data mining, text mining, data warehouse, etc. It will be useful to create

a specific lexicon for the business, identifying sequences of graphic forms (often two or

three forms) which are repeated many times in the corpus, or even to compile such a

lexicon ‘manually’.

Lemmatization

The steps described above will have improved the understanding of the texts. They must then

be simplified, without changing the meaning of course, so that the main themes can be

extracted more easily.

We need to start by lemmatizing the texts: this means putting the terms in their canonical

form, so that nouns would be put in the singular and the various forms of verbs would be put in

the infinitive. This is the form in which the words are set out in an ordinary dictionary, which

may contain about 60 000 entries covering 700 000 different forms, such as plurals of nouns

and different tenses of verbs. French, Spanish, Russian and German have many inflected

forms (conjugations and declensions). German also has the distinctive feature of creating

compound words by stringing several nouns together, and we may have to decide whether to

divide these units into elementary fragments.

Grouping the variants

A second stage of simplification is to group together the variants of terms found in texts. The

graphic variants (realise¼ realize), syntactic variants (name of a man¼ a man’s name),

INFORMATION RETRIEVAL 631

semantic variants (‘X buys Y from Z’¼ ‘Z sells Y to X’), synonyms (US¼USA¼United

States¼Uncle Sam), parasynonyms (words with closely related meanings: discontent, anger,

dissatisfaction), and full forms of abbreviations (D ¼EUR¼ euro, BBC¼B.B.C.¼British

Broadcasting Corporation) are all recorded. Like the dictionary of words, the dictionary of

synonyms can be saved in a file specified by the text mining software.

Expressions and metaphors are identified: for instance, ‘Empire of the Rising Sun’ is

replaced with ‘Japan’ and ‘Threadneedle Street’ is replaced with ‘Bank of England’.

Figure 14.1 shows an example of the results of a text mining analysis using SAS

Text Miner.

Grouping the analogies

We can then group the analogies. We group the terms in families of derivative terms, as in

a thesaurus, which may include the following group of terms, for example:

. credit/loan/undertaking/debt/borrow/borrower/debtor.

Intensity markers are also grouped, for example:

. a little/less/very little/�

. much/more/very/þIdentification of themes

The text analysis is completed by grouping all the terms around level 1 themes, then grouping

all the level 1 themes around level 2 themes. The first transition will be of the following type:

. cheque/bank card/draft/currency/. . . , means of payment

Figure 14.1 Terms disambiguated, labelled and lemmatized.

632 TEXT MINING

while the second transition could be:

. means of payment/money/cash/. . . , bank

14.4.2 Application of statistics and data mining

When the analysis of texts and their themes is completed, we can filter the themes or terms to

be examined. We can use either a statistical criterion (selecting terms and themes by their

frequency) or a semantic criterion (centred on a given subject), or a corpus (identifying

offensive words to avoid and their derivation, in order to ‘clean up’ a document). With the

statistical criterion, we can use a number of weighting rules, for example preferring terms

which appear frequently but in few texts (weight¼ frequency of the term/number of texts

containing it).

These terms, having been disambiguated, labelled, lemmatized, grouped and selected, are

then treated with data mining methods, with the individuals (in the statistical sense) being the

texts or documents (e.g. emails) and the characters of the individuals (their variables) being

the themes or terms in the documents. Thus we can produce lexical tables in which each cell cijis a number of occurrences of term j (or an indicator of presence/absence) in document i, to

which the conventional statistical methods are applied. The cell cij can also be the number

of occurrences of term j in the set of documents relating to customer i (letters, reports of

interviews, etc.)

These tables can be processed by correspondence analysis, which simplifies the problem

by reducing the initial variables (corresponding to the terms), often present in very large

numbers (several thousand) although the preliminary transformations may have decreased

their number, to about a hundred factors (which no longer correspond to terms: this is the

drawback of the method). At the end of this transformation, continuous variables will have

been substituted for the initial discrete variables, and conventional data analysis techniques

can be used – classification, clustering, etc. This method is incorporated, under the name of

SVD (singular value decomposition), in SAS Text Miner. Some techniques such as regular-

ized regression (see Section 11.7.2) are useful when we need to process a large number of

variables compared with the number of individuals.

14.4.3 Suitable methods

Text mining can respond to two types of request. Open requests (or free text requests) are

requests in the form of keywords or free text, used to search for relevant documents in a corpus

that changes slowly (such as a yearbook or an electronic library), with the most relevant

sections of text highlighted. Predefined requests are requests relating to a number of fixed

terms, applied to a corpus that changes in a dynamic way with time (e.g. categorization of

documents, routeing/filtering of mail or news). They are subject to the same problems

as classification.

Like data mining, text mining includes descriptive and predictive methods. In the

predictive domain, the classification (or categorization) of documents is carried out according

to predefined themes (nomenclature). It is used for predefined requests (routeing, filtering)

and is based on decision trees (CART, C5.0) and supervised learning neural networks.

Markov chains can be used for open requests. A Markov chain can be briefly described as

follows. Imagine that we have n boxes, each filled with numbered balls. We draw a ball at

random from the first box; its number indicates the box from which the next ball is to be

INFORMATION RETRIEVAL 633

drawn. We continue to draw balls until we reach an empty box. The set of boxes that we have

passed through, in sequence, is a Markov chain. The probability of drawing a given ball

depends on the box from which it is drawn, and therefore on all the previous drawings.

The same applies to a sentence: the probability of the appearance of a word depends on

the preceding words, and not all sequences of words have the same probability of

occurring. Markov chains are used for speech and handwriting recognition, spelling correc-

tion, voice control of automatic systems, and natural language human–machine interfaces

in general.

In the descriptive domain, a corpus clustering is carried out according to non-predefined

themes (discovered in the documents) and can be followed by automatic extraction of

keywords (terms which are frequent in the cluster and rare in the set of documents). The

clustering can be carried out by a Kohonen network or an agglomerative hierarchical

algorithm. It is also possible to carry out a multiple correspondence analysis by matching

the text data with the other data. For example, we can match the response to a questionnaire

with the respondent’s socio-occupational category. It is possible to create a document map

and identify isolated themes, themes forming homogeneous sets, the strength of the links

between the themes in a single set (the vocabulary common to the themes) and the number of

documents for each theme. Figure 14.2 shows by way of example a graphic representation of

the Tropes software from Acetic.

Figure 14.2 Themes in Shakespeare’s Sonnets.

634 TEXT MINING

14.5 Information extraction

14.5.1 Principles of information extraction

Information extraction systems are made up of trigger words (verbs or nouns), linguistic

forms, and constraints which limit the application of the trigger. These systems require

specific semantic dictionaries for the domain or business, as well as syntactic analysers that

can recognize the general linguistic forms (subject, verb, direct object, etc.). Using a target to

be extracted from and predefined fields to be filled, information extraction systems detect the

relevant sentences and extract the desired information.

The main applications of information extraction are:

. automatic completion of predefined forms from free texts;

. automatic construction of bibliographic databases from research papers (fields to be

extracted: title, author, journal, publication date, research establishment, etc.);

. automatic scanning of Reuters despatches on the acquisition of one company by another

(fields to be extracted: purchaser, vendor, price, industrial sector, turnover, stock

exchange quotation, etc.);

. automatic scanning of the financial press (the ‘people’ section, on chief executives’

moves between companies);

. automatic detection of the plans or requirements of the customers of a business, based

on the records of sales staff (fields to be extracted: name of customer, type of product or

service offered, type of plan or requirement of the customer, amount, customer’s

deadline, customer’s response (take-up/refusal), reason for customer’s response, other

suppliers used by the customer, etc.), and the use of the extracted information in

a propensity score.

The performance of information extraction is summarized by two indicators. The

accuracy rate is the number of correctly completed fields divided by the number of completed

fields. The recall rate is the number of correctly completed fields divided by the number of

fields to be completed.

14.5.2 Example of application: transcription of business interviews

If sales staff discover that their customers have plans that can be financed (buying a house,

changing their car, etc.), they offer the customer a credit proposal and note their reaction in

a report. If the reaction is positive, the completion of the report is less important, because it

will be evident that the product has been taken up. If the reaction is negative, the existence of

the report is more important, as it indicates that a product has been offered to the customer. It is

also useful to be able to analyse these marketing reports automatically in order to determine

the reasons for the customers’ refusal, and then deduce predictive models and typologies, or

even adapt the products offered.

One problem that arises is that these reports, written in natural language, are obviously not

standardized, and may contain:

INFORMATION EXTRACTION 635

. spelling mistakes;

. personal abbreviations;

. stream-of-consciousness writing;

. ellipsis (‘telegraphic’ style);

. an illogical order in sentences in some cases (related words may be separated by

a certain distance);

. negations which are not always explicit (the sentence ‘construction Brighton – finance

NatWest’ is a negation if the bank where the salesperson works is not the NatWest!).

Faced with the difficulty of automatic normalization of reports, we need powerful text

mining tools for information extraction, not just keyword search tools.

Report analysis by text mining can be a highly penetratingmethod, offering these benefits:

. detection of customers resistant to certain kinds of credit (useful information for

building a propensity score);

. automatic detection of certain reasons for refusing to take up a product (customer

‘opposed to credit’, better offer from the competition, no need for credit, etc.);

. detection of customers having plans for future dates (enabling us to market to them at

the right time).

14.6 Multi-type data mining

A very promising method of data mining, known as multi-type data mining, can simulta-

neously examine text data (from text mining processes), paratextual data (such as the date and

purpose of a document, the type of document, the recipient of the document in the business,

etc.), and contextual data (such as information about the author of the document, his relations

with the business, the products he has bought, the services he has used, etc.).

Text data are converted into coded data and then stored with the other data in marketing

databases. The matching of all the data (textual and non-textual) makes multi-type data

mining a very powerful tool. For example, an attrition study will be more precise if it takes

letters of complaint and other exchanges between the business and the customer into account.

636 TEXT MINING

data mining and statistics for decision making (tufféry/data mining and statistics for decision...

Documents