ma mru dm chapter21

Chapter 21Text Mining

2

Introduction

• Data mining is the process of finding and exploiting useful

patterns in data.

• Text Mining for Derived Columns

– Perhaps the most common use of text mining is to add new derived

columns into a model set.

– Extracting derived variables is usually a matter of looking for

specific patterns in the text.

• An address can be used to identify whether someone lives in an apartment

by looking for an apartment number.

• If the address contains any of the following, then it is probably an

apartment number:

– “Apt.” in any case

– “ #”

– Address line beginning with “Apt.” or “Apartment”

– “Unit”

3

Introduction

• Sources of Text– E-mails sent by customers

– Notes entered by customer service reps, doctors, nurses, garage mechanics, and so on

– Transcriptions (voice-to-text translation) of customer service calls

– Comments on websites

– Newspaper and magazine articles

– Professional reports

• Basic Approaches to Representing Documents– There is a continuum of approaches for understanding documents.

– At one end is the “bag of words” approach, where documents are considered merely a collection of their words.

– At the other end is the “understanding” approach, where an attempt is made to actually understand the document and what each word specifically means.

4

Representing Documents in Practice

• Stop Words– Refer to words that have little meaning (the ability to differentiate

between different documents).

– For instance, virtually all documents in English contain the word the, and this word essentially has no meaning for typical text mining applications such as classification, deriving variables, navigation, etc.

• Stemming– The process of reducing words to their “stem,” the base word or almost-

word that provides the meaning without additional grammatical information.

– For instance, the word stemming would be transformed into the word stem, as would stems and stemmed.

– The purpose of stemming is to better capture the content of a document.

– One customer complaint might refer to “late delivery” and another might say “not delivered on time,” two phrases that have no words in common. Using stemming, they would have the word deliver in common.

5

Representing Documents in Practice

• Word pairs and phrases– Identifying word pairs and phrases is important for understanding text.

• The rock group The Who is a famous example of what can happen with automated text processing.

• Most stop word lists would include both the and who on the list, so the phrase would disappear entirely from the document.

• This could be very problematic.

– There are two solutions to this problem.• The easy solution is to keep capitalized stop words, or at least capitalized stop words

that are not at the beginning of the sentence.

• A more sophisticated solution is to search for common word pairs and phrases, and to be sure to keep these.

• Using a Lexicon– A lexicon is a list of words that are important.

– It might also include synonyms, so several different words might be combined into a single idea, including misspellings.

• For instance, “flight,” “fl,” and “flt” might all represent “flight” in airline comments.

6

From Text to Numbers

• Techniques that use the bag-of-words approach transform the bag of words into a giant table of numbers, the term-document matrix.

• The term-document matrix is a simple array, where each row represents a single document and each column represents a particular word.

• Typically, the number of words in a document is reduced through several steps:

– Fixing misspellings

– Removing common words and words with little meaning (“stop words”)

– Stemming

– Replacing words with synonyms

• The result is a vocabulary or lexicon, typically of several hundred to several thousand words that describe each document.

7


• The cells in the matrix contain zero if the word is not in the document.

• Words that are in the document could simply contain the value one, indicating the presence of the word.

• Another possibility is the count of words in the document.

• More commonly, though, the value is the inverse document frequency or one minus the log of the document frequency.– The inverse document frequency is one divided by the number of

documents containing the term.

– Words in many documents have low values; words in few documents have higher values.

– One minus the log of this value behaves in a similar way.

8


• Each document can be thought of as a point in a giant “term” space.

• A corpus can contain thousands of possible terms.

• These terms form a space, where each term is along an axis.

• There are thousands of dimensions.

• The data is quite sparse, meaning that most documents do not contain most terms.

• High dimensional sparse data is a big challenge in data mining.

• The solution is to use the singular values decomposition to reduce dimensionality, in the same way that principal components are used.

9


10


• Parsing

– In general, parsing works by replacing punctuation with spaces, and then taking all terms between spaces.

• Fixing Misspellings

– The automated task is simply a matter of constructing a valid dictionary and choosing the closest term.

– This is an iterative task, where you start with a dictionary and find the closest word to each word not in the dictionary.

– Some words are quite close (real misspellings) and some are quite far away (suggestions for new words to add into the dictionary).

11


• Stemming– Stemming transforms words into their root forms.

– For example, one comment might contain “Customer paid too much on last bill; money refunded.” Another might say “Refunding overpayment.”

– These two comments have no words in common, yet they are saying essentially the same thing.

– The stemming algorithm recognizes that “overpayment” and “paid” both have the root of “pay.”

– Similarly, “refunded” and “refund” both have the root of “refund.”

– Stemming turns the two comments into “Customer pay too much on last bill; money refund” and “Refund pay.”

– After stemming, the comments are not grammatically correct.

– On the other hand, comments similar to each other are much more likely to contain similar terms.

12


• Applying Synonym Lists

– Synonym lists are words and phrases that are recognized in the text and replaced by a common synonym.

– These serve several purposes, including fixing misspellings and finding word phrases.

• For example, “Change address” and “Change phone number” both turn into “Change account info.”

– The lists can also be used to fix misspellings.• For example, these might all be synonyms for Showtime:

• Showtime / Show time / Show-time / ST / Showt / Shwotme

• The synonym lists turn these into a single word (in this case “Showtime”).

13


• Using a Stop List– The stop list contains words that have minimal meaning.

– Stop words can also be meaningful terms that simply do not distinguish between comments

– The purpose of the stop word list is to remove words that do not distinguish between different comments, even when these words might seem meaningful.

• Converting Text to Numbers– Use of singular value decomposition (SVD) to transform documents into

numbers.

• Clustering– Gaussian mixture models (GMM), also known as expectation-

maximization clustering.

– In fact, several different methods were tried, notably k-means.

14

RapidMiner Practice

• To see:

– Training Videos\04 - Neil McGuigan - VancouverData\

• Text Mining 1 - Loading Text Into RapidMiner

• Text Mining 2 - Processing Text In RapidMiner

• Text Mining 3 - Text Association Rules in RapidMiner

• Text Mining 4 - Document Similarity and Clustering in RapidMiner

• Text Mining 5 - Automatic Classification of Documents using RapidMiner

• Text Mining 6 - Applying Model To New Documents

• To practice:

– Do the exercises presented in the movies using the file “TextMiningData.xls”.