conceptual foundations of text mining and preprocessing steps week 2 introduction

CONCEPTUAL FOUNDATIONS OF TEXT MINING AND PREPROCESSING STEPSWEEK 2 INTRODUCTION

2

BAG-OF-WORDS ASSUMPTION

“The order of the words in the document does not matter.”

• Only the occurrence does matter

• “Ann is taller than Harry.” == “Harry is taller than Ann”• OK for document classification or clustering

• Homographs?• 講台、美國在台協會• “She can refuse to overlook our row,” he moped, “unless I

entrance her with the right present: a hit.”• Her moped is presently right at the entrance to the

building: she had hit a row of refuse cans!

• Information extraction and natural language processing?

• Nelson Mandela was NOT born in 1910.

3

PREPROCESSING TEXTStep 1: Choosing the scope of document

• Classification or clustering entire document

• Sentiment analysis, document summarization, information retrieval paragraphs or sections

Step 2: Tokenization

Step 3: Token normalization

Step 4: Stopping (dropping stop words)

Step 5: Stemming and lemmatisation

Step 6: Sentence boundary detection

• Punctuation marks.””;?! (how about Washington, D.C.?)• Statistical classification techniques are used to achieve near

perfect accuracy

4

STEP2: TOKENIZATION Tokenization is the task of

• Chopping a character sequence up into pieces, called tokens.• Example:

• Input: Friends, Romans, Countrymen, lend me your ears.• Output: <Friends, Romans, Countrymen, lend, me, your, ears>

Token — type — term e.g. “to be or not to be”

• A token is an instance of a sequence of characters in some particular document, and is useful semantic unit for processing.

• Number of tokens: (?) 6 tokens

• A type is the class of all tokens containing the same character sequence.• Number of types: (?) 4 types

• A term is a (normalized) type that is indexed in the IR system’s dictionary. (sometimes equal to tokens)

• Usually derived by various normalization processes.• Can be entirely distinct from the tokens (e.g., stemming).• Number of terms: (?) 0

5

TOKENIZATION: SIMPLE METHODSimple method: just split on all non-alphanumeric characters.

• Chile’s Chile and s.• O’Neal O and Neal.• C++ C.• [email protected] chenli, kuo, mail, cgu, edu, tw

6

TOKENIZATION: HYPHENSShould we split it or regard it as one token?

• Hewlett-Packard?Some useful heuristic rules:

• Allowing short hyphenated prefixes on words.• E.g., co-worker.

• But not longer hyphenated forms.• the hold-him-back-and-drag-him-away manner.

• Some IR systems will generalize the query to cover all three of the one word, hyphenated, and two word forms.

• E.g., over-eager over-eager or over eager or overeager.

7

TOKENIZATION: WHITE SPACESplitting on white space can also split what should be regarded as a single token.

• Names: Los Angles.• Compounds are sometimes space separated: whitespace vs.

white space.• Dates: Mar 11, 1983.• Latin: et al., et cetera

Moreover, hyphens and non-separating whitespace can interact.

• Air fares: San Francisco-Los Angles

8

TOKENIZATION: OTHER THAN ENGLISHGerman writes compound nouns without space:

• Computerlinguistik computational linguistics.• Require a compound-splitter module to see if a word can be subdivided

into multiple words that appear in a vocabulary.

East Asian Language (e.g., Chinese, Japanese, Korean, and Tai) are written without any spaces between words.

• Require a word segmentation preprocessing.• Having a large vocabulary and taking the longest vocabulary match with some

heuristics for unknown words.• 我要去總統府總統 president?? or 總統府 presidential palace??

• Machine learning sequence models, such as HMMs.

• Or do indexing with short subsequence of characters (n-grams), regardless of whether particular sequences cross word boundaries or not.

• Is appealing because: • Single Chinese character usually has semantic content.• Most words are short (2 characters).• Word boundaries are not clear and probably no exact locations.

9

STEP 3: NORMALIZATIONCanonicalizing tokens that matches occur despite superficial differences in the character sequences of tokens

• Remove hyphens, periods, accents and diacritics

• on-line online; C.A.T. CAT; naïve naive• Spelling normalization

• Color, colour; disability, disablitiy• Dictionary-based, soundex, metaphone, string-edit distance…

• Case normalization

• ‘Automatic’ ‘automatic’• However, many proper nouns are distinguished only by case.

• E.g., person names, Bush or bush.• A heuristic for English: to lowercase words at the beginning of a sentence;

mid-sentence capitalized words are left as capitalized.• As users usually use lowercase regardless of the correct case of words,

lowercasing everything often remains the most practical solution

10

NORMALIZATION IN OTHER LANGUAGESThe French word for ‘the’ has distinctive forms based on gender/number of the following noun …

• the le, la, l’, les.

Japanese is even more difficult.

• Japanese is mixed with multiple alphabets.• Chinese, hiragana( 平假名 ), katakana( 片假名 ).

• Even a word may be written with multiple writing systems.• Retrieval systems thus require complex equivalence classing

across the writing systems.

Other issues:

• Foreign name translation, such as Beijing and Peking.• A document including many different languages requires more

than one tokenizer and language-specific normalization.

11

STEP 4: STOPPINGRemoving stop words

Function words (grammatical words, close class words) vs. content words (open class words)

• “Stopping is a commonly included feature in nearly every text mining software package.”

• “is a in every”

• “Stopping commonly included feature nearly text mining software package.”

• “烏克蘭總統亞努科維奇 (Viktor Yanukovych) 昨日宣布停火，隨後警方民眾發生新一波衝突，死亡數字攀升”

• “烏克蘭總統亞努科維奇 (Viktor Yanukovych)雖然在昨日宣布停火，但隨後警方和民眾又發生新一波衝突，讓死亡數字往上攀升。”

12

STOPPINGThe general strategy for determining stop words:

• Sort the terms (in a document collection) by collection frequency.• Collection frequency: the number of times each term appears in the collection.

• Take the most frequent terms as a stop list.• Often with the help of human experts.

Existing stop word list

• Generate from existing word frequency lists• http://www.kilgarriff.co.uk/bnc-readme.html#lemmatised

• a, an and, are, as, at, be, by for, from, has, he in, is it, its, of, on, that, the, to, was, were, will, with

• http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words (319)

Stop words which might not be function words

• Subject, sender 、 recipient 、 date

Title of books or songs might consist mainly stop words

• “let it be”, “to be or not to be”, “As we may think”…

http://www.kilgarriff.co.uk/bnc-readme.html#lemmatised

http://www.kilgarriff.co.uk/bnc-readme.html#lemmatised

http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words%20(319

http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words%20(319

13

STOPPINGSeveral works on how the statistics of language can be used to cope with common words in better ways.

• TF-IDF term weighting leads to very common words have little impact on document rankings.

Authorship detection needs function words

14

STEP 5: STEMMING AND LEMMATIZATION

Reduce terms to their “roots”

• Walking, walks, walked, walker… walk

Stemming: usually a crude heuristic process that chops off the ends of words.

• E.g., ‘automate(s)’, ‘automatic’, ‘automation’ all reduced to ‘automat’.

• Snowball stemmer http://snowball.tartarus.org/index.php• (rule)‘-tional’ ‘-tion’ and (example)‘national’ ‘nation’• (rule)’-ly’ “” and ‘quickly’ “quick”; ‘reply’ ‘rep’??

Need dictionary lookupsLemmatization: doing things more properly with the use of a vocabulary and morphological analysis of words. (NLP)

• To return the base or dictionary form of a word, lemma.• E.g., ‘am’, ‘are’, ‘is’ ‘be’.• ‘saw’ would return either ‘see’ or ‘saw’ depending on whether

the use of the token was as a verb or a noun

http://snowball.tartarus.org/index.php

http://snowball.tartarus.org/index.php

15

CREATING VECTORS FROM PROCESSED TEXTDoc 1: My dog ate my homework.

Doc 2: My cat ate the sandwich.

Doc 3: A dolphin ate the homework.

Terms: a, ate, cat, dolphin, dog, homework, my, sandwich, the

Binary vector

• Doc 1: 0,1,0,0,1,1,1,0,0

• Doc 2: 0,1,1,0,0,0,1,1,1

• Doc 3: 1,1,0,1,0,1,0,0,1

Integer count vector

• Doc 1: 0,1,0,0,1,1,2,0,0

• Doc 2: 0,1,1,0,0,0,1,1,1

• Doc 3: 1,1,0,1,0,1,0,0,1

INFORMATION RETRIEVAL

　 1 2 3 4 5 6

antony 1 1 0 0 0 1brutus 1 1 0 1 0 0caesar 1 1 0 1 1 1

calpurnia 0 1 0 0 0 0cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1worser 1 0 1 1 1 0

docID

term

1 if a document contains a word

1 if a document contains a word

Using this matrix, an information retrieval system can easily answer user’s Boolean queries. For example: to answer the query

“Brutus AND Caesar AND NOT Calpurnia”.

110100

AND 110111

AND 101111 (one as the one that you want)

= 100100

query terms are combinedwith the operators AND,OR, and NOT.

17

Is term-document-matrix based indexing feasible??

• No!!• Suppose that we have 1 million (1,000,000) documents for

indexing.• And the collection contains 500,000 distinct terms.• Then … the matrix will have 500,000 x 1,000,000 = 5x1011

entries.• If 1 bit per entry, the matrix will cost around 58GB

memory!!

18

A better way of indexing is to record only the things that do occur – inverted index.

• Sometimes referred as inverted file.

• Consists of two parts: dictionary and postings.

posting: records that a term appeared in a document

postings list(or inverted list)

19

The input to (inverted) index construction is a list of normalized tokens for each document.

Then, we sort this list so that the terms are alphabetical.

Next, multiple occurrences of the same term from the same document are merged.

Instances of the same term are then grouped.

• The result is split into a dictionary and postings.

20

The dictionary also records some statistics, such as the number of documents which contain each term (document frequency).

• Which can be used to rank retrieval documents.

Postings are much larger than dictionary.

• So … in general, we keep the dictionary in memory.• And posting lists are normally kept on disk.

21

How to process Boolean queries using an inverted index.

Consider processing the simple conjunctive query:

Brutus AND Calpurnia

1.Locate Brutus in the dictionary.

2.Retrieve its postings.

3.Locate Calpurnia in the dictionary.

4.Retrieve its postings.

5.Intersect (merge) the two postings lists.

22

The intersection operation need to be efficient.

Here we present an effective merge algorithm that requires the postings being sorted by docID.

22

INTERSECT(p1,p2) answer <> while p1 ≠ NULL and p2 ≠ NULL if docID(p1) == docID(p2) ADD(answer,docID(p1)) p1 next(p1) p2 next(p2) else if docID(p1) < docID(p2) p1 next(p1)

else p2 next(p2)

return answer

1 2 174

2 58

Brutus

Calpurnia

p1

p2

answer: 2

the intersection takes O(x+y), where x and y are the lengths of the postings lists, respectively

23

WHICH DOC IS MORE RELEVANT?

24

TERM FREQUENCY

The weight of a term depends on the number of occurrences of the term in the document.

• Notation: tft,d — the number of occurrences of term t in document d.

A critical problem of term frequency weighting scheme:

• Each term occurrence is considered equally important.• “term i and term j are synonyms”

In fact, certain terms have little or no discriminating power.

• For instance, a collection of documents on the auto industry is likely to have the term ‘auto’ in almost every document.

• We need a mechanism for reducing the effect of terms that occur too often in the collection.

weight contribution: 1 1 1 1 1 1 1

25

INVERSE DOCUMENT FREQUENCY

Document frequency:

• Notation: dft

• The number of documents in the collection that contain a term t.

Inverse document frequency (IDF):

• Notation:

• The idf of a rare term is high, and is likely to be low for a frequent term.

tt df

Nidf log the number of documents in a collectionthe number of documents in a collection

26


An alternative to document frequency — collection frequency (CF).

• The total number of occurrences of a term in the collection.• But … the purpose of term scoring is to discriminate between

documents. • It is better to use a document-level statistic (DF) than to use a

collection-wide statistic for term weighting.

Word CF DF

‘try’ 10422 8760

‘insurance’ 10440 3997

can be a general termappearing in many documents

can be a general termappearing in many documents

can be a discriminating termappearing in a certain of documents

can be a discriminating termappearing in a certain of documents

27


Example of idf’s in the Reuters collection of 806,791 documents

term DF IDF

‘car’ 18,165 1.65

‘auto’ 6,723 2.08

‘insurance’ 19,241 1.62

‘best’ 25,235 1.5

28

TF-IDF WEIGHTINGTF-IDF combines the concept of term frequency and inverse document frequency to assign the weight of term t in document d as follows:

• tf-idft,d = tft,d x idft.

• The weight of term t in document d is:• High, when t occurs many times in d and appears within a

small number of documents.• Low, when t is a rare term in d and occurs in virtually all

documents in the collection.

A simple scoring mechanism of a query q to a document d – the overlap score measure:

• score(q,d) = ∑t in qtf-idft,d

29

TWEAKING TF-IDF

Twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence??

• We observe higher term frequencies in documents, merely because longer documents tend to repeat the same words over and over again.

Sub-linear TF scaling:

• A common modification of TF is to use the logarithm of the term frequency.

• Then, replace TF-IDF as WF-IDF:• wf-idft,d = wft,d * idft.

0 if log1otherwise 0,

,,{ dtdt tftfdtwf

1 2 3 4 5 6 7 8 9 101

112131415161718191

linearlog

30

REFERENCES• Manning, Christopher, Prabhakar Raghavan, and Hinrich

Schutze (2008). Introduction to Information Retrieval. Cambridge University Press.

• Ch1 Boolean retrieval• Ch2 The term vocabulary and posting lists

• Miner, Gary, Dursun Delen, John Elder, Andrew Fast, Thomas Hill, Robert A. Nisbet (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Academic Press.

• Ch3 Conceptual Foundations of Text Mining and pre-processing steps

conceptual foundations of text mining and preprocessing steps week 2 introduction

Documents