text mining intro

8/2/2019 Text Mining Intro

1/39

An Introduction to TextMining

Ravindra Jaju


2/39

10/04/04 Ravindra Jaju

Outline of the presentation

Initiation/Introduction ...

What makes text stand apart from other

kinds of data?

Classification

Clustering

Mining on The Web


3/39


Data Mining

What: Looking for information fromusually large amounts of data

Mainly two kinds of activitiesDescriptive and Predictive

Example of a descriptive activity

Clustering Example of a predictive activity -

Classification


4/39


What kind of data is this?

It could be two customers' baskets, containing(milk, bread, butter) and (shaving cream, razor,after-shave lotion) respectively.

Or, it could be two documents - Javaprogramming language and India beatPakistan


5/39


And what kind of data is this?

Data about people, pairs!


6/39


Data representation

Humans understand data in variousforms

Text Sales figures

Images

Computers understand only numbers


7/39


Working with data

Most of the mining algorithms work only withnumeric data

All data, hence, are represented as numbersso that they can lend themselves to thealgorithms

Whether it is sales figures, crime rates, text, orimages one has to find a suitable way totransform data into numbers.


8/39


Text miningWorking with numbers

Java Programming Language India beat Pakistan

OR

The transformation to 1's and 0's hides allrelationship between Java and Language, andIndia and Pakistan, which humans can makeout (How?)


9/39


Text miningWorking with numbers

(contd.)

As we have seen, data transformation (fromtext/word to some index number in this case)

means that there is some information loss One big challenge in this field today is to find a

good data representation for input to themining algorithms


10/39


Text Representation Issues

Each word has a dictionary meaning, or meanings

Run (1) the verb. (2) the noun, in cricket

Cricket(1) The game. (2) The insect.

Each word is used in various senses

Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a

runner between the wickets

Capturing the meaning of sentences is an important issueas well. Grammar, parts of speech, time sense could be

easy!

Finding out automatically who the he in He is thePresident given a documentis hard. And presidentof? Well ...


11/39


Text Representation Issues (contd.)

In general, it is hard to capture these features from a textdocument

One, it is difficult to extract this automatically

Two, even if we did it, it won't scale!

One simplification is to represent documents as a vector ofwords

We have already seen examples

Each document is represented as a vector, and each

component of the vector represents some quantity relatedto a single word.


12/39


The Document Vector

Java Programming Language (document A)

India beat Pakistan

(document B)

India beat Australia (document C)

What vector operation can you think of to find two similardocuments?

How about the dot product?

As we can easily verify, documents B and C will have a higherdot product than any other combination


13/39


More on document similarity

The dot product or cosinebetween two vectors is ameasure of similarity.

Documents about related topics should have higher similarity

Indonesia

Java

Language

0, 0, 0


14/39


Document Similarity (contd.)

How about distance measures? Cosine similarity measure will not capture the inter-cluster

distances!


15/39


Further refinements to the DV representation

Not all words are equally important the, is, and, to, he, she, it (Why?)

Of course, these words could be important in certain contexts

We have the option of scaling the components of these

words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale

the remaining words

Important words should be scaled upwards, and vice versa

One widely used scaling factorTF-IDF

TF-IDFstands for Term Frequencyand Inverse DocumentFrequencyproduct, for a word.


16/39


Text MiningMoving Further

Document/Term Clustering

Given a large set, group similar entities

Text Classification

Given a document, find what topic does ittalk about

Information Retrieval

Search engines

Information Extraction

Question Answering


17/39


Clustering (Descriptive Activity)

Activity: Group together similardocuments

Techniques used

Partitioning Hierarchical

Agglomerative

Divisive Grid based

Model based


18/39


Clustering (contd.)

Partitioning Divide the input data into kpartitions

K-means, K-medoids

Hierarchical clustering

Agglomerative Each data point is assumed to be a cluster

representative

Keep merging similar clusters till we get a single

cluster Divisive

The opposite of agglomerative


19/39


Frequent term-based text clustering

Idea Frequent terms carry more information about the

cluster they might belong to

Highly co-related frequent terms probably belong to

the same cluster

D = {D1, , Dn} the set of documents

DjsubsetOfT, the set of all terms

Then candidate clusters are generated from F= {F1, , Fk}, where each Fi is a set of allfrequent terms which occur together.


20/39


Classification

The problem statement Given a set of documents, each with a label

called the class label for that document

Given, a classifier which learnsfrom theabove data set

For a new, unseen document, the classifiershould be able to predict with a high

degree of accuracy the correct class towhich the new document belongs


21/39


Decision Tree Classifier

A tree Each node represents some kind of an evaluationfor an attribute of the data

Each edge, the decision taken

The evaluation at each node is some kind ofan information gainmeasure Reduction in entropy more information gained

Entropy E(x) = -pilog2(pi)

pi represents the probability that the data corresponds tosample i

Each edge represents a choice for the value of theattribute the node represents

Good for text mining. But doesnt scale


22/39


Statistical (Bayesian) Classification

For a document-class data, we calculatethe probabilities of occurrence of events

Bayes Theorem

P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it

belongs to a class c is given by the aboveformula.

In practice, the exact values of theprobabilities of each event are unknown,and are estimated from the samples


23/39


Nave Bayes Classification

Probability of the document eventd P(d) = P(w1, , wn) wi are the words

The RHS is generally a headache. We haveto consider the inter-dependence of each of

the wj events Nave Bayes Assume all the wj events

are independent. The RHS expands to

p(wj) Most of the Bayesian text classifiers work

with this simplification


24/39


Bayesian Belief Networks

This is an intermediate approach Not all words are independent

Ifjava and program occur together, thenboost the probability value of class

computer programming Ifjava and indonesia occur together, then

the document is more likely about some-other-class

Problem? How do we come up with co-relations like

above?


25/39


Other classification techniques

Support Vector Machines

Find the best discriminant planebetweentwo classes

k Nearest Neighbour

Association Rule Mining

Neural Networks

Case-based reasoning


26/39


An exampleText Classification from labeled andunlabeled documents with Expectation Maximization

Problem setting Labeling documents is a manual process

A lot more unlabeled documents are

available as compared to labeleddocuments

Unlabeled documents contain informationwhich could help in the classification activity


27/39


An example (contd.)

Train a classifier with the labeled

documents Say, a Nave Bayes classifier

This classifier estimates the model

parameters (the prior probabilities of thevarious events)

Now, classify the unlabeled documents.

Assuming the appliedlabels to be correct,re-estimate the model parameters

Repeat the above step till convergence


28/39


Expectation Maximization

A useful technique for estimating hiddenparameters

In the previous example, the class labelswere missing from some documents

Consists of two steps E-step: Set z(k+1) = E [z | D; (k)]

M-step: Set (k+1) = arg maxP( | D; z

(k+1))

The above steps are repeated tillconvergence, and convergence doesoccur


29/39


Another exampleFast and accurate Text Classification viaMultiple Linear Discriminant Projections


30/39


Contd.

Idea Find a direction which maximizes the

separation between classes.

Why?

Reduce noise, or rather Enhance the differences between classes

The vector corresponding to this direction isthe Fishers discriminant

Project the data-points onto this For all data-points not separated by this

vector, choose another


31/39


Contd.

Repeat till all data are now separable

Note, we are looking at a 2-class case. This easilyextends to multiple classes

Project all the document vectors into the spacerepresented by the vectors as the basisvectors

Now, induce a decision tree on this projectedrepresentation

The number of attributes is highlyreduced

Since this representation nicely separates the

data points (documents), accuracy increases


32/39


Web Text Mining The WWW is a huge, directed graph, with

documents as nodes and hyperlinks as thedirected edges

Apart from the text itself, this graph structurecarries a lot of information about the

usefulness of the nodes For example

10 random, average people on the streets say Mr.T. Ache is a good dentist

5 reputed doctors, including dentists, recommendMr. P. Killer as a better dentist

Who would you choose?


33/39


Kleinbergs HITS

HITS Hypertext Induced TopicSelection

Nodes on the web can be categorizedinto two typeshubs and authorities

Authorities are nodes which one refers tofor definitive information about a topic

Hubs point to authorities

HITS computes the hub and authorityscores on a sub-universe of the web How does one collect this sub-universe?


34/39


HITS (contd.)

The basic steps Au = Hv for all v pointing to u Hu= Av for all v pointed to by u

Repeat the above till convergence

Nodes with high A scores are relevant

Relevant to what? Can we use this for efficient retrieval for a

query?


35/39


PageRank Similar to HITS, but all pages have only one

scorea Rank

R(u) = c (R(v)/Nv) v is the set of pages linking to u, and Nv is the number

of links in v. c is a scaling factor (< 1)

The higher the rank of pages linking to a page,the higher is its own rank!

To handle rank sinks(documents which do notlink outside a set of pages), the formula is

modified as R(u) = c(R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank

source (what kind of pages?)


36/39


Some more topics which we havent touched Using external dictionaries

WordNet

Using language specific techniques

Computational linguistics

Use grammar for judging the sense of aquery in the information retrieval scenario

Other interesting techniques

Latent Semantic Indexing Finding the latentinformation in documents

using Linear Algebra Techniques


37/39


Some more comments

Some purists do not consider most of

the current activities in the text miningfield as real text mining

For example, see Marti Hearsts write-up

at Untangling Text Data Mining
http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html


38/39


Some more comments (contd.)

One example that he mentions

stress is associated with migraines

stress can lead to loss of magnesium

calcium channel blockers prevent some migraines

magnesium is a natural calcium channel blocker

spreading cortical depression (SCD) is implicated insome migraines

high levels of magnesium inhibit SCD

migraine patients have high platelet aggregability

magnesium can suppress platelet aggregability

The above was inferred from a set ofdocuments, with some human help


39/39

10/04/04 R i d J j

References Data Mining Concepts and Techniques, by Jiawei Han and Micheline

Kamber

Principle of Data Mining, by David J. Hand et al

Text Classification from Labeled and Unlabeled Documents using EM,Kamal Nigam et al

Fast and accurate text classification via multiple linear discriminantprojections, S. Chakrabarti et al

Frequent Term-Based Text Clustering, Florian Beil et al

The PageRank Citation Ranking: Bringing Order to the Web, LawrencePage and Sergey Brin

Untangling Text Data Mining, by Marti. A. Hearst,http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

And others