text mining intro
TRANSCRIPT
-
8/2/2019 Text Mining Intro
1/39
An Introduction to TextMining
Ravindra Jaju
-
8/2/2019 Text Mining Intro
2/39
10/04/04 Ravindra Jaju
Outline of the presentation
Initiation/Introduction ...
What makes text stand apart from other
kinds of data?
Classification
Clustering
Mining on The Web
-
8/2/2019 Text Mining Intro
3/39
10/04/04 Ravindra Jaju
Data Mining
What: Looking for information fromusually large amounts of data
Mainly two kinds of activitiesDescriptive and Predictive
Example of a descriptive activity
Clustering Example of a predictive activity -
Classification
-
8/2/2019 Text Mining Intro
4/39
10/04/04 Ravindra Jaju
What kind of data is this?
It could be two customers' baskets, containing(milk, bread, butter) and (shaving cream, razor,after-shave lotion) respectively.
Or, it could be two documents - Javaprogramming language and India beatPakistan
-
8/2/2019 Text Mining Intro
5/39
10/04/04 Ravindra Jaju
And what kind of data is this?
Data about people, pairs!
-
8/2/2019 Text Mining Intro
6/39
10/04/04 Ravindra Jaju
Data representation
Humans understand data in variousforms
Text Sales figures
Images
Computers understand only numbers
-
8/2/2019 Text Mining Intro
7/39
10/04/04 Ravindra Jaju
Working with data
Most of the mining algorithms work only withnumeric data
All data, hence, are represented as numbersso that they can lend themselves to thealgorithms
Whether it is sales figures, crime rates, text, orimages one has to find a suitable way totransform data into numbers.
-
8/2/2019 Text Mining Intro
8/39
10/04/04 Ravindra Jaju
Text miningWorking with numbers
Java Programming Language India beat Pakistan
OR
The transformation to 1's and 0's hides allrelationship between Java and Language, andIndia and Pakistan, which humans can makeout (How?)
-
8/2/2019 Text Mining Intro
9/39
10/04/04 Ravindra Jaju
Text miningWorking with numbers
(contd.)
As we have seen, data transformation (fromtext/word to some index number in this case)
means that there is some information loss One big challenge in this field today is to find a
good data representation for input to themining algorithms
-
8/2/2019 Text Mining Intro
10/39
10/04/04 Ravindra Jaju
Text Representation Issues
Each word has a dictionary meaning, or meanings
Run (1) the verb. (2) the noun, in cricket
Cricket(1) The game. (2) The insect.
Each word is used in various senses
Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a
runner between the wickets
Capturing the meaning of sentences is an important issueas well. Grammar, parts of speech, time sense could be
easy!
Finding out automatically who the he in He is thePresident given a documentis hard. And presidentof? Well ...
-
8/2/2019 Text Mining Intro
11/39
10/04/04 Ravindra Jaju
Text Representation Issues (contd.)
In general, it is hard to capture these features from a textdocument
One, it is difficult to extract this automatically
Two, even if we did it, it won't scale!
One simplification is to represent documents as a vector ofwords
We have already seen examples
Each document is represented as a vector, and each
component of the vector represents some quantity relatedto a single word.
-
8/2/2019 Text Mining Intro
12/39
10/04/04 Ravindra Jaju
The Document Vector
Java Programming Language (document A)
India beat Pakistan
(document B)
India beat Australia (document C)
What vector operation can you think of to find two similardocuments?
How about the dot product?
As we can easily verify, documents B and C will have a higherdot product than any other combination
-
8/2/2019 Text Mining Intro
13/39
10/04/04 Ravindra Jaju
More on document similarity
The dot product or cosinebetween two vectors is ameasure of similarity.
Documents about related topics should have higher similarity
Indonesia
Java
Language
0, 0, 0
-
8/2/2019 Text Mining Intro
14/39
10/04/04 Ravindra Jaju
Document Similarity (contd.)
How about distance measures? Cosine similarity measure will not capture the inter-cluster
distances!
-
8/2/2019 Text Mining Intro
15/39
10/04/04 Ravindra Jaju
Further refinements to the DV representation
Not all words are equally important the, is, and, to, he, she, it (Why?)
Of course, these words could be important in certain contexts
We have the option of scaling the components of these
words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale
the remaining words
Important words should be scaled upwards, and vice versa
One widely used scaling factorTF-IDF
TF-IDFstands for Term Frequencyand Inverse DocumentFrequencyproduct, for a word.
-
8/2/2019 Text Mining Intro
16/39
10/04/04 Ravindra Jaju
Text MiningMoving Further
Document/Term Clustering
Given a large set, group similar entities
Text Classification
Given a document, find what topic does ittalk about
Information Retrieval
Search engines
Information Extraction
Question Answering
-
8/2/2019 Text Mining Intro
17/39
10/04/04 Ravindra Jaju
Clustering (Descriptive Activity)
Activity: Group together similardocuments
Techniques used
Partitioning Hierarchical
Agglomerative
Divisive Grid based
Model based
-
8/2/2019 Text Mining Intro
18/39
10/04/04 Ravindra Jaju
Clustering (contd.)
Partitioning Divide the input data into kpartitions
K-means, K-medoids
Hierarchical clustering
Agglomerative Each data point is assumed to be a cluster
representative
Keep merging similar clusters till we get a single
cluster Divisive
The opposite of agglomerative
-
8/2/2019 Text Mining Intro
19/39
10/04/04 Ravindra Jaju
Frequent term-based text clustering
Idea Frequent terms carry more information about the
cluster they might belong to
Highly co-related frequent terms probably belong to
the same cluster
D = {D1, , Dn} the set of documents
DjsubsetOfT, the set of all terms
Then candidate clusters are generated from F= {F1, , Fk}, where each Fi is a set of allfrequent terms which occur together.
-
8/2/2019 Text Mining Intro
20/39
10/04/04 Ravindra Jaju
Classification
The problem statement Given a set of documents, each with a label
called the class label for that document
Given, a classifier which learnsfrom theabove data set
For a new, unseen document, the classifiershould be able to predict with a high
degree of accuracy the correct class towhich the new document belongs
-
8/2/2019 Text Mining Intro
21/39
10/04/04 Ravindra Jaju
Decision Tree Classifier
A tree Each node represents some kind of an evaluationfor an attribute of the data
Each edge, the decision taken
The evaluation at each node is some kind ofan information gainmeasure Reduction in entropy more information gained
Entropy E(x) = -pilog2(pi)
pi represents the probability that the data corresponds tosample i
Each edge represents a choice for the value of theattribute the node represents
Good for text mining. But doesnt scale
-
8/2/2019 Text Mining Intro
22/39
10/04/04 Ravindra Jaju
Statistical (Bayesian) Classification
For a document-class data, we calculatethe probabilities of occurrence of events
Bayes Theorem
P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it
belongs to a class c is given by the aboveformula.
In practice, the exact values of theprobabilities of each event are unknown,and are estimated from the samples
-
8/2/2019 Text Mining Intro
23/39
10/04/04 Ravindra Jaju
Nave Bayes Classification
Probability of the document eventd P(d) = P(w1, , wn) wi are the words
The RHS is generally a headache. We haveto consider the inter-dependence of each of
the wj events Nave Bayes Assume all the wj events
are independent. The RHS expands to
p(wj) Most of the Bayesian text classifiers work
with this simplification
-
8/2/2019 Text Mining Intro
24/39
10/04/04 Ravindra Jaju
Bayesian Belief Networks
This is an intermediate approach Not all words are independent
Ifjava and program occur together, thenboost the probability value of class
computer programming Ifjava and indonesia occur together, then
the document is more likely about some-other-class
Problem? How do we come up with co-relations like
above?
-
8/2/2019 Text Mining Intro
25/39
10/04/04 Ravindra Jaju
Other classification techniques
Support Vector Machines
Find the best discriminant planebetweentwo classes
k Nearest Neighbour
Association Rule Mining
Neural Networks
Case-based reasoning
-
8/2/2019 Text Mining Intro
26/39
10/04/04 Ravindra Jaju
An exampleText Classification from labeled andunlabeled documents with Expectation Maximization
Problem setting Labeling documents is a manual process
A lot more unlabeled documents are
available as compared to labeleddocuments
Unlabeled documents contain informationwhich could help in the classification activity
-
8/2/2019 Text Mining Intro
27/39
10/04/04 Ravindra Jaju
An example (contd.)
Train a classifier with the labeled
documents Say, a Nave Bayes classifier
This classifier estimates the model
parameters (the prior probabilities of thevarious events)
Now, classify the unlabeled documents.
Assuming the appliedlabels to be correct,re-estimate the model parameters
Repeat the above step till convergence
-
8/2/2019 Text Mining Intro
28/39
10/04/04 Ravindra Jaju
Expectation Maximization
A useful technique for estimating hiddenparameters
In the previous example, the class labelswere missing from some documents
Consists of two steps E-step: Set z(k+1) = E [z | D; (k)]
M-step: Set (k+1) = arg maxP( | D; z
(k+1))
The above steps are repeated tillconvergence, and convergence doesoccur
-
8/2/2019 Text Mining Intro
29/39
10/04/04 Ravindra Jaju
Another exampleFast and accurate Text Classification viaMultiple Linear Discriminant Projections
-
8/2/2019 Text Mining Intro
30/39
10/04/04 Ravindra Jaju
Contd.
Idea Find a direction which maximizes the
separation between classes.
Why?
Reduce noise, or rather Enhance the differences between classes
The vector corresponding to this direction isthe Fishers discriminant
Project the data-points onto this For all data-points not separated by this
vector, choose another
-
8/2/2019 Text Mining Intro
31/39
10/04/04 Ravindra Jaju
Contd.
Repeat till all data are now separable
Note, we are looking at a 2-class case. This easilyextends to multiple classes
Project all the document vectors into the spacerepresented by the vectors as the basisvectors
Now, induce a decision tree on this projectedrepresentation
The number of attributes is highlyreduced
Since this representation nicely separates the
data points (documents), accuracy increases
-
8/2/2019 Text Mining Intro
32/39
10/04/04 Ravindra Jaju
Web Text Mining The WWW is a huge, directed graph, with
documents as nodes and hyperlinks as thedirected edges
Apart from the text itself, this graph structurecarries a lot of information about the
usefulness of the nodes For example
10 random, average people on the streets say Mr.T. Ache is a good dentist
5 reputed doctors, including dentists, recommendMr. P. Killer as a better dentist
Who would you choose?
-
8/2/2019 Text Mining Intro
33/39
10/04/04 Ravindra Jaju
Kleinbergs HITS
HITS Hypertext Induced TopicSelection
Nodes on the web can be categorizedinto two typeshubs and authorities
Authorities are nodes which one refers tofor definitive information about a topic
Hubs point to authorities
HITS computes the hub and authorityscores on a sub-universe of the web How does one collect this sub-universe?
-
8/2/2019 Text Mining Intro
34/39
10/04/04 Ravindra Jaju
HITS (contd.)
The basic steps Au = Hv for all v pointing to u Hu= Av for all v pointed to by u
Repeat the above till convergence
Nodes with high A scores are relevant
Relevant to what? Can we use this for efficient retrieval for a
query?
-
8/2/2019 Text Mining Intro
35/39
10/04/04 Ravindra Jaju
PageRank Similar to HITS, but all pages have only one
scorea Rank
R(u) = c (R(v)/Nv) v is the set of pages linking to u, and Nv is the number
of links in v. c is a scaling factor (< 1)
The higher the rank of pages linking to a page,the higher is its own rank!
To handle rank sinks(documents which do notlink outside a set of pages), the formula is
modified as R(u) = c(R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank
source (what kind of pages?)
-
8/2/2019 Text Mining Intro
36/39
10/04/04 Ravindra Jaju
Some more topics which we havent touched Using external dictionaries
WordNet
Using language specific techniques
Computational linguistics
Use grammar for judging the sense of aquery in the information retrieval scenario
Other interesting techniques
Latent Semantic Indexing Finding the latentinformation in documents
using Linear Algebra Techniques
-
8/2/2019 Text Mining Intro
37/39
10/04/04 Ravindra Jaju
Some more comments
Some purists do not consider most of
the current activities in the text miningfield as real text mining
For example, see Marti Hearsts write-up
at Untangling Text Data Mining
http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html -
8/2/2019 Text Mining Intro
38/39
10/04/04 Ravindra Jaju
Some more comments (contd.)
One example that he mentions
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated insome migraines
high levels of magnesium inhibit SCD
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
The above was inferred from a set ofdocuments, with some human help
-
8/2/2019 Text Mining Intro
39/39
10/04/04 R i d J j
References Data Mining Concepts and Techniques, by Jiawei Han and Micheline
Kamber
Principle of Data Mining, by David J. Hand et al
Text Classification from Labeled and Unlabeled Documents using EM,Kamal Nigam et al
Fast and accurate text classification via multiple linear discriminantprojections, S. Chakrabarti et al
Frequent Term-Based Text Clustering, Florian Beil et al
The PageRank Citation Ranking: Bringing Order to the Web, LawrencePage and Sergey Brin
Untangling Text Data Mining, by Marti. A. Hearst,http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
And others