text mining intro

Upload: madhuri-dalal

Post on 06-Apr-2018

236 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Text Mining Intro

    1/39

    An Introduction to TextMining

    Ravindra Jaju

  • 8/2/2019 Text Mining Intro

    2/39

    10/04/04 Ravindra Jaju

    Outline of the presentation

    Initiation/Introduction ...

    What makes text stand apart from other

    kinds of data?

    Classification

    Clustering

    Mining on The Web

  • 8/2/2019 Text Mining Intro

    3/39

    10/04/04 Ravindra Jaju

    Data Mining

    What: Looking for information fromusually large amounts of data

    Mainly two kinds of activitiesDescriptive and Predictive

    Example of a descriptive activity

    Clustering Example of a predictive activity -

    Classification

  • 8/2/2019 Text Mining Intro

    4/39

    10/04/04 Ravindra Jaju

    What kind of data is this?

    It could be two customers' baskets, containing(milk, bread, butter) and (shaving cream, razor,after-shave lotion) respectively.

    Or, it could be two documents - Javaprogramming language and India beatPakistan

  • 8/2/2019 Text Mining Intro

    5/39

    10/04/04 Ravindra Jaju

    And what kind of data is this?

    Data about people, pairs!

  • 8/2/2019 Text Mining Intro

    6/39

    10/04/04 Ravindra Jaju

    Data representation

    Humans understand data in variousforms

    Text Sales figures

    Images

    Computers understand only numbers

  • 8/2/2019 Text Mining Intro

    7/39

    10/04/04 Ravindra Jaju

    Working with data

    Most of the mining algorithms work only withnumeric data

    All data, hence, are represented as numbersso that they can lend themselves to thealgorithms

    Whether it is sales figures, crime rates, text, orimages one has to find a suitable way totransform data into numbers.

  • 8/2/2019 Text Mining Intro

    8/39

    10/04/04 Ravindra Jaju

    Text miningWorking with numbers

    Java Programming Language India beat Pakistan

    OR

    The transformation to 1's and 0's hides allrelationship between Java and Language, andIndia and Pakistan, which humans can makeout (How?)

  • 8/2/2019 Text Mining Intro

    9/39

    10/04/04 Ravindra Jaju

    Text miningWorking with numbers

    (contd.)

    As we have seen, data transformation (fromtext/word to some index number in this case)

    means that there is some information loss One big challenge in this field today is to find a

    good data representation for input to themining algorithms

  • 8/2/2019 Text Mining Intro

    10/39

    10/04/04 Ravindra Jaju

    Text Representation Issues

    Each word has a dictionary meaning, or meanings

    Run (1) the verb. (2) the noun, in cricket

    Cricket(1) The game. (2) The insect.

    Each word is used in various senses

    Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a

    runner between the wickets

    Capturing the meaning of sentences is an important issueas well. Grammar, parts of speech, time sense could be

    easy!

    Finding out automatically who the he in He is thePresident given a documentis hard. And presidentof? Well ...

  • 8/2/2019 Text Mining Intro

    11/39

    10/04/04 Ravindra Jaju

    Text Representation Issues (contd.)

    In general, it is hard to capture these features from a textdocument

    One, it is difficult to extract this automatically

    Two, even if we did it, it won't scale!

    One simplification is to represent documents as a vector ofwords

    We have already seen examples

    Each document is represented as a vector, and each

    component of the vector represents some quantity relatedto a single word.

  • 8/2/2019 Text Mining Intro

    12/39

    10/04/04 Ravindra Jaju

    The Document Vector

    Java Programming Language (document A)

    India beat Pakistan

    (document B)

    India beat Australia (document C)

    What vector operation can you think of to find two similardocuments?

    How about the dot product?

    As we can easily verify, documents B and C will have a higherdot product than any other combination

  • 8/2/2019 Text Mining Intro

    13/39

    10/04/04 Ravindra Jaju

    More on document similarity

    The dot product or cosinebetween two vectors is ameasure of similarity.

    Documents about related topics should have higher similarity

    Indonesia

    Java

    Language

    0, 0, 0

  • 8/2/2019 Text Mining Intro

    14/39

    10/04/04 Ravindra Jaju

    Document Similarity (contd.)

    How about distance measures? Cosine similarity measure will not capture the inter-cluster

    distances!

  • 8/2/2019 Text Mining Intro

    15/39

    10/04/04 Ravindra Jaju

    Further refinements to the DV representation

    Not all words are equally important the, is, and, to, he, she, it (Why?)

    Of course, these words could be important in certain contexts

    We have the option of scaling the components of these

    words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale

    the remaining words

    Important words should be scaled upwards, and vice versa

    One widely used scaling factorTF-IDF

    TF-IDFstands for Term Frequencyand Inverse DocumentFrequencyproduct, for a word.

  • 8/2/2019 Text Mining Intro

    16/39

    10/04/04 Ravindra Jaju

    Text MiningMoving Further

    Document/Term Clustering

    Given a large set, group similar entities

    Text Classification

    Given a document, find what topic does ittalk about

    Information Retrieval

    Search engines

    Information Extraction

    Question Answering

  • 8/2/2019 Text Mining Intro

    17/39

    10/04/04 Ravindra Jaju

    Clustering (Descriptive Activity)

    Activity: Group together similardocuments

    Techniques used

    Partitioning Hierarchical

    Agglomerative

    Divisive Grid based

    Model based

  • 8/2/2019 Text Mining Intro

    18/39

    10/04/04 Ravindra Jaju

    Clustering (contd.)

    Partitioning Divide the input data into kpartitions

    K-means, K-medoids

    Hierarchical clustering

    Agglomerative Each data point is assumed to be a cluster

    representative

    Keep merging similar clusters till we get a single

    cluster Divisive

    The opposite of agglomerative

  • 8/2/2019 Text Mining Intro

    19/39

    10/04/04 Ravindra Jaju

    Frequent term-based text clustering

    Idea Frequent terms carry more information about the

    cluster they might belong to

    Highly co-related frequent terms probably belong to

    the same cluster

    D = {D1, , Dn} the set of documents

    DjsubsetOfT, the set of all terms

    Then candidate clusters are generated from F= {F1, , Fk}, where each Fi is a set of allfrequent terms which occur together.

  • 8/2/2019 Text Mining Intro

    20/39

    10/04/04 Ravindra Jaju

    Classification

    The problem statement Given a set of documents, each with a label

    called the class label for that document

    Given, a classifier which learnsfrom theabove data set

    For a new, unseen document, the classifiershould be able to predict with a high

    degree of accuracy the correct class towhich the new document belongs

  • 8/2/2019 Text Mining Intro

    21/39

    10/04/04 Ravindra Jaju

    Decision Tree Classifier

    A tree Each node represents some kind of an evaluationfor an attribute of the data

    Each edge, the decision taken

    The evaluation at each node is some kind ofan information gainmeasure Reduction in entropy more information gained

    Entropy E(x) = -pilog2(pi)

    pi represents the probability that the data corresponds tosample i

    Each edge represents a choice for the value of theattribute the node represents

    Good for text mining. But doesnt scale

  • 8/2/2019 Text Mining Intro

    22/39

    10/04/04 Ravindra Jaju

    Statistical (Bayesian) Classification

    For a document-class data, we calculatethe probabilities of occurrence of events

    Bayes Theorem

    P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it

    belongs to a class c is given by the aboveformula.

    In practice, the exact values of theprobabilities of each event are unknown,and are estimated from the samples

  • 8/2/2019 Text Mining Intro

    23/39

    10/04/04 Ravindra Jaju

    Nave Bayes Classification

    Probability of the document eventd P(d) = P(w1, , wn) wi are the words

    The RHS is generally a headache. We haveto consider the inter-dependence of each of

    the wj events Nave Bayes Assume all the wj events

    are independent. The RHS expands to

    p(wj) Most of the Bayesian text classifiers work

    with this simplification

  • 8/2/2019 Text Mining Intro

    24/39

    10/04/04 Ravindra Jaju

    Bayesian Belief Networks

    This is an intermediate approach Not all words are independent

    Ifjava and program occur together, thenboost the probability value of class

    computer programming Ifjava and indonesia occur together, then

    the document is more likely about some-other-class

    Problem? How do we come up with co-relations like

    above?

  • 8/2/2019 Text Mining Intro

    25/39

    10/04/04 Ravindra Jaju

    Other classification techniques

    Support Vector Machines

    Find the best discriminant planebetweentwo classes

    k Nearest Neighbour

    Association Rule Mining

    Neural Networks

    Case-based reasoning

  • 8/2/2019 Text Mining Intro

    26/39

    10/04/04 Ravindra Jaju

    An exampleText Classification from labeled andunlabeled documents with Expectation Maximization

    Problem setting Labeling documents is a manual process

    A lot more unlabeled documents are

    available as compared to labeleddocuments

    Unlabeled documents contain informationwhich could help in the classification activity

  • 8/2/2019 Text Mining Intro

    27/39

    10/04/04 Ravindra Jaju

    An example (contd.)

    Train a classifier with the labeled

    documents Say, a Nave Bayes classifier

    This classifier estimates the model

    parameters (the prior probabilities of thevarious events)

    Now, classify the unlabeled documents.

    Assuming the appliedlabels to be correct,re-estimate the model parameters

    Repeat the above step till convergence

  • 8/2/2019 Text Mining Intro

    28/39

    10/04/04 Ravindra Jaju

    Expectation Maximization

    A useful technique for estimating hiddenparameters

    In the previous example, the class labelswere missing from some documents

    Consists of two steps E-step: Set z(k+1) = E [z | D; (k)]

    M-step: Set (k+1) = arg maxP( | D; z

    (k+1))

    The above steps are repeated tillconvergence, and convergence doesoccur

  • 8/2/2019 Text Mining Intro

    29/39

    10/04/04 Ravindra Jaju

    Another exampleFast and accurate Text Classification viaMultiple Linear Discriminant Projections

  • 8/2/2019 Text Mining Intro

    30/39

    10/04/04 Ravindra Jaju

    Contd.

    Idea Find a direction which maximizes the

    separation between classes.

    Why?

    Reduce noise, or rather Enhance the differences between classes

    The vector corresponding to this direction isthe Fishers discriminant

    Project the data-points onto this For all data-points not separated by this

    vector, choose another

  • 8/2/2019 Text Mining Intro

    31/39

    10/04/04 Ravindra Jaju

    Contd.

    Repeat till all data are now separable

    Note, we are looking at a 2-class case. This easilyextends to multiple classes

    Project all the document vectors into the spacerepresented by the vectors as the basisvectors

    Now, induce a decision tree on this projectedrepresentation

    The number of attributes is highlyreduced

    Since this representation nicely separates the

    data points (documents), accuracy increases

  • 8/2/2019 Text Mining Intro

    32/39

    10/04/04 Ravindra Jaju

    Web Text Mining The WWW is a huge, directed graph, with

    documents as nodes and hyperlinks as thedirected edges

    Apart from the text itself, this graph structurecarries a lot of information about the

    usefulness of the nodes For example

    10 random, average people on the streets say Mr.T. Ache is a good dentist

    5 reputed doctors, including dentists, recommendMr. P. Killer as a better dentist

    Who would you choose?

  • 8/2/2019 Text Mining Intro

    33/39

    10/04/04 Ravindra Jaju

    Kleinbergs HITS

    HITS Hypertext Induced TopicSelection

    Nodes on the web can be categorizedinto two typeshubs and authorities

    Authorities are nodes which one refers tofor definitive information about a topic

    Hubs point to authorities

    HITS computes the hub and authorityscores on a sub-universe of the web How does one collect this sub-universe?

  • 8/2/2019 Text Mining Intro

    34/39

    10/04/04 Ravindra Jaju

    HITS (contd.)

    The basic steps Au = Hv for all v pointing to u Hu= Av for all v pointed to by u

    Repeat the above till convergence

    Nodes with high A scores are relevant

    Relevant to what? Can we use this for efficient retrieval for a

    query?

  • 8/2/2019 Text Mining Intro

    35/39

    10/04/04 Ravindra Jaju

    PageRank Similar to HITS, but all pages have only one

    scorea Rank

    R(u) = c (R(v)/Nv) v is the set of pages linking to u, and Nv is the number

    of links in v. c is a scaling factor (< 1)

    The higher the rank of pages linking to a page,the higher is its own rank!

    To handle rank sinks(documents which do notlink outside a set of pages), the formula is

    modified as R(u) = c(R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank

    source (what kind of pages?)

  • 8/2/2019 Text Mining Intro

    36/39

    10/04/04 Ravindra Jaju

    Some more topics which we havent touched Using external dictionaries

    WordNet

    Using language specific techniques

    Computational linguistics

    Use grammar for judging the sense of aquery in the information retrieval scenario

    Other interesting techniques

    Latent Semantic Indexing Finding the latentinformation in documents

    using Linear Algebra Techniques

  • 8/2/2019 Text Mining Intro

    37/39

    10/04/04 Ravindra Jaju

    Some more comments

    Some purists do not consider most of

    the current activities in the text miningfield as real text mining

    For example, see Marti Hearsts write-up

    at Untangling Text Data Mining

    http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
  • 8/2/2019 Text Mining Intro

    38/39

    10/04/04 Ravindra Jaju

    Some more comments (contd.)

    One example that he mentions

    stress is associated with migraines

    stress can lead to loss of magnesium

    calcium channel blockers prevent some migraines

    magnesium is a natural calcium channel blocker

    spreading cortical depression (SCD) is implicated insome migraines

    high levels of magnesium inhibit SCD

    migraine patients have high platelet aggregability

    magnesium can suppress platelet aggregability

    The above was inferred from a set ofdocuments, with some human help

  • 8/2/2019 Text Mining Intro

    39/39

    10/04/04 R i d J j

    References Data Mining Concepts and Techniques, by Jiawei Han and Micheline

    Kamber

    Principle of Data Mining, by David J. Hand et al

    Text Classification from Labeled and Unlabeled Documents using EM,Kamal Nigam et al

    Fast and accurate text classification via multiple linear discriminantprojections, S. Chakrabarti et al

    Frequent Term-Based Text Clustering, Florian Beil et al

    The PageRank Citation Ranking: Bringing Order to the Web, LawrencePage and Sergey Brin

    Untangling Text Data Mining, by Marti. A. Hearst,http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

    And others