introduction to text mining - unigrazintroduction to text mining jasminka dobša faculty of...

Introduction to Text mining

Jasminka Dobša Faculty of Organization and Informatics

University of Zagreb

1 March 16th, 2015, Karl-Franzens Uni Graz

Outline

•  Definition of text mining •  Vector space model or Bag of words repersentation

  Representation of documents and queries in VSM   Measure of similarity between document and query   Example   Prepocessing of text documents

•  Clusterig of text documents •  Classification of text documents

  Definitonon of classification task   Methods for classification of text documents   Measures of evaluation of classification

March 16th, 2015, Karl-Franzens Uni Graz

Definition and basic concepts

•  Discipline of text mining is an integral part of the wider discipline of data mining

•  It is content-based processing of unstructured text documents and extraction of useful information from them

•  Content-based treatment: treatment based only on the content of the documents, and not on some metadata such as date of publication, source of publication, type of document or similar

•  Basic text mining tasks:   Information retrieval   Automatic classification of text documents

Information retrieval: definition and basic concepts

•  The problem: determining a set of documents from a collection that are relevant to a particular user query

•  Assessment of the relevance of documents is based on a comparison of index terms that appear in texts of documents and text of the query

•  Index term: basically any word or group of words that appear in the text

•  When the text is represented by set of index terms much of the semantics contained in the text is lost

•  Information about the interconnectedness between index terms is lost

Labels and basic models

•  Labels:   d j - document of collection of documents   q - query   ai , aj – vector representations of documents   q - vector representation of the query

•  Basic models:   Probabilistic   Boolean model   Vector model

Characterisctic of the models

•  In probabilistic model set of rules according to which representation of documents and queries is modeled is based on the theory of probability

•  In Boolean model documents and queries are represented as sets of index terms

•  In the vector model documents and queries are represented as multidimensional vectors → such a model is algebraic

Vector space model

•  Vector space model is today one of the most popular models for information retrieval

•  It is developed by Gerald Salton and associates within the search system SMART (System for the Mechanical Analysis and Retrieval of Text)

•  In this model similarity between documents and queries can take real values in the interval [0,1]

•  Indexing terms are joined by weight •  Documents are ranked according to the degree of similarity with

the query •  Representation in the vector space model called Bag of words

representation •  Vector space model assumes independence of indexing terms

Vector space model

•  Notation   - set of index terms

  - set of documents   aij weight associated to pair (ti , dj) , i=1,2,..,m; j=1,2,...,n   qi , i=1,2,...,m weights associated to pair (ti ,q), q query

•  In the vector space model document dj , j=1,2,...n is represented by the vector aj =(aj1 , aj2 , ..., ajm )T

•  Query is represented by the vector q=(q1 , q2 , ..., qm )T •  Collection od documents is represented by term-document matrix

{ }mtttT ,,, 21 …=

{ }ndddD ,,, 21 …=

Term-document matrix

•  The terms are axes in the space

•  Documents and queries are points or vectors in space

•  Space is high-dimensional - dimension corresponds to the number of index terms

•  Documents usually contain only a few index terms from the list of index terms

•  Vectors representing documents are rare: most coordinates are equal to 0

↓↓↓

aaaaaa

Rows represent concepts Columns represent documents

Measure of similarity

•  The similarity between the document and the query is measured by the angle between their representations

•  The measure of similarity between document and query is the cosine of the angle between them

qaqa22

),(cosj

Measure of similarity

•  Vector representations of documents usually are standardized so that its Euclidean norm is 1

•  Norm of a representation of the query does not affect the ranking of documents by relevance

•  The measure of similarity between document and query usually is measured by the dot product of vectors in the numerator

The process of assigning weights to index terms

•  Insisting on high recall reduces precision of retrieval and vice versa •  Therefore, more components to assign weight to index terms are

introduced •  Components:

  Local - depends on the frequency of occurrence of the term in a particular document (Term frequency - TF component)

  Global - factor for the index term depends inversely on the number of documents in which the term appears (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component))

  Normalization – vectors of representations of a documents are divided by Euclidean norm of representation (this avoids the situation that longer documents are often returned as relevant)

Advantages and disadvantages of the vector space model

•  Advantages   Measure the cosine similarity between documents and queries

anables to rank documents according to the importance for a query

  The scheme of assigning weight index terms increases the efficiency of information retrieval

•  Disadvantage   The assumption of independence of index terms

(questionable!)

Example

•  A collection of 15 documents (titles of books)   9 from the field of data mining   5 from the field of linear algebra   One that combines these two areas (application of linear

algebra to data mining)

•  A list of words is formed as follows:

  From the words contained in at least two documents   The words on the list of stop words are discarded   The words are reduced to their basic forms, ie. variations of

the word are mapped to the basic version of the word

Documents 1/2

D1 Survey of text mining: clustering, classification, and retrieval

D2 Automatic text processing: the transformation analysis and retrieval of information by computer

D3 Elementary linear algebra: A matrix approach

D4 Matrix algebra & its applications statistics and econometrics

D5 Effective databases for text & document management

D6 Matrices, vector spaces, and information retrieval

D7 Matrix analysis and applied linear algebra

D8 Topological vector spaces and algebras

Documents 2/2

D9 Information retrieval: data structures & algorithms

D10 Vector spaces and algebras for chemistry and physics

D11 Classification, clustering and data analysis

D12 Clustering of large data sets

D13 Clustering algorithms

D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales

D15 Data mining and knowledge discovery

Terms Data mining terms Linear algebra terms Neutral terms

Text Linear Analysis Mining Algebra Application

Clustering Matrix Algorithm Classification Vector

Retrieval Space

Information

Document

The concepts are mapped in their basic versions, eg. Applications, Applied → Application

Queries

•  Q1: Data mining   Relevant documents : All data mining documents

(blue)

•  Q2: Using linear algebra for data mining   Relevant document: D6

Rezultati pretraživanja weight bxx tfn

Document Q1 Q2 Q1 Q2

D1 0,4472 0,4472 0,3607 0,2424

D2 0 0 0 0

D3 0 1,1547 0 0,6417

D4 0 0,5774 0 0,1471

D5 0 0 0 0

D6 0 0 0 0

D7 0 0,8944 0 0,4598

D8 0 0,5774 0 0,1654

D9 0,5 0,5 0,2634 0,1771

D10 0 0,5774 0 0,1654

D11 0,5 0,5 0,2634 0,1770

D12 0,7071 0,7071 0,4488 0,3016

D13 0 0 0 0

D14 0,5774 0,5774 0,4092 0,2750

D15 1,4142 1,4142 1,0000 0,6720 19 March 16th, 2015, Karl-Franzens Uni Graz

Comments on search results

•  Consider that in the process of search are returned all documents whose measure of similarity with the query is greater than 0

•  For the first query (Data mining)   All documents are returned in the search process are relevant   Not all relevant documents are returned in the search   Only those documents containing the terms contained in the query are

returned - search the lexical, but not semantical

•  For the second query (Using linear algebra for data mining)   None of the returned document is relevant   The only relevant document is not returned in the search

•  Search using weight bxx and tfn did not result in significantly different results presented in this case

Preprecessing of text

•  Index term   Any word that appears in the collection of documents   A series of several words that appears in the texts of

documents

•  Preprocessing of text includes a number of procedures which prepares the text for the processing and which increases processing efficiency itself

•  Some of the procedures are as follows:   Lexical processing   Removal of terms such as conjunctions, prepositions and

similar words that have low discriminatory value in information retrieval, so-called stop words

  Transformation of the words to their root form (stemming) or on their basic form (lemmatization)

•  Lexical processing consist of removing punctuation, removing of the bracket, the conversion of the letters in lower case

•  Bringing the words to root form consist of removing suffixes and prefixes

•  The alternative form of a word usually is obtained by adding suffixes

•  Most algorithms for reducing words to their stem form are limited to removing suffixes

•  The best known algorithm for reducing words to root form is Porter's algorithm developed for the English language

•  In addition to indexing terms which consist of only one word in the text can be identified and phrases that often occur, eg. New York

CLUSTERING AND CLASSIFICATION OF TEXT DOCUMENTS

Clustering and classification of documents

•  During organization of large collections of documents it is useful to use methods of clustering and automatic classification of documents

•  Clustering is unsupervised method, while classification is supervised method

•  Unsupervised methods does not require the existence of a set of data on which the class is trained, while supervised methods require such a set of data

Clustering and classification

•  Clustering:

  The algorithm groups data on the basis of their representations

  For some algorithms it is previously required to define the number of groups

•  Classification

  The algorithm learns the characteristics of the predefined class on examples for learning

  Examples for learning have assigned labels of classes (manualy)

Division

•  Clustering methods are divided into:   Hierarchical

  Agglomerative (bottom up)   They begin with individual objects   Groups are connected in accordance with certain

criteria of similarity or closeness between groups (linkage methods)

  End when all objects are attached into one group   Divizivne (top down)

  They start with a whole set of objects - one group   Groups are separated according to the criteria of

similarity between groups   End when each object is in a separate group

  Non-hierarchical (partitional)

Hierarchical methods

The result of clustering is displayed graphic by dendrogram

K-means method

•  The number of groups is set in advance •  It is not necessary to determine the similarity matrix

(matrix of distances between objects), so the method is more appropriate for grouping a large number of objects

•  K-means method is the most popular method of partition

•  To each p-dimensional vector is joined a group that has the closest central vector (centroid)

K- means method: algorithm

1. K initial centroids of the groups are chosen 2. Each object is attached to the group whose centroid is

closest (restructuring group) 3. Centroids of new groups are computed 4. Steps 2 and 3 are repeated so long as the distance

between the centroid group of two consecutive iterations is less than a given value, or as long as the contents of the group do not change

Example: k-means method (K=2)

Choice of initial centroids

Restructuring of the groups

Computation of centroids

x x x x Computation of centroids

Convergence!

Sec. 16.4

Yahoo! hierarchy is not created by clustering, but it looks like as we would like to get by clustering

dairy crops

agronomy forestry

HCI craft

missions

botany

evolution

cell magnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

Source: Kolegij Information Retrieval and Web Search, Chris Manning, Pandu Nayak, Prabhakar Raghavan, Stanford University

The purpose of clustering of documents

•  Clustering enables better performance of information retrieval

•  It can solve the problem of homonyms (words with multiple meanings)

•  Eg. for the query that consists of word virus   Documents related to viruses (living creatures) and computer

viruses will be retrieved   By grouping these documents in two groups we will choose a

group that interests us, and thus avoid numerous of search results that are not relevant to us

•  Automatic classification or categorization is the process of

assigning labels of predefined classes to unstructured text documents

•  Each document can be associated with one or more classes (single label or, multilabel text categorization)

•  If the document can be assigned to more than one class, then there is overlaping of classes

•  The categorization of documents in k categories (which may overlap) can be reduced to problem of k binary classification problem

•  Defined method of classification is a strict or hard classification •  Another way is to rank a given document dj in categories c1,

c2, ..., ck according to the suitability that the document dj is precisely in this category

Automatic classification or categorization

Applications

•  Applications categorization of text are numerous:   Classification of documents by topic, eg. it is used for the

purpose of forming a hierarchical catalogs od the web sites   Classification of documents by genre   Classification according to the authorship of a document   Filtering documents (spam filtering e-mail messages)

Machine learning techniques for text classification

•  In the eighties of the last century classifiers for automatic classification of documents are constructed by hand, by so-called techniques of knowledge engineering

•  The classifier consist of a set of manually generated rules for each class

•  The disadvantage of this approach is that when updating classes need to be updated and the rules

•  In the nineties of the last century begins to develop a new approach that has enabled the discipline of machine learning

•  Machine learning enables to generate classifiers automatically by learning the characteristics of classes based on training examples

Machine learning techniques for text classification

•  The process of classification is an example of the supervised process of learning because the learning process is conducted, ie. supervised by training examples

•  When updating classes expert help can only consist in labeling new examples for learning

•  Classifiers generated by machine learning techniques for text classification typically show better results of classification of text than manually generated classifiers

Training set, validation set and test set

•  The initial set of documents Ω when using automatic categorization divided into:   Set Tv - a collection of documents for learning from which, if

necessary, allocates a smaller set of documents used for validation (set on a testing model parameters)

  Set Te- a collection of documents for testing

•  Usually a set of documents for learning is greater; common ratio is 2: 1

•  This basic principle is called the principle of Train and test

•  Another possible principle is cross-validation

Cross-validation

•  K different classifiers are formed: Φ1, ..., Φk

•  These classifiers operate on sets {TVi = Ω-Tei, Tei} •  Efficiency of classification is calculated for each of the

classifiers Φ1, ..., Φk and average value is taken

Example (Manning, Nayak, Raghavan)

•  Classification of Web pages od departments of computer science in six classes:   Student, Faculty, Course, Project, Person, Department

•  Learning on about 5000 Website of four universities   Cornell, Washington, U.Texas, Wisconsin

•  Classification of web pages is conducted using Naivog Bayes algorithm

•  Results:

Introduction to Information Retrieval

Documents in vector space (Example: Manning, Nayak, Raghavan)

Sec.14.1

Politics

Science

Sec.14.1

In which class is testing document?

Politics

Science

Document is in the class Poli4cs

Politics

Science

Aim: To find boundaries

Sec.14.1

Types of classifiers

•  Some of the classifier are:   Probability classifiers - Naïve Bayes classifier   Rocchio classifier   The classifier of k-nearest neighbors   Support vector machines

•  Naive Bayes classifier for a given document ranks classes based on the probability that the document belongs to a class

•  Naivety refers to the fact that it is assumed independence of index terms   It turned out that it is not too naive

Rocchio classifier •  Simple representa4on for each class is computed: centroid of the class

•  Test document is a=ached to classis with closest centroid

•  This approach does not guarantee that the classifica4on is consistent with the documents for learning

Sec.14.2

∑∈

Number of documents in the class Documents in the class

Rocchio classifier: in which class is this document?

Politics

Science

Sec.14.1

1µ 2µ

Knn classifier

•  This algorithm belongs to a group of algorithms that

learn directly from examples - instance based learning •  Learning phase consists simply in the storage of

training examples •  During test phase most similar examples from memory

are retrieved and just that examples are used to classify a given example

•  Classifier for the entire domain is never formed, local classifier is generated for each new instance for testing

•  Such a classifier is called lazy

Knn classifier

•  In order to classify an example for testing it is necessary to determine the nearest neighbors - documents from a trainig set that are closest to the document for testing

•  Usually an odd (3,5,7) number of closesets train documents is chosen in order to avoid the same number of documents from the two classes

Example: k=7 (7NN)

Politics

Science

Sec.14.3

6 out of 7 documents are from the class Politics, one is from the class Science.

Support vector machines

•  Support vector machines is the most effective method for classification of text documents

•  Theoretically it is based on statistical learning theory •  It is suitable for classification of the text because it is

effective in the classification of objects which are represented by a large number of attributes - here index terms

Definitions

•  Let X be the input space – it is space of representation of documents

•  Le Y be output space, set of labels that give belonging to the class

•  In the case of binary classification output space is given by

{ }lxxxX ,,, 21 …=

{ }1,1−=Y

Document belongs to a class Document does not belong to a class

Definition

•  In a general case, output space is •  Here we will talk about the case of binary classification •  It is not a reduction of generality because the problem

of classification in k classes can be reduced to k mutually independent binary classification problems in class

•  By SVM method examples will be separated by a hyperplane in the (n-1) – dimensional space

•  n is the dimension of the input space (and it is here the number of index terms)

{ }kY ,,2,1 …=

{ } kicc ii ,,2,1,, …=

Separation of examples by hyperplane

•  In two-dimensional space hyperplane is represented by the equation

•  Dataset is called linearly separable if it can be separated by hyperplane

•  If space is lineary separable then there is more possibilities for the direction that will separate examples from two classes

0=−+ cbyax

Decision function

•  Decision function is plane dividing positive from negative examples

•  The decision function is given in the form

  Whereas w - weight vector, a vector perpendicular to the hyperplane b - bias, the distance from the origin

•  Example for testing will be classified as positive (y = 1) if f (x)> 0, or as negative (y = -1) if f (x) <0

bxf T += xw)(

Margin and support vectors

•  There are a number of hyperplanes that separate the positive from negative examples

•  SVM method is the optimal solution: maximizing the distance between hyperplanes and "problem spots", ie. points near the plane dividing positive from negative examples - these are the so-called supporting vectors

•  It is said to maximize margins

Support vectors

Margin ρ

Decision function

Measures of evaluation of classification

•  Evaluation of classification involves the calculation of certain measures which measure the ability of the classifier to make the right decisions about the affiliation of documents to classes

•  Basic measures are precision and recall •  Measures are computed on basis of contingency tables

Expert decision

YES NO

Decision of classifier

YES TPi (true positive)

FPi (false positive)

NO FNi (false negative)

TNi (true negative)

Measures of evaluation of classification

•  Basic: precision and recall •  The precision for the class i is defined as the proportion of

correctly classified documents in the set of all positive classified documents for a given class (according to the classifier)

•  The recall for the class i is defined as the proportion of correctly

classified documents among all documents from a given class i

•  Combined measure: F1 measure

ii FPTP

ii FNTP

Reuters-21578 collection of documents

•  The most widely used data set for classification •  It consists of Reuters reports from 1987 •  Divided into documents for learning and documents for

testing (ModApte Split) •  Documents for learning: reports from begining of 1987

to 7 April 1987 •  Documents for verification: Reports from 8 April until

the end of 1987 •  9603 document for learning and 3299 document for

testing •  10 large classes, 90 classes for which there is at least

one document for learning and at least one document for testing, totaly 118 classes

•  There are different class March 16th, 2015, Karl-Franzens Uni Graz

Reuters-21578 collection of documents

•  10 big classes

Class No of train documents

No of test documents

Earn 2877 1087

Acquisitions 1650 719

Money-fx 538 179

Grain 433 149

Crude 389 189

Trade 369 118

Interest 347 131

Ship 197 89

Wheat 212 71

Corn 182 56

Sec. 15.2.4

Precision/recall graph: Caterory Crude

0 0.2 0.4 0.6 0.8 1

LSVM Decision Tree Naïve Bayes Rocchio

Precision

Recall

Dumais (1998)

Precision/Recall graph: Category Ship

Precision

Recall

Sec. 15.2.4

0 0.2 0.4 0.6 0.8 1

LSVM Decision Tree Naïve Bayes Rocchio

Dumais (1998)

Conclusion: classification methods

•  Support vector machines has proved to be most successful for the classification of text documents

•  Reasons:   The method relies on statistical learning theory which provides a limit on the

generalization error (error on examples for testing) that is not dependent on the dimension of the input space (on the number of index terms)

•  Method treats well irrelevant index terms - it is questionable whether it is necessary to discard stop words because the algorithm ignores them anyway

•  Methods of this type works well for examples that have sparse representation - it is the case with the classification of text documents because most documents contain only a few index terms from the list (which usually has several thousand index terms)

introduction to text mining - unigrazintroduction to text mining jasminka dobša faculty of...

Documents

cs583 – data mining and text mining

text mining - data mining

web mining & text mining

cse 634 – data mining: text mining · text mining vs. •...

jasminka dobša - unigraz · pdf fileusing r for text mining...

text mining for clementine improve insights with text mining

introduction to text mining - uni-paderborn.de ·...

historical text mining historical text mining, and...

introduction to text mining · introduction to text mining...

information retrieval & text mining - intranet...

text mining

text mining

text mining medline - oracle€¦ · data miningdata mining...

introduction to text mining - en.cs.uni-paderborn.de ·...

introduction to text mining and sas text...

text mining with oracle - text mining summit

text mining webinar - knime€¦ · text mining webinar the...

a brief survey of text mining · text mining = text data...

chapter 5: text and web mining. learning objectives describe...

data mining 1...