introduction to text mining - unigrazintroduction to text mining jasminka dobša faculty of...
Post on 24-Mar-2020
13 Views
Preview:
TRANSCRIPT
Introduction to Text mining
Jasminka Dobša Faculty of Organization and Informatics
University of Zagreb
1 March 16th, 2015, Karl-Franzens Uni Graz
Outline
• Definition of text mining • Vector space model or Bag of words repersentation
Representation of documents and queries in VSM Measure of similarity between document and query Example Prepocessing of text documents
• Clusterig of text documents • Classification of text documents
Definitonon of classification task Methods for classification of text documents Measures of evaluation of classification
March 16th, 2015, Karl-Franzens Uni Graz
2
Definition and basic concepts
• Discipline of text mining is an integral part of the wider discipline of data mining
• It is content-based processing of unstructured text documents and extraction of useful information from them
• Content-based treatment: treatment based only on the content of the documents, and not on some metadata such as date of publication, source of publication, type of document or similar
• Basic text mining tasks: Information retrieval Automatic classification of text documents
March 16th, 2015, Karl-Franzens Uni Graz
3
Information retrieval: definition and basic concepts
• The problem: determining a set of documents from a collection that are relevant to a particular user query
• Assessment of the relevance of documents is based on a comparison of index terms that appear in texts of documents and text of the query
• Index term: basically any word or group of words that appear in the text
• When the text is represented by set of index terms much of the semantics contained in the text is lost
• Information about the interconnectedness between index terms is lost
March 16th, 2015, Karl-Franzens Uni Graz
4
Labels and basic models
• Labels: d j - document of collection of documents q - query ai , aj – vector representations of documents q - vector representation of the query
• Basic models: Probabilistic Boolean model Vector model
March 16th, 2015, Karl-Franzens Uni Graz
5
Characterisctic of the models
• In probabilistic model set of rules according to which representation of documents and queries is modeled is based on the theory of probability
• In Boolean model documents and queries are represented as sets of index terms
• In the vector model documents and queries are represented as multidimensional vectors → such a model is algebraic
March 16th, 2015, Karl-Franzens Uni Graz
6
Vector space model
• Vector space model is today one of the most popular models for information retrieval
• It is developed by Gerald Salton and associates within the search system SMART (System for the Mechanical Analysis and Retrieval of Text)
• In this model similarity between documents and queries can take real values in the interval [0,1]
• Indexing terms are joined by weight • Documents are ranked according to the degree of similarity with
the query • Representation in the vector space model called Bag of words
representation • Vector space model assumes independence of indexing terms
March 16th, 2015, Karl-Franzens Uni Graz
7
Vector space model
• Notation - set of index terms
- set of documents aij weight associated to pair (ti , dj) , i=1,2,..,m; j=1,2,...,n qi , i=1,2,...,m weights associated to pair (ti ,q), q query
• In the vector space model document dj , j=1,2,...n is represented by the vector aj =(aj1 , aj2 , ..., ajm )T
• Query is represented by the vector q=(q1 , q2 , ..., qm )T • Collection od documents is represented by term-document matrix
March 16th, 2015, Karl-Franzens Uni Graz
8
{ }mtttT ,,, 21 …=
{ }ndddD ,,, 21 …=
Term-document matrix
• The terms are axes in the space
• Documents and queries are points or vectors in space
• Space is high-dimensional - dimension corresponds to the number of index terms
• Documents usually contain only a few index terms from the list of index terms
• Vectors representing documents are rare: most coordinates are equal to 0
March 16th, 2015, Karl-Franzens Uni Graz
9
=
↓↓↓
aaa
aaaaaa
A
ddd
mnmm
n
n
n
…
…
…
21
22221
11211
21
t
tt
m←
←
←
2
1
Rows represent concepts Columns represent documents
Measure of similarity
• The similarity between the document and the query is measured by the angle between their representations
• The measure of similarity between document and query is the cosine of the angle between them
March 16th, 2015, Karl-Franzens Uni Graz
10
qa
qaqa22
),(cosj
Tj
j =
Measure of similarity
• Vector representations of documents usually are standardized so that its Euclidean norm is 1
• Norm of a representation of the query does not affect the ranking of documents by relevance
• The measure of similarity between document and query usually is measured by the dot product of vectors in the numerator
March 16th, 2015, Karl-Franzens Uni Graz
11
12=ja
2q
The process of assigning weights to index terms
• Insisting on high recall reduces precision of retrieval and vice versa • Therefore, more components to assign weight to index terms are
introduced • Components:
Local - depends on the frequency of occurrence of the term in a particular document (Term frequency - TF component)
Global - factor for the index term depends inversely on the number of documents in which the term appears (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component))
Normalization – vectors of representations of a documents are divided by Euclidean norm of representation (this avoids the situation that longer documents are often returned as relevant)
March 16th, 2015, Karl-Franzens Uni Graz
12
Advantages and disadvantages of the vector space model
• Advantages Measure the cosine similarity between documents and queries
anables to rank documents according to the importance for a query
The scheme of assigning weight index terms increases the efficiency of information retrieval
• Disadvantage The assumption of independence of index terms
(questionable!)
March 16th, 2015, Karl-Franzens Uni Graz
13
Example
• A collection of 15 documents (titles of books) 9 from the field of data mining 5 from the field of linear algebra One that combines these two areas (application of linear
algebra to data mining)
• A list of words is formed as follows:
From the words contained in at least two documents The words on the list of stop words are discarded The words are reduced to their basic forms, ie. variations of
the word are mapped to the basic version of the word
14 March 16th, 2015, Karl-Franzens Uni Graz
Documents 1/2
D1 Survey of text mining: clustering, classification, and retrieval
D2 Automatic text processing: the transformation analysis and retrieval of information by computer
D3 Elementary linear algebra: A matrix approach
D4 Matrix algebra & its applications statistics and econometrics
D5 Effective databases for text & document management
D6 Matrices, vector spaces, and information retrieval
D7 Matrix analysis and applied linear algebra
D8 Topological vector spaces and algebras
15 March 16th, 2015, Karl-Franzens Uni Graz
Documents 2/2
D9 Information retrieval: data structures & algorithms
D10 Vector spaces and algebras for chemistry and physics
D11 Classification, clustering and data analysis
D12 Clustering of large data sets
D13 Clustering algorithms
D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales
D15 Data mining and knowledge discovery
16 March 16th, 2015, Karl-Franzens Uni Graz
Terms Data mining terms Linear algebra terms Neutral terms
Text Linear Analysis Mining Algebra Application
Clustering Matrix Algorithm Classification Vector
Retrieval Space
Information
Document
Data
17 March 16th, 2015, Karl-Franzens Uni Graz
The concepts are mapped in their basic versions, eg. Applications, Applied → Application
Queries
• Q1: Data mining Relevant documents : All data mining documents
(blue)
• Q2: Using linear algebra for data mining Relevant document: D6
18 March 16th, 2015, Karl-Franzens Uni Graz
Rezultati pretraživanja weight bxx tfn
Document Q1 Q2 Q1 Q2
D1 0,4472 0,4472 0,3607 0,2424
D2 0 0 0 0
D3 0 1,1547 0 0,6417
D4 0 0,5774 0 0,1471
D5 0 0 0 0
D6 0 0 0 0
D7 0 0,8944 0 0,4598
D8 0 0,5774 0 0,1654
D9 0,5 0,5 0,2634 0,1771
D10 0 0,5774 0 0,1654
D11 0,5 0,5 0,2634 0,1770
D12 0,7071 0,7071 0,4488 0,3016
D13 0 0 0 0
D14 0,5774 0,5774 0,4092 0,2750
D15 1,4142 1,4142 1,0000 0,6720 19 March 16th, 2015, Karl-Franzens Uni Graz
Comments on search results
• Consider that in the process of search are returned all documents whose measure of similarity with the query is greater than 0
• For the first query (Data mining) All documents are returned in the search process are relevant Not all relevant documents are returned in the search Only those documents containing the terms contained in the query are
returned - search the lexical, but not semantical
• For the second query (Using linear algebra for data mining) None of the returned document is relevant The only relevant document is not returned in the search
• Search using weight bxx and tfn did not result in significantly different results presented in this case
March 16th, 2015, Karl-Franzens Uni Graz
20
Preprecessing of text
• Index term Any word that appears in the collection of documents A series of several words that appears in the texts of
documents
• Preprocessing of text includes a number of procedures which prepares the text for the processing and which increases processing efficiency itself
• Some of the procedures are as follows: Lexical processing Removal of terms such as conjunctions, prepositions and
similar words that have low discriminatory value in information retrieval, so-called stop words
Transformation of the words to their root form (stemming) or on their basic form (lemmatization)
March 16th, 2015, Karl-Franzens Uni Graz
21
• Lexical processing consist of removing punctuation, removing of the bracket, the conversion of the letters in lower case
• Bringing the words to root form consist of removing suffixes and prefixes
• The alternative form of a word usually is obtained by adding suffixes
• Most algorithms for reducing words to their stem form are limited to removing suffixes
• The best known algorithm for reducing words to root form is Porter's algorithm developed for the English language
• In addition to indexing terms which consist of only one word in the text can be identified and phrases that often occur, eg. New York
March 16th, 2015, Karl-Franzens Uni Graz
22
CLUSTERING AND CLASSIFICATION OF TEXT DOCUMENTS
March 16th, 2015, Karl-Franzens Uni Graz
23
Clustering and classification of documents
• During organization of large collections of documents it is useful to use methods of clustering and automatic classification of documents
• Clustering is unsupervised method, while classification is supervised method
• Unsupervised methods does not require the existence of a set of data on which the class is trained, while supervised methods require such a set of data
March 16th, 2015, Karl-Franzens Uni Graz
24
Clustering and classification
• Clustering:
The algorithm groups data on the basis of their representations
For some algorithms it is previously required to define the number of groups
• Classification
The algorithm learns the characteristics of the predefined class on examples for learning
Examples for learning have assigned labels of classes (manualy)
March 16th, 2015, Karl-Franzens Uni Graz
25
Division
• Clustering methods are divided into: Hierarchical
Agglomerative (bottom up) They begin with individual objects Groups are connected in accordance with certain
criteria of similarity or closeness between groups (linkage methods)
End when all objects are attached into one group Divizivne (top down)
They start with a whole set of objects - one group Groups are separated according to the criteria of
similarity between groups End when each object is in a separate group
Non-hierarchical (partitional)
March 16th, 2015, Karl-Franzens Uni Graz
26
Hierarchical methods
The result of clustering is displayed graphic by dendrogram
March 16th, 2015, Karl-Franzens Uni Graz
27
K-means method
• The number of groups is set in advance • It is not necessary to determine the similarity matrix
(matrix of distances between objects), so the method is more appropriate for grouping a large number of objects
• K-means method is the most popular method of partition
• To each p-dimensional vector is joined a group that has the closest central vector (centroid)
March 16th, 2015, Karl-Franzens Uni Graz
28
K- means method: algorithm
1. K initial centroids of the groups are chosen 2. Each object is attached to the group whose centroid is
closest (restructuring group) 3. Centroids of new groups are computed 4. Steps 2 and 3 are repeated so long as the distance
between the centroid group of two consecutive iterations is less than a given value, or as long as the contents of the group do not change
March 16th, 2015, Karl-Franzens Uni Graz
29
Example: k-means method (K=2)
Choice of initial centroids
Restructuring of the groups
Computation of centroids
x x
Restructuring of the groups
x x x x Computation of centroids
Restructuring of the groups
Convergence!
Sec. 16.4
30 March 16th, 2015, Karl-Franzens Uni Graz
Yahoo! hierarchy is not created by clustering, but it looks like as we would like to get by clustering
dairy crops
agronomy forestry
AI
HCI craft
missions
botany
evolution
cell magnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
Source: Kolegij Information Retrieval and Web Search, Chris Manning, Pandu Nayak, Prabhakar Raghavan, Stanford University
31 March 16th, 2015, Karl-Franzens Uni Graz
The purpose of clustering of documents
• Clustering enables better performance of information retrieval
• It can solve the problem of homonyms (words with multiple meanings)
• Eg. for the query that consists of word virus Documents related to viruses (living creatures) and computer
viruses will be retrieved By grouping these documents in two groups we will choose a
group that interests us, and thus avoid numerous of search results that are not relevant to us
March 16th, 2015, Karl-Franzens Uni Graz
32
• Automatic classification or categorization is the process of
assigning labels of predefined classes to unstructured text documents
• Each document can be associated with one or more classes (single label or, multilabel text categorization)
• If the document can be assigned to more than one class, then there is overlaping of classes
• The categorization of documents in k categories (which may overlap) can be reduced to problem of k binary classification problem
• Defined method of classification is a strict or hard classification • Another way is to rank a given document dj in categories c1,
c2, ..., ck according to the suitability that the document dj is precisely in this category
March 16th, 2015, Karl-Franzens Uni Graz
33
Automatic classification or categorization
Applications
• Applications categorization of text are numerous: Classification of documents by topic, eg. it is used for the
purpose of forming a hierarchical catalogs od the web sites Classification of documents by genre Classification according to the authorship of a document Filtering documents (spam filtering e-mail messages)
March 16th, 2015, Karl-Franzens Uni Graz
34
Machine learning techniques for text classification
• In the eighties of the last century classifiers for automatic classification of documents are constructed by hand, by so-called techniques of knowledge engineering
• The classifier consist of a set of manually generated rules for each class
• The disadvantage of this approach is that when updating classes need to be updated and the rules
• In the nineties of the last century begins to develop a new approach that has enabled the discipline of machine learning
• Machine learning enables to generate classifiers automatically by learning the characteristics of classes based on training examples
March 16th, 2015, Karl-Franzens Uni Graz
35
Machine learning techniques for text classification
• The process of classification is an example of the supervised process of learning because the learning process is conducted, ie. supervised by training examples
• When updating classes expert help can only consist in labeling new examples for learning
• Classifiers generated by machine learning techniques for text classification typically show better results of classification of text than manually generated classifiers
March 16th, 2015, Karl-Franzens Uni Graz
36
Training set, validation set and test set
• The initial set of documents Ω when using automatic categorization divided into: Set Tv - a collection of documents for learning from which, if
necessary, allocates a smaller set of documents used for validation (set on a testing model parameters)
Set Te- a collection of documents for testing
• Usually a set of documents for learning is greater; common ratio is 2: 1
• This basic principle is called the principle of Train and test
• Another possible principle is cross-validation
March 16th, 2015, Karl-Franzens Uni Graz
37
Cross-validation
• K different classifiers are formed: Φ1, ..., Φk
• These classifiers operate on sets {TVi = Ω-Tei, Tei} • Efficiency of classification is calculated for each of the
classifiers Φ1, ..., Φk and average value is taken
March 16th, 2015, Karl-Franzens Uni Graz
38
Example (Manning, Nayak, Raghavan)
• Classification of Web pages od departments of computer science in six classes: Student, Faculty, Course, Project, Person, Department
• Learning on about 5000 Website of four universities Cornell, Washington, U.Texas, Wisconsin
• Classification of web pages is conducted using Naivog Bayes algorithm
• Results:
March 16th, 2015, Karl-Franzens Uni Graz
39
Introduction to Information Retrieval
March 16th, 2015, Karl-Franzens Uni Graz
40
41
Documents in vector space (Example: Manning, Nayak, Raghavan)
Sec.14.1
Politics
Science
Art
March 16th, 2015, Karl-Franzens Uni Graz
42
Sec.14.1
In which class is testing document?
Politics
Science
Art
March 16th, 2015, Karl-Franzens Uni Graz
43
Document is in the class Poli4cs
Politics
Science
Aim: To find boundaries
Sec.14.1
Art
March 16th, 2015, Karl-Franzens Uni Graz
Types of classifiers
• Some of the classifier are: Probability classifiers - Naïve Bayes classifier Rocchio classifier The classifier of k-nearest neighbors Support vector machines
• Naive Bayes classifier for a given document ranks classes based on the probability that the document belongs to a class
• Naivety refers to the fact that it is assumed independence of index terms It turned out that it is not too naive
March 16th, 2015, Karl-Franzens Uni Graz
44
Rocchio classifier • Simple representa4on for each class is computed: centroid of the class
• Test document is a=ached to classis with closest centroid
• This approach does not guarantee that the classifica4on is consistent with the documents for learning
45
Sec.14.2
∑∈
=ij
jcdi
i dc1
µ
Number of documents in the class Documents in the class
March 16th, 2015, Karl-Franzens Uni Graz
46
Rocchio classifier: in which class is this document?
Politics
Science
Art
Sec.14.1
1µ 2µ
3µ
March 16th, 2015, Karl-Franzens Uni Graz
Knn classifier
• This algorithm belongs to a group of algorithms that
learn directly from examples - instance based learning • Learning phase consists simply in the storage of
training examples • During test phase most similar examples from memory
are retrieved and just that examples are used to classify a given example
• Classifier for the entire domain is never formed, local classifier is generated for each new instance for testing
• Such a classifier is called lazy
March 16th, 2015, Karl-Franzens Uni Graz
47
Knn classifier
• In order to classify an example for testing it is necessary to determine the nearest neighbors - documents from a trainig set that are closest to the document for testing
• Usually an odd (3,5,7) number of closesets train documents is chosen in order to avoid the same number of documents from the two classes
March 16th, 2015, Karl-Franzens Uni Graz
48
49
Example: k=7 (7NN)
Politics
Science
Art
Sec.14.3
6 out of 7 documents are from the class Politics, one is from the class Science.
March 16th, 2015, Karl-Franzens Uni Graz
Support vector machines
• Support vector machines is the most effective method for classification of text documents
• Theoretically it is based on statistical learning theory • It is suitable for classification of the text because it is
effective in the classification of objects which are represented by a large number of attributes - here index terms
March 16th, 2015, Karl-Franzens Uni Graz
50
Definitions
• Let X be the input space – it is space of representation of documents
• Le Y be output space, set of labels that give belonging to the class
• In the case of binary classification output space is given by
March 16th, 2015, Karl-Franzens Uni Graz
51
{ }lxxxX ,,, 21 …=
{ }1,1−=Y
Document belongs to a class Document does not belong to a class
Definition
• In a general case, output space is • Here we will talk about the case of binary classification • It is not a reduction of generality because the problem
of classification in k classes can be reduced to k mutually independent binary classification problems in class
• By SVM method examples will be separated by a hyperplane in the (n-1) – dimensional space
• n is the dimension of the input space (and it is here the number of index terms)
March 16th, 2015, Karl-Franzens Uni Graz
52
{ }kY ,,2,1 …=
{ } kicc ii ,,2,1,, …=
Separation of examples by hyperplane
• In two-dimensional space hyperplane is represented by the equation
• Dataset is called linearly separable if it can be separated by hyperplane
• If space is lineary separable then there is more possibilities for the direction that will separate examples from two classes
March 16th, 2015, Karl-Franzens Uni Graz
53
0=−+ cbyax
Decision function
• Decision function is plane dividing positive from negative examples
• The decision function is given in the form
Whereas w - weight vector, a vector perpendicular to the hyperplane b - bias, the distance from the origin
• Example for testing will be classified as positive (y = 1) if f (x)> 0, or as negative (y = -1) if f (x) <0
March 16th, 2015, Karl-Franzens Uni Graz
54
bxf T += xw)(
Margin and support vectors
• There are a number of hyperplanes that separate the positive from negative examples
• SVM method is the optimal solution: maximizing the distance between hyperplanes and "problem spots", ie. points near the plane dividing positive from negative examples - these are the so-called supporting vectors
• It is said to maximize margins
Support vectors
March 16th, 2015, Karl-Franzens Uni Graz
55
Support vectors
Margin ρ
Decision function
Measures of evaluation of classification
• Evaluation of classification involves the calculation of certain measures which measure the ability of the classifier to make the right decisions about the affiliation of documents to classes
• Basic measures are precision and recall • Measures are computed on basis of contingency tables
March 16th, 2015, Karl-Franzens Uni Graz
56
Expert decision
YES NO
Decision of classifier
YES TPi (true positive)
FPi (false positive)
NO FNi (false negative)
TNi (true negative)
Measures of evaluation of classification
• Basic: precision and recall • The precision for the class i is defined as the proportion of
correctly classified documents in the set of all positive classified documents for a given class (according to the classifier)
• The recall for the class i is defined as the proportion of correctly
classified documents among all documents from a given class i
• Combined measure: F1 measure
March 16th, 2015, Karl-Franzens Uni Graz
57
ii
ii FPTP
TPp+
=
ii
ii FNTP
TPr+
=
rprpF
+
⋅=2
1
Reuters-21578 collection of documents
• The most widely used data set for classification • It consists of Reuters reports from 1987 • Divided into documents for learning and documents for
testing (ModApte Split) • Documents for learning: reports from begining of 1987
to 7 April 1987 • Documents for verification: Reports from 8 April until
the end of 1987 • 9603 document for learning and 3299 document for
testing • 10 large classes, 90 classes for which there is at least
one document for learning and at least one document for testing, totaly 118 classes
• There are different class March 16th, 2015, Karl-Franzens Uni Graz
58
Reuters-21578 collection of documents
• 10 big classes
March 16th, 2015, Karl-Franzens Uni Graz
59
Class No of train documents
No of test documents
Earn 2877 1087
Acquisitions 1650 719
Money-fx 538 179
Grain 433 149
Crude 389 189
Trade 369 118
Interest 347 131
Ship 197 89
Wheat 212 71
Corn 182 56
60
Sec. 15.2.4
March 16th, 2015, Karl-Franzens Uni Graz
Precision/recall graph: Caterory Crude
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LSVM Decision Tree Naïve Bayes Rocchio
Precision
Recall
Dumais (1998)
March 16th, 2015, Karl-Franzens Uni Graz
61
62
Precision/Recall graph: Category Ship
Precision
Recall
Sec. 15.2.4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LSVM Decision Tree Naïve Bayes Rocchio
Dumais (1998)
March 16th, 2015, Karl-Franzens Uni Graz
Conclusion: classification methods
• Support vector machines has proved to be most successful for the classification of text documents
• Reasons: The method relies on statistical learning theory which provides a limit on the
generalization error (error on examples for testing) that is not dependent on the dimension of the input space (on the number of index terms)
• Method treats well irrelevant index terms - it is questionable whether it is necessary to discard stop words because the algorithm ignores them anyway
• Methods of this type works well for examples that have sparse representation - it is the case with the classification of text documents because most documents contain only a few index terms from the list (which usually has several thousand index terms)
March 16th, 2015, Karl-Franzens Uni Graz
63
top related