text classification/categorization
TRANSCRIPT
Using Natural Language Processing For
Automated Text Classification
Abhishek Oswal
March 15, 2016
Contents
1 Introduction 3
2 Background 5
3 Types of Learning Techniques 6
3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Comparision between supervised and unsupervised learning . . 8
3.4 Example for diffrerent learning technique . . . . . . . . . . . . 9
4 Process of Classification 10
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Creation of Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 Classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
CONTENTS
5 Text Categorization 13
5.1 Mathematical Definition of the Text Classification Task . . . 13
5.2 Text Representation Format . . . . . . . . . . . . . . . . . . . 14
5.2.1 Bags of Word Representation . . . . . . . . . . . . . . 15
5.2.2 Document–Term Matrix . . . . . . . . . . . . . . . . . 15
5.3 Methods to Classify . . . . . . . . . . . . . . . . . . . . . . . . 16
6 General Approach 17
6.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 18
7 Bayesian Categorization 19
7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 Naive Bayes Equation . . . . . . . . . . . . . . . . . . . . . . 20
8 Support Vector Machines 23
8.1 SVM Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9 k-Nearest Neighbor Categorization 25
9.1 k-NN Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 kNN Algorithm Example . . . . . . . . . . . . . . . . . . . . . 26
10 Properties 27
10.1 Properties of Naïve Bayes Categorization . . . . . . . . . . . 27
10.2 Limitations of Naïve Bayes Categorization . . . . . . . . . . . 28
10.3 Properties of k Nearest Neighbor Categorization . . . . . . . . 28
10.4 Limitations of k Nearest Neighbor Categorization . . . . . . . 29
11 Conclusion 30
2
Abstract
With the growth of technology and Internet, it has become natural that we
need to manage a text as online information. Today, we search books and
news using Internet. Many companies and individuals have their web pages.
When there is some information to find, we search for the information on
the Internet. Hence a lot of information has become open .If you have huge
information, you would need to classify it. It is possible to classify it if
information is less. However, today as there is lot of information, it is be-
coming difficult to classify them by hand. Hence, we need some automatic
and fast apporach to classify text information into various fields. Text clas-
sification is gaining more importance due to the accessibility of large number
of electronic documents from a variety of resources. Problem of classification
has been studied in the Natural Language Processing ,Dataminig ,Machine
Learning with variety of applications in a various diverse domains, such as
, news group filtering ,document organization and target marketing. This
report mainly focuses on analysis of naive Bayes Categorization algorithm
for automated text classification.
Key-words:Text Classification,Text Categorization,Naive Bayes,Support Vec-
tor Machine,Spam Filtering
List of Figures
3.1 Comparision between supervised and unsupervised learning . . 8
3.2 Examples of different learning techniques . . . . . . . . . . . . 9
5.1 Example representing categorization . . . . . . . . . . . . . . . 14
5.2 Bags of word representation . . . . . . . . . . . . . . . . . . . 15
5.3 Document-term matrix representation . . . . . . . . . . . . . . 16
6.1 General apporach of classification . . . . . . . . . . . . . . . . 17
6.2 Formula for calculating precision . . . . . . . . . . . . . . . . 18
6.3 Formula for calculating recall . . . . . . . . . . . . . . . . . . 18
7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2 General NaiveBayes Theorem . . . . . . . . . . . . . . . . . . 20
7.3 Formula for calculating P(c) . . . . . . . . . . . . . . . . . . . 21
7.4 Formula for calculating P(x|c) . . . . . . . . . . . . . . . . . . 21
7.5 Maximizing estimation . . . . . . . . . . . . . . . . . . . . . . 21
7.6 Calculating prior probability . . . . . . . . . . . . . . . . . . . 22
7.7 Formula for predicting category . . . . . . . . . . . . . . . . . 22
8.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . 24
9.1 Eucladian Distance . . . . . . . . . . . . . . . . . . . . . . . . 26
1
LIST OF FIGURES
9.2 KNN algorithm equation . . . . . . . . . . . . . . . . . . . . . 26
2
Chapter 1
Introduction
Categorization is the process in which objects are recognized and differenti-
ated on basis of various properties known as features.A category indicates a
relationship between the sujects based on attributes and objects of knowl-
edge.Hence categorization implies that objects are grouped into various cat-
egories for some specific purpose.
Categorization is used for prediction, decision making etc.Objects which
we classify may be audio,image,video,text etc. Text Categorization is also
known as text classification. Text Classification is a process of classifying doc-
uments with respectt to a group of one or more existent categories.Categories
are formed according to concepts or themes or relation present in their con-
tents. Current research topic of text classification mainly aims to improve
the quality of text representation increase efficiency and develop high quality
classifiers.
Text classification process consists of collection of data documents ( gath-
ering ), data preprocessing ( converting raw data to refined data ),Index-
ing,term weighing methods,classification algorithms ( developing classifiers )
based on various features.
3
CHAPTER 1. INTRODUCTION
The basic goal of text categorization is the classification of documents
into a number of predecided categories. Each document can be in exactly
one, multiple or no category at all.Machine learning apporaches have been
actively explored for classification purpose. Among these are Naive bayes
classifier , K-nearest neighbor classifiers , support vector machine , neural
networks.
Services like mail filters, web filters, and online help desks are based on
text classification. Mail filters sorts business e-mails or spam e- mails , by
classifying e-mail into “ordinary mail” or “spam mail.” Web filters prevent
children from accessing undesirable website content , by classifying web sites
categories. Hence , Text Classification technology is important for these
services to run.
Mainly research works in the area of Text Categorization use supervised
learning methods, which mainly dependson on huge amount of labeled train-
ing data to get better and fast classification. Due to lack of available resources
of labeled training data it requires manual labelling of data so that it can be
used for classification method and that is really very long and to expensive
task. On the other hand, there are wide resources of unlabeled training data
that can be utilized for Text Classification purpose.
Recently there were various research efforts done which tried to estab-
lish their methods on basis of unlabeled training data.Rather than using
labelled data or manually labelling of data of same group and one of those
method that really worked is Keyword-based Text Categorization.Keyword-
based TextClassification is mainly based on keyword representation of cate-
gories and documents.
4
Chapter 2
Background
Today’s world is weighed down with lots of data and information from various
sources.IT field made the collection of data more easier than ever before.
Data Mining is a technique of extracting interesting patterns , known features
and knowledge from a very huge amount of data. It mainly helps large
business orgatization.
Recently data mining also attracted the whole IT industry. It majorly
helps the real world applications, to convert large amount of data to meaning-
full information. Data Mining is used in various field of businesses, banking
sector, scientific research , intelligence agencies ,social media sector , robotics
and many more. And Categorization is one of data mining tasks.
5
Chapter 3
Types of Learning Techniques
Machine Learning has ability to learn from observation , previous experi-
ences, and other means, that results in a system that can be infintely self-
improved to give increased efficiency and better effectiveness.
There are different types of learning techniques.
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
Text Categorization using K- Nearest Neighbor algorithms and Naive
Bayes belongs to supervised learning techniques.
3.1 Supervised Learning
Supervised Learning is a technique in which conclusions are drawn from a
training set. Training set is a set which contains pairs of input data and
6
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
category labels to which they belong. Trained data is initially categorized to
construct categorization model by experts
When the categorization model is trained, it must have ability to cate-
gorize the test data to its appropriate category.Test data is a set of data for
is use for validating our categorization model developed on basis of training
data set.
Supervised learning problems are categorized into "regression" and "clas-
sification" problems.
3.1.1 Regression
In a regression problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to some continuous
function.
3.1.2 Classification
In a classification problem, we are instead trying to predict results in a dis-
crete output. In other words, we are trying to map input variables into
discrete categories.
3.2 Unsupervised Learning
Unsupervised Learning is a technique of detecting a function to describe
hidden pattern from unlabeled data. As the data set given to the learner
are unlabeled there is no error or reward mark or signal to get a potential
solution
7
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
3.2.1 Clustering
We can derive this structure by clustering the data based on relationships
among the variables in the data.With unsupervised learning there is no feed-
back based on the prediction results, i.e., there is no teacher to correct you.
It’s not just about clustering. For example, associative memory is unsuper-
vised learning.
3.3 Comparision between supervised and un-
supervised learning
Hence in supervised learning, the output datasets are provided which are used
to train the machine and get the desired outputs whereas in unsupervised
learning no datasets are provided, instead the data is clustered into different
classes whereas in unsupervised learning there is no desired output that is
provided .
Figure 3.1: Comparision between supervised and unsupervised learning
8
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
3.4 Example for diffrerent learning technique
Figure 3.2: Examples of different learning techniques
9
Chapter 4
Process of Classification
A Categorization process is a proper approach to build the categorization
model from an input set of data. This method requires a learning algorithm
to identify a model that understands the relationship between the attribute
set and class label of the input data. This learning algorithm should fit
the input data well and also predict the class labels of previously unknown
records.
There are various steps involved in this process.
4.1 Data Preprocessing
Data preprocessingis a data mining technique that involves transforming raw
data into an understandable format.Data is often inconsistent, incomplete
and lacking in certain behaviors and is likely to contain many errors.Data
preprocessing prepares raw data for further processing. Data goes through a
different of steps during preprocessing
• Data Cleaning
• Data Integration
10
CHAPTER 4. PROCESS OF CLASSIFICATION
• Data Transformation
• Data Discretization
4.2 Training Set
Training set is a set of data used to find potentially predictive relationships.
Training Data Set refers to the collection of data record whose class labels
are already known and which is used to generate the categorization model.
It is then applied to the test data set.
4.3 Test Set
A test set is a set of data used to discover the utility and strength of a predic-
tive relationship. Test Data Set means the collection of records whose class
labels are known but when given as an input to the built classification model,
it must return the accurate class labels of the records. It determines the ac-
curacy of the model based on the count of correct and incorrect predictions
of the test records
4.4 Creation of Model
This is a first draft of some ideas and principles of modelling, with expectation
of a need for future clarification, development, and revision.It is used for
understanding some part of the world.
11
CHAPTER 4. PROCESS OF CLASSIFICATION
4.5 Algorithm
An algorithm is a self-contained step-by-step set of operations to be per-
formed. Algorithms exist that perform data processing ,automated reasoning
and calculation.It is procedure or steps to be used to solve problem.
4.6 Classify
It is classification of test data set using model developed on basis of training
data set following particular algorithm.It is output which we need.
12
Chapter 5
Text Categorization
Categorization is classifying the data for its most effective manner and for
most efficient use.The Text Classification task is defined as the automatic
classification of a document into two or more already fixed classes.
5.1 Mathematical Definition of the Text Clas-
sification Task
Let ( d j, ci) ∈ D >> C, whereDisthecollectionofdocumentsandletC =
{c1, c2 · · · c|C|}aresetofcategorieswhicharepredefined.ThenthemaintaskofTextCategorizationistoassignaBooleanvaluetoeachpairinD.
Consider the Fig in which D is the Domain of documents and C 1 , C
2 and C 3 are different categories. D contains three different kind of docu-
ments. After categorization, each document is categorized in to its respective
category.
Hence , in simple words the problem of classification can be defined as
below. We have a set of training records D = { X 1 , . . . , X N } ,
such that each record is labeled with a class value drawn from a set of c
different discrete values indexed by { 1 . . . c} .Now the training data is
13
CHAPTER 5. TEXT CATEGORIZATION
Figure 5.1: Example representing categorization
used for construction of a classification model, which relates the features in
the underlying record.
Here it must be noted that the frequency of words also plays a major role
in the classification process.
5.2 Text Representation Format
First step in text categorization is to transform documents, which typicaly
are strings of characters, into a representation which would be suitable for
the algorithm and the classication process.
14
CHAPTER 5. TEXT CATEGORIZATION
5.2.1 Bags of Word Representation
Information Retrieval research suggests that word frequency and the word
itself works well as representation units and so their ordering in a document
is of very less importance for many tasks such as classification and it takes
us to the conclusion that an [ attribute - value ] representation of text is very
appropriate for text classification process.
Figure 5.2: Bags of word representation
5.2.2 Document–Term Matrix
A document-term matrix or term-document matrix is a mathematical matrix
which describes the frequency of words that occur in a collection of docu-
ments. In a term-document matrix, columns correspond to terms and rows
correspond to documents in the collection.
Each distinct word 1 w i corresponds to a feature, with the number of
times word w i occurs in the document as its value. To avoid unnecessarily
large feature vectors, words are considered as features only if they occur in
the training data at least 3 times and if they are not stop-words “ like ”
,“and”, “or”, etc.
15
CHAPTER 5. TEXT CATEGORIZATION
Figure 5.3: Document-term matrix representation
5.3 Methods to Classify
There are many categorization techniques in use. They are:
• Bayesian Categorization.
• K Nearest Neighbor Categorization.
• Decision Tree Categorization.
• Rule Based Categorization.
• Support Vector Machines.
• Neural Networks.
16
Chapter 6
General Approach
Two major categorization techniques are
• Bayesian
• Support Vector Machine
• kNN
Figure 6.1: General apporach of classification
17
CHAPTER 6. GENERAL APPROACH
6.1 Precision and Recall
Precision and Recall values evaluate the performance of the categorization
model. Precision computes exactness where as Recall computes complete-
ness.
Let TP be number of true positives, i.e. number of documents correctly
labeled and as agreed by both the experts and the model. Let FP be the
number of false positives, i.e. the number of documents that are wrongly
categorized by the model as belonging to that category. Let FN be the
number of false negatives, i.e. the number of documents which are not labeled
as belonging to the category but should have been
Hence, Precision is defined as
Figure 6.2: Formula for calculating precision
Recall is defined as
Figure 6.3: Formula for calculating recall
18
Chapter 7
Bayesian Categorization
Bayesian is well known techniques of classification. It is used to predict the
class membership probabilities i.e. probability of a given record belongs to a
specific category and that is based on Bayes Theorem.
Bayes theorem is a simple mathematical formula used for calculating con-
ditional probabilities.
7.1 Bayes Theorem
Let X be a sample data record whose category is not known and H is some
assumption. Let sample X belongs to a specified category C. If one needs to
determine P (H|X) i.e the probability that the assumption H holds given the
data sample X.
Bayes Theorem is
Where P (H|X) is the posterior probability of H on X. Posterior proba-
bility is based on information such as background knowledge rather than the
prior probability which is independent of data sample X.
P (X|H) is the posterior probability of X on H. But if the given date
19
CHAPTER 7. BAYESIAN CATEGORIZATION
Figure 7.1: Bayes Theorem
is huge , it would be difficult to calculate above probabilities. Conditional
independency was introduced to overcome this limitation.
7.2 Naive Bayes Equation
Naive Bayes categorization is one of the simplest probabilistic Bayesian cat-
egorization. It is based on an assumption that the effect of an attribute
value on a given category is independent of the values of other attributes
which is called as conditional independence. It is used to simplify complex
computations .
The Naive Bayes classifier is a probabilistic classifier which is based on
the Naïve bayes assumption.
From Bayes rule, the posterior probability can be given as
Figure 7.2: General NaiveBayes Theorem
Where x is a feature vector and x =(x 1 ,...,x n ) and c is category.Assume
20
CHAPTER 7. BAYESIAN CATEGORIZATION
that the category c max yields to the maximum value for P (c|x).
The parameter P(c) is estimated as
Figure 7.3: Formula for calculating P(c)
The classification results are not affected because parameter p(x) is inde-
pendent of categories.
Assuming that the components of feature vectors are statistically indepen-
dent of each other, p (x|c) can be calculated as
Figure 7.4: Formula for calculating P(x|c)
If the maximum estimation is used then Where N(x, c) is the joint fre-
Figure 7.5: Maximizing estimation
quency of x and c, If some data x (i) disappears in the training data, the
probability of any instance containing x (i) becomes zero, without consider-
ing the other features in the vector. Therefore, to avoid zero probability, by
using Laplacian prior probabilities, p (x i |c) is estimated as follows
21
CHAPTER 7. BAYESIAN CATEGORIZATION
Figure 7.6: Calculating prior probability
The Naive Bayes classifier predicts the category ( c max ) with the largest
posterior probability
Figure 7.7: Formula for predicting category
22
Chapter 8
Support Vector Machines
A support vector machine (SVM) is a machine learning method that divides
space into a training positive examples side and a negative examples side.
It also creates hyperplanes as the margin between the positive and negative
examples . These hyperplanes serve as the optimum solution based on the
concept of structural risk minimization.
8.1 SVM Equation
SVM calculates the optimal hyperplanes that supply the maximum margin,
where w.x + b = 0 is the final border hyperplane for classification. The
training examples on w.x+b = 1 and w.x+b = 1 are called support vectors.
23
CHAPTER 8. SUPPORT VECTOR MACHINES
Figure 8.1: Support vector machine
24
Chapter 9
k-Nearest Neighbor
Categorization
Nearest Neighbor search is an optimization problem which is used for finding
closest points in space. It is also called as similarity search or closest point
search. For a given set of points S in a space M and a query point q, the
problem is to find the closest point in S to q. Usually the distance is measured
by Euclidean distance .
9.1 k-NN Equation
The k-Nearest Neighbor (k-NN) categorization is the simplest among all the
supervised machine learning techniques but mainly used method for classi-
fication and retrieval. It classifies the objects based on the closest training
examples in the feature space. It is an instance based learning and known
as lazy learning algorithm. Here the object instance query is classified based
on the majority of k nearest neighbor category. All the k nearest neighbors
in a database of a query are found by calculating Euclidean distance. The
25
CHAPTER 9. K-NEAREST NEIGHBOR CATEGORIZATION
neighbors of a query instance are taken from the data set of objects which
are already categorized of the classification i.e which are previously known .
Euclidean distance is calculated as
Figure 9.1: Eucladian Distance
9.2 kNN Algorithm Example
Below figure shows feature space for different values of k
Figure 9.2: KNN algorithm equation
26
Chapter 10
Properties
10.1 Properties of Naïve Bayes Categoriza-
tion
• Naïve Bayes categorization is a probabilistic categorization which is
based on Conditional Independence between features.
• Naïve Bayes classifies an unknown instance by computing the category
which maximizes the posterior probability.
• Naïve Bayes categorization is flexible and robust to errors.
• The prior and the likelihood can be updated dramatically with each
training example.
• Probabilistic hypothesis that it outputs not only classification, but a
probability distribution over all categories.
• Naïve Bayes is very efficient and linearly proportional to the time .
• It is easy to implement when compared with other algorithms.
27
CHAPTER 10. PROPERTIES
• Naïve Bayes has low variance and high bias compared to other algo-
rithms.
10.2 Limitations of Naïve Bayes Categoriza-
tion
• Sometimes the assumption of Conditional Independence is violated by
the real world data.
• It gives poor performance when the features are highly correlated.
• It does not consider the frequency of the word occurrences.
• Another problem is that the features are assumed to be independent
compared to results, even when the words are dependent ,asit consid-
eres each word contribution individually.
• It cannot be used for solving more complex classification problems.
10.3 Properties of k Nearest Neighbor Cate-
gorization
• Unlike Naïve Bayes, kNN doesn’t relay on prior probabilities.
• KNN computes the similarity between a testing instance and all the
nearest training examples in a collection.
• It does not explicitly compute a generalization or category prototypes.
• It is also called as Case-based, Instance-based, Memory-based and Lazy
learning algorithm.
28
CHAPTER 10. PROPERTIES
• KNearest Neighbor is the most robust alternative to find k-most similar
examples and return the majority of theses k instances.
• It can work with relatively little information.
• Nearest Neighbor method depends on the similarity or distance metric.
• K Nearest Neighbor algorithm has the potential advantage for the prob-
lems with large number of classes.
10.4 Limitations of k Nearest Neighbor Cat-
egorization
• Classification time is too long.
• It is difficult to find the optimal value of k.
• If the training data is large and complex, target functions may reduce
the speed in sorting out queries and irrelevant attributes may fool the
neighbor.
29
Chapter 11
Conclusion
We discussed the background of the categorization.Also discussed the dif-
ferent methodologies and explained them theoretically .We discussed naive
bayes algorithm, support vector machine, k-nearest neighbour algorithm.We
then discussed about the time efficiencies, advantages and disadvantages of
two engines. From our entire study, we observe that the standard preci-
sion and recall values of k Nearest Neighbor categorization engine are better
than Naïve Bayes engine. It has been observed that the kNN has the better
Text Categorization is an active area of research in the field of information
retrieval and machine learning. In future, this study can be extended by
implementing the categorization engines on larger datasets.
30
Bibliography
[1] Text Categorization with SVM by Thorsten Joachims
http://www.cs.cornell.edu/people/tj/publications/joachims98a.pdf
[2] Scholarpedia, K Nearest neighbor
http://www.scholarpedia.org/article/K-nearestneighbor
[3] Thesis on ‘Clustering Approaches to Text Categorization’ by Hiroya Taka-
mura
http://www.lr.pi.titech.ac.jp/ takamura/pubs/dthesisoriginal.pdf
[4] k-Nearest Neighbor (kNN) Algorithm
https://kiwi.ecn.purdue.edu/rhea/index.php/KNNAlgorithmOldKiwi
31