text classification/categorization

35
Using Natural Language Processing For Automated Text Classification Abhishek Oswal March 15, 2016

Upload: oswal-abhishek

Post on 14-Apr-2017

126 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Text Classification/Categorization

Using Natural Language Processing For

Automated Text Classification

Abhishek Oswal

March 15, 2016

Page 2: Text Classification/Categorization

Contents

1 Introduction 3

2 Background 5

3 Types of Learning Techniques 6

3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 7

3.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Comparision between supervised and unsupervised learning . . 8

3.4 Example for diffrerent learning technique . . . . . . . . . . . . 9

4 Process of Classification 10

4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.4 Creation of Model . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.6 Classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Page 3: Text Classification/Categorization

CONTENTS

5 Text Categorization 13

5.1 Mathematical Definition of the Text Classification Task . . . 13

5.2 Text Representation Format . . . . . . . . . . . . . . . . . . . 14

5.2.1 Bags of Word Representation . . . . . . . . . . . . . . 15

5.2.2 Document–Term Matrix . . . . . . . . . . . . . . . . . 15

5.3 Methods to Classify . . . . . . . . . . . . . . . . . . . . . . . . 16

6 General Approach 17

6.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 18

7 Bayesian Categorization 19

7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.2 Naive Bayes Equation . . . . . . . . . . . . . . . . . . . . . . 20

8 Support Vector Machines 23

8.1 SVM Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 23

9 k-Nearest Neighbor Categorization 25

9.1 k-NN Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

9.2 kNN Algorithm Example . . . . . . . . . . . . . . . . . . . . . 26

10 Properties 27

10.1 Properties of Naïve Bayes Categorization . . . . . . . . . . . 27

10.2 Limitations of Naïve Bayes Categorization . . . . . . . . . . . 28

10.3 Properties of k Nearest Neighbor Categorization . . . . . . . . 28

10.4 Limitations of k Nearest Neighbor Categorization . . . . . . . 29

11 Conclusion 30

2

Page 4: Text Classification/Categorization

Abstract

With the growth of technology and Internet, it has become natural that we

need to manage a text as online information. Today, we search books and

news using Internet. Many companies and individuals have their web pages.

When there is some information to find, we search for the information on

the Internet. Hence a lot of information has become open .If you have huge

information, you would need to classify it. It is possible to classify it if

information is less. However, today as there is lot of information, it is be-

coming difficult to classify them by hand. Hence, we need some automatic

and fast apporach to classify text information into various fields. Text clas-

sification is gaining more importance due to the accessibility of large number

of electronic documents from a variety of resources. Problem of classification

has been studied in the Natural Language Processing ,Dataminig ,Machine

Learning with variety of applications in a various diverse domains, such as

, news group filtering ,document organization and target marketing. This

report mainly focuses on analysis of naive Bayes Categorization algorithm

for automated text classification.

Key-words:Text Classification,Text Categorization,Naive Bayes,Support Vec-

tor Machine,Spam Filtering

Page 5: Text Classification/Categorization

List of Figures

3.1 Comparision between supervised and unsupervised learning . . 8

3.2 Examples of different learning techniques . . . . . . . . . . . . 9

5.1 Example representing categorization . . . . . . . . . . . . . . . 14

5.2 Bags of word representation . . . . . . . . . . . . . . . . . . . 15

5.3 Document-term matrix representation . . . . . . . . . . . . . . 16

6.1 General apporach of classification . . . . . . . . . . . . . . . . 17

6.2 Formula for calculating precision . . . . . . . . . . . . . . . . 18

6.3 Formula for calculating recall . . . . . . . . . . . . . . . . . . 18

7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.2 General NaiveBayes Theorem . . . . . . . . . . . . . . . . . . 20

7.3 Formula for calculating P(c) . . . . . . . . . . . . . . . . . . . 21

7.4 Formula for calculating P(x|c) . . . . . . . . . . . . . . . . . . 21

7.5 Maximizing estimation . . . . . . . . . . . . . . . . . . . . . . 21

7.6 Calculating prior probability . . . . . . . . . . . . . . . . . . . 22

7.7 Formula for predicting category . . . . . . . . . . . . . . . . . 22

8.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . 24

9.1 Eucladian Distance . . . . . . . . . . . . . . . . . . . . . . . . 26

1

Page 6: Text Classification/Categorization

LIST OF FIGURES

9.2 KNN algorithm equation . . . . . . . . . . . . . . . . . . . . . 26

2

Page 7: Text Classification/Categorization

Chapter 1

Introduction

Categorization is the process in which objects are recognized and differenti-

ated on basis of various properties known as features.A category indicates a

relationship between the sujects based on attributes and objects of knowl-

edge.Hence categorization implies that objects are grouped into various cat-

egories for some specific purpose.

Categorization is used for prediction, decision making etc.Objects which

we classify may be audio,image,video,text etc. Text Categorization is also

known as text classification. Text Classification is a process of classifying doc-

uments with respectt to a group of one or more existent categories.Categories

are formed according to concepts or themes or relation present in their con-

tents. Current research topic of text classification mainly aims to improve

the quality of text representation increase efficiency and develop high quality

classifiers.

Text classification process consists of collection of data documents ( gath-

ering ), data preprocessing ( converting raw data to refined data ),Index-

ing,term weighing methods,classification algorithms ( developing classifiers )

based on various features.

3

Page 8: Text Classification/Categorization

CHAPTER 1. INTRODUCTION

The basic goal of text categorization is the classification of documents

into a number of predecided categories. Each document can be in exactly

one, multiple or no category at all.Machine learning apporaches have been

actively explored for classification purpose. Among these are Naive bayes

classifier , K-nearest neighbor classifiers , support vector machine , neural

networks.

Services like mail filters, web filters, and online help desks are based on

text classification. Mail filters sorts business e-mails or spam e- mails , by

classifying e-mail into “ordinary mail” or “spam mail.” Web filters prevent

children from accessing undesirable website content , by classifying web sites

categories. Hence , Text Classification technology is important for these

services to run.

Mainly research works in the area of Text Categorization use supervised

learning methods, which mainly dependson on huge amount of labeled train-

ing data to get better and fast classification. Due to lack of available resources

of labeled training data it requires manual labelling of data so that it can be

used for classification method and that is really very long and to expensive

task. On the other hand, there are wide resources of unlabeled training data

that can be utilized for Text Classification purpose.

Recently there were various research efforts done which tried to estab-

lish their methods on basis of unlabeled training data.Rather than using

labelled data or manually labelling of data of same group and one of those

method that really worked is Keyword-based Text Categorization.Keyword-

based TextClassification is mainly based on keyword representation of cate-

gories and documents.

4

Page 9: Text Classification/Categorization

Chapter 2

Background

Today’s world is weighed down with lots of data and information from various

sources.IT field made the collection of data more easier than ever before.

Data Mining is a technique of extracting interesting patterns , known features

and knowledge from a very huge amount of data. It mainly helps large

business orgatization.

Recently data mining also attracted the whole IT industry. It majorly

helps the real world applications, to convert large amount of data to meaning-

full information. Data Mining is used in various field of businesses, banking

sector, scientific research , intelligence agencies ,social media sector , robotics

and many more. And Categorization is one of data mining tasks.

5

Page 10: Text Classification/Categorization

Chapter 3

Types of Learning Techniques

Machine Learning has ability to learn from observation , previous experi-

ences, and other means, that results in a system that can be infintely self-

improved to give increased efficiency and better effectiveness.

There are different types of learning techniques.

1. Supervised learning

2. Unsupervised learning

3. Semi-supervised learning

Text Categorization using K- Nearest Neighbor algorithms and Naive

Bayes belongs to supervised learning techniques.

3.1 Supervised Learning

Supervised Learning is a technique in which conclusions are drawn from a

training set. Training set is a set which contains pairs of input data and

6

Page 11: Text Classification/Categorization

CHAPTER 3. TYPES OF LEARNING TECHNIQUES

category labels to which they belong. Trained data is initially categorized to

construct categorization model by experts

When the categorization model is trained, it must have ability to cate-

gorize the test data to its appropriate category.Test data is a set of data for

is use for validating our categorization model developed on basis of training

data set.

Supervised learning problems are categorized into "regression" and "clas-

sification" problems.

3.1.1 Regression

In a regression problem, we are trying to predict results within a continuous

output, meaning that we are trying to map input variables to some continuous

function.

3.1.2 Classification

In a classification problem, we are instead trying to predict results in a dis-

crete output. In other words, we are trying to map input variables into

discrete categories.

3.2 Unsupervised Learning

Unsupervised Learning is a technique of detecting a function to describe

hidden pattern from unlabeled data. As the data set given to the learner

are unlabeled there is no error or reward mark or signal to get a potential

solution

7

Page 12: Text Classification/Categorization

CHAPTER 3. TYPES OF LEARNING TECHNIQUES

3.2.1 Clustering

We can derive this structure by clustering the data based on relationships

among the variables in the data.With unsupervised learning there is no feed-

back based on the prediction results, i.e., there is no teacher to correct you.

It’s not just about clustering. For example, associative memory is unsuper-

vised learning.

3.3 Comparision between supervised and un-

supervised learning

Hence in supervised learning, the output datasets are provided which are used

to train the machine and get the desired outputs whereas in unsupervised

learning no datasets are provided, instead the data is clustered into different

classes whereas in unsupervised learning there is no desired output that is

provided .

Figure 3.1: Comparision between supervised and unsupervised learning

8

Page 13: Text Classification/Categorization

CHAPTER 3. TYPES OF LEARNING TECHNIQUES

3.4 Example for diffrerent learning technique

Figure 3.2: Examples of different learning techniques

9

Page 14: Text Classification/Categorization

Chapter 4

Process of Classification

A Categorization process is a proper approach to build the categorization

model from an input set of data. This method requires a learning algorithm

to identify a model that understands the relationship between the attribute

set and class label of the input data. This learning algorithm should fit

the input data well and also predict the class labels of previously unknown

records.

There are various steps involved in this process.

4.1 Data Preprocessing

Data preprocessingis a data mining technique that involves transforming raw

data into an understandable format.Data is often inconsistent, incomplete

and lacking in certain behaviors and is likely to contain many errors.Data

preprocessing prepares raw data for further processing. Data goes through a

different of steps during preprocessing

• Data Cleaning

• Data Integration

10

Page 15: Text Classification/Categorization

CHAPTER 4. PROCESS OF CLASSIFICATION

• Data Transformation

• Data Discretization

4.2 Training Set

Training set is a set of data used to find potentially predictive relationships.

Training Data Set refers to the collection of data record whose class labels

are already known and which is used to generate the categorization model.

It is then applied to the test data set.

4.3 Test Set

A test set is a set of data used to discover the utility and strength of a predic-

tive relationship. Test Data Set means the collection of records whose class

labels are known but when given as an input to the built classification model,

it must return the accurate class labels of the records. It determines the ac-

curacy of the model based on the count of correct and incorrect predictions

of the test records

4.4 Creation of Model

This is a first draft of some ideas and principles of modelling, with expectation

of a need for future clarification, development, and revision.It is used for

understanding some part of the world.

11

Page 16: Text Classification/Categorization

CHAPTER 4. PROCESS OF CLASSIFICATION

4.5 Algorithm

An algorithm is a self-contained step-by-step set of operations to be per-

formed. Algorithms exist that perform data processing ,automated reasoning

and calculation.It is procedure or steps to be used to solve problem.

4.6 Classify

It is classification of test data set using model developed on basis of training

data set following particular algorithm.It is output which we need.

12

Page 17: Text Classification/Categorization

Chapter 5

Text Categorization

Categorization is classifying the data for its most effective manner and for

most efficient use.The Text Classification task is defined as the automatic

classification of a document into two or more already fixed classes.

5.1 Mathematical Definition of the Text Clas-

sification Task

Let ( d j, ci) ∈ D >> C, whereDisthecollectionofdocumentsandletC =

{c1, c2 · · · c|C|}aresetofcategorieswhicharepredefined.ThenthemaintaskofTextCategorizationistoassignaBooleanvaluetoeachpairinD.

Consider the Fig in which D is the Domain of documents and C 1 , C

2 and C 3 are different categories. D contains three different kind of docu-

ments. After categorization, each document is categorized in to its respective

category.

Hence , in simple words the problem of classification can be defined as

below. We have a set of training records D = { X 1 , . . . , X N } ,

such that each record is labeled with a class value drawn from a set of c

different discrete values indexed by { 1 . . . c} .Now the training data is

13

Page 18: Text Classification/Categorization

CHAPTER 5. TEXT CATEGORIZATION

Figure 5.1: Example representing categorization

used for construction of a classification model, which relates the features in

the underlying record.

Here it must be noted that the frequency of words also plays a major role

in the classification process.

5.2 Text Representation Format

First step in text categorization is to transform documents, which typicaly

are strings of characters, into a representation which would be suitable for

the algorithm and the classication process.

14

Page 19: Text Classification/Categorization

CHAPTER 5. TEXT CATEGORIZATION

5.2.1 Bags of Word Representation

Information Retrieval research suggests that word frequency and the word

itself works well as representation units and so their ordering in a document

is of very less importance for many tasks such as classification and it takes

us to the conclusion that an [ attribute - value ] representation of text is very

appropriate for text classification process.

Figure 5.2: Bags of word representation

5.2.2 Document–Term Matrix

A document-term matrix or term-document matrix is a mathematical matrix

which describes the frequency of words that occur in a collection of docu-

ments. In a term-document matrix, columns correspond to terms and rows

correspond to documents in the collection.

Each distinct word 1 w i corresponds to a feature, with the number of

times word w i occurs in the document as its value. To avoid unnecessarily

large feature vectors, words are considered as features only if they occur in

the training data at least 3 times and if they are not stop-words “ like ”

,“and”, “or”, etc.

15

Page 20: Text Classification/Categorization

CHAPTER 5. TEXT CATEGORIZATION

Figure 5.3: Document-term matrix representation

5.3 Methods to Classify

There are many categorization techniques in use. They are:

• Bayesian Categorization.

• K Nearest Neighbor Categorization.

• Decision Tree Categorization.

• Rule Based Categorization.

• Support Vector Machines.

• Neural Networks.

16

Page 21: Text Classification/Categorization

Chapter 6

General Approach

Two major categorization techniques are

• Bayesian

• Support Vector Machine

• kNN

Figure 6.1: General apporach of classification

17

Page 22: Text Classification/Categorization

CHAPTER 6. GENERAL APPROACH

6.1 Precision and Recall

Precision and Recall values evaluate the performance of the categorization

model. Precision computes exactness where as Recall computes complete-

ness.

Let TP be number of true positives, i.e. number of documents correctly

labeled and as agreed by both the experts and the model. Let FP be the

number of false positives, i.e. the number of documents that are wrongly

categorized by the model as belonging to that category. Let FN be the

number of false negatives, i.e. the number of documents which are not labeled

as belonging to the category but should have been

Hence, Precision is defined as

Figure 6.2: Formula for calculating precision

Recall is defined as

Figure 6.3: Formula for calculating recall

18

Page 23: Text Classification/Categorization

Chapter 7

Bayesian Categorization

Bayesian is well known techniques of classification. It is used to predict the

class membership probabilities i.e. probability of a given record belongs to a

specific category and that is based on Bayes Theorem.

Bayes theorem is a simple mathematical formula used for calculating con-

ditional probabilities.

7.1 Bayes Theorem

Let X be a sample data record whose category is not known and H is some

assumption. Let sample X belongs to a specified category C. If one needs to

determine P (H|X) i.e the probability that the assumption H holds given the

data sample X.

Bayes Theorem is

Where P (H|X) is the posterior probability of H on X. Posterior proba-

bility is based on information such as background knowledge rather than the

prior probability which is independent of data sample X.

P (X|H) is the posterior probability of X on H. But if the given date

19

Page 24: Text Classification/Categorization

CHAPTER 7. BAYESIAN CATEGORIZATION

Figure 7.1: Bayes Theorem

is huge , it would be difficult to calculate above probabilities. Conditional

independency was introduced to overcome this limitation.

7.2 Naive Bayes Equation

Naive Bayes categorization is one of the simplest probabilistic Bayesian cat-

egorization. It is based on an assumption that the effect of an attribute

value on a given category is independent of the values of other attributes

which is called as conditional independence. It is used to simplify complex

computations .

The Naive Bayes classifier is a probabilistic classifier which is based on

the Naïve bayes assumption.

From Bayes rule, the posterior probability can be given as

Figure 7.2: General NaiveBayes Theorem

Where x is a feature vector and x =(x 1 ,...,x n ) and c is category.Assume

20

Page 25: Text Classification/Categorization

CHAPTER 7. BAYESIAN CATEGORIZATION

that the category c max yields to the maximum value for P (c|x).

The parameter P(c) is estimated as

Figure 7.3: Formula for calculating P(c)

The classification results are not affected because parameter p(x) is inde-

pendent of categories.

Assuming that the components of feature vectors are statistically indepen-

dent of each other, p (x|c) can be calculated as

Figure 7.4: Formula for calculating P(x|c)

If the maximum estimation is used then Where N(x, c) is the joint fre-

Figure 7.5: Maximizing estimation

quency of x and c, If some data x (i) disappears in the training data, the

probability of any instance containing x (i) becomes zero, without consider-

ing the other features in the vector. Therefore, to avoid zero probability, by

using Laplacian prior probabilities, p (x i |c) is estimated as follows

21

Page 26: Text Classification/Categorization

CHAPTER 7. BAYESIAN CATEGORIZATION

Figure 7.6: Calculating prior probability

The Naive Bayes classifier predicts the category ( c max ) with the largest

posterior probability

Figure 7.7: Formula for predicting category

22

Page 27: Text Classification/Categorization

Chapter 8

Support Vector Machines

A support vector machine (SVM) is a machine learning method that divides

space into a training positive examples side and a negative examples side.

It also creates hyperplanes as the margin between the positive and negative

examples . These hyperplanes serve as the optimum solution based on the

concept of structural risk minimization.

8.1 SVM Equation

SVM calculates the optimal hyperplanes that supply the maximum margin,

where w.x + b = 0 is the final border hyperplane for classification. The

training examples on w.x+b = 1 and w.x+b = 1 are called support vectors.

23

Page 28: Text Classification/Categorization

CHAPTER 8. SUPPORT VECTOR MACHINES

Figure 8.1: Support vector machine

24

Page 29: Text Classification/Categorization

Chapter 9

k-Nearest Neighbor

Categorization

Nearest Neighbor search is an optimization problem which is used for finding

closest points in space. It is also called as similarity search or closest point

search. For a given set of points S in a space M and a query point q, the

problem is to find the closest point in S to q. Usually the distance is measured

by Euclidean distance .

9.1 k-NN Equation

The k-Nearest Neighbor (k-NN) categorization is the simplest among all the

supervised machine learning techniques but mainly used method for classi-

fication and retrieval. It classifies the objects based on the closest training

examples in the feature space. It is an instance based learning and known

as lazy learning algorithm. Here the object instance query is classified based

on the majority of k nearest neighbor category. All the k nearest neighbors

in a database of a query are found by calculating Euclidean distance. The

25

Page 30: Text Classification/Categorization

CHAPTER 9. K-NEAREST NEIGHBOR CATEGORIZATION

neighbors of a query instance are taken from the data set of objects which

are already categorized of the classification i.e which are previously known .

Euclidean distance is calculated as

Figure 9.1: Eucladian Distance

9.2 kNN Algorithm Example

Below figure shows feature space for different values of k

Figure 9.2: KNN algorithm equation

26

Page 31: Text Classification/Categorization

Chapter 10

Properties

10.1 Properties of Naïve Bayes Categoriza-

tion

• Naïve Bayes categorization is a probabilistic categorization which is

based on Conditional Independence between features.

• Naïve Bayes classifies an unknown instance by computing the category

which maximizes the posterior probability.

• Naïve Bayes categorization is flexible and robust to errors.

• The prior and the likelihood can be updated dramatically with each

training example.

• Probabilistic hypothesis that it outputs not only classification, but a

probability distribution over all categories.

• Naïve Bayes is very efficient and linearly proportional to the time .

• It is easy to implement when compared with other algorithms.

27

Page 32: Text Classification/Categorization

CHAPTER 10. PROPERTIES

• Naïve Bayes has low variance and high bias compared to other algo-

rithms.

10.2 Limitations of Naïve Bayes Categoriza-

tion

• Sometimes the assumption of Conditional Independence is violated by

the real world data.

• It gives poor performance when the features are highly correlated.

• It does not consider the frequency of the word occurrences.

• Another problem is that the features are assumed to be independent

compared to results, even when the words are dependent ,asit consid-

eres each word contribution individually.

• It cannot be used for solving more complex classification problems.

10.3 Properties of k Nearest Neighbor Cate-

gorization

• Unlike Naïve Bayes, kNN doesn’t relay on prior probabilities.

• KNN computes the similarity between a testing instance and all the

nearest training examples in a collection.

• It does not explicitly compute a generalization or category prototypes.

• It is also called as Case-based, Instance-based, Memory-based and Lazy

learning algorithm.

28

Page 33: Text Classification/Categorization

CHAPTER 10. PROPERTIES

• KNearest Neighbor is the most robust alternative to find k-most similar

examples and return the majority of theses k instances.

• It can work with relatively little information.

• Nearest Neighbor method depends on the similarity or distance metric.

• K Nearest Neighbor algorithm has the potential advantage for the prob-

lems with large number of classes.

10.4 Limitations of k Nearest Neighbor Cat-

egorization

• Classification time is too long.

• It is difficult to find the optimal value of k.

• If the training data is large and complex, target functions may reduce

the speed in sorting out queries and irrelevant attributes may fool the

neighbor.

29

Page 34: Text Classification/Categorization

Chapter 11

Conclusion

We discussed the background of the categorization.Also discussed the dif-

ferent methodologies and explained them theoretically .We discussed naive

bayes algorithm, support vector machine, k-nearest neighbour algorithm.We

then discussed about the time efficiencies, advantages and disadvantages of

two engines. From our entire study, we observe that the standard preci-

sion and recall values of k Nearest Neighbor categorization engine are better

than Naïve Bayes engine. It has been observed that the kNN has the better

Text Categorization is an active area of research in the field of information

retrieval and machine learning. In future, this study can be extended by

implementing the categorization engines on larger datasets.

30

Page 35: Text Classification/Categorization

Bibliography

[1] Text Categorization with SVM by Thorsten Joachims

http://www.cs.cornell.edu/people/tj/publications/joachims98a.pdf

[2] Scholarpedia, K Nearest neighbor

http://www.scholarpedia.org/article/K-nearestneighbor

[3] Thesis on ‘Clustering Approaches to Text Categorization’ by Hiroya Taka-

mura

http://www.lr.pi.titech.ac.jp/ takamura/pubs/dthesisoriginal.pdf

[4] k-Nearest Neighbor (kNN) Algorithm

https://kiwi.ecn.purdue.edu/rhea/index.php/KNNAlgorithmOldKiwi

31