large scale hierarchical text classification

Large Scale Hierarchical Text Classificationusing Hadoop:MapReduce

Hammad Haleem (10-css-25)Pankaj Kumar sharma (10-css-46)

Department of Computer Engineering

Jamia Millia Islamia University

New Delhi India

S.no. Topic

1 Objective

2 Hierarchical Text Classification (HTC) and dataset

3 About Our Technique

5 About Hadoop and Mapreduce

6 Progress Reporting

7 References

Table of Contents

OBJECTIVE

To develop and deploy a cost-effective, near real time andDistributed Hierarchical Text Documents’ classificationSystem which could be used to classify documents in a hugeHierarchy of Categories in real time.

What is Hierarchical Text Classification ?

▪ Documents are said to follow a hierarchical classification if:

– If a single document is present in more than one category.

– And the categories are themselves in a hierarchy. So a single category can contain multiple documents and even multiple

categories.

Reuters Corpus (RCV1)

• Raw data from news articles.

• Approximately 806,791 documents. It has 1000+ categories.

Wikipedia Dataset

• The data is in the SVM format and requires a very less amount of preprocessing.

• The dataset has 2,400,000 documents in 325,000 categories

Used dataset

OUR APPROACH -DETAILED DESCRIPTION OF ALGORITHM

We divided the development time into two phases.

1. Where we did the initial environment setup and wrote functions which will help in further development.

2. We used these functions to perform the training and testing on actual data.

The section ahead talks in detail about the various steps performed at various phases of project.

WE DEVELOPED FOLLOWING FUNCTIONS : INITIAL PHASE

These methods are quite frequently used in the project so we would like to discuss them

● Train (TrainingDocumentSet D, CategoryTree C)

○ TrainingDocumentSet D is the set of training documents (i.e., already marked with their corresponding

categories).

○ CategoryTree C will contain a list of Categories, the parent-child relation between them.

● Classify (Document d, CategoryTree C, TainedClassifier TC)

○ Document d would be the document to be classified.

○ CategoryTree C will contain a list of categories, the hierarchical parent-child relation between them.

○ TrainedClassifier TC is the output of Train API call

● Tf-Idf-Calculator (DocumentSet D)

○ DocumentSet D is the set of Text Documents for which we’ve to calculate Tf-Idf weights.

● CosineDistance(Document d1, Document d2)

○ will return the cosine distance between documents d1 and d2.

HIGH LEVEL VIEW OF THE WHOLE CLASSIFICATION ALGORITHM.

DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE

▪ G Following diagram gives an overview of the training technique.▪ The training follows lazy learner technique and is quite efficient for the

problem.


The DCLTH is a Lazy learning algorithm, therefore the training phase is quite straight

forward. First of all, all training documents undergo preprocessing. In this layer

mostly following activities are performed:

1. Removal of HTML tags or other noisy data from the documents.

2. Converting special documents like docs, pdfs to simple word based format.

3. Stemming words, i.e., converting words to their corresponding root words.

(We’ll use Potter Stemming).

4. Stop word removal. i.e., removing common words like “a”, “the”, etc.

5. Documents are represented in libSVM format.


After preprocessing is done, the document set undergo Tf-Idf calculator and the Tf-Idf value for

each word related to each document is found.We would use following definition of Tf, Idf and

Tf-Tdf:

Tf(w,d) = f(w, d)/sum(d)

Where f(w, d) is frequency of w, sum(d) is sum of frequencies of each unique word.

Idf(w,d,D) = log( N / (1+n))

Where N is number of documents in set D, and n is number of documents in which W appears. 1

is added to avoid “Division by 0 error”. Finally,

Tf-Idf(w,d,D) = Tf(w, d)*Idf(w,d,D).

Finally this information is stored for later usage. This completes the training phase of DCLTH.

DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE

❏ The novel Deep Classification in Hierarchical Text Classification

performs better from other classification algorithms because

instead of applying classification directly over all the categories,

it first divides the categories into two categories, namely related

and unrelated categories and then using the hierarchical

information related to the category it employes top-down

approach to classify the document.

❏ Generally we have two phases for the testing process.


LEVEL 1The aim of this stage is to divide the large category set into two subset, namely “related” and

“unrelated”. For any document d, the number of related categories is much smaller than the number of

unrelated categories.The point worth noting is that now from a huge set of categories we have only a

small number of categories left on which the real classification could take place.This phase works as

follow:

1. The given document is processed (in the similar fashion as explained in “Training” stage).

2. Processed document is then used to find k most similar documents. For this we would employ

the Cosine distance and k NN algorithm.

3. From these k documents we get the set of related categories (called candidate categories).


LEVEL2 The aim of second stage is to work upon the candidate categories and processed document (both obtained

from first stage) and to obtain most probable categories from them.

This level would work as follow:

1. From the given candidate categories, a pruned tree is formed.

An example of such a tree is given in the following diagram.

2. From this pruned tree, we start from the root and apply Standard Naive Bayes’ Classifier to further prune

down the tree.

3. Algorithm stops when we reach the bottom most level.

4. Categories related to the finally pruned tree (called final category is pruned).

DID WE SAY “DISTRIBUTED” ?

❏ Most of the techniques, which lie under the above two approaches of HTC are Linear and runs are tested on a Single System, not on any cluster.

❏ This project aims to :❏ Deploy any one of these popular approaches to run parallelly on

Hadoop cluster,❏ Develop our own approach optimised to perform best parallely.

Distributed Processing and Hadoop

▪ The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

▪ It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

▪ Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

More from our Hadoop Cluster.

1. We showcase some screenshots from the hadoop

cluster in next few slides.

a. From various panels of hadoop platform

i. Node Manager GUI

ii. DFS gui

iii. Job Tracker GUI

What is Mapreduce and How Hadoop helped us ?

▪ MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster

▪ The model was initially

– "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

– "Reduce" step: The master node then collects the answers to all the subproblems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

PROGRESS REPORT

▪ Setup of Cluster ▪ Analysis of various algorithms for Training and testing ▪ Training of the dataset. ▪ Testing of dataset.▪ Implementing the Whole classifier without Mapreduce.▪ Implementing the Classifier with mapreduce.

References

1. www.kaggle.com/c/lshtc [Dataset and Problem description]2. Distributed Hierarchical text classification framework, US Patent US 7809723 B23. www.about.reuters.com/researchandstandards/corpus4. www.hadoop.apache.org [Hadoop docs]5. Hadoop: The Definitive Guide.6. DKPro TC http://code.google.com/p/dkpro-tc/7. RTextTools https://github.com/timjurka/RTextTools8. DigitalPebble TC https://github.com/DigitalPebble/TextClassification9. Some other techniques in addition to DCLTH are:10. Dumais, Susan, and Hao Chen. "Hierarchical classification of Web content." Proceedings of the 23rd

annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2000.

11. Granitzer, Michael. "Hierarchical text classification using methods from machine learning." Master's Thesis, Graz University of Technology (2003).

large scale hierarchical text classification

Data & Analytics

deep hierarchical

hierarchical

hierarchical

master node

cosine distance

child relation

real time

pruned tree