7.1 introduction - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/2385/15/15_chapter 7.pdf ·...
TRANSCRIPT
116
7.1 INTRODUCTION
Clustering is an unsupervised task whereas classification is supervised
in nature. In the context of machine learning, classification of instances
of a dataset is carried out by a classifier after the classifier is made to
learn the model from a training dataset. The training data consists of
instances which are labeled by a human expert. The labels are the
classes into which the instances of the dataset are divided and are fixed
by the human expert. The essence is that human intervention is required
in the form of preparing the training data for the machine to carry out
the task of classification. Clustering of large datasets is universally
accepted as a difficult problem, since it tries to group instances together,
without the expertise of the human supervisor. The time complexity of
algorithms such as K-Medoids is also unacceptably high, even for
moderately large datasets.
The work reported in this chapter aims to integrate both clustering
and classification in the context of web page categorization. Clustering is
used in preparing the training data for the classifier rather than using
training data created by a human expert.
In this chapter a new approach which integrates clustering and
classification, called Integrated Machine Learning Approach(IMLA) is
presented. In the process of integration of clustering and classification,
the method uses the Find-K algorithm[171] and the modified
117
QuickReduct [172] algorithm. The new method is applied to the
domain of automatic web page categorization.
7.2 Methodology of Integrated Machine Learning Approach
The steps involved in the newly proposed technique are as follows:
1. Creating the dataset
2. Finding the Number of clusters(K) Using the Find-K algorithm
3. Labeling the Clustered Web Pages to Create the Training Dataset
for the Classifier
4. Using the QuickReduct algorithm, reduce the dimensionality of the
dataset
5. Classify the remaining dataset
The steps are elaborated in the following subsections.
7.2.1 Creating the dataset
A dataset consisting of M web page snippets returned by any search
engine in response to a given query should be initially created. In order
to clearly bring out the difference between the selected web snippets, the
keywords forming the query submitted to the search engine should be
removed from the snippets. This is done keeping in view the fact that all
the web pages contain these keywords in them even though they belong
to different categories. These common words tend to mislead the
118
categorization process and therefore, removed. A small part of
the dataset is shown for illustration purpose in Fig. 7.1.
Figure 7.1 A part of the Apple110 Dataset
Once the dataset consisting of the M web snippets is created, a part of it
consisting of N web snippets, where N < M, is selected. The selection of
this dataset is very critical to the accuracy of the overall result since it
acts as the training data for the classifier. In the real world scenario, the
instances can be randomly picked and should form a sizeable portion of
the initial dataset containing M pages so that all the categories present
are sufficiently represented.
7.2.2 Finding the Clusters using Find-K
The first step in creating an automatically labeled training dataset for a
classifier is to cluster the data using a partitioning based clustering
With up to 160GB of storage, iPod classic lets you carry your entiremusic and video collection in your pocket: up to 40000 songs or up to200 hours of ...
ipod, ipod nano, ipod classic, ipod touch, itouch, the complete range isavaiable with us . nano 4gb, nano 8gb, nano video, shuffle 1gb, ipodvideo ...
best buys ipod, online buying ipod, online purchasing ipod, purchaseonline ipod, cheap online purchase ipod, online purchase
8 Oct 2007 ... Buy an iPod in India at the lowest prices. Shipping ipodsanywhere in India including New Delhi, Bangalore, Hyderabad, Mumbai,...
119
CLUSTER 1 : doc11 , doc12 , doc13 , doc14 , doc15 ,
CLUSTER 2 : doc24 , doc21 , doc22 , doc23 , doc25 ,
CLUSTER 3 : doc28 , doc27 , doc29 , doc30 ,
CLUSTER 4 : doc7 , doc6 , doc8 , doc9 , doc10 ,
CLUSTER 5 : doc17 , doc2 , doc18 , doc19 , doc20 ,
CLUSTER 6 : doc1 , doc3 , doc4 , doc5 , doc16 , doc26 ,
algorithm. The reason for using this algorithm is that it produces a
set of flat, structure less and disjoint clusters, which can be treated as
classes to which the instances of the dataset belong. But the biggest
challenge of running a partitioning algorithm such as K-means or K-
Medoids is to know the number of partitions K apriori, which the
algorithm needs as input. Find-K[171] algorithm uses the K-medoids
clustering algorithm to automatically predict the correct number of
clusters that are present in a given text dataset. In this implementation,
our proposed new algorithm, the Find-K [171] has been used to predict
the number of clusters. An example of the result of the clustering task
for dataset consisting of 30 pages and 6 clusters is shown in Fig.7.2.
Figure 7.2 Clusters created from a 30 instance dataset
120
7.2.3 Creating the Training Dataset for the Classifier
In a way, the clustering task carried out by an algorithm can be viewed
as similar to the creation of a training data by a human expert. However,
there is a huge difference in the abilities of an algorithm and a human
expert in assimilating the similarity or difference between a pair of web
page snippets. Web snippets are essentially made up of text and it is well
known that human brains are far ahead of machines in the area of
language processing even today. But this presents us with the challenge.
Clustering, which is an unsupervised machine learning approach, groups
data instances based on their similarity. The measurement of similarity
is very critical to the process of clustering since it acts as an index to
which two instances may belong to the same cluster. The human expert,
on the other hand, depends on the knowledge and experience he or she
has gained over the years in attaching labels to the instance of a dataset.
Once the instances are clustered, the number of the clusters to
which an individual instance belongs is attached as a label to that
instance. In this way, all the instances are attached with their
corresponding cluster number which acts as the class label. A training
dataset is thus created. A partial training dataset can be seen in Fig.7.3.
It may be noted that the web snippets have been converted into the
vector space model. In the vector space model, the documents are
represented by the document-term matrix. The rows of the matrix
121
represent the documents and the columns of the matrix represent
the terms present in the total document set. If there is a ‘0’ at a
document-term position, it means that the document does not contain
that term. On the other hand, if ‘1’ is present, it means that the
document contains that particular term in it.
Figure 7.3 A partial training dataset
122
7.2.4 Dimensionality Reduction Using Modified QuickReduct
Once the dataset set is clustered and subsequently labeled, the modified
QuickReduct algorithm [172] is applied to the dataset to reduce its
dimensionality. A snapshot of the partial dataset after reduction is
shown in Fig. 7.4. It should be noted here that the original dataset
contained 625 features (terms) and now the dataset is represented by
just 5 features.
123
Figure 7.4 Dataset with reduced number of features
7.2.5 Using a Classifier to Classify the Test Instances
A part of the total dataset is used as the training set and the remaining
part is used to test the accuracy of classification. For the purpose of
classification, an implementation of the well known C4.5 algorithm,
known as J48, from the WEKA toolbox has been used. This particular
implementation of the classifier provides a facility wherein we can train
the classifier with the training data and then obtain the predictions on
the unlabelled test dataset. It is ensured that the internal representation
of both the training set and the test set is exactly the same. Fig.7.5
shows the sample test data set, where ‘?’ symbol replaces the class label.
124
Figure 7.5 A Sample test data set
7.3 Experimental Setup and Results
In order to obtain a perspective on the performance of IMLA, its results
are compared with two traditional machine learning approaches.
Clustering the entire dataset using Find-K
Classification with a human expert created training dataset
In order to carry out the experiments, 110 web snippets returned by the
Google search engine in response to the query apple have been manually
collected. This dataset formed the basis for all the experiments carried
out and reported in this work.
125
7.3.1 CASE I ( 80 Training, 30 test)
In this case, 80 out of the 110 web snippets have been selected to form
the training dataset. To begin with, the Find-K algorithm is run on this
dataset and the number of clusters has been found to be 6. This dataset
of 80 instances is now turned into a training data by attaching the
corresponding cluster number/name to the individual instances. The
remaining 30 instances are used as the test data and the results are
noted.
Further, the dataset containing 80 instances is labeled by the
authors and again used as the training set, for comparison purposes.
This is to compare the effectiveness of automatic generation of training
data with the conventional method of human generated training data.
These results are further compared with pure clustering, where the
entire dataset consisting of 110 instances is clustered into 6 clusters.
The confusion matrices obtained after applying IMLA, classification
and clustering are presented in Tables 7.1, 7.2 and 7.3, respectively.
From the confusion matrix, it is very easy to identify the number of true
positives, false positives and false negatives. By examining a row, we can
obtain the true positives and the false negatives. On the other hand, the
column values provide us with the true positives and the false positives.
Table 7.4 contains a summary of the results obtained using the three
126
different methodologies, i.e., that of the newly proposed integrated
method and the traditional clustering and classification.
The comparison has been based on three parameters, namely,
precision, recall and F-Measure. The precision, recall and F-Measure
have been calculated in the following manner.
Precision = True Positives / (False Positives+True Positives)
Recall = True Positives/(False Negatives + True Positves)
F-Measure = 2. Precision*Recall/(Precision+Recall)
It can be seen from the results that for the dataset under
consideration, the average precision is better in the integrated method
when compared to the other two methods. The F-measure value,
however, is better than the traditional classification method but slightly
lower than the pure clustering.
classifiedas ----> a b c d e f
a = ipod 5 0 0 0 0 0b = trailer 0 5 0 0 0 0c = itunes 1 0 4 0 0 0d = laptop 0 0 0 5 0 0e = iphone 1 0 0 0 4 0f = fruit 0 0 0 0 0 5
Table 7.1 Confusion Matrix for IMLA-Case I
127
classified as----> a b c d e f
a = ipod 4 0 1 0 0 0b = trailer 0 5 0 0 0 0c = itunes 0 0 5 0 0 0d = laptop 0 0 0 5 0 0e = iphone 1 0 0 0 4 0f = fruit 0 0 0 0 0 5
Table 7.2 Confusion Matrix for Classification-Case I
classifiedas ----> a b c d e f
a = ipod 18 0 1 1 0 0b = trailer 0 25 0 0 0 0c = itunes 1 1 13 0 0 0d = laptop 0 0 0 14 1 0e = iphone 1 0 0 0 9 0
f = fruit 1 0 0 0 0 25
Table 7.3 Confusion Matrix for Clustering-Case I
128
Table 7.4 Comparison of Categorization Accuracy (80 TRAINING + 30 TEST)
precision recall F-measURE
Classe IMLA Classifica- Cluster- IMLA Classifica- Cluster- IMLA Classifica- Cluster-
ipod 0.71 0.800 0.900 1.00 0.800 0.900 0.83 0.800 0.900
trailer 1.00 1.000 0.961 1.00 1.000 1.000 1.00 1.000 0.980
itunes 1.00 0.833 0.928 0.80 1.000 0.867 0.88 0.909 0.896
laptop 1.00 1.000 0.933 1.00 1.000 0.933 1.00 1.000 0.933
iphone 1.00 1.000 0.900 0.80 0.800 0.900 0.88 0.889 0.900
fruit 1.00 1.000 1.000 1.00 1.000 1.000 1.00 1.000 1.000
Aver- 0.95 0.939 0.937 0.93 0.933 0.933 0.93 0.933 0.935
129
7.3.2 CASE II (30 Training, 80 test)
The same procedure as explained in Case I is carried out here. The
difference is that now the size of the training set is 30 and that of the test
set is 80. The confusion matrix for this case using IMLA is presented in
Table 7.5. The results for all the three methods are presented in Table
7.6. In this case, the newly proposed integrated method completely
outperforms the traditional methods. The values of both precision and
recall and thereby that of F-measure are much better than the pure
classification and clustering technique.
classifiedas ----> a b c d e f
a = ipod 15 0 0 0 0 0b = trailer 0 20 0 0 0 0c = itunes 0 1 9 0 0 0d = laptop 0 0 0 10 0 0e = iphone 1 0 0 0 4 0f = fruit 0 0 0 0 0 20
Table 7.5 Confusion Matrix for IMLA- Case II
130
Table 7.6 Comparison of Categorization Accuracy (30 TRAINING + 80 TEST)
precision recall F-measUre
Classe IMLA Classifica- Cluster- IMLA Classifica- Cluster- IMLA Classifica- Cluster-
ipod 0.93 1.000 0.900 1.00 0.867 0.900 0.96 0.929 0.900
trailer 0.95 0.944 0.961 1.00 0.850 1.000 0.97 0.895 0.980
itunes 1.00 1.000 0.928 0.90 0.900 0.867 0.94 0.947 0.896
laptop 1.00 1.000 0.933 1.00 0.900 0.933 1.00 0.947 0.933
iphone 1.00 0.455 0.900 0.80 1.000 0.900 0.88 0.625 0.900
fruit 1.00 1.000 1.000 1.00 1.000 1.000 1.00 1.000 1.000
Aver- 0.98 0.899 0.937 0.95 0.919 0.933 0.96 0.890 0.934
131
7.3.3 CASE III (42 Training, 68 test)
The confusion matrix obtained after classification by the C4.5 decision
tree classifier is shown in Table 7.7. The results for this case are
presented in Table 7.8. In this case, the newly proposed integrated
method exactly equals the performance of classification method and
outperforms the clustering method.
classifiedas ----> a b c d e f
a = ipod 12 0 1 0 0 0b = trailer 0 8 0 0 0 0c = itunes 0 0 8 0 0 0d = laptop 1 0 0 2 0 0e = iphone 0 0 0 0 18 0f = fruit 0 0 2 0 0 16
Table 7.7 Confusion Matrix for IMLA- Case III
132
Table 7.8 Comparison of Categorization Accuracy ( 42 TRAINING + 68 TEST)
precision recall F-measURE
Classe IMLA Classifica- Cluster- IMLA Classifica- Cluster- IMLA Classifica- Cluster-
ipod 0.92 0.923 0.900 .923 .923 0.900 0.92 0.923 0.900
trailer 1.00 1.000 0.961 .889 .889 1.000 0.94 0.941 0.980
itunes 0.72 0.727 0.928 1.00 1.000 0.867 0.84 0.842 0.896
laptop 1.00 1.000 0.933 1.00 1.000 0.933 1.00 1.000 0.933
iphone 1.00 1.000 0.900 0.66 0.667 0.900 0.80 0.800 0.900
fruit 1.00 1.000 1.000 1.00 1.000 1.000 1.00 1.000 1.000
Aver- 0.94 0.941 0.937 0.91 0.913 0.933 0.91 0.917 0.934
133
7.4 Summary
A new approach called IMLA (Integrated Machine Learning Approach)
which integrates classification and clustering is presented in this
chapter. The new method completely automates the process of web
categorization. The accuracy of the results obtained by the newly
proposed method has been compared with that of the results obtained by
the classical methods applied separately. The results are found to be very
encouraging.