parametric comparison based on split criterion on classification algorithm

12
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 459 PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON CLASSIFICATION ALGORITHM IN STREAM DATA MINING Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod*** *(PG-CE Student, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) ** (Principal, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) *** (Assistant Professor, Department of Computer Engineering) ABSTRACT Stream Data Mining is a new emerging topic in the field of research. Today, there are number of application that generate Massive amount of stream data. Examples of such kind of systems are Sensor networks, Real time surveillance systems, telecommunication systems. Hence there is requirement of intelligent processing of such type of data that would help in proper analysis and use of this data in other task even. Mining stream data is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information . Classification process based on generating decision tree in stream data mining that makes decision process easy. As per the characteristic of stream data, it becomes essential to handle large amount of continuous and changing data with accuracy. In classification process attribute selection at the non leaf decision node thus become a critical analytic point. Various performance parameter’s like Speed of Classification, Accuracy, and CPU Utilization time can be improved if split criterion is implemented precisely. This paper presents implementation of different attribute selection criteria and their comparison with alternative method. Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive Online Analysis), Split Criterion. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 459-470 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

Upload: iaeme

Post on 18-Dec-2014

128 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

459

PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON

CLASSIFICATION ALGORITHM IN STREAM DATA MINING

Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod***

*(PG-CE Student, Department of Computer Engineering),

(C.U.Shah College of Engineering and Technology, Gujarat, India)

** (Principal, Department of Computer Engineering),

(C.U.Shah College of Engineering and Technology, Gujarat, India)

*** (Assistant Professor, Department of Computer Engineering)

ABSTRACT

Stream Data Mining is a new emerging topic in the field of research. Today, there are

number of application that generate Massive amount of stream data. Examples of such kind

of systems are Sensor networks, Real time surveillance systems, telecommunication systems.

Hence there is requirement of intelligent processing of such type of data that would help in

proper analysis and use of this data in other task even. Mining stream data is concerned with

extracting knowledge structures represented in models and patterns in non stopping streams

of information. Classification process based on generating decision tree in stream data mining

that makes decision process easy. As per the characteristic of stream data, it becomes

essential to handle large amount of continuous and changing data with accuracy. In

classification process attribute selection at the non leaf decision node thus become a critical

analytic point. Various performance parameter’s like Speed of Classification, Accuracy, and

CPU Utilization time can be improved if split criterion is implemented precisely. This paper

presents implementation of different attribute selection criteria and their comparison with

alternative method.

Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive

Online Analysis), Split Criterion.

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING

& TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 459-470 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET

© I A E M E

Page 2: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

460

1. INTRODUCTION

Characteristic of stream data also act as challenges for the same. Due its huge size,

continuous nature, speed with which it changes, it requires a real time response which is done

after analysis of this type of data. As the data is huge in size algorithm which would access

the data is restricted for single scan of the data.

Data mining makes use of different types of algorithm for various types of mining

task like Classification, Clustering, and Pattern Recognition. Same way, Stream Data mining

also makes use of different types of algorithm for various types of mining task. Some of the

algorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decision

Tree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm is

based on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound to

gather optimum amount of data so that classification can be done accurately. CVFDT is the

algorithm which is able to detect concept drift which again is a challenge in stream data

mining. As the size of stream data is extremely large, a method is required for improving the

split criterion at the node of decision tree, so that the speed in tree generation is achieved

accuracy is improved and CPU utilization time is reduced. Two different types of split

criterion are checked for Stream data Classification in this paper. And thus improvement in

the algorithm based on it is done as a part of research work.

As said earlier, Stream Data is huge in size, so in order to perform certain analysis; we

need to take some sample of that data so that processing of stream data could be done with

ease. These samples taken should be such that whatever data comes in the portion of sample

is worth analyzing or processing, which means maximum knowledge is extracted from that

sampled data.

In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound based

tree algorithm.

2. RELATED WORK

Implementing algorithm for Stream Data Classification demands improvement in

resource utilization as well as improvisation in accuracy with ongoing classification process.

Here, we would see improvement done on algorithm that is based on Concept Drift Detection

while doing the classification of the data. Drift Detection here is done using Windowing

Technique.

Sliding Window: It is an advance technique. It deals with detailed analysis over most recent

data items and over summarized versions of older ones.

The inspiration behind sliding window is that the user is more concerned with the analysis of

most recent data streams. Thus the detailed analysis is done over the most recent data items

and summarized versions of the old ones. This idea has been adopted in many techniques in

the undergoing comprehensive data stream mining system.

3. CLASSIFICATION PROCESS.

There are many data mining algorithms that exist in practice. Data mining algorithms

can be categorized in three types:

1. Classification

2. Clustering

3. Association

Page 3: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

461

A standard classification system has normally three different phases:

1. The training phase, during which the model is built using labeled data.

2. The testing phase, during which the model is tested by measuring its classification

accuracy on withheld labeled data.

3. The deployment phase during which the model is used to predict the class of unlabelled

data. The three phases are carried out in sequence. See Figure 2.1 for the standard

classification phases.

Fig 3.1: Phases of standard classification systems

3.1. STREAM DATA MINING

Ordinary classification is usually considered in three phases. In the first phase, a

model is built using data, called the training data, for which the property of interest (the class)

is already known (labeled data). In the second phase, the model is used to predict the class of

data (test data), for which the property of interest is known, but which the model has not

previously seen. In the third phase, the model is deployed and used to predict the property of

interest for (unlabelled data).

In stream classification, there is only a single stream of data, having labeled and unlabelled

records occurring together in the stream. The training/test and deployment phases, therefore,

interleave. Stream classification of unlabelled records could be required from the beginning

of the stream, after some sufficiently long initial sequence of labeled records, or at specific

moments in time or for a specific block of records selected by an external analyst.

4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE:

Selection of appropriate splitting criterion helps in improving performance measurement

dimensions. In data stream mining main three performance measurement dimensions:

- Accuracy

- Amount of space necessary or computer memory (Model cost or RAM hours)

- The time required to learn from training examples and to predict (Evaluation time)

These properties may be interdependent: adjusting the time and space used by an

algorithm can influence accuracy. By storing more pre-computed information, such as look

up tables, an algorithm can run faster at the expense of space. An algorithm can also run

faster by processing less information, either by stopping early or storing less, thus having less

data to process. The more time an algorithm has, the more likely it is that accuracy can be

increased.

Page 4: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

462

There are major two types of attribute selection criterion and they are Information

Gain and Gini Index. Later one is also known as binary split criterion. During late 1970s and

1980s .

J.Ross Quinlan, a researcher in machine learning has developed a decision tree

algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attribute

selection. Information gain Gain (A) is given as Gain (A) = Info (D) –InfoA (D).We have

developed a new algorithm to calculate information gain. Methodology wise this algorithm is

promising. We have divided the algorithm into two parts. The first part calculates Info (D)

and the second part calculates the Gain (A).

4.1. Information Gain Calculation: (information before split) – (information after split)

Entropy: A common way to measure impurity is entropy

• Entropy = Where pi is the

probability of class i.

Compute it as the proportion of class i in the set.

• Entropy comes from information theory. The higher the entropy the more the

information content.

• For Continuous data value is computed as (ai+ai+1+1)/2

787.017

4log

17

4

17

13log

17

1322

=

⋅−

⋅−

Entire population (30 instances)

Information Gain= 0.996 - 0.615 = 0.38

391.013

12log

13

12

13

1log

13

122 =

⋅−

⋅−

Calculating Information Gain

17 instances

13 instances

Information Gain = entropy(parent) – [average entropy(children)]

996.030

16log

30

16

30

14log

30

1422 =

⋅−

⋅−

(Weighted) Average Entropy of Children = 615.0391.030

13787.0

30

17=

⋅+

parent

entropy

child

entropy

child

entropy

Figure 4.1: Phases of standard classification systems

4.2. Calculating Gini Index If a data set T contains examples from n classes, Gini index, Gini (T) is defined as

Where pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T are

skewed.

After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the split

data is defined as

The attribute providing smallest gin split(T) is chosen to split the node.

∑−

i

ii pp 2log

∑=

−=

n

jj

pTgini1

2

1)(

)()()(2

2

1

1

TN

TNgini gini

Ngini

NT

split+=

Page 5: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

463

5. METHODOLOGY AND PROPOSED ALGORITHM

CVFDT (Concept Adaptation Very fast Decision Tree) is an extended version of

VFDT which provides same speed and accuracy advantages but if any changes occur in

example generating process provide the ability to detect and respond. Various systems with

this CVFDT uses sliding window of various dataset to keep its model consistent. In Most of

systems, it needs to learn a new model from scratch after arrival of new data. Instead,

CVFDT continuous monitors the quality of new data and adjusts those that are no longer

correct. Whenever new data arrives, CVFDT incrementing counts for new data and

decrements counts for oldest data in the window. The concept is stationary than there is no

statically effect. If the concept is changing, however, some splits examples that will no longer

appear best because new data provides more gain than previous one. Whenever this thing

occurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new best

tree replaces old sub tree and it is more accurate on new data.

5.1 CVFDT ALGORITHM (Based on HoeffdingTree)

1. Alternate trees for each node in HT start as empty.

2. Process Examples from the stream indefinitely

3. For Each Example (x, y)

4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass

Through.

5. Add(x, y) To the sliding window of examples.

6. Remove and forget the effect of the oldest Examples, if the sliding window overflows.

7. CVFDT Grow

8. Check Split Validity if f examples seen since Last checking of alternate trees.

9. Return HT.

Fig: 5.1 Flow of CVFDT algorithm

Page 6: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

464

6. EXPERIMENTAL ANALYSIS WITH OBSERVATION

Different types of dataset were taken and the algorithm of CVFDT was implemented

after Importing those data set to in MOA. Performance analysis of various split criterion used

in decision tree approach are also tested for improving the accuracy of the algorithm. Datasets

used here are in ARFF format. Some of the data are taken from Repository of California

University, some from projects of Spain which are working on Stream Data.

Data Sets taken were as follows: 1) Sensor

2) Sea

3) Random Tree generator.

The Readings taken here are for Sensor data. It contains information (temperature,

humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley

Research Lab. The whole stream contains consecutive information recorded over a 2 months

period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task

of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the

sensor data and the corresponding recording time. While the data stream flow over time, so

does the concepts underlying the stream. For example, the lighting during the working hours

is generally stronger than the night, and the temperature of specific sensors (conference room)

may regularly rise during the meetings.

Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repository

As discussed above an attribute selection measure is a heuristic for selecting the splitting criterion

that “best” separates a given Data. Two common methods used for it are:

1) Entropy based method (i.e. Information Gain)

2) Gini Index

6.1 RANDOM TREE GENERATOR DATA SET RESULTS

Page 7: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

465

Instance Information

Gain(Accuracy)

Gini

Index(Accuracy)

100000 92.6 81.7

200000 93 83

300000 94.7 80.1

400000 96.3 82.2

500000 94.8 80.9

600000 96.9 81.9

700000 96.9 82.6

800000 96.7 82.1

900000 98.7 84

1000000 97.4 77.9

Table-I: Comparison for accuracy in random tree generator

6.2 SEA DATA SET RESULTS

Instance Information

Gain(Accuracy)

Gini

Index(Accuracy)

100000 89.8 89.3

200000 92.1 91.6

300000 89.6 89.3

400000 89.1 88.9

500000 88.5 88.5

600000 88.8 88.1

700000 90.6 90.6

800000 89.5 89.3

900000 89.1 89

1000000 89.9 89.9

Table-II: Comparison for accuracy for SEA Data

Page 8: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

466

6.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPU

UTILIZATION)

Learning evaluation

instances

Evaluation time (Cpu

seconds) Info gain

Evaluation time (Cpu

seconds)Gini index

100000 6.676843 8.704856

200000 13.46289 18.67332

300000 20.23333 29.40619

400000 26.97257 39.87386

500000 33.68062 49.63952

600000 40.40426 59.06198

700000 47.0499 67.70443

800000 53.74234 78.0941

900000 59.93558 88.14057

1000000 66.79963 98.48343

1100000 73.27367 107.1727

1200000 79.27971 116.9851

1300000 85.53535 127.016

1400000 91.99379 136.6257

1500000 98.40543 145.2993

1600000 104.3803 152.9278

1700000 110.3083 160.0102

1800000 116.4859 168.1223

1900000 121.9928 174.8459

Table-III: Comparison of CPU Utilization time for SENSOR Data

Page 9: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

467

6.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY)

Learning evaluation

instances

Classifications correct

(percent)Info Gain

Classifications correct

(percent)Gini Index

100000 96.3 98.4

200000 68.3 69.7

300000 18 64.4

400000 43.2 67.4

500000 62.8 72.9

600000 92 71

700000 97.9 72.5

800000 97.4 73.9

900000 96.8 73.7

1000000 80.6 68.5

1100000 53.6 71.2

1200000 71 90.3

1300000 84.1 73.1

1400000 78.5 83.9

1500000 96.3 84.9

1600000 50.9 84.9

1700000 24 79

1800000 74.3 87.6

1900000 98 97.8

Table-IV: Comparison of ACCURACY for SENSOR Data

Page 10: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

468

6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)

Learning evaluation

instances

Tree size (nodes) Info

Gain

Tree size (nodes) Gini

Index

100000 14 126

200000 30 270

300000 44 396

400000 60 530

500000 76 666

600000 88 800

700000 102 938

800000 122 1076

900000 136 1214

1000000 150 1346

1100000 172 1466

1200000 196 1602

1300000 216 1742

1400000 226 1868

1500000 240 1998

1600000 262 2122

1700000 282 2238

1800000 292 2352

1900000 312 2474

Table-V: Comparison of TREE SIZE for SENSOR Data)

Page 11: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

469

6.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES)

Learning evaluation

instances

Tree size (leaves) Info

Gain Tree size (leaves) Gini Index

100000 7 63

200000 15 135

300000 22 198

400000 30 265

500000 38 333

600000 44 400

700000 51 469

800000 61 538

900000 68 607

1000000 75 673

1100000 86 733

1200000 98 801

1300000 108 871

1400000 113 934

1500000 120 999

1600000 131 1061

1700000 141 1119

1800000 146 1176

1900000 156 1237

Table-IV: Comparison of LEAVES for SENSOR Data)

6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHER

FOR SENSOR DATA

Fig 6.2: Comparison of Performance for Sensor Data for every dimension together

Page 12: Parametric comparison based on split criterion on classification algorithm

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

470

7. CONCLUSION

In this paper, we discussed about theoretical aspects and practical results of Stream

Data Mining Classification algorithms with different split criterion. The comparison based on

different dataset shows the result analysis. Hoeffding trees with windowing technique spend

least amount of time for learning and results in higher accuracy than Gini Index. Memory

utilization, Accuracy and CPU Utilization which are crucial factor in Stream Data are

practically discussed here in this paper with observation. Classification generates decision

tree and tree generated with Split Criterion as Information gain shows that size of tree is also

decreased as shown in table along with dramatic change in accuracy and CPU Utilization.

REFERENCES

[1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, “A Survey Of Stream Data

Mining” Eight National Conference with International Participation-ETAI2007

[2] S.Muthukrishnan, “Data streams: Algorithms and Applications”.Proceeding of the

fourteenth annual ACM-SIAM symposium on discrete algorithms,2003

[3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]“Mining Data

Streams: A Review”, Centre for Distributed Systems and Software Engineering, Monash

University900 Dandenong Rd, Caulfield East, VIC3145, Australia

[4] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning

Algorithms and its Application to Clustering”, Proceedings of the Eighteenth International

Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann

[5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M.

Klein, M. Vasa, and D. Handy, VEDAS: “A Mobile and Distributed Data Stream Mining

System for Real-Time Vehicle Monitoring”, Proceedings of SIAM International Conference

on Data Mining, 2004.

[6]“Adaptive Parameter-free Learning from Evolving Data Streams”, Albert Bifet and Ricard

Gavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain.

[7] “Mining Stream with Concept Drift”, Dariusz Brzezinski, Master’s thesis, Poznan

University of Technology

[8] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,

Present and Future”, International journal of Computer Engineering & Technology (IJCET),

Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375

[9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review

on the Data Mining And Information Security”, International journal of Computer

Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print:

0976 – 6367, ISSN Online: 0976 – 6375