an efficient data mining framework on hadoop using java persistence api

7
An Efficient Data Mining Framework on Hadoop using Java Persistence API Yang Lai The Key Laboratory of Intelligent Information Process- ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China. Graduate University of Chinese Academy of Sciences, Beijing 100039, China. e-mail: [email protected] Shi ZhongZhi The Key Laboratory of Intelligent Information Process- ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China. e-mail: [email protected] Abstract—Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. A feasi- ble distributed data indexing algorithm is proposed for Hadoop data mining, based on ZSCORE binning and inverted indexing and on the Hadoop SequenceFile format. A data mining framework on Hadoop using the Java Persistence API (JPA) and MySQL Cluster is proposed. The framework is elaborated in the implementation of a decision tree algorithm on Hadoop. We compare the data index- ing algorithm with Hadoop MapFile indexing, which performs a binary search, in a modest cloud environment. The results show the algorithm is more efficient than naïve MapFile indexing. We compare the JDBC and JPA implementations of the data mining framework. The performance shows the framework is efficient for data mining on Hadoop. Keywords—Data Mining, Distributed applications, JPA, ORM, Distributed file systems, Cloud computing I. INTRODUCTION Many approaches have been proposed for handling high-dimensional and large-scale data, in which query proc- essing is the bottleneck. “Algorithms for knowledge discov- ery tasks are often based on range searches or nearest neighbor search in multidimensional feature spaces” [1]. Business intelligence and data warehouses can hold a Terabyte or more of data. Cloud computing has emerged for the subsequently increasing demands of data mining. Ma- pReduce is a programming framework and an associated implementation designed for large data sets. The details of partitioning, scheduling, failure handling and communica- tion are hidden by MapReduce. Users simply define map functions to create intermediate <key, value> tuples, and then reduce functions to merge the tuples for special proc- essing [2]. A concise indexing Hadoop implementation of MapRe- duce is presented in McCreadie’s work [3]. Ralf proposes a basic program skeleton to underlie MapReduce computa- tions [4]. Moretti presents an abstraction for scalable data mining, in which data and computation are distributed in a computing cloud with minimal user efforts [5]. Gillick uses Hadoop to implement query-based learning [6]. Most data mining algorithms are based on object -oriented programming, which runs in memory. Han elabo- rates many of these methods [7]. A link-based structure, like X-tree [8], may be suitable for indexing high-dimensional data sets. However, the following features in the MapReduce framework are unsuitable for data mining. First, in globality, map tasks are irrelevant to each other, as are reducing tasks. Data mining requires that all of the training data be con- verted into a global model, such as a decision tree or clus- tering tree. The tasks in the MapReduce framework only handle its partition of the entire data set and output its results into the Hadoop distributed file system (HDFS). Second, random-write operations are disallowed by the HDFS, thus disabling link-based data models in Hadoop, such as linked-lists, trees, and graphs. Finally, the duration of both map and reduce tasks are based on scanning processing, and will end when the partitioning of the training dataset is fin- ished. Data mining requires a persistent model for following testing processing. A database is an ideal persistent repository for objects generated by data mining using Hadoop tasks. To mine high-dimensional and large-scale data on Hadoop, we em- ploy Object-Relation Mapping (ORM), which stores objects whose size may surpass memory limits in a relational data- base. The Java Persistence API (JPA) provides a persistence model for ORM [9]. It can store all the objects generated from data mining models in a relational database. A distrib- uted database is a suitable solution to ensure robustness in distributed handling. MySQL Cluster is designed to with- stand any single point of failure [10], which is consistent with Hadoop. We performed the same work that McCreadie’s per- formed [3] and now propose a novel indexing Hadoop im- plementation for continuous values. Using JPA and MySQL Cluster on Hadoop, we propose an efficient data mining framework, which is elaborated by a decision tree imple- mentation. The distributed computing in Hadoop and cen- tralized data collection using JPA are combined together organically. We also employ the naïve JDBC implementa- tion in comparison with our JPA implementation on Ha- doop. The rest of the paper is organized as follows. In Section 2 we explain the index structures, flowchart and algorithms, and propose the data mining framework. Section 3 provides descriptions of our experimental setting and results. In Sec- tion 4, we offer conclusions and suggest possible directions for future work. 978-0-7695-4108-2/10 $26.00 © 2010 IEEE DOI 10.1109/CIT.2010.71 203 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) 978-0-7695-4108-2/10 $26.00 © 2010 IEEE DOI 10.1109/CIT.2010.71 203 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) 978-0-7695-4108-2/10 $26.00 © 2010 IEEE DOI 10.1109/CIT.2010.71 203

Upload: joao-gabriel-lima

Post on 20-Jan-2015

1.037 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An efficient data mining framework on hadoop using java persistence api

An Efficient Data Mining Framework on Hadoop using Java Persistence API

Yang Lai The Key Laboratory of Intelligent Information Process-

ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China.

Graduate University of Chinese Academy of Sciences, Beijing 100039, China.

e-mail: [email protected]

Shi ZhongZhi The Key Laboratory of Intelligent Information Process-

ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China.

e-mail: [email protected]

Abstract—Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. A feasi-ble distributed data indexing algorithm is proposed for Hadoop data mining, based on ZSCORE binning and inverted indexing and on the Hadoop SequenceFile format. A data mining framework on Hadoop using the Java Persistence API (JPA) and MySQL Cluster is proposed. The framework is elaborated in the implementation of a decision tree algorithm on Hadoop. We compare the data index-ing algorithm with Hadoop MapFile indexing, which performs a binary search, in a modest cloud environment. The results show the algorithm is more efficient than naïve MapFile indexing. We compare the JDBC and JPA implementations of the data mining framework. The performance shows the framework is efficient for data mining on Hadoop.

Keywords—Data Mining, Distributed applications, JPA, ORM, Distributed file systems, Cloud computing

I. INTRODUCTION Many approaches have been proposed for handling

high-dimensional and large-scale data, in which query proc-essing is the bottleneck. “Algorithms for knowledge discov-ery tasks are often based on range searches or nearest neighbor search in multidimensional feature spaces” [1].

Business intelligence and data warehouses can hold a Terabyte or more of data. Cloud computing has emerged for the subsequently increasing demands of data mining. Ma-pReduce is a programming framework and an associated implementation designed for large data sets. The details of partitioning, scheduling, failure handling and communica-tion are hidden by MapReduce. Users simply define map functions to create intermediate <key, value> tuples, and then reduce functions to merge the tuples for special proc-essing [2].

A concise indexing Hadoop implementation of MapRe-duce is presented in McCreadie’s work [3]. Ralf proposes a basic program skeleton to underlie MapReduce computa-tions [4]. Moretti presents an abstraction for scalable data mining, in which data and computation are distributed in a computing cloud with minimal user efforts [5]. Gillick uses Hadoop to implement query-based learning [6].

Most data mining algorithms are based on object -oriented programming, which runs in memory. Han elabo-rates many of these methods [7]. A link-based structure, like

X-tree [8], may be suitable for indexing high-dimensional data sets.

However, the following features in the MapReduce framework are unsuitable for data mining. First, in globality, map tasks are irrelevant to each other, as are reducing tasks. Data mining requires that all of the training data be con-verted into a global model, such as a decision tree or clus-tering tree. The tasks in the MapReduce framework only handle its partition of the entire data set and output its results into the Hadoop distributed file system (HDFS). Second, random-write operations are disallowed by the HDFS, thus disabling link-based data models in Hadoop, such as linked-lists, trees, and graphs. Finally, the duration of both map and reduce tasks are based on scanning processing, and will end when the partitioning of the training dataset is fin-ished. Data mining requires a persistent model for following testing processing.

A database is an ideal persistent repository for objects generated by data mining using Hadoop tasks. To mine high-dimensional and large-scale data on Hadoop, we em-ploy Object-Relation Mapping (ORM), which stores objects whose size may surpass memory limits in a relational data-base. The Java Persistence API (JPA) provides a persistence model for ORM [9]. It can store all the objects generated from data mining models in a relational database. A distrib-uted database is a suitable solution to ensure robustness in distributed handling. MySQL Cluster is designed to with-stand any single point of failure [10], which is consistent with Hadoop.

We performed the same work that McCreadie’s per-formed [3] and now propose a novel indexing Hadoop im-plementation for continuous values. Using JPA and MySQL Cluster on Hadoop, we propose an efficient data mining framework, which is elaborated by a decision tree imple-mentation. The distributed computing in Hadoop and cen-tralized data collection using JPA are combined together organically. We also employ the naïve JDBC implementa-tion in comparison with our JPA implementation on Ha-doop.

The rest of the paper is organized as follows. In Section 2 we explain the index structures, flowchart and algorithms, and propose the data mining framework. Section 3 provides descriptions of our experimental setting and results. In Sec-tion 4, we offer conclusions and suggest possible directions for future work.

978-0-7695-4108-2/10 $26.00 © 2010 IEEE

DOI 10.1109/CIT.2010.71

203

2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)

978-0-7695-4108-2/10 $26.00 © 2010 IEEE

DOI 10.1109/CIT.2010.71

203

2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)

978-0-7695-4108-2/10 $26.00 © 2010 IEEE

DOI 10.1109/CIT.2010.71

203

Page 2: An efficient data mining framework on hadoop using java persistence api

II. METHOD

A. Naïve Data Indexing Naïve data indexing is necessary for querying data in

classification or clustering algorithms. Data indexing can dramatically decrease the complexity of querying. As in McCreadie’s work [3], we create an inverted index for con-tinuous values.

1) Definition A (n × m) dataset is an (n × m) matrix containing n rows

of data, which contain m columns of float numbers. Let min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal the minimum, maximum, sum, average, standard deviation and count of the i-column, respectively—i.e., the essential statistical measures of the i-column of the dataset. Let sum2(i) be the dot product of the i-column, which is used for

std(i). From the formula ZSCORE(x)=x-avg()

std() [7], the for-

mula YSCORE(x)=round⎝⎛

⎠⎞(x - avg())×iStep

std() , a new measure

is defined. The iStep is an experimental integral parameter. 2) Simple Design of the Inverted Index

In general, high-dimensional datasets are always stored as tables in a sophisticated DBMS for most data mining tasks. Table 1 shows the format of a dataset in DBMS. For data mining, the CSV format is common data interchange file format.

Table 1 A normal data table. RecordsNO Field01 Field02 … Rec01 Value02 Value01 … Rec02 Value05 Value01 … Rec03 Value02 Value04 … … … … …

Figure 1 shows the flowchart in which the whole proce-dure is illustrated for indexing and similarity querying:

Discretization calculates the essential statistical measures (Figure 2), then performs YSCORE binning on the Float-Data.TXT file (CSV format) and outputs the IntData.TXT file in which each line is labeled with a unique number (LineNo) consecutively.

The encoder transforms the text format file IntData.TXT into the Hadoop Sequence File format IntData.SEQ and ge-nerates IntData.DAT. To locate a record for a query rapidly, the length of the record has to be equivalent. The Int-Data.DAT contains an (n × m) matrix, i.e., n rows of records containing m columns of IDs of bins, as does IntData.SEQ. The structure of IntData.DAT is shown in Table 2.

With the indexer, the index files Val2NO.DAT and Col2Val.DAT are outputted with MapReduce from the Int-Data.SEQ file. To obtain the list of LineNo’s for a specific bin, it is needed to store the list in a structural file (Val2NO.DAT), shown in Table 4; the address and the length of the list should be stored in another structural file (Col2Val.DAT), shown in Table 3. The two structural files are easy to access by offset.

The searcher, given a query—which should be a vector with m float values— performs YSCORE binning with the same statistical measures and yields a vector with m integers

for searching; it then outputs a list of the number of records. The search algorithm is shown in Figure 5.

In intersection, the positions of each integer in Val2NO.DAT for a particular query will be extracted from Col2Val.DAT , and the lists of LineNo’s for each integer can be found. The result will be computed by the intersec-tion of the lists. The algorithm is omitted.

Table 2 Structure of IntData.DAT in BNF.

IntData.DAT ::= <Record> <Record> ::= <Bins>

<Bins> ::= Integer(32bits binary) Table 3 Structure of Col2Val.DAT in BNF.

Col2Val.DAT ::= <Field> <Field> ::= <Bins, Address, LengthLineNo><Bins> ::= Integer(32bits binary)

<Address> ::= Integer(64bits binary) <LengthLineNo> ::= Integer(64bits binary)

Table 4 Structure of Val2NO.DAT in BNF. Val2NO.DAT ::= <LineNo>

<LineNo> ::= Integer(64bits binary)

FloatData.TXT

IntData.TXT

IntData.SEQIntData.DAT

Val2No.DAT

Col2Val.DAT

Discretization

Encoder

Indexer

Searcher

List of No

Query Result

InterSection

Figure 1 Flowchart for searching the Hadoop Index

3) Statistics phase In Hadoop, data sets are scanned once to calculate the

four distributive measures, min(i), max(i), sum(i) and sum2(i) as well as cnt(i) for the i-column (Figure 2); it then com-putes the algebraic measures of average and deviation. Therefore, a (5×m) array is used for the keys in the MapRe-duce process: the first row contains the keys of the minimum

204204204

Page 3: An efficient data mining framework on hadoop using java persistence api

of a column; the second row contains the keys for the maximum; the third contains the keys for the sum; the fourth contains the keys for the square of the sum; and the fifth contains the keys for the row count. FUNCTION map (key, values, output) 1. f[1..m]← parse the values into float array; 2. iW ←f.length; 3. FOR i = 1 THRU m

a. IF min > f[i] THEN 1) min ← f[i]; 2) output.collect(iW*0+i, min);

b. IF max < f[i] THEN 1) max ← f[i]; 2) output.collect(iW*1+i, max);

c. output.collect(iW*2+i, sum); d. output.collect(iW*3+i, sum2);

4. output.collect(iW*4, 1); FUNCTION reduce (key, values, output) 5. iCategory ← (int) key.get() / LenFields 6. SWITCH (iCategory) 7. CASE 0 : /** min()*/

a. WHILE (values.hasNext()) 1) fTmp ← values.next(); 2) IF (min > fTmp) THEN min ← fTmp;

b. output.collect(key, min); c. BREAK;

8. CASE 1 : /** max() */ a. WHILE (values.hasNext()){

1) fTmp ← values.next(); 2) IF (max < fTmp) THEN max ← fTmp;

b. output.collect(key, max); c. BREAK;

9. CASE 2 : /** sum() */ a. WHILE (values.hasNext())

1) sum ← sum + values.next(); 2) output.collect(key, sum);

b. BREAK; 10. CASE 3 : /** sum2() */

a. WHILE (values.hasNext()) 1) sum2← sum2+ values.next(); 2) output.collect(key, sum2);

b. BREAK; 11. CASE 4 : /** cnt() */

a. WHILE (values.hasNext()) 1) cnt ← cnt + values.next(); 2) output.collect(key, cnt);

b. BREAK; FUNCTION statistics (input) 12. faStat[][]←0; key←0; value←0; cnt←0; 13. WHILE input.hasNext()

a. <key,value> ← Parse the line of input.Next(); b. iCategory← key / LenFields; c. iField← key MOD LenFields; d. IF (iCategory == LenFields * 4) THEN

1) faStat[iCategory][ iField]← value;

e. ELSE 1) Cnt ← value;

14. avg[] ←0; std[] ←0; 15. FOR i = 1 THRU LenFields

a. avg(i) ← faStat[2][i] / cnt; b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt) /(cnt-1));

16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt; END OF ALGORITHM STATISTICS

Figure 2: Algorithm Statistics for continuous values in Hadoop

The algorithm performs O(N/mapper), in which N represents the number of records in the data sets, and map-per is the number of the task of map functions. After step 4, all partitions will have been calculated into the five distribu-tive measures. The reduce function merges them into global measures. Keys for the map function are the positions of the next line, and values are the next line of text.

After step 11 in Figure 2, Hadoop will create some result files named “part-xxxxx”, which contain the five distributive measures for each field; the number of these result files de-pends on the number of reducers. From step 15, the alge-braic measures avg() and std() can be calculated from the five distributive measures.

4) YSCORE phase Figure 3 shows YSCORE binning of 1,000,000 values

from a normal distribution with a mean of 0 and standard deviation of 1. If the iStep of 1 is used, the upper left figure shows exactly the Three-Sigma Rule, there are only 9 bins and the maximum number in the bins is near 400,000. When the iStep is changed to 8, the upper right shows that the number of bins increases to 36, and the figure fits the normal distribution curve better. The important thing is that the maximum in the bins reduces to about 50,000 at the iStep of 8. When the iStep is set to 16 or 32, the maximum reduces to 25,000 or 10,000. The lowered numbers administer to in-verted indexing.

−4 −3 −2 −1 0 1 2 3 40

1

2

3

x 105

Cou

ntin

g

Yscore bins, iStep=1−20 0 20

0

1

2

3

4

x 104

Cou

ntin

g

Yscore bins, iStep=8

−50 0 500

0.5

1

1.5

2

x 104

Cou

ntin

g

Yscore bins, iStep=16−100 0 100

0

5000

10000

Cou

ntin

g

Yscore bins, iStep=32

Figure 3 YSCORE binning of 1,000,000 values from a normal distribution with mean 0 and standard deviation 1. The above figures show YSCORE

binning using an iStep of 1, 8, 16, and 32, respectively.

205205205

Page 4: An efficient data mining framework on hadoop using java persistence api

According to [11], “the central limit theorem (CLT) states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed.” Even if the distribution of a field does not follow normal distribution, in large-scale data sets the values of the field can be put in bins using its mean and deviation when the size approaches a Terabyte.

These interval values are put into bins using YSCORE binning [7]. Then, the inverted index will be of the format in Table 5, following Table 1. The first column holds all fields in the original dataset; the second holds the <bin, RecNoAr-ray> tuples.

Table 5. The structure of inverted index for interval values. Fields Inverted Index

Field01 <Bin02,RecNoArray>,<Bin05, RecNoArray >…Field02 <Bin01,RecNoArray>,<Bin04, RecNoArray >…… …

The YSCORE binning phase does not need a reduce function, thus eliminating the cost of sorting and transferring; the result files are precisely the partitions on which map functions work. Keys in the map function are the positions of the next line, and values are the next line of text. The in-put file will be FloatDate.TXT, and the output file will be IntData.TXT, as shown in Figure 3.

The cumulative distribution function values for the stan-dard normal distribution will be 50% at 0, 60% at 0.253, 70% at 0.524, 80% at 0.842, and 90% at 1.282. If the distri-bution is approximately close to the standard normal distri-bution, YSCORE determines a balanced binning from its ZSCORE value.

After the statistics’ phase and the YSCORE binning phase, Hadoop can handle discrete data as in the com-monly-used MapReduce word count example.

B. Improved Data Indexing Naïve data indexing has a flaw in that it uses only one

reducer to keep the global id for the keys. This causes a per-formance problem that is not scalable. A tractable <Long-Writable, LongWritable> pair is employed to transform the three variables of the global id, the attribute and the id of bin into the intermediate key-value pairs in the JobIndex, (Figure 4). In general, the number of attributes and the number of bins in an attribute will not be more than 232. A LongWritable variable is 64 bits, so an attribute can be put in the upper 32 bits in a LongWritable, and the number of bins can be put in the lower 32 bits. These attribute-bin LongWritables are used as keys for the sorting phase in Ha-doop.

The number of the ReducerIndex task is set equal to the number of attributes. The method setPartitionerClass() in Hadoop decides to which ReducerIndex task an intermediate key-value pair should be sent, using the upper 32 bits in the keys. In the ReducerIndex task, the attribute-bin LongWri-tables will be grouped by the number of bins in the lower 32 bits in the keys, and then the inverted index <Bin,RecNoArray> (Table 5) can be outputted easily. Each

reduce task in Hadoop will generate its own result; therefore, each attribute has its own index file.

SourceDataSet::=<LongWritable, FloatArrayWritable>

mapperDiscrete1

mapperDiscretem

DiscretedData::=<LongWritable, IntArrayWritable>

...

mapperIndex1

mapperIndexm...

IntermediateData::=<LongWritable, LongWritable>

ReducerIndex1

ReducerIndexr...

IndexData::=<IntWritable, LongArrayWritable>

JobD

iscr

et

JobI

ndex

Figure 4 The Data Flow Diagram for improved Indexing of continuous values in Hadoop

C. Hadoop MapFile Indexing The MapFile format is a directory consisting of two files,

a data file and an index file. Both files are, in fact, two files of the SequenceFile type. The data file is a normal Se-quenceFile; the index file consists of a few keys at intervals and their corresponding addresses in the data file. To query a special key for its value, a binary search will be performed in the index file and a sequential search will be performed in the data file within the range defined by the corresponding addresses. [12] The algorithm for searching by MapFile is shown in Figure 5. FUNCTION SearchMapFile (float[] faQuery) 17. laResult ← null; laTmp ← null; isFirst ← TRUE 18. FOR iField =1 THRU faQuery.length

a. iValue ←YSCORE(faQuery[i]); b. Indexer ←new MapFile.Reader(index file of iField); c. Indexer.get(iValue, laTmp); d. IF (isFirst) THEN

1) laResult ← laTmp; isFirst ← FALSE e. ELSE

1) laResult ← getIntersection(laResult, laTmp); 19. RETURN laResult END OF ALGORITHM SearchMapFile

Figure 5 Algorithm of searching by MapFile.

HBase, a Bigtable-like storage for the Hadoop Distrib-uted Computing Environment, is based on MapFile [13]. It does not perform well for reasons of complexity in MapFile.

206206206

Page 5: An efficient data mining framework on hadoop using java persistence api

D. Data mining framework on Hadoop Although this improved data indexing performs better

than does MapFile indexing, the overhead for range query-ing by intersection algorithms is still too high. To apply common object-oriented data mining algorithms in Hadoop, the three problems concerning globalization, random-write and duration, posed in the introduction, should be handled. For this, we propose a data mining framework on Hadoop using JPA.

1) MySQL Cluster After many failures, we found that, for data mining with

Hadoop, a database was an ideal persistent repository for handling the three problems mentioned. To ensure comput-ing robustness, a distributed database is the suitable solution for distributed high-dimensional and large-scale data han-dling. MySQL Cluster is designed to eliminate any single point of failure [10] and is based on open source code; both characteristics make it consistent with data mining with Ha-doop.

2) ORM The JPA and naïve JDBC are used to implement the

ORM for objects of data mining in a database (Figure 6). To improve the data indexing method, a KD-tree or X-tree may be suitable. However, the decision tree is considered a clas-sical algorithm, and better elaborates the framework.

In the data mining framework, a large-scale dataset is scanned by Hadoop. The necessary information is extracted using classical data mining modules, which will be embed-ded in the Hadoop map or reduce tasks. The extracted ob-jects, such as lists, trees, and graphs, will be sent automati-cally into a database—in this case, MySQL Cluster. As a test case, the test dataset will be scanned and handled by each data mining task.

3) Stored procedure After objects obtained from data mining are stored into a

database, all the details can be finished using a stored pro-cedure, such as finding a path between two nodes, returning all the children of a node and returning the category of a test case. A data mining task should send demands and receive results. This decreases network overhead and latency.

The following codes are for returning all the children of a node in a binary tree (Figure 7).

1. CREATE PROCEDURE getChild (IN rootid INT) 2. BEGIN

a. DROP TABLE IF EXISTS chd; b. CREATE TEMPORARY TABLE chd (id INT NOT NULL DEFAULT -1)ENGINE = MEMORY; c. DROP TABLE IF EXISTS par; d. CREATE TEMPORARY TABLE par (id INT NOT NULL DEFAULT -1)ENGINE = MEMORY; e. DROP TABLE IF EXISTS result; f. CREATE TABLE result (id INT NOT NULL DEFAULT -1); g. INSERT chd SELECT A.ID FROM btree A WHERE rootid=A.PID; h. WHILE ROW_COUNT()>0 DO

1) TRUNCATE par; 2) INSERT par SELECT * FROM chd; 3) INSERT result SELECT * FROM chd; 4) TRUNCATE chd; 5) INSERT chd SELECT A.ID FROM btree A, par B WHERE B.ID=A.PID;

i. END WHILE; 3. END;

Figure 7 Stored procedure for all children of a node in MySQL

4) Decision tree implementation As shown in the textbook [7], a decision tree for con-

tinuous values can build a binary tree. Let D be a dataset. The entropy of D is given by

∑=

×−=m

iii ppDInfo

12 )(log)( (1)

Where m is the number of categories and p is the prob-ability of a category. Let A be an attribute in D. The entropy of D based on the partitioning of A is given by

)()()( 2

21

1 DInfoDD

DInfoDD

DInfoA ×+×= (2)

Where entropy Info() is from formula (1). D1 is the part of D, which belongs to the left child node, and the D2 is the part of D, which belongs to the right child node.

Each node in the tree has to contain two members, an ATTRIBUTE and a VALUE. Given a test record (an array of continuous values), the former tells which field should be checked, and the latter tells which child node should be se-lected. If the test record is less than or equal to the member VALUE in the member ATTRIBUTE, then the left child node is selected, otherwise the right child node is selected.

In Hadoop, the JobSelect has to select the member ATTRIBUTE and the member VALUE using the entropy criterion (Figure 8). A training data record should consist of multiple values and a category. In the mapperSelect tasks, a record will be decomposed into AVC_Writable tuples <at-

TrainingData

ndbcluster

JDBCJPA

Data Mininghadoop

TestingData

Figure 6 The Data Flow Diagram of Data Mining Framework on Hadoop

207207207

Page 6: An efficient data mining framework on hadoop using java persistence api

tribute, value, category>, where the attribute indicates the position of the value in the record and the category indicates the class of the record. For example, a record <3, 5, 7, 9, -1> will output the tuples <1, 3, -1>, <2, 5, -1>, <3, 7, -1>, <4, 9, -1>. These AVC_Writable tuples will be the keys in the in-termediate key-value pairs. The values will be VC_Writable tuples, or the part <value, category> in the AVC_Writable tuples.

Hadoop’s method setPartitionerClass() decides to which reducer a key-value pair should be sent, using the attribute in the AVC_Writable tuple. Hadoop’s method setOutputVal-ueGroupingComparator() will group the values (VC_Writable tuples) by the same attribute in ReducerSelect tasks. By default, the values are sorted based on the keys, so determining the best split value for the attribute requires one pass through the VC_Writable tuples by the formula (2).

SourceDataSet::=<LongWritable, FloatArrayWritable>

mapperSelect1

mapperSelectm...

IntermediateData::=<AVC_Writable, VC_Writable>

ReducerSelect1

ReducerSelectr...

SelectData::=<EntropyWritable, AV_Writable>

JobS

elec

t

Figure 8 The Data Flow Diagram of Selection of Decision Tree in Hadoop

After the best split, values for each attribute are calcu-

lated by the ReduceSelect tasks, and the key-value pair <En-tropyWritable, AV_Writable> is outputted. The minimum expected entropy is selected. Let A0 and V0 be the best at-tribute-value pair. D1 is the subset of D satisfying A0 <=V0, and D2 is the subset of D satisfying A0 > V0.

Before the split, a node in JPA will be created corre-sponding to D, with the best attribute A0 and best splitting value V0.

JobSplit contains only mapperSplit tasks, which do not

use the method output.collect (Figure 9); instead, it writes directly to two sequence files in the HDFS. This decreases unwanted I/O overheads. The SourceDataSet is D; the ChildDataLeft is D1; the ChildDataRight is D2; and all of these are in fact directories in HDFS, which contain the outputs from each mapperSplit task.

The above processing is just for one node. To build a whole decision tree, the processing needs to be repeated recursively. A complete decision tree Hadoop algorithm is shown in Figure 10.

1. nodeD.Path ← D, JPA.persist(nodeD); 2. BuildTree(nodeD); FUNCTION BuildTree (Path D) 3. IF (all categories in D are the same C) THEN JPA.begin(); set nodeD as a leaf; set nodeD.category as C; JPA.commit(); RETURN nodeD; 4. IF (all attribute checked) THEN JPA.begin();nodeD.Leaf ← true; nodeD.category ← max(C){C∈D}; JPA.commit(); RETURN nodeD; 5. JPA.begin(); nodeD.attribute, nodeD.value ← JobSe-lect(D); JPA.commit(); 6. JPA.begin(); D1, D2 ← JobSplit(D); JPA.commit(), 7. nodeLeft ← BuildTree(D1); 8. nodeRight ← BuildTree(D2); 9. JPA.persist(nodeLeft); JPA.persist(nodeRight); 10. JPA.begin(); nodeD.ChildLeft ← nodeLeft; nodeD. ChildRight ← nodeRight; JPA.commit(); 11. RETURN nodeD

Figure 10 Algorithm for building decision tree on Hadoop

The JPA.persist() function (Figure 10) means a new node is persisted by JPA. The assignments enclosed by JPA.begin() and JPA.commit() mean these operations are monitored by JPA. Any node in the cluster will share objects persisted by JPA.

III. EXPERIMENTAL SETUP We evaluated several synthetic datasets and an open da-

taset.

A. Computing Environment Experiments were conducted using Linux CentOS 5.0,

Hadoop 0.19.1, JDK 1.6, a 1Gbps switch network, and 4 ASUS RS100-X5 servers (Table 10).

Table 6 Devices for Hadoop computing. Environment Parameter Model name ASUS RS100-X5 CPU MHz Dual Core 2GHz Cache size 1024 KB Memory 3.4GB NIC 1Gbps Hard Disk 1TB

SourceDataSet::=<LongWritable, FloatArrayWritable>

mapperSplit1

mapperSplitm...

ChildDataLeft ChildDataRight

JobS

plit

Figure 9 The Data Flow Diagram of Split of Decision Tree in Hadoop

208208208

Page 7: An efficient data mining framework on hadoop using java persistence api

B. Inverted indexing vs MapFile Indexing Two different-sized experiments are shown in Table 7. A

(1,000,000 × 200) synthetic data set was generated at ran-dom for test1. A (102,294 × 117) data set was taken from KDDCUP 2008[14] for test2. The procedure was followed according to the design (see II.A.2). In “Index Method,” M means MapFile, and I means “Inverted Index”. Of both tests, the latter takes fewer seconds for searching than the former. The SearcherTime is the average for a single query.

Table 7 Comparison of MapFile and Inverted indexing

Item Test1 Test2 Index Method M I M I FloatData(MB) 1,614 204 DiscretizationTime(s) 180 59 EncoderTime(s) 482 42 IndexerTime(s) 5835 8473 307 397 SearcherTime(s) 8.25 1.07 5.07 0.68

For a single query, the searching time of the inverted in-dexing shows great advantage.

C. Data mining framework The comparison of JDBC and JPA as decision trees uses

the same synthetic dataset (Table 8). The times listed are the cumulative of all procedures. To limit the size of the tree, the depth of any tree is no more than 20.

Table 8 Comparison of JDBC and JPA on MySQL cluster Item JDBC JPA

Size of synthetic data

5,897,367,552 bytes 5,000,000 records, with 117 fields.

Number of nodes

501,723 nodes

Accumulated inserting time

2,695.307 seconds 2,981.279 second

Time for get-Child

257.93 seconds 324.51 seconds

The time in Table 8 can be decreased dramatically re-gardless of robustness; i.e., if the MyISAM engine is used on a single node instead of the NDBcluster engine on 4 nodes, performance in database will be multiply approxi-mately 25 times.

IV. CONCLUSION A feasible data indexing algorithm is proposed for

high-dimensional, large-scale data sets. It consists of YSCORE binning, Hadoop processing, and improved inter-sectioning. Experimental results show that the algorithm is technically feasible.

An efficient data mining framework is proposed for Ha-doop; it employs the JPA and MySQL Cluster to resolve problems of globalization, random-write and duration in Hadoop. The distributed computing in Hadoop and central-ized data collection using JPA are combined together or-ganically.

We will consider replacing MapFile with our inverted indexing for HBase in future work in an effort to enhance the performance of HBase in larger-scale data mining. A

convenient tool for the data mining framework will be a focus of future efforts.

ACKNOWLEDGEMENTS This work is supported by the National Basic Research

Priorities Programme (No. 2007CB311004) and National Science Foundation of China (No.60775035, 60903141, 60933004, 60970088), National Science and Technology Support Plan (No.2006BAC08B06).

REFERENCES [1] C. Bohm, S. Berchtold, H. P. Kriegel, and U. Michel,

"Multidimensional index structures in relational databases," in 1st International Conference on Data Warehousing and Knowledge Discovery (DaWak 99), Florence, Italy, 1999, pp. 51-70.

[2] J. Dean, S. Ghemawat, and Usenix, "MapReduce: Simplified data processing on large clusters," in 6th Symposium on Operating Systems Design and Implementation (OSDI 04), San Francisco, CA, 2004, pp. 137-149.

[3] R. M. C. McCreadie, C. Macdonald, and I. Ounis, On Single-Pass Indexing with MapReduce. New York: Assoc Computing Machinery, 2009.

[4] R. Lammel, "Google's MapReduce programming model - Revisited," Science of Computer Programming, vol. 70, pp. 1-30, Jan 2008.

[5] C. Moretti, K. Steinhaeuser, D. Thain, and N. V. Chawla, "Scaling Up Classifiers to Cloud Computers," in IEEE International Conference on Data Mining Pisa, Italy: http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-papers, 2008.

[6] D. Gillick, A. Faria, and J. DeNero, "MapReduce: Distributed Computing for Machine Learning," 2006, p. http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf.

[7] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Second Edition, second ed. San Francisco: Morgan Kaufmann, 2006.

[8] S. Berchtold, D. A. Keim, and H. P. Kriegel, "The X-tree: An Index Structure for High-Dimensional Data," Readings in Multimedia Computing and Networking, 2001.

[9] R. Biswas and E. Ort, "Java Persistence API - A Simpler Programming Model for Entity Persistence," http://java.sun.com/developer/technicalArticles/J2EE/jpa/, 2009.

[10] S. Hinz, P. DuBois, J. Stephens, M. M. Brown, and A. Bedford, "MySQL Cluster," in http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overview.html, 2009.

[11] Wikipedia, "Central limit theorem," 2009, p. http://en.wikipedia.org/wiki/Central_limit_theorem.

[12] A. S. F. ASF, "MapFile (Hadoop 0.19.0 API)," 2008, p. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.html.

[13] J. Gray, "HBase: Bigtable-like structured storage for Hadoop HDFS." vol. 2009, 2008, p. http://wiki.apache.org/hadoop/Hbase.

[14] S. M. S. USA, "kddcup data," 2008, p. http://www.kddcup2008.com/KDDsite/Data.htm.

209209209