parallel multi-objective ant programming for classification using gpus

33
Parallel Multi-Objective Ant Programming for Classification Using GPUs Alberto Cano a , Juan Luis Olmo a , Sebasti´ an Ventura a,* a Department of Computer Science and Numerical Analysis, University of Cordoba, 14071 Cordoba, Spain Abstract Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834x vs CPU, 212x vs 4–threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows of expanding its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time. Keywords: ant programming (AP), ant colony optimization (ACO), parallel computing, GPU, classification * Corresponding author Email addresses: [email protected] (Alberto Cano), [email protected] (Juan Luis Olmo), [email protected] (Sebasti´ an Ventura) Preprint submitted to Journal of Parallel and Distributed Computing November 12, 2012

Upload: independent

Post on 14-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Multi-Objective Ant Programming forClassification Using GPUs

Alberto Canoa, Juan Luis Olmoa, Sebastian Venturaa,∗

aDepartment of Computer Science and Numerical Analysis,University of Cordoba, 14071 Cordoba, Spain

Abstract

Classification using Ant Programming is a challenging data mining task whichdemands a great deal of computational resources when handling data sets of highdimensionality. This paper presents a new parallelization approach of an existingmulti-objective Ant Programming model for classification, using GPUs and theNVIDIA CUDA programming model. The computational costs of the differentsteps of the algorithm are evaluated and it is discussed how best to parallelizethem. The features of both the CPU parallel and GPU versions of the algorithmare presented. An experimental study is carried out to evaluate the performanceand efficiency of the interpreter of the rules, and reports the execution times andspeedups regarding variable population size, complexity of the rules mined anddimensionality of the data sets. Experiments measure the original single-threadedand the new multi-threaded CPU and GPU times with different number of GPUdevices. The results are reported in terms of the number of Giga GP operationsper second of the interpreter (up to 10 billion GPops/s) and the speedup achieved(up to 834x vs CPU, 212x vs 4–threaded CPU). The proposed GPU model isdemonstrated to scale efficiently to larger datasets and to multiple GPU devices,which allows of expanding its applicability to significantly more complicated datasets, previously unmanageable by the original algorithm in reasonable time.

Keywords: ant programming (AP), ant colony optimization (ACO), parallelcomputing, GPU, classification

∗Corresponding authorEmail addresses: [email protected] (Alberto Cano), [email protected] (Juan Luis Olmo),

[email protected] (Sebastian Ventura)

Preprint submitted to Journal of Parallel and Distributed Computing November 12, 2012

1. Introduction

Classification is a supervised machine learning task which consists in predict-ing the class membership of uncategorised examples, whose label is not known,using the properties of examples in a model learned previously from training ex-amples, whose label was known. Classification tasks include a broad range of realworld application domains: disciplines such as bioinformatics, medical diagno-sis, image recognition, and financial engineering, among others, where domainexperts can use the model learned to support their decisions [26, 36].

A great variety of algorithms and techniques have been used to accomplish thistask [32], including decision trees [21], decision rules [31], naive Bayes [23], sup-port vector machines [24], neural networks [22], genetic algorithms [20], geneticprogramming [53], ant colony optimization (ACO) [48], etc. More recently, antprogramming (AP) has been used to accomplish the classification task with greatsuccess, generating interpretable classifiers made up of IF–THEN rules [5, 18, 43].

The curse of dimensionality is one of the major challenges to the application ofant-based algorithms [40], owing to the long computational time necessary. More-over, focusing on the classification issue, the induction of classifiers is becomingincreasingly complicated, since the capabilities of data generation and collectionin real application domains is growing at an exponential pace. Actually, it wouldbe infeasible, or at least would take a lot of time, to run a sequential classifica-tion algorithm over certain data sets. Therefore, it has become crucial to designparallel algorithms capable of handling these large amounts of data [2, 38].

Concerning the computing architectures used to parallelize ACO algorithms,cluster platforms have been traditionally the option most employed, followed bymultiprocessors and massively parallel computers [50]. Recently, other choiceshave been gaining ground, such as grid computing, multi-core servers, and graphicprocessing units (GPUs) [9, 8, 13].

GPUs are devices with multi-core architectures and parallel processor units,which provide fast parallel hardware for a fraction of the cost of a traditionalparallel system. Actually, since the introduction of the computer unified devicearchitecture (CUDA) in 2007, researchers all over the world have harnessed thepower of the GPU for general purpose computing (GPGPU) [10, 47, 46]. The useof GPGPU has been already studied for speeding up algorithms within the frame-work of evolutionary computation and data mining [15, 28, 45]. Owing to thegreat advantages provided by GPGPU, we would like to explore the performanceof a GPU-based parallelization of the Multi-Objective Grammar-Based Ant Pro-gramming (MOGBAP) algorithm for classification [44], which has demonstrated

2

its good performance at addressing the extraction of classification rules. The ex-perimental study analyses the performance and scalability of the algorithm fromsimple to complex problems, increasing the number of instances and attributes.Experimental results show that its parallelization allows of extending the rangeof applications of this algorithm to other domains that have data sets includinghuge amounts of instances and attributes, domains where the application of theMOGBAP algorithm was therefore extremely difficult until now.

This paper is organized as follows. In the next section, we will present somerelated work on parallel ACO as applied to classification, as well as GPU-basedproposals for ACO. Section 3 presents some details concerning the CUDA plat-form. In Section 4, we will first present the sequential version of the algorithm,then discuss the parallelization of the multi-objective AP algorithm for classifica-tion, and conclude by presenting the GPU design and implementation. Section 5describes the experimental study. The results obtained are discussed in Section 6.Finally, Section 7 presents some concluding remarks.

2. Related Work

Several paradigms and taxonomies have been proposed, aiming at collectingthe different possibilities for parallelizing an ACO algorithm. The most appropri-ate parallelization method depends upon the kind of problem to be solved, the datastructures employed, and the stages which receive more computational time [51].

A recent survey of parallel ACO has been published by Pedemonte et al. [50].In that paper, the authors propose an updated taxonomy to classify software-basedparallel ACO algorithms, providing also an insight into current trends in the field.Since we refer to the categories and strategies of this taxonomy to explain thedecisions adopted in parallelizing our AP algorithm, we will briefly mention thesecategories.

• Master–slave model. A master process takes care of managing the globalinformation (i.e., pheromone matrix, best rule search, reinforcement andevaporation, etc.). The master process distributes tasks that have no ef-fect on the global structures to several slave processes, so that these slavesare in charge of looking for a solution and evaluating it, computing its fit-ness. Three subcategories are considered in this paradigm, depending onthe amount of work performed by the slave processes. In order of increas-ing work and, therefore, more communication between the master and theslaves, these categories are: coarse-grain [57], medium-grain [14] and fine-grain [17].

3

• Cellular model. A single population is divided into small neighbourhoods,each one having its own pheromone matrix. Each individual interacts justwith its neighbours, and neighbourhoods overlap in order to allow goodsolutions to spread to the entire population [49].

• Parallel independent runs model. Independent runs on a set of proces-sors of several sequential ACO algorithms without exchanging communica-tion [4].

• Multicolony model. Several ant colonies explore the search space, ex-changing information periodically to cooperate. Each colony has a separatepheromone matrix [30].

• Hybrid models. Other proposals that share features from two or more par-allel models. [51]

We will now draw attention to the specific ACO parallel proposals for clas-sification rule mining, most of them focused on adapting the original Ant-Mineralgorithm [48] to a parallel environment. The first paper was published by Chenet al., who developed the Parallel Ant Miner algorithm [11]. It is a parallel ACOalgorithm based on the massively parallel processors computational model, whichfollows the message passing method. In that paper, a class label is assigned to eachprocessor, and a group of ants is allocated to each processor in order to search forthe antecedent of the rules, following a coarse-grain master–slave model.

Roozmand and Zamanifar [52] extended the algorithm developed by Chen etal., incorporating the multicolony concept and using an ant colony system. Theirproposal considers several independent colonies of ants, each one in charge ofdiscovering rules related to a given class label. Inside each group, ants are alsoexecuted in parallel, following a coarse-grain master–slave model to build andevaluate solutions. Ants of a given group communicate among themselves, andalso with the best ant of the other groups, to update the pheromone matrices. Theauthors reported a higher speed of convergence than the previous Parallel AntMiner algorithm, also discovering more accurate rules.

Chintalapati et al. [12] parallelized the original Ant-Miner algorithm in acluster environment following a coarse-grain master–slave paradigm. They paral-lelized the rule construction stage and the discretization of the numerical attributesbecause, as they reported, these two tasks required more computational time, sothat their parallel implementation speeds significantly the execution of the algo-rithm. The original Ant-Miner was applied locally in each processor, and both

4

the pheromone information and the best discovered rule were updated once theprocessors submitted their information to the master process. The main differ-ences from the paper by Roozmand and Zamanifar is that the ants can search forrules having different class labels in each processor, instead of just the class la-bel assigned to the processor. In addition, the master process maintains a rule setwhere the discovered rules are stored, in the order of their discovery. The authorsconcluded that the best performance results were achieved when using data setshaving more attributes and using a greater number of ants.

The most recent contribution to the field of classification rule mining usingACO has been the AntMinerGPU algorithm [55], proposed by Weiss, which con-sists in the parallelization of the Ant-Miner+ algorithm [39] by using GPUs. Thehypothesis stated in that paper is that higher quality solutions could be found inless time by parallelizing the algorithm using GPUs, because this allows gener-ating more candidate solutions in each generation. The AntMinerGPU algorithmoffloads the implicit parallel steps of AntMiner+ to a GPU device, and it allocateseach ant its own thread using the GPUs multi-threading capabilities. Experimentsusing two data sets focused on analysing the effects on the accuracy and runningtime caused by changes in the population size. The author concluded that the ac-curacy values obtained by both the GPU and the CPU implementation were verysimilar, achieving a speedup close to 100x for large populations.

Concerning other GPU-based parallelizations of ACO algorithms devoted toproblems other than classification rule mining, we can mention the paper by Jien-ing et al. [29], where a parallel implementation of the ACO Max–Min Ant Sys-tem to solve the travelling salesman problem (TSP) was presented. More recently,Cecilia et al. [9] presented a more extensive and deeper work concerning the par-allelization of ACO algorithms to solve the TSP, where several strategies for ac-celerating algorithms addressing this problem by using the GPU architecture werediscussed.

3. CUDA programming model

Computer unified device architecture (CUDA) [1] is a parallel computing ar-chitecture developed by NVIDIA that allows programmers to take advantage ofthe computing capacity of NVIDIA GPUs in a general purpose manner. TheCUDA programming model executes kernels as batches of parallel threads in asingle instruction multiple data (SIMD) programming style. These kernels com-prise thousands to millions of lightweight GPU threads per each kernel invocation.

5

CUDA’s threads are organized into a two-level hierarchy. At the top level, agrid is organized as a two dimensional array of blocks. At the bottom level of thehierarchy, all threads of a block are organized into a three-dimensional array ofthreads. All the blocks in a grid have the same number of threads, with a maximumof 512 (1.3 CUDA capability devices). The maximum number of thread blocksis 65535 x 65535, so each device can run up to 65535 x 65535 x 512 = 2 · 1012threads per kernel call.

To properly identify threads within the grid, each thread in a thread block hasa unique ID in the form of a three-dimensional coordinate, and each block in agrid also has a unique two-dimensional coordinate.

Thread blocks are executed in streaming multiprocessors. A stream multi-processor can perform zero overhead scheduling to interleave warps (a warp isa group of threads that execute together) and hide the overhead of long-latencyarithmetic and memory operations.

There are four different main memory spaces: global, constant, shared, andlocal. These GPU memories are specialized and have different access times, life-times, and output limitations.

• Global memory: this is a large, long-latency memory that exists physicallyas an off-chip dynamic device memory. Threads can read and write to globalmemory in order to share data, and must write the kernel’s output so that itcan be read even after the kernel terminates. However, a better way to sharedata and improve performance is to take advantage of shared memory.

• Shared memory: this is a small, low-latency memory that exists physicallyas on-chip registers, and its contents are only maintained during the threadblock execution and are discarded when the thread block completes. Ker-nels that read or write to a known range of global memory with spatial ortemporal locality can employ shared memory as a software-managed cache.Such caching potentially reduces global memory bandwidth demands andimproves overall performance.

• Local memory: each thread also has its own local memory space as reg-isters, so the number of registers a thread uses determines the number ofconcurrent threads executed in the multiprocesor, which is called multipro-cessor occupancy. To avoid wasting hundreds of cycles while a thread waitsfor a long-latency global-memory load or store to complete, a common tech-nique is to execute batches of global accesses, one per thread, exploiting thehardware’s warp scheduling to overlap the threads’ access latencies.

6

• Constant memory: this is specialized for situations in which many threadswill read the same data simultaneously. This type of memory stores datawritten by the host thread, is accessed constantly, and does not change dur-ing the execution of the kernel. A value read from the constant cache isbroadcast to all threads in a warp, effectively serving 32 loads from mem-ory with a single-cache access. This enables a fast, single-ported cacheto feed multiple simultaneous memory accesses. The amount of constantmemory is 64 KB.

There are some recommendations for maximum performance [19]. Memoryaccesses must be coalesced as with accesses to global memory. Global memoryresides in device memory and is accessed via 32, 64, or 128-byte segment memorytransactions. When a warp executes an instruction that accesses global memory, itcoalesces the memory accesses of the threads within the warp into one or more ofthese memory transactions, depending on the size of the word accessed by eachthread and the distribution of the memory addresses across the threads. In general,the more transactions are necessary, the more unused words are transferred inaddition to the words accessed by the threads, reducing the instruction throughput.

To maximize global memory throughput, it is essential to maximize this coa-lescing by following the most optimal access patterns [27], using data types thatmeet the size and alignment requirements or padding data in some cases, e.g.,when accessing a two-dimensional array. For these accesses to be fully coalesced,the width of both the thread block and the array must be multiples of the warpsize.

4. GPU-MOGBAP (GPU Multi-Objective Grammar Based Ant Program-ming Algorithm)

In this section, we first introduce the original Multi-Objective Grammar-BasedAnt Programming (MOGBAP) [44] algorithm for classification, so as to take intoaccount its requirements when assessing the options for its parallelization. Then,the parallel design decisions which were adopted will be justified. Finally, thespecific GPU implementation will be presented.

4.1. Serial versionThe MOGBAP algorithm is a multi-objective AP algorithm designed specifi-

cally for multi-classification rule mining. It receives a training set as input, build-ing a classifier that consists of a decision list where the discovered rules are sorted

7

in descending order by their fitness. To classify a new unlabelled instance, theclass assigned will correspond to the consequent of the first rule whose antecedentmatches the instance, or the default rule in case no rule in the decision list coversthe instance. Further and detailed information about the MOGBAP algorithm andits pseudocode can be found in [44].

= attr1

value11

<EXP>

AND <EXP> <COND><COND>

= attr1

value12

...= attr

n

valuenm

AND AND <EXP> <COND> <COND>AND <COND> <COND>

AND AND AND <EXP>

<COND> <COND> <COND>

AND AND <COND>

<COND> <COND>

AND

= attr1 value

11

<COND>

AND

= attr1 value

12

<COND>

...

AND

= attrn value

nm

<COND>

der = 1

der = 2

der = 3

(= attr1 value

12)

Figure 1: Space of states at a depth of 3 derivations. The sample shaded path represents the pathfollowed by an ant. Double-line states represent final states.

This algorithm is based on the use of a context-free grammar that restricts thesearch space, ensuring the validity of any individual generated [56]. Regardingthe generation of new individuals, each individual or ant will encode a rule. Thecreation of a new ant follows the application of the transition rule from the initialstate of the environment, which adopts the shape of a derivation tree, until reach-ing a final state or solution. As an example, Figure 1 shows the derivation treegenerated at a depth of three derivations, and the path followed by a given ant ishighlighted.

It is worth pointing out that the structure of the space of states is not knownbeforehand, and neither is the length of the path followed by a given ant. Actually,the treatment of the environment is a key issue for the algorithm, since it dependsdirectly upon the dimensionality of the data set and the number of derivations per-mitted from the grammar. Therefore, there would be an excessive computationalcost in keeping in memory the whole space of states. To avoid this problem, thealgorithm follows a sequential and incremental build approach. The data struc-ture that represents the space of states is initialized with just the initial state andall possible transitions have the same amount of pheromones. This data structurealso contains attributes that take into account the effects of the evaporation and

8

normalization processes over the environment. The data structure is filled as antsare created, storing there the states of each ants path.

As to the multi-objective scheme [3], the idea behind the strategy designedfor MOGBAP is to discover a separate Pareto front for each class in the dataset, because certain classes are more difficult to predict than others. Actually, ifindividuals from different classes are ranked simultaneously according to Paretodominance, overlapping may occur. The multi-objective strategy can be summa-rized as follows. Once the individuals of the current generation have been createdand evaluated for each objective considered, they are divided into k groups accord-ing to their consequent, k being the number of classes in the training set. Then,each group of individuals is combined with the solutions kept in the correspond-ing Pareto front found in the previous iteration of the algorithm, to rank them allaccording to dominance, finding a new Pareto front for each class. Hence, therewill be k Pareto fronts, and only the non-dominated solutions in them will par-ticipate in the pheromone reinforcement. Finally, the output classifier is made upof the non-dominated individuals that exist in each of the k Pareto fronts oncethe last generation of the algorithm finishes. To select the rules of the classifierappropriately, a niching procedure is carried out over each frontier.

Despite the fact that it is known that hard-to-compute (super-linear-order) fit-ness functions tend to dominate the execution time of bio-inspired and evolution-ary algorithms [34, 41], we have carried out a brief study of the average time takenby each different stage of the subject algorithm, MOGBAP. Bearing in mind thatthe main objective of accelerating this algorithm by a GPU is to be able to ex-ecute large data sets, we have selected a data set of high dimensionality for thisexperiment, the poker data set, which has one million instances and ten attributes.The hardware configuration and experimental settings for this study are the onesdetailed in Section 5. The results are shown in Table 1, where the percentage ofexecution time on average taken by each stage of the algorithm is shown. Theseresults clearly support the assumption that the evaluation phase is the one that re-quires more computational time, nearly 96.3% of the total execution time of thealgorithm. They also point out that the ant creation stage involves approximately3% of the computational time, the niching procedure employed to select the rulesthat make up the final classifier involves around 0.6%, and finally, the sum of theremaining stages scarcely affects the computational time. Thus, the most signifi-cant speedup of the algorithm can be reached by accelerating the evaluation stage,which is achieved by taking advantage of the computing capacity of GPUs. Nev-ertheless, the parallel model explained in detail in this section also addresses theparallelization of other stages by using multi-threading.

9

Table 1: Workload of each phase of the sequential implementation of the MOGBAP algorithm

Stage Time (%)Initialization 3.06E-3Ant creation 3.079Ant evaluation 96.33Multi-objective strategy 3.03E-4Reinforcement 3.79E-4Evaporation 1.35E-4Normalization 1.03E-4Classifier constructionfrom niching procedure 0.585

4.2. Parallel versionThe initialization stage of the MOGBAP algorithm is in charge of starting up

the grammar, taking into account the metadata of the training set used to build theclassifier, and it also initializes the data structure used for the space of states.

As stated above, the data structure of the space of states follows an onlineconstruction approach. Each created ant stores its visited states in this structure,if they are not already there (due to the fact that the state in question had not yetbeen visited). The ant creation phase can be parallelized by using multi-threading,since ants can be created simultaneously, bearing in mind that the path followedby a given ant does not depend on the path followed by the others. However, it isimportant to note that the storage of the states visited in the space of states struc-ture cannot be done by the threads directly since it is a common data structure thatcannot be modified concurrently. Therefore, the update of this structure cannotbe done in parallel an it is done by the master process, which receives the pathsfollowed by each ant created.

The evaluation stage checks the instances covered by each ant and computesthe fitness values for each objective based on the confusion matrix. The parallelimplementation with multi-threading is elemental, evaluating each ant in parallel.

Concerning the multi-objective strategy, the discovery of the Pareto front givena set of individuals cannot be parallelized, since a sequential procedure has tobe followed to find out the dominance relations between them. Nevertheless, asindividuals are grouped by the class consequent they predict and a Pareto front isdiscovered within each group, threads can be used to carry out this strategy, using

10

a thread for assessing the dominance inside each different group.Reinforcement, evaporation, and normalization stages are carried out by the

master process, since these stages require writing to the space of states data struc-ture, and written accesses cannot be concurrent.

The niching process that is carried out over the individuals of each Paretofront can be parallelized using multi-threading, and the resulting individuals arereturned to the master process, which sorts them accordingly in order to build thefinal classifier.

As can be observed, this parallel scheme can be fit into the hierarchical coarse-grain master–slave category [50], since the master process manages the globalinformation structures and delegates tasks to the slaves processes. These tasks arethe generation of individuals, their evaluation, the discovery of the Pareto fronts,and the niching procedure execution, communicating the results back to the masterprocess.

The CPU parallel version of the algorithm is designed to run threads using theJava ExecutorService, which defines a pool of threads dynamically regarding tothe number of available processors. This way, the scalability of the algorithm toprocessors with more cores is automatically performed without user supervision.

4.3. GPU versionThis section presents the particular GPU implementation of the MOGBAP al-

gorithm. Before focusing on the evaluation phase, which is parallelized by GPUs,we first justify why the remaining stages are not parallelized in the same way.

The initialization of the grammar and the space of states was not parallelizedusing multi-threading, so neither is it implemented in the GPU. The most inter-esting discussion concerns the ant creation stage: it was decided to parallelizethis stage using multi-threading, because the transfer of the data structure usedfor the space of states to the GPU is not appropriate. Actually, it is infeasible toobtain an internal parallelization beyond the one obtained by creating each ant inits own thread, because it makes no sense to parallelize the construction of a givenpath over the space of states, as it consists in the sequential selection of transitionsfrom one state to another. Therefore, the transfer costs would potentially outweighthe benefits of creating each ant in the GPU, besides being a complicated imple-mentation due to the employment of the grammar and the number of derivationsallowed for it. The parallel multi-threading implementation in the CPU is simpleand efficient.

The multi-objective strategy was parallelized by using multi-threading, as ex-plained in Section 4.2. It is not possible to parallelize this step further, because

11

MASTER PROCESS SLAVES

(MULTITHREADING)

SLAVES

(GPU)

Figure 2: Computational flow chart of the GPU version

the process of finding the Pareto dominance relations in a set of individuals is aserial process. The same justification could be argued for the niching procedure,since internally it proceeds by sorting individuals according to their fitness andperforming a sequence of operations following this order. Finally, reinforcement,evaporation, normalization, and classifier construction steps have to be carried outby the master process, as stated above.

On the other hand, the parallelization of the evaluation phase using GPUs isknown to perform efficiently [7, 16]. The evaluation of an ant does not dependon the evaluation of the others. Moreover, the process of interpreting the ant’srule and checking whether it covers an instance or not is also independent of the

12

#TP #FN

#FP #TN

Rule 1

. . .I1 I2 I3 II

FPTP TP . . . FN

Rule 2

. . .I1 I2 I3 II

FPTN TN . . . TN

Rule 3

. . .I1 I2 I3 II

TPTP FN . . . FP

Rule N

. . .I1 I2 I3 II

FPTP TP . . . TN. . .

. . .

#TP #FN

#FP #TN

Fitness Rule 1

C overage

Kernel

C onfusion

Matrix

Kernel

Fitness Rule 2 Fitness Rule 33

#TP #FN

#FP #TN

Fitness Rule 3N

#TP #FN

#FP #TN

. . .

Figure 3: GPU kernels

rules and the instances. Therefore, the GPU model proposes that each thread isin charge of interpreting one ant’s rule over one instance. Figure 2 shows thecomputational workflow of the GPU model in which the evaluation process on theGPU is performed using two kernel functions.

4.3.1. Coverage kernelThe coverage kernel interprets the rules, which are expressed in Reverse Polish

Notation (RPN), over the instances of the data set. The kernel is shown at the topof Figure 3 and it is executed using a 2D grid of thread blocks. The length alongthe first dimension is the number of rules. The length along the second dimensiondepends on the number of instances I and the number of threads per block N , i.e.,the number of thread blocks to cover all the instances is ceil(I/N). Thus, the totalnumber of thread blocks in the grid is the product of these two dimensions. Thisnumber is important, as it concerns the scalability of the model in future devices.NVIDIA recommends running at least twice as many thread blocks as the numberof multiprocessors [1].

The threads are designed to access coalesced global memory positions formaximum performance. This way, the threads in a warp request consecutive mem-ory addresses that can be serviced in fewer memory transactions. The rules areallocated to constant memory because this can provide broadcasting to all the

13

threads in a warp. All the threads in a warp evaluate the same rule, but over differ-ent instances. This follows the single instruction multiple data (SIMD) data levelparallelism, which is achieved when each processor performs the same task (ruleevaluation) on different instances of distributed data [37, 54].

Code 1 Coverage kernel

__global__ void coverageKernel(double* instancesData, int* instancesClasschar** rules, unsigned char* result)

{// Instance index using the CUDA built-in variablesint instance = blockDim.y * blockIdx.y + threadIdx.y;

// If the rule covers the instanceif(covers(rules[blockIdx.x], instance)){

// FPfor(int i = 0; i < numberClasses; i++)result[numberInstances * gridDim.x * i + blockIdx.x * numberInstances +

instance] = 2;

// TPresult[numberInstances * gridDim.x * instancesClass[instance] +

blockIdx.x * numberInstances + instance] = 0;}else // If the rules does not cover the instance{// TNfor(int i = 0; i < numberClasses; i++)result[numberInstances * gridDim.x * i + blockIdx.x * numberInstances +

instance] = 1;

// FNresult[numberInstances * gridDim.x * instancesClass[instance] +

blockIdx.x * numberInstances + instance] = 3;}

}

The interpreter of the rules is stack-based, i.e., the operands are pushed onto astack, and when an operation is performed, its operands are popped from the stackand its result pushed back on. The interpreter checks the coverage of the instanceby the rule and the outcome result depends on the predicted class and the trueclass of the instance, resulting in one of these four values: true positive (TP ), falsepositive (FP ), true negative (TN ), and false negative (FN ). Each thread writes theoutcome result to its corresponding memory position in the array which storesall the results, which will be counted by the confusion matrix kernel to computethe fitness values for the rules. The code for the coverage kernel is shown inCode 1. The coverage kernel receives as input three arrays: an array of attributes

14

values (instancesData), an array of class values of the instances of the dataset(instancesClass), and an array of arrays containing the rules to evaluate (rules),whereas it returns the compute results in an array of matching results (results).numberInstances and numberClasses are constant global variables visible by thekernels.

4.3.2. Confusion matrix kernelThe confusion matrix kernel, shown at the bottom of Figure 3, counts the

number of TP , FP , TN , and FN previously calculated for each rule. These val-ues are employed to build the confusion matrix which allows us to compute thefitness values. The process of summing the values from an array is known asreduction [25].

Designing an efficient parallel reduction is non-trivial, because it asks for theparallelization of an inherently sequential task. In fact, NVIDIA proposes sixdifferent approaches [1]. Some of the proposals take advantage of the deviceshared memory. Shared memory provides a small but fast memory shared by allthe threads in a block. This is quite desirable when the threads within a blockrequire synchronization and work together to accomplish a task such as reduction.

The threads are organized to perform a two-level reduction and provide coa-lesced memory accesses, avoiding shared memory bank conflicts. This way, thethreads in a warp request consecutive memory addresses that can be serviced infewer memory transactions.

Each thread performs a partial reduction of the coverage results for an ant,storing the partial counts into shared memory. Once all the items have been par-tially counted, a synchronization barrier is called. The synchronization halts theexecution until both all threads in the thread block have reached this point and allglobal and shared memory accesses are visible to all threads in the block, i.e., allpartial results are confirmed to be properly written in shared memory and theirvalues are visible by all the threads in the block. The synchronization barrier isused to coordinate communication between the threads of the same block. Whensome threads within a block access the same addresses in shared or global mem-ory, there are potential read-after-write, write-after-read, or write-after-write haz-ards for some of these memory accesses. Next, only 4 threads perform the finalsum of the 4 confusion matrix values. At this point, there is no need to call thesynchronization barrier because the threads with idx 0, 1, 2 and 3 are within thesame warp, and, since a warp executes one common instruction at a time, threadswithin a warp are implicitly synchronized according to NVIDIA [1]. Finally, onlyone thread computes the fitness values for the different objectives and writes them

15

Code 2 Confusion matrix kernel__global__ void confusionMatrixKernel(unsigned char* result, jdouble* sespaccuracy,

int* bestClass){

__shared__ int confusionMatrix[512];

// Base index of the threadint base = blockIdx.x * numberInstances + threadIdx.y;// Top index of the threadint top = blockIdx.x * numberInstances + numberInstances - base;int moreCoveredClass = 0;

// Checks the best class for the individualfor(int j = 0; j < numberClasses; j++){confusionMatrix[4*threadIdx.y] = 0;confusionMatrix[4*threadIdx.y+1] = 0;confusionMatrix[4*threadIdx.y+2] = 0;confusionMatrix[4*threadIdx.y+3] = 0;

// Performs the first level reduction of the thread valuesfor(int i = 0; i < top; i += 128)confusionMatrix[4*threadIdx.y +

result[gridDim.x * numberInstances * j + base + i]]++;

__syncthreads();

if(threadIdx.y < 4){// Performs the second level reduction of the half of the sumsfor(int i = 4; i < 512; i+=4)confusionMatrix[threadIdx.y] += confusionMatrix[threadIdx.y + i];

if(threadIdx.y == 0 && confusionMatrix[0] > moreCoveredClass){int tp = confusionMatrix[0], tn = confusionMatrix[1];int fp = confusionMatrix[2], fn = confusionMatrix[3];

moreCoveredClass = tp;sespaccuracy[blockIdx.x] = tp / (tp + fn);sespaccuracy[gridDim.x + blockIdx.x] = tn / (tn + fp);sespaccuracy[gridDim.x + gridDim.x +

blockIdx.x] = (tp+1) /(tp + fp + numberClasses);bestClass[blockIdx.x] = j;

}}

__syncthreads();}

}

back to global memory. All the fitness values are returned in a unique array tosave multiple segmented memory transactions. The code for the confusion matrix

16

kernel is shown in Code 2. The kernel receives as input the array of predictionresults from the coverage kernel (result), and returns two arrays, one contains thesensitivity, specificity and accuracy values (sespaccuracy) which are stored in asingle array and copied in a single memory transfer for efficiency, and the othercontains the best class for the rules (bestClass).

5. Experimental setup

In this section, we will first present the hardware devices and the configurationsettings used in the experimental study. Then, the different experiments designedto evaluate the performance of the proposed model will be presented.

5.1. Hardware configurationThe experiments were run on two PCs, both equipped with an Intel Core i7

quad-core processor running at 2.66 GHz and 12 GB of DDR3-1600 host mem-ory. One PC featured two NVIDIA GeForce GTX 285 video cards equipped with2GB of GDDR3 video RAM and the other one featured two NVIDIA GeForceGTX 480 video cards equipped with 1.5GB of GDDR5 video RAM. The GTX285 GPU comprised 30 multiprocessors and 240 cores whereas the GTX 480GPU comprised 15 multiprocessors and 480 CUDA cores, both clocked at 1.4GHz. The host operating system was GNU/Linux Ubuntu 11.10-3.0.0 64 bit alongwith CUDA runtime 4.2, NVIDIA drivers 302.07, Eclipse integrated developmentenvironment 3.7.0, Java OpenJDK runtime environment 1.6-23 64 bit, and GCCcompiler 4.6.3 (O2 optimization level).

5.2. Configuration settingsThere are two different configuration settings to define: the GPU kernel set-

tings, and the algorithm configuration parameters.On the one hand, the GPU kernels require two parameters which define the

number of threads per block and the number of blocks per grid. The compilerand hardware thread scheduler will schedule instructions as optimally as possibleto avoid register memory bank conflicts. The best results are achieved when thenumber of threads per block is a multiple of 64, usually 128, 256 or 512. TheCUDA GPU occupancy calculator provides information about the GPU multipro-cessor occupancy, the number of threads per block, the registers used per thread,and the shared memory employed by the blocks. One of the keys to good perfor-mance is to keep the multiprocessors on the device as busy as possible. A device

17

in which the work is poorly balanced between the multiprocessors will performsuboptimally.

The coverage kernel requires 13 registers and the confusion matrix kernel re-quires 14 registers and 2104 bytes from shared memory. Table 2 shows the GPUoccupancy for different numbers of threads per block. Occupancy is the ratio ofthe number of active warps per multiprocessor to the maximum number of pos-sible active warps. 128 threads per block constitutes a multiprocessor occupancyof 75%, limited by the shared memory per multiprocessor. 256 and 512 threadsper block yield an occupancy of 100%, in both cases the number of active threadsper multiprocessor is 1024, but they differ in the number of active thread blocksper multiprocessor. We consider that the best option is to employ 256 threadsper block since it provides more active threads blocks per multiprocessor to hidelatency arising from register dependencies and, therefore, a wider range of possi-bilities is given to the dispatcher to issue concurrent blocks to the execution units.

Table 2: Threads per block and multiprocessor occupancyThreads per block 128 256 512Active threads per multiprocessor 768 1024 1024Active warps per multiprocessor 24 32 32Active thread blocks per multiprocessor 6 4 2Occupancy of each multiprocessor 75% 100% 100%

The number of blocks in the grid should be larger than the number of multi-processors in order for all multiprocessors to have at least one block to execute,because the primary concern is keeping the entire GPU busy. The grid size (num-ber of thread blocks) of the coverage kernel depends on the number of rules andthe number of instances. This grid is a 2D matrix of thread blocks, whose sizeis R × (I/256) due to the fact that the thread block size is 256. Thus, the gridsize and the GPU occupancy increase as the number of rules or the number ofinstances increase. This behaviour gives us a first idea of the efficiency of themodel, which increases as the complexity of the problem increases, either with agreater number of instances or demanding larger population sizes, as far as thereare thread blocks enough to fill the GPU occupancy.

On the other hand, the MOGBAP algorithm was run using the parameter con-figuration shown in Table 3, where the first four parameters are mandatory, and theother six parameters–enclosed in square brackets–are optional, having default val-ues. These parameters are the recommended values by the authors of MOGBAPalgorithm in [44].

18

Table 3: MOGBAP parameter configurationName Description ValuenumIterations Number of iterations 100maxDerivations Maximum number of derivations for the grammar 15minCoverage Minimum percentage of instances belonging to

the class predicted by the rule that it should cover 5%[τ0] Initial pheromone amount 1.0[τmin] Minimum pheromone amount 0.1[τmax] Maximum pheromone amount 1.0[ρ] Evaporation rate 0.05[α] Heuristic exponent 0.4[β] Pheromone exponent 1.0

5.3. ExperimentsThe experimental study comprised three sets of experiments. The first experi-

ment evaluated the performance of the interpreter of the rules. The second exper-iment evaluated the performance of the evaluation model and its scalability. Thethird experiment analysed the performance of the model over several real-worlddata sets.

5.3.1. RPN interpreter performance

The performance of an RPN interpreter is often given in terms of the numberof primitives which the system interprets per second, similar to GP interpreters,which report the number of GP operations per second (GPops/s) [6, 34, 35]. Theinterpreter evaluates expression trees, independently of whether they are obtainedfrom Ant Programming, Genetic Programming, or Grammar Guided Genetic Pro-gramming.

The first experiment was designed to evaluate the performance of the RPN in-terpreter, which runs over different numbers of rules, different lengths of rules (itsnumber of conditions), and different numbers of instances. Thus, it established asensitivity analysis of the effect of these parameters on the speed of the interpreter.

5.3.2. Model performance and scalability

The second experiment was designed to analyse the performance of the wholeparallelized model (including kernel execution and data transfer times) by varying

19

the number of rules, the lengths of the rules, and the number of instances. Itis interesting to analyse the scalability of the model regarding the sizes of theaforementioned parameters, especially when they are very low (small populationsizes or data sets with a low number of instances) or very high (large populationsizes of data sets with up to a million instances).

5.3.3. Experiments with UCI data sets

The third experiment evaluated the model on 12 real-world data sets from theUCI data set repository website [42]. These data sets are very varied in their de-gree of complexity, number of classes (2–17), number of attributes (4–42), andnumber of instances (150–1M). Different population sizes are evaluated, to anal-yse the scalability of the proposal, especially on complex and large scale data setswhich justify the use of a larger population for best results.

6. Experimental study

This section will present and discuss the experimental results from the differ-ent experiments. Experimental results reported the average execution time valuesand speedups from 100 different executions. Even the MOGBAP algorithm isnon-deterministic, its behaviour is controlled by a user-defined parameter seedvalue, if this seed is fixed, the algorithm is totally deterministic, i.e., it runs thesame execution paths and produces the same results. In order to avoid high de-viations of the execution times produced by the non-deterministic evolutionarybehaviour, we fixed the seed for the experiments carried out. Therefore, it isguaranteed that the 100 runs of the CPU algorithm and the 100 runs of the GPUalgorithm perform exactly the same computation tasks.

6.1. RPN interpreter performanceAccording to Kumar [33], the speedup is the ratio of the serial execution time

to the parallel execution time of the algorithm. In our experiments, the speedup isthe ratio between the CPU and GPU execution times. Specifically, we measuretwo speedups. The first one is how much faster is the GPU algorithm to theoriginal sequential single-threaded algorithm, the second one is how much fasteris the GPU algorithm to the CPU parallel multi-threaded algorithm. Table 4 showsthe interpreter execution times, the performance in terms of GPops/s, and thespeedups achieved when comparing the performance of the GPU model to thesingle-threaded and multi-threaded CPU implementations. Each row represents

20

the case of the RPN interpretation of R rules, comprising C conditions/attributes,over I instances. A rule represents a conjunction of attribute–value comparison(operator + attribute + value). Thus, the total number of GPops to evaluate is(C × 3 + (C − 1))× I ×R.

As far as the CPU and multi-threaded interpreter performance is concerned,the greater the number of conditions per rule, the greater the number of GP op-erations to interpret and, therefore, the longer the execution time. The efficiencyof the CPU interpreter increases as the complexity of the rule increases, but justwhen the number of rules or instances is low. Actually, when the number of rulesor instances is high, the CPU interpreter’s efficiency decreases as the complexityof the rule (number of conditions) increases, i.e., the number of GPops/s of the

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

10

20

30

40

50

6x109

7x109

8x109

9x109

1x1010

Conditions

GPop

s/s

Instances

5E+09

6E+09

7E+09

8E+09

9E+09

1E+10

(a) RPN Conditions vs Instances

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

2550

75100

125150

175200

2,0x109

4,0x109

6,0x109

8,0x109

1,0x1010

Rules

GPop

s/s

Instances

6E+08

3E+09

4E+09

6E+09

8E+09

1E+10

(b) RPN Rules vs Instances

10

20

30

40

50

2550

75100

125150

175200

3x109

4x109

5x109

6x109

7x109

8x109

9x109

Rules

GPop

s/s

Conditio

ns

3E+09

4E+09

5E+09

7E+09

8E+09

9E+09

(c) RPN Rules vs Conditions

Figure 4: RPN interpreter performance

21

Tabl

e4:

RPN

inte

rpre

terp

erfo

rman

ceE

xecu

tion

Tim

e(m

s)G

Pops

/sSp

eedu

pvs

CPU

Spee

dup

vs4

CPU

#R#I

nst

#Con

dG

Pops

CPU

4C

PU1

285

228

51

480

248

0C

PU4

CPU

128

52

285

148

02

480

128

52

285

148

02

480

128

52

285

148

02

480

1010

05

1.90

E04

3.03

1.45

0.14

0.13

0.09

0.05

6.27

E06

1.31

E07

1.36

E08

1.46

E08

2.11

E08

3.80

E08

21.6

423

.31

33.6

760

.60

10.3

611

.15

16.1

129

.00

1010

010

3.90

E04

3.41

1.88

0.15

0.14

0.10

0.05

1.14

E07

2.07

E07

2.60

E08

2.79

E08

3.90

E08

7.80

E08

22.7

324

.36

34.1

068

.20

12.5

313

.43

18.8

037

.60

1010

020

7.90

E04

6.45

3.52

0.28

0.25

0.16

0.09

1.22

E07

2.24

E07

2.82

E08

3.16

E08

4.94

E08

8.78

E08

23.0

425

.80

40.3

171

.67

12.5

714

.08

22.0

039

.11

1010

050

1.99

E05

14.3

25.

840.

340.

320.

250.

141.

39E

073.

41E

075.

85E

086.

22E

087.

96E

081.

42E

0942

.12

44.7

557

.28

102.

2917

.18

18.2

523

.36

41.7

110

1000

05

1.90

E06

30.9

910

.14

1.32

0.71

0.41

0.23

6.13

E07

1.87

E08

1.44

E09

2.68

E09

4.63

E09

8.26

E09

23.4

843

.65

75.5

913

4.74

7.68

14.2

824

.73

44.0

910

1000

010

3.90

E06

90.5

630

.95

2.66

1.43

0.83

0.45

4.31

E07

1.26

E08

1.47

E09

2.73

E09

4.70

E09

8.67

E09

34.0

563

.33

109.

1120

1.24

11.6

421

.64

37.2

968

.78

1010

000

207.

90E

0626

0.97

84.6

25.

122.

751.

640.

883.

03E

079.

34E

071.

54E

092.

87E

094.

82E

098.

98E

0950

.97

94.9

015

9.13

296.

5616

.53

30.7

751

.60

96.1

610

1000

050

1.99

E07

1102

.12

338.

2914

.41

7.69

4.08

2.15

1.81

E07

5.88

E07

1.38

E09

2.59

E09

4.88

E09

9.26

E09

76.4

814

3.32

270.

1351

2.61

23.4

843

.99

82.9

115

7.34

1010

0000

51.

90E

0722

9.20

70.4

113

.24

7.14

3.87

1.95

8.29

E07

2.70

E08

1.44

E09

2.66

E09

4.91

E09

9.74

E09

17.3

132

.10

59.2

211

7.54

5.32

9.86

18.1

936

.11

1010

0000

103.

90E

0774

2.73

227.

3727

.30

14.4

87.

833.

945.

25E

071.

72E

081.

43E

092.

69E

094.

98E

099.

90E

0927

.21

51.2

994

.86

188.

518.

3315

.70

29.0

457

.71

1010

0000

207.

90E

0720

14.3

161

4.66

52.1

727

.92

15.6

57.

863.

92E

071.

29E

081.

51E

092.

83E

095.

05E

091.

01E

1038

.61

72.1

512

8.71

256.

2711

.78

22.0

239

.28

78.2

010

1000

0050

1.99

E08

1506

9.12

4496

.22

143.

7078

.68

39.0

719

.62

1.32

E07

4.43

E07

1.38

E09

2.53

E09

5.09

E09

1.01

E10

104.

8719

1.52

385.

7076

8.05

31.2

957

.15

115.

0822

9.17

5010

05

9.50

E04

2.51

1.48

0.14

0.13

0.10

0.06

3.78

E07

6.42

E07

6.79

E08

7.31

E08

9.50

E08

1.58

E09

17.9

319

.31

25.1

041

.83

10.5

711

.38

14.8

024

.67

5010

010

1.95

E05

3.02

1.64

0.15

0.14

0.11

0.06

6.46

E07

1.19

E08

1.30

E09

1.39

E09

1.77

E09

3.25

E09

20.1

321

.57

27.4

550

.33

10.9

311

.71

14.9

127

.33

5010

020

3.95

E05

13.6

84.

930.

350.

310.

150.

122.

89E

078.

01E

071.

13E

091.

27E

092.

63E

093.

29E

0939

.09

44.1

391

.20

114.

0014

.09

15.9

032

.87

41.0

850

100

509.

95E

0555

.22

16.0

81.

000.

870.

360.

281.

80E

076.

19E

079.

95E

081.

14E

092.

76E

093.

55E

0955

.22

63.4

715

3.39

197.

2116

.08

18.4

844

.67

57.4

350

1000

05

9.50

E06

151.

3444

.32

6.01

3.39

1.97

1.00

6.28

E07

2.14

E08

1.58

E09

2.80

E09

4.82

E09

9.50

E09

25.1

844

.64

76.8

215

1.34

7.37

13.0

722

.50

44.3

250

1000

010

1.95

E07

438.

5212

7.73

12.3

56.

903.

982.

014.

45E

071.

53E

081.

58E

092.

83E

094.

90E

099.

70E

0935

.51

63.5

511

0.18

218.

1710

.34

18.5

132

.09

63.5

550

1000

020

3.95

E07

1095

.45

315.

1523

.95

13.3

27.

943.

993.

61E

071.

25E

081.

65E

092.

97E

094.

97E

099.

90E

0945

.74

82.2

413

7.97

274.

5513

.16

23.6

639

.69

78.9

850

1000

050

9.95

E07

4894

.98

1404

.90

67.2

236

.99

19.8

210

.00

2.03

E07

7.08

E07

1.48

E09

2.69

E09

5.02

E09

9.95

E09

72.8

213

2.33

246.

9748

9.50

20.9

037

.98

70.8

814

0.49

5010

0000

59.

50E

0713

52.8

438

5.82

61.3

235

.10

19.1

99.

617.

02E

072.

46E

081.

55E

092.

71E

094.

95E

099.

89E

0922

.06

38.5

470

.50

140.

776.

2910

.99

20.1

140

.15

5010

0000

101.

95E

0834

55.4

797

8.15

127.

1171

.53

38.9

619

.51

5.64

E07

1.99

E08

1.53

E09

2.73

E09

5.01

E09

9.99

E09

27.1

848

.31

88.6

917

7.11

7.70

13.6

725

.11

50.1

450

1000

0020

3.95

E08

1370

6.57

3841

.16

247.

9213

7.30

77.9

239

.01

2.88

E07

1.03

E08

1.59

E09

2.88

E09

5.07

E09

1.01

E10

55.2

999

.83

175.

9135

1.36

15.4

927

.98

49.3

098

.47

5010

0000

509.

95E

0853

444.

0714

934.

3167

3.37

369.

0419

4.87

97.5

21.

86E

076.

66E

071.

48E

092.

70E

095.

11E

091.

02E

1079

.37

144.

8227

4.25

548.

0322

.18

40.4

776

.64

153.

1410

010

05

1.90

E05

3.71

1.75

0.18

0.14

0.10

0.06

5.12

E07

1.09

E08

1.06

E09

1.36

E09

1.90

E09

3.17

E09

20.6

126

.50

37.1

061

.83

9.72

12.5

017

.50

29.1

710

010

010

3.90

E05

8.15

2.67

0.26

0.19

0.11

0.09

4.79

E07

1.46

E08

1.50

E09

2.05

E09

3.55

E09

4.33

E09

31.3

542

.89

74.0

990

.56

10.2

714

.05

24.2

729

.67

100

100

207.

90E

0524

.94

6.18

0.58

0.36

0.22

0.14

3.17

E07

1.28

E08

1.36

E09

2.19

E09

3.59

E09

5.64

E09

43.0

069

.28

113.

3617

8.14

10.6

617

.17

28.0

944

.14

100

100

501.

99E

0614

9.87

40.1

01.

411.

030.

540.

371.

33E

074.

96E

071.

41E

091.

93E

093.

69E

095.

38E

0910

6.29

145.

5027

7.54

405.

0528

.44

38.9

374

.26

108.

3810

010

000

51.

90E

0722

3.82

60.3

512

.37

6.08

3.91

1.97

8.49

E07

3.15

E08

1.54

E09

3.12

E09

4.86

E09

9.64

E09

18.0

936

.81

57.2

411

3.61

4.88

9.93

15.4

330

.63

100

1000

010

3.90

E07

784.

1120

9.41

24.6

212

.51

7.91

3.98

4.97

E07

1.86

E08

1.58

E09

3.12

E09

4.93

E09

9.80

E09

31.8

562

.68

99.1

319

7.01

8.51

16.7

426

.47

52.6

210

010

000

207.

90E

0721

23.9

356

6.62

48.3

124

.15

15.8

17.

933.

72E

071.

39E

081.

64E

093.

27E

095.

00E

099.

96E

0943

.96

87.9

513

4.34

267.

8311

.73

23.4

635

.84

71.4

510

010

000

501.

99E

0815

638.

7241

51.2

313

4.68

67.7

139

.46

19.8

11.

27E

074.

79E

071.

48E

092.

94E

095.

04E

091.

00E

1011

6.12

230.

9739

6.32

789.

4430

.82

61.3

110

5.20

209.

5510

010

0000

51.

90E

0822

75.2

160

2.67

123.

8661

.79

38.3

319

.21

8.35

E07

3.15

E08

1.53

E09

3.07

E09

4.96

E09

9.89

E09

18.3

736

.82

59.3

611

8.44

4.87

9.75

15.7

231

.37

100

1000

0010

3.90

E08

6481

.91

1707

.42

253.

5512

8.30

77.8

638

.98

6.02

E07

2.28

E08

1.54

E09

3.04

E09

5.01

E09

1.00

E10

25.5

650

.52

83.2

516

6.29

6.73

13.3

121

.93

43.8

010

010

0000

207.

90E

0825

815.

7167

79.2

449

5.46

249.

3015

5.79

77.9

73.

06E

071.

17E

081.

59E

093.

17E

095.

07E

091.

01E

1052

.10

103.

5516

5.71

331.

1013

.68

27.1

943

.52

86.9

510

010

0000

501.

99E

0912

0945

.05

3171

8.90

1350

.74

677.

4938

9.61

194.

901.

65E

076.

27E

071.

47E

092.

94E

095.

11E

091.

02E

1089

.54

178.

5231

0.43

620.

5523

.48

46.8

281

.41

162.

7420

010

05

3.80

E05

5.13

1.60

0.31

0.15

0.14

0.07

7.41

E07

2.38

E08

1.23

E09

2.53

E09

2.71

E09

5.43

E09

16.5

534

.20

36.6

473

.29

5.16

10.6

711

.43

22.8

620

010

010

7.80

E05

14.8

44.

440.

600.

300.

290.

115.

26E

071.

76E

081.

30E

092.

60E

092.

69E

097.

09E

0924

.73

49.4

751

.17

134.

917.

4014

.80

15.3

140

.36

200

100

201.

58E

0651

.75

13.5

21.

170.

590.

570.

223.

05E

071.

17E

081.

35E

092.

68E

092.

77E

097.

18E

0944

.23

87.7

190

.79

235.

2311

.56

22.9

223

.72

61.4

520

010

050

3.98

E06

288.

0474

.38

2.90

1.61

1.35

0.54

1.38

E07

5.35

E07

1.37

E09

2.47

E09

2.95

E09

7.37

E09

99.3

217

8.91

213.

3653

3.41

25.6

546

.20

55.1

013

7.74

200

1000

05

3.80

E07

548.

8514

0.96

24.3

712

.13

7.78

3.90

6.92

E07

2.70

E08

1.56

E09

3.13

E09

4.88

E09

9.74

E09

22.5

245

.25

70.5

514

0.73

5.78

11.6

218

.12

36.1

420

010

000

107.

80E

0713

08.2

232

4.43

50.0

225

.33

15.7

97.

925.

96E

072.

40E

081.

56E

093.

08E

094.

94E

099.

85E

0926

.15

51.6

582

.85

165.

186.

4912

.81

20.5

540

.96

200

1000

020

1.58

E08

4507

.04

1115

.84

97.2

448

.83

31.6

115

.80

3.51

E07

1.42

E08

1.62

E09

3.24

E09

5.00

E09

1.00

E10

46.3

592

.30

142.

5828

5.26

11.4

822

.85

35.3

070

.62

200

1000

050

3.98

E08

2136

4.72

5275

.98

270.

0413

5.74

78.9

639

.51

1.86

E07

7.54

E07

1.47

E09

2.93

E09

5.04

E09

1.01

E10

79.1

215

7.39

270.

5854

0.74

19.5

438

.87

66.8

213

3.54

200

1000

005

3.80

E08

6128

.59

1505

.88

255.

5813

5.39

77.6

439

.35

6.20

E07

2.52

E08

1.49

E09

2.81

E09

4.89

E09

9.66

E09

23.9

845

.27

78.9

415

5.75

5.89

11.1

219

.40

38.2

720

010

0000

107.

80E

0812

653.

1030

53.2

151

0.23

257.

9715

5.70

77.8

86.

16E

072.

55E

081.

53E

093.

02E

095.

01E

091.

00E

1024

.80

49.0

581

.27

162.

475.

9811

.84

19.6

139

.20

200

1000

0020

1.58

E09

4126

9.70

9951

.79

988.

8749

7.35

311.

5215

5.79

3.83

E07

1.59

E08

1.60

E09

3.18

E09

5.07

E09

1.01

E10

41.7

382

.98

132.

4826

4.91

10.0

620

.01

31.9

563

.88

200

1000

0050

3.98

E09

1977

83.7

947

401.

2527

13.0

313

58.1

577

9.08

389.

572.

01E

078.

40E

071.

47E

092.

93E

095.

11E

091.

02E

1072

.90

145.

6325

3.87

507.

7017

.47

34.9

060

.84

121.

68

22

single-threaded and multi-threaded CPU interpreter decreases when dealing withcomplex rules over many instances. We believe that this behaviour is effectivelycaused by saturated CPU cache memory, sinking the performance because largerdata from bigger scenarios do not fit into the cache memory, and only the smallestscenario seems to fit into the cache memory.

The performance of the multi-threaded CPU interpreter is up to 315 millionGPops/s. On the other hand, the GPU implementation obtains high performancein all cases, especially over complex rules and many instances, and achieves up to10 billion GPops/s.

The 480 GPU achieves more consistent performance than the 285 GPU, and itcomprises a 768 KB L2 hardware cache which helps to minimize global memoryreads. It is also noticeable that when using two GPU devices, the performanceof the interpreter is duplicated when there are enough instances to fill the GPU’sthread blocks. The GPops/s for two 480 GPUs are presented in Figure 4, wherea pairwise evaluation is presented in a separate figure considering each possiblecombination between the number of instances, rules, and conditions.

6.2. Model performance and scalabilityTable 5 shows the execution times, the speedups, and the GPU occupancy of

the different GPU devices. Each row represents the case of the fitness evaluation(including coverage kernel, confusion matrix kernel, and data transfers) of R rulesover I instances along 100 generations.

The number of threads within a thread block was optimally set to 256 in theexperimental setup, i.e., a thread block is in charge of computing the coverage ofa rule over a group of 256 instances. Thus, the total number of thread blocks re-quired to compute all the evaluation cases is ceil(I/256)×R. The GPU occupancyis defined as the relation between the number of thread blocks to compute and themaximum number of thread blocks that the GPU hardware is capable of handlingconcurrently. Thus, an occupancy lower than 1 means that the computation re-quirements does not reach the maximum capability of the GPU. The occupancyratio should be over 1 to achieve full performance. The 285 GPU comprises 30multiprocessors and the 480 GPU comprises 15 multiprocessors. Each multipro-cessor can handle 4 active thread blocks when using 256 threads per block. Thus,the number of maximum active thread blocks is 120 and 60 times the numberof GPU devices, i.e., the 480 GPU performs better with fewer and larger threadblocks, achieving higher occupancy.

23

Tabl

e5:

Mod

elpe

rfor

man

ceE

xecu

tion

Tim

e(m

s)Sp

eedu

pvs

CPU

Spee

dup

vs4

CPU

GPU

occu

panc

y#R

#Ins

t#C

ond

Blo

cks

CPU

4C

PU1

285

228

51

480

248

01

285

228

51

480

248

01

285

228

51

480

248

01

285

228

51

480

248

010

100

510

13.5

28.

427.

197.

025.

635.

201.

881.

932.

402.

601.

171.

201.

501.

620.

080.

040.

040.

0210

100

1010

16.9

910

.47

7.42

7.28

5.89

5.41

2.29

2.33

2.88

3.14

1.41

1.44

1.78

1.94

0.08

0.04

0.04

0.02

1010

020

1022

.64

15.6

98.

547.

796.

586.

012.

652.

913.

443.

771.

842.

012.

382.

610.

080.

040.

040.

0210

100

5010

50.3

620

.15

8.76

8.12

7.49

7.12

5.75

6.20

6.72

7.07

2.30

2.48

2.69

2.83

0.08

0.04

0.04

0.02

1010

000

579

067

3.15

259.

0716

.00

13.4

712

.49

10.0

342

.07

49.9

753

.90

67.1

116

.19

19.2

320

.74

25.8

36.

583.

293.

291.

6510

1000

010

790

1034

.94

391.

3917

.76

14.3

413

.57

10.1

758

.27

72.1

776

.27

101.

7622

.04

27.2

928

.84

38.4

86.

583.

293.

291.

6510

1000

020

790

1600

.45

594.

9519

.19

15.3

315

.16

12.0

883

.40

104.

4010

5.57

132.

4931

.00

38.8

139

.24

49.2

56.

583.

293.

291.

6510

1000

050

790

2785

.06

962.

7827

.09

20.7

020

.24

15.6

610

2.81

134.

5413

7.60

177.

8535

.54

46.5

147

.57

61.4

86.

583.

293.

291.

6510

1000

005

7820

5950

.02

2100

.53

101.

0780

.36

74.5

765

.67

58.8

774

.04

79.7

990

.60

20.7

826

.14

28.1

731

.99

65.1

732

.58

32.5

816

.29

1010

0000

1078

2088

47.9

129

03.9

411

9.66

95.4

785

.00

67.8

773

.94

92.6

810

4.09

130.

3724

.27

30.4

234

.16

42.7

965

.17

32.5

832

.58

16.2

910

1000

0020

7820

1291

7.86

4150

.33

142.

9012

1.91

100.

2285

.80

90.4

010

5.96

128.

9015

0.56

29.0

434

.04

41.4

148

.37

65.1

732

.58

32.5

816

.29

1010

0000

5078

2028

036.

6987

76.7

722

1.03

185.

0417

4.94

145.

0012

6.85

151.

5216

0.26

193.

3639

.71

47.4

350

.17

60.5

365

.17

32.5

832

.58

16.2

920

100

520

15.7

89.

767.

467.

136.

956.

502.

122.

212.

272.

431.

311.

371.

401.

500.

170.

080.

080.

0420

100

1020

21.5

112

.47

7.50

7.30

7.23

6.90

2.87

2.95

2.98

3.12

1.66

1.71

1.72

1.81

0.17

0.08

0.08

0.04

2010

020

2040

.41

19.2

18.

747.

857.

537.

044.

625.

155.

375.

742.

202.

452.

552.

730.

170.

080.

080.

0420

100

5020

80.2

033

.07

10.5

99.

229.

157.

487.

578.

708.

7710

.72

3.12

3.59

3.61

4.42

0.17

0.08

0.08

0.04

2010

000

515

8095

9.25

265.

8421

.46

18.1

315

.19

12.1

644

.70

52.9

163

.15

78.8

912

.39

14.6

617

.50

21.8

613

.17

6.58

6.58

3.29

2010

000

1015

8014

02.7

339

7.36

24.4

618

.48

17.8

713

.95

57.3

575

.91

78.5

010

0.55

16.2

521

.50

22.2

428

.48

13.1

76.

586.

583.

2920

1000

020

1580

2315

.95

665.

0724

.83

21.5

020

.32

15.0

193

.27

107.

7211

3.97

154.

2926

.78

30.9

332

.73

44.3

113

.17

6.58

6.58

3.29

2010

000

5015

8043

75.5

811

42.5

633

.72

27.9

526

.14

22.5

212

9.76

156.

5516

7.39

194.

3033

.88

40.8

843

.71

50.7

413

.17

6.58

6.58

3.29

2010

0000

515

640

8614

.43

2129

.28

143.

4310

1.25

94.8

972

.83

60.0

685

.08

90.7

811

8.28

14.8

521

.03

22.4

429

.24

130.

3365

.17

65.1

732

.58

2010

0000

1015

640

1280

8.98

3294

.28

160.

3711

7.21

114.

5280

.79

79.8

710

9.28

111.

8515

8.55

20.5

428

.11

28.7

740

.78

130.

3365

.17

65.1

732

.58

2010

0000

2015

640

2133

1.93

5156

.10

176.

9113

5.19

130.

4110

4.63

120.

5815

7.79

163.

5820

3.88

29.1

538

.14

39.5

449

.28

130.

3365

.17

65.1

732

.58

2010

0000

5015

640

3771

2.71

9433

.25

252.

7621

5.21

197.

1716

7.95

149.

2017

5.24

191.

2722

4.55

37.3

243

.83

47.8

456

.17

130.

3365

.17

65.1

732

.58

5010

05

5028

.33

15.0

611

.37

10.9

08.

257.

192.

492.

603.

433.

941.

321.

381.

832.

090.

420.

210.

210.

1050

100

1050

43.2

121

.45

12.3

311

.22

8.85

7.20

3.50

3.85

4.88

6.00

1.74

1.91

2.42

2.98

0.42

0.21

0.21

0.10

5010

020

5060

.10

30.6

614

.52

12.0

49.

028.

214.

144.

996.

667.

322.

112.

553.

403.

730.

420.

210.

210.

1050

100

5050

113.

2747

.49

14.8

213

.41

10.1

98.

387.

648.

4511

.12

13.5

23.

203.

544.

665.

670.

420.

210.

210.

1050

1000

05

3950

1885

.97

471.

4239

.94

27.5

719

.71

14.1

147

.22

68.4

195

.69

133.

6611

.80

17.1

023

.92

33.4

132

.92

16.4

616

.46

8.23

5010

000

1039

5029

33.9

169

5.09

42.9

931

.28

21.6

114

.93

68.2

593

.80

135.

7719

6.51

16.1

722

.22

32.1

746

.56

32.9

216

.46

16.4

68.

2350

1000

020

3950

4742

.53

1118

.49

47.0

633

.62

23.5

915

.67

100.

7814

1.06

201.

0430

2.65

23.7

733

.27

47.4

171

.38

32.9

216

.46

16.4

68.

2350

1000

050

3950

1007

6.53

2457

.20

64.3

844

.27

27.3

424

.42

156.

5222

7.62

368.

5641

2.63

38.1

755

.50

89.8

810

0.62

32.9

216

.46

16.4

68.

2350

1000

005

3910

018

613.

2345

53.9

233

5.65

210.

6517

2.13

118.

3455

.45

88.3

610

8.13

157.

2913

.57

21.6

226

.46

38.4

832

5.83

162.

9216

2.92

81.4

650

1000

0010

3910

027

135.

5863

08.6

834

7.87

230.

0518

9.54

123.

9178

.00

117.

9614

3.17

218.

9918

.14

27.4

233

.28

50.9

132

5.83

162.

9216

2.92

81.4

650

1000

0020

3910

042

911.

8299

56.2

639

5.40

268.

3221

5.30

154.

2510

8.53

159.

9319

9.31

278.

2025

.18

37.1

146

.24

64.5

532

5.83

162.

9216

2.92

81.4

650

1000

0050

3910

098

992.

8223

412.

6045

3.00

323.

8630

6.60

196.

8421

8.53

305.

6732

2.87

502.

9151

.68

72.2

976

.36

118.

9432

5.83

162.

9216

2.92

81.4

610

010

05

100

45.9

518

.82

12.0

711

.01

9.45

7.90

3.81

4.17

4.86

5.82

1.56

1.71

1.99

2.38

0.83

0.42

0.42

0.21

100

100

1010

061

.32

28.3

612

.78

11.5

09.

758.

374.

805.

336.

297.

332.

222.

472.

913.

390.

830.

420.

420.

2110

010

020

100

96.8

239

.33

15.9

713

.58

10.7

69.

466.

067.

139.

0010

.23

2.46

2.90

3.66

4.16

0.83

0.42

0.42

0.21

100

100

5010

021

0.03

54.7

518

.90

14.6

811

.92

9.75

11.1

114

.31

17.6

221

.54

2.90

3.73

4.59

5.62

0.83

0.42

0.42

0.21

100

1000

05

7900

3590

.06

819.

2257

.74

40.1

935

.89

23.6

262

.18

89.3

310

0.03

151.

9914

.19

20.3

822

.83

34.6

865

.83

32.9

232

.92

16.4

610

010

000

1079

0055

46.3

012

89.7

165

.78

43.6

439

.02

23.9

084

.32

127.

0914

2.14

232.

0619

.61

29.5

533

.05

53.9

665

.83

32.9

232

.92

16.4

610

010

000

2079

0084

96.5

919

14.7

268

.70

46.7

343

.82

29.8

512

3.68

181.

8219

3.90

284.

6427

.87

40.9

743

.70

64.1

465

.83

32.9

232

.92

16.4

610

010

000

5079

0018

677.

7143

47.7

084

.88

56.2

052

.09

34.2

622

0.05

332.

3435

8.57

545.

1851

.22

77.3

683

.47

126.

9065

.83

32.9

232

.92

16.4

610

010

0000

578

200

3409

7.45

7770

.71

546.

3031

9.79

274.

0315

7.12

62.4

210

6.62

124.

4321

7.02

14.2

224

.30

28.3

649

.46

651.

6732

5.83

325.

8316

2.92

100

1000

0010

7820

054

730.

6912

188.

8757

0.31

328.

3431

8.94

174.

4995

.97

166.

6917

1.60

313.

6621

.37

37.1

238

.22

69.8

565

1.67

325.

8332

5.83

162.

9210

010

0000

2078

200

8899

9.51

2045

2.00

590.

7236

0.56

321.

6820

1.46

150.

6624

6.84

276.

6744

1.77

34.6

256

.72

63.5

810

1.52

651.

6732

5.83

325.

8316

2.92

100

1000

0050

7820

019

4084

.86

4371

9.72

673.

7143

4.29

423.

4226

5.87

288.

0844

6.90

458.

3773

0.00

64.8

910

0.67

103.

2516

4.44

651.

6732

5.83

325.

8316

2.92

24

20

40

60

80

100

10

20

30

40

50

100

200

300

400

500

600

700

Conditions

Spe

edup

Rules

0

100

200

300

400

500

600

700

800

(a) Conditions vs Rules

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

10

20

30

40

50

100

200

300

400

500

600

700

Conditions

Spe

edup

Instances

0

100

200

300

400

500

600

700

800

(b) Conditions vs Instances

20

40

60

80

100

10000200

0030000400

0050000600

0070000800

0090000100

000

100

200

300

400

500

600

700

Instances

Spe

edup

Rules

0

100

200

300

400

500

600

700

800

(c) Instances vs Rules

Figure 5: Model performance and scalability

The CPU times increase as the number of rules, the number of instances, andthe number of attributes increase. The GPU times remain practically constant aslong as the occupancy is lower than 1, e.g., the execution times for 10 rules and100 instances are similar to the times for 100 rules and 100 instances. Interest-ingly, doubling the number of rules over cases with many instances doubles thetime for CPU approaches, but it only involves about +50% time when dealingwith GPUs. The reason for such behaviour is on how the dimensionality of thekernel increases, how rules and instances are stored in the GPU memory, and howmemory spaces of the GPU address memory requests. The GPU kernel explainedin Section 4.3.1 employs a 2D grid of thread block. One dimension increases asthe number of data instances. The other dimension increases as the number ofrules. Results shown in Table 5 indicate that increasing only the number of in-stances, increases the execution time linearly (after GPU reached full occupancy).The reason for such behaviour is that increasing the number of data instances, in-

25

creases the number of global memory reads (there are more threads reading datainstances values from global memory). Global memory transactions are linearperformance, making the execution time to increase linearly. However, the rulesare not stored in global memory, but in constant memory. As indicated in Sec-tion 3, a value read from the constant cache is broadcast to all threads in a warp,effectively serving loads from memory with a single-cache access. This meansthat when we increase the number of rules, reading them from constant mem-ory is not that cost than if we had them in global memory, and memory requestsfrom multiple threads are solved in a single cached access. Therefore, doublingthe number of instances doubles the number of global memory reads, doubles thecomputation work (arithmetic calculations) and results in doubling the executiontimes. On the other hand, doubling the number of rules, do not double the numberof constant memory read transactions, doubles the computation work (arithmeticcalculations) and the result is that the execution time is only about 50% larger(mostly due to the double number of calculations). Figure 5 represents in a sepa-rate figure how the relation between the number of conditions, rules, and instancesaffects the speedup of the algorithm.

6.3. Experiments on UCI data setsTable 6 shows the execution times and the speedups achieved over 12 varied

data sets from the UCI machine learning repository.The proposal demonstrates high performance and efficiency, which both in-

crease as the number of instances and the population size increase. The highestspeedup is achieved over the Connect-4 data set using 200 ants and 2 GTX 480GPUs (834× vs single-threaded CPU, 212× vs multi-threaded CPU). The highestspeedup values over all data sets is 834× for 2 GTX 480 GPUs, 566× for 1 GTX480 GPU, 400× for 2 GTX 285 GPUs and 244× for 1 GTX 285 GPU, all whencompared to original single-threaded CPU implementation. The average of thespeedup values over all data sets is 162× for 2 GTX 480 GPUs, 119× for 1 GTX480 GPU, 99× for 2 GTX 285 GPUs and 68× for 1 GTX 285 GPU, all whencompared to original single-threaded CPU implementation. Smallest speedupswere achieved on small data sets like Iris, but it is interesting to highlight thatthey are always higher than 1× even if the number of instances and rules to par-allelize is very low. The aforementioned assertion is not trivial because it impliesthat it is always advisable to speedup the execution using GPUs independently ofthe problem size, i.e., executing on the GPU (copying data instances, copying therules, performing the evaluation and copying the results back to the host memory)is always faster than executing on the CPU, no matter how small the problem is.

26

Tabl

e6:

UC

Idat

ase

tspe

rfor

man

ceE

xecu

tion

Tim

e(m

s)Sp

eedu

pvs

CPU

Spee

dup

vs4

CPU

Dat

ase

t#I

nst

#Att

#Cla

sses

#RC

PU4

CPU

128

52

285

148

02

480

128

52

285

148

02

480

128

52

285

148

02

480

Iris

150

43

1013

6.42

62.8

055

.31

43.8

938

.86

30.5

22.

473.

113.

514.

471.

141.

431.

622.

0620

202.

3790

.70

56.9

247

.59

42.4

935

.57

3.56

4.25

4.76

5.69

1.59

1.91

2.13

2.55

5038

8.33

156.

0076

.53

66.9

250

.19

40.4

25.

075.

807.

749.

612.

042.

333.

113.

8610

070

2.51

235.

3079

.91

70.7

762

.41

52.3

58.

799.

9311

.26

13.4

22.

943.

323.

774.

4920

015

48.8

943

4.30

108.

3896

.27

94.0

977

.73

14.2

916

.09

16.4

619

.93

4.01

4.51

4.62

5.59

iono

sphe

re35

133

210

877.

5228

2.70

64.1

644

.23

41.5

830

.56

13.6

819

.84

21.1

028

.71

4.41

6.39

6.80

9.25

2013

41.5

841

8.70

69.9

748

.76

43.4

834

.27

19.1

727

.51

30.8

639

.15

5.98

8.59

9.63

12.2

250

2486

.55

770.

0095

.34

79.0

163

.42

52.3

526

.08

31.4

739

.21

47.5

08.

089.

7512

.14

14.7

110

044

49.8

214

11.9

013

4.56

108.

2497

.49

77.6

733

.07

41.1

145

.64

57.2

910

.49

13.0

414

.48

18.1

820

087

68.5

626

91.5

017

5.67

146.

2414

3.25

114.

2749

.91

59.9

661

.21

76.7

415

.32

18.4

018

.79

23.5

5au

stra

lian

690

142

1073

0.28

208.

5059

.90

39.1

938

.58

30.7

212

.19

18.6

318

.93

23.7

73.

485.

325.

406.

7920

1066

.03

315.

3061

.44

50.2

445

.75

39.5

317

.35

21.2

223

.30

26.9

75.

136.

286.

897.

9850

1457

.10

429.

3780

.89

65.7

457

.98

43.3

318

.01

22.1

625

.13

33.6

35.

316.

537.

419.

9110

029

34.1

184

7.62

96.0

074

.44

73.1

158

.88

30.5

639

.42

40.1

349

.83

8.83

11.3

911

.59

14.4

020

058

27.4

017

32.6

511

6.33

95.4

380

.27

63.7

150

.09

61.0

672

.60

91.4

714

.89

18.1

621

.59

27.2

0tic

-tac

-toe

958

92

1096

5.21

330.

6056

.42

45.2

345

.04

34.3

717

.11

21.3

421

.43

28.0

85.

867.

317.

349.

6220

1531

.98

448.

0069

.74

55.1

447

.40

37.7

121

.97

27.7

832

.32

40.6

36.

428.

129.

4511

.88

5021

63.9

361

3.91

81.0

875

.21

55.3

541

.55

26.6

928

.77

39.1

052

.08

7.57

8.16

11.0

914

.78

100

4267

.61

1214

.69

111.

1996

.15

79.4

856

.80

38.3

844

.38

53.6

975

.13

10.9

212

.63

15.2

821

.39

200

8632

.52

2500

.98

149.

4312

4.30

112.

1080

.52

57.7

769

.45

77.0

110

7.21

16.7

420

.12

22.3

131

.06

vow

el99

013

1110

1092

.20

308.

0091

.44

77.0

464

.27

50.8

611

.94

14.1

816

.99

21.4

73.

374.

004.

796.

0620

1774

.42

487.

4010

5.03

91.6

077

.13

60.2

616

.89

19.3

723

.01

29.4

54.

645.

326.

328.

0950

2775

.00

761.

0512

9.53

115.

1283

.32

74.8

221

.42

24.1

133

.31

37.0

95.

886.

619.

1310

.17

100

5336

.31

1458

.00

177.

5514

7.22

120.

6393

.00

30.0

636

.25

44.2

457

.38

8.21

9.90

12.0

915

.68

200

1106

8.93

3262

.87

226.

5718

6.74

166.

8512

6.67

48.8

559

.27

66.3

487

.38

14.4

017

.47

19.5

625

.76

segm

ent

2310

197

1024

13.5

472

2.40

98.2

584

.09

74.0

565

.31

24.5

728

.70

32.5

936

.96

7.35

8.59

9.76

11.0

620

4331

.76

1215

.70

109.

0895

.81

84.6

075

.00

39.7

145

.21

51.2

057

.76

11.1

512

.69

14.3

716

.21

5078

75.8

521

17.4

715

9.91

123.

2812

1.69

97.2

049

.25

63.8

964

.72

81.0

313

.24

17.1

817

.40

21.7

810

014

854.

0539

62.3

225

4.74

189.

5917

3.38

143.

7758

.31

78.3

585

.67

103.

3215

.55

20.9

022

.85

27.5

620

028

757.

0876

30.1

032

3.03

256.

1725

3.75

204.

9089

.02

112.

2611

3.33

140.

3523

.62

29.7

930

.07

37.2

4m

ushr

oom

8124

222

1068

32.7

720

74.4

013

9.68

117.

8298

.64

85.9

348

.92

57.9

969

.27

79.5

214

.85

17.6

121

.03

24.1

420

1383

3.77

3723

.90

168.

1514

1.73

123.

3097

.64

82.2

797

.61

112.

2014

1.68

22.1

526

.27

30.2

038

.14

5028

428.

1371

41.7

430

6.78

230.

5118

0.89

127.

8692

.67

123.

3315

7.16

222.

3423

.28

30.9

839

.48

55.8

610

061

334.

5315

581.

7052

9.06

367.

3427

1.82

191.

7811

5.93

166.

9722

5.64

319.

8229

.45

42.4

257

.32

81.2

520

011

4267

.45

2959

6.69

683.

5644

7.32

432.

4530

1.36

167.

1725

5.45

264.

2337

9.17

43.3

066

.16

68.4

498

.21

kr-v

s-kp

2805

66

1710

4217

.19

1320

.30

84.3

274

.57

72.0

665

.63

50.0

156

.55

58.5

264

.26

15.6

617

.71

18.3

220

.12

2081

76.9

923

94.7

094

.56

89.2

674

.59

68.8

886

.47

91.6

110

9.63

118.

7125

.32

26.8

332

.10

34.7

750

1576

3.84

4179

.39

153.

9513

0.91

104.

7290

.87

102.

4012

0.42

150.

5317

3.48

27.1

531

.93

39.9

145

.99

100

3093

3.37

8058

.31

258.

5516

8.16

163.

6611

7.25

119.

6418

3.95

189.

0126

3.82

31.1

747

.92

49.2

468

.73

200

6402

9.44

1589

9.56

314.

4321

2.62

205.

0416

2.71

203.

6430

1.14

312.

2839

3.52

50.5

774

.78

77.5

497

.72

conn

ect-

467

557

423

1083

831.

3724

665.

1599

3.69

813.

1477

7.72

701.

9584

.36

103.

1010

7.79

119.

4324

.82

30.3

331

.71

35.1

420

1576

74.2

142

737.

9512

95.7

610

01.2

686

2.85

752.

9812

1.68

157.

4818

2.74

209.

4032

.98

42.6

849

.53

56.7

650

4268

49.3

511

0240

.25

2232

.12

1522

.36

1286

.71

924.

1219

1.23

280.

3933

1.74

461.

9049

.39

72.4

185

.68

119.

2910

079

9502

.73

2122

62.8

840

58.3

723

28.9

119

59.1

712

70.7

219

7.00

343.

2940

8.08

629.

1752

.30

91.1

410

8.34

167.

0420

016

2077

5.44

4124

62.9

666

42.0

440

49.8

528

61.7

619

41.3

624

4.02

400.

2156

6.36

834.

8762

.10

101.

8514

4.13

212.

46fa

rs10

0968

298

1098

353.

4228

265.

8114

77.4

011

60.9

910

46.5

586

2.74

66.5

784

.72

93.9

811

4.00

19.1

324

.35

27.0

132

.76

2018

7677

.23

5175

4.57

2100

.94

1532

.54

1238

.43

939.

4989

.33

122.

4615

1.54

199.

7724

.63

33.7

741

.79

55.0

950

4203

09.9

112

1558

.74

3548

.67

2278

.66

1900

.24

1403

.68

118.

4418

4.45

221.

1929

9.43

34.2

553

.35

63.9

786

.60

100

8623

75.2

925

1519

.89

7249

.25

3725

.75

2812

.08

1927

.82

118.

9623

1.46

306.

6744

7.33

34.7

067

.51

89.4

413

0.47

200

1660

939.

9949

1768

.62

1235

0.12

6530

.37

5192

.10

3448

.76

134.

4925

4.34

319.

9048

1.60

39.8

275

.30

94.7

114

2.59

kddc

up49

4020

4110

1057

5386

.93

1622

85.8

010

726.

7192

98.4

277

84.8

554

60.7

253

.64

61.8

873

.91

105.

3715

.13

17.4

520

.85

29.7

220

1149

797.

8732

3639

.32

1350

7.70

9626

.74

8761

.11

7001

.57

85.1

211

9.44

131.

2416

4.22

23.9

633

.62

36.9

446

.22

5021

6184

8.69

7755

97.0

522

450.

5115

376.

5415

054.

2410

005.

0596

.29

140.

5914

3.60

216.

0834

.55

50.4

451

.52

77.5

210

045

4353

4.52

1667

554.

0041

050.

4523

640.

0723

473.

9217

780.

2311

0.68

192.

2019

3.56

255.

5440

.62

70.5

471

.04

93.7

920

086

6465

1.97

3283

987.

9169

195.

6637

846.

5534

379.

0325

260.

9712

5.22

228.

9425

2.03

343.

0147

.46

86.7

795

.52

130.

00po

ker

1025

010

1010

1043

0254

.92

1188

20.5

211

866.

3583

17.3

364

51.3

640

65.0

036

.26

51.7

366

.69

105.

8410

.01

14.2

918

.42

29.2

320

9812

21.1

324

5588

.40

2346

9.48

1618

4.86

8143

.93

6783

.35

41.8

160

.63

120.

4814

4.65

10.4

615

.17

30.1

636

.20

5022

9167

8.70

5914

40.3

732

074.

4720

522.

5713

233.

5597

85.9

171

.45

111.

6717

3.17

234.

1818

.44

28.8

244

.69

60.4

410

047

2714

7.28

1219

990.

6553

438.

2731

803.

5621

340.

4214

043.

6288

.46

148.

6422

1.51

336.

6022

.83

38.3

657

.17

86.8

720

097

2159

3.77

2508

965.

4299

071.

5955

171.

6237

227.

4124

018.

7498

.13

176.

2126

1.14

404.

7525

.32

45.4

867

.40

104.

46

27

Nevertheless, the absolute time gain for such small data (few ms) is not a life-or-death question, but it is significant for large data sets, reducing the executiontime from hours to seconds. The GPU model is not only advisable for reducingthe execution time greatly, but also for solving the problem over larger populationsizes, which may achieve better results.

7. Concluding Remarks

In this paper we have analysed the performance of the MOGBAP algorithmfor classification and proposed a parallel GPU-based implementation to speedupthe evaluation phase, which has been demonstrated to require most of the execu-tion time of the algorithm, specially over large scale data sets. The creation of theants each generation, the multi-objective strategy, and the niching approach forselecting the rules that make up the final classifier have been parallelized usingmulti-threading, because of their internal sequential requirements, which justifiestheir not being included within the GPU. The experimental study carried out hasanalysed the performance of the GPU-based RPN interpreter for the rules minedin terms of the GPops interpreted per seconds, the performance of the evaluationmodel varying the population size, the length of the rule, and the dimensionalityof the data set, and its performance over some well-known data sets from the UCIrepository. The experimental results are evidence for the efficiency and scalabilityof the GPU model using different GPU devices and data set dimensions, achievingan interpreter performance of up to 10 billion GPops/s and an evaluation speedupof up to 834× vs a CPU, and 212× vs a 4–threaded CPU. Furthermore, its par-allelization enabled us to extend the application range of the algorithm to otherdomains that present data sets comprising a huge amount of instances and at-tributes, domains where the application of the MOGBAP algorithm was thereforeextremely difficult until now.

Acknowledgments

This work was supported by the Regional Government of Andalusia and theMinistry of Science and Technology, projects P08-TIC-3720 and TIN-2011-22408,and FEDER funds. This research was also supported by the Spanish Ministry ofEducation under the FPU grant AP2010-0042.

References

[1] NVIDIA CUDA Programming and Best Practices Guide, 2012.

28

[2] E. Alba, M. Tomassini, Parallelism and evolutionary algorithms, IEEETransactions on Evolutionary Computation 6 (2002) 443–462.

[3] D. Angus, C. Woodward, Multiple objective ant colony optimisation, SwarmIntelligence 3 (2009) 69–85.

[4] H. Bai, D. OuYang, X. Li, L. He, H. Yu, Max–Min ant system on GPU withCUDA, in: International Conference on Innovative Computing, Informationand Control (ICICIC) (2009), pp. 801–804.

[5] A.R. Baig, W. Shahzad, S. Khan, F. Altaf, ACO based discovery of compre-hensible and accurate rules from medical datasets, International Journal ofInnovative Computing, Information and Control 7 (2011) 6147–6159.

[6] W. Banzhaf, S. Harding, W.B. Langdon, G. Wilson, Accelerating GeneticProgramming through Graphics Processing Units, in: Genetic ProgrammingTheory and Practice VI, Genetic and Evolutionary Computation (2009), pp.1–19.

[7] A. Cano, A. Zafra, S. Ventura, Speeding up the evaluation phase of GP clas-sification algorithms on GPUs, Soft Computing 16 (2012) 187–202.

[8] J.M. Cecilia, J.M. Garcia, A. Nisbet, M. Amos, M. Ujaldon, Enhancing dataparallelism for ant colony optimisation on GPUs, Journal of Parallel andDistributed Computing (In press).

[9] J.M. Cecilia, J.M. Garcia, M. Ujaldon, A. Nisbet, M. Amos, Parallelizationstrategies for ant colony optimisation on GPUs, in: Parallel and DistributedProcessing Workshops and Phd Forum (2011), pp. 339–346.

[10] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A per-formance study of general-purpose applications on graphics processors us-ing CUDA, Journal of Parallel and Distributed Computing 68 (2008) 1370–1380.

[11] Y. Chen, L. Chen, L. Tu, Parallel ant colony algorithm for mining classifi-cation rules, IEEE International Conference on Granular Computing (2006)85–90.

[12] J. Chintalapati, A. Maan, S. Priyanka, N. Mangala, V. Jayaraman, Parallelant-miner (PAM) on high performance clusters, in: Swarm Evolutionary andMemetic Computing, LNCS (2010), pp. 270–277.

29

[13] A. Delevacq, P. Delisle, M. Gravel, M. Krajecki, Parallel ant colony opti-mization on graphics processing units, Journal of Parallel and DistributedComputing (2012) In press.

[14] K.F. Doerner, R.F. Hartl, S. Benkner, M. Lucka, Parallel cooperative sav-ings based ant colony optimization—Multiple search and decomposition ap-proaches, Parallel Processing Letters 16 (2006) 351–370.

[15] K.L. Fok, T.T. Wong, M.L. Wong, Evolutionary computing on consumergraphics hardware, IEEE Intelligent Systems 22 (2007) 69–78.

[16] M.A. Franco, N. Krasnogor, J. Bacardit, Speeding up the evaluation of evo-lutionary learning systems using GPGPUs, in: Genetic and EvolutionaryComputation Conference (GECCO) (2010), pp. 1039–1046.

[17] J. Fu, L. Lei, G. Zhou, A parallel ant colony optimization algorithm withgpu-acceleration based on all-in-roulette selection, in: International Work-shop on Advanced Computational Intelligence (IWACI) (2010), pp. 260 –264.

[18] M.F. Ganji, M.S. Abadeh, A fuzzy classification system based on ant colonyoptimization for diabetes disease diagnosis, Expert Systems with Applica-tions 38 (2011) 14650–14659.

[19] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Mor-ton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences withCUDA, IEEE Micro 28 (2008) 13–27.

[20] S.U. Guan, F. Zhu, An incremental approach to genetic-algorithms-basedclassification, IEEE Transactions on Systems, Man, and Cybernetics, PartB: Cybernetics 35 (2005) 227–239.

[21] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kauff-man, 2006.

[22] S. Haykin, Neural Networks and Learning Machines, Pearson, 3rd edition,2009.

[23] H.J. Huang, C.N. Hsu, Bayesian classification for data from the same un-known class, IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics 32 (2002) 137–145.

30

[24] T.M. Huang, V. Kecman, I. Kopriva, Support vector machines in classifi-cation and regression - an introduction, in: Kernel Based Algorithms forMining Huge Data Sets: Supervised, Semi-supervised, and UnsupervisedLearning (Studies in Computational Intelligence), volume 177, 2005, pp.1–47.

[25] W. Hwu, Illinois ECE 498AL: Programming Massively Parallel Processors,Lecture 13: Reductions and their Implementation, 2009.

[26] A.K. Jain, R.P. Duin, J. Mao, Statistical pattern recognition: A review, IEEETransactions on Pattern Analysis and Machine Intelligence 22 (2000) 4–37.

[27] B. Jang, D. Schaa, P. Mistry, D. Kaeli, Exploiting memory access patterns toimprove memory performance in data-parallel architectures, IEEE Transac-tions on Parallel and Distributed Systems 23 (2011) 105–118.

[28] L. Jian, C. Wang, Y. Liu, S. Liang, W. Yi, Y. Shi, Parallel data mining tech-niques on Graphics Processing Unit with Compute Unified Device Architec-ture (CUDA), The Journal of Supercomputing (2011) In press.

[29] W. Jiening, D. Jiankang, Z. Chunfeng, Implementation of ant colony algo-rithm based on GPU, in: Computer Graphics, Imaging and Visualization(2009), pp. 50–53.

[30] R. Jovanovic, M. Tuba, D. Simian, Comparison of different topologies forisland-based multi-colony ant algorithms for the minimum weight vertexcover problem, WSEAS Transactions on Computers 9 (2010) 83–92.

[31] S.B. Kotsiantis, I.D. Zaharakis, P.E. Pintelas, Machine learning: A review ofclassification and combining techniques, Artificial Intelligence Reviews 26(2006) 159–190.

[32] S.R. Kulkarni, G. Lugosi, S.S. Venkatesh, Learning pattern classification—asurvey, IEEE Transactions on Information Theory 44 (1998) 2178–2206.

[33] V. Kumar, A. Gupta, Analyzing scalability of parallel algorithms and archi-tectures, Journal of Parallel and Distributed Computing 22 (1994) 379–391.

[34] W.B. Langdon, A many threaded CUDA interpreter for genetic program-ming, Lecture Notes in Computer Science, in: European Conference on Ge-netic Programming (EuroGP), volume 6021 of LNCS (2010), pp. 146–158.

31

[35] W.B. Langdon, Graphics processing units and genetic programming: Anoverview, Soft Computing 15 (2011) 1657–1669.

[36] D.T. Larose, Discovering Knowledge in Data: An Introduction to Data Min-ing, Wiley, 2005.

[37] A. Leist, D. Playne, K. Hawick, Exploiting graphical processing units fordata-parallel scientific applications, Concurrency Computation Practice andExperience 21 (2009) 2400–2437.

[38] G. Luque, E. Alba, Parallel genetic algorithms: Theory and realworld appli-cations, Studies in Computational Intelligence 367 (2011) 1–183.

[39] D. Martens, M. De Backer, J. Vanthienen, M. Snoeck, B. Baesens, Classi-fication with ant colony optimization, IEEE Transactions on EvolutionaryComputation 11 (2007) 651–665.

[40] I. Michelakos, N. Mallios, E. Papageorgiou, M. Vassilakopoulos, Ant colonyoptimization and data mining, Studies in Computational Intelligence 352(2011) 31–60.

[41] S. Nesmachnow, F. Luna, E. Alba, Time analysis of standard evolutionaryalgorithms as software programs, in: International Conference on IntelligentSystems Design and Applications, ISDA (2011), pp. 271–276.

[42] D.J. Newman, A. Asuncion, UCI machine learning repository, 2007.

[43] J.L. Olmo, J.R. Romero, S. Ventura, Using ant programming guided bygrammar for building rule-based classifiers, IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics 41 (2011) 1585–1599.

[44] J.L. Olmo, J.R. Romero, S. Ventura, Classification rule mining using ant pro-gramming guided by grammar with multiple Pareto fronts, Soft Computing16 (2012) 2143–2163.

[45] S.N. Omkar, R. Karanth, Rule extraction for classification of acoustic emis-sion signals using ant colony optimisation, Engineering Applications of Ar-tificial Intelligence 21 (2008) 1381–1388.

[46] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips,GPU Computing, Proceedings of the IEEE 96 (2008) 879–899.

32

[47] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A.E. Lefohn,T.J. Purcell, A survey of general-purpose computation on graphics hardware,Computer Graphics Forum 26 (2007) 80–113.

[48] R. Parpinelli, A.A. Freitas, H.S. Lopes, Data mining with an ant colonyoptimization algorithm, IEEE Transactions on Evolutionary Computation 6(2002) 321–332.

[49] M. Pedemonte, H. Cancela, A cellular ant colony optimisation for the gen-eralised steiner problem, International Journal of Innovative Computing andApplications 2 (2010) 188–201.

[50] M. Pedemonte, S. Nesmachnow, H. Cancela, A survey on parallel ant colonyoptimization, Applied Soft Computing 11 (2011) 5181–5197.

[51] M. Randall, A. Lewis, A parallel implementation of ant colony optimization,Journal of Parallel and Distributed Computing 62 (2002) 1421–1432.

[52] O. Roozmand, K. Zamanifar, Parallel Ant Miner 2, in: International Confer-ence on Artificial Intelligence and Soft Computing (ICAISC), volume 5097of LNCS (2008), pp. 681–692.

[53] K.C. Tan, Q. Yu, C.M. Heng, T.H. Lee, Evolutionary computing for knowl-edge discovery in medical diagnosis, Artificial Intelligence in Medicine 27(2003) 129–154.

[54] D. Tarditi, S. Puri, J. Oglesby, Accelerator: using data parallelism to pro-gram gpus for general-purpose uses, in: Proceedings of the 12th interna-tional conference on Architectural (2006), pp. 325–335.

[55] R.M. Weiss, GPU-accelerated ant colony optimization, in: W. mei W. Hwu(Ed.), GPU Computing Gems, Morgan Kaufmann, 2011, pp. 325–340.

[56] M.L. Wong, K.S. Leung, Data Mining Using Grammar-Based Genetic Pro-gramming and Applications, Kluwer Academic Publishers, Norwell, MA,USA, 2000.

[57] W. Zhu, J. Curry, Parallel ant colony for nonlinear function optimizationwith graphics hardware acceleration, in: IEEE International Conference onSystems, Man and Cybernetics (IEEE SMC) (2009), pp. 1803 –1808.

33