parallel multi-objective ant programming for classification using gpus
TRANSCRIPT
Parallel Multi-Objective Ant Programming forClassification Using GPUs
Alberto Canoa, Juan Luis Olmoa, Sebastian Venturaa,∗
aDepartment of Computer Science and Numerical Analysis,University of Cordoba, 14071 Cordoba, Spain
Abstract
Classification using Ant Programming is a challenging data mining task whichdemands a great deal of computational resources when handling data sets of highdimensionality. This paper presents a new parallelization approach of an existingmulti-objective Ant Programming model for classification, using GPUs and theNVIDIA CUDA programming model. The computational costs of the differentsteps of the algorithm are evaluated and it is discussed how best to parallelizethem. The features of both the CPU parallel and GPU versions of the algorithmare presented. An experimental study is carried out to evaluate the performanceand efficiency of the interpreter of the rules, and reports the execution times andspeedups regarding variable population size, complexity of the rules mined anddimensionality of the data sets. Experiments measure the original single-threadedand the new multi-threaded CPU and GPU times with different number of GPUdevices. The results are reported in terms of the number of Giga GP operationsper second of the interpreter (up to 10 billion GPops/s) and the speedup achieved(up to 834x vs CPU, 212x vs 4–threaded CPU). The proposed GPU model isdemonstrated to scale efficiently to larger datasets and to multiple GPU devices,which allows of expanding its applicability to significantly more complicated datasets, previously unmanageable by the original algorithm in reasonable time.
Keywords: ant programming (AP), ant colony optimization (ACO), parallelcomputing, GPU, classification
∗Corresponding authorEmail addresses: [email protected] (Alberto Cano), [email protected] (Juan Luis Olmo),
[email protected] (Sebastian Ventura)
Preprint submitted to Journal of Parallel and Distributed Computing November 12, 2012
1. Introduction
Classification is a supervised machine learning task which consists in predict-ing the class membership of uncategorised examples, whose label is not known,using the properties of examples in a model learned previously from training ex-amples, whose label was known. Classification tasks include a broad range of realworld application domains: disciplines such as bioinformatics, medical diagno-sis, image recognition, and financial engineering, among others, where domainexperts can use the model learned to support their decisions [26, 36].
A great variety of algorithms and techniques have been used to accomplish thistask [32], including decision trees [21], decision rules [31], naive Bayes [23], sup-port vector machines [24], neural networks [22], genetic algorithms [20], geneticprogramming [53], ant colony optimization (ACO) [48], etc. More recently, antprogramming (AP) has been used to accomplish the classification task with greatsuccess, generating interpretable classifiers made up of IF–THEN rules [5, 18, 43].
The curse of dimensionality is one of the major challenges to the application ofant-based algorithms [40], owing to the long computational time necessary. More-over, focusing on the classification issue, the induction of classifiers is becomingincreasingly complicated, since the capabilities of data generation and collectionin real application domains is growing at an exponential pace. Actually, it wouldbe infeasible, or at least would take a lot of time, to run a sequential classifica-tion algorithm over certain data sets. Therefore, it has become crucial to designparallel algorithms capable of handling these large amounts of data [2, 38].
Concerning the computing architectures used to parallelize ACO algorithms,cluster platforms have been traditionally the option most employed, followed bymultiprocessors and massively parallel computers [50]. Recently, other choiceshave been gaining ground, such as grid computing, multi-core servers, and graphicprocessing units (GPUs) [9, 8, 13].
GPUs are devices with multi-core architectures and parallel processor units,which provide fast parallel hardware for a fraction of the cost of a traditionalparallel system. Actually, since the introduction of the computer unified devicearchitecture (CUDA) in 2007, researchers all over the world have harnessed thepower of the GPU for general purpose computing (GPGPU) [10, 47, 46]. The useof GPGPU has been already studied for speeding up algorithms within the frame-work of evolutionary computation and data mining [15, 28, 45]. Owing to thegreat advantages provided by GPGPU, we would like to explore the performanceof a GPU-based parallelization of the Multi-Objective Grammar-Based Ant Pro-gramming (MOGBAP) algorithm for classification [44], which has demonstrated
2
its good performance at addressing the extraction of classification rules. The ex-perimental study analyses the performance and scalability of the algorithm fromsimple to complex problems, increasing the number of instances and attributes.Experimental results show that its parallelization allows of extending the rangeof applications of this algorithm to other domains that have data sets includinghuge amounts of instances and attributes, domains where the application of theMOGBAP algorithm was therefore extremely difficult until now.
This paper is organized as follows. In the next section, we will present somerelated work on parallel ACO as applied to classification, as well as GPU-basedproposals for ACO. Section 3 presents some details concerning the CUDA plat-form. In Section 4, we will first present the sequential version of the algorithm,then discuss the parallelization of the multi-objective AP algorithm for classifica-tion, and conclude by presenting the GPU design and implementation. Section 5describes the experimental study. The results obtained are discussed in Section 6.Finally, Section 7 presents some concluding remarks.
2. Related Work
Several paradigms and taxonomies have been proposed, aiming at collectingthe different possibilities for parallelizing an ACO algorithm. The most appropri-ate parallelization method depends upon the kind of problem to be solved, the datastructures employed, and the stages which receive more computational time [51].
A recent survey of parallel ACO has been published by Pedemonte et al. [50].In that paper, the authors propose an updated taxonomy to classify software-basedparallel ACO algorithms, providing also an insight into current trends in the field.Since we refer to the categories and strategies of this taxonomy to explain thedecisions adopted in parallelizing our AP algorithm, we will briefly mention thesecategories.
• Master–slave model. A master process takes care of managing the globalinformation (i.e., pheromone matrix, best rule search, reinforcement andevaporation, etc.). The master process distributes tasks that have no ef-fect on the global structures to several slave processes, so that these slavesare in charge of looking for a solution and evaluating it, computing its fit-ness. Three subcategories are considered in this paradigm, depending onthe amount of work performed by the slave processes. In order of increas-ing work and, therefore, more communication between the master and theslaves, these categories are: coarse-grain [57], medium-grain [14] and fine-grain [17].
3
• Cellular model. A single population is divided into small neighbourhoods,each one having its own pheromone matrix. Each individual interacts justwith its neighbours, and neighbourhoods overlap in order to allow goodsolutions to spread to the entire population [49].
• Parallel independent runs model. Independent runs on a set of proces-sors of several sequential ACO algorithms without exchanging communica-tion [4].
• Multicolony model. Several ant colonies explore the search space, ex-changing information periodically to cooperate. Each colony has a separatepheromone matrix [30].
• Hybrid models. Other proposals that share features from two or more par-allel models. [51]
We will now draw attention to the specific ACO parallel proposals for clas-sification rule mining, most of them focused on adapting the original Ant-Mineralgorithm [48] to a parallel environment. The first paper was published by Chenet al., who developed the Parallel Ant Miner algorithm [11]. It is a parallel ACOalgorithm based on the massively parallel processors computational model, whichfollows the message passing method. In that paper, a class label is assigned to eachprocessor, and a group of ants is allocated to each processor in order to search forthe antecedent of the rules, following a coarse-grain master–slave model.
Roozmand and Zamanifar [52] extended the algorithm developed by Chen etal., incorporating the multicolony concept and using an ant colony system. Theirproposal considers several independent colonies of ants, each one in charge ofdiscovering rules related to a given class label. Inside each group, ants are alsoexecuted in parallel, following a coarse-grain master–slave model to build andevaluate solutions. Ants of a given group communicate among themselves, andalso with the best ant of the other groups, to update the pheromone matrices. Theauthors reported a higher speed of convergence than the previous Parallel AntMiner algorithm, also discovering more accurate rules.
Chintalapati et al. [12] parallelized the original Ant-Miner algorithm in acluster environment following a coarse-grain master–slave paradigm. They paral-lelized the rule construction stage and the discretization of the numerical attributesbecause, as they reported, these two tasks required more computational time, sothat their parallel implementation speeds significantly the execution of the algo-rithm. The original Ant-Miner was applied locally in each processor, and both
4
the pheromone information and the best discovered rule were updated once theprocessors submitted their information to the master process. The main differ-ences from the paper by Roozmand and Zamanifar is that the ants can search forrules having different class labels in each processor, instead of just the class la-bel assigned to the processor. In addition, the master process maintains a rule setwhere the discovered rules are stored, in the order of their discovery. The authorsconcluded that the best performance results were achieved when using data setshaving more attributes and using a greater number of ants.
The most recent contribution to the field of classification rule mining usingACO has been the AntMinerGPU algorithm [55], proposed by Weiss, which con-sists in the parallelization of the Ant-Miner+ algorithm [39] by using GPUs. Thehypothesis stated in that paper is that higher quality solutions could be found inless time by parallelizing the algorithm using GPUs, because this allows gener-ating more candidate solutions in each generation. The AntMinerGPU algorithmoffloads the implicit parallel steps of AntMiner+ to a GPU device, and it allocateseach ant its own thread using the GPUs multi-threading capabilities. Experimentsusing two data sets focused on analysing the effects on the accuracy and runningtime caused by changes in the population size. The author concluded that the ac-curacy values obtained by both the GPU and the CPU implementation were verysimilar, achieving a speedup close to 100x for large populations.
Concerning other GPU-based parallelizations of ACO algorithms devoted toproblems other than classification rule mining, we can mention the paper by Jien-ing et al. [29], where a parallel implementation of the ACO Max–Min Ant Sys-tem to solve the travelling salesman problem (TSP) was presented. More recently,Cecilia et al. [9] presented a more extensive and deeper work concerning the par-allelization of ACO algorithms to solve the TSP, where several strategies for ac-celerating algorithms addressing this problem by using the GPU architecture werediscussed.
3. CUDA programming model
Computer unified device architecture (CUDA) [1] is a parallel computing ar-chitecture developed by NVIDIA that allows programmers to take advantage ofthe computing capacity of NVIDIA GPUs in a general purpose manner. TheCUDA programming model executes kernels as batches of parallel threads in asingle instruction multiple data (SIMD) programming style. These kernels com-prise thousands to millions of lightweight GPU threads per each kernel invocation.
5
CUDA’s threads are organized into a two-level hierarchy. At the top level, agrid is organized as a two dimensional array of blocks. At the bottom level of thehierarchy, all threads of a block are organized into a three-dimensional array ofthreads. All the blocks in a grid have the same number of threads, with a maximumof 512 (1.3 CUDA capability devices). The maximum number of thread blocksis 65535 x 65535, so each device can run up to 65535 x 65535 x 512 = 2 · 1012threads per kernel call.
To properly identify threads within the grid, each thread in a thread block hasa unique ID in the form of a three-dimensional coordinate, and each block in agrid also has a unique two-dimensional coordinate.
Thread blocks are executed in streaming multiprocessors. A stream multi-processor can perform zero overhead scheduling to interleave warps (a warp isa group of threads that execute together) and hide the overhead of long-latencyarithmetic and memory operations.
There are four different main memory spaces: global, constant, shared, andlocal. These GPU memories are specialized and have different access times, life-times, and output limitations.
• Global memory: this is a large, long-latency memory that exists physicallyas an off-chip dynamic device memory. Threads can read and write to globalmemory in order to share data, and must write the kernel’s output so that itcan be read even after the kernel terminates. However, a better way to sharedata and improve performance is to take advantage of shared memory.
• Shared memory: this is a small, low-latency memory that exists physicallyas on-chip registers, and its contents are only maintained during the threadblock execution and are discarded when the thread block completes. Ker-nels that read or write to a known range of global memory with spatial ortemporal locality can employ shared memory as a software-managed cache.Such caching potentially reduces global memory bandwidth demands andimproves overall performance.
• Local memory: each thread also has its own local memory space as reg-isters, so the number of registers a thread uses determines the number ofconcurrent threads executed in the multiprocesor, which is called multipro-cessor occupancy. To avoid wasting hundreds of cycles while a thread waitsfor a long-latency global-memory load or store to complete, a common tech-nique is to execute batches of global accesses, one per thread, exploiting thehardware’s warp scheduling to overlap the threads’ access latencies.
6
• Constant memory: this is specialized for situations in which many threadswill read the same data simultaneously. This type of memory stores datawritten by the host thread, is accessed constantly, and does not change dur-ing the execution of the kernel. A value read from the constant cache isbroadcast to all threads in a warp, effectively serving 32 loads from mem-ory with a single-cache access. This enables a fast, single-ported cacheto feed multiple simultaneous memory accesses. The amount of constantmemory is 64 KB.
There are some recommendations for maximum performance [19]. Memoryaccesses must be coalesced as with accesses to global memory. Global memoryresides in device memory and is accessed via 32, 64, or 128-byte segment memorytransactions. When a warp executes an instruction that accesses global memory, itcoalesces the memory accesses of the threads within the warp into one or more ofthese memory transactions, depending on the size of the word accessed by eachthread and the distribution of the memory addresses across the threads. In general,the more transactions are necessary, the more unused words are transferred inaddition to the words accessed by the threads, reducing the instruction throughput.
To maximize global memory throughput, it is essential to maximize this coa-lescing by following the most optimal access patterns [27], using data types thatmeet the size and alignment requirements or padding data in some cases, e.g.,when accessing a two-dimensional array. For these accesses to be fully coalesced,the width of both the thread block and the array must be multiples of the warpsize.
4. GPU-MOGBAP (GPU Multi-Objective Grammar Based Ant Program-ming Algorithm)
In this section, we first introduce the original Multi-Objective Grammar-BasedAnt Programming (MOGBAP) [44] algorithm for classification, so as to take intoaccount its requirements when assessing the options for its parallelization. Then,the parallel design decisions which were adopted will be justified. Finally, thespecific GPU implementation will be presented.
4.1. Serial versionThe MOGBAP algorithm is a multi-objective AP algorithm designed specifi-
cally for multi-classification rule mining. It receives a training set as input, build-ing a classifier that consists of a decision list where the discovered rules are sorted
7
in descending order by their fitness. To classify a new unlabelled instance, theclass assigned will correspond to the consequent of the first rule whose antecedentmatches the instance, or the default rule in case no rule in the decision list coversthe instance. Further and detailed information about the MOGBAP algorithm andits pseudocode can be found in [44].
= attr1
value11
<EXP>
AND <EXP> <COND><COND>
= attr1
value12
...= attr
n
valuenm
AND AND <EXP> <COND> <COND>AND <COND> <COND>
AND AND AND <EXP>
<COND> <COND> <COND>
AND AND <COND>
<COND> <COND>
AND
= attr1 value
11
<COND>
AND
= attr1 value
12
<COND>
...
AND
= attrn value
nm
<COND>
der = 1
der = 2
der = 3
(= attr1 value
12)
Figure 1: Space of states at a depth of 3 derivations. The sample shaded path represents the pathfollowed by an ant. Double-line states represent final states.
This algorithm is based on the use of a context-free grammar that restricts thesearch space, ensuring the validity of any individual generated [56]. Regardingthe generation of new individuals, each individual or ant will encode a rule. Thecreation of a new ant follows the application of the transition rule from the initialstate of the environment, which adopts the shape of a derivation tree, until reach-ing a final state or solution. As an example, Figure 1 shows the derivation treegenerated at a depth of three derivations, and the path followed by a given ant ishighlighted.
It is worth pointing out that the structure of the space of states is not knownbeforehand, and neither is the length of the path followed by a given ant. Actually,the treatment of the environment is a key issue for the algorithm, since it dependsdirectly upon the dimensionality of the data set and the number of derivations per-mitted from the grammar. Therefore, there would be an excessive computationalcost in keeping in memory the whole space of states. To avoid this problem, thealgorithm follows a sequential and incremental build approach. The data struc-ture that represents the space of states is initialized with just the initial state andall possible transitions have the same amount of pheromones. This data structurealso contains attributes that take into account the effects of the evaporation and
8
normalization processes over the environment. The data structure is filled as antsare created, storing there the states of each ants path.
As to the multi-objective scheme [3], the idea behind the strategy designedfor MOGBAP is to discover a separate Pareto front for each class in the dataset, because certain classes are more difficult to predict than others. Actually, ifindividuals from different classes are ranked simultaneously according to Paretodominance, overlapping may occur. The multi-objective strategy can be summa-rized as follows. Once the individuals of the current generation have been createdand evaluated for each objective considered, they are divided into k groups accord-ing to their consequent, k being the number of classes in the training set. Then,each group of individuals is combined with the solutions kept in the correspond-ing Pareto front found in the previous iteration of the algorithm, to rank them allaccording to dominance, finding a new Pareto front for each class. Hence, therewill be k Pareto fronts, and only the non-dominated solutions in them will par-ticipate in the pheromone reinforcement. Finally, the output classifier is made upof the non-dominated individuals that exist in each of the k Pareto fronts oncethe last generation of the algorithm finishes. To select the rules of the classifierappropriately, a niching procedure is carried out over each frontier.
Despite the fact that it is known that hard-to-compute (super-linear-order) fit-ness functions tend to dominate the execution time of bio-inspired and evolution-ary algorithms [34, 41], we have carried out a brief study of the average time takenby each different stage of the subject algorithm, MOGBAP. Bearing in mind thatthe main objective of accelerating this algorithm by a GPU is to be able to ex-ecute large data sets, we have selected a data set of high dimensionality for thisexperiment, the poker data set, which has one million instances and ten attributes.The hardware configuration and experimental settings for this study are the onesdetailed in Section 5. The results are shown in Table 1, where the percentage ofexecution time on average taken by each stage of the algorithm is shown. Theseresults clearly support the assumption that the evaluation phase is the one that re-quires more computational time, nearly 96.3% of the total execution time of thealgorithm. They also point out that the ant creation stage involves approximately3% of the computational time, the niching procedure employed to select the rulesthat make up the final classifier involves around 0.6%, and finally, the sum of theremaining stages scarcely affects the computational time. Thus, the most signifi-cant speedup of the algorithm can be reached by accelerating the evaluation stage,which is achieved by taking advantage of the computing capacity of GPUs. Nev-ertheless, the parallel model explained in detail in this section also addresses theparallelization of other stages by using multi-threading.
9
Table 1: Workload of each phase of the sequential implementation of the MOGBAP algorithm
Stage Time (%)Initialization 3.06E-3Ant creation 3.079Ant evaluation 96.33Multi-objective strategy 3.03E-4Reinforcement 3.79E-4Evaporation 1.35E-4Normalization 1.03E-4Classifier constructionfrom niching procedure 0.585
4.2. Parallel versionThe initialization stage of the MOGBAP algorithm is in charge of starting up
the grammar, taking into account the metadata of the training set used to build theclassifier, and it also initializes the data structure used for the space of states.
As stated above, the data structure of the space of states follows an onlineconstruction approach. Each created ant stores its visited states in this structure,if they are not already there (due to the fact that the state in question had not yetbeen visited). The ant creation phase can be parallelized by using multi-threading,since ants can be created simultaneously, bearing in mind that the path followedby a given ant does not depend on the path followed by the others. However, it isimportant to note that the storage of the states visited in the space of states struc-ture cannot be done by the threads directly since it is a common data structure thatcannot be modified concurrently. Therefore, the update of this structure cannotbe done in parallel an it is done by the master process, which receives the pathsfollowed by each ant created.
The evaluation stage checks the instances covered by each ant and computesthe fitness values for each objective based on the confusion matrix. The parallelimplementation with multi-threading is elemental, evaluating each ant in parallel.
Concerning the multi-objective strategy, the discovery of the Pareto front givena set of individuals cannot be parallelized, since a sequential procedure has tobe followed to find out the dominance relations between them. Nevertheless, asindividuals are grouped by the class consequent they predict and a Pareto front isdiscovered within each group, threads can be used to carry out this strategy, using
10
a thread for assessing the dominance inside each different group.Reinforcement, evaporation, and normalization stages are carried out by the
master process, since these stages require writing to the space of states data struc-ture, and written accesses cannot be concurrent.
The niching process that is carried out over the individuals of each Paretofront can be parallelized using multi-threading, and the resulting individuals arereturned to the master process, which sorts them accordingly in order to build thefinal classifier.
As can be observed, this parallel scheme can be fit into the hierarchical coarse-grain master–slave category [50], since the master process manages the globalinformation structures and delegates tasks to the slaves processes. These tasks arethe generation of individuals, their evaluation, the discovery of the Pareto fronts,and the niching procedure execution, communicating the results back to the masterprocess.
The CPU parallel version of the algorithm is designed to run threads using theJava ExecutorService, which defines a pool of threads dynamically regarding tothe number of available processors. This way, the scalability of the algorithm toprocessors with more cores is automatically performed without user supervision.
4.3. GPU versionThis section presents the particular GPU implementation of the MOGBAP al-
gorithm. Before focusing on the evaluation phase, which is parallelized by GPUs,we first justify why the remaining stages are not parallelized in the same way.
The initialization of the grammar and the space of states was not parallelizedusing multi-threading, so neither is it implemented in the GPU. The most inter-esting discussion concerns the ant creation stage: it was decided to parallelizethis stage using multi-threading, because the transfer of the data structure usedfor the space of states to the GPU is not appropriate. Actually, it is infeasible toobtain an internal parallelization beyond the one obtained by creating each ant inits own thread, because it makes no sense to parallelize the construction of a givenpath over the space of states, as it consists in the sequential selection of transitionsfrom one state to another. Therefore, the transfer costs would potentially outweighthe benefits of creating each ant in the GPU, besides being a complicated imple-mentation due to the employment of the grammar and the number of derivationsallowed for it. The parallel multi-threading implementation in the CPU is simpleand efficient.
The multi-objective strategy was parallelized by using multi-threading, as ex-plained in Section 4.2. It is not possible to parallelize this step further, because
11
MASTER PROCESS SLAVES
(MULTITHREADING)
SLAVES
(GPU)
Figure 2: Computational flow chart of the GPU version
the process of finding the Pareto dominance relations in a set of individuals is aserial process. The same justification could be argued for the niching procedure,since internally it proceeds by sorting individuals according to their fitness andperforming a sequence of operations following this order. Finally, reinforcement,evaporation, normalization, and classifier construction steps have to be carried outby the master process, as stated above.
On the other hand, the parallelization of the evaluation phase using GPUs isknown to perform efficiently [7, 16]. The evaluation of an ant does not dependon the evaluation of the others. Moreover, the process of interpreting the ant’srule and checking whether it covers an instance or not is also independent of the
12
#TP #FN
#FP #TN
Rule 1
. . .I1 I2 I3 II
FPTP TP . . . FN
Rule 2
. . .I1 I2 I3 II
FPTN TN . . . TN
Rule 3
. . .I1 I2 I3 II
TPTP FN . . . FP
Rule N
. . .I1 I2 I3 II
FPTP TP . . . TN. . .
. . .
#TP #FN
#FP #TN
Fitness Rule 1
C overage
Kernel
C onfusion
Matrix
Kernel
Fitness Rule 2 Fitness Rule 33
#TP #FN
#FP #TN
Fitness Rule 3N
#TP #FN
#FP #TN
. . .
Figure 3: GPU kernels
rules and the instances. Therefore, the GPU model proposes that each thread isin charge of interpreting one ant’s rule over one instance. Figure 2 shows thecomputational workflow of the GPU model in which the evaluation process on theGPU is performed using two kernel functions.
4.3.1. Coverage kernelThe coverage kernel interprets the rules, which are expressed in Reverse Polish
Notation (RPN), over the instances of the data set. The kernel is shown at the topof Figure 3 and it is executed using a 2D grid of thread blocks. The length alongthe first dimension is the number of rules. The length along the second dimensiondepends on the number of instances I and the number of threads per block N , i.e.,the number of thread blocks to cover all the instances is ceil(I/N). Thus, the totalnumber of thread blocks in the grid is the product of these two dimensions. Thisnumber is important, as it concerns the scalability of the model in future devices.NVIDIA recommends running at least twice as many thread blocks as the numberof multiprocessors [1].
The threads are designed to access coalesced global memory positions formaximum performance. This way, the threads in a warp request consecutive mem-ory addresses that can be serviced in fewer memory transactions. The rules areallocated to constant memory because this can provide broadcasting to all the
13
threads in a warp. All the threads in a warp evaluate the same rule, but over differ-ent instances. This follows the single instruction multiple data (SIMD) data levelparallelism, which is achieved when each processor performs the same task (ruleevaluation) on different instances of distributed data [37, 54].
Code 1 Coverage kernel
__global__ void coverageKernel(double* instancesData, int* instancesClasschar** rules, unsigned char* result)
{// Instance index using the CUDA built-in variablesint instance = blockDim.y * blockIdx.y + threadIdx.y;
// If the rule covers the instanceif(covers(rules[blockIdx.x], instance)){
// FPfor(int i = 0; i < numberClasses; i++)result[numberInstances * gridDim.x * i + blockIdx.x * numberInstances +
instance] = 2;
// TPresult[numberInstances * gridDim.x * instancesClass[instance] +
blockIdx.x * numberInstances + instance] = 0;}else // If the rules does not cover the instance{// TNfor(int i = 0; i < numberClasses; i++)result[numberInstances * gridDim.x * i + blockIdx.x * numberInstances +
instance] = 1;
// FNresult[numberInstances * gridDim.x * instancesClass[instance] +
blockIdx.x * numberInstances + instance] = 3;}
}
The interpreter of the rules is stack-based, i.e., the operands are pushed onto astack, and when an operation is performed, its operands are popped from the stackand its result pushed back on. The interpreter checks the coverage of the instanceby the rule and the outcome result depends on the predicted class and the trueclass of the instance, resulting in one of these four values: true positive (TP ), falsepositive (FP ), true negative (TN ), and false negative (FN ). Each thread writes theoutcome result to its corresponding memory position in the array which storesall the results, which will be counted by the confusion matrix kernel to computethe fitness values for the rules. The code for the coverage kernel is shown inCode 1. The coverage kernel receives as input three arrays: an array of attributes
14
values (instancesData), an array of class values of the instances of the dataset(instancesClass), and an array of arrays containing the rules to evaluate (rules),whereas it returns the compute results in an array of matching results (results).numberInstances and numberClasses are constant global variables visible by thekernels.
4.3.2. Confusion matrix kernelThe confusion matrix kernel, shown at the bottom of Figure 3, counts the
number of TP , FP , TN , and FN previously calculated for each rule. These val-ues are employed to build the confusion matrix which allows us to compute thefitness values. The process of summing the values from an array is known asreduction [25].
Designing an efficient parallel reduction is non-trivial, because it asks for theparallelization of an inherently sequential task. In fact, NVIDIA proposes sixdifferent approaches [1]. Some of the proposals take advantage of the deviceshared memory. Shared memory provides a small but fast memory shared by allthe threads in a block. This is quite desirable when the threads within a blockrequire synchronization and work together to accomplish a task such as reduction.
The threads are organized to perform a two-level reduction and provide coa-lesced memory accesses, avoiding shared memory bank conflicts. This way, thethreads in a warp request consecutive memory addresses that can be serviced infewer memory transactions.
Each thread performs a partial reduction of the coverage results for an ant,storing the partial counts into shared memory. Once all the items have been par-tially counted, a synchronization barrier is called. The synchronization halts theexecution until both all threads in the thread block have reached this point and allglobal and shared memory accesses are visible to all threads in the block, i.e., allpartial results are confirmed to be properly written in shared memory and theirvalues are visible by all the threads in the block. The synchronization barrier isused to coordinate communication between the threads of the same block. Whensome threads within a block access the same addresses in shared or global mem-ory, there are potential read-after-write, write-after-read, or write-after-write haz-ards for some of these memory accesses. Next, only 4 threads perform the finalsum of the 4 confusion matrix values. At this point, there is no need to call thesynchronization barrier because the threads with idx 0, 1, 2 and 3 are within thesame warp, and, since a warp executes one common instruction at a time, threadswithin a warp are implicitly synchronized according to NVIDIA [1]. Finally, onlyone thread computes the fitness values for the different objectives and writes them
15
Code 2 Confusion matrix kernel__global__ void confusionMatrixKernel(unsigned char* result, jdouble* sespaccuracy,
int* bestClass){
__shared__ int confusionMatrix[512];
// Base index of the threadint base = blockIdx.x * numberInstances + threadIdx.y;// Top index of the threadint top = blockIdx.x * numberInstances + numberInstances - base;int moreCoveredClass = 0;
// Checks the best class for the individualfor(int j = 0; j < numberClasses; j++){confusionMatrix[4*threadIdx.y] = 0;confusionMatrix[4*threadIdx.y+1] = 0;confusionMatrix[4*threadIdx.y+2] = 0;confusionMatrix[4*threadIdx.y+3] = 0;
// Performs the first level reduction of the thread valuesfor(int i = 0; i < top; i += 128)confusionMatrix[4*threadIdx.y +
result[gridDim.x * numberInstances * j + base + i]]++;
__syncthreads();
if(threadIdx.y < 4){// Performs the second level reduction of the half of the sumsfor(int i = 4; i < 512; i+=4)confusionMatrix[threadIdx.y] += confusionMatrix[threadIdx.y + i];
if(threadIdx.y == 0 && confusionMatrix[0] > moreCoveredClass){int tp = confusionMatrix[0], tn = confusionMatrix[1];int fp = confusionMatrix[2], fn = confusionMatrix[3];
moreCoveredClass = tp;sespaccuracy[blockIdx.x] = tp / (tp + fn);sespaccuracy[gridDim.x + blockIdx.x] = tn / (tn + fp);sespaccuracy[gridDim.x + gridDim.x +
blockIdx.x] = (tp+1) /(tp + fp + numberClasses);bestClass[blockIdx.x] = j;
}}
__syncthreads();}
}
back to global memory. All the fitness values are returned in a unique array tosave multiple segmented memory transactions. The code for the confusion matrix
16
kernel is shown in Code 2. The kernel receives as input the array of predictionresults from the coverage kernel (result), and returns two arrays, one contains thesensitivity, specificity and accuracy values (sespaccuracy) which are stored in asingle array and copied in a single memory transfer for efficiency, and the othercontains the best class for the rules (bestClass).
5. Experimental setup
In this section, we will first present the hardware devices and the configurationsettings used in the experimental study. Then, the different experiments designedto evaluate the performance of the proposed model will be presented.
5.1. Hardware configurationThe experiments were run on two PCs, both equipped with an Intel Core i7
quad-core processor running at 2.66 GHz and 12 GB of DDR3-1600 host mem-ory. One PC featured two NVIDIA GeForce GTX 285 video cards equipped with2GB of GDDR3 video RAM and the other one featured two NVIDIA GeForceGTX 480 video cards equipped with 1.5GB of GDDR5 video RAM. The GTX285 GPU comprised 30 multiprocessors and 240 cores whereas the GTX 480GPU comprised 15 multiprocessors and 480 CUDA cores, both clocked at 1.4GHz. The host operating system was GNU/Linux Ubuntu 11.10-3.0.0 64 bit alongwith CUDA runtime 4.2, NVIDIA drivers 302.07, Eclipse integrated developmentenvironment 3.7.0, Java OpenJDK runtime environment 1.6-23 64 bit, and GCCcompiler 4.6.3 (O2 optimization level).
5.2. Configuration settingsThere are two different configuration settings to define: the GPU kernel set-
tings, and the algorithm configuration parameters.On the one hand, the GPU kernels require two parameters which define the
number of threads per block and the number of blocks per grid. The compilerand hardware thread scheduler will schedule instructions as optimally as possibleto avoid register memory bank conflicts. The best results are achieved when thenumber of threads per block is a multiple of 64, usually 128, 256 or 512. TheCUDA GPU occupancy calculator provides information about the GPU multipro-cessor occupancy, the number of threads per block, the registers used per thread,and the shared memory employed by the blocks. One of the keys to good perfor-mance is to keep the multiprocessors on the device as busy as possible. A device
17
in which the work is poorly balanced between the multiprocessors will performsuboptimally.
The coverage kernel requires 13 registers and the confusion matrix kernel re-quires 14 registers and 2104 bytes from shared memory. Table 2 shows the GPUoccupancy for different numbers of threads per block. Occupancy is the ratio ofthe number of active warps per multiprocessor to the maximum number of pos-sible active warps. 128 threads per block constitutes a multiprocessor occupancyof 75%, limited by the shared memory per multiprocessor. 256 and 512 threadsper block yield an occupancy of 100%, in both cases the number of active threadsper multiprocessor is 1024, but they differ in the number of active thread blocksper multiprocessor. We consider that the best option is to employ 256 threadsper block since it provides more active threads blocks per multiprocessor to hidelatency arising from register dependencies and, therefore, a wider range of possi-bilities is given to the dispatcher to issue concurrent blocks to the execution units.
Table 2: Threads per block and multiprocessor occupancyThreads per block 128 256 512Active threads per multiprocessor 768 1024 1024Active warps per multiprocessor 24 32 32Active thread blocks per multiprocessor 6 4 2Occupancy of each multiprocessor 75% 100% 100%
The number of blocks in the grid should be larger than the number of multi-processors in order for all multiprocessors to have at least one block to execute,because the primary concern is keeping the entire GPU busy. The grid size (num-ber of thread blocks) of the coverage kernel depends on the number of rules andthe number of instances. This grid is a 2D matrix of thread blocks, whose sizeis R × (I/256) due to the fact that the thread block size is 256. Thus, the gridsize and the GPU occupancy increase as the number of rules or the number ofinstances increase. This behaviour gives us a first idea of the efficiency of themodel, which increases as the complexity of the problem increases, either with agreater number of instances or demanding larger population sizes, as far as thereare thread blocks enough to fill the GPU occupancy.
On the other hand, the MOGBAP algorithm was run using the parameter con-figuration shown in Table 3, where the first four parameters are mandatory, and theother six parameters–enclosed in square brackets–are optional, having default val-ues. These parameters are the recommended values by the authors of MOGBAPalgorithm in [44].
18
Table 3: MOGBAP parameter configurationName Description ValuenumIterations Number of iterations 100maxDerivations Maximum number of derivations for the grammar 15minCoverage Minimum percentage of instances belonging to
the class predicted by the rule that it should cover 5%[τ0] Initial pheromone amount 1.0[τmin] Minimum pheromone amount 0.1[τmax] Maximum pheromone amount 1.0[ρ] Evaporation rate 0.05[α] Heuristic exponent 0.4[β] Pheromone exponent 1.0
5.3. ExperimentsThe experimental study comprised three sets of experiments. The first experi-
ment evaluated the performance of the interpreter of the rules. The second exper-iment evaluated the performance of the evaluation model and its scalability. Thethird experiment analysed the performance of the model over several real-worlddata sets.
5.3.1. RPN interpreter performance
The performance of an RPN interpreter is often given in terms of the numberof primitives which the system interprets per second, similar to GP interpreters,which report the number of GP operations per second (GPops/s) [6, 34, 35]. Theinterpreter evaluates expression trees, independently of whether they are obtainedfrom Ant Programming, Genetic Programming, or Grammar Guided Genetic Pro-gramming.
The first experiment was designed to evaluate the performance of the RPN in-terpreter, which runs over different numbers of rules, different lengths of rules (itsnumber of conditions), and different numbers of instances. Thus, it established asensitivity analysis of the effect of these parameters on the speed of the interpreter.
5.3.2. Model performance and scalability
The second experiment was designed to analyse the performance of the wholeparallelized model (including kernel execution and data transfer times) by varying
19
the number of rules, the lengths of the rules, and the number of instances. Itis interesting to analyse the scalability of the model regarding the sizes of theaforementioned parameters, especially when they are very low (small populationsizes or data sets with a low number of instances) or very high (large populationsizes of data sets with up to a million instances).
5.3.3. Experiments with UCI data sets
The third experiment evaluated the model on 12 real-world data sets from theUCI data set repository website [42]. These data sets are very varied in their de-gree of complexity, number of classes (2–17), number of attributes (4–42), andnumber of instances (150–1M). Different population sizes are evaluated, to anal-yse the scalability of the proposal, especially on complex and large scale data setswhich justify the use of a larger population for best results.
6. Experimental study
This section will present and discuss the experimental results from the differ-ent experiments. Experimental results reported the average execution time valuesand speedups from 100 different executions. Even the MOGBAP algorithm isnon-deterministic, its behaviour is controlled by a user-defined parameter seedvalue, if this seed is fixed, the algorithm is totally deterministic, i.e., it runs thesame execution paths and produces the same results. In order to avoid high de-viations of the execution times produced by the non-deterministic evolutionarybehaviour, we fixed the seed for the experiments carried out. Therefore, it isguaranteed that the 100 runs of the CPU algorithm and the 100 runs of the GPUalgorithm perform exactly the same computation tasks.
6.1. RPN interpreter performanceAccording to Kumar [33], the speedup is the ratio of the serial execution time
to the parallel execution time of the algorithm. In our experiments, the speedup isthe ratio between the CPU and GPU execution times. Specifically, we measuretwo speedups. The first one is how much faster is the GPU algorithm to theoriginal sequential single-threaded algorithm, the second one is how much fasteris the GPU algorithm to the CPU parallel multi-threaded algorithm. Table 4 showsthe interpreter execution times, the performance in terms of GPops/s, and thespeedups achieved when comparing the performance of the GPU model to thesingle-threaded and multi-threaded CPU implementations. Each row represents
20
the case of the RPN interpretation of R rules, comprising C conditions/attributes,over I instances. A rule represents a conjunction of attribute–value comparison(operator + attribute + value). Thus, the total number of GPops to evaluate is(C × 3 + (C − 1))× I ×R.
As far as the CPU and multi-threaded interpreter performance is concerned,the greater the number of conditions per rule, the greater the number of GP op-erations to interpret and, therefore, the longer the execution time. The efficiencyof the CPU interpreter increases as the complexity of the rule increases, but justwhen the number of rules or instances is low. Actually, when the number of rulesor instances is high, the CPU interpreter’s efficiency decreases as the complexityof the rule (number of conditions) increases, i.e., the number of GPops/s of the
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
10
20
30
40
50
6x109
7x109
8x109
9x109
1x1010
Conditions
GPop
s/s
Instances
5E+09
6E+09
7E+09
8E+09
9E+09
1E+10
(a) RPN Conditions vs Instances
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
2550
75100
125150
175200
2,0x109
4,0x109
6,0x109
8,0x109
1,0x1010
Rules
GPop
s/s
Instances
6E+08
3E+09
4E+09
6E+09
8E+09
1E+10
(b) RPN Rules vs Instances
10
20
30
40
50
2550
75100
125150
175200
3x109
4x109
5x109
6x109
7x109
8x109
9x109
Rules
GPop
s/s
Conditio
ns
3E+09
4E+09
5E+09
7E+09
8E+09
9E+09
(c) RPN Rules vs Conditions
Figure 4: RPN interpreter performance
21
Tabl
e4:
RPN
inte
rpre
terp
erfo
rman
ceE
xecu
tion
Tim
e(m
s)G
Pops
/sSp
eedu
pvs
CPU
Spee
dup
vs4
CPU
#R#I
nst
#Con
dG
Pops
CPU
4C
PU1
285
228
51
480
248
0C
PU4
CPU
128
52
285
148
02
480
128
52
285
148
02
480
128
52
285
148
02
480
1010
05
1.90
E04
3.03
1.45
0.14
0.13
0.09
0.05
6.27
E06
1.31
E07
1.36
E08
1.46
E08
2.11
E08
3.80
E08
21.6
423
.31
33.6
760
.60
10.3
611
.15
16.1
129
.00
1010
010
3.90
E04
3.41
1.88
0.15
0.14
0.10
0.05
1.14
E07
2.07
E07
2.60
E08
2.79
E08
3.90
E08
7.80
E08
22.7
324
.36
34.1
068
.20
12.5
313
.43
18.8
037
.60
1010
020
7.90
E04
6.45
3.52
0.28
0.25
0.16
0.09
1.22
E07
2.24
E07
2.82
E08
3.16
E08
4.94
E08
8.78
E08
23.0
425
.80
40.3
171
.67
12.5
714
.08
22.0
039
.11
1010
050
1.99
E05
14.3
25.
840.
340.
320.
250.
141.
39E
073.
41E
075.
85E
086.
22E
087.
96E
081.
42E
0942
.12
44.7
557
.28
102.
2917
.18
18.2
523
.36
41.7
110
1000
05
1.90
E06
30.9
910
.14
1.32
0.71
0.41
0.23
6.13
E07
1.87
E08
1.44
E09
2.68
E09
4.63
E09
8.26
E09
23.4
843
.65
75.5
913
4.74
7.68
14.2
824
.73
44.0
910
1000
010
3.90
E06
90.5
630
.95
2.66
1.43
0.83
0.45
4.31
E07
1.26
E08
1.47
E09
2.73
E09
4.70
E09
8.67
E09
34.0
563
.33
109.
1120
1.24
11.6
421
.64
37.2
968
.78
1010
000
207.
90E
0626
0.97
84.6
25.
122.
751.
640.
883.
03E
079.
34E
071.
54E
092.
87E
094.
82E
098.
98E
0950
.97
94.9
015
9.13
296.
5616
.53
30.7
751
.60
96.1
610
1000
050
1.99
E07
1102
.12
338.
2914
.41
7.69
4.08
2.15
1.81
E07
5.88
E07
1.38
E09
2.59
E09
4.88
E09
9.26
E09
76.4
814
3.32
270.
1351
2.61
23.4
843
.99
82.9
115
7.34
1010
0000
51.
90E
0722
9.20
70.4
113
.24
7.14
3.87
1.95
8.29
E07
2.70
E08
1.44
E09
2.66
E09
4.91
E09
9.74
E09
17.3
132
.10
59.2
211
7.54
5.32
9.86
18.1
936
.11
1010
0000
103.
90E
0774
2.73
227.
3727
.30
14.4
87.
833.
945.
25E
071.
72E
081.
43E
092.
69E
094.
98E
099.
90E
0927
.21
51.2
994
.86
188.
518.
3315
.70
29.0
457
.71
1010
0000
207.
90E
0720
14.3
161
4.66
52.1
727
.92
15.6
57.
863.
92E
071.
29E
081.
51E
092.
83E
095.
05E
091.
01E
1038
.61
72.1
512
8.71
256.
2711
.78
22.0
239
.28
78.2
010
1000
0050
1.99
E08
1506
9.12
4496
.22
143.
7078
.68
39.0
719
.62
1.32
E07
4.43
E07
1.38
E09
2.53
E09
5.09
E09
1.01
E10
104.
8719
1.52
385.
7076
8.05
31.2
957
.15
115.
0822
9.17
5010
05
9.50
E04
2.51
1.48
0.14
0.13
0.10
0.06
3.78
E07
6.42
E07
6.79
E08
7.31
E08
9.50
E08
1.58
E09
17.9
319
.31
25.1
041
.83
10.5
711
.38
14.8
024
.67
5010
010
1.95
E05
3.02
1.64
0.15
0.14
0.11
0.06
6.46
E07
1.19
E08
1.30
E09
1.39
E09
1.77
E09
3.25
E09
20.1
321
.57
27.4
550
.33
10.9
311
.71
14.9
127
.33
5010
020
3.95
E05
13.6
84.
930.
350.
310.
150.
122.
89E
078.
01E
071.
13E
091.
27E
092.
63E
093.
29E
0939
.09
44.1
391
.20
114.
0014
.09
15.9
032
.87
41.0
850
100
509.
95E
0555
.22
16.0
81.
000.
870.
360.
281.
80E
076.
19E
079.
95E
081.
14E
092.
76E
093.
55E
0955
.22
63.4
715
3.39
197.
2116
.08
18.4
844
.67
57.4
350
1000
05
9.50
E06
151.
3444
.32
6.01
3.39
1.97
1.00
6.28
E07
2.14
E08
1.58
E09
2.80
E09
4.82
E09
9.50
E09
25.1
844
.64
76.8
215
1.34
7.37
13.0
722
.50
44.3
250
1000
010
1.95
E07
438.
5212
7.73
12.3
56.
903.
982.
014.
45E
071.
53E
081.
58E
092.
83E
094.
90E
099.
70E
0935
.51
63.5
511
0.18
218.
1710
.34
18.5
132
.09
63.5
550
1000
020
3.95
E07
1095
.45
315.
1523
.95
13.3
27.
943.
993.
61E
071.
25E
081.
65E
092.
97E
094.
97E
099.
90E
0945
.74
82.2
413
7.97
274.
5513
.16
23.6
639
.69
78.9
850
1000
050
9.95
E07
4894
.98
1404
.90
67.2
236
.99
19.8
210
.00
2.03
E07
7.08
E07
1.48
E09
2.69
E09
5.02
E09
9.95
E09
72.8
213
2.33
246.
9748
9.50
20.9
037
.98
70.8
814
0.49
5010
0000
59.
50E
0713
52.8
438
5.82
61.3
235
.10
19.1
99.
617.
02E
072.
46E
081.
55E
092.
71E
094.
95E
099.
89E
0922
.06
38.5
470
.50
140.
776.
2910
.99
20.1
140
.15
5010
0000
101.
95E
0834
55.4
797
8.15
127.
1171
.53
38.9
619
.51
5.64
E07
1.99
E08
1.53
E09
2.73
E09
5.01
E09
9.99
E09
27.1
848
.31
88.6
917
7.11
7.70
13.6
725
.11
50.1
450
1000
0020
3.95
E08
1370
6.57
3841
.16
247.
9213
7.30
77.9
239
.01
2.88
E07
1.03
E08
1.59
E09
2.88
E09
5.07
E09
1.01
E10
55.2
999
.83
175.
9135
1.36
15.4
927
.98
49.3
098
.47
5010
0000
509.
95E
0853
444.
0714
934.
3167
3.37
369.
0419
4.87
97.5
21.
86E
076.
66E
071.
48E
092.
70E
095.
11E
091.
02E
1079
.37
144.
8227
4.25
548.
0322
.18
40.4
776
.64
153.
1410
010
05
1.90
E05
3.71
1.75
0.18
0.14
0.10
0.06
5.12
E07
1.09
E08
1.06
E09
1.36
E09
1.90
E09
3.17
E09
20.6
126
.50
37.1
061
.83
9.72
12.5
017
.50
29.1
710
010
010
3.90
E05
8.15
2.67
0.26
0.19
0.11
0.09
4.79
E07
1.46
E08
1.50
E09
2.05
E09
3.55
E09
4.33
E09
31.3
542
.89
74.0
990
.56
10.2
714
.05
24.2
729
.67
100
100
207.
90E
0524
.94
6.18
0.58
0.36
0.22
0.14
3.17
E07
1.28
E08
1.36
E09
2.19
E09
3.59
E09
5.64
E09
43.0
069
.28
113.
3617
8.14
10.6
617
.17
28.0
944
.14
100
100
501.
99E
0614
9.87
40.1
01.
411.
030.
540.
371.
33E
074.
96E
071.
41E
091.
93E
093.
69E
095.
38E
0910
6.29
145.
5027
7.54
405.
0528
.44
38.9
374
.26
108.
3810
010
000
51.
90E
0722
3.82
60.3
512
.37
6.08
3.91
1.97
8.49
E07
3.15
E08
1.54
E09
3.12
E09
4.86
E09
9.64
E09
18.0
936
.81
57.2
411
3.61
4.88
9.93
15.4
330
.63
100
1000
010
3.90
E07
784.
1120
9.41
24.6
212
.51
7.91
3.98
4.97
E07
1.86
E08
1.58
E09
3.12
E09
4.93
E09
9.80
E09
31.8
562
.68
99.1
319
7.01
8.51
16.7
426
.47
52.6
210
010
000
207.
90E
0721
23.9
356
6.62
48.3
124
.15
15.8
17.
933.
72E
071.
39E
081.
64E
093.
27E
095.
00E
099.
96E
0943
.96
87.9
513
4.34
267.
8311
.73
23.4
635
.84
71.4
510
010
000
501.
99E
0815
638.
7241
51.2
313
4.68
67.7
139
.46
19.8
11.
27E
074.
79E
071.
48E
092.
94E
095.
04E
091.
00E
1011
6.12
230.
9739
6.32
789.
4430
.82
61.3
110
5.20
209.
5510
010
0000
51.
90E
0822
75.2
160
2.67
123.
8661
.79
38.3
319
.21
8.35
E07
3.15
E08
1.53
E09
3.07
E09
4.96
E09
9.89
E09
18.3
736
.82
59.3
611
8.44
4.87
9.75
15.7
231
.37
100
1000
0010
3.90
E08
6481
.91
1707
.42
253.
5512
8.30
77.8
638
.98
6.02
E07
2.28
E08
1.54
E09
3.04
E09
5.01
E09
1.00
E10
25.5
650
.52
83.2
516
6.29
6.73
13.3
121
.93
43.8
010
010
0000
207.
90E
0825
815.
7167
79.2
449
5.46
249.
3015
5.79
77.9
73.
06E
071.
17E
081.
59E
093.
17E
095.
07E
091.
01E
1052
.10
103.
5516
5.71
331.
1013
.68
27.1
943
.52
86.9
510
010
0000
501.
99E
0912
0945
.05
3171
8.90
1350
.74
677.
4938
9.61
194.
901.
65E
076.
27E
071.
47E
092.
94E
095.
11E
091.
02E
1089
.54
178.
5231
0.43
620.
5523
.48
46.8
281
.41
162.
7420
010
05
3.80
E05
5.13
1.60
0.31
0.15
0.14
0.07
7.41
E07
2.38
E08
1.23
E09
2.53
E09
2.71
E09
5.43
E09
16.5
534
.20
36.6
473
.29
5.16
10.6
711
.43
22.8
620
010
010
7.80
E05
14.8
44.
440.
600.
300.
290.
115.
26E
071.
76E
081.
30E
092.
60E
092.
69E
097.
09E
0924
.73
49.4
751
.17
134.
917.
4014
.80
15.3
140
.36
200
100
201.
58E
0651
.75
13.5
21.
170.
590.
570.
223.
05E
071.
17E
081.
35E
092.
68E
092.
77E
097.
18E
0944
.23
87.7
190
.79
235.
2311
.56
22.9
223
.72
61.4
520
010
050
3.98
E06
288.
0474
.38
2.90
1.61
1.35
0.54
1.38
E07
5.35
E07
1.37
E09
2.47
E09
2.95
E09
7.37
E09
99.3
217
8.91
213.
3653
3.41
25.6
546
.20
55.1
013
7.74
200
1000
05
3.80
E07
548.
8514
0.96
24.3
712
.13
7.78
3.90
6.92
E07
2.70
E08
1.56
E09
3.13
E09
4.88
E09
9.74
E09
22.5
245
.25
70.5
514
0.73
5.78
11.6
218
.12
36.1
420
010
000
107.
80E
0713
08.2
232
4.43
50.0
225
.33
15.7
97.
925.
96E
072.
40E
081.
56E
093.
08E
094.
94E
099.
85E
0926
.15
51.6
582
.85
165.
186.
4912
.81
20.5
540
.96
200
1000
020
1.58
E08
4507
.04
1115
.84
97.2
448
.83
31.6
115
.80
3.51
E07
1.42
E08
1.62
E09
3.24
E09
5.00
E09
1.00
E10
46.3
592
.30
142.
5828
5.26
11.4
822
.85
35.3
070
.62
200
1000
050
3.98
E08
2136
4.72
5275
.98
270.
0413
5.74
78.9
639
.51
1.86
E07
7.54
E07
1.47
E09
2.93
E09
5.04
E09
1.01
E10
79.1
215
7.39
270.
5854
0.74
19.5
438
.87
66.8
213
3.54
200
1000
005
3.80
E08
6128
.59
1505
.88
255.
5813
5.39
77.6
439
.35
6.20
E07
2.52
E08
1.49
E09
2.81
E09
4.89
E09
9.66
E09
23.9
845
.27
78.9
415
5.75
5.89
11.1
219
.40
38.2
720
010
0000
107.
80E
0812
653.
1030
53.2
151
0.23
257.
9715
5.70
77.8
86.
16E
072.
55E
081.
53E
093.
02E
095.
01E
091.
00E
1024
.80
49.0
581
.27
162.
475.
9811
.84
19.6
139
.20
200
1000
0020
1.58
E09
4126
9.70
9951
.79
988.
8749
7.35
311.
5215
5.79
3.83
E07
1.59
E08
1.60
E09
3.18
E09
5.07
E09
1.01
E10
41.7
382
.98
132.
4826
4.91
10.0
620
.01
31.9
563
.88
200
1000
0050
3.98
E09
1977
83.7
947
401.
2527
13.0
313
58.1
577
9.08
389.
572.
01E
078.
40E
071.
47E
092.
93E
095.
11E
091.
02E
1072
.90
145.
6325
3.87
507.
7017
.47
34.9
060
.84
121.
68
22
single-threaded and multi-threaded CPU interpreter decreases when dealing withcomplex rules over many instances. We believe that this behaviour is effectivelycaused by saturated CPU cache memory, sinking the performance because largerdata from bigger scenarios do not fit into the cache memory, and only the smallestscenario seems to fit into the cache memory.
The performance of the multi-threaded CPU interpreter is up to 315 millionGPops/s. On the other hand, the GPU implementation obtains high performancein all cases, especially over complex rules and many instances, and achieves up to10 billion GPops/s.
The 480 GPU achieves more consistent performance than the 285 GPU, and itcomprises a 768 KB L2 hardware cache which helps to minimize global memoryreads. It is also noticeable that when using two GPU devices, the performanceof the interpreter is duplicated when there are enough instances to fill the GPU’sthread blocks. The GPops/s for two 480 GPUs are presented in Figure 4, wherea pairwise evaluation is presented in a separate figure considering each possiblecombination between the number of instances, rules, and conditions.
6.2. Model performance and scalabilityTable 5 shows the execution times, the speedups, and the GPU occupancy of
the different GPU devices. Each row represents the case of the fitness evaluation(including coverage kernel, confusion matrix kernel, and data transfers) of R rulesover I instances along 100 generations.
The number of threads within a thread block was optimally set to 256 in theexperimental setup, i.e., a thread block is in charge of computing the coverage ofa rule over a group of 256 instances. Thus, the total number of thread blocks re-quired to compute all the evaluation cases is ceil(I/256)×R. The GPU occupancyis defined as the relation between the number of thread blocks to compute and themaximum number of thread blocks that the GPU hardware is capable of handlingconcurrently. Thus, an occupancy lower than 1 means that the computation re-quirements does not reach the maximum capability of the GPU. The occupancyratio should be over 1 to achieve full performance. The 285 GPU comprises 30multiprocessors and the 480 GPU comprises 15 multiprocessors. Each multipro-cessor can handle 4 active thread blocks when using 256 threads per block. Thus,the number of maximum active thread blocks is 120 and 60 times the numberof GPU devices, i.e., the 480 GPU performs better with fewer and larger threadblocks, achieving higher occupancy.
23
Tabl
e5:
Mod
elpe
rfor
man
ceE
xecu
tion
Tim
e(m
s)Sp
eedu
pvs
CPU
Spee
dup
vs4
CPU
GPU
occu
panc
y#R
#Ins
t#C
ond
Blo
cks
CPU
4C
PU1
285
228
51
480
248
01
285
228
51
480
248
01
285
228
51
480
248
01
285
228
51
480
248
010
100
510
13.5
28.
427.
197.
025.
635.
201.
881.
932.
402.
601.
171.
201.
501.
620.
080.
040.
040.
0210
100
1010
16.9
910
.47
7.42
7.28
5.89
5.41
2.29
2.33
2.88
3.14
1.41
1.44
1.78
1.94
0.08
0.04
0.04
0.02
1010
020
1022
.64
15.6
98.
547.
796.
586.
012.
652.
913.
443.
771.
842.
012.
382.
610.
080.
040.
040.
0210
100
5010
50.3
620
.15
8.76
8.12
7.49
7.12
5.75
6.20
6.72
7.07
2.30
2.48
2.69
2.83
0.08
0.04
0.04
0.02
1010
000
579
067
3.15
259.
0716
.00
13.4
712
.49
10.0
342
.07
49.9
753
.90
67.1
116
.19
19.2
320
.74
25.8
36.
583.
293.
291.
6510
1000
010
790
1034
.94
391.
3917
.76
14.3
413
.57
10.1
758
.27
72.1
776
.27
101.
7622
.04
27.2
928
.84
38.4
86.
583.
293.
291.
6510
1000
020
790
1600
.45
594.
9519
.19
15.3
315
.16
12.0
883
.40
104.
4010
5.57
132.
4931
.00
38.8
139
.24
49.2
56.
583.
293.
291.
6510
1000
050
790
2785
.06
962.
7827
.09
20.7
020
.24
15.6
610
2.81
134.
5413
7.60
177.
8535
.54
46.5
147
.57
61.4
86.
583.
293.
291.
6510
1000
005
7820
5950
.02
2100
.53
101.
0780
.36
74.5
765
.67
58.8
774
.04
79.7
990
.60
20.7
826
.14
28.1
731
.99
65.1
732
.58
32.5
816
.29
1010
0000
1078
2088
47.9
129
03.9
411
9.66
95.4
785
.00
67.8
773
.94
92.6
810
4.09
130.
3724
.27
30.4
234
.16
42.7
965
.17
32.5
832
.58
16.2
910
1000
0020
7820
1291
7.86
4150
.33
142.
9012
1.91
100.
2285
.80
90.4
010
5.96
128.
9015
0.56
29.0
434
.04
41.4
148
.37
65.1
732
.58
32.5
816
.29
1010
0000
5078
2028
036.
6987
76.7
722
1.03
185.
0417
4.94
145.
0012
6.85
151.
5216
0.26
193.
3639
.71
47.4
350
.17
60.5
365
.17
32.5
832
.58
16.2
920
100
520
15.7
89.
767.
467.
136.
956.
502.
122.
212.
272.
431.
311.
371.
401.
500.
170.
080.
080.
0420
100
1020
21.5
112
.47
7.50
7.30
7.23
6.90
2.87
2.95
2.98
3.12
1.66
1.71
1.72
1.81
0.17
0.08
0.08
0.04
2010
020
2040
.41
19.2
18.
747.
857.
537.
044.
625.
155.
375.
742.
202.
452.
552.
730.
170.
080.
080.
0420
100
5020
80.2
033
.07
10.5
99.
229.
157.
487.
578.
708.
7710
.72
3.12
3.59
3.61
4.42
0.17
0.08
0.08
0.04
2010
000
515
8095
9.25
265.
8421
.46
18.1
315
.19
12.1
644
.70
52.9
163
.15
78.8
912
.39
14.6
617
.50
21.8
613
.17
6.58
6.58
3.29
2010
000
1015
8014
02.7
339
7.36
24.4
618
.48
17.8
713
.95
57.3
575
.91
78.5
010
0.55
16.2
521
.50
22.2
428
.48
13.1
76.
586.
583.
2920
1000
020
1580
2315
.95
665.
0724
.83
21.5
020
.32
15.0
193
.27
107.
7211
3.97
154.
2926
.78
30.9
332
.73
44.3
113
.17
6.58
6.58
3.29
2010
000
5015
8043
75.5
811
42.5
633
.72
27.9
526
.14
22.5
212
9.76
156.
5516
7.39
194.
3033
.88
40.8
843
.71
50.7
413
.17
6.58
6.58
3.29
2010
0000
515
640
8614
.43
2129
.28
143.
4310
1.25
94.8
972
.83
60.0
685
.08
90.7
811
8.28
14.8
521
.03
22.4
429
.24
130.
3365
.17
65.1
732
.58
2010
0000
1015
640
1280
8.98
3294
.28
160.
3711
7.21
114.
5280
.79
79.8
710
9.28
111.
8515
8.55
20.5
428
.11
28.7
740
.78
130.
3365
.17
65.1
732
.58
2010
0000
2015
640
2133
1.93
5156
.10
176.
9113
5.19
130.
4110
4.63
120.
5815
7.79
163.
5820
3.88
29.1
538
.14
39.5
449
.28
130.
3365
.17
65.1
732
.58
2010
0000
5015
640
3771
2.71
9433
.25
252.
7621
5.21
197.
1716
7.95
149.
2017
5.24
191.
2722
4.55
37.3
243
.83
47.8
456
.17
130.
3365
.17
65.1
732
.58
5010
05
5028
.33
15.0
611
.37
10.9
08.
257.
192.
492.
603.
433.
941.
321.
381.
832.
090.
420.
210.
210.
1050
100
1050
43.2
121
.45
12.3
311
.22
8.85
7.20
3.50
3.85
4.88
6.00
1.74
1.91
2.42
2.98
0.42
0.21
0.21
0.10
5010
020
5060
.10
30.6
614
.52
12.0
49.
028.
214.
144.
996.
667.
322.
112.
553.
403.
730.
420.
210.
210.
1050
100
5050
113.
2747
.49
14.8
213
.41
10.1
98.
387.
648.
4511
.12
13.5
23.
203.
544.
665.
670.
420.
210.
210.
1050
1000
05
3950
1885
.97
471.
4239
.94
27.5
719
.71
14.1
147
.22
68.4
195
.69
133.
6611
.80
17.1
023
.92
33.4
132
.92
16.4
616
.46
8.23
5010
000
1039
5029
33.9
169
5.09
42.9
931
.28
21.6
114
.93
68.2
593
.80
135.
7719
6.51
16.1
722
.22
32.1
746
.56
32.9
216
.46
16.4
68.
2350
1000
020
3950
4742
.53
1118
.49
47.0
633
.62
23.5
915
.67
100.
7814
1.06
201.
0430
2.65
23.7
733
.27
47.4
171
.38
32.9
216
.46
16.4
68.
2350
1000
050
3950
1007
6.53
2457
.20
64.3
844
.27
27.3
424
.42
156.
5222
7.62
368.
5641
2.63
38.1
755
.50
89.8
810
0.62
32.9
216
.46
16.4
68.
2350
1000
005
3910
018
613.
2345
53.9
233
5.65
210.
6517
2.13
118.
3455
.45
88.3
610
8.13
157.
2913
.57
21.6
226
.46
38.4
832
5.83
162.
9216
2.92
81.4
650
1000
0010
3910
027
135.
5863
08.6
834
7.87
230.
0518
9.54
123.
9178
.00
117.
9614
3.17
218.
9918
.14
27.4
233
.28
50.9
132
5.83
162.
9216
2.92
81.4
650
1000
0020
3910
042
911.
8299
56.2
639
5.40
268.
3221
5.30
154.
2510
8.53
159.
9319
9.31
278.
2025
.18
37.1
146
.24
64.5
532
5.83
162.
9216
2.92
81.4
650
1000
0050
3910
098
992.
8223
412.
6045
3.00
323.
8630
6.60
196.
8421
8.53
305.
6732
2.87
502.
9151
.68
72.2
976
.36
118.
9432
5.83
162.
9216
2.92
81.4
610
010
05
100
45.9
518
.82
12.0
711
.01
9.45
7.90
3.81
4.17
4.86
5.82
1.56
1.71
1.99
2.38
0.83
0.42
0.42
0.21
100
100
1010
061
.32
28.3
612
.78
11.5
09.
758.
374.
805.
336.
297.
332.
222.
472.
913.
390.
830.
420.
420.
2110
010
020
100
96.8
239
.33
15.9
713
.58
10.7
69.
466.
067.
139.
0010
.23
2.46
2.90
3.66
4.16
0.83
0.42
0.42
0.21
100
100
5010
021
0.03
54.7
518
.90
14.6
811
.92
9.75
11.1
114
.31
17.6
221
.54
2.90
3.73
4.59
5.62
0.83
0.42
0.42
0.21
100
1000
05
7900
3590
.06
819.
2257
.74
40.1
935
.89
23.6
262
.18
89.3
310
0.03
151.
9914
.19
20.3
822
.83
34.6
865
.83
32.9
232
.92
16.4
610
010
000
1079
0055
46.3
012
89.7
165
.78
43.6
439
.02
23.9
084
.32
127.
0914
2.14
232.
0619
.61
29.5
533
.05
53.9
665
.83
32.9
232
.92
16.4
610
010
000
2079
0084
96.5
919
14.7
268
.70
46.7
343
.82
29.8
512
3.68
181.
8219
3.90
284.
6427
.87
40.9
743
.70
64.1
465
.83
32.9
232
.92
16.4
610
010
000
5079
0018
677.
7143
47.7
084
.88
56.2
052
.09
34.2
622
0.05
332.
3435
8.57
545.
1851
.22
77.3
683
.47
126.
9065
.83
32.9
232
.92
16.4
610
010
0000
578
200
3409
7.45
7770
.71
546.
3031
9.79
274.
0315
7.12
62.4
210
6.62
124.
4321
7.02
14.2
224
.30
28.3
649
.46
651.
6732
5.83
325.
8316
2.92
100
1000
0010
7820
054
730.
6912
188.
8757
0.31
328.
3431
8.94
174.
4995
.97
166.
6917
1.60
313.
6621
.37
37.1
238
.22
69.8
565
1.67
325.
8332
5.83
162.
9210
010
0000
2078
200
8899
9.51
2045
2.00
590.
7236
0.56
321.
6820
1.46
150.
6624
6.84
276.
6744
1.77
34.6
256
.72
63.5
810
1.52
651.
6732
5.83
325.
8316
2.92
100
1000
0050
7820
019
4084
.86
4371
9.72
673.
7143
4.29
423.
4226
5.87
288.
0844
6.90
458.
3773
0.00
64.8
910
0.67
103.
2516
4.44
651.
6732
5.83
325.
8316
2.92
24
20
40
60
80
100
10
20
30
40
50
100
200
300
400
500
600
700
Conditions
Spe
edup
Rules
0
100
200
300
400
500
600
700
800
(a) Conditions vs Rules
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
10
20
30
40
50
100
200
300
400
500
600
700
Conditions
Spe
edup
Instances
0
100
200
300
400
500
600
700
800
(b) Conditions vs Instances
20
40
60
80
100
10000200
0030000400
0050000600
0070000800
0090000100
000
100
200
300
400
500
600
700
Instances
Spe
edup
Rules
0
100
200
300
400
500
600
700
800
(c) Instances vs Rules
Figure 5: Model performance and scalability
The CPU times increase as the number of rules, the number of instances, andthe number of attributes increase. The GPU times remain practically constant aslong as the occupancy is lower than 1, e.g., the execution times for 10 rules and100 instances are similar to the times for 100 rules and 100 instances. Interest-ingly, doubling the number of rules over cases with many instances doubles thetime for CPU approaches, but it only involves about +50% time when dealingwith GPUs. The reason for such behaviour is on how the dimensionality of thekernel increases, how rules and instances are stored in the GPU memory, and howmemory spaces of the GPU address memory requests. The GPU kernel explainedin Section 4.3.1 employs a 2D grid of thread block. One dimension increases asthe number of data instances. The other dimension increases as the number ofrules. Results shown in Table 5 indicate that increasing only the number of in-stances, increases the execution time linearly (after GPU reached full occupancy).The reason for such behaviour is that increasing the number of data instances, in-
25
creases the number of global memory reads (there are more threads reading datainstances values from global memory). Global memory transactions are linearperformance, making the execution time to increase linearly. However, the rulesare not stored in global memory, but in constant memory. As indicated in Sec-tion 3, a value read from the constant cache is broadcast to all threads in a warp,effectively serving loads from memory with a single-cache access. This meansthat when we increase the number of rules, reading them from constant mem-ory is not that cost than if we had them in global memory, and memory requestsfrom multiple threads are solved in a single cached access. Therefore, doublingthe number of instances doubles the number of global memory reads, doubles thecomputation work (arithmetic calculations) and results in doubling the executiontimes. On the other hand, doubling the number of rules, do not double the numberof constant memory read transactions, doubles the computation work (arithmeticcalculations) and the result is that the execution time is only about 50% larger(mostly due to the double number of calculations). Figure 5 represents in a sepa-rate figure how the relation between the number of conditions, rules, and instancesaffects the speedup of the algorithm.
6.3. Experiments on UCI data setsTable 6 shows the execution times and the speedups achieved over 12 varied
data sets from the UCI machine learning repository.The proposal demonstrates high performance and efficiency, which both in-
crease as the number of instances and the population size increase. The highestspeedup is achieved over the Connect-4 data set using 200 ants and 2 GTX 480GPUs (834× vs single-threaded CPU, 212× vs multi-threaded CPU). The highestspeedup values over all data sets is 834× for 2 GTX 480 GPUs, 566× for 1 GTX480 GPU, 400× for 2 GTX 285 GPUs and 244× for 1 GTX 285 GPU, all whencompared to original single-threaded CPU implementation. The average of thespeedup values over all data sets is 162× for 2 GTX 480 GPUs, 119× for 1 GTX480 GPU, 99× for 2 GTX 285 GPUs and 68× for 1 GTX 285 GPU, all whencompared to original single-threaded CPU implementation. Smallest speedupswere achieved on small data sets like Iris, but it is interesting to highlight thatthey are always higher than 1× even if the number of instances and rules to par-allelize is very low. The aforementioned assertion is not trivial because it impliesthat it is always advisable to speedup the execution using GPUs independently ofthe problem size, i.e., executing on the GPU (copying data instances, copying therules, performing the evaluation and copying the results back to the host memory)is always faster than executing on the CPU, no matter how small the problem is.
26
Tabl
e6:
UC
Idat
ase
tspe
rfor
man
ceE
xecu
tion
Tim
e(m
s)Sp
eedu
pvs
CPU
Spee
dup
vs4
CPU
Dat
ase
t#I
nst
#Att
#Cla
sses
#RC
PU4
CPU
128
52
285
148
02
480
128
52
285
148
02
480
128
52
285
148
02
480
Iris
150
43
1013
6.42
62.8
055
.31
43.8
938
.86
30.5
22.
473.
113.
514.
471.
141.
431.
622.
0620
202.
3790
.70
56.9
247
.59
42.4
935
.57
3.56
4.25
4.76
5.69
1.59
1.91
2.13
2.55
5038
8.33
156.
0076
.53
66.9
250
.19
40.4
25.
075.
807.
749.
612.
042.
333.
113.
8610
070
2.51
235.
3079
.91
70.7
762
.41
52.3
58.
799.
9311
.26
13.4
22.
943.
323.
774.
4920
015
48.8
943
4.30
108.
3896
.27
94.0
977
.73
14.2
916
.09
16.4
619
.93
4.01
4.51
4.62
5.59
iono
sphe
re35
133
210
877.
5228
2.70
64.1
644
.23
41.5
830
.56
13.6
819
.84
21.1
028
.71
4.41
6.39
6.80
9.25
2013
41.5
841
8.70
69.9
748
.76
43.4
834
.27
19.1
727
.51
30.8
639
.15
5.98
8.59
9.63
12.2
250
2486
.55
770.
0095
.34
79.0
163
.42
52.3
526
.08
31.4
739
.21
47.5
08.
089.
7512
.14
14.7
110
044
49.8
214
11.9
013
4.56
108.
2497
.49
77.6
733
.07
41.1
145
.64
57.2
910
.49
13.0
414
.48
18.1
820
087
68.5
626
91.5
017
5.67
146.
2414
3.25
114.
2749
.91
59.9
661
.21
76.7
415
.32
18.4
018
.79
23.5
5au
stra
lian
690
142
1073
0.28
208.
5059
.90
39.1
938
.58
30.7
212
.19
18.6
318
.93
23.7
73.
485.
325.
406.
7920
1066
.03
315.
3061
.44
50.2
445
.75
39.5
317
.35
21.2
223
.30
26.9
75.
136.
286.
897.
9850
1457
.10
429.
3780
.89
65.7
457
.98
43.3
318
.01
22.1
625
.13
33.6
35.
316.
537.
419.
9110
029
34.1
184
7.62
96.0
074
.44
73.1
158
.88
30.5
639
.42
40.1
349
.83
8.83
11.3
911
.59
14.4
020
058
27.4
017
32.6
511
6.33
95.4
380
.27
63.7
150
.09
61.0
672
.60
91.4
714
.89
18.1
621
.59
27.2
0tic
-tac
-toe
958
92
1096
5.21
330.
6056
.42
45.2
345
.04
34.3
717
.11
21.3
421
.43
28.0
85.
867.
317.
349.
6220
1531
.98
448.
0069
.74
55.1
447
.40
37.7
121
.97
27.7
832
.32
40.6
36.
428.
129.
4511
.88
5021
63.9
361
3.91
81.0
875
.21
55.3
541
.55
26.6
928
.77
39.1
052
.08
7.57
8.16
11.0
914
.78
100
4267
.61
1214
.69
111.
1996
.15
79.4
856
.80
38.3
844
.38
53.6
975
.13
10.9
212
.63
15.2
821
.39
200
8632
.52
2500
.98
149.
4312
4.30
112.
1080
.52
57.7
769
.45
77.0
110
7.21
16.7
420
.12
22.3
131
.06
vow
el99
013
1110
1092
.20
308.
0091
.44
77.0
464
.27
50.8
611
.94
14.1
816
.99
21.4
73.
374.
004.
796.
0620
1774
.42
487.
4010
5.03
91.6
077
.13
60.2
616
.89
19.3
723
.01
29.4
54.
645.
326.
328.
0950
2775
.00
761.
0512
9.53
115.
1283
.32
74.8
221
.42
24.1
133
.31
37.0
95.
886.
619.
1310
.17
100
5336
.31
1458
.00
177.
5514
7.22
120.
6393
.00
30.0
636
.25
44.2
457
.38
8.21
9.90
12.0
915
.68
200
1106
8.93
3262
.87
226.
5718
6.74
166.
8512
6.67
48.8
559
.27
66.3
487
.38
14.4
017
.47
19.5
625
.76
segm
ent
2310
197
1024
13.5
472
2.40
98.2
584
.09
74.0
565
.31
24.5
728
.70
32.5
936
.96
7.35
8.59
9.76
11.0
620
4331
.76
1215
.70
109.
0895
.81
84.6
075
.00
39.7
145
.21
51.2
057
.76
11.1
512
.69
14.3
716
.21
5078
75.8
521
17.4
715
9.91
123.
2812
1.69
97.2
049
.25
63.8
964
.72
81.0
313
.24
17.1
817
.40
21.7
810
014
854.
0539
62.3
225
4.74
189.
5917
3.38
143.
7758
.31
78.3
585
.67
103.
3215
.55
20.9
022
.85
27.5
620
028
757.
0876
30.1
032
3.03
256.
1725
3.75
204.
9089
.02
112.
2611
3.33
140.
3523
.62
29.7
930
.07
37.2
4m
ushr
oom
8124
222
1068
32.7
720
74.4
013
9.68
117.
8298
.64
85.9
348
.92
57.9
969
.27
79.5
214
.85
17.6
121
.03
24.1
420
1383
3.77
3723
.90
168.
1514
1.73
123.
3097
.64
82.2
797
.61
112.
2014
1.68
22.1
526
.27
30.2
038
.14
5028
428.
1371
41.7
430
6.78
230.
5118
0.89
127.
8692
.67
123.
3315
7.16
222.
3423
.28
30.9
839
.48
55.8
610
061
334.
5315
581.
7052
9.06
367.
3427
1.82
191.
7811
5.93
166.
9722
5.64
319.
8229
.45
42.4
257
.32
81.2
520
011
4267
.45
2959
6.69
683.
5644
7.32
432.
4530
1.36
167.
1725
5.45
264.
2337
9.17
43.3
066
.16
68.4
498
.21
kr-v
s-kp
2805
66
1710
4217
.19
1320
.30
84.3
274
.57
72.0
665
.63
50.0
156
.55
58.5
264
.26
15.6
617
.71
18.3
220
.12
2081
76.9
923
94.7
094
.56
89.2
674
.59
68.8
886
.47
91.6
110
9.63
118.
7125
.32
26.8
332
.10
34.7
750
1576
3.84
4179
.39
153.
9513
0.91
104.
7290
.87
102.
4012
0.42
150.
5317
3.48
27.1
531
.93
39.9
145
.99
100
3093
3.37
8058
.31
258.
5516
8.16
163.
6611
7.25
119.
6418
3.95
189.
0126
3.82
31.1
747
.92
49.2
468
.73
200
6402
9.44
1589
9.56
314.
4321
2.62
205.
0416
2.71
203.
6430
1.14
312.
2839
3.52
50.5
774
.78
77.5
497
.72
conn
ect-
467
557
423
1083
831.
3724
665.
1599
3.69
813.
1477
7.72
701.
9584
.36
103.
1010
7.79
119.
4324
.82
30.3
331
.71
35.1
420
1576
74.2
142
737.
9512
95.7
610
01.2
686
2.85
752.
9812
1.68
157.
4818
2.74
209.
4032
.98
42.6
849
.53
56.7
650
4268
49.3
511
0240
.25
2232
.12
1522
.36
1286
.71
924.
1219
1.23
280.
3933
1.74
461.
9049
.39
72.4
185
.68
119.
2910
079
9502
.73
2122
62.8
840
58.3
723
28.9
119
59.1
712
70.7
219
7.00
343.
2940
8.08
629.
1752
.30
91.1
410
8.34
167.
0420
016
2077
5.44
4124
62.9
666
42.0
440
49.8
528
61.7
619
41.3
624
4.02
400.
2156
6.36
834.
8762
.10
101.
8514
4.13
212.
46fa
rs10
0968
298
1098
353.
4228
265.
8114
77.4
011
60.9
910
46.5
586
2.74
66.5
784
.72
93.9
811
4.00
19.1
324
.35
27.0
132
.76
2018
7677
.23
5175
4.57
2100
.94
1532
.54
1238
.43
939.
4989
.33
122.
4615
1.54
199.
7724
.63
33.7
741
.79
55.0
950
4203
09.9
112
1558
.74
3548
.67
2278
.66
1900
.24
1403
.68
118.
4418
4.45
221.
1929
9.43
34.2
553
.35
63.9
786
.60
100
8623
75.2
925
1519
.89
7249
.25
3725
.75
2812
.08
1927
.82
118.
9623
1.46
306.
6744
7.33
34.7
067
.51
89.4
413
0.47
200
1660
939.
9949
1768
.62
1235
0.12
6530
.37
5192
.10
3448
.76
134.
4925
4.34
319.
9048
1.60
39.8
275
.30
94.7
114
2.59
kddc
up49
4020
4110
1057
5386
.93
1622
85.8
010
726.
7192
98.4
277
84.8
554
60.7
253
.64
61.8
873
.91
105.
3715
.13
17.4
520
.85
29.7
220
1149
797.
8732
3639
.32
1350
7.70
9626
.74
8761
.11
7001
.57
85.1
211
9.44
131.
2416
4.22
23.9
633
.62
36.9
446
.22
5021
6184
8.69
7755
97.0
522
450.
5115
376.
5415
054.
2410
005.
0596
.29
140.
5914
3.60
216.
0834
.55
50.4
451
.52
77.5
210
045
4353
4.52
1667
554.
0041
050.
4523
640.
0723
473.
9217
780.
2311
0.68
192.
2019
3.56
255.
5440
.62
70.5
471
.04
93.7
920
086
6465
1.97
3283
987.
9169
195.
6637
846.
5534
379.
0325
260.
9712
5.22
228.
9425
2.03
343.
0147
.46
86.7
795
.52
130.
00po
ker
1025
010
1010
1043
0254
.92
1188
20.5
211
866.
3583
17.3
364
51.3
640
65.0
036
.26
51.7
366
.69
105.
8410
.01
14.2
918
.42
29.2
320
9812
21.1
324
5588
.40
2346
9.48
1618
4.86
8143
.93
6783
.35
41.8
160
.63
120.
4814
4.65
10.4
615
.17
30.1
636
.20
5022
9167
8.70
5914
40.3
732
074.
4720
522.
5713
233.
5597
85.9
171
.45
111.
6717
3.17
234.
1818
.44
28.8
244
.69
60.4
410
047
2714
7.28
1219
990.
6553
438.
2731
803.
5621
340.
4214
043.
6288
.46
148.
6422
1.51
336.
6022
.83
38.3
657
.17
86.8
720
097
2159
3.77
2508
965.
4299
071.
5955
171.
6237
227.
4124
018.
7498
.13
176.
2126
1.14
404.
7525
.32
45.4
867
.40
104.
46
27
Nevertheless, the absolute time gain for such small data (few ms) is not a life-or-death question, but it is significant for large data sets, reducing the executiontime from hours to seconds. The GPU model is not only advisable for reducingthe execution time greatly, but also for solving the problem over larger populationsizes, which may achieve better results.
7. Concluding Remarks
In this paper we have analysed the performance of the MOGBAP algorithmfor classification and proposed a parallel GPU-based implementation to speedupthe evaluation phase, which has been demonstrated to require most of the execu-tion time of the algorithm, specially over large scale data sets. The creation of theants each generation, the multi-objective strategy, and the niching approach forselecting the rules that make up the final classifier have been parallelized usingmulti-threading, because of their internal sequential requirements, which justifiestheir not being included within the GPU. The experimental study carried out hasanalysed the performance of the GPU-based RPN interpreter for the rules minedin terms of the GPops interpreted per seconds, the performance of the evaluationmodel varying the population size, the length of the rule, and the dimensionalityof the data set, and its performance over some well-known data sets from the UCIrepository. The experimental results are evidence for the efficiency and scalabilityof the GPU model using different GPU devices and data set dimensions, achievingan interpreter performance of up to 10 billion GPops/s and an evaluation speedupof up to 834× vs a CPU, and 212× vs a 4–threaded CPU. Furthermore, its par-allelization enabled us to extend the application range of the algorithm to otherdomains that present data sets comprising a huge amount of instances and at-tributes, domains where the application of the MOGBAP algorithm was thereforeextremely difficult until now.
Acknowledgments
This work was supported by the Regional Government of Andalusia and theMinistry of Science and Technology, projects P08-TIC-3720 and TIN-2011-22408,and FEDER funds. This research was also supported by the Spanish Ministry ofEducation under the FPU grant AP2010-0042.
References
[1] NVIDIA CUDA Programming and Best Practices Guide, 2012.
28
[2] E. Alba, M. Tomassini, Parallelism and evolutionary algorithms, IEEETransactions on Evolutionary Computation 6 (2002) 443–462.
[3] D. Angus, C. Woodward, Multiple objective ant colony optimisation, SwarmIntelligence 3 (2009) 69–85.
[4] H. Bai, D. OuYang, X. Li, L. He, H. Yu, Max–Min ant system on GPU withCUDA, in: International Conference on Innovative Computing, Informationand Control (ICICIC) (2009), pp. 801–804.
[5] A.R. Baig, W. Shahzad, S. Khan, F. Altaf, ACO based discovery of compre-hensible and accurate rules from medical datasets, International Journal ofInnovative Computing, Information and Control 7 (2011) 6147–6159.
[6] W. Banzhaf, S. Harding, W.B. Langdon, G. Wilson, Accelerating GeneticProgramming through Graphics Processing Units, in: Genetic ProgrammingTheory and Practice VI, Genetic and Evolutionary Computation (2009), pp.1–19.
[7] A. Cano, A. Zafra, S. Ventura, Speeding up the evaluation phase of GP clas-sification algorithms on GPUs, Soft Computing 16 (2012) 187–202.
[8] J.M. Cecilia, J.M. Garcia, A. Nisbet, M. Amos, M. Ujaldon, Enhancing dataparallelism for ant colony optimisation on GPUs, Journal of Parallel andDistributed Computing (In press).
[9] J.M. Cecilia, J.M. Garcia, M. Ujaldon, A. Nisbet, M. Amos, Parallelizationstrategies for ant colony optimisation on GPUs, in: Parallel and DistributedProcessing Workshops and Phd Forum (2011), pp. 339–346.
[10] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A per-formance study of general-purpose applications on graphics processors us-ing CUDA, Journal of Parallel and Distributed Computing 68 (2008) 1370–1380.
[11] Y. Chen, L. Chen, L. Tu, Parallel ant colony algorithm for mining classifi-cation rules, IEEE International Conference on Granular Computing (2006)85–90.
[12] J. Chintalapati, A. Maan, S. Priyanka, N. Mangala, V. Jayaraman, Parallelant-miner (PAM) on high performance clusters, in: Swarm Evolutionary andMemetic Computing, LNCS (2010), pp. 270–277.
29
[13] A. Delevacq, P. Delisle, M. Gravel, M. Krajecki, Parallel ant colony opti-mization on graphics processing units, Journal of Parallel and DistributedComputing (2012) In press.
[14] K.F. Doerner, R.F. Hartl, S. Benkner, M. Lucka, Parallel cooperative sav-ings based ant colony optimization—Multiple search and decomposition ap-proaches, Parallel Processing Letters 16 (2006) 351–370.
[15] K.L. Fok, T.T. Wong, M.L. Wong, Evolutionary computing on consumergraphics hardware, IEEE Intelligent Systems 22 (2007) 69–78.
[16] M.A. Franco, N. Krasnogor, J. Bacardit, Speeding up the evaluation of evo-lutionary learning systems using GPGPUs, in: Genetic and EvolutionaryComputation Conference (GECCO) (2010), pp. 1039–1046.
[17] J. Fu, L. Lei, G. Zhou, A parallel ant colony optimization algorithm withgpu-acceleration based on all-in-roulette selection, in: International Work-shop on Advanced Computational Intelligence (IWACI) (2010), pp. 260 –264.
[18] M.F. Ganji, M.S. Abadeh, A fuzzy classification system based on ant colonyoptimization for diabetes disease diagnosis, Expert Systems with Applica-tions 38 (2011) 14650–14659.
[19] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Mor-ton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences withCUDA, IEEE Micro 28 (2008) 13–27.
[20] S.U. Guan, F. Zhu, An incremental approach to genetic-algorithms-basedclassification, IEEE Transactions on Systems, Man, and Cybernetics, PartB: Cybernetics 35 (2005) 227–239.
[21] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kauff-man, 2006.
[22] S. Haykin, Neural Networks and Learning Machines, Pearson, 3rd edition,2009.
[23] H.J. Huang, C.N. Hsu, Bayesian classification for data from the same un-known class, IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics 32 (2002) 137–145.
30
[24] T.M. Huang, V. Kecman, I. Kopriva, Support vector machines in classifi-cation and regression - an introduction, in: Kernel Based Algorithms forMining Huge Data Sets: Supervised, Semi-supervised, and UnsupervisedLearning (Studies in Computational Intelligence), volume 177, 2005, pp.1–47.
[25] W. Hwu, Illinois ECE 498AL: Programming Massively Parallel Processors,Lecture 13: Reductions and their Implementation, 2009.
[26] A.K. Jain, R.P. Duin, J. Mao, Statistical pattern recognition: A review, IEEETransactions on Pattern Analysis and Machine Intelligence 22 (2000) 4–37.
[27] B. Jang, D. Schaa, P. Mistry, D. Kaeli, Exploiting memory access patterns toimprove memory performance in data-parallel architectures, IEEE Transac-tions on Parallel and Distributed Systems 23 (2011) 105–118.
[28] L. Jian, C. Wang, Y. Liu, S. Liang, W. Yi, Y. Shi, Parallel data mining tech-niques on Graphics Processing Unit with Compute Unified Device Architec-ture (CUDA), The Journal of Supercomputing (2011) In press.
[29] W. Jiening, D. Jiankang, Z. Chunfeng, Implementation of ant colony algo-rithm based on GPU, in: Computer Graphics, Imaging and Visualization(2009), pp. 50–53.
[30] R. Jovanovic, M. Tuba, D. Simian, Comparison of different topologies forisland-based multi-colony ant algorithms for the minimum weight vertexcover problem, WSEAS Transactions on Computers 9 (2010) 83–92.
[31] S.B. Kotsiantis, I.D. Zaharakis, P.E. Pintelas, Machine learning: A review ofclassification and combining techniques, Artificial Intelligence Reviews 26(2006) 159–190.
[32] S.R. Kulkarni, G. Lugosi, S.S. Venkatesh, Learning pattern classification—asurvey, IEEE Transactions on Information Theory 44 (1998) 2178–2206.
[33] V. Kumar, A. Gupta, Analyzing scalability of parallel algorithms and archi-tectures, Journal of Parallel and Distributed Computing 22 (1994) 379–391.
[34] W.B. Langdon, A many threaded CUDA interpreter for genetic program-ming, Lecture Notes in Computer Science, in: European Conference on Ge-netic Programming (EuroGP), volume 6021 of LNCS (2010), pp. 146–158.
31
[35] W.B. Langdon, Graphics processing units and genetic programming: Anoverview, Soft Computing 15 (2011) 1657–1669.
[36] D.T. Larose, Discovering Knowledge in Data: An Introduction to Data Min-ing, Wiley, 2005.
[37] A. Leist, D. Playne, K. Hawick, Exploiting graphical processing units fordata-parallel scientific applications, Concurrency Computation Practice andExperience 21 (2009) 2400–2437.
[38] G. Luque, E. Alba, Parallel genetic algorithms: Theory and realworld appli-cations, Studies in Computational Intelligence 367 (2011) 1–183.
[39] D. Martens, M. De Backer, J. Vanthienen, M. Snoeck, B. Baesens, Classi-fication with ant colony optimization, IEEE Transactions on EvolutionaryComputation 11 (2007) 651–665.
[40] I. Michelakos, N. Mallios, E. Papageorgiou, M. Vassilakopoulos, Ant colonyoptimization and data mining, Studies in Computational Intelligence 352(2011) 31–60.
[41] S. Nesmachnow, F. Luna, E. Alba, Time analysis of standard evolutionaryalgorithms as software programs, in: International Conference on IntelligentSystems Design and Applications, ISDA (2011), pp. 271–276.
[42] D.J. Newman, A. Asuncion, UCI machine learning repository, 2007.
[43] J.L. Olmo, J.R. Romero, S. Ventura, Using ant programming guided bygrammar for building rule-based classifiers, IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics 41 (2011) 1585–1599.
[44] J.L. Olmo, J.R. Romero, S. Ventura, Classification rule mining using ant pro-gramming guided by grammar with multiple Pareto fronts, Soft Computing16 (2012) 2143–2163.
[45] S.N. Omkar, R. Karanth, Rule extraction for classification of acoustic emis-sion signals using ant colony optimisation, Engineering Applications of Ar-tificial Intelligence 21 (2008) 1381–1388.
[46] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips,GPU Computing, Proceedings of the IEEE 96 (2008) 879–899.
32
[47] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A.E. Lefohn,T.J. Purcell, A survey of general-purpose computation on graphics hardware,Computer Graphics Forum 26 (2007) 80–113.
[48] R. Parpinelli, A.A. Freitas, H.S. Lopes, Data mining with an ant colonyoptimization algorithm, IEEE Transactions on Evolutionary Computation 6(2002) 321–332.
[49] M. Pedemonte, H. Cancela, A cellular ant colony optimisation for the gen-eralised steiner problem, International Journal of Innovative Computing andApplications 2 (2010) 188–201.
[50] M. Pedemonte, S. Nesmachnow, H. Cancela, A survey on parallel ant colonyoptimization, Applied Soft Computing 11 (2011) 5181–5197.
[51] M. Randall, A. Lewis, A parallel implementation of ant colony optimization,Journal of Parallel and Distributed Computing 62 (2002) 1421–1432.
[52] O. Roozmand, K. Zamanifar, Parallel Ant Miner 2, in: International Confer-ence on Artificial Intelligence and Soft Computing (ICAISC), volume 5097of LNCS (2008), pp. 681–692.
[53] K.C. Tan, Q. Yu, C.M. Heng, T.H. Lee, Evolutionary computing for knowl-edge discovery in medical diagnosis, Artificial Intelligence in Medicine 27(2003) 129–154.
[54] D. Tarditi, S. Puri, J. Oglesby, Accelerator: using data parallelism to pro-gram gpus for general-purpose uses, in: Proceedings of the 12th interna-tional conference on Architectural (2006), pp. 325–335.
[55] R.M. Weiss, GPU-accelerated ant colony optimization, in: W. mei W. Hwu(Ed.), GPU Computing Gems, Morgan Kaufmann, 2011, pp. 325–340.
[56] M.L. Wong, K.S. Leung, Data Mining Using Grammar-Based Genetic Pro-gramming and Applications, Kluwer Academic Publishers, Norwell, MA,USA, 2000.
[57] W. Zhu, J. Curry, Parallel ant colony for nonlinear function optimizationwith graphics hardware acceleration, in: IEEE International Conference onSystems, Man and Cybernetics (IEEE SMC) (2009), pp. 1803 –1808.
33