applying semi-synchronised task farming to large-scale computer … · 2018. 6. 29. · apply this...

Original Article

The International Journal of HighPerformance Computing Applications1–24� The Author(s) 2014Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342014532965hpc.sagepub.com

Applying semi-synchronised taskfarming to large-scale computer visionproblems

Steven McDonagh, Cigdem Beyan, Phoenix X Huangand Robert B Fisher

AbstractDistributed compute clusters allow the computing power of heterogeneous (and homogeneous) resources to be utilisedto solve large-scale science and engineering problems. One class of problem that has attractive scalability properties, andis therefore often implemented using compute clusters, is task farming (or parameter sweep) applications. A typical char-acteristic of such applications is that no communication is needed between distributed subtasks during the overall com-putation. However, interesting large-scale task farming problem instances that do require global communication betweensubtask sets also exist. We propose a framework called semi-synchronised task farming in order to address problemsrequiring distributed formulations containing subtasks that alternate between independence and synchronisation. Weapply this framework to several large-scale contemporary computer vision problems and present a detailed performanceanalysis to demonstrate framework scalability.

Semi-synchronised task farming splits a given problem into a number of stages. Each stage involves distributing inde-pendent subtasks to be completed in parallel and then making a set of synchronised global operations, based on informa-tion retrieved from the distributed results. The results influence the following subtask distribution stage. This subtaskdistribution followed by result collation process is iterated until overall problem solutions are obtained. We construct asimplified Bulk Synchronous Parallel (BSP) model to formalise this framework and with this formalisation, we develop apredictive model for overall task completion time. We present experimental benchmark results comparing the perfor-mance observed by applying our framework to solve real-world problems on compute clusters with that of solving thetasks in a serial fashion. Furthermore by assessing the predicted time savings that our framework provides in simulationand validating these predictions on a range of complex problems drawn from real-world computer vision tasks, we areable to reliably predict the performance gain obtained when using a compute cluster to tackle resource intensive com-puter vision tasks.

KeywordsVision, task farming, semi-synchronised

1 Introduction

Many computational tasks that employ serial code arelimited by the total CPU time that they require to exe-cute. When the individual tasks that make up an over-all computation are independent of each other it ispossible that they run simultaneously (in parallel) ondifferent processors. Using this approach has thepotential to greatly reduce the wall-clock time (real-world time elapsed from process start to completion)needed to obtain scientific results. Distributing separateruns of the same code while varying model parametersor input data in this way is known as task farming andhas been the focus of much work of both cluster andgrid computing (Silva et al., 1993; Casanova et al.,

1999, 2000). Trivial task farming is a common form ofparallelism and relies on the ability to decompose aproblem into a number of nearly identical yet indepen-dent tasks. Each processor (independent node) runs alocal copy of the serial code, often with its own inputand output files, and no communication is requiredbetween these processes. This form of task farming iswell suited to exploring large parameter spaces or large

University of Edinburgh, UK

Corresponding author:

Steven McDonagh, 10 Crichton Street, Edinburgh, EH8 9AB, United

Kingdom.

Email: [email protected]

independent data sets. On the assumption that all taskstake a similar amount of time to complete then thereare no load imbalance issues and linear scaling canoften be achieved in relation to the number of proces-sors employed.

Many interesting problems do however require somelevel of communication between tasks during distribu-ted execution. In this work we develop a framework toenable semi-synchronised task farming in which an over-all computation involves distributing many sets of par-allel tasks such that all tasks within a set areindependent, yet these tasks must finish before a fol-lowing task set is able to begin execution. Taking intoaccount a level of communication between tasks hasbeen approached previously with a focus on (e.g.) thescheduling aspects of aperiodically arriving non-independent tasks (Abdelzaher et al., 2004), data sta-ging effects on wide area task farming (Elwasif et al.,2001) and cost-time optimisations of task scheduling(Buyya et al., 2002). Given that we propose to handleglobal communication between task sets with a posttask set completion synchronisation step after a roundof concurrent computation, components of the BulkSynchronous Parallel (BSP) model are a suitable basisfor our framework. The BSP model is a bridging modeloriginally proposed by Valiant (1990) and further detailof how to realise our framework and hybrid time pre-diction model is provided in Section 3.

Numerical algorithms can often be implementedusing either task or data parallelism (Hillis et al., 1986;Foster, 1994). Task farming algorithms can be consid-ered a simple subset of task parallel methods that breaka problem down into individual segments, such thateach problem segment can be solved independently andsynchronously on separate compute nodes. The taskparallel model typically requires little inter-node com-munication. Data parallel models conversely share largedata sets among multiple compute nodes and then per-form similar operations independently on the partici-pating nodes for each element of the data array. Dataparallelism therefore typically requires that each proces-sor performs the same task on different pieces of thedistributed data. In this way, HPC data parallelismoften results in additional communication overheadbetween nodes and requires high bandwidth and lowlatency node connectivity. In practice most real parallelcomputations fall somewhere on a spectrum betweentask and data parallelism. This is also true of the taskfarming framework that we introduce (see Section 3).

Computer vision, like many fields, contains algo-rithms that are challenged by the size of the data setsworked with, the number of parameters that must beestimated or the requirement of highly accurate results.These requirements often result in computationallyexpensive algorithms that demand time consumingbatch processing. One efficient solution for accelerating

these processes involves executing algorithms on a clus-ter of machines rather than on a single compute nodeor workstation. Our semi-synchronised task farmingframework provides a simple form of parallel computa-tion that is able to reduce the wall-clock time requiredby such computationally expensive tasks that mightotherwise take several hours, days or even weeks on asingle workstation.

Here we choose computer vision applications as thetest bed for our framework. Once an algorithm hasbeen formulated under our framework, we use simpleperformance modelling to accurately predict overallcomputation time and therefore the likely speed-upmade possible by employing a distributed implementa-tion over a serial approach. In this way we provide aframework that enables the straightforward task distri-bution for problems, comprised of many individualtasks that likely require communication upon comple-tion, coupled with a modelling process capable of pre-dicting the available speed benefit of instantiating thedistributed implementation.

Our contributions in this paper can be summarisedas follows:

� We introduce a simple framework for non-independent task farming based on the BSP model(Valiant, 1990). The framework allows us to formu-late problems by dividing them into many indepen-dent parallel tasks that also require some level ofcommunication and synchronisation between tasksbefore an overall solution to the problem can beobtained.

� As part of this framework we develop acomputation-time model capable of predictingoverall application completion time for problemsthat are formulated using the task farming frame-work that we introduce. Providing this simple toolaffords a method to reliably predict time require-ments and evaluate computation-time and solution-quality trade- offs prior to runtime.

� We apply our semi-synchronised task farmingframework to three contemporary computer visionproblems and report on our experiences of imple-menting distributed solutions to these problems,and explore predicted and experimental speed-upavailable when deploying these implementations onan HPC cluster.

The HPC system that we make use of experimentallyis described in Section 3.1. We outline our task farmingframework and relate it to the BSP model in Section 3.3.We then introduce performance modelling techniques tofacilitate predictions about computational time requiredfor problems formulated under our framework in theremaining part of Section 3. Results from simulationexperiments that verify our predictive model are given in

2 The International Journal of High Performance Computing Applications

Section 4. Section 5 details the results of implementingseveral real-world computer vision applications underour framework and these are compared to sequentialimplementations of the equivalent problems. FinallySection 6 provides some discussion.

2 Related work

The task farming model of high-level parallelism hasbeen the basis for much HPC cluster based work withrecent examples utilising HT Condor (Thain et al.,2005), Google’s MapReduce (Dean and Ghemawat,2008) and Microsoft’s Dryad (Isard et al., 2007). TheHT Condor framework is able to harness idle cyclesfrom both a network of non-dedicated desktop work-station nodes (cycle scavenging) and dedicated rack-mounted clusters. The framework then employs thesecycles to run coarse-grained distributed parallelisationof computationally intensive tasks. Task farming is alsocommon in data centres, for example MapReduce andDryad both make use of task farming to schedule par-allel processing on large terabyte scale datasets. In sys-tems such as these a master process manages the queueof tasks and distributes these tasks amongst the collec-tion of available worker processors. The master processis typically also responsible for handling load balancingand worker node failure. In the current work, masterand worker node interaction is handled by the SunGrid Engine (SGE) (Gentzsch, 2001) using a batchqueue system similar to the Condor framework. Thisqueueing system is responsible for accepting, schedulingand managing the distributed execution of our paralleltasks. This approach allows the distribution of arbi-trary tasks as there is no requirement for a specialisedAPI. Using SGE to manage our task queueing systemallows our developers to concentrate on the image pro-cessing aspects of the problems that we investigate.

Using the SGE environment, jobs typically requestno interaction during execution unless they contain theintegrated ability to find their interaction partners fromtheir dynamically assigned worker node. The semi-synchronised task farming model that we build on topof the SGE layer respects this such that only after a setof tasks has completed are results collated to makedecisions regarding the distribution and form of the fol-lowing set of tasks. In standard task farming, when aworker node completes a task it will request anotherfrom the master node, and our framework also doesthis until all tasks in a task set are processed. Once alltasks in a task set are finished, the results are collatedbefore the following set of tasks are defined and distrib-uted. In comparison to standard task farming, manytask sets likely contribute to a single overall computa-tion under our framework.

Dedicated parallel computer architecture has alsobeen employed to develop computer vision systems. In(Revenga et al., 2003), a Beowulf architecture dedicated

to real-time processing of video streams for embeddedvision systems is proposed and evaluated. The parallelprogramming model made use of is based on algorith-mic skeletons (Cole, 1991). Skeletons are higher-orderprogram constructs that encapsulate common andrecurring forms of parallelism to make them availableto application developers. Skeleton-based parallel pro-gramming methodology offers a partially automatedprocedure for designing and implementing parallelapplications for a specific domain such as image pro-cessing. An application developer provides a skeletalparallel program description, such as a task farm, anda set of application specific sequential functions toinstantiate the skeleton. The system then makes use ofa suite of tools that turn these descriptions into execu-table parallel code. The system in Revenga et al. (2003)was tested by implementing simple image processingalgorithms such as a convolution mask and Sobel filter.

In comparison to classical HPC applications,embedded computer vision on dedicated parallelmachines will often be able to offer advantages such asmobile, real-time performance yet places demands onprogrammers if no high-level parallel programmingmodels or environments are available, such as skeletonsor the SGE that we make use of in this work (see Section3.1 for further details). If these tools are not availablethen programmers must explicitly take into account alllow-level aspects of parallelism such as task partitioning,data distribution, inter-node communication and loadbalancing. If developer expertise lies in (for example)image processing, rather than parallel programming,then accounting for these low-level considerations likelyresults in long and error-prone development cycles.

In contrast to Revenga et al. (2003), here we performtask farming as opposed to low-level data parallelisminvolving geometric partitioning of images for imageprocessing tasks. This results in a coarser level ofabstraction that we apply to higher-level computervision problems involving much larger data sets wherewe do not regard real-time performance as a criticalfactor. It is for this reason that we consider the BSPmodel a good basis for our framework. The originalBSP model considers computation and communicationat the level of the entire program. The BSP model isable to achieve this abstraction by renouncing localityas a performance optimisation (Skillicorn et al., 1997).This in turn simplifies many aspects of algorithmdesign and implementation and does not adverselyaffect performance for most application domains. Low-level image processing however is an example domainfor which locality might be critical, so a BSP basedframework is likely not the best choice.

Parallel and distributed computing systems aredesigned with performance in mind, and significant pre-vious work has been carried out developing approachesfor performance modelling and prediction of

McDonagh et al. 3

applications running on HPC systems. In addition tothe BSP inspired framework that we build on top of theSGE layer, we also formulate application performancemodelling allowing us to predict the run time perfor-mance of the parallel algorithms implemented with ourframework. Application performance modellinginvolves assessing application performance through sys-tem modelling and is an established field (Hammond etal., 2010). Several examples of where this approach hasproven advantageous include: input and code optimisa-tion (Mudalige et al., 2008), efficient scheduling(Spooner et al., 2003) and post-installation performanceverification (Kerbyson et al., 2005). The process of mod-elling itself can be generalised to three basic approaches;modelling based on analytic (mathematical) methods,(e.g. LoPC (Frank et al., 1997)), modelling based ontool support and simulation (e.g. DIMEMAS (Labartaet al., 1997), PACE (Nudd et al., 2000)), and a hybridapproach which uses elements of both (e.g. POEMS(Adve et al., 2000), Performance Prophet (Pllana andFahringer, 2005)). In this work we also choose a hybridapproach and combine basic analytical modelling inher-ited from the BSP model with traditional code profiling;details of our performance modelling approach are pro-vided in the following section.

3 Semi-synchronised task farming

3.1 HPC experimental implementation

In this work we make use of the Edinburgh Computeand Data Facility (ECDF; see http://www.wiki.ed.ac.uk/display/ecdfwiki/) to test the parallel implementa-tions of the computer vision problems that we investi-gate. The ECDF is a Linux compute cluster thatcomprises of 130 IBM iDataPlex servers, each servernode has two Intel Westmere quad-core processorssharing 24 GB of memory. The system uses SGE as abatch queueing system. By tackling computer visionproblems through parallel computation with SGE weshow that increasing the number of participating pro-cessors reduces the wall-clock time required for algo-rithms implemented under our semi-synchronised taskfarming framework (see Section 5 for experimentaldetails). All algorithms are implemented in MATLABand computation times are recorded using the built-inMATLAB command cputime. We report on the sav-ings due to application speed-up in terms of reducedexecution time when running our parallel implementa-tions using many processors compared to employingsequential implementations to perform the same tasks.Our parallel implementations make use of theDistributed Computing Engine (DCE) and DistributedComputing Toolbox (DCT) from MathWorks. Theseproducts offer a user-friendly method of parallel pro-gramming such that master–slave communicationbetween cluster machines is hidden from the developer,

allowing them to focus on domain specific aspects ofeach problem. Our task farming framework is languageindependent and we concede that problem instance wall-clock times can likely be reduced further by making useof (e.g.) an alternative compiled language. However theprimary focus of the current work is to provide evidencethat the proposed framework is able to formulate prob-lems consistently and reduce wall-clock times predicta-bly, compared to the related serial implementations,regardless of the language used. We leave a study of timecritical applications benefiting from (e.g.) compiled lan-guages like C/C++ to future work.

3.2 The Bulk Synchronous Parallel model

The BSP model is a bridging model originally proposedin Valiant (1990). It is a style of parallel programmingdeveloped for general purpose parallelism, that is paral-lelism across all application areas and a wide range ofarchitectures (McColl, 1993). Intended to be employedfor distributed-memory computing, the original modelassumes a BSP machine consists of p identical proces-sors. The related semi-synchronised farming frameworkwe propose (Section 3.3) does not strictly enforce ahomogeneous resource requirement in comparison.This enables our experimental setup, using IBMiDataPlex servers, to contain similar but not necessarilyidentical nodes. In accordance with the original BSPmodel we do assume homogeneous resources duringour theoretical performance modelling for simplicity,and we therefore leave a heterogeneous performancemodelling treatment to future work. In the original BSPmodel, each processor has access to its own local mem-ory and processors can typically communicate witheach other through an all-to-all network. In this workwe make the simplifying assumption that processes onlycontribute information to a global decision making pro-cess at the end of each set of tasks and therefore do notneed to communicate with each other directly. A BSPalgorithm consists of an arbitrary number of supersteps.During supersteps, no communication between proces-sors may occur and all processes, upon completing theircurrent task must then wait at a barrier. Once all pro-cesses complete their current task a barrier synchronisa-tion step occurs and then the next round of tasks(superstep) can begin. In this fashion a BSP computa-tion proceeds in a series of global supersteps and we uti-lise these supersteps to model sets of parallel distributedtasks in our framework. To summarise, a supersteptypically consists of three components:

1. Concurrent computation: Computation takes placeon each of the participating processors p.Processors only make use of data stored in thelocal processor memory. Here we call each inde-pendent process a task. These tasks occur asyn-chronously of each other.


2. Communication: Processors exchange data betweeneach other. Our framework makes the simplifyingassumption that tasks do not need to exchange datawith each other individually, yet the result of eachlocal computation contributes to the followingBarrier synchronisation step (global decision mak-ing). This assumption holds for each computervision application that we investigate (see Section 5).

3. Barrier synchronisation: When each task reachesthis point (the barrier), it must wait until all othertasks have finished their required processing. Onceall tasks have completed, we make a set of globaldecisions before the next superstep may begin (thenext round of concurrent computation and so on).

3.3 Proposed task farming framework

As noted, our framework involves global communica-tion between task sets during a post task-set-completionsynchronisation step following a round of concurrentcomputation. The components and fundamental prop-erties of the BSP model provide a suitable basis for thisframework. Namely moving from a sequential imple-mentation to describe the use of parallelism with a BSPmodel requires only a bare minimum of extra informa-tion be supplied. BSP models are also independent oftarget architecture, making a task farming frameworkbased on BSP portable between distributed architec-tures. Finally the performance of a program distributedusing a BSP based framework is predictable if a fewsimple parameters from the target program can be pro-vided (e.g. task-length distribution parameters). Thisleads to a hybrid performance modelling techniquecapable of predicting the runtime of algorithms imple-mented with our framework.

We solve large scale problems by sharing largedata sets among multiple processors yet the semi-synchronised task farming framework, in consonancewith a task parallelism model, involves only little inter-node communication between tasks running in parallel.However, similar to data parallelism models, the frame-work allows us to split these large data sets betweencompute nodes and perform independent calculationson participating processors in parallel. As the calcula-tions within each task are independent, no informationneeds to be exchanged between nodes during task run-time and sharing of results is postponed until all tasksin a set have completed. As discussed, once a set oftasks has been completed we are able to collate resultsand use this information to make decisions relating tohow the following round of tasks should be formu-lated. The outputs from the final round of tasks arecombined to provide the global program output. Thisframework is formally defined in the following pseudo-code and Figure 1 depicts the process in diagrammaticform.

Let:fI ½t�i g

Nt

i= 1 be the set of Nt input tasks at superstep t

fO½t�i gNt

i= 1be the set of Nt outputs gained from the taskscompleted at superstep t

Input:N0 tasks at superstep t = 0

terminate := 0begin

while (NOT terminate)parallel for i 2 Nt

O½t�i =process(I

½t�i )

end

fI ½t + 1�i= 1 g

Nt+ 1

i= 1 =recompute inputs(fI ½t�i=1gNt

i=1,

fO½t�i=1gNt

i=1)terminate =?test termination

criteria(fO½t�i gNt

i=1)t = t+1

endlast = tR=combine outputs(fO½last�

i gNlast

i=1)endOutput:R

Figure 1. Our semi-synchronised task farming framework. Lightgrey superstep nodes indicate task synchronisation and collectiveglobal decisions based on information obtained from theprevious set of distributed tasks. These decision points influencethe input data, form (and possibly the number) of the followingset of distributed tasks. Each task in a task set is distributed toan individual processor. The distributed tasks following eachsuperstep are not regarded as having a particular linear order(from left to right or otherwise) and may be mapped toprocessors in any way.

McDonagh et al. 5

The advantage of adding the BSP synchronisationstep between task sets allows all tasks in a set theopportunity to collate and communicate informationresulting from the completion of their collective execu-tion. The collective results of a task set can influencedecisions involving the form, model parameters andpossibly the number of tasks making up the followingtask set input. Once formulated, the following set oftasks can be distributed to the participating processors.It is this process of dispatching multiple rounds of par-allel independent tasks, where task formulation may beinfluenced by information from previous task setresults, that we call semi-synchronised task farming.This approach allows us to find distributed solutions tonon-trivial problems that require a level of communica-tion between nodes during overall computation whileretaining much of the simplicity of the standard taskfarming model. If all tasks within a task set take a simi-lar amount of time to complete then it allows for simplemodelling and task distribution. If however tasks exhi-bit completion times with high variance, then a smartscheduler (such as SGE) can still be used efficiently toensure that load balancing is not problematic for ourframework. The wall-clock time, now related to boththe number of task sets and the number of availableprocessors, is much improved over serialimplementations.

The synchronisation aspect allows us to solve prob-lem decompositions that require a level of inter-nodecommunication while retaining the main advantages ofa standard task farming approach, such as ease ofimplementation, level of achievable efficiency (on theassumption that individual tasks in a set require similartime to complete) and, given that existing serial codecan often be used with minimal modification, that userscan produce solutions without requiring detailedknowledge of (e.g.) MPI techniques. We do howevernote that if tasks take widely different amounts of exe-cution time then the total wall-clock time of a task setis governed by the slowest process.

3.4 Simulation and analytical hybrid performancemodelling

We undertake simple performance modelling to evalu-ate the distributed job submission behaviour on a CPUcluster allowing prediction of the run time performanceof algorithms realised with our framework.Performance modelling of distributed systems enablesan understanding of code and machine behaviour andcan be broadly split into two categories; analyticalmodelling and simulation based techniques. As previ-ously mentioned, analytical models are typically devel-oped through the manual inspection of source codeand subsequent formulation of critical path executiontime. This approach usually involves the

implementation of a modelling framework (e.g. LoPC(Frank et al., 1997)) to reduce the work required by theperformance modeller. Analytical approaches are effec-tive yet often require manual analysis of source codenecessitating knowledge of the task domain, implemen-tation languages and communication paradigms.

Here we follow a coarse grained alternativeapproach of simulation based performance modelling.Many simulation tools exist to support this form ofperformance modelling (e.g. the DIMEMAS project(Labarta et al., 1997)). Such tools often involve replay-ing the code being modelled instruction-by-instruction,and the related use of machine resources can then begathered by the simulator. More recent work such asthe WARPP toolkit (Hammond et al., 2009, 2010)make use of larger computational events (as opposedto instruction based simulation) improving simulatorscalability. Here we take a similar approach; instead ofusing single application instructions we model coarsegrained computational blocks. We choose a coarse levelof granularity by defining a computational block asone distributed task in our framework. We then obtainrun times for these computational blocks through tra-ditional code profiling. An additional advantage of thiscoarse-grained simulation is that hybrid models (com-bining analytical and simulation-based approaches) canbe built. By combining these coarse-grained computa-tional events with an analytical model typical of theBulk Synchronous Parallel (BSP) (Valiant, 1990) modelwe obtain a straightforward hybrid model capable ofpredicting application run-time for the algorithms thatwe implement using our task farming framework.

3.5 BSP cost in relation to task farming

The cost of an algorithm represented by the BSP modelis defined as follows. The cost of each superstep isdetermined by the sum of three terms; the cost of thelongest running local task wi, the global communica-tion cost g per message between processors, where thenumber of messages sent or received by task i is hi, andthe cost of the barrier synchronisation at the end ofeach superstep is l (which may be negligible and there-fore the term is dropped).

The cost of one superstep for p processors istherefore:

maxpi= 1(wi)+max

pi= 1(hig)+ l ð1Þ

We make standard simplifying assumptions that wehave homogeneous processors and that tasks do notneed to exchange data with each other individually orwith the master node during each superstep thus ensur-ing that hi = 0 for all i. We assume homogeneous pro-cessors for simplicity during our cost treatment butnote that, in the current landscape of computation, het-erogeneous resources are also common. Although our


framework is applicable to heterogeneous resources inpractice, we leave a theoretical treatment of heteroge-neous processor cost to future work (see Section 4 forrelated discussion of this point). It is common for equa-tion (1) to be written as w+ hg + l where w and h aremaxima, and with our simplification this reduces fur-ther to w+ l. The cost of the algorithm then, is thesum of the costs of each superstep where S is the num-ber of supersteps required.

W +Hg + Sl=XS

s= 0

ws + 0+ Sl ð2Þ

3.6 Our hybrid BSP simulation

We simulate total parallel algorithm execution times byfirstly generating random trials to simulate individualdistributed task timings. To simulate a real-world taskset, we generate trials from a Gaussian distributionparametrised by the mean time required in practice fora single distributed task to complete and add these tothe time cost of barrier synchronisation. Task timingdistribution parameters are found through code profil-ing and making use of the MATLAB function cputime.We assert that this is a reasonable method to simulatetask timings as the task farming applications that weinvestigate all distribute sets of similar length tasks dur-ing each superstep. By specifying or observing the

number of supersteps required for a given real-worldcomputation and the number of distributed tasksrequired in each superstep, we are able to approximatethe total time required by the parallel algorithm as:

XS

s= 0

ws + Sl ð3Þ

where ws is the longest running local task in superstep s,barrier synchronisation time cost is l and the total num-ber of supersteps is S. In practice we run this simulationover many trials and look at the mean result for analgorithm that requires Ns distributed tasks during eachsuperstep.

3.6.1 Limitless CPU node model. As a simple example wetake a mean task length of wm = 10 time units and atask length standard deviation of s = 1, and simulatean application making use of only a single superstep.We find that, using the additional assumption of limit-less computational nodes, as we increase the number ofdistributed tasks required in the superstep the differencebetween the longest task length ws and the mean tasklength wm grows sub-linearly with the number of sub-mitted distributed tasks N (Figure 2). From this simpleexample we are able to conclude that, not taking intoaccount limited computational resources, if we have anapplication that benefits from increasing the number of

Figure 2. Predicted difference between maximum distributed task time and mean task time ws � wm, where wm = 10, s = 1, for analgorithm distributing N tasks in one superstep.

McDonagh et al. 7

distributed tasks during a superstep (e.g. by an order ofmagnitude – see for example Section 5.1), we can expectimproved results for only a small increase in predictedwall-clock time cost.

We can fit this simulated computation time accu-rately using the standard inverse complementary errorfunction. The complementary error function erfc (alsoknown as the Gaussian error function) provides us withan accurate predictor for the maximum job length ws

increment over the mean job length wm, in relation tothe number of submitted jobs, that we are likely toobserve assuming that the true job length distributionresembles a Gaussian distribution. The erfc function isoften used in statistical analysis to predict the beha-viour of any sample with respect to the populationmean. Here we fit our simulation data by applying theinverse erfc to 1

Ns

� �, where Ns is the number of sub-

mitted tasks in superstep s (see Figure 2). The errorfunction erf is defined as:

erf(x)=2ffiffiffiffipp

ðx

0

e�t2

dt

Then the complementary error function, denoted erfcand its inverse erfc�1 are defined as:

erfc(x)= 1� erf(x)= 2ffiffiffippÐ ‘

xe�t2

dt

erfc�1(1� x)= erf�1(x)

The model that empirically fits the simulation for meantask length wm, with standard deviation s distributingNs tasks in parallel, lets us predict the maximum tasktime ws for superstep s as:

ws =wm + 1:4s � erfc�1(1

Ns

)

� �ð4Þ

The scalar 1.4 is needed to fit our empirical data. Wehypothesise that the true scalar value providing the bestfit to our empirical curve here is

ffiffiffi2p

but we leave investi-gation of this to future work. In Figure 2 we usewm = 10 and s = 1 and simulate for various task setsizes Ns. If computational resources are not a limitingfactor, then once we know the number of distributedtasks Ns required per superstep, and have estimates forwm and s, we are able to approximate the expected timews required for a single superstep of a given algorithmand, given the number of supersteps, the expected timerequired for the entire algorithm. This model is valid incases where the number of available parallel workerprocessors is equal to or exceeds the number of tasksrequired per superstep. We have access to 130 iDataPlexservers with multiple CPUs; however in many practicalapplications this requirement will not hold (the numberof tasks per superstep will exceed available participatingworker nodes). Therefore we also consider a finite CPUmodel in the following section.

3.6.2 Finite CPU node model. The previous simulationmodel does not take into account CPU worker nodelimits. In this section additional simulations are per-formed to explore the effect of capping the number ofavailable CPU nodes K in relation to the number ofsubmitted distributed tasks per superstep Ns. Thisallows us to fit a model that reflects our real distributedsystem pragmatically. In this case, we assume thatNs . K and therefore each CPU node is responsible forthe computation of a number of tasks in sequence inorder to complete a superstep. In our task farmingframework under SGE, when a CPU worker nodecompletes the computation of the current task then thenext task from the set still waiting to be processed willbe assigned to the finished core such that each core iscontinually utilised until all tasks have been processed.For each simulation trial, the maximum cumulativeCPU computation time used by a worker node duringa superstep; CPUs must now be found. This value isthe maximal sum of task computation times assignedto an individual CPU. From this max cumulative com-putation time found during a superstep, we subtractwm � Ns

K

� �where wm is the mean task length, Ns is the

number of parallel tasks making up the superstep andK is the number of participating processors. This effec-tively subtracts the mean amount of work we expect aCPU to perform per superstep. This mean amount ofwork per CPU is denoted CPUm =wm � Ns

K

� �. The result-

ing difference tells us how much more work, than themean cumulative work, we expect the node assigned themost work to carry out. As a result, CPUs provides thetime we expect the full superstep s to take to complete.

The final point above holds because all CPU workernodes must be allowed to finish their assigned cumula-tive task computation before it is possible to synchro-nise and conclude a superstep s. When accounting for afinite set of CPU worker nodes, we therefore model thetime it takes to complete a superstep s as the longestcumulative CPU computation time CPUs. Whenaccounting for a fixed number of worker nodes K, themodel that we find (approximately) empirically fits thesimulation data is:

CPUs =wm � Ns�mod(Ns,K)

K

� �+wm if mod(Ns,K) 6¼ 0

wm � Ns

K

� �+ 1:4s � erfc�1 1

Ns

� �if mod(Ns,K)= 0

8<:

ð5Þ

We model CPUs as the mean computational workdone at each worker, CPUm plus some additional workthat must be carried out by the CPU that has per-formed the most work in the current superstep. Wemodel this additional work in the following way: whenwe consider a finite set of CPU worker nodes, the dif-ference between the longest cumulative CPU computa-tion time CPUs and the mean cumulative CPUcomputation time CPUm is primarily influenced by: (a)


how evenly the number of distributed tasks Ns are dis-tributed to the number of participating CPU nodes K

and (b) the mean task length wm. Advanced task farmmodels (e.g. Poldner and Kuchen, 2008) employ vari-ous strategies dictating how tasks should be distributedto workers. Here we take the simple approach that, onthe assumption that tasks belonging to a task set havesimilar length, each task still waiting to be processedwill be assigned in turn to the CPU worker node thatfinishes its current computational work load first. Aconsequence of this is that if the total number of dis-tributed tasks Ns required by the superstep is exactlydivisible by the number of participating CPU nodes K

(i.e. mod(Ns,K) = 0) then, excluding cases involvingextremely high task length variance s2 in relation towm, each CPU will receive an identical number of tasksand therefore the difference between the longest cumu-lative CPU computation time CPUs and the mean timeCPUm will be small and only influenced by the numberof tasks Ns and the task length variance s2 in a similarfashion to the limitless worker node model. In suchcases this small difference is once again accounted forusing the erfc�1 function as before (see Figure 2 andequation (4)). If, contrarily, the number of tasks Ns

divided by the number of participating CPU nodes K

leaves a remaining number of tasks that is small in rela-tion to K (i.e. mod(Ns,K) � K) then, again assumingmoderate task length variance s2 in relation to wm, theCPU node completing the most computational work

will contain one more task than b Ns

K

� �c. We account for

this additional task in our model by adding the meantask length wm (our additional task) to the mean cumu-lative work done, adjusted by the number of CPUworker nodes that are assigned an additional task suchthat they must complete b Ns

K

� �c+ 1 tasks in total. This

models the fact that the difference between CPUs andCPUm will be greater when fewer worker nodes areassigned b Ns

K

� �c+ 1 tasks to complete, since the true

mean work done per CPU will be close to wm � b Ns

K

� �c

when many nodes are completing only b Ns

K

� �c tasks.

The difference between CPUs and CPUm is thereforeessentially linear in mean task length wm once Ns, K

and s are known. Intuitively, if mod(Ns,K) is low butnon-zero (e.g. equal to one), then the single CPU thatis assigned this extra task will be required to completealmost exactly one extra task length of work in com-parison to the mean amount of work, CPUm’wm�b Ns

K

� �c. As mod(Ns,K) grows, the value representing the

mean amount of work done per CPU is adjustedaccordingly. The special case where mod(Ns,K) = 0 weexpect, as discussed previously, only adds a constantamount of excess work above the mean for large Ns

similar to the case explored previously using anunbounded K (see Section 3.6.1). We validate thismodel using empirical simulation data for various K

and task length wm. A sample of these simulation andmodel prediction results, exploring simulated and pre-dicted times for various K are found in Figure 3.

Figure 3. (a) We plot the model of the mean work we expect each CPU to carry out CPUm (blue line) in terms of overall (log-scale) computation time units for varying K processors. We show using empirical simulation (red line) how the longest CPU queueCPUs deviates from this value in practice in relation to Ns and K. Our model prediction of the maximum work carried out by a CPU:‘CPUs Model’ (circles plotted for every 10th K value) exhibits how our model is able to account for this. Here we show a simulationdistributing N= 250 tasks over one superstep with a mean task length of wm = 1000, s= 1. (b) The difference found betweenmodel prediction of CPUs and empirical simulation for each value of K 2 f1.250g. We exhibit model prediction error of \10 timeunits (y-axis) when using a mean job length wm = 1000 units for each value of K explored. Our prediction makes small periodicerrors but this error reduces further as K increases. For the number of CPUs that we make use of in practice (e.g. .20) we see anoverall computation time prediction error of \4 time units when using wm = 1000 units.

McDonagh et al. 9

4 Hybrid BSP model predictions

In this section, we use our hybrid BSP model (intro-duced in Sections 3.6.1 and 3.6.2) to predict theexpected run time of real-world applications that wedistribute to our SGE cluster under our task farmingframework. We present the results of submitting jobsunder real network and Grid Engine loading conditionsand compare job timing results with our predictions totest the validity of the models developed in Section 3.6.

We submit various application configurations to ourSGE cluster that involve distributing Ns = 20, 40 and100 tasks during each superstep in applications makinguse of S = 5, 10 and 30 supersteps. The applicationsthat we utilise for testing our model contain paralleltasks with cost durations of comparable length bydesign. Details of the applications we experiment withare given in Section 5. To calculate true overall applica-tion time cost we record individual parallel task runtimes and are therefore able to find the longest running(highest cost) task within each superstep. We then sumthe times required for the longest running task ws in

each superstep s such thatPS

s= 0

ws + Sl provides the total

time needed to execute the parallel application in prac-tice, assuming that all tasks within a superstep are ableto run in parallel. With regard to the sample applica-tions that we investigate during this experiment we findthat the time cost for the barrier synchronisation steps l

are negligible in practice, and therefore we neglect thesein the runtime calculation. Although barrier synchroni-sation is negligible in the sample application

investigated here, we note that this is certainly notalways the case, and we therefore choose not to over-simplify the model.

We perform repeated trials (n= 10) for each appli-cation configuration tested. Here we provide detail of aconfiguration distributing Ns = 20 tasks during each of10 supersteps as an example. In this example, we mea-

sure mean total real-world cost to beP9

s= 0

ws = 123:06

min of parallel computation time with an average tasklength of wm = 462:9 s (;8 min), and a mean paralleltask length standard deviation of s = 107:13 s. Therecorded individual task times, across all 10 superstepsfrom one trial, are shown in Figure 4. Examining thereal-world run times of the distributed tasks highlightsa slightly heavy-tailed distribution for the particularapplication employed in this experiment. This typicallyresults in several long runtime outliers that contributeto the total runtime cost using our overall runtime cal-culation method. For expository purposes we also fit aGEV (Generalised Extreme Value) model to the datahere, providing a reasonable fit (i.e. resulting in aslightly lower BIC value of 2343:39 compared to theGaussian BIC of 2446:78 for this data set). In futurework we plan to re-examine our hybrid model using(e.g.) a GEV distribution in place of our currentGaussian timing model to predict run times in caseswhere this provides a better fit to the independent tasktimes. We also note that one potential route towardsaccounting for heterogeneous participating processorsp during runtime prediction would involve making useof mixture distributions (e.g. a mixed GEV

Figure 4. Individual parallel task timings across all 10 supersteps from one trial.


distribution). We leave more sophisticated task timedistribution fitting to future work. We obtain individ-ual runtime costs by profiling the application (detailedin Section 5.1) through the use of the MATLAB func-tion cputime. By additionally including SGE queue-ing (non-working) time, mean wall-clock time for theapplication run in this example was 173:46 min (non-working time is attributed to sharing the SGE clusterwith other users).

Using the distributed task model that we introducein equation (5), and assuming that we have sufficientparticipating processors K to accommodate 20 tasks inparallel, we predict the maximum work performed by asingle processor in a superstep to be CPUs = 669:86 sfor this example (an underestimation, the mean valuefound in practice across n= 10 trials for this configura-tion is 738:37 s of CPU time). Using S = 10 supersteps,the total runtime predicted by our model for this experi-ment is therefore 111:6 min. This results in a slightunderestimation of the true mean total cost by 11:4 min(;10%) for this distributed configuration. This under-estimation is probably explained by the slightly non-Gaussian distribution observed in Figure 4. Results for

the predicted and measured job completion times forthe distributed configurations investigated in this wayare summarised in Tables 1 and 2. In Table 2 we presentmeasured and predicted overall computation time andnote that the difference between measured time and ourmodel prediction is always within 11% of the truevalue. Our approximate model provides a simple yetmoderately accurate method for predicting the amountof computational work required by applications formu-lated under our task farming framework and distribu-ted to SGE, or other queue based, cluster systems. Forcompleteness we contrast the computational timerequired to mean wall-clock time used by the cluster inpractice. We note in general wall-clock time is signifi-cantly larger than required computational time; how-ever we find that wall-clock time is subject to highvariance between trials as we have little control overmulti-user cluster wall-clock time. This is due to thequeueing aspect of sharing the SGE cluster with otherusers.

5 Example semi-synchronised taskfarming applications

We introduce three computationally demanding com-puter vision problems and propose solutions implemen-ted using our semi-synchronised task farmingframework. We focus on simple farming applicationsthat are able to benefit from performing many tasks inparallel yet require some form of communicationbetween rounds of parallel tasks (supersteps). Asdescribed previously, these parallel task sets and syn-chronisation steps make up a larger computational pro-cess. The example applications that we study here allshare the following properties:

� Large input data set. Our input data sets are largerelative to the number of model parameters andcontrol options that dictate the data processingprocedures.

Table 1. Parameter sets used for four different sets ofdistributed application experiments varying the number ofdistributed tasks (Ns) and supersteps (S).

# CPUnodes (K)

Tasks persuperstep(Ns)

Supersteps(S)

Model prediction (eq. (5)) 20 20 10Measured timing set 1 20 20 10Model prediction (eq. (5)) 20 20 30Measured timing set 2 20 20 30Model prediction (eq. (5)) 20 40 05Measured timing set 3 20 40 05Model prediction (eq. (5)) 20 100 05Measured timing set 4 20 100 05

Table 2. Distributed application measured timing results and BSP model predictions for four sets of distributed tasks with rowscorresponding to Table 1. We obtain the predicted overall computation time by taking the product of the predicted ws and thenumber of supersteps (S). The difference between our overall computation time model predictions and measured results are alwayswithin 11% of the true value.

True wm (s) Task time s Predicted ws

(eq. (5)) and True ws (s)Overall computationtime (min)

Wall-clocktime (min)

Model prediction (eq. (5)) N/A N/A (462:0+ 207:86)=669:86 (669:86 s �10) = 111:6 N/AMeasured timing set 1 462:0 107:13 738:37 123:06 173:46Model prediction (eq. (5)) N/A N/A (348:17+ 168:02)=516:19 (516:19 s �30) = 258:1 N/AMeasured timing set 2 348:17 86:60 740:0 287:4 434:08Model prediction (eq. (5)) N/A N/A (57:1+ 19:8)=76:9 (76:8 s �5) = 6:40 N/AMeasured timing set 3 57:1 8:95 91:3 6:89 41:3Model prediction (eq. (5)) N/A N/A (214:4+ 96:46)=310:86 (310:86 s �5) = 25:9 N/AMeasured timing set 4 214:4 37:83 353:6 27:3 133:0

McDonagh et al. 11

� Large number of tasks. The number of tasks N thatmake up the overall computational process is largeand may not be known in advance. Each applica-tion launches sets of tasks that are processed in par-allel. All tasks in a synchronised superstep mustcomplete before the following round of tasks canbegin. Task parameters are defined by fixed modelparameters and potentially information resultingfrom the completion of previous task sets.

� Task independence. Each task is defined bymodel parameters, the global input data and poten-tially the task set results from the previous super-step. For tasks that are contained in the samesuperstep, no dependencies exist between superstepmembers.

5.1 Application 1: Multi-view point cloud registration

5.1.1 Multi-view registration. 3D surface registration canbe considered one of the crucial stages of reconstructing3D object models using information obtained fromrange images captured from differing object viewpoints.Point correspondences between range images and vieworder are typically unknown. Aligning pairs of thesedepth images is a well-studied problem that has resultedin fast and usually reliable algorithms. The generalisedproblem of globally aligning multiple partial object sur-faces is a more complex task that has received lessattention, yet remains a fundamental part of extractingcomplete models from multiple 3D surface measure-ments for many useful applications such as robot navi-gation and object reconstruction. This is the multi-viewregistration problem.

Early solutions to the multi-view registration prob-lem typically proposed defining one view position as ananchor point and then progressively aligning overlap-ping range scans in a pairwise fashion such that apply-ing the rigid transforms found at each pairwise step ina chain brings each additional viewpoint into the coor-dinate frame of the anchor scan, thus obtaining a com-plete object model. Although straightforward andfairly computationally inexpensive, this technique oftenresults in registration error accumulation and

propagation. In an attempt to address this issue, morerecent work (Pottmann et al., 2002; Toldo et al., 2010;Torsello et al., 2011) proposes various techniques foraligning all surface viewpoints simultaneously in anattempt to reduce errors and make use of informationfrom all views concurrently. Performing view registra-tion in this fashion is typically able to improve align-ment quality by distributing registration errors evenlybetween overlapping range views. Considering all viewssimultaneously does however typically incur increasedcomputational cost as these approaches must, at eachiteration, compute the registration error between eachrange view and some form of reference. A solution tothe multi-view registration problem, capable of han-dling large data sets, consisting of many viewpoints,therefore provides a good candidate for a parallelisedimplementation.

In this paper we present our approach for the simul-taneous global registration of depth sensor data frommany viewpoints, represented by multiple dense pointclouds (McDonagh and Fisher, 2013), implemented inthe Semi-synchronised task farming frameworkdescribed in Section 3. This framework allows us toprocess large numbers of range images per objectreconstruction whilst retaining the accurate high qual-ity view alignment results typical of simultaneous regis-tration approaches.

5.1.2 Simultaneous registration using task farming. Givenmany partial object views represented by point cloudswith a typical set of seed positions providing a coarsealignment initialisation, we construct a kernel-baseddensity function of the point data to determine an esti-mation of the sampled surface. Using this surface esti-mate we define an energy function that implicitlyconsiders the position of all viewpoints simultaneously.We use this estimation of the sampled surface to per-form an energy minimisation in the scan pose trans-form space, on each scan in parallel, to align eachviewpoint to the object surface estimate and implicitly,to each other. After alignment, we recompute theenergy function and then re-minimise all scan positions.This process is repeated to convergence. Figure 5

Figure 5. Our multi-view registration method. Stages of the algorithm within the dashed line area are distributed to our cluster inparallel.


outlines this approach, for more details see(McDonagh and Fisher, 2013).

Since range viewpoints are aligned in parallel we areable to accommodate many view sets without increas-ing the wall-clock time, unlike typical serial solutions.Utilising many object viewpoints affords benefits oversparse sets of views for the task of object reconstructionsuch as better object surface coverage, hole filling andreconstructed object detail improvement.

For N view-points we define N independent paralleltasks in each superstep and in each of these tasks weuse the current pose of the remaining N � 1 scans forthe purpose of computing a surface estimate and arelated energy function. We allow the final, active scanto move in the transform space by searching for optimalpose parameters. Each parallel task assigns a differentview-point as the active scan. Independently evaluatingthe position of each moving scan in relation to theinferred surface, and therefore minimising our energyfunction brings the active view into better alignment.After this minimisation has taken place for each view-point in parallel, we have N sets of optimal rigid trans-form parameters: three translation (ux, uy, uz) and threerotation (ua, ub, ug) parameters that bring each viewinto alignment with the estimated surface (and thereforethe other views). Once each independent task has founda set of rigid transform parameters (reached the super-step synchronisation barrier), we apply the transformparameters found for each view, thus bringing the entireset into better alignment with one another, completingour barrier synchronisation step. We then redistributethe tasks to perform a re-estimation of the sampled sur-face, using the new view-point positions, for each viewin parallel. This typically results in a tighter, more accu-rate, estimation of the surface. We iterate this processfor S supersteps until viewpoint registration conver-gence has been reached. Convergence can be identifiedby looking at residual point alignment error or the mag-nitude of the transforms being found by each task opti-misation. In practice convergence is usually reachedwithin S = 10 supersteps; however, for the purposes ofthe timing experiments in Section 4, we use up toS = 30 supersteps.

This optimisation algorithm can be summarised asfollows: we define fVig as the set of N individual pointsets Vi, and Si as the collective surface estimate foundusing the points in point sets Vj where j= 1 . . . N andj 6¼ i. We define our energy function E(�) to evaluatethe alignment of 3D points x 2 Vi in relation to surfaceestimate Si. Therefore E(Vi, Si) evaluates the currentpose of point set Vi in relation to how well registered itis with surface estimate Si. We perform minimisation inthe transform space of Vi, evaluating how well the view-point is aligned to our surface estimate Si at each itera-tion step. This minimisation lets us find optimal pose

parameters ui for each Vi in parallel. We use these para-meters to apply pose transformations Tui

to each pointset Vi. This transform optimally aligns point set Vi withthe related current surface estimate. In parallel we aligneach point set Vi to the surface estimate provided by Si.By doing this we implicitly register each viewpoint withall others. We then re-estimate Si from the resultingnew poses of fVig, and iterate this process to conver-gence. This algorithm is described using the followingpseudocode:

Input: Range scans V1, . . . ,VN

beginconverged := 0while (NOT converged)

parallel for i= 1 . . . N

Si =estimate surface([N

j= 1j 6¼i

V j)

ui = argmaxu

E(Tu(Vi), Si))

end

parallel for i= 1 . . . N

Vi = Tui(Vi)

end

converged= test convergence(V 1, . . . ,V N )end

end

5.1.3 Experimental setup. We evaluate this parallel align-ment strategy quantitatively on synthetic and real rangesensor data where we find that we have competitive reg-istration accuracy with existing frameworks for thistask. See McDonagh and Fisher (2013) for registrationaccuracy results. Here we evaluate application speed-updue to parallelisation. As discussed we are able to regis-ter all views simultaneously by taking advantage ofmany cluster nodes, and thus distribute the work. Herewe explore various distributed task and superstep con-figurations and look at the performance gained bymaking use of a distributed system compared to per-forming the work on a single node. In the case of thesingle CPU experiments we register each scan seriallyusing an individual cluster node and then find therelated surface estimates once rigid transforms havebeen found for all scans. Figure 6 shows a partial exam-ple midway through alignment.

We record runtime results as follows: for Single CPUresults no job queueing is involved as the algorithm per-forms the registration of each scan in series until com-pletion. The time reported is the total time required toregister N viewpoints in series over S supersteps. Forthe parallel distributed experiments we measure thetime taken in two ways. As discussed in Section 3.1, thedistributed system we make use of employs a multi-userjob queueing system. Firstly, we measure the wall-clocktime by recording the total real-world time required

McDonagh et al. 13

from the point of submitting our work to the job queueuntil the job is complete (when the registration of allviewpoints Vi has converged in this case). Here jobqueueing (non-working) time cost may be incurred byeach individual distributed task, (the alignment of a sin-gle view Vi to the related surface estimate to find theoptimal pose transform Tui

). In Table 3 this timingresult is referred to as ‘ECDF wall-clock time’. The sec-ond distributed timing measure excludes this queueing(non-working) time and for each superstep finding themaximum task length of an individual distributed task(scan alignment) in a similar measurement process tothat outlined in Section 4. The time reported for thissecond metric is then the sum of the maximum task

lengths over the total number of supersteps, we call thisthe ‘distributed ideal time’. We consider this to be anaccurate assessment of the computation time required,as each superstep must wait for all member distributedtasks to finish before it may apply the global synchroni-sation step and then launch the following set of distrib-uted tasks. This second metric excludes real-worldqueueing time. Furthermore, for this experiment, wehave sufficient worker nodes to process all distributedtasks in a superstep concurrently (true in the case of ourcurrent HPC cluster). These measurements allow us tocompare the optimal theoretical performance gain toreal-world speed-up, achieved in practice on our multi-user system.

Figure 6. Top: A planar slice of our energy function through coarsely aligned partial scans (Stanford bunny data set). Bottom: ourenergy function approximating the underlying surface defined by the coarsely aligned range scans. A zoom of the slice region showssurface function values that are represented by colours increasing from deep blue to red. We align each partial view point cloud withthis surface estimate in parallel.

Table 3. Multi-view registration algorithm timing results: single CPU vs distributed cluster.

Single CPU (min) Distributed ECDFwall-clock time (min)

Distributed ECDFideal time (min)

Model prediction(min) (eq. (5))

Sp

5 views 1 superstep 37:26 10:77 8:74 8:37 4:2620 views 1 superstep 95:38 10:89 7:74 8:28 12:325 views 5 supersteps 176:06 49:22 39:12 36:06 4:5020 views 5 supersteps 835:02 185:94 52:40 49:37 15:94


5.1.4 Performance evaluation. The success of employingan HPC system to solve computationally demandingproblems resulting from large real-world data setsdepends on the system architecture (e.g. number ofavailable processors) and algorithmic design. The per-formance of an algorithm on an HPC system can beevaluated by calculating the speed-up provided over asingle node or single CPU system. Here we use speed-up Sp and efficiency Ep (equations (6) and (7)) to showthe improvement we achieve by formulating computervision problems under our task farming framework.Assuming that the speed of processors and the networkis constant, then speed-up (Baker et al., 2006; Eageret al., 1989) is often defined as:

Sp =T1

Tp

ð6Þ

where p is the number of participating processors, T1 isthe computational time needed for sequential algorithmexecution and Tp is the execution time required by theparallel algorithm when making use of p processors.Ideal (linear) speed-up is obtained in the case Sp = p.Although super linear speed-up is possible in somecases (e.g. due to cache effects in multi-core systems),when using task farming and an HPC cluster we con-sider linear speed-up as ideal scalability. In the linearspeed-up case, doubling the number of processors p willdouble the speed-up Sp (halving the required executiontime Tp). The second, related performance metric wemake use of is efficiency (equation (7)). The Ep metric,typically in range ½0 . . . 1� attempts to estimate how wellutilised p processors are when solving the problem athand compared to how much time is spent on activitiessuch as processor communication and synchronisation.

Ep =Sp

p=

T1

pTp

ð7Þ

For our viewpoint registration algorithm, Table 3shows that, in experiments performing only a singlesuperstep (surface estimation), when we compare theserial and distributed computation times (excluding jobqueueing time) we are able to achieve significant speed-up in each case (where here p= 5, 20 and T1, T5 and T20

timings are in minutes) with S5 =37:268:74

= 4:26 andS20 =

95:387:74

= 12:32. We note that the experiment align-ing fewer viewpoints, using fewer nodes (jfVigj= 5,p= 5, S = 1) achieves a result closer to optimal speed-up (and efficiency). We reason that a longer maximumtask time (the superstep time) is likely to be observedfor the larger experiment (jfVigj= 20, p= 20, S = 1)as it contains more distributed tasks per superstep. Thispoint holds in practice here and was explored duringour predictive model formulation and related scalability

experiments in Section 3.3. Table 3 also shows the sametask set sizes (jfVigj= 5, 20) but with multiple super-steps (S = 5), which achieve slightly improved speed-upand efficiency performance: S5 =

176:0639:12

= 4:50 andS20 =

835:0252:40

= 15:94. Again our hybrid model predic-tions come within 10% of the measured values in eachcase and we include ECDF wall-clock time results inthe distributed experiments for completeness. The timerequired to align 20 range image viewpoints over fivesupersteps using our simultaneous method can be effec-tively reduced from ;14 h to 50 min.

5.2 Application 2: Feature selection

5.2.1 Feature selection for classification. The aim of featureselection in computer vision and pattern recognitionproblems is to obtain a small subset of a larger full setof features which gives (e.g.) accurate classification.The benefits of feature selection are to reduce thedimensionality of data which decreases the classifica-tion time and decreases the chance of over-fitting dur-ing training. Besides, it is important to eliminateirrelevant, redundant features and even the featureswhich might cause inaccurate classification. Popularcomputer vision applications which utilise feature selec-tion are face recognition (Yang et al., 2007), trajectoryanalysis (Beyan and Fisher, 2013), image segmentation(Roth and Lange, 2004), gesture recognition (Guo andDyer, 2003) and medical image processing (McDonaghet al., 2008). In general, feature selection consists offeature subset generation, feature subset evaluation, astopping criteria and validation of results using theselected final subset (Liu, 2005; Wang et al., 2009).

Feature subset evaluation can be in terms of a criter-ion such as maximising a performance criterion. Theiterations continue until the value of the performancecriterion is accepted which is often when adding addi-tional features reduces performance. Feature subsetgeneration can be divided into two categories (Blumand Langley, 1997); filters and wrappers. The filterapproaches do not use a learning algorithm and areusually faster and computationally efficient. Filterapproaches rank the features and evaluate them interms of their benefit/relevance, such as using distance,consistency, and mutual information between a featureand the class labels (Huang et al., 2006). On the otherhand, wrapper methods use a learning algorithm toevaluate the quality of the feature subset. Wrappers areusually superior in accuracy when compared to filters(Kohavi and John, 1997). In this study, we use theSequential Forward Feature Selection (SFFS) algo-rithm (Section 5.2.2) which is a wrapper method with aparallel schema that suits the semi-synchronised taskfarming framework that we have introduced (seeSection 3 and Figure 1).

McDonagh et al. 15

5.2.2 Sequential Forward Feature Selection. The forwardfeature selection procedure begins with an empty fea-ture subset. In the first iteration, it initialises the featuresubset by trying features one by one and evaluates thesubset in terms of the performance criterion. At the endof the first iteration, the first best feature is selected. Insubsequent iterations, the subset of features that isselected in the previous iteration is extended by one ofthe remaining features. Hence, in the second iterationthe feature subsets have two features to be evaluated.After all feature subsets are evaluated, the current bestresult of the new subset is compared with the previousiterations best result and, using the stopping criteria, adecision is made to continue to a third iteration or tostop selecting features in order to validate the results. Ifthe decision is to continue, then a similar procedureiterates to produce three features in the candidate sub-set for the third iteration, four features for the fourthiteration and so on.

5.2.3 Sequential Forward Feature Selection using taskfarming. Similar to many other wrapper approaches, theSFFS procedure is computationally expensive; espe-cially if the number of features is large, the learningalgorithm has a high time complexity and the requirednumber of iterations is large. Therefore, efficient imple-mentations of this method are needed for many com-puter vision applications. The procedure that we use toaccelerate SFFS is based on the semi-synchronised taskfarming framework that we present above (see Figure 7:the dashed box shows where we apply task farming). Inthis context, in each superstep, we first build the subsetsand then distribute each subset as a parallel task to beprocessed using the learning algorithm. After all the dis-tributed tasks finish (the superstep conclusion), we col-lect them to find the current best criterion value and thefeature corresponding to it. The new best feature isselected and becomes a member of all following featuresubsets. During this task synchronisation stage, we also

apply the stopping criteria to decide if we are going tocontinue to select features or not. If the decision is tocontinue, the new feature subsets are built and the newtasks are distributed. At the following iteration, thenumber of distributed tasks is one less than the previousiteration. This distribute-and-collate procedure contin-ues until the value of the performance criteriondecreases compared to the previous iteration. When thisvalue decreases the decision to stop expanding the fea-ture subset is made and the SFFS process is complete.A formal description of our SFFS algorithm usingsemi-synchronised task farming is as follows:

Input: N features ffig=F, Evaluation function E

Output: The selected features S, S � F

beginconverged := 0S:= {}while (NOT converged)

parallel for fi 2 F

evaluate ei =E(S [ ffig)end

select j= argmaxi

(ei)

S = S [ ffjgi

F =Fnffjg

converged=E(S) �?

E(Snffjg)end

end

5.2.4 Experimental setup. The presented feature selectionprocedure, formulated under our task farming frame-work, was tested using a fish trajectory dataset whichhas 3102 trajectories in total. In this dataset 3043 tra-jectories are normal (show typical behaviour) while 59

of them are rare behaviours. There are in total 179 tra-jectory description features which are obtained fromthe curvature scale space (Bashir et al., 2006), moment

Figure 7. Steps of feature selection (adapted from (Wang et al., 2009)). The dashed box contains the stages where we evaluate thecandidate feature subsets independently and in parallel, using our task farming framework.


descriptors (Suk and Flusser, 2004), velocity, accelera-tion, angle, central distance functions (Bashir et al.,2006) and vicinity (Liwicki and Bunke, 2006), and soon, of trajectories. The aim is to select the feature sub-set which can best distinguish normal and rare trajec-tories with high class accuracy. The learning algorithmthat we utilise is based on affinity propagation andclass labels (see (Beyan and Fisher, 2013) for details).The experiments were performed using nine-fold crossvalidation which constructs the training and testing setsrandomly while maintaining an even distributionof normal and abnormal trajectories between folds.Table 4 displays the best feature subset performanceafter a new feature is selected in each iteration. The per-formance metric is the average trajectory class classifi-cation accuracy. The total number of features that werechosen for each fold were 3, 2, 2, 6, 2, 5, 2, 3 and 2respectively, and feature selection stops when theobserved average classification accuracy is lower thanthe previous superstep (iteration). The final (best) cri-terion value for each fold are shown by shaded cells inTable 4.

To evaluate the speed and efficiency of our distribu-ted SFFS algorithm using our task farming frameworkwe compare it to sequential SFFS performed on a sin-gle compute node and again make use of speed-up andefficiency metrics (Section 5.1.4). We test both imple-mentations by varying the total feature pool size2f10, 20, 50, 100, 179g and cap the number of potentialnew features added to the optimal feature subset bylimiting the number of superstep (feature selection)rounds to 2, 6 and 10.

During each feature selection superstep, we employthe learning algorithm: affinity propagation and classlabels (see Beyan and Fisher (2013) for details). Theresults are presented in Table 5 in terms of processingtime (minutes). We compare the results obtained usinga single CPU (sequential SFFS) to the distributed SFFS

implementation again recording both the case includingSGE queueing (ECDF wall-clock time) and the casewhere it is disregarded (Ideal ECDF time). Discountingthe SGE queueing time effectively assumes that we havea sufficient number of cluster nodes available to processall feature subset tasks in parallel.

5.2.5 Performance evaluation. The results in Table 5 showthat formulating this problem under our task farmingframework is again worthwhile, speeding up the com-pletion times of our SFFS application significantly.This is especially true in the cases where the cardinalityof the total feature pool (number of parallel tasks) islarge i.e where F = 50, 100, and 179. The single CPUimplementation is slower than distributed SFFS inevery case, even when taking into account the SGEqueueing time. The performance of our distributedSFFS implementation achieves a speed-up ofSp 2 ½2 . . . 30� (see Table 5) over the serial timings withthe assumption that sufficient compute nodes are avail-able to process all distributed tasks in parallel. Whenthe SGE queueing time is included we achieveSp 2 ½2 . . . 13� (not shown). In practice this allows us toevaluate a feature set containing (e.g.) 179 features tofind an optimal feature subset during training for thepurpose of fish trajectory classification in ;7 h (exclud-ing queueing time) in comparison to the correspondingserial computation that took 195 h (. 1 week) to com-plete. Determining optimal feature subsets in this wayallows us to construct a fish trajectory classificationsystem capable of . 95% accuracy on over 3000 trajec-tories during the training stage.

5.3 Application 3: Hierarchical classification

5.3.1 Hierarchical classification method. The final applica-tion that we implement under our task farming frame-work is a hierarchical classification algorithm called the

Table 4. The results of applying distributed SFFS to a nine-fold real-world fish trajectory dataset. The table shows averagetrajectory-class classification accuracies during training for the best performing feature subset of each length, for each fold. Shadedvalues show the best criteria value found for each fold and the following criteria value (to the right of the best value) shows thevalue found when an additional feature is added (producing a lower criterion value by definition, hence the algorithm terminates).

Feature subset cardinality(# supersteps)

1 2 3 4 5 6 7

Fold1 0:9467 0:9482 0.9497 0:9305

2 0:9527 0.9689 0:9586

3 0:9305 0.9749 0:9734

4 0:8677 0:8841 0:9169 0:9481 0:9585 0.9588 0:9567

5 0:8649 0.9586 0:9481

6 0:9567 0:9675 0:9704 0:9734 0.9749 0:9689

7 0:9438 0.9689 0:9585

8 0:9201 0:9689 0.9808 0:9567

9 0:9645 0.9822 0:9438

McDonagh et al. 17

Balance-Guaranteed Optimised Tree (BGOT). TheBGOT is a classification method that has been shownto perform well when handling data points originatingfrom imbalanced classes (Huang et al., 2012). We useBGOT here for the task of object classification. Usinghierarchical classification, data to be classified is pusheddown a tree path according to a decision made at eachtree node (a classifier) (Cai and Hofmann, 2004; Duanand Sathiya, 2005). This effectively narrows down theclasses that a sample is believed to belong to. Each treeleaf node represents a single class and a data pointreaching a leaf is assigned to that class. During thetraining phase, the BGOT method selects effective sub-sets of predefined image features used at each node ofthe tree with the goal of maximising the mean classifica-tion accuracy among classes arriving at that node. Thisincreases the weight of minority and under representedclasses.

The BGOT algorithm applies two strategies to helpcontrol classification error (Wang and Casasent, 2009):(a) apply more accurate classifiers at a higher tree level(earlier) and leave less certain decisions until deeper lev-els and (b) keep the hierarchical tree balanced to mini-mise the maximum tree depth. A hierarchical classifierhhier is designed as a structured node set. Nodes aredefined as triples: Nodet = fIDt, ~Ft, Ct, where IDt is aunique node number, ~Ft � ff1, . . . . . . , fmg is a featuresubset (chosen by a feature selection procedure (Westonet al., 2000)) that is found to be effective for classifyingCt (a subset of classes). For the classification task weuse the m-class SVM classifier (Chang and Lin, 2011).

An example classification hierarchy with 15 classes isshown in Figure 8. Each node, identified as IDt, illus-trates the class separation decision Ct made at thatnode. The example BGOT is capable of classifying 15

classes by making use of seven classifier nodes and atree-depth of three levels. The first level splits the set ofclasses into two groups.

5.3.2 Generating the hierarchical tree. The tree buildingalgorithm chooses the image feature subset that maxi-mises the average classification accuracy for imagesbelonging to the aforementioned two groups. Eachclass set is then split into two subsets and a new nodein the tree is created for each subset. This procedurecontinues until all nodes contain at most four classes.The automatically generated hierarchical tree (BGOT)chooses the best class set split by exhaustively searchingall possible combinations of class splits that maintain abalanced tree (an equal number of classes assigned toeach of two child nodes). As a result, there are twoparameter sets to search over when building the tree:(a) all possible two-partitions of the classes at eachnode, (b) the related optimal feature subset in terms ofclassification performance. This dual parameter searchresults in a computationally demanding process andsuggests that a parallel approach using our frameworkwould prove advantageous. Parallelising feature subsetselection is discussed previously (Section 5.2), so herewe focus on the tree construction technique, involvingthe designation of image classes to tree nodes, that werealise under our task farming framework.

Table 5. Feature selection algorithm training time results (in minutes): single CPU vs distributed cluster. Our timing modelaccurately predicts expected ideal distributed time and we again display large speed-up Sp gains over the single CPU implementation.The difference between predicted and measured time grows for the large feature set experiments (e.g. 100, 179) where we gain thelargest speed-up Sp. One application specific cause for this discrepancy involves the particular image processing features extracted.When experimenting with more features (100, 179) we include the extraction of computationally expensive image features thatresult in long individual task times. These outliers do not significantly effect superstep mean task length wm but do however increasethe ECDF ideal time by providing large ws . Re-examining our hybrid model with a non-Gaussian individual task time distribution mayhelp to improve these estimates. We again include wall-clock time for completeness.

Single CPU (min) Distributed ECDFwall-clock time (min)

Distributed ECDFideal time (min)

Model prediction(eq. (5)) (min)

Sp

10 features 2 superstep 162 31 19 18:45 8:5310 features 6 supersteps 412 75 55 56:14 7:4910 features 10 supersteps 322 153 132 156:32 2:4420 features 2 superstep 323 35 18 18:01 17:9420 features 6 supersteps 888 113 86 76:40 10:3320 features 10 supersteps 951 211 172 184:36 5:5350 features 2 superstep 1045 79 45 30:91 23:2250 features 6 supersteps 1975 217 123 93:70 16:0550 features 10 supersteps 3111 526 248 249:11 12:54100 features 2 superstep 1749 132 60 33:80 29:15100 features 6 supersteps 4023 417 170 107:22 23:66100 features 10 supersteps 6493 957 303 208:53 21:43179 features 2 superstep 2548 314 189 76:24 13:48179 features 6 supersteps 6788 1027 276 233:53 24:60179 features 10 supersteps 11712 2354 436 380:40 26:86


5.3.3 Generating a Balance-Guaranteed Optimised Tree usingsemi-synchronised task farming. In this section, we focuson the part of BGOT generation involving the binarysplit procedure that finds the best class subset split byexhaustively searching all possible combinations of classsubsets. At each non-leaf tree node, the set of classes aresplit into two groups and a SVM classifier (Cortes andVapnik, 1995) is trained to separate samples betweenthese two groups. Finding an optimal class split is expo-nentially complex and sensitive to the number of classes.

In the example provided there are15

8

� �= 6435 possi-

ble combinations to divide the 15 classes (at the toplevel) into two subsets of cardinality 7 and 8 which then

require an additional8

4

� �= 70 and

7

4

� �= 35 com-

binations to split the tree at the following level. On aver-age the classifier quality of a subset split takes over twominutes to evaluate therefore . 250 CPU-hours arerequired if we wish to run the entire exhaustive evalua-tion process on a single compute node. This process istherefore a good candidate to make use of our parallelframework.

More formally, our tree generation algorithm can bedescribed as follows:

Input: class C1 to Cn

beginc :¼ fC1, . . . ,Cnglevel :¼ 0

featureSet :¼ FeatureSelection(c)

construct(c, level)

endproc construct(c,n) [

if n .MAXDEPTHexit

end

comment: Evaluate classification accuracy on eachsplit of classes c in parallelparallel forfbinary splits of cg

r =evaluate(c, featureSet)end

comment: The ChooseSplit function finds the optimalclass subset pair based on the set of r evaluations½cLeft, cRight� :¼ ChooseSplit(frg)comment: The maximum leaf node subset size is setto 4 to limit max tree depthif size(jcLeftj). 4

construct(cLeft, n+ 1)endif size(jcRightj). 4

construct(cRight, n+ 1)end

endA schematic of the program flow is illustrated in

Figure 9. Firstly the algorithm splits the current set ofclasses c into all combinations of pairs of disjoint sub-sets with size jcj

2and then sends each combination to the

performance evaluation stage. After evaluating all ofthe possible splits, the best subset pair, in terms of clas-sification accuracy, is chosen and this split is used toconstruct two new child tree nodes. This procedure isiterated for both child branches until the stopping cri-terion is satisfied. Each subset classification accuracyperformance evaluation at a given tree level is indepen-dent of every other split, and the evaluation tasks donot need to communicate. Furthermore, all tasks havethe same work-flow yet have varying input: the subsetclass member combination. As a result, we find thisprocess a good candidate for our semi-synchronizedtask farming framework and our HPC cluster. Weassign each combination of class set split to a distribu-ted parallel task. Each pair of subsets is then evaluatedwith an accuracy score in parallel (the accuracy scorefor each distributed task is found by taking the meanclassification accuracy of the two subsets assigned tothe task). After all distributed tasks in a superstep haveconcluded, we collect all of the mean accuracy scoresand select the class split with the highest score (oursuperstep conclusion). Given True Positive and FalseNegative classifications, the mean accuracy (recall rate)per distributed task is defined as:

AR=1

jcjXjcjj= 1

(True Positivej

True Positivej +FalseNegativej

) ð8Þ

where jcj is the number of image classes.

5.3.4 Experimental setup. We perform species classifica-tion experiments using 6875 fish images with a five-foldcross validation procedure. The training and testingsets are isolated such that fish images from the sametrajectory sequence (containing the same fish) are not

Figure 8. A classification tree automatically generated by ourBGOT algorithm. The hierarchical classification strategy usesseven node classifiers to classify 15 classes (C1, . . . ,C15).

McDonagh et al. 19

used during both training and testing. We extract 66

different image features for the classification task.These features are a combination of colour, shape andtexture properties in varying local spatial areas of thefish images such as the tail/head/upper/lower bodyarea, as well as collecting features from the entire fishbody area. SFFS is applied to find an optimal featuresubset to provide input for the classification task. Weuse an SVM variant for the classification task. SinceSVMs were originally developed for the binary classifi-cation problem, we introduce a one-vs-one strategywith a voting mechanism to convert the binary SVMinto a multi-class classifier (Chang and Lin, 2011). Themechanism is based on a classify-and-vote procedure.Specifically, each class is trained in a set of binary clas-sifiers against each other class individually. The opti-mal BGOT result found is shown in Figure 8, where 15

classes are classified using a tree of depth three. SeeHuang et al. (2012) for further species classificationdetails.

5.3.5 Performance evaluation. We explore the computa-tional time requirements for executing our BGOT algo-rithm in a similar fashion to the previous applicationsdeployed under our task farming framework. The most

expensive superstep for this application is (by far) the

initial superstep, involving the evaluation of15

8

� �=

6435 possible pairs of image class subset splits. This ini-tial step is therefore the section of the application thatwe focus our timing evaluation on during this experi-ment. As each subset split takes on average ;2 min ofcomputational time to evaluate we choose to performthe evaluation of a number of subset combinations ineach distributed task. Explicitly we evaluate the timeand efficiency performance using experiments involvingthe distribution of 1, 25, 50 and 100 tasks in parallel forthis large initial superstep. Using 15 image classes, this

results in assigning 64351, 6435

25, 6435

50and 6435

100subset evalua-

tions to each distributed parallel task during eachexperiment respectively. We focus here on timing resultsfrom the initial large superstep and therefore find thatqueueing (non-working) time will be minimal and there-fore display ECDF ideal time and not wall-clock timein Table 6. We show the ECDF ideal time metric(defined in Section 5.1.3) in Table 6 and note that weare again able to significantly decrease the required pro-cessing time in relation to the single computationalnode case by increasing the number of p processorsinvoked. By increasing the number of tasks distributed

Figure 9. The algorithm to generate our BGOT. At each tree level, we select the optimal disjoint and balanced class subset split byexhaustively searching all possible splitting combinations. Each set of algorithm stages within a dashed area represents a superstepthat is distributed to our cluster in parallel.


in parallel in the superstep (and therefore reducing thenumber of subset evaluations assigned to each task), wereduce the ECDF ideal time (and therefore increase ourspeed-up metric) in a near linear fashion achievingspeed-up metrics of S25 = 14:7130, S50 = 27:9121 andS100 = 46:4207 in practice. While increasing the numberof parallel tasks reduces both the ECDF ideal time (andwall-clock time) metrics in the case of the experimentsperformed here, we expect to find a limit to the effi-ciency of doing this in practice. We see from Table 6that our efficiency metric (defined in Section 5.1.3)begins to drop as we increase the number of paralleltasks (and therefore processors invoked p). For exam-ple, using our current multi-user SGE cluster, it isdoubtful that assigning only a single two minute SVMevaluation to each distributed task would provide fur-ther improvement as, given that we do not have accessto 6435 processors in parallel, queueing time in practicewould likely begin to counteract the linear speed-upimprovement we observe in the experiments performedhere. We leave finding the optimal trade-off betweenspeed-up and efficiency (i.e. the optimal number ofimage class subset evaluations to assign per distributedtask) to future work.

By applying our task farming framework to thisproblem we are able to effectively evaluate . 6500

BGOT graphs and find the graph configuration that isable to classify 15 species of fish with the highest accu-racy. Using our task farming approach reduces the timeneeded in practice for this evaluation from . 260 h(using a single compute node) to under 6 h when mak-ing use of an SGE cluster (p= 100). By distributing thisprocess with our task farming framework we have beenable to easily experiment with and extend our speciesclassification system (e.g. to include further fish species)even although this involves BGOT re-evaluation thatwould prove extremely time-consuming if only a serialimplementation were available.

6 Discussion

In this paper, we formulate a semi-synchronised taskfarming framework for solving computationally inten-sive problems where independent problem components

can be distributed across an HPC cluster. Results arecollated to inform following rounds of task distribu-tion, eventually leading to a global problem solution.Our contributions include the development of a modelto predict overall application completion time for prob-lems that are formulated using our framework. Wevalidate this model using simulation and experimentalresults and find it to be sufficiently accurate, providinga simple tool that can be utilised when planning thetime requirements of computationally expensive appli-cations. Further to this we study the performanceenhancement obtained by utilising our framework inpractice to guide the algorithmic design of several com-putationally expensive computer vision problems andcompare the throughput using our framework with thatof solutions making use of only a single compute node.In each example provided we find near linear speed-upimprovements in the number of participating proces-sors p over the related serial implementations. Also, inthe case of each real-world problem investigated, weare able to provide model predictions for computationtime that are typically within ;10% of the executiontime required in practice.

Based on our experimental results we show that pro-cessing large data sets using algorithms formulatedwith our framework, and deployed on an HPC cluster,obtain significant time saving over single node compu-tation due to vast gains in terms of speed-up. We notethat in practice the human effort required to movefrom an original serial algorithm implementation to adistributed task farming application is very reasonable.By making use of SGE to handle the task queueing sys-tem and allowing developers to concentrate on domainspecific problem aspects, we are typically able to com-pletely convert a serial code on the order of days. Byalso employing user-friendly languages for parallel pro-gramming, master–slave communication is also hiddenfrom the developer allowing them to again focus solelyon domain specific problems.

Distributed computing on HPC clusters offers anattractive option for our framework when compared toexpensive integrated mainframe solutions. The mainadvantages of HPC clustering include distributedrobustness and the ease of cluster scalability. When

Table 6. We generate BGOTs whilst varying the number of potential graph node subset evaluations per distributed task (node). Weare able to improve speed-up by increasing the number of participating processors p at the cost of efficiency. The difference betweenour model predictions and measured computational time costs are within ;10% of the true value.

CPUs (K) Distributed ECDFideal time (hours)

Model prediction(eq. (5)) (hours)

Sp Ep

6435 subset evaluations per node 1 260:42 N=A 1:00 1:00257 subset evaluations per node 25 18:70 20:89 14:71 0:59128 subset evaluations per node 50 9:33 10:23 27:91 0:5664 subset evaluations per node 100 5:61 5:64 46:42 0:46

McDonagh et al. 21

AQ1

AQ2

using an HPC cluster to accelerate the rate that we areable to solve computationally expensive problems thefactors of data set size and algorithm design playimportant roles in determining the degree of success inparallelising an application. Our framework allows theperformance of a distributed program on a given archi-tecture to be predictable. Using our framework andsimple timing parameters from the algorithm underevaluation allow us to reason about program design atan early stage.

All implementation examples presented in this workmake use of MATLAB and we find that the prerequi-sites for writing parallel code under the DCT fromMathWorks are relatively low. There is no need for thedeveloper to instruct cluster machines how to commu-nicate, which part of the code to execute and how toassemble end results. We find that this provides astraightforward and intuitive approach to parallelisingcomputationally demanding applications in a reason-able time frame. Parallelisation under this simple taskfarming framework results in potentially huge time sav-ings without requiring extensive task or data parallelismknowledge. Possible extentions and interesting avenuesof future work include implementing solutions usingour framework with faster compile languages (e.g.C/C++) and applying such solutions to time criticalapplications. Additionally, extending our performancemodelling treatment, to account for heterogeneous pro-cessors, would likely improve the model predictivepower. Related extensions might take the form of re-examining individual task time fitting using moresophisticated distributions to improve modelling in theheterogeneous processor case (e.g. employing distribu-tion mixtures). Finally during the experimental workperformed here it was noted that in practice there isoften contention between speed-up and efficiency. Infuture we aim to find optimal-trade-off generalisationsfrom the specific cases presented here. In summary thiswork highlights a range of demanding vision applica-tions that a straightforward parallelisation strategysuch as ours can contribute to solving, whilst offeringvast computational time savings.

Acknowledgements

We thank Murray Cole and Bastiaan Boom for helpful dis-cussion. We also thank the anonymous reviewers for theirvaluable comments and suggestions to improve the quality ofthe paper.

Funding

This work was supported by the Fish4Knowledge project,which is funded by the European Union 7th FrameworkProgramme (grant number FP7/2007-2013) and by theEPSRC (grant number EP/P504902/1).

References

Abdelzaher T, Thaker G and Lardieri P (2004) A feasible

region for meeting aperiodic end-to-end deadlines in

resource pipelines. In: Proceedings of the 24th international

conference on distributed computing systems (ICDCS’04),

Tokyo, 23–26 March 2004, pp.436–445. Washington DC:

IEEE Computer Society. Available at: http://dl.acm.org/

citation.cfm?id=977400.977975.Adve V, Bagrodia R, Browne J, et al. (2000) POEMS: End-to-

end performance design of large parallel adaptive compu-

tational systems. Software Engineering 26(11): 1027–1048.Baker M, Carpenter B and Shafi A (2006) MPJ express:

Towards thread safe Java HPC. In: IEEE international

conference on cluster computing (Cluster 2006), Barcelona,

Spain, 25–28 September 2006, pp.25–28.Bashir F, Khokhar A and Schonfeld D (2006) View-invariant

motion trajectory based activity classification and recogni-

tion. Multimedia Systems 12(1): 45–54.Beyan C and Fisher R (2013) Detecting abnormal fish trajec-

tories using clustered and labelled data. Proceedings of the

IEEE international conference on image processing.Blum A and Langley P (1997) Selection of relevant features

and examples in machine learning. Artificial Intelligence

97(1–2): 245–271.Buyya R, Murshed M and Abramson D (2002) A deadline

and budget constrained cost–time optimisation algorithm for

scheduling task farming applications on global grids. Tech-

nical report, Monash University. Available at: Available

at: http://http://arxiv.org/pdf/cs/0203020.pdf.Cai L and Hofmann T (2004) Hierarchical document categor-

ization with support vector machines. Proceedings of the

13th ACM international conference on information and

knowledge management, pp.78–87.Casanova H, Kim M, Plank J, et al. (1999) Adaptive schedul-

ing for task farming with grid middleware. International

Journal of High Performance Computing 13(3): 231–240.Casanova H, Obertelli G, Berman F, et al. (2000) The

AppLeS parameter sweep template: User-level middleware

for the grid. In: Proceedings of the 2000 ACM/IEEE con-

ference on supercomputing (Supercomputing ‘00), Rhode

Island, NY, 4–10 November 2000, p.60. Washington, DC,

USA: IEEE Computer Society. Available at: http://dl.acm.

org/citation.cfm?id=370049.370499 .Chang C and Lin C (2011) LIBSVM: A Library for support

vector machines. ACM Transactions on Intelligent Systems

and Technology 2(3): 1–27.Cole M (1991) Algorithmic Skeletons: Structured Management

of Parallel Computation. Cambridge, MA: MIT Press.Cortes C and Vapnik V (1995) Support-vector networks.

Machine Learning 20(3): 273–297.Dean J and Ghemawat S (2008) MapReduce: Simplified data

processing on large clusters. Communications of the ACM

51(1): 107–113. doi: 10.1145/1327452.1327492.Duan K and Sathiya K (2005) Which is the best multiclass

SVM method? An empirical study. In: Proceedings of the

6th international conference on multiple classifier systems

(MCS’05), pp.278–285.Eager D, Zahorjan J and Lazowska E (1989) Speedup versus

efficiency in parallel systems. IEEE Transactions on Com-

puters 38(3): 408–423. doi: 10.1109/12.21127.


AQ3

AQ4

AQ5

AQ6

AQ7

AQ8

AQ9

AQ10

AQ11

AQ12

AQ13

AQ14

AQ15

AQ16

AQ17

AQ18

AQ19

AQ20

AQ21

AQ22

Elwasif W, Plank J and Wolski R (2001) Data staging effects

in wide area task farming applications. In: Proceedings of

IEEE international symposium on cluster computing and the

grid, Brisbane, Australia.Foster I (1994) Task parallelism and high-performance lan-

guages. IEEE Parallel Distribution Technology 2(3): 27–36.

doi: 10.1109/M-PDT.1994.329794 .Frank M, Agarwal A and Vernon M (1997) LoPC: modeling

contention in parallel algorithms. In: Proceedings of the 6th

ACM SIGPLAN symposium on the principles and practice

of parallel programming (PPOPP ‘97), pp.276–287. New

York, NY: ACM.Gentzsch W (2001) Sun Grid Engine: Towards creating a

compute power grid. In: Proceedings of the 1st international

symposium on cluster computing and the grid (CCGRID

‘01), pp.35–43. Washington, DC: IEEE Computer Society.Guo G and Dyer C (2003) Simultaneous feature selection and

classifier training via linear programming: A case study

for face expression recognition. Proceedings of the IEEE

conference on computer vision and pattern recognition,

pp.346–352.Hammond S, Mudalige G, Smith J, et al. (2010) WARPP: A

toolkit for simulating high-performance parallel scientific

codes. In: Proceedings of the 2nd international ICST con-

ference on simulation tools and techniques. ACM. doi:

10.4108/ICST.SIMUTOOLS2009.5753.Hammond S, Smith J, Mudalige G, et al. (2009) Predictive

simulation of HPC applications. In: The IEEE 23rd inter-

national conference on advanced information networking and

applications (AINA-09), .

Hillis W, Steele J and Guy L (1986) Data parallel algorithms.

Communications of the ACM 29(12): 1170–1183. doi:

10.1145/7902.7903 .Huang J, Cai Y and Xu X (2006) A filter approach to feature

selection based on mutual information. Proceedings of the

IEEE international conference on cognitive informatics,

pp.84–89.Huang P, Boom B, He J, et al. (2012) Underwater live fish

recognition using balance-guaranteed optimized tree. In:

Proceedings of the 11th Asian conference on computer vision

(ACCV 2012), Vol. 7724, pp.422–433.Isard M, Budiuand M, Yu Y, et al. (2007) Dryad: Distributed

data-parallel programs from sequential building blocks.

In: Proceedings of the 2nd ACM SIGOPS/EuroSys

European conference on computer systems 2007 (EuroSys

‘07), pp.59–72. New York: ACM. doi: 10.1145/

1272996.1273005 .Kerbyson D, Hoisie A and Wasserman H (2005) Use of pre-

dictive performance modeling during large-scale system

installation. Parallel Processing Letters 15(04): 387–395.

doi: 10.1142/S0129626405002301.Kohavi R and John G (1997) Wrappers for feature subset

selection. Artificial Intelligence 97(1–2): 273–324.Labarta J, Girona S and Cortes T (1997) Analysing schedul-

ing policies using DIMEMAS. Parallel Computing 23(1).Liu H Sr (2005) Toward integrating feature selection algo-

rithms for classification and clustering. IEEE Transactions

on Knowledge and Data Engineering 17(4): 491–502.Liwicki M and Bunke H (2006) HMM-based on-line recogni-

tion of handwritten whiteboard notes. Proceedings of the

10th international workshop on frontiers in handwriting rec-

ognition, pp.595–599.McColl W (1993) General purpose parallel computing. Gib-

bons AM and Spirakis P (eds) Lectures on Parallel Compu-

tation (Cambridge International Series on Parallel

Computation). Cambridge; New York: Cambridge Uni-

versity Press: 337–391.McDonagh S and Fisher R (2013) Simultaneous registration

of multi-view range images with adaptive kernel density

estimation. Under review : x–x.

McDonagh S, Fisher R and Rees J (2008) Using 3D informa-

tion for classification of non-melanoma skin lesions. In:

Proceedings of Medical Image Understanding and Analysis,

Dundee, pp.164–168.Mudalige G, Vernon M and Jarvis S (2008) A plug-and-play

model for evaluating wavefront computations on parallel

architectures. In: Proceedings of the IEEE International

Symposium on Parallel and Distributed Processing (IPDPS

2008), pp.1–14. doi: 10.1109/IPDPS.2008.4536243.Nudd G, Kerbyson D, Papaefstathiou E, et al. (2000) PACE:

A toolset for the performance prediction of parallel and

distributed systems. International Journal High Perfor-

mance Computing Applications 14(3): 228–251.Pllana S and Fahringer T (2005) Performance prophet: A per-

formance modeling and prediction tool for parallel and

distributed programs. In: Proceedings of the 2005 interna-

tional conference on parallel processing (ICPP-05), Oslo,

26(11): 509–516.Poldner M and Kuchen H (2008) On implementing the farm

skeleton. Parallel Processing Letters117–131.Pottmann H, Leopoldseder S and Hofer M (2002) Simulta-

neous registration of multiple views of a 3D object.

Archives of the Photogrammetry, Remote Sensing and Spa-

tial Information Sciences: 265–270.Revenga P, Serot J, Lazaro J, et al. (2003) A Beowulf-class

architecture proposal for real-time embedded vision. In:

Proceedings of the 17th International Symposium on Paral-

lel and Distributed Processing (IPDPS ‘03), pp.8–16.

Washington, DC: IEEE Computer Society. Available at:

http://dl.acm.org/citation.cfm?id=838237.838308.Roth V and Lange T (2004) Adaptive feature selection in

image segmentation. Pattern Recognition (3175): 9–17.Silva L, Veer B and Silva J (1993) How to get a fault-tolerant

farm. In: Proceedings of the world transputer congress,

Aachen, Germany, pp.923–938.Skillicorn D, Hill J and McColl W (1997) Questions and

Answers about BSP. Scientific Programming 6: 249–274.Spooner D, Jarvis S, Cao J, et al. (2003) Local grid scheduling

techniques using performance prediction. IEEE Proceed-

ings – Computers and Digital Techniques 150: 87–96(9).Suk T and Flusser J (2004) Graph method for generating

affine moment invariants. Proceedings of the international

conference on pattern recognition, pp.192–195.Thain D, Tannenbaum T and Livny M (2005) Distributed

computing in practice: The Condor experience. Concur-

rency and Computation: Practice and Experience 17(2–4):

323–356. doi: 10.1002/cpe.938.Toldo R, Beinat A and Crosilla F (2010) Global registration

of multiple point clouds embedding the generalized pro-

crustes analysis into an ICP framework. In: Proceedings of

McDonagh et al. 23

AQ23

AQ24

AQ25

AQ26

the conference on 3D proceedings, visualization and

transmission,Torsello A, Rodola E and Albarelli A (2011) Multi-view reg-

istration via graph diffusion of dual quaternions. In: IEEEconference on computer vision and pattern recognition,pp.2441–2448.

Valiant G (1990) A bridging model for parallel computation.Communications of the ACM 33(8): 103–111. doi:10.1145/79173.79181.

Wang Q, Li B and Hu J (2009) Feature selection for humanresource selection based on affinity propagation and SVMsensitivity analysis. In: Proceedings of the world congress

on nature and biologically inspired computing (NaBIC 09),pp.31–36. doi: 10.1109/NABIC.2009.5393596.

Wang Y and Casasent D (2009) A support vector hierarchicalmethod for multi-class classification and rejection. In: Pro-ceedings of the international joint conference on neural net-

works (IJCNN), pp.3281–3288.Weston J, Mukherjee S, Chapelle O, et al. (2000) Feature

selection for SVMs. Advances in Neural Information Pro-

cessing Systems 13: 668–674. Available at: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.7476.

Yang A, Wright J, Ma Y, et al. (2007) Feature selection inface recognition: A sparse representation perspective. UC

Berkeley Tech Report UCB/EECS-2007-99.

Author biographies

Steven McDonagh received the BSc degree inComputer Science and Artificial Intelligence from TheUniversity of Edinburgh in 2008. He is currently aPhD candidate in the School of Informatics, Universityof Edinburgh. His interests span a variety of topics incomputer vision, image processing and machine learn-ing. His current work focuses on the analysis andimplementation of multi-view registration algorithms,range data processing and geometric modelling.

Cigdem Beyan received her BEng degree in ComputerEngineering from Baskent University, Turkey, in 2008,and her MSc degree in Informatics from the MiddleEast Technical University, Turkey, in 2010. She is nowa PhD candidate in the School of Informatics, Instituteof Perception, Action and Behaviour at The University

of Edinburgh, UK. She received the Edinburgh GlobalOverseas Research Scholarship and Principal CareerDevelopment Scholarship in career area teaching. Sheworked on fish trajectory analysis for theFish4Knowledge project. Her primary research inter-ests are computer vision and machine learning, beha-viour understanding, trajectory analysis, anomalydetection, classification of imbalanced data and activelearning.

Phoenix X Huang received his BEng degree inElectronic Engineering in 2006 and his MEng degree inSignal and Information Engineering in 2009 from TheUniversity of Science and Technology of China. He iscurrently working toward the PhD degree in theSchool of Informatics at the University of Edinburgh.His research interests include object recognition, com-puter vision and image processing.

Robert B Fisher FIAPR, FBMVA has received a BS(Mathematics, California Institute of Technology,1974), MS (Computer Science, Stanford, 1978) and aPhD (Edinburgh, 1987). Since then, Bob has been anacademic at Edinburgh University. His research coverstopics in high level and 3D computer vision, focussingon reconstructing geometric models from existingexamples, which contributed to a spin-off company,Dimensional Imaging. More recently, he has also beenresearching video sequence understanding, in particularattempting to understand observed animal behaviour.The research has led to 13 authored or edited booksand about 250 peer-reviewed scientific articles. He hasdeveloped several on-line computer vision resources,with over 1 million hits. Most recently, he has been thecoordinator of an EC STREP project acquiring andanalysing video data of 1.4 billion fish from over about20 camera-years of undersea video of tropical coralreefs. He is a Fellow of the International Associationfor Pattern Recognition (2008) and the BritishMachine Vision Association (2010).


applying semi-synchronised task farming to large-scale computer … · 2018. 6. 29. · apply this...

Documents