gpu accelerated pivoting rules for the simplex algorithmusers.uom.gr/~samaras/pdf/j28.pdf · matlab...

G

ND

a

ARRAA

KLSPGMC

1

wcpz

sss

sCa

h0

The Journal of Systems and Software 96 (2014) 1–9

Contents lists available at ScienceDirect

The Journal of Systems and Software

j ourna l ho mepage: www.elsev ier .com/ locate / j ss

PU accelerated pivoting rules for the simplex algorithm

ikolaos Ploskas, Nikolaos Samaras ∗

epartment of Applied Informatics, School of Information Sciences, University of Macedonia, 156 Egnatia Str., 54006 Thessaloniki, Greece

r t i c l e i n f o

rticle history:eceived 1 October 2013eceived in revised form 10 March 2014ccepted 29 April 2014vailable online 9 May 2014

a b s t r a c t

Simplex type algorithms perform successive pivoting operations (or iterations) in order to reach theoptimal solution. The choice of the pivot element at each iteration is one of the most critical step insimplex type algorithms. The flexibility of the entering and leaving variable selection allows to developvarious pivoting rules. In this paper, we have proposed some of the most well-known pivoting rulesfor the revised simplex algorithm on a CPU–GPU computing environment. All pivoting rules have beenimplemented in MATLAB and CUDA. Computational results on randomly generated optimal dense linear

eywords:inear programmingimplex algorithmivoting rulesraphical Processing UnitATLAB

ompute Unified Device Architecture

programs and on a set of benchmark problems (Netlib-optimal, Kennington, Netlib-infeasible, Mészáros)are also presented. These results showed that the proposed GPU implementations of the pivoting rulesoutperform the corresponding CPU implementations.

© 2014 Elsevier Inc. All rights reserved.

. Introduction

Linear programming (LP) is perhaps the most important andell-studied optimization problem. Lots of real world problems

an be formulated as linear programming problems (LPs). LP is therocess of minimizing or maximizing a linear objective function

=∑n

j=1cjxj to a number of linear equality and inequality con-traints. Simplex algorithm is the most widely used method forolving LPs. Consider the following linear program (LP.1) in thetandard form:

min cT x

s.t. Ax = b

x≥0

(LP.1)

where A ∈ Rm×n, (c, x) ∈ R

n, b ∈ Rm, and T denotes transpo-

ition. We assume that A has full rank, rank(A) = m, m < n.onsequently, the linear system Ax = b is consistent. The simplex
lgorithm searches for an optimal solution by moving from one
∗ Corresponding author. Tel.: +30 2310 891866; fax: +30 2310 891879.E-mail address: [email protected] (N. Samaras).

ttp://dx.doi.org/10.1016/j.jss.2014.04.047164-1212/© 2014 Elsevier Inc. All rights reserved.

feasible solution to another, along the edges of the feasible region.The dual problem associated with the (LP.1) is presented in (DP.1):

min bT w

s.t. AT w + s = c

s≥0

(DP.1)

where w ∈ Rm and s ∈ R

n.A critical step in solving a linear program is the selection of the

entering variable in each iteration. Good choices of the enteringvariable can lead to fast convergence to the optimal solution, whilepoor choices lead to more iterations and worst execution times oreven no solutions of the LPs. The pivoting rule applied to a simplextype algorithm is the key factor that will determine the number ofiterations that the algorithm performs (Maros and Khaliq, 1999).Different pivoting rules yield different basis sequences in the sim-plex algorithm. Many pivoting rules have been proposed in theliterature. Six of them are implemented and compared in this paper,namely: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Incre-ment Method, (iv) Least Recently Considered Method, (v) PartialPricing Rule, and (vi) Steepest Edge rule. Other well-known pivot-ing rules include Devex (Harris, 1973), Modified Devex (Benichouet al., 1977), Steepest Edge approximation scheme (Swietanowski,1998), Murty’s Bard type scheme (Murty, 1974), Edmonds-Fukuda
rule (Fukuda, 1982) and its variants (Clausen, 1987; Wang, 1991;Zhang, 1991; Ziegler, 1990).
Forrest and Goldfarb (1992) presented several new serial imple-mentations of Steepest Edge pivoting rule and compared them

dx.doi.org/10.1016/j.jss.2014.04.047

http://www.sciencedirect.com/science/journal/01641212

http://www.elsevier.com/locate/jss

http://crossmark.crossref.org/dialog/?doi=10.1016/j.jss.2014.04.047&domain=pdf

mailto:[email protected]

dx.doi.org/10.1016/j.jss.2014.04.047

2 l of Sy

wcDss((rottpgpdwiErer

ta(MaaatsoR4eeSacap

paoptcg((2pa2Crbamcsibwm

N. Ploskas, N. Samaras / The Journa

ith Devex variants and the Dantzig’s rule over large LPs and con-luded that the steepest-edge variants are clearly superior to theevex variants and the Dantzig’s rule for solving difficult large-

cale LPs. Thomadakis (1994) has implemented and compared theerial implementations of five pivoting rules: (i) the Dantzig’s rule,ii) the Bland’s rule, (iii) the Least-Recently Considered Method,iv) the Greatest-Increment Method, and (v) the Steepest-Edgeule. Thomadakis examined the trade-off between the numberf iterations and the execution time per iteration and concludedhat: (i) Bland’s rule has the shortest execution time per itera-ion, but it usually needs many more iterations than the otherivoting rules to converge to the optimal solution of a linear pro-ram, (ii) Dantzig’s rule and Least Recently Considered Methoderform comparably, but the latter requires fewer iterations whenegenerate pivots exist, (iii) Greatest Increment Method has theorst execution time per iteration, but it usually needs fewer

terations to converge to the optimal solution, and (iv) Steepest-dge rule requires fewer iterations than all the other pivotingules and its execution time per iteration is lower than Great-st Increment Method but higher than the other three pivotingules.

Ploskas and Samaras (2013) have implemented and comparedhe serial implementations of eight pivoting rules over small-nd medium-sized Netlib LPs: (i) Bland’s rule, (ii) Dantzig’s rule,iii) Greatest Increment Method, (iv) Least Recently Considered

ethod, (v) Partial Pricing Rule, (vi) Queue rule, (vii) Stack rule,nd (viii) Steepest Edge rule. In contrast to Thomadakis, Ploskasnd Samaras also examined the total execution time of the simplexlgorithm relating to the pivoting rule that is used and concludedhat: (i) with a limit of 70, 000 iterations, only Dantzig’s rule hasolved all instances of the test set, while Bland’s rule solved 45ut of 48 instances, Greatest Increment Method 46 out of 48, Leastecently Considered Method 45 out of 48, Partial Pricing 45 out of8, Queue’s rule 41 out of 48, Stacks’ rule 43 out of 48, and Steep-st Edge rule 46 out of 48, (ii) Dantzig’s rule requires the shortestxecution time both on average and on almost all instances, whileteepest Edge rule has the worst execution time both on aver-ge and on almost all instances, and (iii) despite its computationalost, Steepest Edge rule needs the fewest number of iterations thanll the other pivoting rules, while Bland’s rule is by far the worstivoting rule in terms of the number of iterations.

The increasing size of real life LPs demands more computationalower and parallel computing capabilities. The recent hardwaredvances have made it possible to solve large LPs in a short amountf time. Graphical Processing Units (GPUs) have gained a lot ofopularity in the past decade and have been applied to scien-ific computing applications. GPU is utilized for data parallel andomputationally intensive portions of an algorithm. Two majoreneral purpose programming languages exist for GPUs, CUDACompute Unified Device Architecture) (NVIDIA, 2013) and OpenCLOpen Computing Language) (The Khronos OpenCL Working Group,013). These programming languages are based on the streamrocessing model. CUDA was introduced in late 2006 and is onlyvailable with NVIDIA GPUs, while OpenCL was introduced in008 and is available on GPUs of different vendors and even onPUs. In this paper, we have implemented the GPU-based pivotingules using CUDA, because it is currently more mature. GPUs haveeen successfully applied for the solution of LPs using the simplexlgorithm. Jung and O’Leary (2008) proposed a CPU–GPU imple-entation of the Interior Point Method for dense LPs and their

omputational results showed a speedup up to 1.4 on medium-ized Netlib LPs (Gay, 1985) compared to the corresponding CPU
mplementation. Spampinato and Elster (2009) presented a GPU-ased implementation of the revised simplex algorithm on a GPUith NVIDIA CUBLAS and NVIDIA LAPACK libraries. Their imple-entation showed a maximum speedup of 2.5 on large random LPs
stems and Software 96 (2014) 1–9

compared to the corresponding CPU implementation. Bieling et al.(2010) also proposed an implementation of the revised simplexalgorithm on a GPU. They compared their GPU-based implemen-tation with the serial GLPK solver and reported a maximumspeedup of 18 in single precision. Lalami et al. (2011a) proposed animplementation of the tableau simplex algorithm on a CPU–GPUsystem. Their computational results on randomly generated denseproblems showed a maximum speedup of 12.5 compared to thecorresponding CPU implementation. Lalami et al. (2011b) extendedtheir previous work (Lalami et al., 2011a) on a multi-GPU imple-mentation and their computational results on randomly generateddense problems showed a maximum speedup of 24.5. Li et al.(2011) presented a GPU-based parallel algorithm, based on Gauss-ian elimination, for large scale LPs that outperforms the CPU-basedalgorithm. Meyer et al. (2011) proposed a mono- and a multi-GPUimplementation of the tableau simplex algorithm and comparedtheir implementation with the serial CLP solver. Their implemen-tation outperformed CLP solver on large sparse LPs. Gade-Nielsenet al. (2012) presented a GPU-based interior point method andtheir computational results showed a speedup of 6 on randomlygenerated dense LPs and a speedup of 2 on randomly generatedsparse LPs compared to the MATLAB’s built-in function linprog.Smith et al. (2012) proposed a GPU-based interior point methodand their computational results showed a maximum speedup of1.27 on large sparse LPs compared to the corresponding multi-coreCPU implementation.

To the best of our knowledge, Forrest’s and Goldfarb’s paper(Forrest and Goldfarb, 1992), Thomadakis’ report (Thomadakis,1994) and Ploskas’ and Samaras’ paper (Ploskas and Samaras, 2013)are the only papers that compares some of the most widely-usedpivoting rules. Yet, there has been no attempt to implement andcompare pivoting rules using GPUs. This paper builds on the workof Ploskas and Samaras (2013). The novelty of this paper is thatwe propose GPU-based implementations of six pivoting rules andperform a detailed computational study on large-scale randomlygenerated optimal dense LPs and on a set of benchmark LPs of var-ious sources. The aim of the computational study is to establish thepractical value of the GPU-based implementations. We do not aimto implement a serial robust simplex algorithm; this is not achiev-able due to the use of dense algebra. Nevertheless, the main aim is toinvestigate if some of the well-known pivoting rules are suitable toexecute on GPUs. The computational results demonstrate that theproposed GPU implementations outperform CPU implementationsof the six pivoting rules.

The structure of the paper is as follows. In Section 2, six pivotingrules are presented and analyzed. Section 3 presents the GPU-basedimplementations of these pivoting rules. In Section 4, the compu-tational comparison of the CPU and GPU based implementationsobtained with a set of randomly generated optimal dense LPs anda set of benchmark instances is presented. Finally, the conclusionsof this paper are outlined in Section 5.

2. Pivoting rules

Six pivoting rules are presented in this section: (i) Bland’srule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) LeastRecently Considered Method, (v) Partial Pricing Rule, and (vi) Steep-est Edge Rule. Some necessary notations should be introduced,before the presentation of the aforementioned pivoting rules. Let lbe the index of the entering variable and cl be the difference in theobjective value when the non-basic variable xl is increased by one
unit and the basic variables are adjusted appropriately. Reducedcost is the amount by which the objective function on the corre-sponding variable must be improved before the value of the variablewill be positive in the optimal solution.

l of Sy

2

aais

2

fih1ttvtb(

2

aifivoacn

2

(DfwMCpw

2

asahreidi

2

a


.1. Bland’s rule

Bland’s rule (Bland, 1977) selects as entering variable the firstmong the eligible ones, i.e. the leftmost among columns with neg-tive relative cost coefficient. Whereas Bland’s rule avoids cycling,t has been observed in practice that this pivoting rule can lead totalling, a phenomenon where long degenerate paths are produced.

.2. Dantzig’s rule

Dantzig’s rule or largest coefficient rule (Dantzig, 1963) is therst pivoting rule that was proposed for the simplex algorithm. Itas been widely-used in simplex implementations (Bazaraa et al.,990; Papadimitriou and Steiglitz, 1982). This pivoting rule selectshe column Al with the most negative cl . Dantzig’s rule ensureshe largest reduction in the objective value per unit of non-basicariable cl increase. Although its worst-case complexity is exponen-ial (Klee and Minty, 1972), Dantzig’s rule is considered as simpleut powerful enough to guide simplex algorithm into short pathsThomadakis, 1994).

.3. Greatest Increment Method

The entering variable in the Greatest Increment Method (Kleend Minty, 1972) is the one with the largest total objective valuemprovement. The improvement of the objective value is computedor each non-basic variable. The variable that offers the largestmprovement in the objective value is selected as the enteringariable. This pivoting rule can lead to fast convergence to theptimal solution. However, this advantage is eliminated by thedditional computational cost per iteration. Finally, Gärtner (1995)onstructed LPs where Greatest Increment Method showed expo-ential complexity.

.4. Least Recently Considered Method

In the first iteration of the Least Recently Considered MethodZadeh, 1980), the entering variable l is selected according toantzig’s rule (i.e. the column Al with the most negative cl). In the

ollowing iterations, it starts searching for the first eligible variableith index greater than l. If l = n then Least Recently Consideredethod starts searching from the first column again. Least Recently

onsidered Method prevents stalling and has been shown that iterforms fairly well in practice (Thomadakis, 1994). However, itsorst-case complexity has not been proved yet.

.5. Partial Pricing Rule

Full pricing considers the selection of the entering variable fromll eligible variables. Partial Pricing methods are variants of thetandard rules that take only a part of the non-basic variables intoccount. In the computational study presented in Section 4, weave implemented the Partial Pricing Rule as a variant of Dantzig’sule. In static partial pricing, non-basic variables are divided intoqual segments with predefined size. Then, the pricing operations carried out segment by segment until a reduced cost is found. Inynamical partial pricing, the segments’ size is determined dynam-

cally during the execution of the algorithm.

.6. Steepest Edge Rule

Steepest Edge Rule or All-Variable Gradient Method (Goldfarbnd Reid, 1977) selects as entering variable the variable with the

stems and Software 96 (2014) 1–9 3

most objective value reduction per unit distance, as shown in Eq.(1):

dj = min

⎧⎨⎩

cl√1 + ∑m

i=1x2il

: l = 1, 2, . . ., n

⎫⎬⎭ (1)

This pivoting rule can lead to fast convergence to the optimalsolution. However, this advantage is debatable due to the additionalcomputational cost. Approximate methods have been proposedin order to improve the computational efficiency of this method(Swietanowski, 1998; Vanderbei, 2001).

3. GPU accelerated pivoting rules

In this section, we present the GPU-based implementations ofthe aforementioned pivoting rules, taking advantage of the powerthat modern GPUs offer. The parallel implementations of these piv-oting rules are implemented on MATLAB and CUDA. MATLAB’s GPUsupport was added in version R2011a. MATLAB’s GPU library doesnot support sparse matrices, so, all matrices are stored as dense(including zero elements).

3.1. GPU architecture

This section briefly describes the architecture of an NVIDIA GPUin terms of hardware and software. GPU is a multi-core proces-sor having thousands of threads running concurrently. GPU hasmany cores aligned in a particular way forming a single hardwareunit. Data parallel algorithms are well suited for such devices, sincethe hardware can be classified as SIMT (Single-Instruction MultipleThreads). GPUs outperform CPUs in terms of GFLOPS (Giga FloatingPoint Operations per Second). For example, concerning the equip-ment utilized in the computational study presented in Section 4, ahigh-end Core i7 processor with 3.46 GHz delivers up to a peak of55.36 GFLOPs, while a high-end NVIDIA Quadro 6000 delivers upto a peak of 1030.4 GFLOPs.

NVIDIA CUDA is an architecture that manages data-parallelcomputations on a GPU. A CUDA program includes two portions,one that is executed on the CPU and another that is executed onthe GPU. The CPU should be viewed as the host device, while theGPU should be viewed as co-processor. The code that can be paral-lelized is executed on the GPU as kernels, while the rest is executedon the CPU. CPU starts the execution of each portion of code andinvokes a kernel function, so, the execution is moved to the GPU.The connection between CPU memory and GPU memory is througha fast PCIe 16x point to point link. Each code that is executed on theGPU is divided into many threads. Each thread executes the samecode independently on different data. A thread block is a group ofthreads that cooperate via shared memory and synchronize theirexecution to coordinate their memory accesses. A grid consists of agroup of thread blocks and a kernel is executed on a group of grids.A kernel is the resulting code after the compilation. NVIDIA Quadro6000, which was used in our computational experiment, consistsof 14 stream processors (SP) with 32 cores each, resulting to 448total cores. A typical use of a stream is that the GPU schedules amemory copy of results from CPU to GPU, a kernel launch and acopy of results from the GPU to CPU. A high level description of theGPU architecture is shown in Fig. 1.

3.2. Implementation of GPU-based pivoting rules

Fig. 2 presents the process that is performed in the GPU-basedimplementation of the simplex algorithm. Firstly, the CPU initial-izes the algorithm by reading all the necessary data. In the secondstep, the CPU computes a feasible solution and checks if the linear

4 N. Ploskas, N. Samaras / The Journal of Systems and Software 96 (2014) 1–9

ptavtolas

3

smobvt

Table 1GPU-based Bland’s rule.

1. transfer to GPU the vector Sn

2. do parallel

3. find the index of the leftmost negative element of

vector Sn

4. end parallel

5. transfer the index to CPU

Table 2GPU-based Dantzig’s rule.

1. transfer to GPU the vector Sn

2. do parallel

Fig. 1. High level description of GPU architecture.

roblem is optimal. If the linear problem is optimal, the algorithmerminates, while if it is not the CPU transfers the adequate vari-bles to the GPU. In step 3, the GPU finds the index of the enteringariable according to a pivoting rule and then transfers the indexo the CPU. The CPU checks if the linear problem is unbounded inrder to terminate the algorithm. Otherwise, finds the index of theeaving variable and updates the basis and all the necessary vari-bles. Then, the algorithm continues with the next iteration until aolution is found.

.3. Notations

Some necessary notations should be introduced, before the pre-entation of the pseudocodes of each pivoting rule. Let A be an m × natrix. Let NonBasicList be an m × (n − m) vector with the indices

f the non basic variables. Let BasisInv be an m × m matrix with the
asis inverse. Let Xb be an 1 × m vector with the values of the basicariables. Let Sn be an 1 × n vector with dual slack variables. Let l behe index of the entering variable. Let ./ denote the element-wise
Fig. 2. Flow chart of the GPU-based pivoting rules.

3. find the index of the minimum element of Sn

4. end parallel


division of two vectors and X′ denotes the transpose of the matrix×. Let lrcmLast be the index of the last selected entering variablein the Least Recently Considered Method. Let segmentSize be thefixed size of the segments and lastSegment the segment that thelast entering variable was found in the Partial Pricing Rule.

The pseudocodes of all pivoting rules are presented in thefollowing sub-sections of this section. Pseudocodes include “do par-allel” and “end parallel” sections, in which the workload is dividedinto warps that are executed sequentially on a multiprocessor. Thewarp size in NVIDIA Quadro 6000 is 32 threads.

3.4. GPU-based Bland’s rule

Table 1 shows the pseudocode of the implementation of theBland’s rule on a GPU. Initially, the CPU transfers the vector Sn tothe GPU (line 1). Then, the index of the leftmost negative elementof Sn is calculated in parallel (lines 2–4). Finally, the GPU transfersthe index to the CPU (line 5).

3.5. Dantzig’s Rule

Table 2 shows the pseudocode of the implementation of theDantzig’s rule on a GPU. Initially, the CPU transfers the vector Snto the GPU (line 1). Then, the index of the minimum element of Snis calculated in parallel (lines 2–4). Finally, the GPU transfers theindex to the CPU (line 5).

3.6. Greatest Increment Method

Table 3 shows the pseudocode of the implementation of theGreatest Increment Method on a GPU. Initially, the CPU transfersthe vectors Sn, NonBasicList, Xb and A(: , l) (the lth column of arrayA) and array BasisInv to the GPU (line 1). Then, for each candi-date variable with negative coefficient cost, the improvement inthe objective value is calculated (lines 4–17). The index of the vari-able that offers the largest improvement in the objective value isselected (lines 18–19). Finally, the GPU transfers the index to theCPU (line 21).

3.7. GPU-based Least Recently Considered Method

Table 4 shows the pseudocode of the implementation of theLeast Recently Considered Method on a GPU. Initially, the CPUtransfers the vectors Sn and NonBasicList to GPU (line 1). In thefirst iteration of the algorithm, the incoming variable is selected
according to the Dantzig’s rule (lines 2–5). In the next iterations,Least Recently Considered Method finds the index of the first eligi-ble variable with index greater than lrcmLast (lines 6–10). Finally,the GPU transfers the index to the CPU (line 11).

N. Ploskas, N. Samaras / The Journal of Systems and Software 96 (2014) 1–9 5

Table 3GPU-based Greatest Increment Method.

1. transfer to GPU vectors Sn, NonBasicList, Xb and arrays

A(:,l) and BasisInv

2. do parallel

3. localMaxDecrease = inf

4. for i=1:length(NonBasicList)

5. if(Sn(i) < 0)

6. l = NonBasicList(i)

7. h l = BasisInv * A(:,l)

8. mrt = indices of the positive elements of h l

9. Xb hl div = Xb(mrt)./ h l(mrt)

10. theta0 = minimum index of positive elements of

vector Xb hl div

11. currentDecrease = theta0 * Sn(column)

12. if(currDecrease < localMaxDecrease)

13. localMaxDecrease = currDecrease

14. index = i

15. end

16. end

17. end

18. maxDecrease = minimum decrease of all localMaxDecrease

19. find the index of the variable where the maxDecrease

occurs

20. end parallel


Table 4GPU-based Least Recently Considered Method.

1. transfer to GPU the vectors Sn and NonBasicList

2. if(iterations == 1)

3. do parallel

4. find the index of the minimum element of Sn and store

it to

variable lrcmLast

5. end parallel

6. else

7. do parallel

8. find the index of the first eligible variable with

index greater

than lrcmLast and store it to variable lrcmLast

9. end parallel

10. end

3

PuDlst

TG

Table 6GPU-based Steepest Edge.

1. transfer to GPU the vectors Sn, NonBasicList and arrays

A and BasisInv

2. do parallel

3. Y = BasisInv * A(:,NonBasicList)

4. dj = sqrt(1 +diag(Y’ * Y))

5. rj = Sn’./ dj

6. find the index of the minimum element of the vector rj


.8. GPU-based Partial Pricing Rule

Table 5 shows the pseudocode of the implementation of theartial Pricing Rule on a GPU. Initially, the CPU transfers the val-es segmentSize and lastSegment and vector Sn to the GPU (line 1).antzig’s rule is performed in the segment denoted by the variable

astSegment (line 4). If a reduced cost does not exist in the specificegment, then we search in the next segment (lines 5–7). Finally,he GPU transfers the index to the CPU (line 10).

able 5PU-based Partial Pricing Rule.

1. transfer to GPU the values segmentSize and lastSegment

and vector Sn

2. while(index == null)

3. do parallel

4. find the index of the minimum element in the

lastSegment of Sn

5. if(index == null)

6. lastSegment = lastSegment +1

7. end

8. end parallel

9. end


7. end parallel


3.9. GPU-based Steepest Edge

Table 6 shows the pseudocode of the implementation of theSteepest Edge on a GPU. Initially, the CPU transfers the vectors Sn,NonBasicList and arrays A and BasisInv to the GPU (line 1). Then,the index of the incoming variable is calculated according to Eq. (1)(lines 3–6). Finally, the GPU transfers the index to the CPU (line 8).

4. Computational results

Computational studies have been widely used in order to exam-ine the practical efficiency of an algorithm or even comparealgorithms. The computational comparison of the aforementionedpivoting rules has been performed on a quad-processor Intel Corei7 3.4 GHz with 32 Gbyte of main memory and 8 cores, a clock of3700 MHz, an L1 code cache of 32 KB per core, an L1 data cache of 32KB per core, an L2 cache of 256 KB per core, an L3 cache of 8 MB and amemory bandwidth of 21 GB/s, running under Microsoft Windows7 64-bit and on a NVIDIA Quadro 6000 with 6 GB GDDR5 384-bitmemory, a core clock of 574 MHz, a memory clock of 750 MHzand a memory bandwidth of 144 GB/s. It consists of 14 streamprocessors with 32 cores each resulting in 448 total cores. Thegraphics card driver installed in our system is NVIDIA 64 kernelmodule 306.23. The serial algorithms have been implemented usingMATLAB Professional R2012b. MATLAB (MATrix LABoratory) is apowerful programming environment and is especially designed formatrix computations in general.

Serial pivoting rules automatically execute on multiple compu-tational threads in order to take advantage of the multiple coresof the CPU. The execution time of all serial pivoting rules alreadyincludes the performance benefit of the inherent multithreadingin MATLAB. Execution times both on CPU and GPU-based pivotingrules have been measured in seconds using tic and toc MATLAB’sbuilt-in functions. Finally, the results of the GPU are very accurate,because NVIDIA Quadro 6000 is fully IEEE 764-2008 compliant 32-and 64-bit fast double-precision.

In the computational study, two test beds were used: (i) ran-domly generated dense optimal LPs (problem instances have thesame number of constraints and variables and the largest problemtested has 3000 constraints and 3000 variables), and (ii) ten large-scale LPs from the Netlib set (optimal, Kennington and infeasibleLPs) (Carolan et al., 1990; Gay, 1985) and the Misc section ofMészáros collection (Mészáros, 2013). These benchmarks do nothave bounds and ranges sections in their mps files. Dense LPswere included in the computational study mainly for two reasons:(i) dense LPs have important applications, because some applica-tions lead to dense LPs (Bradley et al., 1999), e.g. Dantzig-Wolfedecomposition, and (ii) the most of the proposed GPU-based sim-plex implementations include only computational results for denseLPs, so we have included dense LPs for the sake of completeness.
We implemented an mps reader to read mps files and translatethe data into MATLAB mat files. The Netlib library is a well knownsuite containing many real world LPs. Ordónez and Freund (2003)have shown that 71% of the Netlib LPs are ill-conditioned. Hence,

6 N. Ploskas, N. Samaras / The Journal of Systems and Software 96 (2014) 1–9

Table 7Statistics of the Netlib (optimal, Kennington and infeasible LPs) and Mészáros LPs.

Name Constraints Variables Nonzeros A Optimalobjective value

BNL1 644 1175 6129 1.98E+03CRE-A 3517 4067 19,054 2.35E+07CRE-C 3069 3678 16,922 2.52E+07KLEIN1 55 54 696 InfeasibleKLEIN2 478 54 4585 InfeasibleROSEN1 520 1544 23,794 −2.76E+04ROSEN2 1032 3080 47,536 −5.44E+04

noeitocbnvN

iPGoaaaoiiu

uI

TN

TN

Fig. 3. Average speedup over the randomly generated optimal dense LPs.

Fig. 4. Average speedup over the Netlib and Mészáros set of LPs.

SCFXM3 991 1371 7846 5.49E+04SCTAP3 1481 2480 10,734 1.42E+03STOCFOR2 2158 2031 9492 −3.90E+04

umerical difficulties may occur. All runs terminated with correctptimal objective values reached as those in Netlib index file. Forach instance we averaged times over 10 runs. All times reportedn the paper include communication times between the CPU andhe GPU. Table 7 presents some useful information about the sec-nd test bed, which was used in the computational study. The firstolumn includes the name of the problem, the second the num-er of constraints, the third the number of variables, the fourth theonzero elements of matrix A and the fifth the optimal objectivealue. The test bed includes 4 optimal and 2 infeasible LPs frometlib, 2 Kennington and 2 LPs from Mészáros collection.

Prior to the presentation of the results, we should clarify that ourmplementation use dense algebra functions, because MATLAB’sarallel Computing Toolbox does not support sparse matrices onPUs. However, the main aim of this paper is to examine if somef the widely-used pivoting rules can be implemented using GPUsnd not to implement an efficient implementation of the entirelgorithm on GPUs. Hence, we used MATLAB and MATLAB’s Par-llel Computing Toolbox as rapid prototyping tools to implementur algorithms. Of course, our reported times of the CPU-basedmplementations cannot be compared with state-of-the-art solversmplemented either in other programming languages, e.g. C++, or
sing sparse linear algebra routines.
In Tables 8–13 and Figs. 3–6, the following abbreviations aresed: (i) Bland’s rule – BR, (ii) Dantzig’s rule – DR, (iii) Greatest

ncrement Method – GIM, (iv) Least Recently Considered method

able 8umber of iterations over the randomly generated optimal dense LPs.

Problem BR DR GIM LRCM PPR SER

500 × 500 8451 542 362 4851 5528 1971000 × 1000 29,786 1792 894 24,683 25,931 5481500 × 1500 77,028 3938 1856 54,987 61,243 7792000 × 2000 84,224 3986 1934 77,139 79,134 10422500 × 2500 144,216 5872 4569 101,369 112,041 11803000 × 3000 214,875 8130 6431 152,310 160,134 1508

Average 93,097 4043 2674 69,223 74,002 876

able 9umber of iterations over the Netlib and Mészáros set of LPs.


BNL1 41,816 4577 1570 32,400 15,120 843CRE-A 39,371 5431 3091 14,231 4821 2066CRE-C 30,565 5710 3102 22,461 14,735 1736KLEIN1 102 115 99 238 173 106KLEIN2 1398 258 221 913 940 272ROSEN1 12,322 3444 958 6341 10,267 519ROSEN2 28,282 6877 2103 14,151 22,156 1023SCFXM3 7411 1167 2313 6684 5948 879SCTAP3 2459 2215 1655 3195 3396 589STOCFOR2 6454 1074 1534 10,985 5992 1125

Average 17,018.00 3086.80 1664.60 11,159.90 8354.80 915.80

Fig. 5. Average portion of pivoting time to total execution time over the randomlygenerated optimal dense LPs.

Fig. 6. Average portion of pivoting time to total execution time over the Netlib andMészáros set of LPs.

N. Ploskas, N. Samaras / The Journal of Systems and Software 96 (2014) 1–9 7

Table 10Total execution time of the serial implementations over the randomly generated optimal dense LPs.


500 × 500 54.71 3.42 12.85 31.57 36.34 5.581000 × 1000 1317.40 62.75 945.34 861.55 906.77 162.431500 × 1500 8677.64 433.02 8134.51 6056.12 6834.04 713.432000 × 2000 19,057.93 888.48 16,098.19 12,365.41 13,654.98 2034.122500 × 2500 59,884.37 2872.15 51,013.45 38,561.09 41,361.77 4761.343000 × 3000 148,775.20 7037.62 125,678.43 90,561.65 98,134.15 9451.34

Average 39,627.87 1882.91 33,647.13 24,739.57 26,821.34 2854.71

Table 11Total execution time of the serial implementations over the Netlib and Mészáros set of LPs.


BNL1 436.91 58.29 180.41 337.88 158.14 113.94CRE-A 28,028.03 4286.12 22,013.45 17,984.34 19,349.18 16,541.09CRE-C 12,525.00 2583.08 9845.45 7034.98 8045.61 7683.43KLEIN1 0.02 0.02 0.03 0.04 0.03 0.03KLEIN2 5.89 1.30 1.22 3.74 3.90 1.51ROSEN1 102.34 28.51 88.45 52.54 87.89 51.49ROSEN2 1381.47 345.88 2564.34 694.96 983.23 690.52SCFXM3 206.21 38.56 167.34 186.28 168.25 156.14SCTAP3 222.12 214.36 287.41 286.76 309.60 543.28STOCFOR2 1402.98 258.01 2985.91 2394.38 1362.05 768.93

Average 4431.10 781.41 3813.40 2897.59 3046.79 2655.04

Table 12Total execution time (including communication time) of the GPU-based implementations over the randomly generated optimal dense LPs.


500 × 500 45.59 3.16 10.44 28.90 32.35 2.071000 × 1000 1029.86 58.61 716.14 842.31 880.05 53.341500 × 1500 8592.08 412.31 6025.51 5834.13 6004.89 213.452000 × 2000 16,429.25 800.43 13,416.43 11,845.64 12,566.67 501.492500 × 2500 52,530.14 2804.20 38,355.97 36,798.88 38,905.64 1040.54

,937.7

,743.7

–SrNgooi

sarct

TT

3000 × 3000 123,979.31 6459.08 95

Average 33,767.71 1756.30 25

LRCM, (v) Partial Pricing rule – PPR, and (vi) Steepest Edge rule –ER. Tables 8 and 9 present the iterations needed by each pivotingule over the randomly generated optimal dense LPs and over theetlib and Mészáros set of LPs, respectively. Both on the randomlyenerated optimal dense LPs and on the Netlib and Mészáros setf LPs, Steepest Edge rule requires the fewest iterations than thether rules, whereas Bland’s rule is by far the worst pivoting rulen terms of the number of iterations.

Tables 10 and 11 present the results from the execution of theerial implementations of pivoting rules over the randomly gener-
ted optimal dense LPs and over the Netlib and Mészáros set of LPs,espectively. Tables 12 and 13 present the results from the exe-ution of the GPU-based implementations of pivoting rules overhe randomly generated optimal dense LPs and over the Netlib and
able 13otal execution time (including communication time) of the GPU-based implementations

Problem BR DR GIM

BNL1 368.08 50.42 152.34CRE-A 22,098.69 3477.51 17,800.55CRE-C 9899.01 2145.31 8156.44KLEIN1 0.02 0.02 0.03KLEIN2 4.96 1.15 1.03ROSEN1 88.78 24.16 74.98ROSEN2 1211.45 276.70 2231.01SCFXM3 172.51 26.66 145.50SCTAP3 190.41 201.54 245.08STOCFOR2 1201.40 223.45 2501.58

Average 3523.53 642.69 3130.85

3 85,139.04 91,003.85 2305.66

0 23,414.82 24,898.91 686.09

Mészáros set of LPs, respectively. Figs. 3 and 4 present the aver-age speedup for each pivoting rule over the randomly generatedoptimal dense LPs and over the Netlib and Mészáros set of LPs,respectively.

The results show that only the Steepest Edge rule can be par-allelized on a GPU efficiently. All the other pivoting rules presenttrivial speedups. Figs. 5 and 6 present the average portion of timethat each serial pivoting rule needs to calculate the entering vari-able against the total time over the randomly generated optimaldense LPs and over the Netlib and Mészáros set of LPs, respectively.
The results of Figs. 5 and 6 highlight that only Steepest Edge totalexecution time is dictated by the pivoting time. The other pivot-ing rules perform trivial pivoting operations, so, it is not worth toparallelize them (Greatest Increment Method also spends a lot of
over the Netlib and Mészáros set of LPs.

LRCM PPR SER

322.87 146.82 17.93 16,490.55 17,234.05 2832.34 6500.67 7231.55 1448.00

0.03 0.03 0.04 3.56 3.65 1.08

51.66 81.47 15.45 663.49 910.22 200.05 177.86 155.87 36.85 273.88 286.60 95.20

2156.00 1190.51 354.61

2664.06 2724.08 500.16

8 N. Ploskas, N. Samaras / The Journal of Sy

Table 14Pivoting execution time on CPU and GPU implementations and speedup of theSteepest Edge rule over the randomly generated optimal dense LPs.

Problem CPU GPU Speedup

500 × 500 4.32 0.87 4.951000 × 1000 115.77 20.51 5.641500 × 1500 545.33 73.59 7.412000 × 2000 1602.21 124.11 12.912500 × 2500 3982.61 262.00 15.203000 × 3000 7532.62 450.45 16.72

Average 2297.14 155.26 10.47

Table 15Pivoting execution time on CPU and GPU implementations and speedup of theSteepest Edge rule over the Netlib and Mészáros set of LPs.

Problem CPU GPU Speedup

BNL1 98.32 9.17 10.72CRE-A 14,105.19 1078.01 13.08CRE-C 6503.03 573.55 11.34KLEIN1 0.02 0.02 1.00KLEIN2 0.87 0.41 2.15ROSEN1 40.67 5.65 7.20ROSEN2 533.34 50.74 10.51SCFXM3 136.97 10.80 12.69SCTAP3 485.27 36.99 13.12

tfibeaes

tteLparsoaooii

STOCFOR2 533.79 51.79 10.31

AVERAGE 2243.75 181.71 9.21

ime in the communication between the CPU and the GPU). Thesendings lead us to conclude that only the Steepest Edge rule cane parallelized. Many CPU parallel implementations of the Steep-st Edge rule have been proposed (Bixby and Martin, 2000; Hallnd McKinnon, 1995; Thomadakis and Liu, 1996), while Meyert al. (2011) proposed a multi-GPU implementation for the classicalimplex algorithm using Steepest Edge as the pivoting rule.

Tables 14 and 15 present the time to perform the selection ofhe entering variable both on the CPU and GPU implementation ofhe Steepest Edge rule and the speedup over the randomly gen-rated optimal dense LPs and over the Netlib and Mészáros set ofPs, respectively. Fig. 7 presents the speedup of the total and theivoting time of the Steepest Edge rule over the randomly gener-ted optimal dense LPs and over the Netlib and Mészáros set of LPs,espectively. The results show a maximum speedup on the pivotingtep of 16.72 (on the 3000 × 3000 dense random LP) and an averagef 10.47 and 9.21 for the the randomly generated optimal dense LPsnd the Netlib and Mészáros set of LPs, respectively. The speedupbtained from the pivoting is not depicted exactly in the speedup
f the total time for two reasons: (i) only the selection of the enter-ng variable has been parallelized, and (ii) the communication costs still expensive (CPU transfers the adequate variables to the GPU
Fig. 7. Speedup of the total and the pivoting time of the Steepest Edge rule.

stems and Software 96 (2014) 1–9

at each iteration). For example, the speedup of the pivoting timeachieved for ROSEN1 linear program is 7.20, while the speedup ofthe total time is 3.33. The communication cost for ROSEN1 is 4.45 s,approximately 1/3 of the total execution time.

5. Conclusions

A good selection of the pivoting rule for linear optimizationsolvers can lead to the reduction of the iterations and the solu-tion time of an LP. GPUs have been already applied for the solutionof linear optimization algorithms, but GPU-based implementationsof the pivoting rules have not yet been studied. In this paper, wereviewed and implemented six widely-used pivoting rules andproposed GPU-based implementations for them using MATLABand CUDA. We performed a computational study on large-scalerandomly generated optimal dense LPs, the Netlib (optimal, Ken-nington, and infeasible LPs) and the Mészáros set and found thatonly Steepest Edge rule is suitable for GPUs, because the total timeof the algorithm is dictated by the time to perform the pivoting. Themaximum speedup gained from the pivoting operation of SteepestEdge rule is 16.72.

The results are very promising and allow to hope fast GPU-basedimplementations for linear programming algorithms. In futurework, we plan to port all steps of the revised simplex algorithmin a GPU-based implementation in order to avoid communicationcosts in each step and fully exploit parallelism. Finally, we plan toimplement variants of the Steepest Edge rule using GPUs.

Acknowledgement

The authors thank NVIDIA for their support through the Aca-demic Partnership Program.

References

Bazaraa, M.S., Jarvis, J.J., Sherali, H.D., 1990. Linear Programming and Network Flows.John Wiley & Sons, Inc, Hoboken, NJ.

Benichou, M., Gautier, J., Hentges, G., Ribiere, G., 1977. The efficient solution of large-scale linear programming problems. Math. Program. 13, 280–322.

Bieling, J., Peschlow, P., Martini, P., 2010. An efficient GPU implementation of therevised Simplex method. In: Proceddings of the 24th IEEE International Paralleland Distributed Processing Symposium (IPDPS 2010), Atlanta, USA.

Bixby, R.E., Martin, A., 2000. Parallelizing the dual simplex method. INFORMS J.Comput. 12 (1), 45–56.

Bland, R.G., 1977. New finite pivoting rules for the simplex method. Math. Oper. Res.2 (2), 103–107.

Bradley, S.P., Fayyad, U.M., Mangasarian, O.L., 1999. Mathematical programming fordata mining: formulations and challenges. INFORMS J. Comput. 11 (3), 217–238.

Carolan, W.J., Hill, J.E., Kennington, J.L., Niemi, S., Wichmann, S.J., 1990. An empiricalevaluation of the KORBX(r) algorithms for military airlift applications. Oper. Res.38 (2), 240–248.

Clausen, J., 1987. A note on Edmonds-Fukuda’s pivoting rule for the simplex method.Eur. J. Oper. Res. 29, 378–383.

Dantzig, G.B., 1963. Linear Programming and Extensions. Princeton University Press,Princeton, NJ.

Forrest, J.J., Goldfarb, D., 1992. Steepest-edge simplex algorithms for linear program-ming. Math. Program. 57 (1–3), 341–374.

Fukuda, K., 1982. Oriented matroid programming (Ph.D. thesis). Waterloo Univer-sity, Waterloo, Ontario, Canada.

Gade-Nielsen, N.F., Jorgensen, J.B., Dammann, B., 2012. MPC toolbox with GPU accel-erated optimization algorithms. In: Proceedings of the 10th European Workshopon Advanced Control and Diagnosis (ACD 2012), Copenhagen, Denmark.

Gärtner, B., 1995. Randomized optimization by Simplex-type methods (Ph.D. thesis).Freien Universität, Berlin.

Gay, D.M., 1985. Electronic mail distribution of linear programming test problems.Math. Program. Soc. COAL Newslett. 13, 10–12.

Goldfarb, D., Reid, J.K., 1977. A practicable steepest-edge simplex algorithm. Math.Program. 12 (3), 361–371.

Hall, J.A.J., McKinnon, K.I.M., 1995. An asynchronous parallel revised simplexalgorithm. Tech. Rep. MS 95-50. Department of Mathematics and Statistics,
University of Edinburgh.
Harris, P.M.J., 1973. Pivot selection methods for the Devex LP code. Math. Program.5, 1–28.

Jung, J.H., O’Leary, D.P., 2008. Implementing an interior point method for linearprograms on a CPU–GPU system. Electron. Trans. Numer. Anal. 28, 174–189.

http://refhub.elsevier.com/S0164-1212(14)00117-4/sbref0005






























































































































































































































































































l of Sy

K

L

L

L

M

M

M

M

N

O

P

P

S

S

S

T

Informatics from the University of Macedonia, Greece. Hisprimary research interests are in mathematical program-ming, experimental evaluation of algorithm performance,


lee, V., Minty, G.J., 1972. How good is the simplex algorithm. In: Shisha, O. (Ed.),Inequalities – III. Academic Press Inc., New York and London.

alami, M.E., Boyer, V., El-Baz, D., 2011a. Efficient implementation of the simplexmethod on a CPU–GPU system. In: Proceedings of the 2011 IEEE InternationalSymposium on Parallel and Distributed Processing Workshops and PhD Forum(IPDPSW 2011), Washington, USA, pp. 1999–2006.

alami, M.E., El-Baz, D., Boyer, V., 2011b. Multi gpu implementation of the simplexalgorithm. In: Proceedings of the 2011 IEEE 13th International Conference onHigh Performance Computing and Communications (HPCC), Banff, Canada, pp.179–186.

i, J., Lv, R., Hu, X., Jiang, Z., 2011. A GPU-based parallel algorithm for large scalelinear programming problem. In: Watada, J., Phillips-Wren, G., Jai, L., Howlett,R.J. (Eds.), Intelligent Decision Technologies, SIST 10. Springer-Verlag, Berlin, pp.37–46.

aros, I., Khaliq, M.H., 1999. Advances in design and implementation of optimizationsoftware. Eur. J. Oper. Res. 140 (2), 322–337.

észáros, C., 2013. Linear programming test problems. http://www.sztaki.hu/meszaros/public ftp/lptestset/misc/ (last accessed on 05.08.13).

eyer, X., Albuquerque, P., Chopard, B., 2011. A multi-GPU implementation andperformance model for the standard simplex method. In: Proceedings 1st Inter-national Symposium and 10th Balkan Conference on Operational Research,Thessaloniki, Greece, pp. 312–319.

urty, K.G., 1974. A note on a Bard type scheme for solving the complementarityproblem. Opsearch 11, 123–130.

VIDIA CUDA, 2013. http://developer.nvidia.com/object/cuda.html (last accessed05.08.13).

rdónez, F., Freund, R., 2003. Computational experience and the explanatory valueof condition measures for linear optimization. SIAM J. Optimization 14 (2),307–333.

apadimitriou, C.H., Steiglitz, K., 1982. Combinatorial Optimization: Algorithms andComplexity. Prentice-Hall, Inc, New Jersey.

loskas, N., Samaras, N., 2013. Computational comparison of pivoting rules forthe revised simplex algorithm. In: Proc. XI Balkan Conference on OperationalResearch, 7–10 September, Belgrade.

mith, E., Gondzio, J., Hall, J., 2012. GPU acceleration of the matrix-free interiorpoint method. In: Parallel Processing and Applied Mathematics. Springer, Berlin,Heidelberg, pp. 681–689.

pampinato, D.G., Elster, A.C., 2009. Linear optimization on modern GPUs. In:Proceedings of the 23rd IEEE International Parallel and Distributed ProcessingSymposium (IPDPS 2009), Rome, Italy.

´wietanowski, A., 1998. A new steepest edge approximation for the simplex methodfor linear programming. Comput. Optim. Appl. 10 (3), 271–281.

he Khronos OpenCL Working Group, 2013. OpenCL – The open standard for parallelprogramming of heterogeneous systems. http://www.khronos.org/opencl/ (lastaccessed 05.08.13).

stems and Software 96 (2014) 1–9 9

Thomadakis, M.E., 1994. Implementation and Evaluation of Primal and Dual SimplexMethods with Different Pivot-Selection Techniques in the LPBench Environ-ment. A Research Report. Texas A&M University, Department of ComputerScience.

Thomadakis, M.E., Liu, J.C., 1996. An efficient steepest-edge simplex algorithm forSIMD computers. In: Proc. 10th International Conference on Supercomputing,pp. 286–293.

Vanderbei, R.J., 2001. Linear Programming: Foundations and Extensions, 2nd ed.Kluwer Academic Publishers, Boston.

Wang, Z., 1991. A modified version of the Edmonds-Fukuda algorithm for LP prob-lems in the general form. Asia-Pacific J. Oper. Res. 8 (1), 55–61.

Zadeh, N., 1980. What is the worst case behavior of the simplex algorithm? Technicalreport. Department of Operations Research, Stanford University.

Zhang, S., 1991. On anti-cycling pivoting rules for the simplex method. Oper. Res.Lett. 10, 189–192.

Ziegler, G.M., 1990. Linear programming in oriented matroids. Technical Report No.195, Institut für Mathematik, Universität Augsburg, Germany.

Ploskas Nikolaos is a Ph.D. student at the Department ofApplied Informatics, School of Information Sciences, Uni-versity of Macedonia, Greece. He received his Bachelor ofScience degree and Master degree in Computer Systemsfrom the Department of Applied Informatics of the Univer-sity of Macedonia, Greece. His primary research interestsare in mathematical programming, analysis and designof algorithms, parallel and distributed programming, GPUprogramming, grid technologies and operations research.

Nikolaos Samaras is an Associate Professor at the Depart-ment of Applied Informatics, School of InformationSciences, University of Macedonia, Greece. He received hisBachelor of Science degree and Ph.D. degree in Applied

GPU computing, computer applications and computa-tional operations research.







































































































































http://www.sztaki.hu/meszaros/public_ftp/lptestset/misc/

http://www.sztaki.hu/meszaros/public_ftp/lptestset/misc/
















































http://developer.nvidia.com/object/cuda.html



















































































































http://www.khronos.org/opencl/
























































































gpu accelerated pivoting rules for the simplex algorithmusers.uom.gr/~samaras/pdf/j28.pdf · matlab...

Documents