Download - Techniques for Solving Stiff Chemical Kinetics on Graphical Processing Units

Techniques for Solving Stiff Chemical Kineticson Graphical Processing Units

Christopher P. Stone∗

Computational Science and Engineering, LLC, Chicago, Illinois 60622

and

Roger L. Davis†

University of California, Davis, California 95616

DOI: 10.2514/1.B34874

The cost of integratingdetailed finite rate chemical kineticsmechanisms canbe prohibitive in turbulent combustion

simulations. Techniques that can significantly reduce these simulation times are of great interest to model developers

and the broader propulsion and power community. Two new integration methods using graphical processing units

are presented that can rapidly integrate the nonlinear ordinary differential equations at each grid point. The explicit

graphical processing-unit-enabled fourth-order-accurate Runge–Kutta–Fehlberg ordinary differential equation

solver achieved a maximum speed up of 20.2x over the baseline implicit fifth-order-accurate DVODECPU run time

for large numbers of independent ordinary differential equation systems with comparable accuracy. The graphical

processing unit implementation of DVODE achieved a maximum speed up of 7.7x over the baseline CPU run time.

The performance impact of mapping one graphical processing unit thread to each ordinary differential equation

system was compared with mapping an entire graphical processing unit thread block per ordinary differential

equation (i.e., multiple threads per ordinary differential equation). The one-thread-per-ordinary-differential-

equation approach achieved greater overall speed up but only when the number of independent ordinary differential

equations was large. The one-block-per-ordinary-differential-equation implementation of Runge–Kutta–Fehlberg

and DVODE both achieved lower peak speed ups, but outperformed the serial CPU performance with as few as 100

ordinary differential equations. The new graphical processing-unit-enabled ordinary differential equation solvers

demonstrate a method to significantly reduce the computational cost of detailed finite rate combustion simulations.

I. Introduction

T HE solution of finite rate chemical kinetics problems incomputational fluid dynamics (CFD) combustion simulations

can overwhelm the available computational resources. Finite ratekinetics combustion simulations must solve Ns additional speciesconservation equations on top of the traditional mass, momentum,and energy conservation equations. Detailed chemical mechanismscan consist of hundreds of chemical species with thousands ofelemental reactions leading to intractable storage and computationalcosts. The CFD algorithm can be simplified by solving the chemicalmass action separately from the fluid transport. This approximationallows the chemical kinetics problem at each grid cell or vertex to beintegrated independently over the specified time step size δt. At eachelement, the Ns mass fractions and enthalpy (or temperature) aresolved as a system of Ns � 1 stiff ordinary differential equations(ODEs). Even with the preceding simplification, the cost of solvingthe ODEs can be prohibitive. New algorithmic and computer sciencetechniques such as massive parallel computing can significantlyreduce the computational cost and are therefore of great interest tocombustion simulation developers throughout the propulsion andpower community. An alternative approach exploits the availablefine-grain parallelism using graphical processing units (GPUs).GPUs are auxiliary devices installed in most workstations and are

designed to offload the intensive computations involved in graphics

and display rendering. These devices contain an immense amount ofraw computing power, more, in fact, than their CPU counterparts.Over the past several years, an intense effort has begun to exploit thisavailable computing resource to solve computationally intensivescience and engineering problems. CPUs may contain, at most, tensof cores per package, whereas high-end programmable GPUscommonly contain 100s and can sustain substantially highermemorybandwidths. However, harnessing this power may require a redesignof many existing algorithms to use efficiently the massively paralleland multithreaded environment of GPUs.The current GPU application seeks to accelerate the chemical

kinetics integration encountered within the counterflow linear eddymodel (CF-LEM). The CF-LEM is used to simulate the turbulentcombustion found in opposed-flow configurations (e.g., augmenters)[1]. The combustion is modeled using finite rate kinetics to capturecomplex reaction–diffusion processes such as extinction andreignition. The three-dimensional turbulent fluid mechanics aremodeled using a high-resolution one-dimensional (1-D) turbulentmixing model (i.e., the linear eddy model). In this fashion, theturbulent reaction zone is captured at direct numerical simulation(DNS) resolution. An example of a CF-LEM flame profile is shownin Fig. 1. There, the major fuel and oxidizer species are shown alongwith the temperature. Note that only the reaction zone is shown; thefull CF-LEM domain is significantly larger. In this example, thecombustion kinetics are modeled using a 19-species reduced C2H4

(ethene) mechanism. See Calhoon et al. [1] for further details on thecombustion model.GPUs have recently been used to solve stiff chemical kinetics

problems. Linford et al. [2] investigated atmospheric chemistryproblems on NVIDIA Compute Unified Device Architecture(CUDA) GPUs and Cell Broadband Engines (CBE). They foundpoor scalability on GPUs compared with CBE due to threaddivergence and the large per-problem memory requirement. Morerecently, Shi et al. [3] studied the performance of GPUs for detailedcombustion kinetics. There, reaction mechanisms with over 1000species were investigated; however, speed up was only realized forcomplex mechanisms exceeding 300 species. Mechanisms of this

Presented as Paper 2013-369 at the AIAA Aerospace Sciences Meeting,Grapevine, TX, 7–10 January 2013; received 10 December 2012; revisionreceived 1 March 2013; accepted for publication 9 March 2013; publishedonline 10 June 2013. Copyright © 2013 by the American Institute ofAeronautics and Astronautics, Inc. All rights reserved. Copies of this papermay be made for personal or internal use, on condition that the copier pay the$10.00 per-copy fee to the Copyright Clearance Center, Inc., 222 RosewoodDrive, Danvers, MA 01923; include the code 1533-3876/13 and $10.00 incorrespondence with the CCC.

*1830 N. Winchester Ave., Unit 305; [email protected].

†Professor, Mechanical and Aerospace Engineering; [email protected].

764

JOURNAL OF PROPULSION AND POWER

Vol. 29, No. 4, July–August 2013

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

http://dx.doi.org/

size are far too large for practical multidimensional CFD studies withthousands ormillions of grid points. The same authors [4] presented ahybrid strategy for multidimensional CFD applications in whichregions of the computational domain were classified as either stiff ornonstiff. Nonstiff computational cells were integrated in parallel onthe GPU, using a stable but low-order-accurate explicit integrationmethod, whereas the stiff computational cells were integrated inserial on the host CPU using a high-order-accurate implicit method.This strategy provided favorable speed ups when the stiff ODEregionswere sparse, but requires a priori knowledge of the stiffness ofthe kinetics at each grid point to be effective.The present study differs in several key aspects from the prior

investigations described earlier. The key differences are mechanismsize, ODE stiffness, and the number of independent ODE problems.A much smaller reaction mechanism (only 19 species) is consideredhere; this is far smaller than those reported by Shi et al. [3]. Theirstrategy for detailed kinetics is not practical for the small mechanismof interest here. Asmentioned previously, the chemistry at each LEMcell can be integrated independently and in parallel, and this providesthe primary source of parallelism.However, the hybrid strategy of Shiet al. [4] cannot be exploited because stiffness at individual LEMcellscannot be easily predicted, due to the effects of turbulent mixing.With these differences in mind, a new GPU integration strategy wasdesigned and demonstrated that can efficiently and accuratelyintegrate thousands of independent ODE systems, regardless of theirunderlying ODE stiffness.The following sections provide details on 1) the mathematical

basis of the combustion kinetics problem and the ODE integrationalgorithms, 2) the parallel GPU implementation strategies, and,finally, 3) the performance of the GPU integration methods.

II. Mathematical Model

The LEM combustion model integrates the Ns species massfractions yk and temperature T equations in time t under the constantpressure assumption at every grid point. This can be expressedmathematically by the following set of coupled ODEs:

_yk �_ωkWk

ρ(1)

_T � −1

cp

XNsi

hi _ωk (2)

Here, Wk, hk, and _ωk are the molar mass, enthalpy, and molarproduction rate for species k, and cp and ρ are the specific heat atconstant pressure and mass density of the system. The net molarproduction rate from combustion is a highly nonlinear function ofpressure p, T and the species molar concentrations ci. It is also theprimary source of the ODE stiffness.The reaction rate terms are computed from a reduced 19-species

C2H4-air (ethene) reaction mechanism with Nk � 167 reversible

elemental reactions and Nqss � 10 quasi-steady-state (QSS) species(i.e., they are not solved in time). Each elemental reaction rate iscalculated from a set of temperature- and concentration-dependentfactors. The temperature-dependent forward kf and reverse kb ratecoefficients for each reaction are computed as

kf � ATαe−EaRT (3)

kb �kfeKc

(4)

whereEa is the activation energy,Kc is the equilibrium constant,A isthe preexponential rate constant, and α is the temperaturedependency. Note that these constants are different for eachelemental reaction. The equilibrium constant is computed as afunction of the Gibbs energy (Δgk � Δhk − TΔsk), the elementalreaction stoichiometric coefficients νi, and the reference pressure p0:

Kc �XNν

i

νi�−ΔGi�∕RT �p0

RT

XNν

i

νi (5)

where the summation is over the total number of products andreactants in the given elemental reaction.The net molar rate of progress for each elemental reaction is

computed from the preceding temperature-dependent rate coeffi-cients and the concentrations as

rf � pFOcTB

�kfYNpi

cνii

�rb � pFOcTB

�kbYNri

cνii

�F (6)

Here, Np and Nr are the number of products and reactants for thegiven elemental reaction, cTB is the third-body concentration, andpFO is the low-pressure factor (i.e., falloff effect). Of the 167 reactionsin the current mechanism, 24 involve third-body concentrations and18 falloff reactions factors. The third-body concentration effect foreach relevant reaction is modeled as

cTB �XNsi

ci �XNctb

i

�Ai − 1�ci (7)

The falloff reaction rates account for low-pressure effects and arethemselves functions of the third-body concentrations. Most of thecurrent falloff reactions follow the four-parameter Troe form of thecorrection. These all require the calculation of an additional low-pressure forward Arrhenius rate coefficient kof along with apolynomial of temperature:

pr �kofcTB

kf

fc � �1 − P1�e−TP2 � P1e

−TP3 � e

−P4T

pFO � g�fc�pr

1� pr(8)

The preceding sequence of rate calculations has neglected the Nqss

quasi-steady-state species concentrations. These are computed as acomplex nonlinear combination of rf and rb only. The forward andreverse rates are then updated with the new QSS concentrations,following the same form as Eq. (6).Finally, the net molar production rate for species k is computed as

the sum over all Nk elemental reactions

_ωk �XNki

δνiri (9)

where δνi is the net stoichiometric coefficient for species k in reactioni with the net rate of progress ri (i.e., ri � rfi − rbi ).All temperature-dependent thermodynamic variables for the

chemical species (cp, h, and s) are computed from the NASAseven-term polynomial fit database,

0 0.2 0.4 0.6 0.8 1Nondimensional distance through the flame

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Spec

ies

Mas

s Fr

actio

n (Y

k)

O2

CH3

CH4

CO2

C2H

4

0.3

0.6

0.9

1.2

1.5

1.8

2.1

2.4

Tem

pera

ture

(K

/ 10

00)

Temperature

Fig. 1 ExampleLEMflameprofile:Only the reaction zone is shownandis resolved with 241 LEM cells.

STONE AND DAVIS 765

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

cp∕R � a0 � a1T � a2T2 � a3T4

h∕R � a0T �a1T

2

2� a2T

3

3� a3T

4

4� a4T

5

5� a5

�h − s�∕RT � a0�log�T� − 1� � a1T � a2T2 � a3T3

� a4T4 −a5T� a6 (10)

The parallel implementation of Eqs. (3–10) will strongly influencethe computational performance of the ODE integration. Multiplesources of parallelism exist within the LEM application and thegoverning equations. Notably, each LEM cell represents anindependent ODE that can be solved in parallel (i.e., coarse-grainparallelism). On a finer level, the forward and reverse elemental ratecoefficients can be calculated in parallel for each individual ODE.The optimal balance between coarse- and fine-grain parallelism isstrongly dependent upon the target hardware. Specific details on theimplementation of the preceding right-hand side (RHS) equationsand the ODE solver implementations will be given later.The mass fraction and temperature equations are easily combined

into a single vector form suitable for solving with a generic ODEintegration algorithm. The following form will be used throughoutthe remainder of this article:

_u�t� � f�u�t�;q� u�t0� � u0�q� (11)

where f is the RHS vector function, u�t� is the vector of unknowns atsome time t,q is a vector of nonintegrated parameters (e.g., pressure),_u�t� is the first derivative in time of u�t�, u0 is a vector of initialconditions, and time t is the independent variable. The length of u isNeq � Ns � 1 unknowns. In the current LEM combustion modelcontext, the integration time in Eq. (11) is taken as the diffusion timestep size. That is, the ODEs at each grid point are independentlyintegrated over the same time interval to maintain time accuracywithin the LEM model. Note that a different number of integrationsteps may be required for each different ODE problem. The ODEintegration methods for Eq. (11) employed in this study areintroduced next.

III. Computational Approaches

The DVODE package is commonly used to solve the stiff kineticsODEs and is available fromNetlib.‡This algorithm uses a fifth-order-accurate implicit backward difference formula to integrate Eq. (11).DVODE is a legacy FORTRAN77 application and has no parallelcomputing capabilities. The CVODE ODE solver [5], part of theSUNDIALS nonlinear solver package from Lawrence LivermoreNational Laboratory, is an updated and enhanced version ofDVODE.CVODE is written in C and uses a modular object-orientedprogramming style. Both the C language and the object-orienteddesign greatly simplify the implementation of CVODE in CUDA, asdetailed later. Note that CVODE provides all of the samefunctionality as DVODE. CVODE is algorithmically equivalent toDVODE and can be used as a one-to-one replacement. CVODE’sperformance on the current chemical kinetics application was foundto be only marginally faster, 5% maximum, when run serially. Withthis inmind, CVODE, notDVODE, is used as the basis for theCUDAimplementation.In this study, the DVODE solver was used as the baseline for both

solution accuracy and for serial CPU performance. The combustionkinetics is modeled using the reduced 19-species ethene mechanism.Thermochemistry properties are computed using the CANTERA§

computational library. CANTERA provides a large set of functionsand databases useful for calculating thermodynamic state properties.The dominant computational costs incurred at each ODE

integration step with DVODE and CVODE are the RHS evaluationand the inversion of the Newton iteration matrix, a square Neq × Neq

matrix formed from the system Jacobian (i.e., ∂f∕∂u). The Jacobianis approximated using finite differences [6] and each approximationrequires Neq additional calls to the RHS function. The average totalRHS cost per ODE (including the Jacobian estimation) is 83% of therun time with the current ethene mechanism within DVODE.A traditional CPU implementation of the preceding ODE

equations uses a single threaded (i.e., sequential) algorithm andimplements the preceding ODE and RHS equations in a fashion thatminimizes the number of operations.With a sequential algorithm, theminimum number of operations tends to equate to the minimumwallclock time. Modern CPUs have large on-chip memory, high clockrates, and relatively large near-chip caches, and so this approach isgenerally optimal. Minimizing the total number of sequentialoperations does not necessarily equate to the minimum time onhighly parallel hardware such as a GPU.DVODE and CVODE contain many logical tests on convergence

and refinement to minimize the number of necessary RHSevaluations and matrix computations. Many of these sequentialoptimizations come at the expense of extra storage requirements.This is optimal for CPUs with large amounts of on-chip memory andonly a few cores. However, it can lead to poor parallel efficiency in aGPU environment, in which the computational cores are plentiful butmemory is scarce.GPUs perform most efficiently when each thread executes the

same instruction on different data. That is, they are optimized forsingle-instruction, multiple-data (SIMD) execution. Each system ofODEs in the kinetics application will have a unique set of initialconditions (ICs). Because of nonlinearities in the kinetics, smallperturbations in the ICs can cause each individual integration toprogress differently through the implicit solvers (e.g., different ratesof nonlinear convergence and numbers of time steps). This can lead topoor parallel efficiency and degraded performance in the GPUenvironment. An alternative strategy is to integrate the ODEproblems using a high-order explicit algorithm such as an adaptiveRunge–Kutta (RK) method. This approach is typically slower in asequential environment compared with an implicit algorithm for stiffproblems due to the necessarily smaller integration step sizes,resulting in more steps and more RHS evaluations. However, thistype of algorithm requires minimal logic and may perform better in aparallel environment. Whether or not an explicit algorithm isnumerically stable for the current application must be investigated.An adaptive fourth-order-accurate ODE solver has been created

based on the Runge–Kutta–Fehlberg (RKF45) [7] scheme. TheRKF45 scheme has the useful property that both fourth- and fifth-order-accurate formulas can be constructed using the same sixintermediate RHS evaluation points. The difference between thesetwo schemes gives an order-of-magnitude estimate of the localtruncation error (LTE), which can be used to estimate the accuracy ofa given solution. The integrated solution can then be monitored foracceptance and the step size h can be coarsened or refined based onthe magnitude of the LTE. In this approach, a fourth-order-accuratesolution can be guaranteed.The step size h is adjusted during the course of the integration to

ensure the desired accuracy. A common approach with adaptive RKsolvers is to advance the solution first from t � 0 to h in one full stepand then again using two steps with step size h∕2 [8]. The errorestimation is then found from the difference between the h and h∕2solutions. The RKF45 method requires fewer RHS evaluations(approximately half) because the fourth- and fifth-order accuracyschemes share the same RHS evaluation points. As a result, theoverall cost of the RKF45 ODE solver is lower than other adaptiveRK ODE solvers of comparable accuracy.A solution acceptance criterion similar to that used by DVODE

was implemented to ensure the RKF45 solutions match closely to thebaseline ODE solver. Refinement is selected when the weightedroot-mean-square (WRMS) truncation error is greater than one. TheWRMS error function is defined as

WRMS ��1

Neq

jun�1 − vn�1jrtoljunj � atol

s(12)

‡Data available online at www.netlib.org [retrieved 01 February 2012].§Data available online at code.google.com/p/cantera [retrieved 01 Au-

gust 2011].

766 STONE AND DAVIS

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

www.netlib.org

www.netlib.org

www.netlib.org

code.google.com/p/cantera



where rtol and atol are the user-specified relative and absolutetolerance parameters. Combining the relative and absolute tolerancesallows the species and temperature to be compared more fairly eventhough their values differ by many orders of magnitude. WhenWRMS < 1, the solution is accepted, h is coarsened, andthe next integration may proceed. Otherwise, h is reduced and theintegration step is repeated. In both cases, h is adjusted with thefollowing formula:

hnew � h4��1

2

1

WRMS

r(13)

The leading one-half factor in the adaption formula biases h toward asmaller value and avoids stagnation whenWRMS ≈ 1. We also limitthe change in h to �1∕4� < hnew∕h < 4 to avoid large swings andpotential overshoots. Finally, a ceiling is imposed upon h so that aminimum of 10 integration steps are taken.The initial step size for RKF45 is estimated similar to DVODE

following Brown et al. [6]. The initial step size h0 is estimated suchthat the WRMS of the second-order expansion term about u0 is lessthan one, that is,

�h202

�u�t��WRMS

≤ 1 (14)

The first step in DVODE is limited to first-order accuracy, and so thisestimation ensures the initial step does not produce excessive errors.This approach is likely overly conservative for RKF45, which isfourth-order-accurate for all time steps; however, this conservativeapproach can only benefit the solution accuracy.

IV. Graphical Processing Unit Implementations

An efficient GPU implementation seeks to minimize the total runtime and often requires a different approach than would normally beused to produce an optimal sequential algorithm. For this work, theCVODEandRKF45ODE solvers and theRHS reaction rate functionwere ported to run on NVIDIACUDAGPUs. This section describestwo different parallel strategies used to implement the ODE solversand the RHS function on these GPUs. However, before delving intothe details of the parallel implementations, a brief overview ofthe CUDA GPU architecture and how it differs from commonworkstation-quality CPUs is necessary. The following overview isbased on theNVIDIAM2050 (Fermi)CUDAGPUand the eight-coreAdvanced Micro Devices (AMD) Opteron 6134 Magny-cours CPU,both of which are used for benchmarking throughout this study.

A. CUDA

The NVIDIA Fermi M2050 GPU has 14 streaming multi-processors (SMs) operating at 1.15 GHz. Each SM contains 32 scalarprocessors (SPs), each capable of processing one single-precisionfloating-point operation per clock cycle (two cycles are needed perdouble-precision operation). Therefore, each SM can execute 32operations per cycle. By comparison, anAMDOperton 6134Magny-cours CPU core operates at 2.3 GHz but is only capable of processingfour single-precision operations per cycle using the SIMDprocessingunit. CPUs also have considerably larger on-chipmemory (L1 andL2on-chip caches). Therefore, the larger processor count (both SMsand SPs) of the CUDA hardware must be efficiently harnessed toovercome their significantly lower clock speeds and on-chip storage.Compounding this difficulty, any performance gains on the GPUmust also overcome the overhead incurred when transmitting datafrom the host CPU to the GPU and vice versa.A CUDA code, much like a multithreaded CPU application,

defines the logical flow of a single thread of execution. CUDAthreads are aggregated into thread blocks, where the user controls thenumber of threads per thread block (e.g., 512 CUDA threads perthread block). Each thread block ismapped by the thread scheduler toan available SM.Akey aspect of CUDA is that instructions are issuedto thread warps, not individual threads. A warp is 32 threads and

constitutes the finest level of practical parallelism. In this manner,each CUDA SM can be viewed as an SIMD processing unit, capableof processing 32 elements per cycle. All 32 threads within awarp canexecute the same instruction on different data each clock cycle.Threads within the same warp are able to follow different logicalpaths in the code, that is, threads can diverge. Because only oneinstruction can be issued per warp per cycle, thread divergence canlead to extreme performance penalties. In the worst-case scenario,each thread in a warp must execute a different instruction per cycle,reducing the potential performance by a factor of 32x.Two different parallel strategies were implemented that exploit the

CUDA SIMD capabilities. In both methods, multiple ODEs aresolved concurrently on the GPU, but they differ in how they map theODEs to the CUDASMs. In the first method, called the “one-thread”method, one ODE is solved by one CUDA thread. The secondmethod, called the “one-block” method, solves each ODE using anentire thread block. In the one-block method, all the threads within athread blockwork in concert to solve a singleODE.Also, the CUDA-based ODE solvers integrate the ODEs completely on the GPU fromstart to finish. An important design assumption in both methods isthat the number ofODE systems to be solvedNode is large (Node ≫ 1)and that the number of unknowns per ODE is relatively small(Ns � 19 in the current application). This differs considerably fromthe previous GPU combustion kinetics research of Shi et al. [3]. Intheir work, only one ODE system was solved at a time; Ns wasconsiderably larger, ranging from 50 to well over 1000; and, lastly,the DVODE solver executed on the host CPU, whereas only a selectnumber of costly functions executed on the GPU.For each CUDA SM to achieve its peak SIMD processing

potential, data must be loaded efficiently. The aggregate peakbandwidth from GPU global memory is considerably higher thancommodity CPU’s main memory, but this high bandwidth must beshared by tens of SMs and hundreds of SPs concurrently. Recall thateach SM can process 32 elements per cycle, 8x more than a commonCPU, straining the availablememory bandwidth. Therefore, each SMmust load 8x more data to use all of the available computinghardware. CPUs hide slow main memory operations with multiplelevels of caching in per-core L1 and shared L2 caches of varying sizesand speeds. These caches of data are typically hundreds of kilobytesfor L1 caches and several megabytes for L2 caches. CUDA has onlyrudimentary caching capabilities (512 kB L2 and between 16 and48 kB L1 per SM), which can amplify the shared bandwidth issue.CUDA, however, provides several methods to overcome thisbandwidth limitation: coalesced reads/writes to global memory high-speed on-chip (i.e., local to each SM) sharedmemory (SHMEM), andoverprescription of threads.Coalesced reads and writes are highly optimized and are essential

for efficient SIMD thread execution when operating on data stored inglobalmemory.Amemory request is coalesced if all threads in awarpread fromconsecutivememory locations (i.e., contiguous elements inan array). Coalescing is the memory analog of an SIMD operationand the two must be used together to achieve peak performance.Coalescing is best achievedwith proper array formatting (dimension,rank, and padding). Note that this optimization technique is notrestricted to GPUs and is helpful in efficient SIMD execution onCPUs as well.Overprescription of threads is another critical mechanism used to

hide the latency and bandwidth limits when using CUDA globalmemory. CUDA Fermi allows up to 48 thread warps to runconcurrently on each SM. An efficient hardware-based scheduler isused to swap out warps waiting for global memory data with thosethat are ready to execute (warps can efficiently swap context). For thisreason, CUDA applications are encouraged to run tens of thousandsof threads concurrently when using global memory to hide thelatency.CUDASHMEM is a small segment of high-speedmemory local to

each SM. All threads within the same thread block share this smallsection of memory and, when combined with appropriate threadsynchronization functions, can use it to execute fine-grain parallelalgorithms. Note that CUDA SHMEM is distinct from globalmemory. CUDA SHMEM is commonly used as a manual,

STONE AND DAVIS 767

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

user-defined high-speed data cache. The user’s CUDA code1) explicitly copies data from (slower) global memory into (faster)SHMEM, 2) performs many operations on the data in the SHMEM,and 3) copies the results stored in SHMEMback to global memory tolong-term storage keeping. For example, yk, T, and p for a givenODE are needed repeatedly throughout the (RHS) reaction ratecalculation. These data are first copied from global memory intoSHMEM and then the RHS function is invoked using these SHMEMcopies. The ODE state data are then local to the SM and can befetched quickly, reducing the global memory overhead.Overprescription is not as critical in well-tuned SHMEM-based

CUDA applications because data are local to the SM already.Efficient use of CUDA SHMEM is more nuanced than globalmemory; however, good performance is generally obtained if thesame rules for CUDA global memory coalescence are followed (i.e.,each thread reads/writes to consecutive array elements). Thisguidelinewas followedwhenever possible. Note that 64 kBSHMEMon the CUDAFermi GPUs is partitioned into two separate partitions,48 and 16 kB wide. Depending on the run time configuration, thesecan function as either an automatic L1 cache or the manual SHMEMas described earlier. For this reason, the user is limited to only 48 kBof manual SHMEM (with 16 kB of L1 cache) or vice versa. Theautomatic L1 cache option should be enabled for all applications thatdo not use themanual SHMEMcapabilities. The L1 cache can reducelatency and bandwidth demands in certain applications that useglobal memory only.

B. Parallel Strategies

Now that the basic Fermi GPU hardware has been described, thedifferences in the two parallel implementations will be addressed.As mentioned earlier, one thread is assigned to each ODE in the

one-thread implementation. To a large extent, each thread willexecute the ODE solver algorithms and the RHS calculations[Eqs. (3–9)] in sequential fashion. However, in the SIMD context,each thread warp can execute 32 different ODEs in parallel.Assuming the hardware scheduler effectively hides all latency, weexpect to solve 448 ODEs concurrently (14 SMs × 32 threads perwarp) in this design. Double precision reduces this peak throughputby half, and thread divergence will further degrade the potentialparallel efficiency.Each ODE requires a finite amount of storage, the details of which

shall be given in later sections. Depending on the CUDA-based ODEsolver used, only on the order of 10 ODEs can be stored in theavailable 48 kB of CUDA SHMEM using the current 19-speciesethene mechanism. This is obviously insufficient for the one-threadapproach, which requires at least 32ODEs. Therefore, the one-threadapproach does not use anymanual SHMEM, but uses global memoryexclusively. Also, the automatic 48 kB L1 cache configuration isenabled with the one-thread approach. Overprescription is used tohide the global memory latency, and the data storage is designed forcoalesced memory transactions. Note that effective overprescriptionrequires tens of thousands of ODEs to be solved concurrently.The one-block approach attempts to exploit parallelism within

each ODE instead of just solving many ODEs in parallel. CUDAsupports up to eight thread blocks per SM, which means there issufficient SHMEM storage to hold the eight ODEs per SM.Therefore, the one-block approach uses CUDA SHMEM exten-sively. Like the one-thread method, warp-level SIMD operations arestill exploited, but they are implemented in a very different manner.For example, Eq. (3) must be evaluated for all 167 forward reactions(i.e., there are 167 independent calculations). In the one-blockapproachwith 32 threads per block, each threadwill compute only sixelements of the same kf array [i.e., ceiling (167/32)]. In the one-thread method, each thread in a single warp computes all 167elements of 32 different kf arrays concurrently.Pseudocode of the kf calculation using the one-thread and one-

block approaches is shown in Algorithms 1 and 2, respectively. Ascan be seen in the one-thread pseudocode, 32 different kf arrays arecomputed with a loop stride of one. However, only one kf array iscomputed in the one-block version, and the loop stride is set equal to

the number of threads in the thread block. Note that the kf array isdefined as two-dimensional in the one-thread version but only 1-D inthe one-block. This further demonstrates that multiple independentarrays, each of length 167, are computed by each thread in the one-thread approach, but only one array is computed in the one-blockmethod. The pseudocodes also highlight the different memorylayouts used to optimize the two different approaches. The innerdimension of kf (column-major format) equals the number of threadsin the one-thread implementation. This ensures that the 32 arrayelements of kf saved in each iteration are contiguous and can becoalesced into a singlememory transaction.A similar scenario occursin the one-block method: The 32 array elements saved in eachiteration are also contiguous, ensuring fast storage to sharedmemory.Efficiently implementing Eq. (3) using either parallel strategy was

straightforward because there was no data dependency. However,equations such as Eqs. (7) or (9) are more difficult because theyrequire the summation over all species or all contributing elementalreactions, respectively. This type of array summation, termed areduction, is required throughout the RHS and the ODE solversthemselves (e.g., vector dot products and norms and matrix Lower-Upper (LU) decompositions). The one-thread method requires nospecial attention because the parallelism comes from executing asequential algorithm on multiple different arrays. There, astraightforward sequential implementation is sufficient. Thedifficulty only arises in the one-block approach. Here, we implementthe array reduction operations (min, max, sum, etc.) using a parallelbinary reduction algorithm [9]. These algorithms are implementedusing shared memory and are capable of computing the sum of Narray elements in only log2�N� steps, as opposed to N for a simplesequential summation.The original ethene RHS function (FORTRAN) source code was

fully inlined. For example, the forward Arrhenius rate calculation[Eq. (3)] used 167 different lines of code. There, certain combinationsof coefficients were precomputed and the results inlined in the sourcecode using fixed-precision syntax. This lead to very efficient CPUexecution because it avoided many unnecessary transcendentalfunction calls. However, this optimization strategy negates the SIMDnature of Eq. (3). That is, all 167kf evaluations can use the same

Algorithm 1 One-threadmethod: Each thread executes thesame loop on a different kf array

nThreads :� 32nReactions :� 167dimkf�nThreads; nReactions�dimA�nReactions�dimEa�nReactions�tid :� getThreadIndex� �i :� 0while i < nReactions dokf�tid; i� :� A�i� � Tα�i� � e−Ea�i�∕R�Ti :� i� 1

end while

Algorithm 2 One-blockmethod: All threads operate on

one kf array

nThreads :� 32nReactions :� 167dimkf�nReactions�dimA�nReactions�dimEa�nReactions�tid :� getThreadIndex� �i :� tidwhile i < nReactions dokf�i� :� A�i� � Tα�i� � e−Ea�i�∕R�Ti :� i� nThreads

end while

768 STONE AND DAVIS

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

calculation with different coefficients. The original source code withprecomputed coefficients was factored back to a generic format tomaximize SIMD execution. This allows efficient execution usingeither the one-thread or one-block strategy. However, the modifiedcoefficients in the kf calculation incurred a level of roundoff due tothe finite precision of the original source code. The impact on theaccuracy is reported in the results section. Note that no othercomponent of the CUDA implementation imposed a change in thenumerical precision aside from that incurred by a different order ofoperations.One component of the original ethene mechanism could not be

refactored to expose SIMD execution in the one-block method:Evaluating the QSS species concentrations involves a complexnonlinear calculation that contains significant data dependency. Thissection of the RHS function is executed serially by a single thread inthe one-block implementation. This serialization of the QSScalculation will degrade the overall parallel efficiency of the one-block method. Note that this does not impact the SIMD execution ofthe one-thread method.

C. Parallel Ordinary Differential Equation Solvers

Both the RKF45 and CVODE algorithms were implemented inCUDA following the one-thread and one-block strategies describedearlier. Each thread in the one-thread CUDA implementations ofRKF45 (CUDA-RKF45) and CVODE (CUDA-CVODE) integratesa separate ODE independently over its specified integration time.Note that each ODE problem may have different time integrationlimits. As discussed earlier, the one-thread approach takes anominally sequential algorithm and solves it in a fully SIMD fashion.This approach is straightforward to implement from existing sourcecode, but the performance is entirely dependent upon the number ofproblems to be solved and the efficient coalescing of global memoryoperations.Much of the original CVODEC source code from the SUNDIALS

package was reused in the CUDA-CVODE implementation. AllSUNDIALS solvers use an object-oriented approach for vector andmatrix storage, which greatly simplified the porting to CUDAC. TheCUDA-RKF45 one-thread implementation relies almost entirely onthe efficiency of the one-thread RHS function implementation. At theend of each explicit integration step, the WRMS truncation error istested for suitability by each thread. If the solution is acceptable, thethread updates its solution vector and integration time. Threaddivergence occurs during the adaption test phase, but should onlyhave a minor performance impact compared with the cost of six RHSfunction evaluations at each trial integration step. In both CUDA-RKF45 and CUDA-CVODE, a thread exits the integration loop oncethe final integration time is reached. Instructions are only issued perwarp, so that all threads within a warp must complete their ODEintegrations before proceeding to the next set of problems.Note that itis possible that a small number of ODEs may require significantlymore integration steps, causing most threads to stall and leading topoor utilization of the CUDA processors. That is, integrating manyindependent ODEs constitutes an irregular workload.Adynamicworkload scheduler is implemented following thework

of Tzeng et al. [10] to optimize the throughput of the irregularworkload in the one-thread approach.Work is scheduled in batches of32 ODEs to match the warp size of the GPU. One thread block with512 threads per block is launched for each SM (14 in total) and thethreads persist until the queue is drained. A benefit of the warp-levelqueue is that no explicit synchronization is required within eachthread block. Note that all SMs are assigned at least one batch of 32ODEs before additional work can be dequeued. This ensures that allparallel hardware is used and increases the throughput when thenumber of ODEs is small. A simple queue was implemented using asingle global counter initialized to the number of ODEs to be solved.A free thread warp dequeues the next available batch of ODEs by1) locking the global counter, 2) reading the batch number,3) decrementing the counter, and, finally, 4) releasing the globalcounter. These four steps were actually implemented by invoking the“atomic” decrement operation supported by CUDA. A global queue

such as this can impose a significant bottleneck in the multithreadedGPU environment [10]; however, the high computational cost of theODE integration easily dominates any minor overhead incurred.The one-block CUDA-RKF45 implementation uses shared

memory for all vector storage. Explicit synchronization is enforcedafter each RHS function evaluation to avoid race conditions. Aftereach integration step, the WRMS truncation error is computed usingthe parallel reduction algorithm. The optimal thread block size wasfound to be 64 for the CUDA-RKF45 one-block implementation.CUDA-CVODE one-block follows the same approach as the

one-block CUDA-RKF45 and RHS implementations. Again, theobject-oriented structure of the CVODEvector andmatrix operationssimplifies the implementation effort. Each of the vector operations(e.g., dot product) is parallelized across the thread block. It isimportant to note that the number of synchronization points in theCUDA-CVODE one-block implementation is significantly higherthan that required for CUDA-RKF45 one-block. Parallel versions ofthe LUmatrix factorization (with partial row pivoting) and backsolveroutines were implemented for the CUDA-CVODE one-blockalgorithm. These two complex algorithms were implemented usingonly one thread warp and not the entire thread block. Parallelismwasonly exploited in row or column (vector) operations, and so themaximum possible speed up is limited to only 20. As such, warp-level parallelism was sufficient. Further, the warp-level algorithmreduced the number of explicit synchronization points requiredduring the matrix decomposition and backsolve and increased theoverall performance.The workload manager was modified to dispatch work to thread

blocks instead of towarps for the one-block algorithm. Currently, oneODE is dispatched to each thread block per query instead of 32 perwarp as in the one-thread implementation. Each thread block isassigned at least one ODE before additional work can be dequeued tomaximize the utilization for small numbers of ODEs.The CUDA versions of the ODE solvers each require a different

amount of memory. The RHS function requires a fixed amount ofarray storage per problem so that each ODE solver must providesufficient space for the RHS solution vector and scratch space. Withdouble precision, each RHS evaluation requires 3.2 kB. The one-block version requires additional shared memory scratch space forthe parallel reduction algorithm. This amounts to an extra 512 B for64 threads per thread block. Note that these storage requirements areonly for array data and exclude local scalar variables.The CUDA-RKF45 implementations require storage for six

vectors of length Neq (i.e., five intermediate solutions and one RHSvector) in addition to the RHS function storage for each ODEproblem for a total of 4.6 kB of shared memory. Recall that eachCUDA SM supports up to 48 kB of shared memory and eight threadblocks per MP. Therefore, there is sufficient shared memory storagefor the maximum eight blocks per SM.The CUDA-CVODE implementations require significantly more

memory than the CUDA-RKF45 implementations. The vast majorityof the extra storage is due to the Jacobian matrix with sizeNeq × Neq.The registered usage is also at the maximum 63 per thread. CVODE(and DVODE) reuse previously generated Jacobian matrices if thesolution vector has not deviated significantly. This requires that theJacobian is copied and saved in each instance, doubling the alreadyhigh storage cost. With this enabled, the CUDA-CVODE requires11.2 kB per problem, ofwhich the two Jacobianmatrices and its pivotarray account for 57%. CUDA-CVODE one-block ideally usesshared memory exclusively. With the Jacobian-recycling strategy,there was sufficient shared memory for only four blocks per SM. Itwas found that storing the saved Jacobian and Newton iterationmatrix in global memory instead of shared memory significantlyincreased the overall performance by doubling the number ofpossible thread blocks per SM to eight. The RHS function dominatesthe cost of the ODE solvers so that any slowdown caused by thematrices in global memory has minimal impact.The individual ODEs are solved in parallel within a single CUDA

“kernel.”ACUDA kernel is initiated on the host device but executedon the GPU in parallel. Before the kernels can be executed, the initialconditions for the ODEs must be transferred from the host computer

STONE AND DAVIS 769

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

to theGPU device. In addition, the final integrated solutionmust thenbe transferred back to the host after the kernel is complete. The cost ofthese two transfersmust be factored into the totalGPUexecution timewhen comparing to the CPU execution time.

V. Results and Discussion

The performance and accuracy of the new GPU-based RHSfunction andODE solvers are reported here. All baseline benchmarkswere run on an Air Force Research Laboratory (AFRL) HighPerformance Computing (HPC) DoD Supercomputing ResourceCenter (DSRC) utility server GPU node. Each GPU node has oneNVIDIAFermiM2050GPU and two eight-core AMDOpteron 6134Magny-Cours (2.3 GHz) CPUs. All source code was compiled withthe GCC C++ compiler (ver. 4.1.2), the PGI FORTRAN90 compiler(ver. 11.10), and theNVIDIACUDAC++ compiler (ver. 4.1.28). It isimportant to note that all computationswere conducted using double-precision exclusively. The reported wall clock measurements wereaveraged over a minimum of 10 samples and the sample size wasadjusted until the standard deviation was less than 1%.All DVODE baseline benchmarks use only one CPU core (i.e.,

serial execution) and all GPUbenchmarkswere executed on only oneGPU. It is straightforward to solve the ODEs across all availableCPU cores using a simple partitioning scheme. Because GPU–GPUcommunication is not required, partitioning across multiple GPUsis also straightforward. As such, the one-CPU and one-GPUbenchmarks presented here can be scaled linearly by the number ofinstalled CPUs and GPUs.All benchmarks reported later used ICs taken from a large

CF-LEM database containing over 19 million individual samples.Recall that each grid point in a CF-LEM domain constitutes anindependent ODE system that must be integrated over a specifiedtime step size. For the CF-LEM application, the CF-LEM databaseprovides the ICs (initial species concentration and temperature) alongwith the pressure and time step for each ODE to be solved. A broadrange of operating conditions (strain rates and fuel-oxidizer ratios) isincluded in the database so that the ODE stiffness and integrationtimes can vary widely for the individual ODEs.

A. Right-Hand Side Function Performance and Accuracy

Parallel GPU implementations were required for both the RHSfunction and the ODE solvers themselves. In the current CF-LEMapplication, the computational cost of the ODE solvers is dominatedby the cost of theRHS function evaluation. In this section, the parallelefficiency of the RHS function is examined independent of the ODEsolvers.The wall clock time of the two CUDA RHS function

implementations is shown in Fig. 2. The results are given as afunction of the number N of RHS functions evaluated in parallelwithin each kernel. The GPU speed up is reported relative to theCANTERA library run time executed sequentially on a single CPUcore. In this test, the ODEs are not actually solved. Instead, the RHSfunction is called only once for each ODE. The RHS function haslittle conditional logic and is largely unaffected by variations in theICs between the ODEs, ensuring that there will be little threaddivergence. Therefore, the following RHS performance resultsshould be an indication of the maximum potential ODE solver speedup because the RHS evaluation is the dominant cost and threaddivergence, a significant performance penalty, has been removed.A maximum speed up of 26.8x was obtained using the one-thread

approach and 11.8x with the one-block algorithm. The one-blockmethod is approximately 2x slower than the one-threadimplementation for large N. Recall that many RHS functions areevaluated in parallel, and so there is more parallelism for higher N.The one-thread approach performs very efficiently under thisscenario because thread divergence is minimal and there is sufficientparallelism (tens of thousands of threads) to hide the global memorylatency.However, for a smallN, the one-block strategy is far superior.For example, the one-block strategy is 4.3x faster than the one-threadmethod for N � 100. More important, the one-block algorithm isfaster than the CANTERA CPU run time at this size N. The one-

thread algorithm only becomes faster than the CPU baseline whenNexceeds 103, whereas the one-block algorithm outperforms the CPUbaseline with as few as 100. The lower break-even point of the one-block approach must be balanced with the much faster potentialthroughput of the one-thread when N is very large.Calculations on the GPU must overcome the combined overhead

of data transfer, memory allocation, and kernel launch before the runtime savings can be realized. The combined overhead (i.e., all timebut the kernel execution time) on the RHS function evaluation isshown in Fig. 2. When compared with the one-thread run time, theCUDA overhead accounts for 6% of the run time for N � 1 andincreases to over 17% forN � 106. The fact that the relative overheadincreases with problem size is easily explained: The one-threadparallel efficiency increases with large N leading to a higherproportional overhead. Note that the same absolute overhead will beincurred by the full ODE solvers because the memory allocation anddata transfer amounts are the same.The accuracy of the two CUDA RHS algorithms was also

measured against the CANTERA baseline. The difference betweenthe two CUDA RHS evaluations and the CANTERA library wascomputed over 50,000 randomly selected ODEs. The maximumdifference over all the species and temperature was 3.7 × 10−8

measured in the L2 norm. This difference is attributed entirely tomodifications required to vectorize the forward Arrhenius ratecoefficients, as described previously. The impact of this numericaldifference on the integrated solutionwill be shown in the next section.Note that there is negligible difference between the one-thread andone-block versions. Anyminor numerical difference between the twoCUDA implementations is due to the one-block parallel reductionalgorithm. The parallel reduction algorithm changes the order ofoperations, which can change the final numerical value. However, thedifference is only on the order of the double-precision roundoff error,a value far below the measured numerical difference.

B. Ordinary Differential Equation Performance: CPU

The accuracy and stability of the new RKF45 ODE solver wasassessed before it was accepted as a suitable alternative to theDVODE and CVODE algorithms. A serial CPU implementation ofthe RKF45 ODE solver was used to integrate 50,000 randomlyselected ODEs from the CF-LEM database. The integrated solutionswere compared with those from the DVODE baseline solver. Therelative and absolute tolerances for both solvers were fixed at 10−11

and 10−13, respectively. The difference between the DVODE andRKF45 species and temperature solutions is shown in Fig. 3. Both theL2 and L∞ norms are shown and the values have been normalized bythe square of the integrated DVODE values. The maximumnormalized difference of any one component is on the order of 10−7

and the combined difference over all active species is 10−12. Thissmall difference shows that the RKF45 algorithm is suitable forthe current ethene combustion application. More important, the

100

101

102

103

104

105

106

Number of ODEs

10-5

10-4

10-3

10-2

10-1

100

101

102

103

Wal

l clo

ck ti

me

(s)

CPU-CANTERACUDA-RHS (Thread)CUDA-RHS (Block)CUDA-OVERHEAD

Fig. 2 Run-time (solid) and CUDA speed-up (dashed) of CANTERAand CUDA RHS functions.

770 STONE AND DAVIS

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

performance of the RKF45 algorithm can be directly compared withthe baseline DVODE and GPU-based CVODE algorithms.Figures 4 and 5 show the run time and speed up of the serial CPU

and parallel GPU ODE solvers. The speed up is presented as afunction of the number of ODEs solved per kernelNode and is relativeto the serial CPU DVODE run time. Some variation in cost isobserved between the codes for small Node due to differences in theICs; however, these differences are largely averaged away after 103

ODEs. Beyond that point, both DVODE and RKF45 scale linearlywith Node. Overall, the DVODE run time is approximately 50%faster, despite taking twice the number of integration steps (onaverage). The cost saving inDVODE comes largely from the reducednumber of RHS evaluations per integration step. For instance,DVODE required only 1.78 RHS evaluations per step on average,compared with six for RKF45 for Node � 50; 000. The number offailed integration steps in RKF45wasminor at this point: Only 5.6% ofthe integration steps failed, forcing refinement. As noted previously, thesequential cost-saving measures taken by DVODE (e.g., Jacobianrecycling) may prove counterproductive in the many-core GPU envi-ronment. However, the CUDA RKF45 implementations must over-come nearly a 50% performance penalty to break even with DVODE.

C. Ordinary Differential Equation Performance: GPU

The performance of the GPU implementations of the CVODE andRKF45ODE solvers is now analyzed relative to the baselineDVODEand RKF45 solvers executed serially on the CPU. Referring again toFig. 5, the CUDA-CVODE one-thread performance is seen to bemany times slower than DVODE until Node exceeds 10

3. After thisbreakeven point, the speed up with the CUDA-CVODE one-threadgrows slowly, eventually reaching a steady 7.7x speed up over thebaseline DVODE CPU solver.CUDA-RKF45 one-thread follows a similar scaling trend but is

consistently 2.3x faster than CUDA-CVODE one-thread over theentire range of Node. The breakeven point with the CUDA-RKF45one-thread solver is between 102 and 103 ODEs; the speed up isalready 2.4x at only 103 ODEs. The maximum speed up for theCUDA-RKF45one-thread solver is 20.2x over the serialDVODE runtime. This 20.2x speed up matches closely to the CUDA RHS speedup previously reported, suggesting that the RHS function is thelimiting factor in the throughput using the CUDA-RKF45 one-threadmethod. Note that CUDA-RKF45 one-thread is 28.6x faster than theserial CPU implementation of RKF45.Both one-thread versions of CUDA-CVODE and CUDA-RKF45

suffer from poor performance when Node is small. This is consistentwith the one-thread RHS-only results shown earlier in Fig. 2. TheCUDA-RKF45 one-block breakeven point is only slightly greaterthan 10 ODEs: far sooner than either one-thread ODEimplementation and also sooner than was observed for the one-block RHS-only results. CUDA-RKF45 one-block quickly reaches amaximum speed up of 10.7x relative to DVODE at N ≈ 104. Again,the CUDA-RKF45 one-blockmaximum speed upmatches closely tothe one-block RHS implementation (approximately 11x speed up),clarifying that theRHS function is the limiting factor for bothCUDA-RKF45 implementations. The CUDA-CVODE one-block perfor-mance is nearly identical to CUDA-RKF45 one block for smallNode

but achieves only a 7.3x speed up for large Node.The relative overhead cost can be inferred by referring back to

Fig. 2. The absolute overhead for the ODE solvers is the same as theRHS-only performance test. Recall that the RHS function must becalled at least 60 times (a minimum of 10 time steps) by the RKF45ODE solver. There are similar lower limits for CVODE as well. Thiseffectively amortizes the overhead reported in Fig. 2 over manymoreRHS function evaluations and reduces the relative overhead. TheCUDA-CVODE one-block overhead accounts for only 1.5% of thetotal run time when Node is less than 100 and quickly drops below0.1% for large Node. Obviously, the data transfer and memoryallocation overhead has little impact on the peak CUDAODE solverperformance.The preceding benchmarks showed the performance of the various

ODE solvers on a database of ICs taken from actual LEMsimulations. In these simulations, hundreds of LEM cells are used todiscretize the LEM computational domain. Many different LEMsimulations are therefore solved concurrently when Node is muchgreater than 103. Recall that the LEM can be viewed as a 1-D DNSmethod, and therefore the concentration profiles and temperatureshould vary smoothly throughout the domain. For example, Fig. 1showed a non-premixed combustion simulation with 241 LEM cells.Because the profiles vary smoothly, neighboring LEM cells within

H2 H O O2

OH

H2O

HO

2

H2O

2

CH

3

CH

4

CO

CO

2

CH

2O

C2H

2

C2H

4

C2H

6

CH

2CO

C3H

6

N2

Tem

p

Species Name

-14

-12

-10

-8

-6

-4

-2

0

Log

10(e

rror

)

L2

Linf

Fig. 3 Numerical difference between DVODE and RKF45 over 50,000ODEs. Difference shown in the L2 and L∞ norm.

100

101

102

103

104

105

106

Number of ODEs

10-4

10-3

10-2

10-1

100

101

102

103

104

Wal

l clo

ck ti

me

(s)

CPU-DVODECUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)

Fig. 4 Total run time of the baseline CPU DVODE and RKF45 solversand their CUDA counterparts.

100

101

102

103

104

105

106

Number of ODEs

10-3

10-2

10-1

100

101

102

Spee

d up

to D

VO

DE

CUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)

Fig. 5 Speed up of RKF45 and the CUDA ODE solvers relative to thebaseline CPU DVODE run time.

STONE AND DAVIS 771

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

each simulation can be expected to have similar ICs. By extension,their integration cost (e.g., number of steps and RHS evaluations)should be similar. In the one-thread parallel processing strategy,neighboring cells are likely solved within the same thread warp,resulting, on average, in lower thread divergence and higherefficiency. (Note that high reaction rate gradients exist through therelatively thin-flame front region and, locally, high thread divergencemay persist.) The inverse is also expected: Highly dissimilar ICswithin each warp should increase thread divergence and degradeperformance. This effect is only expected for the one-threadimplementations; the one-block strategy was designed explicitly tominimize this effect. Two additional benchmarks were conducted tomeasure the impact of this LEM cell similarity on the CUDA ODEsolver performance.The same set of ODEs were solved as before; however, the

ordering of the ODEs was randomized to remove the implicitsimilarity between the LEM cells assigned to a givenwarp. Note that,for eachNode, the sameODEswere solved, so that the sequential CPUcost is unchanged. The maximum speed up for the CUDA-RKF45one-thread algorithm dropped from 20.2 down to only 13.2 with therandomized database, a decrease of 35%. The CUDA-CVODE one-thread peak performance declined only 13%, from 7.7 down to 6.7relative to the DVODE baseline. This implies that the threaddivergencewithin the CUDA-CVODE one-thread implementation isalready higher than its RKF45 counterpart. Note that neither one-block implementation was affected by the randomization becauseeachODE is treated independently. That is, the similarity between theICs only impacts the one-thread efficiency.Randomizing the LEM database represented a worst-case

scenario. The final test conducted examined the opposite best-casescenario: perfectly similarity between the ICs. For this test, the sameODE is solved repeatedly. This synthetic scenario should give theoptimal throughput for the one-thread method because full SIMDvectorization is guaranteed for all ODEs. In fact, the throughputshould mimic the RHS-only tests earlier. A single IC was selectedfrom the database that required a similar number of DVODEintegration steps andRHS evaluations as the full database on average.In this scenario, theCUDA-CVODEone thread achieved amaximumspeed up of 29.2x over the DVODE CPU baseline. The peakperformance of CUDA-RKF45 one thread was 20.4 relative toDVODE, significantly lower than observed earlier in comparison tothe CUDA-CVODE performance. However, the RKF45 CPU runtime for this problem is 2.25x higher than the DVODE CPU, muchhigher than the 50% observed over the entire LEM database onaverage. The CUDA-RKF45 performance relative to the serialRKF45 CPU run time shows a much different picture. CUDA-RKF45 one thread achieves a peak speed up of 46.2x relative to theRKF45 CPU run time. Again, neither one-block implementation wasaffected.

VI. Conclusions

This research effort has investigated the performance and accuracyof solving stiff chemical kinetics ordinary differential equations(ODEs) on Compute UnifiedDevice Architecture (CUDA) graphicalprocessing units (GPUs). The mathematical basis of the combustionkinetics ODE formulation and how it impacts a parallel imple-mentation was reviewed. Two different GPU implementations of theright-hand side function (the reaction mechanism) and the two ODEsolvers were reviewed in detail.The two CUDA ODE solvers, CUDA-CVODE and CUDA-

Runge–Kutta–Fehlberg (RKF45), both achieved significant speed upcompared with the baseline serial DVODE CPU solver. The one-thread-per-ODE implementation of the CUDA-RKF45 ODE solverachieved over 14x speed up compared with DVODE with only10,000ODEs. Speed ups of 20.2xwere observedwhen the number ofODEs exceeded 50,000. The one-thread version of CUDA-CVODEachieved lower performance speed up, only 7.7x, compared withDVODE, but exhibited a similar scaling trend. The one-thread-block-per-ODE CUDA implementations of both ODE solvers achievedlower maximum speed ups compared with their one-thread

counterparts. However, both one-thread-block-per-ODE implemen-tations outperformedDVODEwith as few as 100ODEs and achievedpeak performance with only 103 ODEs, far sooner than the one-thread-per-ODE methods.The typical linear eddy model (LEM) simulation uses hundreds of

cells to resolve the reaction zone. The one-thread-block-per-ODECUDA ODE solvers, either DVODE or RKF45, offer a potentialspeed up of between 4x and 7x compared with the baseline serialDVODE if only one LEM domain is solved at a time. Far greaterspeed up is possible if multiple LEM simulations are solvedconcurrently. In this manner, the total number of ODEs solved on theGPU can be increased, offering speed ups in excess of 20x comparedwith DVODE.It is interesting to note that the slower RKF45 algorithm

outperformed the CPU-optimized DVODE and CVODE algorithmswhen implemented in the massively parallel GPU environment. Thesimplified structure of the RKF45 algorithm provided greater parallelefficiency, easily overcoming the lower sequential performance.The maximum throughput of the CUDA-RKF45 algorithm isapproximately twice as fast as the CUDA-CVODE peak. In addition,CUDA-RKF45 reached a speed up of 28.6x compared with theRKF45 CPU implementation.The best- and worst-case performance with the one-threaded

CUDA algorithms was also examined. The worst-case scenario wasmodeled by randomizing the original LEM database, becauseadjacent LEM cells tend to have similar ODE integration costs.Placing these neighboring ODE problems within the same warp wasexpected to minimize thread divergence and improve processorutilization. This led to a 35% performance drop for CUDA-RKF45,but only a 13% drop for CUDA-CVODE. The best-case scenariowasestimated by forcing thread divergence to zero with a syntheticdatabase. Thiswas approximated by having all threads solve the sameODE problem. Comparing the original results with the best- andworst-case scenarios, it is estimated that CUDA-RKF45 one threadachieves 62%of the peak performance,whereasCUDA-CVODEonethread only achieves 26%. From this, the authors conclude that theCUDA-CVODE algorithm is highly susceptible to thread divergencecaused by differences in problems within a single warp.In summary, the potential of GPUs to accelerate the stiff ODE

integration in computations fluid dynamics (CFD) applications wasclearly demonstrated. The results presented in this paper were takenfrom only one reduced C2H4 mechanism; however, these resultsshould be applicable to a broader range of reaction mechanisms ofsimilar size and complexity. The performance results can be directlyextended from the one-dimensional (1-D) LEM application tomultidimensional CFD combustion modeling applications that usean operator-splitting approach for chemical kinetics integration. Cellordering was shown to have a significant impact on performance inthe 1-D setting. Multidimensional cell ordering strategies will berequired for three-dimensional turbulent combustion simulationswith rapid spatial variations to maximize performance. Finally,because communication overhead was shown to be negligible, thepotential speed up should scale linearly across multiple GPUs as wellas within a larger GPU-based cluster.

Acknowledgments

Funding for this work was provided by the Department of DefenseHigh Performance Computing Modernization Program’s UserProductivity Enhancement, Technology Transfer, and TrainingProgram (contract GS04T09DBC0017 through High PerformanceTechnologies, Inc.) under project PP-CFD-KY02-115-P3 titled“Implementing stiff chemical kinetic ODE solvers on GPUs usingCUDA.”The authors thankWilliamCalhoon (CRAFTTech, Inc.) forhis assistance with the baseline DVODE solver and for providing thelinear eddy model database and Balu Sekar (Air Force ResearchLaboratory/RQTC, Wright-Patterson Air Force Base) for his guid-ance during this research effort. Theviews and conclusions containedherein are those of the authors and should not be interpreted asnecessarily representing the official policies or endorsements, either

772 STONE AND DAVIS

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

expressed or implied, of the U.S. Air Force Research Laboratory orthe U.S. Government.

References

[1] Calhoon, W., Zamon, A., Sekar, B., and Kiel, B., “Combustor andAugmentor Combustion Modeling Using LES Based on StochasticModel Parameterization,” Combustion, Fuels, and Emissions, ASME

2011 Turbo Expo: Turbine Technical Conference and Exposition,American Society of Mechanical Engineers, Fairfield, NJ, 2011,pp. 153–167.

[2] Linford, J., Michalakes, J., Vachharajani, M., and Sandu, A., “Multi-Core Acceleration of Chemical Kinetics for Simulation and Prediction,”Proceedings of the Conference on High Performance Computing

Networking, Storage and Analysis, SC ‘09, Association for ComputingMachinery, New York, NY, 2009, pp. 7.1–7.11.

[3] Shi, Y., Green, W., Wong, H.-W., and Oluwole, O., “RedesigningCombustion Modeling Algorithms for the Graphics Processing Unit(GPU): Chemical Kinetic Rate Evaluation and Ordinary DifferentialEquation Integration,” Combustion and Flame, Vol. 158, No. 5, 2011,pp. 836–847.doi:10.1016/j.combustflame.2011.01.024

[4] Shi, Y., Green,W.,Wong,H.-W., andOluwole, O., “AcceleratingMulti-Dimensional Combustion Simulations Using GPU andHybrid Explicit/Implicit ODE Integration,” Combustion and Flame, Vol. 159, No. 7,2012, pp. 2388–2397.doi:10.1016/j.combustflame.2012.02.016

[5] Hindmarsh, A., Brown, P., Grant, K., Serban, R., Shumaker, D., andWoodward, C., “SUNDIALS: Suite of Nonlinear and Differential/Algebraic Equation Solvers,” ACM Transactions on Mathematical

Software, Vol. 31, No. 3, 2005, pp. 363–396.doi:10.1145/1089014

[6] Brown, P., Bryne, G., and Hindmarsh, A., “VODE: A Variable-Coefficient ODE Solver,” SIAM Journal on Scientific and Statistical

Computing, Vol. 10, No. 5, 1989, pp. 1038–1051.doi:10.1137/0910062

[7] Fehlberg, E., “Low-Order Classical Runge–Kutta Formulas withStepsize Control and Their Application to Some Heat TransferProblems,” NASA TR-R-315, July 1969.

[8] Press, W., Teukolsky, S., Vetterling, W., and Flannery, B., NumericalRecipes in FORTRAN77: The Art of Scientific Computing, 2nd ed.,Cambridge Univ. Press, New York, NY, 1992, pp. 708–715.

[9] Davidson,A., andOwens, J., “Register Packing for Cyclic Reduction:ACase Study,” Proceedings of the Fourth Workshop on General Purpose

Processing on Graphics Processing Units, GPGPU-4, Association forComputing Machinery, New York, NY, 2011, pp. 4.1–4.6.

[10] Tzeng, S., Patney, A., and Owens, J. D., “Task Management forIrregular-Parallel Workloads on the GPU,” High Performance

Graphics, edited by Doggett, M., Laine, S., and Hunt,W., EurographicsAssociation, Aire-la-Ville, Switzerland, 2010, pp. 29–37.

J. OefeleinAssociate Editor

STONE AND DAVIS 773

Dow

nloa

ded

by N

AT

ION

AL

CH

UN

G H

SIN

G U

NIV

on

Apr

il 12

, 201

4 | h

ttp://

arc.

aiaa

.org

| D

OI:

10.

2514

/1.B

3487

4

http://dx.doi.org/10.1016/j.combustflame.2011.01.024












http://dx.doi.org/10.1145/1089014

http://dx.doi.org/10.1145/1089014

http://dx.doi.org/10.1137/0910062

http://dx.doi.org/10.1137/0910062

Download - Techniques for Solving Stiff Chemical Kinetics on Graphical Processing Units

Top Related