novel 3d gpu based numerical parallel diffusion algorithms...

Available online at www.sciencedirect.com

ScienceDirect

Mathematics and Computers in Simulation 109 (2015) 1–19www.elsevier.com/locate/matcom

Original articles

Novel 3D GPU based numerical parallel diffusion algorithms incylindrical coordinates for health care simulation

Beini Jianga, Weizhong Daib, Abdul Khaliqb, Michelle Careyc, Xiaobo Zhoud,1,Le Zhange,f,∗

a Department of Mathematical Sciences of Michigan Tech University, Houghton, MI, USAb Mathematics and Statistics, College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA

c Department of Biostatistics and Computational Biology, Center for Biodefense Immune Modeling, University of Rochester, 601 ElmwoodAvenue, Rochester, NY 14642, USA

d Department of Pathology, the Methodist Hospital, Research Institute & Weill Cornell Medical College, 6565 Fannin St, Houston, TX, USAe College of Computer and Information Science, Southwest University, Chongqing, 400715, China

f Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA

Received 29 June 2013; received in revised form 19 February 2014; accepted 11 July 2014Available online 7 August 2014

Abstract

Modeling diffusion processes, such as drug deliver, bio-heat transfer, and the concentration change of cytokine for compu-tational biology research, requires intensive computing resources as one must employ sequential numerical algorithms to obtainaccurate numerical solutions, especially for real-time in vivo 3D simulation. Thus, it is necessary to develop a new numericalalgorithm compatible with state-of-the-art computing hardware. The purpose of this article is to integrate the graphics processingunit (GPU) technology with the locally-one-dimension (LOD) numerical method for solving partial differential equations, and todevelop a novel 3D numerical parallel diffusion algorithm (GNPD) in cylindrical coordinates based on GPU technology, whichcan be used in the neuromuscular junction research.

To demonstrate the effectiveness and efficiency of the obtained GNPD algorithm, we employed it to approximate the real diffu-sion of the neurotransmitter through a disk shaped volume. This disk shaped volume is the synaptic gap, connecting the neuron andthe muscle cell in the neuromuscular junction. Furthermore, we compared the speed and accuracy of the GNPD with the conven-tional sequential diffusion algorithm. Results show that the GNPD can not only significantly accelerate the speed of the diffusionsolver via GPU-based parallelism, but also greatly increase the accuracy by employing the stream function of latest Fermi GPUcards. Therefore, the GNPD has a great potential to be employed in the design, testing, and implementation of health informationsystems in the near future.c⃝ 2014 Published by Elsevier B.V. on behalf of International Association for Mathematics and Computers in Simulation (IMACS).

Keywords: Graphics processing unit (GPU); Locally-one-dimension (LOD) method; Parallel computing; Domain decomposition

∗ Correspondence to: Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue Box 630,Rochester, NY 14642, USA. Tel.: +1 585 275 6689; fax: +1 585 273 1031.

E-mail addresses: [email protected] (X. Zhou), Le [email protected] (L. Zhang).1 Tel.: +1 713 441 8692; fax: +1 713 441 8696.

http://dx.doi.org/10.1016/j.matcom.2014.07.0030378-4754/ c⃝ 2014 Published by Elsevier B.V. on behalf of International Association for Mathematics and Computers in Simulation (IMACS).

http://crossmark.crossref.org/dialog/?doi=10.1016/j.matcom.2014.07.003&domain=pdf

http://www.elsevier.com/locate/matcom

http://dx.doi.org/10.1016/j.matcom.2014.07.003

http://www.elsevier.com/locate/matcom

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.matcom.2014.07.003

2 B. Jiang et al. / Mathematics and Computers in Simulation 109 (2015) 1–19

1. Introduction

Computer simulation is often used in real-time health care situations, such as radiotherapy planning, bio-heatdiffusion and real-time workflow optimization. Among these diverse clinical applications, modeling the diffusionplays an important role in biomedical engineering research such as the diffusion of cytokines [2,1,27–30,37–39], heat[9,34–36] and synaptic transmission of signals [15]. Several sequential numerical schemes, such as Crank–Nicolson,ADI, and LOD methods [18], have been used to simulate the diffusion processes. However, these methods are com-putationally intensive and hence it is impossible to do real-time simulation with high-approximation accuracy. Forexample, Athale et al. [2,1] used relatively coarse grids to simulate the diffusion of chemoattractants in the cancermicroenvironment. Though these studies [2,1] showed a reduction in computing time, the low simulation accuracyprevented it from being used for a real cancer modeling system. Also, Dai et al. [9] and Zhang et al. [34–36] em-ployed special numerical algorithms such as the preconditioned Richardson scheme [3,10] to achieve the specificgoal of their 3D biomedical projects by sacrificing the accuracy in x and y dimensions. However, this method is toospecific to be widely used for the general computational heath care research. For this reason, our previous articles[12,13] presented three parallel numerical diffusion algorithms to accelerate a 2D sequential solver in rectangularcoordinates by employing graphics processing units (GPU). These strategies are parallel computing using globalmemory (PGM), parallel computing using shared memory, global memory and CPU synchronization (PSGMC), andparallel computing using shared memory, global memory and GPU synchronization (PSGMG). The simulation resultsdemonstrate that a GPU-based 2D parallel diffusion solver can greatly improve computing speed while retaining highaccuracy.

This research is developing a GPU based parallel diffusion simulator for the neurotransmitter. The diffusion ofneurotransmitter goes through a disk shaped volume, which is called synaptic gap and connects the neuron and themuscle cell in the neuromuscular junction. To model this disk shaped gap, we use the 3D cylindrical coordinatesinstead of 2D Cartesian grids. However, since the cylindrical grids are irregular with relatively sparser mesh pointsin the outer domain from the origin, the approximation accuracy will be lower in the outer region. To resolve thisproblem, this study employs the stream function of GPU, which can assign thread blocks of different dimensions todifferent streams. Based on our promising 2D acceleration results [12,13] and the most recently released features ofGPU [19], we propose the 3D GPU-based parallel numerical diffusion algorithms (GPND) to accelerate the com-puting speed and improve the accuracy for a well developed heath care application [15]. This health care application(neuromuscular diffusion model) [15] employed a mass diffusion–reaction model to simulate synaptic transmissionof signals between excitable cells in cylindrical coordinates, where the diffusion–reaction equation was solved bythe conventional sequential Crank–Nicolson method. In this paper, we employ the LOD scheme that allows us to inconjunction with the “stream” function decrease the computational cost and increase the accuracy.

It should be noted that in our experience, the conventional sequential diffusion algorithm requires a considerableamount of computing time. Moreover, because of the nature of cylindrical coordinates, even using finer grids stillcannot provide us with an accurate numerical solution for the large simulated field, especially for regions far awayfrom the origin. For these reasons, we propose to develop two 3D GPU-based parallel numerical diffusion algorithms,PGM with warp vote function (PGMW) and PSGMC with stream function (PSGMCS) regarding Neumann boundaryconditions [7,15]. Both PGMW and PSGMCS significantly increase the computing speed. More importantly, PSGMCScan increase the solution accuracy by the newly released “stream” function of GPU in the large 3D matrix. Finally,we will discuss the advantages of these GPU-based parallel numerical computing algorithms from both a theoreticaland practical point of view. The simulation results demonstrated that PGMW and PSGMCS can greatly increase theperformance of the health care application for a large simulated field. Additionally, PSGMCS shows the potential toincrease the accuracy for the large matrix computing.

2. Diffusion model and the numerical method

We consider a diffusion equation in cylindrical coordinates as follows [15]:

∂ (A)

∂t=

Dr

r

∂

∂r

r∂ (A)

∂r

+

Dφ

r2

∂2 (A)

∂φ2 + Dz∂2 (A)

∂z2 , (1a)

B. Jiang et al. / Mathematics and Computers in Simulation 109 (2015) 1–19 3

Fig. 1. Configuration of mesh points in the cylindrical diffusion model.

with the boundary condition

∂ (A) (r,Φ, 0, t)

∂z=

∂ (A) (r,Φ, L , t)

∂z= 0, (1b)

(A) (r,Φ ± 2π, z, t) = (A) (r,Φ, z, t) , (1c)

(A) (rmax ,Φ, z, t) = 0, (1d)

where Dr , DΦ , Dz are diffusivities along radial, angular and transverse directions, respectively, and 0 < r <

rmax , 0 ≤ Φ ≤ 2π, 0 < z < L . Here, A represents the simulated synaptic transmission of signals between excitablecells [15], rmax and L are the radius and height of the 3D field, respectively, and t is the time step. We employ theCrank–Nicolson method to solve the above diffusion problem [15,18].

2.1. Discrete boundary condition along the transverse direction

The grid along the transverse direction is designed to have a new finite difference scheme for the Neumann bound-ary condition as shown in Fig. 1, where the distance between the actual left boundary and z1 is assumed to be θ1∆z,and the distance between the actual right boundary and zK is θ2∆z [7,15]. The constants θ1 and θ2 can be obtained asfollows. First, we denote (A) (r,Φ, z, t) as A(z, t) and ∆z as h, respectively, for simplicity. Then the finite difference

approximation of ∂2 A(z,t)∂z2 at z1 is written as

b∂2 A(z1, t)

∂z2 =a

h2 [A (z2, t) − A (z1, t)] −1h

∂ A (z1 − θ1h, t)

∂z, (2a)

where a, b, θ1 are constants to be determined [7,15].Expanding each term of Eq. (2a) into Taylor series at z1, we obtain the right-hand-side (RHS) of Eq. (2a)

as


RH S =a

h2

h Az (z1, t) +

h2

2Azz (z1, t) +

h3

6Az3 (z1, t)

−

1h

Az (z1, t) − θ1h Azz (z1, t) +

θ21 h

2

2Az3 (z1, t)

+ O

h2

=1h

[a − 1] Az (z1, t) +

a

2+ θ1

Azz (z1, t) +

h

2

a

3− θ2

1

Az3 (z1, t) + O

h2

. (2b)

Omitting the truncation error Oh2

and matching both sides gives

a = 1, b =12

+

√3

3, θ1 =

√3

3. (2c)

Substituting the values of a, b, and θ1 in Eq. (2c) into Eq. (2a), we obtain a second-order finite difference approxima-tion at z1 as

∂2 A(z1, t)

∂z2 ≈a

bh2 [A (z2, t) − A (z1, t)] −1

bh

∂ A (z1 − θ1h, t)

∂z. (2d)

Similarly, the finite difference approximation of ∂2 A(z,t)∂z2 at zK can be expressed as

b∗∂2 A(zK , t)

∂z2 =1h

∂ A(zK + θ2h, t)

∂z−

a∗

h2

A (zK , t) − A (zK−1, t)

(3a)

where a∗, b∗, θ2 are constants to be determined. Employing the Taylor series expansion and matching both sidesagain, we have

RH S =1h

Az (zK , t) + θ2h Azz (zK , t) +

θ22 h

2

2Az3 (zK , t)

−a∗

h2

h Az (zK , t) −

h2

2Azz (zK , t) +

h3

6Az3 (zK , t)

+ O

h2

=1h

1 − a∗

Az (zK , t) +

a∗

2+ θ2

Azz (zK , t) +

h

2

θ2

2 −a∗

3

Az3 (zK , t) + O

h2

(3b)

and

a∗= 1, b∗

=12

+

√3

3, θ2 =

√3

3. (3c)

Then a second-order finite difference approximation at zK for the right boundary is

∂2 A(zK , t)

∂z2 ≈1

b∗h

∂ A (zK + θ2h, t)

∂z−

a∗

b∗h2

A (zK , t) − A (zK−1, t)

. (3d)

Thus, the grid size and the coordinates of the mesh point along the transverse direction are

h =L

K + θ1 +θ2 −1, zk = (k − 1 + θ1) h, k = 1, . . . , K , (4)

where K is the number of interior mesh points. With the Neumann boundary condition in Eq. (1b), Eqs. (2d) and (3d)can be modified into

∂2 A (z1, t)

∂z2 ≈a

bh2 [A (z2, t) − A (z1, t)] , (5a)

∂2 A(zK , t)

∂z2 ≈ −a∗

b∗h2

A (zK , t) − A (zK−1, t)

. (5b)


2.2. Crank–Nicolson scheme [18] for the diffusion equation

Let ri = i∆r,Φ j = j∆Φ, 0 ≤ i ≤ I, 0 ≤ j ≤ J − 1, where ∆r =rmax

I and ∆Φ =2πJ are grid sizes

along the radial and angular directions, respectively (as shown in Fig. 1). Based on the mesh design described in theprevious section, zk = (k − 1 + θ1)∆z, 1 ≤ k ≤ K , where ∆z =

LK+θ1 +θ2 −1 is the grid size along the transverse

direction. We denote Ani, j,k to be the approximation of (A) (ri ,Φ j , zk, n∆t), where n is the time step and ∆t is the

time increment.At interior points with 1 ≤ i ≤ I − 1, 0 ≤ j ≤ J − 1, and 2 ≤ k ≤ K − 1, the Crank–Nicolson scheme [18]

gives

An+1i, j,k − An

i, j,k

∆t=

Dr

2ri (∆r)2

ri+ 1

2

An+1

i+1, j,k − An+1i, j,k

− ri− 1

2

An+1

i, j,k − An+1i−1, j,k

+

Dr

2ri (∆r)2

ri+ 1

2

An

i+1, j,k − Ani, j,k

− ri− 1

2

An

i, j,k − Ani−1, j,k

+

DΦ

2r2i (∆Φ)2

An+1

i, j+1,k − 2An+1i, j,k + An+1

i, j−1,k

+

DΦ

2r2i (∆Φ)2

An

i, j+1,k − 2Ani, j,k + An

i, j−1,k

+

Dz

2 (∆z)2

An+1

i, j,k+1 − 2An+1i, j,k + An+1

i, j,k−1

+

Dz

2 (∆z)2

An

i, j,k+1 − 2Ani, j,k + An

i, j,k−1

. (6a)

The finite difference scheme at the location z1 is obtained as follows:

An+1i, j,k − An

i, j,k

∆t=

Dr

2ri (∆r)2

ri+ 1

2

An+1

i+1, j,k − An+1i, j,k

− ri− 1

2

An+1

i, j,k − An+1i−1, j,k

+

Dr

2ri (∆r)2

ri+ 1

2

An


− ri− 1

2

An


+

DΦ

2r2i (∆Φ)2

An+1

i, j+1,k − 2An+1i, j,k + An+1

i, j−1,k

+

DΦ

2r2i (∆Φ)2

An


i, j−1,k

+

aDz

2b (∆z)2

An+1

i, j,k+1 − An+1i, j,k

+

aDz

2b (∆z)2

An

i, j,k+1 − Ani, j,k

, (6b)

where 1 ≤ i ≤ I − 1, 0 ≤ j ≤ J − 1, k = K and a = 1, b =12 +

√3

3 . Similarly, the finite difference scheme for thelocation zK gives

An+1i, j,k − An

i, j,k

∆t=

Dr

2ri (∆r)2

ri+ 1

2

An+1

i+1, j,k − An+1i, j,k

− ri− 1

2

An+1

i, j,k − An+1i−1, j,k

+

Dr

2ri (∆r)2

ri+ 1

2

An


− ri− 1

2

An


+

DΦ

2r2i (∆Φ)2

An+1

i, j+1,k − 2An+1i, j,k + An+1

i, j−1,k

+

DΦ

2r2i (∆Φ)2

An


i, j−1,k

+

a∗ Dz

2b∗ (∆z)2

An+1

i, j,k−1 − An+1i, j,k

+

a∗ Dz

2b∗ (∆z)2

An

i, j,k−1 − Ani, j,k

, (6c)


where 1 ≤ i ≤ I − 1, 0 ≤ j ≤ J − 1, k = K and a∗= 1, b∗

=12 +

√3

3 . The discrete Dirichlet boundary conditionalong the radial direction is

An0, j,k =

∆Φ2π

J−1m=0

An1,m,k, An

I, j,k = 0. (6d)

The discrete Dirichlet boundary condition along the angular direction is

Ani,−1,k = An

i,J−1,k, Ani,J,k = An

i,0,k . (6e)

2.3. Locally-one-dimension (LOD) scheme

Letting µr =Dr∆t(∆r)2 , µΦ =

DΦ∆t(∆Φ)2 , µz =

Dz∆t(∆z)2 , we can rewrite Eq. (6a) as

1 −µr

2riδ∗2

r −µΦ

2r2i

δ2Φ −

µz

2δ2

z

An+1

i, j,k =

1 +

µr

2riδ∗2

r +µΦ

2r2i

δ2Φ +

µz

2δ2

z

An

i, j,k, (7a)

where the finite difference operators are defined as

δ∗2r Ai, j,k = ri+ 1

2

Ai+1, j,k − Ai, j,k

− ri− 1

2

Ai, j,k − Ai−1, j,k

, (7b)

δ2Φ Ai, j,k = Ai, j+1,k − 2Ai, j,k + Ai, j−1,k, (7c)

δ2z Ai, j,k = Ai, j,k+1 − 2Ai, j,k + Ai, j,k−1. (7d)

Introducing additional terms of order O((∆t)2), Eq. (7) is modified into1 −

µr

2riδ∗2

r

1 −

µΦ

2r2i

δ2Φ

1 −

µz

2δ2

z

An+1

i, j,k =

1 +

µr

2riδ∗2

r

1 +

µΦ

2r2i

δ2Φ

1 +

µz

2δ2

z

An

i, j,k . (8a)

Using two intermediate levels in time, we employ the locally-one-dimension (LOD) scheme [18] to solve Eq. (8a) as1 −

µr

2riδ∗2

r

An+∗

i, j,k =

1 +

µΦ

2r2i

δ2Φ

An

i, j,k,1 −

µΦ

2r2i

δ2Φ

An+∗∗

i, j,k =

1 +

µΦ

2r2i

δ2Φ

An+∗

i, j,k, (8b)1 −

µz

2δ2

z

An+1

i, j,k =

1 +

µz

2δ2

z

An+∗∗

i, j,k .

When employing the LOD scheme for Eqs. (6b) and (6c), the first two equations dealing with radial and angulardirections are the same as those in Eq. (8b). We only need to modify the third equation dealing with the transversedirection as

1 +aµz

2b

An+1

i, j,k −aµz

2bAn+1

i, j,k+1 =

1 −

aµz

2b

An+∗∗

i, j,k +aµz

2bAn+∗∗

i, j,k+1, for k = 1, (8c)

−a∗µz

2b∗An+1

i, j,k−1 +

1 +

a∗µz

2b∗

An+1

i, j,k =a∗µz

2b∗An+∗∗

i, j,k−1 +

1 −

a∗µz

2b∗

An+∗∗

i, j,k , for k = K . (8d)

2.4. Implementation of the Thomas algorithm

The Thomas algorithm can efficiently solve the tridiagonal linear system with the following general form[8,12,18]

ai xi−1 + bi xi + ci+1xi+1 = di , i = 1, 2, . . . , N − 1,

x0 = 0, xN = 0.(9a)


First, the forward sweep computes the coefficients by Eq. (9b), and second, the backward substitution produces thesolution as Eq. (9c)

βk =

c1

b1; k = 1,

ck

bk − βk−1ak; k = 2, 3, . . . , N − 1,

(9b)

vk =

d1

b1; k = 1,

dk − vk−1ak

bk − βk−1ak; k = 2, 3, . . . , N

xN = vN ,

xk = vk − βk xk+1; k = N − 1, N − 2, . . . , 1.(9c)

We rewrite Eq. (8) into tridiagonal systems of equations with some modifications and employ the Thomas algorithmto solve the tridiagonal system.

2.4.1. Radial direction and angular directionThe general form of the system of equations along the radial direction is

ai An+∗

i−1, j,k + bi An+∗

i, j,k + ci An+∗

i+1, j,k = dni, j,k, 1 ≤ i ≤ I − 1, (10a)

where

ai = −µrri−1/2

2ri, bi = 1 +

µr

2ri

ri−1/2 + ri+1/2

, ci = −

µrri+1/2

2ri,

dni, j,k =

µrri−1/2

2riAn

i−1, j,k +

1 −

µr

2ri

ri−1/2 + ri+1/2

An

i, j,k +µrri+1/2

2riAn

i+1, j,k,

0 ≤ j ≤ J − 1, 1 ≤ k ≤ K , with the boundary condition shown in Eq. (6d) and with dni, j,k as a known value

determined only by Ani, j,k . Since An+∗

0, j,k is determined by the value of An+∗

1, j,k (Eq. (6d)), the first line of the equationsystem is non-tridiagonal. To resolve this problem, we implement the Thomas algorithm as follows (Fig. 2):

Step 1: Compute dni, j,k with values of (A) at time step n.

Step 2: Initiate the value of An+∗(0)1, j,k with An

1, j,k and let l = 1.

Step 3: Update the boundary values An+∗(l)0, j,k with An+∗(l−1)

1, j,k by Eq. (6d).

Step 4: Modify dn1, j,k with d(l)

1, j,k = dn1, j,k − a1 An+∗(l)

0, j,k .Step 5: Use the Thomas algorithm to solve the tridiagonal system as follows:

b1 c1 0 · · · · · · 0

a2 b2 c2. . .

...

0 a3 b3. . .

. . ....

.... . .

. . .. . . 0

.... . .

. . . cI−20 · · · · · · 0 aI−1 bI−1

An+∗(l)1, j,k

An+∗(l)2, j,k

An+∗(l)3, j,k...

An+∗(l)I−1, j,k

=

d(l)

1, j,kdn

2, j,kdn

3, j,k...

dnI−1, j,k

, (10b)

where d(l)1, j,k = dn

1, j,k − a1 An+∗(l)0, j,k and obtain An+∗(l)

i, j,k .Step 6: Let l = l + 1 and repeat Steps 3–5 until a criterion for convergence is satisfied. Here, the convergence

condition along the radial direction is shown by Eq. (10c)

Maxj, k

An+∗(l)1, j,k − An+∗(l−1)

1, j,k

< T O L . (10c)

The procedure to solve the angular direction is similar to the radial direction shown in Fig. 2.


Fig. 2. Implementation of Thomas algorithm along radial and angular directions.

2.4.2. Transverse directionCombining Eqs. (8b)–(8d) along the transverse direction, we can see that the system of equations along the

transverse direction satisfies the standard form of the tridiagonal system. Thus, we can employ the Thomas algorithmto solve the system directly.

b1 c1 0 · · · · · · 0

a2 b2 c2. . .

...

0 a3 b3. . .

. . ....

.... . .

. . .. . . 0

.... . .

. . . cK−10 · · · · · · 0 aK bK

An+1i, j,1

An+1i, j,2

An+1i, j,3...

An+1i, j,K

=

dn+∗∗

i, j,1

dn+∗∗

i, j,2

dn+∗∗

i, j,3...

dn+∗∗

i, j,K

,

b1 = 1 +aµz

2b, c1 = −

aµz

2b, aK = −

a∗µz

2b∗, bK = 1 +

a∗µz

2b∗,

ak = −µz

2, bk = 1 + µz, ck = −

µz

2, 2 ≤ k ≤ K − 2, (10d)

dn+∗∗

i, j,1 =

1 −

aµz

2b

An+∗∗

i, j,1 +aµz

2bAn+∗∗

i, j,2 ,

dn+∗∗

i, j,k =µz

2An

i, j,k−1 + (1 − µz) Ani, j,k +

µz

2An

i, j,k+1, 2 ≤ k ≤ K − 2,


Table 1Features of device memory.

Memory type Scope Device access Host access Location

Global memory All threads and host Read–write Read–write Off chip

Local memory Per-thread Read–write None Off chip

Shared memory Per-block Read–write None On chip

Constant memory All threads and host Read Read–write Off chip

Texture memory All threads and host Read Read–write Off chip

Registers Per-thread Read–write None On chip

dn+∗∗

i, j,K =a∗µz

2b∗An+∗∗

i, j,K−1 +

1 −

a∗µz

2b∗

An+∗∗

i, j,K .

3. GPU implementations

This section analyzes the parallel implementations of the domain decomposition [5,12,25,26] LOD scheme as wellas the Thomas algorithm.

3.1. GPU architecture

The NVIDIA GPU devices are designed not only to render graphics but also to be a powerful engine capable ofcomputing with hundreds of cores [20]. CUDA architecture makes GPU [11,19] well-suited for highly parallelizedcomputations. A new single-instruction, multiple-thread (SIMT) architecture is employed to manage threads [20].These threads are organized into a two-level hierarchy (Figure 2-1 of NVIDIA CUDA Programming Guide [20]),and each thread is mapped to one scalar processor core to execute instruction independently. Figure 2-2 of NVIDIACUDA Programming Guide [20] demonstrates the device (GPU) memory spaces including global memory, localmemory, shared memory, constant memory, texture memory and registers. Global memory is the largest memory thatcan be accessed by both host (CPU) and device (GPU). It manages the data exchange between CPU and GPU andsupports communication between thread blocks. However, global memory has as same access latency as local memoryand texture memory, whereas shared memory, registers and constant memory are much faster with relatively limitedspace. Table 1 exhibits the features of these six memory spaces in detail [21].

3.2. Performance analysis of the parallel computing algorithm

The LOD scheme splits the original partial differential equation (PDE) into three separate equations. It facilitiesthe parallel implementation because each split step is completely one dimensional, as shown in Eq. (8b).

The LOD scheme consists of two steps to approximate the numerical solution along each dimension: the firstcomputes the right hand side with an explicit scheme and the second solves a tridiagonal system by the Thomasalgorithm. The steps along the three dimensions are different because of various boundary conditions and finitedifference schemes. Next, we discuss the parallel implementations by the analysis of Eqs. (10a) and (10b).

When the radial dimension (i) is processed, all the computations along the angular ( j) and transverse (k) directionsare independent, which allows us to simultaneously compute these two dimensions. The aforementioned threadbatching of CUDA [16,22] indicates that mapping a grid of thread blocks to a 1D or 2D computational domain is muchsimpler, since the maximum of the grid dimension is two [16,22]. Thus, we project the original 3D computationalregion onto a 2D computation domain [31], as shown in Fig. 3, and solve Eqs. (10a) and (10b) by each direction.

In Fig. 3, for each (i) along the radial dimension, a 2D J ∗ K matrix is mapped into a 1D vector with a length of(J ∗ K ), since all the elements can be computed independently at the same time. The 3D I ∗ J ∗ K computationalregion is transformed into a 2D I ∗ (J ∗ K ) domain, which can be decomposed into smaller subdomains with similarperformance analysis to our previously developed 2D parallel algorithms [12].

The 3D parallel computing algorithm employs p1 processors to set up the explicit scheme of the LOD algorithm,while p2 and p3 processors to compute the forward-sweep and backward-substitution components of the Thomas


Fig. 3. Mapping of a 3D computational region to a 2D computational domain (radial direction).

algorithm respectively [12]. Based on the theory of Jordan and Alaghband [14], the efficiency of both explicit schemeand Thomas algorithm in the LOD method is 100%.

3.3. GPU-based parallel computing algorithms to speed up the diffusion solver

If we let I = m and J = K = m − 2, the theoretical performance analysis shows that in the ideal conditionwith infinite processors, p1 equals (m − 2) [(m − 2) (m − 2)] and should be greater than p2 and p3, which equal[(m − 2) (m − 2)]. This is because the elements on the right side of Eqs. (10a) and (10b) are totally independent,whereas elements along the radial (angular and transverse) direction are dependent upon each other when the Thomasalgorithm is applied to solve the tridiagonal system. Therefore, the Thomas algorithm is the main bottleneck forparallelization of the LOD scheme.

3.3.1. PGM with warp vote function (PGMW)PGM maps each thread block to one sub-vector of the solution matrix and computes one element of the sub-vector

with each thread [12]. PGM is employed to parallelize the LOD method. And the warp vote function any() [20] isapplied to speed up the boundary convergence process [12] along the radial and angular directions.

PGM facilitates the implementation of the parallel algorithm, but it cannot benefit from the advantages of GPUdue to the access latency of global memory [6,12,16,20,32]. Moreover, compared with the previous 2D numericaldiffusion solver of the rectangle coordinates, the current 3D diffusion solver for cylindrical coordinates needs iterationsto converge the boundaries along the radial and angular directions. Hence it will decrease more theoretical parallelperformance than the 2D diffusion solver in the rectangle coordinates. Fig. 4 describes the details of computing flowby employing PGMW to parallel the 3D numerical diffusion solver.

3.3.2. PSGMC with stream function (PSGMCS)As indicated in the previous research [12], PSGMC decomposes the entire data into tiles, and then copies each

tile of data into the fast although relatively limited shared memory [4,16,17,23,33] by employing the classicalalternating Schwarz domain-decomposition strategy [5,12,18,25,26,40]. PSGMC minimizes the global memory accessand overlaps the true boundary convergence process along the radial and angular directions with the convergentprocess of artificial boundaries. The details of PSGMC are illustrated in our previous research [12].


Fig. 4. Flow chart of PGM with warp vote function.

Different from rectangular coordinates, the arc length between two mesh points along the angular direction willincrease with respect to the increase of radius in the cylindrical coordinates (Fig. 5a). Correspondingly, the approxi-mation accuracy will decrease in such regions that are far away from the origin, especially for the large matrix.

To increase the accuracy of the computation, we divide the Rθ area into inner and outer domains as in Fig. 5a. Itshould be noted that we use two sub-domains as a prototype to demonstrate the advantages of the stream function [20].

As the newly released feature of Fermi GPU 480 cards, the ‘stream’ function can distribute various numbers ofthreads for each stream, which allows us to partition the task into two streams and execute each with different grid size(size of thread block). The protocol combining the PSGMC and stream function aims to accelerate the 3D diffusionmodel with higher accuracy.


Fig. 5a. Domain decomposition with stream function.

PSGMC partitions the data and utilizes the classical alternating Schwarz domain-decomposition method [5,18,25,26,40], which requests the inter block communication. Shown as Figs. 5a and 5b, the inner domain (Stream 1)requires communicating with the outer domain (Stream 2) when the domain decomposition strategy is applied tosolve Eq. (10a) along the radial direction. Specifically, I0, I1, . . . , IL−1 in Fig. 5a are the numerical solutions alongthe angular direction in the inner domain, and O0, O1, . . . , O2(L−1), O2L−1 are the adjacent (with respect to theradial direction) solutions belonging to the outer domain. When the inner domain (Stream 1) communicates with theouter domain (Stream 2) as Fig. 5b, the artificial boundaries for the inner domain B I j are updated with Eq. (11a).The artificial boundaries for the outer domain BO2 j and BO2 j+1 are updated with Eqs. (11b) and (11c), wherej = 0, 1, . . . L − 1.

B I j = O2 j , (11a)

BO2 j = I j , (11b)

BO2 j+1 = (I j + I j+1)/2. (11c)

The applications of the domain decomposition method and the data transfer pattern along the angular and transversedirections (Eq. (10d)) are the same as the radial direction. Another advantage of employing ‘stream’ function is toreduce the time for data transfer. As we know, a stream in the CUDA architecture is a sequence of operations thatare performed in order on the device [20]. Operations in different streams can interweave, which allows hiding datatransfers between host and device. The asynchronous data transfer (cudaMemcpyAsync()) is used to overlap withkernel execution when the data can be partitioned into chunks and transferred in multiple stages (streams) [21,20]. Each data chunk is processed by launching kernels when it arrives, and execution within a stream occurssequentially [20]. The timeline of execution for sequential copy, execute with only one stream, and staged concurrentcopy and execute with multiple streams is depicted by Fig. 6a [21].

In addition, data chunks belonging to multiple streams are enabled to have different sizes and allowed to beprocessed with (thread) grids of different dimensions. We take advantage of the stream and incorporate it into theprevious proposed PSGMC to improve the sequential 3D diffusion solver for both computing time and accuracy. Twostreams in the current design are built to obtain the numerical solutions with different grid sizes of mesh points. Thefirst stream computes the solution A(i, j, k) for 1 ≤ i ≤

I−22 , 0 ≤ j ≤ J − 1 and 1 ≤ k ≤ K with the grid size of

∆r,∆Φ and ∆z, while the second stream computes A(i, j, k) for I2 ≤ i ≤ I − 2, 0 ≤ j ≤ 2 ∗ J − 1 and 1 ≤ k ≤ K

with the grid size of, ∆Φ2 and ∆z. The basic design is shown in Figs. 6b–6d.


Fig. 5b. Inter block communication between Streams 1 and 2.

Fig. 6a. The timeline of execution for sequential and staged concurrent copy and execute.

3.4. Accuracy analysis of the parallel computing algorithm

The numerical approximation for

∂ (A)

∂t=

Dr

r

∂

∂r

r∂ (A)

∂r

+

Dφ

r2

∂2 (A)

∂φ2 + Dz∂2 (A)

∂z2

in 0 ≤ r ≤ rmax , 0 < z < L


Fig. 6b. Flow chart of PSGMC with stream function along the radial direction.

∂ (A)

∂z= 0 at z = 0 and z = L (12a)

A = 0 at r = rmax

(A) (r,Φ, z, 0) = J1 (ϖ1r) sin Φ cosπ z

L,

where Jv is the Bessel function [24] of order v of the first kind and ϖ1 is the first positive solution of J1 (ϖrmax ) = 0,is compared with the analytical solution computed by separation of variables [18,24] when Dr = Dφ = Dz = D,

(A) (r,Φ, z, t) = J1 (ϖ1r) sin Φ cosπ z

Lexp

−D

ϖ 2

1 +

π

L

2

t

. (12b)

We will examine whether the parallel computing algorithm by GPU and the sequential computing algorithm by CPUhave the same order of accuracy in the results section.

4. Results

We employ the GPU-based 3D numerical diffusion solver in the cylindrical coordinates to accelerate the sequen-tial computation of the neurotransmitter (acetylcholine), which was originally developed by Khaliq et al. [15]. Todemonstrate the advantages of the GPU-based 3D numerical diffusion solver, we ignore the acetylcholine recep-tors. In the experimental computation, the diffusion of acetylcholine is simulated in a disk-shaped volume [15].‘A’ denotes the molar concentration of acetylcholine. The parameters used in the simulation are listed in Table 2[15].


Fig. 6c. Flow chart of PSGMC with stream function along the angular direction.

Table 2Parameters used in the simulation.

Name/symbol Value (units)

Dr 0.7×10−6 cm2 s−1

DΦ 0.7×10−6 cm2 s−1

Dz 0.7×10−6 cm2 s−1

rmax 6.5 cmL 1 cm

4.1. Computing time of PGMW and sequential computing

Fig. 7a shows the computing time of PGMW and sequential computing with respect to different sizes of the matrix.The results show that PGMW could significantly accelerate the sequential algorithm for the modest size matrix,whereas its performance is limited for both very large and small size matrices.

4.2. Computing time of PSGMCS andPGMW

Fig. 7b shows that the computing time of PSGMCS is much less than the PGMW algorithm for the large sizematrix, whereas PGMW is faster than PSGMCS for the small size matrix.

4.3. Accuracy of PSGMCS and PGMW

We compare the numerical solutions given by PSGMCS and PGMW with the symbolic solution shown by Eq. (12)by computing the total of the absolute solution value for each mesh point in a 64 ∗ 64 ∗ 64 matrix. Fig. 8(a)


Fig. 6d. Flow chart of PSGMC with stream function along the transverse direction.

Fig. 7a. Compute time of PGMW and sequential computing by logarithmic scale. The x axis represents the inner matrix size and the y axisrepresents the compute time (logarithmic scale with base 10) in millisecond. The blue bar represents the compute time of sequential computing andthe red bar represents the compute time of PGMW. The number on each bar indicates the multiple of acceleration to the sequential computing.

and Fig. 8(b) list the simulation result for D = 0.7e − 6 and D = 0.7e − 5 respectively. In this simulation,rmax = 6.5 cm(∆r = 0.1 cm) and L = 1 cm. We can tell from Fig. 8 that both PSGMCS and PGMW give satisfyingnumerical approximates for the diffusion equation. We then compare the accuracy of PSGMCS and PGMW, measuredby the total relative difference between the numerical and the analytical solution. Fig. 9 shows that PSGMCS obtainshigher accuracy than PGMW regardless of the size of ∆r and the value of diffusivity. In particularly, Fig. 9(b) showsthat this trend is more obvious when choosing greater diffusivity and ∆r .


Fig. 7b. Compute time of PSGMCS and PGMW by logarithmic scale. The x axis represents the inner matrix size and the y axis represents thecompute time (logarithmic scale with base 10) in millisecond. The blue bar represents the compute time of PGMW and the red bar represents thecompute time of PSGMCS. The number on each bar indicates the multiple of acceleration to the PGMW computing.

a b

Fig. 8. Accuracy of PSGMCS and PGMW measured with total absolute solution value at each mesh point. The x axis represents the time step andthe y axis represents the total of the absolute solution value.

a b

Fig. 9. Accuracy of PGMW and PSGMCS. The x axis represents the grid size along the radial direction and the y axis represents the accuracymeasured with the total relative difference at each mesh point. The number on each blue bar indicates the higher error of PGMW than that ofPSGMCS.

5. Discussion and conclusion

This research employs GPU based technology to speed up a well developed novel numerical diffusion simulator[7,15] with application to health care in particular neuroscience research. Two parallel GPU based numerical diffusionalgorithms (PGMW and PSGMCS) are proposed based on [12,13]. The computational cost with respect to matrix sizeis shown in Fig. 7a. It is evident that PGMW can greatly improve the performance of the modest size matrix comparedwith the sequential algorithm. However, it is neither quick to process the very large size matrix due to iterations for


boundary convergence along the radial and angular directions nor the small size matrix because it costs considerablecomputing time for data transfer and operand preparation by employing global memory only.

PSGMCS is developed to simulate diffusion in the cylindrical coordinates with speed and high accuracy. Ourprevious studies [12] have already demonstrated that PSGMC can relieve the high latency from global memory accessand improve the performance for the large size matrix. Here, we employed the ‘stream’ function, which is the latestreleased feature of the Fermi GPU 480 card. The ‘stream’ function not only allows us to parallelize the computingwith thread grids of different dimensions, but also to overlap the data copy and execution for different streams. Afterwe employ the stream function, the performance of PSGMCS in speed is much faster than PGMW (Fig. 7b) for largesize matrices. As discussed previously, since the arc length between two mesh points along the angular direction willincrease with respect to the increase of radius in the cylindrical coordinates, the accuracy of the computation willdecrease in the regions far away from the origin. To solve such a problem, PSGMCS employs a two sub-domain as theprototype to demonstrate the advantages of ‘stream’ function. Since ‘stream’ function can assign more threads to theregions that are far away from the origin, we can achieve finer grids in the outer domain than the inner domain shownby Fig. 5a. Fig. 8 shows that both PSGMGS and PGMW can achieve high-approximation accuracy for the modest sizematrix. Fig. 9 reveals if we increase the diffusivity and grid size along the radial direction, PSGMCS can have higheraccuracy than PGMW. These results demonstrate that PSGMCS can significantly increase the computing accuracy ofthe regions far away from the origin for the cylindrical coordinates, if we can employ more streams for more sub-domains to process a larger simulated field. However, since the ‘stream’ function can only provide asynchronous datatransfer to increase the efficiency of data transfer, it is hard for us to employ GPU synchronization to achieve betterperformance as in our previous research, which synchronized the inter-block communication [12]. For this reason, ourfuture study will try to integrate the GPU synchronization into our current protocol to improve the performance of thesolver.

Acknowledgments

This work was supported by Funding: NIH R01LM010185-03, U01HL111560-01, R01DE022676-01, the NationalNatural Science Foundation of China under Grant No. 61372138, as well as USA NIH grants P30AI078498,HHSN272201000055C and U01 CA166886-01. We would like to acknowledge the members of the MathematicalDepartment of Michigan Tech University, Mathematical Department of Louisiana Tech University and theTranslational Biosystems Lab in Cornell Medical School for the valuable discussions.

References

[1] C.A. Athale, T.S. Deisboeck, The effects of EGF-receptor density on multiscale tumor growth patterns, J. Theoret. Biol. 238 (2006) 771–779.[2] C. Athale, Y. Mansury, T.S. Deisboeck, Simulating the impact of a molecular ‘decision-process’ on cellular phenotype and multicellular

patterns in brain tumors, J. Theoret. Biol. 233 (2005) 469–481.[3] B. Bialecki, Preconditioned Richardson and minimal residual iterative methods for piecewise hermite bicubic orthogonal spline collocation

equations, SIAM J. Sci. Comput. 15 (1994) 668–680.[4] M. Boyer, D. Tarjan, S. Scton, K. Skadron, Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors,

in: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IEEE Computer Society, Washington, DC,USA Rome, Italy, 2009.

[5] X.C. Cai, M. Sarkis, A restricted additive Schwarz preconditioner for general sparse linear systems, SIAM Journal on Scientific Computing21 (1999) 792–797.

[6] S. Che, M. Boyer, J.Y. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A performance study of general-purpose applications on graphicsprocessors using CUDA, J. Parallel Distrib. Comput. 68 (2008) 1370–1380.

[7] W. Dai, A new accurate finite difference scheme for Neumann (insulated) boundary condition of heat conduction, Int. J. Therm. Sci. 49 (2010)571–579.

[8] W. Dai, A parallel algorithm for direct solution of large scale five-diagonal linear systems, in: D.H. Bailey (Ed.), Proceedings of the SeventhSIAM Conference on Parallel Processing for Scientific Computing, SIAM, San Francisco, CA, 1995, p. 875.

[9] W. Dai, A. Bejan, X. Tang, L. Zhang, R. Nassar, Optimal temperature distribution in a three dimensional triple-layered skin structure withembedded vasculature, J. Appl. Phys. 99 (2006).

[10] W. Dai, R. Nassar, A preconditioned Richardson method for solving three-dimensional thin film problems with first order derivatives andvariable coefficients, Internat. J. Numer. Methods Heat & Fluid Flow 10 (2000) 477–487.

[11] P.N. Glaskowsky, NVIDIA’s Fermi: The First Complete GPU Computing Architecture in, 2009.[12] B. Jiang, A. Struthers, L. Zhang, Z. Sun, Z. Feng, X. Zhao, W. Dai, K. Zhao, X. Zhou, M. Berens, Employing graphics processing unit

technology, alternating direction implicit method and domain decomposition to speed up the numerical diffusion solver for the biomedicalengineering research, Int. J. Numer. Methods in Biomed. Eng. 27 (2011) 1829–1849.

http://refhub.elsevier.com/S0378-4754(14)00160-8/sbref1












[13] B. Jiang, L. Zhang, W. Zhang, A. Struthers, M.E. Berens, X. Zhou, Accelerate numerical diffusion solver of 2D multiscale and multi-resolutionagent-based brain cancer model by employing graphics processing unit technology, in: WORLDCOMP’11, Las Vegas, Nevada, USA, 2011.

[14] H.F. Jordan, G. Alaghband, Fundamentals of Parallel Processing, Pearson Education Inc., Upper Saddle River, NJ, 2003, p. 07458.[15] A. Khaliq, F. Jenkins, M. DeCoster, W. Dai, A new 3D mass diffusion–reaction model in the neuromuscular junction, J. Comput. Neurosci.

30 (2011) 729–745.[16] D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series), first ed.,

Morgan Kaufmann, Burlington, MA, 2010.[17] Y. Liu, W. Huang, J. Johnson, S. Vaidya, GPU accelerated Smith–Waterman, Computational Science - Iccs 2006, Pt 4, Proceedings, 3994,

2006, pp. 188–195.[18] K.Q. Morton, D.F. Mayers, Numerical Solution of Partial Differential Equations, second ed., Cambridge University Press, New York, 2008.[19] NVIDIA, Tuning CUDA Applications for Fermi, in, NVIDIA, 2010.[20] NVIDIA, NVIDIA CUDA Programming Guide, in, NVIDIA, 2009.[21] NVIDIA, NVIDIA CUDA C Programming Best Practices Guide, in, 2009.[22] NVIDIA, NVIDIA CUDA Programming Guide, in, NVIDIA, 2008.[23] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, in, NVIDIA, 2009.[24] M.N. Ozisik, Heat Conduction, second ed., Wiley-Interscience, Malden, MA, 1993.[25] B. Smith, P. Biqrstad, W. Gropp, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equation, first ed.,

cambridge University Press, New York, 2004.[26] A. St-Cyr, M.J. Gander, S.J. Thomas, Optimized Restricted Additive Schwarz Methods, in: 16th International Conference on Domain

Decomposition Methods, New York 2005.[27] K.R. Swanson, E.C. Alvord Jr., J.D. Murray, A quantitative model for differential motility of gliomas in grey and white matter, Cell Prolif. 33

(2000) 317–329.[28] K.R. Swanson, E.C. Alvord Jr., J.D. Murray, Virtual brain tumours (gliomas) enhance the reality of medical imaging and highlight

inadequacies of current therapy, Br. J. Cancer 86 (2002) 14–18.[29] K.R. Swanson, C. Bridge, J.D. Murray, E.C. Alvord Jr., Virtual and real brain tumors: using mathematical modeling to quantify glioma growth

and invasion, J. Neurol. Sci. 216 (2003) 1–10.[30] K.R. Swanson, R.C. Rostomily, E.C. Alvord Jr., A mathematical modelling tool for predicting survival of individual patients following

resection of glioblastoma: a proof of principle, Br. J. Cancer 98 (2008) 113–119.[31] J.C. Thibault, I. Senocak, CUDA implementation of a Navier–Stokes solver on multi-GPU desktop platforms for incompressible flows, in:

47th AIAA Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, 1801 Alexander Bell Dr., Suite 500, Reston,VA 20191-4344, USA, Orlando, Florida, 2009, pp. 1–15.

[32] V. Volkov, J. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Conference on High Performance Networking and ComputingArchive Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press Piscataway, NJ, USA Austin, TX, 2008.

[33] S.C. Xiao, A.M. Aji, W.C. Feng, On the Robust mapping of dynamic programming onto a graphics processing unit, International Conferenceon Parallel and Distributed Systems, Shenzhen, China, 2009.

[34] L. Zhang, W. Dai, R. Nassar, A numerical method for optimizing laser power in the irradiation of a 3-D triple-layered cylindrical skin structure,Numer. Heat Transfer 48 (2005) 21–41.

[35] L. Zhang, W. Dai, R. Nassar, A Numerical Method for Obtaining an Optimal Temperature Distribution in a 3-D Triple-Layered CylindricalSkin Structure Embedded with a Blood Vessel Numerical Heat Transfer, 49 (2006) 765–784.

[36] L. Zhang, W. Dai, R. Nassar, A numerical algorithm for obtaining an optimal temperature distribution in a 3D triple-layered cylindrical skinstructure, Comput. Assist. Mech. Eng Sci. 14 (2007) 107–125.

[37] L. Zhang, B. Jiang, Y. Wu, C. Strouthos, P.Z. Sun, J. Su, X. Zhou, Developing a multiscale, multi-resolution agent-based brain tumor modelby graphics processing units, Theoret. Biol. Med. Modelling 8 (2011).

[38] L. Zhang, C. Strouthos, Z. Wang, T.S. Deisboeck, Simulating brain tumor heterogeneity with a multiscale agent-based model: Linkingmolecular signatures, phenotypes and expansion rate, Math. Comput. Modelling 49 (2009) 307–319.

[39] L. Zhang, Z. Wang, J.A. Sagotsky, T.S. Deisboeck, Multiscale agent-based cancer modeling, J. Math. Biol. 58 (2009) 545–559.[40] J.P. Zhu, Solving Partial Differential Equations on Parallel Computers, World Scientific Publishing Co. Pte. Ltd, London, 1994.


















novel 3d gpu based numerical parallel diffusion algorithms...

Documents