research article an mpi-openmp hybrid parallel -lu direct...

11
Research Article An MPI-OpenMP Hybrid Parallel H-LU Direct Solver for Electromagnetic Integral Equations Han Guo, Jun Hu, and Zaiping Nie e School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China Correspondence should be addressed to Jun Hu; [email protected] Received 20 July 2014; Revised 12 November 2014; Accepted 15 December 2014 Academic Editor: Wei Hong Copyright © 2015 Han Guo et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this paper we propose a high performance parallel strategy/technique to implement the fast direct solver based on hierarchical matrices method. Our goal is to directly solve electromagnetic integral equations involving electric-large and geometrical-complex targets, which are traditionally difficult to be solved by iterative methods. e parallel method of our direct solver features both OpenMP shared memory programming and MPl message passing for running on a computer cluster. With modifications to the core direct-solving algorithm of hierarchical LU factorization, the new fast solver is scalable for parallelized implementation despite of its sequential nature. e numerical experiments demonstrate the accuracy and efficiency of the proposed parallel direct solver for analyzing electromagnetic scattering problems of complex 3D objects with nearly 4 million unknowns. 1. Introduction Computational electromagnetics (CEM), which is driven by the explosive development of computing technology and vast novel fast algorithms in recent years, has become important to the modeling, design, and optimization of antenna, radar, high-frequency electronic device, and electromagnetic meta- material. Among the three major approaches for CEM, finite- difference time-domain (FDTD) [1], finite element method (FEM) [2], and method of moments (MoM) [3], MoM has gained widely spread reputation for its good accuracy and built-in boundary conditions. Generally, MoM discretizes the electromagnetic integral equations (IEs) into linear systems and solves them via numerical algebraic methods. Although the original MoM procedure forms a dense system matrix which is computationally intensive for solving large-scale problems, a variety of fast algorithms have been introduced to accelerate MoM via reducing both of its CPU time and mem- ory cost. Well-known fast algorithms for frequency domain IEs, such as multilevel fast multipole method (MLFMM) [46], multilevel matrix decomposition algorithm (MLMDA) [79], adaptive integral method (AIM) [10], hierarchical matrices (H-matrices, H 2 -matrices) method [1115], mul- tiscale compressed block decomposition (MLCBD) [16, 17], and adaptive cross-approximation (ACA) [18, 19], have rema- rkably increased the capacity and ability of MoM to analyze radiation/scattering phenomena for electric-large objects. Traditionally, iterative solvers are employed and com- bined with fast algorithms to solve the MoM. Despite of the availability of many efficient fast algorithms, there are still some challenges in the iterative solving process for discretized IEs. One major problem is the slow-convergence issue. Due to various reasons, such as complex geometrical shapes of the targets, fine attachments, dense discretization, and/or non- uniform meshing, the spectrum condition of the impedance matrix of discretized IEs can be severely deteriorated. ere- fore, the convergence speed of iteration will be slowed down significantly so that we are not able to obtain an accurate solution within a reasonable period of time. In order to over- come this difficulty, preconditioning techniques are usually employed to accelerate the convergence. ere are some popular preconditioners, including diagonal blocks inverse [20], incomplete Cholesky factorization [21], incomplete LU factorization (ILU) [21, 22], and sparse approximate inverse (SAI) [23], used widely. However, the effectiveness of precon- ditioning techniques is problem dependent and the converge- nce still cannot be guaranteed. For some extreme cases pre- conditioning shows little alleviation to the original problem. Hindawi Publishing Corporation International Journal of Antennas and Propagation Volume 2015, Article ID 615743, 10 pages http://dx.doi.org/10.1155/2015/615743

Upload: others

Post on 19-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

Research ArticleAn MPI-OpenMP Hybrid Parallel H-LU Direct Solver forElectromagnetic Integral Equations

Han Guo Jun Hu and Zaiping Nie

The School of Electronic Engineering University of Electronic Science and Technology of China Chengdu Sichuan 611731 China

Correspondence should be addressed to Jun Hu hujunuestceducn

Received 20 July 2014 Revised 12 November 2014 Accepted 15 December 2014

Academic Editor Wei Hong

Copyright copy 2015 Han Guo et alThis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In this paper we propose a high performance parallel strategytechnique to implement the fast direct solver based on hierarchicalmatrices method Our goal is to directly solve electromagnetic integral equations involving electric-large and geometrical-complextargets which are traditionally difficult to be solved by iterative methods The parallel method of our direct solver features bothOpenMP shared memory programming and MPl message passing for running on a computer cluster With modifications to thecore direct-solving algorithm of hierarchical LU factorization the new fast solver is scalable for parallelized implementation despiteof its sequential nature The numerical experiments demonstrate the accuracy and efficiency of the proposed parallel direct solverfor analyzing electromagnetic scattering problems of complex 3D objects with nearly 4 million unknowns

1 Introduction

Computational electromagnetics (CEM) which is driven bythe explosive development of computing technology and vastnovel fast algorithms in recent years has become importantto the modeling design and optimization of antenna radarhigh-frequency electronic device and electromagnetic meta-material Among the threemajor approaches for CEM finite-difference time-domain (FDTD) [1] finite element method(FEM) [2] and method of moments (MoM) [3] MoM hasgained widely spread reputation for its good accuracy andbuilt-in boundary conditions GenerallyMoMdiscretizes theelectromagnetic integral equations (IEs) into linear systemsand solves them via numerical algebraic methods Althoughthe original MoM procedure forms a dense system matrixwhich is computationally intensive for solving large-scaleproblems a variety of fast algorithms have been introduced toaccelerate MoM via reducing both of its CPU time andmem-ory cost Well-known fast algorithms for frequency domainIEs such as multilevel fast multipole method (MLFMM) [4ndash6] multilevel matrix decomposition algorithm (MLMDA)[7ndash9] adaptive integral method (AIM) [10] hierarchicalmatrices (H-matrices H2-matrices) method [11ndash15] mul-tiscale compressed block decomposition (MLCBD) [16 17]

and adaptive cross-approximation (ACA) [18 19] have rema-rkably increased the capacity and ability of MoM to analyzeradiationscattering phenomena for electric-large objects

Traditionally iterative solvers are employed and com-bined with fast algorithms to solve the MoM Despite of theavailability of many efficient fast algorithms there are stillsome challenges in the iterative solving process for discretizedIEs One major problem is the slow-convergence issue Dueto various reasons such as complex geometrical shapes of thetargets fine attachments dense discretization andor non-uniform meshing the spectrum condition of the impedancematrix of discretized IEs can be severely deteriorated There-fore the convergence speed of iteration will be slowed downsignificantly so that we are not able to obtain an accuratesolution within a reasonable period of time In order to over-come this difficulty preconditioning techniques are usuallyemployed to accelerate the convergence There are somepopular preconditioners including diagonal blocks inverse[20] incomplete Cholesky factorization [21] incomplete LUfactorization (ILU) [21 22] and sparse approximate inverse(SAI) [23] used widely However the effectiveness of precon-ditioning techniques is problemdependent and the converge-nce still cannot be guaranteed For some extreme cases pre-conditioning shows little alleviation to the original problem

Hindawi Publishing CorporationInternational Journal of Antennas and PropagationVolume 2015 Article ID 615743 10 pageshttpdxdoiorg1011552015615743

2 International Journal of Antennas and Propagation

By contrast direct solving like Gaussian elimination or LUfactorization [24] spends fixed number of operations whoseimpedancesystem is only related to the size of impedancesystemmatrix namely the number of unknowns119873 to obtainthe accurate solution of MoM Although these standarddirect solvers are not favored over iterative solvers becauseof their costly computational complexity 119874(1198733) a numberof fast direct solvers (FDS) are introduced recently [14ndash1925] which inherit the merit of direct solving for avoidingslow-convergence issue and have significantly reduced therequired CPU time and memory

So far there are some existing literatures discussing theparallelization of direct solvers for electromagnetic integralequations [26 27] It is notable that the implementation ofparallelizing FDS is not trivial Since the algorithms of majorFDS share a similar recursive factorization procedure in amultilevel fashion their implementations are intrinsicallysequential in nature However they can be parallelized withgood scalability by elaborate dissection and distribution Inthis paper we proposed an MPI-OpenMP hybrid parallelstrategy to implement the H-matrices based direct solverwith hierarchical LU (H-LU) factorization [15 28] Thisparallel direct solver is designed for running on a computercluster with multiple computing nodes with multicore pro-cessors for each node OpenMP shared memory program-ming [29] is utilized for the parallelization on multicoreprocessors for each node while MPI message passing [30]is employed for the distribution computing among all thenodes The proposed parallel FDS shows good scalabilityand parallelization efficiency and numerical experimentsdemonstrate that it can solve electrical-large IE problemsinvolving nearly 4 million unknowns within dozens of hours

The rest of this paper is organized as follows Section 2reviews the discretized IE we aim to solve and basic MoMformulas Section 3 outlines the construction of H-matricesbased IE system including spatial grouping of basis functionsand the partition of the impedance matrix as well as thedirect solving procedure of H-LU factorization Section 4elaborates the parallelizing strategy for the constructionand LU factorization of H-matrices based IE system withthe discussion of its theoretical parallelization efficiencySection 5 gives some results of numerical tests to show theefficiency capacity and accuracy of our solver Finally thesummary and conclusions are given in Section 6

2 Electromagnetic Integral Equation andMethod of Moments

In this section we give a brief review of surface IEs andtheir discretized forms [3] that our parallel FDS deals withIn the case of 3D electromagnetic field problem electric fieldintegral equation (EFIE) is written as

minust sdot 119894119896120578 int1199041015840

1198891199041015840J (r1015840) sdot (I minus nablanabla

1015840

1198962)119892 (r r1015840) = t sdot Einc

(r) (1)

in which 119896 is the wavenumber 120578 is the wave impedance t isthe unit tangential component on the surface of the object

and119892(r r1015840) is the free spaceGreen functionEinc(r) is the inci-

dent electric field and J(r1015840) is the excited surface current to bedetermined Correspondingly the 3D magnetic field integralequation (MFIE) is written as

minusJ (r)2

+ n times PV int119904

1198891199041015840J (r1015840) times nabla1015840119892 (r r1015840) = minusn timesHinc

(r) (2)

in which n is the normal component on the surface ofthe object Hinc

(r) is the incident magnetic field and PVstands for the Cauchy principal value integration For timeharmonic electromagnetic plane wave Einc andHinc have therelationship

Hinc=1

120578k times Einc

(3)

where k is the unit vector of wave propagating direction Inorder to cancel the internal resonance that occurred in thesolution of individual EFIE or MFIE the two equations areusually linearly combined Specifically we have the combinedfield integral equation (CFIE) written as

CFIE = 120572 sdot EFIE + (1 minus 120572) 120578 sdotMFIE (4)

To solve this IE numerically first we need to mesh thesurfaces of the object with basic patches For electromagneticIEs triangular and quadrilateral patches are two types offrequently used patches which are associated with Rao-Wilton-Gliso (RWG) basis function [31] and roof-top basisfunction [32] respectively By using these basis functions theundetermined J(r) in (1) and (2) can be discretized by thecombination of basis functions f

119894(r) (119894 = 1 2 119873) as

J (r) =119873

sum

119894=1

119868119894f119894(r) (5)

in which 119868119894is the unknown coefficients and 119873 is the total

number of unknowns By using Galerkin test we can formthe119873 times119873 linear systems for (1) and (2) as

ZEFIEsdot I = VEFIE

ZMFIEsdot I = VMFIE

(6)

in which the system matrices ZEFIE and ZMFIE are also calledimpedance matrices in IEs context The entries of the sys-tem matrices and right-hand side vectors can be calculatednumerically through the following formulas

ZEFIE119898119899

= minus119894119896120578int119904

int1199041015840

119892 (r r1015840) [f119898(r) sdot f119899(r1015840)] 1198891199041015840119889119904

+119894120578

119896int119904

int1199041015840

119892 (r r1015840) [nabla119904sdot f119898(r) nabla1199041015840 sdot f119899(r1015840)] 1198891199041015840119889119904

International Journal of Antennas and Propagation 3

ZMFIE119898119899

=1

2int119904

119889119904f119898(r) sdot f119899(r)

+ n sdot int119904

119889119904f119898(r) times int

1199041015840

1198891199041015840f119899(r1015840) times nabla1015840119892 (r r1015840)

VEFIE119898

= int119904

119889119904f119898(r) sdot Einc

(r)

VMFIE119898

= int119904

119889119904f119898(r) sdot n timesHinc

(r) (7)

Consequently the discretized CFIE system matrix and exci-tation vector are

ZCFIE119898119899

= 120572ZEFIE119898119899

+ (1 minus 120572) 120578 sdot ZMFIE119898119899

VCFIE119898

= 120572VEFIE119898

+ (1 minus 120572) 120578 sdot VMFIE119898

(8)

By solving the linear system

ZCFIEsdot I = VCFIE

(9)

we can get the excited surface current J and correspondingradiationscattering field

3 Fast Direct Solver andHierarchical LU Factorization

Generally we can solve the linear system (9) iteratively ordirectly For iterative solvers the Krylov subspace methodsare frequently used such as CG BiCG GMRES and QMR[33] Because iterative methods often suffer from slow-conv-ergence issues for complex applications direct methods havebecome popular recently Several researchers proposed avariety of FDS [14ndash19 25] which aim at avoiding convergenceproblems for iterative solvers and utilize similar hierarchicalinversefactorization procedure and low rank compressiontechniques to reduce the computational and storage complex-ity

H-matrices method [11 12 15] is a widely known math-ematical framework for accelerating both finite elementmethod (FEM) and IE method In the scenario of electro-magnetic IEH-matrices method is regarded as the algebraiccounterpart of MLFMM [4ndash6] The most advantage of H-matrices technique over MLFMM is that it can directly solvethe IE system by utilizing its pure algebraic property In thissection we briefly review the primary procedures of IE-basedH-matrices method and the algorithm of hierarchical LUfactorization which will be parallelized

In order to implement H-matrices method firstly theimpedancematrixZCFIE needs to be formed block-wisely andcompressed by low rank decomposition scheme We subdi-vide the surface of the object hierarchically and according totheir interaction types form the impedanceforward systemmatrix in a multilevel pattern This procedure is very similarto the setting-up step in MLFMM By contrast to the octreestructure ofMLFMM the structure inH-matrices is a binarytree The subdivision is stopped when the dimension ofsmallest subregions covers a few wavelengths The subblocks

in the impedance matrix representing the far-fieldweak int-eractions are compressed by low rank (LR) decompositionschemes Here we use the adaptive cross-approximation(ACA) algorithm [18 19] as our LR decomposition schemewhich only requires partial information of the block to becompressed and is thus very efficient

After the H-matrices based impedance matrix is gener-ated the hierarchical LU (H-LU) factorization [12 26 28] isemployed on it and finally we get an LU decomposed systemmatrix LU (ZCFIE

) During and after the factorization processsubblocks in LR format are kept all the time Having theLU (ZCFIE

) we can easily solve the IE with given excitationsby a recursive backward substitution procedure Figure 1illustrates the overall procedure of theH-matrices based FDSintroduced above

From ZCFIE to LU (ZCFIE) the hierarchical LU factoriza-

tion is employed The factorization algorithm is elaboratedbelow Consider the block LU factorization of the level-1partitioned ZCFIE matrix namely

ZCFIE= [

Z11

Z12

Z21

Z22

] = [L11

L21

L22

] [U11

U12

U22

] (10)

We carry out the following procedures in a recursive way (i)get L11

and U11

by LU factorization Z11= L11U11 (ii) get

U12

by solving lower triangular system Z12

= L11U12 (iii)

get L21by solving upper triangular system Z

21= L21U11 (iv)

update Z22 Z22= Z22minus L21U12 and (v) get L

22and U

22by

LU factorization Z22= L22U22 Recursive implementations

are participated among all procedures (i)ndash(v) Procedures(ii) and (iii) (and similar operations under the hierarchicalexecution of (i) and (v)) are performed via partitionedforwardbackward substitution for hierarchical loweruppertriangularmatrices [12 28] Procedure (iv) is a vital operationthat contributes most of the computing time cost Thissubblocks addition-multiplication procedure occurs not onlyin one or several certain levels but pervades in all of therecursive executions Figure 1 illustrates the overall procedureofH-matrices based FDS Implementation details aboutH-matrices compression and recursive LU factorization can befound in [12 14 15] In addition a similar recursive directsolving process for noncompression IE system matrix isdiscussed in [26]

Based on the result of H-LU factorization we can easilyget the solution for any given excitation namely right-handside (RHS) vector by hierarchical backward substitutionTheCPU time cost for backward substitution is trivial comparedto that ofH-LU factorization

4 Parallelization of H-Matrices Based IE FastDirect Solver

The parallelization ofH-matrices based IE FDS contains twoparts (a) the parallelization (of the process) of generatingthe forwardimpedance matrix with H-matrices form (b)the parallelization of H-LU factorization and backwardsubstitution In this section we will elaborate the approachesfor implementing these two parts separately

4 International Journal of Antennas and Propagation

Object withmultilevel partition

Geometricalinteractions

Compressed byLRD scheme

(ACA)

Low rank format

100

100

200

200

300

300

400

400

500

500

600

600

100 200 300 400 500 600

100

200

300

400

500

600

LU( )

Procedure Iform

Procedure IIdecompose ZCFIE

ZCFIE

ZCFIE

ZCFIE B =UV

U

L

Figure 1 Overall procedure ofH-matrices based FDS

100 200 300 400 500 600

100

200

300

400

500

600

Row 01

Row 02

Row 03

Row 2K

Proc 01

Proc 02

Proc 03

Proc 2K

Figure 2 Data distribution strategy for forwardimpedance matrixand also for LU factorized matrix

41 Parallelization of Generating ForwardImpedance MatrixSince our FDS is executed on a computer cluster the com-puting data must be stored distributively in an appropriateway especially for massive computing tasks The main dataof H-matrices based IE FDS are the LR compressed for-wardimpedance and LU factorizedmatrices In principle forforwardimpedance matrix we divide it into multiple rowsand each computing node stores one of these row-blocksThis memory distribution strategy is shown in Figure 2 Thewidth of each row is not arbitrary but in correspondence tothe hierarchical partition of the matrix in certain level 119870Therefore no LR formatted block will be split by the divisionlines that cross the entire row In practice the width of eachrow is approximately equal and the total number of rows isthe power of 2 namely 2119870 119870 is an integer which is less thanthe depth of the binary tree structure ofH-matrices which isdenoted by 119871 Each row-block is computed and filled locallyby eachMPI node it is assigned to Since every subblock is anindependent unit the filling process for these row-blocks can

International Journal of Antennas and Propagation 5

be parallelized for allMPI nodes without any communicationor memory overlap Furthermore for each MPI node withmulticore processor the filling process of its correspondingblocks can be accelerated by utilizing the OpenMP memoryshare parallel computing In the case of MoM OpenMPis very efficient to parallelize the calculation of impedanceelements since this kind of calculation only depends onthe invariant geometrical information and electromagneticparameters which are universally shared among all nodes

42 Parallelization ofH-LU Factorization TheH-LU factor-ization is executed immediately after the forwardimpedancematrix is generated Compared with the forwardimpedancematrix filling process the parallelization ofH-LU factoriza-tion is not straightforward because its recursive procedure issequential in nature In order to putH-LU factorization intoa distributed parallel framework here we use an incrementalstrategy to parallelize this process gradually

During the H-LU factorization the algorithm is alwaysworkingcomputing on a certain block which is in LR form orfull form or the aggregation of them so that we can set a cri-terion based on the size of the block for determining whethercurrent procedure should be parallelized Specifically assum-ing that the (total) number of unknowns is119873 we have about2119870 nodes for computing then if the maximal dimension ofongoing processing block is larger than 1198732119870 MPI parallelcomputing with multiple nodes is employed otherwise theprocess will be dealt with by single node It should be notedthat OpenMP memory share parallel is always employed forevery nodewithmulticore processor participating in comput-ing In theH-LU factorization process OpenMP is in chargeof intranode parallelization of standard LU factorizationtriangular system solving for full subblocks and SVD or QRdecomposition for LR subblocks addition andmultiplicationand so forth which can be implemented through highlyoptimized computing package like LAPACK [34]

According to the recursive procedure of H-LU factor-ization at the beginning the algorithm will penetrate intothe finest level of H-matrices structure namely level 119871 todo the LU factorization for the top-left most block with thedimension of approximately (1198732119871) times (1198732119871) Since 119871 ge 119870this process is handled by one node After that the algorithmwill recur to level 119871 minus 1 to do the LU factorization for theextended top-left block of approximately (1198732119871minus1)times(1198732119871minus1)upon the previous factorization If 119871 minus 1 ge 119870 remains truethis process is still handled by one nodeThis recursion is kepton when it returns to level 119897 that 119897 lt 119870 for whichMPI parallelwith multiple nodes will be employed

For the sake of convenience we use H-LU(119897) to denotethe factorization on certain block in the level 119897 of H-matrices structure Recall theH-LU factorization procedurein Section 3 Procedure (i) isH-LU itself so its parallelizationis realized by other procedures on the finer levels For theprocedure (ii) and procedure (iii) we solve the triangularlinear system as

LX = A (11)

YU = B (12)

in which L U A and B are the known matrices and L Uare lower and upper triangular matrices respectively X andY need to be determined There two systems can actuallybe solved recursively through finer triangular systems Forinstance by rewriting (11) as

[L11

L21

L22

] [X11

X12

X21

X22

] = [A11

A12

A21

A22

] (13)

we carry out the following procedures (i) get X11

and X12

by solving lower triangular systems A11

= L11X11

andA12= L11X12 respectively (ii) get X

21and X

22by solving

lower triangular system A21minus L21X11= L22X21

and A22minus

L21X12

= L22X22 respectively From the above procedures

we can clearly see that getting columns [X11X21]119879 and

[X12X22]119879 represents two independent processes which can

be parallelized Similarly by refining Y as

[Y11

Y12

Y21

Y22

] (14)

getting rows [Y11Y12] and [Y

21Y22] represents two inde-

pendent processes Therefore if the size of X or Y namelyU12

or L21

in the H-LU context is larger than 1198732119870 theirsolving processes will be dealt with by multiple nodes

After solving processes of triangular systems is done theupdating for Z

22 namely the procedure (iv) of H-LU is

taking place We here use UPDATE(119897) to denote any of theseprocedures on level 119897 Rewriting it as

Z = Z minus AB (15)

apparently this is a typicalmatrix addition andmultiplicationprocedure which can be parallelized through the additionand multiplication of subblocks when it is necessary

[Z11

Z12

Z21

Z22

]

= [Z11

Z12

Z21

Z22

] minus [A11

A12

A21

A22

] [B11

B12

B21

B22

]

= [Z11minus A11B11minus A12B21

Z12minus A11B12minus A12B22

Z21minus A21B11minus A22B21

Z22minus A21B12minus A22B22

]

(16)

In the final form of (16) the matrix computation for each ofthe four subblocks is an independent process and can be dealtwith by different nodes simultaneouslyWhenZ

22is updated

the LU factorization is applied to Z22for getting L

22and U

22

This procedure is identical to theH-LU factorization for Z11

so any parallel scheme employed for factorizing Z11

is alsoapplicable for factorizing Z

22

The overall strategy for parallelizing H-LU factorizationis illustrated in Figure 3 Before the LU factorization onlevel 119870mdashH-LU(119870) is finished only one node 119875

1takes the

computing job Afterwards two nodes 1198751and 119875

2are used to

getU12and L

21simultaneously forH-LU(119870minus1) Then Z

22is

updated and factorized for gettingU22and L

22by the process

H-LU(119870) For triangular solving process on the 119870 minus 2 level

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

2 International Journal of Antennas and Propagation

By contrast direct solving like Gaussian elimination or LUfactorization [24] spends fixed number of operations whoseimpedancesystem is only related to the size of impedancesystemmatrix namely the number of unknowns119873 to obtainthe accurate solution of MoM Although these standarddirect solvers are not favored over iterative solvers becauseof their costly computational complexity 119874(1198733) a numberof fast direct solvers (FDS) are introduced recently [14ndash1925] which inherit the merit of direct solving for avoidingslow-convergence issue and have significantly reduced therequired CPU time and memory

So far there are some existing literatures discussing theparallelization of direct solvers for electromagnetic integralequations [26 27] It is notable that the implementation ofparallelizing FDS is not trivial Since the algorithms of majorFDS share a similar recursive factorization procedure in amultilevel fashion their implementations are intrinsicallysequential in nature However they can be parallelized withgood scalability by elaborate dissection and distribution Inthis paper we proposed an MPI-OpenMP hybrid parallelstrategy to implement the H-matrices based direct solverwith hierarchical LU (H-LU) factorization [15 28] Thisparallel direct solver is designed for running on a computercluster with multiple computing nodes with multicore pro-cessors for each node OpenMP shared memory program-ming [29] is utilized for the parallelization on multicoreprocessors for each node while MPI message passing [30]is employed for the distribution computing among all thenodes The proposed parallel FDS shows good scalabilityand parallelization efficiency and numerical experimentsdemonstrate that it can solve electrical-large IE problemsinvolving nearly 4 million unknowns within dozens of hours

The rest of this paper is organized as follows Section 2reviews the discretized IE we aim to solve and basic MoMformulas Section 3 outlines the construction of H-matricesbased IE system including spatial grouping of basis functionsand the partition of the impedance matrix as well as thedirect solving procedure of H-LU factorization Section 4elaborates the parallelizing strategy for the constructionand LU factorization of H-matrices based IE system withthe discussion of its theoretical parallelization efficiencySection 5 gives some results of numerical tests to show theefficiency capacity and accuracy of our solver Finally thesummary and conclusions are given in Section 6

2 Electromagnetic Integral Equation andMethod of Moments

In this section we give a brief review of surface IEs andtheir discretized forms [3] that our parallel FDS deals withIn the case of 3D electromagnetic field problem electric fieldintegral equation (EFIE) is written as

minust sdot 119894119896120578 int1199041015840

1198891199041015840J (r1015840) sdot (I minus nablanabla

1015840

1198962)119892 (r r1015840) = t sdot Einc

(r) (1)

in which 119896 is the wavenumber 120578 is the wave impedance t isthe unit tangential component on the surface of the object

and119892(r r1015840) is the free spaceGreen functionEinc(r) is the inci-

dent electric field and J(r1015840) is the excited surface current to bedetermined Correspondingly the 3D magnetic field integralequation (MFIE) is written as

minusJ (r)2

+ n times PV int119904

1198891199041015840J (r1015840) times nabla1015840119892 (r r1015840) = minusn timesHinc

(r) (2)

in which n is the normal component on the surface ofthe object Hinc

(r) is the incident magnetic field and PVstands for the Cauchy principal value integration For timeharmonic electromagnetic plane wave Einc andHinc have therelationship

Hinc=1

120578k times Einc

(3)

where k is the unit vector of wave propagating direction Inorder to cancel the internal resonance that occurred in thesolution of individual EFIE or MFIE the two equations areusually linearly combined Specifically we have the combinedfield integral equation (CFIE) written as

CFIE = 120572 sdot EFIE + (1 minus 120572) 120578 sdotMFIE (4)

To solve this IE numerically first we need to mesh thesurfaces of the object with basic patches For electromagneticIEs triangular and quadrilateral patches are two types offrequently used patches which are associated with Rao-Wilton-Gliso (RWG) basis function [31] and roof-top basisfunction [32] respectively By using these basis functions theundetermined J(r) in (1) and (2) can be discretized by thecombination of basis functions f

119894(r) (119894 = 1 2 119873) as

J (r) =119873

sum

119894=1

119868119894f119894(r) (5)

in which 119868119894is the unknown coefficients and 119873 is the total

number of unknowns By using Galerkin test we can formthe119873 times119873 linear systems for (1) and (2) as

ZEFIEsdot I = VEFIE

ZMFIEsdot I = VMFIE

(6)

in which the system matrices ZEFIE and ZMFIE are also calledimpedance matrices in IEs context The entries of the sys-tem matrices and right-hand side vectors can be calculatednumerically through the following formulas

ZEFIE119898119899

= minus119894119896120578int119904

int1199041015840

119892 (r r1015840) [f119898(r) sdot f119899(r1015840)] 1198891199041015840119889119904

+119894120578

119896int119904

int1199041015840

119892 (r r1015840) [nabla119904sdot f119898(r) nabla1199041015840 sdot f119899(r1015840)] 1198891199041015840119889119904

International Journal of Antennas and Propagation 3

ZMFIE119898119899

=1

2int119904

119889119904f119898(r) sdot f119899(r)

+ n sdot int119904

119889119904f119898(r) times int

1199041015840

1198891199041015840f119899(r1015840) times nabla1015840119892 (r r1015840)

VEFIE119898

= int119904

119889119904f119898(r) sdot Einc

(r)

VMFIE119898

= int119904

119889119904f119898(r) sdot n timesHinc

(r) (7)

Consequently the discretized CFIE system matrix and exci-tation vector are

ZCFIE119898119899

= 120572ZEFIE119898119899

+ (1 minus 120572) 120578 sdot ZMFIE119898119899

VCFIE119898

= 120572VEFIE119898

+ (1 minus 120572) 120578 sdot VMFIE119898

(8)

By solving the linear system

ZCFIEsdot I = VCFIE

(9)

we can get the excited surface current J and correspondingradiationscattering field

3 Fast Direct Solver andHierarchical LU Factorization

Generally we can solve the linear system (9) iteratively ordirectly For iterative solvers the Krylov subspace methodsare frequently used such as CG BiCG GMRES and QMR[33] Because iterative methods often suffer from slow-conv-ergence issues for complex applications direct methods havebecome popular recently Several researchers proposed avariety of FDS [14ndash19 25] which aim at avoiding convergenceproblems for iterative solvers and utilize similar hierarchicalinversefactorization procedure and low rank compressiontechniques to reduce the computational and storage complex-ity

H-matrices method [11 12 15] is a widely known math-ematical framework for accelerating both finite elementmethod (FEM) and IE method In the scenario of electro-magnetic IEH-matrices method is regarded as the algebraiccounterpart of MLFMM [4ndash6] The most advantage of H-matrices technique over MLFMM is that it can directly solvethe IE system by utilizing its pure algebraic property In thissection we briefly review the primary procedures of IE-basedH-matrices method and the algorithm of hierarchical LUfactorization which will be parallelized

In order to implement H-matrices method firstly theimpedancematrixZCFIE needs to be formed block-wisely andcompressed by low rank decomposition scheme We subdi-vide the surface of the object hierarchically and according totheir interaction types form the impedanceforward systemmatrix in a multilevel pattern This procedure is very similarto the setting-up step in MLFMM By contrast to the octreestructure ofMLFMM the structure inH-matrices is a binarytree The subdivision is stopped when the dimension ofsmallest subregions covers a few wavelengths The subblocks

in the impedance matrix representing the far-fieldweak int-eractions are compressed by low rank (LR) decompositionschemes Here we use the adaptive cross-approximation(ACA) algorithm [18 19] as our LR decomposition schemewhich only requires partial information of the block to becompressed and is thus very efficient

After the H-matrices based impedance matrix is gener-ated the hierarchical LU (H-LU) factorization [12 26 28] isemployed on it and finally we get an LU decomposed systemmatrix LU (ZCFIE

) During and after the factorization processsubblocks in LR format are kept all the time Having theLU (ZCFIE

) we can easily solve the IE with given excitationsby a recursive backward substitution procedure Figure 1illustrates the overall procedure of theH-matrices based FDSintroduced above

From ZCFIE to LU (ZCFIE) the hierarchical LU factoriza-

tion is employed The factorization algorithm is elaboratedbelow Consider the block LU factorization of the level-1partitioned ZCFIE matrix namely

ZCFIE= [

Z11

Z12

Z21

Z22

] = [L11

L21

L22

] [U11

U12

U22

] (10)

We carry out the following procedures in a recursive way (i)get L11

and U11

by LU factorization Z11= L11U11 (ii) get

U12

by solving lower triangular system Z12

= L11U12 (iii)

get L21by solving upper triangular system Z

21= L21U11 (iv)

update Z22 Z22= Z22minus L21U12 and (v) get L

22and U

22by

LU factorization Z22= L22U22 Recursive implementations

are participated among all procedures (i)ndash(v) Procedures(ii) and (iii) (and similar operations under the hierarchicalexecution of (i) and (v)) are performed via partitionedforwardbackward substitution for hierarchical loweruppertriangularmatrices [12 28] Procedure (iv) is a vital operationthat contributes most of the computing time cost Thissubblocks addition-multiplication procedure occurs not onlyin one or several certain levels but pervades in all of therecursive executions Figure 1 illustrates the overall procedureofH-matrices based FDS Implementation details aboutH-matrices compression and recursive LU factorization can befound in [12 14 15] In addition a similar recursive directsolving process for noncompression IE system matrix isdiscussed in [26]

Based on the result of H-LU factorization we can easilyget the solution for any given excitation namely right-handside (RHS) vector by hierarchical backward substitutionTheCPU time cost for backward substitution is trivial comparedto that ofH-LU factorization

4 Parallelization of H-Matrices Based IE FastDirect Solver

The parallelization ofH-matrices based IE FDS contains twoparts (a) the parallelization (of the process) of generatingthe forwardimpedance matrix with H-matrices form (b)the parallelization of H-LU factorization and backwardsubstitution In this section we will elaborate the approachesfor implementing these two parts separately

4 International Journal of Antennas and Propagation

Object withmultilevel partition

Geometricalinteractions

Compressed byLRD scheme

(ACA)

Low rank format

100

100

200

200

300

300

400

400

500

500

600

600

100 200 300 400 500 600

100

200

300

400

500

600

LU( )

Procedure Iform

Procedure IIdecompose ZCFIE

ZCFIE

ZCFIE

ZCFIE B =UV

U

L

Figure 1 Overall procedure ofH-matrices based FDS

100 200 300 400 500 600

100

200

300

400

500

600

Row 01

Row 02

Row 03

Row 2K

Proc 01

Proc 02

Proc 03

Proc 2K

Figure 2 Data distribution strategy for forwardimpedance matrixand also for LU factorized matrix

41 Parallelization of Generating ForwardImpedance MatrixSince our FDS is executed on a computer cluster the com-puting data must be stored distributively in an appropriateway especially for massive computing tasks The main dataof H-matrices based IE FDS are the LR compressed for-wardimpedance and LU factorizedmatrices In principle forforwardimpedance matrix we divide it into multiple rowsand each computing node stores one of these row-blocksThis memory distribution strategy is shown in Figure 2 Thewidth of each row is not arbitrary but in correspondence tothe hierarchical partition of the matrix in certain level 119870Therefore no LR formatted block will be split by the divisionlines that cross the entire row In practice the width of eachrow is approximately equal and the total number of rows isthe power of 2 namely 2119870 119870 is an integer which is less thanthe depth of the binary tree structure ofH-matrices which isdenoted by 119871 Each row-block is computed and filled locallyby eachMPI node it is assigned to Since every subblock is anindependent unit the filling process for these row-blocks can

International Journal of Antennas and Propagation 5

be parallelized for allMPI nodes without any communicationor memory overlap Furthermore for each MPI node withmulticore processor the filling process of its correspondingblocks can be accelerated by utilizing the OpenMP memoryshare parallel computing In the case of MoM OpenMPis very efficient to parallelize the calculation of impedanceelements since this kind of calculation only depends onthe invariant geometrical information and electromagneticparameters which are universally shared among all nodes

42 Parallelization ofH-LU Factorization TheH-LU factor-ization is executed immediately after the forwardimpedancematrix is generated Compared with the forwardimpedancematrix filling process the parallelization ofH-LU factoriza-tion is not straightforward because its recursive procedure issequential in nature In order to putH-LU factorization intoa distributed parallel framework here we use an incrementalstrategy to parallelize this process gradually

During the H-LU factorization the algorithm is alwaysworkingcomputing on a certain block which is in LR form orfull form or the aggregation of them so that we can set a cri-terion based on the size of the block for determining whethercurrent procedure should be parallelized Specifically assum-ing that the (total) number of unknowns is119873 we have about2119870 nodes for computing then if the maximal dimension ofongoing processing block is larger than 1198732119870 MPI parallelcomputing with multiple nodes is employed otherwise theprocess will be dealt with by single node It should be notedthat OpenMP memory share parallel is always employed forevery nodewithmulticore processor participating in comput-ing In theH-LU factorization process OpenMP is in chargeof intranode parallelization of standard LU factorizationtriangular system solving for full subblocks and SVD or QRdecomposition for LR subblocks addition andmultiplicationand so forth which can be implemented through highlyoptimized computing package like LAPACK [34]

According to the recursive procedure of H-LU factor-ization at the beginning the algorithm will penetrate intothe finest level of H-matrices structure namely level 119871 todo the LU factorization for the top-left most block with thedimension of approximately (1198732119871) times (1198732119871) Since 119871 ge 119870this process is handled by one node After that the algorithmwill recur to level 119871 minus 1 to do the LU factorization for theextended top-left block of approximately (1198732119871minus1)times(1198732119871minus1)upon the previous factorization If 119871 minus 1 ge 119870 remains truethis process is still handled by one nodeThis recursion is kepton when it returns to level 119897 that 119897 lt 119870 for whichMPI parallelwith multiple nodes will be employed

For the sake of convenience we use H-LU(119897) to denotethe factorization on certain block in the level 119897 of H-matrices structure Recall theH-LU factorization procedurein Section 3 Procedure (i) isH-LU itself so its parallelizationis realized by other procedures on the finer levels For theprocedure (ii) and procedure (iii) we solve the triangularlinear system as

LX = A (11)

YU = B (12)

in which L U A and B are the known matrices and L Uare lower and upper triangular matrices respectively X andY need to be determined There two systems can actuallybe solved recursively through finer triangular systems Forinstance by rewriting (11) as

[L11

L21

L22

] [X11

X12

X21

X22

] = [A11

A12

A21

A22

] (13)

we carry out the following procedures (i) get X11

and X12

by solving lower triangular systems A11

= L11X11

andA12= L11X12 respectively (ii) get X

21and X

22by solving

lower triangular system A21minus L21X11= L22X21

and A22minus

L21X12

= L22X22 respectively From the above procedures

we can clearly see that getting columns [X11X21]119879 and

[X12X22]119879 represents two independent processes which can

be parallelized Similarly by refining Y as

[Y11

Y12

Y21

Y22

] (14)

getting rows [Y11Y12] and [Y

21Y22] represents two inde-

pendent processes Therefore if the size of X or Y namelyU12

or L21

in the H-LU context is larger than 1198732119870 theirsolving processes will be dealt with by multiple nodes

After solving processes of triangular systems is done theupdating for Z

22 namely the procedure (iv) of H-LU is

taking place We here use UPDATE(119897) to denote any of theseprocedures on level 119897 Rewriting it as

Z = Z minus AB (15)

apparently this is a typicalmatrix addition andmultiplicationprocedure which can be parallelized through the additionand multiplication of subblocks when it is necessary

[Z11

Z12

Z21

Z22

]

= [Z11

Z12

Z21

Z22

] minus [A11

A12

A21

A22

] [B11

B12

B21

B22

]

= [Z11minus A11B11minus A12B21

Z12minus A11B12minus A12B22

Z21minus A21B11minus A22B21

Z22minus A21B12minus A22B22

]

(16)

In the final form of (16) the matrix computation for each ofthe four subblocks is an independent process and can be dealtwith by different nodes simultaneouslyWhenZ

22is updated

the LU factorization is applied to Z22for getting L

22and U

22

This procedure is identical to theH-LU factorization for Z11

so any parallel scheme employed for factorizing Z11

is alsoapplicable for factorizing Z

22

The overall strategy for parallelizing H-LU factorizationis illustrated in Figure 3 Before the LU factorization onlevel 119870mdashH-LU(119870) is finished only one node 119875

1takes the

computing job Afterwards two nodes 1198751and 119875

2are used to

getU12and L

21simultaneously forH-LU(119870minus1) Then Z

22is

updated and factorized for gettingU22and L

22by the process

H-LU(119870) For triangular solving process on the 119870 minus 2 level

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

International Journal of Antennas and Propagation 3

ZMFIE119898119899

=1

2int119904

119889119904f119898(r) sdot f119899(r)

+ n sdot int119904

119889119904f119898(r) times int

1199041015840

1198891199041015840f119899(r1015840) times nabla1015840119892 (r r1015840)

VEFIE119898

= int119904

119889119904f119898(r) sdot Einc

(r)

VMFIE119898

= int119904

119889119904f119898(r) sdot n timesHinc

(r) (7)

Consequently the discretized CFIE system matrix and exci-tation vector are

ZCFIE119898119899

= 120572ZEFIE119898119899

+ (1 minus 120572) 120578 sdot ZMFIE119898119899

VCFIE119898

= 120572VEFIE119898

+ (1 minus 120572) 120578 sdot VMFIE119898

(8)

By solving the linear system

ZCFIEsdot I = VCFIE

(9)

we can get the excited surface current J and correspondingradiationscattering field

3 Fast Direct Solver andHierarchical LU Factorization

Generally we can solve the linear system (9) iteratively ordirectly For iterative solvers the Krylov subspace methodsare frequently used such as CG BiCG GMRES and QMR[33] Because iterative methods often suffer from slow-conv-ergence issues for complex applications direct methods havebecome popular recently Several researchers proposed avariety of FDS [14ndash19 25] which aim at avoiding convergenceproblems for iterative solvers and utilize similar hierarchicalinversefactorization procedure and low rank compressiontechniques to reduce the computational and storage complex-ity

H-matrices method [11 12 15] is a widely known math-ematical framework for accelerating both finite elementmethod (FEM) and IE method In the scenario of electro-magnetic IEH-matrices method is regarded as the algebraiccounterpart of MLFMM [4ndash6] The most advantage of H-matrices technique over MLFMM is that it can directly solvethe IE system by utilizing its pure algebraic property In thissection we briefly review the primary procedures of IE-basedH-matrices method and the algorithm of hierarchical LUfactorization which will be parallelized

In order to implement H-matrices method firstly theimpedancematrixZCFIE needs to be formed block-wisely andcompressed by low rank decomposition scheme We subdi-vide the surface of the object hierarchically and according totheir interaction types form the impedanceforward systemmatrix in a multilevel pattern This procedure is very similarto the setting-up step in MLFMM By contrast to the octreestructure ofMLFMM the structure inH-matrices is a binarytree The subdivision is stopped when the dimension ofsmallest subregions covers a few wavelengths The subblocks

in the impedance matrix representing the far-fieldweak int-eractions are compressed by low rank (LR) decompositionschemes Here we use the adaptive cross-approximation(ACA) algorithm [18 19] as our LR decomposition schemewhich only requires partial information of the block to becompressed and is thus very efficient

After the H-matrices based impedance matrix is gener-ated the hierarchical LU (H-LU) factorization [12 26 28] isemployed on it and finally we get an LU decomposed systemmatrix LU (ZCFIE

) During and after the factorization processsubblocks in LR format are kept all the time Having theLU (ZCFIE

) we can easily solve the IE with given excitationsby a recursive backward substitution procedure Figure 1illustrates the overall procedure of theH-matrices based FDSintroduced above

From ZCFIE to LU (ZCFIE) the hierarchical LU factoriza-

tion is employed The factorization algorithm is elaboratedbelow Consider the block LU factorization of the level-1partitioned ZCFIE matrix namely

ZCFIE= [

Z11

Z12

Z21

Z22

] = [L11

L21

L22

] [U11

U12

U22

] (10)

We carry out the following procedures in a recursive way (i)get L11

and U11

by LU factorization Z11= L11U11 (ii) get

U12

by solving lower triangular system Z12

= L11U12 (iii)

get L21by solving upper triangular system Z

21= L21U11 (iv)

update Z22 Z22= Z22minus L21U12 and (v) get L

22and U

22by

LU factorization Z22= L22U22 Recursive implementations

are participated among all procedures (i)ndash(v) Procedures(ii) and (iii) (and similar operations under the hierarchicalexecution of (i) and (v)) are performed via partitionedforwardbackward substitution for hierarchical loweruppertriangularmatrices [12 28] Procedure (iv) is a vital operationthat contributes most of the computing time cost Thissubblocks addition-multiplication procedure occurs not onlyin one or several certain levels but pervades in all of therecursive executions Figure 1 illustrates the overall procedureofH-matrices based FDS Implementation details aboutH-matrices compression and recursive LU factorization can befound in [12 14 15] In addition a similar recursive directsolving process for noncompression IE system matrix isdiscussed in [26]

Based on the result of H-LU factorization we can easilyget the solution for any given excitation namely right-handside (RHS) vector by hierarchical backward substitutionTheCPU time cost for backward substitution is trivial comparedto that ofH-LU factorization

4 Parallelization of H-Matrices Based IE FastDirect Solver

The parallelization ofH-matrices based IE FDS contains twoparts (a) the parallelization (of the process) of generatingthe forwardimpedance matrix with H-matrices form (b)the parallelization of H-LU factorization and backwardsubstitution In this section we will elaborate the approachesfor implementing these two parts separately

4 International Journal of Antennas and Propagation

Object withmultilevel partition

Geometricalinteractions

Compressed byLRD scheme

(ACA)

Low rank format

100

100

200

200

300

300

400

400

500

500

600

600

100 200 300 400 500 600

100

200

300

400

500

600

LU( )

Procedure Iform

Procedure IIdecompose ZCFIE

ZCFIE

ZCFIE

ZCFIE B =UV

U

L

Figure 1 Overall procedure ofH-matrices based FDS

100 200 300 400 500 600

100

200

300

400

500

600

Row 01

Row 02

Row 03

Row 2K

Proc 01

Proc 02

Proc 03

Proc 2K

Figure 2 Data distribution strategy for forwardimpedance matrixand also for LU factorized matrix

41 Parallelization of Generating ForwardImpedance MatrixSince our FDS is executed on a computer cluster the com-puting data must be stored distributively in an appropriateway especially for massive computing tasks The main dataof H-matrices based IE FDS are the LR compressed for-wardimpedance and LU factorizedmatrices In principle forforwardimpedance matrix we divide it into multiple rowsand each computing node stores one of these row-blocksThis memory distribution strategy is shown in Figure 2 Thewidth of each row is not arbitrary but in correspondence tothe hierarchical partition of the matrix in certain level 119870Therefore no LR formatted block will be split by the divisionlines that cross the entire row In practice the width of eachrow is approximately equal and the total number of rows isthe power of 2 namely 2119870 119870 is an integer which is less thanthe depth of the binary tree structure ofH-matrices which isdenoted by 119871 Each row-block is computed and filled locallyby eachMPI node it is assigned to Since every subblock is anindependent unit the filling process for these row-blocks can

International Journal of Antennas and Propagation 5

be parallelized for allMPI nodes without any communicationor memory overlap Furthermore for each MPI node withmulticore processor the filling process of its correspondingblocks can be accelerated by utilizing the OpenMP memoryshare parallel computing In the case of MoM OpenMPis very efficient to parallelize the calculation of impedanceelements since this kind of calculation only depends onthe invariant geometrical information and electromagneticparameters which are universally shared among all nodes

42 Parallelization ofH-LU Factorization TheH-LU factor-ization is executed immediately after the forwardimpedancematrix is generated Compared with the forwardimpedancematrix filling process the parallelization ofH-LU factoriza-tion is not straightforward because its recursive procedure issequential in nature In order to putH-LU factorization intoa distributed parallel framework here we use an incrementalstrategy to parallelize this process gradually

During the H-LU factorization the algorithm is alwaysworkingcomputing on a certain block which is in LR form orfull form or the aggregation of them so that we can set a cri-terion based on the size of the block for determining whethercurrent procedure should be parallelized Specifically assum-ing that the (total) number of unknowns is119873 we have about2119870 nodes for computing then if the maximal dimension ofongoing processing block is larger than 1198732119870 MPI parallelcomputing with multiple nodes is employed otherwise theprocess will be dealt with by single node It should be notedthat OpenMP memory share parallel is always employed forevery nodewithmulticore processor participating in comput-ing In theH-LU factorization process OpenMP is in chargeof intranode parallelization of standard LU factorizationtriangular system solving for full subblocks and SVD or QRdecomposition for LR subblocks addition andmultiplicationand so forth which can be implemented through highlyoptimized computing package like LAPACK [34]

According to the recursive procedure of H-LU factor-ization at the beginning the algorithm will penetrate intothe finest level of H-matrices structure namely level 119871 todo the LU factorization for the top-left most block with thedimension of approximately (1198732119871) times (1198732119871) Since 119871 ge 119870this process is handled by one node After that the algorithmwill recur to level 119871 minus 1 to do the LU factorization for theextended top-left block of approximately (1198732119871minus1)times(1198732119871minus1)upon the previous factorization If 119871 minus 1 ge 119870 remains truethis process is still handled by one nodeThis recursion is kepton when it returns to level 119897 that 119897 lt 119870 for whichMPI parallelwith multiple nodes will be employed

For the sake of convenience we use H-LU(119897) to denotethe factorization on certain block in the level 119897 of H-matrices structure Recall theH-LU factorization procedurein Section 3 Procedure (i) isH-LU itself so its parallelizationis realized by other procedures on the finer levels For theprocedure (ii) and procedure (iii) we solve the triangularlinear system as

LX = A (11)

YU = B (12)

in which L U A and B are the known matrices and L Uare lower and upper triangular matrices respectively X andY need to be determined There two systems can actuallybe solved recursively through finer triangular systems Forinstance by rewriting (11) as

[L11

L21

L22

] [X11

X12

X21

X22

] = [A11

A12

A21

A22

] (13)

we carry out the following procedures (i) get X11

and X12

by solving lower triangular systems A11

= L11X11

andA12= L11X12 respectively (ii) get X

21and X

22by solving

lower triangular system A21minus L21X11= L22X21

and A22minus

L21X12

= L22X22 respectively From the above procedures

we can clearly see that getting columns [X11X21]119879 and

[X12X22]119879 represents two independent processes which can

be parallelized Similarly by refining Y as

[Y11

Y12

Y21

Y22

] (14)

getting rows [Y11Y12] and [Y

21Y22] represents two inde-

pendent processes Therefore if the size of X or Y namelyU12

or L21

in the H-LU context is larger than 1198732119870 theirsolving processes will be dealt with by multiple nodes

After solving processes of triangular systems is done theupdating for Z

22 namely the procedure (iv) of H-LU is

taking place We here use UPDATE(119897) to denote any of theseprocedures on level 119897 Rewriting it as

Z = Z minus AB (15)

apparently this is a typicalmatrix addition andmultiplicationprocedure which can be parallelized through the additionand multiplication of subblocks when it is necessary

[Z11

Z12

Z21

Z22

]

= [Z11

Z12

Z21

Z22

] minus [A11

A12

A21

A22

] [B11

B12

B21

B22

]

= [Z11minus A11B11minus A12B21

Z12minus A11B12minus A12B22

Z21minus A21B11minus A22B21

Z22minus A21B12minus A22B22

]

(16)

In the final form of (16) the matrix computation for each ofthe four subblocks is an independent process and can be dealtwith by different nodes simultaneouslyWhenZ

22is updated

the LU factorization is applied to Z22for getting L

22and U

22

This procedure is identical to theH-LU factorization for Z11

so any parallel scheme employed for factorizing Z11

is alsoapplicable for factorizing Z

22

The overall strategy for parallelizing H-LU factorizationis illustrated in Figure 3 Before the LU factorization onlevel 119870mdashH-LU(119870) is finished only one node 119875

1takes the

computing job Afterwards two nodes 1198751and 119875

2are used to

getU12and L

21simultaneously forH-LU(119870minus1) Then Z

22is

updated and factorized for gettingU22and L

22by the process

H-LU(119870) For triangular solving process on the 119870 minus 2 level

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

4 International Journal of Antennas and Propagation

Object withmultilevel partition

Geometricalinteractions

Compressed byLRD scheme

(ACA)

Low rank format

100

100

200

200

300

300

400

400

500

500

600

600

100 200 300 400 500 600

100

200

300

400

500

600

LU( )

Procedure Iform

Procedure IIdecompose ZCFIE

ZCFIE

ZCFIE

ZCFIE B =UV

U

L

Figure 1 Overall procedure ofH-matrices based FDS

100 200 300 400 500 600

100

200

300

400

500

600

Row 01

Row 02

Row 03

Row 2K

Proc 01

Proc 02

Proc 03

Proc 2K

Figure 2 Data distribution strategy for forwardimpedance matrixand also for LU factorized matrix

41 Parallelization of Generating ForwardImpedance MatrixSince our FDS is executed on a computer cluster the com-puting data must be stored distributively in an appropriateway especially for massive computing tasks The main dataof H-matrices based IE FDS are the LR compressed for-wardimpedance and LU factorizedmatrices In principle forforwardimpedance matrix we divide it into multiple rowsand each computing node stores one of these row-blocksThis memory distribution strategy is shown in Figure 2 Thewidth of each row is not arbitrary but in correspondence tothe hierarchical partition of the matrix in certain level 119870Therefore no LR formatted block will be split by the divisionlines that cross the entire row In practice the width of eachrow is approximately equal and the total number of rows isthe power of 2 namely 2119870 119870 is an integer which is less thanthe depth of the binary tree structure ofH-matrices which isdenoted by 119871 Each row-block is computed and filled locallyby eachMPI node it is assigned to Since every subblock is anindependent unit the filling process for these row-blocks can

International Journal of Antennas and Propagation 5

be parallelized for allMPI nodes without any communicationor memory overlap Furthermore for each MPI node withmulticore processor the filling process of its correspondingblocks can be accelerated by utilizing the OpenMP memoryshare parallel computing In the case of MoM OpenMPis very efficient to parallelize the calculation of impedanceelements since this kind of calculation only depends onthe invariant geometrical information and electromagneticparameters which are universally shared among all nodes

42 Parallelization ofH-LU Factorization TheH-LU factor-ization is executed immediately after the forwardimpedancematrix is generated Compared with the forwardimpedancematrix filling process the parallelization ofH-LU factoriza-tion is not straightforward because its recursive procedure issequential in nature In order to putH-LU factorization intoa distributed parallel framework here we use an incrementalstrategy to parallelize this process gradually

During the H-LU factorization the algorithm is alwaysworkingcomputing on a certain block which is in LR form orfull form or the aggregation of them so that we can set a cri-terion based on the size of the block for determining whethercurrent procedure should be parallelized Specifically assum-ing that the (total) number of unknowns is119873 we have about2119870 nodes for computing then if the maximal dimension ofongoing processing block is larger than 1198732119870 MPI parallelcomputing with multiple nodes is employed otherwise theprocess will be dealt with by single node It should be notedthat OpenMP memory share parallel is always employed forevery nodewithmulticore processor participating in comput-ing In theH-LU factorization process OpenMP is in chargeof intranode parallelization of standard LU factorizationtriangular system solving for full subblocks and SVD or QRdecomposition for LR subblocks addition andmultiplicationand so forth which can be implemented through highlyoptimized computing package like LAPACK [34]

According to the recursive procedure of H-LU factor-ization at the beginning the algorithm will penetrate intothe finest level of H-matrices structure namely level 119871 todo the LU factorization for the top-left most block with thedimension of approximately (1198732119871) times (1198732119871) Since 119871 ge 119870this process is handled by one node After that the algorithmwill recur to level 119871 minus 1 to do the LU factorization for theextended top-left block of approximately (1198732119871minus1)times(1198732119871minus1)upon the previous factorization If 119871 minus 1 ge 119870 remains truethis process is still handled by one nodeThis recursion is kepton when it returns to level 119897 that 119897 lt 119870 for whichMPI parallelwith multiple nodes will be employed

For the sake of convenience we use H-LU(119897) to denotethe factorization on certain block in the level 119897 of H-matrices structure Recall theH-LU factorization procedurein Section 3 Procedure (i) isH-LU itself so its parallelizationis realized by other procedures on the finer levels For theprocedure (ii) and procedure (iii) we solve the triangularlinear system as

LX = A (11)

YU = B (12)

in which L U A and B are the known matrices and L Uare lower and upper triangular matrices respectively X andY need to be determined There two systems can actuallybe solved recursively through finer triangular systems Forinstance by rewriting (11) as

[L11

L21

L22

] [X11

X12

X21

X22

] = [A11

A12

A21

A22

] (13)

we carry out the following procedures (i) get X11

and X12

by solving lower triangular systems A11

= L11X11

andA12= L11X12 respectively (ii) get X

21and X

22by solving

lower triangular system A21minus L21X11= L22X21

and A22minus

L21X12

= L22X22 respectively From the above procedures

we can clearly see that getting columns [X11X21]119879 and

[X12X22]119879 represents two independent processes which can

be parallelized Similarly by refining Y as

[Y11

Y12

Y21

Y22

] (14)

getting rows [Y11Y12] and [Y

21Y22] represents two inde-

pendent processes Therefore if the size of X or Y namelyU12

or L21

in the H-LU context is larger than 1198732119870 theirsolving processes will be dealt with by multiple nodes

After solving processes of triangular systems is done theupdating for Z

22 namely the procedure (iv) of H-LU is

taking place We here use UPDATE(119897) to denote any of theseprocedures on level 119897 Rewriting it as

Z = Z minus AB (15)

apparently this is a typicalmatrix addition andmultiplicationprocedure which can be parallelized through the additionand multiplication of subblocks when it is necessary

[Z11

Z12

Z21

Z22

]

= [Z11

Z12

Z21

Z22

] minus [A11

A12

A21

A22

] [B11

B12

B21

B22

]

= [Z11minus A11B11minus A12B21

Z12minus A11B12minus A12B22

Z21minus A21B11minus A22B21

Z22minus A21B12minus A22B22

]

(16)

In the final form of (16) the matrix computation for each ofthe four subblocks is an independent process and can be dealtwith by different nodes simultaneouslyWhenZ

22is updated

the LU factorization is applied to Z22for getting L

22and U

22

This procedure is identical to theH-LU factorization for Z11

so any parallel scheme employed for factorizing Z11

is alsoapplicable for factorizing Z

22

The overall strategy for parallelizing H-LU factorizationis illustrated in Figure 3 Before the LU factorization onlevel 119870mdashH-LU(119870) is finished only one node 119875

1takes the

computing job Afterwards two nodes 1198751and 119875

2are used to

getU12and L

21simultaneously forH-LU(119870minus1) Then Z

22is

updated and factorized for gettingU22and L

22by the process

H-LU(119870) For triangular solving process on the 119870 minus 2 level

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

International Journal of Antennas and Propagation 5

be parallelized for allMPI nodes without any communicationor memory overlap Furthermore for each MPI node withmulticore processor the filling process of its correspondingblocks can be accelerated by utilizing the OpenMP memoryshare parallel computing In the case of MoM OpenMPis very efficient to parallelize the calculation of impedanceelements since this kind of calculation only depends onthe invariant geometrical information and electromagneticparameters which are universally shared among all nodes

42 Parallelization ofH-LU Factorization TheH-LU factor-ization is executed immediately after the forwardimpedancematrix is generated Compared with the forwardimpedancematrix filling process the parallelization ofH-LU factoriza-tion is not straightforward because its recursive procedure issequential in nature In order to putH-LU factorization intoa distributed parallel framework here we use an incrementalstrategy to parallelize this process gradually

During the H-LU factorization the algorithm is alwaysworkingcomputing on a certain block which is in LR form orfull form or the aggregation of them so that we can set a cri-terion based on the size of the block for determining whethercurrent procedure should be parallelized Specifically assum-ing that the (total) number of unknowns is119873 we have about2119870 nodes for computing then if the maximal dimension ofongoing processing block is larger than 1198732119870 MPI parallelcomputing with multiple nodes is employed otherwise theprocess will be dealt with by single node It should be notedthat OpenMP memory share parallel is always employed forevery nodewithmulticore processor participating in comput-ing In theH-LU factorization process OpenMP is in chargeof intranode parallelization of standard LU factorizationtriangular system solving for full subblocks and SVD or QRdecomposition for LR subblocks addition andmultiplicationand so forth which can be implemented through highlyoptimized computing package like LAPACK [34]

According to the recursive procedure of H-LU factor-ization at the beginning the algorithm will penetrate intothe finest level of H-matrices structure namely level 119871 todo the LU factorization for the top-left most block with thedimension of approximately (1198732119871) times (1198732119871) Since 119871 ge 119870this process is handled by one node After that the algorithmwill recur to level 119871 minus 1 to do the LU factorization for theextended top-left block of approximately (1198732119871minus1)times(1198732119871minus1)upon the previous factorization If 119871 minus 1 ge 119870 remains truethis process is still handled by one nodeThis recursion is kepton when it returns to level 119897 that 119897 lt 119870 for whichMPI parallelwith multiple nodes will be employed

For the sake of convenience we use H-LU(119897) to denotethe factorization on certain block in the level 119897 of H-matrices structure Recall theH-LU factorization procedurein Section 3 Procedure (i) isH-LU itself so its parallelizationis realized by other procedures on the finer levels For theprocedure (ii) and procedure (iii) we solve the triangularlinear system as

LX = A (11)

YU = B (12)

in which L U A and B are the known matrices and L Uare lower and upper triangular matrices respectively X andY need to be determined There two systems can actuallybe solved recursively through finer triangular systems Forinstance by rewriting (11) as

[L11

L21

L22

] [X11

X12

X21

X22

] = [A11

A12

A21

A22

] (13)

we carry out the following procedures (i) get X11

and X12

by solving lower triangular systems A11

= L11X11

andA12= L11X12 respectively (ii) get X

21and X

22by solving

lower triangular system A21minus L21X11= L22X21

and A22minus

L21X12

= L22X22 respectively From the above procedures

we can clearly see that getting columns [X11X21]119879 and

[X12X22]119879 represents two independent processes which can

be parallelized Similarly by refining Y as

[Y11

Y12

Y21

Y22

] (14)

getting rows [Y11Y12] and [Y

21Y22] represents two inde-

pendent processes Therefore if the size of X or Y namelyU12

or L21

in the H-LU context is larger than 1198732119870 theirsolving processes will be dealt with by multiple nodes

After solving processes of triangular systems is done theupdating for Z

22 namely the procedure (iv) of H-LU is

taking place We here use UPDATE(119897) to denote any of theseprocedures on level 119897 Rewriting it as

Z = Z minus AB (15)

apparently this is a typicalmatrix addition andmultiplicationprocedure which can be parallelized through the additionand multiplication of subblocks when it is necessary

[Z11

Z12

Z21

Z22

]

= [Z11

Z12

Z21

Z22

] minus [A11

A12

A21

A22

] [B11

B12

B21

B22

]

= [Z11minus A11B11minus A12B21

Z12minus A11B12minus A12B22

Z21minus A21B11minus A22B21

Z22minus A21B12minus A22B22

]

(16)

In the final form of (16) the matrix computation for each ofthe four subblocks is an independent process and can be dealtwith by different nodes simultaneouslyWhenZ

22is updated

the LU factorization is applied to Z22for getting L

22and U

22

This procedure is identical to theH-LU factorization for Z11

so any parallel scheme employed for factorizing Z11

is alsoapplicable for factorizing Z

22

The overall strategy for parallelizing H-LU factorizationis illustrated in Figure 3 Before the LU factorization onlevel 119870mdashH-LU(119870) is finished only one node 119875

1takes the

computing job Afterwards two nodes 1198751and 119875

2are used to

getU12and L

21simultaneously forH-LU(119870minus1) Then Z

22is

updated and factorized for gettingU22and L

22by the process

H-LU(119870) For triangular solving process on the 119870 minus 2 level

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

6 International Journal of Antennas and Propagation

Update(K minus 1) and

Update (K minus 2)and

middot middot middot

ℋ-LU (K minus 2)

ℋ-LU (K minus 1)

ℋ-LU (K minus 1)

ℋ-LU (K minus 2)

ℋ-LU (K minus 3)

ℋ-LU recurring to upper levelsℋ-LU (K)

Update(K) and

ℋ-LU (K)

P1

P1

P1

P1 P1

P2

P2

P3 P4

P2

P2

P3

P3

P4

P4

P5

P5

P6 P7 P8

Figure 3 Parallel strategyapproach forH-LU factorization 119875119894(119894 =

1 2 3 ) denotes the 119894th node

4 nodes can be used concurrently to get U12

and L21

basedon the parallelizable processes of solving the upper and lowertriangular systems as we discussed above Besides updatingZ22

can be parallelized by 4 nodes for 4 square blocks asenclosed by blue line in Figure 3 This parallelization canbe applied to coarser levels in general As for factorizationH-LU(119897) (119897 lt 119870) 2119870minus119897 nodes can be used for parallelizingthe process of solving triangular systems and 4

119870minus119897 nodescan be used for parallelizing the procedure of updating Z

22

During and after H-LU factorization the system matrix iskept in the form of distributive format as the same as thatfor forwardimpedance matrix shown in Figure 2 Under theperspective of data storage those row-blocks are updated byH-LU factorization gradually It should be noted that for thenodes participating in parallel computing the blocks theyare dealing with may not relate to the row-blocks they storedTherefore necessary communication for blocks transferoccurs among multiple nodes during the factorization

43 Theoretical Efficiency of MPI Parallelization To analyzethe parallelization efficiency under an ideal circumstancewe assume that the load balance is perfectly tuned and thecomputer cluster is equipped with high-speed network thusthere is no latency for remote response and the internodecommunication time is negligible Suppose there are 119875 nodesand each node has a constant number of OpenMP threadsThe parallelization efficiency is simply defined as

120578119875=

1198791

119875119879119875

(17)

in which 119879119894is the processing time with 119894 nodes

Apparently for the process of generating forwardimpe-dance matrix 120578

119875could be 100 if the number of operations

for filling each row-block under 119875-nodes parallelizationOrowfill119875

is 1119875 of the number of operations for filling the whole

matrix Ofill1

by single node since 1198791prop Ofill

1and 119879

119875prop

Orowfill119875

= Ofill119875 are true if the CPU time for taking single

operation is invariant This condition can be easily satisfiedto a high degree when 119875 is not too large When 119875 is largeenough to divide the system matrix into 119875 row-blocks mayforce the depth of H-matrices structure inadequately thatconsequently increases the overall complexity ofH-matricesmethod In other words

Orowfill119875

gtOfill1

119875 (for large 119875) (18)

Under this situation the parallelization becomes inefficientFor the process of H-LU factorization by assuming that

119875 = 2119870 the total number of operations under 119875-nodes

parallelization OLU119875

is

OLU119875asymp 119875 sdot O [H-LU (119870)]

+ 119875 (119875 minus 1) sdot O [H-TriSolve (119870)]

+1198753

6O [H-AddProduct (119870)]

(19)

where H-LU(119870) H-TriSolve(119870) and H-AddProduct(119870)denote the processes for LU factorization triangular systemsolving and matrices addition and multiplication in H-format in the119870th level ofH-matrices structure respectivelyTheir computational complexity is 119874(119899120572) (2 le 120572 lt 3) inwhich 119899 = 119873119875 However because of parallelization someoperations are done simultaneously Hence the nonoverlapnumber of operations O119879

119875that represents the actual wall time

of processing is

O119879

119875asymp 119875 sdot O [H-LU (119870)]

+119875119870

2sdot O [H-TriSolve (119870)]

+1198752

4O [H-AddProduct (119870)]

(20)

Additionally the number of operations without paralleliza-tion for H-LU factorization OLU

1is definitely no larger than

OLU119875

for the potential increased overall complexity by paral-lelization Therefore the parallelization efficiency is

120578119875=

OLU1

119875 sdot O119879119875

leOLU119875

119875 sdot O119879119875

(21)

When 119875 goes larger the result is approximately

120578119875le

(11987536)O [H-AddProduct (119870)]

119875 sdot (11987524)O [H-AddProduct (119870)]asymp2

3 (22)

According to our analysis we should say that for agiven size of IE problems there is an optimal 119875 for bestimplementation If 119875 is too small the solving process cannotbe fully parallelized On the other hand larger 119875 will makethe width of the row-blocks narrower that eliminate big LR

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

International Journal of Antennas and Propagation 7

subblocks and causeH-matrices structure to be flatter whichwill consequently impair the low complexity feature of H-matrices method

Similarly the number of OpenMP threads119872 has an opti-mal value too Too small119872will restrict OpenMP paralleliza-tion and too large 119872 is also a waste because the intranodeparallel computing only deals with algebraic arithmetic forsubblocks of relatively small size excessive threads will notimprove its acceleration and lower the efficiency

5 Numerical Results

The cluster we test our code on has 32 physical nodes in totaleach node has 64 GBytes of memory and four quad-core IntelXeon E5-2670 processors with 260GHz clock rate For alltested IE cases the discrete mesh size is 01120582 in which 120582 isthe wavelength

51 Parallelization Efficiency First we test the parallelizationscalability for solving scattering problem of PEC sphereswith different electric sizes The radii of these spheres are119877 = 25120582 119877 = 50120582 and 119877 = 100120582 respectivelyThe total numbers of unknowns for these three cases are27234 108726 and 435336 respectively Figure 4 shows theefficiency for the total time solved by 1 2 4 8 16 32 64 and128MPI nodes with 4 OpenMP threads for each node Thedefinition of parallelization efficiency is the same as formula(17) thus in this test we only consider the efficiency of pureMPI parallelization Unsurprisingly for parallelization withlarge number of nodes bigger cases have higher efficienciesbecause of better load balance and less distortion to the H-matrices structure Besides since the H-LU factorizationdominates the solving time consumption these tested effi-ciencies of bigger cases are close to the maximum efficiencyof theoretical one 120578LU

119875= 23 which demonstrates the good

parallelizing quality of our codeNext we investigate the hybrid parallelization efficiency

for different MPI nodes (119875)OpenMP threads (119872) propor-tion We employ all 512 cores in cluster to solve the 3 PECsphere cases above in four scenarios from pure MPI paral-lelization (1 OpenMP thread for each MPI node large 119875119872)to heavy OpenMP parallelization embedded in MPI (16OpenMP threads for each MPI node small 119875119872) Here weslightly change the definition of parallelization efficiency as

120578(119875119872)

=119879(11)

119875 sdot 119872 sdot 119879(119875119872)

(23)

in which 119879(119894119895)

denotes the total solving time parallelized by119894 MPI nodes with 119895 OpenMP threads for each node Theresults are presented in Figure 5 We can clearly see thatboth pure MPI parallelization and heavy OpenMP hybridparallelization have poorer efficiency compared to that ofmoderate OpenMP hybrid parallelization

Ideally larger119875119872 should give better efficiency for biggercases But in reality since larger 119875119872 can result in morefrequent MPI communication more time is required forcommunication and synchronization and so forth Besideslarge119875 also impairs the low complexity feature ofH-matrices

1 2 4 8 16 32 64 12803

04

05

06

07

08

09

10

Para

lleliz

atio

n effi

cien

cy (120578

)

Number of processors

R = 25120582

R = 50120582

R = 100120582

Figure 4 Parallelization efficiency for solving scattering problemsof PEC sphere of different electric sizes

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

02

03

04

05

06

07

08Pa

ralle

lizat

ion

effici

ency

(120578)

N

214 215 216 217 218 219

Figure 5 Parallelization efficiency of differentMPI nodesOpenMPthreads allocation

method regardless of the size of problems Therefore pureMPI parallelization does not show superiority over hybridparallelization in our test Referring to the optimal numberof OpenMP threads per node it is mainly determined by thesize of smallest blocks in H-matrices which is preset sinceOpenMP efficiency strongly relates to the parallelization offull or LR blocks addition and multiplication at low levelsFor all cases of this test the size of smallest blocks is between50 and 200 so their optimal number of OpenMP threadsper node turns out to be 4 However we can expect theoptimal number of OpenMP threads to be larger (smaller)if the size of the smallest blocks is larger (smaller) Forthe case of solving PEC sphere of 119877 = 100120582 with 128

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

8 International Journal of Antennas and Propagation

28

210

212

214

216

Mem

ory

cost

(MB)

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N15)

N

214 215 216 217 218 219

O (N43logN)

Figure 6 Memory cost for different MPI nodesOpenMP threadsproportion

nodes 4 threads per node the parallelization efficiency in thistest is 0737 which is higher than that of 0562 in previoustest This is because by our new definition of efficiency theOpenMP parallelization is taken into account These resultsalso indicate the superiority ofMPI-OpenMP hybrid paralle-lization over pure MPI parallelization

Figures 6 and 7 show the memory cost and solvingtime for different problem sizes under different MPI nodesOpenMP threads proportion The memory cost complexitygenerally fits the 119874(11987315) trend which has also been demon-strated in sequential FDS [15 17 19] Since the parallelizationefficiency gradually rises along with the increase of problemsize as shown above the solving time complexity is about119874(119873184) instead of 119874(1198732) exhibited in sequential scenario

[15 17 19]

52 PEC Sphere In this part we test the accuracy andefficiency of our proposed parallel FDS and compare itsresults to analysis solutions A scattering problem of one PECsphere with radius 119877 = 300120582 is solved After discretizationthe total number of unknowns of this case is 3918612For hybrid parallelizing 128MPI nodes are employed with4 OpenMP threads for each node The building time forforward systemmatrix is 43187 secondsH-LU factorizationtime is 1572302 seconds and solving time for each excitation(RHS) is only 366 seconds The total memory cost is 9871GBytes The numerical bistatic RCS results agree with theanalytical Mie series very well as shown in Figure 8

53 Airplane Model Finally we test our solver with a morerealistic case The monostatic RCS of an airplane model withthe dimension of 6084120582times3500120582times1625120582 is calculated Afterdiscretization the total number of unknowns is 3654894The building time for forward system matrix is 35485

N

214 215 216 218 21921722

24

26

28

210

212

512 nodes 1 thread per node128 nodes 4 threads per node64 nodes 8 threads per node32 nodes 16 threads per node

O (N2)

O (N184)

Solv

ing

time (

s)

Figure 7 Solving time for different MPI nodesOpenMP threadsproportion

0 20 40 60 80 100 120 140 160 180minus20

minus10

0

10

20

30

40

50

Bist

atic

RCS

(dBs

m)

120579

Mie series

155 160 165 170 175 180

Bist

atic

RCS

(dBs

m)

minus20minus10

01020304050

Radius = 30120582

N = 3918612

MPI-OpenMP parallelℋ-LU direct solver

Figure 8 Bistatic RCS of PEC sphere with 119877 = 300120582 solved byproposed hybrid parallel FDS

seconds H-LU factorization time is 1104534 seconds andthe peak memory cost is 7243 GBytes With the LU form ofthe system matrix we directly calculate the monostatic RCSfor 10000 total vertical incident angles (RHS) by backwardsubstitution The average time cost for calculating each inci-dent angle is 312 seconds The RCS result is compared withthat obtained by FMM iterative solver of 720 incident anglesAlthough FMM iterative solver only costs 1124 GBytes formemory the average iteration time of solving back scatteringfor each angle is 9387 seconds with SAI preconditioner [23]which is not practical for solving thousands of RHS FromFigure 9 we can see that these two results agree with eachother well

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

International Journal of Antennas and Propagation 9

XY

Z

0 20 40 60 80 100 120 140 160 180

minus10

0

10

20

30

40

50

60

70

80

Mon

osta

tic R

CS (d

Bsm

)

MPI-OpenMP parallel

FMM iterative solver

Sweeping 120593

120593

ℋ-LU direct solver

XXYYYYYYY

ZZ

Figure 9 Monostatic RCS of airplane model solved by proposedhybrid parallel FDS and FMM iterative solver

6 Conclusion

In this paper we present an H-matrices based fast directsolver accelerated by both MPI multinodes and OpenMPmultithreads parallel techniquesThemacrostructural imple-mentation ofH-matrices method andH-LU factorization isparallelized by MPI while the microalgebraic computationfor matrix blocks and vector segments is parallelized byOpenMP Despite the sequential nature ofH-matrices directsolving procedure this proposed hybrid parallel strategyshows good parallelization efficiency Numerical results alsodemonstrate excellent accuracy and superiority in solvingmassive excitations (RHS) problem for this parallel directsolver

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the China Scholarship Council(CSC) the Fundamental Research Funds for the Central Uni-versities of China (E022050205) the NSFC under Grant61271033 the Programmeof IntroducingTalents ofDisciplineto Universities under Grant b07046 and the National Excel-lent Youth Foundation (NSFC no 61425010)

References

[1] A Taflve Computational Electrodynamics The Finite-DifferenceTime-Domain Method Artech House Norwood Mass USA1995

[2] J-M JinThe Finite Element Method in Electromagnetics WileyNew York NY USA 2002

[3] R F Harrington Field Computation by Moment Methods Mac-Millan New York NY USA 1968

[4] R Coifman V Rokhlin and S Wandzura ldquoThe fast multiplemethod for the wave equation a pedestrian prescriptionrdquo IEEEAntennas and Propagation Magazine vol 35 no 3 pp 7ndash121993

[5] J Song C C Lu and W C Chew ldquoMultilevel fast multipolealgorithm for electromagnetic scattering by large complexobjectsrdquo IEEE Transactions on Antennas and Propagation vol45 no 10 pp 1488ndash1493 1997

[6] J Hu Z P Nie J Wang G X Zou and J Hu ldquoMultilevel fastmultipole algorithm for solving scattering from 3-D electricallylarge objectrdquo Chinese Journal of Radio Science vol 19 no 5 pp509ndash524 2004

[7] E Michielssen and A Boag ldquoA multilevel matrix decomposi-tion algorithm for analyzing scattering from large structuresrdquoIEEE Transactions on Antennas and Propagation vol 44 no 8pp 1086ndash1093 1996

[8] J M Rius J Parron E Ubeda and J R Mosig ldquoMultilevelmatrix decomposition algorithm for analysis of electricallylarge electromagnetic problems in 3-DrdquoMicrowave and OpticalTechnology Letters vol 22 no 3 pp 177ndash182 1999

[9] H Guo J Hu and E Michielssen ldquoOn MLMDAbutterflycompressibility of inverse integral operatorsrdquo IEEE Antennasand Wireless Propagation Letters vol 12 pp 31ndash34 2013

[10] E Bleszynski M Bleszynski and T Jaroszewicz ldquoAIM adaptiveintegral method for solving large-scale electromagnetic scatter-ing and radiation problemsrdquo Radio Science vol 31 no 5 pp1225ndash1251 1996

[11] W Hackbusch and B Khoromskij ldquoA sparse matrix arithmeticbased on H-matrices Part I Introduction to H-matricesrdquoComputing vol 62 pp 89ndash108 1999

[12] S Borm L Grasedyck and W Hackbusch Hierarchical Matri-ces Lecture Note 21 Max Planck Institute for Mathematics inthe Sciences 2003

[13] W Chai and D Jiao ldquoAn calH2-matrix-based integral-equationsolver of reduced complexity and controlled accuracy for solv-ing electrodynamic problemsrdquo IEEE Transactions on Antennasand Propagation vol 57 no 10 pp 3147ndash3159 2009

[14] W Chai and D Jiao ldquoH- and H2-matrix-based fast integral-equation solvers for large-scale electromagnetic analysisrdquo IETMicrowaves Antennas amp Propagation vol 4 no 10 pp 1583ndash1596 2010

[15] H Guo J Hu H Shao and Z Nie ldquoHierarchical matricesmethod and its application in electromagnetic integral equa-tionsrdquo International Journal of Antennas and Propagation vol2012 Article ID 756259 9 pages 2012

[16] A Heldring J M Rius J M Tamayo and J Parron ldquoCom-pressed block-decomposition algorithm for fast capacitanceextractionrdquo IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems vol 27 no 2 pp 265ndash271 2008

[17] A Heldring J M Rius J M Tamayo J Parron and EUbeda ldquoMultiscale compressed block decomposition for fastdirect solution of method of moments linear systemrdquo IEEETransactions on Antennas and Propagation vol 59 no 2 pp526ndash536 2011

[18] K Zhao M N Vouvakis and J-F Lee ldquoThe adaptive crossapproximation algorithm for accelerated method of momentscomputations of EMC problemsrdquo IEEE Transactions on Electro-magnetic Compatibility vol 47 no 4 pp 763ndash773 2005

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

10 International Journal of Antennas and Propagation

[19] J Shaeffer ldquoDirect solve of electrically large integral equationsfor problem sizes to 1 M unknownsrdquo IEEE Transactions onAntennas and Propagation vol 56 no 8 pp 2306ndash2313 2008

[20] J Song ldquoMultilevel fastmultipole algorithm for electromagneticscattering by large complex objectsrdquo IEEE Transactions onAntennas and Propagation vol 45 no 10 pp 1488ndash1493 1997

[21] M Benzi ldquoPreconditioning techniques for large linear systemsa surveyrdquo Journal of Computational Physics vol 182 no 2 pp418ndash477 2002

[22] J Lee J Zhang and C-C Lu ldquoIncomplete LU preconditioningfor large scale dense complex linear systems from electro-magnetic wave scattering problemsrdquo Journal of ComputationalPhysics vol 185 no 1 pp 158ndash175 2003

[23] J Lee J Zhang and C-C Lu ldquoSparse inverse preconditioningof multilevel fast multipole algorithm for hybrid integral equa-tions in electromagneticsrdquo IEEE Transactions on Antennas andPropagation vol 52 no 9 pp 2277ndash2287 2004

[24] G H Golub and C F V LoanMatrix Computations The JohnsHopkins University Press Baltimore Md USA 1996

[25] R J Adams Y Xu X Xu J-s Choi S D Gedney and F XCanning ldquoModular fast direct electromagnetic analysis usinglocal-global solution modesrdquo IEEE Transactions on Antennasand Propagation vol 56 no 8 pp 2427ndash2441 2008

[26] Y Zhang and T K Sarkar Parallel Solution of Integral Equation-Based EM Problems in the Frequency Domain John Wiley ampSons Hoboken NJ USA 2009

[27] J-Y Peng H-X Zhou K-L Zheng et al ldquoA shared memory-based parallel out-of-core LU solver for matrix equations withapplication in EM problemsrdquo in Proceedings of the 1st Interna-tional Conference onComputational Problem-Solving (ICCP rsquo10)pp 149ndash152 December 2010

[28] M Bebendorf ldquoHierarchical LU decomposition-based precon-ditioners for BEMrdquoComputing vol 74 no 3 pp 225ndash247 2005

[29] L Dagum and R Menon ldquoOpenMP an industry standardAPI for shared-memory programmingrdquo IEEE ComputationalScience amp Engineering vol 5 no 1 pp 46ndash55 1998

[30] WGropp E Lusk andA SkjellumUsingMPI Portable ParallelProgrammingwith theMessage-Passing Interface vol 1TheMITPress 1999

[31] S M Rao D R Wilton and A W Glisson ldquoElectromagneticscattering by surfaces of ar-bitrary shaperdquo IEEE Transactions onAntennas and Propagation vol 30 no 3 pp 409ndash418 1982

[32] A W Glisson and D R Wilton ldquoSimple and efficient numer-ical methods for problems of electromagnetic radiation andscattering from surfacesrdquo IEEE Transactions on Antennas andPropagation vol 28 no 5 pp 593ndash603 1980

[33] R Barrett M Berry T F Chan et al Templates for the Solutionof Linear Systems Building Blocks for Iterative Methods vol 43SIAM Philadelphia Pa USA 1994

[34] E Anderson Z Bai C Bischof et al LAPACKUsersrsquo Guide vol9 SAIM 1999

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of