robust & efficient parallel preconditioning methods in “multi-core era”

2
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami D isaster CREST/Japan Science and Technology Agency (JST) http://www-sold.eps.s.u-tokyo.ac.jp/crest Integrated Predictive Simulation System for Earthquake and Tsunami Disaster Robust & Efficient Parallel Robust & Efficient Parallel Preconditioning Methods in “Multi- Preconditioning Methods in “Multi- Core Era” Core Era” Introduction In this work, parallel preconditioning methods based on “Hierarchical Interface Decomp osition (HID)” and hybrid parallel programming models were applied to finite-element b ased simulations of linear elasticity problems in media with heterogeneous material pr operties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was appl ied for parallelism through OpenMP. The developed code has been tested on the “T2K Ope n Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners base d on HID provide a scalable performance and robustness in comparison to conventional l ocalized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI. Parallel Programming Models on Multi-core Clusters In order to achieve minimal parallelization overheads on SMP (symmetric multiprocessor s) and multi-core clusters, a multi-level hybrid parallel programming model is often em ployed. In this method, coarse-grained parallelism is achieved through domain decomposi tion by message passing among nodes, while fine-grained parallelism is obtained via loo p-level parallelism inside each node using compiler-based thread parallelization techni ques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Gener ally, the efficiency of each programming model depends on hardware performance, applica tion features, and problem size. Fig.1 Hybrid and Flat MPI Parallel Programming Models core core core core memory memory memory memory core core core core memory core core core core memory Target Application In the present work, linear elasticity problems in simple cub e geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Pois son’s ratio is set to 0.25 for all elements, while a heteroge neous distribution of Young’s modulus in each element is calc ulated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are 10 -3 and 10 3 , respectively, with an ave rage value of 1.0. The GPBi-CG (Generalized Product-type meth ods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based o n the framework for parallel FEM procedures of GeoFEM, and th e GeoFEM’s local data structure is applied. Fig.2 Heterogeneous Distribution of Material Property HID (Hierarchical Interface Decomposition) Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative so lvers . They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-con ditioned problems with many processors, because it ignores the global effect of externa l nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC pr econditioners. PHIDAL is based on defining “hierarchical interface decomposition (HI D)”. The HID process starts with a partitioning of the graph, with one layer of overla p. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups o f a lower level. (to be continued to the back page) core core core core core core core core core core core core

Upload: sheila-bentley

Post on 31-Dec-2015

21 views

Category:

Documents


0 download

DESCRIPTION

core. core. core. core. core. core. memory. memory. memory. core. core. core. core. core. core. Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”

October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest

Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

Robust & Efficient Parallel Preconditioning Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”Methods in “Multi-Core Era”

IntroductionIn this work, parallel preconditioning methods based on “Hierarchical Interface Decomposition (HID)” and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.

Parallel Programming Models on Multi-core ClustersIn order to achieve minimal parallelization overheads on SMP (symmetric multiprocessors) and multi-core clusters, a multi-level hybrid parallel programming model is often employed. In this method, coarse-grained parallelism is achieved through domain decomposition by message passing among nodes, while fine-grained parallelism is obtained via loop-level parallelism inside each node using compiler-based thread parallelization techniques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Generally, the efficiency of each programming model depends on hardware performance, application features, and problem size.

Fig.1 Hybrid and Flat MPI Parallel Programming Models

core

core

core

core

me

mo

ry

me

mo

ry

me

mo

ry

me

mo

ry

core

core

core

core

me

mo

ry

core

core

core

core

me

mo

ry

Target ApplicationIn the present work, linear elasticity problems in simple cube geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Poisson’s ratio is set to 0.25 for all elements, while a heterogeneous distribution of Young’s modulus in each element is calculated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are 10-

3 and 103, respectively, with an average value of 1.0. The GPBi-CG (Generalized Product-type methods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based on the framework for parallel FEM procedures of GeoFEM, and the GeoFEM’s local data structure is applied.

Fig.2 Heterogeneous Distribution of Material

Property

HID (Hierarchical Interface Decomposition)Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative solvers . They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-conditioned problems with many processors, because it ignores the global effect of external nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC preconditioners. PHIDAL is based on defining “hierarchical interface decomposition (HID)”. The HID process starts with a partitioning of the graph, with one layer of overlap. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups of a lower level. (to be continued to the back page)

core

core

core

core

core

core

core

core

core

core

core

core

Page 2: Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era”

October 2008Integrated Predictive Simulation System for Earthquake and Tsunami Disaster

CREST/Japan Science and Technology Agency (JST)http://www-sold.eps.s.u-tokyo.ac.jp/crest

HID (Hierarchical Interface Decomposition) (cont.)If the unknowns are reordered according to their level numbers, from the lowest to highest, the block structure of the reordered matrix is as shown in Fig.3. This block structure leads to a natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied. .

(a) Domain Decomposition

0

1

2

3

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

0,12,3

do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)

!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)

SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL

k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)

SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3

enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3

enddoenddo

!$omp end parallel doenddo

call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.

enddo

Fig.3 Domain/block decomposition of the matrix according to the HID reordering(b) Matrix Block (c) Forward Substitution Proc. of SGS

T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)” at the University of Tokyo. The “T2K/Tokyo” was developed by Hitachi under “T2K Open Supercomputer Alliance”. T2K/Tokyo is an AMD Quad-core Opteron-based combined cluster system with 952 nodes, 15,232 cores and 31TB memory.

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Total peak performance is 140.1 TFLOPS. Each node includes four “sockets” of AMD Quad-core Opteron processors (2.3GHz), as shown in Fig.4. Each node is connected via Myrinet-10G network. In the present work, 32 nodes of the system have been evaluated. Because T2K/Tokyo is based on CC/NUMA architecture, careful design of software and data structure is required for efficient access to local memory.

Fig.4 Overview of T2K/Tokyo (Entire System and Each Node)http://www.open-supercomputer.org/Results & Future Works

Preliminary tests of the developed code have been conducted on the T2K/Tokyo using up to 512 cores of T2K/Tokyo system. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.HID-based preconditioning with the hybrid parallel programming model is expected to be a good choice for excellent scalable performance and robustness for implementations on more than 104 cores on clusters of multi-core processors, if each MPI process is assigned to each socket, as is in Hybrid 4x4 in this work.

In the present work, no fill-in processes have been considered in HID procedures. The results of using HID in comparison with conventional localized block Jacobi preconditioning have therefore not been so significant. More robust preconditioning methods based on HID may be developed by considering fill-ins inside and between connectors for realistic applications with ill-conditioned matrices. Moreover, the developed methods may further be evaluated on various types of clusters with more cores.

0.80

1.00

1.20

1.40

1.60

1.80

32 64 128 192 256 384 512

core#

Re

lati

ve

Pe

rfo

rma

nc

e

Flat MPIHybrid 4x4Hybrid 8x2

Fig.5 Strong Scaling Test up to 512 cores of T2K/Tokyo

0.00

0.25

0.50

0.75

1.00

1.25

Flat MPI Hybrid 4x4 Hybrid 8x2

Re

lati

ve

Pe

rfo

rma

nc

e

128 cores 512 cores

Relative Performance of HID normalized by results of Localized Block Jacobi

Relative Performance of three parallel programming models (Flat MPI, Hybrid 4x4 and Hybrid 8x2), normalized by results of Flat MPI