driving nemo towards exascale: introduction of a new software layer in the nemo stack software

18
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/260351706 Driving NEMO towards Exascale: introduction of a new software layer in the NEMO stack software ARTICLE in SSRN ELECTRONIC JOURNAL · DECEMBER 2013 DOI: 10.2139/ssrn.2491450 READS 32 4 AUTHORS: Luisa D’Amore University of Naples Federico II 70 PUBLICATIONS 235 CITATIONS SEE PROFILE Vania Boccia INFN - Istituto Nazionale di Fisica Nucleare 19 PUBLICATIONS 34 CITATIONS SEE PROFILE Luisa Carracciuolo Italian National Research Council 32 PUBLICATIONS 64 CITATIONS SEE PROFILE Almerico Murli University of Naples Federico II 114 PUBLICATIONS 508 CITATIONS SEE PROFILE Available from: Luisa Carracciuolo Retrieved on: 15 January 2016

Upload: independent

Post on 17-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/260351706

DrivingNEMOtowardsExascale:introductionofanewsoftwarelayerintheNEMOstacksoftware

ARTICLEinSSRNELECTRONICJOURNAL·DECEMBER2013

DOI:10.2139/ssrn.2491450

READS

32

4AUTHORS:

LuisaD’Amore

UniversityofNaplesFedericoII

70PUBLICATIONS235CITATIONS

SEEPROFILE

VaniaBoccia

INFN-IstitutoNazionalediFisicaNucleare

19PUBLICATIONS34CITATIONS

SEEPROFILE

LuisaCarracciuolo

ItalianNationalResearchCouncil

32PUBLICATIONS64CITATIONS

SEEPROFILE

AlmericoMurli

UniversityofNaplesFedericoII

114PUBLICATIONS508CITATIONS

SEEPROFILE

Availablefrom:LuisaCarracciuolo

Retrievedon:15January2016

Electronic copy available at: http://ssrn.com/abstract=2491450

Research PapersIssue RP0190December 2013

Scientific Computing andOperations Division(SCO)

Driving NEMO towards Exascale:introduction of a new software layerin the NEMO stack software

By L. D’AmoreUniversity of Naples Federico II

[email protected]

V. BocciaINFN, Unit of Naples

[email protected]

L. CarracciuoloCNR

[email protected]

A. MurliSPACI & CMCC (until 12/2012)

[email protected]

SUMMARY This paper addresses scientific challenges related to high levelimplementation strategies that leads NEMO to effectively use of theopportunities of exascale systems. We consider two software modules asproof-of-concept: the Sea Surface Height equation solver and theVariational Data Assimilation system, which are components of the NEMOocean model (OPA). Advantages rising from the introduction ofconsolidated scientific libraries in NEMO are highlighted: such advantagesconcern both the "software quality" improvement (see the software qualityparameters like robustness, portability, resilence, etc.) and time reduction ofsoftware development

Electronic copy available at: http://ssrn.com/abstract=2491450

02

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

INTRODUCTION

The great frontier of computational science is inthe challenge posed by high-fidelity simulationsof real-world systems, that is, in transformingcomputational science into a fully predictive sci-ence. Earth System Models are typically char-acterized by multiple, interacting physical pro-cesses (multi-physics), interactions that occuron a wide range of both temporal and spatialscales (from 1 to 10 km). Since computationalcost increases nonlinearly with higher resolu-tion it is likely that predictions of environmentalchange at 1 km resolution would require ex-treme scale computers.

The computational challenges that will be facedin making exascale computing a practical realityarise both in the hardware realm and in the soft-ware, and will call for potentially revolutionarychanges in the ways high performance com-puting is being used. A co-design methodologyapproach will be needed in which the designof hardware, algorithms programming modelsand software tools is carried out in a coupledand iterative fashion.

Preparing applications for the transition to exas-cale systems requires that they are able to face,efficiently and effectively, the abundance of par-allelism (also combined in hybrid approaches)and the increase in system faults. As a con-sequence, parallel algorithms must adapt it-self to the increasing amounts of data local-ity, to the need to obtains much higher fac-tors of fine-grained parallelism as high-end sys-tems support increasing numbers of computethreads and to the need of re-balancing com-putation dynamically in response to changingworkloads and conditions of the operating en-vironment. Exascale systems brings new com-putation/communication ratios. Within a nodedata transfers between core is relatively inex-pensive, but temporal locality is still important

for effective cache use. Across node, the rela-tive cost of data transfer is growing. The devel-opment of communication-avoiding algorithmsthat increase the computation/communicationratio is needed.

Thus, applications executing on Exascale sys-tems will have to deal with issues related toscalability, adaptivity and, more in general, re-silience of the software [3, 12, 6].

Exascale co-design is a very complex under-taking mainly for the application codes. Manycomputational scientists have neither time norinclination to become experts in numericalmethods and software, preferring to leave soft-ware development to computer scientists andmathematicians. In this respect, mathematicalsoftware libraries should be used. In this way,domain scientists will be able to use state-of-art software components that can be sharedacross multiple application domains. Sincewriting software is universally recognized to betime consuming and error prone scientists willbenefit from availability of software that can useoff the shelf while experimenting with domainspecific challenges rather than writing their ownpackage. Computing at exascale will put manyheavier demands on algorithms, especially onprogramming models that may be required tohandle a number of design choices, includingshared-memory based programming models(such as OpenMP), message passing basedprogramming models (such as MPI). Hence, inorder to maximize the availability of these newalgorithms to science, the idea is to encapsu-late their implementation in reusable libraries.

Hence, the key for exascale co-design of ap-plication codes is the software layer that medi-ates the interaction between applications andhardware. Deployment of application code bymeans of the use of scientific libraries always isa “good investment”. This approach introducesthe so-called multilevel programming model:

Electronic copy available at: http://ssrn.com/abstract=2491450

Driving NEMO towards Exascale

03

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

the application scientists uses the library in away that is meaningful to him/her, the comput-ing scientists that implements the library pro-grams at a level that is closer to the hardware.

Recent advances in the Portable, ExtensibleToolkit for Scientific computing (PETSc) [15]have substantially improved multilevel, multido-main and multiphysics algorithms. These ca-pabilities enable users to investigate the spaceof linear, nonlinear, and timestepping solversfor more complex simulations, without mak-ing premature choices about algorithms anddata structures. The strong encapsulation ofthe PETSc design facilitates runtime composi-tion of hierarchical methods without sacrificingthe ability to customize problem-specific com-ponents. These capabilities are essential forapplication codes to evolve over time and to in-corporate advances in algorithms for emergingextreme-scale architectures.

Just as crucial in the push toward extreme-scale computing are recent advances in thePETSc design that enable leveraging GPUs inall computational solver phases and the hybridMPI/pthread programming model. These de-sign advances mean that one does not have toforsake the most mathematically sophisticated,hierarchical solvers in order to utilize GPUs andmulticore. Rather, the software logic is inde-pendent of the computational kernels runningon the accelerator hardware, so that one caneasily incorporate new kernels, tuned to a par-ticular new hardware, without rewriting the ap-plication or high-level solver library.

Important progress in separating the controllogic of the PETSc software from the compu-tational kernels have already made. As thecommunity transitions away from an MPI-onlymodel for parallelism, this separation of con-cerns is crucial because we can avoid a totalrewrite of our software base. That is, while goodperformance at the exascale will require a major

overhaul of the code for computational kernelsin PETSc, the high-level control logic is largelyhardware-independent and requires only mod-est refactoring to adapt to new hardware. Inother words, we will not need to reimplementfrom scratch the hundreds of linear, nonlin-ear, and timestepping solvers encapsulated inPETSc. Of course, an essential complement tonew hardware-specific computational kernels isextending the solver libraries to incorporate newalgorithms that reduce communication and syn-chronization, as well as new programming mod-els that explicitly acknowledge data movementand hierarchies of locality. The key point is thatsuch design enables a separation of concernsfor these two fundamental aspects of extreme-scale solvers, thereby making tractable a po-tentially daunting transition process.

In the path to exascale, solver algorithms mustbecome more sophisticated (built by compos-ing a hierarchy of already highly complex algo-rithms), not less sophisticated. Thus, enhanc-ing and refactoring existing scalable and exiblesolver software libraries, such as PETSc, arethe only way to achieve the long-term goal ofexascale solvers. Starting from scratch wouldentail unnecessarily reproducing twenty yearsof previous work before the more sophisticatedsolvers could even be reasonably implemented.With the PETSc software, we are already em-barking on next-generation algorithms and datastructures. Building on this foundation of fun-damental composable solver components willalso facilitate a paradigm shift that raises thelevel of abstraction from simulation of complexsystems to the design and uncertainty quantifi-cation of these systems.

THIS WORK

Some issues related to algorithm scalability andsoftware resilience are already well known toNEMO developers but they are only partially

04

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

faced by means of some investments at soft-ware level aimed to reduce communicationsnumber [7]. Secondarily, software ability toadapt its execution also in presence of unex-pected happenings (i.e. resources overload,faults, etc.) and to benefit from resources het-erogeneity are not yet provided.

In the context of IESP, groups of experts inapplications, computer and computational sci-ence, are working together to produce new al-gorithms with features of super scalability, nat-ural fault tolerance and the ability to adapt theirexecution on emerging hardware systems. Im-plementations of those algorithms will be in-cluded into already consolidated scientific li-braries.

In this work we consider two software modulesof NEMO. Here, these are used as a basic toolof our feasibility study aimed to prove that theintroduction of a consolidated scientific librarycan provide NEMO with several advantages interms of:

software adaptivity (at least portability) onheterogeneous resources

software robustness

software scalability

We use, as “proof-of-concept” the Sea SurfaceHeight equation solver and the Variational DataAssimilation system both used in the NEMOocean model (OPA)[8].

The OPA-NEMO SSH equation solver Thetrend of sea-surface height (SSH) in OPAmodel is modelled by a PDE (see schemareported in fig. 1). Discretization of SSHequation, by means of leap frog scheme,leads to the solution of linear systems. Ifwe consider the reference configuration

Figure 1:OPA NEMO working schema

of NEMO named ORCA2-LIM, which de-scribes a global models of the ocean in-teracting with the ice of poles, the SSH so-lution represents a small part of ORCA2-LIM execution.

The performance analysis, measuring thetime spent from each module activated bythe SSH solution, allowed us to observethat:

a part of the execution time is spentfor the discretization of operators

a part of time is spent for linear sys-tems solution using one of the meth-ods that NEMO identifies as “SOR”or “PCG”. These are ad-hoc imple-mentations of the “standard” Pre-conditioned Conjugate Gradient andSuccessive Over relaxation meth-ods.

The optimization algorithms used by varia-tional data assimilation systems Varia-tional data assimilation problem ia a func-tion minimization. In NEMO this operationis performed by means of two algorithms:CG (as in NEMOVAR software) and L-

Driving NEMO towards Exascale

05

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

Figure 2:Rosenbrock function and its minimum

BFGS (as in OceanVar and NEMOVARsoftware).

The test case was based on the Rosen-brock function FRosenbrock that is a non-convex function used as a test problem foranalyzing the performance of optimiza-tion algorithms because its global mini-mum is known (see fig. 2) but numericalconvergence to that minimum is not easy.

Moreover, the use of a test function letsus to modify the problem size in a flexibleway providing us a tool to perform moreaccurate scalability studies (as i.e. OPA-NEMO GYRE configuration can do).

The L-BFGS algorithm spends executiontime essentially to

evaluate the function;

find the “search direction” for theminimum.

More details are reported in [11, 1,10, 9, 5, 4] where there are describedexperiences about implementation of aVariational Data Assimilation schema inHPC environment (hybrid architectureequipped with consolidated and robustsoftware libraries);

Figure 3:A representation of hybrid architecture

The well-known scalability problems of theabove cited algorithm have been only in partfaced by NEMO developers with some invest-ments related to SOR algorithm [7]. About CG-like algorithms, international group of expertsare working to produce a more scalable variantin the IESP context i.e. reducing the number ofsynchronization points due to global communi-cation operations [12].

THE REFERENCE SOFTWAREENVIRONMENT: PETSC

PETSc is constantly evolving and it reflects theevolution of newer architectures (multi-node,multi-core, GPU, possibly combined to supporthybrid computing paradigms) (see fig. 3).

PETSc is characterized by a considerableendowment of implementations of numericalmethods and algorithms, including those usedto solve systems of linear equations. PETScis modular, organized in different levels of ab-straction (see fig. 4), and it provides data struc-tures that hide, to the final user, the complex-ity of the inter-process communications in dis-tributed memory environments, of the interac-tion with GP-GPU modules, etc.

06

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

Figure 4:The hierarchic organization of PETSc

Figure 5:The hierarchic organization of TAO

PETSc provides tools at different levels of ab-straction and it lets software developers to usePETSc objects at the most suitable level toguarantee high levels of scalability.

There are several packages built on PETSc.Among them, we cite TAO (Toolkit for AdvancedOptimization) that is an object-oriented flexi-ble toolkit with strong emphasis on the reuseof external tools. TAO is aimed at the so-lution of large-scale optimization problems onhigh-performance architectures providing fea-tures as portability, performance, scalable par-allelism, and an interface independent of thearchitecture [13].

TAO includes a variety of solvers based on opti-mization algorithms for several classes of prob-lems (unconstrained, bound-constrained, andPDE-constrained minimization, nonlinear least-squares, and complementarity) (see fig. 5). As

TAO is built on PETSc, it inherits all PETSc fea-tures and approach in declaring, defining andusing objects.

Last but not least, we should mention the possi-bility of extending PETSc by interfaces to otherlibraries (i.e. Trilinos, MUMPS, Hypre, etc.).

For everything mentioned above, we choosePETSc as reference scientific library. In par-ticular, the reference hardware/software envi-ronment is a set of a multiprocessor/multicorenodes, some of them equipped with a graph-ical accelerator (NVIDIA GPU), on which thesoftware layer is composed by:

OPENMPI implementation of the MPI2standard;

both Fortran and C compilers from CPUvendor and from GNU project;

PETSc library;

TAO library;

NETCDF;

CUDA toolkit.

RESULTS

The OPA-NEMO SSH equation solver Someissues about the software robustness andthe algorithm scalability are analized.

The focus was on the PETSc object avail-able for the solution of linear equationssystem by using the Krylov SubspaceMethods (see level 2 of the hierarchicorganization of PETSc in fig. 4): theKSP object. This object, like all the otherPETSc objects, can be configured at run-time selecting:

the kind of solvers (iterative meth-ods: CG, GMRES, SOR, etc.; directmethods: LU or Cholesky factoriza-tion, etc.),

Driving NEMO towards Exascale

07

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

the preconditioners (ILU, Jacobi,Block Jacobi, ...),

the values of all the variables defin-ing stopping criteria (for iterativemethods),

etc.

PETSc’s first goal was to provide a sim-ple and consistent way for the user tospecify the algebraic system (in general,nonlinear and time-dependent) so that awide variety of solvers could be explored,thereby enabling application scientists toexperiment with diverse algorithms andimplementations without requiring prema-ture commitment to particular data struc-tures and solvers. The system for speci-fication goes far beyond simply requiringthe user to provide the Jacobian of a non-linear system in a particular sparse matrixformat. Rather, a set of specifications forhow the user-provided code may provideinformation needed by implicit multileveland Newton-based solvers are employed.The specifications are layered, so that ifthe user code can provide more informa-tion or flexibility, more powerful solversmay then be employed. For example, ifthe user’s code can evaluate the nonlin-ear functions on a set of meshes, thengeometric multigrid may be used to solvethe Jacobian system.

We substitute the NEMO code solving thelinear systemAx = bwith another one us-ing solvers provided by PETSc, perform-ing these steps:

a) identifying the input data of NEMOmodules (the matrix A of the systemof linear equations, vector b of theRight Hand Side (RHS) and vectorx0 of the first approximation for thesolution);

b) transformation of these data intothe corresponding objects of PETSc(Vec and Mat objects);

c) analysis of the characteristics of thelinear equations system (rank andconditioning of A, etc.);

d) solution of the linear equations sys-tem Ax = b by Krylov SubspaceMethods implemented by the KSP

object properly configured.

Regards to points a) and b) we observedthat matrixA is pentadiagonal of size n×n

where n = 26460. The data structuresused in NEMO to represent the matrixA are 5 two-dimensional arrays each ofwhich represents one of the diagonals ofA. Also the vectors b of right hand sideand x0 of the first approximation are rep-resented by 2-dimensional arrays.

Regards to point c), we pointed out thatthe matrix A is rank deficient, is not sym-metric and has a rather large conditionnumber. The system features mentionedabove prevent PETSc, and other robustsoftware libraries, to solve the linear sys-tem with the “standard” implementation ofSOR and PCG methods. In these circum-stances are preferred other methods suchas i.e. the method of Least Squares (LS)or GMRES.

To implement the system solver in PETScwe used the objects Vec, Mat and KSP,that are at the bottom levels (the first andsecond levels) of the library (see fig. 4).

Below are shown some code offprint thatwell describe how the developer can usePETSc objects and how these objects canbe configured in a static, semi-static ordynamic way. In fig. 6, the lines 4,9 and13 are used to configure, at runtime, allthe objects already created (at lines 3, 8

08

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

1 Execution times usedin subfigure (a) are

collected on a nodewith two processors

quad core Intel [email protected].

Execution times usedin subfigure (b) are

collected on a nodewith two processors

quad core Intel [email protected] and a

Tesla K20c.

CMCC Research Papers

and 12).

In fig. 7, lines 3 and 4 create and partiallyconfigure KSP object with static proper-ties (i.e. with a non-zero values for initialguess), line 5 completes the solver con-figuration at runtime by means of suitablecommand line options; at line 8 the solverand the code is called at lines 11 andline 12 gets information about the numberof iterations performed and the reason ofsolver’s convergence.

In fig. 8, are shown examples of com-mand line arguments used to configurethe solver and the other PETSc object atruntime: at line 2, 5 and 8 we choose toexecute the solver respectively by usingOpenMP, native pthreads or CUDA.

We executed new implementation of theSSH equation solver:

1. selecting, at runtime, different algo-rithms to solve the linear system (LS,GMRES, ...)

2. selecting, at runtime, the most suit-able solver implementation (MPI-based, OpenMP, Cuda, ...)

obtaining

1. the same solution (software robust-ness)

2. the same performances (scalabilitypreservation)

3. a more portable version of SSHequation solver.

In figg. 9 is shown the behavior, as afunction of the number of cores, of themean execution times of one iteration ofthe new implementation of the SSH equa-tion solver in the PETSC environment 1 .

We highlight that:

fig. 9-(a) times related to 2-cores execu-tion could be affected by context ini-tialization which is too relevant re-spect to computing phase related toour fixed problem dimension.

fig. 9-(b) results reported are relatedalso to a combination, in a hy-brid approach, of Multicore and GP-GPU technologies. Besides, forsuch small problem dimension, theadvantages provided by using GP-GPU are not significant.

The L-BFGS algorithm Some issues aboutthe software scalability and adaptivity areanalyzed.

The focus was on the tool available inTAO that implements the L-BFGS algo-rithm: the TaoSolver. This object, likethe PETSc objects, can be configured atrun-time selecting:

the kind of optimization algorithm(i.e. Nelder-Mead, LMVM, Newtonline search methods, etc.)

the values of all the variables defin-ing stopping criteria (for iterativemethods),

etc.

We implement in TAO the code needed tofind the minimum of FRosenbrock by meansof the TaoSolver object (i.e. with a non-zero values for initial guess).

To implement the code we used thePETSc object Vec (at the first level ofTAO) and the TaoSolver (at the sec-ond level) of the TAO library (see fig.5). Below are shown some code off-print that well describe how the devel-oper can use PETSc and TAO objects andhow these objects can be configured in astatic, semi-static or dynamic way.

Driving NEMO towards Exascale

09

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

1 ......

2 /* Creation and definition of the elements of A /*

3 ierr = MatCreate(PETSC_COMM_WORLD,&A); CHKERRQ(ierr);

4 ierr = MatSetFromOptions(A); CHKERRQ(ierr);

5 ......

6

7 /* Creation and definition of the elements of b and x */

8 ierr = VecCreate(PETSC_COMM_WORLD,&b); CHKERRQ(ierr);

9 ierr = VecSetFromOptions(b); CHKERRQ(ierr);

10 ......

11

12 ierr = VecCreate(PETSC_COMM_WORLD,&x); CHKERRQ(ierr);

13 ierr = VecSetFromOptions(x); CHKERRQ(ierr);

Figure 6:PETSc Vec and Mat objects creation and configuration

1 /* Creation and definition of KSP object */

2 ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);CHKERRQ(ierr);

3 ierr = KSPSetOperators(ksp,A,A,SAME_NONZERO_PATTERN);CHKERRQ(ierr);

4 ierr = KSPSetInitialGuessNonzero(ksp,PETSC_TRUE);

5 ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr);

6

7 /* Solution of the system by KSP object */

8 ierr = KSPSolve(ksp,b,x);

9

10 /* KSP object status */

11 ierr = KSPGetIterationNumber(ksp,&its);CHKERRQ(ierr);

12 ierr = KSPGetConvergedReason(ksp,&reason); CHKERRQ(ierr);

Figure 7:PETSc KSP object creation and configuration

1 $ ./PETScSolve -ksp_type lsqr -ksp_rtol 1e-7 -threadcomm_nthreads 8

2 -threadcomm_type openmp -log_summary

3

4 $ ./PETScSolve -ksp_type lsqr -ksp_rtol 1e-7 -threadcomm_nthreads 8

5 -threadcomm_type pthread -log_summary

6

7 $ ./PETScSolve -ksp_type lsqr -ksp_rtol 1e-7 -threadcomm_type nothread

8 -log_summary -mat_type seqaijcusparse -vec_type seqcusp

Figure 8:PETSc objects configuration by means of command line options.

10

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

Comparison between sequential and multicore Comparison between sequential, multicore and GP-GPU

(a) (b)

Figure 9: Mean Execution time of one iteration in the new implementation of the SSH equationsolver in the PETSC environment.

1 /* Creation and definition of the vector x of the initial guess values*/

2 ierr = VecCreate(PETSC_COMM_WORLD,&x); CHKERRQ(ierr);

3 ierr = VecSetFromOptions(x); CHKERRQ(ierr);

4 /* Begin of the instructions to define the elements of x vector*/

5 ......

6

7 /* End of the instructions to define the elements of x vector*/

Figure 10:PETSc Vec object creation and configuration

Driving NEMO towards Exascale

11

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

1 /* Creation and definition of TaoSolver object */

2 ierr = TaoCreate(PETSC_COMM_SELF,&tao); CHKERRQ(ierr);

3 ierr = TaoSetInitialVector(tao,x); CHKERRQ(ierr);

4 ierr = TaoSetFromOptions(tao);CHKERRQ(ierr);

5 ......

6 /* Solve the application */

7 ierr = TaoSolve(tao); CHKERRQ(ierr);

8

9 /* TaoSolver status */

10 ierr = TaoGetTerminationReason(tao,&reason); CHKERRQ(ierr);

Figure 11:TaoSolver object creation and configuration

1 $ ./TaoOptimize -tao_method tao_lmvm -tao_max_funcs 10000 -threadcomm_nthreads 8

2 -threadcomm_type openmp -log_summary

3

4 $ ./TaoOptimize -tao_method tao_lmvm -tao_max_funcs 10000 -threadcomm_nthreads 8

5 -threadcomm_type pthread -log_summary

6

7 $ ./TaoOptimize -tao_method tao_lmvm -tao_max_funcs 10000

8 -threadcomm_type nothread -log_summary -vec_type seqcusp

Figure 12:TAO and PETSc objects configuration by means of command line options.

12

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

2 Execution times usedin subfigure (a) are

collected on cluster of8 nodes, connected byInfiniband technology,each of them with twoprocessors quad core

Intel [email protected].

Execution times usedin subfigure (b) are

collected on a nodewith two processors

quad core Intel [email protected] and a

Tesla C1060.

CMCC Research Papers

In fig. 10, line 3 configures at runtime theVec object already created at line 2.

In fig. 11, lines 2 and 3 create and par-tially configure TaoSolver objects withstatic properties (i.e. with a non-zero val-ues for initial guess), line 4 is used tocomplete the TaoSolver configurationat runtime by means of suitable commandline options; at line 7 we call the solverand line 10 gets information about the ter-mination of computations.

In fig. 12, there are some examples ofcommand line arguments used to config-ure the TAO solver and PETSc objects atruntime by means of method selection (i.eLMVM: Limited-Memory, Variable-Metric)method and the definition of the maxi-mum number of function evaluations (see-tao max funcs option). At line 2, 5and 8 we choose to execute the solverrespectively by using OpenMP, nativepthreads or CUDA.

In fig. 13 is shown the behavior, as afunction of the problem dimension, of theexecution times of the TAO implementa-tion of L-BFGS algorithm 2 .

We highlight that:

fig. 13-(a) execution time reduction ismore evident if problem dimensionincreases,

fig. 13-(b) here, the advantages pro-vided by using GP-GPU are moresignificant due to the higher problemdimension.

CONCLUSIONS

Advancing science in key areas requires devel-opment of next-generation physical models tosatisfy the accuracy and fidelity needs for tar-geted simulations, which in turn places higher

demands on computational hardware and soft-ware. Application models represent the func-tional requirements that drive the need for cer-tain numerical algorithms and software imple-mentations.

Science priorities lead to science models, andmodels are implemented in the form of algo-rithms. Algorithm selection is based on var-ious criteria, such as appropriateness, accu-racy, verification, convergence, performance,parallelism and scalability. Models and asso-ciated algorithms are not selected in isolationbut must be evaluated in the context of the ex-isting computer hardware environment. Algo-rithms that perform well on one type of com-puter hardware may become obsolete on newerhardware, so selections must be made care-fully and may change over time. Moving for-ward to exascale will put heavier demands onalgorithms in at least two areas: the need forincreasing amounts of data locality in order toperform computations efficiently, and the needto obtain much higher factors of fine-grainedparallelism as high-end systems support in-creasing numbers of compute threads. As aconsequence, parallel algorithms must adaptto this environment, and new algorithms andimplementations must be developed to extractthe computational capabilities of the new hard-ware. As with science models, the perfor-mance of algorithms can change in two waysas application codes undergo development andnew computer hardware is used. First, al-gorithms themselves can change, motivatedby new models or performance optimizations.Second, algorithms can be executed under dif-ferent specifications, e.g., larger problem sizesor changing accuracy criteria. Both of thesefactors must be taken into account. Significantnew model development, algorithm re-designand science application code reimplementa-tion, supported by (an) exascale-appropriate

Driving NEMO towards Exascale

13

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

iComparison between sequential and multiprocessor

(a)

Comparison between sequential, multicore and GP-GPU

(b)

Figure 13: Execution time of an implementation of the L-BFGS Algorithm in a TAO+PETSC envi-ronment.

14

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

programming model(s), will be required to ex-ploit effectively the power of exascale architec-tures. Uncertainty quantification will permeatethe exascale science workload. The demandfor predictive science results will drive the devel-opment of improved approaches for establish-ing levels of confidence in computational pre-dictions. Both statistical techniques involvinglarge ensemble calculations and other statisti-cal analysis tools will have significantly differ-ent dynamic resource allocation requirementsthan in the past, and the significant code re-design required for the exascale will present anopportunity to embed uncertainty quantificationtechniques in exascale science applications.

The deployment of an applicative software bymeans of the use of Scientific libraries, asPETSc, can be considered a “good invest-ment”. Indeed, in the context of IESP groups ofexpert of applications, computing and computa-tional scientists are working together to producenew algorithms with features of super scalabil-ity, natural fault tolerance, able to adapt theirexecution on emerging hardware systems (notonly distributed memory, shared memory sys-tems but also systems with graphic accelerators- GPU, eventually combined to support hybridcomputing paradigms).

Furthermore PETSc developers are working,in the Exascale context, to produce super scal-able implementations of iterative methods forsolution of linear systems. Similar investmentsare in progress on other well-known scientificlibraries as i.e. TRILINOS (that is already ableto “interact” with PETSc).

Thus, if a library is fault tolerant, robust, scal-able, so are all applications that use it; if a li-brary is able to constantly evolve to reflect theevolution of newer and powerful architectures,then all the application relying on it can au-thomatically benefit of all performance improve-ments provided (see fig. 14).

Figure 14: Performance evolution of GP-GPUNVIDIA devices.

Another example to be cited in this sense is re-ported in [14, 2] where we worked to introducefault tolerance mechanisms and adaptivity fea-tures at library level and so we produce highlevel software that automatically inherits faulttolerance and adaptivity features.

All of the above leads to the conclusion that:

1. robust and consolidated scientific li-braries, like PETSc, are versatile, capa-ble of adapting to different execution en-vironments. Furthermore PETSc providea wide variety of numerical tools to solveproblems of interest for the oceanogra-phy scientific communities. Using PETScin NEMO future redesign, will involve ad-vantages both in terms of savings in soft-ware development time and in terms ofsoftware quality (at least in terms of adap-tivity, robustness, scalability and portabil-ity). In particular, if NEMO model will re-design, it will be possible to introduce alsoPETSc level 3 objects, having more ben-efits in terms of model global scalability;

2. if the initial effort for the introduction of li-braries of scientific computing in high levelsoftware may seem large, on the otherhand, it ensures the possibility of a longerlife for the software (amortizing the initial

Driving NEMO towards Exascale

15

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

time investment) and prevents the userfrom the introduction of changes to thelowest level necessary to incorporate anychanges in the hardware execution envi-ronment.

FUTURE WORK

The need for scalable algorithms in an exas-cale initiative has already been stressed. Allindications are that memory will become therate-limiting factor along the path to exascale,and investments should accordingly be madein designing algorithms with reduced memoryrequirements.

Our future work includes deployment of: (i) al-gorithmically scalable methods, where algorith-mically scalable means that the total resourcesneeded to solve the problem (flops plus mem-ory) are proportional to the resources needed toevaluate the associated operator; (ii) high-ordermethods that perform more computation to ob-tain greater accuracy for each computationaldegree of freedom; and (iii) adaptive methodsdesigned to use the smallest possible numberof degrees of freedom to obtain the neededlevel of accuracy.

The basic framework for fully implicit nonlin-ear solvers is (truncated) Newton’s method us-ing (possibly matrix-free) Newton-(Krylov) tech-niques. The embodiment of Newton’s methodin PETSc is the SNES component. SNESuses the Newton approach of solving the non-linear system using some approximation to theJacobian of the nonlinear function. The ap-proximation of the Jacobian can be computedin many ways, matrix-free matrix-vector prod-uct application or automatic generation of codethat computes the Jacobian or applies it to avector via ADIC or ADIFOR. The incorpora-tion of matrix-free nonlinear solvers is partic-ularly important because this approach elimi-nates the need to compute the fully coupled

Jacobian and yet still enables Newton’s methodto achieve rapid quadratic convergence. Whilemost applications cannot readily provide full Ja-cobians of coupled systems, approximationsof the various Jacobians are commonly avail-able. Thus, in this context, the term matrix-freemeans that while there is no explicit storage ofthe entire sparse Jacobian matrix, there maybe explicit storage of portions of the Jacobian.PETSc also includes new capabilities for com-posing scalable nonlinear solvers such as non-linear GMRES, nonlinear CG, quasi-Newton,and nonlinear multigrid.

16

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

CMCC Research Papers

Bibliography

[1] R. Arcucci. The 4d var data assimilationachievements on hpc systems: from thedecomposition of the problem to a fullyparallel software, 2012. CMCC - 2012 An-nual Meeting, June 2012.

[2] V. Boccia, L. Carracciuolo, G. Laccetti,M. Lapegna, and V. Mele. Hadab: en-abling fault tolerance in parallel appli-cations running in distributed environ-ments. Lecture Notes in Computer Science,7203:700–709, 2012. ISSN: 0302-9743.

[3] F. Cappello and D. Wuebbles. G8 ecs: En-abling climate simulation at extreme scale,2012. G8 Exascale Projects Workshop -12 November 2012 - Salt Lake City, USA.

[4] L. D’Amore, R. Arcucci, L. Carracciuolo,and A. Murli. Dd–oceanvar: A domain de-composition fully parallel data assimilationsoftware for the mediterranean forecastingsystem. Procedia Computer Science, Else-vier, 18:1235–1244, 2013. ISBN: 1877-0509.

[5] L. D’Amore, R. Arcucci, L. Marcellino, andA. Murli. Hpc computation issues of theincremental 3d variational data assimila-tion scheme in oceanvar software. Journalof Numerical Analysis, Industrial and AppliedMathematic, 7(3):91–105, 2012. ISSN:1790-8140.

[6] J. Dongarra, P. Beckman, and et al. Theinternational exascale software roadmap.International Journal of High PerformanceComputer Applications, 25(1), 2011. ISSN:1094-3420.

[7] I. Epicoco, S. Mocavero, and G. Aloisio.Nemo-med: Optimization and improve-ment of scalability. Technical ReportRP0096, CMCC SCO Department, 2011.CMCC Research Paper Issue RP0096January 2011 CMCC SCO Department.

[8] Madec G. Nemo ocean engine, 2008.Note du Pole de modielisation, InstitutPierre-Simon Laplace (IPSL), France, No27 ISSN No 1288-1619.

[9] L.D’Amore, R. Arcucci, L. Carracciuolo,and A Murli. Oceanvar software for usewith nemo: documentation and test guide.Technical Report RP0166, CMCC SCODepartment, 2012. CMCC Research Pa-per Issue RP0166 November 2012.

[10] L.D’Amore, R. Arcucci, L. Carracciuolo,and A Murli. Ocean data assimilationachievements on hpc systems: experi-ments on oceanvar in the mediterraneansea, 2013. CMCC - 2013 Annual Meeting,June 2013.

[11] L.D’Amore, R. Arcucci, V. Mele, G. Scotti,and A Murli. Technical documentationlbfgs for gpucuda, reference manual anduser’s guide. Technical Report RP0167,CMCC SCO Department, 2013. CMCCResearch Paper Issue RP0167 February2013.

[12] L. Curfman McInnes. Recent advances inpetsc scalable solvers, 2012. Lectures atthe Department of Computer Science ofTechnische Universitat Darmstadt, 2 July2012.

[13] T. Munson, J. Sarich, S. Wild, S. Benson,and L. Curfman McInnes. Tao 2.0 usersmanual. Technical Report ANL/MCS-TM-322, Mathematics and Computer Sci-ence Division, Argonne National Labora-tory, 2012. Technical Report ANL/MCS-TM-322.

[14] A. Murli, V. Boccia, L. Carracciuolo,L. D’Amore, G. Laccetti, and M. Lapegna.Monitoring and migration of a petsc-basedparallel application for medical imagingin a grid computing pse. IFIP Interna-tional Federation for Information Processing

Driving NEMO towards Exascale

17

Cen

tro

Eur

o-M

edite

rran

eosu

iCam

biam

enti

Clim

atic

i

- Grid-Based Problem Solving Environments,Springer, 239:421–432, 2007. ISBN: 978-0-387-73658-7.

[15] B. Smith. The portable extensible toolkit

for scientific computing, 2013. Lecturesat the The Argonne Training Program onExtreme-Scale Computing, August 2013.

c© Centro Euro-Mediterraneo sui Cambiamenti Climatici 2014

Visit www.cmcc.it for information on our activities and publications.

The Euro-Mediteranean Centre on Climate Change is a Ltd Company with its registered office andadministration in Lecce and local units in Bologna, Venice, Capua, Sassari, Viterbo, Benevento and Milan.The society doesn’t pursue profitable ends and aims to realize and manage the Centre, its promotion, andresearch coordination and different scientific and applied activities in the field of climate change study.