high performance computing: concepts, methods & means hpc libraries

High Performance Computing: Concepts, Methods & Means

HPC Libraries

Hartmut Kaiser PhDCenter for Computation & Technology

Louisiana State University

April 19th, 2007

Outline

• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing

(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test

2

Outline



3

Puzzle of the Day

#include <stdio.h>

int main(){ int a = 10; switch(a) { case '1': printf("ONE\n"); break;

case '2': printf("TWO\n"); break;

defa1ut: printf("NONE\n"); } return 0;}

4

If you expect the output of the above program to be NONE, I would request you to check it out!

Application domains

• Linear algebra– BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim

• Ordinary and partial Differential Equations– PETSc

• Mesh manipulation and Load Balancing – METIS, ParMETIS, CHACO, JOSTLE, PARTY

• Graph manipulation– Boost.Graph library

• Vector/Signal/Image processing– VSIPL, PSSL.

• General parallelization– MPI, pthreads

• Other domain specific libraries– NAMD, NWChem, Fluent, Gaussian, LS-DYNA

5

Application Domain Overview

• Linear Algebra Libraries – Provide optimized methods for constructing sets of linear equations,

performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution.

– Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK

• PDE Solvers: – Developing general-porpose, parallel numerical PDE libraries– Usual toolsets include manipulation of sparse data structures, iterative

linear system solvers, preconditioners, nonlinear solvers and time-stepping methods.

– Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others.

6

Application Domain Overview

• Mesh manipulation and Load Balancing – These libraries help in partitioning meshes in roughly equal sizes

across processors, thereby balancing the workload while minimizing size of separators and communication costs.

– Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others.

• Other packages:– FFTW: features highly optimized Fourier transform package

including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions.

– NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X

– Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc.

7

Outline



8

BLAS

• (Updated set of) Basic Linear Algebra Subprograms

• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:

as well as scalar dot products and vector norms

– Level 2: contains matrix-vector operations of the form

as well as Tx = y solving for x with T being triangular

– Level 3: contains matrix-matrix operations of the form

as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation.

9

BLAS

• Several implementations for different languages exist– Reference implementation (F77 and C)

http://www.netlib.org/blas/– ATLAS, highly optimized for particular

processor architectures– A generic C++ template class library providing

BLAS functionality: uBLAS http://www.boost.org

– Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)

10

http://www.netlib.org/blas/

http://www.boost.org/

BLAS: F77 naming conventions

11

BLAS: C naming conventions

• F77 routine name is changed to lowercase and prefixed with cblas_

• All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major)

• Character parameters are replaced by corresponding enum values

• Input arguments are declared const• Non-complex scalar input parameters are passed by value• Complex scalar input argiments are passed using a void*• Arrays are passed by address• Output scalar arguments are passed by address• Complex functions become subroutines which return the result

via an additional last parameter (void*), appending _sub to the name

12

BLAS Level 1 routines

• Vector operations(xROT, xSWAP, xCOPY etc.)

• Scalar dot products (xDOT etc.)

• Vector norms(IxAMX etc.)

13


• Matrix-vector operations(xGEMV, xGBMV, xHEMV, xHBMV etc.)

• Solving Tx = y for x, where T is triangular(xGER, xHER etc.)

14


• Matrix-matrix operations(xGEMM etc.)

• Solving for triangular matrices(xTRMM)

• Widely used matrix-matrix multiply (xSYMM, xGEMM)

15

Demo 1

• Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++

• Shows genericity of uBLAS, by comparing generic and banded matrix versions

• Shows newmat, a C++ matrix library which uses operator overloading

16

Outline



17

LAPACK

• Linear Algebra PACKage– http://www.netlib.org/lapack/– Written in F77– Provides routines for

• Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR

decomposition on a matrix and • Singular value problems

– Was initially designed to run efficiently on shared memory vector machines

– Depends on BLAS– Has been extended for distributed (SIMD) systems

(ScaPACK and PLAPACK)

18

http://www.netlib.org/lapack/

19

LAPACK (Architecture)

LAPACK naming conventions

20

Demo 2

• Shows how using a library might speed up the computation considerably

21

Outline



22

PETSc (pronounced PET-see)

• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/)– Suite of data structures and routines for the scalable

(parallel) solution of scientific applications modeled by partial differential equations (PDEs)

– Employs the MPI standard for all message-passing communication

– Intended for use in large-scale application projects– Includes a large suite of parallel linear and nonlinear

equation solvers– Easily used in application codes written in C, C++,

Fortran and Python• Good introduction:

http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt

23

http://www-unix.mcs.anl.gov/petsc/petsc-as/



http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt

PETSc (general features)

• Features include:– Parallel vectors

• Scatters (handles communicating ghost point information)

• Gathers

– Parallel matrices • Several sparse storage formats • Easy, efficient assembly.

– Scalable parallel preconditioners – Krylov subspace methods – Parallel Newton-based nonlinear solvers – Parallel time stepping (ODE) solvers

24

PETSc (Architecture)

25

PETSc: Module architecture and layers of abstraction

PETSc: Component details

• Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures.

• Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems.

• Preconditioners (PC): A collection of sequential and parallel preconditioners, including

– (sequential) ILU(k) (incomplete factorization), – LU (lower/upper decomposition), – both sequential and parallel block Jacobi, overlapping additive Schwarz

methods• Time stepping ODE solvers (TS): Code

for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.

26

PETSc: Component details

• Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including

– GMRES (Generalized Minimal Residual method), – CG (Conjugate Gradient), – CGS (Conjugate Gradient Squared), – Bi-CG-Stab (BiConjugate Gradient Squared), – two variants of TFQMR (transpose free QMR), – CR (Conjugate Residuals), – LSQR (Least Square Root).

All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods.

• Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.

27

Outline



28

Mesh libraries

• Introduction– Structured/unstructured meshes– Examples

• Mesh decomposition

29

Introduction to Meshes and Grids

• Mesh/Grid : 2D or 3D representation of the computational domain.

• Common 2D meshes are composed of triangular or quadrilateral elements

• Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements

30

TriangleQuadrilateral

Tetrahedron

Hexahedron Prism

2D Mesh elements

3D Mesh elements

Structured Grids (Meshes)• Cartesian grids, logically

rectangular grids• Mesh info accessed implicitly

using grid point indices– Efficient in both computation

and storage• Typically use finite difference

discretization

Unstructured Meshes• Mesh connectivity information

must be stored– Incurs additional memory and

computational cost• Handles complex geometries

and grid adaptivity• Typically use finite volume or

finite element discretization• Mesh quality becomes a

concern

31

Structured/Unstructured Meshes

Mesh examples

32

Meshes are used for Computation

33

Mesh Decomposition

• Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication.

• Such decomposition problems have been studied in load balancing for parallel computation.

• Lots of choices:• METIS, ParMETIS -- University of Minnesota.• PARTI -- University of Maryland,• CHACO -- Sandia National Laboratories,• JOSTLE -- University of Greenwich,• PARTY -- University of Paderborn,• SCOTCH -- Université Bordeaux,• TOP/DOMDEC -- NAS at NASA Ames Research Center.

http://www.hlrs.de34

http://www-users.cs.umn.edu/~karypis/metis/metis/index.html

http://www-users.cs.umn.edu/~karypis/metis/parmetis/index.html

http://www.cs.umd.edu/projects/hpsl/compilers/base_mblock.html

Mesh Decomposition

• Load balancing– Distribute elements evenly across processors.

– Each processor should have equal share of work.

• Communication costs should be minimized. – Minimize sub-domain boundary elements.

– Minimize number of neighboring domains.

• Distribution should reflect machine architecture.– Communication versus calculation.

– Bandwidth versus latency.

• Note that optimizing load balance and communication cost simultaneously is an NP-hard problem.

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html

35

36http://www.hlrs.de

36

Mesh decomposition

Static Grids (Meshes)

• Decomposition need only be carried out once

• Static decomposition may therefore be carried out as a preprocessing step, often done in serial

Dynamic Meshes

• Decomposition must be adapted as underlying mesh or processor load changes.

• Dynamic decomposition therefore becomes part of the calculation itself and cannot be carried out solely as a pre-processing step.

37

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html

Static and Dynamic Meshes

HP J67001 CPUSolve Time: 13:26Baseline Time

38

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt

Linux Cluster2 CPU’sSolve Time: 5:20Speed-Up: 2.5X

39



40



41



42


Speedup due to decomposition

# CPUs Run-times (s)

1 806

2 320

4 187

8 111

16 63

43

44http://www.hlrs.de

44

Jostle and Metis

Jostle

45

45

http://www.hlrs.de

Jostle

46

46

http://www.hlrs.de

Jostle

47

47

http://www.hlrs.de

Metis

48

48

http://www.hlrs.de

ParMetis

49

49

http://www.hlrs.de

Metis (serial)

50

50

http://www.hlrs.de

Comparison

51

51

http://www.hlrs.de

Outline



52

FFTW

• Fastest Fourier Transform in the West

• Portable C subroutine library for computing discrete cosine/sine transform (DCT/DST)

• Computes arbitrary size discrete Fourier and Hartley transforms on real or complex data, in one or more dimensions

• Optimized for speed through application of special-purpose compiler genfft (codelet generator), originally written in OCaml; performance comparable even with vendor optimized libraries

• Free software, distributed under GPL; also available under commercial MIT license

• Developed at MIT by Matteo Frigo and Steven G. Johnson• Won J. H. Wilkinson Prize for Numerical Software in 1999• Most recent stable version is 3.1.2 (http://www.fftw.org)

53

Main FFTW Features

• C and FORTRAN interfaces, C++ wrappers available• Speed, including support for SSE, SSE2, 3dNow! and Altivec• Arbitrary size transforms with complexity of O(n·log(n)) (sizes which

can be factored to 2, 3, 5 and 7 are most efficient by default, but a custom code can be also generated for other sizes if required)

• Even/odd data (DCT/DST), types I-IV• Can produce pure real output, or process pure real input data• Efficient handling of multiple, strided transforms (e.g. transformation of

multiple arrays at once; one dimension of multi-dimensional array; one field of multi-component array)

• Parallel code supporting Cilk, SMP platforms with threads, or MPI• Ability to save and restore plans optimized for a given platform (through

wisdom mechanism)• Portable to any platform with a working C compiler

54

FFTW Sample Code

Source: http://www.fftw.org/fftw3.pdf

Computing 1-D complex DFT

55

#include <fftw3.h>#include <fftw3.h>......{{ fftw_complex *in, *out;fftw_complex *in, *out; fftw_plan p;fftw_plan p; ...... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); /* populate in[] with input data *//* populate in[] with input data */ … … p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); ...... fftw_execute(p); /* repeat as needed */fftw_execute(p); /* repeat as needed */ /* transform now available in out[] *//* transform now available in out[] */ ...... fftw_destroy_plan(p);fftw_destroy_plan(p); fftw_free(in); fftw_free(out);fftw_free(in); fftw_free(out);}}

Outline



56

The Boost Libraries

• What’s Boost– What’s important– Other stuff

57

What is Boost?

• Data Structures, Containers, Iterators, and Algorithms

• String and Text Processing • Function Objects and Higher-Order

Programming • Generic Programming and Template

Metaprogramming • Math and Numerics• Input/Output • Miscellaneous

• Mostly header only

58

What’s important

• OS abstraction– Thread: OS independent kernel level thread

interface– Asio: asynchronous input output– Filesystem: file system operations as file copy,

delete, directory create, file path handling– System: OS error code abstraction and handling– Program options: handling of command line

arguments and parameters– Streams: build your own C++ streams– DateTime: Handling of dates, times and time

periods– Timer: simple timer object

59

What’s important

• Data types, Container types, all extending STL– Pointer containers: allow for pointers in STL containers:

vector<char *> ptr_vector<char>– Multi index: data structures with multiple indicies– Constant sized arrays: array<char, 10>, acts like vector or

plain ‘C‘ array– Any: can hold values of any type (if you need polymorphism)– Variant: can hold values of any of the types specified at

compile time (‘C’ equivalent is discriminated union)– Optional: can hold a value or nothing– Tuple: like a vector or array, but every element may have a

different type (similar to plain struct)– Graph library: very sophisticated collection of graph releated

data structures and algorithms• Parallel version exists (using MPI)

60

What’s important

• Helper classes– Smart pointers: working with pointers

without having to worry about memory management

– Memory pools: specialized memory allocation for containers

– Iterator library: write your own iterator classes with ease (non trivial otherwise)

61

Other stuff in Boost

• String and Text processing• Regex, parsing, format, conversion etc.

• Alorithms• String algos, FOR_EACH, minmax etc.

• Math and numerics• Conversion, interval, random, octonion, quarternion, special

functions, rational, uBLAS

• Functional and higher order prgramming• Bind, lambda, function, ref, signals etc.

• Generic and template metaprogramming• Proto, mpl, fusion, phoenix, enable_if etc.

• Testing• Unit tests, concept checks, static_assert

62

Conclusion

• Look at Boost first if you need something not available in Standard library

• Even if it‘s not in Boost look around, there are a lot of libraries in preparation for Boost (Boost Sandbox, File Vault)

63

Links

• Boost, current release V1.33.1 – Web: http://www.boost.org

– CVS: http://sourceforge.net/projects/boost

• Boost Sandbox– CVS: http://sourceforge.net/projects/boost-sandbox

– File Vault: http://boost-consulting.com/vault/

• Boost mailing lists– http://www.boost.org/more/mailing_lists.htm

64

Outlook

Functional specification with a Domain Specific Embedded Language (DSEL)

equation = sum<vertex_edge> [ sumf<edge_vertex>(0.0,

_e) [ pot * orient(_e, _1) ] * A / d * eps] - V * rho

65

Elliptic PDE discretized by Finite Volume

References: [1]

References

1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software Design, Portland, Oregon, October 2006.

66

Outline



67

Summary – Material for the Test

• High performance libraries 5,6,7• Linear algebra libraries: BLAS: 9, 11, 12• Linear algebra libraries: LinPACK: 18• PDE Solvers: 23, 24, 26, 27• Mesh decomposition & load balancing: 30, 31,

34, 35, 37, 44, 45, 46, 48, 49• FFTW: 53, 54• Boost: 58, 59, 60, 61, 62

high performance computing: concepts, methods & means hpc libraries

Documents

used libraries

balancing metisparmetis

load balancing metis

matrixmatrix products

matrixvector products

matrixvector operations

nonlinear solvers

palapackpde solvers