costless software abstractions for parallel hardware system

50
Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation - 04/03/2014

Upload: joel-falcou

Post on 27-Jun-2015

336 views

Category:

Technology


2 download

DESCRIPTION

Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen2. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of various parallel computing libraries in such a way that: - abstraction and expressiveness are maximized - cost over efficiency is minimized As a conclusion, we'll skim over various applications and see how they can benefit from such tools.

TRANSCRIPT

Page 1: Costless Software Abstractions For Parallel Hardware System

Costless Software Abstractions For Parallel Hardware System

Joel Falcou

LRI - INRIA - MetaScale SAS

Maison de la Simulation -04/03/2014

Page 2: Costless Software Abstractions For Parallel Hardware System

Context

In Scientific Computing ...

� there is Scientific

� Applications are domain driven� Users 6= Developers� Users are reluctant to changes

� there is Computing

� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available

The ProblemPeople using computers to do science want to do science first.

2 of 34

Page 3: Costless Software Abstractions For Parallel Hardware System

The Problem – and how we want to solve it

The Facts� The ”Library to bind them all” doesn’t exist (or we should have it already)

� All those users want to take advantage of new architectures

� Few of them want to actually handle all the dirty work

The Ends� Provide a ”familiar” interface that let users benefit from parallelism

� Helps compilers to generate better parallel code

� Increase sustainability by decreasing amount of code to write

The Means� Parallel Abstractions: Skeletons

� Efficient Implementation: DSEL

� The Holy Glue: Generative Programming

3 of 34

Page 4: Costless Software Abstractions For Parallel Hardware System

Efficient or Expressive – Choose one

Efficacité

Expressivité

MatlabSciLab

C++JAVA

C, FORTRAN

4 of 34

Page 5: Costless Software Abstractions For Parallel Hardware System

Talk Layout

Introduction

Abstractions

Efficiency

Tools

Conclusion

5 of 34

Page 6: Costless Software Abstractions For Parallel Hardware System

Talk Layout

Introduction

Abstractions

Efficiency

Tools

Conclusion

6 of 34

Page 7: Costless Software Abstractions For Parallel Hardware System

Spotting abstraction when you see one

Processus 1

F1

Processus 7

F3

Processus 3

F2

Processus 4

F2

Processus 5

F2

Processus 2

Distrib.

Processus 6

Collect.

7 of 34

Page 8: Costless Software Abstractions For Parallel Hardware System

Spotting abstraction when you see one

Processus 1

F1

Processus 7

F3

Processus 3

F2

Processus 4

F2

Processus 5

F2

Processus 2

Distrib.

Processus 6

Collect.

FarmPipeline

7 of 34

Page 9: Costless Software Abstractions For Parallel Hardware System

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications

� Those patterns can be generalized in Skeletons

� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions

� Skeletons support a compositionnal semantic

� Applications become composition of state-less functions

8 of 34

Page 10: Costless Software Abstractions For Parallel Hardware System

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications

� Those patterns can be generalized in Skeletons

� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions

� Skeletons support a compositionnal semantic

� Applications become composition of state-less functions

8 of 34

Page 11: Costless Software Abstractions For Parallel Hardware System

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data

� fold: Perform n-ary reduction over subset of data

� scan: Perform n-ary prefix reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution

� pipe: Task dependency over time

� farm: Load-balancing

9 of 34

Page 12: Costless Software Abstractions For Parallel Hardware System

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data

� fold: Perform n-ary reduction over subset of data

� scan: Perform n-ary prefix reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution

� pipe: Task dependency over time

� farm: Load-balancing

9 of 34

Page 13: Costless Software Abstractions For Parallel Hardware System

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details

� Code is scalable and easy to maintain

� Debuggable, Provable, Certifiable

Hardware Abstraction

� Semantic is set, implementation is free

� Composability ⇒ Hierarchical architecture

10 of 34

Page 14: Costless Software Abstractions For Parallel Hardware System

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details

� Code is scalable and easy to maintain

� Debuggable, Provable, Certifiable

Hardware Abstraction

� Semantic is set, implementation is free

� Composability ⇒ Hierarchical architecture

10 of 34

Page 15: Costless Software Abstractions For Parallel Hardware System

Talk Layout

Introduction

Abstractions

Efficiency

Tools

Conclusion

11 of 34

Page 16: Costless Software Abstractions For Parallel Hardware System

Generative Programming

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

12 of 34

Page 17: Costless Software Abstractions For Parallel Hardware System

Generative Programming as a Tool

Available techniques

� Dedicated compilers

� External pre-processing tools

� Languages supporting meta-programming

Definition of Meta-programming

Meta-programming is the writing of computer programs that analyse,transform and generate other programs (or themselves) as their data.

13 of 34

Page 18: Costless Software Abstractions For Parallel Hardware System

Generative Programming as a Tool

Available techniques

� Dedicated compilers

� External pre-processing tools

� Languages supporting meta-programming

Definition of Meta-programming

Meta-programming is the writing of computer programs that analyse,transform and generate other programs (or themselves) as their data.

13 of 34

Page 19: Costless Software Abstractions For Parallel Hardware System

Generative Programming as a Tool

Available techniques

� Dedicated compilers

� External pre-processing tools

� Languages supporting meta-programming

Definition of Meta-programming

Meta-programming is the writing of computer programs that analyse,transform and generate other programs (or themselves) as their data.

13 of 34

Page 20: Costless Software Abstractions For Parallel Hardware System

From Generative to Meta-programming

Meta-programmable languages

� template Haskell

� metaOcaml

� C++

C++ meta-programming

� Relies on the C++ template sub-language

� Handles types and integral constants at compile-time

� Proved to be Turing-complete

14 of 34

Page 21: Costless Software Abstractions For Parallel Hardware System

From Generative to Meta-programming

Meta-programmable languages

� template Haskell

� metaOcaml

� C++

C++ meta-programming

� Relies on the C++ template sub-language

� Handles types and integral constants at compile-time

� Proved to be Turing-complete

14 of 34

Page 22: Costless Software Abstractions For Parallel Hardware System

From Generative to Meta-programming

Meta-programmable languages

� template Haskell

� metaOcaml

� C++

C++ meta-programming

� Relies on the C++ template sub-language

� Handles types and integral constants at compile-time

� Proved to be Turing-complete

14 of 34

Page 23: Costless Software Abstractions For Parallel Hardware System

Domain Specific Embedded Languages

What’s an DSEL ?

� DSL = Domain Specific Language

� Declarative language, easy-to-use, fitting the domain

� DSEL = DSL within a general purpose language

EDSL in C++

� Relies on operator overload abuse (Expression Templates)

� Carry semantic information around code fragment

� Generic implementation become self-aware of optimizations

Exploiting static AST

� At the expression level: code generation

� At the function level: inter-procedural optimization

15 of 34

Page 24: Costless Software Abstractions For Parallel Hardware System

Expression Templates

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

16 of 34

Page 25: Costless Software Abstractions For Parallel Hardware System

Embedded Domain Specific Languages

EDSL in C++

� Relies on operator overload abuse

� Carry semantic information around code fragment

� Generic implementation become self-aware of optimizations

Advantages

� Allow introduction of DSLs without disrupting dev. chain

� Semantic defined as type informations means compile-time resolution

� Access to a large selection of runtime binding

17 of 34

Page 26: Costless Software Abstractions For Parallel Hardware System

Embedded Domain Specific Languages

18 of 34

Page 27: Costless Software Abstractions For Parallel Hardware System

Talk Layout

Introduction

Abstractions

Efficiency

Tools

Conclusion

19 of 34

Page 28: Costless Software Abstractions For Parallel Hardware System

Different Strokes

Objectives

� Apply DSEL generation techniques for different kind of hardware

� Demonstrate low cost of abstractions

� Demonstrate applicability of skeletons

Contributions

� Extend DEMRAL into AA-DEMRAL

� Boost.SIMD

� NT2

20 of 34

Page 29: Costless Software Abstractions For Parallel Hardware System

Architecture Aware Generative Programming

21 of 34

Page 30: Costless Software Abstractions For Parallel Hardware System

Boost.SIMDGeneric Programming for portable SIMDization

� SIMD computation C++ Library� Built with modern C++ style for modern C++ usage� Easy to extend� Easy to adapt to new CPUs

� Goals:� Integrate SIMD computations in standard C++� Make writing of SIMD code easier� Promotes Generic Programming� Provides performance out of the box

22 of 34

Page 31: Costless Software Abstractions For Parallel Hardware System

Boost.SIMDGeneric Programming for portable SIMDization

� SIMD computation C++ Library� Built with modern C++ style for modern C++ usage� Easy to extend� Easy to adapt to new CPUs

� Goals:� Integrate SIMD computations in standard C++� Make writing of SIMD code easier� Promotes Generic Programming� Provides performance out of the box

22 of 34

Page 32: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD

Principles

� pack<T,N> encapsulates the best hardware register for N elements of type T

� pack<T,N> provides a classical value semantic

� Operations onpack<T,N> maps to proper intrinsics

� Support for SIMD standard algortihms

How does it works

� Code is written as for scalar values

� Code can be applied as polymorphic functor over data

� Expression Templates optimize operations

23 of 34

Page 33: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD

Principles

� pack<T,N> encapsulates the best hardware register for N elements of type T

� pack<T,N> provides a classical value semantic

� Operations onpack<T,N> maps to proper intrinsics

� Support for SIMD standard algortihms

How does it works

� Code is written as for scalar values

� Code can be applied as polymorphic functor over data

� Expression Templates optimize operations

23 of 34

Page 34: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD

Current Support

� OS : Linux, Windows, iOS, Android

� Architecture: x86, ARM, PowerPC

� Compilers: g++, clang, icc, MSVC� Extensions:

� x86: SSE2,SSE3,SSSE3,SSE4.x,AVX,XOP,FMA3/4,AVX2� PPC: VMX� ARM: NEON

� In progress:� Xeon Phi MIC� VSX, QPX� NEON2

24 of 34

Page 35: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD - Mandelbrot

pack <int > julia(pack <float > const& a, pack <float > const& b)

{

pack <int > i_t;

std:: size_t i = 0;

pack <int > iter;

float x, y;

do

{

pack <float > x2 = x * x;

pack <float > y2 = y * y;

pack <float > xy = 2 * x * y;

x = x2 - y2 + a;

y = xy + b;

pack <float > m2 = x2 + y2;

iter = selinc(m2 < 4, iter);

i++;

}

while(any(mask) && i < 256);

return iter;

}

25 of 34

Page 36: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD - Mandelbrot

pack <int > julia(pack <float > const& a, pack <float > const& b)

{

pack <int > i_t;

std:: size_t i = 0;

pack <int > iter;

float x, y;

do

{

pack <float > x2 = x * x;

pack <float > y2 = y * y;

pack <float > xy = 2 * x * y;

x = x2 - y2 + a;

y = xy + b;

pack <float > m2 = x2 + y2;

iter = selinc(m2 < 4, iter);

i++;

}

while(any(mask) && i < 256);

return iter;

}

25 of 34

Page 37: Costless Software Abstractions For Parallel Hardware System

Boost.SIMD - Mandelbrot

pack <int > julia(pack <float > const& a, pack <float > const& b)

{

pack <int > i_t;

std:: size_t i = 0;

pack <int > iter;

float x, y;

do

{

pack <float > x2 = x * x;

pack <float > y2 = y * y;

pack <float > xy = 2 * x * y;

x = x2 - y2 + a;

y = xy + b;

pack <float > m2 = x2 + y2;

iter = selinc(m2 < 4, iter);

i++;

}

while(any(mask) && i < 256);

return iter;

}

25 of 34

Page 38: Costless Software Abstractions For Parallel Hardware System

NT2

A Scientific Computing Library

� Provide a simple, Matlab-like interface for users

� Provide high-performance computing entities and primitives

� Easily extendable

Components

� Use Boost.SIMD for in-core optimizations

� Use recursive parallel skeletons for threading

� Code is made independant of architecture and runtime

26 of 34

Page 39: Costless Software Abstractions For Parallel Hardware System

The Numerical Template Toolbox

Efficacité

Expressivité

MatlabSciLab

C++JAVA

C, FORTRAN

NT2

27 of 34

Page 40: Costless Software Abstractions For Parallel Hardware System

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

28 of 34

Page 41: Costless Software Abstractions For Parallel Hardware System

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

28 of 34

Page 42: Costless Software Abstractions For Parallel Hardware System

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

28 of 34

Page 43: Costless Software Abstractions For Parallel Hardware System

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

28 of 34

Page 44: Costless Software Abstractions For Parallel Hardware System

Performances - Mandelbrodt

29 of 34

Page 45: Costless Software Abstractions For Parallel Hardware System

Performances - LU Decomposition

0 10 20 30 40 50

0

50

100

150

Number of cores

Med

ian

GF

LO

PS

8000× 8000 LU decomposition

NT2

Plasma

30 of 34

Page 46: Costless Software Abstractions For Parallel Hardware System

Performances - linsolve

nt2::tie(x,r) = linsolve(A,b);

Scale C LAPACK C MAGMA NT2 LAPACK NT2 MAGMA

1024× 1024 85.2 85.2 83.1 85.1

2048× 2048 350.7 235.8 348.2 236.0

12000× 1200 735.7 1299.0 734.1 1300.1

31 of 34

Page 47: Costless Software Abstractions For Parallel Hardware System

Talk Layout

Introduction

Abstractions

Efficiency

Tools

Conclusion

32 of 34

Page 48: Costless Software Abstractions For Parallel Hardware System

Let’s round this up!

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system

� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability

Recent activity

� Follow us on http://www.github.com/MetaScale/nt2

� Prototype for single source GPU support

� Toward a global generic approach to parallelism

33 of 34

Page 49: Costless Software Abstractions For Parallel Hardware System

Let’s round this up!

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system

� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability

Recent activity

� Follow us on http://www.github.com/MetaScale/nt2

� Prototype for single source GPU support

� Toward a global generic approach to parallelism

33 of 34

Page 50: Costless Software Abstractions For Parallel Hardware System

Thanks for your attention