designing architecture-aware library using boost.proto
DESCRIPTION
HAMM and Meetng C++ Talk on our model of architecture description for C++ template metaprogramming.TRANSCRIPT
![Page 1: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/1.jpg)
Designing Architecture-aware Library using Boost.Proto
Joel Falcou, Mathias GaunardPierre Esterie, Eric
Jourdanneau
LRI - University Paris Sud XI - CNRS
13/03/2012
![Page 2: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/2.jpg)
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
![Page 3: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/3.jpg)
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
![Page 4: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/4.jpg)
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
![Page 5: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/5.jpg)
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
![Page 6: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/6.jpg)
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
![Page 7: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/7.jpg)
The Problem and how to solve it
The Facts� The ”Library to bind them all” doesn’t exist (or we should have it already)
� New architectures performances are underused due to their inherent complexity
� Few people are both experts in parallel programming and SomeOtherField
The Ends� Find a way to shield users from parallel architecture details
� Find a way to shield developpers from parallel architecture details
� Isolate software from hardware evolution
The Means� Generic Programming
� Template Meta-Programming
� Embedded Domain Specific Languages
3 of 34
![Page 8: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/8.jpg)
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
4 of 34
![Page 9: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/9.jpg)
What’s NT2 ?
A Scientific Computing Library
� Provide a simple, Matlab-like interface for users
� Provide high-performance computing entities and primitives
� Easily extendable
A Research Platform
� Simple framework to add new optimization schemes
� Test bench for EDSL development methodologies
� Test bench for Generic Programming in real life projects
5 of 34
![Page 10: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/10.jpg)
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
![Page 11: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/11.jpg)
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
![Page 12: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/12.jpg)
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
![Page 13: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/13.jpg)
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
![Page 14: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/14.jpg)
Matlab you said ?
1 R = I(:,:,1);
2 G = I(:,:,2);
3 B = I(:,:,3);
4
5 Y = min(abs (0.299.*R+0.587.*G+0.114.*B) ,235);
6 U = min(abs ( -0.169.*R -0.331.*G+0.5.*B) ,240);
7 V = min(abs (0.5.*R -0.419.*G -0.081.*B) ,240);
7 of 34
![Page 15: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/15.jpg)
Now with NT2
1 table <double > R = I(_,_,1);
2 table <double > G = I(_,_,2);
3 table <double > B = I(_,_,3);
4 table <double > Y, U, V;
5
6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);
7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);
8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);
8 of 34
![Page 16: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/16.jpg)
Now with NT2
1 table <float ,settings(shallow_ ,of_size_ <N,M>)> R = I(_,_,1);
2 table <float ,settings(shallow_ ,of_size_ <N,M>)> G = I(_,_,2);
3 table <float ,settings(shallow_ ,of_size_ <N,M>)> B = I(_,_,3);
4 table <float ,of_size_ <N,M> > Y, U, V;
5
6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);
7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);
8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);
8 of 34
![Page 17: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/17.jpg)
Some real life applications
9 of 34
![Page 18: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/18.jpg)
NT2 before Boost.Proto
EDSL Core
� Code base was 11 KLOC
� 8KLOC was dedicated to the various Expression Template glue
� 3KLOC of actual smart stuff
Architectural support
� Altivec and SSE2 extension support
� Some vague pthread support
� Efforts were stagnant: adding a simple feature required multipel KLOC of codeto change
10 of 34
![Page 19: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/19.jpg)
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
11 of 34
![Page 20: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/20.jpg)
Embedded Domain Specific Languages
What’s an EDSL ?
� DSL = Domain Specific Language
� Declarative language, easy-to-use, fitting the domain
� EDSL = DSL within a general purpose language
EDSL in C++
� Relies on operator overload abuse (Expression Templates)
� Carry semantic information around code fragment
� Generic implementation become self-aware of optimizations
12 of 34
![Page 21: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/21.jpg)
Expression Templates
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
13 of 34
![Page 22: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/22.jpg)
Boost.Proto
What’s Boost.Proto
� EDSL for defining EDSLs in C++
� Generalize Expression Templates
� Easy way to define and test EDSL
� EDSL = some Grammar Rules + some Semantic
Boost.Proto Benefits
� Fast development process
� Compiler guys are at ease : Grammar + Semantic + Code Generation process
� Easily extendable through Transforms
� Easy to handle : MPL and Fusion compatible
14 of 34
![Page 23: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/23.jpg)
Why Boost.Proto ?
Benefits
� Scalability: Keep EDSL glue code size low
� Extensibility: Simplify new optimizations design
� Debugging: Allow for meaningful error reporting
Remaining Challenges
� Provide a generic ”compilation” process
� Select the best function definition w/r to hardware
� Allow for non-intrusive new pass definition
15 of 34
![Page 24: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/24.jpg)
State of the art EDSL
Usual approaches and results
� Eigen : supports SIMD arithmetic, thread only algorithms
� uBLAS : parallelism comes from potential back-end
What’s the problem ?
� Hardware consideration comes after ET code
� Retro-fitting parallelism is hard
� Need of a hardware-aware EDSL
� Back to Generative Programming
16 of 34
![Page 25: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/25.jpg)
Principles of Generative Programming
Domain SpecificApplication Description
Generative Component Concrete Application
Translator
Parametric Sub-components
17 of 34
![Page 26: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/26.jpg)
Generative Programming v2
18 of 34
![Page 27: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/27.jpg)
Our Approach
Principles
� Segment EDSL evaluation in phases
� Each phases use Proto transforms to advance code generation
� Hardware specification are active element in function overload
� Use Generalized Tag Dispatching (GTD) to select proper functionimplementation
Advantages
� Hardware can influence part or whole evaluation process
� Proto transforms syntax are easy to use
� GTD increases opportunity for optimizations
19 of 34
![Page 28: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/28.jpg)
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
20 of 34
![Page 29: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/29.jpg)
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
20 of 34
![Page 30: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/30.jpg)
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
1 namespace tag { struct plus_ : elementwise_ <plus_ > {}; }
2
3 functor <tag::plus_ > callee;
4 callee( some_a , some_b );
20 of 34
![Page 31: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/31.jpg)
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
1 auto functor <Tag ,Site >:: operator ()(A&& args ...)
2 {
3 dispatch < hierarchy_of <Site >:: type
4 , hierarchy_of <Tag >:: type(hierarchy_of <A>:: type ...)
5 > callee;
6 return callee( args ... );
7 }
20 of 34
![Page 32: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/32.jpg)
Generalized Tag Dispatching
1 template <class AO>
2 strutc dispatch < tag::sse_
3 , tag::plus_( simd_ < real_ < float_ <A0> > >
4 , simd_ < real_ < float_ <A0 > > >
5 )
6 >
7 {
8 A0 operator(A0&& a, A0&& b)
9 {
10 return _mm_add_ps( a, b );
11 }
12 };
21 of 34
![Page 33: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/33.jpg)
Generalized Tag Dispatching
1 template <class S,class R, class L, class N, class AO >
2 strutc dispatch <openmp_ <S>,reduction_ <R,L,N>(ast_ <A0 >)>
3 {
4 A0:: value_type operator(A0&& ast)
5 {
6 auto result = functor <N,S>(as_ <A0::value_type >());
7 #pragma omp parallel
8 {
9 auto local = functor <N,S>(as_ <A0:: value_type >());
10 #pragma omp for nowait
11 for(int i=0;i<size(ast);++i)
12 functor <R,S>()(local , ast(i) );
13
14 #pragma omp critial
15 functor <L,S>()(result , local );
16 }
17
18 return result;
19 }
20 };
21 of 34
![Page 34: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/34.jpg)
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
![Page 35: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/35.jpg)
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
![Page 36: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/36.jpg)
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
![Page 37: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/37.jpg)
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
![Page 38: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/38.jpg)
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
![Page 39: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/39.jpg)
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
![Page 40: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/40.jpg)
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
![Page 41: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/41.jpg)
The new NT2 Structure
State of the code base
� Expression Template handling 200-300LOC
� GTD handling : 1KLOC
� Actual features and function implementation : 10KLOC
Evolution process
� Adding a new site 100-500LOC
� Adding a function 10-100 LOC
� Time for complete new architecture support : 1 to 3 weeks
24 of 34
![Page 42: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/42.jpg)
The new NT2 Structure
24 of 34
![Page 43: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/43.jpg)
Some results
Support for OpenMP
� 1 build system script
� 4 new overloads for the run phase
� 1 new allocator to prevent false sharing
Support for Next Gen SIMD
� AVX : 2 weeks of work, functionnal
� AVX2 : prototype ready, work in simulator
� LarBn : prototype ready, work in simulator
25 of 34
![Page 44: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/44.jpg)
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
26 of 34
![Page 45: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/45.jpg)
OpenCL in action
Principle
� Standard for manycore/multicore systems
� Use runtime kernel and JIT Compilation
Example1 const char* OpenCLSource [] = {
2 "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b,unsigned
int nb)",
3 "{",
4 " // Index of the elements to add \n",
5 " unsigned int n = get_global_id (0);",
6 " // Sum the n’th element of vectors a and b and store in c \n",
7 "c[n]=0;",
8 " for (int i=0;i<nb;i++) c[n] += (a[n]+b[n]);",
9 "}"
10 };
27 of 34
![Page 46: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/46.jpg)
From Proto to OpenCL
The Schedule Pass
� Split AST as usual on function hierarchy
� Replace each AST by a functor generating the equivalent OpenCL
� Compute a static hash of the tree to remember already compiled expressions
The Run pass
� Stream data at beginning of each kernel
� Handle generic calls and retrieval of compiled kernel
� Fire asynchronous kernel calls over the forest
28 of 34
![Page 47: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/47.jpg)
From Proto to OpenCL
The Schedule Pass
� Split AST as usual on function hierarchy
� Replace each AST by a functor generating the equivalent OpenCL
� Compute a static hash of the tree to remember already compiled expressions
The Run pass
� Stream data at beginning of each kernel
� Handle generic calls and retrieval of compiled kernel
� Fire asynchronous kernel calls over the forest
28 of 34
![Page 48: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/48.jpg)
Support for openCL
NT2 Example1 int main()
2 {
3 int nbpoints =512000;
4 table <float , settings( _1D ) > xx(ofSize(nbpoints));
5 table <float , settings( _1D ) > yy(ofSize(nbpoints));
67 for (int i=1;i<= nbpoints;i++)
8 {
9 xx(i) = -nbpoints /2+i;
10 yy(i) = 0.0;
11 }
12 yy=100* sin (8*3.14* xx /512000);
1314 std::cout <<xx(1) <<" "<<yy(1);
15 }
29 of 34
![Page 49: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/49.jpg)
Support for openCL
NT2 Example1 int main()
2 {
3 int nbpoints =512000;
4 table <float , settings( _1D , nt2::gpu ) > xx(ofSize(nbpoints));
5 table <float , settings( _1D , nt2::gpu ) > yy(ofSize(nbpoints));
67 for (int i=1;i<= nbpoints;i++)
8 {
9 xx(i) = -nbpoints /2+i;
10 yy(i) = 0.0;
11 }
12 yy=100* sin (8*3.14* xx /512000);
1314 std::cout <<xx(1) <<" "<<yy(1);
15 }
29 of 34
![Page 50: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/50.jpg)
Preliminary results for openCL
Generating some signals1 for (int i=1;i<= nbpoints;i++)
2 {
3 xx(i) = -nbpoints /2+i;
4 yy(i) = 0.0;
5 }
67 yy=100* sin (8*3.14* xx /512000);
8 yy=yy +0.0001* xx;
910 for (int ii=1;ii <512000; ii +=15000)
11 {
12 addGaussian(xx ,yy ,(float)ii ,1000);
13 yy=1-yy;
14 }
1516 yy=1-yy;
17 yy=yy*yy;
18 yy=yy*cos(xx);
30 of 34
![Page 51: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/51.jpg)
Preliminary results for openCL
Generating some signals1 template <typename A>
2 void addGaussian(A &xx, A &yy , float centre , float largeur)
3 {
4 yy = yy + ( 100000/( val(largeur)*sqrt (2*3.1415)))
5 * exp(-(xx-val(centre))*(xx-val(centre))
6 /(2* val(largeur)*val(largeur)
7 )
8 );
9 }
30 of 34
![Page 52: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/52.jpg)
Preliminary results for openCL
Generating some signals
CPU 1 core 520.2 ms
CPU 4 cores 141.3 ms
CPU 4 cores+SIMD 36.1 ms
GPU First Run 878.2 ms
GPU Other Run 13.1 ms
GPU Manual OCL 12.9 ms
� Compares favorably to ViennaCL and Accelereyes
30 of 34
![Page 53: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/53.jpg)
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
31 of 34
![Page 54: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/54.jpg)
Let’s round this up!
Parallel Computing for Scientist
� Parallel programmign tools are required
� EDSL are an efficient way to design such tools
� But hardware has to stay in the loop
Our claims
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system
� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability
32 of 34
![Page 55: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/55.jpg)
Let’s round this up!
Parallel Computing for Scientist
� Parallel programmign tools are required
� EDSL are an efficient way to design such tools
� But hardware has to stay in the loop
Our claims
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system
� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability
32 of 34
![Page 56: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/56.jpg)
Current and Future Works
Recent activity
� A public release of NT2 is planned ”soon”
� A Matlab to NT2 compiler has been designed
� Prototype for single source CUDA support in NT2
� PhD started on MPI support in NT2
What we’ll be cooking in the future
� std::future is the future
� Experimental research on NT2 for embedded systems
� Toward a global generic approach to parallelism
33 of 34
![Page 57: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/57.jpg)
Current and Future Works
Recent activity
� A public release of NT2 is planned ”soon”
� A Matlab to NT2 compiler has been designed
� Prototype for single source CUDA support in NT2
� PhD started on MPI support in NT2
What we’ll be cooking in the future
� std::future is the future
� Experimental research on NT2 for embedded systems
� Toward a global generic approach to parallelism
33 of 34
![Page 58: Designing Architecture-aware Library using Boost.Proto](https://reader034.vdocuments.mx/reader034/viewer/2022052601/558b3096d8b42a51468b4601/html5/thumbnails/58.jpg)
Thanks for your attention