costless software abstractions for parallel hardware system

Download Costless Software Abstractions For Parallel Hardware System

Post on 27-Jun-2015

320 views

Category:

Technology

2 download

Embed Size (px)

DESCRIPTION

Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen2. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of various parallel computing libraries in such a way that: - abstraction and expressiveness are maximized - cost over efficiency is minimized As a conclusion, we'll skim over various applications and see how they can benefit from such tools.

TRANSCRIPT

  • 1. Costless Software Abstractions For Parallel Hardware SystemJoel Falcou LRI - INRIA - MetaScale SASMaison de la Simulation 04/03/2014

2. Context In Scientic Computing ... there is Scientic Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specic tuning ... which requires expertise ... which may or may not be availableThe Problem People using computers to do science want to do science rst. 2 of 34 3. The Problem and how we want to solve it The Facts The Library to bind them all doesnt exist (or we should have it already ) All those users want to take advantage of new architectures Few of them want to actually handle all the dirty workThe Ends Provide a familiar interface that let users benet from parallelism Helps compilers to generate better parallel code Increase sustainability by decreasing amount of code to writeThe Means Parallel Abstractions: Skeletons Ecient Implementation: DSEL The Holy Glue: Generative Programming 3 of 34 4. ExpressivitEcient or Expressive Choose oneMatlab SciLabC++ JAVAC, FORTRANEfcacit4 of 34 5. Talk LayoutIntroduction Abstractions Eciency Tools Conclusion5 of 34 6. Talk LayoutIntroduction Abstractions Eciency Tools Conclusion6 of 34 7. Spotting abstraction when you see oneProcessus 3F2 Processus 1Processus 2Processus 4Processus 6Processus 7F1Distrib.F2Collect.F3Processus 5F27 of 34 8. Spotting abstraction when you see onePipelineFarmProcessus 3F2 Processus 1Processus 2Processus 4Processus 6Processus 7F1Distrib.F2Collect.F3Processus 5F27 of 34 9. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns8 of 34 10. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patternsFunctionnal point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions8 of 34 11. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prex reduction over subset of data9 of 34 12. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prex reduction over subset of dataTask Parallel Skeletons par: Independant task execution pipe: Task dependency over time farm: Load-balancing9 of 34 13. Why using Parallel SkeletonsSoftware Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certiable10 of 34 14. Why using Parallel SkeletonsSoftware Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, CertiableHardware Abstraction Semantic is set, implementation is free Composability Hierarchical architecture10 of 34 15. Talk LayoutIntroduction Abstractions Eciency Tools Conclusion11 of 34 16. Generative Programming Domain Specic Application DescriptionGenerative ComponentTranslatorParametric Sub-components12 of 34Concrete Application 17. Generative Programming as a ToolAvailable techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming13 of 34 18. Generative Programming as a ToolAvailable techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming13 of 34 19. Generative Programming as a ToolAvailable techniques Dedicated compilers External pre-processing tools Languages supporting meta-programmingDenition of Meta-programming Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data.13 of 34 20. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++14 of 34 21. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++14 of 34 22. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++C++ meta-programming Relies on the C++ template sub-language Handles types and integral constants at compile-time Proved to be Turing-complete14 of 34 23. Domain Specic Embedded Languages Whats an DSEL ? DSL = Domain Specic Language Declarative language, easy-to-use, tting the domain DSEL = DSL within a general purpose languageEDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizationsExploiting static AST At the expression level: code generation At the function level: inter-procedural optimization 15 of 34 24. Expression Templates =matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a);+x expr const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++;x * x; y * y; 2 * x * y;x2 + y2 ; < 4 , iter ) ;} while ( any ( mask ) && i < 256) ; return iter ; }25 of 34 36. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++;x * x; y * y; 2 * x * y;x2 + y2 ; < 4 , iter ) ;} while ( any ( mask ) && i < 256) ; return iter ; }25 of 34 37. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++;x * x; y * y; 2 * x * y;x2 + y2 ; < 4 , iter ) ;} while ( any ( mask ) && i < 256) ; return iter ; }25 of 34 38. NT2A Scientic Computing Library Provide a simple, Matlab-like interface for users Provide high-performance computing entities and primitives Easily extendableComponents Use Boost.SIMD for in-core optimizations Use recursive parallel skeletons for threading Code is made independant of architecture and runtime26 of 34 39. ExpressivitThe Numerical Template ToolboxNT2Matlab SciLabC++ JAVAC, FORTRANEfcacit27 of 34 40. The Numerical Template Toolbox Principles table is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab28 of 34 41. The Numerical Template Toolbox Principles table is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m le, copy to a .cpp le28 of 34 42. The Numerical Template Toolbox Principles table is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m le, copy to a .cpp le Add #include and do cosmetic changes28 of 34 43. The Numerical Template Toolbox Principles table is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m le, copy to a .cpp le Add #include and do cosmetic changes Compile the le and link with libnt2.a28 of 34 44. Performances - Mandelbrodt29 of 34 45. Performances - LU Decomposition 8000 8000 LU decompositionMedian GFLOPS15010050 NT2 Plasma0 0102030Number of cores 30 of 344050 46. Performances - linsolvent2::tie(x,r) = linsolve(A,b);Scale 1024 1024 2048 2048 12000 120031 of 34C LAPACK 85.2 350.7 735.7C MAGMA 85.2 235.8 1299.0NT2 LAPACK 83.1 348.2 734.1NT2 MAGMA 85.1 236.0 1300.1 47. Talk LayoutIntroduction Abstractions Eciency Tools Conclusion32 of 34 48. Lets round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability33 of 34 49. Lets round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetabilityRecent activity Follow us on http://www.github.com/MetaScale/nt2 Prototype for single source GPU support Toward a global generic approach to parallelism33 of 34 50. Thanks for your attention