advanced high performance scientific computing iii … - part 3 of... · advanced high performance...

53
FP3C Advanced High Performance Scientific Computing III of III Serge G. Petiton [email protected] 1

Upload: duongtuyen

Post on 01-Sep-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

FP3C

!Advanced High Performance !

Scientific Computing!!

III of III!Serge G. Petiton!

[email protected]

1

Page 2: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Part I :Introduction, programming paradigms and algorithms for basic linear algebra (9am-10:30am, and 11am-12:30pm)

1.  Brief history and survey of supercomputing, 2.  Main HPC architectures and programming paradigms, 3.  Vector and data parallel programming, 4.  Dense and sparse linear algebra data structures and algorithms

Part II: Krylov Methods (1:30pm-3pm) 1.  GMRES and ERAM methods 2.  The parallel/distributed ERAM and MERAM methods 3.  The GMRES/ERAM-LS hybrid asynchronous parallel/distributed methods 4.  Auto-tuning of subspace parallel Krylov Methods

Part III : Toward exascale programming (3:30pm-5pm) 1.  YML/XMP programming paradigms 2.  Block Gauss-Jordan method using the YML/XMP framework 3.  On the road to Exascale programming

2  

Page 3: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Part III

Towards Exascale Programming!

(3:30pm-5pm)

1.  YML/XMP programming paradigms 2.  Block Gauss-Jordan method using the YML/XMP framework 3.  On the road to Exascale programming

3  

Page 4: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Toward  graph  of  tasks/components  compu7ng  

•  Communica7ons  have  to  be  minimized  :  but  all  communica7ons  have  not  the  same  costs,  in  term  or  energy  and  7me.    

•  Latencies  between  farther  cores  will  be  very  7me  consuming  :  global  reduc7on  or  other  synchronized  global  opera7ons  will  be  really  a  boCleneck.  

•  We  have  to  avoid  large  inner  products,  global  synchroniza7ons,  and  others  opera7ons  involving  communica7ons  along  all  the  cores.  Large  granularity  parallelism  is  required  (cf.  CA  technics  and  Hybrid  methods).  

•  Graph  or  tasks/components  programming  allows  to  limit  these  communica7ons  only  between  the  allocated  cores  to  a  given  task/components.    

•  Communica7ons  between  these  tasks  and  the  I/O  may  be  op7mized  using    efficient  scheduling  and  orchestra7on  strategies(asynchronous  I/O  and  others)  

•  Distributed  compu7ng  meet  parallel  compu7ng,  as  the  future  super(hyper)computers  become  very  hierarchical  and  as  the  communica7ons  become  more  and  more  important.  Scheduling  strategies  would  have  to  be  developed.  

4  

Page 5: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Toward  graph  of  tasks/components  compu7ng  and  other  compu7ng  levels  

•  Each  task/component  may  be  an  exis7ng  method/soSware  developed  for  a  large  part  of  the  cores,  but  not  all  of  them  (then  classical  or  CA  methods  may  be  eficients)  

•  The  computa7on  on  each  core  may  use  mul7thread  op7miza7ons  and  run7me  libraries  •  Accelerator  programming  may  be  op7mize  also  at  this  level.  •  Then  we  have  the  following  levels  of  programming  and  compu7ng  :  

–  Graph  of  components,  already  developed  or  new  ones,  –  Each  component  is  run  on  a  large  part  of  the  computer,  on  a  large  number  of  cores  –  On  each  processor,  we  may  program  accelerators,  –  On  each  core,  we  have  a  mul7thread  op7misa7on.  

•  In  term  of  programming  paradigms,  we  propose  :  Graph  of  task  (Data  flow  oriented)/SPMD  or  PGAS-­‐like  or…/Data  parallelism  

•  We  have  to  allow  the  users  to  give  exper7se  to  the  middleware,  run7me  system  and  schedulers.  Scien7fic  end-­‐users  have  to  be  the  principal  target  on  co-­‐design  process.  Frameworks  and  languages  have  to  consider  them  first.  

5  

Page 6: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Main  proper7es  :  

•  High  level  graph  descrip7on  language  •  Independent  of  Middleware,  hardware  and  libraries    •  A  backend  for  each  systems  or  middleware  •  Exper7se  may  be  proposed  by  end-­‐users  •  May  use  exis7ng  components  /  thought  eventualy  libraries  

YML  yml.prism.uvsq.fr    Many  people  were  involved  in  the  de  development  of  the  open  source  YML  soSware  :  from  Olivier  Delannoye  (master  internship,  Univ.  Paris  6,   2000)   to   the   today   developers   :   in   par7cular   Miwako   Tusji  (Tsukuba),  Makarem  Dandouna  (Univ.  Versailles)  and  Maxime  Hugues  (INRIA).  

6  

Page 7: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Some elements on YML

•  YML1 Framework is dedicated to develop and run parallel and distributed applications on Cluster, clusters of clusters, and supercomputers (schedulers and middleware would have to be optimized for more integrated computer – cf. “K” and OmnRPC for example).

•  Independent from systems and middlewares –  The end users can reused their code using another middleware –  Actually the main system is OmniRPC3

•  Components approach

–  Defined in XML –  Three types : Abstract, Implementation (in FORTRAN, C or C++;XMP,..),

Graph (Parallelism) –  Reuse and Optimized

•  The parallelism is expressed through a graph description language, named Yvette (name of the river in Gif-sur-Yvette where the ASCI lab was)

•  Deployed in France, Belgium, Ireland, Japan, China, Tunisia, USA.

1 University of Versailles/ PRiSM (http://yml.prism.uvsq.fr/) 2 University of Paris XI (http://www.xtremweb.net/) 3 University of Tsukuba (http://www.omni.hpcc.jp/OmniRPC/)

7  

Page 8: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

YML Architecture

Development Catalog

Workflow Compiler

Component Generator

Just-in-time Scheduler

Data Repository Server (DRS)

Binary Generator Middleware  client  

Backend

Middleware   YML  Worker  

Execution Catalog

Architecture of the 1.0.5 Version 8  

Page 9: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Graph description language: Yvette

•  Language keywords

–  Parallel sections: par section1 // … // section N endpar –  Sequential Loops: seq (i:=begin;end)do … enddo –  Parallel Loops: par (i:=begin;end)do … enddo –  Conditionnal structure: if (condition) then … else … endif –  Synchronization: wait(event) / notify(event) –  Component call: compute NameOfComponent(args,..,..)

•  Application examples with a dense matrix inversion method: The block Gauss-jordan,….

•  Others experiments (including sparse matrix computation) : Krylov methods, Hybrid Methods,…

•  4 types de components : –  Abstracts –  Graphs –  Implementations –  Executions

9  

Page 10: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Components/Tasks Graph Dependency

Begin node End  node  Graph  node  

Dependence  

par compute tache1(..); notify(e1); // forall i=1,n compute tache2(.i,.); migrate matrix(..); notify(e2(i)); end forall // wait(e1 and e2(1); Par compute tache3(..); signal(e3); // compute tache4(..); signal(e4); // compute tache5(..); control robot(..); signal(e5); visualize mesh(…) ; end par // wait(e3 and e4 and e5); compute tache6(..); compute tache7(..); end par

Generic  component  node  

Résultat A

10  

Page 11: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Abstract  Component  <?xml  version="1.0"  ?>  <component  type="abstract"        name="prodMat"    

descrip7on=“Matrix  Matrix  Product"  >    <params>      <param        name="matrixBkk"    type="Matrix"  mode="in"  />      <param        name="matrixAki«      type="Matrix"  mode="inout"  />      <param        name="blocksize"          type="integer"  mode="in"  />    </params>  

</component>        

11  

Page 12: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Implementa7on  Component  <?xml  version="1.0"?>  <component  type="impl"  name="prodMat"  abstract="prodMat"  descrip7on="Implementa7on  

component  of  a  Matrix  Product">              <impl  lang="CXX">          <header  />      <source>      <![CDATA[  

int  i,j,k;  double  **  tempMat;  //Alloca7on      for(k  =  0  ;  k<  blocksize  ;  k++)            for  (i  =  0  ;i  <blocksize  ;  i++)  

         for  (j  =  0  ;j  <blocksize  ;  j++)      tempMat[i][j]  =  tempMat[i][j]  +  matrixBkk.data[i][k]  *  matrixAki.data[k][j];  

                 for  (i  =  0  ;i  <  blocksize  ;  i++)  

     for  (j  =  0  ;j  <  blocksize  ;  j++)          matrixAki.data[i][j]  =  tempMat[i][j];  

//Desalloca7on                  ]]>                              </source>  

   <footer  />    </impl>  

</component>  

 

12  

Page 13: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Graph component of Block Gauss-Jordan Method

13

13  

Page 14: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

simple restart

starting vector

projection

QR+ iter. Inv.

Ritz elements computation

residual norms, stopping test

hybrid restart

send to SM

send

receive

Tcomm a

send

receive

simple restart

starting vector

projection

QR+ iter. Inv.

Ritz elements computation

residual norms, stopping test

hybrid restart

send to SM

Tcomm b

TC a TC b

Experiments  with  the  a  synchronous  Hybrid    Method,  MERAM,  to  compute  eingenpairs  

14  

Page 15: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Comparison between different configurations on Grid5000

Imple me ntation intra-cluste r and inte r-cluste (17000)

1.00E-12

1.00E-11

1.00E-10

1.00E-09

1.00E-08

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

0 20 40 60 80 100 120 140Time(s)

int ra-clust er (Nancy)

int er-clust er(condit ion1)

int er-clust er(condit ion4)

15  

Page 16: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Asynchronous Iterative Restarted Methods Collaboration with Guy Bergère and Ye Zhang

16  

Page 17: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

FP3C

FP3C    

YML/XMP  integra7on  :    

 Mitsuhisa  Sato    Serge  Pe7ton    Maxime  Hugues    Miwako  Tsuji    

 

Framework  and  Programming  for  Post-­‐Petascale  Compu9ng  (FP3C)    

An  4  years  ANR-­‐JST  suported  project  Japanese  PI:  Mitsuhisa  Sato  French  PI:  Serge  Pe7ton  

jfli.nii.ac.jp    Projects,  and  FP3C  

17  

Page 18: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

YML/XMP/StarPu  •  Mul7-­‐level  programming  paradigm  proposal:    

–  Top  level  for  inter-­‐gang/nodes  communica7ons  –  Intermediate  level  with  gang  of  nodes  –  Low  level  at  lot-­‐of-­‐cores  processors  

 •  A  Mul7-­‐Level  Parallel  Programming  Framework  Implementa7on:  

–  Top:  YML  a  graph  descrip7on  language  and  framework  –  Intermediate:  XMP  a  direc7ve  based  programming  –  Low:  Thread  programming  paradigm  (StarPU)  

•  First:  Integrate  XMP  programming  in  YML  (Done)  •  Second  :  Integrate  XMP  and  StarPU  •  Third  :  Integrate  YMP/XMP/StarPU  

18  

Page 19: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Mul7-­‐Level  Parallelism  Integra7on:  YML-­‐XMP  

<TASK 2> <TASK  3> <TASK  4>

<TASK  5> <TASK  6>

<TASK  1>

<TASK  7>

NODE NODE NODE

NODE NODE NODE

for(i=0;i<n;i++){for(j=0;j<n;j++){tmp[i][j]=0.0;

#pragma  xmp  loop  (k)  on  t(k)for(k=0;k<n;k++){tmp[i][j]+=(m1[i][k]*m2[k][j]);

}}}#pragma  xmp  reduction  (+:tmp)

Each  task  is  a  parallel  program  over  several  nodes.XMP  language  can  be  used  to  descript  parallel  program  easily!

YML  provides  a  workflow  programming  environment    and  high  level  graph  description  language  called  YvetteML

OpenMPGPGPUetc...  

N  dimension  graphs  available  

19  

Page 20: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

XcalableMP  (XMP)  •  Direc7ve-­‐based  language  extension  for  scalable  and  performance-­‐aware  

parallel  programming  •  It  will  provide  a  base  parallel  programming  model  and  a  compiler  

infrastructure  to  extend  the  base  languages  by  direc7ves.    •  Source  (C+XMP)  to  source  (C+MPI)  compiler  •  Data  mapping  &  Work  mapping  using  template  

#pragma  xmp  nodes  p(4)  #pragma  xmp  template  t(0:7)  #pragma  xmp  distribute  t(block)  onto  p  int  a[8];  #pragma  xmp  align  a[i]  with  t(i)    int  main(){  #pragma  xmp  loop  on  t(i)      for(i=0;i<8;i++)          a[i]  =  i;  

a[]  

node0  node1  node2  node3   20  

Page 21: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

XcalableMP:  Code  Example  int array[YMAX][XMAX]; #pragma xmp nodes p(4) #pragma xmp template t(YMAX) #pragma xmp distribute t(block) on p #pragma xmp align array[i][*] with t(i) main(){ int i, j, res; res = 0; #pragma xmp loop on t(i) reduction(+:res) for(i = 0; i < 10; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); res += array[i][j]; } }

add  to  the  serial  code  :  incremental  paralleliza9on  

data  distribu7on  

work  sharing  and  data  synchroniza7on  

21  

Page 22: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

YML-­‐XMP  Number  of  core/nodes  and  data  distribu7on  have  to  be  specified  to  the  

scheduler  for  each  XMP  task.    Two  back-­‐ends  will  be  developed,  one  based  on  MPI,  the  other  on  omniRPC  

(which  will  be  deployed  on  the  «  K  »  computer.)    First,  abstract  component  will  be  adapted  to  allow  the  end-­‐user  to  give  some  

exper9se.    <?xml  version="1.0"  ?>  <component  type="abstract"        name="prodMat"    descrip7on=“Matrix  Matrix  Product"  >  <here  have  to  be  included  end-­‐user  exper7ses,  for  example  to  allow  the  scheduler  to  

know  the  number  of  processor  associate  to  this  task,  and  data  distribu7ons>    <params>      <param        name="matrixBkk"    type="Matrix"  mode="in"  />      <param        name="matrixAki«      type="Matrix"  mode="inout"  />      <param        name="blocksize"          type="integer"  mode="in"  />    </params>  

</component>        

22  

Page 23: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

YML-XMP Processes management: OmniRPC Extension, on MPI

mpirun    -­‐n    1    -­‐hosIile    host.txt    yml_scheduler    

node-­‐01 node-­‐02 node-­‐03 .....

omrpc-­‐  agent

mpi_comm_spawn

mpi_comm_spawn

sum.rex sum.rex

yml_  scheduler

kick node-­‐01  node-­‐02  node-­‐03  

.....  (reserved  nodes)

request

request

mul.rex mul.rex mul.rex

mpi_comm_spawn

23  

Page 24: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Implementa7on  Component  Extension  

•  Topology  and  number  of  processors  are  declared  to  be  used  at  compile  and  run-­‐7me.  •  Data  distribu7on  and  mapping  are  declared  •  Automa7c  genera7on  for  distributed  language  (XMP,  CAF,  ...)  •  Used  at  run-­‐7me  to  distribute  data  over  processes  

<?xml  version="1.0"?>  <component  type="impl"  name="Ex"  abstract="Ex"  descrip7on=  "Example">            

 <impl  lang="XMP"  nodes="CPU:(5,5)"  libs="  "  >    <distribute>      <param  template="  block,block  "    name="A(100,100)  "    align="[i][j]:(j,i)  "    />      <param  template="  block  "    name="Y(100);X(100)"  align="[i]:(i,*)  "/>      </distribute>          <header  />      <source>      <![CDATA[  

                         /*  Computa7on  Code  */                    ]]>  

                         </source>      <footer  />    </impl>  

</component>  

 

 24  

Page 25: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

<?xml  version="1.0"?>  <component  type="impl"  name=“test"  abstract=“test">  <impl  lang="XMP"  nodes="CPU:(4)"  >    <templates>  <template  format="block,block"  name=“t"  size="10,10"  nodes=“p(2,2)"/>  <template  format="block"  name=“u"  size=“20"  />  </templates>    <distribute>  <param  name=“X(10,10)"  align="[i][j]:(i,j)"  template=“t"/>  <param  name="Y(10,10)"  align="[i][j]:(i,j)"  template=“t"/>  <param  name=“Z(20)"  align="[i]:(i)"  template=“u"/>  </distribute>  

Addi7onal  node  declara7on  is  allowed.  If  node  is  not  declared  for  a  template,  default  node  declara7on,  (4)  in  this  case,  is  used.    

25  

Page 26: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

XMP  &  C  source  code #pragma  xmp  nodes  _XMP_default_nodes(4)    #pragma  xmp  nodes  p(2,2)  #pragma  xmp  template  t(0:9,0:9)  #pragma  xmp  distribute  t(block,block)  onto  p    #pragma  xmp  template  u(0:19)  #pragma  xmp  distribute  u(block)  onto  _XMP_default_nodes    XMP_CMatrix  X[10][10];  XMP_CMatrix  X[10][10];  XMP_CVector  Z[20];    #pragma  xmp  align  X[i][j]  with  t(i,j)  #pragma  xmp  align  Y[i][j]  with  t(i,j)  #pragma  xmp  align  Z[i]  with  u(i)    ...    

26  

Page 27: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

MERAM  vs  ERAM  YML/XMP  on  Hopper  

27  

Page 28: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Reusable  library,  experiment  with  PETSc,  SLEPc  and  YML    

 Nahid  Emad,  University  of  Versailles    Leroy  Drummond,  LBNL    Makaram  Dandouna,  University  of  Versailles  

 Poster  this  evening:    Sustainability  of  Numerical  Libraries  for  Extreme  Scale  Compu8ng  

 Nahid  Emad  (University  of  Versailles,  France),  Makarem  Dandouna  (University    of  Versailles,  France),  and  Leroy  Drummond  (LBNL)      

28  

Page 29: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Proposed  reusable  design  with  PETSc  and  SLEPc  

29  

Page 30: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

MERAM/SLEPc/MPI MERAM/SLEPc/YML

#express the coarse grain //ism par(i:=1;nberam) do seq(iter:=1;maxiter) do par(j:=1 ; nbProcess[i]) do Compute SolverProjection(...) notify(result[j]) … wait(result[j]) enddo compute Solve(...) ; compute reduce(...); enddo enddo

Experiments,  MERAM  

30  

Page 31: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Scalability of MERAM vs number of co-methods for af23560 on Grid5000

PETSc\MPI PETSc\YML

31  

Page 32: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

MERAM vs ERAM YML/XMP on Hopper

Nahid Emad, Makarem Dandouna, Leroy Drummond

32  

Page 33: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Scalability the solution : the matrix size with PETSc\YML (matrix pde490000)

PETSc/MPI PETSc/YML

33  

Page 34: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Scalability the solution : the matrix size with PETSc\YML (matrix pde1000000)

PETSc/MPI PETSc/YML

34  

Page 35: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Part III

Towards Exascale Programming!

(3:30pm-5pm)

1.  YML/XMP programming paradigms 2.  Block Gauss-Jordan method using the YML/XMP framework 3.  On the road to Exascale programming

35  

Page 36: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

p A B = B A = I Block Gauss-Jordan Matrix size = N = p n

AI,J

B =

0

I

I 0

0

0

I 0

00

0 0

0

I

0 0

n

To invert a matrix 2N3 operations

36  

Page 37: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

2   2  

2  

2  

2  

3   3   3  

3  3  3  

3  3  

2  

2  

1  

1   Element Gauss-Jordan, LAPACK, cx =2n3 +O(n2) A = +/- A B ; BLAS3, cx = 2 n3 – n2,

3  

3   A = A – B C ; BLAS3, cx = 2n3

n2 64 bit floating point numbers

n

n

p

p

2  

2   A = B

37  

Page 38: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

1   2   2  

2  

2  

2  

2  

2  

3   3   3  

3  

3   3   3  

3   3  

Each computing task : 1 up to 3 blocks maximum n < (memory size of one pair) / 3 Up to (p-1)2 computing units (core, node, cluster, peers…) We have to use data persistence and to anticipate data migrations 38  

Page 39: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

2p-­‐1  A=AB,  BLAS3  

(p-­‐1)2  A=AB-­‐C,  BLAS3  

Element  Gauss-­‐Jordan    

Element  Gauss-­‐Jordan  

n2 double floating points = 8n2 bytes One step of the Block Gauss-Jordan method ; p=4

39  

Page 40: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Block-based Gauss-Jordan method

Matrix A Matrix B

p = 5 at the step k = 2

Read Write

0 0

2.1

1 1 1 1 1

2.1

2.1

2.1

2.1

2.1

2.1

2.1

2.1

2.2 2.2 2.2

2.2 2.2

3.1

3.1

3.1

3.1

3.1

3.1

3.1

3.1 3.1 3.1

3.1 3.1

3.1 3.1

3.1

3.1

3.1

3.1

3.1

3.1

3.1

3.1

3.2

3.2

3.2 3.2

3.2

3.2 3.2

3.2

3.2

3.2

3.2 3.2

3.2 3.2

3.2

3.2

3.2 3.2

Sb

n

UC  Berkeley  

Page 41: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

1   2   2  

2  

2  

2  

2  

2  

3   3   3  

3  

3   3   3  

3   3  

• Computation of « new » blocks on computing unit which minimize communications • « update » of block at step k, on computing unit which updated the block a step k-1 •  data send to dedicate computing unit ASAP

July  12,    2011   UC  Berkeley  

Page 42: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

1   2   2  

2  

2  2  

2  

3   3   3  

3  

3   3   3  

3   3  

Nevertheless, we can have in parallel computing from several steps of the method. We have to use an inter and intra steps dependency graph (3D for Block Gauss-Jordan).

2  

3  

2  2  1  

2  

2  3  

2  

42  

Page 43: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

YML experimentations on Grid’5000, seen as a cluster of heterogeneous clusters

A large scale Instrument for Computer Science Research

1000 518

500

500

500

500

500

500

500

RENATER

1Gbps

Lille

Nancy Rennes

Lyon Bordeaux

Toulouse Nice

Grenoble

43  

Page 44: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

BGJ on a Cluster of clusters composed of 200 processors distributed over 4 clusters (without data migration optimizations)

Time of execution for Intra and Inter step implementation, block size = 1500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2x2 3x3 4x4 5x5 6x6 7x7 8x8Number of blocks

Tim

e (s

)

YML-Intra

YML-Inter

OmniRPC

44  

Page 45: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Data persistence, data anticipation migration and asynchronous I/O

•  Data persistence and data migration anticipation are not proposed yet by the YML scheduler –  In consequence, the data persistence mechanism is "emulated" by

regenering blocks on nodes –  An overhead is to add for data management and anticipation

migration •  Henri Calandra (TOTAL), Maxime Hugues (then in TOTAL), Serge

Petiton (Univ. Lille/CNRS) and Data Direct Network (DDN), propose a system named ASIODS which combines graph and delegation in order to avoid disk contention and have a better caching : these technics would be very well-adapted to YML as the graph is available.

45  

Page 46: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

BGJ DP1 & DMA2 on 200 processors distributed over 4 clusters (Cluster of Clusters)

Time of execution on a Cluster of Clusters, block size = 1500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2x2 3x3 4x4 5x5 6x6 7x7 8x8Number of blocks

Tim

e (s

)

YML

YML with DP &DMA (Emulation)OmniRPC

OmniRPC with DP& DMA

46  

Page 47: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Experiments  :    T2K-­‐Tsukuba  YML-­‐XMP  

Node:      Opteron  Barcelona  B8000  CPU    2.3GBHz  x  4FLOP/c  x  4core  x  4socket      =  147.2  GFLOPS/node    8x4=32GB  memory/node  

System:    648  nodes    95.3  TFLOPS/system    20.8  TB  memory/system    800  TB  Raid-­‐6  Luster  cluster  file  system    Fat-­‐tree          full-­‐bisec7on  interconnec7on  Quad-­‐rail  of  InfiniBand 16cores  in  a  node.  

Flat  MPI  47  

Page 48: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Block  Gauss-­‐Jordan  on  T2K  (Tsukuba)  

Block  Size  =  2048   Block  Size  =  1024  

Block  Size  =  512   ¡  Block  Gauss-­‐Jordan  ¡  Matrix  size  16384x16384  ¡  Various  block  size  ¡  Various  nb  cores  per  component  

48  

Page 49: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Block  Gauss-­‐Jordan  on  Grid’5000  YML-­‐XMP  

0  

100  

200  

300  

400  

500  

600  

700  

800  

2x2   3x3   4x4   5x5   6x6   7x7   8x8   9x9  

Time  (s)  

Block  Count  

bs=1000-­‐cores=4   bs=1000-­‐cores=16  

0  

200  

400  

600  

800  

1000  

1200  

2x2   3x3   4x4   5x5   6x6  

Time  (s)  

Block  Count  

bs=2000-­‐cores=4   bs=2000-­‐cores=16  

•  More  cores  for  a  component  does  not  mean  more  performance  

•  Amount  of  data  must  be  not  too  small  

¡  More  parallelism  at  high  level  does  not  mean  performance  increase  

¡  So  what  ?  §  Exchange  through  file  system:  boCleneck  

49  

Page 50: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

Part III

Towards Exascale Programming!

(3:30pm-5pm)

1.  YML/XMP programming paradigms 2.  Block Gauss-Jordan method using the YML/XMP framework 3.  On the road to Exascale programming

50  

Page 51: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

•  Sustained  Petascale  applica7ons  on  an  unique  computer  exist  since  a  few  months  

•  Gordon  Bell  award  :  more  than  3  sustained  petafops  •  Next  fron7er  :  Exascale  compu7ng  (and  how  many  MWaCs???)    •  Nevertheless,  many  challenges  would  emerge,  probably  before  the  

announced  100  Petaflop  computers  and  beyond.  •  We  have  to  be  able  to  an9cipate  solu9ons  to  be  able  to  educate  scien7sts  

as  soon  as  possible  to  the  future  programming.    •  We  have  to  use  the  exis7ng  emerging  pla�orms  and  prototype  to  imagine  

the  future  language,  systems,  algorithms,….    •  We  have  to  propose  new  programming  paradigms  (SPMD/MPI  for  1  

million  of  cores  and  1  billions  of  threads????),  MPI-­‐X  or  X-­‐MPI  or  Z-­‐MPI-­‐X???  

•  We  have  to  propose  new  languages.    •  Co-­‐design  and  domain  applica9on  languages  and/or  high  level  mul9-­‐

level  language  and  frameworks,…….  

The  Future  Exaflop  barrier  :  not  only  another  symbolic  fron7er  coming  aSer  the  Petaflops  

51  

Page 52: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

•  We  have  to  imagine  new  methods  for  the  exascale  computers  •  Methods  would  define  new  architectures  (co-­‐design),  not  the  old  and  present  

methods….    •  Many  people  propose  new  systems  and  language  star9ng  from  the  exis9ng  methods  

and  numerical  libraries  ….  but  they  were  developed  for  MPI-­‐like  programming  and  only  SPMD  paradigm,  and  at  the  “old  9me”  of  the  Moore  law.  

•  We  have  to  adapt  the  methods  with  respect  to  criteria  from  the  architecture,  the  arithme7c,  the  system,  the  I/O,  the  latencies,……  then,  auto-­‐tuning  (at  run9me)i  s  becoming  a  general  approach.  

•  We  have  to  hybridize  numerical  methods  to  solve  large  scien7fic  applica7ons,  asynchronously,  and  each  of  then  have  to  be  auto-­‐tuned,  

•  We  have  to  find  smarter  criteria,  some  of  them  at  the  applica7on  and  at  the  mathema7cal  level,  for  each  method  :  smart  tuning  

•  These  auto-­‐tuned  methods  will  be  correlated  :  intelligent  numerical  methods.  End-­‐Users  have  to  be  able  to  give  exper9se.  

Our  research  goals  are  to  propose  a  framework  and  programming  paradigm  for  Xtrem  compu7ng  and  to  introduce  well-­‐adapted  modern  

smart-­‐tuned  hybrid  (Krylov)  methods      

We  are  looking  for  Interna7onal  collabora7ons    

New  methods  for  future  hypercomputers  

52  

Page 53: Advanced High Performance Scientific Computing III … - Part 3 of... · Advanced High Performance ! Scientific Computing!! III of III! ... Abstract, Implementation (in FORTRAN, C

53  

•  [Graph   of   components]   High   level   graph   descrip7on   language,  framework   and   environment   using   component   technologies,  independent  of  middleware  and  heterogeneous  hardware  (YML)    

–  Too  expensive  to  synchronize  all  the  cores  and  to  send  a  lot  of  data  between  distant  nodes,  except  for  some  parametric-­‐based  applica7ons  

•  [Graph   components]   Each   component   may   have   a   graph   version   for  cluster  of  other  parallel  or  distributed  plaIorms-­‐hypercomputers,  or  may  be  a  XML  component  with  at  last  a  MPI  execu7on  task,  

•  [Implementa7on   component]   Each   of   these   tasks   use   a   low   level  programming  paradigm  for  processor  compu7ng  (data  parallelism,  flux  parallelism,….),  allowing  use  of  heterogeneous  accelerators  

•  [Abstract   components]   End-­‐user   exper9se   would   be   necessary   to  op7mize  such  programming  

•  New   High   level   language   MUST   be   proposed   to   computa9onal  scien9sts   and   lower   level   languages   have   to   be   propose   to   develop  others  levels  of  programming.  

As  a  conclusion