designing’scalable’graph500’benchmarkwith’...

28
Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models 1 Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA 2 Ohio Supercomputer Center, Columbus, Ohio, USA Jithin Jose 1 , Sreeram Potluri 1 , Karen Tomko 2 and Dhabaleswar K. (DK) Panda 1

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Designing  Scalable  Graph500  Benchmark  with  Hybrid  MPI+OpenSHMEM  Programming  Models  

1Network-Based Computing Laboratory

Department of Computer Science and Engineering The Ohio State University, USA

2Ohio Supercomputer Center,

Columbus, Ohio, USA

Jithin  Jose1,  Sreeram  Potluri1,  Karen  Tomko2  and    Dhabaleswar  K.  (DK)  Panda1  

Page 2: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Outline    

•  Introduc/on  •  Problem  Statement  •  Graph500  Benchmark  

•  Design  Details  •  Performance  Evalua/on  •  Conclusion  &  Future  Work

2

Page 3: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

IntroducLon  

•  MPI  -­‐  the  de-­‐facto  programming  model  for  scien/fic  parallel  applica/ons  

•  Offers  aJrac/ve  features  for  High  Performance  Compu/ng  (HPC)  applica/ons  –  Non  blocking,  One  sided,  etc.  

•  MPI  Libraries  (such  as  MVAPICH2,  OpenMPI,  IntelMPI)  have  been  op/mized  to  the  hilt  

•  Emerging  Par//oned  Global  Address  Space  (PGAS)  models  -­‐Unified  Parallel  C  (UPC),  OpenSHMEM  

3

Page 4: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

ParLLoned  Global  Address  Space  (PGAS)  Models  

•  PGAS  Models  –  Shared  memory  abstrac/on  over  distributed  systems  –  Global  view  of  data,  One  sided  opera/ons,  beJer  programmability  –  Suited  for  irregular  and  dynamic  applica/ons  

•  OpenSHMEM,  Unified  Parallel  C  (UPC)  –  Popular  PGAS  models  

•  Will  applica(ons  be  re-­‐wri1en  en(rely  in  PGAS  model?  4  

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory Logical shared memory

Shared Memory Model

SHMEM, DSM Distributed Memory Model

MPI (Message Passing Interface) Partitioned Global Address Space (PGAS)

Page 5: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Hybrid  (MPI+PGAS)  Programming  for    Exascale  Systems  

•  Applica/on  sub-­‐kernels  can  be  re-­‐wriJen  in  MPI/PGAS  based  on  communica/on  characteris/cs  

•  Benefits:  –  Best  of  Distributed  Compu/ng  Model  –  Best  of  Shared  Memory  Compu/ng  Model  

•  Exascale  Roadmap*:    –  “Hybrid  Programming  is  a  prac/cal  way  to    program  exascale  systems”  

5

*  The  Interna/onal  Exascale  Sobware  Roadmap,  Dongarra,  J.,  Beckman,  P.  et  al.,  Volume  25,  Number  1,  2011,  Interna/onal  Journal  of  High  Performance  Computer  Applica/ons,  ISSN  1094-­‐3420  

Kernel  1  MPI  

Kernel  2  MPI  

Kernel  3  MPI  

Kernel  N  MPI  

HPC Application

Kernel  2  PGAS  

Kernel  N  PGAS  

Page 6: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

IntroducLon  to  Graph500  

•  Graph500  Benchmark  –  Represents  data  intensive  and  irregular  applica/ons  that  use  graph  algorithm-­‐based  processing  methods    

–  Bioinforma/cs  and  life  sciences,  social  networking,  data  mining,  and  security/intelligence  rely  on  graph  algorithmic  methods  

–  Exhibits  highly  irregular  and  dynamic  communica/on  paJern    

–  Earlier  research  have  indicated  scalability  limita/ons  of  the  MPI-­‐based  Graph500  implementa/ons  

6

Page 7: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Problem  Statement  

•  Can  a  high  performance  and  scalable  Graph500  benchmark  be  designed  using  MPI  and  PGAS  models?  

•  How  much  performance  gain  can  we  expect?  •  What  will  be  the  strong  and  weak  scalability  characteris(cs  of  such  a  design?  

7

Page 8: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Outline    

•  Introduc/on  •  Problem  Statement  •  Graph500  Benchmark  

•  Design  Details  •  Performance  Evalua/on  •  Conclusion  &  Future  Work

8

Page 9: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Graph500  Benchmark  –  The  Algorithm  •  Breadth  First  Search  (BFS)  Traversal  •  Uses  ‘Level  Synchronized  BFS  Traversal  Algorithm  

–  Each  process  maintains  –  ‘CurrQueue’  and  ‘NewQueue’    –  Ver/ces  in  CurrQueue  are  traversed  and  newly  discovered  ver/ces  are  sent  to  their  owner  processes  

–  Owner  process  receives  edge  informa/on  •  if  not  visted;  updates  parent  informa/on  and  adds  to  NewQueue  

–  Queues  are  swapped  at  end  of  each  level  –  Ini/ally  the  ‘root’  vertex  is  added  to  currQueue  –  Terminates  when  queues  are  empty  

•  Size  of  graph  represented  by  SCALE  and  Edge  Factor  (EF)  –  #Ver/ces  =  2**SCALE,  #Edges  =  #Ver/ces  *  EF    

9

Page 10: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

MPI-­‐based  Graph500  Benchmark  •  MPI_ISend/MPI_Test-­‐MPI_IRecv  for  transferring  ver/ces  •  Implicit  barrier  using  zero  length  message  •  MPI-­‐AllReduce  to  count  number  newqueue  elements  •  Major  BoJlenecks:  

–  Overhead  in  send-­‐recv  communica/on  model  • More  CPU  cycles  consumed,  despite  using    non-­‐blocking  opera/ons  

• Most  of  the  /me  spent  in  MPI-­‐Test  –  Implicit  Linear  Barrier  

• Linear  barrier  causes  significant  overheads  •  Other  MPI  Implementa/ons    

–  MPI-­‐CSR,  MPI-­‐CSC,  MPI-­‐OneSided   10

Page 11: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Outline    

•  Introduc/on  •  Problem  Statement  •  Graph500  Benchmark  

•  Design  Details  •  Performance  Evalua/on  •  Conclusion  &  Future  Work

11

Page 12: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Design  Challenges  for  Hybrid  Graph500  

12

•  Co-­‐ordina/on  between  sender  and  receiver  processes  and  between  mul/ple  sender  processes    –  How  to  synchronize,  while  using  one-­‐sided  communica/on?  

•  Memory  scalability    –  Size  of  receive  buffer  

•  Synchroniza/on  at  the  end  of  each  level  –  Barrier  opera/ons  simply  limit  computa/on-­‐communica/on  overlap  

•  Load  imbalance    

Page 13: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Detailed  Design  

13

•  Communica/on  and  co-­‐ordina/on  using  one-­‐sided  rou/nes  and  fetch-­‐add  atomic  opera/ons    

•  Buffer  structure  for  efficient  computa/on-­‐communica/on  overlap    

•  Level  synchroniza/on  using  non-­‐blocking  barrier  

•  Load  Balancing    

Hybrid Graph500 Benchmark

InfiniBand Network

MVAPICH2-X Unified Communication Runtime

OpenSHMEM Interface MPI Interface

OpenSHMEMcalls MPI calls

Communication co-ordination using fetch-add/put

Efficient Computation/Communication overlap

Load BalancingNon-blocking Level Synchronization

Page 14: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

MVAPICH2/MVAPICH2-­‐X  SoXware  •  High  Performance  open-­‐source  MPI  Library  for  InfiniBand,  10Gig/iWARP  and  

RDMA  over  Converged  Enhanced  Ethernet  (RoCE)  –  MVAPICH  (MPI-­‐1)  ,MVAPICH2  (MPI-­‐3.0),  Available  since  2002  –  MVAPICH2-­‐X  (MPI  +  PGAS),  Available  since  2012  –  Used  by  more  than    2,000  organiza/ons    (HPC  Centers,  Industry  and  

Universi/es)  in  70  countries  –  More  than  173,000  downloads  from  OSU  site  directly  

–  Empowering  many  TOP500  (Jun  ‘13)  clusters  •  6th  ranked  462,462-­‐core  cluster  (Stampede)  at    TACC  •  19th  ranked  125,980-­‐core  cluster  (Pleiades)  at  NASA  •  21st  ranked  73,278-­‐core  cluster  (Tsubame  2.0)  at  Tokyo  Ins/tute  of  

Technology  •  and    many  others  

–  Available  with  sobware  stacks  of  many  IB,  HSE  and  server  vendors  including  Linux  Distros  (RedHat  and  SuSE)  

–  hJp://mvapich.cse.ohio-­‐state.edu  •  Partner  in  the  U.S.  NSF-­‐TACC  Stampede  System  

14

Page 15: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

CommunicaLon  and  Co-­‐ordinaLon  

15

•  Ver/ces  transferred  using    OpenSHMEM  shmem_put    rou/ne  

•  Receive  buffers  are  globally  shared  •  Receive  buffer  size  depends  on  number  of  local  edges  

that  the  process  owns  and  connec/vity  –  Size  is  independent  of  system  scale  

•  Atomic  fetch-­‐add  opera/on  for  co-­‐ordina/ng  between  sender  and  receiver,  and  between  mul/ple  senders    –  Receive  buffer  indices  are  globally  shared  

Hybrid Graph500 Benchmark

Communication co-ordination using fetch-add/put

Efficient Computation/Communication overlap

Load BalancingNon-blocking Level Synchronization

Page 16: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Buffer  Structure  for  be[er  Overlap  

16

•  Receiver  process  shall  know    if  the  data  has  arrived  

•  Buffer  structure  helps  to    iden/fy  incoming  data  

•  Receive  process  ensures    arrival  of  complete  data    

•  packet  by  checking    tail  marker  and  can  then  process    immediately  

Hybrid Graph500 Benchmark

Communication co-ordination using fetch-add/put

Efficient Computation/Communication overlap

Load BalancingNon-blocking Level Synchronization

Page 17: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Level  synchronizaLon  using    non-­‐blocking  barrier  

17

•  MPI-­‐3  non-­‐blocking  barrier    for  level  synchroniza/on  

•  Process  enters  the  barrier  and  s/ll  can  con/nue  to  receive  and  process  incoming  ver/ces  

•  Offers  beJer  computa/on/communica/on  overlap  

Hybrid Graph500 Benchmark

Communication co-ordination using fetch-add/put

Efficient Computation/Communication overlap

Load BalancingNon-blocking Level Synchronization

Page 18: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Intra-­‐node  Load  Balancing  

18

•  Overloaded  process  exposes  work  

•  Idle  process  takes  up  shared  work  and  processes  it,  and  puts  back  for    post-­‐processing  

•  Uses  ‘shmem_ptr’  rou/ne  in  OpenSHMEM  to  access  shared  memory  data  

Hybrid Graph500 Benchmark

Communication co-ordination using fetch-add/put

Efficient Computation/Communication overlap

Load BalancingNon-blocking Level Synchronization

Page 19: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Outline    

•  Introduc/on  •  Problem  Statement  •  Graph500  Benchmark  

•  Design  Details  •  Performance  Evalua/on  •  Conclusion  &  Future  Work

19

Page 20: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Experiment  Setup  •  Cluster  A  (TACC  Stampede)  

–  Intel  Sandybridge  series  of  processors  using  Xeon  dual    8  core  sockets  (2.70GHz)  with  32GB  RAM  

–  Each  node  is  equipped  with  FDR  ConnectX  HCAs    (54  Gbps  data  rate)  with  PCI-­‐Ex  Gen3  interfaces  

•  Cluster  B  –  Xeon  Dual  quad-­‐core  processor  (2.67GHz)  with  12GB  RAM  –  Each  node  is  equipped  with  QDR  ConnectX  HCAs  (32Gbps  data  rate)  with  PCI-­‐Ex  Gen2  interfaces  

•  Sobware  Stacks  –  Graph500  v2.1.4  –  MVAPICH2-­‐X  OpenSHMEM  (v1.9a2)  and  OpenSHMEM  over  GASNet  (v1.20.0)  and       20

Page 21: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Graph500  -­‐  BFS  Traversal  Time  

21

•  Hybrid  design  performs  beJer  than  MPI  implementa/ons  •  4,096  processes  

-­‐  2.2X  improvement  over  MPI-­‐CSR  -­‐  5X  improvement  over  MPI-­‐Simple  

•  8,192  processes  -­‐  7.6X  improvement  over  MPI-­‐Simple  (Same  communica/on  characteris/cs)  -­‐  2.4X  improvement  over  MPI-­‐CSR  

0

2

4

6

8

10

4K 8K

Tim

e (s

)

No. of Processes

MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM)

7.6X 5X

Page 22: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Unified  RunLme  vs.  Separate  RunLmes  

22

•  Hybrid-­‐GASNet  uses  separate  run/mes  for  MPI  and  OpenSHMEM  -­‐  Significant  performance  degrada/on  due  to  lack  of  efficient  

atomic  opera/ons,  and  overhead  due  to  separate  run/mes  •  For  1,024  processes  

-­‐ BFS  /me  for  Hybrid-­‐  GASNet:  22.8  sec  -­‐ BFS  /me  for  Hybrid  MV2X:  0.58  sec  

0

5

10

15

20

25

30

512 1024

Tim

e (s

)

No. of Processes

Hybrid-GASNet

MPI-CSC

MPI-CSR

MPI-Simple

Hybrid-MV2X

Page 23: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Load  Balancing  

23

•  Evalua/ons  using  HPC  Toolkit  indicate    that  load  is  being  balanced  within  node  

•  Load  balancing  limited  within  a  node  -­‐  Need  for  post  processing  -­‐  Higher  cost  for  moving  data  

•  Amount  of  work  almost  equal  at  each  rank!  

1 4

16 64

256 1024 4096

16384 65536

262144

0 2 4 6 8 10 12 14

No.

of V

ertic

es

Rank Level1 Level2 Level3 Level4 Level5 Level6

Without Load Balancing With Load Balancing

Vertex Distribution

Page 24: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Scalability  Analysis  

24

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

26 27 28 29 Trav

erse

d Ed

ges

Per S

econ

d

Scale

Weak Scaling

MPI-Simple MPI-CSC MPI-CSR Hybrid

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

1024 2048 4096 8192 Trav

erse

d Ed

ges

Per S

econ

d

# of Processes

Strong Scaling MPI-Simple MPI-CSC MPI-CSR Hybrid

•  Strong  Scaling  Graph500  Problem  Scale  =  29  

•  Weak  Scaling  Graph500  Problem  Scale  =  26  per  1,024  processes  

•  Results  indicate  good  scalability  characteris/cs  

Page 25: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Performance  at  16K  processes

25

•  Graph  Size  -­‐  Scale  =  29,  EdgeFactor  =  16  •  Time  for  BFS  Traversal  

•  MPI  Simple  –  30.5s  •  MPI  CSR  (best  performing  MPI  version)  –  3.25s  •  Hybrid  (MPI+OpenSHMEM)  –  2.24s  •  13X  improvement  over  MPI  Simple  (same  communica/on  characteris/cs)  

0

5

10

15

20

25

30

35

MPI-Simple MPI-CSC MPI-CSR Hybrid

Tim

e (s

)

13X

Page 26: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Outline    

•  Introduc/on  •  Problem  Statement  •  Graph500  Benchmark  

•  Design  Details  •  Performance  Evalua/on  •  Conclusion  &  Future  Work

26

Page 27: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

Conclusion  &  Future  Work  

27

•  Presented  a  scalable  design  of  Graph500  benchmark  using  hybrid  MPI+OpenSHMEM  

•  Iden/fied  cri/cal  boJlenecks  in  the  MPI-­‐based  implementa/on    •  Not  intended  to  compare  programming  models,  but  

demonstrate  the  benefits  of  hybrid  model  •  Performance  Highlights  

–  At  16,384  cores,  Hybrid  design  achieves  13  X  improvement  over  MPI-­‐Simple  and  2.4X  improvement  over  MPI-­‐CSR  

–  Exhitbits  good  scalability  characteris/cs  –  Significant  performance  improvement  over  using  separate  run/mes  

•  Plan  to  improve  load-­‐balancing  scheme,  considering  inter-­‐node    •  Plan  to  evaluate  our  design  at  larger  scales  and  also  consider  

real-­‐world  applica/ons  

Page 28: Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

28

 Thank  You!  {jose,  potluri,  panda}@cse.ohio-­‐state.edu  

{ktomko}@osc.edu            

Network-­‐Based  Compu/ng  Laboratory  hJp://nowlab.cse.ohio-­‐state.edu/  

MVAPICH  Web  Page  hJp://mvapich.cse.ohio-­‐state.edu/