1 opportunities and challenges of modern communication architectures: case study with qsnet cac...

26
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant V. Kale Parallel Programming Laboratory University of Illinois at Urbana Champaign

Upload: verity-shelton

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

3 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI User View System implementation

TRANSCRIPT

Page 1: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

1

Opportunities and Challenges of Modern Communication

Architectures: Case Study with QsNet

CAC Workshop

Santa Fe, NM, 2004

Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory

University of Illinois at Urbana Champaign

Page 2: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

2

Outline Processor virtualization QsNet

Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary

Page 3: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

3

Processor Virtualization Basic idea of processor virtualization

User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI

User View

System implementation

Page 4: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

4

QsNet Popular interconnect from

Quadrics Several parallel systems in top500

use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor

Page 5: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

5

Elite Network 320 MB/s each way after protocol Reliable fat-tree network

Multiple routes provides fault tolerance

Adaptive worm hole routing 35 ns per hop

Page 6: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

6

Elan Network Adaptor Features

Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)

Components Sparc processor DMA Engine 64 MB RAM On chip cache

Page 7: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

7

Low CPU Overhead

05

101520253035

Lat

ency

(us)

16 64 256 1024 4096Message Size (Bytes)

Latency CPU Overhead

CPU Overhead is small and does not change much with the message size

Page 8: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

8

Traditional Message Passing

Time

P0

P1

Send Overhead Receive Overhead

Idle Time Traditional Message Passing does not utilize

low CPU overhead of Elan

Page 9: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

9

Adaptive Overlap

VP0 VP1 VP0 VP1

Time

P0

P1

Send Overhead Receive Overhead

Processor Virtualization takes full advantage of the low CPU overhead of Elan

Page 10: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

10

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

0.001

0.01

0.1

1

1 10 100 1000

Procs

Exe

c Ti

me

[sec

]AMPI(1)

AMPI(8)

Page 11: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

11

Charm++ Message Driven Execution

Handler

Scheduler

Pump Garbage CollectionSend

Tport Send Post Receives

Receive Message

Page 12: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

12

NAMD: A Production MD System

•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)

Page 13: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

13

Scaling NAMD Several QsNet challenges had to

be overcome to scale NAMD

Page 14: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

14

QsNet Challange: Latency

02468

101214161820

1 5 9 17 33Number of Receives Posted

Shor

t Mes

sage

Lat

ency

(us)

MPI ConverseApplications need to post receives

for messages of different sizes

Page 15: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

15

Latency Bottlenecks Latency

Slow NIC processor with a 100Mhz clock

Cache size only 8KB Traversing a large

loop flushes it

1 860175 924759 10303713 17406017 100800

3Cache Misses vs Number

of Receives Posted

Page 16: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

16

Managing Latency: Message Combining

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 1: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

Page 17: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

17

NAMD PME Performance

0

20

40

60

80

100

120

140St

ep T

ime

256 512 1024

Processors

MeshDirectNative MPI

Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

Page 18: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

18

QsNet Challenge: Bandwidth

MB/s

One Way 290Two Way 128

PCI/DMA Contention restricts bandwidth on Alpha servers

QsNetNetwork Bandwidth

320 MB/s

Page 19: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

19

Improving Bandwidth

Main-Main Elan-Main Elan-ElanOne Way 290 305 319Two Way 128 305 319

Sending messages from Elan memory is

faster

Node bandwidth (MB/s) for different placements of source and destination

Page 20: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

20

QsNet Challenge: Stretched Handlers

Stretched Sends

Green superscripts

Similar stretches observed in the middle of entry methods

NAMD Timeline

Time

Proc

esso

rs

Force computeIntegrate

Page 21: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

21

Stretching Solution Stretched Sends

Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged

Solved the problem by closely working with Quadrics and obtaining a patch

Isend only blocks on the rendezvous of the previous message to the same destination

Page 22: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

22

Stretching Solution Contd. Stretches in the middle of entry

methods Caused by OS daemons Using blocking receives minimized

these stretches Daemons can be scheduled when

processor is idle

Page 23: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

23

NAMD With Blocking ReceivesPr

oces

sors

Time

Blocking Receives

Page 24: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

24

NAMD Performance on Lemieux

0

5

10

15

20

25

30

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Processors

Step

Tim

e (m

s)

0

200

400

600

800

1000

1200

Perf

orm

ance

GFL

OPS

Namd Step Time (ms) Performance (GF)

Page 25: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

25

Summary QsNet and excellent network NIC co-processor ideal for message

driven execution Programming guidelines

Send messages from Elan memory Post limited number of receives and

before the sends Blocking receives to avoid stretching

Page 26: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant

26

Future Work One sided communication

Barrier? Persistent one sided

communication Reserve buffers on destination