1 opportunities and challenges of modern communication architectures: case study with qsnet cac...

1

Opportunities and Challenges of Modern Communication

Architectures: Case Study with QsNet

CAC Workshop

Santa Fe, NM, 2004

Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory

University of Illinois at Urbana Champaign

2

Outline Processor virtualization QsNet

Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary

3

Processor Virtualization Basic idea of processor virtualization

User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI

User View

System implementation

4

QsNet Popular interconnect from

Quadrics Several parallel systems in top500

use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor

http://www.psc.edu/machines/tcs/lemieux.html

5

Elite Network 320 MB/s each way after protocol Reliable fat-tree network

Multiple routes provides fault tolerance

Adaptive worm hole routing 35 ns per hop

6

Elan Network Adaptor Features

Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)

Components Sparc processor DMA Engine 64 MB RAM On chip cache

7

Low CPU Overhead

05

101520253035

Lat

ency

(us)

16 64 256 1024 4096Message Size (Bytes)

Latency CPU Overhead

CPU Overhead is small and does not change much with the message size

8

Traditional Message Passing

Time

P0

P1

Send Overhead Receive Overhead

Idle Time Traditional Message Passing does not utilize

low CPU overhead of Elan

9

Adaptive Overlap

VP0 VP1 VP0 VP1

Time

P0

P1

Send Overhead Receive Overhead

Processor Virtualization takes full advantage of the low CPU overhead of Elan

10

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

0.001

0.01

0.1

1

1 10 100 1000

Procs

Exe

c Ti

me

[sec

]AMPI(1)

AMPI(8)

11

Charm++ Message Driven Execution

Handler

Scheduler

Pump Garbage CollectionSend

Tport Send Post Receives

Receive Message

12

NAMD: A Production MD System

•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)

13

Scaling NAMD Several QsNet challenges had to

be overcome to scale NAMD

14

QsNet Challange: Latency

02468

101214161820

1 5 9 17 33Number of Receives Posted

Shor

t Mes

sage

Lat

ency

(us)

MPI ConverseApplications need to post receives

for messages of different sizes

15

Latency Bottlenecks Latency

Slow NIC processor with a 100Mhz clock

Cache size only 8KB Traversing a large

loop flushes it

1 860175 924759 10303713 17406017 100800

3Cache Misses vs Number

of Receives Posted

16

Managing Latency: Message Combining

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 1: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

17

NAMD PME Performance

0

20

40

60

80

100

120

140St

ep T

ime

256 512 1024

Processors

MeshDirectNative MPI

Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

18

QsNet Challenge: Bandwidth

MB/s

One Way 290Two Way 128

PCI/DMA Contention restricts bandwidth on Alpha servers

QsNetNetwork Bandwidth

320 MB/s

19

Improving Bandwidth

Main-Main Elan-Main Elan-ElanOne Way 290 305 319Two Way 128 305 319

Sending messages from Elan memory is

faster

Node bandwidth (MB/s) for different placements of source and destination

20

QsNet Challenge: Stretched Handlers

Stretched Sends

Green superscripts

Similar stretches observed in the middle of entry methods

NAMD Timeline

Time

Proc

esso

rs

Force computeIntegrate

21

Stretching Solution Stretched Sends

Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged

Solved the problem by closely working with Quadrics and obtaining a patch

Isend only blocks on the rendezvous of the previous message to the same destination

22

Stretching Solution Contd. Stretches in the middle of entry

methods Caused by OS daemons Using blocking receives minimized

these stretches Daemons can be scheduled when

processor is idle

23

NAMD With Blocking ReceivesPr

oces

sors

Time

Blocking Receives

24

NAMD Performance on Lemieux

0

5

10

15

20

25

30

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Processors

Step

Tim

e (m

s)

0

200

400

600

800

1000

1200

Perf

orm

ance

GFL

OPS

Namd Step Time (ms) Performance (GF)

25

Summary QsNet and excellent network NIC co-processor ideal for message

driven execution Programming guidelines

Send messages from Elan memory Post limited number of receives and

before the sends Blocking receives to avoid stretching

26

Future Work One sided communication

Barrier? Persistent one sided

communication Reserve buffers on destination

1 opportunities and challenges of modern communication architectures: case study with qsnet cac...

Documents