1 opportunities and challenges of modern communication architectures: case study with qsnet cac...
DESCRIPTION
3 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI User View System implementationTRANSCRIPT
1
Opportunities and Challenges of Modern Communication
Architectures: Case Study with QsNet
CAC Workshop
Santa Fe, NM, 2004
Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory
University of Illinois at Urbana Champaign
2
Outline Processor virtualization QsNet
Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary
3
Processor Virtualization Basic idea of processor virtualization
User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI
User View
System implementation
4
QsNet Popular interconnect from
Quadrics Several parallel systems in top500
use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor
5
Elite Network 320 MB/s each way after protocol Reliable fat-tree network
Multiple routes provides fault tolerance
Adaptive worm hole routing 35 ns per hop
6
Elan Network Adaptor Features
Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)
Components Sparc processor DMA Engine 64 MB RAM On chip cache
7
Low CPU Overhead
05
101520253035
Lat
ency
(us)
16 64 256 1024 4096Message Size (Bytes)
Latency CPU Overhead
CPU Overhead is small and does not change much with the message size
8
Traditional Message Passing
Time
P0
P1
Send Overhead Receive Overhead
Idle Time Traditional Message Passing does not utilize
low CPU overhead of Elan
9
Adaptive Overlap
VP0 VP1 VP0 VP1
Time
P0
P1
Send Overhead Receive Overhead
Processor Virtualization takes full advantage of the low CPU overhead of Elan
10
Benefit of Adaptive Overlap
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
Shows AMPI with virtualization ratio of 1 and 8.
0.001
0.01
0.1
1
1 10 100 1000
Procs
Exe
c Ti
me
[sec
]AMPI(1)
AMPI(8)
11
Charm++ Message Driven Execution
Handler
Scheduler
Pump Garbage CollectionSend
Tport Send Post Receives
Receive Message
12
NAMD: A Production MD System
•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)
13
Scaling NAMD Several QsNet challenges had to
be overcome to scale NAMD
14
QsNet Challange: Latency
02468
101214161820
1 5 9 17 33Number of Receives Posted
Shor
t Mes
sage
Lat
ency
(us)
MPI ConverseApplications need to post receives
for messages of different sizes
15
Latency Bottlenecks Latency
Slow NIC processor with a 100Mhz clock
Cache size only 8KB Traversing a large
loop flushes it
1 860175 924759 10303713 17406017 100800
3Cache Misses vs Number
of Receives Posted
16
Managing Latency: Message Combining
Organize processors in a 2D (virtual) Mesh
Phase 1: Processors send messages to row neighbors1 P
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
Phase 1: Processors send messages to column neighbors1 P
2* messages instead of P-1 1P
17
NAMD PME Performance
0
20
40
60
80
100
120
140St
ep T
ime
256 512 1024
Processors
MeshDirectNative MPI
Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
18
QsNet Challenge: Bandwidth
MB/s
One Way 290Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha servers
QsNetNetwork Bandwidth
320 MB/s
19
Improving Bandwidth
Main-Main Elan-Main Elan-ElanOne Way 290 305 319Two Way 128 305 319
Sending messages from Elan memory is
faster
Node bandwidth (MB/s) for different placements of source and destination
20
QsNet Challenge: Stretched Handlers
Stretched Sends
Green superscripts
Similar stretches observed in the middle of entry methods
NAMD Timeline
Time
Proc
esso
rs
Force computeIntegrate
21
Stretching Solution Stretched Sends
Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged
Solved the problem by closely working with Quadrics and obtaining a patch
Isend only blocks on the rendezvous of the previous message to the same destination
22
Stretching Solution Contd. Stretches in the middle of entry
methods Caused by OS daemons Using blocking receives minimized
these stretches Daemons can be scheduled when
processor is idle
23
NAMD With Blocking ReceivesPr
oces
sors
Time
Blocking Receives
24
NAMD Performance on Lemieux
0
5
10
15
20
25
30
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Processors
Step
Tim
e (m
s)
0
200
400
600
800
1000
1200
Perf
orm
ance
GFL
OPS
Namd Step Time (ms) Performance (GF)
25
Summary QsNet and excellent network NIC co-processor ideal for message
driven execution Programming guidelines
Send messages from Elan memory Post limited number of receives and
before the sends Blocking receives to avoid stretching
26
Future Work One sided communication
Barrier? Persistent one sided
communication Reserve buffers on destination