amlapi: active messages over low-level application programming interface simon yau, smyau@cs tyson...
TRANSCRIPT
AMLAPI:Active Messages overLow-level Application Programming Interface
Simon Yau, smyau@cs
Tyson Condie, tcondie@cs
Background
AM is a low-level communication architecture for high-performance parallel computing
LAPI is IBM’s version of AMVery similar API’sPrograms running on AM platform
should be able to run on LAPI.Use AMLAPI layer to emulate AM
using LAPI.
Similarities
Both are low-level message-passing style architectures.
Both use active messages:– One node initiates an active message.– Receiving node executes a handler
upon reception of the active message.
Differences
AM virtualized network interface with endpoints and bundles – allow multiple threads at each endpoint.
AM requires handlers to be executed in the context of the application program; LAPI handlers execute in the context of the polling thread.
LAPI separates handlers into header and completion handlers.
LAPI uses counters for synchronization (guarantees execution of handlers); AM guarantees network has accepted data.
AM & LAPI Execution ModelAM Execution
LAPI Execution
Send Msg
Do work..
Get Msg
Execute Handler (and send reply)
Sender
Receiver
Do work…
Send Msg
Do work…
Get Msg
Exec Header Handler
Sender
Receiver
Do work…
Poll
Get Footer
Send Footer
Exec Footer Handler
Poll…
To Emulate AM on LAPI
Emulate Endpoints and bundles– Maintain a list of endpoints per box– Each endpoint is represented by the
box id and its position in the list
Associate each endpoint bundle with a task queue.– An AM is done with a LAPI call which
schedules a task on the queue at the remote end.
DesignSending an AM:
– Package a LAPI Message and send to the receiving node
– At receiving node, multiplex the message to the appropriate endpoint and put the associated function pointer with arguments on to the task queue
Receiving an AM:– When the user Polls, check the task queue
and execute a task from it.– Execute only one task since we do not want
the user thread to spend too much time in the handler.
Picture
Send Msg
Do work…
Get Msg
Header Handler
Sender
Receiver
Do work…
Poll
Get Footer
Send Footer
Footer Handler
Poll… Execute Handler…
1. Sender executes AM_Send2. Sender piggy backs information about the AM call and executes LAPI_Send
3. Network ships the message to receiver
4. Receiver’s network gets the request message, causes the polling thread to execute the header handler
5. Header handler allocates buffer space to which the message is copied. 6. LAPI copies the data into a buffer and calls the Footer handler.7. Footer handler posts the AM handler with the arguments and AM
information on the queue of the designation endpoint..8. When user application polls, it will pull out the handler from the task
queue and executes it.
Evaluation Platform: SP3
Interconnect:– Advertised bandwidth = 350MB/s– Advertised latency = ~17 micro seconds.
SMPs:– 8 X Power3 processor SMPs– 4 GB of memory per node
Processor:– super-scalar, pipelined 64 bit RISC. – 8 instructions per clock at 375 MHz. – 64KB L1 cache, 8MB L2 cache
OS:– AIX with IBM Parallel Environment.
Micro Benchmarks:
Round trip latency: 473 usLAPI round trip latency: 32 us
LAPI & AMLAPI Bandwidth on SP3
0
20000
40000
60000
80000
100000
120000
140000
0 100000 200000 300000
Message Size (bytes)
Ba
nd
wid
th (
KB
/s)
LAPI Bandwidth (KB/s)
AM Bandwidth (KB/s)
Explanation
Percentage breakdown of overhead at 262144 byte message
LAPI51%
Context switch10%
Packing AM Info17%
Copying to Endpoint
VM22%
Time spent on transmission of a message
0
0.5
1
1.5
2
2.5
0 50000 100000 150000 200000 250000 300000
Message size
Tim
e (m
s)
LAPI (communication)
Context Switch & Polling
Packing AM info
Copying to Endpoint VMSegment
Copying data from message buffer to an Endpoint’s VM segment takes up the bulk of the overhead.
Context switching and packing AM info takes up the rest. Since SP3 is an SMP, the LAPI threads and application thread
run on different nodes. Moving data from LAPI thread’s processor requires invalidating the processor cache on which the LAPI thread runs.
Conclusion
Using low-level glue-ware is viable option to make programs portable if the communication layers match
Future work:– Macro benchmarks– Improve short message latency by
header handler– “Zero copy” to endpoint VM – make
AM handler run in LAPI context