low latency networking
TRANSCRIPT
-
8/3/2019 Low Latency Networking
1/56
Low Latency Networking
Glenford Mapp
Digital Technology GroupComputer Laboratory
http://www.cl.cam.ac.uk/Research/DTG/~gem11
-
8/3/2019 Low Latency Networking
2/56
What is Latency?
The time taken to send a unit of data
between two points in a network
A low latency network is a network in
which the design of the hardware, systems
and protocols are geared towards
minimizing the time taken to move units ofdata between any two points on that
network
-
8/3/2019 Low Latency Networking
3/56
Throughput Number of bytes of data that is transferred
per second between two points
Doesnt high throughput imply low latency?
Not necessarily
A bus vs a car travelling along a section of road
Which has the higher throughput?
Which has the lower latency?
-
8/3/2019 Low Latency Networking
4/56
Throughput vs Latency In simplest form,
Throughput ~ C / Latency
C = instantaneous capacity Number of units that are handled per operation
So if C is large you can get good throughput even
if your latency is not low
Low latency does not necessarily imply high
throughput if C also gets smaller
ATM is a good example
-
8/3/2019 Low Latency Networking
5/56
Throughput Claims Look carefully at high throughput claims.
Have they decreased the latency
Per unit operation is faster Software -> Hardware (ATM)
Have they increased instantaneous capacity
Serial -> Parallel-Parallel->Serial
In most designs we have a mixture of both Manufacturers will generally allow increased
latency if capacity greatly increases
-
8/3/2019 Low Latency Networking
6/56
Who cares about latency?
Why is latency important?
Some applications are more affected bylatency rather than throughput
Voice
Also affected by jitter
Networked Games
Interactive sessions
-
8/3/2019 Low Latency Networking
7/56
Lessons from Computers Consider the Mainframe in the time-sharing
era. 1963-1976
Studies showed that user productivityreduced by half if the response time frommainframe increases from 0.5 to 3 seconds
Mainframe optimised for throughput Maximize the number of people using it
High throughput
-
8/3/2019 Low Latency Networking
8/56
Lessons from Computers But as more people logged on the slower
the machine became and by noon the
response time would increase markedly souser productivity would fall
Key factor in the development of PCs
Famous saying I love the Alto (first PC) because it does not run
faster at night!
-
8/3/2019 Low Latency Networking
9/56
A look at the Internet Not really designed for low latency
Designed to be adaptable and robust
But the new applications we want the
Internet to support need low latency
Web servers
Voice over IP
Networked Games, etc
-
8/3/2019 Low Latency Networking
10/56
Components of Network Latency Hardware
Different hardware capacities and limitations
Ethernet variable packet size; max 1500
ATM 53 bytes uses fixed cells
Network Routers and Switches Queueing strategies
Overload/ Congestion strategy
-
8/3/2019 Low Latency Networking
11/56
Components of Network Latency System Latency
Moving the packet between the application and
the network interface
OS latency
The operating system handling the packet
Application Latency Application must acquire resources (e.g. CPU) in
order to send or consume data
-
8/3/2019 Low Latency Networking
12/56
Traditional Networking
A closer look Look at a packet being received by the host
machine and delivered up to the application
At the lowest level, packet enters the
network interface card (NIC) ends up in a
buffer or fifo on the card. Card generates an
interrupt.
-
8/3/2019 Low Latency Networking
13/56
Tradition Networking contd Interrupt Handler runs, data is moved into a
system buffer in main memory.
Packet is placed on a receive queue
In Linux there is one network receive queue
Packets from all the network interfaces are placed
on that queue
Packet is marked for system processing
Interrupt Handler ends
-
8/3/2019 Low Latency Networking
14/56
Traditional Networking contd System processing
Packet is taken up the protocol stack
IP processing ; TCP processing
Connection information associated with thepacket is used to find the corresponding socket
Socket ~ Src (IPaddr, TCP port) , Dest (IPaddr, TCP port)
-
8/3/2019 Low Latency Networking
15/56
Traditional Networking contd Queue the packet on the socket structure
and see if any application threads are
waiting for incoming data
If so, copy the data from system buffer to
the user buffer and wake up the thread
Application has to wait until it gets the CPUto consume data
-
8/3/2019 Low Latency Networking
16/56
-
8/3/2019 Low Latency Networking
17/56
APPLICATIONLAYER
Socket Interface
Socket layer in OS
NIC Network
System
Buffers
System
Buffers
-
8/3/2019 Low Latency Networking
18/56
Cross Talk Issues
Interrupt level
while an application is running on the
processor, network interrupts occur on
incoming packets for other processes. Protocol level
packets for all applications are multiplexed and
de-multiplexed in the kernel
Application Level
All applications must share resources so
sometimes I must wait a long time before I get
the processor.
-
8/3/2019 Low Latency Networking
19/56
Some ways to improve
Traditional Networking User level network interfaces
UNET - Matt Walsh (1995-1998)
Zero copy architectures
Virtual memory mapping techniques
Vertical Partitioning of Operating Systems
-
8/3/2019 Low Latency Networking
20/56
UNET
Application has an interface to talk directlyto a network device
Doesnt involve the kernel in things like
protocol processing, etc. Uses per application message queues to
send and receive data
Novel idea at the time complicates what applications need to do
-
8/3/2019 Low Latency Networking
21/56
UNET EndpointCommunication segment Send
queue
Free
queueRecv
queue
-
8/3/2019 Low Latency Networking
22/56
Zero-Copy Architecture No need to copy data up to the application
DMA from network buffers in NIC card
straight into system buffers
Use VM techniques to map the relevant
system buffers into the address space of the
application
-
8/3/2019 Low Latency Networking
23/56
Vertical Partitioning of the OS So UNET gave applications an abstract
network card so there was less multiplexing
of data.
Why not go all the way and do more
partitioning of OS resources
So CPU is carefully partitioned, file systemsand disk devices also carefully partitioned
-
8/3/2019 Low Latency Networking
24/56
Pegasus project - Cambridge Studied system support for multimedia
applications
Developed a new operating system called
Nemesis which adopted a vertical approach
Most of the operating system functions were in
shared libraries which executed in the usersprocess space
System-wide page table, so no copying
-
8/3/2019 Low Latency Networking
25/56
Vertical Approach
Processes
Shared Libraries
Normal
OS
-
8/3/2019 Low Latency Networking
26/56
Why havent these ideas been
universally implemented Some were explored
VIA is a hardware idea based on UNET
Replace PCI bus
Devices have receive, send and completion
queues and are connected along a high-speed
serial bus One or two products out there but fell out of
favour
Infiniband - now popular extension of VIA
-
8/3/2019 Low Latency Networking
27/56
Ideas not universal Zero copy and VM ideas explored in some
Operating Systems, e.g. the Spring OS by
Sun. Some ideas made their way intoSolaris.Windows 2000 and XP, via Mach
and NT
Nemesis was too radical for prime time QoS ideas have been taken up by others
-
8/3/2019 Low Latency Networking
28/56
But the real reason was.. That processor and network speeds have
been increasing fast enough to keep
traditional networking in the picture.
If you simply want to browse theWeb and
read email, then it is OK
However, there is a looming problem
-
8/3/2019 Low Latency Networking
29/56
Network speeds still going up!
We have gone from 10 Mbps in 1987 to
10G in 2004 and beyond.
Processor not be able to keep up
Interrupt rate is phenomenal
Buses like the PCI bus cannot keep up
Move to PCI Express (Switch Fabric)
Workstation can presently saturate the
network but the tide is rapidly turning!
Network traffic will soon be able to cripple your PC
-
8/3/2019 Low Latency Networking
30/56
-
8/3/2019 Low Latency Networking
31/56
Shared Memory Model Data transfer is accomplished by writing to
memory addresses in the local address
space of the process
This data is captured by the local network
card and serialized into packets which are
transferred over the network to the remotemachine which writes the data to remote
addresses.
-
8/3/2019 Low Latency Networking
32/56
How does it actually work? A region of the local address space of the
process is mapped to an IO region on the
card. That mapping is usually made usingstandard memory-mapping techniques.
In Unix the mmap call is used.
Same thing is done on the remote side
-
8/3/2019 Low Latency Networking
33/56
Shared Memory ModelProcess VM
NIC NIC
Process VM
packets
-
8/3/2019 Low Latency Networking
34/56
How is the association between
the local and remote regionsmade Fixed
In early SMMs, it was fixed.
All processors on the network share the same
region.
Flexible
Needs a communications channel to set up themapping between regions
-
8/3/2019 Low Latency Networking
35/56
Fixed SMM
Process VM space
Proc A Proc B Proc C Proc D
-
8/3/2019 Low Latency Networking
36/56
Dynamic SMM
Process VM space
Proc A Proc B Proc C Proc D
-
8/3/2019 Low Latency Networking
37/56
SMM Been around a long time
Used to communicate between processors in a
cluster.
The SMM is divided into pages, some of
which can be mapped between two
processes and the other set can be mappedglobally
-
8/3/2019 Low Latency Networking
38/56
Problems with SMM Since no interrupts are involved and the OS
is no longer in the loop, its hard to inform
the remote node that data has been sent andis waiting to be read
Major problem is therefore not the transfer,
but application synchronization
-
8/3/2019 Low Latency Networking
39/56
Applications Synchronization
Solutions Polling:
the receiver keeps polling certain addresses to
see if a data transfer has occurred
This is expensive (wasting local CPU) and only
relevant if there is a real chance of a data
transfer.
Could be used to provide to provide a form of
distributed synchronization - spinning on a
remote address
-
8/3/2019 Low Latency Networking
40/56
Application Synchronization
Solutions VM signalling
Pagefault or access violations
Example: page is only mapped locally when
there is data to be read. If I access the page
when there is no data, then a pagefault occurs
and I am blocked until the owner writes to the
page
-
8/3/2019 Low Latency Networking
41/56
VM Signalling If I wish to read and there is data to be read
then the page is mapped into my address
space read-only.
If I attempt to write to the page, a pagefault
occurs and I am blocked until I can acquire
the write lock for the page Not scalable, too closely coupled to the VM
system
-
8/3/2019 Low Latency Networking
42/56
Out-of-Band signaling Use a separate channel outside the data
transfer region to signal that data has been
transferred.
For example, writing to a special set of
addresses would cause an interrupt to be
generated at the remote end
-
8/3/2019 Low Latency Networking
43/56
Out-of-Band Signalling So you would transfer the data by writing to
your local address
After you then wrote to a special address
associated with that memory region
An interrupt occurs on the other side and
the OS works out which buffer you arereferring to and wakes up the waiting
process
-
8/3/2019 Low Latency Networking
44/56
Out-of-Band Signalling Out-of-Band Signalling still involves the
processor to achieve application
synchronization
Adds the overall transfer latency
Ex. Memory Channel
data transfer 2.9 us acquire spin lock 120 us
Increases the expense of the NIC
-
8/3/2019 Low Latency Networking
45/56
History of SMM Used to be extremely proprietary
DEC Memory Channel best known
Used a fixed shared memory region of 512 MB
divided into 64K pages each page being 8K
Very versatile, can share pages between one or
more processes. Use broadcast facilities Average latencies 10-25 us
-
8/3/2019 Low Latency Networking
46/56
SCI - Scalable Coherent Interface IEEE Standard 1956-1992
Uses high speed unidirectional links
Parallel links 16 bits, 500 Mhz (8 Gbs)
Serial G-Link technology (1Gbs)
Packet-based transfer
header - 16 bytes; data = 0, 16, 64 or 256 bytes
queue and signal interrupts
-
8/3/2019 Low Latency Networking
47/56
SCI contd Can do cache-coherency (optional)
Latency < 10 us
Modern cards uses 64bit and 66 MHz buses
(5.33 Gbits/s)
Big player: Dolphin Interconnect
Sun uses their boards to build megaservers
-
8/3/2019 Low Latency Networking
48/56
Processor Intensive Approach
PIA We offload networking by using a processor
on the NIC
Myrinet - most well-known exponent
Full duplex data links 2 Gbits/s
Bus 64-bit 133Hz PCI-X bus
PC - 255 Mhz RISC & Memory
-
8/3/2019 Low Latency Networking
49/56
Myrinet cont
Packet-based
Header, packet type, payload
Host Computer controls the NIC
runs a MCP program
Myrinet controls around 39 % of the cluster
market
-
8/3/2019 Low Latency Networking
50/56
Performance
Latency around 6.3 us
Climbs to over 100 us over 10000 bytes
One way throughput 248 MB/s
Messages over a 1000 bytes
Two way throughput 489 MB/s
Message over 10000 bytes
Throughput between Unix processes on
different hosts
1.98 Gbits (uni) 3.9 Gbits/s (bi)
-
8/3/2019 Low Latency Networking
51/56
Comparing SCI and Myrinet
Latency are about the same
SCI much faster for cluster of 8 or less
but slows exponentially as the number of PCs
increases
Myrinet is better for large systems > 64
Software appears more complete withMyrinet
-
8/3/2019 Low Latency Networking
52/56
Recent developments in Low
Latency Systems Collapsed LAN project (CLAN)
1997 - 2002, AT&T Laboratories-Cambridge
project originally centred around using fibre
technology throughout the building
remoting PCs; just have mouse, keyboard and
display in your office and put the PC in theserver room
bought some SCI cards and got some systems
going
-
8/3/2019 Low Latency Networking
53/56
CLAN project
Faced the application synchronization
problem
Came up with a novel solution called
Tripwire
in-band synchronization
an event is signalled on the receiver when datais written to a special address in the data region
during the data transfer
-
8/3/2019 Low Latency Networking
54/56
Tripwire
Processes
Tripwire
-
8/3/2019 Low Latency Networking
55/56
CLAN Project
Applications can therefore set Tripwires and
be notified when they occur
no spinning, no extra hardware for out-of-bandsignaling
Latency:
DWORD - RRT = 3.7us 1KB IP transfer - 225 Mbit/s RRT= 100us
Throughput 910 Mbits/s 33 MHz, 32 bit bus
-
8/3/2019 Low Latency Networking
56/56
Will Low latency ever make it
into the Main Stream Some low latency 1 Gigabit/s NICs on the
market
Unfortunately 1 Gigabit/s market is now in
the commodity phase.
Real battle is shaping up at 10 Gbit/s
market CLAN project -> Level5Networks-> Solarflare