low latency networking

8/3/2019 Low Latency Networking

1/56

Low Latency Networking

Glenford Mapp

Digital Technology GroupComputer Laboratory

http://www.cl.cam.ac.uk/Research/DTG/~gem11


2/56

What is Latency?

The time taken to send a unit of data

between two points in a network

A low latency network is a network in

which the design of the hardware, systems

and protocols are geared towards

minimizing the time taken to move units ofdata between any two points on that

network


3/56

Throughput Number of bytes of data that is transferred

per second between two points

Doesnt high throughput imply low latency?

Not necessarily

A bus vs a car travelling along a section of road

Which has the higher throughput?

Which has the lower latency?


4/56

Throughput vs Latency In simplest form,

Throughput ~ C / Latency

C = instantaneous capacity Number of units that are handled per operation

So if C is large you can get good throughput even

if your latency is not low

Low latency does not necessarily imply high

throughput if C also gets smaller

ATM is a good example


5/56

Throughput Claims Look carefully at high throughput claims.

Have they decreased the latency

Per unit operation is faster Software -> Hardware (ATM)

Have they increased instantaneous capacity

Serial -> Parallel-Parallel->Serial

In most designs we have a mixture of both Manufacturers will generally allow increased

latency if capacity greatly increases


6/56

Who cares about latency?

Why is latency important?

Some applications are more affected bylatency rather than throughput

Voice

Also affected by jitter

Networked Games

Interactive sessions


7/56

Lessons from Computers Consider the Mainframe in the time-sharing

era. 1963-1976

Studies showed that user productivityreduced by half if the response time frommainframe increases from 0.5 to 3 seconds

Mainframe optimised for throughput Maximize the number of people using it

High throughput


8/56

Lessons from Computers But as more people logged on the slower

the machine became and by noon the

response time would increase markedly souser productivity would fall

Key factor in the development of PCs

Famous saying I love the Alto (first PC) because it does not run

faster at night!


9/56

A look at the Internet Not really designed for low latency

Designed to be adaptable and robust

But the new applications we want the

Internet to support need low latency

Web servers

Voice over IP

Networked Games, etc


10/56

Components of Network Latency Hardware

Different hardware capacities and limitations

Ethernet variable packet size; max 1500

ATM 53 bytes uses fixed cells

Network Routers and Switches Queueing strategies

Overload/ Congestion strategy


11/56

Components of Network Latency System Latency

Moving the packet between the application and

the network interface

OS latency

The operating system handling the packet

Application Latency Application must acquire resources (e.g. CPU) in

order to send or consume data


12/56

Traditional Networking

A closer look Look at a packet being received by the host

machine and delivered up to the application

At the lowest level, packet enters the

network interface card (NIC) ends up in a

buffer or fifo on the card. Card generates an

interrupt.


13/56

Tradition Networking contd Interrupt Handler runs, data is moved into a

system buffer in main memory.

Packet is placed on a receive queue

In Linux there is one network receive queue

Packets from all the network interfaces are placed

on that queue

Packet is marked for system processing

Interrupt Handler ends


14/56

Traditional Networking contd System processing

Packet is taken up the protocol stack

IP processing ; TCP processing

Connection information associated with thepacket is used to find the corresponding socket

Socket ~ Src (IPaddr, TCP port) , Dest (IPaddr, TCP port)


15/56

Traditional Networking contd Queue the packet on the socket structure

and see if any application threads are

waiting for incoming data

If so, copy the data from system buffer to

the user buffer and wake up the thread

Application has to wait until it gets the CPUto consume data


16/56


17/56

APPLICATIONLAYER

Socket Interface

Socket layer in OS

NIC Network

System

Buffers

System

Buffers


18/56

Cross Talk Issues

Interrupt level

while an application is running on the

processor, network interrupts occur on

incoming packets for other processes. Protocol level

packets for all applications are multiplexed and

de-multiplexed in the kernel

Application Level

All applications must share resources so

sometimes I must wait a long time before I get

the processor.


19/56

Some ways to improve

Traditional Networking User level network interfaces

UNET - Matt Walsh (1995-1998)

Zero copy architectures

Virtual memory mapping techniques

Vertical Partitioning of Operating Systems


20/56

UNET

Application has an interface to talk directlyto a network device

Doesnt involve the kernel in things like

protocol processing, etc. Uses per application message queues to

send and receive data

Novel idea at the time complicates what applications need to do


21/56

UNET EndpointCommunication segment Send

queue

Free

queueRecv

queue


22/56

Zero-Copy Architecture No need to copy data up to the application

DMA from network buffers in NIC card

straight into system buffers

Use VM techniques to map the relevant

system buffers into the address space of the

application


23/56

Vertical Partitioning of the OS So UNET gave applications an abstract

network card so there was less multiplexing

of data.

Why not go all the way and do more

partitioning of OS resources

So CPU is carefully partitioned, file systemsand disk devices also carefully partitioned


24/56

Pegasus project - Cambridge Studied system support for multimedia

applications

Developed a new operating system called

Nemesis which adopted a vertical approach

Most of the operating system functions were in

shared libraries which executed in the usersprocess space

System-wide page table, so no copying


25/56

Vertical Approach

Processes

Shared Libraries

Normal

OS


26/56

Why havent these ideas been

universally implemented Some were explored

VIA is a hardware idea based on UNET

Replace PCI bus

Devices have receive, send and completion

queues and are connected along a high-speed

serial bus One or two products out there but fell out of

favour

Infiniband - now popular extension of VIA


27/56

Ideas not universal Zero copy and VM ideas explored in some

Operating Systems, e.g. the Spring OS by

Sun. Some ideas made their way intoSolaris.Windows 2000 and XP, via Mach

and NT

Nemesis was too radical for prime time QoS ideas have been taken up by others


28/56

But the real reason was.. That processor and network speeds have

been increasing fast enough to keep

traditional networking in the picture.

If you simply want to browse theWeb and

read email, then it is OK

However, there is a looming problem


29/56

Network speeds still going up!

We have gone from 10 Mbps in 1987 to

10G in 2004 and beyond.

Processor not be able to keep up

Interrupt rate is phenomenal

Buses like the PCI bus cannot keep up

Move to PCI Express (Switch Fabric)

Workstation can presently saturate the

network but the tide is rapidly turning!

Network traffic will soon be able to cripple your PC


30/56


31/56

Shared Memory Model Data transfer is accomplished by writing to

memory addresses in the local address

space of the process

This data is captured by the local network

card and serialized into packets which are

transferred over the network to the remotemachine which writes the data to remote

addresses.


32/56

How does it actually work? A region of the local address space of the

process is mapped to an IO region on the

card. That mapping is usually made usingstandard memory-mapping techniques.

In Unix the mmap call is used.

Same thing is done on the remote side


33/56

Shared Memory ModelProcess VM

NIC NIC

Process VM

packets


34/56

How is the association between

the local and remote regionsmade Fixed

In early SMMs, it was fixed.

All processors on the network share the same

region.

Flexible

Needs a communications channel to set up themapping between regions


35/56

Fixed SMM

Process VM space

Proc A Proc B Proc C Proc D


36/56

Dynamic SMM

Process VM space

Proc A Proc B Proc C Proc D


37/56

SMM Been around a long time

Used to communicate between processors in a

cluster.

The SMM is divided into pages, some of

which can be mapped between two

processes and the other set can be mappedglobally


38/56

Problems with SMM Since no interrupts are involved and the OS

is no longer in the loop, its hard to inform

the remote node that data has been sent andis waiting to be read

Major problem is therefore not the transfer,

but application synchronization


39/56

Applications Synchronization

Solutions Polling:

the receiver keeps polling certain addresses to

see if a data transfer has occurred

This is expensive (wasting local CPU) and only

relevant if there is a real chance of a data

transfer.

Could be used to provide to provide a form of

distributed synchronization - spinning on a

remote address


40/56

Application Synchronization

Solutions VM signalling

Pagefault or access violations

Example: page is only mapped locally when

there is data to be read. If I access the page

when there is no data, then a pagefault occurs

and I am blocked until the owner writes to the

page


41/56

VM Signalling If I wish to read and there is data to be read

then the page is mapped into my address

space read-only.

If I attempt to write to the page, a pagefault

occurs and I am blocked until I can acquire

the write lock for the page Not scalable, too closely coupled to the VM

system


42/56

Out-of-Band signaling Use a separate channel outside the data

transfer region to signal that data has been

transferred.

For example, writing to a special set of

addresses would cause an interrupt to be

generated at the remote end


43/56

Out-of-Band Signalling So you would transfer the data by writing to

your local address

After you then wrote to a special address

associated with that memory region

An interrupt occurs on the other side and

the OS works out which buffer you arereferring to and wakes up the waiting

process


44/56

Out-of-Band Signalling Out-of-Band Signalling still involves the

processor to achieve application

synchronization

Adds the overall transfer latency

Ex. Memory Channel

data transfer 2.9 us acquire spin lock 120 us

Increases the expense of the NIC


45/56

History of SMM Used to be extremely proprietary

DEC Memory Channel best known

Used a fixed shared memory region of 512 MB

divided into 64K pages each page being 8K

Very versatile, can share pages between one or

more processes. Use broadcast facilities Average latencies 10-25 us


46/56

SCI - Scalable Coherent Interface IEEE Standard 1956-1992

Uses high speed unidirectional links

Parallel links 16 bits, 500 Mhz (8 Gbs)

Serial G-Link technology (1Gbs)

Packet-based transfer

header - 16 bytes; data = 0, 16, 64 or 256 bytes

queue and signal interrupts


47/56

SCI contd Can do cache-coherency (optional)

Latency < 10 us

Modern cards uses 64bit and 66 MHz buses

(5.33 Gbits/s)

Big player: Dolphin Interconnect

Sun uses their boards to build megaservers


48/56

Processor Intensive Approach

PIA We offload networking by using a processor

on the NIC

Myrinet - most well-known exponent

Full duplex data links 2 Gbits/s

Bus 64-bit 133Hz PCI-X bus

PC - 255 Mhz RISC & Memory


49/56

Myrinet cont

Packet-based

Header, packet type, payload

Host Computer controls the NIC

runs a MCP program

Myrinet controls around 39 % of the cluster

market


50/56

Performance

Latency around 6.3 us

Climbs to over 100 us over 10000 bytes

One way throughput 248 MB/s

Messages over a 1000 bytes

Two way throughput 489 MB/s

Message over 10000 bytes

Throughput between Unix processes on

different hosts

1.98 Gbits (uni) 3.9 Gbits/s (bi)


51/56

Comparing SCI and Myrinet

Latency are about the same

SCI much faster for cluster of 8 or less

but slows exponentially as the number of PCs

increases

Myrinet is better for large systems > 64

Software appears more complete withMyrinet


52/56

Recent developments in Low

Latency Systems Collapsed LAN project (CLAN)

1997 - 2002, AT&T Laboratories-Cambridge

project originally centred around using fibre

technology throughout the building

remoting PCs; just have mouse, keyboard and

display in your office and put the PC in theserver room

bought some SCI cards and got some systems

going


53/56

CLAN project

Faced the application synchronization

problem

Came up with a novel solution called

Tripwire

in-band synchronization

an event is signalled on the receiver when datais written to a special address in the data region

during the data transfer


54/56

Tripwire

Processes

Tripwire


55/56

CLAN Project

Applications can therefore set Tripwires and

be notified when they occur

no spinning, no extra hardware for out-of-bandsignaling

Latency:

DWORD - RRT = 3.7us 1KB IP transfer - 225 Mbit/s RRT= 100us

Throughput 910 Mbits/s 33 MHz, 32 bit bus


56/56

Will Low latency ever make it

into the Main Stream Some low latency 1 Gigabit/s NICs on the

market

Unfortunately 1 Gigabit/s market is now in

the commodity phase.

Real battle is shaping up at 10 Gbit/s

market CLAN project -> Level5Networks-> Solarflare

low latency networking

Documents