time measurement of network data transfer

Time measurement of network data transfer

R. Fantechi, G. Lamanna25/5/2011

Outline

• Motivations

• Hardware setup

• Software tools

• Measurement and their (possible) interpretation

• Prospects

Motivations

• Network transfers to L1 and L2 need low latency– For both TEL62-PC and PC-PC transfer, do we know how

much it is?– For which network protocol is it the best?– How does it depend from the computer HW?– How does it depend from the network interface?– How much is it the latency fluctuaction? GPUs are

sensitive…– The knowledge of fluctuations is important to stay within

the 1ms budget • Standard software monitor tools give averages• Try to use hardware signals, generated in strategic points

inside the software• Correlate signals from a sender to those from a receiver

Hardware setup

• Two PCs with GB I/F– A is a Pentium 4 2.4GHz

• Called PCATE

– B is a 2*4 core Xeon• Called PCGPU

– Direct Ethernet connection on hidden network

– Each PC is equipped with a Parallel port I/F• It is used to generate

timing pulses

• Lecroy scope– Time measurements– Histograms– Storage of screenshots

PCATE

PCGPU

Adapter for the parallel port

Software tools

• Investigate three “protocols”– Raw Ethernet packets (socket PF_PACKET, SOCK_RAW)– IP packets (socket PF_INET, SOCK_RAW)– TCP packets (socket PF_INET, SOCK_STREAM)

• Three pairs of simple senders/receivers– The sender

• Gets from the command line packet size, number of packets, delay between packets , downscaling factor (see later)

• Initialize the socket and go in a tight loop, with a delay inside• Inside the loop, before and after the send command, write a

pulse on the parallel port

– The receiver• After inizialization, go in a receive loop and write a pulse on

the parallel port after having received a packet

Code example/* Create raw socket */sock = socket(AF_INET, SOCK_RAW, PROBE_PROT);if (sock < 0) {

perror("opening raw socket");exit(1);

}………………………….

if (iloop<0) iloop = 1000000000; for (i=0;i<iloop;i++) {

if (i%50==0) { buf[0]=0x01; out=0x01; outb(out,0x378); out=0x00; outb(out,0x378); } else buf[0]=0x00;

if (sendto(sock, buf, buflen,0,&server,sizeof(struct sockaddr_in)) < 0)

perror("writing on stream socket"); out=0x02; outb(out,0x378); out=0x00; outb(out,0x378); for (k=0; k<conv_time; k++);

}

• Sender

/* Create socket */sock = socket(AF_INET, SOCK_RAW, PROBE_PROT);if (sock < 0) {

perror("opening stream socket");exit(1);

}…………………. int kk=0; serv_size = sizeof(server);

do { if ((rval = recvfrom(sock, buf, BUFFER_SIZE,0,(struct sockaddr *)&server,&serv_size)) < 0)

perror("reading stream message");i = 0;if (rval == 0)printf("Ending connection\n");else {

if(rval== BUFFER_SIZE) { outb(0x01,0x378); outb(0x00,0x378); } ("-->%d\n", rval);

} while (rval != 0);

• Receiver

Send a pulse

Delay loop

Software tools

• Maximum rate– On the sender, some time is spent for the code execution– The minimum achievable repetition rate between packets

varies from ~6 ms to ~10 ms• Depending on machine speed, type of protocol, etc

• Downscaling factor– Needed to operate properly the scope at high rates

• If the loop index modulo the downscaling factor is 0, send in the packet the pattern to be written by the receiver on the parallel port, otherwise 0

• Packets are sent at the specified rate, but the scope registers only a fraction

• Additional tools used• Wireshark and Tcpdump to check packet arrival• Ifconfig and /proc/interrupts to count packet and interrupt loss

Basic method check

• Are these pulse reliable?– A simple check: histogram the width of the pulse

generated by the sender– Pulse width: ~1.22 ms , sdev 0.04 ms, watch out the

maximum

Parameters used in the tests

• Packet size– Small packets (200 bytes) or large packets (1300 bytes)

• Protocols– 3 as mentioned before

• Delay between packets– Usually from 10 ms down to the minimum– Typical sequence: 10, 5, 2, 1 ms, 100, 50, 20, 10 ms

• Measurements– Store interesting screenshots– Record time difference, sigma, max value

• Time difference = time of rx pulse – time of tx pulse

Lost packets and interrupts

• No lost packets observed at any rate– Checked with ifconfig at source and destination

• Interrupt behaviour via /proc/interrupts– At high rates the number of interrupts decreases

• Well known phenomenon of “interrupt coalescence” in the driver

• Packets received too fast are buffered and the CPU interrupted only once

• For TCP at high rates and 200 bytes buffers, interrupts are reduced also because TCP puts many buffers in an Eth packet

• Anyway, measuring TCP performances is more difficult as the protocol has the freedom of segmenting user buffers as it likes (i.e. flow control)

RX interrupts - PCGPU

Interrupt coalescence

Two examples, at 15 ms (left) and 12 ms (right)1300 bytes, PCATE->PCGPU

CPU usage

Sender Receiver

Time across sendto

Time difference btw a pulse after sendto and one before – The machine is the same

Time across sendto - Fluctuations

Count how many times the time is over 20 ms (wrt all times)

Raw ~5/26000IP ~13/26000TCP min ~8/20000 (1 ms) max ~402/20000 (100 ms) - 1300 bytes

18/26000 - 200 bytes

Quiet example

Moving the mouse…Only

15 > 4500

On PCATE as sender

Transfer time

As a function of time, different buffer sizes

Critical zone

Transfer time

As a function of packet size, different times, PCATE->PCGPU

Transfer time

PCATE -> PCGPU, raw, 1300 bytes

5 ms

2 ms

1 ms

500 ms

200 ms

100 ms

Transfer time

PCGPU->PCATE

200 bytes

1300 bytes

5 ms

~8 ms

Transfer time trending

PCGPU->PCATE, raw200 bytes 50

ms1000 bytes 50 ms

1300 bytes 40 ms

200 bytes 20 ms

1000 bytes 20 ms

1300 bytes 20 ms

Summary

• Hardware timing system– Reliable, not interfering with the measurement (at level of

max 10 ms)

• Time spent in the sender– A fraction (<10%) of the total transfer time– Varies with the protocol type– Stable with the packet rate

• Transfer time– Down to 50 ms varies a little as a function of packet rate

• Between 50 and 120 ms

– Below 20 ms it increases (up to 2 ms) for raw, but not for IP

• This setup is not working below ~10 ms– Where we are most interested

To be done

• Complete the measurement– Both directions– All protocols (TCP, maybe new ones)

• Performance as a function of CPU power– Use different PCs– Add load on the machines

• Test multiple I/F and switches• Change the sender to an object driven by an FPGA

– TEL62 or TALK

• Investigate different protocol features– New protocols or switch features of the old ones

• Test more complex transfer sw (i.e. TDBIO)• Some work hopefully done by USA summer students…

time measurement of network data transfer

Documents

ms200 bytes

ms1000 bytes

raw200 bytes

bytes tempo

function of time

microsec15transfer time

protif sock

rawip packets socket