vsnoop: improving tcp throughput in virtualized environments via acknowledgement offload ardalan...

vSnoop: Improving TCP Throughput in Virtualized Environments

via Acknowledgement Offload

Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Department of Computer SciencePurdue University

Cloud Computing and HPC

Background and Motivation

Virtualization: A key enabler of cloud computing Amazon EC2, Eucalyptus

Increasingly adopted in other real systems: High performance computing

NERSC’s Magellan system Grid/cyberinfrastructure computing

In-VIGO, Nimbus, Virtuoso

Multiple VMs hosted by one physical host Multiple VMs sharing the same core

Flexibility, scalability, and economy

VM Consolidation: A Common Practice

Hardware

Virtualization Layer

VM 1 VM 3 VM 4VM 2Key Observation:

VM consolidation negatively impacts network performance!

Sender

Hardware

Virtualization Layer

Investigating the Problem

Server

VM 1 VM 2 VM 3

Client

40

60

80

100

120

140

160

180

5432

RT

T (

ms)

Number of VMs

US East – West

US East – Europe

US West – Australia

RTT increases in proportion to VM scheduling slice

(30ms)

Q1: How does CPU Sharing affect RTT ?

RTT Increase

Q2: What is the Cause of RTT Increase ?

Sender

Hardware

Driver Domain(dom0)

VM 1

Device Driver

VM 3

bufbuf

30ms

30ms

VM scheduling latency dominates

virtualization overhead!

CD

F

VM 2

buf

+ dom0 processing x wait time in buffer

Connection to the VM is much slower than dom0!

Q3: What is the Impact on TCP Throughput ?

+ dom0 x VM

Our Solution: vSnoop

Alleviates the negative effect of VM scheduling on TCP throughput

Implemented within the driver domain to accelerate TCP connections

Does not require any modifications to the VM

Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs

Xen, VMware, KVM, etc.

Sender VM1 BufferDriver Domain

time

SYN

SYN,ACK

SYN

SYN,ACK

VM1 buffer

TCP Connection to a VMScheduled VM

VM1

VM2

VM3

VM1

VM2

VM3

SYN,ACKSYN

VM Scheduling Latency

RTT

RTT

VM Scheduling Latency

Sender establishes a TCP connection to

VM1

Sender VM Shared BufferDriver Domain

time

SYN

SYN,ACK

SYN

SYN,ACK

VM1 buffer

Key Idea: Acknowledgement OffloadScheduled VM

VM1

VM2

VM3

VM1

VM2

VM3

SYN,ACK

w/ vSnoop

Faster progress during TCP slowstart

vSnoop’s Impact on TCP Flows

TCP Slow Start Early acknowledgements help progress connections

faster Most significant benefit for short transfers that are more

prevalent in data centers [Kandula IMC’09], [Benson WREN’09]

TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from

vSnoop Benefit not as much as for Slow Start

Challenge 1: Out-of-order/special packets (SYN, FIN packets)

Solution: Let the VM handle these packets

Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in

buffer

Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by

vSnoop

Challenge 4: Throttle Receive window to keep vSnoop online

Solution: Adjusted according to the buffer size

Challenges

State Machine Maintained Per-Flow

Start

Unexpected Sequence

Active(online)

No buffer(offline)

Out-of-order packet

In-order pkt Buffer space

available

Out-of-order packet

In-order pktNo buffer

In-order pkt Buffer space available

No buffer

Packet recvEarly acknowledgements

for in-order packets

Don’t acknowledge

Pass out-of-order pkts to VM

vSnoop Implementation in Xen

Driver Domain (dom0)

Bridge

Netfront

Netback

vSnoop

VM1

Netfront

Netback

VM3

Netfront

Netback

VM2

buf bufbuf

Tuning Netfront

Evaluation

Overheads of vSnoop

TCP throughput speedup

Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance

Linpack)

Evaluation – Setup

VM hosts 3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0)

and the guest VMs

Client machine 2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19

Gigabit Ethernet switch

vSnoop Routines

Single Stream Multiple Streams

Cycles CPU % Cycles CPU %

vSnoop_ingress() 509 3.03 516 3.05

vSnoop_lookup_hash()

74 0.44 91 0.51

vSnoop_build_ack() 52 0.32 52 0.32

vSnoop_egress() 104 0.61 104 0.61Per-packet CPU overhead for vSnoop routines in

dom0

vSnoop Overhead

Profiling per-packet vSnoop overhead using Xenoprof [Menon VEE’05]

Minimal aggregateCPU overhead

Median

0.192MB/s

0.778MB/s

6.003MB/s

TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a

100KB file Vanilla Xen, Xen+tuning,

Xen+tuning+vSnoop30x Improvement

+ Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core

0.00

0.20

0.40

0.60

0.80

1.00

10

0M

B

10

MB

1M

B

50

0K

B

25

0K

B

10

0K

B

50

KBNorm

aliz

ed T

hro

ughput

Transfer Size

Xen+tuning+vSnoopXen+tuningXen

TCP Throughput: 2 VMs/Core

0.00

0.20

0.40

0.60

0.80

1.00

10

0M

B

10

MB

1M

B

50

0K

B

25

0K

B

10

0K

B

50

KBNorm

aliz

ed T

hro

ughput

Transfer Size



0.00

0.20

0.40

0.60

0.80

1.00

10

0M

B

10

MB

1M

B

50

0K

B

25

0K

B

10

0K

B

50

KBN

orm

aliz

ed T

hro

ughput

Transfer Size



0.00

0.20

0.40

0.60

0.80

1.00

10

0M

B

10

MB

1M

B

50

0K

B

25

0K

B

10

0K

B

50

KB

Norm

aliz

ed T

hro

ughput

Transfer Size


vSnoop’s benefit rises with higher VM consolidation

TCP Throughput: Other Setup Parameters

CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM

vSnoop consistently achieves significant TCP

throughput improvement

vSnoop

dom0

dom1 dom2

Server1

vSnoop

dom0

dom1 dom2

Server2Client

Client Threads

Application-Level Performance: RUBiS

RUBiS Clients

Apache MySQL

RUBiS Operation Countw/o vSnoop

Countw/ vSnoop

%Gain

Browse 421 505 19.9%

BrowseCategories 288 357 23.9%

SearchItemsInCategory 3498 4747 35.7%

BrowseRegions 128 141 10.1%

ViewItem 2892 3776 30.5%

ViewUserInfo 732 846 15.6%

ViewBidHistory 339 398 17.4%

Others 3939 4815 22.2%

Total 12237 15585 27.4%

Average Throughput 29 req/s 37 req/s 27.5%

RUBiS Results

Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive

vSnoop

dom0

dom1 dom2

Server1

dom0

dom1 dom2

Server2

dom0

dom1 dom2

Server3

dom0

dom2

Server4

dom1

MPI nodes

Application-level Performance – MPI Benchmarks

vSnoop vSnoop vSnoop

Intel MPI Benchmark Results: Broadcast

0.00

0.20

0.40

0.60

0.80

1.00

8M

B

4M

B

2M

B

1M

B

51

2K

B

25

6K

B

12

8K

B

64

KB

Norm

aliz

ed

Exe

cuti

on

Tim

e

Message Size


40% Improvement

Intel MPI Benchmark Results: All-to-All

0.00

0.20

0.40

0.60

0.80

1.00

8M

B

4M

B

2M

B

1M

B

51

2K

B

25

6K

B

12

8K

B

64

KBN

orm

aliz

ed

Exe

cuti

on

Tim

e

Message Size


40%

HPL Benchmark Results

0.000 0.200

0.400 0.600 0.800

1.000 1.200 1.400

1.600 1.800

(8K

,16

)

(8K

,8)

(8K

,4)

(8K

,2)

(6K

,16

)

(6K

,8)

(6K

,4)

(6K

,2)

(4K

,16

)

(4K

,8)

(4K

,4)

(4K

,2)

Gflop

s

Problem Size and Block Size (N,NB)

Xen+tuning+vSnoopXen

Related Work

Optimizing virtualized I/O path Menon et al. [USENIX ATC’06,’08; ASPLOS’09]

Improving intra-host VM communications XenSocket [Middleware’07], XenLoop

[HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07]

I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10]

Conclusions

Problem: VM consolidation degrades TCP throughput

Solution: vSnoop Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs

Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel

MPI benchmark

Thank you.

For more information:

http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop

Or Google “vSnoop Purdue”



TCP Benchmarks cont. Testing different scenarios:

a) 10 concurrent connections b) Sender also subject to VM

scheduling c) Driver domain on a separate core

a)

b)

c)

TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:

40% CPU load:

80% CPU load:

60% CPU load:

vsnoop: improving tcp throughput in virtualized environments via acknowledgement offload ardalan...

Documents

vm migration

economy vm consolidation

result of vm consolidation

hpc applications

server consolidation

vm scheduling slice

multiples vms

physical hostmultiple