vsnoop: improving tcp throughput in virtualized environments via acknowledgement offload ardalan...
TRANSCRIPT
vSnoop: Improving TCP Throughput in Virtualized Environments
via Acknowledgement Offload
Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu
Department of Computer SciencePurdue University
Cloud Computing and HPC
Background and Motivation
Virtualization: A key enabler of cloud computing Amazon EC2, Eucalyptus
Increasingly adopted in other real systems: High performance computing
NERSC’s Magellan system Grid/cyberinfrastructure computing
In-VIGO, Nimbus, Virtuoso
Multiple VMs hosted by one physical host Multiple VMs sharing the same core
Flexibility, scalability, and economy
VM Consolidation: A Common Practice
Hardware
Virtualization Layer
VM 1 VM 3 VM 4VM 2Key Observation:
VM consolidation negatively impacts network performance!
Sender
Hardware
Virtualization Layer
Investigating the Problem
Server
VM 1 VM 2 VM 3
Client
40
60
80
100
120
140
160
180
5432
RT
T (
ms)
Number of VMs
US East – West
US East – Europe
US West – Australia
RTT increases in proportion to VM scheduling slice
(30ms)
Q1: How does CPU Sharing affect RTT ?
RTT Increase
Q2: What is the Cause of RTT Increase ?
Sender
Hardware
Driver Domain(dom0)
VM 1
Device Driver
VM 3
bufbuf
30ms
30ms
VM scheduling latency dominates
virtualization overhead!
CD
F
VM 2
buf
+ dom0 processing x wait time in buffer
Connection to the VM is much slower than dom0!
Q3: What is the Impact on TCP Throughput ?
+ dom0 x VM
Our Solution: vSnoop
Alleviates the negative effect of VM scheduling on TCP throughput
Implemented within the driver domain to accelerate TCP connections
Does not require any modifications to the VM
Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs
Xen, VMware, KVM, etc.
Sender VM1 BufferDriver Domain
time
SYN
SYN,ACK
SYN
SYN,ACK
VM1 buffer
TCP Connection to a VMScheduled VM
VM1
VM2
VM3
VM1
VM2
VM3
SYN,ACKSYN
VM Scheduling Latency
RTT
RTT
VM Scheduling Latency
Sender establishes a TCP connection to
VM1
Sender VM Shared BufferDriver Domain
time
SYN
SYN,ACK
SYN
SYN,ACK
VM1 buffer
Key Idea: Acknowledgement OffloadScheduled VM
VM1
VM2
VM3
VM1
VM2
VM3
SYN,ACK
w/ vSnoop
Faster progress during TCP slowstart
vSnoop’s Impact on TCP Flows
TCP Slow Start Early acknowledgements help progress connections
faster Most significant benefit for short transfers that are more
prevalent in data centers [Kandula IMC’09], [Benson WREN’09]
TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from
vSnoop Benefit not as much as for Slow Start
Challenge 1: Out-of-order/special packets (SYN, FIN packets)
Solution: Let the VM handle these packets
Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in
buffer
Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by
vSnoop
Challenge 4: Throttle Receive window to keep vSnoop online
Solution: Adjusted according to the buffer size
Challenges
State Machine Maintained Per-Flow
Start
Unexpected Sequence
Active(online)
No buffer(offline)
Out-of-order packet
In-order pkt Buffer space
available
Out-of-order packet
In-order pktNo buffer
In-order pkt Buffer space available
No buffer
Packet recvEarly acknowledgements
for in-order packets
Don’t acknowledge
Pass out-of-order pkts to VM
vSnoop Implementation in Xen
Driver Domain (dom0)
Bridge
Netfront
Netback
vSnoop
VM1
Netfront
Netback
VM3
Netfront
Netback
VM2
buf bufbuf
Tuning Netfront
Evaluation
Overheads of vSnoop
TCP throughput speedup
Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance
Linpack)
Evaluation – Setup
VM hosts 3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0)
and the guest VMs
Client machine 2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19
Gigabit Ethernet switch
vSnoop Routines
Single Stream Multiple Streams
Cycles CPU % Cycles CPU %
vSnoop_ingress() 509 3.03 516 3.05
vSnoop_lookup_hash()
74 0.44 91 0.51
vSnoop_build_ack() 52 0.32 52 0.32
vSnoop_egress() 104 0.61 104 0.61Per-packet CPU overhead for vSnoop routines in
dom0
vSnoop Overhead
Profiling per-packet vSnoop overhead using Xenoprof [Menon VEE’05]
Minimal aggregateCPU overhead
Median
0.192MB/s
0.778MB/s
6.003MB/s
TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a
100KB file Vanilla Xen, Xen+tuning,
Xen+tuning+vSnoop30x Improvement
+ Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop
TCP Throughput: 1 VM/Core
0.00
0.20
0.40
0.60
0.80
1.00
10
0M
B
10
MB
1M
B
50
0K
B
25
0K
B
10
0K
B
50
KBNorm
aliz
ed T
hro
ughput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 2 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
10
0M
B
10
MB
1M
B
50
0K
B
25
0K
B
10
0K
B
50
KBNorm
aliz
ed T
hro
ughput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 3 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
10
0M
B
10
MB
1M
B
50
0K
B
25
0K
B
10
0K
B
50
KBN
orm
aliz
ed T
hro
ughput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 5 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
10
0M
B
10
MB
1M
B
50
0K
B
25
0K
B
10
0K
B
50
KB
Norm
aliz
ed T
hro
ughput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
vSnoop’s benefit rises with higher VM consolidation
TCP Throughput: Other Setup Parameters
CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM
vSnoop consistently achieves significant TCP
throughput improvement
vSnoop
dom0
dom1 dom2
Server1
vSnoop
dom0
dom1 dom2
Server2Client
Client Threads
Application-Level Performance: RUBiS
RUBiS Clients
Apache MySQL
RUBiS Operation Countw/o vSnoop
Countw/ vSnoop
%Gain
Browse 421 505 19.9%
BrowseCategories 288 357 23.9%
SearchItemsInCategory 3498 4747 35.7%
BrowseRegions 128 141 10.1%
ViewItem 2892 3776 30.5%
ViewUserInfo 732 846 15.6%
ViewBidHistory 339 398 17.4%
Others 3939 4815 22.2%
Total 12237 15585 27.4%
Average Throughput 29 req/s 37 req/s 27.5%
RUBiS Results
Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive
vSnoop
dom0
dom1 dom2
Server1
dom0
dom1 dom2
Server2
dom0
dom1 dom2
Server3
dom0
dom2
Server4
dom1
MPI nodes
Application-level Performance – MPI Benchmarks
vSnoop vSnoop vSnoop
Intel MPI Benchmark Results: Broadcast
0.00
0.20
0.40
0.60
0.80
1.00
8M
B
4M
B
2M
B
1M
B
51
2K
B
25
6K
B
12
8K
B
64
KB
Norm
aliz
ed
Exe
cuti
on
Tim
e
Message Size
Xen+tuning+vSnoopXen+tuningXen
40% Improvement
Intel MPI Benchmark Results: All-to-All
0.00
0.20
0.40
0.60
0.80
1.00
8M
B
4M
B
2M
B
1M
B
51
2K
B
25
6K
B
12
8K
B
64
KBN
orm
aliz
ed
Exe
cuti
on
Tim
e
Message Size
Xen+tuning+vSnoopXen+tuningXen
40%
HPL Benchmark Results
0.000 0.200
0.400 0.600 0.800
1.000 1.200 1.400
1.600 1.800
(8K
,16
)
(8K
,8)
(8K
,4)
(8K
,2)
(6K
,16
)
(6K
,8)
(6K
,4)
(6K
,2)
(4K
,16
)
(4K
,8)
(4K
,4)
(4K
,2)
Gflop
s
Problem Size and Block Size (N,NB)
Xen+tuning+vSnoopXen
Related Work
Optimizing virtualized I/O path Menon et al. [USENIX ATC’06,’08; ASPLOS’09]
Improving intra-host VM communications XenSocket [Middleware’07], XenLoop
[HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07]
I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10]
Conclusions
Problem: VM consolidation degrades TCP throughput
Solution: vSnoop Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs
Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel
MPI benchmark
Thank you.
For more information:
http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop
Or Google “vSnoop Purdue”
TCP Benchmarks cont. Testing different scenarios:
a) 10 concurrent connections b) Sender also subject to VM
scheduling c) Driver domain on a separate core
a)
b)
c)
TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:
40% CPU load:
80% CPU load:
60% CPU load: