XenServer Engineering Performance TeamFelipe Franciosi
Going double digits on a single host
Scaling Xen’s Aggregate Storage Performance
e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy
© 2014 Citrix
Agenda• The dimensions of storage performance
๏ What exactly are we trying to measure? !
• State of the art ๏ blkfront, blkback, blktap2+tapdisk, tapdisk3, qemu-qdisk ๏ trade-offs between traditional grant mapping, persistent grants, grant copy !
• Aggregate measurements ๏ Pushing the boundaries with very, very fast local storage !
• Where to go next?
2
What exactly are we trying to measure?
The Dimensions of Storage Performance
© 2014 Citrix
The Dimensions of Storage Performance• You have probably seen this: !
!
!
!
!
• The average user will usually: ๏ Run a synthetic benchmark on a bare metal environment ๏ Repeat the test on a virtual machine ๏ Draw conclusions without seeing the full picture
4
# dd if=/dev/sda of=/dev/null bs=1M count=100 iflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.269689 s, 389 MB/s
# hdparm -t /dev/sda !/dev/sda: Timing buffered disk reads: 1116 MB in 3.00 seconds = 371.70 MB/sec
© 2014 Citrix
The Dimensions of Storage Performance
5
thro
ughp
ut
log(block size)
© 2014 Citrix
The Dimensions of Storage Performance
6
thro
ughp
ut
log(block size)
sequ
entia
lity
# of threads
LBA
io d
epth
C/P states config
temperature
io engine
noise
read
ahe
ad
direction (r/
w)
© 2014 Citrix
The Dimensions of Storage Performance• The simpler of all cases:
๏ single thread ๏ iodepth=1 ๏ direct IO ๏ sequential !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
7
© 2014 Citrix
The Dimensions of Storage Performance
8
• Pushing boundaries a bit: ๏ multiple threads ๏ iodepth=1 ๏ direct IO ๏ sequential “kind of” !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
© 2014 Citrix
The Dimensions of Storage Performance• Comparing dom0 vs. domU:
๏ single thread vs. single VM ๏ iodepth=1 ๏ direct IO ๏ sequential !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
9
© 2014 Citrix
The Dimensions of Storage Performance
10
• Comparing dom0 vs. domU: ๏ many threads vs. many VMs ๏ iodepth=1 ๏ direct IO ๏ sequential kind of !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
And the trade-offs between the technologiesState of the Art
© 2014 Citrix
State of the Art: traditional grant mapping
12
user space kernel space
VDI blkfront block layerblkback
libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with foreign pages
page grants
block layer
user apps
VDIdevice driver
user space kernel space
• Pros: ๏ no copies involved ๏ low-latency alternative
(when done in kernel) !
• Cons: ๏ not “network-safe” ๏ hard on grant tables
© 2014 Citrix
State of the Art: persistent grants
13
user space kernel space
VDI blkfront block layerblkback
libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with foreign pages
persistent page grants
block layer
user apps
VDIdevice driver
user space kernel space
• Pros: ๏ easy on grant tables ๏ copies on the front end !
• Cons: ๏ not “network-safe” ๏ copies involved
blkfront memcpy() data from/to a set of persistently granted pages on demand
© 2014 Citrix
• Pros: ๏ “network-safe” ๏ neat features (VHD) !
• Cons: ๏ copies involved ๏ use lots of memory ๏ hard on grant tables
State of the Art: tapdisk2+blktap2+blkback
14
user space kernel space
VDI blkfront block layerblkback
tapdisk2 libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with local pages
page grants
block layer
blktap2 TAP
user apps
blktap2 copies data to local pages
libaio
vfs / others
block layerVDI
device driver
user space kernel space
© 2014 Citrix
• Pros: ๏ “network-safe” ๏ easy on grant tables ๏ neat features (VHD) !
• Cons: ๏ copies involved (back end) ๏ use lots of memory
State of the Art: grant copy
15
user space kernel space
VDI blkfront block layer
tapdisk3 libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with local pages
libaio
vfs / others
user apps
VDIdevice driver
block layer
user space kernel space
gntdev evtchn
tapdisk3 issues grant copy commands via the “gntdev” Xen copies data across domains
© 2014 Citrix
State of the Art: technologies comparison
16
extra!copies
network-safe
low-latency potential
“neat” features
easy on grant tables
grant mapping N N
Y (if done in blkback)
depends (not in
blkback)N
persistent grants
Y (front end) N N
Y (qcow in
qemu-qdisk)Y
grant copy Y (back end) Y N
Y (vhd in
tapdisk3)Y
blktap2 Y (back end) Y N Y N
Going double digits on a single hostAggregate Measurements
© 2014 Citrix
Aggregate Measurements• Test environment:
๏ Intel PowerEdge R720 • Intel E5-2643v2 @ 3.50 GHz (2 Sockets, 6 Cores/socket, HT Enabled)
- Unless stated otherwise: 24 vCPUs to dom0, 2 vCPUs to each guest • 64 GB of RAM
- Unless stated otherwise: 4 GB to dom0, 512 MB to each guest • BIOS Settings:
- Power regulators set to “Performance per Watt (OS)” - C-States disabled, Xen Scaling Governor set to “Performance”
๏ Storage: • 4 x Micron P320h • 2 x Intel P3700 • 1 x Fusion-io ioDrive2
18
© 2014 Citrix
Aggregate Measurements
19
SSD #1 (SR 1) lv01 … lv10lv02
lv01 … lv10lv02
lv01 … lv10lv02
SSD #2 (SR 2)
SSD #7 (SR 7)
vd 7
vd 1 (SR 1 - lv01)
(SR 7 - lv01)
VM 01
vd 7
vd 1 (SR 1 - lv02)
(SR 7 - lv02)
VM 02
vd 7
vd 1 (SR 1 - lv10)
(SR 7 - lv10)
VM 10
© 2014 Citrix
Aggregate Measurements
20
• Baseline ๏ Measurements from dom0 ๏ Each line corresponds to a
group of 7 threads (one for each disk) !
๏ Some of the drives respond faster for small block sizes and a single thread
© 2014 Citrix
Aggregate Measurements
21
• qemu-qdisk ๏ Persistent grants disabled ๏ With O_DIRECT
© 2014 Citrix
Aggregate Measurements
22
• qemu-qdisk ๏ Persistent grants enabled ๏ With O_DIRECT !
๏ Apparent bottleneck was single process per VM
© 2014 Citrix
Aggregate Measurements
23
• tapdisk2 + blktap2 ๏ With O_DIRECT !
๏ Using blkback from 3.10 ๏ No persistent grants ๏ No indirect-IO
© 2014 Citrix
Aggregate Measurements
24
• tapdisk2 + blktap2 ๏ With O_DIRECT !
๏ Using blkback from 3.16 ๏ Persistent grants ๏ Indirect-IO !
๏ Apparent bottleneck on some pvspinlock operations
© 2014 Citrix
Aggregate Measurements
25
• blkback 3.16 ๏ 8 dom0 vCPUs ๏ 6 domU vCPUs !
๏ Persistent grants ๏ Indirect-IO !
๏ Apparent bottleneck on some pvspinlock operations
© 2014 Citrix
Aggregate Measurements
26
• tapdisk3 ๏ Using grant copy ๏ With O_DIRECT ๏ Using libaio !
๏ Apparent bottleneck is vCPU utilisation
Areas for improvementWhere To Go Next?
© 2014 Citrix
thro
ughp
ut
log(block size)
• Single-VBD performance remains problematic ๏ [1/3] Latency is too high
Where To Go Next?
28
vdi VMvirtualisation subsystem
~ ms ~ us
~ us
disk
~ ns
© 2014 Citrix
• Single-VBD performance remains problematic ๏ [2/3] IO Depth is limited to 32
Where To Go Next?
29
VMdiskblkfront
32 reqs
• Are these workloads realistic? • We can use multi-page rings! • But…
qdisktapdisk
blkbackvdi
© 2014 Citrix
• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded
Where To Go Next?
30
VMdiskblkfront
32 reqs
qdisk
tapdisk
blkbackvdi
{fio
numjobs = 1 iodepth = ___ blksz = 4k rw = read
io_submit()
io_getevents()
{1, 1, 4k} = 15k IOPS, 15 %
{1, 8, 4k} = 70k IOPS, 35 %
{1, 16, 4k} = 110k IOPS, 55 %
{1, 24, 4k} = 165k IOPS, 85 %
{1, 32, 4k} = 190k IOPS, 100 %
{1, 64, 4k} = 195k IOPS, 100 %~400k IOPS at 4k (sequential reads)
{5, 32, 4k} = 415k IOPS, 55 % (each)
© 2014 Citrix
blkfront
32 reqs
qdisk
tapdisk
vdi
• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded
Where To Go Next?
31
VMdisk
{fio
numjobs = 1 iodepth = ___ blksz = 4k rw = read
io_submit()
io_getevents()
{1, 1, 4k} = 10k IOPS, 30 % (30 % in dom0)
~400k IOPS at 4k (sequential reads)
{1, 8, 4k} = 50k IOPS, 75 % (75 % in dom0)
{1, 16, 4k} = 70k IOPS, 100 % (100 % in dom0)
{1, 32, 4k} = 110k IOPS, 120 % (100 % in dom0)
{4, 32, 4k} = 115k IOPS, 400 % (125 % in dom0)
blkback
© 2014 Citrix
Where To Go Next?• Many-VBD performance could be much better:
!๏ Both persistent grants and grant copy are interesting alternatives:
• tapdisk3 with grant copy is network-“friendly” and has one process per VBD • qdisk with persistent grants does the copy on the front end !
๏ But both add extra copies to the data path: • We should be avoiding copies… :-/
- Grant operations need to scale better - The network retransmission issues need to be addressed
32
e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy
© 2014 Citrix
Support Slides• Usage of O_DIRECT with QDISK vs. 1 x Micron P320h • Number of dom0 vCPUs on Creedence #87433 + blkback from 3.16 • Temperature effects on storage performance
34
© 2014 Citrix
Usage of O_DIRECT in qemu-qdisk
35
• qemu-qdisk ๏ Without O_DIRECT
(default) !
๏ Faster for small block sizes ๏ Faster for single-VM !
๏ Scalability issue (investigation pending)
© 2014 Citrix
Usage of O_DIRECT in qemu-qdisk
36
• qemu-qdisk ๏ With O_DIRECT
(directiosafe=1) !
๏ Slower for small block sizes ๏ Slower for single-VM !
๏ Scales much better
© 2014 Citrix
Impact of dom0 vCPU count
37
• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ
• blkback backported from 3.16 • LVs plugged directly to guests !
• Throughput sinks with ๏ larger blocks ๏ increased number of guests
• oprofile suggests pvspinlock
© 2014 Citrix
Impact of dom0 vCPU count
38
• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ
• blkback backported from 3.16 • LVs plugged directly to guests !
• Giving less vCPUs to dom0 • Aggregate throughput improves
© 2014 Citrix
Temperature Effects on Storage Performance
39
• Workload keeps pCPUs busy with large block sizes !
• iDRAC Settings > Thermal > Thermal Base Algorithm
• “Maximum Performance”
© 2014 Citrix
Temperature Effects on Storage Performance
40
• Workload keeps pCPUs busy with large block sizes !
• iDRAC Settings > Thermal > Thermal Base Algorithm
• “Auto” !
• Effects very noticeable with 3 or more guests