Download - XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

XenServer Engineering Performance TeamFelipe Franciosi

Going double digits on a single host

Scaling Xen’s Aggregate Storage Performance

e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy

© 2014 Citrix

Agenda• The dimensions of storage performance

๏ What exactly are we trying to measure? !

• State of the art ๏ blkfront, blkback, blktap2+tapdisk, tapdisk3, qemu-qdisk ๏ trade-offs between traditional grant mapping, persistent grants, grant copy !

• Aggregate measurements ๏ Pushing the boundaries with very, very fast local storage !

• Where to go next?

2

What exactly are we trying to measure?

The Dimensions of Storage Performance

© 2014 Citrix

The Dimensions of Storage Performance• You have probably seen this: !

!

!

!

!

• The average user will usually: ๏ Run a synthetic benchmark on a bare metal environment ๏ Repeat the test on a virtual machine ๏ Draw conclusions without seeing the full picture

4

# dd if=/dev/sda of=/dev/null bs=1M count=100 iflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.269689 s, 389 MB/s

# hdparm -t /dev/sda !/dev/sda: Timing buffered disk reads: 1116 MB in 3.00 seconds = 371.70 MB/sec

© 2014 Citrix


5

thro

ughp

ut

log(block size)

© 2014 Citrix


6

thro

ughp

ut

log(block size)

sequ

entia

lity

# of threads

LBA

io d

epth

C/P states config

temperature

io engine

noise

read

ahe

ad

direction (r/

w)

© 2014 Citrix

The Dimensions of Storage Performance• The simpler of all cases:

๏ single thread ๏ iodepth=1 ๏ direct IO ๏ sequential !

• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to

Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433

• Kernel 3.10 + Xen 4.4

7

© 2014 Citrix


8

• Pushing boundaries a bit: ๏ multiple threads ๏ iodepth=1 ๏ direct IO ๏ sequential “kind of” !



• Kernel 3.10 + Xen 4.4

© 2014 Citrix

The Dimensions of Storage Performance• Comparing dom0 vs. domU:

๏ single thread vs. single VM ๏ iodepth=1 ๏ direct IO ๏ sequential !



• Kernel 3.10 + Xen 4.4

9

© 2014 Citrix


10

• Comparing dom0 vs. domU: ๏ many threads vs. many VMs ๏ iodepth=1 ๏ direct IO ๏ sequential kind of !



• Kernel 3.10 + Xen 4.4

And the trade-offs between the technologiesState of the Art

© 2014 Citrix

State of the Art: traditional grant mapping

12

user space kernel space

VDI blkfront block layerblkback

libc / libaio

vfs / others

xen’s blkif protocol: shared memory &

event channels

dom0 domU

requests are associated with pages in the

guest’s memory space

requests are associated with foreign pages

page grants

block layer

user apps

VDIdevice driver


• Pros: ๏ no copies involved ๏ low-latency alternative

(when done in kernel) !

• Cons: ๏ not “network-safe” ๏ hard on grant tables

© 2014 Citrix

State of the Art: persistent grants

13



libc / libaio

vfs / others


event channels

dom0 domU



requests are associated with foreign pages

persistent page grants

block layer

user apps

VDIdevice driver


• Pros: ๏ easy on grant tables ๏ copies on the front end !

• Cons: ๏ not “network-safe” ๏ copies involved

blkfront memcpy() data from/to a set of persistently granted pages on demand

© 2014 Citrix

• Pros: ๏ “network-safe” ๏ neat features (VHD) !

• Cons: ๏ copies involved ๏ use lots of memory ๏ hard on grant tables

State of the Art: tapdisk2+blktap2+blkback

14



tapdisk2 libc / libaio

vfs / others


event channels

dom0 domU



requests are associated with local pages

page grants

block layer

blktap2 TAP

user apps

blktap2 copies data to local pages

libaio

vfs / others

block layerVDI

device driver


© 2014 Citrix

• Pros: ๏ “network-safe” ๏ easy on grant tables ๏ neat features (VHD) !

• Cons: ๏ copies involved (back end) ๏ use lots of memory

State of the Art: grant copy

15


VDI blkfront block layer

tapdisk3 libc / libaio

vfs / others


event channels

dom0 domU



requests are associated with local pages

libaio

vfs / others

user apps

VDIdevice driver

block layer


gntdev evtchn

tapdisk3 issues grant copy commands via the “gntdev” Xen copies data across domains

© 2014 Citrix

State of the Art: technologies comparison

16

extra!copies

network-safe

low-latency potential

“neat” features

easy on grant tables

grant mapping N N

Y (if done in blkback)

depends (not in

blkback)N

persistent grants

Y (front end) N N

Y (qcow in

qemu-qdisk)Y

grant copy Y (back end) Y N

Y (vhd in

tapdisk3)Y

blktap2 Y (back end) Y N Y N

Going double digits on a single hostAggregate Measurements

© 2014 Citrix

Aggregate Measurements• Test environment:

๏ Intel PowerEdge R720 • Intel E5-2643v2 @ 3.50 GHz (2 Sockets, 6 Cores/socket, HT Enabled)

- Unless stated otherwise: 24 vCPUs to dom0, 2 vCPUs to each guest • 64 GB of RAM

- Unless stated otherwise: 4 GB to dom0, 512 MB to each guest • BIOS Settings:

- Power regulators set to “Performance per Watt (OS)” - C-States disabled, Xen Scaling Governor set to “Performance”

๏ Storage: • 4 x Micron P320h • 2 x Intel P3700 • 1 x Fusion-io ioDrive2

18

© 2014 Citrix

Aggregate Measurements

19

SSD #1 (SR 1) lv01 … lv10lv02

lv01 … lv10lv02

lv01 … lv10lv02

SSD #2 (SR 2)

SSD #7 (SR 7)

vd 7

vd 1 (SR 1 - lv01)

(SR 7 - lv01)

VM 01

vd 7

vd 1 (SR 1 - lv02)

(SR 7 - lv02)

VM 02

vd 7

vd 1 (SR 1 - lv10)

(SR 7 - lv10)

VM 10

© 2014 Citrix


20

• Baseline ๏ Measurements from dom0 ๏ Each line corresponds to a

group of 7 threads (one for each disk) !

๏ Some of the drives respond faster for small block sizes and a single thread

© 2014 Citrix


21

• qemu-qdisk ๏ Persistent grants disabled ๏ With O_DIRECT

© 2014 Citrix


22

• qemu-qdisk ๏ Persistent grants enabled ๏ With O_DIRECT !

๏ Apparent bottleneck was single process per VM

© 2014 Citrix


23

• tapdisk2 + blktap2 ๏ With O_DIRECT !

๏ Using blkback from 3.10 ๏ No persistent grants ๏ No indirect-IO

© 2014 Citrix


24

• tapdisk2 + blktap2 ๏ With O_DIRECT !

๏ Using blkback from 3.16 ๏ Persistent grants ๏ Indirect-IO !

๏ Apparent bottleneck on some pvspinlock operations

© 2014 Citrix


25

• blkback 3.16 ๏ 8 dom0 vCPUs ๏ 6 domU vCPUs !

๏ Persistent grants ๏ Indirect-IO !

๏ Apparent bottleneck on some pvspinlock operations

© 2014 Citrix


26

• tapdisk3 ๏ Using grant copy ๏ With O_DIRECT ๏ Using libaio !

๏ Apparent bottleneck is vCPU utilisation

Areas for improvementWhere To Go Next?

© 2014 Citrix

thro

ughp

ut

log(block size)

• Single-VBD performance remains problematic ๏ [1/3] Latency is too high

Where To Go Next?

28

vdi VMvirtualisation subsystem

~ ms ~ us

~ us

disk

~ ns

© 2014 Citrix

• Single-VBD performance remains problematic ๏ [2/3] IO Depth is limited to 32

Where To Go Next?

29

VMdiskblkfront

32 reqs

• Are these workloads realistic? • We can use multi-page rings! • But…

qdisktapdisk

blkbackvdi

© 2014 Citrix

• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded

Where To Go Next?

30

VMdiskblkfront

32 reqs

qdisk

tapdisk

blkbackvdi

{fio

numjobs = 1 iodepth = ___ blksz = 4k rw = read

io_submit()

io_getevents()

{1, 1, 4k} = 15k IOPS, 15 %

{1, 8, 4k} = 70k IOPS, 35 %

{1, 16, 4k} = 110k IOPS, 55 %

{1, 24, 4k} = 165k IOPS, 85 %

{1, 32, 4k} = 190k IOPS, 100 %

{1, 64, 4k} = 195k IOPS, 100 %~400k IOPS at 4k (sequential reads)

{5, 32, 4k} = 415k IOPS, 55 % (each)

© 2014 Citrix

blkfront

32 reqs

qdisk

tapdisk

vdi

• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded

Where To Go Next?

31

VMdisk

{fio

numjobs = 1 iodepth = ___ blksz = 4k rw = read

io_submit()

io_getevents()

{1, 1, 4k} = 10k IOPS, 30 % (30 % in dom0)

~400k IOPS at 4k (sequential reads)

{1, 8, 4k} = 50k IOPS, 75 % (75 % in dom0)

{1, 16, 4k} = 70k IOPS, 100 % (100 % in dom0)

{1, 32, 4k} = 110k IOPS, 120 % (100 % in dom0)

{4, 32, 4k} = 115k IOPS, 400 % (125 % in dom0)

blkback

© 2014 Citrix

Where To Go Next?• Many-VBD performance could be much better:

!๏ Both persistent grants and grant copy are interesting alternatives:

• tapdisk3 with grant copy is network-“friendly” and has one process per VBD • qdisk with persistent grants does the copy on the front end !

๏ But both add extra copies to the data path: • We should be avoiding copies… :-/

- Grant operations need to scale better - The network retransmission issues need to be addressed

32

e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy

© 2014 Citrix

Support Slides• Usage of O_DIRECT with QDISK vs. 1 x Micron P320h • Number of dom0 vCPUs on Creedence #87433 + blkback from 3.16 • Temperature effects on storage performance

34

© 2014 Citrix

Usage of O_DIRECT in qemu-qdisk

35

• qemu-qdisk ๏ Without O_DIRECT

(default) !

๏ Faster for small block sizes ๏ Faster for single-VM !

๏ Scalability issue (investigation pending)

© 2014 Citrix

Usage of O_DIRECT in qemu-qdisk

36

• qemu-qdisk ๏ With O_DIRECT

(directiosafe=1) !

๏ Slower for small block sizes ๏ Slower for single-VM !

๏ Scales much better

© 2014 Citrix

Impact of dom0 vCPU count

37

• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ

• blkback backported from 3.16 • LVs plugged directly to guests !

• Throughput sinks with ๏ larger blocks ๏ increased number of guests

• oprofile suggests pvspinlock

© 2014 Citrix

Impact of dom0 vCPU count

38

• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ

• blkback backported from 3.16 • LVs plugged directly to guests !

• Giving less vCPUs to dom0 • Aggregate throughput improves

© 2014 Citrix

Temperature Effects on Storage Performance

39

• Workload keeps pCPUs busy with large block sizes !

• iDRAC Settings > Thermal > Thermal Base Algorithm

• “Maximum Performance”

© 2014 Citrix

Temperature Effects on Storage Performance

40

• Workload keeps pCPUs busy with large block sizes !

• iDRAC Settings > Thermal > Thermal Base Algorithm

• “Auto” !

• Effects very noticeable with 3 or more guests

Download - XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

Top Related