xpds14 - scaling xen's aggregate storage performance - felipe franciosi, citrix

40
XenServer Engineering Performance Team Felipe Franciosi Going double digits on a single host Scaling Xen’s Aggregate Storage Performance e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy

Upload: the-linux-foundation

Post on 19-May-2015

447 views

Category:

Technology


5 download

DESCRIPTION

Storage systems continue to deliver better performance year after year. High performance solutions are now available off-the-shelf, allowing users to boost their servers with drives capable of achieving several GB/s worth of throughput per host. To fully utilise such devices, workloads with large queue depths are often necessary. In virtual environments, this translates into aggregate workloads coming from multiple virtual machines. Having previously addressed the impact of low latency devices in virtualised platforms, we are now aiming at optimising aggregate workloads. We will discuss the existing memory grant technologies available in Xen and compare trade-offs and performance implications of each: grant mapping, persistent grants and grant copy. For the first time, we will present grant copy as an alternative and show measurements over 7 GB/s, maxing out a set of local SSDs.

TRANSCRIPT

Page 1: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

XenServer Engineering Performance TeamFelipe Franciosi

Going double digits on a single host

Scaling Xen’s Aggregate Storage Performance

e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy

Page 2: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Agenda• The dimensions of storage performance

๏ What exactly are we trying to measure? !

• State of the art ๏ blkfront, blkback, blktap2+tapdisk, tapdisk3, qemu-qdisk ๏ trade-offs between traditional grant mapping, persistent grants, grant copy !

• Aggregate measurements ๏ Pushing the boundaries with very, very fast local storage !

• Where to go next?

2

Page 3: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

What exactly are we trying to measure?

The Dimensions of Storage Performance

Page 4: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance• You have probably seen this: !

!

!

!

!

• The average user will usually: ๏ Run a synthetic benchmark on a bare metal environment ๏ Repeat the test on a virtual machine ๏ Draw conclusions without seeing the full picture

4

# dd if=/dev/sda of=/dev/null bs=1M count=100 iflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.269689 s, 389 MB/s

# hdparm -t /dev/sda !/dev/sda: Timing buffered disk reads: 1116 MB in 3.00 seconds = 371.70 MB/sec

Page 5: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance

5

thro

ughp

ut

log(block size)

Page 6: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance

6

thro

ughp

ut

log(block size)

sequ

entia

lity

# of threads

LBA

io d

epth

C/P states config

temperature

io engine

noise

read

ahe

ad

direction (r/

w)

Page 7: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance• The simpler of all cases:

๏ single thread ๏ iodepth=1 ๏ direct IO ๏ sequential !

• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to

Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433

• Kernel 3.10 + Xen 4.4

7

Page 8: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance

8

• Pushing boundaries a bit: ๏ multiple threads ๏ iodepth=1 ๏ direct IO ๏ sequential “kind of” !

• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to

Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433

• Kernel 3.10 + Xen 4.4

Page 9: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance• Comparing dom0 vs. domU:

๏ single thread vs. single VM ๏ iodepth=1 ๏ direct IO ๏ sequential !

• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to

Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433

• Kernel 3.10 + Xen 4.4

9

Page 10: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

The Dimensions of Storage Performance

10

• Comparing dom0 vs. domU: ๏ many threads vs. many VMs ๏ iodepth=1 ๏ direct IO ๏ sequential kind of !

• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to

Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433

• Kernel 3.10 + Xen 4.4

Page 11: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

And the trade-offs between the technologiesState of the Art

Page 12: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

State of the Art: traditional grant mapping

12

user space kernel space

VDI blkfront block layerblkback

libc / libaio

vfs / others

xen’s blkif protocol: shared memory &

event channels

dom0 domU

requests are associated with pages in the

guest’s memory space

requests are associated with foreign pages

page grants

block layer

user apps

VDIdevice driver

user space kernel space

• Pros: ๏ no copies involved ๏ low-latency alternative

(when done in kernel) !

• Cons: ๏ not “network-safe” ๏ hard on grant tables

Page 13: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

State of the Art: persistent grants

13

user space kernel space

VDI blkfront block layerblkback

libc / libaio

vfs / others

xen’s blkif protocol: shared memory &

event channels

dom0 domU

requests are associated with pages in the

guest’s memory space

requests are associated with foreign pages

persistent page grants

block layer

user apps

VDIdevice driver

user space kernel space

• Pros: ๏ easy on grant tables ๏ copies on the front end !

• Cons: ๏ not “network-safe” ๏ copies involved

blkfront memcpy() data from/to a set of persistently granted pages on demand

Page 14: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

• Pros: ๏ “network-safe” ๏ neat features (VHD) !

• Cons: ๏ copies involved ๏ use lots of memory ๏ hard on grant tables

State of the Art: tapdisk2+blktap2+blkback

14

user space kernel space

VDI blkfront block layerblkback

tapdisk2 libc / libaio

vfs / others

xen’s blkif protocol: shared memory &

event channels

dom0 domU

requests are associated with pages in the

guest’s memory space

requests are associated with local pages

page grants

block layer

blktap2 TAP

user apps

blktap2 copies data to local pages

libaio

vfs / others

block layerVDI

device driver

user space kernel space

Page 15: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

• Pros: ๏ “network-safe” ๏ easy on grant tables ๏ neat features (VHD) !

• Cons: ๏ copies involved (back end) ๏ use lots of memory

State of the Art: grant copy

15

user space kernel space

VDI blkfront block layer

tapdisk3 libc / libaio

vfs / others

xen’s blkif protocol: shared memory &

event channels

dom0 domU

requests are associated with pages in the

guest’s memory space

requests are associated with local pages

libaio

vfs / others

user apps

VDIdevice driver

block layer

user space kernel space

gntdev evtchn

tapdisk3 issues grant copy commands via the “gntdev” Xen copies data across domains

Page 16: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

State of the Art: technologies comparison

16

extra!copies

network-safe

low-latency potential

“neat” features

easy on grant tables

grant mapping N N

Y (if done in blkback)

depends (not in

blkback)N

persistent grants

Y (front end) N N

Y (qcow in

qemu-qdisk)Y

grant copy Y (back end) Y N

Y (vhd in

tapdisk3)Y

blktap2 Y (back end) Y N Y N

Page 17: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

Going double digits on a single hostAggregate Measurements

Page 18: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements• Test environment:

๏ Intel PowerEdge R720 • Intel E5-2643v2 @ 3.50 GHz (2 Sockets, 6 Cores/socket, HT Enabled)

- Unless stated otherwise: 24 vCPUs to dom0, 2 vCPUs to each guest • 64 GB of RAM

- Unless stated otherwise: 4 GB to dom0, 512 MB to each guest • BIOS Settings:

- Power regulators set to “Performance per Watt (OS)” - C-States disabled, Xen Scaling Governor set to “Performance”

๏ Storage: • 4 x Micron P320h • 2 x Intel P3700 • 1 x Fusion-io ioDrive2

18

Page 19: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

19

SSD #1 (SR 1) lv01 … lv10lv02

lv01 … lv10lv02

lv01 … lv10lv02

SSD #2 (SR 2)

SSD #7 (SR 7)

vd 7

vd 1 (SR 1 - lv01)

(SR 7 - lv01)

VM 01

vd 7

vd 1 (SR 1 - lv02)

(SR 7 - lv02)

VM 02

vd 7

vd 1 (SR 1 - lv10)

(SR 7 - lv10)

VM 10

Page 20: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

20

• Baseline ๏ Measurements from dom0 ๏ Each line corresponds to a

group of 7 threads (one for each disk) !

๏ Some of the drives respond faster for small block sizes and a single thread

Page 21: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

21

• qemu-qdisk ๏ Persistent grants disabled ๏ With O_DIRECT

Page 22: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

22

• qemu-qdisk ๏ Persistent grants enabled ๏ With O_DIRECT !

๏ Apparent bottleneck was single process per VM

Page 23: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

23

• tapdisk2 + blktap2 ๏ With O_DIRECT !

๏ Using blkback from 3.10 ๏ No persistent grants ๏ No indirect-IO

Page 24: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

24

• tapdisk2 + blktap2 ๏ With O_DIRECT !

๏ Using blkback from 3.16 ๏ Persistent grants ๏ Indirect-IO !

๏ Apparent bottleneck on some pvspinlock operations

Page 25: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

25

• blkback 3.16 ๏ 8 dom0 vCPUs ๏ 6 domU vCPUs !

๏ Persistent grants ๏ Indirect-IO !

๏ Apparent bottleneck on some pvspinlock operations

Page 26: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Aggregate Measurements

26

• tapdisk3 ๏ Using grant copy ๏ With O_DIRECT ๏ Using libaio !

๏ Apparent bottleneck is vCPU utilisation

Page 27: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

Areas for improvementWhere To Go Next?

Page 28: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

thro

ughp

ut

log(block size)

• Single-VBD performance remains problematic ๏ [1/3] Latency is too high

Where To Go Next?

28

vdi VMvirtualisation subsystem

~ ms ~ us

~ us

disk

~ ns

Page 29: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

• Single-VBD performance remains problematic ๏ [2/3] IO Depth is limited to 32

Where To Go Next?

29

VMdiskblkfront

32 reqs

• Are these workloads realistic? • We can use multi-page rings! • But…

qdisktapdisk

blkbackvdi

Page 30: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded

Where To Go Next?

30

VMdiskblkfront

32 reqs

qdisk

tapdisk

blkbackvdi

{fio

numjobs = 1 iodepth = ___ blksz = 4k rw = read

io_submit()

io_getevents()

{1, 1, 4k} = 15k IOPS, 15 %

{1, 8, 4k} = 70k IOPS, 35 %

{1, 16, 4k} = 110k IOPS, 55 %

{1, 24, 4k} = 165k IOPS, 85 %

{1, 32, 4k} = 190k IOPS, 100 %

{1, 64, 4k} = 195k IOPS, 100 %~400k IOPS at 4k (sequential reads)

{5, 32, 4k} = 415k IOPS, 55 % (each)

Page 31: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

blkfront

32 reqs

qdisk

tapdisk

vdi

• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded

Where To Go Next?

31

VMdisk

{fio

numjobs = 1 iodepth = ___ blksz = 4k rw = read

io_submit()

io_getevents()

{1, 1, 4k} = 10k IOPS, 30 % (30 % in dom0)

~400k IOPS at 4k (sequential reads)

{1, 8, 4k} = 50k IOPS, 75 % (75 % in dom0)

{1, 16, 4k} = 70k IOPS, 100 % (100 % in dom0)

{1, 32, 4k} = 110k IOPS, 120 % (100 % in dom0)

{4, 32, 4k} = 115k IOPS, 400 % (125 % in dom0)

blkback

Page 32: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Where To Go Next?• Many-VBD performance could be much better:

!๏ Both persistent grants and grant copy are interesting alternatives:

• tapdisk3 with grant copy is network-“friendly” and has one process per VBD • qdisk with persistent grants does the copy on the front end !

๏ But both add extra copies to the data path: • We should be avoiding copies… :-/

- Grant operations need to scale better - The network retransmission issues need to be addressed

32

Page 33: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy

Page 34: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Support Slides• Usage of O_DIRECT with QDISK vs. 1 x Micron P320h • Number of dom0 vCPUs on Creedence #87433 + blkback from 3.16 • Temperature effects on storage performance

34

Page 35: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Usage of O_DIRECT in qemu-qdisk

35

• qemu-qdisk ๏ Without O_DIRECT

(default) !

๏ Faster for small block sizes ๏ Faster for single-VM !

๏ Scalability issue (investigation pending)

Page 36: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Usage of O_DIRECT in qemu-qdisk

36

• qemu-qdisk ๏ With O_DIRECT

(directiosafe=1) !

๏ Slower for small block sizes ๏ Slower for single-VM !

๏ Scales much better

Page 37: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Impact of dom0 vCPU count

37

• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ

• blkback backported from 3.16 • LVs plugged directly to guests !

• Throughput sinks with ๏ larger blocks ๏ increased number of guests

• oprofile suggests pvspinlock

Page 38: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Impact of dom0 vCPU count

38

• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ

• blkback backported from 3.16 • LVs plugged directly to guests !

• Giving less vCPUs to dom0 • Aggregate throughput improves

Page 39: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Temperature Effects on Storage Performance

39

• Workload keeps pCPUs busy with large block sizes !

• iDRAC Settings > Thermal > Thermal Base Algorithm

• “Maximum Performance”

Page 40: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

© 2014 Citrix

Temperature Effects on Storage Performance

40

• Workload keeps pCPUs busy with large block sizes !

• iDRAC Settings > Thermal > Thermal Base Algorithm

• “Auto” !

• Effects very noticeable with 3 or more guests