minimizing i/o latency in xen-arm

Minimizing I/O Latency in Xen-ARM

Seehwan Yoo, Chuck Yoo and OSVirtual

Korea University

Operating Systems Lab. http://os.korea.ac.kr

I/O Latency Issues in Xen-ARM

•  Guaranteed I/O latency is essential to Xen-ARM –  For mobile communication,

•  Broken/missing phone calls, long network detection •  AP, CP are used for guaranteed communication

•  Virtualization premises transparent execution –  Not only instruction execution; access isolation –  But also performance; performance isolation

•  In practice, I/O latency is still an obstacle –  Lack of scheduler support

•  Hierarchical scheduling nature i.e. hypervisor cannot impose task scheduling inside a guest OS

•  The credit scheduler doesn’t match to time-sensitive applications –  Latency due to the split driver model

•  Enhances reliability, but degrades performance (w.r.t. I/O latency) •  Inter-domain communication between IDD and user domain

•  We investigate these issues, and present possible remedies

2


Related Work

•  Credit schedulers in Xen –  Task-aware scheduler (by Hwanjoo et. al. vee’08)

•  Inspection of task execution at guest OS •  Adaptively use boost mechanism by VM workload characteristics

–  Laxity-based soft RT scheduler (by Lee min et. al. vee’10) •  Laxity calculation by execution profile •  Assign priority based on the remaining time to deadline (laxity)

–  Dynamic core-allocation (by Y. Hu et. al. hpdc’10) •  Core-allocation by workload characteristics (driver core, fast/slow tick core)

•  Mobile/embedded hypervisors –  OKL4 microvisor: microkernel-based hypervisor

•  Has verified kernel – from sel4 •  Presents good performance – commercially successful •  Real-time optimizations – slice donation, direct switching, reflective scheduling,

threaded interrupt handling, etc. –  VirtualLogix – VLX: mobile hypervisor for real-time support

•  Shared driver model – device sharing model among RT-guest OS and non-RT guest OSs

•  Good PV performance

3


Background – the Credit in Xen

•  Weighted round-robin based fair scheduler •  Priority – BOOST, UNDER, OVER

–  Basic principle: preserve fairness among all VCPUs

•  Each vcpu gets credit periodically •  Credit is debited as vcpu consumes execution time

–  Priority assignment •  Remaining credit <= 0 à OVER (lowest) •  Remaining credit > 0 à UNDER •  Event-pending VCPU : BOOST (highest)

–  BOOST: for providing low response time •  Allows immediate preemption of the current vcpu

4


Fallacies for the BOOST in the Credit

•  Fallacy 1) VCPU is always boosted by I/O event –  In fact, BOOST is sometimes ignored because

VCPU is boosted only when it doesn’t break the fairness •  ‘Not-boosted vcpu’s are observed when the vcpu is in-execution

–  Example 1) •  If a user domain has CPU job and waiting for execution, then •  It is not boosted since it will be executed soon, and tentative BOOST is easy to break

the fairness

•  Fallacy 2) BOOST always prioritizes the VCPU –  In fact, BOOST is easy to be negated because

multiple vcpus can be boosted simultaneously •  ‘Multi-boost’ happens quite often in split driver model

–  Example 1) •  Driver domain has to be boosted, and then •  User domain also needs to be boosted

–  Example 2) •  Driver domain has multiple pkts that are destined to multiple user domains, then •  All the designated user domains are boosted

5


Xen-ARM’s Latency Characteristic

6

•  I/O latency measured throughout interrupt path –  Preemption latency :

until code is preemptible –  VCPU dispatch latency :

until the designated VCPU is scheduled –  Intra-VM latency :

until IO completion

DomU

Dom0

Phy intr.

Xen-ARM Intr. handler

Dom0 I/O task

DomU I/O task

Preemption latency

Vcpu dispatch latency

Vcpu dispatch latency

Dom0 dispatch

DomU dispatch

Intra-VM latency

Intra-VM latency


Observed Latency thru intr. path

7

•  I/O latency measured throughout interrupt path –  Send ping request from external server to dom1 –  VM settings

•  Dom0 : the driver domain •  Dom1 : 20% cpu load + ping recv. •  Dom2 : 100% cpu load (CPU-burning workload)

–  Xen-netback latency •  Large worst-case latency •  Dom0 vcpu dispatch lat.

+ intra-dom0 lat.

–  Netback-domU latency

•  Large average latency •  Dom1 vcpu dispatch lat.


Observed VCPU dispatch latency : not-boosted vcpu

•  Experimental setting –  Dom1: varying CPU workload –  Dom2: burning CPU workload –  Another external host sends ping

to Dom1

•  Not-boosted vcpus affect I/O latency distribution

–  Dom1 CPU workload 20% : almost 90% ping requests are handled within 1ms

–  Dom1 CPU workload 40% : 75% ping requests are handled within 1ms

–  Dom1 CPU workload 60% : 65% ping requests are handled within 1ms

–  Dom1 CPU workload 80% : only 60% ping requests are handled within 1ms

•  When Dom1 has more CPU load è larger I/O latency (self-disturbing by not-boosted vcpu)

8

0 2 4 6 8 10 12 14Latency distribution by CPU load

0

20

40

60

80

100

Cu

mu

lativ

e la

ten

cy d

istr

ibu

tion

20% CPU load @ dom140% CPU load @ dom160% CPU load @ dom180% CPU load @ dom1

Latency (ms)

(%)


Observed Multi-boost

•  At the hypervisor, we counted the number of “schedule out” of the driver domain

•  Multi-boost is specifically when –  the current VCPU is BOOST state, and

it is unblocked –  Imply that there should be another BOOST vcpu, and

the current vcpu is scheduled out

•  Large number of Multi-boosts

9

vcpu state priority sched. out count

Blocked BOOST 275

UNDER 13

Unblocked BOOST 664,190

UNDER 49


Intra-VM latency

•  Latency from usb_irq to netback

–  Schedule outs during dom0 execution

•  Reasons –  Dom0 : not always the

higest prio. –  Asynch I/O handling :

bottom half, softirq, tasklets, etc.

10

0 20 40 60 80

Latency (ms)0.8

0.85

0.9

0.95

1

Cum

ulat

ive

dist

ribut

ion

Xen-usb_irq @ dom0Xen-netback @ dom0Xen-icmp @ dom1

Dom0 : IDDDom1 : CPU 20%+ ping recvDom2 : CPU 100%

0 20 40 60 80

Latency (ms)0.8

0.85

0.9

0.95

1

Cum

ulat

ive

dist

ribut

ion

Xen-usb_irq @ dom0Xen-netback @ dom0Xen-icmp @ dom1



Virtual Interrupt Preemption @ Guest OS

•  Interrupt enable/disable is not physically operated within a guest OS

–  local_irq_disable(): disables only virtual intr.

•  Physical intr. can be occurred –  might trigger inter-VM

scheduling

•  Virtual interrupt preemption –  Similar to lock-holder preemption –  Perform driver function with

interrupt disabled –  Physical timer intr. triggers

inter-VM scheduling –  Virt. Intr. can be received only after

the domain is scheduled again (as large as tens of ms)

11

void default_idle(void) {

local_irq_disable();

if (!need_resched()) // blocks this domain arch_idle();

local_irq_enable();}

Virtual interrupt disabled

Hardware interrupt

Set a pending bit(because virtual

intr. is disabled)

Driver domain is scheduled outhaving pending IRQ

<At driver domain>

local_irq_disable();

if (!need_resched()) // blocks this domain

arch_idle();

local_irq_enable();

Virtual interrupt disabled

Note that the driver domain performs extensive I/O operation that disables interrupt


Resolving I/O Latency Problems for Xen-ARM

1.  Fixed priority assignment –  Let the driver domain always run with

DRIVER_BOOST, which is the highest priority •  regardless of the CPU workload •  Resolves non-boosted VCPU and multi-boost

–  RT_BOOST, BOOST for real-time I/O domain

2.  Virtual FIQ support –  ARM-specific interrupt optimization –  Higher priority than normal IRQ interrupt –  vPSR (Program status register) usage

3.  Do not schedule out the driver domain when it disables virtual interrupt

–  It will be finished soon, and the hypervisor should give a chance to run the driver domain

•  Resolves virtual interrupt preemption

12


Enhanced Latency : no dom1 vcpu dispatch latency

13

0 20 40 60 80

Latency (ms)0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

dist

ribut

ion

Xen-netback @dom0Xen-icmp @ dom1


0 20 40 60 80

Latency (ms)0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

dist

ribut

ion

Xen-netback @ dom0Xen-icmp @ dom1


0 20 40 60 80

Latency (ms)0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

dist

ribut

ion



0 20 40 60 80

Latency (ms)0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

dist

ribut

ion




Enhanced Interrupt Latency : @ Driver Domain

•  Vcpu dispatch latency –  From intr. handler @ Xen

to hard irq handler @ IDD –  Fixed priority dom0 vcpu –  Virtual FIQ –  No latency is observed!

•  Intra-VM latency –  From ISR to netback @ IDD –  No virt. intr. preemption (dom0 highest prio.)

•  Among 13M intr., –  56K are caught where virt intr preemption happened –  8.5M preemption occurred

with FIQ optimization

14

0 20 40 60 80

Latency (ms)0.95

0.96

0.97

0.98

0.99

1

Cum

ulat

ive

dist

ribut

ion

Xen-usb_irq - orig.Xen-netback - orig.Xen-usb_irq - newXen-netback - new



Enhanced end-user Latency : overall result

•  Over 1 million ping tests, –  Fixed priority make the driver domain to run

without additional latency (from inter-VM scheduling)

•  Largely reduces overall latency

–  99% interrupts are handled within 1ms

15

0 10 20 30 40 50 60Latencies in all intervals (ms)

80

85

90

95

100

Late

ncy

dist

ribut

ion

(cum

ulat

ive)

Xen-domU (enhanced)Xen-domU (original)

(%)

reduced latency


Conclusion and Possible Future work

•  We analyzed I/O latency in Xen-ARM virtual machine –  Throughout the interrupt handling path in split driver model

•  Two main reasons for long latency –  Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler

•  Not-boosted vcpu •  Multi-boost

–  Driver domain’s virtualized interrupt handling •  Virtual interrupt preemption (aka. lock-holder preemption)

•  Achieve under millisecond latency for 99% network packet interrupts –  DRIVER_BOOST; the highest priority for driver domain –  Modify scheduler in order for the driver domain not to be scheduled out while

virtual interrupts are disabled –  Further optimizations (incl. virtual FIQ mode, softirq-awareness)

•  Possible future work –  Multi-core consideration/extension (core allocation, etc.)

•  Other scheduler integration –  Tight latency guarantee for real-time guest OS

•  Rest 1% holds the key

16

Thanks for your attention

Credits to OSvirtual @ oslab, KU

17


Appendix. Native comparison

•  Comparison with native system – no cpuworkload –  Slightly reduced handling latency, largely reduced max.

18

Na#ve Orig. dom0

Orig. domU

New sched. Dom0

New sched. DomU

Min 375 459 575 444 584 Avg 532.52 821.48 912.42 576.54 736.06 Max 107456 100782 100964 1883 2208 Stdev 1792.34 4656.26 4009.78 41.84 45.95

300 400 500 600 700 800 900 1000

Latency (us)

0

0.2

0.4

0.6

0.8

1

Cum

ulat

ive

dist

ribut

ion

native ping (eth0)orig. dom0orig. domUnew sched. dom0new sched. domU


Appendix. Fairness?

•  is still good, and achieves high utilization CPU burning jobs’ utilization

Normalized throughput

19

NO I/O (ideal case) Orig. credit New scheduler

Orig. credit New scheduler

* Setting Dom1: 20, 40, 60, 80, 100% CPU load + ping receiver Dom2: 100% CPU load Note that the credit is work-conserving

dom2

dom1

Measured thruput/ ideal thruput = ( )


Appendix. Latency in multi-OS env.

•  3 domain cases with differentiated service –  added RT_BOOST priority, between

DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP

20

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0

6000

1200

0

1800

0

2400

0

3000

0

3600

0

4200

0

4800

0

5400

0

6000

0

6600

0

7200

0

7800

0

8400

0

9000

0

9600

0

1020

00

1080

00

1140

00

1200

00

Credit’s latency dist. 3 domains for 10K ping tests (interval 10~40ms)

Dom1

Dom2

Dom3

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0

6000

1200

0

1800

0

2400

0

3000

0

3600

0

4200

0

4800

0

5400

0

6000

0

6600

0

7200

0

7800

0

8400

0

9000

0

9600

0

1020

00

1080

00

1140

00

1200

00

Enhanced latency dist. 3 doms. for 10K ping tests (interval 10~40ms)

Dom1

Dom2

Dom3


Appendix. Latency in multi-OS env.

•  3 domain cases with differentiated service –  added RT_BOOST priority, between DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP

21

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Enh. Dom1

Enh. Dom2

Enh. Dom3

Credit Dom1

Credit Dom2

Credit Dom3

minimizing i/o latency in xen-arm

Technology

cpu load cpu

dom1 dom1 cpu workload

path io latency

io latency distribution

io latency issues

driver domain dom1

intravm latency

designated vcpu