minimizing i/o latency in xen-arm
TRANSCRIPT
Minimizing I/O Latency in Xen-ARM
Seehwan Yoo, Chuck Yoo and OSVirtual
Korea University
Operating Systems Lab. http://os.korea.ac.kr
I/O Latency Issues in Xen-ARM
• Guaranteed I/O latency is essential to Xen-ARM – For mobile communication,
• Broken/missing phone calls, long network detection • AP, CP are used for guaranteed communication
• Virtualization premises transparent execution – Not only instruction execution; access isolation – But also performance; performance isolation
• In practice, I/O latency is still an obstacle – Lack of scheduler support
• Hierarchical scheduling nature i.e. hypervisor cannot impose task scheduling inside a guest OS
• The credit scheduler doesn’t match to time-sensitive applications – Latency due to the split driver model
• Enhances reliability, but degrades performance (w.r.t. I/O latency) • Inter-domain communication between IDD and user domain
• We investigate these issues, and present possible remedies
2
Operating Systems Lab. http://os.korea.ac.kr
Related Work
• Credit schedulers in Xen – Task-aware scheduler (by Hwanjoo et. al. vee’08)
• Inspection of task execution at guest OS • Adaptively use boost mechanism by VM workload characteristics
– Laxity-based soft RT scheduler (by Lee min et. al. vee’10) • Laxity calculation by execution profile • Assign priority based on the remaining time to deadline (laxity)
– Dynamic core-allocation (by Y. Hu et. al. hpdc’10) • Core-allocation by workload characteristics (driver core, fast/slow tick core)
• Mobile/embedded hypervisors – OKL4 microvisor: microkernel-based hypervisor
• Has verified kernel – from sel4 • Presents good performance – commercially successful • Real-time optimizations – slice donation, direct switching, reflective scheduling,
threaded interrupt handling, etc. – VirtualLogix – VLX: mobile hypervisor for real-time support
• Shared driver model – device sharing model among RT-guest OS and non-RT guest OSs
• Good PV performance
3
Operating Systems Lab. http://os.korea.ac.kr
Background – the Credit in Xen
• Weighted round-robin based fair scheduler • Priority – BOOST, UNDER, OVER
– Basic principle: preserve fairness among all VCPUs
• Each vcpu gets credit periodically • Credit is debited as vcpu consumes execution time
– Priority assignment • Remaining credit <= 0 à OVER (lowest) • Remaining credit > 0 à UNDER • Event-pending VCPU : BOOST (highest)
– BOOST: for providing low response time • Allows immediate preemption of the current vcpu
4
Operating Systems Lab. http://os.korea.ac.kr
Fallacies for the BOOST in the Credit
• Fallacy 1) VCPU is always boosted by I/O event – In fact, BOOST is sometimes ignored because
VCPU is boosted only when it doesn’t break the fairness • ‘Not-boosted vcpu’s are observed when the vcpu is in-execution
– Example 1) • If a user domain has CPU job and waiting for execution, then • It is not boosted since it will be executed soon, and tentative BOOST is easy to break
the fairness
• Fallacy 2) BOOST always prioritizes the VCPU – In fact, BOOST is easy to be negated because
multiple vcpus can be boosted simultaneously • ‘Multi-boost’ happens quite often in split driver model
– Example 1) • Driver domain has to be boosted, and then • User domain also needs to be boosted
– Example 2) • Driver domain has multiple pkts that are destined to multiple user domains, then • All the designated user domains are boosted
5
Operating Systems Lab. http://os.korea.ac.kr
Xen-ARM’s Latency Characteristic
6
• I/O latency measured throughout interrupt path – Preemption latency :
until code is preemptible – VCPU dispatch latency :
until the designated VCPU is scheduled – Intra-VM latency :
until IO completion
DomU
Dom0
Phy intr.
Xen-ARM Intr. handler
Dom0 I/O task
DomU I/O task
Preemption latency
Vcpu dispatch latency
Vcpu dispatch latency
Dom0 dispatch
DomU dispatch
Intra-VM latency
Intra-VM latency
Operating Systems Lab. http://os.korea.ac.kr
Observed Latency thru intr. path
7
• I/O latency measured throughout interrupt path – Send ping request from external server to dom1 – VM settings
• Dom0 : the driver domain • Dom1 : 20% cpu load + ping recv. • Dom2 : 100% cpu load (CPU-burning workload)
– Xen-netback latency • Large worst-case latency • Dom0 vcpu dispatch lat.
+ intra-dom0 lat.
– Netback-domU latency
• Large average latency • Dom1 vcpu dispatch lat.
Operating Systems Lab. http://os.korea.ac.kr
Observed VCPU dispatch latency : not-boosted vcpu
• Experimental setting – Dom1: varying CPU workload – Dom2: burning CPU workload – Another external host sends ping
to Dom1
• Not-boosted vcpus affect I/O latency distribution
– Dom1 CPU workload 20% : almost 90% ping requests are handled within 1ms
– Dom1 CPU workload 40% : 75% ping requests are handled within 1ms
– Dom1 CPU workload 60% : 65% ping requests are handled within 1ms
– Dom1 CPU workload 80% : only 60% ping requests are handled within 1ms
• When Dom1 has more CPU load è larger I/O latency (self-disturbing by not-boosted vcpu)
8
0 2 4 6 8 10 12 14Latency distribution by CPU load
0
20
40
60
80
100
Cu
mu
lativ
e la
ten
cy d
istr
ibu
tion
20% CPU load @ dom140% CPU load @ dom160% CPU load @ dom180% CPU load @ dom1
Latency (ms)
(%)
Operating Systems Lab. http://os.korea.ac.kr
Observed Multi-boost
• At the hypervisor, we counted the number of “schedule out” of the driver domain
• Multi-boost is specifically when – the current VCPU is BOOST state, and
it is unblocked – Imply that there should be another BOOST vcpu, and
the current vcpu is scheduled out
• Large number of Multi-boosts
9
vcpu state priority sched. out count
Blocked BOOST 275
UNDER 13
Unblocked BOOST 664,190
UNDER 49
Operating Systems Lab. http://os.korea.ac.kr
Intra-VM latency
• Latency from usb_irq to netback
– Schedule outs during dom0 execution
• Reasons – Dom0 : not always the
higest prio. – Asynch I/O handling :
bottom half, softirq, tasklets, etc.
10
0 20 40 60 80
Latency (ms)0.8
0.85
0.9
0.95
1
Cum
ulat
ive
dist
ribut
ion
Xen-usb_irq @ dom0Xen-netback @ dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 20%+ ping recvDom2 : CPU 100%
0 20 40 60 80
Latency (ms)0.8
0.85
0.9
0.95
1
Cum
ulat
ive
dist
ribut
ion
Xen-usb_irq @ dom0Xen-netback @ dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 80%+ ping recvDom2 : CPU 100%
Operating Systems Lab. http://os.korea.ac.kr
Virtual Interrupt Preemption @ Guest OS
• Interrupt enable/disable is not physically operated within a guest OS
– local_irq_disable(): disables only virtual intr.
• Physical intr. can be occurred – might trigger inter-VM
scheduling
• Virtual interrupt preemption – Similar to lock-holder preemption – Perform driver function with
interrupt disabled – Physical timer intr. triggers
inter-VM scheduling – Virt. Intr. can be received only after
the domain is scheduled again (as large as tens of ms)
11
void default_idle(void) {
local_irq_disable();
if (!need_resched()) // blocks this domain arch_idle();
local_irq_enable();}
Virtual interrupt disabled
Hardware interrupt
Set a pending bit(because virtual
intr. is disabled)
Driver domain is scheduled outhaving pending IRQ
<At driver domain>
local_irq_disable();
if (!need_resched()) // blocks this domain
arch_idle();
local_irq_enable();
Virtual interrupt disabled
Note that the driver domain performs extensive I/O operation that disables interrupt
Operating Systems Lab. http://os.korea.ac.kr
Resolving I/O Latency Problems for Xen-ARM
1. Fixed priority assignment – Let the driver domain always run with
DRIVER_BOOST, which is the highest priority • regardless of the CPU workload • Resolves non-boosted VCPU and multi-boost
– RT_BOOST, BOOST for real-time I/O domain
2. Virtual FIQ support – ARM-specific interrupt optimization – Higher priority than normal IRQ interrupt – vPSR (Program status register) usage
3. Do not schedule out the driver domain when it disables virtual interrupt
– It will be finished soon, and the hypervisor should give a chance to run the driver domain
• Resolves virtual interrupt preemption
12
Operating Systems Lab. http://os.korea.ac.kr
Enhanced Latency : no dom1 vcpu dispatch latency
13
0 20 40 60 80
Latency (ms)0.4
0.5
0.6
0.7
0.8
0.9
1
Cum
ulat
ive
dist
ribut
ion
Xen-netback @dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 20%+ ping recvDom2 : CPU 100%
0 20 40 60 80
Latency (ms)0.4
0.5
0.6
0.7
0.8
0.9
1
Cum
ulat
ive
dist
ribut
ion
Xen-netback @ dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 40%+ ping recvDom2 : CPU 100%
0 20 40 60 80
Latency (ms)0.4
0.5
0.6
0.7
0.8
0.9
1
Cum
ulat
ive
dist
ribut
ion
Xen-netback @ dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 60%+ ping recvDom2 : CPU 100%
0 20 40 60 80
Latency (ms)0.4
0.5
0.6
0.7
0.8
0.9
1
Cum
ulat
ive
dist
ribut
ion
Xen-netback @ dom0Xen-icmp @ dom1
Dom0 : IDDDom1 : CPU 80%+ ping recvDom2 : CPU 100%
Operating Systems Lab. http://os.korea.ac.kr
Enhanced Interrupt Latency : @ Driver Domain
• Vcpu dispatch latency – From intr. handler @ Xen
to hard irq handler @ IDD – Fixed priority dom0 vcpu – Virtual FIQ – No latency is observed!
• Intra-VM latency – From ISR to netback @ IDD – No virt. intr. preemption (dom0 highest prio.)
• Among 13M intr., – 56K are caught where virt intr preemption happened – 8.5M preemption occurred
with FIQ optimization
14
0 20 40 60 80
Latency (ms)0.95
0.96
0.97
0.98
0.99
1
Cum
ulat
ive
dist
ribut
ion
Xen-usb_irq - orig.Xen-netback - orig.Xen-usb_irq - newXen-netback - new
Dom0 : IDDDom1 : CPU 80%+ ping recvDom2 : CPU 100%
Operating Systems Lab. http://os.korea.ac.kr
Enhanced end-user Latency : overall result
• Over 1 million ping tests, – Fixed priority make the driver domain to run
without additional latency (from inter-VM scheduling)
• Largely reduces overall latency
– 99% interrupts are handled within 1ms
15
0 10 20 30 40 50 60Latencies in all intervals (ms)
80
85
90
95
100
Late
ncy
dist
ribut
ion
(cum
ulat
ive)
Xen-domU (enhanced)Xen-domU (original)
(%)
reduced latency
Operating Systems Lab. http://os.korea.ac.kr
Conclusion and Possible Future work
• We analyzed I/O latency in Xen-ARM virtual machine – Throughout the interrupt handling path in split driver model
• Two main reasons for long latency – Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler
• Not-boosted vcpu • Multi-boost
– Driver domain’s virtualized interrupt handling • Virtual interrupt preemption (aka. lock-holder preemption)
• Achieve under millisecond latency for 99% network packet interrupts – DRIVER_BOOST; the highest priority for driver domain – Modify scheduler in order for the driver domain not to be scheduled out while
virtual interrupts are disabled – Further optimizations (incl. virtual FIQ mode, softirq-awareness)
• Possible future work – Multi-core consideration/extension (core allocation, etc.)
• Other scheduler integration – Tight latency guarantee for real-time guest OS
• Rest 1% holds the key
16
Thanks for your attention
Credits to OSvirtual @ oslab, KU
17
Operating Systems Lab. http://os.korea.ac.kr
Appendix. Native comparison
• Comparison with native system – no cpuworkload – Slightly reduced handling latency, largely reduced max.
18
Na#ve Orig. dom0
Orig. domU
New sched. Dom0
New sched. DomU
Min 375 459 575 444 584 Avg 532.52 821.48 912.42 576.54 736.06 Max 107456 100782 100964 1883 2208 Stdev 1792.34 4656.26 4009.78 41.84 45.95
300 400 500 600 700 800 900 1000
Latency (us)
0
0.2
0.4
0.6
0.8
1
Cum
ulat
ive
dist
ribut
ion
native ping (eth0)orig. dom0orig. domUnew sched. dom0new sched. domU
Operating Systems Lab. http://os.korea.ac.kr
Appendix. Fairness?
• is still good, and achieves high utilization CPU burning jobs’ utilization
Normalized throughput
19
NO I/O (ideal case) Orig. credit New scheduler
Orig. credit New scheduler
* Setting Dom1: 20, 40, 60, 80, 100% CPU load + ping receiver Dom2: 100% CPU load Note that the credit is work-conserving
dom2
dom1
Measured thruput/ ideal thruput = ( )
Operating Systems Lab. http://os.korea.ac.kr
Appendix. Latency in multi-OS env.
• 3 domain cases with differentiated service – added RT_BOOST priority, between
DRIVER_BOOST and BOOST – Dom1 assuming RT – Dom2 and Dom3 for GP
20
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0
6000
1200
0
1800
0
2400
0
3000
0
3600
0
4200
0
4800
0
5400
0
6000
0
6600
0
7200
0
7800
0
8400
0
9000
0
9600
0
1020
00
1080
00
1140
00
1200
00
Credit’s latency dist. 3 domains for 10K ping tests (interval 10~40ms)
Dom1
Dom2
Dom3
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0
6000
1200
0
1800
0
2400
0
3000
0
3600
0
4200
0
4800
0
5400
0
6000
0
6600
0
7200
0
7800
0
8400
0
9000
0
9600
0
1020
00
1080
00
1140
00
1200
00
Enhanced latency dist. 3 doms. for 10K ping tests (interval 10~40ms)
Dom1
Dom2
Dom3
Operating Systems Lab. http://os.korea.ac.kr
Appendix. Latency in multi-OS env.
• 3 domain cases with differentiated service – added RT_BOOST priority, between DRIVER_BOOST and BOOST – Dom1 assuming RT – Dom2 and Dom3 for GP
21
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Enh. Dom1
Enh. Dom2
Enh. Dom3
Credit Dom1
Credit Dom2
Credit Dom3