kernel recipes 2015: puma: pooling unused memory in virtual machines for i/o intensive applications
TRANSCRIPT
Puma: Pooling Unused Memory in Virtual Machinesfor I/O intensive applications
Maxime Lorrillere, Julien Sopena, Sébastien Monnet and Pierre Senscontact: [email protected]
Kernel Recipes 2015
Maxime Lorrillere Puma Kernel Recipes 2015 1 / 15
Introduction Context
Problem: memory fragmentation
Host 1
PFRA
cache
Memory
Disk
Applications
Host 2
PFRA
cacheAnonymous
pagesPage cache
10GB Ethernet
Virtualization allows more flexibility and isolation
Problem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time
Maxime Lorrillere Puma Kernel Recipes 2015 2 / 15
Introduction Context
Problem: memory fragmentation
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Virtualization allows more flexibility and isolation
Problem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time
Maxime Lorrillere Puma Kernel Recipes 2015 2 / 15
Introduction Context
Problem: memory fragmentation
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Virtualization allows more flexibility and isolationProblem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time
Maxime Lorrillere Puma Kernel Recipes 2015 2 / 15
Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Balloon
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memoryThe host asks a VM to deflate its balloon to get more memory
Limitations
⇒ page cache is still fragmented⇒ slow to recover
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memoryThe host asks a VM to deflate its balloon to get more memory
Limitations
⇒ page cache is still fragmented⇒ slow to recover
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
I/O
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memoryThe host asks a VM to deflate its balloon to get more memory
Limitations⇒ page cache is still fragmented
⇒ slow to recover
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
I/O
Balloon
Swap
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memoryThe host asks a VM to deflate its balloon to get more memory
Limitations⇒ page cache is still fragmented⇒ slow to recover
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
Introduction Related work
Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the first VM2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline
⇒ When it does not crash! (OOM-kill)
Maxime Lorrillere Puma Kernel Recipes 2015 4 / 15
Introduction Related work
Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the first VM2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline⇒ When it does not crash! (OOM-kill)
Maxime Lorrillere Puma Kernel Recipes 2015 4 / 15
Introduction Related work
Our contribution: a cooperative page cache
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtiovirtio virtioTCP (~100µs)
Remotepage cache
~10ms
Puma Puma Puma
TCP (~100µs)
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Puma’s approach:Relies on a fast network between VMs and physical machinesHypervisor, filesystem and block device agnosticHandles only clean cache pages⇒ Writes a generally non-blocking⇒ Simple consistency scheme⇒ fast to recover memory!
Maxime Lorrillere Puma Kernel Recipes 2015 5 / 15
Puma design Basics
Puma designLocal page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
Puma design Basics
Puma designLocal page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2
Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
Puma design Basics
Puma designLocal page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2
put(P31)
3
Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
Puma design Basics
Puma designLocal page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P314
4
Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
Puma design Basics
Puma designLocal page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P314
4
Store page
P31
5
Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2Hit?
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2Hit?
req(P24)3
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2Hit?
req(P24)3
Lookup
4
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2Hit?
req(P24)3
Lookup
4
P24 P245
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Basics
Puma designLocal page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2Hit?
req(P24)3
Lookup
4
P24 P245
Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
Puma design Sequential I/O
Puma designFiltering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)3
Sequential reads are detected through the read-ahead algorithm
“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
Puma design Sequential I/O
Puma designFiltering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss 2
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)3
Sequential reads are detected through the read-ahead algorithm
“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
Puma design Sequential I/O
Puma designFiltering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)3
Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadata
When evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
Puma design Sequential I/O
Puma designFiltering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)3
Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
Puma design Sequential I/O
Puma designFiltering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
Reclaim2
put(P24)3
4
Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
Puma design Details and optimisations
Implementation details and optimisations
Response time⇒ Puma is temporarily disabled if the response time becomes too high
Memory footprint⇒ Metadata: amortized 64 bits/page, 2 MB of metadata per GB of cache
Memory recovery⇒ Remote cache pages are discarded when reclaimed
Memory management: avoiding deadlocksAtomic memory allocationsUse of pre-allocated memory pools
PFRA
alloc()
P31
1
Metadata
31
alloc()
P31
Reclaim2
put(P31)
3
P314
4
ConsistencyDirty pages are written to disk before being sent to the cache
Maxime Lorrillere Puma Kernel Recipes 2015 9 / 15
Evaluation Evaluation Overview
Evaluation Overview
Experiment setup on KVM
Puma server: provides from 512 MB to 12 GB of cachePuma client: 1 GBBaseline: a single VM without additional cacheHosts: Intel Xeon E5-2660v2, 5× 600GB SAS in RAID-0Benchmarks: Filebench, BLAST, TPC-C, TPC-H, Postmark
Experiments
1 Varying workload on server side2 Co-localised VMs with a paravirtualised network (virtio)3 Latency injection
Maxime Lorrillere Puma Kernel Recipes 2015 10 / 15
Evaluation Varying workload
Dynamic memory balancingComparison with memory ballooning
Baseline Auto-ballooning Puma
High latencies to reclaim memory with memory ballooning (avg: 20ms)Puma allows to reclaim memory at a small cost (avg: 1.8ms)
Maxime Lorrillere Puma Kernel Recipes 2015 11 / 15
Evaluation Performance evaluation
Sequential I/O filtering
Unfiltered large sequences may severely drop the performanceFiltering sequential I/O allows us to focus on random accesses
Maxime Lorrillere Puma Kernel Recipes 2015 12 / 15
Evaluation Performance evaluation
Performance improvement on database benchmarks
I/Os are a mix of random accesses and medium sized sequences⇒ Concurrent accesses: sequential accesses are interleaved → slow⇒ Non-inclusive strategy: pages are kept in cache even if accessed
sequentially
Maxime Lorrillere Puma Kernel Recipes 2015 13 / 15
Evaluation Latency injection
Network latency managementLatency injection with Netem [LCA’05]
Speedup decreases as we inject network latency between nodesWhen the response time is too high, Puma disables itself to avoid aperformance drop
Maxime Lorrillere Puma Kernel Recipes 2015 14 / 15
Conclusion
Conclusion
Summary⇒ Virtualization leads to a fragmentation of the available cache⇒ Memory ballooning techniques are not able to manage VM’s page
cache distribution
Puma: Pooling Unused memory in virtual MAchines⇒ It is based on an efficient kernel-level remote caching mechanism⇒ It handles clean cache pages to quickly recover the memory⇒ It works with co-localised VMs and remote VMs
Maxime Lorrillere Puma Kernel Recipes 2015 15 / 15