kernel recipes 2015: puma: pooling unused memory in virtual machines for i/o intensive applications

Puma: Pooling Unused Memory in Virtual Machinesfor I/O intensive applications

Maxime Lorrillere, Julien Sopena, Sébastien Monnet and Pierre Senscontact: [email protected]

Kernel Recipes 2015

Maxime Lorrillere Puma Kernel Recipes 2015 1 / 15

Introduction Context

Problem: memory fragmentation

Host 1

PFRA

cache

Memory

Disk

Applications

Host 2

PFRA

cacheAnonymous

pagesPage cache

10GB Ethernet

Virtualization allows more flexibility and isolation

Problem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time




PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

Applications

Hypervisor (KVM) Hypervisor (KVM)

10GB Ethernet

Virtualization allows more flexibility and isolation

Problem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time




PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

Swap

Applications


10GB Ethernet

Virtualization allows more flexibility and isolationProblem: it fragments available memory⇒ Sharing resources like CPU time is straightforward⇒ Memory cannot be reassigned as efficiently as CPU time


Introduction Related work

Solution: Memory Ballooning [OSDI’02]

PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

Swap

Balloon

Applications


10GB Ethernet

The host asks a VM to inflate its balloon to return free memoryThe host asks a VM to deflate its balloon to get more memory

Limitations

⇒ page cache is still fragmented⇒ slow to recover




PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

BalloonBalloon

Applications


10GB Ethernet


Limitations

⇒ page cache is still fragmented⇒ slow to recover




PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

BalloonBalloon

I/O

Applications


10GB Ethernet


Limitations⇒ page cache is still fragmented

⇒ slow to recover




PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtio virtio

I/O

Balloon

Swap

Applications


10GB Ethernet


Limitations⇒ page cache is still fragmented⇒ slow to recover



Memory Ballooning – Time to recover memory

1 Make a lot of I/O on the first VM2 Try to allocate the memory (malloc) on the second VM

Baseline Auto-ballooning

⇒ Memory allocations are 20× slower than the baseline

⇒ When it does not crash! (OOM-kill)



Memory Ballooning – Time to recover memory

1 Make a lot of I/O on the first VM2 Try to allocate the memory (malloc) on the second VM

Baseline Auto-ballooning

⇒ Memory allocations are 20× slower than the baseline⇒ When it does not crash! (OOM-kill)



Our contribution: a cooperative page cache

PFRA

cache

VM1 VM2

Host 1

PFRA

cache

VM3 VM4

Host 2

virtiovirtio virtioTCP (~100µs)

Remotepage cache

~10ms

Puma Puma Puma

TCP (~100µs)

Applications


10GB Ethernet

Puma’s approach:Relies on a fast network between VMs and physical machinesHypervisor, filesystem and block device agnosticHandles only clean cache pages⇒ Writes a generally non-blocking⇒ Simple consistency scheme⇒ fast to recover memory!


Puma design Basics

Puma designLocal page cache eviction – put operation

PFRAPFRA

alloc()

VM1 VM2

P31

1

Metadata

31

P31

Typically triggered by a memory allocationPuma is integrated into the PFRA to detect page cache evictionPages are sent asynchronously to avoid slowdownsRemote pages are stored into the system page cache


Puma design Basics


PFRAPFRA

alloc()

VM1 VM2

P31

1

Metadata

31

P31

Reclaim2



Puma design Basics


PFRAPFRA

alloc()

VM1 VM2

P31

1

Metadata

31

P31

Reclaim2

put(P31)

3



Puma design Basics


PFRAPFRA

alloc()

VM1 VM2

P31

1

Metadata

31

Reclaim2

put(P31)

3

P314

4



Puma design Basics


PFRAPFRA

alloc()

VM1 VM2

P31

1

Metadata

31

Reclaim2

put(P31)

3

P314

4

Store page

P31

5



Puma design Basics

Puma designLocal page cache miss – get operation

PFRAPFRA

P24

Miss

get(P24)

VM1 VM2

P24

1

Metadata

24

Integrated into the page cache to detect local cache missesA local cache miss leads to a (synchronous) get operationLocal metadata are used to know if and where a page is in the cacheExclusive and non-inclusive caching strategies


Puma design Basics


PFRAPFRA

P24

Miss

get(P24)

VM1 VM2

P24

1

Metadata

24

2Hit?



Puma design Basics


PFRAPFRA

P24

Miss

get(P24)

VM1 VM2

P24

1

Metadata

24

2Hit?

req(P24)3



Puma design Basics


PFRAPFRA

P24

Miss

get(P24)

VM1 VM2

P24

1

Metadata

24

2Hit?

req(P24)3

Lookup

4



Puma design Basics


PFRAPFRA

P24

Miss

get(P24)

VM1 VM2

P24

1

Metadata

24

2Hit?

req(P24)3

Lookup

4

P24 P245



Puma design Sequential I/O

Puma designFiltering sequential I/O

P24

get(P24,32)

VM1 - get

1

!Hit

Metadata

PFRA

Miss

PFRA

alloc()

VM1 - put

1

Metadata

S

P24

P24

Reclaim2

put(P24)3

Sequential reads are detected through the read-ahead algorithm

“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded




P24

get(P24,32)

VM1 - get

1

!Hit

Metadata

PFRA

Miss 2

PFRA

alloc()

VM1 - put

1

Metadata

S

P24

P24

Reclaim2

put(P24)3

Sequential reads are detected through the read-ahead algorithm

“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded




P24

get(P24,32)

VM1 - get

1

!Hit

Metadata

PFRA

2

3

S

P24

PFRA

alloc()

VM1 - put

1

Metadata

S

P24

P24

Reclaim2

put(P24)3

Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadata

When evicted, sequential pages are simply discarded




P24

get(P24,32)

VM1 - get

1

!Hit

Metadata

PFRA

2

3

S

P24

PFRA

alloc()

VM1 - put

1

Metadata

S

P24

P24

Reclaim2

put(P24)3

Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded




P24

get(P24,32)

VM1 - get

1

!Hit

Metadata

PFRA

2

3

S

P24

PFRA

alloc()

VM1 - put

1

Metadata

S

P24

Reclaim2

put(P24)3

4

Sequential reads are detected through the read-ahead algorithm“Sequential pages” are tagged into the metadataWhen evicted, sequential pages are simply discarded


Puma design Details and optimisations

Implementation details and optimisations

Response time⇒ Puma is temporarily disabled if the response time becomes too high

Memory footprint⇒ Metadata: amortized 64 bits/page, 2 MB of metadata per GB of cache

Memory recovery⇒ Remote cache pages are discarded when reclaimed

Memory management: avoiding deadlocksAtomic memory allocationsUse of pre-allocated memory pools

PFRA

alloc()

P31

1

Metadata

31

alloc()

P31

Reclaim2

put(P31)

3

P314

4

ConsistencyDirty pages are written to disk before being sent to the cache


Evaluation Evaluation Overview

Evaluation Overview

Experiment setup on KVM

Puma server: provides from 512 MB to 12 GB of cachePuma client: 1 GBBaseline: a single VM without additional cacheHosts: Intel Xeon E5-2660v2, 5× 600GB SAS in RAID-0Benchmarks: Filebench, BLAST, TPC-C, TPC-H, Postmark

Experiments

1 Varying workload on server side2 Co-localised VMs with a paravirtualised network (virtio)3 Latency injection


Evaluation Varying workload

Dynamic memory balancingComparison with memory ballooning

Baseline Auto-ballooning Puma

High latencies to reclaim memory with memory ballooning (avg: 20ms)Puma allows to reclaim memory at a small cost (avg: 1.8ms)


Evaluation Performance evaluation

Sequential I/O filtering

Unfiltered large sequences may severely drop the performanceFiltering sequential I/O allows us to focus on random accesses


Evaluation Performance evaluation

Performance improvement on database benchmarks

I/Os are a mix of random accesses and medium sized sequences⇒ Concurrent accesses: sequential accesses are interleaved → slow⇒ Non-inclusive strategy: pages are kept in cache even if accessed

sequentially


Evaluation Latency injection

Network latency managementLatency injection with Netem [LCA’05]

Speedup decreases as we inject network latency between nodesWhen the response time is too high, Puma disables itself to avoid aperformance drop


Conclusion

Conclusion

Summary⇒ Virtualization leads to a fragmentation of the available cache⇒ Memory ballooning techniques are not able to manage VM’s page

cache distribution

Puma: Pooling Unused memory in virtual MAchines⇒ It is based on an efficient kernel-level remote caching mechanism⇒ It handles clean cache pages to quickly recover the memory⇒ It works with co-localised VMs and remote VMs


kernel recipes 2015: puma: pooling unused memory in virtual machines for i/o intensive applications

Software