fujitsu standard tool fujitsu standard tool author kamezawa created date 6/5/2012 9:09:02 pm

45
Copyright 2012 FUJITSU LIMITED Memory Cgroup what’s going on 2012.06.08 KAMEZAWA Hiroyuki [email protected]

Upload: dangdiep

Post on 23-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Copyright 2012 FUJITSU LIMITED

Memory Cgroup what’s going on

2012.06.08 KAMEZAWA Hiroyuki [email protected]

Copyright 2012 FUJITSU LIMITED

History

Features of memcg in linux-3.5-rc1.

Performance brief.

News and TODOs.

Agenda

1

Birth of Memory cgroup

2008.02.07 v2.6.25-rc1.

Copyright 2010 FUJITSU LIMITED 2

Way to RHEL6(2.6.32).

Some basic features are added.

Per-zone LRU, OOM handling, Swap, hierarchy support, softlimit etc…

Major committers

Balbir Singh

Daisuke Nishimura

Hugh Dickins

KOSAKI Motohiro

KAMEZAWA Hiroyuki

Copyright 2010 FUJITSU LIMITED 3

Performance fixes(…2.6.36)

In this era, many many atomic ops were removed.

Some new features.

Moving charge at task moving.

Threshold notifier (by embedded guy.)

Major commiters

Daisuke Nishimura

Kirill A. Shutemov

KAMEZAWA Hiroyuki

Copyright 2010 FUJITSU LIMITED 4

New commiters (..v3.0)

Biggest changes in this era was new skilled committers

Johannes Weiner

Michal Hocko

Ying Han (+ Google team)

Many refactoring and bug-fixes

Johannes and Michal are MAINTAINER now.

More statistics

Copyright 2010 FUJITSU LIMITED 5

Renewal implementation (..v3.5-rc)

Hard works by Johannes and Hugh.

Reduced 60% of memory overhead.

New accoutings

Tcp memcontrol

Huge Page limiting

Recent Commiters • Johannes Weiner, Hugh Dickins, Glauber Costa, Michal

Hocko, Konstantin Khlebnikov, Kirill A. Shutemov, KAMEZAWA Hiroyuki

Copyright 2010 FUJITSU LIMITED 6

Lines of codes.

7 Copyright 2010 FUJITSU LIMITED

0

1000

2000

3000

4000

5000

6000

7000

80002

.6.2

5

2.6

.26

2.6

.27

2.6

.28

2.6

.29

2.6

.30

2.6

.31

2.6

.32

2.6

.33

2.6

.34

2.6

.35

2.6

.36

2.6

.37

2.6

.38

2.6

.39 3

3.1

3.2

3.3

3.4

Lines of codes.

Lines

Memory cgroup files (RHEL6)

8 Copyright 2010 FUJITSU LIMITED

Memory cgroup files (3.5-rc1)

9 Copyright 2010 FUJITSU LIMITED

memcg git tree

Memory cgroup development tree maintained by Michal Hokko

git://github.com/mstsxfx/memcg-devel.git

Copyright 2010 FUJITSU LIMITED 10

Features of memory cgroup

Agenda

Requirements/Basics

Per-memcg memory/swap limiting

Hierarchical or non-Hierarchical(Flat)

LRU implementation

Threshold/OOM notiffier

Tcp memcontrol

Copyright 2010 FUJITSU LIMITED 11

Requirements

Works on all system CONFIG_MMU=y

Uses 16bytes/PAGE_SIZE(4096bytes) on 64bit system to record information

Copyright 2010 FUJITSU LIMITED

General setup=>

Control Group support

12

cgroup

Allows users to make a group of process via cgroup file system interface

Users can control characteristics of the group by read/write cgroup file system.

Copyright 2010 FUJITSU LIMITED

root

admin

user

Tree of cgroup

processes

Attach processes

to cgroup

13

Example: task attach

Copyright 2010 FUJITSU LIMITED

Check libcgroup

For friendly UI 14

Memory/swap limiting.

limit usage of Anon, File Cache

limit of memory+swap usage.

Copyright 2010 FUJITSU LIMITED

Only 300M usage of cache

3.6G file

15

Hierarchical/non-Hierarchical

User can choice accounting policy of tree

Copyright 2010 FUJITSU LIMITED

Usage=600M

200M 400M

400M

?

200M

?

400M

Assume memory usage

in leaf cgroups are 200M

and 400M.

What the parent’s should be ?

Hierarchical mode

0

200M

0

400M

Non Hierarchical mode

choice

16

Example: hierarchical mode.

Copyright 2010 FUJITSU LIMITED

Memory usage is propagated.

Set hierarchical mode.

17

Example: non-hierarhical mode

Copyright 2010 FUJITSU LIMITED

Unset hierarchical mode

No propagation to the parent

Under non-hierarchila mode, parent/child cgroup

are independent from each other. 18

LRU

Memory cgroup reclaims memory when the usage hit limits.

All pages are tracked by linked list, LRU.

At reclaiming memory, memory cgroup scans s its own LRU list and select victim pages.

Copyright 2010 FUJITSU LIMITED

Usage hits

the limit

Linked list of pages

Scan And find

unused pages drop

page

Linked list

19

LRU implementation(1)

Before 3.3…..LRU was Duplicated.

Copyright 2010 FUJITSU LIMITED

Global per-zone-page-LRU-list scans all pages.

LRU of memcg A

LRU of memcg B

LRU of memcg C

Memory cgroup scans its own LRU.

This consumes

16bytes/page 20

LRU implemenations(2)

Since 3.3, LRUs are unified.

Copyright 2010 FUJITSU LIMITED

LRU of memcg A

LRU of memcg B

LRU of memcg C

Global LRU is re-implemented as a group of all

per-memcg-per-zone-LRU linked list.

This saves 16bytes/page. 21

Threshold notification

Notify via eventfd when the usage crosses the specified value.

Copyright 2010 FUJITSU LIMITED

threshold

usage Notify via eventfd

poll()

Check: Documentation/cgroup/cgroup_event_listener.c

22

Example: threshold

Copyright 2010 FUJITSU LIMITED

Wait for usage reaching 300M bytes

on TestCgroup

23

OOM block/notifier

Memory cgroup can

Block OOM-Kill under a memcg

Notification of OOM via eventfd

If OOM-Killer is blocked

All tasks under the cgroup will stop.

Tasks run again if • Some resources are freed.

• Get signal

• Limit is raised.

• A task is moved to other cgroup

Copyright 2010 FUJITSU LIMITED 24

Example: OOM

Copyright 2010 FUJITSU LIMITED

Set limt and wait for OOM

Check status and kill by hand

All tasks are stopped

kill

Run again

25

tcp memcontrol.

Controls memory usage for TCP (3.3) If memory usage hits limit….

• INPUT: packets will be dropped.

•OUTPUT: wait for available memory.

Copyright 2010 FUJITSU LIMITED

For now, this works independent from other memory controls for anon,file,swap

26

Performance

Comparison between 2.6.32 and 3.4

Transparent Hugepage is disabled.

Check overheads of memory cgroup.

create a tree /cgroup/memory/L0/L1/L2/L3/L4/L5/L6/L7/L8/L9/L10with use_hierarchy=1

Run mini-benchmark on root,L1,L2,L10.

Copyright 2010 FUJITSU LIMITED 27

Overhead of hierarchy

If use_hierarchy=1, need to update several counters at once.

Copyright 2010 FUJITSU LIMITED

root

L0

L1

L10

Root cgroup is out of control. No counters

In this test, All usage in L1…L10 are propagated up to L0. Then, the number of counters and Overheads will be big in deep groups.

28

tar -xpf

tar –xpf linux-3.5.tar onto tmpfs. Checking file cache creation overheads in ‘sys’ time

Flat graph is better.

Copyright 2010 FUJITSU LIMITED

4core/1socket

0

0.1

0.2

0.3

0.4

0.5

0.6

2.14

2.15

2.16

2.17

2.18

2.19

2.2

2.21

2.22

2.23

root L0 L1 L10

2.6.32

3.4.1

3.4

2.6.32

29

rm -rf

# rm –rf linux-3.5 on tmpfs

Checking file cache deletion overheads in ‘sys’.

Flat graph is better.

Copyright 2010 FUJITSU LIMITED

4core/1socket

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

root L0 L1 L10

2.6.32

3.4.1

30

Parallel page faults

Causing page fault in parallel.

Copyright 2010 FUJITSU LIMITED

page fault

unmap

process

page fault

unmap

process

page fault

unmap

process

page fault

unmap

process

page fault

unmap

process

Lock contention

in memory cgroup

Each process touch/free 256k mem.

31

Parallel page faults(1)

Copyright 2010 FUJITSU LIMITED

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

L0L1

L2L10

speed of parallel page faults under memcg # of pagefaults per proc per min.

2.6-t4 3.4-t4 2.6-t3 3.4-t3 2.6-t2 3.4-t2 2.6-t1 3.4-t1

depth

# of proc

1

4

4core/1socket

32

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

45,000,000

50,000,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

root

L0

2.2TB alloc/free /min

1.9TB alloc/free /min

Parallel page faults (2)

Copyright 2010 FUJITSU LIMITED

10core/2sockets Pagefaults/proc/min on linux-3.4

# of procs

175Gb alloc/free per min

1.1TB alloc/free /min

33

News & TODO

Many things are still in TODO list…

Split locks.

Hugetlb controls

Kmem(slab) controls

Softlimit renewal

•Idle memcg.

Dirty Throttling

Reduce memory overhead(16bytes)

Per-memcg kswapd.

Copyright 2010 FUJITSU LIMITED 34

Split locks

Now, per-zone-memcg-lru list is maintained.

But, lock is now shared among memcgs.

Implementation is very complicated.

Hugh and Konstantin working on this.

Copyright 2010 FUJITSU LIMITED

LRU of memcg A

LRU of memcg B

LRU of memcg C

lock

35

Hugetlb Controls

Controls for hugetlbfs

Implements per cgroup hugetlbfs quota.

The development was almost finished but the author started re-designing because of rjections by other commiters.

Maybe hugetlbfs cgroup will be added rather than enhancing memcg.

Aneesh is working on this.

Copyright 2010 FUJITSU LIMITED 36

Kmem(slab) controls

Function to account/limit kernel memory

Under development for a half year…

Supporting slab/slub

Discuss: per-page accounting v.s. per-objs

The biggest concern is performance.

•Isolation & Performance…..challenge.

Costa & Suleiman works on this.

•Seems co-operative with Christoph Lameter

Copyright 2010 FUJITSU LIMITED 37

Soft limit renewal

Now, memory cgroup has softlimt

for hinting the system memory reclaim priority.

Some people complaints

isolation using softlimit doesn’t work well.

it doesn’t work as expected..

Total re-implementation is planned.

Ying Han, Johannes, Michal working on this.

Copyright 2010 FUJITSU LIMITED 38

Idle memcg

In softlimit discussion..

Copyright 2010 FUJITSU LIMITED

root

cg1 cg2 cg3

cg1-1

Memory reclaim

• If soft limit is set, kswapd will choose victim cgroup by it. • What happens when a cgroup has been idle for a long time but it doesn’t hit softlimit ?

39

Dirty Throttoling

In todo list since 3 years ago…

Without this :

background write-out doesn’t start enough quick.

Memory recalim will see too much dirty pages.

Patches were posted several times.

New implementation of I/O less dirty throttoling

Need updates.

Copyright 2010 FUJITSU LIMITED 40

Reduce memory overhead.

Now, using 16bytes/page (onx86-64)

4Mbytes/ 1Gbytes.

Copyright 2010 FUJITSU LIMITED

Seems it’s possible to make this 8bytes/page.

merge this into ‘unused’ 8bytes in ‘struct page’…

41

Per-memcg kswapd

Now, memory cgroup has no kswapd.

Run kthread for reduce memory usage when the usage is near to the limit.

Move cpu usage for memory reclaiming from applications to kthread.

By this, memory will be reclaimed in ‘idle’ time and applications will get better latency. • Should be able to control cpu prio of kthread for kswapd?

Old patch shown good result.

Waiting for lock-splitting patches for avoid contention.

Copyright 2010 FUJITSU LIMITED 42

Copyright 2010 FUJITSU LIMITED 43

Parallel Page fault /touch 2M+THP(3)

Increase bufsize as 2M and enable THP.

Copyright 2010 FUJITSU LIMITED

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

1 3 5 7 9 11 13 15 17 19

root+4kpage

root+THP

L0 + THP

44