doxlon november 2016: facebook engineering on cgroupv2

25
cgroupv2: Linux’s new unified control group system Chris Down ([email protected]) Production Engineer, Web Foundation

Upload: outlyer

Post on 11-Apr-2017

725 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified controlgroup system

Chris Down ([email protected])Production Engineer, Web Foundation

Page 2: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

About me

■ Chris Down■ Production Engineer at Facebook■ Working in Web Foundation■ Working on cgroupv2↔ OS integration, especially inside Facebook■ Dealing with web server and backend service reliability■ Managing >100,000 machines at scale■ Building tools to manage and maintain Facebook’s fleet of machines■ Debugging production incidents, with a focus on Linux internals

Page 3: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

In this talk

■ A short intro to control groups and where/why they are used■ cgroup(v1), what went well, what didn’t■ Why a new major version/API break was needed■ Fundamental design decisions in cgroupv2■ New features and improvements in cgroupv2■ State of cgroupv2, what’s ready, what’s not

Page 4: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

What are cgroups?

■ cgroup == control group■ System for resource management on Linux■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup)■ Limit, throttle, manage, and account for resource usage per control group■ Each resource interface is provided by a controller

Page 5: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Practical uses

■ Isolating core workload from background resource needs■ Web server vs. system processes (eg. Chef, metric collection, etc)■ Time critical work vs. long-term asynchronous jobs

■ Ensuring machines with multiple workloads (eg. containers) do not allow one workloadto overpower the others

Page 6: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How did this work in cgroupv1?

cgroupv1 has a hierarchy per-resource, for example:

% ls /sys/fs/cgroupcpu/ cpuacct/ cpuset/ devices/ freezer/memory/ net_cls/ pids/

Each resource hierarchy contains cgroups for this resource:

% find /sys/fs/cgroup/pids -type d/sys/fs/cgroup/pids/background.slice/sys/fs/cgroup/pids/background.slice/async.slice/sys/fs/cgroup/pids/workload.slice

Page 7: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How did this work in cgroupv1?

■ Separate hierarchy/cgroups foreach resource

■ Even if they have the samename, cgroups for eachresource are distinct

■ cgroups can be nested insideeach other

/sys/fs/cgroup

resource A

cgroup 1

cgroup 2

...

resource B

cgroup 3

cgroup 4

...

resource C

cgroup 5

cgroup 6

...

Page 8: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How did this work in cgroupv1?

■ Limits and accounting are performed per-cgroup■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is“memory”.

/sys/fs/cgroup

resource A

cgroup 1

pid 1 pid 2

resource B

cgroup 3

pid 3 pid 4

resource C

cgroup 5

pid 2 pid 3

Page 9: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How did this work in cgroupv1?

■ One PID is in exactly one cgroup per resource■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroupfor resource B since it’s not explicitly assigned

/sys/fs/cgroup

resource A

cgroup 1

pid 1 pid 2

resource B

cgroup 3

pid 3 pid 4

resource C

cgroup 5

pid 2 pid 3

Page 10: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Hierarchy in cgroupv1

/sys/fs/cgroup

memory/ service 1 memory.limit_in_bytes

cpu/ service 1 cpu.shares

blkio/ service 2 blkio.throttle.read_iops_device

Page 11: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How does this work in cgroupv2?

cgroupv2 has a unified hierarchy, for example:

% ls /sys/fs/cgroupbackground.slice/ workload.slice/

Each cgroup can support multiple resource domains:

% ls /sys/fs/cgroup/background.sliceasync.slice/ foo.mount/ cgroup.subtree_controlmemory.high memory.max pids.current pids.max

Page 12: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How does this work in cgroupv2?

■ cgroups are “global” now —not limited to one resource

■ Resources are now opt-in forcgroups

/sys/fs/cgroup

cgroup 1

cgroup 2

...

cgroup 3

cgroup 4

...

cgroup 5

cgroup 6

...

Page 13: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Hierarchy in cgroupv1

/sys/fs/cgroup

memory/ service 1 memory.limit_in_bytes

cpu/ service 1 cpu.shares

blkio/ service 2 blkio.throttle.read_iops_device

Page 14: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Hierarchy in cgroupv2

/sys/fs/cgroup

service 1

memory.high ∗

cpu.weight

cgroup.subtree_controlmemory

cpu

service 2io.max (riops key)

cgroup.subtree_control io

∗ will discuss about this vs. memory.max later

Page 15: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Fundamental differences between v1 and v2

■ Unified hierarchy — resources apply to cgroups now■ Granularity at PID, not TID level■ Focus on simplicity/clarity over ultimate flexibility

Page 16: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

v2 improvements: Tracking of non-immediate charges

Some types of resources are not immediately chargeable (eg. page cache writeback,network packets going in/out)

In v1:■ These resources are not able to be tied to a cgroup■ Charged to the root cgroup for each resource type, so are essentially unlimited

In v2:■ Resources spent in page cache writebacks and network are charged to the correctcgroup

■ Can be considered as part of cgroup limits

Page 17: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

v2 improvements: Communication with backing subsystems

In v1:■ Most actions for non-share based resources reacted violently to hitting thresholds■ For example, in the memory cgroup, the only option was to OOM kill or freeze

In v2:■ Many cgroup controllers inform subsystems before problems occur■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaimwith memory.high, window size manipulation on reclaim failure)

■ Much easier to deal with temporary spikes in a resource’s usage

Page 18: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

v2 improvements: Consistency between controllers

In v1:■ Controllers often have inconsistent interfaces■ Some controller hierarchies inherit values from parents, some don’t■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs

In v2:■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗

∗ well, at least we can pray we didn’t screw it up too badly

Page 19: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

v2 improvements: Some reasonable configurations are now possible

In v1:■ Some limitations could not be fixed due to backwards compatibility (eg. memory limittypes)

■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes

In v2:■ Less iterative, more designed up front■ In the case of memory limit types, we now have universal thresholds(memory.{high,max})

Page 20: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Current support

■ Controllers merged into mainline kernel■ I/O■ Memory■ PID

■ Controllers still pending merge■ CPU (thread: https://bit.ly/cgroupv2cpu)

Disagreements: Process granularity, constraints around pid placement in cgroups

■ Controllers still being worked on■ Freezer

Page 21: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Chris, is Facebook really using cgroupv2 in production?

Yes! Really!■ Currently rolled out to a non-negligible percentage of web servers■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talkat systemd.conf 2016)

■ Already getting better results on spiky workloads■ Better separation of workload services from system services (eg. service routing,metric collection)

Page 22: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

How can I get it?

cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags:■ systemd.unified_cgroup_hierarchy=1

■ systemd will mount /sys/fs/cgroup as cgroupv2■ Available from systemd v226 onwards

■ cgroup_no_v1=all■ The kernel will disable all v1 cgroup controllers■ Available from Linux 4.6 onwards

Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup

Page 23: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

People at Facebook working on cgroupv2

■ FB kernel team working on upstreaming improvements and new features■ FB operating systems team and Web Foundation working on OS integration &production testing

■ Tupperware (scheduler & containerisation) team working on rollout to all containerusers

Page 24: DOXLON November 2016: Facebook Engineering on cgroupv2

cgroupv2: Linux’s new unified control group system Chris Down ([email protected])

Still have questions?

■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2■ I’ll be around for questions over pizza😃

Page 25: DOXLON November 2016: Facebook Engineering on cgroupv2