doxlon november 2016: facebook engineering on cgroupv2
TRANSCRIPT
cgroupv2: Linux’s new unified controlgroup system
Chris Down ([email protected])Production Engineer, Web Foundation
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
About me
■ Chris Down■ Production Engineer at Facebook■ Working in Web Foundation■ Working on cgroupv2↔ OS integration, especially inside Facebook■ Dealing with web server and backend service reliability■ Managing >100,000 machines at scale■ Building tools to manage and maintain Facebook’s fleet of machines■ Debugging production incidents, with a focus on Linux internals
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
In this talk
■ A short intro to control groups and where/why they are used■ cgroup(v1), what went well, what didn’t■ Why a new major version/API break was needed■ Fundamental design decisions in cgroupv2■ New features and improvements in cgroupv2■ State of cgroupv2, what’s ready, what’s not
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
What are cgroups?
■ cgroup == control group■ System for resource management on Linux■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup)■ Limit, throttle, manage, and account for resource usage per control group■ Each resource interface is provided by a controller
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Practical uses
■ Isolating core workload from background resource needs■ Web server vs. system processes (eg. Chef, metric collection, etc)■ Time critical work vs. long-term asynchronous jobs
■ Ensuring machines with multiple workloads (eg. containers) do not allow one workloadto overpower the others
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How did this work in cgroupv1?
cgroupv1 has a hierarchy per-resource, for example:
% ls /sys/fs/cgroupcpu/ cpuacct/ cpuset/ devices/ freezer/memory/ net_cls/ pids/
Each resource hierarchy contains cgroups for this resource:
% find /sys/fs/cgroup/pids -type d/sys/fs/cgroup/pids/background.slice/sys/fs/cgroup/pids/background.slice/async.slice/sys/fs/cgroup/pids/workload.slice
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How did this work in cgroupv1?
■ Separate hierarchy/cgroups foreach resource
■ Even if they have the samename, cgroups for eachresource are distinct
■ cgroups can be nested insideeach other
/sys/fs/cgroup
resource A
cgroup 1
cgroup 2
...
resource B
cgroup 3
cgroup 4
...
resource C
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How did this work in cgroupv1?
■ Limits and accounting are performed per-cgroup■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is“memory”.
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How did this work in cgroupv1?
■ One PID is in exactly one cgroup per resource■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroupfor resource B since it’s not explicitly assigned
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How does this work in cgroupv2?
cgroupv2 has a unified hierarchy, for example:
% ls /sys/fs/cgroupbackground.slice/ workload.slice/
Each cgroup can support multiple resource domains:
% ls /sys/fs/cgroup/background.sliceasync.slice/ foo.mount/ cgroup.subtree_controlmemory.high memory.max pids.current pids.max
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How does this work in cgroupv2?
■ cgroups are “global” now —not limited to one resource
■ Resources are now opt-in forcgroups
/sys/fs/cgroup
cgroup 1
cgroup 2
...
cgroup 3
cgroup 4
...
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Hierarchy in cgroupv2
/sys/fs/cgroup
service 1
memory.high ∗
cpu.weight
cgroup.subtree_controlmemory
cpu
service 2io.max (riops key)
cgroup.subtree_control io
∗ will discuss about this vs. memory.max later
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Fundamental differences between v1 and v2
■ Unified hierarchy — resources apply to cgroups now■ Granularity at PID, not TID level■ Focus on simplicity/clarity over ultimate flexibility
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
v2 improvements: Tracking of non-immediate charges
Some types of resources are not immediately chargeable (eg. page cache writeback,network packets going in/out)
In v1:■ These resources are not able to be tied to a cgroup■ Charged to the root cgroup for each resource type, so are essentially unlimited
In v2:■ Resources spent in page cache writebacks and network are charged to the correctcgroup
■ Can be considered as part of cgroup limits
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
v2 improvements: Communication with backing subsystems
In v1:■ Most actions for non-share based resources reacted violently to hitting thresholds■ For example, in the memory cgroup, the only option was to OOM kill or freeze
In v2:■ Many cgroup controllers inform subsystems before problems occur■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaimwith memory.high, window size manipulation on reclaim failure)
■ Much easier to deal with temporary spikes in a resource’s usage
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
v2 improvements: Consistency between controllers
In v1:■ Controllers often have inconsistent interfaces■ Some controller hierarchies inherit values from parents, some don’t■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs
In v2:■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗
∗ well, at least we can pray we didn’t screw it up too badly
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
v2 improvements: Some reasonable configurations are now possible
In v1:■ Some limitations could not be fixed due to backwards compatibility (eg. memory limittypes)
■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes
In v2:■ Less iterative, more designed up front■ In the case of memory limit types, we now have universal thresholds(memory.{high,max})
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Current support
■ Controllers merged into mainline kernel■ I/O■ Memory■ PID
■ Controllers still pending merge■ CPU (thread: https://bit.ly/cgroupv2cpu)
Disagreements: Process granularity, constraints around pid placement in cgroups
■ Controllers still being worked on■ Freezer
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Chris, is Facebook really using cgroupv2 in production?
Yes! Really!■ Currently rolled out to a non-negligible percentage of web servers■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talkat systemd.conf 2016)
■ Already getting better results on spiky workloads■ Better separation of workload services from system services (eg. service routing,metric collection)
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
How can I get it?
cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags:■ systemd.unified_cgroup_hierarchy=1
■ systemd will mount /sys/fs/cgroup as cgroupv2■ Available from systemd v226 onwards
■ cgroup_no_v1=all■ The kernel will disable all v1 cgroup controllers■ Available from Linux 4.6 onwards
Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
People at Facebook working on cgroupv2
■ FB kernel team working on upstreaming improvements and new features■ FB operating systems team and Web Foundation working on OS integration &production testing
■ Tupperware (scheduler & containerisation) team working on rollout to all containerusers
cgroupv2: Linux’s new unified control group system Chris Down ([email protected])
Still have questions?
■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2■ I’ll be around for questions over pizza😃