containers @ google
DESCRIPTION
Slides from our presentation at the SF Bay Area Large Scale Production Engineering meetup on Lightweight Containers.TRANSCRIPT
Google Confidential and Proprietary
Victor Marmol ([email protected])Rohit Jnagal ([email protected])
Let Me Contain That For You
Containers @ Google
SF Bay Area Large-Scale Production Engineering: Lightweight Containers MeetupFebruary 20, 2014
Google Confidential and Proprietary
● Used to provide VM-like instances● High density (lower costs) and high performance● Fast to start● Migration is hard, but possible
Containers in the Wild
Linux Kernel
User 1 User 2 User 4User 3
Google Confidential and Proprietary
I/O:CPU:MemSensitive Task Front End Task Back End Task
Alloc
BACKGROUND TASKS
System Daemons Batch workload Soaker workload
The Need for Isolation: A Shared Google Machine
Google Confidential and Proprietary
● Container-aware tasks use asymmetric subcontainers● Provide different guarantees of quality of service● Overcommit resources to achieve high utilization● Early users, few namespaces, and near-zero overhead
Containers @ Google
Linux Kernel
Alloc 1 Task 2Task 1
Sub 1 Sub 4
Sub 2
Sub 1 Sub 3
Sub 2Sub 3
SS1
SS2
SS3Task 1 Task 2
SS4
Google Confidential and Proprietary
Asymmetric Isolation
Isolating only certain resources (e.g., CPU but not memory).
CPU Memory Net
Container 1
Container 2
Container 3
Google Confidential and Proprietary
Containers @ Google Today
● Historically○ 2004: No isolation○ 2006: Cgroups○ Now: Namespaces
● Primarily Linux cgroups + user-space policies and monitoring● We skipped VMs due to high overhead● Used everywhere: SaaS, PaaS, IaaS; Android, Chrome OS● Heterogeneous workloads: Latency, bandwidth, and priority● High task churn
Google Confidential and Proprietary
Goals
● Isolation○ Tasks do not impact each other○ The behavior of a Task is the same regardless of what else is
on the machine
● Predictability○ Tasks behave the same each time they run○ Unless they are specifically configured to use "slack"
● Quality of Service○ Different tasks get different quality of resources
● Overcommitment○ Oversell machine resources within QoS guarantees
Google Confidential and Proprietary
Open source containers stack based on Google’s.
github.com/google/lmctfy/
Provides the Container abstraction to higher levels by abstracting away the kernel interfaces.
Motivation● Existing code, systems, and design around containers● Problems with LXC
○ No abstraction (direct knob exposure)○ No easy way to access programmatically
lmctfy: Let Me Contain That For You
Google Confidential and Proprietary
Objectives● Abstract away enforcement: separate policy from enforcement● Scalability and parallel access● Intent-based container specifications● Asymmetric isolation● Subcontainer support● Provides tiers of quality of service
System Layers● CL1
○ Container abstraction and enforcement○ Thin and light layer○ Current lmctfy
● CL2○ Sets policy (QoS, overcommitment)○ Higher level logic, monitoring, and control loops○ Stateful entity
lmctfy: Let Me Contain That For You
Google Confidential and Proprietary
Current cgroup API is complicated with lots of knobs (each a cgroup file):
Common: 5+ filescgroup.clone_children cgroup.event_control cgroup.procs notify_on_release release_agent
CPU: 8+ filescpuacct.stat cpuacct.usage cpuacct.usage_percpu cpu.cfs_period_us cpu.cfs_quota_us cpu.rt_period_us cpu.rt_runtime_us cpu.shares cpu.stat
Memory: 12+ filesmemory.failcnt memory.force_empty memory.limit_in_bytes memory.max_usage_in_bytes memory.move_charge_at_immigrate memory.numa_stat memory.oom_control memory.pressure_level memory.soft_limit_in_bytes memory.stat memory.swappiness memory.usage_in_bytes memory.use_hierarchy
Cpuset: 12+ filescpuset.cpu_exclusive cpuset.cpus cpuset.mem_exclusive cpuset.mem_hardwall cpuset.memory_migrate cpuset.memory_pressure cpuset.memory_pressure_enabled cpuset.memory_spread_page cpuset.memory_spread_slab cpuset.mems cpuset.sched_load_balance cpuset.sched_relax_domain_level
+DiskIO+Net+...
lmctfy: Fine-tuned resource isolation
Google Confidential and Proprietary
Initial version of lowest layer● Written entirely in C++● Delivered as a CLI and a C++ library (C and Go bindings soon)● Isolation for CPU, memory, and perf event● Full support for subcontainers● “Stateless” and lightweight● Initial support for namespaces, more to come in the next week.
Can be augmented with custom kernel patches● CPU latency and accounting● OOM priority
Supported configurations● Target configuration is well supported● Designed to be flexible, but we test on a limited set of them● More target configurations being added● Contributions to add more are welcome
Released 0.4.0 (This Week!)
Google Confidential and Proprietary
message ContainerSpec { optional int64 owner = 1;
optional CpuSpec cpu = 2; optional MemorySpec memory = 3; optional DiskIoSpec diskio = 4; optional NetworkSpec network = 5; optional VirtualHost virtualhost = 6; ...}
message CpuSpec { optional ShedulingLatency scheduling_latency = 1; optional uint64 limit = 2; optional uint64 max_limit = 3; ...}
Create: “cpu:<limit:1000 max_limit:2000> memory:<limit:4096000 reservation:1024000>”
Container Specifications
Google Confidential and Proprietary
Create: “cpu:<limit:1000 max_limit:2000 scheduling_latency:PRIORITY> memory:<limit:4096000 reservation:1024000>”
equivalent lxc cgroup config:lxc.cgroup.cpu.shares = 2048lxc.cgroup.cpu.cfs_period_us = 50000lxc.cgroup.cpu.cfs_quota_us = 10000lxc.cgroup.cpu.lat = 25.. cpu performance knobs ..lxc.cgroup.memory.limit_in_bytes = 4096000lxc.cgroup.memory.soft_limit_in_bytes = 1024000.. memory performance knobs ..
Cgroup Specifications
Google Confidential and Proprietary
::containers::lmctfy::ContainerApi● Create● Get● Destroy● Detect● InitMachine
::containers::lmctfy::Container● Update● Run● Notifications● List (threads, PIDs, and subcontainers)● Stats● Pause/Resume● KillAll
CLI is a thin wrapper around the C++ API
C++ API
Google Confidential and Proprietary
Path-like hierarchy of container names:Absolute: /parent/selfRelative: self when in /parent
Container Names
Container Name Refers To
/ The root top-level container
/sys The sys top-level container
/sys/sub The sub subcontainer of the sys top-level container
. or ./ The current container (current relative to the calling process)
.. The parent container (parent relative to the calling process)
./foo_container or foo_container
The foo_container subcontainer of the current container
/foo_container The foo_container top-level container
Google Confidential and Proprietary
Towards Version 1.0● Improve VirtualHost support● Root file systems● Checkpoint restore● Support and target most major distros● Fully compatible with Docker’s use of containers
Higher Layer● Admission control and feasibility checks● Monitoring, notifications, and statistics● Tiers of quality of service guarantees
Contributions Welcome!
Roadmap
Google Confidential and Proprietary
Repository: https://github.com/google/lmctfy/Mailing list: [email protected]
Victor Marmol: [email protected] Jnagal: [email protected]
Questions?