Linux Control Groups
Abstract: Management of computer resources (i.e processes/tasks) can become an interesting topic because it presents the user an opportunity hard to ignore.
This area, which is handled by low level threads: scheduling processes, how much CPU time is allotted to a particular task, ability to freeze the task, manipulating the network stack throughput, how much memory can a process allocate or disk IO management, is now available under a hierarchical arrangement called control groups or cgroups as I will relate to them from now on.
Linux control groups, as the name suggests, is some kind of a group which can act as a container for tasks running inside that container.
The ability to manipulate tasks behaviour inside the container is handled by different parameters which are exposed by the Linux kernel, whenever the user is creating the environment. The user is presented with the option to create that environment using system calls (i.e mount/mkdir) or by employing the process to a user-space library. The library comes with helper programs and daemons that can automatically transfer tasks to specified groups. Another user-space project which relays on cgroups is LXC (Linux Containers) which provides a virtual system, a concept extremely popular but totally different in implementation: virtualization. Tasks run inside different containers and obey the rules defined for that container.
/ <- root cgroup
/group1 ::::::::::: /group2
– task1 /group2/group3
– task2 `——–
– task3 – task50
– task4 – task51
Fig 1. Two sub-groups hierarchical disposal
2. An Immediate Consequence
Day to day usage of a modern kernel has shown that resource hungry tasks can trigger, under certain constrains, all kind of unpredicted situations which can lead to dramatic consequences for the running system. One such situation happens with a modern web client: supposedly you want to watch a video and in the same you’re doing video/image editing. Although every web client differs in implementation the “de facto” standard for viewing videos is through the help of a plug-in. A plug-in which under all clients is a CPU/Memory hogging task (for that matter under all operating systems).
Now, what if you want the task doing the image editing always have a higher priority, or run it entirely on another CPU ? With the help of cgroups you can group the parent task of your web client to run on CPU #0 and the task that’s doing the editing to run on CPU #1. But that’s not all. You can also specify which memory banks (for Non-Uniform Memory Access machines) a task can use, and how much memory that process can allocate.
2.1 Controllers and Hierarchical Management
Most of the distribution provided kernels are somehow impaired and do not enable by default access to all controllers. The user will need to add (i.e recompile) that specific controller in the kernel in or order to have access to its features. A controller is a subsystem that makes use of the task grouping facilities provided by cgroups to treat groups of tasks in a particular way. ————————————————————————
Controller: | Specs |
cpu | specify how much CPU (time) a process will have |
cpuset | specify CPU(s) / memory banks where the task can run |
ns | Name-spaces |
memory | memory management, OOM, isolation |
blkio | IO |
cpuacct | CPU accounting |
net_cls | network for traffic shaper (user-space tool as $ tc) |
device | specify access/restrictions to devices |
freezer | freeze a task (hang) |
Fig 2. Controllers and definitions
The hierarchy can be perceived as a set of cgroups arranged in a tree like structure such that each task in the system is in exactly one of the cgroups in the hierarchy. At any time there may be multiple active hierarchies of task cgroups. Each hierarchy is a partition of all tasks in the system. Finally, according the above terminology, a control groups associates a set of tasks with a set of parameters.
The Linux kernel exports cgroups feature with the help of virtual file systems. Each time you want to access a certain controller the user or other tool wanting to have access will need to mount a virtual file system. You can separate controllers on different mount points or you can have all the controllers share the same mount. Doing so, really depends on how you want to manage those controllers.
/cgroup (root group)
/cgroup/memory :: /cgroup/blkio :: /cgroup/cpu :: /cgroup/cpuset
`– /task1 `– /task2 `– /task3 `– /task4
Fig 3. Separate groups based on controllers.
/cgroup (root group)
/cgroup/task1 :: /cgroup/task2 :: /cgroup/task3 :: /cgroup/task4
`– /task11 `– /task22 `– /task33 `– /task44
Fig 4. Subsequent groups inherit the root group.
The difference between Fig. 3 and Fig. 4, will be on how you want to divide or aggregate your tasks. The first can be achieved by mounting every controller separately, while the second is done by mounting the root group with all the controllers at once. The hierarchy is done automatically each time you issue a make directory system call inside that virtual file system. So, in the first case when creating a new sub-group inside “/cgroup/memory” the new sub-group will inherit the parents attributes which is specific with the controller mounted at that location (i.e memory in this case). In the other case creating a new sub-group in “/cgroup/task1” means that the newly created sub-group will have access to all subsystems mounted by its parent (which are inherited from the root group).
3. Imposing Limitations
3.1 CPU and CPUSET Controllers
These controllers are used to specify which CPU and which memory bank the task can run on, but also you can tune how much CPU share will be alloted to a particular group.
3.2 Memory Controller
The memory controller can be used to limit a particular task allocating memory to a superior limit. Two distinct items are available: RSS (Resident Memory), and SWP (Swap Memory). Both of them summed give the total amount of memory a process can allocate inside the sub-group. This process is split between two, because although you can allocate memory it doesn’t mean that you can reference addresses at that location. This translate to the fact that if you don’t “use” (operations such as memcpy) the memory allocated it won’t be physically translated to a valid page memory (which will be translated by the MMU) to a valid memory address. This is the SWP field, found in user-space processes as $ top $ or applications which rely on $libproc$ to query system information.
Another tunable parameter is the Out-of-Memory Killer which can be disabled inside a sub-group. So even if your task is trying to allocate memory it won’t be killed by the kernel. The OOM Killer switch combined with another kernel parameter (vm.oom_kill_allocating_task) can kill only those tasks that try to allocate memory inside on that sub-group, so the user can have total control of the tasks running inside the container.
3.2 Disk IO Controller — Classes and Throttle policies
Upon recently (with arrival of 2.6.37) the only method used to impose limitations regarding block devices was by defining classes. Those classes can take values between 100 and 1000 and can be used only globally as shown in Fig. 5.
/ (root group) — (blkio.weight = 1000)
/class1 :: /class2
`– task1 `– task2
– blkio.weight: 100 – blkio.weight: 900
Fig 5. Two different classes for a blkio controller
Seen from the Fig 5, the sub-group “class1” can use only 10 percent while the tasks running under “class2” sub-group can use almost full bandwidth available for that block device. On the other hand, upon a newer kernel version you can tune depending on the type of IO action (i.e read/write) which block device and how much bandwidth to use.
/ (root group)
/class1 :: /class2
`– task1 `– task1
– throttle.read_bps_device: MAJOR:MINOR 414304 – throttle.write_bps_device: MAJOR:MINOR 8194304
Fig 6. Assigning absolute values for different IO actions.
4. Final Words
Some of the features are under a experimental aura such as the throttle policy introduced in the kernel mainline in the last quarter of 2010, so it might have unexpected behaviour, or the memory constrains when the OOM killer kicks in when a process does massive memory allocations. This obviously will continue to evolve and it’s unarguable that the features and possibilities presented by control groups can help both the user and the system administrator when managing physical resources in conformity with data resources (i.e data in/data out).