Camilo Bermúdez

Control Groups and Yoga

In a previous post I struggled to explain, successfully I hope, in a very simplistic way the basic concepts behind control groups, then I figured it would be interesting to play around a little with the real thing and put the results here for posterity in case one day I find myself in a desperate need to remember. First, let’s have a quick look at the general details about the interface we’re about to deal with.

The Interface

The control groups interface, like others in the kernel, is exposed through a virtual file-system where a hierarchy (remember hierarchies?) is mapped to a hierarchy of directories and at least one subsystem (remember subsystems, right?) has to be attached to this file-system at mount time. All of this mounting and attaching can be done with a single command using mount,

mount cgroup

in this example had I run this command I would be mounting a hierarchy at the /cgroup/memory mount point and attaching to it the memory subsystem, I just didn’t need to because the kernel mounts this and other hierarchies by default at boot time, you will see.

I can list all the subsystems available in my linux installation by using the lssubsys tool,

lssubsys

the option -a lists all subsystems, attached or not, and the option -m additionally shows the mount point of every subsystem that is currently attached. As you can see, all of the subsystems available in my installation are attached to hierarchies under the path /sys/fs/cgroup, that’s the path under which the hierarchies get mounted at boot time.

Within these mapped hierarchies every sub-directory represents a concrete control group and the ordinary files within every such sub-directory represent parameters and state associated with the control group that the sub-directory represents.

create cgroup

In this example I created a control group called test-cgroup within the hierarchy to which the subsystem memory is attached to and immediately a bunch of files appeared within it, those are input and output files used by the subsystem and by control groups itself to provide information about the state and to configure my control group, and they are all there because kindly and automagically the kernel filled in the gaps for me with default values where necessary.

If I want to alter this new control group in any way all I have to do is write into the input files that were just created for me, if I want to add a process to this control group all I have to do is append the process id to the file named tasks, every control group has one of those.

There’s comprehensive documentation on the details of configuration and output of the subsystems that are built in by default on most linux distributions in the RHEL documentation.

Finally, in order to delete the newly created control group you need to install the cgdelete tool included in the cgroup-tools package in Ubuntu,

install cgroup-tools

then,

cgdelete

and it’s gone, cgdelete is invoked by providing to it the subsystem and the control group in the form, cgdelete subsystem:/relative-path-of-the-cgroup.

Controlling Stress

The preceding was an intentionally deceitful title, by now the least you should have learned is that this is not a blog about yoga or any other stress reduction technique, when I say stress I mean this really cool workload generator I just recently discovered, I think it was developed by Amos Waterland at Harvard, and I found it quite useful for my purposes.

To make this work you need to install stress and cgroup-tools, in Ubuntu and I think most debians this should do,

apt-get install stress
apt-get install cgroup-tools

and here we go,

#!/bin/bash

memory_limit="128M"

stress --quiet --vm 4 --vm-bytes ${memory_limit} --vm-keep &>/dev/null &
trap 'kill -9 $(pgrep -P '$!') &> /dev/null' EXIT
while true; do
    total=0;
    for p in $(pgrep -P $!); do
        rss=$(grep "^VmRSS" "/proc/$p/status" | awk '{print $2}');
        process=$(ps -q ${p} -eo ppid,pid,cgroup | grep ${p} | awk -F , '{print $1}')
        total=$(($total + $rss));
        echo "$process ==> $(($rss/1024))MB"
    done;
    echo "Total ==> $(($total/1024))MB";
done

this is a bash script that will fork four processes, each process will reserve the amount of memory defined in line 3 and for each of the four processes it will print its parent process id, its process id, the control group it belongs to in the memory hierarchy and the amount of memory reserved by it, it will keep printing in a loop until the EXIT signal is received, this is its output,

root

keep in mind a couple of things, first, the total amount of memory reserved by the four processes is about 512MB (128MB x 4) as expected, and second, the control group my four processes belong to in the memory hierarchy is called user.slice and it’s a control group created by default by my system at boot time, look at it, it’s the last directory listed next,

user.slice cgroup

now have a look at this modified version of the script,

#!/bin/bash

cgroup_path="/sys/fs/cgroup/memory/cgroup1"
cgroup_memory_limit="256M"
mkdir ${cgroup_path}
pushd ${cgroup_path}
echo "${cgroup_memory_limit}" > memory.limit_in_bytes
echo $$ > tasks
popd

memory_limit="128M"

stress --quiet --vm 4 --vm-bytes ${memory_limit} --vm-keep &>/dev/null &
trap 'kill -9 $(pgrep -P '$!') &> /dev/null; cgdelete memory:/cgroup1' EXIT
while true; do
    total=0;
    for p in $(pgrep -P $!); do
        rss=$(grep "^VmRSS" "/proc/$p/status" | awk '{print $2}');
        process=$(ps -q ${p} -eo ppid,pid,cgroup | grep ${p} | awk -F , '{print $1}')
        total=$(($total + $rss));
        echo "$process ==> $(($rss/1024))MB"
    done;
    echo "Total ==> $(($total/1024))MB";
done

only two things changed, lines 3 to 9 were added and one instruction was added to the trap command in line 14,

  • Line 4 defines the variable containing the limit of memory this control group can reserve on 256MB
  • Line 5 creates a new control group called cgroup1 within the memory hierarchy
  • Line 7 sets a limit on the amount of memory that processes in the new control group can reserve
  • Line 8 adds the current process id to the tasks file which means that the current process and all of its children, including the four we will fork next, now belong to the new control group cgroup1
  • In line 14 the extra instruction added to the trap command deletes the new control group so that the cleanup is complete

if everything goes according to plan the four processes forked by stress should now belong to the control group cgroup1 and the total amount of memory reserved by them should never exceed 256MB, this is the output,

root

there we go, control groups in action, all processes now are part of the cgroup1 control group and after a few seconds running the total memory seems to stabilize at 257MB, almost as expected.

Just like the previous post this was a well-intentioned attempt to serve as an introduction to the concepts behind control groups, hope it helps somebody other than my future self and, as you can imagine, to distill such a complex subject in a few posts requires extreme oversimplification so I strongly recommend to dig deeper at the kernel docs and the RHEL documentation.

Here’s the source code for this post.