2477 lines
64 KiB
HTML
2477 lines
64 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of CGROUPS</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>CGROUPS</H1>
|
|
Section: Linux Programmer's Manual (7)<BR>Updated: 2019-11-19<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
cgroups - Linux control groups
|
|
<A NAME="lbAC"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
Control groups, usually referred to as cgroups,
|
|
are a Linux kernel feature which allow processes to
|
|
be organized into hierarchical groups whose usage of
|
|
various types of resources can then be limited and monitored.
|
|
The kernel's cgroup interface is provided through
|
|
a pseudo-filesystem called cgroupfs.
|
|
Grouping is implemented in the core cgroup kernel code,
|
|
while resource tracking and limits are implemented in
|
|
a set of per-resource-type subsystems (memory, CPU, and so on).
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H3>Terminology</H3>
|
|
|
|
A
|
|
<I>cgroup</I>
|
|
|
|
is a collection of processes that are bound to a set of
|
|
limits or parameters defined via the cgroup filesystem.
|
|
<P>
|
|
|
|
A
|
|
<I>subsystem</I>
|
|
|
|
is a kernel component that modifies the behavior of
|
|
the processes in a cgroup.
|
|
Various subsystems have been implemented, making it possible to do things
|
|
such as limiting the amount of CPU time and memory available to a cgroup,
|
|
accounting for the CPU time used by a cgroup,
|
|
and freezing and resuming execution of the processes in a cgroup.
|
|
Subsystems are sometimes also known as
|
|
<I>resource controllers</I>
|
|
|
|
(or simply, controllers).
|
|
<P>
|
|
|
|
The cgroups for a controller are arranged in a
|
|
<I>hierarchy</I>.
|
|
|
|
This hierarchy is defined by creating, removing, and
|
|
renaming subdirectories within the cgroup filesystem.
|
|
At each level of the hierarchy, attributes (e.g., limits) can be defined.
|
|
The limits, control, and accounting provided by cgroups generally have
|
|
effect throughout the subhierarchy underneath the cgroup where the
|
|
attributes are defined.
|
|
Thus, for example, the limits placed on
|
|
a cgroup at a higher level in the hierarchy cannot be exceeded
|
|
by descendant cgroups.
|
|
|
|
<A NAME="lbAE"> </A>
|
|
<H3>Cgroups version 1 and version 2</H3>
|
|
|
|
The initial release of the cgroups implementation was in Linux 2.6.24.
|
|
Over time, various cgroup controllers have been added
|
|
to allow the management of various types of resources.
|
|
However, the development of these controllers was largely uncoordinated,
|
|
with the result that many inconsistencies arose between controllers
|
|
and management of the cgroup hierarchies became rather complex.
|
|
(A longer description of these problems can be found in
|
|
the kernel source file
|
|
<I>Documentation/cgroup-v2.txt</I>.)
|
|
|
|
<P>
|
|
|
|
Because of the problems with the initial cgroups implementation
|
|
(cgroups version 1),
|
|
starting in Linux 3.10, work began on a new,
|
|
orthogonal implementation to remedy these problems.
|
|
Initially marked experimental, and hidden behind the
|
|
<I>-o __DEVEL__sane_behavior</I>
|
|
|
|
mount option, the new version (cgroups version 2)
|
|
was eventually made official with the release of Linux 4.5.
|
|
Differences between the two versions are described in the text below.
|
|
<P>
|
|
|
|
Although cgroups v2 is intended as a replacement for cgroups v1,
|
|
the older system continues to exist
|
|
(and for compatibility reasons is unlikely to be removed).
|
|
Currently, cgroups v2 implements only a subset of the controllers
|
|
available in cgroups v1.
|
|
The two systems are implemented so that both v1 controllers and
|
|
v2 controllers can be mounted on the same system.
|
|
Thus, for example, it is possible to use those controllers
|
|
that are supported under version 2,
|
|
while also using version 1 controllers
|
|
where version 2 does not yet support those controllers.
|
|
The only restriction here is that a controller can't be simultaneously
|
|
employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
|
|
|
|
<A NAME="lbAF"> </A>
|
|
<H2>CGROUPS VERSION 1</H2>
|
|
|
|
Under cgroups v1, each controller may be mounted against a separate
|
|
cgroup filesystem that provides its own hierarchical organization of the
|
|
processes on the system.
|
|
It is also possible to comount multiple (or even all) cgroups v1 controllers
|
|
against the same cgroup filesystem, meaning that the comounted controllers
|
|
manage the same hierarchical organization of processes.
|
|
<P>
|
|
|
|
For each mounted hierarchy,
|
|
the directory tree mirrors the control group hierarchy.
|
|
Each control group is represented by a directory, with each of its child
|
|
control cgroups represented as a child directory.
|
|
For instance,
|
|
<I>/user/joe/1.session</I>
|
|
|
|
represents control group
|
|
<I>1.session</I>,
|
|
|
|
which is a child of cgroup
|
|
<I>joe</I>,
|
|
|
|
which is a child of
|
|
<I>/user</I>.
|
|
|
|
Under each cgroup directory is a set of files which can be read or
|
|
written to, reflecting resource limits and a few general cgroup
|
|
properties.
|
|
|
|
<A NAME="lbAG"> </A>
|
|
<H3>Tasks (threads) versus processes</H3>
|
|
|
|
In cgroups v1, a distinction is drawn between
|
|
<I>processes</I>
|
|
|
|
and
|
|
<I>tasks</I>.
|
|
|
|
In this view, a process can consist of multiple tasks
|
|
(more commonly called threads, from a user-space perspective,
|
|
and called such in the remainder of this man page).
|
|
In cgroups v1, it is possible to independently manipulate
|
|
the cgroup memberships of the threads in a process.
|
|
<P>
|
|
|
|
The cgroups v1 ability to split threads across different cgroups
|
|
caused problems in some cases.
|
|
For example, it made no sense for the
|
|
<I>memory</I>
|
|
|
|
controller,
|
|
since all of the threads of a process share a single address space.
|
|
Because of these problems,
|
|
the ability to independently manipulate the cgroup memberships
|
|
of the threads in a process was removed in the initial cgroups v2
|
|
implementation, and subsequently restored in a more limited form
|
|
(see the discussion of "thread mode" below).
|
|
|
|
<A NAME="lbAH"> </A>
|
|
<H3>Mounting v1 controllers</H3>
|
|
|
|
The use of cgroups requires a kernel built with the
|
|
<B>CONFIG_CGROUP</B>
|
|
|
|
option.
|
|
In addition, each of the v1 controllers has an associated
|
|
configuration option that must be set in order to employ that controller.
|
|
<P>
|
|
|
|
In order to use a v1 controller,
|
|
it must be mounted against a cgroup filesystem.
|
|
The usual place for such mounts is under a
|
|
<B><A HREF="/cgi-bin/man/man2html?5+tmpfs">tmpfs</A></B>(5)
|
|
|
|
filesystem mounted at
|
|
<I>/sys/fs/cgroup</I>.
|
|
|
|
Thus, one might mount the
|
|
<I>cpu</I>
|
|
|
|
controller as follows:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
|
|
|
|
|
|
<P>
|
|
|
|
It is possible to comount multiple controllers against the same hierarchy.
|
|
For example, here the
|
|
<I>cpu</I>
|
|
|
|
and
|
|
<I>cpuacct</I>
|
|
|
|
controllers are comounted against a single hierarchy:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup -o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
|
|
|
|
|
|
<P>
|
|
|
|
Comounting controllers has the effect that a process is in the same cgroup for
|
|
all of the comounted controllers.
|
|
Separately mounting controllers allows a process to
|
|
be in cgroup
|
|
<I>/foo1</I>
|
|
|
|
for one controller while being in
|
|
<I>/foo2/foo3</I>
|
|
|
|
for another.
|
|
<P>
|
|
|
|
It is possible to comount all v1 controllers against the same hierarchy:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup -o all cgroup /sys/fs/cgroup
|
|
|
|
|
|
<P>
|
|
|
|
(One can achieve the same result by omitting
|
|
<I>-o all</I>,
|
|
|
|
since it is the default if no controllers are explicitly specified.)
|
|
<P>
|
|
|
|
It is not possible to mount the same controller
|
|
against multiple cgroup hierarchies.
|
|
For example, it is not possible to mount both the
|
|
<I>cpu</I>
|
|
|
|
and
|
|
<I>cpuacct</I>
|
|
|
|
controllers against one hierarchy, and to mount the
|
|
<I>cpu</I>
|
|
|
|
controller alone against another hierarchy.
|
|
It is possible to create multiple mount points with exactly
|
|
the same set of comounted controllers.
|
|
However, in this case all that results is multiple mount points
|
|
providing a view of the same hierarchy.
|
|
<P>
|
|
|
|
Note that on many systems, the v1 controllers are automatically mounted under
|
|
<I>/sys/fs/cgroup</I>;
|
|
|
|
in particular,
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
automatically creates such mount points.
|
|
|
|
<A NAME="lbAI"> </A>
|
|
<H3>Unmounting v1 controllers</H3>
|
|
|
|
A mounted cgroup filesystem can be unmounted using the
|
|
<B><A HREF="/cgi-bin/man/man2html?8+umount">umount</A></B>(8)
|
|
|
|
command, as in the following example:
|
|
<P>
|
|
|
|
|
|
|
|
umount /sys/fs/cgroup/pids
|
|
|
|
|
|
<P>
|
|
|
|
<I>But note well</I>:
|
|
|
|
a cgroup filesystem is unmounted only if it is not busy,
|
|
that is, it has no child cgroups.
|
|
If this is not the case, then the only effect of the
|
|
<B><A HREF="/cgi-bin/man/man2html?8+umount">umount</A></B>(8)
|
|
|
|
is to make the mount invisible.
|
|
Thus, to ensure that the mount point is really removed,
|
|
one must first remove all child cgroups,
|
|
which in turn can be done only after all member processes
|
|
have been moved from those cgroups to the root cgroup.
|
|
|
|
<A NAME="lbAJ"> </A>
|
|
<H3>Cgroups version 1 controllers</H3>
|
|
|
|
Each of the cgroups version 1 controllers is governed
|
|
by a kernel configuration option (listed below).
|
|
Additionally, the availability of the cgroups feature is governed by the
|
|
<B>CONFIG_CGROUPS</B>
|
|
|
|
kernel configuration option.
|
|
<DL COMPACT>
|
|
<DT id="1"><I>cpu</I> (since Linux 2.6.24; <I></I><B>CONFIG_CGROUP_SCHED</B>)
|
|
|
|
<DD>
|
|
Cgroups can be guaranteed a minimum number of "CPU shares"
|
|
when a system is busy.
|
|
This does not limit a cgroup's CPU usage if the CPUs are not busy.
|
|
For further information, see
|
|
<I>Documentation/scheduler/sched-design-CFS.txt</I>.
|
|
|
|
<DT id="2"><DD>
|
|
In Linux 3.2,
|
|
this controller was extended to provide CPU "bandwidth" control.
|
|
If the kernel is configured with
|
|
<B>CONFIG_CFS_BANDWIDTH</B>,
|
|
|
|
then within each scheduling period
|
|
(defined via a file in the cgroup directory), it is possible to define
|
|
an upper limit on the CPU time allocated to the processes in a cgroup.
|
|
This upper limit applies even if there is no other competition for the CPU.
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/scheduler/sched-bwc.txt</I>.
|
|
|
|
<DT id="3"><I>cpuacct</I> (since Linux 2.6.24; <I></I><B>CONFIG_CGROUP_CPUACCT</B>)
|
|
|
|
<DD>
|
|
This provides accounting for CPU usage by groups of processes.
|
|
<DT id="4"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/cpuacct.txt</I>.
|
|
|
|
<DT id="5"><I>cpuset</I> (since Linux 2.6.24; <I></I><B>CONFIG_CPUSETS</B>)
|
|
|
|
<DD>
|
|
This cgroup can be used to bind the processes in a cgroup to
|
|
a specified set of CPUs and NUMA nodes.
|
|
<DT id="6"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/cpusets.txt</I>.
|
|
|
|
<DT id="7"><I>memory</I> (since Linux 2.6.25; <I></I><B>CONFIG_MEMCG</B>)
|
|
|
|
<DD>
|
|
The memory controller supports reporting and limiting of process memory, kernel
|
|
memory, and swap used by cgroups.
|
|
<DT id="8"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/memory.txt</I>.
|
|
|
|
<DT id="9"><I>devices</I> (since Linux 2.6.26; <I></I><B>CONFIG_CGROUP_DEVICE</B>)
|
|
|
|
<DD>
|
|
This supports controlling which processes may create (mknod) devices as
|
|
well as open them for reading or writing.
|
|
The policies may be specified as allow-lists and deny-lists.
|
|
Hierarchy is enforced, so new rules must not
|
|
violate existing rules for the target or ancestor cgroups.
|
|
<DT id="10"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/devices.txt</I>.
|
|
|
|
<DT id="11"><I>freezer</I> (since Linux 2.6.28; <I></I><B>CONFIG_CGROUP_FREEZER</B>)
|
|
|
|
<DD>
|
|
The
|
|
<I>freezer</I>
|
|
|
|
cgroup can suspend and restore (resume) all processes in a cgroup.
|
|
Freezing a cgroup
|
|
<I>/A</I>
|
|
|
|
also causes its children, for example, processes in
|
|
<I>/A/B</I>,
|
|
|
|
to be frozen.
|
|
<DT id="12"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/freezer-subsystem.txt</I>.
|
|
|
|
<DT id="13"><I>net_cls</I> (since Linux 2.6.29; <I></I><B>CONFIG_CGROUP_NET_CLASSID</B>)
|
|
|
|
<DD>
|
|
This places a classid, specified for the cgroup, on network packets
|
|
created by a cgroup.
|
|
These classids can then be used in firewall rules,
|
|
as well as used to shape traffic using
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A></B>(8).
|
|
|
|
This applies only to packets
|
|
leaving the cgroup, not to traffic arriving at the cgroup.
|
|
<DT id="14"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/net_cls.txt</I>.
|
|
|
|
<DT id="15"><I>blkio</I> (since Linux 2.6.33; <I></I><B>CONFIG_BLK_CGROUP</B>)
|
|
|
|
<DD>
|
|
The
|
|
<I>blkio</I>
|
|
|
|
cgroup controls and limits access to specified block devices by
|
|
applying IO control in the form of throttling and upper limits against leaf
|
|
nodes and intermediate nodes in the storage hierarchy.
|
|
<DT id="16"><DD>
|
|
Two policies are available.
|
|
The first is a proportional-weight time-based division
|
|
of disk implemented with CFQ.
|
|
This is in effect for leaf nodes using CFQ.
|
|
The second is a throttling policy which specifies
|
|
upper I/O rate limits on a device.
|
|
<DT id="17"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/blkio-controller.txt</I>.
|
|
|
|
<DT id="18"><I>perf_event</I> (since Linux 2.6.39; <I></I><B>CONFIG_CGROUP_PERF</B>)
|
|
|
|
<DD>
|
|
This controller allows
|
|
<I>perf</I>
|
|
|
|
monitoring of the set of processes grouped in a cgroup.
|
|
<DT id="19"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>tools/perf/Documentation/perf-record.txt</I>.
|
|
|
|
<DT id="20"><I>net_prio</I> (since Linux 3.3; <I></I><B>CONFIG_CGROUP_NET_PRIO</B>)
|
|
|
|
<DD>
|
|
This allows priorities to be specified, per network interface, for cgroups.
|
|
<DT id="21"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/net_prio.txt</I>.
|
|
|
|
<DT id="22"><I>hugetlb</I> (since Linux 3.5; <I></I><B>CONFIG_CGROUP_HUGETLB</B>)
|
|
|
|
<DD>
|
|
This supports limiting the use of huge pages by cgroups.
|
|
<DT id="23"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/hugetlb.txt</I>.
|
|
|
|
<DT id="24"><I>pids</I> (since Linux 4.3; <I></I><B>CONFIG_CGROUP_PIDS</B>)
|
|
|
|
<DD>
|
|
This controller permits limiting the number of process that may be created
|
|
in a cgroup (and its descendants).
|
|
<DT id="25"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/pids.txt</I>.
|
|
|
|
<DT id="26"><I>rdma</I> (since Linux 4.11; <I></I><B>CONFIG_CGROUP_RDMA</B>)
|
|
|
|
<DD>
|
|
The RDMA controller permits limiting the use of
|
|
RDMA/IB-specific resources per cgroup.
|
|
<DT id="27"><DD>
|
|
Further information can be found in the kernel source file
|
|
<I>Documentation/cgroup-v1/rdma.txt</I>.
|
|
|
|
|
|
</DL>
|
|
<A NAME="lbAK"> </A>
|
|
<H3>Creating cgroups and moving processes</H3>
|
|
|
|
A cgroup filesystem initially contains a single root cgroup, '/',
|
|
which all processes belong to.
|
|
A new cgroup is created by creating a directory in the cgroup filesystem:
|
|
<P>
|
|
|
|
|
|
|
|
mkdir /sys/fs/cgroup/cpu/cg1
|
|
|
|
|
|
<P>
|
|
|
|
This creates a new empty cgroup.
|
|
<P>
|
|
|
|
A process may be moved to this cgroup by writing its PID into the cgroup's
|
|
<I>cgroup.procs</I>
|
|
|
|
file:
|
|
<P>
|
|
|
|
|
|
|
|
echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
|
|
|
|
|
|
<P>
|
|
|
|
Only one PID at a time should be written to this file.
|
|
<P>
|
|
|
|
Writing the value 0 to a
|
|
<I>cgroup.procs</I>
|
|
|
|
file causes the writing process to be moved to the corresponding cgroup.
|
|
<P>
|
|
|
|
When writing a PID into the
|
|
<I>cgroup.procs</I>,
|
|
|
|
all threads in the process are moved into the new cgroup at once.
|
|
<P>
|
|
|
|
Within a hierarchy, a process can be a member of exactly one cgroup.
|
|
Writing a process's PID to a
|
|
<I>cgroup.procs</I>
|
|
|
|
file automatically removes it from the cgroup of
|
|
which it was previously a member.
|
|
<P>
|
|
|
|
The
|
|
<I>cgroup.procs</I>
|
|
|
|
file can be read to obtain a list of the processes that are
|
|
members of a cgroup.
|
|
The returned list of PIDs is not guaranteed to be in order.
|
|
Nor is it guaranteed to be free of duplicates.
|
|
(For example, a PID may be recycled while reading from the list.)
|
|
<P>
|
|
|
|
In cgroups v1, an individual thread can be moved to
|
|
another cgroup by writing its thread ID
|
|
(i.e., the kernel thread ID returned by
|
|
<B><A HREF="/cgi-bin/man/man2html?2+clone">clone</A></B>(2)
|
|
|
|
and
|
|
<B><A HREF="/cgi-bin/man/man2html?2+gettid">gettid</A></B>(2))
|
|
|
|
to the
|
|
<I>tasks</I>
|
|
|
|
file in a cgroup directory.
|
|
This file can be read to discover the set of threads
|
|
that are members of the cgroup.
|
|
|
|
<A NAME="lbAL"> </A>
|
|
<H3>Removing cgroups</H3>
|
|
|
|
To remove a cgroup,
|
|
it must first have no child cgroups and contain no (nonzombie) processes.
|
|
So long as that is the case, one can simply
|
|
remove the corresponding directory pathname.
|
|
Note that files in a cgroup directory cannot and need not be
|
|
removed.
|
|
|
|
<A NAME="lbAM"> </A>
|
|
<H3>Cgroups v1 release notification</H3>
|
|
|
|
Two files can be used to determine whether the kernel provides
|
|
notifications when a cgroup becomes empty.
|
|
A cgroup is considered to be empty when it contains no child
|
|
cgroups and no member processes.
|
|
<P>
|
|
|
|
A special file in the root directory of each cgroup hierarchy,
|
|
<I>release_agent</I>,
|
|
|
|
can be used to register the pathname of a program that may be invoked when
|
|
a cgroup in the hierarchy becomes empty.
|
|
The pathname of the newly empty cgroup (relative to the cgroup mount point)
|
|
is provided as the sole command-line argument when the
|
|
<I>release_agent</I>
|
|
|
|
program is invoked.
|
|
The
|
|
<I>release_agent</I>
|
|
|
|
program might remove the cgroup directory,
|
|
or perhaps repopulate it with a process.
|
|
<P>
|
|
|
|
The default value of the
|
|
<I>release_agent</I>
|
|
|
|
file is empty, meaning that no release agent is invoked.
|
|
<P>
|
|
|
|
The content of the
|
|
<I>release_agent</I>
|
|
|
|
file can also be specified via a mount option when the
|
|
cgroup filesystem is mounted:
|
|
<P>
|
|
|
|
|
|
|
|
mount -o release_agent=pathname ...
|
|
|
|
|
|
<P>
|
|
|
|
Whether or not the
|
|
<I>release_agent</I>
|
|
|
|
program is invoked when a particular cgroup becomes empty is determined
|
|
by the value in the
|
|
<I>notify_on_release</I>
|
|
|
|
file in the corresponding cgroup directory.
|
|
If this file contains the value 0, then the
|
|
<I>release_agent</I>
|
|
|
|
program is not invoked.
|
|
If it contains the value 1, the
|
|
<I>release_agent</I>
|
|
|
|
program is invoked.
|
|
The default value for this file in the root cgroup is 0.
|
|
At the time when a new cgroup is created,
|
|
the value in this file is inherited from the corresponding file
|
|
in the parent cgroup.
|
|
|
|
<A NAME="lbAN"> </A>
|
|
<H3>Cgroup v1 named hierarchies</H3>
|
|
|
|
In cgroups v1,
|
|
it is possible to mount a cgroup hierarchy that has no attached controllers:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup -o none,name=somename none /some/mount/point
|
|
|
|
|
|
<P>
|
|
|
|
Multiple instances of such hierarchies can be mounted;
|
|
each hierarchy must have a unique name.
|
|
The only purpose of such hierarchies is to track processes.
|
|
(See the discussion of release notification below.)
|
|
An example of this is the
|
|
<I>name=systemd</I>
|
|
|
|
cgroup hierarchy that is used by
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
to track services and user sessions.
|
|
<P>
|
|
|
|
Since Linux 5.0, the
|
|
<I>cgroup_no_v1</I>
|
|
|
|
kernel boot option (described below) can be used to disable cgroup v1
|
|
named hierarchies, by specifying
|
|
<I>cgroup_no_v1=named</I>.
|
|
|
|
<P>
|
|
|
|
<A NAME="lbAO"> </A>
|
|
<H2>CGROUPS VERSION 2</H2>
|
|
|
|
In cgroups v2,
|
|
all mounted controllers reside in a single unified hierarchy.
|
|
While (different) controllers may be simultaneously
|
|
mounted under the v1 and v2 hierarchies,
|
|
it is not possible to mount the same controller simultaneously
|
|
under both the v1 and the v2 hierarchies.
|
|
<P>
|
|
|
|
The new behaviors in cgroups v2 are summarized here,
|
|
and in some cases elaborated in the following subsections.
|
|
<DL COMPACT>
|
|
<DT id="28">1.<DD>
|
|
Cgroups v2 provides a unified hierarchy against
|
|
which all controllers are mounted.
|
|
<DT id="29">2.<DD>
|
|
"Internal" processes are not permitted.
|
|
With the exception of the root cgroup, processes may reside
|
|
only in leaf nodes (cgroups that do not themselves contain child cgroups).
|
|
The details are somewhat more subtle than this, and are described below.
|
|
<DT id="30">3.<DD>
|
|
Active cgroups must be specified via the files
|
|
<I>cgroup.controllers</I>
|
|
|
|
and
|
|
<I>cgroup.subtree_control</I>.
|
|
|
|
<DT id="31">4.<DD>
|
|
The
|
|
<I>tasks</I>
|
|
|
|
file has been removed.
|
|
In addition, the
|
|
<I>cgroup.clone_children</I>
|
|
|
|
file that is employed by the
|
|
<I>cpuset</I>
|
|
|
|
controller has been removed.
|
|
<DT id="32">5.<DD>
|
|
An improved mechanism for notification of empty cgroups is provided by the
|
|
<I>cgroup.events</I>
|
|
|
|
file.
|
|
</DL>
|
|
<P>
|
|
|
|
For more changes, see the
|
|
<I>Documentation/cgroup-v2.txt</I>
|
|
|
|
file in the kernel source.
|
|
<P>
|
|
|
|
Some of the new behaviors listed above saw subsequent modification with
|
|
the addition in Linux 4.14 of "thread mode" (described below).
|
|
|
|
<A NAME="lbAP"> </A>
|
|
<H3>Cgroups v2 unified hierarchy</H3>
|
|
|
|
In cgroups v1, the ability to mount different controllers
|
|
against different hierarchies was intended to allow great flexibility
|
|
for application design.
|
|
In practice, though,
|
|
the flexibility turned out to be less useful than expected,
|
|
and in many cases added complexity.
|
|
Therefore, in cgroups v2,
|
|
all available controllers are mounted against a single hierarchy.
|
|
The available controllers are automatically mounted,
|
|
meaning that it is not necessary (or possible) to specify the controllers
|
|
when mounting the cgroup v2 filesystem using a command such as the following:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup2 none /mnt/cgroup2
|
|
|
|
|
|
<P>
|
|
|
|
A cgroup v2 controller is available only if it is not currently in use
|
|
via a mount against a cgroup v1 hierarchy.
|
|
Or, to put things another way, it is not possible to employ
|
|
the same controller against both a v1 hierarchy and the unified v2 hierarchy.
|
|
This means that it may be necessary first to unmount a v1 controller
|
|
(as described above) before that controller is available in v2.
|
|
Since
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
makes heavy use of some v1 controllers by default,
|
|
it can in some cases be simpler to boot the system with
|
|
selected v1 controllers disabled.
|
|
To do this, specify the
|
|
<I>cgroup_no_v1=list</I>
|
|
|
|
option on the kernel boot command line;
|
|
<I>list</I>
|
|
|
|
is a comma-separated list of the names of the controllers to disable,
|
|
or the word
|
|
<I>all</I>
|
|
|
|
to disable all v1 controllers.
|
|
(This situation is correctly handled by
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1),
|
|
|
|
which falls back to operating without the specified controllers.)
|
|
<P>
|
|
|
|
Note that on many modern systems,
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
automatically mounts the
|
|
<I>cgroup2</I>
|
|
|
|
filesystem at
|
|
<I>/sys/fs/cgroup/unified</I>
|
|
|
|
during the boot process.
|
|
|
|
<A NAME="lbAQ"> </A>
|
|
<H3>Cgroups v2 controllers</H3>
|
|
|
|
The following controllers, documented in the kernel source file
|
|
<I>Documentation/cgroup-v2.txt</I>,
|
|
|
|
are supported in cgroups version 2:
|
|
<DL COMPACT>
|
|
<DT id="33"><I>io</I> (since Linux 4.5)
|
|
|
|
<DD>
|
|
This is the successor of the version 1
|
|
<I>blkio</I>
|
|
|
|
controller.
|
|
<DT id="34"><I>memory</I> (since Linux 4.5)
|
|
|
|
<DD>
|
|
This is the successor of the version 1
|
|
<I>memory</I>
|
|
|
|
controller.
|
|
<DT id="35"><I>pids</I> (since Linux 4.5)
|
|
|
|
<DD>
|
|
This is the same as the version 1
|
|
<I>pids</I>
|
|
|
|
controller.
|
|
<DT id="36"><I>perf_event</I> (since Linux 4.11)
|
|
|
|
<DD>
|
|
This is the same as the version 1
|
|
<I>perf_event</I>
|
|
|
|
controller.
|
|
<DT id="37"><I>rdma</I> (since Linux 4.11)
|
|
|
|
<DD>
|
|
This is the same as the version 1
|
|
<I>rdma</I>
|
|
|
|
controller.
|
|
<DT id="38"><I>cpu</I> (since Linux 4.15)
|
|
|
|
<DD>
|
|
This is the successor to the version 1
|
|
<I>cpu</I>
|
|
|
|
and
|
|
<I>cpuacct</I>
|
|
|
|
controllers.
|
|
<DT id="39"><I>freezer</I> (since Linux 5.2)
|
|
|
|
<DD>
|
|
|
|
This is the successor of the version 1
|
|
<I>freezer</I>
|
|
|
|
controller.
|
|
|
|
</DL>
|
|
<A NAME="lbAR"> </A>
|
|
<H3>Cgroups v2 subtree control</H3>
|
|
|
|
Each cgroup in the v2 hierarchy contains the following two files:
|
|
<DL COMPACT>
|
|
<DT id="40"><I>cgroup.controllers</I>
|
|
|
|
<DD>
|
|
This read-only file exposes a list of the controllers that are
|
|
<I>available</I>
|
|
|
|
in this cgroup.
|
|
The contents of this file match the contents of the
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file in the parent cgroup.
|
|
<DT id="41"><I>cgroup.subtree_control</I>
|
|
|
|
<DD>
|
|
This is a list of controllers that are
|
|
<I>active</I>
|
|
|
|
(<I>enabled</I>)
|
|
|
|
in the cgroup.
|
|
The set of controllers in this file is a subset of the set in the
|
|
<I>cgroup.controllers</I>
|
|
|
|
of this cgroup.
|
|
The set of active controllers is modified by writing strings to this file
|
|
containing space-delimited controller names,
|
|
each preceded by '+' (to enable a controller)
|
|
or '-' (to disable a controller), as in the following example:
|
|
<DT id="42"><DD>
|
|
|
|
|
|
echo '+pids -memory' > x/y/cgroup.subtree_control
|
|
|
|
|
|
<DT id="43"><DD>
|
|
An attempt to enable a controller
|
|
that is not present in
|
|
<I>cgroup.controllers</I>
|
|
|
|
leads to an
|
|
<B>ENOENT</B>
|
|
|
|
error when writing to the
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file.
|
|
</DL>
|
|
<P>
|
|
|
|
Because the list of controllers in
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
is a subset of those
|
|
<I>cgroup.controllers</I>,
|
|
|
|
a controller that has been disabled in one cgroup in the hierarchy
|
|
can never be re-enabled in the subtree below that cgroup.
|
|
<P>
|
|
|
|
A cgroup's
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file determines the set of controllers that are exercised in the
|
|
<I>child</I>
|
|
|
|
cgroups.
|
|
When a controller (e.g.,
|
|
<I>pids</I>)
|
|
|
|
is present in the
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file of a parent cgroup,
|
|
then the corresponding controller-interface files (e.g.,
|
|
<I>pids.max</I>)
|
|
|
|
are automatically created in the children of that cgroup
|
|
and can be used to exert resource control in the child cgroups.
|
|
|
|
<A NAME="lbAS"> </A>
|
|
<H3>Cgroups v2 no internal processes rule</H3>
|
|
|
|
Cgroups v2 enforces a so-called "no internal processes" rule.
|
|
Roughly speaking, this rule means that,
|
|
with the exception of the root cgroup, processes may reside
|
|
only in leaf nodes (cgroups that do not themselves contain child cgroups).
|
|
This avoids the need to decide how to partition resources between
|
|
processes which are members of cgroup A and processes in child cgroups of A.
|
|
<P>
|
|
|
|
For instance, if cgroup
|
|
<I>/cg1/cg2</I>
|
|
|
|
exists, then a process may reside in
|
|
<I>/cg1/cg2</I>,
|
|
|
|
but not in
|
|
<I>/cg1</I>.
|
|
|
|
This is to avoid an ambiguity in cgroups v1
|
|
with respect to the delegation of resources between processes in
|
|
<I>/cg1</I>
|
|
|
|
and its child cgroups.
|
|
The recommended approach in cgroups v2 is to create a subdirectory called
|
|
<I>leaf</I>
|
|
|
|
for any nonleaf cgroup which should contain processes, but no child cgroups.
|
|
Thus, processes which previously would have gone into
|
|
<I>/cg1</I>
|
|
|
|
would now go into
|
|
<I>/cg1/leaf</I>.
|
|
|
|
This has the advantage of making explicit
|
|
the relationship between processes in
|
|
<I>/cg1/leaf</I>
|
|
|
|
and
|
|
<I>/cg1</I>'s
|
|
|
|
other children.
|
|
<P>
|
|
|
|
The "no internal processes" rule is in fact more subtle than stated above.
|
|
More precisely, the rule is that a (nonroot) cgroup can't both
|
|
(1) have member processes, and
|
|
(2) distribute resources into child cgroups---that is, have a nonempty
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file.
|
|
Thus, it
|
|
<I>is</I>
|
|
|
|
possible for a cgroup to have both member processes and child cgroups,
|
|
but before controllers can be enabled for that cgroup,
|
|
the member processes must be moved out of the cgroup
|
|
(e.g., perhaps into the child cgroups).
|
|
<P>
|
|
|
|
With the Linux 4.14 addition of "thread mode" (described below),
|
|
the "no internal processes" rule has been relaxed in some cases.
|
|
|
|
<A NAME="lbAT"> </A>
|
|
<H3>Cgroups v2 cgroup.events file</H3>
|
|
|
|
Each nonroot cgroup in the v2 hierarchy contains a read-only file,
|
|
<I>cgroup.events</I>,
|
|
|
|
whose contents are key-value pairs
|
|
(delimited by newline characters, with the key and value separated by spaces)
|
|
providing state information about the
|
|
the cgroup:
|
|
<P>
|
|
|
|
|
|
|
|
$ <B>cat mygrp/cgroup.events</B>
|
|
populated 1
|
|
frozen 0
|
|
|
|
|
|
<P>
|
|
|
|
The following keys may appear in this file:
|
|
<DL COMPACT>
|
|
<DT id="44"><I>populated</I>
|
|
|
|
<DD>
|
|
The value of this key is either 1,
|
|
if this cgroup or any of its descendants has member processes,
|
|
or otherwise 0.
|
|
<DT id="45"><I>frozen</I> (since Linux 5.2)
|
|
|
|
<DD>
|
|
|
|
The value of this key is 1 if this cgroup is currently frozen,
|
|
or 0 if it is not.
|
|
</DL>
|
|
<P>
|
|
|
|
The
|
|
<I>cgroup.events</I>
|
|
|
|
file can be monitored, in order to receive notification when the value of
|
|
one of its keys changes.
|
|
Such monitoring can be done using
|
|
<B><A HREF="/cgi-bin/man/man2html?7+inotify">inotify</A></B>(7),
|
|
|
|
which notifies changes as
|
|
<B>IN_MODIFY</B>
|
|
|
|
events, or
|
|
<B><A HREF="/cgi-bin/man/man2html?2+poll">poll</A></B>(2),
|
|
|
|
which notifies changes by returning the
|
|
<B>POLLPRI</B>
|
|
|
|
and
|
|
<B>POLLERR</B>
|
|
|
|
bits in the
|
|
<I>revents</I>
|
|
|
|
field.
|
|
|
|
<A NAME="lbAU"> </A>
|
|
<H3>Cgroup v2 release notification</H3>
|
|
|
|
Cgroups v2 provides a new mechanism for obtaining notification
|
|
when a cgroup becomes empty.
|
|
The cgroups v1
|
|
<I>release_agent</I>
|
|
|
|
and
|
|
<I>notify_on_release</I>
|
|
|
|
files are removed, and replaced by the
|
|
<I>populated</I>
|
|
|
|
key in the
|
|
<I>cgroup.events</I>
|
|
|
|
file.
|
|
This key either has the value 0,
|
|
meaning that the cgroup (and its descendants)
|
|
contain no (nonzombie) member processes,
|
|
or 1, meaning that the cgroup (or one of its descendants)
|
|
contains member processes.
|
|
<P>
|
|
|
|
The cgroups v2 release-notification mechanism
|
|
offers the following advantages over the cgroups v1
|
|
<I>release_agent</I>
|
|
|
|
mechanism:
|
|
<DL COMPACT>
|
|
<DT id="46">*<DD>
|
|
It allows for cheaper notification,
|
|
since a single process can monitor multiple
|
|
<I>cgroup.events</I>
|
|
|
|
files (using the techniques described earlier).
|
|
By contrast, the cgroups v1 mechanism requires the expense of creating
|
|
a process for each notification.
|
|
<DT id="47">*<DD>
|
|
Notification for different cgroup subhierarchies can be delegated
|
|
to different processes.
|
|
By contrast, the cgroups v1 mechanism allows only one release agent
|
|
for an entire hierarchy.
|
|
|
|
</DL>
|
|
<A NAME="lbAV"> </A>
|
|
<H3>Cgroups v2 cgroup.stat file</H3>
|
|
|
|
|
|
Each cgroup in the v2 hierarchy contains a read-only
|
|
<I>cgroup.stat</I>
|
|
|
|
file (first introduced in Linux 4.14)
|
|
that consists of lines containing key-value pairs.
|
|
The following keys currently appear in this file:
|
|
<DL COMPACT>
|
|
<DT id="48"><I>nr_descendants</I>
|
|
|
|
<DD>
|
|
This is the total number of visible (i.e., living) descendant cgroups
|
|
underneath this cgroup.
|
|
<DT id="49"><I>nr_dying_descendants</I>
|
|
|
|
<DD>
|
|
This is the total number of dying descendant cgroups
|
|
underneath this cgroup.
|
|
A cgroup enters the dying state after being deleted.
|
|
It remains in that state for an undefined period
|
|
(which will depend on system load)
|
|
while resources are freed before the cgroup is destroyed.
|
|
Note that the presence of some cgroups in the dying state is normal,
|
|
and is not indicative of any problem.
|
|
<DT id="50"><DD>
|
|
A process can't be made a member of a dying cgroup,
|
|
and a dying cgroup can't be brought back to life.
|
|
|
|
</DL>
|
|
<A NAME="lbAW"> </A>
|
|
<H3>Limiting the number of descendant cgroups</H3>
|
|
|
|
Each cgroup in the v2 hierarchy contains the following files,
|
|
which can be used to view and set limits on the number
|
|
of descendant cgroups under that cgroup:
|
|
<DL COMPACT>
|
|
<DT id="51"><I>cgroup.max.depth</I> (since Linux 4.14)
|
|
|
|
<DD>
|
|
|
|
This file defines a limit on the depth of nesting of descendant cgroups.
|
|
A value of 0 in this file means that no descendant cgroups can be created.
|
|
An attempt to create a descendant whose nesting level exceeds
|
|
the limit fails
|
|
(<I><A HREF="/cgi-bin/man/man2html?2+mkdir">mkdir</A></I>(2)
|
|
|
|
fails with the error
|
|
<B>EAGAIN</B>).
|
|
|
|
<DT id="52"><DD>
|
|
Writing the string
|
|
<I>"max"</I>
|
|
|
|
to this file means that no limit is imposed.
|
|
The default value in this file is
|
|
<I>"max"</I>.
|
|
|
|
<DT id="53"><I>cgroup.max.descendants</I> (since Linux 4.14)
|
|
|
|
<DD>
|
|
|
|
This file defines a limit on the number of live descendant cgroups that
|
|
this cgroup may have.
|
|
An attempt to create more descendants than allowed by the limit fails
|
|
(<I><A HREF="/cgi-bin/man/man2html?2+mkdir">mkdir</A></I>(2)
|
|
|
|
fails with the error
|
|
<B>EAGAIN</B>).
|
|
|
|
<DT id="54"><DD>
|
|
Writing the string
|
|
<I>"max"</I>
|
|
|
|
to this file means that no limit is imposed.
|
|
The default value in this file is
|
|
<I>"max"</I>.
|
|
|
|
|
|
</DL>
|
|
<A NAME="lbAX"> </A>
|
|
<H2>CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER</H2>
|
|
|
|
In the context of cgroups,
|
|
delegation means passing management of some subtree
|
|
of the cgroup hierarchy to a nonprivileged user.
|
|
Cgroups v1 provides support for delegation based on file permissions
|
|
in the cgroup hierarchy but with less strict containment rules than v2
|
|
(as noted below).
|
|
Cgroups v2 supports delegation with containment by explicit design.
|
|
The focus of the discussion in this section is on delegation in cgroups v2,
|
|
with some differences for cgroups v1 noted along the way.
|
|
<P>
|
|
|
|
Some terminology is required in order to describe delegation.
|
|
A
|
|
<I>delegater</I>
|
|
|
|
is a privileged user (i.e., root) who owns a parent cgroup.
|
|
A
|
|
<I>delegatee</I>
|
|
|
|
is a nonprivileged user who will be granted the permissions needed
|
|
to manage some subhierarchy under that parent cgroup,
|
|
known as the
|
|
<I>delegated subtree</I>.
|
|
|
|
<P>
|
|
|
|
To perform delegation,
|
|
the delegater makes certain directories and files writable by the delegatee,
|
|
typically by changing the ownership of the objects to be the user ID
|
|
of the delegatee.
|
|
Assuming that we want to delegate the hierarchy rooted at (say)
|
|
<I>/dlgt_grp</I>
|
|
|
|
and that there are not yet any child cgroups under that cgroup,
|
|
the ownership of the following is changed to the user ID of the delegatee:
|
|
<DL COMPACT>
|
|
<DT id="55"><I>/dlgt_grp</I>
|
|
|
|
<DD>
|
|
Changing the ownership of the root of the subtree means that any new
|
|
cgroups created under the subtree (and the files they contain)
|
|
will also be owned by the delegatee.
|
|
<DT id="56"><I>/dlgt_grp/cgroup.procs</I>
|
|
|
|
<DD>
|
|
Changing the ownership of this file means that the delegatee
|
|
can move processes into the root of the delegated subtree.
|
|
<DT id="57"><I>/dlgt_grp/cgroup.subtree_control</I> (cgroups v2 only)
|
|
|
|
<DD>
|
|
Changing the ownership of this file means that the delegatee
|
|
can enable controllers (that are present in
|
|
<I>/dlgt_grp/cgroup.controllers</I>)
|
|
|
|
in order to further redistribute resources at lower levels in the subtree.
|
|
(As an alternative to changing the ownership of this file,
|
|
the delegater might instead add selected controllers to this file.)
|
|
<DT id="58"><I>/dlgt_grp/cgroup.threads</I> (cgroups v2 only)
|
|
|
|
<DD>
|
|
Changing the ownership of this file is necessary if a threaded subtree
|
|
is being delegated (see the description of "thread mode", below).
|
|
This permits the delegatee to write thread IDs to the file.
|
|
(The ownership of this file can also be changed when delegating
|
|
a domain subtree, but currently this serves no purpose,
|
|
since, as described below, it is not possible to move a thread between
|
|
domain cgroups by writing its thread ID to the
|
|
<I>cgroup.threads</I>
|
|
|
|
file.)
|
|
<DT id="59"><DD>
|
|
In cgroups v1, the corresponding file that should instead be delegated is the
|
|
<I>tasks</I>
|
|
|
|
file.
|
|
</DL>
|
|
<P>
|
|
|
|
The delegater should
|
|
<I>not</I>
|
|
|
|
change the ownership of any of the controller interfaces files (e.g.,
|
|
<I>pids.max</I>,
|
|
|
|
<I>memory.high</I>)
|
|
|
|
in
|
|
<I>dlgt_grp</I>.
|
|
|
|
Those files are used from the next level above the delegated subtree
|
|
in order to distribute resources into the subtree,
|
|
and the delegatee should not have permission to change
|
|
the resources that are distributed into the delegated subtree.
|
|
<P>
|
|
|
|
See also the discussion of the
|
|
<I>/sys/kernel/cgroup/delegate</I>
|
|
|
|
file in NOTES for information about further delegatable files in cgroups v2.
|
|
<P>
|
|
|
|
After the aforementioned steps have been performed,
|
|
the delegatee can create child cgroups within the delegated subtree
|
|
(the cgroup subdirectories and the files they contain
|
|
will be owned by the delegatee)
|
|
and move processes between cgroups in the subtree.
|
|
If some controllers are present in
|
|
<I>dlgt_grp/cgroup.subtree_control</I>,
|
|
|
|
or the ownership of that file was passed to the delegatee,
|
|
the delegatee can also control the further redistribution
|
|
of the corresponding resources into the delegated subtree.
|
|
|
|
<A NAME="lbAY"> </A>
|
|
<H3>Cgroups v2 delegation: nsdelegate and cgroup namespaces</H3>
|
|
|
|
Starting with Linux 4.13,
|
|
|
|
there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
|
|
This is done by mounting or remounting the cgroup v2 filesystem with the
|
|
<I>nsdelegate</I>
|
|
|
|
mount option.
|
|
For example, if the cgroup v2 filesystem has already been mounted,
|
|
we can remount it with the
|
|
<I>nsdelegate</I>
|
|
|
|
option as follows:
|
|
<P>
|
|
|
|
|
|
|
|
mount -t cgroup2 -o remount,nsdelegate \
|
|
<BR> none /sys/fs/cgroup/unified
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
The effect of this mount option is to cause cgroup namespaces
|
|
to automatically become delegation boundaries.
|
|
More specifically,
|
|
the following restrictions apply for processes inside the cgroup namespace:
|
|
<DL COMPACT>
|
|
<DT id="60">*<DD>
|
|
Writes to controller interface files in the root directory of the namespace
|
|
will fail with the error
|
|
<B>EPERM</B>.
|
|
|
|
Processes inside the cgroup namespace can still write to delegatable
|
|
files in the root directory of the cgroup namespace such as
|
|
<I>cgroup.procs</I>
|
|
|
|
and
|
|
<I>cgroup.subtree_control</I>,
|
|
|
|
and can create subhierarchy underneath the root directory.
|
|
<DT id="61">*<DD>
|
|
Attempts to migrate processes across the namespace boundary are denied
|
|
(with the error
|
|
<B>ENOENT</B>).
|
|
|
|
Processes inside the cgroup namespace can still
|
|
(subject to the containment rules described below)
|
|
move processes between cgroups
|
|
<I>within</I>
|
|
|
|
the subhierarchy under the namespace root.
|
|
</DL>
|
|
<P>
|
|
|
|
The ability to define cgroup namespaces as delegation boundaries
|
|
makes cgroup namespaces more useful.
|
|
To understand why, suppose that we already have one cgroup hierarchy
|
|
that has been delegated to a nonprivileged user,
|
|
<I>cecilia</I>,
|
|
|
|
using the older delegation technique described above.
|
|
Suppose further that
|
|
<I>cecilia</I>
|
|
|
|
wanted to further delegate a subhierarchy
|
|
under the existing delegated hierarchy.
|
|
(For example, the delegated hierarchy might be associated with
|
|
an unprivileged container run by
|
|
<I>cecilia</I>.)
|
|
|
|
Even if a cgroup namespace was employed,
|
|
because both hierarchies are owned by the unprivileged user
|
|
<I>cecilia</I>,
|
|
|
|
the following illegitimate actions could be performed:
|
|
<DL COMPACT>
|
|
<DT id="62">*<DD>
|
|
A process in the inferior hierarchy could change the
|
|
resource controller settings in the root directory of that hierarchy.
|
|
(These resource controller settings are intended to allow control to
|
|
be exercised from the
|
|
<I>parent</I>
|
|
|
|
cgroup;
|
|
a process inside the child cgroup should not be allowed to modify them.)
|
|
<DT id="63">*<DD>
|
|
A process inside the inferior hierarchy could move processes
|
|
into and out of the inferior hierarchy if the cgroups in the
|
|
superior hierarchy were somehow visible.
|
|
</DL>
|
|
<P>
|
|
|
|
Employing the
|
|
<I>nsdelegate</I>
|
|
|
|
mount option prevents both of these possibilities.
|
|
<P>
|
|
|
|
The
|
|
<I>nsdelegate</I>
|
|
|
|
mount option only has an effect when performed in
|
|
the initial mount namespace;
|
|
in other mount namespaces, the option is silently ignored.
|
|
<P>
|
|
|
|
<I>Note</I>:
|
|
|
|
On some systems,
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
automatically mounts the cgroup v2 filesystem.
|
|
In order to experiment with the
|
|
<I>nsdelegate</I>
|
|
|
|
operation, it may be useful to boot the kernel with
|
|
the following command-line options:
|
|
<P>
|
|
|
|
|
|
|
|
cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
|
|
|
|
|
|
<P>
|
|
|
|
These options cause the kernel to boot with the cgroups v1 controllers
|
|
disabled (meaning that the controllers are available in the v2 hierarchy),
|
|
and tells
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
not to mount and use the cgroup v2 hierarchy,
|
|
so that the v2 hierarchy can be manually mounted
|
|
with the desired options after boot-up.
|
|
|
|
<A NAME="lbAZ"> </A>
|
|
<H3>Cgroup delegation containment rules</H3>
|
|
|
|
Some delegation
|
|
<I>containment rules</I>
|
|
|
|
ensure that the delegatee can move processes between cgroups within the
|
|
delegated subtree,
|
|
but can't move processes from outside the delegated subtree into
|
|
the subtree or vice versa.
|
|
A nonprivileged process (i.e., the delegatee) can write the PID of
|
|
a "target" process into a
|
|
<I>cgroup.procs</I>
|
|
|
|
file only if all of the following are true:
|
|
<DL COMPACT>
|
|
<DT id="64">*<DD>
|
|
The writer has write permission on the
|
|
<I>cgroup.procs</I>
|
|
|
|
file in the destination cgroup.
|
|
<DT id="65">*<DD>
|
|
The writer has write permission on the
|
|
<I>cgroup.procs</I>
|
|
|
|
file in the nearest common ancestor of the source and destination cgroups.
|
|
Note that in some cases,
|
|
the nearest common ancestor may be the source or destination cgroup itself.
|
|
This requirement is not enforced for cgroups v1 hierarchies,
|
|
with the consequence that containment in v1 is less strict than in v2.
|
|
(For example, in cgroups v1 the user that owns two distinct
|
|
delegated subhierarchies can move a process between the hierarchies.)
|
|
<DT id="66">*<DD>
|
|
If the cgroup v2 filesystem was mounted with the
|
|
<I>nsdelegate</I>
|
|
|
|
option, the writer must be able to see the source and destination cgroups
|
|
from its cgroup namespace.
|
|
<DT id="67">*<DD>
|
|
In cgroups v1:
|
|
the effective UID of the writer (i.e., the delegatee) matches the
|
|
real user ID or the saved set-user-ID of the target process.
|
|
Before Linux 4.11,
|
|
|
|
this requirement also applied in cgroups v2
|
|
(This was a historical requirement inherited from cgroups v1
|
|
that was later deemed unnecessary,
|
|
since the other rules suffice for containment in cgroups v2.)
|
|
</DL>
|
|
<P>
|
|
|
|
<I>Note</I>:
|
|
|
|
one consequence of these delegation containment rules is that the
|
|
unprivileged delegatee can't place the first process into
|
|
the delegated subtree;
|
|
instead, the delegater must place the first process
|
|
(a process owned by the delegatee) into the delegated subtree.
|
|
|
|
<A NAME="lbBA"> </A>
|
|
<H2>CGROUPS VERSION 2 THREAD MODE</H2>
|
|
|
|
Among the restrictions imposed by cgroups v2 that were not present
|
|
in cgroups v1 are the following:
|
|
<DL COMPACT>
|
|
<DT id="68">*<DD>
|
|
<I>No thread-granularity control</I>:
|
|
|
|
all of the threads of a process must be in the same cgroup.
|
|
<DT id="69">*<DD>
|
|
<I>No internal processes</I>:
|
|
|
|
a cgroup can't both have member processes and
|
|
exercise controllers on child cgroups.
|
|
</DL>
|
|
<P>
|
|
|
|
Both of these restrictions were added because
|
|
the lack of these restrictions had caused problems
|
|
in cgroups v1.
|
|
In particular, the cgroups v1 ability to allow thread-level granularity
|
|
for cgroup membership made no sense for some controllers.
|
|
(A notable example was the
|
|
<I>memory</I>
|
|
|
|
controller: since threads share an address space,
|
|
it made no sense to split threads across different
|
|
<I>memory</I>
|
|
|
|
cgroups.)
|
|
<P>
|
|
|
|
Notwithstanding the initial design decision in cgroups v2,
|
|
there were use cases for certain controllers, notably the
|
|
<I>cpu</I>
|
|
|
|
controller,
|
|
for which thread-level granularity of control was meaningful and useful.
|
|
To accommodate such use cases, Linux 4.14 added
|
|
<I>thread mode</I>
|
|
|
|
for cgroups v2.
|
|
<P>
|
|
|
|
Thread mode allows the following:
|
|
<DL COMPACT>
|
|
<DT id="70">*<DD>
|
|
The creation of
|
|
<I>threaded subtrees</I>
|
|
|
|
in which the threads of a process may
|
|
be spread across cgroups inside the tree.
|
|
(A threaded subtree may contain multiple multithreaded processes.)
|
|
<DT id="71">*<DD>
|
|
The concept of
|
|
<I>threaded controllers,</I>
|
|
|
|
which can distribute resources across the cgroups in a threaded subtree.
|
|
<DT id="72">*<DD>
|
|
A relaxation of the "no internal processes rule",
|
|
so that, within a threaded subtree,
|
|
a cgroup can both contain member threads and
|
|
exercise resource control over child cgroups.
|
|
</DL>
|
|
<P>
|
|
|
|
With the addition of thread mode,
|
|
each nonroot cgroup now contains a new file,
|
|
<I>cgroup.type</I>,
|
|
|
|
that exposes, and in some circumstances can be used to change,
|
|
the "type" of a cgroup.
|
|
This file contains one of the following type values:
|
|
<DL COMPACT>
|
|
<DT id="73"><I>domain</I>
|
|
|
|
<DD>
|
|
This is a normal v2 cgroup that provides process-granularity control.
|
|
If a process is a member of this cgroup,
|
|
then all threads of the process are (by definition) in the same cgroup.
|
|
This is the default cgroup type,
|
|
and provides the same behavior that was provided for
|
|
cgroups in the initial cgroups v2 implementation.
|
|
<DT id="74"><I>threaded</I>
|
|
|
|
<DD>
|
|
This cgroup is a member of a threaded subtree.
|
|
Threads can be added to this cgroup,
|
|
and controllers can be enabled for the cgroup.
|
|
<DT id="75"><I>domain threaded</I>
|
|
|
|
<DD>
|
|
This is a domain cgroup that serves as the root of a threaded subtree.
|
|
This cgroup type is also known as "threaded root".
|
|
<DT id="76"><I>domain invalid</I>
|
|
|
|
<DD>
|
|
This is a cgroup inside a threaded subtree
|
|
that is in an "invalid" state.
|
|
Processes can't be added to the cgroup,
|
|
and controllers can't be enabled for the cgroup.
|
|
The only thing that can be done with this cgroup (other than deleting it)
|
|
is to convert it to a
|
|
<I>threaded</I>
|
|
|
|
cgroup by writing the string
|
|
<I>"threaded"</I>
|
|
|
|
to the
|
|
<I>cgroup.type</I>
|
|
|
|
file.
|
|
<DT id="77"><DD>
|
|
The rationale for the existence of this "interim" type
|
|
during the creation of a threaded subtree
|
|
(rather than the kernel simply immediately converting all cgroups
|
|
under the threaded root to the type
|
|
<I>threaded</I>)
|
|
|
|
is to allow for
|
|
possible future extensions to the thread mode model
|
|
|
|
</DL>
|
|
<A NAME="lbBB"> </A>
|
|
<H3>Threaded versus domain controllers</H3>
|
|
|
|
With the addition of threads mode,
|
|
cgroups v2 now distinguishes two types of resource controllers:
|
|
<DL COMPACT>
|
|
<DT id="78">*<DD>
|
|
<I>Threaded</I>
|
|
|
|
|
|
|
|
controllers: these controllers support thread-granularity for
|
|
resource control and can be enabled inside threaded subtrees,
|
|
with the result that the corresponding controller-interface files
|
|
appear inside the cgroups in the threaded subtree.
|
|
As at Linux 4.19, the following controllers are threaded:
|
|
<I>cpu</I>,
|
|
|
|
<I>perf_event</I>,
|
|
|
|
and
|
|
<I>pids</I>.
|
|
|
|
<DT id="79">*<DD>
|
|
<I>Domain</I>
|
|
|
|
controllers: these controllers support only process granularity
|
|
for resource control.
|
|
From the perspective of a domain controller,
|
|
all threads of a process are always in the same cgroup.
|
|
Domain controllers can't be enabled inside a threaded subtree.
|
|
|
|
</DL>
|
|
<A NAME="lbBC"> </A>
|
|
<H3>Creating a threaded subtree</H3>
|
|
|
|
There are two pathways that lead to the creation of a threaded subtree.
|
|
The first pathway proceeds as follows:
|
|
<DL COMPACT>
|
|
<DT id="80">1.<DD>
|
|
We write the string
|
|
<I>"threaded"</I>
|
|
|
|
to the
|
|
<I>cgroup.type</I>
|
|
|
|
file of a cgroup
|
|
<I>y/z</I>
|
|
|
|
that currently has the type
|
|
<I>domain</I>.
|
|
|
|
This has the following effects:
|
|
<DL COMPACT><DT id="81"><DD>
|
|
<DL COMPACT>
|
|
<DT id="82">*<DD>
|
|
The type of the cgroup
|
|
<I>y/z</I>
|
|
|
|
becomes
|
|
<I>threaded</I>.
|
|
|
|
<DT id="83">*<DD>
|
|
The type of the parent cgroup,
|
|
<I>y</I>,
|
|
|
|
becomes
|
|
<I>domain threaded</I>.
|
|
|
|
The parent cgroup is the root of a threaded subtree
|
|
(also known as the "threaded root").
|
|
<DT id="84">*<DD>
|
|
All other cgroups under
|
|
<I>y</I>
|
|
|
|
that were not already of type
|
|
<I>threaded</I>
|
|
|
|
(because they were inside already existing threaded subtrees
|
|
under the new threaded root)
|
|
are converted to type
|
|
<I>domain invalid</I>.
|
|
|
|
Any subsequently created cgroups under
|
|
<I>y</I>
|
|
|
|
will also have the type
|
|
<I>domain invalid</I>.
|
|
|
|
</DL>
|
|
</DL>
|
|
|
|
<DT id="85">2.<DD>
|
|
We write the string
|
|
<I>"threaded"</I>
|
|
|
|
to each of the
|
|
<I>domain invalid</I>
|
|
|
|
cgroups under
|
|
<I>y</I>,
|
|
|
|
in order to convert them to the type
|
|
<I>threaded</I>.
|
|
|
|
As a consequence of this step, all threads under the threaded root
|
|
now have the type
|
|
<I>threaded</I>
|
|
|
|
and the threaded subtree is now fully usable.
|
|
The requirement to write
|
|
<I>"threaded"</I>
|
|
|
|
to each of these cgroups is somewhat cumbersome,
|
|
but allows for possible future extensions to the thread-mode model.
|
|
</DL>
|
|
<P>
|
|
|
|
The second way of creating a threaded subtree is as follows:
|
|
<DL COMPACT>
|
|
<DT id="86">1.<DD>
|
|
In an existing cgroup,
|
|
<I>z</I>,
|
|
|
|
that currently has the type
|
|
<I>domain</I>,
|
|
|
|
we (1) enable one or more threaded controllers and
|
|
(2) make a process a member of
|
|
<I>z</I>.
|
|
|
|
(These two steps can be done in either order.)
|
|
This has the following consequences:
|
|
<DL COMPACT><DT id="87"><DD>
|
|
<DL COMPACT>
|
|
<DT id="88">*<DD>
|
|
The type of
|
|
<I>z</I>
|
|
|
|
becomes
|
|
<I>domain threaded</I>.
|
|
|
|
<DT id="89">*<DD>
|
|
All of the descendant cgroups of
|
|
<I>x</I>
|
|
|
|
that were not already of type
|
|
<I>threaded</I>
|
|
|
|
are converted to type
|
|
<I>domain invalid</I>.
|
|
|
|
</DL>
|
|
</DL>
|
|
|
|
<DT id="90">2.<DD>
|
|
As before, we make the threaded subtree usable by writing the string
|
|
<I>"threaded"</I>
|
|
|
|
to each of the
|
|
<I>domain invalid</I>
|
|
|
|
cgroups under
|
|
<I>y</I>,
|
|
|
|
in order to convert them to the type
|
|
<I>threaded</I>.
|
|
|
|
</DL>
|
|
<P>
|
|
|
|
One of the consequences of the above pathways to creating a threaded subtree
|
|
is that the threaded root cgroup can be a parent only to
|
|
<I>threaded</I>
|
|
|
|
(and
|
|
<I>domain invalid</I>)
|
|
|
|
cgroups.
|
|
The threaded root cgroup can't be a parent of a
|
|
<I>domain</I>
|
|
|
|
cgroups, and a
|
|
<I>threaded</I>
|
|
|
|
cgroup
|
|
can't have a sibling that is a
|
|
<I>domain</I>
|
|
|
|
cgroup.
|
|
|
|
<A NAME="lbBD"> </A>
|
|
<H3>Using a threaded subtree</H3>
|
|
|
|
Within a threaded subtree, threaded controllers can be enabled
|
|
in each subgroup whose type has been changed to
|
|
<I>threaded</I>;
|
|
|
|
upon doing so, the corresponding controller interface files
|
|
appear in the children of that cgroup.
|
|
<P>
|
|
|
|
A process can be moved into a threaded subtree by writing its PID to the
|
|
<I>cgroup.procs</I>
|
|
|
|
file in one of the cgroups inside the tree.
|
|
This has the effect of making all of the threads
|
|
in the process members of the corresponding cgroup
|
|
and makes the process a member of the threaded subtree.
|
|
The threads of the process can then be spread across
|
|
the threaded subtree by writing their thread IDs (see
|
|
<B><A HREF="/cgi-bin/man/man2html?2+gettid">gettid</A></B>(2))
|
|
|
|
to the
|
|
<I>cgroup.threads</I>
|
|
|
|
files in different cgroups inside the subtree.
|
|
The threads of a process must all reside in the same threaded subtree.
|
|
<P>
|
|
|
|
As with writing to
|
|
<I>cgroup.procs</I>,
|
|
|
|
some containment rules apply when writing to the
|
|
<I>cgroup.threads</I>
|
|
|
|
file:
|
|
<DL COMPACT>
|
|
<DT id="91">*<DD>
|
|
The writer must have write permission on the
|
|
cgroup.threads
|
|
file in the destination cgroup.
|
|
<DT id="92">*<DD>
|
|
The writer must have write permission on the
|
|
<I>cgroup.procs</I>
|
|
|
|
file in the common ancestor of the source and destination cgroups.
|
|
(In some cases,
|
|
the common ancestor may be the source or destination cgroup itself.)
|
|
<DT id="93">*<DD>
|
|
The source and destination cgroups must be in the same threaded subtree.
|
|
(Outside a threaded subtree, an attempt to move a thread by writing
|
|
its thread ID to the
|
|
<I>cgroup.threads</I>
|
|
|
|
file in a different
|
|
<I>domain</I>
|
|
|
|
cgroup fails with the error
|
|
<B>EOPNOTSUPP</B>.)
|
|
|
|
</DL>
|
|
<P>
|
|
|
|
The
|
|
<I>cgroup.threads</I>
|
|
|
|
file is present in each cgroup (including
|
|
<I>domain</I>
|
|
|
|
cgroups) and can be read in order to discover the set of threads
|
|
that is present in the cgroup.
|
|
The set of thread IDs obtained when reading this file
|
|
is not guaranteed to be ordered or free of duplicates.
|
|
<P>
|
|
|
|
The
|
|
<I>cgroup.procs</I>
|
|
|
|
file in the threaded root shows the PIDs of all processes
|
|
that are members of the threaded subtree.
|
|
The
|
|
<I>cgroup.procs</I>
|
|
|
|
files in the other cgroups in the subtree are not readable.
|
|
<P>
|
|
|
|
Domain controllers can't be enabled in a threaded subtree;
|
|
no controller-interface files appear inside the cgroups underneath the
|
|
threaded root.
|
|
From the point of view of a domain controller,
|
|
threaded subtrees are invisible:
|
|
a multithreaded process inside a threaded subtree appears to a domain
|
|
controller as a process that resides in the threaded root cgroup.
|
|
<P>
|
|
|
|
Within a threaded subtree, the "no internal processes" rule does not apply:
|
|
a cgroup can both contain member processes (or thread)
|
|
and exercise controllers on child cgroups.
|
|
|
|
<A NAME="lbBE"> </A>
|
|
<H3>Rules for writing to cgroup.type and creating threaded subtrees</H3>
|
|
|
|
A number of rules apply when writing to the
|
|
<I>cgroup.type</I>
|
|
|
|
file:
|
|
<DL COMPACT>
|
|
<DT id="94">*<DD>
|
|
Only the string
|
|
<I>"threaded"</I>
|
|
|
|
may be written.
|
|
In other words, the only explicit transition that is possible is to convert a
|
|
<I>domain</I>
|
|
|
|
cgroup to type
|
|
<I>threaded</I>.
|
|
|
|
<DT id="95">*<DD>
|
|
The effect of writing
|
|
<I>"threaded"</I>
|
|
|
|
depends on the current value in
|
|
<I>cgroup.type</I>,
|
|
|
|
as follows:
|
|
<DL COMPACT><DT id="96"><DD>
|
|
<DL COMPACT>
|
|
<DT id="97">•<DD>
|
|
<I>domain</I>
|
|
|
|
or
|
|
<I>domain threaded</I>:
|
|
|
|
start the creation of a threaded subtree
|
|
(whose root is the parent of this cgroup) via
|
|
the first of the pathways described above;
|
|
<DT id="98">•<DD>
|
|
<I>domain invalid</I>:
|
|
|
|
convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
|
|
<I>threaded</I>)
|
|
|
|
state;
|
|
<DT id="99">•<DD>
|
|
<I>threaded</I>:
|
|
|
|
no effect (a "no-op").
|
|
</DL>
|
|
</DL>
|
|
|
|
<DT id="100">*<DD>
|
|
We can't write to a
|
|
<I>cgroup.type</I>
|
|
|
|
file if the parent's type is
|
|
<I>domain invalid</I>.
|
|
|
|
In other words, the cgroups of a threaded subtree must be converted to the
|
|
<I>threaded</I>
|
|
|
|
state in a top-down manner.
|
|
</DL>
|
|
<P>
|
|
|
|
There are also some constraints that must be satisfied
|
|
in order to create a threaded subtree rooted at the cgroup
|
|
<I>x</I>:
|
|
|
|
<DL COMPACT>
|
|
<DT id="101">*<DD>
|
|
There can be no member processes in the descendant cgroups of
|
|
<I>x</I>.
|
|
|
|
(The cgroup
|
|
<I>x</I>
|
|
|
|
can itself have member processes.)
|
|
<DT id="102">*<DD>
|
|
No domain controllers may be enabled in
|
|
<I>x</I>'s
|
|
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file.
|
|
</DL>
|
|
<P>
|
|
|
|
If any of the above constraints is violated, then an attempt to write
|
|
<I>"threaded"</I>
|
|
|
|
to a
|
|
<I>cgroup.type</I>
|
|
|
|
file fails with the error
|
|
<B>ENOTSUP</B>.
|
|
|
|
|
|
<A NAME="lbBF"> </A>
|
|
<H3>The domain threaded cgroup type</H3>
|
|
|
|
According to the pathways described above,
|
|
the type of a cgroup can change to
|
|
<I>domain threaded</I>
|
|
|
|
in either of the following cases:
|
|
<DL COMPACT>
|
|
<DT id="103">*<DD>
|
|
The string
|
|
<I>"threaded"</I>
|
|
|
|
is written to a child cgroup.
|
|
<DT id="104">*<DD>
|
|
A threaded controller is enabled inside the cgroup and
|
|
a process is made a member of the cgroup.
|
|
</DL>
|
|
<P>
|
|
|
|
A
|
|
<I>domain threaded</I>
|
|
|
|
cgroup,
|
|
<I>x</I>,
|
|
|
|
can revert to the type
|
|
<I>domain</I>
|
|
|
|
if the above conditions no longer hold true---that is, if all
|
|
<I>threaded</I>
|
|
|
|
child cgroups of
|
|
<I>x</I>
|
|
|
|
are removed and either
|
|
<I>x</I>
|
|
|
|
no longer has threaded controllers enabled or
|
|
no longer has member processes.
|
|
<P>
|
|
|
|
When a
|
|
<I>domain threaded</I>
|
|
|
|
cgroup
|
|
<I>x</I>
|
|
|
|
reverts to the type
|
|
<I>domain</I>:
|
|
|
|
<DL COMPACT>
|
|
<DT id="105">*<DD>
|
|
All
|
|
<I>domain invalid</I>
|
|
|
|
descendants of
|
|
<I>x</I>
|
|
|
|
that are not in lower-level threaded subtrees revert to the type
|
|
<I>domain</I>.
|
|
|
|
<DT id="106">*<DD>
|
|
The root cgroups in any lower-level threaded subtrees revert to the type
|
|
<I>domain threaded</I>.
|
|
|
|
|
|
</DL>
|
|
<A NAME="lbBG"> </A>
|
|
<H3>Exceptions for the root cgroup</H3>
|
|
|
|
The root cgroup of the v2 hierarchy is treated exceptionally:
|
|
it can be the parent of both
|
|
<I>domain</I>
|
|
|
|
and
|
|
<I>threaded</I>
|
|
|
|
cgroups.
|
|
If the string
|
|
<I>threaded</I>
|
|
|
|
is written to the
|
|
<I>cgroup.type</I>
|
|
|
|
file of one of the children of the root cgroup, then
|
|
<DL COMPACT>
|
|
<DT id="107">*<DD>
|
|
The type of that cgroup becomes
|
|
<I>threaded</I>.
|
|
|
|
<DT id="108">*<DD>
|
|
The type of any descendants of that cgroup that
|
|
are not part of lower-level threaded subtrees changes to
|
|
<I>domain invalid</I>.
|
|
|
|
</DL>
|
|
<P>
|
|
|
|
Note that in this case, there is no cgroup whose type becomes
|
|
<I>domain threaded</I>.
|
|
|
|
(Notionally, the root cgroup can be considered as the threaded root
|
|
for the cgroup whose type was changed to
|
|
<I>threaded</I>.)
|
|
|
|
<P>
|
|
|
|
The aim of this exceptional treatment for the root cgroup is to
|
|
allow a threaded cgroup that employs the
|
|
<I>cpu</I>
|
|
|
|
controller to be placed as high as possible in the hierarchy,
|
|
so as to minimize the (small) cost of traversing the cgroup hierarchy.
|
|
|
|
<A NAME="lbBH"> </A>
|
|
<H3>The cgroups v2 cpu controller and realtime threads</H3>
|
|
|
|
As at Linux 4.19, the cgroups v2
|
|
<I>cpu</I>
|
|
|
|
controller does not support control of realtime threads
|
|
(specifically threads scheduled under any of the policies
|
|
<B>SCHED_FIFO</B>,
|
|
|
|
<B>SCHED_RR</B>,
|
|
|
|
described
|
|
<B>SCHED_DEADLINE</B>;
|
|
|
|
see
|
|
<B><A HREF="/cgi-bin/man/man2html?7+sched">sched</A></B>(7)).
|
|
|
|
Therefore, the
|
|
<I>cpu</I>
|
|
|
|
controller can be enabled in the root cgroup only
|
|
if all realtime threads are in the root cgroup.
|
|
(If there are realtime threads in nonroot cgroups, then a
|
|
<B><A HREF="/cgi-bin/man/man2html?2+write">write</A></B>(2)
|
|
|
|
of the string
|
|
<I>"+cpu"</I>
|
|
|
|
to the
|
|
<I>cgroup.subtree_control</I>
|
|
|
|
file fails with the error
|
|
<B>EINVAL</B>.)
|
|
|
|
<P>
|
|
|
|
On some systems,
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1)
|
|
|
|
places certain realtime threads in nonroot cgroups in the v2 hierarchy.
|
|
On such systems,
|
|
these threads must first be moved to the root cgroup before the
|
|
<I>cpu</I>
|
|
|
|
controller can be enabled.
|
|
|
|
<A NAME="lbBI"> </A>
|
|
<H2>ERRORS</H2>
|
|
|
|
The following errors can occur for
|
|
<B><A HREF="/cgi-bin/man/man2html?2+mount">mount</A></B>(2):
|
|
|
|
<DL COMPACT>
|
|
<DT id="109"><B>EBUSY</B>
|
|
|
|
<DD>
|
|
An attempt to mount a cgroup version 1 filesystem specified neither the
|
|
<I>name=</I>
|
|
|
|
option (to mount a named hierarchy) nor a controller name (or
|
|
<I>all</I>).
|
|
|
|
</DL>
|
|
<A NAME="lbBJ"> </A>
|
|
<H2>NOTES</H2>
|
|
|
|
A child process created via
|
|
<B><A HREF="/cgi-bin/man/man2html?2+fork">fork</A></B>(2)
|
|
|
|
inherits its parent's cgroup memberships.
|
|
A process's cgroup memberships are preserved across
|
|
<B><A HREF="/cgi-bin/man/man2html?2+execve">execve</A></B>(2).
|
|
|
|
|
|
<A NAME="lbBK"> </A>
|
|
<H3>/proc files</H3>
|
|
|
|
<DL COMPACT>
|
|
<DT id="110"><I>/proc/cgroups</I> (since Linux 2.6.24)
|
|
|
|
<DD>
|
|
This file contains information about the controllers
|
|
that are compiled into the kernel.
|
|
An example of the contents of this file (reformatted for readability)
|
|
is the following:
|
|
<DT id="111"><DD>
|
|
|
|
|
|
#subsys_name hierarchy num_cgroups enabled
|
|
cpuset 4 1 1
|
|
cpu 8 1 1
|
|
cpuacct 8 1 1
|
|
blkio 6 1 1
|
|
memory 3 1 1
|
|
devices 10 84 1
|
|
freezer 7 1 1
|
|
net_cls 9 1 1
|
|
perf_event 5 1 1
|
|
net_prio 9 1 1
|
|
hugetlb 0 1 0
|
|
pids 2 1 1
|
|
|
|
|
|
<DT id="112"><DD>
|
|
The fields in this file are, from left to right:
|
|
<DL COMPACT><DT id="113"><DD>
|
|
<DL COMPACT>
|
|
<DT id="114">1.<DD>
|
|
The name of the controller.
|
|
<DT id="115">2.<DD>
|
|
The unique ID of the cgroup hierarchy on which this controller is mounted.
|
|
If multiple cgroups v1 controllers are bound to the same hierarchy,
|
|
then each will show the same hierarchy ID in this field.
|
|
The value in this field will be 0 if:
|
|
<DL COMPACT><DT id="116"><DD>
|
|
<DL COMPACT>
|
|
<DT id="117">a)<DD>
|
|
the controller is not mounted on a cgroups v1 hierarchy;
|
|
<DT id="118">b)<DD>
|
|
the controller is bound to the cgroups v2 single unified hierarchy; or
|
|
<DT id="119">c)<DD>
|
|
the controller is disabled (see below).
|
|
</DL>
|
|
</DL>
|
|
|
|
<DT id="120">3.<DD>
|
|
The number of control groups in this hierarchy using this controller.
|
|
<DT id="121">4.<DD>
|
|
This field contains the value 1 if this controller is enabled,
|
|
or 0 if it has been disabled (via the
|
|
<I>cgroup_disable</I>
|
|
|
|
kernel command-line boot parameter).
|
|
</DL>
|
|
</DL>
|
|
|
|
<DT id="122"><I>/proc/[pid]/cgroup</I> (since Linux 2.6.24)
|
|
|
|
<DD>
|
|
This file describes control groups to which the process
|
|
with the corresponding PID belongs.
|
|
The displayed information differs for
|
|
cgroups version 1 and version 2 hierarchies.
|
|
<DT id="123"><DD>
|
|
For each cgroup hierarchy of which the process is a member,
|
|
there is one entry containing three colon-separated fields:
|
|
<DT id="124"><DD>
|
|
|
|
|
|
hierarchy-ID:controller-list:cgroup-path
|
|
|
|
|
|
<DT id="125"><DD>
|
|
For example:
|
|
<DT id="126"><DD>
|
|
|
|
|
|
5:cpuacct,cpu,cpuset:/daemons
|
|
|
|
|
|
<DT id="127"><DD>
|
|
The colon-separated fields are, from left to right:
|
|
<DL COMPACT><DT id="128"><DD>
|
|
<DL COMPACT>
|
|
<DT id="129">1.<DD>
|
|
For cgroups version 1 hierarchies,
|
|
this field contains a unique hierarchy ID number
|
|
that can be matched to a hierarchy ID in
|
|
<I>/proc/cgroups</I>.
|
|
|
|
For the cgroups version 2 hierarchy, this field contains the value 0.
|
|
<DT id="130">2.<DD>
|
|
For cgroups version 1 hierarchies,
|
|
this field contains a comma-separated list of the controllers
|
|
bound to the hierarchy.
|
|
For the cgroups version 2 hierarchy, this field is empty.
|
|
<DT id="131">3.<DD>
|
|
This field contains the pathname of the control group in the hierarchy
|
|
to which the process belongs.
|
|
This pathname is relative to the mount point of the hierarchy.
|
|
</DL>
|
|
</DL>
|
|
|
|
|
|
</DL>
|
|
<A NAME="lbBL"> </A>
|
|
<H3>/sys/kernel/cgroup files</H3>
|
|
|
|
<DL COMPACT>
|
|
<DT id="132"><I>/sys/kernel/cgroup/delegate</I> (since Linux 4.15)
|
|
|
|
<DD>
|
|
|
|
This file exports a list of the cgroups v2 files
|
|
(one per line) that are delegatable
|
|
(i.e., whose ownership should be changed to the user ID of the delegatee).
|
|
In the future, the set of delegatable files may change or grow,
|
|
and this file provides a way for the kernel to inform
|
|
user-space applications of which files must be delegated.
|
|
As at Linux 4.15, one sees the following when inspecting this file:
|
|
<DT id="133"><DD>
|
|
|
|
|
|
$ <B>cat /sys/kernel/cgroup/delegate</B>
|
|
cgroup.procs
|
|
cgroup.subtree_control
|
|
cgroup.threads
|
|
|
|
|
|
<DT id="134"><I>/sys/kernel/cgroup/features</I> (since Linux 4.15)
|
|
|
|
<DD>
|
|
|
|
Over time, the set of cgroups v2 features that are provided by the
|
|
kernel may change or grow,
|
|
or some features may not be enabled by default.
|
|
This file provides a way for user-space applications to discover what
|
|
features the running kernel supports and has enabled.
|
|
Features are listed one per line:
|
|
<DT id="135"><DD>
|
|
|
|
|
|
$ <B>cat /sys/kernel/cgroup/features</B>
|
|
nsdelegate
|
|
|
|
|
|
<DT id="136"><DD>
|
|
The entries that can appear in this file are:
|
|
<DL COMPACT><DT id="137"><DD>
|
|
<DL COMPACT>
|
|
<DT id="138"><I>nsdelegate</I> (since Linux 4.15)
|
|
|
|
<DD>
|
|
The kernel supports the
|
|
<I>nsdelegate</I>
|
|
|
|
mount option.
|
|
</DL>
|
|
</DL>
|
|
|
|
</DL>
|
|
<A NAME="lbBM"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+prlimit">prlimit</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd">systemd</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd-cgls">systemd-cgls</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+systemd-cgtop">systemd-cgtop</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?2+clone">clone</A></B>(2),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?2+ioprio_set">ioprio_set</A></B>(2),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?2+perf_event_open">perf_event_open</A></B>(2),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?2+setrlimit">setrlimit</A></B>(2),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+cgroup_namespaces">cgroup_namespaces</A></B>(7),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+cpuset">cpuset</A></B>(7),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+namespaces">namespaces</A></B>(7),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+sched">sched</A></B>(7),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+user_namespaces">user_namespaces</A></B>(7)
|
|
|
|
<A NAME="lbBN"> </A>
|
|
<H2>COLOPHON</H2>
|
|
|
|
This page is part of release 5.05 of the Linux
|
|
<I>man-pages</I>
|
|
|
|
project.
|
|
A description of the project,
|
|
information about reporting bugs,
|
|
and the latest version of this page,
|
|
can be found at
|
|
<A HREF="https://www.kernel.org/doc/man-pages/.">https://www.kernel.org/doc/man-pages/.</A>
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="139"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="140"><A HREF="#lbAC">DESCRIPTION</A><DD>
|
|
<DL>
|
|
<DT id="141"><A HREF="#lbAD">Terminology</A><DD>
|
|
<DT id="142"><A HREF="#lbAE">Cgroups version 1 and version 2</A><DD>
|
|
</DL>
|
|
<DT id="143"><A HREF="#lbAF">CGROUPS VERSION 1</A><DD>
|
|
<DL>
|
|
<DT id="144"><A HREF="#lbAG">Tasks (threads) versus processes</A><DD>
|
|
<DT id="145"><A HREF="#lbAH">Mounting v1 controllers</A><DD>
|
|
<DT id="146"><A HREF="#lbAI">Unmounting v1 controllers</A><DD>
|
|
<DT id="147"><A HREF="#lbAJ">Cgroups version 1 controllers</A><DD>
|
|
<DT id="148"><A HREF="#lbAK">Creating cgroups and moving processes</A><DD>
|
|
<DT id="149"><A HREF="#lbAL">Removing cgroups</A><DD>
|
|
<DT id="150"><A HREF="#lbAM">Cgroups v1 release notification</A><DD>
|
|
<DT id="151"><A HREF="#lbAN">Cgroup v1 named hierarchies</A><DD>
|
|
</DL>
|
|
<DT id="152"><A HREF="#lbAO">CGROUPS VERSION 2</A><DD>
|
|
<DL>
|
|
<DT id="153"><A HREF="#lbAP">Cgroups v2 unified hierarchy</A><DD>
|
|
<DT id="154"><A HREF="#lbAQ">Cgroups v2 controllers</A><DD>
|
|
<DT id="155"><A HREF="#lbAR">Cgroups v2 subtree control</A><DD>
|
|
<DT id="156"><A HREF="#lbAS">Cgroups v2 no internal processes rule</A><DD>
|
|
<DT id="157"><A HREF="#lbAT">Cgroups v2 cgroup.events file</A><DD>
|
|
<DT id="158"><A HREF="#lbAU">Cgroup v2 release notification</A><DD>
|
|
<DT id="159"><A HREF="#lbAV">Cgroups v2 cgroup.stat file</A><DD>
|
|
<DT id="160"><A HREF="#lbAW">Limiting the number of descendant cgroups</A><DD>
|
|
</DL>
|
|
<DT id="161"><A HREF="#lbAX">CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER</A><DD>
|
|
<DL>
|
|
<DT id="162"><A HREF="#lbAY">Cgroups v2 delegation: nsdelegate and cgroup namespaces</A><DD>
|
|
<DT id="163"><A HREF="#lbAZ">Cgroup delegation containment rules</A><DD>
|
|
</DL>
|
|
<DT id="164"><A HREF="#lbBA">CGROUPS VERSION 2 THREAD MODE</A><DD>
|
|
<DL>
|
|
<DT id="165"><A HREF="#lbBB">Threaded versus domain controllers</A><DD>
|
|
<DT id="166"><A HREF="#lbBC">Creating a threaded subtree</A><DD>
|
|
<DT id="167"><A HREF="#lbBD">Using a threaded subtree</A><DD>
|
|
<DT id="168"><A HREF="#lbBE">Rules for writing to cgroup.type and creating threaded subtrees</A><DD>
|
|
<DT id="169"><A HREF="#lbBF">The domain threaded cgroup type</A><DD>
|
|
<DT id="170"><A HREF="#lbBG">Exceptions for the root cgroup</A><DD>
|
|
<DT id="171"><A HREF="#lbBH">The cgroups v2 cpu controller and realtime threads</A><DD>
|
|
</DL>
|
|
<DT id="172"><A HREF="#lbBI">ERRORS</A><DD>
|
|
<DT id="173"><A HREF="#lbBJ">NOTES</A><DD>
|
|
<DL>
|
|
<DT id="174"><A HREF="#lbBK">/proc files</A><DD>
|
|
<DT id="175"><A HREF="#lbBL">/sys/kernel/cgroup files</A><DD>
|
|
</DL>
|
|
<DT id="176"><A HREF="#lbBM">SEE ALSO</A><DD>
|
|
<DT id="177"><A HREF="#lbBN">COLOPHON</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:06:07 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|