Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,26 @@
00-INDEX
- this file
blkio-controller.txt
- Description for Block IO Controller, implementation and usage details.
cgroups.txt
- Control Groups definition, implementation details, examples and API.
cpuacct.txt
- CPU Accounting Controller; account CPU usage for groups of tasks.
cpusets.txt
- documents the cpusets feature; assign CPUs and Mem to a set of tasks.
devices.txt
- Device Whitelist Controller; description, interface and security.
freezer-subsystem.txt
- checkpointing; rationale to not use signals, interface.
hugetlb.txt
- HugeTLB Controller implementation and usage details.
memcg_test.txt
- Memory Resource Controller; implementation details.
memory.txt
- Memory Resource Controller; design, accounting, interface, testing.
net_cls.txt
- Network classifier cgroups details and usages.
net_prio.txt
- Network priority cgroups details and usages.
resource_counter.txt
- Resource Counter API.

View file

@ -0,0 +1,394 @@
Block IO Controller
===================
Overview
========
cgroup subsys "blkio" implements the block io controller. There seems to be
a need of various kinds of IO control policies (like proportional BW, max BW)
both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
Plan is to use the same cgroup based management interface for blkio controller
and based on user options switch IO policies in the background.
Currently two IO control policies are implemented. First one is proportional
weight time based division of disk policy. It is implemented in CFQ. Hence
this policy takes effect only on leaf nodes when CFQ is being used. The second
one is throttling policy which can be used to specify upper IO rate limits
on devices. This policy is implemented in generic block layer and can be
used on leaf nodes as well as higher level logical devices like device mapper.
HOWTO
=====
Proportional Weight division of bandwidth
-----------------------------------------
You can do a very simple testing of running two dd threads in two different
cgroups. Here is what you can do.
- Enable Block IO controller
CONFIG_BLK_CGROUP=y
- Enable group scheduling in CFQ
CONFIG_CFQ_GROUP_IOSCHED=y
- Compile and boot into kernel and mount IO controller (blkio); see
cgroups.txt, Why are cgroups needed?.
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
- Create two cgroups
mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2
- Set weights of group test1 and test2
echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight
- Create two same size files (say 512MB each) on same disk (file1, file2) and
launch two dd threads in different cgroup to read those files.
sync
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/sdb/zerofile1 of=/dev/null &
echo $! > /sys/fs/cgroup/blkio/test1/tasks
cat /sys/fs/cgroup/blkio/test1/tasks
dd if=/mnt/sdb/zerofile2 of=/dev/null &
echo $! > /sys/fs/cgroup/blkio/test2/tasks
cat /sys/fs/cgroup/blkio/test2/tasks
- At macro level, first dd should finish first. To get more precise data, keep
on looking at (with the help of script), at blkio.disk_time and
blkio.disk_sectors files of both test1 and test2 groups. This will tell how
much disk time (in milli seconds), each group got and how many secotors each
group dispatched to the disk. We provide fairness in terms of disk time, so
ideally io.disk_time of cgroups should be in proportion to the weight.
Throttling/Upper Limit policy
-----------------------------
- Enable Block IO controller
CONFIG_BLK_CGROUP=y
- Enable throttling in block layer
CONFIG_BLK_DEV_THROTTLING=y
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
- Specify a bandwidth rate on particular device for root group. The format
for policy is "<major>:<minor> <bytes_per_second>".
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
Above will put a limit of 1MB/second on reads happening for root group
on device having major/minor number 8:16.
- Run dd to read a file and see if rate is throttled to 1MB/s or not.
# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
# iflag=direct
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
Limits for writes can be put using blkio.throttle.write_bps_device file.
Hierarchical Cgroups
====================
Both CFQ and throttling implement hierarchy support; however,
throttling's hierarchy support is enabled iff "sane_behavior" is
enabled from cgroup side, which currently is a development option and
not publicly available.
If somebody created a hierarchy like as follows.
root
/ \
test1 test2
|
test3
CFQ by default and throttling with "sane_behavior" will handle the
hierarchy correctly. For details on CFQ hierarchy support, refer to
Documentation/block/cfq-iosched.txt. For throttling, all limits apply
to the whole subtree while all statistics are local to the IOs
directly generated by tasks in that cgroup.
Throttling without "sane_behavior" enabled from cgroup side will
practically treat all groups at same level as if it looks like the
following.
pivot
/ / \ \
root test1 test2 test3
Various user visible config options
===================================
CONFIG_BLK_CGROUP
- Block IO controller.
CONFIG_DEBUG_BLK_CGROUP
- Debug help. Right now some additional stats file show up in cgroup
if this option is enabled.
CONFIG_CFQ_GROUP_IOSCHED
- Enables group scheduling in CFQ. Currently only 1 level of group
creation is allowed.
CONFIG_BLK_DEV_THROTTLING
- Enable block device throttling support in block layer.
Details of cgroup files
=======================
Proportional weight policy files
--------------------------------
- blkio.weight
- Specifies per cgroup weight. This is default weight of the group
on all the devices until and unless overridden by per device rule.
(See blkio.weight_device).
Currently allowed range of weights is from 10 to 1000.
- blkio.weight_device
- One can specify per cgroup per device rules using this interface.
These rules override the default value of group weight as specified
by blkio.weight.
Following is the format.
# echo dev_maj:dev_minor weight > blkio.weight_device
Configure weight=300 on /dev/sdb (8:16) in this cgroup
# echo 8:16 300 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup
# echo 8:0 500 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:0 500
8:16 300
Remove specific weight for /dev/sda in this cgroup
# echo 8:0 0 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
- blkio.leaf_weight[_device]
- Equivalents of blkio.weight[_device] for the purpose of
deciding how much weight tasks in the given cgroup has while
competing with the cgroup's child cgroups. For details,
please refer to Documentation/block/cfq-iosched.txt.
- blkio.time
- disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and
third field specifies the disk time allocated to group in
milliseconds.
- blkio.sectors
- number of sectors transferred to/from disk by the group. First
two fields specify the major and minor number of the device and
third field specifies the number of sectors transferred by the
group to/from the device.
- blkio.io_service_bytes
- Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of bytes.
- blkio.io_serviced
- Number of IOs completed to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of IOs.
- blkio.io_service_time
- Total amount of time between request dispatch and request completion
for the IOs done by this cgroup. This is in nanoseconds to make it
meaningful for flash devices too. For devices with queue depth of 1,
this time represents the actual service time. When queue_depth > 1,
that is no longer true as requests may be served out of order. This
may cause the service time for a given IO to include the service time
of multiple IOs when served out of order which may result in total
io_service_time > actual time elapsed. This time is further divided by
the type of operation - read or write, sync or async. First two fields
specify the major and minor number of the device, third field
specifies the operation type and the fourth field specifies the
io_service_time in ns.
- blkio.io_wait_time
- Total amount of time the IOs for this cgroup spent waiting in the
scheduler queues for service. This can be greater than the total time
elapsed since it is cumulative io_wait_time for all IOs. It is not a
measure of total time the cgroup spent waiting but rather a measure of
the wait_time for its individual IOs. For devices with queue_depth > 1
this metric does not include the time spent waiting for service once
the IO is dispatched to the device but till it actually gets serviced
(there might be a time lag here due to re-ordering of requests by the
device). This is in nanoseconds to make it meaningful for flash
devices too. This time is further divided by the type of operation -
read or write, sync or async. First two fields specify the major and
minor number of the device, third field specifies the operation type
and the fourth field specifies the io_wait_time in ns.
- blkio.io_merged
- Total number of bios/requests merged into requests belonging to this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
- blkio.io_queued
- Total number of requests queued up at any given instant for this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
- blkio.avg_queue_size
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
The average queue size for this cgroup over the entire time of this
cgroup's existence. Queue size samples are taken each time one of the
queues of this cgroup gets a timeslice.
- blkio.group_wait_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
This is the amount of time the cgroup had to wait since it became busy
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
its queues. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup
waiting in the scheduler queue. This is in nanoseconds. If this is
read when the cgroup is in a waiting (for timeslice) state, the stat
will only report the group_wait_time accumulated till the last time it
got a timeslice and will not include the current delta.
- blkio.empty_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup. This is in
nanoseconds. If this is read when the cgroup is in an empty state,
the stat will only report the empty_time accumulated till the last
time it had a pending request and will not include the current delta.
- blkio.idle_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
This is the amount of time spent by the IO scheduler idling for a
given cgroup in anticipation of a better request than the existing ones
from other queues/cgroups. This is in nanoseconds. If this is read
when the cgroup is in an idling state, the stat will only report the
idle_time accumulated till the last idle period and will not include
the current delta.
- blkio.dequeue
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
gives the statistics about how many a times a group was dequeued
from service tree of the device. First two fields specify the major
and minor number of the device and third field specifies the number
of times a group was dequeued from a particular device.
- blkio.*_recursive
- Recursive version of various stats. These files show the
same information as their non-recursive counterparts but
include stats from all the descendant cgroups.
Throttling/Upper limit policy files
-----------------------------------
- blkio.throttle.read_bps_device
- Specifies upper limit on READ rate from the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
- blkio.throttle.write_bps_device
- Specifies upper limit on WRITE rate to the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
- blkio.throttle.read_iops_device
- Specifies upper limit on READ rate from the device. IO rate is
specified in IO per second. Rules are per device. Following is
the format.
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
- blkio.throttle.write_iops_device
- Specifies upper limit on WRITE rate to the device. IO rate is
specified in io per second. Rules are per device. Following is
the format.
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
Note: If both BW and IOPS rules are specified for a device, then IO is
subjected to both the constraints.
- blkio.throttle.io_serviced
- Number of IOs (bio) completed to/from the disk by the group (as
seen by throttling policy). These are further divided by the type
of operation - read or write, sync or async. First two fields specify
the major and minor number of the device, third field specifies the
operation type and the fourth field specifies the number of IOs.
blkio.io_serviced does accounting as seen by CFQ and counts are in
number of requests (struct request). On the other hand,
blkio.throttle.io_serviced counts number of IO in terms of number
of bios as seen by throttling policy. These bios can later be
merged by elevator and total number of requests completed can be
lesser.
- blkio.throttle.io_service_bytes
- Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of bytes.
These numbers should roughly be same as blkio.io_service_bytes as
updated by CFQ. The difference between two is that
blkio.io_service_bytes will not be updated if CFQ is not operating
on request queue.
Common files among various policies
-----------------------------------
- blkio.reset_stats
- Writing an int to this file will result in resetting all the stats
for that cgroup.
CFQ sysfs tunable
=================
/sys/block/<disk>/queue/iosched/slice_idle
------------------------------------------
On a faster hardware CFQ can be slow, especially with sequential workload.
This happens because CFQ idles on a single queue and single queue might not
drive deeper request queue depths to keep the storage busy. In such scenarios
one can try setting slice_idle=0 and that would switch CFQ to IOPS
(IO operations per second) mode on NCQ supporting hardware.
That means CFQ will not idle between cfq queues of a cfq group and hence be
able to driver higher queue depth and achieve better throughput. That also
means that cfq provides fairness among groups in terms of IOPS and not in
terms of disk time.
/sys/block/<disk>/queue/iosched/group_idle
------------------------------------------
If one disables idling on individual cfq queues and cfq service trees by
setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
on the group in an attempt to provide fairness among groups.
By default group_idle is same as slice_idle and does not do anything if
slice_idle is enabled.
One can experience an overall throughput drop if you have created multiple
groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.
What works
==========
- Currently only sync IO queues are support. All the buffered writes are
still system wide and not per group. Hence we will not see service
differentiation between buffered writes between groups.

View file

@ -0,0 +1,687 @@
CGROUPS
-------
Written by Paul Menage <menage@google.com> based on
Documentation/cgroups/cpusets.txt
Original copyright statements from cpusets.txt:
Portions Copyright (C) 2004 BULL SA.
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <pj@sgi.com>
Modified by Christoph Lameter <clameter@sgi.com>
CONTENTS:
=========
1. Control Groups
1.1 What are cgroups ?
1.2 Why are cgroups needed ?
1.3 How are cgroups implemented ?
1.4 What does notify_on_release do ?
1.5 What does clone_children do ?
1.6 How do I use cgroups ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Attaching processes
2.3 Mounting hierarchies by name
3. Kernel API
3.1 Overview
3.2 Synchronization
3.3 Subsystem API
4. Extended attributes usage
5. Questions
1. Control Groups
=================
1.1 What are cgroups ?
----------------------
Control Groups provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour.
Definitions:
A *cgroup* associates a set of tasks with a set of parameters for one
or more subsystems.
A *subsystem* is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a "resource controller" that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.
A *hierarchy* is a set of cgroups arranged in a tree, such that
every task in the system is in exactly one of the cgroups in the
hierarchy, and a set of subsystems; each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.
At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.
User-level code may create and destroy cgroups by name in an
instance of the cgroup virtual file system, specify and query to
which cgroup a task is assigned, and list the task PIDs assigned to
a cgroup. Those creations and assignments only affect the hierarchy
associated with that instance of the cgroup file system.
On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allow
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.
1.2 Why are cgroups needed ?
----------------------------
There are multiple efforts to provide process aggregations in the
Linux kernel, mainly for resource-tracking purposes. Such efforts
include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
namespaces. These all require the basic notion of a
grouping/partitioning of processes, with newly forked processes ending
up in the same group (cgroup) as their parent process.
The kernel cgroup patch provides the minimum essential kernel
mechanisms required to efficiently implement such groups. It has
minimal impact on the system fast paths, and provides hooks for
specific subsystems such as cpusets to provide additional behaviour as
desired.
Multiple hierarchy support is provided to allow for situations where
the division of tasks into cgroups is distinctly different for
different subsystems - having parallel hierarchies allows each
hierarchy to be a natural division of tasks, without having to handle
complex combinations of tasks that would be present if several
unrelated subsystems needed to be forced into the same tree of
cgroups.
At one extreme, each resource controller or subsystem could be in a
separate hierarchy; at the other extreme, all subsystems
would be attached to the same hierarchy.
As an example of a scenario (originally proposed by vatsa@in.ibm.com)
that can benefit from multiple hierarchies, consider a large
university server with various users - students, professors, system
tasks etc. The resource planning for this server could be along the
following lines:
CPU : "Top cpuset"
/ \
CPUSet1 CPUSet2
| |
(Professors) (Students)
In addition (system tasks) are attached to topcpuset (so
that they can run anywhere) with a limit of 20%
Memory : Professors (50%), Students (30%), system (20%)
Disk : Professors (50%), Students (30%), system (20%)
Network : WWW browsing (20%), Network File System (60%), others (20%)
/ \
Professors (15%) students (5%)
Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
into the NFS network class.
At the same time Firefox/Lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).
With the ability to classify tasks differently for different resources
(by putting those resource subsystems in different hierarchies),
the admin can easily set up a script which receives exec notifications
and depending on who is launching the browser he can
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
With only a single hierarchy, he now would potentially have to create
a separate cgroup for every browser launched and associate it with
appropriate network and other resource class. This may lead to
proliferation of such cgroups.
Also let's say that the administrator would like to give enhanced network
access temporarily to a student's browser (since it is night and the user
wants to do online gaming :)) OR give one of the student's simulation
apps enhanced CPU power.
With ability to write PIDs directly to resource classes, it's just a
matter of:
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
(after some time)
# echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
Without this ability, the administrator would have to split the cgroup into
multiple separate ones and then associate the new cgroups with the
new resource classes.
1.3 How are cgroups implemented ?
---------------------------------
Control Groups extends the kernel as follows:
- Each task in the system has a reference-counted pointer to a
css_set.
- A css_set contains a set of reference-counted pointers to
cgroup_subsys_state objects, one for each cgroup subsystem
registered in the system. There is no direct link from a task to
the cgroup of which it's a member in each hierarchy, but this
can be determined by following pointers through the
cgroup_subsys_state objects. This is because accessing the
subsystem state is something that's expected to happen frequently
and in performance-critical code, whereas operations that require a
task's actual cgroup assignments (in particular, moving between
cgroups) are less common. A linked list runs through the cg_list
field of each task_struct using the css_set, anchored at
css_set->tasks.
- A cgroup hierarchy filesystem can be mounted for browsing and
manipulation from user space.
- You can list all the tasks (by PID) attached to any cgroup.
The implementation of cgroups requires a few, simple hooks
into the rest of the kernel, none in performance-critical paths:
- in init/main.c, to initialize the root cgroups and initial
css_set at system boot.
- in fork and exit, to attach and detach a task from its css_set.
In addition, a new file system of type "cgroup" may be mounted, to
enable browsing and modifying the cgroups presently known to the
kernel. When mounting a cgroup hierarchy, you may specify a
comma-separated list of subsystems to mount as the filesystem mount
options. By default, mounting the cgroup filesystem attempts to
mount a hierarchy containing all registered subsystems.
If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems.
It's not currently possible to bind a new subsystem to an active
cgroup hierarchy, or to unbind a subsystem from an active cgroup
hierarchy. This may be possible in future, but is fraught with nasty
error-recovery issues.
When a cgroup filesystem is unmounted, if there are any
child cgroups created below the top-level cgroup, that hierarchy
will remain active even though unmounted; if there are no
child cgroups then the hierarchy will be deactivated.
No new system calls are added for cgroups - all support for
querying and modifying cgroups is via this cgroup file system.
Each task under /proc has an added file named 'cgroup' displaying,
for each active hierarchy, the subsystem names and the cgroup name
as the path relative to the root of the cgroup file system.
Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:
- tasks: list of tasks (by PID) attached to that cgroup. This list
is not guaranteed to be sorted. Writing a thread ID into this file
moves the thread into this cgroup.
- cgroup.procs: list of thread group IDs in the cgroup. This list is
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
should sort/uniquify the list if this property is required.
Writing a thread group ID into this file moves all threads in that
group into this cgroup.
- notify_on_release flag: run the release agent on exit?
- release_agent: the path to use for release notifications (this file
exists in the top cgroup only)
Other subsystems such as cpusets may add additional files in each
cgroup dir.
New cgroups are created using the mkdir system call or shell
command. The properties of a cgroup, such as its flags, are
modified by writing to the appropriate file in that cgroups
directory, as listed above.
The named hierarchical structure of nested cgroups allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".
The attachment of each task, automatically inherited at fork by any
children of that task, to a cgroup allows organizing the work load
on a system into related sets of tasks. A task may be re-attached to
any other cgroup, if allowed by the permissions on the necessary
cgroup file system directories.
When a task is moved from one cgroup to another, it gets a new
css_set pointer - if there's an already existing css_set with the
desired collection of cgroups then that group is reused, otherwise a new
css_set is allocated. The appropriate existing css_set is located by
looking into a hash table.
To allow access from a cgroup to the css_sets (and hence tasks)
that comprise it, a set of cg_cgroup_link objects form a lattice;
each cg_cgroup_link is linked into a list of cg_cgroup_links for
a single cgroup on its cgrp_link_list field, and a list of
cg_cgroup_links for a single css_set on its cg_link_list.
Thus the set of tasks in a cgroup can be listed by iterating over
each css_set that references the cgroup, and sub-iterating over
each css_set's task set.
The use of a Linux virtual file system (vfs) to represent the
cgroup hierarchy provides for a familiar permission and name space
for cgroups, with a minimum of additional kernel code.
1.4 What does notify_on_release do ?
------------------------------------
If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup. This enables automatic
removal of abandoned cgroups. The default value of
notify_on_release in the root cgroup at system boot is disabled
(0). The default value of other cgroups at creation is the current
value of their parents' notify_on_release settings. The default value of
a cgroup hierarchy's release_agent path is empty.
1.5 What does clone_children do ?
---------------------------------
This flag only affects the cpuset controller. If the clone_children
flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
configuration from the parent during initialization.
1.6 How do I use cgroups ?
--------------------------
To start a new job that is to be contained within a cgroup, using
the "cpuset" cgroup subsystem, the steps are something like:
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
2) mkdir /sys/fs/cgroup/cpuset
3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
4) Create the new cgroup by doing mkdir's and write's (or echo's) in
the /sys/fs/cgroup virtual file system.
5) Start a task that will be the "founding father" of the new job.
6) Attach that task to the new cgroup by writing its PID to the
/sys/fs/cgroup/cpuset/tasks file for that cgroup.
7) fork, exec or clone the job tasks from this founding father task.
For example, the following sequence of commands will setup a cgroup
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cgroup:
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset
mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks
sh
# The subshell 'sh' is now running in cgroup Charlie
# The next line should display '/Charlie'
cat /proc/self/cgroup
2. Usage Examples and Syntax
============================
2.1 Basic Usage
---------------
Creating, modifying, using cgroups can be done through the cgroup
virtual filesystem.
To mount a cgroup hierarchy with all available subsystems, type:
# mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.
Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used.
As explained in section `1.2 Why are cgroups needed?' you should create
different hierarchies of cgroups for each single resource or group of
resources you want to control. Therefore, you should mount a tmpfs on
/sys/fs/cgroup and create directories for each cgroup resource or resource
group.
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/rg1
To mount a cgroup hierarchy with just the cpuset and memory
subsystems, type:
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
While remounting cgroups is currently supported, it is not recommend
to use it. Remounting allows changing bound subsystems and
release_agent. Rebinding is hardly useful as it only works when the
hierarchy is empty and release_agent itself should be replaced with
conventional fsnotify. The support for remounting will be removed in
the future.
To Specify a hierarchy's release_agent:
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1
Note that specifying 'release_agent' more than once will return failure.
Note that changing the set of subsystems is currently only supported
when the hierarchy consists of a single (root) cgroup. Supporting
the ability to arbitrarily bind/unbind subsystems from an existing
cgroup hierarchy is intended to be implemented in the future.
Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
is the cgroup that holds the whole system.
If you want to change the value of release_agent:
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
It can also be changed via remount.
If you want to create a new cgroup under /sys/fs/cgroup/rg1:
# cd /sys/fs/cgroup/rg1
# mkdir my_cgroup
Now you want to do something with this cgroup.
# cd my_cgroup
In this directory you can find several files:
# ls
cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)
Now attach your shell to this cgroup:
# /bin/echo $$ > tasks
You can also create cgroups inside your cgroup by using mkdir in this
directory.
# mkdir my_sub_cs
To remove a cgroup, just use rmdir:
# rmdir my_sub_cs
This will fail if the cgroup is in use (has cgroups inside, or
has processes attached, or is held alive by other subsystem-specific
reference).
2.2 Attaching processes
-----------------------
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another:
# /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks
...
# /bin/echo PIDn > tasks
You can attach the current shell task by echoing 0:
# echo 0 > tasks
You can use the cgroup.procs file instead of the tasks file to move all
threads in a threadgroup at once. Echoing the PID of any task in a
threadgroup to cgroup.procs causes all tasks in that threadgroup to be
attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
in the writing task's threadgroup.
Note: Since every task is always a member of exactly one cgroup in each
mounted hierarchy, to remove a task from its current cgroup you must
move it into a new cgroup (possibly the root cgroup) by writing to the
new cgroup's tasks file.
Note: Due to some restrictions enforced by some cgroup subsystems, moving
a process to another cgroup can fail.
2.3 Mounting hierarchies by name
--------------------------------
Passing the name=<x> option when mounting a cgroups hierarchy
associates the given name with the hierarchy. This can be used when
mounting a pre-existing hierarchy, in order to refer to it by name
rather than by its set of active subsystems. Each hierarchy is either
nameless, or has a unique name.
The name should match [\w.-]+
When passing a name=<x> option for a new hierarchy, you need to
specify subsystems manually; the legacy behaviour of mounting all
subsystems when none are explicitly specified is not supported when
you give a subsystem a name.
The name of the subsystem appears as part of the hierarchy description
in /proc/mounts and /proc/<pid>/cgroups.
3. Kernel API
=============
3.1 Overview
------------
Each kernel subsystem that wants to hook into the generic cgroup
system needs to create a cgroup_subsys object. This contains
various methods, which are callbacks from the cgroup system, along
with a subsystem ID which will be assigned by the cgroup system.
Other fields in the cgroup_subsys object include:
- subsys_id: a unique array index for the subsystem, indicating which
entry in cgroup->subsys[] this subsystem should be managing.
- name: should be initialized to a unique subsystem name. Should be
no longer than MAX_CGROUP_TYPE_NAMELEN.
- early_init: indicate if the subsystem needs early initialization
at system boot.
Each cgroup object created by the system has an array of pointers,
indexed by subsystem ID; this pointer is entirely managed by the
subsystem; the generic cgroup code will never touch this pointer.
3.2 Synchronization
-------------------
There is a global mutex, cgroup_mutex, used by the cgroup
system. This should be taken by anything that wants to modify a
cgroup. It may also be taken to prevent cgroups from being
modified, but more specific locks may be more appropriate in that
situation.
See kernel/cgroup.c for more details.
Subsystems can take/release the cgroup_mutex via the functions
cgroup_lock()/cgroup_unlock().
Accessing a task's cgroup pointer may be done in the following ways:
- while holding cgroup_mutex
- while holding the task's alloc_lock (via task_lock())
- inside an rcu_read_lock() section via rcu_dereference()
3.3 Subsystem API
-----------------
Each subsystem should:
- add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_subsys
If a subsystem can be compiled as a module, it should also have in its
module initcall a call to cgroup_load_subsys(), and in its exitcall a
call to cgroup_unload_subsys(). It should also set its_subsys.module =
THIS_MODULE in its .c file.
Each subsystem may export the following methods. The only mandatory
methods are css_alloc/free. Any others that are null are presumed to
be successful no-ops.
struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)
(cgroup_mutex held by caller)
Called to allocate a subsystem state object for a cgroup. The
subsystem should allocate its subsystem state object for the passed
cgroup, returning a pointer to the new object on success or a
ERR_PTR() value. On success, the subsystem pointer should point to
a structure of type cgroup_subsys_state (typically embedded in a
larger subsystem-specific object), which will be initialized by the
cgroup system. Note that this will be called at initialization to
create the root subsystem state for this subsystem; this case can be
identified by the passed cgroup object having a NULL parent (since
it's the root of the hierarchy) and may be an appropriate place for
initialization code.
int css_online(struct cgroup *cgrp)
(cgroup_mutex held by caller)
Called after @cgrp successfully completed all allocations and made
visible to cgroup_for_each_child/descendant_*() iterators. The
subsystem may choose to fail creation by returning -errno. This
callback can be used to implement reliable state sharing and
propagation along the hierarchy. See the comment on
cgroup_for_each_descendant_pre() for details.
void css_offline(struct cgroup *cgrp);
(cgroup_mutex held by caller)
This is the counterpart of css_online() and called iff css_online()
has succeeded on @cgrp. This signifies the beginning of the end of
@cgrp. @cgrp is being removed and the subsystem should start dropping
all references it's holding on @cgrp. When all references are dropped,
cgroup removal will proceed to the next step - css_free(). After this
callback, @cgrp should be considered dead to the subsystem.
void css_free(struct cgroup *cgrp)
(cgroup_mutex held by caller)
The cgroup system is about to free @cgrp; the subsystem should free
its subsystem state object. By the time this method is called, @cgrp
is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup).
int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called prior to moving a task into a cgroup; if the subsystem
returns an error, this will abort the attach operation. Used
to extend the permission checks - if all subsystems in a cgroup
return 0, the attach will be allowed to proceed, even if the
default permission check (root or same user) fails.
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called prior to moving one or more tasks into a cgroup; if the
subsystem returns an error, this will abort the attach operation.
@tset contains the tasks to be attached and is guaranteed to have at
least one task in it.
If there are multiple tasks in the taskset, then:
- it's guaranteed that all are from the same thread group
- @tset contains all tasks from the thread group whether or not
they're switching cgroups
- the first task is the leader
Each @tset entry also contains the task's old cgroup and tasks which
aren't switching cgroup can be skipped easily using the
cgroup_taskset_for_each() iterator. Note that this isn't called on a
fork. If this method returns 0 (success) then this should remain valid
while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future.
void css_reset(struct cgroup_subsys_state *css)
(cgroup_mutex held by caller)
An optional operation which should restore @css's configuration to the
initial state. This is currently only used on the unified hierarchy
when a subsystem is disabled on a cgroup through
"cgroup.subtree_control" but should remain enabled because other
subsystems depend on it. cgroup core makes such a css invisible by
removing the associated interface files and invokes this callback so
that the hidden subsystem can return to the initial neutral state.
This prevents unexpected resource control from a hidden css and
ensures that the configuration is in the initial state when it is made
visible again later.
void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called when a task attach operation has failed after can_attach() has succeeded.
A subsystem whose can_attach() has some side-effects should provide this
function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded. The parameters are identical to can_attach().
void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
The parameters are identical to can_attach().
void fork(struct task_struct *task)
Called when a task is forked into a cgroup.
void exit(struct task_struct *task)
Called during task exit.
void bind(struct cgroup *root)
(cgroup_mutex held by caller)
Called when a cgroup subsystem is rebound to a different hierarchy
and root cgroup. Currently this will only involve movement between
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).
4. Extended attribute usage
===========================
cgroup filesystem supports certain types of extended attributes in its
directories and files. The current supported types are:
- Trusted (XATTR_TRUSTED)
- Security (XATTR_SECURITY)
Both require CAP_SYS_ADMIN capability to set.
Like in tmpfs, the extended attributes in cgroup filesystem are stored
using kernel memory and it's advised to keep the usage at minimum. This
is the reason why user defined extended attributes are not supported, since
any user can do it and there's no limit in the value size.
The current known users for this feature are SELinux to limit cgroup usage
in containers and systemd for assorted meta data like main PID in a cgroup
(systemd creates a cgroup per service).
5. Questions
============
Q: what's up with this '/bin/echo' ?
A: bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cgroup file system, you won't be
able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached !
A: We can only return one error code per call to write(). So you should also
put only ONE PID.

View file

@ -0,0 +1,49 @@
CPU Accounting Controller
-------------------------
The CPU accounting controller is used to group tasks using cgroups and
account the CPU usage of these groups of tasks.
The CPU accounting controller supports multi-hierarchy groups. An accounting
group accumulates the CPU usage of all of its child groups and the tasks
directly present in its group.
Accounting groups can be created by first mounting the cgroup filesystem.
# mount -t cgroup -ocpuacct none /sys/fs/cgroup
With the above step, the initial or the parent accounting group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
by this group which is essentially the CPU time obtained by all the tasks
in the system.
New accounting groups can be created under the parent group /sys/fs/cgroup.
# cd /sys/fs/cgroup
# mkdir g1
# echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children
can be obtained from g1/cpuacct.usage and the same is accumulated in
/sys/fs/cgroup/cpuacct.usage also.
cpuacct.stat file lists a few statistics which further divide the
CPU time obtained by the cgroup into user and system times. Currently
the following statistics are supported:
user: Time spent by tasks of the cgroup in user mode.
system: Time spent by tasks of the cgroup in kernel mode.
user and system are in USER_HZ unit.
cpuacct controller uses percpu_counter interface to collect user and
system times. This has two side effects:
- It is theoretically possible to see wrong values for user and system times.
This is because percpu_counter_read() on 32bit systems isn't safe
against concurrent writes.
- It is possible to see slightly outdated values for user and system times
due to the batch processing nature of percpu_counter.

View file

@ -0,0 +1,833 @@
CPUSETS
-------
Copyright (C) 2004 BULL SA.
Written by Simon.Derr@bull.net
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <pj@sgi.com>
Modified by Christoph Lameter <clameter@sgi.com>
Modified by Paul Menage <menage@google.com>
Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
CONTENTS:
=========
1. Cpusets
1.1 What are cpusets ?
1.2 Why are cpusets needed ?
1.3 How are cpusets implemented ?
1.4 What are exclusive cpusets ?
1.5 What is memory_pressure ?
1.6 What is memory spread ?
1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ?
1.9 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
2.3 Setting flags
2.4 Attaching processes
3. Questions
4. Contact
1. Cpusets
==========
1.1 What are cpusets ?
----------------------
Cpusets provide a mechanism for assigning a set of CPUs and Memory
Nodes to a set of tasks. In this document "Memory Node" refers to
an on-line node that contains memory.
Cpusets constrain the CPU and Memory placement of tasks to only
the resources within a task's current cpuset. They form a nested
hierarchy visible in a virtual file system. These are the essential
hooks, beyond what is already present, required to manage dynamic
job placement on large systems.
Cpusets use the generic cgroup subsystem described in
Documentation/cgroups/cgroups.txt.
Requests by a task, using the sched_setaffinity(2) system call to
include CPUs in its CPU affinity mask, and using the mbind(2) and
set_mempolicy(2) system calls to include Memory Nodes in its memory
policy, are both filtered through that task's cpuset, filtering out any
CPUs or Memory Nodes not in that cpuset. The scheduler will not
schedule a task on a CPU that is not allowed in its cpus_allowed
vector, and the kernel page allocator will not allocate a page on a
node that is not allowed in the requesting task's mems_allowed vector.
User level code may create and destroy cpusets by name in the cgroup
virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
specify and query to which cpuset a task is assigned, and list the
task pids assigned to a cpuset.
1.2 Why are cpusets needed ?
----------------------------
The management of large computer systems, with many processors (CPUs),
complex memory cache hierarchies and multiple Memory Nodes having
non-uniform access times (NUMA) presents additional challenges for
the efficient scheduling and memory placement of processes.
Frequently more modest sized systems can be operated with adequate
efficiency just by letting the operating system automatically share
the available CPU and Memory resources amongst the requesting tasks.
But larger systems, which benefit more from careful processor and
memory placement to reduce memory access times and contention,
and which typically represent a larger investment for the customer,
can benefit from explicitly placing jobs on properly sized subsets of
the system.
This can be especially valuable on:
* Web Servers running multiple instances of the same web application,
* Servers running different applications (for instance, a web server
and a database), or
* NUMA systems running large HPC applications with demanding
performance characteristics.
These subsets, or "soft partitions" must be able to be dynamically
adjusted, as the job mix changes, without impacting other concurrently
executing jobs. The location of the running jobs pages may also be moved
when the memory locations are changed.
The kernel cpuset patch provides the minimum essential kernel
mechanisms required to efficiently implement such subsets. It
leverages existing CPU and Memory Placement facilities in the Linux
kernel to avoid any additional impact on the critical scheduler or
memory allocator code.
1.3 How are cpusets implemented ?
---------------------------------
Cpusets provide a Linux kernel mechanism to constrain which CPUs and
Memory Nodes are used by a process or set of processes.
The Linux kernel already has a pair of mechanisms to specify on which
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
Nodes it may obtain memory (mbind, set_mempolicy).
Cpusets extends these two mechanisms as follows:
- Cpusets are sets of allowed CPUs and Memory Nodes, known to the
kernel.
- Each task in the system is attached to a cpuset, via a pointer
in the task structure to a reference counted cgroup structure.
- Calls to sched_setaffinity are filtered to just those CPUs
allowed in that task's cpuset.
- Calls to mbind and set_mempolicy are filtered to just
those Memory Nodes allowed in that task's cpuset.
- The root cpuset contains all the systems CPUs and Memory
Nodes.
- For any cpuset, one can define child cpusets containing a subset
of the parents CPU and Memory Node resources.
- The hierarchy of cpusets can be mounted at /dev/cpuset, for
browsing and manipulation from user space.
- A cpuset may be marked exclusive, which ensures that no other
cpuset (except direct ancestors and descendants) may contain
any overlapping CPUs or Memory Nodes.
- You can list all the tasks (by pid) attached to any cpuset.
The implementation of cpusets requires a few, simple hooks
into the rest of the kernel, none in performance critical paths:
- in init/main.c, to initialize the root cpuset at system boot.
- in fork and exit, to attach and detach a task from its cpuset.
- in sched_setaffinity, to mask the requested CPUs by what's
allowed in that task's cpuset.
- in sched.c migrate_live_tasks(), to keep migrating tasks within
the CPUs allowed by their cpuset, if possible.
- in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that task's cpuset.
- in page_alloc.c, to restrict memory to allowed nodes.
- in vmscan.c, to restrict page recovery to the current cpuset.
You should mount the "cgroup" filesystem type in order to enable
browsing and modifying the cpusets presently known to the kernel. No
new system calls are added for cpusets - all support for querying and
modifying cpusets is via this cpuset file system.
The /proc/<pid>/status file for each task has four added lines,
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
and mems_allowed (on which Memory Nodes it may obtain memory),
in the two formats seen in the following example:
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-127
Mems_allowed: ffffffff,ffffffff
Mems_allowed_list: 0-63
Each cpuset is represented by a directory in the cgroup file system
containing (on top of the standard cgroup files) the following
files describing that cpuset:
- cpuset.cpus: list of CPUs in that cpuset
- cpuset.mems: list of Memory Nodes in that cpuset
- cpuset.memory_migrate flag: if set, move pages to cpusets nodes
- cpuset.cpu_exclusive flag: is cpu placement exclusive?
- cpuset.mem_exclusive flag: is memory placement exclusive?
- cpuset.mem_hardwall flag: is memory allocation hardwalled
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
In addition, only the root cpuset has the following file:
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
New cpusets are created using the mkdir system call or shell
command. The properties of a cpuset, such as its flags, allowed
CPUs and Memory Nodes, and attached tasks, are modified by writing
to the appropriate file in that cpusets directory, as listed above.
The named hierarchical structure of nested cpusets allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".
The attachment of each task, automatically inherited at fork by any
children of that task, to a cpuset allows organizing the work load
on a system into related sets of tasks such that each set is constrained
to using the CPUs and Memory Nodes of a particular cpuset. A task
may be re-attached to any other cpuset, if allowed by the permissions
on the necessary cpuset file system directories.
Such management of a system "in the large" integrates smoothly with
the detailed placement done on individual tasks and memory regions
using the sched_setaffinity, mbind and set_mempolicy system calls.
The following rules apply to each cpuset:
- Its CPUs and Memory Nodes must be a subset of its parents.
- It can't be marked exclusive unless its parent is.
- If its cpu or memory is exclusive, they may not overlap any sibling.
These rules, and the natural hierarchy of cpusets, enable efficient
enforcement of the exclusive guarantee, without having to scan all
cpusets every time any of them change to ensure nothing overlaps a
exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
to represent the cpuset hierarchy provides for a familiar permission
and name space for cpusets, with a minimum of additional kernel code.
The cpus and mems files in the root (top_cpuset) cpuset are
read-only. The cpus file automatically tracks the value of
cpu_online_mask using a CPU hotplug notifier, and the mems file
automatically tracks the value of node_states[N_MEMORY]--i.e.,
nodes with memory--using the cpuset_track_online_nodes() hook.
1.4 What are exclusive cpusets ?
--------------------------------
If a cpuset is cpu or mem exclusive, no other cpuset, other than
a direct ancestor or descendant, may share any of the same CPUs or
Memory Nodes.
A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
i.e. it restricts kernel allocations for page, buffer and other data
commonly shared by the kernel across multiple users. All cpusets,
whether hardwalled or not, restrict allocations of memory for user
space. This enables configuring a system so that several independent
jobs can share common kernel data, such as file system pages, while
isolating each job's user allocation in its own cpuset. To do this,
construct a large mem_exclusive cpuset to hold all the jobs, and
construct child, non-mem_exclusive cpusets for each individual job.
Only a small amount of typical kernel memory, such as requests from
interrupt handlers, is allowed to be taken outside even a
mem_exclusive cpuset.
1.5 What is memory_pressure ?
-----------------------------
The memory_pressure of a cpuset provides a simple per-cpuset metric
of the rate that the tasks in a cpuset are attempting to free up in
use memory on the nodes of the cpuset to satisfy additional memory
requests.
This enables batch managers monitoring jobs running in dedicated
cpusets to efficiently detect what level of memory pressure that job
is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or re-prioritize jobs that
are trying to use more memory than allowed on the nodes assigned to them,
and with tightly coupled, long running, massively parallel scientific
computing jobs that will dramatically fail to meet required performance
goals if they start to use more memory than allowed to them.
This mechanism provides a very economical way for the batch manager
to monitor a cpuset for signs of memory pressure. It's up to the
batch manager or other user code to decide what to do about it and
take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm,
the system load imposed by a batch scheduler monitoring this
metric is sharply reduced on large systems, because a scan of
the tasklist can be avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a
single read, instead of having to read and accumulate results
for a period of time.
Because this meter is per-cpuset rather than per-task or mm,
the batch scheduler can obtain the key information, memory
pressure in a cpuset, with a single read, rather than having to
query and accumulate results over all the (dynamically changing)
set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words
of data per-cpuset) is kept, and updated by any task attached to that
cpuset, if it enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by
the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.
1.6 What is memory spread ?
---------------------------
There are two boolean flag files per cpuset that control where the
kernel allocates pages for the file system buffers and related in
kernel data structures. They are called 'cpuset.memory_spread_page' and
'cpuset.memory_spread_slab'.
If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
the kernel will spread the file system buffers (page cache) evenly
over all the nodes that the faulting task is allowed to use, instead
of preferring to put those pages on the node where the task is running.
If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
then the kernel will spread some file system related slab caches,
such as for inodes and dentries evenly over all the nodes that the
faulting task is allowed to use, instead of preferring to put those
pages on the node where the task is running.
The setting of these flags does not affect anonymous data segment or
stack segment pages of a task.
By default, both kinds of memory spreading are off, and memory
pages are allocated on the node local to where the task is running,
except perhaps as modified by the task's NUMA mempolicy or cpuset
configuration, so long as sufficient free memory pages are available.
When new cpusets are created, they inherit the memory spread settings
of their parent.
Setting memory spreading causes allocations for the affected page
or slab caches to ignore the task's NUMA mempolicy and be spread
instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
mempolicies will not notice any change in these calls as a result of
their containing task's memory spread settings. If memory spreading
is turned off, then the currently specified NUMA mempolicy once again
applies to memory page allocations.
Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
files. By default they contain "0", meaning that the feature is off
for that cpuset. If a "1" is written to that file, then that turns
the named feature on.
The implementation is simple.
Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
joins that cpuset. The page allocation calls for the page cache
is modified to perform an inline check for this PFA_SPREAD_PAGE task
flag, and if set, a call to a new routine cpuset_mem_spread_node()
returns the node to prefer for the allocation.
Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
pages from the node returned by cpuset_mem_spread_node().
The cpuset_mem_spread_node() routine is also simple. It uses the
value of a per-task rotor cpuset_mem_spread_rotor to select the next
node in the current task's mems_allowed to prefer for the allocation.
This memory placement policy is also known (in other contexts) as
round-robin or interleave.
This policy can provide substantial improvements for jobs that need
to place thread local data on the corresponding node, but that need
to access large file system data sets that need to be spread across
the several nodes in the jobs cpuset in order to fit. Without this
policy, especially for jobs that might have one thread reading in the
data set, the memory allocation across the nodes in the jobs cpuset
can become very uneven.
1.7 What is sched_load_balance ?
--------------------------------
The kernel scheduler (kernel/sched/core.c) automatically load balances
tasks. If one CPU is underutilized, kernel code running on that
CPU will look for tasks on other more overloaded CPUs and move those
tasks to itself, within the constraints of such placement mechanisms
as cpusets and sched_setaffinity.
The algorithmic cost of load balancing and its impact on key shared
kernel data structures such as the task list increases more than
linearly with the number of CPUs being balanced. So the scheduler
has support to partition the systems CPUs into a number of sched
domains such that it only load balances within each sched domain.
Each sched domain covers some subset of the CPUs in the system;
no two sched domains overlap; some CPUs might not be in any sched
domain and hence won't be load balanced.
Put simply, it costs less to balance between two smaller sched domains
than one big one, but doing so means that overloads in one of the
two domains won't be load balanced to the other one.
By default, there is one sched domain covering all CPUs, except those
marked isolated using the kernel boot time "isolcpus=" argument.
This default load balancing across all CPUs is not well suited for
the following two situations:
1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary.
2) Systems supporting realtime on some CPUs need to minimize
system overhead on those CPUs, including avoiding task load
balancing if that is not needed.
When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other.
When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
scheduler will avoid load balancing across the CPUs in that cpuset,
--except-- in so far as is necessary because some overlapping cpuset
has "sched_load_balance" enabled.
So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
enabled, then the scheduler will have one sched domain covering all
CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
cpusets won't matter, as we're already fully load balancing.
Therefore in the above two situations, the top cpuset flag
"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
child cpusets have this flag enabled.
When doing this, you don't usually want to leave any unpinned tasks in
the top cpuset that might use non-trivial amounts of CPU, as such tasks
may be artificially constrained to some subset of CPUs, depending on
the particulars of this flag setting in descendant cpusets. Even if
such a task could use spare CPU cycles in some other CPUs, the kernel
scheduler might not consider the possibility of load balancing that
task to that underused CPU.
Of course, tasks pinned to a particular CPU can be left in a cpuset
that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
else anyway.
There is an impedance mismatch here, between cpusets and sched domains.
Cpusets are hierarchical and nest. Sched domains are flat; they don't
overlap and each CPU is in at most one sched domain.
It is necessary for sched domains to be flat because load balancing
across partially overlapping sets of CPUs would risk unstable dynamics
that would be beyond our understanding. So if each of two partially
overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
form a single sched domain that is a superset of both. We won't move
a task to a CPU outside it cpuset, but the scheduler load balancing
code might waste some compute cycles considering that possibility.
This mismatch is why there is not a simple one-to-one relation
between which cpusets have the flag "cpuset.sched_load_balance" enabled,
and the sched domain configuration. If a cpuset enables the flag, it
will get balancing across all its CPUs, but if it disables the flag,
it will only be assured of no load balancing if no other overlapping
cpuset enables the flag.
If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
one of them has this flag enabled, then the other may find its
tasks only partially load balanced, just on the overlapping CPUs.
This is just the general case of the top_cpuset example given a few
paragraphs above. In the general case, as in the top cpuset case,
don't leave tasks that might use non-trivial amounts of CPU in
such partially load balanced cpusets, as they may be artificially
constrained to some subset of the CPUs allowed to them, for lack of
load balancing to the other CPUs.
1.7.1 sched_load_balance implementation details.
------------------------------------------------
The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
to most cpuset flags.) When enabled for a cpuset, the kernel will
ensure that it can load balance across all the CPUs in that cpuset
(makes sure that all the CPUs in the cpus_allowed of that cpuset are
in the same sched domain.)
If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
then they will be (must be) both in the same sched domain.
If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
then by the above that means there is a single sched domain covering
the whole system, regardless of any other cpuset settings.
The kernel commits to user space that it will avoid load balancing
where it can. It will pick as fine a granularity partition of sched
domains as it can while still providing load balancing for any set
of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
The internal kernel cpuset to scheduler interface passes from the
cpuset code to the scheduler code a partition of the load balanced
CPUs in the system. This partition is a set of subsets (represented
as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
all the CPUs that must be load balanced.
The cpuset code builds a new such partition and passes it to the
scheduler sched domain setup code, to have the sched domains rebuilt
as necessary, whenever:
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
- or CPUs come or go from a cpuset with this flag enabled,
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
and with this flag enabled changes,
- or a cpuset with non-empty CPUs and with this flag enabled is removed,
- or a cpu is offlined/onlined.
This partition exactly defines what sched domains the scheduler should
setup - one sched domain for each element (struct cpumask) in the
partition.
The scheduler remembers the currently active sched domain partitions.
When the scheduler routine partition_sched_domains() is invoked from
the cpuset code to update these sched domains, it compares the new
partition requested with the current, and updates its sched domains,
removing the old and adding the new, for each change.
1.8 What is sched_relax_domain_level ?
--------------------------------------
In sched domain, the scheduler migrates tasks in 2 ways; periodic load
balance on tick, and at time of some schedule events.
When a task is woken up, scheduler try to move the task on idle CPU.
For example, if a task A running on CPU X activates another task B
on the same CPU X, and if CPU Y is X's sibling and performing idle,
then scheduler migrate task B to CPU Y so that task B can start on
CPU Y without waiting task A on CPU X.
And if a CPU run out of tasks in its runqueue, the CPU try to pull
extra tasks from other busy CPUs to help them before it is going to
be idle.
Of course it takes some searching cost to find movable tasks and/or
idle CPUs, the scheduler might not search all CPUs in the domain
every time. In fact, in some architectures, the searching ranges on
events are limited in the same socket or node where the CPU locates,
while the load balance on tick searches all.
For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
is idle while CPU X and the siblings are busy, scheduler can't migrate
woken task B from X to Z since it is out of its searching range.
As the result, task B on CPU X need to wait task A or wait load balance
on the next tick. For some applications in special situation, waiting
1 tick may be too long.
The 'cpuset.sched_relax_domain_level' file allows you to request changing
this searching range as you like. This file takes int value which
indicates size of searching range in levels ideally as follows,
otherwise initial value -1 that indicates the cpuset has no request.
-1 : no request. use system default or follow request of others.
0 : no search.
1 : search siblings (hyperthreads in a core).
2 : search cores in a package.
3 : search cpus in a node [= system wide on non-NUMA system]
( 4 : search nodes in a chunk of node [on NUMA system] )
( 5 : search system wide [on NUMA system] )
The system default is architecture dependent. The system default
can be changed using the relax_domain_level= boot parameter.
This file is per-cpuset and affect the sched domain where the cpuset
belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
there is no sched domain belonging the cpuset.
If multiple cpusets are overlapping and hence they form a single sched
domain, the largest value among those is used. Be careful, if one
requests 0 and others are -1 then 0 is used.
Note that modifying this file will have both good and bad effects,
and whether it is acceptable or not depends on your situation.
Don't modify this file if you are not sure.
If your situation is:
- The migration costs between each cpu can be assumed considerably
small(for you) due to your special application's behavior or
special hardware support for CPU cache etc.
- The searching cost doesn't have impact(for you) or you can make
the searching cost enough small by managing cpuset to compact etc.
- The latency is required even it sacrifices cache hit rate etc.
then increasing 'sched_relax_domain_level' would benefit you.
1.9 How do I use cpusets ?
--------------------------
In order to minimize the impact of cpusets on critical kernel
code, such as the scheduler, and due to the fact that the kernel
does not support one task updating the memory placement of another
task directly, the impact on a task of changing its cpuset CPU
or Memory Node placement, or of changing to which cpuset a task
is attached, is subtle.
If a cpuset has its Memory Nodes modified, then for each task attached
to that cpuset, the next time that the kernel attempts to allocate
a page of memory for that task, the kernel will notice the change
in the task's cpuset, and update its per-task memory placement to
remain within the new cpusets memory placement. If the task was using
mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
its new cpuset, then the task will continue to use whatever subset
of MPOL_BIND nodes are still allowed in the new cpuset. If the task
was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
in the new cpuset, then the task will be essentially treated as if it
was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
as queried by get_mempolicy(), doesn't change). If a task is moved
from one cpuset to another, then the kernel will adjust the task's
memory placement, as above, the next time that the kernel attempts
to allocate a page of memory for that task.
If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
will have its allowed CPU placement changed immediately. Similarly,
if a task's pid is written to another cpusets 'cpuset.tasks' file, then its
allowed CPU placement is changed immediately. If such a task had been
bound to some subset of its cpuset using the sched_setaffinity() call,
the task will be allowed to run on any CPU allowed in its new cpuset,
negating the effect of the prior sched_setaffinity() call.
In summary, the memory placement of a task whose cpuset is changed is
updated by the kernel, on the next allocation of a page for that task,
and the processor placement is updated immediately.
Normally, once a page is allocated (given a physical page
of main memory) then that page stays on whatever node it
was allocated, so long as it remains allocated, even if the
cpusets memory placement policy 'cpuset.mems' subsequently changes.
If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
tasks are attached to that cpuset, any pages that task had
allocated to it on nodes in its previous cpuset are migrated
to the task's new cpuset. The relative placement of the page within
the cpuset is preserved during these migration operations if possible.
For example if the page was on the second valid node of the prior cpuset
then the page will be placed on the second valid node of the new cpuset.
Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
'cpuset.mems' file is modified, pages allocated to tasks in that
cpuset, that were on nodes in the previous setting of 'cpuset.mems',
will be moved to nodes in the new setting of 'mems.'
Pages that were not in the task's prior cpuset, or in the cpuset's
prior 'cpuset.mems' setting, will not be moved.
There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset,
then all the tasks in that cpuset will be moved to the nearest ancestor
with non-empty cpus. But the moving of some (or all) tasks might fail if
cpuset is bound with another cgroup subsystem which has some restrictions
on task attaching. In this failing case, those tasks will stay
in the original cpuset, and the kernel will automatically update
their cpus_allowed to allow all online CPUs. When memory hotplug
functionality for removing Memory Nodes is available, a similar exception
is expected to apply there as well. In general, the kernel prefers to
violate cpuset placement, over starving a task that has had all
its allowed CPUs or Memory Nodes taken offline.
There is a second exception to the above. GFP_ATOMIC requests are
kernel internal allocations that must be satisfied, immediately.
The kernel may drop some request, in rare cases even panic, if a
GFP_ATOMIC alloc fails. If the request cannot be satisfied within
the current task's cpuset, then we relax the cpuset, and look for
memory anywhere we can find it. It's better to violate the cpuset
than stress the kernel.
To start a new job that is to be contained within a cpuset, the steps are:
1) mkdir /sys/fs/cgroup/cpuset
2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
the /sys/fs/cgroup/cpuset virtual file system.
4) Start a task that will be the "founding father" of the new job.
5) Attach that task to the new cpuset by writing its pid to the
/sys/fs/cgroup/cpuset tasks file for that cpuset.
6) fork, exec or clone the job tasks from this founding father task.
For example, the following sequence of commands will setup a cpuset
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cpuset:
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks
sh
# The subshell 'sh' is now running in cpuset Charlie
# The next line should display '/Charlie'
cat /proc/self/cpuset
There are ways to query or modify cpusets:
- via the cpuset file system directly, using the various cd, mkdir, echo,
cat, rmdir commands from the shell, or their equivalent from C.
- via the C library libcpuset.
- via the C library libcgroup.
(http://sourceforge.net/projects/libcg/)
- via the python application cset.
(http://code.google.com/p/cpuset/)
The sched_setaffinity calls can also be done at the shell prompt using
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
calls can be done at the shell prompt using the numactl command
(part of Andi Kleen's numa package).
2. Usage Examples and Syntax
============================
2.1 Basic Usage
---------------
Creating, modifying, using the cpusets can be done through the cpuset
virtual filesystem.
To mount it, type:
# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
is the cpuset that holds the whole system.
If you want to create a new cpuset under /sys/fs/cgroup/cpuset:
# cd /sys/fs/cgroup/cpuset
# mkdir my_cpuset
Now you want to do something with this cpuset.
# cd my_cpuset
In this directory you can find several files:
# ls
cgroup.clone_children cpuset.memory_pressure
cgroup.event_control cpuset.memory_spread_page
cgroup.procs cpuset.memory_spread_slab
cpuset.cpu_exclusive cpuset.mems
cpuset.cpus cpuset.sched_load_balance
cpuset.mem_exclusive cpuset.sched_relax_domain_level
cpuset.mem_hardwall notify_on_release
cpuset.memory_migrate tasks
Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
it, its properties. By writing to these files you can manipulate
the cpuset.
Set some flags:
# /bin/echo 1 > cpuset.cpu_exclusive
Add some cpus:
# /bin/echo 0-7 > cpuset.cpus
Add some mems:
# /bin/echo 0-7 > cpuset.mems
Now attach your shell to this cpuset:
# /bin/echo $$ > tasks
You can also create cpusets inside your cpuset by using mkdir in this
directory.
# mkdir my_sub_cs
To remove a cpuset, just use rmdir:
# rmdir my_sub_cs
This will fail if the cpuset is in use (has cpusets inside, or has
processes attached).
Note that for legacy reasons, the "cpuset" filesystem exists as a
wrapper around the cgroup filesystem.
The command
mount -t cpuset X /sys/fs/cgroup/cpuset
is equivalent to
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
2.2 Adding/removing cpus
------------------------
This is the syntax to use when writing in the cpus or mems files
in cpuset directories:
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
To add a CPU to a cpuset, write the new list of CPUs including the
CPU to be added. To add 6 to the above cpuset:
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
Similarly to remove a CPU from a cpuset, write the new list of CPUs
without the CPU to be removed.
To remove all the CPUs:
# /bin/echo "" > cpuset.cpus -> clear cpus list
2.3 Setting flags
-----------------
The syntax is very simple:
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
2.4 Attaching processes
-----------------------
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another:
# /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks
...
# /bin/echo PIDn > tasks
3. Questions
============
Q: what's up with this '/bin/echo' ?
A: bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cpuset file system, you won't be
able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached !
A: We can only return one error code per call to write(). So you should also
put only ONE pid.
4. Contact
==========
Web: http://www.bullopensource.org/cpuset

View file

@ -0,0 +1,116 @@
Device Whitelist Controller
1. Description:
Implement a cgroup to track and enforce open and mknod restrictions
on device files. A device cgroup associates a device access
whitelist with each cgroup. A whitelist entry has 4 fields.
'type' is a (all), c (char), or b (block). 'all' means it applies
to all types and all major and minor numbers. Major and minor are
either an integer or * for all. Access is a composition of r
(read), w (write), and m (mknod).
The root device cgroup starts with rwm to 'all'. A child device
cgroup gets a copy of the parent. Administrators can then remove
devices from the whitelist or add new entries. A child cgroup can
never receive a device access which is denied by its parent.
2. User Interface
An entry is added using devices.allow, and removed using
devices.deny. For instance
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing
echo a > /sys/fs/cgroup/1/devices.deny
will remove the default 'a *:* rwm' entry. Doing
echo a > /sys/fs/cgroup/1/devices.allow
will add the 'a *:* rwm' entry to the whitelist.
3. Security
Any task can move itself between cgroups. This clearly won't
suffice, but we can decide the best way to adequately restrict
movement as people get some experience with this. We may just want
to require CAP_SYS_ADMIN, which at least is a separate bit from
CAP_MKNOD. We may want to just refuse moving to a cgroup which
isn't a descendant of the current one. Or we may want to use
CAP_MAC_ADMIN, since we really are trying to lock down root.
CAP_SYS_ADMIN is needed to modify the whitelist or move another
task to a new cgroup. (Again we'll probably want to change that).
A cgroup may not be granted more permissions than the cgroup's
parent has.
4. Hierarchy
device cgroups maintain hierarchy by making sure a cgroup never has more
access permissions than its parent. Every time an entry is written to
a cgroup's devices.deny file, all its children will have that entry removed
from their whitelist and all the locally set whitelist entries will be
re-evaluated. In case one of the locally set whitelist entries would provide
more access than the cgroup's parent, it'll be removed from the whitelist.
Example:
A
/ \
B
group behavior exceptions
A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
If a device is denied in group A:
# echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's entries, the whitelist entry
"c 116:2 rwm" will be removed:
group whitelist entries denied devices
A all "b 8:* rwm", "c 116:* rw"
B "c 1:3 rwm", "b 3:* rwm" all the rest
In case parent's exceptions change and local exceptions are not allowed
anymore, they'll be deleted.
Notice that new whitelist entries will not be propagated:
A
/ \
B
group whitelist entries denied devices
A "c 1:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest
when adding "c *:3 rwm":
# echo "c *:3 rwm" >A/devices.allow
the result:
group whitelist entries denied devices
A "c *:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest
but now it'll be possible to add new entries to B:
# echo "c 2:3 rwm" >B/devices.allow
# echo "c 50:3 r" >B/devices.allow
or even
# echo "c *:3 rwm" >B/devices.allow
Allowing or denying all by writing 'a' to devices.allow or devices.deny will
not be possible once the device cgroups has children.
4.1 Hierarchy (internal implementation)
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
list of exceptions. The internal state is controlled using the same user
interface to preserve compatibility with the previous whitelist-only
implementation. Removal or addition of exceptions that will reduce the access
to devices will be propagated down the hierarchy.
For every propagated exception, the effective rules will be re-evaluated based
on current parent's access rules.

View file

@ -0,0 +1,123 @@
The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.
The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.
Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells:
$ echo $$
16644
$ bash
$ echo $$
16690
From a second, unrelated bash shell:
$ kill -SIGSTOP 16690
$ kill -SIGCONT 16690
<at this point 16690 exits and causes 16644 to exit too>
This happens because bash can observe both signals and choose how it
responds to them.
Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.
In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.
The cgroup freezer is hierarchical. Freezing a cgroup freezes all
tasks beloning to the cgroup and all its descendant cgroups. Each
cgroup has its own state (self-state) and the state inherited from the
parent (parent-state). Iff both states are THAWED, the cgroup is
THAWED.
The following cgroupfs files are created by cgroup freezer.
* freezer.state: Read-write.
When read, returns the effective state of the cgroup - "THAWED",
"FREEZING" or "FROZEN". This is the combined self and parent-states.
If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
FREEZING cgroup transitions into FROZEN state when all tasks
belonging to the cgroup and its descendants become frozen. Note that
a cgroup reverts to FREEZING from FROZEN after a new task is added
to the cgroup or one of its descendant cgroups until the new task is
frozen.
When written, sets the self-state of the cgroup. Two values are
allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
if not already freezing, enters FREEZING state along with all its
descendant cgroups.
If THAWED is written, the self-state of the cgroup is changed to
THAWED. Note that the effective state may not change to THAWED if
the parent-state is still freezing. If a cgroup's effective state
becomes THAWED, all its descendants which are freezing because of
the cgroup also leave the freezing state.
* freezer.self_freezing: Read only.
Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
This value is 1 iff the last write to freezer.state was "FROZEN".
* freezer.parent_freezing: Read only.
Shows the parent-state. 0 if none of the cgroup's ancestors is
frozen; otherwise, 1.
The root cgroup is non-freezable and the above interface files don't
exist.
* Examples of usage :
# mkdir /sys/fs/cgroup/freezer
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
# mkdir /sys/fs/cgroup/freezer/0
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
to get status of the freezer subsystem :
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
to freeze all tasks in the container :
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
FREEZING
# cat /sys/fs/cgroup/freezer/0/freezer.state
FROZEN
to unfreeze all tasks in the container :
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
This is the basic mechanism which should do the right thing for user space task
in a simple scenario.

View file

@ -0,0 +1,45 @@
HugeTLB Controller
-------------------
The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how much
HugeTLB pages it would require for its use.
HugeTLB controller can be created by first mounting the cgroup filesystem.
# mount -t cgroup -o hugetlb none /sys/fs/cgroup
With the above step, the initial or the parent HugeTLB group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
New groups can be created under the parent group /sys/fs/cgroup.
# cd /sys/fs/cgroup
# mkdir g1
# echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell
process (bash) into it.
Brief summary of control files
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
For a system supporting two hugepage size (16M and 16G) the control
files include:
hugetlb.16GB.limit_in_bytes
hugetlb.16GB.max_usage_in_bytes
hugetlb.16GB.usage_in_bytes
hugetlb.16GB.failcnt
hugetlb.16MB.limit_in_bytes
hugetlb.16MB.max_usage_in_bytes
hugetlb.16MB.usage_in_bytes
hugetlb.16MB.failcnt

View file

@ -0,0 +1,280 @@
Memory Resource Controller(Memcg) Implementation Memo.
Last Updated: 2010/2
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior.
Please note that implementation details can be changed.
(*) Topics on API should be in Documentation/cgroups/memory.txt)
0. How to record usage ?
2 objects are used.
page_cgroup ....an object per page.
Allocated at boot or memory hotplug. Freed at memory hot removal.
swap_cgroup ... an entry per swp_entry.
Allocated at swapon(). Freed at swapoff().
The page_cgroup has USED bit and double count against a page_cgroup never
occurs. swap_cgroup is used only when a charged page is swapped-out.
1. Charge
a page/swp_entry may be charged (usage += PAGE_SIZE) at
mem_cgroup_try_charge()
2. Uncharge
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
mem_cgroup_uncharge()
Called when a page's refcount goes down to 0.
mem_cgroup_uncharge_swap()
Called when swp_entry's refcnt goes down to 0. A charge against swap
disappears.
3. charge-commit-cancel
Memcg pages are charged in two steps:
mem_cgroup_try_charge()
mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
At try_charge(), there are no flags to say "this page is charged".
at this point, usage += PAGE_SIZE.
At commit(), the page is associated with the memcg.
At cancel(), simply usage -= PAGE_SIZE.
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
4. Anonymous
Anonymous page is newly allocated at
- page fault into MAP_ANONYMOUS mapping.
- Copy-On-Write.
4.1 Swap-in.
At swap-in, the page is taken from swap-cache. There are 2 cases.
(a) If the SwapCache is newly allocated and read, it has no charges.
(b) If the SwapCache has been mapped by processes, it has been
charged already.
4.2 Swap-out.
At swap-out, typical state transition is below.
(a) add to swap cache. (marked as SwapCache)
swp_entry's refcnt += 1.
(b) fully unmapped.
swp_entry's refcnt += # of ptes.
(c) write back to swap.
(d) delete from swap cache. (remove from SwapCache)
swp_entry's refcnt -= 1.
Finally, at task exit,
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
5. Page Cache
Page Cache is charged at
- add_to_page_cache_locked().
The logic is very clear. (About migration, see below)
Note: __remove_from_page_cache() is called by remove_from_page_cache()
and __remove_mapping().
6. Shmem(tmpfs) Page Cache
The best way to understand shmem's page state transition is to read
mm/shmem.c.
But brief explanation of the behavior of memcg around shmem will be
helpful to understand the logic.
Shmem's page (just leaf page, not direct/indirect block) can be on
- radix-tree of shmem's inode.
- SwapCache.
- Both on radix-tree and SwapCache. This happens at swap-in
and swap-out,
It's charged when...
- A new page is added to shmem's radix-tree.
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
7. Page Migration
mem_cgroup_migrate()
8. LRU
Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global zone->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
list management functions under zone->lru_lock().
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.
(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)
9. Typical Tests.
Tests for racy cases.
9.1 Small limit to memcg.
When you do test to do racy case, it's good test to set memcg's limit
to be very small rather than GB. Many races found in the test under
xKB or xxMB limits.
(Memory behavior under GB and Memory behavior under MB shows very
different situation.)
9.2 Shmem
Historically, memcg's shmem handling was poor and we saw some amount
of troubles here. This is because shmem is page-cache but can be
SwapCache. Test with shmem/tmpfs is always good test.
9.3 Migration
For NUMA, migration is an another special case. To do easy test, cpuset
is useful. Following is a sample script to do migration.
mount -t cgroup -o cpuset none /opt/cpuset
mkdir /opt/cpuset/01
echo 1 > /opt/cpuset/01/cpuset.cpus
echo 0 > /opt/cpuset/01/cpuset.mems
echo 1 > /opt/cpuset/01/cpuset.memory_migrate
mkdir /opt/cpuset/02
echo 1 > /opt/cpuset/02/cpuset.cpus
echo 1 > /opt/cpuset/02/cpuset.mems
echo 1 > /opt/cpuset/02/cpuset.memory_migrate
In above set, when you moves a task from 01 to 02, page migration to
node 0 to node 1 will occur. Following is a script to migrate all
under cpuset.
--
move_task()
{
for pid in $1
do
/bin/echo $pid >$2/tasks 2>/dev/null
echo -n $pid
echo -n " "
done
echo END
}
G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} &
--
9.4 Memory hotplug.
memory hotplug test is one of good test.
to offline memory, do following.
# echo offline > /sys/devices/system/memory/memoryXXX/state
(XXX is the place of memory)
This is an easy way to test page migration, too.
9.5 mkdir/rmdir
When using hierarchy, mkdir/rmdir test should be done.
Use tests like the following.
echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a
mkdir /opt/cgroup/01/child_b
set limit to 01.
add limit to 01/child_b
run jobs under child_a and child_b
create/delete following groups at random while jobs are running.
/opt/cgroup/01/child_a/child_aa
/opt/cgroup/01/child_b/child_bb
/opt/cgroup/01/child_c
running new jobs in new group is also good.
9.6 Mount with other subsystems.
Mounting with other subsystems is a good test because there is a
race and lock dependency with other cgroup subsystems.
example)
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
and do task move, mkdir, rmdir etc...under this.
9.7 swapoff.
Besides management of swap is one of complicated parts of memcg,
call path of swap-in at swapoff is not same as usual swap-in path..
It's worth to be tested explicitly.
For example, test like following is good.
(Shell-A)
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes
# echo 0 > /cgroup/test/tasks
Run malloc(100M) program under this. You'll see 60M of swaps.
(Shell-B)
# move all tasks in /cgroup/test to /cgroup
# /sbin/swapoff -a
# rmdir /cgroup/test
# kill malloc task.
Of course, tmpfs v.s. swapoff test should be tested, too.
9.8 OOM-Killer
Out-of-memory caused by memcg's limit will kill tasks under
the memcg. When hierarchy is used, a task under hierarchy
will be killed by the kernel.
In this case, panic_on_oom shouldn't be invoked and tasks
in other groups shouldn't be killed.
It's not difficult to cause OOM under memcg as following.
Case A) when you can swapoff
#swapoff -a
#echo 50M > /memory.limit_in_bytes
run 51M of malloc
Case B) when you use mem+swap limitation.
#echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes
run 51M of malloc
9.9 Move charges at task migration
Charges associated with a task can be moved along with task migration.
(Shell-A)
#mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks
run some programs which uses some amount of memory in /cgroup/A.
(Shell-B)
#mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading *.usage_in_bytes or
memory.stat of both A and B.
See 8.2 of Documentation/cgroups/memory.txt to see what value should be
written to move_charge_at_immigrate.
9.10 Memory thresholds
Memory controller implements memory thresholds using cgroups notification
API. You can use tools/cgroup/cgroup_event_listener.c to test it.
(Shell-A) Create cgroup and run event listener
# mkdir /cgroup/A
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
(Shell-B) Add task to cgroup and try to allocate and free memory
# echo $$ >/cgroup/A/tasks
# a="$(dd if=/dev/zero bs=1M count=10)"
# a=
You will see message from cgroup_event_listener every time you cross
the thresholds.
Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
It's good idea to test root cgroup as well.

View file

@ -0,0 +1,873 @@
Memory Resource Controller
NOTE: The Memory Resource Controller has generically been referred to as the
memory controller in this document. Do not confuse memory controller
used here with the memory controller that is used in hardware.
(For editors)
In this document:
When we mention a cgroup (cgroupfs's directory) with memory controller,
we call it "memory cgroup". When you see git-log and source code, you'll
see patch's title and function names tend to use "memcg".
In this document, we avoid using it.
Benefits and Purpose of the memory controller
The memory controller isolates the memory behaviour of a group of tasks
from the rest of the system. The article on LWN [12] mentions some probable
uses of the memory controller. The memory controller can be used to
a. Isolate an application or a group of applications
Memory-hungry applications can be isolated and limited to a smaller
amount of memory.
b. Create a cgroup with a limited amount of memory; this can be used
as a good alternative to booting with mem=XXXX.
c. Virtualization solutions can control the amount of memory they want
to assign to a virtual machine instance.
d. A CD/DVD burner could control the amount of memory used by the
rest of the system to ensure that burning does not fail due to lack
of available memory.
e. There are several other use cases; find one or use the controller just
for fun (to learn and hack on the VM subsystem).
Current Status: linux-2.6.34-mmotm(development version of 2010/April)
Features:
- accounting anonymous pages, file caches, swap caches usage and limiting them.
- pages are linked to per-memcg LRU exclusively, and there is no global LRU.
- optionally, memory+swap usage can be accounted and limited.
- hierarchical accounting
- soft limit
- moving (recharging) account at moving a task is selectable.
- usage threshold notifier
- memory pressure notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.
Kernel memory support is a work in progress, and the current version provides
basically functionality. (See Section 2.7)
Brief summary of control files.
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
memory.usage_in_bytes # show current res_counter usage for memory
(See 5.5 for details)
memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
(See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
memory.failcnt # show the number of memory usage hits limits
memory.memsw.failcnt # show the number of memory+Swap hits limits
memory.max_usage_in_bytes # show max memory usage recorded
memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
memory.soft_limit_in_bytes # set/show soft limit of memory usage
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
memory.numa_stat # show the number of memory usage per numa node
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
memory.kmem.usage_in_bytes # show current kernel memory allocation
memory.kmem.failcnt # show the number of kernel memory usage hits limits
memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits
memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
1. History
The memory controller has a long history. A request for comments for the memory
controller was posted by Balbir Singh [1]. At the time the RFC was posted
there were several implementations for memory control. The goal of the
RFC was to build consensus and agreement for the minimal features required
for memory control. The first RSS controller was posted by Balbir Singh[2]
in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
RSS controller. At OLS, at the resource management BoF, everyone suggested
that we handle both page cache and RSS together. Another request was raised
to allow user space handling of OOM. The current memory controller is
at version 6; it combines both mapped (RSS) and unmapped Page
Cache Control [11].
2. Memory Control
Memory is a unique resource in the sense that it is present in a limited
amount. If a task requires a lot of CPU processing, the task can spread
its processing over a period of hours, days, months or years, but with
memory, the same physical memory needs to be reused to accomplish the task.
The memory controller implementation has been divided into phases. These
are:
1. Memory controller
2. mlock(2) controller
3. Kernel user memory accounting and slab control
4. user mappings length controller
The memory controller is the first controller developed.
2.1. Design
The core of the design is a counter called the res_counter. The res_counter
tracks the current memory usage and limit of the group of processes associated
with the controller. Each cgroup has a memory controller specific data
structure (mem_cgroup) associated with it.
2.2. Accounting
+--------------------+
| mem_cgroup |
| (res_counter) |
+--------------------+
/ ^ \
/ | \
+---------------+ | +---------------+
| mm_struct | |.... | mm_struct |
| | | | |
+---------------+ | +---------------+
|
+ --------------+
|
+---------------+ +------+--------+
| page +----------> page_cgroup|
| | | |
+---------------+ +---------------+
(Figure 1: Hierarchy of Accounting)
Figure 1 shows the important aspects of the controller
1. Accounting happens per cgroup
2. Each mm_struct knows about which cgroup it belongs to
3. Each page has a pointer to the page_cgroup, which in turn knows the
cgroup it belongs to
The accounting is done as follows: mem_cgroup_charge_common() is invoked to
set up the necessary data structures and check if the cgroup that is being
charged is over its limit. If it is, then reclaim is invoked on the cgroup.
More details can be found in the reclaim section of this document.
If everything goes well, a page meta-data-structure called page_cgroup is
updated. page_cgroup has its own LRU on cgroup.
(*) page_cgroup structure is allocated at boot/memory-hotplug time.
2.2.1 Accounting details
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
Some pages which are never reclaimable and will not be on the LRU
are not accounted. We just account pages under usual VM management.
RSS pages are accounted at page_fault unless they've already been accounted
for earlier. A file page will be accounted for as Page Cache when it's
inserted into inode (radix-tree). While it's mapped into the page tables of
processes, duplicate accounting is carefully avoided.
An RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
A swapped-in page is not accounted until it's mapped.
Note: The kernel does swapin-readahead and reads multiple swaps at once.
This means swapped-in pages may contain pages for other tasks than a task
causing page fault. So, we avoid accounting at swap-in I/O.
At page migration, accounting information is kept.
Note: we just account pages-on-LRU because our purpose is to control amount
of used pages; not-on-LRU pages tend to be out-of-control from VM view.
2.3 Shared Page Accounting
Shared pages are accounted on the basis of the first touch approach. The
cgroup that first touches a page is accounted for the page. The principle
behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
But see section 8.2: when moving a task to another cgroup, its pages may
be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
Exception: If CONFIG_MEMCG_SWAP is not used.
When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.
2.4 Swap Extension (CONFIG_MEMCG_SWAP)
Swap Extension allows you to record charge for swap. A swapped-in page is
charged back to original page allocator if possible.
When swap is accounted, following files are added.
- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.
memsw means memory+swap. Usage of memory+swap is limited by
memsw.limit_in_bytes.
Example: Assume a system with 4G of swap. A task which allocates 6G of memory
(by mistake) under 2G memory limitation will use all swap.
In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
By using the memsw limit, you can avoid system OOM which can be caused by swap
shortage.
* why 'memory+swap' rather than swap.
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
memory+swap. In other words, when we want to limit the usage of swap without
affecting global LRU, memory+swap limit is better than just limiting swap from
an OS point of view.
* What happens when a cgroup hits memory.memsw.limit_in_bytes
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
in this cgroup. Then, swap-out will not be done by cgroup routine and file
caches are dropped. But as mentioned above, global LRU can do swapout memory
from it for sanity of the system's memory management state. You can't forbid
it by cgroup.
2.5 Reclaim
Each cgroup maintains a per cgroup LRU which has the same structure as
global VM. When a cgroup goes over its limit, we first try
to reclaim memory from the cgroup so as to make space for the new
pages that the cgroup has touched. If the reclaim is unsuccessful,
an OOM routine is invoked to select and kill the bulkiest task in the
cgroup. (See 10. OOM Control below.)
The reclaim algorithm has not been modified for cgroups, except that
pages that are selected for reclaiming come from the per-cgroup LRU
list.
NOTE: Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.
Note2: When panic_on_oom is set to "2", the whole system will panic.
When oom event notifier is registered, event will be delivered.
(See oom_control section)
2.6 Locking
lock_page_cgroup()/unlock_page_cgroup() should not be called under
mapping->tree_lock.
Other lock order is following:
PG_locked.
mm->page_table_lock
zone->lru_lock
lock_page_cgroup.
In many cases, just lock_page_cgroup() is called.
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
zone->lru_lock, it has no lock of its own.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
WARNING: Current implementation lacks reclaim support. That means allocation
attempts will fail when close to the limit even if there are plenty of
kmem available for reclaim. That makes this option unusable in real
life so DO NOT SELECT IT unless for development purposes.
With the Kernel memory extension, the Memory Controller is able to limit
the amount of kernel memory used by the system. Kernel memory is fundamentally
different than user memory, since it can't be swapped out, which makes it
possible to DoS the system by consuming too much of this precious resource.
Kernel memory won't be accounted at all until limit on a group is set. This
allows for existing setups to continue working without disruption. The limit
cannot be set if the cgroup have children, or if there are already tasks in the
cgroup. Attempting to set the limit under those conditions will return -EBUSY.
When use_hierarchy == 1 and a group is accounted, its children will
automatically be accounted regardless of their limit value.
After a group is first limited, it will be kept being accounted until it
is removed. The memory limitation itself, can of course be removed by writing
-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
limited.
Kernel memory limits are not imposed for the root cgroup. Usage for the root
cgroup may or may not be accounted. The memory used is accumulated into
memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
(currently only for tcp).
The main "kmem" counter is fed into the main counter, so kmem charges will
also be visible from the user counter.
Currently no soft limit is implemented for kernel memory. It is future work
to trigger slab reclaim when those limits are reached.
2.7.1 Current Kernel Memory resources accounted
* stack pages: every process consumes some stack pages. By accounting into
kernel memory, we prevent new processes from being created when the kernel
memory usage is too high.
* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
of each kmem_cache is created every time the cache is touched by the first time
from inside the memcg. The creation is done lazily, so some objects can still be
skipped while the cache is being created. All objects in a slab page should
belong to the same memcg. This only fails to hold when a task is migrated to a
different memcg during the page allocation by the cache.
* sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.
* tcp memory pressure: sockets memory pressure for the tcp protocol.
2.7.3 Common use cases
Because the "kmem" counter is fed to the main user counter, kernel memory can
never be limited completely independently of user memory. Say "U" is the user
limit, and "K" the kernel limit. There are three possible ways limits can be
set:
U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.
3. User Interface
0. Configuration
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_RESOURCE_COUNTERS
c. Enable CONFIG_MEMCG
d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
2. Make the new group and move bash into it
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit:
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
NOTE: We cannot set limits on the root cgroup any more.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage:
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512
A successful write to this file does not guarantee a successful setting of
this limit to the value written into the file. This can be due to a
number of factors, such as rounding up to page boundaries or the total
availability of memory on the system. The user is required to re-read
this file after a write to guarantee the value committed by the kernel.
# echo 1 > memory.limit_in_bytes
# cat memory.limit_in_bytes
4096
The memory.failcnt field gives the number of times that the cgroup limit was
exceeded.
The memory.stat file gives accounting information. Now, the number of
caches, RSS and Active pages/Inactive pages are shown.
4. Testing
For testing features and implementation, see memcg_test.txt.
Performance test is also important. To see pure memory controller's overhead,
testing on tmpfs will give you good numbers of small overheads.
Example: do kernel make on tmpfs.
Page-fault scalability is also important. At measuring parallel
page fault test, multi-process test may be better than multi-thread
test because it has noise of shared objects/status.
But the above two are testing extreme situations.
Trying usual test under memory controller is always helpful.
4.1 Troubleshooting
Sometimes a user might find that the application under a cgroup is
terminated by the OOM killer. There are several causes for this:
1. The cgroup limit is too low (just too low to do anything useful)
2. The user is using anonymous memory and swap is turned off or too low
A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
some of the pages cached in the cgroup (page cache pages).
To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
seeing what happens will be helpful.
4.2 Task migration
When a task migrates from one cgroup to another, its charge is not
carried forward by default. The pages allocated from the original cgroup still
remain charged to it, the charge is dropped when the page is freed or
reclaimed.
You can move charges of a task along with task migration.
See 8. "Move charges at task migration"
4.3 Removing a cgroup
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)
We move the stats to root (if use_hierarchy==0) or parent (if
use_hierarchy==1), and no change on the charge except uncharging
from the child.
Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.
About use_hierarchy, see Section 6.
5. Misc. interfaces.
5.1 force_empty
memory.force_empty interface is provided to make cgroup's memory usage empty.
When writing anything to this
# echo 0 > memory.force_empty
the cgroup will be reclaimed and as many pages reclaimed as possible.
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
Also, note that when memory.kmem.limit_in_bytes is set the charges due to
kernel pages will still be seen. This is not considered a failure and the
write will still return success. In this case, it is expected that
memory.kmem.usage_in_bytes == memory.usage_in_bytes.
About use_hierarchy, see Section 6.
5.2 stat file
memory.stat file includes following statistics
# per-memory cgroup local status
cache - # of bytes of page cache memory.
rss - # of bytes of anonymous and swap cache memory (includes
transparent hugepages).
rss_huge - # of bytes of anonymous transparent hugepages.
mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
pgpgin - # of charging events to the memory cgroup. The charging
event happens each time a page is accounted as either mapped
anon page(RSS) or cache page(Page Cache) to the cgroup.
pgpgout - # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage
writeback - # of bytes of file/anon cache that are queued for syncing to
disk.
inactive_anon - # of bytes of anonymous and swap cache memory on inactive
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
LRU list.
inactive_file - # of bytes of file-backed memory on inactive LRU list.
active_file - # of bytes of file-backed memory on active LRU list.
unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
# status considering hierarchy (see memory.use_hierarchy settings)
hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
under which the memory cgroup is
hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
hierarchy under which memory cgroup is.
total_<counter> - # hierarchical version of <counter>, which in
addition to the cgroup's own value includes the
sum of all hierarchical children's values of
<counter>, i.e. total_cache
# The following additional stats are dependent on CONFIG_DEBUG_VM.
recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
Memo:
recent_rotated means recent frequency of LRU rotation.
recent_scanned means recent # of scans to LRU.
showing for better debug please see the code for meanings.
Note:
Only anonymous and swap cache memory is listed as part of 'rss' stat.
This should not be confused with the true 'resident set size' or the
amount of physical memory used by the cgroup.
'rss + file_mapped" will give you resident set size of cgroup.
(Note: file and shmem may be shared among other cgroups. In that case,
file_mapped is accounted only when the memory cgroup is owner of page
cache.)
5.3 swappiness
Overrides /proc/sys/vm/swappiness for the particular group. The tunable
in the root cgroup corresponds to the global swappiness setting.
Please note that unlike during the global reclaim, limit reclaim
enforces that 0 swappiness really prevents from any swapping even if
there is a swap storage available. This might lead to memcg OOM killer
if there are no file pages to reclaim.
5.4 failcnt
A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
This failcnt(== failure count) shows the number of times that a usage counter
hit its limit. When a memory cgroup hits a limit, failcnt increases and
memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file.
# echo 0 > .../memory.failcnt
5.5 usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some optimization
to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
value for efficient access. (Of course, when necessary, it's synchronized.)
If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
value in memory.stat(see 5.2).
5.6 numa_stat
This is similar to numa_maps but operates on a per-memcg basis. This is
useful for providing visibility into the numa locality information within
an memcg since the pages are allowed to be allocated from any physical
node. One of the use cases is evaluating application performance by
combining this information with the application's CPU allocation.
Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
per-node page counts including "hierarchical_<counter>" which sums up all
hierarchical children's values in addition to the memcg's own value.
The output format of memory.numa_stat is:
total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
The "total" count is sum of file + anon + unevictable.
6. Hierarchy support
The memory controller supports a deep hierarchy and hierarchical accounting.
The hierarchy is created by creating the appropriate cgroups in the
cgroup filesystem. Consider for example, the following cgroup filesystem
hierarchy
root
/ | \
/ | \
a b c
| \
| \
d e
In the diagram above, with hierarchical accounting enabled, all memory
usage of e, is accounted to its ancestors up until the root (i.e, c and root),
that has memory.use_hierarchy enabled. If one of the ancestors goes over its
limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
children of the ancestor.
6.1 Enabling hierarchical accounting and reclaim
A memory cgroup by default disables the hierarchy feature. Support
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
# echo 1 > memory.use_hierarchy
The feature can be disabled by
# echo 0 > memory.use_hierarchy
NOTE1: Enabling/disabling will fail if either the cgroup already has other
cgroups created below it, or if the parent cgroup has use_hierarchy
enabled.
NOTE2: When panic_on_oom is set to "2", the whole system will panic in
case of an OOM event in any cgroup.
7. Soft limits
Soft limits allow for greater sharing of memory. The idea behind soft limits
is to allow control groups to use as much of the memory as needed, provided
a. There is no memory contention
b. They do not exceed their hard limit
When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.
Please note that soft limits is a best-effort feature; it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd).
7.1 Interface
Soft limits can be setup by using the following commands (in this example we
assume a soft limit of 256 MiB)
# echo 256M > memory.soft_limit_in_bytes
If we want to change this to 1G, we can at any time use
# echo 1G > memory.soft_limit_in_bytes
NOTE1: Soft limits take effect over a long period of time, since they involve
reclaiming memory for balancing between memory cgroups
NOTE2: It is recommended to set the soft limit always below the hard limit,
otherwise the hard limit will take precedence.
8. Move charges at task migration
Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
This feature is not supported in !CONFIG_MMU environments because of lack of
page tables.
8.1 Interface
This feature is disabled by default. It can be enabled (and disabled again) by
writing to memory.move_charge_at_immigrate of the destination cgroup.
If you want to enable it:
# echo (some positive value) > memory.move_charge_at_immigrate
Note: Each bits of move_charge_at_immigrate has its own meaning about what type
of charges should be moved. See 8.2 for details.
Note: Charges are moved only when you move mm->owner, in other words,
a leader of a thread group.
Note: If we cannot find enough space for the task in the destination cgroup, we
try to make space by reclaiming memory. Task migration may fail if we
cannot make enough space.
Note: It can take several seconds if you move charges much.
And if you want disable it again:
# echo 0 > memory.move_charge_at_immigrate
8.2 Type of charges which can be moved
Each bit in move_charge_at_immigrate has its own meaning about what type of
charges should be moved. But in any case, it must be noted that an account of
a page or a swap can be moved only when it is charged to the task's current
(old) memory cgroup.
bit | what type of charges would be moved ?
-----+------------------------------------------------------------------------
0 | A charge of an anonymous page (or swap of it) used by the target task.
| You must enable Swap Extension (see 2.4) to enable move of swap charges.
-----+------------------------------------------------------------------------
1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
| and swaps of tmpfs file) mmapped by the target task. Unlike the case of
| anonymous pages, file pages (and swaps) in the range mmapped by the task
| will be moved even if the task hasn't done page fault, i.e. they might
| not be the task's "RSS", but other task's "RSS" that maps the same file.
| And mapcount of the page is ignored (the page can be moved even if
| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
| enable move of swap charges.
8.3 TODO
- All of moving charge operations are done under cgroup_mutex. It's not good
behavior to hold the mutex too long, so we may need some trick.
9. Memory thresholds
Memory cgroup implements memory thresholds using the cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses.
To register a threshold, an application must:
- create an eventfd using eventfd(2);
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
cgroup.event_control.
Application will be notified through eventfd when memory usage crosses
threshold in any direction.
It's applicable for root and non-root cgroup.
10. OOM Control
memory.oom_control file is for OOM notification and other controls.
Memory cgroup implements OOM notifier using the cgroup notification
API (See cgroups.txt). It allows to register multiple OOM notification
delivery and gets notification when OOM happens.
To register a notifier, an application must:
- create an eventfd using eventfd(2)
- open memory.oom_control file
- write string like "<event_fd> <fd of memory.oom_control>" to
cgroup.event_control
The application will be notified through eventfd when OOM happens.
OOM notification doesn't work for the root cgroup.
You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
#echo 1 > memory.oom_control
If OOM-killer is disabled, tasks under cgroup will hang/sleep
in memory cgroup's OOM-waitqueue when they request accountable memory.
For running them, you have to relax the memory cgroup's OOM status by
* enlarge limit or reduce usage.
To reduce usage,
* kill some tasks.
* move some tasks to other group with account migration.
* remove some files (on tmpfs?)
Then, stopped tasks will work again.
At reading, current status of OOM is shown.
oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
11. Memory Pressure
The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
different strategies of managing their memory resources. The pressure
levels are defined as following:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)
The file memory.pressure_level is only used to setup an eventfd. To
register a notification, an application must:
- create an eventfd using eventfd(2);
- open memory.pressure_level;
- write string like "<event_fd> <fd of memory.pressure_level> <level>"
to cgroup.event_control.
Application will be notified through eventfd when memory pressure is at
the specific level (or higher). Read/write operations to
memory.pressure_level are no implemented.
Test:
Here is a small script example that makes a new cgroup, sets up a
memory limit, sets up a notification in the cgroup and then makes child
cgroup experience a critical pressure:
# cd /sys/fs/cgroup/memory/
# mkdir foo
# cd foo
# cgroup_event_listener memory.pressure_level low &
# echo 8000000 > memory.limit_in_bytes
# echo 8000000 > memory.memsw.limit_in_bytes
# echo $$ > tasks
# dd if=/dev/zero | read x
(Expect a bunch of notifications, and eventually, the oom-killer will
trigger.)
12. TODO
1. Make per-cgroup scanner reclaim not-shared pages first
2. Teach controller to account for shared-pages
3. Start reclamation in the background when the limit is
not yet hit but the usage is getting closer
Summary
Overall, the memory controller has been a stable controller and has been
commented and discussed quite extensively in the community.
References
1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
2. Singh, Balbir. Memory Controller (RSS Control),
http://lwn.net/Articles/222762/
3. Emelianov, Pavel. Resource controllers based on process cgroups
http://lkml.org/lkml/2007/3/6/198
4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
http://lkml.org/lkml/2007/4/9/78
5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
http://lkml.org/lkml/2007/5/30/244
6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
subsystem (v3), http://lwn.net/Articles/235534/
8. Singh, Balbir. RSS controller v2 test results (lmbench),
http://lkml.org/lkml/2007/5/17/232
9. Singh, Balbir. RSS controller v2 AIM9 results
http://lkml.org/lkml/2007/5/18/1
10. Singh, Balbir. Memory controller v6 test results,
http://lkml.org/lkml/2007/8/19/36
11. Singh, Balbir. Memory controller introduction (v6),
http://lkml.org/lkml/2007/8/17/69
12. Corbet, Jonathan, Controlling memory use in cgroups,
http://lwn.net/Articles/243795/

View file

@ -0,0 +1,39 @@
Network classifier cgroup
-------------------------
The Network classifier cgroup provides an interface to
tag network packets with a class identifier (classid).
The Traffic Controller (tc) can be used to assign
different priorities to packets from different cgroups.
Also, Netfilter (iptables) can use this tag to perform
actions on such packets.
Creating a net_cls cgroups instance creates a net_cls.classid file.
This net_cls.classid value is initialized to 0.
You can write hexadecimal values to net_cls.classid; the format for these
values is 0xAAAABBBB; AAAA is the major handle number and BBBB
is the minor handle number.
Reading net_cls.classid yields a decimal result.
Example:
mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
- setting a 10:1 handle.
cat /sys/fs/cgroup/net_cls/0/net_cls.classid
1048577
configuring tc:
tc qdisc add dev eth0 root handle 10: htb
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
- creating traffic class 10:1
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
configuring iptables, basic example:
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP

View file

@ -0,0 +1,55 @@
Network priority cgroup
-------------------------
The Network priority cgroup provides an interface to allow an administrator to
dynamically set the priority of network traffic generated by various
applications
Nominally, an application would set the priority of its traffic via the
SO_PRIORITY socket option. This however, is not always possible because:
1) The application may not have been coded to set this value
2) The priority of application traffic is often a site-specific administrative
decision rather than an application defined one.
This cgroup allows an administrator to assign a process to a group which defines
the priority of egress traffic on a given interface. Network priority groups can
be created by first mounting the cgroup filesystem.
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
With the above step, the initial group acting as the parent accounting group
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
Each net_prio cgroup contains two files that are subsystem specific
net_prio.prioidx
This file is read-only, and is simply informative. It contains a unique integer
value that the kernel uses as an internal representation of this cgroup.
net_prio.ifpriomap
This file contains a map of the priorities assigned to traffic originating from
processes in this group and egressing the system on various interfaces. It
contains a list of tuples in the form <ifname priority>. Contents of this file
can be modified by echoing a string into the file using the same tuple format.
for example:
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
This command would force any traffic originating from processes belonging to the
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
said traffic set to the value 5. The parent accounting group also has a
writeable 'net_prio.ifpriomap' file that can be used to set a system default
priority.
Priorities are set immediately prior to queueing a frame to the device
queueing discipline (qdisc) so priorities will be assigned prior to the hardware
queue selection being made.
One usage for the net_prio cgroup is with mqprio qdisc allowing application
traffic to be steered to hardware/driver based traffic classes. These mappings
can then be managed by administrators or other networking protocols such as
DCBX.
A new net_prio cgroup inherits the parent's configuration.

View file

@ -0,0 +1,197 @@
The Resource Counter
The resource counter, declared at include/linux/res_counter.h,
is supposed to facilitate the resource management by controllers
by providing common stuff for accounting.
This "stuff" includes the res_counter structure and routines
to work with it.
1. Crucial parts of the res_counter structure
a. unsigned long long usage
The usage value shows the amount of a resource that is consumed
by a group at a given time. The units of measurement should be
determined by the controller that uses this counter. E.g. it can
be bytes, items or any other unit the controller operates on.
b. unsigned long long max_usage
The maximal value of the usage over time.
This value is useful when gathering statistical information about
the particular group, as it shows the actual resource requirements
for a particular group, not just some usage snapshot.
c. unsigned long long limit
The maximal allowed amount of resource to consume by the group. In
case the group requests for more resources, so that the usage value
would exceed the limit, the resource allocation is rejected (see
the next section).
d. unsigned long long failcnt
The failcnt stands for "failures counter". This is the number of
resource allocation attempts that failed.
c. spinlock_t lock
Protects changes of the above values.
2. Basic accounting routines
a. void res_counter_init(struct res_counter *rc,
struct res_counter *rc_parent)
Initializes the resource counter. As usual, should be the first
routine called for a new counter.
The struct res_counter *parent can be used to define a hierarchical
child -> parent relationship directly in the res_counter structure,
NULL can be used to define no relationship.
c. int res_counter_charge(struct res_counter *rc, unsigned long val,
struct res_counter **limit_fail_at)
When a resource is about to be allocated it has to be accounted
with the appropriate resource counter (controller should determine
which one to use on its own). This operation is called "charging".
This is not very important which operation - resource allocation
or charging - is performed first, but
* if the allocation is performed first, this may create a
temporary resource over-usage by the time resource counter is
charged;
* if the charging is performed first, then it should be uncharged
on error path (if the one is called).
If the charging fails and a hierarchical dependency exists, the
limit_fail_at parameter is set to the particular res_counter element
where the charging failed.
d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
"uncharging". The return value of this function indicate the amount
of charges still present in the counter.
The _locked routines imply that the res_counter->lock is taken.
e. u64 res_counter_uncharge_until
(struct res_counter *rc, struct res_counter *top,
unsigned long val)
Almost same as res_counter_uncharge() but propagation of uncharge
stops when rc == top. This is useful when kill a res_counter in
child cgroup.
2.1 Other accounting routines
There are more routines that may help you with common needs, like
checking whether the limit is reached or resetting the max_usage
value. They are all declared in include/linux/res_counter.h.
3. Analyzing the resource counter registrations
a. If the failcnt value constantly grows, this means that the counter's
limit is too tight. Either the group is misbehaving and consumes too
many resources, or the configuration is not suitable for the group
and the limit should be increased.
b. The max_usage value can be used to quickly tune the group. One may
set the limits to maximal values and either load the container with
a common pattern or leave one for a while. After this the max_usage
value shows the amount of memory the container would require during
its common activity.
Setting the limit a bit above this value gives a pretty good
configuration that works in most of the cases.
c. If the max_usage is much less than the limit, but the failcnt value
is growing, then the group tries to allocate a big chunk of resource
at once.
d. If the max_usage is much less than the limit, but the failcnt value
is 0, then this group is given too high limit, that it does not
require. It is better to lower the limit a bit leaving more resource
for other groups.
4. Communication with the control groups subsystem (cgroups)
All the resource controllers that are using cgroups and resource counters
should provide files (in the cgroup filesystem) to work with the resource
counter fields. They are recommended to adhere to the following rules:
a. File names
Field name File name
---------------------------------------------------
usage usage_in_<unit_of_measurement>
max_usage max_usage_in_<unit_of_measurement>
limit limit_in_<unit_of_measurement>
failcnt failcnt
lock no file :)
b. Reading from file should show the corresponding field value in the
appropriate format.
c. Writing to file
Field Expected behavior
----------------------------------
usage prohibited
max_usage reset to usage
limit set the limit
failcnt reset to zero
5. Usage example
a. Declare a task group (take a look at cgroups subsystem for this) and
fold a res_counter into it
struct my_group {
struct res_counter res;
<other fields>
}
b. Put hooks in resource allocation/release paths
int alloc_something(...)
{
if (res_counter_charge(res_counter_ptr, amount) < 0)
return -ENOMEM;
<allocate the resource and return to the caller>
}
void release_something(...)
{
res_counter_uncharge(res_counter_ptr, amount);
<release the resource>
}
In order to keep the usage value self-consistent, both the
"res_counter_ptr" and the "amount" in release_something() should be
the same as they were in the alloc_something() when the releasing
resource was allocated.
c. Provide the way to read res_counter values and set them (the cgroups
still can help with it).
c. Compile and run :)

View file

@ -0,0 +1,382 @@
Cgroup unified hierarchy
April, 2014 Tejun Heo <tj@kernel.org>
This document describes the changes made by unified hierarchy and
their rationales. It will eventually be merged into the main cgroup
documentation.
CONTENTS
1. Background
2. Basic Operation
2-1. Mounting
2-2. cgroup.subtree_control
2-3. cgroup.controllers
3. Structural Constraints
3-1. Top-down
3-2. No internal tasks
4. Other Changes
4-1. [Un]populated Notification
4-2. Other Core Changes
4-3. Per-Controller Changes
4-3-1. blkio
4-3-2. cpuset
4-3-3. memory
5. Planned Changes
5-1. CAP for resource control
1. Background
cgroup allows an arbitrary number of hierarchies and each hierarchy
can host any number of controllers. While this seems to provide a
high level of flexibility, it isn't quite useful in practice.
For example, as there is only one instance of each controller, utility
type controllers such as freezer which can be useful in all
hierarchies can only be used in one. The issue is exacerbated by the
fact that controllers can't be moved around once hierarchies are
populated. Another issue is that all controllers bound to a hierarchy
are forced to have exactly the same view of the hierarchy. It isn't
possible to vary the granularity depending on the specific controller.
In practice, these issues heavily limit which controllers can be put
on the same hierarchy and most configurations resort to putting each
controller on its own hierarchy. Only closely related ones, such as
the cpu and cpuacct controllers, make sense to put on the same
hierarchy. This often means that userland ends up managing multiple
similar hierarchies repeating the same steps on each hierarchy
whenever a hierarchy management operation is necessary.
Unfortunately, support for multiple hierarchies comes at a steep cost.
Internal implementation in cgroup core proper is dazzlingly
complicated but more importantly the support for multiple hierarchies
restricts how cgroup is used in general and what controllers can do.
There's no limit on how many hierarchies there may be, which means
that a task's cgroup membership can't be described in finite length.
The key may contain any varying number of entries and is unlimited in
length, which makes it highly awkward to handle and leads to addition
of controllers which exist only to identify membership, which in turn
exacerbates the original problem.
Also, as a controller can't have any expectation regarding what shape
of hierarchies other controllers would be on, each controller has to
assume that all other controllers are operating on completely
orthogonal hierarchies. This makes it impossible, or at least very
cumbersome, for controllers to cooperate with each other.
In most use cases, putting controllers on hierarchies which are
completely orthogonal to each other isn't necessary. What usually is
called for is the ability to have differing levels of granularity
depending on the specific controller. In other words, hierarchy may
be collapsed from leaf towards root when viewed from specific
controllers. For example, a given configuration might not care about
how memory is distributed beyond a certain level while still wanting
to control how CPU cycles are distributed.
Unified hierarchy is the next version of cgroup interface. It aims to
address the aforementioned issues by having more structure while
retaining enough flexibility for most use cases. Various other
general and controller-specific interface issues are also addressed in
the process.
2. Basic Operation
2-1. Mounting
Currently, unified hierarchy can be mounted with the following mount
command. Note that this is still under development and scheduled to
change soon.
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
All controllers which support the unified hierarchy and are not bound
to other hierarchies are automatically bound to unified hierarchy and
show up at the root of it. Controllers which are enabled only in the
root of unified hierarchy can be bound to other hierarchies. This
allows mixing unified hierarchy with the traditional multiple
hierarchies in a fully backward compatible way.
For development purposes, the following boot parameter makes all
controllers to appear on the unified hierarchy whether supported or
not.
cgroup__DEVEL__legacy_files_on_dfl
A controller can be moved across hierarchies only after the controller
is no longer referenced in its current hierarchy. Because per-cgroup
controller states are destroyed asynchronously and controllers may
have lingering references, a controller may not show up immediately on
the unified hierarchy after the final umount of the previous
hierarchy. Similarly, a controller should be fully disabled to be
moved out of the unified hierarchy and it may take some time for the
disabled controller to become available for other hierarchies;
furthermore, due to dependencies among controllers, other controllers
may need to be disabled too.
While useful for development and manual configurations, dynamically
moving controllers between the unified and other hierarchies is
strongly discouraged for production use. It is recommended to decide
the hierarchies and controller associations before starting using the
controllers.
2-2. cgroup.subtree_control
All cgroups on unified hierarchy have a "cgroup.subtree_control" file
which governs which controllers are enabled on the children of the
cgroup. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" file determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in a "cgroup.subtree_control" file declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
When read, the file contains a space-separated list of currently
enabled controllers. A write to the file should contain a
space-separated list of controllers with '+' or '-' prefixed (without
the quotes). Controllers prefixed with '+' are enabled and '-'
disabled. If a controller is listed multiple times, the last entry
wins. The specific operations are executed atomically - either all
succeed or fail.
2-3. cgroup.controllers
Read-only "cgroup.controllers" file contains a space-separated list of
controllers which can be enabled in the cgroup's
"cgroup.subtree_control" file.
In the root cgroup, this lists controllers which are not bound to
other hierarchies and the content changes as controllers are bound to
and unbound from other hierarchies.
In non-root cgroups, the content of this file equals that of the
parent's "cgroup.subtree_control" file as only controllers enabled
from the parent can be used in its children.
3. Structural Constraints
3-1. Top-down
As it doesn't make sense to nest control of an uncontrolled resource,
all non-root "cgroup.subtree_control" files can only contain
controllers which are enabled in the parent's "cgroup.subtree_control"
file. A controller can be enabled only if the parent has the
controller enabled and a controller can't be disabled if one or more
children have it enabled.
3-2. No internal tasks
One long-standing issue that cgroup faces is the competition between
tasks belonging to the parent cgroup and its children cgroups. This
is inherently nasty as two different types of entities compete and
there is no agreed-upon obvious way to handle it. Different
controllers are doing different things.
The cpu controller considers tasks and cgroups as equivalents and maps
nice levels to cgroup weights. This works for some cases but falls
flat when children should be allocated specific ratios of CPU cycles
and the number of internal tasks fluctuates - the ratios constantly
change as the number of competing entities fluctuates. There also are
other issues. The mapping from nice level to weight isn't obvious or
universal, and there are various other knobs which simply aren't
available for tasks.
The blkio controller implicitly creates a hidden leaf node for each
cgroup to host the tasks. The hidden leaf has its own copies of all
the knobs with "leaf_" prefixed. While this allows equivalent control
over internal tasks, it's with serious drawbacks. It always adds an
extra layer of nesting which may not be necessary, makes the interface
messy and significantly complicates the implementation.
The memory controller currently doesn't have a way to control what
happens between internal tasks and child cgroups and the behavior is
not clearly defined. There have been attempts to add ad-hoc behaviors
and knobs to tailor the behavior to specific workloads. Continuing
this direction will lead to problems which will be extremely difficult
to resolve in the long term.
Multiple controllers struggle with internal tasks and came up with
different ways to deal with it; unfortunately, all the approaches in
use now are severely flawed and, furthermore, the widely different
behaviors make cgroup as whole highly inconsistent.
It is clear that this is something which needs to be addressed from
cgroup core proper in a uniform way so that controllers don't need to
worry about it and cgroup as a whole shows a consistent and logical
behavior. To achieve that, unified hierarchy enforces the following
structural constraint:
Except for the root, only cgroups which don't contain any task may
have controllers enabled in their "cgroup.subtree_control" files.
Combined with other properties, this guarantees that, when a
controller is looking at the part of the hierarchy which has it
enabled, tasks are always only on the leaves. This rules out
situations where child cgroups compete against internal tasks of the
parent.
There are two things to note. Firstly, the root cgroup is exempt from
the restriction. Root contains tasks and anonymous resource
consumption which can't be associated with any other cgroup and
requires special treatment from most controllers. How resource
consumption in the root cgroup is governed is up to each controller.
Secondly, the restriction doesn't take effect if there is no enabled
controller in the cgroup's "cgroup.subtree_control" file. This is
important as otherwise it wouldn't be possible to create children of a
populated cgroup. To control resource distribution of a cgroup, the
cgroup must create children and transfer all its tasks to the children
before enabling controllers in its "cgroup.subtree_control" file.
4. Other Changes
4-1. [Un]populated Notification
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.
- It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.
- There is single monitoring point at the root. There's no way to
delegate management of a subtree.
- The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.
- Events are filtered from the kernel side. A "notify_on_release"
file is used to subscribe to or suppress release events. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
Unified hierarchy implements an interface file "cgroup.populated"
which can be used to monitor whether the cgroup's subhierarchy has
tasks in it or not. Its value is 0 if there is no task in the cgroup
and its descendants; otherwise, 1. poll and [id]notify events are
triggered when the value changes.
This is significantly lighter and simpler and trivially allows
delegating management of subhierarchy - subhierarchy monitoring can
block further propagation simply by putting itself or another process
in the subhierarchy and monitor events that it's interested in from
there without interfering with monitoring higher in the tree.
In unified hierarchy, the release_agent mechanism is no longer
supported and the interface files "release_agent" and
"notify_on_release" do not exist.
4-2. Other Core Changes
- None of the mount options is allowed.
- remount is disallowed.
- rename(2) is disallowed.
- The "tasks" file is removed. Everything should at process
granularity. Use the "cgroup.procs" file instead.
- The "cgroup.procs" file is not sorted. pids will be unique unless
they got recycled in-between reads.
- The "cgroup.clone_children" file is removed.
4-3. Per-Controller Changes
4-3-1. blkio
- blk-throttle becomes properly hierarchical.
4-3-2. cpuset
- Tasks are kept in empty cpusets after hotplug and take on the masks
of the nearest non-empty ancestor, instead of being moved to it.
- A task can be moved into an empty cpuset, and again it takes on the
masks of the nearest non-empty ancestor.
4-3-3. memory
- use_hierarchy is on by default and the cgroup file for the flag is
not created.
5. Planned Changes
5-1. CAP for resource control
Unified hierarchy will require one of the capabilities(7), which is
yet to be decided, for all resource control related knobs. Process
organization operations - creation of sub-cgroups and migration of
processes in sub-hierarchies may be delegated by changing the
ownership and/or permissions on the cgroup directory and
"cgroup.procs" interface file; however, all operations which affect
resource control - writes to a "cgroup.subtree_control" file or any
controller-specific knobs - will require an explicit CAP privilege.
This, in part, is to prevent the cgroup interface from being
inadvertently promoted to programmable API used by non-privileged
binaries. cgroup exposes various aspects of the system in ways which
aren't properly abstracted for direct consumption by regular programs.
This is an administration interface much closer to sysctl knobs than
system calls. Even the basic access model, being filesystem path
based, isn't suitable for direct consumption. There's no way to
access "my cgroup" in a race-free way or make multiple operations
atomic against migration to another cgroup.
Another aspect is that, for better or for worse, the cgroup interface
goes through far less scrutiny than regular interfaces for
unprivileged userland. The upside is that cgroup is able to expose
useful features which may not be suitable for general consumption in a
reasonable time frame. It provides a relatively short path between
internal details and userland-visible interface. Of course, this
shortcut comes with high risk. We go through what we go through for
general kernel APIs for good reasons. It may end up leaking internal
details in a way which can exert significant pain by locking the
kernel into a contract that can't be maintained in a reasonable
manner.
Also, due to the specific nature, cgroup and its controllers don't
tend to attract attention from a wide scope of developers. cgroup's
short history is already fraught with severely mis-designed
interfaces, unnecessary commitments to and exposing of internal
details, broken and dangerous implementations of various features.
Keeping cgroup as an administration interface is both advantageous for
its role and imperative given its nature. Some of the cgroup features
may make sense for unprivileged access. If deemed justified, those
must be further abstracted and implemented as a different interface,
be it a system call or process-private filesystem, and survive through
the scrutiny that any interface for general consumption is required to
go through.
Requiring CAP is not a complete solution but should serve as a
significant deterrent against spraying cgroup usages in non-privileged
programs.