linux:kernel:cgroup:cgroup_management_daemon [TenForward]

文書の過去の版を表示しています。

Hi,

as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like.

何度か言及してきたように，私はスタンドアローンで動く cgroup 管理デーモンを書きたいと思っている．基本的な要求仕様は，それがスタンドアローンであることである．つまり，ホスト上で動くシングルインスタンスであることが任意の深さにネストするコンテナから使うのに適しており，設定された制限を外れる事を許さず，属していないタスクを配下に置くことを許さず，与えられた制限内で思うようにタスクに制限を分配できることである．

Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API.

さらに，Tejun はユーザに cgroupfs の実装にあまりにも密接に関係させたくないと規定した．なので，コマンドは cgroupfs のファイル名と値を指定するよりは少し一般的になるだろう．私の知る限りでは，機能的に冗長であってはいけないので，特定のコントローラを指定するのを避けることすらするかもしれない．一方で，あまりに一般的にもしたくない．なので，私はおおまかには Lmctfy のコマンドライン API っぽいものにする．

One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own.

向かうゴールの一つは，可能な限り簡単に安全にネストされた lxc を可能にする事である．もしこのプロジェクトが成功したら，大量のコードが lxc から削除できるようになる．私はこのプロジェクトは lxc プロジェクトの大きな一部分だと考えている．しかし，それがどれくらい重要であるかと仮定すると，我々ほど重要ではない他の誰かのニーズを考慮することを意味しないシステム管理である．

This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking.

この文書は二つのパートからなっている．最初は，構築されるデーモン (cgmanager) の意図と，どのように安全に対する要求を強制するかを述べる．二つ目はクライアントがマネージャに送る事が出来るコマンドについて述べる．設定することが可能なコントローラのキーのリストは，現時点ではほぼ未完成であり，提供を考えていたアプローチを示す事を主に提供する．

Summary

Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system.

それぞれの「ホスト」(別々の Linux kernel のインスタンスで識別される) は，control groups を管理するためのデーモンを厳密に一つだけ持つ．このデーモンは /sys/fs/cgroup/manager にある dbus ソケットを通して cgroup の管理リクエストに答える．このソケットは多数のコンテナ内に bind mount 可能であるので，一つのデーモンでシステム全体をサポートできる．

Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available.

プログラムは dbus 呼び出しで cgroup リクエストを作成でき，もし可能であるなら dbus を使うように変更された lmctfy をリンクして間接的に cgroup リクエストを作成できる．

Outline:

A single manager, cgmanager, is started on the host, very early

during boot. It has very few dependencies, and requires only

  /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
  the cgroup hierarchies in a private namespace and set defaults
  (clone_children, use_hierarchy, sane_behavior, release_agent?) It
  will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
  
* cgmanager という単一のマネージャがホスト上で，ブート時の初期に起動する．
  依存関係は極めて少なく，/proc, /run, /sys と read-only で /etc だけがマウントされていることが必要なだけである．
  このマネージャはプライベートの名前空間内で cgroup の階層構造をマウントし，デフォルトを設定する
  (clone_children, use_hierarchy, sane_behavior, release_agent?)
  
. A client (requestor 'r') can make cgroup requests over
  /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
  requirements for r are listed below.
. The client request will pertain an existing or new cgroup A.  r's
  privilege over the cgroup must be checked.  r is said to have
  privilege over A if A is owned by r's uid, or if A's owner is mapped
  into r's user namespace, and r is root in that user namespace.
. The client request may pertain a victim task v, which may be moved
  to a new cgroup.  In that case r's privilege over both the cgroup
  and v must be checked.  r is said to have privilege over v if v
  is mapped in r's pid namespace, v's uid is mapped into r's user ns,
  and r is root in its userns.  Or if r and v have the same uid
  and v is mapped in r's pid namespace.
. r's credentials will be taken from socket's peercred, ensuring that
  pid and uid are translated.
. r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
  translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
  which is the global uid, and check /proc/PID(r)/uid_map to see whether
  UID is mapped there.
. dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
  the kernel translate it for the reader.  Only 'move task v to cgroup
  A' will require a SCM_CREDENTIAL to be sent.

Privilege requirements by action:

Requestor of an action ® over a socket may only make

changes to cgroups over which it has privilege.

Requestors may be limited to a certain #/depth of cgroups

(to limit memory usage) - DEFER?

Cgroup hierarchy is responsible for resource limits
A requestor must either be uid 0 in its userns with victim mapped

ito its userns, or the same uid and in same/ancestor pidns as the

    victim
  * If r requests creation of cgroup '/x', /x will be interpreted
    as relative to r's cgroup.  r cannot make changes to cgroups not
    under its own current cgroup.
  * If r is not in the initial user_ns, then it may not change settings
    in its own cgroup, only descendants.  (Not strictly necessary -
    we could require the use of extra cgroups when wanted, as lxc does
    currently)
  * If r requests creation of cgroup '/x', it must have write access
    to its own cgroup  (not strictly necessary)
  * If r requests chown of cgroup /x to uid Y, Y is passed in a
    ucred over the unix socket, and therefore translated to init
    userns.
  * if r requests setting a limit under /x, then
    . either r must be root in its own userns, and UID(/x) be mapped
      into its userns, or else UID(r) == UID(/x)
    . /x must not be / (not strictly necessary, all users know to
      ensure an extra cgroup layer above '/')
    . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
      which won't be satisfied.  Therefore we'll need to do privilege
      checks ourselves, then perform the write as the host root user.
      (see devices.allow/deny).  Further we need to support older kernels
      which don't support setns for pid.
  * If r requests action on victim V, it passes V's pid in a ucred,
    so that gets translated.
    Daemon will verify that V's uid is mapped into r's userns.  Since
    r is either root or the same uid as V, it is allowed to classify.

The above addresses

creating cgroups
chowning cgroups
setting cgroup limits
moving tasks into cgroups

. but does not address a 'cgexec <group> – command' type of behavior.

To handle that (specifically for upstart), recommend that r do:

if (!pid) {

      request_reclassify(cgroup, getpid());
      do_execve();
    }
. alternatively, the daemon could, if kernel is new enough, setns to
  the requestor's namespaces to execute a command in a new cgroup.
  The new command would be daemonized to that pid namespaces' pid 1.

Types of requests:

r requests creating cgroup A'/A

. lmctfy/cli/commands/create.cc

  . Verify that UID(r) mapped to 0 in r's userns
  . R=cgroup_of(r)
  . Verify that UID(R) is mapped into r's userns
  . Create R/A'/A
  . chown R/A'/A to UID(r)
* r requests to move task x to cgroup A.
  . lmctfy/cli/commands/enter.cc
  . r must send PID(x) as ancillary message
  . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
    that userns
    (is it safe to allow if UID(x) == UID(r))?
  . R=cgroup_of(r)
  . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
  . echo PID(x) >> /R/A/tasks
* r requests chown of cgroup A to uid X
  . X is passed in ancillary message
    * ensures it is valid in r's userns
    * maps the userid to host for us
  . Verify that UID(r) mapped to 0 in r's userns
  . R=cgroup_of(r)
  . Chown R/A to X
* r requests cgroup A's 'property=value'
  . Verify that either
    * A != ''
    * UID(r) == 0 on host
    In other words, r in a userns may not set root cgroup settings.
  . Verify that UID(r) mapped to 0 in r's userns
  . R=cgroup_of(r)
  . Set property=value for R/A
    * Expect kernel to guarantee hierarchical constraints
* r requests deletion of cgroup A
  . lmctfy/cli/commands/destroy.cc (without -f)
  . same requirements as setting 'property=value'
* r requests purge of cgroup A
  . lmctfy/cli/commands/destroy.cc (with -f)
  . same requirements as setting 'property=value'

Long-term we will want the cgroup manager to become more intelligent - to place its own limits on clients, to address cpu and device hotplug, etc. Since we will not be doing that in the first prototype, the daemon will not keep any state about the clients.

Client DBus Message API

<name>: a-zA-Z0-9 <name>: “a-zA-Z0-9 ” <controllerlist>: <controller1>[:controllerlist] <valueentry>: key:value <valueentry>: frozen <valueentry>: thawed <values>: valueentry[:values] keys:

{memory,swap}.{limit,soft_limit}
cpus_allowed  # set of allowed cpus
cpus_fraction # % of allowed cpus
cpus_number   # number of allowed cpus
cpu_share_percent   # percent of cpushare
devices_whitelist
devices_blacklist
net_prio_index
net_prio_interface_map
net_classid
hugetlb_limit
blkio_weight
blkio_weight_device
blkio_throttle_{read,write}

readkeys:

devices_list
{memory,swap}.{failcnt,max_use,limitnuma_stat}
hugetlb_max_usage
hugetlb_usage
hugetlb_failcnt
cpuacct_stat
<etc>

Commands:

ListControllers
Create <name> <controllerlist> <values>
Setvalue <name> <values>
Getvalue <name> <readkeys>
ListChildren <name>
ListTasks <name>
ListControllers <name>
Chown <name> <uid>
Chown <name> <uid>:<gid>
Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
Delete <name>
Delete-force <name>
Kill <name>