差分

このページの2つのバージョン間の差分を表示します。

--- linux:kernel:cgroup:単一階層構造 [2015/12/14 09:01] – tenforward
+++ linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward
@@ 行 1: / 行 1: @@
-====== 単一階層構造 ======
+=i===== 単一階層構造 ======
 -1 まで 4.3 カーネルの文書と同期。
@@ 行 26: / 行 26: @@
 -1. Top-down
 -2. No internal tasks
-. Other Changes
+. Delegation
--1. [Un]populated Notification
+-1. Model of delegation
--2. Other Core Changes
+-2. Common ancestor rule
--3. Per-Controller Changes
+. Other Changes
--3-1. blkio
+-1. [Un]populated Notification
--3-2. cpuset
+-2. Other Core Changes
--3-3. memory
+-3. Controller File Conventions
-. Planned Changes
+-3-1. Format
--1. CAP for resource control
+-3-2. Control Knobs
+-4. Per-Controller Changes
+-4-1. io
+-4-2. cpuset
+-4-3. memory
+. Planned Changes
+-1. CAP for resource control
@@ 行 326: / 行 332: @@
 い他の様々なノブが存在する。
-blkio implicitly creates a hidden leaf node for each cgroup to host
+The io controller implicitly creates a hidden leaf node for each
-the tasks.  The hidden leaf has its own copies of all the knobs with
+cgroup to host the tasks.  The hidden leaf has its own copies of all
-"leaf_" prefixed.  While this allows equivalent control over internal
+the knobs with "leaf_" prefixed.  While this allows equivalent control
-tasks, it's with serious drawbacks.  It always adds an extra layer of
+over internal tasks, it's with serious drawbacks.  It always adds an
-nesting which may not be necessary, makes the interface messy and
+extra layer of nesting which may not be necessary, makes the interface
-significantly complicates the implementation.
+messy and significantly complicates the implementation.
-blkio はタスクを扱うために暗黙にそれぞれの cgroup にリーフノードを作成
+io コントローラはタスクを扱うために暗黙にそれぞれの cgroup にリーフノー
-する。この隠れたリーフは自身のコピーとして、頭に "leaf_" と付いた全て
+ドを作成する。この隠れたリーフは自身のコピーとして、頭に "leaf_" と付
-のノブを持つ。これは同等のコントロールが内部タスクにも可能になるが、重
+いた全てのノブを持つ。これは同等のコントロールが内部タスクにも可能にな
-大な欠点も持つ。常に必要ではないかもしれないネストした余分なレイヤーを
+るが、重大な欠点も持つ。常に必要ではないかもしれないネストした余分なレ
-追加し、インターフェースを乱雑にして、実装をかなり複雑にする。
+イヤーを追加し、インターフェースを乱雑にして、実装をかなり複雑にする。
-memory currently doesn't have a way to control what happens between
+The memory controller currently doesn't have a way to control what
-internal tasks and child cgroups and the behavior is not clearly
+happens between internal tasks and child cgroups and the behavior is
-defined.  There have been attempts to add ad-hoc behaviors and knobs
+not clearly defined.  There have been attempts to add ad-hoc behaviors
-to tailor the behavior to specific workloads.  Continuing this
+and knobs to tailor the behavior to specific workloads.  Continuing
-direction will lead to problems which will be extremely difficult to
+this direction will lead to problems which will be extremely difficult
-resolve in the long term.
+to resolve in the long term.
-メモリは現時点では内部タスクと子cgroup間で起こっていることをコントロー
+メモリコントローラは現時点では内部タスクと子cgroup間で起こっていること
-ルする方法はない。そして、振る舞いが明確には定義されていない。特定の作
+をコントロールする方法はない。そして、振る舞いが明確には定義されていな
-業に振る舞いを合わせるためのアドホックな振る舞いとノブが追加されてきた。
+い。特定の作業に振る舞いを合わせるためのアドホックな振る舞いとノブが追
-この方向性を続けることは長期間に渡って解決が非常に難しい問題を引き起こ
+加されてきた。この方向性を続けることは長期間に渡って解決が非常に難しい
-すだろう。
+問題を引き起こすだろう。
 Multiple controllers struggle with internal tasks and came up with
@@ 行 414: / 行 420: @@
 "cgroup.subtree_control" でコントローラを有効にする前に、そのタスクの
 全てを子に移動させなければならない。
-. Other Changes
--1. [Un]populated Notification
+. Delegation
+-1. Model of delegation
+A cgroup can be delegated to a less privileged user by granting write
+access of the directory and its "cgroup.procs" file to the user.  Note
+that the resource control knobs in a given directory concern the
+resources of the parent and thus must not be delegated along with the
+directory.
+cgroup は非特権ユーザに権限委譲できる。それはディレクトリと、
+"cgroup.procs" ファイルへの書き込み権限を与えることにより可能である。
+与えたディレクトリ内のリソースコントロールノブは、親のリソースに関係す
+るので、ディレクトリと一緒に権限委譲してはいけないことに注意が必要である。
+Once delegated, the user can build sub-hierarchy under the directory,
+organize processes as it sees fit and further distribute the resources
+it got from the parent.  The limits and other settings of all resource
+controllers are hierarchical and regardless of what happens in the
+delegated sub-hierarchy, nothing can escape the resource restrictions
+imposed by the parent.
+一度権限委譲すると、そのユーザはディレクトリ以下にサブ階層を構築できる。
+そして、適切であると考えるようにプロセスを構造化でき、さらに親からもらっ
+たリソースを分配できてしまう。全てのリソースコントローラの制限と他の設
+定は階層的であり、移譲されたサブ階層で何が起ころうとも、いかなるものも
+親によって課されたリソース制限から逃れられない。
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may in the future be limited explicitly.
+現時点では、cgroup は cgroup の数、移譲されたサブ階層のネストの深さに
+ついては何も制限されていない。しかし、将来は明確に制限されるかもしれない。
+-2. Common ancestor rule
+On the unified hierarchy, to write to a "cgroup.procs" file, in
+addition to the usual write permission to the file and uid match, the
+writer must also have write access to the "cgroup.procs" file of the
+common ancestor of the source and destination cgroups.  This prevents
+delegatees from smuggling processes across disjoint sub-hierarchies.
+単一階層構造では、"cgroup.procs" ファイルに書き込むために、ファイルへ
+の通常の書き込み権と uid マッチに加えて、ライターはソースおよびデスティ
+ネーションの共通の先祖の "cgroup.procs" への書き込み権も必要です。
+Let's say cgroups C0 and C1 have been delegated to user U0 who created
+C00, C01 under C0 and C10 under C1 as follows.
+例えば以下のように、cgroup C0 と C1 は、C0 配下に C00, C01、C1 以下に
+C10 を作ったユーザ U0 に権威移譲されているとしましょう。
+ ~~~~~~~~~~~~~ - C0 - C00
+ ~ cgroup    ~      \ C01
+ ~ hierarchy ~
+ ~~~~~~~~~~~~~ - C1 - C10
+C0 and C1 are separate entities in terms of resource distribution
+regardless of their relative positions in the hierarchy.  The
+resources the processes under C0 are entitled to are controlled by
+C0's ancestors and may be completely different from C1.  It's clear
+that the intention of delegating C0 to U0 is allowing U0 to organize
+the processes under C0 and further control the distribution of C0's
+resources.
+C0 と C1 は、階層内の相対的な位置に関わらず、リソース配分の観点から独
+立したエンティティです。C0 以下のプロセスが与えられる権利があるリソー
+スは、C0 の祖先にコントロールされ、おそらく C1 とは全く異なります。
+U0 に対して C0 の権限を与えることは、U0 に C0 以下のプロセスを扱う許可
+を与えることです。さらに C0 のリソースの分配をコントロールする許可も与
+えます。
+On traditional hierarchies, if a task has write access to "tasks" or
+"cgroup.procs" file of a cgroup and its uid agrees with the target, it
+can move the target to the cgroup.  In the above example, U0 will not
+only be able to move processes in each sub-hierarchy but also across
+the two sub-hierarchies, effectively allowing it to violate the
+organizational and resource restrictions implied by the hierarchical
+structure above C0 and C1.
+これまでの階層構造では、タスクが cgroup の "tasks" もしくは
+"cgroup.procs" ファイルへの書き込み権を持っていて、uid がターゲットに
+一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、
+U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に
+わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で
+示されるリソース制限や構造を破ることができるようになるでしょう。
+On the unified hierarchy, let's say U0 wants to write the pid of a
+process which has a matching uid and is currently in C10 into
+"C00/cgroup.procs".  U0 obviously has write access to the file and
+migration permission on the process; however, the common ancestor of
+the source cgroup C10 and the destination cgroup C00 is above the
+points of delegation and U0 would not have write access to its
+"cgroup.procs" and thus be denied with -EACCES.
+これまでの階層構造では、タスクが cgroup の "tasks" もしくは
+"cgroup.procs" ファイルへの書き込み権を持っていて、uid がターゲットに
+一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、
+U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に
+わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で
+示されるリソース制限や構造を破ることができるようになるでしょう。
+. Other Changes
+-1. [Un]populated Notification
 cgroup users often need a way to determine when a cgroup's
@@ 行 495: / 行 606: @@
 フェースファイルも存在しない。
 -2. Other Core Changes
 - None of the mount options is allowed.
@@ 行 525: / 行 636: @@
 - "cgroup.clojne_children" は消去される。
--3. Per-Controller Changes
+-3. Controller File Conventions
--3-1. blkio
+-3-1. Format
-- blk-throttle becomes properly hierarchical.
+In general, all controller files should be in one of the following
+formats whenever possible.
-- blk-throttle が適切に改造構造となる
+可能な場合はつねに、全てのコントローラファイルは以下のうちのひとつであ
+る必要があります。
--3-2. cpuset
+- Values only files
+  値のみのファイル
+  VAL0 VAL1...\n
+- Flat keyed files
+  フラットなキーのファイル
+  KEY0 VAL0\n
+  KEY1 VAL1\n
+  ...
+- Nested keyed files
+  ネストしたキーのファイル
+  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+  ...
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+書き込み可能なファイルの場合は、書き込みのフォーマットは通常は読み取り
+のときとマッチしている必要がある。しかし、コントローラは後のフィールド
+を省略できるかもしれない。もしくは、最も一般的なユースケースのための制
+限されたショートカットを実装できるかもしれない。
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+フラットとネストしたキーのファイルは、単一のキーに対する値のみを一度に
+書き込めます。ネストしたキーのファイルは、サブキーペアは何らかの命令で
+指定できるかもしれない。また、全てのペアが指定されなくても良いかもしれ
+ない。
+-3-2. Control Knobs
+- Settings for a single feature should generally be implemented in a
+  single file.
+- In general, the root cgroup should be exempt from resource control
+  and thus shouldn't have resource control knobs.
+- If a controller implements ratio based resource distribution, the
+  control knob should be named "weight" and have the range [1, 10000]
+  and 100 should be the default value.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+- If a controller implements an absolute resource guarantee and/or
+  limit, the control knobs should be named "min" and "max"
+  respectively.  If a controller implements best effort resource
+  gurantee and/or limit, the control knobs should be named "low" and
+  "high" respectively.
+  In the above four control files, the special token "max" should be
+  used to represent upward infinity for both reading and writing.
+- If a setting has configurable default value and specific overrides,
+  the default settings should be keyed with "default" and appear as
+  the first entry in the file.  Specific entries can use "default" as
+  its value to indicate inheritance of the default value.
+-4. Per-Controller Changes
+-4-1. io
+- blkio is renamed to io.  The interface is overhauled anyway.  The
+  new name is more in line with the other two major controllers, cpu
+  and memory, and better suited given that it may be used for cgroup
+  writeback without involving block layer.
+- blkio は io にリネームされました。インターフェースは全面的に見直され
+  ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、
+  メモリにより協調するものです。そしてブロックレイヤーを介さずに
+  cgroup writeback に使うのにより適しています。
+- Everything including stat is always hierarchical making separate
+  recursive stat files pointless and, as no internal node can have
+  tasks, leaf weights are meaningless.  The operation model is
+  simplified and the interface is overhauled accordingly.
+- stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう
+  に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ
+  イトは無意味となります。この操作モデルは簡素化されており、インター
+  フェースは適切に見直されています。
+  io.stat
+	The stat file.  The reported stats are from the point where
+	bio's are issued to request_queue.  The stats are counted
+	independent of which policies are enabled.  Each line in the
+	file follows the following format.  More fields may later be
+	added at the end.
+	  $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+	統計 (stat) ファイルです。報告される統計は bio が
+	request_queue に対して発行された時点からのものです。統計は有効
+	になっているポリシーと独立してカウントされます。ファイル内のそ
+	れぞれの行は後述のフォーマットで続きます。複数のフィールドは最
+	後に追加されます。
+	  $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+  io.weight
+	The weight setting, currently only available and effective if
+	cfq-iosched is in use for the target device.  The weight is
+	between 1 and 10000 and defaults to 100.  The first line
+	always contains the default weight in the following format to
+	use when per-device setting is missing.
+	ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ
+	スで使われている場合のみ有効で効果があります。ウェイトは 1 か
+	ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の
+	フォーマットのデフォルトのウェイトです。これはデバイスごとの設
+	定がない場合に使われます。
+	  default $WEIGHT
+	Subsequent lines list per-device weights of the following
+	format.
+	次の行は以下のフォーマットのデバイスごとのウェイトのリストです。
+	  $MAJ:$MIN $WEIGHT
+	Writing "$WEIGHT" or "default $WEIGHT" changes the default
+	setting.  Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
+	while "$MAJ:$MIN default" clears it.
+	"$WEIGHT" または "default $WEIGHT" を書きこむと、デフォルト値
+	が変更されます。"$MAJ:$MIN $WEIGHT" を設定すると、"$MAJ:$MIN
+	default" がクリアされて、デバイスごとのウェイトが設定されます。
+	This file is available only on non-root cgroups.
+	このファイルはルート以外の cgroup でのみ使えます。
+  io.max
+	The maximum bandwidth and/or iops setting, only available if
+	blk-throttle is enabled.  The file is of the following format.
+	帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場
+	合のみ使えます。ファイルは以下のフォーマットになります。
+	  $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
+	${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
+	read/write IOs per second.  "max" indicates no limit.  Writing
+	to the file follows the same format but the individual
+	settings may be omitted or specified in any order.
+	${R|W}BPS は秒あたりの読みこみ／書きこみのバイト数で、${R|W}
+	は秒あたりの読みこみ／書きこみ IOPS です。"max" は制限なしを示
+	します。ファイルへの書きこみは同じフォーマットに従いますが、個
+	別の設定は省略したり、任意の順番で指定できます。
+	This file is available only on non-root cgroups.
+	このファイルはルート cgroup 以外でのみ利用できます。
+-4-2. cpuset
 - Tasks are kept in empty cpusets after hotplug and take on the masks
   of the nearest non-empty ancestor, instead of being moved to it.
-- hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する
-  代わりに、最も近い祖先のマスクを引き受ける。
 - A task can be moved into an empty cpuset, and again it takes on the
   masks of the nearest non-empty ancestor.
-- タスクを空の cpuset に移動することは可能である。そしてこの場合も最も
-  近い空でない祖先のマスクを引き受ける。
--3-3. memory
+-4-3. memory
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
-- use_hierarchy はデフォルトでオンになる。このフラグ用の cgroup ファイルは生成されない。
+- The original lower boundary, the soft limit, is defined as a limit
+  that is per default unset.  As a result, the set of cgroups that
+  global reclaim prefers is opt-in, rather than opt-out.  The costs
+  for optimizing these mostly negative lookups are so high that the
+  implementation, despite its enormous size, does not even provide the
+  basic desirable behavior.  First off, the soft limit has no
+  hierarchical meaning.  All configured groups are organized in a
+  global rbtree and treated like equal peers, regardless where they
+  are located in the hierarchy.  This makes subtree delegation
+  impossible.  Second, the soft limit reclaim pass is so aggressive
+  that it not just introduces high allocation latencies into the
+  system, but also impacts system performance due to overreclaim, to
+  the point where the feature becomes self-defeating.
+  The memory.low boundary on the other hand is a top-down allocated
+  reserve.  A cgroup enjoys reclaim protection when it and all its
+  ancestors are below their low boundaries, which makes delegation of
+  subtrees possible.  Secondly, new cgroups have no reserve per
+  default and in the common case most cgroups are eligible for the
+  preferred reclaim pass.  This allows the new low boundary to be
+  efficiently implemented with just a minor addition to the generic
+  reclaim code, without the need for out-of-band data structures and
+  reclaim passes.  Because the generic reclaim code considers all
+  cgroups except for the ones running low in the preferred first
+  reclaim pass, overreclaim of individual groups is eliminated as
+  well, resulting in much better overall workload performance.
+- The original high boundary, the hard limit, is defined as a strict
+  limit that can not budge, even if the OOM killer has to be called.
+  But this generally goes against the goal of making the most out of
+  the available memory.  The memory consumption of workloads varies
+  during runtime, and that requires users to overcommit.  But doing
+  that with a strict upper limit requires either a fairly accurate
+  prediction of the working set size or adding slack to the limit.
+  Since working set size estimation is hard and error prone, and
+  getting it wrong results in OOM kills, most users tend to err on the
+  side of a looser limit and end up wasting precious resources.
+  The memory.high boundary on the other hand can be set much more
+  conservatively.  When hit, it throttles allocations by forcing them
+  into direct reclaim to work off the excess, but it never invokes the
+  OOM killer.  As a result, a high boundary that is chosen too
+  aggressively will not terminate the processes, but instead it will
+  lead to gradual performance degradation.  The user can monitor this
+  and make corrections until the minimal memory footprint that still
+  gives acceptable performance is found.
+  In extreme cases, with many concurrent allocations and a complete
+  breakdown of reclaim progress within the group, the high boundary
+  can be exceeded.  But even then it's mostly better to satisfy the
+  allocation from the slack available in other groups or the rest of
+  the system than killing the group.  Otherwise, memory.max is there
+  to limit this type of spillover and ultimately contain buggy or even
+  malicious applications.
+- The original control file names are unwieldy and inconsistent in
+  many different ways.  For example, the upper boundary hit count is
+  exported in the memory.failcnt file, but an OOM event count has to
+  be manually counted by listening to memory.oom_control events, and
+  lower boundary / soft limit events have to be counted by first
+  setting a threshold for that value and then counting those events.
+  Also, usage and limit files encode their units in the filename.
+  That makes the filenames very long, even though this is not
+  information that a user needs to be reminded of every time they type
+  out those names.
+  To address these naming issues, as well as to signal clearly that
+  the new interface carries a new configuration model, the naming
+  conventions in it necessarily differ from the old interface.
+- The original limit files indicate the state of an unset limit with a
+  Very High Number, and a configured limit can be unset by echoing -1
+  into those files.  But that very high number is implementation and
+  architecture dependent and not very descriptive.  And while -1 can
+  be understood as an underflow into the highest possible value, -2 or
+  -10M etc. do not work, so it's not consistent.
+  memory.low, memory.high, and memory.max will use the string "max" to
+  indicate and set the highest possible value.
 . Planned Changes