差分

このページの2つのバージョン間の差分を表示します。

--- linux:kernel:cgroup:単一階層構造 [2015/12/16 11:15] – tenforward
+++ linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward
@@ 行 1: / 行 1: @@
-====== 単一階層構造 ======
+=i===== 単一階層構造 ======
 -1 まで 4.3 カーネルの文書と同期。
@@ 行 26: / 行 26: @@
 -1. Top-down
 -2. No internal tasks
-. Other Changes
+. Delegation
--1. [Un]populated Notification
+-1. Model of delegation
--2. Other Core Changes
+-2. Common ancestor rule
--3. Per-Controller Changes
+. Other Changes
--3-1. blkio
+-1. [Un]populated Notification
--3-2. cpuset
+-2. Other Core Changes
--3-3. memory
+-3. Controller File Conventions
-. Planned Changes
+-3-1. Format
--1. CAP for resource control
+-3-2. Control Knobs
+-4. Per-Controller Changes
+-4-1. io
+-4-2. cpuset
+-4-3. memory
+. Planned Changes
+-1. CAP for resource control
@@ 行 495: / 行 501: @@
 organizational and resource restrictions implied by the hierarchical
 structure above C0 and C1.
+これまでの階層構造では、タスクが cgroup の "tasks" もしくは
+"cgroup.procs" ファイルへの書き込み権を持っていて、uid がターゲットに
+一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、
+U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に
+わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で
+示されるリソース制限や構造を破ることができるようになるでしょう。
 On the unified hierarchy, let's say U0 wants to write the pid of a
@@ 行 503: / 行 516: @@
 points of delegation and U0 would not have write access to its
 "cgroup.procs" and thus be denied with -EACCES.
+これまでの階層構造では、タスクが cgroup の "tasks" もしくは
+"cgroup.procs" ファイルへの書き込み権を持っていて、uid がターゲットに
+一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、
+U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に
+わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で
+示されるリソース制限や構造を破ることができるようになるでしょう。
 . Other Changes
@@ 行 615: / 行 636: @@
 - "cgroup.clojne_children" は消去される。
----(ここまで)---
+-3. Controller File Conventions
--3. Per-Controller Changes
+-3-1. Format
--3-1. blkio
+In general, all controller files should be in one of the following
+formats whenever possible.
-- blk-throttle becomes properly hierarchical.
+可能な場合はつねに、全てのコントローラファイルは以下のうちのひとつであ
+る必要があります。
-- blk-throttle が適切に改造構造となる
+- Values only files
+  値のみのファイル
--3-2. cpuset
+  VAL0 VAL1...\n
+- Flat keyed files
+  フラットなキーのファイル
+  KEY0 VAL0\n
+  KEY1 VAL1\n
+  ...
+- Nested keyed files
+  ネストしたキーのファイル
+  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+  ...
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+書き込み可能なファイルの場合は、書き込みのフォーマットは通常は読み取り
+のときとマッチしている必要がある。しかし、コントローラは後のフィールド
+を省略できるかもしれない。もしくは、最も一般的なユースケースのための制
+限されたショートカットを実装できるかもしれない。
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+フラットとネストしたキーのファイルは、単一のキーに対する値のみを一度に
+書き込めます。ネストしたキーのファイルは、サブキーペアは何らかの命令で
+指定できるかもしれない。また、全てのペアが指定されなくても良いかもしれ
+ない。
+-3-2. Control Knobs
+- Settings for a single feature should generally be implemented in a
+  single file.
+- In general, the root cgroup should be exempt from resource control
+  and thus shouldn't have resource control knobs.
+- If a controller implements ratio based resource distribution, the
+  control knob should be named "weight" and have the range [1, 10000]
+  and 100 should be the default value.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+- If a controller implements an absolute resource guarantee and/or
+  limit, the control knobs should be named "min" and "max"
+  respectively.  If a controller implements best effort resource
+  gurantee and/or limit, the control knobs should be named "low" and
+  "high" respectively.
+  In the above four control files, the special token "max" should be
+  used to represent upward infinity for both reading and writing.
+- If a setting has configurable default value and specific overrides,
+  the default settings should be keyed with "default" and appear as
+  the first entry in the file.  Specific entries can use "default" as
+  its value to indicate inheritance of the default value.
+-4. Per-Controller Changes
+-4-1. io
+- blkio is renamed to io.  The interface is overhauled anyway.  The
+  new name is more in line with the other two major controllers, cpu
+  and memory, and better suited given that it may be used for cgroup
+  writeback without involving block layer.
+- blkio は io にリネームされました。インターフェースは全面的に見直され
+  ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、
+  メモリにより協調するものです。そしてブロックレイヤーを介さずに
+  cgroup writeback に使うのにより適しています。
+- Everything including stat is always hierarchical making separate
+  recursive stat files pointless and, as no internal node can have
+  tasks, leaf weights are meaningless.  The operation model is
+  simplified and the interface is overhauled accordingly.
+- stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう
+  に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ
+  イトは無意味となります。この操作モデルは簡素化されており、インター
+  フェースは適切に見直されています。
+  io.stat
+	The stat file.  The reported stats are from the point where
+	bio's are issued to request_queue.  The stats are counted
+	independent of which policies are enabled.  Each line in the
+	file follows the following format.  More fields may later be
+	added at the end.
+	  $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+	統計 (stat) ファイルです。報告される統計は bio が
+	request_queue に対して発行された時点からのものです。統計は有効
+	になっているポリシーと独立してカウントされます。ファイル内のそ
+	れぞれの行は後述のフォーマットで続きます。複数のフィールドは最
+	後に追加されます。
+	  $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+  io.weight
+	The weight setting, currently only available and effective if
+	cfq-iosched is in use for the target device.  The weight is
+	between 1 and 10000 and defaults to 100.  The first line
+	always contains the default weight in the following format to
+	use when per-device setting is missing.
+	ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ
+	スで使われている場合のみ有効で効果があります。ウェイトは 1 か
+	ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の
+	フォーマットのデフォルトのウェイトです。これはデバイスごとの設
+	定がない場合に使われます。
+	  default $WEIGHT
+	Subsequent lines list per-device weights of the following
+	format.
+	次の行は以下のフォーマットのデバイスごとのウェイトのリストです。
+	  $MAJ:$MIN $WEIGHT
+	Writing "$WEIGHT" or "default $WEIGHT" changes the default
+	setting.  Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
+	while "$MAJ:$MIN default" clears it.
+	"$WEIGHT" または "default $WEIGHT" を書きこむと、デフォルト値
+	が変更されます。"$MAJ:$MIN $WEIGHT" を設定すると、"$MAJ:$MIN
+	default" がクリアされて、デバイスごとのウェイトが設定されます。
+	This file is available only on non-root cgroups.
+	このファイルはルート以外の cgroup でのみ使えます。
+  io.max
+	The maximum bandwidth and/or iops setting, only available if
+	blk-throttle is enabled.  The file is of the following format.
+	帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場
+	合のみ使えます。ファイルは以下のフォーマットになります。
+	  $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
+	${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
+	read/write IOs per second.  "max" indicates no limit.  Writing
+	to the file follows the same format but the individual
+	settings may be omitted or specified in any order.
+	${R|W}BPS は秒あたりの読みこみ／書きこみのバイト数で、${R|W}
+	は秒あたりの読みこみ／書きこみ IOPS です。"max" は制限なしを示
+	します。ファイルへの書きこみは同じフォーマットに従いますが、個
+	別の設定は省略したり、任意の順番で指定できます。
+	This file is available only on non-root cgroups.
+	このファイルはルート cgroup 以外でのみ利用できます。
+-4-2. cpuset
 - Tasks are kept in empty cpusets after hotplug and take on the masks
   of the nearest non-empty ancestor, instead of being moved to it.
-- hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する
-  代わりに、最も近い祖先のマスクを引き受ける。
 - A task can be moved into an empty cpuset, and again it takes on the
   masks of the nearest non-empty ancestor.
-- タスクを空の cpuset に移動することは可能である。そしてこの場合も最も
-  近い空でない祖先のマスクを引き受ける。
--3-3. memory
+-4-3. memory
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
-- use_hierarchy はデフォルトでオンになる。このフラグ用の cgroup ファイルは生成されない。
+- The original lower boundary, the soft limit, is defined as a limit
+  that is per default unset.  As a result, the set of cgroups that
+  global reclaim prefers is opt-in, rather than opt-out.  The costs
+  for optimizing these mostly negative lookups are so high that the
+  implementation, despite its enormous size, does not even provide the
+  basic desirable behavior.  First off, the soft limit has no
+  hierarchical meaning.  All configured groups are organized in a
+  global rbtree and treated like equal peers, regardless where they
+  are located in the hierarchy.  This makes subtree delegation
+  impossible.  Second, the soft limit reclaim pass is so aggressive
+  that it not just introduces high allocation latencies into the
+  system, but also impacts system performance due to overreclaim, to
+  the point where the feature becomes self-defeating.
+  The memory.low boundary on the other hand is a top-down allocated
+  reserve.  A cgroup enjoys reclaim protection when it and all its
+  ancestors are below their low boundaries, which makes delegation of
+  subtrees possible.  Secondly, new cgroups have no reserve per
+  default and in the common case most cgroups are eligible for the
+  preferred reclaim pass.  This allows the new low boundary to be
+  efficiently implemented with just a minor addition to the generic
+  reclaim code, without the need for out-of-band data structures and
+  reclaim passes.  Because the generic reclaim code considers all
+  cgroups except for the ones running low in the preferred first
+  reclaim pass, overreclaim of individual groups is eliminated as
+  well, resulting in much better overall workload performance.
+- The original high boundary, the hard limit, is defined as a strict
+  limit that can not budge, even if the OOM killer has to be called.
+  But this generally goes against the goal of making the most out of
+  the available memory.  The memory consumption of workloads varies
+  during runtime, and that requires users to overcommit.  But doing
+  that with a strict upper limit requires either a fairly accurate
+  prediction of the working set size or adding slack to the limit.
+  Since working set size estimation is hard and error prone, and
+  getting it wrong results in OOM kills, most users tend to err on the
+  side of a looser limit and end up wasting precious resources.
+  The memory.high boundary on the other hand can be set much more
+  conservatively.  When hit, it throttles allocations by forcing them
+  into direct reclaim to work off the excess, but it never invokes the
+  OOM killer.  As a result, a high boundary that is chosen too
+  aggressively will not terminate the processes, but instead it will
+  lead to gradual performance degradation.  The user can monitor this
+  and make corrections until the minimal memory footprint that still
+  gives acceptable performance is found.
+  In extreme cases, with many concurrent allocations and a complete
+  breakdown of reclaim progress within the group, the high boundary
+  can be exceeded.  But even then it's mostly better to satisfy the
+  allocation from the slack available in other groups or the rest of
+  the system than killing the group.  Otherwise, memory.max is there
+  to limit this type of spillover and ultimately contain buggy or even
+  malicious applications.
+- The original control file names are unwieldy and inconsistent in
+  many different ways.  For example, the upper boundary hit count is
+  exported in the memory.failcnt file, but an OOM event count has to
+  be manually counted by listening to memory.oom_control events, and
+  lower boundary / soft limit events have to be counted by first
+  setting a threshold for that value and then counting those events.
+  Also, usage and limit files encode their units in the filename.
+  That makes the filenames very long, even though this is not
+  information that a user needs to be reminded of every time they type
+  out those names.
+  To address these naming issues, as well as to signal clearly that
+  the new interface carries a new configuration model, the naming
+  conventions in it necessarily differ from the old interface.
+- The original limit files indicate the state of an unset limit with a
+  Very High Number, and a configured limit can be unset by echoing -1
+  into those files.  But that very high number is implementation and
+  architecture dependent and not very descriptive.  And while -1 can
+  be understood as an underflow into the highest possible value, -2 or
+  -10M etc. do not work, so it's not consistent.
+  memory.low, memory.high, and memory.max will use the string "max" to
+  indicate and set the highest possible value.
 . Planned Changes