差分
このページの2つのバージョン間の差分を表示します。
両方とも前のリビジョン 前のリビジョン 次のリビジョン | 前のリビジョン | ||
linux:kernel:cgroup:単一階層構造 [2015/12/14 09:26] – tenforward | linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward | ||
---|---|---|---|
行 1: | 行 1: | ||
- | ====== 単一階層構造 ====== | + | =i===== 単一階層構造 ====== |
2-1 まで 4.3 カーネルの文書と同期。 | 2-1 まで 4.3 カーネルの文書と同期。 | ||
行 26: | 行 26: | ||
3-1. Top-down | 3-1. Top-down | ||
3-2. No internal tasks | 3-2. No internal tasks | ||
- | 4. Other Changes | + | 4. Delegation |
- | | + | 4-1. Model of delegation |
- | | + | 4-2. Common ancestor rule |
- | | + | 5. Other Changes |
- | | + | |
- | | + | |
- | | + | |
- | 5. Planned Changes | + | 5-3-1. Format |
- | | + | 5-3-2. Control Knobs |
+ | 5-4. Per-Controller Changes | ||
+ | | ||
+ | | ||
+ | | ||
+ | 6. Planned Changes | ||
+ | | ||
行 457: | 行 463: | ||
common ancestor of the source and destination cgroups. | common ancestor of the source and destination cgroups. | ||
delegatees from smuggling processes across disjoint sub-hierarchies. | delegatees from smuggling processes across disjoint sub-hierarchies. | ||
+ | |||
+ | 単一階層構造では、" | ||
+ | の通常の書き込み権と uid マッチに加えて、ライターはソースおよびデスティ | ||
+ | ネーションの共通の先祖の " | ||
Let's say cgroups C0 and C1 have been delegated to user U0 who created | Let's say cgroups C0 and C1 have been delegated to user U0 who created | ||
C00, C01 under C0 and C10 under C1 as follows. | C00, C01 under C0 and C10 under C1 as follows. | ||
+ | |||
+ | 例えば以下のように、cgroup C0 と C1 は、C0 配下に C00, C01、C1 以下に | ||
+ | C10 を作ったユーザ U0 に権威移譲されているとしましょう。 | ||
| | ||
行 473: | 行 486: | ||
the processes under C0 and further control the distribution of C0's | the processes under C0 and further control the distribution of C0's | ||
resources. | resources. | ||
+ | |||
+ | C0 と C1 は、階層内の相対的な位置に関わらず、リソース配分の観点から独 | ||
+ | 立したエンティティです。C0 以下のプロセスが与えられる権利があるリソー | ||
+ | スは、C0 の祖先にコントロールされ、おそらく C1 とは全く異なります。 | ||
+ | U0 に対して C0 の権限を与えることは、U0 に C0 以下のプロセスを扱う許可 | ||
+ | を与えることです。さらに C0 のリソースの分配をコントロールする許可も与 | ||
+ | えます。 | ||
On traditional hierarchies, | On traditional hierarchies, | ||
行 481: | 行 501: | ||
organizational and resource restrictions implied by the hierarchical | organizational and resource restrictions implied by the hierarchical | ||
structure above C0 and C1. | structure above C0 and C1. | ||
+ | |||
+ | これまでの階層構造では、タスクが cgroup の " | ||
+ | " | ||
+ | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
+ | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
+ | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
+ | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
On the unified hierarchy, let's say U0 wants to write the pid of a | On the unified hierarchy, let's say U0 wants to write the pid of a | ||
行 490: | 行 517: | ||
" | " | ||
- | ---(ここまで)--- | + | これまでの階層構造では、タスクが cgroup の " |
+ | " | ||
+ | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
+ | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
+ | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
+ | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
- | 4. Other Changes | + | 5. Other Changes |
- | 4-1. [Un]populated Notification | + | 5-1. [Un]populated Notification |
cgroup users often need a way to determine when a cgroup' | cgroup users often need a way to determine when a cgroup' | ||
行 573: | 行 606: | ||
フェースファイルも存在しない。 | フェースファイルも存在しない。 | ||
- | 4-2. Other Core Changes | + | 5-2. Other Core Changes |
- None of the mount options is allowed. | - None of the mount options is allowed. | ||
行 603: | 行 636: | ||
- " | - " | ||
- | 4-3. Per-Controller | + | 5-3. Controller |
- | 4-3-1. blkio | + | 5-3-1. Format |
- | - blk-throttle becomes properly hierarchical. | + | In general, all controller files should be in one of the following |
+ | formats whenever possible. | ||
- | - blk-throttle が適切に改造構造となる | + | 可能な場合はつねに、全てのコントローラファイルは以下のうちのひとつであ |
+ | る必要があります。 | ||
- | 4-3-2. cpuset | + | - Values only files |
+ | 値のみのファイル | ||
+ | |||
+ | VAL0 VAL1...\n | ||
+ | |||
+ | - Flat keyed files | ||
+ | フラットなキーのファイル | ||
+ | |||
+ | KEY0 VAL0\n | ||
+ | KEY1 VAL1\n | ||
+ | ... | ||
+ | |||
+ | - Nested keyed files | ||
+ | ネストしたキーのファイル | ||
+ | |||
+ | KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... | ||
+ | KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... | ||
+ | ... | ||
+ | |||
+ | For a writeable file, the format for writing should generally match | ||
+ | reading; however, controllers may allow omitting later fields or | ||
+ | implement restricted shortcuts for most common use cases. | ||
+ | |||
+ | 書き込み可能なファイルの場合は、書き込みのフォーマットは通常は読み取り | ||
+ | のときとマッチしている必要がある。しかし、コントローラは後のフィールド | ||
+ | を省略できるかもしれない。もしくは、最も一般的なユースケースのための制 | ||
+ | 限されたショートカットを実装できるかもしれない。 | ||
+ | |||
+ | For both flat and nested keyed files, only the values for a single key | ||
+ | can be written at a time. For nested keyed files, the sub key pairs | ||
+ | may be specified in any order and not all pairs have to be specified. | ||
+ | |||
+ | フラットとネストしたキーのファイルは、単一のキーに対する値のみを一度に | ||
+ | 書き込めます。ネストしたキーのファイルは、サブキーペアは何らかの命令で | ||
+ | 指定できるかもしれない。また、全てのペアが指定されなくても良いかもしれ | ||
+ | ない。 | ||
+ | |||
+ | 5-3-2. Control Knobs | ||
+ | |||
+ | - Settings for a single feature should generally be implemented in a | ||
+ | single file. | ||
+ | |||
+ | - In general, the root cgroup should be exempt from resource control | ||
+ | and thus shouldn' | ||
+ | |||
+ | - If a controller implements ratio based resource distribution, | ||
+ | control knob should be named " | ||
+ | and 100 should be the default value. | ||
+ | enough and symmetric bias in both directions while keeping it | ||
+ | intuitive (the default is 100%). | ||
+ | |||
+ | - If a controller implements an absolute resource guarantee and/or | ||
+ | limit, the control knobs should be named " | ||
+ | respectively. | ||
+ | gurantee and/or limit, the control knobs should be named " | ||
+ | " | ||
+ | |||
+ | In the above four control files, the special token " | ||
+ | used to represent upward infinity for both reading and writing. | ||
+ | |||
+ | - If a setting has configurable default value and specific overrides, | ||
+ | the default settings should be keyed with " | ||
+ | the first entry in the file. Specific entries can use " | ||
+ | its value to indicate inheritance of the default value. | ||
+ | |||
+ | 5-4. Per-Controller Changes | ||
+ | |||
+ | 5-4-1. io | ||
+ | |||
+ | - blkio is renamed to io. The interface is overhauled anyway. | ||
+ | new name is more in line with the other two major controllers, | ||
+ | and memory, and better suited given that it may be used for cgroup | ||
+ | writeback without involving block layer. | ||
+ | |||
+ | - blkio は io にリネームされました。インターフェースは全面的に見直され | ||
+ | ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、 | ||
+ | メモリにより協調するものです。そしてブロックレイヤーを介さずに | ||
+ | cgroup writeback に使うのにより適しています。 | ||
+ | |||
+ | - Everything including stat is always hierarchical making separate | ||
+ | recursive stat files pointless and, as no internal node can have | ||
+ | tasks, leaf weights are meaningless. | ||
+ | simplified and the interface is overhauled accordingly. | ||
+ | |||
+ | - stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう | ||
+ | に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ | ||
+ | イトは無意味となります。この操作モデルは簡素化されており、インター | ||
+ | フェースは適切に見直されています。 | ||
+ | |||
+ | io.stat | ||
+ | |||
+ | The stat file. The reported stats are from the point where | ||
+ | bio's are issued to request_queue. | ||
+ | independent of which policies are enabled. | ||
+ | file follows the following format. | ||
+ | added at the end. | ||
+ | |||
+ | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
+ | |||
+ | 統計 (stat) ファイルです。報告される統計は bio が | ||
+ | request_queue に対して発行された時点からのものです。統計は有効 | ||
+ | になっているポリシーと独立してカウントされます。ファイル内のそ | ||
+ | れぞれの行は後述のフォーマットで続きます。複数のフィールドは最 | ||
+ | 後に追加されます。 | ||
+ | |||
+ | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
+ | |||
+ | io.weight | ||
+ | |||
+ | The weight setting, currently only available and effective if | ||
+ | cfq-iosched is in use for the target device. | ||
+ | between 1 and 10000 and defaults to 100. The first line | ||
+ | always contains the default weight in the following format to | ||
+ | use when per-device setting is missing. | ||
+ | |||
+ | ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ | ||
+ | スで使われている場合のみ有効で効果があります。ウェイトは 1 か | ||
+ | ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の | ||
+ | フォーマットのデフォルトのウェイトです。これはデバイスごとの設 | ||
+ | 定がない場合に使われます。 | ||
+ | |||
+ | default $WEIGHT | ||
+ | |||
+ | Subsequent lines list per-device weights of the following | ||
+ | format. | ||
+ | |||
+ | 次の行は以下のフォーマットのデバイスごとのウェイトのリストです。 | ||
+ | |||
+ | $MAJ:$MIN $WEIGHT | ||
+ | |||
+ | Writing " | ||
+ | setting. | ||
+ | while " | ||
+ | |||
+ | " | ||
+ | が変更されます。" | ||
+ | default" | ||
+ | |||
+ | This file is available only on non-root cgroups. | ||
+ | |||
+ | このファイルはルート以外の cgroup でのみ使えます。 | ||
+ | |||
+ | io.max | ||
+ | |||
+ | The maximum bandwidth and/or iops setting, only available if | ||
+ | blk-throttle is enabled. | ||
+ | |||
+ | 帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場 | ||
+ | 合のみ使えます。ファイルは以下のフォーマットになります。 | ||
+ | |||
+ | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS | ||
+ | |||
+ | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are | ||
+ | read/write IOs per second. | ||
+ | to the file follows the same format but the individual | ||
+ | settings may be omitted or specified in any order. | ||
+ | |||
+ | ${R|W}BPS は秒あたりの読みこみ/書きこみのバイト数で、${R|W} | ||
+ | は秒あたりの読みこみ/書きこみ IOPS です。" | ||
+ | します。ファイルへの書きこみは同じフォーマットに従いますが、個 | ||
+ | 別の設定は省略したり、任意の順番で指定できます。 | ||
+ | |||
+ | This file is available only on non-root cgroups. | ||
+ | |||
+ | このファイルはルート cgroup 以外でのみ利用できます。 | ||
+ | |||
+ | 5-4-2. cpuset | ||
- Tasks are kept in empty cpusets after hotplug and take on the masks | - Tasks are kept in empty cpusets after hotplug and take on the masks | ||
of the nearest non-empty ancestor, instead of being moved to it. | of the nearest non-empty ancestor, instead of being moved to it. | ||
- | | ||
- | - hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する | ||
- | 代わりに、最も近い祖先のマスクを引き受ける。 | ||
- A task can be moved into an empty cpuset, and again it takes on the | - A task can be moved into an empty cpuset, and again it takes on the | ||
masks of the nearest non-empty ancestor. | masks of the nearest non-empty ancestor. | ||
- | - タスクを空の cpuset に移動することは可能である。そしてこの場合も最も | ||
- | 近い空でない祖先のマスクを引き受ける。 | ||
- | 4-3-3. memory | + | 5-4-3. memory |
- use_hierarchy is on by default and the cgroup file for the flag is | - use_hierarchy is on by default and the cgroup file for the flag is | ||
not created. | not created. | ||
- | - use_hierarchy はデフォルトでオンになる。このフラグ用の | + | - The original lower boundary, the soft limit, is defined as a limit |
+ | that is per default unset. | ||
+ | global reclaim prefers is opt-in, rather than opt-out. | ||
+ | for optimizing these mostly negative lookups are so high that the | ||
+ | implementation, | ||
+ | basic desirable behavior. | ||
+ | hierarchical meaning. | ||
+ | global rbtree and treated like equal peers, regardless where they | ||
+ | are located in the hierarchy. | ||
+ | impossible. | ||
+ | that it not just introduces high allocation latencies into the | ||
+ | system, but also impacts system performance due to overreclaim, | ||
+ | the point where the feature becomes self-defeating. | ||
+ | |||
+ | The memory.low boundary on the other hand is a top-down allocated | ||
+ | reserve. | ||
+ | ancestors are below their low boundaries, which makes delegation of | ||
+ | subtrees possible. | ||
+ | default and in the common case most cgroups are eligible for the | ||
+ | preferred reclaim pass. This allows the new low boundary to be | ||
+ | efficiently implemented with just a minor addition to the generic | ||
+ | reclaim code, without the need for out-of-band data structures and | ||
+ | reclaim passes. | ||
+ | cgroups except for the ones running low in the preferred first | ||
+ | reclaim pass, overreclaim of individual groups is eliminated as | ||
+ | well, resulting in much better overall workload performance. | ||
+ | |||
+ | - The original high boundary, the hard limit, is defined as a strict | ||
+ | limit that can not budge, even if the OOM killer has to be called. | ||
+ | But this generally goes against the goal of making the most out of | ||
+ | the available memory. | ||
+ | during runtime, and that requires users to overcommit. | ||
+ | that with a strict upper limit requires either a fairly accurate | ||
+ | prediction of the working set size or adding slack to the limit. | ||
+ | Since working set size estimation is hard and error prone, and | ||
+ | getting it wrong results in OOM kills, most users tend to err on the | ||
+ | side of a looser limit and end up wasting precious resources. | ||
+ | |||
+ | The memory.high boundary on the other hand can be set much more | ||
+ | conservatively. | ||
+ | into direct reclaim to work off the excess, but it never invokes the | ||
+ | OOM killer. | ||
+ | aggressively will not terminate the processes, but instead it will | ||
+ | lead to gradual performance degradation. | ||
+ | and make corrections until the minimal memory footprint that still | ||
+ | gives acceptable performance is found. | ||
+ | |||
+ | In extreme cases, with many concurrent allocations and a complete | ||
+ | breakdown of reclaim progress within the group, the high boundary | ||
+ | can be exceeded. | ||
+ | allocation from the slack available in other groups or the rest of | ||
+ | the system than killing the group. | ||
+ | to limit this type of spillover and ultimately contain buggy or even | ||
+ | malicious applications. | ||
+ | |||
+ | - The original control file names are unwieldy and inconsistent in | ||
+ | many different ways. For example, the upper boundary hit count is | ||
+ | exported in the memory.failcnt file, but an OOM event count has to | ||
+ | be manually counted by listening to memory.oom_control events, and | ||
+ | lower boundary / soft limit events have to be counted by first | ||
+ | setting a threshold for that value and then counting those events. | ||
+ | Also, usage and limit files encode their units in the filename. | ||
+ | That makes the filenames very long, even though this is not | ||
+ | information that a user needs to be reminded of every time they type | ||
+ | out those names. | ||
+ | |||
+ | To address these naming issues, as well as to signal clearly that | ||
+ | the new interface carries a new configuration model, the naming | ||
+ | conventions in it necessarily differ from the old interface. | ||
+ | |||
+ | - The original limit files indicate the state of an unset limit with a | ||
+ | Very High Number, and a configured limit can be unset by echoing -1 | ||
+ | into those files. | ||
+ | architecture dependent and not very descriptive. | ||
+ | be understood as an underflow into the highest possible value, -2 or | ||
+ | -10M etc. do not work, so it's not consistent. | ||
+ | |||
+ | memory.low, memory.high, | ||
+ | indicate and set the highest possible value. | ||
5. Planned Changes | 5. Planned Changes |