差分
このページの2つのバージョン間の差分を表示します。
| 両方とも前のリビジョン 前のリビジョン 次のリビジョン | 前のリビジョン | ||
| linux:kernel:cgroup:単一階層構造 [2015/12/16 11:15] – tenforward | linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward | ||
|---|---|---|---|
| 行 1: | 行 1: | ||
| - | ====== 単一階層構造 ====== | + | =i===== 単一階層構造 ====== |
| 2-1 まで 4.3 カーネルの文書と同期。 | 2-1 まで 4.3 カーネルの文書と同期。 | ||
| 行 26: | 行 26: | ||
| 3-1. Top-down | 3-1. Top-down | ||
| 3-2. No internal tasks | 3-2. No internal tasks | ||
| - | 4. Other Changes | + | 4. Delegation |
| - | | + | 4-1. Model of delegation |
| - | | + | 4-2. Common ancestor rule |
| - | | + | 5. Other Changes |
| - | | + | |
| - | | + | |
| - | | + | |
| - | 5. Planned Changes | + | 5-3-1. Format |
| - | | + | 5-3-2. Control Knobs |
| + | 5-4. Per-Controller Changes | ||
| + | | ||
| + | | ||
| + | | ||
| + | 6. Planned Changes | ||
| + | | ||
| 行 495: | 行 501: | ||
| organizational and resource restrictions implied by the hierarchical | organizational and resource restrictions implied by the hierarchical | ||
| structure above C0 and C1. | structure above C0 and C1. | ||
| + | |||
| + | これまでの階層構造では、タスクが cgroup の " | ||
| + | " | ||
| + | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
| + | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
| + | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
| + | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
| On the unified hierarchy, let's say U0 wants to write the pid of a | On the unified hierarchy, let's say U0 wants to write the pid of a | ||
| 行 503: | 行 516: | ||
| points of delegation and U0 would not have write access to its | points of delegation and U0 would not have write access to its | ||
| " | " | ||
| + | |||
| + | これまでの階層構造では、タスクが cgroup の " | ||
| + | " | ||
| + | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
| + | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
| + | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
| + | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
| + | |||
| 5. Other Changes | 5. Other Changes | ||
| 行 615: | 行 636: | ||
| - " | - " | ||
| - | ---(ここまで)--- | + | 5-3. Controller File Conventions |
| - | 4-3. Per-Controller Changes | + | 5-3-1. Format |
| - | 4-3-1. blkio | + | In general, all controller files should be in one of the following |
| + | formats whenever possible. | ||
| - | - blk-throttle becomes properly hierarchical. | + | 可能な場合はつねに、全てのコントローラファイルは以下のうちのひとつであ |
| + | る必要があります。 | ||
| - | - blk-throttle が適切に改造構造となる | + | - Values only files |
| + | 値のみのファイル | ||
| - | 4-3-2. cpuset | + | VAL0 VAL1...\n |
| + | |||
| + | - Flat keyed files | ||
| + | フラットなキーのファイル | ||
| + | |||
| + | KEY0 VAL0\n | ||
| + | KEY1 VAL1\n | ||
| + | ... | ||
| + | |||
| + | - Nested keyed files | ||
| + | ネストしたキーのファイル | ||
| + | |||
| + | KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... | ||
| + | KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... | ||
| + | ... | ||
| + | |||
| + | For a writeable file, the format for writing should generally match | ||
| + | reading; however, controllers may allow omitting later fields or | ||
| + | implement restricted shortcuts for most common use cases. | ||
| + | |||
| + | 書き込み可能なファイルの場合は、書き込みのフォーマットは通常は読み取り | ||
| + | のときとマッチしている必要がある。しかし、コントローラは後のフィールド | ||
| + | を省略できるかもしれない。もしくは、最も一般的なユースケースのための制 | ||
| + | 限されたショートカットを実装できるかもしれない。 | ||
| + | |||
| + | For both flat and nested keyed files, only the values for a single key | ||
| + | can be written at a time. For nested keyed files, the sub key pairs | ||
| + | may be specified in any order and not all pairs have to be specified. | ||
| + | |||
| + | フラットとネストしたキーのファイルは、単一のキーに対する値のみを一度に | ||
| + | 書き込めます。ネストしたキーのファイルは、サブキーペアは何らかの命令で | ||
| + | 指定できるかもしれない。また、全てのペアが指定されなくても良いかもしれ | ||
| + | ない。 | ||
| + | |||
| + | 5-3-2. Control Knobs | ||
| + | |||
| + | - Settings for a single feature should generally be implemented in a | ||
| + | single file. | ||
| + | |||
| + | - In general, the root cgroup should be exempt from resource control | ||
| + | and thus shouldn' | ||
| + | |||
| + | - If a controller implements ratio based resource distribution, | ||
| + | control knob should be named " | ||
| + | and 100 should be the default value. | ||
| + | enough and symmetric bias in both directions while keeping it | ||
| + | intuitive (the default is 100%). | ||
| + | |||
| + | - If a controller implements an absolute resource guarantee and/or | ||
| + | limit, the control knobs should be named " | ||
| + | respectively. | ||
| + | gurantee and/or limit, the control knobs should be named " | ||
| + | " | ||
| + | |||
| + | In the above four control files, the special token " | ||
| + | used to represent upward infinity for both reading and writing. | ||
| + | |||
| + | - If a setting has configurable default value and specific overrides, | ||
| + | the default settings should be keyed with " | ||
| + | the first entry in the file. Specific entries can use " | ||
| + | its value to indicate inheritance of the default value. | ||
| + | |||
| + | 5-4. Per-Controller Changes | ||
| + | |||
| + | 5-4-1. io | ||
| + | |||
| + | - blkio is renamed to io. The interface is overhauled anyway. | ||
| + | new name is more in line with the other two major controllers, | ||
| + | and memory, and better suited given that it may be used for cgroup | ||
| + | writeback without involving block layer. | ||
| + | |||
| + | - blkio は io にリネームされました。インターフェースは全面的に見直され | ||
| + | ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、 | ||
| + | メモリにより協調するものです。そしてブロックレイヤーを介さずに | ||
| + | cgroup writeback に使うのにより適しています。 | ||
| + | |||
| + | - Everything including stat is always hierarchical making separate | ||
| + | recursive stat files pointless and, as no internal node can have | ||
| + | tasks, leaf weights are meaningless. | ||
| + | simplified and the interface is overhauled accordingly. | ||
| + | |||
| + | - stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう | ||
| + | に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ | ||
| + | イトは無意味となります。この操作モデルは簡素化されており、インター | ||
| + | フェースは適切に見直されています。 | ||
| + | |||
| + | io.stat | ||
| + | |||
| + | The stat file. The reported stats are from the point where | ||
| + | bio's are issued to request_queue. | ||
| + | independent of which policies are enabled. | ||
| + | file follows the following format. | ||
| + | added at the end. | ||
| + | |||
| + | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
| + | |||
| + | 統計 (stat) ファイルです。報告される統計は bio が | ||
| + | request_queue に対して発行された時点からのものです。統計は有効 | ||
| + | になっているポリシーと独立してカウントされます。ファイル内のそ | ||
| + | れぞれの行は後述のフォーマットで続きます。複数のフィールドは最 | ||
| + | 後に追加されます。 | ||
| + | |||
| + | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
| + | |||
| + | io.weight | ||
| + | |||
| + | The weight setting, currently only available and effective if | ||
| + | cfq-iosched is in use for the target device. | ||
| + | between 1 and 10000 and defaults to 100. The first line | ||
| + | always contains the default weight in the following format to | ||
| + | use when per-device setting is missing. | ||
| + | |||
| + | ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ | ||
| + | スで使われている場合のみ有効で効果があります。ウェイトは 1 か | ||
| + | ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の | ||
| + | フォーマットのデフォルトのウェイトです。これはデバイスごとの設 | ||
| + | 定がない場合に使われます。 | ||
| + | |||
| + | default $WEIGHT | ||
| + | |||
| + | Subsequent lines list per-device weights of the following | ||
| + | format. | ||
| + | |||
| + | 次の行は以下のフォーマットのデバイスごとのウェイトのリストです。 | ||
| + | |||
| + | $MAJ:$MIN $WEIGHT | ||
| + | |||
| + | Writing " | ||
| + | setting. | ||
| + | while " | ||
| + | |||
| + | " | ||
| + | が変更されます。" | ||
| + | default" | ||
| + | |||
| + | This file is available only on non-root cgroups. | ||
| + | |||
| + | このファイルはルート以外の cgroup でのみ使えます。 | ||
| + | |||
| + | io.max | ||
| + | |||
| + | The maximum bandwidth and/or iops setting, only available if | ||
| + | blk-throttle is enabled. | ||
| + | |||
| + | 帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場 | ||
| + | 合のみ使えます。ファイルは以下のフォーマットになります。 | ||
| + | |||
| + | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS | ||
| + | |||
| + | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are | ||
| + | read/write IOs per second. | ||
| + | to the file follows the same format but the individual | ||
| + | settings may be omitted or specified in any order. | ||
| + | |||
| + | ${R|W}BPS は秒あたりの読みこみ/書きこみのバイト数で、${R|W} | ||
| + | は秒あたりの読みこみ/書きこみ IOPS です。" | ||
| + | します。ファイルへの書きこみは同じフォーマットに従いますが、個 | ||
| + | 別の設定は省略したり、任意の順番で指定できます。 | ||
| + | |||
| + | This file is available only on non-root cgroups. | ||
| + | |||
| + | このファイルはルート cgroup 以外でのみ利用できます。 | ||
| + | |||
| + | 5-4-2. cpuset | ||
| - Tasks are kept in empty cpusets after hotplug and take on the masks | - Tasks are kept in empty cpusets after hotplug and take on the masks | ||
| of the nearest non-empty ancestor, instead of being moved to it. | of the nearest non-empty ancestor, instead of being moved to it. | ||
| - | | ||
| - | - hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する | ||
| - | 代わりに、最も近い祖先のマスクを引き受ける。 | ||
| - A task can be moved into an empty cpuset, and again it takes on the | - A task can be moved into an empty cpuset, and again it takes on the | ||
| masks of the nearest non-empty ancestor. | masks of the nearest non-empty ancestor. | ||
| - | - タスクを空の cpuset に移動することは可能である。そしてこの場合も最も | ||
| - | 近い空でない祖先のマスクを引き受ける。 | ||
| - | 4-3-3. memory | + | 5-4-3. memory |
| - use_hierarchy is on by default and the cgroup file for the flag is | - use_hierarchy is on by default and the cgroup file for the flag is | ||
| not created. | not created. | ||
| - | - use_hierarchy はデフォルトでオンになる。このフラグ用の | + | - The original lower boundary, the soft limit, is defined as a limit |
| + | that is per default unset. | ||
| + | global reclaim prefers is opt-in, rather than opt-out. | ||
| + | for optimizing these mostly negative lookups are so high that the | ||
| + | implementation, | ||
| + | basic desirable behavior. | ||
| + | hierarchical meaning. | ||
| + | global rbtree and treated like equal peers, regardless where they | ||
| + | are located in the hierarchy. | ||
| + | impossible. | ||
| + | that it not just introduces high allocation latencies into the | ||
| + | system, but also impacts system performance due to overreclaim, | ||
| + | the point where the feature becomes self-defeating. | ||
| + | |||
| + | The memory.low boundary on the other hand is a top-down allocated | ||
| + | reserve. | ||
| + | ancestors are below their low boundaries, which makes delegation of | ||
| + | subtrees possible. | ||
| + | default and in the common case most cgroups are eligible for the | ||
| + | preferred reclaim pass. This allows the new low boundary to be | ||
| + | efficiently implemented with just a minor addition to the generic | ||
| + | reclaim code, without the need for out-of-band data structures and | ||
| + | reclaim passes. | ||
| + | cgroups except for the ones running low in the preferred first | ||
| + | reclaim pass, overreclaim of individual groups is eliminated as | ||
| + | well, resulting in much better overall workload performance. | ||
| + | |||
| + | - The original high boundary, the hard limit, is defined as a strict | ||
| + | limit that can not budge, even if the OOM killer has to be called. | ||
| + | But this generally goes against the goal of making the most out of | ||
| + | the available memory. | ||
| + | during runtime, and that requires users to overcommit. | ||
| + | that with a strict upper limit requires either a fairly accurate | ||
| + | prediction of the working set size or adding slack to the limit. | ||
| + | Since working set size estimation is hard and error prone, and | ||
| + | getting it wrong results in OOM kills, most users tend to err on the | ||
| + | side of a looser limit and end up wasting precious resources. | ||
| + | |||
| + | The memory.high boundary on the other hand can be set much more | ||
| + | conservatively. | ||
| + | into direct reclaim to work off the excess, but it never invokes the | ||
| + | OOM killer. | ||
| + | aggressively will not terminate the processes, but instead it will | ||
| + | lead to gradual performance degradation. | ||
| + | and make corrections until the minimal memory footprint that still | ||
| + | gives acceptable performance is found. | ||
| + | |||
| + | In extreme cases, with many concurrent allocations and a complete | ||
| + | breakdown of reclaim progress within the group, the high boundary | ||
| + | can be exceeded. | ||
| + | allocation from the slack available in other groups or the rest of | ||
| + | the system than killing the group. | ||
| + | to limit this type of spillover and ultimately contain buggy or even | ||
| + | malicious applications. | ||
| + | |||
| + | - The original control file names are unwieldy and inconsistent in | ||
| + | many different ways. For example, the upper boundary hit count is | ||
| + | exported in the memory.failcnt file, but an OOM event count has to | ||
| + | be manually counted by listening to memory.oom_control events, and | ||
| + | lower boundary / soft limit events have to be counted by first | ||
| + | setting a threshold for that value and then counting those events. | ||
| + | Also, usage and limit files encode their units in the filename. | ||
| + | That makes the filenames very long, even though this is not | ||
| + | information that a user needs to be reminded of every time they type | ||
| + | out those names. | ||
| + | |||
| + | To address these naming issues, as well as to signal clearly that | ||
| + | the new interface carries a new configuration model, the naming | ||
| + | conventions in it necessarily differ from the old interface. | ||
| + | |||
| + | - The original limit files indicate the state of an unset limit with a | ||
| + | Very High Number, and a configured limit can be unset by echoing -1 | ||
| + | into those files. | ||
| + | architecture dependent and not very descriptive. | ||
| + | be understood as an underflow into the highest possible value, -2 or | ||
| + | -10M etc. do not work, so it's not consistent. | ||
| + | |||
| + | memory.low, memory.high, | ||
| + | indicate and set the highest possible value. | ||
| 5. Planned Changes | 5. Planned Changes | ||