差分
このページの2つのバージョン間の差分を表示します。
両方とも前のリビジョン 前のリビジョン 次のリビジョン | 前のリビジョン | ||
linux:kernel:cgroup:単一階層構造 [2015/12/16 11:26] – tenforward | linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward | ||
---|---|---|---|
行 1: | 行 1: | ||
- | ====== 単一階層構造 ====== | + | =i===== 単一階層構造 ====== |
2-1 まで 4.3 カーネルの文書と同期。 | 2-1 まで 4.3 カーネルの文書と同期。 | ||
行 501: | 行 501: | ||
organizational and resource restrictions implied by the hierarchical | organizational and resource restrictions implied by the hierarchical | ||
structure above C0 and C1. | structure above C0 and C1. | ||
+ | |||
+ | これまでの階層構造では、タスクが cgroup の " | ||
+ | " | ||
+ | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
+ | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
+ | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
+ | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
On the unified hierarchy, let's say U0 wants to write the pid of a | On the unified hierarchy, let's say U0 wants to write the pid of a | ||
行 509: | 行 516: | ||
points of delegation and U0 would not have write access to its | points of delegation and U0 would not have write access to its | ||
" | " | ||
+ | |||
+ | これまでの階層構造では、タスクが cgroup の " | ||
+ | " | ||
+ | 一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、 | ||
+ | U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に | ||
+ | わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で | ||
+ | 示されるリソース制限や構造を破ることができるようになるでしょう。 | ||
+ | |||
5. Other Changes | 5. Other Changes | ||
行 696: | 行 711: | ||
its value to indicate inheritance of the default value. | its value to indicate inheritance of the default value. | ||
| | ||
- | ---(ここまで)--- | ||
- | |||
5-4. Per-Controller Changes | 5-4. Per-Controller Changes | ||
- | 4-3-1. blkio | + | 5-4-1. io |
- | - blk-throttle becomes properly hierarchical. | + | - blkio is renamed to io. The interface is overhauled anyway. |
+ | new name is more in line with the other two major controllers, | ||
+ | and memory, and better suited given that it may be used for cgroup | ||
+ | writeback without involving block layer. | ||
- | - blk-throttle が適切に改造構造となる | + | - blkio は io にリネームされました。インターフェースは全面的に見直され |
+ | ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、 | ||
+ | メモリにより協調するものです。そしてブロックレイヤーを介さずに | ||
+ | cgroup writeback に使うのにより適しています。 | ||
- | 4-3-2. cpuset | + | - Everything including stat is always hierarchical making separate |
+ | recursive stat files pointless and, as no internal node can have | ||
+ | tasks, leaf weights are meaningless. | ||
+ | simplified and the interface is overhauled accordingly. | ||
+ | |||
+ | - stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう | ||
+ | に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ | ||
+ | イトは無意味となります。この操作モデルは簡素化されており、インター | ||
+ | フェースは適切に見直されています。 | ||
+ | |||
+ | io.stat | ||
+ | |||
+ | The stat file. The reported stats are from the point where | ||
+ | bio's are issued to request_queue. | ||
+ | independent of which policies are enabled. | ||
+ | file follows the following format. | ||
+ | added at the end. | ||
+ | |||
+ | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
+ | |||
+ | 統計 (stat) ファイルです。報告される統計は bio が | ||
+ | request_queue に対して発行された時点からのものです。統計は有効 | ||
+ | になっているポリシーと独立してカウントされます。ファイル内のそ | ||
+ | れぞれの行は後述のフォーマットで続きます。複数のフィールドは最 | ||
+ | 後に追加されます。 | ||
+ | |||
+ | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
+ | |||
+ | io.weight | ||
+ | |||
+ | The weight setting, currently only available and effective if | ||
+ | cfq-iosched is in use for the target device. | ||
+ | between 1 and 10000 and defaults to 100. The first line | ||
+ | always contains the default weight in the following format to | ||
+ | use when per-device setting is missing. | ||
+ | |||
+ | ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ | ||
+ | スで使われている場合のみ有効で効果があります。ウェイトは 1 か | ||
+ | ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の | ||
+ | フォーマットのデフォルトのウェイトです。これはデバイスごとの設 | ||
+ | 定がない場合に使われます。 | ||
+ | |||
+ | default $WEIGHT | ||
+ | |||
+ | Subsequent lines list per-device weights of the following | ||
+ | format. | ||
+ | |||
+ | 次の行は以下のフォーマットのデバイスごとのウェイトのリストです。 | ||
+ | |||
+ | $MAJ:$MIN $WEIGHT | ||
+ | |||
+ | Writing " | ||
+ | setting. | ||
+ | while " | ||
+ | |||
+ | " | ||
+ | が変更されます。" | ||
+ | default" | ||
+ | |||
+ | This file is available only on non-root cgroups. | ||
+ | |||
+ | このファイルはルート以外の cgroup でのみ使えます。 | ||
+ | |||
+ | io.max | ||
+ | |||
+ | The maximum bandwidth and/or iops setting, only available if | ||
+ | blk-throttle is enabled. | ||
+ | |||
+ | 帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場 | ||
+ | 合のみ使えます。ファイルは以下のフォーマットになります。 | ||
+ | |||
+ | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS | ||
+ | |||
+ | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are | ||
+ | read/write IOs per second. | ||
+ | to the file follows the same format but the individual | ||
+ | settings may be omitted or specified in any order. | ||
+ | |||
+ | ${R|W}BPS は秒あたりの読みこみ/書きこみのバイト数で、${R|W} | ||
+ | は秒あたりの読みこみ/書きこみ IOPS です。" | ||
+ | します。ファイルへの書きこみは同じフォーマットに従いますが、個 | ||
+ | 別の設定は省略したり、任意の順番で指定できます。 | ||
+ | |||
+ | This file is available only on non-root cgroups. | ||
+ | |||
+ | このファイルはルート cgroup 以外でのみ利用できます。 | ||
+ | |||
+ | 5-4-2. cpuset | ||
- Tasks are kept in empty cpusets after hotplug and take on the masks | - Tasks are kept in empty cpusets after hotplug and take on the masks | ||
of the nearest non-empty ancestor, instead of being moved to it. | of the nearest non-empty ancestor, instead of being moved to it. | ||
- | | ||
- | - hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する | ||
- | 代わりに、最も近い祖先のマスクを引き受ける。 | ||
- A task can be moved into an empty cpuset, and again it takes on the | - A task can be moved into an empty cpuset, and again it takes on the | ||
masks of the nearest non-empty ancestor. | masks of the nearest non-empty ancestor. | ||
- | - タスクを空の cpuset に移動することは可能である。そしてこの場合も最も | ||
- | 近い空でない祖先のマスクを引き受ける。 | ||
- | 4-3-3. memory | + | 5-4-3. memory |
- use_hierarchy is on by default and the cgroup file for the flag is | - use_hierarchy is on by default and the cgroup file for the flag is | ||
not created. | not created. | ||
- | - use_hierarchy はデフォルトでオンになる。このフラグ用の | + | - The original lower boundary, the soft limit, is defined as a limit |
+ | that is per default unset. | ||
+ | global reclaim prefers is opt-in, rather than opt-out. | ||
+ | for optimizing these mostly negative lookups are so high that the | ||
+ | implementation, | ||
+ | basic desirable behavior. | ||
+ | hierarchical meaning. | ||
+ | global rbtree and treated like equal peers, regardless where they | ||
+ | are located in the hierarchy. | ||
+ | impossible. | ||
+ | that it not just introduces high allocation latencies into the | ||
+ | system, but also impacts system performance due to overreclaim, | ||
+ | the point where the feature becomes self-defeating. | ||
+ | |||
+ | The memory.low boundary on the other hand is a top-down allocated | ||
+ | reserve. | ||
+ | ancestors are below their low boundaries, which makes delegation of | ||
+ | subtrees possible. | ||
+ | default and in the common case most cgroups are eligible for the | ||
+ | preferred reclaim pass. This allows the new low boundary to be | ||
+ | efficiently implemented with just a minor addition to the generic | ||
+ | reclaim code, without the need for out-of-band data structures and | ||
+ | reclaim passes. | ||
+ | cgroups except for the ones running low in the preferred first | ||
+ | reclaim pass, overreclaim of individual groups is eliminated as | ||
+ | well, resulting in much better overall workload performance. | ||
+ | |||
+ | - The original high boundary, the hard limit, is defined as a strict | ||
+ | limit that can not budge, even if the OOM killer has to be called. | ||
+ | But this generally goes against the goal of making the most out of | ||
+ | the available memory. | ||
+ | during runtime, and that requires users to overcommit. | ||
+ | that with a strict upper limit requires either a fairly accurate | ||
+ | prediction of the working set size or adding slack to the limit. | ||
+ | Since working set size estimation is hard and error prone, and | ||
+ | getting it wrong results in OOM kills, most users tend to err on the | ||
+ | side of a looser limit and end up wasting precious resources. | ||
+ | |||
+ | The memory.high boundary on the other hand can be set much more | ||
+ | conservatively. | ||
+ | into direct reclaim to work off the excess, but it never invokes the | ||
+ | OOM killer. | ||
+ | aggressively will not terminate the processes, but instead it will | ||
+ | lead to gradual performance degradation. | ||
+ | and make corrections until the minimal memory footprint that still | ||
+ | gives acceptable performance is found. | ||
+ | |||
+ | In extreme cases, with many concurrent allocations and a complete | ||
+ | breakdown of reclaim progress within the group, the high boundary | ||
+ | can be exceeded. | ||
+ | allocation from the slack available in other groups or the rest of | ||
+ | the system than killing the group. | ||
+ | to limit this type of spillover and ultimately contain buggy or even | ||
+ | malicious applications. | ||
+ | |||
+ | - The original control file names are unwieldy and inconsistent in | ||
+ | many different ways. For example, the upper boundary hit count is | ||
+ | exported in the memory.failcnt file, but an OOM event count has to | ||
+ | be manually counted by listening to memory.oom_control events, and | ||
+ | lower boundary / soft limit events have to be counted by first | ||
+ | setting a threshold for that value and then counting those events. | ||
+ | Also, usage and limit files encode their units in the filename. | ||
+ | That makes the filenames very long, even though this is not | ||
+ | information that a user needs to be reminded of every time they type | ||
+ | out those names. | ||
+ | |||
+ | To address these naming issues, as well as to signal clearly that | ||
+ | the new interface carries a new configuration model, the naming | ||
+ | conventions in it necessarily differ from the old interface. | ||
+ | |||
+ | - The original limit files indicate the state of an unset limit with a | ||
+ | Very High Number, and a configured limit can be unset by echoing -1 | ||
+ | into those files. | ||
+ | architecture dependent and not very descriptive. | ||
+ | be understood as an underflow into the highest possible value, -2 or | ||
+ | -10M etc. do not work, so it's not consistent. | ||
+ | |||
+ | memory.low, memory.high, | ||
+ | indicate and set the highest possible value. | ||
5. Planned Changes | 5. Planned Changes |