差分
このページの2つのバージョン間の差分を表示します。
| 両方とも前のリビジョン 前のリビジョン 次のリビジョン | 前のリビジョン | ||
| linux:kernel:cgroup:単一階層構造 [2015/12/24 10:12] – tenforward | linux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) – tenforward | ||
|---|---|---|---|
| 行 1: | 行 1: | ||
| - | ====== 単一階層構造 ====== | + | =i===== 単一階層構造 ====== |
| 2-1 まで 4.3 カーネルの文書と同期。 | 2-1 まで 4.3 カーネルの文書と同期。 | ||
| 行 711: | 行 711: | ||
| its value to indicate inheritance of the default value. | its value to indicate inheritance of the default value. | ||
| | | ||
| - | ---(ここまで)--- | ||
| - | |||
| 5-4. Per-Controller Changes | 5-4. Per-Controller Changes | ||
| - | 4-3-1. blkio | + | 5-4-1. io |
| - | - blk-throttle becomes properly hierarchical. | + | - blkio is renamed to io. The interface is overhauled anyway. |
| + | new name is more in line with the other two major controllers, | ||
| + | and memory, and better suited given that it may be used for cgroup | ||
| + | writeback without involving block layer. | ||
| - | - blk-throttle が適切に改造構造となる | + | - blkio は io にリネームされました。インターフェースは全面的に見直され |
| + | ました。新しい名前は他のふたつのメジャーなコントローラである、CPU、 | ||
| + | メモリにより協調するものです。そしてブロックレイヤーを介さずに | ||
| + | cgroup writeback に使うのにより適しています。 | ||
| - | 4-3-2. cpuset | + | - Everything including stat is always hierarchical making separate |
| + | recursive stat files pointless and, as no internal node can have | ||
| + | tasks, leaf weights are meaningless. | ||
| + | simplified and the interface is overhauled accordingly. | ||
| + | |||
| + | - stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう | ||
| + | に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ | ||
| + | イトは無意味となります。この操作モデルは簡素化されており、インター | ||
| + | フェースは適切に見直されています。 | ||
| + | |||
| + | io.stat | ||
| + | |||
| + | The stat file. The reported stats are from the point where | ||
| + | bio's are issued to request_queue. | ||
| + | independent of which policies are enabled. | ||
| + | file follows the following format. | ||
| + | added at the end. | ||
| + | |||
| + | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
| + | |||
| + | 統計 (stat) ファイルです。報告される統計は bio が | ||
| + | request_queue に対して発行された時点からのものです。統計は有効 | ||
| + | になっているポリシーと独立してカウントされます。ファイル内のそ | ||
| + | れぞれの行は後述のフォーマットで続きます。複数のフィールドは最 | ||
| + | 後に追加されます。 | ||
| + | |||
| + | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | ||
| + | |||
| + | io.weight | ||
| + | |||
| + | The weight setting, currently only available and effective if | ||
| + | cfq-iosched is in use for the target device. | ||
| + | between 1 and 10000 and defaults to 100. The first line | ||
| + | always contains the default weight in the following format to | ||
| + | use when per-device setting is missing. | ||
| + | |||
| + | ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ | ||
| + | スで使われている場合のみ有効で効果があります。ウェイトは 1 か | ||
| + | ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の | ||
| + | フォーマットのデフォルトのウェイトです。これはデバイスごとの設 | ||
| + | 定がない場合に使われます。 | ||
| + | |||
| + | default $WEIGHT | ||
| + | |||
| + | Subsequent lines list per-device weights of the following | ||
| + | format. | ||
| + | |||
| + | 次の行は以下のフォーマットのデバイスごとのウェイトのリストです。 | ||
| + | |||
| + | $MAJ:$MIN $WEIGHT | ||
| + | |||
| + | Writing " | ||
| + | setting. | ||
| + | while " | ||
| + | |||
| + | " | ||
| + | が変更されます。" | ||
| + | default" | ||
| + | |||
| + | This file is available only on non-root cgroups. | ||
| + | |||
| + | このファイルはルート以外の cgroup でのみ使えます。 | ||
| + | |||
| + | io.max | ||
| + | |||
| + | The maximum bandwidth and/or iops setting, only available if | ||
| + | blk-throttle is enabled. | ||
| + | |||
| + | 帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場 | ||
| + | 合のみ使えます。ファイルは以下のフォーマットになります。 | ||
| + | |||
| + | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS | ||
| + | |||
| + | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are | ||
| + | read/write IOs per second. | ||
| + | to the file follows the same format but the individual | ||
| + | settings may be omitted or specified in any order. | ||
| + | |||
| + | ${R|W}BPS は秒あたりの読みこみ/書きこみのバイト数で、${R|W} | ||
| + | は秒あたりの読みこみ/書きこみ IOPS です。" | ||
| + | します。ファイルへの書きこみは同じフォーマットに従いますが、個 | ||
| + | 別の設定は省略したり、任意の順番で指定できます。 | ||
| + | |||
| + | This file is available only on non-root cgroups. | ||
| + | |||
| + | このファイルはルート cgroup 以外でのみ利用できます。 | ||
| + | |||
| + | 5-4-2. cpuset | ||
| - Tasks are kept in empty cpusets after hotplug and take on the masks | - Tasks are kept in empty cpusets after hotplug and take on the masks | ||
| of the nearest non-empty ancestor, instead of being moved to it. | of the nearest non-empty ancestor, instead of being moved to it. | ||
| - | | ||
| - | - hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する | ||
| - | 代わりに、最も近い祖先のマスクを引き受ける。 | ||
| - A task can be moved into an empty cpuset, and again it takes on the | - A task can be moved into an empty cpuset, and again it takes on the | ||
| masks of the nearest non-empty ancestor. | masks of the nearest non-empty ancestor. | ||
| - | - タスクを空の cpuset に移動することは可能である。そしてこの場合も最も | ||
| - | 近い空でない祖先のマスクを引き受ける。 | ||
| - | 4-3-3. memory | + | 5-4-3. memory |
| - use_hierarchy is on by default and the cgroup file for the flag is | - use_hierarchy is on by default and the cgroup file for the flag is | ||
| not created. | not created. | ||
| - | - use_hierarchy はデフォルトでオンになる。このフラグ用の | + | - The original lower boundary, the soft limit, is defined as a limit |
| + | that is per default unset. | ||
| + | global reclaim prefers is opt-in, rather than opt-out. | ||
| + | for optimizing these mostly negative lookups are so high that the | ||
| + | implementation, | ||
| + | basic desirable behavior. | ||
| + | hierarchical meaning. | ||
| + | global rbtree and treated like equal peers, regardless where they | ||
| + | are located in the hierarchy. | ||
| + | impossible. | ||
| + | that it not just introduces high allocation latencies into the | ||
| + | system, but also impacts system performance due to overreclaim, | ||
| + | the point where the feature becomes self-defeating. | ||
| + | |||
| + | The memory.low boundary on the other hand is a top-down allocated | ||
| + | reserve. | ||
| + | ancestors are below their low boundaries, which makes delegation of | ||
| + | subtrees possible. | ||
| + | default and in the common case most cgroups are eligible for the | ||
| + | preferred reclaim pass. This allows the new low boundary to be | ||
| + | efficiently implemented with just a minor addition to the generic | ||
| + | reclaim code, without the need for out-of-band data structures and | ||
| + | reclaim passes. | ||
| + | cgroups except for the ones running low in the preferred first | ||
| + | reclaim pass, overreclaim of individual groups is eliminated as | ||
| + | well, resulting in much better overall workload performance. | ||
| + | |||
| + | - The original high boundary, the hard limit, is defined as a strict | ||
| + | limit that can not budge, even if the OOM killer has to be called. | ||
| + | But this generally goes against the goal of making the most out of | ||
| + | the available memory. | ||
| + | during runtime, and that requires users to overcommit. | ||
| + | that with a strict upper limit requires either a fairly accurate | ||
| + | prediction of the working set size or adding slack to the limit. | ||
| + | Since working set size estimation is hard and error prone, and | ||
| + | getting it wrong results in OOM kills, most users tend to err on the | ||
| + | side of a looser limit and end up wasting precious resources. | ||
| + | |||
| + | The memory.high boundary on the other hand can be set much more | ||
| + | conservatively. | ||
| + | into direct reclaim to work off the excess, but it never invokes the | ||
| + | OOM killer. | ||
| + | aggressively will not terminate the processes, but instead it will | ||
| + | lead to gradual performance degradation. | ||
| + | and make corrections until the minimal memory footprint that still | ||
| + | gives acceptable performance is found. | ||
| + | |||
| + | In extreme cases, with many concurrent allocations and a complete | ||
| + | breakdown of reclaim progress within the group, the high boundary | ||
| + | can be exceeded. | ||
| + | allocation from the slack available in other groups or the rest of | ||
| + | the system than killing the group. | ||
| + | to limit this type of spillover and ultimately contain buggy or even | ||
| + | malicious applications. | ||
| + | |||
| + | - The original control file names are unwieldy and inconsistent in | ||
| + | many different ways. For example, the upper boundary hit count is | ||
| + | exported in the memory.failcnt file, but an OOM event count has to | ||
| + | be manually counted by listening to memory.oom_control events, and | ||
| + | lower boundary / soft limit events have to be counted by first | ||
| + | setting a threshold for that value and then counting those events. | ||
| + | Also, usage and limit files encode their units in the filename. | ||
| + | That makes the filenames very long, even though this is not | ||
| + | information that a user needs to be reminded of every time they type | ||
| + | out those names. | ||
| + | |||
| + | To address these naming issues, as well as to signal clearly that | ||
| + | the new interface carries a new configuration model, the naming | ||
| + | conventions in it necessarily differ from the old interface. | ||
| + | |||
| + | - The original limit files indicate the state of an unset limit with a | ||
| + | Very High Number, and a configured limit can be unset by echoing -1 | ||
| + | into those files. | ||
| + | architecture dependent and not very descriptive. | ||
| + | be understood as an underflow into the highest possible value, -2 or | ||
| + | -10M etc. do not work, so it's not consistent. | ||
| + | |||
| + | memory.low, memory.high, | ||
| + | indicate and set the highest possible value. | ||
| 5. Planned Changes | 5. Planned Changes | ||