linux:kernel:cgroup:単一階層構造

差分

このページの2つのバージョン間の差分を表示します。

この比較画面へのリンク

両方とも前のリビジョン 前のリビジョン
次のリビジョン
前のリビジョン
linux:kernel:cgroup:単一階層構造 [2015/12/24 10:00] tenforwardlinux:kernel:cgroup:単一階層構造 [2016/01/22 08:50] (現在) tenforward
行 1: 行 1:
-====== 単一階層構造 ======+=i===== 単一階層構造 ======
  
 2-1 まで 4.3 カーネルの文書と同期。 2-1 まで 4.3 カーネルの文書と同期。
行 516: 行 516:
 points of delegation and U0 would not have write access to its points of delegation and U0 would not have write access to its
 "cgroup.procs" and thus be denied with -EACCES. "cgroup.procs" and thus be denied with -EACCES.
 +
 +これまでの階層構造では、タスクが cgroup の "tasks" もしくは
 +"cgroup.procs" ファイルへの書き込み権を持っていて、uid がターゲットに
 +一致しているなら、ターゲットを cgroup に移動できます。先の例でいうと、
 +U0 は各サブ階層内にプロセスを移動できるだけでなく、ふたつのサブ階層に
 +わたってもプロセスを移動できます。実際は、C0 と C1 より上の階層構造で
 +示されるリソース制限や構造を破ることができるようになるでしょう。
 +
  
 5. Other Changes 5. Other Changes
行 703: 行 711:
   its value to indicate inheritance of the default value.   its value to indicate inheritance of the default value.
      
----(ここまで)--- 
- 
 5-4. Per-Controller Changes 5-4. Per-Controller Changes
  
-4-3-1. blkio+5-4-1. io
  
-blk-throttle becomes properly hierarchical.+blkio is renamed to io.  The interface is overhauled anyway.  The 
 +  new name is more in line with the other two major controllers, cpu 
 +  and memory, and better suited given that it may be used for cgroup 
 +  writeback without involving block layer.
  
-blk-throttle が適切改造構造となる+blkio は io リネームされました。インターフェースは全面的に見直され 
 +  ました。新しい名前は他のふたつのメジャーコントローラであ、CPU、 
 +  メモリにより協調するものです。そしてブロックレイヤーを介さずに 
 +  cgroup writeback に使うのにより適しています。
  
-4-3-2. cpuset+Everything including stat is always hierarchical making separate 
 +  recursive stat files pointless and, as no internal node can have 
 +  tasks, leaf weights are meaningless.  The operation model is 
 +  simplified and the interface is overhauled accordingly. 
 + 
 +- stat を含むすべてが、別々の再帰的な stat ファイルが無意味となるよう 
 +  に常に階層的です。内部的なノードはタスクを持てませんので、リーフのウェ 
 +  イトは無意味となります。この操作モデルは簡素化されており、インター 
 +  フェースは適切に見直されています。 
 + 
 +  io.stat 
 + 
 + The stat file.  The reported stats are from the point where 
 + bio's are issued to request_queue.  The stats are counted 
 + independent of which policies are enabled.  Each line in the 
 + file follows the following format.  More fields may later be 
 + added at the end. 
 + 
 +   $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS 
 + 
 + 統計 (stat) ファイルです。報告される統計は bio が 
 + request_queue に対して発行された時点からのものです。統計は有効 
 + になっているポリシーと独立してカウントされます。ファイル内のそ 
 + れぞれの行は後述のフォーマットで続きます。複数のフィールドは最 
 + 後に追加されます。 
 + 
 +   $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS 
 + 
 +  io.weight 
 + 
 + The weight setting, currently only available and effective if 
 + cfq-iosched is in use for the target device.  The weight is 
 + between 1 and 10000 and defaults to 100.  The first line 
 + always contains the default weight in the following format to 
 + use when per-device setting is missing. 
 + 
 + ウェイトの設定で、現時点では cfq-iosched がターゲットのデバイ 
 + スで使われている場合のみ有効で効果があります。ウェイトは 1 か 
 + ら 10000 の間で、デフォルトは 100 です。最初の行は常に以下の 
 + フォーマットのデフォルトのウェイトです。これはデバイスごとの設 
 + 定がない場合に使われます。 
 + 
 +   default $WEIGHT 
 + 
 + Subsequent lines list per-device weights of the following 
 + format. 
 + 
 + 次の行は以下のフォーマットのデバイスごとのウェイトのリストです。 
 + 
 +   $MAJ:$MIN $WEIGHT 
 + 
 + Writing "$WEIGHT" or "default $WEIGHT" changes the default 
 + setting.  Writing "$MAJ:$MIN $WEIGHT" sets per-device weight 
 + while "$MAJ:$MIN default" clears it. 
 + 
 + "$WEIGHT" または "default $WEIGHT" を書きこむと、デフォルト値 
 + が変更されます。"$MAJ:$MIN $WEIGHT" を設定すると、"$MAJ:$MIN 
 + default" がクリアされて、デバイスごとのウェイトが設定されます。 
 + 
 + This file is available only on non-root cgroups. 
 + 
 + このファイルはルート以外の cgroup でのみ使えます。 
 + 
 +  io.max 
 + 
 + The maximum bandwidth and/or iops setting, only available if 
 + blk-throttle is enabled.  The file is of the following format. 
 + 
 + 帯域幅もしくは IOPS の最大値の設定です。blk-throttleが有効な場 
 + 合のみ使えます。ファイルは以下のフォーマットになります。 
 + 
 +   $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS 
 + 
 + ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are 
 + read/write IOs per second.  "max" indicates no limit.  Writing 
 + to the file follows the same format but the individual 
 + settings may be omitted or specified in any order. 
 + 
 + ${R|W}BPS は秒あたりの読みこみ/書きこみのバイト数で、${R|W} 
 + は秒あたりの読みこみ/書きこみ IOPS です。"max" は制限なしを示 
 + します。ファイルへの書きこみは同じフォーマットに従いますが、個 
 + 別の設定は省略したり、任意の順番で指定できます。 
 + 
 + This file is available only on non-root cgroups. 
 + 
 + このファイルはルート cgroup 以外でのみ利用できます。 
 + 
 +5-4-2. cpuset
  
 - Tasks are kept in empty cpusets after hotplug and take on the masks - Tasks are kept in empty cpusets after hotplug and take on the masks
   of the nearest non-empty ancestor, instead of being moved to it.   of the nearest non-empty ancestor, instead of being moved to it.
-   
-- hotplug 後の空の cpuset 内のタスクは保持され、最も近い祖先に移動する 
-  代わりに、最も近い祖先のマスクを引き受ける。 
  
 - A task can be moved into an empty cpuset, and again it takes on the - A task can be moved into an empty cpuset, and again it takes on the
   masks of the nearest non-empty ancestor.   masks of the nearest non-empty ancestor.
  
-- タスクを空の cpuset に移動することは可能である。そしてこの場合も最も 
-  近い空でない祖先のマスクを引き受ける。 
  
-4-3-3. memory+5-4-3. memory
  
 - use_hierarchy is on by default and the cgroup file for the flag is - use_hierarchy is on by default and the cgroup file for the flag is
   not created.   not created.
  
-use_hierarchy はデフォルトでオンになる。このフラグ用の cgroup ファイルは生成されない。+The original lower boundary, the soft limit, is defined as a limit 
 +  that is per default unset.  As a result, the set of cgroups that 
 +  global reclaim prefers is opt-in, rather than opt-out.  The costs 
 +  for optimizing these mostly negative lookups are so high that the 
 +  implementation, despite its enormous size, does not even provide the 
 +  basic desirable behavior.  First off, the soft limit has no 
 +  hierarchical meaning.  All configured groups are organized in a 
 +  global rbtree and treated like equal peers, regardless where they 
 +  are located in the hierarchy.  This makes subtree delegation 
 +  impossible.  Second, the soft limit reclaim pass is so aggressive 
 +  that it not just introduces high allocation latencies into the 
 +  system, but also impacts system performance due to overreclaim, to 
 +  the point where the feature becomes self-defeating. 
 + 
 +  The memory.low boundary on the other hand is a top-down allocated 
 +  reserve. cgroup enjoys reclaim protection when it and all its 
 +  ancestors are below their low boundaries, which makes delegation of 
 +  subtrees possible.  Secondly, new cgroups have no reserve per 
 +  default and in the common case most cgroups are eligible for the 
 +  preferred reclaim pass.  This allows the new low boundary to be 
 +  efficiently implemented with just a minor addition to the generic 
 +  reclaim code, without the need for out-of-band data structures and 
 +  reclaim passes.  Because the generic reclaim code considers all 
 +  cgroups except for the ones running low in the preferred first 
 +  reclaim pass, overreclaim of individual groups is eliminated as 
 +  well, resulting in much better overall workload performance. 
 + 
 +- The original high boundary, the hard limit, is defined as a strict 
 +  limit that can not budge, even if the OOM killer has to be called. 
 +  But this generally goes against the goal of making the most out of 
 +  the available memory.  The memory consumption of workloads varies 
 +  during runtime, and that requires users to overcommit.  But doing 
 +  that with a strict upper limit requires either a fairly accurate 
 +  prediction of the working set size or adding slack to the limit. 
 +  Since working set size estimation is hard and error prone, and 
 +  getting it wrong results in OOM kills, most users tend to err on the 
 +  side of a looser limit and end up wasting precious resources. 
 + 
 +  The memory.high boundary on the other hand can be set much more 
 +  conservatively.  When hit, it throttles allocations by forcing them 
 +  into direct reclaim to work off the excess, but it never invokes the 
 +  OOM killer.  As a result, a high boundary that is chosen too 
 +  aggressively will not terminate the processes, but instead it will 
 +  lead to gradual performance degradation.  The user can monitor this 
 +  and make corrections until the minimal memory footprint that still 
 +  gives acceptable performance is found. 
 + 
 +  In extreme cases, with many concurrent allocations and a complete 
 +  breakdown of reclaim progress within the group, the high boundary 
 +  can be exceeded.  But even then it's mostly better to satisfy the 
 +  allocation from the slack available in other groups or the rest of 
 +  the system than killing the group.  Otherwise, memory.max is there 
 +  to limit this type of spillover and ultimately contain buggy or even 
 +  malicious applications. 
 + 
 +- The original control file names are unwieldy and inconsistent in 
 +  many different ways.  For example, the upper boundary hit count is 
 +  exported in the memory.failcnt file, but an OOM event count has to 
 +  be manually counted by listening to memory.oom_control events, and 
 +  lower boundary / soft limit events have to be counted by first 
 +  setting a threshold for that value and then counting those events. 
 +  Also, usage and limit files encode their units in the filename. 
 +  That makes the filenames very long, even though this is not 
 +  information that a user needs to be reminded of every time they type 
 +  out those names. 
 + 
 +  To address these naming issues, as well as to signal clearly that 
 +  the new interface carries a new configuration model, the naming 
 +  conventions in it necessarily differ from the old interface. 
 + 
 +- The original limit files indicate the state of an unset limit with a 
 +  Very High Number, and a configured limit can be unset by echoing -1 
 +  into those files.  But that very high number is implementation and 
 +  architecture dependent and not very descriptive.  And while -1 can 
 +  be understood as an underflow into the highest possible value, -2 or 
 +  -10M etc. do not work, so it's not consistent. 
 + 
 +  memory.low, memory.high, and memory.max will use the string "max" to 
 +  indicate and set the highest possible value. 
  
 5. Planned Changes 5. Planned Changes
  • linux/kernel/cgroup/単一階層構造.1450951205.txt.gz
  • 最終更新: 2015/12/24 10:00
  • by tenforward