X-Git-Url: http://nv-tegra.nvidia.com/gitweb/?p=linux-2.6.git;a=blobdiff_plain;f=Documentation%2Fcontrollers%2Fmemory.txt;h=9fe2d0eabe05d9938e79c9b51a766fb0fbdf7d28;hp=b5bbea92a61ad1c0c8ef2d6677cf8e91f7bbc9c3;hb=d13d144309d2e5a3e6ad978b16c1d0226ddc9231;hpb=dfc05c259e424e4160c66eab728f55cc4b53fd75 diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index b5bbea9..9fe2d0e 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt @@ -1,4 +1,8 @@ -Memory Controller +Memory Resource Controller + +NOTE: The Memory Resource Controller has been generically been referred +to as the memory controller in this document. Do not confuse memory controller +used here with the memory controller that is used in hardware. Salient features @@ -108,14 +112,22 @@ the per cgroup LRU. 2.2.1 Accounting details -All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. -RSS pages are accounted at the time of page_add_*_rmap() unless they've already -been accounted for earlier. A file page will be accounted for as Page Cache; -it's mapped into the page tables of a process, duplicate accounting is carefully -avoided. Page Cache pages are accounted at the time of add_to_page_cache(). -The corresponding routines that remove a page from the page tables or removes -a page from Page Cache is used to decrement the accounting counters of the -cgroup. +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +(some pages which never be reclaimable and will not be on global LRU + are not accounted. we just accounts pages under usual vm management.) + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +A RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. + +At page migration, accounting information is kept. + +Note: we just account pages-on-lru because our purpose is to control amount +of used pages. not-on-lru pages are tend to be out-of-control from vm view. 2.3 Shared Page Accounting @@ -125,6 +137,11 @@ behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in -- this will happen on memory pressure). +Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to +be backed into memory in force, charges for pages are accounted against the +caller of swapoff rather than the users of shmem. + + 2.4 Reclaim Each cgroup maintains a per cgroup LRU that consists of an active @@ -152,7 +169,7 @@ The memory controller uses the following hierarchy a. Enable CONFIG_CGROUPS b. Enable CONFIG_RESOURCE_COUNTERS -c. Enable CONFIG_CGROUP_MEM_CONT +c. Enable CONFIG_CGROUP_MEM_RES_CTLR 1. Prepare the cgroups # mkdir -p /cgroups @@ -164,20 +181,20 @@ c. Enable CONFIG_CGROUP_MEM_CONT Since now we're in the 0 cgroup, We can alter the memory limit: -# echo -n 4M > /cgroups/0/memory.limit_in_bytes +# echo 4M > /cgroups/0/memory.limit_in_bytes NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, mega or gigabytes. # cat /cgroups/0/memory.limit_in_bytes -4194304 Bytes +4194304 NOTE: The interface has now changed to display the usage in bytes instead of pages We can check the usage: # cat /cgroups/0/memory.usage_in_bytes -1216512 Bytes +1216512 A successful write to this file does not guarantee a successful set of this limit to the value written into the file. This can be due to a @@ -185,9 +202,9 @@ number of factors, such as rounding up to page boundaries or the total availability of memory on the system. The user is required to re-read this file after a write to guarantee the value committed by the kernel. -# echo -n 1 > memory.limit_in_bytes +# echo 1 > memory.limit_in_bytes # cat memory.limit_in_bytes -4096 Bytes +4096 The memory.failcnt field gives the number of times that the cgroup limit was exceeded. @@ -195,12 +212,6 @@ exceeded. The memory.stat file gives accounting information. Now, the number of caches, RSS and Active pages/Inactive pages are shown. -The memory.force_empty gives an interface to drop *all* charges by force. - -# echo -n 1 > memory.force_empty - -will drop all charges in cgroup. Currently, this is maintained for test. - 4. Testing Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. @@ -230,23 +241,36 @@ reclaimed. A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a cgroup might have some charge associated with it, even though all -tasks have migrated away from it. Such charges are automatically dropped at -rmdir() if there are no tasks. +tasks have migrated away from it. +Such charges are freed(at default) or moved to its parent. When moved, +both of RSS and CACHES are moved to parent. +If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. + +5. Misc. interfaces. + +5.1 force_empty + memory.force_empty interface is provided to make cgroup's memory usage empty. + You can use this interface only when the cgroup has no tasks. + When writing anything to this + + # echo 0 > memory.force_empty -4.4 Choosing what to account -- Page Cache (unmapped) vs RSS (mapped)? + Almost all pages tracked by this memcg will be unmapped and freed. Some of + pages cannot be freed because it's locked or in-use. Such pages are moved + to parent and this cgroup will be empty. But this may return -EBUSY in + some too busy case. -The type of memory accounted by the cgroup can be limited to just -mapped pages by writing "1" to memory.control_type field + Typical use case of this interface is that calling this before rmdir(). + Because rmdir() moves all pages to parent, some out-of-use page caches can be + moved to the parent. If you want to avoid that, force_empty will be useful. -echo -n 1 > memory.control_type -5. TODO +6. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first 3. Teach controller to account for shared-pages -4. Start reclamation when the limit is lowered -5. Start reclamation in the background when the limit is +4. Start reclamation in the background when the limit is not yet hit but the usage is getting closer Summary @@ -262,18 +286,19 @@ References 3. Emelianov, Pavel. Resource controllers based on process cgroups http://lkml.org/lkml/2007/3/6/198 4. Emelianov, Pavel. RSS controller based on process cgroups (v2) - http://lkml.org/lkml/2007/4/9/74 + http://lkml.org/lkml/2007/4/9/78 5. Emelianov, Pavel. RSS controller based on process cgroups (v3) http://lkml.org/lkml/2007/5/30/244 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control subsystem (v3), http://lwn.net/Articles/235534/ -8. Singh, Balbir. RSS controller V2 test results (lmbench), +8. Singh, Balbir. RSS controller v2 test results (lmbench), http://lkml.org/lkml/2007/5/17/232 -9. Singh, Balbir. RSS controller V2 AIM9 results +9. Singh, Balbir. RSS controller v2 AIM9 results http://lkml.org/lkml/2007/5/18/1 -10. Singh, Balbir. Memory controller v6 results, +10. Singh, Balbir. Memory controller v6 test results, http://lkml.org/lkml/2007/8/19/36 -11. Singh, Balbir. Memory controller v6, http://lkml.org/lkml/2007/8/17/69 +11. Singh, Balbir. Memory controller introduction (v6), + http://lkml.org/lkml/2007/8/17/69 12. Corbet, Jonathan, Controlling memory use in cgroups, http://lwn.net/Articles/243795/