Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 1 | .. _numa_memory_policy: |
| 2 | |
Mike Rapoport | 1174bd84 | 2018-05-08 10:02:09 +0300 | [diff] [blame] | 3 | ================== |
| 4 | NUMA Memory Policy |
| 5 | ================== |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 6 | |
Mike Rapoport | 1174bd84 | 2018-05-08 10:02:09 +0300 | [diff] [blame] | 7 | What is NUMA Memory Policy? |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 8 | ============================ |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 9 | |
| 10 | In the Linux kernel, "memory policy" determines from which node the kernel will |
| 11 | allocate memory in a NUMA system or in an emulated NUMA system. Linux has |
| 12 | supported platforms with Non-Uniform Memory Access architectures since 2.4.?. |
| 13 | The current memory policy support was added to Linux 2.6 around May 2004. This |
| 14 | document attempts to describe the concepts and APIs of the 2.6 memory policy |
| 15 | support. |
| 16 | |
Thadeu Lima de Souza Cascardo | 21acb9c | 2009-02-04 10:12:08 +0100 | [diff] [blame] | 17 | Memory policies should not be confused with cpusets |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 18 | (``Documentation/cgroup-v1/cpusets.txt``) |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 19 | which is an administrative mechanism for restricting the nodes from which |
| 20 | memory may be allocated by a set of processes. Memory policies are a |
| 21 | programming interface that a NUMA-aware application can take advantage of. When |
| 22 | both cpusets and policies are applied to a task, the restrictions of the cpuset |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 23 | takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` |
| 24 | below for more details. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 25 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 26 | Memory Policy Concepts |
| 27 | ====================== |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 28 | |
| 29 | Scope of Memory Policies |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 30 | ------------------------ |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 31 | |
| 32 | The Linux kernel supports _scopes_ of memory policy, described here from |
| 33 | most general to most specific: |
| 34 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 35 | System Default Policy |
| 36 | this policy is "hard coded" into the kernel. It is the policy |
| 37 | that governs all page allocations that aren't controlled by |
| 38 | one of the more specific policy scopes discussed below. When |
| 39 | the system is "up and running", the system default policy will |
| 40 | use "local allocation" described below. However, during boot |
| 41 | up, the system default policy will be set to interleave |
| 42 | allocations across all nodes with "sufficient" memory, so as |
| 43 | not to overload the initial boot node with boot-time |
| 44 | allocations. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 45 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 46 | Task/Process Policy |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 47 | this is an optional, per-task policy. When defined for a |
| 48 | specific task, this policy controls all page allocations made |
| 49 | by or on behalf of the task that aren't controlled by a more |
| 50 | specific scope. If a task does not define a task policy, then |
| 51 | all page allocations that would have been controlled by the |
| 52 | task policy "fall back" to the System Default Policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 53 | |
| 54 | The task policy applies to the entire address space of a task. Thus, |
| 55 | it is inheritable, and indeed is inherited, across both fork() |
| 56 | [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task |
| 57 | to establish the task policy for a child task exec()'d from an |
| 58 | executable image that has no awareness of memory policy. See the |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 59 | :ref:`Memory Policy APIs <memory_policy_apis>` section, |
| 60 | below, for an overview of the system call |
Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 61 | that a task may use to set/change its task/process policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 62 | |
| 63 | In a multi-threaded task, task policies apply only to the thread |
| 64 | [Linux kernel task] that installs the policy and any threads |
| 65 | subsequently created by that thread. Any sibling threads existing |
| 66 | at the time a new task policy is installed retain their current |
| 67 | policy. |
| 68 | |
| 69 | A task policy applies only to pages allocated after the policy is |
| 70 | installed. Any pages already faulted in by the task when the task |
| 71 | changes its task policy remain where they were allocated based on |
| 72 | the policy at the time they were allocated. |
| 73 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 74 | .. _vma_policy: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 75 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 76 | VMA Policy |
| 77 | A "VMA" or "Virtual Memory Area" refers to a range of a task's |
| 78 | virtual address space. A task may define a specific policy for a range |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 79 | of its virtual address space. See the |
| 80 | :ref:`Memory Policy APIs <memory_policy_apis>` section, |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 81 | below, for an overview of the mbind() system call used to set a VMA |
| 82 | policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 83 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 84 | A VMA policy will govern the allocation of pages that back |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 85 | this region of the address space. Any regions of the task's |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 86 | address space that don't have an explicit VMA policy will fall |
| 87 | back to the task policy, which may itself fall back to the |
| 88 | System Default Policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 89 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 90 | VMA policies have a few complicating details: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 91 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 92 | * VMA policy applies ONLY to anonymous pages. These include |
| 93 | pages allocated for anonymous segments, such as the task |
| 94 | stack and heap, and any regions of the address space |
| 95 | mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is |
| 96 | applied to a file mapping, it will be ignored if the mapping |
| 97 | used the MAP_SHARED flag. If the file mapping used the |
| 98 | MAP_PRIVATE flag, the VMA policy will only be applied when |
| 99 | an anonymous page is allocated on an attempt to write to the |
| 100 | mapping-- i.e., at Copy-On-Write. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 101 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 102 | * VMA policies are shared between all tasks that share a |
| 103 | virtual address space--a.k.a. threads--independent of when |
| 104 | the policy is installed; and they are inherited across |
| 105 | fork(). However, because VMA policies refer to a specific |
| 106 | region of a task's address space, and because the address |
| 107 | space is discarded and recreated on exec*(), VMA policies |
| 108 | are NOT inheritable across exec(). Thus, only NUMA-aware |
| 109 | applications may use VMA policies. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 110 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 111 | * A task may install a new VMA policy on a sub-range of a |
| 112 | previously mmap()ed region. When this happens, Linux splits |
| 113 | the existing virtual memory area into 2 or 3 VMAs, each with |
| 114 | it's own policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 115 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 116 | * By default, VMA policy applies only to pages allocated after |
| 117 | the policy is installed. Any pages already faulted into the |
| 118 | VMA range remain where they were allocated based on the |
| 119 | policy at the time they were allocated. However, since |
| 120 | 2.6.16, Linux supports page migration via the mbind() system |
| 121 | call, so that page contents can be moved to match a newly |
| 122 | installed policy. |
| 123 | |
| 124 | Shared Policy |
| 125 | Conceptually, shared policies apply to "memory objects" mapped |
| 126 | shared into one or more tasks' distinct address spaces. An |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 127 | application installs shared policies the same way as VMA |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 128 | policies--using the mbind() system call specifying a range of |
| 129 | virtual addresses that map the shared object. However, unlike |
| 130 | VMA policies, which can be considered to be an attribute of a |
| 131 | range of a task's address space, shared policies apply |
| 132 | directly to the shared object. Thus, all tasks that attach to |
| 133 | the object share the policy, and all pages allocated for the |
| 134 | shared object, by any task, will obey the shared policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 135 | |
| 136 | As of 2.6.22, only shared memory segments, created by shmget() or |
| 137 | mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared |
| 138 | policy support was added to Linux, the associated data structures were |
| 139 | added to hugetlbfs shmem segments. At the time, hugetlbfs did not |
| 140 | support allocation at fault time--a.k.a lazy allocation--so hugetlbfs |
| 141 | shmem segments were never "hooked up" to the shared policy support. |
| 142 | Although hugetlbfs segments now support lazy allocation, their support |
| 143 | for shared policy has not been completed. |
| 144 | |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 145 | As mentioned above in :ref:`VMA policies <vma_policy>` section, |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 146 | allocations of page cache pages for regular files mmap()ed |
| 147 | with MAP_SHARED ignore any VMA policy installed on the virtual |
| 148 | address range backed by the shared file mapping. Rather, |
| 149 | shared page cache pages, including pages backing private |
| 150 | mappings that have not yet been written by the task, follow |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 151 | task policy, if any, else System Default Policy. |
| 152 | |
| 153 | The shared policy infrastructure supports different policies on subset |
| 154 | ranges of the shared object. However, Linux still splits the VMA of |
| 155 | the task that installs the policy for each range of distinct policy. |
| 156 | Thus, different tasks that attach to a shared memory segment can have |
| 157 | different VMA configurations mapping that one shared object. This |
| 158 | can be seen by examining the /proc/<pid>/numa_maps of tasks sharing |
| 159 | a shared memory region, when one task has installed shared policy on |
| 160 | one or more ranges of the region. |
| 161 | |
| 162 | Components of Memory Policies |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 163 | ----------------------------- |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 164 | |
Mike Rapoport | 1174bd84 | 2018-05-08 10:02:09 +0300 | [diff] [blame] | 165 | A NUMA memory policy consists of a "mode", optional mode flags, and |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 166 | an optional set of nodes. The mode determines the behavior of the |
| 167 | policy, the optional mode flags determine the behavior of the mode, |
| 168 | and the optional set of nodes can be viewed as the arguments to the |
| 169 | policy behavior. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 170 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 171 | Internally, memory policies are implemented by a reference counted |
| 172 | structure, struct mempolicy. Details of this structure will be |
| 173 | discussed in context, below, as required to explain the behavior. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 174 | |
Mike Rapoport | 1174bd84 | 2018-05-08 10:02:09 +0300 | [diff] [blame] | 175 | NUMA memory policy supports the following 4 behavioral modes: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 176 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 177 | Default Mode--MPOL_DEFAULT |
| 178 | This mode is only used in the memory policy APIs. Internally, |
| 179 | MPOL_DEFAULT is converted to the NULL memory policy in all |
| 180 | policy scopes. Any existing non-default policy will simply be |
| 181 | removed when MPOL_DEFAULT is specified. As a result, |
| 182 | MPOL_DEFAULT means "fall back to the next most specific policy |
| 183 | scope." |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 184 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 185 | For example, a NULL or default task policy will fall back to the |
| 186 | system default policy. A NULL or default vma policy will fall |
| 187 | back to the task policy. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 188 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 189 | When specified in one of the memory policy APIs, the Default mode |
| 190 | does not use the optional set of nodes. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 191 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 192 | It is an error for the set of nodes specified for this policy to |
| 193 | be non-empty. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 194 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 195 | MPOL_BIND |
| 196 | This mode specifies that memory must come from the set of |
| 197 | nodes specified by the policy. Memory will be allocated from |
| 198 | the node in the set with sufficient free memory that is |
| 199 | closest to the node where the allocation takes place. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 200 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 201 | MPOL_PREFERRED |
| 202 | This mode specifies that the allocation should be attempted |
| 203 | from the single node specified in the policy. If that |
| 204 | allocation fails, the kernel will search other nodes, in order |
| 205 | of increasing distance from the preferred node based on |
| 206 | information provided by the platform firmware. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 207 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 208 | Internally, the Preferred policy uses a single node--the |
| 209 | preferred_node member of struct mempolicy. When the internal |
| 210 | mode flag MPOL_F_LOCAL is set, the preferred_node is ignored |
| 211 | and the policy is interpreted as local allocation. "Local" |
| 212 | allocation policy can be viewed as a Preferred policy that |
| 213 | starts at the node containing the cpu where the allocation |
| 214 | takes place. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 215 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 216 | It is possible for the user to specify that local allocation |
| 217 | is always preferred by passing an empty nodemask with this |
| 218 | mode. If an empty nodemask is passed, the policy cannot use |
| 219 | the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags |
| 220 | described below. |
David Rientjes | 3e1f0645 | 2008-04-28 02:12:34 -0700 | [diff] [blame] | 221 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 222 | MPOL_INTERLEAVED |
| 223 | This mode specifies that page allocations be interleaved, on a |
| 224 | page granularity, across the nodes specified in the policy. |
| 225 | This mode also behaves slightly differently, based on the |
| 226 | context where it is used: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 227 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 228 | For allocation of anonymous pages and shared memory pages, |
| 229 | Interleave mode indexes the set of nodes specified by the |
| 230 | policy using the page offset of the faulting address into the |
| 231 | segment [VMA] containing the address modulo the number of |
| 232 | nodes specified by the policy. It then attempts to allocate a |
| 233 | page, starting at the selected node, as if the node had been |
| 234 | specified by a Preferred policy or had been selected by a |
| 235 | local allocation. That is, allocation will follow the per |
| 236 | node zonelist. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 237 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 238 | For allocation of page cache pages, Interleave mode indexes |
| 239 | the set of nodes specified by the policy using a node counter |
| 240 | maintained per task. This counter wraps around to the lowest |
| 241 | specified node after it reaches the highest specified node. |
| 242 | This will tend to spread the pages out over the nodes |
| 243 | specified by the policy based on the order in which they are |
| 244 | allocated, rather than based on any page offset into an |
| 245 | address range or file. During system boot up, the temporary |
| 246 | interleaved system default policy works in this mode. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 247 | |
Mike Rapoport | 1174bd84 | 2018-05-08 10:02:09 +0300 | [diff] [blame] | 248 | NUMA memory policy supports the following optional mode flags: |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 249 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 250 | MPOL_F_STATIC_NODES |
| 251 | This flag specifies that the nodemask passed by |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 252 | the user should not be remapped if the task or VMA's set of allowed |
| 253 | nodes changes after the memory policy has been defined. |
| 254 | |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 255 | Without this flag, any time a mempolicy is rebound because of a |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 256 | change in the set of allowed nodes, the node (Preferred) or |
| 257 | nodemask (Bind, Interleave) is remapped to the new set of |
| 258 | allowed nodes. This may result in nodes being used that were |
| 259 | previously undesired. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 260 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 261 | With this flag, if the user-specified nodes overlap with the |
| 262 | nodes allowed by the task's cpuset, then the memory policy is |
| 263 | applied to their intersection. If the two sets of nodes do not |
| 264 | overlap, the Default policy is used. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 265 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 266 | For example, consider a task that is attached to a cpuset with |
| 267 | mems 1-3 that sets an Interleave policy over the same set. If |
| 268 | the cpuset's mems change to 3-5, the Interleave will now occur |
| 269 | over nodes 3, 4, and 5. With this flag, however, since only node |
| 270 | 3 is allowed from the user's nodemask, the "interleave" only |
| 271 | occurs over that node. If no nodes from the user's nodemask are |
| 272 | now allowed, the Default behavior is used. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 273 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 274 | MPOL_F_STATIC_NODES cannot be combined with the |
| 275 | MPOL_F_RELATIVE_NODES flag. It also cannot be used for |
| 276 | MPOL_PREFERRED policies that were created with an empty nodemask |
| 277 | (local allocation). |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 278 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 279 | MPOL_F_RELATIVE_NODES |
| 280 | This flag specifies that the nodemask passed |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 281 | by the user will be mapped relative to the set of the task or VMA's |
| 282 | set of allowed nodes. The kernel stores the user-passed nodemask, |
| 283 | and if the allowed nodes changes, then that original nodemask will |
| 284 | be remapped relative to the new set of allowed nodes. |
| 285 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 286 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a |
| 287 | mempolicy is rebound because of a change in the set of allowed |
| 288 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is |
| 289 | remapped to the new set of allowed nodes. That remap may not |
| 290 | preserve the relative nature of the user's passed nodemask to its |
| 291 | set of allowed nodes upon successive rebinds: a nodemask of |
| 292 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of |
| 293 | allowed nodes is restored to its original state. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 294 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 295 | With this flag, the remap is done so that the node numbers from |
| 296 | the user's passed nodemask are relative to the set of allowed |
| 297 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's |
| 298 | nodemask, the policy will be effected over the first (and in the |
| 299 | Bind or Interleave case, the third and fifth) nodes in the set of |
| 300 | allowed nodes. The nodemask passed by the user represents nodes |
| 301 | relative to task or VMA's set of allowed nodes. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 302 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 303 | If the user's nodemask includes nodes that are outside the range |
| 304 | of the new set of allowed nodes (for example, node 5 is set in |
| 305 | the user's nodemask when the set of allowed nodes is only 0-3), |
| 306 | then the remap wraps around to the beginning of the nodemask and, |
| 307 | if not already set, sets the node in the mempolicy nodemask. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 308 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 309 | For example, consider a task that is attached to a cpuset with |
| 310 | mems 2-5 that sets an Interleave policy over the same set with |
| 311 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the |
| 312 | interleave now occurs over nodes 3,5-7. If the cpuset's mems |
| 313 | then change to 0,2-3,5, then the interleave occurs over nodes |
| 314 | 0,2-3,5. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 315 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 316 | Thanks to the consistent remapping, applications preparing |
| 317 | nodemasks to specify memory policies using this flag should |
| 318 | disregard their current, actual cpuset imposed memory placement |
| 319 | and prepare the nodemask as if they were always located on |
| 320 | memory nodes 0 to N-1, where N is the number of memory nodes the |
| 321 | policy is intended to manage. Let the kernel then remap to the |
| 322 | set of memory nodes allowed by the task's cpuset, as that may |
| 323 | change over time. |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 324 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 325 | MPOL_F_RELATIVE_NODES cannot be combined with the |
| 326 | MPOL_F_STATIC_NODES flag. It also cannot be used for |
| 327 | MPOL_PREFERRED policies that were created with an empty nodemask |
| 328 | (local allocation). |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 329 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 330 | Memory Policy Reference Counting |
| 331 | ================================ |
Lee Schermerhorn | 52cd3b0 | 2008-04-28 02:13:16 -0700 | [diff] [blame] | 332 | |
| 333 | To resolve use/free races, struct mempolicy contains an atomic reference |
| 334 | count field. Internal interfaces, mpol_get()/mpol_put() increment and |
| 335 | decrement this reference count, respectively. mpol_put() will only free |
| 336 | the structure back to the mempolicy kmem cache when the reference count |
| 337 | goes to zero. |
| 338 | |
Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 339 | When a new memory policy is allocated, its reference count is initialized |
Lee Schermerhorn | 52cd3b0 | 2008-04-28 02:13:16 -0700 | [diff] [blame] | 340 | to '1', representing the reference held by the task that is installing the |
| 341 | new policy. When a pointer to a memory policy structure is stored in another |
| 342 | structure, another reference is added, as the task's reference will be dropped |
| 343 | on completion of the policy installation. |
| 344 | |
| 345 | During run-time "usage" of the policy, we attempt to minimize atomic operations |
| 346 | on the reference count, as this can lead to cache lines bouncing between cpus |
| 347 | and NUMA nodes. "Usage" here means one of the following: |
| 348 | |
| 349 | 1) querying of the policy, either by the task itself [using the get_mempolicy() |
| 350 | API discussed below] or by another task using the /proc/<pid>/numa_maps |
| 351 | interface. |
| 352 | |
| 353 | 2) examination of the policy to determine the policy mode and associated node |
| 354 | or node lists, if any, for page allocation. This is considered a "hot |
| 355 | path". Note that for MPOL_BIND, the "usage" extends across the entire |
| 356 | allocation process, which may sleep during page reclaimation, because the |
| 357 | BIND policy nodemask is used, by reference, to filter ineligible nodes. |
| 358 | |
| 359 | We can avoid taking an extra reference during the usages listed above as |
| 360 | follows: |
| 361 | |
| 362 | 1) we never need to get/free the system default policy as this is never |
| 363 | changed nor freed, once the system is up and running. |
| 364 | |
| 365 | 2) for querying the policy, we do not need to take an extra reference on the |
| 366 | target task's task policy nor vma policies because we always acquire the |
| 367 | task's mm's mmap_sem for read during the query. The set_mempolicy() and |
| 368 | mbind() APIs [see below] always acquire the mmap_sem for write when |
| 369 | installing or replacing task or vma policies. Thus, there is no possibility |
| 370 | of a task or thread freeing a policy while another task or thread is |
| 371 | querying it. |
| 372 | |
| 373 | 3) Page allocation usage of task or vma policy occurs in the fault path where |
| 374 | we hold them mmap_sem for read. Again, because replacing the task or vma |
| 375 | policy requires that the mmap_sem be held for write, the policy can't be |
| 376 | freed out from under us while we're using it for page allocation. |
| 377 | |
| 378 | 4) Shared policies require special consideration. One task can replace a |
| 379 | shared memory policy while another task, with a distinct mmap_sem, is |
| 380 | querying or allocating a page based on the policy. To resolve this |
| 381 | potential race, the shared policy infrastructure adds an extra reference |
| 382 | to the shared policy during lookup while holding a spin lock on the shared |
| 383 | policy management structure. This requires that we drop this extra |
| 384 | reference when we're finished "using" the policy. We must drop the |
| 385 | extra reference on shared policies in the same query/allocation paths |
| 386 | used for non-shared policies. For this reason, shared policies are marked |
| 387 | as such, and the extra reference is dropped "conditionally"--i.e., only |
| 388 | for shared policies. |
| 389 | |
| 390 | Because of this extra reference counting, and because we must lookup |
| 391 | shared policies in a tree structure under spinlock, shared policies are |
Matt LaPlante | d919588 | 2008-07-25 19:45:33 -0700 | [diff] [blame] | 392 | more expensive to use in the page allocation path. This is especially |
Lee Schermerhorn | 52cd3b0 | 2008-04-28 02:13:16 -0700 | [diff] [blame] | 393 | true for shared policies on shared memory regions shared by tasks running |
| 394 | on different NUMA nodes. This extra overhead can be avoided by always |
| 395 | falling back to task or system default policy for shared memory regions, |
| 396 | or by prefaulting the entire shared memory region into memory and locking |
| 397 | it down. However, this might not be appropriate for all applications. |
| 398 | |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 399 | .. _memory_policy_apis: |
| 400 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 401 | Memory Policy APIs |
Mike Rapoport | 42f44d1 | 2018-05-08 10:02:08 +0300 | [diff] [blame] | 402 | ================== |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 403 | |
| 404 | Linux supports 3 system calls for controlling memory policy. These APIS |
| 405 | always affect only the calling task, the calling task's address space, or |
| 406 | some shared object mapped into the calling task's address space. |
| 407 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 408 | .. note:: |
| 409 | the headers that define these APIs and the parameter data types for |
| 410 | user space applications reside in a package that is not part of the |
| 411 | Linux kernel. The kernel system call interfaces, with the 'sys\_' |
| 412 | prefix, are defined in <linux/syscalls.h>; the mode and flag |
| 413 | definitions are defined in <linux/mempolicy.h>. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 414 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 415 | Set [Task] Memory Policy:: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 416 | |
| 417 | long set_mempolicy(int mode, const unsigned long *nmask, |
| 418 | unsigned long maxnode); |
| 419 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 420 | Set's the calling task's "task/process memory policy" to mode |
| 421 | specified by the 'mode' argument and the set of nodes defined by |
| 422 | 'nmask'. 'nmask' points to a bit mask of node ids containing at least |
| 423 | 'maxnode' ids. Optional mode flags may be passed by combining the |
| 424 | 'mode' argument with the flag (for example: MPOL_INTERLEAVE | |
| 425 | MPOL_F_STATIC_NODES). |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 426 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 427 | See the set_mempolicy(2) man page for more details |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 428 | |
| 429 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 430 | Get [Task] Memory Policy or Related Information:: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 431 | |
| 432 | long get_mempolicy(int *mode, |
| 433 | const unsigned long *nmask, unsigned long maxnode, |
| 434 | void *addr, int flags); |
| 435 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 436 | Queries the "task/process memory policy" of the calling task, or the |
| 437 | policy or location of a specified virtual address, depending on the |
| 438 | 'flags' argument. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 439 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 440 | See the get_mempolicy(2) man page for more details |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 441 | |
| 442 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 443 | Install VMA/Shared Policy for a Range of Task's Address Space:: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 444 | |
| 445 | long mbind(void *start, unsigned long len, int mode, |
| 446 | const unsigned long *nmask, unsigned long maxnode, |
| 447 | unsigned flags); |
| 448 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 449 | mbind() installs the policy specified by (mode, nmask, maxnodes) as a |
| 450 | VMA policy for the range of the calling task's address space specified |
| 451 | by the 'start' and 'len' arguments. Additional actions may be |
| 452 | requested via the 'flags' argument. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 453 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 454 | See the mbind(2) man page for more details. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 455 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 456 | Memory Policy Command Line Interface |
| 457 | ==================================== |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 458 | |
| 459 | Although not strictly part of the Linux implementation of memory policy, |
| 460 | a command line tool, numactl(8), exists that allows one to: |
| 461 | |
| 462 | + set the task policy for a specified program via set_mempolicy(2), fork(2) and |
| 463 | exec(2) |
| 464 | |
| 465 | + set the shared policy for a shared memory segment via mbind(2) |
| 466 | |
Nikanth Karthikesan | 0bc79f7f | 2010-09-20 11:43:58 +0530 | [diff] [blame] | 467 | The numactl(8) tool is packaged with the run-time version of the library |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 468 | containing the memory policy system call wrappers. Some distributions |
| 469 | package the headers and compile-time libraries in a separate development |
| 470 | package. |
| 471 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 472 | .. _mem_pol_and_cpusets: |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 473 | |
Mike Rapoport | cb5e437 | 2018-03-21 21:22:29 +0200 | [diff] [blame] | 474 | Memory Policies and cpusets |
| 475 | =========================== |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 476 | |
| 477 | Memory policies work within cpusets as described above. For memory policies |
| 478 | that require a node or set of nodes, the nodes are restricted to the set of |
Lee Schermerhorn | 754af6f | 2007-10-16 01:24:51 -0700 | [diff] [blame] | 479 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 480 | specified for the policy contains nodes that are not allowed by the cpuset and |
| 481 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes |
| 482 | specified for the policy and the set of nodes with memory is used. If the |
| 483 | result is the empty set, the policy is considered invalid and cannot be |
| 484 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped |
| 485 | onto and folded into the task's set of allowed nodes as previously described. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 486 | |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame] | 487 | The interaction of memory policies and cpusets can be problematic when tasks |
| 488 | in two cpusets share access to a memory region, such as shared memory segments |
| 489 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and |
| 490 | any of the tasks install shared policy on the region, only nodes whose |
| 491 | memories are allowed in both cpusets may be used in the policies. Obtaining |
| 492 | this information requires "stepping outside" the memory policy APIs to use the |
| 493 | cpuset information and requires that one know in what cpusets other task might |
| 494 | be attaching to the shared region. Furthermore, if the cpusets' allowed |
| 495 | memory sets are disjoint, "local" allocation is the only valid policy. |