mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-09-04 23:47:46 -04:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
40
Documentation/vm/00-INDEX
Normal file
40
Documentation/vm/00-INDEX
Normal file
|
@ -0,0 +1,40 @@
|
|||
00-INDEX
|
||||
- this file.
|
||||
active_mm.txt
|
||||
- An explanation from Linus about tsk->active_mm vs tsk->mm.
|
||||
balance
|
||||
- various information on memory balancing.
|
||||
cleancache.txt
|
||||
- Intro to cleancache and page-granularity victim cache.
|
||||
frontswap.txt
|
||||
- Outline frontswap, part of the transcendent memory frontend.
|
||||
highmem.txt
|
||||
- Outline of highmem and common issues.
|
||||
hugetlbpage.txt
|
||||
- a brief summary of hugetlbpage support in the Linux kernel.
|
||||
hwpoison.txt
|
||||
- explains what hwpoison is
|
||||
ksm.txt
|
||||
- how to use the Kernel Samepage Merging feature.
|
||||
numa
|
||||
- information about NUMA specific code in the Linux vm.
|
||||
numa_memory_policy.txt
|
||||
- documentation of concepts and APIs of the 2.6 memory policy support.
|
||||
overcommit-accounting
|
||||
- description of the Linux kernels overcommit handling modes.
|
||||
page_migration
|
||||
- description of page migration in NUMA systems.
|
||||
pagemap.txt
|
||||
- pagemap, from the userspace perspective
|
||||
slub.txt
|
||||
- a short users guide for SLUB.
|
||||
soft-dirty.txt
|
||||
- short explanation for soft-dirty PTEs
|
||||
split_page_table_lock
|
||||
- Separate per-table lock to improve scalability of the old page_table_lock.
|
||||
transhuge.txt
|
||||
- Transparent Hugepage Support, alternative way of using hugepages.
|
||||
unevictable-lru.txt
|
||||
- Unevictable LRU infrastructure
|
||||
zswap.txt
|
||||
- Intro to compressed cache for swap pages
|
83
Documentation/vm/active_mm.txt
Normal file
83
Documentation/vm/active_mm.txt
Normal file
|
@ -0,0 +1,83 @@
|
|||
List: linux-kernel
|
||||
Subject: Re: active_mm
|
||||
From: Linus Torvalds <torvalds () transmeta ! com>
|
||||
Date: 1999-07-30 21:36:24
|
||||
|
||||
Cc'd to linux-kernel, because I don't write explanations all that often,
|
||||
and when I do I feel better about more people reading them.
|
||||
|
||||
On Fri, 30 Jul 1999, David Mosberger wrote:
|
||||
>
|
||||
> Is there a brief description someplace on how "mm" vs. "active_mm" in
|
||||
> the task_struct are supposed to be used? (My apologies if this was
|
||||
> discussed on the mailing lists---I just returned from vacation and
|
||||
> wasn't able to follow linux-kernel for a while).
|
||||
|
||||
Basically, the new setup is:
|
||||
|
||||
- we have "real address spaces" and "anonymous address spaces". The
|
||||
difference is that an anonymous address space doesn't care about the
|
||||
user-level page tables at all, so when we do a context switch into an
|
||||
anonymous address space we just leave the previous address space
|
||||
active.
|
||||
|
||||
The obvious use for a "anonymous address space" is any thread that
|
||||
doesn't need any user mappings - all kernel threads basically fall into
|
||||
this category, but even "real" threads can temporarily say that for
|
||||
some amount of time they are not going to be interested in user space,
|
||||
and that the scheduler might as well try to avoid wasting time on
|
||||
switching the VM state around. Currently only the old-style bdflush
|
||||
sync does that.
|
||||
|
||||
- "tsk->mm" points to the "real address space". For an anonymous process,
|
||||
tsk->mm will be NULL, for the logical reason that an anonymous process
|
||||
really doesn't _have_ a real address space at all.
|
||||
|
||||
- however, we obviously need to keep track of which address space we
|
||||
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
|
||||
which shows what the currently active address space is.
|
||||
|
||||
The rule is that for a process with a real address space (ie tsk->mm is
|
||||
non-NULL) the active_mm obviously always has to be the same as the real
|
||||
one.
|
||||
|
||||
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
|
||||
"borrowed" mm while the anonymous process is running. When the
|
||||
anonymous process gets scheduled away, the borrowed address space is
|
||||
returned and cleared.
|
||||
|
||||
To support all that, the "struct mm_struct" now has two counters: a
|
||||
"mm_users" counter that is how many "real address space users" there are,
|
||||
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
|
||||
users) plus one if there are any real users.
|
||||
|
||||
Usually there is at least one real user, but it could be that the real
|
||||
user exited on another CPU while a lazy user was still active, so you do
|
||||
actually get cases where you have a address space that is _only_ used by
|
||||
lazy users. That is often a short-lived state, because once that thread
|
||||
gets scheduled away in favour of a real thread, the "zombie" mm gets
|
||||
released because "mm_users" becomes zero.
|
||||
|
||||
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
|
||||
more. "init_mm" should be considered just a "lazy context when no other
|
||||
context is available", and in fact it is mainly used just at bootup when
|
||||
no real VM has yet been created. So code that used to check
|
||||
|
||||
if (current->mm == &init_mm)
|
||||
|
||||
should generally just do
|
||||
|
||||
if (!current->mm)
|
||||
|
||||
instead (which makes more sense anyway - the test is basically one of "do
|
||||
we have a user context", and is generally done by the page fault handler
|
||||
and things like that).
|
||||
|
||||
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
|
||||
because it slightly changes the interfaces to accommodate the alpha (who
|
||||
would have thought it, but the alpha actually ends up having one of the
|
||||
ugliest context switch codes - unlike the other architectures where the MM
|
||||
and register state is separate, the alpha PALcode joins the two, and you
|
||||
need to switch both together).
|
||||
|
||||
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
|
93
Documentation/vm/balance
Normal file
93
Documentation/vm/balance
Normal file
|
@ -0,0 +1,93 @@
|
|||
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
Memory balancing is needed for non __GFP_WAIT as well as for non
|
||||
__GFP_IO allocations.
|
||||
|
||||
There are two reasons to be requesting non __GFP_WAIT allocations:
|
||||
the caller can not sleep (typically intr context), or does not want
|
||||
to incur cost overheads of page stealing and possible swap io for
|
||||
whatever reasons.
|
||||
|
||||
__GFP_IO allocation requests are made to prevent file system deadlocks.
|
||||
|
||||
In the absence of non sleepable allocation requests, it seems detrimental
|
||||
to be doing balancing. Page reclamation can be kicked off lazily, that
|
||||
is, only when needed (aka zone free memory is 0), instead of making it
|
||||
a proactive process.
|
||||
|
||||
That being said, the kernel should try to fulfill requests for direct
|
||||
mapped pages from the direct mapped pool, instead of falling back on
|
||||
the dma pool, so as to keep the dma pool filled for dma requests (atomic
|
||||
or not). A similar argument applies to highmem and direct mapped pages.
|
||||
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
|
||||
regular memory requests by allocating one from the dma pool, instead
|
||||
of incurring the overhead of regular zone balancing.
|
||||
|
||||
In 2.2, memory balancing/page reclamation would kick off only when the
|
||||
_total_ number of free pages fell below 1/64 th of total memory. With the
|
||||
right ratio of dma and regular memory, it is quite possible that balancing
|
||||
would not be done even when the dma zone was completely empty. 2.2 has
|
||||
been running production machines of varying memory sizes, and seems to be
|
||||
doing fine even with the presence of this problem. In 2.3, due to
|
||||
HIGHMEM, this problem is aggravated.
|
||||
|
||||
In 2.3, zone balancing can be done in one of two ways: depending on the
|
||||
zone size (and possibly of the size of lower class zones), we can decide
|
||||
at init time how many free pages we should aim for while balancing any
|
||||
zone. The good part is, while balancing, we do not need to look at sizes
|
||||
of lower class zones, the bad part is, we might do too frequent balancing
|
||||
due to ignoring possibly lower usage in the lower class zones. Also,
|
||||
with a slight change in the allocation routine, it is possible to reduce
|
||||
the memclass() macro to be a simple equality.
|
||||
|
||||
Another possible solution is that we balance only when the free memory
|
||||
of a zone _and_ all its lower class zones falls below 1/64th of the
|
||||
total memory in the zone and its lower class zones. This fixes the 2.2
|
||||
balancing problem, and stays as close to 2.2 behavior as possible. Also,
|
||||
the balancing algorithm works the same way on the various architectures,
|
||||
which have different numbers and types of zones. If we wanted to get
|
||||
fancy, we could assign different weights to free pages in different
|
||||
zones in the future.
|
||||
|
||||
Note that if the size of the regular zone is huge compared to dma zone,
|
||||
it becomes less significant to consider the free dma pages while
|
||||
deciding whether to balance the regular zone. The first solution
|
||||
becomes more attractive then.
|
||||
|
||||
The appended patch implements the second solution. It also "fixes" two
|
||||
problems: first, kswapd is woken up as in 2.2 on low memory conditions
|
||||
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
|
||||
so as to give a fighting chance for replace_with_highmem() to get a
|
||||
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
|
||||
fall back into regular zone. This also makes sure that HIGHMEM pages
|
||||
are not leaked (for example, in situations where a HIGHMEM page is in
|
||||
the swapcache but is not being used by anyone)
|
||||
|
||||
kswapd also needs to know about the zones it should balance. kswapd is
|
||||
primarily needed in a situation where balancing can not be done,
|
||||
probably because all allocation requests are coming from intr context
|
||||
and all process contexts are sleeping. For 2.3, kswapd does not really
|
||||
need to balance the highmem zone, since intr context does not request
|
||||
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
|
||||
structure to decide whether a zone needs balancing.
|
||||
|
||||
Page stealing from process memory and shm is done if stealing the page would
|
||||
alleviate memory pressure on any zone in the page's node that has fallen below
|
||||
its watermark.
|
||||
|
||||
watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
|
||||
are per-zone fields, used to determine when a zone needs to be balanced. When
|
||||
the number of pages falls below watermark[WMARK_MIN], the hysteric field
|
||||
low_on_memory gets set. This stays set till the number of free pages becomes
|
||||
watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
|
||||
try to free some pages in the zone (providing GFP_WAIT is set in the request).
|
||||
Orthogonal to this, is the decision to poke kswapd to free some zone pages.
|
||||
That decision is not hysteresis based, and is done when the number of free
|
||||
pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
|
||||
|
||||
|
||||
(Good) Ideas that I have heard:
|
||||
1. Dynamic experience should influence balancing: number of failed requests
|
||||
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
|
||||
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
|
||||
dma pages. (lkd@tantalophile.demon.co.uk)
|
279
Documentation/vm/cleancache.txt
Normal file
279
Documentation/vm/cleancache.txt
Normal file
|
@ -0,0 +1,279 @@
|
|||
MOTIVATION
|
||||
|
||||
Cleancache is a new optional feature provided by the VFS layer that
|
||||
potentially dramatically increases page cache effectiveness for
|
||||
many workloads in many environments at a negligible cost.
|
||||
|
||||
Cleancache can be thought of as a page-granularity victim cache for clean
|
||||
pages that the kernel's pageframe replacement algorithm (PFRA) would like
|
||||
to keep around, but can't since there isn't enough memory. So when the
|
||||
PFRA "evicts" a page, it first attempts to use cleancache code to
|
||||
put the data contained in that page into "transcendent memory", memory
|
||||
that is not directly accessible or addressable by the kernel and is
|
||||
of unknown and possibly time-varying size.
|
||||
|
||||
Later, when a cleancache-enabled filesystem wishes to access a page
|
||||
in a file on disk, it first checks cleancache to see if it already
|
||||
contains it; if it does, the page of data is copied into the kernel
|
||||
and a disk access is avoided.
|
||||
|
||||
Transcendent memory "drivers" for cleancache are currently implemented
|
||||
in Xen (using hypervisor memory) and zcache (using in-kernel compressed
|
||||
memory) and other implementations are in development.
|
||||
|
||||
FAQs are included below.
|
||||
|
||||
IMPLEMENTATION OVERVIEW
|
||||
|
||||
A cleancache "backend" that provides transcendent memory registers itself
|
||||
to the kernel's cleancache "frontend" by calling cleancache_register_ops,
|
||||
passing a pointer to a cleancache_ops structure with funcs set appropriately.
|
||||
Note that cleancache_register_ops returns the previous settings so that
|
||||
chaining can be performed if desired. The functions provided must conform to
|
||||
certain semantics as follows:
|
||||
|
||||
Most important, cleancache is "ephemeral". Pages which are copied into
|
||||
cleancache have an indefinite lifetime which is completely unknowable
|
||||
by the kernel and so may or may not still be in cleancache at any later time.
|
||||
Thus, as its name implies, cleancache is not suitable for dirty pages.
|
||||
Cleancache has complete discretion over what pages to preserve and what
|
||||
pages to discard and when.
|
||||
|
||||
Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
|
||||
pool id which, if positive, must be saved in the filesystem's superblock;
|
||||
a negative return value indicates failure. A "put_page" will copy a
|
||||
(presumably about-to-be-evicted) page into cleancache and associate it with
|
||||
the pool id, a file key, and a page index into the file. (The combination
|
||||
of a pool id, a file key, and an index is sometimes called a "handle".)
|
||||
A "get_page" will copy the page, if found, from cleancache into kernel memory.
|
||||
An "invalidate_page" will ensure the page no longer is present in cleancache;
|
||||
an "invalidate_inode" will invalidate all pages associated with the specified
|
||||
file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate
|
||||
all pages in all files specified by the given pool id and also surrender
|
||||
the pool id.
|
||||
|
||||
An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
|
||||
to treat the pool as shared using a 128-bit UUID as a key. On systems
|
||||
that may run multiple kernels (such as hard partitioned or virtualized
|
||||
systems) that may share a clustered filesystem, and where cleancache
|
||||
may be shared among those kernels, calls to init_shared_fs that specify the
|
||||
same UUID will receive the same pool id, thus allowing the pages to
|
||||
be shared. Note that any security requirements must be imposed outside
|
||||
of the kernel (e.g. by "tools" that control cleancache). Or a
|
||||
cleancache implementation can simply disable shared_init by always
|
||||
returning a negative value.
|
||||
|
||||
If a get_page is successful on a non-shared pool, the page is invalidated
|
||||
(thus making cleancache an "exclusive" cache). On a shared pool, the page
|
||||
is NOT invalidated on a successful get_page so that it remains accessible to
|
||||
other sharers. The kernel is responsible for ensuring coherency between
|
||||
cleancache (shared or not), the page cache, and the filesystem, using
|
||||
cleancache invalidate operations as required.
|
||||
|
||||
Note that cleancache must enforce put-put-get coherency and get-get
|
||||
coherency. For the former, if two puts are made to the same handle but
|
||||
with different data, say AAA by the first put and BBB by the second, a
|
||||
subsequent get can never return the stale data (AAA). For get-get coherency,
|
||||
if a get for a given handle fails, subsequent gets for that handle will
|
||||
never succeed unless preceded by a successful put with that handle.
|
||||
|
||||
Last, cleancache provides no SMP serialization guarantees; if two
|
||||
different Linux threads are simultaneously putting and invalidating a page
|
||||
with the same handle, the results are indeterminate. Callers must
|
||||
lock the page to ensure serial behavior.
|
||||
|
||||
CLEANCACHE PERFORMANCE METRICS
|
||||
|
||||
If properly configured, monitoring of cleancache is done via debugfs in
|
||||
the /sys/kernel/debug/mm/cleancache directory. The effectiveness of cleancache
|
||||
can be measured (across all filesystems) with:
|
||||
|
||||
succ_gets - number of gets that were successful
|
||||
failed_gets - number of gets that failed
|
||||
puts - number of puts attempted (all "succeed")
|
||||
invalidates - number of invalidates attempted
|
||||
|
||||
A backend implementation may provide additional metrics.
|
||||
|
||||
FAQ
|
||||
|
||||
1) Where's the value? (Andrew Morton)
|
||||
|
||||
Cleancache provides a significant performance benefit to many workloads
|
||||
in many environments with negligible overhead by improving the
|
||||
effectiveness of the pagecache. Clean pagecache pages are
|
||||
saved in transcendent memory (RAM that is otherwise not directly
|
||||
addressable to the kernel); fetching those pages later avoids "refaults"
|
||||
and thus disk reads.
|
||||
|
||||
Cleancache (and its sister code "frontswap") provide interfaces for
|
||||
this transcendent memory (aka "tmem"), which conceptually lies between
|
||||
fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
|
||||
Disallowing direct kernel or userland reads/writes to tmem
|
||||
is ideal when data is transformed to a different form and size (such
|
||||
as with compression) or secretly moved (as might be useful for write-
|
||||
balancing for some RAM-like devices). Evicted page-cache pages (and
|
||||
swap pages) are a great use for this kind of slower-than-RAM-but-much-
|
||||
faster-than-disk transcendent memory, and the cleancache (and frontswap)
|
||||
"page-object-oriented" specification provides a nice way to read and
|
||||
write -- and indirectly "name" -- the pages.
|
||||
|
||||
In the virtual case, the whole point of virtualization is to statistically
|
||||
multiplex physical resources across the varying demands of multiple
|
||||
virtual machines. This is really hard to do with RAM and efforts to
|
||||
do it well with no kernel change have essentially failed (except in some
|
||||
well-publicized special-case workloads). Cleancache -- and frontswap --
|
||||
with a fairly small impact on the kernel, provide a huge amount
|
||||
of flexibility for more dynamic, flexible RAM multiplexing.
|
||||
Specifically, the Xen Transcendent Memory backend allows otherwise
|
||||
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
|
||||
virtual machines, but the pages can be compressed and deduplicated to
|
||||
optimize RAM utilization. And when guest OS's are induced to surrender
|
||||
underutilized RAM (e.g. with "self-ballooning"), page cache pages
|
||||
are the first to go, and cleancache allows those pages to be
|
||||
saved and reclaimed if overall host system memory conditions allow.
|
||||
|
||||
And the identical interface used for cleancache can be used in
|
||||
physical systems as well. The zcache driver acts as a memory-hungry
|
||||
device that stores pages of data in a compressed state. And
|
||||
the proposed "RAMster" driver shares RAM across multiple physical
|
||||
systems.
|
||||
|
||||
2) Why does cleancache have its sticky fingers so deep inside the
|
||||
filesystems and VFS? (Andrew Morton and Christoph Hellwig)
|
||||
|
||||
The core hooks for cleancache in VFS are in most cases a single line
|
||||
and the minimum set are placed precisely where needed to maintain
|
||||
coherency (via cleancache_invalidate operations) between cleancache,
|
||||
the page cache, and disk. All hooks compile into nothingness if
|
||||
cleancache is config'ed off and turn into a function-pointer-
|
||||
compare-to-NULL if config'ed on but no backend claims the ops
|
||||
functions, or to a compare-struct-element-to-negative if a
|
||||
backend claims the ops functions but a filesystem doesn't enable
|
||||
cleancache.
|
||||
|
||||
Some filesystems are built entirely on top of VFS and the hooks
|
||||
in VFS are sufficient, so don't require an "init_fs" hook; the
|
||||
initial implementation of cleancache didn't provide this hook.
|
||||
But for some filesystems (such as btrfs), the VFS hooks are
|
||||
incomplete and one or more hooks in fs-specific code are required.
|
||||
And for some other filesystems, such as tmpfs, cleancache may
|
||||
be counterproductive. So it seemed prudent to require a filesystem
|
||||
to "opt in" to use cleancache, which requires adding a hook in
|
||||
each filesystem. Not all filesystems are supported by cleancache
|
||||
only because they haven't been tested. The existing set should
|
||||
be sufficient to validate the concept, the opt-in approach means
|
||||
that untested filesystems are not affected, and the hooks in the
|
||||
existing filesystems should make it very easy to add more
|
||||
filesystems in the future.
|
||||
|
||||
The total impact of the hooks to existing fs and mm files is only
|
||||
about 40 lines added (not counting comments and blank lines).
|
||||
|
||||
3) Why not make cleancache asynchronous and batched so it can
|
||||
more easily interface with real devices with DMA instead
|
||||
of copying each individual page? (Minchan Kim)
|
||||
|
||||
The one-page-at-a-time copy semantics simplifies the implementation
|
||||
on both the frontend and backend and also allows the backend to
|
||||
do fancy things on-the-fly like page compression and
|
||||
page deduplication. And since the data is "gone" (copied into/out
|
||||
of the pageframe) before the cleancache get/put call returns,
|
||||
a great deal of race conditions and potential coherency issues
|
||||
are avoided. While the interface seems odd for a "real device"
|
||||
or for real kernel-addressable RAM, it makes perfect sense for
|
||||
transcendent memory.
|
||||
|
||||
4) Why is non-shared cleancache "exclusive"? And where is the
|
||||
page "invalidated" after a "get"? (Minchan Kim)
|
||||
|
||||
The main reason is to free up space in transcendent memory and
|
||||
to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
|
||||
the page can be "put" immediately following the "get". If
|
||||
put-after-get for inclusive becomes common, the interface could
|
||||
be easily extended to add a "get_no_invalidate" call.
|
||||
|
||||
The invalidate is done by the cleancache backend implementation.
|
||||
|
||||
5) What's the performance impact?
|
||||
|
||||
Performance analysis has been presented at OLS'09 and LCA'10.
|
||||
Briefly, performance gains can be significant on most workloads,
|
||||
especially when memory pressure is high (e.g. when RAM is
|
||||
overcommitted in a virtual workload); and because the hooks are
|
||||
invoked primarily in place of or in addition to a disk read/write,
|
||||
overhead is negligible even in worst case workloads. Basically
|
||||
cleancache replaces I/O with memory-copy-CPU-overhead; on older
|
||||
single-core systems with slow memory-copy speeds, cleancache
|
||||
has little value, but in newer multicore machines, especially
|
||||
consolidated/virtualized machines, it has great value.
|
||||
|
||||
6) How do I add cleancache support for filesystem X? (Boaz Harrash)
|
||||
|
||||
Filesystems that are well-behaved and conform to certain
|
||||
restrictions can utilize cleancache simply by making a call to
|
||||
cleancache_init_fs at mount time. Unusual, misbehaving, or
|
||||
poorly layered filesystems must either add additional hooks
|
||||
and/or undergo extensive additional testing... or should just
|
||||
not enable the optional cleancache.
|
||||
|
||||
Some points for a filesystem to consider:
|
||||
|
||||
- The FS should be block-device-based (e.g. a ram-based FS such
|
||||
as tmpfs should not enable cleancache)
|
||||
- To ensure coherency/correctness, the FS must ensure that all
|
||||
file removal or truncation operations either go through VFS or
|
||||
add hooks to do the equivalent cleancache "invalidate" operations
|
||||
- To ensure coherency/correctness, either inode numbers must
|
||||
be unique across the lifetime of the on-disk file OR the
|
||||
FS must provide an "encode_fh" function.
|
||||
- The FS must call the VFS superblock alloc and deactivate routines
|
||||
or add hooks to do the equivalent cleancache calls done there.
|
||||
- To maximize performance, all pages fetched from the FS should
|
||||
go through the do_mpag_readpage routine or the FS should add
|
||||
hooks to do the equivalent (cf. btrfs)
|
||||
- Currently, the FS blocksize must be the same as PAGESIZE. This
|
||||
is not an architectural restriction, but no backends currently
|
||||
support anything different.
|
||||
- A clustered FS should invoke the "shared_init_fs" cleancache
|
||||
hook to get best performance for some backends.
|
||||
|
||||
7) Why not use the KVA of the inode as the key? (Christoph Hellwig)
|
||||
|
||||
If cleancache would use the inode virtual address instead of
|
||||
inode/filehandle, the pool id could be eliminated. But, this
|
||||
won't work because cleancache retains pagecache data pages
|
||||
persistently even when the inode has been pruned from the
|
||||
inode unused list, and only invalidates the data page if the file
|
||||
gets removed/truncated. So if cleancache used the inode kva,
|
||||
there would be potential coherency issues if/when the inode
|
||||
kva is reused for a different file. Alternately, if cleancache
|
||||
invalidated the pages when the inode kva was freed, much of the value
|
||||
of cleancache would be lost because the cache of pages in cleanache
|
||||
is potentially much larger than the kernel pagecache and is most
|
||||
useful if the pages survive inode cache removal.
|
||||
|
||||
8) Why is a global variable required?
|
||||
|
||||
The cleancache_enabled flag is checked in all of the frequently-used
|
||||
cleancache hooks. The alternative is a function call to check a static
|
||||
variable. Since cleancache is enabled dynamically at runtime, systems
|
||||
that don't enable cleancache would suffer thousands (possibly
|
||||
tens-of-thousands) of unnecessary function calls per second. So the
|
||||
global variable allows cleancache to be enabled by default at compile
|
||||
time, but have insignificant performance impact when cleancache remains
|
||||
disabled at runtime.
|
||||
|
||||
9) Does cleanache work with KVM?
|
||||
|
||||
The memory model of KVM is sufficiently different that a cleancache
|
||||
backend may have less value for KVM. This remains to be tested,
|
||||
especially in an overcommitted system.
|
||||
|
||||
10) Does cleancache work in userspace? It sounds useful for
|
||||
memory hungry caches like web browsers. (Jamie Lokier)
|
||||
|
||||
No plans yet, though we agree it sounds useful, at least for
|
||||
apps that bypass the page cache (e.g. O_DIRECT).
|
||||
|
||||
Last updated: Dan Magenheimer, April 13 2011
|
278
Documentation/vm/frontswap.txt
Normal file
278
Documentation/vm/frontswap.txt
Normal file
|
@ -0,0 +1,278 @@
|
|||
Frontswap provides a "transcendent memory" interface for swap pages.
|
||||
In some environments, dramatic performance savings may be obtained because
|
||||
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
|
||||
|
||||
(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
|
||||
and the only necessary changes to the core kernel for transcendent memory;
|
||||
all other supporting code -- the "backends" -- is implemented as drivers.
|
||||
See the LWN.net article "Transcendent memory in a nutshell" for a detailed
|
||||
overview of frontswap and related kernel parts:
|
||||
https://lwn.net/Articles/454795/ )
|
||||
|
||||
Frontswap is so named because it can be thought of as the opposite of
|
||||
a "backing" store for a swap device. The storage is assumed to be
|
||||
a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
|
||||
to the requirements of transcendent memory (such as Xen's "tmem", or
|
||||
in-kernel compressed memory, aka "zcache", or future RAM-like devices);
|
||||
this pseudo-RAM device is not directly accessible or addressable by the
|
||||
kernel and is of unknown and possibly time-varying size. The driver
|
||||
links itself to frontswap by calling frontswap_register_ops to set the
|
||||
frontswap_ops funcs appropriately and the functions it provides must
|
||||
conform to certain policies as follows:
|
||||
|
||||
An "init" prepares the device to receive frontswap pages associated
|
||||
with the specified swap device number (aka "type"). A "store" will
|
||||
copy the page to transcendent memory and associate it with the type and
|
||||
offset associated with the page. A "load" will copy the page, if found,
|
||||
from transcendent memory into kernel memory, but will NOT remove the page
|
||||
from transcendent memory. An "invalidate_page" will remove the page
|
||||
from transcendent memory and an "invalidate_area" will remove ALL pages
|
||||
associated with the swap type (e.g., like swapoff) and notify the "device"
|
||||
to refuse further stores with that swap type.
|
||||
|
||||
Once a page is successfully stored, a matching load on the page will normally
|
||||
succeed. So when the kernel finds itself in a situation where it needs
|
||||
to swap out a page, it first attempts to use frontswap. If the store returns
|
||||
success, the data has been successfully saved to transcendent memory and
|
||||
a disk write and, if the data is later read back, a disk read are avoided.
|
||||
If a store returns failure, transcendent memory has rejected the data, and the
|
||||
page can be written to swap as usual.
|
||||
|
||||
If a backend chooses, frontswap can be configured as a "writethrough
|
||||
cache" by calling frontswap_writethrough(). In this mode, the reduction
|
||||
in swap device writes is lost (and also a non-trivial performance advantage)
|
||||
in order to allow the backend to arbitrarily "reclaim" space used to
|
||||
store frontswap pages to more completely manage its memory usage.
|
||||
|
||||
Note that if a page is stored and the page already exists in transcendent memory
|
||||
(a "duplicate" store), either the store succeeds and the data is overwritten,
|
||||
or the store fails AND the page is invalidated. This ensures stale data may
|
||||
never be obtained from frontswap.
|
||||
|
||||
If properly configured, monitoring of frontswap is done via debugfs in
|
||||
the /sys/kernel/debug/frontswap directory. The effectiveness of
|
||||
frontswap can be measured (across all swap devices) with:
|
||||
|
||||
failed_stores - how many store attempts have failed
|
||||
loads - how many loads were attempted (all should succeed)
|
||||
succ_stores - how many store attempts have succeeded
|
||||
invalidates - how many invalidates were attempted
|
||||
|
||||
A backend implementation may provide additional metrics.
|
||||
|
||||
FAQ
|
||||
|
||||
1) Where's the value?
|
||||
|
||||
When a workload starts swapping, performance falls through the floor.
|
||||
Frontswap significantly increases performance in many such workloads by
|
||||
providing a clean, dynamic interface to read and write swap pages to
|
||||
"transcendent memory" that is otherwise not directly addressable to the kernel.
|
||||
This interface is ideal when data is transformed to a different form
|
||||
and size (such as with compression) or secretly moved (as might be
|
||||
useful for write-balancing for some RAM-like devices). Swap pages (and
|
||||
evicted page-cache pages) are a great use for this kind of slower-than-RAM-
|
||||
but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
|
||||
cleancache) interface to transcendent memory provides a nice way to read
|
||||
and write -- and indirectly "name" -- the pages.
|
||||
|
||||
Frontswap -- and cleancache -- with a fairly small impact on the kernel,
|
||||
provides a huge amount of flexibility for more dynamic, flexible RAM
|
||||
utilization in various system configurations:
|
||||
|
||||
In the single kernel case, aka "zcache", pages are compressed and
|
||||
stored in local memory, thus increasing the total anonymous pages
|
||||
that can be safely kept in RAM. Zcache essentially trades off CPU
|
||||
cycles used in compression/decompression for better memory utilization.
|
||||
Benchmarks have shown little or no impact when memory pressure is
|
||||
low while providing a significant performance improvement (25%+)
|
||||
on some workloads under high memory pressure.
|
||||
|
||||
"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
|
||||
support for clustered systems. Frontswap pages are locally compressed
|
||||
as in zcache, but then "remotified" to another system's RAM. This
|
||||
allows RAM to be dynamically load-balanced back-and-forth as needed,
|
||||
i.e. when system A is overcommitted, it can swap to system B, and
|
||||
vice versa. RAMster can also be configured as a memory server so
|
||||
many servers in a cluster can swap, dynamically as needed, to a single
|
||||
server configured with a large amount of RAM... without pre-configuring
|
||||
how much of the RAM is available for each of the clients!
|
||||
|
||||
In the virtual case, the whole point of virtualization is to statistically
|
||||
multiplex physical resources across the varying demands of multiple
|
||||
virtual machines. This is really hard to do with RAM and efforts to do
|
||||
it well with no kernel changes have essentially failed (except in some
|
||||
well-publicized special-case workloads).
|
||||
Specifically, the Xen Transcendent Memory backend allows otherwise
|
||||
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
|
||||
virtual machines, but the pages can be compressed and deduplicated to
|
||||
optimize RAM utilization. And when guest OS's are induced to surrender
|
||||
underutilized RAM (e.g. with "selfballooning"), sudden unexpected
|
||||
memory pressure may result in swapping; frontswap allows those pages
|
||||
to be swapped to and from hypervisor RAM (if overall host system memory
|
||||
conditions allow), thus mitigating the potentially awful performance impact
|
||||
of unplanned swapping.
|
||||
|
||||
A KVM implementation is underway and has been RFC'ed to lkml. And,
|
||||
using frontswap, investigation is also underway on the use of NVM as
|
||||
a memory extension technology.
|
||||
|
||||
2) Sure there may be performance advantages in some situations, but
|
||||
what's the space/time overhead of frontswap?
|
||||
|
||||
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
|
||||
nothingness and the only overhead is a few extra bytes per swapon'ed
|
||||
swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
|
||||
registers, there is one extra global variable compared to zero for
|
||||
every swap page read or written. If CONFIG_FRONTSWAP is enabled
|
||||
AND a frontswap backend registers AND the backend fails every "store"
|
||||
request (i.e. provides no memory despite claiming it might),
|
||||
CPU overhead is still negligible -- and since every frontswap fail
|
||||
precedes a swap page write-to-disk, the system is highly likely
|
||||
to be I/O bound and using a small fraction of a percent of a CPU
|
||||
will be irrelevant anyway.
|
||||
|
||||
As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
|
||||
registers, one bit is allocated for every swap page for every swap
|
||||
device that is swapon'd. This is added to the EIGHT bits (which
|
||||
was sixteen until about 2.6.34) that the kernel already allocates
|
||||
for every swap page for every swap device that is swapon'd. (Hugh
|
||||
Dickins has observed that frontswap could probably steal one of
|
||||
the existing eight bits, but let's worry about that minor optimization
|
||||
later.) For very large swap disks (which are rare) on a standard
|
||||
4K pagesize, this is 1MB per 32GB swap.
|
||||
|
||||
When swap pages are stored in transcendent memory instead of written
|
||||
out to disk, there is a side effect that this may create more memory
|
||||
pressure that can potentially outweigh the other advantages. A
|
||||
backend, such as zcache, must implement policies to carefully (but
|
||||
dynamically) manage memory limits to ensure this doesn't happen.
|
||||
|
||||
3) OK, how about a quick overview of what this frontswap patch does
|
||||
in terms that a kernel hacker can grok?
|
||||
|
||||
Let's assume that a frontswap "backend" has registered during
|
||||
kernel initialization; this registration indicates that this
|
||||
frontswap backend has access to some "memory" that is not directly
|
||||
accessible by the kernel. Exactly how much memory it provides is
|
||||
entirely dynamic and random.
|
||||
|
||||
Whenever a swap-device is swapon'd frontswap_init() is called,
|
||||
passing the swap device number (aka "type") as a parameter.
|
||||
This notifies frontswap to expect attempts to "store" swap pages
|
||||
associated with that number.
|
||||
|
||||
Whenever the swap subsystem is readying a page to write to a swap
|
||||
device (c.f swap_writepage()), frontswap_store is called. Frontswap
|
||||
consults with the frontswap backend and if the backend says it does NOT
|
||||
have room, frontswap_store returns -1 and the kernel swaps the page
|
||||
to the swap device as normal. Note that the response from the frontswap
|
||||
backend is unpredictable to the kernel; it may choose to never accept a
|
||||
page, it could accept every ninth page, or it might accept every
|
||||
page. But if the backend does accept a page, the data from the page
|
||||
has already been copied and associated with the type and offset,
|
||||
and the backend guarantees the persistence of the data. In this case,
|
||||
frontswap sets a bit in the "frontswap_map" for the swap device
|
||||
corresponding to the page offset on the swap device to which it would
|
||||
otherwise have written the data.
|
||||
|
||||
When the swap subsystem needs to swap-in a page (swap_readpage()),
|
||||
it first calls frontswap_load() which checks the frontswap_map to
|
||||
see if the page was earlier accepted by the frontswap backend. If
|
||||
it was, the page of data is filled from the frontswap backend and
|
||||
the swap-in is complete. If not, the normal swap-in code is
|
||||
executed to obtain the page of data from the real swap device.
|
||||
|
||||
So every time the frontswap backend accepts a page, a swap device read
|
||||
and (potentially) a swap device write are replaced by a "frontswap backend
|
||||
store" and (possibly) a "frontswap backend loads", which are presumably much
|
||||
faster.
|
||||
|
||||
4) Can't frontswap be configured as a "special" swap device that is
|
||||
just higher priority than any real swap device (e.g. like zswap,
|
||||
or maybe swap-over-nbd/NFS)?
|
||||
|
||||
No. First, the existing swap subsystem doesn't allow for any kind of
|
||||
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
|
||||
but this would require fairly drastic changes. Even if it were
|
||||
rewritten, the existing swap subsystem uses the block I/O layer which
|
||||
assumes a swap device is fixed size and any page in it is linearly
|
||||
addressable. Frontswap barely touches the existing swap subsystem,
|
||||
and works around the constraints of the block I/O subsystem to provide
|
||||
a great deal of flexibility and dynamicity.
|
||||
|
||||
For example, the acceptance of any swap page by the frontswap backend is
|
||||
entirely unpredictable. This is critical to the definition of frontswap
|
||||
backends because it grants completely dynamic discretion to the
|
||||
backend. In zcache, one cannot know a priori how compressible a page is.
|
||||
"Poorly" compressible pages can be rejected, and "poorly" can itself be
|
||||
defined dynamically depending on current memory constraints.
|
||||
|
||||
Further, frontswap is entirely synchronous whereas a real swap
|
||||
device is, by definition, asynchronous and uses block I/O. The
|
||||
block I/O layer is not only unnecessary, but may perform "optimizations"
|
||||
that are inappropriate for a RAM-oriented device including delaying
|
||||
the write of some pages for a significant amount of time. Synchrony is
|
||||
required to ensure the dynamicity of the backend and to avoid thorny race
|
||||
conditions that would unnecessarily and greatly complicate frontswap
|
||||
and/or the block I/O subsystem. That said, only the initial "store"
|
||||
and "load" operations need be synchronous. A separate asynchronous thread
|
||||
is free to manipulate the pages stored by frontswap. For example,
|
||||
the "remotification" thread in RAMster uses standard asynchronous
|
||||
kernel sockets to move compressed frontswap pages to a remote machine.
|
||||
Similarly, a KVM guest-side implementation could do in-guest compression
|
||||
and use "batched" hypercalls.
|
||||
|
||||
In a virtualized environment, the dynamicity allows the hypervisor
|
||||
(or host OS) to do "intelligent overcommit". For example, it can
|
||||
choose to accept pages only until host-swapping might be imminent,
|
||||
then force guests to do their own swapping.
|
||||
|
||||
There is a downside to the transcendent memory specifications for
|
||||
frontswap: Since any "store" might fail, there must always be a real
|
||||
slot on a real swap device to swap the page. Thus frontswap must be
|
||||
implemented as a "shadow" to every swapon'd device with the potential
|
||||
capability of holding every page that the swap device might have held
|
||||
and the possibility that it might hold no pages at all. This means
|
||||
that frontswap cannot contain more pages than the total of swapon'd
|
||||
swap devices. For example, if NO swap device is configured on some
|
||||
installation, frontswap is useless. Swapless portable devices
|
||||
can still use frontswap but a backend for such devices must configure
|
||||
some kind of "ghost" swap device and ensure that it is never used.
|
||||
|
||||
5) Why this weird definition about "duplicate stores"? If a page
|
||||
has been previously successfully stored, can't it always be
|
||||
successfully overwritten?
|
||||
|
||||
Nearly always it can, but no, sometimes it cannot. Consider an example
|
||||
where data is compressed and the original 4K page has been compressed
|
||||
to 1K. Now an attempt is made to overwrite the page with data that
|
||||
is non-compressible and so would take the entire 4K. But the backend
|
||||
has no more space. In this case, the store must be rejected. Whenever
|
||||
frontswap rejects a store that would overwrite, it also must invalidate
|
||||
the old data and ensure that it is no longer accessible. Since the
|
||||
swap subsystem then writes the new data to the read swap device,
|
||||
this is the correct course of action to ensure coherency.
|
||||
|
||||
6) What is frontswap_shrink for?
|
||||
|
||||
When the (non-frontswap) swap subsystem swaps out a page to a real
|
||||
swap device, that page is only taking up low-value pre-allocated disk
|
||||
space. But if frontswap has placed a page in transcendent memory, that
|
||||
page may be taking up valuable real estate. The frontswap_shrink
|
||||
routine allows code outside of the swap subsystem to force pages out
|
||||
of the memory managed by frontswap and back into kernel-addressable memory.
|
||||
For example, in RAMster, a "suction driver" thread will attempt
|
||||
to "repatriate" pages sent to a remote machine back to the local machine;
|
||||
this is driven using the frontswap_shrink mechanism when memory pressure
|
||||
subsides.
|
||||
|
||||
7) Why does the frontswap patch create the new include file swapfile.h?
|
||||
|
||||
The frontswap code depends on some swap-subsystem-internal data
|
||||
structures that have, over the years, moved back and forth between
|
||||
static and global. This seemed a reasonable compromise: Define
|
||||
them as global but declare them in a new include file that isn't
|
||||
included by the large number of source files that include swap.h.
|
||||
|
||||
Dan Magenheimer, last updated April 9, 2012
|
162
Documentation/vm/highmem.txt
Normal file
162
Documentation/vm/highmem.txt
Normal file
|
@ -0,0 +1,162 @@
|
|||
|
||||
====================
|
||||
HIGH MEMORY HANDLING
|
||||
====================
|
||||
|
||||
By: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
||||
|
||||
Contents:
|
||||
|
||||
(*) What is high memory?
|
||||
|
||||
(*) Temporary virtual mappings.
|
||||
|
||||
(*) Using kmap_atomic.
|
||||
|
||||
(*) Cost of temporary mappings.
|
||||
|
||||
(*) i386 PAE.
|
||||
|
||||
|
||||
====================
|
||||
WHAT IS HIGH MEMORY?
|
||||
====================
|
||||
|
||||
High memory (highmem) is used when the size of physical memory approaches or
|
||||
exceeds the maximum size of virtual memory. At that point it becomes
|
||||
impossible for the kernel to keep all of the available physical memory mapped
|
||||
at all times. This means the kernel needs to start using temporary mappings of
|
||||
the pieces of physical memory that it wants to access.
|
||||
|
||||
The part of (physical) memory not covered by a permanent mapping is what we
|
||||
refer to as 'highmem'. There are various architecture dependent constraints on
|
||||
where exactly that border lies.
|
||||
|
||||
In the i386 arch, for example, we choose to map the kernel into every process's
|
||||
VM space so that we don't have to pay the full TLB invalidation costs for
|
||||
kernel entry/exit. This means the available virtual memory space (4GiB on
|
||||
i386) has to be divided between user and kernel space.
|
||||
|
||||
The traditional split for architectures using this approach is 3:1, 3GiB for
|
||||
userspace and the top 1GiB for kernel space:
|
||||
|
||||
+--------+ 0xffffffff
|
||||
| Kernel |
|
||||
+--------+ 0xc0000000
|
||||
| |
|
||||
| User |
|
||||
| |
|
||||
+--------+ 0x00000000
|
||||
|
||||
This means that the kernel can at most map 1GiB of physical memory at any one
|
||||
time, but because we need virtual address space for other things - including
|
||||
temporary maps to access the rest of the physical memory - the actual direct
|
||||
map will typically be less (usually around ~896MiB).
|
||||
|
||||
Other architectures that have mm context tagged TLBs can have separate kernel
|
||||
and user maps. Some hardware (like some ARMs), however, have limited virtual
|
||||
space when they use mm context tags.
|
||||
|
||||
|
||||
==========================
|
||||
TEMPORARY VIRTUAL MAPPINGS
|
||||
==========================
|
||||
|
||||
The kernel contains several ways of creating temporary mappings:
|
||||
|
||||
(*) vmap(). This can be used to make a long duration mapping of multiple
|
||||
physical pages into a contiguous virtual space. It needs global
|
||||
synchronization to unmap.
|
||||
|
||||
(*) kmap(). This permits a short duration mapping of a single page. It needs
|
||||
global synchronization, but is amortized somewhat. It is also prone to
|
||||
deadlocks when using in a nested fashion, and so it is not recommended for
|
||||
new code.
|
||||
|
||||
(*) kmap_atomic(). This permits a very short duration mapping of a single
|
||||
page. Since the mapping is restricted to the CPU that issued it, it
|
||||
performs well, but the issuing task is therefore required to stay on that
|
||||
CPU until it has finished, lest some other task displace its mappings.
|
||||
|
||||
kmap_atomic() may also be used by interrupt contexts, since it is does not
|
||||
sleep and the caller may not sleep until after kunmap_atomic() is called.
|
||||
|
||||
It may be assumed that k[un]map_atomic() won't fail.
|
||||
|
||||
|
||||
=================
|
||||
USING KMAP_ATOMIC
|
||||
=================
|
||||
|
||||
When and where to use kmap_atomic() is straightforward. It is used when code
|
||||
wants to access the contents of a page that might be allocated from high memory
|
||||
(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
|
||||
functions, and they can be used in a manner similar to the following:
|
||||
|
||||
/* Find the page of interest. */
|
||||
struct page *page = find_get_page(mapping, offset);
|
||||
|
||||
/* Gain access to the contents of that page. */
|
||||
void *vaddr = kmap_atomic(page);
|
||||
|
||||
/* Do something to the contents of that page. */
|
||||
memset(vaddr, 0, PAGE_SIZE);
|
||||
|
||||
/* Unmap that page. */
|
||||
kunmap_atomic(vaddr);
|
||||
|
||||
Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
|
||||
not the argument.
|
||||
|
||||
If you need to map two pages because you want to copy from one page to
|
||||
another you need to keep the kmap_atomic calls strictly nested, like:
|
||||
|
||||
vaddr1 = kmap_atomic(page1);
|
||||
vaddr2 = kmap_atomic(page2);
|
||||
|
||||
memcpy(vaddr1, vaddr2, PAGE_SIZE);
|
||||
|
||||
kunmap_atomic(vaddr2);
|
||||
kunmap_atomic(vaddr1);
|
||||
|
||||
|
||||
==========================
|
||||
COST OF TEMPORARY MAPPINGS
|
||||
==========================
|
||||
|
||||
The cost of creating temporary mappings can be quite high. The arch has to
|
||||
manipulate the kernel's page tables, the data TLB and/or the MMU's registers.
|
||||
|
||||
If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping
|
||||
simply with a bit of arithmetic that will convert the page struct address into
|
||||
a pointer to the page contents rather than juggling mappings about. In such a
|
||||
case, the unmap operation may be a null operation.
|
||||
|
||||
If CONFIG_MMU is not set, then there can be no temporary mappings and no
|
||||
highmem. In such a case, the arithmetic approach will also be used.
|
||||
|
||||
|
||||
========
|
||||
i386 PAE
|
||||
========
|
||||
|
||||
The i386 arch, under some circumstances, will permit you to stick up to 64GiB
|
||||
of RAM into your 32-bit machine. This has a number of consequences:
|
||||
|
||||
(*) Linux needs a page-frame structure for each page in the system and the
|
||||
pageframes need to live in the permanent mapping, which means:
|
||||
|
||||
(*) you can have 896M/sizeof(struct page) page-frames at most; with struct
|
||||
page being 32-bytes that would end up being something in the order of 112G
|
||||
worth of pages; the kernel, however, needs to store more than just
|
||||
page-frames in that memory...
|
||||
|
||||
(*) PAE makes your page tables larger - which slows the system down as more
|
||||
data has to be accessed to traverse in TLB fills and the like. One
|
||||
advantage is that PAE has more PTE bits and can provide advanced features
|
||||
like NX and PAT.
|
||||
|
||||
The general recommendation is that you don't use more than 8GiB on a 32-bit
|
||||
machine - although more might work for you and your workload, you're pretty
|
||||
much on your own - don't expect kernel developers to really care much if things
|
||||
come apart.
|
314
Documentation/vm/hugetlbpage.txt
Normal file
314
Documentation/vm/hugetlbpage.txt
Normal file
|
@ -0,0 +1,314 @@
|
|||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
that is provided by most modern architectures. For example, i386
|
||||
architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
|
||||
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
|
||||
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
|
||||
translations. Typically this is a very scarce resource on processor.
|
||||
Operating systems try to make best use of limited number of TLB resources.
|
||||
This optimization is more critical now as bigger and bigger physical memories
|
||||
(several GBs) are more readily available.
|
||||
|
||||
Users can use the huge page support in Linux kernel by either using the mmap
|
||||
system call or standard SYSV shared memory system calls (shmget, shmat).
|
||||
|
||||
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The /proc/meminfo file provides information about the total number of
|
||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||
information about the number of free, reserved and surplus huge pages and the
|
||||
default huge page size. The huge page size is needed for generating the
|
||||
proper alignment and size of the arguments to system calls that map huge page
|
||||
regions.
|
||||
|
||||
The output of "cat /proc/meminfo" will include lines like:
|
||||
|
||||
.....
|
||||
HugePages_Total: vvv
|
||||
HugePages_Free: www
|
||||
HugePages_Rsvd: xxx
|
||||
HugePages_Surp: yyy
|
||||
Hugepagesize: zzz kB
|
||||
|
||||
where:
|
||||
HugePages_Total is the size of the pool of huge pages.
|
||||
HugePages_Free is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in /proc/sys/vm/nr_hugepages. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
/proc/sys/vm/nr_overcommit_hugepages.
|
||||
|
||||
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
|
||||
in the kernel.
|
||||
|
||||
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
|
||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||
returned to the huge page pool when freed by a task. A user with root
|
||||
privileges can dynamically allocate more or free some persistent huge pages
|
||||
by increasing or decreasing the value of 'nr_hugepages'.
|
||||
|
||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||
be used for other purposes. Huge pages cannot be swapped out under
|
||||
memory pressure.
|
||||
|
||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||
pool, a user with appropriate privilege can use either the mmap system call
|
||||
or shared memory system calls to use the huge pages. See the discussion of
|
||||
Using Huge Pages, below.
|
||||
|
||||
The administrator can allocate persistent huge pages on the kernel boot
|
||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||
number of huge pages requested. This is the most reliable method of
|
||||
allocating huge pages as memory has not yet become fragmented.
|
||||
|
||||
Some platforms support multiple huge page sizes. To allocate huge pages
|
||||
of a specific size, one must precede the huge pages boot command parameters
|
||||
with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||
|
||||
When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
|
||||
indicates the current number of pre-allocated huge pages of the default size.
|
||||
Thus, one can use the following command to dynamically allocate/deallocate
|
||||
default sized persistent huge pages:
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
This command will try to adjust the number of default sized huge pages in the
|
||||
huge page pool to 20, allocating or freeing huge pages, as required.
|
||||
|
||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||
task that modifies nr_hugepages. The default for the allowed nodes--when the
|
||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||
nodes with insufficient available, contiguous memory for a huge page will be
|
||||
silently skipped when allocating persistent huge pages. See the discussion
|
||||
below of the interaction of task memory policy, cpusets and per node attributes
|
||||
with the allocation and freeing of persistent huge pages.
|
||||
|
||||
The success or failure of huge page allocation depends on the amount of
|
||||
physically contiguous memory that is present in system at the time of the
|
||||
allocation attempt. If the kernel is unable to allocate huge pages from
|
||||
some nodes in a NUMA system, it will attempt to make up the difference by
|
||||
allocating extra pages on other nodes with sufficient available contiguous
|
||||
memory, if any.
|
||||
|
||||
System administrators may want to put this command in one of the local rc
|
||||
init files. This will enable the kernel to allocate huge pages early in
|
||||
the boot process when the possibility of getting physical contiguous pages
|
||||
is still very high. Administrators can verify the number of huge pages
|
||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||
distribution of huge pages in a NUMA system, use:
|
||||
|
||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||
|
||||
/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
|
||||
requested by applications. Writing any non-zero value into this file
|
||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||
unused, they are freed back to the kernel's normal page pool.
|
||||
|
||||
When increasing the huge page pool size via nr_hugepages, any existing surplus
|
||||
pages will first be promoted to persistent huge pages. Then, additional
|
||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||
the new persistent huge page pool size.
|
||||
|
||||
The administrator may shrink the pool of persistent huge pages for
|
||||
the default huge page size by setting the nr_hugepages sysctl to a
|
||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||
across all nodes in the memory policy of the task modifying nr_hugepages.
|
||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||
normal page pool.
|
||||
|
||||
Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
|
||||
it becomes less than the number of huge pages in use will convert the balance
|
||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||
the number of surplus pages it would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
|
||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||
no more surplus huge pages will be allowed to be allocated.
|
||||
|
||||
With support for multiple huge page pools at run-time available, much of
|
||||
the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
|
||||
The /proc interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is:
|
||||
|
||||
/sys/kernel/mm/hugepages
|
||||
|
||||
For each huge page size supported by the running kernel, a subdirectory
|
||||
will exist, of the form:
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist:
|
||||
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
nr_overcommit_hugepages
|
||||
free_hugepages
|
||||
resv_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
|
||||
|
||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||
===================================================================
|
||||
|
||||
Whether huge pages are allocated and freed via the /proc interface or
|
||||
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
|
||||
nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
|
||||
sysctl or attribute. When the nr_hugepages attribute is used, mempolicy
|
||||
is ignored.
|
||||
|
||||
The recommended method to allocate or free huge pages to/from the kernel
|
||||
huge page pool, using the nr_hugepages example above, is:
|
||||
|
||||
numactl --interleave <node-list> echo 20 \
|
||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
or, more succinctly:
|
||||
|
||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
This will allocate or free abs(20 - nr_hugepages) to or from the nodes
|
||||
specified in <node-list>, depending on whether number of persistent huge pages
|
||||
is initially less than or greater than 20, respectively. No huge pages will be
|
||||
allocated nor freed on any node not included in the specified <node-list>.
|
||||
|
||||
When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
|
||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||
resulting effect on persistent huge page allocation is as follows:
|
||||
|
||||
1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
|
||||
persistent huge pages will be distributed across the node or nodes
|
||||
specified in the mempolicy as if "interleave" had been specified.
|
||||
However, if a node in the policy does not contain sufficient contiguous
|
||||
memory for a huge page, the allocation will not "fallback" to the nearest
|
||||
neighbor node with sufficient contiguous memory. To do this would cause
|
||||
undesirable imbalance in the distribution of the huge page pool, or
|
||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||
the task's memory policy.
|
||||
|
||||
2) One or more nodes may be specified with the bind or interleave policy.
|
||||
If more than one node is specified with the preferred policy, only the
|
||||
lowest numeric id will be used. Local policy will select the node where
|
||||
the task is running at the time the nodes_allowed mask is constructed.
|
||||
For local policy to be deterministic, the task must be bound to a cpu or
|
||||
cpus in a single node. Otherwise, the task could be migrated to some
|
||||
other node at any time after launch and the resulting node will be
|
||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||
Any of the other mempolicy modes may be used to specify a single node.
|
||||
|
||||
3) The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
whether this policy was set explicitly by the task itself or one of its
|
||||
ancestors, such as numactl. This means that if the task is invoked from a
|
||||
shell with non-default policy, that policy will be used. One can specify a
|
||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||
interleaving over all nodes in the system or cpuset.
|
||||
|
||||
4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
|
||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||
be no way for a task with non-default policy running in a cpuset with a
|
||||
subset of the system nodes to allocate huge pages outside the cpuset
|
||||
without first moving to a cpuset that contains all of the desired nodes.
|
||||
|
||||
5) Boot-time huge page allocation attempts to distribute the requested number
|
||||
of huge pages over all on-lines nodes with memory.
|
||||
|
||||
Per Node Hugepages Attributes
|
||||
=============================
|
||||
|
||||
A subset of the contents of the root huge page control directory in sysfs,
|
||||
described above, will be replicated under each the system device of each
|
||||
NUMA node with memory in:
|
||||
|
||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||
|
||||
Under this directory, the subdirectory for each supported huge page size
|
||||
contains the following attribute files:
|
||||
|
||||
nr_hugepages
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
The free_' and surplus_' attribute files are read-only. They return the number
|
||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||
node.
|
||||
|
||||
The nr_hugepages attribute returns the total number of huge pages on the
|
||||
specified node. When this attribute is written, the number of persistent huge
|
||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||
|
||||
Note that the number of overcommit and reserve pages remain global quantities,
|
||||
as we don't know until fault time, when the faulting task's mempolicy is
|
||||
applied, from which node the huge page allocation will be attempted.
|
||||
|
||||
|
||||
Using Huge Pages
|
||||
================
|
||||
|
||||
If the user applications are going to request huge pages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs:
|
||||
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \
|
||||
none /mnt/huge
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
|
||||
options sets the owner and group of the root of the file system. By default
|
||||
the uid and gid of the current process are taken. The mode option sets the
|
||||
mode of root of file system to value & 01777. This value is given in octal.
|
||||
By default the value 0755 is picked. The size option sets the maximum value of
|
||||
memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
|
||||
rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of
|
||||
inodes that /mnt/huge can use. If the size or nr_inodes option is not
|
||||
provided on command line then no limits are set. For size and nr_inodes
|
||||
options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
|
||||
example, size=2K has the same meaning as size=2048.
|
||||
|
||||
While read system calls are supported on files that reside on hugetlb
|
||||
file systems, write system calls are not.
|
||||
|
||||
Regular chown, chgrp, and chmod commands (with right permissions) could be
|
||||
used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if the
|
||||
applications are going to use only shmat/shmget system calls or mmap with
|
||||
MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment
|
||||
should be a member of a supplementary group and system admin needs to
|
||||
configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for
|
||||
same or different applications to use any combination of mmaps and shm*
|
||||
calls, though the mount of filesystem will be required for using mmap calls
|
||||
without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||
map_hugetlb.c.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c
|
||||
|
||||
2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c
|
||||
|
||||
3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
|
||||
4) The libhugetlbfs (http://libhugetlbfs.sourceforge.net) library provides a
|
||||
wide range of userspace tools to help with huge page usability, environment
|
||||
setup, and control. Furthermore it provides useful test cases that should be
|
||||
used when modifying code to ensure no regressions are introduced.
|
187
Documentation/vm/hwpoison.txt
Normal file
187
Documentation/vm/hwpoison.txt
Normal file
|
@ -0,0 +1,187 @@
|
|||
What is hwpoison?
|
||||
|
||||
Upcoming Intel CPUs have support for recovering from some memory errors
|
||||
(``MCA recovery''). This requires the OS to declare a page "poisoned",
|
||||
kill the processes associated with it and avoid using it in the future.
|
||||
|
||||
This patchkit implements the necessary infrastructure in the VM.
|
||||
|
||||
To quote the overview comment:
|
||||
|
||||
* High level machine check handler. Handles pages reported by the
|
||||
* hardware as being corrupted usually due to a 2bit ECC memory or cache
|
||||
* failure.
|
||||
*
|
||||
* This focusses on pages detected as corrupted in the background.
|
||||
* When the current CPU tries to consume corruption the currently
|
||||
* running process can just be killed directly instead. This implies
|
||||
* that if the error cannot be handled for some reason it's safe to
|
||||
* just ignore it because no corruption has been consumed yet. Instead
|
||||
* when that happens another machine check will happen.
|
||||
*
|
||||
* Handles page cache pages in various states. The tricky part
|
||||
* here is that we can access any page asynchronous to other VM
|
||||
* users, because memory failures could happen anytime and anywhere,
|
||||
* possibly violating some of their assumptions. This is why this code
|
||||
* has to be extremely careful. Generally it tries to use normal locking
|
||||
* rules, as in get the standard locks, even if that means the
|
||||
* error handling takes potentially a long time.
|
||||
*
|
||||
* Some of the operations here are somewhat inefficient and have non
|
||||
* linear algorithmic complexity, because the data structures have not
|
||||
* been optimized for this case. This is in particular the case
|
||||
* for the mapping from a vma to a process. Since this case is expected
|
||||
* to be rare we hope we can get away with this.
|
||||
|
||||
The code consists of a the high level handler in mm/memory-failure.c,
|
||||
a new page poison bit and various checks in the VM to handle poisoned
|
||||
pages.
|
||||
|
||||
The main target right now is KVM guests, but it works for all kinds
|
||||
of applications. KVM support requires a recent qemu-kvm release.
|
||||
|
||||
For the KVM use there was need for a new signal type so that
|
||||
KVM can inject the machine check into the guest with the proper
|
||||
address. This in theory allows other applications to handle
|
||||
memory failures too. The expection is that near all applications
|
||||
won't do that, but some very specialized ones might.
|
||||
|
||||
---
|
||||
|
||||
There are two (actually three) modi memory failure recovery can be in:
|
||||
|
||||
vm.memory_failure_recovery sysctl set to zero:
|
||||
All memory failures cause a panic. Do not attempt recovery.
|
||||
(on x86 this can be also affected by the tolerant level of the
|
||||
MCE subsystem)
|
||||
|
||||
early kill
|
||||
(can be controlled globally and per process)
|
||||
Send SIGBUS to the application as soon as the error is detected
|
||||
This allows applications who can process memory errors in a gentle
|
||||
way (e.g. drop affected object)
|
||||
This is the mode used by KVM qemu.
|
||||
|
||||
late kill
|
||||
Send SIGBUS when the application runs into the corrupted page.
|
||||
This is best for memory error unaware applications and default
|
||||
Note some pages are always handled as late kill.
|
||||
|
||||
---
|
||||
|
||||
User control:
|
||||
|
||||
vm.memory_failure_recovery
|
||||
See sysctl.txt
|
||||
|
||||
vm.memory_failure_early_kill
|
||||
Enable early kill mode globally
|
||||
|
||||
PR_MCE_KILL
|
||||
Set early/late kill mode/revert to system default
|
||||
arg1: PR_MCE_KILL_CLEAR: Revert to system default
|
||||
arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
|
||||
PR_MCE_KILL_EARLY: Early kill
|
||||
PR_MCE_KILL_LATE: Late kill
|
||||
PR_MCE_KILL_DEFAULT: Use system global default
|
||||
Note that if you want to have a dedicated thread which handles
|
||||
the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
|
||||
call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
|
||||
the SIGBUS is sent to the main thread.
|
||||
|
||||
PR_MCE_KILL_GET
|
||||
return current mode
|
||||
|
||||
|
||||
---
|
||||
|
||||
Testing:
|
||||
|
||||
madvise(MADV_HWPOISON, ....)
|
||||
(as root)
|
||||
Poison a page in the process for testing
|
||||
|
||||
|
||||
hwpoison-inject module through debugfs
|
||||
|
||||
/sys/debug/hwpoison/
|
||||
|
||||
corrupt-pfn
|
||||
|
||||
Inject hwpoison fault at PFN echoed into this file. This does
|
||||
some early filtering to avoid corrupted unintended pages in test suites.
|
||||
|
||||
unpoison-pfn
|
||||
|
||||
Software-unpoison page at PFN echoed into this file. This
|
||||
way a page can be reused again.
|
||||
This only works for Linux injected failures, not for real
|
||||
memory failures.
|
||||
|
||||
Note these injection interfaces are not stable and might change between
|
||||
kernel versions
|
||||
|
||||
corrupt-filter-dev-major
|
||||
corrupt-filter-dev-minor
|
||||
|
||||
Only handle memory failures to pages associated with the file system defined
|
||||
by block device major/minor. -1U is the wildcard value.
|
||||
This should be only used for testing with artificial injection.
|
||||
|
||||
corrupt-filter-memcg
|
||||
|
||||
Limit injection to pages owned by memgroup. Specified by inode number
|
||||
of the memcg.
|
||||
|
||||
Example:
|
||||
mkdir /sys/fs/cgroup/mem/hwpoison
|
||||
|
||||
usemem -m 100 -s 1000 &
|
||||
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
|
||||
|
||||
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
|
||||
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
|
||||
|
||||
page-types -p `pidof init` --hwpoison # shall do nothing
|
||||
page-types -p `pidof usemem` --hwpoison # poison its pages
|
||||
|
||||
corrupt-filter-flags-mask
|
||||
corrupt-filter-flags-value
|
||||
|
||||
When specified, only poison pages if ((page_flags & mask) == value).
|
||||
This allows stress testing of many kinds of pages. The page_flags
|
||||
are the same as in /proc/kpageflags. The flag bits are defined in
|
||||
include/linux/kernel-page-flags.h and documented in
|
||||
Documentation/vm/pagemap.txt
|
||||
|
||||
Architecture specific MCE injector
|
||||
|
||||
x86 has mce-inject, mce-test
|
||||
|
||||
Some portable hwpoison test programs in mce-test, see blow.
|
||||
|
||||
---
|
||||
|
||||
References:
|
||||
|
||||
http://halobates.de/mce-lc09-2.pdf
|
||||
Overview presentation from LinuxCon 09
|
||||
|
||||
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
|
||||
Test suite (hwpoison specific portable tests in tsrc)
|
||||
|
||||
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
|
||||
x86 specific injector
|
||||
|
||||
|
||||
---
|
||||
|
||||
Limitations:
|
||||
|
||||
- Not all page types are supported and never will. Most kernel internal
|
||||
objects cannot be recovered, only LRU pages for now.
|
||||
- Right now hugepage support is missing.
|
||||
|
||||
---
|
||||
Andi Kleen, Oct 2009
|
||||
|
97
Documentation/vm/ksm.txt
Normal file
97
Documentation/vm/ksm.txt
Normal file
|
@ -0,0 +1,97 @@
|
|||
How to use the Kernel Samepage Merging feature
|
||||
----------------------------------------------
|
||||
|
||||
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
|
||||
added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation,
|
||||
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
|
||||
|
||||
The KSM daemon ksmd periodically scans those areas of user memory which
|
||||
have been registered with it, looking for pages of identical content which
|
||||
can be replaced by a single write-protected page (which is automatically
|
||||
copied if a process later wants to update its content).
|
||||
|
||||
KSM was originally developed for use with KVM (where it was known as
|
||||
Kernel Shared Memory), to fit more virtual machines into physical memory,
|
||||
by sharing the data common between them. But it can be useful to any
|
||||
application which generates many instances of the same data.
|
||||
|
||||
KSM only merges anonymous (private) pages, never pagecache (file) pages.
|
||||
KSM's merged pages were originally locked into kernel memory, but can now
|
||||
be swapped out just like other user pages (but sharing is broken when they
|
||||
are swapped back in: ksmd must rediscover their identity and merge again).
|
||||
|
||||
KSM only operates on those areas of address space which an application
|
||||
has advised to be likely candidates for merging, by using the madvise(2)
|
||||
system call: int madvise(addr, length, MADV_MERGEABLE).
|
||||
|
||||
The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel
|
||||
that advice and restore unshared pages: whereupon KSM unmerges whatever
|
||||
it merged in that range. Note: this unmerging call may suddenly require
|
||||
more memory than is available - possibly failing with EAGAIN, but more
|
||||
probably arousing the Out-Of-Memory killer.
|
||||
|
||||
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
|
||||
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
|
||||
built with CONFIG_KSM=y, those calls will normally succeed: even if the
|
||||
the KSM daemon is not currently running, MADV_MERGEABLE still registers
|
||||
the range for whenever the KSM daemon is started; even if the range
|
||||
cannot contain any pages which KSM could actually merge; even if
|
||||
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
|
||||
|
||||
Like other madvise calls, they are intended for use on mapped areas of
|
||||
the user address space: they will report ENOMEM if the specified range
|
||||
includes unmapped gaps (though working on the intervening mapped areas),
|
||||
and might fail with EAGAIN if not enough memory for internal structures.
|
||||
|
||||
Applications should be considerate in their use of MADV_MERGEABLE,
|
||||
restricting its use to areas likely to benefit. KSM's scans may use a lot
|
||||
of processing power: some installations will disable KSM for that reason.
|
||||
|
||||
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
|
||||
readable by all but writable only by root:
|
||||
|
||||
pages_to_scan - how many present pages to scan before ksmd goes to sleep
|
||||
e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan"
|
||||
Default: 100 (chosen for demonstration purposes)
|
||||
|
||||
sleep_millisecs - how many milliseconds ksmd should sleep before next scan
|
||||
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
|
||||
Default: 20 (chosen for demonstration purposes)
|
||||
|
||||
merge_across_nodes - specifies if pages from different numa nodes can be merged.
|
||||
When set to 0, ksm merges only pages which physically
|
||||
reside in the memory area of same NUMA node. That brings
|
||||
lower latency to access of shared pages. Systems with more
|
||||
nodes, at significant NUMA distances, are likely to benefit
|
||||
from the lower latency of setting 0. Smaller systems, which
|
||||
need to minimize memory usage, are likely to benefit from
|
||||
the greater sharing of setting 1 (default). You may wish to
|
||||
compare how your system performs under each setting, before
|
||||
deciding on which to use. merge_across_nodes setting can be
|
||||
changed only when there are no ksm shared pages in system:
|
||||
set run 2 to unmerge pages first, then to 1 after changing
|
||||
merge_across_nodes, to remerge according to the new setting.
|
||||
Default: 1 (merging across nodes as in earlier releases)
|
||||
|
||||
run - set 0 to stop ksmd from running but keep merged pages,
|
||||
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
|
||||
set 2 to stop ksmd and unmerge all pages currently merged,
|
||||
but leave mergeable areas registered for next run
|
||||
Default: 0 (must be changed to 1 to activate KSM,
|
||||
except if CONFIG_SYSFS is disabled)
|
||||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
|
||||
|
||||
pages_shared - how many shared pages are being used
|
||||
pages_sharing - how many more sites are sharing them i.e. how much saved
|
||||
pages_unshared - how many pages unique but repeatedly checked for merging
|
||||
pages_volatile - how many pages changing too fast to be placed in a tree
|
||||
full_scans - how many times all mergeable areas have been scanned
|
||||
|
||||
A high ratio of pages_sharing to pages_shared indicates good sharing, but
|
||||
a high ratio of pages_unshared to pages_sharing indicates wasted effort.
|
||||
pages_volatile embraces several different kinds of activity, but a high
|
||||
proportion there would also indicate poor use of madvise MADV_MERGEABLE.
|
||||
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
149
Documentation/vm/numa
Normal file
149
Documentation/vm/numa
Normal file
|
@ -0,0 +1,149 @@
|
|||
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
What is NUMA?
|
||||
|
||||
This question can be answered from a couple of perspectives: the
|
||||
hardware view and the Linux software view.
|
||||
|
||||
From the hardware perspective, a NUMA system is a computer platform that
|
||||
comprises multiple components or assemblies each of which may contain 0
|
||||
or more CPUs, local memory, and/or IO buses. For brevity and to
|
||||
disambiguate the hardware view of these physical components/assemblies
|
||||
from the software abstraction thereof, we'll call the components/assemblies
|
||||
'cells' in this document.
|
||||
|
||||
Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
|
||||
of the system--although some components necessary for a stand-alone SMP system
|
||||
may not be populated on any given cell. The cells of the NUMA system are
|
||||
connected together with some sort of system interconnect--e.g., a crossbar or
|
||||
point-to-point link are common types of NUMA system interconnects. Both of
|
||||
these types of interconnects can be aggregated to create NUMA platforms with
|
||||
cells at multiple distances from other cells.
|
||||
|
||||
For Linux, the NUMA platforms of interest are primarily what is known as Cache
|
||||
Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
|
||||
to and accessible from any CPU attached to any cell and cache coherency
|
||||
is handled in hardware by the processor caches and/or the system interconnect.
|
||||
|
||||
Memory access time and effective memory bandwidth varies depending on how far
|
||||
away the cell containing the CPU or IO bus making the memory access is from the
|
||||
cell containing the target memory. For example, access to memory by CPUs
|
||||
attached to the same cell will experience faster access times and higher
|
||||
bandwidths than accesses to memory on other, remote cells. NUMA platforms
|
||||
can have cells at multiple remote distances from any given cell.
|
||||
|
||||
Platform vendors don't build NUMA systems just to make software developers'
|
||||
lives interesting. Rather, this architecture is a means to provide scalable
|
||||
memory bandwidth. However, to achieve scalable memory bandwidth, system and
|
||||
application software must arrange for a large majority of the memory references
|
||||
[cache misses] to be to "local" memory--memory on the same cell, if any--or
|
||||
to the closest cell with memory.
|
||||
|
||||
This leads to the Linux software view of a NUMA system:
|
||||
|
||||
Linux divides the system's hardware resources into multiple software
|
||||
abstractions called "nodes". Linux maps the nodes onto the physical cells
|
||||
of the hardware platform, abstracting away some of the details for some
|
||||
architectures. As with physical cells, software nodes may contain 0 or more
|
||||
CPUs, memory and/or IO buses. And, again, memory accesses to memory on
|
||||
"closer" nodes--nodes that map to closer cells--will generally experience
|
||||
faster access times and higher effective bandwidth than accesses to more
|
||||
remote cells.
|
||||
|
||||
For some architectures, such as x86, Linux will "hide" any node representing a
|
||||
physical cell that has no memory attached, and reassign any CPUs attached to
|
||||
that cell to a node representing a cell that does have memory. Thus, on
|
||||
these architectures, one cannot assume that all CPUs that Linux associates with
|
||||
a given node will see the same local memory access times and bandwidth.
|
||||
|
||||
In addition, for some architectures, again x86 is an example, Linux supports
|
||||
the emulation of additional nodes. For NUMA emulation, linux will carve up
|
||||
the existing nodes--or the system memory for non-NUMA platforms--into multiple
|
||||
nodes. Each emulated node will manage a fraction of the underlying cells'
|
||||
physical memory. NUMA emluation is useful for testing NUMA kernel and
|
||||
application features on non-NUMA platforms, and as a sort of memory resource
|
||||
management mechanism when used together with cpusets.
|
||||
[see Documentation/cgroups/cpusets.txt]
|
||||
|
||||
For each node with memory, Linux constructs an independent memory management
|
||||
subsystem, complete with its own free page lists, in-use page lists, usage
|
||||
statistics and locks to mediate access. In addition, Linux constructs for
|
||||
each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
|
||||
an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
|
||||
selected zone/node cannot satisfy the allocation request. This situation,
|
||||
when a zone has no available memory to satisfy a request, is called
|
||||
"overflow" or "fallback".
|
||||
|
||||
Because some nodes contain multiple zones containing different types of
|
||||
memory, Linux must decide whether to order the zonelists such that allocations
|
||||
fall back to the same zone type on a different node, or to a different zone
|
||||
type on the same node. This is an important consideration because some zones,
|
||||
such as DMA or DMA32, represent relatively scarce resources. Linux chooses
|
||||
a default zonelist order based on the sizes of the various zone types relative
|
||||
to the total memory of the node and the total memory of the system. The
|
||||
default zonelist order may be overridden using the numa_zonelist_order kernel
|
||||
boot parameter or sysctl. [see Documentation/kernel-parameters.txt and
|
||||
Documentation/sysctl/vm.txt]
|
||||
|
||||
By default, Linux will attempt to satisfy memory allocation requests from the
|
||||
node to which the CPU that executes the request is assigned. Specifically,
|
||||
Linux will attempt to allocate from the first node in the appropriate zonelist
|
||||
for the node where the request originates. This is called "local allocation."
|
||||
If the "local" node cannot satisfy the request, the kernel will examine other
|
||||
nodes' zones in the selected zonelist looking for the first zone in the list
|
||||
that can satisfy the request.
|
||||
|
||||
Local allocation will tend to keep subsequent access to the allocated memory
|
||||
"local" to the underlying physical resources and off the system interconnect--
|
||||
as long as the task on whose behalf the kernel allocated some memory does not
|
||||
later migrate away from that memory. The Linux scheduler is aware of the
|
||||
NUMA topology of the platform--embodied in the "scheduling domains" data
|
||||
structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
|
||||
attempts to minimize task migration to distant scheduling domains. However,
|
||||
the scheduler does not take a task's NUMA footprint into account directly.
|
||||
Thus, under sufficient imbalance, tasks can migrate between nodes, remote
|
||||
from their initial node and kernel data structures.
|
||||
|
||||
System administrators and application designers can restrict a task's migration
|
||||
to improve NUMA locality using various CPU affinity command line interfaces,
|
||||
such as taskset(1) and numactl(1), and program interfaces such as
|
||||
sched_setaffinity(2). Further, one can modify the kernel's default local
|
||||
allocation behavior using Linux NUMA memory policy.
|
||||
[see Documentation/vm/numa_memory_policy.txt.]
|
||||
|
||||
System administrators can restrict the CPUs and nodes' memories that a non-
|
||||
privileged user can specify in the scheduling or NUMA commands and functions
|
||||
using control groups and CPUsets. [see Documentation/cgroups/cpusets.txt]
|
||||
|
||||
On architectures that do not hide memoryless nodes, Linux will include only
|
||||
zones [nodes] with memory in the zonelists. This means that for a memoryless
|
||||
node the "local memory node"--the node of the first zone in CPU's node's
|
||||
zonelist--will not be the node itself. Rather, it will be the node that the
|
||||
kernel selected as the nearest node with memory when it built the zonelists.
|
||||
So, default, local allocations will succeed with the kernel supplying the
|
||||
closest available memory. This is a consequence of the same mechanism that
|
||||
allows such allocations to fallback to other nearby nodes when a node that
|
||||
does contain memory overflows.
|
||||
|
||||
Some kernel allocations do not want or cannot tolerate this allocation fallback
|
||||
behavior. Rather they want to be sure they get memory from the specified node
|
||||
or get notified that the node has no free memory. This is usually the case when
|
||||
a subsystem allocates per CPU memory resources, for example.
|
||||
|
||||
A typical model for making such an allocation is to obtain the node id of the
|
||||
node to which the "current CPU" is attached using one of the kernel's
|
||||
numa_node_id() or CPU_to_node() functions and then request memory from only
|
||||
the node id returned. When such an allocation fails, the requesting subsystem
|
||||
may revert to its own fallback path. The slab kernel memory allocator is an
|
||||
example of this. Or, the subsystem may choose to disable or not to enable
|
||||
itself on allocation failure. The kernel profiling subsystem is an example of
|
||||
this.
|
||||
|
||||
If the architecture supports--does not hide--memoryless nodes, then CPUs
|
||||
attached to memoryless nodes would always incur the fallback path overhead
|
||||
or some subsystems would fail to initialize if they attempted to allocated
|
||||
memory exclusively from a node without memory. To support such
|
||||
architectures transparently, kernel subsystems can use the numa_mem_id()
|
||||
or cpu_to_mem() function to locate the "local memory node" for the calling or
|
||||
specified CPU. Again, this is the same node from which default, local page
|
||||
allocations will be attempted.
|
452
Documentation/vm/numa_memory_policy.txt
Normal file
452
Documentation/vm/numa_memory_policy.txt
Normal file
|
@ -0,0 +1,452 @@
|
|||
|
||||
What is Linux Memory Policy?
|
||||
|
||||
In the Linux kernel, "memory policy" determines from which node the kernel will
|
||||
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
||||
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
||||
The current memory policy support was added to Linux 2.6 around May 2004. This
|
||||
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||
support.
|
||||
|
||||
Memory policies should not be confused with cpusets
|
||||
(Documentation/cgroups/cpusets.txt)
|
||||
which is an administrative mechanism for restricting the nodes from which
|
||||
memory may be allocated by a set of processes. Memory policies are a
|
||||
programming interface that a NUMA-aware application can take advantage of. When
|
||||
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
||||
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
|
||||
|
||||
MEMORY POLICY CONCEPTS
|
||||
|
||||
Scope of Memory Policies
|
||||
|
||||
The Linux kernel supports _scopes_ of memory policy, described here from
|
||||
most general to most specific:
|
||||
|
||||
System Default Policy: this policy is "hard coded" into the kernel. It
|
||||
is the policy that governs all page allocations that aren't controlled
|
||||
by one of the more specific policy scopes discussed below. When the
|
||||
system is "up and running", the system default policy will use "local
|
||||
allocation" described below. However, during boot up, the system
|
||||
default policy will be set to interleave allocations across all nodes
|
||||
with "sufficient" memory, so as not to overload the initial boot node
|
||||
with boot-time allocations.
|
||||
|
||||
Task/Process Policy: this is an optional, per-task policy. When defined
|
||||
for a specific task, this policy controls all page allocations made by or
|
||||
on behalf of the task that aren't controlled by a more specific scope.
|
||||
If a task does not define a task policy, then all page allocations that
|
||||
would have been controlled by the task policy "fall back" to the System
|
||||
Default Policy.
|
||||
|
||||
The task policy applies to the entire address space of a task. Thus,
|
||||
it is inheritable, and indeed is inherited, across both fork()
|
||||
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
||||
to establish the task policy for a child task exec()'d from an
|
||||
executable image that has no awareness of memory policy. See the
|
||||
MEMORY POLICY APIS section, below, for an overview of the system call
|
||||
that a task may use to set/change its task/process policy.
|
||||
|
||||
In a multi-threaded task, task policies apply only to the thread
|
||||
[Linux kernel task] that installs the policy and any threads
|
||||
subsequently created by that thread. Any sibling threads existing
|
||||
at the time a new task policy is installed retain their current
|
||||
policy.
|
||||
|
||||
A task policy applies only to pages allocated after the policy is
|
||||
installed. Any pages already faulted in by the task when the task
|
||||
changes its task policy remain where they were allocated based on
|
||||
the policy at the time they were allocated.
|
||||
|
||||
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
||||
virtual address space. A task may define a specific policy for a range
|
||||
of its virtual address space. See the MEMORY POLICIES APIS section,
|
||||
below, for an overview of the mbind() system call used to set a VMA
|
||||
policy.
|
||||
|
||||
A VMA policy will govern the allocation of pages that back this region of
|
||||
the address space. Any regions of the task's address space that don't
|
||||
have an explicit VMA policy will fall back to the task policy, which may
|
||||
itself fall back to the System Default Policy.
|
||||
|
||||
VMA policies have a few complicating details:
|
||||
|
||||
VMA policy applies ONLY to anonymous pages. These include pages
|
||||
allocated for anonymous segments, such as the task stack and heap, and
|
||||
any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
|
||||
If a VMA policy is applied to a file mapping, it will be ignored if
|
||||
the mapping used the MAP_SHARED flag. If the file mapping used the
|
||||
MAP_PRIVATE flag, the VMA policy will only be applied when an
|
||||
anonymous page is allocated on an attempt to write to the mapping--
|
||||
i.e., at Copy-On-Write.
|
||||
|
||||
VMA policies are shared between all tasks that share a virtual address
|
||||
space--a.k.a. threads--independent of when the policy is installed; and
|
||||
they are inherited across fork(). However, because VMA policies refer
|
||||
to a specific region of a task's address space, and because the address
|
||||
space is discarded and recreated on exec*(), VMA policies are NOT
|
||||
inheritable across exec(). Thus, only NUMA-aware applications may
|
||||
use VMA policies.
|
||||
|
||||
A task may install a new VMA policy on a sub-range of a previously
|
||||
mmap()ed region. When this happens, Linux splits the existing virtual
|
||||
memory area into 2 or 3 VMAs, each with it's own policy.
|
||||
|
||||
By default, VMA policy applies only to pages allocated after the policy
|
||||
is installed. Any pages already faulted into the VMA range remain
|
||||
where they were allocated based on the policy at the time they were
|
||||
allocated. However, since 2.6.16, Linux supports page migration via
|
||||
the mbind() system call, so that page contents can be moved to match
|
||||
a newly installed policy.
|
||||
|
||||
Shared Policy: Conceptually, shared policies apply to "memory objects"
|
||||
mapped shared into one or more tasks' distinct address spaces. An
|
||||
application installs a shared policies the same way as VMA policies--using
|
||||
the mbind() system call specifying a range of virtual addresses that map
|
||||
the shared object. However, unlike VMA policies, which can be considered
|
||||
to be an attribute of a range of a task's address space, shared policies
|
||||
apply directly to the shared object. Thus, all tasks that attach to the
|
||||
object share the policy, and all pages allocated for the shared object,
|
||||
by any task, will obey the shared policy.
|
||||
|
||||
As of 2.6.22, only shared memory segments, created by shmget() or
|
||||
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
||||
policy support was added to Linux, the associated data structures were
|
||||
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
||||
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
||||
shmem segments were never "hooked up" to the shared policy support.
|
||||
Although hugetlbfs segments now support lazy allocation, their support
|
||||
for shared policy has not been completed.
|
||||
|
||||
As mentioned above [re: VMA policies], allocations of page cache
|
||||
pages for regular files mmap()ed with MAP_SHARED ignore any VMA
|
||||
policy installed on the virtual address range backed by the shared
|
||||
file mapping. Rather, shared page cache pages, including pages backing
|
||||
private mappings that have not yet been written by the task, follow
|
||||
task policy, if any, else System Default Policy.
|
||||
|
||||
The shared policy infrastructure supports different policies on subset
|
||||
ranges of the shared object. However, Linux still splits the VMA of
|
||||
the task that installs the policy for each range of distinct policy.
|
||||
Thus, different tasks that attach to a shared memory segment can have
|
||||
different VMA configurations mapping that one shared object. This
|
||||
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
||||
a shared memory region, when one task has installed shared policy on
|
||||
one or more ranges of the region.
|
||||
|
||||
Components of Memory Policies
|
||||
|
||||
A Linux memory policy consists of a "mode", optional mode flags, and an
|
||||
optional set of nodes. The mode determines the behavior of the policy,
|
||||
the optional mode flags determine the behavior of the mode, and the
|
||||
optional set of nodes can be viewed as the arguments to the policy
|
||||
behavior.
|
||||
|
||||
Internally, memory policies are implemented by a reference counted
|
||||
structure, struct mempolicy. Details of this structure will be discussed
|
||||
in context, below, as required to explain the behavior.
|
||||
|
||||
Linux memory policy supports the following 4 behavioral modes:
|
||||
|
||||
Default Mode--MPOL_DEFAULT: This mode is only used in the memory
|
||||
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
|
||||
memory policy in all policy scopes. Any existing non-default policy
|
||||
will simply be removed when MPOL_DEFAULT is specified. As a result,
|
||||
MPOL_DEFAULT means "fall back to the next most specific policy scope."
|
||||
|
||||
For example, a NULL or default task policy will fall back to the
|
||||
system default policy. A NULL or default vma policy will fall
|
||||
back to the task policy.
|
||||
|
||||
When specified in one of the memory policy APIs, the Default mode
|
||||
does not use the optional set of nodes.
|
||||
|
||||
It is an error for the set of nodes specified for this policy to
|
||||
be non-empty.
|
||||
|
||||
MPOL_BIND: This mode specifies that memory must come from the
|
||||
set of nodes specified by the policy. Memory will be allocated from
|
||||
the node in the set with sufficient free memory that is closest to
|
||||
the node where the allocation takes place.
|
||||
|
||||
MPOL_PREFERRED: This mode specifies that the allocation should be
|
||||
attempted from the single node specified in the policy. If that
|
||||
allocation fails, the kernel will search other nodes, in order of
|
||||
increasing distance from the preferred node based on information
|
||||
provided by the platform firmware.
|
||||
|
||||
Internally, the Preferred policy uses a single node--the
|
||||
preferred_node member of struct mempolicy. When the internal
|
||||
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
|
||||
the policy is interpreted as local allocation. "Local" allocation
|
||||
policy can be viewed as a Preferred policy that starts at the node
|
||||
containing the cpu where the allocation takes place.
|
||||
|
||||
It is possible for the user to specify that local allocation is
|
||||
always preferred by passing an empty nodemask with this mode.
|
||||
If an empty nodemask is passed, the policy cannot use the
|
||||
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
|
||||
below.
|
||||
|
||||
MPOL_INTERLEAVED: This mode specifies that page allocations be
|
||||
interleaved, on a page granularity, across the nodes specified in
|
||||
the policy. This mode also behaves slightly differently, based on
|
||||
the context where it is used:
|
||||
|
||||
For allocation of anonymous pages and shared memory pages,
|
||||
Interleave mode indexes the set of nodes specified by the policy
|
||||
using the page offset of the faulting address into the segment
|
||||
[VMA] containing the address modulo the number of nodes specified
|
||||
by the policy. It then attempts to allocate a page, starting at
|
||||
the selected node, as if the node had been specified by a Preferred
|
||||
policy or had been selected by a local allocation. That is,
|
||||
allocation will follow the per node zonelist.
|
||||
|
||||
For allocation of page cache pages, Interleave mode indexes the set
|
||||
of nodes specified by the policy using a node counter maintained
|
||||
per task. This counter wraps around to the lowest specified node
|
||||
after it reaches the highest specified node. This will tend to
|
||||
spread the pages out over the nodes specified by the policy based
|
||||
on the order in which they are allocated, rather than based on any
|
||||
page offset into an address range or file. During system boot up,
|
||||
the temporary interleaved system default policy works in this
|
||||
mode.
|
||||
|
||||
Linux memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
|
||||
the user should not be remapped if the task or VMA's set of allowed
|
||||
nodes changes after the memory policy has been defined.
|
||||
|
||||
Without this flag, anytime a mempolicy is rebound because of a
|
||||
change in the set of allowed nodes, the node (Preferred) or
|
||||
nodemask (Bind, Interleave) is remapped to the new set of
|
||||
allowed nodes. This may result in nodes being used that were
|
||||
previously undesired.
|
||||
|
||||
With this flag, if the user-specified nodes overlap with the
|
||||
nodes allowed by the task's cpuset, then the memory policy is
|
||||
applied to their intersection. If the two sets of nodes do not
|
||||
overlap, the Default policy is used.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 1-3 that sets an Interleave policy over the same set. If
|
||||
the cpuset's mems change to 3-5, the Interleave will now occur
|
||||
over nodes 3, 4, and 5. With this flag, however, since only node
|
||||
3 is allowed from the user's nodemask, the "interleave" only
|
||||
occurs over that node. If no nodes from the user's nodemask are
|
||||
now allowed, the Default behavior is used.
|
||||
|
||||
MPOL_F_STATIC_NODES cannot be combined with the
|
||||
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
|
||||
by the user will be mapped relative to the set of the task or VMA's
|
||||
set of allowed nodes. The kernel stores the user-passed nodemask,
|
||||
and if the allowed nodes changes, then that original nodemask will
|
||||
be remapped relative to the new set of allowed nodes.
|
||||
|
||||
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
||||
mempolicy is rebound because of a change in the set of allowed
|
||||
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
||||
remapped to the new set of allowed nodes. That remap may not
|
||||
preserve the relative nature of the user's passed nodemask to its
|
||||
set of allowed nodes upon successive rebinds: a nodemask of
|
||||
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
||||
allowed nodes is restored to its original state.
|
||||
|
||||
With this flag, the remap is done so that the node numbers from
|
||||
the user's passed nodemask are relative to the set of allowed
|
||||
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
||||
nodemask, the policy will be effected over the first (and in the
|
||||
Bind or Interleave case, the third and fifth) nodes in the set of
|
||||
allowed nodes. The nodemask passed by the user represents nodes
|
||||
relative to task or VMA's set of allowed nodes.
|
||||
|
||||
If the user's nodemask includes nodes that are outside the range
|
||||
of the new set of allowed nodes (for example, node 5 is set in
|
||||
the user's nodemask when the set of allowed nodes is only 0-3),
|
||||
then the remap wraps around to the beginning of the nodemask and,
|
||||
if not already set, sets the node in the mempolicy nodemask.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 2-5 that sets an Interleave policy over the same set with
|
||||
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
||||
interleave now occurs over nodes 3,5-7. If the cpuset's mems
|
||||
then change to 0,2-3,5, then the interleave occurs over nodes
|
||||
0,2-3,5.
|
||||
|
||||
Thanks to the consistent remapping, applications preparing
|
||||
nodemasks to specify memory policies using this flag should
|
||||
disregard their current, actual cpuset imposed memory placement
|
||||
and prepare the nodemask as if they were always located on
|
||||
memory nodes 0 to N-1, where N is the number of memory nodes the
|
||||
policy is intended to manage. Let the kernel then remap to the
|
||||
set of memory nodes allowed by the task's cpuset, as that may
|
||||
change over time.
|
||||
|
||||
MPOL_F_RELATIVE_NODES cannot be combined with the
|
||||
MPOL_F_STATIC_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
MEMORY POLICY REFERENCE COUNTING
|
||||
|
||||
To resolve use/free races, struct mempolicy contains an atomic reference
|
||||
count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
||||
decrement this reference count, respectively. mpol_put() will only free
|
||||
the structure back to the mempolicy kmem cache when the reference count
|
||||
goes to zero.
|
||||
|
||||
When a new memory policy is allocated, its reference count is initialized
|
||||
to '1', representing the reference held by the task that is installing the
|
||||
new policy. When a pointer to a memory policy structure is stored in another
|
||||
structure, another reference is added, as the task's reference will be dropped
|
||||
on completion of the policy installation.
|
||||
|
||||
During run-time "usage" of the policy, we attempt to minimize atomic operations
|
||||
on the reference count, as this can lead to cache lines bouncing between cpus
|
||||
and NUMA nodes. "Usage" here means one of the following:
|
||||
|
||||
1) querying of the policy, either by the task itself [using the get_mempolicy()
|
||||
API discussed below] or by another task using the /proc/<pid>/numa_maps
|
||||
interface.
|
||||
|
||||
2) examination of the policy to determine the policy mode and associated node
|
||||
or node lists, if any, for page allocation. This is considered a "hot
|
||||
path". Note that for MPOL_BIND, the "usage" extends across the entire
|
||||
allocation process, which may sleep during page reclaimation, because the
|
||||
BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
||||
|
||||
We can avoid taking an extra reference during the usages listed above as
|
||||
follows:
|
||||
|
||||
1) we never need to get/free the system default policy as this is never
|
||||
changed nor freed, once the system is up and running.
|
||||
|
||||
2) for querying the policy, we do not need to take an extra reference on the
|
||||
target task's task policy nor vma policies because we always acquire the
|
||||
task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
||||
mbind() APIs [see below] always acquire the mmap_sem for write when
|
||||
installing or replacing task or vma policies. Thus, there is no possibility
|
||||
of a task or thread freeing a policy while another task or thread is
|
||||
querying it.
|
||||
|
||||
3) Page allocation usage of task or vma policy occurs in the fault path where
|
||||
we hold them mmap_sem for read. Again, because replacing the task or vma
|
||||
policy requires that the mmap_sem be held for write, the policy can't be
|
||||
freed out from under us while we're using it for page allocation.
|
||||
|
||||
4) Shared policies require special consideration. One task can replace a
|
||||
shared memory policy while another task, with a distinct mmap_sem, is
|
||||
querying or allocating a page based on the policy. To resolve this
|
||||
potential race, the shared policy infrastructure adds an extra reference
|
||||
to the shared policy during lookup while holding a spin lock on the shared
|
||||
policy management structure. This requires that we drop this extra
|
||||
reference when we're finished "using" the policy. We must drop the
|
||||
extra reference on shared policies in the same query/allocation paths
|
||||
used for non-shared policies. For this reason, shared policies are marked
|
||||
as such, and the extra reference is dropped "conditionally"--i.e., only
|
||||
for shared policies.
|
||||
|
||||
Because of this extra reference counting, and because we must lookup
|
||||
shared policies in a tree structure under spinlock, shared policies are
|
||||
more expensive to use in the page allocation path. This is especially
|
||||
true for shared policies on shared memory regions shared by tasks running
|
||||
on different NUMA nodes. This extra overhead can be avoided by always
|
||||
falling back to task or system default policy for shared memory regions,
|
||||
or by prefaulting the entire shared memory region into memory and locking
|
||||
it down. However, this might not be appropriate for all applications.
|
||||
|
||||
MEMORY POLICY APIs
|
||||
|
||||
Linux supports 3 system calls for controlling memory policy. These APIS
|
||||
always affect only the calling task, the calling task's address space, or
|
||||
some shared object mapped into the calling task's address space.
|
||||
|
||||
Note: the headers that define these APIs and the parameter data types
|
||||
for user space applications reside in a package that is not part of
|
||||
the Linux kernel. The kernel system call interfaces, with the 'sys_'
|
||||
prefix, are defined in <linux/syscalls.h>; the mode and flag
|
||||
definitions are defined in <linux/mempolicy.h>.
|
||||
|
||||
Set [Task] Memory Policy:
|
||||
|
||||
long set_mempolicy(int mode, const unsigned long *nmask,
|
||||
unsigned long maxnode);
|
||||
|
||||
Set's the calling task's "task/process memory policy" to mode
|
||||
specified by the 'mode' argument and the set of nodes defined
|
||||
by 'nmask'. 'nmask' points to a bit mask of node ids containing
|
||||
at least 'maxnode' ids. Optional mode flags may be passed by
|
||||
combining the 'mode' argument with the flag (for example:
|
||||
MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
|
||||
|
||||
See the set_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Get [Task] Memory Policy or Related Information
|
||||
|
||||
long get_mempolicy(int *mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
void *addr, int flags);
|
||||
|
||||
Queries the "task/process memory policy" of the calling task, or
|
||||
the policy or location of a specified virtual address, depending
|
||||
on the 'flags' argument.
|
||||
|
||||
See the get_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Install VMA/Shared Policy for a Range of Task's Address Space
|
||||
|
||||
long mbind(void *start, unsigned long len, int mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
unsigned flags);
|
||||
|
||||
mbind() installs the policy specified by (mode, nmask, maxnodes) as
|
||||
a VMA policy for the range of the calling task's address space
|
||||
specified by the 'start' and 'len' arguments. Additional actions
|
||||
may be requested via the 'flags' argument.
|
||||
|
||||
See the mbind(2) man page for more details.
|
||||
|
||||
MEMORY POLICY COMMAND LINE INTERFACE
|
||||
|
||||
Although not strictly part of the Linux implementation of memory policy,
|
||||
a command line tool, numactl(8), exists that allows one to:
|
||||
|
||||
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
||||
exec(2)
|
||||
|
||||
+ set the shared policy for a shared memory segment via mbind(2)
|
||||
|
||||
The numactl(8) tool is packaged with the run-time version of the library
|
||||
containing the memory policy system call wrappers. Some distributions
|
||||
package the headers and compile-time libraries in a separate development
|
||||
package.
|
||||
|
||||
|
||||
MEMORY POLICIES AND CPUSETS
|
||||
|
||||
Memory policies work within cpusets as described above. For memory policies
|
||||
that require a node or set of nodes, the nodes are restricted to the set of
|
||||
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
||||
specified for the policy contains nodes that are not allowed by the cpuset and
|
||||
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
||||
specified for the policy and the set of nodes with memory is used. If the
|
||||
result is the empty set, the policy is considered invalid and cannot be
|
||||
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
||||
onto and folded into the task's set of allowed nodes as previously described.
|
||||
|
||||
The interaction of memory policies and cpusets can be problematic when tasks
|
||||
in two cpusets share access to a memory region, such as shared memory segments
|
||||
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
||||
any of the tasks install shared policy on the region, only nodes whose
|
||||
memories are allowed in both cpusets may be used in the policies. Obtaining
|
||||
this information requires "stepping outside" the memory policy APIs to use the
|
||||
cpuset information and requires that one know in what cpusets other task might
|
||||
be attaching to the shared region. Furthermore, if the cpusets' allowed
|
||||
memory sets are disjoint, "local" allocation is the only valid policy.
|
80
Documentation/vm/overcommit-accounting
Normal file
80
Documentation/vm/overcommit-accounting
Normal file
|
@ -0,0 +1,80 @@
|
|||
The Linux kernel supports the following overcommit handling modes
|
||||
|
||||
0 - Heuristic overcommit handling. Obvious overcommits of
|
||||
address space are refused. Used for a typical system. It
|
||||
ensures a seriously wild allocation fails while allowing
|
||||
overcommit to reduce swap usage. root is allowed to
|
||||
allocate slightly more memory in this mode. This is the
|
||||
default.
|
||||
|
||||
1 - Always overcommit. Appropriate for some scientific
|
||||
applications. Classic example is code using sparse arrays
|
||||
and just relying on the virtual memory consisting almost
|
||||
entirely of zero pages.
|
||||
|
||||
2 - Don't overcommit. The total address space commit
|
||||
for the system is not permitted to exceed swap + a
|
||||
configurable amount (default is 50%) of physical RAM.
|
||||
Depending on the amount you use, in most situations
|
||||
this means a process will not be killed while accessing
|
||||
pages but will receive errors on memory allocation as
|
||||
appropriate.
|
||||
|
||||
Useful for applications that want to guarantee their
|
||||
memory allocations will be available in the future
|
||||
without having to initialize every page.
|
||||
|
||||
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
|
||||
|
||||
The overcommit amount can be set via `vm.overcommit_ratio' (percentage)
|
||||
or `vm.overcommit_kbytes' (absolute value).
|
||||
|
||||
The current overcommit limit and amount committed are viewable in
|
||||
/proc/meminfo as CommitLimit and Committed_AS respectively.
|
||||
|
||||
Gotchas
|
||||
-------
|
||||
|
||||
The C language stack growth does an implicit mremap. If you want absolute
|
||||
guarantees and run close to the edge you MUST mmap your stack for the
|
||||
largest size you think you will need. For typical stack usage this does
|
||||
not matter much but it's a corner case if you really really care
|
||||
|
||||
In mode 2 the MAP_NORESERVE flag is ignored.
|
||||
|
||||
|
||||
How It Works
|
||||
------------
|
||||
|
||||
The overcommit is based on the following rules
|
||||
|
||||
For a file backed map
|
||||
SHARED or READ-only - 0 cost (the file is the map not swap)
|
||||
PRIVATE WRITABLE - size of mapping per instance
|
||||
|
||||
For an anonymous or /dev/zero map
|
||||
SHARED - size of mapping
|
||||
PRIVATE READ-only - 0 cost (but of little use)
|
||||
PRIVATE WRITABLE - size of mapping per instance
|
||||
|
||||
Additional accounting
|
||||
Pages made writable copies by mmap
|
||||
shmfs memory drawn from the same pool
|
||||
|
||||
Status
|
||||
------
|
||||
|
||||
o We account mmap memory mappings
|
||||
o We account mprotect changes in commit
|
||||
o We account mremap changes in size
|
||||
o We account brk
|
||||
o We account munmap
|
||||
o We report the commit status in /proc
|
||||
o Account and check on fork
|
||||
o Review stack handling/building on exec
|
||||
o SHMfs accounting
|
||||
o Implement actual limit enforcement
|
||||
|
||||
To Do
|
||||
-----
|
||||
o Account ptrace pages (this is hard)
|
149
Documentation/vm/page_migration
Normal file
149
Documentation/vm/page_migration
Normal file
|
@ -0,0 +1,149 @@
|
|||
Page migration
|
||||
--------------
|
||||
|
||||
Page migration allows the moving of the physical location of pages between
|
||||
nodes in a numa system while the process is running. This means that the
|
||||
virtual addresses that the process sees do not change. However, the
|
||||
system rearranges the physical location of those pages.
|
||||
|
||||
The main intend of page migration is to reduce the latency of memory access
|
||||
by moving pages near to the processor where the process accessing that memory
|
||||
is running.
|
||||
|
||||
Page migration allows a process to manually relocate the node on which its
|
||||
pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
|
||||
a new memory policy via mbind(). The pages of process can also be relocated
|
||||
from another process using the sys_migrate_pages() function call. The
|
||||
migrate_pages function call takes two sets of nodes and moves pages of a
|
||||
process that are located on the from nodes to the destination nodes.
|
||||
Page migration functions are provided by the numactl package by Andi Kleen
|
||||
(a version later than 0.9.3 is required. Get it from
|
||||
ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
|
||||
which provides an interface similar to other numa functionality for page
|
||||
migration. cat /proc/<pid>/numa_maps allows an easy review of where the
|
||||
pages of a process are located. See also the numa_maps documentation in the
|
||||
proc(5) man page.
|
||||
|
||||
Manual migration is useful if for example the scheduler has relocated
|
||||
a process to a processor on a distant node. A batch scheduler or an
|
||||
administrator may detect the situation and move the pages of the process
|
||||
nearer to the new processor. The kernel itself does only provide
|
||||
manual page migration support. Automatic page migration may be implemented
|
||||
through user space processes that move pages. A special function call
|
||||
"move_pages" allows the moving of individual pages within a process.
|
||||
A NUMA profiler may f.e. obtain a log showing frequent off node
|
||||
accesses and may use the result to move pages to more advantageous
|
||||
locations.
|
||||
|
||||
Larger installations usually partition the system using cpusets into
|
||||
sections of nodes. Paul Jackson has equipped cpusets with the ability to
|
||||
move pages when a task is moved to another cpuset (See
|
||||
Documentation/cgroups/cpusets.txt).
|
||||
Cpusets allows the automation of process locality. If a task is moved to
|
||||
a new cpuset then also all its pages are moved with it so that the
|
||||
performance of the process does not sink dramatically. Also the pages
|
||||
of processes in a cpuset are moved if the allowed memory nodes of a
|
||||
cpuset are changed.
|
||||
|
||||
Page migration allows the preservation of the relative location of pages
|
||||
within a group of nodes for all migration techniques which will preserve a
|
||||
particular memory allocation pattern generated even after migrating a
|
||||
process. This is necessary in order to preserve the memory latencies.
|
||||
Processes will run with similar performance after migration.
|
||||
|
||||
Page migration occurs in several steps. First a high level
|
||||
description for those trying to use migrate_pages() from the kernel
|
||||
(for userspace usage see the Andi Kleen's numactl package mentioned above)
|
||||
and then a low level description of how the low level details work.
|
||||
|
||||
A. In kernel use of migrate_pages()
|
||||
-----------------------------------
|
||||
|
||||
1. Remove pages from the LRU.
|
||||
|
||||
Lists of pages to be migrated are generated by scanning over
|
||||
pages and moving them into lists. This is done by
|
||||
calling isolate_lru_page().
|
||||
Calling isolate_lru_page increases the references to the page
|
||||
so that it cannot vanish while the page migration occurs.
|
||||
It also prevents the swapper or other scans to encounter
|
||||
the page.
|
||||
|
||||
2. We need to have a function of type new_page_t that can be
|
||||
passed to migrate_pages(). This function should figure out
|
||||
how to allocate the correct new page given the old page.
|
||||
|
||||
3. The migrate_pages() function is called which attempts
|
||||
to do the migration. It will call the function to allocate
|
||||
the new page for each page that is considered for
|
||||
moving.
|
||||
|
||||
B. How migrate_pages() works
|
||||
----------------------------
|
||||
|
||||
migrate_pages() does several passes over its list of pages. A page is moved
|
||||
if all references to a page are removable at the time. The page has
|
||||
already been removed from the LRU via isolate_lru_page() and the refcount
|
||||
is increased so that the page cannot be freed while page migration occurs.
|
||||
|
||||
Steps:
|
||||
|
||||
1. Lock the page to be migrated
|
||||
|
||||
2. Insure that writeback is complete.
|
||||
|
||||
3. Prep the new page that we want to move to. It is locked
|
||||
and set to not being uptodate so that all accesses to the new
|
||||
page immediately lock while the move is in progress.
|
||||
|
||||
4. The new page is prepped with some settings from the old page so that
|
||||
accesses to the new page will discover a page with the correct settings.
|
||||
|
||||
5. All the page table references to the page are converted
|
||||
to migration entries or dropped (nonlinear vmas).
|
||||
This decrease the mapcount of a page. If the resulting
|
||||
mapcount is not zero then we do not migrate the page.
|
||||
All user space processes that attempt to access the page
|
||||
will now wait on the page lock.
|
||||
|
||||
6. The radix tree lock is taken. This will cause all processes trying
|
||||
to access the page via the mapping to block on the radix tree spinlock.
|
||||
|
||||
7. The refcount of the page is examined and we back out if references remain
|
||||
otherwise we know that we are the only one referencing this page.
|
||||
|
||||
8. The radix tree is checked and if it does not contain the pointer to this
|
||||
page then we back out because someone else modified the radix tree.
|
||||
|
||||
9. The radix tree is changed to point to the new page.
|
||||
|
||||
10. The reference count of the old page is dropped because the radix tree
|
||||
reference is gone. A reference to the new page is established because
|
||||
the new page is referenced to by the radix tree.
|
||||
|
||||
11. The radix tree lock is dropped. With that lookups in the mapping
|
||||
become possible again. Processes will move from spinning on the tree_lock
|
||||
to sleeping on the locked new page.
|
||||
|
||||
12. The page contents are copied to the new page.
|
||||
|
||||
13. The remaining page flags are copied to the new page.
|
||||
|
||||
14. The old page flags are cleared to indicate that the page does
|
||||
not provide any information anymore.
|
||||
|
||||
15. Queued up writeback on the new page is triggered.
|
||||
|
||||
16. If migration entries were page then replace them with real ptes. Doing
|
||||
so will enable access for user space processes not already waiting for
|
||||
the page lock.
|
||||
|
||||
19. The page locks are dropped from the old and new page.
|
||||
Processes waiting on the page lock will redo their page faults
|
||||
and will reach the new page.
|
||||
|
||||
20. The new page is moved to the LRU and can be scanned by the swapper
|
||||
etc again.
|
||||
|
||||
Christoph Lameter, May 8, 2006.
|
||||
|
152
Documentation/vm/pagemap.txt
Normal file
152
Documentation/vm/pagemap.txt
Normal file
|
@ -0,0 +1,152 @@
|
|||
pagemap, from the userspace perspective
|
||||
---------------------------------------
|
||||
|
||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||
userspace programs to examine the page tables and related information by
|
||||
reading files in /proc.
|
||||
|
||||
There are three components to pagemap:
|
||||
|
||||
* /proc/pid/pagemap. This file lets a userspace process find out which
|
||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||
value for each virtual page, containing the following data (from
|
||||
fs/proc/task_mmu.c, above pagemap_read):
|
||||
|
||||
* Bits 0-54 page frame number (PFN) if present
|
||||
* Bits 0-4 swap type if swapped
|
||||
* Bits 5-54 swap offset if swapped
|
||||
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
|
||||
* Bits 56-60 zero
|
||||
* Bit 61 page is file-page or shared-anon
|
||||
* Bit 62 page swapped
|
||||
* Bit 63 page present
|
||||
|
||||
If the page is not present but in swap, then the PFN contains an
|
||||
encoding of the swap file number and the page's offset into the
|
||||
swap. Unmapped pages return a null PFN. This allows determining
|
||||
precisely which pages are mapped (or in swap) and comparing mapped
|
||||
pages between processes.
|
||||
|
||||
Efficient users of this interface will use /proc/pid/maps to
|
||||
determine which areas of memory are actually mapped and llseek to
|
||||
skip over unmapped regions.
|
||||
|
||||
* /proc/kpagecount. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
* /proc/kpageflags. This file contains a 64-bit set of flags for each
|
||||
page, indexed by PFN.
|
||||
|
||||
The flags are (from fs/proc/page.c, above kpageflags_read):
|
||||
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
10. BUDDY
|
||||
11. MMAP
|
||||
12. ANON
|
||||
13. SWAPCACHE
|
||||
14. SWAPBACKED
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
16. HUGE
|
||||
18. UNEVICTABLE
|
||||
19. HWPOISON
|
||||
20. NOPAGE
|
||||
21. KSM
|
||||
22. THP
|
||||
|
||||
Short descriptions to the page flags:
|
||||
|
||||
0. LOCKED
|
||||
page is being locked for exclusive access, eg. by undergoing read/write IO
|
||||
|
||||
7. SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
|
||||
10. BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
A compound page with order N consists of 2^N physically contiguous pages.
|
||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||
head page and T donates its tail page(s). The major consumers of compound
|
||||
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
|
||||
memory allocators and various device drivers. However in this interface,
|
||||
only huge/giga pages are made visible to end users.
|
||||
17. HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
|
||||
19. HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
|
||||
20. NOPAGE
|
||||
no page frame exists at the requested address
|
||||
|
||||
21. KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
|
||||
22. THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
|
||||
[IO related page flags]
|
||||
1. ERROR IO error occurred
|
||||
3. UPTODATE page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4. DIRTY page has been written to, hence contains new data
|
||||
ie. for file backed page: (in-memory data revision > on-disk one)
|
||||
8. WRITEBACK page is being synced to disk
|
||||
|
||||
[LRU related page flags]
|
||||
5. LRU page is in one of the LRU lists
|
||||
6. ACTIVE page is in the active LRU list
|
||||
18. UNEVICTABLE page is in the unevictable (non-)LRU list
|
||||
It is somehow pinned and not a candidate for LRU page reclaims,
|
||||
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2. REFERENCED page has been referenced since last LRU list enqueue/requeue
|
||||
9. RECLAIM page will be reclaimed soon after its pageout IO completed
|
||||
11. MMAP a memory mapped page
|
||||
12. ANON a memory mapped page that is not part of a file
|
||||
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
|
||||
14. SWAPBACKED page is backed by swap/RAM
|
||||
|
||||
The page-types tool in this directory can be used to query the above flags.
|
||||
|
||||
Using pagemap to do something useful:
|
||||
|
||||
The general procedure for using pagemap to find out about a process' memory
|
||||
usage goes like this:
|
||||
|
||||
1. Read /proc/pid/maps to determine which parts of the memory space are
|
||||
mapped to what.
|
||||
2. Select the maps you are interested in -- all of them, or a particular
|
||||
library, or the stack or the heap, etc.
|
||||
3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
|
||||
4. Read a u64 for each page from pagemap.
|
||||
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
|
||||
read, seek to that entry in the file, and read the data you want.
|
||||
|
||||
For example, to find the "unique set size" (USS), which is the amount of
|
||||
memory that a process is using that is not shared with any other process,
|
||||
you can go through every map in the process, find the PFNs, look those up
|
||||
in kpagecount, and tally up the number of pages that are only referenced
|
||||
once.
|
||||
|
||||
Other notes:
|
||||
|
||||
Reading from any of the files will return -EINVAL if you are not starting
|
||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
||||
into the file), or if the size of the read is not a multiple of 8 bytes.
|
28
Documentation/vm/remap_file_pages.txt
Normal file
28
Documentation/vm/remap_file_pages.txt
Normal file
|
@ -0,0 +1,28 @@
|
|||
The remap_file_pages() system call is used to create a nonlinear mapping,
|
||||
that is, a mapping in which the pages of the file are mapped into a
|
||||
nonsequential order in memory. The advantage of using remap_file_pages()
|
||||
over using repeated calls to mmap(2) is that the former approach does not
|
||||
require the kernel to create additional VMA (Virtual Memory Area) data
|
||||
structures.
|
||||
|
||||
Supporting of nonlinear mapping requires significant amount of non-trivial
|
||||
code in kernel virtual memory subsystem including hot paths. Also to get
|
||||
nonlinear mapping work kernel need a way to distinguish normal page table
|
||||
entries from entries with file offset (pte_file). Kernel reserves flag in
|
||||
PTE for this purpose. PTE flags are scarce resource especially on some CPU
|
||||
architectures. It would be nice to free up the flag for other usage.
|
||||
|
||||
Fortunately, there are not many users of remap_file_pages() in the wild.
|
||||
It's only known that one enterprise RDBMS implementation uses the syscall
|
||||
on 32-bit systems to map files bigger than can linearly fit into 32-bit
|
||||
virtual address space. This use-case is not critical anymore since 64-bit
|
||||
systems are widely available.
|
||||
|
||||
The plan is to deprecate the syscall and replace it with an emulation.
|
||||
The emulation will create new VMAs instead of nonlinear mappings. It's
|
||||
going to work slower for rare users of remap_file_pages() but ABI is
|
||||
preserved.
|
||||
|
||||
One side effect of emulation (apart from performance) is that user can hit
|
||||
vm.max_map_count limit more easily due to additional VMAs. See comment for
|
||||
DEFAULT_MAX_MAP_COUNT for more details on the limit.
|
283
Documentation/vm/slub.txt
Normal file
283
Documentation/vm/slub.txt
Normal file
|
@ -0,0 +1,283 @@
|
|||
Short users guide for SLUB
|
||||
--------------------------
|
||||
|
||||
The basic philosophy of SLUB is very different from SLAB. SLAB
|
||||
requires rebuilding the kernel to activate debug options for all
|
||||
slab caches. SLUB always includes full debugging but it is off by default.
|
||||
SLUB can enable debugging only for selected slabs in order to avoid
|
||||
an impact on overall system performance which may make a bug more
|
||||
difficult to find.
|
||||
|
||||
In order to switch debugging on one can add a option "slub_debug"
|
||||
to the kernel command line. That will enable full debugging for
|
||||
all slabs.
|
||||
|
||||
Typically one would then use the "slabinfo" command to get statistical
|
||||
data and perform operation on the slabs. By default slabinfo only lists
|
||||
slabs that have data in them. See "slabinfo -h" for more options when
|
||||
running the command. slabinfo can be compiled with
|
||||
|
||||
gcc -o slabinfo tools/vm/slabinfo.c
|
||||
|
||||
Some of the modes of operation of slabinfo require that slub debugging
|
||||
be enabled on the command line. F.e. no tracking information will be
|
||||
available without debugging on and validation can only partially
|
||||
be performed if debugging was not switched on.
|
||||
|
||||
Some more sophisticated uses of slub_debug:
|
||||
-------------------------------------------
|
||||
|
||||
Parameters may be given to slub_debug. If none is specified then full
|
||||
debugging is enabled. Format:
|
||||
|
||||
slub_debug=<Debug-Options> Enable options for all slabs
|
||||
slub_debug=<Debug-Options>,<slab name>
|
||||
Enable options only for select slabs
|
||||
|
||||
Possible debug options are
|
||||
F Sanity checks on (enables SLAB_DEBUG_FREE. Sorry
|
||||
SLAB legacy issues)
|
||||
Z Red zoning
|
||||
P Poisoning (object and padding)
|
||||
U User tracking (free and alloc)
|
||||
T Trace (please only use on single slabs)
|
||||
A Toggle failslab filter mark for the cache
|
||||
O Switch debugging off for caches that would have
|
||||
caused higher minimum slab orders
|
||||
- Switch all debugging off (useful if the kernel is
|
||||
configured with CONFIG_SLUB_DEBUG_ON)
|
||||
|
||||
F.e. in order to boot just with sanity checks and red zoning one would specify:
|
||||
|
||||
slub_debug=FZ
|
||||
|
||||
Trying to find an issue in the dentry cache? Try
|
||||
|
||||
slub_debug=,dentry
|
||||
|
||||
to only enable debugging on the dentry cache.
|
||||
|
||||
Red zoning and tracking may realign the slab. We can just apply sanity checks
|
||||
to the dentry cache with
|
||||
|
||||
slub_debug=F,dentry
|
||||
|
||||
Debugging options may require the minimum possible slab order to increase as
|
||||
a result of storing the metadata (for example, caches with PAGE_SIZE object
|
||||
sizes). This has a higher liklihood of resulting in slab allocation errors
|
||||
in low memory situations or if there's high fragmentation of memory. To
|
||||
switch off debugging for such caches by default, use
|
||||
|
||||
slub_debug=O
|
||||
|
||||
In case you forgot to enable debugging on the kernel command line: It is
|
||||
possible to enable debugging manually when the kernel is up. Look at the
|
||||
contents of:
|
||||
|
||||
/sys/kernel/slab/<slab name>/
|
||||
|
||||
Look at the writable files. Writing 1 to them will enable the
|
||||
corresponding debug option. All options can be set on a slab that does
|
||||
not contain objects. If the slab already contains objects then sanity checks
|
||||
and tracing may only be enabled. The other options may cause the realignment
|
||||
of objects.
|
||||
|
||||
Careful with tracing: It may spew out lots of information and never stop if
|
||||
used on the wrong slab.
|
||||
|
||||
Slab merging
|
||||
------------
|
||||
|
||||
If no debug options are specified then SLUB may merge similar slabs together
|
||||
in order to reduce overhead and increase cache hotness of objects.
|
||||
slabinfo -a displays which slabs were merged together.
|
||||
|
||||
Slab validation
|
||||
---------------
|
||||
|
||||
SLUB can validate all object if the kernel was booted with slub_debug. In
|
||||
order to do so you must have the slabinfo tool. Then you can do
|
||||
|
||||
slabinfo -v
|
||||
|
||||
which will test all objects. Output will be generated to the syslog.
|
||||
|
||||
This also works in a more limited way if boot was without slab debug.
|
||||
In that case slabinfo -v simply tests all reachable objects. Usually
|
||||
these are in the cpu slabs and the partial slabs. Full slabs are not
|
||||
tracked by SLUB in a non debug situation.
|
||||
|
||||
Getting more performance
|
||||
------------------------
|
||||
|
||||
To some degree SLUB's performance is limited by the need to take the
|
||||
list_lock once in a while to deal with partial slabs. That overhead is
|
||||
governed by the order of the allocation for each slab. The allocations
|
||||
can be influenced by kernel parameters:
|
||||
|
||||
slub_min_objects=x (default 4)
|
||||
slub_min_order=x (default 0)
|
||||
slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER))
|
||||
|
||||
slub_min_objects allows to specify how many objects must at least fit
|
||||
into one slab in order for the allocation order to be acceptable.
|
||||
In general slub will be able to perform this number of allocations
|
||||
on a slab without consulting centralized resources (list_lock) where
|
||||
contention may occur.
|
||||
|
||||
slub_min_order specifies a minim order of slabs. A similar effect like
|
||||
slub_min_objects.
|
||||
|
||||
slub_max_order specified the order at which slub_min_objects should no
|
||||
longer be checked. This is useful to avoid SLUB trying to generate
|
||||
super large order pages to fit slub_min_objects of a slab cache with
|
||||
large object sizes into one high order page. Setting command line
|
||||
parameter debug_guardpage_minorder=N (N > 0), forces setting
|
||||
slub_max_order to 0, what cause minimum possible order of slabs
|
||||
allocation.
|
||||
|
||||
SLUB Debug output
|
||||
-----------------
|
||||
|
||||
Here is a sample of slub debug output:
|
||||
|
||||
====================================================================
|
||||
BUG kmalloc-8: Redzone overwritten
|
||||
--------------------------------------------------------------------
|
||||
|
||||
INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc
|
||||
INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58
|
||||
INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58
|
||||
INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554
|
||||
|
||||
Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
|
||||
Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005
|
||||
Redzone 0xc90f6d28: 00 cc cc cc .
|
||||
Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
|
||||
|
||||
[<c010523d>] dump_trace+0x63/0x1eb
|
||||
[<c01053df>] show_trace_log_lvl+0x1a/0x2f
|
||||
[<c010601d>] show_trace+0x12/0x14
|
||||
[<c0106035>] dump_stack+0x16/0x18
|
||||
[<c017e0fa>] object_err+0x143/0x14b
|
||||
[<c017e2cc>] check_object+0x66/0x234
|
||||
[<c017eb43>] __slab_free+0x239/0x384
|
||||
[<c017f446>] kfree+0xa6/0xc6
|
||||
[<c02e2335>] get_modalias+0xb9/0xf5
|
||||
[<c02e23b7>] dmi_dev_uevent+0x27/0x3c
|
||||
[<c027866a>] dev_uevent+0x1ad/0x1da
|
||||
[<c0205024>] kobject_uevent_env+0x20a/0x45b
|
||||
[<c020527f>] kobject_uevent+0xa/0xf
|
||||
[<c02779f1>] store_uevent+0x4f/0x58
|
||||
[<c027758e>] dev_attr_store+0x29/0x2f
|
||||
[<c01bec4f>] sysfs_write_file+0x16e/0x19c
|
||||
[<c0183ba7>] vfs_write+0xd1/0x15a
|
||||
[<c01841d7>] sys_write+0x3d/0x72
|
||||
[<c0104112>] sysenter_past_esp+0x5f/0x99
|
||||
[<b7f7b410>] 0xb7f7b410
|
||||
=======================
|
||||
|
||||
FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc
|
||||
|
||||
If SLUB encounters a corrupted object (full detection requires the kernel
|
||||
to be booted with slub_debug) then the following output will be dumped
|
||||
into the syslog:
|
||||
|
||||
1. Description of the problem encountered
|
||||
|
||||
This will be a message in the system log starting with
|
||||
|
||||
===============================================
|
||||
BUG <slab cache affected>: <What went wrong>
|
||||
-----------------------------------------------
|
||||
|
||||
INFO: <corruption start>-<corruption_end> <more info>
|
||||
INFO: Slab <address> <slab information>
|
||||
INFO: Object <address> <object information>
|
||||
INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by
|
||||
cpu> pid=<pid of the process>
|
||||
INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu>
|
||||
pid=<pid of the process>
|
||||
|
||||
(Object allocation / free information is only available if SLAB_STORE_USER is
|
||||
set for the slab. slub_debug sets that option)
|
||||
|
||||
2. The object contents if an object was involved.
|
||||
|
||||
Various types of lines can follow the BUG SLUB line:
|
||||
|
||||
Bytes b4 <address> : <bytes>
|
||||
Shows a few bytes before the object where the problem was detected.
|
||||
Can be useful if the corruption does not stop with the start of the
|
||||
object.
|
||||
|
||||
Object <address> : <bytes>
|
||||
The bytes of the object. If the object is inactive then the bytes
|
||||
typically contain poison values. Any non-poison value shows a
|
||||
corruption by a write after free.
|
||||
|
||||
Redzone <address> : <bytes>
|
||||
The Redzone following the object. The Redzone is used to detect
|
||||
writes after the object. All bytes should always have the same
|
||||
value. If there is any deviation then it is due to a write after
|
||||
the object boundary.
|
||||
|
||||
(Redzone information is only available if SLAB_RED_ZONE is set.
|
||||
slub_debug sets that option)
|
||||
|
||||
Padding <address> : <bytes>
|
||||
Unused data to fill up the space in order to get the next object
|
||||
properly aligned. In the debug case we make sure that there are
|
||||
at least 4 bytes of padding. This allows the detection of writes
|
||||
before the object.
|
||||
|
||||
3. A stackdump
|
||||
|
||||
The stackdump describes the location where the error was detected. The cause
|
||||
of the corruption is may be more likely found by looking at the function that
|
||||
allocated or freed the object.
|
||||
|
||||
4. Report on how the problem was dealt with in order to ensure the continued
|
||||
operation of the system.
|
||||
|
||||
These are messages in the system log beginning with
|
||||
|
||||
FIX <slab cache affected>: <corrective action taken>
|
||||
|
||||
In the above sample SLUB found that the Redzone of an active object has
|
||||
been overwritten. Here a string of 8 characters was written into a slab that
|
||||
has the length of 8 characters. However, a 8 character string needs a
|
||||
terminating 0. That zero has overwritten the first byte of the Redzone field.
|
||||
After reporting the details of the issue encountered the FIX SLUB message
|
||||
tells us that SLUB has restored the Redzone to its proper value and then
|
||||
system operations continue.
|
||||
|
||||
Emergency operations:
|
||||
---------------------
|
||||
|
||||
Minimal debugging (sanity checks alone) can be enabled by booting with
|
||||
|
||||
slub_debug=F
|
||||
|
||||
This will be generally be enough to enable the resiliency features of slub
|
||||
which will keep the system running even if a bad kernel component will
|
||||
keep corrupting objects. This may be important for production systems.
|
||||
Performance will be impacted by the sanity checks and there will be a
|
||||
continual stream of error messages to the syslog but no additional memory
|
||||
will be used (unlike full debugging).
|
||||
|
||||
No guarantees. The kernel component still needs to be fixed. Performance
|
||||
may be optimized further by locating the slab that experiences corruption
|
||||
and enabling debugging only for that cache
|
||||
|
||||
I.e.
|
||||
|
||||
slub_debug=F,dentry
|
||||
|
||||
If the corruption occurs by writing after the end of the object then it
|
||||
may be advisable to enable a Redzone to avoid corrupting the beginning
|
||||
of other objects.
|
||||
|
||||
slub_debug=FZ,dentry
|
||||
|
||||
Christoph Lameter, May 30, 2007
|
43
Documentation/vm/soft-dirty.txt
Normal file
43
Documentation/vm/soft-dirty.txt
Normal file
|
@ -0,0 +1,43 @@
|
|||
SOFT-DIRTY PTEs
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
writes to. In order to do this tracking one should
|
||||
|
||||
1. Clear soft-dirty bits from the task's PTEs.
|
||||
|
||||
This is done by writing "4" into the /proc/PID/clear_refs file of the
|
||||
task in question.
|
||||
|
||||
2. Wait some time.
|
||||
|
||||
3. Read soft-dirty bits from the PTEs.
|
||||
|
||||
This is done by reading from the /proc/PID/pagemap. The bit 55 of the
|
||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||
written to since step 1.
|
||||
|
||||
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
when the soft-dirty bit is cleared. So, after this, when the task tries to
|
||||
modify a page at some virtual address the #PF occurs and the kernel sets
|
||||
the soft-dirty bit on the respective PTE.
|
||||
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
|
||||
This is so, since the pages are still mapped to physical memory, and thus all
|
||||
the kernel does is finds this fact out and puts both writable and soft-dirty
|
||||
bits on the PTE.
|
||||
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
there is still a scenario when we can lose soft dirty bits -- a task
|
||||
unmaps a previously mapped memory region and then maps a new one at exactly
|
||||
the same place. When unmap is called, the kernel internally clears PTE values
|
||||
including soft dirty bits. To notify user space application about such
|
||||
memory region renewal the kernel always marks new memory regions (and
|
||||
expanded regions) as soft dirty.
|
||||
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
can find more details about it on http://criu.org
|
||||
|
||||
|
||||
-- Pavel Emelyanov, Apr 9, 2013
|
94
Documentation/vm/split_page_table_lock
Normal file
94
Documentation/vm/split_page_table_lock
Normal file
|
@ -0,0 +1,94 @@
|
|||
Split page table lock
|
||||
=====================
|
||||
|
||||
Originally, mm->page_table_lock spinlock protected all page tables of the
|
||||
mm_struct. But this approach leads to poor page fault scalability of
|
||||
multi-threaded applications due high contention on the lock. To improve
|
||||
scalability, split page table lock was introduced.
|
||||
|
||||
With split page table lock we have separate per-table lock to serialize
|
||||
access to the table. At the moment we use split lock for PTE and PMD
|
||||
tables. Access to higher level tables protected by mm->page_table_lock.
|
||||
|
||||
There are helpers to lock/unlock a table and other accessor functions:
|
||||
- pte_offset_map_lock()
|
||||
maps pte and takes PTE table lock, returns pointer to the taken
|
||||
lock;
|
||||
- pte_unmap_unlock()
|
||||
unlocks and unmaps PTE table;
|
||||
- pte_alloc_map_lock()
|
||||
allocates PTE table if needed and take the lock, returns pointer
|
||||
to taken lock or NULL if allocation failed;
|
||||
- pte_lockptr()
|
||||
returns pointer to PTE table lock;
|
||||
- pmd_lock()
|
||||
takes PMD table lock, returns pointer to taken lock;
|
||||
- pmd_lockptr()
|
||||
returns pointer to PMD table lock;
|
||||
|
||||
Split page table lock for PTE tables is enabled compile-time if
|
||||
CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
|
||||
If split lock is disabled, all tables guaded by mm->page_table_lock.
|
||||
|
||||
Split page table lock for PMD tables is enabled, if it's enabled for PTE
|
||||
tables and the architecture supports it (see below).
|
||||
|
||||
Hugetlb and split page table lock
|
||||
---------------------------------
|
||||
|
||||
Hugetlb can support several page sizes. We use split lock only for PMD
|
||||
level, but not for PUD.
|
||||
|
||||
Hugetlb-specific helpers:
|
||||
- huge_pte_lock()
|
||||
takes pmd split lock for PMD_SIZE page, mm->page_table_lock
|
||||
otherwise;
|
||||
- huge_pte_lockptr()
|
||||
returns pointer to table lock;
|
||||
|
||||
Support of split page table lock by an architecture
|
||||
---------------------------------------------------
|
||||
|
||||
There's no need in special enabling of PTE split page table lock:
|
||||
everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
|
||||
which must be called on PTE table allocation / freeing.
|
||||
|
||||
Make sure the architecture doesn't use slab allocator for page table
|
||||
allocation: slab uses page->slab_cache and page->first_page for its pages.
|
||||
These fields share storage with page->ptl.
|
||||
|
||||
PMD split lock only makes sense if you have more than two page table
|
||||
levels.
|
||||
|
||||
PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
|
||||
allocation and pgtable_pmd_page_dtor() on freeing.
|
||||
|
||||
Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
|
||||
pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
|
||||
paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
|
||||
|
||||
With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
|
||||
|
||||
NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
|
||||
be handled properly.
|
||||
|
||||
page->ptl
|
||||
---------
|
||||
|
||||
page->ptl is used to access split page table lock, where 'page' is struct
|
||||
page of page containing the table. It shares storage with page->private
|
||||
(and few other fields in union).
|
||||
|
||||
To avoid increasing size of struct page and have best performance, we use a
|
||||
trick:
|
||||
- if spinlock_t fits into long, we use page->ptr as spinlock, so we
|
||||
can avoid indirect access and save a cache line.
|
||||
- if size of spinlock_t is bigger then size of long, we use page->ptl as
|
||||
pointer to spinlock_t and allocate it dynamically. This allows to use
|
||||
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
|
||||
one more cache line for indirect access;
|
||||
|
||||
The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
|
||||
pgtable_pmd_page_ctor() for PMD table.
|
||||
|
||||
Please, never access page->ptl directly -- use appropriate helper.
|
376
Documentation/vm/transhuge.txt
Normal file
376
Documentation/vm/transhuge.txt
Normal file
|
@ -0,0 +1,376 @@
|
|||
= Transparent Hugepage Support =
|
||||
|
||||
== Objective ==
|
||||
|
||||
Performance critical computing applications dealing with large memory
|
||||
working sets are already running on top of libhugetlbfs and in turn
|
||||
hugetlbfs. Transparent Hugepage Support is an alternative means of
|
||||
using huge pages for the backing of virtual memory with huge pages
|
||||
that supports the automatic promotion and demotion of page sizes and
|
||||
without the shortcomings of hugetlbfs.
|
||||
|
||||
Currently it only works for anonymous memory mappings but in the
|
||||
future it can expand over the pagecache layer starting with tmpfs.
|
||||
|
||||
The reason applications are running faster is because of two
|
||||
factors. The first factor is almost completely irrelevant and it's not
|
||||
of significant interest because it'll also have the downside of
|
||||
requiring larger clear-page copy-page in page faults which is a
|
||||
potentially negative effect. The first factor consists in taking a
|
||||
single page fault for each 2M virtual region touched by userland (so
|
||||
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||
only matters the first time the memory is accessed for the lifetime of
|
||||
a memory mapping. The second long lasting and much more important
|
||||
factor will affect all subsequent accesses to the memory for the whole
|
||||
runtime of the application. The second factor consist of two
|
||||
components: 1) the TLB miss will run faster (especially with
|
||||
virtualization using nested pagetables but almost always also on bare
|
||||
metal without virtualization) and 2) a single TLB entry will be
|
||||
mapping a much larger amount of virtual memory in turn reducing the
|
||||
number of TLB misses. With virtualization and nested pagetables the
|
||||
TLB can be mapped of larger size only if both KVM and the Linux guest
|
||||
are using hugepages but a significant speedup already happens if only
|
||||
one of the two is using hugepages just because of the fact the TLB
|
||||
miss is going to run faster.
|
||||
|
||||
== Design ==
|
||||
|
||||
- "graceful fallback": mm components which don't have transparent
|
||||
hugepage knowledge fall back to breaking a transparent hugepage and
|
||||
working on the regular pages and their respective regular pmd/pte
|
||||
mappings
|
||||
|
||||
- if a hugepage allocation fails because of memory fragmentation,
|
||||
regular pages should be gracefully allocated instead and mixed in
|
||||
the same vma without any failure or significant delay and without
|
||||
userland noticing
|
||||
|
||||
- if some task quits and more hugepages become available (either
|
||||
immediately in the buddy or through the VM), guest physical memory
|
||||
backed by regular pages should be relocated on hugepages
|
||||
automatically (with khugepaged)
|
||||
|
||||
- it doesn't require memory reservation and in turn it uses hugepages
|
||||
whenever possible (the only possible reservation here is kernelcore=
|
||||
to avoid unmovable pages to fragment all the memory but such a tweak
|
||||
is not specific to transparent hugepage support and it's a generic
|
||||
feature that applies to all dynamic high order allocations in the
|
||||
kernel)
|
||||
|
||||
- this initial support only offers the feature in the anonymous memory
|
||||
regions but it'd be ideal to move it to tmpfs and the pagecache
|
||||
later
|
||||
|
||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||
if compared to the reservation approach of hugetlbfs by allowing all
|
||||
unused memory to be used as cache or other movable (or even unmovable
|
||||
entities). It doesn't require reservation to prevent hugepage
|
||||
allocation failures to be noticeable from userland. It allows paging
|
||||
and all other advanced VM features to be available on the
|
||||
hugepages. It requires no modifications for applications to take
|
||||
advantage of it.
|
||||
|
||||
Applications however can be further optimized to take advantage of
|
||||
this feature, like for example they've been optimized before to avoid
|
||||
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||
is by far not mandatory and khugepaged already can take care of long
|
||||
lived page allocations even for hugepage unaware applications that
|
||||
deals with large amounts of memory.
|
||||
|
||||
In certain cases when hugepages are enabled system wide, application
|
||||
may end up allocating more memory resources. An application may mmap a
|
||||
large region but only touch 1 byte of it, in that case a 2M page might
|
||||
be allocated instead of a 4k page for no good. This is why it's
|
||||
possible to disable hugepages system-wide and to only have them inside
|
||||
MADV_HUGEPAGE madvise regions.
|
||||
|
||||
Embedded systems should enable hugepages only inside madvise regions
|
||||
to eliminate any risk of wasting any precious byte of memory and to
|
||||
only run faster.
|
||||
|
||||
Applications that gets a lot of benefit from hugepages and that don't
|
||||
risk to lose memory by using hugepages, should use
|
||||
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||
|
||||
== sysfs ==
|
||||
|
||||
Transparent Hugepage Support can be entirely disabled (mostly for
|
||||
debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
|
||||
avoid the risk of consuming more memory resources) or enabled system
|
||||
wide. This can be achieved with one of:
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
It's also possible to limit defrag efforts in the VM to generate
|
||||
hugepages in case they're not immediately free to madvise regions or
|
||||
to never try to defrag memory and simply fallback to regular pages
|
||||
unless hugepages are immediately available. Clearly if we spend CPU
|
||||
time to defrag memory, we would expect to gain even more by the fact
|
||||
we use hugepages later instead of regular pages. This isn't always
|
||||
guaranteed, but it may be more likely in case the allocation is for a
|
||||
MADV_HUGEPAGE region.
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
|
||||
By default kernel tries to use huge zero page on read page fault.
|
||||
It's possible to disable huge zero page by writing 0 or enable it
|
||||
back by writing 1:
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
|
||||
khugepaged will be automatically started when
|
||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||
be automatically shutdown if it's set to "never".
|
||||
|
||||
khugepaged runs usually at low frequency so while one may not want to
|
||||
invoke defrag algorithms synchronously during the page faults, it
|
||||
should be worth invoking defrag at least in khugepaged. However it's
|
||||
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||
defrag in khugepaged by writing 1:
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
|
||||
You can also control how many pages khugepaged should scan at each
|
||||
pass:
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||
|
||||
and how many milliseconds to wait in khugepaged between each pass (you
|
||||
can set this to 0 to run khugepaged at 100% utilization of one core):
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||
|
||||
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||
allocation failure to throttle the next allocation attempt.
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||
|
||||
The khugepaged progress can be seen in the number of pages collapsed:
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||
|
||||
for each pass:
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||
|
||||
== Boot parameter ==
|
||||
|
||||
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||
Support by passing the parameter "transparent_hugepage=always" or
|
||||
"transparent_hugepage=madvise" or "transparent_hugepage=never"
|
||||
(without "") to the kernel command line.
|
||||
|
||||
== Need of application restart ==
|
||||
|
||||
The transparent_hugepage/enabled values only affect future
|
||||
behavior. So to make them effective you need to restart any
|
||||
application that could have been using hugepages. This also applies to
|
||||
the regions registered in khugepaged.
|
||||
|
||||
== Monitoring usage ==
|
||||
|
||||
The number of transparent huge pages currently used by the system is
|
||||
available by reading the AnonHugePages field in /proc/meminfo. To
|
||||
identify what applications are using transparent huge pages, it is
|
||||
necessary to read /proc/PID/smaps and count the AnonHugePages fields
|
||||
for each mapping. Note that reading the smaps file is expensive and
|
||||
reading it frequently will incur overhead.
|
||||
|
||||
There are a number of counters in /proc/vmstat that may be used to
|
||||
monitor how successfully the system is providing huge pages for use.
|
||||
|
||||
thp_fault_alloc is incremented every time a huge page is successfully
|
||||
allocated to handle a page fault. This applies to both the
|
||||
first time a page is faulted and for COW faults.
|
||||
|
||||
thp_collapse_alloc is incremented by khugepaged when it has found
|
||||
a range of pages to collapse into one huge page and has
|
||||
successfully allocated a new huge page to store the data.
|
||||
|
||||
thp_fault_fallback is incremented if a page fault fails to allocate
|
||||
a huge page and instead falls back to using small pages.
|
||||
|
||||
thp_collapse_alloc_failed is incremented if khugepaged found a range
|
||||
of pages that should be collapsed into one huge page but failed
|
||||
the allocation.
|
||||
|
||||
thp_split is incremented every time a huge page is split into base
|
||||
pages. This can happen for a variety of reasons but a common
|
||||
reason is that a huge page is old and is being reclaimed.
|
||||
|
||||
thp_zero_page_alloc is incremented every time a huge zero page is
|
||||
successfully allocated. It includes allocations which where
|
||||
dropped due race with other allocation. Note, it doesn't count
|
||||
every map of the huge zero page, only its allocation.
|
||||
|
||||
thp_zero_page_alloc_failed is incremented if kernel fails to allocate
|
||||
huge zero page and falls back to using small pages.
|
||||
|
||||
As the system ages, allocating huge pages may be expensive as the
|
||||
system uses memory compaction to copy data around memory to free a
|
||||
huge page for use. There are some counters in /proc/vmstat to help
|
||||
monitor this overhead.
|
||||
|
||||
compact_stall is incremented every time a process stalls to run
|
||||
memory compaction so that a huge page is free for use.
|
||||
|
||||
compact_success is incremented if the system compacted memory and
|
||||
freed a huge page for use.
|
||||
|
||||
compact_fail is incremented if the system tries to compact memory
|
||||
but failed.
|
||||
|
||||
compact_pages_moved is incremented each time a page is moved. If
|
||||
this value is increasing rapidly, it implies that the system
|
||||
is copying a lot of data to satisfy the huge page allocation.
|
||||
It is possible that the cost of copying exceeds any savings
|
||||
from reduced TLB misses.
|
||||
|
||||
compact_pagemigrate_failed is incremented when the underlying mechanism
|
||||
for moving a page failed.
|
||||
|
||||
compact_blocks_moved is incremented each time memory compaction examines
|
||||
a huge page aligned range of pages.
|
||||
|
||||
It is possible to establish how long the stalls were using the function
|
||||
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||
using the mm_page_alloc tracepoint to identify which allocations were
|
||||
for huge pages.
|
||||
|
||||
== get_user_pages and follow_page ==
|
||||
|
||||
get_user_pages and follow_page if run on a hugepage, will return the
|
||||
head or tail pages as usual (exactly as they would do on
|
||||
hugetlbfs). Most gup users will only care about the actual physical
|
||||
address of the page and its temporary pinning to release after the I/O
|
||||
is complete, so they won't ever notice the fact the page is huge. But
|
||||
if any driver is going to mangle over the page structure of the tail
|
||||
page (like for checking page->mapping or other bits that are relevant
|
||||
for the head page and not the tail page), it should be updated to jump
|
||||
to check head page instead (while serializing properly against
|
||||
split_huge_page() to avoid the head and tail pages to disappear from
|
||||
under it, see the futex code to see an example of that, hugetlbfs also
|
||||
needed special handling in futex code for similar reasons).
|
||||
|
||||
NOTE: these aren't new constraints to the GUP API, and they match the
|
||||
same constrains that applies to hugetlbfs too, so any driver capable
|
||||
of handling GUP on hugetlbfs will also work fine on transparent
|
||||
hugepage backed mappings.
|
||||
|
||||
In case you can't handle compound pages if they're returned by
|
||||
follow_page, the FOLL_SPLIT bit can be specified as parameter to
|
||||
follow_page, so that it will split the hugepages before returning
|
||||
them. Migration for example passes FOLL_SPLIT as parameter to
|
||||
follow_page because it's not hugepage aware and in fact it can't work
|
||||
at all on hugetlbfs (but it instead works fine on transparent
|
||||
hugepages thanks to FOLL_SPLIT). migration simply can't deal with
|
||||
hugepages being returned (as it's not only checking the pfn of the
|
||||
page and pinning it during the copy but it pretends to migrate the
|
||||
memory in regular page sizes and with regular pte/pmd mappings).
|
||||
|
||||
== Optimizing the applications ==
|
||||
|
||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||
memory region, the mmap region has to be hugepage naturally
|
||||
aligned. posix_memalign() can provide that guarantee.
|
||||
|
||||
== Hugetlbfs ==
|
||||
|
||||
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||
support enabled just fine as always. No difference can be noted in
|
||||
hugetlbfs other than there will be less overall fragmentation. All
|
||||
usual features belonging to hugetlbfs are preserved and
|
||||
unaffected. libhugetlbfs will also work fine as usual.
|
||||
|
||||
== Graceful fallback ==
|
||||
|
||||
Code walking pagetables but unware about huge pmds can simply call
|
||||
split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
|
||||
pmd_offset. It's trivial to make the code transparent hugepage aware
|
||||
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
|
||||
missing after pmd_offset returns the pmd. Thanks to the graceful
|
||||
fallback design, with a one liner change, you can avoid to write
|
||||
hundred if not thousand of lines of complex code to make your code
|
||||
hugepage aware.
|
||||
|
||||
If you're not walking pagetables but you run into a physical hugepage
|
||||
but you can't handle it natively in your code, you can split it by
|
||||
calling split_huge_page(page). This is what the Linux VM does before
|
||||
it tries to swapout the hugepage for example.
|
||||
|
||||
Example to make mremap.c transparent hugepage aware with a one liner
|
||||
change:
|
||||
|
||||
diff --git a/mm/mremap.c b/mm/mremap.c
|
||||
--- a/mm/mremap.c
|
||||
+++ b/mm/mremap.c
|
||||
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
|
||||
return NULL;
|
||||
|
||||
pmd = pmd_offset(pud, addr);
|
||||
+ split_huge_page_pmd(vma, addr, pmd);
|
||||
if (pmd_none_or_clear_bad(pmd))
|
||||
return NULL;
|
||||
|
||||
== Locking in hugepage aware code ==
|
||||
|
||||
We want as much code as possible hugepage aware, as calling
|
||||
split_huge_page() or split_huge_page_pmd() has a cost.
|
||||
|
||||
To make pagetable walks huge pmd aware, all you need to do is to call
|
||||
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
|
||||
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
|
||||
created from under you by khugepaged (khugepaged collapse_huge_page
|
||||
takes the mmap_sem in write mode in addition to the anon_vma lock). If
|
||||
pmd_trans_huge returns false, you just fallback in the old code
|
||||
paths. If instead pmd_trans_huge returns true, you have to take the
|
||||
mm->page_table_lock and re-run pmd_trans_huge. Taking the
|
||||
page_table_lock will prevent the huge pmd to be converted into a
|
||||
regular pmd from under you (split_huge_page can run in parallel to the
|
||||
pagetable walk). If the second pmd_trans_huge returns false, you
|
||||
should just drop the page_table_lock and fallback to the old code as
|
||||
before. Otherwise you should run pmd_trans_splitting on the pmd. In
|
||||
case pmd_trans_splitting returns true, it means split_huge_page is
|
||||
already in the middle of splitting the page. So if pmd_trans_splitting
|
||||
returns true it's enough to drop the page_table_lock and call
|
||||
wait_split_huge_page and then fallback the old code paths. You are
|
||||
guaranteed by the time wait_split_huge_page returns, the pmd isn't
|
||||
huge anymore. If pmd_trans_splitting returns false, you can proceed to
|
||||
process the huge pmd and the hugepage natively. Once finished you can
|
||||
drop the page_table_lock.
|
||||
|
||||
== compound_lock, get_user_pages and put_page ==
|
||||
|
||||
split_huge_page internally has to distribute the refcounts in the head
|
||||
page to the tail pages before clearing all PG_head/tail bits from the
|
||||
page structures. It can do that easily for refcounts taken by huge pmd
|
||||
mappings. But the GUI API as created by hugetlbfs (that returns head
|
||||
and tail pages if running get_user_pages on an address backed by any
|
||||
hugepage), requires the refcount to be accounted on the tail pages and
|
||||
not only in the head pages, if we want to be able to run
|
||||
split_huge_page while there are gup pins established on any tail
|
||||
page. Failure to be able to run split_huge_page if there's any gup pin
|
||||
on any tail page, would mean having to split all hugepages upfront in
|
||||
get_user_pages which is unacceptable as too many gup users are
|
||||
performance critical and they must work natively on hugepages like
|
||||
they work natively on hugetlbfs already (hugetlbfs is simpler because
|
||||
hugetlbfs pages cannot be split so there wouldn't be requirement of
|
||||
accounting the pins on the tail pages for hugetlbfs). If we wouldn't
|
||||
account the gup refcounts on the tail pages during gup, we won't know
|
||||
anymore which tail page is pinned by gup and which is not while we run
|
||||
split_huge_page. But we still have to add the gup pin to the head page
|
||||
too, to know when we can free the compound page in case it's never
|
||||
split during its lifetime. That requires changing not just
|
||||
get_page, but put_page as well so that when put_page runs on a tail
|
||||
page (and only on a tail page) it will find its respective head page,
|
||||
and then it will decrease the head page refcount in addition to the
|
||||
tail page refcount. To obtain a head page reliably and to decrease its
|
||||
refcount without race conditions, put_page has to serialize against
|
||||
__split_huge_page_refcount using a special per-page lock called
|
||||
compound_lock.
|
686
Documentation/vm/unevictable-lru.txt
Normal file
686
Documentation/vm/unevictable-lru.txt
Normal file
|
@ -0,0 +1,686 @@
|
|||
==============================
|
||||
UNEVICTABLE LRU INFRASTRUCTURE
|
||||
==============================
|
||||
|
||||
========
|
||||
CONTENTS
|
||||
========
|
||||
|
||||
(*) The Unevictable LRU
|
||||
|
||||
- The unevictable page list.
|
||||
- Memory control group interaction.
|
||||
- Marking address spaces unevictable.
|
||||
- Detecting Unevictable Pages.
|
||||
- vmscan's handling of unevictable pages.
|
||||
|
||||
(*) mlock()'d pages.
|
||||
|
||||
- History.
|
||||
- Basic management.
|
||||
- mlock()/mlockall() system call handling.
|
||||
- Filtering special vmas.
|
||||
- munlock()/munlockall() system call handling.
|
||||
- Migrating mlocked pages.
|
||||
- mmap(MAP_LOCKED) system call handling.
|
||||
- munmap()/exit()/exec() system call handling.
|
||||
- try_to_unmap().
|
||||
- try_to_munlock() reverse map scan.
|
||||
- Page reclaim in shrink_*_list().
|
||||
|
||||
|
||||
============
|
||||
INTRODUCTION
|
||||
============
|
||||
|
||||
This document describes the Linux memory manager's "Unevictable LRU"
|
||||
infrastructure and the use of this to manage several types of "unevictable"
|
||||
pages.
|
||||
|
||||
The document attempts to provide the overall rationale behind this mechanism
|
||||
and the rationale for some of the design decisions that drove the
|
||||
implementation. The latter design rationale is discussed in the context of an
|
||||
implementation description. Admittedly, one can obtain the implementation
|
||||
details - the "what does it do?" - by reading the code. One hopes that the
|
||||
descriptions below add value by provide the answer to "why does it do that?".
|
||||
|
||||
|
||||
===================
|
||||
THE UNEVICTABLE LRU
|
||||
===================
|
||||
|
||||
The Unevictable LRU facility adds an additional LRU list to track unevictable
|
||||
pages and to hide these pages from vmscan. This mechanism is based on a patch
|
||||
by Larry Woodman of Red Hat to address several scalability problems with page
|
||||
reclaim in Linux. The problems have been observed at customer sites on large
|
||||
memory x86_64 systems.
|
||||
|
||||
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
|
||||
main memory will have over 32 million 4k pages in a single zone. When a large
|
||||
fraction of these pages are not evictable for any reason [see below], vmscan
|
||||
will spend a lot of time scanning the LRU lists looking for the small fraction
|
||||
of pages that are evictable. This can result in a situation where all CPUs are
|
||||
spending 100% of their time in vmscan for hours or days on end, with the system
|
||||
completely unresponsive.
|
||||
|
||||
The unevictable list addresses the following classes of unevictable pages:
|
||||
|
||||
(*) Those owned by ramfs.
|
||||
|
||||
(*) Those mapped into SHM_LOCK'd shared memory regions.
|
||||
|
||||
(*) Those mapped into VM_LOCKED [mlock()ed] VMAs.
|
||||
|
||||
The infrastructure may also be able to handle other conditions that make pages
|
||||
unevictable, either by definition or by circumstance, in the future.
|
||||
|
||||
|
||||
THE UNEVICTABLE PAGE LIST
|
||||
-------------------------
|
||||
|
||||
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
|
||||
called the "unevictable" list and an associated page flag, PG_unevictable, to
|
||||
indicate that the page is being managed on the unevictable list.
|
||||
|
||||
The PG_unevictable flag is analogous to, and mutually exclusive with, the
|
||||
PG_active flag in that it indicates on which LRU list a page resides when
|
||||
PG_lru is set.
|
||||
|
||||
The Unevictable LRU infrastructure maintains unevictable pages on an additional
|
||||
LRU list for a few reasons:
|
||||
|
||||
(1) We get to "treat unevictable pages just like we treat other pages in the
|
||||
system - which means we get to use the same code to manipulate them, the
|
||||
same code to isolate them (for migrate, etc.), the same code to keep track
|
||||
of the statistics, etc..." [Rik van Riel]
|
||||
|
||||
(2) We want to be able to migrate unevictable pages between nodes for memory
|
||||
defragmentation, workload management and memory hotplug. The linux kernel
|
||||
can only migrate pages that it can successfully isolate from the LRU
|
||||
lists. If we were to maintain pages elsewhere than on an LRU-like list,
|
||||
where they can be found by isolate_lru_page(), we would prevent their
|
||||
migration, unless we reworked migration code to find the unevictable pages
|
||||
itself.
|
||||
|
||||
|
||||
The unevictable list does not differentiate between file-backed and anonymous,
|
||||
swap-backed pages. This differentiation is only important while the pages are,
|
||||
in fact, evictable.
|
||||
|
||||
The unevictable list benefits from the "arrayification" of the per-zone LRU
|
||||
lists and statistics originally proposed and posted by Christoph Lameter.
|
||||
|
||||
The unevictable list does not use the LRU pagevec mechanism. Rather,
|
||||
unevictable pages are placed directly on the page's zone's unevictable list
|
||||
under the zone lru_lock. This allows us to prevent the stranding of pages on
|
||||
the unevictable list when one task has the page isolated from the LRU and other
|
||||
tasks are changing the "evictability" state of the page.
|
||||
|
||||
|
||||
MEMORY CONTROL GROUP INTERACTION
|
||||
--------------------------------
|
||||
|
||||
The unevictable LRU facility interacts with the memory control group [aka
|
||||
memory controller; see Documentation/cgroups/memory.txt] by extending the
|
||||
lru_list enum.
|
||||
|
||||
The memory controller data structure automatically gets a per-zone unevictable
|
||||
list as a result of the "arrayification" of the per-zone LRU lists (one per
|
||||
lru_list enum element). The memory controller tracks the movement of pages to
|
||||
and from the unevictable list.
|
||||
|
||||
When a memory control group comes under memory pressure, the controller will
|
||||
not attempt to reclaim pages on the unevictable list. This has a couple of
|
||||
effects:
|
||||
|
||||
(1) Because the pages are "hidden" from reclaim on the unevictable list, the
|
||||
reclaim process can be more efficient, dealing only with pages that have a
|
||||
chance of being reclaimed.
|
||||
|
||||
(2) On the other hand, if too many of the pages charged to the control group
|
||||
are unevictable, the evictable portion of the working set of the tasks in
|
||||
the control group may not fit into the available memory. This can cause
|
||||
the control group to thrash or to OOM-kill tasks.
|
||||
|
||||
|
||||
MARKING ADDRESS SPACES UNEVICTABLE
|
||||
----------------------------------
|
||||
|
||||
For facilities such as ramfs none of the pages attached to the address space
|
||||
may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
|
||||
address space flag is provided, and this can be manipulated by a filesystem
|
||||
using a number of wrapper functions:
|
||||
|
||||
(*) void mapping_set_unevictable(struct address_space *mapping);
|
||||
|
||||
Mark the address space as being completely unevictable.
|
||||
|
||||
(*) void mapping_clear_unevictable(struct address_space *mapping);
|
||||
|
||||
Mark the address space as being evictable.
|
||||
|
||||
(*) int mapping_unevictable(struct address_space *mapping);
|
||||
|
||||
Query the address space, and return true if it is completely
|
||||
unevictable.
|
||||
|
||||
These are currently used in two places in the kernel:
|
||||
|
||||
(1) By ramfs to mark the address spaces of its inodes when they are created,
|
||||
and this mark remains for the life of the inode.
|
||||
|
||||
(2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
|
||||
|
||||
Note that SHM_LOCK is not required to page in the locked pages if they're
|
||||
swapped out; the application must touch the pages manually if it wants to
|
||||
ensure they're in memory.
|
||||
|
||||
|
||||
DETECTING UNEVICTABLE PAGES
|
||||
---------------------------
|
||||
|
||||
The function page_evictable() in vmscan.c determines whether a page is
|
||||
evictable or not using the query function outlined above [see section "Marking
|
||||
address spaces unevictable"] to check the AS_UNEVICTABLE flag.
|
||||
|
||||
For address spaces that are so marked after being populated (as SHM regions
|
||||
might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
|
||||
the page tables for the region as does, for example, mlock(), nor need it make
|
||||
any special effort to push any pages in the SHM_LOCK'd area to the unevictable
|
||||
list. Instead, vmscan will do this if and when it encounters the pages during
|
||||
a reclamation scan.
|
||||
|
||||
On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
|
||||
the pages in the region and "rescue" them from the unevictable list if no other
|
||||
condition is keeping them unevictable. If an unevictable region is destroyed,
|
||||
the pages are also "rescued" from the unevictable list in the process of
|
||||
freeing them.
|
||||
|
||||
page_evictable() also checks for mlocked pages by testing an additional page
|
||||
flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
|
||||
faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
|
||||
|
||||
|
||||
VMSCAN'S HANDLING OF UNEVICTABLE PAGES
|
||||
--------------------------------------
|
||||
|
||||
If unevictable pages are culled in the fault path, or moved to the unevictable
|
||||
list at mlock() or mmap() time, vmscan will not encounter the pages until they
|
||||
have become evictable again (via munlock() for example) and have been "rescued"
|
||||
from the unevictable list. However, there may be situations where we decide,
|
||||
for the sake of expediency, to leave a unevictable page on one of the regular
|
||||
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
|
||||
pages in all of the shrink_{active|inactive|page}_list() functions and will
|
||||
"cull" such pages that it encounters: that is, it diverts those pages to the
|
||||
unevictable list for the zone being scanned.
|
||||
|
||||
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
|
||||
page is not marked as PG_mlocked. Such pages will make it all the way to
|
||||
shrink_page_list() where they will be detected when vmscan walks the reverse
|
||||
map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
|
||||
shrink_page_list() will cull the page at that point.
|
||||
|
||||
To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
|
||||
using putback_lru_page() - the inverse operation to isolate_lru_page() - after
|
||||
dropping the page lock. Because the condition which makes the page unevictable
|
||||
may change once the page is unlocked, putback_lru_page() will recheck the
|
||||
unevictable state of a page that it places on the unevictable list. If the
|
||||
page has become unevictable, putback_lru_page() removes it from the list and
|
||||
retries, including the page_unevictable() test. Because such a race is a rare
|
||||
event and movement of pages onto the unevictable list should be rare, these
|
||||
extra evictabilty checks should not occur in the majority of calls to
|
||||
putback_lru_page().
|
||||
|
||||
|
||||
=============
|
||||
MLOCKED PAGES
|
||||
=============
|
||||
|
||||
The unevictable page list is also useful for mlock(), in addition to ramfs and
|
||||
SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
|
||||
NOMMU situations, all mappings are effectively mlocked.
|
||||
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
|
||||
The "Unevictable mlocked Pages" infrastructure is based on work originally
|
||||
posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
|
||||
Nick posted his patch as an alternative to a patch posted by Christoph Lameter
|
||||
to achieve the same objective: hiding mlocked pages from vmscan.
|
||||
|
||||
In Nick's patch, he used one of the struct page LRU list link fields as a count
|
||||
of VM_LOCKED VMAs that map the page. This use of the link field for a count
|
||||
prevented the management of the pages on an LRU list, and thus mlocked pages
|
||||
were not migratable as isolate_lru_page() could not find them, and the LRU list
|
||||
link field was not available to the migration subsystem.
|
||||
|
||||
Nick resolved this by putting mlocked pages back on the lru list before
|
||||
attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
|
||||
Nick's patch was integrated with the Unevictable LRU work, the count was
|
||||
replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
|
||||
mapped the page. More on this below.
|
||||
|
||||
|
||||
BASIC MANAGEMENT
|
||||
----------------
|
||||
|
||||
mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
|
||||
pages. When such a page has been "noticed" by the memory management subsystem,
|
||||
the page is marked with the PG_mlocked flag. This can be manipulated using the
|
||||
PageMlocked() functions.
|
||||
|
||||
A PG_mlocked page will be placed on the unevictable list when it is added to
|
||||
the LRU. Such pages can be "noticed" by memory management in several places:
|
||||
|
||||
(1) in the mlock()/mlockall() system call handlers;
|
||||
|
||||
(2) in the mmap() system call handler when mmapping a region with the
|
||||
MAP_LOCKED flag;
|
||||
|
||||
(3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
|
||||
flag
|
||||
|
||||
(4) in the fault path, if mlocked pages are "culled" in the fault path,
|
||||
and when a VM_LOCKED stack segment is expanded; or
|
||||
|
||||
(5) as mentioned above, in vmscan:shrink_page_list() when attempting to
|
||||
reclaim a page in a VM_LOCKED VMA via try_to_unmap()
|
||||
|
||||
all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
|
||||
already have it set.
|
||||
|
||||
mlocked pages become unlocked and rescued from the unevictable list when:
|
||||
|
||||
(1) mapped in a range unlocked via the munlock()/munlockall() system calls;
|
||||
|
||||
(2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
|
||||
unmapping at task exit;
|
||||
|
||||
(3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
|
||||
or
|
||||
|
||||
(4) before a page is COW'd in a VM_LOCKED VMA.
|
||||
|
||||
|
||||
mlock()/mlockall() SYSTEM CALL HANDLING
|
||||
---------------------------------------
|
||||
|
||||
Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
|
||||
for each VMA in the range specified by the call. In the case of mlockall(),
|
||||
this is the entire active address space of the task. Note that mlock_fixup()
|
||||
is used for both mlocking and munlocking a range of memory. A call to mlock()
|
||||
an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
|
||||
treated as a no-op, and mlock_fixup() simply returns.
|
||||
|
||||
If the VMA passes some filtering as described in "Filtering Special Vmas"
|
||||
below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
|
||||
off a subset of the VMA if the range does not cover the entire VMA. Once the
|
||||
VMA has been merged or split or neither, mlock_fixup() will call
|
||||
__mlock_vma_pages_range() to fault in the pages via get_user_pages() and to
|
||||
mark the pages as mlocked via mlock_vma_page().
|
||||
|
||||
Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
|
||||
get_user_pages() will be unable to fault in the pages. That's okay. If pages
|
||||
do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
|
||||
fault path or in vmscan.
|
||||
|
||||
Also note that a page returned by get_user_pages() could be truncated or
|
||||
migrated out from under us, while we're trying to mlock it. To detect this,
|
||||
__mlock_vma_pages_range() checks page_mapping() after acquiring the page lock.
|
||||
If the page is still associated with its mapping, we'll go ahead and call
|
||||
mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
|
||||
In the worst case, this will result in a page mapped in a VM_LOCKED VMA
|
||||
remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
|
||||
detect and cull such pages.
|
||||
|
||||
mlock_vma_page() will call TestSetPageMlocked() for each page returned by
|
||||
get_user_pages(). We use TestSetPageMlocked() because the page might already
|
||||
be mlocked by another task/VMA and we don't want to do extra work. We
|
||||
especially do not want to count an mlocked page more than once in the
|
||||
statistics. If the page was already mlocked, mlock_vma_page() need do nothing
|
||||
more.
|
||||
|
||||
If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
|
||||
page from the LRU, as it is likely on the appropriate active or inactive list
|
||||
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
|
||||
back the page - by calling putback_lru_page() - which will notice that the page
|
||||
is now mlocked and divert the page to the zone's unevictable list. If
|
||||
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
|
||||
it later if and when it attempts to reclaim the page.
|
||||
|
||||
|
||||
FILTERING SPECIAL VMAS
|
||||
----------------------
|
||||
|
||||
mlock_fixup() filters several classes of "special" VMAs:
|
||||
|
||||
1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind
|
||||
these mappings are inherently pinned, so we don't need to mark them as
|
||||
mlocked. In any case, most of the pages have no struct page in which to so
|
||||
mark the page. Because of this, get_user_pages() will fail for these VMAs,
|
||||
so there is no sense in attempting to visit them.
|
||||
|
||||
2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
|
||||
neither need nor want to mlock() these pages. However, to preserve the
|
||||
prior behavior of mlock() - before the unevictable/mlock changes -
|
||||
mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
|
||||
allocate the huge pages and populate the ptes.
|
||||
|
||||
3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
|
||||
such as the VDSO page, relay channel pages, etc. These pages
|
||||
are inherently unevictable and are not managed on the LRU lists.
|
||||
mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
|
||||
make_pages_present() to populate the ptes.
|
||||
|
||||
Note that for all of these special VMAs, mlock_fixup() does not set the
|
||||
VM_LOCKED flag. Therefore, we won't have to deal with them later during
|
||||
munlock(), munmap() or task exit. Neither does mlock_fixup() account these
|
||||
VMAs against the task's "locked_vm".
|
||||
|
||||
|
||||
munlock()/munlockall() SYSTEM CALL HANDLING
|
||||
-------------------------------------------
|
||||
|
||||
The munlock() and munlockall() system calls are handled by the same functions -
|
||||
do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
|
||||
lock operation indicated by an argument. So, these system calls are also
|
||||
handled by mlock_fixup(). Again, if called for an already munlocked VMA,
|
||||
mlock_fixup() simply returns. Because of the VMA filtering discussed above,
|
||||
VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
|
||||
ignored for munlock.
|
||||
|
||||
If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
|
||||
specified range. The range is then munlocked via the function
|
||||
__mlock_vma_pages_range() - the same function used to mlock a VMA range -
|
||||
passing a flag to indicate that munlock() is being performed.
|
||||
|
||||
Because the VMA access protections could have been changed to PROT_NONE after
|
||||
faulting in and mlocking pages, get_user_pages() was unreliable for visiting
|
||||
these pages for munlocking. Because we don't want to leave pages mlocked,
|
||||
get_user_pages() was enhanced to accept a flag to ignore the permissions when
|
||||
fetching the pages - all of which should be resident as a result of previous
|
||||
mlocking.
|
||||
|
||||
For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
|
||||
munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
|
||||
flag using TestClearPageMlocked(). As with mlock_vma_page(),
|
||||
munlock_vma_page() use the Test*PageMlocked() function to handle the case where
|
||||
the page might have already been unlocked by another task. If the page was
|
||||
mlocked, munlock_vma_page() updates that zone statistics for the number of
|
||||
mlocked pages. Note, however, that at this point we haven't checked whether
|
||||
the page is mapped by other VM_LOCKED VMAs.
|
||||
|
||||
We can't call try_to_munlock(), the function that walks the reverse map to
|
||||
check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
|
||||
try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
|
||||
not be on an LRU list [more on these below]. However, the call to
|
||||
isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
|
||||
we go ahead and clear PG_mlocked up front, as this might be the only chance we
|
||||
have. If we can successfully isolate the page, we go ahead and
|
||||
try_to_munlock(), which will restore the PG_mlocked flag and update the zone
|
||||
page statistics if it finds another VMA holding the page mlocked. If we fail
|
||||
to isolate the page, we'll have left a potentially mlocked page on the LRU.
|
||||
This is fine, because we'll catch it later if and if vmscan tries to reclaim
|
||||
the page. This should be relatively rare.
|
||||
|
||||
|
||||
MIGRATING MLOCKED PAGES
|
||||
-----------------------
|
||||
|
||||
A page that is being migrated has been isolated from the LRU lists and is held
|
||||
locked across unmapping of the page, updating the page's address space entry
|
||||
and copying the contents and state, until the page table entry has been
|
||||
replaced with an entry that refers to the new page. Linux supports migration
|
||||
of mlocked pages and other unevictable pages. This involves simply moving the
|
||||
PG_mlocked and PG_unevictable states from the old page to the new page.
|
||||
|
||||
Note that page migration can race with mlocking or munlocking of the same page.
|
||||
This has been discussed from the mlock/munlock perspective in the respective
|
||||
sections above. Both processes (migration and m[un]locking) hold the page
|
||||
locked. This provides the first level of synchronization. Page migration
|
||||
zeros out the page_mapping of the old page before unlocking it, so m[un]lock
|
||||
can skip these pages by testing the page mapping under page lock.
|
||||
|
||||
To complete page migration, we place the new and old pages back onto the LRU
|
||||
after dropping the page lock. The "unneeded" page - old page on success, new
|
||||
page on failure - will be freed when the reference count held by the migration
|
||||
process is released. To ensure that we don't strand pages on the unevictable
|
||||
list because of a race between munlock and migration, page migration uses the
|
||||
putback_lru_page() function to add migrated pages back to the LRU.
|
||||
|
||||
|
||||
mmap(MAP_LOCKED) SYSTEM CALL HANDLING
|
||||
-------------------------------------
|
||||
|
||||
In addition the mlock()/mlockall() system calls, an application can request
|
||||
that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
|
||||
call. Furthermore, any mmap() call or brk() call that expands the heap by a
|
||||
task that has previously called mlockall() with the MCL_FUTURE flag will result
|
||||
in the newly mapped memory being mlocked. Before the unevictable/mlock
|
||||
changes, the kernel simply called make_pages_present() to allocate pages and
|
||||
populate the page table.
|
||||
|
||||
To mlock a range of memory under the unevictable/mlock infrastructure, the
|
||||
mmap() handler and task address space expansion functions call
|
||||
mlock_vma_pages_range() specifying the vma and the address range to mlock.
|
||||
mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in
|
||||
"Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have
|
||||
already been set by the caller, in filtered VMAs. Thus these VMA's need not be
|
||||
visited for munlock when the region is unmapped.
|
||||
|
||||
For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
|
||||
fault/allocate the pages and mlock them. Again, like mlock_fixup(),
|
||||
mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
|
||||
attempting to fault/allocate and mlock the pages and "upgrades" the semaphore
|
||||
back to write mode before returning.
|
||||
|
||||
The callers of mlock_vma_pages_range() will have already added the memory range
|
||||
to be mlocked to the task's "locked_vm". To account for filtered VMAs,
|
||||
mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
|
||||
callers then subtract a non-negative return value from the task's locked_vm. A
|
||||
negative return value represent an error - for example, from get_user_pages()
|
||||
attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
|
||||
memory range accounted as locked_vm, as the protections could be changed later
|
||||
and pages allocated into that region.
|
||||
|
||||
|
||||
munmap()/exit()/exec() SYSTEM CALL HANDLING
|
||||
-------------------------------------------
|
||||
|
||||
When unmapping an mlocked region of memory, whether by an explicit call to
|
||||
munmap() or via an internal unmap from exit() or exec() processing, we must
|
||||
munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
|
||||
Before the unevictable/mlock changes, mlocking did not mark the pages in any
|
||||
way, so unmapping them required no processing.
|
||||
|
||||
To munlock a range of memory under the unevictable/mlock infrastructure, the
|
||||
munmap() handler and task address space call tear down function
|
||||
munlock_vma_pages_all(). The name reflects the observation that one always
|
||||
specifies the entire VMA range when munlock()ing during unmap of a region.
|
||||
Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
|
||||
actually contain mlocked pages will be passed to munlock_vma_pages_all().
|
||||
|
||||
munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
|
||||
for the munlock case, calls __munlock_vma_pages_range() to walk the page table
|
||||
for the VMA's memory range and munlock_vma_page() each resident page mapped by
|
||||
the VMA. This effectively munlocks the page, only if this is the last
|
||||
VM_LOCKED VMA that maps the page.
|
||||
|
||||
|
||||
try_to_unmap()
|
||||
--------------
|
||||
|
||||
Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
|
||||
have VM_LOCKED flag set. It is possible for a page mapped into one or more
|
||||
VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
|
||||
of the active or inactive LRU lists. This could happen if, for example, a task
|
||||
in the process of munlocking the page could not isolate the page from the LRU.
|
||||
As a result, vmscan/shrink_page_list() might encounter such a page as described
|
||||
in section "vmscan's handling of unevictable pages". To handle this situation,
|
||||
try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
|
||||
map.
|
||||
|
||||
try_to_unmap() is always called, by either vmscan for reclaim or for page
|
||||
migration, with the argument page locked and isolated from the LRU. Separate
|
||||
functions handle anonymous and mapped file pages, as these types of pages have
|
||||
different reverse map mechanisms.
|
||||
|
||||
(*) try_to_unmap_anon()
|
||||
|
||||
To unmap anonymous pages, each VMA in the list anchored in the anon_vma
|
||||
must be visited - at least until a VM_LOCKED VMA is encountered. If the
|
||||
page is being unmapped for migration, VM_LOCKED VMAs do not stop the
|
||||
process because mlocked pages are migratable. However, for reclaim, if
|
||||
the page is mapped into a VM_LOCKED VMA, the scan stops.
|
||||
|
||||
try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
|
||||
the mm_struct to which the VMA belongs. If this is successful, it will
|
||||
mlock the page via mlock_vma_page() - we wouldn't have gotten to
|
||||
try_to_unmap_anon() if the page were already mlocked - and will return
|
||||
SWAP_MLOCK, indicating that the page is unevictable.
|
||||
|
||||
If the mmap semaphore cannot be acquired, we are not sure whether the page
|
||||
is really unevictable or not. In this case, try_to_unmap_anon() will
|
||||
return SWAP_AGAIN.
|
||||
|
||||
(*) try_to_unmap_file() - linear mappings
|
||||
|
||||
Unmapping of a mapped file page works the same as for anonymous mappings,
|
||||
except that the scan visits all VMAs that map the page's index/page offset
|
||||
in the page's mapping's reverse map priority search tree. It also visits
|
||||
each VMA in the page's mapping's non-linear list, if the list is
|
||||
non-empty.
|
||||
|
||||
As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
|
||||
page, try_to_unmap_file() will attempt to acquire the associated
|
||||
mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
|
||||
is successful, and SWAP_AGAIN, if not.
|
||||
|
||||
(*) try_to_unmap_file() - non-linear mappings
|
||||
|
||||
If a page's mapping contains a non-empty non-linear mapping VMA list, then
|
||||
try_to_un{map|lock}() must also visit each VMA in that list to determine
|
||||
whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit
|
||||
all VMAs in the non-linear list to ensure that the pages is not/should not
|
||||
be mlocked.
|
||||
|
||||
If a VM_LOCKED VMA is found in the list, the scan could terminate.
|
||||
However, there is no easy way to determine whether the page is actually
|
||||
mapped in a given VMA - either for unmapping or testing whether the
|
||||
VM_LOCKED VMA actually pins the page.
|
||||
|
||||
try_to_unmap_file() handles non-linear mappings by scanning a certain
|
||||
number of pages - a "cluster" - in each non-linear VMA associated with the
|
||||
page's mapping, for each file mapped page that vmscan tries to unmap. If
|
||||
this happens to unmap the page we're trying to unmap, try_to_unmap() will
|
||||
notice this on return (page_mapcount(page) will be 0) and return
|
||||
SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to
|
||||
recirculate this page. We take advantage of the cluster scan in
|
||||
try_to_unmap_cluster() as follows:
|
||||
|
||||
For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the
|
||||
mmap semaphore of the associated mm_struct for read without blocking.
|
||||
|
||||
If this attempt is successful and the VMA is VM_LOCKED,
|
||||
try_to_unmap_cluster() will retain the mmap semaphore for the scan;
|
||||
otherwise it drops it here.
|
||||
|
||||
Then, for each page in the cluster, if we're holding the mmap semaphore
|
||||
for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to
|
||||
mlock the page. This call is a no-op if the page is already locked,
|
||||
but will mlock any pages in the non-linear mapping that happen to be
|
||||
unlocked.
|
||||
|
||||
If one of the pages so mlocked is the page passed in to try_to_unmap(),
|
||||
try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default
|
||||
SWAP_AGAIN. This will allow vmscan to cull the page, rather than
|
||||
recirculating it on the inactive list.
|
||||
|
||||
Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it
|
||||
returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED
|
||||
VMA, but couldn't be mlocked.
|
||||
|
||||
|
||||
try_to_munlock() REVERSE MAP SCAN
|
||||
---------------------------------
|
||||
|
||||
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
|
||||
page_referenced() reverse map walker.
|
||||
|
||||
When munlock_vma_page() [see section "munlock()/munlockall() System Call
|
||||
Handling" above] tries to munlock a page, it needs to determine whether or not
|
||||
the page is mapped by any VM_LOCKED VMA without actually attempting to unmap
|
||||
all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
|
||||
introduced a variant of try_to_unmap() called try_to_munlock().
|
||||
|
||||
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
|
||||
mapped file pages with an additional argument specifying unlock versus unmap
|
||||
processing. Again, these functions walk the respective reverse maps looking
|
||||
for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
|
||||
pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
|
||||
attempt to acquire the associated mmap semaphore, mlock the page via
|
||||
mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
|
||||
pre-clearing of the page's PG_mlocked done by munlock_vma_page.
|
||||
|
||||
If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
|
||||
semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
|
||||
recycle the page on the inactive list and hope that it has better luck with the
|
||||
page next time.
|
||||
|
||||
For file pages mapped into non-linear VMAs, the try_to_munlock() logic works
|
||||
slightly differently. On encountering a VM_LOCKED non-linear VMA that might
|
||||
map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the
|
||||
page. munlock_vma_page() will just leave the page unlocked and let vmscan deal
|
||||
with it - the usual fallback position.
|
||||
|
||||
Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
|
||||
reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
|
||||
However, the scan can terminate when it encounters a VM_LOCKED VMA and can
|
||||
successfully acquire the VMA's mmap semaphore for read and mlock the page.
|
||||
Although try_to_munlock() might be called a great many times when munlocking a
|
||||
large region or tearing down a large address space that has been mlocked via
|
||||
mlockall(), overall this is a fairly rare event.
|
||||
|
||||
|
||||
PAGE RECLAIM IN shrink_*_list()
|
||||
-------------------------------
|
||||
|
||||
shrink_active_list() culls any obviously unevictable pages - i.e.
|
||||
!page_evictable(page) - diverting these to the unevictable list.
|
||||
However, shrink_active_list() only sees unevictable pages that made it onto the
|
||||
active/inactive lru lists. Note that these pages do not have PageUnevictable
|
||||
set - otherwise they would be on the unevictable list and shrink_active_list
|
||||
would never see them.
|
||||
|
||||
Some examples of these unevictable pages on the LRU lists are:
|
||||
|
||||
(1) ramfs pages that have been placed on the LRU lists when first allocated.
|
||||
|
||||
(2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
|
||||
allocate or fault in the pages in the shared memory region. This happens
|
||||
when an application accesses the page the first time after SHM_LOCK'ing
|
||||
the segment.
|
||||
|
||||
(3) mlocked pages that could not be isolated from the LRU and moved to the
|
||||
unevictable list in mlock_vma_page().
|
||||
|
||||
(4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
|
||||
acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
|
||||
munlock_vma_page() was forced to let the page back on to the normal LRU
|
||||
list for vmscan to handle.
|
||||
|
||||
shrink_inactive_list() also diverts any unevictable pages that it finds on the
|
||||
inactive lists to the appropriate zone's unevictable list.
|
||||
|
||||
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
|
||||
after shrink_active_list() had moved them to the inactive list, or pages mapped
|
||||
into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
|
||||
recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
|
||||
but will pass on to shrink_page_list().
|
||||
|
||||
shrink_page_list() again culls obviously unevictable pages that it could
|
||||
encounter for similar reason to shrink_inactive_list(). Pages mapped into
|
||||
VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
|
||||
try_to_unmap(). shrink_page_list() will divert them to the unevictable list
|
||||
when try_to_unmap() returns SWAP_MLOCK, as discussed above.
|
68
Documentation/vm/zswap.txt
Normal file
68
Documentation/vm/zswap.txt
Normal file
|
@ -0,0 +1,68 @@
|
|||
Overview:
|
||||
|
||||
Zswap is a lightweight compressed cache for swap pages. It takes pages that are
|
||||
in the process of being swapped out and attempts to compress them into a
|
||||
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
|
||||
for potentially reduced swap I/O. This trade-off can also result in a
|
||||
significant performance improvement if reads from the compressed cache are
|
||||
faster than reads from a swap device.
|
||||
|
||||
NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory
|
||||
reclaim. This interaction has not been fully explored on the large set of
|
||||
potential configurations and workloads that exist. For this reason, zswap
|
||||
is a work in progress and should be considered experimental.
|
||||
|
||||
Some potential benefits:
|
||||
* Desktop/laptop users with limited RAM capacities can mitigate the
|
||||
performance impact of swapping.
|
||||
* Overcommitted guests that share a common I/O resource can
|
||||
dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
|
||||
throttling by the hypervisor. This allows more work to get done with less
|
||||
impact to the guest workload and guests sharing the I/O subsystem
|
||||
* Users with SSDs as swap devices can extend the life of the device by
|
||||
drastically reducing life-shortening writes.
|
||||
|
||||
Zswap evicts pages from compressed cache on an LRU basis to the backing swap
|
||||
device when the compressed pool reaches its size limit. This requirement had
|
||||
been identified in prior community discussions.
|
||||
|
||||
To enabled zswap, the "enabled" attribute must be set to 1 at boot time. e.g.
|
||||
zswap.enabled=1
|
||||
|
||||
Design:
|
||||
|
||||
Zswap receives pages for compression through the Frontswap API and is able to
|
||||
evict pages from its own compressed pool on an LRU basis and write them back to
|
||||
the backing swap device in the case that the compressed pool is full.
|
||||
|
||||
Zswap makes use of zbud for the managing the compressed memory pool. Each
|
||||
allocation in zbud is not directly accessible by address. Rather, a handle is
|
||||
returned by the allocation routine and that handle must be mapped before being
|
||||
accessed. The compressed memory pool grows on demand and shrinks as compressed
|
||||
pages are freed. The pool is not preallocated.
|
||||
|
||||
When a swap page is passed from frontswap to zswap, zswap maintains a mapping
|
||||
of the swap entry, a combination of the swap type and swap offset, to the zbud
|
||||
handle that references that compressed swap page. This mapping is achieved
|
||||
with a red-black tree per swap type. The swap offset is the search key for the
|
||||
tree nodes.
|
||||
|
||||
During a page fault on a PTE that is a swap entry, frontswap calls the zswap
|
||||
load function to decompress the page into the page allocated by the page fault
|
||||
handler.
|
||||
|
||||
Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
|
||||
in the swap_map goes to 0) the swap code calls the zswap invalidate function,
|
||||
via frontswap, to free the compressed entry.
|
||||
|
||||
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
|
||||
controlled policy:
|
||||
* max_pool_percent - The maximum percentage of memory that the compressed
|
||||
pool can occupy.
|
||||
|
||||
Zswap allows the compressor to be selected at kernel boot time by setting the
|
||||
“compressor” attribute. The default compressor is lzo. e.g.
|
||||
zswap.compressor=deflate
|
||||
|
||||
A debugfs interface is provided for various statistic about pool size, number
|
||||
of pages stored, and various counters for the reasons pages are rejected.
|
Loading…
Add table
Add a link
Reference in a new issue