Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

695
mm/Kconfig Normal file
View file

@ -0,0 +1,695 @@
config SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SELECT_MEMORY_MODEL
choice
prompt "Memory model"
depends on SELECT_MEMORY_MODEL
default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
default FLATMEM_MANUAL
config FLATMEM_MANUAL
bool "Flat Memory"
depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || ARCH_FLATMEM_ENABLE
help
This option allows you to change some of the ways that
Linux manages its memory internally. Most users will
only have one option here: FLATMEM. This is normal
and a correct option.
Some users of more advanced features like NUMA and
memory hotplug may have different options here.
DISCONTIGMEM is a more mature, better tested system,
but is incompatible with memory hotplug and may suffer
decreased performance over SPARSEMEM. If unsure between
"Sparse Memory" and "Discontiguous Memory", choose
"Discontiguous Memory".
If unsure, choose this option (Flat Memory) over any other.
config DISCONTIGMEM_MANUAL
bool "Discontiguous Memory"
depends on ARCH_DISCONTIGMEM_ENABLE
help
This option provides enhanced support for discontiguous
memory systems, over FLATMEM. These systems have holes
in their physical address spaces, and this option provides
more efficient handling of these holes. However, the vast
majority of hardware has quite flat address spaces, and
can have degraded performance from the extra overhead that
this option imposes.
Many NUMA configurations will have this as the only option.
If unsure, choose "Flat Memory" over this option.
config SPARSEMEM_MANUAL
bool "Sparse Memory"
depends on ARCH_SPARSEMEM_ENABLE
help
This will be the only option for some systems, including
memory hotplug systems. This is normal.
For many other systems, this will be an alternative to
"Discontiguous Memory". This option provides some potential
performance benefits, along with decreased code complexity,
but it is newer, and more experimental.
If unsure, choose "Discontiguous Memory" or "Flat Memory"
over this option.
endchoice
config DISCONTIGMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL
config SPARSEMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_SPARSEMEM_ENABLE) || SPARSEMEM_MANUAL
config FLATMEM
def_bool y
depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL
config FLAT_NODE_MEM_MAP
def_bool y
depends on !SPARSEMEM
#
# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
# to represent different areas of memory. This variable allows
# those dependencies to exist individually.
#
config NEED_MULTIPLE_NODES
def_bool y
depends on DISCONTIGMEM || NUMA
config HAVE_MEMORY_PRESENT
def_bool y
depends on ARCH_HAVE_MEMORY_PRESENT || SPARSEMEM
#
# SPARSEMEM_EXTREME (which is the default) does some bootmem
# allocations when memory_present() is called. If this cannot
# be done on your architecture, select this option. However,
# statically allocating the mem_section[] array can potentially
# consume vast quantities of .bss, so be careful.
#
# This option will also potentially produce smaller runtime code
# with gcc 3.4 and later.
#
config SPARSEMEM_STATIC
bool
#
# Architecture platforms which require a two level mem_section in SPARSEMEM
# must select this option. This is usually for architecture platforms with
# an extremely sparse physical address space.
#
config SPARSEMEM_EXTREME
def_bool y
depends on SPARSEMEM && !SPARSEMEM_STATIC
config SPARSEMEM_VMEMMAP_ENABLE
bool
config SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
def_bool y
depends on SPARSEMEM && X86_64
config SPARSEMEM_VMEMMAP
bool "Sparse Memory virtual memmap"
depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
default y
help
SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.
config HAVE_MEMBLOCK
boolean
config HAVE_MEMBLOCK_NODE_MAP
boolean
config HAVE_MEMBLOCK_PHYS_MAP
boolean
config HAVE_GENERIC_RCU_GUP
boolean
config ARCH_DISCARD_MEMBLOCK
boolean
config NO_BOOTMEM
boolean
config MEMORY_ISOLATION
boolean
config MOVABLE_NODE
boolean "Enable to assign a node which has only movable memory"
depends on HAVE_MEMBLOCK
depends on NO_BOOTMEM
depends on X86_64
depends on NUMA
default n
help
Allow a node to have only movable memory. Pages used by the kernel,
such as direct mapping pages cannot be migrated. So the corresponding
memory device cannot be hotplugged. This option allows the following
two things:
- When the system is booting, node full of hotpluggable memory can
be arranged to have only movable memory so that the whole node can
be hot-removed. (need movable_node boot option specified).
- After the system is up, the option allows users to online all the
memory of a node as movable memory so that the whole node can be
hot-removed.
Users who don't use the memory hotplug feature are fine with this
option on since they don't specify movable_node boot option or they
don't online memory as movable.
Say Y here if you want to hotplug a whole node.
Say N here if you want kernel to use memory on all nodes evenly.
#
# Only be set on architectures that have completely implemented memory hotplug
# feature. If you are not sure, don't touch it.
#
config HAVE_BOOTMEM_INFO_NODE
def_bool n
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390)
config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
select MEMORY_ISOLATION
select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64)
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
#
# If we have space for more page flags then we can enable additional
# optimizations and functionality.
#
# Regular Sparsemem takes page flag bits for the sectionid if it does not
# use a virtual memmap. Disable extended page flags for 32 bit platforms
# that require the use of a sectionid in the page flags.
#
config PAGEFLAGS_EXTENDED
def_bool y
depends on 64BIT || SPARSEMEM_VMEMMAP || !SPARSEMEM
# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
# Default to 4 for wider testing, though 8 might be more appropriate.
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
# PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
# DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC spinlock_t also enlarge struct page.
#
config SPLIT_PTLOCK_CPUS
int
default "999999" if !MMU
default "999999" if ARM && !CPU_CACHE_VIPT
default "999999" if PARISC && !PA20
default "4"
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
boolean
#
# support for memory balloon
config MEMORY_BALLOON
boolean
#
# support for memory balloon compaction
config BALLOON_COMPACTION
bool "Allow for balloon memory compaction/migration"
def_bool y
depends on COMPACTION && MEMORY_BALLOON
help
Memory fragmentation introduced by ballooning might reduce
significantly the number of 2MB contiguous memory blocks that can be
used within a guest, thus imposing performance penalties associated
with the reduced number of transparent huge pages that could be used
by the guest workload. Allowing the compaction & migration for memory
pages enlisted as being part of memory balloon devices avoids the
scenario aforementioned and helps improving memory defragmentation.
#
# support for memory compaction
config COMPACTION
bool "Allow for memory compaction"
def_bool y
select MIGRATION
depends on MMU
help
Allows the compaction of memory for the allocation of huge pages.
#
# support for page migration
#
config MIGRATION
bool "Page migration"
def_bool y
depends on (NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA) && MMU
help
Allows the migration of the physical location of pages of processes
while the virtual addresses are not changed. This is useful in
two situations. The first is on NUMA systems to put pages nearer
to the processors accessing. The second is when allocating huge
pages as migration can relocate pages to satisfy a huge page
allocation instead of reclaiming.
config ARCH_ENABLE_HUGEPAGE_MIGRATION
boolean
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
config ZONE_DMA_FLAG
int
default "0" if !ZONE_DMA
default "1"
config BOUNCE
bool "Enable bounce buffers"
default y
depends on BLOCK && MMU && (ZONE_DMA || HIGHMEM)
help
Enable bounce buffers for devices that cannot access
the full range of memory available to the CPU. Enabled
by default when ZONE_DMA or HIGHMEM is selected, but you
may say n to override this.
# On the 'tile' arch, USB OHCI needs the bounce pool since tilegx will often
# have more than 4GB of memory, but we don't currently use the IOTLB to present
# a 32-bit address to OHCI. So we need to use a bounce pool instead.
#
# We also use the bounce pool to provide stable page writes for jbd. jbd
# initiates buffer writeback without locking the page or setting PG_writeback,
# and fixing that behavior (a second time; jbd2 doesn't have this problem) is
# a major rework effort. Instead, use the bounce buffer to snapshot pages
# (until jbd goes away). The only jbd user is ext3.
config NEED_BOUNCE_POOL
bool
default y if (TILE && USB_OHCI_HCD) || (BLK_DEV_INTEGRITY && JBD)
config NR_QUICK
int
depends on QUICKLIST
default "2" if AVR32
default "1"
config VIRT_TO_BUS
bool
help
An architecture should select this if it implements the
deprecated interface virt_to_bus(). All new architectures
should probably not select this.
config MMU_NOTIFIER
bool
config KSM
bool "Enable KSM for page merging"
depends on MMU
help
Enable Kernel Samepage Merging: KSM periodically scans those areas
of an application's address space that an app has advised may be
mergeable. When it finds pages of identical content, it replaces
the many instances by a single page with that content, so
saving memory until one or another app needs to modify the content.
Recommended for use with KVM, or with other duplicative applications.
See Documentation/vm/ksm.txt for more information: KSM is inactive
until a program has madvised that an area is MADV_MERGEABLE, and
root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
depends on MMU
default 4096
help
This is the portion of low virtual memory which should be protected
from userspace allocation. Keeping a user from writing to low pages
can help reduce the impact of kernel NULL pointer bugs.
For most ia64, ppc64 and x86 users with lots of address space
a value of 65536 is reasonable and should cause no problems.
On arm and other archs it should not be higher than 32768.
Programs which use vm86 functionality or have some need to map
this low address space will need CAP_SYS_RAWIO or disable this
protection by setting the value to 0.
This value can be changed after boot using the
/proc/sys/vm/mmap_min_addr tunable.
config ARCH_SUPPORTS_MEMORY_FAILURE
bool
config MEMORY_FAILURE
depends on MMU
depends on ARCH_SUPPORTS_MEMORY_FAILURE
bool "Enable recovery from hardware memory errors"
select MEMORY_ISOLATION
help
Enables code to recover from some memory failures on systems
with MCA recovery. This allows a system to continue running
even when some of its memory has uncorrected errors. This requires
special hardware support and typically ECC memory.
config HWPOISON_INJECT
tristate "HWPoison pages injector"
depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
select PROC_PAGE_MONITOR
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
default 1
help
The NOMMU mmap() frequently needs to allocate large contiguous chunks
of memory on which to store mappings, but it can only ask the system
allocator for chunks in 2^N*PAGE_SIZE amounts - which is frequently
more than it requires. To deal with this, mmap() is able to trim off
the excess and return it to the allocator.
If trimming is enabled, the excess is trimmed off and returned to the
system allocator, which can cause extra fragmentation, particularly
if there are a lot of transient processes.
If trimming is disabled, the excess is kept, but not used, which for
long-term mappings means that the space is wasted.
Trimming can be dynamically controlled through a sysctl option
(/proc/sys/vm/nr_trim_pages) which specifies the minimum number of
excess pages there must be before trimming should occur, or zero if
no trimming is to occur.
This option specifies the initial value of this option. The default
of 1 says that all excess pages should be trimmed.
See Documentation/nommu-mmap.txt for more information.
config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
huge tlb transparently to the applications whenever possible.
This feature can improve computing performance to certain
applications by speeding up page faults during memory
allocation, by reducing the number of tlb misses and by speeding
up the pagetable walking.
If memory constrained on embedded, you may want to say N.
choice
prompt "Transparent Hugepage Support sysfs defaults"
depends on TRANSPARENT_HUGEPAGE
default TRANSPARENT_HUGEPAGE_ALWAYS
help
Selects the sysfs defaults for Transparent Hugepage Support.
config TRANSPARENT_HUGEPAGE_ALWAYS
bool "always"
help
Enabling Transparent Hugepage always, can increase the
memory footprint of applications without a guaranteed
benefit but it will work automatically for all applications.
config TRANSPARENT_HUGEPAGE_MADVISE
bool "madvise"
help
Enabling Transparent Hugepage madvise, will only provide a
performance improvement benefit to the applications using
madvise(MADV_HUGEPAGE) but it won't risk to increase the
memory footprint of applications without a guaranteed
benefit.
endchoice
#
# UP and nommu archs use km based percpu allocator
#
config NEED_PER_CPU_KM
depends on !SMP
bool
default y
config CLEANCACHE
bool "Enable cleancache driver to cache clean pages if tmem is present"
default n
help
Cleancache can be thought of as a page-granularity victim cache
for clean pages that the kernel's pageframe replacement algorithm
(PFRA) would like to keep around, but can't since there isn't enough
memory. So when the PFRA "evicts" a page, it first attempts to use
cleancache code to put the data contained in that page into
"transcendent memory", memory that is not directly accessible or
addressable by the kernel and is of unknown and possibly
time-varying size. And when a cleancache-enabled
filesystem wishes to access a page in a file on disk, it first
checks cleancache to see if it already contains it; if it does,
the page is copied into the kernel and a disk access is avoided.
When a transcendent memory driver is available (such as zcache or
Xen transcendent memory), a significant I/O reduction
may be achieved. When none is available, all cleancache calls
are reduced to a single pointer-compare-against-NULL resulting
in a negligible performance hit.
If unsure, say Y to enable cleancache
config FRONTSWAP
bool "Enable frontswap to cache swap pages if tmem is present"
depends on SWAP
default n
help
Frontswap is so named because it can be thought of as the opposite
of a "backing" store for a swap device. The data is stored into
"transcendent memory", memory that is not directly accessible or
addressable by the kernel and is of unknown and possibly
time-varying size. When space in transcendent memory is available,
a significant swap I/O reduction may be achieved. When none is
available, all frontswap calls are reduced to a single pointer-
compare-against-NULL resulting in a negligible performance hit
and swap data is stored as normal on the matching swap device.
If unsure, say Y to enable frontswap.
config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_MEMBLOCK && MMU
select MIGRATION
select MEMORY_ISOLATION
help
This enables the Contiguous Memory Allocator which allows other
subsystems to allocate big physically-contiguous blocks of memory.
CMA reserves a region of memory and allows only movable pages to
be allocated from it. This way, the kernel can use the memory for
pagecache and when a subsystem requests for contiguous area, the
allocated pages are migrated away to serve the contiguous request.
If unsure, say "n".
config CMA_DEBUG
bool "CMA debug messages (DEVELOPMENT)"
depends on DEBUG_KERNEL && CMA
help
Turns on debug messages in CMA. This produces KERN_DEBUG
messages for every CMA call as well as various messages while
processing calls such as dma_alloc_from_contiguous().
This option does not affect warning and error messages.
config CMA_PINPAGE_MIGRATION
bool "CMA pinned page migration (EXPERIMENTAL)"
depends on CMA
default n
help
Turns on cma page migration for pinned page.
config CMA_AREAS
int "Maximum count of the CMA areas"
depends on CMA
default 7
help
CMA allows to create CMA areas for particular purpose, mainly,
used as device private area. This parameter sets the maximum
number of CMA area in the system.
If unsure, leave the default value "7".
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
select PROC_PAGE_MONITOR
help
This option enables memory changes tracking by introducing a
soft-dirty bit on pte-s. This bit it set when someone writes
into a page just as regular dirty bit, but unlike the latter
it can be cleared by hands.
See Documentation/vm/soft-dirty.txt for more details.
config ZSWAP
bool "Compressed cache for swap pages (EXPERIMENTAL)"
depends on FRONTSWAP && CRYPTO=y
select CRYPTO_LZO
select CRYPTO_LZ4
select ZPOOL
default n
help
A lightweight compressed cache for swap pages. It takes
pages that are in the process of being swapped out and attempts to
compress them into a dynamically allocated RAM-based memory pool.
This can result in a significant I/O reduction on swap device and,
in the case where decompressing from RAM is faster that swap device
reads, can also improve workload performance.
This is marked experimental because it is a new feature (as of
v3.11) that interacts heavily with memory reclaim. While these
interactions don't cause any known issues on simple memory setups,
they have not be fully explored on the large set of potential
configurations and workloads that exist.
config ZSWAP_COMPACTION
bool "Enable runtime zswap compaction"
depends on ZSWAP && ZPOOL && ZSMALLOC
default n
config ZSWAP_ENABLE_WRITEBACK
bool "Enable zswap writeback"
depends on ZSWAP
default n
config ZPOOL
tristate "Common API for compressed memory storage"
default n
help
Compressed memory storage API. This allows using either zbud or
zsmalloc.
config ZBUD
tristate "Low density storage for compressed pages"
default n
help
A special purpose allocator for storing compressed pages.
It is designed to store up to two compressed pages per physical
page. While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.
config ZSMALLOC
tristate "Memory allocator for compressed pages"
depends on MMU
default n
help
zsmalloc is a slab-based memory allocator designed to store
compressed RAM pages. zsmalloc uses virtual memory mapping
in order to reduce fragmentation. However, this results in a
non-standard allocator interface where a handle, not a pointer, is
returned by an alloc(). This handle must be mapped in order to
access the allocated space.
config PGTABLE_MAPPING
bool "Use page table mapping to access object in zsmalloc"
depends on ZSMALLOC
help
By default, zsmalloc uses a copy-based object mapping method to
access allocations that span two pages. However, if a particular
architecture (ex, ARM) performs VM mapping faster than copying,
then you should select this. This causes zsmalloc to use page table
mapping rather than copying for object mapping.
You can check speed with zsmalloc benchmark:
https://github.com/spartacus06/zsmapbench
config ZSMALLOC_STAT
bool "Export zsmalloc statistics"
depends on ZSMALLOC
select DEBUG_FS
help
This option enables code in the zsmalloc to collect various
statistics about whats happening in zsmalloc and exports that
information to userspace via debugfs.
If unsure, say N.
config ZSMALLOC_OBJ_SEQ
bool "Record the sequence number of each object in itself"
depends on ZSMALLOC
default n
help
This option records the sequence number of each zs object in the object.
The recorded number will be used for reclaim_zspage.
If unsure, say N.
config DIRECT_RECLAIM_FILE_PAGES_ONLY
bool "Reclaim file pages only on direct reclaim path"
depends on ZSWAP
default n
config INCREASE_MAXIMUM_SWAPPINESS
bool "Allow swappiness to be set up to 200"
depends on ZSWAP
default n
config FIX_INACTIVE_RATIO
bool "Fix active:inactive anon ratio to 1:1"
depends on ZSWAP
default n
config TIGHT_PGDAT_BALANCE
bool "Set more tight balanced condition to kswapd"
depends on ZSWAP
default n
config SWAP_ENABLE_READAHEAD
bool "Enable readahead on page swap in"
depends on SWAP
default y
help
When a page fault occurs, adjacent pages of SWAP_CLUSTER_MAX are
also paged in expecting those pages will be used in near future.
This behaviour is good at disk-based system, but not on in-memory
compression (e.g. zram).
config GENERIC_EARLY_IOREMAP
bool
config MAX_STACK_SIZE_MB
int "Maximum user stack size for 32-bit processes (MB)"
default 80
range 8 256 if METAG
range 8 2048
depends on STACK_GROWSUP && (!64BIT || COMPAT)
help
This is the maximum stack size in Megabytes in the VM layout of 32-bit
user processes when the stack grows upwards (currently only on parisc
and metag arch). The stack will be located at the highest memory
address minus the given value, unless the RLIMIT_STACK hard limit is
changed to a smaller value in which case that is used.
A sane initial value is 80 MB.
config MMAP_READAROUND_LIMIT
int "Limit mmap readaround upperbound"
default 0
help
Inappropriate mmap readaround size can hurt device performance
during the sluggish situation. Add the hard upper-limit for
mmap readaround.

29
mm/Kconfig.debug Normal file
View file

@ -0,0 +1,29 @@
config DEBUG_PAGEALLOC
bool "Debug page memory allocations"
depends on DEBUG_KERNEL
depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC
depends on !KMEMCHECK
select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
---help---
Unmap pages from the kernel linear mapping after free_pages().
This results in a large slowdown, but helps to find certain types
of memory corruption.
For architectures which don't enable ARCH_SUPPORTS_DEBUG_PAGEALLOC,
fill the pages with poison patterns after free_pages() and verify
the patterns before alloc_pages(). Additionally,
this option cannot be enabled in combination with hibernation as
that would result in incorrect warnings of memory corruption after
a resume because free pages are not saved to the suspend image.
config WANT_PAGE_DEBUG_FLAGS
bool
config PAGE_POISONING
bool
select WANT_PAGE_DEBUG_FLAGS
config PAGE_GUARD
bool
select WANT_PAGE_DEBUG_FLAGS

71
mm/Makefile Normal file
View file

@ -0,0 +1,71 @@
#
# Makefile for the linux memory manager.
#
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o gup.o highmem.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
vmalloc.o pagewalk.o pgtable-generic.o
ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
endif
obj-y := filemap.o mempool.o oom_kill.o \
maccess.o page_alloc.o page-writeback.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
compaction.o vmacache.o \
interval_tree.o list_lru.o workingset.o \
iov_iter.o debug.o $(mmu-y)
obj-y += init-mm.o
ifdef CONFIG_NO_BOOTMEM
obj-y += nobootmem.o
else
obj-y += bootmem.o
endif
obj-$(CONFIG_ADVISE_SYSCALLS) += fadvise.o
ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
endif
obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
obj-$(CONFIG_ZPOOL) += zpool.o
obj-$(CONFIG_ZBUD) += zbud.o
obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o

668
mm/backing-dev.c Normal file
View file

@ -0,0 +1,668 @@
#include <linux/wait.h>
#include <linux/backing-dev.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/mm.h>
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/writeback.h>
#include <linux/device.h>
#include <trace/events/writeback.h>
static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
struct backing_dev_info default_backing_dev_info = {
.name = "default",
.ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
};
EXPORT_SYMBOL_GPL(default_backing_dev_info);
struct backing_dev_info noop_backing_dev_info = {
.name = "noop",
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
EXPORT_SYMBOL_GPL(noop_backing_dev_info);
static struct class *bdi_class;
/*
* bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
* locking.
*/
DEFINE_SPINLOCK(bdi_lock);
LIST_HEAD(bdi_list);
/* bdi_wq serves all asynchronous writeback tasks */
struct workqueue_struct *bdi_wq;
static void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
{
if (wb1 < wb2) {
spin_lock(&wb1->list_lock);
spin_lock_nested(&wb2->list_lock, 1);
} else {
spin_lock(&wb2->list_lock);
spin_lock_nested(&wb1->list_lock, 1);
}
}
#ifdef CONFIG_DEBUG_FS
#include <linux/debugfs.h>
#include <linux/seq_file.h>
static struct dentry *bdi_debug_root;
static void bdi_debug_init(void)
{
bdi_debug_root = debugfs_create_dir("bdi", NULL);
}
static int bdi_debug_stats_show(struct seq_file *m, void *v)
{
struct backing_dev_info *bdi = m->private;
struct bdi_writeback *wb = &bdi->wb;
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
unsigned long nr_dirty, nr_io, nr_more_io;
struct inode *inode;
nr_dirty = nr_io = nr_more_io = 0;
spin_lock(&wb->list_lock);
list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
nr_dirty++;
list_for_each_entry(inode, &wb->b_io, i_wb_list)
nr_io++;
list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
nr_more_io++;
spin_unlock(&wb->list_lock);
global_dirty_limits(&background_thresh, &dirty_thresh);
bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
"BdiWriteback: %10lu kB\n"
"BdiReclaimable: %10lu kB\n"
"BdiDirtyThresh: %10lu kB\n"
"DirtyThresh: %10lu kB\n"
"BackgroundThresh: %10lu kB\n"
"BdiDirtied: %10lu kB\n"
"BdiWritten: %10lu kB\n"
"BdiWriteBandwidth: %10lu kBps\n"
"b_dirty: %10lu\n"
"b_io: %10lu\n"
"b_more_io: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
K(bdi_thresh),
K(dirty_thresh),
K(background_thresh),
(unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
(unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
(unsigned long) K(bdi->write_bandwidth),
nr_dirty,
nr_io,
nr_more_io,
!list_empty(&bdi->bdi_list), bdi->state);
#undef K
return 0;
}
static int bdi_debug_stats_open(struct inode *inode, struct file *file)
{
return single_open(file, bdi_debug_stats_show, inode->i_private);
}
static const struct file_operations bdi_debug_stats_fops = {
.open = bdi_debug_stats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
{
bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
bdi, &bdi_debug_stats_fops);
}
static void bdi_debug_unregister(struct backing_dev_info *bdi)
{
debugfs_remove(bdi->debug_stats);
debugfs_remove(bdi->debug_dir);
}
#else
static inline void bdi_debug_init(void)
{
}
static inline void bdi_debug_register(struct backing_dev_info *bdi,
const char *name)
{
}
static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
{
}
#endif
static ssize_t read_ahead_kb_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
struct backing_dev_info *bdi = dev_get_drvdata(dev);
unsigned long read_ahead_kb;
ssize_t ret;
ret = kstrtoul(buf, 10, &read_ahead_kb);
if (ret < 0)
return ret;
bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10);
return count;
}
#define K(pages) ((pages) << (PAGE_SHIFT - 10))
#define BDI_SHOW(name, expr) \
static ssize_t name##_show(struct device *dev, \
struct device_attribute *attr, char *page) \
{ \
struct backing_dev_info *bdi = dev_get_drvdata(dev); \
\
return snprintf(page, PAGE_SIZE-1, "%lld\n", (long long)expr); \
} \
static DEVICE_ATTR_RW(name);
BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
static ssize_t min_ratio_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
struct backing_dev_info *bdi = dev_get_drvdata(dev);
unsigned int ratio;
ssize_t ret;
ret = kstrtouint(buf, 10, &ratio);
if (ret < 0)
return ret;
ret = bdi_set_min_ratio(bdi, ratio);
if (!ret)
ret = count;
return ret;
}
BDI_SHOW(min_ratio, bdi->min_ratio)
static ssize_t max_ratio_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
struct backing_dev_info *bdi = dev_get_drvdata(dev);
unsigned int ratio;
ssize_t ret;
ret = kstrtouint(buf, 10, &ratio);
if (ret < 0)
return ret;
ret = bdi_set_max_ratio(bdi, ratio);
if (!ret)
ret = count;
return ret;
}
BDI_SHOW(max_ratio, bdi->max_ratio)
static ssize_t stable_pages_required_show(struct device *dev,
struct device_attribute *attr,
char *page)
{
struct backing_dev_info *bdi = dev_get_drvdata(dev);
return snprintf(page, PAGE_SIZE-1, "%d\n",
bdi_cap_stable_pages_required(bdi) ? 1 : 0);
}
static DEVICE_ATTR_RO(stable_pages_required);
static struct attribute *bdi_dev_attrs[] = {
&dev_attr_read_ahead_kb.attr,
&dev_attr_min_ratio.attr,
&dev_attr_max_ratio.attr,
&dev_attr_stable_pages_required.attr,
NULL,
};
ATTRIBUTE_GROUPS(bdi_dev);
static __init int bdi_class_init(void)
{
bdi_class = class_create(THIS_MODULE, "bdi");
if (IS_ERR(bdi_class))
return PTR_ERR(bdi_class);
bdi_class->dev_groups = bdi_dev_groups;
bdi_debug_init();
return 0;
}
postcore_initcall(bdi_class_init);
static int __init default_bdi_init(void)
{
int err;
bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
WQ_UNBOUND | WQ_SYSFS, 0);
if (!bdi_wq)
return -ENOMEM;
err = bdi_init(&default_backing_dev_info);
if (!err)
bdi_register(&default_backing_dev_info, NULL, "default");
err = bdi_init(&noop_backing_dev_info);
return err;
}
subsys_initcall(default_bdi_init);
int bdi_has_dirty_io(struct backing_dev_info *bdi)
{
return wb_has_dirty_io(&bdi->wb);
}
/*
* This function is used when the first inode for this bdi is marked dirty. It
* wakes-up the corresponding bdi thread which should then take care of the
* periodic background write-out of dirty inodes. Since the write-out would
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
* set up a timer which wakes the bdi thread up later.
*
* Note, we wouldn't bother setting up the timer, but this function is on the
* fast-path (used by '__mark_inode_dirty()'), so we save few context switches
* by delaying the wake-up.
*
* We have to be careful not to postpone flush work if it is scheduled for
* earlier. Thus we use queue_delayed_work().
*/
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
{
unsigned long timeout;
timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
spin_lock_bh(&bdi->wb_lock);
if (test_bit(BDI_registered, &bdi->state))
queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
spin_unlock_bh(&bdi->wb_lock);
}
/*
* Remove bdi from bdi_list, and ensure that it is no longer visible
*/
static void bdi_remove_from_list(struct backing_dev_info *bdi)
{
spin_lock_bh(&bdi_lock);
list_del_rcu(&bdi->bdi_list);
spin_unlock_bh(&bdi_lock);
synchronize_rcu_expedited();
}
int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...)
{
va_list args;
struct device *dev;
if (bdi->dev) /* The driver needs to use separate queues per device */
return 0;
va_start(args, fmt);
dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
va_end(args);
if (IS_ERR(dev))
return PTR_ERR(dev);
bdi->dev = dev;
bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
spin_lock_bh(&bdi_lock);
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
spin_unlock_bh(&bdi_lock);
trace_writeback_bdi_register(bdi);
return 0;
}
EXPORT_SYMBOL(bdi_register);
int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
{
return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
}
EXPORT_SYMBOL(bdi_register_dev);
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
static void bdi_wb_shutdown(struct backing_dev_info *bdi)
{
if (!bdi_cap_writeback_dirty(bdi))
return;
/*
* Make sure nobody finds us on the bdi_list anymore
*/
bdi_remove_from_list(bdi);
/* Make sure nobody queues further work */
spin_lock_bh(&bdi->wb_lock);
clear_bit(BDI_registered, &bdi->state);
spin_unlock_bh(&bdi->wb_lock);
/*
* Drain work list and shutdown the delayed_work. At this point,
* @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
* is dying and its work_list needs to be drained no matter what.
*/
mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
flush_delayed_work(&bdi->wb.dwork);
WARN_ON(!list_empty(&bdi->work_list));
WARN_ON(delayed_work_pending(&bdi->wb.dwork));
}
/*
* This bdi is going away now, make sure that no super_blocks point to it
*/
static void bdi_prune_sb(struct backing_dev_info *bdi)
{
struct super_block *sb;
spin_lock(&sb_lock);
list_for_each_entry(sb, &super_blocks, s_list) {
if (sb->s_bdi == bdi)
sb->s_bdi = &default_backing_dev_info;
}
spin_unlock(&sb_lock);
}
void bdi_unregister(struct backing_dev_info *bdi)
{
if (bdi->dev) {
bdi_set_min_ratio(bdi, 0);
trace_writeback_bdi_unregister(bdi);
bdi_prune_sb(bdi);
bdi_wb_shutdown(bdi);
bdi_debug_unregister(bdi);
device_unregister(bdi->dev);
bdi->dev = NULL;
}
}
EXPORT_SYMBOL(bdi_unregister);
static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
memset(wb, 0, sizeof(*wb));
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}
/*
* Initial write bandwidth: 100 MB/s
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))
int bdi_init(struct backing_dev_info *bdi)
{
int i, err;
bdi->dev = NULL;
bdi->min_ratio = 0;
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
spin_lock_init(&bdi->wb_lock);
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->work_list);
bdi_wb_init(&bdi->wb, bdi);
for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
if (err)
goto err;
}
bdi->dirty_exceeded = 0;
bdi->bw_time_stamp = jiffies;
bdi->written_stamp = 0;
bdi->balanced_dirty_ratelimit = INIT_BW;
bdi->dirty_ratelimit = INIT_BW;
bdi->write_bandwidth = INIT_BW;
bdi->avg_write_bandwidth = INIT_BW;
err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
if (err) {
err:
while (i--)
percpu_counter_destroy(&bdi->bdi_stat[i]);
}
return err;
}
EXPORT_SYMBOL(bdi_init);
void bdi_destroy(struct backing_dev_info *bdi)
{
int i;
/*
* Splice our entries to the default_backing_dev_info. This
* condition shouldn't happen. @wb must be empty at this point and
* dirty inodes on it might cause other issues. This workaround is
* added by ce5f8e779519 ("writeback: splice dirty inode entries to
* default bdi on bdi_destroy()") without root-causing the issue.
*
* http://lkml.kernel.org/g/1253038617-30204-11-git-send-email-jens.axboe@oracle.com
* http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350
*
* We should probably add WARN_ON() to find out whether it still
* happens and track it down if so.
*/
if (bdi_has_dirty_io(bdi)) {
struct bdi_writeback *dst = &default_backing_dev_info.wb;
bdi_lock_two(&bdi->wb, dst);
list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
list_splice(&bdi->wb.b_io, &dst->b_io);
list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
spin_unlock(&bdi->wb.list_lock);
spin_unlock(&dst->list_lock);
}
bdi_unregister(bdi);
WARN_ON(delayed_work_pending(&bdi->wb.dwork));
for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
percpu_counter_destroy(&bdi->bdi_stat[i]);
fprop_local_destroy_percpu(&bdi->completions);
}
EXPORT_SYMBOL(bdi_destroy);
/*
* For use from filesystems to quickly init and register a bdi associated
* with dirty writeback
*/
int bdi_setup_and_register(struct backing_dev_info *bdi, char *name,
unsigned int cap)
{
int err;
bdi->name = name;
bdi->capabilities = cap;
err = bdi_init(bdi);
if (err)
return err;
err = bdi_register(bdi, NULL, "%.28s-%ld", name,
atomic_long_inc_return(&bdi_seq));
if (err) {
bdi_destroy(bdi);
return err;
}
return 0;
}
EXPORT_SYMBOL(bdi_setup_and_register);
static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
wait_queue_head_t *wqh = &congestion_wqh[sync];
bit = sync ? BDI_sync_congested : BDI_async_congested;
if (test_and_clear_bit(bit, &bdi->state))
atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
wake_up(wqh);
}
EXPORT_SYMBOL(clear_bdi_congested);
void set_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
if (!test_and_set_bit(bit, &bdi->state))
atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
/**
* congestion_wait - wait for a backing_dev to become uncongested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
long congestion_wait(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
return ret;
}
EXPORT_SYMBOL(congestion_wait);
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
* @zone: A zone to check if it is heavily congested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) and the given
* @zone has experienced recent congestion, this waits for up to @timeout
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
* In the absence of zone congestion, cond_resched() is called to yield
* the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
long wait_iff_congested(struct zone *zone, int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
/*
* If there is no congestion, or heavy congestion is not being
* encountered in the current zone, yield if necessary instead
* of sleeping on the congestion queue
*/
if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
cond_resched();
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
if (ret < 0)
ret = 0;
goto out;
}
/* Sleep until uncongested or a write happens */
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
return ret;
}
EXPORT_SYMBOL(wait_iff_congested);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
char kbuf[] = "0\n";
if (*ppos || *lenp < sizeof(kbuf)) {
*lenp = 0;
return 0;
}
if (copy_to_user(buffer, kbuf, sizeof(kbuf)))
return -EFAULT;
printk_once(KERN_WARNING "%s exported in /proc is scheduled for removal\n",
table->procname);
*lenp = 2;
*ppos += *lenp;
return 2;
}

221
mm/balloon_compaction.c Normal file
View file

@ -0,0 +1,221 @@
/*
* mm/balloon_compaction.c
*
* Common interface for making balloon pages movable by compaction.
*
* Copyright (C) 2012, Red Hat, Inc. Rafael Aquini <aquini@redhat.com>
*/
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/export.h>
#include <linux/balloon_compaction.h>
/*
* balloon_page_enqueue - allocates a new page and inserts it into the balloon
* page list.
* @b_dev_info: balloon device decriptor where we will insert a new page to
*
* Driver must call it to properly allocate a new enlisted balloon page
* before definetively removing it from the guest system.
* This function returns the page address for the recently enqueued page or
* NULL in the case we fail to allocate a new page this turn.
*/
struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info)
{
unsigned long flags;
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
__GFP_NOMEMALLOC | __GFP_NORETRY);
if (!page)
return NULL;
/*
* Block others from accessing the 'page' when we get around to
* establishing additional references. We should be the only one
* holding a reference to the 'page' at this point.
*/
BUG_ON(!trylock_page(page));
spin_lock_irqsave(&b_dev_info->pages_lock, flags);
balloon_page_insert(b_dev_info, page);
__count_vm_event(BALLOON_INFLATE);
spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
unlock_page(page);
return page;
}
EXPORT_SYMBOL_GPL(balloon_page_enqueue);
/*
* balloon_page_dequeue - removes a page from balloon's page list and returns
* the its address to allow the driver release the page.
* @b_dev_info: balloon device decriptor where we will grab a page from.
*
* Driver must call it to properly de-allocate a previous enlisted balloon page
* before definetively releasing it back to the guest system.
* This function returns the page address for the recently dequeued page or
* NULL in the case we find balloon's page list temporarily empty due to
* compaction isolated pages.
*/
struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
{
struct page *page, *tmp;
unsigned long flags;
bool dequeued_page;
dequeued_page = false;
list_for_each_entry_safe(page, tmp, &b_dev_info->pages, lru) {
/*
* Block others from accessing the 'page' while we get around
* establishing additional references and preparing the 'page'
* to be released by the balloon driver.
*/
if (trylock_page(page)) {
#ifdef CONFIG_BALLOON_COMPACTION
if (!PagePrivate(page)) {
/* raced with isolation */
unlock_page(page);
continue;
}
#endif
spin_lock_irqsave(&b_dev_info->pages_lock, flags);
balloon_page_delete(page);
__count_vm_event(BALLOON_DEFLATE);
spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
unlock_page(page);
dequeued_page = true;
break;
}
}
if (!dequeued_page) {
/*
* If we are unable to dequeue a balloon page because the page
* list is empty and there is no isolated pages, then something
* went out of track and some balloon pages are lost.
* BUG() here, otherwise the balloon driver may get stuck into
* an infinite loop while attempting to release all its pages.
*/
spin_lock_irqsave(&b_dev_info->pages_lock, flags);
if (unlikely(list_empty(&b_dev_info->pages) &&
!b_dev_info->isolated_pages))
BUG();
spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
page = NULL;
}
return page;
}
EXPORT_SYMBOL_GPL(balloon_page_dequeue);
#ifdef CONFIG_BALLOON_COMPACTION
static inline void __isolate_balloon_page(struct page *page)
{
struct balloon_dev_info *b_dev_info = balloon_page_device(page);
unsigned long flags;
spin_lock_irqsave(&b_dev_info->pages_lock, flags);
ClearPagePrivate(page);
list_del(&page->lru);
b_dev_info->isolated_pages++;
spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
}
static inline void __putback_balloon_page(struct page *page)
{
struct balloon_dev_info *b_dev_info = balloon_page_device(page);
unsigned long flags;
spin_lock_irqsave(&b_dev_info->pages_lock, flags);
SetPagePrivate(page);
list_add(&page->lru, &b_dev_info->pages);
b_dev_info->isolated_pages--;
spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
}
/* __isolate_lru_page() counterpart for a ballooned page */
bool balloon_page_isolate(struct page *page)
{
/*
* Avoid burning cycles with pages that are yet under __free_pages(),
* or just got freed under us.
*
* In case we 'win' a race for a balloon page being freed under us and
* raise its refcount preventing __free_pages() from doing its job
* the put_page() at the end of this block will take care of
* release this page, thus avoiding a nasty leakage.
*/
if (likely(get_page_unless_zero(page))) {
/*
* As balloon pages are not isolated from LRU lists, concurrent
* compaction threads can race against page migration functions
* as well as race against the balloon driver releasing a page.
*
* In order to avoid having an already isolated balloon page
* being (wrongly) re-isolated while it is under migration,
* or to avoid attempting to isolate pages being released by
* the balloon driver, lets be sure we have the page lock
* before proceeding with the balloon page isolation steps.
*/
if (likely(trylock_page(page))) {
/*
* A ballooned page, by default, has PagePrivate set.
* Prevent concurrent compaction threads from isolating
* an already isolated balloon page by clearing it.
*/
if (balloon_page_movable(page)) {
__isolate_balloon_page(page);
unlock_page(page);
return true;
}
unlock_page(page);
}
put_page(page);
}
return false;
}
/* putback_lru_page() counterpart for a ballooned page */
void balloon_page_putback(struct page *page)
{
/*
* 'lock_page()' stabilizes the page and prevents races against
* concurrent isolation threads attempting to re-isolate it.
*/
lock_page(page);
if (__is_movable_balloon_page(page)) {
__putback_balloon_page(page);
/* drop the extra ref count taken for page isolation */
put_page(page);
} else {
WARN_ON(1);
dump_page(page, "not movable balloon page");
}
unlock_page(page);
}
/* move_to_new_page() counterpart for a ballooned page */
int balloon_page_migrate(struct page *newpage,
struct page *page, enum migrate_mode mode)
{
struct balloon_dev_info *balloon = balloon_page_device(page);
int rc = -EAGAIN;
/*
* Block others from accessing the 'newpage' when we get around to
* establishing additional references. We should be the only one
* holding a reference to the 'newpage' at this point.
*/
BUG_ON(!trylock_page(newpage));
if (WARN_ON(!__is_movable_balloon_page(page))) {
dump_page(page, "not movable balloon page");
unlock_page(newpage);
return rc;
}
if (balloon && balloon->migratepage)
rc = balloon->migratepage(balloon, newpage, page, mode);
unlock_page(newpage);
return rc;
}
#endif /* CONFIG_BALLOON_COMPACTION */

861
mm/bootmem.c Normal file
View file

@ -0,0 +1,861 @@
/*
* bootmem - A boot-time physical memory allocator and configurator
*
* Copyright (C) 1999 Ingo Molnar
* 1999 Kanoj Sarcar, SGI
* 2008 Johannes Weiner
*
* Access to this subsystem has to be serialized externally (which is true
* for the boot process anyway).
*/
#include <linux/init.h>
#include <linux/pfn.h>
#include <linux/slab.h>
#include <linux/bootmem.h>
#include <linux/export.h>
#include <linux/kmemleak.h>
#include <linux/range.h>
#include <linux/memblock.h>
#include <linux/bug.h>
#include <linux/io.h>
#include <asm/processor.h>
#include "internal.h"
#ifndef CONFIG_NEED_MULTIPLE_NODES
struct pglist_data __refdata contig_page_data = {
.bdata = &bootmem_node_data[0]
};
EXPORT_SYMBOL(contig_page_data);
#endif
unsigned long max_low_pfn;
unsigned long min_low_pfn;
unsigned long max_pfn;
bootmem_data_t bootmem_node_data[MAX_NUMNODES] __initdata;
static struct list_head bdata_list __initdata = LIST_HEAD_INIT(bdata_list);
static int bootmem_debug;
static int __init bootmem_debug_setup(char *buf)
{
bootmem_debug = 1;
return 0;
}
early_param("bootmem_debug", bootmem_debug_setup);
#define bdebug(fmt, args...) ({ \
if (unlikely(bootmem_debug)) \
printk(KERN_INFO \
"bootmem::%s " fmt, \
__func__, ## args); \
})
static unsigned long __init bootmap_bytes(unsigned long pages)
{
unsigned long bytes = DIV_ROUND_UP(pages, 8);
return ALIGN(bytes, sizeof(long));
}
/**
* bootmem_bootmap_pages - calculate bitmap size in pages
* @pages: number of pages the bitmap has to represent
*/
unsigned long __init bootmem_bootmap_pages(unsigned long pages)
{
unsigned long bytes = bootmap_bytes(pages);
return PAGE_ALIGN(bytes) >> PAGE_SHIFT;
}
/*
* link bdata in order
*/
static void __init link_bootmem(bootmem_data_t *bdata)
{
bootmem_data_t *ent;
list_for_each_entry(ent, &bdata_list, list) {
if (bdata->node_min_pfn < ent->node_min_pfn) {
list_add_tail(&bdata->list, &ent->list);
return;
}
}
list_add_tail(&bdata->list, &bdata_list);
}
/*
* Called once to set up the allocator itself.
*/
static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
unsigned long mapstart, unsigned long start, unsigned long end)
{
unsigned long mapsize;
mminit_validate_memmodel_limits(&start, &end);
bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));
bdata->node_min_pfn = start;
bdata->node_low_pfn = end;
link_bootmem(bdata);
/*
* Initially all pages are reserved - setup_arch() has to
* register free RAM areas explicitly.
*/
mapsize = bootmap_bytes(end - start);
memset(bdata->node_bootmem_map, 0xff, mapsize);
bdebug("nid=%td start=%lx map=%lx end=%lx mapsize=%lx\n",
bdata - bootmem_node_data, start, mapstart, end, mapsize);
return mapsize;
}
/**
* init_bootmem_node - register a node as boot memory
* @pgdat: node to register
* @freepfn: pfn where the bitmap for this node is to be placed
* @startpfn: first pfn on the node
* @endpfn: first pfn after the node
*
* Returns the number of bytes needed to hold the bitmap for this node.
*/
unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
unsigned long startpfn, unsigned long endpfn)
{
return init_bootmem_core(pgdat->bdata, freepfn, startpfn, endpfn);
}
/**
* init_bootmem - register boot memory
* @start: pfn where the bitmap is to be placed
* @pages: number of available physical pages
*
* Returns the number of bytes needed to hold the bitmap.
*/
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
max_low_pfn = pages;
min_low_pfn = start;
return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}
/*
* free_bootmem_late - free bootmem pages directly to page allocator
* @addr: starting physical address of the range
* @size: size of the range in bytes
*
* This is only useful when the bootmem allocator has already been torn
* down, but we are still initializing the system. Pages are given directly
* to the page allocator, no bootmem metadata is updated because it is gone.
*/
void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
{
unsigned long cursor, end;
kmemleak_free_part(__va(physaddr), size);
cursor = PFN_UP(physaddr);
end = PFN_DOWN(physaddr + size);
for (; cursor < end; cursor++) {
__free_pages_bootmem(pfn_to_page(cursor), 0);
totalram_pages++;
}
}
static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
{
struct page *page;
unsigned long *map, start, end, pages, count = 0;
if (!bdata->node_bootmem_map)
return 0;
map = bdata->node_bootmem_map;
start = bdata->node_min_pfn;
end = bdata->node_low_pfn;
bdebug("nid=%td start=%lx end=%lx\n",
bdata - bootmem_node_data, start, end);
while (start < end) {
unsigned long idx, vec;
unsigned shift;
idx = start - bdata->node_min_pfn;
shift = idx & (BITS_PER_LONG - 1);
/*
* vec holds at most BITS_PER_LONG map bits,
* bit 0 corresponds to start.
*/
vec = ~map[idx / BITS_PER_LONG];
if (shift) {
vec >>= shift;
if (end - start >= BITS_PER_LONG)
vec |= ~map[idx / BITS_PER_LONG + 1] <<
(BITS_PER_LONG - shift);
}
/*
* If we have a properly aligned and fully unreserved
* BITS_PER_LONG block of pages in front of us, free
* it in one go.
*/
if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) {
int order = ilog2(BITS_PER_LONG);
__free_pages_bootmem(pfn_to_page(start), order);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
unsigned long cur = start;
start = ALIGN(start + 1, BITS_PER_LONG);
while (vec && cur != start) {
if (vec & 1) {
page = pfn_to_page(cur);
__free_pages_bootmem(page, 0);
count++;
}
vec >>= 1;
++cur;
}
}
}
page = virt_to_page(bdata->node_bootmem_map);
pages = bdata->node_low_pfn - bdata->node_min_pfn;
pages = bootmem_bootmap_pages(pages);
count += pages;
while (pages--)
__free_pages_bootmem(page++, 0);
bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
return count;
}
static int reset_managed_pages_done __initdata;
void reset_node_managed_pages(pg_data_t *pgdat)
{
struct zone *z;
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
z->managed_pages = 0;
}
void __init reset_all_zones_managed_pages(void)
{
struct pglist_data *pgdat;
if (reset_managed_pages_done)
return;
for_each_online_pgdat(pgdat)
reset_node_managed_pages(pgdat);
reset_managed_pages_done = 1;
}
/**
* free_all_bootmem - release free pages to the buddy allocator
*
* Returns the number of pages actually released.
*/
unsigned long __init free_all_bootmem(void)
{
unsigned long total_pages = 0;
bootmem_data_t *bdata;
reset_all_zones_managed_pages();
list_for_each_entry(bdata, &bdata_list, list)
total_pages += free_all_bootmem_core(bdata);
totalram_pages += total_pages;
return total_pages;
}
static void __init __free(bootmem_data_t *bdata,
unsigned long sidx, unsigned long eidx)
{
unsigned long idx;
bdebug("nid=%td start=%lx end=%lx\n", bdata - bootmem_node_data,
sidx + bdata->node_min_pfn,
eidx + bdata->node_min_pfn);
if (bdata->hint_idx > sidx)
bdata->hint_idx = sidx;
for (idx = sidx; idx < eidx; idx++)
if (!test_and_clear_bit(idx, bdata->node_bootmem_map))
BUG();
}
static int __init __reserve(bootmem_data_t *bdata, unsigned long sidx,
unsigned long eidx, int flags)
{
unsigned long idx;
int exclusive = flags & BOOTMEM_EXCLUSIVE;
bdebug("nid=%td start=%lx end=%lx flags=%x\n",
bdata - bootmem_node_data,
sidx + bdata->node_min_pfn,
eidx + bdata->node_min_pfn,
flags);
for (idx = sidx; idx < eidx; idx++)
if (test_and_set_bit(idx, bdata->node_bootmem_map)) {
if (exclusive) {
__free(bdata, sidx, idx);
return -EBUSY;
}
bdebug("silent double reserve of PFN %lx\n",
idx + bdata->node_min_pfn);
}
return 0;
}
static int __init mark_bootmem_node(bootmem_data_t *bdata,
unsigned long start, unsigned long end,
int reserve, int flags)
{
unsigned long sidx, eidx;
bdebug("nid=%td start=%lx end=%lx reserve=%d flags=%x\n",
bdata - bootmem_node_data, start, end, reserve, flags);
BUG_ON(start < bdata->node_min_pfn);
BUG_ON(end > bdata->node_low_pfn);
sidx = start - bdata->node_min_pfn;
eidx = end - bdata->node_min_pfn;
if (reserve)
return __reserve(bdata, sidx, eidx, flags);
else
__free(bdata, sidx, eidx);
return 0;
}
static int __init mark_bootmem(unsigned long start, unsigned long end,
int reserve, int flags)
{
unsigned long pos;
bootmem_data_t *bdata;
pos = start;
list_for_each_entry(bdata, &bdata_list, list) {
int err;
unsigned long max;
if (pos < bdata->node_min_pfn ||
pos >= bdata->node_low_pfn) {
BUG_ON(pos != start);
continue;
}
max = min(bdata->node_low_pfn, end);
err = mark_bootmem_node(bdata, pos, max, reserve, flags);
if (reserve && err) {
mark_bootmem(start, pos, 0, 0);
return err;
}
if (max == end)
return 0;
pos = bdata->node_low_pfn;
}
BUG();
}
/**
* free_bootmem_node - mark a page range as usable
* @pgdat: node the range resides on
* @physaddr: starting address of the range
* @size: size of the range in bytes
*
* Partial pages will be considered reserved and left as they are.
*
* The range must reside completely on the specified node.
*/
void __init free_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size)
{
unsigned long start, end;
kmemleak_free_part(__va(physaddr), size);
start = PFN_UP(physaddr);
end = PFN_DOWN(physaddr + size);
mark_bootmem_node(pgdat->bdata, start, end, 0, 0);
}
/**
* free_bootmem - mark a page range as usable
* @addr: starting physical address of the range
* @size: size of the range in bytes
*
* Partial pages will be considered reserved and left as they are.
*
* The range must be contiguous but may span node boundaries.
*/
void __init free_bootmem(unsigned long physaddr, unsigned long size)
{
unsigned long start, end;
kmemleak_free_part(__va(physaddr), size);
start = PFN_UP(physaddr);
end = PFN_DOWN(physaddr + size);
mark_bootmem(start, end, 0, 0);
}
/**
* reserve_bootmem_node - mark a page range as reserved
* @pgdat: node the range resides on
* @physaddr: starting address of the range
* @size: size of the range in bytes
* @flags: reservation flags (see linux/bootmem.h)
*
* Partial pages will be reserved.
*
* The range must reside completely on the specified node.
*/
int __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size, int flags)
{
unsigned long start, end;
start = PFN_DOWN(physaddr);
end = PFN_UP(physaddr + size);
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}
/**
* reserve_bootmem - mark a page range as reserved
* @addr: starting address of the range
* @size: size of the range in bytes
* @flags: reservation flags (see linux/bootmem.h)
*
* Partial pages will be reserved.
*
* The range must be contiguous but may span node boundaries.
*/
int __init reserve_bootmem(unsigned long addr, unsigned long size,
int flags)
{
unsigned long start, end;
start = PFN_DOWN(addr);
end = PFN_UP(addr + size);
return mark_bootmem(start, end, 1, flags);
}
static unsigned long __init align_idx(struct bootmem_data *bdata,
unsigned long idx, unsigned long step)
{
unsigned long base = bdata->node_min_pfn;
/*
* Align the index with respect to the node start so that the
* combination of both satisfies the requested alignment.
*/
return ALIGN(base + idx, step) - base;
}
static unsigned long __init align_off(struct bootmem_data *bdata,
unsigned long off, unsigned long align)
{
unsigned long base = PFN_PHYS(bdata->node_min_pfn);
/* Same as align_idx for byte offsets */
return ALIGN(base + off, align) - base;
}
static void * __init alloc_bootmem_bdata(struct bootmem_data *bdata,
unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
{
unsigned long fallback = 0;
unsigned long min, max, start, sidx, midx, step;
bdebug("nid=%td size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n",
bdata - bootmem_node_data, size, PAGE_ALIGN(size) >> PAGE_SHIFT,
align, goal, limit);
BUG_ON(!size);
BUG_ON(align & (align - 1));
BUG_ON(limit && goal + size > limit);
if (!bdata->node_bootmem_map)
return NULL;
min = bdata->node_min_pfn;
max = bdata->node_low_pfn;
goal >>= PAGE_SHIFT;
limit >>= PAGE_SHIFT;
if (limit && max > limit)
max = limit;
if (max <= min)
return NULL;
step = max(align >> PAGE_SHIFT, 1UL);
if (goal && min < goal && goal < max)
start = ALIGN(goal, step);
else
start = ALIGN(min, step);
sidx = start - bdata->node_min_pfn;
midx = max - bdata->node_min_pfn;
if (bdata->hint_idx > sidx) {
/*
* Handle the valid case of sidx being zero and still
* catch the fallback below.
*/
fallback = sidx + 1;
sidx = align_idx(bdata, bdata->hint_idx, step);
}
while (1) {
int merge;
void *region;
unsigned long eidx, i, start_off, end_off;
find_block:
sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx);
sidx = align_idx(bdata, sidx, step);
eidx = sidx + PFN_UP(size);
if (sidx >= midx || eidx > midx)
break;
for (i = sidx; i < eidx; i++)
if (test_bit(i, bdata->node_bootmem_map)) {
sidx = align_idx(bdata, i, step);
if (sidx == i)
sidx += step;
goto find_block;
}
if (bdata->last_end_off & (PAGE_SIZE - 1) &&
PFN_DOWN(bdata->last_end_off) + 1 == sidx)
start_off = align_off(bdata, bdata->last_end_off, align);
else
start_off = PFN_PHYS(sidx);
merge = PFN_DOWN(start_off) < sidx;
end_off = start_off + size;
bdata->last_end_off = end_off;
bdata->hint_idx = PFN_UP(end_off);
/*
* Reserve the area now:
*/
if (__reserve(bdata, PFN_DOWN(start_off) + merge,
PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
BUG();
region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) +
start_off);
memset(region, 0, size);
/*
* The min_count is set to 0 so that bootmem allocated blocks
* are never reported as leaks.
*/
kmemleak_alloc(region, size, 0, 0);
return region;
}
if (fallback) {
sidx = align_idx(bdata, fallback - 1, step);
fallback = 0;
goto find_block;
}
return NULL;
}
static void * __init alloc_bootmem_core(unsigned long size,
unsigned long align,
unsigned long goal,
unsigned long limit)
{
bootmem_data_t *bdata;
void *region;
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc(size, GFP_NOWAIT);
list_for_each_entry(bdata, &bdata_list, list) {
if (goal && bdata->node_low_pfn <= PFN_DOWN(goal))
continue;
if (limit && bdata->node_min_pfn >= PFN_DOWN(limit))
break;
region = alloc_bootmem_bdata(bdata, size, align, goal, limit);
if (region)
return region;
}
return NULL;
}
static void * __init ___alloc_bootmem_nopanic(unsigned long size,
unsigned long align,
unsigned long goal,
unsigned long limit)
{
void *ptr;
restart:
ptr = alloc_bootmem_core(size, align, goal, limit);
if (ptr)
return ptr;
if (goal) {
goal = 0;
goto restart;
}
return NULL;
}
/**
* __alloc_bootmem_nopanic - allocate boot memory without panicking
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* Returns NULL on failure.
*/
void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
unsigned long goal)
{
unsigned long limit = 0;
return ___alloc_bootmem_nopanic(size, align, goal, limit);
}
static void * __init ___alloc_bootmem(unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
{
void *mem = ___alloc_bootmem_nopanic(size, align, goal, limit);
if (mem)
return mem;
/*
* Whoops, we cannot satisfy the allocation request.
*/
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
}
/**
* __alloc_bootmem - allocate boot memory
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem(unsigned long size, unsigned long align,
unsigned long goal)
{
unsigned long limit = 0;
return ___alloc_bootmem(size, align, goal, limit);
}
void * __init ___alloc_bootmem_node_nopanic(pg_data_t *pgdat,
unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
{
void *ptr;
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc(size, GFP_NOWAIT);
again:
/* do not panic in alloc_bootmem_bdata() */
if (limit && goal + size > limit)
limit = 0;
ptr = alloc_bootmem_bdata(pgdat->bdata, size, align, goal, limit);
if (ptr)
return ptr;
ptr = alloc_bootmem_core(size, align, goal, limit);
if (ptr)
return ptr;
if (goal) {
goal = 0;
goto again;
}
return NULL;
}
void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node_nopanic(pgdat, size, align, goal, 0);
}
void * __init ___alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal,
unsigned long limit)
{
void *ptr;
ptr = ___alloc_bootmem_node_nopanic(pgdat, size, align, goal, 0);
if (ptr)
return ptr;
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
}
/**
* __alloc_bootmem_node - allocate boot memory from a specific node
* @pgdat: node to allocate from
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may fall back to any node in the system if the specified node
* can not hold the requested memory.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node(pgdat, size, align, goal, 0);
}
void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
#ifdef MAX_DMA32_PFN
unsigned long end_pfn;
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
/* update goal according ...MAX_DMA32_PFN */
end_pfn = pgdat_end_pfn(pgdat);
if (end_pfn > MAX_DMA32_PFN + (128 >> (20 - PAGE_SHIFT)) &&
(goal >> PAGE_SHIFT) < MAX_DMA32_PFN) {
void *ptr;
unsigned long new_goal;
new_goal = MAX_DMA32_PFN << PAGE_SHIFT;
ptr = alloc_bootmem_bdata(pgdat->bdata, size, align,
new_goal, 0);
if (ptr)
return ptr;
}
#endif
return __alloc_bootmem_node(pgdat, size, align, goal);
}
#ifndef ARCH_LOW_ADDRESS_LIMIT
#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
#endif
/**
* __alloc_bootmem_low - allocate low boot memory
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_low(unsigned long size, unsigned long align,
unsigned long goal)
{
return ___alloc_bootmem(size, align, goal, ARCH_LOW_ADDRESS_LIMIT);
}
void * __init __alloc_bootmem_low_nopanic(unsigned long size,
unsigned long align,
unsigned long goal)
{
return ___alloc_bootmem_nopanic(size, align, goal,
ARCH_LOW_ADDRESS_LIMIT);
}
/**
* __alloc_bootmem_low_node - allocate low boot memory from a specific node
* @pgdat: node to allocate from
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may fall back to any node in the system if the specified node
* can not hold the requested memory.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_low_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node(pgdat, size, align,
goal, ARCH_LOW_ADDRESS_LIMIT);
}

409
mm/cleancache.c Normal file
View file

@ -0,0 +1,409 @@
/*
* Cleancache frontend
*
* This code provides the generic "frontend" layer to call a matching
* "backend" driver implementation of cleancache. See
* Documentation/vm/cleancache.txt for more information.
*
* Copyright (C) 2009-2010 Oracle Corp. All rights reserved.
* Author: Dan Magenheimer
*
* This work is licensed under the terms of the GNU GPL, version 2.
*/
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/exportfs.h>
#include <linux/mm.h>
#include <linux/debugfs.h>
#include <linux/cleancache.h>
/*
* cleancache_ops is set by cleancache_ops_register to contain the pointers
* to the cleancache "backend" implementation functions.
*/
static struct cleancache_ops *cleancache_ops __read_mostly;
/*
* Counters available via /sys/kernel/debug/frontswap (if debugfs is
* properly configured. These are for information only so are not protected
* against increment races.
*/
static u64 cleancache_succ_gets;
static u64 cleancache_failed_gets;
static u64 cleancache_puts;
static u64 cleancache_invalidates;
/*
* When no backend is registered all calls to init_fs and init_shared_fs
* are registered and fake poolids (FAKE_FS_POOLID_OFFSET or
* FAKE_SHARED_FS_POOLID_OFFSET, plus offset in the respective array
* [shared_|]fs_poolid_map) are given to the respective super block
* (sb->cleancache_poolid) and no tmem_pools are created. When a backend
* registers with cleancache the previous calls to init_fs and init_shared_fs
* are executed to create tmem_pools and set the respective poolids. While no
* backend is registered all "puts", "gets" and "flushes" are ignored or failed.
*/
#define MAX_INITIALIZABLE_FS 32
#define FAKE_FS_POOLID_OFFSET 1000
#define FAKE_SHARED_FS_POOLID_OFFSET 2000
#define FS_NO_BACKEND (-1)
#define FS_UNKNOWN (-2)
static int fs_poolid_map[MAX_INITIALIZABLE_FS];
static int shared_fs_poolid_map[MAX_INITIALIZABLE_FS];
static char *uuids[MAX_INITIALIZABLE_FS];
/*
* Mutex for the [shared_|]fs_poolid_map to guard against multiple threads
* invoking umount (and ending in __cleancache_invalidate_fs) and also multiple
* threads calling mount (and ending up in __cleancache_init_[shared|]fs).
*/
static DEFINE_MUTEX(poolid_mutex);
/*
* When set to false (default) all calls to the cleancache functions, except
* the __cleancache_invalidate_fs and __cleancache_init_[shared|]fs are guarded
* by the if (!cleancache_ops) return. This means multiple threads (from
* different filesystems) will be checking cleancache_ops. The usage of a
* bool instead of a atomic_t or a bool guarded by a spinlock is OK - we are
* OK if the time between the backend's have been initialized (and
* cleancache_ops has been set to not NULL) and when the filesystems start
* actually calling the backends. The inverse (when unloading) is obviously
* not good - but this shim does not do that (yet).
*/
/*
* The backends and filesystems work all asynchronously. This is b/c the
* backends can be built as modules.
* The usual sequence of events is:
* a) mount / -> __cleancache_init_fs is called. We set the
* [shared_|]fs_poolid_map and uuids for.
*
* b). user does I/Os -> we call the rest of __cleancache_* functions
* which return immediately as cleancache_ops is false.
*
* c). modprobe zcache -> cleancache_register_ops. We init the backend
* and set cleancache_ops to true, and for any fs_poolid_map
* (which is set by __cleancache_init_fs) we initialize the poolid.
*
* d). user does I/Os -> now that cleancache_ops is true all the
* __cleancache_* functions can call the backend. They all check
* that fs_poolid_map is valid and if so invoke the backend.
*
* e). umount / -> __cleancache_invalidate_fs, the fs_poolid_map is
* reset (which is the second check in the __cleancache_* ops
* to call the backend).
*
* The sequence of event could also be c), followed by a), and d). and e). The
* c) would not happen anymore. There is also the chance of c), and one thread
* doing a) + d), and another doing e). For that case we depend on the
* filesystem calling __cleancache_invalidate_fs in the proper sequence (so
* that it handles all I/Os before it invalidates the fs (which is last part
* of unmounting process).
*
* Note: The acute reader will notice that there is no "rmmod zcache" case.
* This is b/c the functionality for that is not yet implemented and when
* done, will require some extra locking not yet devised.
*/
/*
* Register operations for cleancache, returning previous thus allowing
* detection of multiple backends and possible nesting.
*/
struct cleancache_ops *cleancache_register_ops(struct cleancache_ops *ops)
{
struct cleancache_ops *old = cleancache_ops;
int i;
mutex_lock(&poolid_mutex);
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
if (fs_poolid_map[i] == FS_NO_BACKEND)
fs_poolid_map[i] = ops->init_fs(PAGE_SIZE);
if (shared_fs_poolid_map[i] == FS_NO_BACKEND)
shared_fs_poolid_map[i] = ops->init_shared_fs
(uuids[i], PAGE_SIZE);
}
/*
* We MUST set cleancache_ops _after_ we have called the backends
* init_fs or init_shared_fs functions. Otherwise the compiler might
* re-order where cleancache_ops is set in this function.
*/
barrier();
cleancache_ops = ops;
mutex_unlock(&poolid_mutex);
return old;
}
EXPORT_SYMBOL(cleancache_register_ops);
/* Called by a cleancache-enabled filesystem at time of mount */
void __cleancache_init_fs(struct super_block *sb)
{
int i;
mutex_lock(&poolid_mutex);
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
if (fs_poolid_map[i] == FS_UNKNOWN) {
sb->cleancache_poolid = i + FAKE_FS_POOLID_OFFSET;
if (cleancache_ops)
fs_poolid_map[i] = cleancache_ops->init_fs(PAGE_SIZE);
else
fs_poolid_map[i] = FS_NO_BACKEND;
break;
}
}
mutex_unlock(&poolid_mutex);
}
EXPORT_SYMBOL(__cleancache_init_fs);
/* Called by a cleancache-enabled clustered filesystem at time of mount */
void __cleancache_init_shared_fs(char *uuid, struct super_block *sb)
{
int i;
mutex_lock(&poolid_mutex);
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
if (shared_fs_poolid_map[i] == FS_UNKNOWN) {
sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET;
uuids[i] = uuid;
if (cleancache_ops)
shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs
(uuid, PAGE_SIZE);
else
shared_fs_poolid_map[i] = FS_NO_BACKEND;
break;
}
}
mutex_unlock(&poolid_mutex);
}
EXPORT_SYMBOL(__cleancache_init_shared_fs);
/*
* If the filesystem uses exportable filehandles, use the filehandle as
* the key, else use the inode number.
*/
static int cleancache_get_key(struct inode *inode,
struct cleancache_filekey *key)
{
int (*fhfn)(struct inode *, __u32 *fh, int *, struct inode *);
int len = 0, maxlen = CLEANCACHE_KEY_MAX;
struct super_block *sb = inode->i_sb;
key->u.ino = inode->i_ino;
if (sb->s_export_op != NULL) {
fhfn = sb->s_export_op->encode_fh;
if (fhfn) {
len = (*fhfn)(inode, &key->u.fh[0], &maxlen, NULL);
if (len <= FILEID_ROOT || len == FILEID_INVALID)
return -1;
if (maxlen > CLEANCACHE_KEY_MAX)
return -1;
}
}
return 0;
}
/*
* Returns a pool_id that is associated with a given fake poolid.
*/
static int get_poolid_from_fake(int fake_pool_id)
{
if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET)
return shared_fs_poolid_map[fake_pool_id -
FAKE_SHARED_FS_POOLID_OFFSET];
else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET)
return fs_poolid_map[fake_pool_id - FAKE_FS_POOLID_OFFSET];
return FS_NO_BACKEND;
}
/*
* "Get" data from cleancache associated with the poolid/inode/index
* that were specified when the data was put to cleanache and, if
* successful, use it to fill the specified page with data and return 0.
* The pageframe is unchanged and returns -1 if the get fails.
* Page must be locked by caller.
*
* The function has two checks before any action is taken - whether
* a backend is registered and whether the sb->cleancache_poolid
* is correct.
*/
int __cleancache_get_page(struct page *page)
{
int ret = -1;
int pool_id;
int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };
if (!cleancache_ops) {
cleancache_failed_gets++;
goto out;
}
VM_BUG_ON_PAGE(!PageLocked(page), page);
fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
if (fake_pool_id < 0)
goto out;
pool_id = get_poolid_from_fake(fake_pool_id);
if (cleancache_get_key(page->mapping->host, &key) < 0)
goto out;
if (pool_id >= 0)
ret = cleancache_ops->get_page(pool_id,
key, page->index, page);
if (ret == 0)
cleancache_succ_gets++;
else
cleancache_failed_gets++;
out:
return ret;
}
EXPORT_SYMBOL(__cleancache_get_page);
/*
* "Put" data from a page to cleancache and associate it with the
* (previously-obtained per-filesystem) poolid and the page's,
* inode and page index. Page must be locked. Note that a put_page
* always "succeeds", though a subsequent get_page may succeed or fail.
*
* The function has two checks before any action is taken - whether
* a backend is registered and whether the sb->cleancache_poolid
* is correct.
*/
void __cleancache_put_page(struct page *page)
{
int pool_id;
int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };
if (!cleancache_ops) {
cleancache_puts++;
return;
}
VM_BUG_ON_PAGE(!PageLocked(page), page);
fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
if (fake_pool_id < 0)
return;
pool_id = get_poolid_from_fake(fake_pool_id);
if (pool_id >= 0 &&
cleancache_get_key(page->mapping->host, &key) >= 0) {
cleancache_ops->put_page(pool_id, key, page->index, page);
cleancache_puts++;
}
}
EXPORT_SYMBOL(__cleancache_put_page);
/*
* Invalidate any data from cleancache associated with the poolid and the
* page's inode and page index so that a subsequent "get" will fail.
*
* The function has two checks before any action is taken - whether
* a backend is registered and whether the sb->cleancache_poolid
* is correct.
*/
void __cleancache_invalidate_page(struct address_space *mapping,
struct page *page)
{
/* careful... page->mapping is NULL sometimes when this is called */
int pool_id;
int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };
if (!cleancache_ops)
return;
if (fake_pool_id >= 0) {
pool_id = get_poolid_from_fake(fake_pool_id);
if (pool_id < 0)
return;
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (cleancache_get_key(mapping->host, &key) >= 0) {
cleancache_ops->invalidate_page(pool_id,
key, page->index);
cleancache_invalidates++;
}
}
}
EXPORT_SYMBOL(__cleancache_invalidate_page);
/*
* Invalidate all data from cleancache associated with the poolid and the
* mappings's inode so that all subsequent gets to this poolid/inode
* will fail.
*
* The function has two checks before any action is taken - whether
* a backend is registered and whether the sb->cleancache_poolid
* is correct.
*/
void __cleancache_invalidate_inode(struct address_space *mapping)
{
int pool_id;
int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };
if (!cleancache_ops)
return;
if (fake_pool_id < 0)
return;
pool_id = get_poolid_from_fake(fake_pool_id);
if (pool_id >= 0 && cleancache_get_key(mapping->host, &key) >= 0)
cleancache_ops->invalidate_inode(pool_id, key);
}
EXPORT_SYMBOL(__cleancache_invalidate_inode);
/*
* Called by any cleancache-enabled filesystem at time of unmount;
* note that pool_id is surrendered and may be returned by a subsequent
* cleancache_init_fs or cleancache_init_shared_fs.
*/
void __cleancache_invalidate_fs(struct super_block *sb)
{
int index;
int fake_pool_id = sb->cleancache_poolid;
int old_poolid = fake_pool_id;
mutex_lock(&poolid_mutex);
if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) {
index = fake_pool_id - FAKE_SHARED_FS_POOLID_OFFSET;
old_poolid = shared_fs_poolid_map[index];
shared_fs_poolid_map[index] = FS_UNKNOWN;
uuids[index] = NULL;
} else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) {
index = fake_pool_id - FAKE_FS_POOLID_OFFSET;
old_poolid = fs_poolid_map[index];
fs_poolid_map[index] = FS_UNKNOWN;
}
sb->cleancache_poolid = -1;
if (cleancache_ops)
cleancache_ops->invalidate_fs(old_poolid);
mutex_unlock(&poolid_mutex);
}
EXPORT_SYMBOL(__cleancache_invalidate_fs);
static int __init init_cleancache(void)
{
int i;
#ifdef CONFIG_DEBUG_FS
struct dentry *root = debugfs_create_dir("cleancache", NULL);
if (root == NULL)
return -ENXIO;
debugfs_create_u64("succ_gets", S_IRUGO, root, &cleancache_succ_gets);
debugfs_create_u64("failed_gets", S_IRUGO,
root, &cleancache_failed_gets);
debugfs_create_u64("puts", S_IRUGO, root, &cleancache_puts);
debugfs_create_u64("invalidates", S_IRUGO,
root, &cleancache_invalidates);
#endif
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
fs_poolid_map[i] = FS_UNKNOWN;
shared_fs_poolid_map[i] = FS_UNKNOWN;
}
return 0;
}
module_init(init_cleancache)

430
mm/cma.c Normal file
View file

@ -0,0 +1,430 @@
/*
* Contiguous Memory Allocator
*
* Copyright (c) 2010-2011 by Samsung Electronics.
* Copyright IBM Corporation, 2013
* Copyright LG Electronics Inc., 2014
* Written by:
* Marek Szyprowski <m.szyprowski@samsung.com>
* Michal Nazarewicz <mina86@mina86.com>
* Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
* Joonsoo Kim <iamjoonsoo.kim@lge.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; either version 2 of the
* License or (at your optional) any later version of the license.
*/
#define pr_fmt(fmt) "cma: " fmt
#ifdef CONFIG_CMA_DEBUG
#ifndef DEBUG
# define DEBUG
#endif
#endif
#include <linux/memblock.h>
#include <linux/err.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sizes.h>
#include <linux/slab.h>
#include <linux/log2.h>
#include <linux/cma.h>
#include <linux/highmem.h>
struct cma {
unsigned long base_pfn;
unsigned long count;
unsigned long *bitmap;
unsigned int order_per_bit; /* Order of pages represented by one bit */
struct mutex lock;
};
static struct cma cma_areas[MAX_CMA_AREAS];
static unsigned cma_area_count;
static DEFINE_MUTEX(cma_mutex);
phys_addr_t cma_get_base(struct cma *cma)
{
return PFN_PHYS(cma->base_pfn);
}
unsigned long cma_get_size(struct cma *cma)
{
return cma->count << PAGE_SHIFT;
}
static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
{
if (align_order <= cma->order_per_bit)
return 0;
return (1UL << (align_order - cma->order_per_bit)) - 1;
}
static unsigned long cma_bitmap_maxno(struct cma *cma)
{
return cma->count >> cma->order_per_bit;
}
static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
unsigned long pages)
{
return ALIGN(pages, 1UL << cma->order_per_bit) >> cma->order_per_bit;
}
static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
{
unsigned long bitmap_no, bitmap_count;
bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
mutex_lock(&cma->lock);
bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
mutex_unlock(&cma->lock);
}
static int __init cma_activate_area(struct cma *cma)
{
int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
unsigned i = cma->count >> pageblock_order;
struct zone *zone;
cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
if (!cma->bitmap)
return -ENOMEM;
WARN_ON_ONCE(!pfn_valid(pfn));
zone = page_zone(pfn_to_page(pfn));
do {
unsigned j;
base_pfn = pfn;
for (j = pageblock_nr_pages; j; --j, pfn++) {
WARN_ON_ONCE(!pfn_valid(pfn));
/*
* alloc_contig_range requires the pfn range
* specified to be in the same zone. Make this
* simple by forcing the entire CMA resv range
* to be in the same zone.
*/
if (page_zone(pfn_to_page(pfn)) != zone)
goto err;
}
init_cma_reserved_pageblock(pfn_to_page(base_pfn));
} while (--i);
mutex_init(&cma->lock);
return 0;
err:
kfree(cma->bitmap);
cma->count = 0;
return -EINVAL;
}
static int __init cma_init_reserved_areas(void)
{
int i;
for (i = 0; i < cma_area_count; i++) {
int ret = cma_activate_area(&cma_areas[i]);
if (ret)
return ret;
}
return 0;
}
core_initcall(cma_init_reserved_areas);
/**
* cma_init_reserved_mem() - create custom contiguous area from reserved memory
* @base: Base address of the reserved area
* @size: Size of the reserved area (in bytes),
* @order_per_bit: Order of pages represented by one bit on bitmap.
* @res_cma: Pointer to store the created cma region.
*
* This function creates custom contiguous area from already reserved memory.
*/
int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
int order_per_bit, struct cma **res_cma)
{
struct cma *cma;
phys_addr_t alignment;
/* Sanity checks */
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
pr_err("Not enough slots for CMA reserved regions!\n");
return -ENOSPC;
}
if (!size || !memblock_is_region_reserved(base, size))
return -EINVAL;
/* ensure minimal alignment requied by mm core */
alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
/* alignment should be aligned with order_per_bit */
if (!IS_ALIGNED(alignment >> PAGE_SHIFT, 1 << order_per_bit))
return -EINVAL;
if (ALIGN(base, alignment) != base || ALIGN(size, alignment) != size)
return -EINVAL;
/*
* Each reserved area must be initialised later, when more kernel
* subsystems (like slab allocator) are available.
*/
cma = &cma_areas[cma_area_count];
cma->base_pfn = PFN_DOWN(base);
cma->count = size >> PAGE_SHIFT;
cma->order_per_bit = order_per_bit;
*res_cma = cma;
cma_area_count++;
return 0;
}
/**
* cma_declare_contiguous() - reserve custom contiguous area
* @base: Base address of the reserved area optional, use 0 for any
* @size: Size of the reserved area (in bytes),
* @limit: End address of the reserved memory (optional, 0 for any).
* @alignment: Alignment for the CMA area, should be power of 2 or zero
* @order_per_bit: Order of pages represented by one bit on bitmap.
* @fixed: hint about where to place the reserved area
* @res_cma: Pointer to store the created cma region.
*
* This function reserves memory from early allocator. It should be
* called by arch specific code once the early allocator (memblock or bootmem)
* has been activated and all other subsystems have already allocated/reserved
* memory. This function allows to create custom reserved areas.
*
* If @fixed is true, reserve contiguous area at exactly @base. If false,
* reserve in range from @base to @limit.
*/
int __init cma_declare_contiguous(phys_addr_t base,
phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
bool fixed, struct cma **res_cma)
{
phys_addr_t memblock_end = memblock_end_of_DRAM();
phys_addr_t highmem_start;
int ret = 0;
#ifdef CONFIG_X86
/*
* high_memory isn't direct mapped memory so retrieving its physical
* address isn't appropriate. But it would be useful to check the
* physical address of the highmem boundary so it's justfiable to get
* the physical address from it. On x86 there is a validation check for
* this case, so the following workaround is needed to avoid it.
*/
highmem_start = __pa_nodebug(high_memory);
#else
highmem_start = __pa(high_memory);
#endif
pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n",
__func__, &size, &base, &limit, &alignment);
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
pr_err("Not enough slots for CMA reserved regions!\n");
return -ENOSPC;
}
if (!size)
return -EINVAL;
if (alignment && !is_power_of_2(alignment))
return -EINVAL;
/*
* Sanitise input arguments.
* Pages both ends in CMA area could be merged into adjacent unmovable
* migratetype page by page allocator's buddy algorithm. In the case,
* you couldn't get a contiguous memory, which is not what we want.
*/
alignment = max(alignment,
(phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
base = ALIGN(base, alignment);
size = ALIGN(size, alignment);
limit &= ~(alignment - 1);
if (!base)
fixed = false;
/* size should be aligned with order_per_bit */
if (!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit))
return -EINVAL;
/*
* If allocating at a fixed base the request region must not cross the
* low/high memory boundary.
*/
if (fixed && base < highmem_start && base + size > highmem_start) {
ret = -EINVAL;
pr_err("Region at %pa defined on low/high memory boundary (%pa)\n",
&base, &highmem_start);
goto err;
}
/*
* If the limit is unspecified or above the memblock end, its effective
* value will be the memblock end. Set it explicitly to simplify further
* checks.
*/
if (limit == 0 || limit > memblock_end)
limit = memblock_end;
/* Reserve memory */
if (fixed) {
if (memblock_is_region_reserved(base, size) ||
memblock_reserve(base, size) < 0) {
ret = -EBUSY;
goto err;
}
} else {
phys_addr_t addr = 0;
/*
* All pages in the reserved area must come from the same zone.
* If the requested region crosses the low/high memory boundary,
* try allocating from high memory first and fall back to low
* memory in case of failure.
*/
if (base < highmem_start && limit > highmem_start) {
addr = memblock_alloc_range(size, alignment,
highmem_start, limit);
limit = highmem_start;
}
if (!addr) {
addr = memblock_alloc_range(size, alignment, base,
limit);
if (!addr) {
ret = -ENOMEM;
goto err;
}
}
base = addr;
}
ret = cma_init_reserved_mem(base, size, order_per_bit, res_cma);
if (ret)
goto err;
pr_info("Reserved %ld MiB at %pa\n", (unsigned long)size / SZ_1M,
&base);
return 0;
err:
pr_err("Failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
return ret;
}
/**
* cma_alloc() - allocate pages from contiguous area
* @cma: Contiguous memory region for which the allocation is performed.
* @count: Requested number of pages.
* @align: Requested alignment of pages (in PAGE_SIZE order).
*
* This function allocates part of contiguous memory on specific
* contiguous memory area.
*/
struct page *cma_alloc(struct cma *cma, int count, unsigned int align)
{
unsigned long mask, pfn, start = 0;
unsigned long bitmap_maxno, bitmap_no, bitmap_count;
struct page *page = NULL;
int ret;
if (!cma || !cma->count)
return NULL;
pr_debug("%s(cma %p, count %d, align %d)\n", __func__, (void *)cma,
count, align);
if (!count)
return NULL;
mask = cma_bitmap_aligned_mask(cma, align);
bitmap_maxno = cma_bitmap_maxno(cma);
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
for (;;) {
mutex_lock(&cma->lock);
bitmap_no = bitmap_find_next_zero_area(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask);
if (bitmap_no >= bitmap_maxno) {
mutex_unlock(&cma->lock);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
* It's safe to drop the lock here. We've marked this region for
* our exclusive use. If the migration fails we will take the
* lock again and unmark it.
*/
mutex_unlock(&cma->lock);
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
break;
}
cma_clear_bitmap(cma, pfn, count);
if (ret != -EBUSY)
break;
pr_debug("%s(): memory range at %p is busy, retrying\n",
__func__, pfn_to_page(pfn));
/* try again with a bit different memory target */
start = bitmap_no + mask + 1;
}
pr_debug("%s(): returned %p\n", __func__, page);
return page;
}
/**
* cma_release() - release allocated pages
* @cma: Contiguous memory region for which the allocation is performed.
* @pages: Allocated pages.
* @count: Number of allocated pages.
*
* This function releases memory allocated by alloc_cma().
* It returns false when provided pages do not belong to contiguous area and
* true otherwise.
*/
bool cma_release(struct cma *cma, struct page *pages, int count)
{
unsigned long pfn;
if (!cma || !pages)
return false;
pr_debug("%s(page %p)\n", __func__, (void *)pages);
pfn = page_to_pfn(pages);
if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
return false;
VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
free_contig_range(pfn, count);
cma_clear_bitmap(cma, pfn, count);
return true;
}

1540
mm/compaction.c Normal file

File diff suppressed because it is too large Load diff

102
mm/debug-pagealloc.c Normal file
View file

@ -0,0 +1,102 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/page-debug-flags.h>
#include <linux/poison.h>
#include <linux/ratelimit.h>
static inline void set_page_poison(struct page *page)
{
__set_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
}
static inline void clear_page_poison(struct page *page)
{
__clear_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
}
static inline bool page_poison(struct page *page)
{
return test_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
}
static void poison_page(struct page *page)
{
void *addr = kmap_atomic(page);
set_page_poison(page);
memset(addr, PAGE_POISON, PAGE_SIZE);
kunmap_atomic(addr);
}
static void poison_pages(struct page *page, int n)
{
int i;
for (i = 0; i < n; i++)
poison_page(page + i);
}
static bool single_bit_flip(unsigned char a, unsigned char b)
{
unsigned char error = a ^ b;
return error && !(error & (error - 1));
}
static void check_poison_mem(unsigned char *mem, size_t bytes)
{
static DEFINE_RATELIMIT_STATE(ratelimit, 5 * HZ, 10);
unsigned char *start;
unsigned char *end;
start = memchr_inv(mem, PAGE_POISON, bytes);
if (!start)
return;
for (end = mem + bytes - 1; end > start; end--) {
if (*end != PAGE_POISON)
break;
}
if (!__ratelimit(&ratelimit))
return;
else if (start == end && single_bit_flip(*start, PAGE_POISON))
printk(KERN_ERR "pagealloc: single bit error\n");
else
printk(KERN_ERR "pagealloc: memory corruption\n");
print_hex_dump(KERN_ERR, "", DUMP_PREFIX_ADDRESS, 16, 1, start,
end - start + 1, 1);
dump_stack();
}
static void unpoison_page(struct page *page)
{
void *addr;
if (!page_poison(page))
return;
addr = kmap_atomic(page);
check_poison_mem(addr, PAGE_SIZE);
clear_page_poison(page);
kunmap_atomic(addr);
}
static void unpoison_pages(struct page *page, int n)
{
int i;
for (i = 0; i < n; i++)
unpoison_page(page + i);
}
void kernel_map_pages(struct page *page, int numpages, int enable)
{
if (enable)
unpoison_pages(page, numpages);
else
poison_pages(page, numpages);
}

237
mm/debug.c Normal file
View file

@ -0,0 +1,237 @@
/*
* mm/debug.c
*
* mm/ specific debug routines.
*
*/
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/ftrace_event.h>
#include <linux/memcontrol.h>
static const struct trace_print_flags pageflag_names[] = {
{1UL << PG_locked, "locked" },
{1UL << PG_error, "error" },
{1UL << PG_referenced, "referenced" },
{1UL << PG_uptodate, "uptodate" },
{1UL << PG_dirty, "dirty" },
{1UL << PG_lru, "lru" },
{1UL << PG_active, "active" },
{1UL << PG_slab, "slab" },
{1UL << PG_owner_priv_1, "owner_priv_1" },
{1UL << PG_arch_1, "arch_1" },
{1UL << PG_reserved, "reserved" },
{1UL << PG_private, "private" },
{1UL << PG_private_2, "private_2" },
{1UL << PG_writeback, "writeback" },
#ifdef CONFIG_PAGEFLAGS_EXTENDED
{1UL << PG_head, "head" },
{1UL << PG_tail, "tail" },
#else
{1UL << PG_compound, "compound" },
#endif
{1UL << PG_swapcache, "swapcache" },
{1UL << PG_mappedtodisk, "mappedtodisk" },
{1UL << PG_reclaim, "reclaim" },
{1UL << PG_swapbacked, "swapbacked" },
{1UL << PG_unevictable, "unevictable" },
#ifdef CONFIG_MMU
{1UL << PG_mlocked, "mlocked" },
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
{1UL << PG_uncached, "uncached" },
#endif
#ifdef CONFIG_MEMORY_FAILURE
{1UL << PG_hwpoison, "hwpoison" },
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
};
static void dump_flags(unsigned long flags,
const struct trace_print_flags *names, int count)
{
const char *delim = "";
unsigned long mask;
int i;
pr_emerg("flags: %#lx(", flags);
/* remove zone id */
flags &= (1UL << NR_PAGEFLAGS) - 1;
for (i = 0; i < count && flags; i++) {
mask = names[i].mask;
if ((flags & mask) != mask)
continue;
flags &= ~mask;
pr_cont("%s%s", delim, names[i].name);
delim = "|";
}
/* check for left over flags */
if (flags)
pr_cont("%s%#lx", delim, flags);
pr_cont(")\n");
}
void dump_page_badflags(struct page *page, const char *reason,
unsigned long badflags)
{
pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
page, atomic_read(&page->_count), page_mapcount(page),
page->mapping, page->index);
BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
if (reason)
pr_alert("page dumped because: %s\n", reason);
if (page->flags & badflags) {
pr_alert("bad because of flags:\n");
dump_flags(page->flags & badflags,
pageflag_names, ARRAY_SIZE(pageflag_names));
}
mem_cgroup_print_bad_page(page);
}
void dump_page(struct page *page, const char *reason)
{
dump_page_badflags(page, reason, 0);
}
EXPORT_SYMBOL(dump_page);
#ifdef CONFIG_DEBUG_VM
static const struct trace_print_flags vmaflags_names[] = {
{VM_READ, "read" },
{VM_WRITE, "write" },
{VM_EXEC, "exec" },
{VM_SHARED, "shared" },
{VM_MAYREAD, "mayread" },
{VM_MAYWRITE, "maywrite" },
{VM_MAYEXEC, "mayexec" },
{VM_MAYSHARE, "mayshare" },
{VM_GROWSDOWN, "growsdown" },
{VM_PFNMAP, "pfnmap" },
{VM_DENYWRITE, "denywrite" },
{VM_LOCKED, "locked" },
{VM_IO, "io" },
{VM_SEQ_READ, "seqread" },
{VM_RAND_READ, "randread" },
{VM_DONTCOPY, "dontcopy" },
{VM_DONTEXPAND, "dontexpand" },
{VM_ACCOUNT, "account" },
{VM_NORESERVE, "noreserve" },
{VM_HUGETLB, "hugetlb" },
{VM_NONLINEAR, "nonlinear" },
#if defined(CONFIG_X86)
{VM_PAT, "pat" },
#elif defined(CONFIG_PPC)
{VM_SAO, "sao" },
#elif defined(CONFIG_PARISC) || defined(CONFIG_METAG) || defined(CONFIG_IA64)
{VM_GROWSUP, "growsup" },
#elif !defined(CONFIG_MMU)
{VM_MAPPED_COPY, "mappedcopy" },
#else
{VM_ARCH_1, "arch_1" },
#endif
{VM_DONTDUMP, "dontdump" },
#ifdef CONFIG_MEM_SOFT_DIRTY
{VM_SOFTDIRTY, "softdirty" },
#endif
{VM_MIXEDMAP, "mixedmap" },
{VM_HUGEPAGE, "hugepage" },
{VM_NOHUGEPAGE, "nohugepage" },
{VM_MERGEABLE, "mergeable" },
};
void dump_vma(const struct vm_area_struct *vma)
{
pr_emerg("vma %p start %p end %p\n"
"next %p prev %p mm %p\n"
"prot %lx anon_vma %p vm_ops %p\n"
"pgoff %lx file %p private_data %p\n",
vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_next,
vma->vm_prev, vma->vm_mm,
(unsigned long)pgprot_val(vma->vm_page_prot),
vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
vma->vm_file, vma->vm_private_data);
dump_flags(vma->vm_flags, vmaflags_names, ARRAY_SIZE(vmaflags_names));
}
EXPORT_SYMBOL(dump_vma);
void dump_mm(const struct mm_struct *mm)
{
pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
#ifdef CONFIG_MMU
"get_unmapped_area %p\n"
#endif
"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
"pgd %p mm_users %d mm_count %d nr_ptes %lu map_count %d\n"
"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
"pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n"
"start_code %lx end_code %lx start_data %lx end_data %lx\n"
"start_brk %lx brk %lx start_stack %lx\n"
"arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
"binfmt %p flags %lx core_state %p\n"
#ifdef CONFIG_AIO
"ioctx_table %p\n"
#endif
#ifdef CONFIG_MEMCG
"owner %p "
#endif
"exe_file %p\n"
#ifdef CONFIG_MMU_NOTIFIER
"mmu_notifier_mm %p\n"
#endif
#ifdef CONFIG_NUMA_BALANCING
"numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
"tlb_flush_pending %d\n"
#endif
"%s", /* This is here to hold the comma */
mm, mm->mmap, mm->vmacache_seqnum, mm->task_size,
#ifdef CONFIG_MMU
mm->get_unmapped_area,
#endif
mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
mm->pgd, atomic_read(&mm->mm_users),
atomic_read(&mm->mm_count),
atomic_long_read((atomic_long_t *)&mm->nr_ptes),
mm->map_count,
mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm,
mm->start_code, mm->end_code, mm->start_data, mm->end_data,
mm->start_brk, mm->brk, mm->start_stack,
mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
mm->binfmt, mm->flags, mm->core_state,
#ifdef CONFIG_AIO
mm->ioctx_table,
#endif
#ifdef CONFIG_MEMCG
mm->owner,
#endif
mm->exe_file,
#ifdef CONFIG_MMU_NOTIFIER
mm->mmu_notifier_mm,
#endif
#ifdef CONFIG_NUMA_BALANCING
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
mm->tlb_flush_pending,
#endif
"" /* This is here to not have a comma! */
);
dump_flags(mm->def_flags, vmaflags_names,
ARRAY_SIZE(vmaflags_names));
}
#endif /* CONFIG_DEBUG_VM */

529
mm/dmapool.c Normal file
View file

@ -0,0 +1,529 @@
/*
* DMA Pool allocator
*
* Copyright 2001 David Brownell
* Copyright 2007 Intel Corporation
* Author: Matthew Wilcox <willy@linux.intel.com>
*
* This software may be redistributed and/or modified under the terms of
* the GNU General Public License ("GPL") version 2 as published by the
* Free Software Foundation.
*
* This allocator returns small blocks of a given size which are DMA-able by
* the given device. It uses the dma_alloc_coherent page allocator to get
* new pages, then splits them up into blocks of the required size.
* Many older drivers still have their own code to do this.
*
* The current design of this allocator is fairly simple. The pool is
* represented by the 'struct dma_pool' which keeps a doubly-linked list of
* allocated pages. Each page in the page_list is split into blocks of at
* least 'size' bytes. Free blocks are tracked in an unsorted singly-linked
* list of free blocks within the page. Used blocks aren't tracked, but we
* keep a count of how many are currently allocated from each page.
*/
#include <linux/device.h>
#include <linux/dma-mapping.h>
#include <linux/dmapool.h>
#include <linux/kernel.h>
#include <linux/list.h>
#include <linux/export.h>
#include <linux/mutex.h>
#include <linux/poison.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/spinlock.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/wait.h>
#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON)
#define DMAPOOL_DEBUG 1
#endif
struct dma_pool { /* the pool */
struct list_head page_list;
spinlock_t lock;
size_t size;
struct device *dev;
size_t allocation;
size_t boundary;
char name[32];
struct list_head pools;
};
struct dma_page { /* cacheable header for 'allocation' bytes */
struct list_head page_list;
void *vaddr;
dma_addr_t dma;
unsigned int in_use;
unsigned int offset;
};
static DEFINE_MUTEX(pools_lock);
static DEFINE_MUTEX(pools_reg_lock);
static ssize_t
show_pools(struct device *dev, struct device_attribute *attr, char *buf)
{
unsigned temp;
unsigned size;
char *next;
struct dma_page *page;
struct dma_pool *pool;
next = buf;
size = PAGE_SIZE;
temp = scnprintf(next, size, "poolinfo - 0.1\n");
size -= temp;
next += temp;
mutex_lock(&pools_lock);
list_for_each_entry(pool, &dev->dma_pools, pools) {
unsigned pages = 0;
unsigned blocks = 0;
spin_lock_irq(&pool->lock);
list_for_each_entry(page, &pool->page_list, page_list) {
pages++;
blocks += page->in_use;
}
spin_unlock_irq(&pool->lock);
/* per-pool info, no real statistics yet */
temp = scnprintf(next, size, "%-16s %4u %4Zu %4Zu %2u\n",
pool->name, blocks,
pages * (pool->allocation / pool->size),
pool->size, pages);
size -= temp;
next += temp;
}
mutex_unlock(&pools_lock);
return PAGE_SIZE - size;
}
static DEVICE_ATTR(pools, S_IRUGO, show_pools, NULL);
/**
* dma_pool_create - Creates a pool of consistent memory blocks, for dma.
* @name: name of pool, for diagnostics
* @dev: device that will be doing the DMA
* @size: size of the blocks in this pool.
* @align: alignment requirement for blocks; must be a power of two
* @boundary: returned blocks won't cross this power of two boundary
* Context: !in_interrupt()
*
* Returns a dma allocation pool with the requested characteristics, or
* null if one can't be created. Given one of these pools, dma_pool_alloc()
* may be used to allocate memory. Such memory will all have "consistent"
* DMA mappings, accessible by the device and its driver without using
* cache flushing primitives. The actual size of blocks allocated may be
* larger than requested because of alignment.
*
* If @boundary is nonzero, objects returned from dma_pool_alloc() won't
* cross that size boundary. This is useful for devices which have
* addressing restrictions on individual DMA transfers, such as not crossing
* boundaries of 4KBytes.
*/
struct dma_pool *dma_pool_create(const char *name, struct device *dev,
size_t size, size_t align, size_t boundary)
{
struct dma_pool *retval;
size_t allocation;
bool empty = false;
if (align == 0)
align = 1;
else if (align & (align - 1))
return NULL;
if (size == 0)
return NULL;
else if (size < 4)
size = 4;
if ((size % align) != 0)
size = ALIGN(size, align);
allocation = max_t(size_t, size, PAGE_SIZE);
if (!boundary)
boundary = allocation;
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;
retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev));
if (!retval)
return retval;
strlcpy(retval->name, name, sizeof(retval->name));
retval->dev = dev;
INIT_LIST_HEAD(&retval->page_list);
spin_lock_init(&retval->lock);
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
INIT_LIST_HEAD(&retval->pools);
/*
* pools_lock ensures that the ->dma_pools list does not get corrupted.
* pools_reg_lock ensures that there is not a race between
* dma_pool_create() and dma_pool_destroy() or within dma_pool_create()
* when the first invocation of dma_pool_create() failed on
* device_create_file() and the second assumes that it has been done (I
* know it is a short window).
*/
mutex_lock(&pools_reg_lock);
mutex_lock(&pools_lock);
if (list_empty(&dev->dma_pools))
empty = true;
list_add(&retval->pools, &dev->dma_pools);
mutex_unlock(&pools_lock);
if (empty) {
int err;
err = device_create_file(dev, &dev_attr_pools);
if (err) {
mutex_lock(&pools_lock);
list_del(&retval->pools);
mutex_unlock(&pools_lock);
mutex_unlock(&pools_reg_lock);
kfree(retval);
return NULL;
}
}
mutex_unlock(&pools_reg_lock);
return retval;
}
EXPORT_SYMBOL(dma_pool_create);
static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
{
unsigned int offset = 0;
unsigned int next_boundary = pool->boundary;
do {
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) >= next_boundary)) {
next = next_boundary;
next_boundary += pool->boundary;
}
*(int *)(page->vaddr + offset) = next;
offset = next;
} while (offset < pool->allocation);
}
static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
{
struct dma_page *page;
page = kmalloc(sizeof(*page), mem_flags);
if (!page)
return NULL;
page->vaddr = dma_alloc_coherent(pool->dev, pool->allocation,
&page->dma, mem_flags);
if (page->vaddr) {
#ifdef DMAPOOL_DEBUG
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
pool_initialise_page(pool, page);
page->in_use = 0;
page->offset = 0;
} else {
kfree(page);
page = NULL;
}
return page;
}
static inline int is_page_busy(struct dma_page *page)
{
return page->in_use != 0;
}
static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
{
dma_addr_t dma = page->dma;
#ifdef DMAPOOL_DEBUG
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
list_del(&page->page_list);
kfree(page);
}
/**
* dma_pool_destroy - destroys a pool of dma memory blocks.
* @pool: dma pool that will be destroyed
* Context: !in_interrupt()
*
* Caller guarantees that no more memory from the pool is in use,
* and that nothing will try to use the pool after this call.
*/
void dma_pool_destroy(struct dma_pool *pool)
{
bool empty = false;
mutex_lock(&pools_reg_lock);
mutex_lock(&pools_lock);
list_del(&pool->pools);
if (pool->dev && list_empty(&pool->dev->dma_pools))
empty = true;
mutex_unlock(&pools_lock);
if (empty)
device_remove_file(pool->dev, &dev_attr_pools);
mutex_unlock(&pools_reg_lock);
while (!list_empty(&pool->page_list)) {
struct dma_page *page;
page = list_entry(pool->page_list.next,
struct dma_page, page_list);
if (is_page_busy(page)) {
if (pool->dev)
dev_err(pool->dev,
"dma_pool_destroy %s, %p busy\n",
pool->name, page->vaddr);
else
printk(KERN_ERR
"dma_pool_destroy %s, %p busy\n",
pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(&page->page_list);
kfree(page);
} else
pool_free_page(pool, page);
}
kfree(pool);
}
EXPORT_SYMBOL(dma_pool_destroy);
/**
* dma_pool_alloc - get a block of consistent memory
* @pool: dma pool that will produce the block
* @mem_flags: GFP_* bitmask
* @handle: pointer to dma address of block
*
* This returns the kernel virtual address of a currently unused block,
* and reports its dma address through the handle.
* If such a memory block can't be allocated, %NULL is returned.
*/
void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
dma_addr_t *handle)
{
unsigned long flags;
struct dma_page *page;
size_t offset;
void *retval;
might_sleep_if(mem_flags & __GFP_WAIT);
spin_lock_irqsave(&pool->lock, flags);
list_for_each_entry(page, &pool->page_list, page_list) {
if (page->offset < pool->allocation)
goto ready;
}
/* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
spin_unlock_irqrestore(&pool->lock, flags);
page = pool_alloc_page(pool, mem_flags);
if (!page)
return NULL;
spin_lock_irqsave(&pool->lock, flags);
list_add(&page->page_list, &pool->page_list);
ready:
page->in_use++;
offset = page->offset;
page->offset = *(int *)(page->vaddr + offset);
retval = offset + page->vaddr;
*handle = offset + page->dma;
#ifdef DMAPOOL_DEBUG
{
int i;
u8 *data = retval;
/* page->offset is stored in first 4 bytes */
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
if (pool->dev)
dev_err(pool->dev,
"dma_pool_alloc %s, %p (corrupted)\n",
pool->name, retval);
else
pr_err("dma_pool_alloc %s, %p (corrupted)\n",
pool->name, retval);
/*
* Dump the first 4 bytes even if they are not
* POOL_POISON_FREED
*/
print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
data, pool->size, 1);
break;
}
}
memset(retval, POOL_POISON_ALLOCATED, pool->size);
#endif
spin_unlock_irqrestore(&pool->lock, flags);
return retval;
}
EXPORT_SYMBOL(dma_pool_alloc);
static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
{
struct dma_page *page;
list_for_each_entry(page, &pool->page_list, page_list) {
if (dma < page->dma)
continue;
if (dma < (page->dma + pool->allocation))
return page;
}
return NULL;
}
/**
* dma_pool_free - put block back into dma pool
* @pool: the dma pool holding the block
* @vaddr: virtual address of block
* @dma: dma address of block
*
* Caller promises neither device nor driver will again touch this block
* unless it is first re-allocated.
*/
void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
{
struct dma_page *page;
unsigned long flags;
unsigned int offset;
spin_lock_irqsave(&pool->lock, flags);
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(&pool->lock, flags);
if (pool->dev)
dev_err(pool->dev,
"dma_pool_free %s, %p/%lx (bad dma)\n",
pool->name, vaddr, (unsigned long)dma);
else
printk(KERN_ERR "dma_pool_free %s, %p/%lx (bad dma)\n",
pool->name, vaddr, (unsigned long)dma);
return;
}
offset = vaddr - page->vaddr;
#ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(&pool->lock, flags);
if (pool->dev)
dev_err(pool->dev,
"dma_pool_free %s, %p (bad vaddr)/%Lx\n",
pool->name, vaddr, (unsigned long long)dma);
else
printk(KERN_ERR
"dma_pool_free %s, %p (bad vaddr)/%Lx\n",
pool->name, vaddr, (unsigned long long)dma);
return;
}
{
unsigned int chain = page->offset;
while (chain < pool->allocation) {
if (chain != offset) {
chain = *(int *)(page->vaddr + chain);
continue;
}
spin_unlock_irqrestore(&pool->lock, flags);
if (pool->dev)
dev_err(pool->dev, "dma_pool_free %s, dma %Lx "
"already free\n", pool->name,
(unsigned long long)dma);
else
printk(KERN_ERR "dma_pool_free %s, dma %Lx "
"already free\n", pool->name,
(unsigned long long)dma);
return;
}
}
memset(vaddr, POOL_POISON_FREED, pool->size);
#endif
page->in_use--;
*(int *)vaddr = page->offset;
page->offset = offset;
/*
* Resist a temptation to do
* if (!is_page_busy(page)) pool_free_page(pool, page);
* Better have a few empty pages hang around.
*/
spin_unlock_irqrestore(&pool->lock, flags);
}
EXPORT_SYMBOL(dma_pool_free);
/*
* Managed DMA pool
*/
static void dmam_pool_release(struct device *dev, void *res)
{
struct dma_pool *pool = *(struct dma_pool **)res;
dma_pool_destroy(pool);
}
static int dmam_pool_match(struct device *dev, void *res, void *match_data)
{
return *(struct dma_pool **)res == match_data;
}
/**
* dmam_pool_create - Managed dma_pool_create()
* @name: name of pool, for diagnostics
* @dev: device that will be doing the DMA
* @size: size of the blocks in this pool.
* @align: alignment requirement for blocks; must be a power of two
* @allocation: returned blocks won't cross this boundary (or zero)
*
* Managed dma_pool_create(). DMA pool created with this function is
* automatically destroyed on driver detach.
*/
struct dma_pool *dmam_pool_create(const char *name, struct device *dev,
size_t size, size_t align, size_t allocation)
{
struct dma_pool **ptr, *pool;
ptr = devres_alloc(dmam_pool_release, sizeof(*ptr), GFP_KERNEL);
if (!ptr)
return NULL;
pool = *ptr = dma_pool_create(name, dev, size, align, allocation);
if (pool)
devres_add(dev, ptr);
else
devres_free(ptr);
return pool;
}
EXPORT_SYMBOL(dmam_pool_create);
/**
* dmam_pool_destroy - Managed dma_pool_destroy()
* @pool: dma pool that will be destroyed
*
* Managed dma_pool_destroy().
*/
void dmam_pool_destroy(struct dma_pool *pool)
{
struct device *dev = pool->dev;
WARN_ON(devres_release(dev, dmam_pool_release, dmam_pool_match, pool));
}
EXPORT_SYMBOL(dmam_pool_destroy);

245
mm/early_ioremap.c Normal file
View file

@ -0,0 +1,245 @@
/*
* Provide common bits of early_ioremap() support for architectures needing
* temporary mappings during boot before ioremap() is available.
*
* This is mostly a direct copy of the x86 early_ioremap implementation.
*
* (C) Copyright 1995 1996, 2014 Linus Torvalds
*
*/
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/io.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <asm/fixmap.h>
#ifdef CONFIG_MMU
static int early_ioremap_debug __initdata;
static int __init early_ioremap_debug_setup(char *str)
{
early_ioremap_debug = 1;
return 0;
}
early_param("early_ioremap_debug", early_ioremap_debug_setup);
static int after_paging_init __initdata;
void __init __weak early_ioremap_shutdown(void)
{
}
void __init early_ioremap_reset(void)
{
early_ioremap_shutdown();
after_paging_init = 1;
}
/*
* Generally, ioremap() is available after paging_init() has been called.
* Architectures wanting to allow early_ioremap after paging_init() can
* define __late_set_fixmap and __late_clear_fixmap to do the right thing.
*/
#ifndef __late_set_fixmap
static inline void __init __late_set_fixmap(enum fixed_addresses idx,
phys_addr_t phys, pgprot_t prot)
{
BUG();
}
#endif
#ifndef __late_clear_fixmap
static inline void __init __late_clear_fixmap(enum fixed_addresses idx)
{
BUG();
}
#endif
static void __iomem *prev_map[FIX_BTMAPS_SLOTS] __initdata;
static unsigned long prev_size[FIX_BTMAPS_SLOTS] __initdata;
static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata;
void __init early_ioremap_setup(void)
{
int i;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
if (WARN_ON(prev_map[i]))
break;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
}
static int __init check_early_ioremap_leak(void)
{
int count = 0;
int i;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
if (prev_map[i])
count++;
if (WARN(count, KERN_WARNING
"Debug warning: early ioremap leak of %d areas detected.\n"
"please boot with early_ioremap_debug and report the dmesg.\n",
count))
return 1;
return 0;
}
late_initcall(check_early_ioremap_leak);
static void __init __iomem *
__early_ioremap(resource_size_t phys_addr, unsigned long size, pgprot_t prot)
{
unsigned long offset;
resource_size_t last_addr;
unsigned int nrpages;
enum fixed_addresses idx;
int i, slot;
WARN_ON(system_state != SYSTEM_BOOTING);
slot = -1;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
if (!prev_map[i]) {
slot = i;
break;
}
}
if (WARN(slot < 0, "%s(%08llx, %08lx) not found slot\n",
__func__, (u64)phys_addr, size))
return NULL;
/* Don't allow wraparound or zero size */
last_addr = phys_addr + size - 1;
if (WARN_ON(!size || last_addr < phys_addr))
return NULL;
prev_size[slot] = size;
/*
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
phys_addr &= PAGE_MASK;
size = PAGE_ALIGN(last_addr + 1) - phys_addr;
/*
* Mappings have to fit in the FIX_BTMAP area.
*/
nrpages = size >> PAGE_SHIFT;
if (WARN_ON(nrpages > NR_FIX_BTMAPS))
return NULL;
/*
* Ok, go for it..
*/
idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;
while (nrpages > 0) {
if (after_paging_init)
__late_set_fixmap(idx, phys_addr, prot);
else
__early_set_fixmap(idx, phys_addr, prot);
phys_addr += PAGE_SIZE;
--idx;
--nrpages;
}
WARN(early_ioremap_debug, "%s(%08llx, %08lx) [%d] => %08lx + %08lx\n",
__func__, (u64)phys_addr, size, slot, offset, slot_virt[slot]);
prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
return prev_map[slot];
}
void __init early_iounmap(void __iomem *addr, unsigned long size)
{
unsigned long virt_addr;
unsigned long offset;
unsigned int nrpages;
enum fixed_addresses idx;
int i, slot;
slot = -1;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
if (prev_map[i] == addr) {
slot = i;
break;
}
}
if (WARN(slot < 0, "early_iounmap(%p, %08lx) not found slot\n",
addr, size))
return;
if (WARN(prev_size[slot] != size,
"early_iounmap(%p, %08lx) [%d] size not consistent %08lx\n",
addr, size, slot, prev_size[slot]))
return;
WARN(early_ioremap_debug, "early_iounmap(%p, %08lx) [%d]\n",
addr, size, slot);
virt_addr = (unsigned long)addr;
if (WARN_ON(virt_addr < fix_to_virt(FIX_BTMAP_BEGIN)))
return;
offset = virt_addr & ~PAGE_MASK;
nrpages = PAGE_ALIGN(offset + size) >> PAGE_SHIFT;
idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;
while (nrpages > 0) {
if (after_paging_init)
__late_clear_fixmap(idx);
else
__early_set_fixmap(idx, 0, FIXMAP_PAGE_CLEAR);
--idx;
--nrpages;
}
prev_map[slot] = NULL;
}
/* Remap an IO device */
void __init __iomem *
early_ioremap(resource_size_t phys_addr, unsigned long size)
{
return __early_ioremap(phys_addr, size, FIXMAP_PAGE_IO);
}
/* Remap memory */
void __init *
early_memremap(resource_size_t phys_addr, unsigned long size)
{
return (__force void *)__early_ioremap(phys_addr, size,
FIXMAP_PAGE_NORMAL);
}
#else /* CONFIG_MMU */
void __init __iomem *
early_ioremap(resource_size_t phys_addr, unsigned long size)
{
return (__force void __iomem *)phys_addr;
}
/* Remap memory */
void __init *
early_memremap(resource_size_t phys_addr, unsigned long size)
{
return (void *)phys_addr;
}
void __init early_iounmap(void __iomem *addr, unsigned long size)
{
}
#endif /* CONFIG_MMU */
void __init early_memunmap(void *addr, unsigned long size)
{
early_iounmap((__force void __iomem *)addr, size);
}

156
mm/fadvise.c Normal file
View file

@ -0,0 +1,156 @@
/*
* mm/fadvise.c
*
* Copyright (C) 2002, Linus Torvalds
*
* 11Jan2003 Andrew Morton
* Initial version.
*/
#include <linux/kernel.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/fadvise.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <asm/unistd.h>
/*
* POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
* deactivate the pages and clear PG_Referenced.
*/
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
int ret = 0;
if (!f.file)
return -EBADF;
if (S_ISFIFO(file_inode(f.file)->i_mode)) {
ret = -ESPIPE;
goto out;
}
mapping = f.file->f_mapping;
if (!mapping || len < 0) {
ret = -EINVAL;
goto out;
}
if (mapping->a_ops->get_xip_mem) {
switch (advice) {
case POSIX_FADV_NORMAL:
case POSIX_FADV_RANDOM:
case POSIX_FADV_SEQUENTIAL:
case POSIX_FADV_WILLNEED:
case POSIX_FADV_NOREUSE:
case POSIX_FADV_DONTNEED:
/* no bad return value, but ignore advice */
break;
default:
ret = -EINVAL;
}
goto out;
}
/* Careful about overflows. Len == 0 means "as much as possible" */
endbyte = offset + len;
if (!len || endbyte < len)
endbyte = -1;
else
endbyte--; /* inclusive */
bdi = mapping->backing_dev_info;
switch (advice) {
case POSIX_FADV_NORMAL:
f.file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(&f.file->f_lock);
f.file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&f.file->f_lock);
break;
case POSIX_FADV_RANDOM:
spin_lock(&f.file->f_lock);
f.file->f_mode |= FMODE_RANDOM;
spin_unlock(&f.file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
f.file->f_ra.ra_pages = bdi->ra_pages * 2;
spin_lock(&f.file->f_lock);
f.file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&f.file->f_lock);
break;
case POSIX_FADV_WILLNEED:
/* First and last PARTIAL page! */
start_index = offset >> PAGE_CACHE_SHIFT;
end_index = endbyte >> PAGE_CACHE_SHIFT;
/* Careful about overflow on the "+1" */
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
/*
* Ignore return value because fadvise() shall return
* success even if filesystem can't retrieve a hint,
*/
force_page_cache_readahead(mapping, f.file, start_index,
nrpages);
break;
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
__filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);
/* First and last FULL page! */
start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);
if (end_index >= start_index) {
unsigned long count = invalidate_mapping_pages(mapping,
start_index, end_index);
/*
* If fewer pages were invalidated than expected then
* it is possible that some of the pages were on
* a per-cpu pagevec for a remote CPU. Drain all
* pagevecs and try again.
*/
if (count < (end_index - start_index + 1)) {
lru_add_drain_all();
invalidate_mapping_pages(mapping, start_index,
end_index);
}
}
break;
default:
ret = -EINVAL;
}
out:
fdput(f);
return ret;
}
#ifdef __ARCH_WANT_SYS_FADVISE64
SYSCALL_DEFINE4(fadvise64, int, fd, loff_t, offset, size_t, len, int, advice)
{
return sys_fadvise64_64(fd, offset, len, advice);
}
#endif

60
mm/failslab.c Normal file
View file

@ -0,0 +1,60 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
static struct {
struct fault_attr attr;
u32 ignore_gfp_wait;
int cache_filter;
} failslab = {
.attr = FAULT_ATTR_INITIALIZER,
.ignore_gfp_wait = 1,
.cache_filter = 0,
};
bool should_failslab(size_t size, gfp_t gfpflags, unsigned long cache_flags)
{
if (gfpflags & __GFP_NOFAIL)
return false;
if (failslab.ignore_gfp_wait && (gfpflags & __GFP_WAIT))
return false;
if (failslab.cache_filter && !(cache_flags & SLAB_FAILSLAB))
return false;
return should_fail(&failslab.attr, size);
}
static int __init setup_failslab(char *str)
{
return setup_fault_attr(&failslab.attr, str);
}
__setup("failslab=", setup_failslab);
#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
static int __init failslab_debugfs_init(void)
{
struct dentry *dir;
umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr);
if (IS_ERR(dir))
return PTR_ERR(dir);
if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
&failslab.ignore_gfp_wait))
goto fail;
if (!debugfs_create_bool("cache-filter", mode, dir,
&failslab.cache_filter))
goto fail;
return 0;
fail:
debugfs_remove_recursive(dir);
return -ENOMEM;
}
late_initcall(failslab_debugfs_init);
#endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */

2722
mm/filemap.c Normal file

File diff suppressed because it is too large Load diff

483
mm/filemap_xip.c Normal file
View file

@ -0,0 +1,483 @@
/*
* linux/mm/filemap_xip.c
*
* Copyright (C) 2005 IBM Corporation
* Author: Carsten Otte <cotte@de.ibm.com>
*
* derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
*
*/
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/export.h>
#include <linux/uio.h>
#include <linux/rmap.h>
#include <linux/mmu_notifier.h>
#include <linux/sched.h>
#include <linux/seqlock.h>
#include <linux/mutex.h>
#include <linux/gfp.h>
#include <asm/tlbflush.h>
#include <asm/io.h>
/*
* We do use our own empty page to avoid interference with other users
* of ZERO_PAGE(), such as /dev/zero
*/
static DEFINE_MUTEX(xip_sparse_mutex);
static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
static struct page *__xip_sparse_page;
/* called under xip_sparse_mutex */
static struct page *xip_sparse_page(void)
{
if (!__xip_sparse_page) {
struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
if (page)
__xip_sparse_page = page;
}
return __xip_sparse_page;
}
/*
* This is a file read routine for execute in place files, and uses
* the mapping->a_ops->get_xip_mem() function for the actual low-level
* stuff.
*
* Note the struct file* is not used at all. It may be NULL.
*/
static ssize_t
do_xip_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
char __user *buf,
size_t len,
loff_t *ppos)
{
struct inode *inode = mapping->host;
pgoff_t index, end_index;
unsigned long offset;
loff_t isize, pos;
size_t copied = 0, error = 0;
BUG_ON(!mapping->a_ops->get_xip_mem);
pos = *ppos;
index = pos >> PAGE_CACHE_SHIFT;
offset = pos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
if (!isize)
goto out;
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
do {
unsigned long nr, left;
void *xip_mem;
unsigned long xip_pfn;
int zero = 0;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
if (index >= end_index) {
if (index > end_index)
goto out;
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
goto out;
}
}
nr = nr - offset;
if (nr > len - copied)
nr = len - copied;
error = mapping->a_ops->get_xip_mem(mapping, index, 0,
&xip_mem, &xip_pfn);
if (unlikely(error)) {
if (error == -ENODATA) {
/* sparse */
zero = 1;
} else
goto out;
}
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
/* address based flush */ ;
/*
* Ok, we have the mem, so now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
if (!zero)
left = __copy_to_user(buf+copied, xip_mem+offset, nr);
else
left = __clear_user(buf + copied, nr);
if (left) {
error = -EFAULT;
goto out;
}
copied += (nr - left);
offset += (nr - left);
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
} while (copied < len);
out:
*ppos = pos + copied;
if (filp)
file_accessed(filp);
return (copied ? copied : error);
}
ssize_t
xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
if (!access_ok(VERIFY_WRITE, buf, len))
return -EFAULT;
return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
buf, len, ppos);
}
EXPORT_SYMBOL_GPL(xip_file_read);
/*
* __xip_unmap is invoked from xip_unmap and
* xip_write
*
* This function walks all vmas of the address_space and unmaps the
* __xip_sparse_page when found at pgoff.
*/
static void
__xip_unmap (struct address_space * mapping,
unsigned long pgoff)
{
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long address;
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
struct page *page;
unsigned count;
int locked = 0;
count = read_seqcount_begin(&xip_sparse_seq);
page = __xip_sparse_page;
if (!page)
return;
retry:
mutex_lock(&mapping->i_mmap_mutex);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
mm = vma->vm_mm;
address = vma->vm_start +
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
pte = page_check_address(page, mm, address, &ptl, 1);
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page);
dec_mm_counter(mm, MM_FILEPAGES);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
/* must invalidate_page _before_ freeing the page */
mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
mutex_unlock(&mapping->i_mmap_mutex);
if (locked) {
mutex_unlock(&xip_sparse_mutex);
} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
mutex_lock(&xip_sparse_mutex);
locked = 1;
goto retry;
}
}
/*
* xip_fault() is invoked via the vma operations vector for a
* mapped memory region to read in file data during a page fault.
*
* This function is derived from filemap_fault, but used for execute in place
*/
static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
pgoff_t size;
void *xip_mem;
unsigned long xip_pfn;
struct page *page;
int error;
/* XXX: are VM_FAULT_ codes OK? */
again:
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
&xip_mem, &xip_pfn);
if (likely(!error))
goto found;
if (error != -ENODATA)
return VM_FAULT_OOM;
/* sparse block */
if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
(!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
int err;
/* maybe shared writable, allocate new block */
mutex_lock(&xip_sparse_mutex);
error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
&xip_mem, &xip_pfn);
mutex_unlock(&xip_sparse_mutex);
if (error)
return VM_FAULT_SIGBUS;
/* unmap sparse mappings at pgoff from all other vmas */
__xip_unmap(mapping, vmf->pgoff);
found:
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
* err == -EBUSY is fine, we've raced against another thread
* that faulted-in the same page
*/
if (err != -EBUSY)
BUG_ON(err);
return VM_FAULT_NOPAGE;
} else {
int err, ret = VM_FAULT_OOM;
mutex_lock(&xip_sparse_mutex);
write_seqcount_begin(&xip_sparse_seq);
error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
&xip_mem, &xip_pfn);
if (unlikely(!error)) {
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
goto again;
}
if (error != -ENODATA)
goto out;
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
goto out;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
goto out;
ret = VM_FAULT_NOPAGE;
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
return ret;
}
}
static const struct vm_operations_struct xip_file_vm_ops = {
.fault = xip_file_fault,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
{
BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
file_accessed(file);
vma->vm_ops = &xip_file_vm_ops;
vma->vm_flags |= VM_MIXEDMAP;
return 0;
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
static ssize_t
__xip_file_write(struct file *filp, const char __user *buf,
size_t count, loff_t pos, loff_t *ppos)
{
struct address_space * mapping = filp->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
size_t bytes;
ssize_t written = 0;
BUG_ON(!mapping->a_ops->get_xip_mem);
do {
unsigned long index;
unsigned long offset;
size_t copied;
void *xip_mem;
unsigned long xip_pfn;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes > count)
bytes = count;
status = a_ops->get_xip_mem(mapping, index, 0,
&xip_mem, &xip_pfn);
if (status == -ENODATA) {
/* we allocate a new page unmap it */
mutex_lock(&xip_sparse_mutex);
status = a_ops->get_xip_mem(mapping, index, 1,
&xip_mem, &xip_pfn);
mutex_unlock(&xip_sparse_mutex);
if (!status)
/* unmap page at pgoff from all other vmas */
__xip_unmap(mapping, index);
}
if (status)
break;
copied = bytes -
__copy_from_user_nocache(xip_mem + offset, buf, bytes);
if (likely(copied > 0)) {
status = copied;
if (status >= 0) {
written += status;
count -= status;
pos += status;
buf += status;
}
}
if (unlikely(copied != bytes))
if (status >= 0)
status = -EFAULT;
if (status < 0)
break;
} while (count);
*ppos = pos;
/*
* No need to use i_size_read() here, the i_size
* cannot change under us because we hold i_mutex.
*/
if (pos > inode->i_size) {
i_size_write(inode, pos);
mark_inode_dirty(inode);
}
return written ? written : status;
}
ssize_t
xip_file_write(struct file *filp, const char __user *buf, size_t len,
loff_t *ppos)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
size_t count;
loff_t pos;
ssize_t ret;
mutex_lock(&inode->i_mutex);
if (!access_ok(VERIFY_READ, buf, len)) {
ret=-EFAULT;
goto out_up;
}
pos = *ppos;
count = len;
/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
if (ret)
goto out_backing;
if (count == 0)
goto out_backing;
ret = file_remove_suid(filp);
if (ret)
goto out_backing;
ret = file_update_time(filp);
if (ret)
goto out_backing;
ret = __xip_file_write (filp, buf, count, pos, ppos);
out_backing:
current->backing_dev_info = NULL;
out_up:
mutex_unlock(&inode->i_mutex);
return ret;
}
EXPORT_SYMBOL_GPL(xip_file_write);
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
* to get the page instead of page cache
*/
int
xip_truncate_page(struct address_space *mapping, loff_t from)
{
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
unsigned blocksize;
unsigned length;
void *xip_mem;
unsigned long xip_pfn;
int err;
BUG_ON(!mapping->a_ops->get_xip_mem);
blocksize = 1 << mapping->host->i_blkbits;
length = offset & (blocksize - 1);
/* Block boundary? Nothing to do */
if (!length)
return 0;
length = blocksize - length;
err = mapping->a_ops->get_xip_mem(mapping, index, 0,
&xip_mem, &xip_pfn);
if (unlikely(err)) {
if (err == -ENODATA)
/* Hole? No need to truncate */
return 0;
else
return err;
}
memset(xip_mem + offset, 0, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);

283
mm/fremap.c Normal file
View file

@ -0,0 +1,283 @@
/*
* linux/mm/fremap.c
*
* Explicit pagetable population and nonlinear (random) mappings support.
*
* started by Ingo Molnar, Copyright (C) 2002, 2003
*/
#include <linux/export.h>
#include <linux/backing-dev.h>
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/file.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/swapops.h>
#include <linux/rmap.h>
#include <linux/syscalls.h>
#include <linux/mmu_notifier.h>
#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include "internal.h"
static int mm_counter(struct page *page)
{
return PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
}
static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
pte_t pte = *ptep;
struct page *page;
swp_entry_t entry;
if (pte_present(pte)) {
flush_cache_page(vma, addr, pte_pfn(pte));
pte = ptep_clear_flush(vma, addr, ptep);
page = vm_normal_page(vma, addr, pte);
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
update_hiwater_rss(mm);
dec_mm_counter(mm, mm_counter(page));
page_remove_rmap(page);
page_cache_release(page);
}
} else { /* zap_pte() is not called when pte_none() */
if (!pte_file(pte)) {
update_hiwater_rss(mm);
entry = pte_to_swp_entry(pte);
if (non_swap_entry(entry)) {
if (is_migration_entry(entry)) {
page = migration_entry_to_page(entry);
dec_mm_counter(mm, mm_counter(page));
}
} else {
free_swap_and_cache(entry);
dec_mm_counter(mm, MM_SWAPENTS);
}
}
pte_clear_not_present_full(mm, addr, ptep, 0);
}
}
/*
* Install a file pte to a given virtual memory address, release any
* previously existing mapping.
*/
static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long pgoff, pgprot_t prot)
{
int err = -ENOMEM;
pte_t *pte, ptfile;
spinlock_t *ptl;
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
goto out;
ptfile = pgoff_to_pte(pgoff);
if (!pte_none(*pte))
zap_pte(mm, vma, addr, pte);
set_pte_at(mm, addr, pte, pte_file_mksoft_dirty(ptfile));
/*
* We don't need to run update_mmu_cache() here because the "file pte"
* being installed by install_file_pte() is not a real pte - it's a
* non-present entry (like a swap entry), noting what file offset should
* be mapped there when there's a fault (in a non-linear vma where
* that's not obvious).
*/
pte_unmap_unlock(pte, ptl);
err = 0;
out:
return err;
}
int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
unsigned long size, pgoff_t pgoff)
{
struct mm_struct *mm = vma->vm_mm;
int err;
do {
err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
if (err)
return err;
size -= PAGE_SIZE;
addr += PAGE_SIZE;
pgoff++;
} while (size);
return 0;
}
EXPORT_SYMBOL(generic_file_remap_pages);
/**
* sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma
* @start: start of the remapped virtual memory range
* @size: size of the remapped virtual memory range
* @prot: new protection bits of the range (see NOTE)
* @pgoff: to-be-mapped page of the backing store file
* @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
*
* sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma
* (shared backing store file).
*
* This syscall works purely via pagetables, so it's the most efficient
* way to map the same (large) file into a given virtual window. Unlike
* mmap()/mremap() it does not create any new vmas. The new mappings are
* also safe across swapout.
*
* NOTE: the @prot parameter right now is ignored (but must be zero),
* and the vma's default protection is used. Arbitrary protections
* might be implemented in the future.
*/
SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
{
struct mm_struct *mm = current->mm;
struct address_space *mapping;
struct vm_area_struct *vma;
int err = -EINVAL;
int has_write_lock = 0;
vm_flags_t vm_flags = 0;
pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. "
"See Documentation/vm/remap_file_pages.txt.\n",
current->comm, current->pid);
if (prot)
return err;
/*
* Sanitize the syscall parameters:
*/
start = start & PAGE_MASK;
size = size & PAGE_MASK;
/* Does the address range wrap, or is the span zero-sized? */
if (start + size <= start)
return err;
/* Does pgoff wrap? */
if (pgoff + (size >> PAGE_SHIFT) < pgoff)
return err;
/* Can we represent this offset inside this architecture's pte's? */
#if PTE_FILE_MAX_BITS < BITS_PER_LONG
if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
return err;
#endif
/* We need down_write() to change vma->vm_flags. */
down_read(&mm->mmap_sem);
retry:
vma = find_vma(mm, start);
/*
* Make sure the vma is shared, that it supports prefaulting,
* and that the remapped range is valid and fully within
* the single existing vma.
*/
if (!vma || !(vma->vm_flags & VM_SHARED))
goto out;
if (!vma->vm_ops || !vma->vm_ops->remap_pages)
goto out;
if (start < vma->vm_start || start + size > vma->vm_end)
goto out;
/* Must set VM_NONLINEAR before any pages are populated. */
if (!(vma->vm_flags & VM_NONLINEAR)) {
/*
* vm_private_data is used as a swapout cursor
* in a VM_NONLINEAR vma.
*/
if (vma->vm_private_data)
goto out;
/* Don't need a nonlinear mapping, exit success */
if (pgoff == linear_page_index(vma, start)) {
err = 0;
goto out;
}
if (!has_write_lock) {
get_write_lock:
up_read(&mm->mmap_sem);
down_write(&mm->mmap_sem);
has_write_lock = 1;
goto retry;
}
mapping = vma->vm_file->f_mapping;
/*
* page_mkclean doesn't work on nonlinear vmas, so if
* dirty pages need to be accounted, emulate with linear
* vmas.
*/
if (mapping_cap_account_dirty(mapping)) {
unsigned long addr;
struct file *file = get_file(vma->vm_file);
/* mmap_region may free vma; grab the info now */
vm_flags = vma->vm_flags;
addr = mmap_region(file, start, size, vm_flags, pgoff);
fput(file);
if (IS_ERR_VALUE(addr)) {
err = addr;
} else {
BUG_ON(addr != start);
err = 0;
}
goto out_freed;
}
mutex_lock(&mapping->i_mmap_mutex);
flush_dcache_mmap_lock(mapping);
vma->vm_flags |= VM_NONLINEAR;
vma_interval_tree_remove(vma, &mapping->i_mmap);
vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
flush_dcache_mmap_unlock(mapping);
mutex_unlock(&mapping->i_mmap_mutex);
}
if (vma->vm_flags & VM_LOCKED) {
/*
* drop PG_Mlocked flag for over-mapped range
*/
if (!has_write_lock)
goto get_write_lock;
vm_flags = vma->vm_flags;
munlock_vma_pages_range(vma, start, start + size);
vma->vm_flags = vm_flags;
}
mmu_notifier_invalidate_range_start(mm, start, start + size);
err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
mmu_notifier_invalidate_range_end(mm, start, start + size);
/*
* We can't clear VM_NONLINEAR because we'd have to do
* it after ->populate completes, and that would prevent
* downgrading the lock. (Locks can't be upgraded).
*/
out:
if (vma)
vm_flags = vma->vm_flags;
out_freed:
if (likely(!has_write_lock))
up_read(&mm->mmap_sem);
else
up_write(&mm->mmap_sem);
if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK)))
mm_populate(start, size);
return err;
}

457
mm/frontswap.c Normal file
View file

@ -0,0 +1,457 @@
/*
* Frontswap frontend
*
* This code provides the generic "frontend" layer to call a matching
* "backend" driver implementation of frontswap. See
* Documentation/vm/frontswap.txt for more information.
*
* Copyright (C) 2009-2012 Oracle Corp. All rights reserved.
* Author: Dan Magenheimer
*
* This work is licensed under the terms of the GNU GPL, version 2.
*/
#include <linux/mman.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/security.h>
#include <linux/module.h>
#include <linux/debugfs.h>
#include <linux/frontswap.h>
#include <linux/swapfile.h>
/*
* frontswap_ops is set by frontswap_register_ops to contain the pointers
* to the frontswap "backend" implementation functions.
*/
static struct frontswap_ops *frontswap_ops __read_mostly;
/*
* If enabled, frontswap_store will return failure even on success. As
* a result, the swap subsystem will always write the page to swap, in
* effect converting frontswap into a writethrough cache. In this mode,
* there is no direct reduction in swap writes, but a frontswap backend
* can unilaterally "reclaim" any pages in use with no data loss, thus
* providing increases control over maximum memory usage due to frontswap.
*/
static bool frontswap_writethrough_enabled __read_mostly;
/*
* If enabled, the underlying tmem implementation is capable of doing
* exclusive gets, so frontswap_load, on a successful tmem_get must
* mark the page as no longer in frontswap AND mark it dirty.
*/
static bool frontswap_tmem_exclusive_gets_enabled __read_mostly;
#ifdef CONFIG_DEBUG_FS
/*
* Counters available via /sys/kernel/debug/frontswap (if debugfs is
* properly configured). These are for information only so are not protected
* against increment races.
*/
static u64 frontswap_loads;
static u64 frontswap_succ_stores;
static u64 frontswap_failed_stores;
static u64 frontswap_invalidates;
static inline void inc_frontswap_loads(void) {
frontswap_loads++;
}
static inline void inc_frontswap_succ_stores(void) {
frontswap_succ_stores++;
}
static inline void inc_frontswap_failed_stores(void) {
frontswap_failed_stores++;
}
static inline void inc_frontswap_invalidates(void) {
frontswap_invalidates++;
}
#else
static inline void inc_frontswap_loads(void) { }
static inline void inc_frontswap_succ_stores(void) { }
static inline void inc_frontswap_failed_stores(void) { }
static inline void inc_frontswap_invalidates(void) { }
#endif
/*
* Due to the asynchronous nature of the backends loading potentially
* _after_ the swap system has been activated, we have chokepoints
* on all frontswap functions to not call the backend until the backend
* has registered.
*
* Specifically when no backend is registered (nobody called
* frontswap_register_ops) all calls to frontswap_init (which is done via
* swapon -> enable_swap_info -> frontswap_init) are registered and remembered
* (via the setting of need_init bitmap) but fail to create tmem_pools. When a
* backend registers with frontswap at some later point the previous
* calls to frontswap_init are executed (by iterating over the need_init
* bitmap) to create tmem_pools and set the respective poolids. All of that is
* guarded by us using atomic bit operations on the 'need_init' bitmap.
*
* This would not guards us against the user deciding to call swapoff right as
* we are calling the backend to initialize (so swapon is in action).
* Fortunatly for us, the swapon_mutex has been taked by the callee so we are
* OK. The other scenario where calls to frontswap_store (called via
* swap_writepage) is racing with frontswap_invalidate_area (called via
* swapoff) is again guarded by the swap subsystem.
*
* While no backend is registered all calls to frontswap_[store|load|
* invalidate_area|invalidate_page] are ignored or fail.
*
* The time between the backend being registered and the swap file system
* calling the backend (via the frontswap_* functions) is indeterminate as
* frontswap_ops is not atomic_t (or a value guarded by a spinlock).
* That is OK as we are comfortable missing some of these calls to the newly
* registered backend.
*
* Obviously the opposite (unloading the backend) must be done after all
* the frontswap_[store|load|invalidate_area|invalidate_page] start
* ignorning or failing the requests - at which point frontswap_ops
* would have to be made in some fashion atomic.
*/
static DECLARE_BITMAP(need_init, MAX_SWAPFILES);
/*
* Register operations for frontswap, returning previous thus allowing
* detection of multiple backends and possible nesting.
*/
struct frontswap_ops *frontswap_register_ops(struct frontswap_ops *ops)
{
struct frontswap_ops *old = frontswap_ops;
int i;
for (i = 0; i < MAX_SWAPFILES; i++) {
if (test_and_clear_bit(i, need_init)) {
struct swap_info_struct *sis = swap_info[i];
/* __frontswap_init _should_ have set it! */
if (!sis->frontswap_map)
return ERR_PTR(-EINVAL);
ops->init(i);
}
}
/*
* We MUST have frontswap_ops set _after_ the frontswap_init's
* have been called. Otherwise __frontswap_store might fail. Hence
* the barrier to make sure compiler does not re-order us.
*/
barrier();
frontswap_ops = ops;
return old;
}
EXPORT_SYMBOL(frontswap_register_ops);
/*
* Enable/disable frontswap writethrough (see above).
*/
void frontswap_writethrough(bool enable)
{
frontswap_writethrough_enabled = enable;
}
EXPORT_SYMBOL(frontswap_writethrough);
/*
* Enable/disable frontswap exclusive gets (see above).
*/
void frontswap_tmem_exclusive_gets(bool enable)
{
frontswap_tmem_exclusive_gets_enabled = enable;
}
EXPORT_SYMBOL(frontswap_tmem_exclusive_gets);
/*
* Called when a swap device is swapon'd.
*/
void __frontswap_init(unsigned type, unsigned long *map)
{
struct swap_info_struct *sis = swap_info[type];
BUG_ON(sis == NULL);
/*
* p->frontswap is a bitmap that we MUST have to figure out which page
* has gone in frontswap. Without it there is no point of continuing.
*/
if (WARN_ON(!map))
return;
/*
* Irregardless of whether the frontswap backend has been loaded
* before this function or it will be later, we _MUST_ have the
* p->frontswap set to something valid to work properly.
*/
frontswap_map_set(sis, map);
if (frontswap_ops)
frontswap_ops->init(type);
else {
BUG_ON(type > MAX_SWAPFILES);
set_bit(type, need_init);
}
}
EXPORT_SYMBOL(__frontswap_init);
bool __frontswap_test(struct swap_info_struct *sis,
pgoff_t offset)
{
bool ret = false;
if (frontswap_ops && sis->frontswap_map)
ret = test_bit(offset, sis->frontswap_map);
return ret;
}
EXPORT_SYMBOL(__frontswap_test);
static inline void __frontswap_clear(struct swap_info_struct *sis,
pgoff_t offset)
{
clear_bit(offset, sis->frontswap_map);
atomic_dec(&sis->frontswap_pages);
}
/*
* "Store" data from a page to frontswap and associate it with the page's
* swaptype and offset. Page must be locked and in the swap cache.
* If frontswap already contains a page with matching swaptype and
* offset, the frontswap implementation may either overwrite the data and
* return success or invalidate the page from frontswap and return failure.
*/
int __frontswap_store(struct page *page)
{
int ret = -1, dup = 0;
swp_entry_t entry = { .val = page_private(page), };
int type = swp_type(entry);
struct swap_info_struct *sis = swap_info[type];
pgoff_t offset = swp_offset(entry);
/*
* Return if no backend registed.
* Don't need to inc frontswap_failed_stores here.
*/
if (!frontswap_ops)
return ret;
BUG_ON(!PageLocked(page));
BUG_ON(sis == NULL);
if (__frontswap_test(sis, offset))
dup = 1;
ret = frontswap_ops->store(type, offset, page);
if (ret == 0) {
set_bit(offset, sis->frontswap_map);
inc_frontswap_succ_stores();
if (!dup)
atomic_inc(&sis->frontswap_pages);
} else {
/*
failed dup always results in automatic invalidate of
the (older) page from frontswap
*/
inc_frontswap_failed_stores();
if (dup) {
__frontswap_clear(sis, offset);
frontswap_ops->invalidate_page(type, offset);
}
}
if (frontswap_writethrough_enabled)
/* report failure so swap also writes to swap device */
ret = -1;
return ret;
}
EXPORT_SYMBOL(__frontswap_store);
/*
* "Get" data from frontswap associated with swaptype and offset that were
* specified when the data was put to frontswap and use it to fill the
* specified page with data. Page must be locked and in the swap cache.
*/
int __frontswap_load(struct page *page)
{
int ret = -1;
swp_entry_t entry = { .val = page_private(page), };
int type = swp_type(entry);
struct swap_info_struct *sis = swap_info[type];
pgoff_t offset = swp_offset(entry);
BUG_ON(!PageLocked(page));
BUG_ON(sis == NULL);
/*
* __frontswap_test() will check whether there is backend registered
*/
if (__frontswap_test(sis, offset))
ret = frontswap_ops->load(type, offset, page);
if (ret == 0) {
inc_frontswap_loads();
if (frontswap_tmem_exclusive_gets_enabled) {
SetPageDirty(page);
__frontswap_clear(sis, offset);
}
}
return ret;
}
EXPORT_SYMBOL(__frontswap_load);
/*
* Invalidate any data from frontswap associated with the specified swaptype
* and offset so that a subsequent "get" will fail.
*/
void __frontswap_invalidate_page(unsigned type, pgoff_t offset)
{
struct swap_info_struct *sis = swap_info[type];
BUG_ON(sis == NULL);
/*
* __frontswap_test() will check whether there is backend registered
*/
if (__frontswap_test(sis, offset)) {
frontswap_ops->invalidate_page(type, offset);
__frontswap_clear(sis, offset);
inc_frontswap_invalidates();
}
}
EXPORT_SYMBOL(__frontswap_invalidate_page);
/*
* Invalidate all data from frontswap associated with all offsets for the
* specified swaptype.
*/
void __frontswap_invalidate_area(unsigned type)
{
struct swap_info_struct *sis = swap_info[type];
if (frontswap_ops) {
BUG_ON(sis == NULL);
if (sis->frontswap_map == NULL)
return;
frontswap_ops->invalidate_area(type);
atomic_set(&sis->frontswap_pages, 0);
bitmap_zero(sis->frontswap_map, sis->max);
}
clear_bit(type, need_init);
}
EXPORT_SYMBOL(__frontswap_invalidate_area);
static unsigned long __frontswap_curr_pages(void)
{
unsigned long totalpages = 0;
struct swap_info_struct *si = NULL;
assert_spin_locked(&swap_lock);
plist_for_each_entry(si, &swap_active_head, list)
totalpages += atomic_read(&si->frontswap_pages);
return totalpages;
}
static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
int *swapid)
{
int ret = -EINVAL;
struct swap_info_struct *si = NULL;
int si_frontswap_pages;
unsigned long total_pages_to_unuse = total;
unsigned long pages = 0, pages_to_unuse = 0;
assert_spin_locked(&swap_lock);
plist_for_each_entry(si, &swap_active_head, list) {
si_frontswap_pages = atomic_read(&si->frontswap_pages);
if (total_pages_to_unuse < si_frontswap_pages) {
pages = pages_to_unuse = total_pages_to_unuse;
} else {
pages = si_frontswap_pages;
pages_to_unuse = 0; /* unuse all */
}
/* ensure there is enough RAM to fetch pages from frontswap */
if (security_vm_enough_memory_mm(current->mm, pages)) {
ret = -ENOMEM;
continue;
}
vm_unacct_memory(pages);
*unused = pages_to_unuse;
*swapid = si->type;
ret = 0;
break;
}
return ret;
}
/*
* Used to check if it's necessory and feasible to unuse pages.
* Return 1 when nothing to do, 0 when need to shink pages,
* error code when there is an error.
*/
static int __frontswap_shrink(unsigned long target_pages,
unsigned long *pages_to_unuse,
int *type)
{
unsigned long total_pages = 0, total_pages_to_unuse;
assert_spin_locked(&swap_lock);
total_pages = __frontswap_curr_pages();
if (total_pages <= target_pages) {
/* Nothing to do */
*pages_to_unuse = 0;
return 1;
}
total_pages_to_unuse = total_pages - target_pages;
return __frontswap_unuse_pages(total_pages_to_unuse, pages_to_unuse, type);
}
/*
* Frontswap, like a true swap device, may unnecessarily retain pages
* under certain circumstances; "shrink" frontswap is essentially a
* "partial swapoff" and works by calling try_to_unuse to attempt to
* unuse enough frontswap pages to attempt to -- subject to memory
* constraints -- reduce the number of pages in frontswap to the
* number given in the parameter target_pages.
*/
void frontswap_shrink(unsigned long target_pages)
{
unsigned long pages_to_unuse = 0;
int uninitialized_var(type), ret;
/*
* we don't want to hold swap_lock while doing a very
* lengthy try_to_unuse, but swap_list may change
* so restart scan from swap_active_head each time
*/
spin_lock(&swap_lock);
ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
spin_unlock(&swap_lock);
if (ret == 0)
try_to_unuse(type, true, pages_to_unuse);
return;
}
EXPORT_SYMBOL(frontswap_shrink);
/*
* Count and return the number of frontswap pages across all
* swap devices. This is exported so that backend drivers can
* determine current usage without reading debugfs.
*/
unsigned long frontswap_curr_pages(void)
{
unsigned long totalpages = 0;
spin_lock(&swap_lock);
totalpages = __frontswap_curr_pages();
spin_unlock(&swap_lock);
return totalpages;
}
EXPORT_SYMBOL(frontswap_curr_pages);
static int __init init_frontswap(void)
{
#ifdef CONFIG_DEBUG_FS
struct dentry *root = debugfs_create_dir("frontswap", NULL);
if (root == NULL)
return -ENXIO;
debugfs_create_u64("loads", S_IRUGO, root, &frontswap_loads);
debugfs_create_u64("succ_stores", S_IRUGO, root, &frontswap_succ_stores);
debugfs_create_u64("failed_stores", S_IRUGO, root,
&frontswap_failed_stores);
debugfs_create_u64("invalidates", S_IRUGO,
root, &frontswap_invalidates);
#endif
return 0;
}
module_init(init_frontswap);

1124
mm/gup.c Normal file

File diff suppressed because it is too large Load diff

488
mm/highmem.c Normal file
View file

@ -0,0 +1,488 @@
/*
* High memory handling common code and variables.
*
* (C) 1999 Andrea Arcangeli, SuSE GmbH, andrea@suse.de
* Gerhard Wichert, Siemens AG, Gerhard.Wichert@pdb.siemens.de
*
*
* Redesigned the x86 32-bit VM architecture to deal with
* 64-bit physical space. With current x86 CPUs this
* means up to 64 Gigabytes physical RAM.
*
* Rewrote high memory support to move the page cache into
* high memory. Implemented permanent (schedulable) kmaps
* based on Linus' idea.
*
* Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
*/
#include <linux/mm.h>
#include <linux/export.h>
#include <linux/swap.h>
#include <linux/bio.h>
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
#include <linux/kgdb.h>
#include <asm/tlbflush.h>
#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
DEFINE_PER_CPU(int, __kmap_atomic_idx);
#endif
/*
* Virtual_count is not a pure "count".
* 0 means that it is not mapped, and has not been mapped
* since a TLB flush - it is usable.
* 1 means that there are no users, but it has been mapped
* since the last TLB flush - so we can't use it.
* n means that there are (n-1) current users of it.
*/
#ifdef CONFIG_HIGHMEM
/*
* Architecture with aliasing data cache may define the following family of
* helper functions in its asm/highmem.h to control cache color of virtual
* addresses where physical memory pages are mapped by kmap.
*/
#ifndef get_pkmap_color
/*
* Determine color of virtual address where the page should be mapped.
*/
static inline unsigned int get_pkmap_color(struct page *page)
{
return 0;
}
#define get_pkmap_color get_pkmap_color
/*
* Get next index for mapping inside PKMAP region for page with given color.
*/
static inline unsigned int get_next_pkmap_nr(unsigned int color)
{
static unsigned int last_pkmap_nr;
last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
return last_pkmap_nr;
}
/*
* Determine if page index inside PKMAP region (pkmap_nr) of given color
* has wrapped around PKMAP region end. When this happens an attempt to
* flush all unused PKMAP slots is made.
*/
static inline int no_more_pkmaps(unsigned int pkmap_nr, unsigned int color)
{
return pkmap_nr == 0;
}
/*
* Get the number of PKMAP entries of the given color. If no free slot is
* found after checking that many entries, kmap will sleep waiting for
* someone to call kunmap and free PKMAP slot.
*/
static inline int get_pkmap_entries_count(unsigned int color)
{
return LAST_PKMAP;
}
/*
* Get head of a wait queue for PKMAP entries of the given color.
* Wait queues for different mapping colors should be independent to avoid
* unnecessary wakeups caused by freeing of slots of other colors.
*/
static inline wait_queue_head_t *get_pkmap_wait_queue_head(unsigned int color)
{
static DECLARE_WAIT_QUEUE_HEAD(pkmap_map_wait);
return &pkmap_map_wait;
}
#endif
unsigned long totalhigh_pages __read_mostly;
EXPORT_SYMBOL(totalhigh_pages);
EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx);
unsigned int nr_free_highpages (void)
{
pg_data_t *pgdat;
unsigned int pages = 0;
for_each_online_pgdat(pgdat) {
pages += zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
NR_FREE_PAGES);
if (zone_movable_is_highmem())
pages += zone_page_state(
&pgdat->node_zones[ZONE_MOVABLE],
NR_FREE_PAGES);
}
return pages;
}
static int pkmap_count[LAST_PKMAP];
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kmap_lock);
pte_t * pkmap_page_table;
/*
* Most architectures have no use for kmap_high_get(), so let's abstract
* the disabling of IRQ out of the locking in that case to save on a
* potential useless overhead.
*/
#ifdef ARCH_NEEDS_KMAP_HIGH_GET
#define lock_kmap() spin_lock_irq(&kmap_lock)
#define unlock_kmap() spin_unlock_irq(&kmap_lock)
#define lock_kmap_any(flags) spin_lock_irqsave(&kmap_lock, flags)
#define unlock_kmap_any(flags) spin_unlock_irqrestore(&kmap_lock, flags)
#else
#define lock_kmap() spin_lock(&kmap_lock)
#define unlock_kmap() spin_unlock(&kmap_lock)
#define lock_kmap_any(flags) \
do { spin_lock(&kmap_lock); (void)(flags); } while (0)
#define unlock_kmap_any(flags) \
do { spin_unlock(&kmap_lock); (void)(flags); } while (0)
#endif
struct page *kmap_to_page(void *vaddr)
{
unsigned long addr = (unsigned long)vaddr;
if (addr >= PKMAP_ADDR(0) && addr < PKMAP_ADDR(LAST_PKMAP)) {
int i = PKMAP_NR(addr);
return pte_page(pkmap_page_table[i]);
}
return virt_to_page(addr);
}
EXPORT_SYMBOL(kmap_to_page);
static void flush_all_zero_pkmaps(void)
{
int i;
int need_flush = 0;
flush_cache_kmaps();
for (i = 0; i < LAST_PKMAP; i++) {
struct page *page;
/*
* zero means we don't have anything to do,
* >1 means that it is still in use. Only
* a count of 1 means that it is free but
* needs to be unmapped
*/
if (pkmap_count[i] != 1)
continue;
pkmap_count[i] = 0;
/* sanity check */
BUG_ON(pte_none(pkmap_page_table[i]));
/*
* Don't need an atomic fetch-and-clear op here;
* no-one has the page mapped, and cannot get at
* its virtual address (and hence PTE) without first
* getting the kmap_lock (which is held here).
* So no dangers, even with speculative execution.
*/
page = pte_page(pkmap_page_table[i]);
pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
set_page_address(page, NULL);
need_flush = 1;
}
if (need_flush)
flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
}
/**
* kmap_flush_unused - flush all unused kmap mappings in order to remove stray mappings
*/
void kmap_flush_unused(void)
{
lock_kmap();
flush_all_zero_pkmaps();
unlock_kmap();
}
static inline unsigned long map_new_virtual(struct page *page)
{
unsigned long vaddr;
int count;
unsigned int last_pkmap_nr;
unsigned int color = get_pkmap_color(page);
start:
count = get_pkmap_entries_count(color);
/* Find an empty entry */
for (;;) {
last_pkmap_nr = get_next_pkmap_nr(color);
if (no_more_pkmaps(last_pkmap_nr, color)) {
flush_all_zero_pkmaps();
count = get_pkmap_entries_count(color);
}
if (!pkmap_count[last_pkmap_nr])
break; /* Found a usable entry */
if (--count)
continue;
/*
* Sleep for somebody else to unmap their entries
*/
{
DECLARE_WAITQUEUE(wait, current);
wait_queue_head_t *pkmap_map_wait =
get_pkmap_wait_queue_head(color);
__set_current_state(TASK_UNINTERRUPTIBLE);
add_wait_queue(pkmap_map_wait, &wait);
unlock_kmap();
schedule();
remove_wait_queue(pkmap_map_wait, &wait);
lock_kmap();
/* Somebody else might have mapped it while we slept */
if (page_address(page))
return (unsigned long)page_address(page);
/* Re-start */
goto start;
}
}
vaddr = PKMAP_ADDR(last_pkmap_nr);
set_pte_at(&init_mm, vaddr,
&(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
pkmap_count[last_pkmap_nr] = 1;
set_page_address(page, (void *)vaddr);
return vaddr;
}
/**
* kmap_high - map a highmem page into memory
* @page: &struct page to map
*
* Returns the page's virtual memory address.
*
* We cannot call this from interrupts, as it may block.
*/
void *kmap_high(struct page *page)
{
unsigned long vaddr;
/*
* For highmem pages, we can't trust "virtual" until
* after we have the lock.
*/
lock_kmap();
vaddr = (unsigned long)page_address(page);
if (!vaddr)
vaddr = map_new_virtual(page);
pkmap_count[PKMAP_NR(vaddr)]++;
BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 2);
unlock_kmap();
return (void*) vaddr;
}
EXPORT_SYMBOL(kmap_high);
#ifdef ARCH_NEEDS_KMAP_HIGH_GET
/**
* kmap_high_get - pin a highmem page into memory
* @page: &struct page to pin
*
* Returns the page's current virtual memory address, or NULL if no mapping
* exists. If and only if a non null address is returned then a
* matching call to kunmap_high() is necessary.
*
* This can be called from any context.
*/
void *kmap_high_get(struct page *page)
{
unsigned long vaddr, flags;
lock_kmap_any(flags);
vaddr = (unsigned long)page_address(page);
if (vaddr) {
BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 1);
pkmap_count[PKMAP_NR(vaddr)]++;
}
unlock_kmap_any(flags);
return (void*) vaddr;
}
#endif
/**
* kunmap_high - unmap a highmem page into memory
* @page: &struct page to unmap
*
* If ARCH_NEEDS_KMAP_HIGH_GET is not defined then this may be called
* only from user context.
*/
void kunmap_high(struct page *page)
{
unsigned long vaddr;
unsigned long nr;
unsigned long flags;
int need_wakeup;
unsigned int color = get_pkmap_color(page);
wait_queue_head_t *pkmap_map_wait;
lock_kmap_any(flags);
vaddr = (unsigned long)page_address(page);
BUG_ON(!vaddr);
nr = PKMAP_NR(vaddr);
/*
* A count must never go down to zero
* without a TLB flush!
*/
need_wakeup = 0;
switch (--pkmap_count[nr]) {
case 0:
BUG();
case 1:
/*
* Avoid an unnecessary wake_up() function call.
* The common case is pkmap_count[] == 1, but
* no waiters.
* The tasks queued in the wait-queue are guarded
* by both the lock in the wait-queue-head and by
* the kmap_lock. As the kmap_lock is held here,
* no need for the wait-queue-head's lock. Simply
* test if the queue is empty.
*/
pkmap_map_wait = get_pkmap_wait_queue_head(color);
need_wakeup = waitqueue_active(pkmap_map_wait);
}
unlock_kmap_any(flags);
/* do wake-up, if needed, race-free outside of the spin lock */
if (need_wakeup)
wake_up(pkmap_map_wait);
}
EXPORT_SYMBOL(kunmap_high);
#endif
#if defined(HASHED_PAGE_VIRTUAL)
#define PA_HASH_ORDER 7
/*
* Describes one page->virtual association
*/
struct page_address_map {
struct page *page;
void *virtual;
struct list_head list;
};
static struct page_address_map page_address_maps[LAST_PKMAP];
/*
* Hash table bucket
*/
static struct page_address_slot {
struct list_head lh; /* List of page_address_maps */
spinlock_t lock; /* Protect this bucket's list */
} ____cacheline_aligned_in_smp page_address_htable[1<<PA_HASH_ORDER];
static struct page_address_slot *page_slot(const struct page *page)
{
return &page_address_htable[hash_ptr(page, PA_HASH_ORDER)];
}
/**
* page_address - get the mapped virtual address of a page
* @page: &struct page to get the virtual address of
*
* Returns the page's virtual address.
*/
void *page_address(const struct page *page)
{
unsigned long flags;
void *ret;
struct page_address_slot *pas;
if (!PageHighMem(page))
return lowmem_page_address(page);
pas = page_slot(page);
ret = NULL;
spin_lock_irqsave(&pas->lock, flags);
if (!list_empty(&pas->lh)) {
struct page_address_map *pam;
list_for_each_entry(pam, &pas->lh, list) {
if (pam->page == page) {
ret = pam->virtual;
goto done;
}
}
}
done:
spin_unlock_irqrestore(&pas->lock, flags);
return ret;
}
EXPORT_SYMBOL(page_address);
/**
* set_page_address - set a page's virtual address
* @page: &struct page to set
* @virtual: virtual address to use
*/
void set_page_address(struct page *page, void *virtual)
{
unsigned long flags;
struct page_address_slot *pas;
struct page_address_map *pam;
BUG_ON(!PageHighMem(page));
pas = page_slot(page);
if (virtual) { /* Add */
pam = &page_address_maps[PKMAP_NR((unsigned long)virtual)];
pam->page = page;
pam->virtual = virtual;
spin_lock_irqsave(&pas->lock, flags);
list_add_tail(&pam->list, &pas->lh);
spin_unlock_irqrestore(&pas->lock, flags);
} else { /* Remove */
spin_lock_irqsave(&pas->lock, flags);
list_for_each_entry(pam, &pas->lh, list) {
if (pam->page == page) {
list_del(&pam->list);
spin_unlock_irqrestore(&pas->lock, flags);
goto done;
}
}
spin_unlock_irqrestore(&pas->lock, flags);
}
done:
return;
}
void __init page_address_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
INIT_LIST_HEAD(&page_address_htable[i].lh);
spin_lock_init(&page_address_htable[i].lock);
}
}
#endif /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */

2980
mm/huge_memory.c Normal file

File diff suppressed because it is too large Load diff

3840
mm/hugetlb.c Normal file

File diff suppressed because it is too large Load diff

409
mm/hugetlb_cgroup.c Normal file
View file

@ -0,0 +1,409 @@
/*
*
* Copyright IBM Corporation, 2012
* Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
* as published by the Free Software Foundation.
*
* This program is distributed in the hope that it would be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
*
*/
#include <linux/cgroup.h>
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
struct hugetlb_cgroup {
struct cgroup_subsys_state css;
/*
* the counter to account for hugepages from hugetlb.
*/
struct res_counter hugepage[HUGE_MAX_HSTATE];
};
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
{
return s ? container_of(s, struct hugetlb_cgroup, css) : NULL;
}
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
{
return hugetlb_cgroup_from_css(task_css(task, hugetlb_cgrp_id));
}
static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
{
return (h_cg == root_h_cgroup);
}
static inline struct hugetlb_cgroup *
parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg)
{
return hugetlb_cgroup_from_css(h_cg->css.parent);
}
static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
{
int idx;
for (idx = 0; idx < hugetlb_max_hstate; idx++) {
if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
return true;
}
return false;
}
static struct cgroup_subsys_state *
hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
struct hugetlb_cgroup *h_cgroup;
int idx;
h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
if (!h_cgroup)
return ERR_PTR(-ENOMEM);
if (parent_h_cgroup) {
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
res_counter_init(&h_cgroup->hugepage[idx],
&parent_h_cgroup->hugepage[idx]);
} else {
root_h_cgroup = h_cgroup;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
res_counter_init(&h_cgroup->hugepage[idx], NULL);
}
return &h_cgroup->css;
}
static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct hugetlb_cgroup *h_cgroup;
h_cgroup = hugetlb_cgroup_from_css(css);
kfree(h_cgroup);
}
/*
* Should be called with hugetlb_lock held.
* Since we are holding hugetlb_lock, pages cannot get moved from
* active list or uncharged from the cgroup, So no need to get
* page reference and test for page active here. This function
* cannot fail.
*/
static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
struct page *page)
{
int csize;
struct res_counter *counter;
struct res_counter *fail_res;
struct hugetlb_cgroup *page_hcg;
struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
page_hcg = hugetlb_cgroup_from_page(page);
/*
* We can have pages in active list without any cgroup
* ie, hugepage with less than 3 pages. We can safely
* ignore those pages.
*/
if (!page_hcg || page_hcg != h_cg)
goto out;
csize = PAGE_SIZE << compound_order(page);
if (!parent) {
parent = root_h_cgroup;
/* root has no limit */
res_counter_charge_nofail(&parent->hugepage[idx],
csize, &fail_res);
}
counter = &h_cg->hugepage[idx];
res_counter_uncharge_until(counter, counter->parent, csize);
set_hugetlb_cgroup(page, parent);
out:
return;
}
/*
* Force the hugetlb cgroup to empty the hugetlb resources by moving them to
* the parent cgroup.
*/
static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
struct hstate *h;
struct page *page;
int idx = 0;
do {
for_each_hstate(h) {
spin_lock(&hugetlb_lock);
list_for_each_entry(page, &h->hugepage_activelist, lru)
hugetlb_cgroup_move_parent(idx, h_cg, page);
spin_unlock(&hugetlb_lock);
idx++;
}
cond_resched();
} while (hugetlb_cgroup_have_usage(h_cg));
}
int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
{
int ret = 0;
struct res_counter *fail_res;
struct hugetlb_cgroup *h_cg = NULL;
unsigned long csize = nr_pages * PAGE_SIZE;
if (hugetlb_cgroup_disabled())
goto done;
/*
* We don't charge any cgroup if the compound page have less
* than 3 pages.
*/
if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
goto done;
again:
rcu_read_lock();
h_cg = hugetlb_cgroup_from_task(current);
if (!css_tryget_online(&h_cg->css)) {
rcu_read_unlock();
goto again;
}
rcu_read_unlock();
ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
css_put(&h_cg->css);
done:
*ptr = h_cg;
return ret;
}
/* Should be called with hugetlb_lock held */
void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page)
{
if (hugetlb_cgroup_disabled() || !h_cg)
return;
set_hugetlb_cgroup(page, h_cg);
return;
}
/*
* Should be called with hugetlb_lock held
*/
void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page)
{
struct hugetlb_cgroup *h_cg;
unsigned long csize = nr_pages * PAGE_SIZE;
if (hugetlb_cgroup_disabled())
return;
lockdep_assert_held(&hugetlb_lock);
h_cg = hugetlb_cgroup_from_page(page);
if (unlikely(!h_cg))
return;
set_hugetlb_cgroup(page, NULL);
res_counter_uncharge(&h_cg->hugepage[idx], csize);
return;
}
void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
{
unsigned long csize = nr_pages * PAGE_SIZE;
if (hugetlb_cgroup_disabled() || !h_cg)
return;
if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
return;
res_counter_uncharge(&h_cg->hugepage[idx], csize);
return;
}
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
int idx, name;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
idx = MEMFILE_IDX(cft->private);
name = MEMFILE_ATTR(cft->private);
return res_counter_read_u64(&h_cg->hugepage[idx], name);
}
static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int idx, name, ret;
unsigned long long val;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
buf = strstrip(buf);
idx = MEMFILE_IDX(of_cft(of)->private);
name = MEMFILE_ATTR(of_cft(of)->private);
switch (name) {
case RES_LIMIT:
if (hugetlb_cgroup_is_root(h_cg)) {
/* Can't set limit on root */
ret = -EINVAL;
break;
}
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buf, &val);
if (ret)
break;
val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
break;
default:
ret = -EINVAL;
break;
}
return ret ?: nbytes;
}
static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int idx, name, ret = 0;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
idx = MEMFILE_IDX(of_cft(of)->private);
name = MEMFILE_ATTR(of_cft(of)->private);
switch (name) {
case RES_MAX_USAGE:
res_counter_reset_max(&h_cg->hugepage[idx]);
break;
case RES_FAILCNT:
res_counter_reset_failcnt(&h_cg->hugepage[idx]);
break;
default:
ret = -EINVAL;
break;
}
return ret ?: nbytes;
}
static char *mem_fmt(char *buf, int size, unsigned long hsize)
{
if (hsize >= (1UL << 30))
snprintf(buf, size, "%luGB", hsize >> 30);
else if (hsize >= (1UL << 20))
snprintf(buf, size, "%luMB", hsize >> 20);
else
snprintf(buf, size, "%luKB", hsize >> 10);
return buf;
}
static void __init __hugetlb_cgroup_file_init(int idx)
{
char buf[32];
struct cftype *cft;
struct hstate *h = &hstates[idx];
/* format the size */
mem_fmt(buf, 32, huge_page_size(h));
/* Add the limit file */
cft = &h->cgroup_files[0];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
cft->read_u64 = hugetlb_cgroup_read_u64;
cft->write = hugetlb_cgroup_write;
/* Add the usage file */
cft = &h->cgroup_files[1];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the MAX usage file */
cft = &h->cgroup_files[2];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the failcntfile */
cft = &h->cgroup_files[3];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* NULL terminate the last cft */
cft = &h->cgroup_files[4];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
h->cgroup_files));
}
void __init hugetlb_cgroup_file_init(void)
{
struct hstate *h;
for_each_hstate(h) {
/*
* Add cgroup control files only if the huge page consists
* of more than two normal pages. This is because we use
* page[2].lru.next for storing cgroup details.
*/
if (huge_page_order(h) >= HUGETLB_CGROUP_MIN_ORDER)
__hugetlb_cgroup_file_init(hstate_index(h));
}
}
/*
* hugetlb_lock will make sure a parallel cgroup rmdir won't happen
* when we migrate hugepages
*/
void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
{
struct hugetlb_cgroup *h_cg;
struct hstate *h = page_hstate(oldhpage);
if (hugetlb_cgroup_disabled())
return;
VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage);
spin_lock(&hugetlb_lock);
h_cg = hugetlb_cgroup_from_page(oldhpage);
set_hugetlb_cgroup(oldhpage, NULL);
/* move the h_cg details to new cgroup */
set_hugetlb_cgroup(newhpage, h_cg);
list_move(&newhpage->lru, &h->hugepage_activelist);
spin_unlock(&hugetlb_lock);
return;
}
struct cgroup_subsys hugetlb_cgrp_subsys = {
.css_alloc = hugetlb_cgroup_css_alloc,
.css_offline = hugetlb_cgroup_css_offline,
.css_free = hugetlb_cgroup_css_free,
};

141
mm/hwpoison-inject.c Normal file
View file

@ -0,0 +1,141 @@
/* Inject a hwpoison memory failure on a arbitrary pfn */
#include <linux/module.h>
#include <linux/debugfs.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/pagemap.h>
#include <linux/hugetlb.h>
#include "internal.h"
static struct dentry *hwpoison_dir;
static int hwpoison_inject(void *data, u64 val)
{
unsigned long pfn = val;
struct page *p;
struct page *hpage;
int err;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (!pfn_valid(pfn))
return -ENXIO;
p = pfn_to_page(pfn);
hpage = compound_head(p);
/*
* This implies unable to support free buddy pages.
*/
if (!get_page_unless_zero(hpage))
return 0;
if (!hwpoison_filter_enable)
goto inject;
if (!PageLRU(p) && !PageHuge(p))
shake_page(p, 0);
/*
* This implies unable to support non-LRU pages.
*/
if (!PageLRU(p) && !PageHuge(p))
return 0;
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
* We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
lock_page(hpage);
err = hwpoison_filter(hpage);
unlock_page(hpage);
if (err)
return 0;
inject:
pr_info("Injecting memory failure at pfn %#lx\n", pfn);
return memory_failure(pfn, 18, MF_COUNT_INCREASED);
}
static int hwpoison_unpoison(void *data, u64 val)
{
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return unpoison_memory(val);
}
DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
DEFINE_SIMPLE_ATTRIBUTE(unpoison_fops, NULL, hwpoison_unpoison, "%lli\n");
static void pfn_inject_exit(void)
{
debugfs_remove_recursive(hwpoison_dir);
}
static int pfn_inject_init(void)
{
struct dentry *dentry;
hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
if (hwpoison_dir == NULL)
return -ENOMEM;
/*
* Note that the below poison/unpoison interfaces do not involve
* hardware status change, hence do not require hardware support.
* They are mainly for testing hwpoison in software level.
*/
dentry = debugfs_create_file("corrupt-pfn", 0200, hwpoison_dir,
NULL, &hwpoison_fops);
if (!dentry)
goto fail;
dentry = debugfs_create_file("unpoison-pfn", 0200, hwpoison_dir,
NULL, &unpoison_fops);
if (!dentry)
goto fail;
dentry = debugfs_create_u32("corrupt-filter-enable", 0600,
hwpoison_dir, &hwpoison_filter_enable);
if (!dentry)
goto fail;
dentry = debugfs_create_u32("corrupt-filter-dev-major", 0600,
hwpoison_dir, &hwpoison_filter_dev_major);
if (!dentry)
goto fail;
dentry = debugfs_create_u32("corrupt-filter-dev-minor", 0600,
hwpoison_dir, &hwpoison_filter_dev_minor);
if (!dentry)
goto fail;
dentry = debugfs_create_u64("corrupt-filter-flags-mask", 0600,
hwpoison_dir, &hwpoison_filter_flags_mask);
if (!dentry)
goto fail;
dentry = debugfs_create_u64("corrupt-filter-flags-value", 0600,
hwpoison_dir, &hwpoison_filter_flags_value);
if (!dentry)
goto fail;
#ifdef CONFIG_MEMCG_SWAP
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
goto fail;
#endif
return 0;
fail:
pfn_inject_exit();
return -ENOMEM;
}
module_init(pfn_inject_init);
module_exit(pfn_inject_exit);
MODULE_LICENSE("GPL");

25
mm/init-mm.c Normal file
View file

@ -0,0 +1,25 @@
#include <linux/mm_types.h>
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/cpumask.h>
#include <linux/atomic.h>
#include <asm/pgtable.h>
#include <asm/mmu.h>
#ifndef INIT_MM_CONTEXT
#define INIT_MM_CONTEXT(name)
#endif
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
INIT_MM_CONTEXT(init_mm)
};

413
mm/internal.h Normal file
View file

@ -0,0 +1,413 @@
/* internal.h: mm/ internal definitions
*
* Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (dhowells@redhat.com)
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
#ifndef __MM_INTERNAL_H
#define __MM_INTERNAL_H
#include <linux/fs.h>
#include <linux/mm.h>
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_count, v);
}
extern int __do_page_cache_readahead(struct address_space *mapping,
struct file *filp, pgoff_t offset, unsigned long nr_to_read,
unsigned long lookahead_size);
/*
* Submit IO for the read-ahead request in file_ra_state.
*/
static inline unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping, struct file *filp)
{
return __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
}
/*
* Turn a non-refcounted page (->_count == 0) into refcounted with
* a count of one.
*/
static inline void set_page_refcounted(struct page *page)
{
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(atomic_read(&page->_count), page);
set_page_count(page, 1);
}
static inline void __get_page_tail_foll(struct page *page,
bool get_page_head)
{
/*
* If we're getting a tail page, the elevated page->_count is
* required only in the head page and we will elevate the head
* page->_count and tail page->_mapcount.
*
* We elevate page_tail->_mapcount for tail pages to force
* page_tail->_count to be zero at all times to avoid getting
* false positives from get_page_unless_zero() with
* speculative page access (like in
* page_cache_get_speculative()) on tail pages.
*/
VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
if (get_page_head)
atomic_inc(&page->first_page->_count);
get_huge_page_tail(page);
}
/*
* This is meant to be called as the FOLL_GET operation of
* follow_page() and it must be called while holding the proper PT
* lock while the pte (or pmd_trans_huge) is still mapping the page.
*/
static inline void get_page_foll(struct page *page)
{
if (unlikely(PageTail(page)))
/*
* This is safe only because
* __split_huge_page_refcount() can't run under
* get_page_foll() because we hold the proper PT lock.
*/
__get_page_tail_foll(page, true);
else {
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
atomic_inc(&page->_count);
}
}
extern unsigned long highest_memmap_pfn;
/*
* in mm/vmscan.c:
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
extern bool zone_reclaimable(struct zone *zone);
/*
* in mm/rmap.c:
*/
extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
/*
* in mm/page_alloc.c
*/
/*
* Locate the struct page for both the matching buddy in our
* pair (buddy1) and the combined O(n+1) page they form (page).
*
* 1) Any buddy B1 will have an order O twin B2 which satisfies
* the following equation:
* B2 = B1 ^ (1 << O)
* For example, if the starting buddy (buddy2) is #8 its order
* 1 buddy is #10:
* B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
*
* 2) Any buddy B will have an order O+1 parent P which
* satisfies the following equation:
* P = B & ~(1 << O)
*
* Assumption: *_mem_map is contiguous at least up to MAX_ORDER
*/
static inline unsigned long
__find_buddy_index(unsigned long page_idx, unsigned int order)
{
return page_idx ^ (1 << order);
}
extern int __isolate_free_page(struct page *page, unsigned int order);
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
#endif
extern int user_min_free_kbytes;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/*
* in mm/compaction.c
*/
/*
* compact_control is used to track pages being migrated and the free pages
* they are being migrated to during memory compaction. The free_pfn starts
* at the end of a zone and migrate_pfn begins at the start. Movable pages
* are moved to the end of a zone during a compaction run and the run
* completes when free_pfn <= migrate_pfn
*/
struct compact_control {
struct list_head freepages; /* List of free pages to migrate to */
struct list_head migratepages; /* List of pages being migrated */
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool finished_update_free; /* True when the zone cached pfns are
* no longer being updated
*/
bool finished_update_migrate;
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
struct zone *zone;
int contended; /* Signal need_sched() or lock
* contention detected during
* compaction
*/
};
unsigned long
isolate_freepages_range(struct compact_control *cc,
unsigned long start_pfn, unsigned long end_pfn);
unsigned long
isolate_migratepages_range(struct compact_control *cc,
unsigned long low_pfn, unsigned long end_pfn);
#endif
/*
* This function returns the order of a free page in the buddy system. In
* general, page_zone(page)->lock must be held by the caller to prevent the
* page from being allocated in parallel and returning garbage as the order.
* If a caller does not hold page_zone(page)->lock, it must guarantee that the
* page cannot be allocated or merged in parallel. Alternatively, it must
* handle invalid values gracefully, and use page_order_unsafe() below.
*/
static inline unsigned long page_order(struct page *page)
{
/* PageBuddy() must be checked by the caller */
return page_private(page);
}
/*
* Like page_order(), but for callers who cannot afford to hold the zone lock.
* PageBuddy() should be checked first by the caller to minimize race window,
* and invalid values must be handled gracefully.
*
* ACCESS_ONCE is used so that if the caller assigns the result into a local
* variable and e.g. tests it for valid range before using, the compiler cannot
* decide to remove the variable and inline the page_private(page) multiple
* times, potentially observing different values in the tests and the actual
* use of the result.
*/
#define page_order_unsafe(page) ACCESS_ONCE(page_private(page))
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
/* mm/util.c */
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent);
#ifdef CONFIG_MMU
extern long __mlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking);
extern void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
{
munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
}
/*
* must be called with vma's mmap_sem held for read or write, and page locked.
*/
extern void mlock_vma_page(struct page *page);
extern unsigned int munlock_vma_page(struct page *page);
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
* on truncation or freeing.
*
* It is legal to call this function for any page, mlocked or not.
* If called for a page that is still mapped by mlocked vmas, all we do
* is revert to lazy LRU behaviour -- semantics are not broken.
*/
extern void clear_page_mlock(struct page *page);
/*
* mlock_migrate_page - called only from migrate_page_copy() to
* migrate the Mlocked page flag; update statistics.
*/
static inline void mlock_migrate_page(struct page *newpage, struct page *page)
{
if (TestClearPageMlocked(page)) {
unsigned long flags;
int nr_pages = hpage_nr_pages(page);
local_irq_save(flags);
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
SetPageMlocked(newpage);
__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
local_irq_restore(flags);
}
}
extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern unsigned long vma_address(struct page *page,
struct vm_area_struct *vma);
#endif
#else /* !CONFIG_MMU */
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }
#endif /* !CONFIG_MMU */
/*
* Return the mem_map entry representing the 'offset' subpage within
* the maximally aligned gigantic page 'base'. Handle any discontiguity
* in the mem_map at MAX_ORDER_NR_PAGES boundaries.
*/
static inline struct page *mem_map_offset(struct page *base, int offset)
{
if (unlikely(offset >= MAX_ORDER_NR_PAGES))
return nth_page(base, offset);
return base + offset;
}
/*
* Iterator over all subpages within the maximally aligned gigantic
* page 'base'. Handle any discontiguity in the mem_map.
*/
static inline struct page *mem_map_next(struct page *iter,
struct page *base, int offset)
{
if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
unsigned long pfn = page_to_pfn(base) + offset;
if (!pfn_valid(pfn))
return NULL;
return pfn_to_page(pfn);
}
return iter + 1;
}
/*
* FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
* so all functions starting at paging_init should be marked __init
* in those cases. SPARSEMEM, however, allows for memory hotplug,
* and alloc_bootmem_node is not used.
*/
#ifdef CONFIG_SPARSEMEM
#define __paginginit __meminit
#else
#define __paginginit __init
#endif
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
MMINIT_VERIFY,
MMINIT_TRACE
};
#ifdef CONFIG_DEBUG_MEMORY_INIT
extern int mminit_loglevel;
#define mminit_dprintk(level, prefix, fmt, arg...) \
do { \
if (level < mminit_loglevel) { \
printk(level <= MMINIT_WARNING ? KERN_WARNING : KERN_DEBUG); \
printk(KERN_CONT "mminit::" prefix " " fmt, ##arg); \
} \
} while (0)
extern void mminit_verify_pageflags_layout(void);
extern void mminit_verify_page_links(struct page *page,
enum zone_type zone, unsigned long nid, unsigned long pfn);
extern void mminit_verify_zonelist(void);
#else
static inline void mminit_dprintk(enum mminit_level level,
const char *prefix, const char *fmt, ...)
{
}
static inline void mminit_verify_pageflags_layout(void)
{
}
static inline void mminit_verify_page_links(struct page *page,
enum zone_type zone, unsigned long nid, unsigned long pfn)
{
}
static inline void mminit_verify_zonelist(void)
{
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn);
#else
static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
}
#endif /* CONFIG_SPARSEMEM */
#define ZONE_RECLAIM_NOSCAN -2
#define ZONE_RECLAIM_FULL -1
#define ZONE_RECLAIM_SOME 0
#define ZONE_RECLAIM_SUCCESS 1
extern int hwpoison_filter(struct page *p);
extern u32 hwpoison_filter_dev_major;
extern u32 hwpoison_filter_dev_minor;
extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
extern u64 hwpoison_filter_memcg;
extern u32 hwpoison_filter_enable;
extern unsigned long vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
extern void set_pageblock_order(void);
unsigned long reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *page_list);
/* The ALLOC_WMARK bits are used as an index to zone->watermark */
#define ALLOC_WMARK_MIN WMARK_MIN
#define ALLOC_WMARK_LOW WMARK_LOW
#define ALLOC_WMARK_HIGH WMARK_HIGH
#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
/* Mask to get the watermark bits */
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
#define ALLOC_FAIR 0x100 /* fair zone allocation */
#endif /* __MM_INTERNAL_H */

112
mm/interval_tree.c Normal file
View file

@ -0,0 +1,112 @@
/*
* mm/interval_tree.c - interval tree for mapping->i_mmap
*
* Copyright (C) 2012, Michel Lespinasse <walken@google.com>
*
* This file is released under the GPL v2.
*/
#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/rmap.h>
#include <linux/interval_tree_generic.h>
static inline unsigned long vma_start_pgoff(struct vm_area_struct *v)
{
return v->vm_pgoff;
}
static inline unsigned long vma_last_pgoff(struct vm_area_struct *v)
{
return v->vm_pgoff + ((v->vm_end - v->vm_start) >> PAGE_SHIFT) - 1;
}
INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.linear.rb,
unsigned long, shared.linear.rb_subtree_last,
vma_start_pgoff, vma_last_pgoff,, vma_interval_tree)
/* Insert node immediately after prev in the interval tree */
void vma_interval_tree_insert_after(struct vm_area_struct *node,
struct vm_area_struct *prev,
struct rb_root *root)
{
struct rb_node **link;
struct vm_area_struct *parent;
unsigned long last = vma_last_pgoff(node);
VM_BUG_ON_VMA(vma_start_pgoff(node) != vma_start_pgoff(prev), node);
if (!prev->shared.linear.rb.rb_right) {
parent = prev;
link = &prev->shared.linear.rb.rb_right;
} else {
parent = rb_entry(prev->shared.linear.rb.rb_right,
struct vm_area_struct, shared.linear.rb);
if (parent->shared.linear.rb_subtree_last < last)
parent->shared.linear.rb_subtree_last = last;
while (parent->shared.linear.rb.rb_left) {
parent = rb_entry(parent->shared.linear.rb.rb_left,
struct vm_area_struct, shared.linear.rb);
if (parent->shared.linear.rb_subtree_last < last)
parent->shared.linear.rb_subtree_last = last;
}
link = &parent->shared.linear.rb.rb_left;
}
node->shared.linear.rb_subtree_last = last;
rb_link_node(&node->shared.linear.rb, &parent->shared.linear.rb, link);
rb_insert_augmented(&node->shared.linear.rb, root,
&vma_interval_tree_augment);
}
static inline unsigned long avc_start_pgoff(struct anon_vma_chain *avc)
{
return vma_start_pgoff(avc->vma);
}
static inline unsigned long avc_last_pgoff(struct anon_vma_chain *avc)
{
return vma_last_pgoff(avc->vma);
}
INTERVAL_TREE_DEFINE(struct anon_vma_chain, rb, unsigned long, rb_subtree_last,
avc_start_pgoff, avc_last_pgoff,
static inline, __anon_vma_interval_tree)
void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
struct rb_root *root)
{
#ifdef CONFIG_DEBUG_VM_RB
node->cached_vma_start = avc_start_pgoff(node);
node->cached_vma_last = avc_last_pgoff(node);
#endif
__anon_vma_interval_tree_insert(node, root);
}
void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
struct rb_root *root)
{
__anon_vma_interval_tree_remove(node, root);
}
struct anon_vma_chain *
anon_vma_interval_tree_iter_first(struct rb_root *root,
unsigned long first, unsigned long last)
{
return __anon_vma_interval_tree_iter_first(root, first, last);
}
struct anon_vma_chain *
anon_vma_interval_tree_iter_next(struct anon_vma_chain *node,
unsigned long first, unsigned long last)
{
return __anon_vma_interval_tree_iter_next(node, first, last);
}
#ifdef CONFIG_DEBUG_VM_RB
void anon_vma_interval_tree_verify(struct anon_vma_chain *node)
{
WARN_ON_ONCE(node->cached_vma_start != avc_start_pgoff(node));
WARN_ON_ONCE(node->cached_vma_last != avc_last_pgoff(node));
}
#endif

958
mm/iov_iter.c Normal file
View file

@ -0,0 +1,958 @@
#include <linux/export.h>
#include <linux/uio.h>
#include <linux/pagemap.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, left, wanted;
const struct iovec *iov;
char __user *buf;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
iov = i->iov;
skip = i->iov_offset;
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);
left = __copy_to_user(buf, from, copy);
copy -= left;
skip += copy;
from += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_to_user(buf, from, copy);
copy -= left;
skip = copy;
from += copy;
bytes -= copy;
}
if (skip == iov->iov_len) {
iov++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= iov - i->iov;
i->iov = iov;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, left, wanted;
const struct iovec *iov;
char __user *buf;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
iov = i->iov;
skip = i->iov_offset;
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);
left = __copy_from_user(to, buf, copy);
copy -= left;
skip += copy;
to += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_from_user(to, buf, copy);
copy -= left;
skip = copy;
to += copy;
bytes -= copy;
}
if (skip == iov->iov_len) {
iov++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= iov - i->iov;
i->iov = iov;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
size_t skip, copy, left, wanted;
const struct iovec *iov;
char __user *buf;
void *kaddr, *from;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
iov = i->iov;
skip = i->iov_offset;
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);
if (!fault_in_pages_writeable(buf, copy)) {
kaddr = kmap_atomic(page);
from = kaddr + offset;
/* first chunk, usually the only one */
left = __copy_to_user_inatomic(buf, from, copy);
copy -= left;
skip += copy;
from += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_to_user_inatomic(buf, from, copy);
copy -= left;
skip = copy;
from += copy;
bytes -= copy;
}
if (likely(!bytes)) {
kunmap_atomic(kaddr);
goto done;
}
offset = from - kaddr;
buf += copy;
kunmap_atomic(kaddr);
copy = min(bytes, iov->iov_len - skip);
}
/* Too bad - revert to non-atomic kmap */
kaddr = kmap(page);
from = kaddr + offset;
left = __copy_to_user(buf, from, copy);
copy -= left;
skip += copy;
from += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_to_user(buf, from, copy);
copy -= left;
skip = copy;
from += copy;
bytes -= copy;
}
kunmap(page);
done:
if (skip == iov->iov_len) {
iov++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= iov - i->iov;
i->iov = iov;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
size_t skip, copy, left, wanted;
const struct iovec *iov;
char __user *buf;
void *kaddr, *to;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
iov = i->iov;
skip = i->iov_offset;
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);
if (!fault_in_pages_readable(buf, copy)) {
kaddr = kmap_atomic(page);
to = kaddr + offset;
/* first chunk, usually the only one */
left = __copy_from_user_inatomic(to, buf, copy);
copy -= left;
skip += copy;
to += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_from_user_inatomic(to, buf, copy);
copy -= left;
skip = copy;
to += copy;
bytes -= copy;
}
if (likely(!bytes)) {
kunmap_atomic(kaddr);
goto done;
}
offset = to - kaddr;
buf += copy;
kunmap_atomic(kaddr);
copy = min(bytes, iov->iov_len - skip);
}
/* Too bad - revert to non-atomic kmap */
kaddr = kmap(page);
to = kaddr + offset;
left = __copy_from_user(to, buf, copy);
copy -= left;
skip += copy;
to += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __copy_from_user(to, buf, copy);
copy -= left;
skip = copy;
to += copy;
bytes -= copy;
}
kunmap(page);
done:
if (skip == iov->iov_len) {
iov++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= iov - i->iov;
i->iov = iov;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t zero_iovec(size_t bytes, struct iov_iter *i)
{
size_t skip, copy, left, wanted;
const struct iovec *iov;
char __user *buf;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
iov = i->iov;
skip = i->iov_offset;
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);
left = __clear_user(buf, copy);
copy -= left;
skip += copy;
bytes -= copy;
while (unlikely(!left && bytes)) {
iov++;
buf = iov->iov_base;
copy = min(bytes, iov->iov_len);
left = __clear_user(buf, copy);
copy -= left;
skip = copy;
bytes -= copy;
}
if (skip == iov->iov_len) {
iov++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= iov - i->iov;
i->iov = iov;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
size_t copied = 0, left = 0;
while (bytes) {
char __user *buf = iov->iov_base + base;
int copy = min(bytes, iov->iov_len - base);
base = 0;
left = __copy_from_user_inatomic(vaddr, buf, copy);
copied += copy;
bytes -= copy;
vaddr += copy;
iov++;
if (unlikely(left))
break;
}
return copied - left;
}
/*
* Copy as much as we can into the page and return the number of bytes which
* were successfully copied. If a fault is encountered then return the number of
* bytes which were copied.
*/
static size_t copy_from_user_atomic_iovec(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
char *kaddr;
size_t copied;
kaddr = kmap_atomic(page);
if (likely(i->nr_segs == 1)) {
int left;
char __user *buf = i->iov->iov_base + i->iov_offset;
left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
copied = __iovec_copy_from_user_inatomic(kaddr + offset,
i->iov, i->iov_offset, bytes);
}
kunmap_atomic(kaddr);
return copied;
}
static void advance_iovec(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);
if (likely(i->nr_segs == 1)) {
i->iov_offset += bytes;
i->count -= bytes;
} else {
const struct iovec *iov = i->iov;
size_t base = i->iov_offset;
unsigned long nr_segs = i->nr_segs;
/*
* The !iov->iov_len check ensures we skip over unlikely
* zero-length segments (without overruning the iovec).
*/
while (bytes || unlikely(i->count && !iov->iov_len)) {
int copy;
copy = min(bytes, iov->iov_len - base);
BUG_ON(!i->count || i->count < copy);
i->count -= copy;
bytes -= copy;
base += copy;
if (iov->iov_len == base) {
iov++;
nr_segs--;
base = 0;
}
}
i->iov = iov;
i->iov_offset = base;
i->nr_segs = nr_segs;
}
}
/*
* Fault in the first iovec of the given iov_iter, to a maximum length
* of bytes. Returns 0 on success, or non-zero if the memory could not be
* accessed (ie. because it is an invalid address).
*
* writev-intensive code may want this to prefault several iovecs -- that
* would be possible (callers must not rely on the fact that _only_ the
* first iovec will be faulted with the current implementation).
*/
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
if (!(i->type & ITER_BVEC)) {
char __user *buf = i->iov->iov_base + i->iov_offset;
bytes = min(bytes, i->iov->iov_len - i->iov_offset);
return fault_in_pages_readable(buf, bytes);
}
return 0;
}
EXPORT_SYMBOL(iov_iter_fault_in_readable);
static unsigned long alignment_iovec(const struct iov_iter *i)
{
const struct iovec *iov = i->iov;
unsigned long res;
size_t size = i->count;
size_t n;
if (!size)
return 0;
res = (unsigned long)iov->iov_base + i->iov_offset;
n = iov->iov_len - i->iov_offset;
if (n >= size)
return res | size;
size -= n;
res |= n;
while (size > (++iov)->iov_len) {
res |= (unsigned long)iov->iov_base | iov->iov_len;
size -= iov->iov_len;
}
res |= (unsigned long)iov->iov_base | size;
return res;
}
void iov_iter_init(struct iov_iter *i, int direction,
const struct iovec *iov, unsigned long nr_segs,
size_t count)
{
/* It will get better. Eventually... */
if (segment_eq(get_fs(), KERNEL_DS))
direction |= ITER_KVEC;
i->type = direction;
i->iov = iov;
i->nr_segs = nr_segs;
i->iov_offset = 0;
i->count = count;
}
EXPORT_SYMBOL(iov_iter_init);
static ssize_t get_pages_iovec(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
{
size_t offset = i->iov_offset;
const struct iovec *iov = i->iov;
size_t len;
unsigned long addr;
int n;
int res;
len = iov->iov_len - offset;
if (len > i->count)
len = i->count;
if (len > maxsize)
len = maxsize;
addr = (unsigned long)iov->iov_base + offset;
len += *start = addr & (PAGE_SIZE - 1);
if (len > maxpages * PAGE_SIZE)
len = maxpages * PAGE_SIZE;
addr &= ~(PAGE_SIZE - 1);
n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
if (unlikely(res < 0))
return res;
return (res == n ? len : res * PAGE_SIZE) - *start;
}
static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
{
size_t offset = i->iov_offset;
const struct iovec *iov = i->iov;
size_t len;
unsigned long addr;
void *p;
int n;
int res;
len = iov->iov_len - offset;
if (len > i->count)
len = i->count;
if (len > maxsize)
len = maxsize;
addr = (unsigned long)iov->iov_base + offset;
len += *start = addr & (PAGE_SIZE - 1);
addr &= ~(PAGE_SIZE - 1);
n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
if (!p)
p = vmalloc(n * sizeof(struct page *));
if (!p)
return -ENOMEM;
res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
if (unlikely(res < 0)) {
kvfree(p);
return res;
}
*pages = p;
return (res == n ? len : res * PAGE_SIZE) - *start;
}
static int iov_iter_npages_iovec(const struct iov_iter *i, int maxpages)
{
size_t offset = i->iov_offset;
size_t size = i->count;
const struct iovec *iov = i->iov;
int npages = 0;
int n;
for (n = 0; size && n < i->nr_segs; n++, iov++) {
unsigned long addr = (unsigned long)iov->iov_base + offset;
size_t len = iov->iov_len - offset;
offset = 0;
if (unlikely(!len)) /* empty segment */
continue;
if (len > size)
len = size;
npages += (addr + len + PAGE_SIZE - 1) / PAGE_SIZE
- addr / PAGE_SIZE;
if (npages >= maxpages) /* don't bother going further */
return maxpages;
size -= len;
offset = 0;
}
return min(npages, maxpages);
}
static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
{
char *from = kmap_atomic(page);
memcpy(to, from + offset, len);
kunmap_atomic(from);
}
static void memcpy_to_page(struct page *page, size_t offset, char *from, size_t len)
{
char *to = kmap_atomic(page);
memcpy(to + offset, from, len);
kunmap_atomic(to);
}
static void memzero_page(struct page *page, size_t offset, size_t len)
{
char *addr = kmap_atomic(page);
memset(addr + offset, 0, len);
kunmap_atomic(addr);
}
static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
bvec = i->bvec;
skip = i->iov_offset;
copy = min_t(size_t, bytes, bvec->bv_len - skip);
memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
skip += copy;
from += copy;
bytes -= copy;
while (bytes) {
bvec++;
copy = min(bytes, (size_t)bvec->bv_len);
memcpy_to_page(bvec->bv_page, bvec->bv_offset, from, copy);
skip = copy;
from += copy;
bytes -= copy;
}
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= bvec - i->bvec;
i->bvec = bvec;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
bvec = i->bvec;
skip = i->iov_offset;
copy = min(bytes, bvec->bv_len - skip);
memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
to += copy;
skip += copy;
bytes -= copy;
while (bytes) {
bvec++;
copy = min(bytes, (size_t)bvec->bv_len);
memcpy_from_page(to, bvec->bv_page, bvec->bv_offset, copy);
skip = copy;
to += copy;
bytes -= copy;
}
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
}
i->count -= wanted;
i->nr_segs -= bvec - i->bvec;
i->bvec = bvec;
i->iov_offset = skip;
return wanted;
}
static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
size_t bytes, struct iov_iter *i)
{
void *kaddr = kmap_atomic(page);
size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);
return wanted;
}
static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
size_t bytes, struct iov_iter *i)
{
void *kaddr = kmap_atomic(page);
size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);
return wanted;
}
static size_t zero_bvec(size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
if (unlikely(bytes > i->count))
bytes = i->count;
if (unlikely(!bytes))
return 0;
wanted = bytes;
bvec = i->bvec;
skip = i->iov_offset;
copy = min_t(size_t, bytes, bvec->bv_len - skip);
memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
skip += copy;
bytes -= copy;
while (bytes) {
bvec++;
copy = min(bytes, (size_t)bvec->bv_len);
memzero_page(bvec->bv_page, bvec->bv_offset, copy);
skip = copy;
bytes -= copy;
}
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
}
i->count -= wanted - bytes;
i->nr_segs -= bvec - i->bvec;
i->bvec = bvec;
i->iov_offset = skip;
return wanted - bytes;
}
static size_t copy_from_user_bvec(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
char *kaddr;
size_t left;
const struct bio_vec *bvec;
size_t base = i->iov_offset;
kaddr = kmap_atomic(page);
for (left = bytes, bvec = i->bvec; left; bvec++, base = 0) {
size_t copy = min(left, bvec->bv_len - base);
if (!bvec->bv_len)
continue;
memcpy_from_page(kaddr + offset, bvec->bv_page,
bvec->bv_offset + base, copy);
offset += copy;
left -= copy;
}
kunmap_atomic(kaddr);
return bytes;
}
static void advance_bvec(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);
if (likely(i->nr_segs == 1)) {
i->iov_offset += bytes;
i->count -= bytes;
} else {
const struct bio_vec *bvec = i->bvec;
size_t base = i->iov_offset;
unsigned long nr_segs = i->nr_segs;
/*
* The !iov->iov_len check ensures we skip over unlikely
* zero-length segments (without overruning the iovec).
*/
while (bytes || unlikely(i->count && !bvec->bv_len)) {
int copy;
copy = min(bytes, bvec->bv_len - base);
BUG_ON(!i->count || i->count < copy);
i->count -= copy;
bytes -= copy;
base += copy;
if (bvec->bv_len == base) {
bvec++;
nr_segs--;
base = 0;
}
}
i->bvec = bvec;
i->iov_offset = base;
i->nr_segs = nr_segs;
}
}
static unsigned long alignment_bvec(const struct iov_iter *i)
{
const struct bio_vec *bvec = i->bvec;
unsigned long res;
size_t size = i->count;
size_t n;
if (!size)
return 0;
res = bvec->bv_offset + i->iov_offset;
n = bvec->bv_len - i->iov_offset;
if (n >= size)
return res | size;
size -= n;
res |= n;
while (size > (++bvec)->bv_len) {
res |= bvec->bv_offset | bvec->bv_len;
size -= bvec->bv_len;
}
res |= bvec->bv_offset | size;
return res;
}
static ssize_t get_pages_bvec(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
{
const struct bio_vec *bvec = i->bvec;
size_t len = bvec->bv_len - i->iov_offset;
if (len > i->count)
len = i->count;
if (len > maxsize)
len = maxsize;
/* can't be more than PAGE_SIZE */
*start = bvec->bv_offset + i->iov_offset;
get_page(*pages = bvec->bv_page);
return len;
}
static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
{
const struct bio_vec *bvec = i->bvec;
size_t len = bvec->bv_len - i->iov_offset;
if (len > i->count)
len = i->count;
if (len > maxsize)
len = maxsize;
*start = bvec->bv_offset + i->iov_offset;
*pages = kmalloc(sizeof(struct page *), GFP_KERNEL);
if (!*pages)
return -ENOMEM;
get_page(**pages = bvec->bv_page);
return len;
}
static int iov_iter_npages_bvec(const struct iov_iter *i, int maxpages)
{
size_t offset = i->iov_offset;
size_t size = i->count;
const struct bio_vec *bvec = i->bvec;
int npages = 0;
int n;
for (n = 0; size && n < i->nr_segs; n++, bvec++) {
size_t len = bvec->bv_len - offset;
offset = 0;
if (unlikely(!len)) /* empty segment */
continue;
if (len > size)
len = size;
npages++;
if (npages >= maxpages) /* don't bother going further */
return maxpages;
size -= len;
offset = 0;
}
return min(npages, maxpages);
}
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
if (i->type & ITER_BVEC)
return copy_page_to_iter_bvec(page, offset, bytes, i);
else
return copy_page_to_iter_iovec(page, offset, bytes, i);
}
EXPORT_SYMBOL(copy_page_to_iter);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
if (i->type & ITER_BVEC)
return copy_page_from_iter_bvec(page, offset, bytes, i);
else
return copy_page_from_iter_iovec(page, offset, bytes, i);
}
EXPORT_SYMBOL(copy_page_from_iter);
size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
{
if (i->type & ITER_BVEC)
return copy_to_iter_bvec(addr, bytes, i);
else
return copy_to_iter_iovec(addr, bytes, i);
}
EXPORT_SYMBOL(copy_to_iter);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
{
if (i->type & ITER_BVEC)
return copy_from_iter_bvec(addr, bytes, i);
else
return copy_from_iter_iovec(addr, bytes, i);
}
EXPORT_SYMBOL(copy_from_iter);
size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
{
if (i->type & ITER_BVEC) {
return zero_bvec(bytes, i);
} else {
return zero_iovec(bytes, i);
}
}
EXPORT_SYMBOL(iov_iter_zero);
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
if (i->type & ITER_BVEC)
return copy_from_user_bvec(page, i, offset, bytes);
else
return copy_from_user_atomic_iovec(page, i, offset, bytes);
}
EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
void iov_iter_advance(struct iov_iter *i, size_t size)
{
if (i->type & ITER_BVEC)
advance_bvec(i, size);
else
advance_iovec(i, size);
}
EXPORT_SYMBOL(iov_iter_advance);
/*
* Return the count of just the current iov_iter segment.
*/
size_t iov_iter_single_seg_count(const struct iov_iter *i)
{
if (i->nr_segs == 1)
return i->count;
else if (i->type & ITER_BVEC)
return min(i->count, i->bvec->bv_len - i->iov_offset);
else
return min(i->count, i->iov->iov_len - i->iov_offset);
}
EXPORT_SYMBOL(iov_iter_single_seg_count);
unsigned long iov_iter_alignment(const struct iov_iter *i)
{
if (i->type & ITER_BVEC)
return alignment_bvec(i);
else
return alignment_iovec(i);
}
EXPORT_SYMBOL(iov_iter_alignment);
ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
{
if (i->type & ITER_BVEC)
return get_pages_bvec(i, pages, maxsize, maxpages, start);
else
return get_pages_iovec(i, pages, maxsize, maxpages, start);
}
EXPORT_SYMBOL(iov_iter_get_pages);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
{
if (i->type & ITER_BVEC)
return get_pages_alloc_bvec(i, pages, maxsize, start);
else
return get_pages_alloc_iovec(i, pages, maxsize, start);
}
EXPORT_SYMBOL(iov_iter_get_pages_alloc);
int iov_iter_npages(const struct iov_iter *i, int maxpages)
{
if (i->type & ITER_BVEC)
return iov_iter_npages_bvec(i, maxpages);
else
return iov_iter_npages_iovec(i, maxpages);
}
EXPORT_SYMBOL(iov_iter_npages);

123
mm/kmemcheck.c Normal file
View file

@ -0,0 +1,123 @@
#include <linux/gfp.h>
#include <linux/mm_types.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include "slab.h"
#include <linux/kmemcheck.h>
void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)
{
struct page *shadow;
int pages;
int i;
pages = 1 << order;
/*
* With kmemcheck enabled, we need to allocate a memory area for the
* shadow bits as well.
*/
shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);
if (!shadow) {
if (printk_ratelimit())
printk(KERN_ERR "kmemcheck: failed to allocate "
"shadow bitmap\n");
return;
}
for(i = 0; i < pages; ++i)
page[i].shadow = page_address(&shadow[i]);
/*
* Mark it as non-present for the MMU so that our accesses to
* this memory will trigger a page fault and let us analyze
* the memory accesses.
*/
kmemcheck_hide_pages(page, pages);
}
void kmemcheck_free_shadow(struct page *page, int order)
{
struct page *shadow;
int pages;
int i;
if (!kmemcheck_page_is_tracked(page))
return;
pages = 1 << order;
kmemcheck_show_pages(page, pages);
shadow = virt_to_page(page[0].shadow);
for(i = 0; i < pages; ++i)
page[i].shadow = NULL;
__free_pages(shadow, order);
}
void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object,
size_t size)
{
/*
* Has already been memset(), which initializes the shadow for us
* as well.
*/
if (gfpflags & __GFP_ZERO)
return;
/* No need to initialize the shadow of a non-tracked slab. */
if (s->flags & SLAB_NOTRACK)
return;
if (!kmemcheck_enabled || gfpflags & __GFP_NOTRACK) {
/*
* Allow notracked objects to be allocated from
* tracked caches. Note however that these objects
* will still get page faults on access, they just
* won't ever be flagged as uninitialized. If page
* faults are not acceptable, the slab cache itself
* should be marked NOTRACK.
*/
kmemcheck_mark_initialized(object, size);
} else if (!s->ctor) {
/*
* New objects should be marked uninitialized before
* they're returned to the called.
*/
kmemcheck_mark_uninitialized(object, size);
}
}
void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size)
{
/* TODO: RCU freeing is unsupported for now; hide false positives. */
if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU))
kmemcheck_mark_freed(object, size);
}
void kmemcheck_pagealloc_alloc(struct page *page, unsigned int order,
gfp_t gfpflags)
{
int pages;
if (gfpflags & (__GFP_HIGHMEM | __GFP_NOTRACK))
return;
pages = 1 << order;
/*
* NOTE: We choose to track GFP_ZERO pages too; in fact, they
* can become uninitialized by copying uninitialized memory
* into them.
*/
/* XXX: Can use zone->node for node? */
kmemcheck_alloc_shadow(page, order, gfpflags, -1);
if (gfpflags & __GFP_ZERO)
kmemcheck_mark_initialized_pages(page, pages);
else
kmemcheck_mark_uninitialized_pages(page, pages);
}

111
mm/kmemleak-test.c Normal file
View file

@ -0,0 +1,111 @@
/*
* mm/kmemleak-test.c
*
* Copyright (C) 2008 ARM Limited
* Written by Catalin Marinas <catalin.marinas@arm.com>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
#define pr_fmt(fmt) "kmemleak: " fmt
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/list.h>
#include <linux/percpu.h>
#include <linux/fdtable.h>
#include <linux/kmemleak.h>
struct test_node {
long header[25];
struct list_head list;
long footer[25];
};
static LIST_HEAD(test_list);
static DEFINE_PER_CPU(void *, kmemleak_test_pointer);
/*
* Some very simple testing. This function needs to be extended for
* proper testing.
*/
static int __init kmemleak_test_init(void)
{
struct test_node *elem;
int i;
printk(KERN_INFO "Kmemleak testing\n");
/* make some orphan objects */
pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
#ifndef CONFIG_MODULES
pr_info("kmem_cache_alloc(files_cachep) = %p\n",
kmem_cache_alloc(files_cachep, GFP_KERNEL));
pr_info("kmem_cache_alloc(files_cachep) = %p\n",
kmem_cache_alloc(files_cachep, GFP_KERNEL));
#endif
pr_info("vmalloc(64) = %p\n", vmalloc(64));
pr_info("vmalloc(64) = %p\n", vmalloc(64));
pr_info("vmalloc(64) = %p\n", vmalloc(64));
pr_info("vmalloc(64) = %p\n", vmalloc(64));
pr_info("vmalloc(64) = %p\n", vmalloc(64));
/*
* Add elements to a list. They should only appear as orphan
* after the module is removed.
*/
for (i = 0; i < 10; i++) {
elem = kzalloc(sizeof(*elem), GFP_KERNEL);
pr_info("kzalloc(sizeof(*elem)) = %p\n", elem);
if (!elem)
return -ENOMEM;
INIT_LIST_HEAD(&elem->list);
list_add_tail(&elem->list, &test_list);
}
for_each_possible_cpu(i) {
per_cpu(kmemleak_test_pointer, i) = kmalloc(129, GFP_KERNEL);
pr_info("kmalloc(129) = %p\n",
per_cpu(kmemleak_test_pointer, i));
}
return 0;
}
module_init(kmemleak_test_init);
static void __exit kmemleak_test_exit(void)
{
struct test_node *elem, *tmp;
/*
* Remove the list elements without actually freeing the
* memory.
*/
list_for_each_entry_safe(elem, tmp, &test_list, list)
list_del(&elem->list);
}
module_exit(kmemleak_test_exit);
MODULE_LICENSE("GPL");

1920
mm/kmemleak.c Normal file

File diff suppressed because it is too large Load diff

2341
mm/ksm.c Normal file

File diff suppressed because it is too large Load diff

152
mm/list_lru.c Normal file
View file

@ -0,0 +1,152 @@
/*
* Copyright (c) 2013 Red Hat, Inc. and Parallels Inc. All rights reserved.
* Authors: David Chinner and Glauber Costa
*
* Generic LRU infrastructure
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/list_lru.h>
#include <linux/slab.h>
bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
spin_lock(&nlru->lock);
WARN_ON_ONCE(nlru->nr_items < 0);
if (list_empty(item)) {
list_add_tail(item, &nlru->list);
if (nlru->nr_items++ == 0)
node_set(nid, lru->active_nodes);
spin_unlock(&nlru->lock);
return true;
}
spin_unlock(&nlru->lock);
return false;
}
EXPORT_SYMBOL_GPL(list_lru_add);
bool list_lru_del(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
spin_lock(&nlru->lock);
if (!list_empty(item)) {
list_del_init(item);
if (--nlru->nr_items == 0)
node_clear(nid, lru->active_nodes);
WARN_ON_ONCE(nlru->nr_items < 0);
spin_unlock(&nlru->lock);
return true;
}
spin_unlock(&nlru->lock);
return false;
}
EXPORT_SYMBOL_GPL(list_lru_del);
unsigned long
list_lru_count_node(struct list_lru *lru, int nid)
{
unsigned long count = 0;
struct list_lru_node *nlru = &lru->node[nid];
spin_lock(&nlru->lock);
WARN_ON_ONCE(nlru->nr_items < 0);
count += nlru->nr_items;
spin_unlock(&nlru->lock);
return count;
}
EXPORT_SYMBOL_GPL(list_lru_count_node);
unsigned long
list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
void *cb_arg, unsigned long *nr_to_walk)
{
struct list_lru_node *nlru = &lru->node[nid];
struct list_head *item, *n;
unsigned long isolated = 0;
spin_lock(&nlru->lock);
restart:
list_for_each_safe(item, n, &nlru->list) {
enum lru_status ret;
/*
* decrement nr_to_walk first so that we don't livelock if we
* get stuck on large numbesr of LRU_RETRY items
*/
if (!*nr_to_walk)
break;
--*nr_to_walk;
ret = isolate(item, &nlru->lock, cb_arg);
switch (ret) {
case LRU_REMOVED_RETRY:
assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
if (--nlru->nr_items == 0)
node_clear(nid, lru->active_nodes);
WARN_ON_ONCE(nlru->nr_items < 0);
isolated++;
/*
* If the lru lock has been dropped, our list
* traversal is now invalid and so we have to
* restart from scratch.
*/
if (ret == LRU_REMOVED_RETRY)
goto restart;
break;
case LRU_ROTATE:
list_move_tail(item, &nlru->list);
break;
case LRU_SKIP:
break;
case LRU_RETRY:
/*
* The lru lock has been dropped, our list traversal is
* now invalid and so we have to restart from scratch.
*/
assert_spin_locked(&nlru->lock);
goto restart;
default:
BUG();
}
}
spin_unlock(&nlru->lock);
return isolated;
}
EXPORT_SYMBOL_GPL(list_lru_walk_node);
int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
{
int i;
size_t size = sizeof(*lru->node) * nr_node_ids;
lru->node = kzalloc(size, GFP_KERNEL);
if (!lru->node)
return -ENOMEM;
nodes_clear(lru->active_nodes);
for (i = 0; i < nr_node_ids; i++) {
spin_lock_init(&lru->node[i].lock);
if (key)
lockdep_set_class(&lru->node[i].lock, key);
INIT_LIST_HEAD(&lru->node[i].list);
lru->node[i].nr_items = 0;
}
return 0;
}
EXPORT_SYMBOL_GPL(list_lru_init_key);
void list_lru_destroy(struct list_lru *lru)
{
kfree(lru->node);
}
EXPORT_SYMBOL_GPL(list_lru_destroy);

62
mm/maccess.c Normal file
View file

@ -0,0 +1,62 @@
/*
* Access kernel memory without faulting.
*/
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/uaccess.h>
/**
* probe_kernel_read(): safely attempt to read from a location
* @dst: pointer to the buffer that shall take the data
* @src: address to read from
* @size: size of the data chunk
*
* Safely read from address @src to the buffer at @dst. If a kernel fault
* happens, handle that and return -EFAULT.
*/
long __weak probe_kernel_read(void *dst, const void *src, size_t size)
__attribute__((alias("__probe_kernel_read")));
long __probe_kernel_read(void *dst, const void *src, size_t size)
{
long ret;
mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS);
pagefault_disable();
ret = __copy_from_user_inatomic(dst,
(__force const void __user *)src, size);
pagefault_enable();
set_fs(old_fs);
return ret ? -EFAULT : 0;
}
EXPORT_SYMBOL_GPL(probe_kernel_read);
/**
* probe_kernel_write(): safely attempt to write to a location
* @dst: address to write to
* @src: pointer to the data that shall be written
* @size: size of the data chunk
*
* Safely write to address @dst from the buffer at @src. If a kernel fault
* happens, handle that and return -EFAULT.
*/
long __weak probe_kernel_write(void *dst, const void *src, size_t size)
__attribute__((alias("__probe_kernel_write")));
long __probe_kernel_write(void *dst, const void *src, size_t size)
{
long ret;
mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS);
pagefault_disable();
ret = __copy_to_user_inatomic((__force void __user *)dst, src, size);
pagefault_enable();
set_fs(old_fs);
return ret ? -EFAULT : 0;
}
EXPORT_SYMBOL_GPL(probe_kernel_write);

554
mm/madvise.c Normal file
View file

@ -0,0 +1,554 @@
/*
* linux/mm/madvise.c
*
* Copyright (C) 1999 Linus Torvalds
* Copyright (C) 2002 Christoph Hellwig
*/
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/syscalls.h>
#include <linux/mempolicy.h>
#include <linux/page-isolation.h>
#include <linux/hugetlb.h>
#include <linux/falloc.h>
#include <linux/sched.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/blkdev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
* take mmap_sem for writing. Others, which simply traverse vmas, need
* to only take it for reading.
*/
static int madvise_need_mmap_write(int behavior)
{
switch (behavior) {
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
return 1;
}
}
/*
* We can potentially split a vm area into separate
* areas, each area with its own behavior.
*/
static long madvise_behavior(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
unsigned long new_flags = vma->vm_flags;
switch (behavior) {
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
case MADV_SEQUENTIAL:
new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
break;
case MADV_RANDOM:
new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
break;
case MADV_DONTFORK:
new_flags |= VM_DONTCOPY;
break;
case MADV_DOFORK:
if (vma->vm_flags & VM_IO) {
error = -EINVAL;
goto out;
}
new_flags &= ~VM_DONTCOPY;
break;
case MADV_DONTDUMP:
new_flags |= VM_DONTDUMP;
break;
case MADV_DODUMP:
if (new_flags & VM_SPECIAL) {
error = -EINVAL;
goto out;
}
new_flags &= ~VM_DONTDUMP;
break;
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
error = ksm_madvise(vma, start, end, behavior, &new_flags);
if (error)
goto out;
break;
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
error = hugepage_madvise(vma, &new_flags, behavior);
if (error)
goto out;
break;
}
if (new_flags == vma->vm_flags) {
*prev = vma;
goto out;
}
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
vma_get_anon_name(vma));
if (*prev) {
vma = *prev;
goto success;
}
*prev = vma;
if (start != vma->vm_start) {
error = split_vma(mm, vma, start, 1);
if (error)
goto out;
}
if (end != vma->vm_end) {
error = split_vma(mm, vma, end, 0);
if (error)
goto out;
}
success:
/*
* vm_flags is protected by the mmap_sem held in write mode.
*/
vma->vm_flags = new_flags;
out:
if (error == -ENOMEM)
error = -EAGAIN;
return error;
}
#ifdef CONFIG_SWAP
static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
{
pte_t *orig_pte;
struct vm_area_struct *vma = walk->private;
unsigned long index;
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
return 0;
for (index = start; index != end; index += PAGE_SIZE) {
pte_t pte;
swp_entry_t entry;
struct page *page;
spinlock_t *ptl;
orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
pte = *(orig_pte + ((index - start) / PAGE_SIZE));
pte_unmap_unlock(orig_pte, ptl);
if (pte_present(pte) || pte_none(pte) || pte_file(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
continue;
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
vma, index);
if (page)
page_cache_release(page);
}
return 0;
}
static void force_swapin_readahead(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
struct mm_walk walk = {
.mm = vma->vm_mm,
.pmd_entry = swapin_walk_pmd_entry,
.private = vma,
};
walk_page_range(start, end, &walk);
lru_add_drain(); /* Push any new pages onto the LRU now */
}
static void force_shm_swapin_readahead(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct address_space *mapping)
{
pgoff_t index;
struct page *page;
swp_entry_t swap;
for (; start < end; start += PAGE_SIZE) {
index = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
page = find_get_entry(mapping, index);
if (!radix_tree_exceptional_entry(page)) {
if (page)
page_cache_release(page);
continue;
}
swap = radix_to_swp_entry(page);
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
NULL, 0);
if (page)
page_cache_release(page);
}
lru_add_drain(); /* Push any new pages onto the LRU now */
}
#endif /* CONFIG_SWAP */
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
static long madvise_willneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
struct file *file = vma->vm_file;
#ifdef CONFIG_SWAP
if (!file || mapping_cap_swap_backed(file->f_mapping)) {
*prev = vma;
if (!file)
force_swapin_readahead(vma, start, end);
else
force_shm_swapin_readahead(vma, start, end,
file->f_mapping);
return 0;
}
#endif
if (!file)
return -EBADF;
if (file->f_mapping->a_ops->get_xip_mem) {
/* no bad return value, but ignore advice */
return 0;
}
*prev = vma;
start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
if (end > vma->vm_end)
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
force_page_cache_readahead(file->f_mapping, file, start, end - start);
return 0;
}
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
* zap_page_range call sets things up for shrink_active_list to actually free
* these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
* NB: This interface discards data rather than pushes it out to swap,
* as some implementations do. This has performance implications for
* applications like large transactional databases which want to discard
* pages in anonymous maps after committing to backing store the data
* that was kept in them. There is no reason to write this data out to
* the swap area if the application is discarding it.
*
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
static long madvise_dontneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;
if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
struct zap_details details = {
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, &details);
} else
zap_page_range(vma, start, end - start, NULL);
return 0;
}
/*
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
static long madvise_remove(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
loff_t offset;
int error;
struct file *f;
*prev = NULL; /* tell sys_madvise we drop mmap_sem */
if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB))
return -EINVAL;
f = vma->vm_file;
if (!f || !f->f_mapping || !f->f_mapping->host) {
return -EINVAL;
}
if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
return -EACCES;
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
/*
* Filesystem's fallocate may need to take i_mutex. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
* mmap_sem.
*/
get_file(f);
up_read(&current->mm->mmap_sem);
error = do_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
down_read(&current->mm->mmap_sem);
return error;
}
#ifdef CONFIG_MEMORY_FAILURE
/*
* Error injection support for memory error handling.
*/
static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
{
struct page *p;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
for (; start < end; start += PAGE_SIZE <<
compound_order(compound_head(p))) {
int ret;
ret = get_user_pages_fast(start, 1, 0, &p);
if (ret != 1)
return ret;
if (PageHWPoison(p)) {
put_page(p);
continue;
}
if (bhv == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining page %#lx at %#lx\n",
page_to_pfn(p), start);
ret = soft_offline_page(p, MF_COUNT_INCREASED);
if (ret)
return ret;
continue;
}
pr_info("Injecting memory failure for page %#lx at %#lx\n",
page_to_pfn(p), start);
/* Ignore return value for now */
memory_failure(page_to_pfn(p), 0, MF_COUNT_INCREASED);
}
return 0;
}
#endif
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
case MADV_REMOVE:
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
}
static int
madvise_behavior_valid(int behavior)
{
switch (behavior) {
case MADV_DOFORK:
case MADV_DONTFORK:
case MADV_NORMAL:
case MADV_SEQUENTIAL:
case MADV_RANDOM:
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
return 1;
default:
return 0;
}
}
/*
* The madvise(2) system call.
*
* Applications can use madvise() to advise the kernel how it should
* handle paging I/O in this VM area. The idea is to help the kernel
* use appropriate read-ahead and caching techniques. The information
* provided is advisory only, and can be safely disregarded by the
* kernel without affecting the correct operation of the application.
*
* behavior values:
* MADV_NORMAL - the default behavior is to read clusters. This
* results in some read-ahead and read-behind.
* MADV_RANDOM - the system should read the minimum amount of data
* on any access, since it is unlikely that the appli-
* cation will need more than what it asks for.
* MADV_SEQUENTIAL - pages in the given range will probably be accessed
* once, so they can be aggressively read ahead, and
* can be freed soon after they are accessed.
* MADV_WILLNEED - the application is notifying the system to read
* some pages ahead.
* MADV_DONTNEED - the application is finished with the given range,
* so the kernel can free resources associated with it.
* MADV_REMOVE - the application wants to free up the given range of
* pages and associated backing store.
* MADV_DONTFORK - omit this area from child's address space when forking:
* typically, to avoid COWing pages pinned by get_user_pages().
* MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
* MADV_MERGEABLE - the application recommends that KSM try to merge pages in
* this area with pages of identical content from other such areas.
* MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
*
* return values:
* zero - success
* -EINVAL - start + len < 0, start is not page-aligned,
* "behavior" is not a valid value, or application
* is attempting to release locked or shared pages.
* -ENOMEM - addresses in the specified range are not currently
* mapped, or are outside the AS of the process.
* -EIO - an I/O error occurred while paging in data.
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
int unmapped_error = 0;
int error = -EINVAL;
int write;
size_t len;
struct blk_plug plug;
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
return madvise_hwpoison(behavior, start, start+len_in);
#endif
if (!madvise_behavior_valid(behavior))
return error;
if (start & ~PAGE_MASK)
return error;
len = (len_in + ~PAGE_MASK) & PAGE_MASK;
/* Check to see whether len was rounded up from small -ve to zero */
if (len_in && !len)
return error;
end = start + len;
if (end < start)
return error;
error = 0;
if (end == start)
return error;
write = madvise_need_mmap_write(behavior);
if (write)
down_write(&current->mm->mmap_sem);
else
down_read(&current->mm->mmap_sem);
/*
* If the interval [start,end) covers some unmapped address
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
vma = find_vma_prev(current->mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
blk_start_plug(&plug);
for (;;) {
/* Still start < end. */
error = -ENOMEM;
if (!vma)
goto out;
/* Here start < (end|vma->vm_end). */
if (start < vma->vm_start) {
unmapped_error = -ENOMEM;
start = vma->vm_start;
if (start >= end)
goto out;
}
/* Here vma->vm_start <= start < (end|vma->vm_end) */
tmp = vma->vm_end;
if (end < tmp)
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
error = madvise_vma(vma, &prev, start, tmp, behavior);
if (error)
goto out;
start = tmp;
if (prev && start < prev->vm_end)
start = prev->vm_end;
error = unmapped_error;
if (start >= end)
goto out;
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
vma = find_vma(current->mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
up_write(&current->mm->mmap_sem);
else
up_read(&current->mm->mmap_sem);
return error;
}

1594
mm/memblock.c Normal file

File diff suppressed because it is too large Load diff

6667
mm/memcontrol.c Normal file

File diff suppressed because it is too large Load diff

1746
mm/memory-failure.c Normal file

File diff suppressed because it is too large Load diff

3830
mm/memory.c Normal file

File diff suppressed because it is too large Load diff

2022
mm/memory_hotplug.c Normal file

File diff suppressed because it is too large Load diff

2856
mm/mempolicy.c Normal file

File diff suppressed because it is too large Load diff

379
mm/mempool.c Normal file
View file

@ -0,0 +1,379 @@
/*
* linux/mm/mempool.c
*
* memory buffer pool support. Such pools are mostly used
* for guaranteed, deadlock-free memory allocations during
* extreme VM load.
*
* started by Ingo Molnar, Copyright (C) 2001
*/
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/kmemleak.h>
#include <linux/export.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
static void add_element(mempool_t *pool, void *element)
{
BUG_ON(pool->curr_nr >= pool->min_nr);
pool->elements[pool->curr_nr++] = element;
}
static void *remove_element(mempool_t *pool)
{
BUG_ON(pool->curr_nr <= 0);
return pool->elements[--pool->curr_nr];
}
/**
* mempool_destroy - deallocate a memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
*
* Free all reserved elements in @pool and @pool itself. This function
* only sleeps if the free_fn() function sleeps.
*/
void mempool_destroy(mempool_t *pool)
{
while (pool->curr_nr) {
void *element = remove_element(pool);
pool->free(element, pool->pool_data);
}
kfree(pool->elements);
kfree(pool);
}
EXPORT_SYMBOL(mempool_destroy);
/**
* mempool_create - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
* @free_fn: user-defined element-freeing function.
* @pool_data: optional private data available to the user-defined functions.
*
* this function creates and allocates a guaranteed size, preallocated
* memory pool. The pool can be used from the mempool_alloc() and mempool_free()
* functions. This function might sleep. Both the alloc_fn() and the free_fn()
* functions might sleep - as long as the mempool_alloc() function is not called
* from IRQ contexts.
*/
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
return mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,
GFP_KERNEL, NUMA_NO_NODE);
}
EXPORT_SYMBOL(mempool_create);
mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
mempool_t *pool;
pool = kzalloc_node(sizeof(*pool), gfp_mask, node_id);
if (!pool)
return NULL;
pool->elements = kmalloc_node(min_nr * sizeof(void *),
gfp_mask, node_id);
if (!pool->elements) {
kfree(pool);
return NULL;
}
spin_lock_init(&pool->lock);
pool->min_nr = min_nr;
pool->pool_data = pool_data;
init_waitqueue_head(&pool->wait);
pool->alloc = alloc_fn;
pool->free = free_fn;
/*
* First pre-allocate the guaranteed number of buffers.
*/
while (pool->curr_nr < pool->min_nr) {
void *element;
element = pool->alloc(gfp_mask, pool->pool_data);
if (unlikely(!element)) {
mempool_destroy(pool);
return NULL;
}
add_element(pool, element);
}
return pool;
}
EXPORT_SYMBOL(mempool_create_node);
/**
* mempool_resize - resize an existing memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
* @new_min_nr: the new minimum number of elements guaranteed to be
* allocated for this pool.
* @gfp_mask: the usual allocation bitmask.
*
* This function shrinks/grows the pool. In the case of growing,
* it cannot be guaranteed that the pool will be grown to the new
* size immediately, but new mempool_free() calls will refill it.
*
* Note, the caller must guarantee that no mempool_destroy is called
* while this function is running. mempool_alloc() & mempool_free()
* might be called (eg. from IRQ contexts) while this function executes.
*/
int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
{
void *element;
void **new_elements;
unsigned long flags;
BUG_ON(new_min_nr <= 0);
spin_lock_irqsave(&pool->lock, flags);
if (new_min_nr <= pool->min_nr) {
while (new_min_nr < pool->curr_nr) {
element = remove_element(pool);
spin_unlock_irqrestore(&pool->lock, flags);
pool->free(element, pool->pool_data);
spin_lock_irqsave(&pool->lock, flags);
}
pool->min_nr = new_min_nr;
goto out_unlock;
}
spin_unlock_irqrestore(&pool->lock, flags);
/* Grow the pool */
new_elements = kmalloc(new_min_nr * sizeof(*new_elements), gfp_mask);
if (!new_elements)
return -ENOMEM;
spin_lock_irqsave(&pool->lock, flags);
if (unlikely(new_min_nr <= pool->min_nr)) {
/* Raced, other resize will do our work */
spin_unlock_irqrestore(&pool->lock, flags);
kfree(new_elements);
goto out;
}
memcpy(new_elements, pool->elements,
pool->curr_nr * sizeof(*new_elements));
kfree(pool->elements);
pool->elements = new_elements;
pool->min_nr = new_min_nr;
while (pool->curr_nr < pool->min_nr) {
spin_unlock_irqrestore(&pool->lock, flags);
element = pool->alloc(gfp_mask, pool->pool_data);
if (!element)
goto out;
spin_lock_irqsave(&pool->lock, flags);
if (pool->curr_nr < pool->min_nr) {
add_element(pool, element);
} else {
spin_unlock_irqrestore(&pool->lock, flags);
pool->free(element, pool->pool_data); /* Raced */
goto out;
}
}
out_unlock:
spin_unlock_irqrestore(&pool->lock, flags);
out:
return 0;
}
EXPORT_SYMBOL(mempool_resize);
/**
* mempool_alloc - allocate an element from a specific memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
* @gfp_mask: the usual allocation bitmask.
*
* this function only sleeps if the alloc_fn() function sleeps or
* returns NULL. Note that due to preallocation, this function
* *never* fails when called from process contexts. (it might
* fail if called from an IRQ context.)
* Note: using __GFP_ZERO is not supported.
*/
void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
void *element;
unsigned long flags;
wait_queue_t wait;
gfp_t gfp_temp;
VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_sleep_if(gfp_mask & __GFP_WAIT);
gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN; /* failures are OK */
gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
repeat_alloc:
element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
return element;
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool->curr_nr)) {
element = remove_element(pool);
spin_unlock_irqrestore(&pool->lock, flags);
/* paired with rmb in mempool_free(), read comment there */
smp_wmb();
/*
* Update the allocation stack trace as this is more useful
* for debugging.
*/
kmemleak_update_trace(element);
return element;
}
/*
* We use gfp mask w/o __GFP_WAIT or IO for the first round. If
* alloc failed with that and @pool was empty, retry immediately.
*/
if (gfp_temp != gfp_mask) {
spin_unlock_irqrestore(&pool->lock, flags);
gfp_temp = gfp_mask;
goto repeat_alloc;
}
/* We must not sleep if !__GFP_WAIT */
if (!(gfp_mask & __GFP_WAIT)) {
spin_unlock_irqrestore(&pool->lock, flags);
return NULL;
}
/* Let's wait for someone else to return an element to @pool */
init_wait(&wait);
prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
spin_unlock_irqrestore(&pool->lock, flags);
/*
* FIXME: this should be io_schedule(). The timeout is there as a
* workaround for some DM problems in 2.6.18.
*/
io_schedule_timeout(5*HZ);
finish_wait(&pool->wait, &wait);
goto repeat_alloc;
}
EXPORT_SYMBOL(mempool_alloc);
/**
* mempool_free - return an element to the pool.
* @element: pool element pointer.
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
*
* this function only sleeps if the free_fn() function sleeps.
*/
void mempool_free(void *element, mempool_t *pool)
{
unsigned long flags;
if (unlikely(element == NULL))
return;
/*
* Paired with the wmb in mempool_alloc(). The preceding read is
* for @element and the following @pool->curr_nr. This ensures
* that the visible value of @pool->curr_nr is from after the
* allocation of @element. This is necessary for fringe cases
* where @element was passed to this task without going through
* barriers.
*
* For example, assume @p is %NULL at the beginning and one task
* performs "p = mempool_alloc(...);" while another task is doing
* "while (!p) cpu_relax(); mempool_free(p, ...);". This function
* may end up using curr_nr value which is from before allocation
* of @p without the following rmb.
*/
smp_rmb();
/*
* For correctness, we need a test which is guaranteed to trigger
* if curr_nr + #allocated == min_nr. Testing curr_nr < min_nr
* without locking achieves that and refilling as soon as possible
* is desirable.
*
* Because curr_nr visible here is always a value after the
* allocation of @element, any task which decremented curr_nr below
* min_nr is guaranteed to see curr_nr < min_nr unless curr_nr gets
* incremented to min_nr afterwards. If curr_nr gets incremented
* to min_nr after the allocation of @element, the elements
* allocated after that are subject to the same guarantee.
*
* Waiters happen iff curr_nr is 0 and the above guarantee also
* ensures that there will be frees which return elements to the
* pool waking up the waiters.
*/
if (unlikely(pool->curr_nr < pool->min_nr)) {
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool->curr_nr < pool->min_nr)) {
add_element(pool, element);
spin_unlock_irqrestore(&pool->lock, flags);
wake_up(&pool->wait);
return;
}
spin_unlock_irqrestore(&pool->lock, flags);
}
pool->free(element, pool->pool_data);
}
EXPORT_SYMBOL(mempool_free);
/*
* A commonly used alloc and free fn.
*/
void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data)
{
struct kmem_cache *mem = pool_data;
return kmem_cache_alloc(mem, gfp_mask);
}
EXPORT_SYMBOL(mempool_alloc_slab);
void mempool_free_slab(void *element, void *pool_data)
{
struct kmem_cache *mem = pool_data;
kmem_cache_free(mem, element);
}
EXPORT_SYMBOL(mempool_free_slab);
/*
* A commonly used alloc and free fn that kmalloc/kfrees the amount of memory
* specified by pool_data
*/
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data)
{
size_t size = (size_t)pool_data;
return kmalloc(size, gfp_mask);
}
EXPORT_SYMBOL(mempool_kmalloc);
void mempool_kfree(void *element, void *pool_data)
{
kfree(element);
}
EXPORT_SYMBOL(mempool_kfree);
/*
* A simple mempool-backed page allocator that allocates pages
* of the order specified by pool_data.
*/
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
{
int order = (int)(long)pool_data;
return alloc_pages(gfp_mask, order);
}
EXPORT_SYMBOL(mempool_alloc_pages);
void mempool_free_pages(void *element, void *pool_data)
{
int order = (int)(long)pool_data;
__free_pages(element, order);
}
EXPORT_SYMBOL(mempool_free_pages);

1913
mm/migrate.c Normal file

File diff suppressed because it is too large Load diff

317
mm/mincore.c Normal file
View file

@ -0,0 +1,317 @@
/*
* linux/mm/mincore.c
*
* Copyright (C) 1994-2006 Linus Torvalds
*/
/*
* The mincore() system call.
*/
#include <linux/pagemap.h>
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/hugetlb.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
static void mincore_hugetlb_page_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
#ifdef CONFIG_HUGETLB_PAGE
struct hstate *h;
h = hstate_vma(vma);
while (1) {
unsigned char present;
pte_t *ptep;
/*
* Huge pages are always in RAM for now, but
* theoretically it needs to be checked.
*/
ptep = huge_pte_offset(current->mm,
addr & huge_page_mask(h));
present = ptep && !huge_pte_none(huge_ptep_get(ptep));
while (1) {
*vec = present;
vec++;
addr += PAGE_SIZE;
if (addr == end)
return;
/* check hugepage border */
if (!(addr & ~huge_page_mask(h)))
break;
}
}
#else
BUG();
#endif
}
/*
* Later we can get more picky about what "in core" means precisely.
* For now, simply check to see if the page is in the page cache,
* and is up to date; i.e. that no page-in operation would be required
* at this time if an application were to map and access this page.
*/
static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
{
unsigned char present = 0;
struct page *page;
/*
* When tmpfs swaps out a page from a file, any process mapping that
* file will not get a swp_entry_t in its pte, but rather it is like
* any other file mapping (ie. marked !present and faulted in with
* tmpfs's .fault). So swapped out tmpfs mappings are tested here.
*/
#ifdef CONFIG_SWAP
if (shmem_mapping(mapping)) {
page = find_get_entry(mapping, pgoff);
/*
* shmem/tmpfs may return swap: account for swapcache
* page too.
*/
if (radix_tree_exceptional_entry(page)) {
swp_entry_t swp = radix_to_swp_entry(page);
page = find_get_page(swap_address_space(swp), swp.val);
}
} else
page = find_get_page(mapping, pgoff);
#else
page = find_get_page(mapping, pgoff);
#endif
if (page) {
present = PageUptodate(page);
page_cache_release(page);
}
return present;
}
static void mincore_unmapped_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
unsigned long nr = (end - addr) >> PAGE_SHIFT;
int i;
if (vma->vm_file) {
pgoff_t pgoff;
pgoff = linear_page_index(vma, addr);
for (i = 0; i < nr; i++, pgoff++)
vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
} else {
for (i = 0; i < nr; i++)
vec[i] = 0;
}
}
static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
unsigned long next;
spinlock_t *ptl;
pte_t *ptep;
ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
pte_t pte = *ptep;
pgoff_t pgoff;
next = addr + PAGE_SIZE;
if (pte_none(pte))
mincore_unmapped_range(vma, addr, next, vec);
else if (pte_present(pte))
*vec = 1;
else if (pte_file(pte)) {
pgoff = pte_to_pgoff(pte);
*vec = mincore_page(vma->vm_file->f_mapping, pgoff);
} else { /* pte is a swap entry */
swp_entry_t entry = pte_to_swp_entry(pte);
if (is_migration_entry(entry)) {
/* migration entries are always uptodate */
*vec = 1;
} else {
#ifdef CONFIG_SWAP
pgoff = entry.val;
*vec = mincore_page(swap_address_space(entry),
pgoff);
#else
WARN_ON(1);
*vec = 1;
#endif
}
}
vec++;
} while (ptep++, addr = next, addr != end);
pte_unmap_unlock(ptep - 1, ptl);
}
static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
unsigned long next;
pmd_t *pmd;
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (mincore_huge_pmd(vma, pmd, addr, next, vec)) {
vec += (next - addr) >> PAGE_SHIFT;
continue;
}
/* fall through */
}
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
mincore_unmapped_range(vma, addr, next, vec);
else
mincore_pte_range(vma, pmd, addr, next, vec);
vec += (next - addr) >> PAGE_SHIFT;
} while (pmd++, addr = next, addr != end);
}
static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
unsigned long next;
pud_t *pud;
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
mincore_unmapped_range(vma, addr, next, vec);
else
mincore_pmd_range(vma, pud, addr, next, vec);
vec += (next - addr) >> PAGE_SHIFT;
} while (pud++, addr = next, addr != end);
}
static void mincore_page_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
unsigned long next;
pgd_t *pgd;
pgd = pgd_offset(vma->vm_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
mincore_unmapped_range(vma, addr, next, vec);
else
mincore_pud_range(vma, pgd, addr, next, vec);
vec += (next - addr) >> PAGE_SHIFT;
} while (pgd++, addr = next, addr != end);
}
/*
* Do a chunk of "sys_mincore()". We've already checked
* all the arguments, we hold the mmap semaphore: we should
* just return the amount of info we're asked for.
*/
static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
{
struct vm_area_struct *vma;
unsigned long end;
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
if (is_vm_hugetlb_page(vma))
mincore_hugetlb_page_range(vma, addr, end, vec);
else
mincore_page_range(vma, addr, end, vec);
return (end - addr) >> PAGE_SHIFT;
}
/*
* The mincore(2) system call.
*
* mincore() returns the memory residency status of the pages in the
* current process's address space specified by [addr, addr + len).
* The status is returned in a vector of bytes. The least significant
* bit of each byte is 1 if the referenced page is in memory, otherwise
* it is zero.
*
* Because the status of a page can change after mincore() checks it
* but before it returns to the application, the returned vector may
* contain stale information. Only locked pages are guaranteed to
* remain in memory.
*
* return values:
* zero - success
* -EFAULT - vec points to an illegal address
* -EINVAL - addr is not a multiple of PAGE_CACHE_SIZE
* -ENOMEM - Addresses in the range [addr, addr + len] are
* invalid for the address space of this process, or
* specify one or more pages which are not currently
* mapped
* -EAGAIN - A kernel resource was temporarily unavailable.
*/
SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
unsigned char __user *, vec)
{
long retval;
unsigned long pages;
unsigned char *tmp;
/* Check the start address: needs to be page-aligned.. */
if (start & ~PAGE_CACHE_MASK)
return -EINVAL;
/* ..and we need to be passed a valid user-space range */
if (!access_ok(VERIFY_READ, (void __user *) start, len))
return -ENOMEM;
/* This also avoids any overflows on PAGE_CACHE_ALIGN */
pages = len >> PAGE_SHIFT;
pages += (len & ~PAGE_MASK) != 0;
if (!access_ok(VERIFY_WRITE, vec, pages))
return -EFAULT;
tmp = (void *) __get_free_page(GFP_USER);
if (!tmp)
return -EAGAIN;
retval = 0;
while (pages) {
/*
* Do at most PAGE_SIZE entries per iteration, due to
* the temporary buffer size.
*/
down_read(&current->mm->mmap_sem);
retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
up_read(&current->mm->mmap_sem);
if (retval <= 0)
break;
if (copy_to_user(vec, tmp, retval)) {
retval = -EFAULT;
break;
}
pages -= retval;
vec += retval;
start += retval << PAGE_SHIFT;
retval = 0;
}
free_page((unsigned long) tmp);
return retval;
}

874
mm/mlock.c Normal file
View file

@ -0,0 +1,874 @@
/*
* linux/mm/mlock.c
*
* (C) Copyright 1995 Linus Torvalds
* (C) Copyright 2002 Christoph Hellwig
*/
#include <linux/capability.h>
#include <linux/mman.h>
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/pagemap.h>
#include <linux/pagevec.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
#include <linux/export.h>
#include <linux/rmap.h>
#include <linux/mmzone.h>
#include <linux/hugetlb.h>
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include "internal.h"
int can_do_mlock(void)
{
if (capable(CAP_IPC_LOCK))
return 1;
if (rlimit(RLIMIT_MEMLOCK) != 0)
return 1;
return 0;
}
EXPORT_SYMBOL(can_do_mlock);
/*
* Mlocked pages are marked with PageMlocked() flag for efficient testing
* in vmscan and, possibly, the fault path; and to support semi-accurate
* statistics.
*
* An mlocked page [PageMlocked(page)] is unevictable. As such, it will
* be placed on the LRU "unevictable" list, rather than the [in]active lists.
* The unevictable list is an LRU sibling list to the [in]active lists.
* PageUnevictable is set to indicate the unevictable state.
*
* When lazy mlocking via vmscan, it is important to ensure that the
* vma's VM_LOCKED status is not concurrently being modified, otherwise we
* may have mlocked a page that is being munlocked. So lazy mlock must take
* the mmap_sem for read, and verify that the vma really is locked
* (see mm/rmap.c).
*/
/*
* LRU accounting for clear_page_mlock()
*/
void clear_page_mlock(struct page *page)
{
if (!TestClearPageMlocked(page))
return;
mod_zone_page_state(page_zone(page), NR_MLOCK,
-hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGCLEARED);
if (!isolate_lru_page(page)) {
putback_lru_page(page);
} else {
/*
* We lost the race. the page already moved to evictable list.
*/
if (PageUnevictable(page))
count_vm_event(UNEVICTABLE_PGSTRANDED);
}
}
/*
* Mark page as mlocked if not already.
* If page on LRU, isolate and putback to move to unevictable list.
*/
void mlock_vma_page(struct page *page)
{
/* Serialize with page migration */
BUG_ON(!PageLocked(page));
if (!TestSetPageMlocked(page)) {
mod_zone_page_state(page_zone(page), NR_MLOCK,
hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
if (!isolate_lru_page(page))
putback_lru_page(page);
}
}
/*
* Isolate a page from LRU with optional get_page() pin.
* Assumes lru_lock already held and page already pinned.
*/
static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
{
if (PageLRU(page)) {
struct lruvec *lruvec;
lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
if (getpage)
get_page(page);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
return true;
}
return false;
}
/*
* Finish munlock after successful page isolation
*
* Page must be locked. This is a wrapper for try_to_munlock()
* and putback_lru_page() with munlock accounting.
*/
static void __munlock_isolated_page(struct page *page)
{
int ret = SWAP_AGAIN;
/*
* Optimization: if the page was mapped just once, that's our mapping
* and we don't need to check all the other vmas.
*/
if (page_mapcount(page) > 1)
ret = try_to_munlock(page);
/* Did try_to_unlock() succeed or punt? */
if (ret != SWAP_MLOCK)
count_vm_event(UNEVICTABLE_PGMUNLOCKED);
putback_lru_page(page);
}
/*
* Accounting for page isolation fail during munlock
*
* Performs accounting when page isolation fails in munlock. There is nothing
* else to do because it means some other task has already removed the page
* from the LRU. putback_lru_page() will take care of removing the page from
* the unevictable list, if necessary. vmscan [page_referenced()] will move
* the page back to the unevictable list if some other vma has it mlocked.
*/
static void __munlock_isolation_failed(struct page *page)
{
if (PageUnevictable(page))
__count_vm_event(UNEVICTABLE_PGSTRANDED);
else
__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
}
/**
* munlock_vma_page - munlock a vma page
* @page - page to be unlocked, either a normal page or THP page head
*
* returns the size of the page as a page mask (0 for normal page,
* HPAGE_PMD_NR - 1 for THP head page)
*
* called from munlock()/munmap() path with page supposedly on the LRU.
* When we munlock a page, because the vma where we found the page is being
* munlock()ed or munmap()ed, we want to check whether other vmas hold the
* page locked so that we can leave it on the unevictable lru list and not
* bother vmscan with it. However, to walk the page's rmap list in
* try_to_munlock() we must isolate the page from the LRU. If some other
* task has removed the page from the LRU, we won't be able to do that.
* So we clear the PageMlocked as we might not get another chance. If we
* can't isolate the page, we leave it for putback_lru_page() and vmscan
* [page_referenced()/try_to_unmap()] to deal with.
*/
unsigned int munlock_vma_page(struct page *page)
{
unsigned int nr_pages;
struct zone *zone = page_zone(page);
/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
/*
* Serialize with any parallel __split_huge_page_refcount() which
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
spin_lock_irq(&zone->lru_lock);
nr_pages = hpage_nr_pages(page);
if (!TestClearPageMlocked(page))
goto unlock_out;
__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
if (__munlock_isolate_lru_page(page, true)) {
spin_unlock_irq(&zone->lru_lock);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);
unlock_out:
spin_unlock_irq(&zone->lru_lock);
out:
return nr_pages - 1;
}
/**
* __mlock_vma_pages_range() - mlock a range of pages in the vma.
* @vma: target vma
* @start: start address
* @end: end address
* @nonblocking:
*
* This takes care of making the pages present too.
*
* return 0 on success, negative error code on error.
*
* vma->vm_mm->mmap_sem must be held.
*
* If @nonblocking is NULL, it may be held for read or write and will
* be unperturbed.
*
* If @nonblocking is non-NULL, it must held for read only and may be
* released. If it's released, *@nonblocking will be set to 0.
*/
long __mlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
int gup_flags;
VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
VM_BUG_ON_VMA(start < vma->vm_start, vma);
VM_BUG_ON_VMA(end > vma->vm_end, vma);
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
gup_flags = FOLL_TOUCH | FOLL_MLOCK;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
gup_flags |= FOLL_WRITE;
/*
* We want mlock to succeed for regions that have any permissions
* other than PROT_NONE.
*/
if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
gup_flags |= FOLL_FORCE;
/*
* We made sure addr is within a VMA, so the following will
* not result in a stack expansion that recurses back here.
*/
return __get_user_pages(current, mm, start, nr_pages, gup_flags,
NULL, NULL, nonblocking);
}
/*
* convert get_user_pages() return value to posix mlock() error
*/
static int __mlock_posix_error_return(long retval)
{
if (retval == -EFAULT)
retval = -ENOMEM;
else if (retval == -ENOMEM)
retval = -EAGAIN;
return retval;
}
/*
* Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
*
* The fast path is available only for evictable pages with single mapping.
* Then we can bypass the per-cpu pvec and get better performance.
* when mapcount > 1 we need try_to_munlock() which can fail.
* when !page_evictable(), we need the full redo logic of putback_lru_page to
* avoid leaving evictable page in unevictable list.
*
* In case of success, @page is added to @pvec and @pgrescued is incremented
* in case that the page was previously unevictable. @page is also unlocked.
*/
static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
int *pgrescued)
{
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (page_mapcount(page) <= 1 && page_evictable(page)) {
pagevec_add(pvec, page);
if (TestClearPageUnevictable(page))
(*pgrescued)++;
unlock_page(page);
return true;
}
return false;
}
/*
* Putback multiple evictable pages to the LRU
*
* Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
* the pages might have meanwhile become unevictable but that is OK.
*/
static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
{
count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
/*
*__pagevec_lru_add() calls release_pages() so we don't call
* put_page() explicitly
*/
__pagevec_lru_add(pvec);
count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
}
/*
* Munlock a batch of pages from the same zone
*
* The work is split to two main phases. First phase clears the Mlocked flag
* and attempts to isolate the pages, all under a single zone lru lock.
* The second phase finishes the munlock only for pages where isolation
* succeeded.
*
* Note that the pagevec may be modified during the process.
*/
static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
int delta_munlocked;
struct pagevec pvec_putback;
int pgrescued = 0;
pagevec_init(&pvec_putback, 0);
/* Phase 1: page isolation */
spin_lock_irq(&zone->lru_lock);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
if (__munlock_isolate_lru_page(page, false))
continue;
else
__munlock_isolation_failed(page);
}
/*
* We won't be munlocking this page in the next phase
* but we still need to release the follow_page_mask()
* pin. We cannot do it under lru_lock however. If it's
* the last pin, __page_cache_release() would deadlock.
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
}
delta_munlocked = -nr + pagevec_count(&pvec_putback);
__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
spin_unlock_irq(&zone->lru_lock);
/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
/* Phase 2: page munlock */
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
if (page) {
lock_page(page);
if (!__putback_lru_fast_prepare(page, &pvec_putback,
&pgrescued)) {
/*
* Slow path. We don't want to lose the last
* pin before unlock_page()
*/
get_page(page); /* for putback_lru_page() */
__munlock_isolated_page(page);
unlock_page(page);
put_page(page); /* from follow_page_mask() */
}
}
}
/*
* Phase 3: page putback for pages that qualified for the fast path
* This will also call put_page() to return pin from follow_page_mask()
*/
if (pagevec_count(&pvec_putback))
__putback_lru_fast(&pvec_putback, pgrescued);
}
/*
* Fill up pagevec for __munlock_pagevec using pte walk
*
* The function expects that the struct page corresponding to @start address is
* a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
*
* The rest of @pvec is filled by subsequent pages within the same pmd and same
* zone, as long as the pte's are present and vm_normal_page() succeeds. These
* pages also get pinned.
*
* Returns the address of the next page that should be scanned. This equals
* @start + PAGE_SIZE when no page could be added by the pte walk.
*/
static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
struct vm_area_struct *vma, int zoneid, unsigned long start,
unsigned long end)
{
pte_t *pte;
spinlock_t *ptl;
/*
* Initialize pte walk starting at the already pinned page where we
* are sure that there is a pte, as it was pinned under the same
* mmap_sem write op.
*/
pte = get_locked_pte(vma->vm_mm, start, &ptl);
/* Make sure we do not cross the page table boundary */
end = pgd_addr_end(start, end);
end = pud_addr_end(start, end);
end = pmd_addr_end(start, end);
/* The page next to the pinned page is the first we will try to get */
start += PAGE_SIZE;
while (start < end) {
struct page *page = NULL;
pte++;
if (pte_present(*pte))
page = vm_normal_page(vma, start, *pte);
/*
* Break if page could not be obtained or the page's node+zone does not
* match
*/
if (!page || page_zone_id(page) != zoneid)
break;
get_page(page);
/*
* Increase the address that will be returned *before* the
* eventual break due to pvec becoming full by adding the page
*/
start += PAGE_SIZE;
if (pagevec_add(pvec, page) == 0)
break;
}
pte_unmap_unlock(pte, ptl);
return start;
}
/*
* munlock_vma_pages_range() - munlock all pages in the vma range.'
* @vma - vma containing range to be munlock()ed.
* @start - start address in @vma of the range
* @end - end of range in @vma.
*
* For mremap(), munmap() and exit().
*
* Called with @vma VM_LOCKED.
*
* Returns with VM_LOCKED cleared. Callers must be prepared to
* deal with this.
*
* We don't save and restore VM_LOCKED here because pages are
* still on lru. In unmap path, pages might be scanned by reclaim
* and re-mlocked by try_to_{munlock|unmap} before we unmap and
* free them. This will result in freeing mlocked pages.
*/
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
vma->vm_flags &= ~VM_LOCKED;
while (start < end) {
struct page *page = NULL;
unsigned int page_mask;
unsigned long page_increm;
struct pagevec pvec;
struct zone *zone;
int zoneid;
pagevec_init(&pvec, 0);
/*
* Although FOLL_DUMP is intended for get_dump_page(),
* it just so happens that its special treatment of the
* ZERO_PAGE (returning an error instead of doing get_page)
* suits munlock very well (and if somehow an abnormal page
* has sneaked into the range, we won't oops here: great).
*/
page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
&page_mask);
if (page && !IS_ERR(page)) {
if (PageTransHuge(page)) {
lock_page(page);
/*
* Any THP page found by follow_page_mask() may
* have gotten split before reaching
* munlock_vma_page(), so we need to recompute
* the page_mask here.
*/
page_mask = munlock_vma_page(page);
unlock_page(page);
put_page(page); /* follow_page_mask() */
} else {
/*
* Non-huge pages are handled in batches via
* pagevec. The pin from follow_page_mask()
* prevents them from collapsing by THP.
*/
pagevec_add(&pvec, page);
zone = page_zone(page);
zoneid = page_zone_id(page);
/*
* Try to fill the rest of pagevec using fast
* pte walk. This will also update start to
* the next page to process. Then munlock the
* pagevec.
*/
start = __munlock_pagevec_fill(&pvec, vma,
zoneid, start, end);
__munlock_pagevec(&pvec, zone);
goto next;
}
}
/* It's a bug to munlock in the middle of a THP page */
VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
page_increm = 1 + page_mask;
start += page_increm * PAGE_SIZE;
next:
cond_resched();
}
}
/*
* mlock_fixup - handle mlock[all]/munlock[all] requests.
*
* Filters out "special" vmas -- VM_LOCKED never gets set for these, and
* munlock is a no-op. However, for some special vmas, we go ahead and
* populate the ptes.
*
* For vmas that pass the filters, merge/split as appropriate.
*/
static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, vm_flags_t newflags)
{
struct mm_struct *mm = vma->vm_mm;
pgoff_t pgoff;
int nr_pages;
int ret = 0;
int lock = !!(newflags & VM_LOCKED);
if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
goto out; /* don't set VM_LOCKED, don't count */
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
vma_get_anon_name(vma));
if (*prev) {
vma = *prev;
goto success;
}
if (start != vma->vm_start) {
ret = split_vma(mm, vma, start, 1);
if (ret)
goto out;
}
if (end != vma->vm_end) {
ret = split_vma(mm, vma, end, 0);
if (ret)
goto out;
}
success:
/*
* Keep track of amount of locked VM.
*/
nr_pages = (end - start) >> PAGE_SHIFT;
if (!lock)
nr_pages = -nr_pages;
mm->locked_vm += nr_pages;
/*
* vm_flags is protected by the mmap_sem held in write mode.
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, __mlock_vma_pages_range will bring it back.
*/
if (lock)
vma->vm_flags = newflags;
else
munlock_vma_pages_range(vma, start, end);
out:
*prev = vma;
return ret;
}
static int do_mlock(unsigned long start, size_t len, int on)
{
unsigned long nstart, end, tmp;
struct vm_area_struct * vma, * prev;
int error;
VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(len != PAGE_ALIGN(len));
end = start + len;
if (end < start)
return -EINVAL;
if (end == start)
return 0;
vma = find_vma(current->mm, start);
if (!vma || vma->vm_start > start)
return -ENOMEM;
prev = vma->vm_prev;
if (start > vma->vm_start)
prev = vma;
for (nstart = start ; ; ) {
vm_flags_t newflags;
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
newflags = vma->vm_flags & ~VM_LOCKED;
if (on)
newflags |= VM_LOCKED;
tmp = vma->vm_end;
if (tmp > end)
tmp = end;
error = mlock_fixup(vma, &prev, nstart, tmp, newflags);
if (error)
break;
nstart = tmp;
if (nstart < prev->vm_end)
nstart = prev->vm_end;
if (nstart >= end)
break;
vma = prev->vm_next;
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
break;
}
}
return error;
}
/*
* __mm_populate - populate and/or mlock pages within a range of address space.
*
* This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap
* flags. VMAs must be already marked with the desired vm_flags, and
* mmap_sem must not be held.
*/
int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
{
struct mm_struct *mm = current->mm;
unsigned long end, nstart, nend;
struct vm_area_struct *vma = NULL;
int locked = 0;
long ret = 0;
VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(len != PAGE_ALIGN(len));
end = start + len;
for (nstart = start; nstart < end; nstart = nend) {
/*
* We want to fault in pages for [nstart; end) address range.
* Find first corresponding VMA.
*/
if (!locked) {
locked = 1;
down_read(&mm->mmap_sem);
vma = find_vma(mm, nstart);
} else if (nstart >= vma->vm_end)
vma = vma->vm_next;
if (!vma || vma->vm_start >= end)
break;
/*
* Set [nstart; nend) to intersection of desired address
* range with the first VMA. Also, skip undesirable VMA types.
*/
nend = min(end, vma->vm_end);
if (vma->vm_flags & (VM_IO | VM_PFNMAP))
continue;
if (nstart < vma->vm_start)
nstart = vma->vm_start;
/*
* Now fault in a range of pages. __mlock_vma_pages_range()
* double checks the vma flags, so that it won't mlock pages
* if the vma was already munlocked.
*/
ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
if (ret < 0) {
if (ignore_errors) {
ret = 0;
continue; /* continue at next VMA */
}
ret = __mlock_posix_error_return(ret);
break;
}
nend = nstart + ret * PAGE_SIZE;
ret = 0;
}
if (locked)
up_read(&mm->mmap_sem);
return ret; /* 0 or negative error code */
}
SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
{
unsigned long locked;
unsigned long lock_limit;
int error = -ENOMEM;
if (!can_do_mlock())
return -EPERM;
lru_add_drain_all(); /* flush pagevec */
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
start &= PAGE_MASK;
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
locked = len >> PAGE_SHIFT;
down_write(&current->mm->mmap_sem);
locked += current->mm->locked_vm;
/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
error = do_mlock(start, len, 1);
up_write(&current->mm->mmap_sem);
if (!error)
error = __mm_populate(start, len, 0);
return error;
}
SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
start &= PAGE_MASK;
down_write(&current->mm->mmap_sem);
ret = do_mlock(start, len, 0);
up_write(&current->mm->mmap_sem);
return ret;
}
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
if (flags & MCL_FUTURE)
current->mm->def_flags |= VM_LOCKED;
else
current->mm->def_flags &= ~VM_LOCKED;
if (flags == MCL_FUTURE)
goto out;
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;
newflags = vma->vm_flags & ~VM_LOCKED;
if (flags & MCL_CURRENT)
newflags |= VM_LOCKED;
/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
cond_resched_rcu_qs();
}
out:
return 0;
}
SYSCALL_DEFINE1(mlockall, int, flags)
{
unsigned long lock_limit;
int ret = -EINVAL;
if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
goto out;
ret = -EPERM;
if (!can_do_mlock())
goto out;
if (flags & MCL_CURRENT)
lru_add_drain_all(); /* flush pagevec */
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
ret = -ENOMEM;
down_write(&current->mm->mmap_sem);
if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) ||
capable(CAP_IPC_LOCK))
ret = do_mlockall(flags);
up_write(&current->mm->mmap_sem);
if (!ret && (flags & MCL_CURRENT))
mm_populate(0, TASK_SIZE);
out:
return ret;
}
SYSCALL_DEFINE0(munlockall)
{
int ret;
down_write(&current->mm->mmap_sem);
ret = do_mlockall(0);
up_write(&current->mm->mmap_sem);
return ret;
}
/*
* Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
* shm segments) get accounted against the user_struct instead.
*/
static DEFINE_SPINLOCK(shmlock_user_lock);
int user_shm_lock(size_t size, struct user_struct *user)
{
unsigned long lock_limit, locked;
int allowed = 0;
locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
lock_limit = rlimit(RLIMIT_MEMLOCK);
if (lock_limit == RLIM_INFINITY)
allowed = 1;
lock_limit >>= PAGE_SHIFT;
spin_lock(&shmlock_user_lock);
if (!allowed &&
locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
goto out;
get_uid(user);
user->locked_shm += locked;
allowed = 1;
out:
spin_unlock(&shmlock_user_lock);
return allowed;
}
void user_shm_unlock(size_t size, struct user_struct *user)
{
spin_lock(&shmlock_user_lock);
user->locked_shm -= (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
spin_unlock(&shmlock_user_lock);
free_uid(user);
}

205
mm/mm_init.c Normal file
View file

@ -0,0 +1,205 @@
/*
* mm_init.c - Memory initialisation verification and debugging
*
* Copyright 2008 IBM Corporation, 2008
* Author Mel Gorman <mel@csn.ul.ie>
*
*/
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/kobject.h>
#include <linux/export.h>
#include <linux/memory.h>
#include <linux/notifier.h>
#include "internal.h"
#ifdef CONFIG_DEBUG_MEMORY_INIT
int mminit_loglevel;
#ifndef SECTIONS_SHIFT
#define SECTIONS_SHIFT 0
#endif
/* The zonelists are simply reported, validation is manual. */
void mminit_verify_zonelist(void)
{
int nid;
if (mminit_loglevel < MMINIT_VERIFY)
return;
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
struct zone *zone;
struct zoneref *z;
struct zonelist *zonelist;
int i, listid, zoneid;
BUG_ON(MAX_ZONELISTS > 2);
for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) {
/* Identify the zone and nodelist */
zoneid = i % MAX_NR_ZONES;
listid = i / MAX_NR_ZONES;
zonelist = &pgdat->node_zonelists[listid];
zone = &pgdat->node_zones[zoneid];
if (!populated_zone(zone))
continue;
/* Print information about the zonelist */
printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
listid > 0 ? "thisnode" : "general", nid,
zone->name);
/* Iterate the zonelist */
for_each_zone_zonelist(zone, z, zonelist, zoneid) {
#ifdef CONFIG_NUMA
printk(KERN_CONT "%d:%s ",
zone->node, zone->name);
#else
printk(KERN_CONT "0:%s ", zone->name);
#endif /* CONFIG_NUMA */
}
printk(KERN_CONT "\n");
}
}
}
void __init mminit_verify_pageflags_layout(void)
{
int shift, width;
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
LAST_CPUPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
"Section %d Node %d Zone %d Lastcpupid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
LAST_CPUPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
(unsigned long)LAST_CPUPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
(unsigned long)ZONEID_PGOFF);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_usage",
"location: %d -> %d layout %d -> %d unused %d -> %d page-flags\n",
shift, width, width, NR_PAGEFLAGS, NR_PAGEFLAGS, 0);
#ifdef NODE_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Last cpupid not in page flags");
#endif
if (SECTIONS_WIDTH) {
shift -= SECTIONS_WIDTH;
BUG_ON(shift != SECTIONS_PGSHIFT);
}
if (NODES_WIDTH) {
shift -= NODES_WIDTH;
BUG_ON(shift != NODES_PGSHIFT);
}
if (ZONES_WIDTH) {
shift -= ZONES_WIDTH;
BUG_ON(shift != ZONES_PGSHIFT);
}
/* Check for bitmask overlaps */
or_mask = (ZONES_MASK << ZONES_PGSHIFT) |
(NODES_MASK << NODES_PGSHIFT) |
(SECTIONS_MASK << SECTIONS_PGSHIFT);
add_mask = (ZONES_MASK << ZONES_PGSHIFT) +
(NODES_MASK << NODES_PGSHIFT) +
(SECTIONS_MASK << SECTIONS_PGSHIFT);
BUG_ON(or_mask != add_mask);
}
void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
unsigned long nid, unsigned long pfn)
{
BUG_ON(page_to_nid(page) != nid);
BUG_ON(page_zonenum(page) != zone);
BUG_ON(page_to_pfn(page) != pfn);
}
static __init int set_mminit_loglevel(char *str)
{
get_option(&str, &mminit_loglevel);
return 0;
}
early_param("mminit_loglevel", set_mminit_loglevel);
#endif /* CONFIG_DEBUG_MEMORY_INIT */
struct kobject *mm_kobj;
EXPORT_SYMBOL_GPL(mm_kobj);
#ifdef CONFIG_SMP
s32 vm_committed_as_batch = 32;
static void __meminit mm_compute_batch(void)
{
u64 memsized_batch;
s32 nr = num_present_cpus();
s32 batch = max_t(s32, nr*2, 32);
/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
vm_committed_as_batch = max_t(s32, memsized_batch, batch);
}
static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
unsigned long action, void *arg)
{
switch (action) {
case MEM_ONLINE:
case MEM_OFFLINE:
mm_compute_batch();
default:
break;
}
return NOTIFY_OK;
}
static struct notifier_block compute_batch_nb __meminitdata = {
.notifier_call = mm_compute_batch_notifier,
.priority = IPC_CALLBACK_PRI, /* use lowest priority */
};
static int __init mm_compute_batch_init(void)
{
mm_compute_batch();
register_hotmemory_notifier(&compute_batch_nb);
return 0;
}
__initcall(mm_compute_batch_init);
#endif
static int __init mm_sysfs_init(void)
{
mm_kobj = kobject_create_and_add("mm", kernel_kobj);
if (!mm_kobj)
return -ENOMEM;
return 0;
}
postcore_initcall(mm_sysfs_init);

3348
mm/mmap.c Normal file

File diff suppressed because it is too large Load diff

62
mm/mmu_context.c Normal file
View file

@ -0,0 +1,62 @@
/* Copyright (C) 2009 Red Hat, Inc.
*
* See ../COPYING for licensing terms.
*/
#include <linux/mm.h>
#include <linux/mmu_context.h>
#include <linux/export.h>
#include <linux/sched.h>
#include <asm/mmu_context.h>
/*
* use_mm
* Makes the calling kernel thread take on the specified
* mm context.
* (Note: this routine is intended to be called only
* from a kernel thread context)
*/
void use_mm(struct mm_struct *mm)
{
struct mm_struct *active_mm;
struct task_struct *tsk = current;
task_lock(tsk);
active_mm = tsk->active_mm;
if (active_mm != mm) {
atomic_inc(&mm->mm_count);
tsk->active_mm = mm;
}
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
#ifdef finish_arch_post_lock_switch
finish_arch_post_lock_switch();
#endif
if (active_mm != mm)
mmdrop(active_mm);
}
EXPORT_SYMBOL_GPL(use_mm);
/*
* unuse_mm
* Reverses the effect of use_mm, i.e. releases the
* specified mm context which was earlier taken on
* by the calling kernel thread
* (Note: this routine is intended to be called only
* from a kernel thread context)
*/
void unuse_mm(struct mm_struct *mm)
{
struct task_struct *tsk = current;
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);

371
mm/mmu_notifier.c Normal file
View file

@ -0,0 +1,371 @@
/*
* linux/mm/mmu_notifier.c
*
* Copyright (C) 2008 Qumranet, Inc.
* Copyright (C) 2008 SGI
* Christoph Lameter <clameter@sgi.com>
*
* This work is licensed under the terms of the GNU GPL, version 2. See
* the COPYING file in the top-level directory.
*/
#include <linux/rculist.h>
#include <linux/mmu_notifier.h>
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/err.h>
#include <linux/srcu.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/slab.h>
/* global SRCU for all MMs */
static struct srcu_struct srcu;
/*
* This function allows mmu_notifier::release callback to delay a call to
* a function that will free appropriate resources. The function must be
* quick and must not block.
*/
void mmu_notifier_call_srcu(struct rcu_head *rcu,
void (*func)(struct rcu_head *rcu))
{
call_srcu(&srcu, rcu, func);
}
EXPORT_SYMBOL_GPL(mmu_notifier_call_srcu);
void mmu_notifier_synchronize(void)
{
/* Wait for any running method to finish. */
srcu_barrier(&srcu);
}
EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
/*
* This function can't run concurrently against mmu_notifier_register
* because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
* runs with mm_users == 0. Other tasks may still invoke mmu notifiers
* in parallel despite there being no task using this mm any more,
* through the vmas outside of the exit_mmap context, such as with
* vmtruncate. This serializes against mmu_notifier_unregister with
* the mmu_notifier_mm->lock in addition to SRCU and it serializes
* against the other mmu notifiers with SRCU. struct mmu_notifier_mm
* can't go away from under us as exit_mmap holds an mm_count pin
* itself.
*/
void __mmu_notifier_release(struct mm_struct *mm)
{
struct mmu_notifier *mn;
int id;
/*
* SRCU here will block mmu_notifier_unregister until
* ->release returns.
*/
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
/*
* If ->release runs before mmu_notifier_unregister it must be
* handled, as it's the only way for the driver to flush all
* existing sptes and stop the driver from establishing any more
* sptes before all the pages in the mm are freed.
*/
if (mn->ops->release)
mn->ops->release(mn, mm);
spin_lock(&mm->mmu_notifier_mm->lock);
while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
mn = hlist_entry(mm->mmu_notifier_mm->list.first,
struct mmu_notifier,
hlist);
/*
* We arrived before mmu_notifier_unregister so
* mmu_notifier_unregister will do nothing other than to wait
* for ->release to finish and for mmu_notifier_unregister to
* return.
*/
hlist_del_init_rcu(&mn->hlist);
}
spin_unlock(&mm->mmu_notifier_mm->lock);
srcu_read_unlock(&srcu, id);
/*
* synchronize_srcu here prevents mmu_notifier_release from returning to
* exit_mmap (which would proceed with freeing all pages in the mm)
* until the ->release method returns, if it was invoked by
* mmu_notifier_unregister.
*
* The mmu_notifier_mm can't go away from under us because one mm_count
* is held by exit_mmap.
*/
synchronize_srcu(&srcu);
}
/*
* If no young bitflag is supported by the hardware, ->clear_flush_young can
* unmap the address and return 1 or 0 depending if the mapping previously
* existed or not.
*/
int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end)
{
struct mmu_notifier *mn;
int young = 0, id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->clear_flush_young)
young |= mn->ops->clear_flush_young(mn, mm, start, end);
}
srcu_read_unlock(&srcu, id);
return young;
}
int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
struct mmu_notifier *mn;
int young = 0, id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->test_young) {
young = mn->ops->test_young(mn, mm, address);
if (young)
break;
}
}
srcu_read_unlock(&srcu, id);
return young;
}
void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
pte_t pte)
{
struct mmu_notifier *mn;
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->change_pte)
mn->ops->change_pte(mn, mm, address, pte);
}
srcu_read_unlock(&srcu, id);
}
void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address)
{
struct mmu_notifier *mn;
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_page)
mn->ops->invalidate_page(mn, mm, address);
}
srcu_read_unlock(&srcu, id);
}
void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
struct mmu_notifier *mn;
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
mn->ops->invalidate_range_start(mn, mm, start, end);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
struct mmu_notifier *mn;
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_end)
mn->ops->invalidate_range_end(mn, mm, start, end);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
static int do_mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm,
int take_mmap_sem)
{
struct mmu_notifier_mm *mmu_notifier_mm;
int ret;
BUG_ON(atomic_read(&mm->mm_users) <= 0);
/*
* Verify that mmu_notifier_init() already run and the global srcu is
* initialized.
*/
BUG_ON(!srcu.per_cpu_ref);
ret = -ENOMEM;
mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
if (unlikely(!mmu_notifier_mm))
goto out;
if (take_mmap_sem)
down_write(&mm->mmap_sem);
ret = mm_take_all_locks(mm);
if (unlikely(ret))
goto out_clean;
if (!mm_has_notifiers(mm)) {
INIT_HLIST_HEAD(&mmu_notifier_mm->list);
spin_lock_init(&mmu_notifier_mm->lock);
mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
}
atomic_inc(&mm->mm_count);
/*
* Serialize the update against mmu_notifier_unregister. A
* side note: mmu_notifier_release can't run concurrently with
* us because we hold the mm_users pin (either implicitly as
* current->mm or explicitly with get_task_mm() or similar).
* We can't race against any other mmu notifier method either
* thanks to mm_take_all_locks().
*/
spin_lock(&mm->mmu_notifier_mm->lock);
hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
spin_unlock(&mm->mmu_notifier_mm->lock);
mm_drop_all_locks(mm);
out_clean:
if (take_mmap_sem)
up_write(&mm->mmap_sem);
kfree(mmu_notifier_mm);
out:
BUG_ON(atomic_read(&mm->mm_users) <= 0);
return ret;
}
/*
* Must not hold mmap_sem nor any other VM related lock when calling
* this registration function. Must also ensure mm_users can't go down
* to zero while this runs to avoid races with mmu_notifier_release,
* so mm has to be current->mm or the mm should be pinned safely such
* as with get_task_mm(). If the mm is not current->mm, the mm_users
* pin should be released by calling mmput after mmu_notifier_register
* returns. mmu_notifier_unregister must be always called to
* unregister the notifier. mm_count is automatically pinned to allow
* mmu_notifier_unregister to safely run at any time later, before or
* after exit_mmap. ->release will always be called before exit_mmap
* frees the pages.
*/
int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
{
return do_mmu_notifier_register(mn, mm, 1);
}
EXPORT_SYMBOL_GPL(mmu_notifier_register);
/*
* Same as mmu_notifier_register but here the caller must hold the
* mmap_sem in write mode.
*/
int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
{
return do_mmu_notifier_register(mn, mm, 0);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_register);
/* this is called after the last mmu_notifier_unregister() returned */
void __mmu_notifier_mm_destroy(struct mm_struct *mm)
{
BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
kfree(mm->mmu_notifier_mm);
mm->mmu_notifier_mm = LIST_POISON1; /* debug */
}
/*
* This releases the mm_count pin automatically and frees the mm
* structure if it was the last user of it. It serializes against
* running mmu notifiers with SRCU and against mmu_notifier_unregister
* with the unregister lock + SRCU. All sptes must be dropped before
* calling mmu_notifier_unregister. ->release or any other notifier
* method may be invoked concurrently with mmu_notifier_unregister,
* and only after mmu_notifier_unregister returned we're guaranteed
* that ->release or any other method can't run anymore.
*/
void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
{
BUG_ON(atomic_read(&mm->mm_count) <= 0);
if (!hlist_unhashed(&mn->hlist)) {
/*
* SRCU here will force exit_mmap to wait for ->release to
* finish before freeing the pages.
*/
int id;
id = srcu_read_lock(&srcu);
/*
* exit_mmap will block in mmu_notifier_release to guarantee
* that ->release is called before freeing the pages.
*/
if (mn->ops->release)
mn->ops->release(mn, mm);
srcu_read_unlock(&srcu, id);
spin_lock(&mm->mmu_notifier_mm->lock);
/*
* Can not use list_del_rcu() since __mmu_notifier_release
* can delete it before we hold the lock.
*/
hlist_del_init_rcu(&mn->hlist);
spin_unlock(&mm->mmu_notifier_mm->lock);
}
/*
* Wait for any running method to finish, of course including
* ->release if it was run by mmu_notifier_release instead of us.
*/
synchronize_srcu(&srcu);
BUG_ON(atomic_read(&mm->mm_count) <= 0);
mmdrop(mm);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
/*
* Same as mmu_notifier_unregister but no callback and no srcu synchronization.
*/
void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
struct mm_struct *mm)
{
spin_lock(&mm->mmu_notifier_mm->lock);
/*
* Can not use list_del_rcu() since __mmu_notifier_release
* can delete it before we hold the lock.
*/
hlist_del_init_rcu(&mn->hlist);
spin_unlock(&mm->mmu_notifier_mm->lock);
BUG_ON(atomic_read(&mm->mm_count) <= 0);
mmdrop(mm);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);
static int __init mmu_notifier_init(void)
{
return init_srcu_struct(&srcu);
}
subsys_initcall(mmu_notifier_init);

116
mm/mmzone.c Normal file
View file

@ -0,0 +1,116 @@
/*
* linux/mm/mmzone.c
*
* management codes for pgdats, zones and page flags
*/
#include <linux/stddef.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
struct pglist_data *first_online_pgdat(void)
{
return NODE_DATA(first_online_node);
}
struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
{
int nid = next_online_node(pgdat->node_id);
if (nid == MAX_NUMNODES)
return NULL;
return NODE_DATA(nid);
}
/*
* next_zone - helper magic for for_each_zone()
*/
struct zone *next_zone(struct zone *zone)
{
pg_data_t *pgdat = zone->zone_pgdat;
if (zone < pgdat->node_zones + MAX_NR_ZONES - 1)
zone++;
else {
pgdat = next_online_pgdat(pgdat);
if (pgdat)
zone = pgdat->node_zones;
else
zone = NULL;
}
return zone;
}
static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
{
#ifdef CONFIG_NUMA
return node_isset(zonelist_node_idx(zref), *nodes);
#else
return 1;
#endif /* CONFIG_NUMA */
}
/* Returns the next zone at or below highest_zoneidx in a zonelist */
struct zoneref *next_zones_zonelist(struct zoneref *z,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
struct zone **zone)
{
/*
* Find the next suitable zone to use for the allocation.
* Only filter based on nodemask if it's set
*/
if (likely(nodes == NULL))
while (zonelist_zone_idx(z) > highest_zoneidx)
z++;
else
while (zonelist_zone_idx(z) > highest_zoneidx ||
(z->zone && !zref_in_nodemask(z, nodes)))
z++;
*zone = zonelist_zone(z);
return z;
}
#ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
int memmap_valid_within(unsigned long pfn,
struct page *page, struct zone *zone)
{
if (page_to_pfn(page) != pfn)
return 0;
if (page_zone(page) != zone)
return 0;
return 1;
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
void lruvec_init(struct lruvec *lruvec)
{
enum lru_list lru;
memset(lruvec, 0, sizeof(struct lruvec));
for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}
#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
int page_cpupid_xchg_last(struct page *page, int cpupid)
{
unsigned long old_flags, flags;
int last_cpupid;
do {
old_flags = flags = page->flags;
last_cpupid = page_cpupid_last(page);
flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
return last_cpupid;
}
#endif

433
mm/mprotect.c Normal file
View file

@ -0,0 +1,433 @@
/*
* mm/mprotect.c
*
* (C) Copyright 1994 Linus Torvalds
* (C) Copyright 2002 Christoph Hellwig
*
* Address space accounting code <alan@lxorguk.ukuu.org.uk>
* (C) Copyright 2002 Red Hat Inc, All Rights Reserved
*/
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/shm.h>
#include <linux/mman.h>
#include <linux/fs.h>
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/mempolicy.h>
#include <linux/personality.h>
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
#include <linux/migrate.h>
#include <linux/perf_event.h>
#include <linux/ksm.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
/*
* For a prot_numa update we only hold mmap_sem for read so there is a
* potential race with faulting where a pmd was temporarily none. This
* function checks for a transhuge pmd under the appropriate lock. It
* returns a pte if it was successfully locked or NULL if it raced with
* a transhuge insertion.
*/
static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, int prot_numa, spinlock_t **ptl)
{
pte_t *pte;
spinlock_t *pmdl;
/* !prot_numa is protected by mmap_sem held for write */
if (!prot_numa)
return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
pmdl = pmd_lock(vma->vm_mm, pmd);
if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
spin_unlock(pmdl);
return NULL;
}
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
spin_unlock(pmdl);
return pte;
}
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
if (!pte)
return 0;
arch_enter_lazy_mmu_mode();
do {
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
bool updated = false;
if (!prot_numa) {
ptent = ptep_modify_prot_start(mm, addr, pte);
if (pte_numa(ptent))
ptent = pte_mknonnuma(ptent);
ptent = pte_modify(ptent, newprot);
/*
* Avoid taking write faults for pages we
* know to be dirty.
*/
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
!(vma->vm_flags & VM_SOFTDIRTY)))
ptent = pte_mkwrite(ptent);
ptep_modify_prot_commit(mm, addr, pte, ptent);
updated = true;
} else {
struct page *page;
page = vm_normal_page(vma, addr, oldpte);
if (page && !PageKsm(page)) {
if (!pte_numa(oldpte)) {
ptep_set_numa(mm, addr, pte);
updated = true;
}
}
}
if (updated)
pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
if (is_write_migration_entry(entry)) {
pte_t newpte;
/*
* A protection check is difficult so
* just be safe and disable write
*/
make_migration_entry_read(&entry);
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
set_pte_at(mm, addr, pte, newpte);
pages++;
}
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
return pages;
}
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, int dirty_accountable, int prot_numa)
{
pmd_t *pmd;
struct mm_struct *mm = vma->vm_mm;
unsigned long next;
unsigned long pages = 0;
unsigned long nr_huge_updates = 0;
unsigned long mni_start = 0;
pmd = pmd_offset(pud, addr);
do {
unsigned long this_pages;
next = pmd_addr_end(addr, end);
if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
continue;
/* invoke the mmu notifier if the pmd is populated */
if (!mni_start) {
mni_start = addr;
mmu_notifier_invalidate_range_start(mm, mni_start, end);
}
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, prot_numa);
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
pages += HPAGE_PMD_NR;
nr_huge_updates++;
}
/* huge pmd was handled */
continue;
}
}
/* fall through, the trans huge pmd just split */
}
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa);
pages += this_pages;
} while (pmd++, addr = next, addr != end);
if (mni_start)
mmu_notifier_invalidate_range_end(mm, mni_start, end);
if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
return pages;
}
static inline unsigned long change_pud_range(struct vm_area_struct *vma,
pgd_t *pgd, unsigned long addr, unsigned long end,
pgprot_t newprot, int dirty_accountable, int prot_numa)
{
pud_t *pud;
unsigned long next;
unsigned long pages = 0;
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
pages += change_pmd_range(vma, pud, addr, next, newprot,
dirty_accountable, prot_numa);
} while (pud++, addr = next, addr != end);
return pages;
}
static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
unsigned long next;
unsigned long start = addr;
unsigned long pages = 0;
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
set_tlb_flush_pending(mm);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
pages += change_pud_range(vma, pgd, addr, next, newprot,
dirty_accountable, prot_numa);
} while (pgd++, addr = next, addr != end);
/* Only flush the TLB if we actually modified any entries: */
if (pages)
flush_tlb_range(vma, start, end);
clear_tlb_flush_pending(mm);
return pages;
}
unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
{
unsigned long pages;
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
return pages;
}
int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long oldflags = vma->vm_flags;
long nrpages = (end - start) >> PAGE_SHIFT;
unsigned long charged = 0;
pgoff_t pgoff;
int error;
int dirty_accountable = 0;
if (newflags == oldflags) {
*pprev = vma;
return 0;
}
/*
* If we make a private mapping writable we increase our commit;
* but (without finer accounting) cannot reduce our commit if we
* make it unwritable again. hugetlb mapping were accounted for
* even if read-only so there is no need to account for them here
*/
if (newflags & VM_WRITE) {
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
VM_SHARED|VM_NORESERVE))) {
charged = nrpages;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
newflags |= VM_ACCOUNT;
}
}
/*
* First try to merge with previous and/or next vma.
*/
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
vma_get_anon_name(vma));
if (*pprev) {
vma = *pprev;
goto success;
}
*pprev = vma;
if (start != vma->vm_start) {
error = split_vma(mm, vma, start, 1);
if (error)
goto fail;
}
if (end != vma->vm_end) {
error = split_vma(mm, vma, end, 0);
if (error)
goto fail;
}
success:
/*
* vm_flags and vm_page_prot are protected by the mmap_sem
* held in write mode.
*/
vma->vm_flags = newflags;
dirty_accountable = vma_wants_writenotify(vma);
vma_set_page_prot(vma);
change_protection(vma, start, end, vma->vm_page_prot,
dirty_accountable, 0);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
return 0;
fail:
vm_unacct_memory(charged);
return error;
}
SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
{
unsigned long vm_flags, nstart, end, tmp, reqprot;
struct vm_area_struct *vma, *prev;
int error = -EINVAL;
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
return -EINVAL;
if (start & ~PAGE_MASK)
return -EINVAL;
if (!len)
return 0;
len = PAGE_ALIGN(len);
end = start + len;
if (end <= start)
return -ENOMEM;
if (!arch_validate_prot(prot))
return -EINVAL;
reqprot = prot;
/*
* Does the application expect PROT_READ to imply PROT_EXEC:
*/
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
prot |= PROT_EXEC;
vm_flags = calc_vm_prot_bits(prot);
down_write(&current->mm->mmap_sem);
vma = find_vma(current->mm, start);
error = -ENOMEM;
if (!vma)
goto out;
prev = vma->vm_prev;
if (unlikely(grows & PROT_GROWSDOWN)) {
if (vma->vm_start >= end)
goto out;
start = vma->vm_start;
error = -EINVAL;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto out;
} else {
if (vma->vm_start > start)
goto out;
if (unlikely(grows & PROT_GROWSUP)) {
end = vma->vm_end;
error = -EINVAL;
if (!(vma->vm_flags & VM_GROWSUP))
goto out;
}
}
if (start > vma->vm_start)
prev = vma;
for (nstart = start ; ; ) {
unsigned long newflags;
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
newflags = vm_flags;
newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
/* newflags >> 4 shift VM_MAY% in place of VM_% */
if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
error = -EACCES;
goto out;
}
error = security_file_mprotect(vma, reqprot, prot);
if (error)
goto out;
tmp = vma->vm_end;
if (tmp > end)
tmp = end;
error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
if (error)
goto out;
nstart = tmp;
if (nstart < prev->vm_end)
nstart = prev->vm_end;
if (nstart >= end)
goto out;
vma = prev->vm_next;
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
goto out;
}
}
out:
up_write(&current->mm->mmap_sem);
return error;
}

588
mm/mremap.c Normal file
View file

@ -0,0 +1,588 @@
/*
* mm/mremap.c
*
* (C) Copyright 1996 Linus Torvalds
*
* Address space accounting code <alan@lxorguk.ukuu.org.uk>
* (C) Copyright 2002 Red Hat Inc, All Rights Reserved
*/
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/shm.h>
#include <linux/ksm.h>
#include <linux/mman.h>
#include <linux/swap.h>
#include <linux/capability.h>
#include <linux/fs.h>
#include <linux/swapops.h>
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/mmu_notifier.h>
#include <linux/sched/sysctl.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include "internal.h"
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pgd = pgd_offset(mm, addr);
if (pgd_none_or_clear_bad(pgd))
return NULL;
pud = pud_offset(pgd, addr);
if (pud_none_or_clear_bad(pud))
return NULL;
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd))
return NULL;
return pmd;
}
static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
return NULL;
VM_BUG_ON(pmd_trans_huge(*pmd));
return pmd;
}
static pte_t move_soft_dirty_pte(pte_t pte)
{
/*
* Set soft dirty bit so we can notice
* in userspace the ptes were moved.
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
if (pte_present(pte))
pte = pte_mksoft_dirty(pte);
else if (is_swap_pte(pte))
pte = pte_swp_mksoft_dirty(pte);
else if (pte_file(pte))
pte = pte_file_mksoft_dirty(pte);
#endif
return pte;
}
static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
unsigned long new_addr, bool need_rmap_locks)
{
struct address_space *mapping = NULL;
struct anon_vma *anon_vma = NULL;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
/*
* When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
*
* When need_rmap_locks is false, we use other ways to avoid
* such races:
*
* - During exec() shift_arg_pages(), we use a specially tagged vma
* which rmap call sites look for using is_vma_temporary_stack().
*
* - During mremap(), new_vma is often known to be placed after vma
* in rmap traversal order. This ensures rmap will always observe
* either the old pte, or the new pte, or both (the page table locks
* serialize access to individual ptes, but only rmap traversal
* order guarantees that we won't miss both the old and new ptes).
*/
if (need_rmap_locks) {
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
mutex_lock(&mapping->i_mmap_mutex);
}
if (vma->anon_vma) {
anon_vma = vma->anon_vma;
anon_vma_lock_write(anon_vma);
}
}
/*
* We don't have to worry about the ordering of src and dst
* pte locks because exclusive mmap_sem prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
new_pte = pte_offset_map(new_pmd, new_addr);
new_ptl = pte_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
arch_enter_lazy_mmu_mode();
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
if (pte_none(*old_pte))
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
pte = move_soft_dirty_pte(pte);
set_pte_at(mm, new_addr, new_pte, pte);
}
arch_leave_lazy_mmu_mode();
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
if (anon_vma)
anon_vma_unlock_write(anon_vma);
if (mapping)
mutex_unlock(&mapping->i_mmap_mutex);
}
#define LATENCY_LIMIT (64 * PAGE_SIZE)
unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
bool need_rmap_locks)
{
unsigned long extent, next, old_end;
pmd_t *old_pmd, *new_pmd;
bool need_flush = false;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
mmun_start = old_addr;
mmun_end = old_end;
mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
next = (old_addr + PMD_SIZE) & PMD_MASK;
/* even if next overflowed, extent below will be ok */
extent = next - old_addr;
if (extent > old_end - old_addr)
extent = old_end - old_addr;
old_pmd = get_old_pmd(vma->vm_mm, old_addr);
if (!old_pmd)
continue;
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
if (pmd_trans_huge(*old_pmd)) {
int err = 0;
if (extent == HPAGE_PMD_SIZE) {
VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
vma);
/* See comment in move_ptes() */
if (need_rmap_locks)
anon_vma_lock_write(vma->anon_vma);
err = move_huge_pmd(vma, new_vma, old_addr,
new_addr, old_end,
old_pmd, new_pmd);
if (need_rmap_locks)
anon_vma_unlock_write(vma->anon_vma);
}
if (err > 0) {
need_flush = true;
continue;
} else if (!err) {
split_huge_page_pmd(vma, old_addr, old_pmd);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
new_pmd, new_addr))
break;
next = (new_addr + PMD_SIZE) & PMD_MASK;
if (extent > next - new_addr)
extent = next - new_addr;
if (extent > LATENCY_LIMIT)
extent = LATENCY_LIMIT;
move_ptes(vma, old_pmd, old_addr, old_addr + extent,
new_vma, new_pmd, new_addr, need_rmap_locks);
need_flush = true;
}
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);
mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
return len + old_addr - old_end; /* how much done */
}
static unsigned long move_vma(struct vm_area_struct *vma,
unsigned long old_addr, unsigned long old_len,
unsigned long new_len, unsigned long new_addr, bool *locked)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma;
unsigned long vm_flags = vma->vm_flags;
unsigned long new_pgoff;
unsigned long moved_len;
unsigned long excess = 0;
unsigned long hiwater_vm;
int split = 0;
int err;
bool need_rmap_locks;
/*
* We'd prefer to avoid failure later on in do_munmap:
* which may split one vma into three before unmapping.
*/
if (mm->map_count >= sysctl_max_map_count - 3)
return -ENOMEM;
/*
* Advise KSM to break any KSM pages in the area to be moved:
* it would be confusing if they were to turn up at the new
* location, where they happen to coincide with different KSM
* pages recently unmapped. But leave vma->vm_flags as it was,
* so KSM can come around to merge on vma and new_vma afterwards.
*/
err = ksm_madvise(vma, old_addr, old_addr + old_len,
MADV_UNMERGEABLE, &vm_flags);
if (err)
return err;
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
&need_rmap_locks);
if (!new_vma)
return -ENOMEM;
moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
need_rmap_locks);
if (moved_len < old_len) {
/*
* On error, move entries back from new area to old,
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
*/
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
true);
vma = new_vma;
old_len = new_len;
old_addr = new_addr;
new_addr = -ENOMEM;
}
/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {
vma->vm_flags &= ~VM_ACCOUNT;
excess = vma->vm_end - vma->vm_start - old_len;
if (old_addr > vma->vm_start &&
old_addr + old_len < vma->vm_end)
split = 1;
}
/*
* If we failed to move page tables we still do total_vm increment
* since do_munmap() will decrement it by old_len == new_len.
*
* Since total_vm is about to be raised artificially high for a
* moment, we need to restore high watermark afterwards: if stats
* are taken meanwhile, total_vm and hiwater_vm appear too high.
* If this were a serious issue, we'd add a flag to do_munmap().
*/
hiwater_vm = mm->hiwater_vm;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
if (do_munmap(mm, old_addr, old_len) < 0) {
/* OOM: unable to split vma, just get accounts right */
vm_unacct_memory(excess >> PAGE_SHIFT);
excess = 0;
}
mm->hiwater_vm = hiwater_vm;
/* Restore VM_ACCOUNT if one or two pieces of vma left */
if (excess) {
vma->vm_flags |= VM_ACCOUNT;
if (split)
vma->vm_next->vm_flags |= VM_ACCOUNT;
}
if (vm_flags & VM_LOCKED) {
mm->locked_vm += new_len >> PAGE_SHIFT;
*locked = true;
}
return new_addr;
}
static struct vm_area_struct *vma_to_resize(unsigned long addr,
unsigned long old_len, unsigned long new_len, unsigned long *p)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = find_vma(mm, addr);
if (!vma || vma->vm_start > addr)
goto Efault;
if (is_vm_hugetlb_page(vma))
goto Einval;
/* We can't remap across vm area boundaries */
if (old_len > vma->vm_end - addr)
goto Efault;
/* Need to be careful about a growing mapping */
if (new_len > old_len) {
unsigned long pgoff;
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
goto Efault;
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
pgoff += vma->vm_pgoff;
if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
goto Einval;
}
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
locked = mm->locked_vm << PAGE_SHIFT;
lock_limit = rlimit(RLIMIT_MEMLOCK);
locked += new_len - old_len;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
goto Eagain;
}
if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
goto Enomem;
if (vma->vm_flags & VM_ACCOUNT) {
unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
if (security_vm_enough_memory_mm(mm, charged))
goto Efault;
*p = charged;
}
return vma;
Efault: /* very odd choice for most of the cases, but... */
return ERR_PTR(-EFAULT);
Einval:
return ERR_PTR(-EINVAL);
Enomem:
return ERR_PTR(-ENOMEM);
Eagain:
return ERR_PTR(-EAGAIN);
}
static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
unsigned long new_addr, unsigned long new_len, bool *locked)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long ret = -EINVAL;
unsigned long charged = 0;
unsigned long map_flags;
if (new_addr & ~PAGE_MASK)
goto out;
if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
goto out;
/* Check if the location we're moving into overlaps the
* old location at all, and fail if it does.
*/
if ((new_addr <= addr) && (new_addr+new_len) > addr)
goto out;
if ((addr <= new_addr) && (addr+old_len) > new_addr)
goto out;
ret = do_munmap(mm, new_addr, new_len);
if (ret)
goto out;
if (old_len >= new_len) {
ret = do_munmap(mm, addr+new_len, old_len - new_len);
if (ret && old_len != new_len)
goto out;
old_len = new_len;
}
vma = vma_to_resize(addr, old_len, new_len, &charged);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
goto out;
}
map_flags = MAP_FIXED;
if (vma->vm_flags & VM_MAYSHARE)
map_flags |= MAP_SHARED;
ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff +
((addr - vma->vm_start) >> PAGE_SHIFT),
map_flags);
if (ret & ~PAGE_MASK)
goto out1;
ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
if (!(ret & ~PAGE_MASK))
goto out;
out1:
vm_unacct_memory(charged);
out:
return ret;
}
static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
{
unsigned long end = vma->vm_end + delta;
if (end < vma->vm_end) /* overflow */
return 0;
if (vma->vm_next && vma->vm_next->vm_start < end) /* intersection */
return 0;
if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
0, MAP_FIXED) & ~PAGE_MASK)
return 0;
return 1;
}
/*
* Expand (or shrink) an existing mapping, potentially moving it at the
* same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
*
* MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
* This option implies MREMAP_MAYMOVE.
*/
SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
unsigned long, new_len, unsigned long, flags,
unsigned long, new_addr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long ret = -EINVAL;
unsigned long charged = 0;
bool locked = false;
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
return ret;
if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
return ret;
if (addr & ~PAGE_MASK)
return ret;
old_len = PAGE_ALIGN(old_len);
new_len = PAGE_ALIGN(new_len);
/*
* We allow a zero old-len as a special case
* for DOS-emu "duplicate shm area" thing. But
* a zero new-len is nonsensical.
*/
if (!new_len)
return ret;
down_write(&current->mm->mmap_sem);
if (flags & MREMAP_FIXED) {
ret = mremap_to(addr, old_len, new_addr, new_len,
&locked);
goto out;
}
/*
* Always allow a shrinking remap: that just unmaps
* the unnecessary pages..
* do_munmap does all the needed commit accounting
*/
if (old_len >= new_len) {
ret = do_munmap(mm, addr+new_len, old_len - new_len);
if (ret && old_len != new_len)
goto out;
ret = addr;
goto out;
}
/*
* Ok, we need to grow..
*/
vma = vma_to_resize(addr, old_len, new_len, &charged);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
goto out;
}
/* old_len exactly to the end of the area..
*/
if (old_len == vma->vm_end - addr) {
/* can we just expand the current mapping? */
if (vma_expandable(vma, new_len - old_len)) {
int pages = (new_len - old_len) >> PAGE_SHIFT;
if (vma_adjust(vma, vma->vm_start, addr + new_len,
vma->vm_pgoff, NULL)) {
ret = -ENOMEM;
goto out;
}
vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
if (vma->vm_flags & VM_LOCKED) {
mm->locked_vm += pages;
locked = true;
new_addr = addr;
}
ret = addr;
goto out;
}
}
/*
* We weren't able to just expand or shrink the area,
* we need to create a new one and move it..
*/
ret = -ENOMEM;
if (flags & MREMAP_MAYMOVE) {
unsigned long map_flags = 0;
if (vma->vm_flags & VM_MAYSHARE)
map_flags |= MAP_SHARED;
new_addr = get_unmapped_area(vma->vm_file, 0, new_len,
vma->vm_pgoff +
((addr - vma->vm_start) >> PAGE_SHIFT),
map_flags);
if (new_addr & ~PAGE_MASK) {
ret = new_addr;
goto out;
}
ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
}
out:
if (ret & ~PAGE_MASK)
vm_unacct_memory(charged);
up_write(&current->mm->mmap_sem);
if (locked && new_len > old_len)
mm_populate(new_addr + old_len, new_len - old_len);
return ret;
}

110
mm/msync.c Normal file
View file

@ -0,0 +1,110 @@
/*
* linux/mm/msync.c
*
* Copyright (C) 1994-1999 Linus Torvalds
*/
/*
* The msync() system call.
*/
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/file.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
* Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
* The application may now run fsync() to
* write out the dirty pages and wait on the writeout and check the result.
* Or the application may run fadvise(FADV_DONTNEED) against the fd to start
* async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int unmapped_error = 0;
int error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
error = -ENOMEM;
len = (len + ~PAGE_MASK) & PAGE_MASK;
end = start + len;
if (end < start)
goto out;
error = 0;
if (end == start)
goto out;
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
for (;;) {
struct file *file;
loff_t fstart, fend;
/* Still start < end. */
error = -ENOMEM;
if (!vma)
goto out_unlock;
/* Here start < vma->vm_end. */
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
goto out_unlock;
unmapped_error = -ENOMEM;
}
/* Here vma->vm_start <= start < vma->vm_end. */
if ((flags & MS_INVALIDATE) &&
(vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
goto out_unlock;
}
file = vma->vm_file;
fstart = (start - vma->vm_start) +
((loff_t)vma->vm_pgoff << PAGE_SHIFT);
fend = fstart + (min(end, vma->vm_end) - start) - 1;
start = vma->vm_end;
if ((flags & MS_SYNC) && file &&
(vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&mm->mmap_sem);
if (vma->vm_flags & VM_NONLINEAR)
error = vfs_fsync(file, 1);
else
error = vfs_fsync_range(file, fstart, fend, 1);
fput(file);
if (error || start >= end)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
} else {
if (start >= end) {
error = 0;
goto out_unlock;
}
vma = vma->vm_next;
}
}
out_unlock:
up_read(&mm->mmap_sem);
out:
return error ? : unmapped_error;
}

438
mm/nobootmem.c Normal file
View file

@ -0,0 +1,438 @@
/*
* bootmem - A boot-time physical memory allocator and configurator
*
* Copyright (C) 1999 Ingo Molnar
* 1999 Kanoj Sarcar, SGI
* 2008 Johannes Weiner
*
* Access to this subsystem has to be serialized externally (which is true
* for the boot process anyway).
*/
#include <linux/init.h>
#include <linux/pfn.h>
#include <linux/slab.h>
#include <linux/bootmem.h>
#include <linux/export.h>
#include <linux/kmemleak.h>
#include <linux/range.h>
#include <linux/memblock.h>
#include <asm/bug.h>
#include <asm/io.h>
#include <asm/processor.h>
#include "internal.h"
#ifndef CONFIG_NEED_MULTIPLE_NODES
struct pglist_data __refdata contig_page_data;
EXPORT_SYMBOL(contig_page_data);
#endif
unsigned long max_low_pfn;
unsigned long min_low_pfn;
unsigned long max_pfn;
static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
u64 goal, u64 limit)
{
void *ptr;
u64 addr;
if (limit > memblock.current_limit)
limit = memblock.current_limit;
addr = memblock_find_in_range_node(size, align, goal, limit, nid);
if (!addr)
return NULL;
if (memblock_reserve(addr, size))
return NULL;
ptr = phys_to_virt(addr);
memset(ptr, 0, size);
/*
* The min_count is set to 0 so that bootmem allocated blocks
* are never reported as leaks.
*/
kmemleak_alloc(ptr, size, 0, 0);
return ptr;
}
/*
* free_bootmem_late - free bootmem pages directly to page allocator
* @addr: starting address of the range
* @size: size of the range in bytes
*
* This is only useful when the bootmem allocator has already been torn
* down, but we are still initializing the system. Pages are given directly
* to the page allocator, no bootmem metadata is updated because it is gone.
*/
void __init free_bootmem_late(unsigned long addr, unsigned long size)
{
unsigned long cursor, end;
kmemleak_free_part(__va(addr), size);
cursor = PFN_UP(addr);
end = PFN_DOWN(addr + size);
for (; cursor < end; cursor++) {
__free_pages_bootmem(pfn_to_page(cursor), 0);
totalram_pages++;
}
}
static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
int order;
while (start < end) {
order = min(MAX_ORDER - 1UL, __ffs(start));
while (start + (1UL << order) > end)
order--;
__free_pages_bootmem(pfn_to_page(start), order);
start += (1UL << order);
}
}
static unsigned long __init __free_memory_core(phys_addr_t start,
phys_addr_t end)
{
unsigned long start_pfn = PFN_UP(start);
unsigned long end_pfn = min_t(unsigned long,
PFN_DOWN(end), max_low_pfn);
if (start_pfn > end_pfn)
return 0;
__free_pages_memory(start_pfn, end_pfn);
return end_pfn - start_pfn;
}
static unsigned long __init free_low_memory_core_early(void)
{
unsigned long count = 0;
phys_addr_t start, end;
u64 i;
memblock_clear_hotplug(0, -1);
for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL)
count += __free_memory_core(start, end);
#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
{
phys_addr_t size;
/* Free memblock.reserved array if it was allocated */
size = get_allocated_memblock_reserved_regions_info(&start);
if (size)
count += __free_memory_core(start, start + size);
/* Free memblock.memory array if it was allocated */
size = get_allocated_memblock_memory_regions_info(&start);
if (size)
count += __free_memory_core(start, start + size);
}
#endif
return count;
}
static int reset_managed_pages_done __initdata;
void reset_node_managed_pages(pg_data_t *pgdat)
{
struct zone *z;
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
z->managed_pages = 0;
}
void __init reset_all_zones_managed_pages(void)
{
struct pglist_data *pgdat;
if (reset_managed_pages_done)
return;
for_each_online_pgdat(pgdat)
reset_node_managed_pages(pgdat);
reset_managed_pages_done = 1;
}
/**
* free_all_bootmem - release free pages to the buddy allocator
*
* Returns the number of pages actually released.
*/
unsigned long __init free_all_bootmem(void)
{
unsigned long pages;
reset_all_zones_managed_pages();
/*
* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
* low ram will be on Node1
*/
pages = free_low_memory_core_early();
totalram_pages += pages;
return pages;
}
/**
* free_bootmem_node - mark a page range as usable
* @pgdat: node the range resides on
* @physaddr: starting address of the range
* @size: size of the range in bytes
*
* Partial pages will be considered reserved and left as they are.
*
* The range must reside completely on the specified node.
*/
void __init free_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size)
{
memblock_free(physaddr, size);
}
/**
* free_bootmem - mark a page range as usable
* @addr: starting address of the range
* @size: size of the range in bytes
*
* Partial pages will be considered reserved and left as they are.
*
* The range must be contiguous but may span node boundaries.
*/
void __init free_bootmem(unsigned long addr, unsigned long size)
{
memblock_free(addr, size);
}
static void * __init ___alloc_bootmem_nopanic(unsigned long size,
unsigned long align,
unsigned long goal,
unsigned long limit)
{
void *ptr;
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc(size, GFP_NOWAIT);
restart:
ptr = __alloc_memory_core_early(NUMA_NO_NODE, size, align, goal, limit);
if (ptr)
return ptr;
if (goal != 0) {
goal = 0;
goto restart;
}
return NULL;
}
/**
* __alloc_bootmem_nopanic - allocate boot memory without panicking
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* Returns NULL on failure.
*/
void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
unsigned long goal)
{
unsigned long limit = -1UL;
return ___alloc_bootmem_nopanic(size, align, goal, limit);
}
static void * __init ___alloc_bootmem(unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
{
void *mem = ___alloc_bootmem_nopanic(size, align, goal, limit);
if (mem)
return mem;
/*
* Whoops, we cannot satisfy the allocation request.
*/
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
}
/**
* __alloc_bootmem - allocate boot memory
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem(unsigned long size, unsigned long align,
unsigned long goal)
{
unsigned long limit = -1UL;
return ___alloc_bootmem(size, align, goal, limit);
}
void * __init ___alloc_bootmem_node_nopanic(pg_data_t *pgdat,
unsigned long size,
unsigned long align,
unsigned long goal,
unsigned long limit)
{
void *ptr;
again:
ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
goal, limit);
if (ptr)
return ptr;
ptr = __alloc_memory_core_early(NUMA_NO_NODE, size, align,
goal, limit);
if (ptr)
return ptr;
if (goal) {
goal = 0;
goto again;
}
return NULL;
}
void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node_nopanic(pgdat, size, align, goal, 0);
}
static void * __init ___alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal,
unsigned long limit)
{
void *ptr;
ptr = ___alloc_bootmem_node_nopanic(pgdat, size, align, goal, limit);
if (ptr)
return ptr;
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
}
/**
* __alloc_bootmem_node - allocate boot memory from a specific node
* @pgdat: node to allocate from
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may fall back to any node in the system if the specified node
* can not hold the requested memory.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node(pgdat, size, align, goal, 0);
}
void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
return __alloc_bootmem_node(pgdat, size, align, goal);
}
#ifndef ARCH_LOW_ADDRESS_LIMIT
#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
#endif
/**
* __alloc_bootmem_low - allocate low boot memory
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may happen on any node in the system.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_low(unsigned long size, unsigned long align,
unsigned long goal)
{
return ___alloc_bootmem(size, align, goal, ARCH_LOW_ADDRESS_LIMIT);
}
void * __init __alloc_bootmem_low_nopanic(unsigned long size,
unsigned long align,
unsigned long goal)
{
return ___alloc_bootmem_nopanic(size, align, goal,
ARCH_LOW_ADDRESS_LIMIT);
}
/**
* __alloc_bootmem_low_node - allocate low boot memory from a specific node
* @pgdat: node to allocate from
* @size: size of the request in bytes
* @align: alignment of the region
* @goal: preferred starting address of the region
*
* The goal is dropped if it can not be satisfied and the allocation will
* fall back to memory below @goal.
*
* Allocation may fall back to any node in the system if the specified node
* can not hold the requested memory.
*
* The function panics if the request can not be satisfied.
*/
void * __init __alloc_bootmem_low_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
return ___alloc_bootmem_node(pgdat, size, align, goal,
ARCH_LOW_ADDRESS_LIMIT);
}

2177
mm/nommu.c Normal file

File diff suppressed because it is too large Load diff

754
mm/oom_kill.c Normal file
View file

@ -0,0 +1,754 @@
/*
* linux/mm/oom_kill.c
*
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
* Copyright (C) 2010 Google, Inc.
* Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
* in mm/page_alloc.c when we really run out of memory.
*
* Since we won't call these routines often (on a well-configured
* machine) this file will double as a 'coding guide' and a signpost
* for newbie kernel hackers. It features several pointers to major
* kernel subsystems and hints as to where to find out what things do.
*/
#include <linux/oom.h>
#include <linux/mm.h>
#include <linux/err.h>
#include <linux/gfp.h>
#include <linux/sched.h>
#include <linux/swap.h>
#include <linux/timex.h>
#include <linux/jiffies.h>
#include <linux/cpuset.h>
#include <linux/export.h>
#include <linux/notifier.h>
#include <linux/memcontrol.h>
#include <linux/mempolicy.h>
#include <linux/security.h>
#include <linux/ptrace.h>
#include <linux/freezer.h>
#include <linux/ftrace.h>
#include <linux/ratelimit.h>
#include <linux/compaction.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);
#ifdef CONFIG_NUMA
/**
* has_intersects_mems_allowed() - check task eligiblity for kill
* @start: task struct of which task to consider
* @mask: nodemask passed to page allocator for mempolicy ooms
*
* Task eligibility is determined by whether or not a candidate task, @tsk,
* shares the same mempolicy nodes as current if it is bound by such a policy
* and whether or not it has the same set of allowed cpuset nodes.
*/
static bool has_intersects_mems_allowed(struct task_struct *start,
const nodemask_t *mask)
{
struct task_struct *tsk;
bool ret = false;
rcu_read_lock();
for_each_thread(start, tsk) {
if (mask) {
/*
* If this is a mempolicy constrained oom, tsk's
* cpuset is irrelevant. Only return true if its
* mempolicy intersects current, otherwise it may be
* needlessly killed.
*/
ret = mempolicy_nodemask_intersects(tsk, mask);
} else {
/*
* This is not a mempolicy constrained oom, so only
* check the mems of tsk's cpuset.
*/
ret = cpuset_mems_allowed_intersects(current, tsk);
}
if (ret)
break;
}
rcu_read_unlock();
return ret;
}
#else
static bool has_intersects_mems_allowed(struct task_struct *tsk,
const nodemask_t *mask)
{
return true;
}
#endif /* CONFIG_NUMA */
/*
* The process p may have detached its own ->mm while exiting or through
* use_mm(), but one or more of its subthreads may still have a valid
* pointer. Return p, or any of its subthreads with a valid ->mm, with
* task_lock() held.
*/
struct task_struct *find_lock_task_mm(struct task_struct *p)
{
struct task_struct *t;
rcu_read_lock();
for_each_thread(p, t) {
task_lock(t);
if (likely(t->mm))
goto found;
task_unlock(t);
}
t = NULL;
found:
rcu_read_unlock();
return t;
}
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p,
const struct mem_cgroup *memcg, const nodemask_t *nodemask)
{
if (is_global_init(p))
return true;
if (p->flags & PF_KTHREAD)
return true;
/* When mem_cgroup_out_of_memory() and p is not member of the group */
if (memcg && !task_in_mem_cgroup(p, memcg))
return true;
/* p may not have freeable memory in nodemask */
if (!has_intersects_mems_allowed(p, nodemask))
return true;
return false;
}
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
long points;
long adj;
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
p = find_lock_task_mm(p);
if (!p)
return 0;
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN) {
task_unlock(p);
return 0;
}
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) +
get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
/*
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
return points > 0 ? points : 1;
}
/*
* Determine the type of allocation constraint.
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
gfp_t gfp_mask, nodemask_t *nodemask,
unsigned long *totalpages)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
bool cpuset_limited = false;
int nid;
/* Default to all available memory */
*totalpages = totalram_pages + total_swap_pages;
if (!zonelist)
return CONSTRAINT_NONE;
/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
* to kill current.We have to random task kill in this case.
* Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
*/
if (gfp_mask & __GFP_THISNODE)
return CONSTRAINT_NONE;
/*
* This is not a __GFP_THISNODE allocation, so a truncated nodemask in
* the page allocator means a mempolicy is in effect. Cpuset policy
* is enforced in get_page_from_freelist().
*/
if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
return CONSTRAINT_MEMORY_POLICY;
}
/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
cpuset_limited = true;
if (cpuset_limited) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, cpuset_current_mems_allowed)
*totalpages += node_spanned_pages(nid);
return CONSTRAINT_CPUSET;
}
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
gfp_t gfp_mask, nodemask_t *nodemask,
unsigned long *totalpages)
{
*totalpages = totalram_pages + total_swap_pages;
return CONSTRAINT_NONE;
}
#endif
enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill)
{
if (oom_unkillable_task(task, NULL, nodemask))
return OOM_SCAN_CONTINUE;
/*
* This task already has access to memory reserves and is being killed.
* Don't allow any other task to have access to the reserves.
*/
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
if (unlikely(frozen(task)))
__thaw_task(task);
if (!force_kill)
return OOM_SCAN_ABORT;
}
if (!task->mm)
return OOM_SCAN_CONTINUE;
/*
* If task is allocating a lot of memory and has been marked to be
* killed first if it triggers an oom, then select it.
*/
if (oom_task_origin(task))
return OOM_SCAN_SELECT;
if (task->flags & PF_EXITING && !force_kill) {
/*
* If this task is not being ptraced on exit, then wait for it
* to finish before killing some other task unnecessarily.
*/
if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
return OOM_SCAN_ABORT;
}
return OOM_SCAN_OK;
}
/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. Returns -1 on scan abort.
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
static struct task_struct *select_bad_process(unsigned int *ppoints,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill)
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
unsigned long chosen_points = 0;
rcu_read_lock();
for_each_process_thread(g, p) {
unsigned int points;
switch (oom_scan_process_thread(p, totalpages, nodemask,
force_kill)) {
case OOM_SCAN_SELECT:
chosen = p;
chosen_points = ULONG_MAX;
/* fall through */
case OOM_SCAN_CONTINUE:
continue;
case OOM_SCAN_ABORT:
rcu_read_unlock();
return (struct task_struct *)(-1UL);
case OOM_SCAN_OK:
break;
};
points = oom_badness(p, NULL, nodemask, totalpages);
if (!points || points < chosen_points)
continue;
/* Prefer thread group leaders for display purposes */
if (points == chosen_points && thread_group_leader(chosen))
continue;
chosen = p;
chosen_points = points;
}
if (chosen)
get_task_struct(chosen);
rcu_read_unlock();
*ppoints = chosen_points * 1000 / totalpages;
return chosen;
}
/**
* dump_tasks - dump current memory state of all system tasks
* @memcg: current's memory controller, if constrained
* @nodemask: nodemask passed to page allocator for mempolicy ooms
*
* Dumps the current memory state of all eligible tasks. Tasks not in the same
* memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
* are not shown.
* State information includes task's pid, uid, tgid, vm size, rss, nr_ptes,
* swapents, oom_score_adj value, and name.
*/
static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
{
struct task_struct *p;
struct task_struct *task;
pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n");
rcu_read_lock();
for_each_process(p) {
if (oom_unkillable_task(p, memcg, nodemask))
continue;
task = find_lock_task_mm(p);
if (!task) {
/*
* This is a kthread or all of p's threads have already
* detached their mm's. There's no need to report
* them; they can't be oom killed anyway.
*/
continue;
}
pr_info("[%5d] %5d %5d %8lu %8lu %7ld %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
atomic_long_read(&task->mm->nr_ptes),
get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj, task->comm);
task_unlock(task);
}
rcu_read_unlock();
}
static BLOCKING_NOTIFIER_HEAD(oomdebug_notify_list);
int register_oomdebug_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_register(&oomdebug_notify_list, nb);
}
EXPORT_SYMBOL_GPL(register_oomdebug_notifier);
int unregister_oomdebug_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_unregister(&oomdebug_notify_list, nb);
}
EXPORT_SYMBOL_GPL(unregister_oomdebug_notifier);
static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *memcg, const nodemask_t *nodemask)
{
struct zone *zone;
struct zoneref *z;
int alloc_flags = 0;
int ret = 0;
task_lock(current);
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
"oom_score_adj=%hd\n",
current->comm, gfp_mask, order,
current->signal->oom_score_adj);
cpuset_print_task_mems_allowed(current);
task_unlock(current);
dump_stack();
if (memcg)
mem_cgroup_print_oom_info(memcg, p);
else {
show_mem(SHOW_MEM_FILTER_NODES);
#ifdef CONFIG_CMA
if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
for_each_zone_zonelist_nodemask(zone, z, node_zonelist(0,
gfp_mask), gfp_zone(gfp_mask), NULL) {
pr_info("%s: HIGH wmark ok? (%s)"
" LOW wmark ok? %s)"
" MIN wmark ok? (%s)"
" compaction_suitable %lu\n", zone->name,
(zone_watermark_ok(zone, order,
high_wmark_pages(zone), 0, alloc_flags) ? "yes" : "no"),
(zone_watermark_ok(zone, order,
low_wmark_pages(zone), 0, alloc_flags) ? "yes" : "no"),
(zone_watermark_ok(zone, order,
min_wmark_pages(zone), 0, alloc_flags) ? "yes" : "no"),
compaction_suitable(zone, order));
}
}
if (sysctl_oom_dump_tasks)
dump_tasks(memcg, nodemask);
blocking_notifier_call_chain(&oomdebug_notify_list, 0, &ret);
}
/*
* Number of OOM killer invocations (including memcg OOM killer).
* Primarily used by PM freezer to check for potential races with
* OOM killed frozen task.
*/
static atomic_t oom_kills = ATOMIC_INIT(0);
int oom_kills_count(void)
{
return atomic_read(&oom_kills);
}
void note_oom_kill(void)
{
atomic_inc(&oom_kills);
}
#define K(x) ((x) << (PAGE_SHIFT-10))
/*
* Must be called while holding a reference to p, which will be released upon
* returning.
*/
void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned int points, unsigned long totalpages,
struct mem_cgroup *memcg, nodemask_t *nodemask,
const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t;
struct mm_struct *mm;
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
/*
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
return;
}
if (__ratelimit(&oom_rs))
dump_header(p, gfp_mask, order, memcg, nodemask);
task_lock(p);
pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);
/*
* If any of p's children has a different mm and is eligible for kill,
* the one with the highest oom_badness() score is sacrificed for its
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
read_lock(&tasklist_lock);
for_each_thread(p, t) {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
if (child->mm == p->mm)
continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
child_points = oom_badness(child, memcg, nodemask,
totalpages);
if (child_points > victim_points) {
put_task_struct(victim);
victim = child;
victim_points = child_points;
get_task_struct(victim);
}
}
}
read_unlock(&tasklist_lock);
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
return;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
victim = p;
}
/* mm cannot safely be dereferenced after task_unlock(victim) */
mm = victim->mm;
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
K(get_mm_counter(victim->mm, MM_FILEPAGES)));
task_unlock(victim);
/*
* Kill all user processes sharing victim->mm in other thread groups, if
* any. They don't get access to memory reserves, though, to avoid
* depletion of all memory. This prevents mm->mmap_sem livelock when an
* oom killed thread cannot exit because it requires the semaphore and
* its contended by another thread trying to allocate memory itself.
* That thread will now get access to memory reserves since it has a
* pending fatal signal.
*/
rcu_read_lock();
for_each_process(p)
if (p->mm == mm && !same_thread_group(p, victim) &&
!(p->flags & PF_KTHREAD)) {
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
task_lock(p); /* Protect ->comm from prctl() */
pr_err("Kill process %d (%s) sharing same memory\n",
task_pid_nr(p), p->comm);
task_unlock(p);
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
set_tsk_thread_flag(victim, TIF_MEMDIE);
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
put_task_struct(victim);
}
#undef K
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
int order, const nodemask_t *nodemask)
{
if (likely(!sysctl_panic_on_oom))
return;
if (sysctl_panic_on_oom != 2) {
/*
* panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
* does not panic for cpuset, mempolicy, or memcg allocation
* failures.
*/
if (constraint != CONSTRAINT_NONE)
return;
}
dump_header(NULL, gfp_mask, order, NULL, nodemask);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
}
static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
int register_oom_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_register(&oom_notify_list, nb);
}
EXPORT_SYMBOL_GPL(register_oom_notifier);
int unregister_oom_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_unregister(&oom_notify_list, nb);
}
EXPORT_SYMBOL_GPL(unregister_oom_notifier);
/*
* Try to acquire the OOM killer lock for the zones in zonelist. Returns zero
* if a parallel OOM killing is already taking place that includes a zone in
* the zonelist. Otherwise, locks all zones in the zonelist and returns 1.
*/
bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_mask)
{
struct zoneref *z;
struct zone *zone;
bool ret = true;
spin_lock(&zone_scan_lock);
for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
if (test_bit(ZONE_OOM_LOCKED, &zone->flags)) {
ret = false;
goto out;
}
/*
* Lock each zone in the zonelist under zone_scan_lock so a parallel
* call to oom_zonelist_trylock() doesn't succeed when it shouldn't.
*/
for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
set_bit(ZONE_OOM_LOCKED, &zone->flags);
out:
spin_unlock(&zone_scan_lock);
return ret;
}
/*
* Clears the ZONE_OOM_LOCKED flag for all zones in the zonelist so that failed
* allocation attempts with zonelists containing them may now recall the OOM
* killer, if necessary.
*/
void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
{
struct zoneref *z;
struct zone *zone;
spin_lock(&zone_scan_lock);
for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
clear_bit(ZONE_OOM_LOCKED, &zone->flags);
spin_unlock(&zone_scan_lock);
}
/**
* out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
* @nodemask: nodemask passed to page allocator
* @force_kill: true if a task must be killed, even if others are exiting
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask, bool force_kill)
{
const nodemask_t *mpol_mask;
struct task_struct *p;
unsigned long totalpages;
unsigned long freed = 0;
unsigned int uninitialized_var(points);
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
return;
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*/
if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
set_thread_flag(TIF_MEMDIE);
return;
}
/*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
&totalpages);
mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
nodemask,
"Out of memory (oom_kill_allocating_task)");
goto out;
}
p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
panic("Out of memory and no killable processes...\n");
}
if (p != (void *)-1UL) {
oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
nodemask, "Out of memory");
killed = 1;
}
out:
/*
* Give the killed threads a good chance of exiting before trying to
* allocate memory again.
*/
if (killed)
schedule_timeout_killable(1);
}
/*
* The pagefault handler calls here because it is out of memory, so kill a
* memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
* parallel oom killing is already in progress so do nothing.
*/
void pagefault_out_of_memory(void)
{
struct zonelist *zonelist;
if (mem_cgroup_oom_synchronize(true))
return;
zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
oom_zonelist_unlock(zonelist, GFP_KERNEL);
}
}

2420
mm/page-writeback.c Normal file

File diff suppressed because it is too large Load diff

6666
mm/page_alloc.c Normal file

File diff suppressed because it is too large Load diff

530
mm/page_cgroup.c Normal file
View file

@ -0,0 +1,530 @@
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/bootmem.h>
#include <linux/bit_spinlock.h>
#include <linux/page_cgroup.h>
#include <linux/hash.h>
#include <linux/slab.h>
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
#include <linux/kmemleak.h>
static unsigned long total_usage;
#if !defined(CONFIG_SPARSEMEM)
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
{
pgdat->node_page_cgroup = NULL;
}
struct page_cgroup *lookup_page_cgroup(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
unsigned long offset;
struct page_cgroup *base;
base = NODE_DATA(page_to_nid(page))->node_page_cgroup;
#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_cgroup arrays are
* allocated when feeding a range of pages to the allocator
* for the first time during bootup or memory hotplug.
*/
if (unlikely(!base))
return NULL;
#endif
offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
return base + offset;
}
static int __init alloc_node_page_cgroup(int nid)
{
struct page_cgroup *base;
unsigned long table_size;
unsigned long nr_pages;
nr_pages = NODE_DATA(nid)->node_spanned_pages;
if (!nr_pages)
return 0;
table_size = sizeof(struct page_cgroup) * nr_pages;
base = memblock_virt_alloc_try_nid_nopanic(
table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
BOOTMEM_ALLOC_ACCESSIBLE, nid);
if (!base)
return -ENOMEM;
NODE_DATA(nid)->node_page_cgroup = base;
total_usage += table_size;
return 0;
}
void __init page_cgroup_init_flatmem(void)
{
int nid, fail;
if (mem_cgroup_disabled())
return;
for_each_online_node(nid) {
fail = alloc_node_page_cgroup(nid);
if (fail)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
" don't want memory cgroups\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup failed.\n");
printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
panic("Out of memory");
}
#else /* CONFIG_FLAT_NODE_MEM_MAP */
struct page_cgroup *lookup_page_cgroup(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_cgroup arrays are
* allocated when feeding a range of pages to the allocator
* for the first time during bootup or memory hotplug.
*/
if (!section->page_cgroup)
return NULL;
#endif
return section->page_cgroup + pfn;
}
static void *__meminit alloc_page_cgroup(size_t size, int nid)
{
gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
void *addr = NULL;
addr = alloc_pages_exact_nid(nid, size, flags);
if (addr) {
kmemleak_alloc(addr, size, 1, flags);
return addr;
}
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);
else
addr = vzalloc(size);
return addr;
}
static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
{
struct mem_section *section;
struct page_cgroup *base;
unsigned long table_size;
section = __pfn_to_section(pfn);
if (section->page_cgroup)
return 0;
table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
base = alloc_page_cgroup(table_size, nid);
/*
* The value stored in section->page_cgroup is (base - pfn)
* and it does not point to the memory block allocated above,
* causing kmemleak false positives.
*/
kmemleak_not_leak(base);
if (!base) {
printk(KERN_ERR "page cgroup allocation failure\n");
return -ENOMEM;
}
/*
* The passed "pfn" may not be aligned to SECTION. For the calculation
* we need to apply a mask.
*/
pfn &= PAGE_SECTION_MASK;
section->page_cgroup = base - pfn;
total_usage += table_size;
return 0;
}
#ifdef CONFIG_MEMORY_HOTPLUG
static void free_page_cgroup(void *addr)
{
if (is_vmalloc_addr(addr)) {
vfree(addr);
} else {
struct page *page = virt_to_page(addr);
size_t table_size =
sizeof(struct page_cgroup) * PAGES_PER_SECTION;
BUG_ON(PageReserved(page));
kmemleak_free(addr);
free_pages_exact(addr, table_size);
}
}
static void __free_page_cgroup(unsigned long pfn)
{
struct mem_section *ms;
struct page_cgroup *base;
ms = __pfn_to_section(pfn);
if (!ms || !ms->page_cgroup)
return;
base = ms->page_cgroup + pfn;
free_page_cgroup(base);
ms->page_cgroup = NULL;
}
static int __meminit online_page_cgroup(unsigned long start_pfn,
unsigned long nr_pages,
int nid)
{
unsigned long start, end, pfn;
int fail = 0;
start = SECTION_ALIGN_DOWN(start_pfn);
end = SECTION_ALIGN_UP(start_pfn + nr_pages);
if (nid == -1) {
/*
* In this case, "nid" already exists and contains valid memory.
* "start_pfn" passed to us is a pfn which is an arg for
* online__pages(), and start_pfn should exist.
*/
nid = pfn_to_nid(start_pfn);
VM_BUG_ON(!node_state(nid, N_ONLINE));
}
for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
if (!pfn_present(pfn))
continue;
fail = init_section_page_cgroup(pfn, nid);
}
if (!fail)
return 0;
/* rollback */
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
__free_page_cgroup(pfn);
return -ENOMEM;
}
static int __meminit offline_page_cgroup(unsigned long start_pfn,
unsigned long nr_pages, int nid)
{
unsigned long start, end, pfn;
start = SECTION_ALIGN_DOWN(start_pfn);
end = SECTION_ALIGN_UP(start_pfn + nr_pages);
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
__free_page_cgroup(pfn);
return 0;
}
static int __meminit page_cgroup_callback(struct notifier_block *self,
unsigned long action, void *arg)
{
struct memory_notify *mn = arg;
int ret = 0;
switch (action) {
case MEM_GOING_ONLINE:
ret = online_page_cgroup(mn->start_pfn,
mn->nr_pages, mn->status_change_nid);
break;
case MEM_OFFLINE:
offline_page_cgroup(mn->start_pfn,
mn->nr_pages, mn->status_change_nid);
break;
case MEM_CANCEL_ONLINE:
offline_page_cgroup(mn->start_pfn,
mn->nr_pages, mn->status_change_nid);
break;
case MEM_GOING_OFFLINE:
break;
case MEM_ONLINE:
case MEM_CANCEL_OFFLINE:
break;
}
return notifier_from_errno(ret);
}
#endif
void __init page_cgroup_init(void)
{
unsigned long pfn;
int nid;
if (mem_cgroup_disabled())
return;
for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
start_pfn = node_start_pfn(nid);
end_pfn = node_end_pfn(nid);
/*
* start_pfn and end_pfn may not be aligned to SECTION and the
* page->flags of out of node pages are not initialized. So we
* scan [start_pfn, the biggest section's pfn < end_pfn) here.
*/
for (pfn = start_pfn;
pfn < end_pfn;
pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
if (!pfn_valid(pfn))
continue;
/*
* Nodes's pfns can be overlapping.
* We know some arch can have a nodes layout such as
* -------------pfn-------------->
* N0 | N1 | N2 | N0 | N1 | N2|....
*/
if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_cgroup(pfn, nid))
goto oom;
}
}
hotplug_memory_notifier(page_cgroup_callback, 0);
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
"don't want memory cgroups\n");
return;
oom:
printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
panic("Out of memory");
}
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
{
return;
}
#endif
#ifdef CONFIG_MEMCG_SWAP
static DEFINE_MUTEX(swap_cgroup_mutex);
struct swap_cgroup_ctrl {
struct page **map;
unsigned long length;
spinlock_t lock;
};
static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
struct swap_cgroup {
unsigned short id;
};
#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
/*
* SwapCgroup implements "lookup" and "exchange" operations.
* In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
* against SwapCache. At swap_free(), this is accessed directly from swap.
*
* This means,
* - we have no race in "exchange" when we're accessed via SwapCache because
* SwapCache(and its swp_entry) is under lock.
* - When called via swap_free(), there is no user of this entry and no race.
* Then, we don't need lock around "exchange".
*
* TODO: we can push these buffers out to HIGHMEM.
*/
/*
* allocate buffer for swap_cgroup.
*/
static int swap_cgroup_prepare(int type)
{
struct page *page;
struct swap_cgroup_ctrl *ctrl;
unsigned long idx, max;
ctrl = &swap_cgroup_ctrl[type];
for (idx = 0; idx < ctrl->length; idx++) {
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page)
goto not_enough_page;
ctrl->map[idx] = page;
}
return 0;
not_enough_page:
max = idx;
for (idx = 0; idx < max; idx++)
__free_page(ctrl->map[idx]);
return -ENOMEM;
}
static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
struct swap_cgroup_ctrl **ctrlp)
{
pgoff_t offset = swp_offset(ent);
struct swap_cgroup_ctrl *ctrl;
struct page *mappage;
struct swap_cgroup *sc;
ctrl = &swap_cgroup_ctrl[swp_type(ent)];
if (ctrlp)
*ctrlp = ctrl;
mappage = ctrl->map[offset / SC_PER_PAGE];
sc = page_address(mappage);
return sc + offset % SC_PER_PAGE;
}
/**
* swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
* @ent: swap entry to be cmpxchged
* @old: old id
* @new: new id
*
* Returns old id at success, 0 at failure.
* (There is no mem_cgroup using 0 as its id)
*/
unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new)
{
struct swap_cgroup_ctrl *ctrl;
struct swap_cgroup *sc;
unsigned long flags;
unsigned short retval;
sc = lookup_swap_cgroup(ent, &ctrl);
spin_lock_irqsave(&ctrl->lock, flags);
retval = sc->id;
if (retval == old)
sc->id = new;
else
retval = 0;
spin_unlock_irqrestore(&ctrl->lock, flags);
return retval;
}
/**
* swap_cgroup_record - record mem_cgroup for this swp_entry.
* @ent: swap entry to be recorded into
* @id: mem_cgroup to be recorded
*
* Returns old value at success, 0 at failure.
* (Of course, old value can be 0.)
*/
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
{
struct swap_cgroup_ctrl *ctrl;
struct swap_cgroup *sc;
unsigned short old;
unsigned long flags;
sc = lookup_swap_cgroup(ent, &ctrl);
spin_lock_irqsave(&ctrl->lock, flags);
old = sc->id;
sc->id = id;
spin_unlock_irqrestore(&ctrl->lock, flags);
return old;
}
/**
* lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
* @ent: swap entry to be looked up.
*
* Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
*/
unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
{
return lookup_swap_cgroup(ent, NULL)->id;
}
int swap_cgroup_swapon(int type, unsigned long max_pages)
{
void *array;
unsigned long array_size;
unsigned long length;
struct swap_cgroup_ctrl *ctrl;
if (!do_swap_account)
return 0;
length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
array_size = length * sizeof(void *);
array = vzalloc(array_size);
if (!array)
goto nomem;
ctrl = &swap_cgroup_ctrl[type];
mutex_lock(&swap_cgroup_mutex);
ctrl->length = length;
ctrl->map = array;
spin_lock_init(&ctrl->lock);
if (swap_cgroup_prepare(type)) {
/* memory shortage */
ctrl->map = NULL;
ctrl->length = 0;
mutex_unlock(&swap_cgroup_mutex);
vfree(array);
goto nomem;
}
mutex_unlock(&swap_cgroup_mutex);
return 0;
nomem:
printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
printk(KERN_INFO
"swap_cgroup can be disabled by swapaccount=0 boot option\n");
return -ENOMEM;
}
void swap_cgroup_swapoff(int type)
{
struct page **map;
unsigned long i, length;
struct swap_cgroup_ctrl *ctrl;
if (!do_swap_account)
return;
mutex_lock(&swap_cgroup_mutex);
ctrl = &swap_cgroup_ctrl[type];
map = ctrl->map;
length = ctrl->length;
ctrl->map = NULL;
ctrl->length = 0;
mutex_unlock(&swap_cgroup_mutex);
if (map) {
for (i = 0; i < length; i++) {
struct page *page = map[i];
if (page)
__free_page(page);
}
vfree(map);
}
}
#endif

397
mm/page_io.c Normal file
View file

@ -0,0 +1,397 @@
/*
* linux/mm/page_io.c
*
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
*
* Swap reorganised 29.12.95,
* Asynchronous swapping added 30.12.95. Stephen Tweedie
* Removed race in async swapping. 14.4.1996. Bruno Haible
* Add swap of shared pages through the page cache. 20.2.1998. Stephen Tweedie
* Always use brw_page, life becomes simpler. 12 May 1998 Eric Biederman
*/
#include <linux/mm.h>
#include <linux/kernel_stat.h>
#include <linux/gfp.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/bio.h>
#include <linux/swapops.h>
#include <linux/buffer_head.h>
#include <linux/writeback.h>
#include <linux/frontswap.h>
#include <linux/aio.h>
#include <linux/blkdev.h>
#include <asm/pgtable.h>
static struct bio *get_swap_bio(gfp_t gfp_flags,
struct page *page, bio_end_io_t end_io)
{
struct bio *bio;
bio = bio_alloc(gfp_flags, 1);
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
bio->bi_io_vec[0].bv_page = page;
bio->bi_io_vec[0].bv_len = PAGE_SIZE;
bio->bi_io_vec[0].bv_offset = 0;
bio->bi_vcnt = 1;
bio->bi_iter.bi_size = PAGE_SIZE;
bio->bi_end_io = end_io;
}
return bio;
}
void end_swap_bio_write(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
if (!uptodate) {
SetPageError(page);
/*
* We failed to write the page out to swap-space.
* Re-dirty the page in order to avoid it being reclaimed.
* Also print a dire warning that things will go BAD (tm)
* very quickly.
*
* Also clear PG_reclaim to avoid rotate_reclaimable_page()
*/
set_page_dirty(page);
#ifndef CONFIG_VNSWAP
printk(KERN_ALERT "Write-error on swap-device (%u:%u:%Lu)\n",
imajor(bio->bi_bdev->bd_inode),
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_iter.bi_sector);
#endif
ClearPageReclaim(page);
}
end_page_writeback(page);
bio_put(bio);
}
void end_swap_bio_read(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
if (!uptodate) {
SetPageError(page);
ClearPageUptodate(page);
printk(KERN_ALERT "Read-error on swap-device (%u:%u:%Lu)\n",
imajor(bio->bi_bdev->bd_inode),
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_iter.bi_sector);
goto out;
}
SetPageUptodate(page);
/*
* There is no guarantee that the page is in swap cache - the software
* suspend code (at least) uses end_swap_bio_read() against a non-
* swapcache page. So we must check PG_swapcache before proceeding with
* this optimization.
*/
if (likely(PageSwapCache(page))) {
struct swap_info_struct *sis;
sis = page_swap_info(page);
if (sis->flags & SWP_BLKDEV) {
/*
* The swap subsystem performs lazy swap slot freeing,
* expecting that the page will be swapped out again.
* So we can avoid an unnecessary write if the page
* isn't redirtied.
* This is good for real swap storage because we can
* reduce unnecessary I/O and enhance wear-leveling
* if an SSD is used as the as swap device.
* But if in-memory swap device (eg zram) is used,
* this causes a duplicated copy between uncompressed
* data in VM-owned memory and compressed data in
* zram-owned memory. So let's free zram-owned memory
* and make the VM-owned decompressed page *dirty*,
* so the page should be swapped out somewhere again if
* we again wish to reclaim it.
*/
struct gendisk *disk = sis->bdev->bd_disk;
if (disk->fops->swap_slot_free_notify) {
swp_entry_t entry;
unsigned long offset;
entry.val = page_private(page);
offset = swp_offset(entry);
SetPageDirty(page);
disk->fops->swap_slot_free_notify(sis->bdev,
offset);
}
}
}
out:
unlock_page(page);
bio_put(bio);
}
int generic_swapfile_activate(struct swap_info_struct *sis,
struct file *swap_file,
sector_t *span)
{
struct address_space *mapping = swap_file->f_mapping;
struct inode *inode = mapping->host;
unsigned blocks_per_page;
unsigned long page_no;
unsigned blkbits;
sector_t probe_block;
sector_t last_block;
sector_t lowest_block = -1;
sector_t highest_block = 0;
int nr_extents = 0;
int ret;
blkbits = inode->i_blkbits;
blocks_per_page = PAGE_SIZE >> blkbits;
/*
* Map all the blocks into the extent list. This code doesn't try
* to be very smart.
*/
probe_block = 0;
page_no = 0;
last_block = i_size_read(inode) >> blkbits;
while ((probe_block + blocks_per_page) <= last_block &&
page_no < sis->max) {
unsigned block_in_page;
sector_t first_block;
first_block = bmap(inode, probe_block);
if (first_block == 0)
goto bad_bmap;
/*
* It must be PAGE_SIZE aligned on-disk
*/
if (first_block & (blocks_per_page - 1)) {
probe_block++;
goto reprobe;
}
for (block_in_page = 1; block_in_page < blocks_per_page;
block_in_page++) {
sector_t block;
block = bmap(inode, probe_block + block_in_page);
if (block == 0)
goto bad_bmap;
if (block != first_block + block_in_page) {
/* Discontiguity */
probe_block++;
goto reprobe;
}
}
first_block >>= (PAGE_SHIFT - blkbits);
if (page_no) { /* exclude the header page */
if (first_block < lowest_block)
lowest_block = first_block;
if (first_block > highest_block)
highest_block = first_block;
}
/*
* We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks
*/
ret = add_swap_extent(sis, page_no, 1, first_block);
if (ret < 0)
goto out;
nr_extents += ret;
page_no++;
probe_block += blocks_per_page;
reprobe:
continue;
}
ret = nr_extents;
*span = 1 + highest_block - lowest_block;
if (page_no == 0)
page_no = 1; /* force Empty message */
sis->max = page_no;
sis->pages = page_no - 1;
sis->highest_bit = page_no - 1;
out:
return ret;
bad_bmap:
printk(KERN_ERR "swapon: swapfile has holes\n");
ret = -EINVAL;
goto out;
}
/*
* We may have stale swap cache pages in memory: notice
* them here and get rid of the unnecessary final write.
*/
int swap_writepage(struct page *page, struct writeback_control *wbc)
{
int ret = 0;
if (try_to_free_swap(page)) {
unlock_page(page);
goto out;
}
if (frontswap_store(page) == 0) {
set_page_writeback(page);
unlock_page(page);
end_page_writeback(page);
goto out;
}
#ifdef CONFIG_VNSWAP
set_page_dirty(page);
ClearPageReclaim(page);
unlock_page(page);
#else
ret = __swap_writepage(page, wbc, end_swap_bio_write);
#endif
out:
return ret;
}
static sector_t swap_page_sector(struct page *page)
{
return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9);
}
int __swap_writepage(struct page *page, struct writeback_control *wbc,
void (*end_write_func)(struct bio *, int))
{
struct bio *bio;
int ret, rw = WRITE;
struct swap_info_struct *sis = page_swap_info(page);
if (sis->flags & SWP_FILE) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
struct bio_vec bv = {
.bv_page = page,
.bv_len = PAGE_SIZE,
.bv_offset = 0
};
struct iov_iter from = {
.type = ITER_BVEC | WRITE,
.count = PAGE_SIZE,
.iov_offset = 0,
.nr_segs = 1,
};
from.bvec = &bv; /* older gcc versions are broken */
init_sync_kiocb(&kiocb, swap_file);
kiocb.ki_pos = page_file_offset(page);
kiocb.ki_nbytes = PAGE_SIZE;
set_page_writeback(page);
unlock_page(page);
ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
&kiocb, &from,
kiocb.ki_pos);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
ret = 0;
} else {
/*
* In the case of swap-over-nfs, this can be a
* temporary failure if the system has limited
* memory for allocating transmit buffers.
* Mark the page dirty and avoid
* rotate_reclaimable_page but rate-limit the
* messages but do not flag PageError like
* the normal direct-to-bio case as it could
* be temporary.
*/
set_page_dirty(page);
ClearPageReclaim(page);
pr_err_ratelimited("Write error on dio swapfile (%Lu)\n",
page_file_offset(page));
}
end_page_writeback(page);
return ret;
}
ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
if (!ret) {
count_vm_event(PSWPOUT);
return 0;
}
ret = 0;
bio = get_swap_bio(GFP_NOIO, page, end_write_func);
if (bio == NULL) {
set_page_dirty(page);
unlock_page(page);
ret = -ENOMEM;
goto out;
}
if (wbc->sync_mode == WB_SYNC_ALL)
rw |= REQ_SYNC;
count_vm_event(PSWPOUT);
set_page_writeback(page);
unlock_page(page);
submit_bio(rw, bio);
out:
return ret;
}
int swap_readpage(struct page *page)
{
struct bio *bio;
int ret = 0;
struct swap_info_struct *sis = page_swap_info(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageUptodate(page), page);
if (frontswap_load(page) == 0) {
SetPageUptodate(page);
unlock_page(page);
goto out;
}
if (sis->flags & SWP_FILE) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
ret = mapping->a_ops->readpage(swap_file, page);
if (!ret)
count_vm_event(PSWPIN);
return ret;
}
ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
if (!ret) {
count_vm_event(PSWPIN);
return 0;
}
ret = 0;
bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
if (bio == NULL) {
unlock_page(page);
ret = -ENOMEM;
goto out;
}
count_vm_event(PSWPIN);
submit_bio(READ, bio);
out:
return ret;
}
int swap_set_page_dirty(struct page *page)
{
struct swap_info_struct *sis = page_swap_info(page);
if (sis->flags & SWP_FILE) {
struct address_space *mapping = sis->swap_file->f_mapping;
return mapping->a_ops->set_page_dirty(page);
} else {
return __set_page_dirty_no_writeback(page);
}
}

313
mm/page_isolation.c Normal file
View file

@ -0,0 +1,313 @@
/*
* linux/mm/page_isolation.c
*/
#include <linux/mm.h>
#include <linux/page-isolation.h>
#include <linux/pageblock-flags.h>
#include <linux/memory.h>
#include <linux/hugetlb.h>
#include "internal.h"
int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
{
struct zone *zone;
unsigned long flags, pfn;
struct memory_isolate_notify arg;
int notifier_ret;
int ret = -EBUSY;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
pfn = page_to_pfn(page);
arg.start_pfn = pfn;
arg.nr_pages = pageblock_nr_pages;
arg.pages_found = 0;
/*
* It may be possible to isolate a pageblock even if the
* migratetype is not MIGRATE_MOVABLE. The memory isolation
* notifier chain is used by balloon drivers to return the
* number of pages in a range that are held by the balloon
* driver to shrink memory. If all the pages are accounted for
* by balloons, are free, or on the LRU, isolation can continue.
* Later, for example, when memory hotplug notifier runs, these
* pages reported as "can be isolated" should be isolated(freed)
* by the balloon driver through the memory notifier chain.
*/
notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
notifier_ret = notifier_to_errno(notifier_ret);
if (notifier_ret)
goto out;
/*
* FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
* We just check MOVABLE pages.
*/
if (!has_unmovable_pages(zone, page, arg.pages_found,
skip_hwpoisoned_pages))
ret = 0;
/*
* immobile means "not-on-lru" paes. If immobile is larger than
* removable-by-driver pages reported by notifier, we'll fail.
*/
out:
if (!ret) {
unsigned long nr_pages;
int migratetype = get_pageblock_migratetype(page);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
__mod_zone_freepage_state(zone, -nr_pages, migratetype);
}
spin_unlock_irqrestore(&zone->lock, flags);
if (!ret)
drain_all_pages();
return ret;
}
void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags, nr_pages;
struct page *isolated_page = NULL;
unsigned int order;
unsigned long page_idx, buddy_idx;
struct page *buddy;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
/*
* Because freepage with more than pageblock_order on isolated
* pageblock is restricted to merge due to freepage counting problem,
* it is possible that there is free buddy page.
* move_freepages_block() doesn't care of merge so we need other
* approach in order to merge them. Isolation and free will make
* these pages to be merged.
*/
if (PageBuddy(page)) {
order = page_order(page);
if (order >= pageblock_order) {
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!is_migrate_isolate_page(buddy)) {
__isolate_free_page(page, order);
kernel_map_pages(page, (1 << order), 1);
set_page_refcounted(page);
isolated_page = page;
}
}
}
/*
* If we isolate freepage with more than pageblock_order, there
* should be no freepage in the range, so we could avoid costly
* pageblock scanning for freepage moving.
*/
if (!isolated_page) {
nr_pages = move_freepages_block(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
}
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
if (isolated_page)
__free_pages(isolated_page, order);
}
static inline struct page *
__first_valid_page(unsigned long pfn, unsigned long nr_pages)
{
int i;
for (i = 0; i < nr_pages; i++)
if (pfn_valid_within(pfn + i))
break;
if (unlikely(i == nr_pages))
return NULL;
return pfn_to_page(pfn + i);
}
/*
* start_isolate_page_range() -- make page-allocation-type of range of pages
* to be MIGRATE_ISOLATE.
* @start_pfn: The lower PFN of the range to be isolated.
* @end_pfn: The upper PFN of the range to be isolated.
* @migratetype: migrate type to set in error recovery.
*
* Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
* the range will never be allocated. Any free pages and pages freed in the
* future will not be allocated again.
*
* start_pfn/end_pfn must be aligned to pageblock_order.
* Returns 0 on success and -EBUSY if any part of range cannot be isolated.
*/
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned migratetype, bool skip_hwpoisoned_pages)
{
unsigned long pfn;
unsigned long undo_pfn;
struct page *page;
BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
for (pfn = start_pfn;
pfn < end_pfn;
pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (page &&
set_migratetype_isolate(page, skip_hwpoisoned_pages)) {
undo_pfn = pfn;
goto undo;
}
}
return 0;
undo:
for (pfn = start_pfn;
pfn < undo_pfn;
pfn += pageblock_nr_pages)
unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
return -EBUSY;
}
/*
* Make isolated pages available again.
*/
int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned migratetype)
{
unsigned long pfn;
struct page *page;
BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
for (pfn = start_pfn;
pfn < end_pfn;
pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (!page || get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
continue;
unset_migratetype_isolate(page, migratetype);
}
return 0;
}
/*
* Test all pages in the range is free(means isolated) or not.
* all pages in [start_pfn...end_pfn) must be in the same zone.
* zone->lock must be held before call this.
*
* Returns 1 if all pages in the range are isolated.
*/
static int
__test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
bool skip_hwpoisoned_pages)
{
struct page *page;
while (pfn < end_pfn) {
if (!pfn_valid_within(pfn)) {
pfn++;
continue;
}
page = pfn_to_page(pfn);
if (PageBuddy(page)) {
/*
* If race between isolatation and allocation happens,
* some free pages could be in MIGRATE_MOVABLE list
* although pageblock's migratation type of the page
* is MIGRATE_ISOLATE. Catch it and move the page into
* MIGRATE_ISOLATE list.
*/
if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) {
struct page *end_page;
end_page = page + (1 << page_order(page)) - 1;
move_freepages(page_zone(page), page, end_page,
MIGRATE_ISOLATE);
}
pfn += 1 << page_order(page);
}
else if (page_count(page) == 0 &&
get_freepage_migratetype(page) == MIGRATE_ISOLATE)
pfn += 1;
else if (skip_hwpoisoned_pages && PageHWPoison(page)) {
/*
* The HWPoisoned page may be not in buddy
* system, and page_count() is not 0.
*/
pfn++;
continue;
}
else
break;
}
if (pfn < end_pfn)
return 0;
return 1;
}
int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
bool skip_hwpoisoned_pages)
{
unsigned long pfn, flags;
struct page *page;
struct zone *zone;
int ret;
/*
* Note: pageblock_nr_pages != MAX_ORDER. Then, chunks of free pages
* are not aligned to pageblock_nr_pages.
* Then we just check migratetype first.
*/
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (page && get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
break;
}
page = __first_valid_page(start_pfn, end_pfn - start_pfn);
if ((pfn < end_pfn) || !page)
return -EBUSY;
/* Check all pages are free or marked as ISOLATED */
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
ret = __test_page_isolated_in_pageblock(start_pfn, end_pfn,
skip_hwpoisoned_pages);
spin_unlock_irqrestore(&zone->lock, flags);
return ret ? 0 : -EBUSY;
}
struct page *alloc_migrate_target(struct page *page, unsigned long private,
int **resultp)
{
gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
/*
* TODO: allocate a destination hugepage from a nearest neighbor node,
* accordance with memory policy of the user process if possible. For
* now as a simple work-around, we use the next node for destination.
*/
if (PageHuge(page)) {
nodemask_t src = nodemask_of_node(page_to_nid(page));
nodemask_t dst;
nodes_complement(dst, src);
return alloc_huge_page_node(page_hstate(compound_head(page)),
next_node(page_to_nid(page), dst));
}
if (PageHighMem(page))
gfp_mask |= __GFP_HIGHMEM;
return alloc_page(gfp_mask);
}

251
mm/pagewalk.c Normal file
View file

@ -0,0 +1,251 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/hugetlb.h>
static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pte_t *pte;
int err = 0;
pte = pte_offset_map(pmd, addr);
for (;;) {
err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
addr += PAGE_SIZE;
if (addr == end)
break;
pte++;
}
pte_unmap(pte);
return err;
}
static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pmd_t *pmd;
unsigned long next;
int err = 0;
pmd = pmd_offset(pud, addr);
do {
again:
next = pmd_addr_end(addr, end);
if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
break;
continue;
}
/*
* This implies that each ->pmd_entry() handler
* needs to know about pmd_trans_huge() pmds
*/
if (walk->pmd_entry)
err = walk->pmd_entry(pmd, addr, next, walk);
if (err)
break;
/*
* Check this here so we only break down trans_huge
* pages when we _need_ to
*/
if (!walk->pte_entry)
continue;
split_huge_page_pmd_mm(walk->mm, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
return err;
}
static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pud_t *pud;
unsigned long next;
int err = 0;
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
break;
continue;
}
if (walk->pud_entry)
err = walk->pud_entry(pud, addr, next, walk);
if (!err && (walk->pmd_entry || walk->pte_entry))
err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);
return err;
}
#ifdef CONFIG_HUGETLB_PAGE
static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr,
unsigned long end)
{
unsigned long boundary = (addr & huge_page_mask(h)) + huge_page_size(h);
return boundary < end ? boundary : end;
}
static int walk_hugetlb_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct hstate *h = hstate_vma(vma);
unsigned long next;
unsigned long hmask = huge_page_mask(h);
pte_t *pte;
int err = 0;
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
if (pte && walk->hugetlb_entry)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
if (err)
return err;
} while (addr = next, addr != end);
return 0;
}
#else /* CONFIG_HUGETLB_PAGE */
static int walk_hugetlb_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
return 0;
}
#endif /* CONFIG_HUGETLB_PAGE */
/**
* walk_page_range - walk a memory map's page tables with a callback
* @addr: starting address
* @end: ending address
* @walk: set of callbacks to invoke for each level of the tree
*
* Recursively walk the page table for the memory area in a VMA,
* calling supplied callbacks. Callbacks are called in-order (first
* PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
* etc.). If lower-level callbacks are omitted, walking depth is reduced.
*
* Each callback receives an entry pointer and the start and end of the
* associated range, and a copy of the original mm_walk for access to
* the ->private or ->mm fields.
*
* Usually no locks are taken, but splitting transparent huge page may
* take page table lock. And the bottom level iterator will map PTE
* directories from highmem if necessary.
*
* If any callback returns a non-zero value, the walk is aborted and
* the return value is propagated back to the caller. Otherwise 0 is returned.
*
* walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry
* is !NULL.
*/
int walk_page_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pgd_t *pgd;
unsigned long next;
int err = 0;
if (addr >= end)
return err;
if (!walk->mm)
return -EINVAL;
VM_BUG_ON_MM(!rwsem_is_locked(&walk->mm->mmap_sem), walk->mm);
pgd = pgd_offset(walk->mm, addr);
do {
struct vm_area_struct *vma = NULL;
next = pgd_addr_end(addr, end);
/*
* This function was not intended to be vma based.
* But there are vma special cases to be handled:
* - hugetlb vma's
* - VM_PFNMAP vma's
*/
vma = find_vma(walk->mm, addr);
if (vma) {
/*
* There are no page structures backing a VM_PFNMAP
* range, so do not allow split_huge_page_pmd().
*/
if ((vma->vm_start <= addr) &&
(vma->vm_flags & VM_PFNMAP)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
break;
pgd = pgd_offset(walk->mm, next);
continue;
}
/*
* Handle hugetlb vma individually because pagetable
* walk for the hugetlb page is dependent on the
* architecture and we can't handled it in the same
* manner as non-huge pages.
*/
if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
is_vm_hugetlb_page(vma)) {
if (vma->vm_end < next)
next = vma->vm_end;
/*
* Hugepage is very tightly coupled with vma,
* so walk through hugetlb entries within a
* given vma.
*/
err = walk_hugetlb_range(vma, addr, next, walk);
if (err)
break;
pgd = pgd_offset(walk->mm, next);
continue;
}
}
if (pgd_none_or_clear_bad(pgd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
break;
pgd++;
continue;
}
if (walk->pgd_entry)
err = walk->pgd_entry(pgd, addr, next, walk);
if (!err &&
(walk->pud_entry || walk->pmd_entry || walk->pte_entry))
err = walk_pud_range(pgd, addr, next, walk);
if (err)
break;
pgd++;
} while (addr = next, addr < end);
return err;
}

110
mm/percpu-km.c Normal file
View file

@ -0,0 +1,110 @@
/*
* mm/percpu-km.c - kernel memory based chunk allocation
*
* Copyright (C) 2010 SUSE Linux Products GmbH
* Copyright (C) 2010 Tejun Heo <tj@kernel.org>
*
* This file is released under the GPLv2.
*
* Chunks are allocated as a contiguous kernel memory using gfp
* allocation. This is to be used on nommu architectures.
*
* To use percpu-km,
*
* - define CONFIG_NEED_PER_CPU_KM from the arch Kconfig.
*
* - CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK must not be defined. It's
* not compatible with PER_CPU_KM. EMBED_FIRST_CHUNK should work
* fine.
*
* - NUMA is not supported. When setting up the first chunk,
* @cpu_distance_fn should be NULL or report all CPUs to be nearer
* than or at LOCAL_DISTANCE.
*
* - It's best if the chunk size is power of two multiple of
* PAGE_SIZE. Because each chunk is allocated as a contiguous
* kernel memory block using alloc_pages(), memory will be wasted if
* chunk size is not aligned. percpu-km code will whine about it.
*/
#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
#error "contiguous percpu allocation is incompatible with paged first chunk"
#endif
#include <linux/log2.h>
static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
return 0;
}
static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
/* nada */
}
static struct pcpu_chunk *pcpu_create_chunk(void)
{
const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
struct pcpu_chunk *chunk;
struct page *pages;
int i;
chunk = pcpu_alloc_chunk();
if (!chunk)
return NULL;
pages = alloc_pages(GFP_KERNEL, order_base_2(nr_pages));
if (!pages) {
pcpu_free_chunk(chunk);
return NULL;
}
for (i = 0; i < nr_pages; i++)
pcpu_set_page_chunk(nth_page(pages, i), chunk);
chunk->data = pages;
chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
spin_lock_irq(&pcpu_lock);
pcpu_chunk_populated(chunk, 0, nr_pages);
spin_unlock_irq(&pcpu_lock);
return chunk;
}
static void pcpu_destroy_chunk(struct pcpu_chunk *chunk)
{
const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
if (chunk && chunk->data)
__free_pages(chunk->data, order_base_2(nr_pages));
pcpu_free_chunk(chunk);
}
static struct page *pcpu_addr_to_page(void *addr)
{
return virt_to_page(addr);
}
static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
{
size_t nr_pages, alloc_pages;
/* all units must be in a single group */
if (ai->nr_groups != 1) {
printk(KERN_CRIT "percpu: can't handle more than one groups\n");
return -EINVAL;
}
nr_pages = (ai->groups[0].nr_units * ai->unit_size) >> PAGE_SHIFT;
alloc_pages = roundup_pow_of_two(nr_pages);
if (alloc_pages > nr_pages)
printk(KERN_WARNING "percpu: wasting %zu pages per chunk\n",
alloc_pages - nr_pages);
return 0;
}

366
mm/percpu-vm.c Normal file
View file

@ -0,0 +1,366 @@
/*
* mm/percpu-vm.c - vmalloc area based chunk allocation
*
* Copyright (C) 2010 SUSE Linux Products GmbH
* Copyright (C) 2010 Tejun Heo <tj@kernel.org>
*
* This file is released under the GPLv2.
*
* Chunks are mapped into vmalloc areas and populated page by page.
* This is the default chunk allocator.
*/
static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
unsigned int cpu, int page_idx)
{
/* must not be used on pre-mapped chunk */
WARN_ON(chunk->immutable);
return vmalloc_to_page((void *)pcpu_chunk_addr(chunk, cpu, page_idx));
}
/**
* pcpu_get_pages - get temp pages array
* @chunk: chunk of interest
*
* Returns pointer to array of pointers to struct page which can be indexed
* with pcpu_page_idx(). Note that there is only one array and accesses
* should be serialized by pcpu_alloc_mutex.
*
* RETURNS:
* Pointer to temp pages array on success.
*/
static struct page **pcpu_get_pages(struct pcpu_chunk *chunk_alloc)
{
static struct page **pages;
size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
lockdep_assert_held(&pcpu_alloc_mutex);
if (!pages)
pages = pcpu_mem_zalloc(pages_size);
return pages;
}
/**
* pcpu_free_pages - free pages which were allocated for @chunk
* @chunk: chunk pages were allocated for
* @pages: array of pages to be freed, indexed by pcpu_page_idx()
* @page_start: page index of the first page to be freed
* @page_end: page index of the last page to be freed + 1
*
* Free pages [@page_start and @page_end) in @pages for all units.
* The pages were allocated for @chunk.
*/
static void pcpu_free_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
{
unsigned int cpu;
int i;
for_each_possible_cpu(cpu) {
for (i = page_start; i < page_end; i++) {
struct page *page = pages[pcpu_page_idx(cpu, i)];
if (page)
__free_page(page);
}
}
}
/**
* pcpu_alloc_pages - allocates pages for @chunk
* @chunk: target chunk
* @pages: array to put the allocated pages into, indexed by pcpu_page_idx()
* @page_start: page index of the first page to be allocated
* @page_end: page index of the last page to be allocated + 1
*
* Allocate pages [@page_start,@page_end) into @pages for all units.
* The allocation is for @chunk. Percpu core doesn't care about the
* content of @pages and will pass it verbatim to pcpu_map_pages().
*/
static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
{
const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
unsigned int cpu, tcpu;
int i;
for_each_possible_cpu(cpu) {
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
if (!*pagep)
goto err;
}
}
return 0;
err:
while (--i >= page_start)
__free_page(pages[pcpu_page_idx(cpu, i)]);
for_each_possible_cpu(tcpu) {
if (tcpu == cpu)
break;
for (i = page_start; i < page_end; i++)
__free_page(pages[pcpu_page_idx(tcpu, i)]);
}
return -ENOMEM;
}
/**
* pcpu_pre_unmap_flush - flush cache prior to unmapping
* @chunk: chunk the regions to be flushed belongs to
* @page_start: page index of the first page to be flushed
* @page_end: page index of the last page to be flushed + 1
*
* Pages in [@page_start,@page_end) of @chunk are about to be
* unmapped. Flush cache. As each flushing trial can be very
* expensive, issue flush on the whole region at once rather than
* doing it for each cpu. This could be an overkill but is more
* scalable.
*/
static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
flush_cache_vunmap(
pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
}
static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
{
unmap_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT);
}
/**
* pcpu_unmap_pages - unmap pages out of a pcpu_chunk
* @chunk: chunk of interest
* @pages: pages array which can be used to pass information to free
* @page_start: page index of the first page to unmap
* @page_end: page index of the last page to unmap + 1
*
* For each cpu, unmap pages [@page_start,@page_end) out of @chunk.
* Corresponding elements in @pages were cleared by the caller and can
* be used to carry information to pcpu_free_pages() which will be
* called after all unmaps are finished. The caller should call
* proper pre/post flush functions.
*/
static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
{
unsigned int cpu;
int i;
for_each_possible_cpu(cpu) {
for (i = page_start; i < page_end; i++) {
struct page *page;
page = pcpu_chunk_page(chunk, cpu, i);
WARN_ON(!page);
pages[pcpu_page_idx(cpu, i)] = page;
}
__pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
page_end - page_start);
}
}
/**
* pcpu_post_unmap_tlb_flush - flush TLB after unmapping
* @chunk: pcpu_chunk the regions to be flushed belong to
* @page_start: page index of the first page to be flushed
* @page_end: page index of the last page to be flushed + 1
*
* Pages [@page_start,@page_end) of @chunk have been unmapped. Flush
* TLB for the regions. This can be skipped if the area is to be
* returned to vmalloc as vmalloc will handle TLB flushing lazily.
*
* As with pcpu_pre_unmap_flush(), TLB flushing also is done at once
* for the whole region.
*/
static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
flush_tlb_kernel_range(
pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
}
static int __pcpu_map_pages(unsigned long addr, struct page **pages,
int nr_pages)
{
return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT,
PAGE_KERNEL, pages);
}
/**
* pcpu_map_pages - map pages into a pcpu_chunk
* @chunk: chunk of interest
* @pages: pages array containing pages to be mapped
* @page_start: page index of the first page to map
* @page_end: page index of the last page to map + 1
*
* For each cpu, map pages [@page_start,@page_end) into @chunk. The
* caller is responsible for calling pcpu_post_map_flush() after all
* mappings are complete.
*
* This function is responsible for setting up whatever is necessary for
* reverse lookup (addr -> chunk).
*/
static int pcpu_map_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
{
unsigned int cpu, tcpu;
int i, err;
for_each_possible_cpu(cpu) {
err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
&pages[pcpu_page_idx(cpu, page_start)],
page_end - page_start);
if (err < 0)
goto err;
for (i = page_start; i < page_end; i++)
pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
chunk);
}
return 0;
err:
for_each_possible_cpu(tcpu) {
if (tcpu == cpu)
break;
__pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
page_end - page_start);
}
pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
return err;
}
/**
* pcpu_post_map_flush - flush cache after mapping
* @chunk: pcpu_chunk the regions to be flushed belong to
* @page_start: page index of the first page to be flushed
* @page_end: page index of the last page to be flushed + 1
*
* Pages [@page_start,@page_end) of @chunk have been mapped. Flush
* cache.
*
* As with pcpu_pre_unmap_flush(), TLB flushing also is done at once
* for the whole region.
*/
static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
flush_cache_vmap(
pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
}
/**
* pcpu_populate_chunk - populate and map an area of a pcpu_chunk
* @chunk: chunk of interest
* @page_start: the start page
* @page_end: the end page
*
* For each cpu, populate and map pages [@page_start,@page_end) into
* @chunk.
*
* CONTEXT:
* pcpu_alloc_mutex, does GFP_KERNEL allocation.
*/
static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
struct page **pages;
pages = pcpu_get_pages(chunk);
if (!pages)
return -ENOMEM;
if (pcpu_alloc_pages(chunk, pages, page_start, page_end))
return -ENOMEM;
if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
pcpu_free_pages(chunk, pages, page_start, page_end);
return -ENOMEM;
}
pcpu_post_map_flush(chunk, page_start, page_end);
return 0;
}
/**
* pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
* @chunk: chunk to depopulate
* @page_start: the start page
* @page_end: the end page
*
* For each cpu, depopulate and unmap pages [@page_start,@page_end)
* from @chunk.
*
* CONTEXT:
* pcpu_alloc_mutex.
*/
static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
struct page **pages;
/*
* If control reaches here, there must have been at least one
* successful population attempt so the temp pages array must
* be available now.
*/
pages = pcpu_get_pages(chunk);
BUG_ON(!pages);
/* unmap and free */
pcpu_pre_unmap_flush(chunk, page_start, page_end);
pcpu_unmap_pages(chunk, pages, page_start, page_end);
/* no need to flush tlb, vmalloc will handle it lazily */
pcpu_free_pages(chunk, pages, page_start, page_end);
}
static struct pcpu_chunk *pcpu_create_chunk(void)
{
struct pcpu_chunk *chunk;
struct vm_struct **vms;
chunk = pcpu_alloc_chunk();
if (!chunk)
return NULL;
vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
pcpu_nr_groups, pcpu_atom_size);
if (!vms) {
pcpu_free_chunk(chunk);
return NULL;
}
chunk->data = vms;
chunk->base_addr = vms[0]->addr - pcpu_group_offsets[0];
return chunk;
}
static void pcpu_destroy_chunk(struct pcpu_chunk *chunk)
{
if (chunk && chunk->data)
pcpu_free_vm_areas(chunk->data, pcpu_nr_groups);
pcpu_free_chunk(chunk);
}
static struct page *pcpu_addr_to_page(void *addr)
{
return vmalloc_to_page(addr);
}
static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
{
/* no extra restriction */
return 0;
}

2297
mm/percpu.c Normal file

File diff suppressed because it is too large Load diff

202
mm/pgtable-generic.c Normal file
View file

@ -0,0 +1,202 @@
/*
* mm/pgtable-generic.c
*
* Generic pgtable methods declared in asm-generic/pgtable.h
*
* Copyright (C) 2010 Linus Torvalds
*/
#include <linux/pagemap.h>
#include <asm/tlb.h>
#include <asm-generic/pgtable.h>
/*
* If a p?d_bad entry is found while walking page tables, report
* the error, before resetting entry to p?d_none. Usually (but
* very seldom) called out from the p?d_none_or_clear_bad macros.
*/
void pgd_clear_bad(pgd_t *pgd)
{
pgd_ERROR(*pgd);
pgd_clear(pgd);
}
void pud_clear_bad(pud_t *pud)
{
pud_ERROR(*pud);
pud_clear(pud);
}
void pmd_clear_bad(pmd_t *pmd)
{
pmd_ERROR(*pmd);
pmd_clear(pmd);
}
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Only sets the access flags (dirty, accessed), as well as write
* permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
* used to be done in the caller, but sparc needs minor faults to
* force that call on sun4c so we changed this macro slightly
*/
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
{
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
#endif
#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int changed = !pmd_same(*pmdp, entry);
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
if (changed) {
set_pmd_at(vma->vm_mm, address, pmdp, entry);
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
return changed;
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
BUG();
return 0;
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
}
#endif
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
int young;
young = ptep_test_and_clear_young(vma, address, ptep);
if (young)
flush_tlb_page(vma, address);
return young;
}
#endif
#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
{
int young;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
#else
BUG();
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
young = pmdp_test_and_clear_young(vma, address, pmdp);
if (young)
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return young;
}
#endif
#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
pte_t *ptep)
{
struct mm_struct *mm = (vma)->vm_mm;
pte_t pte;
pte = ptep_get_and_clear(mm, address, ptep);
if (pte_accessible(mm, pte))
flush_tlb_page(vma, address);
return pte;
}
#endif
#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
pmd_t pmd = pmd_mksplitting(*pmdp);
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
set_pmd_at(vma->vm_mm, address, pmdp, pmd);
/* tlb flush only to serialize against gup-fast */
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
{
assert_spin_locked(pmd_lockptr(mm, pmdp));
/* FIFO */
if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(&pgtable->lru);
else
list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
pmd_huge_pte(mm, pmdp) = pgtable;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* no "address" argument so destroys page coloring of some arch */
pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
pgtable_t pgtable;
assert_spin_locked(pmd_lockptr(mm, pmdp));
/* FIFO */
pgtable = pmd_huge_pte(mm, pmdp);
if (list_empty(&pgtable->lru))
pmd_huge_pte(mm, pmdp) = NULL;
else {
pmd_huge_pte(mm, pmdp) = list_entry(pgtable->lru.next,
struct page, lru);
list_del(&pgtable->lru);
}
return pgtable;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
#ifndef __HAVE_ARCH_PMDP_INVALIDATE
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
pmd_t entry = *pmdp;
if (pmd_numa(entry))
entry = pmd_mknonnuma(entry);
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

379
mm/process_vm_access.c Normal file
View file

@ -0,0 +1,379 @@
/*
* linux/mm/process_vm_access.c
*
* Copyright (C) 2010-2011 Christopher Yeoh <cyeoh@au1.ibm.com>, IBM Corp.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
#include <linux/mm.h>
#include <linux/uio.h>
#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/ptrace.h>
#include <linux/slab.h>
#include <linux/syscalls.h>
#ifdef CONFIG_COMPAT
#include <linux/compat.h>
#endif
/**
* process_vm_rw_pages - read/write pages from task specified
* @pages: array of pointers to pages we want to copy
* @start_offset: offset in page to start copying from/to
* @len: number of bytes to copy
* @iter: where to copy to/from locally
* @vm_write: 0 means copy from, 1 means copy to
* Returns 0 on success, error code otherwise
*/
static int process_vm_rw_pages(struct page **pages,
unsigned offset,
size_t len,
struct iov_iter *iter,
int vm_write)
{
/* Do the copy for each page */
while (len && iov_iter_count(iter)) {
struct page *page = *pages++;
size_t copy = PAGE_SIZE - offset;
size_t copied;
if (copy > len)
copy = len;
if (vm_write) {
copied = copy_page_from_iter(page, offset, copy, iter);
set_page_dirty_lock(page);
} else {
copied = copy_page_to_iter(page, offset, copy, iter);
}
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
offset = 0;
}
return 0;
}
/* Maximum number of pages kmalloc'd to hold struct page's during copy */
#define PVM_MAX_KMALLOC_PAGES (PAGE_SIZE * 2)
/**
* process_vm_rw_single_vec - read/write pages from task specified
* @addr: start memory address of target process
* @len: size of area to copy to/from
* @iter: where to copy to/from locally
* @process_pages: struct pages area that can store at least
* nr_pages_to_copy struct page pointers
* @mm: mm for task
* @task: task to read/write from
* @vm_write: 0 means copy from, 1 means copy to
* Returns 0 on success or on failure error code
*/
static int process_vm_rw_single_vec(unsigned long addr,
unsigned long len,
struct iov_iter *iter,
struct page **process_pages,
struct mm_struct *mm,
struct task_struct *task,
int vm_write)
{
unsigned long pa = addr & PAGE_MASK;
unsigned long start_offset = addr - pa;
unsigned long nr_pages;
ssize_t rc = 0;
unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
/ sizeof(struct pages *);
/* Work out address and page range required */
if (len == 0)
return 0;
nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
while (!rc && nr_pages && iov_iter_count(iter)) {
int pages = min(nr_pages, max_pages_per_loop);
size_t bytes;
/* Get the pages we're interested in */
down_read(&mm->mmap_sem);
pages = get_user_pages(task, mm, pa, pages,
vm_write, 0, process_pages, NULL);
up_read(&mm->mmap_sem);
if (pages <= 0)
return -EFAULT;
bytes = pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
rc = process_vm_rw_pages(process_pages,
start_offset, bytes, iter,
vm_write);
len -= bytes;
start_offset = 0;
nr_pages -= pages;
pa += pages * PAGE_SIZE;
while (pages)
put_page(process_pages[--pages]);
}
return rc;
}
/* Maximum number of entries for process pages array
which lives on stack */
#define PVM_MAX_PP_ARRAY_COUNT 16
/**
* process_vm_rw_core - core of reading/writing pages from task specified
* @pid: PID of process to read/write from/to
* @iter: where to copy to/from locally
* @rvec: iovec array specifying where to copy to/from in the other process
* @riovcnt: size of rvec array
* @flags: currently unused
* @vm_write: 0 if reading from other process, 1 if writing to other process
* Returns the number of bytes read/written or error code. May
* return less bytes than expected if an error occurs during the copying
* process.
*/
static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
const struct iovec *rvec,
unsigned long riovcnt,
unsigned long flags, int vm_write)
{
struct task_struct *task;
struct page *pp_stack[PVM_MAX_PP_ARRAY_COUNT];
struct page **process_pages = pp_stack;
struct mm_struct *mm;
unsigned long i;
ssize_t rc = 0;
unsigned long nr_pages = 0;
unsigned long nr_pages_iov;
ssize_t iov_len;
size_t total_len = iov_iter_count(iter);
/*
* Work out how many pages of struct pages we're going to need
* when eventually calling get_user_pages
*/
for (i = 0; i < riovcnt; i++) {
iov_len = rvec[i].iov_len;
if (iov_len > 0) {
nr_pages_iov = ((unsigned long)rvec[i].iov_base
+ iov_len)
/ PAGE_SIZE - (unsigned long)rvec[i].iov_base
/ PAGE_SIZE + 1;
nr_pages = max(nr_pages, nr_pages_iov);
}
}
if (nr_pages == 0)
return 0;
if (nr_pages > PVM_MAX_PP_ARRAY_COUNT) {
/* For reliability don't try to kmalloc more than
2 pages worth */
process_pages = kmalloc(min_t(size_t, PVM_MAX_KMALLOC_PAGES,
sizeof(struct pages *)*nr_pages),
GFP_KERNEL);
if (!process_pages)
return -ENOMEM;
}
/* Get process information */
rcu_read_lock();
task = find_task_by_vpid(pid);
if (task)
get_task_struct(task);
rcu_read_unlock();
if (!task) {
rc = -ESRCH;
goto free_proc_pages;
}
mm = mm_access(task, PTRACE_MODE_ATTACH);
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
* Explicitly map EACCES to EPERM as EPERM is a more a
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
rc = -EPERM;
goto put_task_struct;
}
for (i = 0; i < riovcnt && iov_iter_count(iter) && !rc; i++)
rc = process_vm_rw_single_vec(
(unsigned long)rvec[i].iov_base, rvec[i].iov_len,
iter, process_pages, mm, task, vm_write);
/* copied = space before - space after */
total_len -= iov_iter_count(iter);
/* If we have managed to copy any data at all then
we return the number of bytes copied. Otherwise
we return the error code */
if (total_len)
rc = total_len;
mmput(mm);
put_task_struct:
put_task_struct(task);
free_proc_pages:
if (process_pages != pp_stack)
kfree(process_pages);
return rc;
}
/**
* process_vm_rw - check iovecs before calling core routine
* @pid: PID of process to read/write from/to
* @lvec: iovec array specifying where to copy to/from locally
* @liovcnt: size of lvec array
* @rvec: iovec array specifying where to copy to/from in the other process
* @riovcnt: size of rvec array
* @flags: currently unused
* @vm_write: 0 if reading from other process, 1 if writing to other process
* Returns the number of bytes read/written or error code. May
* return less bytes than expected if an error occurs during the copying
* process.
*/
static ssize_t process_vm_rw(pid_t pid,
const struct iovec __user *lvec,
unsigned long liovcnt,
const struct iovec __user *rvec,
unsigned long riovcnt,
unsigned long flags, int vm_write)
{
struct iovec iovstack_l[UIO_FASTIOV];
struct iovec iovstack_r[UIO_FASTIOV];
struct iovec *iov_l = iovstack_l;
struct iovec *iov_r = iovstack_r;
struct iov_iter iter;
ssize_t rc;
if (flags != 0)
return -EINVAL;
/* Check iovecs */
if (vm_write)
rc = rw_copy_check_uvector(WRITE, lvec, liovcnt, UIO_FASTIOV,
iovstack_l, &iov_l);
else
rc = rw_copy_check_uvector(READ, lvec, liovcnt, UIO_FASTIOV,
iovstack_l, &iov_l);
if (rc <= 0)
goto free_iovecs;
iov_iter_init(&iter, vm_write ? WRITE : READ, iov_l, liovcnt, rc);
rc = rw_copy_check_uvector(CHECK_IOVEC_ONLY, rvec, riovcnt, UIO_FASTIOV,
iovstack_r, &iov_r);
if (rc <= 0)
goto free_iovecs;
rc = process_vm_rw_core(pid, &iter, iov_r, riovcnt, flags, vm_write);
free_iovecs:
if (iov_r != iovstack_r)
kfree(iov_r);
if (iov_l != iovstack_l)
kfree(iov_l);
return rc;
}
SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
unsigned long, liovcnt, const struct iovec __user *, rvec,
unsigned long, riovcnt, unsigned long, flags)
{
return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 0);
}
SYSCALL_DEFINE6(process_vm_writev, pid_t, pid,
const struct iovec __user *, lvec,
unsigned long, liovcnt, const struct iovec __user *, rvec,
unsigned long, riovcnt, unsigned long, flags)
{
return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 1);
}
#ifdef CONFIG_COMPAT
static ssize_t
compat_process_vm_rw(compat_pid_t pid,
const struct compat_iovec __user *lvec,
unsigned long liovcnt,
const struct compat_iovec __user *rvec,
unsigned long riovcnt,
unsigned long flags, int vm_write)
{
struct iovec iovstack_l[UIO_FASTIOV];
struct iovec iovstack_r[UIO_FASTIOV];
struct iovec *iov_l = iovstack_l;
struct iovec *iov_r = iovstack_r;
struct iov_iter iter;
ssize_t rc = -EFAULT;
if (flags != 0)
return -EINVAL;
if (vm_write)
rc = compat_rw_copy_check_uvector(WRITE, lvec, liovcnt,
UIO_FASTIOV, iovstack_l,
&iov_l);
else
rc = compat_rw_copy_check_uvector(READ, lvec, liovcnt,
UIO_FASTIOV, iovstack_l,
&iov_l);
if (rc <= 0)
goto free_iovecs;
iov_iter_init(&iter, vm_write ? WRITE : READ, iov_l, liovcnt, rc);
rc = compat_rw_copy_check_uvector(CHECK_IOVEC_ONLY, rvec, riovcnt,
UIO_FASTIOV, iovstack_r,
&iov_r);
if (rc <= 0)
goto free_iovecs;
rc = process_vm_rw_core(pid, &iter, iov_r, riovcnt, flags, vm_write);
free_iovecs:
if (iov_r != iovstack_r)
kfree(iov_r);
if (iov_l != iovstack_l)
kfree(iov_l);
return rc;
}
COMPAT_SYSCALL_DEFINE6(process_vm_readv, compat_pid_t, pid,
const struct compat_iovec __user *, lvec,
compat_ulong_t, liovcnt,
const struct compat_iovec __user *, rvec,
compat_ulong_t, riovcnt,
compat_ulong_t, flags)
{
return compat_process_vm_rw(pid, lvec, liovcnt, rvec,
riovcnt, flags, 0);
}
COMPAT_SYSCALL_DEFINE6(process_vm_writev, compat_pid_t, pid,
const struct compat_iovec __user *, lvec,
compat_ulong_t, liovcnt,
const struct compat_iovec __user *, rvec,
compat_ulong_t, riovcnt,
compat_ulong_t, flags)
{
return compat_process_vm_rw(pid, lvec, liovcnt, rvec,
riovcnt, flags, 1);
}
#endif

102
mm/quicklist.c Normal file
View file

@ -0,0 +1,102 @@
/*
* Quicklist support.
*
* Quicklists are light weight lists of pages that have a defined state
* on alloc and free. Pages must be in the quicklist specific defined state
* (zero by default) when the page is freed. It seems that the initial idea
* for such lists first came from Dave Miller and then various other people
* improved on it.
*
* Copyright (C) 2007 SGI,
* Christoph Lameter <clameter@sgi.com>
* Generalized, added support for multiple lists and
* constructors / destructors.
*/
#include <linux/kernel.h>
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/quicklist.h>
DEFINE_PER_CPU(struct quicklist [CONFIG_NR_QUICK], quicklist);
#define FRACTION_OF_NODE_MEM 16
static unsigned long max_pages(unsigned long min_pages)
{
unsigned long node_free_pages, max;
int node = numa_node_id();
struct zone *zones = NODE_DATA(node)->node_zones;
int num_cpus_on_node;
node_free_pages =
#ifdef CONFIG_ZONE_DMA
zone_page_state(&zones[ZONE_DMA], NR_FREE_PAGES) +
#endif
#ifdef CONFIG_ZONE_DMA32
zone_page_state(&zones[ZONE_DMA32], NR_FREE_PAGES) +
#endif
zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
max = node_free_pages / FRACTION_OF_NODE_MEM;
num_cpus_on_node = cpumask_weight(cpumask_of_node(node));
max /= num_cpus_on_node;
return max(max, min_pages);
}
static long min_pages_to_free(struct quicklist *q,
unsigned long min_pages, long max_free)
{
long pages_to_free;
pages_to_free = q->nr_pages - max_pages(min_pages);
return min(pages_to_free, max_free);
}
/*
* Trim down the number of pages in the quicklist
*/
void quicklist_trim(int nr, void (*dtor)(void *),
unsigned long min_pages, unsigned long max_free)
{
long pages_to_free;
struct quicklist *q;
q = &get_cpu_var(quicklist)[nr];
if (q->nr_pages > min_pages) {
pages_to_free = min_pages_to_free(q, min_pages, max_free);
while (pages_to_free > 0) {
/*
* We pass a gfp_t of 0 to quicklist_alloc here
* because we will never call into the page allocator.
*/
void *p = quicklist_alloc(nr, 0, NULL);
if (dtor)
dtor(p);
free_page((unsigned long)p);
pages_to_free--;
}
}
put_cpu_var(quicklist);
}
unsigned long quicklist_total_size(void)
{
unsigned long count = 0;
int cpu;
struct quicklist *ql, *q;
for_each_online_cpu(cpu) {
ql = per_cpu(quicklist, cpu);
for (q = ql; q < ql + CONFIG_NR_QUICK; q++)
count += q->nr_pages;
}
return count;
}

580
mm/readahead.c Normal file
View file

@ -0,0 +1,580 @@
/*
* mm/readahead.c - address_space-level file readahead.
*
* Copyright (C) 2002, Linus Torvalds
*
* 09Apr2002 Andrew Morton
* Initial version.
*/
#include <linux/kernel.h>
#include <linux/gfp.h>
#include <linux/export.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>
#include <linux/syscalls.h>
#include <linux/file.h>
#include "internal.h"
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
*/
void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
{
ra->ra_pages = mapping->backing_dev_info->ra_pages;
ra->prev_pos = -1;
}
EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
/*
* see if a page needs releasing upon read_cache_pages() failure
* - the caller of read_cache_pages() may have set PG_private or PG_fscache
* before calling, such as the NFS fs marking pages that are cached locally
* on disk, thus we need to give the fs a chance to clean up in the event of
* an error
*/
static void read_cache_pages_invalidate_page(struct address_space *mapping,
struct page *page)
{
if (page_has_private(page)) {
if (!trylock_page(page))
BUG();
page->mapping = mapping;
do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
page->mapping = NULL;
unlock_page(page);
}
page_cache_release(page);
}
/*
* release a list of pages, invalidating them first if need be
*/
static void read_cache_pages_invalidate_pages(struct address_space *mapping,
struct list_head *pages)
{
struct page *victim;
while (!list_empty(pages)) {
victim = list_to_page(pages);
list_del(&victim->lru);
read_cache_pages_invalidate_page(mapping, victim);
}
}
/**
* read_cache_pages - populate an address space with some pages & start reads against them
* @mapping: the address_space
* @pages: The address of a list_head which contains the target pages. These
* pages have their ->index populated and are otherwise uninitialised.
* @filler: callback routine for filling a single page.
* @data: private data for the callback routine.
*
* Hides the details of the LRU cache etc from the filesystems.
*/
int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
int ret = 0;
while (!list_empty(pages)) {
page = list_to_page(pages);
list_del(&page->lru);
if (add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
read_cache_pages_invalidate_page(mapping, page);
continue;
}
page_cache_release(page);
ret = filler(data, page);
if (unlikely(ret)) {
read_cache_pages_invalidate_pages(mapping, pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
}
return ret;
}
EXPORT_SYMBOL(read_cache_pages);
static int read_pages(struct address_space *mapping, struct file *filp,
struct list_head *pages, unsigned nr_pages)
{
struct blk_plug plug;
unsigned page_idx;
int ret;
blk_start_plug(&plug);
if (mapping->a_ops->readpages) {
ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
/* Clean up the remaining pages */
put_pages_list(pages);
goto out;
}
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_to_page(pages);
list_del(&page->lru);
if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
mapping->a_ops->readpage(filp, page);
}
page_cache_release(page);
}
ret = 0;
out:
blk_finish_plug(&plug);
return ret;
}
/*
* __do_page_cache_readahead() actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
* behaviour which would occur if page allocations are causing VM writeback.
* We really don't want to intermingle reads and writes like that.
*
* Returns the number of pages requested, or the maximum amount of I/O allowed.
*/
int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read,
unsigned long lookahead_size)
{
struct inode *inode = mapping->host;
struct page *page;
unsigned long end_index; /* The last page we want to read */
LIST_HEAD(page_pool);
int page_idx;
int ret = 0;
loff_t isize = i_size_read(inode);
if (isize == 0)
goto out;
end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
/*
* Preallocate as many pages as we will need.
*/
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;
if (page_offset > end_index)
break;
rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
rcu_read_unlock();
if (page && !radix_tree_exceptional_entry(page))
continue;
page = page_cache_alloc_readahead(mapping);
if (!page)
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
if (page_idx == nr_to_read - lookahead_size)
SetPageReadahead(page);
ret++;
}
/*
* Now start the IO. We ignore I/O errors - if the page is not
* uptodate then the caller will launch readpage again, and
* will then handle the error.
*/
if (ret)
read_pages(mapping, filp, &page_pool, ret);
BUG_ON(!list_empty(&page_pool));
out:
return ret;
}
/*
* Chunk the readahead into 2 megabyte units, so that we don't pin too much
* memory at once.
*/
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read)
{
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
nr_to_read = max_sane_readahead(nr_to_read);
while (nr_to_read) {
int err;
unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
err = __do_page_cache_readahead(mapping, filp,
offset, this_chunk, 0);
if (err < 0)
return err;
offset += this_chunk;
nr_to_read -= this_chunk;
}
return 0;
}
#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
return min(nr, MAX_READAHEAD);
}
/*
* Set the initial window size, round to next power of 2 and square
* for small size, x 4 for medium, and x 2 for large
* for 128k (32 page) max ra
* 1-8 page = 32k initial, > 8 page = 128k initial
*/
static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
{
unsigned long newsize = roundup_pow_of_two(size);
if (newsize <= max / 32)
newsize = newsize * 4;
else if (newsize <= max / 4)
newsize = newsize * 2;
else
newsize = max;
return newsize;
}
/*
* Get the previous window size, ramp it up, and
* return it as the new window size.
*/
static unsigned long get_next_ra_size(struct file_ra_state *ra,
unsigned long max)
{
unsigned long cur = ra->size;
unsigned long newsize;
if (cur < max / 16)
newsize = 4 * cur;
else
newsize = 2 * cur;
return min(newsize, max);
}
/*
* On-demand readahead design.
*
* The fields in struct file_ra_state represent the most-recently-executed
* readahead attempt:
*
* |<----- async_size ---------|
* |------------------- size -------------------->|
* |==================#===========================|
* ^start ^page marked with PG_readahead
*
* To overlap application thinking time and disk I/O time, we do
* `readahead pipelining': Do not wait until the application consumed all
* readahead pages and stalled on the missing page at readahead_index;
* Instead, submit an asynchronous readahead I/O as soon as there are
* only async_size pages left in the readahead window. Normally async_size
* will be equal to size, for maximum pipelining.
*
* In interleaved sequential reads, concurrent streams on the same fd can
* be invalidating each other's readahead state. So we flag the new readahead
* page at (start+size-async_size) with PG_readahead, and use it as readahead
* indicator. The flag won't be set on already cached pages, to avoid the
* readahead-for-nothing fuss, saving pointless page cache lookups.
*
* prev_pos tracks the last visited byte in the _previous_ read request.
* It should be maintained by the caller, and will be used for detecting
* small random reads. Note that the readahead algorithm checks loosely
* for sequential patterns. Hence interleaved reads might be served as
* sequential ones.
*
* There is a special-case: if the first page which the application tries to
* read happens to be the first page of the file, it is assumed that a linear
* read is about to happen and the window is immediately set to the initial size
* based on I/O request size and the max_readahead.
*
* The code ramps up the readahead size aggressively at first, but slow down as
* it approaches max_readhead.
*/
/*
* Count contiguously cached pages from @offset-1 to @offset-@max,
* this count is a conservative estimation of
* - length of the sequential read sequence, or
* - thrashing threshold in memory tight systems
*/
static pgoff_t count_history_pages(struct address_space *mapping,
pgoff_t offset, unsigned long max)
{
pgoff_t head;
rcu_read_lock();
head = page_cache_prev_hole(mapping, offset - 1, max);
rcu_read_unlock();
return offset - 1 - head;
}
/*
* page cache context based read-ahead
*/
static int try_context_readahead(struct address_space *mapping,
struct file_ra_state *ra,
pgoff_t offset,
unsigned long req_size,
unsigned long max)
{
pgoff_t size;
size = count_history_pages(mapping, offset, max);
/*
* not enough history pages:
* it could be a random read
*/
if (size <= req_size)
return 0;
/*
* starts from beginning of file:
* it is a strong indication of long-run stream (or whole-file-read)
*/
if (size >= offset)
size *= 2;
ra->start = offset;
ra->size = min(size + req_size, max);
ra->async_size = 1;
return 1;
}
/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
static unsigned long
ondemand_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
bool hit_readahead_marker, pgoff_t offset,
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
pgoff_t prev_offset;
/*
* start of file
*/
if (!offset)
goto initial_readahead;
/*
* It's the expected callback offset, assume sequential access.
* Ramp up sizes, and push forward the readahead window.
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
}
/*
* Hit a marked page without valid readahead state.
* E.g. interleaved reads.
* Query the pagecache for async_size, which normally equals to
* readahead size. Ramp it up and use it as the new readahead size.
*/
if (hit_readahead_marker) {
pgoff_t start;
rcu_read_lock();
start = page_cache_next_hole(mapping, offset + 1, max);
rcu_read_unlock();
if (!start || start - offset > max)
return 0;
ra->start = start;
ra->size = start - offset; /* old async_size */
ra->size += req_size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
}
/*
* oversize read
*/
if (req_size > max)
goto initial_readahead;
/*
* sequential cache miss
* trivial case: (offset - prev_offset) == 1
* unaligned reads: (offset - prev_offset) == 0
*/
prev_offset = (unsigned long long)ra->prev_pos >> PAGE_CACHE_SHIFT;
if (offset - prev_offset <= 1UL)
goto initial_readahead;
/*
* Query the page cache and look for the traces(cached history pages)
* that a sequential stream would leave behind.
*/
if (try_context_readahead(mapping, ra, offset, req_size, max))
goto readit;
/*
* standalone, small random read
* Read as is, and do not pollute the readahead state.
*/
return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
initial_readahead:
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
readit:
/*
* Will this read hit the readahead marker made by itself?
* If so, trigger the readahead marker hit now, and merge
* the resulted next readahead window into the current one.
*/
if (offset == ra->start && ra->size == ra->async_size) {
ra->async_size = get_next_ra_size(ra, max);
ra->size += ra->async_size;
}
return ra_submit(ra, mapping, filp);
}
/**
* page_cache_sync_readahead - generic file readahead
* @mapping: address_space which holds the pagecache and I/O vectors
* @ra: file_ra_state which holds the readahead state
* @filp: passed on to ->readpage() and ->readpages()
* @offset: start offset into @mapping, in pagecache page-sized units
* @req_size: hint: total size of the read which the caller is performing in
* pagecache pages
*
* page_cache_sync_readahead() should be called when a cache miss happened:
* it will submit the read. The readahead logic may decide to piggyback more
* pages onto the read request if access patterns suggest it will improve
* performance.
*/
void page_cache_sync_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
pgoff_t offset, unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
return;
/* be dumb */
if (filp && (filp->f_mode & FMODE_RANDOM)) {
force_page_cache_readahead(mapping, filp, offset, req_size);
return;
}
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
/**
* page_cache_async_readahead - file readahead for marked pages
* @mapping: address_space which holds the pagecache and I/O vectors
* @ra: file_ra_state which holds the readahead state
* @filp: passed on to ->readpage() and ->readpages()
* @page: the page at @offset which has the PG_readahead flag set
* @offset: start offset into @mapping, in pagecache page-sized units
* @req_size: hint: total size of the read which the caller is performing in
* pagecache pages
*
* page_cache_async_readahead() should be called when a page is used which
* has the PG_readahead flag; this is a marker to suggest that the application
* has used up enough of the readahead window that we should start pulling in
* more pages.
*/
void
page_cache_async_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
struct page *page, pgoff_t offset,
unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
return;
/*
* Same bit is used for PG_readahead and PG_reclaim.
*/
if (PageWriteback(page))
return;
ClearPageReadahead(page);
/*
* Defer asynchronous read-ahead on IO congestion.
*/
if (bdi_read_congested(mapping->backing_dev_info))
return;
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, true, offset, req_size);
}
EXPORT_SYMBOL_GPL(page_cache_async_readahead);
static ssize_t
do_readahead(struct address_space *mapping, struct file *filp,
pgoff_t index, unsigned long nr)
{
if (!mapping || !mapping->a_ops)
return -EINVAL;
return force_page_cache_readahead(mapping, filp, index, nr);
}
SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
{
ssize_t ret;
struct fd f;
ret = -EBADF;
f = fdget(fd);
if (f.file) {
if (f.file->f_mode & FMODE_READ) {
struct address_space *mapping = f.file->f_mapping;
pgoff_t start = offset >> PAGE_CACHE_SHIFT;
pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
unsigned long len = end - start + 1;
ret = do_readahead(mapping, f.file, start, len);
}
fdput(f);
}
return ret;
}

1821
mm/rmap.c Normal file

File diff suppressed because it is too large Load diff

3473
mm/shmem.c Normal file

File diff suppressed because it is too large Load diff

4230
mm/slab.c Normal file

File diff suppressed because it is too large Load diff

363
mm/slab.h Normal file
View file

@ -0,0 +1,363 @@
#ifndef MM_SLAB_H
#define MM_SLAB_H
/*
* Internal slab definitions
*/
#ifdef CONFIG_SLOB
/*
* Common fields provided in kmem_cache by all slab allocators
* This struct is either used directly by the allocator (SLOB)
* or the allocator must include definitions for all fields
* provided in kmem_cache_common in their definition of kmem_cache.
*
* Once we can do anonymous structs (C11 standard) we could put a
* anonymous struct definition in these allocators so that the
* separate allocations in the kmem_cache structure of SLAB and
* SLUB is no longer needed.
*/
struct kmem_cache {
unsigned int object_size;/* The original size of the object */
unsigned int size; /* The aligned/padded/added on size */
unsigned int align; /* Alignment as calculated */
unsigned long flags; /* Active flags on the slab */
const char *name; /* Slab name for sysfs */
int refcount; /* Use counter */
void (*ctor)(void *); /* Called on object slot creation */
struct list_head list; /* List of all slab caches on the system */
};
#endif /* CONFIG_SLOB */
#ifdef CONFIG_SLAB
#include <linux/slab_def.h>
#endif
#ifdef CONFIG_SLUB
#include <linux/slub_def.h>
#endif
#include <linux/memcontrol.h>
/*
* State of the slab allocator.
*
* This is used to describe the states of the allocator during bootup.
* Allocators use this to gradually bootstrap themselves. Most allocators
* have the problem that the structures used for managing slab caches are
* allocated from slab caches themselves.
*/
enum slab_state {
DOWN, /* No slab functionality yet */
PARTIAL, /* SLUB: kmem_cache_node available */
PARTIAL_NODE, /* SLAB: kmalloc size for node struct available */
UP, /* Slab caches usable but not all extras yet */
FULL /* Everything is working */
};
extern enum slab_state slab_state;
/* The slab cache mutex protects the management structures during changes */
extern struct mutex slab_mutex;
/* The list of all slab caches on the system */
extern struct list_head slab_caches;
/* The slab cache that manages slab cache information */
extern struct kmem_cache *kmem_cache;
unsigned long calculate_alignment(unsigned long flags,
unsigned long align, unsigned long size);
#ifndef CONFIG_SLOB
/* Kmalloc array related functions */
void create_kmalloc_caches(unsigned long);
/* Find the kmalloc slab corresponding for a certain size */
struct kmem_cache *kmalloc_slab(size_t, gfp_t);
#endif
/* Functions provided by the slab allocators */
extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);
extern struct kmem_cache *create_kmalloc_cache(const char *name, size_t size,
unsigned long flags);
extern void create_boot_cache(struct kmem_cache *, const char *name,
size_t size, unsigned long flags);
struct mem_cgroup;
int slab_unmergeable(struct kmem_cache *s);
struct kmem_cache *find_mergeable(size_t size, size_t align,
unsigned long flags, const char *name, void (*ctor)(void *));
#ifndef CONFIG_SLOB
struct kmem_cache *
__kmem_cache_alias(const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *));
unsigned long kmem_cache_flags(unsigned long object_size,
unsigned long flags, const char *name,
void (*ctor)(void *));
#else
static inline struct kmem_cache *
__kmem_cache_alias(const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *))
{ return NULL; }
static inline unsigned long kmem_cache_flags(unsigned long object_size,
unsigned long flags, const char *name,
void (*ctor)(void *))
{
return flags;
}
#endif
/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC | \
SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
#if defined(CONFIG_DEBUG_SLAB)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
#elif defined(CONFIG_SLUB_DEBUG)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
SLAB_TRACE | SLAB_DEBUG_FREE)
#else
#define SLAB_DEBUG_FLAGS (0)
#endif
#if defined(CONFIG_SLAB)
#define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | SLAB_NOTRACK)
#elif defined(CONFIG_SLUB)
#define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
SLAB_TEMPORARY | SLAB_NOTRACK)
#else
#define SLAB_CACHE_FLAGS (0)
#endif
#define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
int __kmem_cache_shutdown(struct kmem_cache *);
int __kmem_cache_shrink(struct kmem_cache *);
void slab_kmem_cache_release(struct kmem_cache *);
struct seq_file;
struct file;
struct slabinfo {
unsigned long active_objs;
unsigned long num_objs;
unsigned long active_slabs;
unsigned long num_slabs;
unsigned long shared_avail;
unsigned int limit;
unsigned int batchcount;
unsigned int shared;
unsigned int objects_per_slab;
unsigned int cache_order;
};
void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
ssize_t slabinfo_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos);
#ifdef CONFIG_MEMCG_KMEM
static inline bool is_root_cache(struct kmem_cache *s)
{
return !s->memcg_params || s->memcg_params->is_root_cache;
}
static inline bool slab_equal_or_root(struct kmem_cache *s,
struct kmem_cache *p)
{
return (p == s) ||
(s->memcg_params && (p == s->memcg_params->root_cache));
}
/*
* We use suffixes to the name in memcg because we can't have caches
* created in the system with the same name. But when we print them
* locally, better refer to them with the base name
*/
static inline const char *cache_name(struct kmem_cache *s)
{
if (!is_root_cache(s))
return s->memcg_params->root_cache->name;
return s->name;
}
/*
* Note, we protect with RCU only the memcg_caches array, not per-memcg caches.
* That said the caller must assure the memcg's cache won't go away. Since once
* created a memcg's cache is destroyed only along with the root cache, it is
* true if we are going to allocate from the cache or hold a reference to the
* root cache by other means. Otherwise, we should hold either the slab_mutex
* or the memcg's slab_caches_mutex while calling this function and accessing
* the returned value.
*/
static inline struct kmem_cache *
cache_from_memcg_idx(struct kmem_cache *s, int idx)
{
struct kmem_cache *cachep;
struct memcg_cache_params *params;
if (!s->memcg_params)
return NULL;
rcu_read_lock();
params = rcu_dereference(s->memcg_params);
cachep = params->memcg_caches[idx];
rcu_read_unlock();
/*
* Make sure we will access the up-to-date value. The code updating
* memcg_caches issues a write barrier to match this (see
* memcg_register_cache()).
*/
smp_read_barrier_depends();
return cachep;
}
static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
{
if (is_root_cache(s))
return s;
return s->memcg_params->root_cache;
}
static __always_inline int memcg_charge_slab(struct kmem_cache *s,
gfp_t gfp, int order)
{
if (!memcg_kmem_enabled())
return 0;
if (is_root_cache(s))
return 0;
return __memcg_charge_slab(s, gfp, order);
}
static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order)
{
if (!memcg_kmem_enabled())
return;
if (is_root_cache(s))
return;
__memcg_uncharge_slab(s, order);
}
#else
static inline bool is_root_cache(struct kmem_cache *s)
{
return true;
}
static inline bool slab_equal_or_root(struct kmem_cache *s,
struct kmem_cache *p)
{
return true;
}
static inline const char *cache_name(struct kmem_cache *s)
{
return s->name;
}
static inline struct kmem_cache *
cache_from_memcg_idx(struct kmem_cache *s, int idx)
{
return NULL;
}
static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
{
return s;
}
static inline int memcg_charge_slab(struct kmem_cache *s, gfp_t gfp, int order)
{
return 0;
}
static inline void memcg_uncharge_slab(struct kmem_cache *s, int order)
{
}
#endif
static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
{
struct kmem_cache *cachep;
struct page *page;
/*
* When kmemcg is not being used, both assignments should return the
* same value. but we don't want to pay the assignment price in that
* case. If it is not compiled in, the compiler should be smart enough
* to not do even the assignment. In that case, slab_equal_or_root
* will also be a constant.
*/
if (!memcg_kmem_enabled() && !unlikely(s->flags & SLAB_DEBUG_FREE))
return s;
page = virt_to_head_page(x);
cachep = page->slab_cache;
if (slab_equal_or_root(cachep, s))
return cachep;
pr_err("%s: Wrong slab cache. %s but object is from %s\n",
__func__, cachep->name, s->name);
WARN_ON_ONCE(1);
return s;
}
#ifndef CONFIG_SLOB
/*
* The slab lists for all objects.
*/
struct kmem_cache_node {
spinlock_t list_lock;
#ifdef CONFIG_SLAB
struct list_head slabs_partial; /* partial list first, better asm code */
struct list_head slabs_full;
struct list_head slabs_free;
unsigned long free_objects;
unsigned int free_limit;
unsigned int colour_next; /* Per-node cache coloring */
struct array_cache *shared; /* shared per node */
struct alien_cache **alien; /* on other nodes */
unsigned long next_reap; /* updated without locking */
int free_touched; /* updated without locking */
#endif
#ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
struct list_head full;
#endif
#endif
};
static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
{
return s->node[node];
}
/*
* Iterator over all nodes. The body will be executed for each node that has
* a kmem_cache_node structure allocated (which is true for all online nodes)
*/
#define for_each_kmem_cache_node(__s, __node, __n) \
for (__node = 0; __node < nr_node_ids; __node++) \
if ((__n = get_node(__s, __node)))
#endif
void *slab_next(struct seq_file *m, void *p, loff_t *pos);
void slab_stop(struct seq_file *m, void *p);
#endif /* MM_SLAB_H */

1054
mm/slab_common.c Normal file

File diff suppressed because it is too large Load diff

642
mm/slob.c Normal file
View file

@ -0,0 +1,642 @@
/*
* SLOB Allocator: Simple List Of Blocks
*
* Matt Mackall <mpm@selenic.com> 12/30/03
*
* NUMA support by Paul Mundt, 2007.
*
* How SLOB works:
*
* The core of SLOB is a traditional K&R style heap allocator, with
* support for returning aligned objects. The granularity of this
* allocator is as little as 2 bytes, however typically most architectures
* will require 4 bytes on 32-bit and 8 bytes on 64-bit.
*
* The slob heap is a set of linked list of pages from alloc_pages(),
* and within each page, there is a singly-linked list of free blocks
* (slob_t). The heap is grown on demand. To reduce fragmentation,
* heap pages are segregated into three lists, with objects less than
* 256 bytes, objects less than 1024 bytes, and all other objects.
*
* Allocation from heap involves first searching for a page with
* sufficient free blocks (using a next-fit-like approach) followed by
* a first-fit scan of the page. Deallocation inserts objects back
* into the free list in address order, so this is effectively an
* address-ordered first fit.
*
* Above this is an implementation of kmalloc/kfree. Blocks returned
* from kmalloc are prepended with a 4-byte header with the kmalloc size.
* If kmalloc is asked for objects of PAGE_SIZE or larger, it calls
* alloc_pages() directly, allocating compound pages so the page order
* does not have to be separately tracked.
* These objects are detected in kfree() because PageSlab()
* is false for them.
*
* SLAB is emulated on top of SLOB by simply calling constructors and
* destructors for every SLAB allocation. Objects are returned with the
* 4-byte alignment unless the SLAB_HWCACHE_ALIGN flag is set, in which
* case the low-level allocator will fragment blocks to create the proper
* alignment. Again, objects of page-size or greater are allocated by
* calling alloc_pages(). As SLAB objects know their size, no separate
* size bookkeeping is necessary and there is essentially no allocation
* space overhead, and compound pages aren't needed for multi-page
* allocations.
*
* NUMA support in SLOB is fairly simplistic, pushing most of the real
* logic down to the page allocator, and simply doing the node accounting
* on the upper levels. In the event that a node id is explicitly
* provided, alloc_pages_exact_node() with the specified node id is used
* instead. The common case (or when the node id isn't explicitly provided)
* will default to the current node, as per numa_node_id().
*
* Node aware pages are still inserted in to the global freelist, and
* these are scanned for by matching against the node id encoded in the
* page flags. As a result, block allocations that can be satisfied from
* the freelist will only be done so on pages residing on the same node,
* in order to prevent random node placement.
*/
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/swap.h> /* struct reclaim_state */
#include <linux/cache.h>
#include <linux/init.h>
#include <linux/export.h>
#include <linux/rcupdate.h>
#include <linux/list.h>
#include <linux/kmemleak.h>
#include <trace/events/kmem.h>
#include <linux/atomic.h>
#include "slab.h"
/*
* slob_block has a field 'units', which indicates size of block if +ve,
* or offset of next block if -ve (in SLOB_UNITs).
*
* Free blocks of size 1 unit simply contain the offset of the next block.
* Those with larger size contain their size in the first SLOB_UNIT of
* memory, and the offset of the next free block in the second SLOB_UNIT.
*/
#if PAGE_SIZE <= (32767 * 2)
typedef s16 slobidx_t;
#else
typedef s32 slobidx_t;
#endif
struct slob_block {
slobidx_t units;
};
typedef struct slob_block slob_t;
/*
* All partially free slob pages go on these lists.
*/
#define SLOB_BREAK1 256
#define SLOB_BREAK2 1024
static LIST_HEAD(free_slob_small);
static LIST_HEAD(free_slob_medium);
static LIST_HEAD(free_slob_large);
/*
* slob_page_free: true for pages on free_slob_pages list.
*/
static inline int slob_page_free(struct page *sp)
{
return PageSlobFree(sp);
}
static void set_slob_page_free(struct page *sp, struct list_head *list)
{
list_add(&sp->lru, list);
__SetPageSlobFree(sp);
}
static inline void clear_slob_page_free(struct page *sp)
{
list_del(&sp->lru);
__ClearPageSlobFree(sp);
}
#define SLOB_UNIT sizeof(slob_t)
#define SLOB_UNITS(size) DIV_ROUND_UP(size, SLOB_UNIT)
/*
* struct slob_rcu is inserted at the tail of allocated slob blocks, which
* were created with a SLAB_DESTROY_BY_RCU slab. slob_rcu is used to free
* the block using call_rcu.
*/
struct slob_rcu {
struct rcu_head head;
int size;
};
/*
* slob_lock protects all slob allocator structures.
*/
static DEFINE_SPINLOCK(slob_lock);
/*
* Encode the given size and next info into a free slob block s.
*/
static void set_slob(slob_t *s, slobidx_t size, slob_t *next)
{
slob_t *base = (slob_t *)((unsigned long)s & PAGE_MASK);
slobidx_t offset = next - base;
if (size > 1) {
s[0].units = size;
s[1].units = offset;
} else
s[0].units = -offset;
}
/*
* Return the size of a slob block.
*/
static slobidx_t slob_units(slob_t *s)
{
if (s->units > 0)
return s->units;
return 1;
}
/*
* Return the next free slob block pointer after this one.
*/
static slob_t *slob_next(slob_t *s)
{
slob_t *base = (slob_t *)((unsigned long)s & PAGE_MASK);
slobidx_t next;
if (s[0].units < 0)
next = -s[0].units;
else
next = s[1].units;
return base+next;
}
/*
* Returns true if s is the last free block in its page.
*/
static int slob_last(slob_t *s)
{
return !((unsigned long)slob_next(s) & ~PAGE_MASK);
}
static void *slob_new_pages(gfp_t gfp, int order, int node)
{
void *page;
#ifdef CONFIG_NUMA
if (node != NUMA_NO_NODE)
page = alloc_pages_exact_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
if (!page)
return NULL;
return page_address(page);
}
static void slob_free_pages(void *b, int order)
{
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;
free_pages((unsigned long)b, order);
}
/*
* Allocate a slob block within a given slob_page sp.
*/
static void *slob_page_alloc(struct page *sp, size_t size, int align)
{
slob_t *prev, *cur, *aligned = NULL;
int delta = 0, units = SLOB_UNITS(size);
for (prev = NULL, cur = sp->freelist; ; prev = cur, cur = slob_next(cur)) {
slobidx_t avail = slob_units(cur);
if (align) {
aligned = (slob_t *)ALIGN((unsigned long)cur, align);
delta = aligned - cur;
}
if (avail >= units + delta) { /* room enough? */
slob_t *next;
if (delta) { /* need to fragment head to align? */
next = slob_next(cur);
set_slob(aligned, avail - delta, next);
set_slob(cur, delta, aligned);
prev = cur;
cur = aligned;
avail = slob_units(cur);
}
next = slob_next(cur);
if (avail == units) { /* exact fit? unlink. */
if (prev)
set_slob(prev, slob_units(prev), next);
else
sp->freelist = next;
} else { /* fragment */
if (prev)
set_slob(prev, slob_units(prev), cur + units);
else
sp->freelist = cur + units;
set_slob(cur + units, avail - units, next);
}
sp->units -= units;
if (!sp->units)
clear_slob_page_free(sp);
return cur;
}
if (slob_last(cur))
return NULL;
}
}
/*
* slob_alloc: entry point into the slob allocator.
*/
static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
{
struct page *sp;
struct list_head *prev;
struct list_head *slob_list;
slob_t *b = NULL;
unsigned long flags;
if (size < SLOB_BREAK1)
slob_list = &free_slob_small;
else if (size < SLOB_BREAK2)
slob_list = &free_slob_medium;
else
slob_list = &free_slob_large;
spin_lock_irqsave(&slob_lock, flags);
/* Iterate through each partially free page, try to find room */
list_for_each_entry(sp, slob_list, lru) {
#ifdef CONFIG_NUMA
/*
* If there's a node specification, search for a partial
* page with a matching node id in the freelist.
*/
if (node != NUMA_NO_NODE && page_to_nid(sp) != node)
continue;
#endif
/* Enough room on this page? */
if (sp->units < SLOB_UNITS(size))
continue;
/* Attempt to alloc */
prev = sp->lru.prev;
b = slob_page_alloc(sp, size, align);
if (!b)
continue;
/* Improve fragment distribution and reduce our average
* search time by starting our next search here. (see
* Knuth vol 1, sec 2.5, pg 449) */
if (prev != slob_list->prev &&
slob_list->next != prev->next)
list_move_tail(slob_list, prev->next);
break;
}
spin_unlock_irqrestore(&slob_lock, flags);
/* Not enough space: must allocate a new page */
if (!b) {
b = slob_new_pages(gfp & ~__GFP_ZERO, 0, node);
if (!b)
return NULL;
sp = virt_to_page(b);
__SetPageSlab(sp);
spin_lock_irqsave(&slob_lock, flags);
sp->units = SLOB_UNITS(PAGE_SIZE);
sp->freelist = b;
INIT_LIST_HEAD(&sp->lru);
set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE));
set_slob_page_free(sp, slob_list);
b = slob_page_alloc(sp, size, align);
BUG_ON(!b);
spin_unlock_irqrestore(&slob_lock, flags);
}
if (unlikely((gfp & __GFP_ZERO) && b))
memset(b, 0, size);
return b;
}
/*
* slob_free: entry point into the slob allocator.
*/
static void slob_free(void *block, int size)
{
struct page *sp;
slob_t *prev, *next, *b = (slob_t *)block;
slobidx_t units;
unsigned long flags;
struct list_head *slob_list;
if (unlikely(ZERO_OR_NULL_PTR(block)))
return;
BUG_ON(!size);
sp = virt_to_page(block);
units = SLOB_UNITS(size);
spin_lock_irqsave(&slob_lock, flags);
if (sp->units + units == SLOB_UNITS(PAGE_SIZE)) {
/* Go directly to page allocator. Do not pass slob allocator */
if (slob_page_free(sp))
clear_slob_page_free(sp);
spin_unlock_irqrestore(&slob_lock, flags);
__ClearPageSlab(sp);
page_mapcount_reset(sp);
slob_free_pages(b, 0);
return;
}
if (!slob_page_free(sp)) {
/* This slob page is about to become partially free. Easy! */
sp->units = units;
sp->freelist = b;
set_slob(b, units,
(void *)((unsigned long)(b +
SLOB_UNITS(PAGE_SIZE)) & PAGE_MASK));
if (size < SLOB_BREAK1)
slob_list = &free_slob_small;
else if (size < SLOB_BREAK2)
slob_list = &free_slob_medium;
else
slob_list = &free_slob_large;
set_slob_page_free(sp, slob_list);
goto out;
}
/*
* Otherwise the page is already partially free, so find reinsertion
* point.
*/
sp->units += units;
if (b < (slob_t *)sp->freelist) {
if (b + units == sp->freelist) {
units += slob_units(sp->freelist);
sp->freelist = slob_next(sp->freelist);
}
set_slob(b, units, sp->freelist);
sp->freelist = b;
} else {
prev = sp->freelist;
next = slob_next(prev);
while (b > next) {
prev = next;
next = slob_next(prev);
}
if (!slob_last(prev) && b + units == next) {
units += slob_units(next);
set_slob(b, units, slob_next(next));
} else
set_slob(b, units, next);
if (prev + slob_units(prev) == b) {
units = slob_units(b) + slob_units(prev);
set_slob(prev, units, slob_next(b));
} else
set_slob(prev, slob_units(prev), b);
}
out:
spin_unlock_irqrestore(&slob_lock, flags);
}
/*
* End of slob allocator proper. Begin kmem_cache_alloc and kmalloc frontend.
*/
static __always_inline void *
__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
{
unsigned int *m;
int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
void *ret;
gfp &= gfp_allowed_mask;
lockdep_trace_alloc(gfp);
if (size < PAGE_SIZE - align) {
if (!size)
return ZERO_SIZE_PTR;
m = slob_alloc(size + align, gfp, align, node);
if (!m)
return NULL;
*m = size;
ret = (void *)m + align;
trace_kmalloc_node(caller, ret,
size, size + align, gfp, node);
} else {
unsigned int order = get_order(size);
if (likely(order))
gfp |= __GFP_COMP;
ret = slob_new_pages(gfp, order, node);
trace_kmalloc_node(caller, ret,
size, PAGE_SIZE << order, gfp, node);
}
kmemleak_alloc(ret, size, 1, gfp);
return ret;
}
void *__kmalloc(size_t size, gfp_t gfp)
{
return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc);
void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller)
{
return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller);
}
#ifdef CONFIG_NUMA
void *__kmalloc_node_track_caller(size_t size, gfp_t gfp,
int node, unsigned long caller)
{
return __do_kmalloc_node(size, gfp, node, caller);
}
#endif
void kfree(const void *block)
{
struct page *sp;
trace_kfree(_RET_IP_, block);
if (unlikely(ZERO_OR_NULL_PTR(block)))
return;
kmemleak_free(block);
sp = virt_to_page(block);
if (PageSlab(sp)) {
int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
unsigned int *m = (unsigned int *)(block - align);
slob_free(m, *m + align);
} else
__free_pages(sp, compound_order(sp));
}
EXPORT_SYMBOL(kfree);
/* can't use ksize for kmem_cache_alloc memory, only kmalloc */
size_t ksize(const void *block)
{
struct page *sp;
int align;
unsigned int *m;
BUG_ON(!block);
if (unlikely(block == ZERO_SIZE_PTR))
return 0;
sp = virt_to_page(block);
if (unlikely(!PageSlab(sp)))
return PAGE_SIZE << compound_order(sp);
align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
m = (unsigned int *)(block - align);
return SLOB_UNITS(*m) * SLOB_UNIT;
}
EXPORT_SYMBOL(ksize);
int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
{
if (flags & SLAB_DESTROY_BY_RCU) {
/* leave room for rcu footer at the end of object */
c->size += sizeof(struct slob_rcu);
}
c->flags = flags;
return 0;
}
void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
{
void *b;
flags &= gfp_allowed_mask;
lockdep_trace_alloc(flags);
if (c->size < PAGE_SIZE) {
b = slob_alloc(c->size, flags, c->align, node);
trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
SLOB_UNITS(c->size) * SLOB_UNIT,
flags, node);
} else {
b = slob_new_pages(flags, get_order(c->size), node);
trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
PAGE_SIZE << get_order(c->size),
flags, node);
}
if (b && c->ctor)
c->ctor(b);
kmemleak_alloc_recursive(b, c->size, 1, c->flags, flags);
return b;
}
EXPORT_SYMBOL(slob_alloc_node);
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
return slob_alloc_node(cachep, flags, NUMA_NO_NODE);
}
EXPORT_SYMBOL(kmem_cache_alloc);
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t gfp, int node)
{
return __do_kmalloc_node(size, gfp, node, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc_node);
void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t gfp, int node)
{
return slob_alloc_node(cachep, gfp, node);
}
EXPORT_SYMBOL(kmem_cache_alloc_node);
#endif
static void __kmem_cache_free(void *b, int size)
{
if (size < PAGE_SIZE)
slob_free(b, size);
else
slob_free_pages(b, get_order(size));
}
static void kmem_rcu_free(struct rcu_head *head)
{
struct slob_rcu *slob_rcu = (struct slob_rcu *)head;
void *b = (void *)slob_rcu - (slob_rcu->size - sizeof(struct slob_rcu));
__kmem_cache_free(b, slob_rcu->size);
}
void kmem_cache_free(struct kmem_cache *c, void *b)
{
kmemleak_free_recursive(b, c->flags);
if (unlikely(c->flags & SLAB_DESTROY_BY_RCU)) {
struct slob_rcu *slob_rcu;
slob_rcu = b + (c->size - sizeof(struct slob_rcu));
slob_rcu->size = c->size;
call_rcu(&slob_rcu->head, kmem_rcu_free);
} else {
__kmem_cache_free(b, c->size);
}
trace_kmem_cache_free(_RET_IP_, b);
}
EXPORT_SYMBOL(kmem_cache_free);
int __kmem_cache_shutdown(struct kmem_cache *c)
{
/* No way to check for remaining objects */
return 0;
}
int __kmem_cache_shrink(struct kmem_cache *d)
{
return 0;
}
struct kmem_cache kmem_cache_boot = {
.name = "kmem_cache",
.size = sizeof(struct kmem_cache),
.flags = SLAB_PANIC,
.align = ARCH_KMALLOC_MINALIGN,
};
void __init kmem_cache_init(void)
{
kmem_cache = &kmem_cache_boot;
slab_state = UP;
}
void __init kmem_cache_init_late(void)
{
slab_state = FULL;
}

5257
mm/slub.c Normal file

File diff suppressed because it is too large Load diff

253
mm/sparse-vmemmap.c Normal file
View file

@ -0,0 +1,253 @@
/*
* Virtual Memory Map support
*
* (C) 2007 sgi. Christoph Lameter.
*
* Virtual memory maps allow VM primitives pfn_to_page, page_to_pfn,
* virt_to_page, page_address() to be implemented as a base offset
* calculation without memory access.
*
* However, virtual mappings need a page table and TLBs. Many Linux
* architectures already map their physical space using 1-1 mappings
* via TLBs. For those arches the virtual memory map is essentially
* for free if we use the same page size as the 1-1 mappings. In that
* case the overhead consists of a few additional pages that are
* allocated to create a view of memory for vmemmap.
*
* The architecture is expected to provide a vmemmap_populate() function
* to instantiate the mapping.
*/
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/bootmem.h>
#include <linux/highmem.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
#include <linux/sched.h>
#include <asm/dma.h>
#include <asm/pgalloc.h>
#include <asm/pgtable.h>
/*
* Allocate a block of memory to be used to back the virtual memory map
* or to back the page tables that are used to create the mapping.
* Uses the main allocators if they are available, else bootmem.
*/
static void * __init_refok __earlyonly_bootmem_alloc(int node,
unsigned long size,
unsigned long align,
unsigned long goal)
{
return memblock_virt_alloc_try_nid(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
}
static void *vmemmap_buf;
static void *vmemmap_buf_end;
void * __meminit vmemmap_alloc_block(unsigned long size, int node)
{
/* If the main allocator is up use that, fallback to bootmem. */
if (slab_is_available()) {
struct page *page;
if (node_state(node, N_HIGH_MEMORY))
page = alloc_pages_node(
node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
get_order(size));
else
page = alloc_pages(
GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
get_order(size));
if (page)
return page_address(page);
return NULL;
} else
return __earlyonly_bootmem_alloc(node, size, size,
__pa(MAX_DMA_ADDRESS));
}
/* need to make sure size is all the same during early stage */
void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
{
void *ptr;
if (!vmemmap_buf)
return vmemmap_alloc_block(size, node);
/* take the from buf */
ptr = (void *)ALIGN((unsigned long)vmemmap_buf, size);
if (ptr + size > vmemmap_buf_end)
return vmemmap_alloc_block(size, node);
vmemmap_buf = ptr + size;
return ptr;
}
void __meminit vmemmap_verify(pte_t *pte, int node,
unsigned long start, unsigned long end)
{
unsigned long pfn = pte_pfn(*pte);
int actual_node = early_pfn_to_nid(pfn);
if (node_distance(actual_node, node) > LOCAL_DISTANCE)
printk(KERN_WARNING "[%lx-%lx] potential offnode "
"page_structs\n", start, end - 1);
}
pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node)
{
pte_t *pte = pte_offset_kernel(pmd, addr);
if (pte_none(*pte)) {
pte_t entry;
void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
if (!p)
return NULL;
entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
set_pte_at(&init_mm, addr, pte, entry);
}
return pte;
}
pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
{
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
void *p = vmemmap_alloc_block(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(&init_mm, pmd, p);
}
return pmd;
}
pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
{
#ifdef CONFIG_TIMA_RKP
int rkp_do = 0;
#endif
void *p = NULL ;
pud_t *pud = pud_offset(pgd, addr);
if (pud_none(*pud)) {
#ifdef CONFIG_TIMA_RKP
#ifdef CONFIG_KNOX_KAP
if (boot_mode_security)
#endif
rkp_do = 1;
if( rkp_do ){
p = rkp_ro_alloc();
if (!p)
p = vmemmap_alloc_block(PAGE_SIZE, node);
}else{
p = vmemmap_alloc_block(PAGE_SIZE, node);
}
#else /* !CONFIG_TIMA_RKP */
p = vmemmap_alloc_block(PAGE_SIZE, node);
#endif
if (!p)
return NULL;
pud_populate(&init_mm, pud, p);
}
return pud;
}
pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
{
pgd_t *pgd = pgd_offset_k(addr);
if (pgd_none(*pgd)) {
void *p = vmemmap_alloc_block(PAGE_SIZE, node);
if (!p)
return NULL;
pgd_populate(&init_mm, pgd, p);
}
return pgd;
}
int __meminit vmemmap_populate_basepages(unsigned long start,
unsigned long end, int node)
{
unsigned long addr = start;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
for (; addr < end; addr += PAGE_SIZE) {
pgd = vmemmap_pgd_populate(addr, node);
if (!pgd)
return -ENOMEM;
pud = vmemmap_pud_populate(pgd, addr, node);
if (!pud)
return -ENOMEM;
pmd = vmemmap_pmd_populate(pud, addr, node);
if (!pmd)
return -ENOMEM;
pte = vmemmap_pte_populate(pmd, addr, node);
if (!pte)
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
}
return 0;
}
struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
{
unsigned long start;
unsigned long end;
struct page *map;
map = pfn_to_page(pnum * PAGES_PER_SECTION);
start = (unsigned long)map;
end = (unsigned long)(map + PAGES_PER_SECTION);
if (vmemmap_populate(start, end, nid))
return NULL;
return map;
}
void __init sparse_mem_maps_populate_node(struct page **map_map,
unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long map_count, int nodeid)
{
unsigned long pnum;
unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
void *vmemmap_buf_start;
size = ALIGN(size, PMD_SIZE);
vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size * map_count,
PMD_SIZE, __pa(MAX_DMA_ADDRESS));
if (vmemmap_buf_start) {
vmemmap_buf = vmemmap_buf_start;
vmemmap_buf_end = vmemmap_buf_start + size * map_count;
}
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
struct mem_section *ms;
if (!present_section_nr(pnum))
continue;
map_map[pnum] = sparse_mem_map_populate(pnum, nodeid);
if (map_map[pnum])
continue;
ms = __nr_to_section(pnum);
printk(KERN_ERR "%s: sparsemem memory map backing failed "
"some memory will not be available.\n", __func__);
ms->section_mem_map = 0;
}
if (vmemmap_buf_start) {
/* need to free left buf */
memblock_free_early(__pa(vmemmap_buf),
vmemmap_buf_end - vmemmap_buf);
vmemmap_buf = NULL;
vmemmap_buf_end = NULL;
}
}

811
mm/sparse.c Normal file
View file

@ -0,0 +1,811 @@
/*
* sparse memory mappings.
*/
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/mmzone.h>
#include <linux/bootmem.h>
#include <linux/compiler.h>
#include <linux/highmem.h>
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
#include "internal.h"
#include <asm/dma.h>
#include <asm/pgalloc.h>
#include <asm/pgtable.h>
/*
* Permanent SPARSEMEM data:
*
* 1) mem_section - memory sections, mem_map's for valid memory
*/
#ifdef CONFIG_SPARSEMEM_EXTREME
struct mem_section *mem_section[NR_SECTION_ROOTS]
____cacheline_internodealigned_in_smp;
#else
struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
____cacheline_internodealigned_in_smp;
#endif
EXPORT_SYMBOL(mem_section);
#ifdef NODE_NOT_IN_PAGE_FLAGS
/*
* If we did not store the node number in the page then we have to
* do a lookup in the section_to_node_table in order to find which
* node the page belongs to.
*/
#if MAX_NUMNODES <= 256
static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#else
static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#endif
int page_to_nid(const struct page *page)
{
return section_to_node_table[page_to_section(page)];
}
EXPORT_SYMBOL(page_to_nid);
static void set_section_nid(unsigned long section_nr, int nid)
{
section_to_node_table[section_nr] = nid;
}
#else /* !NODE_NOT_IN_PAGE_FLAGS */
static inline void set_section_nid(unsigned long section_nr, int nid)
{
}
#endif
#ifdef CONFIG_SPARSEMEM_EXTREME
static struct mem_section noinline __init_refok *sparse_index_alloc(int nid)
{
struct mem_section *section = NULL;
unsigned long array_size = SECTIONS_PER_ROOT *
sizeof(struct mem_section);
if (slab_is_available()) {
if (node_state(nid, N_HIGH_MEMORY))
section = kzalloc_node(array_size, GFP_KERNEL, nid);
else
section = kzalloc(array_size, GFP_KERNEL);
} else {
section = memblock_virt_alloc_node(array_size, nid);
}
return section;
}
static int __meminit sparse_index_init(unsigned long section_nr, int nid)
{
unsigned long root = SECTION_NR_TO_ROOT(section_nr);
struct mem_section *section;
if (mem_section[root])
return -EEXIST;
section = sparse_index_alloc(nid);
if (!section)
return -ENOMEM;
mem_section[root] = section;
return 0;
}
#else /* !SPARSEMEM_EXTREME */
static inline int sparse_index_init(unsigned long section_nr, int nid)
{
return 0;
}
#endif
/*
* Although written for the SPARSEMEM_EXTREME case, this happens
* to also work for the flat array case because
* NR_SECTION_ROOTS==NR_MEM_SECTIONS.
*/
int __section_nr(struct mem_section* ms)
{
unsigned long root_nr;
struct mem_section* root;
for (root_nr = 0; root_nr < NR_SECTION_ROOTS; root_nr++) {
root = __nr_to_section(root_nr * SECTIONS_PER_ROOT);
if (!root)
continue;
if ((ms >= root) && (ms < (root + SECTIONS_PER_ROOT)))
break;
}
VM_BUG_ON(root_nr == NR_SECTION_ROOTS);
return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
}
/*
* During early boot, before section_mem_map is used for an actual
* mem_map, we use section_mem_map to store the section's NUMA
* node. This keeps us from having to use another data structure. The
* node information is cleared just before we store the real mem_map.
*/
static inline unsigned long sparse_encode_early_nid(int nid)
{
return (nid << SECTION_NID_SHIFT);
}
static inline int sparse_early_nid(struct mem_section *section)
{
return (section->section_mem_map >> SECTION_NID_SHIFT);
}
/* Validate the physical addressing limitations of the model */
void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
/*
* Sanity checks - do not allow an architecture to pass
* in larger pfns than the maximum scope of sparsemem:
*/
if (*start_pfn > max_sparsemem_pfn) {
mminit_dprintk(MMINIT_WARNING, "pfnvalidation",
"Start of range %lu -> %lu exceeds SPARSEMEM max %lu\n",
*start_pfn, *end_pfn, max_sparsemem_pfn);
WARN_ON_ONCE(1);
*start_pfn = max_sparsemem_pfn;
*end_pfn = max_sparsemem_pfn;
} else if (*end_pfn > max_sparsemem_pfn) {
mminit_dprintk(MMINIT_WARNING, "pfnvalidation",
"End of range %lu -> %lu exceeds SPARSEMEM max %lu\n",
*start_pfn, *end_pfn, max_sparsemem_pfn);
WARN_ON_ONCE(1);
*end_pfn = max_sparsemem_pfn;
}
}
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long start, unsigned long end)
{
unsigned long pfn;
start &= PAGE_SECTION_MASK;
mminit_validate_memmodel_limits(&start, &end);
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;
}
}
/*
* Only used by the i386 NUMA architecures, but relatively
* generic code.
*/
unsigned long __init node_memmap_size_bytes(int nid, unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn;
unsigned long nr_pages = 0;
mminit_validate_memmodel_limits(&start_pfn, &end_pfn);
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
if (nid != early_pfn_to_nid(pfn))
continue;
if (pfn_present(pfn))
nr_pages += PAGES_PER_SECTION;
}
return nr_pages * sizeof(struct page);
}
/*
* Subtle, we encode the real pfn into the mem_map such that
* the identity pfn - section_mem_map will return the actual
* physical page frame number.
*/
static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long pnum)
{
return (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
}
/*
* Decode mem_map from the coded memmap
*/
struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum)
{
/* mask off the extra low bits of information */
coded_mem_map &= SECTION_MAP_MASK;
return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum);
}
static int __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
unsigned long *pageblock_bitmap)
{
if (!present_section(ms))
return -EINVAL;
ms->section_mem_map &= ~SECTION_MAP_MASK;
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
ms->pageblock_flags = pageblock_bitmap;
return 1;
}
unsigned long usemap_size(void)
{
unsigned long size_bytes;
size_bytes = roundup(SECTION_BLOCKFLAGS_BITS, 8) / 8;
size_bytes = roundup(size_bytes, sizeof(unsigned long));
return size_bytes;
}
#ifdef CONFIG_MEMORY_HOTPLUG
static unsigned long *__kmalloc_section_usemap(void)
{
return kmalloc(usemap_size(), GFP_KERNEL);
}
#endif /* CONFIG_MEMORY_HOTPLUG */
#ifdef CONFIG_MEMORY_HOTREMOVE
static unsigned long * __init
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
unsigned long size)
{
unsigned long goal, limit;
unsigned long *p;
int nid;
/*
* A page may contain usemaps for other sections preventing the
* page being freed and making a section unremovable while
* other sections referencing the usemap remain active. Similarly,
* a pgdat can prevent a section being removed. If section A
* contains a pgdat and section B contains the usemap, both
* sections become inter-dependent. This allocates usemaps
* from the same section as the pgdat where possible to avoid
* this problem.
*/
goal = __pa(pgdat) & (PAGE_SECTION_MASK << PAGE_SHIFT);
limit = goal + (1UL << PA_SECTION_SHIFT);
nid = early_pfn_to_nid(goal >> PAGE_SHIFT);
again:
p = memblock_virt_alloc_try_nid_nopanic(size,
SMP_CACHE_BYTES, goal, limit,
nid);
if (!p && limit) {
limit = 0;
goto again;
}
return p;
}
static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
{
unsigned long usemap_snr, pgdat_snr;
static unsigned long old_usemap_snr = NR_MEM_SECTIONS;
static unsigned long old_pgdat_snr = NR_MEM_SECTIONS;
struct pglist_data *pgdat = NODE_DATA(nid);
int usemap_nid;
usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
if (usemap_snr == pgdat_snr)
return;
if (old_usemap_snr == usemap_snr && old_pgdat_snr == pgdat_snr)
/* skip redundant message */
return;
old_usemap_snr = usemap_snr;
old_pgdat_snr = pgdat_snr;
usemap_nid = sparse_early_nid(__nr_to_section(usemap_snr));
if (usemap_nid != nid) {
printk(KERN_INFO
"node %d must be removed before remove section %ld\n",
nid, usemap_snr);
return;
}
/*
* There is a circular dependency.
* Some platforms allow un-removable section because they will just
* gather other removable sections for dynamic partitioning.
* Just notify un-removable section's number here.
*/
printk(KERN_INFO "Section %ld and %ld (node %d)", usemap_snr,
pgdat_snr, nid);
printk(KERN_CONT
" have a circular dependency on usemap and pgdat allocations\n");
}
#else
static unsigned long * __init
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
unsigned long size)
{
return memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
}
static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
{
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
static void __init sparse_early_usemaps_alloc_node(void *data,
unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long usemap_count, int nodeid)
{
void *usemap;
unsigned long pnum;
unsigned long **usemap_map = (unsigned long **)data;
int size = usemap_size();
usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid),
size * usemap_count);
if (!usemap) {
printk(KERN_WARNING "%s: allocation failed\n", __func__);
return;
}
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
continue;
usemap_map[pnum] = usemap;
usemap += size;
check_usemap_section_nr(nodeid, usemap_map[pnum]);
}
}
#ifndef CONFIG_SPARSEMEM_VMEMMAP
struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid)
{
struct page *map;
unsigned long size;
map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
if (map)
return map;
size = PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
map = memblock_virt_alloc_try_nid(size,
PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
BOOTMEM_ALLOC_ACCESSIBLE, nid);
return map;
}
void __init sparse_mem_maps_populate_node(struct page **map_map,
unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long map_count, int nodeid)
{
void *map;
unsigned long pnum;
unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
map = alloc_remap(nodeid, size * map_count);
if (map) {
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
continue;
map_map[pnum] = map;
map += size;
}
return;
}
size = PAGE_ALIGN(size);
map = memblock_virt_alloc_try_nid(size * map_count,
PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
if (map) {
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
continue;
map_map[pnum] = map;
map += size;
}
return;
}
/* fallback */
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
struct mem_section *ms;
if (!present_section_nr(pnum))
continue;
map_map[pnum] = sparse_mem_map_populate(pnum, nodeid);
if (map_map[pnum])
continue;
ms = __nr_to_section(pnum);
printk(KERN_ERR "%s: sparsemem memory map backing failed "
"some memory will not be available.\n", __func__);
ms->section_mem_map = 0;
}
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
static void __init sparse_early_mem_maps_alloc_node(void *data,
unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long map_count, int nodeid)
{
struct page **map_map = (struct page **)data;
sparse_mem_maps_populate_node(map_map, pnum_begin, pnum_end,
map_count, nodeid);
}
#else
static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
{
struct page *map;
struct mem_section *ms = __nr_to_section(pnum);
int nid = sparse_early_nid(ms);
map = sparse_mem_map_populate(pnum, nid);
if (map)
return map;
printk(KERN_ERR "%s: sparsemem memory map backing failed "
"some memory will not be available.\n", __func__);
ms->section_mem_map = 0;
return NULL;
}
#endif
void __weak __meminit vmemmap_populate_print_last(void)
{
}
/**
* alloc_usemap_and_memmap - memory alloction for pageblock flags and vmemmap
* @map: usemap_map for pageblock flags or mmap_map for vmemmap
*/
static void __init alloc_usemap_and_memmap(void (*alloc_func)
(void *, unsigned long, unsigned long,
unsigned long, int), void *data)
{
unsigned long pnum;
unsigned long map_count;
int nodeid_begin = 0;
unsigned long pnum_begin = 0;
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
struct mem_section *ms;
if (!present_section_nr(pnum))
continue;
ms = __nr_to_section(pnum);
nodeid_begin = sparse_early_nid(ms);
pnum_begin = pnum;
break;
}
map_count = 1;
for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
struct mem_section *ms;
int nodeid;
if (!present_section_nr(pnum))
continue;
ms = __nr_to_section(pnum);
nodeid = sparse_early_nid(ms);
if (nodeid == nodeid_begin) {
map_count++;
continue;
}
/* ok, we need to take cake of from pnum_begin to pnum - 1*/
alloc_func(data, pnum_begin, pnum,
map_count, nodeid_begin);
/* new start, update count etc*/
nodeid_begin = nodeid;
pnum_begin = pnum;
map_count = 1;
}
/* ok, last chunk */
alloc_func(data, pnum_begin, NR_MEM_SECTIONS,
map_count, nodeid_begin);
}
/*
* Allocate the accumulated non-linear sections, allocate a mem_map
* for each and record the physical to section mapping.
*/
void __init sparse_init(void)
{
unsigned long pnum;
struct page *map;
unsigned long *usemap;
unsigned long **usemap_map;
int size;
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
int size2;
struct page **map_map;
#endif
/* see include/linux/mmzone.h 'struct mem_section' definition */
BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
set_pageblock_order();
/*
* map is using big page (aka 2M in x86 64 bit)
* usemap is less one page (aka 24 bytes)
* so alloc 2M (with 2M align) and 24 bytes in turn will
* make next 2M slip to one more 2M later.
* then in big system, the memory will have a lot of holes...
* here try to allocate 2M pages continuously.
*
* powerpc need to call sparse_init_one_section right after each
* sparse_early_mem_map_alloc, so allocate usemap_map at first.
*/
size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
usemap_map = memblock_virt_alloc(size, 0);
if (!usemap_map)
panic("can not allocate usemap_map\n");
alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
(void *)usemap_map);
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
map_map = memblock_virt_alloc(size2, 0);
if (!map_map)
panic("can not allocate map_map\n");
alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
(void *)map_map);
#endif
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
if (!present_section_nr(pnum))
continue;
usemap = usemap_map[pnum];
if (!usemap)
continue;
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
map = sparse_early_mem_map_alloc(pnum);
#endif
if (!map)
continue;
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
usemap);
}
vmemmap_populate_print_last();
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
memblock_free_early(__pa(map_map), size2);
#endif
memblock_free_early(__pa(usemap_map), size);
}
#ifdef CONFIG_MEMORY_HOTPLUG
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
{
/* This will make the necessary allocations eventually. */
return sparse_mem_map_populate(pnum, nid);
}
static void __kfree_section_memmap(struct page *memmap)
{
unsigned long start = (unsigned long)memmap;
unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
vmemmap_free(start, end);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
static void free_map_bootmem(struct page *memmap)
{
unsigned long start = (unsigned long)memmap;
unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
vmemmap_free(start, end);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#else
static struct page *__kmalloc_section_memmap(void)
{
struct page *page, *ret;
unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;
page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
if (page)
goto got_map_page;
ret = vmalloc(memmap_size);
if (ret)
goto got_map_ptr;
return NULL;
got_map_page:
ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
got_map_ptr:
return ret;
}
static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
{
return __kmalloc_section_memmap();
}
static void __kfree_section_memmap(struct page *memmap)
{
if (is_vmalloc_addr(memmap))
vfree(memmap);
else
free_pages((unsigned long)memmap,
get_order(sizeof(struct page) * PAGES_PER_SECTION));
}
#ifdef CONFIG_MEMORY_HOTREMOVE
static void free_map_bootmem(struct page *memmap)
{
unsigned long maps_section_nr, removing_section_nr, i;
unsigned long magic, nr_pages;
struct page *page = virt_to_page(memmap);
nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
>> PAGE_SHIFT;
for (i = 0; i < nr_pages; i++, page++) {
magic = (unsigned long) page->lru.next;
BUG_ON(magic == NODE_INFO);
maps_section_nr = pfn_to_section_nr(page_to_pfn(page));
removing_section_nr = page->private;
/*
* When this function is called, the removing section is
* logical offlined state. This means all pages are isolated
* from page allocator. If removing section's memmap is placed
* on the same section, it must not be freed.
* If it is freed, page allocator may allocate it which will
* be removed physically soon.
*/
if (maps_section_nr != removing_section_nr)
put_page_bootmem(page);
}
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
/*
* returns the number of sections whose mem_maps were properly
* set. If this is <=0, then that means that the passed-in
* map was not consumed and must be freed.
*/
int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
unsigned long *usemap;
unsigned long flags;
int ret;
/*
* no locking for this, because it does its own
* plus, it does a kmalloc
*/
ret = sparse_index_init(section_nr, pgdat->node_id);
if (ret < 0 && ret != -EEXIST)
return ret;
memmap = kmalloc_section_memmap(section_nr, pgdat->node_id);
if (!memmap)
return -ENOMEM;
usemap = __kmalloc_section_usemap();
if (!usemap) {
__kfree_section_memmap(memmap);
return -ENOMEM;
}
pgdat_resize_lock(pgdat, &flags);
ms = __pfn_to_section(start_pfn);
if (ms->section_mem_map & SECTION_MARKED_PRESENT) {
ret = -EEXIST;
goto out;
}
memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
ms->section_mem_map |= SECTION_MARKED_PRESENT;
ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
out:
pgdat_resize_unlock(pgdat, &flags);
if (ret <= 0) {
kfree(usemap);
__kfree_section_memmap(memmap);
}
return ret;
}
#ifdef CONFIG_MEMORY_HOTREMOVE
#ifdef CONFIG_MEMORY_FAILURE
static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
{
int i;
if (!memmap)
return;
for (i = 0; i < PAGES_PER_SECTION; i++) {
if (PageHWPoison(&memmap[i])) {
atomic_long_sub(1, &num_poisoned_pages);
ClearPageHWPoison(&memmap[i]);
}
}
}
#else
static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
{
}
#endif
static void free_section_usemap(struct page *memmap, unsigned long *usemap)
{
struct page *usemap_page;
if (!usemap)
return;
usemap_page = virt_to_page(usemap);
/*
* Check to see if allocation came from hot-plug-add
*/
if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
kfree(usemap);
if (memmap)
__kfree_section_memmap(memmap);
return;
}
/*
* The usemap came from bootmem. This is packed with other usemaps
* on the section which has pgdat at boot time. Just keep it as is now.
*/
if (memmap)
free_map_bootmem(memmap);
}
void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
{
struct page *memmap = NULL;
unsigned long *usemap = NULL, flags;
struct pglist_data *pgdat = zone->zone_pgdat;
pgdat_resize_lock(pgdat, &flags);
if (ms->section_mem_map) {
usemap = ms->pageblock_flags;
memmap = sparse_decode_mem_map(ms->section_mem_map,
__section_nr(ms));
ms->section_mem_map = 0;
ms->pageblock_flags = NULL;
}
pgdat_resize_unlock(pgdat, &flags);
clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
free_section_usemap(memmap, usemap);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_MEMORY_HOTPLUG */

1158
mm/swap.c Normal file

File diff suppressed because it is too large Load diff

497
mm/swap_state.c Normal file
View file

@ -0,0 +1,497 @@
/*
* linux/mm/swap_state.c
*
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
* Swap reorganised 29.12.95, Stephen Tweedie
*
* Rewritten to use page cache, (C) 1998 Stephen Tweedie
*/
#include <linux/mm.h>
#include <linux/gfp.h>
#include <linux/kernel_stat.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
#include <asm/pgtable.h>
/*
* swapper_space is a fiction, retained to simplify the path through
* vmscan's shrink_page_list.
*/
static const struct address_space_operations swap_aops = {
.writepage = swap_writepage,
.set_page_dirty = swap_set_page_dirty,
#ifdef CONFIG_MIGRATION
.migratepage = migrate_page,
#endif
};
static struct backing_dev_info swap_backing_dev_info = {
.name = "swap",
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
};
struct address_space swapper_spaces[MAX_SWAPFILES] = {
[0 ... MAX_SWAPFILES - 1] = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
.i_mmap_writable = ATOMIC_INIT(0),
.a_ops = &swap_aops,
.backing_dev_info = &swap_backing_dev_info,
}
};
#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
static struct {
unsigned long add_total;
unsigned long del_total;
unsigned long find_success;
unsigned long find_total;
} swap_cache_info;
unsigned long total_swapcache_pages(void)
{
int i;
unsigned long ret = 0;
for (i = 0; i < MAX_SWAPFILES; i++)
ret += swapper_spaces[i].nrpages;
return ret;
}
static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
void show_swap_cache_info(void)
{
printk("%lu pages in swap cache\n", total_swapcache_pages());
printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
swap_cache_info.add_total, swap_cache_info.del_total,
swap_cache_info.find_success, swap_cache_info.find_total);
printk("Free swap = %ldkB\n",
get_nr_swap_pages() << (PAGE_SHIFT - 10));
printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
}
/*
* __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
int __add_to_swap_cache(struct page *page, swp_entry_t entry)
{
int error;
struct address_space *address_space;
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
page_cache_get(page);
SetPageSwapCache(page);
set_page_private(page, entry.val);
address_space = swap_address_space(entry);
spin_lock_irq(&address_space->tree_lock);
error = radix_tree_insert(&address_space->page_tree,
entry.val, page);
if (likely(!error)) {
address_space->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(add_total);
}
spin_unlock_irq(&address_space->tree_lock);
if (unlikely(error)) {
/*
* Only the context which have set SWAP_HAS_CACHE flag
* would call add_to_swap_cache().
* So add_to_swap_cache() doesn't returns -EEXIST.
*/
VM_BUG_ON(error == -EEXIST);
set_page_private(page, 0UL);
ClearPageSwapCache(page);
page_cache_release(page);
}
return error;
}
int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
{
int error;
error = radix_tree_maybe_preload(gfp_mask);
if (!error) {
error = __add_to_swap_cache(page, entry);
radix_tree_preload_end();
}
return error;
}
/*
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
void __delete_from_swap_cache(struct page *page)
{
swp_entry_t entry;
struct address_space *address_space;
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
VM_BUG_ON_PAGE(PageWriteback(page), page);
entry.val = page_private(page);
address_space = swap_address_space(entry);
radix_tree_delete(&address_space->page_tree, page_private(page));
set_page_private(page, 0);
ClearPageSwapCache(page);
address_space->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
}
/**
* add_to_swap - allocate swap space for a page
* @page: page we want to move to swap
*
* Allocate swap space for the page and add the page to the
* swap cache. Caller needs to hold the page lock.
*/
int add_to_swap(struct page *page, struct list_head *list)
{
swp_entry_t entry;
int err;
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageUptodate(page), page);
entry = get_swap_page();
if (!entry.val)
return 0;
if (unlikely(PageTransHuge(page)))
if (unlikely(split_huge_page_to_list(page, list))) {
swapcache_free(entry);
return 0;
}
/*
* Radix-tree node allocations from PF_MEMALLOC contexts could
* completely exhaust the page allocator. __GFP_NOMEMALLOC
* stops emergency reserves from being allocated.
*
* TODO: this could cause a theoretical memory reclaim
* deadlock in the swap out path.
*/
/*
* Add it to the swap cache and mark it dirty
*/
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
if (!err) { /* Success */
SetPageDirty(page);
return 1;
} else { /* -ENOMEM radix-tree allocation failure */
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
* clear SWAP_HAS_CACHE flag.
*/
swapcache_free(entry);
return 0;
}
}
/*
* This must be called only on pages that have
* been verified to be in the swap cache and locked.
* It will never put the page into the free list,
* the caller has a reference on the page.
*/
void delete_from_swap_cache(struct page *page)
{
swp_entry_t entry;
struct address_space *address_space;
entry.val = page_private(page);
address_space = swap_address_space(entry);
spin_lock_irq(&address_space->tree_lock);
__delete_from_swap_cache(page);
spin_unlock_irq(&address_space->tree_lock);
swapcache_free(entry);
page_cache_release(page);
}
/*
* If we are the only user, then try to free up the swap cache.
*
* Its ok to check for PageSwapCache without the page lock
* here because we are going to recheck again inside
* try_to_free_swap() _with_ the lock.
* - Marcelo
*/
static inline void free_swap_cache(struct page *page)
{
if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) {
try_to_free_swap(page);
unlock_page(page);
}
}
/*
* Perform a free_page(), also freeing any swap cache associated with
* this page if it is the last user of the page.
*/
void free_page_and_swap_cache(struct page *page)
{
free_swap_cache(page);
page_cache_release(page);
}
/*
* Passed an array of pages, drop them all from swapcache and then release
* them. They are removed from the LRU and freed if this is their last use.
*/
void free_pages_and_swap_cache(struct page **pages, int nr)
{
struct page **pagep = pages;
int i;
lru_add_drain();
for (i = 0; i < nr; i++)
free_swap_cache(pagep[i]);
release_pages(pagep, nr, false);
}
/*
* Lookup a swap entry in the swap cache. A found page will be returned
* unlocked and with its refcount incremented - we rely on the kernel
* lock getting page table operations atomic even if we drop the page
* lock before returning.
*/
struct page * lookup_swap_cache(swp_entry_t entry)
{
struct page *page;
page = find_get_page(swap_address_space(entry), entry.val);
if (page) {
INC_CACHE_INFO(find_success);
if (TestClearPageReadahead(page))
atomic_inc(&swapin_readahead_hits);
}
INC_CACHE_INFO(find_total);
return page;
}
/*
* Locate a page of swap in physical memory, reserving swap cache space
* and reading the disk if it is not already cached.
* A failure return means that either the page allocation failed or that
* the swap entry is no longer in use.
*/
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr)
{
struct page *found_page, *new_page = NULL;
int err;
do {
/*
* First check the swap cache. Since this is normally
* called after lookup_swap_cache() failed, re-calling
* that would confuse statistics.
*/
found_page = find_get_page(swap_address_space(entry),
entry.val);
if (found_page)
break;
/*
* Get a new page to read into from swap.
*/
if (!new_page) {
new_page = alloc_page_vma(gfp_mask, vma, addr);
if (!new_page)
break; /* Out of memory */
}
/*
* call radix_tree_preload() while we can wait.
*/
err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL);
if (err)
break;
/*
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry);
if (err == -EEXIST) {
radix_tree_preload_end();
/*
* We might race against get_swap_page() and stumble
* across a SWAP_HAS_CACHE swap_map entry whose page
* has not been brought into the swapcache yet, while
* the other end is scheduled away waiting on discard
* I/O completion at scan_swap_map().
*
* In order to avoid turning this transitory state
* into a permanent loop around this -EEXIST case
* if !CONFIG_PREEMPT and the I/O completion happens
* to be waiting on the CPU waitqueue where we are now
* busy looping, we just conditionally invoke the
* scheduler here, if there are some more important
* tasks to run.
*/
cond_resched();
continue;
}
if (err) { /* swp entry is obsolete ? */
radix_tree_preload_end();
break;
}
/* May fail (-ENOMEM) if radix-tree node allocation failed. */
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
err = __add_to_swap_cache(new_page, entry);
if (likely(!err)) {
radix_tree_preload_end();
/*
* Initiate read into locked page and return.
*/
lru_cache_add_anon(new_page);
swap_readpage(new_page);
return new_page;
}
radix_tree_preload_end();
ClearPageSwapBacked(new_page);
__clear_page_locked(new_page);
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
* clear SWAP_HAS_CACHE flag.
*/
swapcache_free(entry);
} while (err != -ENOMEM);
if (new_page)
page_cache_release(new_page);
return found_page;
}
#ifdef CONFIG_SWAP_ENABLE_READAHEAD
static unsigned long swapin_nr_pages(unsigned long offset)
{
static unsigned long prev_offset;
unsigned int pages, max_pages, last_ra;
static atomic_t last_readahead_pages;
max_pages = 1 << ACCESS_ONCE(page_cluster);
if (max_pages <= 1)
return 1;
/*
* This heuristic has been found to work well on both sequential and
* random loads, swapping to hard disk or to SSD: please don't ask
* what the "+ 2" means, it just happens to work well, that's all.
*/
pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
if (pages == 2) {
/*
* We can have no readahead hits to judge by: but must not get
* stuck here forever, so check for an adjacent offset instead
* (and don't even bother to check whether swap type is same).
*/
if (offset != prev_offset + 1 && offset != prev_offset - 1)
pages = 1;
prev_offset = offset;
} else {
unsigned int roundup = 4;
while (roundup < pages)
roundup <<= 1;
pages = roundup;
}
if (pages > max_pages)
pages = max_pages;
/* Don't shrink readahead too fast */
last_ra = atomic_read(&last_readahead_pages) / 2;
if (pages < last_ra)
pages = last_ra;
atomic_set(&last_readahead_pages, pages);
return pages;
}
#endif
/**
* swapin_readahead - swap in pages in hope we need them soon
* @entry: swap entry of this memory
* @gfp_mask: memory allocation flags
* @vma: user vma this address belongs to
* @addr: target address for mempolicy
*
* Returns the struct page for entry and addr, after queueing swapin.
*
* Primitive swap readahead code. We simply read an aligned block of
* (1 << page_cluster) entries in the swap area. This method is chosen
* because it doesn't cost us any seek time. We also make sure to queue
* the 'original' request together with the readahead ones...
*
* This has been extended to use the NUMA policies from the mm triggering
* the readahead.
*
* Caller must hold down_read on the vma->vm_mm if vma is not NULL.
*/
struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr)
{
#ifdef CONFIG_SWAP_ENABLE_READAHEAD
struct page *page;
unsigned long entry_offset = swp_offset(entry);
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
struct blk_plug plug;
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
gfp_mask, vma, addr);
if (!page)
continue;
if (offset != entry_offset)
SetPageReadahead(page);
page_cache_release(page);
}
blk_finish_plug(&plug);
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
#endif
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}

2982
mm/swapfile.c Normal file

File diff suppressed because it is too large Load diff

820
mm/truncate.c Normal file
View file

@ -0,0 +1,820 @@
/*
* mm/truncate.c - code for taking down pages from address_spaces
*
* Copyright (C) 2002, Linus Torvalds
*
* 10Sep2002 Andrew Morton
* Initial version.
*/
#include <linux/kernel.h>
#include <linux/backing-dev.h>
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/export.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/pagevec.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/buffer_head.h> /* grr. try_to_release_page,
do_invalidatepage */
#include <linux/cleancache.h>
#include <linux/rmap.h>
#include "internal.h"
static void clear_exceptional_entry(struct address_space *mapping,
pgoff_t index, void *entry)
{
struct radix_tree_node *node;
void **slot;
/* Handled by shmem itself */
if (shmem_mapping(mapping))
return;
spin_lock_irq(&mapping->tree_lock);
/*
* Regular page slots are stabilized by the page lock even
* without the tree itself locked. These unlocked entries
* need verification under the tree lock.
*/
if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
goto unlock;
if (*slot != entry)
goto unlock;
radix_tree_replace_slot(slot, NULL);
mapping->nrshadows--;
if (!node)
goto unlock;
workingset_node_shadows_dec(node);
/*
* Don't track node without shadow entries.
*
* Avoid acquiring the list_lru lock if already untracked.
* The list_empty() test is safe as node->private_list is
* protected by mapping->tree_lock.
*/
if (!workingset_node_shadows(node) &&
!list_empty(&node->private_list))
list_lru_del(&workingset_shadow_nodes, &node->private_list);
__radix_tree_delete_node(&mapping->page_tree, node);
unlock:
spin_unlock_irq(&mapping->tree_lock);
}
/**
* do_invalidatepage - invalidate part or all of a page
* @page: the page which is affected
* @offset: start of the range to invalidate
* @length: length of the range to invalidate
*
* do_invalidatepage() is called when all or part of the page has become
* invalidated by a truncate operation.
*
* do_invalidatepage() does not have to release all buffers, but it must
* ensure that no dirty buffer is left outside @offset and that no I/O
* is underway against any of the blocks which are outside the truncation
* point. Because the caller is about to free (and possibly reuse) those
* blocks on-disk.
*/
void do_invalidatepage(struct page *page, unsigned int offset,
unsigned int length)
{
void (*invalidatepage)(struct page *, unsigned int, unsigned int);
invalidatepage = page->mapping->a_ops->invalidatepage;
#ifdef CONFIG_BLOCK
if (!invalidatepage)
invalidatepage = block_invalidatepage;
#endif
if (invalidatepage)
(*invalidatepage)(page, offset, length);
}
/*
* This cancels just the dirty bit on the kernel page itself, it
* does NOT actually remove dirty bits on any mmap's that may be
* around. It also leaves the page tagged dirty, so any sync
* activity will still find it on the dirty lists, and in particular,
* clear_page_dirty_for_io() will still look at the dirty bits in
* the VM.
*
* Doing this should *normally* only ever be done when a page
* is truncated, and is not actually mapped anywhere at all. However,
* fs/buffer.c does this when it notices that somebody has cleaned
* out all the buffers on a page without actually doing it through
* the VM. Can you say "ext3 is horribly ugly"? Tought you could.
*/
void cancel_dirty_page(struct page *page, unsigned int account_size)
{
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
if (account_size)
task_io_account_cancelled_write(account_size);
}
}
}
EXPORT_SYMBOL(cancel_dirty_page);
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes orphaned. It will be left on the LRU and may even be mapped into
* user pagetables if we're racing with filemap_fault().
*
* We need to bale out if page->mapping is no longer equal to the original
* mapping. This happens a) when the VM reclaimed the page while we waited on
* its lock, b) when a concurrent invalidate_mapping_pages got there first and
* c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
*/
static int
truncate_complete_page(struct address_space *mapping, struct page *page)
{
if (page->mapping != mapping)
return -EIO;
if (page_has_private(page))
do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
cancel_dirty_page(page, PAGE_CACHE_SIZE);
ClearPageMappedToDisk(page);
delete_from_page_cache(page);
return 0;
}
/*
* This is for invalidate_mapping_pages(). That function can be called at
* any time, and is not supposed to throw away dirty pages. But pages can
* be marked dirty at any time too, so use remove_mapping which safely
* discards clean, unused pages.
*
* Returns non-zero if the page was successfully invalidated.
*/
static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
{
int ret;
if (page->mapping != mapping)
return 0;
if (page_has_private(page) && !try_to_release_page(page, 0))
return 0;
ret = remove_mapping(mapping, page);
return ret;
}
int truncate_inode_page(struct address_space *mapping, struct page *page)
{
if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index << PAGE_CACHE_SHIFT,
PAGE_CACHE_SIZE, 0);
}
return truncate_complete_page(mapping, page);
}
/*
* Used to get rid of pages on hardware memory corruption.
*/
int generic_error_remove_page(struct address_space *mapping, struct page *page)
{
if (!mapping)
return -EINVAL;
/*
* Only punch for normal data pages for now.
* Handling other types like directories would need more auditing.
*/
if (!S_ISREG(mapping->host->i_mode))
return -EIO;
return truncate_inode_page(mapping, page);
}
EXPORT_SYMBOL(generic_error_remove_page);
/*
* Safely invalidate one page from its pagecache mapping.
* It only drops clean, unused pages. The page must be locked.
*
* Returns 1 if the page is successfully invalidated, otherwise 0.
*/
int invalidate_inode_page(struct page *page)
{
struct address_space *mapping = page_mapping(page);
if (!mapping)
return 0;
if (PageDirty(page) || PageWriteback(page))
return 0;
if (page_mapped(page))
return 0;
return invalidate_complete_page(mapping, page);
}
/**
* truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
* @lstart: offset from which to truncate
* @lend: offset to which to truncate (inclusive)
*
* Truncate the page cache, removing the pages that are between
* specified offsets (and zeroing out partial pages
* if lstart or lend + 1 is not page aligned).
*
* Truncate takes two passes - the first pass is nonblocking. It will not
* block on page locks and it will not block on writeback. The second pass
* will wait. This is to prevent as much IO as possible in the affected region.
* The first pass will remove most pages, so the search cost of the second pass
* is low.
*
* We pass down the cache-hot hint to the page freeing code. Even if the
* mapping is large, it is probably the case that the final pages are the most
* recently touched, and freeing happens in ascending file offset order.
*
* Note that since ->invalidatepage() accepts range to invalidate
* truncate_inode_pages_range is able to handle cases where lend + 1 is not
* page aligned properly.
*/
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
unsigned int partial_start; /* inclusive */
unsigned int partial_end; /* exclusive */
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
cleancache_invalidate_inode(mapping);
if (mapping->nrpages == 0 && mapping->nrshadows == 0)
return;
/* Offsets within partial pages */
partial_start = lstart & (PAGE_CACHE_SIZE - 1);
partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
/*
* 'start' and 'end' always covers the range of pages to be fully
* truncated. Partial pages are covered with 'partial_start' at the
* start of the range and 'partial_end' at the end of the range.
* Note that 'end' is exclusive while 'lend' is inclusive.
*/
start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (lend == -1)
/*
* lend == -1 indicates end-of-file so we have to set 'end'
* to the highest possible pgoff_t and since the type is
* unsigned we're using -1.
*/
end = -1;
else
end = (lend + 1) >> PAGE_CACHE_SHIFT;
pagevec_init(&pvec, 0);
index = start;
while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
/* We rely upon deletion not changing page->index */
index = indices[i];
if (index >= end)
break;
if (radix_tree_exceptional_entry(page)) {
clear_exceptional_entry(mapping, index, page);
continue;
}
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
if (PageWriteback(page)) {
unlock_page(page);
continue;
}
truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
cond_resched();
index++;
}
if (partial_start) {
struct page *page = find_lock_page(mapping, start - 1);
if (page) {
unsigned int top = PAGE_CACHE_SIZE;
if (start > end) {
/* Truncation within a single page */
top = partial_end;
partial_end = 0;
}
wait_on_page_writeback(page);
zero_user_segment(page, partial_start, top);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
do_invalidatepage(page, partial_start,
top - partial_start);
unlock_page(page);
page_cache_release(page);
}
}
if (partial_end) {
struct page *page = find_lock_page(mapping, end);
if (page) {
wait_on_page_writeback(page);
zero_user_segment(page, 0, partial_end);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
do_invalidatepage(page, 0,
partial_end);
unlock_page(page);
page_cache_release(page);
}
}
/*
* If the truncation happened within a single page no pages
* will be released, just zeroed, so we can bail out now.
*/
if (start >= end)
return;
index = start;
for ( ; ; ) {
cond_resched();
if (!pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
/* If all gone from start onwards, we're done */
if (index == start)
break;
/* Otherwise restart to make sure all gone */
index = start;
continue;
}
if (index == start && indices[0] >= end) {
/* All gone out of hole to be punched, we're done */
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
break;
}
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
/* We rely upon deletion not changing page->index */
index = indices[i];
if (index >= end) {
/* Restart punch to make sure all gone */
index = start - 1;
break;
}
if (radix_tree_exceptional_entry(page)) {
clear_exceptional_entry(mapping, index, page);
continue;
}
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
index++;
}
cleancache_invalidate_inode(mapping);
}
EXPORT_SYMBOL(truncate_inode_pages_range);
/**
* truncate_inode_pages - truncate *all* the pages from an offset
* @mapping: mapping to truncate
* @lstart: offset from which to truncate
*
* Called under (and serialised by) inode->i_mutex.
*
* Note: When this function returns, there can be a page in the process of
* deletion (inside __delete_from_page_cache()) in the specified range. Thus
* mapping->nrpages can be non-zero when this function returns even after
* truncation of the whole mapping.
*/
void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
{
truncate_inode_pages_range(mapping, lstart, (loff_t)-1);
}
EXPORT_SYMBOL(truncate_inode_pages);
/**
* truncate_inode_pages_final - truncate *all* pages before inode dies
* @mapping: mapping to truncate
*
* Called under (and serialized by) inode->i_mutex.
*
* Filesystems have to use this in the .evict_inode path to inform the
* VM that this is the final truncate and the inode is going away.
*/
void truncate_inode_pages_final(struct address_space *mapping)
{
unsigned long nrshadows;
unsigned long nrpages;
/*
* Page reclaim can not participate in regular inode lifetime
* management (can't call iput()) and thus can race with the
* inode teardown. Tell it when the address space is exiting,
* so that it does not install eviction information after the
* final truncate has begun.
*/
mapping_set_exiting(mapping);
/*
* When reclaim installs eviction entries, it increases
* nrshadows first, then decreases nrpages. Make sure we see
* this in the right order or we might miss an entry.
*/
nrpages = mapping->nrpages;
smp_rmb();
nrshadows = mapping->nrshadows;
if (nrpages || nrshadows) {
/*
* As truncation uses a lockless tree lookup, cycle
* the tree lock to make sure any ongoing tree
* modification that does not see AS_EXITING is
* completed before starting the final truncate.
*/
spin_lock_irq(&mapping->tree_lock);
spin_unlock_irq(&mapping->tree_lock);
truncate_inode_pages(mapping, 0);
}
}
EXPORT_SYMBOL(truncate_inode_pages_final);
/**
* invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
* @mapping: the address_space which holds the pages to invalidate
* @start: the offset 'from' which to invalidate
* @end: the offset 'to' which to invalidate (inclusive)
*
* This function only removes the unlocked pages, if you want to
* remove all the pages of one inode, you must call truncate_inode_pages.
*
* invalidate_mapping_pages() will not block on IO activity. It will not
* invalidate pages which are dirty, locked, under writeback or mapped into
* pagetables.
*/
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
pgoff_t indices[PAGEVEC_SIZE];
struct pagevec pvec;
pgoff_t index = start;
unsigned long ret;
unsigned long count = 0;
int i;
pagevec_init(&pvec, 0);
while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
/* We rely upon deletion not changing page->index */
index = indices[i];
if (index > end)
break;
if (radix_tree_exceptional_entry(page)) {
clear_exceptional_entry(mapping, index, page);
continue;
}
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
ret = invalidate_inode_page(page);
unlock_page(page);
/*
* Invalidation is a hint that the page is no longer
* of interest and try to speed up its reclaim.
*/
if (!ret)
deactivate_page(page);
count += ret;
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
cond_resched();
index++;
}
return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);
/*
* This is like invalidate_complete_page(), except it ignores the page's
* refcount. We do this because invalidate_inode_pages2() needs stronger
* invalidation guarantees, and cannot afford to leave pages behind because
* shrink_page_list() has a temp ref on them, or because they're transiently
* sitting in the lru_cache_add() pagevecs.
*/
static int
invalidate_complete_page2(struct address_space *mapping, struct page *page)
{
if (page->mapping != mapping)
return 0;
if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
goto failed;
BUG_ON(page_has_private(page));
__delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
if (mapping->a_ops->freepage)
mapping->a_ops->freepage(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
spin_unlock_irq(&mapping->tree_lock);
return 0;
}
static int do_launder_page(struct address_space *mapping, struct page *page)
{
if (!PageDirty(page))
return 0;
if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
return 0;
return mapping->a_ops->launder_page(page);
}
/**
* invalidate_inode_pages2_range - remove range of pages from an address_space
* @mapping: the address_space
* @start: the page offset 'from' which to invalidate
* @end: the page offset 'to' which to invalidate (inclusive)
*
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
* Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
pgoff_t indices[PAGEVEC_SIZE];
struct pagevec pvec;
pgoff_t index;
int i;
int ret = 0;
int ret2 = 0;
int did_range_unmap = 0;
cleancache_invalidate_inode(mapping);
pagevec_init(&pvec, 0);
index = start;
while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
indices)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
/* We rely upon deletion not changing page->index */
index = indices[i];
if (index > end)
break;
if (radix_tree_exceptional_entry(page)) {
clear_exceptional_entry(mapping, index, page);
continue;
}
lock_page(page);
WARN_ON(page->index != index);
if (page->mapping != mapping) {
unlock_page(page);
continue;
}
wait_on_page_writeback(page);
if (page_mapped(page)) {
if (!did_range_unmap) {
/*
* Zap the rest of the file in one hit.
*/
unmap_mapping_range(mapping,
(loff_t)index << PAGE_CACHE_SHIFT,
(loff_t)(1 + end - index)
<< PAGE_CACHE_SHIFT,
0);
did_range_unmap = 1;
} else {
/*
* Just zap this page
*/
unmap_mapping_range(mapping,
(loff_t)index << PAGE_CACHE_SHIFT,
PAGE_CACHE_SIZE, 0);
}
}
BUG_ON(page_mapped(page));
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
cond_resched();
index++;
}
cleancache_invalidate_inode(mapping);
return ret;
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
/**
* invalidate_inode_pages2 - remove all pages from an address_space
* @mapping: the address_space
*
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
* Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2(struct address_space *mapping)
{
return invalidate_inode_pages2_range(mapping, 0, -1);
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
/**
* truncate_pagecache - unmap and remove pagecache that has been truncated
* @inode: inode
* @newsize: new file size
*
* inode's new i_size must already be written before truncate_pagecache
* is called.
*
* This function should typically be called before the filesystem
* releases resources associated with the freed range (eg. deallocates
* blocks). This way, pagecache will always stay logically coherent
* with on-disk format, and the filesystem would not have to deal with
* situations such as writepage being called for a page that has already
* had its underlying blocks deallocated.
*/
void truncate_pagecache(struct inode *inode, loff_t newsize)
{
struct address_space *mapping = inode->i_mapping;
loff_t holebegin = round_up(newsize, PAGE_SIZE);
/*
* unmap_mapping_range is called twice, first simply for
* efficiency so that truncate_inode_pages does fewer
* single-page unmaps. However after this first call, and
* before truncate_inode_pages finishes, it is possible for
* private pages to be COWed, which remain after
* truncate_inode_pages finishes, hence the second
* unmap_mapping_range call must be made for correctness.
*/
unmap_mapping_range(mapping, holebegin, 0, 1);
truncate_inode_pages(mapping, newsize);
unmap_mapping_range(mapping, holebegin, 0, 1);
}
EXPORT_SYMBOL(truncate_pagecache);
/**
* truncate_setsize - update inode and pagecache for a new file size
* @inode: inode
* @newsize: new file size
*
* truncate_setsize updates i_size and performs pagecache truncation (if
* necessary) to @newsize. It will be typically be called from the filesystem's
* setattr function when ATTR_SIZE is passed in.
*
* Must be called with a lock serializing truncates and writes (generally
* i_mutex but e.g. xfs uses a different lock) and before all filesystem
* specific block truncation has been performed.
*/
void truncate_setsize(struct inode *inode, loff_t newsize)
{
loff_t oldsize = inode->i_size;
i_size_write(inode, newsize);
if (newsize > oldsize)
pagecache_isize_extended(inode, oldsize, newsize);
truncate_pagecache(inode, newsize);
}
EXPORT_SYMBOL(truncate_setsize);
/**
* pagecache_isize_extended - update pagecache after extension of i_size
* @inode: inode for which i_size was extended
* @from: original inode size
* @to: new inode size
*
* Handle extension of inode size either caused by extending truncate or by
* write starting after current i_size. We mark the page straddling current
* i_size RO so that page_mkwrite() is called on the nearest write access to
* the page. This way filesystem can be sure that page_mkwrite() is called on
* the page before user writes to the page via mmap after the i_size has been
* changed.
*
* The function must be called after i_size is updated so that page fault
* coming after we unlock the page will already see the new i_size.
* The function must be called while we still hold i_mutex - this not only
* makes sure i_size is stable but also that userspace cannot observe new
* i_size value before we are prepared to store mmap writes at new inode size.
*/
void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
{
int bsize = 1 << inode->i_blkbits;
loff_t rounded_from;
struct page *page;
pgoff_t index;
WARN_ON(to > inode->i_size);
if (from >= to || bsize == PAGE_CACHE_SIZE)
return;
/* Page straddling @from will not have any hole block created? */
rounded_from = round_up(from, bsize);
if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
return;
index = from >> PAGE_CACHE_SHIFT;
page = find_lock_page(inode->i_mapping, index);
/* Page not cached? Nothing to do */
if (!page)
return;
/*
* See clear_page_dirty_for_io() for details why set_page_dirty()
* is needed.
*/
if (page_mkclean(page))
set_page_dirty(page);
unlock_page(page);
page_cache_release(page);
}
EXPORT_SYMBOL(pagecache_isize_extended);
/**
* truncate_pagecache_range - unmap and remove pagecache that is hole-punched
* @inode: inode
* @lstart: offset of beginning of hole
* @lend: offset of last byte of hole
*
* This function should typically be called before the filesystem
* releases resources associated with the freed range (eg. deallocates
* blocks). This way, pagecache will always stay logically coherent
* with on-disk format, and the filesystem would not have to deal with
* situations such as writepage being called for a page that has already
* had its underlying blocks deallocated.
*/
void truncate_pagecache_range(struct inode *inode, loff_t lstart, loff_t lend)
{
struct address_space *mapping = inode->i_mapping;
loff_t unmap_start = round_up(lstart, PAGE_SIZE);
loff_t unmap_end = round_down(1 + lend, PAGE_SIZE) - 1;
/*
* This rounding is currently just for example: unmap_mapping_range
* expands its hole outwards, whereas we want it to contract the hole
* inwards. However, existing callers of truncate_pagecache_range are
* doing their own page rounding first. Note that unmap_mapping_range
* allows holelen 0 for all, and we allow lend -1 for end of file.
*/
/*
* Unlike in truncate_pagecache, unmap_mapping_range is called only
* once (before truncating pagecache), and without "even_cows" flag:
* hole-punching should not remove private COWed pages from the hole.
*/
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
truncate_inode_pages_range(mapping, lstart, lend);
}
EXPORT_SYMBOL(truncate_pagecache_range);

402
mm/util.c Normal file
View file

@ -0,0 +1,402 @@
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/compiler.h>
#include <linux/export.h>
#include <linux/err.h>
#include <linux/sched.h>
#include <linux/security.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mman.h>
#include <linux/hugetlb.h>
#include <linux/vmalloc.h>
#include <asm/uaccess.h>
#include "internal.h"
/**
* kstrdup - allocate space for and copy an existing string
* @s: the string to duplicate
* @gfp: the GFP mask used in the kmalloc() call when allocating memory
*/
char *kstrdup(const char *s, gfp_t gfp)
{
size_t len;
char *buf;
if (!s)
return NULL;
len = strlen(s) + 1;
buf = kmalloc_track_caller(len, gfp);
if (buf)
memcpy(buf, s, len);
return buf;
}
EXPORT_SYMBOL(kstrdup);
/**
* kstrndup - allocate space for and copy an existing string
* @s: the string to duplicate
* @max: read at most @max chars from @s
* @gfp: the GFP mask used in the kmalloc() call when allocating memory
*/
char *kstrndup(const char *s, size_t max, gfp_t gfp)
{
size_t len;
char *buf;
if (!s)
return NULL;
len = strnlen(s, max);
buf = kmalloc_track_caller(len+1, gfp);
if (buf) {
memcpy(buf, s, len);
buf[len] = '\0';
}
return buf;
}
EXPORT_SYMBOL(kstrndup);
/**
* kmemdup - duplicate region of memory
*
* @src: memory region to duplicate
* @len: memory region length
* @gfp: GFP mask to use
*/
void *kmemdup(const void *src, size_t len, gfp_t gfp)
{
void *p;
p = kmalloc_track_caller(len, gfp);
if (p)
memcpy(p, src, len);
return p;
}
EXPORT_SYMBOL(kmemdup);
/**
* memdup_user - duplicate memory region from user space
*
* @src: source address in user space
* @len: number of bytes to copy
*
* Returns an ERR_PTR() on failure.
*/
void *memdup_user(const void __user *src, size_t len)
{
void *p;
/*
* Always use GFP_KERNEL, since copy_from_user() can sleep and
* cause pagefault, which makes it pointless to use GFP_NOFS
* or GFP_ATOMIC.
*/
p = kmalloc_track_caller(len, GFP_KERNEL);
if (!p)
return ERR_PTR(-ENOMEM);
if (copy_from_user(p, src, len)) {
kfree(p);
return ERR_PTR(-EFAULT);
}
return p;
}
EXPORT_SYMBOL(memdup_user);
/*
* strndup_user - duplicate an existing string from user space
* @s: The string to duplicate
* @n: Maximum number of bytes to copy, including the trailing NUL.
*/
char *strndup_user(const char __user *s, long n)
{
char *p;
long length;
length = strnlen_user(s, n);
if (!length)
return ERR_PTR(-EFAULT);
if (length > n)
return ERR_PTR(-EINVAL);
p = memdup_user(s, length);
if (IS_ERR(p))
return p;
p[length - 1] = '\0';
return p;
}
EXPORT_SYMBOL(strndup_user);
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent)
{
struct vm_area_struct *next;
vma->vm_prev = prev;
if (prev) {
next = prev->vm_next;
prev->vm_next = vma;
} else {
mm->mmap = vma;
if (rb_parent)
next = rb_entry(rb_parent,
struct vm_area_struct, vm_rb);
else
next = NULL;
}
vma->vm_next = next;
if (next)
next->vm_prev = vma;
}
/* Check if the vma is being used as a stack by this task */
static int vm_is_stack_for_task(struct task_struct *t,
struct vm_area_struct *vma)
{
return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t));
}
/*
* Check if the vma is being used as a stack.
* If is_group is non-zero, check in the entire thread group or else
* just check in the current task. Returns the task_struct of the task
* that the vma is stack for. Must be called under rcu_read_lock().
*/
struct task_struct *task_of_stack(struct task_struct *task,
struct vm_area_struct *vma, bool in_group)
{
if (vm_is_stack_for_task(task, vma))
return task;
if (in_group) {
struct task_struct *t;
for_each_thread(task, t) {
if (vm_is_stack_for_task(t, vma))
return t;
}
}
return NULL;
}
#if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT)
void arch_pick_mmap_layout(struct mm_struct *mm)
{
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
}
#endif
/*
* Like get_user_pages_fast() except its IRQ-safe in that it won't fall
* back to the regular GUP.
* If the architecture not support this function, simply return with no
* page pinned
*/
int __weak __get_user_pages_fast(unsigned long start,
int nr_pages, int write, struct page **pages)
{
return 0;
}
EXPORT_SYMBOL_GPL(__get_user_pages_fast);
/**
* get_user_pages_fast() - pin user pages in memory
* @start: starting user address
* @nr_pages: number of pages from start to pin
* @write: whether pages will be written to
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long.
*
* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno.
*
* get_user_pages_fast provides equivalent functionality to get_user_pages,
* operating on current and current->mm, with force=0 and vma=NULL. However
* unlike get_user_pages, it must be called without mmap_sem held.
*
* get_user_pages_fast may take mmap_sem and page table locks, so no
* assumptions can be made about lack of locking. get_user_pages_fast is to be
* implemented in a way that is advantageous (vs get_user_pages()) when the
* user memory area is already faulted in and present in ptes. However if the
* pages have to be faulted in, it may turn out to be slightly slower so
* callers need to carefully consider what to use. On many architectures,
* get_user_pages_fast simply falls back to get_user_pages.
*/
int __weak get_user_pages_fast(unsigned long start,
int nr_pages, int write, struct page **pages)
{
struct mm_struct *mm = current->mm;
int ret;
down_read(&mm->mmap_sem);
ret = get_user_pages(current, mm, start, nr_pages,
write, 0, pages, NULL);
up_read(&mm->mmap_sem);
return ret;
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
{
unsigned long ret;
struct mm_struct *mm = current->mm;
unsigned long populate;
ret = security_mmap_file(file, prot, flag);
if (!ret) {
down_write(&mm->mmap_sem);
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate);
up_write(&mm->mmap_sem);
if (populate)
mm_populate(ret, populate);
}
return ret;
}
unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
{
if (unlikely(offset + PAGE_ALIGN(len) < offset))
return -EINVAL;
if (unlikely(offset & ~PAGE_MASK))
return -EINVAL;
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
}
EXPORT_SYMBOL(vm_mmap);
void kvfree(const void *addr)
{
if (is_vmalloc_addr(addr))
vfree(addr);
else
kfree(addr);
}
EXPORT_SYMBOL(kvfree);
struct address_space *page_mapping(struct page *page)
{
struct address_space *mapping = page->mapping;
/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(PageSlab(page)))
return NULL;
if (unlikely(PageSwapCache(page))) {
swp_entry_t entry;
entry.val = page_private(page);
mapping = swap_address_space(entry);
} else if ((unsigned long)mapping & PAGE_MAPPING_ANON)
mapping = NULL;
return mapping;
}
int overcommit_ratio_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
{
int ret;
ret = proc_dointvec(table, write, buffer, lenp, ppos);
if (ret == 0 && write)
sysctl_overcommit_kbytes = 0;
return ret;
}
int overcommit_kbytes_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
{
int ret;
ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
if (ret == 0 && write)
sysctl_overcommit_ratio = 0;
return ret;
}
/*
* Committed memory limit enforced when OVERCOMMIT_NEVER policy is used
*/
unsigned long vm_commit_limit(void)
{
unsigned long allowed;
if (sysctl_overcommit_kbytes)
allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10);
else
allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100);
allowed += total_swap_pages;
return allowed;
}
/**
* get_cmdline() - copy the cmdline value to a buffer.
* @task: the task whose cmdline value to copy.
* @buffer: the buffer to copy to.
* @buflen: the length of the buffer. Larger cmdline values are truncated
* to this length.
* Returns the size of the cmdline field copied. Note that the copy does
* not guarantee an ending NULL byte.
*/
int get_cmdline(struct task_struct *task, char *buffer, int buflen)
{
int res = 0;
unsigned int len;
struct mm_struct *mm = get_task_mm(task);
if (!mm)
goto out;
if (!mm->arg_end)
goto out_mm; /* Shh! No looking before we're done */
len = mm->arg_end - mm->arg_start;
if (len > buflen)
len = buflen;
res = access_process_vm(task, mm->arg_start, buffer, len, 0);
/*
* If the nul at the end of args has been overwritten, then
* assume application is using setproctitle(3).
*/
if (res > 0 && buffer[res-1] != '\0' && len < buflen) {
len = strnlen(buffer, res);
if (len < res) {
res = len;
} else {
len = mm->env_end - mm->env_start;
if (len > buflen - res)
len = buflen - res;
res += access_process_vm(task, mm->env_start,
buffer+res, len, 0);
res = strnlen(buffer, res);
}
}
out_mm:
mmput(mm);
out:
return res;
}

132
mm/vmacache.c Normal file
View file

@ -0,0 +1,132 @@
/*
* Copyright (C) 2014 Davidlohr Bueso.
*/
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
/*
* Flush vma caches for threads that share a given mm.
*
* The operation is safe because the caller holds the mmap_sem
* exclusively and other threads accessing the vma cache will
* have mmap_sem held at least for read, so no extra locking
* is required to maintain the vma cache.
*/
void vmacache_flush_all(struct mm_struct *mm)
{
struct task_struct *g, *p;
/*
* Single threaded tasks need not iterate the entire
* list of process. We can avoid the flushing as well
* since the mm's seqnum was increased and don't have
* to worry about other threads' seqnum. Current's
* flush will occur upon the next lookup.
*/
if (atomic_read(&mm->mm_users) == 1)
return;
rcu_read_lock();
for_each_process_thread(g, p) {
/*
* Only flush the vmacache pointers as the
* mm seqnum is already set and curr's will
* be set upon invalidation when the next
* lookup is done.
*/
if (mm == p->mm)
vmacache_flush(p);
}
rcu_read_unlock();
}
/*
* This task may be accessing a foreign mm via (for example)
* get_user_pages()->find_vma(). The vmacache is task-local and this
* task's vmacache pertains to a different mm (ie, its own). There is
* nothing we can do here.
*
* Also handle the case where a kernel thread has adopted this mm via use_mm().
* That kernel thread's vmacache is not applicable to this mm.
*/
static bool vmacache_valid_mm(struct mm_struct *mm)
{
return current->mm == mm && !(current->flags & PF_KTHREAD);
}
void vmacache_update(unsigned long addr, struct vm_area_struct *newvma)
{
if (vmacache_valid_mm(newvma->vm_mm))
current->vmacache[VMACACHE_HASH(addr)] = newvma;
}
static bool vmacache_valid(struct mm_struct *mm)
{
struct task_struct *curr;
if (!vmacache_valid_mm(mm))
return false;
curr = current;
if (mm->vmacache_seqnum != curr->vmacache_seqnum) {
/*
* First attempt will always be invalid, initialize
* the new cache for this task here.
*/
curr->vmacache_seqnum = mm->vmacache_seqnum;
vmacache_flush(curr);
return false;
}
return true;
}
struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
{
int i;
if (!vmacache_valid(mm))
return NULL;
count_vm_vmacache_event(VMACACHE_FIND_CALLS);
for (i = 0; i < VMACACHE_SIZE; i++) {
struct vm_area_struct *vma = current->vmacache[i];
if (!vma)
continue;
if (WARN_ON_ONCE(vma->vm_mm != mm))
break;
if (vma->vm_start <= addr && vma->vm_end > addr) {
count_vm_vmacache_event(VMACACHE_FIND_HITS);
return vma;
}
}
return NULL;
}
#ifndef CONFIG_MMU
struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
unsigned long start,
unsigned long end)
{
int i;
if (!vmacache_valid(mm))
return NULL;
count_vm_vmacache_event(VMACACHE_FIND_CALLS);
for (i = 0; i < VMACACHE_SIZE; i++) {
struct vm_area_struct *vma = current->vmacache[i];
if (vma && vma->vm_start == start && vma->vm_end == end) {
count_vm_vmacache_event(VMACACHE_FIND_HITS);
return vma;
}
}
return NULL;
}
#endif

2718
mm/vmalloc.c Normal file

File diff suppressed because it is too large Load diff

382
mm/vmpressure.c Normal file
View file

@ -0,0 +1,382 @@
/*
* Linux VM pressure
*
* Copyright 2012 Linaro Ltd.
* Anton Vorontsov <anton.vorontsov@linaro.org>
*
* Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
* Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 as published
* by the Free Software Foundation.
*/
#include <linux/cgroup.h>
#include <linux/fs.h>
#include <linux/log2.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/vmstat.h>
#include <linux/eventfd.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/printk.h>
#include <linux/vmpressure.h>
/*
* The window size (vmpressure_win) is the number of scanned pages before
* we try to analyze scanned/reclaimed ratio. So the window is used as a
* rate-limit tunable for the "low" level notification, and also for
* averaging the ratio for medium/critical levels. Using small window
* sizes can cause lot of false positives, but too big window size will
* delay the notifications.
*
* As the vmscan reclaimer logic works with chunks which are multiple of
* SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
*
* TODO: Make the window size depend on machine size, as we do for vmstat
* thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
*/
static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
/*
* These thresholds are used when we account memory pressure through
* scanned/reclaimed ratio. The current values were chosen empirically. In
* essence, they are percents: the higher the value, the more number
* unsuccessful reclaims there were.
*/
static const unsigned int vmpressure_level_med = 60;
static const unsigned int vmpressure_level_critical = 95;
/*
* When there are too little pages left to scan, vmpressure() may miss the
* critical pressure as number of pages will be less than "window size".
* However, in that case the vmscan priority will raise fast as the
* reclaimer will try to scan LRUs more deeply.
*
* The vmscan logic considers these special priorities:
*
* prio == DEF_PRIORITY (12): reclaimer starts with that value
* prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
* prio == 0 : close to OOM, kernel scans every page in an lru
*
* Any value in this range is acceptable for this tunable (i.e. from 12 to
* 0). Current value for the vmpressure_level_critical_prio is chosen
* empirically, but the number, in essence, means that we consider
* critical level when scanning depth is ~10% of the lru size (vmscan
* scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
* eights).
*/
static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
static struct vmpressure *work_to_vmpressure(struct work_struct *work)
{
return container_of(work, struct vmpressure, work);
}
static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
{
struct cgroup_subsys_state *css = vmpressure_to_css(vmpr);
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
memcg = parent_mem_cgroup(memcg);
if (!memcg)
return NULL;
return memcg_to_vmpressure(memcg);
}
enum vmpressure_levels {
VMPRESSURE_LOW = 0,
VMPRESSURE_MEDIUM,
VMPRESSURE_CRITICAL,
VMPRESSURE_NUM_LEVELS,
};
static const char * const vmpressure_str_levels[] = {
[VMPRESSURE_LOW] = "low",
[VMPRESSURE_MEDIUM] = "medium",
[VMPRESSURE_CRITICAL] = "critical",
};
static enum vmpressure_levels vmpressure_level(unsigned long pressure)
{
if (pressure >= vmpressure_level_critical)
return VMPRESSURE_CRITICAL;
else if (pressure >= vmpressure_level_med)
return VMPRESSURE_MEDIUM;
return VMPRESSURE_LOW;
}
static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
unsigned long reclaimed)
{
unsigned long scale = scanned + reclaimed;
unsigned long pressure;
/*
* We calculate the ratio (in percents) of how many pages were
* scanned vs. reclaimed in a given time frame (window). Note that
* time is in VM reclaimer's "ticks", i.e. number of pages
* scanned. This makes it possible to set desired reaction time
* and serves as a ratelimit.
*/
pressure = scale - (reclaimed * scale / scanned);
pressure = pressure * 100 / scale;
pr_debug("%s: %3lu (s: %lu r: %lu)\n", __func__, pressure,
scanned, reclaimed);
return vmpressure_level(pressure);
}
struct vmpressure_event {
struct eventfd_ctx *efd;
enum vmpressure_levels level;
struct list_head node;
};
static bool vmpressure_event(struct vmpressure *vmpr,
unsigned long scanned, unsigned long reclaimed)
{
struct vmpressure_event *ev;
enum vmpressure_levels level;
bool signalled = false;
level = vmpressure_calc_level(scanned, reclaimed);
mutex_lock(&vmpr->events_lock);
list_for_each_entry(ev, &vmpr->events, node) {
if (level >= ev->level) {
eventfd_signal(ev->efd, 1);
signalled = true;
}
}
mutex_unlock(&vmpr->events_lock);
return signalled;
}
static void vmpressure_work_fn(struct work_struct *work)
{
struct vmpressure *vmpr = work_to_vmpressure(work);
unsigned long scanned;
unsigned long reclaimed;
spin_lock(&vmpr->sr_lock);
/*
* Several contexts might be calling vmpressure(), so it is
* possible that the work was rescheduled again before the old
* work context cleared the counters. In that case we will run
* just after the old work returns, but then scanned might be zero
* here. No need for any locks here since we don't care if
* vmpr->reclaimed is in sync.
*/
scanned = vmpr->scanned;
if (!scanned) {
spin_unlock(&vmpr->sr_lock);
return;
}
reclaimed = vmpr->reclaimed;
vmpr->scanned = 0;
vmpr->reclaimed = 0;
spin_unlock(&vmpr->sr_lock);
do {
if (vmpressure_event(vmpr, scanned, reclaimed))
break;
/*
* If not handled, propagate the event upward into the
* hierarchy.
*/
} while ((vmpr = vmpressure_parent(vmpr)));
}
/**
* vmpressure() - Account memory pressure through scanned/reclaimed ratio
* @gfp: reclaimer's gfp mask
* @memcg: cgroup memory controller handle
* @scanned: number of pages scanned
* @reclaimed: number of pages reclaimed
*
* This function should be called from the vmscan reclaim path to account
* "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
* pressure index is then further refined and averaged over time.
*
* This function does not return any value.
*/
void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed)
{
struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
/*
* Here we only want to account pressure that userland is able to
* help us with. For example, suppose that DMA zone is under
* pressure; if we notify userland about that kind of pressure,
* then it will be mostly a waste as it will trigger unnecessary
* freeing of memory by userland (since userland is more likely to
* have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
* is why we include only movable, highmem and FS/IO pages.
* Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
* we account it too.
*/
if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
return;
/*
* If we got here with no pages scanned, then that is an indicator
* that reclaimer was unable to find any shrinkable LRUs at the
* current scanning depth. But it does not mean that we should
* report the critical pressure, yet. If the scanning priority
* (scanning depth) goes too high (deep), we will be notified
* through vmpressure_prio(). But so far, keep calm.
*/
if (!scanned)
return;
spin_lock(&vmpr->sr_lock);
vmpr->scanned += scanned;
vmpr->reclaimed += reclaimed;
scanned = vmpr->scanned;
spin_unlock(&vmpr->sr_lock);
if (scanned < vmpressure_win)
return;
schedule_work(&vmpr->work);
}
/**
* vmpressure_prio() - Account memory pressure through reclaimer priority level
* @gfp: reclaimer's gfp mask
* @memcg: cgroup memory controller handle
* @prio: reclaimer's priority
*
* This function should be called from the reclaim path every time when
* the vmscan's reclaiming priority (scanning depth) changes.
*
* This function does not return any value.
*/
void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
{
/*
* We only use prio for accounting critical level. For more info
* see comment for vmpressure_level_critical_prio variable above.
*/
if (prio > vmpressure_level_critical_prio)
return;
/*
* OK, the prio is below the threshold, updating vmpressure
* information before shrinker dives into long shrinking of long
* range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
* to the vmpressure() basically means that we signal 'critical'
* level.
*/
vmpressure(gfp, memcg, vmpressure_win, 0);
}
/**
* vmpressure_register_event() - Bind vmpressure notifications to an eventfd
* @memcg: memcg that is interested in vmpressure notifications
* @eventfd: eventfd context to link notifications with
* @args: event arguments (used to set up a pressure level threshold)
*
* This function associates eventfd context with the vmpressure
* infrastructure, so that the notifications will be delivered to the
* @eventfd. The @args parameter is a string that denotes pressure level
* threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
* "critical").
*
* To be used as memcg event method.
*/
int vmpressure_register_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd, const char *args)
{
struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
struct vmpressure_event *ev;
int level;
for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
if (!strcmp(vmpressure_str_levels[level], args))
break;
}
if (level >= VMPRESSURE_NUM_LEVELS)
return -EINVAL;
ev = kzalloc(sizeof(*ev), GFP_KERNEL);
if (!ev)
return -ENOMEM;
ev->efd = eventfd;
ev->level = level;
mutex_lock(&vmpr->events_lock);
list_add(&ev->node, &vmpr->events);
mutex_unlock(&vmpr->events_lock);
return 0;
}
/**
* vmpressure_unregister_event() - Unbind eventfd from vmpressure
* @memcg: memcg handle
* @eventfd: eventfd context that was used to link vmpressure with the @cg
*
* This function does internal manipulations to detach the @eventfd from
* the vmpressure notifications, and then frees internal resources
* associated with the @eventfd (but the @eventfd itself is not freed).
*
* To be used as memcg event method.
*/
void vmpressure_unregister_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd)
{
struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
struct vmpressure_event *ev;
mutex_lock(&vmpr->events_lock);
list_for_each_entry(ev, &vmpr->events, node) {
if (ev->efd != eventfd)
continue;
list_del(&ev->node);
kfree(ev);
break;
}
mutex_unlock(&vmpr->events_lock);
}
/**
* vmpressure_init() - Initialize vmpressure control structure
* @vmpr: Structure to be initialized
*
* This function should be called on every allocated vmpressure structure
* before any usage.
*/
void vmpressure_init(struct vmpressure *vmpr)
{
spin_lock_init(&vmpr->sr_lock);
mutex_init(&vmpr->events_lock);
INIT_LIST_HEAD(&vmpr->events);
INIT_WORK(&vmpr->work, vmpressure_work_fn);
}
/**
* vmpressure_cleanup() - shuts down vmpressure control structure
* @vmpr: Structure to be cleaned up
*
* This function should be called before the structure in which it is
* embedded is cleaned up.
*/
void vmpressure_cleanup(struct vmpressure *vmpr)
{
/*
* Make sure there is no pending work before eventfd infrastructure
* goes away.
*/
flush_work(&vmpr->work);
}

3865
mm/vmscan.c Normal file

File diff suppressed because it is too large Load diff

1597
mm/vmstat.c Normal file

File diff suppressed because it is too large Load diff

414
mm/workingset.c Normal file
View file

@ -0,0 +1,414 @@
/*
* Workingset detection
*
* Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
*/
#include <linux/memcontrol.h>
#include <linux/writeback.h>
#include <linux/pagemap.h>
#include <linux/atomic.h>
#include <linux/module.h>
#include <linux/swap.h>
#include <linux/fs.h>
#include <linux/mm.h>
/*
* Double CLOCK lists
*
* Per zone, two clock lists are maintained for file pages: the
* inactive and the active list. Freshly faulted pages start out at
* the head of the inactive list and page reclaim scans pages from the
* tail. Pages that are accessed multiple times on the inactive list
* are promoted to the active list, to protect them from reclaim,
* whereas active pages are demoted to the inactive list when the
* active list grows too big.
*
* fault ------------------------+
* |
* +--------------+ | +-------------+
* reclaim <- | inactive | <-+-- demotion | active | <--+
* +--------------+ +-------------+ |
* | |
* +-------------- promotion ------------------+
*
*
* Access frequency and refault distance
*
* A workload is thrashing when its pages are frequently used but they
* are evicted from the inactive list every time before another access
* would have promoted them to the active list.
*
* In cases where the average access distance between thrashing pages
* is bigger than the size of memory there is nothing that can be
* done - the thrashing set could never fit into memory under any
* circumstance.
*
* However, the average access distance could be bigger than the
* inactive list, yet smaller than the size of memory. In this case,
* the set could fit into memory if it weren't for the currently
* active pages - which may be used more, hopefully less frequently:
*
* +-memory available to cache-+
* | |
* +-inactive------+-active----+
* a b | c d e f g h i | J K L M N |
* +---------------+-----------+
*
* It is prohibitively expensive to accurately track access frequency
* of pages. But a reasonable approximation can be made to measure
* thrashing on the inactive list, after which refaulting pages can be
* activated optimistically to compete with the existing active pages.
*
* Approximating inactive page access frequency - Observations:
*
* 1. When a page is accessed for the first time, it is added to the
* head of the inactive list, slides every existing inactive page
* towards the tail by one slot, and pushes the current tail page
* out of memory.
*
* 2. When a page is accessed for the second time, it is promoted to
* the active list, shrinking the inactive list by one slot. This
* also slides all inactive pages that were faulted into the cache
* more recently than the activated page towards the tail of the
* inactive list.
*
* Thus:
*
* 1. The sum of evictions and activations between any two points in
* time indicate the minimum number of inactive pages accessed in
* between.
*
* 2. Moving one inactive page N page slots towards the tail of the
* list requires at least N inactive page accesses.
*
* Combining these:
*
* 1. When a page is finally evicted from memory, the number of
* inactive pages accessed while the page was in cache is at least
* the number of page slots on the inactive list.
*
* 2. In addition, measuring the sum of evictions and activations (E)
* at the time of a page's eviction, and comparing it to another
* reading (R) at the time the page faults back into memory tells
* the minimum number of accesses while the page was not cached.
* This is called the refault distance.
*
* Because the first access of the page was the fault and the second
* access the refault, we combine the in-cache distance with the
* out-of-cache distance to get the complete minimum access distance
* of this page:
*
* NR_inactive + (R - E)
*
* And knowing the minimum access distance of a page, we can easily
* tell if the page would be able to stay in cache assuming all page
* slots in the cache were available:
*
* NR_inactive + (R - E) <= NR_inactive + NR_active
*
* which can be further simplified to
*
* (R - E) <= NR_active
*
* Put into words, the refault distance (out-of-cache) can be seen as
* a deficit in inactive list space (in-cache). If the inactive list
* had (R - E) more page slots, the page would not have been evicted
* in between accesses, but activated instead. And on a full system,
* the only thing eating into inactive list space is active pages.
*
*
* Activating refaulting pages
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
* time there is actually a good chance that pages on the active list
* are no longer in active use.
*
* So when a refault distance of (R - E) is observed and there are at
* least (R - E) active pages, the refaulting page is activated
* optimistically in the hope that (R - E) active pages are actually
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
*
* But if this is right, the stale pages will be pushed out of memory
* and the used pages get to stay in cache.
*
*
* Implementation
*
* For each zone's file LRU lists, a counter for inactive evictions
* and activations is maintained (zone->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
* identify the zone) is stored in the now empty page cache radix tree
* slot of the evicted page. This is called a shadow entry.
*
* On cache misses for which there are shadow entries, an eligible
* refault distance will immediately activate the refaulting page.
*/
static void *pack_shadow(unsigned long eviction, struct zone *zone)
{
eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
static void unpack_shadow(void *shadow,
struct zone **zone,
unsigned long *distance)
{
unsigned long entry = (unsigned long)shadow;
unsigned long eviction;
unsigned long refault;
unsigned long mask;
int zid, nid;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
zid = entry & ((1UL << ZONES_SHIFT) - 1);
entry >>= ZONES_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
eviction = entry;
*zone = NODE_DATA(nid)->node_zones + zid;
refault = atomic_long_read(&(*zone)->inactive_age);
mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
RADIX_TREE_EXCEPTIONAL_SHIFT);
/*
* The unsigned subtraction here gives an accurate distance
* across inactive_age overflows in most cases.
*
* There is a special case: usually, shadow entries have a
* short lifetime and are either refaulted or reclaimed along
* with the inode before they get too old. But it is not
* impossible for the inactive_age to lap a shadow entry in
* the field, which can then can result in a false small
* refault distance, leading to a false activation should this
* old entry actually refault again. However, earlier kernels
* used to deactivate unconditionally with *every* reclaim
* invocation for the longest time, so the occasional
* inappropriate activation leading to pressure on the active
* list is not a problem.
*/
*distance = (refault - eviction) & mask;
}
/**
* workingset_eviction - note the eviction of a page from memory
* @mapping: address space the page was backing
* @page: the page being evicted
*
* Returns a shadow entry to be stored in @mapping->page_tree in place
* of the evicted @page so that a later refault can be detected.
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
struct zone *zone = page_zone(page);
unsigned long eviction;
eviction = atomic_long_inc_return(&zone->inactive_age);
return pack_shadow(eviction, zone);
}
/**
* workingset_refault - evaluate the refault of a previously evicted page
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the zone it was allocated in.
*
* Returns %true if the page should be activated, %false otherwise.
*/
bool workingset_refault(void *shadow)
{
unsigned long refault_distance;
struct zone *zone;
unpack_shadow(shadow, &zone, &refault_distance);
inc_zone_state(zone, WORKINGSET_REFAULT);
if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
inc_zone_state(zone, WORKINGSET_ACTIVATE);
return true;
}
return false;
}
/**
* workingset_activation - note a page activation
* @page: page that is being activated
*/
void workingset_activation(struct page *page)
{
atomic_long_inc(&page_zone(page)->inactive_age);
}
/*
* Shadow entries reflect the share of the working set that does not
* fit into memory, so their number depends on the access pattern of
* the workload. In most cases, they will refault or get reclaimed
* along with the inode, but a (malicious) workload that streams
* through files with a total size several times that of available
* memory, while preventing the inodes from being reclaimed, can
* create excessive amounts of shadow nodes. To keep a lid on this,
* track shadow nodes and reclaim them when they grow way past the
* point where they would still be useful.
*/
struct list_lru workingset_shadow_nodes;
static unsigned long count_shadow_nodes(struct shrinker *shrinker,
struct shrink_control *sc)
{
unsigned long shadow_nodes;
unsigned long max_nodes;
unsigned long pages;
/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
local_irq_disable();
shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
local_irq_enable();
pages = node_present_pages(sc->nid);
/*
* Active cache pages are limited to 50% of memory, and shadow
* entries that represent a refault distance bigger than that
* do not have any effect. Limit the number of shadow nodes
* such that shadow entries do not exceed the number of active
* cache pages, assuming a worst-case node population density
* of 1/8th on average.
*
* On 64-bit with 7 radix_tree_nodes per page and 64 slots
* each, this will reclaim shadow entries when they consume
* ~2% of available memory:
*
* PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
*/
max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3);
if (shadow_nodes <= max_nodes)
return 0;
return shadow_nodes - max_nodes;
}
static enum lru_status shadow_lru_isolate(struct list_head *item,
spinlock_t *lru_lock,
void *arg)
{
struct address_space *mapping;
struct radix_tree_node *node;
unsigned int i;
int ret;
/*
* Page cache insertions and deletions synchroneously maintain
* the shadow node LRU under the mapping->tree_lock and the
* lru_lock. Because the page cache tree is emptied before
* the inode can be destroyed, holding the lru_lock pins any
* address_space that has radix tree nodes on the LRU.
*
* We can then safely transition to the mapping->tree_lock to
* pin only the address_space of the particular node we want
* to reclaim, take the node off-LRU, and drop the lru_lock.
*/
node = container_of(item, struct radix_tree_node, private_list);
mapping = node->private_data;
/* Coming from the list, invert the lock order */
if (!spin_trylock(&mapping->tree_lock)) {
spin_unlock(lru_lock);
ret = LRU_RETRY;
goto out;
}
list_del_init(item);
spin_unlock(lru_lock);
/*
* The nodes should only contain one or more shadow entries,
* no pages, so we expect to be able to remove them all and
* delete and free the empty node afterwards.
*/
BUG_ON(!node->count);
BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
if (node->slots[i]) {
BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
node->slots[i] = NULL;
BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
BUG_ON(!mapping->nrshadows);
mapping->nrshadows--;
}
}
BUG_ON(node->count);
inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
if (!__radix_tree_delete_node(&mapping->page_tree, node))
BUG();
spin_unlock(&mapping->tree_lock);
ret = LRU_REMOVED_RETRY;
out:
local_irq_enable();
cond_resched();
local_irq_disable();
spin_lock(lru_lock);
return ret;
}
static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
struct shrink_control *sc)
{
unsigned long ret;
/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
local_irq_disable();
ret = list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
shadow_lru_isolate, NULL, &sc->nr_to_scan);
local_irq_enable();
return ret;
}
static struct shrinker workingset_shadow_shrinker = {
.count_objects = count_shadow_nodes,
.scan_objects = scan_shadow_nodes,
.seeks = DEFAULT_SEEKS,
.flags = SHRINKER_NUMA_AWARE,
};
/*
* Our list_lru->lock is IRQ-safe as it nests inside the IRQ-safe
* mapping->tree_lock.
*/
static struct lock_class_key shadow_nodes_key;
static int __init workingset_init(void)
{
int ret;
ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
if (ret)
goto err;
ret = register_shrinker(&workingset_shadow_shrinker);
if (ret)
goto err_list_lru;
return 0;
err_list_lru:
list_lru_destroy(&workingset_shadow_nodes);
err:
return ret;
}
module_init(workingset_init);

636
mm/zbud.c Normal file
View file

@ -0,0 +1,636 @@
/*
* zbud.c
*
* Copyright (C) 2013, Seth Jennings, IBM
*
* Concepts based on zcache internal zbud allocator by Dan Magenheimer.
*
* zbud is an special purpose allocator for storing compressed pages. Contrary
* to what its name may suggest, zbud is not a buddy allocator, but rather an
* allocator that "buddies" two compressed pages together in a single memory
* page.
*
* While this design limits storage density, it has simple and deterministic
* reclaim properties that make it preferable to a higher density approach when
* reclaim will be used.
*
* zbud works by storing compressed pages, or "zpages", together in pairs in a
* single memory page called a "zbud page". The first buddy is "left
* justified" at the beginning of the zbud page, and the last buddy is "right
* justified" at the end of the zbud page. The benefit is that if either
* buddy is freed, the freed buddy space, coalesced with whatever slack space
* that existed between the buddies, results in the largest possible free region
* within the zbud page.
*
* zbud also provides an attractive lower bound on density. The ratio of zpages
* to zbud pages can not be less than 1. This ensures that zbud can never "do
* harm" by using more pages to store zpages than the uncompressed zpages would
* have used on their own.
*
* zbud pages are divided into "chunks". The size of the chunks is fixed at
* compile time and determined by NCHUNKS_ORDER below. Dividing zbud pages
* into chunks allows organizing unbuddied zbud pages into a manageable number
* of unbuddied lists according to the number of free chunks available in the
* zbud page.
*
* The zbud API differs from that of conventional allocators in that the
* allocation function, zbud_alloc(), returns an opaque handle to the user,
* not a dereferenceable pointer. The user must map the handle using
* zbud_map() in order to get a usable pointer by which to access the
* allocation data and unmap the handle with zbud_unmap() when operations
* on the allocation data are complete.
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/atomic.h>
#include <linux/list.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/preempt.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/zbud.h>
#include <linux/zpool.h>
/*****************
* Structures
*****************/
/*
* NCHUNKS_ORDER determines the internal allocation granularity, effectively
* adjusting internal fragmentation. It also determines the number of
* freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
* allocation granularity will be in chunks of size PAGE_SIZE/64. As one chunk
* in allocated page is occupied by zbud header, NCHUNKS will be calculated to
* 63 which shows the max number of free chunks in zbud page, also there will be
* 63 freelists per pool.
*/
#define NCHUNKS_ORDER 6
#define CHUNK_SHIFT (PAGE_SHIFT - NCHUNKS_ORDER)
#define CHUNK_SIZE (1 << CHUNK_SHIFT)
#define ZHDR_SIZE_ALIGNED CHUNK_SIZE
#define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
/**
* struct zbud_pool - stores metadata for each zbud pool
* @lock: protects all pool fields and first|last_chunk fields of any
* zbud page in the pool
* @unbuddied: array of lists tracking zbud pages that only contain one buddy;
* the lists each zbud page is added to depends on the size of
* its free region.
* @buddied: list tracking the zbud pages that contain two buddies;
* these zbud pages are full
* @lru: list tracking the zbud pages in LRU order by most recently
* added buddy.
* @pages_nr: number of zbud pages in the pool.
* @ops: pointer to a structure of user defined operations specified at
* pool creation time.
*
* This structure is allocated at pool creation time and maintains metadata
* pertaining to a particular zbud pool.
*/
struct zbud_pool {
spinlock_t lock;
struct list_head unbuddied[NCHUNKS];
struct list_head buddied;
struct list_head lru;
u64 pages_nr;
struct zbud_ops *ops;
};
/*
* struct zbud_header - zbud page metadata occupying the first chunk of each
* zbud page.
* @buddy: links the zbud page into the unbuddied/buddied lists in the pool
* @lru: links the zbud page into the lru list in the pool
* @first_chunks: the size of the first buddy in chunks, 0 if free
* @last_chunks: the size of the last buddy in chunks, 0 if free
*/
struct zbud_header {
struct list_head buddy;
struct list_head lru;
unsigned int first_chunks;
unsigned int last_chunks;
bool under_reclaim;
};
/*****************
* zpool
****************/
#ifdef CONFIG_ZPOOL
static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle)
{
return zpool_evict(pool, handle);
}
static struct zbud_ops zbud_zpool_ops = {
.evict = zbud_zpool_evict
};
static void *zbud_zpool_create(char *name, gfp_t gfp,
struct zpool_ops *zpool_ops)
{
return zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL);
}
static void zbud_zpool_destroy(void *pool)
{
zbud_destroy_pool(pool);
}
static int zbud_zpool_malloc(void *pool, size_t size, gfp_t gfp,
unsigned long *handle)
{
return zbud_alloc(pool, size, gfp, handle);
}
static void zbud_zpool_free(void *pool, unsigned long handle)
{
zbud_free(pool, handle);
}
static int zbud_zpool_shrink(void *pool, unsigned int pages,
unsigned int *reclaimed)
{
unsigned int total = 0;
int ret = -EINVAL;
while (total < pages) {
ret = zbud_reclaim_page(pool, 8);
if (ret < 0)
break;
total++;
}
if (reclaimed)
*reclaimed = total;
return ret;
}
static void *zbud_zpool_map(void *pool, unsigned long handle,
enum zpool_mapmode mm)
{
return zbud_map(pool, handle);
}
static void zbud_zpool_unmap(void *pool, unsigned long handle)
{
zbud_unmap(pool, handle);
}
static u64 zbud_zpool_total_size(void *pool)
{
return zbud_get_pool_size(pool) * PAGE_SIZE;
}
static unsigned long zbud_zpool_compact(void *pool)
{
return 0;
}
static bool zbud_zpool_compactable(void *pool, unsigned int pages)
{
return false;
}
static struct zpool_driver zbud_zpool_driver = {
.type = "zbud",
.owner = THIS_MODULE,
.create = zbud_zpool_create,
.destroy = zbud_zpool_destroy,
.malloc = zbud_zpool_malloc,
.free = zbud_zpool_free,
.shrink = zbud_zpool_shrink,
.map = zbud_zpool_map,
.unmap = zbud_zpool_unmap,
.total_size = zbud_zpool_total_size,
.compact = zbud_zpool_compact,
.compactable = zbud_zpool_compactable,
};
MODULE_ALIAS("zpool-zbud");
#endif /* CONFIG_ZPOOL */
/*****************
* Helpers
*****************/
/* Just to make the code easier to read */
enum buddy {
FIRST,
LAST
};
/* Converts an allocation size in bytes to size in zbud chunks */
static int size_to_chunks(size_t size)
{
return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT;
}
#define for_each_unbuddied_list(_iter, _begin) \
for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++)
/* Initializes the zbud header of a newly allocated zbud page */
static struct zbud_header *init_zbud_page(struct page *page)
{
struct zbud_header *zhdr = page_address(page);
zhdr->first_chunks = 0;
zhdr->last_chunks = 0;
INIT_LIST_HEAD(&zhdr->buddy);
INIT_LIST_HEAD(&zhdr->lru);
zhdr->under_reclaim = 0;
return zhdr;
}
/* Resets the struct page fields and frees the page */
static void free_zbud_page(struct zbud_header *zhdr)
{
__free_page(virt_to_page(zhdr));
}
/*
* Encodes the handle of a particular buddy within a zbud page
* Pool lock should be held as this function accesses first|last_chunks
*/
static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud)
{
unsigned long handle;
/*
* For now, the encoded handle is actually just the pointer to the data
* but this might not always be the case. A little information hiding.
* Add CHUNK_SIZE to the handle if it is the first allocation to jump
* over the zbud header in the first chunk.
*/
handle = (unsigned long)zhdr;
if (bud == FIRST)
/* skip over zbud header */
handle += ZHDR_SIZE_ALIGNED;
else /* bud == LAST */
handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT);
return handle;
}
/* Returns the zbud page where a given handle is stored */
static struct zbud_header *handle_to_zbud_header(unsigned long handle)
{
return (struct zbud_header *)(handle & PAGE_MASK);
}
/* Returns the number of free chunks in a zbud page */
static int num_free_chunks(struct zbud_header *zhdr)
{
/*
* Rather than branch for different situations, just use the fact that
* free buddies have a length of zero to simplify everything.
*/
return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks;
}
/*****************
* API Functions
*****************/
/**
* zbud_create_pool() - create a new zbud pool
* @gfp: gfp flags when allocating the zbud pool structure
* @ops: user-defined operations for the zbud pool
*
* Return: pointer to the new zbud pool or NULL if the metadata allocation
* failed.
*/
struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops)
{
struct zbud_pool *pool;
int i;
pool = kmalloc(sizeof(struct zbud_pool), gfp);
if (!pool)
return NULL;
spin_lock_init(&pool->lock);
for_each_unbuddied_list(i, 0)
INIT_LIST_HEAD(&pool->unbuddied[i]);
INIT_LIST_HEAD(&pool->buddied);
INIT_LIST_HEAD(&pool->lru);
pool->pages_nr = 0;
pool->ops = ops;
return pool;
}
/**
* zbud_destroy_pool() - destroys an existing zbud pool
* @pool: the zbud pool to be destroyed
*
* The pool should be emptied before this function is called.
*/
void zbud_destroy_pool(struct zbud_pool *pool)
{
kfree(pool);
}
/**
* zbud_alloc() - allocates a region of a given size
* @pool: zbud pool from which to allocate
* @size: size in bytes of the desired allocation
* @gfp: gfp flags used if the pool needs to grow
* @handle: handle of the new allocation
*
* This function will attempt to find a free region in the pool large enough to
* satisfy the allocation request. A search of the unbuddied lists is
* performed first. If no suitable free region is found, then a new page is
* allocated and added to the pool to satisfy the request.
*
* gfp should not set __GFP_HIGHMEM as highmem pages cannot be used
* as zbud pool pages.
*
* Return: 0 if success and handle is set, otherwise -EINVAL if the size or
* gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
* a new page.
*/
int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp,
unsigned long *handle)
{
int chunks, i, freechunks;
struct zbud_header *zhdr = NULL;
enum buddy bud;
struct page *page;
if (!size || (gfp & __GFP_HIGHMEM))
return -EINVAL;
if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE)
return -ENOSPC;
chunks = size_to_chunks(size);
spin_lock(&pool->lock);
/* First, try to find an unbuddied zbud page. */
zhdr = NULL;
for_each_unbuddied_list(i, chunks) {
if (!list_empty(&pool->unbuddied[i])) {
zhdr = list_first_entry(&pool->unbuddied[i],
struct zbud_header, buddy);
list_del(&zhdr->buddy);
if (zhdr->first_chunks == 0)
bud = FIRST;
else
bud = LAST;
goto found;
}
}
/* Couldn't find unbuddied zbud page, create new one */
spin_unlock(&pool->lock);
page = alloc_page(gfp);
if (!page)
return -ENOMEM;
spin_lock(&pool->lock);
pool->pages_nr++;
zhdr = init_zbud_page(page);
bud = FIRST;
found:
if (bud == FIRST)
zhdr->first_chunks = chunks;
else
zhdr->last_chunks = chunks;
if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
/* Add to unbuddied list */
freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
} else {
/* Add to buddied list */
list_add(&zhdr->buddy, &pool->buddied);
}
/* Add/move zbud page to beginning of LRU */
if (!list_empty(&zhdr->lru))
list_del(&zhdr->lru);
list_add(&zhdr->lru, &pool->lru);
*handle = encode_handle(zhdr, bud);
spin_unlock(&pool->lock);
return 0;
}
/**
* zbud_free() - frees the allocation associated with the given handle
* @pool: pool in which the allocation resided
* @handle: handle associated with the allocation returned by zbud_alloc()
*
* In the case that the zbud page in which the allocation resides is under
* reclaim, as indicated by the PG_reclaim flag being set, this function
* only sets the first|last_chunks to 0. The page is actually freed
* once both buddies are evicted (see zbud_reclaim_page() below).
*/
void zbud_free(struct zbud_pool *pool, unsigned long handle)
{
struct zbud_header *zhdr;
int freechunks;
spin_lock(&pool->lock);
zhdr = handle_to_zbud_header(handle);
/* If first buddy, handle will be page aligned */
if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK)
zhdr->last_chunks = 0;
else
zhdr->first_chunks = 0;
if (zhdr->under_reclaim) {
/* zbud page is under reclaim, reclaim will free */
spin_unlock(&pool->lock);
return;
}
/* Remove from existing buddy list */
list_del(&zhdr->buddy);
if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
/* zbud page is empty, free */
list_del(&zhdr->lru);
free_zbud_page(zhdr);
pool->pages_nr--;
} else {
/* Add to unbuddied list */
freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
}
spin_unlock(&pool->lock);
}
#define list_tail_entry(ptr, type, member) \
list_entry((ptr)->prev, type, member)
/**
* zbud_reclaim_page() - evicts allocations from a pool page and frees it
* @pool: pool from which a page will attempt to be evicted
* @retires: number of pages on the LRU list for which eviction will
* be attempted before failing
*
* zbud reclaim is different from normal system reclaim in that the reclaim is
* done from the bottom, up. This is because only the bottom layer, zbud, has
* information on how the allocations are organized within each zbud page. This
* has the potential to create interesting locking situations between zbud and
* the user, however.
*
* To avoid these, this is how zbud_reclaim_page() should be called:
* The user detects a page should be reclaimed and calls zbud_reclaim_page().
* zbud_reclaim_page() will remove a zbud page from the pool LRU list and call
* the user-defined eviction handler with the pool and handle as arguments.
*
* If the handle can not be evicted, the eviction handler should return
* non-zero. zbud_reclaim_page() will add the zbud page back to the
* appropriate list and try the next zbud page on the LRU up to
* a user defined number of retries.
*
* If the handle is successfully evicted, the eviction handler should
* return 0 _and_ should have called zbud_free() on the handle. zbud_free()
* contains logic to delay freeing the page if the page is under reclaim,
* as indicated by the setting of the PG_reclaim flag on the underlying page.
*
* If all buddies in the zbud page are successfully evicted, then the
* zbud page can be freed.
*
* Returns: 0 if page is successfully freed, otherwise -EINVAL if there are
* no pages to evict or an eviction handler is not registered, -EAGAIN if
* the retry limit was hit.
*/
int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
{
int i, ret, freechunks;
struct zbud_header *zhdr;
unsigned long first_handle = 0, last_handle = 0;
spin_lock(&pool->lock);
if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
retries == 0) {
spin_unlock(&pool->lock);
return -EINVAL;
}
for (i = 0; i < retries; i++) {
zhdr = list_tail_entry(&pool->lru, struct zbud_header, lru);
list_del(&zhdr->lru);
list_del(&zhdr->buddy);
/* Protect zbud page against free */
zhdr->under_reclaim = true;
/*
* We need encode the handles before unlocking, since we can
* race with free that will set (first|last)_chunks to 0
*/
first_handle = 0;
last_handle = 0;
if (zhdr->first_chunks)
first_handle = encode_handle(zhdr, FIRST);
if (zhdr->last_chunks)
last_handle = encode_handle(zhdr, LAST);
spin_unlock(&pool->lock);
/* Issue the eviction callback(s) */
if (first_handle) {
ret = pool->ops->evict(pool, first_handle);
if (ret)
goto next;
}
if (last_handle) {
ret = pool->ops->evict(pool, last_handle);
if (ret)
goto next;
}
next:
spin_lock(&pool->lock);
zhdr->under_reclaim = false;
if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
/*
* Both buddies are now free, free the zbud page and
* return success.
*/
free_zbud_page(zhdr);
pool->pages_nr--;
spin_unlock(&pool->lock);
return 0;
} else if (zhdr->first_chunks == 0 ||
zhdr->last_chunks == 0) {
/* add to unbuddied list */
freechunks = num_free_chunks(zhdr);
list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
} else {
/* add to buddied list */
list_add(&zhdr->buddy, &pool->buddied);
}
/* add to beginning of LRU */
list_add(&zhdr->lru, &pool->lru);
}
spin_unlock(&pool->lock);
return -EAGAIN;
}
/**
* zbud_map() - maps the allocation associated with the given handle
* @pool: pool in which the allocation resides
* @handle: handle associated with the allocation to be mapped
*
* While trivial for zbud, the mapping functions for others allocators
* implementing this allocation API could have more complex information encoded
* in the handle and could create temporary mappings to make the data
* accessible to the user.
*
* Returns: a pointer to the mapped allocation
*/
void *zbud_map(struct zbud_pool *pool, unsigned long handle)
{
return (void *)(handle);
}
/**
* zbud_unmap() - maps the allocation associated with the given handle
* @pool: pool in which the allocation resides
* @handle: handle associated with the allocation to be unmapped
*/
void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
{
}
/**
* zbud_get_pool_size() - gets the zbud pool size in pages
* @pool: pool whose size is being queried
*
* Returns: size in pages of the given pool. The pool lock need not be
* taken to access pages_nr.
*/
u64 zbud_get_pool_size(struct zbud_pool *pool)
{
return pool->pages_nr;
}
static int __init init_zbud(void)
{
/* Make sure the zbud header will fit in one chunk */
BUILD_BUG_ON(sizeof(struct zbud_header) > ZHDR_SIZE_ALIGNED);
pr_info("loaded\n");
#ifdef CONFIG_ZPOOL
zpool_register_driver(&zbud_zpool_driver);
#endif
return 0;
}
static void __exit exit_zbud(void)
{
#ifdef CONFIG_ZPOOL
zpool_unregister_driver(&zbud_zpool_driver);
#endif
pr_info("unloaded\n");
}
module_init(init_zbud);
module_exit(exit_zbud);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Seth Jennings <sjenning@linux.vnet.ibm.com>");
MODULE_DESCRIPTION("Buddy Allocator for Compressed Pages");

376
mm/zpool.c Normal file
View file

@ -0,0 +1,376 @@
/*
* zpool memory storage api
*
* Copyright (C) 2014 Dan Streetman
*
* This is a common frontend for memory storage pool implementations.
* Typically, this is used to store compressed memory.
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/list.h>
#include <linux/types.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/module.h>
#include <linux/zpool.h>
struct zpool {
char *type;
struct zpool_driver *driver;
void *pool;
struct zpool_ops *ops;
struct list_head list;
};
static LIST_HEAD(drivers_head);
static DEFINE_SPINLOCK(drivers_lock);
static LIST_HEAD(pools_head);
static DEFINE_SPINLOCK(pools_lock);
/**
* zpool_register_driver() - register a zpool implementation.
* @driver: driver to register
*/
void zpool_register_driver(struct zpool_driver *driver)
{
spin_lock(&drivers_lock);
atomic_set(&driver->refcount, 0);
list_add(&driver->list, &drivers_head);
spin_unlock(&drivers_lock);
}
EXPORT_SYMBOL(zpool_register_driver);
/**
* zpool_unregister_driver() - unregister a zpool implementation.
* @driver: driver to unregister.
*
* Module usage counting is used to prevent using a driver
* while/after unloading, so if this is called from module
* exit function, this should never fail; if called from
* other than the module exit function, and this returns
* failure, the driver is in use and must remain available.
*/
int zpool_unregister_driver(struct zpool_driver *driver)
{
int ret = 0, refcount;
spin_lock(&drivers_lock);
refcount = atomic_read(&driver->refcount);
WARN_ON(refcount < 0);
if (refcount > 0)
ret = -EBUSY;
else
list_del(&driver->list);
spin_unlock(&drivers_lock);
return ret;
}
EXPORT_SYMBOL(zpool_unregister_driver);
/**
* zpool_evict() - evict callback from a zpool implementation.
* @pool: pool to evict from.
* @handle: handle to evict.
*
* This can be used by zpool implementations to call the
* user's evict zpool_ops struct evict callback.
*/
int zpool_evict(void *pool, unsigned long handle)
{
struct zpool *zpool;
spin_lock(&pools_lock);
list_for_each_entry(zpool, &pools_head, list) {
if (zpool->pool == pool) {
spin_unlock(&pools_lock);
if (!zpool->ops || !zpool->ops->evict)
return -EINVAL;
return zpool->ops->evict(zpool, handle);
}
}
spin_unlock(&pools_lock);
return -ENOENT;
}
EXPORT_SYMBOL(zpool_evict);
static struct zpool_driver *zpool_get_driver(char *type)
{
struct zpool_driver *driver;
spin_lock(&drivers_lock);
list_for_each_entry(driver, &drivers_head, list) {
if (!strcmp(driver->type, type)) {
bool got = try_module_get(driver->owner);
if (got)
atomic_inc(&driver->refcount);
spin_unlock(&drivers_lock);
return got ? driver : NULL;
}
}
spin_unlock(&drivers_lock);
return NULL;
}
static void zpool_put_driver(struct zpool_driver *driver)
{
atomic_dec(&driver->refcount);
module_put(driver->owner);
}
/**
* zpool_create_pool() - Create a new zpool
* @type The type of the zpool to create (e.g. zbud, zsmalloc)
* @name The name of the zpool (e.g. zram0, zswap)
* @gfp The GFP flags to use when allocating the pool.
* @ops The optional ops callback.
*
* This creates a new zpool of the specified type. The gfp flags will be
* used when allocating memory, if the implementation supports it. If the
* ops param is NULL, then the created zpool will not be shrinkable.
*
* Implementations must guarantee this to be thread-safe.
*
* Returns: New zpool on success, NULL on failure.
*/
struct zpool *zpool_create_pool(char *type, char *name, gfp_t gfp,
struct zpool_ops *ops)
{
struct zpool_driver *driver;
struct zpool *zpool;
pr_info("creating pool type %s\n", type);
driver = zpool_get_driver(type);
if (!driver) {
request_module("zpool-%s", type);
driver = zpool_get_driver(type);
}
if (!driver) {
pr_err("no driver for type %s\n", type);
return NULL;
}
zpool = kmalloc(sizeof(*zpool), GFP_KERNEL);
if (!zpool) {
pr_err("couldn't create zpool - out of memory\n");
zpool_put_driver(driver);
return NULL;
}
zpool->type = driver->type;
zpool->driver = driver;
zpool->pool = driver->create(name, gfp, ops);
zpool->ops = ops;
if (!zpool->pool) {
pr_err("couldn't create %s pool\n", type);
zpool_put_driver(driver);
kfree(zpool);
return NULL;
}
pr_info("created %s pool\n", type);
spin_lock(&pools_lock);
list_add(&zpool->list, &pools_head);
spin_unlock(&pools_lock);
return zpool;
}
/**
* zpool_destroy_pool() - Destroy a zpool
* @pool The zpool to destroy.
*
* Implementations must guarantee this to be thread-safe,
* however only when destroying different pools. The same
* pool should only be destroyed once, and should not be used
* after it is destroyed.
*
* This destroys an existing zpool. The zpool should not be in use.
*/
void zpool_destroy_pool(struct zpool *zpool)
{
pr_info("destroying pool type %s\n", zpool->type);
spin_lock(&pools_lock);
list_del(&zpool->list);
spin_unlock(&pools_lock);
zpool->driver->destroy(zpool->pool);
zpool_put_driver(zpool->driver);
kfree(zpool);
}
/**
* zpool_get_type() - Get the type of the zpool
* @pool The zpool to check
*
* This returns the type of the pool.
*
* Implementations must guarantee this to be thread-safe.
*
* Returns: The type of zpool.
*/
char *zpool_get_type(struct zpool *zpool)
{
return zpool->type;
}
/**
* zpool_malloc() - Allocate memory
* @pool The zpool to allocate from.
* @size The amount of memory to allocate.
* @gfp The GFP flags to use when allocating memory.
* @handle Pointer to the handle to set
*
* This allocates the requested amount of memory from the pool.
* The gfp flags will be used when allocating memory, if the
* implementation supports it. The provided @handle will be
* set to the allocated object handle.
*
* Implementations must guarantee this to be thread-safe.
*
* Returns: 0 on success, negative value on error.
*/
int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
unsigned long *handle)
{
return zpool->driver->malloc(zpool->pool, size, gfp, handle);
}
/**
* zpool_free() - Free previously allocated memory
* @pool The zpool that allocated the memory.
* @handle The handle to the memory to free.
*
* This frees previously allocated memory. This does not guarantee
* that the pool will actually free memory, only that the memory
* in the pool will become available for use by the pool.
*
* Implementations must guarantee this to be thread-safe,
* however only when freeing different handles. The same
* handle should only be freed once, and should not be used
* after freeing.
*/
void zpool_free(struct zpool *zpool, unsigned long handle)
{
zpool->driver->free(zpool->pool, handle);
}
/**
* zpool_shrink() - Shrink the pool size
* @pool The zpool to shrink.
* @pages The number of pages to shrink the pool.
* @reclaimed The number of pages successfully evicted.
*
* This attempts to shrink the actual memory size of the pool
* by evicting currently used handle(s). If the pool was
* created with no zpool_ops, or the evict call fails for any
* of the handles, this will fail. If non-NULL, the @reclaimed
* parameter will be set to the number of pages reclaimed,
* which may be more than the number of pages requested.
*
* Implementations must guarantee this to be thread-safe.
*
* Returns: 0 on success, negative value on error/failure.
*/
int zpool_shrink(struct zpool *zpool, unsigned int pages,
unsigned int *reclaimed)
{
return zpool->driver->shrink(zpool->pool, pages, reclaimed);
}
/**
* zpool_map_handle() - Map a previously allocated handle into memory
* @pool The zpool that the handle was allocated from
* @handle The handle to map
* @mm How the memory should be mapped
*
* This maps a previously allocated handle into memory. The @mm
* param indicates to the implementation how the memory will be
* used, i.e. read-only, write-only, read-write. If the
* implementation does not support it, the memory will be treated
* as read-write.
*
* This may hold locks, disable interrupts, and/or preemption,
* and the zpool_unmap_handle() must be called to undo those
* actions. The code that uses the mapped handle should complete
* its operatons on the mapped handle memory quickly and unmap
* as soon as possible. As the implementation may use per-cpu
* data, multiple handles should not be mapped concurrently on
* any cpu.
*
* Returns: A pointer to the handle's mapped memory area.
*/
void *zpool_map_handle(struct zpool *zpool, unsigned long handle,
enum zpool_mapmode mapmode)
{
return zpool->driver->map(zpool->pool, handle, mapmode);
}
/**
* zpool_unmap_handle() - Unmap a previously mapped handle
* @pool The zpool that the handle was allocated from
* @handle The handle to unmap
*
* This unmaps a previously mapped handle. Any locks or other
* actions that the implementation took in zpool_map_handle()
* will be undone here. The memory area returned from
* zpool_map_handle() should no longer be used after this.
*/
void zpool_unmap_handle(struct zpool *zpool, unsigned long handle)
{
zpool->driver->unmap(zpool->pool, handle);
}
/**
* zpool_get_total_size() - The total size of the pool
* @pool The zpool to check
*
* This returns the total size in bytes of the pool.
*
* Returns: Total size of the zpool in bytes.
*/
u64 zpool_get_total_size(struct zpool *zpool)
{
return zpool->driver->total_size(zpool->pool);
}
unsigned long zpool_compact(struct zpool *zpool)
{
return zpool->driver->compact(zpool->pool);
}
bool zpool_compactable(struct zpool *zpool, unsigned int pages)
{
return zpool->driver->compactable(zpool->pool, pages);
}
static int __init init_zpool(void)
{
pr_info("loaded\n");
return 0;
}
static void __exit exit_zpool(void)
{
pr_info("unloaded\n");
}
module_init(init_zpool);
module_exit(exit_zpool);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Dan Streetman <ddstreet@ieee.org>");
MODULE_DESCRIPTION("Common API for compressed memory storage");

2452
mm/zsmalloc.c Normal file

File diff suppressed because it is too large Load diff

1326
mm/zswap.c Normal file

File diff suppressed because it is too large Load diff