Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

435
drivers/md/Kconfig Normal file
View file

@ -0,0 +1,435 @@
#
# Block device driver configuration
#
menuconfig MD
bool "Multiple devices driver support (RAID and LVM)"
depends on BLOCK
help
Support multiple physical spindles through a single logical device.
Required for RAID and logical volume management.
if MD
config BLK_DEV_MD
tristate "RAID support"
---help---
This driver lets you combine several hard disk partitions into one
logical block device. This can be used to simply append one
partition to another one or to combine several redundant hard disks
into a RAID1/4/5 device so as to provide protection against hard
disk failures. This is called "Software RAID" since the combining of
the partitions is done by the kernel. "Hardware RAID" means that the
combining is done by a dedicated controller; if you have such a
controller, you do not need to say Y here.
More information about Software RAID on Linux is contained in the
Software RAID mini-HOWTO, available from
<http://www.tldp.org/docs.html#howto>. There you will also learn
where to get the supporting user space utilities raidtools.
If unsure, say N.
config MD_AUTODETECT
bool "Autodetect RAID arrays during kernel boot"
depends on BLK_DEV_MD=y
default y
---help---
If you say Y here, then the kernel will try to autodetect raid
arrays as part of its boot process.
If you don't use raid and say Y, this autodetection can cause
a several-second delay in the boot time due to various
synchronisation steps that are part of this step.
If unsure, say Y.
config MD_LINEAR
tristate "Linear (append) mode"
depends on BLK_DEV_MD
---help---
If you say Y here, then your multiple devices driver will be able to
use the so-called linear mode, i.e. it will combine the hard disk
partitions by simply appending one to the other.
To compile this as a module, choose M here: the module
will be called linear.
If unsure, say Y.
config MD_RAID0
tristate "RAID-0 (striping) mode"
depends on BLK_DEV_MD
---help---
If you say Y here, then your multiple devices driver will be able to
use the so-called raid0 mode, i.e. it will combine the hard disk
partitions into one logical device in such a fashion as to fill them
up evenly, one chunk here and one chunk there. This will increase
the throughput rate if the partitions reside on distinct disks.
Information about Software RAID on Linux is contained in the
Software-RAID mini-HOWTO, available from
<http://www.tldp.org/docs.html#howto>. There you will also
learn where to get the supporting user space utilities raidtools.
To compile this as a module, choose M here: the module
will be called raid0.
If unsure, say Y.
config MD_RAID1
tristate "RAID-1 (mirroring) mode"
depends on BLK_DEV_MD
---help---
A RAID-1 set consists of several disk drives which are exact copies
of each other. In the event of a mirror failure, the RAID driver
will continue to use the operational mirrors in the set, providing
an error free MD (multiple device) to the higher levels of the
kernel. In a set with N drives, the available space is the capacity
of a single drive, and the set protects against a failure of (N - 1)
drives.
Information about Software RAID on Linux is contained in the
Software-RAID mini-HOWTO, available from
<http://www.tldp.org/docs.html#howto>. There you will also
learn where to get the supporting user space utilities raidtools.
If you want to use such a RAID-1 set, say Y. To compile this code
as a module, choose M here: the module will be called raid1.
If unsure, say Y.
config MD_RAID10
tristate "RAID-10 (mirrored striping) mode"
depends on BLK_DEV_MD
---help---
RAID-10 provides a combination of striping (RAID-0) and
mirroring (RAID-1) with easier configuration and more flexible
layout.
Unlike RAID-0, but like RAID-1, RAID-10 requires all devices to
be the same size (or at least, only as much as the smallest device
will be used).
RAID-10 provides a variety of layouts that provide different levels
of redundancy and performance.
RAID-10 requires mdadm-1.7.0 or later, available at:
ftp://ftp.kernel.org/pub/linux/utils/raid/mdadm/
If unsure, say Y.
config MD_RAID456
tristate "RAID-4/RAID-5/RAID-6 mode"
depends on BLK_DEV_MD
select RAID6_PQ
select ASYNC_MEMCPY
select ASYNC_XOR
select ASYNC_PQ
select ASYNC_RAID6_RECOV
---help---
A RAID-5 set of N drives with a capacity of C MB per drive provides
the capacity of C * (N - 1) MB, and protects against a failure
of a single drive. For a given sector (row) number, (N - 1) drives
contain data sectors, and one drive contains the parity protection.
For a RAID-4 set, the parity blocks are present on a single drive,
while a RAID-5 set distributes the parity across the drives in one
of the available parity distribution methods.
A RAID-6 set of N drives with a capacity of C MB per drive
provides the capacity of C * (N - 2) MB, and protects
against a failure of any two drives. For a given sector
(row) number, (N - 2) drives contain data sectors, and two
drives contains two independent redundancy syndromes. Like
RAID-5, RAID-6 distributes the syndromes across the drives
in one of the available parity distribution methods.
Information about Software RAID on Linux is contained in the
Software-RAID mini-HOWTO, available from
<http://www.tldp.org/docs.html#howto>. There you will also
learn where to get the supporting user space utilities raidtools.
If you want to use such a RAID-4/RAID-5/RAID-6 set, say Y. To
compile this code as a module, choose M here: the module
will be called raid456.
If unsure, say Y.
config MD_MULTIPATH
tristate "Multipath I/O support"
depends on BLK_DEV_MD
help
MD_MULTIPATH provides a simple multi-path personality for use
the MD framework. It is not under active development. New
projects should consider using DM_MULTIPATH which has more
features and more testing.
If unsure, say N.
config MD_FAULTY
tristate "Faulty test module for MD"
depends on BLK_DEV_MD
help
The "faulty" module allows for a block device that occasionally returns
read or write errors. It is useful for testing.
In unsure, say N.
source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM_BUILTIN
boolean
config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
---help---
Device-mapper is a low level volume manager. It works by allowing
people to specify mappings for ranges of logical sectors. Various
mapping types are available, in addition people may write their own
modules containing custom mappings if they wish.
Higher level volume managers such as LVM2 use this driver.
To compile this as a module, choose M here: the module will be
called dm-mod.
If unsure, say N.
config DM_DEBUG
boolean "Device mapper debugging support"
depends on BLK_DEV_DM
---help---
Enable this for messages that may help debug device-mapper problems.
If unsure, say N.
config DM_BUFIO
tristate
depends on BLK_DEV_DM
---help---
This interface allows you to do buffered I/O on a device and acts
as a cache, holding recently-read blocks in memory and performing
delayed writes.
config DM_BIO_PRISON
tristate
depends on BLK_DEV_DM
---help---
Some bio locking schemes used by other device-mapper targets
including thin provisioning.
source "drivers/md/persistent-data/Kconfig"
config DM_CRYPT
tristate "Crypt target support"
depends on BLK_DEV_DM
select CRYPTO
select CRYPTO_CBC
---help---
This device-mapper target allows you to create a device that
transparently encrypts the data on it. You'll need to activate
the ciphers you're going to use in the cryptoapi configuration.
Information on how to use dm-crypt can be found on
<http://www.saout.de/misc/dm-crypt/>
To compile this code as a module, choose M here: the module will
be called dm-crypt.
If unsure, say N.
config DM_SNAPSHOT
tristate "Snapshot target"
depends on BLK_DEV_DM
select DM_BUFIO
---help---
Allow volume managers to take writable snapshots of a device.
config DM_THIN_PROVISIONING
tristate "Thin provisioning target"
depends on BLK_DEV_DM
select DM_PERSISTENT_DATA
select DM_BIO_PRISON
---help---
Provides thin provisioning and snapshots that share a data store.
config DM_CACHE
tristate "Cache target (EXPERIMENTAL)"
depends on BLK_DEV_DM
default n
select DM_PERSISTENT_DATA
select DM_BIO_PRISON
---help---
dm-cache attempts to improve performance of a block device by
moving frequently used data to a smaller, higher performance
device. Different 'policy' plugins can be used to change the
algorithms used to select which blocks are promoted, demoted,
cleaned etc. It supports writeback and writethrough modes.
config DM_CACHE_MQ
tristate "MQ Cache Policy (EXPERIMENTAL)"
depends on DM_CACHE
default y
---help---
A cache policy that uses a multiqueue ordered by recent hit
count to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy. It prioritises
reads over writes.
config DM_CACHE_CLEANER
tristate "Cleaner Cache Policy (EXPERIMENTAL)"
depends on DM_CACHE
default y
---help---
A simple cache policy that writes back all data to the
origin. Used when decommissioning a dm-cache.
config DM_ERA
tristate "Era target (EXPERIMENTAL)"
depends on BLK_DEV_DM
default n
select DM_PERSISTENT_DATA
select DM_BIO_PRISON
---help---
dm-era tracks which parts of a block device are written to
over time. Useful for maintaining cache coherency when using
vendor snapshots.
config DM_MIRROR
tristate "Mirror target"
depends on BLK_DEV_DM
---help---
Allow volume managers to mirror logical volumes, also
needed for live data migration tools such as 'pvmove'.
config DM_LOG_USERSPACE
tristate "Mirror userspace logging"
depends on DM_MIRROR && NET
select CONNECTOR
---help---
The userspace logging module provides a mechanism for
relaying the dm-dirty-log API to userspace. Log designs
which are more suited to userspace implementation (e.g.
shared storage logs) or experimental logs can be implemented
by leveraging this framework.
config DM_RAID
tristate "RAID 1/4/5/6/10 target"
depends on BLK_DEV_DM
select MD_RAID1
select MD_RAID10
select MD_RAID456
select BLK_DEV_MD
---help---
A dm target that supports RAID1, RAID10, RAID4, RAID5 and RAID6 mappings
A RAID-5 set of N drives with a capacity of C MB per drive provides
the capacity of C * (N - 1) MB, and protects against a failure
of a single drive. For a given sector (row) number, (N - 1) drives
contain data sectors, and one drive contains the parity protection.
For a RAID-4 set, the parity blocks are present on a single drive,
while a RAID-5 set distributes the parity across the drives in one
of the available parity distribution methods.
A RAID-6 set of N drives with a capacity of C MB per drive
provides the capacity of C * (N - 2) MB, and protects
against a failure of any two drives. For a given sector
(row) number, (N - 2) drives contain data sectors, and two
drives contains two independent redundancy syndromes. Like
RAID-5, RAID-6 distributes the syndromes across the drives
in one of the available parity distribution methods.
config DM_ZERO
tristate "Zero target"
depends on BLK_DEV_DM
---help---
A target that discards writes, and returns all zeroes for
reads. Useful in some recovery situations.
config DM_MULTIPATH
tristate "Multipath target"
depends on BLK_DEV_DM
# nasty syntax but means make DM_MULTIPATH independent
# of SCSI_DH if the latter isn't defined but if
# it is, DM_MULTIPATH must depend on it. We get a build
# error if SCSI_DH=m and DM_MULTIPATH=y
depends on SCSI_DH || !SCSI_DH
---help---
Allow volume managers to support multipath hardware.
config DM_MULTIPATH_QL
tristate "I/O Path Selector based on the number of in-flight I/Os"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path with the least number of in-flight I/Os.
If unsure, say N.
config DM_MULTIPATH_ST
tristate "I/O Path Selector based on the service time"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path expected to complete the incoming I/O in the shortest
time.
If unsure, say N.
config DM_DELAY
tristate "I/O delaying target"
depends on BLK_DEV_DM
---help---
A target that delays reads and/or writes and can send
them to different devices. Useful for testing.
If unsure, say N.
config DM_UEVENT
bool "DM uevents"
depends on BLK_DEV_DM
---help---
Generate udev events for DM events.
config DM_FLAKEY
tristate "Flakey target"
depends on BLK_DEV_DM
---help---
A target that intermittently fails I/O for debugging purposes.
config DM_VERITY
tristate "Verity target support"
depends on BLK_DEV_DM
select CRYPTO
select CRYPTO_HASH
select DM_BUFIO
---help---
This device-mapper target creates a read-only device that
transparently validates the data on one underlying device against
a pre-generated tree of cryptographic checksums stored on a second
device.
You'll need to activate the digests you're going to use in the
cryptoapi configuration.
To compile this code as a module, choose M here: the module will
be called dm-verity.
If unsure, say N.
config DM_SWITCH
tristate "Switch target support (EXPERIMENTAL)"
depends on BLK_DEV_DM
---help---
This device-mapper target creates a device that supports an arbitrary
mapping of fixed-size regions of I/O across a fixed set of paths.
The path used for any specific region can be switched dynamically
by sending the target a message.
To compile this code as a module, choose M here: the module will
be called dm-switch.
If unsure, say N.
endif # MD

65
drivers/md/Makefile Normal file
View file

@ -0,0 +1,65 @@
#
# Makefile for the kernel software RAID and LVM drivers.
#
ifeq ($(SEC_BUILD_CONF_QUICK_DMVERITY),true)
EXTRA_CFLAGS += -DVERIFY_META_ONLY=true
endif
dm-mod-y += dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \
dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o
dm-multipath-y += dm-path-selector.o dm-mpath.o
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
dm-mirror-y += dm-raid1.o
dm-log-userspace-y \
+= dm-log-userspace-base.o dm-log-userspace-transfer.o
dm-thin-pool-y += dm-thin.o dm-thin-metadata.o
dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o
dm-cache-mq-y += dm-cache-policy-mq.o
dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y += dm-era-target.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o
# Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise
# themselves, and md.o may use the personalities when it
# auto-initialised.
obj-$(CONFIG_MD_LINEAR) += linear.o
obj-$(CONFIG_MD_RAID0) += raid0.o
obj-$(CONFIG_MD_RAID1) += raid1.o
obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
obj-$(CONFIG_DM_SWITCH) += dm-switch.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o
obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o
obj-$(CONFIG_DM_RAID) += dm-raid.o
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_VERITY) += dm-dirty.o
obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o
obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
obj-$(CONFIG_DM_ERA) += dm-era.o
ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
endif

26
drivers/md/bcache/Kconfig Normal file
View file

@ -0,0 +1,26 @@
config BCACHE
tristate "Block device as cache"
---help---
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
See Documentation/bcache.txt for details.
config BCACHE_DEBUG
bool "Bcache debugging"
depends on BCACHE
---help---
Don't select this option unless you're a developer
Enables extra debugging tools, allows expensive runtime checks to be
turned on.
config BCACHE_CLOSURES_DEBUG
bool "Debug closures"
depends on BCACHE
select DEBUG_FS
---help---
Keeps all active closures in a linked list and provides a debugfs
interface to list them, which makes it possible to see asynchronous
operations that get stuck.

View file

@ -0,0 +1,8 @@
obj-$(CONFIG_BCACHE) += bcache.o
bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\
io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
util.o writeback.o
CFLAGS_request.o += -Iblock

696
drivers/md/bcache/alloc.c Normal file
View file

@ -0,0 +1,696 @@
/*
* Primary bucket allocation code
*
* Copyright 2012 Google, Inc.
*
* Allocation in bcache is done in terms of buckets:
*
* Each bucket has associated an 8 bit gen; this gen corresponds to the gen in
* btree pointers - they must match for the pointer to be considered valid.
*
* Thus (assuming a bucket has no dirty data or metadata in it) we can reuse a
* bucket simply by incrementing its gen.
*
* The gens (along with the priorities; it's really the gens are important but
* the code is named as if it's the priorities) are written in an arbitrary list
* of buckets on disk, with a pointer to them in the journal header.
*
* When we invalidate a bucket, we have to write its new gen to disk and wait
* for that write to complete before we use it - otherwise after a crash we
* could have pointers that appeared to be good but pointed to data that had
* been overwritten.
*
* Since the gens and priorities are all stored contiguously on disk, we can
* batch this up: We fill up the free_inc list with freshly invalidated buckets,
* call prio_write(), and when prio_write() finishes we pull buckets off the
* free_inc list and optionally discard them.
*
* free_inc isn't the only freelist - if it was, we'd often to sleep while
* priorities and gens were being written before we could allocate. c->free is a
* smaller freelist, and buckets on that list are always ready to be used.
*
* If we've got discards enabled, that happens when a bucket moves from the
* free_inc list to the free list.
*
* There is another freelist, because sometimes we have buckets that we know
* have nothing pointing into them - these we can reuse without waiting for
* priorities to be rewritten. These come from freed btree nodes and buckets
* that garbage collection discovered no longer had valid keys pointing into
* them (because they were overwritten). That's the unused list - buckets on the
* unused list move to the free list, optionally being discarded in the process.
*
* It's also important to ensure that gens don't wrap around - with respect to
* either the oldest gen in the btree or the gen on disk. This is quite
* difficult to do in practice, but we explicitly guard against it anyways - if
* a bucket is in danger of wrapping around we simply skip invalidating it that
* time around, and we garbage collect or rewrite the priorities sooner than we
* would have otherwise.
*
* bch_bucket_alloc() allocates a single bucket from a specific cache.
*
* bch_bucket_alloc_set() allocates one or more buckets from different caches
* out of a cache set.
*
* free_some_buckets() drives all the processes described above. It's called
* from bch_bucket_alloc() and a few other places that need to make sure free
* buckets are ready.
*
* invalidate_buckets_(lru|fifo)() find buckets that are available to be
* invalidated, and then invalidate them and stick them on the free_inc list -
* in either lru or fifo order.
*/
#include "bcache.h"
#include "btree.h"
#include <linux/blkdev.h>
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/random.h>
#include <trace/events/bcache.h>
/* Bucket heap / gen */
uint8_t bch_inc_gen(struct cache *ca, struct bucket *b)
{
uint8_t ret = ++b->gen;
ca->set->need_gc = max(ca->set->need_gc, bucket_gc_gen(b));
WARN_ON_ONCE(ca->set->need_gc > BUCKET_GC_GEN_MAX);
return ret;
}
void bch_rescale_priorities(struct cache_set *c, int sectors)
{
struct cache *ca;
struct bucket *b;
unsigned next = c->nbuckets * c->sb.bucket_size / 1024;
unsigned i;
int r;
atomic_sub(sectors, &c->rescale);
do {
r = atomic_read(&c->rescale);
if (r >= 0)
return;
} while (atomic_cmpxchg(&c->rescale, r, r + next) != r);
mutex_lock(&c->bucket_lock);
c->min_prio = USHRT_MAX;
for_each_cache(ca, c, i)
for_each_bucket(b, ca)
if (b->prio &&
b->prio != BTREE_PRIO &&
!atomic_read(&b->pin)) {
b->prio--;
c->min_prio = min(c->min_prio, b->prio);
}
mutex_unlock(&c->bucket_lock);
}
/*
* Background allocation thread: scans for buckets to be invalidated,
* invalidates them, rewrites prios/gens (marking them as invalidated on disk),
* then optionally issues discard commands to the newly free buckets, then puts
* them on the various freelists.
*/
static inline bool can_inc_bucket_gen(struct bucket *b)
{
return bucket_gc_gen(b) < BUCKET_GC_GEN_MAX;
}
bool bch_can_invalidate_bucket(struct cache *ca, struct bucket *b)
{
BUG_ON(!ca->set->gc_mark_valid);
return (!GC_MARK(b) ||
GC_MARK(b) == GC_MARK_RECLAIMABLE) &&
!atomic_read(&b->pin) &&
can_inc_bucket_gen(b);
}
void __bch_invalidate_one_bucket(struct cache *ca, struct bucket *b)
{
lockdep_assert_held(&ca->set->bucket_lock);
BUG_ON(GC_MARK(b) && GC_MARK(b) != GC_MARK_RECLAIMABLE);
if (GC_SECTORS_USED(b))
trace_bcache_invalidate(ca, b - ca->buckets);
bch_inc_gen(ca, b);
b->prio = INITIAL_PRIO;
atomic_inc(&b->pin);
}
static void bch_invalidate_one_bucket(struct cache *ca, struct bucket *b)
{
__bch_invalidate_one_bucket(ca, b);
fifo_push(&ca->free_inc, b - ca->buckets);
}
/*
* Determines what order we're going to reuse buckets, smallest bucket_prio()
* first: we also take into account the number of sectors of live data in that
* bucket, and in order for that multiply to make sense we have to scale bucket
*
* Thus, we scale the bucket priorities so that the bucket with the smallest
* prio is worth 1/8th of what INITIAL_PRIO is worth.
*/
#define bucket_prio(b) \
({ \
unsigned min_prio = (INITIAL_PRIO - ca->set->min_prio) / 8; \
\
(b->prio - ca->set->min_prio + min_prio) * GC_SECTORS_USED(b); \
})
#define bucket_max_cmp(l, r) (bucket_prio(l) < bucket_prio(r))
#define bucket_min_cmp(l, r) (bucket_prio(l) > bucket_prio(r))
static void invalidate_buckets_lru(struct cache *ca)
{
struct bucket *b;
ssize_t i;
ca->heap.used = 0;
for_each_bucket(b, ca) {
if (!bch_can_invalidate_bucket(ca, b))
continue;
if (!heap_full(&ca->heap))
heap_add(&ca->heap, b, bucket_max_cmp);
else if (bucket_max_cmp(b, heap_peek(&ca->heap))) {
ca->heap.data[0] = b;
heap_sift(&ca->heap, 0, bucket_max_cmp);
}
}
for (i = ca->heap.used / 2 - 1; i >= 0; --i)
heap_sift(&ca->heap, i, bucket_min_cmp);
while (!fifo_full(&ca->free_inc)) {
if (!heap_pop(&ca->heap, b, bucket_min_cmp)) {
/*
* We don't want to be calling invalidate_buckets()
* multiple times when it can't do anything
*/
ca->invalidate_needs_gc = 1;
wake_up_gc(ca->set);
return;
}
bch_invalidate_one_bucket(ca, b);
}
}
static void invalidate_buckets_fifo(struct cache *ca)
{
struct bucket *b;
size_t checked = 0;
while (!fifo_full(&ca->free_inc)) {
if (ca->fifo_last_bucket < ca->sb.first_bucket ||
ca->fifo_last_bucket >= ca->sb.nbuckets)
ca->fifo_last_bucket = ca->sb.first_bucket;
b = ca->buckets + ca->fifo_last_bucket++;
if (bch_can_invalidate_bucket(ca, b))
bch_invalidate_one_bucket(ca, b);
if (++checked >= ca->sb.nbuckets) {
ca->invalidate_needs_gc = 1;
wake_up_gc(ca->set);
return;
}
}
}
static void invalidate_buckets_random(struct cache *ca)
{
struct bucket *b;
size_t checked = 0;
while (!fifo_full(&ca->free_inc)) {
size_t n;
get_random_bytes(&n, sizeof(n));
n %= (size_t) (ca->sb.nbuckets - ca->sb.first_bucket);
n += ca->sb.first_bucket;
b = ca->buckets + n;
if (bch_can_invalidate_bucket(ca, b))
bch_invalidate_one_bucket(ca, b);
if (++checked >= ca->sb.nbuckets / 2) {
ca->invalidate_needs_gc = 1;
wake_up_gc(ca->set);
return;
}
}
}
static void invalidate_buckets(struct cache *ca)
{
BUG_ON(ca->invalidate_needs_gc);
switch (CACHE_REPLACEMENT(&ca->sb)) {
case CACHE_REPLACEMENT_LRU:
invalidate_buckets_lru(ca);
break;
case CACHE_REPLACEMENT_FIFO:
invalidate_buckets_fifo(ca);
break;
case CACHE_REPLACEMENT_RANDOM:
invalidate_buckets_random(ca);
break;
}
}
#define allocator_wait(ca, cond) \
do { \
while (1) { \
set_current_state(TASK_INTERRUPTIBLE); \
if (cond) \
break; \
\
mutex_unlock(&(ca)->set->bucket_lock); \
if (kthread_should_stop()) \
return 0; \
\
try_to_freeze(); \
schedule(); \
mutex_lock(&(ca)->set->bucket_lock); \
} \
__set_current_state(TASK_RUNNING); \
} while (0)
static int bch_allocator_push(struct cache *ca, long bucket)
{
unsigned i;
/* Prios/gens are actually the most important reserve */
if (fifo_push(&ca->free[RESERVE_PRIO], bucket))
return true;
for (i = 0; i < RESERVE_NR; i++)
if (fifo_push(&ca->free[i], bucket))
return true;
return false;
}
static int bch_allocator_thread(void *arg)
{
struct cache *ca = arg;
mutex_lock(&ca->set->bucket_lock);
while (1) {
/*
* First, we pull buckets off of the unused and free_inc lists,
* possibly issue discards to them, then we add the bucket to
* the free list:
*/
while (!fifo_empty(&ca->free_inc)) {
long bucket;
fifo_pop(&ca->free_inc, bucket);
if (ca->discard) {
mutex_unlock(&ca->set->bucket_lock);
blkdev_issue_discard(ca->bdev,
bucket_to_sector(ca->set, bucket),
ca->sb.bucket_size, GFP_KERNEL, 0);
mutex_lock(&ca->set->bucket_lock);
}
allocator_wait(ca, bch_allocator_push(ca, bucket));
wake_up(&ca->set->btree_cache_wait);
wake_up(&ca->set->bucket_wait);
}
/*
* We've run out of free buckets, we need to find some buckets
* we can invalidate. First, invalidate them in memory and add
* them to the free_inc list:
*/
retry_invalidate:
allocator_wait(ca, ca->set->gc_mark_valid &&
!ca->invalidate_needs_gc);
invalidate_buckets(ca);
/*
* Now, we write their new gens to disk so we can start writing
* new stuff to them:
*/
allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
if (CACHE_SYNC(&ca->set->sb)) {
/*
* This could deadlock if an allocation with a btree
* node locked ever blocked - having the btree node
* locked would block garbage collection, but here we're
* waiting on garbage collection before we invalidate
* and free anything.
*
* But this should be safe since the btree code always
* uses btree_check_reserve() before allocating now, and
* if it fails it blocks without btree nodes locked.
*/
if (!fifo_full(&ca->free_inc))
goto retry_invalidate;
bch_prio_write(ca);
}
}
}
/* Allocation */
long bch_bucket_alloc(struct cache *ca, unsigned reserve, bool wait)
{
DEFINE_WAIT(w);
struct bucket *b;
long r;
/* fastpath */
if (fifo_pop(&ca->free[RESERVE_NONE], r) ||
fifo_pop(&ca->free[reserve], r))
goto out;
if (!wait) {
trace_bcache_alloc_fail(ca, reserve);
return -1;
}
do {
prepare_to_wait(&ca->set->bucket_wait, &w,
TASK_UNINTERRUPTIBLE);
mutex_unlock(&ca->set->bucket_lock);
schedule();
mutex_lock(&ca->set->bucket_lock);
} while (!fifo_pop(&ca->free[RESERVE_NONE], r) &&
!fifo_pop(&ca->free[reserve], r));
finish_wait(&ca->set->bucket_wait, &w);
out:
wake_up_process(ca->alloc_thread);
trace_bcache_alloc(ca, reserve);
if (expensive_debug_checks(ca->set)) {
size_t iter;
long i;
unsigned j;
for (iter = 0; iter < prio_buckets(ca) * 2; iter++)
BUG_ON(ca->prio_buckets[iter] == (uint64_t) r);
for (j = 0; j < RESERVE_NR; j++)
fifo_for_each(i, &ca->free[j], iter)
BUG_ON(i == r);
fifo_for_each(i, &ca->free_inc, iter)
BUG_ON(i == r);
}
b = ca->buckets + r;
BUG_ON(atomic_read(&b->pin) != 1);
SET_GC_SECTORS_USED(b, ca->sb.bucket_size);
if (reserve <= RESERVE_PRIO) {
SET_GC_MARK(b, GC_MARK_METADATA);
SET_GC_MOVE(b, 0);
b->prio = BTREE_PRIO;
} else {
SET_GC_MARK(b, GC_MARK_RECLAIMABLE);
SET_GC_MOVE(b, 0);
b->prio = INITIAL_PRIO;
}
return r;
}
void __bch_bucket_free(struct cache *ca, struct bucket *b)
{
SET_GC_MARK(b, 0);
SET_GC_SECTORS_USED(b, 0);
}
void bch_bucket_free(struct cache_set *c, struct bkey *k)
{
unsigned i;
for (i = 0; i < KEY_PTRS(k); i++)
__bch_bucket_free(PTR_CACHE(c, k, i),
PTR_BUCKET(c, k, i));
}
int __bch_bucket_alloc_set(struct cache_set *c, unsigned reserve,
struct bkey *k, int n, bool wait)
{
int i;
lockdep_assert_held(&c->bucket_lock);
BUG_ON(!n || n > c->caches_loaded || n > 8);
bkey_init(k);
/* sort by free space/prio of oldest data in caches */
for (i = 0; i < n; i++) {
struct cache *ca = c->cache_by_alloc[i];
long b = bch_bucket_alloc(ca, reserve, wait);
if (b == -1)
goto err;
k->ptr[i] = PTR(ca->buckets[b].gen,
bucket_to_sector(c, b),
ca->sb.nr_this_dev);
SET_KEY_PTRS(k, i + 1);
}
return 0;
err:
bch_bucket_free(c, k);
bkey_put(c, k);
return -1;
}
int bch_bucket_alloc_set(struct cache_set *c, unsigned reserve,
struct bkey *k, int n, bool wait)
{
int ret;
mutex_lock(&c->bucket_lock);
ret = __bch_bucket_alloc_set(c, reserve, k, n, wait);
mutex_unlock(&c->bucket_lock);
return ret;
}
/* Sector allocator */
struct open_bucket {
struct list_head list;
unsigned last_write_point;
unsigned sectors_free;
BKEY_PADDED(key);
};
/*
* We keep multiple buckets open for writes, and try to segregate different
* write streams for better cache utilization: first we look for a bucket where
* the last write to it was sequential with the current write, and failing that
* we look for a bucket that was last used by the same task.
*
* The ideas is if you've got multiple tasks pulling data into the cache at the
* same time, you'll get better cache utilization if you try to segregate their
* data and preserve locality.
*
* For example, say you've starting Firefox at the same time you're copying a
* bunch of files. Firefox will likely end up being fairly hot and stay in the
* cache awhile, but the data you copied might not be; if you wrote all that
* data to the same buckets it'd get invalidated at the same time.
*
* Both of those tasks will be doing fairly random IO so we can't rely on
* detecting sequential IO to segregate their data, but going off of the task
* should be a sane heuristic.
*/
static struct open_bucket *pick_data_bucket(struct cache_set *c,
const struct bkey *search,
unsigned write_point,
struct bkey *alloc)
{
struct open_bucket *ret, *ret_task = NULL;
list_for_each_entry_reverse(ret, &c->data_buckets, list)
if (!bkey_cmp(&ret->key, search))
goto found;
else if (ret->last_write_point == write_point)
ret_task = ret;
ret = ret_task ?: list_first_entry(&c->data_buckets,
struct open_bucket, list);
found:
if (!ret->sectors_free && KEY_PTRS(alloc)) {
ret->sectors_free = c->sb.bucket_size;
bkey_copy(&ret->key, alloc);
bkey_init(alloc);
}
if (!ret->sectors_free)
ret = NULL;
return ret;
}
/*
* Allocates some space in the cache to write to, and k to point to the newly
* allocated space, and updates KEY_SIZE(k) and KEY_OFFSET(k) (to point to the
* end of the newly allocated space).
*
* May allocate fewer sectors than @sectors, KEY_SIZE(k) indicates how many
* sectors were actually allocated.
*
* If s->writeback is true, will not fail.
*/
bool bch_alloc_sectors(struct cache_set *c, struct bkey *k, unsigned sectors,
unsigned write_point, unsigned write_prio, bool wait)
{
struct open_bucket *b;
BKEY_PADDED(key) alloc;
unsigned i;
/*
* We might have to allocate a new bucket, which we can't do with a
* spinlock held. So if we have to allocate, we drop the lock, allocate
* and then retry. KEY_PTRS() indicates whether alloc points to
* allocated bucket(s).
*/
bkey_init(&alloc.key);
spin_lock(&c->data_bucket_lock);
while (!(b = pick_data_bucket(c, k, write_point, &alloc.key))) {
unsigned watermark = write_prio
? RESERVE_MOVINGGC
: RESERVE_NONE;
spin_unlock(&c->data_bucket_lock);
if (bch_bucket_alloc_set(c, watermark, &alloc.key, 1, wait))
return false;
spin_lock(&c->data_bucket_lock);
}
/*
* If we had to allocate, we might race and not need to allocate the
* second time we call find_data_bucket(). If we allocated a bucket but
* didn't use it, drop the refcount bch_bucket_alloc_set() took:
*/
if (KEY_PTRS(&alloc.key))
bkey_put(c, &alloc.key);
for (i = 0; i < KEY_PTRS(&b->key); i++)
EBUG_ON(ptr_stale(c, &b->key, i));
/* Set up the pointer to the space we're allocating: */
for (i = 0; i < KEY_PTRS(&b->key); i++)
k->ptr[i] = b->key.ptr[i];
sectors = min(sectors, b->sectors_free);
SET_KEY_OFFSET(k, KEY_OFFSET(k) + sectors);
SET_KEY_SIZE(k, sectors);
SET_KEY_PTRS(k, KEY_PTRS(&b->key));
/*
* Move b to the end of the lru, and keep track of what this bucket was
* last used for:
*/
list_move_tail(&b->list, &c->data_buckets);
bkey_copy_key(&b->key, k);
b->last_write_point = write_point;
b->sectors_free -= sectors;
for (i = 0; i < KEY_PTRS(&b->key); i++) {
SET_PTR_OFFSET(&b->key, i, PTR_OFFSET(&b->key, i) + sectors);
atomic_long_add(sectors,
&PTR_CACHE(c, &b->key, i)->sectors_written);
}
if (b->sectors_free < c->sb.block_size)
b->sectors_free = 0;
/*
* k takes refcounts on the buckets it points to until it's inserted
* into the btree, but if we're done with this bucket we just transfer
* get_data_bucket()'s refcount.
*/
if (b->sectors_free)
for (i = 0; i < KEY_PTRS(&b->key); i++)
atomic_inc(&PTR_BUCKET(c, &b->key, i)->pin);
spin_unlock(&c->data_bucket_lock);
return true;
}
/* Init */
void bch_open_buckets_free(struct cache_set *c)
{
struct open_bucket *b;
while (!list_empty(&c->data_buckets)) {
b = list_first_entry(&c->data_buckets,
struct open_bucket, list);
list_del(&b->list);
kfree(b);
}
}
int bch_open_buckets_alloc(struct cache_set *c)
{
int i;
spin_lock_init(&c->data_bucket_lock);
for (i = 0; i < 6; i++) {
struct open_bucket *b = kzalloc(sizeof(*b), GFP_KERNEL);
if (!b)
return -ENOMEM;
list_add(&b->list, &c->data_buckets);
}
return 0;
}
int bch_cache_allocator_start(struct cache *ca)
{
struct task_struct *k = kthread_run(bch_allocator_thread,
ca, "bcache_allocator");
if (IS_ERR(k))
return PTR_ERR(k);
ca->alloc_thread = k;
return 0;
}

946
drivers/md/bcache/bcache.h Normal file
View file

@ -0,0 +1,946 @@
#ifndef _BCACHE_H
#define _BCACHE_H
/*
* SOME HIGH LEVEL CODE DOCUMENTATION:
*
* Bcache mostly works with cache sets, cache devices, and backing devices.
*
* Support for multiple cache devices hasn't quite been finished off yet, but
* it's about 95% plumbed through. A cache set and its cache devices is sort of
* like a md raid array and its component devices. Most of the code doesn't care
* about individual cache devices, the main abstraction is the cache set.
*
* Multiple cache devices is intended to give us the ability to mirror dirty
* cached data and metadata, without mirroring clean cached data.
*
* Backing devices are different, in that they have a lifetime independent of a
* cache set. When you register a newly formatted backing device it'll come up
* in passthrough mode, and then you can attach and detach a backing device from
* a cache set at runtime - while it's mounted and in use. Detaching implicitly
* invalidates any cached data for that backing device.
*
* A cache set can have multiple (many) backing devices attached to it.
*
* There's also flash only volumes - this is the reason for the distinction
* between struct cached_dev and struct bcache_device. A flash only volume
* works much like a bcache device that has a backing device, except the
* "cached" data is always dirty. The end result is that we get thin
* provisioning with very little additional code.
*
* Flash only volumes work but they're not production ready because the moving
* garbage collector needs more work. More on that later.
*
* BUCKETS/ALLOCATION:
*
* Bcache is primarily designed for caching, which means that in normal
* operation all of our available space will be allocated. Thus, we need an
* efficient way of deleting things from the cache so we can write new things to
* it.
*
* To do this, we first divide the cache device up into buckets. A bucket is the
* unit of allocation; they're typically around 1 mb - anywhere from 128k to 2M+
* works efficiently.
*
* Each bucket has a 16 bit priority, and an 8 bit generation associated with
* it. The gens and priorities for all the buckets are stored contiguously and
* packed on disk (in a linked list of buckets - aside from the superblock, all
* of bcache's metadata is stored in buckets).
*
* The priority is used to implement an LRU. We reset a bucket's priority when
* we allocate it or on cache it, and every so often we decrement the priority
* of each bucket. It could be used to implement something more sophisticated,
* if anyone ever gets around to it.
*
* The generation is used for invalidating buckets. Each pointer also has an 8
* bit generation embedded in it; for a pointer to be considered valid, its gen
* must match the gen of the bucket it points into. Thus, to reuse a bucket all
* we have to do is increment its gen (and write its new gen to disk; we batch
* this up).
*
* Bcache is entirely COW - we never write twice to a bucket, even buckets that
* contain metadata (including btree nodes).
*
* THE BTREE:
*
* Bcache is in large part design around the btree.
*
* At a high level, the btree is just an index of key -> ptr tuples.
*
* Keys represent extents, and thus have a size field. Keys also have a variable
* number of pointers attached to them (potentially zero, which is handy for
* invalidating the cache).
*
* The key itself is an inode:offset pair. The inode number corresponds to a
* backing device or a flash only volume. The offset is the ending offset of the
* extent within the inode - not the starting offset; this makes lookups
* slightly more convenient.
*
* Pointers contain the cache device id, the offset on that device, and an 8 bit
* generation number. More on the gen later.
*
* Index lookups are not fully abstracted - cache lookups in particular are
* still somewhat mixed in with the btree code, but things are headed in that
* direction.
*
* Updates are fairly well abstracted, though. There are two different ways of
* updating the btree; insert and replace.
*
* BTREE_INSERT will just take a list of keys and insert them into the btree -
* overwriting (possibly only partially) any extents they overlap with. This is
* used to update the index after a write.
*
* BTREE_REPLACE is really cmpxchg(); it inserts a key into the btree iff it is
* overwriting a key that matches another given key. This is used for inserting
* data into the cache after a cache miss, and for background writeback, and for
* the moving garbage collector.
*
* There is no "delete" operation; deleting things from the index is
* accomplished by either by invalidating pointers (by incrementing a bucket's
* gen) or by inserting a key with 0 pointers - which will overwrite anything
* previously present at that location in the index.
*
* This means that there are always stale/invalid keys in the btree. They're
* filtered out by the code that iterates through a btree node, and removed when
* a btree node is rewritten.
*
* BTREE NODES:
*
* Our unit of allocation is a bucket, and we we can't arbitrarily allocate and
* free smaller than a bucket - so, that's how big our btree nodes are.
*
* (If buckets are really big we'll only use part of the bucket for a btree node
* - no less than 1/4th - but a bucket still contains no more than a single
* btree node. I'd actually like to change this, but for now we rely on the
* bucket's gen for deleting btree nodes when we rewrite/split a node.)
*
* Anyways, btree nodes are big - big enough to be inefficient with a textbook
* btree implementation.
*
* The way this is solved is that btree nodes are internally log structured; we
* can append new keys to an existing btree node without rewriting it. This
* means each set of keys we write is sorted, but the node is not.
*
* We maintain this log structure in memory - keeping 1Mb of keys sorted would
* be expensive, and we have to distinguish between the keys we have written and
* the keys we haven't. So to do a lookup in a btree node, we have to search
* each sorted set. But we do merge written sets together lazily, so the cost of
* these extra searches is quite low (normally most of the keys in a btree node
* will be in one big set, and then there'll be one or two sets that are much
* smaller).
*
* This log structure makes bcache's btree more of a hybrid between a
* conventional btree and a compacting data structure, with some of the
* advantages of both.
*
* GARBAGE COLLECTION:
*
* We can't just invalidate any bucket - it might contain dirty data or
* metadata. If it once contained dirty data, other writes might overwrite it
* later, leaving no valid pointers into that bucket in the index.
*
* Thus, the primary purpose of garbage collection is to find buckets to reuse.
* It also counts how much valid data it each bucket currently contains, so that
* allocation can reuse buckets sooner when they've been mostly overwritten.
*
* It also does some things that are really internal to the btree
* implementation. If a btree node contains pointers that are stale by more than
* some threshold, it rewrites the btree node to avoid the bucket's generation
* wrapping around. It also merges adjacent btree nodes if they're empty enough.
*
* THE JOURNAL:
*
* Bcache's journal is not necessary for consistency; we always strictly
* order metadata writes so that the btree and everything else is consistent on
* disk in the event of an unclean shutdown, and in fact bcache had writeback
* caching (with recovery from unclean shutdown) before journalling was
* implemented.
*
* Rather, the journal is purely a performance optimization; we can't complete a
* write until we've updated the index on disk, otherwise the cache would be
* inconsistent in the event of an unclean shutdown. This means that without the
* journal, on random write workloads we constantly have to update all the leaf
* nodes in the btree, and those writes will be mostly empty (appending at most
* a few keys each) - highly inefficient in terms of amount of metadata writes,
* and it puts more strain on the various btree resorting/compacting code.
*
* The journal is just a log of keys we've inserted; on startup we just reinsert
* all the keys in the open journal entries. That means that when we're updating
* a node in the btree, we can wait until a 4k block of keys fills up before
* writing them out.
*
* For simplicity, we only journal updates to leaf nodes; updates to parent
* nodes are rare enough (since our leaf nodes are huge) that it wasn't worth
* the complexity to deal with journalling them (in particular, journal replay)
* - updates to non leaf nodes just happen synchronously (see btree_split()).
*/
#define pr_fmt(fmt) "bcache: %s() " fmt "\n", __func__
#include <linux/bcache.h>
#include <linux/bio.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/types.h>
#include <linux/workqueue.h>
#include "bset.h"
#include "util.h"
#include "closure.h"
struct bucket {
atomic_t pin;
uint16_t prio;
uint8_t gen;
uint8_t last_gc; /* Most out of date gen in the btree */
uint16_t gc_mark; /* Bitfield used by GC. See below for field */
};
/*
* I'd use bitfields for these, but I don't trust the compiler not to screw me
* as multiple threads touch struct bucket without locking
*/
BITMASK(GC_MARK, struct bucket, gc_mark, 0, 2);
#define GC_MARK_RECLAIMABLE 1
#define GC_MARK_DIRTY 2
#define GC_MARK_METADATA 3
#define GC_SECTORS_USED_SIZE 13
#define MAX_GC_SECTORS_USED (~(~0ULL << GC_SECTORS_USED_SIZE))
BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, GC_SECTORS_USED_SIZE);
BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
#include "journal.h"
#include "stats.h"
struct search;
struct btree;
struct keybuf;
struct keybuf_key {
struct rb_node node;
BKEY_PADDED(key);
void *private;
};
struct keybuf {
struct bkey last_scanned;
spinlock_t lock;
/*
* Beginning and end of range in rb tree - so that we can skip taking
* lock and checking the rb tree when we need to check for overlapping
* keys.
*/
struct bkey start;
struct bkey end;
struct rb_root keys;
#define KEYBUF_NR 500
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};
struct bio_split_pool {
struct bio_set *bio_split;
mempool_t *bio_split_hook;
};
struct bio_split_hook {
struct closure cl;
struct bio_split_pool *p;
struct bio *bio;
bio_end_io_t *bi_end_io;
void *bi_private;
};
struct bcache_device {
struct closure cl;
struct kobject kobj;
struct cache_set *c;
unsigned id;
#define BCACHEDEVNAME_SIZE 12
char name[BCACHEDEVNAME_SIZE];
struct gendisk *disk;
unsigned long flags;
#define BCACHE_DEV_CLOSING 0
#define BCACHE_DEV_DETACHING 1
#define BCACHE_DEV_UNLINK_DONE 2
unsigned nr_stripes;
unsigned stripe_size;
atomic_t *stripe_sectors_dirty;
unsigned long *full_dirty_stripes;
unsigned long sectors_dirty_last;
long sectors_dirty_derivative;
struct bio_set *bio_split;
unsigned data_csum:1;
int (*cache_miss)(struct btree *, struct search *,
struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
struct bio_split_pool bio_split_hook;
};
struct io {
/* Used to track sequential IO so it can be skipped */
struct hlist_node hash;
struct list_head lru;
unsigned long jiffies;
unsigned sequential;
sector_t last;
};
struct cached_dev {
struct list_head list;
struct bcache_device disk;
struct block_device *bdev;
struct cache_sb sb;
struct bio sb_bio;
struct bio_vec sb_bv[1];
struct closure sb_write;
struct semaphore sb_write_mutex;
/* Refcount on the cache set. Always nonzero when we're caching. */
atomic_t count;
struct work_struct detach;
/*
* Device might not be running if it's dirty and the cache set hasn't
* showed up yet.
*/
atomic_t running;
/*
* Writes take a shared lock from start to finish; scanning for dirty
* data to refill the rb tree requires an exclusive lock.
*/
struct rw_semaphore writeback_lock;
/*
* Nonzero, and writeback has a refcount (d->count), iff there is dirty
* data in the cache. Protected by writeback_lock; must have an
* shared lock to set and exclusive lock to clear.
*/
atomic_t has_dirty;
struct bch_ratelimit writeback_rate;
struct delayed_work writeback_rate_update;
/*
* Internal to the writeback code, so read_dirty() can keep track of
* where it's at.
*/
sector_t last_read;
/* Limit number of writeback bios in flight */
struct semaphore in_flight;
struct task_struct *writeback_thread;
struct keybuf writeback_keys;
/* For tracking sequential IO */
#define RECENT_IO_BITS 7
#define RECENT_IO (1 << RECENT_IO_BITS)
struct io io[RECENT_IO];
struct hlist_head io_hash[RECENT_IO + 1];
struct list_head io_lru;
spinlock_t io_lock;
struct cache_accounting accounting;
/* The rest of this all shows up in sysfs */
unsigned sequential_cutoff;
unsigned readahead;
unsigned verify:1;
unsigned bypass_torture_test:1;
unsigned partial_stripes_expensive:1;
unsigned writeback_metadata:1;
unsigned writeback_running:1;
unsigned char writeback_percent;
unsigned writeback_delay;
uint64_t writeback_rate_target;
int64_t writeback_rate_proportional;
int64_t writeback_rate_derivative;
int64_t writeback_rate_change;
unsigned writeback_rate_update_seconds;
unsigned writeback_rate_d_term;
unsigned writeback_rate_p_term_inverse;
};
enum alloc_reserve {
RESERVE_BTREE,
RESERVE_PRIO,
RESERVE_MOVINGGC,
RESERVE_NONE,
RESERVE_NR,
};
struct cache {
struct cache_set *set;
struct cache_sb sb;
struct bio sb_bio;
struct bio_vec sb_bv[1];
struct kobject kobj;
struct block_device *bdev;
struct task_struct *alloc_thread;
struct closure prio;
struct prio_set *disk_buckets;
/*
* When allocating new buckets, prio_write() gets first dibs - since we
* may not be allocate at all without writing priorities and gens.
* prio_buckets[] contains the last buckets we wrote priorities to (so
* gc can mark them as metadata), prio_next[] contains the buckets
* allocated for the next prio write.
*/
uint64_t *prio_buckets;
uint64_t *prio_last_buckets;
/*
* free: Buckets that are ready to be used
*
* free_inc: Incoming buckets - these are buckets that currently have
* cached data in them, and we can't reuse them until after we write
* their new gen to disk. After prio_write() finishes writing the new
* gens/prios, they'll be moved to the free list (and possibly discarded
* in the process)
*/
DECLARE_FIFO(long, free)[RESERVE_NR];
DECLARE_FIFO(long, free_inc);
size_t fifo_last_bucket;
/* Allocation stuff: */
struct bucket *buckets;
DECLARE_HEAP(struct bucket *, heap);
/*
* If nonzero, we know we aren't going to find any buckets to invalidate
* until a gc finishes - otherwise we could pointlessly burn a ton of
* cpu
*/
unsigned invalidate_needs_gc:1;
bool discard; /* Get rid of? */
struct journal_device journal;
/* The rest of this all shows up in sysfs */
#define IO_ERROR_SHIFT 20
atomic_t io_errors;
atomic_t io_count;
atomic_long_t meta_sectors_written;
atomic_long_t btree_sectors_written;
atomic_long_t sectors_written;
struct bio_split_pool bio_split_hook;
};
struct gc_stat {
size_t nodes;
size_t key_bytes;
size_t nkeys;
uint64_t data; /* sectors */
unsigned in_use; /* percent */
};
/*
* Flag bits, for how the cache set is shutting down, and what phase it's at:
*
* CACHE_SET_UNREGISTERING means we're not just shutting down, we're detaching
* all the backing devices first (their cached data gets invalidated, and they
* won't automatically reattach).
*
* CACHE_SET_STOPPING always gets set first when we're closing down a cache set;
* we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e.
* flushing dirty data).
*
* CACHE_SET_RUNNING means all cache devices have been registered and journal
* replay is complete.
*/
#define CACHE_SET_UNREGISTERING 0
#define CACHE_SET_STOPPING 1
#define CACHE_SET_RUNNING 2
struct cache_set {
struct closure cl;
struct list_head list;
struct kobject kobj;
struct kobject internal;
struct dentry *debug;
struct cache_accounting accounting;
unsigned long flags;
struct cache_sb sb;
struct cache *cache[MAX_CACHES_PER_SET];
struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
int caches_loaded;
struct bcache_device **devices;
struct list_head cached_devs;
uint64_t cached_dev_sectors;
struct closure caching;
struct closure sb_write;
struct semaphore sb_write_mutex;
mempool_t *search;
mempool_t *bio_meta;
struct bio_set *bio_split;
/* For the btree cache */
struct shrinker shrink;
/* For the btree cache and anything allocation related */
struct mutex bucket_lock;
/* log2(bucket_size), in sectors */
unsigned short bucket_bits;
/* log2(block_size), in sectors */
unsigned short block_bits;
/*
* Default number of pages for a new btree node - may be less than a
* full bucket
*/
unsigned btree_pages;
/*
* Lists of struct btrees; lru is the list for structs that have memory
* allocated for actual btree node, freed is for structs that do not.
*
* We never free a struct btree, except on shutdown - we just put it on
* the btree_cache_freed list and reuse it later. This simplifies the
* code, and it doesn't cost us much memory as the memory usage is
* dominated by buffers that hold the actual btree node data and those
* can be freed - and the number of struct btrees allocated is
* effectively bounded.
*
* btree_cache_freeable effectively is a small cache - we use it because
* high order page allocations can be rather expensive, and it's quite
* common to delete and allocate btree nodes in quick succession. It
* should never grow past ~2-3 nodes in practice.
*/
struct list_head btree_cache;
struct list_head btree_cache_freeable;
struct list_head btree_cache_freed;
/* Number of elements in btree_cache + btree_cache_freeable lists */
unsigned btree_cache_used;
/*
* If we need to allocate memory for a new btree node and that
* allocation fails, we can cannibalize another node in the btree cache
* to satisfy the allocation - lock to guarantee only one thread does
* this at a time:
*/
wait_queue_head_t btree_cache_wait;
struct task_struct *btree_cache_alloc_lock;
/*
* When we free a btree node, we increment the gen of the bucket the
* node is in - but we can't rewrite the prios and gens until we
* finished whatever it is we were doing, otherwise after a crash the
* btree node would be freed but for say a split, we might not have the
* pointers to the new nodes inserted into the btree yet.
*
* This is a refcount that blocks prio_write() until the new keys are
* written.
*/
atomic_t prio_blocked;
wait_queue_head_t bucket_wait;
/*
* For any bio we don't skip we subtract the number of sectors from
* rescale; when it hits 0 we rescale all the bucket priorities.
*/
atomic_t rescale;
/*
* When we invalidate buckets, we use both the priority and the amount
* of good data to determine which buckets to reuse first - to weight
* those together consistently we keep track of the smallest nonzero
* priority of any bucket.
*/
uint16_t min_prio;
/*
* max(gen - last_gc) for all buckets. When it gets too big we have to gc
* to keep gens from wrapping around.
*/
uint8_t need_gc;
struct gc_stat gc_stats;
size_t nbuckets;
struct task_struct *gc_thread;
/* Where in the btree gc currently is */
struct bkey gc_done;
/*
* The allocation code needs gc_mark in struct bucket to be correct, but
* it's not while a gc is in progress. Protected by bucket_lock.
*/
int gc_mark_valid;
/* Counts how many sectors bio_insert has added to the cache */
atomic_t sectors_to_gc;
wait_queue_head_t moving_gc_wait;
struct keybuf moving_gc_keys;
/* Number of moving GC bios in flight */
struct semaphore moving_in_flight;
struct workqueue_struct *moving_gc_wq;
struct btree *root;
#ifdef CONFIG_BCACHE_DEBUG
struct btree *verify_data;
struct bset *verify_ondisk;
struct mutex verify_lock;
#endif
unsigned nr_uuids;
struct uuid_entry *uuids;
BKEY_PADDED(uuid_bucket);
struct closure uuid_write;
struct semaphore uuid_write_mutex;
/*
* A btree node on disk could have too many bsets for an iterator to fit
* on the stack - have to dynamically allocate them
*/
mempool_t *fill_iter;
struct bset_sort_state sort;
/* List of buckets we're currently writing data to */
struct list_head data_buckets;
spinlock_t data_bucket_lock;
struct journal journal;
#define CONGESTED_MAX 1024
unsigned congested_last_us;
atomic_t congested;
/* The rest of this all shows up in sysfs */
unsigned congested_read_threshold_us;
unsigned congested_write_threshold_us;
struct time_stats btree_gc_time;
struct time_stats btree_split_time;
struct time_stats btree_read_time;
atomic_long_t cache_read_races;
atomic_long_t writeback_keys_done;
atomic_long_t writeback_keys_failed;
enum {
ON_ERROR_UNREGISTER,
ON_ERROR_PANIC,
} on_error;
unsigned error_limit;
unsigned error_decay;
unsigned short journal_delay_ms;
bool expensive_debug_checks;
unsigned verify:1;
unsigned key_merging_disabled:1;
unsigned gc_always_rewrite:1;
unsigned shrinker_disabled:1;
unsigned copy_gc_enabled:1;
#define BUCKET_HASH_BITS 12
struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS];
};
struct bbio {
unsigned submit_time_us;
union {
struct bkey key;
uint64_t _pad[3];
/*
* We only need pad = 3 here because we only ever carry around a
* single pointer - i.e. the pointer we're doing io to/from.
*/
};
struct bio bio;
};
#define BTREE_PRIO USHRT_MAX
#define INITIAL_PRIO 32768U
#define btree_bytes(c) ((c)->btree_pages * PAGE_SIZE)
#define btree_blocks(b) \
((unsigned) (KEY_SIZE(&b->key) >> (b)->c->block_bits))
#define btree_default_blocks(c) \
((unsigned) ((PAGE_SECTORS * (c)->btree_pages) >> (c)->block_bits))
#define bucket_pages(c) ((c)->sb.bucket_size / PAGE_SECTORS)
#define bucket_bytes(c) ((c)->sb.bucket_size << 9)
#define block_bytes(c) ((c)->sb.block_size << 9)
#define prios_per_bucket(c) \
((bucket_bytes(c) - sizeof(struct prio_set)) / \
sizeof(struct bucket_disk))
#define prio_buckets(c) \
DIV_ROUND_UP((size_t) (c)->sb.nbuckets, prios_per_bucket(c))
static inline size_t sector_to_bucket(struct cache_set *c, sector_t s)
{
return s >> c->bucket_bits;
}
static inline sector_t bucket_to_sector(struct cache_set *c, size_t b)
{
return ((sector_t) b) << c->bucket_bits;
}
static inline sector_t bucket_remainder(struct cache_set *c, sector_t s)
{
return s & (c->sb.bucket_size - 1);
}
static inline struct cache *PTR_CACHE(struct cache_set *c,
const struct bkey *k,
unsigned ptr)
{
return c->cache[PTR_DEV(k, ptr)];
}
static inline size_t PTR_BUCKET_NR(struct cache_set *c,
const struct bkey *k,
unsigned ptr)
{
return sector_to_bucket(c, PTR_OFFSET(k, ptr));
}
static inline struct bucket *PTR_BUCKET(struct cache_set *c,
const struct bkey *k,
unsigned ptr)
{
return PTR_CACHE(c, k, ptr)->buckets + PTR_BUCKET_NR(c, k, ptr);
}
static inline uint8_t gen_after(uint8_t a, uint8_t b)
{
uint8_t r = a - b;
return r > 128U ? 0 : r;
}
static inline uint8_t ptr_stale(struct cache_set *c, const struct bkey *k,
unsigned i)
{
return gen_after(PTR_BUCKET(c, k, i)->gen, PTR_GEN(k, i));
}
static inline bool ptr_available(struct cache_set *c, const struct bkey *k,
unsigned i)
{
return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && PTR_CACHE(c, k, i);
}
/* Btree key macros */
/*
* This is used for various on disk data structures - cache_sb, prio_set, bset,
* jset: The checksum is _always_ the first 8 bytes of these structs
*/
#define csum_set(i) \
bch_crc64(((void *) (i)) + sizeof(uint64_t), \
((void *) bset_bkey_last(i)) - \
(((void *) (i)) + sizeof(uint64_t)))
/* Error handling macros */
#define btree_bug(b, ...) \
do { \
if (bch_cache_set_error((b)->c, __VA_ARGS__)) \
dump_stack(); \
} while (0)
#define cache_bug(c, ...) \
do { \
if (bch_cache_set_error(c, __VA_ARGS__)) \
dump_stack(); \
} while (0)
#define btree_bug_on(cond, b, ...) \
do { \
if (cond) \
btree_bug(b, __VA_ARGS__); \
} while (0)
#define cache_bug_on(cond, c, ...) \
do { \
if (cond) \
cache_bug(c, __VA_ARGS__); \
} while (0)
#define cache_set_err_on(cond, c, ...) \
do { \
if (cond) \
bch_cache_set_error(c, __VA_ARGS__); \
} while (0)
/* Looping macros */
#define for_each_cache(ca, cs, iter) \
for (iter = 0; ca = cs->cache[iter], iter < (cs)->sb.nr_in_set; iter++)
#define for_each_bucket(b, ca) \
for (b = (ca)->buckets + (ca)->sb.first_bucket; \
b < (ca)->buckets + (ca)->sb.nbuckets; b++)
static inline void cached_dev_put(struct cached_dev *dc)
{
if (atomic_dec_and_test(&dc->count))
schedule_work(&dc->detach);
}
static inline bool cached_dev_get(struct cached_dev *dc)
{
if (!atomic_inc_not_zero(&dc->count))
return false;
/* Paired with the mb in cached_dev_attach */
smp_mb__after_atomic();
return true;
}
/*
* bucket_gc_gen() returns the difference between the bucket's current gen and
* the oldest gen of any pointer into that bucket in the btree (last_gc).
*/
static inline uint8_t bucket_gc_gen(struct bucket *b)
{
return b->gen - b->last_gc;
}
#define BUCKET_GC_GEN_MAX 96U
#define kobj_attribute_write(n, fn) \
static struct kobj_attribute ksysfs_##n = __ATTR(n, S_IWUSR, NULL, fn)
#define kobj_attribute_rw(n, show, store) \
static struct kobj_attribute ksysfs_##n = \
__ATTR(n, S_IWUSR|S_IRUSR, show, store)
static inline void wake_up_allocators(struct cache_set *c)
{
struct cache *ca;
unsigned i;
for_each_cache(ca, c, i)
wake_up_process(ca->alloc_thread);
}
/* Forward declarations */
void bch_count_io_errors(struct cache *, int, const char *);
void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
int, const char *);
void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
void bch_bbio_free(struct bio *, struct cache_set *);
struct bio *bch_bbio_alloc(struct cache_set *);
void bch_generic_make_request(struct bio *, struct bio_split_pool *);
void __bch_submit_bbio(struct bio *, struct cache_set *);
void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
uint8_t bch_inc_gen(struct cache *, struct bucket *);
void bch_rescale_priorities(struct cache_set *, int);
bool bch_can_invalidate_bucket(struct cache *, struct bucket *);
void __bch_invalidate_one_bucket(struct cache *, struct bucket *);
void __bch_bucket_free(struct cache *, struct bucket *);
void bch_bucket_free(struct cache_set *, struct bkey *);
long bch_bucket_alloc(struct cache *, unsigned, bool);
int __bch_bucket_alloc_set(struct cache_set *, unsigned,
struct bkey *, int, bool);
int bch_bucket_alloc_set(struct cache_set *, unsigned,
struct bkey *, int, bool);
bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
unsigned, unsigned, bool);
__printf(2, 3)
bool bch_cache_set_error(struct cache_set *, const char *, ...);
void bch_prio_write(struct cache *);
void bch_write_bdev_super(struct cached_dev *, struct closure *);
extern struct workqueue_struct *bcache_wq;
extern const char * const bch_cache_modes[];
extern struct mutex bch_register_lock;
extern struct list_head bch_cache_sets;
extern struct kobj_type bch_cached_dev_ktype;
extern struct kobj_type bch_flash_dev_ktype;
extern struct kobj_type bch_cache_set_ktype;
extern struct kobj_type bch_cache_set_internal_ktype;
extern struct kobj_type bch_cache_ktype;
void bch_cached_dev_release(struct kobject *);
void bch_flash_dev_release(struct kobject *);
void bch_cache_set_release(struct kobject *);
void bch_cache_release(struct kobject *);
int bch_uuid_write(struct cache_set *);
void bcache_write_super(struct cache_set *);
int bch_flash_dev_create(struct cache_set *c, uint64_t size);
int bch_cached_dev_attach(struct cached_dev *, struct cache_set *);
void bch_cached_dev_detach(struct cached_dev *);
void bch_cached_dev_run(struct cached_dev *);
void bcache_device_stop(struct bcache_device *);
void bch_cache_set_unregister(struct cache_set *);
void bch_cache_set_stop(struct cache_set *);
struct cache_set *bch_cache_set_alloc(struct cache_sb *);
void bch_btree_cache_free(struct cache_set *);
int bch_btree_cache_alloc(struct cache_set *);
void bch_moving_init_cache_set(struct cache_set *);
int bch_open_buckets_alloc(struct cache_set *);
void bch_open_buckets_free(struct cache_set *);
int bch_cache_allocator_start(struct cache *ca);
void bch_debug_exit(void);
int bch_debug_init(struct kobject *);
void bch_request_exit(void);
int bch_request_init(void);
#endif /* _BCACHE_H */

1331
drivers/md/bcache/bset.c Normal file

File diff suppressed because it is too large Load diff

566
drivers/md/bcache/bset.h Normal file
View file

@ -0,0 +1,566 @@
#ifndef _BCACHE_BSET_H
#define _BCACHE_BSET_H
#include <linux/bcache.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include "util.h" /* for time_stats */
/*
* BKEYS:
*
* A bkey contains a key, a size field, a variable number of pointers, and some
* ancillary flag bits.
*
* We use two different functions for validating bkeys, bch_ptr_invalid and
* bch_ptr_bad().
*
* bch_ptr_invalid() primarily filters out keys and pointers that would be
* invalid due to some sort of bug, whereas bch_ptr_bad() filters out keys and
* pointer that occur in normal practice but don't point to real data.
*
* The one exception to the rule that ptr_invalid() filters out invalid keys is
* that it also filters out keys of size 0 - these are keys that have been
* completely overwritten. It'd be safe to delete these in memory while leaving
* them on disk, just unnecessary work - so we filter them out when resorting
* instead.
*
* We can't filter out stale keys when we're resorting, because garbage
* collection needs to find them to ensure bucket gens don't wrap around -
* unless we're rewriting the btree node those stale keys still exist on disk.
*
* We also implement functions here for removing some number of sectors from the
* front or the back of a bkey - this is mainly used for fixing overlapping
* extents, by removing the overlapping sectors from the older key.
*
* BSETS:
*
* A bset is an array of bkeys laid out contiguously in memory in sorted order,
* along with a header. A btree node is made up of a number of these, written at
* different times.
*
* There could be many of them on disk, but we never allow there to be more than
* 4 in memory - we lazily resort as needed.
*
* We implement code here for creating and maintaining auxiliary search trees
* (described below) for searching an individial bset, and on top of that we
* implement a btree iterator.
*
* BTREE ITERATOR:
*
* Most of the code in bcache doesn't care about an individual bset - it needs
* to search entire btree nodes and iterate over them in sorted order.
*
* The btree iterator code serves both functions; it iterates through the keys
* in a btree node in sorted order, starting from either keys after a specific
* point (if you pass it a search key) or the start of the btree node.
*
* AUXILIARY SEARCH TREES:
*
* Since keys are variable length, we can't use a binary search on a bset - we
* wouldn't be able to find the start of the next key. But binary searches are
* slow anyways, due to terrible cache behaviour; bcache originally used binary
* searches and that code topped out at under 50k lookups/second.
*
* So we need to construct some sort of lookup table. Since we only insert keys
* into the last (unwritten) set, most of the keys within a given btree node are
* usually in sets that are mostly constant. We use two different types of
* lookup tables to take advantage of this.
*
* Both lookup tables share in common that they don't index every key in the
* set; they index one key every BSET_CACHELINE bytes, and then a linear search
* is used for the rest.
*
* For sets that have been written to disk and are no longer being inserted
* into, we construct a binary search tree in an array - traversing a binary
* search tree in an array gives excellent locality of reference and is very
* fast, since both children of any node are adjacent to each other in memory
* (and their grandchildren, and great grandchildren...) - this means
* prefetching can be used to great effect.
*
* It's quite useful performance wise to keep these nodes small - not just
* because they're more likely to be in L2, but also because we can prefetch
* more nodes on a single cacheline and thus prefetch more iterations in advance
* when traversing this tree.
*
* Nodes in the auxiliary search tree must contain both a key to compare against
* (we don't want to fetch the key from the set, that would defeat the purpose),
* and a pointer to the key. We use a few tricks to compress both of these.
*
* To compress the pointer, we take advantage of the fact that one node in the
* search tree corresponds to precisely BSET_CACHELINE bytes in the set. We have
* a function (to_inorder()) that takes the index of a node in a binary tree and
* returns what its index would be in an inorder traversal, so we only have to
* store the low bits of the offset.
*
* The key is 84 bits (KEY_DEV + key->key, the offset on the device). To
* compress that, we take advantage of the fact that when we're traversing the
* search tree at every iteration we know that both our search key and the key
* we're looking for lie within some range - bounded by our previous
* comparisons. (We special case the start of a search so that this is true even
* at the root of the tree).
*
* So we know the key we're looking for is between a and b, and a and b don't
* differ higher than bit 50, we don't need to check anything higher than bit
* 50.
*
* We don't usually need the rest of the bits, either; we only need enough bits
* to partition the key range we're currently checking. Consider key n - the
* key our auxiliary search tree node corresponds to, and key p, the key
* immediately preceding n. The lowest bit we need to store in the auxiliary
* search tree is the highest bit that differs between n and p.
*
* Note that this could be bit 0 - we might sometimes need all 80 bits to do the
* comparison. But we'd really like our nodes in the auxiliary search tree to be
* of fixed size.
*
* The solution is to make them fixed size, and when we're constructing a node
* check if p and n differed in the bits we needed them to. If they don't we
* flag that node, and when doing lookups we fallback to comparing against the
* real key. As long as this doesn't happen to often (and it seems to reliably
* happen a bit less than 1% of the time), we win - even on failures, that key
* is then more likely to be in cache than if we were doing binary searches all
* the way, since we're touching so much less memory.
*
* The keys in the auxiliary search tree are stored in (software) floating
* point, with an exponent and a mantissa. The exponent needs to be big enough
* to address all the bits in the original key, but the number of bits in the
* mantissa is somewhat arbitrary; more bits just gets us fewer failures.
*
* We need 7 bits for the exponent and 3 bits for the key's offset (since keys
* are 8 byte aligned); using 22 bits for the mantissa means a node is 4 bytes.
* We need one node per 128 bytes in the btree node, which means the auxiliary
* search trees take up 3% as much memory as the btree itself.
*
* Constructing these auxiliary search trees is moderately expensive, and we
* don't want to be constantly rebuilding the search tree for the last set
* whenever we insert another key into it. For the unwritten set, we use a much
* simpler lookup table - it's just a flat array, so index i in the lookup table
* corresponds to the i range of BSET_CACHELINE bytes in the set. Indexing
* within each byte range works the same as with the auxiliary search trees.
*
* These are much easier to keep up to date when we insert a key - we do it
* somewhat lazily; when we shift a key up we usually just increment the pointer
* to it, only when it would overflow do we go to the trouble of finding the
* first key in that range of bytes again.
*/
struct btree_keys;
struct btree_iter;
struct btree_iter_set;
struct bkey_float;
#define MAX_BSETS 4U
struct bset_tree {
/*
* We construct a binary tree in an array as if the array
* started at 1, so that things line up on the same cachelines
* better: see comments in bset.c at cacheline_to_bkey() for
* details
*/
/* size of the binary tree and prev array */
unsigned size;
/* function of size - precalculated for to_inorder() */
unsigned extra;
/* copy of the last key in the set */
struct bkey end;
struct bkey_float *tree;
/*
* The nodes in the bset tree point to specific keys - this
* array holds the sizes of the previous key.
*
* Conceptually it's a member of struct bkey_float, but we want
* to keep bkey_float to 4 bytes and prev isn't used in the fast
* path.
*/
uint8_t *prev;
/* The actual btree node, with pointers to each sorted set */
struct bset *data;
};
struct btree_keys_ops {
bool (*sort_cmp)(struct btree_iter_set,
struct btree_iter_set);
struct bkey *(*sort_fixup)(struct btree_iter *, struct bkey *);
bool (*insert_fixup)(struct btree_keys *, struct bkey *,
struct btree_iter *, struct bkey *);
bool (*key_invalid)(struct btree_keys *,
const struct bkey *);
bool (*key_bad)(struct btree_keys *, const struct bkey *);
bool (*key_merge)(struct btree_keys *,
struct bkey *, struct bkey *);
void (*key_to_text)(char *, size_t, const struct bkey *);
void (*key_dump)(struct btree_keys *, const struct bkey *);
/*
* Only used for deciding whether to use START_KEY(k) or just the key
* itself in a couple places
*/
bool is_extents;
};
struct btree_keys {
const struct btree_keys_ops *ops;
uint8_t page_order;
uint8_t nsets;
unsigned last_set_unwritten:1;
bool *expensive_debug_checks;
/*
* Sets of sorted keys - the real btree node - plus a binary search tree
*
* set[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
* to the memory we have allocated for this btree node. Additionally,
* set[0]->data points to the entire btree node as it exists on disk.
*/
struct bset_tree set[MAX_BSETS];
};
static inline struct bset_tree *bset_tree_last(struct btree_keys *b)
{
return b->set + b->nsets;
}
static inline bool bset_written(struct btree_keys *b, struct bset_tree *t)
{
return t <= b->set + b->nsets - b->last_set_unwritten;
}
static inline bool bkey_written(struct btree_keys *b, struct bkey *k)
{
return !b->last_set_unwritten || k < b->set[b->nsets].data->start;
}
static inline unsigned bset_byte_offset(struct btree_keys *b, struct bset *i)
{
return ((size_t) i) - ((size_t) b->set->data);
}
static inline unsigned bset_sector_offset(struct btree_keys *b, struct bset *i)
{
return bset_byte_offset(b, i) >> 9;
}
#define __set_bytes(i, k) (sizeof(*(i)) + (k) * sizeof(uint64_t))
#define set_bytes(i) __set_bytes(i, i->keys)
#define __set_blocks(i, k, block_bytes) \
DIV_ROUND_UP(__set_bytes(i, k), block_bytes)
#define set_blocks(i, block_bytes) \
__set_blocks(i, (i)->keys, block_bytes)
static inline size_t bch_btree_keys_u64s_remaining(struct btree_keys *b)
{
struct bset_tree *t = bset_tree_last(b);
BUG_ON((PAGE_SIZE << b->page_order) <
(bset_byte_offset(b, t->data) + set_bytes(t->data)));
if (!b->last_set_unwritten)
return 0;
return ((PAGE_SIZE << b->page_order) -
(bset_byte_offset(b, t->data) + set_bytes(t->data))) /
sizeof(u64);
}
static inline struct bset *bset_next_set(struct btree_keys *b,
unsigned block_bytes)
{
struct bset *i = bset_tree_last(b)->data;
return ((void *) i) + roundup(set_bytes(i), block_bytes);
}
void bch_btree_keys_free(struct btree_keys *);
int bch_btree_keys_alloc(struct btree_keys *, unsigned, gfp_t);
void bch_btree_keys_init(struct btree_keys *, const struct btree_keys_ops *,
bool *);
void bch_bset_init_next(struct btree_keys *, struct bset *, uint64_t);
void bch_bset_build_written_tree(struct btree_keys *);
void bch_bset_fix_invalidated_key(struct btree_keys *, struct bkey *);
bool bch_bkey_try_merge(struct btree_keys *, struct bkey *, struct bkey *);
void bch_bset_insert(struct btree_keys *, struct bkey *, struct bkey *);
unsigned bch_btree_insert_key(struct btree_keys *, struct bkey *,
struct bkey *);
enum {
BTREE_INSERT_STATUS_NO_INSERT = 0,
BTREE_INSERT_STATUS_INSERT,
BTREE_INSERT_STATUS_BACK_MERGE,
BTREE_INSERT_STATUS_OVERWROTE,
BTREE_INSERT_STATUS_FRONT_MERGE,
};
/* Btree key iteration */
struct btree_iter {
size_t size, used;
#ifdef CONFIG_BCACHE_DEBUG
struct btree_keys *b;
#endif
struct btree_iter_set {
struct bkey *k, *end;
} data[MAX_BSETS];
};
typedef bool (*ptr_filter_fn)(struct btree_keys *, const struct bkey *);
struct bkey *bch_btree_iter_next(struct btree_iter *);
struct bkey *bch_btree_iter_next_filter(struct btree_iter *,
struct btree_keys *, ptr_filter_fn);
void bch_btree_iter_push(struct btree_iter *, struct bkey *, struct bkey *);
struct bkey *bch_btree_iter_init(struct btree_keys *, struct btree_iter *,
struct bkey *);
struct bkey *__bch_bset_search(struct btree_keys *, struct bset_tree *,
const struct bkey *);
/*
* Returns the first key that is strictly greater than search
*/
static inline struct bkey *bch_bset_search(struct btree_keys *b,
struct bset_tree *t,
const struct bkey *search)
{
return search ? __bch_bset_search(b, t, search) : t->data->start;
}
#define for_each_key_filter(b, k, iter, filter) \
for (bch_btree_iter_init((b), (iter), NULL); \
((k) = bch_btree_iter_next_filter((iter), (b), filter));)
#define for_each_key(b, k, iter) \
for (bch_btree_iter_init((b), (iter), NULL); \
((k) = bch_btree_iter_next(iter));)
/* Sorting */
struct bset_sort_state {
mempool_t *pool;
unsigned page_order;
unsigned crit_factor;
struct time_stats time;
};
void bch_bset_sort_state_free(struct bset_sort_state *);
int bch_bset_sort_state_init(struct bset_sort_state *, unsigned);
void bch_btree_sort_lazy(struct btree_keys *, struct bset_sort_state *);
void bch_btree_sort_into(struct btree_keys *, struct btree_keys *,
struct bset_sort_state *);
void bch_btree_sort_and_fix_extents(struct btree_keys *, struct btree_iter *,
struct bset_sort_state *);
void bch_btree_sort_partial(struct btree_keys *, unsigned,
struct bset_sort_state *);
static inline void bch_btree_sort(struct btree_keys *b,
struct bset_sort_state *state)
{
bch_btree_sort_partial(b, 0, state);
}
struct bset_stats {
size_t sets_written, sets_unwritten;
size_t bytes_written, bytes_unwritten;
size_t floats, failed;
};
void bch_btree_keys_stats(struct btree_keys *, struct bset_stats *);
/* Bkey utility code */
#define bset_bkey_last(i) bkey_idx((struct bkey *) (i)->d, (i)->keys)
static inline struct bkey *bset_bkey_idx(struct bset *i, unsigned idx)
{
return bkey_idx(i->start, idx);
}
static inline void bkey_init(struct bkey *k)
{
*k = ZERO_KEY;
}
static __always_inline int64_t bkey_cmp(const struct bkey *l,
const struct bkey *r)
{
return unlikely(KEY_INODE(l) != KEY_INODE(r))
? (int64_t) KEY_INODE(l) - (int64_t) KEY_INODE(r)
: (int64_t) KEY_OFFSET(l) - (int64_t) KEY_OFFSET(r);
}
void bch_bkey_copy_single_ptr(struct bkey *, const struct bkey *,
unsigned);
bool __bch_cut_front(const struct bkey *, struct bkey *);
bool __bch_cut_back(const struct bkey *, struct bkey *);
static inline bool bch_cut_front(const struct bkey *where, struct bkey *k)
{
BUG_ON(bkey_cmp(where, k) > 0);
return __bch_cut_front(where, k);
}
static inline bool bch_cut_back(const struct bkey *where, struct bkey *k)
{
BUG_ON(bkey_cmp(where, &START_KEY(k)) < 0);
return __bch_cut_back(where, k);
}
#define PRECEDING_KEY(_k) \
({ \
struct bkey *_ret = NULL; \
\
if (KEY_INODE(_k) || KEY_OFFSET(_k)) { \
_ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0); \
\
if (!_ret->low) \
_ret->high--; \
_ret->low--; \
} \
\
_ret; \
})
static inline bool bch_ptr_invalid(struct btree_keys *b, const struct bkey *k)
{
return b->ops->key_invalid(b, k);
}
static inline bool bch_ptr_bad(struct btree_keys *b, const struct bkey *k)
{
return b->ops->key_bad(b, k);
}
static inline void bch_bkey_to_text(struct btree_keys *b, char *buf,
size_t size, const struct bkey *k)
{
return b->ops->key_to_text(buf, size, k);
}
static inline bool bch_bkey_equal_header(const struct bkey *l,
const struct bkey *r)
{
return (KEY_DIRTY(l) == KEY_DIRTY(r) &&
KEY_PTRS(l) == KEY_PTRS(r) &&
KEY_CSUM(l) == KEY_CSUM(r));
}
/* Keylists */
struct keylist {
union {
struct bkey *keys;
uint64_t *keys_p;
};
union {
struct bkey *top;
uint64_t *top_p;
};
/* Enough room for btree_split's keys without realloc */
#define KEYLIST_INLINE 16
uint64_t inline_keys[KEYLIST_INLINE];
};
static inline void bch_keylist_init(struct keylist *l)
{
l->top_p = l->keys_p = l->inline_keys;
}
static inline void bch_keylist_init_single(struct keylist *l, struct bkey *k)
{
l->keys = k;
l->top = bkey_next(k);
}
static inline void bch_keylist_push(struct keylist *l)
{
l->top = bkey_next(l->top);
}
static inline void bch_keylist_add(struct keylist *l, struct bkey *k)
{
bkey_copy(l->top, k);
bch_keylist_push(l);
}
static inline bool bch_keylist_empty(struct keylist *l)
{
return l->top == l->keys;
}
static inline void bch_keylist_reset(struct keylist *l)
{
l->top = l->keys;
}
static inline void bch_keylist_free(struct keylist *l)
{
if (l->keys_p != l->inline_keys)
kfree(l->keys_p);
}
static inline size_t bch_keylist_nkeys(struct keylist *l)
{
return l->top_p - l->keys_p;
}
static inline size_t bch_keylist_bytes(struct keylist *l)
{
return bch_keylist_nkeys(l) * sizeof(uint64_t);
}
struct bkey *bch_keylist_pop(struct keylist *);
void bch_keylist_pop_front(struct keylist *);
int __bch_keylist_realloc(struct keylist *, unsigned);
/* Debug stuff */
#ifdef CONFIG_BCACHE_DEBUG
int __bch_count_data(struct btree_keys *);
void __bch_check_keys(struct btree_keys *, const char *, ...);
void bch_dump_bset(struct btree_keys *, struct bset *, unsigned);
void bch_dump_bucket(struct btree_keys *);
#else
static inline int __bch_count_data(struct btree_keys *b) { return -1; }
static inline void __bch_check_keys(struct btree_keys *b, const char *fmt, ...) {}
static inline void bch_dump_bucket(struct btree_keys *b) {}
void bch_dump_bset(struct btree_keys *, struct bset *, unsigned);
#endif
static inline bool btree_keys_expensive_checks(struct btree_keys *b)
{
#ifdef CONFIG_BCACHE_DEBUG
return *b->expensive_debug_checks;
#else
return false;
#endif
}
static inline int bch_count_data(struct btree_keys *b)
{
return btree_keys_expensive_checks(b) ? __bch_count_data(b) : -1;
}
#define bch_check_keys(b, ...) \
do { \
if (btree_keys_expensive_checks(b)) \
__bch_check_keys(b, __VA_ARGS__); \
} while (0)
#endif

2530
drivers/md/bcache/btree.c Normal file

File diff suppressed because it is too large Load diff

310
drivers/md/bcache/btree.h Normal file
View file

@ -0,0 +1,310 @@
#ifndef _BCACHE_BTREE_H
#define _BCACHE_BTREE_H
/*
* THE BTREE:
*
* At a high level, bcache's btree is relatively standard b+ tree. All keys and
* pointers are in the leaves; interior nodes only have pointers to the child
* nodes.
*
* In the interior nodes, a struct bkey always points to a child btree node, and
* the key is the highest key in the child node - except that the highest key in
* an interior node is always MAX_KEY. The size field refers to the size on disk
* of the child node - this would allow us to have variable sized btree nodes
* (handy for keeping the depth of the btree 1 by expanding just the root).
*
* Btree nodes are themselves log structured, but this is hidden fairly
* thoroughly. Btree nodes on disk will in practice have extents that overlap
* (because they were written at different times), but in memory we never have
* overlapping extents - when we read in a btree node from disk, the first thing
* we do is resort all the sets of keys with a mergesort, and in the same pass
* we check for overlapping extents and adjust them appropriately.
*
* struct btree_op is a central interface to the btree code. It's used for
* specifying read vs. write locking, and the embedded closure is used for
* waiting on IO or reserve memory.
*
* BTREE CACHE:
*
* Btree nodes are cached in memory; traversing the btree might require reading
* in btree nodes which is handled mostly transparently.
*
* bch_btree_node_get() looks up a btree node in the cache and reads it in from
* disk if necessary. This function is almost never called directly though - the
* btree() macro is used to get a btree node, call some function on it, and
* unlock the node after the function returns.
*
* The root is special cased - it's taken out of the cache's lru (thus pinning
* it in memory), so we can find the root of the btree by just dereferencing a
* pointer instead of looking it up in the cache. This makes locking a bit
* tricky, since the root pointer is protected by the lock in the btree node it
* points to - the btree_root() macro handles this.
*
* In various places we must be able to allocate memory for multiple btree nodes
* in order to make forward progress. To do this we use the btree cache itself
* as a reserve; if __get_free_pages() fails, we'll find a node in the btree
* cache we can reuse. We can't allow more than one thread to be doing this at a
* time, so there's a lock, implemented by a pointer to the btree_op closure -
* this allows the btree_root() macro to implicitly release this lock.
*
* BTREE IO:
*
* Btree nodes never have to be explicitly read in; bch_btree_node_get() handles
* this.
*
* For writing, we have two btree_write structs embeddded in struct btree - one
* write in flight, and one being set up, and we toggle between them.
*
* Writing is done with a single function - bch_btree_write() really serves two
* different purposes and should be broken up into two different functions. When
* passing now = false, it merely indicates that the node is now dirty - calling
* it ensures that the dirty keys will be written at some point in the future.
*
* When passing now = true, bch_btree_write() causes a write to happen
* "immediately" (if there was already a write in flight, it'll cause the write
* to happen as soon as the previous write completes). It returns immediately
* though - but it takes a refcount on the closure in struct btree_op you passed
* to it, so a closure_sync() later can be used to wait for the write to
* complete.
*
* This is handy because btree_split() and garbage collection can issue writes
* in parallel, reducing the amount of time they have to hold write locks.
*
* LOCKING:
*
* When traversing the btree, we may need write locks starting at some level -
* inserting a key into the btree will typically only require a write lock on
* the leaf node.
*
* This is specified with the lock field in struct btree_op; lock = 0 means we
* take write locks at level <= 0, i.e. only leaf nodes. bch_btree_node_get()
* checks this field and returns the node with the appropriate lock held.
*
* If, after traversing the btree, the insertion code discovers it has to split
* then it must restart from the root and take new locks - to do this it changes
* the lock field and returns -EINTR, which causes the btree_root() macro to
* loop.
*
* Handling cache misses require a different mechanism for upgrading to a write
* lock. We do cache lookups with only a read lock held, but if we get a cache
* miss and we wish to insert this data into the cache, we have to insert a
* placeholder key to detect races - otherwise, we could race with a write and
* overwrite the data that was just written to the cache with stale data from
* the backing device.
*
* For this we use a sequence number that write locks and unlocks increment - to
* insert the check key it unlocks the btree node and then takes a write lock,
* and fails if the sequence number doesn't match.
*/
#include "bset.h"
#include "debug.h"
struct btree_write {
atomic_t *journal;
/* If btree_split() frees a btree node, it writes a new pointer to that
* btree node indicating it was freed; it takes a refcount on
* c->prio_blocked because we can't write the gens until the new
* pointer is on disk. This allows btree_write_endio() to release the
* refcount that btree_split() took.
*/
int prio_blocked;
};
struct btree {
/* Hottest entries first */
struct hlist_node hash;
/* Key/pointer for this btree node */
BKEY_PADDED(key);
/* Single bit - set when accessed, cleared by shrinker */
unsigned long accessed;
unsigned long seq;
struct rw_semaphore lock;
struct cache_set *c;
struct btree *parent;
struct mutex write_lock;
unsigned long flags;
uint16_t written; /* would be nice to kill */
uint8_t level;
struct btree_keys keys;
/* For outstanding btree writes, used as a lock - protects write_idx */
struct closure io;
struct semaphore io_mutex;
struct list_head list;
struct delayed_work work;
struct btree_write writes[2];
struct bio *bio;
};
#define BTREE_FLAG(flag) \
static inline bool btree_node_ ## flag(struct btree *b) \
{ return test_bit(BTREE_NODE_ ## flag, &b->flags); } \
\
static inline void set_btree_node_ ## flag(struct btree *b) \
{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
enum btree_flags {
BTREE_NODE_io_error,
BTREE_NODE_dirty,
BTREE_NODE_write_idx,
};
BTREE_FLAG(io_error);
BTREE_FLAG(dirty);
BTREE_FLAG(write_idx);
static inline struct btree_write *btree_current_write(struct btree *b)
{
return b->writes + btree_node_write_idx(b);
}
static inline struct btree_write *btree_prev_write(struct btree *b)
{
return b->writes + (btree_node_write_idx(b) ^ 1);
}
static inline struct bset *btree_bset_first(struct btree *b)
{
return b->keys.set->data;
}
static inline struct bset *btree_bset_last(struct btree *b)
{
return bset_tree_last(&b->keys)->data;
}
static inline unsigned bset_block_offset(struct btree *b, struct bset *i)
{
return bset_sector_offset(&b->keys, i) >> b->c->block_bits;
}
static inline void set_gc_sectors(struct cache_set *c)
{
atomic_set(&c->sectors_to_gc, c->sb.bucket_size * c->nbuckets / 16);
}
void bkey_put(struct cache_set *c, struct bkey *k);
/* Looping macros */
#define for_each_cached_btree(b, c, iter) \
for (iter = 0; \
iter < ARRAY_SIZE((c)->bucket_hash); \
iter++) \
hlist_for_each_entry_rcu((b), (c)->bucket_hash + iter, hash)
/* Recursing down the btree */
struct btree_op {
/* for waiting on btree reserve in btree_split() */
wait_queue_t wait;
/* Btree level at which we start taking write locks */
short lock;
unsigned insert_collision:1;
};
static inline void bch_btree_op_init(struct btree_op *op, int write_lock_level)
{
memset(op, 0, sizeof(struct btree_op));
init_wait(&op->wait);
op->lock = write_lock_level;
}
static inline void rw_lock(bool w, struct btree *b, int level)
{
w ? down_write_nested(&b->lock, level + 1)
: down_read_nested(&b->lock, level + 1);
if (w)
b->seq++;
}
static inline void rw_unlock(bool w, struct btree *b)
{
if (w)
b->seq++;
(w ? up_write : up_read)(&b->lock);
}
void bch_btree_node_read_done(struct btree *);
void __bch_btree_node_write(struct btree *, struct closure *);
void bch_btree_node_write(struct btree *, struct closure *);
void bch_btree_set_root(struct btree *);
struct btree *__bch_btree_node_alloc(struct cache_set *, struct btree_op *,
int, bool, struct btree *);
struct btree *bch_btree_node_get(struct cache_set *, struct btree_op *,
struct bkey *, int, bool, struct btree *);
int bch_btree_insert_check_key(struct btree *, struct btree_op *,
struct bkey *);
int bch_btree_insert(struct cache_set *, struct keylist *,
atomic_t *, struct bkey *);
int bch_gc_thread_start(struct cache_set *);
void bch_initial_gc_finish(struct cache_set *);
void bch_moving_gc(struct cache_set *);
int bch_btree_check(struct cache_set *);
void bch_initial_mark_key(struct cache_set *, int, struct bkey *);
static inline void wake_up_gc(struct cache_set *c)
{
if (c->gc_thread)
wake_up_process(c->gc_thread);
}
#define MAP_DONE 0
#define MAP_CONTINUE 1
#define MAP_ALL_NODES 0
#define MAP_LEAF_NODES 1
#define MAP_END_KEY 1
typedef int (btree_map_nodes_fn)(struct btree_op *, struct btree *);
int __bch_btree_map_nodes(struct btree_op *, struct cache_set *,
struct bkey *, btree_map_nodes_fn *, int);
static inline int bch_btree_map_nodes(struct btree_op *op, struct cache_set *c,
struct bkey *from, btree_map_nodes_fn *fn)
{
return __bch_btree_map_nodes(op, c, from, fn, MAP_ALL_NODES);
}
static inline int bch_btree_map_leaf_nodes(struct btree_op *op,
struct cache_set *c,
struct bkey *from,
btree_map_nodes_fn *fn)
{
return __bch_btree_map_nodes(op, c, from, fn, MAP_LEAF_NODES);
}
typedef int (btree_map_keys_fn)(struct btree_op *, struct btree *,
struct bkey *);
int bch_btree_map_keys(struct btree_op *, struct cache_set *,
struct bkey *, btree_map_keys_fn *, int);
typedef bool (keybuf_pred_fn)(struct keybuf *, struct bkey *);
void bch_keybuf_init(struct keybuf *);
void bch_refill_keybuf(struct cache_set *, struct keybuf *,
struct bkey *, keybuf_pred_fn *);
bool bch_keybuf_check_overlapping(struct keybuf *, struct bkey *,
struct bkey *);
void bch_keybuf_del(struct keybuf *, struct keybuf_key *);
struct keybuf_key *bch_keybuf_next(struct keybuf *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *, struct keybuf *,
struct bkey *, keybuf_pred_fn *);
#endif

222
drivers/md/bcache/closure.c Normal file
View file

@ -0,0 +1,222 @@
/*
* Asynchronous refcounty things
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/seq_file.h>
#include "closure.h"
static inline void closure_put_after_sub(struct closure *cl, int flags)
{
int r = flags & CLOSURE_REMAINING_MASK;
BUG_ON(flags & CLOSURE_GUARD_MASK);
BUG_ON(!r && (flags & ~CLOSURE_DESTRUCTOR));
/* Must deliver precisely one wakeup */
if (r == 1 && (flags & CLOSURE_SLEEPING))
wake_up_process(cl->task);
if (!r) {
if (cl->fn && !(flags & CLOSURE_DESTRUCTOR)) {
atomic_set(&cl->remaining,
CLOSURE_REMAINING_INITIALIZER);
closure_queue(cl);
} else {
struct closure *parent = cl->parent;
closure_fn *destructor = cl->fn;
closure_debug_destroy(cl);
if (destructor)
destructor(cl);
if (parent)
closure_put(parent);
}
}
}
/* For clearing flags with the same atomic op as a put */
void closure_sub(struct closure *cl, int v)
{
closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
}
EXPORT_SYMBOL(closure_sub);
/**
* closure_put - decrement a closure's refcount
*/
void closure_put(struct closure *cl)
{
closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
}
EXPORT_SYMBOL(closure_put);
/**
* closure_wake_up - wake up all closures on a wait list, without memory barrier
*/
void __closure_wake_up(struct closure_waitlist *wait_list)
{
struct llist_node *list;
struct closure *cl;
struct llist_node *reverse = NULL;
list = llist_del_all(&wait_list->list);
/* We first reverse the list to preserve FIFO ordering and fairness */
while (list) {
struct llist_node *t = list;
list = llist_next(list);
t->next = reverse;
reverse = t;
}
/* Then do the wakeups */
while (reverse) {
cl = container_of(reverse, struct closure, list);
reverse = llist_next(reverse);
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
}
EXPORT_SYMBOL(__closure_wake_up);
/**
* closure_wait - add a closure to a waitlist
*
* @waitlist will own a ref on @cl, which will be released when
* closure_wake_up() is called on @waitlist.
*
*/
bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
{
if (atomic_read(&cl->remaining) & CLOSURE_WAITING)
return false;
closure_set_waiting(cl, _RET_IP_);
atomic_add(CLOSURE_WAITING + 1, &cl->remaining);
llist_add(&cl->list, &waitlist->list);
return true;
}
EXPORT_SYMBOL(closure_wait);
/**
* closure_sync - sleep until a closure a closure has nothing left to wait on
*
* Sleeps until the refcount hits 1 - the thread that's running the closure owns
* the last refcount.
*/
void closure_sync(struct closure *cl)
{
while (1) {
__closure_start_sleep(cl);
closure_set_ret_ip(cl);
if ((atomic_read(&cl->remaining) &
CLOSURE_REMAINING_MASK) == 1)
break;
schedule();
}
__closure_end_sleep(cl);
}
EXPORT_SYMBOL(closure_sync);
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
static LIST_HEAD(closure_list);
static DEFINE_SPINLOCK(closure_list_lock);
void closure_debug_create(struct closure *cl)
{
unsigned long flags;
BUG_ON(cl->magic == CLOSURE_MAGIC_ALIVE);
cl->magic = CLOSURE_MAGIC_ALIVE;
spin_lock_irqsave(&closure_list_lock, flags);
list_add(&cl->all, &closure_list);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
EXPORT_SYMBOL(closure_debug_create);
void closure_debug_destroy(struct closure *cl)
{
unsigned long flags;
BUG_ON(cl->magic != CLOSURE_MAGIC_ALIVE);
cl->magic = CLOSURE_MAGIC_DEAD;
spin_lock_irqsave(&closure_list_lock, flags);
list_del(&cl->all);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
EXPORT_SYMBOL(closure_debug_destroy);
static struct dentry *debug;
#define work_data_bits(work) ((unsigned long *)(&(work)->data))
static int debug_seq_show(struct seq_file *f, void *data)
{
struct closure *cl;
spin_lock_irq(&closure_list_lock);
list_for_each_entry(cl, &closure_list, all) {
int r = atomic_read(&cl->remaining);
seq_printf(f, "%p: %pF -> %pf p %p r %i ",
cl, (void *) cl->ip, cl->fn, cl->parent,
r & CLOSURE_REMAINING_MASK);
seq_printf(f, "%s%s%s%s\n",
test_bit(WORK_STRUCT_PENDING,
work_data_bits(&cl->work)) ? "Q" : "",
r & CLOSURE_RUNNING ? "R" : "",
r & CLOSURE_STACK ? "S" : "",
r & CLOSURE_SLEEPING ? "Sl" : "");
if (r & CLOSURE_WAITING)
seq_printf(f, " W %pF\n",
(void *) cl->waiting_on);
seq_printf(f, "\n");
}
spin_unlock_irq(&closure_list_lock);
return 0;
}
static int debug_seq_open(struct inode *inode, struct file *file)
{
return single_open(file, debug_seq_show, NULL);
}
static const struct file_operations debug_ops = {
.owner = THIS_MODULE,
.open = debug_seq_open,
.read = seq_read,
.release = single_release
};
void __init closure_debug_init(void)
{
debug = debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
}
#endif
MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
MODULE_LICENSE("GPL");

386
drivers/md/bcache/closure.h Normal file
View file

@ -0,0 +1,386 @@
#ifndef _LINUX_CLOSURE_H
#define _LINUX_CLOSURE_H
#include <linux/llist.h>
#include <linux/sched.h>
#include <linux/workqueue.h>
/*
* Closure is perhaps the most overused and abused term in computer science, but
* since I've been unable to come up with anything better you're stuck with it
* again.
*
* What are closures?
*
* They embed a refcount. The basic idea is they count "things that are in
* progress" - in flight bios, some other thread that's doing something else -
* anything you might want to wait on.
*
* The refcount may be manipulated with closure_get() and closure_put().
* closure_put() is where many of the interesting things happen, when it causes
* the refcount to go to 0.
*
* Closures can be used to wait on things both synchronously and asynchronously,
* and synchronous and asynchronous use can be mixed without restriction. To
* wait synchronously, use closure_sync() - you will sleep until your closure's
* refcount hits 1.
*
* To wait asynchronously, use
* continue_at(cl, next_function, workqueue);
*
* passing it, as you might expect, the function to run when nothing is pending
* and the workqueue to run that function out of.
*
* continue_at() also, critically, is a macro that returns the calling function.
* There's good reason for this.
*
* To use safely closures asynchronously, they must always have a refcount while
* they are running owned by the thread that is running them. Otherwise, suppose
* you submit some bios and wish to have a function run when they all complete:
*
* foo_endio(struct bio *bio, int error)
* {
* closure_put(cl);
* }
*
* closure_init(cl);
*
* do_stuff();
* closure_get(cl);
* bio1->bi_endio = foo_endio;
* bio_submit(bio1);
*
* do_more_stuff();
* closure_get(cl);
* bio2->bi_endio = foo_endio;
* bio_submit(bio2);
*
* continue_at(cl, complete_some_read, system_wq);
*
* If closure's refcount started at 0, complete_some_read() could run before the
* second bio was submitted - which is almost always not what you want! More
* importantly, it wouldn't be possible to say whether the original thread or
* complete_some_read()'s thread owned the closure - and whatever state it was
* associated with!
*
* So, closure_init() initializes a closure's refcount to 1 - and when a
* closure_fn is run, the refcount will be reset to 1 first.
*
* Then, the rule is - if you got the refcount with closure_get(), release it
* with closure_put() (i.e, in a bio->bi_endio function). If you have a refcount
* on a closure because you called closure_init() or you were run out of a
* closure - _always_ use continue_at(). Doing so consistently will help
* eliminate an entire class of particularly pernicious races.
*
* Lastly, you might have a wait list dedicated to a specific event, and have no
* need for specifying the condition - you just want to wait until someone runs
* closure_wake_up() on the appropriate wait list. In that case, just use
* closure_wait(). It will return either true or false, depending on whether the
* closure was already on a wait list or not - a closure can only be on one wait
* list at a time.
*
* Parents:
*
* closure_init() takes two arguments - it takes the closure to initialize, and
* a (possibly null) parent.
*
* If parent is non null, the new closure will have a refcount for its lifetime;
* a closure is considered to be "finished" when its refcount hits 0 and the
* function to run is null. Hence
*
* continue_at(cl, NULL, NULL);
*
* returns up the (spaghetti) stack of closures, precisely like normal return
* returns up the C stack. continue_at() with non null fn is better thought of
* as doing a tail call.
*
* All this implies that a closure should typically be embedded in a particular
* struct (which its refcount will normally control the lifetime of), and that
* struct can very much be thought of as a stack frame.
*/
struct closure;
typedef void (closure_fn) (struct closure *);
struct closure_waitlist {
struct llist_head list;
};
enum closure_state {
/*
* CLOSURE_WAITING: Set iff the closure is on a waitlist. Must be set by
* the thread that owns the closure, and cleared by the thread that's
* waking up the closure.
*
* CLOSURE_SLEEPING: Must be set before a thread uses a closure to sleep
* - indicates that cl->task is valid and closure_put() may wake it up.
* Only set or cleared by the thread that owns the closure.
*
* The rest are for debugging and don't affect behaviour:
*
* CLOSURE_RUNNING: Set when a closure is running (i.e. by
* closure_init() and when closure_put() runs then next function), and
* must be cleared before remaining hits 0. Primarily to help guard
* against incorrect usage and accidentally transferring references.
* continue_at() and closure_return() clear it for you, if you're doing
* something unusual you can use closure_set_dead() which also helps
* annotate where references are being transferred.
*
* CLOSURE_STACK: Sanity check - remaining should never hit 0 on a
* closure with this flag set
*/
CLOSURE_BITS_START = (1 << 23),
CLOSURE_DESTRUCTOR = (1 << 23),
CLOSURE_WAITING = (1 << 25),
CLOSURE_SLEEPING = (1 << 27),
CLOSURE_RUNNING = (1 << 29),
CLOSURE_STACK = (1 << 31),
};
#define CLOSURE_GUARD_MASK \
((CLOSURE_DESTRUCTOR|CLOSURE_WAITING|CLOSURE_SLEEPING| \
CLOSURE_RUNNING|CLOSURE_STACK) << 1)
#define CLOSURE_REMAINING_MASK (CLOSURE_BITS_START - 1)
#define CLOSURE_REMAINING_INITIALIZER (1|CLOSURE_RUNNING)
struct closure {
union {
struct {
struct workqueue_struct *wq;
struct task_struct *task;
struct llist_node list;
closure_fn *fn;
};
struct work_struct work;
};
struct closure *parent;
atomic_t remaining;
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
#define CLOSURE_MAGIC_DEAD 0xc054dead
#define CLOSURE_MAGIC_ALIVE 0xc054a11e
unsigned magic;
struct list_head all;
unsigned long ip;
unsigned long waiting_on;
#endif
};
void closure_sub(struct closure *cl, int v);
void closure_put(struct closure *cl);
void __closure_wake_up(struct closure_waitlist *list);
bool closure_wait(struct closure_waitlist *list, struct closure *cl);
void closure_sync(struct closure *cl);
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
void closure_debug_init(void);
void closure_debug_create(struct closure *cl);
void closure_debug_destroy(struct closure *cl);
#else
static inline void closure_debug_init(void) {}
static inline void closure_debug_create(struct closure *cl) {}
static inline void closure_debug_destroy(struct closure *cl) {}
#endif
static inline void closure_set_ip(struct closure *cl)
{
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
cl->ip = _THIS_IP_;
#endif
}
static inline void closure_set_ret_ip(struct closure *cl)
{
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
cl->ip = _RET_IP_;
#endif
}
static inline void closure_set_waiting(struct closure *cl, unsigned long f)
{
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
cl->waiting_on = f;
#endif
}
static inline void __closure_end_sleep(struct closure *cl)
{
__set_current_state(TASK_RUNNING);
if (atomic_read(&cl->remaining) & CLOSURE_SLEEPING)
atomic_sub(CLOSURE_SLEEPING, &cl->remaining);
}
static inline void __closure_start_sleep(struct closure *cl)
{
closure_set_ip(cl);
cl->task = current;
set_current_state(TASK_UNINTERRUPTIBLE);
if (!(atomic_read(&cl->remaining) & CLOSURE_SLEEPING))
atomic_add(CLOSURE_SLEEPING, &cl->remaining);
}
static inline void closure_set_stopped(struct closure *cl)
{
atomic_sub(CLOSURE_RUNNING, &cl->remaining);
}
static inline void set_closure_fn(struct closure *cl, closure_fn *fn,
struct workqueue_struct *wq)
{
BUG_ON(object_is_on_stack(cl));
closure_set_ip(cl);
cl->fn = fn;
cl->wq = wq;
/* between atomic_dec() in closure_put() */
smp_mb__before_atomic();
}
static inline void closure_queue(struct closure *cl)
{
struct workqueue_struct *wq = cl->wq;
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
BUG_ON(!queue_work(wq, &cl->work));
} else
cl->fn(cl);
}
/**
* closure_get - increment a closure's refcount
*/
static inline void closure_get(struct closure *cl)
{
#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
BUG_ON((atomic_inc_return(&cl->remaining) &
CLOSURE_REMAINING_MASK) <= 1);
#else
atomic_inc(&cl->remaining);
#endif
}
/**
* closure_init - Initialize a closure, setting the refcount to 1
* @cl: closure to initialize
* @parent: parent of the new closure. cl will take a refcount on it for its
* lifetime; may be NULL.
*/
static inline void closure_init(struct closure *cl, struct closure *parent)
{
memset(cl, 0, sizeof(struct closure));
cl->parent = parent;
if (parent)
closure_get(parent);
atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER);
closure_debug_create(cl);
closure_set_ip(cl);
}
static inline void closure_init_stack(struct closure *cl)
{
memset(cl, 0, sizeof(struct closure));
atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER|CLOSURE_STACK);
}
/**
* closure_wake_up - wake up all closures on a wait list.
*/
static inline void closure_wake_up(struct closure_waitlist *list)
{
smp_mb();
__closure_wake_up(list);
}
/**
* continue_at - jump to another function with barrier
*
* After @cl is no longer waiting on anything (i.e. all outstanding refs have
* been dropped with closure_put()), it will resume execution at @fn running out
* of @wq (or, if @wq is NULL, @fn will be called by closure_put() directly).
*
* NOTE: This macro expands to a return in the calling function!
*
* This is because after calling continue_at() you no longer have a ref on @cl,
* and whatever @cl owns may be freed out from under you - a running closure fn
* has a ref on its own closure which continue_at() drops.
*/
#define continue_at(_cl, _fn, _wq) \
do { \
set_closure_fn(_cl, _fn, _wq); \
closure_sub(_cl, CLOSURE_RUNNING + 1); \
return; \
} while (0)
/**
* closure_return - finish execution of a closure
*
* This is used to indicate that @cl is finished: when all outstanding refs on
* @cl have been dropped @cl's ref on its parent closure (as passed to
* closure_init()) will be dropped, if one was specified - thus this can be
* thought of as returning to the parent closure.
*/
#define closure_return(_cl) continue_at((_cl), NULL, NULL)
/**
* continue_at_nobarrier - jump to another function without barrier
*
* Causes @fn to be executed out of @cl, in @wq context (or called directly if
* @wq is NULL).
*
* NOTE: like continue_at(), this macro expands to a return in the caller!
*
* The ref the caller of continue_at_nobarrier() had on @cl is now owned by @fn,
* thus it's not safe to touch anything protected by @cl after a
* continue_at_nobarrier().
*/
#define continue_at_nobarrier(_cl, _fn, _wq) \
do { \
set_closure_fn(_cl, _fn, _wq); \
closure_queue(_cl); \
return; \
} while (0)
/**
* closure_return - finish execution of a closure, with destructor
*
* Works like closure_return(), except @destructor will be called when all
* outstanding refs on @cl have been dropped; @destructor may be used to safely
* free the memory occupied by @cl, and it is called with the ref on the parent
* closure still held - so @destructor could safely return an item to a
* freelist protected by @cl's parent.
*/
#define closure_return_with_destructor(_cl, _destructor) \
do { \
set_closure_fn(_cl, _destructor, NULL); \
closure_sub(_cl, CLOSURE_RUNNING - CLOSURE_DESTRUCTOR + 1); \
return; \
} while (0)
/**
* closure_call - execute @fn out of a new, uninitialized closure
*
* Typically used when running out of one closure, and we want to run @fn
* asynchronously out of a new closure - @parent will then wait for @cl to
* finish.
*/
static inline void closure_call(struct closure *cl, closure_fn fn,
struct workqueue_struct *wq,
struct closure *parent)
{
closure_init(cl, parent);
continue_at_nobarrier(cl, fn, wq);
}
#endif /* _LINUX_CLOSURE_H */

252
drivers/md/bcache/debug.c Normal file
View file

@ -0,0 +1,252 @@
/*
* Assorted bcache debug code
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "btree.h"
#include "debug.h"
#include "extents.h"
#include <linux/console.h>
#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/random.h>
#include <linux/seq_file.h>
static struct dentry *debug;
#ifdef CONFIG_BCACHE_DEBUG
#define for_each_written_bset(b, start, i) \
for (i = (start); \
(void *) i < (void *) (start) + (KEY_SIZE(&b->key) << 9) &&\
i->seq == (start)->seq; \
i = (void *) i + set_blocks(i, block_bytes(b->c)) * \
block_bytes(b->c))
void bch_btree_verify(struct btree *b)
{
struct btree *v = b->c->verify_data;
struct bset *ondisk, *sorted, *inmemory;
struct bio *bio;
if (!b->c->verify || !b->c->verify_ondisk)
return;
down(&b->io_mutex);
mutex_lock(&b->c->verify_lock);
ondisk = b->c->verify_ondisk;
sorted = b->c->verify_data->keys.set->data;
inmemory = b->keys.set->data;
bkey_copy(&v->key, &b->key);
v->written = 0;
v->level = b->level;
v->keys.ops = b->keys.ops;
bio = bch_bbio_alloc(b->c);
bio->bi_bdev = PTR_CACHE(b->c, &b->key, 0)->bdev;
bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0);
bio->bi_iter.bi_size = KEY_SIZE(&v->key) << 9;
bch_bio_map(bio, sorted);
submit_bio_wait(REQ_META|READ_SYNC, bio);
bch_bbio_free(bio, b->c);
memcpy(ondisk, sorted, KEY_SIZE(&v->key) << 9);
bch_btree_node_read_done(v);
sorted = v->keys.set->data;
if (inmemory->keys != sorted->keys ||
memcmp(inmemory->start,
sorted->start,
(void *) bset_bkey_last(inmemory) - (void *) inmemory->start)) {
struct bset *i;
unsigned j;
console_lock();
printk(KERN_ERR "*** in memory:\n");
bch_dump_bset(&b->keys, inmemory, 0);
printk(KERN_ERR "*** read back in:\n");
bch_dump_bset(&v->keys, sorted, 0);
for_each_written_bset(b, ondisk, i) {
unsigned block = ((void *) i - (void *) ondisk) /
block_bytes(b->c);
printk(KERN_ERR "*** on disk block %u:\n", block);
bch_dump_bset(&b->keys, i, block);
}
printk(KERN_ERR "*** block %zu not written\n",
((void *) i - (void *) ondisk) / block_bytes(b->c));
for (j = 0; j < inmemory->keys; j++)
if (inmemory->d[j] != sorted->d[j])
break;
printk(KERN_ERR "b->written %u\n", b->written);
console_unlock();
panic("verify failed at %u\n", j);
}
mutex_unlock(&b->c->verify_lock);
up(&b->io_mutex);
}
void bch_data_verify(struct cached_dev *dc, struct bio *bio)
{
char name[BDEVNAME_SIZE];
struct bio *check;
struct bio_vec bv, *bv2;
struct bvec_iter iter;
int i;
check = bio_clone(bio, GFP_NOIO);
if (!check)
return;
if (bio_alloc_pages(check, GFP_NOIO))
goto out_put;
submit_bio_wait(READ_SYNC, check);
bio_for_each_segment(bv, bio, iter) {
void *p1 = kmap_atomic(bv.bv_page);
void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
cache_set_err_on(memcmp(p1 + bv.bv_offset,
p2 + bv.bv_offset,
bv.bv_len),
dc->disk.c,
"verify failed at dev %s sector %llu",
bdevname(dc->bdev, name),
(uint64_t) bio->bi_iter.bi_sector);
kunmap_atomic(p1);
}
bio_for_each_segment_all(bv2, check, i)
__free_page(bv2->bv_page);
out_put:
bio_put(check);
}
#endif
#ifdef CONFIG_DEBUG_FS
/* XXX: cache set refcounting */
struct dump_iterator {
char buf[PAGE_SIZE];
size_t bytes;
struct cache_set *c;
struct keybuf keys;
};
static bool dump_pred(struct keybuf *buf, struct bkey *k)
{
return true;
}
static ssize_t bch_dump_read(struct file *file, char __user *buf,
size_t size, loff_t *ppos)
{
struct dump_iterator *i = file->private_data;
ssize_t ret = 0;
char kbuf[80];
while (size) {
struct keybuf_key *w;
unsigned bytes = min(i->bytes, size);
int err = copy_to_user(buf, i->buf, bytes);
if (err)
return err;
ret += bytes;
buf += bytes;
size -= bytes;
i->bytes -= bytes;
memmove(i->buf, i->buf + bytes, i->bytes);
if (i->bytes)
break;
w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY, dump_pred);
if (!w)
break;
bch_extent_to_text(kbuf, sizeof(kbuf), &w->key);
i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", kbuf);
bch_keybuf_del(&i->keys, w);
}
return ret;
}
static int bch_dump_open(struct inode *inode, struct file *file)
{
struct cache_set *c = inode->i_private;
struct dump_iterator *i;
i = kzalloc(sizeof(struct dump_iterator), GFP_KERNEL);
if (!i)
return -ENOMEM;
file->private_data = i;
i->c = c;
bch_keybuf_init(&i->keys);
i->keys.last_scanned = KEY(0, 0, 0);
return 0;
}
static int bch_dump_release(struct inode *inode, struct file *file)
{
kfree(file->private_data);
return 0;
}
static const struct file_operations cache_set_debug_ops = {
.owner = THIS_MODULE,
.open = bch_dump_open,
.read = bch_dump_read,
.release = bch_dump_release
};
void bch_debug_init_cache_set(struct cache_set *c)
{
if (!IS_ERR_OR_NULL(debug)) {
char name[50];
snprintf(name, 50, "bcache-%pU", c->sb.set_uuid);
c->debug = debugfs_create_file(name, 0400, debug, c,
&cache_set_debug_ops);
}
}
#endif
void bch_debug_exit(void)
{
if (!IS_ERR_OR_NULL(debug))
debugfs_remove_recursive(debug);
}
int __init bch_debug_init(struct kobject *kobj)
{
int ret = 0;
debug = debugfs_create_dir("bcache", NULL);
return ret;
}

34
drivers/md/bcache/debug.h Normal file
View file

@ -0,0 +1,34 @@
#ifndef _BCACHE_DEBUG_H
#define _BCACHE_DEBUG_H
struct bio;
struct cached_dev;
struct cache_set;
#ifdef CONFIG_BCACHE_DEBUG
void bch_btree_verify(struct btree *);
void bch_data_verify(struct cached_dev *, struct bio *);
#define expensive_debug_checks(c) ((c)->expensive_debug_checks)
#define key_merging_disabled(c) ((c)->key_merging_disabled)
#define bypass_torture_test(d) ((d)->bypass_torture_test)
#else /* DEBUG */
static inline void bch_btree_verify(struct btree *b) {}
static inline void bch_data_verify(struct cached_dev *dc, struct bio *bio) {}
#define expensive_debug_checks(c) 0
#define key_merging_disabled(c) 0
#define bypass_torture_test(d) 0
#endif
#ifdef CONFIG_DEBUG_FS
void bch_debug_init_cache_set(struct cache_set *);
#else
static inline void bch_debug_init_cache_set(struct cache_set *c) {}
#endif
#endif

625
drivers/md/bcache/extents.c Normal file
View file

@ -0,0 +1,625 @@
/*
* Copyright (C) 2010 Kent Overstreet <kent.overstreet@gmail.com>
*
* Uses a block device as cache for other block devices; optimized for SSDs.
* All allocation is done in buckets, which should match the erase block size
* of the device.
*
* Buckets containing cached data are kept on a heap sorted by priority;
* bucket priority is increased on cache hit, and periodically all the buckets
* on the heap have their priority scaled down. This currently is just used as
* an LRU but in the future should allow for more intelligent heuristics.
*
* Buckets have an 8 bit counter; freeing is accomplished by incrementing the
* counter. Garbage collection is used to remove stale pointers.
*
* Indexing is done via a btree; nodes are not necessarily fully sorted, rather
* as keys are inserted we only sort the pages that have not yet been written.
* When garbage collection is run, we resort the entire node.
*
* All configuration is done via sysfs; see Documentation/bcache.txt.
*/
#include "bcache.h"
#include "btree.h"
#include "debug.h"
#include "extents.h"
#include "writeback.h"
static void sort_key_next(struct btree_iter *iter,
struct btree_iter_set *i)
{
i->k = bkey_next(i->k);
if (i->k == i->end)
*i = iter->data[--iter->used];
}
static bool bch_key_sort_cmp(struct btree_iter_set l,
struct btree_iter_set r)
{
int64_t c = bkey_cmp(l.k, r.k);
return c ? c > 0 : l.k < r.k;
}
static bool __ptr_invalid(struct cache_set *c, const struct bkey *k)
{
unsigned i;
for (i = 0; i < KEY_PTRS(k); i++)
if (ptr_available(c, k, i)) {
struct cache *ca = PTR_CACHE(c, k, i);
size_t bucket = PTR_BUCKET_NR(c, k, i);
size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
if (KEY_SIZE(k) + r > c->sb.bucket_size ||
bucket < ca->sb.first_bucket ||
bucket >= ca->sb.nbuckets)
return true;
}
return false;
}
/* Common among btree and extent ptrs */
static const char *bch_ptr_status(struct cache_set *c, const struct bkey *k)
{
unsigned i;
for (i = 0; i < KEY_PTRS(k); i++)
if (ptr_available(c, k, i)) {
struct cache *ca = PTR_CACHE(c, k, i);
size_t bucket = PTR_BUCKET_NR(c, k, i);
size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
if (KEY_SIZE(k) + r > c->sb.bucket_size)
return "bad, length too big";
if (bucket < ca->sb.first_bucket)
return "bad, short offset";
if (bucket >= ca->sb.nbuckets)
return "bad, offset past end of device";
if (ptr_stale(c, k, i))
return "stale";
}
if (!bkey_cmp(k, &ZERO_KEY))
return "bad, null key";
if (!KEY_PTRS(k))
return "bad, no pointers";
if (!KEY_SIZE(k))
return "zeroed key";
return "";
}
void bch_extent_to_text(char *buf, size_t size, const struct bkey *k)
{
unsigned i = 0;
char *out = buf, *end = buf + size;
#define p(...) (out += scnprintf(out, end - out, __VA_ARGS__))
p("%llu:%llu len %llu -> [", KEY_INODE(k), KEY_START(k), KEY_SIZE(k));
for (i = 0; i < KEY_PTRS(k); i++) {
if (i)
p(", ");
if (PTR_DEV(k, i) == PTR_CHECK_DEV)
p("check dev");
else
p("%llu:%llu gen %llu", PTR_DEV(k, i),
PTR_OFFSET(k, i), PTR_GEN(k, i));
}
p("]");
if (KEY_DIRTY(k))
p(" dirty");
if (KEY_CSUM(k))
p(" cs%llu %llx", KEY_CSUM(k), k->ptr[1]);
#undef p
}
static void bch_bkey_dump(struct btree_keys *keys, const struct bkey *k)
{
struct btree *b = container_of(keys, struct btree, keys);
unsigned j;
char buf[80];
bch_extent_to_text(buf, sizeof(buf), k);
printk(" %s", buf);
for (j = 0; j < KEY_PTRS(k); j++) {
size_t n = PTR_BUCKET_NR(b->c, k, j);
printk(" bucket %zu", n);
if (n >= b->c->sb.first_bucket && n < b->c->sb.nbuckets)
printk(" prio %i",
PTR_BUCKET(b->c, k, j)->prio);
}
printk(" %s\n", bch_ptr_status(b->c, k));
}
/* Btree ptrs */
bool __bch_btree_ptr_invalid(struct cache_set *c, const struct bkey *k)
{
char buf[80];
if (!KEY_PTRS(k) || !KEY_SIZE(k) || KEY_DIRTY(k))
goto bad;
if (__ptr_invalid(c, k))
goto bad;
return false;
bad:
bch_extent_to_text(buf, sizeof(buf), k);
cache_bug(c, "spotted btree ptr %s: %s", buf, bch_ptr_status(c, k));
return true;
}
static bool bch_btree_ptr_invalid(struct btree_keys *bk, const struct bkey *k)
{
struct btree *b = container_of(bk, struct btree, keys);
return __bch_btree_ptr_invalid(b->c, k);
}
static bool btree_ptr_bad_expensive(struct btree *b, const struct bkey *k)
{
unsigned i;
char buf[80];
struct bucket *g;
if (mutex_trylock(&b->c->bucket_lock)) {
for (i = 0; i < KEY_PTRS(k); i++)
if (ptr_available(b->c, k, i)) {
g = PTR_BUCKET(b->c, k, i);
if (KEY_DIRTY(k) ||
g->prio != BTREE_PRIO ||
(b->c->gc_mark_valid &&
GC_MARK(g) != GC_MARK_METADATA))
goto err;
}
mutex_unlock(&b->c->bucket_lock);
}
return false;
err:
mutex_unlock(&b->c->bucket_lock);
bch_extent_to_text(buf, sizeof(buf), k);
btree_bug(b,
"inconsistent btree pointer %s: bucket %zi pin %i prio %i gen %i last_gc %i mark %llu",
buf, PTR_BUCKET_NR(b->c, k, i), atomic_read(&g->pin),
g->prio, g->gen, g->last_gc, GC_MARK(g));
return true;
}
static bool bch_btree_ptr_bad(struct btree_keys *bk, const struct bkey *k)
{
struct btree *b = container_of(bk, struct btree, keys);
unsigned i;
if (!bkey_cmp(k, &ZERO_KEY) ||
!KEY_PTRS(k) ||
bch_ptr_invalid(bk, k))
return true;
for (i = 0; i < KEY_PTRS(k); i++)
if (!ptr_available(b->c, k, i) ||
ptr_stale(b->c, k, i))
return true;
if (expensive_debug_checks(b->c) &&
btree_ptr_bad_expensive(b, k))
return true;
return false;
}
static bool bch_btree_ptr_insert_fixup(struct btree_keys *bk,
struct bkey *insert,
struct btree_iter *iter,
struct bkey *replace_key)
{
struct btree *b = container_of(bk, struct btree, keys);
if (!KEY_OFFSET(insert))
btree_current_write(b)->prio_blocked++;
return false;
}
const struct btree_keys_ops bch_btree_keys_ops = {
.sort_cmp = bch_key_sort_cmp,
.insert_fixup = bch_btree_ptr_insert_fixup,
.key_invalid = bch_btree_ptr_invalid,
.key_bad = bch_btree_ptr_bad,
.key_to_text = bch_extent_to_text,
.key_dump = bch_bkey_dump,
};
/* Extents */
/*
* Returns true if l > r - unless l == r, in which case returns true if l is
* older than r.
*
* Necessary for btree_sort_fixup() - if there are multiple keys that compare
* equal in different sets, we have to process them newest to oldest.
*/
static bool bch_extent_sort_cmp(struct btree_iter_set l,
struct btree_iter_set r)
{
int64_t c = bkey_cmp(&START_KEY(l.k), &START_KEY(r.k));
return c ? c > 0 : l.k < r.k;
}
static struct bkey *bch_extent_sort_fixup(struct btree_iter *iter,
struct bkey *tmp)
{
while (iter->used > 1) {
struct btree_iter_set *top = iter->data, *i = top + 1;
if (iter->used > 2 &&
bch_extent_sort_cmp(i[0], i[1]))
i++;
if (bkey_cmp(top->k, &START_KEY(i->k)) <= 0)
break;
if (!KEY_SIZE(i->k)) {
sort_key_next(iter, i);
heap_sift(iter, i - top, bch_extent_sort_cmp);
continue;
}
if (top->k > i->k) {
if (bkey_cmp(top->k, i->k) >= 0)
sort_key_next(iter, i);
else
bch_cut_front(top->k, i->k);
heap_sift(iter, i - top, bch_extent_sort_cmp);
} else {
/* can't happen because of comparison func */
BUG_ON(!bkey_cmp(&START_KEY(top->k), &START_KEY(i->k)));
if (bkey_cmp(i->k, top->k) < 0) {
bkey_copy(tmp, top->k);
bch_cut_back(&START_KEY(i->k), tmp);
bch_cut_front(i->k, top->k);
heap_sift(iter, 0, bch_extent_sort_cmp);
return tmp;
} else {
bch_cut_back(&START_KEY(i->k), top->k);
}
}
}
return NULL;
}
static void bch_subtract_dirty(struct bkey *k,
struct cache_set *c,
uint64_t offset,
int sectors)
{
if (KEY_DIRTY(k))
bcache_dev_sectors_dirty_add(c, KEY_INODE(k),
offset, -sectors);
}
static bool bch_extent_insert_fixup(struct btree_keys *b,
struct bkey *insert,
struct btree_iter *iter,
struct bkey *replace_key)
{
struct cache_set *c = container_of(b, struct btree, keys)->c;
uint64_t old_offset;
unsigned old_size, sectors_found = 0;
BUG_ON(!KEY_OFFSET(insert));
BUG_ON(!KEY_SIZE(insert));
while (1) {
struct bkey *k = bch_btree_iter_next(iter);
if (!k)
break;
if (bkey_cmp(&START_KEY(k), insert) >= 0) {
if (KEY_SIZE(k))
break;
else
continue;
}
if (bkey_cmp(k, &START_KEY(insert)) <= 0)
continue;
old_offset = KEY_START(k);
old_size = KEY_SIZE(k);
/*
* We might overlap with 0 size extents; we can't skip these
* because if they're in the set we're inserting to we have to
* adjust them so they don't overlap with the key we're
* inserting. But we don't want to check them for replace
* operations.
*/
if (replace_key && KEY_SIZE(k)) {
/*
* k might have been split since we inserted/found the
* key we're replacing
*/
unsigned i;
uint64_t offset = KEY_START(k) -
KEY_START(replace_key);
/* But it must be a subset of the replace key */
if (KEY_START(k) < KEY_START(replace_key) ||
KEY_OFFSET(k) > KEY_OFFSET(replace_key))
goto check_failed;
/* We didn't find a key that we were supposed to */
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
if (!bch_bkey_equal_header(k, replace_key))
goto check_failed;
/* skip past gen */
offset <<= 8;
BUG_ON(!KEY_PTRS(replace_key));
for (i = 0; i < KEY_PTRS(replace_key); i++)
if (k->ptr[i] != replace_key->ptr[i] + offset)
goto check_failed;
sectors_found = KEY_OFFSET(k) - KEY_START(insert);
}
if (bkey_cmp(insert, k) < 0 &&
bkey_cmp(&START_KEY(insert), &START_KEY(k)) > 0) {
/*
* We overlapped in the middle of an existing key: that
* means we have to split the old key. But we have to do
* slightly different things depending on whether the
* old key has been written out yet.
*/
struct bkey *top;
bch_subtract_dirty(k, c, KEY_START(insert),
KEY_SIZE(insert));
if (bkey_written(b, k)) {
/*
* We insert a new key to cover the top of the
* old key, and the old key is modified in place
* to represent the bottom split.
*
* It's completely arbitrary whether the new key
* is the top or the bottom, but it has to match
* up with what btree_sort_fixup() does - it
* doesn't check for this kind of overlap, it
* depends on us inserting a new key for the top
* here.
*/
top = bch_bset_search(b, bset_tree_last(b),
insert);
bch_bset_insert(b, top, k);
} else {
BKEY_PADDED(key) temp;
bkey_copy(&temp.key, k);
bch_bset_insert(b, k, &temp.key);
top = bkey_next(k);
}
bch_cut_front(insert, top);
bch_cut_back(&START_KEY(insert), k);
bch_bset_fix_invalidated_key(b, k);
goto out;
}
if (bkey_cmp(insert, k) < 0) {
bch_cut_front(insert, k);
} else {
if (bkey_cmp(&START_KEY(insert), &START_KEY(k)) > 0)
old_offset = KEY_START(insert);
if (bkey_written(b, k) &&
bkey_cmp(&START_KEY(insert), &START_KEY(k)) <= 0) {
/*
* Completely overwrote, so we don't have to
* invalidate the binary search tree
*/
bch_cut_front(k, k);
} else {
__bch_cut_back(&START_KEY(insert), k);
bch_bset_fix_invalidated_key(b, k);
}
}
bch_subtract_dirty(k, c, old_offset, old_size - KEY_SIZE(k));
}
check_failed:
if (replace_key) {
if (!sectors_found) {
return true;
} else if (sectors_found < KEY_SIZE(insert)) {
SET_KEY_OFFSET(insert, KEY_OFFSET(insert) -
(KEY_SIZE(insert) - sectors_found));
SET_KEY_SIZE(insert, sectors_found);
}
}
out:
if (KEY_DIRTY(insert))
bcache_dev_sectors_dirty_add(c, KEY_INODE(insert),
KEY_START(insert),
KEY_SIZE(insert));
return false;
}
bool __bch_extent_invalid(struct cache_set *c, const struct bkey *k)
{
char buf[80];
if (!KEY_SIZE(k))
return true;
if (KEY_SIZE(k) > KEY_OFFSET(k))
goto bad;
if (__ptr_invalid(c, k))
goto bad;
return false;
bad:
bch_extent_to_text(buf, sizeof(buf), k);
cache_bug(c, "spotted extent %s: %s", buf, bch_ptr_status(c, k));
return true;
}
static bool bch_extent_invalid(struct btree_keys *bk, const struct bkey *k)
{
struct btree *b = container_of(bk, struct btree, keys);
return __bch_extent_invalid(b->c, k);
}
static bool bch_extent_bad_expensive(struct btree *b, const struct bkey *k,
unsigned ptr)
{
struct bucket *g = PTR_BUCKET(b->c, k, ptr);
char buf[80];
if (mutex_trylock(&b->c->bucket_lock)) {
if (b->c->gc_mark_valid &&
(!GC_MARK(g) ||
GC_MARK(g) == GC_MARK_METADATA ||
(GC_MARK(g) != GC_MARK_DIRTY && KEY_DIRTY(k))))
goto err;
if (g->prio == BTREE_PRIO)
goto err;
mutex_unlock(&b->c->bucket_lock);
}
return false;
err:
mutex_unlock(&b->c->bucket_lock);
bch_extent_to_text(buf, sizeof(buf), k);
btree_bug(b,
"inconsistent extent pointer %s:\nbucket %zu pin %i prio %i gen %i last_gc %i mark %llu",
buf, PTR_BUCKET_NR(b->c, k, ptr), atomic_read(&g->pin),
g->prio, g->gen, g->last_gc, GC_MARK(g));
return true;
}
static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k)
{
struct btree *b = container_of(bk, struct btree, keys);
struct bucket *g;
unsigned i, stale;
if (!KEY_PTRS(k) ||
bch_extent_invalid(bk, k))
return true;
for (i = 0; i < KEY_PTRS(k); i++)
if (!ptr_available(b->c, k, i))
return true;
if (!expensive_debug_checks(b->c) && KEY_DIRTY(k))
return false;
for (i = 0; i < KEY_PTRS(k); i++) {
g = PTR_BUCKET(b->c, k, i);
stale = ptr_stale(b->c, k, i);
btree_bug_on(stale > 96, b,
"key too stale: %i, need_gc %u",
stale, b->c->need_gc);
btree_bug_on(stale && KEY_DIRTY(k) && KEY_SIZE(k),
b, "stale dirty pointer");
if (stale)
return true;
if (expensive_debug_checks(b->c) &&
bch_extent_bad_expensive(b, k, i))
return true;
}
return false;
}
static uint64_t merge_chksums(struct bkey *l, struct bkey *r)
{
return (l->ptr[KEY_PTRS(l)] + r->ptr[KEY_PTRS(r)]) &
~((uint64_t)1 << 63);
}
static bool bch_extent_merge(struct btree_keys *bk, struct bkey *l, struct bkey *r)
{
struct btree *b = container_of(bk, struct btree, keys);
unsigned i;
if (key_merging_disabled(b->c))
return false;
for (i = 0; i < KEY_PTRS(l); i++)
if (l->ptr[i] + PTR(0, KEY_SIZE(l), 0) != r->ptr[i] ||
PTR_BUCKET_NR(b->c, l, i) != PTR_BUCKET_NR(b->c, r, i))
return false;
/* Keys with no pointers aren't restricted to one bucket and could
* overflow KEY_SIZE
*/
if (KEY_SIZE(l) + KEY_SIZE(r) > USHRT_MAX) {
SET_KEY_OFFSET(l, KEY_OFFSET(l) + USHRT_MAX - KEY_SIZE(l));
SET_KEY_SIZE(l, USHRT_MAX);
bch_cut_front(l, r);
return false;
}
if (KEY_CSUM(l)) {
if (KEY_CSUM(r))
l->ptr[KEY_PTRS(l)] = merge_chksums(l, r);
else
SET_KEY_CSUM(l, 0);
}
SET_KEY_OFFSET(l, KEY_OFFSET(l) + KEY_SIZE(r));
SET_KEY_SIZE(l, KEY_SIZE(l) + KEY_SIZE(r));
return true;
}
const struct btree_keys_ops bch_extent_keys_ops = {
.sort_cmp = bch_extent_sort_cmp,
.sort_fixup = bch_extent_sort_fixup,
.insert_fixup = bch_extent_insert_fixup,
.key_invalid = bch_extent_invalid,
.key_bad = bch_extent_bad,
.key_merge = bch_extent_merge,
.key_to_text = bch_extent_to_text,
.key_dump = bch_bkey_dump,
.is_extents = true,
};

View file

@ -0,0 +1,14 @@
#ifndef _BCACHE_EXTENTS_H
#define _BCACHE_EXTENTS_H
extern const struct btree_keys_ops bch_btree_keys_ops;
extern const struct btree_keys_ops bch_extent_keys_ops;
struct bkey;
struct cache_set;
void bch_extent_to_text(char *, size_t, const struct bkey *);
bool __bch_btree_ptr_invalid(struct cache_set *, const struct bkey *);
bool __bch_extent_invalid(struct cache_set *, const struct bkey *);
#endif /* _BCACHE_EXTENTS_H */

243
drivers/md/bcache/io.c Normal file
View file

@ -0,0 +1,243 @@
/*
* Some low level IO code, and hacks for various block layer limitations
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "bset.h"
#include "debug.h"
#include <linux/blkdev.h>
static unsigned bch_bio_max_sectors(struct bio *bio)
{
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
struct bio_vec bv;
struct bvec_iter iter;
unsigned ret = 0, seg = 0;
if (bio->bi_rw & REQ_DISCARD)
return min(bio_sectors(bio), q->limits.max_discard_sectors);
bio_for_each_segment(bv, bio, iter) {
struct bvec_merge_data bvm = {
.bi_bdev = bio->bi_bdev,
.bi_sector = bio->bi_iter.bi_sector,
.bi_size = ret << 9,
.bi_rw = bio->bi_rw,
};
if (seg == min_t(unsigned, BIO_MAX_PAGES,
queue_max_segments(q)))
break;
if (q->merge_bvec_fn &&
q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
break;
seg++;
ret += bv.bv_len >> 9;
}
ret = min(ret, queue_max_sectors(q));
WARN_ON(!ret);
ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
return ret;
}
static void bch_bio_submit_split_done(struct closure *cl)
{
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
s->bio->bi_end_io = s->bi_end_io;
s->bio->bi_private = s->bi_private;
bio_endio_nodec(s->bio, 0);
closure_debug_destroy(&s->cl);
mempool_free(s, s->p->bio_split_hook);
}
static void bch_bio_submit_split_endio(struct bio *bio, int error)
{
struct closure *cl = bio->bi_private;
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
if (error)
clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
bio_put(bio);
closure_put(cl);
}
void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
{
struct bio_split_hook *s;
struct bio *n;
if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
goto submit;
if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
goto submit;
s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
closure_init(&s->cl, NULL);
s->bio = bio;
s->p = p;
s->bi_end_io = bio->bi_end_io;
s->bi_private = bio->bi_private;
bio_get(bio);
do {
n = bio_next_split(bio, bch_bio_max_sectors(bio),
GFP_NOIO, s->p->bio_split);
n->bi_end_io = bch_bio_submit_split_endio;
n->bi_private = &s->cl;
closure_get(&s->cl);
generic_make_request(n);
} while (n != bio);
continue_at(&s->cl, bch_bio_submit_split_done, NULL);
submit:
generic_make_request(bio);
}
/* Bios with headers */
void bch_bbio_free(struct bio *bio, struct cache_set *c)
{
struct bbio *b = container_of(bio, struct bbio, bio);
mempool_free(b, c->bio_meta);
}
struct bio *bch_bbio_alloc(struct cache_set *c)
{
struct bbio *b = mempool_alloc(c->bio_meta, GFP_NOIO);
struct bio *bio = &b->bio;
bio_init(bio);
bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
bio->bi_max_vecs = bucket_pages(c);
bio->bi_io_vec = bio->bi_inline_vecs;
return bio;
}
void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
{
struct bbio *b = container_of(bio, struct bbio, bio);
bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0);
bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev;
b->submit_time_us = local_clock_us();
closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
}
void bch_submit_bbio(struct bio *bio, struct cache_set *c,
struct bkey *k, unsigned ptr)
{
struct bbio *b = container_of(bio, struct bbio, bio);
bch_bkey_copy_single_ptr(&b->key, k, ptr);
__bch_submit_bbio(bio, c);
}
/* IO errors */
void bch_count_io_errors(struct cache *ca, int error, const char *m)
{
/*
* The halflife of an error is:
* log2(1/2)/log2(127/128) * refresh ~= 88 * refresh
*/
if (ca->set->error_decay) {
unsigned count = atomic_inc_return(&ca->io_count);
while (count > ca->set->error_decay) {
unsigned errors;
unsigned old = count;
unsigned new = count - ca->set->error_decay;
/*
* First we subtract refresh from count; each time we
* succesfully do so, we rescale the errors once:
*/
count = atomic_cmpxchg(&ca->io_count, old, new);
if (count == old) {
count = new;
errors = atomic_read(&ca->io_errors);
do {
old = errors;
new = ((uint64_t) errors * 127) / 128;
errors = atomic_cmpxchg(&ca->io_errors,
old, new);
} while (old != errors);
}
}
}
if (error) {
char buf[BDEVNAME_SIZE];
unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
&ca->io_errors);
errors >>= IO_ERROR_SHIFT;
if (errors < ca->set->error_limit)
pr_err("%s: IO error on %s, recovering",
bdevname(ca->bdev, buf), m);
else
bch_cache_set_error(ca->set,
"%s: too many IO errors %s",
bdevname(ca->bdev, buf), m);
}
}
void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio,
int error, const char *m)
{
struct bbio *b = container_of(bio, struct bbio, bio);
struct cache *ca = PTR_CACHE(c, &b->key, 0);
unsigned threshold = bio->bi_rw & REQ_WRITE
? c->congested_write_threshold_us
: c->congested_read_threshold_us;
if (threshold) {
unsigned t = local_clock_us();
int us = t - b->submit_time_us;
int congested = atomic_read(&c->congested);
if (us > (int) threshold) {
int ms = us / 1024;
c->congested_last_us = t;
ms = min(ms, CONGESTED_MAX + congested);
atomic_sub(ms, &c->congested);
} else if (congested < 0)
atomic_inc(&c->congested);
}
bch_count_io_errors(ca, error, m);
}
void bch_bbio_endio(struct cache_set *c, struct bio *bio,
int error, const char *m)
{
struct closure *cl = bio->bi_private;
bch_bbio_count_io_errors(c, bio, error, m);
bio_put(bio);
closure_put(cl);
}

821
drivers/md/bcache/journal.c Normal file
View file

@ -0,0 +1,821 @@
/*
* bcache journalling code, for btree insertions
*
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "btree.h"
#include "debug.h"
#include "extents.h"
#include <trace/events/bcache.h>
/*
* Journal replay/recovery:
*
* This code is all driven from run_cache_set(); we first read the journal
* entries, do some other stuff, then we mark all the keys in the journal
* entries (same as garbage collection would), then we replay them - reinserting
* them into the cache in precisely the same order as they appear in the
* journal.
*
* We only journal keys that go in leaf nodes, which simplifies things quite a
* bit.
*/
static void journal_read_endio(struct bio *bio, int error)
{
struct closure *cl = bio->bi_private;
closure_put(cl);
}
static int journal_read_bucket(struct cache *ca, struct list_head *list,
unsigned bucket_index)
{
struct journal_device *ja = &ca->journal;
struct bio *bio = &ja->bio;
struct journal_replay *i;
struct jset *j, *data = ca->set->journal.w[0].data;
struct closure cl;
unsigned len, left, offset = 0;
int ret = 0;
sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bucket_index]);
closure_init_stack(&cl);
pr_debug("reading %u", bucket_index);
while (offset < ca->sb.bucket_size) {
reread: left = ca->sb.bucket_size - offset;
len = min_t(unsigned, left, PAGE_SECTORS << JSET_BITS);
bio_reset(bio);
bio->bi_iter.bi_sector = bucket + offset;
bio->bi_bdev = ca->bdev;
bio->bi_rw = READ;
bio->bi_iter.bi_size = len << 9;
bio->bi_end_io = journal_read_endio;
bio->bi_private = &cl;
bch_bio_map(bio, data);
closure_bio_submit(bio, &cl, ca);
closure_sync(&cl);
/* This function could be simpler now since we no longer write
* journal entries that overlap bucket boundaries; this means
* the start of a bucket will always have a valid journal entry
* if it has any journal entries at all.
*/
j = data;
while (len) {
struct list_head *where;
size_t blocks, bytes = set_bytes(j);
if (j->magic != jset_magic(&ca->sb)) {
pr_debug("%u: bad magic", bucket_index);
return ret;
}
if (bytes > left << 9 ||
bytes > PAGE_SIZE << JSET_BITS) {
pr_info("%u: too big, %zu bytes, offset %u",
bucket_index, bytes, offset);
return ret;
}
if (bytes > len << 9)
goto reread;
if (j->csum != csum_set(j)) {
pr_info("%u: bad csum, %zu bytes, offset %u",
bucket_index, bytes, offset);
return ret;
}
blocks = set_blocks(j, block_bytes(ca->set));
while (!list_empty(list)) {
i = list_first_entry(list,
struct journal_replay, list);
if (i->j.seq >= j->last_seq)
break;
list_del(&i->list);
kfree(i);
}
list_for_each_entry_reverse(i, list, list) {
if (j->seq == i->j.seq)
goto next_set;
if (j->seq < i->j.last_seq)
goto next_set;
if (j->seq > i->j.seq) {
where = &i->list;
goto add;
}
}
where = list;
add:
i = kmalloc(offsetof(struct journal_replay, j) +
bytes, GFP_KERNEL);
if (!i)
return -ENOMEM;
memcpy(&i->j, j, bytes);
list_add(&i->list, where);
ret = 1;
ja->seq[bucket_index] = j->seq;
next_set:
offset += blocks * ca->sb.block_size;
len -= blocks * ca->sb.block_size;
j = ((void *) j) + blocks * block_bytes(ca);
}
}
return ret;
}
int bch_journal_read(struct cache_set *c, struct list_head *list)
{
#define read_bucket(b) \
({ \
int ret = journal_read_bucket(ca, list, b); \
__set_bit(b, bitmap); \
if (ret < 0) \
return ret; \
ret; \
})
struct cache *ca;
unsigned iter;
for_each_cache(ca, c, iter) {
struct journal_device *ja = &ca->journal;
unsigned long bitmap[SB_JOURNAL_BUCKETS / BITS_PER_LONG];
unsigned i, l, r, m;
uint64_t seq;
bitmap_zero(bitmap, SB_JOURNAL_BUCKETS);
pr_debug("%u journal buckets", ca->sb.njournal_buckets);
/*
* Read journal buckets ordered by golden ratio hash to quickly
* find a sequence of buckets with valid journal entries
*/
for (i = 0; i < ca->sb.njournal_buckets; i++) {
l = (i * 2654435769U) % ca->sb.njournal_buckets;
if (test_bit(l, bitmap))
break;
if (read_bucket(l))
goto bsearch;
}
/*
* If that fails, check all the buckets we haven't checked
* already
*/
pr_debug("falling back to linear search");
for (l = find_first_zero_bit(bitmap, ca->sb.njournal_buckets);
l < ca->sb.njournal_buckets;
l = find_next_zero_bit(bitmap, ca->sb.njournal_buckets, l + 1))
if (read_bucket(l))
goto bsearch;
/* no journal entries on this device? */
if (l == ca->sb.njournal_buckets)
continue;
bsearch:
BUG_ON(list_empty(list));
/* Binary search */
m = l;
r = find_next_bit(bitmap, ca->sb.njournal_buckets, l + 1);
pr_debug("starting binary search, l %u r %u", l, r);
while (l + 1 < r) {
seq = list_entry(list->prev, struct journal_replay,
list)->j.seq;
m = (l + r) >> 1;
read_bucket(m);
if (seq != list_entry(list->prev, struct journal_replay,
list)->j.seq)
l = m;
else
r = m;
}
/*
* Read buckets in reverse order until we stop finding more
* journal entries
*/
pr_debug("finishing up: m %u njournal_buckets %u",
m, ca->sb.njournal_buckets);
l = m;
while (1) {
if (!l--)
l = ca->sb.njournal_buckets - 1;
if (l == m)
break;
if (test_bit(l, bitmap))
continue;
if (!read_bucket(l))
break;
}
seq = 0;
for (i = 0; i < ca->sb.njournal_buckets; i++)
if (ja->seq[i] > seq) {
seq = ja->seq[i];
/*
* When journal_reclaim() goes to allocate for
* the first time, it'll use the bucket after
* ja->cur_idx
*/
ja->cur_idx = i;
ja->last_idx = ja->discard_idx = (i + 1) %
ca->sb.njournal_buckets;
}
}
if (!list_empty(list))
c->journal.seq = list_entry(list->prev,
struct journal_replay,
list)->j.seq;
return 0;
#undef read_bucket
}
void bch_journal_mark(struct cache_set *c, struct list_head *list)
{
atomic_t p = { 0 };
struct bkey *k;
struct journal_replay *i;
struct journal *j = &c->journal;
uint64_t last = j->seq;
/*
* journal.pin should never fill up - we never write a journal
* entry when it would fill up. But if for some reason it does, we
* iterate over the list in reverse order so that we can just skip that
* refcount instead of bugging.
*/
list_for_each_entry_reverse(i, list, list) {
BUG_ON(last < i->j.seq);
i->pin = NULL;
while (last-- != i->j.seq)
if (fifo_free(&j->pin) > 1) {
fifo_push_front(&j->pin, p);
atomic_set(&fifo_front(&j->pin), 0);
}
if (fifo_free(&j->pin) > 1) {
fifo_push_front(&j->pin, p);
i->pin = &fifo_front(&j->pin);
atomic_set(i->pin, 1);
}
for (k = i->j.start;
k < bset_bkey_last(&i->j);
k = bkey_next(k))
if (!__bch_extent_invalid(c, k)) {
unsigned j;
for (j = 0; j < KEY_PTRS(k); j++)
if (ptr_available(c, k, j))
atomic_inc(&PTR_BUCKET(c, k, j)->pin);
bch_initial_mark_key(c, 0, k);
}
}
}
int bch_journal_replay(struct cache_set *s, struct list_head *list)
{
int ret = 0, keys = 0, entries = 0;
struct bkey *k;
struct journal_replay *i =
list_entry(list->prev, struct journal_replay, list);
uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
struct keylist keylist;
list_for_each_entry(i, list, list) {
BUG_ON(i->pin && atomic_read(i->pin) != 1);
cache_set_err_on(n != i->j.seq, s,
"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
n, i->j.seq - 1, start, end);
for (k = i->j.start;
k < bset_bkey_last(&i->j);
k = bkey_next(k)) {
trace_bcache_journal_replay_key(k);
bch_keylist_init_single(&keylist, k);
ret = bch_btree_insert(s, &keylist, i->pin, NULL);
if (ret)
goto err;
BUG_ON(!bch_keylist_empty(&keylist));
keys++;
cond_resched();
}
if (i->pin)
atomic_dec(i->pin);
n = i->j.seq + 1;
entries++;
}
pr_info("journal replay done, %i keys in %i entries, seq %llu",
keys, entries, end);
err:
while (!list_empty(list)) {
i = list_first_entry(list, struct journal_replay, list);
list_del(&i->list);
kfree(i);
}
return ret;
}
/* Journalling */
static void btree_flush_write(struct cache_set *c)
{
/*
* Try to find the btree node with that references the oldest journal
* entry, best is our current candidate and is locked if non NULL:
*/
struct btree *b, *best;
unsigned i;
retry:
best = NULL;
for_each_cached_btree(b, c, i)
if (btree_current_write(b)->journal) {
if (!best)
best = b;
else if (journal_pin_cmp(c,
btree_current_write(best)->journal,
btree_current_write(b)->journal)) {
best = b;
}
}
b = best;
if (b) {
mutex_lock(&b->write_lock);
if (!btree_current_write(b)->journal) {
mutex_unlock(&b->write_lock);
/* We raced */
goto retry;
}
__bch_btree_node_write(b, NULL);
mutex_unlock(&b->write_lock);
}
}
#define last_seq(j) ((j)->seq - fifo_used(&(j)->pin) + 1)
static void journal_discard_endio(struct bio *bio, int error)
{
struct journal_device *ja =
container_of(bio, struct journal_device, discard_bio);
struct cache *ca = container_of(ja, struct cache, journal);
atomic_set(&ja->discard_in_flight, DISCARD_DONE);
closure_wake_up(&ca->set->journal.wait);
closure_put(&ca->set->cl);
}
static void journal_discard_work(struct work_struct *work)
{
struct journal_device *ja =
container_of(work, struct journal_device, discard_work);
submit_bio(0, &ja->discard_bio);
}
static void do_journal_discard(struct cache *ca)
{
struct journal_device *ja = &ca->journal;
struct bio *bio = &ja->discard_bio;
if (!ca->discard) {
ja->discard_idx = ja->last_idx;
return;
}
switch (atomic_read(&ja->discard_in_flight)) {
case DISCARD_IN_FLIGHT:
return;
case DISCARD_DONE:
ja->discard_idx = (ja->discard_idx + 1) %
ca->sb.njournal_buckets;
atomic_set(&ja->discard_in_flight, DISCARD_READY);
/* fallthrough */
case DISCARD_READY:
if (ja->discard_idx == ja->last_idx)
return;
atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT);
bio_init(bio);
bio->bi_iter.bi_sector = bucket_to_sector(ca->set,
ca->sb.d[ja->discard_idx]);
bio->bi_bdev = ca->bdev;
bio->bi_rw = REQ_WRITE|REQ_DISCARD;
bio->bi_max_vecs = 1;
bio->bi_io_vec = bio->bi_inline_vecs;
bio->bi_iter.bi_size = bucket_bytes(ca);
bio->bi_end_io = journal_discard_endio;
closure_get(&ca->set->cl);
INIT_WORK(&ja->discard_work, journal_discard_work);
schedule_work(&ja->discard_work);
}
}
static void journal_reclaim(struct cache_set *c)
{
struct bkey *k = &c->journal.key;
struct cache *ca;
uint64_t last_seq;
unsigned iter, n = 0;
atomic_t p;
while (!atomic_read(&fifo_front(&c->journal.pin)))
fifo_pop(&c->journal.pin, p);
last_seq = last_seq(&c->journal);
/* Update last_idx */
for_each_cache(ca, c, iter) {
struct journal_device *ja = &ca->journal;
while (ja->last_idx != ja->cur_idx &&
ja->seq[ja->last_idx] < last_seq)
ja->last_idx = (ja->last_idx + 1) %
ca->sb.njournal_buckets;
}
for_each_cache(ca, c, iter)
do_journal_discard(ca);
if (c->journal.blocks_free)
goto out;
/*
* Allocate:
* XXX: Sort by free journal space
*/
for_each_cache(ca, c, iter) {
struct journal_device *ja = &ca->journal;
unsigned next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
/* No space available on this device */
if (next == ja->discard_idx)
continue;
ja->cur_idx = next;
k->ptr[n++] = PTR(0,
bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
ca->sb.nr_this_dev);
}
bkey_init(k);
SET_KEY_PTRS(k, n);
if (n)
c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
out:
if (!journal_full(&c->journal))
__closure_wake_up(&c->journal.wait);
}
void bch_journal_next(struct journal *j)
{
atomic_t p = { 1 };
j->cur = (j->cur == j->w)
? &j->w[1]
: &j->w[0];
/*
* The fifo_push() needs to happen at the same time as j->seq is
* incremented for last_seq() to be calculated correctly
*/
BUG_ON(!fifo_push(&j->pin, p));
atomic_set(&fifo_back(&j->pin), 1);
j->cur->data->seq = ++j->seq;
j->cur->dirty = false;
j->cur->need_write = false;
j->cur->data->keys = 0;
if (fifo_full(&j->pin))
pr_debug("journal_pin full (%zu)", fifo_used(&j->pin));
}
static void journal_write_endio(struct bio *bio, int error)
{
struct journal_write *w = bio->bi_private;
cache_set_err_on(error, w->c, "journal io error");
closure_put(&w->c->journal.io);
}
static void journal_write(struct closure *);
static void journal_write_done(struct closure *cl)
{
struct journal *j = container_of(cl, struct journal, io);
struct journal_write *w = (j->cur == j->w)
? &j->w[1]
: &j->w[0];
__closure_wake_up(&w->wait);
continue_at_nobarrier(cl, journal_write, system_wq);
}
static void journal_write_unlock(struct closure *cl)
{
struct cache_set *c = container_of(cl, struct cache_set, journal.io);
c->journal.io_in_flight = 0;
spin_unlock(&c->journal.lock);
}
static void journal_write_unlocked(struct closure *cl)
__releases(c->journal.lock)
{
struct cache_set *c = container_of(cl, struct cache_set, journal.io);
struct cache *ca;
struct journal_write *w = c->journal.cur;
struct bkey *k = &c->journal.key;
unsigned i, sectors = set_blocks(w->data, block_bytes(c)) *
c->sb.block_size;
struct bio *bio;
struct bio_list list;
bio_list_init(&list);
if (!w->need_write) {
closure_return_with_destructor(cl, journal_write_unlock);
} else if (journal_full(&c->journal)) {
journal_reclaim(c);
spin_unlock(&c->journal.lock);
btree_flush_write(c);
continue_at(cl, journal_write, system_wq);
}
c->journal.blocks_free -= set_blocks(w->data, block_bytes(c));
w->data->btree_level = c->root->level;
bkey_copy(&w->data->btree_root, &c->root->key);
bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
for_each_cache(ca, c, i)
w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
w->data->magic = jset_magic(&c->sb);
w->data->version = BCACHE_JSET_VERSION;
w->data->last_seq = last_seq(&c->journal);
w->data->csum = csum_set(w->data);
for (i = 0; i < KEY_PTRS(k); i++) {
ca = PTR_CACHE(c, k, i);
bio = &ca->journal.bio;
atomic_long_add(sectors, &ca->meta_sectors_written);
bio_reset(bio);
bio->bi_iter.bi_sector = PTR_OFFSET(k, i);
bio->bi_bdev = ca->bdev;
bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA;
bio->bi_iter.bi_size = sectors << 9;
bio->bi_end_io = journal_write_endio;
bio->bi_private = w;
bch_bio_map(bio, w->data);
trace_bcache_journal_write(bio);
bio_list_add(&list, bio);
SET_PTR_OFFSET(k, i, PTR_OFFSET(k, i) + sectors);
ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
}
atomic_dec_bug(&fifo_back(&c->journal.pin));
bch_journal_next(&c->journal);
journal_reclaim(c);
spin_unlock(&c->journal.lock);
while ((bio = bio_list_pop(&list)))
closure_bio_submit(bio, cl, c->cache[0]);
continue_at(cl, journal_write_done, NULL);
}
static void journal_write(struct closure *cl)
{
struct cache_set *c = container_of(cl, struct cache_set, journal.io);
spin_lock(&c->journal.lock);
journal_write_unlocked(cl);
}
static void journal_try_write(struct cache_set *c)
__releases(c->journal.lock)
{
struct closure *cl = &c->journal.io;
struct journal_write *w = c->journal.cur;
w->need_write = true;
if (!c->journal.io_in_flight) {
c->journal.io_in_flight = 1;
closure_call(cl, journal_write_unlocked, NULL, &c->cl);
} else {
spin_unlock(&c->journal.lock);
}
}
static struct journal_write *journal_wait_for_write(struct cache_set *c,
unsigned nkeys)
{
size_t sectors;
struct closure cl;
bool wait = false;
closure_init_stack(&cl);
spin_lock(&c->journal.lock);
while (1) {
struct journal_write *w = c->journal.cur;
sectors = __set_blocks(w->data, w->data->keys + nkeys,
block_bytes(c)) * c->sb.block_size;
if (sectors <= min_t(size_t,
c->journal.blocks_free * c->sb.block_size,
PAGE_SECTORS << JSET_BITS))
return w;
if (wait)
closure_wait(&c->journal.wait, &cl);
if (!journal_full(&c->journal)) {
if (wait)
trace_bcache_journal_entry_full(c);
/*
* XXX: If we were inserting so many keys that they
* won't fit in an _empty_ journal write, we'll
* deadlock. For now, handle this in
* bch_keylist_realloc() - but something to think about.
*/
BUG_ON(!w->data->keys);
journal_try_write(c); /* unlocks */
} else {
if (wait)
trace_bcache_journal_full(c);
journal_reclaim(c);
spin_unlock(&c->journal.lock);
btree_flush_write(c);
}
closure_sync(&cl);
spin_lock(&c->journal.lock);
wait = true;
}
}
static void journal_write_work(struct work_struct *work)
{
struct cache_set *c = container_of(to_delayed_work(work),
struct cache_set,
journal.work);
spin_lock(&c->journal.lock);
if (c->journal.cur->dirty)
journal_try_write(c);
else
spin_unlock(&c->journal.lock);
}
/*
* Entry point to the journalling code - bio_insert() and btree_invalidate()
* pass bch_journal() a list of keys to be journalled, and then
* bch_journal() hands those same keys off to btree_insert_async()
*/
atomic_t *bch_journal(struct cache_set *c,
struct keylist *keys,
struct closure *parent)
{
struct journal_write *w;
atomic_t *ret;
if (!CACHE_SYNC(&c->sb))
return NULL;
w = journal_wait_for_write(c, bch_keylist_nkeys(keys));
memcpy(bset_bkey_last(w->data), keys->keys, bch_keylist_bytes(keys));
w->data->keys += bch_keylist_nkeys(keys);
ret = &fifo_back(&c->journal.pin);
atomic_inc(ret);
if (parent) {
closure_wait(&w->wait, parent);
journal_try_write(c);
} else if (!w->dirty) {
w->dirty = true;
schedule_delayed_work(&c->journal.work,
msecs_to_jiffies(c->journal_delay_ms));
spin_unlock(&c->journal.lock);
} else {
spin_unlock(&c->journal.lock);
}
return ret;
}
void bch_journal_meta(struct cache_set *c, struct closure *cl)
{
struct keylist keys;
atomic_t *ref;
bch_keylist_init(&keys);
ref = bch_journal(c, &keys, cl);
if (ref)
atomic_dec_bug(ref);
}
void bch_journal_free(struct cache_set *c)
{
free_pages((unsigned long) c->journal.w[1].data, JSET_BITS);
free_pages((unsigned long) c->journal.w[0].data, JSET_BITS);
free_fifo(&c->journal.pin);
}
int bch_journal_alloc(struct cache_set *c)
{
struct journal *j = &c->journal;
spin_lock_init(&j->lock);
INIT_DELAYED_WORK(&j->work, journal_write_work);
c->journal_delay_ms = 100;
j->w[0].c = c;
j->w[1].c = c;
if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
!(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) ||
!(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)))
return -ENOMEM;
return 0;
}

179
drivers/md/bcache/journal.h Normal file
View file

@ -0,0 +1,179 @@
#ifndef _BCACHE_JOURNAL_H
#define _BCACHE_JOURNAL_H
/*
* THE JOURNAL:
*
* The journal is treated as a circular buffer of buckets - a journal entry
* never spans two buckets. This means (not implemented yet) we can resize the
* journal at runtime, and will be needed for bcache on raw flash support.
*
* Journal entries contain a list of keys, ordered by the time they were
* inserted; thus journal replay just has to reinsert the keys.
*
* We also keep some things in the journal header that are logically part of the
* superblock - all the things that are frequently updated. This is for future
* bcache on raw flash support; the superblock (which will become another
* journal) can't be moved or wear leveled, so it contains just enough
* information to find the main journal, and the superblock only has to be
* rewritten when we want to move/wear level the main journal.
*
* Currently, we don't journal BTREE_REPLACE operations - this will hopefully be
* fixed eventually. This isn't a bug - BTREE_REPLACE is used for insertions
* from cache misses, which don't have to be journaled, and for writeback and
* moving gc we work around it by flushing the btree to disk before updating the
* gc information. But it is a potential issue with incremental garbage
* collection, and it's fragile.
*
* OPEN JOURNAL ENTRIES:
*
* Each journal entry contains, in the header, the sequence number of the last
* journal entry still open - i.e. that has keys that haven't been flushed to
* disk in the btree.
*
* We track this by maintaining a refcount for every open journal entry, in a
* fifo; each entry in the fifo corresponds to a particular journal
* entry/sequence number. When the refcount at the tail of the fifo goes to
* zero, we pop it off - thus, the size of the fifo tells us the number of open
* journal entries
*
* We take a refcount on a journal entry when we add some keys to a journal
* entry that we're going to insert (held by struct btree_op), and then when we
* insert those keys into the btree the btree write we're setting up takes a
* copy of that refcount (held by struct btree_write). That refcount is dropped
* when the btree write completes.
*
* A struct btree_write can only hold a refcount on a single journal entry, but
* might contain keys for many journal entries - we handle this by making sure
* it always has a refcount on the _oldest_ journal entry of all the journal
* entries it has keys for.
*
* JOURNAL RECLAIM:
*
* As mentioned previously, our fifo of refcounts tells us the number of open
* journal entries; from that and the current journal sequence number we compute
* last_seq - the oldest journal entry we still need. We write last_seq in each
* journal entry, and we also have to keep track of where it exists on disk so
* we don't overwrite it when we loop around the journal.
*
* To do that we track, for each journal bucket, the sequence number of the
* newest journal entry it contains - if we don't need that journal entry we
* don't need anything in that bucket anymore. From that we track the last
* journal bucket we still need; all this is tracked in struct journal_device
* and updated by journal_reclaim().
*
* JOURNAL FILLING UP:
*
* There are two ways the journal could fill up; either we could run out of
* space to write to, or we could have too many open journal entries and run out
* of room in the fifo of refcounts. Since those refcounts are decremented
* without any locking we can't safely resize that fifo, so we handle it the
* same way.
*
* If the journal fills up, we start flushing dirty btree nodes until we can
* allocate space for a journal write again - preferentially flushing btree
* nodes that are pinning the oldest journal entries first.
*/
/*
* Only used for holding the journal entries we read in btree_journal_read()
* during cache_registration
*/
struct journal_replay {
struct list_head list;
atomic_t *pin;
struct jset j;
};
/*
* We put two of these in struct journal; we used them for writes to the
* journal that are being staged or in flight.
*/
struct journal_write {
struct jset *data;
#define JSET_BITS 3
struct cache_set *c;
struct closure_waitlist wait;
bool dirty;
bool need_write;
};
/* Embedded in struct cache_set */
struct journal {
spinlock_t lock;
/* used when waiting because the journal was full */
struct closure_waitlist wait;
struct closure io;
int io_in_flight;
struct delayed_work work;
/* Number of blocks free in the bucket(s) we're currently writing to */
unsigned blocks_free;
uint64_t seq;
DECLARE_FIFO(atomic_t, pin);
BKEY_PADDED(key);
struct journal_write w[2], *cur;
};
/*
* Embedded in struct cache. First three fields refer to the array of journal
* buckets, in cache_sb.
*/
struct journal_device {
/*
* For each journal bucket, contains the max sequence number of the
* journal writes it contains - so we know when a bucket can be reused.
*/
uint64_t seq[SB_JOURNAL_BUCKETS];
/* Journal bucket we're currently writing to */
unsigned cur_idx;
/* Last journal bucket that still contains an open journal entry */
unsigned last_idx;
/* Next journal bucket to be discarded */
unsigned discard_idx;
#define DISCARD_READY 0
#define DISCARD_IN_FLIGHT 1
#define DISCARD_DONE 2
/* 1 - discard in flight, -1 - discard completed */
atomic_t discard_in_flight;
struct work_struct discard_work;
struct bio discard_bio;
struct bio_vec discard_bv;
/* Bio for journal reads/writes to this device */
struct bio bio;
struct bio_vec bv[8];
};
#define journal_pin_cmp(c, l, r) \
(fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))
#define JOURNAL_PIN 20000
#define journal_full(j) \
(!(j)->blocks_free || fifo_free(&(j)->pin) <= 1)
struct closure;
struct cache_set;
struct btree_op;
struct keylist;
atomic_t *bch_journal(struct cache_set *, struct keylist *, struct closure *);
void bch_journal_next(struct journal *);
void bch_journal_mark(struct cache_set *, struct list_head *);
void bch_journal_meta(struct cache_set *, struct closure *);
int bch_journal_read(struct cache_set *, struct list_head *);
int bch_journal_replay(struct cache_set *, struct list_head *);
void bch_journal_free(struct cache_set *);
int bch_journal_alloc(struct cache_set *);
#endif /* _BCACHE_JOURNAL_H */

View file

@ -0,0 +1,256 @@
/*
* Moving/copying garbage collector
*
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "btree.h"
#include "debug.h"
#include "request.h"
#include <trace/events/bcache.h>
struct moving_io {
struct closure cl;
struct keybuf_key *w;
struct data_insert_op op;
struct bbio bio;
};
static bool moving_pred(struct keybuf *buf, struct bkey *k)
{
struct cache_set *c = container_of(buf, struct cache_set,
moving_gc_keys);
unsigned i;
for (i = 0; i < KEY_PTRS(k); i++)
if (ptr_available(c, k, i) &&
GC_MOVE(PTR_BUCKET(c, k, i)))
return true;
return false;
}
/* Moving GC - IO loop */
static void moving_io_destructor(struct closure *cl)
{
struct moving_io *io = container_of(cl, struct moving_io, cl);
kfree(io);
}
static void write_moving_finish(struct closure *cl)
{
struct moving_io *io = container_of(cl, struct moving_io, cl);
struct bio *bio = &io->bio.bio;
struct bio_vec *bv;
int i;
bio_for_each_segment_all(bv, bio, i)
__free_page(bv->bv_page);
if (io->op.replace_collision)
trace_bcache_gc_copy_collision(&io->w->key);
bch_keybuf_del(&io->op.c->moving_gc_keys, io->w);
up(&io->op.c->moving_in_flight);
closure_return_with_destructor(cl, moving_io_destructor);
}
static void read_moving_endio(struct bio *bio, int error)
{
struct bbio *b = container_of(bio, struct bbio, bio);
struct moving_io *io = container_of(bio->bi_private,
struct moving_io, cl);
if (error)
io->op.error = error;
else if (!KEY_DIRTY(&b->key) &&
ptr_stale(io->op.c, &b->key, 0)) {
io->op.error = -EINTR;
}
bch_bbio_endio(io->op.c, bio, error, "reading data to move");
}
static void moving_init(struct moving_io *io)
{
struct bio *bio = &io->bio.bio;
bio_init(bio);
bio_get(bio);
bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
bio->bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9;
bio->bi_max_vecs = DIV_ROUND_UP(KEY_SIZE(&io->w->key),
PAGE_SECTORS);
bio->bi_private = &io->cl;
bio->bi_io_vec = bio->bi_inline_vecs;
bch_bio_map(bio, NULL);
}
static void write_moving(struct closure *cl)
{
struct moving_io *io = container_of(cl, struct moving_io, cl);
struct data_insert_op *op = &io->op;
if (!op->error) {
moving_init(io);
io->bio.bio.bi_iter.bi_sector = KEY_START(&io->w->key);
op->write_prio = 1;
op->bio = &io->bio.bio;
op->writeback = KEY_DIRTY(&io->w->key);
op->csum = KEY_CSUM(&io->w->key);
bkey_copy(&op->replace_key, &io->w->key);
op->replace = true;
closure_call(&op->cl, bch_data_insert, NULL, cl);
}
continue_at(cl, write_moving_finish, op->wq);
}
static void read_moving_submit(struct closure *cl)
{
struct moving_io *io = container_of(cl, struct moving_io, cl);
struct bio *bio = &io->bio.bio;
bch_submit_bbio(bio, io->op.c, &io->w->key, 0);
continue_at(cl, write_moving, io->op.wq);
}
static void read_moving(struct cache_set *c)
{
struct keybuf_key *w;
struct moving_io *io;
struct bio *bio;
struct closure cl;
closure_init_stack(&cl);
/* XXX: if we error, background writeback could stall indefinitely */
while (!test_bit(CACHE_SET_STOPPING, &c->flags)) {
w = bch_keybuf_next_rescan(c, &c->moving_gc_keys,
&MAX_KEY, moving_pred);
if (!w)
break;
if (ptr_stale(c, &w->key, 0)) {
bch_keybuf_del(&c->moving_gc_keys, w);
continue;
}
io = kzalloc(sizeof(struct moving_io) + sizeof(struct bio_vec)
* DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
GFP_KERNEL);
if (!io)
goto err;
w->private = io;
io->w = w;
io->op.inode = KEY_INODE(&w->key);
io->op.c = c;
io->op.wq = c->moving_gc_wq;
moving_init(io);
bio = &io->bio.bio;
bio->bi_rw = READ;
bio->bi_end_io = read_moving_endio;
if (bio_alloc_pages(bio, GFP_KERNEL))
goto err;
trace_bcache_gc_copy(&w->key);
down(&c->moving_in_flight);
closure_call(&io->cl, read_moving_submit, NULL, &cl);
}
if (0) {
err: if (!IS_ERR_OR_NULL(w->private))
kfree(w->private);
bch_keybuf_del(&c->moving_gc_keys, w);
}
closure_sync(&cl);
}
static bool bucket_cmp(struct bucket *l, struct bucket *r)
{
return GC_SECTORS_USED(l) < GC_SECTORS_USED(r);
}
static unsigned bucket_heap_top(struct cache *ca)
{
struct bucket *b;
return (b = heap_peek(&ca->heap)) ? GC_SECTORS_USED(b) : 0;
}
void bch_moving_gc(struct cache_set *c)
{
struct cache *ca;
struct bucket *b;
unsigned i;
if (!c->copy_gc_enabled)
return;
mutex_lock(&c->bucket_lock);
for_each_cache(ca, c, i) {
unsigned sectors_to_move = 0;
unsigned reserve_sectors = ca->sb.bucket_size *
fifo_used(&ca->free[RESERVE_MOVINGGC]);
ca->heap.used = 0;
for_each_bucket(b, ca) {
if (GC_MARK(b) == GC_MARK_METADATA ||
!GC_SECTORS_USED(b) ||
GC_SECTORS_USED(b) == ca->sb.bucket_size ||
atomic_read(&b->pin))
continue;
if (!heap_full(&ca->heap)) {
sectors_to_move += GC_SECTORS_USED(b);
heap_add(&ca->heap, b, bucket_cmp);
} else if (bucket_cmp(b, heap_peek(&ca->heap))) {
sectors_to_move -= bucket_heap_top(ca);
sectors_to_move += GC_SECTORS_USED(b);
ca->heap.data[0] = b;
heap_sift(&ca->heap, 0, bucket_cmp);
}
}
while (sectors_to_move > reserve_sectors) {
heap_pop(&ca->heap, b, bucket_cmp);
sectors_to_move -= GC_SECTORS_USED(b);
}
while (heap_pop(&ca->heap, b, bucket_cmp))
SET_GC_MOVE(b, 1);
}
mutex_unlock(&c->bucket_lock);
c->moving_gc_keys.last_scanned = ZERO_KEY;
read_moving(c);
}
void bch_moving_init_cache_set(struct cache_set *c)
{
bch_keybuf_init(&c->moving_gc_keys);
sema_init(&c->moving_in_flight, 64);
}

1160
drivers/md/bcache/request.c Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,43 @@
#ifndef _BCACHE_REQUEST_H_
#define _BCACHE_REQUEST_H_
struct data_insert_op {
struct closure cl;
struct cache_set *c;
struct bio *bio;
struct workqueue_struct *wq;
unsigned inode;
uint16_t write_point;
uint16_t write_prio;
short error;
union {
uint16_t flags;
struct {
unsigned bypass:1;
unsigned writeback:1;
unsigned flush_journal:1;
unsigned csum:1;
unsigned replace:1;
unsigned replace_collision:1;
unsigned insert_data_done:1;
};
};
struct keylist insert_keys;
BKEY_PADDED(replace_key);
};
unsigned bch_get_congested(struct cache_set *);
void bch_data_insert(struct closure *cl);
void bch_cached_dev_request_init(struct cached_dev *dc);
void bch_flash_dev_request_init(struct bcache_device *d);
extern struct kmem_cache *bch_search_cache, *bch_passthrough_cache;
#endif /* _BCACHE_REQUEST_H_ */

241
drivers/md/bcache/stats.c Normal file
View file

@ -0,0 +1,241 @@
/*
* bcache stats code
*
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "stats.h"
#include "btree.h"
#include "sysfs.h"
/*
* We keep absolute totals of various statistics, and addionally a set of three
* rolling averages.
*
* Every so often, a timer goes off and rescales the rolling averages.
* accounting_rescale[] is how many times the timer has to go off before we
* rescale each set of numbers; that gets us half lives of 5 minutes, one hour,
* and one day.
*
* accounting_delay is how often the timer goes off - 22 times in 5 minutes,
* and accounting_weight is what we use to rescale:
*
* pow(31 / 32, 22) ~= 1/2
*
* So that we don't have to increment each set of numbers every time we (say)
* get a cache hit, we increment a single atomic_t in acc->collector, and when
* the rescale function runs it resets the atomic counter to 0 and adds its
* old value to each of the exported numbers.
*
* To reduce rounding error, the numbers in struct cache_stats are all
* stored left shifted by 16, and scaled back in the sysfs show() function.
*/
static const unsigned DAY_RESCALE = 288;
static const unsigned HOUR_RESCALE = 12;
static const unsigned FIVE_MINUTE_RESCALE = 1;
static const unsigned accounting_delay = (HZ * 300) / 22;
static const unsigned accounting_weight = 32;
/* sysfs reading/writing */
read_attribute(cache_hits);
read_attribute(cache_misses);
read_attribute(cache_bypass_hits);
read_attribute(cache_bypass_misses);
read_attribute(cache_hit_ratio);
read_attribute(cache_readaheads);
read_attribute(cache_miss_collisions);
read_attribute(bypassed);
SHOW(bch_stats)
{
struct cache_stats *s =
container_of(kobj, struct cache_stats, kobj);
#define var(stat) (s->stat >> 16)
var_print(cache_hits);
var_print(cache_misses);
var_print(cache_bypass_hits);
var_print(cache_bypass_misses);
sysfs_print(cache_hit_ratio,
DIV_SAFE(var(cache_hits) * 100,
var(cache_hits) + var(cache_misses)));
var_print(cache_readaheads);
var_print(cache_miss_collisions);
sysfs_hprint(bypassed, var(sectors_bypassed) << 9);
#undef var
return 0;
}
STORE(bch_stats)
{
return size;
}
static void bch_stats_release(struct kobject *k)
{
}
static struct attribute *bch_stats_files[] = {
&sysfs_cache_hits,
&sysfs_cache_misses,
&sysfs_cache_bypass_hits,
&sysfs_cache_bypass_misses,
&sysfs_cache_hit_ratio,
&sysfs_cache_readaheads,
&sysfs_cache_miss_collisions,
&sysfs_bypassed,
NULL
};
static KTYPE(bch_stats);
int bch_cache_accounting_add_kobjs(struct cache_accounting *acc,
struct kobject *parent)
{
int ret = kobject_add(&acc->total.kobj, parent,
"stats_total");
ret = ret ?: kobject_add(&acc->five_minute.kobj, parent,
"stats_five_minute");
ret = ret ?: kobject_add(&acc->hour.kobj, parent,
"stats_hour");
ret = ret ?: kobject_add(&acc->day.kobj, parent,
"stats_day");
return ret;
}
void bch_cache_accounting_clear(struct cache_accounting *acc)
{
memset(&acc->total.cache_hits,
0,
sizeof(unsigned long) * 7);
}
void bch_cache_accounting_destroy(struct cache_accounting *acc)
{
kobject_put(&acc->total.kobj);
kobject_put(&acc->five_minute.kobj);
kobject_put(&acc->hour.kobj);
kobject_put(&acc->day.kobj);
atomic_set(&acc->closing, 1);
if (del_timer_sync(&acc->timer))
closure_return(&acc->cl);
}
/* EWMA scaling */
static void scale_stat(unsigned long *stat)
{
*stat = ewma_add(*stat, 0, accounting_weight, 0);
}
static void scale_stats(struct cache_stats *stats, unsigned long rescale_at)
{
if (++stats->rescale == rescale_at) {
stats->rescale = 0;
scale_stat(&stats->cache_hits);
scale_stat(&stats->cache_misses);
scale_stat(&stats->cache_bypass_hits);
scale_stat(&stats->cache_bypass_misses);
scale_stat(&stats->cache_readaheads);
scale_stat(&stats->cache_miss_collisions);
scale_stat(&stats->sectors_bypassed);
}
}
static void scale_accounting(unsigned long data)
{
struct cache_accounting *acc = (struct cache_accounting *) data;
#define move_stat(name) do { \
unsigned t = atomic_xchg(&acc->collector.name, 0); \
t <<= 16; \
acc->five_minute.name += t; \
acc->hour.name += t; \
acc->day.name += t; \
acc->total.name += t; \
} while (0)
move_stat(cache_hits);
move_stat(cache_misses);
move_stat(cache_bypass_hits);
move_stat(cache_bypass_misses);
move_stat(cache_readaheads);
move_stat(cache_miss_collisions);
move_stat(sectors_bypassed);
scale_stats(&acc->total, 0);
scale_stats(&acc->day, DAY_RESCALE);
scale_stats(&acc->hour, HOUR_RESCALE);
scale_stats(&acc->five_minute, FIVE_MINUTE_RESCALE);
acc->timer.expires += accounting_delay;
if (!atomic_read(&acc->closing))
add_timer(&acc->timer);
else
closure_return(&acc->cl);
}
static void mark_cache_stats(struct cache_stat_collector *stats,
bool hit, bool bypass)
{
if (!bypass)
if (hit)
atomic_inc(&stats->cache_hits);
else
atomic_inc(&stats->cache_misses);
else
if (hit)
atomic_inc(&stats->cache_bypass_hits);
else
atomic_inc(&stats->cache_bypass_misses);
}
void bch_mark_cache_accounting(struct cache_set *c, struct bcache_device *d,
bool hit, bool bypass)
{
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
mark_cache_stats(&dc->accounting.collector, hit, bypass);
mark_cache_stats(&c->accounting.collector, hit, bypass);
}
void bch_mark_cache_readahead(struct cache_set *c, struct bcache_device *d)
{
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
atomic_inc(&dc->accounting.collector.cache_readaheads);
atomic_inc(&c->accounting.collector.cache_readaheads);
}
void bch_mark_cache_miss_collision(struct cache_set *c, struct bcache_device *d)
{
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
atomic_inc(&dc->accounting.collector.cache_miss_collisions);
atomic_inc(&c->accounting.collector.cache_miss_collisions);
}
void bch_mark_sectors_bypassed(struct cache_set *c, struct cached_dev *dc,
int sectors)
{
atomic_add(sectors, &dc->accounting.collector.sectors_bypassed);
atomic_add(sectors, &c->accounting.collector.sectors_bypassed);
}
void bch_cache_accounting_init(struct cache_accounting *acc,
struct closure *parent)
{
kobject_init(&acc->total.kobj, &bch_stats_ktype);
kobject_init(&acc->five_minute.kobj, &bch_stats_ktype);
kobject_init(&acc->hour.kobj, &bch_stats_ktype);
kobject_init(&acc->day.kobj, &bch_stats_ktype);
closure_init(&acc->cl, parent);
init_timer(&acc->timer);
acc->timer.expires = jiffies + accounting_delay;
acc->timer.data = (unsigned long) acc;
acc->timer.function = scale_accounting;
add_timer(&acc->timer);
}

61
drivers/md/bcache/stats.h Normal file
View file

@ -0,0 +1,61 @@
#ifndef _BCACHE_STATS_H_
#define _BCACHE_STATS_H_
struct cache_stat_collector {
atomic_t cache_hits;
atomic_t cache_misses;
atomic_t cache_bypass_hits;
atomic_t cache_bypass_misses;
atomic_t cache_readaheads;
atomic_t cache_miss_collisions;
atomic_t sectors_bypassed;
};
struct cache_stats {
struct kobject kobj;
unsigned long cache_hits;
unsigned long cache_misses;
unsigned long cache_bypass_hits;
unsigned long cache_bypass_misses;
unsigned long cache_readaheads;
unsigned long cache_miss_collisions;
unsigned long sectors_bypassed;
unsigned rescale;
};
struct cache_accounting {
struct closure cl;
struct timer_list timer;
atomic_t closing;
struct cache_stat_collector collector;
struct cache_stats total;
struct cache_stats five_minute;
struct cache_stats hour;
struct cache_stats day;
};
struct cache_set;
struct cached_dev;
struct bcache_device;
void bch_cache_accounting_init(struct cache_accounting *acc,
struct closure *parent);
int bch_cache_accounting_add_kobjs(struct cache_accounting *acc,
struct kobject *parent);
void bch_cache_accounting_clear(struct cache_accounting *acc);
void bch_cache_accounting_destroy(struct cache_accounting *acc);
void bch_mark_cache_accounting(struct cache_set *, struct bcache_device *,
bool, bool);
void bch_mark_cache_readahead(struct cache_set *, struct bcache_device *);
void bch_mark_cache_miss_collision(struct cache_set *, struct bcache_device *);
void bch_mark_sectors_bypassed(struct cache_set *, struct cached_dev *, int);
#endif /* _BCACHE_STATS_H_ */

2120
drivers/md/bcache/super.c Normal file

File diff suppressed because it is too large Load diff

898
drivers/md/bcache/sysfs.c Normal file
View file

@ -0,0 +1,898 @@
/*
* bcache sysfs interfaces
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "sysfs.h"
#include "btree.h"
#include "request.h"
#include "writeback.h"
#include <linux/blkdev.h>
#include <linux/sort.h>
static const char * const cache_replacement_policies[] = {
"lru",
"fifo",
"random",
NULL
};
static const char * const error_actions[] = {
"unregister",
"panic",
NULL
};
write_attribute(attach);
write_attribute(detach);
write_attribute(unregister);
write_attribute(stop);
write_attribute(clear_stats);
write_attribute(trigger_gc);
write_attribute(prune_cache);
write_attribute(flash_vol_create);
read_attribute(bucket_size);
read_attribute(block_size);
read_attribute(nbuckets);
read_attribute(tree_depth);
read_attribute(root_usage_percent);
read_attribute(priority_stats);
read_attribute(btree_cache_size);
read_attribute(btree_cache_max_chain);
read_attribute(cache_available_percent);
read_attribute(written);
read_attribute(btree_written);
read_attribute(metadata_written);
read_attribute(active_journal_entries);
sysfs_time_stats_attribute(btree_gc, sec, ms);
sysfs_time_stats_attribute(btree_split, sec, us);
sysfs_time_stats_attribute(btree_sort, ms, us);
sysfs_time_stats_attribute(btree_read, ms, us);
read_attribute(btree_nodes);
read_attribute(btree_used_percent);
read_attribute(average_key_size);
read_attribute(dirty_data);
read_attribute(bset_tree_stats);
read_attribute(state);
read_attribute(cache_read_races);
read_attribute(writeback_keys_done);
read_attribute(writeback_keys_failed);
read_attribute(io_errors);
read_attribute(congested);
rw_attribute(congested_read_threshold_us);
rw_attribute(congested_write_threshold_us);
rw_attribute(sequential_cutoff);
rw_attribute(data_csum);
rw_attribute(cache_mode);
rw_attribute(writeback_metadata);
rw_attribute(writeback_running);
rw_attribute(writeback_percent);
rw_attribute(writeback_delay);
rw_attribute(writeback_rate);
rw_attribute(writeback_rate_update_seconds);
rw_attribute(writeback_rate_d_term);
rw_attribute(writeback_rate_p_term_inverse);
read_attribute(writeback_rate_debug);
read_attribute(stripe_size);
read_attribute(partial_stripes_expensive);
rw_attribute(synchronous);
rw_attribute(journal_delay_ms);
rw_attribute(discard);
rw_attribute(running);
rw_attribute(label);
rw_attribute(readahead);
rw_attribute(errors);
rw_attribute(io_error_limit);
rw_attribute(io_error_halflife);
rw_attribute(verify);
rw_attribute(bypass_torture_test);
rw_attribute(key_merging_disabled);
rw_attribute(gc_always_rewrite);
rw_attribute(expensive_debug_checks);
rw_attribute(cache_replacement_policy);
rw_attribute(btree_shrinker_disabled);
rw_attribute(copy_gc_enabled);
rw_attribute(size);
SHOW(__bch_cached_dev)
{
struct cached_dev *dc = container_of(kobj, struct cached_dev,
disk.kobj);
const char *states[] = { "no cache", "clean", "dirty", "inconsistent" };
#define var(stat) (dc->stat)
if (attr == &sysfs_cache_mode)
return bch_snprint_string_list(buf, PAGE_SIZE,
bch_cache_modes + 1,
BDEV_CACHE_MODE(&dc->sb));
sysfs_printf(data_csum, "%i", dc->disk.data_csum);
var_printf(verify, "%i");
var_printf(bypass_torture_test, "%i");
var_printf(writeback_metadata, "%i");
var_printf(writeback_running, "%i");
var_print(writeback_delay);
var_print(writeback_percent);
sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9);
var_print(writeback_rate_update_seconds);
var_print(writeback_rate_d_term);
var_print(writeback_rate_p_term_inverse);
if (attr == &sysfs_writeback_rate_debug) {
char rate[20];
char dirty[20];
char target[20];
char proportional[20];
char derivative[20];
char change[20];
s64 next_io;
bch_hprint(rate, dc->writeback_rate.rate << 9);
bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9);
bch_hprint(target, dc->writeback_rate_target << 9);
bch_hprint(proportional,dc->writeback_rate_proportional << 9);
bch_hprint(derivative, dc->writeback_rate_derivative << 9);
bch_hprint(change, dc->writeback_rate_change << 9);
next_io = div64_s64(dc->writeback_rate.next - local_clock(),
NSEC_PER_MSEC);
return sprintf(buf,
"rate:\t\t%s/sec\n"
"dirty:\t\t%s\n"
"target:\t\t%s\n"
"proportional:\t%s\n"
"derivative:\t%s\n"
"change:\t\t%s/sec\n"
"next io:\t%llims\n",
rate, dirty, target, proportional,
derivative, change, next_io);
}
sysfs_hprint(dirty_data,
bcache_dev_sectors_dirty(&dc->disk) << 9);
sysfs_hprint(stripe_size, dc->disk.stripe_size << 9);
var_printf(partial_stripes_expensive, "%u");
var_hprint(sequential_cutoff);
var_hprint(readahead);
sysfs_print(running, atomic_read(&dc->running));
sysfs_print(state, states[BDEV_STATE(&dc->sb)]);
if (attr == &sysfs_label) {
memcpy(buf, dc->sb.label, SB_LABEL_SIZE);
buf[SB_LABEL_SIZE + 1] = '\0';
strcat(buf, "\n");
return strlen(buf);
}
#undef var
return 0;
}
SHOW_LOCKED(bch_cached_dev)
STORE(__cached_dev)
{
struct cached_dev *dc = container_of(kobj, struct cached_dev,
disk.kobj);
unsigned v = size;
struct cache_set *c;
struct kobj_uevent_env *env;
#define d_strtoul(var) sysfs_strtoul(var, dc->var)
#define d_strtoul_nonzero(var) sysfs_strtoul_clamp(var, dc->var, 1, INT_MAX)
#define d_strtoi_h(var) sysfs_hatoi(var, dc->var)
sysfs_strtoul(data_csum, dc->disk.data_csum);
d_strtoul(verify);
d_strtoul(bypass_torture_test);
d_strtoul(writeback_metadata);
d_strtoul(writeback_running);
d_strtoul(writeback_delay);
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40);
sysfs_strtoul_clamp(writeback_rate,
dc->writeback_rate.rate, 1, INT_MAX);
d_strtoul_nonzero(writeback_rate_update_seconds);
d_strtoul(writeback_rate_d_term);
d_strtoul_nonzero(writeback_rate_p_term_inverse);
d_strtoi_h(sequential_cutoff);
d_strtoi_h(readahead);
if (attr == &sysfs_clear_stats)
bch_cache_accounting_clear(&dc->accounting);
if (attr == &sysfs_running &&
strtoul_or_return(buf))
bch_cached_dev_run(dc);
if (attr == &sysfs_cache_mode) {
ssize_t v = bch_read_string_list(buf, bch_cache_modes + 1);
if (v < 0)
return v;
if ((unsigned) v != BDEV_CACHE_MODE(&dc->sb)) {
SET_BDEV_CACHE_MODE(&dc->sb, v);
bch_write_bdev_super(dc, NULL);
}
}
if (attr == &sysfs_label) {
if (size > SB_LABEL_SIZE)
return -EINVAL;
memcpy(dc->sb.label, buf, size);
if (size < SB_LABEL_SIZE)
dc->sb.label[size] = '\0';
if (size && dc->sb.label[size - 1] == '\n')
dc->sb.label[size - 1] = '\0';
bch_write_bdev_super(dc, NULL);
if (dc->disk.c) {
memcpy(dc->disk.c->uuids[dc->disk.id].label,
buf, SB_LABEL_SIZE);
bch_uuid_write(dc->disk.c);
}
env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
if (!env)
return -ENOMEM;
add_uevent_var(env, "DRIVER=bcache");
add_uevent_var(env, "CACHED_UUID=%pU", dc->sb.uuid),
add_uevent_var(env, "CACHED_LABEL=%s", buf);
kobject_uevent_env(
&disk_to_dev(dc->disk.disk)->kobj, KOBJ_CHANGE, env->envp);
kfree(env);
}
if (attr == &sysfs_attach) {
if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
return -EINVAL;
list_for_each_entry(c, &bch_cache_sets, list) {
v = bch_cached_dev_attach(dc, c);
if (!v)
return size;
}
pr_err("Can't attach %s: cache set not found", buf);
size = v;
}
if (attr == &sysfs_detach && dc->disk.c)
bch_cached_dev_detach(dc);
if (attr == &sysfs_stop)
bcache_device_stop(&dc->disk);
return size;
}
STORE(bch_cached_dev)
{
struct cached_dev *dc = container_of(kobj, struct cached_dev,
disk.kobj);
mutex_lock(&bch_register_lock);
size = __cached_dev_store(kobj, attr, buf, size);
if (attr == &sysfs_writeback_running)
bch_writeback_queue(dc);
if (attr == &sysfs_writeback_percent)
schedule_delayed_work(&dc->writeback_rate_update,
dc->writeback_rate_update_seconds * HZ);
mutex_unlock(&bch_register_lock);
return size;
}
static struct attribute *bch_cached_dev_files[] = {
&sysfs_attach,
&sysfs_detach,
&sysfs_stop,
#if 0
&sysfs_data_csum,
#endif
&sysfs_cache_mode,
&sysfs_writeback_metadata,
&sysfs_writeback_running,
&sysfs_writeback_delay,
&sysfs_writeback_percent,
&sysfs_writeback_rate,
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_d_term,
&sysfs_writeback_rate_p_term_inverse,
&sysfs_writeback_rate_debug,
&sysfs_dirty_data,
&sysfs_stripe_size,
&sysfs_partial_stripes_expensive,
&sysfs_sequential_cutoff,
&sysfs_clear_stats,
&sysfs_running,
&sysfs_state,
&sysfs_label,
&sysfs_readahead,
#ifdef CONFIG_BCACHE_DEBUG
&sysfs_verify,
&sysfs_bypass_torture_test,
#endif
NULL
};
KTYPE(bch_cached_dev);
SHOW(bch_flash_dev)
{
struct bcache_device *d = container_of(kobj, struct bcache_device,
kobj);
struct uuid_entry *u = &d->c->uuids[d->id];
sysfs_printf(data_csum, "%i", d->data_csum);
sysfs_hprint(size, u->sectors << 9);
if (attr == &sysfs_label) {
memcpy(buf, u->label, SB_LABEL_SIZE);
buf[SB_LABEL_SIZE + 1] = '\0';
strcat(buf, "\n");
return strlen(buf);
}
return 0;
}
STORE(__bch_flash_dev)
{
struct bcache_device *d = container_of(kobj, struct bcache_device,
kobj);
struct uuid_entry *u = &d->c->uuids[d->id];
sysfs_strtoul(data_csum, d->data_csum);
if (attr == &sysfs_size) {
uint64_t v;
strtoi_h_or_return(buf, v);
u->sectors = v >> 9;
bch_uuid_write(d->c);
set_capacity(d->disk, u->sectors);
}
if (attr == &sysfs_label) {
memcpy(u->label, buf, SB_LABEL_SIZE);
bch_uuid_write(d->c);
}
if (attr == &sysfs_unregister) {
set_bit(BCACHE_DEV_DETACHING, &d->flags);
bcache_device_stop(d);
}
return size;
}
STORE_LOCKED(bch_flash_dev)
static struct attribute *bch_flash_dev_files[] = {
&sysfs_unregister,
#if 0
&sysfs_data_csum,
#endif
&sysfs_label,
&sysfs_size,
NULL
};
KTYPE(bch_flash_dev);
struct bset_stats_op {
struct btree_op op;
size_t nodes;
struct bset_stats stats;
};
static int bch_btree_bset_stats(struct btree_op *b_op, struct btree *b)
{
struct bset_stats_op *op = container_of(b_op, struct bset_stats_op, op);
op->nodes++;
bch_btree_keys_stats(&b->keys, &op->stats);
return MAP_CONTINUE;
}
static int bch_bset_print_stats(struct cache_set *c, char *buf)
{
struct bset_stats_op op;
int ret;
memset(&op, 0, sizeof(op));
bch_btree_op_init(&op.op, -1);
ret = bch_btree_map_nodes(&op.op, c, &ZERO_KEY, bch_btree_bset_stats);
if (ret < 0)
return ret;
return snprintf(buf, PAGE_SIZE,
"btree nodes: %zu\n"
"written sets: %zu\n"
"unwritten sets: %zu\n"
"written key bytes: %zu\n"
"unwritten key bytes: %zu\n"
"floats: %zu\n"
"failed: %zu\n",
op.nodes,
op.stats.sets_written, op.stats.sets_unwritten,
op.stats.bytes_written, op.stats.bytes_unwritten,
op.stats.floats, op.stats.failed);
}
static unsigned bch_root_usage(struct cache_set *c)
{
unsigned bytes = 0;
struct bkey *k;
struct btree *b;
struct btree_iter iter;
goto lock_root;
do {
rw_unlock(false, b);
lock_root:
b = c->root;
rw_lock(false, b, b->level);
} while (b != c->root);
for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
bytes += bkey_bytes(k);
rw_unlock(false, b);
return (bytes * 100) / btree_bytes(c);
}
static size_t bch_cache_size(struct cache_set *c)
{
size_t ret = 0;
struct btree *b;
mutex_lock(&c->bucket_lock);
list_for_each_entry(b, &c->btree_cache, list)
ret += 1 << (b->keys.page_order + PAGE_SHIFT);
mutex_unlock(&c->bucket_lock);
return ret;
}
static unsigned bch_cache_max_chain(struct cache_set *c)
{
unsigned ret = 0;
struct hlist_head *h;
mutex_lock(&c->bucket_lock);
for (h = c->bucket_hash;
h < c->bucket_hash + (1 << BUCKET_HASH_BITS);
h++) {
unsigned i = 0;
struct hlist_node *p;
hlist_for_each(p, h)
i++;
ret = max(ret, i);
}
mutex_unlock(&c->bucket_lock);
return ret;
}
static unsigned bch_btree_used(struct cache_set *c)
{
return div64_u64(c->gc_stats.key_bytes * 100,
(c->gc_stats.nodes ?: 1) * btree_bytes(c));
}
static unsigned bch_average_key_size(struct cache_set *c)
{
return c->gc_stats.nkeys
? div64_u64(c->gc_stats.data, c->gc_stats.nkeys)
: 0;
}
SHOW(__bch_cache_set)
{
struct cache_set *c = container_of(kobj, struct cache_set, kobj);
sysfs_print(synchronous, CACHE_SYNC(&c->sb));
sysfs_print(journal_delay_ms, c->journal_delay_ms);
sysfs_hprint(bucket_size, bucket_bytes(c));
sysfs_hprint(block_size, block_bytes(c));
sysfs_print(tree_depth, c->root->level);
sysfs_print(root_usage_percent, bch_root_usage(c));
sysfs_hprint(btree_cache_size, bch_cache_size(c));
sysfs_print(btree_cache_max_chain, bch_cache_max_chain(c));
sysfs_print(cache_available_percent, 100 - c->gc_stats.in_use);
sysfs_print_time_stats(&c->btree_gc_time, btree_gc, sec, ms);
sysfs_print_time_stats(&c->btree_split_time, btree_split, sec, us);
sysfs_print_time_stats(&c->sort.time, btree_sort, ms, us);
sysfs_print_time_stats(&c->btree_read_time, btree_read, ms, us);
sysfs_print(btree_used_percent, bch_btree_used(c));
sysfs_print(btree_nodes, c->gc_stats.nodes);
sysfs_hprint(average_key_size, bch_average_key_size(c));
sysfs_print(cache_read_races,
atomic_long_read(&c->cache_read_races));
sysfs_print(writeback_keys_done,
atomic_long_read(&c->writeback_keys_done));
sysfs_print(writeback_keys_failed,
atomic_long_read(&c->writeback_keys_failed));
if (attr == &sysfs_errors)
return bch_snprint_string_list(buf, PAGE_SIZE, error_actions,
c->on_error);
/* See count_io_errors for why 88 */
sysfs_print(io_error_halflife, c->error_decay * 88);
sysfs_print(io_error_limit, c->error_limit >> IO_ERROR_SHIFT);
sysfs_hprint(congested,
((uint64_t) bch_get_congested(c)) << 9);
sysfs_print(congested_read_threshold_us,
c->congested_read_threshold_us);
sysfs_print(congested_write_threshold_us,
c->congested_write_threshold_us);
sysfs_print(active_journal_entries, fifo_used(&c->journal.pin));
sysfs_printf(verify, "%i", c->verify);
sysfs_printf(key_merging_disabled, "%i", c->key_merging_disabled);
sysfs_printf(expensive_debug_checks,
"%i", c->expensive_debug_checks);
sysfs_printf(gc_always_rewrite, "%i", c->gc_always_rewrite);
sysfs_printf(btree_shrinker_disabled, "%i", c->shrinker_disabled);
sysfs_printf(copy_gc_enabled, "%i", c->copy_gc_enabled);
if (attr == &sysfs_bset_tree_stats)
return bch_bset_print_stats(c, buf);
return 0;
}
SHOW_LOCKED(bch_cache_set)
STORE(__bch_cache_set)
{
struct cache_set *c = container_of(kobj, struct cache_set, kobj);
if (attr == &sysfs_unregister)
bch_cache_set_unregister(c);
if (attr == &sysfs_stop)
bch_cache_set_stop(c);
if (attr == &sysfs_synchronous) {
bool sync = strtoul_or_return(buf);
if (sync != CACHE_SYNC(&c->sb)) {
SET_CACHE_SYNC(&c->sb, sync);
bcache_write_super(c);
}
}
if (attr == &sysfs_flash_vol_create) {
int r;
uint64_t v;
strtoi_h_or_return(buf, v);
r = bch_flash_dev_create(c, v);
if (r)
return r;
}
if (attr == &sysfs_clear_stats) {
atomic_long_set(&c->writeback_keys_done, 0);
atomic_long_set(&c->writeback_keys_failed, 0);
memset(&c->gc_stats, 0, sizeof(struct gc_stat));
bch_cache_accounting_clear(&c->accounting);
}
if (attr == &sysfs_trigger_gc)
wake_up_gc(c);
if (attr == &sysfs_prune_cache) {
struct shrink_control sc;
sc.gfp_mask = GFP_KERNEL;
sc.nr_to_scan = strtoul_or_return(buf);
c->shrink.scan_objects(&c->shrink, &sc);
}
sysfs_strtoul(congested_read_threshold_us,
c->congested_read_threshold_us);
sysfs_strtoul(congested_write_threshold_us,
c->congested_write_threshold_us);
if (attr == &sysfs_errors) {
ssize_t v = bch_read_string_list(buf, error_actions);
if (v < 0)
return v;
c->on_error = v;
}
if (attr == &sysfs_io_error_limit)
c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT;
/* See count_io_errors() for why 88 */
if (attr == &sysfs_io_error_halflife)
c->error_decay = strtoul_or_return(buf) / 88;
sysfs_strtoul(journal_delay_ms, c->journal_delay_ms);
sysfs_strtoul(verify, c->verify);
sysfs_strtoul(key_merging_disabled, c->key_merging_disabled);
sysfs_strtoul(expensive_debug_checks, c->expensive_debug_checks);
sysfs_strtoul(gc_always_rewrite, c->gc_always_rewrite);
sysfs_strtoul(btree_shrinker_disabled, c->shrinker_disabled);
sysfs_strtoul(copy_gc_enabled, c->copy_gc_enabled);
return size;
}
STORE_LOCKED(bch_cache_set)
SHOW(bch_cache_set_internal)
{
struct cache_set *c = container_of(kobj, struct cache_set, internal);
return bch_cache_set_show(&c->kobj, attr, buf);
}
STORE(bch_cache_set_internal)
{
struct cache_set *c = container_of(kobj, struct cache_set, internal);
return bch_cache_set_store(&c->kobj, attr, buf, size);
}
static void bch_cache_set_internal_release(struct kobject *k)
{
}
static struct attribute *bch_cache_set_files[] = {
&sysfs_unregister,
&sysfs_stop,
&sysfs_synchronous,
&sysfs_journal_delay_ms,
&sysfs_flash_vol_create,
&sysfs_bucket_size,
&sysfs_block_size,
&sysfs_tree_depth,
&sysfs_root_usage_percent,
&sysfs_btree_cache_size,
&sysfs_cache_available_percent,
&sysfs_average_key_size,
&sysfs_errors,
&sysfs_io_error_limit,
&sysfs_io_error_halflife,
&sysfs_congested,
&sysfs_congested_read_threshold_us,
&sysfs_congested_write_threshold_us,
&sysfs_clear_stats,
NULL
};
KTYPE(bch_cache_set);
static struct attribute *bch_cache_set_internal_files[] = {
&sysfs_active_journal_entries,
sysfs_time_stats_attribute_list(btree_gc, sec, ms)
sysfs_time_stats_attribute_list(btree_split, sec, us)
sysfs_time_stats_attribute_list(btree_sort, ms, us)
sysfs_time_stats_attribute_list(btree_read, ms, us)
&sysfs_btree_nodes,
&sysfs_btree_used_percent,
&sysfs_btree_cache_max_chain,
&sysfs_bset_tree_stats,
&sysfs_cache_read_races,
&sysfs_writeback_keys_done,
&sysfs_writeback_keys_failed,
&sysfs_trigger_gc,
&sysfs_prune_cache,
#ifdef CONFIG_BCACHE_DEBUG
&sysfs_verify,
&sysfs_key_merging_disabled,
&sysfs_expensive_debug_checks,
#endif
&sysfs_gc_always_rewrite,
&sysfs_btree_shrinker_disabled,
&sysfs_copy_gc_enabled,
NULL
};
KTYPE(bch_cache_set_internal);
SHOW(__bch_cache)
{
struct cache *ca = container_of(kobj, struct cache, kobj);
sysfs_hprint(bucket_size, bucket_bytes(ca));
sysfs_hprint(block_size, block_bytes(ca));
sysfs_print(nbuckets, ca->sb.nbuckets);
sysfs_print(discard, ca->discard);
sysfs_hprint(written, atomic_long_read(&ca->sectors_written) << 9);
sysfs_hprint(btree_written,
atomic_long_read(&ca->btree_sectors_written) << 9);
sysfs_hprint(metadata_written,
(atomic_long_read(&ca->meta_sectors_written) +
atomic_long_read(&ca->btree_sectors_written)) << 9);
sysfs_print(io_errors,
atomic_read(&ca->io_errors) >> IO_ERROR_SHIFT);
if (attr == &sysfs_cache_replacement_policy)
return bch_snprint_string_list(buf, PAGE_SIZE,
cache_replacement_policies,
CACHE_REPLACEMENT(&ca->sb));
if (attr == &sysfs_priority_stats) {
int cmp(const void *l, const void *r)
{ return *((uint16_t *) r) - *((uint16_t *) l); }
struct bucket *b;
size_t n = ca->sb.nbuckets, i;
size_t unused = 0, available = 0, dirty = 0, meta = 0;
uint64_t sum = 0;
/* Compute 31 quantiles */
uint16_t q[31], *p, *cached;
ssize_t ret;
cached = p = vmalloc(ca->sb.nbuckets * sizeof(uint16_t));
if (!p)
return -ENOMEM;
mutex_lock(&ca->set->bucket_lock);
for_each_bucket(b, ca) {
if (!GC_SECTORS_USED(b))
unused++;
if (GC_MARK(b) == GC_MARK_RECLAIMABLE)
available++;
if (GC_MARK(b) == GC_MARK_DIRTY)
dirty++;
if (GC_MARK(b) == GC_MARK_METADATA)
meta++;
}
for (i = ca->sb.first_bucket; i < n; i++)
p[i] = ca->buckets[i].prio;
mutex_unlock(&ca->set->bucket_lock);
sort(p, n, sizeof(uint16_t), cmp, NULL);
while (n &&
!cached[n - 1])
--n;
unused = ca->sb.nbuckets - n;
while (cached < p + n &&
*cached == BTREE_PRIO)
cached++, n--;
for (i = 0; i < n; i++)
sum += INITIAL_PRIO - cached[i];
if (n)
do_div(sum, n);
for (i = 0; i < ARRAY_SIZE(q); i++)
q[i] = INITIAL_PRIO - cached[n * (i + 1) /
(ARRAY_SIZE(q) + 1)];
vfree(p);
ret = scnprintf(buf, PAGE_SIZE,
"Unused: %zu%%\n"
"Clean: %zu%%\n"
"Dirty: %zu%%\n"
"Metadata: %zu%%\n"
"Average: %llu\n"
"Sectors per Q: %zu\n"
"Quantiles: [",
unused * 100 / (size_t) ca->sb.nbuckets,
available * 100 / (size_t) ca->sb.nbuckets,
dirty * 100 / (size_t) ca->sb.nbuckets,
meta * 100 / (size_t) ca->sb.nbuckets, sum,
n * ca->sb.bucket_size / (ARRAY_SIZE(q) + 1));
for (i = 0; i < ARRAY_SIZE(q); i++)
ret += scnprintf(buf + ret, PAGE_SIZE - ret,
"%u ", q[i]);
ret--;
ret += scnprintf(buf + ret, PAGE_SIZE - ret, "]\n");
return ret;
}
return 0;
}
SHOW_LOCKED(bch_cache)
STORE(__bch_cache)
{
struct cache *ca = container_of(kobj, struct cache, kobj);
if (attr == &sysfs_discard) {
bool v = strtoul_or_return(buf);
if (blk_queue_discard(bdev_get_queue(ca->bdev)))
ca->discard = v;
if (v != CACHE_DISCARD(&ca->sb)) {
SET_CACHE_DISCARD(&ca->sb, v);
bcache_write_super(ca->set);
}
}
if (attr == &sysfs_cache_replacement_policy) {
ssize_t v = bch_read_string_list(buf, cache_replacement_policies);
if (v < 0)
return v;
if ((unsigned) v != CACHE_REPLACEMENT(&ca->sb)) {
mutex_lock(&ca->set->bucket_lock);
SET_CACHE_REPLACEMENT(&ca->sb, v);
mutex_unlock(&ca->set->bucket_lock);
bcache_write_super(ca->set);
}
}
if (attr == &sysfs_clear_stats) {
atomic_long_set(&ca->sectors_written, 0);
atomic_long_set(&ca->btree_sectors_written, 0);
atomic_long_set(&ca->meta_sectors_written, 0);
atomic_set(&ca->io_count, 0);
atomic_set(&ca->io_errors, 0);
}
return size;
}
STORE_LOCKED(bch_cache)
static struct attribute *bch_cache_files[] = {
&sysfs_bucket_size,
&sysfs_block_size,
&sysfs_nbuckets,
&sysfs_priority_stats,
&sysfs_discard,
&sysfs_written,
&sysfs_btree_written,
&sysfs_metadata_written,
&sysfs_io_errors,
&sysfs_clear_stats,
&sysfs_cache_replacement_policy,
NULL
};
KTYPE(bch_cache);

110
drivers/md/bcache/sysfs.h Normal file
View file

@ -0,0 +1,110 @@
#ifndef _BCACHE_SYSFS_H_
#define _BCACHE_SYSFS_H_
#define KTYPE(type) \
struct kobj_type type ## _ktype = { \
.release = type ## _release, \
.sysfs_ops = &((const struct sysfs_ops) { \
.show = type ## _show, \
.store = type ## _store \
}), \
.default_attrs = type ## _files \
}
#define SHOW(fn) \
static ssize_t fn ## _show(struct kobject *kobj, struct attribute *attr,\
char *buf) \
#define STORE(fn) \
static ssize_t fn ## _store(struct kobject *kobj, struct attribute *attr,\
const char *buf, size_t size) \
#define SHOW_LOCKED(fn) \
SHOW(fn) \
{ \
ssize_t ret; \
mutex_lock(&bch_register_lock); \
ret = __ ## fn ## _show(kobj, attr, buf); \
mutex_unlock(&bch_register_lock); \
return ret; \
}
#define STORE_LOCKED(fn) \
STORE(fn) \
{ \
ssize_t ret; \
mutex_lock(&bch_register_lock); \
ret = __ ## fn ## _store(kobj, attr, buf, size); \
mutex_unlock(&bch_register_lock); \
return ret; \
}
#define __sysfs_attribute(_name, _mode) \
static struct attribute sysfs_##_name = \
{ .name = #_name, .mode = _mode }
#define write_attribute(n) __sysfs_attribute(n, S_IWUSR)
#define read_attribute(n) __sysfs_attribute(n, S_IRUGO)
#define rw_attribute(n) __sysfs_attribute(n, S_IRUGO|S_IWUSR)
#define sysfs_printf(file, fmt, ...) \
do { \
if (attr == &sysfs_ ## file) \
return snprintf(buf, PAGE_SIZE, fmt "\n", __VA_ARGS__); \
} while (0)
#define sysfs_print(file, var) \
do { \
if (attr == &sysfs_ ## file) \
return snprint(buf, PAGE_SIZE, var); \
} while (0)
#define sysfs_hprint(file, val) \
do { \
if (attr == &sysfs_ ## file) { \
ssize_t ret = bch_hprint(buf, val); \
strcat(buf, "\n"); \
return ret + 1; \
} \
} while (0)
#define var_printf(_var, fmt) sysfs_printf(_var, fmt, var(_var))
#define var_print(_var) sysfs_print(_var, var(_var))
#define var_hprint(_var) sysfs_hprint(_var, var(_var))
#define sysfs_strtoul(file, var) \
do { \
if (attr == &sysfs_ ## file) \
return strtoul_safe(buf, var) ?: (ssize_t) size; \
} while (0)
#define sysfs_strtoul_clamp(file, var, min, max) \
do { \
if (attr == &sysfs_ ## file) \
return strtoul_safe_clamp(buf, var, min, max) \
?: (ssize_t) size; \
} while (0)
#define strtoul_or_return(cp) \
({ \
unsigned long _v; \
int _r = kstrtoul(cp, 10, &_v); \
if (_r) \
return _r; \
_v; \
})
#define strtoi_h_or_return(cp, v) \
do { \
int _r = strtoi_h(cp, &v); \
if (_r) \
return _r; \
} while (0)
#define sysfs_hatoi(file, var) \
do { \
if (attr == &sysfs_ ## file) \
return strtoi_h(buf, &var) ?: (ssize_t) size; \
} while (0)
#endif /* _BCACHE_SYSFS_H_ */

52
drivers/md/bcache/trace.c Normal file
View file

@ -0,0 +1,52 @@
#include "bcache.h"
#include "btree.h"
#include <linux/blktrace_api.h>
#include <linux/module.h>
#define CREATE_TRACE_POINTS
#include <trace/events/bcache.h>
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_request_start);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_request_end);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_bypass_sequential);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_bypass_congested);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_read);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_write);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_read_retry);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_cache_insert);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_journal_replay_key);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_journal_write);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_journal_full);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_journal_entry_full);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_cache_cannibalize);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_read);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_write);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_node_alloc);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_node_alloc_fail);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_node_free);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_gc_coalesce);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_gc_start);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_gc_end);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_gc_copy);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_gc_copy_collision);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_insert_key);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_node_split);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_node_compact);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_btree_set_root);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_invalidate);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_alloc_fail);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_writeback);
EXPORT_TRACEPOINT_SYMBOL_GPL(bcache_writeback_collision);

381
drivers/md/bcache/util.c Normal file
View file

@ -0,0 +1,381 @@
/*
* random utiility code, for bcache but in theory not specific to bcache
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/ctype.h>
#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/seq_file.h>
#include <linux/types.h>
#include "util.h"
#define simple_strtoint(c, end, base) simple_strtol(c, end, base)
#define simple_strtouint(c, end, base) simple_strtoul(c, end, base)
#define STRTO_H(name, type) \
int bch_ ## name ## _h(const char *cp, type *res) \
{ \
int u = 0; \
char *e; \
type i = simple_ ## name(cp, &e, 10); \
\
switch (tolower(*e)) { \
default: \
return -EINVAL; \
case 'y': \
case 'z': \
u++; \
case 'e': \
u++; \
case 'p': \
u++; \
case 't': \
u++; \
case 'g': \
u++; \
case 'm': \
u++; \
case 'k': \
u++; \
if (e++ == cp) \
return -EINVAL; \
case '\n': \
case '\0': \
if (*e == '\n') \
e++; \
} \
\
if (*e) \
return -EINVAL; \
\
while (u--) { \
if ((type) ~0 > 0 && \
(type) ~0 / 1024 <= i) \
return -EINVAL; \
if ((i > 0 && ANYSINT_MAX(type) / 1024 < i) || \
(i < 0 && -ANYSINT_MAX(type) / 1024 > i)) \
return -EINVAL; \
i *= 1024; \
} \
\
*res = i; \
return 0; \
} \
STRTO_H(strtoint, int)
STRTO_H(strtouint, unsigned int)
STRTO_H(strtoll, long long)
STRTO_H(strtoull, unsigned long long)
ssize_t bch_hprint(char *buf, int64_t v)
{
static const char units[] = "?kMGTPEZY";
char dec[4] = "";
int u, t = 0;
for (u = 0; v >= 1024 || v <= -1024; u++) {
t = v & ~(~0 << 10);
v >>= 10;
}
if (!u)
return sprintf(buf, "%llu", v);
if (v < 100 && v > -100)
snprintf(dec, sizeof(dec), ".%i", t / 100);
return sprintf(buf, "%lli%s%c", v, dec, units[u]);
}
ssize_t bch_snprint_string_list(char *buf, size_t size, const char * const list[],
size_t selected)
{
char *out = buf;
size_t i;
for (i = 0; list[i]; i++)
out += snprintf(out, buf + size - out,
i == selected ? "[%s] " : "%s ", list[i]);
out[-1] = '\n';
return out - buf;
}
ssize_t bch_read_string_list(const char *buf, const char * const list[])
{
size_t i;
char *s, *d = kstrndup(buf, PAGE_SIZE - 1, GFP_KERNEL);
if (!d)
return -ENOMEM;
s = strim(d);
for (i = 0; list[i]; i++)
if (!strcmp(list[i], s))
break;
kfree(d);
if (!list[i])
return -EINVAL;
return i;
}
bool bch_is_zero(const char *p, size_t n)
{
size_t i;
for (i = 0; i < n; i++)
if (p[i])
return false;
return true;
}
int bch_parse_uuid(const char *s, char *uuid)
{
size_t i, j, x;
memset(uuid, 0, 16);
for (i = 0, j = 0;
i < strspn(s, "-0123456789:ABCDEFabcdef") && j < 32;
i++) {
x = s[i] | 32;
switch (x) {
case '0'...'9':
x -= '0';
break;
case 'a'...'f':
x -= 'a' - 10;
break;
default:
continue;
}
if (!(j & 1))
x <<= 4;
uuid[j++ >> 1] |= x;
}
return i;
}
void bch_time_stats_update(struct time_stats *stats, uint64_t start_time)
{
uint64_t now, duration, last;
spin_lock(&stats->lock);
now = local_clock();
duration = time_after64(now, start_time)
? now - start_time : 0;
last = time_after64(now, stats->last)
? now - stats->last : 0;
stats->max_duration = max(stats->max_duration, duration);
if (stats->last) {
ewma_add(stats->average_duration, duration, 8, 8);
if (stats->average_frequency)
ewma_add(stats->average_frequency, last, 8, 8);
else
stats->average_frequency = last << 8;
} else {
stats->average_duration = duration << 8;
}
stats->last = now ?: 1;
spin_unlock(&stats->lock);
}
/**
* bch_next_delay() - increment @d by the amount of work done, and return how
* long to delay until the next time to do some work.
*
* @d - the struct bch_ratelimit to update
* @done - the amount of work done, in arbitrary units
*
* Returns the amount of time to delay by, in jiffies
*/
uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
{
uint64_t now = local_clock();
d->next += div_u64(done * NSEC_PER_SEC, d->rate);
if (time_before64(now + NSEC_PER_SEC, d->next))
d->next = now + NSEC_PER_SEC;
if (time_after64(now - NSEC_PER_SEC * 2, d->next))
d->next = now - NSEC_PER_SEC * 2;
return time_after64(d->next, now)
? div_u64(d->next - now, NSEC_PER_SEC / HZ)
: 0;
}
void bch_bio_map(struct bio *bio, void *base)
{
size_t size = bio->bi_iter.bi_size;
struct bio_vec *bv = bio->bi_io_vec;
BUG_ON(!bio->bi_iter.bi_size);
BUG_ON(bio->bi_vcnt);
bv->bv_offset = base ? ((unsigned long) base) % PAGE_SIZE : 0;
goto start;
for (; size; bio->bi_vcnt++, bv++) {
bv->bv_offset = 0;
start: bv->bv_len = min_t(size_t, PAGE_SIZE - bv->bv_offset,
size);
if (base) {
bv->bv_page = is_vmalloc_addr(base)
? vmalloc_to_page(base)
: virt_to_page(base);
base += bv->bv_len;
}
size -= bv->bv_len;
}
}
/*
* Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group (Any
* use permitted, subject to terms of PostgreSQL license; see.)
* If we have a 64-bit integer type, then a 64-bit CRC looks just like the
* usual sort of implementation. (See Ross Williams' excellent introduction
* A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS, available from
* ftp://ftp.rocksoft.com/papers/crc_v3.txt or several other net sites.)
* If we have no working 64-bit type, then fake it with two 32-bit registers.
*
* The present implementation is a normal (not "reflected", in Williams'
* terms) 64-bit CRC, using initial all-ones register contents and a final
* bit inversion. The chosen polynomial is borrowed from the DLT1 spec
* (ECMA-182, available from http://www.ecma.ch/ecma1/STAND/ECMA-182.HTM):
*
* x^64 + x^62 + x^57 + x^55 + x^54 + x^53 + x^52 + x^47 + x^46 + x^45 +
* x^40 + x^39 + x^38 + x^37 + x^35 + x^33 + x^32 + x^31 + x^29 + x^27 +
* x^24 + x^23 + x^22 + x^21 + x^19 + x^17 + x^13 + x^12 + x^10 + x^9 +
* x^7 + x^4 + x + 1
*/
static const uint64_t crc_table[256] = {
0x0000000000000000ULL, 0x42F0E1EBA9EA3693ULL, 0x85E1C3D753D46D26ULL,
0xC711223CFA3E5BB5ULL, 0x493366450E42ECDFULL, 0x0BC387AEA7A8DA4CULL,
0xCCD2A5925D9681F9ULL, 0x8E224479F47CB76AULL, 0x9266CC8A1C85D9BEULL,
0xD0962D61B56FEF2DULL, 0x17870F5D4F51B498ULL, 0x5577EEB6E6BB820BULL,
0xDB55AACF12C73561ULL, 0x99A54B24BB2D03F2ULL, 0x5EB4691841135847ULL,
0x1C4488F3E8F96ED4ULL, 0x663D78FF90E185EFULL, 0x24CD9914390BB37CULL,
0xE3DCBB28C335E8C9ULL, 0xA12C5AC36ADFDE5AULL, 0x2F0E1EBA9EA36930ULL,
0x6DFEFF5137495FA3ULL, 0xAAEFDD6DCD770416ULL, 0xE81F3C86649D3285ULL,
0xF45BB4758C645C51ULL, 0xB6AB559E258E6AC2ULL, 0x71BA77A2DFB03177ULL,
0x334A9649765A07E4ULL, 0xBD68D2308226B08EULL, 0xFF9833DB2BCC861DULL,
0x388911E7D1F2DDA8ULL, 0x7A79F00C7818EB3BULL, 0xCC7AF1FF21C30BDEULL,
0x8E8A101488293D4DULL, 0x499B3228721766F8ULL, 0x0B6BD3C3DBFD506BULL,
0x854997BA2F81E701ULL, 0xC7B97651866BD192ULL, 0x00A8546D7C558A27ULL,
0x4258B586D5BFBCB4ULL, 0x5E1C3D753D46D260ULL, 0x1CECDC9E94ACE4F3ULL,
0xDBFDFEA26E92BF46ULL, 0x990D1F49C77889D5ULL, 0x172F5B3033043EBFULL,
0x55DFBADB9AEE082CULL, 0x92CE98E760D05399ULL, 0xD03E790CC93A650AULL,
0xAA478900B1228E31ULL, 0xE8B768EB18C8B8A2ULL, 0x2FA64AD7E2F6E317ULL,
0x6D56AB3C4B1CD584ULL, 0xE374EF45BF6062EEULL, 0xA1840EAE168A547DULL,
0x66952C92ECB40FC8ULL, 0x2465CD79455E395BULL, 0x3821458AADA7578FULL,
0x7AD1A461044D611CULL, 0xBDC0865DFE733AA9ULL, 0xFF3067B657990C3AULL,
0x711223CFA3E5BB50ULL, 0x33E2C2240A0F8DC3ULL, 0xF4F3E018F031D676ULL,
0xB60301F359DBE0E5ULL, 0xDA050215EA6C212FULL, 0x98F5E3FE438617BCULL,
0x5FE4C1C2B9B84C09ULL, 0x1D14202910527A9AULL, 0x93366450E42ECDF0ULL,
0xD1C685BB4DC4FB63ULL, 0x16D7A787B7FAA0D6ULL, 0x5427466C1E109645ULL,
0x4863CE9FF6E9F891ULL, 0x0A932F745F03CE02ULL, 0xCD820D48A53D95B7ULL,
0x8F72ECA30CD7A324ULL, 0x0150A8DAF8AB144EULL, 0x43A04931514122DDULL,
0x84B16B0DAB7F7968ULL, 0xC6418AE602954FFBULL, 0xBC387AEA7A8DA4C0ULL,
0xFEC89B01D3679253ULL, 0x39D9B93D2959C9E6ULL, 0x7B2958D680B3FF75ULL,
0xF50B1CAF74CF481FULL, 0xB7FBFD44DD257E8CULL, 0x70EADF78271B2539ULL,
0x321A3E938EF113AAULL, 0x2E5EB66066087D7EULL, 0x6CAE578BCFE24BEDULL,
0xABBF75B735DC1058ULL, 0xE94F945C9C3626CBULL, 0x676DD025684A91A1ULL,
0x259D31CEC1A0A732ULL, 0xE28C13F23B9EFC87ULL, 0xA07CF2199274CA14ULL,
0x167FF3EACBAF2AF1ULL, 0x548F120162451C62ULL, 0x939E303D987B47D7ULL,
0xD16ED1D631917144ULL, 0x5F4C95AFC5EDC62EULL, 0x1DBC74446C07F0BDULL,
0xDAAD56789639AB08ULL, 0x985DB7933FD39D9BULL, 0x84193F60D72AF34FULL,
0xC6E9DE8B7EC0C5DCULL, 0x01F8FCB784FE9E69ULL, 0x43081D5C2D14A8FAULL,
0xCD2A5925D9681F90ULL, 0x8FDAB8CE70822903ULL, 0x48CB9AF28ABC72B6ULL,
0x0A3B7B1923564425ULL, 0x70428B155B4EAF1EULL, 0x32B26AFEF2A4998DULL,
0xF5A348C2089AC238ULL, 0xB753A929A170F4ABULL, 0x3971ED50550C43C1ULL,
0x7B810CBBFCE67552ULL, 0xBC902E8706D82EE7ULL, 0xFE60CF6CAF321874ULL,
0xE224479F47CB76A0ULL, 0xA0D4A674EE214033ULL, 0x67C58448141F1B86ULL,
0x253565A3BDF52D15ULL, 0xAB1721DA49899A7FULL, 0xE9E7C031E063ACECULL,
0x2EF6E20D1A5DF759ULL, 0x6C0603E6B3B7C1CAULL, 0xF6FAE5C07D3274CDULL,
0xB40A042BD4D8425EULL, 0x731B26172EE619EBULL, 0x31EBC7FC870C2F78ULL,
0xBFC9838573709812ULL, 0xFD39626EDA9AAE81ULL, 0x3A28405220A4F534ULL,
0x78D8A1B9894EC3A7ULL, 0x649C294A61B7AD73ULL, 0x266CC8A1C85D9BE0ULL,
0xE17DEA9D3263C055ULL, 0xA38D0B769B89F6C6ULL, 0x2DAF4F0F6FF541ACULL,
0x6F5FAEE4C61F773FULL, 0xA84E8CD83C212C8AULL, 0xEABE6D3395CB1A19ULL,
0x90C79D3FEDD3F122ULL, 0xD2377CD44439C7B1ULL, 0x15265EE8BE079C04ULL,
0x57D6BF0317EDAA97ULL, 0xD9F4FB7AE3911DFDULL, 0x9B041A914A7B2B6EULL,
0x5C1538ADB04570DBULL, 0x1EE5D94619AF4648ULL, 0x02A151B5F156289CULL,
0x4051B05E58BC1E0FULL, 0x87409262A28245BAULL, 0xC5B073890B687329ULL,
0x4B9237F0FF14C443ULL, 0x0962D61B56FEF2D0ULL, 0xCE73F427ACC0A965ULL,
0x8C8315CC052A9FF6ULL, 0x3A80143F5CF17F13ULL, 0x7870F5D4F51B4980ULL,
0xBF61D7E80F251235ULL, 0xFD913603A6CF24A6ULL, 0x73B3727A52B393CCULL,
0x31439391FB59A55FULL, 0xF652B1AD0167FEEAULL, 0xB4A25046A88DC879ULL,
0xA8E6D8B54074A6ADULL, 0xEA16395EE99E903EULL, 0x2D071B6213A0CB8BULL,
0x6FF7FA89BA4AFD18ULL, 0xE1D5BEF04E364A72ULL, 0xA3255F1BE7DC7CE1ULL,
0x64347D271DE22754ULL, 0x26C49CCCB40811C7ULL, 0x5CBD6CC0CC10FAFCULL,
0x1E4D8D2B65FACC6FULL, 0xD95CAF179FC497DAULL, 0x9BAC4EFC362EA149ULL,
0x158E0A85C2521623ULL, 0x577EEB6E6BB820B0ULL, 0x906FC95291867B05ULL,
0xD29F28B9386C4D96ULL, 0xCEDBA04AD0952342ULL, 0x8C2B41A1797F15D1ULL,
0x4B3A639D83414E64ULL, 0x09CA82762AAB78F7ULL, 0x87E8C60FDED7CF9DULL,
0xC51827E4773DF90EULL, 0x020905D88D03A2BBULL, 0x40F9E43324E99428ULL,
0x2CFFE7D5975E55E2ULL, 0x6E0F063E3EB46371ULL, 0xA91E2402C48A38C4ULL,
0xEBEEC5E96D600E57ULL, 0x65CC8190991CB93DULL, 0x273C607B30F68FAEULL,
0xE02D4247CAC8D41BULL, 0xA2DDA3AC6322E288ULL, 0xBE992B5F8BDB8C5CULL,
0xFC69CAB42231BACFULL, 0x3B78E888D80FE17AULL, 0x7988096371E5D7E9ULL,
0xF7AA4D1A85996083ULL, 0xB55AACF12C735610ULL, 0x724B8ECDD64D0DA5ULL,
0x30BB6F267FA73B36ULL, 0x4AC29F2A07BFD00DULL, 0x08327EC1AE55E69EULL,
0xCF235CFD546BBD2BULL, 0x8DD3BD16FD818BB8ULL, 0x03F1F96F09FD3CD2ULL,
0x41011884A0170A41ULL, 0x86103AB85A2951F4ULL, 0xC4E0DB53F3C36767ULL,
0xD8A453A01B3A09B3ULL, 0x9A54B24BB2D03F20ULL, 0x5D45907748EE6495ULL,
0x1FB5719CE1045206ULL, 0x919735E51578E56CULL, 0xD367D40EBC92D3FFULL,
0x1476F63246AC884AULL, 0x568617D9EF46BED9ULL, 0xE085162AB69D5E3CULL,
0xA275F7C11F7768AFULL, 0x6564D5FDE549331AULL, 0x279434164CA30589ULL,
0xA9B6706FB8DFB2E3ULL, 0xEB46918411358470ULL, 0x2C57B3B8EB0BDFC5ULL,
0x6EA7525342E1E956ULL, 0x72E3DAA0AA188782ULL, 0x30133B4B03F2B111ULL,
0xF7021977F9CCEAA4ULL, 0xB5F2F89C5026DC37ULL, 0x3BD0BCE5A45A6B5DULL,
0x79205D0E0DB05DCEULL, 0xBE317F32F78E067BULL, 0xFCC19ED95E6430E8ULL,
0x86B86ED5267CDBD3ULL, 0xC4488F3E8F96ED40ULL, 0x0359AD0275A8B6F5ULL,
0x41A94CE9DC428066ULL, 0xCF8B0890283E370CULL, 0x8D7BE97B81D4019FULL,
0x4A6ACB477BEA5A2AULL, 0x089A2AACD2006CB9ULL, 0x14DEA25F3AF9026DULL,
0x562E43B4931334FEULL, 0x913F6188692D6F4BULL, 0xD3CF8063C0C759D8ULL,
0x5DEDC41A34BBEEB2ULL, 0x1F1D25F19D51D821ULL, 0xD80C07CD676F8394ULL,
0x9AFCE626CE85B507ULL,
};
uint64_t bch_crc64_update(uint64_t crc, const void *_data, size_t len)
{
const unsigned char *data = _data;
while (len--) {
int i = ((int) (crc >> 56) ^ *data++) & 0xFF;
crc = crc_table[i] ^ (crc << 8);
}
return crc;
}
uint64_t bch_crc64(const void *data, size_t len)
{
uint64_t crc = 0xffffffffffffffffULL;
crc = bch_crc64_update(crc, data, len);
return crc ^ 0xffffffffffffffffULL;
}

588
drivers/md/bcache/util.h Normal file
View file

@ -0,0 +1,588 @@
#ifndef _BCACHE_UTIL_H
#define _BCACHE_UTIL_H
#include <linux/blkdev.h>
#include <linux/errno.h>
#include <linux/kernel.h>
#include <linux/llist.h>
#include <linux/ratelimit.h>
#include <linux/vmalloc.h>
#include <linux/workqueue.h>
#include "closure.h"
#define PAGE_SECTORS (PAGE_SIZE / 512)
struct closure;
#ifdef CONFIG_BCACHE_DEBUG
#define EBUG_ON(cond) BUG_ON(cond)
#define atomic_dec_bug(v) BUG_ON(atomic_dec_return(v) < 0)
#define atomic_inc_bug(v, i) BUG_ON(atomic_inc_return(v) <= i)
#else /* DEBUG */
#define EBUG_ON(cond) do { if (cond); } while (0)
#define atomic_dec_bug(v) atomic_dec(v)
#define atomic_inc_bug(v, i) atomic_inc(v)
#endif
#define DECLARE_HEAP(type, name) \
struct { \
size_t size, used; \
type *data; \
} name
#define init_heap(heap, _size, gfp) \
({ \
size_t _bytes; \
(heap)->used = 0; \
(heap)->size = (_size); \
_bytes = (heap)->size * sizeof(*(heap)->data); \
(heap)->data = NULL; \
if (_bytes < KMALLOC_MAX_SIZE) \
(heap)->data = kmalloc(_bytes, (gfp)); \
if ((!(heap)->data) && ((gfp) & GFP_KERNEL)) \
(heap)->data = vmalloc(_bytes); \
(heap)->data; \
})
#define free_heap(heap) \
do { \
if (is_vmalloc_addr((heap)->data)) \
vfree((heap)->data); \
else \
kfree((heap)->data); \
(heap)->data = NULL; \
} while (0)
#define heap_swap(h, i, j) swap((h)->data[i], (h)->data[j])
#define heap_sift(h, i, cmp) \
do { \
size_t _r, _j = i; \
\
for (; _j * 2 + 1 < (h)->used; _j = _r) { \
_r = _j * 2 + 1; \
if (_r + 1 < (h)->used && \
cmp((h)->data[_r], (h)->data[_r + 1])) \
_r++; \
\
if (cmp((h)->data[_r], (h)->data[_j])) \
break; \
heap_swap(h, _r, _j); \
} \
} while (0)
#define heap_sift_down(h, i, cmp) \
do { \
while (i) { \
size_t p = (i - 1) / 2; \
if (cmp((h)->data[i], (h)->data[p])) \
break; \
heap_swap(h, i, p); \
i = p; \
} \
} while (0)
#define heap_add(h, d, cmp) \
({ \
bool _r = !heap_full(h); \
if (_r) { \
size_t _i = (h)->used++; \
(h)->data[_i] = d; \
\
heap_sift_down(h, _i, cmp); \
heap_sift(h, _i, cmp); \
} \
_r; \
})
#define heap_pop(h, d, cmp) \
({ \
bool _r = (h)->used; \
if (_r) { \
(d) = (h)->data[0]; \
(h)->used--; \
heap_swap(h, 0, (h)->used); \
heap_sift(h, 0, cmp); \
} \
_r; \
})
#define heap_peek(h) ((h)->used ? (h)->data[0] : NULL)
#define heap_full(h) ((h)->used == (h)->size)
#define DECLARE_FIFO(type, name) \
struct { \
size_t front, back, size, mask; \
type *data; \
} name
#define fifo_for_each(c, fifo, iter) \
for (iter = (fifo)->front; \
c = (fifo)->data[iter], iter != (fifo)->back; \
iter = (iter + 1) & (fifo)->mask)
#define __init_fifo(fifo, gfp) \
({ \
size_t _allocated_size, _bytes; \
BUG_ON(!(fifo)->size); \
\
_allocated_size = roundup_pow_of_two((fifo)->size + 1); \
_bytes = _allocated_size * sizeof(*(fifo)->data); \
\
(fifo)->mask = _allocated_size - 1; \
(fifo)->front = (fifo)->back = 0; \
(fifo)->data = NULL; \
\
if (_bytes < KMALLOC_MAX_SIZE) \
(fifo)->data = kmalloc(_bytes, (gfp)); \
if ((!(fifo)->data) && ((gfp) & GFP_KERNEL)) \
(fifo)->data = vmalloc(_bytes); \
(fifo)->data; \
})
#define init_fifo_exact(fifo, _size, gfp) \
({ \
(fifo)->size = (_size); \
__init_fifo(fifo, gfp); \
})
#define init_fifo(fifo, _size, gfp) \
({ \
(fifo)->size = (_size); \
if ((fifo)->size > 4) \
(fifo)->size = roundup_pow_of_two((fifo)->size) - 1; \
__init_fifo(fifo, gfp); \
})
#define free_fifo(fifo) \
do { \
if (is_vmalloc_addr((fifo)->data)) \
vfree((fifo)->data); \
else \
kfree((fifo)->data); \
(fifo)->data = NULL; \
} while (0)
#define fifo_used(fifo) (((fifo)->back - (fifo)->front) & (fifo)->mask)
#define fifo_free(fifo) ((fifo)->size - fifo_used(fifo))
#define fifo_empty(fifo) (!fifo_used(fifo))
#define fifo_full(fifo) (!fifo_free(fifo))
#define fifo_front(fifo) ((fifo)->data[(fifo)->front])
#define fifo_back(fifo) \
((fifo)->data[((fifo)->back - 1) & (fifo)->mask])
#define fifo_idx(fifo, p) (((p) - &fifo_front(fifo)) & (fifo)->mask)
#define fifo_push_back(fifo, i) \
({ \
bool _r = !fifo_full((fifo)); \
if (_r) { \
(fifo)->data[(fifo)->back++] = (i); \
(fifo)->back &= (fifo)->mask; \
} \
_r; \
})
#define fifo_pop_front(fifo, i) \
({ \
bool _r = !fifo_empty((fifo)); \
if (_r) { \
(i) = (fifo)->data[(fifo)->front++]; \
(fifo)->front &= (fifo)->mask; \
} \
_r; \
})
#define fifo_push_front(fifo, i) \
({ \
bool _r = !fifo_full((fifo)); \
if (_r) { \
--(fifo)->front; \
(fifo)->front &= (fifo)->mask; \
(fifo)->data[(fifo)->front] = (i); \
} \
_r; \
})
#define fifo_pop_back(fifo, i) \
({ \
bool _r = !fifo_empty((fifo)); \
if (_r) { \
--(fifo)->back; \
(fifo)->back &= (fifo)->mask; \
(i) = (fifo)->data[(fifo)->back] \
} \
_r; \
})
#define fifo_push(fifo, i) fifo_push_back(fifo, (i))
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
#define fifo_swap(l, r) \
do { \
swap((l)->front, (r)->front); \
swap((l)->back, (r)->back); \
swap((l)->size, (r)->size); \
swap((l)->mask, (r)->mask); \
swap((l)->data, (r)->data); \
} while (0)
#define fifo_move(dest, src) \
do { \
typeof(*((dest)->data)) _t; \
while (!fifo_full(dest) && \
fifo_pop(src, _t)) \
fifo_push(dest, _t); \
} while (0)
/*
* Simple array based allocator - preallocates a number of elements and you can
* never allocate more than that, also has no locking.
*
* Handy because if you know you only need a fixed number of elements you don't
* have to worry about memory allocation failure, and sometimes a mempool isn't
* what you want.
*
* We treat the free elements as entries in a singly linked list, and the
* freelist as a stack - allocating and freeing push and pop off the freelist.
*/
#define DECLARE_ARRAY_ALLOCATOR(type, name, size) \
struct { \
type *freelist; \
type data[size]; \
} name
#define array_alloc(array) \
({ \
typeof((array)->freelist) _ret = (array)->freelist; \
\
if (_ret) \
(array)->freelist = *((typeof((array)->freelist) *) _ret);\
\
_ret; \
})
#define array_free(array, ptr) \
do { \
typeof((array)->freelist) _ptr = ptr; \
\
*((typeof((array)->freelist) *) _ptr) = (array)->freelist; \
(array)->freelist = _ptr; \
} while (0)
#define array_allocator_init(array) \
do { \
typeof((array)->freelist) _i; \
\
BUILD_BUG_ON(sizeof((array)->data[0]) < sizeof(void *)); \
(array)->freelist = NULL; \
\
for (_i = (array)->data; \
_i < (array)->data + ARRAY_SIZE((array)->data); \
_i++) \
array_free(array, _i); \
} while (0)
#define array_freelist_empty(array) ((array)->freelist == NULL)
#define ANYSINT_MAX(t) \
((((t) 1 << (sizeof(t) * 8 - 2)) - (t) 1) * (t) 2 + (t) 1)
int bch_strtoint_h(const char *, int *);
int bch_strtouint_h(const char *, unsigned int *);
int bch_strtoll_h(const char *, long long *);
int bch_strtoull_h(const char *, unsigned long long *);
static inline int bch_strtol_h(const char *cp, long *res)
{
#if BITS_PER_LONG == 32
return bch_strtoint_h(cp, (int *) res);
#else
return bch_strtoll_h(cp, (long long *) res);
#endif
}
static inline int bch_strtoul_h(const char *cp, long *res)
{
#if BITS_PER_LONG == 32
return bch_strtouint_h(cp, (unsigned int *) res);
#else
return bch_strtoull_h(cp, (unsigned long long *) res);
#endif
}
#define strtoi_h(cp, res) \
(__builtin_types_compatible_p(typeof(*res), int) \
? bch_strtoint_h(cp, (void *) res) \
: __builtin_types_compatible_p(typeof(*res), long) \
? bch_strtol_h(cp, (void *) res) \
: __builtin_types_compatible_p(typeof(*res), long long) \
? bch_strtoll_h(cp, (void *) res) \
: __builtin_types_compatible_p(typeof(*res), unsigned int) \
? bch_strtouint_h(cp, (void *) res) \
: __builtin_types_compatible_p(typeof(*res), unsigned long) \
? bch_strtoul_h(cp, (void *) res) \
: __builtin_types_compatible_p(typeof(*res), unsigned long long)\
? bch_strtoull_h(cp, (void *) res) : -EINVAL)
#define strtoul_safe(cp, var) \
({ \
unsigned long _v; \
int _r = kstrtoul(cp, 10, &_v); \
if (!_r) \
var = _v; \
_r; \
})
#define strtoul_safe_clamp(cp, var, min, max) \
({ \
unsigned long _v; \
int _r = kstrtoul(cp, 10, &_v); \
if (!_r) \
var = clamp_t(typeof(var), _v, min, max); \
_r; \
})
#define snprint(buf, size, var) \
snprintf(buf, size, \
__builtin_types_compatible_p(typeof(var), int) \
? "%i\n" : \
__builtin_types_compatible_p(typeof(var), unsigned) \
? "%u\n" : \
__builtin_types_compatible_p(typeof(var), long) \
? "%li\n" : \
__builtin_types_compatible_p(typeof(var), unsigned long)\
? "%lu\n" : \
__builtin_types_compatible_p(typeof(var), int64_t) \
? "%lli\n" : \
__builtin_types_compatible_p(typeof(var), uint64_t) \
? "%llu\n" : \
__builtin_types_compatible_p(typeof(var), const char *) \
? "%s\n" : "%i\n", var)
ssize_t bch_hprint(char *buf, int64_t v);
bool bch_is_zero(const char *p, size_t n);
int bch_parse_uuid(const char *s, char *uuid);
ssize_t bch_snprint_string_list(char *buf, size_t size, const char * const list[],
size_t selected);
ssize_t bch_read_string_list(const char *buf, const char * const list[]);
struct time_stats {
spinlock_t lock;
/*
* all fields are in nanoseconds, averages are ewmas stored left shifted
* by 8
*/
uint64_t max_duration;
uint64_t average_duration;
uint64_t average_frequency;
uint64_t last;
};
void bch_time_stats_update(struct time_stats *stats, uint64_t time);
static inline unsigned local_clock_us(void)
{
return local_clock() >> 10;
}
#define NSEC_PER_ns 1L
#define NSEC_PER_us NSEC_PER_USEC
#define NSEC_PER_ms NSEC_PER_MSEC
#define NSEC_PER_sec NSEC_PER_SEC
#define __print_time_stat(stats, name, stat, units) \
sysfs_print(name ## _ ## stat ## _ ## units, \
div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
#define sysfs_print_time_stats(stats, name, \
frequency_units, \
duration_units) \
do { \
__print_time_stat(stats, name, \
average_frequency, frequency_units); \
__print_time_stat(stats, name, \
average_duration, duration_units); \
sysfs_print(name ## _ ##max_duration ## _ ## duration_units, \
div_u64((stats)->max_duration, NSEC_PER_ ## duration_units));\
\
sysfs_print(name ## _last_ ## frequency_units, (stats)->last \
? div_s64(local_clock() - (stats)->last, \
NSEC_PER_ ## frequency_units) \
: -1LL); \
} while (0)
#define sysfs_time_stats_attribute(name, \
frequency_units, \
duration_units) \
read_attribute(name ## _average_frequency_ ## frequency_units); \
read_attribute(name ## _average_duration_ ## duration_units); \
read_attribute(name ## _max_duration_ ## duration_units); \
read_attribute(name ## _last_ ## frequency_units)
#define sysfs_time_stats_attribute_list(name, \
frequency_units, \
duration_units) \
&sysfs_ ## name ## _average_frequency_ ## frequency_units, \
&sysfs_ ## name ## _average_duration_ ## duration_units, \
&sysfs_ ## name ## _max_duration_ ## duration_units, \
&sysfs_ ## name ## _last_ ## frequency_units,
#define ewma_add(ewma, val, weight, factor) \
({ \
(ewma) *= (weight) - 1; \
(ewma) += (val) << factor; \
(ewma) /= (weight); \
(ewma) >> factor; \
})
struct bch_ratelimit {
/* Next time we want to do some work, in nanoseconds */
uint64_t next;
/*
* Rate at which we want to do work, in units per nanosecond
* The units here correspond to the units passed to bch_next_delay()
*/
unsigned rate;
};
static inline void bch_ratelimit_reset(struct bch_ratelimit *d)
{
d->next = local_clock();
}
uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done);
#define __DIV_SAFE(n, d, zero) \
({ \
typeof(n) _n = (n); \
typeof(d) _d = (d); \
_d ? _n / _d : zero; \
})
#define DIV_SAFE(n, d) __DIV_SAFE(n, d, 0)
#define container_of_or_null(ptr, type, member) \
({ \
typeof(ptr) _ptr = ptr; \
_ptr ? container_of(_ptr, type, member) : NULL; \
})
#define RB_INSERT(root, new, member, cmp) \
({ \
__label__ dup; \
struct rb_node **n = &(root)->rb_node, *parent = NULL; \
typeof(new) this; \
int res, ret = -1; \
\
while (*n) { \
parent = *n; \
this = container_of(*n, typeof(*(new)), member); \
res = cmp(new, this); \
if (!res) \
goto dup; \
n = res < 0 \
? &(*n)->rb_left \
: &(*n)->rb_right; \
} \
\
rb_link_node(&(new)->member, parent, n); \
rb_insert_color(&(new)->member, root); \
ret = 0; \
dup: \
ret; \
})
#define RB_SEARCH(root, search, member, cmp) \
({ \
struct rb_node *n = (root)->rb_node; \
typeof(&(search)) this, ret = NULL; \
int res; \
\
while (n) { \
this = container_of(n, typeof(search), member); \
res = cmp(&(search), this); \
if (!res) { \
ret = this; \
break; \
} \
n = res < 0 \
? n->rb_left \
: n->rb_right; \
} \
ret; \
})
#define RB_GREATER(root, search, member, cmp) \
({ \
struct rb_node *n = (root)->rb_node; \
typeof(&(search)) this, ret = NULL; \
int res; \
\
while (n) { \
this = container_of(n, typeof(search), member); \
res = cmp(&(search), this); \
if (res < 0) { \
ret = this; \
n = n->rb_left; \
} else \
n = n->rb_right; \
} \
ret; \
})
#define RB_FIRST(root, type, member) \
container_of_or_null(rb_first(root), type, member)
#define RB_LAST(root, type, member) \
container_of_or_null(rb_last(root), type, member)
#define RB_NEXT(ptr, member) \
container_of_or_null(rb_next(&(ptr)->member), typeof(*ptr), member)
#define RB_PREV(ptr, member) \
container_of_or_null(rb_prev(&(ptr)->member), typeof(*ptr), member)
/* Does linear interpolation between powers of two */
static inline unsigned fract_exp_two(unsigned x, unsigned fract_bits)
{
unsigned fract = x & ~(~0 << fract_bits);
x >>= fract_bits;
x = 1 << x;
x += (x * fract) >> fract_bits;
return x;
}
void bch_bio_map(struct bio *bio, void *base);
static inline sector_t bdev_sectors(struct block_device *bdev)
{
return bdev->bd_inode->i_size >> 9;
}
#define closure_bio_submit(bio, cl, dev) \
do { \
closure_get(cl); \
bch_generic_make_request(bio, &(dev)->bio_split_hook); \
} while (0)
uint64_t bch_crc64_update(uint64_t, const void *, size_t);
uint64_t bch_crc64(const void *, size_t);
#endif /* _BCACHE_UTIL_H */

View file

@ -0,0 +1,513 @@
/*
* background writeback - scan btree for dirty data and write it to the backing
* device
*
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
* Copyright 2012 Google, Inc.
*/
#include "bcache.h"
#include "btree.h"
#include "debug.h"
#include "writeback.h"
#include <linux/delay.h>
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <trace/events/bcache.h>
/* Rate limiting */
static void __update_writeback_rate(struct cached_dev *dc)
{
struct cache_set *c = dc->disk.c;
uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size;
uint64_t cache_dirty_target =
div_u64(cache_sectors * dc->writeback_percent, 100);
int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
c->cached_dev_sectors);
/* PD controller */
int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
int64_t derivative = dirty - dc->disk.sectors_dirty_last;
int64_t proportional = dirty - target;
int64_t change;
dc->disk.sectors_dirty_last = dirty;
/* Scale to sectors per second */
proportional *= dc->writeback_rate_update_seconds;
proportional = div_s64(proportional, dc->writeback_rate_p_term_inverse);
derivative = div_s64(derivative, dc->writeback_rate_update_seconds);
derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
(dc->writeback_rate_d_term /
dc->writeback_rate_update_seconds) ?: 1, 0);
derivative *= dc->writeback_rate_d_term;
derivative = div_s64(derivative, dc->writeback_rate_p_term_inverse);
change = proportional + derivative;
/* Don't increase writeback rate if the device isn't keeping up */
if (change > 0 &&
time_after64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))
change = 0;
dc->writeback_rate.rate =
clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
1, NSEC_PER_MSEC);
dc->writeback_rate_proportional = proportional;
dc->writeback_rate_derivative = derivative;
dc->writeback_rate_change = change;
dc->writeback_rate_target = target;
}
static void update_writeback_rate(struct work_struct *work)
{
struct cached_dev *dc = container_of(to_delayed_work(work),
struct cached_dev,
writeback_rate_update);
down_read(&dc->writeback_lock);
if (atomic_read(&dc->has_dirty) &&
dc->writeback_percent)
__update_writeback_rate(dc);
up_read(&dc->writeback_lock);
schedule_delayed_work(&dc->writeback_rate_update,
dc->writeback_rate_update_seconds * HZ);
}
static unsigned writeback_delay(struct cached_dev *dc, unsigned sectors)
{
if (test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) ||
!dc->writeback_percent)
return 0;
return bch_next_delay(&dc->writeback_rate, sectors);
}
struct dirty_io {
struct closure cl;
struct cached_dev *dc;
struct bio bio;
};
static void dirty_init(struct keybuf_key *w)
{
struct dirty_io *io = w->private;
struct bio *bio = &io->bio;
bio_init(bio);
if (!io->dc->writeback_percent)
bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
bio->bi_iter.bi_size = KEY_SIZE(&w->key) << 9;
bio->bi_max_vecs = DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS);
bio->bi_private = w;
bio->bi_io_vec = bio->bi_inline_vecs;
bch_bio_map(bio, NULL);
}
static void dirty_io_destructor(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
kfree(io);
}
static void write_dirty_finish(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
struct cached_dev *dc = io->dc;
struct bio_vec *bv;
int i;
bio_for_each_segment_all(bv, &io->bio, i)
__free_page(bv->bv_page);
/* This is kind of a dumb way of signalling errors. */
if (KEY_DIRTY(&w->key)) {
int ret;
unsigned i;
struct keylist keys;
bch_keylist_init(&keys);
bkey_copy(keys.top, &w->key);
SET_KEY_DIRTY(keys.top, false);
bch_keylist_push(&keys);
for (i = 0; i < KEY_PTRS(&w->key); i++)
atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
if (ret)
trace_bcache_writeback_collision(&w->key);
atomic_long_inc(ret
? &dc->disk.c->writeback_keys_failed
: &dc->disk.c->writeback_keys_done);
}
bch_keybuf_del(&dc->writeback_keys, w);
up(&dc->in_flight);
closure_return_with_destructor(cl, dirty_io_destructor);
}
static void dirty_endio(struct bio *bio, int error)
{
struct keybuf_key *w = bio->bi_private;
struct dirty_io *io = w->private;
if (error)
SET_KEY_DIRTY(&w->key, false);
closure_put(&io->cl);
}
static void write_dirty(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
dirty_init(w);
io->bio.bi_rw = WRITE;
io->bio.bi_iter.bi_sector = KEY_START(&w->key);
io->bio.bi_bdev = io->dc->bdev;
io->bio.bi_end_io = dirty_endio;
closure_bio_submit(&io->bio, cl, &io->dc->disk);
continue_at(cl, write_dirty_finish, system_wq);
}
static void read_dirty_endio(struct bio *bio, int error)
{
struct keybuf_key *w = bio->bi_private;
struct dirty_io *io = w->private;
bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0),
error, "reading dirty data from cache");
dirty_endio(bio, error);
}
static void read_dirty_submit(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
closure_bio_submit(&io->bio, cl, &io->dc->disk);
continue_at(cl, write_dirty, system_wq);
}
static void read_dirty(struct cached_dev *dc)
{
unsigned delay = 0;
struct keybuf_key *w;
struct dirty_io *io;
struct closure cl;
closure_init_stack(&cl);
/*
* XXX: if we error, background writeback just spins. Should use some
* mempools.
*/
while (!kthread_should_stop()) {
try_to_freeze();
w = bch_keybuf_next(&dc->writeback_keys);
if (!w)
break;
BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
delay = schedule_timeout_interruptible(delay);
dc->last_read = KEY_OFFSET(&w->key);
io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
* DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
GFP_KERNEL);
if (!io)
goto err;
w->private = io;
io->dc = dc;
dirty_init(w);
io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
io->bio.bi_bdev = PTR_CACHE(dc->disk.c,
&w->key, 0)->bdev;
io->bio.bi_rw = READ;
io->bio.bi_end_io = read_dirty_endio;
if (bio_alloc_pages(&io->bio, GFP_KERNEL))
goto err_free;
trace_bcache_writeback(&w->key);
down(&dc->in_flight);
closure_call(&io->cl, read_dirty_submit, NULL, &cl);
delay = writeback_delay(dc, KEY_SIZE(&w->key));
}
if (0) {
err_free:
kfree(w->private);
err:
bch_keybuf_del(&dc->writeback_keys, w);
}
/*
* Wait for outstanding writeback IOs to finish (and keybuf slots to be
* freed) before refilling again
*/
closure_sync(&cl);
}
/* Scan for dirty data */
void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode,
uint64_t offset, int nr_sectors)
{
struct bcache_device *d = c->devices[inode];
unsigned stripe_offset, stripe, sectors_dirty;
if (!d)
return;
stripe = offset_to_stripe(d, offset);
stripe_offset = offset & (d->stripe_size - 1);
while (nr_sectors) {
int s = min_t(unsigned, abs(nr_sectors),
d->stripe_size - stripe_offset);
if (nr_sectors < 0)
s = -s;
if (stripe >= d->nr_stripes)
return;
sectors_dirty = atomic_add_return(s,
d->stripe_sectors_dirty + stripe);
if (sectors_dirty == d->stripe_size)
set_bit(stripe, d->full_dirty_stripes);
else
clear_bit(stripe, d->full_dirty_stripes);
nr_sectors -= s;
stripe_offset = 0;
stripe++;
}
}
static bool dirty_pred(struct keybuf *buf, struct bkey *k)
{
return KEY_DIRTY(k);
}
static void refill_full_stripes(struct cached_dev *dc)
{
struct keybuf *buf = &dc->writeback_keys;
unsigned start_stripe, stripe, next_stripe;
bool wrapped = false;
stripe = offset_to_stripe(&dc->disk, KEY_OFFSET(&buf->last_scanned));
if (stripe >= dc->disk.nr_stripes)
stripe = 0;
start_stripe = stripe;
while (1) {
stripe = find_next_bit(dc->disk.full_dirty_stripes,
dc->disk.nr_stripes, stripe);
if (stripe == dc->disk.nr_stripes)
goto next;
next_stripe = find_next_zero_bit(dc->disk.full_dirty_stripes,
dc->disk.nr_stripes, stripe);
buf->last_scanned = KEY(dc->disk.id,
stripe * dc->disk.stripe_size, 0);
bch_refill_keybuf(dc->disk.c, buf,
&KEY(dc->disk.id,
next_stripe * dc->disk.stripe_size, 0),
dirty_pred);
if (array_freelist_empty(&buf->freelist))
return;
stripe = next_stripe;
next:
if (wrapped && stripe > start_stripe)
return;
if (stripe == dc->disk.nr_stripes) {
stripe = 0;
wrapped = true;
}
}
}
static bool refill_dirty(struct cached_dev *dc)
{
struct keybuf *buf = &dc->writeback_keys;
struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
bool searched_from_start = false;
if (dc->partial_stripes_expensive) {
refill_full_stripes(dc);
if (array_freelist_empty(&buf->freelist))
return false;
}
if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
buf->last_scanned = KEY(dc->disk.id, 0, 0);
searched_from_start = true;
}
bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
}
static int bch_writeback_thread(void *arg)
{
struct cached_dev *dc = arg;
bool searched_full_index;
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
if (!atomic_read(&dc->has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
!dc->writeback_running)) {
up_write(&dc->writeback_lock);
set_current_state(TASK_INTERRUPTIBLE);
if (kthread_should_stop())
return 0;
try_to_freeze();
schedule();
continue;
}
searched_full_index = refill_dirty(dc);
if (searched_full_index &&
RB_EMPTY_ROOT(&dc->writeback_keys.keys)) {
atomic_set(&dc->has_dirty, 0);
cached_dev_put(dc);
SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
bch_write_bdev_super(dc, NULL);
}
up_write(&dc->writeback_lock);
bch_ratelimit_reset(&dc->writeback_rate);
read_dirty(dc);
if (searched_full_index) {
unsigned delay = dc->writeback_delay * HZ;
while (delay &&
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
delay = schedule_timeout_interruptible(delay);
}
}
return 0;
}
/* Init */
struct sectors_dirty_init {
struct btree_op op;
unsigned inode;
};
static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
struct bkey *k)
{
struct sectors_dirty_init *op = container_of(_op,
struct sectors_dirty_init, op);
if (KEY_INODE(k) > op->inode)
return MAP_DONE;
if (KEY_DIRTY(k))
bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
KEY_START(k), KEY_SIZE(k));
return MAP_CONTINUE;
}
void bch_sectors_dirty_init(struct cached_dev *dc)
{
struct sectors_dirty_init op;
bch_btree_op_init(&op.op, -1);
op.inode = dc->disk.id;
bch_btree_map_keys(&op.op, dc->disk.c, &KEY(op.inode, 0, 0),
sectors_dirty_init_fn, 0);
dc->disk.sectors_dirty_last = bcache_dev_sectors_dirty(&dc->disk);
}
void bch_cached_dev_writeback_init(struct cached_dev *dc)
{
sema_init(&dc->in_flight, 64);
init_rwsem(&dc->writeback_lock);
bch_keybuf_init(&dc->writeback_keys);
dc->writeback_metadata = true;
dc->writeback_running = true;
dc->writeback_percent = 10;
dc->writeback_delay = 30;
dc->writeback_rate.rate = 1024;
dc->writeback_rate_update_seconds = 5;
dc->writeback_rate_d_term = 30;
dc->writeback_rate_p_term_inverse = 6000;
INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
}
int bch_cached_dev_writeback_start(struct cached_dev *dc)
{
dc->writeback_thread = kthread_create(bch_writeback_thread, dc,
"bcache_writeback");
if (IS_ERR(dc->writeback_thread))
return PTR_ERR(dc->writeback_thread);
schedule_delayed_work(&dc->writeback_rate_update,
dc->writeback_rate_update_seconds * HZ);
bch_writeback_queue(dc);
return 0;
}

View file

@ -0,0 +1,91 @@
#ifndef _BCACHE_WRITEBACK_H
#define _BCACHE_WRITEBACK_H
#define CUTOFF_WRITEBACK 40
#define CUTOFF_WRITEBACK_SYNC 70
static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
{
uint64_t i, ret = 0;
for (i = 0; i < d->nr_stripes; i++)
ret += atomic_read(d->stripe_sectors_dirty + i);
return ret;
}
static inline unsigned offset_to_stripe(struct bcache_device *d,
uint64_t offset)
{
do_div(offset, d->stripe_size);
return offset;
}
static inline bool bcache_dev_stripe_dirty(struct cached_dev *dc,
uint64_t offset,
unsigned nr_sectors)
{
unsigned stripe = offset_to_stripe(&dc->disk, offset);
while (1) {
if (atomic_read(dc->disk.stripe_sectors_dirty + stripe))
return true;
if (nr_sectors <= dc->disk.stripe_size)
return false;
nr_sectors -= dc->disk.stripe_size;
stripe++;
}
}
static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
unsigned cache_mode, bool would_skip)
{
unsigned in_use = dc->disk.c->gc_stats.in_use;
if (cache_mode != CACHE_MODE_WRITEBACK ||
test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) ||
in_use > CUTOFF_WRITEBACK_SYNC)
return false;
if (dc->partial_stripes_expensive &&
bcache_dev_stripe_dirty(dc, bio->bi_iter.bi_sector,
bio_sectors(bio)))
return true;
if (would_skip)
return false;
return bio->bi_rw & REQ_SYNC ||
in_use <= CUTOFF_WRITEBACK;
}
static inline void bch_writeback_queue(struct cached_dev *dc)
{
wake_up_process(dc->writeback_thread);
}
static inline void bch_writeback_add(struct cached_dev *dc)
{
if (!atomic_read(&dc->has_dirty) &&
!atomic_xchg(&dc->has_dirty, 1)) {
atomic_inc(&dc->count);
if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);
/* XXX: should do this synchronously */
bch_write_bdev_super(dc, NULL);
}
bch_writeback_queue(dc);
}
}
void bcache_dev_sectors_dirty_add(struct cache_set *, unsigned, uint64_t, int);
void bch_sectors_dirty_init(struct cached_dev *dc);
void bch_cached_dev_writeback_init(struct cached_dev *);
int bch_cached_dev_writeback_start(struct cached_dev *);
#endif

2261
drivers/md/bitmap.c Normal file

File diff suppressed because it is too large Load diff

265
drivers/md/bitmap.h Normal file
View file

@ -0,0 +1,265 @@
/*
* bitmap.h: Copyright (C) Peter T. Breuer (ptb@ot.uc3m.es) 2003
*
* additions: Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc.
*/
#ifndef BITMAP_H
#define BITMAP_H 1
#define BITMAP_MAJOR_LO 3
/* version 4 insists the bitmap is in little-endian order
* with version 3, it is host-endian which is non-portable
*/
#define BITMAP_MAJOR_HI 4
#define BITMAP_MAJOR_HOSTENDIAN 3
/*
* in-memory bitmap:
*
* Use 16 bit block counters to track pending writes to each "chunk".
* The 2 high order bits are special-purpose, the first is a flag indicating
* whether a resync is needed. The second is a flag indicating whether a
* resync is active.
* This means that the counter is actually 14 bits:
*
* +--------+--------+------------------------------------------------+
* | resync | resync | counter |
* | needed | active | |
* | (0-1) | (0-1) | (0-16383) |
* +--------+--------+------------------------------------------------+
*
* The "resync needed" bit is set when:
* a '1' bit is read from storage at startup.
* a write request fails on some drives
* a resync is aborted on a chunk with 'resync active' set
* It is cleared (and resync-active set) when a resync starts across all drives
* of the chunk.
*
*
* The "resync active" bit is set when:
* a resync is started on all drives, and resync_needed is set.
* resync_needed will be cleared (as long as resync_active wasn't already set).
* It is cleared when a resync completes.
*
* The counter counts pending write requests, plus the on-disk bit.
* When the counter is '1' and the resync bits are clear, the on-disk
* bit can be cleared as well, thus setting the counter to 0.
* When we set a bit, or in the counter (to start a write), if the fields is
* 0, we first set the disk bit and set the counter to 1.
*
* If the counter is 0, the on-disk bit is clear and the stipe is clean
* Anything that dirties the stipe pushes the counter to 2 (at least)
* and sets the on-disk bit (lazily).
* If a periodic sweep find the counter at 2, it is decremented to 1.
* If the sweep find the counter at 1, the on-disk bit is cleared and the
* counter goes to zero.
*
* Also, we'll hijack the "map" pointer itself and use it as two 16 bit block
* counters as a fallback when "page" memory cannot be allocated:
*
* Normal case (page memory allocated):
*
* page pointer (32-bit)
*
* [ ] ------+
* |
* +-------> [ ][ ]..[ ] (4096 byte page == 2048 counters)
* c1 c2 c2048
*
* Hijacked case (page memory allocation failed):
*
* hijacked page pointer (32-bit)
*
* [ ][ ] (no page memory allocated)
* counter #1 (16-bit) counter #2 (16-bit)
*
*/
#ifdef __KERNEL__
#define PAGE_BITS (PAGE_SIZE << 3)
#define PAGE_BIT_SHIFT (PAGE_SHIFT + 3)
typedef __u16 bitmap_counter_t;
#define COUNTER_BITS 16
#define COUNTER_BIT_SHIFT 4
#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
#define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK)
#define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK)
#define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
/* how many counters per page? */
#define PAGE_COUNTER_RATIO (PAGE_BITS / COUNTER_BITS)
/* same, except a shift value for more efficient bitops */
#define PAGE_COUNTER_SHIFT (PAGE_BIT_SHIFT - COUNTER_BIT_SHIFT)
/* same, except a mask value for more efficient bitops */
#define PAGE_COUNTER_MASK (PAGE_COUNTER_RATIO - 1)
#define BITMAP_BLOCK_SHIFT 9
#endif
/*
* bitmap structures:
*/
#define BITMAP_MAGIC 0x6d746962
/* use these for bitmap->flags and bitmap->sb->state bit-fields */
enum bitmap_state {
BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */
BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
BITMAP_HOSTENDIAN =15,
};
/* the superblock at the front of the bitmap file -- little endian */
typedef struct bitmap_super_s {
__le32 magic; /* 0 BITMAP_MAGIC */
__le32 version; /* 4 the bitmap major for now, could change... */
__u8 uuid[16]; /* 8 128 bit uuid - must match md device uuid */
__le64 events; /* 24 event counter for the bitmap (1)*/
__le64 events_cleared;/*32 event counter when last bit cleared (2) */
__le64 sync_size; /* 40 the size of the md device's sync range(3) */
__le32 state; /* 48 bitmap state information */
__le32 chunksize; /* 52 the bitmap chunk size in bytes */
__le32 daemon_sleep; /* 56 seconds between disk flushes */
__le32 write_behind; /* 60 number of outstanding write-behind writes */
__le32 sectors_reserved; /* 64 number of 512-byte sectors that are
* reserved for the bitmap. */
__u8 pad[256 - 68]; /* set to zero */
} bitmap_super_t;
/* notes:
* (1) This event counter is updated before the eventcounter in the md superblock
* When a bitmap is loaded, it is only accepted if this event counter is equal
* to, or one greater than, the event counter in the superblock.
* (2) This event counter is updated when the other one is *if*and*only*if* the
* array is not degraded. As bits are not cleared when the array is degraded,
* this represents the last time that any bits were cleared.
* If a device is being added that has an event count with this value or
* higher, it is accepted as conforming to the bitmap.
* (3)This is the number of sectors represented by the bitmap, and is the range that
* resync happens across. For raid1 and raid5/6 it is the size of individual
* devices. For raid10 it is the size of the array.
*/
#ifdef __KERNEL__
/* the in-memory bitmap is represented by bitmap_pages */
struct bitmap_page {
/*
* map points to the actual memory page
*/
char *map;
/*
* in emergencies (when map cannot be alloced), hijack the map
* pointer and use it as two counters itself
*/
unsigned int hijacked:1;
/*
* If any counter in this page is '1' or '2' - and so could be
* cleared then that page is marked as 'pending'
*/
unsigned int pending:1;
/*
* count of dirty bits on the page
*/
unsigned int count:30;
};
/* the main bitmap structure - one per mddev */
struct bitmap {
struct bitmap_counts {
spinlock_t lock;
struct bitmap_page *bp;
unsigned long pages; /* total number of pages
* in the bitmap */
unsigned long missing_pages; /* number of pages
* not yet allocated */
unsigned long chunkshift; /* chunksize = 2^chunkshift
* (for bitops) */
unsigned long chunks; /* Total number of data
* chunks for the array */
} counts;
struct mddev *mddev; /* the md device that the bitmap is for */
__u64 events_cleared;
int need_sync;
struct bitmap_storage {
struct file *file; /* backing disk file */
struct page *sb_page; /* cached copy of the bitmap
* file superblock */
struct page **filemap; /* list of cache pages for
* the file */
unsigned long *filemap_attr; /* attributes associated
* w/ filemap pages */
unsigned long file_pages; /* number of pages in the file*/
unsigned long bytes; /* total bytes in the bitmap */
} storage;
unsigned long flags;
int allclean;
atomic_t behind_writes;
unsigned long behind_writes_used; /* highest actual value at runtime */
/*
* the bitmap daemon - periodically wakes up and sweeps the bitmap
* file, cleaning up bits and flushing out pages to disk as necessary
*/
unsigned long daemon_lastrun; /* jiffies of last run */
unsigned long last_end_sync; /* when we lasted called end_sync to
* update bitmap with resync progress */
atomic_t pending_writes; /* pending writes to the bitmap file */
wait_queue_head_t write_wait;
wait_queue_head_t overflow_wait;
wait_queue_head_t behind_wait;
struct kernfs_node *sysfs_can_clear;
};
/* the bitmap API */
/* these are used only by md/bitmap */
int bitmap_create(struct mddev *mddev);
int bitmap_load(struct mddev *mddev);
void bitmap_flush(struct mddev *mddev);
void bitmap_destroy(struct mddev *mddev);
void bitmap_print_sb(struct bitmap *bitmap);
void bitmap_update_sb(struct bitmap *bitmap);
void bitmap_status(struct seq_file *seq, struct bitmap *bitmap);
int bitmap_setallbits(struct bitmap *bitmap);
void bitmap_write_all(struct bitmap *bitmap);
void bitmap_dirty_bits(struct bitmap *bitmap, unsigned long s, unsigned long e);
/* these are exported */
int bitmap_startwrite(struct bitmap *bitmap, sector_t offset,
unsigned long sectors, int behind);
void bitmap_endwrite(struct bitmap *bitmap, sector_t offset,
unsigned long sectors, int success, int behind);
int bitmap_start_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, int degraded);
void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, int aborted);
void bitmap_close_sync(struct bitmap *bitmap);
void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector);
void bitmap_unplug(struct bitmap *bitmap);
void bitmap_daemon_work(struct mddev *mddev);
int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
int chunksize, int init);
#endif
#endif

408
drivers/md/dm-bio-prison.c Normal file
View file

@ -0,0 +1,408 @@
/*
* Copyright (C) 2012 Red Hat, Inc.
*
* This file is released under the GPL.
*/
#include "dm.h"
#include "dm-bio-prison.h"
#include <linux/spinlock.h>
#include <linux/mempool.h>
#include <linux/module.h>
#include <linux/slab.h>
/*----------------------------------------------------------------*/
struct bucket {
spinlock_t lock;
struct hlist_head cells;
};
struct dm_bio_prison {
mempool_t *cell_pool;
unsigned nr_buckets;
unsigned hash_mask;
struct bucket *buckets;
};
/*----------------------------------------------------------------*/
static uint32_t calc_nr_buckets(unsigned nr_cells)
{
uint32_t n = 128;
nr_cells /= 4;
nr_cells = min(nr_cells, 8192u);
while (n < nr_cells)
n <<= 1;
return n;
}
static struct kmem_cache *_cell_cache;
static void init_bucket(struct bucket *b)
{
spin_lock_init(&b->lock);
INIT_HLIST_HEAD(&b->cells);
}
/*
* @nr_cells should be the number of cells you want in use _concurrently_.
* Don't confuse it with the number of distinct keys.
*/
struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells)
{
unsigned i;
uint32_t nr_buckets = calc_nr_buckets(nr_cells);
size_t len = sizeof(struct dm_bio_prison) +
(sizeof(struct bucket) * nr_buckets);
struct dm_bio_prison *prison = kmalloc(len, GFP_KERNEL);
if (!prison)
return NULL;
prison->cell_pool = mempool_create_slab_pool(nr_cells, _cell_cache);
if (!prison->cell_pool) {
kfree(prison);
return NULL;
}
prison->nr_buckets = nr_buckets;
prison->hash_mask = nr_buckets - 1;
prison->buckets = (struct bucket *) (prison + 1);
for (i = 0; i < nr_buckets; i++)
init_bucket(prison->buckets + i);
return prison;
}
EXPORT_SYMBOL_GPL(dm_bio_prison_create);
void dm_bio_prison_destroy(struct dm_bio_prison *prison)
{
mempool_destroy(prison->cell_pool);
kfree(prison);
}
EXPORT_SYMBOL_GPL(dm_bio_prison_destroy);
struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, gfp_t gfp)
{
return mempool_alloc(prison->cell_pool, gfp);
}
EXPORT_SYMBOL_GPL(dm_bio_prison_alloc_cell);
void dm_bio_prison_free_cell(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell)
{
mempool_free(cell, prison->cell_pool);
}
EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell);
static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key)
{
const unsigned long BIG_PRIME = 4294967291UL;
uint64_t hash = key->block * BIG_PRIME;
return (uint32_t) (hash & prison->hash_mask);
}
static int keys_equal(struct dm_cell_key *lhs, struct dm_cell_key *rhs)
{
return (lhs->virtual == rhs->virtual) &&
(lhs->dev == rhs->dev) &&
(lhs->block == rhs->block);
}
static struct bucket *get_bucket(struct dm_bio_prison *prison,
struct dm_cell_key *key)
{
return prison->buckets + hash_key(prison, key);
}
static struct dm_bio_prison_cell *__search_bucket(struct bucket *b,
struct dm_cell_key *key)
{
struct dm_bio_prison_cell *cell;
hlist_for_each_entry(cell, &b->cells, list)
if (keys_equal(&cell->key, key))
return cell;
return NULL;
}
static void __setup_new_cell(struct bucket *b,
struct dm_cell_key *key,
struct bio *holder,
struct dm_bio_prison_cell *cell)
{
memcpy(&cell->key, key, sizeof(cell->key));
cell->holder = holder;
bio_list_init(&cell->bios);
hlist_add_head(&cell->list, &b->cells);
}
static int __bio_detain(struct bucket *b,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
struct dm_bio_prison_cell *cell;
cell = __search_bucket(b, key);
if (cell) {
if (inmate)
bio_list_add(&cell->bios, inmate);
*cell_result = cell;
return 1;
}
__setup_new_cell(b, key, inmate, cell_prealloc);
*cell_result = cell_prealloc;
return 0;
}
static int bio_detain(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
int r;
unsigned long flags;
struct bucket *b = get_bucket(prison, key);
spin_lock_irqsave(&b->lock, flags);
r = __bio_detain(b, key, inmate, cell_prealloc, cell_result);
spin_unlock_irqrestore(&b->lock, flags);
return r;
}
int dm_bio_detain(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
return bio_detain(prison, key, inmate, cell_prealloc, cell_result);
}
EXPORT_SYMBOL_GPL(dm_bio_detain);
int dm_get_cell(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
return bio_detain(prison, key, NULL, cell_prealloc, cell_result);
}
EXPORT_SYMBOL_GPL(dm_get_cell);
/*
* @inmates must have been initialised prior to this call
*/
static void __cell_release(struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
hlist_del(&cell->list);
if (inmates) {
if (cell->holder)
bio_list_add(inmates, cell->holder);
bio_list_merge(inmates, &cell->bios);
}
}
void dm_cell_release(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell,
struct bio_list *bios)
{
unsigned long flags;
struct bucket *b = get_bucket(prison, &cell->key);
spin_lock_irqsave(&b->lock, flags);
__cell_release(cell, bios);
spin_unlock_irqrestore(&b->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release);
/*
* Sometimes we don't want the holder, just the additional bios.
*/
static void __cell_release_no_holder(struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
hlist_del(&cell->list);
bio_list_merge(inmates, &cell->bios);
}
void dm_cell_release_no_holder(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
unsigned long flags;
struct bucket *b = get_bucket(prison, &cell->key);
spin_lock_irqsave(&b->lock, flags);
__cell_release_no_holder(cell, inmates);
spin_unlock_irqrestore(&b->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release_no_holder);
void dm_cell_error(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell, int error)
{
struct bio_list bios;
struct bio *bio;
bio_list_init(&bios);
dm_cell_release(prison, cell, &bios);
while ((bio = bio_list_pop(&bios)))
bio_endio(bio, error);
}
EXPORT_SYMBOL_GPL(dm_cell_error);
/*----------------------------------------------------------------*/
#define DEFERRED_SET_SIZE 64
struct dm_deferred_entry {
struct dm_deferred_set *ds;
unsigned count;
struct list_head work_items;
};
struct dm_deferred_set {
spinlock_t lock;
unsigned current_entry;
unsigned sweeper;
struct dm_deferred_entry entries[DEFERRED_SET_SIZE];
};
struct dm_deferred_set *dm_deferred_set_create(void)
{
int i;
struct dm_deferred_set *ds;
ds = kmalloc(sizeof(*ds), GFP_KERNEL);
if (!ds)
return NULL;
spin_lock_init(&ds->lock);
ds->current_entry = 0;
ds->sweeper = 0;
for (i = 0; i < DEFERRED_SET_SIZE; i++) {
ds->entries[i].ds = ds;
ds->entries[i].count = 0;
INIT_LIST_HEAD(&ds->entries[i].work_items);
}
return ds;
}
EXPORT_SYMBOL_GPL(dm_deferred_set_create);
void dm_deferred_set_destroy(struct dm_deferred_set *ds)
{
kfree(ds);
}
EXPORT_SYMBOL_GPL(dm_deferred_set_destroy);
struct dm_deferred_entry *dm_deferred_entry_inc(struct dm_deferred_set *ds)
{
unsigned long flags;
struct dm_deferred_entry *entry;
spin_lock_irqsave(&ds->lock, flags);
entry = ds->entries + ds->current_entry;
entry->count++;
spin_unlock_irqrestore(&ds->lock, flags);
return entry;
}
EXPORT_SYMBOL_GPL(dm_deferred_entry_inc);
static unsigned ds_next(unsigned index)
{
return (index + 1) % DEFERRED_SET_SIZE;
}
static void __sweep(struct dm_deferred_set *ds, struct list_head *head)
{
while ((ds->sweeper != ds->current_entry) &&
!ds->entries[ds->sweeper].count) {
list_splice_init(&ds->entries[ds->sweeper].work_items, head);
ds->sweeper = ds_next(ds->sweeper);
}
if ((ds->sweeper == ds->current_entry) && !ds->entries[ds->sweeper].count)
list_splice_init(&ds->entries[ds->sweeper].work_items, head);
}
void dm_deferred_entry_dec(struct dm_deferred_entry *entry, struct list_head *head)
{
unsigned long flags;
spin_lock_irqsave(&entry->ds->lock, flags);
BUG_ON(!entry->count);
--entry->count;
__sweep(entry->ds, head);
spin_unlock_irqrestore(&entry->ds->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_deferred_entry_dec);
/*
* Returns 1 if deferred or 0 if no pending items to delay job.
*/
int dm_deferred_set_add_work(struct dm_deferred_set *ds, struct list_head *work)
{
int r = 1;
unsigned long flags;
unsigned next_entry;
spin_lock_irqsave(&ds->lock, flags);
if ((ds->sweeper == ds->current_entry) &&
!ds->entries[ds->current_entry].count)
r = 0;
else {
list_add(work, &ds->entries[ds->current_entry].work_items);
next_entry = ds_next(ds->current_entry);
if (!ds->entries[next_entry].count)
ds->current_entry = next_entry;
}
spin_unlock_irqrestore(&ds->lock, flags);
return r;
}
EXPORT_SYMBOL_GPL(dm_deferred_set_add_work);
/*----------------------------------------------------------------*/
static int __init dm_bio_prison_init(void)
{
_cell_cache = KMEM_CACHE(dm_bio_prison_cell, 0);
if (!_cell_cache)
return -ENOMEM;
return 0;
}
static void __exit dm_bio_prison_exit(void)
{
kmem_cache_destroy(_cell_cache);
_cell_cache = NULL;
}
/*
* module hooks
*/
module_init(dm_bio_prison_init);
module_exit(dm_bio_prison_exit);
MODULE_DESCRIPTION(DM_NAME " bio prison");
MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

111
drivers/md/dm-bio-prison.h Normal file
View file

@ -0,0 +1,111 @@
/*
* Copyright (C) 2011-2012 Red Hat, Inc.
*
* This file is released under the GPL.
*/
#ifndef DM_BIO_PRISON_H
#define DM_BIO_PRISON_H
#include "persistent-data/dm-block-manager.h" /* FIXME: for dm_block_t */
#include "dm-thin-metadata.h" /* FIXME: for dm_thin_id */
#include <linux/list.h>
#include <linux/bio.h>
/*----------------------------------------------------------------*/
/*
* Sometimes we can't deal with a bio straight away. We put them in prison
* where they can't cause any mischief. Bios are put in a cell identified
* by a key, multiple bios can be in the same cell. When the cell is
* subsequently unlocked the bios become available.
*/
struct dm_bio_prison;
/* FIXME: this needs to be more abstract */
struct dm_cell_key {
int virtual;
dm_thin_id dev;
dm_block_t block;
};
/*
* Treat this as opaque, only in header so callers can manage allocation
* themselves.
*/
struct dm_bio_prison_cell {
struct hlist_node list;
struct dm_cell_key key;
struct bio *holder;
struct bio_list bios;
};
struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells);
void dm_bio_prison_destroy(struct dm_bio_prison *prison);
/*
* These two functions just wrap a mempool. This is a transitory step:
* Eventually all bio prison clients should manage their own cell memory.
*
* Like mempool_alloc(), dm_bio_prison_alloc_cell() can only fail if called
* in interrupt context or passed GFP_NOWAIT.
*/
struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison,
gfp_t gfp);
void dm_bio_prison_free_cell(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell);
/*
* Creates, or retrieves a cell for the given key.
*
* Returns 1 if pre-existing cell returned, zero if new cell created using
* @cell_prealloc.
*/
int dm_get_cell(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result);
/*
* An atomic op that combines retrieving a cell, and adding a bio to it.
*
* Returns 1 if the cell was already held, 0 if @inmate is the new holder.
*/
int dm_bio_detain(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result);
void dm_cell_release(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell,
struct bio_list *bios);
void dm_cell_release_no_holder(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell,
struct bio_list *inmates);
void dm_cell_error(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell, int error);
/*----------------------------------------------------------------*/
/*
* We use the deferred set to keep track of pending reads to shared blocks.
* We do this to ensure the new mapping caused by a write isn't performed
* until these prior reads have completed. Otherwise the insertion of the
* new mapping could free the old block that the read bios are mapped to.
*/
struct dm_deferred_set;
struct dm_deferred_entry;
struct dm_deferred_set *dm_deferred_set_create(void);
void dm_deferred_set_destroy(struct dm_deferred_set *ds);
struct dm_deferred_entry *dm_deferred_entry_inc(struct dm_deferred_set *ds);
void dm_deferred_entry_dec(struct dm_deferred_entry *entry, struct list_head *head);
int dm_deferred_set_add_work(struct dm_deferred_set *ds, struct list_head *work);
/*----------------------------------------------------------------*/
#endif

View file

@ -0,0 +1,40 @@
/*
* Copyright (C) 2004-2005 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
#ifndef DM_BIO_RECORD_H
#define DM_BIO_RECORD_H
#include <linux/bio.h>
/*
* There are lots of mutable fields in the bio struct that get
* changed by the lower levels of the block layer. Some targets,
* such as multipath, may wish to resubmit a bio on error. The
* functions in this file help the target record and restore the
* original bio state.
*/
struct dm_bio_details {
struct block_device *bi_bdev;
unsigned long bi_flags;
struct bvec_iter bi_iter;
};
static inline void dm_bio_record(struct dm_bio_details *bd, struct bio *bio)
{
bd->bi_bdev = bio->bi_bdev;
bd->bi_flags = bio->bi_flags;
bd->bi_iter = bio->bi_iter;
}
static inline void dm_bio_restore(struct dm_bio_details *bd, struct bio *bio)
{
bio->bi_bdev = bd->bi_bdev;
bio->bi_flags = bd->bi_flags;
bio->bi_iter = bd->bi_iter;
}
#endif

1872
drivers/md/dm-bufio.c Normal file

File diff suppressed because it is too large Load diff

132
drivers/md/dm-bufio.h Normal file
View file

@ -0,0 +1,132 @@
/*
* Copyright (C) 2009-2011 Red Hat, Inc.
*
* Author: Mikulas Patocka <mpatocka@redhat.com>
*
* This file is released under the GPL.
*/
#ifndef DM_BUFIO_H
#define DM_BUFIO_H
#include <linux/blkdev.h>
#include <linux/types.h>
/*----------------------------------------------------------------*/
struct dm_bufio_client;
struct dm_buffer;
/*
* Create a buffered IO cache on a given device
*/
struct dm_bufio_client *
dm_bufio_client_create(struct block_device *bdev, unsigned block_size,
unsigned reserved_buffers, unsigned aux_size,
void (*alloc_callback)(struct dm_buffer *),
void (*write_callback)(struct dm_buffer *));
/*
* Release a buffered IO cache.
*/
void dm_bufio_client_destroy(struct dm_bufio_client *c);
/*
* WARNING: to avoid deadlocks, these conditions are observed:
*
* - At most one thread can hold at most "reserved_buffers" simultaneously.
* - Each other threads can hold at most one buffer.
* - Threads which call only dm_bufio_get can hold unlimited number of
* buffers.
*/
/*
* Read a given block from disk. Returns pointer to data. Returns a
* pointer to dm_buffer that can be used to release the buffer or to make
* it dirty.
*/
void *dm_bufio_read(struct dm_bufio_client *c, sector_t block,
struct dm_buffer **bp);
/*
* Like dm_bufio_read, but return buffer from cache, don't read
* it. If the buffer is not in the cache, return NULL.
*/
void *dm_bufio_get(struct dm_bufio_client *c, sector_t block,
struct dm_buffer **bp);
/*
* Like dm_bufio_read, but don't read anything from the disk. It is
* expected that the caller initializes the buffer and marks it dirty.
*/
void *dm_bufio_new(struct dm_bufio_client *c, sector_t block,
struct dm_buffer **bp);
/*
* Prefetch the specified blocks to the cache.
* The function starts to read the blocks and returns without waiting for
* I/O to finish.
*/
void dm_bufio_prefetch(struct dm_bufio_client *c,
sector_t block, unsigned n_blocks);
/*
* Release a reference obtained with dm_bufio_{read,get,new}. The data
* pointer and dm_buffer pointer is no longer valid after this call.
*/
void dm_bufio_release(struct dm_buffer *b);
/*
* Mark a buffer dirty. It should be called after the buffer is modified.
*
* In case of memory pressure, the buffer may be written after
* dm_bufio_mark_buffer_dirty, but before dm_bufio_write_dirty_buffers. So
* dm_bufio_write_dirty_buffers guarantees that the buffer is on-disk but
* the actual writing may occur earlier.
*/
void dm_bufio_mark_buffer_dirty(struct dm_buffer *b);
/*
* Initiate writing of dirty buffers, without waiting for completion.
*/
void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c);
/*
* Write all dirty buffers. Guarantees that all dirty buffers created prior
* to this call are on disk when this call exits.
*/
int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c);
/*
* Send an empty write barrier to the device to flush hardware disk cache.
*/
int dm_bufio_issue_flush(struct dm_bufio_client *c);
/*
* Like dm_bufio_release but also move the buffer to the new
* block. dm_bufio_write_dirty_buffers is needed to commit the new block.
*/
void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block);
/*
* Free the given buffer.
* This is just a hint, if the buffer is in use or dirty, this function
* does nothing.
*/
void dm_bufio_forget(struct dm_bufio_client *c, sector_t block);
/*
* Set the minimum number of buffers before cleanup happens.
*/
void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_device_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_block_number(struct dm_buffer *b);
void *dm_bufio_get_block_data(struct dm_buffer *b);
void *dm_bufio_get_aux_data(struct dm_buffer *b);
struct dm_bufio_client *dm_bufio_get_client(struct dm_buffer *b);
/*----------------------------------------------------------------*/
#endif

48
drivers/md/dm-builtin.c Normal file
View file

@ -0,0 +1,48 @@
#include "dm.h"
/*
* The kobject release method must not be placed in the module itself,
* otherwise we are subject to module unload races.
*
* The release method is called when the last reference to the kobject is
* dropped. It may be called by any other kernel code that drops the last
* reference.
*
* The release method suffers from module unload race. We may prevent the
* module from being unloaded at the start of the release method (using
* increased module reference count or synchronizing against the release
* method), however there is no way to prevent the module from being
* unloaded at the end of the release method.
*
* If this code were placed in the dm module, the following race may
* happen:
* 1. Some other process takes a reference to dm kobject
* 2. The user issues ioctl function to unload the dm device
* 3. dm_sysfs_exit calls kobject_put, however the object is not released
* because of the other reference taken at step 1
* 4. dm_sysfs_exit waits on the completion
* 5. The other process that took the reference in step 1 drops it,
* dm_kobject_release is called from this process
* 6. dm_kobject_release calls complete()
* 7. a reschedule happens before dm_kobject_release returns
* 8. dm_sysfs_exit continues, the dm device is unloaded, module reference
* count is decremented
* 9. The user unloads the dm module
* 10. The other process that was rescheduled in step 7 continues to run,
* it is now executing code in unloaded module, so it crashes
*
* Note that if the process that takes the foreign reference to dm kobject
* has a low priority and the system is sufficiently loaded with
* higher-priority processes that prevent the low-priority process from
* being scheduled long enough, this bug may really happen.
*
* In order to fix this module unload race, we place the release method
* into a helper code that is compiled directly into the kernel.
*/
void dm_kobject_release(struct kobject *kobj)
{
complete(dm_get_completion_from_kobject(kobj));
}
EXPORT_SYMBOL(dm_kobject_release);

View file

@ -0,0 +1,43 @@
/*
* Copyright (C) 2012 Red Hat, Inc.
*
* This file is released under the GPL.
*/
#ifndef DM_CACHE_BLOCK_TYPES_H
#define DM_CACHE_BLOCK_TYPES_H
#include "persistent-data/dm-block-manager.h"
/*----------------------------------------------------------------*/
/*
* It's helpful to get sparse to differentiate between indexes into the
* origin device, indexes into the cache device, and indexes into the
* discard bitset.
*/
typedef dm_block_t __bitwise__ dm_oblock_t;
typedef uint32_t __bitwise__ dm_cblock_t;
static inline dm_oblock_t to_oblock(dm_block_t b)
{
return (__force dm_oblock_t) b;
}
static inline dm_block_t from_oblock(dm_oblock_t b)
{
return (__force dm_block_t) b;
}
static inline dm_cblock_t to_cblock(uint32_t b)
{
return (__force dm_cblock_t) b;
}
static inline uint32_t from_cblock(dm_cblock_t b)
{
return (__force uint32_t) b;
}
#endif /* DM_CACHE_BLOCK_TYPES_H */

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,138 @@
/*
* Copyright (C) 2012 Red Hat, Inc.
*
* This file is released under the GPL.
*/
#ifndef DM_CACHE_METADATA_H
#define DM_CACHE_METADATA_H
#include "dm-cache-block-types.h"
#include "dm-cache-policy-internal.h"
#include "persistent-data/dm-space-map-metadata.h"
/*----------------------------------------------------------------*/
#define DM_CACHE_METADATA_BLOCK_SIZE DM_SM_METADATA_BLOCK_SIZE
/* FIXME: remove this restriction */
/*
* The metadata device is currently limited in size.
*/
#define DM_CACHE_METADATA_MAX_SECTORS DM_SM_METADATA_MAX_SECTORS
/*
* A metadata device larger than 16GB triggers a warning.
*/
#define DM_CACHE_METADATA_MAX_SECTORS_WARNING (16 * (1024 * 1024 * 1024 >> SECTOR_SHIFT))
/*----------------------------------------------------------------*/
/*
* Ext[234]-style compat feature flags.
*
* A new feature which old metadata will still be compatible with should
* define a DM_CACHE_FEATURE_COMPAT_* flag (rarely useful).
*
* A new feature that is not compatible with old code should define a
* DM_CACHE_FEATURE_INCOMPAT_* flag and guard the relevant code with
* that flag.
*
* A new feature that is not compatible with old code accessing the
* metadata RDWR should define a DM_CACHE_FEATURE_RO_COMPAT_* flag and
* guard the relevant code with that flag.
*
* As these various flags are defined they should be added to the
* following masks.
*/
#define DM_CACHE_FEATURE_COMPAT_SUPP 0UL
#define DM_CACHE_FEATURE_COMPAT_RO_SUPP 0UL
#define DM_CACHE_FEATURE_INCOMPAT_SUPP 0UL
/*
* Reopens or creates a new, empty metadata volume.
* Returns an ERR_PTR on failure.
*/
struct dm_cache_metadata *dm_cache_metadata_open(struct block_device *bdev,
sector_t data_block_size,
bool may_format_device,
size_t policy_hint_size);
void dm_cache_metadata_close(struct dm_cache_metadata *cmd);
/*
* The metadata needs to know how many cache blocks there are. We don't
* care about the origin, assuming the core target is giving us valid
* origin blocks to map to.
*/
int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size);
dm_cblock_t dm_cache_size(struct dm_cache_metadata *cmd);
int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd,
sector_t discard_block_size,
dm_oblock_t new_nr_entries);
typedef int (*load_discard_fn)(void *context, sector_t discard_block_size,
dm_oblock_t dblock, bool discarded);
int dm_cache_load_discards(struct dm_cache_metadata *cmd,
load_discard_fn fn, void *context);
int dm_cache_set_discard(struct dm_cache_metadata *cmd, dm_oblock_t dblock, bool discard);
int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock);
int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock, dm_oblock_t oblock);
int dm_cache_changed_this_transaction(struct dm_cache_metadata *cmd);
typedef int (*load_mapping_fn)(void *context, dm_oblock_t oblock,
dm_cblock_t cblock, bool dirty,
uint32_t hint, bool hint_valid);
int dm_cache_load_mappings(struct dm_cache_metadata *cmd,
struct dm_cache_policy *policy,
load_mapping_fn fn,
void *context);
int dm_cache_set_dirty(struct dm_cache_metadata *cmd, dm_cblock_t cblock, bool dirty);
struct dm_cache_statistics {
uint32_t read_hits;
uint32_t read_misses;
uint32_t write_hits;
uint32_t write_misses;
};
void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd,
struct dm_cache_statistics *stats);
void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd,
struct dm_cache_statistics *stats);
int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown);
int dm_cache_get_free_metadata_block_count(struct dm_cache_metadata *cmd,
dm_block_t *result);
int dm_cache_get_metadata_dev_size(struct dm_cache_metadata *cmd,
dm_block_t *result);
void dm_cache_dump(struct dm_cache_metadata *cmd);
/*
* The policy is invited to save a 32bit hint value for every cblock (eg,
* for a hit count). These are stored against the policy name. If
* policies are changed, then hints will be lost. If the machine crashes,
* hints will be lost.
*
* The hints are indexed by the cblock, but many policies will not
* neccessarily have a fast way of accessing efficiently via cblock. So
* rather than querying the policy for each cblock, we let it walk its data
* structures and fill in the hints in whatever order it wishes.
*/
int dm_cache_write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *p);
/*
* Query method. Are all the blocks in the cache clean?
*/
int dm_cache_metadata_all_clean(struct dm_cache_metadata *cmd, bool *result);
/*----------------------------------------------------------------*/
#endif /* DM_CACHE_METADATA_H */

View file

@ -0,0 +1,467 @@
/*
* Copyright (C) 2012 Red Hat. All rights reserved.
*
* writeback cache policy supporting flushing out dirty cache blocks.
*
* This file is released under the GPL.
*/
#include "dm-cache-policy.h"
#include "dm.h"
#include <linux/hash.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
/*----------------------------------------------------------------*/
#define DM_MSG_PREFIX "cache cleaner"
/* Cache entry struct. */
struct wb_cache_entry {
struct list_head list;
struct hlist_node hlist;
dm_oblock_t oblock;
dm_cblock_t cblock;
bool dirty:1;
bool pending:1;
};
struct hash {
struct hlist_head *table;
dm_block_t hash_bits;
unsigned nr_buckets;
};
struct policy {
struct dm_cache_policy policy;
spinlock_t lock;
struct list_head free;
struct list_head clean;
struct list_head clean_pending;
struct list_head dirty;
/*
* We know exactly how many cblocks will be needed,
* so we can allocate them up front.
*/
dm_cblock_t cache_size, nr_cblocks_allocated;
struct wb_cache_entry *cblocks;
struct hash chash;
};
/*----------------------------------------------------------------------------*/
/*
* Low-level functions.
*/
static unsigned next_power(unsigned n, unsigned min)
{
return roundup_pow_of_two(max(n, min));
}
static struct policy *to_policy(struct dm_cache_policy *p)
{
return container_of(p, struct policy, policy);
}
static struct list_head *list_pop(struct list_head *q)
{
struct list_head *r = q->next;
list_del(r);
return r;
}
/*----------------------------------------------------------------------------*/
/* Allocate/free various resources. */
static int alloc_hash(struct hash *hash, unsigned elts)
{
hash->nr_buckets = next_power(elts >> 4, 16);
hash->hash_bits = ffs(hash->nr_buckets) - 1;
hash->table = vzalloc(sizeof(*hash->table) * hash->nr_buckets);
return hash->table ? 0 : -ENOMEM;
}
static void free_hash(struct hash *hash)
{
vfree(hash->table);
}
static int alloc_cache_blocks_with_hash(struct policy *p, dm_cblock_t cache_size)
{
int r = -ENOMEM;
p->cblocks = vzalloc(sizeof(*p->cblocks) * from_cblock(cache_size));
if (p->cblocks) {
unsigned u = from_cblock(cache_size);
while (u--)
list_add(&p->cblocks[u].list, &p->free);
p->nr_cblocks_allocated = 0;
/* Cache entries hash. */
r = alloc_hash(&p->chash, from_cblock(cache_size));
if (r)
vfree(p->cblocks);
}
return r;
}
static void free_cache_blocks_and_hash(struct policy *p)
{
free_hash(&p->chash);
vfree(p->cblocks);
}
static struct wb_cache_entry *alloc_cache_entry(struct policy *p)
{
struct wb_cache_entry *e;
BUG_ON(from_cblock(p->nr_cblocks_allocated) >= from_cblock(p->cache_size));
e = list_entry(list_pop(&p->free), struct wb_cache_entry, list);
p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) + 1);
return e;
}
/*----------------------------------------------------------------------------*/
/* Hash functions (lookup, insert, remove). */
static struct wb_cache_entry *lookup_cache_entry(struct policy *p, dm_oblock_t oblock)
{
struct hash *hash = &p->chash;
unsigned h = hash_64(from_oblock(oblock), hash->hash_bits);
struct wb_cache_entry *cur;
struct hlist_head *bucket = &hash->table[h];
hlist_for_each_entry(cur, bucket, hlist) {
if (cur->oblock == oblock) {
/* Move upfront bucket for faster access. */
hlist_del(&cur->hlist);
hlist_add_head(&cur->hlist, bucket);
return cur;
}
}
return NULL;
}
static void insert_cache_hash_entry(struct policy *p, struct wb_cache_entry *e)
{
unsigned h = hash_64(from_oblock(e->oblock), p->chash.hash_bits);
hlist_add_head(&e->hlist, &p->chash.table[h]);
}
static void remove_cache_hash_entry(struct wb_cache_entry *e)
{
hlist_del(&e->hlist);
}
/* Public interface (see dm-cache-policy.h */
static int wb_map(struct dm_cache_policy *pe, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result)
{
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
unsigned long flags;
result->op = POLICY_MISS;
if (can_block)
spin_lock_irqsave(&p->lock, flags);
else if (!spin_trylock_irqsave(&p->lock, flags))
return -EWOULDBLOCK;
e = lookup_cache_entry(p, oblock);
if (e) {
result->op = POLICY_HIT;
result->cblock = e->cblock;
}
spin_unlock_irqrestore(&p->lock, flags);
return 0;
}
static int wb_lookup(struct dm_cache_policy *pe, dm_oblock_t oblock, dm_cblock_t *cblock)
{
int r;
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
unsigned long flags;
if (!spin_trylock_irqsave(&p->lock, flags))
return -EWOULDBLOCK;
e = lookup_cache_entry(p, oblock);
if (e) {
*cblock = e->cblock;
r = 0;
} else
r = -ENOENT;
spin_unlock_irqrestore(&p->lock, flags);
return r;
}
static void __set_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock, bool set)
{
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
e = lookup_cache_entry(p, oblock);
BUG_ON(!e);
if (set) {
if (!e->dirty) {
e->dirty = true;
list_move(&e->list, &p->dirty);
}
} else {
if (e->dirty) {
e->pending = false;
e->dirty = false;
list_move(&e->list, &p->clean);
}
}
}
static void wb_set_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock)
{
struct policy *p = to_policy(pe);
unsigned long flags;
spin_lock_irqsave(&p->lock, flags);
__set_clear_dirty(pe, oblock, true);
spin_unlock_irqrestore(&p->lock, flags);
}
static void wb_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock)
{
struct policy *p = to_policy(pe);
unsigned long flags;
spin_lock_irqsave(&p->lock, flags);
__set_clear_dirty(pe, oblock, false);
spin_unlock_irqrestore(&p->lock, flags);
}
static void add_cache_entry(struct policy *p, struct wb_cache_entry *e)
{
insert_cache_hash_entry(p, e);
if (e->dirty)
list_add(&e->list, &p->dirty);
else
list_add(&e->list, &p->clean);
}
static int wb_load_mapping(struct dm_cache_policy *pe,
dm_oblock_t oblock, dm_cblock_t cblock,
uint32_t hint, bool hint_valid)
{
int r;
struct policy *p = to_policy(pe);
struct wb_cache_entry *e = alloc_cache_entry(p);
if (e) {
e->cblock = cblock;
e->oblock = oblock;
e->dirty = false; /* blocks default to clean */
add_cache_entry(p, e);
r = 0;
} else
r = -ENOMEM;
return r;
}
static void wb_destroy(struct dm_cache_policy *pe)
{
struct policy *p = to_policy(pe);
free_cache_blocks_and_hash(p);
kfree(p);
}
static struct wb_cache_entry *__wb_force_remove_mapping(struct policy *p, dm_oblock_t oblock)
{
struct wb_cache_entry *r = lookup_cache_entry(p, oblock);
BUG_ON(!r);
remove_cache_hash_entry(r);
list_del(&r->list);
return r;
}
static void wb_remove_mapping(struct dm_cache_policy *pe, dm_oblock_t oblock)
{
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
unsigned long flags;
spin_lock_irqsave(&p->lock, flags);
e = __wb_force_remove_mapping(p, oblock);
list_add_tail(&e->list, &p->free);
BUG_ON(!from_cblock(p->nr_cblocks_allocated));
p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) - 1);
spin_unlock_irqrestore(&p->lock, flags);
}
static void wb_force_mapping(struct dm_cache_policy *pe,
dm_oblock_t current_oblock, dm_oblock_t oblock)
{
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
unsigned long flags;
spin_lock_irqsave(&p->lock, flags);
e = __wb_force_remove_mapping(p, current_oblock);
e->oblock = oblock;
add_cache_entry(p, e);
spin_unlock_irqrestore(&p->lock, flags);
}
static struct wb_cache_entry *get_next_dirty_entry(struct policy *p)
{
struct list_head *l;
struct wb_cache_entry *r;
if (list_empty(&p->dirty))
return NULL;
l = list_pop(&p->dirty);
r = container_of(l, struct wb_cache_entry, list);
list_add(l, &p->clean_pending);
return r;
}
static int wb_writeback_work(struct dm_cache_policy *pe,
dm_oblock_t *oblock,
dm_cblock_t *cblock)
{
int r = -ENOENT;
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
unsigned long flags;
spin_lock_irqsave(&p->lock, flags);
e = get_next_dirty_entry(p);
if (e) {
*oblock = e->oblock;
*cblock = e->cblock;
r = 0;
}
spin_unlock_irqrestore(&p->lock, flags);
return r;
}
static dm_cblock_t wb_residency(struct dm_cache_policy *pe)
{
return to_policy(pe)->nr_cblocks_allocated;
}
/* Init the policy plugin interface function pointers. */
static void init_policy_functions(struct policy *p)
{
p->policy.destroy = wb_destroy;
p->policy.map = wb_map;
p->policy.lookup = wb_lookup;
p->policy.set_dirty = wb_set_dirty;
p->policy.clear_dirty = wb_clear_dirty;
p->policy.load_mapping = wb_load_mapping;
p->policy.walk_mappings = NULL;
p->policy.remove_mapping = wb_remove_mapping;
p->policy.writeback_work = wb_writeback_work;
p->policy.force_mapping = wb_force_mapping;
p->policy.residency = wb_residency;
p->policy.tick = NULL;
}
static struct dm_cache_policy *wb_create(dm_cblock_t cache_size,
sector_t origin_size,
sector_t cache_block_size)
{
int r;
struct policy *p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
return NULL;
init_policy_functions(p);
INIT_LIST_HEAD(&p->free);
INIT_LIST_HEAD(&p->clean);
INIT_LIST_HEAD(&p->clean_pending);
INIT_LIST_HEAD(&p->dirty);
p->cache_size = cache_size;
spin_lock_init(&p->lock);
/* Allocate cache entry structs and add them to free list. */
r = alloc_cache_blocks_with_hash(p, cache_size);
if (!r)
return &p->policy;
kfree(p);
return NULL;
}
/*----------------------------------------------------------------------------*/
static struct dm_cache_policy_type wb_policy_type = {
.name = "cleaner",
.version = {1, 0, 0},
.hint_size = 0,
.owner = THIS_MODULE,
.create = wb_create
};
static int __init wb_init(void)
{
int r = dm_cache_policy_register(&wb_policy_type);
if (r < 0)
DMERR("register failed %d", r);
else
DMINFO("version %u.%u.%u loaded",
wb_policy_type.version[0],
wb_policy_type.version[1],
wb_policy_type.version[2]);
return r;
}
static void __exit wb_exit(void)
{
dm_cache_policy_unregister(&wb_policy_type);
}
module_init(wb_init);
module_exit(wb_exit);
MODULE_AUTHOR("Heinz Mauelshagen <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("cleaner cache policy");

View file

@ -0,0 +1,131 @@
/*
* Copyright (C) 2012 Red Hat. All rights reserved.
*
* This file is released under the GPL.
*/
#ifndef DM_CACHE_POLICY_INTERNAL_H
#define DM_CACHE_POLICY_INTERNAL_H
#include "dm-cache-policy.h"
/*----------------------------------------------------------------*/
/*
* Little inline functions that simplify calling the policy methods.
*/
static inline int policy_map(struct dm_cache_policy *p, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result)
{
return p->map(p, oblock, can_block, can_migrate, discarded_oblock, bio, result);
}
static inline int policy_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock)
{
BUG_ON(!p->lookup);
return p->lookup(p, oblock, cblock);
}
static inline void policy_set_dirty(struct dm_cache_policy *p, dm_oblock_t oblock)
{
if (p->set_dirty)
p->set_dirty(p, oblock);
}
static inline void policy_clear_dirty(struct dm_cache_policy *p, dm_oblock_t oblock)
{
if (p->clear_dirty)
p->clear_dirty(p, oblock);
}
static inline int policy_load_mapping(struct dm_cache_policy *p,
dm_oblock_t oblock, dm_cblock_t cblock,
uint32_t hint, bool hint_valid)
{
return p->load_mapping(p, oblock, cblock, hint, hint_valid);
}
static inline int policy_walk_mappings(struct dm_cache_policy *p,
policy_walk_fn fn, void *context)
{
return p->walk_mappings ? p->walk_mappings(p, fn, context) : 0;
}
static inline int policy_writeback_work(struct dm_cache_policy *p,
dm_oblock_t *oblock,
dm_cblock_t *cblock)
{
return p->writeback_work ? p->writeback_work(p, oblock, cblock) : -ENOENT;
}
static inline void policy_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock)
{
p->remove_mapping(p, oblock);
}
static inline int policy_remove_cblock(struct dm_cache_policy *p, dm_cblock_t cblock)
{
return p->remove_cblock(p, cblock);
}
static inline void policy_force_mapping(struct dm_cache_policy *p,
dm_oblock_t current_oblock, dm_oblock_t new_oblock)
{
return p->force_mapping(p, current_oblock, new_oblock);
}
static inline dm_cblock_t policy_residency(struct dm_cache_policy *p)
{
return p->residency(p);
}
static inline void policy_tick(struct dm_cache_policy *p)
{
if (p->tick)
return p->tick(p);
}
static inline int policy_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen)
{
ssize_t sz = 0;
if (p->emit_config_values)
return p->emit_config_values(p, result, maxlen);
DMEMIT("0");
return 0;
}
static inline int policy_set_config_value(struct dm_cache_policy *p,
const char *key, const char *value)
{
return p->set_config_value ? p->set_config_value(p, key, value) : -EINVAL;
}
/*----------------------------------------------------------------*/
/*
* Creates a new cache policy given a policy name, a cache size, an origin size and the block size.
*/
struct dm_cache_policy *dm_cache_policy_create(const char *name, dm_cblock_t cache_size,
sector_t origin_size, sector_t block_size);
/*
* Destroys the policy. This drops references to the policy module as well
* as calling it's destroy method. So always use this rather than calling
* the policy->destroy method directly.
*/
void dm_cache_policy_destroy(struct dm_cache_policy *p);
/*
* In case we've forgotten.
*/
const char *dm_cache_policy_get_name(struct dm_cache_policy *p);
const unsigned *dm_cache_policy_get_version(struct dm_cache_policy *p);
size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p);
/*----------------------------------------------------------------*/
#endif /* DM_CACHE_POLICY_INTERNAL_H */

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,173 @@
/*
* Copyright (C) 2012 Red Hat. All rights reserved.
*
* This file is released under the GPL.
*/
#include "dm-cache-policy-internal.h"
#include "dm.h"
#include <linux/module.h>
#include <linux/slab.h>
/*----------------------------------------------------------------*/
#define DM_MSG_PREFIX "cache-policy"
static DEFINE_SPINLOCK(register_lock);
static LIST_HEAD(register_list);
static struct dm_cache_policy_type *__find_policy(const char *name)
{
struct dm_cache_policy_type *t;
list_for_each_entry(t, &register_list, list)
if (!strcmp(t->name, name))
return t;
return NULL;
}
static struct dm_cache_policy_type *__get_policy_once(const char *name)
{
struct dm_cache_policy_type *t = __find_policy(name);
if (t && !try_module_get(t->owner)) {
DMWARN("couldn't get module %s", name);
t = ERR_PTR(-EINVAL);
}
return t;
}
static struct dm_cache_policy_type *get_policy_once(const char *name)
{
struct dm_cache_policy_type *t;
spin_lock(&register_lock);
t = __get_policy_once(name);
spin_unlock(&register_lock);
return t;
}
static struct dm_cache_policy_type *get_policy(const char *name)
{
struct dm_cache_policy_type *t;
t = get_policy_once(name);
if (IS_ERR(t))
return NULL;
if (t)
return t;
request_module("dm-cache-%s", name);
t = get_policy_once(name);
if (IS_ERR(t))
return NULL;
return t;
}
static void put_policy(struct dm_cache_policy_type *t)
{
module_put(t->owner);
}
int dm_cache_policy_register(struct dm_cache_policy_type *type)
{
int r;
/* One size fits all for now */
if (type->hint_size != 0 && type->hint_size != 4) {
DMWARN("hint size must be 0 or 4 but %llu supplied.", (unsigned long long) type->hint_size);
return -EINVAL;
}
spin_lock(&register_lock);
if (__find_policy(type->name)) {
DMWARN("attempt to register policy under duplicate name %s", type->name);
r = -EINVAL;
} else {
list_add(&type->list, &register_list);
r = 0;
}
spin_unlock(&register_lock);
return r;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_register);
void dm_cache_policy_unregister(struct dm_cache_policy_type *type)
{
spin_lock(&register_lock);
list_del_init(&type->list);
spin_unlock(&register_lock);
}
EXPORT_SYMBOL_GPL(dm_cache_policy_unregister);
struct dm_cache_policy *dm_cache_policy_create(const char *name,
dm_cblock_t cache_size,
sector_t origin_size,
sector_t cache_block_size)
{
struct dm_cache_policy *p = NULL;
struct dm_cache_policy_type *type;
type = get_policy(name);
if (!type) {
DMWARN("unknown policy type");
return ERR_PTR(-EINVAL);
}
p = type->create(cache_size, origin_size, cache_block_size);
if (!p) {
put_policy(type);
return ERR_PTR(-ENOMEM);
}
p->private = type;
return p;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_create);
void dm_cache_policy_destroy(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
p->destroy(p);
put_policy(t);
}
EXPORT_SYMBOL_GPL(dm_cache_policy_destroy);
const char *dm_cache_policy_get_name(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
/* if t->real is set then an alias was used (e.g. "default") */
if (t->real)
return t->real->name;
return t->name;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_get_name);
const unsigned *dm_cache_policy_get_version(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
return t->version;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_get_version);
size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
return t->hint_size;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_get_hint_size);
/*----------------------------------------------------------------*/

View file

@ -0,0 +1,249 @@
/*
* Copyright (C) 2012 Red Hat. All rights reserved.
*
* This file is released under the GPL.
*/
#ifndef DM_CACHE_POLICY_H
#define DM_CACHE_POLICY_H
#include "dm-cache-block-types.h"
#include <linux/device-mapper.h>
/*----------------------------------------------------------------*/
/* FIXME: make it clear which methods are optional. Get debug policy to
* double check this at start.
*/
/*
* The cache policy makes the important decisions about which blocks get to
* live on the faster cache device.
*
* When the core target has to remap a bio it calls the 'map' method of the
* policy. This returns an instruction telling the core target what to do.
*
* POLICY_HIT:
* That block is in the cache. Remap to the cache and carry on.
*
* POLICY_MISS:
* This block is on the origin device. Remap and carry on.
*
* POLICY_NEW:
* This block is currently on the origin device, but the policy wants to
* move it. The core should:
*
* - hold any further io to this origin block
* - copy the origin to the given cache block
* - release all the held blocks
* - remap the original block to the cache
*
* POLICY_REPLACE:
* This block is currently on the origin device. The policy wants to
* move it to the cache, with the added complication that the destination
* cache block needs a writeback first. The core should:
*
* - hold any further io to this origin block
* - hold any further io to the origin block that's being written back
* - writeback
* - copy new block to cache
* - release held blocks
* - remap bio to cache and reissue.
*
* Should the core run into trouble while processing a POLICY_NEW or
* POLICY_REPLACE instruction it will roll back the policies mapping using
* remove_mapping() or force_mapping(). These methods must not fail. This
* approach avoids having transactional semantics in the policy (ie, the
* core informing the policy when a migration is complete), and hence makes
* it easier to write new policies.
*
* In general policy methods should never block, except in the case of the
* map function when can_migrate is set. So be careful to implement using
* bounded, preallocated memory.
*/
enum policy_operation {
POLICY_HIT,
POLICY_MISS,
POLICY_NEW,
POLICY_REPLACE
};
/*
* This is the instruction passed back to the core target.
*/
struct policy_result {
enum policy_operation op;
dm_oblock_t old_oblock; /* POLICY_REPLACE */
dm_cblock_t cblock; /* POLICY_HIT, POLICY_NEW, POLICY_REPLACE */
};
typedef int (*policy_walk_fn)(void *context, dm_cblock_t cblock,
dm_oblock_t oblock, uint32_t hint);
/*
* The cache policy object. Just a bunch of methods. It is envisaged that
* this structure will be embedded in a bigger, policy specific structure
* (ie. use container_of()).
*/
struct dm_cache_policy {
/*
* FIXME: make it clear which methods are optional, and which may
* block.
*/
/*
* Destroys this object.
*/
void (*destroy)(struct dm_cache_policy *p);
/*
* See large comment above.
*
* oblock - the origin block we're interested in.
*
* can_block - indicates whether the current thread is allowed to
* block. -EWOULDBLOCK returned if it can't and would.
*
* can_migrate - gives permission for POLICY_NEW or POLICY_REPLACE
* instructions. If denied and the policy would have
* returned one of these instructions it should
* return -EWOULDBLOCK.
*
* discarded_oblock - indicates whether the whole origin block is
* in a discarded state (FIXME: better to tell the
* policy about this sooner, so it can recycle that
* cache block if it wants.)
* bio - the bio that triggered this call.
* result - gets filled in with the instruction.
*
* May only return 0, or -EWOULDBLOCK (if !can_migrate)
*/
int (*map)(struct dm_cache_policy *p, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result);
/*
* Sometimes we want to see if a block is in the cache, without
* triggering any update of stats. (ie. it's not a real hit).
*
* Must not block.
*
* Returns 0 if in cache, -ENOENT if not, < 0 for other errors
* (-EWOULDBLOCK would be typical).
*/
int (*lookup)(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock);
void (*set_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock);
void (*clear_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock);
/*
* Called when a cache target is first created. Used to load a
* mapping from the metadata device into the policy.
*/
int (*load_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock,
dm_cblock_t cblock, uint32_t hint, bool hint_valid);
int (*walk_mappings)(struct dm_cache_policy *p, policy_walk_fn fn,
void *context);
/*
* Override functions used on the error paths of the core target.
* They must succeed.
*/
void (*remove_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock);
void (*force_mapping)(struct dm_cache_policy *p, dm_oblock_t current_oblock,
dm_oblock_t new_oblock);
/*
* This is called via the invalidate_cblocks message. It is
* possible the particular cblock has already been removed due to a
* write io in passthrough mode. In which case this should return
* -ENODATA.
*/
int (*remove_cblock)(struct dm_cache_policy *p, dm_cblock_t cblock);
/*
* Provide a dirty block to be written back by the core target.
*
* Returns:
*
* 0 and @cblock,@oblock: block to write back provided
*
* -ENODATA: no dirty blocks available
*/
int (*writeback_work)(struct dm_cache_policy *p, dm_oblock_t *oblock, dm_cblock_t *cblock);
/*
* How full is the cache?
*/
dm_cblock_t (*residency)(struct dm_cache_policy *p);
/*
* Because of where we sit in the block layer, we can be asked to
* map a lot of little bios that are all in the same block (no
* queue merging has occurred). To stop the policy being fooled by
* these the core target sends regular tick() calls to the policy.
* The policy should only count an entry as hit once per tick.
*/
void (*tick)(struct dm_cache_policy *p);
/*
* Configuration.
*/
int (*emit_config_values)(struct dm_cache_policy *p,
char *result, unsigned maxlen);
int (*set_config_value)(struct dm_cache_policy *p,
const char *key, const char *value);
/*
* Book keeping ptr for the policy register, not for general use.
*/
void *private;
};
/*----------------------------------------------------------------*/
/*
* We maintain a little register of the different policy types.
*/
#define CACHE_POLICY_NAME_SIZE 16
#define CACHE_POLICY_VERSION_SIZE 3
struct dm_cache_policy_type {
/* For use by the register code only. */
struct list_head list;
/*
* Policy writers should fill in these fields. The name field is
* what gets passed on the target line to select your policy.
*/
char name[CACHE_POLICY_NAME_SIZE];
unsigned version[CACHE_POLICY_VERSION_SIZE];
/*
* For use by an alias dm_cache_policy_type to point to the
* real dm_cache_policy_type.
*/
struct dm_cache_policy_type *real;
/*
* Policies may store a hint for each each cache block.
* Currently the size of this hint must be 0 or 4 bytes but we
* expect to relax this in future.
*/
size_t hint_size;
struct module *owner;
struct dm_cache_policy *(*create)(dm_cblock_t cache_size,
sector_t origin_size,
sector_t block_size);
};
int dm_cache_policy_register(struct dm_cache_policy_type *type);
void dm_cache_policy_unregister(struct dm_cache_policy_type *type);
/*----------------------------------------------------------------*/
#endif /* DM_CACHE_POLICY_H */

3169
drivers/md/dm-cache-target.c Normal file

File diff suppressed because it is too large Load diff

2175
drivers/md/dm-crypt.c Normal file

File diff suppressed because it is too large Load diff

373
drivers/md/dm-delay.c Normal file
View file

@ -0,0 +1,373 @@
/*
* Copyright (C) 2005-2007 Red Hat GmbH
*
* A target that delays reads and/or writes and can send
* them to different devices.
*
* This file is released under the GPL.
*/
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
#define DM_MSG_PREFIX "delay"
struct delay_c {
struct timer_list delay_timer;
struct mutex timer_lock;
struct workqueue_struct *kdelayd_wq;
struct work_struct flush_expired_bios;
struct list_head delayed_bios;
atomic_t may_delay;
struct dm_dev *dev_read;
sector_t start_read;
unsigned read_delay;
unsigned reads;
struct dm_dev *dev_write;
sector_t start_write;
unsigned write_delay;
unsigned writes;
};
struct dm_delay_info {
struct delay_c *context;
struct list_head list;
unsigned long expires;
};
static DEFINE_MUTEX(delayed_bios_lock);
static void handle_delayed_timer(unsigned long data)
{
struct delay_c *dc = (struct delay_c *)data;
queue_work(dc->kdelayd_wq, &dc->flush_expired_bios);
}
static void queue_timeout(struct delay_c *dc, unsigned long expires)
{
mutex_lock(&dc->timer_lock);
if (!timer_pending(&dc->delay_timer) || expires < dc->delay_timer.expires)
mod_timer(&dc->delay_timer, expires);
mutex_unlock(&dc->timer_lock);
}
static void flush_bios(struct bio *bio)
{
struct bio *n;
while (bio) {
n = bio->bi_next;
bio->bi_next = NULL;
generic_make_request(bio);
bio = n;
}
}
static struct bio *flush_delayed_bios(struct delay_c *dc, int flush_all)
{
struct dm_delay_info *delayed, *next;
unsigned long next_expires = 0;
int start_timer = 0;
struct bio_list flush_bios = { };
mutex_lock(&delayed_bios_lock);
list_for_each_entry_safe(delayed, next, &dc->delayed_bios, list) {
if (flush_all || time_after_eq(jiffies, delayed->expires)) {
struct bio *bio = dm_bio_from_per_bio_data(delayed,
sizeof(struct dm_delay_info));
list_del(&delayed->list);
bio_list_add(&flush_bios, bio);
if ((bio_data_dir(bio) == WRITE))
delayed->context->writes--;
else
delayed->context->reads--;
continue;
}
if (!start_timer) {
start_timer = 1;
next_expires = delayed->expires;
} else
next_expires = min(next_expires, delayed->expires);
}
mutex_unlock(&delayed_bios_lock);
if (start_timer)
queue_timeout(dc, next_expires);
return bio_list_get(&flush_bios);
}
static void flush_expired_bios(struct work_struct *work)
{
struct delay_c *dc;
dc = container_of(work, struct delay_c, flush_expired_bios);
flush_bios(flush_delayed_bios(dc, 0));
}
/*
* Mapping parameters:
* <device> <offset> <delay> [<write_device> <write_offset> <write_delay>]
*
* With separate write parameters, the first set is only used for reads.
* Delays are specified in milliseconds.
*/
static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
struct delay_c *dc;
unsigned long long tmpll;
char dummy;
if (argc != 3 && argc != 6) {
ti->error = "requires exactly 3 or 6 arguments";
return -EINVAL;
}
dc = kmalloc(sizeof(*dc), GFP_KERNEL);
if (!dc) {
ti->error = "Cannot allocate context";
return -ENOMEM;
}
dc->reads = dc->writes = 0;
if (sscanf(argv[1], "%llu%c", &tmpll, &dummy) != 1) {
ti->error = "Invalid device sector";
goto bad;
}
dc->start_read = tmpll;
if (sscanf(argv[2], "%u%c", &dc->read_delay, &dummy) != 1) {
ti->error = "Invalid delay";
goto bad;
}
if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
&dc->dev_read)) {
ti->error = "Device lookup failed";
goto bad;
}
dc->dev_write = NULL;
if (argc == 3)
goto out;
if (sscanf(argv[4], "%llu%c", &tmpll, &dummy) != 1) {
ti->error = "Invalid write device sector";
goto bad_dev_read;
}
dc->start_write = tmpll;
if (sscanf(argv[5], "%u%c", &dc->write_delay, &dummy) != 1) {
ti->error = "Invalid write delay";
goto bad_dev_read;
}
if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table),
&dc->dev_write)) {
ti->error = "Write device lookup failed";
goto bad_dev_read;
}
out:
dc->kdelayd_wq = alloc_workqueue("kdelayd", WQ_MEM_RECLAIM, 0);
if (!dc->kdelayd_wq) {
DMERR("Couldn't start kdelayd");
goto bad_queue;
}
setup_timer(&dc->delay_timer, handle_delayed_timer, (unsigned long)dc);
INIT_WORK(&dc->flush_expired_bios, flush_expired_bios);
INIT_LIST_HEAD(&dc->delayed_bios);
mutex_init(&dc->timer_lock);
atomic_set(&dc->may_delay, 1);
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->per_bio_data_size = sizeof(struct dm_delay_info);
ti->private = dc;
return 0;
bad_queue:
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
bad_dev_read:
dm_put_device(ti, dc->dev_read);
bad:
kfree(dc);
return -EINVAL;
}
static void delay_dtr(struct dm_target *ti)
{
struct delay_c *dc = ti->private;
destroy_workqueue(dc->kdelayd_wq);
dm_put_device(ti, dc->dev_read);
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
kfree(dc);
}
static int delay_bio(struct delay_c *dc, int delay, struct bio *bio)
{
struct dm_delay_info *delayed;
unsigned long expires = 0;
if (!delay || !atomic_read(&dc->may_delay))
return 1;
delayed = dm_per_bio_data(bio, sizeof(struct dm_delay_info));
delayed->context = dc;
delayed->expires = expires = jiffies + (delay * HZ / 1000);
mutex_lock(&delayed_bios_lock);
if (bio_data_dir(bio) == WRITE)
dc->writes++;
else
dc->reads++;
list_add_tail(&delayed->list, &dc->delayed_bios);
mutex_unlock(&delayed_bios_lock);
queue_timeout(dc, expires);
return 0;
}
static void delay_presuspend(struct dm_target *ti)
{
struct delay_c *dc = ti->private;
atomic_set(&dc->may_delay, 0);
del_timer_sync(&dc->delay_timer);
flush_bios(flush_delayed_bios(dc, 1));
}
static void delay_resume(struct dm_target *ti)
{
struct delay_c *dc = ti->private;
atomic_set(&dc->may_delay, 1);
}
static int delay_map(struct dm_target *ti, struct bio *bio)
{
struct delay_c *dc = ti->private;
if ((bio_data_dir(bio) == WRITE) && (dc->dev_write)) {
bio->bi_bdev = dc->dev_write->bdev;
if (bio_sectors(bio))
bio->bi_iter.bi_sector = dc->start_write +
dm_target_offset(ti, bio->bi_iter.bi_sector);
return delay_bio(dc, dc->write_delay, bio);
}
bio->bi_bdev = dc->dev_read->bdev;
bio->bi_iter.bi_sector = dc->start_read +
dm_target_offset(ti, bio->bi_iter.bi_sector);
return delay_bio(dc, dc->read_delay, bio);
}
static void delay_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct delay_c *dc = ti->private;
int sz = 0;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%u %u", dc->reads, dc->writes);
break;
case STATUSTYPE_TABLE:
DMEMIT("%s %llu %u", dc->dev_read->name,
(unsigned long long) dc->start_read,
dc->read_delay);
if (dc->dev_write)
DMEMIT(" %s %llu %u", dc->dev_write->name,
(unsigned long long) dc->start_write,
dc->write_delay);
break;
}
}
static int delay_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct delay_c *dc = ti->private;
int ret = 0;
ret = fn(ti, dc->dev_read, dc->start_read, ti->len, data);
if (ret)
goto out;
if (dc->dev_write)
ret = fn(ti, dc->dev_write, dc->start_write, ti->len, data);
out:
return ret;
}
static struct target_type delay_target = {
.name = "delay",
.version = {1, 2, 1},
.module = THIS_MODULE,
.ctr = delay_ctr,
.dtr = delay_dtr,
.map = delay_map,
.presuspend = delay_presuspend,
.resume = delay_resume,
.status = delay_status,
.iterate_devices = delay_iterate_devices,
};
static int __init dm_delay_init(void)
{
int r;
r = dm_register_target(&delay_target);
if (r < 0) {
DMERR("register failed %d", r);
goto bad_register;
}
return 0;
bad_register:
return r;
}
static void __exit dm_delay_exit(void)
{
dm_unregister_target(&delay_target);
}
/* Module hooks */
module_init(dm_delay_init);
module_exit(dm_delay_exit);
MODULE_DESCRIPTION(DM_NAME " delay target");
MODULE_AUTHOR("Heinz Mauelshagen <mauelshagen@redhat.com>");
MODULE_LICENSE("GPL");

768
drivers/md/dm-dirty.c Executable file
View file

@ -0,0 +1,768 @@
/*
* Copyright (C) 2012 Red Hat, Inc.
*
* Author: Mikulas Patocka <mpatocka@redhat.com>
*
* Based on Chromium dm-verity driver (C) 2011 The Chromium OS Authors
*
* This file is released under the GPLv2.
*/
/* Author: Jitesh Shah <j1.shah@sta.samsung.com>
* 27 March 2014.
*
* Updated dm-verity file to include write-able target. dm-dirty.
* Write-able target for dm-verity.
*/
#include "dm-bufio.h"
#include <linux/module.h>
#include <linux/device-mapper.h>
#include <crypto/hash.h>
#include <linux/vmalloc.h>
#include <linux/string.h>
#define REBUILD_TREE 1
/* Copied from dm-verity.c. Name also retained */
#define DM_VERITY_MEMPOOL_SIZE 64
#define DM_VERITY_MAX_LEVELS 63
struct dm_verity {
struct dm_dev *data_dev;
struct dm_dev *hash_dev;
struct dm_target *ti;
struct dm_bufio_client *bufio;
char *alg_name;
struct crypto_shash *tfm;
u8 *root_digest; /* digest of the root block */
u8 *salt; /* salt: its size is salt_size */
unsigned salt_size;
sector_t data_start; /* data offset in 512-byte sectors */
sector_t hash_start; /* hash start in blocks */
sector_t data_blocks; /* the number of data blocks */
sector_t hash_blocks; /* the number of hash blocks */
unsigned char data_dev_block_bits; /* log2(data blocksize) */
unsigned char hash_dev_block_bits; /* log2(hash blocksize) */
unsigned char hash_per_block_bits; /* log2(hashes in hash block) */
unsigned char levels; /* the number of tree levels */
unsigned char version;
unsigned digest_size; /* digest size for the current hash algorithm */
unsigned shash_descsize;/* the size of temporary space for crypto */
int hash_failed; /* set to 1 if hash of any block failed */
mempool_t *io_mempool; /* mempool of struct dm_verity_io */
mempool_t *vec_mempool; /* mempool of bio vector */
struct workqueue_struct *verify_wq;
/* starting blocks for each tree level. 0 is the lowest level. */
sector_t hash_level_block[DM_VERITY_MAX_LEVELS];
};
static inline u64 sector_to_block(struct dm_verity *v, u64 sector);
static sector_t verity_position_at_level(struct dm_verity *v, sector_t block,
int level)
{
return block >> (level * v->hash_per_block_bits);
}
static void verity_hash_at_level(struct dm_verity *v, sector_t block, int level,
sector_t *hash_block, unsigned *offset)
{
sector_t position = verity_position_at_level(v, block, level);
unsigned idx;
*hash_block = v->hash_level_block[level] + (position >> v->hash_per_block_bits);
if (!offset)
return;
idx = position & ((1 << v->hash_per_block_bits) - 1);
if (!v->version)
*offset = idx * v->digest_size;
else
*offset = idx << (v->hash_dev_block_bits - v->hash_per_block_bits);
}
/* End of copied stuff */
struct dm_dirty_io {
struct dm_verity *v;
void *orig_bi_private;
struct bio *bio;
bio_end_io_t *orig_bi_end_io;
struct work_struct work;
struct bvec_iter iter;
};
#define DM_MSG_PREFIX "dirty"
static void dirty_finish_io(struct dm_dirty_io *io, struct bio *bio, int error)
{
bio->bi_end_io = io->orig_bi_end_io;
bio->bi_private = io->orig_bi_private;
bio_endio_nodec(bio, error);
}
static void dirty_hash(struct bio *bio, int error)
{
struct bio_vec *iovec;
int i, ret;
u8* page;
u8* result;
struct shash_desc *desc;
struct dm_dirty_io *io = bio->bi_private;
struct dm_buffer *buf = NULL;
u8* hash_block_buf;
sector_t hash_block;
unsigned offset;
struct dm_verity *v = io->v;
unsigned short idx = io->iter.bi_idx;
/* At this point only WRITE bios and non-error bios are allowed.
* bio_data_dir() and error should have been checked before.
*/
/* TODO: Assumption here is that the FS sends block sized buffers. This seems to be
* true for ext4 but should not be assumed. Re-factor this code to remove that
* assumption.
*/
for(i = 0; i < (io->iter.bi_size >> v->data_dev_block_bits); i++) {
/* Loop over all blocks */
verity_hash_at_level(v, sector_to_block(v, io->iter.bi_sector) + (sector_t)i, 0, &hash_block, &offset);
hash_block_buf = dm_bufio_read(v->bufio, hash_block, &buf);
if (unlikely(IS_ERR(hash_block_buf))) {
DMERR("Error getting hash block");
buf = NULL;
goto free_all_res;
}
desc = (struct shash_desc *)(io + 1);
desc->tfm = v->tfm;
desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
ret = crypto_shash_init(desc);
if (ret < 0) {
DMERR("Error initing hash");
goto free_all_res;
}
ret = crypto_shash_update(desc, v->salt, v->salt_size);
if (ret < 0) {
DMERR("Error updating salt hash");
goto free_all_res;
}
//iovec = (bio)->bi_io_vec + i;
iovec = bio_iovec_idx(bio, idx + i);
/*
if (idx + i >= bio->bi_vcnt) {
DMERR("Excedding vcnt: idx:%hu vcnt:%hu i:%d", idx, bio->bi_vcnt, i);
goto free_all_res;
}
*/
page = kmap_atomic(iovec->bv_page);
if(page == NULL) {
DMERR("Failed to kmap dirty page");
goto free_bufio;
}
if(iovec->bv_len != (1 << v->data_dev_block_bits)) {
DMERR("Non-page data len: %u offset: %u", iovec->bv_len, iovec->bv_offset);
}
ret = crypto_shash_update(desc, page + iovec->bv_offset, iovec->bv_len);
if (ret < 0) {
DMERR("Error updating data hash");
}
kunmap_atomic(page);
result = hash_block_buf + offset;
ret = crypto_shash_final(desc, result);
if (ret < 0) {
DMERR("Error finalizing hash");
}
if (0 == sector_to_block(v, io->iter.bi_sector) + (sector_t)i) {
// for data block 0, we use dummy hash
memset(result, 1, v->digest_size);
}
dm_bufio_mark_buffer_dirty(buf);
free_bufio:
dm_bufio_release(buf);
buf = NULL;
}
free_all_res:
if (buf)
dm_bufio_release(buf);
dirty_finish_io(io, bio, error);
}
static void dirty_work(struct work_struct *w)
{
struct dm_dirty_io *io = container_of(w, struct dm_dirty_io, work);
dirty_hash(io->bio, 0);
}
static void dirty_end_io(struct bio *bio, int error)
{
struct dm_dirty_io *io = bio->bi_private;
if (error) {
/* If there was an error writing the data, do not recalculate the hash.
* Just go ahead and finish the bio
*/
dirty_finish_io(io, bio, error);
return;
}
INIT_WORK(&io->work, dirty_work);
queue_work(io->v->verify_wq, &io->work);
}
static inline u64 sector_to_block(struct dm_verity *v, u64 sector)
{
return (sector >> (v->data_dev_block_bits - SECTOR_SHIFT));
}
static inline unsigned bytes_to_block(struct dm_verity *v, unsigned bytes)
{
return (bytes >> v->data_dev_block_bits);
}
static int dirty_map(struct dm_target *ti, struct bio *bio)
{
struct dm_dirty_io *io;
struct dm_verity *v = ti->private;
/* If a WRITE coems with FLUSH flag set, we don't need to do anything special.
* Its not the end of the world if the hash is not updated to the disk right
* away.
*/
if (bio_data_dir(bio) != WRITE || bio->bi_iter.bi_size == 0)
goto process_bio;
if (bio->bi_iter.bi_size & ((1 << v->data_dev_block_bits) - 1) ||
(bio->bi_iter.bi_sector & ((1 << (v->data_dev_block_bits - SECTOR_SHIFT)) - 1))) {
DMERR("Not page size IO: size:%u, sector:%llu", bio->bi_iter.bi_size, (long long unsigned int)bio->bi_iter.bi_sector);
return -EIO;
}
if(sector_to_block(v, bio->bi_iter.bi_sector) + bytes_to_block(v, bio->bi_iter.bi_size) > v->data_blocks) {
DMERR("Out of range. sector:%llu, size:%u", (long long unsigned int)bio->bi_iter.bi_sector, bio->bi_iter.bi_size);
return -EIO;
}
/* Allocate an io structure and save required info */
io = mempool_alloc(v->io_mempool, GFP_NOIO);
io->v = v;
io->bio = bio;
io->orig_bi_end_io = bio->bi_end_io;
io->orig_bi_private = bio->bi_private;
io->iter = bio->bi_iter;
/* When bio is done. bi_sector points to offset from start of device, NOT the partition
* Size is changed to 0 (the residual size). so save these values beforehand.
*/
/* Link io to bio structure now and assign an end function */
bio->bi_private = io;
bio->bi_end_io = dirty_end_io;
process_bio:
/* Submit bio now */
bio->bi_bdev = v->data_dev->bdev;
submit_bio(bio->bi_rw, bio);
return DM_MAPIO_SUBMITTED;
}
static sector_t dirty_nr_blocks_at_level(struct dm_verity *v, int level)
{
if (level == 0) {
return v->hash_blocks - v->hash_level_block[0];
} else
return v->hash_level_block[level - 1] - v->hash_level_block[level];
}
static void dirty_one_level_up(struct dm_verity *v, sector_t block, int source_level, sector_t *hash_block, unsigned *offset)
{
sector_t block_offset;
block_offset = block - v->hash_level_block[source_level];
*hash_block = v->hash_level_block[source_level + 1] + (block_offset >> v->hash_per_block_bits);
*offset = (block_offset & ((1 << v->hash_per_block_bits)-1)) << (v->hash_dev_block_bits - v->hash_per_block_bits);
}
static int dirty_update_hash(struct dm_verity *v, sector_t block, struct shash_desc *desc, u8 *result)
{
struct dm_buffer *buf = NULL;
u8* hash_block_buf;
int ret = -1;
hash_block_buf = dm_bufio_read(v->bufio, block, &buf);
if (unlikely(IS_ERR(hash_block_buf))) {
DMERR("Error getting hash block in dtr");
buf = NULL;
return ret;
}
desc->tfm = v->tfm;
desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
ret = crypto_shash_init(desc);
if (ret < 0) {
DMERR("Error initing hash");
goto free_all_res;
}
ret = crypto_shash_update(desc, v->salt, v->salt_size);
if (ret < 0) {
DMERR("Error updating salt hash");
goto free_all_res;
}
/* Sadly assumption is hash_block size = 4096. FIX THIS */
ret = crypto_shash_update(desc, hash_block_buf, 1 << v->hash_dev_block_bits);
if (ret < 0) {
DMERR("Error updating hash");
goto free_all_res;
}
ret = crypto_shash_final(desc, result);
if (ret < 0) {
DMERR("Error finalizing hash");
goto free_all_res;
}
ret = 0;
free_all_res:
dm_bufio_release(buf);
return ret;
}
static int finish_verity_tree(struct dm_verity *v)
{
sector_t hash_block, nr_blocks, block_num;
unsigned offset;
int i, ret = -1;
struct shash_desc *desc;
struct dm_buffer *buf = NULL;
u8* hash_block_buf;
u8 *result;
desc = kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(v->tfm) + v->digest_size, GFP_KERNEL);
if(NULL == desc) {
DMERR("Error allocating desc mem");
return ret;
}
result = (u8*)desc + sizeof(struct shash_desc) + crypto_shash_descsize(v->tfm);
for(i = 1; i < v->levels; i++) {
nr_blocks = dirty_nr_blocks_at_level(v, i - 1);
for(block_num = v->hash_level_block[i - 1]; nr_blocks > 0; nr_blocks--,block_num++) {
dirty_one_level_up(v, block_num, i - 1, &hash_block, &offset);
if (dirty_update_hash(v, block_num, desc, result)) {
DMERR("Error calculating hash for block %llu", (long long unsigned int)block_num);
goto free_desc;
}
hash_block_buf = dm_bufio_read(v->bufio, hash_block, &buf);
if (unlikely(IS_ERR(hash_block_buf))) {
DMERR("Error getting hash block %llu in dtr", (long long unsigned int)hash_block);
buf = NULL;
goto free_desc;
}
memcpy(hash_block_buf + offset, result, v->digest_size);
dm_bufio_mark_buffer_dirty(buf);
dm_bufio_release(buf);
}
}
/* Get last level node */
hash_block = v->hash_level_block[v->levels - 1];
/* Calculate root hash */
if(dirty_update_hash(v, hash_block, desc, result)) {
DMERR("Failed to calculate root hash. Block: %llu", (long long unsigned int)hash_block);
goto free_desc;
}
memcpy(v->root_digest, result, v->digest_size);
ret = 0;
free_desc:
kfree(desc);
return ret;
}
static void dirty_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
return;
}
static int dirty_ioctl(struct dm_target *ti, unsigned cmd,
unsigned long arg)
{
struct dm_verity *v = ti->private;
struct {
unsigned int size;
int ret;
} output;
switch(cmd) {
case REBUILD_TREE:
if(__copy_from_user(&output.size, (void *)arg, sizeof(unsigned int))) {
DMINFO("Error in copy_from_user");
}
if (output.size < v->digest_size) {
DMERR("Allocated size (%u) less than root digest size (%u)", output.size,
v->digest_size);
output.size = 0;
output.ret = -1;
goto make_output;
}
/* All checks done. Lets finalize the hash */
if (finish_verity_tree(v)) {
output.ret = -1;
goto make_output;
}
output.size = v->digest_size;
output.ret = 0;
if(copy_to_user((void *)(arg + sizeof(output)), v->root_digest, v->digest_size)) {
DMERR("Failed to copy root digest");
output.ret = -1;
}
make_output:
if(copy_to_user((void *)arg, &output, sizeof(output)))
DMERR("Failed to copy to user");
break;
default:
DMERR("Unknown cmd: %u", cmd);
}
return 0;
}
static void dirty_dtr(struct dm_target *ti)
{
struct dm_verity *v = ti->private;
DMINFO("dirty destructor called");
if (v->verify_wq)
destroy_workqueue(v->verify_wq);
if (v->io_mempool)
mempool_destroy(v->io_mempool);
if (v->bufio)
dm_bufio_write_dirty_buffers(v->bufio);
if (v->bufio)
dm_bufio_client_destroy(v->bufio);
kfree(v->salt);
kfree(v->root_digest);
if (v->tfm)
crypto_free_shash(v->tfm);
kfree(v->alg_name);
if (v->hash_dev)
dm_put_device(ti, v->hash_dev);
if (v->data_dev)
dm_put_device(ti, v->data_dev);
kfree(v);
}
static int dirty_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
struct dm_verity *v;
unsigned num;
unsigned long long num_ll;
int r;
int i;
sector_t hash_position;
u32 nr_hash_blocks;
char dummy;
v = kzalloc(sizeof(struct dm_verity), GFP_KERNEL);
if (!v) {
ti->error = "Cannot allocate verity structure";
return -ENOMEM;
}
ti->private = v;
v->ti = ti;
if (argc != 10) {
ti->error = "Invalid argument count: exactly 10 arguments required";
r = -EINVAL;
goto bad;
}
if (sscanf(argv[0], "%d%c", &num, &dummy) != 1 ||
num > 1) {
ti->error = "Invalid version";
r = -EINVAL;
goto bad;
}
v->version = num;
r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &v->data_dev);
if (r) {
ti->error = "Data device lookup failed";
goto bad;
}
r = dm_get_device(ti, argv[2], FMODE_READ | FMODE_WRITE, &v->hash_dev);
if (r) {
ti->error = "Data device lookup failed";
goto bad;
}
if (sscanf(argv[3], "%u%c", &num, &dummy) != 1 ||
!num || (num & (num - 1)) ||
num < bdev_logical_block_size(v->data_dev->bdev) ||
num > PAGE_SIZE) {
ti->error = "Invalid data device block size";
r = -EINVAL;
goto bad;
}
v->data_dev_block_bits = ffs(num) - 1;
if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
!num || (num & (num - 1)) ||
num < bdev_logical_block_size(v->hash_dev->bdev) ||
num > INT_MAX) {
ti->error = "Invalid hash device block size";
r = -EINVAL;
goto bad;
}
v->hash_dev_block_bits = ffs(num) - 1;
if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
(sector_t)(num_ll << (v->data_dev_block_bits - SECTOR_SHIFT))
>> (v->data_dev_block_bits - SECTOR_SHIFT) != num_ll) {
ti->error = "Invalid data blocks";
r = -EINVAL;
goto bad;
}
v->data_blocks = num_ll;
if (ti->len > (v->data_blocks << (v->data_dev_block_bits - SECTOR_SHIFT))) {
ti->error = "Data device is too small";
r = -EINVAL;
goto bad;
}
if (sscanf(argv[6], "%llu%c", &num_ll, &dummy) != 1 ||
(sector_t)(num_ll << (v->hash_dev_block_bits - SECTOR_SHIFT))
>> (v->hash_dev_block_bits - SECTOR_SHIFT) != num_ll) {
ti->error = "Invalid hash start";
r = -EINVAL;
goto bad;
}
v->hash_start = num_ll;
v->alg_name = kstrdup(argv[7], GFP_KERNEL);
if (!v->alg_name) {
ti->error = "Cannot allocate algorithm name";
r = -ENOMEM;
goto bad;
}
v->tfm = crypto_alloc_shash(v->alg_name, 0, 0);
if (IS_ERR(v->tfm)) {
ti->error = "Cannot initialize hash function";
r = PTR_ERR(v->tfm);
v->tfm = NULL;
goto bad;
}
v->digest_size = crypto_shash_digestsize(v->tfm);
if ((1 << v->hash_dev_block_bits) < v->digest_size * 2) {
ti->error = "Digest size too big";
r = -EINVAL;
goto bad;
}
v->shash_descsize =
sizeof(struct shash_desc) + crypto_shash_descsize(v->tfm);
v->root_digest = kmalloc(v->digest_size, GFP_KERNEL);
if (!v->root_digest) {
ti->error = "Cannot allocate root digest";
r = -ENOMEM;
goto bad;
}
if (strlen(argv[8]) != v->digest_size * 2 ||
hex2bin(v->root_digest, argv[8], v->digest_size)) {
ti->error = "Invalid root digest";
r = -EINVAL;
goto bad;
}
if (strcmp(argv[9], "-")) {
v->salt_size = strlen(argv[9]) / 2;
v->salt = kmalloc(v->salt_size, GFP_KERNEL);
if (!v->salt) {
ti->error = "Cannot allocate salt";
r = -ENOMEM;
goto bad;
}
if (strlen(argv[9]) != v->salt_size * 2 ||
hex2bin(v->salt, argv[9], v->salt_size)) {
ti->error = "Invalid salt";
r = -EINVAL;
goto bad;
}
}
v->hash_per_block_bits =
fls((1 << v->hash_dev_block_bits) / v->digest_size) - 1;
v->levels = 0;
if (v->data_blocks)
while (v->hash_per_block_bits * v->levels < 64 &&
(unsigned long long)(v->data_blocks - 1) >>
(v->hash_per_block_bits * v->levels))
v->levels++;
if (v->levels > DM_VERITY_MAX_LEVELS) {
ti->error = "Too many tree levels";
r = -E2BIG;
goto bad;
}
hash_position = v->hash_start;
for (i = v->levels - 1; i >= 0; i--) {
sector_t s;
v->hash_level_block[i] = hash_position;
s = verity_position_at_level(v, v->data_blocks, i);
s = (s >> v->hash_per_block_bits) +
!!(s & ((1 << v->hash_per_block_bits) - 1));
if (hash_position + s < hash_position) {
ti->error = "Hash device offset overflow";
r = -E2BIG;
goto bad;
}
hash_position += s;
}
v->hash_blocks = hash_position;
/* Following typecast will only work for Small samsung disks */
nr_hash_blocks = (u32)(hash_position - v->hash_start);
v->bufio = dm_bufio_client_create(v->hash_dev->bdev,
1 << v->hash_dev_block_bits, nr_hash_blocks, 0,
NULL, NULL);
if (IS_ERR(v->bufio)) {
ti->error = "Cannot initialize dm-bufio";
r = PTR_ERR(v->bufio);
v->bufio = NULL;
goto bad;
}
if (dm_bufio_get_device_size(v->bufio) < v->hash_blocks) {
ti->error = "Hash device is too small";
r = -E2BIG;
goto bad;
}
/* We have reserved nr_hash_blocks. Prefetch all */
dm_bufio_prefetch(v->bufio, v->hash_start, nr_hash_blocks);
v->io_mempool = mempool_create_kmalloc_pool(DM_VERITY_MEMPOOL_SIZE,
sizeof(struct dm_dirty_io) + v->shash_descsize + v->digest_size);
if (!v->io_mempool) {
ti->error = "Cannot allocate io mempool";
r = -ENOMEM;
goto bad;
}
/* WQ_UNBOUND greatly improves performance when running on ramdisk */
v->verify_wq = alloc_workqueue("kdirtyd", WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND, 1 /* num_online_cpus() */);
if (!v->verify_wq) {
ti->error = "Cannot allocate workqueue";
r = -ENOMEM;
goto bad;
}
DMINFO("dirty constructor succeeded\n");
return 0;
bad:
dirty_dtr(ti);
return r;
}
static int dirty_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct dm_verity *v = ti->private;
return fn(ti, v->data_dev, v->data_start, ti->len, data);
}
static void dirty_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct dm_verity *v = ti->private;
if (limits->logical_block_size < 1 << v->data_dev_block_bits) {
limits->logical_block_size = 1 << v->data_dev_block_bits;
}
if (limits->physical_block_size < 1 << v->data_dev_block_bits) {
limits->physical_block_size = 1 << v->data_dev_block_bits;
}
blk_limits_io_min(limits, limits->logical_block_size);
}
static struct target_type dirty_target = {
.name = "dirty",
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = dirty_ctr,
.dtr = dirty_dtr,
.map = dirty_map,
.status = dirty_status,
.ioctl = dirty_ioctl,
.iterate_devices = dirty_iterate_devices,
.io_hints = dirty_io_hints,
};
static int __init dm_dirty_init(void)
{
int ret;
ret = dm_register_target(&dirty_target);
if (ret < 0)
DMERR("dirty register failed %d", ret);
return ret;
}
static void __exit dm_dirty_exit(void)
{
dm_unregister_target(&dirty_target);
}
module_init(dm_dirty_init);
module_exit(dm_dirty_exit);
MODULE_AUTHOR("Jitesh Shah <j1.shah@sta.samsung.com>");
MODULE_DESCRIPTION(DM_NAME " update target for dm-verity");
MODULE_LICENSE("GPL");

1747
drivers/md/dm-era-target.c Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,290 @@
/*
* Copyright (C) 2001-2002 Sistina Software (UK) Limited.
* Copyright (C) 2006-2008 Red Hat GmbH
*
* This file is released under the GPL.
*/
#include "dm-exception-store.h"
#include <linux/ctype.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/vmalloc.h>
#include <linux/module.h>
#include <linux/slab.h>
#define DM_MSG_PREFIX "snapshot exception stores"
static LIST_HEAD(_exception_store_types);
static DEFINE_SPINLOCK(_lock);
static struct dm_exception_store_type *__find_exception_store_type(const char *name)
{
struct dm_exception_store_type *type;
list_for_each_entry(type, &_exception_store_types, list)
if (!strcmp(name, type->name))
return type;
return NULL;
}
static struct dm_exception_store_type *_get_exception_store_type(const char *name)
{
struct dm_exception_store_type *type;
spin_lock(&_lock);
type = __find_exception_store_type(name);
if (type && !try_module_get(type->module))
type = NULL;
spin_unlock(&_lock);
return type;
}
/*
* get_type
* @type_name
*
* Attempt to retrieve the dm_exception_store_type by name. If not already
* available, attempt to load the appropriate module.
*
* Exstore modules are named "dm-exstore-" followed by the 'type_name'.
* Modules may contain multiple types.
* This function will first try the module "dm-exstore-<type_name>",
* then truncate 'type_name' on the last '-' and try again.
*
* For example, if type_name was "clustered-shared", it would search
* 'dm-exstore-clustered-shared' then 'dm-exstore-clustered'.
*
* 'dm-exception-store-<type_name>' is too long of a name in my
* opinion, which is why I've chosen to have the files
* containing exception store implementations be 'dm-exstore-<type_name>'.
* If you want your module to be autoloaded, you will follow this
* naming convention.
*
* Returns: dm_exception_store_type* on success, NULL on failure
*/
static struct dm_exception_store_type *get_type(const char *type_name)
{
char *p, *type_name_dup;
struct dm_exception_store_type *type;
type = _get_exception_store_type(type_name);
if (type)
return type;
type_name_dup = kstrdup(type_name, GFP_KERNEL);
if (!type_name_dup) {
DMERR("No memory left to attempt load for \"%s\"", type_name);
return NULL;
}
while (request_module("dm-exstore-%s", type_name_dup) ||
!(type = _get_exception_store_type(type_name))) {
p = strrchr(type_name_dup, '-');
if (!p)
break;
p[0] = '\0';
}
if (!type)
DMWARN("Module for exstore type \"%s\" not found.", type_name);
kfree(type_name_dup);
return type;
}
static void put_type(struct dm_exception_store_type *type)
{
spin_lock(&_lock);
module_put(type->module);
spin_unlock(&_lock);
}
int dm_exception_store_type_register(struct dm_exception_store_type *type)
{
int r = 0;
spin_lock(&_lock);
if (!__find_exception_store_type(type->name))
list_add(&type->list, &_exception_store_types);
else
r = -EEXIST;
spin_unlock(&_lock);
return r;
}
EXPORT_SYMBOL(dm_exception_store_type_register);
int dm_exception_store_type_unregister(struct dm_exception_store_type *type)
{
spin_lock(&_lock);
if (!__find_exception_store_type(type->name)) {
spin_unlock(&_lock);
return -EINVAL;
}
list_del(&type->list);
spin_unlock(&_lock);
return 0;
}
EXPORT_SYMBOL(dm_exception_store_type_unregister);
static int set_chunk_size(struct dm_exception_store *store,
const char *chunk_size_arg, char **error)
{
unsigned chunk_size;
if (kstrtouint(chunk_size_arg, 10, &chunk_size)) {
*error = "Invalid chunk size";
return -EINVAL;
}
if (!chunk_size) {
store->chunk_size = store->chunk_mask = store->chunk_shift = 0;
return 0;
}
return dm_exception_store_set_chunk_size(store, chunk_size, error);
}
int dm_exception_store_set_chunk_size(struct dm_exception_store *store,
unsigned chunk_size,
char **error)
{
/* Check chunk_size is a power of 2 */
if (!is_power_of_2(chunk_size)) {
*error = "Chunk size is not a power of 2";
return -EINVAL;
}
/* Validate the chunk size against the device block size */
if (chunk_size %
(bdev_logical_block_size(dm_snap_cow(store->snap)->bdev) >> 9) ||
chunk_size %
(bdev_logical_block_size(dm_snap_origin(store->snap)->bdev) >> 9)) {
*error = "Chunk size is not a multiple of device blocksize";
return -EINVAL;
}
if (chunk_size > INT_MAX >> SECTOR_SHIFT) {
*error = "Chunk size is too high";
return -EINVAL;
}
store->chunk_size = chunk_size;
store->chunk_mask = chunk_size - 1;
store->chunk_shift = ffs(chunk_size) - 1;
return 0;
}
int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
struct dm_snapshot *snap,
unsigned *args_used,
struct dm_exception_store **store)
{
int r = 0;
struct dm_exception_store_type *type = NULL;
struct dm_exception_store *tmp_store;
char persistent;
if (argc < 2) {
ti->error = "Insufficient exception store arguments";
return -EINVAL;
}
tmp_store = kmalloc(sizeof(*tmp_store), GFP_KERNEL);
if (!tmp_store) {
ti->error = "Exception store allocation failed";
return -ENOMEM;
}
persistent = toupper(*argv[0]);
if (persistent == 'P')
type = get_type("P");
else if (persistent == 'N')
type = get_type("N");
else {
ti->error = "Persistent flag is not P or N";
r = -EINVAL;
goto bad_type;
}
if (!type) {
ti->error = "Exception store type not recognised";
r = -EINVAL;
goto bad_type;
}
tmp_store->type = type;
tmp_store->snap = snap;
r = set_chunk_size(tmp_store, argv[1], &ti->error);
if (r)
goto bad;
r = type->ctr(tmp_store, 0, NULL);
if (r) {
ti->error = "Exception store type constructor failed";
goto bad;
}
*args_used = 2;
*store = tmp_store;
return 0;
bad:
put_type(type);
bad_type:
kfree(tmp_store);
return r;
}
EXPORT_SYMBOL(dm_exception_store_create);
void dm_exception_store_destroy(struct dm_exception_store *store)
{
store->type->dtr(store);
put_type(store->type);
kfree(store);
}
EXPORT_SYMBOL(dm_exception_store_destroy);
int dm_exception_store_init(void)
{
int r;
r = dm_transient_snapshot_init();
if (r) {
DMERR("Unable to register transient exception store type.");
goto transient_fail;
}
r = dm_persistent_snapshot_init();
if (r) {
DMERR("Unable to register persistent exception store type");
goto persistent_fail;
}
return 0;
persistent_fail:
dm_transient_snapshot_exit();
transient_fail:
return r;
}
void dm_exception_store_exit(void)
{
dm_persistent_snapshot_exit();
dm_transient_snapshot_exit();
}

View file

@ -0,0 +1,227 @@
/*
* Copyright (C) 2001-2002 Sistina Software (UK) Limited.
* Copyright (C) 2008 Red Hat, Inc. All rights reserved.
*
* Device-mapper snapshot exception store.
*
* This file is released under the GPL.
*/
#ifndef _LINUX_DM_EXCEPTION_STORE
#define _LINUX_DM_EXCEPTION_STORE
#include <linux/blkdev.h>
#include <linux/device-mapper.h>
/*
* The snapshot code deals with largish chunks of the disk at a
* time. Typically 32k - 512k.
*/
typedef sector_t chunk_t;
/*
* An exception is used where an old chunk of data has been
* replaced by a new one.
* If chunk_t is 64 bits in size, the top 8 bits of new_chunk hold the number
* of chunks that follow contiguously. Remaining bits hold the number of the
* chunk within the device.
*/
struct dm_exception {
struct list_head hash_list;
chunk_t old_chunk;
chunk_t new_chunk;
};
/*
* Abstraction to handle the meta/layout of exception stores (the
* COW device).
*/
struct dm_exception_store;
struct dm_exception_store_type {
const char *name;
struct module *module;
int (*ctr) (struct dm_exception_store *store,
unsigned argc, char **argv);
/*
* Destroys this object when you've finished with it.
*/
void (*dtr) (struct dm_exception_store *store);
/*
* The target shouldn't read the COW device until this is
* called. As exceptions are read from the COW, they are
* reported back via the callback.
*/
int (*read_metadata) (struct dm_exception_store *store,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context);
/*
* Find somewhere to store the next exception.
*/
int (*prepare_exception) (struct dm_exception_store *store,
struct dm_exception *e);
/*
* Update the metadata with this exception.
*/
void (*commit_exception) (struct dm_exception_store *store,
struct dm_exception *e,
void (*callback) (void *, int success),
void *callback_context);
/*
* Returns 0 if the exception store is empty.
*
* If there are exceptions still to be merged, sets
* *last_old_chunk and *last_new_chunk to the most recent
* still-to-be-merged chunk and returns the number of
* consecutive previous ones.
*/
int (*prepare_merge) (struct dm_exception_store *store,
chunk_t *last_old_chunk, chunk_t *last_new_chunk);
/*
* Clear the last n exceptions.
* nr_merged must be <= the value returned by prepare_merge.
*/
int (*commit_merge) (struct dm_exception_store *store, int nr_merged);
/*
* The snapshot is invalid, note this in the metadata.
*/
void (*drop_snapshot) (struct dm_exception_store *store);
unsigned (*status) (struct dm_exception_store *store,
status_type_t status, char *result,
unsigned maxlen);
/*
* Return how full the snapshot is.
*/
void (*usage) (struct dm_exception_store *store,
sector_t *total_sectors, sector_t *sectors_allocated,
sector_t *metadata_sectors);
/* For internal device-mapper use only. */
struct list_head list;
};
struct dm_snapshot;
struct dm_exception_store {
struct dm_exception_store_type *type;
struct dm_snapshot *snap;
/* Size of data blocks saved - must be a power of 2 */
unsigned chunk_size;
unsigned chunk_mask;
unsigned chunk_shift;
void *context;
};
/*
* Obtain the origin or cow device used by a given snapshot.
*/
struct dm_dev *dm_snap_origin(struct dm_snapshot *snap);
struct dm_dev *dm_snap_cow(struct dm_snapshot *snap);
/*
* Funtions to manipulate consecutive chunks
*/
# if defined(CONFIG_LBDAF) || (BITS_PER_LONG == 64)
# define DM_CHUNK_CONSECUTIVE_BITS 8
# define DM_CHUNK_NUMBER_BITS 56
static inline chunk_t dm_chunk_number(chunk_t chunk)
{
return chunk & (chunk_t)((1ULL << DM_CHUNK_NUMBER_BITS) - 1ULL);
}
static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e)
{
return e->new_chunk >> DM_CHUNK_NUMBER_BITS;
}
static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e)
{
e->new_chunk += (1ULL << DM_CHUNK_NUMBER_BITS);
BUG_ON(!dm_consecutive_chunk_count(e));
}
static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e)
{
BUG_ON(!dm_consecutive_chunk_count(e));
e->new_chunk -= (1ULL << DM_CHUNK_NUMBER_BITS);
}
# else
# define DM_CHUNK_CONSECUTIVE_BITS 0
static inline chunk_t dm_chunk_number(chunk_t chunk)
{
return chunk;
}
static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e)
{
return 0;
}
static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e)
{
}
static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e)
{
}
# endif
/*
* Return the number of sectors in the device.
*/
static inline sector_t get_dev_size(struct block_device *bdev)
{
return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
}
static inline chunk_t sector_to_chunk(struct dm_exception_store *store,
sector_t sector)
{
return sector >> store->chunk_shift;
}
int dm_exception_store_type_register(struct dm_exception_store_type *type);
int dm_exception_store_type_unregister(struct dm_exception_store_type *type);
int dm_exception_store_set_chunk_size(struct dm_exception_store *store,
unsigned chunk_size,
char **error);
int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
struct dm_snapshot *snap,
unsigned *args_used,
struct dm_exception_store **store);
void dm_exception_store_destroy(struct dm_exception_store *store);
int dm_exception_store_init(void);
void dm_exception_store_exit(void);
/*
* Two exception store implementations.
*/
int dm_persistent_snapshot_init(void);
void dm_persistent_snapshot_exit(void);
int dm_transient_snapshot_init(void);
void dm_transient_snapshot_exit(void);
#endif /* _LINUX_DM_EXCEPTION_STORE */

447
drivers/md/dm-flakey.c Normal file
View file

@ -0,0 +1,447 @@
/*
* Copyright (C) 2003 Sistina Software (UK) Limited.
* Copyright (C) 2004, 2010-2011 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#define DM_MSG_PREFIX "flakey"
#define all_corrupt_bio_flags_match(bio, fc) \
(((bio)->bi_rw & (fc)->corrupt_bio_flags) == (fc)->corrupt_bio_flags)
/*
* Flakey: Used for testing only, simulates intermittent,
* catastrophic device failure.
*/
struct flakey_c {
struct dm_dev *dev;
unsigned long start_time;
sector_t start;
unsigned up_interval;
unsigned down_interval;
unsigned long flags;
unsigned corrupt_bio_byte;
unsigned corrupt_bio_rw;
unsigned corrupt_bio_value;
unsigned corrupt_bio_flags;
};
enum feature_flag_bits {
DROP_WRITES
};
struct per_bio_data {
bool bio_submitted;
};
static int parse_features(struct dm_arg_set *as, struct flakey_c *fc,
struct dm_target *ti)
{
int r;
unsigned argc;
const char *arg_name;
static struct dm_arg _args[] = {
{0, 6, "Invalid number of feature args"},
{1, UINT_MAX, "Invalid corrupt bio byte"},
{0, 255, "Invalid corrupt value to write into bio byte (0-255)"},
{0, UINT_MAX, "Invalid corrupt bio flags mask"},
};
/* No feature arguments supplied. */
if (!as->argc)
return 0;
r = dm_read_arg_group(_args, as, &argc, &ti->error);
if (r)
return r;
while (argc) {
arg_name = dm_shift_arg(as);
argc--;
/*
* drop_writes
*/
if (!strcasecmp(arg_name, "drop_writes")) {
if (test_and_set_bit(DROP_WRITES, &fc->flags)) {
ti->error = "Feature drop_writes duplicated";
return -EINVAL;
}
continue;
}
/*
* corrupt_bio_byte <Nth_byte> <direction> <value> <bio_flags>
*/
if (!strcasecmp(arg_name, "corrupt_bio_byte")) {
if (!argc) {
ti->error = "Feature corrupt_bio_byte requires parameters";
return -EINVAL;
}
r = dm_read_arg(_args + 1, as, &fc->corrupt_bio_byte, &ti->error);
if (r)
return r;
argc--;
/*
* Direction r or w?
*/
arg_name = dm_shift_arg(as);
if (!strcasecmp(arg_name, "w"))
fc->corrupt_bio_rw = WRITE;
else if (!strcasecmp(arg_name, "r"))
fc->corrupt_bio_rw = READ;
else {
ti->error = "Invalid corrupt bio direction (r or w)";
return -EINVAL;
}
argc--;
/*
* Value of byte (0-255) to write in place of correct one.
*/
r = dm_read_arg(_args + 2, as, &fc->corrupt_bio_value, &ti->error);
if (r)
return r;
argc--;
/*
* Only corrupt bios with these flags set.
*/
r = dm_read_arg(_args + 3, as, &fc->corrupt_bio_flags, &ti->error);
if (r)
return r;
argc--;
continue;
}
ti->error = "Unrecognised flakey feature requested";
return -EINVAL;
}
if (test_bit(DROP_WRITES, &fc->flags) && (fc->corrupt_bio_rw == WRITE)) {
ti->error = "drop_writes is incompatible with corrupt_bio_byte with the WRITE flag set";
return -EINVAL;
}
return 0;
}
/*
* Construct a flakey mapping:
* <dev_path> <offset> <up interval> <down interval> [<#feature args> [<arg>]*]
*
* Feature args:
* [drop_writes]
* [corrupt_bio_byte <Nth_byte> <direction> <value> <bio_flags>]
*
* Nth_byte starts from 1 for the first byte.
* Direction is r for READ or w for WRITE.
* bio_flags is ignored if 0.
*/
static int flakey_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
static struct dm_arg _args[] = {
{0, UINT_MAX, "Invalid up interval"},
{0, UINT_MAX, "Invalid down interval"},
};
int r;
struct flakey_c *fc;
unsigned long long tmpll;
struct dm_arg_set as;
const char *devname;
char dummy;
as.argc = argc;
as.argv = argv;
if (argc < 4) {
ti->error = "Invalid argument count";
return -EINVAL;
}
fc = kzalloc(sizeof(*fc), GFP_KERNEL);
if (!fc) {
ti->error = "Cannot allocate context";
return -ENOMEM;
}
fc->start_time = jiffies;
devname = dm_shift_arg(&as);
if (sscanf(dm_shift_arg(&as), "%llu%c", &tmpll, &dummy) != 1) {
ti->error = "Invalid device sector";
goto bad;
}
fc->start = tmpll;
r = dm_read_arg(_args, &as, &fc->up_interval, &ti->error);
if (r)
goto bad;
r = dm_read_arg(_args, &as, &fc->down_interval, &ti->error);
if (r)
goto bad;
if (!(fc->up_interval + fc->down_interval)) {
ti->error = "Total (up + down) interval is zero";
goto bad;
}
if (fc->up_interval + fc->down_interval < fc->up_interval) {
ti->error = "Interval overflow";
goto bad;
}
r = parse_features(&as, fc, ti);
if (r)
goto bad;
if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &fc->dev)) {
ti->error = "Device lookup failed";
goto bad;
}
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->per_bio_data_size = sizeof(struct per_bio_data);
ti->private = fc;
return 0;
bad:
kfree(fc);
return -EINVAL;
}
static void flakey_dtr(struct dm_target *ti)
{
struct flakey_c *fc = ti->private;
dm_put_device(ti, fc->dev);
kfree(fc);
}
static sector_t flakey_map_sector(struct dm_target *ti, sector_t bi_sector)
{
struct flakey_c *fc = ti->private;
return fc->start + dm_target_offset(ti, bi_sector);
}
static void flakey_map_bio(struct dm_target *ti, struct bio *bio)
{
struct flakey_c *fc = ti->private;
bio->bi_bdev = fc->dev->bdev;
if (bio_sectors(bio))
bio->bi_iter.bi_sector =
flakey_map_sector(ti, bio->bi_iter.bi_sector);
}
static void corrupt_bio_data(struct bio *bio, struct flakey_c *fc)
{
unsigned bio_bytes = bio_cur_bytes(bio);
char *data = bio_data(bio);
/*
* Overwrite the Nth byte of the data returned.
*/
if (data && bio_bytes >= fc->corrupt_bio_byte) {
data[fc->corrupt_bio_byte - 1] = fc->corrupt_bio_value;
DMDEBUG("Corrupting data bio=%p by writing %u to byte %u "
"(rw=%c bi_rw=%lu bi_sector=%llu cur_bytes=%u)\n",
bio, fc->corrupt_bio_value, fc->corrupt_bio_byte,
(bio_data_dir(bio) == WRITE) ? 'w' : 'r', bio->bi_rw,
(unsigned long long)bio->bi_iter.bi_sector, bio_bytes);
}
}
static int flakey_map(struct dm_target *ti, struct bio *bio)
{
struct flakey_c *fc = ti->private;
unsigned elapsed;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
pb->bio_submitted = false;
/* Are we alive ? */
elapsed = (jiffies - fc->start_time) / HZ;
if (elapsed % (fc->up_interval + fc->down_interval) >= fc->up_interval) {
/*
* Flag this bio as submitted while down.
*/
pb->bio_submitted = true;
/*
* Map reads as normal.
*/
if (bio_data_dir(bio) == READ)
goto map_bio;
/*
* Drop writes?
*/
if (test_bit(DROP_WRITES, &fc->flags)) {
bio_endio(bio, 0);
return DM_MAPIO_SUBMITTED;
}
/*
* Corrupt matching writes.
*/
if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == WRITE)) {
if (all_corrupt_bio_flags_match(bio, fc))
corrupt_bio_data(bio, fc);
goto map_bio;
}
/*
* By default, error all I/O.
*/
return -EIO;
}
map_bio:
flakey_map_bio(ti, bio);
return DM_MAPIO_REMAPPED;
}
static int flakey_end_io(struct dm_target *ti, struct bio *bio, int error)
{
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
/*
* Corrupt successful READs while in down state.
* If flags were specified, only corrupt those that match.
*/
if (fc->corrupt_bio_byte && !error && pb->bio_submitted &&
(bio_data_dir(bio) == READ) && (fc->corrupt_bio_rw == READ) &&
all_corrupt_bio_flags_match(bio, fc))
corrupt_bio_data(bio, fc);
return error;
}
static void flakey_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct flakey_c *fc = ti->private;
unsigned drop_writes;
switch (type) {
case STATUSTYPE_INFO:
result[0] = '\0';
break;
case STATUSTYPE_TABLE:
DMEMIT("%s %llu %u %u ", fc->dev->name,
(unsigned long long)fc->start, fc->up_interval,
fc->down_interval);
drop_writes = test_bit(DROP_WRITES, &fc->flags);
DMEMIT("%u ", drop_writes + (fc->corrupt_bio_byte > 0) * 5);
if (drop_writes)
DMEMIT("drop_writes ");
if (fc->corrupt_bio_byte)
DMEMIT("corrupt_bio_byte %u %c %u %u ",
fc->corrupt_bio_byte,
(fc->corrupt_bio_rw == WRITE) ? 'w' : 'r',
fc->corrupt_bio_value, fc->corrupt_bio_flags);
break;
}
}
static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long arg)
{
struct flakey_c *fc = ti->private;
struct dm_dev *dev = fc->dev;
int r = 0;
/*
* Only pass ioctls through if the device sizes match exactly.
*/
if (fc->start ||
ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}
static int flakey_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
struct bio_vec *biovec, int max_size)
{
struct flakey_c *fc = ti->private;
struct request_queue *q = bdev_get_queue(fc->dev->bdev);
if (!q->merge_bvec_fn)
return max_size;
bvm->bi_bdev = fc->dev->bdev;
bvm->bi_sector = flakey_map_sector(ti, bvm->bi_sector);
return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}
static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
{
struct flakey_c *fc = ti->private;
return fn(ti, fc->dev, fc->start, ti->len, data);
}
static struct target_type flakey_target = {
.name = "flakey",
.version = {1, 3, 1},
.module = THIS_MODULE,
.ctr = flakey_ctr,
.dtr = flakey_dtr,
.map = flakey_map,
.end_io = flakey_end_io,
.status = flakey_status,
.ioctl = flakey_ioctl,
.merge = flakey_merge,
.iterate_devices = flakey_iterate_devices,
};
static int __init dm_flakey_init(void)
{
int r = dm_register_target(&flakey_target);
if (r < 0)
DMERR("register failed %d", r);
return r;
}
static void __exit dm_flakey_exit(void)
{
dm_unregister_target(&flakey_target);
}
/* Module hooks */
module_init(dm_flakey_init);
module_exit(dm_flakey_exit);
MODULE_DESCRIPTION(DM_NAME " flakey target");
MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

540
drivers/md/dm-io.c Normal file
View file

@ -0,0 +1,540 @@
/*
* Copyright (C) 2003 Sistina Software
* Copyright (C) 2006 Red Hat GmbH
*
* This file is released under the GPL.
*/
#include "dm.h"
#include <linux/device-mapper.h>
#include <linux/bio.h>
#include <linux/completion.h>
#include <linux/mempool.h>
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/dm-io.h>
#define DM_MSG_PREFIX "io"
#define DM_IO_MAX_REGIONS BITS_PER_LONG
struct dm_io_client {
mempool_t *pool;
struct bio_set *bios;
};
/*
* Aligning 'struct io' reduces the number of bits required to store
* its address. Refer to store_io_and_region_in_bio() below.
*/
struct io {
unsigned long error_bits;
atomic_t count;
struct dm_io_client *client;
io_notify_fn callback;
void *context;
void *vma_invalidate_address;
unsigned long vma_invalidate_size;
} __attribute__((aligned(DM_IO_MAX_REGIONS)));
static struct kmem_cache *_dm_io_cache;
/*
* Create a client with mempool and bioset.
*/
struct dm_io_client *dm_io_client_create(void)
{
struct dm_io_client *client;
unsigned min_ios = dm_get_reserved_bio_based_ios();
client = kmalloc(sizeof(*client), GFP_KERNEL);
if (!client)
return ERR_PTR(-ENOMEM);
client->pool = mempool_create_slab_pool(min_ios, _dm_io_cache);
if (!client->pool)
goto bad;
client->bios = bioset_create(min_ios, 0);
if (!client->bios)
goto bad;
return client;
bad:
if (client->pool)
mempool_destroy(client->pool);
kfree(client);
return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL(dm_io_client_create);
void dm_io_client_destroy(struct dm_io_client *client)
{
mempool_destroy(client->pool);
bioset_free(client->bios);
kfree(client);
}
EXPORT_SYMBOL(dm_io_client_destroy);
/*-----------------------------------------------------------------
* We need to keep track of which region a bio is doing io for.
* To avoid a memory allocation to store just 5 or 6 bits, we
* ensure the 'struct io' pointer is aligned so enough low bits are
* always zero and then combine it with the region number directly in
* bi_private.
*---------------------------------------------------------------*/
static void store_io_and_region_in_bio(struct bio *bio, struct io *io,
unsigned region)
{
if (unlikely(!IS_ALIGNED((unsigned long)io, DM_IO_MAX_REGIONS))) {
DMCRIT("Unaligned struct io pointer %p", io);
BUG();
}
bio->bi_private = (void *)((unsigned long)io | region);
}
static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
unsigned *region)
{
unsigned long val = (unsigned long)bio->bi_private;
*io = (void *)(val & -(unsigned long)DM_IO_MAX_REGIONS);
*region = val & (DM_IO_MAX_REGIONS - 1);
}
/*-----------------------------------------------------------------
* We need an io object to keep track of the number of bios that
* have been dispatched for a particular io.
*---------------------------------------------------------------*/
static void complete_io(struct io *io)
{
unsigned long error_bits = io->error_bits;
io_notify_fn fn = io->callback;
void *context = io->context;
if (io->vma_invalidate_size)
invalidate_kernel_vmap_range(io->vma_invalidate_address,
io->vma_invalidate_size);
mempool_free(io, io->client->pool);
fn(error_bits, context);
}
static void dec_count(struct io *io, unsigned int region, int error)
{
if (error)
set_bit(region, &io->error_bits);
if (atomic_dec_and_test(&io->count))
complete_io(io);
}
static void endio(struct bio *bio, int error)
{
struct io *io;
unsigned region;
if (error && bio_data_dir(bio) == READ)
zero_fill_bio(bio);
/*
* The bio destructor in bio_put() may use the io object.
*/
retrieve_io_and_region_from_bio(bio, &io, &region);
bio_put(bio);
dec_count(io, region, error);
}
/*-----------------------------------------------------------------
* These little objects provide an abstraction for getting a new
* destination page for io.
*---------------------------------------------------------------*/
struct dpages {
void (*get_page)(struct dpages *dp,
struct page **p, unsigned long *len, unsigned *offset);
void (*next_page)(struct dpages *dp);
unsigned context_u;
void *context_ptr;
void *vma_invalidate_address;
unsigned long vma_invalidate_size;
};
/*
* Functions for getting the pages from a list.
*/
static void list_get_page(struct dpages *dp,
struct page **p, unsigned long *len, unsigned *offset)
{
unsigned o = dp->context_u;
struct page_list *pl = (struct page_list *) dp->context_ptr;
*p = pl->page;
*len = PAGE_SIZE - o;
*offset = o;
}
static void list_next_page(struct dpages *dp)
{
struct page_list *pl = (struct page_list *) dp->context_ptr;
dp->context_ptr = pl->next;
dp->context_u = 0;
}
static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned offset)
{
dp->get_page = list_get_page;
dp->next_page = list_next_page;
dp->context_u = offset;
dp->context_ptr = pl;
}
/*
* Functions for getting the pages from a bvec.
*/
static void bio_get_page(struct dpages *dp, struct page **p,
unsigned long *len, unsigned *offset)
{
struct bio_vec *bvec = dp->context_ptr;
*p = bvec->bv_page;
*len = bvec->bv_len - dp->context_u;
*offset = bvec->bv_offset + dp->context_u;
}
static void bio_next_page(struct dpages *dp)
{
struct bio_vec *bvec = dp->context_ptr;
dp->context_ptr = bvec + 1;
dp->context_u = 0;
}
static void bio_dp_init(struct dpages *dp, struct bio *bio)
{
dp->get_page = bio_get_page;
dp->next_page = bio_next_page;
dp->context_ptr = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
dp->context_u = bio->bi_iter.bi_bvec_done;
}
/*
* Functions for getting the pages from a VMA.
*/
static void vm_get_page(struct dpages *dp,
struct page **p, unsigned long *len, unsigned *offset)
{
*p = vmalloc_to_page(dp->context_ptr);
*offset = dp->context_u;
*len = PAGE_SIZE - dp->context_u;
}
static void vm_next_page(struct dpages *dp)
{
dp->context_ptr += PAGE_SIZE - dp->context_u;
dp->context_u = 0;
}
static void vm_dp_init(struct dpages *dp, void *data)
{
dp->get_page = vm_get_page;
dp->next_page = vm_next_page;
dp->context_u = ((unsigned long) data) & (PAGE_SIZE - 1);
dp->context_ptr = data;
}
/*
* Functions for getting the pages from kernel memory.
*/
static void km_get_page(struct dpages *dp, struct page **p, unsigned long *len,
unsigned *offset)
{
*p = virt_to_page(dp->context_ptr);
*offset = dp->context_u;
*len = PAGE_SIZE - dp->context_u;
}
static void km_next_page(struct dpages *dp)
{
dp->context_ptr += PAGE_SIZE - dp->context_u;
dp->context_u = 0;
}
static void km_dp_init(struct dpages *dp, void *data)
{
dp->get_page = km_get_page;
dp->next_page = km_next_page;
dp->context_u = ((unsigned long) data) & (PAGE_SIZE - 1);
dp->context_ptr = data;
}
/*-----------------------------------------------------------------
* IO routines that accept a list of pages.
*---------------------------------------------------------------*/
static void do_region(int rw, unsigned region, struct dm_io_region *where,
struct dpages *dp, struct io *io)
{
struct bio *bio;
struct page *page;
unsigned long len;
unsigned offset;
unsigned num_bvecs;
sector_t remaining = where->count;
struct request_queue *q = bdev_get_queue(where->bdev);
unsigned short logical_block_size = queue_logical_block_size(q);
sector_t num_sectors;
unsigned int uninitialized_var(special_cmd_max_sectors);
/*
* Reject unsupported discard and write same requests.
*/
if (rw & REQ_DISCARD)
special_cmd_max_sectors = q->limits.max_discard_sectors;
else if (rw & REQ_WRITE_SAME)
special_cmd_max_sectors = q->limits.max_write_same_sectors;
if ((rw & (REQ_DISCARD | REQ_WRITE_SAME)) && special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
}
/*
* where->count may be zero if rw holds a flush and we need to
* send a zero-sized flush.
*/
do {
/*
* Allocate a suitably sized-bio.
*/
if ((rw & REQ_DISCARD) || (rw & REQ_WRITE_SAME))
num_bvecs = 1;
else
num_bvecs = min_t(int, bio_get_nr_vecs(where->bdev),
dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT)));
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_iter.bi_sector = where->sector + (where->count - remaining);
bio->bi_bdev = where->bdev;
bio->bi_end_io = endio;
store_io_and_region_in_bio(bio, io, region);
if (rw & REQ_DISCARD) {
num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
} else if (rw & REQ_WRITE_SAME) {
/*
* WRITE SAME only uses a single page.
*/
dp->get_page(dp, &page, &len, &offset);
bio_add_page(bio, page, logical_block_size, offset);
num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
offset = 0;
remaining -= num_sectors;
dp->next_page(dp);
} else while (remaining) {
/*
* Try and add as many pages as possible.
*/
dp->get_page(dp, &page, &len, &offset);
len = min(len, to_bytes(remaining));
if (!bio_add_page(bio, page, len, offset))
break;
offset = 0;
remaining -= to_sector(len);
dp->next_page(dp);
}
atomic_inc(&io->count);
submit_bio(rw, bio);
} while (remaining);
}
static void dispatch_io(int rw, unsigned int num_regions,
struct dm_io_region *where, struct dpages *dp,
struct io *io, int sync)
{
int i;
struct dpages old_pages = *dp;
BUG_ON(num_regions > DM_IO_MAX_REGIONS);
if (sync)
rw |= REQ_SYNC;
/*
* For multiple regions we need to be careful to rewind
* the dp object for each call to do_region.
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}
/*
* Drop the extra reference that we were holding to avoid
* the io being completed too early.
*/
dec_count(io, 0, 0);
}
struct sync_io {
unsigned long error_bits;
struct completion wait;
};
static void sync_io_complete(unsigned long error, void *context)
{
struct sync_io *sio = context;
sio->error_bits = error;
complete(&sio->wait);
}
static int sync_io(struct dm_io_client *client, unsigned int num_regions,
struct dm_io_region *where, int rw, struct dpages *dp,
unsigned long *error_bits)
{
struct io *io;
struct sync_io sio;
if (num_regions > 1 && (rw & RW_MASK) != WRITE) {
WARN_ON(1);
return -EIO;
}
init_completion(&sio.wait);
io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->client = client;
io->callback = sync_io_complete;
io->context = &sio;
io->vma_invalidate_address = dp->vma_invalidate_address;
io->vma_invalidate_size = dp->vma_invalidate_size;
dispatch_io(rw, num_regions, where, dp, io, 1);
wait_for_completion_io(&sio.wait);
if (error_bits)
*error_bits = sio.error_bits;
return sio.error_bits ? -EIO : 0;
}
static int async_io(struct dm_io_client *client, unsigned int num_regions,
struct dm_io_region *where, int rw, struct dpages *dp,
io_notify_fn fn, void *context)
{
struct io *io;
if (num_regions > 1 && (rw & RW_MASK) != WRITE) {
WARN_ON(1);
fn(1, context);
return -EIO;
}
io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->client = client;
io->callback = fn;
io->context = context;
io->vma_invalidate_address = dp->vma_invalidate_address;
io->vma_invalidate_size = dp->vma_invalidate_size;
dispatch_io(rw, num_regions, where, dp, io, 0);
return 0;
}
static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
unsigned long size)
{
/* Set up dpages based on memory type */
dp->vma_invalidate_address = NULL;
dp->vma_invalidate_size = 0;
switch (io_req->mem.type) {
case DM_IO_PAGE_LIST:
list_dp_init(dp, io_req->mem.ptr.pl, io_req->mem.offset);
break;
case DM_IO_BIO:
bio_dp_init(dp, io_req->mem.ptr.bio);
break;
case DM_IO_VMA:
flush_kernel_vmap_range(io_req->mem.ptr.vma, size);
if ((io_req->bi_rw & RW_MASK) == READ) {
dp->vma_invalidate_address = io_req->mem.ptr.vma;
dp->vma_invalidate_size = size;
}
vm_dp_init(dp, io_req->mem.ptr.vma);
break;
case DM_IO_KMEM:
km_dp_init(dp, io_req->mem.ptr.addr);
break;
default:
return -EINVAL;
}
return 0;
}
/*
* New collapsed (a)synchronous interface.
*
* If the IO is asynchronous (i.e. it has notify.fn), you must either unplug
* the queue with blk_unplug() some time later or set REQ_SYNC in io_req->bi_rw.
* If you fail to do one of these, the IO will be submitted to the disk after
* q->unplug_delay, which defaults to 3ms in blk-settings.c.
*/
int dm_io(struct dm_io_request *io_req, unsigned num_regions,
struct dm_io_region *where, unsigned long *sync_error_bits)
{
int r;
struct dpages dp;
r = dp_init(io_req, &dp, (unsigned long)where->count << SECTOR_SHIFT);
if (r)
return r;
if (!io_req->notify.fn)
return sync_io(io_req->client, num_regions, where,
io_req->bi_rw, &dp, sync_error_bits);
return async_io(io_req->client, num_regions, where, io_req->bi_rw,
&dp, io_req->notify.fn, io_req->notify.context);
}
EXPORT_SYMBOL(dm_io);
int __init dm_io_init(void)
{
_dm_io_cache = KMEM_CACHE(io, 0);
if (!_dm_io_cache)
return -ENOMEM;
return 0;
}
void dm_io_exit(void)
{
kmem_cache_destroy(_dm_io_cache);
_dm_io_cache = NULL;
}

1954
drivers/md/dm-ioctl.c Normal file

File diff suppressed because it is too large Load diff

884
drivers/md/dm-kcopyd.c Normal file
View file

@ -0,0 +1,884 @@
/*
* Copyright (C) 2002 Sistina Software (UK) Limited.
* Copyright (C) 2006 Red Hat GmbH
*
* This file is released under the GPL.
*
* Kcopyd provides a simple interface for copying an area of one
* block-device to one or more other block-devices, with an asynchronous
* completion notification.
*/
#include <linux/types.h>
#include <linux/atomic.h>
#include <linux/blkdev.h>
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/list.h>
#include <linux/mempool.h>
#include <linux/module.h>
#include <linux/pagemap.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/workqueue.h>
#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/device-mapper.h>
#include <linux/dm-kcopyd.h>
#include "dm.h"
#define SUB_JOB_SIZE 128
#define SPLIT_COUNT 8
#define MIN_JOBS 8
#define RESERVE_PAGES (DIV_ROUND_UP(SUB_JOB_SIZE << SECTOR_SHIFT, PAGE_SIZE))
/*-----------------------------------------------------------------
* Each kcopyd client has its own little pool of preallocated
* pages for kcopyd io.
*---------------------------------------------------------------*/
struct dm_kcopyd_client {
struct page_list *pages;
unsigned nr_reserved_pages;
unsigned nr_free_pages;
struct dm_io_client *io_client;
wait_queue_head_t destroyq;
atomic_t nr_jobs;
mempool_t *job_pool;
struct workqueue_struct *kcopyd_wq;
struct work_struct kcopyd_work;
struct dm_kcopyd_throttle *throttle;
/*
* We maintain three lists of jobs:
*
* i) jobs waiting for pages
* ii) jobs that have pages, and are waiting for the io to be issued.
* iii) jobs that have completed.
*
* All three of these are protected by job_lock.
*/
spinlock_t job_lock;
struct list_head complete_jobs;
struct list_head io_jobs;
struct list_head pages_jobs;
};
static struct page_list zero_page_list;
static DEFINE_SPINLOCK(throttle_spinlock);
/*
* IO/IDLE accounting slowly decays after (1 << ACCOUNT_INTERVAL_SHIFT) period.
* When total_period >= (1 << ACCOUNT_INTERVAL_SHIFT) the counters are divided
* by 2.
*/
#define ACCOUNT_INTERVAL_SHIFT SHIFT_HZ
/*
* Sleep this number of milliseconds.
*
* The value was decided experimentally.
* Smaller values seem to cause an increased copy rate above the limit.
* The reason for this is unknown but possibly due to jiffies rounding errors
* or read/write cache inside the disk.
*/
#define SLEEP_MSEC 100
/*
* Maximum number of sleep events. There is a theoretical livelock if more
* kcopyd clients do work simultaneously which this limit avoids.
*/
#define MAX_SLEEPS 10
static void io_job_start(struct dm_kcopyd_throttle *t)
{
unsigned throttle, now, difference;
int slept = 0, skew;
if (unlikely(!t))
return;
try_again:
spin_lock_irq(&throttle_spinlock);
throttle = ACCESS_ONCE(t->throttle);
if (likely(throttle >= 100))
goto skip_limit;
now = jiffies;
difference = now - t->last_jiffies;
t->last_jiffies = now;
if (t->num_io_jobs)
t->io_period += difference;
t->total_period += difference;
/*
* Maintain sane values if we got a temporary overflow.
*/
if (unlikely(t->io_period > t->total_period))
t->io_period = t->total_period;
if (unlikely(t->total_period >= (1 << ACCOUNT_INTERVAL_SHIFT))) {
int shift = fls(t->total_period >> ACCOUNT_INTERVAL_SHIFT);
t->total_period >>= shift;
t->io_period >>= shift;
}
skew = t->io_period - throttle * t->total_period / 100;
if (unlikely(skew > 0) && slept < MAX_SLEEPS) {
slept++;
spin_unlock_irq(&throttle_spinlock);
msleep(SLEEP_MSEC);
goto try_again;
}
skip_limit:
t->num_io_jobs++;
spin_unlock_irq(&throttle_spinlock);
}
static void io_job_finish(struct dm_kcopyd_throttle *t)
{
unsigned long flags;
if (unlikely(!t))
return;
spin_lock_irqsave(&throttle_spinlock, flags);
t->num_io_jobs--;
if (likely(ACCESS_ONCE(t->throttle) >= 100))
goto skip_limit;
if (!t->num_io_jobs) {
unsigned now, difference;
now = jiffies;
difference = now - t->last_jiffies;
t->last_jiffies = now;
t->io_period += difference;
t->total_period += difference;
/*
* Maintain sane values if we got a temporary overflow.
*/
if (unlikely(t->io_period > t->total_period))
t->io_period = t->total_period;
}
skip_limit:
spin_unlock_irqrestore(&throttle_spinlock, flags);
}
static void wake(struct dm_kcopyd_client *kc)
{
queue_work(kc->kcopyd_wq, &kc->kcopyd_work);
}
/*
* Obtain one page for the use of kcopyd.
*/
static struct page_list *alloc_pl(gfp_t gfp)
{
struct page_list *pl;
pl = kmalloc(sizeof(*pl), gfp);
if (!pl)
return NULL;
pl->page = alloc_page(gfp);
if (!pl->page) {
kfree(pl);
return NULL;
}
return pl;
}
static void free_pl(struct page_list *pl)
{
__free_page(pl->page);
kfree(pl);
}
/*
* Add the provided pages to a client's free page list, releasing
* back to the system any beyond the reserved_pages limit.
*/
static void kcopyd_put_pages(struct dm_kcopyd_client *kc, struct page_list *pl)
{
struct page_list *next;
do {
next = pl->next;
if (kc->nr_free_pages >= kc->nr_reserved_pages)
free_pl(pl);
else {
pl->next = kc->pages;
kc->pages = pl;
kc->nr_free_pages++;
}
pl = next;
} while (pl);
}
static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
unsigned int nr, struct page_list **pages)
{
struct page_list *pl;
*pages = NULL;
do {
pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
if (unlikely(!pl)) {
/* Use reserved pages */
pl = kc->pages;
if (unlikely(!pl))
goto out_of_memory;
kc->pages = pl->next;
kc->nr_free_pages--;
}
pl->next = *pages;
*pages = pl;
} while (--nr);
return 0;
out_of_memory:
if (*pages)
kcopyd_put_pages(kc, *pages);
return -ENOMEM;
}
/*
* These three functions resize the page pool.
*/
static void drop_pages(struct page_list *pl)
{
struct page_list *next;
while (pl) {
next = pl->next;
free_pl(pl);
pl = next;
}
}
/*
* Allocate and reserve nr_pages for the use of a specific client.
*/
static int client_reserve_pages(struct dm_kcopyd_client *kc, unsigned nr_pages)
{
unsigned i;
struct page_list *pl = NULL, *next;
for (i = 0; i < nr_pages; i++) {
next = alloc_pl(GFP_KERNEL);
if (!next) {
if (pl)
drop_pages(pl);
return -ENOMEM;
}
next->next = pl;
pl = next;
}
kc->nr_reserved_pages += nr_pages;
kcopyd_put_pages(kc, pl);
return 0;
}
static void client_free_pages(struct dm_kcopyd_client *kc)
{
BUG_ON(kc->nr_free_pages != kc->nr_reserved_pages);
drop_pages(kc->pages);
kc->pages = NULL;
kc->nr_free_pages = kc->nr_reserved_pages = 0;
}
/*-----------------------------------------------------------------
* kcopyd_jobs need to be allocated by the *clients* of kcopyd,
* for this reason we use a mempool to prevent the client from
* ever having to do io (which could cause a deadlock).
*---------------------------------------------------------------*/
struct kcopyd_job {
struct dm_kcopyd_client *kc;
struct list_head list;
unsigned long flags;
/*
* Error state of the job.
*/
int read_err;
unsigned long write_err;
/*
* Either READ or WRITE
*/
int rw;
struct dm_io_region source;
/*
* The destinations for the transfer.
*/
unsigned int num_dests;
struct dm_io_region dests[DM_KCOPYD_MAX_REGIONS];
struct page_list *pages;
/*
* Set this to ensure you are notified when the job has
* completed. 'context' is for callback to use.
*/
dm_kcopyd_notify_fn fn;
void *context;
/*
* These fields are only used if the job has been split
* into more manageable parts.
*/
struct mutex lock;
atomic_t sub_jobs;
sector_t progress;
struct kcopyd_job *master_job;
};
static struct kmem_cache *_job_cache;
int __init dm_kcopyd_init(void)
{
_job_cache = kmem_cache_create("kcopyd_job",
sizeof(struct kcopyd_job) * (SPLIT_COUNT + 1),
__alignof__(struct kcopyd_job), 0, NULL);
if (!_job_cache)
return -ENOMEM;
zero_page_list.next = &zero_page_list;
zero_page_list.page = ZERO_PAGE(0);
return 0;
}
void dm_kcopyd_exit(void)
{
kmem_cache_destroy(_job_cache);
_job_cache = NULL;
}
/*
* Functions to push and pop a job onto the head of a given job
* list.
*/
static struct kcopyd_job *pop(struct list_head *jobs,
struct dm_kcopyd_client *kc)
{
struct kcopyd_job *job = NULL;
unsigned long flags;
spin_lock_irqsave(&kc->job_lock, flags);
if (!list_empty(jobs)) {
job = list_entry(jobs->next, struct kcopyd_job, list);
list_del(&job->list);
}
spin_unlock_irqrestore(&kc->job_lock, flags);
return job;
}
static void push(struct list_head *jobs, struct kcopyd_job *job)
{
unsigned long flags;
struct dm_kcopyd_client *kc = job->kc;
spin_lock_irqsave(&kc->job_lock, flags);
list_add_tail(&job->list, jobs);
spin_unlock_irqrestore(&kc->job_lock, flags);
}
static void push_head(struct list_head *jobs, struct kcopyd_job *job)
{
unsigned long flags;
struct dm_kcopyd_client *kc = job->kc;
spin_lock_irqsave(&kc->job_lock, flags);
list_add(&job->list, jobs);
spin_unlock_irqrestore(&kc->job_lock, flags);
}
/*
* These three functions process 1 item from the corresponding
* job list.
*
* They return:
* < 0: error
* 0: success
* > 0: can't process yet.
*/
static int run_complete_job(struct kcopyd_job *job)
{
void *context = job->context;
int read_err = job->read_err;
unsigned long write_err = job->write_err;
dm_kcopyd_notify_fn fn = job->fn;
struct dm_kcopyd_client *kc = job->kc;
if (job->pages && job->pages != &zero_page_list)
kcopyd_put_pages(kc, job->pages);
/*
* If this is the master job, the sub jobs have already
* completed so we can free everything.
*/
if (job->master_job == job)
mempool_free(job, kc->job_pool);
fn(read_err, write_err, context);
if (atomic_dec_and_test(&kc->nr_jobs))
wake_up(&kc->destroyq);
return 0;
}
static void complete_io(unsigned long error, void *context)
{
struct kcopyd_job *job = (struct kcopyd_job *) context;
struct dm_kcopyd_client *kc = job->kc;
io_job_finish(kc->throttle);
if (error) {
if (job->rw & WRITE)
job->write_err |= error;
else
job->read_err = 1;
if (!test_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags)) {
push(&kc->complete_jobs, job);
wake(kc);
return;
}
}
if (job->rw & WRITE)
push(&kc->complete_jobs, job);
else {
job->rw = WRITE;
push(&kc->io_jobs, job);
}
wake(kc);
}
/*
* Request io on as many buffer heads as we can currently get for
* a particular job.
*/
static int run_io_job(struct kcopyd_job *job)
{
int r;
struct dm_io_request io_req = {
.bi_rw = job->rw,
.mem.type = DM_IO_PAGE_LIST,
.mem.ptr.pl = job->pages,
.mem.offset = 0,
.notify.fn = complete_io,
.notify.context = job,
.client = job->kc->io_client,
};
io_job_start(job->kc->throttle);
if (job->rw == READ)
r = dm_io(&io_req, 1, &job->source, NULL);
else
r = dm_io(&io_req, job->num_dests, job->dests, NULL);
return r;
}
static int run_pages_job(struct kcopyd_job *job)
{
int r;
unsigned nr_pages = dm_div_up(job->dests[0].count, PAGE_SIZE >> 9);
r = kcopyd_get_pages(job->kc, nr_pages, &job->pages);
if (!r) {
/* this job is ready for io */
push(&job->kc->io_jobs, job);
return 0;
}
if (r == -ENOMEM)
/* can't complete now */
return 1;
return r;
}
/*
* Run through a list for as long as possible. Returns the count
* of successful jobs.
*/
static int process_jobs(struct list_head *jobs, struct dm_kcopyd_client *kc,
int (*fn) (struct kcopyd_job *))
{
struct kcopyd_job *job;
int r, count = 0;
while ((job = pop(jobs, kc))) {
r = fn(job);
if (r < 0) {
/* error this rogue job */
if (job->rw & WRITE)
job->write_err = (unsigned long) -1L;
else
job->read_err = 1;
push(&kc->complete_jobs, job);
break;
}
if (r > 0) {
/*
* We couldn't service this job ATM, so
* push this job back onto the list.
*/
push_head(jobs, job);
break;
}
count++;
}
return count;
}
/*
* kcopyd does this every time it's woken up.
*/
static void do_work(struct work_struct *work)
{
struct dm_kcopyd_client *kc = container_of(work,
struct dm_kcopyd_client, kcopyd_work);
struct blk_plug plug;
/*
* The order that these are called is *very* important.
* complete jobs can free some pages for pages jobs.
* Pages jobs when successful will jump onto the io jobs
* list. io jobs call wake when they complete and it all
* starts again.
*/
blk_start_plug(&plug);
process_jobs(&kc->complete_jobs, kc, run_complete_job);
process_jobs(&kc->pages_jobs, kc, run_pages_job);
process_jobs(&kc->io_jobs, kc, run_io_job);
blk_finish_plug(&plug);
}
/*
* If we are copying a small region we just dispatch a single job
* to do the copy, otherwise the io has to be split up into many
* jobs.
*/
static void dispatch_job(struct kcopyd_job *job)
{
struct dm_kcopyd_client *kc = job->kc;
atomic_inc(&kc->nr_jobs);
if (unlikely(!job->source.count))
push(&kc->complete_jobs, job);
else if (job->pages == &zero_page_list)
push(&kc->io_jobs, job);
else
push(&kc->pages_jobs, job);
wake(kc);
}
static void segment_complete(int read_err, unsigned long write_err,
void *context)
{
/* FIXME: tidy this function */
sector_t progress = 0;
sector_t count = 0;
struct kcopyd_job *sub_job = (struct kcopyd_job *) context;
struct kcopyd_job *job = sub_job->master_job;
struct dm_kcopyd_client *kc = job->kc;
mutex_lock(&job->lock);
/* update the error */
if (read_err)
job->read_err = 1;
if (write_err)
job->write_err |= write_err;
/*
* Only dispatch more work if there hasn't been an error.
*/
if ((!job->read_err && !job->write_err) ||
test_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags)) {
/* get the next chunk of work */
progress = job->progress;
count = job->source.count - progress;
if (count) {
if (count > SUB_JOB_SIZE)
count = SUB_JOB_SIZE;
job->progress += count;
}
}
mutex_unlock(&job->lock);
if (count) {
int i;
*sub_job = *job;
sub_job->source.sector += progress;
sub_job->source.count = count;
for (i = 0; i < job->num_dests; i++) {
sub_job->dests[i].sector += progress;
sub_job->dests[i].count = count;
}
sub_job->fn = segment_complete;
sub_job->context = sub_job;
dispatch_job(sub_job);
} else if (atomic_dec_and_test(&job->sub_jobs)) {
/*
* Queue the completion callback to the kcopyd thread.
*
* Some callers assume that all the completions are called
* from a single thread and don't race with each other.
*
* We must not call the callback directly here because this
* code may not be executing in the thread.
*/
push(&kc->complete_jobs, job);
wake(kc);
}
}
/*
* Create some sub jobs to share the work between them.
*/
static void split_job(struct kcopyd_job *master_job)
{
int i;
atomic_inc(&master_job->kc->nr_jobs);
atomic_set(&master_job->sub_jobs, SPLIT_COUNT);
for (i = 0; i < SPLIT_COUNT; i++) {
master_job[i + 1].master_job = master_job;
segment_complete(0, 0u, &master_job[i + 1]);
}
}
int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
unsigned int num_dests, struct dm_io_region *dests,
unsigned int flags, dm_kcopyd_notify_fn fn, void *context)
{
struct kcopyd_job *job;
int i;
/*
* Allocate an array of jobs consisting of one master job
* followed by SPLIT_COUNT sub jobs.
*/
job = mempool_alloc(kc->job_pool, GFP_NOIO);
/*
* set up for the read.
*/
job->kc = kc;
job->flags = flags;
job->read_err = 0;
job->write_err = 0;
job->num_dests = num_dests;
memcpy(&job->dests, dests, sizeof(*dests) * num_dests);
if (from) {
job->source = *from;
job->pages = NULL;
job->rw = READ;
} else {
memset(&job->source, 0, sizeof job->source);
job->source.count = job->dests[0].count;
job->pages = &zero_page_list;
/*
* Use WRITE SAME to optimize zeroing if all dests support it.
*/
job->rw = WRITE | REQ_WRITE_SAME;
for (i = 0; i < job->num_dests; i++)
if (!bdev_write_same(job->dests[i].bdev)) {
job->rw = WRITE;
break;
}
}
job->fn = fn;
job->context = context;
job->master_job = job;
if (job->source.count <= SUB_JOB_SIZE)
dispatch_job(job);
else {
mutex_init(&job->lock);
job->progress = 0;
split_job(job);
}
return 0;
}
EXPORT_SYMBOL(dm_kcopyd_copy);
int dm_kcopyd_zero(struct dm_kcopyd_client *kc,
unsigned num_dests, struct dm_io_region *dests,
unsigned flags, dm_kcopyd_notify_fn fn, void *context)
{
return dm_kcopyd_copy(kc, NULL, num_dests, dests, flags, fn, context);
}
EXPORT_SYMBOL(dm_kcopyd_zero);
void *dm_kcopyd_prepare_callback(struct dm_kcopyd_client *kc,
dm_kcopyd_notify_fn fn, void *context)
{
struct kcopyd_job *job;
job = mempool_alloc(kc->job_pool, GFP_NOIO);
memset(job, 0, sizeof(struct kcopyd_job));
job->kc = kc;
job->fn = fn;
job->context = context;
job->master_job = job;
atomic_inc(&kc->nr_jobs);
return job;
}
EXPORT_SYMBOL(dm_kcopyd_prepare_callback);
void dm_kcopyd_do_callback(void *j, int read_err, unsigned long write_err)
{
struct kcopyd_job *job = j;
struct dm_kcopyd_client *kc = job->kc;
job->read_err = read_err;
job->write_err = write_err;
push(&kc->complete_jobs, job);
wake(kc);
}
EXPORT_SYMBOL(dm_kcopyd_do_callback);
/*
* Cancels a kcopyd job, eg. someone might be deactivating a
* mirror.
*/
#if 0
int kcopyd_cancel(struct kcopyd_job *job, int block)
{
/* FIXME: finish */
return -1;
}
#endif /* 0 */
/*-----------------------------------------------------------------
* Client setup
*---------------------------------------------------------------*/
struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle)
{
int r = -ENOMEM;
struct dm_kcopyd_client *kc;
kc = kmalloc(sizeof(*kc), GFP_KERNEL);
if (!kc)
return ERR_PTR(-ENOMEM);
spin_lock_init(&kc->job_lock);
INIT_LIST_HEAD(&kc->complete_jobs);
INIT_LIST_HEAD(&kc->io_jobs);
INIT_LIST_HEAD(&kc->pages_jobs);
kc->throttle = throttle;
kc->job_pool = mempool_create_slab_pool(MIN_JOBS, _job_cache);
if (!kc->job_pool)
goto bad_slab;
INIT_WORK(&kc->kcopyd_work, do_work);
kc->kcopyd_wq = alloc_workqueue("kcopyd", WQ_MEM_RECLAIM, 0);
if (!kc->kcopyd_wq)
goto bad_workqueue;
kc->pages = NULL;
kc->nr_reserved_pages = kc->nr_free_pages = 0;
r = client_reserve_pages(kc, RESERVE_PAGES);
if (r)
goto bad_client_pages;
kc->io_client = dm_io_client_create();
if (IS_ERR(kc->io_client)) {
r = PTR_ERR(kc->io_client);
goto bad_io_client;
}
init_waitqueue_head(&kc->destroyq);
atomic_set(&kc->nr_jobs, 0);
return kc;
bad_io_client:
client_free_pages(kc);
bad_client_pages:
destroy_workqueue(kc->kcopyd_wq);
bad_workqueue:
mempool_destroy(kc->job_pool);
bad_slab:
kfree(kc);
return ERR_PTR(r);
}
EXPORT_SYMBOL(dm_kcopyd_client_create);
void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
{
/* Wait for completion of all jobs submitted by this client. */
wait_event(kc->destroyq, !atomic_read(&kc->nr_jobs));
BUG_ON(!list_empty(&kc->complete_jobs));
BUG_ON(!list_empty(&kc->io_jobs));
BUG_ON(!list_empty(&kc->pages_jobs));
destroy_workqueue(kc->kcopyd_wq);
dm_io_client_destroy(kc->io_client);
client_free_pages(kc);
mempool_destroy(kc->job_pool);
kfree(kc);
}
EXPORT_SYMBOL(dm_kcopyd_client_destroy);

182
drivers/md/dm-linear.c Normal file
View file

@ -0,0 +1,182 @@
/*
* Copyright (C) 2001-2003 Sistina Software (UK) Limited.
*
* This file is released under the GPL.
*/
#include "dm.h"
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
#define DM_MSG_PREFIX "linear"
/*
* Linear: maps a linear range of a device.
*/
struct linear_c {
struct dm_dev *dev;
sector_t start;
};
/*
* Construct a linear mapping: <dev_path> <offset>
*/
static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
struct linear_c *lc;
unsigned long long tmp;
char dummy;
if (argc != 2) {
ti->error = "Invalid argument count";
return -EINVAL;
}
lc = kmalloc(sizeof(*lc), GFP_KERNEL);
if (lc == NULL) {
ti->error = "dm-linear: Cannot allocate linear context";
return -ENOMEM;
}
if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1) {
ti->error = "dm-linear: Invalid device sector";
goto bad;
}
lc->start = tmp;
if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &lc->dev)) {
ti->error = "dm-linear: Device lookup failed";
goto bad;
}
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
ti->private = lc;
return 0;
bad:
kfree(lc);
return -EINVAL;
}
static void linear_dtr(struct dm_target *ti)
{
struct linear_c *lc = (struct linear_c *) ti->private;
dm_put_device(ti, lc->dev);
kfree(lc);
}
static sector_t linear_map_sector(struct dm_target *ti, sector_t bi_sector)
{
struct linear_c *lc = ti->private;
return lc->start + dm_target_offset(ti, bi_sector);
}
static void linear_map_bio(struct dm_target *ti, struct bio *bio)
{
struct linear_c *lc = ti->private;
bio->bi_bdev = lc->dev->bdev;
if (bio_sectors(bio))
bio->bi_iter.bi_sector =
linear_map_sector(ti, bio->bi_iter.bi_sector);
}
static int linear_map(struct dm_target *ti, struct bio *bio)
{
linear_map_bio(ti, bio);
return DM_MAPIO_REMAPPED;
}
static void linear_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct linear_c *lc = (struct linear_c *) ti->private;
switch (type) {
case STATUSTYPE_INFO:
result[0] = '\0';
break;
case STATUSTYPE_TABLE:
snprintf(result, maxlen, "%s %llu", lc->dev->name,
(unsigned long long)lc->start);
break;
}
}
static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
unsigned long arg)
{
struct linear_c *lc = (struct linear_c *) ti->private;
struct dm_dev *dev = lc->dev;
int r = 0;
/*
* Only pass ioctls through if the device sizes match exactly.
*/
if (lc->start ||
ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}
static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
struct bio_vec *biovec, int max_size)
{
struct linear_c *lc = ti->private;
struct request_queue *q = bdev_get_queue(lc->dev->bdev);
if (!q->merge_bvec_fn)
return max_size;
bvm->bi_bdev = lc->dev->bdev;
bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);
return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct linear_c *lc = ti->private;
return fn(ti, lc->dev, lc->start, ti->len, data);
}
static struct target_type linear_target = {
.name = "linear",
.version = {1, 2, 1},
.module = THIS_MODULE,
.ctr = linear_ctr,
.dtr = linear_dtr,
.map = linear_map,
.status = linear_status,
.ioctl = linear_ioctl,
.merge = linear_merge,
.iterate_devices = linear_iterate_devices,
};
int __init dm_linear_init(void)
{
int r = dm_register_target(&linear_target);
if (r < 0)
DMERR("register failed %d", r);
return r;
}
void dm_linear_exit(void)
{
dm_unregister_target(&linear_target);
}

View file

@ -0,0 +1,930 @@
/*
* Copyright (C) 2006-2009 Red Hat, Inc.
*
* This file is released under the LGPL.
*/
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/dm-dirty-log.h>
#include <linux/device-mapper.h>
#include <linux/dm-log-userspace.h>
#include <linux/module.h>
#include <linux/workqueue.h>
#include "dm-log-userspace-transfer.h"
#define DM_LOG_USERSPACE_VSN "1.3.0"
struct flush_entry {
int type;
region_t region;
struct list_head list;
};
/*
* This limit on the number of mark and clear request is, to a degree,
* arbitrary. However, there is some basis for the choice in the limits
* imposed on the size of data payload by dm-log-userspace-transfer.c:
* dm_consult_userspace().
*/
#define MAX_FLUSH_GROUP_COUNT 32
struct log_c {
struct dm_target *ti;
struct dm_dev *log_dev;
uint32_t region_size;
region_t region_count;
uint64_t luid;
char uuid[DM_UUID_LEN];
char *usr_argv_str;
uint32_t usr_argc;
/*
* in_sync_hint gets set when doing is_remote_recovering. It
* represents the first region that needs recovery. IOW, the
* first zero bit of sync_bits. This can be useful for to limit
* traffic for calls like is_remote_recovering and get_resync_work,
* but be take care in its use for anything else.
*/
uint64_t in_sync_hint;
/*
* Mark and clear requests are held until a flush is issued
* so that we can group, and thereby limit, the amount of
* network traffic between kernel and userspace. The 'flush_lock'
* is used to protect these lists.
*/
spinlock_t flush_lock;
struct list_head mark_list;
struct list_head clear_list;
/*
* Workqueue for flush of clear region requests.
*/
struct workqueue_struct *dmlog_wq;
struct delayed_work flush_log_work;
atomic_t sched_flush;
/*
* Combine userspace flush and mark requests for efficiency.
*/
uint32_t integrated_flush;
};
static mempool_t *flush_entry_pool;
static void *flush_entry_alloc(gfp_t gfp_mask, void *pool_data)
{
return kmalloc(sizeof(struct flush_entry), gfp_mask);
}
static void flush_entry_free(void *element, void *pool_data)
{
kfree(element);
}
static int userspace_do_request(struct log_c *lc, const char *uuid,
int request_type, char *data, size_t data_size,
char *rdata, size_t *rdata_size)
{
int r;
/*
* If the server isn't there, -ESRCH is returned,
* and we must keep trying until the server is
* restored.
*/
retry:
r = dm_consult_userspace(uuid, lc->luid, request_type, data,
data_size, rdata, rdata_size);
if (r != -ESRCH)
return r;
DMERR(" Userspace log server not found.");
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(2*HZ);
DMWARN("Attempting to contact userspace log server...");
r = dm_consult_userspace(uuid, lc->luid, DM_ULOG_CTR,
lc->usr_argv_str,
strlen(lc->usr_argv_str) + 1,
NULL, NULL);
if (!r)
break;
}
DMINFO("Reconnected to userspace log server... DM_ULOG_CTR complete");
r = dm_consult_userspace(uuid, lc->luid, DM_ULOG_RESUME, NULL,
0, NULL, NULL);
if (!r)
goto retry;
DMERR("Error trying to resume userspace log: %d", r);
return -ESRCH;
}
static int build_constructor_string(struct dm_target *ti,
unsigned argc, char **argv,
char **ctr_str)
{
int i, str_size;
char *str = NULL;
*ctr_str = NULL;
/*
* Determine overall size of the string.
*/
for (i = 0, str_size = 0; i < argc; i++)
str_size += strlen(argv[i]) + 1; /* +1 for space between args */
str_size += 20; /* Max number of chars in a printed u64 number */
str = kzalloc(str_size, GFP_KERNEL);
if (!str) {
DMWARN("Unable to allocate memory for constructor string");
return -ENOMEM;
}
str_size = sprintf(str, "%llu", (unsigned long long)ti->len);
for (i = 0; i < argc; i++)
str_size += sprintf(str + str_size, " %s", argv[i]);
*ctr_str = str;
return str_size;
}
static void do_flush(struct work_struct *work)
{
int r;
struct log_c *lc = container_of(work, struct log_c, flush_log_work.work);
atomic_set(&lc->sched_flush, 0);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH, NULL, 0, NULL, NULL);
if (r)
dm_table_event(lc->ti->table);
}
/*
* userspace_ctr
*
* argv contains:
* <UUID> [integrated_flush] <other args>
* Where 'other args' are the userspace implementation-specific log
* arguments.
*
* Example:
* <UUID> [integrated_flush] clustered-disk <arg count> <log dev>
* <region_size> [[no]sync]
*
* This module strips off the <UUID> and uses it for identification
* purposes when communicating with userspace about a log.
*
* If integrated_flush is defined, the kernel combines flush
* and mark requests.
*
* The rest of the line, beginning with 'clustered-disk', is passed
* to the userspace ctr function.
*/
static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
unsigned argc, char **argv)
{
int r = 0;
int str_size;
char *ctr_str = NULL;
struct log_c *lc = NULL;
uint64_t rdata;
size_t rdata_size = sizeof(rdata);
char *devices_rdata = NULL;
size_t devices_rdata_size = DM_NAME_LEN;
if (argc < 3) {
DMWARN("Too few arguments to userspace dirty log");
return -EINVAL;
}
lc = kzalloc(sizeof(*lc), GFP_KERNEL);
if (!lc) {
DMWARN("Unable to allocate userspace log context.");
return -ENOMEM;
}
/* The ptr value is sufficient for local unique id */
lc->luid = (unsigned long)lc;
lc->ti = ti;
if (strlen(argv[0]) > (DM_UUID_LEN - 1)) {
DMWARN("UUID argument too long.");
kfree(lc);
return -EINVAL;
}
lc->usr_argc = argc;
strncpy(lc->uuid, argv[0], DM_UUID_LEN);
argc--;
argv++;
spin_lock_init(&lc->flush_lock);
INIT_LIST_HEAD(&lc->mark_list);
INIT_LIST_HEAD(&lc->clear_list);
if (!strcasecmp(argv[0], "integrated_flush")) {
lc->integrated_flush = 1;
argc--;
argv++;
}
str_size = build_constructor_string(ti, argc, argv, &ctr_str);
if (str_size < 0) {
kfree(lc);
return str_size;
}
devices_rdata = kzalloc(devices_rdata_size, GFP_KERNEL);
if (!devices_rdata) {
DMERR("Failed to allocate memory for device information");
r = -ENOMEM;
goto out;
}
/*
* Send table string and get back any opened device.
*/
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_CTR,
ctr_str, str_size,
devices_rdata, &devices_rdata_size);
if (r < 0) {
if (r == -ESRCH)
DMERR("Userspace log server not found");
else
DMERR("Userspace log server failed to create log");
goto out;
}
/* Since the region size does not change, get it now */
rdata_size = sizeof(rdata);
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_GET_REGION_SIZE,
NULL, 0, (char *)&rdata, &rdata_size);
if (r) {
DMERR("Failed to get region size of dirty log");
goto out;
}
lc->region_size = (uint32_t)rdata;
lc->region_count = dm_sector_div_up(ti->len, lc->region_size);
if (devices_rdata_size) {
if (devices_rdata[devices_rdata_size - 1] != '\0') {
DMERR("DM_ULOG_CTR device return string not properly terminated");
r = -EINVAL;
goto out;
}
r = dm_get_device(ti, devices_rdata,
dm_table_get_mode(ti->table), &lc->log_dev);
if (r)
DMERR("Failed to register %s with device-mapper",
devices_rdata);
}
if (lc->integrated_flush) {
lc->dmlog_wq = alloc_workqueue("dmlogd", WQ_MEM_RECLAIM, 0);
if (!lc->dmlog_wq) {
DMERR("couldn't start dmlogd");
r = -ENOMEM;
goto out;
}
INIT_DELAYED_WORK(&lc->flush_log_work, do_flush);
atomic_set(&lc->sched_flush, 0);
}
out:
kfree(devices_rdata);
if (r) {
kfree(lc);
kfree(ctr_str);
} else {
lc->usr_argv_str = ctr_str;
log->context = lc;
}
return r;
}
static void userspace_dtr(struct dm_dirty_log *log)
{
struct log_c *lc = log->context;
if (lc->integrated_flush) {
/* flush workqueue */
if (atomic_read(&lc->sched_flush))
flush_delayed_work(&lc->flush_log_work);
destroy_workqueue(lc->dmlog_wq);
}
(void) dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_DTR,
NULL, 0, NULL, NULL);
if (lc->log_dev)
dm_put_device(lc->ti, lc->log_dev);
kfree(lc->usr_argv_str);
kfree(lc);
return;
}
static int userspace_presuspend(struct dm_dirty_log *log)
{
int r;
struct log_c *lc = log->context;
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_PRESUSPEND,
NULL, 0, NULL, NULL);
return r;
}
static int userspace_postsuspend(struct dm_dirty_log *log)
{
int r;
struct log_c *lc = log->context;
/*
* Run planned flush earlier.
*/
if (lc->integrated_flush && atomic_read(&lc->sched_flush))
flush_delayed_work(&lc->flush_log_work);
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_POSTSUSPEND,
NULL, 0, NULL, NULL);
return r;
}
static int userspace_resume(struct dm_dirty_log *log)
{
int r;
struct log_c *lc = log->context;
lc->in_sync_hint = 0;
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_RESUME,
NULL, 0, NULL, NULL);
return r;
}
static uint32_t userspace_get_region_size(struct dm_dirty_log *log)
{
struct log_c *lc = log->context;
return lc->region_size;
}
/*
* userspace_is_clean
*
* Check whether a region is clean. If there is any sort of
* failure when consulting the server, we return not clean.
*
* Returns: 1 if clean, 0 otherwise
*/
static int userspace_is_clean(struct dm_dirty_log *log, region_t region)
{
int r;
uint64_t region64 = (uint64_t)region;
int64_t is_clean;
size_t rdata_size;
struct log_c *lc = log->context;
rdata_size = sizeof(is_clean);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_IS_CLEAN,
(char *)&region64, sizeof(region64),
(char *)&is_clean, &rdata_size);
return (r) ? 0 : (int)is_clean;
}
/*
* userspace_in_sync
*
* Check if the region is in-sync. If there is any sort
* of failure when consulting the server, we assume that
* the region is not in sync.
*
* If 'can_block' is set, return immediately
*
* Returns: 1 if in-sync, 0 if not-in-sync, -EWOULDBLOCK
*/
static int userspace_in_sync(struct dm_dirty_log *log, region_t region,
int can_block)
{
int r;
uint64_t region64 = region;
int64_t in_sync;
size_t rdata_size;
struct log_c *lc = log->context;
/*
* We can never respond directly - even if in_sync_hint is
* set. This is because another machine could see a device
* failure and mark the region out-of-sync. If we don't go
* to userspace to ask, we might think the region is in-sync
* and allow a read to pick up data that is stale. (This is
* very unlikely if a device actually fails; but it is very
* likely if a connection to one device from one machine fails.)
*
* There still might be a problem if the mirror caches the region
* state as in-sync... but then this call would not be made. So,
* that is a mirror problem.
*/
if (!can_block)
return -EWOULDBLOCK;
rdata_size = sizeof(in_sync);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_IN_SYNC,
(char *)&region64, sizeof(region64),
(char *)&in_sync, &rdata_size);
return (r) ? 0 : (int)in_sync;
}
static int flush_one_by_one(struct log_c *lc, struct list_head *flush_list)
{
int r = 0;
struct flush_entry *fe;
list_for_each_entry(fe, flush_list, list) {
r = userspace_do_request(lc, lc->uuid, fe->type,
(char *)&fe->region,
sizeof(fe->region),
NULL, NULL);
if (r)
break;
}
return r;
}
static int flush_by_group(struct log_c *lc, struct list_head *flush_list,
int flush_with_payload)
{
int r = 0;
int count;
uint32_t type = 0;
struct flush_entry *fe, *tmp_fe;
LIST_HEAD(tmp_list);
uint64_t group[MAX_FLUSH_GROUP_COUNT];
/*
* Group process the requests
*/
while (!list_empty(flush_list)) {
count = 0;
list_for_each_entry_safe(fe, tmp_fe, flush_list, list) {
group[count] = fe->region;
count++;
list_move(&fe->list, &tmp_list);
type = fe->type;
if (count >= MAX_FLUSH_GROUP_COUNT)
break;
}
if (flush_with_payload) {
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH,
(char *)(group),
count * sizeof(uint64_t),
NULL, NULL);
/*
* Integrated flush failed.
*/
if (r)
break;
} else {
r = userspace_do_request(lc, lc->uuid, type,
(char *)(group),
count * sizeof(uint64_t),
NULL, NULL);
if (r) {
/*
* Group send failed. Attempt one-by-one.
*/
list_splice_init(&tmp_list, flush_list);
r = flush_one_by_one(lc, flush_list);
break;
}
}
}
/*
* Must collect flush_entrys that were successfully processed
* as a group so that they will be free'd by the caller.
*/
list_splice_init(&tmp_list, flush_list);
return r;
}
/*
* userspace_flush
*
* This function is ok to block.
* The flush happens in two stages. First, it sends all
* clear/mark requests that are on the list. Then it
* tells the server to commit them. This gives the
* server a chance to optimise the commit, instead of
* doing it for every request.
*
* Additionally, we could implement another thread that
* sends the requests up to the server - reducing the
* load on flush. Then the flush would have less in
* the list and be responsible for the finishing commit.
*
* Returns: 0 on success, < 0 on failure
*/
static int userspace_flush(struct dm_dirty_log *log)
{
int r = 0;
unsigned long flags;
struct log_c *lc = log->context;
LIST_HEAD(mark_list);
LIST_HEAD(clear_list);
int mark_list_is_empty;
int clear_list_is_empty;
struct flush_entry *fe, *tmp_fe;
spin_lock_irqsave(&lc->flush_lock, flags);
list_splice_init(&lc->mark_list, &mark_list);
list_splice_init(&lc->clear_list, &clear_list);
spin_unlock_irqrestore(&lc->flush_lock, flags);
mark_list_is_empty = list_empty(&mark_list);
clear_list_is_empty = list_empty(&clear_list);
if (mark_list_is_empty && clear_list_is_empty)
return 0;
r = flush_by_group(lc, &clear_list, 0);
if (r)
goto out;
if (!lc->integrated_flush) {
r = flush_by_group(lc, &mark_list, 0);
if (r)
goto out;
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH,
NULL, 0, NULL, NULL);
goto out;
}
/*
* Send integrated flush request with mark_list as payload.
*/
r = flush_by_group(lc, &mark_list, 1);
if (r)
goto out;
if (mark_list_is_empty && !atomic_read(&lc->sched_flush)) {
/*
* When there are only clear region requests,
* we schedule a flush in the future.
*/
queue_delayed_work(lc->dmlog_wq, &lc->flush_log_work, 3 * HZ);
atomic_set(&lc->sched_flush, 1);
} else {
/*
* Cancel pending flush because we
* have already flushed in mark_region.
*/
cancel_delayed_work(&lc->flush_log_work);
atomic_set(&lc->sched_flush, 0);
}
out:
/*
* We can safely remove these entries, even after failure.
* Calling code will receive an error and will know that
* the log facility has failed.
*/
list_for_each_entry_safe(fe, tmp_fe, &mark_list, list) {
list_del(&fe->list);
mempool_free(fe, flush_entry_pool);
}
list_for_each_entry_safe(fe, tmp_fe, &clear_list, list) {
list_del(&fe->list);
mempool_free(fe, flush_entry_pool);
}
if (r)
dm_table_event(lc->ti->table);
return r;
}
/*
* userspace_mark_region
*
* This function should avoid blocking unless absolutely required.
* (Memory allocation is valid for blocking.)
*/
static void userspace_mark_region(struct dm_dirty_log *log, region_t region)
{
unsigned long flags;
struct log_c *lc = log->context;
struct flush_entry *fe;
/* Wait for an allocation, but _never_ fail */
fe = mempool_alloc(flush_entry_pool, GFP_NOIO);
BUG_ON(!fe);
spin_lock_irqsave(&lc->flush_lock, flags);
fe->type = DM_ULOG_MARK_REGION;
fe->region = region;
list_add(&fe->list, &lc->mark_list);
spin_unlock_irqrestore(&lc->flush_lock, flags);
return;
}
/*
* userspace_clear_region
*
* This function must not block.
* So, the alloc can't block. In the worst case, it is ok to
* fail. It would simply mean we can't clear the region.
* Does nothing to current sync context, but does mean
* the region will be re-sync'ed on a reload of the mirror
* even though it is in-sync.
*/
static void userspace_clear_region(struct dm_dirty_log *log, region_t region)
{
unsigned long flags;
struct log_c *lc = log->context;
struct flush_entry *fe;
/*
* If we fail to allocate, we skip the clearing of
* the region. This doesn't hurt us in any way, except
* to cause the region to be resync'ed when the
* device is activated next time.
*/
fe = mempool_alloc(flush_entry_pool, GFP_ATOMIC);
if (!fe) {
DMERR("Failed to allocate memory to clear region.");
return;
}
spin_lock_irqsave(&lc->flush_lock, flags);
fe->type = DM_ULOG_CLEAR_REGION;
fe->region = region;
list_add(&fe->list, &lc->clear_list);
spin_unlock_irqrestore(&lc->flush_lock, flags);
return;
}
/*
* userspace_get_resync_work
*
* Get a region that needs recovery. It is valid to return
* an error for this function.
*
* Returns: 1 if region filled, 0 if no work, <0 on error
*/
static int userspace_get_resync_work(struct dm_dirty_log *log, region_t *region)
{
int r;
size_t rdata_size;
struct log_c *lc = log->context;
struct {
int64_t i; /* 64-bit for mix arch compatibility */
region_t r;
} pkg;
if (lc->in_sync_hint >= lc->region_count)
return 0;
rdata_size = sizeof(pkg);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_RESYNC_WORK,
NULL, 0, (char *)&pkg, &rdata_size);
*region = pkg.r;
return (r) ? r : (int)pkg.i;
}
/*
* userspace_set_region_sync
*
* Set the sync status of a given region. This function
* must not fail.
*/
static void userspace_set_region_sync(struct dm_dirty_log *log,
region_t region, int in_sync)
{
int r;
struct log_c *lc = log->context;
struct {
region_t r;
int64_t i;
} pkg;
pkg.r = region;
pkg.i = (int64_t)in_sync;
r = userspace_do_request(lc, lc->uuid, DM_ULOG_SET_REGION_SYNC,
(char *)&pkg, sizeof(pkg), NULL, NULL);
/*
* It would be nice to be able to report failures.
* However, it is easy emough to detect and resolve.
*/
return;
}
/*
* userspace_get_sync_count
*
* If there is any sort of failure when consulting the server,
* we assume that the sync count is zero.
*
* Returns: sync count on success, 0 on failure
*/
static region_t userspace_get_sync_count(struct dm_dirty_log *log)
{
int r;
size_t rdata_size;
uint64_t sync_count;
struct log_c *lc = log->context;
rdata_size = sizeof(sync_count);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_SYNC_COUNT,
NULL, 0, (char *)&sync_count, &rdata_size);
if (r)
return 0;
if (sync_count >= lc->region_count)
lc->in_sync_hint = lc->region_count;
return (region_t)sync_count;
}
/*
* userspace_status
*
* Returns: amount of space consumed
*/
static int userspace_status(struct dm_dirty_log *log, status_type_t status_type,
char *result, unsigned maxlen)
{
int r = 0;
char *table_args;
size_t sz = (size_t)maxlen;
struct log_c *lc = log->context;
switch (status_type) {
case STATUSTYPE_INFO:
r = userspace_do_request(lc, lc->uuid, DM_ULOG_STATUS_INFO,
NULL, 0, result, &sz);
if (r) {
sz = 0;
DMEMIT("%s 1 COM_FAILURE", log->type->name);
}
break;
case STATUSTYPE_TABLE:
sz = 0;
table_args = strchr(lc->usr_argv_str, ' ');
BUG_ON(!table_args); /* There will always be a ' ' */
table_args++;
DMEMIT("%s %u %s ", log->type->name, lc->usr_argc, lc->uuid);
if (lc->integrated_flush)
DMEMIT("integrated_flush ");
DMEMIT("%s ", table_args);
break;
}
return (r) ? 0 : (int)sz;
}
/*
* userspace_is_remote_recovering
*
* Returns: 1 if region recovering, 0 otherwise
*/
static int userspace_is_remote_recovering(struct dm_dirty_log *log,
region_t region)
{
int r;
uint64_t region64 = region;
struct log_c *lc = log->context;
static unsigned long long limit;
struct {
int64_t is_recovering;
uint64_t in_sync_hint;
} pkg;
size_t rdata_size = sizeof(pkg);
/*
* Once the mirror has been reported to be in-sync,
* it will never again ask for recovery work. So,
* we can safely say there is not a remote machine
* recovering if the device is in-sync. (in_sync_hint
* must be reset at resume time.)
*/
if (region < lc->in_sync_hint)
return 0;
else if (jiffies < limit)
return 1;
limit = jiffies + (HZ / 4);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_IS_REMOTE_RECOVERING,
(char *)&region64, sizeof(region64),
(char *)&pkg, &rdata_size);
if (r)
return 1;
lc->in_sync_hint = pkg.in_sync_hint;
return (int)pkg.is_recovering;
}
static struct dm_dirty_log_type _userspace_type = {
.name = "userspace",
.module = THIS_MODULE,
.ctr = userspace_ctr,
.dtr = userspace_dtr,
.presuspend = userspace_presuspend,
.postsuspend = userspace_postsuspend,
.resume = userspace_resume,
.get_region_size = userspace_get_region_size,
.is_clean = userspace_is_clean,
.in_sync = userspace_in_sync,
.flush = userspace_flush,
.mark_region = userspace_mark_region,
.clear_region = userspace_clear_region,
.get_resync_work = userspace_get_resync_work,
.set_region_sync = userspace_set_region_sync,
.get_sync_count = userspace_get_sync_count,
.status = userspace_status,
.is_remote_recovering = userspace_is_remote_recovering,
};
static int __init userspace_dirty_log_init(void)
{
int r = 0;
flush_entry_pool = mempool_create(100, flush_entry_alloc,
flush_entry_free, NULL);
if (!flush_entry_pool) {
DMWARN("Unable to create flush_entry_pool: No memory.");
return -ENOMEM;
}
r = dm_ulog_tfr_init();
if (r) {
DMWARN("Unable to initialize userspace log communications");
mempool_destroy(flush_entry_pool);
return r;
}
r = dm_dirty_log_type_register(&_userspace_type);
if (r) {
DMWARN("Couldn't register userspace dirty log type");
dm_ulog_tfr_exit();
mempool_destroy(flush_entry_pool);
return r;
}
DMINFO("version " DM_LOG_USERSPACE_VSN " loaded");
return 0;
}
static void __exit userspace_dirty_log_exit(void)
{
dm_dirty_log_type_unregister(&_userspace_type);
dm_ulog_tfr_exit();
mempool_destroy(flush_entry_pool);
DMINFO("version " DM_LOG_USERSPACE_VSN " unloaded");
return;
}
module_init(userspace_dirty_log_init);
module_exit(userspace_dirty_log_exit);
MODULE_DESCRIPTION(DM_NAME " userspace dirty log link");
MODULE_AUTHOR("Jonathan Brassow <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

View file

@ -0,0 +1,286 @@
/*
* Copyright (C) 2006-2009 Red Hat, Inc.
*
* This file is released under the LGPL.
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <net/sock.h>
#include <linux/workqueue.h>
#include <linux/connector.h>
#include <linux/device-mapper.h>
#include <linux/dm-log-userspace.h>
#include "dm-log-userspace-transfer.h"
static uint32_t dm_ulog_seq;
/*
* Netlink/Connector is an unreliable protocol. How long should
* we wait for a response before assuming it was lost and retrying?
* (If we do receive a response after this time, it will be discarded
* and the response to the resent request will be waited for.
*/
#define DM_ULOG_RETRY_TIMEOUT (15 * HZ)
/*
* Pre-allocated space for speed
*/
#define DM_ULOG_PREALLOCED_SIZE 512
static struct cn_msg *prealloced_cn_msg;
static struct dm_ulog_request *prealloced_ulog_tfr;
static struct cb_id ulog_cn_id = {
.idx = CN_IDX_DM,
.val = CN_VAL_DM_USERSPACE_LOG
};
static DEFINE_MUTEX(dm_ulog_lock);
struct receiving_pkg {
struct list_head list;
struct completion complete;
uint32_t seq;
int error;
size_t *data_size;
char *data;
};
static DEFINE_SPINLOCK(receiving_list_lock);
static struct list_head receiving_list;
static int dm_ulog_sendto_server(struct dm_ulog_request *tfr)
{
int r;
struct cn_msg *msg = prealloced_cn_msg;
memset(msg, 0, sizeof(struct cn_msg));
msg->id.idx = ulog_cn_id.idx;
msg->id.val = ulog_cn_id.val;
msg->ack = 0;
msg->seq = tfr->seq;
msg->len = sizeof(struct dm_ulog_request) + tfr->data_size;
r = cn_netlink_send(msg, 0, 0, gfp_any());
return r;
}
/*
* Parameters for this function can be either msg or tfr, but not
* both. This function fills in the reply for a waiting request.
* If just msg is given, then the reply is simply an ACK from userspace
* that the request was received.
*
* Returns: 0 on success, -ENOENT on failure
*/
static int fill_pkg(struct cn_msg *msg, struct dm_ulog_request *tfr)
{
uint32_t rtn_seq = (msg) ? msg->seq : (tfr) ? tfr->seq : 0;
struct receiving_pkg *pkg;
/*
* The 'receiving_pkg' entries in this list are statically
* allocated on the stack in 'dm_consult_userspace'.
* Each process that is waiting for a reply from the user
* space server will have an entry in this list.
*
* We are safe to do it this way because the stack space
* is unique to each process, but still addressable by
* other processes.
*/
list_for_each_entry(pkg, &receiving_list, list) {
if (rtn_seq != pkg->seq)
continue;
if (msg) {
pkg->error = -msg->ack;
/*
* If we are trying again, we will need to know our
* storage capacity. Otherwise, along with the
* error code, we make explicit that we have no data.
*/
if (pkg->error != -EAGAIN)
*(pkg->data_size) = 0;
} else if (tfr->data_size > *(pkg->data_size)) {
DMERR("Insufficient space to receive package [%u] "
"(%u vs %zu)", tfr->request_type,
tfr->data_size, *(pkg->data_size));
*(pkg->data_size) = 0;
pkg->error = -ENOSPC;
} else {
pkg->error = tfr->error;
memcpy(pkg->data, tfr->data, tfr->data_size);
*(pkg->data_size) = tfr->data_size;
}
complete(&pkg->complete);
return 0;
}
return -ENOENT;
}
/*
* This is the connector callback that delivers data
* that was sent from userspace.
*/
static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
{
struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
if (!capable(CAP_SYS_ADMIN))
return;
spin_lock(&receiving_list_lock);
if (msg->len == 0)
fill_pkg(msg, NULL);
else if (msg->len < sizeof(*tfr))
DMERR("Incomplete message received (expected %u, got %u): [%u]",
(unsigned)sizeof(*tfr), msg->len, msg->seq);
else
fill_pkg(NULL, tfr);
spin_unlock(&receiving_list_lock);
}
/**
* dm_consult_userspace
* @uuid: log's universal unique identifier (must be DM_UUID_LEN in size)
* @luid: log's local unique identifier
* @request_type: found in include/linux/dm-log-userspace.h
* @data: data to tx to the server
* @data_size: size of data in bytes
* @rdata: place to put return data from server
* @rdata_size: value-result (amount of space given/amount of space used)
*
* rdata_size is undefined on failure.
*
* Memory used to communicate with userspace is zero'ed
* before populating to ensure that no unwanted bits leak
* from kernel space to user-space. All userspace log communications
* between kernel and user space go through this function.
*
* Returns: 0 on success, -EXXX on failure
**/
int dm_consult_userspace(const char *uuid, uint64_t luid, int request_type,
char *data, size_t data_size,
char *rdata, size_t *rdata_size)
{
int r = 0;
size_t dummy = 0;
int overhead_size = sizeof(struct dm_ulog_request) + sizeof(struct cn_msg);
struct dm_ulog_request *tfr = prealloced_ulog_tfr;
struct receiving_pkg pkg;
/*
* Given the space needed to hold the 'struct cn_msg' and
* 'struct dm_ulog_request' - do we have enough payload
* space remaining?
*/
if (data_size > (DM_ULOG_PREALLOCED_SIZE - overhead_size)) {
DMINFO("Size of tfr exceeds preallocated size");
return -EINVAL;
}
if (!rdata_size)
rdata_size = &dummy;
resend:
/*
* We serialize the sending of requests so we can
* use the preallocated space.
*/
mutex_lock(&dm_ulog_lock);
memset(tfr, 0, DM_ULOG_PREALLOCED_SIZE - sizeof(struct cn_msg));
memcpy(tfr->uuid, uuid, DM_UUID_LEN);
tfr->version = DM_ULOG_REQUEST_VERSION;
tfr->luid = luid;
tfr->seq = dm_ulog_seq++;
/*
* Must be valid request type (all other bits set to
* zero). This reserves other bits for possible future
* use.
*/
tfr->request_type = request_type & DM_ULOG_REQUEST_MASK;
tfr->data_size = data_size;
if (data && data_size)
memcpy(tfr->data, data, data_size);
memset(&pkg, 0, sizeof(pkg));
init_completion(&pkg.complete);
pkg.seq = tfr->seq;
pkg.data_size = rdata_size;
pkg.data = rdata;
spin_lock(&receiving_list_lock);
list_add(&(pkg.list), &receiving_list);
spin_unlock(&receiving_list_lock);
r = dm_ulog_sendto_server(tfr);
mutex_unlock(&dm_ulog_lock);
if (r) {
DMERR("Unable to send log request [%u] to userspace: %d",
request_type, r);
spin_lock(&receiving_list_lock);
list_del_init(&(pkg.list));
spin_unlock(&receiving_list_lock);
goto out;
}
r = wait_for_completion_timeout(&(pkg.complete), DM_ULOG_RETRY_TIMEOUT);
spin_lock(&receiving_list_lock);
list_del_init(&(pkg.list));
spin_unlock(&receiving_list_lock);
if (!r) {
DMWARN("[%s] Request timed out: [%u/%u] - retrying",
(strlen(uuid) > 8) ?
(uuid + (strlen(uuid) - 8)) : (uuid),
request_type, pkg.seq);
goto resend;
}
r = pkg.error;
if (r == -EAGAIN)
goto resend;
out:
return r;
}
int dm_ulog_tfr_init(void)
{
int r;
void *prealloced;
INIT_LIST_HEAD(&receiving_list);
prealloced = kmalloc(DM_ULOG_PREALLOCED_SIZE, GFP_KERNEL);
if (!prealloced)
return -ENOMEM;
prealloced_cn_msg = prealloced;
prealloced_ulog_tfr = prealloced + sizeof(struct cn_msg);
r = cn_add_callback(&ulog_cn_id, "dmlogusr", cn_ulog_callback);
if (r) {
kfree(prealloced_cn_msg);
return r;
}
return 0;
}
void dm_ulog_tfr_exit(void)
{
cn_del_callback(&ulog_cn_id);
kfree(prealloced_cn_msg);
}

View file

@ -0,0 +1,18 @@
/*
* Copyright (C) 2006-2009 Red Hat, Inc.
*
* This file is released under the LGPL.
*/
#ifndef __DM_LOG_USERSPACE_TRANSFER_H__
#define __DM_LOG_USERSPACE_TRANSFER_H__
#define DM_MSG_PREFIX "dm-log-userspace"
int dm_ulog_tfr_init(void);
void dm_ulog_tfr_exit(void);
int dm_consult_userspace(const char *uuid, uint64_t luid, int request_type,
char *data, size_t data_size,
char *rdata, size_t *rdata_size);
#endif /* __DM_LOG_USERSPACE_TRANSFER_H__ */

888
drivers/md/dm-log.c Normal file
View file

@ -0,0 +1,888 @@
/*
* Copyright (C) 2003 Sistina Software
* Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
*
* This file is released under the LGPL.
*/
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/module.h>
#include <linux/vmalloc.h>
#include <linux/dm-io.h>
#include <linux/dm-dirty-log.h>
#include <linux/device-mapper.h>
#define DM_MSG_PREFIX "dirty region log"
static LIST_HEAD(_log_types);
static DEFINE_SPINLOCK(_lock);
static struct dm_dirty_log_type *__find_dirty_log_type(const char *name)
{
struct dm_dirty_log_type *log_type;
list_for_each_entry(log_type, &_log_types, list)
if (!strcmp(name, log_type->name))
return log_type;
return NULL;
}
static struct dm_dirty_log_type *_get_dirty_log_type(const char *name)
{
struct dm_dirty_log_type *log_type;
spin_lock(&_lock);
log_type = __find_dirty_log_type(name);
if (log_type && !try_module_get(log_type->module))
log_type = NULL;
spin_unlock(&_lock);
return log_type;
}
/*
* get_type
* @type_name
*
* Attempt to retrieve the dm_dirty_log_type by name. If not already
* available, attempt to load the appropriate module.
*
* Log modules are named "dm-log-" followed by the 'type_name'.
* Modules may contain multiple types.
* This function will first try the module "dm-log-<type_name>",
* then truncate 'type_name' on the last '-' and try again.
*
* For example, if type_name was "clustered-disk", it would search
* 'dm-log-clustered-disk' then 'dm-log-clustered'.
*
* Returns: dirty_log_type* on success, NULL on failure
*/
static struct dm_dirty_log_type *get_type(const char *type_name)
{
char *p, *type_name_dup;
struct dm_dirty_log_type *log_type;
if (!type_name)
return NULL;
log_type = _get_dirty_log_type(type_name);
if (log_type)
return log_type;
type_name_dup = kstrdup(type_name, GFP_KERNEL);
if (!type_name_dup) {
DMWARN("No memory left to attempt log module load for \"%s\"",
type_name);
return NULL;
}
while (request_module("dm-log-%s", type_name_dup) ||
!(log_type = _get_dirty_log_type(type_name))) {
p = strrchr(type_name_dup, '-');
if (!p)
break;
p[0] = '\0';
}
if (!log_type)
DMWARN("Module for logging type \"%s\" not found.", type_name);
kfree(type_name_dup);
return log_type;
}
static void put_type(struct dm_dirty_log_type *type)
{
if (!type)
return;
spin_lock(&_lock);
if (!__find_dirty_log_type(type->name))
goto out;
module_put(type->module);
out:
spin_unlock(&_lock);
}
int dm_dirty_log_type_register(struct dm_dirty_log_type *type)
{
int r = 0;
spin_lock(&_lock);
if (!__find_dirty_log_type(type->name))
list_add(&type->list, &_log_types);
else
r = -EEXIST;
spin_unlock(&_lock);
return r;
}
EXPORT_SYMBOL(dm_dirty_log_type_register);
int dm_dirty_log_type_unregister(struct dm_dirty_log_type *type)
{
spin_lock(&_lock);
if (!__find_dirty_log_type(type->name)) {
spin_unlock(&_lock);
return -EINVAL;
}
list_del(&type->list);
spin_unlock(&_lock);
return 0;
}
EXPORT_SYMBOL(dm_dirty_log_type_unregister);
struct dm_dirty_log *dm_dirty_log_create(const char *type_name,
struct dm_target *ti,
int (*flush_callback_fn)(struct dm_target *ti),
unsigned int argc, char **argv)
{
struct dm_dirty_log_type *type;
struct dm_dirty_log *log;
log = kmalloc(sizeof(*log), GFP_KERNEL);
if (!log)
return NULL;
type = get_type(type_name);
if (!type) {
kfree(log);
return NULL;
}
log->flush_callback_fn = flush_callback_fn;
log->type = type;
if (type->ctr(log, ti, argc, argv)) {
kfree(log);
put_type(type);
return NULL;
}
return log;
}
EXPORT_SYMBOL(dm_dirty_log_create);
void dm_dirty_log_destroy(struct dm_dirty_log *log)
{
log->type->dtr(log);
put_type(log->type);
kfree(log);
}
EXPORT_SYMBOL(dm_dirty_log_destroy);
/*-----------------------------------------------------------------
* Persistent and core logs share a lot of their implementation.
* FIXME: need a reload method to be called from a resume
*---------------------------------------------------------------*/
/*
* Magic for persistent mirrors: "MiRr"
*/
#define MIRROR_MAGIC 0x4D695272
/*
* The on-disk version of the metadata.
*/
#define MIRROR_DISK_VERSION 2
#define LOG_OFFSET 2
struct log_header_disk {
__le32 magic;
/*
* Simple, incrementing version. no backward
* compatibility.
*/
__le32 version;
__le64 nr_regions;
} __packed;
struct log_header_core {
uint32_t magic;
uint32_t version;
uint64_t nr_regions;
};
struct log_c {
struct dm_target *ti;
int touched_dirtied;
int touched_cleaned;
int flush_failed;
uint32_t region_size;
unsigned int region_count;
region_t sync_count;
unsigned bitset_uint32_count;
uint32_t *clean_bits;
uint32_t *sync_bits;
uint32_t *recovering_bits; /* FIXME: this seems excessive */
int sync_search;
/* Resync flag */
enum sync {
DEFAULTSYNC, /* Synchronize if necessary */
NOSYNC, /* Devices known to be already in sync */
FORCESYNC, /* Force a sync to happen */
} sync;
struct dm_io_request io_req;
/*
* Disk log fields
*/
int log_dev_failed;
int log_dev_flush_failed;
struct dm_dev *log_dev;
struct log_header_core header;
struct dm_io_region header_location;
struct log_header_disk *disk_header;
};
/*
* The touched member needs to be updated every time we access
* one of the bitsets.
*/
static inline int log_test_bit(uint32_t *bs, unsigned bit)
{
return test_bit_le(bit, bs) ? 1 : 0;
}
static inline void log_set_bit(struct log_c *l,
uint32_t *bs, unsigned bit)
{
__set_bit_le(bit, bs);
l->touched_cleaned = 1;
}
static inline void log_clear_bit(struct log_c *l,
uint32_t *bs, unsigned bit)
{
__clear_bit_le(bit, bs);
l->touched_dirtied = 1;
}
/*----------------------------------------------------------------
* Header IO
*--------------------------------------------------------------*/
static void header_to_disk(struct log_header_core *core, struct log_header_disk *disk)
{
disk->magic = cpu_to_le32(core->magic);
disk->version = cpu_to_le32(core->version);
disk->nr_regions = cpu_to_le64(core->nr_regions);
}
static void header_from_disk(struct log_header_core *core, struct log_header_disk *disk)
{
core->magic = le32_to_cpu(disk->magic);
core->version = le32_to_cpu(disk->version);
core->nr_regions = le64_to_cpu(disk->nr_regions);
}
static int rw_header(struct log_c *lc, int rw)
{
lc->io_req.bi_rw = rw;
return dm_io(&lc->io_req, 1, &lc->header_location, NULL);
}
static int flush_header(struct log_c *lc)
{
struct dm_io_region null_location = {
.bdev = lc->header_location.bdev,
.sector = 0,
.count = 0,
};
lc->io_req.bi_rw = WRITE_FLUSH;
return dm_io(&lc->io_req, 1, &null_location, NULL);
}
static int read_header(struct log_c *log)
{
int r;
r = rw_header(log, READ);
if (r)
return r;
header_from_disk(&log->header, log->disk_header);
/* New log required? */
if (log->sync != DEFAULTSYNC || log->header.magic != MIRROR_MAGIC) {
log->header.magic = MIRROR_MAGIC;
log->header.version = MIRROR_DISK_VERSION;
log->header.nr_regions = 0;
}
#ifdef __LITTLE_ENDIAN
if (log->header.version == 1)
log->header.version = 2;
#endif
if (log->header.version != MIRROR_DISK_VERSION) {
DMWARN("incompatible disk log version");
return -EINVAL;
}
return 0;
}
static int _check_region_size(struct dm_target *ti, uint32_t region_size)
{
if (region_size < 2 || region_size > ti->len)
return 0;
if (!is_power_of_2(region_size))
return 0;
return 1;
}
/*----------------------------------------------------------------
* core log constructor/destructor
*
* argv contains region_size followed optionally by [no]sync
*--------------------------------------------------------------*/
#define BYTE_SHIFT 3
static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti,
unsigned int argc, char **argv,
struct dm_dev *dev)
{
enum sync sync = DEFAULTSYNC;
struct log_c *lc;
uint32_t region_size;
unsigned int region_count;
size_t bitset_size, buf_size;
int r;
char dummy;
if (argc < 1 || argc > 2) {
DMWARN("wrong number of arguments to dirty region log");
return -EINVAL;
}
if (argc > 1) {
if (!strcmp(argv[1], "sync"))
sync = FORCESYNC;
else if (!strcmp(argv[1], "nosync"))
sync = NOSYNC;
else {
DMWARN("unrecognised sync argument to "
"dirty region log: %s", argv[1]);
return -EINVAL;
}
}
if (sscanf(argv[0], "%u%c", &region_size, &dummy) != 1 ||
!_check_region_size(ti, region_size)) {
DMWARN("invalid region size %s", argv[0]);
return -EINVAL;
}
region_count = dm_sector_div_up(ti->len, region_size);
lc = kmalloc(sizeof(*lc), GFP_KERNEL);
if (!lc) {
DMWARN("couldn't allocate core log");
return -ENOMEM;
}
lc->ti = ti;
lc->touched_dirtied = 0;
lc->touched_cleaned = 0;
lc->flush_failed = 0;
lc->region_size = region_size;
lc->region_count = region_count;
lc->sync = sync;
/*
* Work out how many "unsigned long"s we need to hold the bitset.
*/
bitset_size = dm_round_up(region_count,
sizeof(*lc->clean_bits) << BYTE_SHIFT);
bitset_size >>= BYTE_SHIFT;
lc->bitset_uint32_count = bitset_size / sizeof(*lc->clean_bits);
/*
* Disk log?
*/
if (!dev) {
lc->clean_bits = vmalloc(bitset_size);
if (!lc->clean_bits) {
DMWARN("couldn't allocate clean bitset");
kfree(lc);
return -ENOMEM;
}
lc->disk_header = NULL;
} else {
lc->log_dev = dev;
lc->log_dev_failed = 0;
lc->log_dev_flush_failed = 0;
lc->header_location.bdev = lc->log_dev->bdev;
lc->header_location.sector = 0;
/*
* Buffer holds both header and bitset.
*/
buf_size =
dm_round_up((LOG_OFFSET << SECTOR_SHIFT) + bitset_size,
bdev_logical_block_size(lc->header_location.
bdev));
if (buf_size > i_size_read(dev->bdev->bd_inode)) {
DMWARN("log device %s too small: need %llu bytes",
dev->name, (unsigned long long)buf_size);
kfree(lc);
return -EINVAL;
}
lc->header_location.count = buf_size >> SECTOR_SHIFT;
lc->io_req.mem.type = DM_IO_VMA;
lc->io_req.notify.fn = NULL;
lc->io_req.client = dm_io_client_create();
if (IS_ERR(lc->io_req.client)) {
r = PTR_ERR(lc->io_req.client);
DMWARN("couldn't allocate disk io client");
kfree(lc);
return r;
}
lc->disk_header = vmalloc(buf_size);
if (!lc->disk_header) {
DMWARN("couldn't allocate disk log buffer");
dm_io_client_destroy(lc->io_req.client);
kfree(lc);
return -ENOMEM;
}
lc->io_req.mem.ptr.vma = lc->disk_header;
lc->clean_bits = (void *)lc->disk_header +
(LOG_OFFSET << SECTOR_SHIFT);
}
memset(lc->clean_bits, -1, bitset_size);
lc->sync_bits = vmalloc(bitset_size);
if (!lc->sync_bits) {
DMWARN("couldn't allocate sync bitset");
if (!dev)
vfree(lc->clean_bits);
else
dm_io_client_destroy(lc->io_req.client);
vfree(lc->disk_header);
kfree(lc);
return -ENOMEM;
}
memset(lc->sync_bits, (sync == NOSYNC) ? -1 : 0, bitset_size);
lc->sync_count = (sync == NOSYNC) ? region_count : 0;
lc->recovering_bits = vzalloc(bitset_size);
if (!lc->recovering_bits) {
DMWARN("couldn't allocate sync bitset");
vfree(lc->sync_bits);
if (!dev)
vfree(lc->clean_bits);
else
dm_io_client_destroy(lc->io_req.client);
vfree(lc->disk_header);
kfree(lc);
return -ENOMEM;
}
lc->sync_search = 0;
log->context = lc;
return 0;
}
static int core_ctr(struct dm_dirty_log *log, struct dm_target *ti,
unsigned int argc, char **argv)
{
return create_log_context(log, ti, argc, argv, NULL);
}
static void destroy_log_context(struct log_c *lc)
{
vfree(lc->sync_bits);
vfree(lc->recovering_bits);
kfree(lc);
}
static void core_dtr(struct dm_dirty_log *log)
{
struct log_c *lc = (struct log_c *) log->context;
vfree(lc->clean_bits);
destroy_log_context(lc);
}
/*----------------------------------------------------------------
* disk log constructor/destructor
*
* argv contains log_device region_size followed optionally by [no]sync
*--------------------------------------------------------------*/
static int disk_ctr(struct dm_dirty_log *log, struct dm_target *ti,
unsigned int argc, char **argv)
{
int r;
struct dm_dev *dev;
if (argc < 2 || argc > 3) {
DMWARN("wrong number of arguments to disk dirty region log");
return -EINVAL;
}
r = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &dev);
if (r)
return r;
r = create_log_context(log, ti, argc - 1, argv + 1, dev);
if (r) {
dm_put_device(ti, dev);
return r;
}
return 0;
}
static void disk_dtr(struct dm_dirty_log *log)
{
struct log_c *lc = (struct log_c *) log->context;
dm_put_device(lc->ti, lc->log_dev);
vfree(lc->disk_header);
dm_io_client_destroy(lc->io_req.client);
destroy_log_context(lc);
}
static void fail_log_device(struct log_c *lc)
{
if (lc->log_dev_failed)
return;
lc->log_dev_failed = 1;
dm_table_event(lc->ti->table);
}
static int disk_resume(struct dm_dirty_log *log)
{
int r;
unsigned i;
struct log_c *lc = (struct log_c *) log->context;
size_t size = lc->bitset_uint32_count * sizeof(uint32_t);
/* read the disk header */
r = read_header(lc);
if (r) {
DMWARN("%s: Failed to read header on dirty region log device",
lc->log_dev->name);
fail_log_device(lc);
/*
* If the log device cannot be read, we must assume
* all regions are out-of-sync. If we simply return
* here, the state will be uninitialized and could
* lead us to return 'in-sync' status for regions
* that are actually 'out-of-sync'.
*/
lc->header.nr_regions = 0;
}
/* set or clear any new bits -- device has grown */
if (lc->sync == NOSYNC)
for (i = lc->header.nr_regions; i < lc->region_count; i++)
/* FIXME: amazingly inefficient */
log_set_bit(lc, lc->clean_bits, i);
else
for (i = lc->header.nr_regions; i < lc->region_count; i++)
/* FIXME: amazingly inefficient */
log_clear_bit(lc, lc->clean_bits, i);
/* clear any old bits -- device has shrunk */
for (i = lc->region_count; i % (sizeof(*lc->clean_bits) << BYTE_SHIFT); i++)
log_clear_bit(lc, lc->clean_bits, i);
/* copy clean across to sync */
memcpy(lc->sync_bits, lc->clean_bits, size);
lc->sync_count = memweight(lc->clean_bits,
lc->bitset_uint32_count * sizeof(uint32_t));
lc->sync_search = 0;
/* set the correct number of regions in the header */
lc->header.nr_regions = lc->region_count;
header_to_disk(&lc->header, lc->disk_header);
/* write the new header */
r = rw_header(lc, WRITE);
if (!r) {
r = flush_header(lc);
if (r)
lc->log_dev_flush_failed = 1;
}
if (r) {
DMWARN("%s: Failed to write header on dirty region log device",
lc->log_dev->name);
fail_log_device(lc);
}
return r;
}
static uint32_t core_get_region_size(struct dm_dirty_log *log)
{
struct log_c *lc = (struct log_c *) log->context;
return lc->region_size;
}
static int core_resume(struct dm_dirty_log *log)
{
struct log_c *lc = (struct log_c *) log->context;
lc->sync_search = 0;
return 0;
}
static int core_is_clean(struct dm_dirty_log *log, region_t region)
{
struct log_c *lc = (struct log_c *) log->context;
return log_test_bit(lc->clean_bits, region);
}
static int core_in_sync(struct dm_dirty_log *log, region_t region, int block)
{
struct log_c *lc = (struct log_c *) log->context;
return log_test_bit(lc->sync_bits, region);
}
static int core_flush(struct dm_dirty_log *log)
{
/* no op */
return 0;
}
static int disk_flush(struct dm_dirty_log *log)
{
int r, i;
struct log_c *lc = log->context;
/* only write if the log has changed */
if (!lc->touched_cleaned && !lc->touched_dirtied)
return 0;
if (lc->touched_cleaned && log->flush_callback_fn &&
log->flush_callback_fn(lc->ti)) {
/*
* At this point it is impossible to determine which
* regions are clean and which are dirty (without
* re-reading the log off disk). So mark all of them
* dirty.
*/
lc->flush_failed = 1;
for (i = 0; i < lc->region_count; i++)
log_clear_bit(lc, lc->clean_bits, i);
}
r = rw_header(lc, WRITE);
if (r)
fail_log_device(lc);
else {
if (lc->touched_dirtied) {
r = flush_header(lc);
if (r) {
lc->log_dev_flush_failed = 1;
fail_log_device(lc);
} else
lc->touched_dirtied = 0;
}
lc->touched_cleaned = 0;
}
return r;
}
static void core_mark_region(struct dm_dirty_log *log, region_t region)
{
struct log_c *lc = (struct log_c *) log->context;
log_clear_bit(lc, lc->clean_bits, region);
}
static void core_clear_region(struct dm_dirty_log *log, region_t region)
{
struct log_c *lc = (struct log_c *) log->context;
if (likely(!lc->flush_failed))
log_set_bit(lc, lc->clean_bits, region);
}
static int core_get_resync_work(struct dm_dirty_log *log, region_t *region)
{
struct log_c *lc = (struct log_c *) log->context;
if (lc->sync_search >= lc->region_count)
return 0;
do {
*region = find_next_zero_bit_le(lc->sync_bits,
lc->region_count,
lc->sync_search);
lc->sync_search = *region + 1;
if (*region >= lc->region_count)
return 0;
} while (log_test_bit(lc->recovering_bits, *region));
log_set_bit(lc, lc->recovering_bits, *region);
return 1;
}
static void core_set_region_sync(struct dm_dirty_log *log, region_t region,
int in_sync)
{
struct log_c *lc = (struct log_c *) log->context;
log_clear_bit(lc, lc->recovering_bits, region);
if (in_sync) {
log_set_bit(lc, lc->sync_bits, region);
lc->sync_count++;
} else if (log_test_bit(lc->sync_bits, region)) {
lc->sync_count--;
log_clear_bit(lc, lc->sync_bits, region);
}
}
static region_t core_get_sync_count(struct dm_dirty_log *log)
{
struct log_c *lc = (struct log_c *) log->context;
return lc->sync_count;
}
#define DMEMIT_SYNC \
if (lc->sync != DEFAULTSYNC) \
DMEMIT("%ssync ", lc->sync == NOSYNC ? "no" : "")
static int core_status(struct dm_dirty_log *log, status_type_t status,
char *result, unsigned int maxlen)
{
int sz = 0;
struct log_c *lc = log->context;
switch(status) {
case STATUSTYPE_INFO:
DMEMIT("1 %s", log->type->name);
break;
case STATUSTYPE_TABLE:
DMEMIT("%s %u %u ", log->type->name,
lc->sync == DEFAULTSYNC ? 1 : 2, lc->region_size);
DMEMIT_SYNC;
}
return sz;
}
static int disk_status(struct dm_dirty_log *log, status_type_t status,
char *result, unsigned int maxlen)
{
int sz = 0;
struct log_c *lc = log->context;
switch(status) {
case STATUSTYPE_INFO:
DMEMIT("3 %s %s %c", log->type->name, lc->log_dev->name,
lc->log_dev_flush_failed ? 'F' :
lc->log_dev_failed ? 'D' :
'A');
break;
case STATUSTYPE_TABLE:
DMEMIT("%s %u %s %u ", log->type->name,
lc->sync == DEFAULTSYNC ? 2 : 3, lc->log_dev->name,
lc->region_size);
DMEMIT_SYNC;
}
return sz;
}
static struct dm_dirty_log_type _core_type = {
.name = "core",
.module = THIS_MODULE,
.ctr = core_ctr,
.dtr = core_dtr,
.resume = core_resume,
.get_region_size = core_get_region_size,
.is_clean = core_is_clean,
.in_sync = core_in_sync,
.flush = core_flush,
.mark_region = core_mark_region,
.clear_region = core_clear_region,
.get_resync_work = core_get_resync_work,
.set_region_sync = core_set_region_sync,
.get_sync_count = core_get_sync_count,
.status = core_status,
};
static struct dm_dirty_log_type _disk_type = {
.name = "disk",
.module = THIS_MODULE,
.ctr = disk_ctr,
.dtr = disk_dtr,
.postsuspend = disk_flush,
.resume = disk_resume,
.get_region_size = core_get_region_size,
.is_clean = core_is_clean,
.in_sync = core_in_sync,
.flush = disk_flush,
.mark_region = core_mark_region,
.clear_region = core_clear_region,
.get_resync_work = core_get_resync_work,
.set_region_sync = core_set_region_sync,
.get_sync_count = core_get_sync_count,
.status = disk_status,
};
static int __init dm_dirty_log_init(void)
{
int r;
r = dm_dirty_log_type_register(&_core_type);
if (r)
DMWARN("couldn't register core log");
r = dm_dirty_log_type_register(&_disk_type);
if (r) {
DMWARN("couldn't register disk type");
dm_dirty_log_type_unregister(&_core_type);
}
return r;
}
static void __exit dm_dirty_log_exit(void)
{
dm_dirty_log_type_unregister(&_disk_type);
dm_dirty_log_type_unregister(&_core_type);
}
module_init(dm_dirty_log_init);
module_exit(dm_dirty_log_exit);
MODULE_DESCRIPTION(DM_NAME " dirty region log");
MODULE_AUTHOR("Joe Thornber, Heinz Mauelshagen <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

1746
drivers/md/dm-mpath.c Normal file

File diff suppressed because it is too large Load diff

22
drivers/md/dm-mpath.h Normal file
View file

@ -0,0 +1,22 @@
/*
* Copyright (C) 2004 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*
* Multipath.
*/
#ifndef DM_MPATH_H
#define DM_MPATH_H
struct dm_dev;
struct dm_path {
struct dm_dev *dev; /* Read-only */
void *pscontext; /* For path-selector use */
};
/* Callback for hwh_pg_init_fn to use when complete */
void dm_pg_init_complete(struct dm_path *path, unsigned err_flags);
#endif

View file

@ -0,0 +1,140 @@
/*
* Copyright (C) 2003 Sistina Software.
* Copyright (C) 2004 Red Hat, Inc. All rights reserved.
*
* Module Author: Heinz Mauelshagen
*
* This file is released under the GPL.
*
* Path selector registration.
*/
#include <linux/device-mapper.h>
#include <linux/module.h>
#include "dm-path-selector.h"
#include <linux/slab.h>
struct ps_internal {
struct path_selector_type pst;
struct list_head list;
};
#define pst_to_psi(__pst) container_of((__pst), struct ps_internal, pst)
static LIST_HEAD(_path_selectors);
static DECLARE_RWSEM(_ps_lock);
static struct ps_internal *__find_path_selector_type(const char *name)
{
struct ps_internal *psi;
list_for_each_entry(psi, &_path_selectors, list) {
if (!strcmp(name, psi->pst.name))
return psi;
}
return NULL;
}
static struct ps_internal *get_path_selector(const char *name)
{
struct ps_internal *psi;
down_read(&_ps_lock);
psi = __find_path_selector_type(name);
if (psi && !try_module_get(psi->pst.module))
psi = NULL;
up_read(&_ps_lock);
return psi;
}
struct path_selector_type *dm_get_path_selector(const char *name)
{
struct ps_internal *psi;
if (!name)
return NULL;
psi = get_path_selector(name);
if (!psi) {
request_module("dm-%s", name);
psi = get_path_selector(name);
}
return psi ? &psi->pst : NULL;
}
void dm_put_path_selector(struct path_selector_type *pst)
{
struct ps_internal *psi;
if (!pst)
return;
down_read(&_ps_lock);
psi = __find_path_selector_type(pst->name);
if (!psi)
goto out;
module_put(psi->pst.module);
out:
up_read(&_ps_lock);
}
static struct ps_internal *_alloc_path_selector(struct path_selector_type *pst)
{
struct ps_internal *psi = kzalloc(sizeof(*psi), GFP_KERNEL);
if (psi)
psi->pst = *pst;
return psi;
}
int dm_register_path_selector(struct path_selector_type *pst)
{
int r = 0;
struct ps_internal *psi = _alloc_path_selector(pst);
if (!psi)
return -ENOMEM;
down_write(&_ps_lock);
if (__find_path_selector_type(pst->name)) {
kfree(psi);
r = -EEXIST;
} else
list_add(&psi->list, &_path_selectors);
up_write(&_ps_lock);
return r;
}
int dm_unregister_path_selector(struct path_selector_type *pst)
{
struct ps_internal *psi;
down_write(&_ps_lock);
psi = __find_path_selector_type(pst->name);
if (!psi) {
up_write(&_ps_lock);
return -EINVAL;
}
list_del(&psi->list);
up_write(&_ps_lock);
kfree(psi);
return 0;
}
EXPORT_SYMBOL_GPL(dm_register_path_selector);
EXPORT_SYMBOL_GPL(dm_unregister_path_selector);

View file

@ -0,0 +1,97 @@
/*
* Copyright (C) 2003 Sistina Software.
* Copyright (C) 2004 Red Hat, Inc. All rights reserved.
*
* Module Author: Heinz Mauelshagen
*
* This file is released under the GPL.
*
* Path-Selector registration.
*/
#ifndef DM_PATH_SELECTOR_H
#define DM_PATH_SELECTOR_H
#include <linux/device-mapper.h>
#include "dm-mpath.h"
/*
* We provide an abstraction for the code that chooses which path
* to send some io down.
*/
struct path_selector_type;
struct path_selector {
struct path_selector_type *type;
void *context;
};
/* Information about a path selector type */
struct path_selector_type {
char *name;
struct module *module;
unsigned int table_args;
unsigned int info_args;
/*
* Constructs a path selector object, takes custom arguments
*/
int (*create) (struct path_selector *ps, unsigned argc, char **argv);
void (*destroy) (struct path_selector *ps);
/*
* Add an opaque path object, along with some selector specific
* path args (eg, path priority).
*/
int (*add_path) (struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error);
/*
* Chooses a path for this io, if no paths are available then
* NULL will be returned.
*
* repeat_count is the number of times to use the path before
* calling the function again. 0 means don't call it again unless
* the path fails.
*/
struct dm_path *(*select_path) (struct path_selector *ps,
unsigned *repeat_count,
size_t nr_bytes);
/*
* Notify the selector that a path has failed.
*/
void (*fail_path) (struct path_selector *ps, struct dm_path *p);
/*
* Ask selector to reinstate a path.
*/
int (*reinstate_path) (struct path_selector *ps, struct dm_path *p);
/*
* Table content based on parameters added in ps_add_path_fn
* or path selector status
*/
int (*status) (struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned int maxlen);
int (*start_io) (struct path_selector *ps, struct dm_path *path,
size_t nr_bytes);
int (*end_io) (struct path_selector *ps, struct dm_path *path,
size_t nr_bytes);
};
/* Register a path selector */
int dm_register_path_selector(struct path_selector_type *type);
/* Unregister a path selector */
int dm_unregister_path_selector(struct path_selector_type *type);
/* Returns a registered path selector type */
struct path_selector_type *dm_get_path_selector(const char *name);
/* Releases a path selector */
void dm_put_path_selector(struct path_selector_type *pst);
#endif

View file

@ -0,0 +1,264 @@
/*
* Copyright (C) 2004-2005 IBM Corp. All Rights Reserved.
* Copyright (C) 2006-2009 NEC Corporation.
*
* dm-queue-length.c
*
* Module Author: Stefan Bader, IBM
* Modified by: Kiyoshi Ueda, NEC
*
* This file is released under the GPL.
*
* queue-length path selector - choose a path with the least number of
* in-flight I/Os.
*/
#include "dm.h"
#include "dm-path-selector.h"
#include <linux/slab.h>
#include <linux/ctype.h>
#include <linux/errno.h>
#include <linux/module.h>
#include <linux/atomic.h>
#define DM_MSG_PREFIX "multipath queue-length"
#define QL_MIN_IO 128
#define QL_VERSION "0.1.0"
struct selector {
struct list_head valid_paths;
struct list_head failed_paths;
};
struct path_info {
struct list_head list;
struct dm_path *path;
unsigned repeat_count;
atomic_t qlen; /* the number of in-flight I/Os */
};
static struct selector *alloc_selector(void)
{
struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
if (s) {
INIT_LIST_HEAD(&s->valid_paths);
INIT_LIST_HEAD(&s->failed_paths);
}
return s;
}
static int ql_create(struct path_selector *ps, unsigned argc, char **argv)
{
struct selector *s = alloc_selector();
if (!s)
return -ENOMEM;
ps->context = s;
return 0;
}
static void ql_free_paths(struct list_head *paths)
{
struct path_info *pi, *next;
list_for_each_entry_safe(pi, next, paths, list) {
list_del(&pi->list);
kfree(pi);
}
}
static void ql_destroy(struct path_selector *ps)
{
struct selector *s = ps->context;
ql_free_paths(&s->valid_paths);
ql_free_paths(&s->failed_paths);
kfree(s);
ps->context = NULL;
}
static int ql_status(struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct path_info *pi;
/* When called with NULL path, return selector status/args. */
if (!path)
DMEMIT("0 ");
else {
pi = path->pscontext;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%d ", atomic_read(&pi->qlen));
break;
case STATUSTYPE_TABLE:
DMEMIT("%u ", pi->repeat_count);
break;
}
}
return sz;
}
static int ql_add_path(struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error)
{
struct selector *s = ps->context;
struct path_info *pi;
unsigned repeat_count = QL_MIN_IO;
char dummy;
/*
* Arguments: [<repeat_count>]
* <repeat_count>: The number of I/Os before switching path.
* If not given, default (QL_MIN_IO) is used.
*/
if (argc > 1) {
*error = "queue-length ps: incorrect number of arguments";
return -EINVAL;
}
if ((argc == 1) && (sscanf(argv[0], "%u%c", &repeat_count, &dummy) != 1)) {
*error = "queue-length ps: invalid repeat count";
return -EINVAL;
}
/* Allocate the path information structure */
pi = kmalloc(sizeof(*pi), GFP_KERNEL);
if (!pi) {
*error = "queue-length ps: Error allocating path information";
return -ENOMEM;
}
pi->path = path;
pi->repeat_count = repeat_count;
atomic_set(&pi->qlen, 0);
path->pscontext = pi;
list_add_tail(&pi->list, &s->valid_paths);
return 0;
}
static void ql_fail_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move(&pi->list, &s->failed_paths);
}
static int ql_reinstate_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move_tail(&pi->list, &s->valid_paths);
return 0;
}
/*
* Select a path having the minimum number of in-flight I/Os
*/
static struct dm_path *ql_select_path(struct path_selector *ps,
unsigned *repeat_count, size_t nr_bytes)
{
struct selector *s = ps->context;
struct path_info *pi = NULL, *best = NULL;
if (list_empty(&s->valid_paths))
return NULL;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list) {
if (!best ||
(atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
best = pi;
if (!atomic_read(&best->qlen))
break;
}
if (!best)
return NULL;
*repeat_count = best->repeat_count;
return best->path;
}
static int ql_start_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_inc(&pi->qlen);
return 0;
}
static int ql_end_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_dec(&pi->qlen);
return 0;
}
static struct path_selector_type ql_ps = {
.name = "queue-length",
.module = THIS_MODULE,
.table_args = 1,
.info_args = 1,
.create = ql_create,
.destroy = ql_destroy,
.status = ql_status,
.add_path = ql_add_path,
.fail_path = ql_fail_path,
.reinstate_path = ql_reinstate_path,
.select_path = ql_select_path,
.start_io = ql_start_io,
.end_io = ql_end_io,
};
static int __init dm_ql_init(void)
{
int r = dm_register_path_selector(&ql_ps);
if (r < 0)
DMERR("register failed %d", r);
DMINFO("version " QL_VERSION " loaded");
return r;
}
static void __exit dm_ql_exit(void)
{
int r = dm_unregister_path_selector(&ql_ps);
if (r < 0)
DMERR("unregister failed %d", r);
}
module_init(dm_ql_init);
module_exit(dm_ql_exit);
MODULE_AUTHOR("Stefan Bader <Stefan.Bader at de.ibm.com>");
MODULE_DESCRIPTION(
"(C) Copyright IBM Corp. 2004,2005 All Rights Reserved.\n"
DM_NAME " path selector to balance the number of in-flight I/Os"
);
MODULE_LICENSE("GPL");

1756
drivers/md/dm-raid.c Normal file

File diff suppressed because it is too large Load diff

1458
drivers/md/dm-raid1.c Normal file

File diff suppressed because it is too large Load diff

724
drivers/md/dm-region-hash.c Normal file
View file

@ -0,0 +1,724 @@
/*
* Copyright (C) 2003 Sistina Software Limited.
* Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
#include <linux/dm-dirty-log.h>
#include <linux/dm-region-hash.h>
#include <linux/ctype.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include "dm.h"
#define DM_MSG_PREFIX "region hash"
/*-----------------------------------------------------------------
* Region hash
*
* The mirror splits itself up into discrete regions. Each
* region can be in one of three states: clean, dirty,
* nosync. There is no need to put clean regions in the hash.
*
* In addition to being present in the hash table a region _may_
* be present on one of three lists.
*
* clean_regions: Regions on this list have no io pending to
* them, they are in sync, we are no longer interested in them,
* they are dull. dm_rh_update_states() will remove them from the
* hash table.
*
* quiesced_regions: These regions have been spun down, ready
* for recovery. rh_recovery_start() will remove regions from
* this list and hand them to kmirrord, which will schedule the
* recovery io with kcopyd.
*
* recovered_regions: Regions that kcopyd has successfully
* recovered. dm_rh_update_states() will now schedule any delayed
* io, up the recovery_count, and remove the region from the
* hash.
*
* There are 2 locks:
* A rw spin lock 'hash_lock' protects just the hash table,
* this is never held in write mode from interrupt context,
* which I believe means that we only have to disable irqs when
* doing a write lock.
*
* An ordinary spin lock 'region_lock' that protects the three
* lists in the region_hash, with the 'state', 'list' and
* 'delayed_bios' fields of the regions. This is used from irq
* context, so all other uses will have to suspend local irqs.
*---------------------------------------------------------------*/
struct dm_region_hash {
uint32_t region_size;
unsigned region_shift;
/* holds persistent region state */
struct dm_dirty_log *log;
/* hash table */
rwlock_t hash_lock;
mempool_t *region_pool;
unsigned mask;
unsigned nr_buckets;
unsigned prime;
unsigned shift;
struct list_head *buckets;
unsigned max_recovery; /* Max # of regions to recover in parallel */
spinlock_t region_lock;
atomic_t recovery_in_flight;
struct semaphore recovery_count;
struct list_head clean_regions;
struct list_head quiesced_regions;
struct list_head recovered_regions;
struct list_head failed_recovered_regions;
/*
* If there was a flush failure no regions can be marked clean.
*/
int flush_failure;
void *context;
sector_t target_begin;
/* Callback function to schedule bios writes */
void (*dispatch_bios)(void *context, struct bio_list *bios);
/* Callback function to wakeup callers worker thread. */
void (*wakeup_workers)(void *context);
/* Callback function to wakeup callers recovery waiters. */
void (*wakeup_all_recovery_waiters)(void *context);
};
struct dm_region {
struct dm_region_hash *rh; /* FIXME: can we get rid of this ? */
region_t key;
int state;
struct list_head hash_list;
struct list_head list;
atomic_t pending;
struct bio_list delayed_bios;
};
/*
* Conversion fns
*/
static region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector)
{
return sector >> rh->region_shift;
}
sector_t dm_rh_region_to_sector(struct dm_region_hash *rh, region_t region)
{
return region << rh->region_shift;
}
EXPORT_SYMBOL_GPL(dm_rh_region_to_sector);
region_t dm_rh_bio_to_region(struct dm_region_hash *rh, struct bio *bio)
{
return dm_rh_sector_to_region(rh, bio->bi_iter.bi_sector -
rh->target_begin);
}
EXPORT_SYMBOL_GPL(dm_rh_bio_to_region);
void *dm_rh_region_context(struct dm_region *reg)
{
return reg->rh->context;
}
EXPORT_SYMBOL_GPL(dm_rh_region_context);
region_t dm_rh_get_region_key(struct dm_region *reg)
{
return reg->key;
}
EXPORT_SYMBOL_GPL(dm_rh_get_region_key);
sector_t dm_rh_get_region_size(struct dm_region_hash *rh)
{
return rh->region_size;
}
EXPORT_SYMBOL_GPL(dm_rh_get_region_size);
/*
* FIXME: shall we pass in a structure instead of all these args to
* dm_region_hash_create()????
*/
#define RH_HASH_MULT 2654435387U
#define RH_HASH_SHIFT 12
#define MIN_REGIONS 64
struct dm_region_hash *dm_region_hash_create(
void *context, void (*dispatch_bios)(void *context,
struct bio_list *bios),
void (*wakeup_workers)(void *context),
void (*wakeup_all_recovery_waiters)(void *context),
sector_t target_begin, unsigned max_recovery,
struct dm_dirty_log *log, uint32_t region_size,
region_t nr_regions)
{
struct dm_region_hash *rh;
unsigned nr_buckets, max_buckets;
size_t i;
/*
* Calculate a suitable number of buckets for our hash
* table.
*/
max_buckets = nr_regions >> 6;
for (nr_buckets = 128u; nr_buckets < max_buckets; nr_buckets <<= 1)
;
nr_buckets >>= 1;
rh = kmalloc(sizeof(*rh), GFP_KERNEL);
if (!rh) {
DMERR("unable to allocate region hash memory");
return ERR_PTR(-ENOMEM);
}
rh->context = context;
rh->dispatch_bios = dispatch_bios;
rh->wakeup_workers = wakeup_workers;
rh->wakeup_all_recovery_waiters = wakeup_all_recovery_waiters;
rh->target_begin = target_begin;
rh->max_recovery = max_recovery;
rh->log = log;
rh->region_size = region_size;
rh->region_shift = ffs(region_size) - 1;
rwlock_init(&rh->hash_lock);
rh->mask = nr_buckets - 1;
rh->nr_buckets = nr_buckets;
rh->shift = RH_HASH_SHIFT;
rh->prime = RH_HASH_MULT;
rh->buckets = vmalloc(nr_buckets * sizeof(*rh->buckets));
if (!rh->buckets) {
DMERR("unable to allocate region hash bucket memory");
kfree(rh);
return ERR_PTR(-ENOMEM);
}
for (i = 0; i < nr_buckets; i++)
INIT_LIST_HEAD(rh->buckets + i);
spin_lock_init(&rh->region_lock);
sema_init(&rh->recovery_count, 0);
atomic_set(&rh->recovery_in_flight, 0);
INIT_LIST_HEAD(&rh->clean_regions);
INIT_LIST_HEAD(&rh->quiesced_regions);
INIT_LIST_HEAD(&rh->recovered_regions);
INIT_LIST_HEAD(&rh->failed_recovered_regions);
rh->flush_failure = 0;
rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
sizeof(struct dm_region));
if (!rh->region_pool) {
vfree(rh->buckets);
kfree(rh);
rh = ERR_PTR(-ENOMEM);
}
return rh;
}
EXPORT_SYMBOL_GPL(dm_region_hash_create);
void dm_region_hash_destroy(struct dm_region_hash *rh)
{
unsigned h;
struct dm_region *reg, *nreg;
BUG_ON(!list_empty(&rh->quiesced_regions));
for (h = 0; h < rh->nr_buckets; h++) {
list_for_each_entry_safe(reg, nreg, rh->buckets + h,
hash_list) {
BUG_ON(atomic_read(&reg->pending));
mempool_free(reg, rh->region_pool);
}
}
if (rh->log)
dm_dirty_log_destroy(rh->log);
if (rh->region_pool)
mempool_destroy(rh->region_pool);
vfree(rh->buckets);
kfree(rh);
}
EXPORT_SYMBOL_GPL(dm_region_hash_destroy);
struct dm_dirty_log *dm_rh_dirty_log(struct dm_region_hash *rh)
{
return rh->log;
}
EXPORT_SYMBOL_GPL(dm_rh_dirty_log);
static unsigned rh_hash(struct dm_region_hash *rh, region_t region)
{
return (unsigned) ((region * rh->prime) >> rh->shift) & rh->mask;
}
static struct dm_region *__rh_lookup(struct dm_region_hash *rh, region_t region)
{
struct dm_region *reg;
struct list_head *bucket = rh->buckets + rh_hash(rh, region);
list_for_each_entry(reg, bucket, hash_list)
if (reg->key == region)
return reg;
return NULL;
}
static void __rh_insert(struct dm_region_hash *rh, struct dm_region *reg)
{
list_add(&reg->hash_list, rh->buckets + rh_hash(rh, reg->key));
}
static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
{
struct dm_region *reg, *nreg;
nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
if (unlikely(!nreg))
nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
DM_RH_CLEAN : DM_RH_NOSYNC;
nreg->rh = rh;
nreg->key = region;
INIT_LIST_HEAD(&nreg->list);
atomic_set(&nreg->pending, 0);
bio_list_init(&nreg->delayed_bios);
write_lock_irq(&rh->hash_lock);
reg = __rh_lookup(rh, region);
if (reg)
/* We lost the race. */
mempool_free(nreg, rh->region_pool);
else {
__rh_insert(rh, nreg);
if (nreg->state == DM_RH_CLEAN) {
spin_lock(&rh->region_lock);
list_add(&nreg->list, &rh->clean_regions);
spin_unlock(&rh->region_lock);
}
reg = nreg;
}
write_unlock_irq(&rh->hash_lock);
return reg;
}
static struct dm_region *__rh_find(struct dm_region_hash *rh, region_t region)
{
struct dm_region *reg;
reg = __rh_lookup(rh, region);
if (!reg) {
read_unlock(&rh->hash_lock);
reg = __rh_alloc(rh, region);
read_lock(&rh->hash_lock);
}
return reg;
}
int dm_rh_get_state(struct dm_region_hash *rh, region_t region, int may_block)
{
int r;
struct dm_region *reg;
read_lock(&rh->hash_lock);
reg = __rh_lookup(rh, region);
read_unlock(&rh->hash_lock);
if (reg)
return reg->state;
/*
* The region wasn't in the hash, so we fall back to the
* dirty log.
*/
r = rh->log->type->in_sync(rh->log, region, may_block);
/*
* Any error from the dirty log (eg. -EWOULDBLOCK) gets
* taken as a DM_RH_NOSYNC
*/
return r == 1 ? DM_RH_CLEAN : DM_RH_NOSYNC;
}
EXPORT_SYMBOL_GPL(dm_rh_get_state);
static void complete_resync_work(struct dm_region *reg, int success)
{
struct dm_region_hash *rh = reg->rh;
rh->log->type->set_region_sync(rh->log, reg->key, success);
/*
* Dispatch the bios before we call 'wake_up_all'.
* This is important because if we are suspending,
* we want to know that recovery is complete and
* the work queue is flushed. If we wake_up_all
* before we dispatch_bios (queue bios and call wake()),
* then we risk suspending before the work queue
* has been properly flushed.
*/
rh->dispatch_bios(rh->context, &reg->delayed_bios);
if (atomic_dec_and_test(&rh->recovery_in_flight))
rh->wakeup_all_recovery_waiters(rh->context);
up(&rh->recovery_count);
}
/* dm_rh_mark_nosync
* @ms
* @bio
*
* The bio was written on some mirror(s) but failed on other mirror(s).
* We can successfully endio the bio but should avoid the region being
* marked clean by setting the state DM_RH_NOSYNC.
*
* This function is _not_ safe in interrupt context!
*/
void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
{
unsigned long flags;
struct dm_dirty_log *log = rh->log;
struct dm_region *reg;
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;
if (bio->bi_rw & REQ_FLUSH) {
rh->flush_failure = 1;
return;
}
if (bio->bi_rw & REQ_DISCARD)
return;
/* We must inform the log that the sync count has changed. */
log->type->set_region_sync(log, region, 0);
read_lock(&rh->hash_lock);
reg = __rh_find(rh, region);
read_unlock(&rh->hash_lock);
/* region hash entry should exist because write was in-flight */
BUG_ON(!reg);
BUG_ON(!list_empty(&reg->list));
spin_lock_irqsave(&rh->region_lock, flags);
/*
* Possible cases:
* 1) DM_RH_DIRTY
* 2) DM_RH_NOSYNC: was dirty, other preceding writes failed
* 3) DM_RH_RECOVERING: flushing pending writes
* Either case, the region should have not been connected to list.
*/
recovering = (reg->state == DM_RH_RECOVERING);
reg->state = DM_RH_NOSYNC;
BUG_ON(!list_empty(&reg->list));
spin_unlock_irqrestore(&rh->region_lock, flags);
if (recovering)
complete_resync_work(reg, 0);
}
EXPORT_SYMBOL_GPL(dm_rh_mark_nosync);
void dm_rh_update_states(struct dm_region_hash *rh, int errors_handled)
{
struct dm_region *reg, *next;
LIST_HEAD(clean);
LIST_HEAD(recovered);
LIST_HEAD(failed_recovered);
/*
* Quickly grab the lists.
*/
write_lock_irq(&rh->hash_lock);
spin_lock(&rh->region_lock);
if (!list_empty(&rh->clean_regions)) {
list_splice_init(&rh->clean_regions, &clean);
list_for_each_entry(reg, &clean, list)
list_del(&reg->hash_list);
}
if (!list_empty(&rh->recovered_regions)) {
list_splice_init(&rh->recovered_regions, &recovered);
list_for_each_entry(reg, &recovered, list)
list_del(&reg->hash_list);
}
if (!list_empty(&rh->failed_recovered_regions)) {
list_splice_init(&rh->failed_recovered_regions,
&failed_recovered);
list_for_each_entry(reg, &failed_recovered, list)
list_del(&reg->hash_list);
}
spin_unlock(&rh->region_lock);
write_unlock_irq(&rh->hash_lock);
/*
* All the regions on the recovered and clean lists have
* now been pulled out of the system, so no need to do
* any more locking.
*/
list_for_each_entry_safe(reg, next, &recovered, list) {
rh->log->type->clear_region(rh->log, reg->key);
complete_resync_work(reg, 1);
mempool_free(reg, rh->region_pool);
}
list_for_each_entry_safe(reg, next, &failed_recovered, list) {
complete_resync_work(reg, errors_handled ? 0 : 1);
mempool_free(reg, rh->region_pool);
}
list_for_each_entry_safe(reg, next, &clean, list) {
rh->log->type->clear_region(rh->log, reg->key);
mempool_free(reg, rh->region_pool);
}
rh->log->type->flush(rh->log);
}
EXPORT_SYMBOL_GPL(dm_rh_update_states);
static void rh_inc(struct dm_region_hash *rh, region_t region)
{
struct dm_region *reg;
read_lock(&rh->hash_lock);
reg = __rh_find(rh, region);
spin_lock_irq(&rh->region_lock);
atomic_inc(&reg->pending);
if (reg->state == DM_RH_CLEAN) {
reg->state = DM_RH_DIRTY;
list_del_init(&reg->list); /* take off the clean list */
spin_unlock_irq(&rh->region_lock);
rh->log->type->mark_region(rh->log, reg->key);
} else
spin_unlock_irq(&rh->region_lock);
read_unlock(&rh->hash_lock);
}
void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
{
struct bio *bio;
for (bio = bios->head; bio; bio = bio->bi_next) {
if (bio->bi_rw & (REQ_FLUSH | REQ_DISCARD))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
}
EXPORT_SYMBOL_GPL(dm_rh_inc_pending);
void dm_rh_dec(struct dm_region_hash *rh, region_t region)
{
unsigned long flags;
struct dm_region *reg;
int should_wake = 0;
read_lock(&rh->hash_lock);
reg = __rh_lookup(rh, region);
read_unlock(&rh->hash_lock);
spin_lock_irqsave(&rh->region_lock, flags);
if (atomic_dec_and_test(&reg->pending)) {
/*
* There is no pending I/O for this region.
* We can move the region to corresponding list for next action.
* At this point, the region is not yet connected to any list.
*
* If the state is DM_RH_NOSYNC, the region should be kept off
* from clean list.
* The hash entry for DM_RH_NOSYNC will remain in memory
* until the region is recovered or the map is reloaded.
*/
/* do nothing for DM_RH_NOSYNC */
if (unlikely(rh->flush_failure)) {
/*
* If a write flush failed some time ago, we
* don't know whether or not this write made it
* to the disk, so we must resync the device.
*/
reg->state = DM_RH_NOSYNC;
} else if (reg->state == DM_RH_RECOVERING) {
list_add_tail(&reg->list, &rh->quiesced_regions);
} else if (reg->state == DM_RH_DIRTY) {
reg->state = DM_RH_CLEAN;
list_add(&reg->list, &rh->clean_regions);
}
should_wake = 1;
}
spin_unlock_irqrestore(&rh->region_lock, flags);
if (should_wake)
rh->wakeup_workers(rh->context);
}
EXPORT_SYMBOL_GPL(dm_rh_dec);
/*
* Starts quiescing a region in preparation for recovery.
*/
static int __rh_recovery_prepare(struct dm_region_hash *rh)
{
int r;
region_t region;
struct dm_region *reg;
/*
* Ask the dirty log what's next.
*/
r = rh->log->type->get_resync_work(rh->log, &region);
if (r <= 0)
return r;
/*
* Get this region, and start it quiescing by setting the
* recovering flag.
*/
read_lock(&rh->hash_lock);
reg = __rh_find(rh, region);
read_unlock(&rh->hash_lock);
spin_lock_irq(&rh->region_lock);
reg->state = DM_RH_RECOVERING;
/* Already quiesced ? */
if (atomic_read(&reg->pending))
list_del_init(&reg->list);
else
list_move(&reg->list, &rh->quiesced_regions);
spin_unlock_irq(&rh->region_lock);
return 1;
}
void dm_rh_recovery_prepare(struct dm_region_hash *rh)
{
/* Extra reference to avoid race with dm_rh_stop_recovery */
atomic_inc(&rh->recovery_in_flight);
while (!down_trylock(&rh->recovery_count)) {
atomic_inc(&rh->recovery_in_flight);
if (__rh_recovery_prepare(rh) <= 0) {
atomic_dec(&rh->recovery_in_flight);
up(&rh->recovery_count);
break;
}
}
/* Drop the extra reference */
if (atomic_dec_and_test(&rh->recovery_in_flight))
rh->wakeup_all_recovery_waiters(rh->context);
}
EXPORT_SYMBOL_GPL(dm_rh_recovery_prepare);
/*
* Returns any quiesced regions.
*/
struct dm_region *dm_rh_recovery_start(struct dm_region_hash *rh)
{
struct dm_region *reg = NULL;
spin_lock_irq(&rh->region_lock);
if (!list_empty(&rh->quiesced_regions)) {
reg = list_entry(rh->quiesced_regions.next,
struct dm_region, list);
list_del_init(&reg->list); /* remove from the quiesced list */
}
spin_unlock_irq(&rh->region_lock);
return reg;
}
EXPORT_SYMBOL_GPL(dm_rh_recovery_start);
void dm_rh_recovery_end(struct dm_region *reg, int success)
{
struct dm_region_hash *rh = reg->rh;
spin_lock_irq(&rh->region_lock);
if (success)
list_add(&reg->list, &reg->rh->recovered_regions);
else
list_add(&reg->list, &reg->rh->failed_recovered_regions);
spin_unlock_irq(&rh->region_lock);
rh->wakeup_workers(rh->context);
}
EXPORT_SYMBOL_GPL(dm_rh_recovery_end);
/* Return recovery in flight count. */
int dm_rh_recovery_in_flight(struct dm_region_hash *rh)
{
return atomic_read(&rh->recovery_in_flight);
}
EXPORT_SYMBOL_GPL(dm_rh_recovery_in_flight);
int dm_rh_flush(struct dm_region_hash *rh)
{
return rh->log->type->flush(rh->log);
}
EXPORT_SYMBOL_GPL(dm_rh_flush);
void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio)
{
struct dm_region *reg;
read_lock(&rh->hash_lock);
reg = __rh_find(rh, dm_rh_bio_to_region(rh, bio));
bio_list_add(&reg->delayed_bios, bio);
read_unlock(&rh->hash_lock);
}
EXPORT_SYMBOL_GPL(dm_rh_delay);
void dm_rh_stop_recovery(struct dm_region_hash *rh)
{
int i;
/* wait for any recovering regions */
for (i = 0; i < rh->max_recovery; i++)
down(&rh->recovery_count);
}
EXPORT_SYMBOL_GPL(dm_rh_stop_recovery);
void dm_rh_start_recovery(struct dm_region_hash *rh)
{
int i;
for (i = 0; i < rh->max_recovery; i++)
up(&rh->recovery_count);
rh->wakeup_workers(rh->context);
}
EXPORT_SYMBOL_GPL(dm_rh_start_recovery);
MODULE_DESCRIPTION(DM_NAME " region hash");
MODULE_AUTHOR("Joe Thornber/Heinz Mauelshagen <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

219
drivers/md/dm-round-robin.c Normal file
View file

@ -0,0 +1,219 @@
/*
* Copyright (C) 2003 Sistina Software.
* Copyright (C) 2004-2005 Red Hat, Inc. All rights reserved.
*
* Module Author: Heinz Mauelshagen
*
* This file is released under the GPL.
*
* Round-robin path selector.
*/
#include <linux/device-mapper.h>
#include "dm-path-selector.h"
#include <linux/slab.h>
#include <linux/module.h>
#define DM_MSG_PREFIX "multipath round-robin"
/*-----------------------------------------------------------------
* Path-handling code, paths are held in lists
*---------------------------------------------------------------*/
struct path_info {
struct list_head list;
struct dm_path *path;
unsigned repeat_count;
};
static void free_paths(struct list_head *paths)
{
struct path_info *pi, *next;
list_for_each_entry_safe(pi, next, paths, list) {
list_del(&pi->list);
kfree(pi);
}
}
/*-----------------------------------------------------------------
* Round-robin selector
*---------------------------------------------------------------*/
#define RR_MIN_IO 1000
struct selector {
struct list_head valid_paths;
struct list_head invalid_paths;
};
static struct selector *alloc_selector(void)
{
struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
if (s) {
INIT_LIST_HEAD(&s->valid_paths);
INIT_LIST_HEAD(&s->invalid_paths);
}
return s;
}
static int rr_create(struct path_selector *ps, unsigned argc, char **argv)
{
struct selector *s;
s = alloc_selector();
if (!s)
return -ENOMEM;
ps->context = s;
return 0;
}
static void rr_destroy(struct path_selector *ps)
{
struct selector *s = (struct selector *) ps->context;
free_paths(&s->valid_paths);
free_paths(&s->invalid_paths);
kfree(s);
ps->context = NULL;
}
static int rr_status(struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned int maxlen)
{
struct path_info *pi;
int sz = 0;
if (!path)
DMEMIT("0 ");
else {
switch(type) {
case STATUSTYPE_INFO:
break;
case STATUSTYPE_TABLE:
pi = path->pscontext;
DMEMIT("%u ", pi->repeat_count);
break;
}
}
return sz;
}
/*
* Called during initialisation to register each path with an
* optional repeat_count.
*/
static int rr_add_path(struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error)
{
struct selector *s = (struct selector *) ps->context;
struct path_info *pi;
unsigned repeat_count = RR_MIN_IO;
char dummy;
if (argc > 1) {
*error = "round-robin ps: incorrect number of arguments";
return -EINVAL;
}
/* First path argument is number of I/Os before switching path */
if ((argc == 1) && (sscanf(argv[0], "%u%c", &repeat_count, &dummy) != 1)) {
*error = "round-robin ps: invalid repeat count";
return -EINVAL;
}
/* allocate the path */
pi = kmalloc(sizeof(*pi), GFP_KERNEL);
if (!pi) {
*error = "round-robin ps: Error allocating path context";
return -ENOMEM;
}
pi->path = path;
pi->repeat_count = repeat_count;
path->pscontext = pi;
list_add_tail(&pi->list, &s->valid_paths);
return 0;
}
static void rr_fail_path(struct path_selector *ps, struct dm_path *p)
{
struct selector *s = (struct selector *) ps->context;
struct path_info *pi = p->pscontext;
list_move(&pi->list, &s->invalid_paths);
}
static int rr_reinstate_path(struct path_selector *ps, struct dm_path *p)
{
struct selector *s = (struct selector *) ps->context;
struct path_info *pi = p->pscontext;
list_move(&pi->list, &s->valid_paths);
return 0;
}
static struct dm_path *rr_select_path(struct path_selector *ps,
unsigned *repeat_count, size_t nr_bytes)
{
struct selector *s = (struct selector *) ps->context;
struct path_info *pi = NULL;
if (!list_empty(&s->valid_paths)) {
pi = list_entry(s->valid_paths.next, struct path_info, list);
list_move_tail(&pi->list, &s->valid_paths);
*repeat_count = pi->repeat_count;
}
return pi ? pi->path : NULL;
}
static struct path_selector_type rr_ps = {
.name = "round-robin",
.module = THIS_MODULE,
.table_args = 1,
.info_args = 0,
.create = rr_create,
.destroy = rr_destroy,
.status = rr_status,
.add_path = rr_add_path,
.fail_path = rr_fail_path,
.reinstate_path = rr_reinstate_path,
.select_path = rr_select_path,
};
static int __init dm_rr_init(void)
{
int r = dm_register_path_selector(&rr_ps);
if (r < 0)
DMERR("register failed %d", r);
DMINFO("version 1.0.0 loaded");
return r;
}
static void __exit dm_rr_exit(void)
{
int r = dm_unregister_path_selector(&rr_ps);
if (r < 0)
DMERR("unregister failed %d", r);
}
module_init(dm_rr_init);
module_exit(dm_rr_exit);
MODULE_DESCRIPTION(DM_NAME " round-robin multipath path selector");
MODULE_AUTHOR("Sistina Software <dm-devel@redhat.com>");
MODULE_LICENSE("GPL");

View file

@ -0,0 +1,343 @@
/*
* Copyright (C) 2007-2009 NEC Corporation. All Rights Reserved.
*
* Module Author: Kiyoshi Ueda
*
* This file is released under the GPL.
*
* Throughput oriented path selector.
*/
#include "dm.h"
#include "dm-path-selector.h"
#include <linux/slab.h>
#include <linux/module.h>
#define DM_MSG_PREFIX "multipath service-time"
#define ST_MIN_IO 1
#define ST_MAX_RELATIVE_THROUGHPUT 100
#define ST_MAX_RELATIVE_THROUGHPUT_SHIFT 7
#define ST_MAX_INFLIGHT_SIZE ((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT)
#define ST_VERSION "0.2.0"
struct selector {
struct list_head valid_paths;
struct list_head failed_paths;
};
struct path_info {
struct list_head list;
struct dm_path *path;
unsigned repeat_count;
unsigned relative_throughput;
atomic_t in_flight_size; /* Total size of in-flight I/Os */
};
static struct selector *alloc_selector(void)
{
struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
if (s) {
INIT_LIST_HEAD(&s->valid_paths);
INIT_LIST_HEAD(&s->failed_paths);
}
return s;
}
static int st_create(struct path_selector *ps, unsigned argc, char **argv)
{
struct selector *s = alloc_selector();
if (!s)
return -ENOMEM;
ps->context = s;
return 0;
}
static void free_paths(struct list_head *paths)
{
struct path_info *pi, *next;
list_for_each_entry_safe(pi, next, paths, list) {
list_del(&pi->list);
kfree(pi);
}
}
static void st_destroy(struct path_selector *ps)
{
struct selector *s = ps->context;
free_paths(&s->valid_paths);
free_paths(&s->failed_paths);
kfree(s);
ps->context = NULL;
}
static int st_status(struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct path_info *pi;
if (!path)
DMEMIT("0 ");
else {
pi = path->pscontext;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%d %u ", atomic_read(&pi->in_flight_size),
pi->relative_throughput);
break;
case STATUSTYPE_TABLE:
DMEMIT("%u %u ", pi->repeat_count,
pi->relative_throughput);
break;
}
}
return sz;
}
static int st_add_path(struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error)
{
struct selector *s = ps->context;
struct path_info *pi;
unsigned repeat_count = ST_MIN_IO;
unsigned relative_throughput = 1;
char dummy;
/*
* Arguments: [<repeat_count> [<relative_throughput>]]
* <repeat_count>: The number of I/Os before switching path.
* If not given, default (ST_MIN_IO) is used.
* <relative_throughput>: The relative throughput value of
* the path among all paths in the path-group.
* The valid range: 0-<ST_MAX_RELATIVE_THROUGHPUT>
* If not given, minimum value '1' is used.
* If '0' is given, the path isn't selected while
* other paths having a positive value are
* available.
*/
if (argc > 2) {
*error = "service-time ps: incorrect number of arguments";
return -EINVAL;
}
if (argc && (sscanf(argv[0], "%u%c", &repeat_count, &dummy) != 1)) {
*error = "service-time ps: invalid repeat count";
return -EINVAL;
}
if ((argc == 2) &&
(sscanf(argv[1], "%u%c", &relative_throughput, &dummy) != 1 ||
relative_throughput > ST_MAX_RELATIVE_THROUGHPUT)) {
*error = "service-time ps: invalid relative_throughput value";
return -EINVAL;
}
/* allocate the path */
pi = kmalloc(sizeof(*pi), GFP_KERNEL);
if (!pi) {
*error = "service-time ps: Error allocating path context";
return -ENOMEM;
}
pi->path = path;
pi->repeat_count = repeat_count;
pi->relative_throughput = relative_throughput;
atomic_set(&pi->in_flight_size, 0);
path->pscontext = pi;
list_add_tail(&pi->list, &s->valid_paths);
return 0;
}
static void st_fail_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move(&pi->list, &s->failed_paths);
}
static int st_reinstate_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move_tail(&pi->list, &s->valid_paths);
return 0;
}
/*
* Compare the estimated service time of 2 paths, pi1 and pi2,
* for the incoming I/O.
*
* Returns:
* < 0 : pi1 is better
* 0 : no difference between pi1 and pi2
* > 0 : pi2 is better
*
* Description:
* Basically, the service time is estimated by:
* ('pi->in-flight-size' + 'incoming') / 'pi->relative_throughput'
* To reduce the calculation, some optimizations are made.
* (See comments inline)
*/
static int st_compare_load(struct path_info *pi1, struct path_info *pi2,
size_t incoming)
{
size_t sz1, sz2, st1, st2;
sz1 = atomic_read(&pi1->in_flight_size);
sz2 = atomic_read(&pi2->in_flight_size);
/*
* Case 1: Both have same throughput value. Choose less loaded path.
*/
if (pi1->relative_throughput == pi2->relative_throughput)
return sz1 - sz2;
/*
* Case 2a: Both have same load. Choose higher throughput path.
* Case 2b: One path has no throughput value. Choose the other one.
*/
if (sz1 == sz2 ||
!pi1->relative_throughput || !pi2->relative_throughput)
return pi2->relative_throughput - pi1->relative_throughput;
/*
* Case 3: Calculate service time. Choose faster path.
* Service time using pi1:
* st1 = (sz1 + incoming) / pi1->relative_throughput
* Service time using pi2:
* st2 = (sz2 + incoming) / pi2->relative_throughput
*
* To avoid the division, transform the expression to use
* multiplication.
* Because ->relative_throughput > 0 here, if st1 < st2,
* the expressions below are the same meaning:
* (sz1 + incoming) / pi1->relative_throughput <
* (sz2 + incoming) / pi2->relative_throughput
* (sz1 + incoming) * pi2->relative_throughput <
* (sz2 + incoming) * pi1->relative_throughput
* So use the later one.
*/
sz1 += incoming;
sz2 += incoming;
if (unlikely(sz1 >= ST_MAX_INFLIGHT_SIZE ||
sz2 >= ST_MAX_INFLIGHT_SIZE)) {
/*
* Size may be too big for multiplying pi->relative_throughput
* and overflow.
* To avoid the overflow and mis-selection, shift down both.
*/
sz1 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
sz2 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
}
st1 = sz1 * pi2->relative_throughput;
st2 = sz2 * pi1->relative_throughput;
if (st1 != st2)
return st1 - st2;
/*
* Case 4: Service time is equal. Choose higher throughput path.
*/
return pi2->relative_throughput - pi1->relative_throughput;
}
static struct dm_path *st_select_path(struct path_selector *ps,
unsigned *repeat_count, size_t nr_bytes)
{
struct selector *s = ps->context;
struct path_info *pi = NULL, *best = NULL;
if (list_empty(&s->valid_paths))
return NULL;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list)
if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
best = pi;
if (!best)
return NULL;
*repeat_count = best->repeat_count;
return best->path;
}
static int st_start_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_add(nr_bytes, &pi->in_flight_size);
return 0;
}
static int st_end_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_sub(nr_bytes, &pi->in_flight_size);
return 0;
}
static struct path_selector_type st_ps = {
.name = "service-time",
.module = THIS_MODULE,
.table_args = 2,
.info_args = 2,
.create = st_create,
.destroy = st_destroy,
.status = st_status,
.add_path = st_add_path,
.fail_path = st_fail_path,
.reinstate_path = st_reinstate_path,
.select_path = st_select_path,
.start_io = st_start_io,
.end_io = st_end_io,
};
static int __init dm_st_init(void)
{
int r = dm_register_path_selector(&st_ps);
if (r < 0)
DMERR("register failed %d", r);
DMINFO("version " ST_VERSION " loaded");
return r;
}
static void __exit dm_st_exit(void)
{
int r = dm_unregister_path_selector(&st_ps);
if (r < 0)
DMERR("unregister failed %d", r);
}
module_init(dm_st_init);
module_exit(dm_st_exit);
MODULE_DESCRIPTION(DM_NAME " throughput oriented path selector");
MODULE_AUTHOR("Kiyoshi Ueda <k-ueda@ct.jp.nec.com>");
MODULE_LICENSE("GPL");

View file

@ -0,0 +1,958 @@
/*
* Copyright (C) 2001-2002 Sistina Software (UK) Limited.
* Copyright (C) 2006-2008 Red Hat GmbH
*
* This file is released under the GPL.
*/
#include "dm-exception-store.h"
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/vmalloc.h>
#include <linux/export.h>
#include <linux/slab.h>
#include <linux/dm-io.h>
#include "dm-bufio.h"
#define DM_MSG_PREFIX "persistent snapshot"
#define DM_CHUNK_SIZE_DEFAULT_SECTORS 32 /* 16KB */
#define DM_PREFETCH_CHUNKS 12
/*-----------------------------------------------------------------
* Persistent snapshots, by persistent we mean that the snapshot
* will survive a reboot.
*---------------------------------------------------------------*/
/*
* We need to store a record of which parts of the origin have
* been copied to the snapshot device. The snapshot code
* requires that we copy exception chunks to chunk aligned areas
* of the COW store. It makes sense therefore, to store the
* metadata in chunk size blocks.
*
* There is no backward or forward compatibility implemented,
* snapshots with different disk versions than the kernel will
* not be usable. It is expected that "lvcreate" will blank out
* the start of a fresh COW device before calling the snapshot
* constructor.
*
* The first chunk of the COW device just contains the header.
* After this there is a chunk filled with exception metadata,
* followed by as many exception chunks as can fit in the
* metadata areas.
*
* All on disk structures are in little-endian format. The end
* of the exceptions info is indicated by an exception with a
* new_chunk of 0, which is invalid since it would point to the
* header chunk.
*/
/*
* Magic for persistent snapshots: "SnAp" - Feeble isn't it.
*/
#define SNAP_MAGIC 0x70416e53
/*
* The on-disk version of the metadata.
*/
#define SNAPSHOT_DISK_VERSION 1
#define NUM_SNAPSHOT_HDR_CHUNKS 1
struct disk_header {
__le32 magic;
/*
* Is this snapshot valid. There is no way of recovering
* an invalid snapshot.
*/
__le32 valid;
/*
* Simple, incrementing version. no backward
* compatibility.
*/
__le32 version;
/* In sectors */
__le32 chunk_size;
} __packed;
struct disk_exception {
__le64 old_chunk;
__le64 new_chunk;
} __packed;
struct core_exception {
uint64_t old_chunk;
uint64_t new_chunk;
};
struct commit_callback {
void (*callback)(void *, int success);
void *context;
};
/*
* The top level structure for a persistent exception store.
*/
struct pstore {
struct dm_exception_store *store;
int version;
int valid;
uint32_t exceptions_per_area;
/*
* Now that we have an asynchronous kcopyd there is no
* need for large chunk sizes, so it wont hurt to have a
* whole chunks worth of metadata in memory at once.
*/
void *area;
/*
* An area of zeros used to clear the next area.
*/
void *zero_area;
/*
* An area used for header. The header can be written
* concurrently with metadata (when invalidating the snapshot),
* so it needs a separate buffer.
*/
void *header_area;
/*
* Used to keep track of which metadata area the data in
* 'chunk' refers to.
*/
chunk_t current_area;
/*
* The next free chunk for an exception.
*
* When creating exceptions, all the chunks here and above are
* free. It holds the next chunk to be allocated. On rare
* occasions (e.g. after a system crash) holes can be left in
* the exception store because chunks can be committed out of
* order.
*
* When merging exceptions, it does not necessarily mean all the
* chunks here and above are free. It holds the value it would
* have held if all chunks had been committed in order of
* allocation. Consequently the value may occasionally be
* slightly too low, but since it's only used for 'status' and
* it can never reach its minimum value too early this doesn't
* matter.
*/
chunk_t next_free;
/*
* The index of next free exception in the current
* metadata area.
*/
uint32_t current_committed;
atomic_t pending_count;
uint32_t callback_count;
struct commit_callback *callbacks;
struct dm_io_client *io_client;
struct workqueue_struct *metadata_wq;
};
static int alloc_area(struct pstore *ps)
{
int r = -ENOMEM;
size_t len;
len = ps->store->chunk_size << SECTOR_SHIFT;
/*
* Allocate the chunk_size block of memory that will hold
* a single metadata area.
*/
ps->area = vmalloc(len);
if (!ps->area)
goto err_area;
ps->zero_area = vzalloc(len);
if (!ps->zero_area)
goto err_zero_area;
ps->header_area = vmalloc(len);
if (!ps->header_area)
goto err_header_area;
return 0;
err_header_area:
vfree(ps->zero_area);
err_zero_area:
vfree(ps->area);
err_area:
return r;
}
static void free_area(struct pstore *ps)
{
if (ps->area)
vfree(ps->area);
ps->area = NULL;
if (ps->zero_area)
vfree(ps->zero_area);
ps->zero_area = NULL;
if (ps->header_area)
vfree(ps->header_area);
ps->header_area = NULL;
}
struct mdata_req {
struct dm_io_region *where;
struct dm_io_request *io_req;
struct work_struct work;
int result;
};
static void do_metadata(struct work_struct *work)
{
struct mdata_req *req = container_of(work, struct mdata_req, work);
req->result = dm_io(req->io_req, 1, req->where, NULL);
}
/*
* Read or write a chunk aligned and sized block of data from a device.
*/
static int chunk_io(struct pstore *ps, void *area, chunk_t chunk, int rw,
int metadata)
{
struct dm_io_region where = {
.bdev = dm_snap_cow(ps->store->snap)->bdev,
.sector = ps->store->chunk_size * chunk,
.count = ps->store->chunk_size,
};
struct dm_io_request io_req = {
.bi_rw = rw,
.mem.type = DM_IO_VMA,
.mem.ptr.vma = area,
.client = ps->io_client,
.notify.fn = NULL,
};
struct mdata_req req;
if (!metadata)
return dm_io(&io_req, 1, &where, NULL);
req.where = &where;
req.io_req = &io_req;
/*
* Issue the synchronous I/O from a different thread
* to avoid generic_make_request recursion.
*/
INIT_WORK_ONSTACK(&req.work, do_metadata);
queue_work(ps->metadata_wq, &req.work);
flush_workqueue(ps->metadata_wq);
destroy_work_on_stack(&req.work);
return req.result;
}
/*
* Convert a metadata area index to a chunk index.
*/
static chunk_t area_location(struct pstore *ps, chunk_t area)
{
return NUM_SNAPSHOT_HDR_CHUNKS + ((ps->exceptions_per_area + 1) * area);
}
static void skip_metadata(struct pstore *ps)
{
uint32_t stride = ps->exceptions_per_area + 1;
chunk_t next_free = ps->next_free;
if (sector_div(next_free, stride) == NUM_SNAPSHOT_HDR_CHUNKS)
ps->next_free++;
}
/*
* Read or write a metadata area. Remembering to skip the first
* chunk which holds the header.
*/
static int area_io(struct pstore *ps, int rw)
{
int r;
chunk_t chunk;
chunk = area_location(ps, ps->current_area);
r = chunk_io(ps, ps->area, chunk, rw, 0);
if (r)
return r;
return 0;
}
static void zero_memory_area(struct pstore *ps)
{
memset(ps->area, 0, ps->store->chunk_size << SECTOR_SHIFT);
}
static int zero_disk_area(struct pstore *ps, chunk_t area)
{
return chunk_io(ps, ps->zero_area, area_location(ps, area), WRITE, 0);
}
static int read_header(struct pstore *ps, int *new_snapshot)
{
int r;
struct disk_header *dh;
unsigned chunk_size;
int chunk_size_supplied = 1;
char *chunk_err;
/*
* Use default chunk size (or logical_block_size, if larger)
* if none supplied
*/
if (!ps->store->chunk_size) {
ps->store->chunk_size = max(DM_CHUNK_SIZE_DEFAULT_SECTORS,
bdev_logical_block_size(dm_snap_cow(ps->store->snap)->
bdev) >> 9);
ps->store->chunk_mask = ps->store->chunk_size - 1;
ps->store->chunk_shift = ffs(ps->store->chunk_size) - 1;
chunk_size_supplied = 0;
}
ps->io_client = dm_io_client_create();
if (IS_ERR(ps->io_client))
return PTR_ERR(ps->io_client);
r = alloc_area(ps);
if (r)
return r;
r = chunk_io(ps, ps->header_area, 0, READ, 1);
if (r)
goto bad;
dh = ps->header_area;
if (le32_to_cpu(dh->magic) == 0) {
*new_snapshot = 1;
return 0;
}
if (le32_to_cpu(dh->magic) != SNAP_MAGIC) {
DMWARN("Invalid or corrupt snapshot");
r = -ENXIO;
goto bad;
}
*new_snapshot = 0;
ps->valid = le32_to_cpu(dh->valid);
ps->version = le32_to_cpu(dh->version);
chunk_size = le32_to_cpu(dh->chunk_size);
if (ps->store->chunk_size == chunk_size)
return 0;
if (chunk_size_supplied)
DMWARN("chunk size %u in device metadata overrides "
"table chunk size of %u.",
chunk_size, ps->store->chunk_size);
/* We had a bogus chunk_size. Fix stuff up. */
free_area(ps);
r = dm_exception_store_set_chunk_size(ps->store, chunk_size,
&chunk_err);
if (r) {
DMERR("invalid on-disk chunk size %u: %s.",
chunk_size, chunk_err);
return r;
}
r = alloc_area(ps);
return r;
bad:
free_area(ps);
return r;
}
static int write_header(struct pstore *ps)
{
struct disk_header *dh;
memset(ps->header_area, 0, ps->store->chunk_size << SECTOR_SHIFT);
dh = ps->header_area;
dh->magic = cpu_to_le32(SNAP_MAGIC);
dh->valid = cpu_to_le32(ps->valid);
dh->version = cpu_to_le32(ps->version);
dh->chunk_size = cpu_to_le32(ps->store->chunk_size);
return chunk_io(ps, ps->header_area, 0, WRITE, 1);
}
/*
* Access functions for the disk exceptions, these do the endian conversions.
*/
static struct disk_exception *get_exception(struct pstore *ps, void *ps_area,
uint32_t index)
{
BUG_ON(index >= ps->exceptions_per_area);
return ((struct disk_exception *) ps_area) + index;
}
static void read_exception(struct pstore *ps, void *ps_area,
uint32_t index, struct core_exception *result)
{
struct disk_exception *de = get_exception(ps, ps_area, index);
/* copy it */
result->old_chunk = le64_to_cpu(de->old_chunk);
result->new_chunk = le64_to_cpu(de->new_chunk);
}
static void write_exception(struct pstore *ps,
uint32_t index, struct core_exception *e)
{
struct disk_exception *de = get_exception(ps, ps->area, index);
/* copy it */
de->old_chunk = cpu_to_le64(e->old_chunk);
de->new_chunk = cpu_to_le64(e->new_chunk);
}
static void clear_exception(struct pstore *ps, uint32_t index)
{
struct disk_exception *de = get_exception(ps, ps->area, index);
/* clear it */
de->old_chunk = 0;
de->new_chunk = 0;
}
/*
* Registers the exceptions that are present in the current area.
* 'full' is filled in to indicate if the area has been
* filled.
*/
static int insert_exceptions(struct pstore *ps, void *ps_area,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context,
int *full)
{
int r;
unsigned int i;
struct core_exception e;
/* presume the area is full */
*full = 1;
for (i = 0; i < ps->exceptions_per_area; i++) {
read_exception(ps, ps_area, i, &e);
/*
* If the new_chunk is pointing at the start of
* the COW device, where the first metadata area
* is we know that we've hit the end of the
* exceptions. Therefore the area is not full.
*/
if (e.new_chunk == 0LL) {
ps->current_committed = i;
*full = 0;
break;
}
/*
* Keep track of the start of the free chunks.
*/
if (ps->next_free <= e.new_chunk)
ps->next_free = e.new_chunk + 1;
/*
* Otherwise we add the exception to the snapshot.
*/
r = callback(callback_context, e.old_chunk, e.new_chunk);
if (r)
return r;
}
return 0;
}
static int read_exceptions(struct pstore *ps,
int (*callback)(void *callback_context, chunk_t old,
chunk_t new),
void *callback_context)
{
int r, full = 1;
struct dm_bufio_client *client;
chunk_t prefetch_area = 0;
client = dm_bufio_client_create(dm_snap_cow(ps->store->snap)->bdev,
ps->store->chunk_size << SECTOR_SHIFT,
1, 0, NULL, NULL);
if (IS_ERR(client))
return PTR_ERR(client);
/*
* Setup for one current buffer + desired readahead buffers.
*/
dm_bufio_set_minimum_buffers(client, 1 + DM_PREFETCH_CHUNKS);
/*
* Keeping reading chunks and inserting exceptions until
* we find a partially full area.
*/
for (ps->current_area = 0; full; ps->current_area++) {
struct dm_buffer *bp;
void *area;
chunk_t chunk;
if (unlikely(prefetch_area < ps->current_area))
prefetch_area = ps->current_area;
if (DM_PREFETCH_CHUNKS) do {
chunk_t pf_chunk = area_location(ps, prefetch_area);
if (unlikely(pf_chunk >= dm_bufio_get_device_size(client)))
break;
dm_bufio_prefetch(client, pf_chunk, 1);
prefetch_area++;
if (unlikely(!prefetch_area))
break;
} while (prefetch_area <= ps->current_area + DM_PREFETCH_CHUNKS);
chunk = area_location(ps, ps->current_area);
area = dm_bufio_read(client, chunk, &bp);
if (unlikely(IS_ERR(area))) {
r = PTR_ERR(area);
goto ret_destroy_bufio;
}
r = insert_exceptions(ps, area, callback, callback_context,
&full);
if (!full)
memcpy(ps->area, area, ps->store->chunk_size << SECTOR_SHIFT);
dm_bufio_release(bp);
dm_bufio_forget(client, chunk);
if (unlikely(r))
goto ret_destroy_bufio;
}
ps->current_area--;
skip_metadata(ps);
r = 0;
ret_destroy_bufio:
dm_bufio_client_destroy(client);
return r;
}
static struct pstore *get_info(struct dm_exception_store *store)
{
return (struct pstore *) store->context;
}
static void persistent_usage(struct dm_exception_store *store,
sector_t *total_sectors,
sector_t *sectors_allocated,
sector_t *metadata_sectors)
{
struct pstore *ps = get_info(store);
*sectors_allocated = ps->next_free * store->chunk_size;
*total_sectors = get_dev_size(dm_snap_cow(store->snap)->bdev);
/*
* First chunk is the fixed header.
* Then there are (ps->current_area + 1) metadata chunks, each one
* separated from the next by ps->exceptions_per_area data chunks.
*/
*metadata_sectors = (ps->current_area + 1 + NUM_SNAPSHOT_HDR_CHUNKS) *
store->chunk_size;
}
static void persistent_dtr(struct dm_exception_store *store)
{
struct pstore *ps = get_info(store);
destroy_workqueue(ps->metadata_wq);
/* Created in read_header */
if (ps->io_client)
dm_io_client_destroy(ps->io_client);
free_area(ps);
/* Allocated in persistent_read_metadata */
if (ps->callbacks)
vfree(ps->callbacks);
kfree(ps);
}
static int persistent_read_metadata(struct dm_exception_store *store,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context)
{
int r, uninitialized_var(new_snapshot);
struct pstore *ps = get_info(store);
/*
* Read the snapshot header.
*/
r = read_header(ps, &new_snapshot);
if (r)
return r;
/*
* Now we know correct chunk_size, complete the initialisation.
*/
ps->exceptions_per_area = (ps->store->chunk_size << SECTOR_SHIFT) /
sizeof(struct disk_exception);
ps->callbacks = dm_vcalloc(ps->exceptions_per_area,
sizeof(*ps->callbacks));
if (!ps->callbacks)
return -ENOMEM;
/*
* Do we need to setup a new snapshot ?
*/
if (new_snapshot) {
r = write_header(ps);
if (r) {
DMWARN("write_header failed");
return r;
}
ps->current_area = 0;
zero_memory_area(ps);
r = zero_disk_area(ps, 0);
if (r)
DMWARN("zero_disk_area(0) failed");
return r;
}
/*
* Sanity checks.
*/
if (ps->version != SNAPSHOT_DISK_VERSION) {
DMWARN("unable to handle snapshot disk version %d",
ps->version);
return -EINVAL;
}
/*
* Metadata are valid, but snapshot is invalidated
*/
if (!ps->valid)
return 1;
/*
* Read the metadata.
*/
r = read_exceptions(ps, callback, callback_context);
return r;
}
static int persistent_prepare_exception(struct dm_exception_store *store,
struct dm_exception *e)
{
struct pstore *ps = get_info(store);
sector_t size = get_dev_size(dm_snap_cow(store->snap)->bdev);
/* Is there enough room ? */
if (size < ((ps->next_free + 1) * store->chunk_size))
return -ENOSPC;
e->new_chunk = ps->next_free;
/*
* Move onto the next free pending, making sure to take
* into account the location of the metadata chunks.
*/
ps->next_free++;
skip_metadata(ps);
atomic_inc(&ps->pending_count);
return 0;
}
static void persistent_commit_exception(struct dm_exception_store *store,
struct dm_exception *e,
void (*callback) (void *, int success),
void *callback_context)
{
unsigned int i;
struct pstore *ps = get_info(store);
struct core_exception ce;
struct commit_callback *cb;
ce.old_chunk = e->old_chunk;
ce.new_chunk = e->new_chunk;
write_exception(ps, ps->current_committed++, &ce);
/*
* Add the callback to the back of the array. This code
* is the only place where the callback array is
* manipulated, and we know that it will never be called
* multiple times concurrently.
*/
cb = ps->callbacks + ps->callback_count++;
cb->callback = callback;
cb->context = callback_context;
/*
* If there are exceptions in flight and we have not yet
* filled this metadata area there's nothing more to do.
*/
if (!atomic_dec_and_test(&ps->pending_count) &&
(ps->current_committed != ps->exceptions_per_area))
return;
/*
* If we completely filled the current area, then wipe the next one.
*/
if ((ps->current_committed == ps->exceptions_per_area) &&
zero_disk_area(ps, ps->current_area + 1))
ps->valid = 0;
/*
* Commit exceptions to disk.
*/
if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;
/*
* Advance to the next area if this one is full.
*/
if (ps->current_committed == ps->exceptions_per_area) {
ps->current_committed = 0;
ps->current_area++;
zero_memory_area(ps);
}
for (i = 0; i < ps->callback_count; i++) {
cb = ps->callbacks + i;
cb->callback(cb->context, ps->valid);
}
ps->callback_count = 0;
}
static int persistent_prepare_merge(struct dm_exception_store *store,
chunk_t *last_old_chunk,
chunk_t *last_new_chunk)
{
struct pstore *ps = get_info(store);
struct core_exception ce;
int nr_consecutive;
int r;
/*
* When current area is empty, move back to preceding area.
*/
if (!ps->current_committed) {
/*
* Have we finished?
*/
if (!ps->current_area)
return 0;
ps->current_area--;
r = area_io(ps, READ);
if (r < 0)
return r;
ps->current_committed = ps->exceptions_per_area;
}
read_exception(ps, ps->area, ps->current_committed - 1, &ce);
*last_old_chunk = ce.old_chunk;
*last_new_chunk = ce.new_chunk;
/*
* Find number of consecutive chunks within the current area,
* working backwards.
*/
for (nr_consecutive = 1; nr_consecutive < ps->current_committed;
nr_consecutive++) {
read_exception(ps, ps->area,
ps->current_committed - 1 - nr_consecutive, &ce);
if (ce.old_chunk != *last_old_chunk - nr_consecutive ||
ce.new_chunk != *last_new_chunk - nr_consecutive)
break;
}
return nr_consecutive;
}
static int persistent_commit_merge(struct dm_exception_store *store,
int nr_merged)
{
int r, i;
struct pstore *ps = get_info(store);
BUG_ON(nr_merged > ps->current_committed);
for (i = 0; i < nr_merged; i++)
clear_exception(ps, ps->current_committed - 1 - i);
r = area_io(ps, WRITE_FLUSH_FUA);
if (r < 0)
return r;
ps->current_committed -= nr_merged;
/*
* At this stage, only persistent_usage() uses ps->next_free, so
* we make no attempt to keep ps->next_free strictly accurate
* as exceptions may have been committed out-of-order originally.
* Once a snapshot has become merging, we set it to the value it
* would have held had all the exceptions been committed in order.
*
* ps->current_area does not get reduced by prepare_merge() until
* after commit_merge() has removed the nr_merged previous exceptions.
*/
ps->next_free = area_location(ps, ps->current_area) +
ps->current_committed + 1;
return 0;
}
static void persistent_drop_snapshot(struct dm_exception_store *store)
{
struct pstore *ps = get_info(store);
ps->valid = 0;
if (write_header(ps))
DMWARN("write header failed");
}
static int persistent_ctr(struct dm_exception_store *store,
unsigned argc, char **argv)
{
struct pstore *ps;
/* allocate the pstore */
ps = kzalloc(sizeof(*ps), GFP_KERNEL);
if (!ps)
return -ENOMEM;
ps->store = store;
ps->valid = 1;
ps->version = SNAPSHOT_DISK_VERSION;
ps->area = NULL;
ps->zero_area = NULL;
ps->header_area = NULL;
ps->next_free = NUM_SNAPSHOT_HDR_CHUNKS + 1; /* header and 1st area */
ps->current_committed = 0;
ps->callback_count = 0;
atomic_set(&ps->pending_count, 0);
ps->callbacks = NULL;
ps->metadata_wq = alloc_workqueue("ksnaphd", WQ_MEM_RECLAIM, 0);
if (!ps->metadata_wq) {
kfree(ps);
DMERR("couldn't start header metadata update thread");
return -ENOMEM;
}
store->context = ps;
return 0;
}
static unsigned persistent_status(struct dm_exception_store *store,
status_type_t status, char *result,
unsigned maxlen)
{
unsigned sz = 0;
switch (status) {
case STATUSTYPE_INFO:
break;
case STATUSTYPE_TABLE:
DMEMIT(" P %llu", (unsigned long long)store->chunk_size);
}
return sz;
}
static struct dm_exception_store_type _persistent_type = {
.name = "persistent",
.module = THIS_MODULE,
.ctr = persistent_ctr,
.dtr = persistent_dtr,
.read_metadata = persistent_read_metadata,
.prepare_exception = persistent_prepare_exception,
.commit_exception = persistent_commit_exception,
.prepare_merge = persistent_prepare_merge,
.commit_merge = persistent_commit_merge,
.drop_snapshot = persistent_drop_snapshot,
.usage = persistent_usage,
.status = persistent_status,
};
static struct dm_exception_store_type _persistent_compat_type = {
.name = "P",
.module = THIS_MODULE,
.ctr = persistent_ctr,
.dtr = persistent_dtr,
.read_metadata = persistent_read_metadata,
.prepare_exception = persistent_prepare_exception,
.commit_exception = persistent_commit_exception,
.prepare_merge = persistent_prepare_merge,
.commit_merge = persistent_commit_merge,
.drop_snapshot = persistent_drop_snapshot,
.usage = persistent_usage,
.status = persistent_status,
};
int dm_persistent_snapshot_init(void)
{
int r;
r = dm_exception_store_type_register(&_persistent_type);
if (r) {
DMERR("Unable to register persistent exception store type");
return r;
}
r = dm_exception_store_type_register(&_persistent_compat_type);
if (r) {
DMERR("Unable to register old-style persistent exception "
"store type");
dm_exception_store_type_unregister(&_persistent_type);
return r;
}
return r;
}
void dm_persistent_snapshot_exit(void)
{
dm_exception_store_type_unregister(&_persistent_type);
dm_exception_store_type_unregister(&_persistent_compat_type);
}

View file

@ -0,0 +1,153 @@
/*
* Copyright (C) 2001-2002 Sistina Software (UK) Limited.
* Copyright (C) 2006-2008 Red Hat GmbH
*
* This file is released under the GPL.
*/
#include "dm-exception-store.h"
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/vmalloc.h>
#include <linux/export.h>
#include <linux/slab.h>
#include <linux/dm-io.h>
#define DM_MSG_PREFIX "transient snapshot"
/*-----------------------------------------------------------------
* Implementation of the store for non-persistent snapshots.
*---------------------------------------------------------------*/
struct transient_c {
sector_t next_free;
};
static void transient_dtr(struct dm_exception_store *store)
{
kfree(store->context);
}
static int transient_read_metadata(struct dm_exception_store *store,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context)
{
return 0;
}
static int transient_prepare_exception(struct dm_exception_store *store,
struct dm_exception *e)
{
struct transient_c *tc = store->context;
sector_t size = get_dev_size(dm_snap_cow(store->snap)->bdev);
if (size < (tc->next_free + store->chunk_size))
return -1;
e->new_chunk = sector_to_chunk(store, tc->next_free);
tc->next_free += store->chunk_size;
return 0;
}
static void transient_commit_exception(struct dm_exception_store *store,
struct dm_exception *e,
void (*callback) (void *, int success),
void *callback_context)
{
/* Just succeed */
callback(callback_context, 1);
}
static void transient_usage(struct dm_exception_store *store,
sector_t *total_sectors,
sector_t *sectors_allocated,
sector_t *metadata_sectors)
{
*sectors_allocated = ((struct transient_c *) store->context)->next_free;
*total_sectors = get_dev_size(dm_snap_cow(store->snap)->bdev);
*metadata_sectors = 0;
}
static int transient_ctr(struct dm_exception_store *store,
unsigned argc, char **argv)
{
struct transient_c *tc;
tc = kmalloc(sizeof(struct transient_c), GFP_KERNEL);
if (!tc)
return -ENOMEM;
tc->next_free = 0;
store->context = tc;
return 0;
}
static unsigned transient_status(struct dm_exception_store *store,
status_type_t status, char *result,
unsigned maxlen)
{
unsigned sz = 0;
switch (status) {
case STATUSTYPE_INFO:
break;
case STATUSTYPE_TABLE:
DMEMIT(" N %llu", (unsigned long long)store->chunk_size);
}
return sz;
}
static struct dm_exception_store_type _transient_type = {
.name = "transient",
.module = THIS_MODULE,
.ctr = transient_ctr,
.dtr = transient_dtr,
.read_metadata = transient_read_metadata,
.prepare_exception = transient_prepare_exception,
.commit_exception = transient_commit_exception,
.usage = transient_usage,
.status = transient_status,
};
static struct dm_exception_store_type _transient_compat_type = {
.name = "N",
.module = THIS_MODULE,
.ctr = transient_ctr,
.dtr = transient_dtr,
.read_metadata = transient_read_metadata,
.prepare_exception = transient_prepare_exception,
.commit_exception = transient_commit_exception,
.usage = transient_usage,
.status = transient_status,
};
int dm_transient_snapshot_init(void)
{
int r;
r = dm_exception_store_type_register(&_transient_type);
if (r) {
DMWARN("Unable to register transient exception store type");
return r;
}
r = dm_exception_store_type_register(&_transient_compat_type);
if (r) {
DMWARN("Unable to register old-style transient "
"exception store type");
dm_exception_store_type_unregister(&_transient_type);
return r;
}
return r;
}
void dm_transient_snapshot_exit(void)
{
dm_exception_store_type_unregister(&_transient_type);
dm_exception_store_type_unregister(&_transient_compat_type);
}

2486
drivers/md/dm-snap.c Normal file

File diff suppressed because it is too large Load diff

981
drivers/md/dm-stats.c Normal file
View file

@ -0,0 +1,981 @@
#include <linux/errno.h>
#include <linux/numa.h>
#include <linux/slab.h>
#include <linux/rculist.h>
#include <linux/threads.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/device-mapper.h>
#include "dm.h"
#include "dm-stats.h"
#define DM_MSG_PREFIX "stats"
static int dm_stat_need_rcu_barrier;
/*
* Using 64-bit values to avoid overflow (which is a
* problem that block/genhd.c's IO accounting has).
*/
struct dm_stat_percpu {
unsigned long long sectors[2];
unsigned long long ios[2];
unsigned long long merges[2];
unsigned long long ticks[2];
unsigned long long io_ticks[2];
unsigned long long io_ticks_total;
unsigned long long time_in_queue;
};
struct dm_stat_shared {
atomic_t in_flight[2];
unsigned long stamp;
struct dm_stat_percpu tmp;
};
struct dm_stat {
struct list_head list_entry;
int id;
size_t n_entries;
sector_t start;
sector_t end;
sector_t step;
const char *program_id;
const char *aux_data;
struct rcu_head rcu_head;
size_t shared_alloc_size;
size_t percpu_alloc_size;
struct dm_stat_percpu *stat_percpu[NR_CPUS];
struct dm_stat_shared stat_shared[0];
};
struct dm_stats_last_position {
sector_t last_sector;
unsigned last_rw;
};
/*
* A typo on the command line could possibly make the kernel run out of memory
* and crash. To prevent the crash we account all used memory. We fail if we
* exhaust 1/4 of all memory or 1/2 of vmalloc space.
*/
#define DM_STATS_MEMORY_FACTOR 4
#define DM_STATS_VMALLOC_FACTOR 2
static DEFINE_SPINLOCK(shared_memory_lock);
static unsigned long shared_memory_amount;
static bool __check_shared_memory(size_t alloc_size)
{
size_t a;
a = shared_memory_amount + alloc_size;
if (a < shared_memory_amount)
return false;
if (a >> PAGE_SHIFT > totalram_pages / DM_STATS_MEMORY_FACTOR)
return false;
#ifdef CONFIG_MMU
if (a > (VMALLOC_END - VMALLOC_START) / DM_STATS_VMALLOC_FACTOR)
return false;
#endif
return true;
}
static bool check_shared_memory(size_t alloc_size)
{
bool ret;
spin_lock_irq(&shared_memory_lock);
ret = __check_shared_memory(alloc_size);
spin_unlock_irq(&shared_memory_lock);
return ret;
}
static bool claim_shared_memory(size_t alloc_size)
{
spin_lock_irq(&shared_memory_lock);
if (!__check_shared_memory(alloc_size)) {
spin_unlock_irq(&shared_memory_lock);
return false;
}
shared_memory_amount += alloc_size;
spin_unlock_irq(&shared_memory_lock);
return true;
}
static void free_shared_memory(size_t alloc_size)
{
unsigned long flags;
spin_lock_irqsave(&shared_memory_lock, flags);
if (WARN_ON_ONCE(shared_memory_amount < alloc_size)) {
spin_unlock_irqrestore(&shared_memory_lock, flags);
DMCRIT("Memory usage accounting bug.");
return;
}
shared_memory_amount -= alloc_size;
spin_unlock_irqrestore(&shared_memory_lock, flags);
}
static void *dm_kvzalloc(size_t alloc_size, int node)
{
void *p;
if (!claim_shared_memory(alloc_size))
return NULL;
if (alloc_size <= KMALLOC_MAX_SIZE) {
p = kzalloc_node(alloc_size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
if (p)
return p;
}
p = vzalloc_node(alloc_size, node);
if (p)
return p;
free_shared_memory(alloc_size);
return NULL;
}
static void dm_kvfree(void *ptr, size_t alloc_size)
{
if (!ptr)
return;
free_shared_memory(alloc_size);
if (is_vmalloc_addr(ptr))
vfree(ptr);
else
kfree(ptr);
}
static void dm_stat_free(struct rcu_head *head)
{
int cpu;
struct dm_stat *s = container_of(head, struct dm_stat, rcu_head);
kfree(s->program_id);
kfree(s->aux_data);
for_each_possible_cpu(cpu)
dm_kvfree(s->stat_percpu[cpu], s->percpu_alloc_size);
dm_kvfree(s, s->shared_alloc_size);
}
static int dm_stat_in_flight(struct dm_stat_shared *shared)
{
return atomic_read(&shared->in_flight[READ]) +
atomic_read(&shared->in_flight[WRITE]);
}
void dm_stats_init(struct dm_stats *stats)
{
int cpu;
struct dm_stats_last_position *last;
mutex_init(&stats->mutex);
INIT_LIST_HEAD(&stats->list);
stats->last = alloc_percpu(struct dm_stats_last_position);
for_each_possible_cpu(cpu) {
last = per_cpu_ptr(stats->last, cpu);
last->last_sector = (sector_t)ULLONG_MAX;
last->last_rw = UINT_MAX;
}
}
void dm_stats_cleanup(struct dm_stats *stats)
{
size_t ni;
struct dm_stat *s;
struct dm_stat_shared *shared;
while (!list_empty(&stats->list)) {
s = container_of(stats->list.next, struct dm_stat, list_entry);
list_del(&s->list_entry);
for (ni = 0; ni < s->n_entries; ni++) {
shared = &s->stat_shared[ni];
if (WARN_ON(dm_stat_in_flight(shared))) {
DMCRIT("leaked in-flight counter at index %lu "
"(start %llu, end %llu, step %llu): reads %d, writes %d",
(unsigned long)ni,
(unsigned long long)s->start,
(unsigned long long)s->end,
(unsigned long long)s->step,
atomic_read(&shared->in_flight[READ]),
atomic_read(&shared->in_flight[WRITE]));
}
}
dm_stat_free(&s->rcu_head);
}
free_percpu(stats->last);
}
static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
sector_t step, const char *program_id, const char *aux_data,
void (*suspend_callback)(struct mapped_device *),
void (*resume_callback)(struct mapped_device *),
struct mapped_device *md)
{
struct list_head *l;
struct dm_stat *s, *tmp_s;
sector_t n_entries;
size_t ni;
size_t shared_alloc_size;
size_t percpu_alloc_size;
struct dm_stat_percpu *p;
int cpu;
int ret_id;
int r;
if (end < start || !step)
return -EINVAL;
n_entries = end - start;
if (dm_sector_div64(n_entries, step))
n_entries++;
if (n_entries != (size_t)n_entries || !(size_t)(n_entries + 1))
return -EOVERFLOW;
shared_alloc_size = sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared);
if ((shared_alloc_size - sizeof(struct dm_stat)) / sizeof(struct dm_stat_shared) != n_entries)
return -EOVERFLOW;
percpu_alloc_size = (size_t)n_entries * sizeof(struct dm_stat_percpu);
if (percpu_alloc_size / sizeof(struct dm_stat_percpu) != n_entries)
return -EOVERFLOW;
if (!check_shared_memory(shared_alloc_size + num_possible_cpus() * percpu_alloc_size))
return -ENOMEM;
s = dm_kvzalloc(shared_alloc_size, NUMA_NO_NODE);
if (!s)
return -ENOMEM;
s->n_entries = n_entries;
s->start = start;
s->end = end;
s->step = step;
s->shared_alloc_size = shared_alloc_size;
s->percpu_alloc_size = percpu_alloc_size;
s->program_id = kstrdup(program_id, GFP_KERNEL);
if (!s->program_id) {
r = -ENOMEM;
goto out;
}
s->aux_data = kstrdup(aux_data, GFP_KERNEL);
if (!s->aux_data) {
r = -ENOMEM;
goto out;
}
for (ni = 0; ni < n_entries; ni++) {
atomic_set(&s->stat_shared[ni].in_flight[READ], 0);
atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0);
}
for_each_possible_cpu(cpu) {
p = dm_kvzalloc(percpu_alloc_size, cpu_to_node(cpu));
if (!p) {
r = -ENOMEM;
goto out;
}
s->stat_percpu[cpu] = p;
}
/*
* Suspend/resume to make sure there is no i/o in flight,
* so that newly created statistics will be exact.
*
* (note: we couldn't suspend earlier because we must not
* allocate memory while suspended)
*/
suspend_callback(md);
mutex_lock(&stats->mutex);
s->id = 0;
list_for_each(l, &stats->list) {
tmp_s = container_of(l, struct dm_stat, list_entry);
if (WARN_ON(tmp_s->id < s->id)) {
r = -EINVAL;
goto out_unlock_resume;
}
if (tmp_s->id > s->id)
break;
if (unlikely(s->id == INT_MAX)) {
r = -ENFILE;
goto out_unlock_resume;
}
s->id++;
}
ret_id = s->id;
list_add_tail_rcu(&s->list_entry, l);
mutex_unlock(&stats->mutex);
resume_callback(md);
return ret_id;
out_unlock_resume:
mutex_unlock(&stats->mutex);
resume_callback(md);
out:
dm_stat_free(&s->rcu_head);
return r;
}
static struct dm_stat *__dm_stats_find(struct dm_stats *stats, int id)
{
struct dm_stat *s;
list_for_each_entry(s, &stats->list, list_entry) {
if (s->id > id)
break;
if (s->id == id)
return s;
}
return NULL;
}
static int dm_stats_delete(struct dm_stats *stats, int id)
{
struct dm_stat *s;
int cpu;
mutex_lock(&stats->mutex);
s = __dm_stats_find(stats, id);
if (!s) {
mutex_unlock(&stats->mutex);
return -ENOENT;
}
list_del_rcu(&s->list_entry);
mutex_unlock(&stats->mutex);
/*
* vfree can't be called from RCU callback
*/
for_each_possible_cpu(cpu)
if (is_vmalloc_addr(s->stat_percpu))
goto do_sync_free;
if (is_vmalloc_addr(s)) {
do_sync_free:
synchronize_rcu_expedited();
dm_stat_free(&s->rcu_head);
} else {
ACCESS_ONCE(dm_stat_need_rcu_barrier) = 1;
call_rcu(&s->rcu_head, dm_stat_free);
}
return 0;
}
static int dm_stats_list(struct dm_stats *stats, const char *program,
char *result, unsigned maxlen)
{
struct dm_stat *s;
sector_t len;
unsigned sz = 0;
/*
* Output format:
* <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
*/
mutex_lock(&stats->mutex);
list_for_each_entry(s, &stats->list, list_entry) {
if (!program || !strcmp(program, s->program_id)) {
len = s->end - s->start;
DMEMIT("%d: %llu+%llu %llu %s %s\n", s->id,
(unsigned long long)s->start,
(unsigned long long)len,
(unsigned long long)s->step,
s->program_id,
s->aux_data);
}
}
mutex_unlock(&stats->mutex);
return 1;
}
static void dm_stat_round(struct dm_stat_shared *shared, struct dm_stat_percpu *p)
{
/*
* This is racy, but so is part_round_stats_single.
*/
unsigned long now = jiffies;
unsigned in_flight_read;
unsigned in_flight_write;
unsigned long difference = now - shared->stamp;
if (!difference)
return;
in_flight_read = (unsigned)atomic_read(&shared->in_flight[READ]);
in_flight_write = (unsigned)atomic_read(&shared->in_flight[WRITE]);
if (in_flight_read)
p->io_ticks[READ] += difference;
if (in_flight_write)
p->io_ticks[WRITE] += difference;
if (in_flight_read + in_flight_write) {
p->io_ticks_total += difference;
p->time_in_queue += (in_flight_read + in_flight_write) * difference;
}
shared->stamp = now;
}
static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
unsigned long bi_rw, sector_t len, bool merged,
bool end, unsigned long duration)
{
unsigned long idx = bi_rw & REQ_WRITE;
struct dm_stat_shared *shared = &s->stat_shared[entry];
struct dm_stat_percpu *p;
/*
* For strict correctness we should use local_irq_save/restore
* instead of preempt_disable/enable.
*
* preempt_disable/enable is racy if the driver finishes bios
* from non-interrupt context as well as from interrupt context
* or from more different interrupts.
*
* On 64-bit architectures the race only results in not counting some
* events, so it is acceptable. On 32-bit architectures the race could
* cause the counter going off by 2^32, so we need to do proper locking
* there.
*
* part_stat_lock()/part_stat_unlock() have this race too.
*/
#if BITS_PER_LONG == 32
unsigned long flags;
local_irq_save(flags);
#else
preempt_disable();
#endif
p = &s->stat_percpu[smp_processor_id()][entry];
if (!end) {
dm_stat_round(shared, p);
atomic_inc(&shared->in_flight[idx]);
} else {
dm_stat_round(shared, p);
atomic_dec(&shared->in_flight[idx]);
p->sectors[idx] += len;
p->ios[idx] += 1;
p->merges[idx] += merged;
p->ticks[idx] += duration;
}
#if BITS_PER_LONG == 32
local_irq_restore(flags);
#else
preempt_enable();
#endif
}
static void __dm_stat_bio(struct dm_stat *s, unsigned long bi_rw,
sector_t bi_sector, sector_t end_sector,
bool end, unsigned long duration,
struct dm_stats_aux *stats_aux)
{
sector_t rel_sector, offset, todo, fragment_len;
size_t entry;
if (end_sector <= s->start || bi_sector >= s->end)
return;
if (unlikely(bi_sector < s->start)) {
rel_sector = 0;
todo = end_sector - s->start;
} else {
rel_sector = bi_sector - s->start;
todo = end_sector - bi_sector;
}
if (unlikely(end_sector > s->end))
todo -= (end_sector - s->end);
offset = dm_sector_div64(rel_sector, s->step);
entry = rel_sector;
do {
if (WARN_ON_ONCE(entry >= s->n_entries)) {
DMCRIT("Invalid area access in region id %d", s->id);
return;
}
fragment_len = todo;
if (fragment_len > s->step - offset)
fragment_len = s->step - offset;
dm_stat_for_entry(s, entry, bi_rw, fragment_len,
stats_aux->merged, end, duration);
todo -= fragment_len;
entry++;
offset = 0;
} while (unlikely(todo != 0));
}
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
unsigned long duration, struct dm_stats_aux *stats_aux)
{
struct dm_stat *s;
sector_t end_sector;
struct dm_stats_last_position *last;
if (unlikely(!bi_sectors))
return;
end_sector = bi_sector + bi_sectors;
if (!end) {
/*
* A race condition can at worst result in the merged flag being
* misrepresented, so we don't have to disable preemption here.
*/
last = raw_cpu_ptr(stats->last);
stats_aux->merged =
(bi_sector == (ACCESS_ONCE(last->last_sector) &&
((bi_rw & (REQ_WRITE | REQ_DISCARD)) ==
(ACCESS_ONCE(last->last_rw) & (REQ_WRITE | REQ_DISCARD)))
));
ACCESS_ONCE(last->last_sector) = end_sector;
ACCESS_ONCE(last->last_rw) = bi_rw;
}
rcu_read_lock();
list_for_each_entry_rcu(s, &stats->list, list_entry)
__dm_stat_bio(s, bi_rw, bi_sector, end_sector, end, duration, stats_aux);
rcu_read_unlock();
}
static void __dm_stat_init_temporary_percpu_totals(struct dm_stat_shared *shared,
struct dm_stat *s, size_t x)
{
int cpu;
struct dm_stat_percpu *p;
local_irq_disable();
p = &s->stat_percpu[smp_processor_id()][x];
dm_stat_round(shared, p);
local_irq_enable();
memset(&shared->tmp, 0, sizeof(shared->tmp));
for_each_possible_cpu(cpu) {
p = &s->stat_percpu[cpu][x];
shared->tmp.sectors[READ] += ACCESS_ONCE(p->sectors[READ]);
shared->tmp.sectors[WRITE] += ACCESS_ONCE(p->sectors[WRITE]);
shared->tmp.ios[READ] += ACCESS_ONCE(p->ios[READ]);
shared->tmp.ios[WRITE] += ACCESS_ONCE(p->ios[WRITE]);
shared->tmp.merges[READ] += ACCESS_ONCE(p->merges[READ]);
shared->tmp.merges[WRITE] += ACCESS_ONCE(p->merges[WRITE]);
shared->tmp.ticks[READ] += ACCESS_ONCE(p->ticks[READ]);
shared->tmp.ticks[WRITE] += ACCESS_ONCE(p->ticks[WRITE]);
shared->tmp.io_ticks[READ] += ACCESS_ONCE(p->io_ticks[READ]);
shared->tmp.io_ticks[WRITE] += ACCESS_ONCE(p->io_ticks[WRITE]);
shared->tmp.io_ticks_total += ACCESS_ONCE(p->io_ticks_total);
shared->tmp.time_in_queue += ACCESS_ONCE(p->time_in_queue);
}
}
static void __dm_stat_clear(struct dm_stat *s, size_t idx_start, size_t idx_end,
bool init_tmp_percpu_totals)
{
size_t x;
struct dm_stat_shared *shared;
struct dm_stat_percpu *p;
for (x = idx_start; x < idx_end; x++) {
shared = &s->stat_shared[x];
if (init_tmp_percpu_totals)
__dm_stat_init_temporary_percpu_totals(shared, s, x);
local_irq_disable();
p = &s->stat_percpu[smp_processor_id()][x];
p->sectors[READ] -= shared->tmp.sectors[READ];
p->sectors[WRITE] -= shared->tmp.sectors[WRITE];
p->ios[READ] -= shared->tmp.ios[READ];
p->ios[WRITE] -= shared->tmp.ios[WRITE];
p->merges[READ] -= shared->tmp.merges[READ];
p->merges[WRITE] -= shared->tmp.merges[WRITE];
p->ticks[READ] -= shared->tmp.ticks[READ];
p->ticks[WRITE] -= shared->tmp.ticks[WRITE];
p->io_ticks[READ] -= shared->tmp.io_ticks[READ];
p->io_ticks[WRITE] -= shared->tmp.io_ticks[WRITE];
p->io_ticks_total -= shared->tmp.io_ticks_total;
p->time_in_queue -= shared->tmp.time_in_queue;
local_irq_enable();
}
}
static int dm_stats_clear(struct dm_stats *stats, int id)
{
struct dm_stat *s;
mutex_lock(&stats->mutex);
s = __dm_stats_find(stats, id);
if (!s) {
mutex_unlock(&stats->mutex);
return -ENOENT;
}
__dm_stat_clear(s, 0, s->n_entries, true);
mutex_unlock(&stats->mutex);
return 1;
}
/*
* This is like jiffies_to_msec, but works for 64-bit values.
*/
static unsigned long long dm_jiffies_to_msec64(unsigned long long j)
{
unsigned long long result = 0;
unsigned mult;
if (j)
result = jiffies_to_msecs(j & 0x3fffff);
if (j >= 1 << 22) {
mult = jiffies_to_msecs(1 << 22);
result += (unsigned long long)mult * (unsigned long long)jiffies_to_msecs((j >> 22) & 0x3fffff);
}
if (j >= 1ULL << 44)
result += (unsigned long long)mult * (unsigned long long)mult * (unsigned long long)jiffies_to_msecs(j >> 44);
return result;
}
static int dm_stats_print(struct dm_stats *stats, int id,
size_t idx_start, size_t idx_len,
bool clear, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct dm_stat *s;
size_t x;
sector_t start, end, step;
size_t idx_end;
struct dm_stat_shared *shared;
/*
* Output format:
* <start_sector>+<length> counters
*/
mutex_lock(&stats->mutex);
s = __dm_stats_find(stats, id);
if (!s) {
mutex_unlock(&stats->mutex);
return -ENOENT;
}
idx_end = idx_start + idx_len;
if (idx_end < idx_start ||
idx_end > s->n_entries)
idx_end = s->n_entries;
if (idx_start > idx_end)
idx_start = idx_end;
step = s->step;
start = s->start + (step * idx_start);
for (x = idx_start; x < idx_end; x++, start = end) {
shared = &s->stat_shared[x];
end = start + step;
if (unlikely(end > s->end))
end = s->end;
__dm_stat_init_temporary_percpu_totals(shared, s, x);
DMEMIT("%llu+%llu %llu %llu %llu %llu %llu %llu %llu %llu %d %llu %llu %llu %llu\n",
(unsigned long long)start,
(unsigned long long)step,
shared->tmp.ios[READ],
shared->tmp.merges[READ],
shared->tmp.sectors[READ],
dm_jiffies_to_msec64(shared->tmp.ticks[READ]),
shared->tmp.ios[WRITE],
shared->tmp.merges[WRITE],
shared->tmp.sectors[WRITE],
dm_jiffies_to_msec64(shared->tmp.ticks[WRITE]),
dm_stat_in_flight(shared),
dm_jiffies_to_msec64(shared->tmp.io_ticks_total),
dm_jiffies_to_msec64(shared->tmp.time_in_queue),
dm_jiffies_to_msec64(shared->tmp.io_ticks[READ]),
dm_jiffies_to_msec64(shared->tmp.io_ticks[WRITE]));
if (unlikely(sz + 1 >= maxlen))
goto buffer_overflow;
}
if (clear)
__dm_stat_clear(s, idx_start, idx_end, false);
buffer_overflow:
mutex_unlock(&stats->mutex);
return 1;
}
static int dm_stats_set_aux(struct dm_stats *stats, int id, const char *aux_data)
{
struct dm_stat *s;
const char *new_aux_data;
mutex_lock(&stats->mutex);
s = __dm_stats_find(stats, id);
if (!s) {
mutex_unlock(&stats->mutex);
return -ENOENT;
}
new_aux_data = kstrdup(aux_data, GFP_KERNEL);
if (!new_aux_data) {
mutex_unlock(&stats->mutex);
return -ENOMEM;
}
kfree(s->aux_data);
s->aux_data = new_aux_data;
mutex_unlock(&stats->mutex);
return 0;
}
static int message_stats_create(struct mapped_device *md,
unsigned argc, char **argv,
char *result, unsigned maxlen)
{
int id;
char dummy;
unsigned long long start, end, len, step;
unsigned divisor;
const char *program_id, *aux_data;
/*
* Input format:
* <range> <step> [<program_id> [<aux_data>]]
*/
if (argc < 3 || argc > 5)
return -EINVAL;
if (!strcmp(argv[1], "-")) {
start = 0;
len = dm_get_size(md);
if (!len)
len = 1;
} else if (sscanf(argv[1], "%llu+%llu%c", &start, &len, &dummy) != 2 ||
start != (sector_t)start || len != (sector_t)len)
return -EINVAL;
end = start + len;
if (start >= end)
return -EINVAL;
if (sscanf(argv[2], "/%u%c", &divisor, &dummy) == 1) {
step = end - start;
if (do_div(step, divisor))
step++;
if (!step)
step = 1;
} else if (sscanf(argv[2], "%llu%c", &step, &dummy) != 1 ||
step != (sector_t)step || !step)
return -EINVAL;
program_id = "-";
aux_data = "-";
if (argc > 3)
program_id = argv[3];
if (argc > 4)
aux_data = argv[4];
/*
* If a buffer overflow happens after we created the region,
* it's too late (the userspace would retry with a larger
* buffer, but the region id that caused the overflow is already
* leaked). So we must detect buffer overflow in advance.
*/
snprintf(result, maxlen, "%d", INT_MAX);
if (dm_message_test_buffer_overflow(result, maxlen))
return 1;
id = dm_stats_create(dm_get_stats(md), start, end, step, program_id, aux_data,
dm_internal_suspend, dm_internal_resume, md);
if (id < 0)
return id;
snprintf(result, maxlen, "%d", id);
return 1;
}
static int message_stats_delete(struct mapped_device *md,
unsigned argc, char **argv)
{
int id;
char dummy;
if (argc != 2)
return -EINVAL;
if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0)
return -EINVAL;
return dm_stats_delete(dm_get_stats(md), id);
}
static int message_stats_clear(struct mapped_device *md,
unsigned argc, char **argv)
{
int id;
char dummy;
if (argc != 2)
return -EINVAL;
if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0)
return -EINVAL;
return dm_stats_clear(dm_get_stats(md), id);
}
static int message_stats_list(struct mapped_device *md,
unsigned argc, char **argv,
char *result, unsigned maxlen)
{
int r;
const char *program = NULL;
if (argc < 1 || argc > 2)
return -EINVAL;
if (argc > 1) {
program = kstrdup(argv[1], GFP_KERNEL);
if (!program)
return -ENOMEM;
}
r = dm_stats_list(dm_get_stats(md), program, result, maxlen);
kfree(program);
return r;
}
static int message_stats_print(struct mapped_device *md,
unsigned argc, char **argv, bool clear,
char *result, unsigned maxlen)
{
int id;
char dummy;
unsigned long idx_start = 0, idx_len = ULONG_MAX;
if (argc != 2 && argc != 4)
return -EINVAL;
if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0)
return -EINVAL;
if (argc > 3) {
if (strcmp(argv[2], "-") &&
sscanf(argv[2], "%lu%c", &idx_start, &dummy) != 1)
return -EINVAL;
if (strcmp(argv[3], "-") &&
sscanf(argv[3], "%lu%c", &idx_len, &dummy) != 1)
return -EINVAL;
}
return dm_stats_print(dm_get_stats(md), id, idx_start, idx_len, clear,
result, maxlen);
}
static int message_stats_set_aux(struct mapped_device *md,
unsigned argc, char **argv)
{
int id;
char dummy;
if (argc != 3)
return -EINVAL;
if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0)
return -EINVAL;
return dm_stats_set_aux(dm_get_stats(md), id, argv[2]);
}
int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv,
char *result, unsigned maxlen)
{
int r;
if (dm_request_based(md)) {
DMWARN("Statistics are only supported for bio-based devices");
return -EOPNOTSUPP;
}
/* All messages here must start with '@' */
if (!strcasecmp(argv[0], "@stats_create"))
r = message_stats_create(md, argc, argv, result, maxlen);
else if (!strcasecmp(argv[0], "@stats_delete"))
r = message_stats_delete(md, argc, argv);
else if (!strcasecmp(argv[0], "@stats_clear"))
r = message_stats_clear(md, argc, argv);
else if (!strcasecmp(argv[0], "@stats_list"))
r = message_stats_list(md, argc, argv, result, maxlen);
else if (!strcasecmp(argv[0], "@stats_print"))
r = message_stats_print(md, argc, argv, false, result, maxlen);
else if (!strcasecmp(argv[0], "@stats_print_clear"))
r = message_stats_print(md, argc, argv, true, result, maxlen);
else if (!strcasecmp(argv[0], "@stats_set_aux"))
r = message_stats_set_aux(md, argc, argv);
else
return 2; /* this wasn't a stats message */
if (r == -EINVAL)
DMWARN("Invalid parameters for message %s", argv[0]);
return r;
}
int __init dm_statistics_init(void)
{
shared_memory_amount = 0;
dm_stat_need_rcu_barrier = 0;
return 0;
}
void dm_statistics_exit(void)
{
if (dm_stat_need_rcu_barrier)
rcu_barrier();
if (WARN_ON(shared_memory_amount))
DMCRIT("shared_memory_amount leaked: %lu", shared_memory_amount);
}
module_param_named(stats_current_allocated_bytes, shared_memory_amount, ulong, S_IRUGO);
MODULE_PARM_DESC(stats_current_allocated_bytes, "Memory currently used by statistics");

40
drivers/md/dm-stats.h Normal file
View file

@ -0,0 +1,40 @@
#ifndef DM_STATS_H
#define DM_STATS_H
#include <linux/types.h>
#include <linux/mutex.h>
#include <linux/list.h>
int dm_statistics_init(void);
void dm_statistics_exit(void);
struct dm_stats {
struct mutex mutex;
struct list_head list; /* list of struct dm_stat */
struct dm_stats_last_position __percpu *last;
sector_t last_sector;
unsigned last_rw;
};
struct dm_stats_aux {
bool merged;
};
void dm_stats_init(struct dm_stats *st);
void dm_stats_cleanup(struct dm_stats *st);
struct mapped_device;
int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv,
char *result, unsigned maxlen);
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
unsigned long duration, struct dm_stats_aux *aux);
static inline bool dm_stats_used(struct dm_stats *st)
{
return !list_empty(&st->list);
}
#endif

465
drivers/md/dm-stripe.c Normal file
View file

@ -0,0 +1,465 @@
/*
* Copyright (C) 2001-2003 Sistina Software (UK) Limited.
*
* This file is released under the GPL.
*/
#include "dm.h"
#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/log2.h>
#define DM_MSG_PREFIX "striped"
#define DM_IO_ERROR_THRESHOLD 15
struct stripe {
struct dm_dev *dev;
sector_t physical_start;
atomic_t error_count;
};
struct stripe_c {
uint32_t stripes;
int stripes_shift;
/* The size of this target / num. stripes */
sector_t stripe_width;
uint32_t chunk_size;
int chunk_size_shift;
/* Needed for handling events */
struct dm_target *ti;
/* Work struct used for triggering events*/
struct work_struct trigger_event;
struct stripe stripe[0];
};
/*
* An event is triggered whenever a drive
* drops out of a stripe volume.
*/
static void trigger_event(struct work_struct *work)
{
struct stripe_c *sc = container_of(work, struct stripe_c,
trigger_event);
dm_table_event(sc->ti->table);
}
static inline struct stripe_c *alloc_context(unsigned int stripes)
{
size_t len;
if (dm_array_too_big(sizeof(struct stripe_c), sizeof(struct stripe),
stripes))
return NULL;
len = sizeof(struct stripe_c) + (sizeof(struct stripe) * stripes);
return kmalloc(len, GFP_KERNEL);
}
/*
* Parse a single <dev> <sector> pair
*/
static int get_stripe(struct dm_target *ti, struct stripe_c *sc,
unsigned int stripe, char **argv)
{
unsigned long long start;
char dummy;
if (sscanf(argv[1], "%llu%c", &start, &dummy) != 1)
return -EINVAL;
if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
&sc->stripe[stripe].dev))
return -ENXIO;
sc->stripe[stripe].physical_start = start;
return 0;
}
/*
* Construct a striped mapping.
* <number of stripes> <chunk size> [<dev_path> <offset>]+
*/
static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
struct stripe_c *sc;
sector_t width, tmp_len;
uint32_t stripes;
uint32_t chunk_size;
int r;
unsigned int i;
if (argc < 2) {
ti->error = "Not enough arguments";
return -EINVAL;
}
if (kstrtouint(argv[0], 10, &stripes) || !stripes) {
ti->error = "Invalid stripe count";
return -EINVAL;
}
if (kstrtouint(argv[1], 10, &chunk_size) || !chunk_size) {
ti->error = "Invalid chunk_size";
return -EINVAL;
}
width = ti->len;
if (sector_div(width, stripes)) {
ti->error = "Target length not divisible by "
"number of stripes";
return -EINVAL;
}
tmp_len = width;
if (sector_div(tmp_len, chunk_size)) {
ti->error = "Target length not divisible by "
"chunk size";
return -EINVAL;
}
/*
* Do we have enough arguments for that many stripes ?
*/
if (argc != (2 + 2 * stripes)) {
ti->error = "Not enough destinations "
"specified";
return -EINVAL;
}
sc = alloc_context(stripes);
if (!sc) {
ti->error = "Memory allocation for striped context "
"failed";
return -ENOMEM;
}
INIT_WORK(&sc->trigger_event, trigger_event);
/* Set pointer to dm target; used in trigger_event */
sc->ti = ti;
sc->stripes = stripes;
sc->stripe_width = width;
if (stripes & (stripes - 1))
sc->stripes_shift = -1;
else
sc->stripes_shift = __ffs(stripes);
r = dm_set_target_max_io_len(ti, chunk_size);
if (r) {
kfree(sc);
return r;
}
ti->num_flush_bios = stripes;
ti->num_discard_bios = stripes;
ti->num_write_same_bios = stripes;
sc->chunk_size = chunk_size;
if (chunk_size & (chunk_size - 1))
sc->chunk_size_shift = -1;
else
sc->chunk_size_shift = __ffs(chunk_size);
/*
* Get the stripe destinations.
*/
for (i = 0; i < stripes; i++) {
argv += 2;
r = get_stripe(ti, sc, i, argv);
if (r < 0) {
ti->error = "Couldn't parse stripe destination";
while (i--)
dm_put_device(ti, sc->stripe[i].dev);
kfree(sc);
return r;
}
atomic_set(&(sc->stripe[i].error_count), 0);
}
ti->private = sc;
return 0;
}
static void stripe_dtr(struct dm_target *ti)
{
unsigned int i;
struct stripe_c *sc = (struct stripe_c *) ti->private;
for (i = 0; i < sc->stripes; i++)
dm_put_device(ti, sc->stripe[i].dev);
flush_work(&sc->trigger_event);
kfree(sc);
}
static void stripe_map_sector(struct stripe_c *sc, sector_t sector,
uint32_t *stripe, sector_t *result)
{
sector_t chunk = dm_target_offset(sc->ti, sector);
sector_t chunk_offset;
if (sc->chunk_size_shift < 0)
chunk_offset = sector_div(chunk, sc->chunk_size);
else {
chunk_offset = chunk & (sc->chunk_size - 1);
chunk >>= sc->chunk_size_shift;
}
if (sc->stripes_shift < 0)
*stripe = sector_div(chunk, sc->stripes);
else {
*stripe = chunk & (sc->stripes - 1);
chunk >>= sc->stripes_shift;
}
if (sc->chunk_size_shift < 0)
chunk *= sc->chunk_size;
else
chunk <<= sc->chunk_size_shift;
*result = chunk + chunk_offset;
}
static void stripe_map_range_sector(struct stripe_c *sc, sector_t sector,
uint32_t target_stripe, sector_t *result)
{
uint32_t stripe;
stripe_map_sector(sc, sector, &stripe, result);
if (stripe == target_stripe)
return;
/* round down */
sector = *result;
if (sc->chunk_size_shift < 0)
*result -= sector_div(sector, sc->chunk_size);
else
*result = sector & ~(sector_t)(sc->chunk_size - 1);
if (target_stripe < stripe)
*result += sc->chunk_size; /* next chunk */
}
static int stripe_map_range(struct stripe_c *sc, struct bio *bio,
uint32_t target_stripe)
{
sector_t begin, end;
stripe_map_range_sector(sc, bio->bi_iter.bi_sector,
target_stripe, &begin);
stripe_map_range_sector(sc, bio_end_sector(bio),
target_stripe, &end);
if (begin < end) {
bio->bi_bdev = sc->stripe[target_stripe].dev->bdev;
bio->bi_iter.bi_sector = begin +
sc->stripe[target_stripe].physical_start;
bio->bi_iter.bi_size = to_bytes(end - begin);
return DM_MAPIO_REMAPPED;
} else {
/* The range doesn't map to the target stripe */
bio_endio(bio, 0);
return DM_MAPIO_SUBMITTED;
}
}
static int stripe_map(struct dm_target *ti, struct bio *bio)
{
struct stripe_c *sc = ti->private;
uint32_t stripe;
unsigned target_bio_nr;
if (bio->bi_rw & REQ_FLUSH) {
target_bio_nr = dm_bio_get_target_bio_nr(bio);
BUG_ON(target_bio_nr >= sc->stripes);
bio->bi_bdev = sc->stripe[target_bio_nr].dev->bdev;
return DM_MAPIO_REMAPPED;
}
if (unlikely(bio->bi_rw & REQ_DISCARD) ||
unlikely(bio->bi_rw & REQ_WRITE_SAME)) {
target_bio_nr = dm_bio_get_target_bio_nr(bio);
BUG_ON(target_bio_nr >= sc->stripes);
return stripe_map_range(sc, bio, target_bio_nr);
}
stripe_map_sector(sc, bio->bi_iter.bi_sector,
&stripe, &bio->bi_iter.bi_sector);
bio->bi_iter.bi_sector += sc->stripe[stripe].physical_start;
bio->bi_bdev = sc->stripe[stripe].dev->bdev;
return DM_MAPIO_REMAPPED;
}
/*
* Stripe status:
*
* INFO
* #stripes [stripe_name <stripe_name>] [group word count]
* [error count 'A|D' <error count 'A|D'>]
*
* TABLE
* #stripes [stripe chunk size]
* [stripe_name physical_start <stripe_name physical_start>]
*
*/
static void stripe_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct stripe_c *sc = (struct stripe_c *) ti->private;
char buffer[sc->stripes + 1];
unsigned int sz = 0;
unsigned int i;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%d ", sc->stripes);
for (i = 0; i < sc->stripes; i++) {
DMEMIT("%s ", sc->stripe[i].dev->name);
buffer[i] = atomic_read(&(sc->stripe[i].error_count)) ?
'D' : 'A';
}
buffer[i] = '\0';
DMEMIT("1 %s", buffer);
break;
case STATUSTYPE_TABLE:
DMEMIT("%d %llu", sc->stripes,
(unsigned long long)sc->chunk_size);
for (i = 0; i < sc->stripes; i++)
DMEMIT(" %s %llu", sc->stripe[i].dev->name,
(unsigned long long)sc->stripe[i].physical_start);
break;
}
}
static int stripe_end_io(struct dm_target *ti, struct bio *bio, int error)
{
unsigned i;
char major_minor[16];
struct stripe_c *sc = ti->private;
if (!error)
return 0; /* I/O complete */
if ((error == -EWOULDBLOCK) && (bio->bi_rw & REQ_RAHEAD))
return error;
if (error == -EOPNOTSUPP)
return error;
memset(major_minor, 0, sizeof(major_minor));
sprintf(major_minor, "%d:%d",
MAJOR(disk_devt(bio->bi_bdev->bd_disk)),
MINOR(disk_devt(bio->bi_bdev->bd_disk)));
/*
* Test to see which stripe drive triggered the event
* and increment error count for all stripes on that device.
* If the error count for a given device exceeds the threshold
* value we will no longer trigger any further events.
*/
for (i = 0; i < sc->stripes; i++)
if (!strcmp(sc->stripe[i].dev->name, major_minor)) {
atomic_inc(&(sc->stripe[i].error_count));
if (atomic_read(&(sc->stripe[i].error_count)) <
DM_IO_ERROR_THRESHOLD)
schedule_work(&sc->trigger_event);
}
return error;
}
static int stripe_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct stripe_c *sc = ti->private;
int ret = 0;
unsigned i = 0;
do {
ret = fn(ti, sc->stripe[i].dev,
sc->stripe[i].physical_start,
sc->stripe_width, data);
} while (!ret && ++i < sc->stripes);
return ret;
}
static void stripe_io_hints(struct dm_target *ti,
struct queue_limits *limits)
{
struct stripe_c *sc = ti->private;
unsigned chunk_size = sc->chunk_size << SECTOR_SHIFT;
blk_limits_io_min(limits, chunk_size);
blk_limits_io_opt(limits, chunk_size * sc->stripes);
}
static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
struct bio_vec *biovec, int max_size)
{
struct stripe_c *sc = ti->private;
sector_t bvm_sector = bvm->bi_sector;
uint32_t stripe;
struct request_queue *q;
stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
if (!q->merge_bvec_fn)
return max_size;
bvm->bi_bdev = sc->stripe[stripe].dev->bdev;
bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector;
return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}
static struct target_type stripe_target = {
.name = "striped",
.version = {1, 5, 1},
.module = THIS_MODULE,
.ctr = stripe_ctr,
.dtr = stripe_dtr,
.map = stripe_map,
.end_io = stripe_end_io,
.status = stripe_status,
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.merge = stripe_merge,
};
int __init dm_stripe_init(void)
{
int r;
r = dm_register_target(&stripe_target);
if (r < 0) {
DMWARN("target registration failed");
return r;
}
return r;
}
void dm_stripe_exit(void)
{
dm_unregister_target(&stripe_target);
}

591
drivers/md/dm-switch.c Normal file
View file

@ -0,0 +1,591 @@
/*
* Copyright (C) 2010-2012 by Dell Inc. All rights reserved.
* Copyright (C) 2011-2013 Red Hat, Inc.
*
* This file is released under the GPL.
*
* dm-switch is a device-mapper target that maps IO to underlying block
* devices efficiently when there are a large number of fixed-sized
* address regions but there is no simple pattern to allow for a compact
* mapping representation such as dm-stripe.
*/
#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/vmalloc.h>
#define DM_MSG_PREFIX "switch"
/*
* One region_table_slot_t holds <region_entries_per_slot> region table
* entries each of which is <region_table_entry_bits> in size.
*/
typedef unsigned long region_table_slot_t;
/*
* A device with the offset to its start sector.
*/
struct switch_path {
struct dm_dev *dmdev;
sector_t start;
};
/*
* Context block for a dm switch device.
*/
struct switch_ctx {
struct dm_target *ti;
unsigned nr_paths; /* Number of paths in path_list. */
unsigned region_size; /* Region size in 512-byte sectors */
unsigned long nr_regions; /* Number of regions making up the device */
signed char region_size_bits; /* log2 of region_size or -1 */
unsigned char region_table_entry_bits; /* Number of bits in one region table entry */
unsigned char region_entries_per_slot; /* Number of entries in one region table slot */
signed char region_entries_per_slot_bits; /* log2 of region_entries_per_slot or -1 */
region_table_slot_t *region_table; /* Region table */
/*
* Array of dm devices to switch between.
*/
struct switch_path path_list[0];
};
static struct switch_ctx *alloc_switch_ctx(struct dm_target *ti, unsigned nr_paths,
unsigned region_size)
{
struct switch_ctx *sctx;
sctx = kzalloc(sizeof(struct switch_ctx) + nr_paths * sizeof(struct switch_path),
GFP_KERNEL);
if (!sctx)
return NULL;
sctx->ti = ti;
sctx->region_size = region_size;
ti->private = sctx;
return sctx;
}
static int alloc_region_table(struct dm_target *ti, unsigned nr_paths)
{
struct switch_ctx *sctx = ti->private;
sector_t nr_regions = ti->len;
sector_t nr_slots;
if (!(sctx->region_size & (sctx->region_size - 1)))
sctx->region_size_bits = __ffs(sctx->region_size);
else
sctx->region_size_bits = -1;
sctx->region_table_entry_bits = 1;
while (sctx->region_table_entry_bits < sizeof(region_table_slot_t) * 8 &&
(region_table_slot_t)1 << sctx->region_table_entry_bits < nr_paths)
sctx->region_table_entry_bits++;
sctx->region_entries_per_slot = (sizeof(region_table_slot_t) * 8) / sctx->region_table_entry_bits;
if (!(sctx->region_entries_per_slot & (sctx->region_entries_per_slot - 1)))
sctx->region_entries_per_slot_bits = __ffs(sctx->region_entries_per_slot);
else
sctx->region_entries_per_slot_bits = -1;
if (sector_div(nr_regions, sctx->region_size))
nr_regions++;
sctx->nr_regions = nr_regions;
if (sctx->nr_regions != nr_regions || sctx->nr_regions >= ULONG_MAX) {
ti->error = "Region table too large";
return -EINVAL;
}
nr_slots = nr_regions;
if (sector_div(nr_slots, sctx->region_entries_per_slot))
nr_slots++;
if (nr_slots > ULONG_MAX / sizeof(region_table_slot_t)) {
ti->error = "Region table too large";
return -EINVAL;
}
sctx->region_table = vmalloc(nr_slots * sizeof(region_table_slot_t));
if (!sctx->region_table) {
ti->error = "Cannot allocate region table";
return -ENOMEM;
}
return 0;
}
static void switch_get_position(struct switch_ctx *sctx, unsigned long region_nr,
unsigned long *region_index, unsigned *bit)
{
if (sctx->region_entries_per_slot_bits >= 0) {
*region_index = region_nr >> sctx->region_entries_per_slot_bits;
*bit = region_nr & (sctx->region_entries_per_slot - 1);
} else {
*region_index = region_nr / sctx->region_entries_per_slot;
*bit = region_nr % sctx->region_entries_per_slot;
}
*bit *= sctx->region_table_entry_bits;
}
static unsigned switch_region_table_read(struct switch_ctx *sctx, unsigned long region_nr)
{
unsigned long region_index;
unsigned bit;
switch_get_position(sctx, region_nr, &region_index, &bit);
return (ACCESS_ONCE(sctx->region_table[region_index]) >> bit) &
((1 << sctx->region_table_entry_bits) - 1);
}
/*
* Find which path to use at given offset.
*/
static unsigned switch_get_path_nr(struct switch_ctx *sctx, sector_t offset)
{
unsigned path_nr;
sector_t p;
p = offset;
if (sctx->region_size_bits >= 0)
p >>= sctx->region_size_bits;
else
sector_div(p, sctx->region_size);
path_nr = switch_region_table_read(sctx, p);
/* This can only happen if the processor uses non-atomic stores. */
if (unlikely(path_nr >= sctx->nr_paths))
path_nr = 0;
return path_nr;
}
static void switch_region_table_write(struct switch_ctx *sctx, unsigned long region_nr,
unsigned value)
{
unsigned long region_index;
unsigned bit;
region_table_slot_t pte;
switch_get_position(sctx, region_nr, &region_index, &bit);
pte = sctx->region_table[region_index];
pte &= ~((((region_table_slot_t)1 << sctx->region_table_entry_bits) - 1) << bit);
pte |= (region_table_slot_t)value << bit;
sctx->region_table[region_index] = pte;
}
/*
* Fill the region table with an initial round robin pattern.
*/
static void initialise_region_table(struct switch_ctx *sctx)
{
unsigned path_nr = 0;
unsigned long region_nr;
for (region_nr = 0; region_nr < sctx->nr_regions; region_nr++) {
switch_region_table_write(sctx, region_nr, path_nr);
if (++path_nr >= sctx->nr_paths)
path_nr = 0;
}
}
static int parse_path(struct dm_arg_set *as, struct dm_target *ti)
{
struct switch_ctx *sctx = ti->private;
unsigned long long start;
int r;
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&sctx->path_list[sctx->nr_paths].dmdev);
if (r) {
ti->error = "Device lookup failed";
return r;
}
if (kstrtoull(dm_shift_arg(as), 10, &start) || start != (sector_t)start) {
ti->error = "Invalid device starting offset";
dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
return -EINVAL;
}
sctx->path_list[sctx->nr_paths].start = start;
sctx->nr_paths++;
return 0;
}
/*
* Destructor: Don't free the dm_target, just the ti->private data (if any).
*/
static void switch_dtr(struct dm_target *ti)
{
struct switch_ctx *sctx = ti->private;
while (sctx->nr_paths--)
dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
vfree(sctx->region_table);
kfree(sctx);
}
/*
* Constructor arguments:
* <num_paths> <region_size> <num_optional_args> [<optional_args>...]
* [<dev_path> <offset>]+
*
* Optional args are to allow for future extension: currently this
* parameter must be 0.
*/
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
static struct dm_arg _args[] = {
{1, (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_path), "Invalid number of paths"},
{1, UINT_MAX, "Invalid region size"},
{0, 0, "Invalid number of optional args"},
};
struct switch_ctx *sctx;
struct dm_arg_set as;
unsigned nr_paths, region_size, nr_optional_args;
int r;
as.argc = argc;
as.argv = argv;
r = dm_read_arg(_args, &as, &nr_paths, &ti->error);
if (r)
return -EINVAL;
r = dm_read_arg(_args + 1, &as, &region_size, &ti->error);
if (r)
return r;
r = dm_read_arg_group(_args + 2, &as, &nr_optional_args, &ti->error);
if (r)
return r;
/* parse optional arguments here, if we add any */
if (as.argc != nr_paths * 2) {
ti->error = "Incorrect number of path arguments";
return -EINVAL;
}
sctx = alloc_switch_ctx(ti, nr_paths, region_size);
if (!sctx) {
ti->error = "Cannot allocate redirection context";
return -ENOMEM;
}
r = dm_set_target_max_io_len(ti, region_size);
if (r)
goto error;
while (as.argc) {
r = parse_path(&as, ti);
if (r)
goto error;
}
r = alloc_region_table(ti, nr_paths);
if (r)
goto error;
initialise_region_table(sctx);
/* For UNMAP, sending the request down any path is sufficient */
ti->num_discard_bios = 1;
return 0;
error:
switch_dtr(ti);
return r;
}
static int switch_map(struct dm_target *ti, struct bio *bio)
{
struct switch_ctx *sctx = ti->private;
sector_t offset = dm_target_offset(ti, bio->bi_iter.bi_sector);
unsigned path_nr = switch_get_path_nr(sctx, offset);
bio->bi_bdev = sctx->path_list[path_nr].dmdev->bdev;
bio->bi_iter.bi_sector = sctx->path_list[path_nr].start + offset;
return DM_MAPIO_REMAPPED;
}
/*
* We need to parse hex numbers in the message as quickly as possible.
*
* This table-based hex parser improves performance.
* It improves a time to load 1000000 entries compared to the condition-based
* parser.
* table-based parser condition-based parser
* PA-RISC 0.29s 0.31s
* Opteron 0.0495s 0.0498s
*/
static const unsigned char hex_table[256] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 255, 255, 255, 255, 255, 255,
255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};
static __always_inline unsigned long parse_hex(const char **string)
{
unsigned char d;
unsigned long r = 0;
while ((d = hex_table[(unsigned char)**string]) < 16) {
r = (r << 4) | d;
(*string)++;
}
return r;
}
static int process_set_region_mappings(struct switch_ctx *sctx,
unsigned argc, char **argv)
{
unsigned i;
unsigned long region_index = 0;
for (i = 1; i < argc; i++) {
unsigned long path_nr;
const char *string = argv[i];
if ((*string & 0xdf) == 'R') {
unsigned long cycle_length, num_write;
string++;
if (unlikely(*string == ',')) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
cycle_length = parse_hex(&string);
if (unlikely(*string != ',')) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
string++;
if (unlikely(!*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
num_write = parse_hex(&string);
if (unlikely(*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
if (unlikely(!cycle_length) || unlikely(cycle_length - 1 > region_index)) {
DMWARN("invalid set_region_mappings cycle length: %lu > %lu",
cycle_length - 1, region_index);
return -EINVAL;
}
if (unlikely(region_index + num_write < region_index) ||
unlikely(region_index + num_write >= sctx->nr_regions)) {
DMWARN("invalid set_region_mappings region number: %lu + %lu >= %lu",
region_index, num_write, sctx->nr_regions);
return -EINVAL;
}
while (num_write--) {
region_index++;
path_nr = switch_region_table_read(sctx, region_index - cycle_length);
switch_region_table_write(sctx, region_index, path_nr);
}
continue;
}
if (*string == ':')
region_index++;
else {
region_index = parse_hex(&string);
if (unlikely(*string != ':')) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
}
string++;
if (unlikely(!*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
path_nr = parse_hex(&string);
if (unlikely(*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
if (unlikely(region_index >= sctx->nr_regions)) {
DMWARN("invalid set_region_mappings region number: %lu >= %lu", region_index, sctx->nr_regions);
return -EINVAL;
}
if (unlikely(path_nr >= sctx->nr_paths)) {
DMWARN("invalid set_region_mappings device: %lu >= %u", path_nr, sctx->nr_paths);
return -EINVAL;
}
switch_region_table_write(sctx, region_index, path_nr);
}
return 0;
}
/*
* Messages are processed one-at-a-time.
*
* Only set_region_mappings is supported.
*/
static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
static DEFINE_MUTEX(message_mutex);
struct switch_ctx *sctx = ti->private;
int r = -EINVAL;
mutex_lock(&message_mutex);
if (!strcasecmp(argv[0], "set_region_mappings"))
r = process_set_region_mappings(sctx, argc, argv);
else
DMWARN("Unrecognised message received.");
mutex_unlock(&message_mutex);
return r;
}
static void switch_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct switch_ctx *sctx = ti->private;
unsigned sz = 0;
int path_nr;
switch (type) {
case STATUSTYPE_INFO:
result[0] = '\0';
break;
case STATUSTYPE_TABLE:
DMEMIT("%u %u 0", sctx->nr_paths, sctx->region_size);
for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++)
DMEMIT(" %s %llu", sctx->path_list[path_nr].dmdev->name,
(unsigned long long)sctx->path_list[path_nr].start);
break;
}
}
/*
* Switch ioctl:
*
* Passthrough all ioctls to the path for sector 0
*/
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
unsigned long arg)
{
struct switch_ctx *sctx = ti->private;
struct block_device *bdev;
fmode_t mode;
unsigned path_nr;
int r = 0;
path_nr = switch_get_path_nr(sctx, 0);
bdev = sctx->path_list[path_nr].dmdev->bdev;
mode = sctx->path_list[path_nr].dmdev->mode;
/*
* Only pass ioctls through if the device sizes match exactly.
*/
if (ti->len + sctx->path_list[path_nr].start != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}
static int switch_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct switch_ctx *sctx = ti->private;
int path_nr;
int r;
for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++) {
r = fn(ti, sctx->path_list[path_nr].dmdev,
sctx->path_list[path_nr].start, ti->len, data);
if (r)
return r;
}
return 0;
}
static struct target_type switch_target = {
.name = "switch",
.version = {1, 1, 0},
.module = THIS_MODULE,
.ctr = switch_ctr,
.dtr = switch_dtr,
.map = switch_map,
.message = switch_message,
.status = switch_status,
.ioctl = switch_ioctl,
.iterate_devices = switch_iterate_devices,
};
static int __init dm_switch_init(void)
{
int r;
r = dm_register_target(&switch_target);
if (r < 0)
DMERR("dm_register_target() failed %d", r);
return r;
}
static void __exit dm_switch_exit(void)
{
dm_unregister_target(&switch_target);
}
module_init(dm_switch_init);
module_exit(dm_switch_exit);
MODULE_DESCRIPTION(DM_NAME " dynamic path switching target");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Narendran Ganapathy <Narendran_Ganapathy@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");

111
drivers/md/dm-sysfs.c Normal file
View file

@ -0,0 +1,111 @@
/*
* Copyright (C) 2008 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
#include <linux/sysfs.h>
#include <linux/dm-ioctl.h>
#include "dm.h"
struct dm_sysfs_attr {
struct attribute attr;
ssize_t (*show)(struct mapped_device *, char *);
ssize_t (*store)(struct mapped_device *, char *);
};
#define DM_ATTR_RO(_name) \
struct dm_sysfs_attr dm_attr_##_name = \
__ATTR(_name, S_IRUGO, dm_attr_##_name##_show, NULL)
static ssize_t dm_attr_show(struct kobject *kobj, struct attribute *attr,
char *page)
{
struct dm_sysfs_attr *dm_attr;
struct mapped_device *md;
ssize_t ret;
dm_attr = container_of(attr, struct dm_sysfs_attr, attr);
if (!dm_attr->show)
return -EIO;
md = dm_get_from_kobject(kobj);
if (!md)
return -EINVAL;
ret = dm_attr->show(md, page);
dm_put(md);
return ret;
}
static ssize_t dm_attr_name_show(struct mapped_device *md, char *buf)
{
if (dm_copy_name_and_uuid(md, buf, NULL))
return -EIO;
strcat(buf, "\n");
return strlen(buf);
}
static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf)
{
if (dm_copy_name_and_uuid(md, NULL, buf))
return -EIO;
strcat(buf, "\n");
return strlen(buf);
}
static ssize_t dm_attr_suspended_show(struct mapped_device *md, char *buf)
{
sprintf(buf, "%d\n", dm_suspended_md(md));
return strlen(buf);
}
static DM_ATTR_RO(name);
static DM_ATTR_RO(uuid);
static DM_ATTR_RO(suspended);
static struct attribute *dm_attrs[] = {
&dm_attr_name.attr,
&dm_attr_uuid.attr,
&dm_attr_suspended.attr,
NULL,
};
static const struct sysfs_ops dm_sysfs_ops = {
.show = dm_attr_show,
};
/*
* dm kobject is embedded in mapped_device structure
* no need to define release function here
*/
static struct kobj_type dm_ktype = {
.sysfs_ops = &dm_sysfs_ops,
.default_attrs = dm_attrs,
.release = dm_kobject_release,
};
/*
* Initialize kobj
* because nobody using md yet, no need to call explicit dm_get/put
*/
int dm_sysfs_init(struct mapped_device *md)
{
return kobject_init_and_add(dm_kobject(md), &dm_ktype,
&disk_to_dev(dm_disk(md))->kobj,
"%s", "dm");
}
/*
* Remove kobj, called after all references removed
*/
void dm_sysfs_exit(struct mapped_device *md)
{
struct kobject *kobj = dm_kobject(md);
kobject_put(kobj);
wait_for_completion(dm_get_completion_from_kobject(kobj));
}

1654
drivers/md/dm-table.c Normal file

File diff suppressed because it is too large Load diff

160
drivers/md/dm-target.c Normal file
View file

@ -0,0 +1,160 @@
/*
* Copyright (C) 2001 Sistina Software (UK) Limited
*
* This file is released under the GPL.
*/
#include "dm.h"
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kmod.h>
#include <linux/bio.h>
#define DM_MSG_PREFIX "target"
static LIST_HEAD(_targets);
static DECLARE_RWSEM(_lock);
#define DM_MOD_NAME_SIZE 32
static inline struct target_type *__find_target_type(const char *name)
{
struct target_type *tt;
list_for_each_entry(tt, &_targets, list)
if (!strcmp(name, tt->name))
return tt;
return NULL;
}
static struct target_type *get_target_type(const char *name)
{
struct target_type *tt;
down_read(&_lock);
tt = __find_target_type(name);
if (tt && !try_module_get(tt->module))
tt = NULL;
up_read(&_lock);
return tt;
}
static void load_module(const char *name)
{
request_module("dm-%s", name);
}
struct target_type *dm_get_target_type(const char *name)
{
struct target_type *tt = get_target_type(name);
if (!tt) {
load_module(name);
tt = get_target_type(name);
}
return tt;
}
void dm_put_target_type(struct target_type *tt)
{
down_read(&_lock);
module_put(tt->module);
up_read(&_lock);
}
int dm_target_iterate(void (*iter_func)(struct target_type *tt,
void *param), void *param)
{
struct target_type *tt;
down_read(&_lock);
list_for_each_entry(tt, &_targets, list)
iter_func(tt, param);
up_read(&_lock);
return 0;
}
int dm_register_target(struct target_type *tt)
{
int rv = 0;
down_write(&_lock);
if (__find_target_type(tt->name))
rv = -EEXIST;
else
list_add(&tt->list, &_targets);
up_write(&_lock);
return rv;
}
void dm_unregister_target(struct target_type *tt)
{
down_write(&_lock);
if (!__find_target_type(tt->name)) {
DMCRIT("Unregistering unrecognised target: %s", tt->name);
BUG();
}
list_del(&tt->list);
up_write(&_lock);
}
/*
* io-err: always fails an io, useful for bringing
* up LVs that have holes in them.
*/
static int io_err_ctr(struct dm_target *tt, unsigned int argc, char **args)
{
/*
* Return error for discards instead of -EOPNOTSUPP
*/
tt->num_discard_bios = 1;
return 0;
}
static void io_err_dtr(struct dm_target *tt)
{
/* empty */
}
static int io_err_map(struct dm_target *tt, struct bio *bio)
{
return -EIO;
}
static int io_err_map_rq(struct dm_target *ti, struct request *clone,
union map_info *map_context)
{
return -EIO;
}
static struct target_type error_target = {
.name = "error",
.version = {1, 2, 0},
.ctr = io_err_ctr,
.dtr = io_err_dtr,
.map = io_err_map,
.map_rq = io_err_map_rq,
};
int __init dm_target_init(void)
{
return dm_register_target(&error_target);
}
void dm_target_exit(void)
{
dm_unregister_target(&error_target);
}
EXPORT_SYMBOL(dm_register_target);
EXPORT_SYMBOL(dm_unregister_target);

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,218 @@
/*
* Copyright (C) 2010-2011 Red Hat, Inc.
*
* This file is released under the GPL.
*/
#ifndef DM_THIN_METADATA_H
#define DM_THIN_METADATA_H
#include "persistent-data/dm-block-manager.h"
#include "persistent-data/dm-space-map.h"
#include "persistent-data/dm-space-map-metadata.h"
#define THIN_METADATA_BLOCK_SIZE DM_SM_METADATA_BLOCK_SIZE
/*
* The metadata device is currently limited in size.
*/
#define THIN_METADATA_MAX_SECTORS DM_SM_METADATA_MAX_SECTORS
/*
* A metadata device larger than 16GB triggers a warning.
*/
#define THIN_METADATA_MAX_SECTORS_WARNING (16 * (1024 * 1024 * 1024 >> SECTOR_SHIFT))
/*----------------------------------------------------------------*/
/*
* Thin metadata superblock flags.
*/
#define THIN_METADATA_NEEDS_CHECK_FLAG (1 << 0)
struct dm_pool_metadata;
struct dm_thin_device;
/*
* Device identifier
*/
typedef uint64_t dm_thin_id;
/*
* Reopens or creates a new, empty metadata volume.
*/
struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
sector_t data_block_size,
bool format_device);
int dm_pool_metadata_close(struct dm_pool_metadata *pmd);
/*
* Compat feature flags. Any incompat flags beyond the ones
* specified below will prevent use of the thin metadata.
*/
#define THIN_FEATURE_COMPAT_SUPP 0UL
#define THIN_FEATURE_COMPAT_RO_SUPP 0UL
#define THIN_FEATURE_INCOMPAT_SUPP 0UL
/*
* Device creation/deletion.
*/
int dm_pool_create_thin(struct dm_pool_metadata *pmd, dm_thin_id dev);
/*
* An internal snapshot.
*
* You can only snapshot a quiesced origin i.e. one that is either
* suspended or not instanced at all.
*/
int dm_pool_create_snap(struct dm_pool_metadata *pmd, dm_thin_id dev,
dm_thin_id origin);
/*
* Deletes a virtual device from the metadata. It _is_ safe to call this
* when that device is open. Operations on that device will just start
* failing. You still need to call close() on the device.
*/
int dm_pool_delete_thin_device(struct dm_pool_metadata *pmd,
dm_thin_id dev);
/*
* Commits _all_ metadata changes: device creation, deletion, mapping
* updates.
*/
int dm_pool_commit_metadata(struct dm_pool_metadata *pmd);
/*
* Discards all uncommitted changes. Rereads the superblock, rolling back
* to the last good transaction. Thin devices remain open.
* dm_thin_aborted_changes() tells you if they had uncommitted changes.
*
* If this call fails it's only useful to call dm_pool_metadata_close().
* All other methods will fail with -EINVAL.
*/
int dm_pool_abort_metadata(struct dm_pool_metadata *pmd);
/*
* Set/get userspace transaction id.
*/
int dm_pool_set_metadata_transaction_id(struct dm_pool_metadata *pmd,
uint64_t current_id,
uint64_t new_id);
int dm_pool_get_metadata_transaction_id(struct dm_pool_metadata *pmd,
uint64_t *result);
/*
* Hold/get root for userspace transaction.
*
* The metadata snapshot is a copy of the current superblock (minus the
* space maps). Userland can access the data structures for READ
* operations only. A small performance hit is incurred by providing this
* copy of the metadata to userland due to extra copy-on-write operations
* on the metadata nodes. Release this as soon as you finish with it.
*/
int dm_pool_reserve_metadata_snap(struct dm_pool_metadata *pmd);
int dm_pool_release_metadata_snap(struct dm_pool_metadata *pmd);
int dm_pool_get_metadata_snap(struct dm_pool_metadata *pmd,
dm_block_t *result);
/*
* Actions on a single virtual device.
*/
/*
* Opening the same device more than once will fail with -EBUSY.
*/
int dm_pool_open_thin_device(struct dm_pool_metadata *pmd, dm_thin_id dev,
struct dm_thin_device **td);
int dm_pool_close_thin_device(struct dm_thin_device *td);
dm_thin_id dm_thin_dev_id(struct dm_thin_device *td);
struct dm_thin_lookup_result {
dm_block_t block;
bool shared:1;
};
/*
* Returns:
* -EWOULDBLOCK iff @can_block is set and would block.
* -ENODATA iff that mapping is not present.
* 0 success
*/
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
int can_block, struct dm_thin_lookup_result *result);
/*
* Obtain an unused block.
*/
int dm_pool_alloc_data_block(struct dm_pool_metadata *pmd, dm_block_t *result);
/*
* Insert or remove block.
*/
int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
dm_block_t data_block);
int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block);
/*
* Queries.
*/
bool dm_thin_changed_this_transaction(struct dm_thin_device *td);
bool dm_pool_changed_this_transaction(struct dm_pool_metadata *pmd);
bool dm_thin_aborted_changes(struct dm_thin_device *td);
int dm_thin_get_highest_mapped_block(struct dm_thin_device *td,
dm_block_t *highest_mapped);
int dm_thin_get_mapped_count(struct dm_thin_device *td, dm_block_t *result);
int dm_pool_get_free_block_count(struct dm_pool_metadata *pmd,
dm_block_t *result);
int dm_pool_get_free_metadata_block_count(struct dm_pool_metadata *pmd,
dm_block_t *result);
int dm_pool_get_metadata_dev_size(struct dm_pool_metadata *pmd,
dm_block_t *result);
int dm_pool_get_data_block_size(struct dm_pool_metadata *pmd, sector_t *result);
int dm_pool_get_data_dev_size(struct dm_pool_metadata *pmd, dm_block_t *result);
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result);
/*
* Returns -ENOSPC if the new size is too small and already allocated
* blocks would be lost.
*/
int dm_pool_resize_data_dev(struct dm_pool_metadata *pmd, dm_block_t new_size);
int dm_pool_resize_metadata_dev(struct dm_pool_metadata *pmd, dm_block_t new_size);
/*
* Flicks the underlying block manager into read only mode, so you know
* that nothing is changing.
*/
void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd);
void dm_pool_metadata_read_write(struct dm_pool_metadata *pmd);
int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
dm_block_t threshold,
dm_sm_threshold_fn fn,
void *context);
/*
* Updates the superblock immediately.
*/
int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd);
bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
/*----------------------------------------------------------------*/
#endif

3642
drivers/md/dm-thin.c Normal file

File diff suppressed because it is too large Load diff

219
drivers/md/dm-uevent.c Normal file
View file

@ -0,0 +1,219 @@
/*
* Device Mapper Uevent Support (dm-uevent)
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
* Copyright IBM Corporation, 2007
* Author: Mike Anderson <andmike@linux.vnet.ibm.com>
*/
#include <linux/list.h>
#include <linux/slab.h>
#include <linux/kobject.h>
#include <linux/dm-ioctl.h>
#include <linux/export.h>
#include "dm.h"
#include "dm-uevent.h"
#define DM_MSG_PREFIX "uevent"
static const struct {
enum dm_uevent_type type;
enum kobject_action action;
char *name;
} _dm_uevent_type_names[] = {
{DM_UEVENT_PATH_FAILED, KOBJ_CHANGE, "PATH_FAILED"},
{DM_UEVENT_PATH_REINSTATED, KOBJ_CHANGE, "PATH_REINSTATED"},
};
static struct kmem_cache *_dm_event_cache;
struct dm_uevent {
struct mapped_device *md;
enum kobject_action action;
struct kobj_uevent_env ku_env;
struct list_head elist;
char name[DM_NAME_LEN];
char uuid[DM_UUID_LEN];
};
static void dm_uevent_free(struct dm_uevent *event)
{
kmem_cache_free(_dm_event_cache, event);
}
static struct dm_uevent *dm_uevent_alloc(struct mapped_device *md)
{
struct dm_uevent *event;
event = kmem_cache_zalloc(_dm_event_cache, GFP_ATOMIC);
if (!event)
return NULL;
INIT_LIST_HEAD(&event->elist);
event->md = md;
return event;
}
static struct dm_uevent *dm_build_path_uevent(struct mapped_device *md,
struct dm_target *ti,
enum kobject_action action,
const char *dm_action,
const char *path,
unsigned nr_valid_paths)
{
struct dm_uevent *event;
event = dm_uevent_alloc(md);
if (!event) {
DMERR("%s: dm_uevent_alloc() failed", __func__);
goto err_nomem;
}
event->action = action;
if (add_uevent_var(&event->ku_env, "DM_TARGET=%s", ti->type->name)) {
DMERR("%s: add_uevent_var() for DM_TARGET failed",
__func__);
goto err_add;
}
if (add_uevent_var(&event->ku_env, "DM_ACTION=%s", dm_action)) {
DMERR("%s: add_uevent_var() for DM_ACTION failed",
__func__);
goto err_add;
}
if (add_uevent_var(&event->ku_env, "DM_SEQNUM=%u",
dm_next_uevent_seq(md))) {
DMERR("%s: add_uevent_var() for DM_SEQNUM failed",
__func__);
goto err_add;
}
if (add_uevent_var(&event->ku_env, "DM_PATH=%s", path)) {
DMERR("%s: add_uevent_var() for DM_PATH failed", __func__);
goto err_add;
}
if (add_uevent_var(&event->ku_env, "DM_NR_VALID_PATHS=%d",
nr_valid_paths)) {
DMERR("%s: add_uevent_var() for DM_NR_VALID_PATHS failed",
__func__);
goto err_add;
}
return event;
err_add:
dm_uevent_free(event);
err_nomem:
return ERR_PTR(-ENOMEM);
}
/**
* dm_send_uevents - send uevents for given list
*
* @events: list of events to send
* @kobj: kobject generating event
*
*/
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
{
int r;
struct dm_uevent *event, *next;
list_for_each_entry_safe(event, next, events, elist) {
list_del_init(&event->elist);
/*
* When a device is being removed this copy fails and we
* discard these unsent events.
*/
if (dm_copy_name_and_uuid(event->md, event->name,
event->uuid)) {
DMINFO("%s: skipping sending uevent for lost device",
__func__);
goto uevent_free;
}
if (add_uevent_var(&event->ku_env, "DM_NAME=%s", event->name)) {
DMERR("%s: add_uevent_var() for DM_NAME failed",
__func__);
goto uevent_free;
}
if (add_uevent_var(&event->ku_env, "DM_UUID=%s", event->uuid)) {
DMERR("%s: add_uevent_var() for DM_UUID failed",
__func__);
goto uevent_free;
}
r = kobject_uevent_env(kobj, event->action, event->ku_env.envp);
if (r)
DMERR("%s: kobject_uevent_env failed", __func__);
uevent_free:
dm_uevent_free(event);
}
}
EXPORT_SYMBOL_GPL(dm_send_uevents);
/**
* dm_path_uevent - called to create a new path event and queue it
*
* @event_type: path event type enum
* @ti: pointer to a dm_target
* @path: string containing pathname
* @nr_valid_paths: number of valid paths remaining
*
*/
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
const char *path, unsigned nr_valid_paths)
{
struct mapped_device *md = dm_table_get_md(ti->table);
struct dm_uevent *event;
if (event_type >= ARRAY_SIZE(_dm_uevent_type_names)) {
DMERR("%s: Invalid event_type %d", __func__, event_type);
return;
}
event = dm_build_path_uevent(md, ti,
_dm_uevent_type_names[event_type].action,
_dm_uevent_type_names[event_type].name,
path, nr_valid_paths);
if (IS_ERR(event))
return;
dm_uevent_add(md, &event->elist);
}
EXPORT_SYMBOL_GPL(dm_path_uevent);
int dm_uevent_init(void)
{
_dm_event_cache = KMEM_CACHE(dm_uevent, 0);
if (!_dm_event_cache)
return -ENOMEM;
DMINFO("version 1.0.3");
return 0;
}
void dm_uevent_exit(void)
{
kmem_cache_destroy(_dm_event_cache);
}

59
drivers/md/dm-uevent.h Normal file
View file

@ -0,0 +1,59 @@
/*
* Device Mapper Uevent Support
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
* Copyright IBM Corporation, 2007
* Author: Mike Anderson <andmike@linux.vnet.ibm.com>
*/
#ifndef DM_UEVENT_H
#define DM_UEVENT_H
enum dm_uevent_type {
DM_UEVENT_PATH_FAILED,
DM_UEVENT_PATH_REINSTATED,
};
#ifdef CONFIG_DM_UEVENT
extern int dm_uevent_init(void);
extern void dm_uevent_exit(void);
extern void dm_send_uevents(struct list_head *events, struct kobject *kobj);
extern void dm_path_uevent(enum dm_uevent_type event_type,
struct dm_target *ti, const char *path,
unsigned nr_valid_paths);
#else
static inline int dm_uevent_init(void)
{
return 0;
}
static inline void dm_uevent_exit(void)
{
}
static inline void dm_send_uevents(struct list_head *events,
struct kobject *kobj)
{
}
static inline void dm_path_uevent(enum dm_uevent_type event_type,
struct dm_target *ti, const char *path,
unsigned nr_valid_paths)
{
}
#endif /* CONFIG_DM_UEVENT */
#endif /* DM_UEVENT_H */

1028
drivers/md/dm-verity.c Normal file

File diff suppressed because it is too large Load diff

84
drivers/md/dm-zero.c Normal file
View file

@ -0,0 +1,84 @@
/*
* Copyright (C) 2003 Jana Saout <jana@saout.de>
*
* This file is released under the GPL.
*/
#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/bio.h>
#define DM_MSG_PREFIX "zero"
/*
* Construct a dummy mapping that only returns zeros
*/
static int zero_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
if (argc != 0) {
ti->error = "No arguments required";
return -EINVAL;
}
/*
* Silently drop discards, avoiding -EOPNOTSUPP.
*/
ti->num_discard_bios = 1;
return 0;
}
/*
* Return zeros only on reads
*/
static int zero_map(struct dm_target *ti, struct bio *bio)
{
switch(bio_rw(bio)) {
case READ:
zero_fill_bio(bio);
break;
case READA:
/* readahead of null bytes only wastes buffer cache */
return -EIO;
case WRITE:
/* writes get silently dropped */
break;
}
bio_endio(bio, 0);
/* accepted bio, don't make new request */
return DM_MAPIO_SUBMITTED;
}
static struct target_type zero_target = {
.name = "zero",
.version = {1, 1, 0},
.module = THIS_MODULE,
.ctr = zero_ctr,
.map = zero_map,
};
static int __init dm_zero_init(void)
{
int r = dm_register_target(&zero_target);
if (r < 0)
DMERR("register failed %d", r);
return r;
}
static void __exit dm_zero_exit(void)
{
dm_unregister_target(&zero_target);
}
module_init(dm_zero_init)
module_exit(dm_zero_exit)
MODULE_AUTHOR("Jana Saout <jana@saout.de>");
MODULE_DESCRIPTION(DM_NAME " dummy target returning zeros");
MODULE_LICENSE("GPL");

3092
drivers/md/dm.c Normal file

File diff suppressed because it is too large Load diff

225
drivers/md/dm.h Normal file
View file

@ -0,0 +1,225 @@
/*
* Internal header file for device mapper
*
* Copyright (C) 2001, 2002 Sistina Software
* Copyright (C) 2004-2006 Red Hat, Inc. All rights reserved.
*
* This file is released under the LGPL.
*/
#ifndef DM_INTERNAL_H
#define DM_INTERNAL_H
#include <linux/fs.h>
#include <linux/device-mapper.h>
#include <linux/list.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>
#include "dm-stats.h"
/*
* Suspend feature flags
*/
#define DM_SUSPEND_LOCKFS_FLAG (1 << 0)
#define DM_SUSPEND_NOFLUSH_FLAG (1 << 1)
/*
* Status feature flags
*/
#define DM_STATUS_NOFLUSH_FLAG (1 << 0)
/*
* Type of table and mapped_device's mempool
*/
#define DM_TYPE_NONE 0
#define DM_TYPE_BIO_BASED 1
#define DM_TYPE_REQUEST_BASED 2
/*
* List of devices that a metadevice uses and should open/close.
*/
struct dm_dev_internal {
struct list_head list;
atomic_t count;
struct dm_dev *dm_dev;
};
struct dm_table;
struct dm_md_mempools;
/*-----------------------------------------------------------------
* Internal table functions.
*---------------------------------------------------------------*/
void dm_table_destroy(struct dm_table *t);
void dm_table_event_callback(struct dm_table *t,
void (*fn)(void *), void *context);
struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
bool dm_table_has_no_data_devices(struct dm_table *table);
int dm_calculate_queue_limits(struct dm_table *table,
struct queue_limits *limits);
void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits);
struct list_head *dm_table_get_devices(struct dm_table *t);
void dm_table_presuspend_targets(struct dm_table *t);
void dm_table_postsuspend_targets(struct dm_table *t);
int dm_table_resume_targets(struct dm_table *t);
int dm_table_any_congested(struct dm_table *t, int bdi_bits);
int dm_table_any_busy_target(struct dm_table *t);
unsigned dm_table_get_type(struct dm_table *t);
struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
bool dm_table_request_based(struct dm_table *t);
void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
int dm_queue_merge_is_compulsory(struct request_queue *q);
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
void dm_set_md_type(struct mapped_device *md, unsigned type);
unsigned dm_get_md_type(struct mapped_device *md);
struct target_type *dm_get_immutable_target_type(struct mapped_device *md);
int dm_setup_md_queue(struct mapped_device *md);
/*
* To check the return value from dm_table_find_target().
*/
#define dm_target_is_valid(t) ((t)->table)
/*
* To check whether the target type is bio-based or not (request-based).
*/
#define dm_target_bio_based(t) ((t)->type->map != NULL)
/*
* To check whether the target type is request-based or not (bio-based).
*/
#define dm_target_request_based(t) ((t)->type->map_rq != NULL)
/*
* To check whether the target type is a hybrid (capable of being
* either request-based or bio-based).
*/
#define dm_target_hybrid(t) (dm_target_bio_based(t) && dm_target_request_based(t))
/*-----------------------------------------------------------------
* A registry of target types.
*---------------------------------------------------------------*/
int dm_target_init(void);
void dm_target_exit(void);
struct target_type *dm_get_target_type(const char *name);
void dm_put_target_type(struct target_type *tt);
int dm_target_iterate(void (*iter_func)(struct target_type *tt,
void *param), void *param);
int dm_split_args(int *argc, char ***argvp, char *input);
/*
* Is this mapped_device being deleted?
*/
int dm_deleting_md(struct mapped_device *md);
/*
* Is this mapped_device suspended?
*/
int dm_suspended_md(struct mapped_device *md);
/*
* Test if the device is scheduled for deferred remove.
*/
int dm_test_deferred_remove_flag(struct mapped_device *md);
/*
* Try to remove devices marked for deferred removal.
*/
void dm_deferred_remove(void);
/*
* The device-mapper can be driven through one of two interfaces;
* ioctl or filesystem, depending which patch you have applied.
*/
int dm_interface_init(void);
void dm_interface_exit(void);
/*
* sysfs interface
*/
struct dm_kobject_holder {
struct kobject kobj;
struct completion completion;
};
static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
{
return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
}
int dm_sysfs_init(struct mapped_device *md);
void dm_sysfs_exit(struct mapped_device *md);
struct kobject *dm_kobject(struct mapped_device *md);
struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
/*
* The kobject helper
*/
void dm_kobject_release(struct kobject *kobj);
/*
* Targets for linear and striped mappings
*/
int dm_linear_init(void);
void dm_linear_exit(void);
int dm_stripe_init(void);
void dm_stripe_exit(void);
/*
* mapped_device operations
*/
void dm_destroy(struct mapped_device *md);
void dm_destroy_immediate(struct mapped_device *md);
int dm_open_count(struct mapped_device *md);
int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only_deferred);
int dm_cancel_deferred_remove(struct mapped_device *md);
int dm_request_based(struct mapped_device *md);
sector_t dm_get_size(struct mapped_device *md);
struct request_queue *dm_get_md_queue(struct mapped_device *md);
int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
struct dm_dev **result);
void dm_put_table_device(struct mapped_device *md, struct dm_dev *d);
struct dm_stats *dm_get_stats(struct mapped_device *md);
int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
unsigned cookie);
void dm_internal_suspend(struct mapped_device *md);
void dm_internal_resume(struct mapped_device *md);
int dm_io_init(void);
void dm_io_exit(void);
int dm_kcopyd_init(void);
void dm_kcopyd_exit(void);
/*
* Mempool operations
*/
struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, unsigned per_bio_data_size);
void dm_free_md_mempools(struct dm_md_mempools *pools);
/*
* Helpers that are used by DM core
*/
unsigned dm_get_reserved_bio_based_ios(void);
unsigned dm_get_reserved_rq_based_ios(void);
static inline bool dm_message_test_buffer_overflow(char *result, unsigned maxlen)
{
return !maxlen || strlen(result) + 1 >= maxlen;
}
#endif

373
drivers/md/faulty.c Normal file
View file

@ -0,0 +1,373 @@
/*
* faulty.c : Multiple Devices driver for Linux
*
* Copyright (C) 2004 Neil Brown
*
* fautly-device-simulator personality for md
*
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2, or (at your option)
* any later version.
*
* You should have received a copy of the GNU General Public License
* (for example /usr/src/linux/COPYING); if not, write to the Free
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
/*
* The "faulty" personality causes some requests to fail.
*
* Possible failure modes are:
* reads fail "randomly" but succeed on retry
* writes fail "randomly" but succeed on retry
* reads for some address fail and then persist until a write
* reads for some address fail and then persist irrespective of write
* writes for some address fail and persist
* all writes fail
*
* Different modes can be active at a time, but only
* one can be set at array creation. Others can be added later.
* A mode can be one-shot or recurrent with the recurrence being
* once in every N requests.
* The bottom 5 bits of the "layout" indicate the mode. The
* remainder indicate a period, or 0 for one-shot.
*
* There is an implementation limit on the number of concurrently
* persisting-faulty blocks. When a new fault is requested that would
* exceed the limit, it is ignored.
* All current faults can be clear using a layout of "0".
*
* Requests are always sent to the device. If they are to fail,
* we clone the bio and insert a new b_end_io into the chain.
*/
#define WriteTransient 0
#define ReadTransient 1
#define WritePersistent 2
#define ReadPersistent 3
#define WriteAll 4 /* doesn't go to device */
#define ReadFixable 5
#define Modes 6
#define ClearErrors 31
#define ClearFaults 30
#define AllPersist 100 /* internal use only */
#define NoPersist 101
#define ModeMask 0x1f
#define ModeShift 5
#define MaxFault 50
#include <linux/blkdev.h>
#include <linux/module.h>
#include <linux/raid/md_u.h>
#include <linux/slab.h>
#include "md.h"
#include <linux/seq_file.h>
static void faulty_fail(struct bio *bio, int error)
{
struct bio *b = bio->bi_private;
b->bi_iter.bi_size = bio->bi_iter.bi_size;
b->bi_iter.bi_sector = bio->bi_iter.bi_sector;
bio_put(bio);
bio_io_error(b);
}
struct faulty_conf {
int period[Modes];
atomic_t counters[Modes];
sector_t faults[MaxFault];
int modes[MaxFault];
int nfaults;
struct md_rdev *rdev;
};
static int check_mode(struct faulty_conf *conf, int mode)
{
if (conf->period[mode] == 0 &&
atomic_read(&conf->counters[mode]) <= 0)
return 0; /* no failure, no decrement */
if (atomic_dec_and_test(&conf->counters[mode])) {
if (conf->period[mode])
atomic_set(&conf->counters[mode], conf->period[mode]);
return 1;
}
return 0;
}
static int check_sector(struct faulty_conf *conf, sector_t start, sector_t end, int dir)
{
/* If we find a ReadFixable sector, we fix it ... */
int i;
for (i=0; i<conf->nfaults; i++)
if (conf->faults[i] >= start &&
conf->faults[i] < end) {
/* found it ... */
switch (conf->modes[i] * 2 + dir) {
case WritePersistent*2+WRITE: return 1;
case ReadPersistent*2+READ: return 1;
case ReadFixable*2+READ: return 1;
case ReadFixable*2+WRITE:
conf->modes[i] = NoPersist;
return 0;
case AllPersist*2+READ:
case AllPersist*2+WRITE: return 1;
default:
return 0;
}
}
return 0;
}
static void add_sector(struct faulty_conf *conf, sector_t start, int mode)
{
int i;
int n = conf->nfaults;
for (i=0; i<conf->nfaults; i++)
if (conf->faults[i] == start) {
switch(mode) {
case NoPersist: conf->modes[i] = mode; return;
case WritePersistent:
if (conf->modes[i] == ReadPersistent ||
conf->modes[i] == ReadFixable)
conf->modes[i] = AllPersist;
else
conf->modes[i] = WritePersistent;
return;
case ReadPersistent:
if (conf->modes[i] == WritePersistent)
conf->modes[i] = AllPersist;
else
conf->modes[i] = ReadPersistent;
return;
case ReadFixable:
if (conf->modes[i] == WritePersistent ||
conf->modes[i] == ReadPersistent)
conf->modes[i] = AllPersist;
else
conf->modes[i] = ReadFixable;
return;
}
} else if (conf->modes[i] == NoPersist)
n = i;
if (n >= MaxFault)
return;
conf->faults[n] = start;
conf->modes[n] = mode;
if (conf->nfaults == n)
conf->nfaults = n+1;
}
static void make_request(struct mddev *mddev, struct bio *bio)
{
struct faulty_conf *conf = mddev->private;
int failit = 0;
if (bio_data_dir(bio) == WRITE) {
/* write request */
if (atomic_read(&conf->counters[WriteAll])) {
/* special case - don't decrement, don't generic_make_request,
* just fail immediately
*/
bio_endio(bio, -EIO);
return;
}
if (check_sector(conf, bio->bi_iter.bi_sector,
bio_end_sector(bio), WRITE))
failit = 1;
if (check_mode(conf, WritePersistent)) {
add_sector(conf, bio->bi_iter.bi_sector,
WritePersistent);
failit = 1;
}
if (check_mode(conf, WriteTransient))
failit = 1;
} else {
/* read request */
if (check_sector(conf, bio->bi_iter.bi_sector,
bio_end_sector(bio), READ))
failit = 1;
if (check_mode(conf, ReadTransient))
failit = 1;
if (check_mode(conf, ReadPersistent)) {
add_sector(conf, bio->bi_iter.bi_sector,
ReadPersistent);
failit = 1;
}
if (check_mode(conf, ReadFixable)) {
add_sector(conf, bio->bi_iter.bi_sector,
ReadFixable);
failit = 1;
}
}
if (failit) {
struct bio *b = bio_clone_mddev(bio, GFP_NOIO, mddev);
b->bi_bdev = conf->rdev->bdev;
b->bi_private = bio;
b->bi_end_io = faulty_fail;
bio = b;
} else
bio->bi_bdev = conf->rdev->bdev;
generic_make_request(bio);
}
static void status(struct seq_file *seq, struct mddev *mddev)
{
struct faulty_conf *conf = mddev->private;
int n;
if ((n=atomic_read(&conf->counters[WriteTransient])) != 0)
seq_printf(seq, " WriteTransient=%d(%d)",
n, conf->period[WriteTransient]);
if ((n=atomic_read(&conf->counters[ReadTransient])) != 0)
seq_printf(seq, " ReadTransient=%d(%d)",
n, conf->period[ReadTransient]);
if ((n=atomic_read(&conf->counters[WritePersistent])) != 0)
seq_printf(seq, " WritePersistent=%d(%d)",
n, conf->period[WritePersistent]);
if ((n=atomic_read(&conf->counters[ReadPersistent])) != 0)
seq_printf(seq, " ReadPersistent=%d(%d)",
n, conf->period[ReadPersistent]);
if ((n=atomic_read(&conf->counters[ReadFixable])) != 0)
seq_printf(seq, " ReadFixable=%d(%d)",
n, conf->period[ReadFixable]);
if ((n=atomic_read(&conf->counters[WriteAll])) != 0)
seq_printf(seq, " WriteAll");
seq_printf(seq, " nfaults=%d", conf->nfaults);
}
static int reshape(struct mddev *mddev)
{
int mode = mddev->new_layout & ModeMask;
int count = mddev->new_layout >> ModeShift;
struct faulty_conf *conf = mddev->private;
if (mddev->new_layout < 0)
return 0;
/* new layout */
if (mode == ClearFaults)
conf->nfaults = 0;
else if (mode == ClearErrors) {
int i;
for (i=0 ; i < Modes ; i++) {
conf->period[i] = 0;
atomic_set(&conf->counters[i], 0);
}
} else if (mode < Modes) {
conf->period[mode] = count;
if (!count) count++;
atomic_set(&conf->counters[mode], count);
} else
return -EINVAL;
mddev->new_layout = -1;
mddev->layout = -1; /* makes sure further changes come through */
return 0;
}
static sector_t faulty_size(struct mddev *mddev, sector_t sectors, int raid_disks)
{
WARN_ONCE(raid_disks,
"%s does not support generic reshape\n", __func__);
if (sectors == 0)
return mddev->dev_sectors;
return sectors;
}
static int run(struct mddev *mddev)
{
struct md_rdev *rdev;
int i;
struct faulty_conf *conf;
if (md_check_no_bitmap(mddev))
return -EINVAL;
conf = kmalloc(sizeof(*conf), GFP_KERNEL);
if (!conf)
return -ENOMEM;
for (i=0; i<Modes; i++) {
atomic_set(&conf->counters[i], 0);
conf->period[i] = 0;
}
conf->nfaults = 0;
rdev_for_each(rdev, mddev) {
conf->rdev = rdev;
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
}
md_set_array_sectors(mddev, faulty_size(mddev, 0, 0));
mddev->private = conf;
reshape(mddev);
return 0;
}
static int stop(struct mddev *mddev)
{
struct faulty_conf *conf = mddev->private;
kfree(conf);
mddev->private = NULL;
return 0;
}
static struct md_personality faulty_personality =
{
.name = "faulty",
.level = LEVEL_FAULTY,
.owner = THIS_MODULE,
.make_request = make_request,
.run = run,
.stop = stop,
.status = status,
.check_reshape = reshape,
.size = faulty_size,
};
static int __init raid_init(void)
{
return register_md_personality(&faulty_personality);
}
static void raid_exit(void)
{
unregister_md_personality(&faulty_personality);
}
module_init(raid_init);
module_exit(raid_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Fault injection personality for MD");
MODULE_ALIAS("md-personality-10"); /* faulty */
MODULE_ALIAS("md-faulty");
MODULE_ALIAS("md-level--5");

387
drivers/md/linear.c Normal file
View file

@ -0,0 +1,387 @@
/*
linear.c : Multiple Devices driver for Linux
Copyright (C) 1994-96 Marc ZYNGIER
<zyngier@ufr-info-p7.ibp.fr> or
<maz@gloups.fdn.fr>
Linear mode management functions.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
You should have received a copy of the GNU General Public License
(for example /usr/src/linux/COPYING); if not, write to the Free
Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#include <linux/blkdev.h>
#include <linux/raid/md_u.h>
#include <linux/seq_file.h>
#include <linux/module.h>
#include <linux/slab.h>
#include "md.h"
#include "linear.h"
/*
* find which device holds a particular offset
*/
static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector)
{
int lo, mid, hi;
struct linear_conf *conf;
lo = 0;
hi = mddev->raid_disks - 1;
conf = rcu_dereference(mddev->private);
/*
* Binary Search
*/
while (hi > lo) {
mid = (hi + lo) / 2;
if (sector < conf->disks[mid].end_sector)
hi = mid;
else
lo = mid + 1;
}
return conf->disks + lo;
}
/**
* linear_mergeable_bvec -- tell bio layer if two requests can be merged
* @q: request queue
* @bvm: properties of new bio
* @biovec: the request that could be merged to it.
*
* Return amount of bytes we can take at this offset
*/
static int linear_mergeable_bvec(struct request_queue *q,
struct bvec_merge_data *bvm,
struct bio_vec *biovec)
{
struct mddev *mddev = q->queuedata;
struct dev_info *dev0;
unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
int maxbytes = biovec->bv_len;
struct request_queue *subq;
rcu_read_lock();
dev0 = which_dev(mddev, sector);
maxsectors = dev0->end_sector - sector;
subq = bdev_get_queue(dev0->rdev->bdev);
if (subq->merge_bvec_fn) {
bvm->bi_bdev = dev0->rdev->bdev;
bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
biovec));
}
rcu_read_unlock();
if (maxsectors < bio_sectors)
maxsectors = 0;
else
maxsectors -= bio_sectors;
if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
return maxbytes;
if (maxsectors > (maxbytes >> 9))
return maxbytes;
else
return maxsectors << 9;
}
static int linear_congested(void *data, int bits)
{
struct mddev *mddev = data;
struct linear_conf *conf;
int i, ret = 0;
if (mddev_congested(mddev, bits))
return 1;
rcu_read_lock();
conf = rcu_dereference(mddev->private);
for (i = 0; i < mddev->raid_disks && !ret ; i++) {
struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
ret |= bdi_congested(&q->backing_dev_info, bits);
}
rcu_read_unlock();
return ret;
}
static sector_t linear_size(struct mddev *mddev, sector_t sectors, int raid_disks)
{
struct linear_conf *conf;
sector_t array_sectors;
rcu_read_lock();
conf = rcu_dereference(mddev->private);
WARN_ONCE(sectors || raid_disks,
"%s does not support generic reshape\n", __func__);
array_sectors = conf->array_sectors;
rcu_read_unlock();
return array_sectors;
}
static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
{
struct linear_conf *conf;
struct md_rdev *rdev;
int i, cnt;
bool discard_supported = false;
conf = kzalloc (sizeof (*conf) + raid_disks*sizeof(struct dev_info),
GFP_KERNEL);
if (!conf)
return NULL;
cnt = 0;
conf->array_sectors = 0;
rdev_for_each(rdev, mddev) {
int j = rdev->raid_disk;
struct dev_info *disk = conf->disks + j;
sector_t sectors;
if (j < 0 || j >= raid_disks || disk->rdev) {
printk(KERN_ERR "md/linear:%s: disk numbering problem. Aborting!\n",
mdname(mddev));
goto out;
}
disk->rdev = rdev;
if (mddev->chunk_sectors) {
sectors = rdev->sectors;
sector_div(sectors, mddev->chunk_sectors);
rdev->sectors = sectors * mddev->chunk_sectors;
}
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
conf->array_sectors += rdev->sectors;
cnt++;
if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
discard_supported = true;
}
if (cnt != raid_disks) {
printk(KERN_ERR "md/linear:%s: not enough drives present. Aborting!\n",
mdname(mddev));
goto out;
}
if (!discard_supported)
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
else
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
/*
* Here we calculate the device offsets.
*/
conf->disks[0].end_sector = conf->disks[0].rdev->sectors;
for (i = 1; i < raid_disks; i++)
conf->disks[i].end_sector =
conf->disks[i-1].end_sector +
conf->disks[i].rdev->sectors;
return conf;
out:
kfree(conf);
return NULL;
}
static int linear_run (struct mddev *mddev)
{
struct linear_conf *conf;
int ret;
if (md_check_no_bitmap(mddev))
return -EINVAL;
conf = linear_conf(mddev, mddev->raid_disks);
if (!conf)
return 1;
mddev->private = conf;
md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
mddev->queue->backing_dev_info.congested_fn = linear_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
ret = md_integrity_register(mddev);
if (ret) {
kfree(conf);
mddev->private = NULL;
}
return ret;
}
static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
{
/* Adding a drive to a linear array allows the array to grow.
* It is permitted if the new drive has a matching superblock
* already on it, with raid_disk equal to raid_disks.
* It is achieved by creating a new linear_private_data structure
* and swapping it in in-place of the current one.
* The current one is never freed until the array is stopped.
* This avoids races.
*/
struct linear_conf *newconf, *oldconf;
if (rdev->saved_raid_disk != mddev->raid_disks)
return -EINVAL;
rdev->raid_disk = rdev->saved_raid_disk;
rdev->saved_raid_disk = -1;
newconf = linear_conf(mddev,mddev->raid_disks+1);
if (!newconf)
return -ENOMEM;
oldconf = rcu_dereference_protected(mddev->private,
lockdep_is_held(
&mddev->reconfig_mutex));
mddev->raid_disks++;
rcu_assign_pointer(mddev->private, newconf);
md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
set_capacity(mddev->gendisk, mddev->array_sectors);
revalidate_disk(mddev->gendisk);
kfree_rcu(oldconf, rcu);
return 0;
}
static int linear_stop (struct mddev *mddev)
{
struct linear_conf *conf =
rcu_dereference_protected(mddev->private,
lockdep_is_held(
&mddev->reconfig_mutex));
/*
* We do not require rcu protection here since
* we hold reconfig_mutex for both linear_add and
* linear_stop, so they cannot race.
* We should make sure any old 'conf's are properly
* freed though.
*/
rcu_barrier();
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
kfree(conf);
mddev->private = NULL;
return 0;
}
static void linear_make_request(struct mddev *mddev, struct bio *bio)
{
char b[BDEVNAME_SIZE];
struct dev_info *tmp_dev;
struct bio *split;
sector_t start_sector, end_sector, data_offset;
if (unlikely(bio->bi_rw & REQ_FLUSH)) {
md_flush_request(mddev, bio);
return;
}
do {
rcu_read_lock();
tmp_dev = which_dev(mddev, bio->bi_iter.bi_sector);
start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
end_sector = tmp_dev->end_sector;
data_offset = tmp_dev->rdev->data_offset;
bio->bi_bdev = tmp_dev->rdev->bdev;
rcu_read_unlock();
if (unlikely(bio->bi_iter.bi_sector >= end_sector ||
bio->bi_iter.bi_sector < start_sector))
goto out_of_bounds;
if (unlikely(bio_end_sector(bio) > end_sector)) {
/* This bio crosses a device boundary, so we have to
* split it.
*/
split = bio_split(bio, end_sector -
bio->bi_iter.bi_sector,
GFP_NOIO, fs_bio_set);
bio_chain(split, bio);
} else {
split = bio;
}
split->bi_iter.bi_sector = split->bi_iter.bi_sector -
start_sector + data_offset;
if (unlikely((split->bi_rw & REQ_DISCARD) &&
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
/* Just ignore it */
bio_endio(split, 0);
} else
generic_make_request(split);
} while (split != bio);
return;
out_of_bounds:
printk(KERN_ERR
"md/linear:%s: make_request: Sector %llu out of bounds on "
"dev %s: %llu sectors, offset %llu\n",
mdname(mddev),
(unsigned long long)bio->bi_iter.bi_sector,
bdevname(tmp_dev->rdev->bdev, b),
(unsigned long long)tmp_dev->rdev->sectors,
(unsigned long long)start_sector);
bio_io_error(bio);
}
static void linear_status (struct seq_file *seq, struct mddev *mddev)
{
seq_printf(seq, " %dk rounding", mddev->chunk_sectors / 2);
}
static struct md_personality linear_personality =
{
.name = "linear",
.level = LEVEL_LINEAR,
.owner = THIS_MODULE,
.make_request = linear_make_request,
.run = linear_run,
.stop = linear_stop,
.status = linear_status,
.hot_add_disk = linear_add,
.size = linear_size,
};
static int __init linear_init (void)
{
return register_md_personality (&linear_personality);
}
static void linear_exit (void)
{
unregister_md_personality (&linear_personality);
}
module_init(linear_init);
module_exit(linear_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Linear device concatenation personality for MD");
MODULE_ALIAS("md-personality-1"); /* LINEAR - deprecated*/
MODULE_ALIAS("md-linear");
MODULE_ALIAS("md-level--1");

15
drivers/md/linear.h Normal file
View file

@ -0,0 +1,15 @@
#ifndef _LINEAR_H
#define _LINEAR_H
struct dev_info {
struct md_rdev *rdev;
sector_t end_sector;
};
struct linear_conf
{
struct rcu_head rcu;
sector_t array_sectors;
struct dev_info disks[0];
};
#endif

8549
drivers/md/md.c Normal file

File diff suppressed because it is too large Load diff

627
drivers/md/md.h Normal file
View file

@ -0,0 +1,627 @@
/*
md.h : kernel internal structure of the Linux MD driver
Copyright (C) 1996-98 Ingo Molnar, Gadi Oxman
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
You should have received a copy of the GNU General Public License
(for example /usr/src/linux/COPYING); if not, write to the Free
Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#ifndef _MD_MD_H
#define _MD_MD_H
#include <linux/blkdev.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/timer.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
#define MaxSector (~(sector_t)0)
/* Bad block numbers are stored sorted in a single page.
* 64bits is used for each block or extent.
* 54 bits are sector number, 9 bits are extent size,
* 1 bit is an 'acknowledged' flag.
*/
#define MD_MAX_BADBLOCKS (PAGE_SIZE/8)
/*
* MD's 'extended' device
*/
struct md_rdev {
struct list_head same_set; /* RAID devices within the same set */
sector_t sectors; /* Device size (in 512bytes sectors) */
struct mddev *mddev; /* RAID array if running */
int last_events; /* IO event timestamp */
/*
* If meta_bdev is non-NULL, it means that a separate device is
* being used to store the metadata (superblock/bitmap) which
* would otherwise be contained on the same device as the data (bdev).
*/
struct block_device *meta_bdev;
struct block_device *bdev; /* block device handle */
struct page *sb_page, *bb_page;
int sb_loaded;
__u64 sb_events;
sector_t data_offset; /* start of data in array */
sector_t new_data_offset;/* only relevant while reshaping */
sector_t sb_start; /* offset of the super block (in 512byte sectors) */
int sb_size; /* bytes in the superblock */
int preferred_minor; /* autorun support */
struct kobject kobj;
/* A device can be in one of three states based on two flags:
* Not working: faulty==1 in_sync==0
* Fully working: faulty==0 in_sync==1
* Working, but not
* in sync with array
* faulty==0 in_sync==0
*
* It can never have faulty==1, in_sync==1
* This reduces the burden of testing multiple flags in many cases
*/
unsigned long flags; /* bit set of 'enum flag_bits' bits. */
wait_queue_head_t blocked_wait;
int desc_nr; /* descriptor index in the superblock */
int raid_disk; /* role of device in array */
int new_raid_disk; /* role that the device will have in
* the array after a level-change completes.
*/
int saved_raid_disk; /* role that device used to have in the
* array and could again if we did a partial
* resync from the bitmap
*/
sector_t recovery_offset;/* If this device has been partially
* recovered, this is where we were
* up to.
*/
atomic_t nr_pending; /* number of pending requests.
* only maintained for arrays that
* support hot removal
*/
atomic_t read_errors; /* number of consecutive read errors that
* we have tried to ignore.
*/
struct timespec last_read_error; /* monotonic time since our
* last read error
*/
atomic_t corrected_errors; /* number of corrected read errors,
* for reporting to userspace and storing
* in superblock.
*/
struct work_struct del_work; /* used for delayed sysfs removal */
struct kernfs_node *sysfs_state; /* handle for 'state'
* sysfs entry */
struct badblocks {
int count; /* count of bad blocks */
int unacked_exist; /* there probably are unacknowledged
* bad blocks. This is only cleared
* when a read discovers none
*/
int shift; /* shift from sectors to block size
* a -ve shift means badblocks are
* disabled.*/
u64 *page; /* badblock list */
int changed;
seqlock_t lock;
sector_t sector;
sector_t size; /* in sectors */
} badblocks;
};
enum flag_bits {
Faulty, /* device is known to have a fault */
In_sync, /* device is in_sync with rest of array */
Bitmap_sync, /* ..actually, not quite In_sync. Need a
* bitmap-based recovery to get fully in sync
*/
Unmerged, /* device is being added to array and should
* be considerred for bvec_merge_fn but not
* yet for actual IO
*/
WriteMostly, /* Avoid reading if at all possible */
AutoDetected, /* added by auto-detect */
Blocked, /* An error occurred but has not yet
* been acknowledged by the metadata
* handler, so don't allow writes
* until it is cleared */
WriteErrorSeen, /* A write error has been seen on this
* device
*/
FaultRecorded, /* Intermediate state for clearing
* Blocked. The Fault is/will-be
* recorded in the metadata, but that
* metadata hasn't been stored safely
* on disk yet.
*/
BlockedBadBlocks, /* A writer is blocked because they
* found an unacknowledged bad-block.
* This can safely be cleared at any
* time, and the writer will re-check.
* It may be set at any time, and at
* worst the writer will timeout and
* re-check. So setting it as
* accurately as possible is good, but
* not absolutely critical.
*/
WantReplacement, /* This device is a candidate to be
* hot-replaced, either because it has
* reported some faults, or because
* of explicit request.
*/
Replacement, /* This device is a replacement for
* a want_replacement device with same
* raid_disk number.
*/
};
#define BB_LEN_MASK (0x00000000000001FFULL)
#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
#define BB_ACK_MASK (0x8000000000000000ULL)
#define BB_MAX_LEN 512
#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
sector_t *first_bad, int *bad_sectors);
static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
sector_t *first_bad, int *bad_sectors)
{
if (unlikely(rdev->badblocks.count)) {
int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
sectors,
first_bad, bad_sectors);
if (rv)
*first_bad -= rdev->data_offset;
return rv;
}
return 0;
}
extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
int is_new);
extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
int is_new);
extern void md_ack_all_badblocks(struct badblocks *bb);
struct mddev {
void *private;
struct md_personality *pers;
dev_t unit;
int md_minor;
struct list_head disks;
unsigned long flags;
#define MD_CHANGE_DEVS 0 /* Some device status has changed */
#define MD_CHANGE_CLEAN 1 /* transition to or from 'clean' */
#define MD_CHANGE_PENDING 2 /* switch from 'clean' to 'active' in progress */
#define MD_UPDATE_SB_FLAGS (1 | 2 | 4) /* If these are set, md_update_sb needed */
#define MD_ARRAY_FIRST_USE 3 /* First use of array, needs initialization */
#define MD_STILL_CLOSED 4 /* If set, then array has not been opened since
* md_ioctl checked on it.
*/
int suspended;
atomic_t active_io;
int ro;
int sysfs_active; /* set when sysfs deletes
* are happening, so run/
* takeover/stop are not safe
*/
int ready; /* See when safe to pass
* IO requests down */
struct gendisk *gendisk;
struct kobject kobj;
int hold_active;
#define UNTIL_IOCTL 1
#define UNTIL_STOP 2
/* Superblock information */
int major_version,
minor_version,
patch_version;
int persistent;
int external; /* metadata is
* managed externally */
char metadata_type[17]; /* externally set*/
int chunk_sectors;
time_t ctime, utime;
int level, layout;
char clevel[16];
int raid_disks;
int max_disks;
sector_t dev_sectors; /* used size of
* component devices */
sector_t array_sectors; /* exported array size */
int external_size; /* size managed
* externally */
__u64 events;
/* If the last 'event' was simply a clean->dirty transition, and
* we didn't write it to the spares, then it is safe and simple
* to just decrement the event count on a dirty->clean transition.
* So we record that possibility here.
*/
int can_decrease_events;
char uuid[16];
/* If the array is being reshaped, we need to record the
* new shape and an indication of where we are up to.
* This is written to the superblock.
* If reshape_position is MaxSector, then no reshape is happening (yet).
*/
sector_t reshape_position;
int delta_disks, new_level, new_layout;
int new_chunk_sectors;
int reshape_backwards;
struct md_thread *thread; /* management thread */
struct md_thread *sync_thread; /* doing resync or reconstruct */
/* 'last_sync_action' is initialized to "none". It is set when a
* sync operation (i.e "data-check", "requested-resync", "resync",
* "recovery", or "reshape") is started. It holds this value even
* when the sync thread is "frozen" (interrupted) or "idle" (stopped
* or finished). It is overwritten when a new sync operation is begun.
*/
char *last_sync_action;
sector_t curr_resync; /* last block scheduled */
/* As resync requests can complete out of order, we cannot easily track
* how much resync has been completed. So we occasionally pause until
* everything completes, then set curr_resync_completed to curr_resync.
* As such it may be well behind the real resync mark, but it is a value
* we are certain of.
*/
sector_t curr_resync_completed;
unsigned long resync_mark; /* a recent timestamp */
sector_t resync_mark_cnt;/* blocks written at resync_mark */
sector_t curr_mark_cnt; /* blocks scheduled now */
sector_t resync_max_sectors; /* may be set by personality */
atomic64_t resync_mismatches; /* count of sectors where
* parity/replica mismatch found
*/
/* allow user-space to request suspension of IO to regions of the array */
sector_t suspend_lo;
sector_t suspend_hi;
/* if zero, use the system-wide default */
int sync_speed_min;
int sync_speed_max;
/* resync even though the same disks are shared among md-devices */
int parallel_resync;
int ok_start_degraded;
/* recovery/resync flags
* NEEDED: we might need to start a resync/recover
* RUNNING: a thread is running, or about to be started
* SYNC: actually doing a resync, not a recovery
* RECOVER: doing recovery, or need to try it.
* INTR: resync needs to be aborted for some reason
* DONE: thread is done and is waiting to be reaped
* REQUEST: user-space has requested a sync (used with SYNC)
* CHECK: user-space request for check-only, no repair
* RESHAPE: A reshape is happening
* ERROR: sync-action interrupted because io-error
*
* If neither SYNC or RESHAPE are set, then it is a recovery.
*/
#define MD_RECOVERY_RUNNING 0
#define MD_RECOVERY_SYNC 1
#define MD_RECOVERY_RECOVER 2
#define MD_RECOVERY_INTR 3
#define MD_RECOVERY_DONE 4
#define MD_RECOVERY_NEEDED 5
#define MD_RECOVERY_REQUESTED 6
#define MD_RECOVERY_CHECK 7
#define MD_RECOVERY_RESHAPE 8
#define MD_RECOVERY_FROZEN 9
#define MD_RECOVERY_ERROR 10
unsigned long recovery;
/* If a RAID personality determines that recovery (of a particular
* device) will fail due to a read error on the source device, it
* takes a copy of this number and does not attempt recovery again
* until this number changes.
*/
int recovery_disabled;
int in_sync; /* know to not need resync */
/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
* that we are never stopping an array while it is open.
* 'reconfig_mutex' protects all other reconfiguration.
* These locks are separate due to conflicting interactions
* with bdev->bd_mutex.
* Lock ordering is:
* reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
* bd_mutex -> open_mutex: e.g. __blkdev_get -> md_open
*/
struct mutex open_mutex;
struct mutex reconfig_mutex;
atomic_t active; /* general refcount */
atomic_t openers; /* number of active opens */
int changed; /* True if we might need to
* reread partition info */
int degraded; /* whether md should consider
* adding a spare
*/
int merge_check_needed; /* at least one
* member device
* has a
* merge_bvec_fn */
atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
sector_t recovery_cp;
sector_t resync_min; /* user requested sync
* starts here */
sector_t resync_max; /* resync should pause
* when it gets here */
struct kernfs_node *sysfs_state; /* handle for 'array_state'
* file in sysfs.
*/
struct kernfs_node *sysfs_action; /* handle for 'sync_action' */
struct work_struct del_work; /* used for delayed sysfs removal */
spinlock_t write_lock;
wait_queue_head_t sb_wait; /* for waiting on superblock updates */
atomic_t pending_writes; /* number of active superblock writes */
unsigned int safemode; /* if set, update "clean" superblock
* when no writes pending.
*/
unsigned int safemode_delay;
struct timer_list safemode_timer;
atomic_t writes_pending;
struct request_queue *queue; /* for plugging ... */
struct bitmap *bitmap; /* the bitmap for the device */
struct {
struct file *file; /* the bitmap file */
loff_t offset; /* offset from superblock of
* start of bitmap. May be
* negative, but not '0'
* For external metadata, offset
* from start of device.
*/
unsigned long space; /* space available at this offset */
loff_t default_offset; /* this is the offset to use when
* hot-adding a bitmap. It should
* eventually be settable by sysfs.
*/
unsigned long default_space; /* space available at
* default offset */
struct mutex mutex;
unsigned long chunksize;
unsigned long daemon_sleep; /* how many jiffies between updates? */
unsigned long max_write_behind; /* write-behind mode */
int external;
} bitmap_info;
atomic_t max_corr_read_errors; /* max read retries */
struct list_head all_mddevs;
struct attribute_group *to_remove;
struct bio_set *bio_set;
/* Generic flush handling.
* The last to finish preflush schedules a worker to submit
* the rest of the request (without the REQ_FLUSH flag).
*/
struct bio *flush_bio;
atomic_t flush_pending;
struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */
void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
};
static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
{
int faulty = test_bit(Faulty, &rdev->flags);
if (atomic_dec_and_test(&rdev->nr_pending) && faulty)
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
}
static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
{
atomic_add(nr_sectors, &bdev->bd_contains->bd_disk->sync_io);
}
struct md_personality
{
char *name;
int level;
struct list_head list;
struct module *owner;
void (*make_request)(struct mddev *mddev, struct bio *bio);
int (*run)(struct mddev *mddev);
int (*stop)(struct mddev *mddev);
void (*status)(struct seq_file *seq, struct mddev *mddev);
/* error_handler must set ->faulty and clear ->in_sync
* if appropriate, and should abort recovery if needed
*/
void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*spare_active) (struct mddev *mddev);
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);
int (*resize) (struct mddev *mddev, sector_t sectors);
sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);
int (*check_reshape) (struct mddev *mddev);
int (*start_reshape) (struct mddev *mddev);
void (*finish_reshape) (struct mddev *mddev);
/* quiesce moves between quiescence states
* 0 - fully active
* 1 - no new requests allowed
* others - reserved
*/
void (*quiesce) (struct mddev *mddev, int state);
/* takeover is used to transition an array from one
* personality to another. The new personality must be able
* to handle the data in the current layout.
* e.g. 2drive raid1 -> 2drive raid5
* ndrive raid5 -> degraded n+1drive raid6 with special layout
* If the takeover succeeds, a new 'private' structure is returned.
* This needs to be installed and then ->run used to activate the
* array.
*/
void *(*takeover) (struct mddev *mddev);
};
struct md_sysfs_entry {
struct attribute attr;
ssize_t (*show)(struct mddev *, char *);
ssize_t (*store)(struct mddev *, const char *, size_t);
};
extern struct attribute_group md_bitmap_group;
static inline struct kernfs_node *sysfs_get_dirent_safe(struct kernfs_node *sd, char *name)
{
if (sd)
return sysfs_get_dirent(sd, name);
return sd;
}
static inline void sysfs_notify_dirent_safe(struct kernfs_node *sd)
{
if (sd)
sysfs_notify_dirent(sd);
}
static inline char * mdname (struct mddev * mddev)
{
return mddev->gendisk ? mddev->gendisk->disk_name : "mdX";
}
static inline int sysfs_link_rdev(struct mddev *mddev, struct md_rdev *rdev)
{
char nm[20];
if (!test_bit(Replacement, &rdev->flags) && mddev->kobj.sd) {
sprintf(nm, "rd%d", rdev->raid_disk);
return sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
} else
return 0;
}
static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
{
char nm[20];
if (!test_bit(Replacement, &rdev->flags) && mddev->kobj.sd) {
sprintf(nm, "rd%d", rdev->raid_disk);
sysfs_remove_link(&mddev->kobj, nm);
}
}
/*
* iterates through some rdev ringlist. It's safe to remove the
* current 'rdev'. Dont touch 'tmp' though.
*/
#define rdev_for_each_list(rdev, tmp, head) \
list_for_each_entry_safe(rdev, tmp, head, same_set)
/*
* iterates through the 'same array disks' ringlist
*/
#define rdev_for_each(rdev, mddev) \
list_for_each_entry(rdev, &((mddev)->disks), same_set)
#define rdev_for_each_safe(rdev, tmp, mddev) \
list_for_each_entry_safe(rdev, tmp, &((mddev)->disks), same_set)
#define rdev_for_each_rcu(rdev, mddev) \
list_for_each_entry_rcu(rdev, &((mddev)->disks), same_set)
struct md_thread {
void (*run) (struct md_thread *thread);
struct mddev *mddev;
wait_queue_head_t wqueue;
unsigned long flags;
struct task_struct *tsk;
unsigned long timeout;
void *private;
};
#define THREAD_WAKEUP 0
static inline void safe_put_page(struct page *p)
{
if (p) put_page(p);
}
extern int register_md_personality(struct md_personality *p);
extern int unregister_md_personality(struct md_personality *p);
extern struct md_thread *md_register_thread(
void (*run)(struct md_thread *thread),
struct mddev *mddev,
const char *name);
extern void md_unregister_thread(struct md_thread **threadp);
extern void md_wakeup_thread(struct md_thread *thread);
extern void md_check_recovery(struct mddev *mddev);
extern void md_reap_sync_thread(struct mddev *mddev);
extern void md_write_start(struct mddev *mddev, struct bio *bi);
extern void md_write_end(struct mddev *mddev);
extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
extern void md_finish_reshape(struct mddev *mddev);
extern int mddev_congested(struct mddev *mddev, int bits);
extern void md_flush_request(struct mddev *mddev, struct bio *bio);
extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(struct mddev *mddev);
extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
struct page *page, int rw, bool metadata_op);
extern void md_do_sync(struct md_thread *thread);
extern void md_new_event(struct mddev *mddev);
extern int md_allow_write(struct mddev *mddev);
extern void md_wait_for_blocked_rdev(struct md_rdev *rdev, struct mddev *mddev);
extern void md_set_array_sectors(struct mddev *mddev, sector_t array_sectors);
extern int md_check_no_bitmap(struct mddev *mddev);
extern int md_integrity_register(struct mddev *mddev);
extern void md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev);
extern int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale);
extern void mddev_init(struct mddev *mddev);
extern int md_run(struct mddev *mddev);
extern void md_stop(struct mddev *mddev);
extern void md_stop_writes(struct mddev *mddev);
extern int md_rdev_init(struct md_rdev *rdev);
extern void md_rdev_clear(struct md_rdev *rdev);
extern void mddev_suspend(struct mddev *mddev);
extern void mddev_resume(struct mddev *mddev);
extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev);
extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
struct mddev *mddev);
extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule);
static inline int mddev_check_plugged(struct mddev *mddev)
{
return !!blk_check_plugged(md_unplug, mddev,
sizeof(struct blk_plug_cb));
}
#endif /* _MD_MD_H */

554
drivers/md/multipath.c Normal file
View file

@ -0,0 +1,554 @@
/*
* multipath.c : Multiple Devices driver for Linux
*
* Copyright (C) 1999, 2000, 2001 Ingo Molnar, Red Hat
*
* Copyright (C) 1996, 1997, 1998 Ingo Molnar, Miguel de Icaza, Gadi Oxman
*
* MULTIPATH management functions.
*
* derived from raid1.c.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2, or (at your option)
* any later version.
*
* You should have received a copy of the GNU General Public License
* (for example /usr/src/linux/COPYING); if not, write to the Free
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#include <linux/blkdev.h>
#include <linux/module.h>
#include <linux/raid/md_u.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
#include "md.h"
#include "multipath.h"
#define MAX_WORK_PER_DISK 128
#define NR_RESERVED_BUFS 32
static int multipath_map (struct mpconf *conf)
{
int i, disks = conf->raid_disks;
/*
* Later we do read balancing on the read side
* now we use the first available disk.
*/
rcu_read_lock();
for (i = 0; i < disks; i++) {
struct md_rdev *rdev = rcu_dereference(conf->multipaths[i].rdev);
if (rdev && test_bit(In_sync, &rdev->flags)) {
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
return i;
}
}
rcu_read_unlock();
printk(KERN_ERR "multipath_map(): no more operational IO paths?\n");
return (-1);
}
static void multipath_reschedule_retry (struct multipath_bh *mp_bh)
{
unsigned long flags;
struct mddev *mddev = mp_bh->mddev;
struct mpconf *conf = mddev->private;
spin_lock_irqsave(&conf->device_lock, flags);
list_add(&mp_bh->retry_list, &conf->retry_list);
spin_unlock_irqrestore(&conf->device_lock, flags);
md_wakeup_thread(mddev->thread);
}
/*
* multipath_end_bh_io() is called when we have finished servicing a multipathed
* operation and are ready to return a success/failure code to the buffer
* cache layer.
*/
static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err)
{
struct bio *bio = mp_bh->master_bio;
struct mpconf *conf = mp_bh->mddev->private;
bio_endio(bio, err);
mempool_free(mp_bh, conf->pool);
}
static void multipath_end_request(struct bio *bio, int error)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct multipath_bh *mp_bh = bio->bi_private;
struct mpconf *conf = mp_bh->mddev->private;
struct md_rdev *rdev = conf->multipaths[mp_bh->path].rdev;
if (uptodate)
multipath_end_bh_io(mp_bh, 0);
else if (!(bio->bi_rw & REQ_RAHEAD)) {
/*
* oops, IO error:
*/
char b[BDEVNAME_SIZE];
md_error (mp_bh->mddev, rdev);
printk(KERN_ERR "multipath: %s: rescheduling sector %llu\n",
bdevname(rdev->bdev,b),
(unsigned long long)bio->bi_iter.bi_sector);
multipath_reschedule_retry(mp_bh);
} else
multipath_end_bh_io(mp_bh, error);
rdev_dec_pending(rdev, conf->mddev);
}
static void multipath_make_request(struct mddev *mddev, struct bio * bio)
{
struct mpconf *conf = mddev->private;
struct multipath_bh * mp_bh;
struct multipath_info *multipath;
if (unlikely(bio->bi_rw & REQ_FLUSH)) {
md_flush_request(mddev, bio);
return;
}
mp_bh = mempool_alloc(conf->pool, GFP_NOIO);
mp_bh->master_bio = bio;
mp_bh->mddev = mddev;
mp_bh->path = multipath_map(conf);
if (mp_bh->path < 0) {
bio_endio(bio, -EIO);
mempool_free(mp_bh, conf->pool);
return;
}
multipath = conf->multipaths + mp_bh->path;
mp_bh->bio = *bio;
mp_bh->bio.bi_iter.bi_sector += multipath->rdev->data_offset;
mp_bh->bio.bi_bdev = multipath->rdev->bdev;
mp_bh->bio.bi_rw |= REQ_FAILFAST_TRANSPORT;
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
generic_make_request(&mp_bh->bio);
return;
}
static void multipath_status (struct seq_file *seq, struct mddev *mddev)
{
struct mpconf *conf = mddev->private;
int i;
seq_printf (seq, " [%d/%d] [", conf->raid_disks,
conf->raid_disks - mddev->degraded);
for (i = 0; i < conf->raid_disks; i++)
seq_printf (seq, "%s",
conf->multipaths[i].rdev &&
test_bit(In_sync, &conf->multipaths[i].rdev->flags) ? "U" : "_");
seq_printf (seq, "]");
}
static int multipath_congested(void *data, int bits)
{
struct mddev *mddev = data;
struct mpconf *conf = mddev->private;
int i, ret = 0;
if (mddev_congested(mddev, bits))
return 1;
rcu_read_lock();
for (i = 0; i < mddev->raid_disks ; i++) {
struct md_rdev *rdev = rcu_dereference(conf->multipaths[i].rdev);
if (rdev && !test_bit(Faulty, &rdev->flags)) {
struct request_queue *q = bdev_get_queue(rdev->bdev);
ret |= bdi_congested(&q->backing_dev_info, bits);
/* Just like multipath_map, we just check the
* first available device
*/
break;
}
}
rcu_read_unlock();
return ret;
}
/*
* Careful, this can execute in IRQ contexts as well!
*/
static void multipath_error (struct mddev *mddev, struct md_rdev *rdev)
{
struct mpconf *conf = mddev->private;
char b[BDEVNAME_SIZE];
if (conf->raid_disks - mddev->degraded <= 1) {
/*
* Uh oh, we can do nothing if this is our last path, but
* first check if this is a queued request for a device
* which has just failed.
*/
printk(KERN_ALERT
"multipath: only one IO path left and IO error.\n");
/* leave it active... it's all we have */
return;
}
/*
* Mark disk as unusable
*/
if (test_and_clear_bit(In_sync, &rdev->flags)) {
unsigned long flags;
spin_lock_irqsave(&conf->device_lock, flags);
mddev->degraded++;
spin_unlock_irqrestore(&conf->device_lock, flags);
}
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
printk(KERN_ALERT "multipath: IO failure on %s,"
" disabling IO path.\n"
"multipath: Operation continuing"
" on %d IO paths.\n",
bdevname(rdev->bdev, b),
conf->raid_disks - mddev->degraded);
}
static void print_multipath_conf (struct mpconf *conf)
{
int i;
struct multipath_info *tmp;
printk("MULTIPATH conf printout:\n");
if (!conf) {
printk("(conf==NULL)\n");
return;
}
printk(" --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded,
conf->raid_disks);
for (i = 0; i < conf->raid_disks; i++) {
char b[BDEVNAME_SIZE];
tmp = conf->multipaths + i;
if (tmp->rdev)
printk(" disk%d, o:%d, dev:%s\n",
i,!test_bit(Faulty, &tmp->rdev->flags),
bdevname(tmp->rdev->bdev,b));
}
}
static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct mpconf *conf = mddev->private;
struct request_queue *q;
int err = -EEXIST;
int path;
struct multipath_info *p;
int first = 0;
int last = mddev->raid_disks - 1;
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;
print_multipath_conf(conf);
for (path = first; path <= last; path++)
if ((p=conf->multipaths+path)->rdev == NULL) {
q = rdev->bdev->bd_disk->queue;
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
/* as we don't honour merge_bvec_fn, we must never risk
* violating it, so limit ->max_segments to one, lying
* within a single page.
* (Note: it is very unlikely that a device with
* merge_bvec_fn will be involved in multipath.)
*/
if (q->merge_bvec_fn) {
blk_queue_max_segments(mddev->queue, 1);
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
spin_lock_irq(&conf->device_lock);
mddev->degraded--;
rdev->raid_disk = path;
set_bit(In_sync, &rdev->flags);
spin_unlock_irq(&conf->device_lock);
rcu_assign_pointer(p->rdev, rdev);
err = 0;
md_integrity_add_rdev(rdev, mddev);
break;
}
print_multipath_conf(conf);
return err;
}
static int multipath_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct mpconf *conf = mddev->private;
int err = 0;
int number = rdev->raid_disk;
struct multipath_info *p = conf->multipaths + number;
print_multipath_conf(conf);
if (rdev == p->rdev) {
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
printk(KERN_ERR "hot-remove-disk, slot %d is identified"
" but is still operational!\n", number);
err = -EBUSY;
goto abort;
}
p->rdev = NULL;
synchronize_rcu();
if (atomic_read(&rdev->nr_pending)) {
/* lost the race, try later */
err = -EBUSY;
p->rdev = rdev;
goto abort;
}
err = md_integrity_register(mddev);
}
abort:
print_multipath_conf(conf);
return err;
}
/*
* This is a kernel thread which:
*
* 1. Retries failed read operations on working multipaths.
* 2. Updates the raid superblock when problems encounter.
* 3. Performs writes following reads for array syncronising.
*/
static void multipathd(struct md_thread *thread)
{
struct mddev *mddev = thread->mddev;
struct multipath_bh *mp_bh;
struct bio *bio;
unsigned long flags;
struct mpconf *conf = mddev->private;
struct list_head *head = &conf->retry_list;
md_check_recovery(mddev);
for (;;) {
char b[BDEVNAME_SIZE];
spin_lock_irqsave(&conf->device_lock, flags);
if (list_empty(head))
break;
mp_bh = list_entry(head->prev, struct multipath_bh, retry_list);
list_del(head->prev);
spin_unlock_irqrestore(&conf->device_lock, flags);
bio = &mp_bh->bio;
bio->bi_iter.bi_sector = mp_bh->master_bio->bi_iter.bi_sector;
if ((mp_bh->path = multipath_map (conf))<0) {
printk(KERN_ALERT "multipath: %s: unrecoverable IO read"
" error for block %llu\n",
bdevname(bio->bi_bdev,b),
(unsigned long long)bio->bi_iter.bi_sector);
multipath_end_bh_io(mp_bh, -EIO);
} else {
printk(KERN_ERR "multipath: %s: redirecting sector %llu"
" to another IO path\n",
bdevname(bio->bi_bdev,b),
(unsigned long long)bio->bi_iter.bi_sector);
*bio = *(mp_bh->master_bio);
bio->bi_iter.bi_sector +=
conf->multipaths[mp_bh->path].rdev->data_offset;
bio->bi_bdev = conf->multipaths[mp_bh->path].rdev->bdev;
bio->bi_rw |= REQ_FAILFAST_TRANSPORT;
bio->bi_end_io = multipath_end_request;
bio->bi_private = mp_bh;
generic_make_request(bio);
}
}
spin_unlock_irqrestore(&conf->device_lock, flags);
}
static sector_t multipath_size(struct mddev *mddev, sector_t sectors, int raid_disks)
{
WARN_ONCE(sectors || raid_disks,
"%s does not support generic reshape\n", __func__);
return mddev->dev_sectors;
}
static int multipath_run (struct mddev *mddev)
{
struct mpconf *conf;
int disk_idx;
struct multipath_info *disk;
struct md_rdev *rdev;
int working_disks;
if (md_check_no_bitmap(mddev))
return -EINVAL;
if (mddev->level != LEVEL_MULTIPATH) {
printk("multipath: %s: raid level not set to multipath IO (%d)\n",
mdname(mddev), mddev->level);
goto out;
}
/*
* copy the already verified devices into our private MULTIPATH
* bookkeeping area. [whatever we allocate in multipath_run(),
* should be freed in multipath_stop()]
*/
conf = kzalloc(sizeof(struct mpconf), GFP_KERNEL);
mddev->private = conf;
if (!conf) {
printk(KERN_ERR
"multipath: couldn't allocate memory for %s\n",
mdname(mddev));
goto out;
}
conf->multipaths = kzalloc(sizeof(struct multipath_info)*mddev->raid_disks,
GFP_KERNEL);
if (!conf->multipaths) {
printk(KERN_ERR
"multipath: couldn't allocate memory for %s\n",
mdname(mddev));
goto out_free_conf;
}
working_disks = 0;
rdev_for_each(rdev, mddev) {
disk_idx = rdev->raid_disk;
if (disk_idx < 0 ||
disk_idx >= mddev->raid_disks)
continue;
disk = conf->multipaths + disk_idx;
disk->rdev = rdev;
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
/* as we don't honour merge_bvec_fn, we must never risk
* violating it, not that we ever expect a device with
* a merge_bvec_fn to be involved in multipath */
if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
blk_queue_max_segments(mddev->queue, 1);
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
if (!test_bit(Faulty, &rdev->flags))
working_disks++;
}
conf->raid_disks = mddev->raid_disks;
conf->mddev = mddev;
spin_lock_init(&conf->device_lock);
INIT_LIST_HEAD(&conf->retry_list);
if (!working_disks) {
printk(KERN_ERR "multipath: no operational IO paths for %s\n",
mdname(mddev));
goto out_free_conf;
}
mddev->degraded = conf->raid_disks - working_disks;
conf->pool = mempool_create_kmalloc_pool(NR_RESERVED_BUFS,
sizeof(struct multipath_bh));
if (conf->pool == NULL) {
printk(KERN_ERR
"multipath: couldn't allocate memory for %s\n",
mdname(mddev));
goto out_free_conf;
}
{
mddev->thread = md_register_thread(multipathd, mddev,
"multipath");
if (!mddev->thread) {
printk(KERN_ERR "multipath: couldn't allocate thread"
" for %s\n", mdname(mddev));
goto out_free_conf;
}
}
printk(KERN_INFO
"multipath: array %s active with %d out of %d IO paths\n",
mdname(mddev), conf->raid_disks - mddev->degraded,
mddev->raid_disks);
/*
* Ok, everything is just fine now
*/
md_set_array_sectors(mddev, multipath_size(mddev, 0, 0));
mddev->queue->backing_dev_info.congested_fn = multipath_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
if (md_integrity_register(mddev))
goto out_free_conf;
return 0;
out_free_conf:
if (conf->pool)
mempool_destroy(conf->pool);
kfree(conf->multipaths);
kfree(conf);
mddev->private = NULL;
out:
return -EIO;
}
static int multipath_stop (struct mddev *mddev)
{
struct mpconf *conf = mddev->private;
md_unregister_thread(&mddev->thread);
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
mempool_destroy(conf->pool);
kfree(conf->multipaths);
kfree(conf);
mddev->private = NULL;
return 0;
}
static struct md_personality multipath_personality =
{
.name = "multipath",
.level = LEVEL_MULTIPATH,
.owner = THIS_MODULE,
.make_request = multipath_make_request,
.run = multipath_run,
.stop = multipath_stop,
.status = multipath_status,
.error_handler = multipath_error,
.hot_add_disk = multipath_add_disk,
.hot_remove_disk= multipath_remove_disk,
.size = multipath_size,
};
static int __init multipath_init (void)
{
return register_md_personality (&multipath_personality);
}
static void __exit multipath_exit (void)
{
unregister_md_personality (&multipath_personality);
}
module_init(multipath_init);
module_exit(multipath_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("simple multi-path personality for MD");
MODULE_ALIAS("md-personality-7"); /* MULTIPATH */
MODULE_ALIAS("md-multipath");
MODULE_ALIAS("md-level--4");

31
drivers/md/multipath.h Normal file
View file

@ -0,0 +1,31 @@
#ifndef _MULTIPATH_H
#define _MULTIPATH_H
struct multipath_info {
struct md_rdev *rdev;
};
struct mpconf {
struct mddev *mddev;
struct multipath_info *multipaths;
int raid_disks;
spinlock_t device_lock;
struct list_head retry_list;
mempool_t *pool;
};
/*
* this is our 'private' 'collective' MULTIPATH buffer head.
* it contains information about what kind of IO operations were started
* for this MULTIPATH operation, and about their status:
*/
struct multipath_bh {
struct mddev *mddev;
struct bio *master_bio;
struct bio bio;
int path;
struct list_head retry_list;
};
#endif

Some files were not shown because too many files have changed in this diff Show more