Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,28 @@
00-INDEX
- This file
biodoc.txt
- Notes on the Generic Block Layer Rewrite in Linux 2.5
capability.txt
- Generic Block Device Capability (/sys/block/<device>/capability)
cfq-iosched.txt
- CFQ IO scheduler tunables
cmdline-partition.txt
- how to specify block device partitions on kernel command line
data-integrity.txt
- Block data integrity
deadline-iosched.txt
- Deadline IO scheduler tunables
ioprio.txt
- Block io priorities (in CFQ scheduler)
null_blk.txt
- Null block for block-layer benchmarking.
queue-sysfs.txt
- Queue's sysfs entries
request.txt
- The members of struct request (in include/linux/blkdev.h)
stat.txt
- Block layer statistics in /sys/block/<device>/stat
switching-sched.txt
- Switching I/O schedulers at runtime
writeback_cache_control.txt
- Control of volatile write back caches

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,111 @@
Immutable biovecs and biovec iterators:
=======================================
Kent Overstreet <kmo@daterainc.com>
As of 3.13, biovecs should never be modified after a bio has been submitted.
Instead, we have a new struct bvec_iter which represents a range of a biovec -
the iterator will be modified as the bio is completed, not the biovec.
More specifically, old code that needed to partially complete a bio would
update bi_sector and bi_size, and advance bi_idx to the next biovec. If it
ended up partway through a biovec, it would increment bv_offset and decrement
bv_len by the number of bytes completed in that biovec.
In the new scheme of things, everything that must be mutated in order to
partially complete a bio is segregated into struct bvec_iter: bi_sector,
bi_size and bi_idx have been moved there; and instead of modifying bv_offset
and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of
bytes completed in the current bvec.
There are a bunch of new helper macros for hiding the gory details - in
particular, presenting the illusion of partially completed biovecs so that
normal code doesn't have to deal with bi_bvec_done.
* Driver code should no longer refer to biovecs directly; we now have
bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
constructed from the raw biovecs but taking into account bi_bvec_done and
bi_size.
bio_for_each_segment() has been updated to take a bvec_iter argument
instead of an integer (that corresponded to bi_idx); for a lot of code the
conversion just required changing the types of the arguments to
bio_for_each_segment().
* Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a
wrapper around bio_advance_iter() that operates on bio->bi_iter, and also
advances the bio integrity's iter if present.
There is a lower level advance function - bvec_iter_advance() - which takes
a pointer to a biovec, not a bio; this is used by the bio integrity code.
What's all this get us?
=======================
Having a real iterator, and making biovecs immutable, has a number of
advantages:
* Before, iterating over bios was very awkward when you weren't processing
exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c,
which copies the contents of one bio into another. Because the biovecs
wouldn't necessarily be the same size, the old code was tricky convoluted -
it had to walk two different bios at the same time, keeping both bi_idx and
and offset into the current biovec for each.
The new code is much more straightforward - have a look. This sort of
pattern comes up in a lot of places; a lot of drivers were essentially open
coding bvec iterators before, and having common implementation considerably
simplifies a lot of code.
* Before, any code that might need to use the biovec after the bio had been
completed (perhaps to copy the data somewhere else, or perhaps to resubmit
it somewhere else if there was an error) had to save the entire bvec array
- again, this was being done in a fair number of places.
* Biovecs can be shared between multiple bios - a bvec iter can represent an
arbitrary range of an existing biovec, both starting and ending midway
through biovecs. This is what enables efficient splitting of arbitrary
bios. Note that this means we _only_ use bi_size to determine when we've
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
bi_size into account when constructing biovecs.
* Splitting bios is now much simpler. The old bio_split() didn't even work on
bios with more than a single bvec! Now, we can efficiently split arbitrary
size bios - because the new bio can share the old bio's biovec.
Care must be taken to ensure the biovec isn't freed while the split bio is
still using it, in case the original bio completes first, though. Using
bio_chain() when splitting bios helps with this.
* Submitting partially completed bios is now perfectly fine - this comes up
occasionally in stacking block drivers and various code (e.g. md and
bcache) had some ugly workarounds for this.
It used to be the case that submitting a partially completed bio would work
fine to _most_ devices, but since accessing the raw bvec array was the
norm, not all drivers would respect bi_idx and those would break. Now,
since all drivers _must_ go through the bvec iterator - and have been
audited to make sure they are - submitting partially completed bios is
perfectly fine.
Other implications:
===================
* Almost all usage of bi_idx is now incorrect and has been removed; instead,
where previously you would have used bi_idx you'd now use a bvec_iter,
probably passing it to one of the helper macros.
I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you
now use bio_iter_iovec(), which takes a bvec_iter and returns a
literal struct bio_vec - constructed on the fly from the raw biovec but
taking into account bi_bvec_done (and bi_size).
* bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that
doesn't actually own the bio. The reason is twofold: firstly, it's not
actually needed for iterating over the bio anymore - we only use bi_size.
Secondly, when cloning a bio and reusing (a portion of) the original bio's
biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate
over all the biovecs in the new bio - which is silly as it's not needed.
So, don't use bi_vcnt anymore.

View file

@ -0,0 +1,15 @@
Generic Block Device Capability
===============================================================================
This file documents the sysfs file block/<disk>/capability
capability is a hex word indicating which capabilities a specific disk
supports. For more information on bits not listed here, see
include/linux/genhd.h
Capability Value
-------------------------------------------------------------------------------
GENHD_FL_MEDIA_CHANGE_NOTIFY 4
When this bit is set, the disk supports Asynchronous Notification
of media change events. These events will be broadcast to user
space via kernel uevent.

View file

@ -0,0 +1,292 @@
CFQ (Complete Fairness Queueing)
===============================
The main aim of CFQ scheduler is to provide a fair allocation of the disk
I/O bandwidth for all the processes which requests an I/O operation.
CFQ maintains the per process queue for the processes which request I/O
operation(synchronous requests). In case of asynchronous requests, all the
requests from all the processes are batched together according to their
process's I/O priority.
CFQ ioscheduler tunables
========================
slice_idle
----------
This specifies how long CFQ should idle for next request on certain cfq queues
(for sequential workloads) and service trees (for random workloads) before
queue is expired and CFQ selects next queue to dispatch from.
By default slice_idle is a non-zero value. That means by default we idle on
queues/service trees. This can be very helpful on highly seeky media like
single spindle SATA/SAS disks where we can cut down on overall number of
seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues/service tree
level and one should see an overall improved throughput on faster storage
devices like multiple SATA/SAS disks in hardware RAID configuration. The down
side is that isolation provided from WRITES also goes down and notion of
IO priority becomes weaker.
So depending on storage and workload, it might be useful to set slice_idle=0.
In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
keeping slice_idle enabled should be useful. For any configurations where
there are multiple spindles behind single LUN (Host based hardware RAID
controller or for storage arrays), setting slice_idle=0 might end up in better
throughput and acceptable latencies.
back_seek_max
-------------
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
The distance is the amount of space from the current head location to the
sectors that are backward in terms of distance.
This parameter allows the scheduler to anticipate requests in the "backward"
direction and consider them as being the "next" if they are within this
distance from the current head location.
back_seek_penalty
-----------------
This parameter is used to compute the cost of backward seeking. If the
backward distance of request is just 1/back_seek_penalty from a "front"
request, then the seeking cost of two requests is considered equivalent.
So scheduler will not bias toward one or the other request (otherwise scheduler
will bias toward front request). Default value of back_seek_penalty is 2.
fifo_expire_async
-----------------
This parameter is used to set the timeout of asynchronous requests. Default
value of this is 248ms.
fifo_expire_sync
----------------
This parameter is used to set the timeout of synchronous requests. Default
value of this is 124ms. In case to favor synchronous requests over asynchronous
one, this value should be decreased relative to fifo_expire_async.
group_idle
-----------
This parameter forces idling at the CFQ group level instead of CFQ
queue level. This was introduced after a bottleneck was observed
in higher end storage due to idle on sequential queue and allow dispatch
from a single queue. The idea with this parameter is that it can be run with
slice_idle=0 and group_idle=8, so that idling does not happen on individual
queues in the group but happens overall on the group and thus still keeps the
IO controller working.
Not idling on individual queues in the group will dispatch requests from
multiple queues in the group at the same time and achieve higher throughput
on higher end storage.
Default value for this parameter is 8ms.
latency
-------
This parameter is used to enable/disable the latency mode of the CFQ
scheduler. If latency mode (called low_latency) is enabled, CFQ tries
to recompute the slice time for each process based on the target_latency set
for the system. This favors fairness over throughput. Disabling low
latency (setting it to 0) ignores target latency, allowing each process in the
system to get a full time slice.
By default low latency mode is enabled.
target_latency
--------------
This parameter is used to calculate the time slice for a process if cfq's
latency mode is enabled. It will ensure that sync requests have an estimated
latency. But if sequential workload is higher(e.g. sequential read),
then to meet the latency constraints, throughput may decrease because of less
time for each process to issue I/O request before the cfq queue is switched.
Though this can be overcome by disabling the latency_mode, it may increase
the read latency for some applications. This parameter allows for changing
target_latency through the sysfs interface which can provide the balanced
throughput and read latency.
Default value for target_latency is 300ms.
slice_async
-----------
This parameter is same as of slice_sync but for asynchronous queue. The
default value is 40ms.
slice_async_rq
--------------
This parameter is used to limit the dispatching of asynchronous request to
device request queue in queue's slice time. The maximum number of request that
are allowed to be dispatched also depends upon the io priority. Default value
for this is 2.
slice_sync
----------
When a queue is selected for execution, the queues IO requests are only
executed for a certain amount of time(time_slice) before switching to another
queue. This parameter is used to calculate the time slice of synchronous
queue.
time_slice is computed using the below equation:-
time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
time_slice of synchronous queue, increase the value of slice_sync. Default
value is 100ms.
quantum
-------
This specifies the number of request dispatched to the device queue. In a
queue's time slice, a request will not be dispatched if the number of request
in the device exceeds this parameter. This parameter is used for synchronous
request.
In case of storage with several disk, this setting can limit the parallel
processing of request. Therefore, increasing the value can improve the
performance although this can cause the latency of some I/O to increase due
to more number of requests.
CFQ Group scheduling
====================
CFQ supports blkio cgroup and has "blkio." prefixed files in each
blkio cgroup directory. It is weight-based and there are four knobs
for configuration - weight[_device] and leaf_weight[_device].
Internal cgroup nodes (the ones with children) can also have tasks in
them, so the former two configure how much proportion the cgroup as a
whole is entitled to at its parent's level while the latter two
configure how much proportion the tasks in the cgroup have compared to
its direct children.
Another way to think about it is assuming that each internal node has
an implicit leaf child node which hosts all the tasks whose weight is
configured by leaf_weight[_device]. Let's assume a blkio hierarchy
composed of five cgroups - root, A, B, AA and AB - with the following
weights where the names represent the hierarchy.
weight leaf_weight
root : 125 125
A : 500 750
B : 250 500
AA : 500 500
AB : 1000 500
root never has a parent making its weight is meaningless. For backward
compatibility, weight is always kept in sync with leaf_weight. B, AA
and AB have no child and thus its tasks have no children cgroup to
compete with. They always get 100% of what the cgroup won at the
parent level. Considering only the weights which matter, the hierarchy
looks like the following.
root
/ | \
A B leaf
500 250 125
/ | \
AA AB leaf
500 1000 750
If all cgroups have active IOs and competing with each other, disk
time will be distributed like the following.
Distribution below root. The total active weight at this level is
A:500 + B:250 + C:125 = 875.
root-leaf : 125 / 875 =~ 14%
A : 500 / 875 =~ 57%
B(-leaf) : 250 / 875 =~ 28%
A has children and further distributes its 57% among the children and
the implicit leaf node. The total active weight at this level is
AA:500 + AB:1000 + A-leaf:750 = 2250.
A-leaf : ( 750 / 2250) * A =~ 19%
AA(-leaf) : ( 500 / 2250) * A =~ 12%
AB(-leaf) : (1000 / 2250) * A =~ 25%
CFQ IOPS Mode for group scheduling
===================================
Basic CFQ design is to provide priority based time slices. Higher priority
process gets bigger time slice and lower priority process gets smaller time
slice. Measuring time becomes harder if storage is fast and supports NCQ and
it would be better to dispatch multiple requests from multiple cfq queues in
request queue at a time. In such scenario, it is not possible to measure time
consumed by single queue accurately.
What is possible though is to measure number of requests dispatched from a
single queue and also allow dispatch from multiple cfq queue at the same time.
This effectively becomes the fairness in terms of IOPS (IO operations per
second).
If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
to IOPS mode and starts providing fairness in terms of number of requests
dispatched. Note that this mode switching takes effect only for group
scheduling. For non-cgroup users nothing should change.
CFQ IO scheduler Idling Theory
===============================
Idling on a queue is primarily about waiting for the next request to come
on same queue after completion of a request. In this process CFQ will not
dispatch requests from other cfq queues even if requests are pending there.
The rationale behind idling is that it can cut down on number of seeks
on rotational media. For example, if a process is doing dependent
sequential reads (next read will come on only after completion of previous
one), then not dispatching request from other queue should help as we
did not move the disk head and kept on dispatching sequential IO from
one queue.
CFQ has following service trees and various queues are put on these trees.
sync-idle sync-noidle async
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
On this tree we idle on each queue individually.
All synchronous non-sequential queues go on sync-noidle tree. Also any
request which are marked with REQ_NOIDLE go on this service tree. On this
tree we do not idle on individual queues instead idle on the whole group
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
we will idle only once last queue has dispatched the IO and there is
no more IO on this service tree.
All async writes go on async service tree. There is no idling on async
queues.
CFQ has some optimizations for SSDs and if it detects a non-rotational
media which can support higher queue depth (multiple requests at in
flight at a time), then it cuts down on idling of individual queues and
all the queues move to sync-noidle tree and only tree idle remains. This
tree idling provides isolation with buffered write queues on async tree.
FAQ
===
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
queues. Otherwise in presence of many sequential readers, other
synchronous IO might not get fair share of disk.
For example, if there are 10 sequential readers doing IO and they get
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
roughly after 1 second. If after completion of REQ_NOIDLE request we
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
request comes in, again it will be scheduled after 1second. Repeat it
and notice how a workload can lose its disk share and suffer due to
multiple sequential readers.
fsync can generate dependent IO where bunch of data is written in the
context of fsync, and later some journaling data is written. Journaling
data comes in only after fsync has finished its IO (atleast for ext4
that seemed to be the case). Now if one decides not to idle on fsync
thread due to REQ_NOIDLE, then next journaling write will not get
scheduled for another second. A process doing small fsync, will suffer
badly in presence of multiple sequential readers.
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
provides isolation from multiple sequential readers and at the same
time we do not idle on individual threads.
Q2. When to specify REQ_NOIDLE
A2. I would think whenever one is doing synchronous write and not expecting
more writes to be dispatched from same context soon, should be able
to specify REQ_NOIDLE on writes and that probably should work well for
most of the cases.

View file

@ -0,0 +1,39 @@
Embedded device command line partition parsing
=====================================================================
Support for reading the block device partition table from the command line.
It is typically used for fixed block (eMMC) embedded devices.
It has no MBR, so saves storage space. Bootloader can be easily accessed
by absolute address of data on the block device.
Users can easily change the partition.
The format for the command line is just like mtdparts:
blkdevparts=<blkdev-def>[;<blkdev-def>]
<blkdev-def> := <blkdev-id>:<partdef>[,<partdef>]
<partdef> := <size>[@<offset>](part-name)
<blkdev-id>
block device disk name, embedded device used fixed block device,
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
<size>
partition size, in bytes, such as: 512, 1m, 1G.
<offset>
partition start address, in bytes.
(part-name)
partition name, kernel send uevent with "PARTNAME". application can create
a link to block device partition with the name "PARTNAME".
user space application can access partition by partition name.
Example:
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
bootargs:
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
dmesg:
mmcblk0: p1(data0) p2(data1) p3()
mmcblk0boot0: p1(boot) p2(kernel)

View file

@ -0,0 +1,283 @@
----------------------------------------------------------------------
1. INTRODUCTION
Modern filesystems feature checksumming of data and metadata to
protect against data corruption. However, the detection of the
corruption is done at read time which could potentially be months
after the data was written. At that point the original data that the
application tried to write is most likely lost.
The solution is to ensure that the disk is actually storing what the
application meant it to. Recent additions to both the SCSI family
protocols (SBC Data Integrity Field, SCC protection proposal) as well
as SATA/T13 (External Path Protection) try to remedy this by adding
support for appending integrity metadata to an I/O. The integrity
metadata (or protection information in SCSI terminology) includes a
checksum for each sector as well as an incrementing counter that
ensures the individual sectors are written in the right order. And
for some protection schemes also that the I/O is written to the right
place on disk.
Current storage controllers and devices implement various protective
measures, for instance checksumming and scrubbing. But these
technologies are working in their own isolated domains or at best
between adjacent nodes in the I/O path. The interesting thing about
DIF and the other integrity extensions is that the protection format
is well defined and every node in the I/O path can verify the
integrity of the I/O and reject it if corruption is detected. This
allows not only corruption prevention but also isolation of the point
of failure.
----------------------------------------------------------------------
2. THE DATA INTEGRITY EXTENSIONS
As written, the protocol extensions only protect the path between
controller and storage device. However, many controllers actually
allow the operating system to interact with the integrity metadata
(IMD). We have been working with several FC/SAS HBA vendors to enable
the protection information to be transferred to and from their
controllers.
The SCSI Data Integrity Field works by appending 8 bytes of protection
information to each sector. The data + integrity metadata is stored
in 520 byte sectors on disk. Data + IMD are interleaved when
transferred between the controller and target. The T13 proposal is
similar.
Because it is highly inconvenient for operating systems to deal with
520 (and 4104) byte sectors, we approached several HBA vendors and
encouraged them to allow separation of the data and integrity metadata
scatter-gather lists.
The controller will interleave the buffers on write and split them on
read. This means that Linux can DMA the data buffers to and from
host memory without changes to the page cache.
Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
is somewhat heavy to compute in software. Benchmarks found that
calculating this checksum had a significant impact on system
performance for a number of workloads. Some controllers allow a
lighter-weight checksum to be used when interfacing with the operating
system. Emulex, for instance, supports the TCP/IP checksum instead.
The IP checksum received from the OS is converted to the 16-bit CRC
when writing and vice versa. This allows the integrity metadata to be
generated by Linux or the application at very low cost (comparable to
software RAID5).
The IP checksum is weaker than the CRC in terms of detecting bit
errors. However, the strength is really in the separation of the data
buffers and the integrity metadata. These two distinct buffers must
match up for an I/O to complete.
The separation of the data and integrity metadata buffers as well as
the choice in checksums is referred to as the Data Integrity
Extensions. As these extensions are outside the scope of the protocol
bodies (T10, T13), Oracle and its partners are trying to standardize
them within the Storage Networking Industry Association.
----------------------------------------------------------------------
3. KERNEL CHANGES
The data integrity framework in Linux enables protection information
to be pinned to I/Os and sent to/received from controllers that
support it.
The advantage to the integrity extensions in SCSI and SATA is that
they enable us to protect the entire path from application to storage
device. However, at the same time this is also the biggest
disadvantage. It means that the protection information must be in a
format that can be understood by the disk.
Generally Linux/POSIX applications are agnostic to the intricacies of
the storage devices they are accessing. The virtual filesystem switch
and the block layer make things like hardware sector size and
transport protocols completely transparent to the application.
However, this level of detail is required when preparing the
protection information to send to a disk. Consequently, the very
concept of an end-to-end protection scheme is a layering violation.
It is completely unreasonable for an application to be aware whether
it is accessing a SCSI or SATA disk.
The data integrity support implemented in Linux attempts to hide this
from the application. As far as the application (and to some extent
the kernel) is concerned, the integrity metadata is opaque information
that's attached to the I/O.
The current implementation allows the block layer to automatically
generate the protection information for any I/O. Eventually the
intent is to move the integrity metadata calculation to userspace for
user data. Metadata and other I/O that originates within the kernel
will still use the automatic generation interface.
Some storage devices allow each hardware sector to be tagged with a
16-bit value. The owner of this tag space is the owner of the block
device. I.e. the filesystem in most cases. The filesystem can use
this extra space to tag sectors as they see fit. Because the tag
space is limited, the block interface allows tagging bigger chunks by
way of interleaving. This way, 8*16 bits of information can be
attached to a typical 4KB filesystem block.
This also means that applications such as fsck and mkfs will need
access to manipulate the tags from user space. A passthrough
interface for this is being worked on.
----------------------------------------------------------------------
4. BLOCK LAYER IMPLEMENTATION DETAILS
4.1 BIO
The data integrity patches add a new field to struct bio when
CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a
pointer to a struct bip which contains the bio integrity payload.
Essentially a bip is a trimmed down struct bio which holds a bio_vec
containing the integrity metadata and the required housekeeping
information (bvec pool, vector count, etc.)
A kernel subsystem can enable data integrity protection on a bio by
calling bio_integrity_alloc(bio). This will allocate and attach the
bip to the bio.
Individual pages containing integrity metadata can subsequently be
attached using bio_integrity_add_page().
bio_free() will automatically free the bip.
4.2 BLOCK DEVICE
Because the format of the protection data is tied to the physical
disk, each block device has been extended with a block integrity
profile (struct blk_integrity). This optional profile is registered
with the block layer using blk_integrity_register().
The profile contains callback functions for generating and verifying
the protection data, as well as getting and setting application tags.
The profile also contains a few constants to aid in completing,
merging and splitting the integrity metadata.
Layered block devices will need to pick a profile that's appropriate
for all subdevices. blk_integrity_compare() can help with that. DM
and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
will require extra work due to the application tag.
----------------------------------------------------------------------
5.0 BLOCK LAYER INTEGRITY API
5.1 NORMAL FILESYSTEM
The normal filesystem is unaware that the underlying block device
is capable of sending/receiving integrity metadata. The IMD will
be automatically generated by the block layer at submit_bio() time
in case of a WRITE. A READ request will cause the I/O integrity
to be verified upon completion.
IMD generation and verification can be toggled using the
/sys/block/<bdev>/integrity/write_generate
and
/sys/block/<bdev>/integrity/read_verify
flags.
5.2 INTEGRITY-AWARE FILESYSTEM
A filesystem that is integrity-aware can prepare I/Os with IMD
attached. It can also use the application tag space if this is
supported by the block device.
int bio_integrity_prep(bio);
To generate IMD for WRITE and to set up buffers for READ, the
filesystem must call bio_integrity_prep(bio).
Prior to calling this function, the bio data direction and start
sector must be set, and the bio should have all data pages
added. It is up to the caller to ensure that the bio does not
change while I/O is in progress.
bio_integrity_prep() should only be called if
bio_integrity_enabled() returned 1.
5.3 PASSING EXISTING INTEGRITY METADATA
Filesystems that either generate their own integrity metadata or
are capable of transferring IMD from user space can use the
following calls:
struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
Allocates the bio integrity payload and hangs it off of the bio.
nr_pages indicate how many pages of protection data need to be
stored in the integrity bio_vec list (similar to bio_alloc()).
The integrity payload will be freed at bio_free() time.
int bio_integrity_add_page(bio, page, len, offset);
Attaches a page containing integrity metadata to an existing
bio. The bio must have an existing bip,
i.e. bio_integrity_alloc() must have been called. For a WRITE,
the integrity metadata in the pages must be in a format
understood by the target device with the notable exception that
the sector numbers will be remapped as the request traverses the
I/O stack. This implies that the pages added using this call
will be modified during I/O! The first reference tag in the
integrity metadata must have a value of bip->bip_sector.
Pages can be added using bio_integrity_add_page() as long as
there is room in the bip bio_vec array (nr_pages).
Upon completion of a READ operation, the attached pages will
contain the integrity metadata received from the storage device.
It is up to the receiver to process them and verify data
integrity upon completion.
5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
METADATA
To enable integrity exchange on a block device the gendisk must be
registered as capable:
int blk_integrity_register(gendisk, blk_integrity);
The blk_integrity struct is a template and should contain the
following:
static struct blk_integrity my_profile = {
.name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
.generate_fn = my_generate_fn,
.verify_fn = my_verify_fn,
.tuple_size = sizeof(struct my_tuple_size),
.tag_size = <tag bytes per hw sector>,
};
'name' is a text string which will be visible in sysfs. This is
part of the userland API so chose it carefully and never change
it. The format is standards body-type-variant.
E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
'generate_fn' generates appropriate integrity metadata (for WRITE).
'verify_fn' verifies that the data buffer matches the integrity
metadata.
'tuple_size' must be set to match the size of the integrity
metadata per sector. I.e. 8 for DIF and EPP.
'tag_size' must be set to identify how many bytes of tag space
are available per hardware sector. For DIF this is either 2 or
0 depending on the value of the Control Mode Page ATO bit.
----------------------------------------------------------------------
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>

View file

@ -0,0 +1,75 @@
Deadline IO scheduler tunables
==============================
This little file attempts to document how the deadline io scheduler works.
In particular, it will clarify the meaning of the exposed tunables that may be
of interest to power users.
Selecting IO schedulers
-----------------------
Refer to Documentation/block/switching-sched.txt for information on
selecting an io scheduler on a per-device basis.
********************************************************************************
read_expire (in ms)
-----------
The goal of the deadline io scheduler is to attempt to guarantee a start
service time for a request. As we focus mainly on read latencies, this is
tunable. When a read request first enters the io scheduler, it is assigned
a deadline that is the current time + the read_expire value in units of
milliseconds.
write_expire (in ms)
-----------
Similar to read_expire mentioned above, but for writes.
fifo_batch (number of requests)
----------
Requests are grouped into ``batches'' of a particular data direction (read or
write) which are serviced in increasing sector order. To limit extra seeking,
deadline expiries are only checked between batches. fifo_batch controls the
maximum number of requests per batch.
This parameter tunes the balance between per-request latency and aggregate
throughput. When low latency is the primary concern, smaller is better (where
a value of 1 yields first-come first-served behaviour). Increasing fifo_batch
generally improves throughput, at the cost of latency variation.
writes_starved (number of dispatches)
--------------
When we have to move requests from the io scheduler queue to the block
device dispatch queue, we always give a preference to reads. However, we
don't want to starve writes indefinitely either. So writes_starved controls
how many times we give preference to reads over writes. When that has been
done writes_starved number of times, we dispatch some writes based on the
same criteria as reads.
front_merges (bool)
------------
Sometimes it happens that a request enters the io scheduler that is contiguous
with a request that is already on the queue. Either it fits in the back of that
request, or it fits at the front. That is called either a back merge candidate
or a front merge candidate. Due to the way files are typically laid out,
back merges are much more common than front merges. For some work loads, you
may even know that it is a waste of time to spend any time attempting to
front merge requests. Setting front_merges to 0 disables this functionality.
Front merges may still occur due to the cached last_merge hint, but since
that comes at basically 0 cost we leave that on. We simply disable the
rbtree front sector lookup when the io scheduler merge function is called.
Nov 11 2002, Jens Axboe <jens.axboe@oracle.com>

View file

@ -0,0 +1,183 @@
Block io priorities
===================
Intro
-----
With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
priorities are supported for reads on files. This enables users to io nice
processes or process groups, similar to what has been possible with cpu
scheduling for ages. This document mainly details the current possibilities
with cfq; other io schedulers do not support io priorities thus far.
Scheduling classes
------------------
CFQ implements three generic scheduling classes that determine how io is
served for a process.
IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
higher priority than any other in the system, processes from this class are
given first access to the disk every time. Thus it needs to be used with some
care, one io RT process can starve the entire system. Within the RT class,
there are 8 levels of class data that determine exactly how much time this
process needs the disk for on each service. In the future this might change
to be more directly mappable to performance, by passing in a wanted data
rate instead.
IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
for any process that hasn't set a specific io priority. The class data
determines how much io bandwidth the process will get, it's directly mappable
to the cpu nice levels just more coarsely implemented. 0 is the highest
BE prio level, 7 is the lowest. The mapping between cpu nice level and io
nice level is determined as: io_nice = (cpu_nice + 20) / 5.
IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
level only get io time when no one else needs the disk. The idle class has no
class data, since it doesn't really apply here.
Tools
-----
See below for a sample ionice tool. Usage:
# ionice -c<class> -n<level> -p<pid>
If pid isn't given, the current process is assumed. IO priority settings
are inherited on fork, so you can use ionice to start the process at a given
level:
# ionice -c2 -n0 /bin/ls
will run ls at the best-effort scheduling class at the highest priority.
For a running process, you can give the pid instead:
# ionice -c1 -n2 -p100
will change pid 100 to run at the realtime scheduling class, at priority 2.
---> snip ionice.c tool <---
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <getopt.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <asm/unistd.h>
extern int sys_ioprio_set(int, int, int);
extern int sys_ioprio_get(int, int);
#if defined(__i386__)
#define __NR_ioprio_set 289
#define __NR_ioprio_get 290
#elif defined(__ppc__)
#define __NR_ioprio_set 273
#define __NR_ioprio_get 274
#elif defined(__x86_64__)
#define __NR_ioprio_set 251
#define __NR_ioprio_get 252
#elif defined(__ia64__)
#define __NR_ioprio_set 1274
#define __NR_ioprio_get 1275
#else
#error "Unsupported arch"
#endif
static inline int ioprio_set(int which, int who, int ioprio)
{
return syscall(__NR_ioprio_set, which, who, ioprio);
}
static inline int ioprio_get(int which, int who)
{
return syscall(__NR_ioprio_get, which, who);
}
enum {
IOPRIO_CLASS_NONE,
IOPRIO_CLASS_RT,
IOPRIO_CLASS_BE,
IOPRIO_CLASS_IDLE,
};
enum {
IOPRIO_WHO_PROCESS = 1,
IOPRIO_WHO_PGRP,
IOPRIO_WHO_USER,
};
#define IOPRIO_CLASS_SHIFT 13
const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
int main(int argc, char *argv[])
{
int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
int c, pid = 0;
while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
switch (c) {
case 'n':
ioprio = strtol(optarg, NULL, 10);
set = 1;
break;
case 'c':
ioprio_class = strtol(optarg, NULL, 10);
set = 1;
break;
case 'p':
pid = strtol(optarg, NULL, 10);
break;
}
}
switch (ioprio_class) {
case IOPRIO_CLASS_NONE:
ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_RT:
case IOPRIO_CLASS_BE:
break;
case IOPRIO_CLASS_IDLE:
ioprio = 7;
break;
default:
printf("bad prio class %d\n", ioprio_class);
return 1;
}
if (!set) {
if (!pid && argv[optind])
pid = strtol(argv[optind], NULL, 10);
ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
printf("pid=%d, %d\n", pid, ioprio);
if (ioprio == -1)
perror("ioprio_get");
else {
ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
ioprio = ioprio & 0xff;
printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
}
} else {
if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
perror("ioprio_set");
return 1;
}
if (argv[optind])
execvp(argv[optind], &argv[optind]);
}
return 0;
}
---> snip ionice.c tool <---
March 11 2005, Jens Axboe <jens.axboe@oracle.com>

View file

@ -0,0 +1,72 @@
Null block device driver
================================================================================
I. Overview
The null block device (/dev/nullb*) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
The following instances are possible:
Single-queue block-layer
- Request-based.
- Single submission queue per device.
- Implements IO scheduling algorithms (CFQ, Deadline, noop).
Multi-queue block-layer
- Request-based.
- Configurable submission queues per device.
No block-layer (Known as bio-based)
- Bio-based. IO requests are submitted directly to the device driver.
- Directly accepts bio data structure and returns them.
All of them have a completion queue for each core in the system.
II. Module parameters applicable for all instances:
queue_mode=[0-2]: Default: 2-Multi-queue
Selects which block-layer the module should instantiate with.
0: Bio-based.
1: Single-queue.
2: Multi-queue.
home_node=[0--nr_nodes]: Default: NUMA_NO_NODE
Selects what CPU node the data structures are allocated from.
gb=[Size in GB]: Default: 250GB
The size of the device reported to the system.
bs=[Block size (in bytes)]: Default: 512 bytes
The block size reported to the system.
nr_devices=[Number of devices]: Default: 2
Number of block devices instantiated. They are instantiated as /dev/nullb0,
etc.
irqmode=[0-2]: Default: 1-Soft-irq
The completion mode used for completing IOs to the block-layer.
0: None.
1: Soft-irq. Uses IPI to complete IOs across CPU nodes. Simulates the overhead
when IOs are issued from another CPU node than the home the device is
connected to.
2: Timer: Waits a specific period (completion_nsec) for each IO before
completion.
completion_nsec=[ns]: Default: 10.000ns
Combined with irqmode=2 (timer). The time each completion event must wait.
submit_queues=[0..nr_cpus]:
The number of submission queues attached to the device driver. If unset, it
defaults to 1 on single-queue and bio-based instances. For multi-queue,
it is ignored when use_per_node_hctx module parameter is 1.
hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.
III: Multi-queue specific parameters
use_per_node_hctx=[0/1]: Default: 0
0: The number of submit queues are set to the value of the submit_queues
parameter.
1: The multi-queue block layer is instantiated with a hardware dispatch
queue for each CPU node in the system.

View file

@ -0,0 +1,138 @@
Queue sysfs files
=================
This text file will detail the queue files that are located in the sysfs tree
for each block device. Note that stacked devices typically do not export
any settings, since their queue merely functions are a remapping target.
These files are the ones found in the /sys/block/xxx/queue/ directory.
Files denoted with a RO postfix are readonly and the RW postfix means
read-write.
add_random (RW)
----------------
This file allows to turn off the disk entropy contribution. Default
value of this file is '1'(on).
discard_granularity (RO)
-----------------------
This shows the size of internal allocation of the device in bytes, if
reported by the device. A value of '0' means device does not support
the discard functionality.
discard_max_bytes (RO)
----------------------
Devices that support discard functionality may have internal limits on
the number of bytes that can be trimmed or unmapped in a single operation.
The discard_max_bytes parameter is set by the device driver to the maximum
number of bytes that can be discarded in a single operation. Discard
requests issued to the device must not exceed this limit. A discard_max_bytes
value of 0 means that the device does not support discard functionality.
discard_zeroes_data (RO)
------------------------
When read, this file will show if the discarded block are zeroed by the
device or not. If its value is '1' the blocks are zeroed otherwise not.
hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
disk.
logical_block_size (RO)
-----------------------
This is the logcal block size of the device, in bytes.
max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.
max_integrity_segments (RO)
---------------------------
When read, this file shows the max limit of integrity segments as
set by block layer which a hardware controller can handle.
max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.
max_segments (RO)
-----------------
Maximum number of segments of the device.
max_segment_size (RO)
---------------------
Maximum segment size of the device.
minimum_io_size (RO)
--------------------
This is the smallest preferred IO size reported by the device.
nomerges (RW)
-------------
This enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).
nr_requests (RW)
----------------
This controls how many requests may be allocated in the block layer for
read or write requests. Note that the total allocated number may be twice
this amount, since it applies only to reads or writes (not the accumulated
sum).
To avoid priority inversion through request starvation, a request
queue maintains a separate request pool per each cgroup when
CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such
per-block-cgroup request pool. IOW, if there are N block cgroups,
each request queue may have up to N request pools, each independently
regulated by nr_requests.
optimal_io_size (RO)
--------------------
This is the optimal IO size reported by the device.
physical_block_size (RO)
------------------------
This is the physical block size of device, in bytes.
read_ahead_kb (RW)
------------------
Maximum number of kilobytes to read-ahead for filesystems on this block
device.
rotational (RW)
---------------
This file is used to stat if the device is of rotational type or
non-rotational type.
rq_affinity (RW)
----------------
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic).
scheduler (RW)
--------------
When read, this file will display the current and available IO schedulers
for this block device. The currently active IO scheduler will be enclosed
in [] brackets. Writing an IO scheduler name to this file will switch
control of this block device to that new IO scheduler. Note that writing
an IO scheduler name to this file will attempt to load that IO scheduler
module, if it isn't already present in the system.
Jens Axboe <jens.axboe@oracle.com>, February 2009

View file

@ -0,0 +1,88 @@
struct request documentation
Jens Axboe <jens.axboe@oracle.com> 27/05/02
1.0
Index
2.0 Struct request members classification
2.1 struct request members explanation
3.0
2.0
Short explanation of request members
Classification flags:
D driver member
B block layer member
I I/O scheduler member
Unless an entry contains a D classification, a device driver must not access
this member. Some members may contain D classifications, but should only be
access through certain macros or functions (eg ->flags).
<linux/blkdev.h>
2.1
Member Flag Comment
------ ---- -------
struct list_head queuelist BI Organization on various internal
queues
void *elevator_private I I/O scheduler private data
unsigned char cmd[16] D Driver can use this for setting up
a cdb before execution, see
blk_queue_prep_rq
unsigned long flags DBI Contains info about data direction,
request type, etc.
int rq_status D Request status bits
kdev_t rq_dev DBI Target device
int errors DB Error counts
sector_t sector DBI Target location
unsigned long hard_nr_sectors B Used to keep sector sane
unsigned long nr_sectors DBI Total number of sectors in request
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
unsigned short nr_phys_segments DB Number of physical scatter gather
segments in a request
unsigned short nr_hw_segments DB Number of hardware scatter gather
segments in a request
unsigned int current_nr_sectors DB Number of sectors in first segment
of request
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
int tag DB TCQ tag, if assigned
void *special D Free to be used by driver
char *buffer D Map of first segment, also see
section on bouncing SECTION
struct completion *waiting D Can be used by driver to get signalled
on request completion
struct bio *bio DBI First bio in request
struct bio *biotail DBI Last bio in request
struct request_queue *q DB Request queue this request belongs to
struct request_list *rl B Request list this request came from

View file

@ -0,0 +1,82 @@
Block layer statistics in /sys/block/<dev>/stat
===============================================
This file documents the contents of the /sys/block/<dev>/stat file.
The stat file provides several statistics about the state of block
device <dev>.
Q. Why are there multiple statistics in a single file? Doesn't sysfs
normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
represent a consistent snapshot of the state of the device. If the
statistics were exported as multiple files containing one statistic
each, it would be impossible to guarantee that a set of readings
represent a single point in time.
The stat file consists of a single line of text containing 11 decimal
values separated by whitespace. The fields are summarized in the
following table, and described in more detail below.
Name units description
---- ----- -----------
read I/Os requests number of read I/Os processed
read merges requests number of read I/Os merged with in-queue I/O
read sectors sectors number of sectors read
read ticks milliseconds total wait time for read requests
write I/Os requests number of write I/Os processed
write merges requests number of write I/Os merged with in-queue I/O
write sectors sectors number of sectors written
write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
read I/Os, write I/Os
=====================
These values increment when an I/O request completes.
read merges, write merges
=========================
These values increment when an I/O request is merged with an
already-queued I/O request.
read sectors, write sectors
===========================
These values count the number of sectors read from or written to this
block device. The "sectors" in question are the standard UNIX 512-byte
sectors, not any device- or filesystem-specific block size. The
counters are incremented when the I/O completes.
read ticks, write ticks
=======================
These values count the number of milliseconds that I/O requests have
waited on this block device. If there are multiple I/O requests waiting,
these values will increase at a rate greater than 1000/second; for
example, if 60 read requests wait for an average of 30 ms, the read_ticks
field will increase by 60*30 = 1800.
in_flight
=========
This value counts the number of I/O requests that have been issued to
the device driver but have not yet completed. It does not include I/O
requests that are in the queue but not yet issued to the device driver.
io_ticks
========
This value counts the number of milliseconds during which the device has
had I/O requests queued.
time_in_queue
=============
This value counts the number of milliseconds that I/O requests have waited
on this block device. If there are multiple I/O requests waiting, this
value will increase as the product of the number of milliseconds times the
number of requests waiting (see "read ticks" above for an example).

View file

@ -0,0 +1,37 @@
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in:
/sys/block/<device>/queue/iosched
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
you can do so by typing:
# mount none /sys -t sysfs
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq

View file

@ -0,0 +1,86 @@
Explicit volatile write back cache control
=====================================
Introduction
------------
Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.
The Linux block layer provides two simple mechanisms that let filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.
Explicit cache flushes
----------------------
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O. It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.
Forced Unit Access
-----------------
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
Implementation details for filesystems
--------------------------------------
Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.
Implementation details for make_request_fn based block drivers
--------------------------------------------------------------
These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits need to be propagated to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.
Implementation details for request_fn based block drivers
--------------------------------------------------------------
For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For devices with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of an empty REQ_FLUSH request followed by the actual write by the block
layer. For devices that also support the FUA bit the block layer needs
to be told to pass through the REQ_FUA bit using:
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
and the driver must handle write requests that have the REQ_FUA bit set
in prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.