mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-09-05 07:57:45 -04:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
28
Documentation/block/00-INDEX
Normal file
28
Documentation/block/00-INDEX
Normal file
|
@ -0,0 +1,28 @@
|
|||
00-INDEX
|
||||
- This file
|
||||
biodoc.txt
|
||||
- Notes on the Generic Block Layer Rewrite in Linux 2.5
|
||||
capability.txt
|
||||
- Generic Block Device Capability (/sys/block/<device>/capability)
|
||||
cfq-iosched.txt
|
||||
- CFQ IO scheduler tunables
|
||||
cmdline-partition.txt
|
||||
- how to specify block device partitions on kernel command line
|
||||
data-integrity.txt
|
||||
- Block data integrity
|
||||
deadline-iosched.txt
|
||||
- Deadline IO scheduler tunables
|
||||
ioprio.txt
|
||||
- Block io priorities (in CFQ scheduler)
|
||||
null_blk.txt
|
||||
- Null block for block-layer benchmarking.
|
||||
queue-sysfs.txt
|
||||
- Queue's sysfs entries
|
||||
request.txt
|
||||
- The members of struct request (in include/linux/blkdev.h)
|
||||
stat.txt
|
||||
- Block layer statistics in /sys/block/<device>/stat
|
||||
switching-sched.txt
|
||||
- Switching I/O schedulers at runtime
|
||||
writeback_cache_control.txt
|
||||
- Control of volatile write back caches
|
1198
Documentation/block/biodoc.txt
Normal file
1198
Documentation/block/biodoc.txt
Normal file
File diff suppressed because it is too large
Load diff
111
Documentation/block/biovecs.txt
Normal file
111
Documentation/block/biovecs.txt
Normal file
|
@ -0,0 +1,111 @@
|
|||
|
||||
Immutable biovecs and biovec iterators:
|
||||
=======================================
|
||||
|
||||
Kent Overstreet <kmo@daterainc.com>
|
||||
|
||||
As of 3.13, biovecs should never be modified after a bio has been submitted.
|
||||
Instead, we have a new struct bvec_iter which represents a range of a biovec -
|
||||
the iterator will be modified as the bio is completed, not the biovec.
|
||||
|
||||
More specifically, old code that needed to partially complete a bio would
|
||||
update bi_sector and bi_size, and advance bi_idx to the next biovec. If it
|
||||
ended up partway through a biovec, it would increment bv_offset and decrement
|
||||
bv_len by the number of bytes completed in that biovec.
|
||||
|
||||
In the new scheme of things, everything that must be mutated in order to
|
||||
partially complete a bio is segregated into struct bvec_iter: bi_sector,
|
||||
bi_size and bi_idx have been moved there; and instead of modifying bv_offset
|
||||
and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of
|
||||
bytes completed in the current bvec.
|
||||
|
||||
There are a bunch of new helper macros for hiding the gory details - in
|
||||
particular, presenting the illusion of partially completed biovecs so that
|
||||
normal code doesn't have to deal with bi_bvec_done.
|
||||
|
||||
* Driver code should no longer refer to biovecs directly; we now have
|
||||
bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
|
||||
constructed from the raw biovecs but taking into account bi_bvec_done and
|
||||
bi_size.
|
||||
|
||||
bio_for_each_segment() has been updated to take a bvec_iter argument
|
||||
instead of an integer (that corresponded to bi_idx); for a lot of code the
|
||||
conversion just required changing the types of the arguments to
|
||||
bio_for_each_segment().
|
||||
|
||||
* Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a
|
||||
wrapper around bio_advance_iter() that operates on bio->bi_iter, and also
|
||||
advances the bio integrity's iter if present.
|
||||
|
||||
There is a lower level advance function - bvec_iter_advance() - which takes
|
||||
a pointer to a biovec, not a bio; this is used by the bio integrity code.
|
||||
|
||||
What's all this get us?
|
||||
=======================
|
||||
|
||||
Having a real iterator, and making biovecs immutable, has a number of
|
||||
advantages:
|
||||
|
||||
* Before, iterating over bios was very awkward when you weren't processing
|
||||
exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c,
|
||||
which copies the contents of one bio into another. Because the biovecs
|
||||
wouldn't necessarily be the same size, the old code was tricky convoluted -
|
||||
it had to walk two different bios at the same time, keeping both bi_idx and
|
||||
and offset into the current biovec for each.
|
||||
|
||||
The new code is much more straightforward - have a look. This sort of
|
||||
pattern comes up in a lot of places; a lot of drivers were essentially open
|
||||
coding bvec iterators before, and having common implementation considerably
|
||||
simplifies a lot of code.
|
||||
|
||||
* Before, any code that might need to use the biovec after the bio had been
|
||||
completed (perhaps to copy the data somewhere else, or perhaps to resubmit
|
||||
it somewhere else if there was an error) had to save the entire bvec array
|
||||
- again, this was being done in a fair number of places.
|
||||
|
||||
* Biovecs can be shared between multiple bios - a bvec iter can represent an
|
||||
arbitrary range of an existing biovec, both starting and ending midway
|
||||
through biovecs. This is what enables efficient splitting of arbitrary
|
||||
bios. Note that this means we _only_ use bi_size to determine when we've
|
||||
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
|
||||
bi_size into account when constructing biovecs.
|
||||
|
||||
* Splitting bios is now much simpler. The old bio_split() didn't even work on
|
||||
bios with more than a single bvec! Now, we can efficiently split arbitrary
|
||||
size bios - because the new bio can share the old bio's biovec.
|
||||
|
||||
Care must be taken to ensure the biovec isn't freed while the split bio is
|
||||
still using it, in case the original bio completes first, though. Using
|
||||
bio_chain() when splitting bios helps with this.
|
||||
|
||||
* Submitting partially completed bios is now perfectly fine - this comes up
|
||||
occasionally in stacking block drivers and various code (e.g. md and
|
||||
bcache) had some ugly workarounds for this.
|
||||
|
||||
It used to be the case that submitting a partially completed bio would work
|
||||
fine to _most_ devices, but since accessing the raw bvec array was the
|
||||
norm, not all drivers would respect bi_idx and those would break. Now,
|
||||
since all drivers _must_ go through the bvec iterator - and have been
|
||||
audited to make sure they are - submitting partially completed bios is
|
||||
perfectly fine.
|
||||
|
||||
Other implications:
|
||||
===================
|
||||
|
||||
* Almost all usage of bi_idx is now incorrect and has been removed; instead,
|
||||
where previously you would have used bi_idx you'd now use a bvec_iter,
|
||||
probably passing it to one of the helper macros.
|
||||
|
||||
I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you
|
||||
now use bio_iter_iovec(), which takes a bvec_iter and returns a
|
||||
literal struct bio_vec - constructed on the fly from the raw biovec but
|
||||
taking into account bi_bvec_done (and bi_size).
|
||||
|
||||
* bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that
|
||||
doesn't actually own the bio. The reason is twofold: firstly, it's not
|
||||
actually needed for iterating over the bio anymore - we only use bi_size.
|
||||
Secondly, when cloning a bio and reusing (a portion of) the original bio's
|
||||
biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate
|
||||
over all the biovecs in the new bio - which is silly as it's not needed.
|
||||
|
||||
So, don't use bi_vcnt anymore.
|
15
Documentation/block/capability.txt
Normal file
15
Documentation/block/capability.txt
Normal file
|
@ -0,0 +1,15 @@
|
|||
Generic Block Device Capability
|
||||
===============================================================================
|
||||
This file documents the sysfs file block/<disk>/capability
|
||||
|
||||
capability is a hex word indicating which capabilities a specific disk
|
||||
supports. For more information on bits not listed here, see
|
||||
include/linux/genhd.h
|
||||
|
||||
Capability Value
|
||||
-------------------------------------------------------------------------------
|
||||
GENHD_FL_MEDIA_CHANGE_NOTIFY 4
|
||||
When this bit is set, the disk supports Asynchronous Notification
|
||||
of media change events. These events will be broadcast to user
|
||||
space via kernel uevent.
|
||||
|
292
Documentation/block/cfq-iosched.txt
Normal file
292
Documentation/block/cfq-iosched.txt
Normal file
|
@ -0,0 +1,292 @@
|
|||
CFQ (Complete Fairness Queueing)
|
||||
===============================
|
||||
|
||||
The main aim of CFQ scheduler is to provide a fair allocation of the disk
|
||||
I/O bandwidth for all the processes which requests an I/O operation.
|
||||
|
||||
CFQ maintains the per process queue for the processes which request I/O
|
||||
operation(synchronous requests). In case of asynchronous requests, all the
|
||||
requests from all the processes are batched together according to their
|
||||
process's I/O priority.
|
||||
|
||||
CFQ ioscheduler tunables
|
||||
========================
|
||||
|
||||
slice_idle
|
||||
----------
|
||||
This specifies how long CFQ should idle for next request on certain cfq queues
|
||||
(for sequential workloads) and service trees (for random workloads) before
|
||||
queue is expired and CFQ selects next queue to dispatch from.
|
||||
|
||||
By default slice_idle is a non-zero value. That means by default we idle on
|
||||
queues/service trees. This can be very helpful on highly seeky media like
|
||||
single spindle SATA/SAS disks where we can cut down on overall number of
|
||||
seeks and see improved throughput.
|
||||
|
||||
Setting slice_idle to 0 will remove all the idling on queues/service tree
|
||||
level and one should see an overall improved throughput on faster storage
|
||||
devices like multiple SATA/SAS disks in hardware RAID configuration. The down
|
||||
side is that isolation provided from WRITES also goes down and notion of
|
||||
IO priority becomes weaker.
|
||||
|
||||
So depending on storage and workload, it might be useful to set slice_idle=0.
|
||||
In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
|
||||
keeping slice_idle enabled should be useful. For any configurations where
|
||||
there are multiple spindles behind single LUN (Host based hardware RAID
|
||||
controller or for storage arrays), setting slice_idle=0 might end up in better
|
||||
throughput and acceptable latencies.
|
||||
|
||||
back_seek_max
|
||||
-------------
|
||||
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
|
||||
The distance is the amount of space from the current head location to the
|
||||
sectors that are backward in terms of distance.
|
||||
|
||||
This parameter allows the scheduler to anticipate requests in the "backward"
|
||||
direction and consider them as being the "next" if they are within this
|
||||
distance from the current head location.
|
||||
|
||||
back_seek_penalty
|
||||
-----------------
|
||||
This parameter is used to compute the cost of backward seeking. If the
|
||||
backward distance of request is just 1/back_seek_penalty from a "front"
|
||||
request, then the seeking cost of two requests is considered equivalent.
|
||||
|
||||
So scheduler will not bias toward one or the other request (otherwise scheduler
|
||||
will bias toward front request). Default value of back_seek_penalty is 2.
|
||||
|
||||
fifo_expire_async
|
||||
-----------------
|
||||
This parameter is used to set the timeout of asynchronous requests. Default
|
||||
value of this is 248ms.
|
||||
|
||||
fifo_expire_sync
|
||||
----------------
|
||||
This parameter is used to set the timeout of synchronous requests. Default
|
||||
value of this is 124ms. In case to favor synchronous requests over asynchronous
|
||||
one, this value should be decreased relative to fifo_expire_async.
|
||||
|
||||
group_idle
|
||||
-----------
|
||||
This parameter forces idling at the CFQ group level instead of CFQ
|
||||
queue level. This was introduced after a bottleneck was observed
|
||||
in higher end storage due to idle on sequential queue and allow dispatch
|
||||
from a single queue. The idea with this parameter is that it can be run with
|
||||
slice_idle=0 and group_idle=8, so that idling does not happen on individual
|
||||
queues in the group but happens overall on the group and thus still keeps the
|
||||
IO controller working.
|
||||
Not idling on individual queues in the group will dispatch requests from
|
||||
multiple queues in the group at the same time and achieve higher throughput
|
||||
on higher end storage.
|
||||
|
||||
Default value for this parameter is 8ms.
|
||||
|
||||
latency
|
||||
-------
|
||||
This parameter is used to enable/disable the latency mode of the CFQ
|
||||
scheduler. If latency mode (called low_latency) is enabled, CFQ tries
|
||||
to recompute the slice time for each process based on the target_latency set
|
||||
for the system. This favors fairness over throughput. Disabling low
|
||||
latency (setting it to 0) ignores target latency, allowing each process in the
|
||||
system to get a full time slice.
|
||||
|
||||
By default low latency mode is enabled.
|
||||
|
||||
target_latency
|
||||
--------------
|
||||
This parameter is used to calculate the time slice for a process if cfq's
|
||||
latency mode is enabled. It will ensure that sync requests have an estimated
|
||||
latency. But if sequential workload is higher(e.g. sequential read),
|
||||
then to meet the latency constraints, throughput may decrease because of less
|
||||
time for each process to issue I/O request before the cfq queue is switched.
|
||||
|
||||
Though this can be overcome by disabling the latency_mode, it may increase
|
||||
the read latency for some applications. This parameter allows for changing
|
||||
target_latency through the sysfs interface which can provide the balanced
|
||||
throughput and read latency.
|
||||
|
||||
Default value for target_latency is 300ms.
|
||||
|
||||
slice_async
|
||||
-----------
|
||||
This parameter is same as of slice_sync but for asynchronous queue. The
|
||||
default value is 40ms.
|
||||
|
||||
slice_async_rq
|
||||
--------------
|
||||
This parameter is used to limit the dispatching of asynchronous request to
|
||||
device request queue in queue's slice time. The maximum number of request that
|
||||
are allowed to be dispatched also depends upon the io priority. Default value
|
||||
for this is 2.
|
||||
|
||||
slice_sync
|
||||
----------
|
||||
When a queue is selected for execution, the queues IO requests are only
|
||||
executed for a certain amount of time(time_slice) before switching to another
|
||||
queue. This parameter is used to calculate the time slice of synchronous
|
||||
queue.
|
||||
|
||||
time_slice is computed using the below equation:-
|
||||
time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
|
||||
time_slice of synchronous queue, increase the value of slice_sync. Default
|
||||
value is 100ms.
|
||||
|
||||
quantum
|
||||
-------
|
||||
This specifies the number of request dispatched to the device queue. In a
|
||||
queue's time slice, a request will not be dispatched if the number of request
|
||||
in the device exceeds this parameter. This parameter is used for synchronous
|
||||
request.
|
||||
|
||||
In case of storage with several disk, this setting can limit the parallel
|
||||
processing of request. Therefore, increasing the value can improve the
|
||||
performance although this can cause the latency of some I/O to increase due
|
||||
to more number of requests.
|
||||
|
||||
CFQ Group scheduling
|
||||
====================
|
||||
|
||||
CFQ supports blkio cgroup and has "blkio." prefixed files in each
|
||||
blkio cgroup directory. It is weight-based and there are four knobs
|
||||
for configuration - weight[_device] and leaf_weight[_device].
|
||||
Internal cgroup nodes (the ones with children) can also have tasks in
|
||||
them, so the former two configure how much proportion the cgroup as a
|
||||
whole is entitled to at its parent's level while the latter two
|
||||
configure how much proportion the tasks in the cgroup have compared to
|
||||
its direct children.
|
||||
|
||||
Another way to think about it is assuming that each internal node has
|
||||
an implicit leaf child node which hosts all the tasks whose weight is
|
||||
configured by leaf_weight[_device]. Let's assume a blkio hierarchy
|
||||
composed of five cgroups - root, A, B, AA and AB - with the following
|
||||
weights where the names represent the hierarchy.
|
||||
|
||||
weight leaf_weight
|
||||
root : 125 125
|
||||
A : 500 750
|
||||
B : 250 500
|
||||
AA : 500 500
|
||||
AB : 1000 500
|
||||
|
||||
root never has a parent making its weight is meaningless. For backward
|
||||
compatibility, weight is always kept in sync with leaf_weight. B, AA
|
||||
and AB have no child and thus its tasks have no children cgroup to
|
||||
compete with. They always get 100% of what the cgroup won at the
|
||||
parent level. Considering only the weights which matter, the hierarchy
|
||||
looks like the following.
|
||||
|
||||
root
|
||||
/ | \
|
||||
A B leaf
|
||||
500 250 125
|
||||
/ | \
|
||||
AA AB leaf
|
||||
500 1000 750
|
||||
|
||||
If all cgroups have active IOs and competing with each other, disk
|
||||
time will be distributed like the following.
|
||||
|
||||
Distribution below root. The total active weight at this level is
|
||||
A:500 + B:250 + C:125 = 875.
|
||||
|
||||
root-leaf : 125 / 875 =~ 14%
|
||||
A : 500 / 875 =~ 57%
|
||||
B(-leaf) : 250 / 875 =~ 28%
|
||||
|
||||
A has children and further distributes its 57% among the children and
|
||||
the implicit leaf node. The total active weight at this level is
|
||||
AA:500 + AB:1000 + A-leaf:750 = 2250.
|
||||
|
||||
A-leaf : ( 750 / 2250) * A =~ 19%
|
||||
AA(-leaf) : ( 500 / 2250) * A =~ 12%
|
||||
AB(-leaf) : (1000 / 2250) * A =~ 25%
|
||||
|
||||
CFQ IOPS Mode for group scheduling
|
||||
===================================
|
||||
Basic CFQ design is to provide priority based time slices. Higher priority
|
||||
process gets bigger time slice and lower priority process gets smaller time
|
||||
slice. Measuring time becomes harder if storage is fast and supports NCQ and
|
||||
it would be better to dispatch multiple requests from multiple cfq queues in
|
||||
request queue at a time. In such scenario, it is not possible to measure time
|
||||
consumed by single queue accurately.
|
||||
|
||||
What is possible though is to measure number of requests dispatched from a
|
||||
single queue and also allow dispatch from multiple cfq queue at the same time.
|
||||
This effectively becomes the fairness in terms of IOPS (IO operations per
|
||||
second).
|
||||
|
||||
If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
|
||||
to IOPS mode and starts providing fairness in terms of number of requests
|
||||
dispatched. Note that this mode switching takes effect only for group
|
||||
scheduling. For non-cgroup users nothing should change.
|
||||
|
||||
CFQ IO scheduler Idling Theory
|
||||
===============================
|
||||
Idling on a queue is primarily about waiting for the next request to come
|
||||
on same queue after completion of a request. In this process CFQ will not
|
||||
dispatch requests from other cfq queues even if requests are pending there.
|
||||
|
||||
The rationale behind idling is that it can cut down on number of seeks
|
||||
on rotational media. For example, if a process is doing dependent
|
||||
sequential reads (next read will come on only after completion of previous
|
||||
one), then not dispatching request from other queue should help as we
|
||||
did not move the disk head and kept on dispatching sequential IO from
|
||||
one queue.
|
||||
|
||||
CFQ has following service trees and various queues are put on these trees.
|
||||
|
||||
sync-idle sync-noidle async
|
||||
|
||||
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
|
||||
On this tree we idle on each queue individually.
|
||||
|
||||
All synchronous non-sequential queues go on sync-noidle tree. Also any
|
||||
request which are marked with REQ_NOIDLE go on this service tree. On this
|
||||
tree we do not idle on individual queues instead idle on the whole group
|
||||
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
|
||||
we will idle only once last queue has dispatched the IO and there is
|
||||
no more IO on this service tree.
|
||||
|
||||
All async writes go on async service tree. There is no idling on async
|
||||
queues.
|
||||
|
||||
CFQ has some optimizations for SSDs and if it detects a non-rotational
|
||||
media which can support higher queue depth (multiple requests at in
|
||||
flight at a time), then it cuts down on idling of individual queues and
|
||||
all the queues move to sync-noidle tree and only tree idle remains. This
|
||||
tree idling provides isolation with buffered write queues on async tree.
|
||||
|
||||
FAQ
|
||||
===
|
||||
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
|
||||
|
||||
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
|
||||
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
|
||||
queues. Otherwise in presence of many sequential readers, other
|
||||
synchronous IO might not get fair share of disk.
|
||||
|
||||
For example, if there are 10 sequential readers doing IO and they get
|
||||
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
|
||||
roughly after 1 second. If after completion of REQ_NOIDLE request we
|
||||
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
|
||||
request comes in, again it will be scheduled after 1second. Repeat it
|
||||
and notice how a workload can lose its disk share and suffer due to
|
||||
multiple sequential readers.
|
||||
|
||||
fsync can generate dependent IO where bunch of data is written in the
|
||||
context of fsync, and later some journaling data is written. Journaling
|
||||
data comes in only after fsync has finished its IO (atleast for ext4
|
||||
that seemed to be the case). Now if one decides not to idle on fsync
|
||||
thread due to REQ_NOIDLE, then next journaling write will not get
|
||||
scheduled for another second. A process doing small fsync, will suffer
|
||||
badly in presence of multiple sequential readers.
|
||||
|
||||
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
|
||||
provides isolation from multiple sequential readers and at the same
|
||||
time we do not idle on individual threads.
|
||||
|
||||
Q2. When to specify REQ_NOIDLE
|
||||
A2. I would think whenever one is doing synchronous write and not expecting
|
||||
more writes to be dispatched from same context soon, should be able
|
||||
to specify REQ_NOIDLE on writes and that probably should work well for
|
||||
most of the cases.
|
39
Documentation/block/cmdline-partition.txt
Normal file
39
Documentation/block/cmdline-partition.txt
Normal file
|
@ -0,0 +1,39 @@
|
|||
Embedded device command line partition parsing
|
||||
=====================================================================
|
||||
|
||||
Support for reading the block device partition table from the command line.
|
||||
It is typically used for fixed block (eMMC) embedded devices.
|
||||
It has no MBR, so saves storage space. Bootloader can be easily accessed
|
||||
by absolute address of data on the block device.
|
||||
Users can easily change the partition.
|
||||
|
||||
The format for the command line is just like mtdparts:
|
||||
|
||||
blkdevparts=<blkdev-def>[;<blkdev-def>]
|
||||
<blkdev-def> := <blkdev-id>:<partdef>[,<partdef>]
|
||||
<partdef> := <size>[@<offset>](part-name)
|
||||
|
||||
<blkdev-id>
|
||||
block device disk name, embedded device used fixed block device,
|
||||
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||
|
||||
<size>
|
||||
partition size, in bytes, such as: 512, 1m, 1G.
|
||||
|
||||
<offset>
|
||||
partition start address, in bytes.
|
||||
|
||||
(part-name)
|
||||
partition name, kernel send uevent with "PARTNAME". application can create
|
||||
a link to block device partition with the name "PARTNAME".
|
||||
user space application can access partition by partition name.
|
||||
|
||||
Example:
|
||||
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
|
||||
|
||||
bootargs:
|
||||
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
|
||||
|
||||
dmesg:
|
||||
mmcblk0: p1(data0) p2(data1) p3()
|
||||
mmcblk0boot0: p1(boot) p2(kernel)
|
283
Documentation/block/data-integrity.txt
Normal file
283
Documentation/block/data-integrity.txt
Normal file
|
@ -0,0 +1,283 @@
|
|||
----------------------------------------------------------------------
|
||||
1. INTRODUCTION
|
||||
|
||||
Modern filesystems feature checksumming of data and metadata to
|
||||
protect against data corruption. However, the detection of the
|
||||
corruption is done at read time which could potentially be months
|
||||
after the data was written. At that point the original data that the
|
||||
application tried to write is most likely lost.
|
||||
|
||||
The solution is to ensure that the disk is actually storing what the
|
||||
application meant it to. Recent additions to both the SCSI family
|
||||
protocols (SBC Data Integrity Field, SCC protection proposal) as well
|
||||
as SATA/T13 (External Path Protection) try to remedy this by adding
|
||||
support for appending integrity metadata to an I/O. The integrity
|
||||
metadata (or protection information in SCSI terminology) includes a
|
||||
checksum for each sector as well as an incrementing counter that
|
||||
ensures the individual sectors are written in the right order. And
|
||||
for some protection schemes also that the I/O is written to the right
|
||||
place on disk.
|
||||
|
||||
Current storage controllers and devices implement various protective
|
||||
measures, for instance checksumming and scrubbing. But these
|
||||
technologies are working in their own isolated domains or at best
|
||||
between adjacent nodes in the I/O path. The interesting thing about
|
||||
DIF and the other integrity extensions is that the protection format
|
||||
is well defined and every node in the I/O path can verify the
|
||||
integrity of the I/O and reject it if corruption is detected. This
|
||||
allows not only corruption prevention but also isolation of the point
|
||||
of failure.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2. THE DATA INTEGRITY EXTENSIONS
|
||||
|
||||
As written, the protocol extensions only protect the path between
|
||||
controller and storage device. However, many controllers actually
|
||||
allow the operating system to interact with the integrity metadata
|
||||
(IMD). We have been working with several FC/SAS HBA vendors to enable
|
||||
the protection information to be transferred to and from their
|
||||
controllers.
|
||||
|
||||
The SCSI Data Integrity Field works by appending 8 bytes of protection
|
||||
information to each sector. The data + integrity metadata is stored
|
||||
in 520 byte sectors on disk. Data + IMD are interleaved when
|
||||
transferred between the controller and target. The T13 proposal is
|
||||
similar.
|
||||
|
||||
Because it is highly inconvenient for operating systems to deal with
|
||||
520 (and 4104) byte sectors, we approached several HBA vendors and
|
||||
encouraged them to allow separation of the data and integrity metadata
|
||||
scatter-gather lists.
|
||||
|
||||
The controller will interleave the buffers on write and split them on
|
||||
read. This means that Linux can DMA the data buffers to and from
|
||||
host memory without changes to the page cache.
|
||||
|
||||
Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
|
||||
is somewhat heavy to compute in software. Benchmarks found that
|
||||
calculating this checksum had a significant impact on system
|
||||
performance for a number of workloads. Some controllers allow a
|
||||
lighter-weight checksum to be used when interfacing with the operating
|
||||
system. Emulex, for instance, supports the TCP/IP checksum instead.
|
||||
The IP checksum received from the OS is converted to the 16-bit CRC
|
||||
when writing and vice versa. This allows the integrity metadata to be
|
||||
generated by Linux or the application at very low cost (comparable to
|
||||
software RAID5).
|
||||
|
||||
The IP checksum is weaker than the CRC in terms of detecting bit
|
||||
errors. However, the strength is really in the separation of the data
|
||||
buffers and the integrity metadata. These two distinct buffers must
|
||||
match up for an I/O to complete.
|
||||
|
||||
The separation of the data and integrity metadata buffers as well as
|
||||
the choice in checksums is referred to as the Data Integrity
|
||||
Extensions. As these extensions are outside the scope of the protocol
|
||||
bodies (T10, T13), Oracle and its partners are trying to standardize
|
||||
them within the Storage Networking Industry Association.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
3. KERNEL CHANGES
|
||||
|
||||
The data integrity framework in Linux enables protection information
|
||||
to be pinned to I/Os and sent to/received from controllers that
|
||||
support it.
|
||||
|
||||
The advantage to the integrity extensions in SCSI and SATA is that
|
||||
they enable us to protect the entire path from application to storage
|
||||
device. However, at the same time this is also the biggest
|
||||
disadvantage. It means that the protection information must be in a
|
||||
format that can be understood by the disk.
|
||||
|
||||
Generally Linux/POSIX applications are agnostic to the intricacies of
|
||||
the storage devices they are accessing. The virtual filesystem switch
|
||||
and the block layer make things like hardware sector size and
|
||||
transport protocols completely transparent to the application.
|
||||
|
||||
However, this level of detail is required when preparing the
|
||||
protection information to send to a disk. Consequently, the very
|
||||
concept of an end-to-end protection scheme is a layering violation.
|
||||
It is completely unreasonable for an application to be aware whether
|
||||
it is accessing a SCSI or SATA disk.
|
||||
|
||||
The data integrity support implemented in Linux attempts to hide this
|
||||
from the application. As far as the application (and to some extent
|
||||
the kernel) is concerned, the integrity metadata is opaque information
|
||||
that's attached to the I/O.
|
||||
|
||||
The current implementation allows the block layer to automatically
|
||||
generate the protection information for any I/O. Eventually the
|
||||
intent is to move the integrity metadata calculation to userspace for
|
||||
user data. Metadata and other I/O that originates within the kernel
|
||||
will still use the automatic generation interface.
|
||||
|
||||
Some storage devices allow each hardware sector to be tagged with a
|
||||
16-bit value. The owner of this tag space is the owner of the block
|
||||
device. I.e. the filesystem in most cases. The filesystem can use
|
||||
this extra space to tag sectors as they see fit. Because the tag
|
||||
space is limited, the block interface allows tagging bigger chunks by
|
||||
way of interleaving. This way, 8*16 bits of information can be
|
||||
attached to a typical 4KB filesystem block.
|
||||
|
||||
This also means that applications such as fsck and mkfs will need
|
||||
access to manipulate the tags from user space. A passthrough
|
||||
interface for this is being worked on.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
4. BLOCK LAYER IMPLEMENTATION DETAILS
|
||||
|
||||
4.1 BIO
|
||||
|
||||
The data integrity patches add a new field to struct bio when
|
||||
CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a
|
||||
pointer to a struct bip which contains the bio integrity payload.
|
||||
Essentially a bip is a trimmed down struct bio which holds a bio_vec
|
||||
containing the integrity metadata and the required housekeeping
|
||||
information (bvec pool, vector count, etc.)
|
||||
|
||||
A kernel subsystem can enable data integrity protection on a bio by
|
||||
calling bio_integrity_alloc(bio). This will allocate and attach the
|
||||
bip to the bio.
|
||||
|
||||
Individual pages containing integrity metadata can subsequently be
|
||||
attached using bio_integrity_add_page().
|
||||
|
||||
bio_free() will automatically free the bip.
|
||||
|
||||
|
||||
4.2 BLOCK DEVICE
|
||||
|
||||
Because the format of the protection data is tied to the physical
|
||||
disk, each block device has been extended with a block integrity
|
||||
profile (struct blk_integrity). This optional profile is registered
|
||||
with the block layer using blk_integrity_register().
|
||||
|
||||
The profile contains callback functions for generating and verifying
|
||||
the protection data, as well as getting and setting application tags.
|
||||
The profile also contains a few constants to aid in completing,
|
||||
merging and splitting the integrity metadata.
|
||||
|
||||
Layered block devices will need to pick a profile that's appropriate
|
||||
for all subdevices. blk_integrity_compare() can help with that. DM
|
||||
and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
|
||||
will require extra work due to the application tag.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
5.0 BLOCK LAYER INTEGRITY API
|
||||
|
||||
5.1 NORMAL FILESYSTEM
|
||||
|
||||
The normal filesystem is unaware that the underlying block device
|
||||
is capable of sending/receiving integrity metadata. The IMD will
|
||||
be automatically generated by the block layer at submit_bio() time
|
||||
in case of a WRITE. A READ request will cause the I/O integrity
|
||||
to be verified upon completion.
|
||||
|
||||
IMD generation and verification can be toggled using the
|
||||
|
||||
/sys/block/<bdev>/integrity/write_generate
|
||||
|
||||
and
|
||||
|
||||
/sys/block/<bdev>/integrity/read_verify
|
||||
|
||||
flags.
|
||||
|
||||
|
||||
5.2 INTEGRITY-AWARE FILESYSTEM
|
||||
|
||||
A filesystem that is integrity-aware can prepare I/Os with IMD
|
||||
attached. It can also use the application tag space if this is
|
||||
supported by the block device.
|
||||
|
||||
|
||||
int bio_integrity_prep(bio);
|
||||
|
||||
To generate IMD for WRITE and to set up buffers for READ, the
|
||||
filesystem must call bio_integrity_prep(bio).
|
||||
|
||||
Prior to calling this function, the bio data direction and start
|
||||
sector must be set, and the bio should have all data pages
|
||||
added. It is up to the caller to ensure that the bio does not
|
||||
change while I/O is in progress.
|
||||
|
||||
bio_integrity_prep() should only be called if
|
||||
bio_integrity_enabled() returned 1.
|
||||
|
||||
|
||||
5.3 PASSING EXISTING INTEGRITY METADATA
|
||||
|
||||
Filesystems that either generate their own integrity metadata or
|
||||
are capable of transferring IMD from user space can use the
|
||||
following calls:
|
||||
|
||||
|
||||
struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
|
||||
|
||||
Allocates the bio integrity payload and hangs it off of the bio.
|
||||
nr_pages indicate how many pages of protection data need to be
|
||||
stored in the integrity bio_vec list (similar to bio_alloc()).
|
||||
|
||||
The integrity payload will be freed at bio_free() time.
|
||||
|
||||
|
||||
int bio_integrity_add_page(bio, page, len, offset);
|
||||
|
||||
Attaches a page containing integrity metadata to an existing
|
||||
bio. The bio must have an existing bip,
|
||||
i.e. bio_integrity_alloc() must have been called. For a WRITE,
|
||||
the integrity metadata in the pages must be in a format
|
||||
understood by the target device with the notable exception that
|
||||
the sector numbers will be remapped as the request traverses the
|
||||
I/O stack. This implies that the pages added using this call
|
||||
will be modified during I/O! The first reference tag in the
|
||||
integrity metadata must have a value of bip->bip_sector.
|
||||
|
||||
Pages can be added using bio_integrity_add_page() as long as
|
||||
there is room in the bip bio_vec array (nr_pages).
|
||||
|
||||
Upon completion of a READ operation, the attached pages will
|
||||
contain the integrity metadata received from the storage device.
|
||||
It is up to the receiver to process them and verify data
|
||||
integrity upon completion.
|
||||
|
||||
|
||||
5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
|
||||
METADATA
|
||||
|
||||
To enable integrity exchange on a block device the gendisk must be
|
||||
registered as capable:
|
||||
|
||||
int blk_integrity_register(gendisk, blk_integrity);
|
||||
|
||||
The blk_integrity struct is a template and should contain the
|
||||
following:
|
||||
|
||||
static struct blk_integrity my_profile = {
|
||||
.name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
|
||||
.generate_fn = my_generate_fn,
|
||||
.verify_fn = my_verify_fn,
|
||||
.tuple_size = sizeof(struct my_tuple_size),
|
||||
.tag_size = <tag bytes per hw sector>,
|
||||
};
|
||||
|
||||
'name' is a text string which will be visible in sysfs. This is
|
||||
part of the userland API so chose it carefully and never change
|
||||
it. The format is standards body-type-variant.
|
||||
E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
|
||||
|
||||
'generate_fn' generates appropriate integrity metadata (for WRITE).
|
||||
|
||||
'verify_fn' verifies that the data buffer matches the integrity
|
||||
metadata.
|
||||
|
||||
'tuple_size' must be set to match the size of the integrity
|
||||
metadata per sector. I.e. 8 for DIF and EPP.
|
||||
|
||||
'tag_size' must be set to identify how many bytes of tag space
|
||||
are available per hardware sector. For DIF this is either 2 or
|
||||
0 depending on the value of the Control Mode Page ATO bit.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
|
75
Documentation/block/deadline-iosched.txt
Normal file
75
Documentation/block/deadline-iosched.txt
Normal file
|
@ -0,0 +1,75 @@
|
|||
Deadline IO scheduler tunables
|
||||
==============================
|
||||
|
||||
This little file attempts to document how the deadline io scheduler works.
|
||||
In particular, it will clarify the meaning of the exposed tunables that may be
|
||||
of interest to power users.
|
||||
|
||||
Selecting IO schedulers
|
||||
-----------------------
|
||||
Refer to Documentation/block/switching-sched.txt for information on
|
||||
selecting an io scheduler on a per-device basis.
|
||||
|
||||
|
||||
********************************************************************************
|
||||
|
||||
|
||||
read_expire (in ms)
|
||||
-----------
|
||||
|
||||
The goal of the deadline io scheduler is to attempt to guarantee a start
|
||||
service time for a request. As we focus mainly on read latencies, this is
|
||||
tunable. When a read request first enters the io scheduler, it is assigned
|
||||
a deadline that is the current time + the read_expire value in units of
|
||||
milliseconds.
|
||||
|
||||
|
||||
write_expire (in ms)
|
||||
-----------
|
||||
|
||||
Similar to read_expire mentioned above, but for writes.
|
||||
|
||||
|
||||
fifo_batch (number of requests)
|
||||
----------
|
||||
|
||||
Requests are grouped into ``batches'' of a particular data direction (read or
|
||||
write) which are serviced in increasing sector order. To limit extra seeking,
|
||||
deadline expiries are only checked between batches. fifo_batch controls the
|
||||
maximum number of requests per batch.
|
||||
|
||||
This parameter tunes the balance between per-request latency and aggregate
|
||||
throughput. When low latency is the primary concern, smaller is better (where
|
||||
a value of 1 yields first-come first-served behaviour). Increasing fifo_batch
|
||||
generally improves throughput, at the cost of latency variation.
|
||||
|
||||
|
||||
writes_starved (number of dispatches)
|
||||
--------------
|
||||
|
||||
When we have to move requests from the io scheduler queue to the block
|
||||
device dispatch queue, we always give a preference to reads. However, we
|
||||
don't want to starve writes indefinitely either. So writes_starved controls
|
||||
how many times we give preference to reads over writes. When that has been
|
||||
done writes_starved number of times, we dispatch some writes based on the
|
||||
same criteria as reads.
|
||||
|
||||
|
||||
front_merges (bool)
|
||||
------------
|
||||
|
||||
Sometimes it happens that a request enters the io scheduler that is contiguous
|
||||
with a request that is already on the queue. Either it fits in the back of that
|
||||
request, or it fits at the front. That is called either a back merge candidate
|
||||
or a front merge candidate. Due to the way files are typically laid out,
|
||||
back merges are much more common than front merges. For some work loads, you
|
||||
may even know that it is a waste of time to spend any time attempting to
|
||||
front merge requests. Setting front_merges to 0 disables this functionality.
|
||||
Front merges may still occur due to the cached last_merge hint, but since
|
||||
that comes at basically 0 cost we leave that on. We simply disable the
|
||||
rbtree front sector lookup when the io scheduler merge function is called.
|
||||
|
||||
|
||||
Nov 11 2002, Jens Axboe <jens.axboe@oracle.com>
|
||||
|
||||
|
183
Documentation/block/ioprio.txt
Normal file
183
Documentation/block/ioprio.txt
Normal file
|
@ -0,0 +1,183 @@
|
|||
Block io priorities
|
||||
===================
|
||||
|
||||
|
||||
Intro
|
||||
-----
|
||||
|
||||
With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
|
||||
priorities are supported for reads on files. This enables users to io nice
|
||||
processes or process groups, similar to what has been possible with cpu
|
||||
scheduling for ages. This document mainly details the current possibilities
|
||||
with cfq; other io schedulers do not support io priorities thus far.
|
||||
|
||||
Scheduling classes
|
||||
------------------
|
||||
|
||||
CFQ implements three generic scheduling classes that determine how io is
|
||||
served for a process.
|
||||
|
||||
IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
|
||||
higher priority than any other in the system, processes from this class are
|
||||
given first access to the disk every time. Thus it needs to be used with some
|
||||
care, one io RT process can starve the entire system. Within the RT class,
|
||||
there are 8 levels of class data that determine exactly how much time this
|
||||
process needs the disk for on each service. In the future this might change
|
||||
to be more directly mappable to performance, by passing in a wanted data
|
||||
rate instead.
|
||||
|
||||
IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
|
||||
for any process that hasn't set a specific io priority. The class data
|
||||
determines how much io bandwidth the process will get, it's directly mappable
|
||||
to the cpu nice levels just more coarsely implemented. 0 is the highest
|
||||
BE prio level, 7 is the lowest. The mapping between cpu nice level and io
|
||||
nice level is determined as: io_nice = (cpu_nice + 20) / 5.
|
||||
|
||||
IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
|
||||
level only get io time when no one else needs the disk. The idle class has no
|
||||
class data, since it doesn't really apply here.
|
||||
|
||||
Tools
|
||||
-----
|
||||
|
||||
See below for a sample ionice tool. Usage:
|
||||
|
||||
# ionice -c<class> -n<level> -p<pid>
|
||||
|
||||
If pid isn't given, the current process is assumed. IO priority settings
|
||||
are inherited on fork, so you can use ionice to start the process at a given
|
||||
level:
|
||||
|
||||
# ionice -c2 -n0 /bin/ls
|
||||
|
||||
will run ls at the best-effort scheduling class at the highest priority.
|
||||
For a running process, you can give the pid instead:
|
||||
|
||||
# ionice -c1 -n2 -p100
|
||||
|
||||
will change pid 100 to run at the realtime scheduling class, at priority 2.
|
||||
|
||||
---> snip ionice.c tool <---
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <errno.h>
|
||||
#include <getopt.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/ptrace.h>
|
||||
#include <asm/unistd.h>
|
||||
|
||||
extern int sys_ioprio_set(int, int, int);
|
||||
extern int sys_ioprio_get(int, int);
|
||||
|
||||
#if defined(__i386__)
|
||||
#define __NR_ioprio_set 289
|
||||
#define __NR_ioprio_get 290
|
||||
#elif defined(__ppc__)
|
||||
#define __NR_ioprio_set 273
|
||||
#define __NR_ioprio_get 274
|
||||
#elif defined(__x86_64__)
|
||||
#define __NR_ioprio_set 251
|
||||
#define __NR_ioprio_get 252
|
||||
#elif defined(__ia64__)
|
||||
#define __NR_ioprio_set 1274
|
||||
#define __NR_ioprio_get 1275
|
||||
#else
|
||||
#error "Unsupported arch"
|
||||
#endif
|
||||
|
||||
static inline int ioprio_set(int which, int who, int ioprio)
|
||||
{
|
||||
return syscall(__NR_ioprio_set, which, who, ioprio);
|
||||
}
|
||||
|
||||
static inline int ioprio_get(int which, int who)
|
||||
{
|
||||
return syscall(__NR_ioprio_get, which, who);
|
||||
}
|
||||
|
||||
enum {
|
||||
IOPRIO_CLASS_NONE,
|
||||
IOPRIO_CLASS_RT,
|
||||
IOPRIO_CLASS_BE,
|
||||
IOPRIO_CLASS_IDLE,
|
||||
};
|
||||
|
||||
enum {
|
||||
IOPRIO_WHO_PROCESS = 1,
|
||||
IOPRIO_WHO_PGRP,
|
||||
IOPRIO_WHO_USER,
|
||||
};
|
||||
|
||||
#define IOPRIO_CLASS_SHIFT 13
|
||||
|
||||
const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
|
||||
int c, pid = 0;
|
||||
|
||||
while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
|
||||
switch (c) {
|
||||
case 'n':
|
||||
ioprio = strtol(optarg, NULL, 10);
|
||||
set = 1;
|
||||
break;
|
||||
case 'c':
|
||||
ioprio_class = strtol(optarg, NULL, 10);
|
||||
set = 1;
|
||||
break;
|
||||
case 'p':
|
||||
pid = strtol(optarg, NULL, 10);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
switch (ioprio_class) {
|
||||
case IOPRIO_CLASS_NONE:
|
||||
ioprio_class = IOPRIO_CLASS_BE;
|
||||
break;
|
||||
case IOPRIO_CLASS_RT:
|
||||
case IOPRIO_CLASS_BE:
|
||||
break;
|
||||
case IOPRIO_CLASS_IDLE:
|
||||
ioprio = 7;
|
||||
break;
|
||||
default:
|
||||
printf("bad prio class %d\n", ioprio_class);
|
||||
return 1;
|
||||
}
|
||||
|
||||
if (!set) {
|
||||
if (!pid && argv[optind])
|
||||
pid = strtol(argv[optind], NULL, 10);
|
||||
|
||||
ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
|
||||
|
||||
printf("pid=%d, %d\n", pid, ioprio);
|
||||
|
||||
if (ioprio == -1)
|
||||
perror("ioprio_get");
|
||||
else {
|
||||
ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
|
||||
ioprio = ioprio & 0xff;
|
||||
printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
|
||||
}
|
||||
} else {
|
||||
if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
|
||||
perror("ioprio_set");
|
||||
return 1;
|
||||
}
|
||||
|
||||
if (argv[optind])
|
||||
execvp(argv[optind], &argv[optind]);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
---> snip ionice.c tool <---
|
||||
|
||||
|
||||
March 11 2005, Jens Axboe <jens.axboe@oracle.com>
|
72
Documentation/block/null_blk.txt
Normal file
72
Documentation/block/null_blk.txt
Normal file
|
@ -0,0 +1,72 @@
|
|||
Null block device driver
|
||||
================================================================================
|
||||
|
||||
I. Overview
|
||||
|
||||
The null block device (/dev/nullb*) is used for benchmarking the various
|
||||
block-layer implementations. It emulates a block device of X gigabytes in size.
|
||||
The following instances are possible:
|
||||
|
||||
Single-queue block-layer
|
||||
- Request-based.
|
||||
- Single submission queue per device.
|
||||
- Implements IO scheduling algorithms (CFQ, Deadline, noop).
|
||||
Multi-queue block-layer
|
||||
- Request-based.
|
||||
- Configurable submission queues per device.
|
||||
No block-layer (Known as bio-based)
|
||||
- Bio-based. IO requests are submitted directly to the device driver.
|
||||
- Directly accepts bio data structure and returns them.
|
||||
|
||||
All of them have a completion queue for each core in the system.
|
||||
|
||||
II. Module parameters applicable for all instances:
|
||||
|
||||
queue_mode=[0-2]: Default: 2-Multi-queue
|
||||
Selects which block-layer the module should instantiate with.
|
||||
|
||||
0: Bio-based.
|
||||
1: Single-queue.
|
||||
2: Multi-queue.
|
||||
|
||||
home_node=[0--nr_nodes]: Default: NUMA_NO_NODE
|
||||
Selects what CPU node the data structures are allocated from.
|
||||
|
||||
gb=[Size in GB]: Default: 250GB
|
||||
The size of the device reported to the system.
|
||||
|
||||
bs=[Block size (in bytes)]: Default: 512 bytes
|
||||
The block size reported to the system.
|
||||
|
||||
nr_devices=[Number of devices]: Default: 2
|
||||
Number of block devices instantiated. They are instantiated as /dev/nullb0,
|
||||
etc.
|
||||
|
||||
irqmode=[0-2]: Default: 1-Soft-irq
|
||||
The completion mode used for completing IOs to the block-layer.
|
||||
|
||||
0: None.
|
||||
1: Soft-irq. Uses IPI to complete IOs across CPU nodes. Simulates the overhead
|
||||
when IOs are issued from another CPU node than the home the device is
|
||||
connected to.
|
||||
2: Timer: Waits a specific period (completion_nsec) for each IO before
|
||||
completion.
|
||||
|
||||
completion_nsec=[ns]: Default: 10.000ns
|
||||
Combined with irqmode=2 (timer). The time each completion event must wait.
|
||||
|
||||
submit_queues=[0..nr_cpus]:
|
||||
The number of submission queues attached to the device driver. If unset, it
|
||||
defaults to 1 on single-queue and bio-based instances. For multi-queue,
|
||||
it is ignored when use_per_node_hctx module parameter is 1.
|
||||
|
||||
hw_queue_depth=[0..qdepth]: Default: 64
|
||||
The hardware queue depth of the device.
|
||||
|
||||
III: Multi-queue specific parameters
|
||||
|
||||
use_per_node_hctx=[0/1]: Default: 0
|
||||
0: The number of submit queues are set to the value of the submit_queues
|
||||
parameter.
|
||||
1: The multi-queue block layer is instantiated with a hardware dispatch
|
||||
queue for each CPU node in the system.
|
138
Documentation/block/queue-sysfs.txt
Normal file
138
Documentation/block/queue-sysfs.txt
Normal file
|
@ -0,0 +1,138 @@
|
|||
Queue sysfs files
|
||||
=================
|
||||
|
||||
This text file will detail the queue files that are located in the sysfs tree
|
||||
for each block device. Note that stacked devices typically do not export
|
||||
any settings, since their queue merely functions are a remapping target.
|
||||
These files are the ones found in the /sys/block/xxx/queue/ directory.
|
||||
|
||||
Files denoted with a RO postfix are readonly and the RW postfix means
|
||||
read-write.
|
||||
|
||||
add_random (RW)
|
||||
----------------
|
||||
This file allows to turn off the disk entropy contribution. Default
|
||||
value of this file is '1'(on).
|
||||
|
||||
discard_granularity (RO)
|
||||
-----------------------
|
||||
This shows the size of internal allocation of the device in bytes, if
|
||||
reported by the device. A value of '0' means device does not support
|
||||
the discard functionality.
|
||||
|
||||
discard_max_bytes (RO)
|
||||
----------------------
|
||||
Devices that support discard functionality may have internal limits on
|
||||
the number of bytes that can be trimmed or unmapped in a single operation.
|
||||
The discard_max_bytes parameter is set by the device driver to the maximum
|
||||
number of bytes that can be discarded in a single operation. Discard
|
||||
requests issued to the device must not exceed this limit. A discard_max_bytes
|
||||
value of 0 means that the device does not support discard functionality.
|
||||
|
||||
discard_zeroes_data (RO)
|
||||
------------------------
|
||||
When read, this file will show if the discarded block are zeroed by the
|
||||
device or not. If its value is '1' the blocks are zeroed otherwise not.
|
||||
|
||||
hw_sector_size (RO)
|
||||
-------------------
|
||||
This is the hardware sector size of the device, in bytes.
|
||||
|
||||
iostats (RW)
|
||||
-------------
|
||||
This file is used to control (on/off) the iostats accounting of the
|
||||
disk.
|
||||
|
||||
logical_block_size (RO)
|
||||
-----------------------
|
||||
This is the logcal block size of the device, in bytes.
|
||||
|
||||
max_hw_sectors_kb (RO)
|
||||
----------------------
|
||||
This is the maximum number of kilobytes supported in a single data transfer.
|
||||
|
||||
max_integrity_segments (RO)
|
||||
---------------------------
|
||||
When read, this file shows the max limit of integrity segments as
|
||||
set by block layer which a hardware controller can handle.
|
||||
|
||||
max_sectors_kb (RW)
|
||||
-------------------
|
||||
This is the maximum number of kilobytes that the block layer will allow
|
||||
for a filesystem request. Must be smaller than or equal to the maximum
|
||||
size allowed by the hardware.
|
||||
|
||||
max_segments (RO)
|
||||
-----------------
|
||||
Maximum number of segments of the device.
|
||||
|
||||
max_segment_size (RO)
|
||||
---------------------
|
||||
Maximum segment size of the device.
|
||||
|
||||
minimum_io_size (RO)
|
||||
--------------------
|
||||
This is the smallest preferred IO size reported by the device.
|
||||
|
||||
nomerges (RW)
|
||||
-------------
|
||||
This enables the user to disable the lookup logic involved with IO
|
||||
merging requests in the block layer. By default (0) all merges are
|
||||
enabled. When set to 1 only simple one-hit merges will be tried. When
|
||||
set to 2 no merge algorithms will be tried (including one-hit or more
|
||||
complex tree/hash lookups).
|
||||
|
||||
nr_requests (RW)
|
||||
----------------
|
||||
This controls how many requests may be allocated in the block layer for
|
||||
read or write requests. Note that the total allocated number may be twice
|
||||
this amount, since it applies only to reads or writes (not the accumulated
|
||||
sum).
|
||||
|
||||
To avoid priority inversion through request starvation, a request
|
||||
queue maintains a separate request pool per each cgroup when
|
||||
CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such
|
||||
per-block-cgroup request pool. IOW, if there are N block cgroups,
|
||||
each request queue may have up to N request pools, each independently
|
||||
regulated by nr_requests.
|
||||
|
||||
optimal_io_size (RO)
|
||||
--------------------
|
||||
This is the optimal IO size reported by the device.
|
||||
|
||||
physical_block_size (RO)
|
||||
------------------------
|
||||
This is the physical block size of device, in bytes.
|
||||
|
||||
read_ahead_kb (RW)
|
||||
------------------
|
||||
Maximum number of kilobytes to read-ahead for filesystems on this block
|
||||
device.
|
||||
|
||||
rotational (RW)
|
||||
---------------
|
||||
This file is used to stat if the device is of rotational type or
|
||||
non-rotational type.
|
||||
|
||||
rq_affinity (RW)
|
||||
----------------
|
||||
If this option is '1', the block layer will migrate request completions to the
|
||||
cpu "group" that originally submitted the request. For some workloads this
|
||||
provides a significant reduction in CPU cycles due to caching effects.
|
||||
|
||||
For storage configurations that need to maximize distribution of completion
|
||||
processing setting this option to '2' forces the completion to run on the
|
||||
requesting cpu (bypassing the "group" aggregation logic).
|
||||
|
||||
scheduler (RW)
|
||||
--------------
|
||||
When read, this file will display the current and available IO schedulers
|
||||
for this block device. The currently active IO scheduler will be enclosed
|
||||
in [] brackets. Writing an IO scheduler name to this file will switch
|
||||
control of this block device to that new IO scheduler. Note that writing
|
||||
an IO scheduler name to this file will attempt to load that IO scheduler
|
||||
module, if it isn't already present in the system.
|
||||
|
||||
|
||||
|
||||
Jens Axboe <jens.axboe@oracle.com>, February 2009
|
88
Documentation/block/request.txt
Normal file
88
Documentation/block/request.txt
Normal file
|
@ -0,0 +1,88 @@
|
|||
|
||||
struct request documentation
|
||||
|
||||
Jens Axboe <jens.axboe@oracle.com> 27/05/02
|
||||
|
||||
1.0
|
||||
Index
|
||||
|
||||
2.0 Struct request members classification
|
||||
|
||||
2.1 struct request members explanation
|
||||
|
||||
3.0
|
||||
|
||||
|
||||
2.0
|
||||
Short explanation of request members
|
||||
|
||||
Classification flags:
|
||||
|
||||
D driver member
|
||||
B block layer member
|
||||
I I/O scheduler member
|
||||
|
||||
Unless an entry contains a D classification, a device driver must not access
|
||||
this member. Some members may contain D classifications, but should only be
|
||||
access through certain macros or functions (eg ->flags).
|
||||
|
||||
<linux/blkdev.h>
|
||||
|
||||
2.1
|
||||
Member Flag Comment
|
||||
------ ---- -------
|
||||
|
||||
struct list_head queuelist BI Organization on various internal
|
||||
queues
|
||||
|
||||
void *elevator_private I I/O scheduler private data
|
||||
|
||||
unsigned char cmd[16] D Driver can use this for setting up
|
||||
a cdb before execution, see
|
||||
blk_queue_prep_rq
|
||||
|
||||
unsigned long flags DBI Contains info about data direction,
|
||||
request type, etc.
|
||||
|
||||
int rq_status D Request status bits
|
||||
|
||||
kdev_t rq_dev DBI Target device
|
||||
|
||||
int errors DB Error counts
|
||||
|
||||
sector_t sector DBI Target location
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep sector sane
|
||||
|
||||
unsigned long nr_sectors DBI Total number of sectors in request
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
|
||||
|
||||
unsigned short nr_phys_segments DB Number of physical scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned short nr_hw_segments DB Number of hardware scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned int current_nr_sectors DB Number of sectors in first segment
|
||||
of request
|
||||
|
||||
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
|
||||
|
||||
int tag DB TCQ tag, if assigned
|
||||
|
||||
void *special D Free to be used by driver
|
||||
|
||||
char *buffer D Map of first segment, also see
|
||||
section on bouncing SECTION
|
||||
|
||||
struct completion *waiting D Can be used by driver to get signalled
|
||||
on request completion
|
||||
|
||||
struct bio *bio DBI First bio in request
|
||||
|
||||
struct bio *biotail DBI Last bio in request
|
||||
|
||||
struct request_queue *q DB Request queue this request belongs to
|
||||
|
||||
struct request_list *rl B Request list this request came from
|
82
Documentation/block/stat.txt
Normal file
82
Documentation/block/stat.txt
Normal file
|
@ -0,0 +1,82 @@
|
|||
Block layer statistics in /sys/block/<dev>/stat
|
||||
===============================================
|
||||
|
||||
This file documents the contents of the /sys/block/<dev>/stat file.
|
||||
|
||||
The stat file provides several statistics about the state of block
|
||||
device <dev>.
|
||||
|
||||
Q. Why are there multiple statistics in a single file? Doesn't sysfs
|
||||
normally contain a single value per file?
|
||||
A. By having a single file, the kernel can guarantee that the statistics
|
||||
represent a consistent snapshot of the state of the device. If the
|
||||
statistics were exported as multiple files containing one statistic
|
||||
each, it would be impossible to guarantee that a set of readings
|
||||
represent a single point in time.
|
||||
|
||||
The stat file consists of a single line of text containing 11 decimal
|
||||
values separated by whitespace. The fields are summarized in the
|
||||
following table, and described in more detail below.
|
||||
|
||||
Name units description
|
||||
---- ----- -----------
|
||||
read I/Os requests number of read I/Os processed
|
||||
read merges requests number of read I/Os merged with in-queue I/O
|
||||
read sectors sectors number of sectors read
|
||||
read ticks milliseconds total wait time for read requests
|
||||
write I/Os requests number of write I/Os processed
|
||||
write merges requests number of write I/Os merged with in-queue I/O
|
||||
write sectors sectors number of sectors written
|
||||
write ticks milliseconds total wait time for write requests
|
||||
in_flight requests number of I/Os currently in flight
|
||||
io_ticks milliseconds total time this block device has been active
|
||||
time_in_queue milliseconds total wait time for all requests
|
||||
|
||||
read I/Os, write I/Os
|
||||
=====================
|
||||
|
||||
These values increment when an I/O request completes.
|
||||
|
||||
read merges, write merges
|
||||
=========================
|
||||
|
||||
These values increment when an I/O request is merged with an
|
||||
already-queued I/O request.
|
||||
|
||||
read sectors, write sectors
|
||||
===========================
|
||||
|
||||
These values count the number of sectors read from or written to this
|
||||
block device. The "sectors" in question are the standard UNIX 512-byte
|
||||
sectors, not any device- or filesystem-specific block size. The
|
||||
counters are incremented when the I/O completes.
|
||||
|
||||
read ticks, write ticks
|
||||
=======================
|
||||
|
||||
These values count the number of milliseconds that I/O requests have
|
||||
waited on this block device. If there are multiple I/O requests waiting,
|
||||
these values will increase at a rate greater than 1000/second; for
|
||||
example, if 60 read requests wait for an average of 30 ms, the read_ticks
|
||||
field will increase by 60*30 = 1800.
|
||||
|
||||
in_flight
|
||||
=========
|
||||
|
||||
This value counts the number of I/O requests that have been issued to
|
||||
the device driver but have not yet completed. It does not include I/O
|
||||
requests that are in the queue but not yet issued to the device driver.
|
||||
|
||||
io_ticks
|
||||
========
|
||||
|
||||
This value counts the number of milliseconds during which the device has
|
||||
had I/O requests queued.
|
||||
|
||||
time_in_queue
|
||||
=============
|
||||
|
||||
This value counts the number of milliseconds that I/O requests have waited
|
||||
on this block device. If there are multiple I/O requests waiting, this
|
||||
value will increase as the product of the number of milliseconds times the
|
||||
number of requests waiting (see "read ticks" above for an example).
|
37
Documentation/block/switching-sched.txt
Normal file
37
Documentation/block/switching-sched.txt
Normal file
|
@ -0,0 +1,37 @@
|
|||
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
|
||||
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
|
||||
globally at boot time only presently.
|
||||
|
||||
Each io queue has a set of io scheduler tunables associated with it. These
|
||||
tunables control how the io scheduler works. You can find these entries
|
||||
in:
|
||||
|
||||
/sys/block/<device>/queue/iosched
|
||||
|
||||
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
|
||||
you can do so by typing:
|
||||
|
||||
# mount none /sys -t sysfs
|
||||
|
||||
As of the Linux 2.6.10 kernel, it is now possible to change the
|
||||
IO scheduler for a given block device on the fly (thus making it possible,
|
||||
for instance, to set the CFQ scheduler for the system default, but
|
||||
set a specific device to use the deadline or noop schedulers - which
|
||||
can improve that device's throughput).
|
||||
|
||||
To set a specific scheduler, simply do this:
|
||||
|
||||
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
|
||||
|
||||
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
|
||||
device name (hda, hdb, sga, or whatever you happen to have).
|
||||
|
||||
The list of defined schedulers can be found by simply doing
|
||||
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
|
||||
will be displayed, with the currently selected scheduler in brackets:
|
||||
|
||||
# cat /sys/block/hda/queue/scheduler
|
||||
noop deadline [cfq]
|
||||
# echo deadline > /sys/block/hda/queue/scheduler
|
||||
# cat /sys/block/hda/queue/scheduler
|
||||
noop [deadline] cfq
|
86
Documentation/block/writeback_cache_control.txt
Normal file
86
Documentation/block/writeback_cache_control.txt
Normal file
|
@ -0,0 +1,86 @@
|
|||
|
||||
Explicit volatile write back cache control
|
||||
=====================================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Many storage devices, especially in the consumer market, come with volatile
|
||||
write back caches. That means the devices signal I/O completion to the
|
||||
operating system before data actually has hit the non-volatile storage. This
|
||||
behavior obviously speeds up various workloads, but it means the operating
|
||||
system needs to force data out to the non-volatile storage when it performs
|
||||
a data integrity operation like fsync, sync or an unmount.
|
||||
|
||||
The Linux block layer provides two simple mechanisms that let filesystems
|
||||
control the caching behavior of the storage device. These mechanisms are
|
||||
a forced cache flush, and the Force Unit Access (FUA) flag for requests.
|
||||
|
||||
|
||||
Explicit cache flushes
|
||||
----------------------
|
||||
|
||||
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
|
||||
the filesystem and will make sure the volatile cache of the storage device
|
||||
has been flushed before the actual I/O operation is started. This explicitly
|
||||
guarantees that previously completed write requests are on non-volatile
|
||||
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
|
||||
set on an otherwise empty bio structure, which causes only an explicit cache
|
||||
flush without any dependent I/O. It is recommend to use
|
||||
the blkdev_issue_flush() helper for a pure cache flush.
|
||||
|
||||
|
||||
Forced Unit Access
|
||||
-----------------
|
||||
|
||||
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
|
||||
filesystem and will make sure that I/O completion for this request is only
|
||||
signaled after the data has been committed to non-volatile storage.
|
||||
|
||||
|
||||
Implementation details for filesystems
|
||||
--------------------------------------
|
||||
|
||||
Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
|
||||
worry if the underlying devices need any explicit cache flushing and how
|
||||
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
|
||||
may both be set on a single bio.
|
||||
|
||||
|
||||
Implementation details for make_request_fn based block drivers
|
||||
--------------------------------------------------------------
|
||||
|
||||
These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
|
||||
directly below the submit_bio interface. For remapping drivers the REQ_FUA
|
||||
bits need to be propagated to underlying devices, and a global flush needs
|
||||
to be implemented for bios with the REQ_FLUSH bit set. For real device
|
||||
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
|
||||
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
|
||||
data can be completed successfully without doing any work. Drivers for
|
||||
devices with volatile caches need to implement the support for these
|
||||
flags themselves without any help from the block layer.
|
||||
|
||||
|
||||
Implementation details for request_fn based block drivers
|
||||
--------------------------------------------------------------
|
||||
|
||||
For devices that do not support volatile write caches there is no driver
|
||||
support required, the block layer completes empty REQ_FLUSH requests before
|
||||
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
|
||||
requests that have a payload. For devices with volatile write caches the
|
||||
driver needs to tell the block layer that it supports flushing caches by
|
||||
doing:
|
||||
|
||||
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
|
||||
|
||||
and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
|
||||
REQ_FLUSH requests with a payload are automatically turned into a sequence
|
||||
of an empty REQ_FLUSH request followed by the actual write by the block
|
||||
layer. For devices that also support the FUA bit the block layer needs
|
||||
to be told to pass through the REQ_FUA bit using:
|
||||
|
||||
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
|
||||
|
||||
and the driver must handle write requests that have the REQ_FUA bit set
|
||||
in prep_fn/request_fn. If the FUA bit is not natively supported the block
|
||||
layer turns it into an empty REQ_FLUSH request after the actual write.
|
Loading…
Add table
Add a link
Reference in a new issue