Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,158 @@
00-INDEX
- this file (info on some of the filesystems supported by linux).
Locking
- info on locking rules as they pertain to Linux VFS.
Makefile
- Makefile for building the filsystems-part of DocBook.
9p.txt
- 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
adfs.txt
- info and mount options for the Acorn Advanced Disc Filing System.
afs.txt
- info and examples for the distributed AFS (Andrew File System) fs.
affs.txt
- info and mount options for the Amiga Fast File System.
autofs4-mount-control.txt
- info on device control operations for autofs4 module.
automount-support.txt
- information about filesystem automount support.
befs.txt
- information about the BeOS filesystem for Linux.
bfs.txt
- info for the SCO UnixWare Boot Filesystem (BFS).
btrfs.txt
- info for the BTRFS filesystem.
caching/
- directory containing filesystem cache documentation.
ceph.txt
- info for the Ceph Distributed File System.
cifs/
- directory containing CIFS filesystem documentation and example code.
coda.txt
- description of the CODA filesystem.
configfs/
- directory containing configfs documentation and example code.
cramfs.txt
- info on the cram filesystem for small storage (ROMs etc).
debugfs.txt
- info on the debugfs filesystem.
devpts.txt
- info on the devpts filesystem.
directory-locking
- info about the locking scheme used for directory operations.
dlmfs.txt
- info on the userspace interface to the OCFS2 DLM.
dnotify.txt
- info about directory notification in Linux.
dnotify_test.c
- example program for dnotify.
ecryptfs.txt
- docs on eCryptfs: stacked cryptographic filesystem for Linux.
efivarfs.txt
- info for the efivarfs filesystem.
exofs.txt
- info, usage, mount options, design about EXOFS.
ext2.txt
- info, mount options and specifications for the Ext2 filesystem.
ext3.txt
- info, mount options and specifications for the Ext3 filesystem.
ext4.txt
- info, mount options and specifications for the Ext4 filesystem.
f2fs.txt
- info and mount options for the F2FS filesystem.
fiemap.txt
- info on fiemap ioctl.
files.txt
- info on file management in the Linux kernel.
fuse.txt
- info on the Filesystem in User SpacE including mount options.
gfs2-glocks.txt
- info on the Global File System 2 - Glock internal locking rules.
gfs2-uevents.txt
- info on the Global File System 2 - uevents.
gfs2.txt
- info on the Global File System 2.
hfs.txt
- info on the Macintosh HFS Filesystem for Linux.
hfsplus.txt
- info on the Macintosh HFSPlus Filesystem for Linux.
hpfs.txt
- info and mount options for the OS/2 HPFS.
inotify.txt
- info on the powerful yet simple file change notification system.
isofs.txt
- info and mount options for the ISO 9660 (CDROM) filesystem.
jfs.txt
- info and mount options for the JFS filesystem.
locks.txt
- info on file locking implementations, flock() vs. fcntl(), etc.
logfs.txt
- info on the LogFS flash filesystem.
mandatory-locking.txt
- info on the Linux implementation of Sys V mandatory file locking.
ncpfs.txt
- info on Novell Netware(tm) filesystem using NCP protocol.
nfs/
- nfs-related documentation.
nilfs2.txt
- info and mount options for the NILFS2 filesystem.
ntfs.txt
- info and mount options for the NTFS filesystem (Windows NT).
ocfs2.txt
- info and mount options for the OCFS2 clustered filesystem.
omfs.txt
- info on the Optimized MPEG FileSystem.
path-lookup.txt
- info on path walking and name lookup locking.
pohmelfs/
- directory containing pohmelfs filesystem documentation.
porting
- various information on filesystem porting.
proc.txt
- info on Linux's /proc filesystem.
qnx6.txt
- info on the QNX6 filesystem.
quota.txt
- info on Quota subsystem.
ramfs-rootfs-initramfs.txt
- info on the 'in memory' filesystems ramfs, rootfs and initramfs.
relay.txt
- info on relay, for efficient streaming from kernel to user space.
romfs.txt
- description of the ROMFS filesystem.
seq_file.txt
- how to use the seq_file API.
sharedsubtree.txt
- a description of shared subtrees for namespaces.
spufs.txt
- info and mount options for the SPU filesystem used on Cell.
squashfs.txt
- info on the squashfs filesystem.
sysfs-pci.txt
- info on accessing PCI device resources through sysfs.
sysfs-tagging.txt
- info on sysfs tagging to avoid duplicates.
sysfs.txt
- info on sysfs, a ram-based filesystem for exporting kernel objects.
sysv-fs.txt
- info on the SystemV/V7/Xenix/Coherent filesystem.
tmpfs.txt
- info on tmpfs, a filesystem that holds all files in virtual memory.
ubifs.txt
- info on the Unsorted Block Images FileSystem.
udf.txt
- info and mount options for the UDF filesystem.
ufs.txt
- info on the ufs filesystem.
vfat.txt
- info on using the VFAT filesystem used in Windows NT and Windows 95.
vfs.txt
- overview of the Virtual File System.
xfs-delayed-logging-design.txt
- info on the XFS Delayed Logging Design.
xfs-self-describing-metadata.txt
- info on XFS Self Describing Metadata.
xfs.txt
- info and mount options for the XFS filesystem.
xip.txt
- info on execute-in-place for file mappings.

View file

@ -0,0 +1,161 @@
v9fs: Plan 9 Resource Sharing for Linux
=======================================
ABOUT
=====
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
This software was originally developed by Ron Minnich <rminnich@sandia.gov>
and Maya Gokhale. Additional development by Greg Watson
<gwatson@lanl.gov> and most recently Eric Van Hensbergen
<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
<rsc@swtch.com>.
The best detailed explanation of the Linux implementation and applications of
the 9p client is available in the form of a USENIX paper:
http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
Other applications are described in the following papers:
* XCPU & Clustering
http://xcpu.org/papers/xcpu-talk.pdf
* KVMFS: control file system for KVM
http://xcpu.org/papers/kvmfs.pdf
* CellFS: A New Programming Model for the Cell BE
http://xcpu.org/papers/cellfs-talk.pdf
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
http://goo.gl/3WPDg
USAGE
=====
For remote file server:
mount -t 9p 10.10.1.2 /mnt/9
For Plan 9 From User Space applications (http://swtch.com/plan9)
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
For server running on QEMU host with virtio transport:
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
where mount_tag is the tag associated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
OPTIONS
=======
trans=name select an alternative transport. Valid options are
currently:
unix - specifying a named pipe mount point
tcp - specifying a normal TCP/IP connection
fd - used passed file descriptors for connection
(see rfdno and wfdno)
virtio - connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma - connect to a specified RDMA channel
uname=name user name to attempt mount as on the remote server. The
server may override or ignore this value. Certain user
names may require authentication.
aname=name aname specifies the file tree to access when the server is
offering several exported file systems.
cache=mode specifies a caching policy. By default, no caches are used.
none = default no cache policy, metadata and data
alike are synchronous.
loose = no attempts are made at consistency,
intended for exclusive, read-only mounts
fscache = use FS-Cache for a persistent, read-only
cache backend.
mmap = minimal cache that is only used for read-write
mmap. Northing else is cached, like cache=none
debug=n specifies debug level. The debug level is a bitmask.
0x01 = display verbose error messages
0x02 = developer debug (DEBUG_CURRENT)
0x04 = display 9p trace
0x08 = display VFS trace
0x10 = display Marshalling debug
0x20 = display RPC debug
0x40 = display transport debug
0x80 = display allocation debug
0x100 = display protocol message debug
0x200 = display Fid debug
0x400 = display packet debug
0x800 = display fscache tracing debug
rfdno=n the file descriptor for reading with trans=fd
wfdno=n the file descriptor for writing with trans=fd
msize=n the number of bytes to use for 9p packet payload
port=n port to connect to on the remote server
noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
version=name Select 9P protocol version. Valid options are:
9p2000 - Legacy mode (same as noextend)
9p2000.u - Use 9P2000.u protocol
9p2000.L - Use 9P2000.L protocol
dfltuid attempt to mount as a particular uid
dfltgid attempt to mount with a particular gid
afid security channel - used by Plan 9 authentication protocols
nodevmap do not map special files - represent them as normal files.
This can be used to share devices/named pipes/sockets between
hosts. This functionality will be expanded in later versions.
access there are four access modes.
user = if a user tries to access a file on v9fs
filesystem for the first time, v9fs sends an
attach command (Tattach) for that user.
This is the default mode.
<uid> = allows only user with uid=<uid> to access
the files on the mounted filesystem
any = v9fs does single attach and performs all
operations as one user
client = ACL based access check on the 9p client
side for access validation
cachetag cache tag to use the specified persistent cache.
cache tags for existing cache sessions can be listed at
/sys/fs/9p/caches. (applies only to cache=fscache)
RESOURCES
=========
Protocol specifications are maintained on github:
http://ericvh.github.com/9p-rfc/
9p client and server implementations are listed on
http://9p.cat-v.org/implementations
A 9p2000.L server is being developed by LLNL and can be found
at http://code.google.com/p/diod/
There are user and developer mailing lists available through the v9fs project
on sourceforge (http://sourceforge.net/projects/v9fs).
News and other information is maintained on a Wiki.
(http://sf.net/apps/mediawiki/v9fs/index.php).
Bug reports are best issued via the mailing list.
For more information on the Plan 9 Operating System check out
http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9

View file

@ -0,0 +1,577 @@
The text below describes the locking rules for VFS-related methods.
It is (believed to be) up-to-date. *Please*, if you change anything in
prototypes or locking protocols - update this file. And update the relevant
instances in the tree, don't leave that to maintainers of filesystems/devices/
etc. At the very least, put the list of dubious cases in the end of this file.
Don't turn it into log - maintainers of out-of-the-tree code are supposed to
be able to use diff(1).
Thing currently missing here: socket operations. Alexey?
--------------------------- dentry_operations --------------------------
prototypes:
int (*d_revalidate)(struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *, const struct dentry *,
unsigned int, const char *, const struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(struct dentry *, bool);
locking rules:
rename_lock ->d_lock may block rcu-walk
d_revalidate: no no yes (ref-walk) maybe
d_weak_revalidate:no no yes no
d_hash no no no maybe
d_compare: yes no no maybe
d_delete: no yes no no
d_release: no no yes no
d_prune: no yes no no
d_iput: no no yes no
d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
--------------------------- inode_operations ---------------------------
prototypes:
int (*create) (struct inode *,struct dentry *,umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*rename2) (struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int, unsigned int);
int (*get_acl)(struct inode *, int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
int (*tmpfile) (struct inode *, struct dentry *, umode_t);
int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
locking rules:
all may block
i_mutex(inode)
lookup: yes
create: yes
link: yes (both)
mknod: yes
symlink: yes
mkdir: yes
unlink: yes (both)
rmdir: yes (both) (see below)
rename: yes (all) (see below)
rename2: yes (all) (see below)
readlink: no
follow_link: no
put_link: no
setattr: yes
permission: no (may not block if called in rcu-walk mode)
get_acl: no
getattr: no
setxattr: yes
getxattr: no
listxattr: no
removexattr: yes
fiemap: no
update_time: no
atomic_open: yes
tmpfile: no
dentry_open: no
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
victim.
cross-directory ->rename() and rename2() has (per-superblock)
->s_vfs_rename_sem.
See Documentation/filesystems/directory-locking for more detailed discussion
of the locking scheme for directory operations.
--------------------------- super_operations ---------------------------
prototypes:
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
locking rules:
All may block [not true, see below]
s_umount
alloc_inode:
destroy_inode:
dirty_inode:
write_inode:
drop_inode: !!!inode->i_lock!!!
evict_inode:
put_super: write
sync_fs: read
freeze_fs: write
unfreeze_fs: write
statfs: maybe(read) (see below)
remount_fs: write
umount_begin: no
show_options: no (namespace_sem)
quota_read: no (see below)
quota_write: no (see below)
bdev_try_to_free_page: no (see below)
->statfs() has s_umount (shared) when called by ustat(2) (native or
compat), but that's an accident of bad API; s_umount is used to pin
the superblock down when we only have dev_t given us by userland to
identify the superblock. Everything else (statfs(), fstatfs(), etc.)
doesn't hold it when calling ->statfs() - superblock is pinned down
by resolving the pathname passed to syscall.
->quota_read() and ->quota_write() functions are both guaranteed to
be the only ones operating on the quota file by the quota code (via
dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking
see also dquot_operations section.
->bdev_try_to_free_page is called from the ->releasepage handler of
the block device inode. See there for more details.
--------------------------- file_system_type ---------------------------
prototypes:
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
locking rules:
may block
mount yes
kill_sb yes
->mount() returns ERR_PTR or the root dentry; its superblock should be locked
on return.
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference.
--------------------------- address_space_operations --------------------------
prototypes:
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
int (*swap_activate)(struct file *);
int (*swap_deactivate)(struct file *);
locking rules:
All except set_page_dirty and freepage may block
PageLocked(page) i_mutex
writepage: yes, unlocks (see below)
readpage: yes, unlocks
sync_page: maybe
writepages:
set_page_dirty no
readpages:
write_begin: locks the page yes
write_end: yes, unlocks yes
bmap:
invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
error_remove_page: yes
swap_activate: no
swap_deactivate: no
->write_begin(), ->write_end(), ->sync_page() and ->readpage()
may be called from the request handler (/dev/loop).
->readpage() unlocks the page, either synchronously or via I/O
completion.
->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion.
->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ
depending upon the mode.
If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
it *must* start I/O against the page, even if that would involve
blocking on in-progress I/O.
If writepage is called for memory cleansing (sync_mode ==
WBC_SYNC_NONE) then its role is to get as much writeout underway as
possible. So writepage should try to avoid blocking against
currently-in-progress I/O.
If the filesystem is not called for "sync" and it determines that it
would need to block against in-progress I/O to be able to start new I/O
against the page the filesystem should redirty the page with
redirty_page_for_writepage(), then unlock the page and return zero.
This may also be done to avoid internal deadlocks, but rarely.
If the filesystem is called for sync then it must wait on any
in-progress I/O and then start new I/O.
The filesystem should unlock the page synchronously, before returning to the
caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
value. WRITEPAGE_ACTIVATE means that page cannot really be written out
currently, and VM should stop calling ->writepage() on this page for some
time. VM does this by moving page to the head of the active list, hence the
name.
Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
and return zero, writepage *must* run set_page_writeback() against the page,
followed by unlocking it. Once set_page_writeback() has been run against the
page, write I/O can be submitted and the write I/O completion handler must run
end_page_writeback() once the I/O is complete. If no I/O is submitted, the
filesystem must run end_page_writeback() against the page before returning from
writepage.
That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
if the filesystem needs the page to be locked during writeout, that is ok, too,
the page is allowed to be unlocked at any point in time between the calls to
set_page_writeback() and end_page_writeback().
Note, failure to run either redirty_page_for_writepage() or the combination of
set_page_writeback()/end_page_writeback() on a page submitted to writepage
will leave the page itself marked clean but it will be tagged as dirty in the
radix tree. This incoherency can lead to all sorts of hard-to-debug problems
in the filesystem like having dirty inodes at umount and losing written data.
->sync_page() locking rules are not well-defined - usually it is called
with lock on page, but that is not guaranteed. Considering the currently
existing instances of this method ->sync_page() itself doesn't look
well-defined...
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
*nr_to_write pages. *nr_to_write must be decremented for each page which is
written. The address_space implementation may write more (or less) pages
than *nr_to_write asks for, but it should try to be reasonably close. If
nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on
mapping->io_pages.
->set_page_dirty() is called from various places in the kernel
when the target page is marked as needing writeback. It may be called
under spinlock (it cannot block) and is sometimes called with the page
not locked.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. Please,
keep it that way and don't breed new callers.
->invalidatepage() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
returns zero on success. If ->invalidatepage is zero, the kernel uses
block_invalidatepage() instead.
->releasepage() is called when the kernel is about to try to drop the
buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers.
->freepage() is called when the kernel is done dropping the page
from the page cache.
->launder_page() may be called prior to releasing a page if
it is still found to be dirty. It returns zero if the page was successfully
cleaned, or an error value if not. Note that in order to prevent the page
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.
->swap_activate will be called with a non-zero argument on
files backing (non block device backed) swapfiles. A return value
of zero indicates success, in which case this file can be used for
backing swapspace. The swapspace operations will be proxied to the
address space operations.
->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.
----------------------- file_lock_operations ------------------------------
prototypes:
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *);
locking rules:
inode->i_lock may block
fl_copy_lock: yes no
fl_release_private: maybe maybe[1]
[1]: ->fl_release_private for flock or POSIX locks is currently allowed
to block. Leases however can still be freed while the i_lock is held and
so fl_release_private called on a lease should not block.
----------------------- lock_manager_operations ---------------------------
prototypes:
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
int (*lm_change)(struct file_lock **, int);
locking rules:
inode->i_lock blocked_lock_lock may block
lm_compare_owner: yes[1] maybe no
lm_owner_key yes[1] yes no
lm_notify: yes yes no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no
[1]: ->lm_compare_owner and ->lm_owner_key are generally called with
*an* inode->i_lock held. It may not be the i_lock of the inode
associated with either file_lock argument! This is the case with deadlock
detection, since the code has to chase down the owners of locks that may
be entirely unrelated to the one on which the lock is being acquired.
For deadlock detection however, the blocked_lock_lock is also held. The
fact that these locks are held ensures that the file_locks do not
disappear out from under you while doing the comparison or generating an
owner key.
--------------------------- buffer_head -----------------------------------
prototypes:
void (*b_end_io)(struct buffer_head *bh, int uptodate);
locking rules:
called from interrupts. In other words, extreme care is needed here.
bh is locked, but that's all warranties we have here. Currently only RAID1,
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
call this method upon the IO completion.
--------------------------- block_device_operations -----------------------
prototypes:
int (*open) (struct block_device *, fmode_t);
int (*release) (struct gendisk *, fmode_t);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *);
int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules:
bd_mutex
open: yes
release: yes
ioctl: no
compat_ioctl: no
direct_access: no
media_changed: no
unlock_native_capacity: no
revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
media_changed, unlock_native_capacity and revalidate_disk are called only from
check_disk_change().
swap_slot_free_notify is called with swap_lock and sometimes the page lock
held.
--------------------------- file_operations -------------------------------
prototypes:
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
void __user *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *, int, loff_t, loff_t);
};
locking rules:
All may block.
->llseek() locking has moved from llseek to the individual llseek
implementations. If your fs is not using generic_file_llseek, you
need to acquire and release the appropriate locks in your ->llseek().
For many filesystems, it is probably safe to acquire the inode
mutex or just to use i_size_read() instead.
Note: this does not protect the file->f_pos against concurrent modifications
since this is something the userspace has to take care about.
->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
Most instances call fasync_helper(), which does that maintenance, so it's
not normally something one needs to worry about. Return values > 0 will be
mapped to zero in the VFS layer.
->readdir() and ->ioctl() on directories must be changed. Ideally we would
move ->readdir() to inode_operations and use a separate method for directory
->ioctl() or kill the latter completely. One of the problems is that for
anything that resembles union-mount we won't have a struct file for all
components. And there are other reasons why the current interface is a mess...
->read on directories probably must go away - we should just enforce -EISDIR
in sys_read() and friends.
->setlease operations should call generic_setlease() before or after setting
the lease within the individual filesystem to record the result of the
operation
--------------------------- dquot_operations -------------------------------
prototypes:
int (*write_dquot) (struct dquot *);
int (*acquire_dquot) (struct dquot *);
int (*release_dquot) (struct dquot *);
int (*mark_dirty) (struct dquot *);
int (*write_info) (struct super_block *, int);
These operations are intended to be more or less wrapping functions that ensure
a proper locking wrt the filesystem and call the generic quota operations.
What filesystem should expect from the generic quota functions:
FS recursion Held locks when called
write_dquot: yes dqonoff_sem or dqptr_sem
acquire_dquot: yes dqonoff_sem or dqptr_sem
release_dquot: yes dqonoff_sem or dqptr_sem
mark_dirty: no -
write_info: yes dqonoff_sem
FS recursion means calling ->quota_read() and ->quota_write() from superblock
operations.
More details about quota locking can be found in fs/dquot.c.
--------------------------- vm_operations_struct -----------------------------
prototypes:
void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*);
int (*fault)(struct vm_area_struct*, struct vm_fault *);
int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
mmap_sem PageLocked(page)
open: yes
close: yes
fault: yes can return with page locked
map_pages: yes
page_mkwrite: yes can return with page locked
access: yes
->fault() is called when a previously not present pte is about
to be faulted in. The filesystem must find and return the page associated
with the passed in "pgoff" in the vm_fault structure. If it is possible that
the page may be truncated and/or invalidated, then the filesystem must lock
the page, then ensure it is not already truncated (the page lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "pgoff"
till "max_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with offset "pgoff" is
passed in "pte" field in vm_fault structure. Pointers to entries for other
offsets should be calculated relative to "pte".
->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
no truncate/invalidate races, and then return with the page locked. If
the page has been truncated, the filesystem should not look up a new page
like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
will cause the VM to retry the fault.
->access() is called when get_user_pages() fails in
access_process_vm(), typically used to debug a process through
/proc/pid/mem or ptrace. This function is needed only for
VM_IO | VM_PFNMAP VMAs.
================================================================================
Dubious stuff
(if you break something or notice that it is broken and do not fix it yourself
- at least put it here)

View file

@ -0,0 +1,7 @@
subdir-y := configfs
# List of programs to build
hostprogs-y := dnotify_test
# Tell kbuild to always build the programs
always := $(hostprogs-y)

View file

@ -0,0 +1,75 @@
Mount options for ADFS
----------------------
uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root).
gid=nnn All files in the partition will be in group
nnn. Default 0 (root).
ownmask=nnn The permission mask for ADFS 'owner' permissions
will be nnn. Default 0700.
othmask=nnn The permission mask for ADFS 'other' permissions
will be nnn. Default 0077.
ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
When ftsuffix=1, a hexadecimal suffix corresponding to
the RISC OS file type will be added. Default 0.
Mapping of ADFS permissions to Linux permissions
------------------------------------------------
ADFS permissions consist of the following:
Owner read
Owner write
Other read
Other write
(In older versions, an 'execute' permission did exist, but this
does not hold the same meaning as the Linux 'execute' permission
and is now obsolete).
The mapping is performed as follows:
Owner read -> -r--r--r--
Owner write -> --w--w---w
Owner read and filetype UnixExec -> ---x--x--x
These are then masked by ownmask, eg 700 -> -rwx------
Possible owner mode permissions -> -rwx------
Other read -> -r--r--r--
Other write -> --w--w--w-
Other read and filetype UnixExec -> ---x--x--x
These are then masked by othmask, eg 077 -> ----rwxrwx
Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and
not a UnixExec filetype, then the permissions will be:
-rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would
be modified to:
-rw-rw----
There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but
keep the default write protection (ownmask=0755,othmask=0577):
-rw-r--r--
You can therefore tailor the permission translation to whatever you
desire the permissions should be under Linux.
RISC OS file type suffix
------------------------
RISC OS file types are stored in bits 19..8 of the file load address.
To enable non-RISC OS systems to be used to store files without losing
file type information, a file naming convention was devised (initially
for use with NFS) such that a hexadecimal suffix of the form ,xyz
denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This
naming convention is now also used by RISC OS emulators such as RPCEmu.
Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file
type suffixes to be appended to file names read from a directory. If the
ftsuffix option is zero or omitted, no file type suffixes will be added.

View file

@ -0,0 +1,222 @@
Overview of Amiga Filesystems
=============================
Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems:
DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either.
Supported read/write.
DOS\1 The original Fast File System. Supported read/write.
DOS\2 The old "international" filesystem. International means that
a bug has been fixed so that accented ("international") letters
in file names are case-insensitive, as they ought to be.
Supported read/write.
DOS\3 The "international" Fast File System. Supported read/write.
DOS\4 The original filesystem with directory cache. The directory
cache speeds up directory accesses on floppies considerably,
but slows down file creation/deletion. Doesn't make much
sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only.
All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
speed up almost everything at the expense of wasted disk space. The speed
gain above 4K seems not really worth the price, so you don't lose too
much here, either.
The muFS (multi user File System) equivalents of the above file systems
are supported, too.
Mount options for the AFFS
==========================
protect If this option is set, the protection bits cannot be altered.
setuid[=uid] This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively.
setgid[=gid] Same as above, but for gid.
mode=mode Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x
permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files
will map to 600.
nofilenametruncate
The file system will return an error when filename exceeds
standard maximum filename length (30 characters).
reserved=num Sets the number of reserved blocks at the start of the
partition to num. You should never need this option.
Default is 2.
root=block Sets the block number of the root block. This should never
be necessary.
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself.
quiet The file system will not return an error for disallowed
mode changes.
verbose The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted.
mufs The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used
as one.
prefix=path Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/".
(See below.)
volume=name When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string).
(See below.)
Handling of the Users/Groups and protection flags
=================================================
Amiga -> Linux:
The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
- R maps to r for user, group and others. On directories, R implies x.
- If both W and D are allowed, w will be set.
- E maps to x.
- H and P are always retained and ignored under Linux.
- A is always reset when a file is written to.
User id and group id will be used unless set[gu]id are given as mount
options. Since most of the Amiga file systems are single user systems
they will be owned by root. The root directory (the mount point) of the
Amiga filesystem will be owned by the user who actually mounts the
filesystem (the root directory doesn't have uid/gid fields).
Linux -> Amiga:
The Linux rwxrwxrwx file mode is handled as follows:
- r permission will set R for user, group and others.
- w permission will set W and D for user, group and others.
- x permission of the user will set E for plain files.
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
Newly created files and directories will get the user and group ID
of the current user and a mode according to the umask.
Symbolic links
==============
Although the Amiga and Linux file systems resemble each other, there
are some, not always subtle, differences. One of them becomes apparent
with symbolic links. While Linux has a file system with exactly one
root directory, the Amiga has a separate root directory for each
file system (for example, partition, floppy disk, ...). With the Amiga,
these entities are called "volumes". They have symbolic names which
can be used to access them. Thus, symbolic links can point to a
different volume. AFFS turns the volume name into a directory name
and prepends the prefix path (see prefix option) to it.
Example:
You mount all your Amiga partitions under /amiga/<volume> (where
<volume> is the name of the volume), and you give the option
"prefix=/amiga/" when mounting all your AFFS partitions. (They
might be "User", "WB" and "Graphics", the mount points /amiga/User,
/amiga/WB and /amiga/Graphics). A symbolic link referring to
"User:sc/include/dos/dos.h" will be followed to
"/amiga/User/sc/include/dos/dos.h".
Examples
========
Command line:
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs
/etc/fstab entry:
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE
==============
If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
have an Amiga harddisk connected to your PC, it will overwrite
the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
the Rigid Disk Block. Sheer luck has it that this is an unused
area of the RDB, so only the checksum doesn't match anymore.
Linux will ignore this garbage and recognize the RDB anyway, but
before you connect that drive to your Amiga again, you must
restore or repair your RDB. So please do make a backup copy of it
before booting Windows!
If the damage is already done, the following should fix the RDB
(where <disk> is the device name).
DO AT YOUR OWN RISK:
dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed
dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
dd if=rdb.fixed of=/dev/<disk>
Bugs, Restrictions, Caveats
===========================
Quite a few things may not work as advertised. Not everything is
tested, though several hundred MB have been read and written using
this fs. For a most up-to-date list of bugs please consult
fs/affs/Changes.
By default, filenames are truncated to 30 characters without warning.
'nofilenametruncate' mount option can change that behavior.
Case is ignored by the affs in filename matching, but Linux shells
do care about the case. Example (with /wb being an affs mounted fs):
rm /wb/WRONGCASE
will remove /mnt/wrongcase, but
rm /wb/WR*
will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more
than 1 process writes to a (small) diskette, the blocks are allocated
in an ugly way (but the real AFFS doesn't do much better). This
is also true when space gets tight.
You cannot execute programs on an OFS (Old File System), since the
program files cannot be memory mapped due to the 488 byte blocks.
For the same reason you cannot mount an image on such a filesystem
via the loopback device.
The bitmap valid flag in the root block may not be accurate when the
system crashes while an affs partition is mounted. There's currently
no way to fix a garbled filesystem without an Amiga (disk validator)
or manually (who would do this?). Maybe later.
If you mount affs partitions on system startup, you may want to tell
fsck that the fs should not be checked (place a '0' in the sixth field
of /etc/fstab).
It's not possible to read floppy disks with a normal PC or workstation
due to an incompatibility with the Amiga floppy controller.
If you are interested in an Amiga Emulator for Linux, look at
http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/

View file

@ -0,0 +1,247 @@
====================
kAFS: AFS FILESYSTEM
====================
Contents:
- Overview.
- Usage.
- Mountpoints.
- Proc filesystem.
- The cell database.
- Security.
- Examples.
========
OVERVIEW
========
This filesystem provides a fairly simple secure AFS filesystem driver. It is
under development and does not yet provide the full feature set. The features
it does support include:
(*) Security (currently only AFS kaserver and KerberosIV tickets).
(*) File reading and writing.
(*) Automounting.
(*) Local caching (via fscache).
It does not yet support the following AFS features:
(*) pioctl() system call.
===========
COMPILATION
===========
The filesystem should be enabled by turning on the kernel configuration
options:
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
CONFIG_AFS - The AFS filesystem
Additionally, the following can be turned on to aid debugging:
CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled
CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled
They permit the debugging messages to be turned on dynamically by manipulating
the masks in the following files:
/sys/module/af_rxrpc/parameters/debug
/sys/module/kafs/parameters/debug
=====
USAGE
=====
When inserting the driver modules the root cell must be specified along with a
list of volume location server IP addresses:
modprobe af_rxrpc
modprobe rxkad
modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
Documentation/networking/rxrpc.txt
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following
procedure:
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell, with the latter separated by colons.
Filesystems can be mounted anywhere by commands similar to the following:
mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
mount -t afs "#root.afs." /afs
mount -t afs "#root.cell." /afs/cambridge
Where the initial character is either a hash or a percent symbol depending on
whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
volume, but are willing to use a R/W volume instead (percent).
The name of the volume can be suffixes with ".backup" or ".readonly" to
specify connection to only volumes of those types.
The name of the cell is optional, and if not given during a mount, then the
named volume will be looked up in the cell specified during modprobe.
Additional cells can be added through /proc (see later section).
===========
MOUNTPOINTS
===========
AFS has a concept of mountpoints. In AFS terms, these are specially formatted
symbolic links (of the same form as the "device name" passed to mount). kAFS
presents these to the user as directories that have a follow-link capability
(ie: symbolic link semantics). If anyone attempts to access them, they will
automatically cause the target volume to be mounted (if possible) on that site.
Automatically mounted filesystems will be automatically unmounted approximately
twenty minutes after they were last used. Alternatively they can be unmounted
directly with the umount() system call.
Manually unmounting an AFS volume will cause any idle submounts upon it to be
culled first. If all are culled, then the requested volume will also be
unmounted, otherwise error EBUSY will be returned.
This can be used by the administrator to attempt to unmount the whole AFS tree
mounted on /afs in one go by doing:
umount /afs
===============
PROC FILESYSTEM
===============
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and
their usage counts:
[root@andromeda ~]# cat /proc/fs/afs/cells
USE NAME
3 cambridge.redhat.com
(*) A directory per cell that contains files that list volume location
servers, volumes, and active servers known within that cell.
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
USE ADDR STATE
4 172.16.18.91 0
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers
ADDRESS
172.16.18.91
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes
USE STT VLID[0] VLID[1] VLID[2] NAME
1 Val 20000000 20000001 20000002 root.afs
=================
THE CELL DATABASE
=================
The filesystem maintains an internal database of all the cells it knows and the
IP addresses of the volume location servers for those cells. The cell to which
the system belongs is added to the database when modprobe is performed by the
"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
the kernel command line.
Further cells can be added by commands similar to the following:
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
No other cell database operations are available at this time.
========
SECURITY
========
Secure operations are initiated by acquiring a key using the klog program. A
very primitive klog program is available at:
http://people.redhat.com/~dhowells/rxrpc/klog.c
This should be compiled by:
make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
And then run as:
./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service
and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
by cat'ing /proc/keys:
[root@andromeda ~]# keyctl show
Session Keyring
-3 --alswrv 0 0 keyring: _ses.3268
2 --alswrv 0 0 \_ keyring: _uid.0
111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM
Currently the username, realm, password and proposed ticket lifetime are
compiled in to the program.
It is not required to acquire a key before using AFS facilities, but if one is
not acquired then all operations will be governed by the anonymous user parts
of the ACLs.
If a key is acquired, then all AFS operations, including mounts and automounts,
made by a possessor of that key will be secured with that key.
If a file is opened with a particular key and then the file descriptor is
passed to a process that doesn't have that key (perhaps over an AF_UNIX
socket), then the operations on the file will be made with key that was used to
open the file.
========
EXAMPLES
========
Here's what I use to test this. Some of the names and IP addresses are local
to my internal DNS. My "root.afs" partition has a mount point within it for
some public volumes volumes.
insmod /tmp/rxrpc.o
insmod /tmp/rxkad.o
insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91
mount -t afs \%root.afs. /afs
mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells
mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
umount /afs
rmmod kafs
rmmod rxkad
rmmod rxrpc

View file

@ -0,0 +1,393 @@
Miscellaneous Device control operations for the autofs4 kernel module
====================================================================
The problem
===========
There is a problem with active restarts in autofs (that is to say
restarting autofs when there are busy mounts).
During normal operation autofs uses a file descriptor opened on the
directory that is being managed in order to be able to issue control
operations. Using a file descriptor gives ioctl operations access to
autofs specific information stored in the super block. The operations
are things such as setting an autofs mount catatonic, setting the
expire timeout and requesting expire checks. As is explained below,
certain types of autofs triggered mounts can end up covering an autofs
mount itself which prevents us being able to use open(2) to obtain a
file descriptor for these operations if we don't already have one open.
Currently autofs uses "umount -l" (lazy umount) to clear active mounts
at restart. While using lazy umount works for most cases, anything that
needs to walk back up the mount tree to construct a path, such as
getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works
because the point from which the path is constructed has been detached
from the mount tree.
The actual problem with autofs is that it can't reconnect to existing
mounts. Immediately one thinks of just adding the ability to remount
autofs file systems would solve it, but alas, that can't work. This is
because autofs direct mounts and the implementation of "on demand mount
and expire" of nested mount trees have the file system mounted directly
on top of the mount trigger directory dentry.
For example, there are two types of automount maps, direct (in the kernel
module source you will see a third type called an offset, which is just
a direct mount in disguise) and indirect.
Here is a master map with direct and indirect map entries:
/- /etc/auto.direct
/test /etc/auto.indirect
and the corresponding map files:
/etc/auto.direct:
/automount/dparse/g6 budgie:/autofs/export1
/automount/dparse/g1 shark:/autofs/export1
and so on.
/etc/auto.indirect:
g1 shark:/autofs/export1
g6 budgie:/autofs/export1
and so on.
For the above indirect map an autofs file system is mounted on /test and
mounts are triggered for each sub-directory key by the inode lookup
operation. So we see a mount of shark:/autofs/export1 on /test/g1, for
example.
The way that direct mounts are handled is by making an autofs mount on
each full path, such as /automount/dparse/g1, and using it as a mount
trigger. So when we walk on the path we mount shark:/autofs/export1 "on
top of this mount point". Since these are always directories we can
use the follow_link inode operation to trigger the mount.
But, each entry in direct and indirect maps can have offsets (making
them multi-mount map entries).
For example, an indirect mount map entry could also be:
g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export1 \
/s2/ss2 shark:/autofs/export2
and a similarly a direct mount map entry could also be:
/automount/dparse/g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export2 \
/s2/ss2 shark:/autofs/export2
One of the issues with version 4 of autofs was that, when mounting an
entry with a large number of offsets, possibly with nesting, we needed
to mount and umount all of the offsets as a single unit. Not really a
problem, except for people with a large number of offsets in map entries.
This mechanism is used for the well known "hosts" map and we have seen
cases (in 2.4) where the available number of mounts are exhausted or
where the number of privileged ports available is exhausted.
In version 5 we mount only as we go down the tree of offsets and
similarly for expiring them which resolves the above problem. There is
somewhat more detail to the implementation but it isn't needed for the
sake of the problem explanation. The one important detail is that these
offsets are implemented using the same mechanism as the direct mounts
above and so the mount points can be covered by a mount.
The current autofs implementation uses an ioctl file descriptor opened
on the mount point for control operations. The references held by the
descriptor are accounted for in checks made to determine if a mount is
in use and is also used to access autofs file system information held
in the mount super block. So the use of a file handle needs to be
retained.
The Solution
============
To be able to restart autofs leaving existing direct, indirect and
offset mounts in place we need to be able to obtain a file handle
for these potentially covered autofs mount points. Rather than just
implement an isolated operation it was decided to re-implement the
existing ioctl interface and add new operations to provide this
functionality.
In addition, to be able to reconstruct a mount tree that has busy mounts,
the uid and gid of the last user that triggered the mount needs to be
available because these can be used as macro substitution variables in
autofs maps. They are recorded at mount request time and an operation
has been added to retrieve them.
Since we're re-implementing the control interface, a couple of other
problems with the existing interface have been addressed. First, when
a mount or expire operation completes a status is returned to the
kernel by either a "send ready" or a "send fail" operation. The
"send fail" operation of the ioctl interface could only ever send
ENOENT so the re-implementation allows user space to send an actual
status. Another expensive operation in user space, for those using
very large maps, is discovering if a mount is present. Usually this
involves scanning /proc/mounts and since it needs to be done quite
often it can introduce significant overhead when there are many entries
in the mount table. An operation to lookup the mount status of a mount
point dentry (covered or not) has also been added.
Current kernel development policy recommends avoiding the use of the
ioctl mechanism in favor of systems such as Netlink. An implementation
using this system was attempted to evaluate its suitability and it was
found to be inadequate, in this case. The Generic Netlink system was
used for this as raw Netlink would lead to a significant increase in
complexity. There's no question that the Generic Netlink system is an
elegant solution for common case ioctl functions but it's not a complete
replacement probably because its primary purpose in life is to be a
message bus implementation rather than specifically an ioctl replacement.
While it would be possible to work around this there is one concern
that lead to the decision to not use it. This is that the autofs
expire in the daemon has become far to complex because umount
candidates are enumerated, almost for no other reason than to "count"
the number of times to call the expire ioctl. This involves scanning
the mount table which has proved to be a big overhead for users with
large maps. The best way to improve this is try and get back to the
way the expire was done long ago. That is, when an expire request is
issued for a mount (file handle) we should continually call back to
the daemon until we can't umount any more mounts, then return the
appropriate status to the daemon. At the moment we just expire one
mount at a time. A Generic Netlink implementation would exclude this
possibility for future development due to the requirements of the
message bus architecture.
autofs4 Miscellaneous Device mount control interface
====================================================
The control interface is opening a device node, typically /dev/autofs.
All the ioctls use a common structure to pass the needed parameter
information and return operation results:
struct autofs_dev_ioctl {
__u32 ver_major;
__u32 ver_minor;
__u32 size; /* total size of data passed in
* including this struct */
__s32 ioctlfd; /* automount command fd */
__u32 arg1; /* Command parameters */
__u32 arg2;
char path[0];
};
The ioctlfd field is a mount point file descriptor of an autofs mount
point. It is returned by the open call and is used by all calls except
the check for whether a given path is a mount point, where it may
optionally be used to check a specific mount corresponding to a given
mount point file descriptor, and when requesting the uid and gid of the
last successful mount on a directory within the autofs file system.
The fields arg1 and arg2 are used to communicate parameters and results of
calls made as described below.
The path field is used to pass a path where it is needed and the size field
is used account for the increased structure length when translating the
structure sent from user space.
This structure can be initialized before setting specific fields by using
the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *).
All of the ioctls perform a copy of this structure from user space to
kernel space and return -EINVAL if the size parameter is smaller than
the structure size itself, -ENOMEM if the kernel memory allocation fails
or -EFAULT if the copy itself fails. Other checks include a version check
of the compiled in user space version against the module version and a
mismatch results in a -EINVAL return. If the size field is greater than
the structure size then a path is assumed to be present and is checked to
ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is
returned. Following these checks, for all ioctl commands except
AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and
AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is
not a valid descriptor or doesn't correspond to an autofs mount point
an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is
returned.
The ioctls
==========
An example of an implementation which uses this interface can be seen
in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the
distribution tar available for download from kernel.org in directory
/pub/linux/daemons/autofs/v5.
The device node ioctl operations implemented by this interface are:
AUTOFS_DEV_IOCTL_VERSION
------------------------
Get the major and minor version of the autofs4 device ioctl kernel module
implementation. It requires an initialized struct autofs_dev_ioctl as an
input parameter and sets the version information in the passed in structure.
It returns 0 on success or the error -EINVAL if a version mismatch is
detected.
AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD
------------------------------------------------------------------
Get the major and minor version of the autofs4 protocol version understood
by loaded module. This call requires an initialized struct autofs_dev_ioctl
with the ioctlfd field set to a valid autofs mount point descriptor
and sets the requested version number in structure field arg1. These
commands return 0 on success or one of the negative error codes if
validation fails.
AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT
----------------------------------------------------------
Obtain and release a file descriptor for an autofs managed mount point
path. The open call requires an initialized struct autofs_dev_ioctl with
the path field set and the size field adjusted appropriately as well
as the arg1 field set to the device number of the autofs mount. The
device number can be obtained from the mount options shown in
/proc/mounts. The close call requires an initialized struct
autofs_dev_ioct with the ioctlfd field set to the descriptor obtained
from the open call. The release of the file descriptor can also be done
with close(2) so any open descriptors will also be closed at process exit.
The close call is included in the implemented operations largely for
completeness and to provide for a consistent user space implementation.
AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD
--------------------------------------------------------
Return mount and expire result status from user space to the kernel.
Both of these calls require an initialized struct autofs_dev_ioctl
with the ioctlfd field set to the descriptor obtained from the open
call and the arg1 field set to the wait queue token number, received
by user space in the foregoing mount or expire request. The arg2 field
is set to the status to be returned. For the ready call this is always
0 and for the fail call it is set to the errno of the operation.
AUTOFS_DEV_IOCTL_SETPIPEFD_CMD
------------------------------
Set the pipe file descriptor used for kernel communication to the daemon.
Normally this is set at mount time using an option but when reconnecting
to a existing mount we need to use this to tell the autofs mount about
the new kernel pipe descriptor. In order to protect mounts against
incorrectly setting the pipe descriptor we also require that the autofs
mount be catatonic (see next call).
The call requires an initialized struct autofs_dev_ioctl with the
ioctlfd field set to the descriptor obtained from the open call and
the arg1 field set to descriptor of the pipe. On success the call
also sets the process group id used to identify the controlling process
(eg. the owning automount(8) daemon) to the process group of the caller.
AUTOFS_DEV_IOCTL_CATATONIC_CMD
------------------------------
Make the autofs mount point catatonic. The autofs mount will no longer
issue mount requests, the kernel communication pipe descriptor is released
and any remaining waits in the queue released.
The call requires an initialized struct autofs_dev_ioctl with the
ioctlfd field set to the descriptor obtained from the open call.
AUTOFS_DEV_IOCTL_TIMEOUT_CMD
----------------------------
Set the expire timeout for mounts within an autofs mount point.
The call requires an initialized struct autofs_dev_ioctl with the
ioctlfd field set to the descriptor obtained from the open call.
AUTOFS_DEV_IOCTL_REQUESTER_CMD
------------------------------
Return the uid and gid of the last process to successfully trigger a the
mount on the given path dentry.
The call requires an initialized struct autofs_dev_ioctl with the path
field set to the mount point in question and the size field adjusted
appropriately as well as the arg1 field set to the device number of the
containing autofs mount. Upon return the struct field arg1 contains the
uid and arg2 the gid.
When reconstructing an autofs mount tree with active mounts we need to
re-connect to mounts that may have used the original process uid and
gid (or string variations of them) for mount lookups within the map entry.
This call provides the ability to obtain this uid and gid so they may be
used by user space for the mount map lookups.
AUTOFS_DEV_IOCTL_EXPIRE_CMD
---------------------------
Issue an expire request to the kernel for an autofs mount. Typically
this ioctl is called until no further expire candidates are found.
The call requires an initialized struct autofs_dev_ioctl with the
ioctlfd field set to the descriptor obtained from the open call. In
addition an immediate expire, independent of the mount timeout, can be
requested by setting the arg1 field to 1. If no expire candidates can
be found the ioctl returns -1 with errno set to EAGAIN.
This call causes the kernel module to check the mount corresponding
to the given ioctlfd for mounts that can be expired, issues an expire
request back to the daemon and waits for completion.
AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD
------------------------------
Checks if an autofs mount point is in use.
The call requires an initialized struct autofs_dev_ioctl with the
ioctlfd field set to the descriptor obtained from the open call and
it returns the result in the arg1 field, 1 for busy and 0 otherwise.
AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD
---------------------------------
Check if the given path is a mountpoint.
The call requires an initialized struct autofs_dev_ioctl. There are two
possible variations. Both use the path field set to the path of the mount
point to check and the size field adjusted appropriately. One uses the
ioctlfd field to identify a specific mount point to check while the other
variation uses the path and optionally arg1 set to an autofs mount type.
The call returns 1 if this is a mount point and sets arg1 to the device
number of the mount and field arg2 to the relevant super block magic
number (described below) or 0 if it isn't a mountpoint. In both cases
the the device number (as returned by new_encode_dev()) is returned
in field arg1.
If supplied with a file descriptor we're looking for a specific mount,
not necessarily at the top of the mounted stack. In this case the path
the descriptor corresponds to is considered a mountpoint if it is itself
a mountpoint or contains a mount, such as a multi-mount without a root
mount. In this case we return 1 if the descriptor corresponds to a mount
point and and also returns the super magic of the covering mount if there
is one or 0 if it isn't a mountpoint.
If a path is supplied (and the ioctlfd field is set to -1) then the path
is looked up and is checked to see if it is the root of a mount. If a
type is also given we are looking for a particular autofs mount and if
a match isn't found a fail is returned. If the the located path is the
root of a mount 1 is returned along with the super magic of the mount
or 0 otherwise.

View file

@ -0,0 +1,520 @@
<head>
<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
</head>
autofs - how it works
=====================
Purpose
-------
The goal of autofs is to provide on-demand mounting and race free
automatic unmounting of various other filesystems. This provides two
key advantages:
1. There is no need to delay boot until all filesystems that
might be needed are mounted. Processes that try to access those
slow filesystems might be delayed but other processes can
continue freely. This is particularly important for
network filesystems (e.g. NFS) or filesystems stored on
media with a media-changing robot.
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
in that data base at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
first accessed a name.
Context
-------
The "autofs4" filesystem module is only one part of an autofs system.
There also needs to be a user-space program which looks up names
and mounts filesystems. This will often be the "automount" program,
though other tools including "systemd" can make use of "autofs4".
This document describes only the kernel module and the interactions
required with any user-space program. Subsequent text refers to this
as the "automount daemon" or simply "the daemon".
"autofs4" is a Linux kernel module with provides the "autofs"
filesystem type. Several "autofs" filesystems can be mounted and they
can each be managed separately, or all managed by the same daemon.
Content
-------
An autofs filesystem can contain 3 sorts of objects: directories,
symbolic links and mount traps. Mount traps are directories with
extra properties as described in the next section.
Objects can only be created by the automount daemon: symlinks are
created with a regular `symlink` system call, while directories and
mount traps are created with `mkdir`. The determination of whether a
directory should be a mount trap or not is quite _ad hoc_, largely for
historical reasons, and is determined in part by the
*direct*/*indirect*/*offset* mount options, and the *maxproto* mount option.
If neither the *direct* or *offset* mount options are given (so the
mount is considered to be *indirect*), then the root directory is
always a regular directory, otherwise it is a mount trap when it is
empty and a regular directory when not empty. Note that *direct* and
*offset* are treated identically so a concise summary is that the root
directory is a mount trap only if the filesystem is mounted *direct*
and the root is empty.
Directories created in the root directory are mount traps only if the
filesystem is mounted *indirect* and they are empty.
Directories further down the tree depend on the *maxproto* mount
option and particularly whether it is less than five or not.
When *maxproto* is five, no directories further down the
tree are ever mount traps, they are always regular directories. When
the *maxproto* is four (or three), these directories are mount traps
precisely when they are empty.
So: non-empty (i.e. non-leaf) directories are never mount traps. Empty
directories are sometimes mount traps, and sometimes not depending on
where in the tree they are (root, top level, or lower), the *maxproto*,
and whether the mount was *indirect* or not.
Mount Traps
---------------
A core element of the implementation of autofs is the Mount Traps
which are provided by the Linux VFS. Any directory provided by a
filesystem can be designated as a trap. This involves two separate
features that work together to allow autofs to do its job.
**DCACHE_NEED_AUTOMOUNT**
If a dentry has the DCACHE_NEED_AUTOMOUNT flag set (which gets set if
the inode has S_AUTOMOUNT set, or can be set directly) then it is
(potentially) a mount trap. Any access to this directory beyond a
"`stat`" will (normally) cause the `d_op->d_automount()` dentry operation
to be called. The task of this method is to find the filesystem that
should be mounted on the directory and to return it. The VFS is
responsible for actually mounting the root of this filesystem on the
directory.
autofs doesn't find the filesystem itself but sends a message to the
automount daemon asking it to find and mount the filesystem. The
autofs `d_automount` method then waits for the daemon to report that
everything is ready. It will then return "`NULL`" indicating that the
mount has already happened. The VFS doesn't try to mount anything but
follows down the mount that is already there.
This functionality is sufficient for some users of mount traps such
as NFS which creates traps so that mountpoints on the server can be
reflected on the client. However it is not sufficient for autofs. As
mounting onto a directory is considered to be "beyond a `stat`", the
automount daemon would not be able to mount a filesystem on the 'trap'
directory without some way to avoid getting caught in the trap. For
that purpose there is another flag.
**DCACHE_MANAGE_TRANSIT**
If a dentry has DCACHE_MANAGE_TRANSIT set then two very different but
related behaviors are invoked, both using the `d_op->d_manage()`
dentry operation.
Firstly, before checking to see if any filesystem is mounted on the
directory, d_manage() will be called with the `rcu_walk` parameter set
to `false`. It may return one of three things:
- A return value of zero indicates that there is nothing special
about this dentry and normal checks for mounts and automounts
should proceed.
autofs normally returns zero, but first waits for any
expiry (automatic unmounting of the mounted filesystem) to
complete. This avoids races.
- A return value of `-EISDIR` tells the VFS to ignore any mounts
on the directory and to not consider calling `->d_automount()`.
This effectively disables the **DCACHE_NEED_AUTOMOUNT** flag
causing the directory not be a mount trap after all.
autofs returns this if it detects that the process performing the
lookup is the automount daemon and that the mount has been
requested but has not yet completed. How it determines this is
discussed later. This allows the automount daemon not to get
caught in the mount trap.
There is a subtlety here. It is possible that a second autofs
filesystem can be mounted below the first and for both of them to
be managed by the same daemon. For the daemon to be able to mount
something on the second it must be able to "walk" down past the
first. This means that d_manage cannot *always* return -EISDIR for
the automount daemon. It must only return it when a mount has
been requested, but has not yet completed.
`d_manage` also returns `-EISDIR` if the dentry shouldn't be a
mount trap, either because it is a symbolic link or because it is
not empty.
- Any other negative value is treated as an error and returned
to the caller.
autofs can return
- -ENOENT if the automount daemon failed to mount anything,
- -ENOMEM if it ran out of memory,
- -EINTR if a signal arrived while waiting for expiry to
complete
- or any other error sent down by the automount daemon.
The second use case only occurs during an "RCU-walk" and so `rcu_walk`
will be set.
An RCU-walk is a fast and lightweight process for walking down a
filename path (i.e. it is like running on tip-toes). RCU-walk cannot
cope with all situations so when it finds a difficulty it falls back
to "REF-walk", which is slower but more robust.
RCU-walk will never call `->d_automount`; the filesystems must already
be mounted or RCU-walk cannot handle the path.
To determine if a mount-trap is safe for RCU-walk mode it calls
`->d_manage()` with `rcu_walk` set to `true`.
In this case `d_manage()` must avoid blocking and should avoid taking
spinlocks if at all possible. Its sole purpose is to determine if it
would be safe to follow down into any mounted directory and the only
reason that it might not be is if an expiry of the mount is
underway.
In the `rcu_walk` case, `d_manage()` cannot return -EISDIR to tell the
VFS that this is a directory that doesn't require d_automount. If
`rcu_walk` sees a dentry with DCACHE_NEED_AUTOMOUNT set but nothing
mounted, it *will* fall back to REF-walk. `d_manage()` cannot make the
VFS remain in RCU-walk mode, but can only tell it to get out of
RCU-walk mode by returning `-ECHILD`.
So `d_manage()`, when called with `rcu_walk` set, should either return
-ECHILD if there is any reason to believe it is unsafe to end the
mounted filesystem, and otherwise should return 0.
autofs will return `-ECHILD` if an expiry of the filesystem has been
initiated or is being considered, otherwise it returns 0.
Mountpoint expiry
-----------------
The VFS has a mechansim for automatically expiring unused mounts,
much as it can expire any unused dentry information from the dcache.
This is guided by the MNT_SHRINKABLE flag. This only applies to
mounts that were created by `d_automount()` returning a filesystem to be
mounted. As autofs doesn't return such a filesystem but leaves the
mounting to the automount daemon, it must involve the automount daemon
in unmounting as well. This also means that autofs has more control
of expiry.
The VFS also supports "expiry" of mounts using the MNT_EXPIRE flag to
the `umount` system call. Unmounting with MNT_EXPIRE will fail unless
a previous attempt had been made, and the filesystem has been inactive
and untouched since that previous attempt. autofs4 does not depend on
this but has its own internal tracking of whether filesystems were
recently used. This allows individual names in the autofs directory
to expire separately.
With version 4 of the protocol, the automount daemon can try to
unmount any filesystems mounted on the autofs filesystem or remove any
symbolic links or empty directories any time it likes. If the unmount
or removal is successful the filesystem will be returned to the state
it was before the mount or creation, so that any access of the name
will trigger normal auto-mount processing. In particlar, `rmdir` and
`unlink` do not leave negative entries in the dcache as a normal
filesystem would, so an attempt to access a recently-removed object is
passed to autofs for handling.
With version 5, this is not safe except for unmounting from top-level
directories. As lower-level directories are never mount traps, other
processes will see an empty directory as soon as the filesystem is
unmounted. So it is generally safest to use the autofs expiry
protocol described below.
Normally the daemon only wants to remove entries which haven't been
used for a while. For this purpose autofs maintains a "`last_used`"
time stamp on each directory or symlink. For symlinks it genuinely
does record the last time the symlink was "used" or followed to find
out where it points to. For directories the field is a slight
misnomer. It actually records the last time that autofs checked if
the directory or one of its descendents was busy and found that it
was. This is just as useful and doesn't require updating the field so
often.
The daemon is able to ask autofs if anything is due to be expired,
using an `ioctl` as discussed later. For a *direct* mount, autofs
considers if the entire mount-tree can be unmounted or not. For an
*indirect* mount, autofs considers each of the names in the top level
directory to determine if any of those can be unmounted and cleaned
up.
There is an option with indirect mounts to consider each of the leaves
that has been mounted on instead of considering the top-level names.
This is intended for compatability with version 4 of autofs and should
be considered as deprecated.
When autofs considers a directory it checks the `last_used` time and
compares it with the "timeout" value set when the filesystem was
mounted, though this check is ignored in some cases. It also checks if
the directory or anything below it is in use. For symbolic links,
only the `last_used` time is ever considered.
If both appear to support expiring the directory or symlink, an action
is taken.
There are two ways to ask autofs to consider expiry. The first is to
use the **AUTOFS_IOC_EXPIRE** ioctl. This only works for indirect
mounts. If it finds something in the root directory to expire it will
return the name of that thing. Once a name has been returned the
automount daemon needs to unmount any filesystems mounted below the
name normally. As described above, this is unsafe for non-toplevel
mounts in a version-5 autofs. For this reason the current `automountd`
does not use this ioctl.
The second mechanism uses either the **AUTOFS_DEV_IOCTL_EXPIRE_CMD** or
the **AUTOFS_IOC_EXPIRE_MULTI** ioctl. This will work for both direct and
indirect mounts. If it selects an object to expire, it will notify
the daemon using the notification mechanism described below. This
will block until the daemon acknowledges the expiry notification.
This implies that the "`EXPIRE`" ioctl must be sent from a different
thread than the one which handles notification.
While the ioctl is blocking, the entry is marked as "expiring" and
`d_manage` will block until the daemon affirms that the unmount has
completed (together with removing any directories that might have been
necessary), or has been aborted.
Communicating with autofs: detecting the daemon
-----------------------------------------------
There are several forms of communication between the automount daemon
and the filesystem. As we have already seen, the daemon can create and
remove directories and symlinks using normal filesystem operations.
autofs knows whether a process requesting some operation is the daemon
or not based on its process-group id number (see getpgid(1)).
When an autofs filesystem it mounted the pgid of the mounting
processes is recorded unless the "pgrp=" option is given, in which
case that number is recorded instead. Any request arriving from a
process in that process group is considered to come from the daemon.
If the daemon ever has to be stopped and restarted a new pgid can be
provided through an ioctl as will be described below.
Communicating with autofs: the event pipe
-----------------------------------------
When an autofs filesystem is mounted, the 'write' end of a pipe must
be passed using the 'fd=' mount option. autofs will write
notification messages to this pipe for the daemon to respond to.
For version 5, the format of the message is:
struct autofs_v5_packet {
int proto_version; /* Protocol version */
int type; /* Type of packet */
autofs_wqt_t wait_queue_token;
__u32 dev;
__u64 ino;
__u32 uid;
__u32 gid;
__u32 pid;
__u32 tgid;
__u32 len;
char name[NAME_MAX+1];
};
where the type is one of
autofs_ptype_missing_indirect
autofs_ptype_expire_indirect
autofs_ptype_missing_direct
autofs_ptype_expire_direct
so messages can indicate that a name is missing (something tried to
access it but it isn't there) or that it has been selected for expiry.
The pipe will be set to "packet mode" (equivalent to passing
`O_DIRECT`) to _pipe2(2)_ so that a read from the pipe will return at
most one packet, and any unread portion of a packet will be discarded.
The `wait_queue_token` is a unique number which can identify a
particular request to be acknowledged. When a message is sent over
the pipe the affected dentry is marked as either "active" or
"expiring" and other accesses to it block until the message is
acknowledged using one of the ioctls below and the relevant
`wait_queue_token`.
Communicating with autofs: root directory ioctls
------------------------------------------------
The root directory of an autofs filesystem will respond to a number of
ioctls. The process issuing the ioctl must have the CAP_SYS_ADMIN
capability, or must be the automount daemon.
The available ioctl commands are:
- **AUTOFS_IOC_READY**: a notification has been handled. The argument
to the ioctl command is the "wait_queue_token" number
corresponding to the notification being acknowledged.
- **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with
the error code `ENOENT`.
- **AUTOFS_IOC_CATATONIC**: Causes the autofs to enter "catatonic"
mode meaning that it stops sending notifications to the daemon.
This mode is also entered if a write to the pipe fails.
- **AUTOFS_IOC_PROTOVER**: This returns the protocol version in use.
- **AUTOFS_IOC_PROTOSUBVER**: Returns the protocol sub-version which
is really a version number for the implementation. It is
currently 2.
- **AUTOFS_IOC_SETTIMEOUT**: This passes a pointer to an unsigned
long. The value is used to set the timeout for expiry, and
the current timeout value is stored back through the pointer.
- **AUTOFS_IOC_ASKUMOUNT**: Returns, in the pointed-to `int`, 1 if
the filesystem could be unmounted. This is only a hint as
the situation could change at any instant. This call can be
use to avoid a more expensive full unmount attempt.
- **AUTOFS_IOC_EXPIRE**: as described above, this asks if there is
anything suitable to expire. A pointer to a packet:
struct autofs_packet_expire_multi {
int proto_version; /* Protocol version */
int type; /* Type of packet */
autofs_wqt_t wait_queue_token;
int len;
char name[NAME_MAX+1];
};
is required. This is filled in with the name of something
that can be unmounted or removed. If nothing can be expired,
`errno` is set to `EAGAIN`. Even though a `wait_queue_token`
is present in the structure, no "wait queue" is established
and no acknowledgment is needed.
- **AUTOFS_IOC_EXPIRE_MULTI**: This is similar to
**AUTOFS_IOC_EXPIRE** except that it causes notification to be
sent to the daemon, and it blocks until the daemon acknowledges.
The argument is an integer which can contain two different flags.
**AUTOFS_EXP_IMMEDIATE** causes `last_used` time to be ignored
and objects are expired if the are not in use.
**AUTOFS_EXP_LEAVES** will select a leaf rather than a top-level
name to expire. This is only safe when *maxproto* is 4.
Communicating with autofs: char-device ioctls
---------------------------------------------
It is not always possible to open the root of an autofs filesystem,
particularly a *direct* mounted filesystem. If the automount daemon
is restarted there is no way for it to regain control of existing
mounts using any of the above communication channels. To address this
need there is a "miscellaneous" character device (major 10, minor 235)
which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
The `ioctl`s that can be used on this device are described in a separate
document `autofs4-mount-control.txt`, and are summarized briefly here.
Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure:
struct autofs_dev_ioctl {
__u32 ver_major;
__u32 ver_minor;
__u32 size; /* total size of data passed in
* including this struct */
__s32 ioctlfd; /* automount command fd */
__u32 arg1; /* Command parameters */
__u32 arg2;
char path[0];
};
For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
filesystem is identified by the `path`. All other commands identify
the filesystem by the `ioctlfd` which is a file descriptor open on the
root, and which can be returned by **OPEN_MOUNT**.
The `ver_major` and `ver_minor` are in/out parameters which check that
the requested version is supported, and report the maximum version
that the kernel module can support.
Commands are:
- **AUTOFS_DEV_IOCTL_VERSION_CMD**: does nothing, except validate and
set version numbers.
- **AUTOFS_DEV_IOCTL_OPENMOUNT_CMD**: return an open file descriptor
on the root of an autofs filesystem. The filesystem is identified
by name and device number, which is stored in `arg1`. Device
numbers for existing filesystems can be found in
`/proc/self/mountinfo`.
- **AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD**: same as `close(ioctlfd)`.
- **AUTOFS_DEV_IOCTL_SETPIPEFD_CMD**: if the filesystem is in
catatonic mode, this can provide the write end of a new pipe
in `arg1` to re-establish communication with a daemon. The
process group of the calling process is used to identify the
daemon.
- **AUTOFS_DEV_IOCTL_REQUESTER_CMD**: `path` should be a
name within the filesystem that has been auto-mounted on.
arg1 is the dev number of the underlying autofs. On successful
return, `arg1` and `arg2` will be the UID and GID of the process
which triggered that mount.
- **AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD**: Check if path is a
mountpoint of a particular type - see separate documentation for
details.
- **AUTOFS_DEV_IOCTL_PROTOVER_CMD**:
- **AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD**:
- **AUTOFS_DEV_IOCTL_READY_CMD**:
- **AUTOFS_DEV_IOCTL_FAIL_CMD**:
- **AUTOFS_DEV_IOCTL_CATATONIC_CMD**:
- **AUTOFS_DEV_IOCTL_TIMEOUT_CMD**:
- **AUTOFS_DEV_IOCTL_EXPIRE_CMD**:
- **AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD**: These all have the same
function as the similarly named **AUTOFS_IOC** ioctls, except
that **FAIL** can be given an explicit error number in `arg1`
instead of assuming `ENOENT`, and this **EXPIRE** command
corresponds to **AUTOFS_IOC_EXPIRE_MULTI**.
Catatonic mode
--------------
As mentioned, an autofs mount can enter "catatonic" mode. This
happens if a write to the notification pipe fails, or if it is
explicitly requested by an `ioctl`.
When entering catatonic mode, the pipe is closed and any pending
notifications are acknowledged with the error `ENOENT`.
Once in catatonic mode attempts to access non-existing names will
result in `ENOENT` while attempts to access existing directories will
be treated in the same way as if they came from the daemon, so mount
traps will not fire.
When the filesystem is mounted a _uid_ and _gid_ can be given which
set the ownership of directories and symbolic links. When the
filesystem is in catatonic mode, any process with a matching UID can
create directories or symlinks in the root directory, but not in other
directories.
Catatonic mode can only be left via the
**AUTOFS_DEV_IOCTL_OPENMOUNT_CMD** ioctl on the `/dev/autofs`.
autofs, name spaces, and shared mounts
--------------------------------------
With bind mounts and name spaces it is possible for an autofs
filesystem to appear at multiple places in one or more filesystem
name spaces. For this to work sensibly, the autofs filesystem should
always be mounted "shared". e.g.
> `mount --make-shared /autofs/mount/point`
The automount daemon is only able to mange a single mount location for
an autofs filesystem and if mounts on that are not 'shared', other
locations will not behave as expected. In particular access to those
other locations will likely result in the `ELOOP` error
> Too many levels of symbolic links

View file

@ -0,0 +1,118 @@
Support is available for filesystems that wish to do automounting support (such
as kAFS which can be found in fs/afs/). This facility includes allowing
in-kernel mounts to be performed and mountpoint degradation to be
requested. The latter can also be requested by userspace.
======================
IN-KERNEL AUTOMOUNTING
======================
A filesystem can now mount another filesystem on one of its directories by the
following procedure:
(1) Give the directory a follow_link() operation.
When the directory is accessed, the follow_link op will be called, and
it will be provided with the location of the mountpoint in the nameidata
structure (vfsmount and dentry).
(2) Have the follow_link() op do the following steps:
(a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
superblock and gain a vfsmount structure representing it.
(b) Copy the nameidata provided as an argument and substitute the dentry
argument into it the copy.
(c) Call do_add_mount() to install the new vfsmount into the namespace's
mountpoint tree, thus making it accessible to userspace. Use the
nameidata set up in (b) as the destination.
If the mountpoint will be automatically expired, then do_add_mount()
should also be given the location of an expiration list (see further
down).
(d) Release the path in the nameidata argument and substitute in the new
vfsmount and its root dentry. The ref counts on these will need
incrementing.
Then from userspace, you can just do something like:
[root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs
asd cambridge cambridge.redhat.com grand.central.org
[root@andromeda root]# ls /afs/cambridge
afsdoc
[root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2
And then if you look in the mountpoint catalogue, you'll see something like:
[root@andromeda root]# cat /proc/mounts
...
#root.afs. /afs afs rw 0 0
#root.cell. /afs/cambridge.redhat.com afs rw 0 0
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
===========================
AUTOMATIC MOUNTPOINT EXPIRY
===========================
Automatic expiration of mountpoints is easy, provided you've mounted the
mountpoint to be expired in the automounting procedure outlined above.
To do expiration, you need to follow these steps:
(3) Create at least one list off which the vfsmounts to be expired can be
hung. Access to this list will be governed by the vfsmount_lock.
(4) In step (2c) above, the call to do_add_mount() should be provided with a
pointer to this list. It will hang the vfsmount off of it if it succeeds.
(5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
with a pointer to this list. This will process the list, marking every
vfsmount thereon for potential expiry on the next call.
If a vfsmount was already flagged for expiry, and if its usage count is 1
(it's only referenced by its parent vfsmount), then it will be deleted
from the namespace and thrown away (effectively unmounted).
It may prove simplest to simply call this at regular intervals, using
some sort of timed event to drive it.
The expiration flag is cleared by calls to mntput. This means that expiration
will only happen on the second expiration request after the last time the
mountpoint was accessed.
If a mountpoint is moved, it gets removed from the expiration list. If a bind
mount is made on an expirable mount, the new vfsmount will not be on the
expiration list and will not expire.
If a namespace is copied, all mountpoints contained therein will be copied,
and the copies of those that are on an expiration list will be added to the
same expiration list.
=======================
USERSPACE DRIVEN EXPIRY
=======================
As an alternative, it is possible for userspace to request expiry of any
mountpoint (though some will be rejected - the current process's idea of the
rootfs for example). It does this by passing the MNT_EXPIRE flag to
umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
If the mountpoint in question is in referenced by something other than
umount() or its parent mountpoint, an EBUSY error will be returned and the
mountpoint will not be marked for expiration or unmounted.
If the mountpoint was not already marked for expiry at that time, an EAGAIN
error will be given and it won't be unmounted.
Otherwise if it was already marked and it wasn't referenced, unmounting will
take place as usual.
Again, the expiration flag is cleared every time anything other than umount()
looks at a mountpoint.

View file

@ -0,0 +1,117 @@
BeOS filesystem for Linux
Document last updated: Dec 6, 2001
WARNING
=======
Make sure you understand that this is alpha software. This means that the
implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
LICENSE
=====
This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
AUTHOR
=====
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for
details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
His original code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/>
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
This filesystem doesn't have a maintainer.
WHAT IS THIS DRIVER?
==================
This module implements the native filesystem of BeOS http://www.beincorporated.com/
for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation.
Which is it, BFS or BEFS?
================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this naming conflict, on Linux the BeOS
filesystem is called befs.
HOW TO INSTALL
==============
step 1. Install the BeFS patch into the source code tree of linux.
Apply the patchfile to your kernel source tree.
Assuming that your kernel source is in /foo/bar/linux and the patchfile
is called patch-befs-xxx, you would do the following:
cd /foo/bar/linux
patch -p1 < /path/to/patch-befs-xxx
if the patching step fails (i.e. there are rejected hunks), you can try to
figure it out yourself (it shouldn't be hard), or mail the maintainer
(Will Dyson <will_dyson@pobox.com>) for help.
step 2. Configuration & make kernel
The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general
reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
However, to use the BeFS module, you must enable it at configure time.
cd /foo/bar/linux
make menuconfig (or xconfig)
The BeFS module is not a standard part of the linux kernel, so you must first
enable support for experimental code under the "Code maturity level" menu.
Then, under the "Filesystems" menu will be an option called "BeFS
filesystem (experimental)", or something like that. Enable that option
(it is fine to make it a module).
Save your kernel configuration and then build your kernel.
step 3. Install
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step.
USING BFS
=========
To use the BeOS filesystem, use filesystem type 'befs'.
ex)
mount -t befs /dev/fd0 /beos
MOUNT OPTIONS
=============
uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
HOW TO GET LASTEST VERSION
==========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
ANY KNOWN BUGS?
===========
As of Jan 20, 2002:
None
SPECIAL THANKS
==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
Hiroyuki Yamada ... Testing LinuxPPC.

View file

@ -0,0 +1,57 @@
BFS FILESYSTEM FOR LINUX
========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
usually contains the kernel image and a few other files required for the
boot process.
In order to access /stand partition under Linux you obviously need to
know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount
BFS filesystem via loopback:
# losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand
where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate
/dev/loop0 device by:
# losetup -d /dev/loop0
You can simplify mounting by just typing:
# mount -t bfs -o loop stand.img /mnt/stand
this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not
loaded automatically, make sure that you have compiled the module and
that modprobe is functioning. Beware that umount will not deallocate
/dev/loopN device if /etc/mtab file on your system is a symbolic link to
/proc/mounts. You will need to do it manually using "-d" switch of
losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which
slice contains it. The command prtvtoc(1M) is your friend:
# prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this
information you can use dd(1) to create the BFS image:
# umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking
the magic number:
# od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface.
If you have any patches, questions or suggestions regarding this BFS
implementation please contact the author:
Tigran Aivazian <tigran@aivazian.fsnet.co.uk>

View file

@ -0,0 +1,270 @@
BTRFS
=====
Btrfs is a copy on write filesystem for Linux aimed at
implementing advanced features while focusing on fault tolerance,
repair and easy administration. Initially developed by Oracle, Btrfs
is licensed under the GPL and open for contribution from anyone.
Linux has a wealth of filesystems to choose from, but we are facing a
number of challenges with scaling to the large storage subsystems that
are becoming common in today's data centers. Filesystems need to scale
in their ability to address and manage large storage, and also in
their ability to detect, repair and tolerate errors in the data stored
on disk. Btrfs is under heavy development, and is not suitable for
any uses other than benchmarking and review. The Btrfs disk format is
not yet finalized.
The main Btrfs features include:
* Extent based file storage (2^64 max file size)
* Space efficient packing of small files
* Space efficient indexed directories
* Dynamic inode allocation
* Writable snapshots
* Subvolumes (separate internal filesystem roots)
* Object level mirroring and striping
* Checksums on data and metadata (multiple algorithms available)
* Compression
* Integrated multiple device support, with several raid algorithms
* Online filesystem check (not yet implemented)
* Very fast offline filesystem check
* Efficient incremental backup and FS mirroring (not yet implemented)
* Online filesystem defragmentation
Mount Options
=============
When mounting a btrfs filesystem, the following option are accepted.
Options with (*) are default options and will not show in the mount options.
alloc_start=<bytes>
Debugging option to force all block allocations above a certain
byte threshold on each block device. The value is specified in
bytes, optionally with a K, M, or G suffix, case insensitive.
Default is 1MB.
noautodefrag(*)
autodefrag
Disable/enable auto defragmentation.
Auto defragmentation detects small random writes into files and queue
them up for the defrag process. Works best for small files;
Not well suited for large database workloads.
check_int
check_int_data
check_int_print_mask=<value>
These debugging options control the behavior of the integrity checking
module (the BTRFS_FS_CHECK_INTEGRITY config option required).
check_int enables the integrity checker module, which examines all
block write requests to ensure on-disk consistency, at a large
memory and CPU cost.
check_int_data includes extent data in the integrity checks, and
implies the check_int option.
check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values
as defined in fs/btrfs/check-integrity.c, to control the integrity
checker module behavior.
See comments at the top of fs/btrfs/check-integrity.c for more info.
commit=<seconds>
Set the interval of periodic commit, 30 seconds by default. Higher
values defer data being synced to permanent storage with obvious
consequences when the system crashes. The upper bound is not forced,
but a warning is printed if it's more than 300 seconds (5 minutes).
compress
compress=<type>
compress-force
compress-force=<type>
Control BTRFS file data compression. Type may be specified as "zlib"
"lzo" or "no" (for no compression, used for remounting). If no type
is specified, zlib is used. If compress-force is specified,
all files will be compressed, whether or not they compress well.
If compression is enabled, nodatacow and nodatasum are disabled.
degraded
Allow mounts to continue with missing devices. A read-write mount may
fail with too many devices missing, for example if a stripe member
is completely missing.
device=<devicepath>
Specify a device during mount so that ioctls on the control device
can be avoided. Especially useful when trying to mount a multi-device
setup as root. May be specified multiple times for multiple devices.
nodiscard(*)
discard
Disable/enable discard mount option.
Discard issues frequent commands to let the block device reclaim space
freed by the filesystem.
This is useful for SSD devices, thinly provisioned
LUNs and virtual machine images, but may have a significant
performance impact. (The fstrim command is also available to
initiate batch trims from userspace).
noenospc_debug(*)
enospc_debug
Disable/enable debugging option to be more verbose in some ENOSPC conditions.
fatal_errors=<action>
Action to take when encountering a fatal error:
"bug" - BUG() on a fatal error. This is the default.
"panic" - panic() on a fatal error.
noflushoncommit(*)
flushoncommit
The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit. This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations). This was previously the behavior only when a snapshot is
created.
inode_cache
Enable free inode number caching. Defaults to off due to an overflow
problem when the free space crcs don't fit inside a single page.
max_inline=<bytes>
Specify the maximum amount of space, in bytes, that can be inlined in
a metadata B-tree leaf. The value is specified in bytes, optionally
with a K, M, or G suffix, case insensitive. In practice, this value
is limited by the root sector size, with some space unavailable due
to leaf headers. For a 4k sectorsize, max inline data is ~3900 bytes.
metadata_ratio=<value>
Specify that 1 metadata chunk should be allocated after every <value>
data chunks. Off by default.
acl(*)
noacl
Enable/disable support for Posix Access Control Lists (ACLs). See the
acl(5) manual page for more information about ACLs.
barrier(*)
nobarrier
Enable/disable the use of block layer write barriers. Write barriers
ensure that certain IOs make it through the device cache and are on
persistent storage. If disabled on a device with a volatile
(non-battery-backed) write-back cache, nobarrier option will lead to
filesystem corruption on a system crash or power loss.
datacow(*)
nodatacow
Enable/disable data copy-on-write for newly created files.
Nodatacow implies nodatasum, and disables all compression.
datasum(*)
nodatasum
Enable/disable data checksumming for newly created files.
Datasum implies datacow.
treelog(*)
notreelog
Enable/disable the tree logging used for fsync and O_SYNC writes.
recovery
Enable autorecovery attempts if a bad tree root is found at mount time.
Currently this scans a list of several previous tree roots and tries to
use the first readable.
rescan_uuid_tree
Force check and rebuild procedure of the UUID tree. This should not
normally be needed.
skip_balance
Skip automatic resume of interrupted balance operation after mount.
May be resumed with "btrfs balance resume."
space_cache (*)
Enable the on-disk freespace cache.
nospace_cache
Disable freespace cache loading without clearing the cache.
clear_cache
Force clearing and rebuilding of the disk space cache if something
has gone wrong.
ssd
nossd
ssd_spread
Options to control ssd allocation schemes. By default, BTRFS will
enable or disable ssd allocation heuristics depending on whether a
rotational or nonrotational disk is in use. The ssd and nossd options
can override this autodetection.
The ssd_spread mount option attempts to allocate into big chunks
of unused space, and may perform better on low-end ssds. ssd_spread
implies ssd, enabling all other ssd heuristics as well.
subvol=<path>
Mount subvolume at <path> rather than the root subvolume. <path> is
relative to the top level subvolume.
subvolid=<ID>
Mount subvolume specified by an ID number rather than the root subvolume.
This allows mounting of subvolumes which are not in the root of the mounted
filesystem.
You can use "btrfs subvolume list" to see subvolume ID numbers.
subvolrootid=<objectid> (deprecated)
Mount subvolume specified by <objectid> rather than the root subvolume.
This allows mounting of subvolumes which are not in the root of the mounted
filesystem.
You can use "btrfs subvolume show " to see the object ID for a subvolume.
thread_pool=<number>
The number of worker threads to allocate. The default number is equal
to the number of CPUs + 2, or 8, whichever is smaller.
user_subvol_rm_allowed
Allow subvolumes to be deleted by a non-root user. Use with caution.
MAILING LIST
============
There is a Btrfs mailing list hosted on vger.kernel.org. You can
find details on how to subscribe here:
http://vger.kernel.org/vger-lists.html#linux-btrfs
Mailing list archives are available from gmane:
http://dir.gmane.org/gmane.comp.file-systems.btrfs
IRC
===
Discussion of Btrfs also occurs on the #btrfs channel of the Freenode
IRC network.
UTILITIES
=========
Userspace tools for creating and manipulating Btrfs file systems are
available from the git repository at the following location:
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
These include the following tools:
* mkfs.btrfs: create a filesystem
* btrfs: a single tool to manage the filesystems, refer to the manpage for more details
* 'btrfsck' or 'btrfs check': do a consistency check of the filesystem
Other tools for specific tasks:
* btrfs-convert: in-place conversion from ext2/3/4 filesystems
* btrfs-image: dump filesystem metadata for debugging

View file

@ -0,0 +1,703 @@
==========================
FS-CACHE CACHE BACKEND API
==========================
The FS-Cache system provides an API by which actual caches can be supplied to
FS-Cache for it to then serve out to network filesystems and other interested
parties.
This API is declared in <linux/fscache-cache.h>.
====================================
INITIALISING AND REGISTERING A CACHE
====================================
To start off, a cache definition must be initialised and registered for each
cache the backend wants to make available. For instance, CacheFS does this in
the fill_super() operation on mounting.
The cache definition (struct fscache_cache) should be initialised by calling:
void fscache_init_cache(struct fscache_cache *cache,
struct fscache_cache_ops *ops,
const char *idfmt,
...);
Where:
(*) "cache" is a pointer to the cache definition;
(*) "ops" is a pointer to the table of operations that the backend supports on
this cache; and
(*) "idfmt" is a format and printf-style arguments for constructing a label
for the cache.
The cache should then be registered with FS-Cache by passing a pointer to the
previously initialised cache definition to:
int fscache_add_cache(struct fscache_cache *cache,
struct fscache_object *fsdef,
const char *tagname);
Two extra arguments should also be supplied:
(*) "fsdef" which should point to the object representation for the FS-Cache
master index in this cache. Netfs primary index entries will be created
here. FS-Cache keeps the caller's reference to the index object if
successful and will release it upon withdrawal of the cache.
(*) "tagname" which, if given, should be a text string naming this cache. If
this is NULL, the identifier will be used instead. For CacheFS, the
identifier is set to name the underlying block device and the tag can be
supplied by mount.
This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
is already in use. 0 will be returned on success.
=====================
UNREGISTERING A CACHE
=====================
A cache can be withdrawn from the system by calling this function with a
pointer to the cache definition:
void fscache_withdraw_cache(struct fscache_cache *cache);
In CacheFS's case, this is called by put_super().
========
SECURITY
========
The cache methods are executed one of two contexts:
(1) that of the userspace process that issued the netfs operation that caused
the cache method to be invoked, or
(2) that of one of the processes in the FS-Cache thread pool.
In either case, this may not be an appropriate context in which to access the
cache.
The calling process's fsuid, fsgid and SELinux security identities may need to
be masqueraded for the duration of the cache driver's access to the cache.
This is left to the cache to handle; FS-Cache makes no effort in this regard.
===================================
CONTROL AND STATISTICS PRESENTATION
===================================
The cache may present data to the outside world through FS-Cache's interfaces
in sysfs and procfs - the former for control and the latter for statistics.
A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
is enabled. This is accessible through the kobject struct fscache_cache::kobj
and is for use by the cache as it sees fit.
========================
RELEVANT DATA STRUCTURES
========================
(*) Index/Data file FS-Cache representation cookie:
struct fscache_cookie {
struct fscache_object_def *def;
struct fscache_netfs *netfs;
void *netfs_data;
...
};
The fields that might be of use to the backend describe the object
definition, the netfs definition and the netfs's data for this cookie.
The object definition contain functions supplied by the netfs for loading
and matching index entries; these are required to provide some of the
cache operations.
(*) In-cache object representation:
struct fscache_object {
int debug_id;
enum {
FSCACHE_OBJECT_RECYCLING,
...
} state;
spinlock_t lock
struct fscache_cache *cache;
struct fscache_cookie *cookie;
...
};
Structures of this type should be allocated by the cache backend and
passed to FS-Cache when requested by the appropriate cache operation. In
the case of CacheFS, they're embedded in CacheFS's internal object
structures.
The debug_id is a simple integer that can be used in debugging messages
that refer to a particular object. In such a case it should be printed
using "OBJ%x" to be consistent with FS-Cache.
Each object contains a pointer to the cookie that represents the object it
is backing. An object should retired when put_object() is called if it is
in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
initialised by calling fscache_object_init(object).
(*) FS-Cache operation record:
struct fscache_operation {
atomic_t usage;
struct fscache_object *object;
unsigned long flags;
#define FSCACHE_OP_EXCLUSIVE
void (*processor)(struct fscache_operation *op);
void (*release)(struct fscache_operation *op);
...
};
FS-Cache has a pool of threads that it uses to give CPU time to the
various asynchronous operations that need to be done as part of driving
the cache. These are represented by the above structure. The processor
method is called to give the op CPU time, and the release method to get
rid of it when its usage count reaches 0.
An operation can be made exclusive upon an object by setting the
appropriate flag before enqueuing it with fscache_enqueue_operation(). If
an operation needs more processing time, it should be enqueued again.
(*) FS-Cache retrieval operation record:
struct fscache_retrieval {
struct fscache_operation op;
struct address_space *mapping;
struct list_head *to_do;
...
};
A structure of this type is allocated by FS-Cache to record retrieval and
allocation requests made by the netfs. This struct is then passed to the
backend to do the operation. The backend may get extra refs to it by
calling fscache_get_retrieval() and refs may be discarded by calling
fscache_put_retrieval().
A retrieval operation can be used by the backend to do retrieval work. To
do this, the retrieval->op.processor method pointer should be set
appropriately by the backend and fscache_enqueue_retrieval() called to
submit it to the thread pool. CacheFiles, for example, uses this to queue
page examination when it detects PG_lock being cleared.
The to_do field is an empty list available for the cache backend to use as
it sees fit.
(*) FS-Cache storage operation record:
struct fscache_storage {
struct fscache_operation op;
pgoff_t store_limit;
...
};
A structure of this type is allocated by FS-Cache to record outstanding
writes to be made. FS-Cache itself enqueues this operation and invokes
the write_page() method on the object at appropriate times to effect
storage.
================
CACHE OPERATIONS
================
The cache backend provides FS-Cache with a table of operations that can be
performed on the denizens of the cache. These are held in a structure of type:
struct fscache_cache_ops
(*) Name of cache provider [mandatory]:
const char *name
This isn't strictly an operation, but should be pointed at a string naming
the backend.
(*) Allocate a new object [mandatory]:
struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
struct fscache_cookie *cookie)
This method is used to allocate a cache object representation to back a
cookie in a particular cache. fscache_object_init() should be called on
the object to initialise it prior to returning.
This function may also be used to parse the index key to be used for
multiple lookup calls to turn it into a more convenient form. FS-Cache
will call the lookup_complete() method to allow the cache to release the
form once lookup is complete or aborted.
(*) Look up and create object [mandatory]:
void (*lookup_object)(struct fscache_object *object)
This method is used to look up an object, given that the object is already
allocated and attached to the cookie. This should instantiate that object
in the cache if it can.
The method should call fscache_object_lookup_negative() as soon as
possible if it determines the object doesn't exist in the cache. If the
object is found to exist and the netfs indicates that it is valid then
fscache_obtained_object() should be called once the object is in a
position to have data stored in it. Similarly, fscache_obtained_object()
should also be called once a non-present object has been created.
If a lookup error occurs, fscache_object_lookup_error() should be called
to abort the lookup of that object.
(*) Release lookup data [mandatory]:
void (*lookup_complete)(struct fscache_object *object)
This method is called to ask the cache to release any resources it was
using to perform a lookup.
(*) Increment object refcount [mandatory]:
struct fscache_object *(*grab_object)(struct fscache_object *object)
This method is called to increment the reference count on an object. It
may fail (for instance if the cache is being withdrawn) by returning NULL.
It should return the object pointer if successful.
(*) Lock/Unlock object [mandatory]:
void (*lock_object)(struct fscache_object *object)
void (*unlock_object)(struct fscache_object *object)
These methods are used to exclusively lock an object. It must be possible
to schedule with the lock held, so a spinlock isn't sufficient.
(*) Pin/Unpin object [optional]:
int (*pin_object)(struct fscache_object *object)
void (*unpin_object)(struct fscache_object *object)
These methods are used to pin an object into the cache. Once pinned an
object cannot be reclaimed to make space. Return -ENOSPC if there's not
enough space in the cache to permit this.
(*) Check coherency state of an object [mandatory]:
int (*check_consistency)(struct fscache_object *object)
This method is called to have the cache check the saved auxiliary data of
the object against the netfs's idea of the state. 0 should be returned
if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
may also be returned.
(*) Update object [mandatory]:
int (*update_object)(struct fscache_object *object)
This is called to update the index entry for the specified object. The
new information should be in object->cookie->netfs_data. This can be
obtained by calling object->cookie->def->get_aux()/get_attr().
(*) Invalidate data object [mandatory]:
int (*invalidate_object)(struct fscache_operation *op)
This is called to invalidate a data object (as pointed to by op->object).
All the data stored for this object should be discarded and an
attr_changed operation should be performed. The caller will follow up
with an object update operation.
fscache_op_complete() must be called on op before returning.
(*) Discard object [mandatory]:
void (*drop_object)(struct fscache_object *object)
This method is called to indicate that an object has been unbound from its
cookie, and that the cache should release the object's resources and
retire it if it's in state FSCACHE_OBJECT_RECYCLING.
This method should not attempt to release any references held by the
caller. The caller will invoke the put_object() method as appropriate.
(*) Release object reference [mandatory]:
void (*put_object)(struct fscache_object *object)
This method is used to discard a reference to an object. The object may
be freed when all the references to it are released.
(*) Synchronise a cache [mandatory]:
void (*sync)(struct fscache_cache *cache)
This is called to ask the backend to synchronise a cache with its backing
device.
(*) Dissociate a cache [mandatory]:
void (*dissociate_pages)(struct fscache_cache *cache)
This is called to ask a cache to perform any page dissociations as part of
cache withdrawal.
(*) Notification that the attributes on a netfs file changed [mandatory]:
int (*attr_changed)(struct fscache_object *object);
This is called to indicate to the cache that certain attributes on a netfs
file have changed (for example the maximum size a file may reach). The
cache can read these from the netfs by calling the cookie's get_attr()
method.
The cache may use the file size information to reserve space on the cache.
It should also call fscache_set_store_limit() to indicate to FS-Cache the
highest byte it's willing to store for an object.
This method may return -ve if an error occurred or the cache object cannot
be expanded. In such a case, the object will be withdrawn from service.
This operation is run asynchronously from FS-Cache's thread pool, and
storage and retrieval operations from the netfs are excluded during the
execution of this operation.
(*) Reserve cache space for an object's data [optional]:
int (*reserve_space)(struct fscache_object *object, loff_t size);
This is called to request that cache space be reserved to hold the data
for an object and the metadata used to track it. Zero size should be
taken as request to cancel a reservation.
This should return 0 if successful, -ENOSPC if there isn't enough space
available, or -ENOMEM or -EIO on other errors.
The reservation may exceed the current size of the object, thus permitting
future expansion. If the amount of space consumed by an object would
exceed the reservation, it's permitted to refuse requests to allocate
pages, but not required. An object may be pruned down to its reservation
size if larger than that already.
(*) Request page be read from cache [mandatory]:
int (*read_or_alloc_page)(struct fscache_retrieval *op,
struct page *page,
gfp_t gfp)
This is called to attempt to read a netfs page from the cache, or to
reserve a backing block if not. FS-Cache will have done as much checking
as it can before calling, but most of the work belongs to the backend.
If there's no page in the cache, then -ENODATA should be returned if the
backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
didn't.
If there is suitable data in the cache, then a read operation should be
queued and 0 returned. When the read finishes, fscache_end_io() should be
called.
The fscache_mark_pages_cached() should be called for the page if any cache
metadata is retained. This will indicate to the netfs that the page needs
explicit uncaching. This operation takes a pagevec, thus allowing several
pages to be marked at once.
The retrieval record pointed to by op should be retained for each page
queued and released when I/O on the page has been formally ended.
fscache_get/put_retrieval() are available for this purpose.
The retrieval record may be used to get CPU time via the FS-Cache thread
pool. If this is desired, the op->op.processor should be set to point to
the appropriate processing routine, and fscache_enqueue_retrieval() should
be called at an appropriate point to request CPU time. For instance, the
retrieval routine could be enqueued upon the completion of a disk read.
The to_do field in the retrieval record is provided to aid in this.
If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
returned if possible or fscache_end_io() called with a suitable error
code.
fscache_put_retrieval() should be called after a page or pages are dealt
with. This will complete the operation when all pages are dealt with.
(*) Request pages be read from cache [mandatory]:
int (*read_or_alloc_pages)(struct fscache_retrieval *op,
struct list_head *pages,
unsigned *nr_pages,
gfp_t gfp)
This is like the read_or_alloc_page() method, except it is handed a list
of pages instead of one page. Any pages on which a read operation is
started must be added to the page cache for the specified mapping and also
to the LRU. Such pages must also be removed from the pages list and
*nr_pages decremented per page.
If there was an error such as -ENOMEM, then that should be returned; else
if one or more pages couldn't be read or allocated, then -ENOBUFS should
be returned; else if one or more pages couldn't be read, then -ENODATA
should be returned. If all the pages are dispatched then 0 should be
returned.
(*) Request page be allocated in the cache [mandatory]:
int (*allocate_page)(struct fscache_retrieval *op,
struct page *page,
gfp_t gfp)
This is like the read_or_alloc_page() method, except that it shouldn't
read from the cache, even if there's data there that could be retrieved.
It should, however, set up any internal metadata required such that
the write_page() method can write to the cache.
If there's no backing block available, then -ENOBUFS should be returned
(or -ENOMEM if there were other problems). If a block is successfully
allocated, then the netfs page should be marked and 0 returned.
(*) Request pages be allocated in the cache [mandatory]:
int (*allocate_pages)(struct fscache_retrieval *op,
struct list_head *pages,
unsigned *nr_pages,
gfp_t gfp)
This is an multiple page version of the allocate_page() method. pages and
nr_pages should be treated as for the read_or_alloc_pages() method.
(*) Request page be written to cache [mandatory]:
int (*write_page)(struct fscache_storage *op,
struct page *page);
This is called to write from a page on which there was a previously
successful read_or_alloc_page() call or similar. FS-Cache filters out
pages that don't have mappings.
This method is called asynchronously from the FS-Cache thread pool. It is
not required to actually store anything, provided -ENODATA is then
returned to the next read of this page.
If an error occurred, then a negative error code should be returned,
otherwise zero should be returned. FS-Cache will take appropriate action
in response to an error, such as withdrawing this object.
If this method returns success then FS-Cache will inform the netfs
appropriately.
(*) Discard retained per-page metadata [mandatory]:
void (*uncache_page)(struct fscache_object *object, struct page *page)
This is called when a netfs page is being evicted from the pagecache. The
cache backend should tear down any internal representation or tracking it
maintains for this page.
==================
FS-CACHE UTILITIES
==================
FS-Cache provides some utilities that a cache backend may make use of:
(*) Note occurrence of an I/O error in a cache:
void fscache_io_error(struct fscache_cache *cache)
This tells FS-Cache that an I/O error occurred in the cache. After this
has been called, only resource dissociation operations (object and page
release) will be passed from the netfs to the cache backend for the
specified cache.
This does not actually withdraw the cache. That must be done separately.
(*) Invoke the retrieval I/O completion function:
void fscache_end_io(struct fscache_retrieval *op, struct page *page,
int error);
This is called to note the end of an attempt to retrieve a page. The
error value should be 0 if successful and an error otherwise.
(*) Record that one or more pages being retrieved or allocated have been dealt
with:
void fscache_retrieval_complete(struct fscache_retrieval *op,
int n_pages);
This is called to record the fact that one or more pages have been dealt
with and are no longer the concern of this operation. When the number of
pages remaining in the operation reaches 0, the operation will be
completed.
(*) Record operation completion:
void fscache_op_complete(struct fscache_operation *op);
This is called to record the completion of an operation. This deducts
this operation from the parent object's run state, potentially permitting
one or more pending operations to start running.
(*) Set highest store limit:
void fscache_set_store_limit(struct fscache_object *object,
loff_t i_size);
This sets the limit FS-Cache imposes on the highest byte it's willing to
try and store for a netfs. Any page over this limit is automatically
rejected by fscache_read_alloc_page() and co with -ENOBUFS.
(*) Mark pages as being cached:
void fscache_mark_pages_cached(struct fscache_retrieval *op,
struct pagevec *pagevec);
This marks a set of pages as being cached. After this has been called,
the netfs must call fscache_uncache_page() to unmark the pages.
(*) Perform coherency check on an object:
enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
const void *data,
uint16_t datalen);
This asks the netfs to perform a coherency check on an object that has
just been looked up. The cookie attached to the object will determine the
netfs to use. data and datalen should specify where the auxiliary data
retrieved from the cache can be found.
One of three values will be returned:
(*) FSCACHE_CHECKAUX_OKAY
The coherency data indicates the object is valid as is.
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE
The coherency data needs updating, but otherwise the object is
valid.
(*) FSCACHE_CHECKAUX_OBSOLETE
The coherency data indicates that the object is obsolete and should
be discarded.
(*) Initialise a freshly allocated object:
void fscache_object_init(struct fscache_object *object);
This initialises all the fields in an object representation.
(*) Indicate the destruction of an object:
void fscache_object_destroyed(struct fscache_cache *cache);
This must be called to inform FS-Cache that an object that belonged to a
cache has been destroyed and deallocated. This will allow continuation
of the cache withdrawal process when it is stopped pending destruction of
all the objects.
(*) Indicate negative lookup on an object:
void fscache_object_lookup_negative(struct fscache_object *object);
This is called to indicate to FS-Cache that a lookup process for an object
found a negative result.
This changes the state of an object to permit reads pending on lookup
completion to go off and start fetching data from the netfs server as it's
known at this point that there can't be any data in the cache.
This may be called multiple times on an object. Only the first call is
significant - all subsequent calls are ignored.
(*) Indicate an object has been obtained:
void fscache_obtained_object(struct fscache_object *object);
This is called to indicate to FS-Cache that a lookup process for an object
produced a positive result, or that an object was created. This should
only be called once for any particular object.
This changes the state of an object to indicate:
(1) if no call to fscache_object_lookup_negative() has been made on
this object, that there may be data available, and that reads can
now go and look for it; and
(2) that writes may now proceed against this object.
(*) Indicate that object lookup failed:
void fscache_object_lookup_error(struct fscache_object *object);
This marks an object as having encountered a fatal error (usually EIO)
and causes it to move into a state whereby it will be withdrawn as soon
as possible.
(*) Get and release references on a retrieval record:
void fscache_get_retrieval(struct fscache_retrieval *op);
void fscache_put_retrieval(struct fscache_retrieval *op);
These two functions are used to retain a retrieval record whilst doing
asynchronous data retrieval and block allocation.
(*) Enqueue a retrieval record for processing.
void fscache_enqueue_retrieval(struct fscache_retrieval *op);
This enqueues a retrieval record for processing by the FS-Cache thread
pool. One of the threads in the pool will invoke the retrieval record's
op->op.processor callback function. This function may be called from
within the callback function.
(*) List of object state names:
const char *fscache_object_states[];
For debugging purposes, this may be used to turn the state that an object
is in into a text string for display purposes.

View file

@ -0,0 +1,501 @@
===============================================
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
===============================================
Contents:
(*) Overview.
(*) Requirements.
(*) Configuration.
(*) Starting the cache.
(*) Things to avoid.
(*) Cache culling.
(*) Cache structure.
(*) Security model and SELinux.
(*) A note on security.
(*) Statistical information.
(*) Debugging.
========
OVERVIEW
========
CacheFiles is a caching backend that's meant to use as a cache a directory on
an already mounted filesystem of a local type (such as Ext3).
CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin.
The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
and whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.
CacheFiles is currently limited to a single cache.
CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.
============
REQUIREMENTS
============
The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:
- dnotify.
- extended attributes (xattrs).
- openat() and friends.
- bmap() support on files in the filesystem (FIBMAP ioctl).
- The use of bmap() to detect a partial page at the end of the file.
It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.
=============
CONFIGURATION
=============
The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:
(*) brun <N>%
(*) bcull <N>%
(*) bstop <N>%
(*) frun <N>%
(*) fcull <N>%
(*) fstop <N>%
Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.
(*) dir <path>
Specify the directory containing the root of the cache. Mandatory.
(*) tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".
(*) debug <mask>
Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:
1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())
This mask can also be set through sysfs, eg:
echo 5 >/sys/modules/cachefiles/parameters/debug
==================
STARTING THE CACHE
==================
The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.
The daemon is run as follows:
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
The flags are:
(*) -d
Increase the debugging level. This can be specified multiple times and
is cumulative with itself.
(*) -s
Send messages to stderr instead of syslog.
(*) -n
Don't daemonise and go into background.
(*) -f <configfile>
Use an alternative configuration file rather than the default one.
===============
THINGS TO AVOID
===============
Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.
Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.
Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).
Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.
Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.
Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly.
=============
CACHE CULLING
=============
The cache may need culling occasionally to make space. This involves
discarding objects from the cache that have been used less recently than
anything else. Culling is based on the access time of data objects. Empty
directories are culled if not in use.
Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six
"limits":
(*) brun
(*) frun
If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.
(*) bcull
(*) fcull
If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.
(*) bstop
(*) fstop
If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.
These must be configured thusly:
0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100
Note that these are percentages of available space and available files, and do
_not_ appear as 100 minus the percentage displayed by the "df" program.
The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order. A new scan of the cache is
started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.
===============
CACHE STRUCTURE
===============
The CacheFiles module will create two directories in the directory it was
given:
(*) cache/
(*) graveyard/
The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.
The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.
The module represents index objects as directories with the filename "I..." or
"J...". Note that the "cache/" directory is itself a special index.
Data objects are represented as files if they have no children, or directories
if they do. Their filenames all begin "D..." or "E...". If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.
Special objects are similar to data objects, except their filenames begin
"S..." or "T...".
If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
objects:
INDEX INDEX INDEX DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
'+' prepended:
J1223/@23/+xy...z/+kl...m/Epqr
Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.
To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:
OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."
Intermediate directories are always "@" or "+" as appropriate.
Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs. The latter is used to detect stale objects in the cache and update
or retire them.
Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).
==========================
SECURITY MODEL AND SELINUX
==========================
CacheFiles is implemented to deal properly with the LSM security features of
the Linux kernel and the SELinux facility.
One of the problems that CacheFiles faces is that it is generally acting on
behalf of a process, and running in that process's context, and that includes a
security context that is not appropriate for accessing the cache - either
because the files in the cache are inaccessible to that process, or because if
the process creates a file in the cache, that file may be inaccessible to other
processes.
The way CacheFiles works is to temporarily change the security context (fsuid,
fsgid and actor security label) that the process acts as - without changing the
security context of the process when it the target of an operation performed by
some other process (so signalling and suchlike still work correctly).
When the CacheFiles module is asked to bind to its cache, it:
(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
this is:
cachefiles_var_t
(2) Finds the security label of the process which issued the bind request
(presumed to be the cachefilesd daemon), which by default will be:
cachefilesd_t
and asks LSM to supply a security ID as which it should act given the
daemon's label. By default, this will be:
cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID
based on a rule of this form in the policy.
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
For instance:
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
The module's security ID gives it permission to create, move and remove files
and directories in the cache, to find and access directories and files in the
cache, to set and access extended attributes on cache objects, and to read and
write files in the cache.
The daemon's security ID gives it only a very restricted set of permissions: it
may scan directories, stat files and erase files and directories. It may
not read or write files in the cache, and so it is precluded from accessing the
data cached therein; nor is it permitted to create new files in the cache.
There are policy source files available in:
http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
and later versions. In that tarball, see the files:
cachefilesd.te
cachefilesd.fc
cachefilesd.if
They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own
directory and run:
make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp
You will need checkpolicy and selinux-policy-devel installed prior to the
build.
By default, the cache is located in /var/fscache, but if it is desirable that
it should be elsewhere, than either the above policy files must be altered, or
an auxiliary policy must be installed to label the alternate location of the
cache.
For instructions on how to add an auxiliary policy to enable the cache to be
located elsewhere when SELinux is in enforcing mode, please see:
/usr/share/doc/cachefilesd-*/move-cache.txt
When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources.
==================
A NOTE ON SECURITY
==================
CacheFiles makes use of the split security in the task_struct. It allocates
its own task_security structure, and redirects current->cred to point to it
when it acts on behalf of another process, in that process's context.
The reason it does this is that it calls vfs_mkdir() and suchlike rather than
bypassing security and calling inode ops directly. Therefore the VFS and LSM
may deny the CacheFiles access to the cache data because under some
circumstances the caching code is running in the security context of whatever
process issued the original syscall on the netfs.
Furthermore, should CacheFiles create a file or directory, the security
parameters with that object is created (UID, GID, security label) would be
derived from that process that issued the system call, thus potentially
preventing other processes from accessing the cache - including CacheFiles's
cache management daemon (cachefilesd).
What is required is to temporarily override the security of the process that
issued the system call. We can't, however, just do an in-place change of the
security data as that affects the process as an object, not just as a subject.
This means it may lose signals or ptrace events for example, and affects what
the process looks like in /proc.
So CacheFiles makes use of a logical split in the security between the
objective security (task->real_cred) and the subjective security (task->cred).
The objective security holds the intrinsic security properties of a process and
is never overridden. This is what appears in /proc, and is what is used when a
process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
may be overridden. This is not seen externally, and is used whan a process
acts upon another object, for example SIGKILLing another process or opening a
file.
LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label.
=======================
STATISTICAL INFORMATION
=======================
If FS-Cache is compiled with the following option enabled:
CONFIG_CACHEFILES_HISTOGRAM=y
then it will gather certain statistics and display them through a proc file.
(*) /proc/fs/cachefiles/histogram
cat /proc/fs/cachefiles/histogram
JIFS SECS LOOKUPS MKDIRS CREATES
===== ===== ========= ========= =========
This shows the breakdown of the number of times each amount of time
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
COLUMN TIME MEASUREMENT
======= =======================================================
LOOKUPS Length of time to perform a lookup on the backing fs
MKDIRS Length of time to perform a mkdir on the backing fs
CREATES Length of time to perform a create on the backing fs
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
=========
DEBUGGING
=========
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
debugging enabled by adjusting the value in:
/sys/module/cachefiles/parameters/debug
This is a bitmask of debugging streams to enable:
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 General Function entry trace
1 2 Function exit trace
2 4 General
The appropriate set of values should be OR'd together and the result written to
the control file. For example:
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
will turn on all function entry debugging.

View file

@ -0,0 +1,443 @@
==========================
General Filesystem Caching
==========================
========
OVERVIEW
========
This facility is a general purpose cache for network filesystems, though it
could be used for caching other things such as ISO9660 filesystems too.
FS-Cache mediates between cache backends (such as CacheFS) and network
filesystems:
+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+
Or to look at it another way, FS-Cache is a module that provides a caching
facility to a network filesystem such that the cache is transparent to the
user:
+---------+
| |
| Server |
| |
+---------+
| NETWORK
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +----------+
V | |
+---------+ | |
| | | |
| NFS |----->| FS-Cache |
| | | |--+
+---------+ | | | +--------------+ +--------------+
| | | | | | | |
V +----------+ +-->| CacheFiles |-->| Ext3 |
+---------+ | /var/cache | | /dev/sda6 |
| | +--------------+ +--------------+
| VFS | ^ ^
| | | |
+---------+ +--------------+ |
| KERNEL SPACE | |
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
| USER SPACE | |
V | |
+---------+ +--------------+
| | | |
| Process | | cachefilesd |
| | | |
+---------+ +--------------+
FS-Cache does not follow the idea of completely loading every netfs file
opened in its entirety into a cache before permitting it to be accessed and
then serving the pages out of that cache rather than the netfs inode because:
(1) It must be practical to operate without a cache.
(2) The size of any accessible file must not be limited to the size of the
cache.
(3) The combined size of all opened files (this includes mapped libraries)
must not be limited to the size of the cache.
(4) The user should not be forced to download an entire file just to do a
one-off access of a small portion of it (such as might be done with the
"file" program).
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
the netfs('s) using it.
FS-Cache provides the following facilities:
(1) More than one cache can be used at once. Caches can be selected
explicitly by use of tags.
(2) Caches can be added / removed at any time.
(3) The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (2)).
(4) The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.
(5) Cookies are used to represent indices, files and other objects to the
netfs. The simplest cookie is just a NULL pointer - indicating nothing
cached there.
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
desires, though it must be aware that the index search function is
recursive, stack space is limited, and indices can only be children of
indices.
(7) Data I/O is done direct to and from the netfs's pages. The netfs
indicates that page A is at index B of the data-file represented by cookie
C, and that it should be read or written. The cache backend may or may
not start I/O on that page, but if it does, a netfs callback will be
invoked to indicate completion. The I/O may be either synchronous or
asynchronous.
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
them as obsolete and the index hierarchy rooted at that point will get
recycled.
(9) The netfs provides a "match" function for index searches. In addition to
saying whether a match was made or not, this can also specify that an
entry should be updated or deleted.
(10) As much as possible is done asynchronously.
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches.
FSDEF
|
+------------------------------------+
| |
NFS AFS
| |
+--------------------------+ +-----------+
| | | |
homedir mirror afs.org redhat.com
| | |
+------------+ +---------------+ +----------+
| | | | | |
00001 00002 00007 00125 vol00001 vol00002
| | | | |
+---+---+ +-----+ +---+ +------+------+ +-----+----+
| | | | | | | | | | | | |
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
| |
PG0 +-------+
| |
00001 00003
|
+---+---+
| | |
PG0 PG1 PG2
In the example above, you can see two netfs's being backed: NFS and AFS. These
have different index hierarchies:
(*) The NFS primary index contains per-server indices. Each server index is
indexed by NFS file handles to get data file objects. Each data file
objects can have an array of pages, but may also have further child
objects, such as extended attributes and directory entries. Extended
attribute objects themselves have page-array contents.
(*) The AFS primary index contains per-cell indices. Each cell index contains
per-logical-volume indices. Each of volume index contains up to three
indices for the read-write, read-only and backup mirrors of those volumes.
Each of these contains vnode data file objects, each of which contains an
array of pages.
The very top index is the FS-Cache master index in which individual netfs's
have entries.
Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.
The netfs API to FS-Cache can be found in:
Documentation/filesystems/caching/netfs-api.txt
The cache backend API to FS-Cache can be found in:
Documentation/filesystems/caching/backend-api.txt
A description of the internal representations and object state machine can be
found in:
Documentation/filesystems/caching/object.txt
=======================
STATISTICAL INFORMATION
=======================
If FS-Cache is compiled with the following options enabled:
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y
then it will gather certain statistics and display them through a number of
proc files.
(*) /proc/fs/fscache/stats
This shows counts of a number of events that can happen in FS-Cache:
CLASS EVENT MEANING
======= ======= =======================================================
Cookies idx=N Number of index cookies allocated
dat=N Number of data storage cookies allocated
spc=N Number of special cookies allocated
Objects alc=N Number of objects allocated
nal=N Number of object allocation failures
avl=N Number of objects that reached the available state
ded=N Number of objects that reached the dead state
ChkAux non=N Number of objects that didn't have a coherency check
ok=N Number of objects that passed a coherency check
upd=N Number of objects that needed a coherency data update
obs=N Number of objects that were declared obsolete
Pages mrk=N Number of pages marked as being cached
unc=N Number of uncache page requests seen
Acquire n=N Number of acquire cookie requests seen
nul=N Number of acq reqs given a NULL parent
noc=N Number of acq reqs rejected due to no cache available
ok=N Number of acq reqs succeeded
nbf=N Number of acq reqs rejected due to error
oom=N Number of acq reqs failed on ENOMEM
Lookups n=N Number of lookup calls made on cache backends
neg=N Number of negative lookups made
pos=N Number of positive lookups made
crt=N Number of objects created by lookup
tmo=N Number of lookups timed out and requeued
Updates n=N Number of update cookie requests seen
nul=N Number of upd reqs given a NULL parent
run=N Number of upd reqs granted CPU time
Relinqs n=N Number of relinquish cookie requests seen
nul=N Number of rlq reqs given a NULL parent
wcr=N Number of rlq reqs waited on completion of creation
AttrChg n=N Number of attribute changed requests seen
ok=N Number of attr changed requests queued
nbf=N Number of attr changed rejected -ENOBUFS
oom=N Number of attr changed failed -ENOMEM
run=N Number of attr changed ops given CPU time
Allocs n=N Number of allocation requests seen
ok=N Number of successful alloc reqs
wt=N Number of alloc reqs that waited on lookup completion
nbf=N Number of alloc reqs rejected -ENOBUFS
int=N Number of alloc reqs aborted -ERESTARTSYS
ops=N Number of alloc reqs submitted
owt=N Number of alloc reqs waited for CPU time
abt=N Number of alloc reqs aborted due to object death
Retrvls n=N Number of retrieval (read) requests seen
ok=N Number of successful retr reqs
wt=N Number of retr reqs that waited on lookup completion
nod=N Number of retr reqs returned -ENODATA
nbf=N Number of retr reqs rejected -ENOBUFS
int=N Number of retr reqs aborted -ERESTARTSYS
oom=N Number of retr reqs failed -ENOMEM
ops=N Number of retr reqs submitted
owt=N Number of retr reqs waited for CPU time
abt=N Number of retr reqs aborted due to object death
Stores n=N Number of storage (write) requests seen
ok=N Number of successful store reqs
agn=N Number of store reqs on a page already pending storage
nbf=N Number of store reqs rejected -ENOBUFS
oom=N Number of store reqs failed -ENOMEM
ops=N Number of store reqs submitted
run=N Number of store reqs granted CPU time
pgs=N Number of pages given store req processing time
rxd=N Number of store reqs deleted from tracking tree
olm=N Number of store reqs over store limit
VmScan nos=N Number of release reqs against pages with no pending store
gon=N Number of release reqs against pages stored by time lock granted
bsy=N Number of release reqs ignored due to in-progress store
can=N Number of page stores cancelled due to release req
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
can=N Number of async ops cancelled
rej=N Number of async ops rejected due to object lookup/create failure
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released
gc=N Number of deferred-release async ops garbage collected
CacheOp alo=N Number of in-progress alloc_object() cache ops
luo=N Number of in-progress lookup_object() cache ops
luc=N Number of in-progress lookup_complete() cache ops
gro=N Number of in-progress grab_object() cache ops
upo=N Number of in-progress update_object() cache ops
dro=N Number of in-progress drop_object() cache ops
pto=N Number of in-progress put_object() cache ops
syn=N Number of in-progress sync_cache() cache ops
atc=N Number of in-progress attr_changed() cache ops
rap=N Number of in-progress read_or_alloc_page() cache ops
ras=N Number of in-progress read_or_alloc_pages() cache ops
alp=N Number of in-progress allocate_page() cache ops
als=N Number of in-progress allocate_pages() cache ops
wrp=N Number of in-progress write_page() cache ops
ucp=N Number of in-progress uncache_page() cache ops
dsp=N Number of in-progress dissociate_pages() cache ops
(*) /proc/fs/fscache/histogram
cat /proc/fs/fscache/histogram
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
===== ===== ========= ========= ========= ========= =========
This shows the breakdown of the number of times each amount of time
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
COLUMN TIME MEASUREMENT
======= =======================================================
OBJ INST Length of time to instantiate an object
OP RUNS Length of time a call to process an operation took
OBJ RUNS Length of time a call to process an object event took
RETRV DLY Time between an requesting a read and lookup completing
RETRIEVLS Time between beginning and end of a retrieval
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
===========
OBJECT LIST
===========
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
list of all the objects currently allocated and allow them to be viewed
through:
/proc/fs/fscache/objects
This will look something like:
[root@andromeda ~]# head /proc/fs/fscache/objects
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
where the first set of columns before the '|' describe the object:
COLUMN DESCRIPTION
======= ===============================================================
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
PARENT Debugging ID of parent object
STAT Object state
CHLDN Number of child objects of this object
OPS Number of outstanding operations on this object
OOP Number of outstanding child object management operations
IPR
EX Number of outstanding exclusive operations
READS Number of outstanding read operations
EM Object's event mask
EV Events raised on this object
F Object flags
S Object work item busy state mask (1:pending 2:running)
and the second set of columns describe the object's cookie, if present:
COLUMN DESCRIPTION
=============== =======================================================
NETFS_COOKIE_DEF Name of netfs cookie definition
TY Cookie type (IX - index, DT - data, hex - special)
FL Cookie flags
NETFS_DATA Netfs private data stored in the cookie
OBJECT_KEY Object key } 1 column, with separating comma
AUX_DATA Object aux data } presence may be configured
The data shown may be filtered by attaching the a key to an appropriate keyring
before viewing the file. Something like:
keyctl add user fscache:objlist <restrictions> @s
where <restrictions> are a selection of the following letters:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
and the following paired letters:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have work queued
s Show objects that don't have work queued
If neither side of a letter pair is given, then both are implied. For example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
By default all objects and all fields will be shown.
=========
DEBUGGING
=========
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
debugging enabled by adjusting the value in:
/sys/module/fscache/parameters/debug
This is a bitmask of debugging streams to enable:
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 Cache management Function entry trace
1 2 Function exit trace
2 4 General
3 8 Cookie management Function entry trace
4 16 Function exit trace
5 32 General
6 64 Page handling Function entry trace
7 128 Function exit trace
8 256 General
9 512 Operation management Function entry trace
10 1024 Function exit trace
11 2048 General
The appropriate set of values should be OR'd together and the result written to
the control file. For example:
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
will turn on all function entry debugging.

View file

@ -0,0 +1,915 @@
===============================
FS-CACHE NETWORK FILESYSTEM API
===============================
There's an API by which a network filesystem can make use of the FS-Cache
facilities. This is based around a number of principles:
(1) Caches can store a number of different object types. There are two main
object types: indices and files. The first is a special type used by
FS-Cache to make finding objects faster and to make retiring of groups of
objects easier.
(2) Every index, file or other object is represented by a cookie. This cookie
may or may not have anything associated with it, but the netfs doesn't
need to care.
(3) Barring the top-level index (one entry per cached netfs), the index
hierarchy for each netfs is structured according the whim of the netfs.
This API is declared in <linux/fscache.h>.
This document contains the following sections:
(1) Network filesystem definition
(2) Index definition
(3) Object definition
(4) Network filesystem (un)registration
(5) Cache tag lookup
(6) Index registration
(7) Data file registration
(8) Miscellaneous object registration
(9) Setting the data file size
(10) Page alloc/read/write
(11) Page uncaching
(12) Index and data file consistency
(13) Cookie enablement
(14) Miscellaneous cookie operations
(15) Cookie unregistration
(16) Index invalidation
(17) Data file invalidation
(18) FS-Cache specific page flags.
=============================
NETWORK FILESYSTEM DEFINITION
=============================
FS-Cache needs a description of the network filesystem. This is specified
using a record of the following structure:
struct fscache_netfs {
uint32_t version;
const char *name;
struct fscache_cookie *primary_index;
...
};
This first two fields should be filled in before registration, and the third
will be filled in by the registration function; any other fields should just be
ignored and are for internal use only.
The fields are:
(1) The name of the netfs (used as the key in the toplevel index).
(2) The version of the netfs (if the name matches but the version doesn't, the
entire in-cache hierarchy for this netfs will be scrapped and begun
afresh).
(3) The cookie representing the primary index will be allocated according to
another parameter passed into the registration function.
For example, kAFS (linux/fs/afs/) uses the following definitions to describe
itself:
struct fscache_netfs afs_cache_netfs = {
.version = 0,
.name = "afs",
};
================
INDEX DEFINITION
================
Indices are used for two purposes:
(1) To aid the finding of a file based on a series of keys (such as AFS's
"cell", "volume ID", "vnode ID").
(2) To make it easier to discard a subset of all the files cached based around
a particular key - for instance to mirror the removal of an AFS volume.
However, since it's unlikely that any two netfs's are going to want to define
their index hierarchies in quite the same way, FS-Cache tries to impose as few
restraints as possible on how an index is structured and where it is placed in
the tree. The netfs can even mix indices and data files at the same level, but
it's not recommended.
Each index entry consists of a key of indeterminate length plus some auxiliary
data, also of indeterminate length.
There are some limits on indices:
(1) Any index containing non-index objects should be restricted to a single
cache. Any such objects created within an index will be created in the
first cache only. The cache in which an index is created can be
controlled by cache tags (see below).
(2) The entry data must be atomically journallable, so it is limited to about
400 bytes at present. At least 400 bytes will be available.
(3) The depth of the index tree should be judged with care as the search
function is recursive. Too many layers will run the kernel out of stack.
=================
OBJECT DEFINITION
=================
To define an object, a structure of the following type should be filled out:
struct fscache_cookie_def
{
uint8_t name[16];
uint8_t type;
struct fscache_cache_tag *(*select_cache)(
const void *parent_netfs_data,
const void *cookie_netfs_data);
uint16_t (*get_key)(const void *cookie_netfs_data,
void *buffer,
uint16_t bufmax);
void (*get_attr)(const void *cookie_netfs_data,
uint64_t *size);
uint16_t (*get_aux)(const void *cookie_netfs_data,
void *buffer,
uint16_t bufmax);
enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
const void *data,
uint16_t datalen);
void (*get_context)(void *cookie_netfs_data, void *context);
void (*put_context)(void *cookie_netfs_data, void *context);
void (*mark_pages_cached)(void *cookie_netfs_data,
struct address_space *mapping,
struct pagevec *cached_pvec);
void (*now_uncached)(void *cookie_netfs_data);
};
This has the following fields:
(1) The type of the object [mandatory].
This is one of the following values:
(*) FSCACHE_COOKIE_TYPE_INDEX
This defines an index, which is a special FS-Cache type.
(*) FSCACHE_COOKIE_TYPE_DATAFILE
This defines an ordinary data file.
(*) Any other value between 2 and 255
This defines an extraordinary object such as an XATTR.
(2) The name of the object type (NUL terminated unless all 16 chars are used)
[optional].
(3) A function to select the cache in which to store an index [optional].
This function is invoked when an index needs to be instantiated in a cache
during the instantiation of a non-index object. Only the immediate index
parent for the non-index object will be queried. Any indices above that
in the hierarchy may be stored in multiple caches. This function does not
need to be supplied for any non-index object or any index that will only
have index children.
If this function is not supplied or if it returns NULL then the first
cache in the parent's list will be chosen, or failing that, the first
cache in the master list.
(4) A function to retrieve an object's key from the netfs [mandatory].
This function will be called with the netfs data that was passed to the
cookie acquisition function and the maximum length of key data that it may
provide. It should write the required key data into the given buffer and
return the quantity it wrote.
(5) A function to retrieve attribute data from the netfs [optional].
This function will be called with the netfs data that was passed to the
cookie acquisition function. It should return the size of the file if
this is a data file. The size may be used to govern how much cache must
be reserved for this file in the cache.
If the function is absent, a file size of 0 is assumed.
(6) A function to retrieve auxiliary data from the netfs [optional].
This function will be called with the netfs data that was passed to the
cookie acquisition function and the maximum length of auxiliary data that
it may provide. It should write the auxiliary data into the given buffer
and return the quantity it wrote.
If this function is absent, the auxiliary data length will be set to 0.
The length of the auxiliary data buffer may be dependent on the key
length. A netfs mustn't rely on being able to provide more than 400 bytes
for both.
(7) A function to check the auxiliary data [optional].
This function will be called to check that a match found in the cache for
this object is valid. For instance with AFS it could check the auxiliary
data against the data version number returned by the server to determine
whether the index entry in a cache is still valid.
If this function is absent, it will be assumed that matching objects in a
cache are always valid.
If present, the function should return one of the following values:
(*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
(*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
This function can also be used to extract data from the auxiliary data in
the cache and copy it into the netfs's structures.
(8) A pair of functions to manage contexts for the completion callback
[optional].
The cache read/write functions are passed a context which is then passed
to the I/O completion callback function. To ensure this context remains
valid until after the I/O completion is called, two functions may be
provided: one to get an extra reference on the context, and one to drop a
reference to it.
If the context is not used or is a type of object that won't go out of
scope, then these functions are not required. These functions are not
required for indices as indices may not contain data. These functions may
be called in interrupt context and so may not sleep.
(9) A function to mark a page as retaining cache metadata [optional].
This is called by the cache to indicate that it is retaining in-memory
information for this page and that the netfs should uncache the page when
it has finished. This does not indicate whether there's data on the disk
or not. Note that several pages at once may be presented for marking.
The PG_fscache bit is set on the pages before this function would be
called, so the function need not be provided if this is sufficient.
This function is not required for indices as they're not permitted data.
(10) A function to unmark all the pages retaining cache metadata [mandatory].
This is called by FS-Cache to indicate that a backing store is being
unbound from a cookie and that all the marks on the pages should be
cleared to prevent confusion. Note that the cache will have torn down all
its tracking information so that the pages don't need to be explicitly
uncached.
This function is not required for indices as they're not permitted data.
===================================
NETWORK FILESYSTEM (UN)REGISTRATION
===================================
The first step is to declare the network filesystem to the cache. This also
involves specifying the layout of the primary index (for AFS, this would be the
"cell" level).
The registration function is:
int fscache_register_netfs(struct fscache_netfs *netfs);
It just takes a pointer to the netfs definition. It returns 0 or an error as
appropriate.
For kAFS, registration is done as follows:
ret = fscache_register_netfs(&afs_cache_netfs);
The last step is, of course, unregistration:
void fscache_unregister_netfs(struct fscache_netfs *netfs);
================
CACHE TAG LOOKUP
================
FS-Cache permits the use of more than one cache. To permit particular index
subtrees to be bound to particular caches, the second step is to look up cache
representation tags. This step is optional; it can be left entirely up to
FS-Cache as to which cache should be used. The problem with doing that is that
FS-Cache will always pick the first cache that was registered.
To get the representation for a named tag:
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
This takes a text string as the name and returns a representation of a tag. It
will never return an error. It may return a dummy tag, however, if it runs out
of memory; this will inhibit caching with this tag.
Any representation so obtained must be released by passing it to this function:
void fscache_release_cache_tag(struct fscache_cache_tag *tag);
The tag will be retrieved by FS-Cache when it calls the object definition
operation select_cache().
==================
INDEX REGISTRATION
==================
The third step is to inform FS-Cache about part of an index hierarchy that can
be used to locate files. This is done by requesting a cookie for each index in
the path to the file:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_cookie *parent,
const struct fscache_object_def *def,
void *netfs_data,
bool enable);
This function creates an index entry in the index represented by parent,
filling in the index entry by calling the operations pointed to by def.
Note that this function never returns an error - all errors are handled
internally. It may, however, return NULL to indicate no cookie. It is quite
acceptable to pass this token back to this function as the parent to another
acquisition (or even to the relinquish cookie, read page and write page
functions - see below).
Note also that no indices are actually created in a cache until a non-index
object needs to be created somewhere down the hierarchy. Furthermore, an index
may be created in several different caches independently at different times.
This is all handled transparently, and the netfs doesn't see any of it.
A cookie will be created in the disabled state if enabled is false. A cookie
must be enabled to do anything with it. A disabled cookie can be enabled by
calling fscache_enable_cookie() (see below).
For example, with AFS, a cell would be added to the primary index. This index
entry would have a dependent inode containing a volume location index for the
volume mappings within this cell:
cell->cache =
fscache_acquire_cookie(afs_cache_netfs.primary_index,
&afs_cell_cache_index_def,
cell, true);
Then when a volume location was accessed, it would be entered into the cell's
index and an inode would be allocated that acts as a volume type and hash chain
combination:
vlocation->cache =
fscache_acquire_cookie(cell->cache,
&afs_vlocation_cache_index_def,
vlocation, true);
And then a particular flavour of volume (R/O for example) could be added to
that index, creating another index for vnodes (AFS inode equivalents):
volume->cache =
fscache_acquire_cookie(vlocation->cache,
&afs_volume_cache_index_def,
volume, true);
======================
DATA FILE REGISTRATION
======================
The fourth step is to request a data file be created in the cache. This is
identical to index cookie acquisition. The only difference is that the type in
the object definition should be something other than index type.
vnode->cache =
fscache_acquire_cookie(volume->cache,
&afs_vnode_cache_object_def,
vnode, true);
=================================
MISCELLANEOUS OBJECT REGISTRATION
=================================
An optional step is to request an object of miscellaneous type be created in
the cache. This is almost identical to index cookie acquisition. The only
difference is that the type in the object definition should be something other
than index type. Whilst the parent object could be an index, it's more likely
it would be some other type of object such as a data file.
xattr->cache =
fscache_acquire_cookie(vnode->cache,
&afs_xattr_cache_object_def,
xattr, true);
Miscellaneous objects might be used to store extended attributes or directory
entries for example.
==========================
SETTING THE DATA FILE SIZE
==========================
The fifth step is to set the physical attributes of the file, such as its size.
This doesn't automatically reserve any space in the cache, but permits the
cache to adjust its metadata for data tracking appropriately:
int fscache_attr_changed(struct fscache_cookie *cookie);
The cache will return -ENOBUFS if there is no backing cache or if there is no
space to allocate any extra metadata required in the cache. The attributes
will be accessed with the get_attr() cookie definition operation.
Note that attempts to read or write data pages in the cache over this size may
be rebuffed with -ENOBUFS.
This operation schedules an attribute adjustment to happen asynchronously at
some point in the future, and as such, it may happen after the function returns
to the caller. The attribute adjustment excludes read and write operations.
=====================
PAGE ALLOC/READ/WRITE
=====================
And the sixth step is to store and retrieve pages in the cache. There are
three functions that are used to do this.
Note:
(1) A page should not be re-read or re-allocated without uncaching it first.
(2) A read or allocated page must be uncached when the netfs page is released
from the pagecache.
(3) A page should only be written to the cache if previous read or allocated.
This permits the cache to maintain its page tracking in proper order.
PAGE READ
---------
Firstly, the netfs should ask FS-Cache to examine the caches and read the
contents cached for a particular page of a particular file if present, or else
allocate space to store the contents if not:
typedef
void (*fscache_rw_complete_t)(struct page *page,
void *context,
int error);
int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
struct page *page,
fscache_rw_complete_t end_io_func,
void *context,
gfp_t gfp);
The cookie argument must specify a cookie for an object that isn't an index,
the page specified will have the data loaded into it (and is also used to
specify the page number), and the gfp argument is used to control how any
memory allocations made are satisfied.
If the cookie indicates the inode is not cached:
(1) The function will return -ENOBUFS.
Else if there's a copy of the page resident in the cache:
(1) The mark_pages_cached() cookie operation will be called on that page.
(2) The function will submit a request to read the data from the cache's
backing device directly into the page specified.
(3) The function will return 0.
(4) When the read is complete, end_io_func() will be invoked with:
(*) The netfs data supplied when the cookie was created.
(*) The page descriptor.
(*) The context argument passed to the above function. This will be
maintained with the get_context/put_context functions mentioned above.
(*) An argument that's 0 on success or negative for an error code.
If an error occurs, it should be assumed that the page contains no usable
data. fscache_readpages_cancel() may need to be called.
end_io_func() will be called in process context if the read is results in
an error, but it might be called in interrupt context if the read is
successful.
Otherwise, if there's not a copy available in cache, but the cache may be able
to store the page:
(1) The mark_pages_cached() cookie operation will be called on that page.
(2) A block may be reserved in the cache and attached to the object at the
appropriate place.
(3) The function will return -ENODATA.
This function may also return -ENOMEM or -EINTR, in which case it won't have
read any data from the cache.
PAGE ALLOCATE
-------------
Alternatively, if there's not expected to be any data in the cache for a page
because the file has been extended, a block can simply be allocated instead:
int fscache_alloc_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp);
This is similar to the fscache_read_or_alloc_page() function, except that it
never reads from the cache. It will return 0 if a block has been allocated,
rather than -ENODATA as the other would. One or the other must be performed
before writing to the cache.
The mark_pages_cached() cookie operation will be called on the page if
successful.
PAGE WRITE
----------
Secondly, if the netfs changes the contents of the page (either due to an
initial download or if a user performs a write), then the page should be
written back to the cache:
int fscache_write_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp);
The cookie argument must specify a data file cookie, the page specified should
contain the data to be written (and is also used to specify the page number),
and the gfp argument is used to control how any memory allocations made are
satisfied.
The page must have first been read or allocated successfully and must not have
been uncached before writing is performed.
If the cookie indicates the inode is not cached then:
(1) The function will return -ENOBUFS.
Else if space can be allocated in the cache to hold this page:
(1) PG_fscache_write will be set on the page.
(2) The function will submit a request to write the data to cache's backing
device directly from the page specified.
(3) The function will return 0.
(4) When the write is complete PG_fscache_write is cleared on the page and
anyone waiting for that bit will be woken up.
Else if there's no space available in the cache, -ENOBUFS will be returned. It
is also possible for the PG_fscache_write bit to be cleared when no write took
place if unforeseen circumstances arose (such as a disk error).
Writing takes place asynchronously.
MULTIPLE PAGE READ
------------------
A facility is provided to read several pages at once, as requested by the
readpages() address space operation:
int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
struct address_space *mapping,
struct list_head *pages,
int *nr_pages,
fscache_rw_complete_t end_io_func,
void *context,
gfp_t gfp);
This works in a similar way to fscache_read_or_alloc_page(), except:
(1) Any page it can retrieve data for is removed from pages and nr_pages and
dispatched for reading to the disk. Reads of adjacent pages on disk may
be merged for greater efficiency.
(2) The mark_pages_cached() cookie operation will be called on several pages
at once if they're being read or allocated.
(3) If there was an general error, then that error will be returned.
Else if some pages couldn't be allocated or read, then -ENOBUFS will be
returned.
Else if some pages couldn't be read but were allocated, then -ENODATA will
be returned.
Otherwise, if all pages had reads dispatched, then 0 will be returned, the
list will be empty and *nr_pages will be 0.
(4) end_io_func will be called once for each page being read as the reads
complete. It will be called in process context if error != 0, but it may
be called in interrupt context if there is no error.
Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
some of the pages being read and some being allocated. Those pages will have
been marked appropriately and will need uncaching.
CANCELLATION OF UNREAD PAGES
----------------------------
If one or more pages are passed to fscache_read_or_alloc_pages() but not then
read from the cache and also not read from the underlying filesystem then
those pages will need to have any marks and reservations removed. This can be
done by calling:
void fscache_readpages_cancel(struct fscache_cookie *cookie,
struct list_head *pages);
prior to returning to the caller. The cookie argument should be as passed to
fscache_read_or_alloc_pages(). Every page in the pages list will be examined
and any that have PG_fscache set will be uncached.
==============
PAGE UNCACHING
==============
To uncache a page, this function should be called:
void fscache_uncache_page(struct fscache_cookie *cookie,
struct page *page);
This function permits the cache to release any in-memory representation it
might be holding for this netfs page. This function must be called once for
each page on which the read or write page functions above have been called to
make sure the cache's in-memory tracking information gets torn down.
Note that pages can't be explicitly deleted from the a data file. The whole
data file must be retired (see the relinquish cookie function below).
Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions, so the page
invalidation functions must use:
bool fscache_check_page_write(struct fscache_cookie *cookie,
struct page *page);
to see if a page is being written to the cache, and:
void fscache_wait_on_page_write(struct fscache_cookie *cookie,
struct page *page);
to wait for it to finish if it is.
When releasepage() is being implemented, a special FS-Cache function exists to
manage the heuristics of coping with vmscan trying to eject pages, which may
conflict with the cache trying to write pages to the cache (which may itself
need to allocate memory):
bool fscache_maybe_release_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp);
This takes the netfs cookie, and the page and gfp arguments as supplied to
releasepage(). It will return false if the page cannot be released yet for
some reason and if it returns true, the page has been uncached and can now be
released.
To make a page available for release, this function may wait for an outstanding
storage request to complete, or it may attempt to cancel the storage request -
in which case the page will not be stored in the cache this time.
BULK INODE PAGE UNCACHE
-----------------------
A convenience routine is provided to perform an uncache on all the pages
attached to an inode. This assumes that the pages on the inode correspond on a
1:1 basis with the pages in the cache.
void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
struct inode *inode);
This takes the netfs cookie that the pages were cached with and the inode that
the pages are attached to. This function will wait for pages to finish being
written to the cache and for the cache to finish with the page generally. No
error is returned.
===============================
INDEX AND DATA FILE CONSISTENCY
===============================
To find out whether auxiliary data for an object is up to data within the
cache, the following function can be called:
int fscache_check_consistency(struct fscache_cookie *cookie)
This will call back to the netfs to check whether the auxiliary data associated
with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
may also return -ENOMEM and -ERESTARTSYS.
To request an update of the index data for an index or other object, the
following function should be called:
void fscache_update_cookie(struct fscache_cookie *cookie);
This function will refer back to the netfs_data pointer stored in the cookie by
the acquisition function to obtain the data to write into each revised index
entry. The update method in the parent index definition will be called to
transfer the data.
Note that partial updates may happen automatically at other times, such as when
data blocks are added to a data file object.
=================
COOKIE ENABLEMENT
=================
Cookies exist in one of two states: enabled and disabled. If a cookie is
disabled, it ignores all attempts to acquire child cookies; check, update or
invalidate its state; allocate, read or write backing pages - though it is
still possible to uncache pages and relinquish the cookie.
The initial enablement state is set by fscache_acquire_cookie(), but the cookie
can be enabled or disabled later. To disable a cookie, call:
void fscache_disable_cookie(struct fscache_cookie *cookie,
bool invalidate);
If the cookie is not already disabled, this locks the cookie against other
enable and disable ops, marks the cookie as being disabled, discards or
invalidates any backing objects and waits for cessation of activity on any
associated object before unlocking the cookie.
All possible failures are handled internally. The caller should consider
calling fscache_uncache_all_inode_pages() afterwards to make sure all page
markings are cleared up.
Cookies can be enabled or reenabled with:
void fscache_enable_cookie(struct fscache_cookie *cookie,
bool (*can_enable)(void *data),
void *data)
If the cookie is not already enabled, this locks the cookie against other
enable and disable ops, invokes can_enable() and, if the cookie is not an index
cookie, will begin the procedure of acquiring backing objects.
The optional can_enable() function is passed the data argument and returns a
ruling as to whether or not enablement should actually be permitted to begin.
All possible failures are handled internally. The cookie will only be marked
as enabled if provisional backing objects are allocated.
===============================
MISCELLANEOUS COOKIE OPERATIONS
===============================
There are a number of operations that can be used to control cookies:
(*) Cookie pinning:
int fscache_pin_cookie(struct fscache_cookie *cookie);
void fscache_unpin_cookie(struct fscache_cookie *cookie);
These operations permit data cookies to be pinned into the cache and to
have the pinning removed. They are not permitted on index cookies.
The pinning function will return 0 if successful, -ENOBUFS in the cookie
isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
-EIO if there's any other problem.
(*) Data space reservation:
int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
This permits a netfs to request cache space be reserved to store up to the
given amount of a file. It is permitted to ask for more than the current
size of the file to allow for future file expansion.
If size is given as zero then the reservation will be cancelled.
The function will return 0 if successful, -ENOBUFS in the cookie isn't
backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
-EIO if there's any other problem.
Note that this doesn't pin an object in a cache; it can still be culled to
make space if it's not in use.
=====================
COOKIE UNREGISTRATION
=====================
To get rid of a cookie, this function should be called.
void fscache_relinquish_cookie(struct fscache_cookie *cookie,
bool retire);
If retire is non-zero, then the object will be marked for recycling, and all
copies of it will be removed from all active caches in which it is present.
Not only that but all child objects will also be retired.
If retire is zero, then the object may be available again when next the
acquisition function is called. Retirement here will overrule the pinning on a
cookie.
One very important note - relinquish must NOT be called for a cookie unless all
the cookies for "child" indices, objects and pages have been relinquished
first.
==================
INDEX INVALIDATION
==================
There is no direct way to invalidate an index subtree. To do this, the caller
should relinquish and retire the cookie they have, and then acquire a new one.
======================
DATA FILE INVALIDATION
======================
Sometimes it will be necessary to invalidate an object that contains data.
Typically this will be necessary when the server tells the netfs of a foreign
change - at which point the netfs has to throw away all the state it had for an
inode and reload from the server.
To indicate that a cache object should be invalidated, the following function
can be called:
void fscache_invalidate(struct fscache_cookie *cookie);
This can be called with spinlocks held as it defers the work to a thread pool.
All extant storage, retrieval and attribute change ops at this point are
cancelled and discarded. Some future operations will be rejected until the
cache has had a chance to insert a barrier in the operations queue. After
that, operations will be queued again behind the invalidation operation.
The invalidation operation will perform an attribute change operation and an
auxiliary data update operation as it is very likely these will have changed.
Using the following function, the netfs can wait for the invalidation operation
to have reached a point at which it can start submitting ordinary operations
once again:
void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
===========================
FS-CACHE SPECIFIC PAGE FLAG
===========================
FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
given the alternative name PG_fscache.
PG_fscache is used to indicate that the page is known by the cache, and that
the cache must be informed if the page is going to go away. It's an indication
to the netfs that the cache has an interest in this page, where an interest may
be a pointer to it, resources allocated or reserved for it, or I/O in progress
upon it.
The netfs can use this information in methods such as releasepage() to
determine whether it needs to uncache a page or update it.
Furthermore, if this bit is set, releasepage() and invalidatepage() operations
will be called on a page to get rid of it, even if PG_private is not set. This
allows caching to attempted on a page before read_cache_pages() to be called
after fscache_read_or_alloc_pages() as the former will try and release pages it
was given under certain circumstances.
This bit does not overlap with such as PG_private. This means that FS-Cache
can be used with a filesystem that uses the block buffering code.
There are a number of operations defined on this flag:
int PageFsCache(struct page *page);
void SetPageFsCache(struct page *page)
void ClearPageFsCache(struct page *page)
int TestSetPageFsCache(struct page *page)
int TestClearPageFsCache(struct page *page)
These functions are bit test, bit set, bit clear, bit test and set and bit
test and clear operations on PG_fscache.

View file

@ -0,0 +1,320 @@
====================================================
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
====================================================
By: David Howells <dhowells@redhat.com>
Contents:
(*) Representation
(*) Object management state machine.
- Provision of cpu time.
- Locking simplification.
(*) The set of states.
(*) The set of events.
==============
REPRESENTATION
==============
FS-Cache maintains an in-kernel representation of each object that a netfs is
currently interested in. Such objects are represented by the fscache_cookie
struct and are referred to as cookies.
FS-Cache also maintains a separate in-kernel representation of the objects that
a cache backend is currently actively caching. Such objects are represented by
the fscache_object struct. The cache backends allocate these upon request, and
are expected to embed them in their own representations. These are referred to
as objects.
There is a 1:N relationship between cookies and objects. A cookie may be
represented by multiple objects - an index may exist in more than one cache -
or even by no objects (it may not be cached).
Furthermore, both cookies and objects are hierarchical. The two hierarchies
correspond, but the cookies tree is a superset of the union of the object trees
of multiple caches:
NETFS INDEX TREE : CACHE 1 : CACHE 2
: :
: +-----------+ :
+----------->| IObject | :
+-----------+ | : +-----------+ :
| ICookie |-------+ : | :
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
| : | : +-----------+
| : V : |
| : +-----------+ : |
V +----------->| IObject | : |
+-----------+ | : +-----------+ : |
| ICookie |-------+ : | : V
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
+-----+-----+ : | : +-----------+
| | : | : |
V | : V : |
+-----------+ | : +-----------+ : |
| ICookie |------------------------->| IObject | : |
+-----------+ | : +-----------+ : |
| V : | : V
| +-----------+ : | : +-----------+
| | ICookie |-------------------------------->| IObject |
| +-----------+ : | : +-----------+
V | : V : |
+-----------+ | : +-----------+ : |
| DCookie |------------------------->| DObject | : |
+-----------+ | : +-----------+ : |
| : : |
+-------+-------+ : : |
| | : : |
V V : : V
+-----------+ +-----------+ : : +-----------+
| DCookie | | DCookie |------------------------>| DObject |
+-----------+ +-----------+ : : +-----------+
: :
In the above illustration, ICookie and IObject represent indices and DCookie
and DObject represent data storage objects. Indices may have representation in
multiple caches, but currently, non-index objects may not. Objects of any type
may also be entirely unrepresented.
As far as the netfs API goes, the netfs is only actually permitted to see
pointers to the cookies. The cookies themselves and any objects attached to
those cookies are hidden from it.
===============================
OBJECT MANAGEMENT STATE MACHINE
===============================
Within FS-Cache, each active object is managed by its own individual state
machine. The state for an object is kept in the fscache_object struct, in
object->state. A cookie may point to a set of objects that are in different
states.
Each state has an action associated with it that is invoked when the machine
wakes up in that state. There are four logical sets of states:
(1) Preparation: states that wait for the parent objects to become ready. The
representations are hierarchical, and it is expected that an object must
be created or accessed with respect to its parent object.
(2) Initialisation: states that perform lookups in the cache and validate
what's found and that create on disk any missing metadata.
(3) Normal running: states that allow netfs operations on objects to proceed
and that update the state of objects.
(4) Termination: states that detach objects from their netfs cookies, that
delete objects from disk, that handle disk and system errors and that free
up in-memory resources.
In most cases, transitioning between states is in response to signalled events.
When a state has finished processing, it will usually set the mask of events in
which it is interested (object->event_mask) and relinquish the worker thread.
Then when an event is raised (by calling fscache_raise_event()), if the event
is not masked, the object will be queued for processing (by calling
fscache_enqueue_object()).
PROVISION OF CPU TIME
---------------------
The work to be done by the various states was given CPU time by the threads of
the slow work facility. This was used in preference to the workqueue facility
because:
(1) Threads may be completely occupied for very long periods of time by a
particular work item. These state actions may be doing sequences of
synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
getxattr, truncate, unlink, rmdir, rename).
(2) Threads may do little actual work, but may rather spend a lot of time
sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
workqueues don't necessarily have the right numbers of threads.
LOCKING SIMPLIFICATION
----------------------
Because only one worker thread may be operating on any particular object's
state machine at once, this simplifies the locking, particularly with respect
to disconnecting the netfs's representation of a cache object (fscache_cookie)
from the cache backend's representation (fscache_object) - which may be
requested from either end.
=================
THE SET OF STATES
=================
The object state machine has a set of states that it can be in. There are
preparation states in which the object sets itself up and waits for its parent
object to transit to a state that allows access to its children:
(1) State FSCACHE_OBJECT_INIT.
Initialise the object and wait for the parent object to become active. In
the cache, it is expected that it will not be possible to look an object
up from the parent object, until that parent object itself has been looked
up.
There are initialisation states in which the object sets itself up and accesses
disk for the object metadata:
(2) State FSCACHE_OBJECT_LOOKING_UP.
Look up the object on disk, using the parent as a starting point.
FS-Cache expects the cache backend to probe the cache to see whether this
object is represented there, and if it is, to see if it's valid (coherency
management).
The cache should call fscache_object_lookup_negative() to indicate lookup
failure for whatever reason, and should call fscache_obtained_object() to
indicate success.
At the completion of lookup, FS-Cache will let the netfs go ahead with
read operations, no matter whether the file is yet cached. If not yet
cached, read operations will be immediately rejected with ENODATA until
the first known page is uncached - as to that point there can be no data
to be read out of the cache for that file that isn't currently also held
in the pagecache.
(3) State FSCACHE_OBJECT_CREATING.
Create an object on disk, using the parent as a starting point. This
happens if the lookup failed to find the object, or if the object's
coherency data indicated what's on disk is out of date. In this state,
FS-Cache expects the cache to create
The cache should call fscache_obtained_object() if creation completes
successfully, fscache_object_lookup_negative() otherwise.
At the completion of creation, FS-Cache will start processing write
operations the netfs has queued for an object. If creation failed, the
write ops will be transparently discarded, and nothing recorded in the
cache.
There are some normal running states in which the object spends its time
servicing netfs requests:
(4) State FSCACHE_OBJECT_AVAILABLE.
A transient state in which pending operations are started, child objects
are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
lookup data is freed.
(5) State FSCACHE_OBJECT_ACTIVE.
The normal running state. In this state, requests the netfs makes will be
passed on to the cache.
(6) State FSCACHE_OBJECT_INVALIDATING.
The object is undergoing invalidation. When the state comes here, it
discards all pending read, write and attribute change operations as it is
going to clear out the cache entirely and reinitialise it. It will then
continue to the FSCACHE_OBJECT_UPDATING state.
(7) State FSCACHE_OBJECT_UPDATING.
The state machine comes here to update the object in the cache from the
netfs's records. This involves updating the auxiliary data that is used
to maintain coherency.
And there are terminal states in which an object cleans itself up, deallocates
memory and potentially deletes stuff from disk:
(8) State FSCACHE_OBJECT_LC_DYING.
The object comes here if it is dying because of a lookup or creation
error. This would be due to a disk error or system error of some sort.
Temporary data is cleaned up, and the parent is released.
(9) State FSCACHE_OBJECT_DYING.
The object comes here if it is dying due to an error, because its parent
cookie has been relinquished by the netfs or because the cache is being
withdrawn.
Any child objects waiting on this one are given CPU time so that they too
can destroy themselves. This object waits for all its children to go away
before advancing to the next state.
(10) State FSCACHE_OBJECT_ABORT_INIT.
The object comes to this state if it was waiting on its parent in
FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
(11) State FSCACHE_OBJECT_RELEASING.
(12) State FSCACHE_OBJECT_RECYCLING.
The object comes to one of these two states when dying once it is rid of
all its children, if it is dying because the netfs relinquished its
cookie. In the first state, the cached data is expected to persist, and
in the second it will be deleted.
(13) State FSCACHE_OBJECT_WITHDRAWING.
The object transits to this state if the cache decides it wants to
withdraw the object from service, perhaps to make space, but also due to
error or just because the whole cache is being withdrawn.
(14) State FSCACHE_OBJECT_DEAD.
The object transits to this state when the in-memory object record is
ready to be deleted. The object processor shouldn't ever see an object in
this state.
THE SET OF EVENTS
-----------------
There are a number of events that can be raised to an object state machine:
(*) FSCACHE_OBJECT_EV_UPDATE
The netfs requested that an object be updated. The state machine will ask
the cache backend to update the object, and the cache backend will ask the
netfs for details of the change through its cookie definition ops.
(*) FSCACHE_OBJECT_EV_CLEARED
This is signalled in two circumstances:
(a) when an object's last child object is dropped and
(b) when the last operation outstanding on an object is completed.
This is used to proceed from the dying state.
(*) FSCACHE_OBJECT_EV_ERROR
This is signalled when an I/O error occurs during the processing of some
object.
(*) FSCACHE_OBJECT_EV_RELEASE
(*) FSCACHE_OBJECT_EV_RETIRE
These are signalled when the netfs relinquishes a cookie it was using.
The event selected depends on whether the netfs asks for the backing
object to be retired (deleted) or retained.
(*) FSCACHE_OBJECT_EV_WITHDRAW
This is signalled when the cache backend wants to withdraw an object.
This means that the object will have to be detached from the netfs's
cookie.
Because the withdrawing releasing/retiring events are all handled by the object
state machine, it doesn't matter if there's a collision with both ends trying
to sever the connection at the same time. The state machine can just pick
which one it wants to honour, and that effects the other.

View file

@ -0,0 +1,213 @@
================================
ASYNCHRONOUS OPERATIONS HANDLING
================================
By: David Howells <dhowells@redhat.com>
Contents:
(*) Overview.
(*) Operation record initialisation.
(*) Parameters.
(*) Procedure.
(*) Asynchronous callback.
========
OVERVIEW
========
FS-Cache has an asynchronous operations handling facility that it uses for its
data storage and retrieval routines. Its operations are represented by
fscache_operation structs, though these are usually embedded into some other
structure.
This facility is available to and expected to be be used by the cache backends,
and FS-Cache will create operations and pass them off to the appropriate cache
backend for completion.
To make use of this facility, <linux/fscache-cache.h> should be #included.
===============================
OPERATION RECORD INITIALISATION
===============================
An operation is recorded in an fscache_operation struct:
struct fscache_operation {
union {
struct work_struct fast_work;
struct slow_work slow_work;
};
unsigned long flags;
fscache_operation_processor_t processor;
...
};
Someone wanting to issue an operation should allocate something with this
struct embedded in it. They should initialise it by calling:
void fscache_operation_init(struct fscache_operation *op,
fscache_operation_release_t release);
with the operation to be initialised and the release function to use.
The op->flags parameter should be set to indicate the CPU time provision and
the exclusivity (see the Parameters section).
The op->fast_work, op->slow_work and op->processor flags should be set as
appropriate for the CPU time provision (see the Parameters section).
FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
operation and waited for afterwards.
==========
PARAMETERS
==========
There are a number of parameters that can be set in the operation record's flag
parameter. There are three options for the provision of CPU time in these
operations:
(1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
may decide it wants to handle an operation itself without deferring it to
another thread.
This is, for example, used in read operations for calling readpages() on
the backing filesystem in CacheFiles. Although readpages() does an
asynchronous data fetch, the determination of whether pages exist is done
synchronously - and the netfs does not proceed until this has been
determined.
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
before submitting the operation, and the operating thread must wait for it
to be cleared before proceeding:
wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
TASK_UNINTERRUPTIBLE);
(2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
will be given to keventd to process. Such an operation is not permitted
to sleep on I/O.
This is, for example, used by CacheFiles to copy data from a backing fs
page to a netfs page after the backing fs has read the page in.
If this option is used, op->fast_work and op->processor must be
initialised before submitting the operation:
INIT_WORK(&op->fast_work, do_some_work);
(3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
will be given to the slow work facility to process. Such an operation is
permitted to sleep on I/O.
This is, for example, used by FS-Cache to handle background writes of
pages that have just been fetched from a remote server.
If this option is used, op->slow_work and op->processor must be
initialised before submitting the operation:
fscache_operation_init_slow(op, processor)
Furthermore, operations may be one of two types:
(1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
conjunction with any other operation on the object being operated upon.
An example of this is the attribute change operation, in which the file
being written to may need truncation.
(2) Shareable. Operations of this type may be running simultaneously. It's
up to the operation implementation to prevent interference between other
operations running at the same time.
=========
PROCEDURE
=========
Operations are used through the following procedure:
(1) The submitting thread must allocate the operation and initialise it
itself. Normally this would be part of a more specific structure with the
generic op embedded within.
(2) The submitting thread must then submit the operation for processing using
one of the following two functions:
int fscache_submit_op(struct fscache_object *object,
struct fscache_operation *op);
int fscache_submit_exclusive_op(struct fscache_object *object,
struct fscache_operation *op);
The first function should be used to submit non-exclusive ops and the
second to submit exclusive ones. The caller must still set the
FSCACHE_OP_EXCLUSIVE flag.
If successful, both functions will assign the operation to the specified
object and return 0. -ENOBUFS will be returned if the object specified is
permanently unavailable.
The operation manager will defer operations on an object that is still
undergoing lookup or creation. The operation will also be deferred if an
operation of conflicting exclusivity is in progress on the object.
If the operation is asynchronous, the manager will retain a reference to
it, so the caller should put their reference to it by passing it to:
void fscache_put_operation(struct fscache_operation *op);
(3) If the submitting thread wants to do the work itself, and has marked the
operation with FSCACHE_OP_MYTHREAD, then it should monitor
FSCACHE_OP_WAITING as described above and check the state of the object if
necessary (the object might have died whilst the thread was waiting).
When it has finished doing its processing, it should call
fscache_op_complete() and fscache_put_operation() on it.
(4) The operation holds an effective lock upon the object, preventing other
exclusive ops conflicting until it is released. The operation can be
enqueued for further immediate asynchronous processing by adjusting the
CPU time provisioning option if necessary, eg:
op->flags &= ~FSCACHE_OP_TYPE;
op->flags |= ~FSCACHE_OP_FAST;
and calling:
void fscache_enqueue_operation(struct fscache_operation *op)
This can be used to allow other things to have use of the worker thread
pools.
=====================
ASYNCHRONOUS CALLBACK
=====================
When used in asynchronous mode, the worker thread pool will invoke the
processor method with a pointer to the operation. This should then get at the
container struct by using container_of():
static void fscache_write_op(struct fscache_operation *_op)
{
struct fscache_storage *op =
container_of(_op, struct fscache_storage, op);
...
}
The caller holds a reference on the operation, and will invoke
fscache_put_operation() when the processor function returns. The processor
function is at liberty to call fscache_enqueue_operation() or to take extra
references.

View file

@ -0,0 +1,148 @@
Ceph Distributed File System
============================
Ceph is a distributed network file system designed to provide good
performance, reliability, and scalability.
Basic features include:
* POSIX semantics
* Seamless scaling from 1 to many thousands of nodes
* High availability and reliability. No single point of failure.
* N-way replication of data across storage nodes
* Fast recovery from node failures
* Automatic rebalancing of data on node addition/removal
* Easy deployment: most FS components are userspace daemons
Also,
* Flexible snapshots (on any directory)
* Recursive accounting (nested files, directories, bytes)
In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
on symmetric access by all clients to shared block devices, Ceph
separates data and metadata management into independent server
clusters, similar to Lustre. Unlike Lustre, however, metadata and
storage nodes run entirely as user space daemons. Storage nodes
utilize btrfs to store data objects, leveraging its advanced features
(checksumming, metadata replication, etc.). File data is striped
across storage nodes in large chunks to distribute workload and
facilitate high throughputs. When storage nodes fail, data is
re-replicated in a distributed fashion by the storage nodes themselves
(with some minimal coordination from a cluster monitor), making the
system extremely efficient and scalable.
Metadata servers effectively form a large, consistent, distributed
in-memory cache above the file namespace that is extremely scalable,
dynamically redistributes metadata in response to workload changes,
and can tolerate arbitrary (well, non-Byzantine) node failures. The
metadata server takes a somewhat unconventional approach to metadata
storage to significantly improve performance for common workloads. In
particular, inodes with only a single link are embedded in
directories, allowing entire directories of dentries and inodes to be
loaded into its cache with a single I/O operation. The contents of
extremely large directories can be fragmented and managed by
independent metadata servers, allowing scalable concurrent access.
The system offers automatic data rebalancing/migration when scaling
from a small cluster of just a few nodes to many hundreds, without
requiring an administrator carve the data set into static volumes or
go through the tedious process of migrating data between servers.
When the file system approaches full, new nodes can be easily added
and things will "just work."
Ceph includes flexible snapshot mechanism that allows a user to create
a snapshot on any subdirectory (and its nested contents) in the
system. Snapshot creation and deletion are as simple as 'mkdir
.snap/foo' and 'rmdir .snap/foo'.
Ceph also provides some recursive accounting on directories for nested
files and bytes. That is, a 'getfattr -d foo' on any directory in the
system will reveal the total number of nested regular files and
subdirectories, and a summation of all nested file sizes. This makes
the identification of large disk space consumers relatively quick, as
no 'du' or similar recursive scan of the file system is required.
Mount Syntax
============
The basic mount syntax is:
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify
happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
1.2.3.4,
# mount -t ceph 1.2.3.4:/ /mnt/ceph
is sufficient. If /sbin/mount.ceph is installed, a hostname can be
used instead of an IP address.
Mount Options
=============
ip=A.B.C.D[:N]
Specify the IP and/or port the client should bind to locally.
There is normally not much reason to do this. If the IP is not
specified, the client's IP address is determined by looking at the
address its connection to the monitor originates from.
wsize=X
Specify the maximum write size in bytes. By default there is no
maximum. Ceph will normally size writes based on the file stripe
size.
rsize=X
Specify the maximum readahead.
mount_timeout=X
Specify the timeout value for mount (in seconds), in the case
of a non-responsive Ceph file system. The default is 30
seconds.
rbytes
When stat() is called on a directory, set st_size to 'rbytes',
the summation of file sizes over all files nested beneath that
directory. This is the default.
norbytes
When stat() is called on a directory, set st_size to the
number of entries in that directory.
nocrc
Disable CRC32C calculation for data writes. If set, the storage node
must rely on TCP's error correction to detect data corruption
in the data payload.
dcache
Use the dcache contents to perform negative lookups and
readdir when the client has the entire directory contents in
its cache. (This does not change correctness; the client uses
cached metadata only when a lease or capability ensures it is
valid.)
nodcache
Do not use the dcache as above. This avoids a significant amount of
complex code, sacrificing performance without affecting correctness,
and is useful for tracking down bugs.
noasyncreaddir
Do not use the dcache as above for readdir.
More Information
================
For more information on Ceph, see the home page at
http://ceph.newdream.net/
The Linux kernel client source tree is available at
git://ceph.newdream.net/git/ceph-client.git
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
git://ceph.newdream.net/git/ceph.git

View file

@ -0,0 +1,57 @@
Original Author
===============
Steve French (sfrench@samba.org)
The author wishes to express his appreciation and thanks to:
Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS
improvements. Thanks to IBM for allowing me time and test resources to pursue
this project, to Jim McDonough from IBM (and the Samba Team) for his help, to
the IBM Linux JFS team for explaining many esoteric Linux filesystem features.
Jeremy Allison of the Samba team has done invaluable work in adding the server
side of the original CIFS Unix extensions and reviewing and implementing
portions of the newer CIFS POSIX extensions into the Samba 3 file server. Thank
Dave Boutcher of IBM Rochester (author of the OS/400 smb/cifs filesystem client)
for proving years ago that very good smb/cifs clients could be done on Unix-like
operating systems. Volker Lendecke, Andrew Tridgell, Urban Widmark, John
Newbigin and others for their work on the Linux smbfs module. Thanks to
the other members of the Storage Network Industry Association CIFS Technical
Workgroup for their work specifying this highly complex protocol and finally
thanks to the Samba team for their technical advice and encouragement.
Patch Contributors
------------------
Zwane Mwaikambo
Andi Kleen
Amrut Joshi
Shobhit Dayal
Sergey Vlasov
Richard Hughes
Yury Umanets
Mark Hamzy (for some of the early cifs IPv6 work)
Domen Puncer
Jesper Juhl (in particular for lots of whitespace/formatting cleanup)
Vince Negri and Dave Stahl (for finding an important caching bug)
Adrian Bunk (kcalloc cleanups)
Miklos Szeredi
Kazeon team for various fixes especially for 2.4 version.
Asser Ferno (Change Notify support)
Shaggy (Dave Kleikamp) for innumerable small fs suggestions and some good cleanup
Gunter Kukkukk (testing and suggestions for support of old servers)
Igor Mammedov (DFS support)
Jeff Layton (many, many fixes, as well as great work on the cifs Kerberos code)
Scott Lovenberg
Pavel Shilovsky (for great work adding SMB2 support, and various SMB3 features)
Test case and Bug Report contributors
-------------------------------------
Thanks to those in the community who have submitted detailed bug reports
and debug of problems they have found: Jochen Dolze, David Blaine,
Rene Scharfe, Martin Josefsson, Alexander Wild, Anthony Liguori,
Lars Muller, Urban Widmark, Massimiliano Ferrero, Howard Owen,
Olaf Kirch, Kieron Briggs, Nick Millington and others. Also special
mention to the Stanford Checker (SWAT) which pointed out many minor
bugs in error paths. Valuable suggestions also have come from Al Viro
and Dave Miller.
And thanks to the IBM LTC and Power test teams and SuSE testers for
finding multiple bugs during excellent stress test runs.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,753 @@
The CIFS VFS support for Linux supports many advanced network filesystem
features such as hierarchical dfs like namespace, hardlinks, locking and more.
It was designed to comply with the SNIA CIFS Technical Reference (which
supersedes the 1992 X/Open SMB Standard) as well as to perform best practice
practical interoperability with Windows 2000, Windows XP, Samba and equivalent
servers. This code was developed in participation with the Protocol Freedom
Information Foundation.
Please see
http://protocolfreedom.org/ and
http://samba.org/samba/PFIF/
for more details.
For questions or bug reports please contact:
sfrench@samba.org (sfrench@us.ibm.com)
Build instructions:
==================
For Linux 2.4:
1) Get the kernel source (e.g.from http://www.kernel.org)
and download the cifs vfs source (see the project page
at http://us1.samba.org/samba/Linux_CIFS_client.html)
and change directory into the top of the kernel directory
then patch the kernel (e.g. "patch -p1 < cifs_24.patch")
to add the cifs vfs to your kernel configure options if
it has not already been added (e.g. current SuSE and UL
users do not need to apply the cifs_24.patch since the cifs vfs is
already in the kernel configure menu) and then
mkdir linux/fs/cifs and then copy the current cifs vfs files from
the cifs download to your kernel build directory e.g.
cp <cifs_download_dir>/fs/cifs/* to <kernel_download_dir>/fs/cifs
2) make menuconfig (or make xconfig)
3) select cifs from within the network filesystem choices
4) save and exit
5) make dep
6) make modules (or "make" if CIFS VFS not to be built as a module)
For Linux 2.6:
1) Download the kernel (e.g. from http://www.kernel.org)
and change directory into the top of the kernel directory tree
(e.g. /usr/src/linux-2.5.73)
2) make menuconfig (or make xconfig)
3) select cifs from within the network filesystem choices
4) save and exit
5) make
Installation instructions:
=========================
If you have built the CIFS vfs as module (successfully) simply
type "make modules_install" (or if you prefer, manually copy the file to
the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.o).
If you have built the CIFS vfs into the kernel itself, follow the instructions
for your distribution on how to install a new kernel (usually you
would simply type "make install").
If you do not have the utility mount.cifs (in the Samba 3.0 source tree and on
the CIFS VFS web site) copy it to the same directory in which mount.smbfs and
similar files reside (usually /sbin). Although the helper software is not
required, mount.cifs is recommended. Eventually the Samba 3.0 utility program
"net" may also be helpful since it may someday provide easier mount syntax for
users who are used to Windows e.g.
net use <mount point> <UNC name or cifs URL>
Note that running the Winbind pam/nss module (logon service) on all of your
Linux clients is useful in mapping Uids and Gids consistently across the
domain to the proper network user. The mount.cifs mount helper can be
trivially built from Samba 3.0 or later source e.g. by executing:
gcc samba/source/client/mount.cifs.c -o mount.cifs
If cifs is built as a module, then the size and number of network buffers
and maximum number of simultaneous requests to one server can be configured.
Changing these from their defaults is not recommended. By executing modinfo
modinfo kernel/fs/cifs/cifs.ko
on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made
at module initialization time (by running insmod cifs.ko) can be seen.
Allowing User Mounts
====================
To permit users to mount and unmount over directories they own is possible
with the cifs vfs. A way to enable such mounting is to mark the mount.cifs
utility as suid (e.g. "chmod +s /sbin/mount.cifs). To enable users to
umount shares they mount requires
1) mount.cifs version 1.4 or later
2) an entry for the share in /etc/fstab indicating that a user may
unmount it e.g.
//server/usersharename /mnt/username cifs user 0 0
Note that when the mount.cifs utility is run suid (allowing user mounts),
in order to reduce risks, the "nosuid" mount flag is passed in on mount to
disallow execution of an suid program mounted on the remote target.
When mount is executed as root, nosuid is not passed in by default,
and execution of suid programs on the remote target would be enabled
by default. This can be changed, as with nfs and other filesystems,
by simply specifying "nosuid" among the mount options. For user mounts
though to be able to pass the suid flag to mount requires rebuilding
mount.cifs with the following flag:
gcc samba/source/client/mount.cifs.c -DCIFS_ALLOW_USR_SUID -o mount.cifs
There is a corresponding manual page for cifs mounting in the Samba 3.0 and
later source tree in docs/manpages/mount.cifs.8
Allowing User Unmounts
======================
To permit users to ummount directories that they have user mounted (see above),
the utility umount.cifs may be used. It may be invoked directly, or if
umount.cifs is placed in /sbin, umount can invoke the cifs umount helper
(at least for most versions of the umount utility) for umount of cifs
mounts, unless umount is invoked with -i (which will avoid invoking a umount
helper). As with mount.cifs, to enable user unmounts umount.cifs must be marked
as suid (e.g. "chmod +s /sbin/umount.cifs") or equivalent (some distributions
allow adding entries to a file to the /etc/permissions file to achieve the
equivalent suid effect). For this utility to succeed the target path
must be a cifs mount, and the uid of the current user must match the uid
of the user who mounted the resource.
Also note that the customary way of allowing user mounts and unmounts is
(instead of using mount.cifs and unmount.cifs as suid) to add a line
to the file /etc/fstab for each //server/share you wish to mount, but
this can become unwieldy when potential mount targets include many
or unpredictable UNC names.
Samba Considerations
====================
To get the maximum benefit from the CIFS VFS, we recommend using a server that
supports the SNIA CIFS Unix Extensions standard (e.g. Samba 2.2.5 or later or
Samba 3.0) but the CIFS vfs works fine with a wide variety of CIFS servers.
Note that uid, gid and file permissions will display default values if you do
not have a server that supports the Unix extensions for CIFS (such as Samba
2.2.5 or later). To enable the Unix CIFS Extensions in the Samba server, add
the line:
unix extensions = yes
to your smb.conf file on the server. Note that the following smb.conf settings
are also useful (on the Samba server) when the majority of clients are Unix or
Linux:
case sensitive = yes
delete readonly = yes
ea support = yes
Note that server ea support is required for supporting xattrs from the Linux
cifs client, and that EA support is present in later versions of Samba (e.g.
3.0.6 and later (also EA support works in all versions of Windows, at least to
shares on NTFS filesystems). Extended Attribute (xattr) support is an optional
feature of most Linux filesystems which may require enabling via
make menuconfig. Client support for extended attributes (user xattr) can be
disabled on a per-mount basis by specifying "nouser_xattr" on mount.
The CIFS client can get and set POSIX ACLs (getfacl, setfacl) to Samba servers
version 3.10 and later. Setting POSIX ACLs requires enabling both XATTR and
then POSIX support in the CIFS configuration options when building the cifs
module. POSIX ACL support can be disabled on a per mount basic by specifying
"noacl" on mount.
Some administrators may want to change Samba's smb.conf "map archive" and
"create mask" parameters from the default. Unless the create mask is changed
newly created files can end up with an unnecessarily restrictive default mode,
which may not be what you want, although if the CIFS Unix extensions are
enabled on the server and client, subsequent setattr calls (e.g. chmod) can
fix the mode. Note that creating special devices (mknod) remotely
may require specifying a mkdev function to Samba if you are not using
Samba 3.0.6 or later. For more information on these see the manual pages
("man smb.conf") on the Samba server system. Note that the cifs vfs,
unlike the smbfs vfs, does not read the smb.conf on the client system
(the few optional settings are passed in on mount via -o parameters instead).
Note that Samba 2.2.7 or later includes a fix that allows the CIFS VFS to delete
open files (required for strict POSIX compliance). Windows Servers already
supported this feature. Samba server does not allow symlinks that refer to files
outside of the share, so in Samba versions prior to 3.0.6, most symlinks to
files with absolute paths (ie beginning with slash) such as:
ln -s /mnt/foo bar
would be forbidden. Samba 3.0.6 server or later includes the ability to create
such symlinks safely by converting unsafe symlinks (ie symlinks to server
files that are outside of the share) to a samba specific format on the server
that is ignored by local server applications and non-cifs clients and that will
not be traversed by the Samba server). This is opaque to the Linux client
application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or
later, but only for remote clients using the CIFS Unix extensions, and will
be invisbile to Windows clients and typically will not affect local
applications running on the same server as Samba.
Use instructions:
================
Once the CIFS VFS support is built into the kernel or installed as a module
(cifs.o), you can use mount syntax like the following to access Samba or Windows
servers:
mount -t cifs //9.53.216.11/e$ /mnt -o user=myname,pass=mypassword
Before -o the option -v may be specified to make the mount.cifs
mount helper display the mount steps more verbosely.
After -o the following commonly used cifs vfs specific options
are supported:
user=<username>
pass=<password>
domain=<domain name>
Other cifs mount options are described below. Use of TCP names (in addition to
ip addresses) is available if the mount helper (mount.cifs) is installed. If
you do not trust the server to which are mounted, or if you do not have
cifs signing enabled (and the physical network is insecure), consider use
of the standard mount options "noexec" and "nosuid" to reduce the risk of
running an altered binary on your local system (downloaded from a hostile server
or altered by a hostile router).
Although mounting using format corresponding to the CIFS URL specification is
not possible in mount.cifs yet, it is possible to use an alternate format
for the server and sharename (which is somewhat similar to NFS style mount
syntax) instead of the more widely used UNC format (i.e. \\server\share):
mount -t cifs tcp_name_of_server:share_name /mnt -o user=myname,pass=mypasswd
When using the mount helper mount.cifs, passwords may be specified via alternate
mechanisms, instead of specifying it after -o using the normal "pass=" syntax
on the command line:
1) By including it in a credential file. Specify credentials=filename as one
of the mount options. Credential files contain two lines
username=someuser
password=your_password
2) By specifying the password in the PASSWD environment variable (similarly
the user name can be taken from the USER environment variable).
3) By specifying the password in a file by name via PASSWD_FILE
4) By specifying the password in a file by file descriptor via PASSWD_FD
If no password is provided, mount.cifs will prompt for password entry
Restrictions
============
Servers must support either "pure-TCP" (port 445 TCP/IP CIFS connections) or RFC
1001/1002 support for "Netbios-Over-TCP/IP." This is not likely to be a
problem as most servers support this.
Valid filenames differ between Windows and Linux. Windows typically restricts
filenames which contain certain reserved characters (e.g.the character :
which is used to delimit the beginning of a stream name by Windows), while
Linux allows a slightly wider set of valid characters in filenames. Windows
servers can remap such characters when an explicit mapping is specified in
the Server's registry. Samba starting with version 3.10 will allow such
filenames (ie those which contain valid Linux characters, which normally
would be forbidden for Windows/CIFS semantics) as long as the server is
configured for Unix Extensions (and the client has not disabled
/proc/fs/cifs/LinuxExtensionsEnabled).
CIFS VFS Mount Options
======================
A partial list of the supported mount options follows:
user The user name to use when trying to establish
the CIFS session.
password The user password. If the mount helper is
installed, the user will be prompted for password
if not supplied.
ip The ip address of the target server
unc The target server Universal Network Name (export) to
mount.
domain Set the SMB/CIFS workgroup name prepended to the
username during CIFS session establishment
forceuid Set the default uid for inodes to the uid
passed in on mount. For mounts to servers
which do support the CIFS Unix extensions, such as a
properly configured Samba server, the server provides
the uid, gid and mode so this parameter should not be
specified unless the server and clients uid and gid
numbering differ. If the server and client are in the
same domain (e.g. running winbind or nss_ldap) and
the server supports the Unix Extensions then the uid
and gid can be retrieved from the server (and uid
and gid would not have to be specifed on the mount.
For servers which do not support the CIFS Unix
extensions, the default uid (and gid) returned on lookup
of existing files will be the uid (gid) of the person
who executed the mount (root, except when mount.cifs
is configured setuid for user mounts) unless the "uid="
(gid) mount option is specified. Also note that permission
checks (authorization checks) on accesses to a file occur
at the server, but there are cases in which an administrator
may want to restrict at the client as well. For those
servers which do not report a uid/gid owner
(such as Windows), permissions can also be checked at the
client, and a crude form of client side permission checking
can be enabled by specifying file_mode and dir_mode on
the client. (default)
forcegid (similar to above but for the groupid instead of uid) (default)
noforceuid Fill in file owner information (uid) by requesting it from
the server if possible. With this option, the value given in
the uid= option (on mount) will only be used if the server
can not support returning uids on inodes.
noforcegid (similar to above but for the group owner, gid, instead of uid)
uid Set the default uid for inodes, and indicate to the
cifs kernel driver which local user mounted. If the server
supports the unix extensions the default uid is
not used to fill in the owner fields of inodes (files)
unless the "forceuid" parameter is specified.
gid Set the default gid for inodes (similar to above).
file_mode If CIFS Unix extensions are not supported by the server
this overrides the default mode for file inodes.
fsc Enable local disk caching using FS-Cache (off by default). This
option could be useful to improve performance on a slow link,
heavily loaded server and/or network where reading from the
disk is faster than reading from the server (over the network).
This could also impact scalability positively as the
number of calls to the server are reduced. However, local
caching is not suitable for all workloads for e.g. read-once
type workloads. So, you need to consider carefully your
workload/scenario before using this option. Currently, local
disk caching is functional for CIFS files opened as read-only.
dir_mode If CIFS Unix extensions are not supported by the server
this overrides the default mode for directory inodes.
port attempt to contact the server on this tcp port, before
trying the usual ports (port 445, then 139).
iocharset Codepage used to convert local path names to and from
Unicode. Unicode is used by default for network path
names if the server supports it. If iocharset is
not specified then the nls_default specified
during the local client kernel build will be used.
If server does not support Unicode, this parameter is
unused.
rsize default read size (usually 16K). The client currently
can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize
defaults to 16K and may be changed (from 8K to the maximum
kmalloc size allowed by your kernel) at module install time
for cifs.ko. Setting CIFSMaxBufSize to a very large value
will cause cifs to use more memory and may reduce performance
in some cases. To use rsize greater than 127K (the original
cifs protocol maximum) also requires that the server support
a new Unix Capability flag (for very large read) which some
newer servers (e.g. Samba 3.0.26 or later) do. rsize can be
set from a minimum of 2048 to a maximum of 130048 (127K or
CIFSMaxBufSize, whichever is smaller)
wsize default write size (default 57344)
maximum wsize currently allowed by CIFS is 57344 (fourteen
4096 byte pages)
actimeo=n attribute cache timeout in seconds (default 1 second).
After this timeout, the cifs client requests fresh attribute
information from the server. This option allows to tune the
attribute cache timeout to suit the workload needs. Shorter
timeouts mean better the cache coherency, but increased number
of calls to the server. Longer timeouts mean reduced number
of calls to the server at the expense of less stricter cache
coherency checks (i.e. incorrect attribute cache for a short
period of time).
rw mount the network share read-write (note that the
server may still consider the share read-only)
ro mount network share read-only
version used to distinguish different versions of the
mount helper utility (not typically needed)
sep if first mount option (after the -o), overrides
the comma as the separator between the mount
parms. e.g.
-o user=myname,password=mypassword,domain=mydom
could be passed instead with period as the separator by
-o sep=.user=myname.password=mypassword.domain=mydom
this might be useful when comma is contained within username
or password or domain. This option is less important
when the cifs mount helper cifs.mount (version 1.1 or later)
is used.
nosuid Do not allow remote executables with the suid bit
program to be executed. This is only meaningful for mounts
to servers such as Samba which support the CIFS Unix Extensions.
If you do not trust the servers in your network (your mount
targets) it is recommended that you specify this option for
greater security.
exec Permit execution of binaries on the mount.
noexec Do not permit execution of binaries on the mount.
dev Recognize block devices on the remote mount.
nodev Do not recognize devices on the remote mount.
suid Allow remote files on this mountpoint with suid enabled to
be executed (default for mounts when executed as root,
nosuid is default for user mounts).
credentials Although ignored by the cifs kernel component, it is used by
the mount helper, mount.cifs. When mount.cifs is installed it
opens and reads the credential file specified in order
to obtain the userid and password arguments which are passed to
the cifs vfs.
guest Although ignored by the kernel component, the mount.cifs
mount helper will not prompt the user for a password
if guest is specified on the mount options. If no
password is specified a null password will be used.
perm Client does permission checks (vfs_permission check of uid
and gid of the file against the mode and desired operation),
Note that this is in addition to the normal ACL check on the
target machine done by the server software.
Client permission checking is enabled by default.
noperm Client does not do permission checks. This can expose
files on this mount to access by other users on the local
client system. It is typically only needed when the server
supports the CIFS Unix Extensions but the UIDs/GIDs on the
client and server system do not match closely enough to allow
access by the user doing the mount, but it may be useful with
non CIFS Unix Extension mounts for cases in which the default
mode is specified on the mount but is not to be enforced on the
client (e.g. perhaps when MultiUserMount is enabled)
Note that this does not affect the normal ACL check on the
target machine done by the server software (of the server
ACL against the user name provided at mount time).
serverino Use server's inode numbers instead of generating automatically
incrementing inode numbers on the client. Although this will
make it easier to spot hardlinked files (as they will have
the same inode numbers) and inode numbers may be persistent,
note that the server does not guarantee that the inode numbers
are unique if multiple server side mounts are exported under a
single share (since inode numbers on the servers might not
be unique if multiple filesystems are mounted under the same
shared higher level directory). Note that some older
(e.g. pre-Windows 2000) do not support returning UniqueIDs
or the CIFS Unix Extensions equivalent and for those
this mount option will have no effect. Exporting cifs mounts
under nfsd requires this mount option on the cifs mount.
This is now the default if server supports the
required network operation.
noserverino Client generates inode numbers (rather than using the actual one
from the server). These inode numbers will vary after
unmount or reboot which can confuse some applications,
but not all server filesystems support unique inode
numbers.
setuids If the CIFS Unix extensions are negotiated with the server
the client will attempt to set the effective uid and gid of
the local process on newly created files, directories, and
devices (create, mkdir, mknod). If the CIFS Unix Extensions
are not negotiated, for newly created files and directories
instead of using the default uid and gid specified on
the mount, cache the new file's uid and gid locally which means
that the uid for the file can change when the inode is
reloaded (or the user remounts the share).
nosetuids The client will not attempt to set the uid and gid on
on newly created files, directories, and devices (create,
mkdir, mknod) which will result in the server setting the
uid and gid to the default (usually the server uid of the
user who mounted the share). Letting the server (rather than
the client) set the uid and gid is the default. If the CIFS
Unix Extensions are not negotiated then the uid and gid for
new files will appear to be the uid (gid) of the mounter or the
uid (gid) parameter specified on the mount.
netbiosname When mounting to servers via port 139, specifies the RFC1001
source name to use to represent the client netbios machine
name when doing the RFC1001 netbios session initialize.
direct Do not do inode data caching on files opened on this mount.
This precludes mmapping files on this mount. In some cases
with fast networks and little or no caching benefits on the
client (e.g. when the application is doing large sequential
reads bigger than page size without rereading the same data)
this can provide better performance than the default
behavior which caches reads (readahead) and writes
(writebehind) through the local Linux client pagecache
if oplock (caching token) is granted and held. Note that
direct allows write operations larger than page size
to be sent to the server.
strictcache Use for switching on strict cache mode. In this mode the
client read from the cache all the time it has Oplock Level II,
otherwise - read from the server. All written data are stored
in the cache, but if the client doesn't have Exclusive Oplock,
it writes the data to the server.
rwpidforward Forward pid of a process who opened a file to any read or write
operation on that file. This prevent applications like WINE
from failing on read and write if we use mandatory brlock style.
acl Allow setfacl and getfacl to manage posix ACLs if server
supports them. (default)
noacl Do not allow setfacl and getfacl calls on this mount
user_xattr Allow getting and setting user xattrs (those attributes whose
name begins with "user." or "os2.") as OS/2 EAs (extended
attributes) to the server. This allows support of the
setfattr and getfattr utilities. (default)
nouser_xattr Do not allow getfattr/setfattr to get/set/list xattrs
mapchars Translate six of the seven reserved characters (not backslash)
*?<>|:
to the remap range (above 0xF000), which also
allows the CIFS client to recognize files created with
such characters by Windows's POSIX emulation. This can
also be useful when mounting to most versions of Samba
(which also forbids creating and opening files
whose names contain any of these seven characters).
This has no effect if the server does not support
Unicode on the wire.
nomapchars Do not translate any of these seven characters (default).
nocase Request case insensitive path name matching (case
sensitive is the default if the server supports it).
(mount option "ignorecase" is identical to "nocase")
posixpaths If CIFS Unix extensions are supported, attempt to
negotiate posix path name support which allows certain
characters forbidden in typical CIFS filenames, without
requiring remapping. (default)
noposixpaths If CIFS Unix extensions are supported, do not request
posix path name support (this may cause servers to
reject creatingfile with certain reserved characters).
nounix Disable the CIFS Unix Extensions for this mount (tree
connection). This is rarely needed, but it may be useful
in order to turn off multiple settings all at once (ie
posix acls, posix locks, posix paths, symlink support
and retrieving uids/gids/mode from the server) or to
work around a bug in server which implement the Unix
Extensions.
nobrl Do not send byte range lock requests to the server.
This is necessary for certain applications that break
with cifs style mandatory byte range locks (and most
cifs servers do not yet support requesting advisory
byte range locks).
forcemandatorylock Even if the server supports posix (advisory) byte range
locking, send only mandatory lock requests. For some
(presumably rare) applications, originally coded for
DOS/Windows, which require Windows style mandatory byte range
locking, they may be able to take advantage of this option,
forcing the cifs client to only send mandatory locks
even if the cifs server would support posix advisory locks.
"forcemand" is accepted as a shorter form of this mount
option.
nostrictsync If this mount option is set, when an application does an
fsync call then the cifs client does not send an SMB Flush
to the server (to force the server to write all dirty data
for this file immediately to disk), although cifs still sends
all dirty (cached) file data to the server and waits for the
server to respond to the write. Since SMB Flush can be
very slow, and some servers may be reliable enough (to risk
delaying slightly flushing the data to disk on the server),
turning on this option may be useful to improve performance for
applications that fsync too much, at a small risk of server
crash. If this mount option is not set, by default cifs will
send an SMB flush request (and wait for a response) on every
fsync call.
nodfs Disable DFS (global name space support) even if the
server claims to support it. This can help work around
a problem with parsing of DFS paths with Samba server
versions 3.0.24 and 3.0.25.
remount remount the share (often used to change from ro to rw mounts
or vice versa)
cifsacl Report mode bits (e.g. on stat) based on the Windows ACL for
the file. (EXPERIMENTAL)
servern Specify the server 's netbios name (RFC1001 name) to use
when attempting to setup a session to the server.
This is needed for mounting to some older servers (such
as OS/2 or Windows 98 and Windows ME) since they do not
support a default server name. A server name can be up
to 15 characters long and is usually uppercased.
sfu When the CIFS Unix Extensions are not negotiated, attempt to
create device files and fifos in a format compatible with
Services for Unix (SFU). In addition retrieve bits 10-12
of the mode via the SETFILEBITS extended attribute (as
SFU does). In the future the bottom 9 bits of the
mode also will be emulated using queries of the security
descriptor (ACL).
mfsymlinks Enable support for Minshall+French symlinks
(see http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks)
This option is ignored when specified together with the
'sfu' option. Minshall+French symlinks are used even if
the server supports the CIFS Unix Extensions.
sign Must use packet signing (helps avoid unwanted data modification
by intermediate systems in the route). Note that signing
does not work with lanman or plaintext authentication.
seal Must seal (encrypt) all data on this mounted share before
sending on the network. Requires support for Unix Extensions.
Note that this differs from the sign mount option in that it
causes encryption of data sent over this mounted share but other
shares mounted to the same server are unaffected.
locallease This option is rarely needed. Fcntl F_SETLEASE is
used by some applications such as Samba and NFSv4 server to
check to see whether a file is cacheable. CIFS has no way
to explicitly request a lease, but can check whether a file
is cacheable (oplocked). Unfortunately, even if a file
is not oplocked, it could still be cacheable (ie cifs client
could grant fcntl leases if no other local processes are using
the file) for cases for example such as when the server does not
support oplocks and the user is sure that the only updates to
the file will be from this client. Specifying this mount option
will allow the cifs client to check for leases (only) locally
for files which are not oplocked instead of denying leases
in that case. (EXPERIMENTAL)
sec Security mode. Allowed values are:
none attempt to connection as a null user (no name)
krb5 Use Kerberos version 5 authentication
krb5i Use Kerberos authentication and packet signing
ntlm Use NTLM password hashing (default)
ntlmi Use NTLM password hashing with signing (if
/proc/fs/cifs/PacketSigningEnabled on or if
server requires signing also can be the default)
ntlmv2 Use NTLMv2 password hashing
ntlmv2i Use NTLMv2 password hashing with packet signing
lanman (if configured in kernel config) use older
lanman hash
hard Retry file operations if server is not responding
soft Limit retries to unresponsive servers (usually only
one retry) before returning an error. (default)
The mount.cifs mount helper also accepts a few mount options before -o
including:
-S take password from stdin (equivalent to setting the environment
variable "PASSWD_FD=0"
-V print mount.cifs version
-? display simple usage information
With most 2.6 kernel versions of modutils, the version of the cifs kernel
module can be displayed via modinfo.
Misc /proc/fs/cifs Flags and Debug Info
=======================================
Informational pseudo-files:
DebugData Displays information about active CIFS sessions and
shares, features enabled as well as the cifs.ko
version.
Stats Lists summary resource usage information as well as per
share statistics, if CONFIG_CIFS_STATS in enabled
in the kernel configuration.
Configuration pseudo-files:
PacketSigningEnabled If set to one, cifs packet signing is enabled
and will be used if the server requires
it. If set to two, cifs packet signing is
required even if the server considers packet
signing optional. (default 1)
SecurityFlags Flags which control security negotiation and
also packet signing. Authentication (may/must)
flags (e.g. for NTLM and/or NTLMv2) may be combined with
the signing flags. Specifying two different password
hashing mechanisms (as "must use") on the other hand
does not make much sense. Default flags are
0x07007
(NTLM, NTLMv2 and packet signing allowed). The maximum
allowable flags if you want to allow mounts to servers
using weaker password hashes is 0x37037 (lanman,
plaintext, ntlm, ntlmv2, signing allowed). Some
SecurityFlags require the corresponding menuconfig
options to be enabled (lanman and plaintext require
CONFIG_CIFS_WEAK_PW_HASH for example). Enabling
plaintext authentication currently requires also
enabling lanman authentication in the security flags
because the cifs module only supports sending
laintext passwords using the older lanman dialect
form of the session setup SMB. (e.g. for authentication
using plain text passwords, set the SecurityFlags
to 0x30030):
may use packet signing 0x00001
must use packet signing 0x01001
may use NTLM (most common password hash) 0x00002
must use NTLM 0x02002
may use NTLMv2 0x00004
must use NTLMv2 0x04004
may use Kerberos security 0x00008
must use Kerberos 0x08008
may use lanman (weak) password hash 0x00010
must use lanman password hash 0x10010
may use plaintext passwords 0x00020
must use plaintext passwords 0x20020
(reserved for future packet encryption) 0x00040
cifsFYI If set to non-zero value, additional debug information
will be logged to the system error log. This field
contains three flags controlling different classes of
debugging entries. The maximum value it can be set
to is 7 which enables all debugging points (default 0).
Some debugging statements are not compiled into the
cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the
kernel configuration. cifsFYI may be set to one or
nore of the following flags (7 sets them all):
log cifs informational messages 0x01
log return codes from cifs entry points 0x02
log slow responses (ie which take longer than 1 second)
CONFIG_CIFS_STATS2 must be enabled in .config 0x04
traceSMB If set to one, debug information is logged to the
system error log with the start of smb requests
and responses (default 0)
LookupCacheEnable If set to one, inode information is kept cached
for one second improving performance of lookups
(default 1)
OplockEnabled If set to one, safe distributed caching enabled.
(default 1)
LinuxExtensionsEnabled If set to one then the client will attempt to
use the CIFS "UNIX" extensions which are optional
protocol enhancements that allow CIFS servers
to return accurate UID/GID information as well
as support symbolic links. If you use servers
such as Samba that support the CIFS Unix
extensions but do not want to use symbolic link
support and want to map the uid and gid fields
to values supplied at mount (rather than the
actual values, then set this to zero. (default 1)
These experimental features and tracing can be enabled by changing flags in
/proc/fs/cifs (after the cifs module has been installed or built into the
kernel, e.g. insmod cifs). To enable a feature set it to 1 e.g. to enable
tracing to the kernel message log type:
echo 7 > /proc/fs/cifs/cifsFYI
cifsFYI functions as a bit mask. Setting it to 1 enables additional kernel
logging of various informational messages. 2 enables logging of non-zero
SMB return codes while 4 enables logging of requests that take longer
than one second to complete (except for byte range lock requests).
Setting it to 4 requires defining CONFIG_CIFS_STATS2 manually in the
source code (typically by setting it in the beginning of cifsglob.h),
and setting it to seven enables all three. Finally, tracing
the start of smb requests and responses can be enabled via:
echo 1 > /proc/fs/cifs/traceSMB
Per share (per client mount) statistics are available in /proc/fs/cifs/Stats
if the kernel was configured with cifs statistics enabled. The statistics
represent the number of successful (ie non-zero return code from the server)
SMB responses to some of the more common commands (open, delete, mkdir etc.).
Also recorded is the total bytes read and bytes written to the server for
that share. Note that due to client caching effects this can be less than the
number of bytes read and written by the application running on the client.
The statistics for the number of total SMBs and oplock breaks are different in
that they represent all for that share, not just those for which the server
returned success.
Also note that "cat /proc/fs/cifs/DebugData" will display information about
the active sessions and the shares that are mounted.
Enabling Kerberos (extended security) works but requires version 1.2 or later
of the helper program cifs.upcall to be present and to be configured in the
/etc/request-key.conf file. The cifs.upcall helper program is from the Samba
project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not
require this helper. Note that NTLMv2 security (which does not require the
cifs.upcall helper program), instead of using Kerberos, is sufficient for
some use cases.
DFS support allows transparent redirection to shares in an MS-DFS name space.
In addition, DFS support for target shares which are specified as UNC
names which begin with host names (rather than IP addresses) requires
a user space helper (such as cifs.upcall) to be present in order to
translate host names to ip address, and the user space helper must also
be configured in the file /etc/request-key.conf. Samba, Windows servers and
many NAS appliances support DFS as a way of constructing a global name
space to ease network configuration and improve reliability.
To use cifs Kerberos and DFS support, the Linux keyutils package should be
installed and something like the following lines should be added to the
/etc/request-key.conf file:
create cifs.spnego * * /usr/local/sbin/cifs.upcall %k
create dns_resolver * * /usr/local/sbin/cifs.upcall %k
CIFS kernel module parameters
=============================
These module parameters can be specified or modified either during the time of
module loading or during the runtime by using the interface
/proc/module/cifs/parameters/<param>
i.e. echo "value" > /sys/module/cifs/parameters/<param>
1. enable_oplocks - Enable or disable oplocks. Oplocks are enabled by default.
[Y/y/1]. To disable use any of [N/n/0].

View file

@ -0,0 +1,104 @@
Version 2.03 August 1, 2014
A Partial List of Missing Features
==================================
Contributions are welcome. There are plenty of opportunities
for visible, important contributions to this module. Here
is a partial list of the known problems and missing features:
a) SMB3 (and SMB3.02) missing optional features:
- RDMA
- multichannel (started)
- directory leases (improved metadata caching)
- T10 copy offload (copy chunk is only mechanism supported)
- encrypted shares
b) improved sparse file support
c) Directory entry caching relies on a 1 second timer, rather than
using FindNotify or equivalent. - (started)
d) quota support (needs minor kernel change since quota calls
to make it to network filesystems or deviceless filesystems)
e) improve support for very old servers (OS/2 and Win9x for example)
Including support for changing the time remotely (utimes command).
f) hook lower into the sockets api (as NFS/SunRPC does) to avoid the
extra copy in/out of the socket buffers in some cases.
g) Better optimize open (and pathbased setfilesize) to reduce the
oplock breaks coming from windows srv. Piggyback identical file
opens on top of each other by incrementing reference count rather
than resending (helps reduce server resource utilization and avoid
spurious oplock breaks).
h) Add support for storing symlink info to Windows servers
in the Extended Attribute format their SFU clients would recognize.
i) Finish inotify support so kde and gnome file list windows
will autorefresh (partially complete by Asser). Needs minor kernel
vfs change to support removing D_NOTIFY on a file.
j) Add GUI tool to configure /proc/fs/cifs settings and for display of
the CIFS statistics (started)
k) implement support for security and trusted categories of xattrs
(requires minor protocol extension) to enable better support for SELINUX
l) Implement O_DIRECT flag on open (already supported on mount)
m) Create UID mapping facility so server UIDs can be mapped on a per
mount or a per server basis to client UIDs or nobody if no mapping
exists. This is helpful when Unix extensions are negotiated to
allow better permission checking when UIDs differ on the server
and client. Add new protocol request to the CIFS protocol
standard for asking the server for the corresponding name of a
particular uid.
n) DOS attrs - returned as pseudo-xattr in Samba format (check VFAT and NTFS for this too)
o) mount check for unmatched uids
p) Add support for new vfs entry point for fallocate
q) Add tools to take advantage of cifs/smb3 specific ioctls and features
such as "CopyChunk" (fast server side file copy)
r) encrypted file support
s) improved stats gathering, tools (perhaps integration with nfsometer?)
t) allow setting more NTFS/SMB3 file attributes remotely (currently limited to compressed
file attribute via chflags)
u) mount helper GUI (to simplify the various configuration options on mount)
KNOWN BUGS
====================================
See http://bugzilla.samba.org - search on product "CifsVFS" for
current bug list. Also check http://bugzilla.kernel.org (Product = File System, Component = CIFS)
1) existing symbolic links (Windows reparse points) are recognized but
can not be created remotely. They are implemented for Samba and those that
support the CIFS Unix extensions, although earlier versions of Samba
overly restrict the pathnames.
2) follow_link and readdir code does not follow dfs junctions
but recognizes them
Misc testing to do
==================
1) check out max path names and max path name components against various server
types. Try nested symlinks (8 deep). Return max path name in stat -f information
2) Improve xfstest's cifs enablement and adapt xfstests where needed to test
cifs better
3) Additional performance testing and optimization using iozone and similar -
there are some easy changes that can be done to parallelize sequential writes,
and when signing is disabled to request larger read sizes (larger than
negotiated size) and send larger write sizes to modern servers.
4) More exhaustively test against less common servers

View file

@ -0,0 +1,31 @@
This is the client VFS module for the Common Internet File System
(CIFS) protocol which is the successor to the Server Message Block
(SMB) protocol, the native file sharing mechanism for most early
PC operating systems. New and improved versions of CIFS are now
called SMB2 and SMB3. These dialects are also supported by the
CIFS VFS module. CIFS is fully supported by network
file servers such as Windows 2000, 2003, 2008 and 2012
as well by Samba (which provides excellent CIFS
server support for Linux and many other operating systems), so
this network filesystem client can mount to a wide variety of
servers.
The intent of this module is to provide the most advanced network
file system function for CIFS compliant servers, including better
POSIX compliance, secure per-user session establishment, high
performance safe distributed caching (oplock), optional packet
signing, large files, Unicode support and other internationalization
improvements. Since both Samba server and this filesystem client support
the CIFS Unix extensions, the combination can provide a reasonable
alternative to NFSv4 for fileserving in some Linux to Linux environments,
not just in Linux to Windows environments.
This filesystem has an mount utility (mount.cifs) that can be obtained from
https://ftp.samba.org/pub/linux-cifs/cifs-utils/
It must be installed in the directory with the other mount helpers.
For more information on the module see the project wiki page at
https://wiki.samba.org/index.php/LinuxCIFS_utils

View file

@ -0,0 +1,62 @@
#!/usr/bin/perl -w
#
# winucase_convert.pl -- convert "Windows 8 Upper Case Mapping Table.txt" to
# a two-level set of C arrays.
#
# Copyright 2013: Jeff Layton <jlayton@redhat.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
while(<>) {
next if (!/^0x(..)(..)\t0x(....)\t/);
$firstchar = hex($1);
$secondchar = hex($2);
$uppercase = hex($3);
$top[$firstchar][$secondchar] = $uppercase;
}
for ($i = 0; $i < 256; $i++) {
next if (!$top[$i]);
printf("static const wchar_t t2_%2.2x[256] = {", $i);
for ($j = 0; $j < 256; $j++) {
if (($j % 8) == 0) {
print "\n\t";
} else {
print " ";
}
printf("0x%4.4x,", $top[$i][$j] ? $top[$i][$j] : 0);
}
print "\n};\n\n";
}
printf("static const wchar_t *const toplevel[256] = {", $i);
for ($i = 0; $i < 256; $i++) {
if (($i % 8) == 0) {
print "\n\t";
} elsif ($top[$i]) {
print " ";
} else {
print " ";
}
if ($top[$i]) {
printf("t2_%2.2x,", $i);
} else {
print "NULL,";
}
}
print "\n};\n\n";

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,3 @@
ifneq ($(CONFIG_CONFIGFS_FS),)
obj-m += configfs_example_explicit.o configfs_example_macros.o
endif

View file

@ -0,0 +1,484 @@
configfs - Userspace-driven kernel object configuration.
Joel Becker <joel.becker@oracle.com>
Updated: 31 March 2005
Copyright (c) 2005 Oracle Corporation,
Joel Becker <joel.becker@oracle.com>
[What is configfs?]
configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of
kernel objects, configfs is a filesystem-based manager of kernel
objects, or config_items.
With sysfs, an object is created in kernel (for example, when a device
is discovered) and it is registered with sysfs. Its attributes then
appear in sysfs, allowing userspace to read the attributes via
readdir(3)/read(2). It may allow some attributes to be modified via
write(2). The important point is that the object is created and
destroyed in kernel, the kernel controls the lifecycle of the sysfs
representation, and sysfs is merely a window on all this.
A configfs config_item is created via an explicit userspace operation:
mkdir(2). It is destroyed via rmdir(2). The attributes appear at
mkdir(2) time, and can be read or modified via read(2) and write(2).
As with sysfs, readdir(3) queries the list of items and/or attributes.
symlink(2) can be used to group items together. Unlike sysfs, the
lifetime of the representation is completely driven by userspace. The
kernel modules backing the items must respond to this.
Both sysfs and configfs can and should exist together on the same
system. One is not a replacement for the other.
[Using configfs]
configfs can be compiled as a module or into the kernel. You can access
it by doing
mount -t configfs none /config
The configfs tree will be empty unless client modules are also loaded.
These are modules that register their item types with configfs as
subsystems. Once a client subsystem is loaded, it will appear as a
subdirectory (or more than one) under /config. Like sysfs, the
configfs tree is always there, whether mounted on /config or not.
An item is created via mkdir(2). The item's attributes will also
appear at this time. readdir(3) can determine what the attributes are,
read(2) can query their default values, and write(2) can store new
values. Like sysfs, attributes should be ASCII text files, preferably
with only one value per file. The same efficiency caveats from sysfs
apply. Don't mix more than one attribute in one attribute file.
Like sysfs, configfs expects write(2) to store the entire buffer at
once. When writing to configfs attributes, userspace processes should
first read the entire file, modify the portions they wish to change, and
then write the entire buffer back. Attribute files have a maximum size
of one page (PAGE_SIZE, 4096 on i386).
When an item needs to be destroyed, remove it with rmdir(2). An
item cannot be destroyed if any other item has a link to it (via
symlink(2)). Links can be removed via unlink(2).
[Configuring FakeNBD: an Example]
Imagine there's a Network Block Device (NBD) driver that allows you to
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
for its configuration. Obviously, there will be a nice program that
sysadmins use to configure FakeNBD, but somehow that program has to tell
the driver about it. Here's where configfs comes in.
When the FakeNBD driver is loaded, it registers itself with configfs.
readdir(3) sees this just fine:
# ls /config
fakenbd
A fakenbd connection can be created with mkdir(2). The name is
arbitrary, but likely the tool will make some use of the name. Perhaps
it is a uuid or a disk name:
# mkdir /config/fakenbd/disk1
# ls /config/fakenbd/disk1
target device rw
The target attribute contains the IP address of the server FakeNBD will
connect to. The device attribute is the device on the server.
Predictably, the rw attribute determines whether the connection is
read-only or read-write.
# echo 10.0.0.1 > /config/fakenbd/disk1/target
# echo /dev/sda1 > /config/fakenbd/disk1/device
# echo 1 > /config/fakenbd/disk1/rw
That's it. That's all there is. Now the device is configured, via the
shell no less.
[Coding With configfs]
Every object in configfs is a config_item. A config_item reflects an
object in the subsystem. It has attributes that match values on that
object. configfs handles the filesystem representation of that object
and its attributes, allowing the subsystem to ignore all but the
basic show/store interaction.
Items are created and destroyed inside a config_group. A group is a
collection of items that share the same attributes and operations.
Items are created by mkdir(2) and removed by rmdir(2), but configfs
handles that. The group has a set of operations to perform these tasks
A subsystem is the top level of a client module. During initialization,
the client module registers the subsystem with configfs, the subsystem
appears as a directory at the top of the configfs filesystem. A
subsystem is also a config_group, and can do everything a config_group
can.
[struct config_item]
struct config_item {
char *ci_name;
char ci_namebuf[UOBJ_NAME_LEN];
struct kref ci_kref;
struct list_head ci_entry;
struct config_item *ci_parent;
struct config_group *ci_group;
struct config_item_type *ci_type;
struct dentry *ci_dentry;
};
void config_item_init(struct config_item *);
void config_item_init_type_name(struct config_item *,
const char *name,
struct config_item_type *type);
struct config_item *config_item_get(struct config_item *);
void config_item_put(struct config_item *);
Generally, struct config_item is embedded in a container structure, a
structure that actually represents what the subsystem is doing. The
config_item portion of that structure is how the object interacts with
configfs.
Whether statically defined in a source file or created by a parent
config_group, a config_item must have one of the _init() functions
called on it. This initializes the reference count and sets up the
appropriate fields.
All users of a config_item should have a reference on it via
config_item_get(), and drop the reference when they are done via
config_item_put().
By itself, a config_item cannot do much more than appear in configfs.
Usually a subsystem wants the item to display and/or store attributes,
among other things. For that, it needs a type.
[struct config_item_type]
struct configfs_item_operations {
void (*release)(struct config_item *);
ssize_t (*show_attribute)(struct config_item *,
struct configfs_attribute *,
char *);
ssize_t (*store_attribute)(struct config_item *,
struct configfs_attribute *,
const char *, size_t);
int (*allow_link)(struct config_item *src,
struct config_item *target);
int (*drop_link)(struct config_item *src,
struct config_item *target);
};
struct config_item_type {
struct module *ct_owner;
struct configfs_item_operations *ct_item_ops;
struct configfs_group_operations *ct_group_ops;
struct configfs_attribute **ct_attrs;
};
The most basic function of a config_item_type is to define what
operations can be performed on a config_item. All items that have been
allocated dynamically will need to provide the ct_item_ops->release()
method. This method is called when the config_item's reference count
reaches zero. Items that wish to display an attribute need to provide
the ct_item_ops->show_attribute() method. Similarly, storing a new
attribute value uses the store_attribute() method.
[struct configfs_attribute]
struct configfs_attribute {
char *ca_name;
struct module *ca_owner;
umode_t ca_mode;
};
When a config_item wants an attribute to appear as a file in the item's
configfs directory, it must define a configfs_attribute describing it.
It then adds the attribute to the NULL-terminated array
config_item_type->ct_attrs. When the item appears in configfs, the
attribute file will appear with the configfs_attribute->ca_name
filename. configfs_attribute->ca_mode specifies the file permissions.
If an attribute is readable and the config_item provides a
ct_item_ops->show_attribute() method, that method will be called
whenever userspace asks for a read(2) on the attribute. The converse
will happen for write(2).
[struct config_group]
A config_item cannot live in a vacuum. The only way one can be created
is via mkdir(2) on a config_group. This will trigger creation of a
child item.
struct config_group {
struct config_item cg_item;
struct list_head cg_children;
struct configfs_subsystem *cg_subsys;
struct config_group **default_groups;
};
void config_group_init(struct config_group *group);
void config_group_init_type_name(struct config_group *group,
const char *name,
struct config_item_type *type);
The config_group structure contains a config_item. Properly configuring
that item means that a group can behave as an item in its own right.
However, it can do more: it can create child items or groups. This is
accomplished via the group operations specified on the group's
config_item_type.
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group,
const char *name);
struct config_group *(*make_group)(struct config_group *group,
const char *name);
int (*commit_item)(struct config_item *item);
void (*disconnect_notify)(struct config_group *group,
struct config_item *item);
void (*drop_item)(struct config_group *group,
struct config_item *item);
};
A group creates child items by providing the
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
config_item (or more likely, its container structure), initializes it,
and returns it to configfs. Configfs will then populate the filesystem
tree to reflect the new item.
If the subsystem wants the child to be a group itself, the subsystem
provides ct_group_ops->make_group(). Everything else behaves the same,
using the group _init() functions on the group.
Finally, when userspace calls rmdir(2) on the item or group,
ct_group_ops->drop_item() is called. As a config_group is also a
config_item, it is not necessary for a separate drop_group() method.
The subsystem must config_item_put() the reference that was initialized
upon item allocation. If a subsystem has no work to do, it may omit
the ct_group_ops->drop_item() method, and configfs will call
config_item_put() on the item on behalf of the subsystem.
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
is called, configfs WILL remove the item from the filesystem tree
(assuming that it has no children to keep it busy). The subsystem is
responsible for responding to this. If the subsystem has references to
the item in other threads, the memory is safe. It may take some time
for the item to actually disappear from the subsystem's usage. But it
is gone from configfs.
When drop_item() is called, the item's linkage has already been torn
down. It no longer has a reference on its parent and has no place in
the item hierarchy. If a client needs to do some cleanup before this
teardown happens, the subsystem can implement the
ct_group_ops->disconnect_notify() method. The method is called after
configfs has removed the item from the filesystem view but before the
item is removed from its parent group. Like drop_item(),
disconnect_notify() is void and cannot fail. Client subsystems should
not drop any references here, as they still must do it in drop_item().
A config_group cannot be removed while it still has child items. This
is implemented in the configfs rmdir(2) code. ->drop_item() will not be
called, as the item has not been dropped. rmdir(2) will fail, as the
directory is not empty.
[struct configfs_subsystem]
A subsystem must register itself, usually at module_init time. This
tells configfs to make the subsystem appear in the file tree.
struct configfs_subsystem {
struct config_group su_group;
struct mutex su_mutex;
};
int configfs_register_subsystem(struct configfs_subsystem *subsys);
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
A subsystem consists of a toplevel config_group and a mutex.
The group is where child config_items are created. For a subsystem,
this group is usually defined statically. Before calling
configfs_register_subsystem(), the subsystem must have initialized the
group via the usual group _init() functions, and it must also have
initialized the mutex.
When the register call returns, the subsystem is live, and it
will be visible via configfs. At that point, mkdir(2) can be called and
the subsystem must be ready for it.
[An Example]
The best example of these basic concepts is the simple_children
subsystem/group and the simple_child item in configfs_example_explicit.c
and configfs_example_macros.c. It shows a trivial object displaying and
storing an attribute, and a simple group creating and destroying these
children.
The only difference between configfs_example_explicit.c and
configfs_example_macros.c is how the attributes of the childless item
are defined. The childless item has extended attributes, each with
their own show()/store() operation. This follows a convention commonly
used in sysfs. configfs_example_explicit.c creates these attributes
by explicitly defining the structures involved. Conversely
configfs_example_macros.c uses some convenience macros from configfs.h
to define the attributes. These macros are similar to their sysfs
counterparts.
[Hierarchy Navigation and the Subsystem Mutex]
There is an extra bonus that configfs provides. The config_groups and
config_items are arranged in a hierarchy due to the fact that they
appear in a filesystem. A subsystem is NEVER to touch the filesystem
parts, but the subsystem might be interested in this hierarchy. For
this reason, the hierarchy is mirrored via the config_group->cg_children
and config_item->ci_parent structure members.
A subsystem can navigate the cg_children list and the ci_parent pointer
to see the tree created by the subsystem. This can race with configfs'
management of the hierarchy, so configfs uses the subsystem mutex to
protect modifications. Whenever a subsystem wants to navigate the
hierarchy, it must do so under the protection of the subsystem
mutex.
A subsystem will be prevented from acquiring the mutex while a newly
allocated item has not been linked into this hierarchy. Similarly, it
will not be able to acquire the mutex while a dropping item has not
yet been unlinked. This means that an item's ci_parent pointer will
never be NULL while the item is in configfs, and that an item will only
be in its parent's cg_children list for the same duration. This allows
a subsystem to trust ci_parent and cg_children while they hold the
mutex.
[Item Aggregation Via symlink(2)]
configfs provides a simple group via the group->item parent/child
relationship. Often, however, a larger environment requires aggregation
outside of the parent/child connection. This is implemented via
symlink(2).
A config_item may provide the ct_item_ops->allow_link() and
ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
symlink(2) may be called with the config_item as the source of the link.
These links are only allowed between configfs config_items. Any
symlink(2) attempt outside the configfs filesystem will be denied.
When symlink(2) is called, the source config_item's ->allow_link()
method is called with itself and a target item. If the source item
allows linking to target item, it returns 0. A source item may wish to
reject a link if it only wants links to a certain type of object (say,
in its own subsystem).
When unlink(2) is called on the symbolic link, the source item is
notified via the ->drop_link() method. Like the ->drop_item() method,
this is a void function and cannot return failure. The subsystem is
responsible for responding to the change.
A config_item cannot be removed while it links to any other item, nor
can it be removed while an item links to it. Dangling symlinks are not
allowed in configfs.
[Automatically Created Subgroups]
A new config_group may want to have two types of child config_items.
While this could be codified by magic names in ->make_item(), it is much
more explicit to have a method whereby userspace sees this divergence.
Rather than have a group where some items behave differently than
others, configfs provides a method whereby one or many subgroups are
automatically created inside the parent at its creation. Thus,
mkdir("parent") results in "parent", "parent/subgroup1", up through
"parent/subgroupN". Items of type 1 can now be created in
"parent/subgroup1", and items of type N can be created in
"parent/subgroupN".
These automatic subgroups, or default groups, do not preclude other
children of the parent group. If ct_group_ops->make_group() exists,
other child groups can be created on the parent group directly.
A configfs subsystem specifies default groups by filling in the
NULL-terminated array default_groups on the config_group structure.
Each group in that array is populated in the configfs tree at the same
time as the parent group. Similarly, they are removed at the same time
as the parent. No extra notification is provided. When a ->drop_item()
method call notifies the subsystem the parent group is going away, it
also means every default group child associated with that parent group.
As a consequence of this, default_groups cannot be removed directly via
rmdir(2). They also are not considered when rmdir(2) on the parent
group is checking for children.
[Dependent Subsystems]
Sometimes other drivers depend on particular configfs items. For
example, ocfs2 mounts depend on a heartbeat region item. If that
region item is removed with rmdir(2), the ocfs2 mount must BUG or go
readonly. Not happy.
configfs provides two additional API calls: configfs_depend_item() and
configfs_undepend_item(). A client driver can call
configfs_depend_item() on an existing item to tell configfs that it is
depended on. configfs will then return -EBUSY from rmdir(2) for that
item. When the item is no longer depended on, the client driver calls
configfs_undepend_item() on it.
These API cannot be called underneath any configfs callbacks, as
they will conflict. They can block and allocate. A client driver
probably shouldn't calling them of its own gumption. Rather it should
be providing an API that external subsystems call.
How does this work? Imagine the ocfs2 mount process. When it mounts,
it asks for a heartbeat region item. This is done via a call into the
heartbeat code. Inside the heartbeat code, the region item is looked
up. Here, the heartbeat code calls configfs_depend_item(). If it
succeeds, then heartbeat knows the region is safe to give to ocfs2.
If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error.
[Committable Items]
NOTE: Committable items are currently unimplemented.
Some config_items cannot have a valid initial state. That is, no
default values can be specified for the item's attributes such that the
item can do its work. Userspace must configure one or more attributes,
after which the subsystem can start whatever entity this item
represents.
Consider the FakeNBD device from above. Without a target address *and*
a target device, the subsystem has no idea what block device to import.
The simple example assumes that the subsystem merely waits until all the
appropriate attributes are configured, and then connects. This will,
indeed, work, but now every attribute store must check if the attributes
are initialized. Every attribute store must fire off the connection if
that condition is met.
Far better would be an explicit action notifying the subsystem that the
config_item is ready to go. More importantly, an explicit action allows
the subsystem to provide feedback as to whether the attributes are
initialized in a way that makes sense. configfs provides this as
committable items.
configfs still uses only normal filesystem operations. An item is
committed via rename(2). The item is moved from a directory where it
can be modified to a directory where it cannot.
Any group that provides the ct_group_ops->commit_item() method has
committable items. When this group appears in configfs, mkdir(2) will
not work directly in the group. Instead, the group will have two
subdirectories: "live" and "pending". The "live" directory does not
support mkdir(2) or rmdir(2) either. It only allows rename(2). The
"pending" directory does allow mkdir(2) and rmdir(2). An item is
created in the "pending" directory. Its attributes can be modified at
will. Userspace commits the item by renaming it into the "live"
directory. At this point, the subsystem receives the ->commit_item()
callback. If all required attributes are filled to satisfaction, the
method returns zero and the item is moved to the "live" directory.
As rmdir(2) does not work in the "live" directory, an item must be
shutdown, or "uncommitted". Again, this is done via rename(2), this
time from the "live" directory back to the "pending" one. The subsystem
is notified by the ct_group_ops->uncommit_object() method.

View file

@ -0,0 +1,483 @@
/*
* vim: noexpandtab ts=8 sts=0 sw=8:
*
* configfs_example_explicit.c - This file is a demonstration module
* containing a number of configfs subsystems. It explicitly defines
* each structure without using the helper macros defined in
* configfs.h.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
* Based on sysfs:
* sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
*
* configfs Copyright (C) 2005 Oracle. All rights reserved.
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/configfs.h>
/*
* 01-childless
*
* This first example is a childless subsystem. It cannot create
* any config_items. It just has attributes.
*
* Note that we are enclosing the configfs_subsystem inside a container.
* This is not necessary if a subsystem has no attributes directly
* on the subsystem. See the next example, 02-simple-children, for
* such a subsystem.
*/
struct childless {
struct configfs_subsystem subsys;
int showme;
int storeme;
};
struct childless_attribute {
struct configfs_attribute attr;
ssize_t (*show)(struct childless *, char *);
ssize_t (*store)(struct childless *, const char *, size_t);
};
static inline struct childless *to_childless(struct config_item *item)
{
return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
}
static ssize_t childless_showme_read(struct childless *childless,
char *page)
{
ssize_t pos;
pos = sprintf(page, "%d\n", childless->showme);
childless->showme++;
return pos;
}
static ssize_t childless_storeme_read(struct childless *childless,
char *page)
{
return sprintf(page, "%d\n", childless->storeme);
}
static ssize_t childless_storeme_write(struct childless *childless,
const char *page,
size_t count)
{
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if ((*p != '\0') && (*p != '\n'))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
childless->storeme = tmp;
return count;
}
static ssize_t childless_description_read(struct childless *childless,
char *page)
{
return sprintf(page,
"[01-childless]\n"
"\n"
"The childless subsystem is the simplest possible subsystem in\n"
"configfs. It does not support the creation of child config_items.\n"
"It only has a few attributes. In fact, it isn't much different\n"
"than a directory in /proc.\n");
}
static struct childless_attribute childless_attr_showme = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
.show = childless_showme_read,
};
static struct childless_attribute childless_attr_storeme = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
.show = childless_storeme_read,
.store = childless_storeme_write,
};
static struct childless_attribute childless_attr_description = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
.show = childless_description_read,
};
static struct configfs_attribute *childless_attrs[] = {
&childless_attr_showme.attr,
&childless_attr_storeme.attr,
&childless_attr_description.attr,
NULL,
};
static ssize_t childless_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
struct childless *childless = to_childless(item);
struct childless_attribute *childless_attr =
container_of(attr, struct childless_attribute, attr);
ssize_t ret = 0;
if (childless_attr->show)
ret = childless_attr->show(childless, page);
return ret;
}
static ssize_t childless_attr_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct childless *childless = to_childless(item);
struct childless_attribute *childless_attr =
container_of(attr, struct childless_attribute, attr);
ssize_t ret = -EINVAL;
if (childless_attr->store)
ret = childless_attr->store(childless, page, count);
return ret;
}
static struct configfs_item_operations childless_item_ops = {
.show_attribute = childless_attr_show,
.store_attribute = childless_attr_store,
};
static struct config_item_type childless_type = {
.ct_item_ops = &childless_item_ops,
.ct_attrs = childless_attrs,
.ct_owner = THIS_MODULE,
};
static struct childless childless_subsys = {
.subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "01-childless",
.ci_type = &childless_type,
},
},
},
};
/* ----------------------------------------------------------------- */
/*
* 02-simple-children
*
* This example merely has a simple one-attribute child. Note that
* there is no extra attribute structure, as the child's attribute is
* known from the get-go. Also, there is no container for the
* subsystem, as it has no attributes of its own.
*/
struct simple_child {
struct config_item item;
int storeme;
};
static inline struct simple_child *to_simple_child(struct config_item *item)
{
return item ? container_of(item, struct simple_child, item) : NULL;
}
static struct configfs_attribute simple_child_attr_storeme = {
.ca_owner = THIS_MODULE,
.ca_name = "storeme",
.ca_mode = S_IRUGO | S_IWUSR,
};
static struct configfs_attribute *simple_child_attrs[] = {
&simple_child_attr_storeme,
NULL,
};
static ssize_t simple_child_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
ssize_t count;
struct simple_child *simple_child = to_simple_child(item);
count = sprintf(page, "%d\n", simple_child->storeme);
return count;
}
static ssize_t simple_child_attr_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct simple_child *simple_child = to_simple_child(item);
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if (!p || (*p && (*p != '\n')))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
simple_child->storeme = tmp;
return count;
}
static void simple_child_release(struct config_item *item)
{
kfree(to_simple_child(item));
}
static struct configfs_item_operations simple_child_item_ops = {
.release = simple_child_release,
.show_attribute = simple_child_attr_show,
.store_attribute = simple_child_attr_store,
};
static struct config_item_type simple_child_type = {
.ct_item_ops = &simple_child_item_ops,
.ct_attrs = simple_child_attrs,
.ct_owner = THIS_MODULE,
};
struct simple_children {
struct config_group group;
};
static inline struct simple_children *to_simple_children(struct config_item *item)
{
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
}
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
{
struct simple_child *simple_child;
simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL);
if (!simple_child)
return ERR_PTR(-ENOMEM);
config_item_init_type_name(&simple_child->item, name,
&simple_child_type);
simple_child->storeme = 0;
return &simple_child->item;
}
static struct configfs_attribute simple_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *simple_children_attrs[] = {
&simple_children_attr_description,
NULL,
};
static ssize_t simple_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[02-simple-children]\n"
"\n"
"This subsystem allows the creation of child config_items. These\n"
"items have only one attribute that is readable and writeable.\n");
}
static void simple_children_release(struct config_item *item)
{
kfree(to_simple_children(item));
}
static struct configfs_item_operations simple_children_item_ops = {
.release = simple_children_release,
.show_attribute = simple_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations simple_children_group_ops = {
.make_item = simple_children_make_item,
};
static struct config_item_type simple_children_type = {
.ct_item_ops = &simple_children_item_ops,
.ct_group_ops = &simple_children_group_ops,
.ct_attrs = simple_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem simple_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "02-simple-children",
.ci_type = &simple_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* 03-group-children
*
* This example reuses the simple_children group from above. However,
* the simple_children group is not the subsystem itself, it is a
* child of the subsystem. Creation of a group in the subsystem creates
* a new simple_children group. That group can then have simple_child
* children of its own.
*/
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
{
struct simple_children *simple_children;
simple_children = kzalloc(sizeof(struct simple_children),
GFP_KERNEL);
if (!simple_children)
return ERR_PTR(-ENOMEM);
config_group_init_type_name(&simple_children->group, name,
&simple_children_type);
return &simple_children->group;
}
static struct configfs_attribute group_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *group_children_attrs[] = {
&group_children_attr_description,
NULL,
};
static ssize_t group_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[03-group-children]\n"
"\n"
"This subsystem allows the creation of child config_groups. These\n"
"groups are like the subsystem simple-children.\n");
}
static struct configfs_item_operations group_children_item_ops = {
.show_attribute = group_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations group_children_group_ops = {
.make_group = group_children_make_group,
};
static struct config_item_type group_children_type = {
.ct_item_ops = &group_children_item_ops,
.ct_group_ops = &group_children_group_ops,
.ct_attrs = group_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem group_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "03-group-children",
.ci_type = &group_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* We're now done with our subsystem definitions.
* For convenience in this module, here's a list of them all. It
* allows the init function to easily register them. Most modules
* will only have one subsystem, and will only call register_subsystem
* on it directly.
*/
static struct configfs_subsystem *example_subsys[] = {
&childless_subsys.subsys,
&simple_children_subsys,
&group_children_subsys,
NULL,
};
static int __init configfs_example_init(void)
{
int ret;
int i;
struct configfs_subsystem *subsys;
for (i = 0; example_subsys[i]; i++) {
subsys = example_subsys[i];
config_group_init(&subsys->su_group);
mutex_init(&subsys->su_mutex);
ret = configfs_register_subsystem(subsys);
if (ret) {
printk(KERN_ERR "Error %d while registering subsystem %s\n",
ret,
subsys->su_group.cg_item.ci_namebuf);
goto out_unregister;
}
}
return 0;
out_unregister:
for (i--; i >= 0; i--)
configfs_unregister_subsystem(example_subsys[i]);
return ret;
}
static void __exit configfs_example_exit(void)
{
int i;
for (i = 0; example_subsys[i]; i++)
configfs_unregister_subsystem(example_subsys[i]);
}
module_init(configfs_example_init);
module_exit(configfs_example_exit);
MODULE_LICENSE("GPL");

View file

@ -0,0 +1,446 @@
/*
* vim: noexpandtab ts=8 sts=0 sw=8:
*
* configfs_example_macros.c - This file is a demonstration module
* containing a number of configfs subsystems. It uses the helper
* macros defined by configfs.h
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
* Based on sysfs:
* sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
*
* configfs Copyright (C) 2005 Oracle. All rights reserved.
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/configfs.h>
/*
* 01-childless
*
* This first example is a childless subsystem. It cannot create
* any config_items. It just has attributes.
*
* Note that we are enclosing the configfs_subsystem inside a container.
* This is not necessary if a subsystem has no attributes directly
* on the subsystem. See the next example, 02-simple-children, for
* such a subsystem.
*/
struct childless {
struct configfs_subsystem subsys;
int showme;
int storeme;
};
static inline struct childless *to_childless(struct config_item *item)
{
return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
}
CONFIGFS_ATTR_STRUCT(childless);
#define CHILDLESS_ATTR(_name, _mode, _show, _store) \
struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR(_name, _mode, _show, _store)
#define CHILDLESS_ATTR_RO(_name, _show) \
struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR_RO(_name, _show);
static ssize_t childless_showme_read(struct childless *childless,
char *page)
{
ssize_t pos;
pos = sprintf(page, "%d\n", childless->showme);
childless->showme++;
return pos;
}
static ssize_t childless_storeme_read(struct childless *childless,
char *page)
{
return sprintf(page, "%d\n", childless->storeme);
}
static ssize_t childless_storeme_write(struct childless *childless,
const char *page,
size_t count)
{
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if (!p || (*p && (*p != '\n')))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
childless->storeme = tmp;
return count;
}
static ssize_t childless_description_read(struct childless *childless,
char *page)
{
return sprintf(page,
"[01-childless]\n"
"\n"
"The childless subsystem is the simplest possible subsystem in\n"
"configfs. It does not support the creation of child config_items.\n"
"It only has a few attributes. In fact, it isn't much different\n"
"than a directory in /proc.\n");
}
CHILDLESS_ATTR_RO(showme, childless_showme_read);
CHILDLESS_ATTR(storeme, S_IRUGO | S_IWUSR, childless_storeme_read,
childless_storeme_write);
CHILDLESS_ATTR_RO(description, childless_description_read);
static struct configfs_attribute *childless_attrs[] = {
&childless_attr_showme.attr,
&childless_attr_storeme.attr,
&childless_attr_description.attr,
NULL,
};
CONFIGFS_ATTR_OPS(childless);
static struct configfs_item_operations childless_item_ops = {
.show_attribute = childless_attr_show,
.store_attribute = childless_attr_store,
};
static struct config_item_type childless_type = {
.ct_item_ops = &childless_item_ops,
.ct_attrs = childless_attrs,
.ct_owner = THIS_MODULE,
};
static struct childless childless_subsys = {
.subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "01-childless",
.ci_type = &childless_type,
},
},
},
};
/* ----------------------------------------------------------------- */
/*
* 02-simple-children
*
* This example merely has a simple one-attribute child. Note that
* there is no extra attribute structure, as the child's attribute is
* known from the get-go. Also, there is no container for the
* subsystem, as it has no attributes of its own.
*/
struct simple_child {
struct config_item item;
int storeme;
};
static inline struct simple_child *to_simple_child(struct config_item *item)
{
return item ? container_of(item, struct simple_child, item) : NULL;
}
static struct configfs_attribute simple_child_attr_storeme = {
.ca_owner = THIS_MODULE,
.ca_name = "storeme",
.ca_mode = S_IRUGO | S_IWUSR,
};
static struct configfs_attribute *simple_child_attrs[] = {
&simple_child_attr_storeme,
NULL,
};
static ssize_t simple_child_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
ssize_t count;
struct simple_child *simple_child = to_simple_child(item);
count = sprintf(page, "%d\n", simple_child->storeme);
return count;
}
static ssize_t simple_child_attr_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct simple_child *simple_child = to_simple_child(item);
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if (!p || (*p && (*p != '\n')))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
simple_child->storeme = tmp;
return count;
}
static void simple_child_release(struct config_item *item)
{
kfree(to_simple_child(item));
}
static struct configfs_item_operations simple_child_item_ops = {
.release = simple_child_release,
.show_attribute = simple_child_attr_show,
.store_attribute = simple_child_attr_store,
};
static struct config_item_type simple_child_type = {
.ct_item_ops = &simple_child_item_ops,
.ct_attrs = simple_child_attrs,
.ct_owner = THIS_MODULE,
};
struct simple_children {
struct config_group group;
};
static inline struct simple_children *to_simple_children(struct config_item *item)
{
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
}
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
{
struct simple_child *simple_child;
simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL);
if (!simple_child)
return ERR_PTR(-ENOMEM);
config_item_init_type_name(&simple_child->item, name,
&simple_child_type);
simple_child->storeme = 0;
return &simple_child->item;
}
static struct configfs_attribute simple_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *simple_children_attrs[] = {
&simple_children_attr_description,
NULL,
};
static ssize_t simple_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[02-simple-children]\n"
"\n"
"This subsystem allows the creation of child config_items. These\n"
"items have only one attribute that is readable and writeable.\n");
}
static void simple_children_release(struct config_item *item)
{
kfree(to_simple_children(item));
}
static struct configfs_item_operations simple_children_item_ops = {
.release = simple_children_release,
.show_attribute = simple_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations simple_children_group_ops = {
.make_item = simple_children_make_item,
};
static struct config_item_type simple_children_type = {
.ct_item_ops = &simple_children_item_ops,
.ct_group_ops = &simple_children_group_ops,
.ct_attrs = simple_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem simple_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "02-simple-children",
.ci_type = &simple_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* 03-group-children
*
* This example reuses the simple_children group from above. However,
* the simple_children group is not the subsystem itself, it is a
* child of the subsystem. Creation of a group in the subsystem creates
* a new simple_children group. That group can then have simple_child
* children of its own.
*/
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
{
struct simple_children *simple_children;
simple_children = kzalloc(sizeof(struct simple_children),
GFP_KERNEL);
if (!simple_children)
return ERR_PTR(-ENOMEM);
config_group_init_type_name(&simple_children->group, name,
&simple_children_type);
return &simple_children->group;
}
static struct configfs_attribute group_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *group_children_attrs[] = {
&group_children_attr_description,
NULL,
};
static ssize_t group_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[03-group-children]\n"
"\n"
"This subsystem allows the creation of child config_groups. These\n"
"groups are like the subsystem simple-children.\n");
}
static struct configfs_item_operations group_children_item_ops = {
.show_attribute = group_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations group_children_group_ops = {
.make_group = group_children_make_group,
};
static struct config_item_type group_children_type = {
.ct_item_ops = &group_children_item_ops,
.ct_group_ops = &group_children_group_ops,
.ct_attrs = group_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem group_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "03-group-children",
.ci_type = &group_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* We're now done with our subsystem definitions.
* For convenience in this module, here's a list of them all. It
* allows the init function to easily register them. Most modules
* will only have one subsystem, and will only call register_subsystem
* on it directly.
*/
static struct configfs_subsystem *example_subsys[] = {
&childless_subsys.subsys,
&simple_children_subsys,
&group_children_subsys,
NULL,
};
static int __init configfs_example_init(void)
{
int ret;
int i;
struct configfs_subsystem *subsys;
for (i = 0; example_subsys[i]; i++) {
subsys = example_subsys[i];
config_group_init(&subsys->su_group);
mutex_init(&subsys->su_mutex);
ret = configfs_register_subsystem(subsys);
if (ret) {
printk(KERN_ERR "Error %d while registering subsystem %s\n",
ret,
subsys->su_group.cg_item.ci_namebuf);
goto out_unregister;
}
}
return 0;
out_unregister:
for (i--; i >= 0; i--)
configfs_unregister_subsystem(example_subsys[i]);
return ret;
}
static void __exit configfs_example_exit(void)
{
int i;
for (i = 0; example_subsys[i]; i++)
configfs_unregister_subsystem(example_subsys[i]);
}
module_init(configfs_example_init);
module_exit(configfs_example_exit);
MODULE_LICENSE("GPL");

View file

@ -0,0 +1,76 @@
Cramfs - cram a filesystem onto a small ROM
cramfs is designed to be simple and small, and to compress things well.
It uses the zlib routines to compress a file one page at a time, and
allows random page access. The meta-data is not compressed, but is
expressed in a very terse representation to make it use much less
diskspace than traditional filesystems.
You can't write to a cramfs filesystem (making it compressible and
compact also makes it _very_ hard to update on-the-fly), so you have to
create the disk image with the "mkcramfs" utility.
Usage Notes
-----------
File sizes are limited to less than 16MB.
Maximum filesystem size is a little over 256MB. (The last file on the
filesystem is allowed to extend past 256MB.)
Only the low 8 bits of gid are stored. The current version of
mkcramfs simply truncates to 8 bits, which is a potential security
issue.
Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image.
Cramfs directories have no `.' or `..' entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's
no need to use -noleaf in `find', btw.)
No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but
the update lasts only as long as the inode is cached in memory, after
which the timestamp reverts to 1970, i.e. moves backwards in time.
Currently, cramfs must be written and read with architectures of the
same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
== 4096. At least the latter of these is a bug, but it hasn't been
decided what the best fix is. For the moment if you have larger pages
you can just change the #define in mkcramfs.c, so long as you don't
mind the filesystem becoming unreadable to future kernels.
For /usr/share/magic
--------------------
0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d
>8 ulelong x flags 0x%x
>12 ulelong x future 0x%x
>16 string >\0 signature "%.16s"
>32 ulelong x fsid.crc 0x%x
>36 ulelong x fsid.edition %d
>40 ulelong x fsid.blocks %d
>44 ulelong x fsid.files %d
>48 string >\0 name "%.16s"
512 ulelong 0x28cd3d45 Linux cramfs offset 512
>516 ulelong x size %d
>520 ulelong x flags 0x%x
>524 ulelong x future 0x%x
>528 string >\0 signature "%.16s"
>544 ulelong x fsid.crc 0x%x
>548 ulelong x fsid.edition %d
>552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d
>560 string >\0 name "%.16s"
Hacker Notes
------------
See fs/cramfs/README for filesystem layout and implementation notes.

View file

@ -0,0 +1,191 @@
Copyright 2009 Jonathan Corbet <corbet@lwn.net>
Debugfs exists as a simple way for kernel developers to make information
available to user space. Unlike /proc, which is only meant for information
about a process, or sysfs, which has strict one-value-per-file rules,
debugfs has no rules at all. Developers can put any information they want
there. The debugfs filesystem is also intended to not serve as a stable
ABI to user space; in theory, there are no stability constraints placed on
files exported there. The real world is not always so simple, though [1];
even debugfs interfaces are best designed with the idea that they will need
to be maintained forever.
Debugfs is typically mounted with a command like:
mount -t debugfs none /sys/kernel/debug
(Or an equivalent /etc/fstab line).
The debugfs root directory is accessible only to the root user by
default. To change access to the tree the "uid", "gid" and "mode" mount
options can be used.
Note that the debugfs API is exported GPL-only to modules.
Code using debugfs should include <linux/debugfs.h>. Then, the first order
of business will be to create at least one directory to hold a set of
debugfs files:
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
This call, if successful, will make a directory called name underneath the
indicated parent directory. If parent is NULL, the directory will be
created in the debugfs root. On success, the return value is a struct
dentry pointer which can be used to create files in the directory (and to
clean it up at the end). A NULL return value indicates that something went
wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the
kernel has been built without debugfs support and none of the functions
described below will work.
The most general way to create a file within a debugfs directory is with:
struct dentry *debugfs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
const struct file_operations *fops);
Here, name is the name of the file to create, mode describes the access
permissions the file should have, parent indicates the directory which
should hold the file, data will be stored in the i_private field of the
resulting inode structure, and fops is a set of file operations which
implement the file's behavior. At a minimum, the read() and/or write()
operations should be provided; others can be included as needed. Again,
the return value will be a dentry pointer to the created file, NULL for
error, or ERR_PTR(-ENODEV) if debugfs support is missing.
In a number of cases, the creation of a set of file operations is not
actually necessary; the debugfs code provides a number of helper functions
for simple situations. Files containing a single integer value can be
created with any of:
struct dentry *debugfs_create_u8(const char *name, umode_t mode,
struct dentry *parent, u8 *value);
struct dentry *debugfs_create_u16(const char *name, umode_t mode,
struct dentry *parent, u16 *value);
struct dentry *debugfs_create_u32(const char *name, umode_t mode,
struct dentry *parent, u32 *value);
struct dentry *debugfs_create_u64(const char *name, umode_t mode,
struct dentry *parent, u64 *value);
These files support both reading and writing the given value; if a specific
file should not be written to, simply set the mode bits accordingly. The
values in these files are in decimal; if hexadecimal is more appropriate,
the following functions can be used instead:
struct dentry *debugfs_create_x8(const char *name, umode_t mode,
struct dentry *parent, u8 *value);
struct dentry *debugfs_create_x16(const char *name, umode_t mode,
struct dentry *parent, u16 *value);
struct dentry *debugfs_create_x32(const char *name, umode_t mode,
struct dentry *parent, u32 *value);
struct dentry *debugfs_create_x64(const char *name, umode_t mode,
struct dentry *parent, u64 *value);
These functions are useful as long as the developer knows the size of the
value to be exported. Some types can have different widths on different
architectures, though, complicating the situation somewhat. There is a
function meant to help out in one special case:
struct dentry *debugfs_create_size_t(const char *name, umode_t mode,
struct dentry *parent,
size_t *value);
As might be expected, this function will create a debugfs file to represent
a variable of type size_t.
Boolean values can be placed in debugfs with:
struct dentry *debugfs_create_bool(const char *name, umode_t mode,
struct dentry *parent, u32 *value);
A read on the resulting file will yield either Y (for non-zero values) or
N, followed by a newline. If written to, it will accept either upper- or
lower-case values, or 1 or 0. Any other input will be silently ignored.
Another option is exporting a block of arbitrary binary data, with
this structure and function:
struct debugfs_blob_wrapper {
void *data;
unsigned long size;
};
struct dentry *debugfs_create_blob(const char *name, umode_t mode,
struct dentry *parent,
struct debugfs_blob_wrapper *blob);
A read of this file will return the data pointed to by the
debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way
to return several lines of (static) formatted text output. This function
can be used to export binary information, but there does not appear to be
any code which does so in the mainline. Note that all files created with
debugfs_create_blob() are read-only.
If you want to dump a block of registers (something that happens quite
often during development, even if little such code reaches mainline.
Debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential
file.
struct debugfs_reg32 {
char *name;
unsigned long offset;
};
struct debugfs_regset32 {
struct debugfs_reg32 *regs;
int nregs;
void __iomem *base;
};
struct dentry *debugfs_create_regset32(const char *name, umode_t mode,
struct dentry *parent,
struct debugfs_regset32 *regset);
int debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs,
int nregs, void __iomem *base, char *prefix);
The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block.
There are a couple of other directory-oriented helper functions:
struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *old_dentry,
struct dentry *new_dir,
const char *new_name);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
A call to debugfs_rename() will give a new name to an existing debugfs
file, possibly in a different directory. The new_name must not exist prior
to the call; the return value is old_dentry with updated information.
Symbolic links can be created with debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
there is no automatic cleanup of any directories created in debugfs. If a
module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
can be removed with:
void debugfs_remove(struct dentry *dentry);
The dentry value can be NULL, in which case nothing will be removed.
Once upon a time, debugfs users were required to remember the dentry
pointer for every debugfs file they created so that all files could be
cleaned up. We live in more civilized times now, though, and debugfs users
can call:
void debugfs_remove_recursive(struct dentry *dentry);
If this function is passed a pointer for the dentry corresponding to the
top-level directory, the entire hierarchy below that directory will be
removed.
Notes:
[1] http://lwn.net/Articles/309298/

View file

@ -0,0 +1,132 @@
To support containers, we now allow multiple instances of devpts filesystem,
such that indices of ptys allocated in one instance are independent of indices
allocated in other instances of devpts.
To preserve backward compatibility, this support for multiple instances is
enabled only if:
- CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and
- '-o newinstance' mount option is specified while mounting devpts
IOW, devpts now supports both single-instance and multi-instance semantics.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and
this referred to as the "legacy" mode. In this mode, the new mount options
(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message
on console.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the
'newinstance' option (as in current start-up scripts) the new mount binds
to the initial kernel mount of devpts. This mode is referred to as the
'single-instance' mode and the current, single-instance semantics are
preserved, i.e PTYs are common across the system.
The only difference between this single-instance mode and the legacy mode
is the presence of new, '/dev/pts/ptmx' node with permissions 0000, which
can safely be ignored.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified,
the mount is considered to be in the multi-instance mode and a new instance
of the devpts fs is created. Any ptys created in this instance are independent
of ptys in other instances of devpts. Like in the single-instance mode, the
/dev/pts/ptmx node is present. To effectively use the multi-instance mode,
open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or
bind-mount.
Eg: A container startup script could do the following:
$ chmod 0666 /dev/pts/ptmx
$ rm /dev/ptmx
$ ln -s pts/ptmx /dev/ptmx
$ ns_exec -cm /bin/bash
# We are now in new container
$ umount /dev/pts
$ mount -t devpts -o newinstance lxcpts /dev/pts
$ sshd -p 1234
where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs
/bin/bash in the child process. A pty created by the sshd is not visible in
the original mount of /dev/pts.
User-space changes
------------------
In multi-instance mode (i.e '-o newinstance' mount option is specified at least
once), following user-space issues should be noted.
1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored
and no change is needed to system-startup scripts.
2. To effectively use multi-instance mode (i.e -o newinstance is specified)
administrators or startup scripts should "redirect" open of /dev/ptmx to
/dev/pts/ptmx using either a bind mount or symlink.
$ mount -t devpts -o newinstance devpts /dev/pts
followed by either
$ rm /dev/ptmx
$ ln -s pts/ptmx /dev/ptmx
$ chmod 666 /dev/pts/ptmx
or
$ mount -o bind /dev/pts/ptmx /dev/ptmx
3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it
enables better error-reporting and treats both single-instance and
multi-instance mounts similarly.
But this method requires that system-startup scripts set the mode of
/dev/pts/ptmx correctly (default mode is 0000). The scripts can set the
mode by, either
- adding ptmxmode mount option to devpts entry in /etc/fstab, or
- using 'chmod 0666 /dev/pts/ptmx'
4. If multi-instance mode mount is needed for containers, but the system
startup scripts have not yet been updated, container-startup scripts
should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single-
instance mounts.
Or, in general, container-startup scripts should use:
mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts
if [ ! -L /dev/ptmx ]; then
mount -o bind /dev/pts/ptmx /dev/ptmx
fi
When all devpts mounts are multi-instance, /dev/ptmx can permanently be
a symlink to pts/ptmx and the bind mount can be ignored.
5. A multi-instance mount that is not accompanied by the /dev/ptmx to
/dev/pts/ptmx redirection would result in an unusable/unreachable pty.
mount -t devpts -o newinstance lxcpts /dev/pts
immediately followed by:
open("/dev/ptmx")
would create a pty, say /dev/pts/7, in the initial kernel mount.
But /dev/pts/7 would be invisible in the new mount.
6. The permissions for /dev/pts/ptmx node should be specified when mounting
/dev/pts, using the '-o ptmxmode=%o' mount option (default is 0000).
mount -t devpts -o newinstance -o ptmxmode=0644 devpts /dev/pts
The permissions can be later be changed as usual with 'chmod'.
chmod 666 /dev/pts/ptmx
7. A mount of devpts without the 'newinstance' option results in binding to
initial kernel mount. This behavior while preserving legacy semantics,
does not provide strict isolation in a container environment. i.e by
mounting devpts without the 'newinstance' option, a container could
get visibility into the 'host' or root container's devpts.
To workaround this and have strict isolation, all mounts of devpts,
including the mount in the root container, should use the newinstance
option.

View file

@ -0,0 +1,127 @@
Locking scheme used for directory operations is based on two
kinds of locks - per-inode (->i_mutex) and per-filesystem
(->s_vfs_rename_mutex).
When taking the i_mutex on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following.
For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing.
2) object creation. Locking rules: same as above.
3) object removal. Locking rules: caller locks parent, finds victim,
locks victim and calls the method.
4) rename() that is _not_ cross-directory. Locking rules: caller locks
the parent and finds source and target. If target already exists, lock
it. If source is a non-directory, lock it. If that means we need to
lock both, lock them in inode pointer order.
5) link creation. Locking rules:
* lock parent
* check that source is not a directory
* lock source
* call the method.
6) cross-directory rename. The trickiest in the whole bunch. Locking
rules:
* lock the filesystem
* lock parents in "ancestors first" order.
* find source and target.
* if old parent is equal to or is a descendent of target
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
* If target exists, lock it. If source is a non-directory, lock
it. In case that means we need to lock both source and target,
do so in inode pointer order.
* call the method.
The rules above obviously guarantee that all directories that are going to be
read, modified or removed by method will be locked by caller.
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
First of all, at any moment we have a partial ordering of the
objects - A < B iff A is an ancestor of B.
That ordering can change. However, the following is true:
(1) if object removal or non-cross-directory rename holds lock on A and
attempts to acquire lock on B, A will remain the parent of B until we
acquire the lock on B. (Proof: only cross-directory rename can change
the parent of object and it would have to lock the parent).
(2) if cross-directory rename holds the lock on filesystem, order will not
change until rename acquires all locks. (Proof: other cross-directory
renames will be blocked on filesystem lock and we don't start changing
the order until we had acquired all locks).
(3) locks on non-directory objects are acquired only after locks on
directory objects, and are acquired in inode pointer order.
(Proof: all operations but renames take lock on at most one
non-directory object, except renames, which take locks on source and
target in inode pointer order in the case they are not directories.)
Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_mutex.
By (3), any process holding a non-directory lock can only be
waiting on another non-directory lock with a larger address. Therefore
the process holding the "largest" such lock can always make progress, and
non-directory objects are not included in the set of contended locks.
Thus link creation can't be a part of deadlock - it can't be
blocked on source and it means that it doesn't hold any locks.
Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by
operation other than cross-directory rename. Then the lock this operation
is blocked on belongs to child of that object due to (1).
It means that one of the operations is cross-directory rename.
Otherwise the set of contended objects would be infinite - each of them
would have a contended child and we had assumed that no object is its
own descendent. Moreover, there is exactly one cross-directory rename
(see above).
Consider the object blocking the cross-directory rename. One
of its descendents is locked by cross-directory rename (otherwise we
would again have an infinite set of contended objects). But that
means that cross-directory rename is taking locks out of order. Due
to (2) the order hadn't changed since we had acquired filesystem lock.
But locking rules for cross-directory rename guarantee that we do not
try to acquire lock on descendent before the lock on ancestor.
Contradiction. I.e. deadlock is impossible. Q.E.D.
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
Since the only new (parent, child) pair added by rename() is (new parent,
source), such loop would have to contain these objects and the rest of it
would have to exist before rename(). I.e. at the moment of loop creation
rename() responsible for that would be holding filesystem lock and new parent
would have to be equal to or a descendent of source. But that means that
new parent had been equal to or a descendent of source since the moment when
we had acquired filesystem lock and rename() would fail with -ELOOP in that
case.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
implementation assumes that directory graph is a tree. This assumption is
also preserved by all operations (cross-directory rename on a tree that would
not introduce a cycle will leave it a tree and link() fails for directories).
Notice that "directory" in the above == "anything that might have
children", so if we are going to introduce hybrid objects we will need
either to make sure that link(2) doesn't work for them or to make changes
in is_subdir() that would make it work even in presence of such beasts.

View file

@ -0,0 +1,130 @@
dlmfs
==================
A minimal DLM userspace interface implemented via a virtual file
system.
dlmfs is built with OCFS2 as it requires most of its infrastructure.
Project web page: http://oss.oracle.com/projects/ocfs2
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS
=======
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
and Transmeta Corp.
Mark Fasheh <mark.fasheh@oracle.com>
Caveats
=======
- Right now it only works with the OCFS2 DLM, though support for other
DLM implementations should not be a major issue.
Mount options
=============
None
Usage
=====
If you're just interested in OCFS2, then please see ocfs2.txt. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
Setup
=====
dlmfs requires that the OCFS2 cluster infrastructure be in
place. Please download ocfs2-tools from the above url and configure a
cluster.
You'll want to start heartbeating on a volume which all the nodes in
your lockspace can access. The easiest way to do this is via
ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
that an OCFS2 file system be in place so that it can automatically
find its heartbeat area, though it will eventually support heartbeat
against raw disks.
Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
with ocfs2-tools.
Once you're heartbeating, DLM lock 'domains' can be easily created /
destroyed and locks within them accessed.
Locking
=======
Users may access dlmfs via standard file system calls, or they can use
'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
system calls and presents a more traditional locking api.
dlmfs handles lock caching automatically for the user, so a lock
request for an already acquired lock will not generate another DLM
call. Userspace programs are assumed to handle their own local
locking.
Two levels of locks are supported - Shared Read, and Exclusive.
Also supported is a Trylock operation.
For information on the libo2dlm interface, please see o2dlm.h,
distributed with ocfs2-tools.
Lock value blocks can be read and written to a resource via read(2)
and write(2) against the fd obtained via your open(2) call. The
maximum currently supported LVB length is 64 bytes (though that is an
OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
small amounts of data amongst their nodes.
mkdir(2) signals dlmfs to join a domain (which will have the same name
as the resulting directory)
rmdir(2) signals dlmfs to leave the domain
Locks for a given domain are represented by regular inodes inside the
domain directory. Locking against them is done via the open(2) system
call.
The open(2) call will not return until your lock has been granted or
an error has occurred, unless it has been instructed to do a trylock
operation. If the lock succeeds, you'll get an fd.
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
not automatically create inodes for existing lock resources.
Open Flag Lock Request Type
--------- -----------------
O_RDONLY Shared Read
O_RDWR Exclusive
Open Flag Resulting Locking Behavior
--------- --------------------------
O_NONBLOCK Trylock operation
You must provide exactly one of O_RDONLY or O_RDWR.
If O_NONBLOCK is also provided and the trylock operation was valid but
could not lock the resource then open(2) will return ETXTBUSY.
close(2) drops the lock associated with your fd.
Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
supported locally as well. This means you can use them to restrict
access to the resources via dlmfs on your local node only.
The resource LVB may be read from the fd in either Shared Read or
Exclusive modes via the read(2) system call. It can be written via
write(2) only when open in Exclusive mode.
Once written, an LVB will be visible to other nodes who obtain Read
Only or higher level locks on the resource.
See Also
========
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
For more information on the VMS distributed locking API.

View file

@ -0,0 +1,70 @@
Linux Directory Notification
============================
Stephen Rothwell <sfr@canb.auug.org.au>
The intention of directory notification is to allow user applications
to be notified when a directory, or any of the files in it, are changed.
The basic mechanism involves the application registering for notification
on a directory using a fcntl(2) call and the notifications themselves
being delivered using signals.
The application decides which "events" it wants to be notified about.
The currently defined events are:
DN_ACCESS A file in the directory was accessed (read)
DN_MODIFY A file in the directory was modified (write,truncate)
DN_CREATE A file was created in the directory
DN_DELETE A file was unlinked from directory
DN_RENAME A file in the directory was renamed
DN_ATTRIB A file in the directory had its attributes
changed (chmod,chown)
Usually, the application must reregister after each notification, but
if DN_MULTISHOT is or'ed with the event mask, then the registration will
remain until explicitly removed (by registering for no events).
By default, SIGIO will be delivered to the process and no other useful
information. However, if the F_SETSIG fcntl(2) call is used to let the
kernel know which signal to deliver, a siginfo structure will be passed to
the signal handler and the si_fd member of that structure will contain the
file descriptor associated with the directory in which the event occurred.
Preferably the application will choose one of the real time signals
(SIGRTMIN + <n>) so that the notifications may be queued. This is
especially important if DN_MULTISHOT is specified. Note that SIGRTMIN
is often blocked, so it is better to use (at least) SIGRTMIN + 1.
Implementation expectations (features and bugs :-))
---------------------------
The notification should work for any local access to files even if the
actual file system is on a remote server. This implies that remote
access to files served by local user mode servers should be notified.
Also, remote accesses to files served by a local kernel NFS server should
be notified.
In order to make the impact on the file system code as small as possible,
the problem of hard links to files has been ignored. So if a file (x)
exists in two directories (a and b) then a change to the file using the
name "a/x" should be notified to a program expecting notifications on
directory "a", but will not be notified to one expecting notifications on
directory "b".
Also, files that are unlinked, will still cause notifications in the
last directory that they were linked to.
Configuration
-------------
Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When
disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL.
Example
-------
See Documentation/filesystems/dnotify_test.c for an example.
NOTE
----
Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
See Documentation/filesystems/inotify.txt for more information on it.

View file

@ -0,0 +1,34 @@
#define _GNU_SOURCE /* needed to get the defines */
#include <fcntl.h> /* in glibc 2.2 this has the needed
values defined */
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
static volatile int event_fd;
static void handler(int sig, siginfo_t *si, void *data)
{
event_fd = si->si_fd;
}
int main(void)
{
struct sigaction act;
int fd;
act.sa_sigaction = handler;
sigemptyset(&act.sa_mask);
act.sa_flags = SA_SIGINFO;
sigaction(SIGRTMIN + 1, &act, NULL);
fd = open(".", O_RDONLY);
fcntl(fd, F_SETSIG, SIGRTMIN + 1);
fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
/* we will now be notified if any of the files
in "." is modified or new files are created */
while (1) {
pause();
printf("Got event on fd=%d\n", event_fd);
}
}

View file

@ -0,0 +1,77 @@
eCryptfs: A stacked cryptographic filesystem for Linux
eCryptfs is free software. Please see the file COPYING for details.
For documentation, please see the files in the doc/ subdirectory. For
building and installation instructions please see the INSTALL file.
Maintainer: Phillip Hellewell
Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
Developers: Michael C. Thompson
Kent Yoder
Web Site: http://ecryptfs.sf.net
This software is currently undergoing development. Make sure to
maintain a backup copy of any data you write into eCryptfs.
eCryptfs requires the userspace tools downloadable from the
SourceForge site:
http://sourceforge.net/projects/ecryptfs/
Userspace requirements include:
- David Howells' userspace keyring headers and libraries (version
1.0 or higher), obtainable from
http://people.redhat.com/~dhowells/keyutils/
- Libgcrypt
NOTES
In the beta/experimental releases of eCryptfs, when you upgrade
eCryptfs, you should copy the files to an unencrypted location and
then copy the files back into the new eCryptfs mount to migrate the
files.
MOUNT-WIDE PASSPHRASE
Create a new directory into which eCryptfs will write its encrypted
files (i.e., /root/crypt). Then, create the mount point directory
(i.e., /mnt/crypt). Now it's time to mount eCryptfs:
mount -t ecryptfs /root/crypt /mnt/crypt
You should be prompted for a passphrase and a salt (the salt may be
blank).
Try writing a new file:
echo "Hello, World" > /mnt/crypt/hello.txt
The operation will complete. Notice that there is a new file in
/root/crypt that is at least 12288 bytes in size (depending on your
host page size). This is the encrypted underlying file for what you
just wrote. To test reading, from start to finish, you need to clear
the user session keyring:
keyctl clear @u
Then umount /mnt/crypt and mount again per the instructions given
above.
cat /mnt/crypt/hello.txt
NOTES
eCryptfs version 0.1 should only be mounted on (1) empty directories
or (2) directories containing files only created by eCryptfs. If you
mount a directory that has pre-existing files not created by eCryptfs,
then behavior is undefined. Do not run eCryptfs in higher verbosity
levels unless you are doing so for the sole purpose of debugging or
development, since secret values will be written out to the system log
in that case.
Mike Halcrow
mhalcrow@us.ibm.com

View file

@ -0,0 +1,16 @@
efivarfs - a (U)EFI variable filesystem
The efivarfs filesystem was created to address the shortcomings of
using entries in sysfs to maintain EFI variables. The old sysfs EFI
variables code only supported variables of up to 1024 bytes. This
limitation existed in version 0.99 of the EFI specification, but was
removed before any full releases. Since variables can now be larger
than a single page, sysfs isn't the best interface for this.
Variables can be created, deleted and modified with the efivarfs
filesystem.
efivarfs is typically mounted like this,
mount -t efivarfs none /sys/firmware/efi/efivars

View file

@ -0,0 +1,185 @@
===============================================================================
WHAT IS EXOFS?
===============================================================================
exofs is a file system that uses an OSD and exports the API of a normal Linux
file system. Users access exofs like any other local file system, and exofs
will in turn issue commands to the local OSD initiator.
OSD is a new T10 command set that views storage devices not as a large/flat
array of sectors but as a container of objects, each having a length, quota,
time attributes and more. Each object is addressed by a 64bit ID, and is
contained in a 64bit ID partition. Each object has associated attributes
attached to it, which are integral part of the object and provide metadata about
the object. The standard defines some common obligatory attributes, but user
attributes can be added as needed.
===============================================================================
ENVIRONMENT
===============================================================================
To use this file system, you need to have an object store to run it on. You
may download a target from:
http://open-osd.org
See Documentation/scsi/osd.txt for how to setup a working osd environment.
===============================================================================
USAGE
===============================================================================
1. Download and compile exofs and open-osd initiator:
You need an external Kernel source tree or kernel headers from your
distribution. (anything based on 2.6.26 or later).
a. download open-osd including exofs source using:
[parent-directory]$ git clone git://git.open-osd.org/open-osd.git
b. Build the library module like this:
[parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
This will build both the open-osd initiator as well as the exofs kernel
module. Use whatever parameters you compiled your Kernel with and
$(KER_DIR) above pointing to the Kernel you compile against. See the file
open-osd/top-level-Makefile for an example.
2. Get the OSD initiator and target set up properly, and login to the target.
See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
for example script that does all these steps.
3. Insmod the exofs.ko module:
[exofs]$ insmod exofs.ko
4. Make sure the directory where you want to mount exists. If not, create it.
(For example, mkdir /mnt/exofs)
5. At first run you will need to invoke the mkfs.exofs application
As an example, this will create the file system on:
/dev/osd0 partition ID 65536
mkfs.exofs --pid=65536 --format /dev/osd0
The --format is optional. If not specified, no OSD_FORMAT will be
performed and a clean file system will be created in the specified pid,
in the available space of the target. (Use --format=size_in_meg to limit
the total LUN space available)
If pid already exists, it will be deleted and a new one will be created in
its place. Be careful.
An exofs lives inside a single OSD partition. You can create multiple exofs
filesystems on the same device using multiple pids.
(run mkfs.exofs without any parameters for usage help message)
6. Mount the file system.
For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
7. For reference (See do-exofs example script):
do-exofs start - an example of how to perform the above steps.
do-exofs stop - an example of how to unmount the file system.
do-exofs format - an example of how to format and mkfs a new exofs.
8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
===============================================================================
exofs mount options
===============================================================================
Similar to any mount command:
mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
Where:
-t exofs: specifies the exofs file system
/dev/osdX: X is a decimal number. /dev/osdX was created after a successful
login into an OSD target.
mount_exofs_directory: The directory to mount the file system on
exofs specific options: Options are separated by commas (,)
pid=<integer> - The partition number to mount/create as
container of the filesystem.
This option is mandatory. integer can be
Hex by pre-pending an 0x to the number.
osdname=<id> - Mount by a device's osdname.
osdname is usually a 36 character uuid of the
form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
It is one of the device's uuid specified in the
mkfs.exofs format command.
If this option is specified then the /dev/osdX
above can be empty and is ignored.
to=<integer> - Timeout in ticks for a single command.
default is (60 * HZ) [for debugging only]
===============================================================================
DESIGN
===============================================================================
* The file system control block (AKA on-disk superblock) resides in an object
with a special ID (defined in common.h).
Information included in the file system control block is used to fill the
in-memory superblock structure at mount time. This object is created before
the file system is used by mkexofs.c. It contains information such as:
- The file system's magic number
- The next inode number to be allocated
* Each file resides in its own object and contains the data (and it will be
possible to extend the file over multiple objects, though this has not been
implemented yet).
* A directory is treated as a file, and essentially contains a list of <file
name, inode #> pairs for files that are found in that directory. The object
IDs correspond to the files' inode numbers and will be allocated according to
a bitmap (stored in a separate object). Now they are allocated using a
counter.
* Each file's control block (AKA on-disk inode) is stored in its object's
attributes. This applies to both regular files and other types (directories,
device files, symlinks, etc.).
* Credentials are generated per object (inode and superblock) when they are
created in memory (read from disk or created). The credential works for all
operations and is used as long as the object remains in memory.
* Async OSD operations are used whenever possible, but the target may execute
them out of order. The operations that concern us are create, delete,
readpage, writepage, update_inode, and truncate. The following pairs of
operations should execute in the order written, and we need to prevent them
from executing in reverse order:
- The following are handled with the OBJ_CREATED and OBJ_2BCREATED
flags. OBJ_CREATED is set when we know the object exists on the OSD -
in create's callback function, and when we successfully do a
read_inode.
OBJ_2BCREATED is set in the beginning of the create function, so we
know that we should wait.
- create/delete: delete should wait until the object is created
on the OSD.
- create/readpage: readpage should be able to return a page
full of zeroes in this case. If there was a write already
en-route (i.e. create, writepage, readpage) then the page
would be locked, and so it would really be the same as
create/writepage.
- create/writepage: if writepage is called for a sync write, it
should wait until the object is created on the OSD.
Otherwise, it should just return.
- create/truncate: truncate should wait until the object is
created on the OSD.
- create/update_inode: update_inode should wait until the
object is created on the OSD.
- Handled by VFS locks:
- readpage/delete: shouldn't happen because of page lock.
- writepage/delete: shouldn't happen because of page lock.
- readpage/writepage: shouldn't happen because of page lock.
===============================================================================
LICENSE/COPYRIGHT
===============================================================================
The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
version 2.6.10). All files include the original copyrights, and the license
is GPL version 2 (only version 2, as is true for the Linux kernel). The
Linux kernel can be downloaded from www.kernel.org.

View file

@ -0,0 +1,383 @@
The Second Extended Filesystem
==============================
ext2 was originally released in January 1993. Written by R\'emy Card,
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
Extended Filesystem. It is currently still (April 2001) the predominant
filesystem in use by Linux. There are also implementations available
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
Options
=======
Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
bsddf (*) Makes `df' act like BSD.
minixdf Makes `df' act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
errors=continue Keep going on a filesystem error.
errors=remount-ro Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
grpid, bsdgroups Give objects the same group ID as their parent.
nogrpid, sysvgroups New objects have the group ID of their creator.
nouid32 Use 16-bit UIDs and GIDs.
oldalloc Enable the old block allocator. Orlov should
have better performance, we'd like to get some
feedback if it's the contrary for you.
orlov (*) Use the Orlov block allocator.
(See http://lwn.net/Articles/14633/ and
http://lwn.net/Articles/14446/.)
resuid=n The user ID which may use the reserved blocks.
resgid=n The group ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
See also http://acl.bestbits.at
nouser_xattr Don't support "user." extended attributes.
acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
See also http://acl.bestbits.at
noacl Don't support POSIX ACLs.
nobh Do not attach buffer_heads to file pagecache.
xip Use execute in place (no caching) if possible
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
Specification
=============
ext2 shares many properties with traditional Unix filesystems. It has
the concepts of blocks, inodes and directories. It has space in the
specification for Access Control Lists (ACLs), fragments, undeletion and
compression though these are not yet implemented (some are available as
separate patches). There is also a versioning mechanism to allow new
features (such as journalling) to be added in a maximally compatible
manner.
Blocks
------
The space in the device or file is split up into blocks. These are
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
which is decided when the filesystem is created. Smaller blocks mean
less wasted space per file, but require slightly more accounting overhead,
and also impose other limits on the size of files and the filesystem.
Block Groups
------------
Blocks are clustered into block groups in order to reduce fragmentation
and minimise the amount of head seeking when reading a large amount
of consecutive data. Information about each block group is kept in a
descriptor table stored in the block(s) immediately after the superblock.
Two blocks near the start of each group are reserved for the block usage
bitmap and the inode usage bitmap which show which blocks and inodes
are in use. Since each bitmap is limited to a single block, this means
that the maximum size of a block group is 8 times the size of a block.
The block(s) following the bitmaps in each block group are designated
as the inode table for that block group and the remainder are the data
blocks. The block allocation algorithm attempts to allocate data blocks
in the same block group as the inode which contains them.
The Superblock
--------------
The superblock contains all the information about the configuration of
the filing system. The primary copy of the superblock is stored at an
offset of 1024 bytes from the start of the device, and it is essential
to mounting the filesystem. Since it is so important, backup copies of
the superblock are stored in block groups throughout the filesystem.
The first version of ext2 (revision 0) stores a copy at the start of
every block group, along with backups of the group descriptor block(s).
Because this can consume a considerable amount of space for large
filesystems, later revisions can optionally reduce the number of backup
copies by only putting backups in specific groups (this is the sparse
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
The information in the superblock contains fields such as the total
number of inodes and blocks in the filesystem and how many are free,
how many inodes and blocks are in each block group, when the filesystem
was mounted (and if it was cleanly unmounted), when it was modified,
what version of the filesystem it is (see the Revisions section below)
and which OS created it.
If the filesystem is revision 1 or higher, then there are extra fields,
such as a volume name, a unique identification number, the inode size,
and space for optional filesystem features to store configuration info.
All fields in the superblock (as in all other ext2 structures) are stored
on the disc in little endian format, so a filesystem is portable between
machines without having to know what machine it was created on.
Inodes
------
The inode (index node) is a fundamental concept in the ext2 filesystem.
Each object in the filesystem is represented by an inode. The inode
structure contains pointers to the filesystem blocks which contain the
data held in the object and all of the metadata about an object except
its name. The metadata about an object includes the permissions, owner,
group, flags, size, number of blocks used, access time, change time,
modification time, deletion time, number of links, fragments, version
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
There are some reserved fields which are currently unused in the inode
structure and several which are overloaded. One field is reserved for the
directory ACL if the inode is a directory and alternately for the top 32
bits of the file size if the inode is a regular file (allowing file sizes
larger than 2GB). The translator field is unused under Linux, but is used
by the HURD to reference the inode of a program which will be used to
interpret this object. Most of the remaining reserved fields have been
used up for both Linux and the HURD for larger owner and group fields,
The HURD also has a larger mode field so it uses another of the remaining
fields to store the extra more bits.
There are pointers to the first 12 blocks which contain the file's data
in the inode. There is a pointer to an indirect block (which contains
pointers to the next set of blocks), a pointer to a doubly-indirect
block (which contains pointers to indirect blocks) and a pointer to a
trebly-indirect block (which contains pointers to doubly-indirect blocks).
The flags field contains some ext2-specific flags which aren't catered
for by the standard chmod flags. These flags can be listed with lsattr
and changed with the chattr command, and allow specific filesystem
behaviour on a per-file basis. There are flags for secure deletion,
undeletable, compression, synchronous updates, immutability, append-only,
dumpable, no-atime, indexed directories, and data-journaling. Not all
of these are supported yet.
Directories
-----------
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate
each name with an inode number. Later revisions of the filesystem also
encode the type of the object (file, directory, symlink, device, fifo,
socket) to avoid the need to check the inode itself for this information
(support for taking advantage of this feature does not yet exist in
Glibc 2.2).
The inode allocation code tries to assign inodes which are in the same
block group as the directory in which they are first created.
The current implementation of ext2 uses a singly-linked list to store
the filenames in the directory; a pending enhancement uses hashing of the
filenames to allow lookup without the need to scan the entire directory.
The current implementation never removes empty directory blocks once they
have been allocated to hold more files.
Special files
-------------
Symbolic links are also filesystem objects with inodes. They deserve
special mention because the data for them is stored within the inode
itself if the symlink is less than 60 bytes long. It uses the fields
which would normally be used to store the pointers to data blocks.
This is a worthwhile optimisation as it we avoid allocating a full
block for the symlink, and most symlinks are less than 60 characters long.
Character and block special devices never have data blocks assigned to
them. Instead, their device number is stored in the inode, again reusing
the fields which would be used to point to the data blocks.
Reserved Space
--------------
In ext2, there is a mechanism for reserving a certain number of blocks
for a particular user (normally the super-user). This is intended to
allow for the system to continue functioning even if non-privileged users
fill up all the space available to them (this is independent of filesystem
quotas). It also keeps the filesystem from filling up entirely which
helps combat fragmentation.
Filesystem check
----------------
At boot time, most systems run a consistency check (e2fsck) on their
filesystems. The superblock of the ext2 filesystem contains several
fields which indicate whether fsck should actually run (since checking
the filesystem at boot can take a long time if it is large). fsck will
run if the filesystem was not cleanly unmounted, if the maximum mount
count has been exceeded or if the maximum time between checks has been
exceeded.
Feature Compatibility
---------------------
The compatibility feature mechanism used in ext2 is sophisticated.
It safely allows features to be added to the filesystem, without
unnecessarily sacrificing compatibility with older versions of the
filesystem code. The feature compatibility mechanism is not supported by
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
revision 1. There are three 32-bit fields, one for compatible features
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
incompatible (INCOMPAT) features.
These feature flags have specific meanings for the kernel as follows:
A COMPAT flag indicates that a feature is present in the filesystem,
but the on-disk format is 100% compatible with older on-disk formats, so
a kernel which didn't know anything about this feature could read/write
the filesystem without any chance of corrupting the filesystem (or even
making it inconsistent). This is essentially just a flag which says
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
want to be aware of (more on e2fsck and feature flags later). The ext3
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
a regular file with data blocks in it so the kernel does not need to
take any special notice of it if it doesn't understand ext3 journaling.
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
with older on-disk formats for reading (i.e. the feature does not change
the visible on-disk format). However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is prevented. The
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
sparse groups allow file data blocks where superblock/group descriptor
backups used to live, and ext2_free_blocks() refuses to free these blocks,
which would leading to inconsistent bitmaps. An old kernel would also
get an error if it tried to free a series of blocks which crossed a group
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
An INCOMPAT flag indicates the on-disk format has changed in some
way that makes it unreadable by older kernels, or would otherwise
cause a problem if an old kernel tried to mount it. FILETYPE is an
INCOMPAT flag because older kernels would think a filename was longer
than 256 characters, which would lead to corrupt directory listings.
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
doesn't understand compression, you would just get garbage back from
read() instead of it automatically decompressing your data. The ext3
RECOVER flag is needed to prevent a kernel which does not understand the
ext3 journal from mounting the filesystem without replaying the journal.
For e2fsck, it needs to be more strict with the handling of these
flags than the kernel. If it doesn't understand ANY of the COMPAT,
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
because it has no way of verifying whether a given feature is valid
or not. Allowing e2fsck to succeed on a filesystem with an unknown
feature is a false sense of security for the user. Refusing to check
a filesystem with unknown features is a good incentive for the user to
update to the latest e2fsck. This also means that anyone adding feature
flags to ext2 also needs to update e2fsck to verify these features.
Metadata
--------
It is frequently claimed that the ext2 implementation of writing
asynchronous metadata is faster than the ffs synchronous metadata
scheme but less reliable. Both methods are equally resolvable by their
respective fsck programs.
If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2:
per-file if you have the program source: use the O_SYNC flag to open()
per-file if you don't have the source: use "chattr +S" on the file
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below.
Limitations
-----------
There are various limits imposed by the on-disk layout of ext2. Other
limits are imposed by the current implementation of the kernel code.
Many of the limits are determined at the time the filesystem is first
created, and depend upon the block size chosen. The ratio of inodes to
data blocks is fixed at filesystem creation time, so the only way to
increase the number of inodes is to increase the size of the filesystem.
No tools currently exist which can change the ratio of inodes to blocks.
Most of these limits could be overcome with slight changes in the on-disk
format and using a compatibility flag to signal the format change (at
the expense of some compatibility).
Filesystem block size: 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also
an upper limit on the block size imposed by the page size of the kernel,
so 8kB blocks are only allowed on Alpha systems (and other architectures
which support larger pages).
There is an upper limit of 32000 subdirectories in a single directory.
There is a "soft" upper limit of about 10-15k files in a single directory
with the current linear linked-list directory implementation. This limit
stems from performance problems when creating and deleting (and also
finding) files in such large directories. Using a hashed directory index
(under development) allows 100k-1M+ files in a single directory without
performance problems (although RAM size becomes an issue at this point).
The (meaningless) absolute upper limit of files in a single directory
(imposed by the file size, the realistic limit is obviously much less)
is over 130 trillion files. It would be higher except there are not
enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.
Journaling
----------
A journaling extension to the ext2 code has been developed by Stephen
Tweedie. It avoids the risks of metadata corruption and the need to
wait for e2fsck to complete after a crash, without requiring a change
to the on-disk ext2 layout. In a nutshell, the journal is a regular
file which stores whole metadata (and optionally data) blocks that have
been modified, prior to writing them into the filesystem. This means
it is possible to add a journal to an existing ext2 filesystem without
the need for data conversion.
When changes to the filesystem (e.g. a file is renamed) they are stored in
a transaction in the journal and can either be complete or incomplete at
the time of a crash. If a transaction is complete at the time of a crash
(or in the normal case where the system does not crash), then any blocks
in that transaction are guaranteed to represent a valid filesystem state,
and are copied into the filesystem. If a transaction is incomplete at
the time of the crash, then there is no guarantee of consistency for
the blocks in that transaction so they are discarded (which means any
filesystem changes they represent are also lost).
Check Documentation/filesystems/ext3.txt if you want to read more about
ext3 and journaling.
References
==========
The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Filesystem Resizing http://ext2resize.sourceforge.net/
Compression (*) http://e2compr.sourceforge.net/
Implementations for:
Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
(*) no longer actively developed/supported (as of Apr 2001)
(+) no longer actively developed/supported (as of Mar 2009)

View file

@ -0,0 +1,215 @@
Ext3 Filesystem
===============
Ext3 was originally released in September 1999. Written by Stephen Tweedie
for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
Ext3 is the ext2 filesystem enhanced with journalling capabilities.
Options
=======
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
ro Mount filesystem read only. Note that ext3 will replay
the journal (and thus write to the partition) even when
mounted "read only". Mount options "ro,noload" can be
used to prevent writes to the filesystem.
journal=update Update the ext3 file system's journal to the current
format.
journal=inum When a journal already exists, this option is ignored.
Otherwise, it specifies the number of the inode which
will represent the ext3 file system's journal file.
journal_path=path
journal_dev=devnum When the external journal device's major/minor numbers
have changed, these options allow the user to specify
the new journal location. The journal device is
identified through either its new major/minor numbers
encoded in devnum, or via a path to the device.
norecovery Don't load the journal on mounting. Note that this forces
noload mount of inconsistent filesystem, which can lead to
various problems.
data=journal All data are committed into the journal prior to being
written into the main file system.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
journal.
data=writeback Data ordering is not preserved, data may be written
into the main file system after its metadata has been
committed to the journal.
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose
as much as the latest 5 seconds of work (your
filesystem will not be damaged though, thanks to the
journaling). This default value (or any low value)
will hurt performance, but it's good for data-safety.
Setting it to 0 will have the same effect as leaving
it at the default (5 seconds).
Setting it to very large values will improve
performance.
barrier=<0|1(*)> This enables/disables the use of write barriers in
barrier (*) the jbd code. barrier=0 disables, barrier=1 enables.
nobarrier This also requires an IO stack which can support
barriers, and if jbd gets an error on a barrier
write, it will disable again with a warning.
Write barriers enforce proper on-disk ordering
of journal commits, making volatile disk write caches
safe to use, at some performance penalty. If
your disks are battery-backed in one way or another,
disabling barriers may safely improve performance.
The mount options "barrier" and "nobarrier" can
also be used to enable or disable barriers, for
consistency with other ext3 mount options.
user_xattr Enables Extended User Attributes. Additionally, you
need to have extended attribute support enabled in the
kernel configuration (CONFIG_EXT3_FS_XATTR). See the
attr(5) manual page and http://acl.bestbits.at/ to
learn more about extended attributes.
nouser_xattr Disables Extended User Attributes.
acl Enables POSIX Access Control Lists support.
Additionally, you need to have ACL support enabled in
the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
See the acl(5) manual page and http://acl.bestbits.at/
for more information.
noacl This option disables POSIX Access Control List
support.
reservation
noreservation
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount.
nocheck
debug Extra debugging information is sent to syslog.
errors=remount-ro Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
(These mount options override the errors behavior
specified in the superblock, which can be
configured using tune2fs.)
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
data_err=abort Abort the journal if an error occurs in a file
data buffer in ordered mode.
grpid Give objects the same group ID as their creator.
bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota These options are ignored by the filesystem. They
noquota are used only by quota tools to recognize volumes
grpquota where quota should be turned on. See documentation
usrquota in the quota-tools package for more details
(http://sourceforge.net/projects/linuxquota).
jqfmt=<quota type> These options tell filesystem details about quota
usrjquota=<file> so that quota information can be properly updated
grpjquota=<file> during journal replay. They replace the above
quota options. See documentation in the quota-tools
package for more details
(http://sourceforge.net/projects/linuxquota).
Specification
=============
Ext3 shares all disk implementation with the ext2 filesystem, and adds
transactions capabilities to ext2. Journaling is done by the Journaling Block
Device layer.
Journaling Block Device layer
-----------------------------
The Journaling Block Device layer (JBD) isn't ext3 specific. It was designed
to add journaling capabilities to a block device. The ext3 filesystem code
will inform the JBD of modifications it is performing (called a transaction).
The journal supports the transactions start and stop, and in case of a crash,
the journal can replay the transactions to quickly put the partition back into
a consistent state.
Handles represent a single atomic update to a filesystem. JBD can handle an
external journal on a block device.
Data Mode
---------
There are 3 different data modes:
* writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext3 performance.
* ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
* journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all other modes.
Compatibility
-------------
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
Ext2.
External Tools
==============
See manual pages to learn more.
tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
mke2fs: create a ext3 partition with the -j flag.
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
References
==========
kernel source: <file:fs/ext3/>
<file:fs/jbd/>
programs: http://e2fsprogs.sourceforge.net/
http://ext2resize.sourceforge.net
useful links: http://www.ibm.com/developerworks/library/l-fs7/index.html
http://www.ibm.com/developerworks/library/l-fs8/index.html

View file

@ -0,0 +1,625 @@
Ext4 Filesystem
===============
Ext4 is an advanced level of the ext3 filesystem which incorporates
scalability and reliability enhancements for supporting large filesystems
(64 bit) in keeping with increasing disk capacities and state-of-the-art
feature requirements.
Mailing list: linux-ext4@vger.kernel.org
Web site: http://ext4.wiki.kernel.org
1. Quick usage instructions:
===========================
Note: More extensive information for getting started with ext4 can be
found at the ext4 wiki site at the URL:
http://ext4.wiki.kernel.org/index.php/Ext4_Howto
- Compile and install the latest version of e2fsprogs (as of this
writing version 1.41.3) from:
http://sourceforge.net/project/showfiles.php?group_id=2406
or
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
or grab the latest git repository from:
git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
- Note that it is highly important to install the mke2fs.conf file
that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
you have edited the /etc/mke2fs.conf file installed on your system,
you will need to merge your changes with the version from e2fsprogs
1.41.x.
- Create a new filesystem using the ext4 filesystem type:
# mke2fs -t ext4 /dev/hda1
Or to configure an existing ext3 filesystem to support extents:
# tune2fs -O extents /dev/hda1
If the filesystem was created with 128 byte inodes, it can be
converted to use 256 byte for greater efficiency via:
# tune2fs -I 256 /dev/hda1
(Note: we currently do not have tools to convert an ext4
filesystem back to ext3; so please do not do try this on production
filesystems.)
- Mounting:
# mount -t ext4 /dev/hda1 /wherever
- When comparing performance with other filesystems, it's always
important to try multiple workloads; very often a subtle change in a
workload parameter can completely change the ranking of which
filesystems do well compared to others. When comparing versus ext3,
note that ext4 enables write barriers by default, while ext3 does
not enable write barriers by default. So it is useful to use
explicitly specify whether barriers are enabled or not when via the
'-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
for a fair comparison. When tuning ext3 for best benchmark numbers,
it is often worthwhile to try changing the data journaling mode; '-o
data=writeback' can be faster for some workloads. (Note however that
running mounted with data=writeback can potentially leave stale data
exposed in recently written files in case of an unclean shutdown,
which could be a security exposure in some situations.) Configuring
the filesystem with a large journal can also be helpful for
metadata-intensive workloads.
2. Features
===========
2.1 Currently available
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redundancy in tree
* improved file allocation (multi-block alloc)
* lift 32000 subdirectory limit imposed by i_links_count[1]
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
* reduced e2fsck time via uninit_bg feature
* journal checksumming for robustness, performance
* persistent file preallocation (e.g for streaming media, databases)
* ability to pack bitmaps and inode tables into larger virtual groups via the
flex_bg feature
* large file support
* Inode allocation using large virtual block groups via flex_bg
* delayed allocation
* large block (up to pagesize) support
* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force
the ordering)
[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.
2.2 Candidate features for future inclusion
* Online defrag (patches available but not well tested)
* reduced mke2fs time via lazy itable initialization in conjunction with
the uninit_bg feature (capability to do this is available in e2fsprogs
but a kernel thread to do lazy zeroing of unused inode table blocks
after filesystem is first mounted is required for safety)
There are several others under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them. Features like
metadata checksumming have been discussed and planned for a bit but no patches
exist yet so I'm not sure they're in the near-term roadmap.
The big performance win will come with mballoc, delalloc and flex_bg
grouping of bitmaps and inode tables. Some test results available here:
- http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
- http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
3. Options
==========
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
ro Mount filesystem read only. Note that ext4 will
replay the journal (and thus write to the
partition) even when mounted "read only". The
mount options "ro,noload" can be used to prevent
writes to the filesystem.
journal_checksum Enable checksumming of the journal transactions.
This will allow the recovery code in e2fsck and the
kernel to detect corruption in the kernel. It is a
compatible change and will be ignored by older kernels.
journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot
mount the device. This will enable 'journal_checksum'
internally.
journal_path=path
journal_dev=devnum When the external journal device's major/minor numbers
have changed, these options allow the user to specify
the new journal location. The journal device is
identified through either its new major/minor numbers
encoded in devnum, or via a path to the device.
norecovery Don't load the journal on mounting. Note that
noload if the filesystem was not unmounted cleanly,
skipping the journal replay will lead to the
filesystem containing inconsistencies that can
lead to any number of problems.
data=journal All data are committed into the journal prior to being
written into the main file system. Enabling
this mode will disable delayed allocation and
O_DIRECT support.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
journal.
data=writeback Data ordering is not preserved, data may be written
into the main file system after its metadata has been
committed to the journal.
commit=nrsec (*) Ext4 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose
as much as the latest 5 seconds of work (your
filesystem will not be damaged though, thanks to the
journaling). This default value (or any low value)
will hurt performance, but it's good for data-safety.
Setting it to 0 will have the same effect as leaving
it at the default (5 seconds).
Setting it to very large values will improve
performance.
barrier=<0|1(*)> This enables/disables the use of write barriers in
barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
nobarrier This also requires an IO stack which can support
barriers, and if jbd gets an error on a barrier
write, it will disable again with a warning.
Write barriers enforce proper on-disk ordering
of journal commits, making volatile disk write caches
safe to use, at some performance penalty. If
your disks are battery-backed in one way or another,
disabling barriers may safely improve performance.
The mount options "barrier" and "nobarrier" can
also be used to enable or disable barriers, for
consistency with other ext4 mount options.
inode_readahead_blks=n This tuning parameter controls the maximum
number of inode table blocks that ext4's inode
table readahead algorithm will pre-read into
the buffer cache. The default value is 32 blocks.
nouser_xattr Disables Extended User Attributes. See the
attr(5) manual page and http://acl.bestbits.at/
for more information about extended attributes.
noacl This option disables POSIX Access Control List
support. If ACL support is enabled in the kernel
configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
enabled by default on mount. See the acl(5) manual
page and http://acl.bestbits.at/ for more information
about acl.
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
debug Extra debugging information is sent to syslog.
abort Simulate the effects of calling ext4_abort() for
debugging purposes. This is normally used while
remounting a filesystem which is already mounted.
errors=remount-ro Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
(These mount options override the errors behavior
specified in the superblock, which can be configured
using tune2fs)
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
data_err=abort Abort the journal if an error occurs in a file
data buffer in ordered mode.
grpid Give objects the same group ID as their creator.
bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota These options are ignored by the filesystem. They
noquota are used only by quota tools to recognize volumes
grpquota where quota should be turned on. See documentation
usrquota in the quota-tools package for more details
(http://sourceforge.net/projects/linuxquota).
jqfmt=<quota type> These options tell filesystem details about quota
usrjquota=<file> so that quota information can be properly updated
grpjquota=<file> during journal replay. They replace the above
quota options. See documentation in the quota-tools
package for more details
(http://sourceforge.net/projects/linuxquota).
stripe=n Number of filesystem blocks that mballoc will try
to use for allocation size and alignment. For RAID5/6
systems this should be the number of data
disks * RAID chunk size in file system blocks.
delalloc (*) Defer block allocation until just before ext4
writes out the block(s) in question. This
allows ext4 to better allocation decisions
more efficiently.
nodelalloc Disable delayed allocation. Blocks are allocated
when the data is copied from userspace to the
page cache, either via the write(2) system call
or when an mmap'ed page which was previously
unallocated is written for the first time.
max_batch_time=usec Maximum amount of time ext4 should wait for
additional filesystem operations to be batch
together with a synchronous write operation.
Since a synchronous write operation is going to
force a commit and then a wait for the I/O
complete, it doesn't cost much, and can be a
huge throughput win, we wait for a small amount
of time to see if any other transactions can
piggyback on the synchronous write. The
algorithm used is designed to automatically tune
for the speed of the disk, by measuring the
amount of time (on average) that it takes to
finish committing a transaction. Call this time
the "commit time". If the time that the
transaction has been running is less than the
commit time, ext4 will try sleeping for the
commit time to see if other operations will join
the transaction. The commit time is capped by
the max_batch_time, which defaults to 15000us
(15ms). This optimization can be turned off
entirely by setting max_batch_time to 0.
min_batch_time=usec This parameter sets the commit time (as
described above) to be at least min_batch_time.
It defaults to zero microseconds. Increasing
this parameter may improve the throughput of
multi-threaded, synchronous workloads on very
fast disks, at the cost of increasing latency.
journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
highest priority) which should be used for I/O
operations submitted by kjournald2 during a
commit operation. This defaults to 3, which is
a slightly higher priority than the default I/O
priority.
auto_da_alloc(*) Many broken applications don't use fsync() when
noauto_da_alloc replacing existing files via patterns such as
fd = open("foo.new")/write(fd,..)/close(fd)/
rename("foo.new", "foo"), or worse yet,
fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
If auto_da_alloc is enabled, ext4 will detect
the replace-via-rename and replace-via-truncate
patterns and force that any delayed allocation
blocks are allocated such that at the next
journal commit, in the default data=ordered
mode, the data blocks of the new file are forced
to disk before the rename() operation is
committed. This provides roughly the same level
of guarantees as ext3, and avoids the
"zero-length" problem that can happen when a
system crashes before the delayed allocation
blocks are forced to disk.
noinit_itable Do not initialize any uninitialized inode table
blocks in the background. This feature may be
used by installation CD's so that the install
process can complete as quickly as possible; the
inode table initialization process would then be
deferred until the next time the file system
is unmounted.
init_itable=n The lazy itable init code will wait n times the
number of milliseconds it took to zero out the
previous block group's inode table. This
minimizes the impact on the system performance
while file system's inode table is being initialized.
discard Controls whether ext4 should issue discard/TRIM
nodiscard(*) commands to the underlying block device when
blocks are freed. This is useful for SSD devices
and sparse/thinly-provisioned LUNs, but it is off
by default until sufficient testing has been done.
nouid32 Disables 32-bit UIDs and GIDs. This is for
interoperability with older kernels which only
store and expect 16-bit values.
block_validity This options allows to enables/disables the in-kernel
noblock_validity facility for tracking filesystem metadata blocks
within internal data structures. This allows multi-
block allocator and other routines to quickly locate
extents which might overlap with filesystem metadata
blocks. This option is intended for debugging
purposes and since it negatively affects the
performance, it is off by default.
dioread_lock Controls whether or not ext4 should use the DIO read
dioread_nolock locking. If the dioread_nolock option is specified
ext4 will allocate uninitialized extent before buffer
write and convert the extent to initialized after IO
completes. This approach allows ext4 code to avoid
using inode mutex, which improves scalability on high
speed storages. However this does not work with
data journaling and dioread_nolock option will be
ignored with kernel warning. Note that dioread_nolock
code path is only used for extent-based files.
Because of the restrictions this options comprises
it is off by default (e.g. dioread_lock).
max_dir_size_kb=n This limits the size of directories so that any
attempt to expand them beyond the specified
limit in kilobytes will cause an ENOSPC error.
This is useful in memory constrained
environments, where a very large directory can
cause severe performance problems or even
provoke the Out Of Memory killer. (For example,
if there is only 512mb memory available, a 176mb
directory may seriously cramp the system's style.)
i_version Enable 64-bit inode version support. This option is
off by default.
Data Mode
=========
There are 3 different data modes:
* writeback mode
In data=writeback mode, ext4 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext4 performance.
* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction. When it's time to write the new metadata
out to disk, the associated data blocks are written first. In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.
* journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes. Enabling this mode will disable delayed
allocation and O_DIRECT support.
/proc entries
=============
Information about mounted ext4 file systems can be found in
/proc/fs/ext4. Each mounted filesystem will have a directory in
/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
/proc/fs/ext4/dm-0). The files in each per-device directory are shown
in table below.
Files in /proc/fs/ext4/<devname>
..............................................................................
File Content
mb_groups details of multiblock allocator buddy cache of free blocks
..............................................................................
/sys entries
============
Information about mounted ext4 file systems can be found in
/sys/fs/ext4. Each mounted filesystem will have a directory in
/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
/sys/fs/ext4/dm-0). The files in each per-device directory are shown
in table below.
Files in /sys/fs/ext4/<devname>
(see also Documentation/ABI/testing/sysfs-fs-ext4)
..............................................................................
File Content
delayed_allocation_blocks This file is read-only and shows the number of
blocks that are dirty in the page cache, but
which do not have their location in the
filesystem allocated yet.
inode_goal Tuning parameter which (if non-zero) controls
the goal inode used by the inode allocator in
preference to all other allocation heuristics.
This is intended for debugging use only, and
should be 0 on production systems.
inode_readahead_blks Tuning parameter which controls the maximum
number of inode table blocks that ext4's inode
table readahead algorithm will pre-read into
the buffer cache
lifetime_write_kbytes This file is read-only and shows the number of
kilobytes of data that have been written to this
filesystem since it was created.
max_writeback_mb_bump The maximum number of megabytes the writeback
code will try to write out before move on to
another inode.
mb_group_prealloc The multiblock allocator will round up allocation
requests to a multiple of this tuning parameter if
the stripe size is not set in the ext4 superblock
mb_max_to_scan The maximum number of extents the multiblock
allocator will search to find the best extent
mb_min_to_scan The minimum number of extents the multiblock
allocator will search to find the best extent
mb_order2_req Tuning parameter which controls the minimum size
for requests (as a power of 2) where the buddy
cache is used
mb_stats Controls whether the multiblock allocator should
collect statistics, which are shown during the
unmount. 1 means to collect statistics, 0 means
not to collect statistics
mb_stream_req Files which have fewer blocks than this tunable
parameter will have their blocks allocated out
of a block group specific preallocation pool, so
that small files are packed closely together.
Each large file will have its blocks allocated
out of its own unique preallocation pool.
session_write_kbytes This file is read-only and shows the number of
kilobytes of data that have been written to this
filesystem since it was mounted.
reserved_clusters This is RW file and contains number of reserved
clusters in the file system which will be used
in the specific situations to avoid costly
zeroout, unexpected ENOSPC, or possible data
loss. The default is 2% or 4096 clusters,
whichever is smaller and this can be changed
however it can never exceed number of clusters
in the file system. If there is not enough space
for the reserved space when mounting the file
mount will _not_ fail.
..............................................................................
Ioctls
======
There is some Ext4 specific functionality which can be accessed by applications
through the system call interfaces. The list of all Ext4 specific ioctls are
shown in the table below.
Table of Ext4 specific ioctls
..............................................................................
Ioctl Description
EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
The ioctl argument is an integer bitfield, with
bit values described in ext4.h. This ioctl is an
alias for FS_IOC_GETFLAGS.
EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
The ioctl argument is an integer bitfield, with
bit values described in ext4.h. This ioctl is an
alias for FS_IOC_SETFLAGS.
EXT4_IOC_GETVERSION
EXT4_IOC_GETVERSION_OLD
Get the inode i_generation number stored for
each inode. The i_generation number is normally
changed only when new inode is created and it is
particularly useful for network filesystems. The
'_OLD' version of this ioctl is an alias for
FS_IOC_GETVERSION.
EXT4_IOC_SETVERSION
EXT4_IOC_SETVERSION_OLD
Set the inode i_generation number stored for
each inode. The '_OLD' version of this ioctl
is an alias for FS_IOC_SETVERSION.
EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
mount option. It allows to resize filesystem
to the end of the last existing block group,
further resize has to be done with resize2fs,
either online, or offline. The argument points
to the unsigned logn number representing the
filesystem new block count.
EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
this ioctl is pointing to) to the donor_fd (the
one specified in move_extent structure passed
as an argument to this ioctl). Then, exchange
inode metadata between orig_fd and donor_fd.
This is especially useful for online
defragmentation, because the allocator has the
opportunity to allocate moved blocks better,
ideally into one contiguous extent.
EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
new group descriptor block. The new group
descriptor is described by ext4_new_group_input
structure, which is passed as an argument to
this ioctl. This is especially useful in
conjunction with EXT4_IOC_GROUP_EXTEND,
which allows online resize of the filesystem
to the end of the last existing block group.
Those two ioctls combined is used in userspace
online resize tool (e.g. resize2fs).
EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
It converts (migrates) ext3 indirect block mapped
inode to ext4 extent mapped inode by walking
through indirect block mapping of the original
inode and converting contiguous block ranges
into ext4 extents of the temporary inode. Then,
inodes are swapped. This ioctl might help, when
migrating from ext3 to ext4 filesystem, however
suggestion is to create fresh ext4 filesystem
and copy data from the backup. Note, that
filesystem has to support extents for this ioctl
to work.
EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
allocated to preserve application-expected ext3
behaviour. Note that this will also start
triggering a write of the data blocks, but this
behaviour may change in the future as it is
not necessary and has been done this way only
for sake of simplicity.
EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number
of blocks of resized filesystem is passed in via
64 bit integer argument. The kernel allocates
bitmaps and inode table, the userspace tool thus
just passes the new number of blocks.
EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
(like i_blocks, i_size, i_flags, ...) from
the specified inode with inode
EXT4_BOOT_LOADER_INO (#5). This is typically
used to store a boot loader in a secure part of
the filesystem, where it can't be changed by a
normal user by accident.
The data blocks of the previous boot loader
will be associated with the given inode.
..............................................................................
References
==========
kernel source: <file:fs/ext4/>
<file:fs/jbd2/>
programs: http://e2fsprogs.sourceforge.net/
useful links: http://fedoraproject.org/wiki/ext3-devel
http://www.bullopensource.org/ext4/
http://ext4.wiki.kernel.org/index.php/Main_Page
http://fedoraproject.org/wiki/Features/Ext4

View file

@ -0,0 +1,557 @@
================================================================================
WHAT IS Flash-Friendly File System (F2FS)?
================================================================================
NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been equipped on a variety systems ranging from mobile to server systems. Since
they are known to have different characteristics from the conventional rotating
disks, a file system, an upper layer to the storage device, should adapt to the
changes from the sketch in the design level.
F2FS is a file system exploiting NAND flash memory-based storage devices, which
is based on Log-structured File System (LFS). The design has been focused on
addressing the fundamental issues in LFS, which are snowball effect of wandering
tree and high cleaning overhead.
Since a NAND flash memory-based storage device shows different characteristic
according to its internal geometry or flash memory management scheme, namely FTL,
F2FS and its tools support various parameters not only for configuring on-disk
layout, but also for selecting allocation and cleaning algorithms.
The following git tree provides the file system formatting tool (mkfs.f2fs),
a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
For reporting bugs and sending patches, please use the following mailing list:
>> linux-f2fs-devel@lists.sourceforge.net
================================================================================
BACKGROUND AND DESIGN ISSUES
================================================================================
Log-structured File System (LFS)
--------------------------------
"A log-structured file system writes all modifications to disk sequentially in
a log-like structure, thereby speeding up both file writing and crash recovery.
The log is the only structure on disk; it contains indexing information so that
files can be read back from the log efficiently. In order to maintain large free
areas on disk for fast writing, we divide the log into segments and use a
segment cleaner to compress the live information from heavily fragmented
segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
implementation of a log-structured file system", ACM Trans. Computer Systems
10, 1, 2652.
Wandering Tree Problem
----------------------
In LFS, when a file data is updated and written to the end of log, its direct
pointer block is updated due to the changed location. Then the indirect pointer
block is also updated due to the direct pointer block update. In this manner,
the upper index structures such as inode, inode map, and checkpoint block are
also updated recursively. This problem is called as wandering tree problem [1],
and in order to enhance the performance, it should eliminate or relax the update
propagation as much as possible.
[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
Cleaning Overhead
-----------------
Since LFS is based on out-of-place writes, it produces so many obsolete blocks
scattered across the whole storage. In order to serve new empty log space, it
needs to reclaim these obsolete blocks seamlessly to users. This job is called
as a cleaning process.
The process consists of three operations as follows.
1. A victim segment is selected through referencing segment usage table.
2. It loads parent index structures of all the data in the victim identified by
segment summary blocks.
3. It checks the cross-reference between the data and its parent index structure.
4. It moves valid data selectively.
This cleaning job may cause unexpected long delays, so the most important goal
is to hide the latencies to users. And also definitely, it should reduce the
amount of valid data to be moved, and move them quickly as well.
================================================================================
KEY FEATURES
================================================================================
Flash Awareness
---------------
- Enlarge the random write area for better performance, but provide the high
spatial locality
- Align FS data structures to the operational units in FTL as best efforts
Wandering Tree Problem
----------------------
- Use a term, “node”, that represents inodes as well as various pointer blocks
- Introduce Node Address Table (NAT) containing the locations of all the “node”
blocks; this will cut off the update propagation.
Cleaning Overhead
-----------------
- Support a background cleaning process
- Support greedy and cost-benefit algorithms for victim selection policies
- Support multi-head logs for static/dynamic hot and cold data separation
- Introduce adaptive logging for efficient block allocation
================================================================================
MOUNT OPTIONS
================================================================================
background_gc=%s Turn on/off cleaning operations, namely garbage
collection, triggered in background when I/O subsystem is
idle. If background_gc=on, it will turn on the garbage
collection and if background_gc=off, garbage collection
will be truned off.
Default value for this option is on. So garbage
collection is on by default.
disable_roll_forward Disable the roll-forward recovery routine
discard Issue discard/TRIM commands when a segment is cleaned.
no_heap Disable heap-style segment allocation which finds free
segments for data from the beginning of main area, while
for node from the end of main area.
nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
by default if CONFIG_F2FS_FS_XATTR is selected.
noacl Disable POSIX Access Control List. Note: acl is enabled
by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
active_logs=%u Support configuring the number of active logs. In the
current design, f2fs supports only 2, 4, and 6 logs.
Default number is 6.
disable_ext_identify Disable the extension list configured by mkfs, so f2fs
does not aware of cold files such as media files.
inline_xattr Enable the inline xattrs feature.
inline_data Enable the inline data feature: New created small(<~3.4k)
files can be written into inode block.
flush_merge Merge concurrent cache_flush commands as much as possible
to eliminate redundant command issues. If the underlying
device handles the cache_flush command relatively slowly,
recommend to enable this option.
nobarrier This option can be used if underlying storage guarantees
its cached data should be written to the novolatile area.
If this option is set, no cache_flush commands are issued
but f2fs still guarantees the write ordering of all the
data writes.
================================================================================
DEBUGFS ENTRIES
================================================================================
/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
f2fs. Each file shows the whole f2fs information.
/sys/kernel/debug/f2fs/status includes:
- major file system information managed by f2fs currently
- average SIT information about whole segments
- current memory footprint consumed by f2fs.
================================================================================
SYSFS ENTRIES
================================================================================
Information about mounted f2f2 file systems can be found in
/sys/fs/f2fs. Each mounted filesystem will have a directory in
/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda).
The files in each per-device directory are shown in table below.
Files in /sys/fs/f2fs/<devname>
(see also Documentation/ABI/testing/sysfs-fs-f2fs)
..............................................................................
File Content
gc_max_sleep_time This tuning parameter controls the maximum sleep
time for the garbage collection thread. Time is
in milliseconds.
gc_min_sleep_time This tuning parameter controls the minimum sleep
time for the garbage collection thread. Time is
in milliseconds.
gc_no_gc_sleep_time This tuning parameter controls the default sleep
time for the garbage collection thread. Time is
in milliseconds.
gc_idle This parameter controls the selection of victim
policy for garbage collection. Setting gc_idle = 0
(default) will disable this option. Setting
gc_idle = 1 will select the Cost Benefit approach
& setting gc_idle = 2 will select the greedy aproach.
reclaim_segments This parameter controls the number of prefree
segments to be reclaimed. If the number of prefree
segments is larger than the number of segments
in the proportion to the percentage over total
volume size, f2fs tries to conduct checkpoint to
reclaim the prefree segments to free segments.
By default, 5% over total # of segments.
max_small_discards This parameter controls the number of discard
commands that consist small blocks less than 2MB.
The candidates to be discarded are cached until
checkpoint is triggered, and issued during the
checkpoint. By default, it is disabled with 0.
ipu_policy This parameter controls the policy of in-place
updates in f2fs. There are five policies:
0x01: F2FS_IPU_FORCE, 0x02: F2FS_IPU_SSR,
0x04: F2FS_IPU_UTIL, 0x08: F2FS_IPU_SSR_UTIL,
0x10: F2FS_IPU_FSYNC.
min_ipu_util This parameter controls the threshold to trigger
in-place-updates. The number indicates percentage
of the filesystem utilization, and used by
F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.
min_fsync_blocks This parameter controls the threshold to trigger
in-place-updates when F2FS_IPU_FSYNC mode is set.
The number indicates the number of dirty pages
when fsync needs to flush on its call path. If
the number is less than this value, it triggers
in-place-updates.
max_victim_search This parameter controls the number of trials to
find a victim segment when conducting SSR and
cleaning operations. The default value is 4096
which covers 8GB block address range.
dir_level This parameter controls the directory level to
support large directory. If a directory has a
number of files, it can reduce the file lookup
latency by increasing this dir_level value.
Otherwise, it needs to decrease this value to
reduce the space overhead. The default value is 0.
ram_thresh This parameter controls the memory footprint used
by free nids and cached nat entries. By default,
10 is set, which indicates 10 MB / 1 GB RAM.
================================================================================
USAGE
================================================================================
1. Download userland tools and compile them.
2. Skip, if f2fs was compiled statically inside kernel.
Otherwise, insert the f2fs.ko module.
# insmod f2fs.ko
3. Create a directory trying to mount
# mkdir /mnt/f2fs
4. Format the block device, and then mount as f2fs
# mkfs.f2fs -l label /dev/block_device
# mount -t f2fs /dev/block_device /mnt/f2fs
mkfs.f2fs
---------
The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
which builds a basic on-disk layout.
The options consist of:
-l [label] : Give a volume label, up to 512 unicode name.
-a [0 or 1] : Split start location of each area for heap-based allocation.
1 is set by default, which performs this.
-o [int] : Set overprovision ratio in percent over volume size.
5 is set by default.
-s [int] : Set the number of segments per section.
1 is set by default.
-z [int] : Set the number of sections per zone.
1 is set by default.
-e [str] : Set basic extension list. e.g. "mp3,gif,mov"
-t [0 or 1] : Disable discard command or not.
1 is set by default, which conducts discard.
fsck.f2fs
---------
The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
partition, which examines whether the filesystem metadata and user-made data
are cross-referenced correctly or not.
Note that, initial version of the tool does not fix any inconsistency.
The options consist of:
-d debug level [default:0]
dump.f2fs
---------
The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
file. Each file is dump_ssa and dump_sit.
The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem.
It shows on-disk inode information reconized by a given inode number, and is
able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
./dump_sit respectively.
The options consist of:
-d debug level [default:0]
-i inode no (hex)
-s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
-a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
Examples:
# dump.f2fs -i [ino] /dev/sdx
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
================================================================================
DESIGN
================================================================================
On-disk Layout
--------------
F2FS divides the whole volume into a number of segments, each of which is fixed
to 2MB in size. A section is composed of consecutive segments, and a zone
consists of a set of sections. By default, section and zone sizes are set to one
segment size identically, but users can easily modify the sizes by mkfs.
F2FS splits the entire volume into six areas, and all the areas except superblock
consists of multiple segments as described below.
align with the zone size <-|
|-> align with the segment size
_________________________________________________________________________
| | | Segment | Node | Segment | |
| Superblock | Checkpoint | Info. | Address | Summary | Main |
| (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | |
|____________|_____2______|______N______|______N______|______N_____|__N___|
. .
. .
. .
._________________________________________.
|_Segment_|_..._|_Segment_|_..._|_Segment_|
. .
._________._________
|_section_|__...__|_
. .
.________.
|__zone__|
- Superblock (SB)
: It is located at the beginning of the partition, and there exist two copies
to avoid file system crash. It contains basic partition information and some
default parameters of f2fs.
- Checkpoint (CP)
: It contains file system information, bitmaps for valid NAT/SIT sets, orphan
inode lists, and summary entries of current active segments.
- Segment Information Table (SIT)
: It contains segment information such as valid block count and bitmap for the
validity of all the blocks.
- Node Address Table (NAT)
: It is composed of a block address table for all the node blocks stored in
Main area.
- Segment Summary Area (SSA)
: It contains summary entries which contains the owner information of all the
data and node blocks stored in Main area.
- Main Area
: It contains file and directory data including their indices.
In order to avoid misalignment between file system and flash-based storage, F2FS
aligns the start block address of CP with the segment size. Also, it aligns the
start block address of Main area with the zone size by reserving some segments
in SSA area.
Reference the following survey for additional technical details.
https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
File System Metadata Structure
------------------------------
F2FS adopts the checkpointing scheme to maintain file system consistency. At
mount time, F2FS first tries to find the last valid checkpoint data by scanning
CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
One of them always indicates the last valid data, which is called as shadow copy
mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
For file system consistency, each CP points to which NAT and SIT copies are
valid, as shown as below.
+--------+----------+---------+
| CP | SIT | NAT |
+--------+----------+---------+
. . . .
. . . .
. . . .
+-------+-------+--------+--------+--------+--------+
| CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 |
+-------+-------+--------+--------+--------+--------+
| ^ ^
| | |
`----------------------------------------'
Index Structure
---------------
The key data structure to manage the data locations is a "node". Similar to
traditional file structures, F2FS has three types of node: inode, direct node,
indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
indices, two direct node pointers, two indirect node pointers, and one double
indirect node pointer as described below. One direct node block contains 1018
data blocks, and one indirect node block contains also 1018 node blocks. Thus,
one inode block (i.e., a file) covers:
4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
Inode block (4KB)
|- data (923)
|- direct node (2)
| `- data (1018)
|- indirect node (2)
| `- direct node (1018)
| `- data (1018)
`- double indirect node (1)
`- indirect node (1018)
`- direct node (1018)
`- data (1018)
Note that, all the node blocks are mapped by NAT which means the location of
each node is translated by the NAT table. In the consideration of the wandering
tree problem, F2FS is able to cut off the propagation of node updates caused by
leaf data writes.
Directory Structure
-------------------
A directory entry occupies 11 bytes, which consists of the following attributes.
- hash hash value of the file name
- ino inode number
- len the length of file name
- type file type such as directory, symlink, etc
A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
used to represent whether each dentry is valid or not. A dentry block occupies
4KB with the following composition.
Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
dentries(11 * 214 bytes) + file name (8 * 214 bytes)
[Bucket]
+--------------------------------+
|dentry block 1 | dentry block 2 |
+--------------------------------+
. .
. .
. [Dentry Block Structure: 4KB] .
+--------+----------+----------+------------+
| bitmap | reserved | dentries | file names |
+--------+----------+----------+------------+
[Dentry Block: 4KB] . .
. .
. .
+------+------+-----+------+
| hash | ino | len | type |
+------+------+-----+------+
[Dentry Structure: 11 bytes]
F2FS implements multi-level hash tables for directory structure. Each level has
a hash table with dedicated number of hash buckets as shown below. Note that
"A(2B)" means a bucket includes 2 data blocks.
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
The number of blocks and buckets are determined by,
,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
# of blocks in level #n = |
`- 4, Otherwise
,- 2^(n + dir_level),
| if n + dir_level < MAX_DIR_HASH_DEPTH / 2,
# of buckets in level #n = |
`- 2^((MAX_DIR_HASH_DEPTH / 2) - 1),
Otherwise
When F2FS finds a file name in a directory, at first a hash value of the file
name is calculated. Then, F2FS scans the hash table in level #0 to find the
dentry consisting of the file name and its inode number. If not found, F2FS
scans the next hash table in level #1. In this way, F2FS scans hash tables in
each levels incrementally from 1 to N. In each levels F2FS needs to scan only
one bucket determined by the following equation, which shows O(log(# of files))
complexity.
bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
In the case of file creation, F2FS finds empty consecutive slots that cover the
file name. F2FS searches the empty slots in the hash tables of whole levels from
1 to N in the same way as the lookup operation.
The following figure shows an example of two cases holding children.
--------------> Dir <--------------
| |
child child
child - child [hole] - child
child - child - child [hole] - [hole] - child
Case 1: Case 2:
Number of children = 6, Number of children = 3,
File size = 7 File size = 7
Default Block Allocation
------------------------
At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
and Hot/Warm/Cold data.
- Hot node contains direct node blocks of directories.
- Warm node contains direct node blocks except hot node blocks.
- Cold node contains indirect node blocks
- Hot data contains dentry blocks
- Warm data contains data blocks except hot and cold data blocks
- Cold data contains multimedia data or migrated data blocks
LFS has two schemes for free space management: threaded log and copy-and-compac-
tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
for devices showing very good sequential write performance, since free segments
are served all the time for writing new data. However, it suffers from cleaning
overhead under high utilization. Contrarily, the threaded log scheme suffers
from random writes, but no cleaning process is needed. F2FS adopts a hybrid
scheme where the copy-and-compaction scheme is adopted by default, but the
policy is dynamically changed to the threaded log scheme according to the file
system status.
In order to align F2FS with underlying flash-based storage, F2FS allocates a
segment in a unit of section. F2FS expects that the section size would be the
same as the unit size of garbage collection in FTL. Furthermore, with respect
to the mapping granularity in FTL, F2FS allocates each section of the active
logs from different zones as much as possible, since FTL can write the data in
the active logs into one allocation unit according to its mapping granularity.
Cleaning process
----------------
F2FS does cleaning both on demand and in the background. On-demand cleaning is
triggered when there are not enough free segments to serve VFS calls. Background
cleaner is operated by a kernel thread, and triggers the cleaning job when the
system is idle.
F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
In the greedy algorithm, F2FS selects a victim segment having the smallest number
of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
according to the segment age and the number of valid blocks in order to address
log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
algorithm.
In order to identify whether the data in the victim segment are valid or not,
F2FS manages a bitmap. Each bit represents the validity of a block, and the
bitmap is composed of a bit stream covering whole blocks in main area.

View file

@ -0,0 +1,228 @@
============
Fiemap Ioctl
============
The fiemap ioctl is an efficient method for userspace to get file
extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
returns a list of extents.
Request Basics
--------------
A fiemap request is encoded within struct fiemap:
struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which
* userspace cares about (in) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_mapped_extents; /* number of extents that were
* mapped (out) */
__u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
};
fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
those on disk - that is, the logical offset of the 1st returned extent
may start before fm_start, and the range covered by the last returned
extent may end after fm_length. All offsets and lengths are in bytes.
Certain flags to modify the way in which mappings are looked up can be
set in fm_flags. If the kernel doesn't understand some particular
flags, it will return EBADR and the contents of fm_flags will contain
the set of flags which caused the error. If the kernel is compatible
with all flags passed, the contents of fm_flags will be unmodified.
It is up to userspace to determine whether rejection of a particular
flag is fatal to its operation. This scheme is intended to allow the
fiemap interface to grow in the future but without losing
compatibility with old software.
fm_extent_count specifies the number of elements in the fm_extents[] array
that can be used to return extents. If fm_extent_count is zero, then the
fm_extents[] array is ignored (no extents will be returned), and the
fm_mapped_extents count will hold the number of extents needed in
fm_extents[] to hold the file's current mapping. Note that there is
nothing to prevent the file from changing between calls to FIEMAP.
The following flags can be set in fm_flags:
* FIEMAP_FLAG_SYNC
If this flag is set, the kernel will sync the file before mapping extents.
* FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree.
Extent Mapping
--------------
Extent information is returned within the embedded fm_extents array
which userspace must allocate along with the fiemap structure. The
number of elements in the fiemap_extents[] array should be passed via
fm_extent_count. The number of extents mapped by kernel will be
returned via fm_mapped_extents. If the number of fiemap_extents
allocated is less than would be required to map the requested range,
the maximum number of extents that can be mapped in the fm_extent[]
array will be returned and fm_mapped_extents will be equal to
fm_extent_count. In that case, the last extent in the array will not
complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
returned in fm_extents.
struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of
* the extent */
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent */
__u64 fe_length; /* length in bytes for the extent */
__u64 fe_reserved64[2];
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_reserved[3];
};
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is
returned, fe_logical, fe_physical, and fe_length will be aligned to the
block size of the file system. With the exception of extents flagged as
FIEMAP_EXTENT_MERGED, adjacent extents will not be merged.
The fe_flags field contains flags which describe the extent returned.
A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in
the file so that the process making fiemap calls can determine when no
more extents are available, without having to call the ioctl again.
Some flags are intentionally vague and will always be set in the
presence of other more specific flags. This way a program looking for
a general property does not have to know all existing and future flags
which imply that property.
For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
for inline or tail-packed data can key on the specific flag. Software
which simply cares not to try operating on non-aligned extents
however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
* FIEMAP_EXTENT_LAST
This is the last extent in the file. A mapping attempt past this
extent will return nothing.
* FIEMAP_EXTENT_UNKNOWN
The location of this extent is currently unknown. This may indicate
the data is stored on an inaccessible volume or that no storage has
been allocated for the file yet.
* FIEMAP_EXTENT_DELALLOC
- This will also set FIEMAP_EXTENT_UNKNOWN.
Delayed allocation - while there is data for this extent, its
physical location has not been allocated yet.
* FIEMAP_EXTENT_ENCODED
This extent does not consist of plain filesystem blocks but is
encoded (e.g. encrypted or compressed). Reading the data in this
extent via I/O to the block device will have undefined results.
Note that it is *always* undefined to try to update the data
in-place by writing to the indicated location without the
assistance of the filesystem, or to access the data using the
information returned by the FIEMAP interface while the filesystem
is mounted. In other words, user applications may only read the
extent data via I/O to the block device while the filesystem is
unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is
clear; user applications must not try reading or writing to the
filesystem via the block device under any other circumstances.
* FIEMAP_EXTENT_DATA_ENCRYPTED
- This will also set FIEMAP_EXTENT_ENCODED
The data in this extent has been encrypted by the file system.
* FIEMAP_EXTENT_NOT_ALIGNED
Extent offsets and length are not guaranteed to be block aligned.
* FIEMAP_EXTENT_DATA_INLINE
This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is located within a meta data block.
* FIEMAP_EXTENT_DATA_TAIL
This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files.
* FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but its data has not been
initialized. This indicates the extent's data will be all zero if read
through the filesystem but the contents are undefined if read directly from
the device.
* FIEMAP_EXTENT_MERGED
This will be set when a file does not support extents, i.e., it uses a block
based addressing scheme. Since returning an extent for each block back to
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.
VFS -> File System Implementation
---------------------------------
File systems wishing to support fiemap must implement a ->fiemap callback on
their inode_operations structure. The fs ->fiemap call is responsible for
defining its set of supported fiemap flags, and calling a helper function on
each discovered extent:
struct inode_operations {
...
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
fiemap request:
struct fiemap_extent_info {
unsigned int fi_flags; /* Flags as passed from user */
unsigned int fi_extents_mapped; /* Number of mapped extents */
unsigned int fi_extents_max; /* Size of fiemap_extent array */
struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
};
It is intended that the file system should not need to access any of this
structure directly.
Flag checking should be done at the beginning of the ->fiemap callback via the
fiemap_check_flags() helper:
int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
The struct fieinfo should be passed in as received from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If
fiemap_check_flags finds invalid user flags, it will place the bad values in
fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
fiemap_check_flags(), it should immediately exit, returning that error back to
ioctl_fiemap().
For each extent in the request range, the file system should call
the helper function, fiemap_fill_next_extent():
int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
u64 phys, u64 len, u32 flags, u32 dev);
fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
automatically be set from specific flags on behalf of the calling file
system so that the userspace API is not broken.
fiemap_fill_next_extent() returns 0 on success, and 1 when the
user-supplied fm_extents array is full. If an error is encountered
while copying the extent to user memory, -EFAULT will be returned.

View file

@ -0,0 +1,123 @@
File management in the Linux kernel
-----------------------------------
This document describes how locking for files (struct file)
and file descriptor table (struct files) works.
Up until 2.6.12, the file descriptor table has been protected
with a lock (files->file_lock) and reference count (files->count).
->file_lock protected accesses to all the file related fields
of the table. ->count was used for sharing the file descriptor
table between tasks cloned with CLONE_FILES flag. Typically
this would be the case for posix threads. As with the common
refcounting model in the kernel, the last task doing
a put_files_struct() frees the file descriptor (fd) table.
The files (struct file) themselves are protected using
reference count (->f_count).
In the new lock-free model of file descriptor management,
the reference counting is similar, but the locking is
based on RCU. The file descriptor table contains multiple
elements - the fd sets (open_fds and close_on_exec, the
array of file pointers, the sizes of the sets and the array
etc.). In order for the updates to appear atomic to
a lock-free reader, all the elements of the file descriptor
table are in a separate structure - struct fdtable.
files_struct contains a pointer to struct fdtable through
which the actual fd table is accessed. Initially the
fdtable is embedded in files_struct itself. On a subsequent
expansion of fdtable, a new fdtable structure is allocated
and files->fdtab points to the new structure. The fdtable
structure is freed with RCU and lock-free readers either
see the old fdtable or the new fdtable making the update
appear atomic. Here are the locking rules for
the fdtable structure -
1. All references to the fdtable must be done through
the files_fdtable() macro :
struct fdtable *fdt;
rcu_read_lock();
fdt = files_fdtable(files);
....
if (n <= fdt->max_fds)
....
...
rcu_read_unlock();
files_fdtable() uses rcu_dereference() macro which takes care of
the memory barrier requirements for lock-free dereference.
The fdtable pointer must be read within the read-side
critical section.
2. Reading of the fdtable as described above must be protected
by rcu_read_lock()/rcu_read_unlock().
3. For any update to the fd table, files->file_lock must
be held.
4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
An example :
struct file *file;
rcu_read_lock();
file = fcheck(fd);
if (file) {
...
}
....
rcu_read_unlock();
5. Handling of the file structures is special. Since the look-up
of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the
file structure. This is avoided using atomic_long_inc_not_zero()
on ->f_count :
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
....
return file;
atomic_long_inc_not_zero() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail
fget()/fget_light().
6. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
and fcheck()/fcheck_files() which take care of these issues.
7. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
For example :
spin_lock(&files->file_lock);
fd = locate_fd(files, file, start);
if (fd >= 0) {
/* locate_fd() may have expanded fdtable, load the ptr */
fdt = files_fdtable(files);
__set_open_fd(fd, fdt);
__clear_close_on_exec(fd, fdt);
spin_unlock(&files->file_lock);
.....
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().

View file

@ -0,0 +1,423 @@
Definitions
~~~~~~~~~~~
Userspace filesystem:
A filesystem in which data and metadata are provided by an ordinary
userspace process. The filesystem can be accessed normally through
the kernel interface.
Filesystem daemon:
The process(es) providing the data and metadata of the filesystem.
Non-privileged mount (or user mount):
A userspace filesystem mounted by a non-privileged (non-root) user.
The filesystem daemon is running with the privileges of the mounting
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Filesystem connection:
A connection between the filesystem daemon and the kernel. The
connection exists until either the daemon dies, or the filesystem is
umounted. Note that detaching (or lazy umounting) the filesystem
does _not_ break the connection, in this case it will exist until
the last reference to the filesystem is released.
Mount owner:
The user who does the mounting.
User:
The user who is performing filesystem operations.
What is FUSE?
~~~~~~~~~~~~~
FUSE is a userspace filesystem framework. It consists of a kernel
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
(fusermount).
One of the most important features of FUSE is allowing secure,
non-privileged mounts. This opens up new possibilities for the use of
filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
The userspace library and utilities are available from the FUSE
homepage:
http://fuse.sourceforge.net/
Filesystem type
~~~~~~~~~~~~~~~
The filesystem type given to mount(2) can be one of the following:
'fuse'
This is the usual way to mount a FUSE filesystem. The first
argument of the mount system call may contain an arbitrary string,
which is not interpreted by the kernel.
'fuseblk'
The filesystem is block device based. The first argument of the
mount system call is interpreted as the name of the device.
Mount options
~~~~~~~~~~~~~
'fd=N'
The file descriptor to use for communication between the userspace
filesystem and the kernel. The file descriptor must have been
obtained by opening the FUSE device ('/dev/fuse').
'rootmode=M'
The file mode of the filesystem's root in octal representation.
'user_id=N'
The numeric user id of the mount owner.
'group_id=N'
The numeric group id of the mount owner.
'default_permissions'
By default FUSE doesn't check file access permissions, the
filesystem is free to implement its access policy or leave it to
the underlying file access mechanism (e.g. in case of network
filesystems). This option enables permission checking, restricting
access based on file mode. It is usually useful together with the
'allow_other' mount option.
'allow_other'
This option overrides the security measure restricting file access
to the user mounting the filesystem. This option is by default only
allowed to root, but this restriction can be removed with a
(userspace) configuration option.
'max_read=N'
With this option the maximum size of read operations can be set.
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
'blksize=N'
Set the block size for the filesystem. The default is 512. This
option is only valid for 'fuseblk' type mounts.
Control filesystem
~~~~~~~~~~~~~~~~~~
There's a control filesystem for FUSE, which can be mounted by:
mount -t fusectl none /sys/fs/fuse/connections
Mounting it under the '/sys/fs/fuse/connections' directory makes it
backwards compatible with earlier versions.
Under the fuse control filesystem each connection has a directory
named by a unique number.
For each connection the following files exist within this directory:
'waiting'
The number of requests which are waiting to be transferred to
userspace or being processed by the filesystem daemon. If there is
no filesystem activity and 'waiting' is non-zero, then the
filesystem is hung or deadlocked.
'abort'
Writing anything into this file will abort the filesystem
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a process issuing a FUSE filesystem request is interrupted, the
following will happen:
1) If the request is not yet sent to userspace AND the signal is
fatal (SIGKILL or unhandled fatal signal), then the request is
dequeued and returns immediately.
2) If the request is not yet sent to userspace AND the signal is not
fatal, then an 'interrupted' flag is set for the request. When
the request has been successfully transferred to userspace and
this flag is set, an INTERRUPT request is queued.
3) If the request is already sent to userspace, then an INTERRUPT
request is queued.
INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.
The userspace filesystem may ignore the INTERRUPT requests entirely,
or may honor them by sending a reply to the _original_ request, with
the error set to EINTR.
It is also possible that there's a race between processing the
original request and its INTERRUPT request. There are two possibilities:
1) The INTERRUPT request is processed before the original request is
processed
2) The INTERRUPT request is processed after the original request has
been answered
If the filesystem cannot find the original request, it should wait for
some timeout and/or a number of new requests to arrive, after which it
should reply to the INTERRUPT request with an EAGAIN error. In case
1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
reply will be ignored.
Aborting a filesystem connection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is possible to get into certain situations where the filesystem is
not responding. Reasons for this may be:
a) Broken userspace filesystem implementation
b) Network connection down
c) Accidental deadlock
d) Malicious deadlock
(For more on c) and d) see later sections)
In either of these cases it may be useful to abort the connection to
the filesystem. There are several ways to do this:
- Kill the filesystem daemon. Works in case of a) and b)
- Kill the filesystem daemon and all users of the filesystem. Works
in all cases except some malicious deadlocks
- Use forced umount (umount -f). Works in all cases but only if
filesystem is still attached (it hasn't been lazy unmounted)
- Abort filesystem through the FUSE control filesystem. Most
powerful method, always works.
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since the mount() system call is a privileged operation, a helper
program (fusermount) is needed, which is installed setuid root.
The implication of providing non-privileged mounts is that the mount
owner must not be able to use this capability to compromise the
system. Obvious requirements arising from this are:
A) mount owner should not be able to get elevated privileges with the
help of the mounted filesystem
B) mount owner should not get illegitimate access to information from
other users' and the super user's processes
C) mount owner should not be able to induce undesired behavior in
other users' or the super user's processes
How are requirements fulfilled?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A) The mount owner could gain elevated privileges by either:
1) creating a filesystem containing a device file, then opening
this device
2) creating a filesystem containing a suid or sgid application,
then executing this application
The solution is not to allow opening device files and ignore
setuid and setgid bits when executing programs. To ensure this
fusermount always adds "nosuid" and "nodev" to the mount options
for non-privileged mounts.
B) If another user is accessing files or directories in the
filesystem, the filesystem daemon serving requests can record the
exact sequence and timing of operations performed. This
information is otherwise inaccessible to the mount owner, so this
counts as an information leak.
The solution to this problem will be presented in point 2) of C).
C) There are several ways in which the mount owner can induce
undesired behavior in other users' processes, such as:
1) mounting a filesystem over a file or directory which the mount
owner could otherwise not be able to modify (or could only
make limited modifications).
This is solved in fusermount, by checking the access
permissions on the mountpoint and only allowing the mount if
the mount owner can do unlimited modification (has write
access to the mountpoint, and mountpoint is not a "sticky"
directory)
2) Even if 1) is solved the mount owner can change the behavior
of other users' processes.
i) It can slow down or indefinitely delay the execution of a
filesystem operation creating a DoS against the user or the
whole system. For example a suid application locking a
system file, and then accessing a file on the mount owner's
filesystem could be stopped, and thus causing the system
file to be locked forever.
ii) It can present files or directories of unlimited length, or
directory structures of unlimited depth, possibly causing a
system process to eat up diskspace, memory or other
resources, again causing DoS.
The solution to this as well as B) is not to allow processes
to access the filesystem, which could otherwise not be
monitored or manipulated by the mount owner. Since if the
mount owner can ptrace a process, it can do all of the above
without using a FUSE mount, the same criteria as used in
ptrace can be used to check if a process is allowed to access
the filesystem or not.
Note that the ptrace check is not strictly necessary to
prevent B/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since SIGSTOP can be used to get a similar effect.
I think these limitations are unacceptable?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
mounts, it can relax the last limitation with a "user_allow_other"
config option. If this config option is set, the mounting user can
add the "allow_other" mount option which disables the check for other
users' processes.
Kernel - userspace interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following diagram shows how a filesystem operation (in this
example unlink) is performed in FUSE.
NOTE: everything in this description is greatly simplified
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| | >sys_read()
| | >fuse_dev_read()
| | >request_wait()
| | [sleep on fc->waitq]
| |
| >sys_unlink() |
| >fuse_unlink() |
| [get request from |
| fc->unused_list] |
| >request_send() |
| [queue req on fc->pending] |
| [wake up fc->waitq] | [woken up]
| >request_wait_answer() |
| [sleep on req->waitq] |
| | <request_wait()
| | [remove req from fc->pending]
| | [copy req to read buffer]
| | [add req to fc->processing]
| | <fuse_dev_read()
| | <sys_read()
| |
| | [perform unlink]
| |
| | >sys_write()
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| [woken up] | [wake up req->waitq]
| | <fuse_dev_write()
| | <sys_write()
| <request_wait_answer() |
| <request_send() |
| [add request to |
| fc->unused_list] |
| <fuse_unlink() |
| <sys_unlink() |
There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.
Scenario 1 - Simple deadlock
-----------------------------
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| >sys_unlink("/mnt/fuse/file") |
| [acquire inode semaphore |
| for "file"] |
| >fuse_unlink() |
| [sleep on req->waitq] |
| | <sys_read()
| | >sys_unlink("/mnt/fuse/file")
| | [acquire inode semaphore
| | for "file"]
| | *DEADLOCK*
The solution for this is to allow the filesystem to be aborted.
Scenario 2 - Tricky deadlock
----------------------------
This one needs a carefully crafted filesystem. It's a variation on
the above, only the call back to the filesystem is not explicit,
but is caused by a pagefault.
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
| |
| [fd = open("/mnt/fuse/file")] | [request served normally]
| [mmap fd to 'addr'] |
| [close fd] | [FLUSH triggers 'magic' flag]
| [read a byte from addr] |
| >do_page_fault() |
| [find or create page] |
| [lock page] |
| >fuse_readpage() |
| [queue READ request] |
| [sleep on req->waitq] |
| | [read request to buffer]
| | [create reply header before addr]
| | >sys_write(addr - headerlength)
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| | >do_page_fault()
| | [find or create page]
| | [lock page]
| | * DEADLOCK *
Solution is basically the same as above.
An additional problem is that while the write buffer is being copied
to the request, the request must not be interrupted/aborted. This is
because the destination address of the copy may not be valid after the
request has returned.
This is solved with doing the copy atomically, and allowing abort
while the page(s) belonging to the write buffer are faulted with
get_user_pages(). The 'req->locked' flag indicates when the copy is
taking place, and abort is delayed until this flag is unset.

View file

@ -0,0 +1,231 @@
Glock internal locking rules
------------------------------
This documents the basic principles of the glock state machine
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
has two main (internal) locks:
1. A spinlock (gl_spin) which protects the internal state such
as gl_state, gl_target and the list of holders (gl_holders)
2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
threads from making calls to the DLM, etc. at the same time. If a
thread takes this lock, it must then call run_queue (usually via the
workqueue) when it releases it in order to ensure any pending tasks
are completed.
The gl_holders list contains all the queued lock requests (not
just the holders) associated with the glock. If there are any
held locks, then they will be contiguous entries at the head
of the list. Locks are granted in strictly the order that they
are queued, except for those marked LM_FLAG_PRIORITY which are
used only during recovery, and even then only for journal locks.
There are three lock states that users of the glock layer can request,
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
to the following DLM lock modes:
Glock mode | DLM lock mode
------------------------------
UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
SH | PR (Protected read)
DF | CW (Concurrent write)
EX | EX (Exclusive)
Thus DF is basically a shared mode which is incompatible with the "normal"
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
--------------------------------------------------------------------------
UN | No | No | No | No
SH | Yes | Yes | No | No
DF | No | Yes | No | No
EX | Yes | Yes | Yes | Yes
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
Field | Purpose
----------------------------------------------------------------------------
go_xmote_th | Called before remote state change (e.g. to sync dirty data)
go_xmote_bh | Called after remote state change (e.g. to refill cache)
go_inval | Called if remote state change requires invalidating the cache
go_demote_ok | Returns boolean value of whether its ok to demote a glock
| (e.g. checks timeout, and that there is no cached data)
go_lock | Called for the first local holder of a lock
go_unlock | Called on the final local unlock of a lock
go_dump | Called to print content of object for debugfs file, or on
| error to dump glock to the log.
go_type | The type of the glock, LM_TYPE_.....
go_callback | Called if the DLM sends a callback to drop this lock
go_flags | GLOF_ASPACE is set, if the glock has an address space
| associated with it
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
prevent a situation where locks are being bounced around the cluster
from node to node with none of the nodes making any progress. This
tends to show up most with shared mmaped files which are being written
to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
There is a plan to try and remove the go_lock and go_unlock callbacks
if possible, in order to try and speed up the fast path though the locking.
Also, eventually we hope to make the glock "EX" mode locally shared
such that any local locking will be done with the i_mutex as required
rather than via the glock.
Locking rules for glock operations:
Operation | GLF_LOCK bit lock held | gl_spin spinlock held
-----------------------------------------------------------------
go_xmote_th | Yes | No
go_xmote_bh | Yes | No
go_inval | Yes | No
go_demote_ok | Sometimes | Yes
go_lock | Yes | No
go_unlock | Yes | No
go_dump | Sometimes | Yes
go_callback | Sometimes (N/A) | Yes
N.B. Operations must not drop either the bit lock or the spinlock
if its held on entry. go_dump and do_demote_ok must never block.
Note that go_dump will only be called if the glock's state
indicates that it is caching uptodate data.
Glock locking order within GFS2:
1. i_mutex (if required)
2. Rename glock (for rename only)
3. Inode glock(s)
(Parents before children, inodes at "same level" with same parent in
lock number order)
4. Rgrp glock(s) (for (de)allocation operations)
5. Transaction glock (via gfs2_trans_begin) for non-read operations
6. Page lock (always last, very important!)
There are two glocks per inode. One deals with access to the inode
itself (locking order as above), and the other, known as the iopen
glock is used in conjunction with the i_nlink field in the inode to
determine the lifetime of the inode in question. Locking of inodes
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
In general we prefer to lock local locks prior to cluster locks.
Glock Statistics
------------------
The stats are divided into two sets: those relating to the
super block and those relating to an individual glock. The
super block stats are done on a per cpu basis in order to
try and reduce the overhead of gathering them. They are also
further divided by glock type. All timings are in nanoseconds.
In the case of both the super block and glock statistics,
the same information is gathered in each case. The super
block timing statistics are used to provide default values for
the glock timing statistics, so that newly created glocks
should have, as far as possible, a sensible starting point.
The per-glock counters are initialised to zero when the
glock is created. The per-glock statistics are lost when
the glock is ejected from memory.
The statistics are divided into three pairs of mean and
variance, plus two counters. The mean/variance pairs are
smoothed exponential estimates and the algorithm used is
one which will be very familiar to those used to calculation
of round trip times in network code. See "TCP/IP Illustrated,
Volume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement",
p. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards.
Unlike the TCP/IP Illustrated case, the mean and variance are
not scaled, but are in units of integer nanoseconds.
The three pairs of mean/variance measure the following
things:
1. DLM lock time (non-blocking requests)
2. DLM lock time (blocking requests)
3. Inter-request time (again to the DLM)
A non-blocking request is one which will complete right
away, whatever the state of the DLM lock in question. That
currently means any requests when (a) the current state of
the lock is exclusive, i.e. a lock demotion (b) the requested
state is either null or unlocked (again, a demotion) or (c) the
"try lock" flag is set. A blocking request covers all the other
lock requests.
There are two counters. The first is there primarily to show
how many lock requests have been made, and thus how much data
has gone into the mean/variance calculations. The other counter
is counting queuing of holders at the top layer of the glock
code. Hopefully that number will be a lot larger than the number
of dlm lock requests issued.
So why gather these statistics? There are several reasons
we'd like to get a better idea of these timings:
1. To be able to better set the glock "min hold time"
2. To spot performance issues more easily
3. To improve the algorithm for selecting resource groups for
allocation (to base it on lock wait time, rather than blindly
using a "try lock")
Due to the smoothing action of the updates, a step change in
some input quantity being sampled will only fully be taken
into account after 8 samples (or 4 for the variance) and this
needs to be carefully considered when interpreting the
results.
Knowing both the time it takes a lock request to complete and
the average time between lock requests for a glock means we
can compute the total percentage of the time for which the
node is able to use a glock vs. time that the rest of the
cluster has its share. That will be very useful when setting
the lock min hold time.
Great care has been taken to ensure that we
measure exactly the quantities that we want, as accurately
as possible. There are always inaccuracies in any
measuring system, but I hope this is as accurate as we
can reasonably make it.
Per sb stats can be found here:
/sys/kernel/debug/gfs2/<fsname>/sbstats
Per glock stats can be found here:
/sys/kernel/debug/gfs2/<fsname>/glstats
Assuming that debugfs is mounted on /sys/kernel/debug and also
that <fsname> is replaced with the name of the gfs2 filesystem
in question.
The abbreviations used in the output as are follows:
srtt - Smoothed round trip time for non-blocking dlm requests
srttvar - Variance estimate for srtt
srttb - Smoothed round trip time for (potentially) blocking dlm requests
srttvarb - Variance estimate for srttb
sirt - Smoothed inter-request time (for dlm requests)
sirtvar - Variance estimate for sirt
dlm - Number of dlm requests made (dcnt in glstats file)
queue - Number of glock requests queued (qcnt in glstats file)
The sbstats file contains a set of these stats for each glock type (so 8 lines
for each type) and for each cpu (one column per cpu). The glstats file contains
a set of these stats for each glock in a similar format to the glocks file, but
using the format mean/variance for each of the timing stats.
The gfs2_glock_lock_time tracepoint prints out the current values of the stats
for the glock in question, along with some addition information on each dlm
reply that is received:
status - The status of the dlm request
flags - The dlm request flags
tdiff - The time taken by this specific request
(remaining fields as per above list)

View file

@ -0,0 +1,100 @@
uevents and GFS2
==================
During the lifetime of a GFS2 mount, a number of uevents are generated.
This document explains what the events are and what they are used
for (by gfs_controld in gfs2-utils).
A list of GFS2 uevents
-----------------------
1. ADD
The ADD event occurs at mount time. It will always be the first
uevent generated by the newly created filesystem. If the mount
is successful, an ONLINE uevent will follow. If it is not successful
then a REMOVE uevent will follow.
The ADD uevent has two environment variables: SPECTATOR=[0|1]
and RDONLY=[0|1] that specify the spectator status (a read-only mount
with no journal assigned), and read-only (with journal assigned) status
of the filesystem respectively.
2. ONLINE
The ONLINE uevent is generated after a successful mount or remount. It
has the same environment variables as the ADD uevent. The ONLINE
uevent, along with the two environment variables for spectator and
RDONLY are a relatively recent addition (2.6.32-rc+) and will not
be generated by older kernels.
3. CHANGE
The CHANGE uevent is used in two places. One is when reporting the
successful mount of the filesystem by the first node (FIRSTMOUNT=Done).
This is used as a signal by gfs_controld that it is then ok for other
nodes in the cluster to mount the filesystem.
The other CHANGE uevent is used to inform of the completion
of journal recovery for one of the filesystems journals. It has
two environment variables, JID= which specifies the journal id which
has just been recovered, and RECOVERY=[Done|Failed] to indicate the
success (or otherwise) of the operation. These uevents are generated
for every journal recovered, whether it is during the initial mount
process or as the result of gfs_controld requesting a specific journal
recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file.
Because the CHANGE uevent was used (in early versions of gfs_controld)
without checking the environment variables to discover the state, we
cannot add any more functions to it without running the risk of
someone using an older version of the user tools and breaking their
cluster. For this reason the ONLINE uevent was used when adding a new
uevent for a successful mount or remount.
4. OFFLINE
The OFFLINE uevent is only generated due to filesystem errors and is used
as part of the "withdraw" mechanism. Currently this doesn't give any
information about what the error is, which is something that needs to
be fixed.
5. REMOVE
The REMOVE uevent is generated at the end of an unsuccessful mount
or at the end of a umount of the filesystem. All REMOVE uevents will
have been preceded by at least an ADD uevent for the same filesystem,
and unlike the other uevents is generated automatically by the kernel's
kobject subsystem.
Information common to all GFS2 uevents (uevent environment variables)
----------------------------------------------------------------------
1. LOCKTABLE=
The LOCKTABLE is a string, as supplied on the mount command
line (locktable=) or via fstab. It is used as a filesystem label
as well as providing the information for a lock_dlm mount to be
able to join the cluster.
2. LOCKPROTO=
The LOCKPROTO is a string, and its value depends on what is set
on the mount command line, or via fstab. It will be either
lock_nolock or lock_dlm. In the future other lock managers
may be supported.
3. JOURNALID=
If a journal is in use by the filesystem (journals are not
assigned for spectator mounts) then this will give the
numeric journal id in all GFS2 uevents.
4. UUID=
With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID
into the filesystem superblock. If it exists, this will
be included in every uevent relating to the filesystem.

View file

@ -0,0 +1,45 @@
Global File System
------------------
https://fedorahosted.org/cluster/wiki/HomePage
GFS is a cluster file system. It allows a cluster of computers to
simultaneously use a block device that is shared between them (with FC,
iSCSI, NBD, etc). GFS reads and writes to the block device like a local
file system, but also uses a lock module to allow the computers coordinate
their I/O so file system consistency is maintained. One of the nifty
features of GFS is perfect consistency -- changes made to the file system
on one machine show up immediately on all other machines in the cluster.
GFS uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are:
lock_nolock -- allows gfs to be used as a local file system
lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
The dlm is found at linux/fs/dlm/
Lock_dlm depends on user space cluster management systems found
at the URL above.
To use gfs as a local file system, no external clustering systems are
needed, simply:
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
If you are using Fedora, you need to install the gfs2-utils package
and, for lock_dlm, you will also need to install the cman package
and write a cluster.conf as per the documentation. For F17 and above
cman has been replaced by the dlm package.
GFS2 is not on-disk compatible with previous versions of GFS, but it
is pretty close.
The following man pages can be found at the URL above:
fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem
gfs2_convert to convert a gfs filesystem to gfs2 in-place
mkfs.gfs2 to make a filesystem

View file

@ -0,0 +1,82 @@
Note: This filesystem doesn't have a maintainer.
Macintosh HFS Filesystem for Linux
==================================
HFS stands for ``Hierarchical File System'' and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh
models used MFS (``Macintosh File System''), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux.
Mount options
=============
When mounting an HFS filesystem, the following options are accepted:
creator=cccc, type=cccc
Specifies the creator/type values as shown by the MacOS finder
used for creating new files. Default values: '????'.
uid=n, gid=n
Specifies the user/group that owns all files on the filesystems.
Default: user/group id of the mounting process.
dir_umask=n, file_umask=n, umask=n
Specifies the umask used for all files , all directories or all
files and directories. Defaults to the umask of the mounting process.
session=n
Select the CDROM session to mount as HFS filesystem. Defaults to
leaving that decision to the CDROM driver. This option will fail
with anything but a CDROM as underlying devices.
part=n
Select partition number n from the devices. Does only makes
sense for CDROMS because they can't be partitioned under Linux.
For disk devices the generic partition parsing code does this
for us. Defaults to not parsing the partition table at all.
quiet
Ignore invalid mount options instead of complaining.
Writing to HFS Filesystems
==========================
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect:
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files.
o You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange:
o You can't create, delete or rename resource forks of files or the
Finder's metadata.
o They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory.
o Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work.
Creating HFS filesystems
===================================
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
<http://www.mars.org/home/rob/proj/hfs/> for details.
Credits
=======
The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU).
Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought
in btree routines derived from Brad Boyer's hfsplus driver.

View file

@ -0,0 +1,59 @@
Macintosh HFSPlus Filesystem for Linux
======================================
HFSPlus is a filesystem first introduced in MacOS 8.1.
HFSPlus has several extensions to HFS, including 32-bit allocation
blocks, 255-character unicode filenames, and file sizes of 2^63 bytes.
Mount options
=============
When mounting an HFSPlus filesystem, the following options are accepted:
creator=cccc, type=cccc
Specifies the creator/type values as shown by the MacOS finder
used for creating new files. Default values: '????'.
uid=n, gid=n
Specifies the user/group that owns all files on the filesystem
that have uninitialized permissions structures.
Default: user/group id of the mounting process.
umask=n
Specifies the umask (in octal) used for files and directories
that have uninitialized permissions structures.
Default: umask of the mounting process.
session=n
Select the CDROM session to mount as HFSPlus filesystem. Defaults to
leaving that decision to the CDROM driver. This option will fail
with anything but a CDROM as underlying devices.
part=n
Select partition number n from the devices. This option only makes
sense for CDROMs because they can't be partitioned under Linux.
For disk devices the generic partition parsing code does this
for us. Defaults to not parsing the partition table at all.
decompose
Decompose file name characters.
nodecompose
Do not decompose file name characters.
force
Used to force write access to volumes that are marked as journalled
or locked. Use at your own risk.
nls=cccc
Encoding to use when presenting file names.
References
==========
kernel source: <file:fs/hfsplus>
Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html

View file

@ -0,0 +1,296 @@
Read/Write HPFS 2.09
1998-2004, Mikulas Patocka
email: mikulas@artax.karlin.mff.cuni.cz
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
CREDITS:
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options
uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
Set owner/group/mode for files that do not have it specified in extended
attributes. Mode is inverted umask - for example umask 027 gives owner
all permission, group read permission and anybody else no access. Note
that for files mode is anded with 0666. If you want files to have 'x'
rights, you must use extended attributes.
case=lower,asis (default asis)
File name lowercasing in readdir.
conv=binary,text,auto (default binary)
CR/LF -> LF conversion, if auto, decision is made according to extension
- there is a list of text extensions (I thing it's better to not convert
text file than to damage binary file). If you want to change that list,
change it in the source. Original readonly HPFS contained some strange
heuristic algorithm that I removed. I thing it's danger to let the
computer decide whether file is text or binary. For example, DJGPP
binaries contain small text message at the beginning and they could be
misidentified and damaged under some circumstances.
check=none,normal,strict (default normal)
Check level. Selecting none will cause only little speedup and big
danger. I tried to write it so that it won't crash if check=normal on
corrupted filesystems. check=strict means many superfluous checks -
used for debugging (for example it checks if file is allocated in
bitmaps when accessing it).
errors=continue,remount-ro,panic (default remount-ro)
Behaviour when filesystem errors found.
chkdsk=no,errors,always (default errors)
When to mark filesystem dirty so that OS/2 checks it.
eas=no,ro,rw (default rw)
What to do with extended attributes. 'no' - ignore them and use always
values specified in uid/gid/mode options. 'ro' - read extended
attributes but do not create them. 'rw' - create extended attributes
when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
timeshift=(-)nnn (default 0)
Shifts the time by nnn seconds. For example, if you see under linux
one hour more, than under os/2, use timeshift=-3600.
File names
As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
also won't be able to compile linux kernel (and maybe other things) on HPFS
because kernel creates different files with names like bootsect.S and
bootsect.s. When searching for file thats name has characters >= 128, codepages
are used - see below.
OS/2 ignores dots and spaces at the end of file name, so this driver does as
well. If you create 'a. ...', the file 'a' will be created, but you can still
access it under names 'a.', 'a..', 'a . . . ' etc.
Extended attributes
On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is
an ascii string identifying that attribute and value is any string of bytes of
variable length. OS/2 stores window and icon positions and file types there. So
why not use it for unix-specific info like file owner or access rights? This
driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
that extended attributes those value differs from defaults specified in mount
options are created. Once created, the extended attributes are never deleted,
they're just changed. It means that when your default uid=0 and you type
something like 'chown luser file; chown root file' the file will contain
extended attribute UID=0. And when you umount the fs and mount it again with
uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
extended attribute "MODE" will not be set, this special case is done by setting
read-only flag. When you mknod a block or char device, besides "MODE", the
special 4-byte extended attribute "DEV" will be created containing the device
number. Currently this driver cannot resize extended attributes - it means
that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
attributes with different sizes, they won't be rewritten and changing these
values doesn't work.
Symlinks
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
chgrp symlinks but I don't know what is it good for. chmoding symlink results
in chmoding file where symlink points. These symlinks are just for Linux use and
incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
stored in very crazy way. They tried to do it so that link changes when file is
moved ... sometimes it works. But the link is partly stored in directory
extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
to analyze or change OS2SYS.INI.
Codepages
HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage its name is in. However OS/2 was created in
America where people don't care much about codepages and so multiple codepages
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
Once I booted English OS/2 working in cp 850 and I created a file on my 852
partition. It marked file name codepage as 850 - good. But when I again booted
Czech OS/2, the file was completely inaccessible under any name. It seems that
OS/2 uppercases the search pattern with its system code page (852) and file
name it's comparing to with its code page (850). These could never match. Is it
really what IBM developers wanted? But problems continued. When I created in
Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
probably uses different uppercasing method when searching where to place a file
(note, that files in HPFS directory must be sorted) and when searching for
a file. Finally when I opened this directory in PmShell, PmShell crashed (the
funny thing was that, when rebooted, PmShell tried to reopen this directory
again :-). chkdsk happily ignores these errors and only low-level disk
modification saved me. Never mix different language versions of OS/2 on one
system although HPFS was designed to allow that.
OK, I could implement complex codepage support to this driver but I think it
would cause more problems than benefit with such buggy implementation in OS/2.
So this driver simply uses first codepage it finds for uppercasing and
lowercasing no matter what's file codepage index. Usually all file names are in
this codepage - if you don't try to do what I described above :-)
Known bugs
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how
to handle some HPFS386 structures like access control list or extended perm
list, I don't know how to delete them when file is deleted and how to not
overwrite them with extended attributes. Send me some info on these structures
and I'll make it. However, this driver should detect presence of HPFS386
structures, remount read-only and not destroy them (I hope).
When there's not enough space for extended attributes, they will be truncated
and no error is returned.
OS/2 can't access files if the path is longer than about 256 chars but this
driver allows you to do it. chkdsk ignores such errors.
Sometimes you won't be able to delete some files on a very full filesystem
(returning error ENOSPC). That's because file in non-leaf node in directory tree
(one directory, if it's large, has dirents in tree on HPFS) must be replaced
with another node when deleted. And that new file might have larger name than
the old one so the new name doesn't fit in directory node (dnode). And that
would result in directory tree splitting, that takes disk space. Workaround is
to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that
preallocated directory band is full i.e.
number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories.
You can't rename over directories (what is it good for?).
Renaming files so that only case changes doesn't work. This driver supports it
but vfs doesn't. Something like 'mv file FILE' won't work.
All atimes and directory mtimes are not updated. That's because of performance
reasons. If you extremely wish to update them, let me know, I'll write it (but
it will be slow).
When the system is out of memory and swap, it may slightly corrupt filesystem
(lost files, unbalanced directories). (I guess all filesystem may do it).
When compiled, you get warning: function declaration isn't a prototype. Does
anybody know what does it mean?
What does "unbalanced tree" message mean?
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
crashes when the tree is not balanced. This driver handles unbalanced trees
correctly and writes warning if it finds them. If you see this message, this is
probably because of directories created with old version of this driver.
Workaround is to move all files from that directory to another and then back
again. Do it in Linux, not OS/2! If you see this message in directory that is
whole created by this driver, it is BUG - let me know about it.
Bugs in OS/2
When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem.
Sometimes (I think it's random) when you create a file with one-char name under
OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
error corrected".
File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386.
Codepage bugs described above.
If you don't install fixpacks, there are many, many more...
History
0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened)
0.92 Fixed a little memory leak in freeing directory inodes
0.93 Fixed bug that locked up the machine when there were too many filenames
with first 15 characters same
Fixed write_file to zero file when writing behind file end
0.94 Fixed a little memory leak when trying to delete busy file or directory
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
1.90 First version for 2.1.1xx kernels
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames
Rewritten locking to avoid race-conditions
Mount option 'eas' now works
Fsync no longer returns error
Files beginning with '.' are marked hidden
Remount support added
Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it)
Fixed a file overflow at 2G
Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive)
1.94 Added workaround for one bug in Linux
Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like
it)
Fixed a bug that it set badly one flag in large anode tree (it was not
destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race
Fixed possible problem on large disks
You can now delete open files
Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions)
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
1.97.1 Changed names of global symbols
Fixed a bug when chmoding or chowning root directory
1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting
file
Now it tries to truncate the file if there's not enough space when deleting
Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files
could be damaged
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
end of partition
2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures
Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremely fragmented files
vim: set textwidth=80:

View file

@ -0,0 +1,270 @@
inotify
a powerful yet simple file change notification system
Document started 15 Mar 2005 by Robert Love <rml@novell.com>
(i) User Interface
Inotify is controlled by a set of three system calls and normal file I/O on a
returned file descriptor.
First step in using inotify is to initialise an inotify instance:
int fd = inotify_init ();
Each instance is associated with a unique, ordered queue.
Change events are managed by "watches". A watch is an (object,mask) pair where
the object is a file or directory and the mask is a bit mask of one or more
inotify events that the application wishes to receive. See <linux/inotify.h>
for valid events. A watch is referenced by a watch descriptor, or wd.
Watches are added via a path to the file.
Watches on a directory will return events on any files inside of the directory.
Adding a watch is simple:
int wd = inotify_add_watch (fd, path, mask);
Where "fd" is the return value from inotify_init(), path is the path to the
object to watch, and mask is the watch mask (see <linux/inotify.h>).
You can update an existing watch in the same manner, by passing in a new mask.
An existing watch is removed via
int ret = inotify_rm_watch (fd, wd);
Events are provided in the form of an inotify_event structure that is read(2)
from a given inotify instance. The filename is of dynamic length and follows
the struct. It is of size len. The filename is padded with null bytes to
ensure proper alignment. This padding is reflected in len.
You can slurp multiple events by passing a large buffer, for example
size_t len = read (fd, buf, BUF_LEN);
Where "buf" is a pointer to an array of "inotify_event" structures at least
BUF_LEN bytes in size. The above example will return as many events as are
available and fit in BUF_LEN.
Each inotify instance fd is also select()- and poll()-able.
You can find the size of the current event queue via the standard FIONREAD
ioctl on the fd returned by inotify_init().
All watches are destroyed and cleaned up on close.
(ii)
Prototypes:
int inotify_init (void);
int inotify_add_watch (int fd, const char *path, __u32 mask);
int inotify_rm_watch (int fd, __u32 mask);
(iii) Kernel Interface
Inotify's kernel API consists a set of functions for managing watches and an
event callback.
To use the kernel API, you must first initialize an inotify instance with a set
of inotify_operations. You are given an opaque inotify_handle, which you use
for any further calls to inotify.
struct inotify_handle *ih = inotify_init(my_event_handler);
You must provide a function for processing events and a function for destroying
the inotify watch.
void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
u32 cookie, const char *name, struct inode *inode)
watch - the pointer to the inotify_watch that triggered this call
wd - the watch descriptor
mask - describes the event that occurred
cookie - an identifier for synchronizing events
name - the dentry name for affected files in a directory-based event
inode - the affected inode in a directory-based event
void destroy_watch(struct inotify_watch *watch)
You may add watches by providing a pre-allocated and initialized inotify_watch
structure and specifying the inode to watch along with an inotify event mask.
You must pin the inode during the call. You will likely wish to embed the
inotify_watch structure in a structure of your own which contains other
information about the watch. Once you add an inotify watch, it is immediately
subject to removal depending on filesystem events. You must grab a reference if
you depend on the watch hanging around after the call.
inotify_init_watch(&my_watch->iwatch);
inotify_get_watch(&my_watch->iwatch); // optional
s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
inotify_put_watch(&my_watch->iwatch); // optional
You may use the watch descriptor (wd) or the address of the inotify_watch for
other inotify operations. You must not directly read or manipulate data in the
inotify_watch. Additionally, you must not call inotify_add_watch() more than
once for a given inotify_watch structure, unless you have first called either
inotify_rm_watch() or inotify_rm_wd().
To determine if you have already registered a watch for a given inode, you may
call inotify_find_watch(), which gives you both the wd and the watch pointer for
the inotify_watch, or an error if the watch does not exist.
wd = inotify_find_watch(ih, inode, &watchp);
You may use container_of() on the watch pointer to access your own data
associated with a given watch. When an existing watch is found,
inotify_find_watch() bumps the refcount before releasing its locks. You must
put that reference with:
put_inotify_watch(watchp);
Call inotify_find_update_watch() to update the event mask for an existing watch.
inotify_find_update_watch() returns the wd of the updated watch, or an error if
the watch does not exist.
wd = inotify_find_update_watch(ih, inode, mask);
An existing watch may be removed by calling either inotify_rm_watch() or
inotify_rm_wd().
int ret = inotify_rm_watch(ih, &my_watch->iwatch);
int ret = inotify_rm_wd(ih, wd);
A watch may be removed while executing your event handler with the following:
inotify_remove_watch_locked(ih, iwatch);
Call inotify_destroy() to remove all watches from your inotify instance and
release it. If there are no outstanding references, inotify_destroy() will call
your destroy_watch op for each watch.
inotify_destroy(ih);
When inotify removes a watch, it sends an IN_IGNORED event to your callback.
You may use this event as an indication to free the watch memory. Note that
inotify may remove a watch due to filesystem events, as well as by your request.
If you use IN_ONESHOT, inotify will remove the watch after the first event, at
which point you may call the final inotify_put_watch.
(iv) Kernel Interface Prototypes
struct inotify_handle *inotify_init(struct inotify_operations *ops);
inotify_init_watch(struct inotify_watch *watch);
s32 inotify_add_watch(struct inotify_handle *ih,
struct inotify_watch *watch,
struct inode *inode, u32 mask);
s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
struct inotify_watch **watchp);
s32 inotify_find_update_watch(struct inotify_handle *ih,
struct inode *inode, u32 mask);
int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
int inotify_rm_watch(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_remove_watch_locked(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_destroy(struct inotify_handle *ih);
void get_inotify_watch(struct inotify_watch *watch);
void put_inotify_watch(struct inotify_watch *watch);
(v) Internal Kernel Implementation
Each inotify instance is represented by an inotify_handle structure.
Inotify's userspace consumers also have an inotify_device which is
associated with the inotify_handle, and on which events are queued.
Each watch is associated with an inotify_watch structure. Watches are chained
off of each associated inotify_handle and each associated inode.
See fs/notify/inotify/inotify_fsnotify.c and fs/notify/inotify/inotify_user.c
for the locking and lifetime rules.
(vi) Rationale
Q: What is the design decision behind not tying the watch to the open fd of
the watched object?
A: Watches are associated with an open inotify device, not an open file.
This solves the primary problem with dnotify: keeping the file open pins
the file and thus, worse, pins the mount. Dnotify is therefore infeasible
for use on a desktop system with removable media as the media cannot be
unmounted. Watching a file should not require that it be open.
Q: What is the design decision behind using an-fd-per-instance as opposed to
an fd-per-watch?
A: An fd-per-watch quickly consumes more file descriptors than are allowed,
more fd's than are feasible to manage, and more fd's than are optimally
select()-able. Yes, root can bump the per-process fd limit and yes, users
can use epoll, but requiring both is a silly and extraneous requirement.
A watch consumes less memory than an open file, separating the number
spaces is thus sensible. The current design is what user-space developers
want: Users initialize inotify, once, and add n watches, requiring but one
fd and no twiddling with fd limits. Initializing an inotify instance two
thousand times is silly. If we can implement user-space's preferences
cleanly--and we can, the idr layer makes stuff like this trivial--then we
should.
There are other good arguments. With a single fd, there is a single
item to block on, which is mapped to a single queue of events. The single
fd returns all watch events and also any potential out-of-band data. If
every fd was a separate watch,
- There would be no way to get event ordering. Events on file foo and
file bar would pop poll() on both fd's, but there would be no way to tell
which happened first. A single queue trivially gives you ordering. Such
ordering is crucial to existing applications such as Beagle. Imagine
"mv a b ; mv b a" events without ordering.
- We'd have to maintain n fd's and n internal queues with state,
versus just one. It is a lot messier in the kernel. A single, linear
queue is the data structure that makes sense.
- User-space developers prefer the current API. The Beagle guys, for
example, love it. Trust me, I asked. It is not a surprise: Who'd want
to manage and block on 1000 fd's via select?
- No way to get out of band data.
- 1024 is still too low. ;-)
When you talk about designing a file change notification system that
scales to 1000s of directories, juggling 1000s of fd's just does not seem
the right interface. It is too heavy.
Additionally, it _is_ possible to more than one instance and
juggle more than one queue and thus more than one associated fd. There
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
process can easily want more than one queue.
Q: Why the system call approach?
A: The poor user-space interface is the second biggest problem with dnotify.
Signals are a terrible, terrible interface for file notification. Or for
anything, for that matter. The ideal solution, from all perspectives, is a
file descriptor-based one that allows basic file I/O and poll/select.
Obtaining the fd and managing the watches could have been done either via a
device file or a family of new system calls. We decided to implement a
family of system calls because that is the preferred approach for new kernel
interfaces. The only real difference was whether we wanted to use open(2)
and ioctl(2) or a couple of new system calls. System calls beat ioctls.

View file

@ -0,0 +1,48 @@
Mount options that are the same as for msdos and vfat partitions.
gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition.
Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions.
iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF-8 translations with the
utf8 option.
utf8 Encode Unicode names in UTF-8 format. Default is no.
Mount options unique to the isofs filesystem.
block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes
check=relaxed Matches filenames with different cases
check=strict Matches only filenames with the exact same case
cruft Try to handle badly formatted CDs.
map=off Do not map non-Rock Ridge filenames to lower case
map=normal Map non-Rock Ridge filenames to lower case
map=acorn As map=normal but also apply Acorn extensions if present
mode=xxx Sets the permissions on files to xxx unless Rock Ridge
extensions set the permissions otherwise
dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
extensions set the permissions otherwise
overriderockperm Set permissions on files and directories according to
'mode' and 'dmode' even though Rock Ridge extensions are
present.
nojoliet Ignore Joliet extensions if they are present.
norock Ignore Rock Ridge extensions if they are present.
hide Completely strip hidden files from the file system.
showassoc Show files marked with the 'associated' bit
unhide Deprecated; showing hidden files is now default;
If given, it is a synonym for 'showassoc' which will
recreate previous unhide behavior
session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx
Recommended documents about ISO 9660 standard are located at:
http://www.y-adagio.com/
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification.

View file

@ -0,0 +1,52 @@
IBM's Journaled File System (JFS) for Linux
JFS Homepage: http://jfs.sourceforge.net/
The following mount options are supported:
(*) == default
iocharset=name Character set to use for converting from Unicode to
ASCII. The default is to do no conversion. Use
iocharset=utf8 for UTF-8 translations. This requires
CONFIG_NLS_UTF8 to be set in the kernel .config file.
iocharset=none specifies the default behavior explicitly.
resize=value Resize the volume to <value> blocks. JFS only supports
growing a volume, not shrinking it. This option is only
valid during a remount, when the volume is mounted
read-write. The resize keyword with no value will grow
the volume to the full size of the partition.
nointegrity Do not write to the journal. The primary use of this option
is to allow for higher performance when restoring a volume
from backup media. The integrity of the volume is not
guaranteed if the system abnormally abends.
integrity(*) Commit metadata changes to the journal. Use this option to
remount a volume where the nointegrity option was
previously specified in order to restore normal behavior.
errors=continue Keep going on a filesystem error.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
uid=value Override on-disk uid with specified value
gid=value Override on-disk gid with specified value
umask=value Override on-disk umask with specified octal value. For
directories, the execute bit will be set if the corresponding
read bit is set.
discard=minlen This enables/disables the use of discard/TRIM commands.
discard The discard/TRIM commands are sent to the underlying
nodiscard(*) block device when blocks are freed. This is useful for SSD
devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl
command is also available together with the nodiscard option.
The value of minlen specifies the minimum blockcount, when
a TRIM command to the block device is considered useful.
When no value is given to the discard option, it defaults to
64 blocks, which means 256KiB in JFS.
The minlen value of discard overrides the minlen value given
on an FITRIM ioctl().
The JFS mailing list can be subscribed to by using the link labeled
"Mail list Subscribe" at our web page http://jfs.sourceforge.net/

View file

@ -0,0 +1,68 @@
File Locking Release Notes
Andy Walker <andy@lysaker.kvaerner.no>
12 May 1997
1. What's New?
--------------
1.1 Broken Flock Emulation
--------------------------
The old flock(2) emulation in the kernel was swapped for proper BSD
compatible flock(2) support in the 1.3.x series of kernels. With the
release of the 2.1.x kernel series, support for the old emulation has
been totally removed, so that we don't need to carry this baggage
forever.
This should not cause problems for anybody, since everybody using a
2.1.x kernel should have updated their C library to a suitable version
anyway (see the file "Documentation/Changes".)
1.2 Allow Mixed Locks Again
---------------------------
1.2.1 Typical Problems - Sendmail
---------------------------------
Because sendmail was unable to use the old flock() emulation, many sendmail
installations use fcntl() instead of flock(). This is true of Slackware 3.0
for example. This gave rise to some other subtle problems if sendmail was
configured to rebuild the alias file. Sendmail tried to lock the aliases.dir
file with fcntl() at the same time as the GDBM routines tried to lock this
file with flock(). With pre 1.3.96 kernels this could result in deadlocks that,
over time, or under a very heavy mail load, would eventually cause the kernel
to lock solid with deadlocked processes.
1.2.2 The Solution
------------------
The solution I have chosen, after much experimentation and discussion,
is to make flock() and fcntl() locks oblivious to each other. Both can
exists, and neither will have any effect on the other.
I wanted the two lock styles to be cooperative, but there were so many
race and deadlock conditions that the current solution was the only
practical one. It puts us in the same position as, for example, SunOS
4.1.x and several other commercial Unices. The only OS's that support
cooperative flock()/fcntl() are those that emulate flock() using
fcntl(), with all the problems that implies.
1.3 Mandatory Locking As A Mount Option
---------------------------------------
Mandatory locking, as described in
'Documentation/filesystems/mandatory-locking.txt' was prior to this release a
general configuration option that was valid for all mounted filesystems. This
had a number of inherent dangers, not the least of which was the ability to
freeze an NFS server by asking it to read a file for which a mandatory lock
existed.
From this release of the kernel, mandatory locking can be turned on and off
on a per-filesystem basis, using the mount options 'mand' and 'nomand'.
The default is to disallow mandatory locking. The intention is that
mandatory locking only be enabled on a local filesystem as the specific need
arises.

View file

@ -0,0 +1,241 @@
The LogFS Flash Filesystem
==========================
Specification
=============
Superblocks
-----------
Two superblocks exist at the beginning and end of the filesystem.
Each superblock is 256 Bytes large, with another 3840 Bytes reserved
for future purposes, making a total of 4096 Bytes.
Superblock locations may differ for MTD and block devices. On MTD the
first non-bad block contains a superblock in the first 4096 Bytes and
the last non-bad block contains a superblock in the last 4096 Bytes.
On block devices, the first 4096 Bytes of the device contain the first
superblock and the last aligned 4096 Byte-block contains the second
superblock.
For the most part, the superblocks can be considered read-only. They
are written only to correct errors detected within the superblocks,
move the journal and change the filesystem parameters through tunefs.
As a result, the superblock does not contain any fields that require
constant updates, like the amount of free space, etc.
Segments
--------
The space in the device is split up into equal-sized segments.
Segments are the primary write unit of LogFS. Within each segments,
writes happen from front (low addresses) to back (high addresses. If
only a partial segment has been written, the segment number, the
current position within and optionally a write buffer are stored in
the journal.
Segments are erased as a whole. Therefore Garbage Collection may be
required to completely free a segment before doing so.
Journal
--------
The journal contains all global information about the filesystem that
is subject to frequent change. At mount time, it has to be scanned
for the most recent commit entry, which contains a list of pointers to
all currently valid entries.
Object Store
------------
All space except for the superblocks and journal is part of the object
store. Each segment contains a segment header and a number of
objects, each consisting of the object header and the payload.
Objects are either inodes, directory entries (dentries), file data
blocks or indirect blocks.
Levels
------
Garbage collection (GC) may fail if all data is written
indiscriminately. One requirement of GC is that data is separated
roughly according to the distance between the tree root and the data.
Effectively that means all file data is on level 0, indirect blocks
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
respectively. Inode file data is on level 6 for the inodes and 7-11
for indirect blocks.
Each segment contains objects of a single level only. As a result,
each level requires its own separate segment to be open for writing.
Inode File
----------
All inodes are stored in a special file, the inode file. Single
exception is the inode file's inode (master inode) which for obvious
reasons is stored in the journal instead. Instead of data blocks, the
leaf nodes of the inode files are inodes.
Aliases
-------
Writes in LogFS are done by means of a wandering tree. A naïve
implementation would require that for each write or a block, all
parent blocks are written as well, since the block pointers have
changed. Such an implementation would not be very efficient.
In LogFS, the block pointer changes are cached in the journal by means
of alias entries. Each alias consists of its logical address - inode
number, block index, level and child number (index into block) - and
the changed data. Any 8-byte word can be changes in this manner.
Currently aliases are used for block pointers, file size, file used
bytes and the height of an inodes indirect tree.
Segment Aliases
---------------
Related to regular aliases, these are used to handle bad blocks.
Initially, bad blocks are handled by moving the affected segment
content to a spare segment and noting this move in the journal with a
segment alias, a simple (to, from) tupel. GC will later empty this
segment and the alias can be removed again. This is used on MTD only.
Vim
---
By cleverly predicting the life time of data, it is possible to
separate long-living data from short-living data and thereby reduce
the GC overhead later. Each type of distinc life expectency (vim) can
have a separate segment open for writing. Each (level, vim) tupel can
be open just once. If an open segment with unknown vim is encountered
at mount time, it is closed and ignored henceforth.
Indirect Tree
-------------
Inodes in LogFS are similar to FFS-style filesystems with direct and
indirect block pointers. One difference is that LogFS uses a single
indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
A height field in the inode defines the height of the indirect tree
and thereby the indirection of the pointer.
Another difference is the addressing of indirect blocks. In LogFS,
the first 16 pointers in the first indirect block are left empty,
corresponding to the 16 direct pointers in the inode. In ext2 (maybe
others as well) the first pointer in the first indirect block
corresponds to logical block 12, skipping the 12 direct pointers.
So where ext2 is using arithmetic to better utilize space, LogFS keeps
arithmetic simple and uses compression to save space.
Compression
-----------
Both file data and metadata can be compressed. Compression for file
data can be enabled with chattr +c and disabled with chattr -c. Doing
so has no effect on existing data, but new data will be stored
accordingly. New inodes will inherit the compression flag of the
parent directory.
Metadata is always compressed. However, the space accounting ignores
this and charges for the uncompressed size. Failing to do so could
result in GC failures when, after moving some data, indirect blocks
compress worse than previously. Even on a 100% full medium, GC may
not consume any extra space, so the compression gains are lost space
to the user.
However, they are not lost space to the filesystem internals. By
cheating the user for those bytes, the filesystem gained some slack
space and GC will run less often and faster.
Garbage Collection and Wear Leveling
------------------------------------
Garbage collection is invoked whenever the number of free segments
falls below a threshold. The best (known) candidate is picked based
on the least amount of valid data contained in the segment. All
remaining valid data is copied elsewhere, thereby invalidating it.
The GC code also checks for aliases and writes then back if their
number gets too large.
Wear leveling is done by occasionally picking a suboptimal segment for
garbage collection. If a stale segments erase count is significantly
lower than the active segments' erase counts, it will be picked. Wear
leveling is rate limited, so it will never monopolize the device for
more than one segment worth at a time.
Values for "occasionally", "significantly lower" are compile time
constants.
Hashed directories
------------------
To satisfy efficient lookup(), directory entries are hashed and
located based on the hash. In order to both support large directories
and not be overly inefficient for small directories, several hash
tables of increasing size are used. For each table, the hash value
modulo the table size gives the table index.
Tables sizes are chosen to limit the number of indirect blocks with a
fully populated table to 0, 1, 2 or 3 respectively. So the first
table contains 16 entries, the second 512-16, etc.
The last table is special in several ways. First its size depends on
the effective 32bit limit on telldir/seekdir cookies. Since logfs
uses the upper half of the address space for indirect blocks, the size
is limited to 2^31. Secondly the table contains hash buckets with 16
entries each.
Using single-entry buckets would result in birthday "attacks". At
just 2^16 used entries, hash collisions would be likely (P >= 0.5).
My math skills are insufficient to do the combinatorics for the 17x
collisions necessary to overflow a bucket, but testing showed that in
10,000 runs the lowest directory fill before a bucket overflow was
188,057,130 entries with an average of 315,149,915 entries. So for
directory sizes of up to a million, bucket overflows should be
virtually impossible under normal circumstances.
With carefully chosen filenames, it is obviously possible to cause an
overflow with just 21 entries (4 higher tables + 16 entries + 1). So
there may be a security concern if a malicious user has write access
to a directory.
Open For Discussion
===================
Device Address Space
--------------------
A device address space is used for caching. Both block devices and
MTD provide functions to either read a single page or write a segment.
Partial segments may be written for data integrity, but where possible
complete segments are written for performance on simple block device
flash media.
Meta Inodes
-----------
Inodes are stored in the inode file, which is just a regular file for
most purposes. At umount time, however, the inode file needs to
remain open until all dirty inodes are written. So
generic_shutdown_super() may not close this inode, but shouldn't
complain about remaining inodes due to the inode file either. Same
goes for mapping inode of the device address space.
Currently logfs uses a hack that essentially copies part of fs/inode.c
code over. A general solution would be preferred.
Indirect block mapping
----------------------
With compression, the block device (or mapping inode) cannot be used
to cache indirect blocks. Some other place is required. Currently
logfs uses the top half of each inode's address space. The low 8TB
(on 32bit) are filled with file data, the high 8TB are used for
indirect blocks.
One problem is that 16TB files created on 64bit systems actually have
data in the top 8TB. But files >16TB would cause problems anyway, so
only the limit has changed.

View file

@ -0,0 +1,171 @@
Mandatory File Locking For The Linux Operating System
Andy Walker <andy@lysaker.kvaerner.no>
15 April 1996
(Updated September 2007)
0. Why you should avoid mandatory locking
-----------------------------------------
The Linux implementation is prey to a number of difficult-to-fix race
conditions which in practice make it not dependable:
- The write system call checks for a mandatory lock only once
at its start. It is therefore possible for a lock request to
be granted after this check but before the data is modified.
A process may then see file data change even while a mandatory
lock was held.
- Similarly, an exclusive lock may be granted on a file after
the kernel has decided to proceed with a read, but before the
read has actually completed, and the reading process may see
the file data in a state which should not have been visible
to it.
- Similar races make the claimed mutual exclusion between lock
and mmap similarly unreliable.
1. What is mandatory locking?
------------------------------
Mandatory locking is kernel enforced file locking, as opposed to the more usual
cooperative file locking used to guarantee sequential access to files among
processes. File locks are applied using the flock() and fcntl() system calls
(and the lockf() library routine which is a wrapper around fcntl().) It is
normally a process' responsibility to check for locks on a file it wishes to
update, before applying its own lock, updating the file and unlocking it again.
The most commonly used example of this (and in the case of sendmail, the most
troublesome) is access to a user's mailbox. The mail user agent and the mail
transfer agent must guard against updating the mailbox at the same time, and
prevent reading the mailbox while it is being updated.
In a perfect world all processes would use and honour a cooperative, or
"advisory" locking scheme. However, the world isn't perfect, and there's
a lot of poorly written code out there.
In trying to address this problem, the designers of System V UNIX came up
with a "mandatory" locking scheme, whereby the operating system kernel would
block attempts by a process to write to a file that another process holds a
"read" -or- "shared" lock on, and block attempts to both read and write to a
file that a process holds a "write " -or- "exclusive" lock on.
The System V mandatory locking scheme was intended to have as little impact as
possible on existing user code. The scheme is based on marking individual files
as candidates for mandatory locking, and using the existing fcntl()/lockf()
interface for applying locks just as if they were normal, advisory locks.
Note 1: In saying "file" in the paragraphs above I am actually not telling
the whole truth. System V locking is based on fcntl(). The granularity of
fcntl() is such that it allows the locking of byte ranges in files, in addition
to entire files, so the mandatory locking rules also have byte level
granularity.
Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
borrowing the fcntl() locking scheme from System V. The mandatory locking
scheme is defined by the System V Interface Definition (SVID) Version 3.
2. Marking a file for mandatory locking
---------------------------------------
A file is marked as a candidate for mandatory locking by setting the group-id
bit in its file mode but removing the group-execute bit. This is an otherwise
meaningless combination, and was chosen by the System V implementors so as not
to break existing user programs.
Note that the group-id bit is usually automatically cleared by the kernel when
a setgid file is written to. This is a security measure. The kernel has been
modified to recognize the special case of a mandatory lock candidate and to
refrain from clearing this bit. Similarly the kernel has been modified not
to run mandatory lock candidates with setgid privileges.
3. Available implementations
----------------------------
I have considered the implementations of mandatory locking available with
SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
Generally I have tried to make the most sense out of the behaviour exhibited
by these three reference systems. There are many anomalies.
All the reference systems reject all calls to open() for a file on which
another process has outstanding mandatory locks. This is in direct
contravention of SVID 3, which states that only calls to open() with the
O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
definition, which is the "Right Thing", since only calls with O_TRUNC can
modify the contents of the file.
HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
just mandatory locks. That would appear to contravene POSIX.1.
mmap() is another interesting case. All the operating systems mentioned
prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
also disallows advisory locks for such a file. SVID actually specifies the
paranoid HP-UX behaviour.
In my opinion only MAP_SHARED mappings should be immune from locking, and then
only from mandatory locks - that is what is currently implemented.
SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
mandatory locks, so reads and writes to locked files always block when they
should return EAGAIN.
I'm afraid that this is such an esoteric area that the semantics described
below are just as valid as any others, so long as the main points seem to
agree.
4. Semantics
------------
1. Mandatory locks can only be applied via the fcntl()/lockf() locking
interface - in other words the System V/POSIX interface. BSD style
locks using flock() never result in a mandatory lock.
2. If a process has locked a region of a file with a mandatory read lock, then
other processes are permitted to read from that region. If any of these
processes attempts to write to the region it will block until the lock is
released, unless the process has opened the file with the O_NONBLOCK
flag in which case the system call will return immediately with the error
status EAGAIN.
3. If a process has locked a region of a file with a mandatory write lock, all
attempts to read or write to that region block until the lock is released,
unless a process has opened the file with the O_NONBLOCK flag in which case
the system call will return immediately with the error status EAGAIN.
4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
any mandatory locks owned by other processes will be rejected with the
error status EAGAIN.
5. Attempts to apply a mandatory lock to a file that is memory mapped and
shared (via mmap() with MAP_SHARED) will be rejected with the error status
EAGAIN.
6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
that has any mandatory locks in effect will be rejected with the error status
EAGAIN.
5. Which system calls are affected?
-----------------------------------
Those which modify a file's contents, not just the inode. That gives read(),
write(), readv(), writev(), open(), creat(), mmap(), truncate() and
ftruncate(). truncate() and ftruncate() are considered to be "write" actions
for the purposes of mandatory locking.
The affected region is usually defined as stretching from the current position
for the total number of bytes read or written. For the truncate calls it is
defined as the bytes of a file removed or added (we must also consider bytes
added, as a lock can specify just "the whole file", rather than a specific
range of bytes.)
Note 3: I may have overlooked some system calls that need mandatory lock
checking in my eagerness to get this code out the door. Please let me know, or
better still fix the system calls yourself and submit a patch to me or Linus.
6. Warning!
-----------
Not even root can override a mandatory lock, so runaway processes can wreak
havoc if they lock crucial files. The way around it is to change the file
permissions (remove the setgid bit) before trying to read or write to it.
Of course, that might be a bit tricky if the system is hung :-(

View file

@ -0,0 +1,12 @@
The ncpfs filesystem understands the NCP protocol, designed by the
Novell Corporation for their NetWare(tm) product. NCP is functionally
similar to the NFS used in the TCP/IP community.
To mount a NetWare filesystem, you need a special mount program, which
can be found in the ncpfs package. The home site for ncpfs is
ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
will have it as well.
Related products are linware and mars_nwe, which will give Linux partial
NetWare server functionality.
mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs.

View file

@ -0,0 +1,26 @@
00-INDEX
- this file (nfs-related documentation).
Exporting
- explanation of how to make filesystems exportable.
fault_injection.txt
- information for using fault injection on the server
knfsd-stats.txt
- statistics which the NFS server makes available to user space.
nfs.txt
- nfs client, and DNS resolution for fs_locations.
nfs41-server.txt
- info on the Linux server implementation of NFSv4 minor version 1.
nfs-rdma.txt
- how to install and setup the Linux NFS/RDMA client and server software
nfsd-admin-interfaces.txt
- Administrative interfaces for nfsd.
nfsroot.txt
- short guide on setting up a diskless box with NFS root filesystem.
pnfs.txt
- short explanation of some of the internals of the pnfs client code
rpc-cache.txt
- introduction to the caching mechanisms in the sunrpc layer.
idmapper.txt
- information for configuring request-keys to be used by idmapper
rpc-server-gss.txt
- Information on GSS authentication support in the NFS Server

View file

@ -0,0 +1,162 @@
Making Filesystems Exportable
=============================
Overview
--------
All filesystem operations require a dentry (or two) as a starting
point. Local applications have a reference-counted hold on suitable
dentries via open file descriptors or cwd/root. However remote
applications that access a filesystem via a remote filesystem protocol
such as NFS may not be able to hold such a reference, and so need a
different way to refer to a particular dentry. As the alternative
form of reference needs to be stable across renames, truncates, and
server-reboot (among other things, though these tend to be the most
problematic), there is no simple answer like 'filename'.
The mechanism discussed here allows each filesystem implementation to
specify how to generate an opaque (outside of the filesystem) byte
string for any dentry, and how to find an appropriate dentry for any
given opaque byte string.
This byte string will be called a "filehandle fragment" as it
corresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragments
and dentries will be termed "exportable".
Dcache Issues
-------------
The dcache normally contains a proper prefix of any given filesystem
tree. This means that if any filesystem object is in the dcache, then
all of the ancestors of that filesystem object are also in the dcache.
As normal access is by filename this prefix is created naturally and
maintained easily (by each object maintaining a reference count on
its parent).
However when objects are included into the dcache by interpreting a
filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as
it is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
prefix.
b/ A per-superblock list "s_anon" of dentries which are the roots of
subtrees that are not in the proper prefix. These dentries, as
well as the proper prefix, need to be released at unmount time. As
these dentries will not be hashed, they are linked together on the
d_hash list_head.
c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
d_obtain_alias(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry
can ever be attached.
d_splice_alias(inode, dentry) or d_materialise_unique(dentry, inode)
will introduce a new dentry into the tree; either the passed-in
dentry or a preexisting alias for the given inode (such as an
anonymous one created by d_obtain_alias), if appropriate. The two
functions differ in their handling of directories with preexisting
aliases:
d_splice_alias will use any existing IS_ROOT dentry, but it will
return -EIO rather than try to move a dentry with a different
parent. This is appropriate for local filesystems, which
should never see such an alias unless the filesystem is
corrupted somehow (for example, if two on-disk directory
entries refer to the same directory.)
d_materialise_unique will attempt to move any dentry. This is
appropriate for distributed filesystems, where finding a
directory other than where we last cached it may be a normal
consequence of concurrent operations on other hosts.
Both functions return NULL when the passed-in dentry is used,
following the calling convention of ->lookup.
Filesystem Issues
-----------------
For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name.
If inode is NULL, d_splice_alias(inode, dentry) is equivalent to
d_add(dentry, inode), NULL
Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
Typically the ->lookup routine will simply end with a:
return d_splice_alias(inode, dentry);
}
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
encode_fh (optional)
Takes a dentry and creates a filehandle fragment which can later be used
to find or create a dentry for the same object. The default
implementation creates a filehandle fragment that encodes a 32bit inode
and generation number for the inode encoded, and if necessary the
same information for the parent.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
create a dentry for it (possibly with d_obtain_alias).
fh_to_parent (optional but strongly recommended)
Given a filehandle fragment, this should find the parent of the
implied object and create a dentry for it (possibly with
d_obtain_alias). May fail if the filehandle fragment is too small.
get_parent (optional but strongly recommended)
When given a dentry for a directory, this should return a dentry for
the parent. Quite possibly the parent dentry will have been allocated
by d_alloc_anon. The default get_parent function just returns an error
so any filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".." entries
in the dcache which are too messy to work with.
get_name (optional)
When given a parent dentry and a child dentry, this should find a name
in the directory identified by the parent dentry, which leads to the
object identified by the child dentry. If no get_name function is
supplied, a default implementation is provided which uses vfs_readdir
to find potential names, and matches inode numbers to find the correct
match.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
The decode_fh routine should not depend on the stated size that is
passed to it. This size may be larger than the original filehandle
generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.

View file

@ -0,0 +1,69 @@
Fault Injection
===============
Fault injection is a method for forcing errors that may not normally occur, or
may be difficult to reproduce. Forcing these errors in a controlled environment
can help the developer find and fix bugs before their code is shipped in a
production system. Injecting an error on the Linux NFS server will allow us to
observe how the client reacts and if it manages to recover its state correctly.
NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this
feature.
Using Fault Injection
=====================
On the client, mount the fault injection server through NFS v4.0+ and do some
work over NFS (open files, take locks, ...).
On the server, mount the debugfs filesystem to <debug_dir> and ls
<debug_dir>/nfsd. This will show a list of files that will be used for
injecting faults on the NFS server. As root, write a number n to the file
corresponding to the action you want the server to take. The server will then
process the first n items it finds. So if you want to forget 5 locks, echo '5'
to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget
all corresponding items. A log message will be created containing the number
of items forgotten (check dmesg).
Go back to work on the client and check if the client recovered from the error
correctly.
Available Faults
================
forget_clients:
The NFS server keeps a list of clients that have placed a mount call. If
this list is cleared, the server will have no knowledge of who the client
is, forcing the client to reauthenticate with the server.
forget_openowners:
The NFS server keeps a list of what files are currently opened and who
they were opened by. Clearing this list will force the client to reopen
its files.
forget_locks:
The NFS server keeps a list of what files are currently locked in the VFS.
Clearing this list will force the client to reclaim its locks (files are
unlocked through the VFS as they are cleared from this list).
forget_delegations:
A delegation is used to assure the client that a file, or part of a file,
has not changed since the delegation was awarded. Clearing this list will
force the client to reaquire its delegation before accessing the file
again.
recall_delegations:
Delegations can be recalled by the server when another client attempts to
access a file. This test will notify the client that its delegation has
been revoked, forcing the client to reaquire the delegation before using
the file again.
tools/nfs/inject_faults.sh script
=================================
This script has been created to ease the fault injection process. This script
will detect the mounted debugfs directory and write to the files located there
based on the arguments passed by the user. For example, running
`inject_faults.sh forget_locks 1` as root will instruct the server to forget
one lock. Running `inject_faults forget_locks` will instruct the server to
forgetall locks.

View file

@ -0,0 +1,75 @@
=========
ID Mapper
=========
Id mapper is used by NFS to translate user and group ids into names, and to
translate user and group names into ids. Part of this translation involves
performing an upcall to userspace to request the information. There are two
ways NFS could obtain this information: placing a call to /sbin/request-key
or by placing a call to the rpc.idmap daemon.
NFS will attempt to call /sbin/request-key first. If this succeeds, the
result will be cached using the generic request-key cache. This call should
only fail if /etc/request-key.conf is not configured for the id_resolver key
type, see the "Configuring" section below if you wish to use the request-key
method.
If the call to /sbin/request-key fails (if /etc/request-key.conf is not
configured with the id_resolver key type), then the idmapper will ask the
legacy rpc.idmap daemon for the id mapping. This result will be stored
in a custom NFS idmap cache.
===========
Configuring
===========
The file /etc/request-key.conf will need to be modified so /sbin/request-key can
direct the upcall. The following line should be added:
#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
#====== ======= =============== =============== ===============================
create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
The last parameter, 600, defines how many seconds into the future the key will
expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
is not specified, nfs.idmap will default to 600 seconds.
id mapper uses for key descriptions:
uid: Find the UID for the given user
gid: Find the GID for the given group
user: Find the user name for the given UID
group: Find the group name for the given GID
You can handle any of these individually, rather than using the generic upcall
program. If you would like to use your own program for a uid lookup then you
would edit your request-key.conf so it look similar to this:
#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
#====== ======= =============== =============== ===============================
create id_resolver uid:* * /some/other/program %k %d 600
create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
Notice that the new line was added above the line for the generic program.
request-key will find the first matching line and corresponding program. In
this case, /some/other/program will handle all uid lookups and
/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
See <file:Documentation/security/keys-request-key.txt> for more information
about the request-key function.
=========
nfs.idmap
=========
nfs.idmap is designed to be called by request-key, and should not be run "by
hand". This program takes two arguments, a serialized key and a key
description. The serialized key is first converted into a key_serial_t, and
then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
determines the correct function to call by looking at the first part of the
description string. For example, a uid lookup description will appear as
"uid:user@domain".
nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.

View file

@ -0,0 +1,159 @@
Kernel NFS Server Statistics
============================
This document describes the format and semantics of the statistics
which the kernel NFS server makes available to userspace. These
statistics are available in several text form pseudo files, each of
which is described separately below.
In most cases you don't need to know these formats, as the nfsstat(8)
program from the nfs-utils distribution provides a helpful command-line
interface for extracting and printing them.
All the files described here are formatted as a sequence of text lines,
separated by newline '\n' characters. Lines beginning with a hash
'#' character are comments intended for humans and should be ignored
by parsing routines. All other lines contain a sequence of fields
separated by whitespace.
/proc/fs/nfsd/pool_stats
------------------------
This file is available in kernels from 2.6.30 onwards, if the
/proc/fs/nfsd filesystem is mounted (it almost always should be).
The first line is a comment which describes the fields present in
all the other lines. The other lines present the following data as
a sequence of unsigned decimal numeric fields. One line is shown
for each NFS thread pool.
All counters are 64 bits wide and wrap naturally. There is no way
to zero these counters, instead applications should do their own
rate conversion.
pool
The id number of the NFS thread pool to which this line applies.
This number does not change.
Thread pool ids are a contiguous set of small integers starting
at zero. The maximum value depends on the thread pool mode, but
currently cannot be larger than the number of CPUs in the system.
Note that in the default case there will be a single thread pool
which contains all the nfsd threads and all the CPUs in the system,
and thus this file will have a single line with a pool id of "0".
packets-arrived
Counts how many NFS packets have arrived. More precisely, this
is the number of times that the network stack has notified the
sunrpc server layer that new data may be available on a transport
(e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
Depending on the NFS workload patterns and various network stack
effects (such as Large Receive Offload) which can combine packets
on the wire, this may be either more or less than the number
of NFS calls received (which statistic is available elsewhere).
However this is a more accurate and less workload-dependent measure
of how much CPU load is being placed on the sunrpc server layer
due to NFS network traffic.
sockets-enqueued
Counts how many times an NFS transport is enqueued to wait for
an nfsd thread to service it, i.e. no nfsd thread was considered
available.
The circumstance this statistic tracks indicates that there was NFS
network-facing work to be done but it couldn't be done immediately,
thus introducing a small delay in servicing NFS calls. The ideal
rate of change for this counter is zero; significantly non-zero
values may indicate a performance limitation.
This can happen either because there are too few nfsd threads in the
thread pool for the NFS workload (the workload is thread-limited),
or because the NFS workload needs more CPU time than is available in
the thread pool (the workload is CPU-limited). In the former case,
configuring more nfsd threads will probably improve the performance
of the NFS workload. In the latter case, the sunrpc server layer is
already choosing not to wake idle nfsd threads because there are too
many nfsd threads which want to run but cannot, so configuring more
nfsd threads will make no difference whatsoever. The overloads-avoided
statistic (see below) can be used to distinguish these cases.
threads-woken
Counts how many times an idle nfsd thread is woken to try to
receive some data from an NFS transport.
This statistic tracks the circumstance where incoming
network-facing NFS work is being handled quickly, which is a good
thing. The ideal rate of change for this counter will be close
to but less than the rate of change of the packets-arrived counter.
overloads-avoided
Counts how many times the sunrpc server layer chose not to wake an
nfsd thread, despite the presence of idle nfsd threads, because
too many nfsd threads had been recently woken but could not get
enough CPU time to actually run.
This statistic counts a circumstance where the sunrpc layer
heuristically avoids overloading the CPU scheduler with too many
runnable nfsd threads. The ideal rate of change for this counter
is zero. Significant non-zero values indicate that the workload
is CPU limited. Usually this is associated with heavy CPU usage
on all the CPUs in the nfsd thread pool.
If a sustained large overloads-avoided rate is detected on a pool,
the top(1) utility should be used to check for the following
pattern of CPU usage on all the CPUs associated with the given
nfsd thread pool.
- %us ~= 0 (as you're *NOT* running applications on your NFS server)
- %wa ~= 0
- %id ~= 0
- %sy + %hi + %si ~= 100
If this pattern is seen, configuring more nfsd threads will *not*
improve the performance of the workload. If this patten is not
seen, then something more subtle is wrong.
threads-timedout
Counts how many times an nfsd thread triggered an idle timeout,
i.e. was not woken to handle any incoming network packets for
some time.
This statistic counts a circumstance where there are more nfsd
threads configured than can be used by the NFS workload. This is
a clue that the number of nfsd threads can be reduced without
affecting performance. Unfortunately, it's only a clue and not
a strong indication, for a couple of reasons:
- Currently the rate at which the counter is incremented is quite
slow; the idle timeout is 60 minutes. Unless the NFS workload
remains constant for hours at a time, this counter is unlikely
to be providing information that is still useful.
- It is usually a wise policy to provide some slack,
i.e. configure a few more nfsds than are currently needed,
to allow for future spikes in load.
Note that incoming packets on NFS transports will be dealt with in
one of three ways. An nfsd thread can be woken (threads-woken counts
this case), or the transport can be enqueued for later attention
(sockets-enqueued counts this case), or the packet can be temporarily
deferred because the transport is currently being used by an nfsd
thread. This last case is not very interesting and is not explicitly
counted, but can be inferred from the other counters thus:
packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
More
----
Descriptions of the other statistics file should go here.
Greg Banks <gnb@sgi.com>
26 Mar 2009

View file

@ -0,0 +1,273 @@
################################################################################
# #
# NFS/RDMA README #
# #
################################################################################
Author: NetApp and Open Grid Computing
Date: May 29, 2008
Table of Contents
~~~~~~~~~~~~~~~~~
- Overview
- Getting Help
- Installation
- Check RDMA and NFS Setup
- NFS/RDMA Setup
Overview
~~~~~~~~
This document describes how to install and setup the Linux NFS/RDMA client
and server software.
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
was first included in the following release, Linux 2.6.25.
In our testing, we have obtained excellent performance results (full 10Gbit
wire bandwidth at minimal client CPU) under many workloads. The code passes
the full Connectathon test suite and operates over both Infiniband and iWARP
RDMA adapters.
Getting Help
~~~~~~~~~~~~
If you get stuck, you can ask questions on the
nfs-rdma-devel@lists.sourceforge.net
mailing list.
Installation
~~~~~~~~~~~~
These instructions are a step by step guide to building a machine for
use with NFS/RDMA.
- Install an RDMA device
Any device supported by the drivers in drivers/infiniband/hw is acceptable.
Testing has been performed using several Mellanox-based IB cards, the
Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
- Install a Linux distribution and tools
The first kernel release to contain both the NFS/RDMA client and server was
Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
Linux kernel release should be installed.
The procedures described in this document have been tested with
distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
- Install nfs-utils-1.1.2 or greater on the client
An NFS/RDMA mount point can be obtained by using the mount.nfs command in
nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
version with support for NFS/RDMA mounts, but for various reasons we
recommend using nfs-utils-1.1.2 or greater). To see which version of
mount.nfs you are using, type:
$ /sbin/mount.nfs -V
If the version is less than 1.1.2 or the command does not exist,
you should install the latest version of nfs-utils.
Download the latest package from:
http://www.kernel.org/pub/linux/utils/nfs
Uncompress the package and follow the installation instructions.
If you will not need the idmapper and gssd executables (you do not need
these to create an NFS/RDMA enabled mount command), the installation
process can be simplified by disabling these features when running
configure:
$ ./configure --disable-gss --disable-nfsv4
To build nfs-utils you will need the tcp_wrappers package installed. For
more information on this see the package's README and INSTALL files.
After building the nfs-utils package, there will be a mount.nfs binary in
the utils/mount directory. This binary can be used to initiate NFS v2, v3,
or v4 mounts. To initiate a v4 mount, the binary must be called
mount.nfs4. The standard technique is to create a symlink called
mount.nfs4 to mount.nfs.
This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
$ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
In this location, mount.nfs will be invoked automatically for NFS mounts
by the system mount command.
NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
on the NFS client machine. You do not need this specific version of
nfs-utils on the server. Furthermore, only the mount.nfs command from
nfs-utils-1.1.2 is needed on the client.
- Install a Linux kernel with NFS/RDMA
The NFS/RDMA client and server are both included in the mainline Linux
kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
kernel can be found at:
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
Download the sources and place them in an appropriate location.
- Configure the RDMA stack
Make sure your kernel configuration has RDMA support enabled. Under
Device Drivers -> InfiniBand support, update the kernel configuration
to enable InfiniBand support [NOTE: the option name is misleading. Enabling
InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
iWARP adapter support (amso, cxgb3, etc.).
If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
- Configure the NFS client and server
Your kernel configuration must also have NFS file system support and/or
NFS server support enabled. These and other NFS related configuration
options can be found under File Systems -> Network File Systems.
- Build, install, reboot
The NFS/RDMA code will be enabled automatically if NFS and RDMA
are turned on. The NFS/RDMA client and server are configured via the
SUNRPC_XPRT_RDMA_CLIENT and SUNRPC_XPRT_RDMA_SERVER config options that both
depend on SUNRPC and INFINIBAND. The default value of both options will be:
- N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
and server will not be built
- M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
in this case the NFS/RDMA client and server will be built as modules
- Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
and server will be built into the kernel
Therefore, if you have followed the steps above and turned no NFS and RDMA,
the NFS/RDMA client and server will be built.
Build a new kernel, install it, boot it.
Check RDMA and NFS Setup
~~~~~~~~~~~~~~~~~~~~~~~~
Before configuring the NFS/RDMA software, it is a good idea to test
your new kernel to ensure that the kernel is working correctly.
In particular, it is a good idea to verify that the RDMA stack
is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
is working properly.
- Check RDMA Setup
If you built the RDMA components as modules, load them at
this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
card:
$ modprobe ib_mthca
$ modprobe ib_ipoib
If you are using InfiniBand, make sure there is a Subnet Manager (SM)
running on the network. If your IB switch has an embedded SM, you can
use it. Otherwise, you will need to run an SM, such as OpenSM, on one
of your end nodes.
If an SM is running on your network, you should see the following:
$ cat /sys/class/infiniband/driverX/ports/1/state
4: ACTIVE
where driverX is mthca0, ipath5, ehca3, etc.
To further test the InfiniBand software stack, use IPoIB (this
assumes you have two IB hosts named host1 and host2):
host1$ ifconfig ib0 a.b.c.x
host2$ ifconfig ib0 a.b.c.y
host1$ ping a.b.c.y
host2$ ping a.b.c.x
For other device types, follow the appropriate procedures.
- Check NFS Setup
For the NFS components enabled above (client and/or server),
test their functionality over standard Ethernet using TCP/IP or UDP/IP.
NFS/RDMA Setup
~~~~~~~~~~~~~~
We recommend that you use two machines, one to act as the client and
one to act as the server.
One time configuration:
- On the server system, configure the /etc/exports file and
start the NFS/RDMA server.
Exports entries with the following formats have been tested:
/vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
/vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
The IP address(es) is(are) the client's IPoIB address for an InfiniBand
HCA or the cleint's iWARP address(es) for an RNIC.
NOTE: The "insecure" option must be used because the NFS/RDMA client does
not use a reserved port.
Each time a machine boots:
- Load and configure the RDMA drivers
For InfiniBand using a Mellanox adapter:
$ modprobe ib_mthca
$ modprobe ib_ipoib
$ ifconfig ib0 a.b.c.d
NOTE: use unique addresses for the client and server
- Start the NFS server
If the NFS/RDMA server was built as a module
(CONFIG_SUNRPC_XPRT_RDMA_SERVER=m in kernel config), load the RDMA
transport module:
$ modprobe svcrdma
Regardless of how the server was built (module or built-in), start the
server:
$ /etc/init.d/nfs start
or
$ service nfs start
Instruct the server to listen on the RDMA transport:
$ echo rdma 20049 > /proc/fs/nfsd/portlist
- On the client system
If the NFS/RDMA client was built as a module
(CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m in kernel config), load the RDMA client
module:
$ modprobe xprtrdma.ko
Regardless of how the client was built (module or built-in), use this
command to mount the NFS/RDMA server:
$ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
To verify that the mount is using RDMA, run "cat /proc/mounts" and check
the "proto" field for the given mount.
Congratulations! You're using NFS/RDMA!

View file

@ -0,0 +1,136 @@
The NFS client
==============
The NFS version 2 protocol was first documented in RFC1094 (March 1989).
Since then two more major releases of NFS have been published, with NFSv3
being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
2003).
The Linux NFS client currently supports all the above published versions,
and work is in progress on adding support for minor version 1 of the NFSv4
protocol.
The purpose of this document is to provide information on some of the
special features of the NFS client that can be configured by system
administrators.
The nfs4_unique_id parameter
============================
NFSv4 requires clients to identify themselves to servers with a unique
string. File open and lock state shared between one client and one server
is associated with this identity. To support robust NFSv4 state recovery
and transparent state migration, this identity string must not change
across client reboots.
Without any other intervention, the Linux client uses a string that contains
the local system's node name. System administrators, however, often do not
take care to ensure that node names are fully qualified and do not change
over the lifetime of a client system. Node names can have other
administrative requirements that require particular behavior that does not
work well as part of an nfs_client_id4 string.
The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
used instead of a system's node name when an NFS client identifies itself to
a server. Thus, if the system's node name is not unique, or it changes, its
nfs.nfs4_unique_id stays the same, preventing collision with other clients
or loss of state during NFS reboot recovery or transparent state migration.
The nfs.nfs4_unique_id string is typically a UUID, though it can contain
anything that is believed to be unique across all NFS clients. An
nfs4_unique_id string should be chosen when a client system is installed,
just as a system's root file system gets a fresh UUID in its label at
install time.
The string should remain fixed for the lifetime of the client. It can be
changed safely if care is taken that the client shuts down cleanly and all
outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
This string can be stored in an NFS client's grub.conf, or it can be provided
via a net boot facility such as PXE. It may also be specified as an nfs.ko
module parameter. Specifying a uniquifier string is not support for NFS
clients running in containers.
The DNS resolver
================
NFSv4 allows for one server to refer the NFS client to data that has been
migrated onto another server by means of the special "fs_locations"
attribute. See
http://tools.ietf.org/html/rfc3530#section-6
and
http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
The fs_locations information can take the form of either an ip address and
a path, or a DNS hostname and a path. The latter requires the NFS client to
do a DNS lookup in order to mount the new volume, and hence the need for an
upcall to allow userland to provide this service.
Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
(1) The process checks the dns_resolve cache to see if it contains a
valid entry. If so, it returns that entry and exits.
(2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
(may be changed using the 'nfs.cache_getent' kernel boot parameter)
is run, with two arguments:
- the cache name, "dns_resolve"
- the hostname to resolve
(3) After looking up the corresponding ip address, the helper script
writes the result into the rpc_pipefs pseudo-file
'/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
in the following (text) format:
"<ip address> <hostname> <ttl>\n"
Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
(ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
<hostname> is identical to the second argument of the helper
script, and <ttl> is the 'time to live' of this cache entry (in
units of seconds).
Note: If <ip address> is invalid, say the string "0", then a negative
entry is created, which will cause the kernel to treat the hostname
as having no valid DNS translation.
A basic sample /sbin/nfs_cache_getent
=====================================
#!/bin/bash
#
ttl=600
#
cut=/usr/bin/cut
getent=/usr/bin/getent
rpc_pipefs=/var/lib/nfs/rpc_pipefs
#
die()
{
echo "Usage: $0 cache_name entry_name"
exit 1
}
[ $# -lt 2 ] && die
cachename="$1"
cache_path=${rpc_pipefs}/cache/${cachename}/channel
case "${cachename}" in
dns_resolve)
name="$2"
result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
[ -z "${result}" ] && result="0"
;;
*)
die
;;
esac
echo "${result} ${name} ${ttl}" >${cache_path}

View file

@ -0,0 +1,180 @@
NFSv4.1 Server Implementation
Server support for minorversion 1 can be controlled using the
/proc/fs/nfsd/versions control file. The string output returned
by reading this file will contain either "+4.1" or "-4.1"
correspondingly.
Currently, server support for minorversion 1 is enabled by default.
It can be disabled at run time by writing the string "-4.1" to
the /proc/fs/nfsd/versions control file. Note that to write this
control file, the nfsd service must be taken down. You can use rpc.nfsd
for this; see rpc.nfsd(8).
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
"-4", respectively. Therefore, code meant to work on both new and old
kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on RFC 5661.
From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
Other NFSv4.1 features, Parallel NFS operations in particular,
are still under development out of tree.
See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
for more information.
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
in minor version 1. The first column indicates the operations that
are not supported yet by the linux server implementation.
The OPTIONAL features identified and their abbreviations are as follows:
pNFS Parallel NFS
FDELG File Delegations
DDELG Directory Delegations
The following abbreviations indicate the linux server implementation status.
I Implemented NFSv4.1 operations.
NS Not Supported.
NS* unimplemented optional feature.
P pNFS features implemented out of tree.
PNS pNFS features that are not supported yet (out of tree).
Operations
+----------------------+------------+--------------+----------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+----------------------+------------+--------------+----------------+
| ACCESS | REQ | | Section 18.1 |
I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
| CLOSE | REQ | | Section 18.2 |
| COMMIT | REQ | | Section 18.3 |
| CREATE | REQ | | Section 18.4 |
I | CREATE_SESSION | REQ | | Section 18.36 |
NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
| DELEGRETURN | OPT | FDELG, | Section 18.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
I | DESTROY_CLIENTID | REQ | | Section 18.50 |
I | DESTROY_SESSION | REQ | | Section 18.37 |
I | EXCHANGE_ID | REQ | | Section 18.35 |
I | FREE_STATEID | REQ | | Section 18.38 |
| GETATTR | REQ | | Section 18.7 |
P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
| GETFH | REQ | | Section 18.8 |
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
| LINK | OPT | | Section 18.9 |
| LOCK | REQ | | Section 18.10 |
| LOCKT | REQ | | Section 18.11 |
| LOCKU | REQ | | Section 18.12 |
| LOOKUP | REQ | | Section 18.13 |
| LOOKUPP | REQ | | Section 18.14 |
| NVERIFY | REQ | | Section 18.15 |
| OPEN | REQ | | Section 18.16 |
NS*| OPENATTR | OPT | | Section 18.17 |
| OPEN_CONFIRM | MNI | | N/A |
| OPEN_DOWNGRADE | REQ | | Section 18.18 |
| PUTFH | REQ | | Section 18.19 |
| PUTPUBFH | REQ | | Section 18.20 |
| PUTROOTFH | REQ | | Section 18.21 |
| READ | REQ | | Section 18.22 |
| READDIR | REQ | | Section 18.23 |
| READLINK | OPT | | Section 18.24 |
| RECLAIM_COMPLETE | REQ | | Section 18.51 |
| RELEASE_LOCKOWNER | MNI | | N/A |
| REMOVE | REQ | | Section 18.25 |
| RENAME | REQ | | Section 18.26 |
| RENEW | MNI | | N/A |
| RESTOREFH | REQ | | Section 18.27 |
| SAVEFH | REQ | | Section 18.28 |
| SECINFO | REQ | | Section 18.29 |
I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
| | | layout (REQ) | Section 13.12 |
I | SEQUENCE | REQ | | Section 18.46 |
| SETATTR | REQ | | Section 18.30 |
| SETCLIENTID | MNI | | N/A |
| SETCLIENTID_CONFIRM | MNI | | N/A |
NS | SET_SSV | REQ | | Section 18.47 |
I | TEST_STATEID | REQ | | Section 18.48 |
| VERIFY | REQ | | Section 18.31 |
NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
| WRITE | REQ | | Section 18.32 |
Callback Operations
+-------------------------+-----------+-------------+---------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+-------------------------+-----------+-------------+---------------+
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
| CB_RECALL | OPT | FDELG, | Section 20.2 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
| | | (REQ) | |
I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
| | | DDELG, pNFS | |
| | | (REQ) | |
+-------------------------+-----------+-------------+---------------+
Implementation notes:
SSV:
* The spec claims this is mandatory, but we don't actually know of any
implementations, so we're ignoring it for now. The server returns
NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
GSS on the backchannel:
* Again, theoretically required but not widely implemented (in
particular, the current Linux client doesn't request it). We return
NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
DELEGPURGE:
* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
persist across client reboots). Thus we need not implement this for
now.
EXCHANGE_ID:
* implementation ids are ignored
CREATE_SESSION:
* backchannel attributes are ignored
SEQUENCE:
* no support for dynamic slot table renegotiation (optional)
Nonstandard compound limitations:
* No support for a sessions fore channel RPC compound that requires both a
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
fail to live up to the promise we made in CREATE_SESSION fore channel
negotiation.
See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.

View file

@ -0,0 +1,41 @@
Administrative interfaces for nfsd
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that normally these interfaces are used only by the utilities in
nfs-utils.
nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem,
which is normally mounted at /proc/fs/nfsd/.
The server is always started by the first write of a nonzero value to
nfsd/threads.
Before doing that, NFSD can be told which sockets to listen on by
writing to nfsd/portlist; that write may be:
- an ascii-encoded file descriptor, which should refer to a
bound (and listening, for tcp) socket, or
- "transportname port", where transportname is currently either
"udp", "tcp", or "rdma".
If nfsd is started without doing any of these, then it will create one
udp and one tcp listener at port 2049 (see nfsd_init_socks).
On startup, nfsd and lockd grace periods start.
nfsd is shut down by a write of 0 to nfsd/threads. All locks and state
are thrown away at that point.
Between startup and shutdown, the number of threads may be adjusted up
or down by additional writes to nfsd/threads or by writes to
nfsd/pool_threads.
For more detail about files under nfsd/ and what they control, see
fs/nfsd/nfsctl.c; most of them have detailed comments.
Implementation notes
^^^^^^^^^^^^^^^^^^^^
Note that the rpc server requires the caller to serialize addition and
removal of listening sockets, and startup and shutdown of the server.
For nfsd this is done using nfsd_mutex.

View file

@ -0,0 +1,302 @@
Mounting the root filesystem via NFS (nfsroot)
===============================================
Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
Updated 2006 by Horms <horms@verge.net.au>
In order to use a diskless system, such as an X-terminal or printer server
for example, it is necessary for the root filesystem to be present on a
non-disk device. This may be an initramfs (see Documentation/filesystems/
ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
filesystem mounted via NFS. The following text describes on how to use NFS
for the root filesystem. For the rest of this text 'client' means the
diskless system, and 'server' means the NFS server.
1.) Enabling nfsroot capabilities
-----------------------------
In order to use nfsroot, NFS client support needs to be selected as
built-in during configuration. Once this has been selected, the nfsroot
option will become available, which should also be selected.
In the networking options, kernel level autoconfiguration can be selected,
along with the types of autoconfiguration to support. Selecting all of
DHCP, BOOTP and RARP is safe.
2.) Kernel command line
-------------------
When the kernel has been loaded by a boot loader (see below) it needs to be
told what root fs device to use. And in the case of nfsroot, where to find
both the server and the name of the directory on the server to mount as root.
This can be established using the following kernel command line parameters:
root=/dev/nfs
This is necessary to enable the pseudo-NFS-device. Note that it's not a
real device but just a synonym to tell the kernel to use NFS instead of
a real device.
nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
If the `nfsroot' parameter is NOT given on the command line,
the default "/tftpboot/%s" will be used.
<server-ip> Specifies the IP address of the NFS server.
The default address is determined by the `ip' parameter
(see below). This parameter allows the use of different
servers for IP autoconfiguration and NFS.
<root-dir> Name of the directory on the server to mount as root.
If there is a "%s" token in the string, it will be
replaced by the ASCII-representation of the client's
IP address.
<nfs-options> Standard NFS options. All options are separated by commas.
The following defaults are used:
port = as given by server portmap daemon
rsize = 4096
wsize = 4096
timeo = 7
retrans = 3
acregmin = 3
acregmax = 60
acdirmin = 30
acdirmax = 60
flags = hard, nointr, noposix, cto, ac
ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:
<dns0-ip>:<dns1-ip>
This parameter tells the kernel how to configure IP addresses of devices
and also how to set up the IP routing table. It was originally called
`nfsaddrs', but now the boot-time IP configuration works independently of
NFS, so it was renamed to `ip' and the old name remained as an alias for
compatibility reasons.
If this parameter is missing from the kernel command line, all fields are
assumed to be empty, and the defaults mentioned below apply. In general
this means that the kernel tries to configure everything using
autoconfiguration.
The <autoconf> parameter can appear alone as the value to the `ip'
parameter (without all the ':' characters before). If the value is
"ip=off" or "ip=none", no autoconfiguration will take place, otherwise
autoconfiguration will take place. The most common way to use this
is "ip=dhcp".
<client-ip> IP address of the client.
Default: Determined using autoconfiguration.
<server-ip> IP address of the NFS server. If RARP is used to determine
the client address and this parameter is NOT empty only
replies from the specified server are accepted.
Only required for NFS root. That is autoconfiguration
will not be triggered if it is missing and NFS root is not
in operation.
Default: Determined using autoconfiguration.
The address of the autoconfiguration server is used.
<gw-ip> IP address of a gateway if the server is on a different subnet.
Default: Determined using autoconfiguration.
<netmask> Netmask for local network interface. If unspecified
the netmask is derived from the client IP address assuming
classful addressing.
Default: Determined using autoconfiguration.
<hostname> Name of the client. May be supplied by autoconfiguration,
but its absence will not trigger autoconfiguration.
If specified and DHCP is used, the user provided hostname will
be carried in the DHCP request to hopefully update DNS record.
Default: Client IP address is used in ASCII notation.
<device> Name of network device to use.
Default: If the host only has one device, it is used.
Otherwise the device is determined using
autoconfiguration. This is done by sending
autoconfiguration requests out of all devices,
and using the device that received the first reply.
<autoconf> Method to use for autoconfiguration. In the case of options
which specify multiple autoconfiguration protocols,
requests are sent using all protocols, and the first one
to reply is used.
Only autoconfiguration protocols that have been compiled
into the kernel will be used, regardless of the value of
this option.
off or none: don't use autoconfiguration
(do static IP assignment instead)
on or any: use any protocol available in the kernel
(default)
dhcp: use DHCP
bootp: use BOOTP
rarp: use RARP
both: use both BOOTP and RARP but not DHCP
(old option kept for backwards compatibility)
Default: any
<dns0-ip> IP address of first nameserver.
Value gets exported by /proc/net/pnp which is often linked
on embedded systems by /etc/resolv.conf.
<dns1-ip> IP address of secound nameserver.
Same as above.
nfsrootdebug
This parameter enables debugging messages to appear in the kernel
log at boot time so that administrators can verify that the correct
NFS mount options, server address, and root path are passed to the
NFS client.
rdinit=<executable file>
To specify which file contains the program that starts system
initialization, administrators can use this command line parameter.
The default value of this parameter is "/init". If the specified
file exists and the kernel can execute it, root filesystem related
kernel command line parameters, including `nfsroot=', are ignored.
A description of the process of mounting the root file system can be
found in:
Documentation/early-userspace/README
3.) Boot Loader
----------
To get the kernel into memory different approaches can be used.
They depend on various facilities being available:
3.1) Booting from a floppy using syslinux
When building kernels, an easy way to create a boot floppy that uses
syslinux is to use the zdisk or bzdisk make targets which use zimage
and bzimage images respectively. Both targets accept the
FDARGS parameter which can be used to set the kernel command line.
e.g.
make bzdisk FDARGS="root=/dev/nfs"
Note that the user running this command will need to have
access to the floppy drive device, /dev/fd0
For more information on syslinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
N.B: Previously it was possible to write a kernel directly to
a floppy using dd, configure the boot device using rdev, and
boot using the resulting floppy. Linux no longer supports this
method of booting.
3.2) Booting from a cdrom using isolinux
When building kernels, an easy way to create a bootable cdrom that
uses isolinux is to use the isoimage target which uses a bzimage
image. Like zdisk and bzdisk, this target accepts the FDARGS
parameter which can be used to set the kernel command line.
e.g.
make isoimage FDARGS="root=/dev/nfs"
The resulting iso image will be arch/<ARCH>/boot/image.iso
This can be written to a cdrom using a variety of tools including
cdrecord.
e.g.
cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
3.2) Using LILO
When using LILO all the necessary command line parameters may be
specified using the 'append=' directive in the LILO configuration
file.
However, to use the 'root=' directive you also need to create
a dummy root device, which may be removed after LILO is run.
mknod /dev/boot255 c 0 255
For information on configuring LILO, please refer to its documentation.
3.3) Using GRUB
When using GRUB, kernel parameter are simply appended after the kernel
specification: kernel <kernel> <parameters>
3.4) Using loadlin
loadlin may be used to boot Linux from a DOS command prompt without
requiring a local hard disk to mount as root. This has not been
thoroughly tested by the authors of this document, but in general
it should be possible configure the kernel command line similarly
to the configuration of LILO.
Please refer to the loadlin documentation for further information.
3.5) Using a boot ROM
This is probably the most elegant way of booting a diskless client.
With a boot ROM the kernel is loaded using the TFTP protocol. The
authors of this document are not aware of any no commercial boot
ROMs that support booting Linux over the network. However, there
are two free implementations of a boot ROM, netboot-nfs and
etherboot, both of which are available on sunsite.unc.edu, and both
of which contain everything you need to boot a diskless Linux client.
3.6) Using pxelinux
Pxelinux may be used to boot linux using the PXE boot loader
which is present on many modern network cards.
When using pxelinux, the kernel image is specified using
"kernel <relative-path-below /tftpboot>". The nfsroot parameters
are passed to the kernel by adding them to the "append" line.
It is common to use serial console in conjunction with pxeliunx,
see Documentation/serial-console.txt for more information.
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
4.) Credits
-------
The nfsroot code in the kernel and the RARP support have been written
by Gero Kuhlmann <gero@gkminix.han.de>.
The rest of the IP layer autoconfiguration code has been written
by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
In order to write the initial version of nfsroot I would like to thank
Jens-Uwe Mager <jum@anubis.han.de> for his help.

View file

@ -0,0 +1,109 @@
Reference counting in pnfs:
==========================
The are several inter-related caches. We have layouts which can
reference multiple devices, each of which can reference multiple data servers.
Each data server can be referenced by multiple devices. Each device
can be referenced by multiple layouts. To keep all of this straight,
we need to reference count.
struct pnfs_layout_hdr
----------------------
The on-the-wire command LAYOUTGET corresponds to struct
pnfs_layout_segment, usually referred to by the variable name lseg.
Each nfs_inode may hold a pointer to a cache of these layout
segments in nfsi->layout, of type struct pnfs_layout_hdr.
We reference the header for the inode pointing to it, across each
outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
LAYOUTCOMMIT), and for each lseg held within.
Each header is also (when non-empty) put on a list associated with
struct nfs_client (cl_layouts). Being put on this list does not bump
the reference count, as the layout is kept around by the lseg that
keeps it in the list.
deviceid_cache
--------------
lsegs reference device ids, which are resolved per nfs_client and
layout driver type. The device ids are held in a RCU cache (struct
nfs4_deviceid_cache). The cache itself is referenced across each
mount. The entries (struct nfs4_deviceid) themselves are held across
the lifetime of each lseg referencing them.
RCU is used because the deviceid is basically a write once, read many
data structure. The hlist size of 32 buckets needs better
justification, but seems reasonable given that we can have multiple
deviceid's per filesystem, and multiple filesystems per nfs_client.
The hash code is copied from the nfsd code base. A discussion of
hashing and variations of this algorithm can be found at:
http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
data server cache
-----------------
file driver devices refer to data servers, which are kept in a module
level cache. Its reference is held over the lifetime of the deviceid
pointing to it.
lseg
----
lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
bit which holds it in the pnfs_layout_hdr's list. When the final lseg
is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
bit is set, preventing any new lsegs from being added.
layout drivers
--------------
PNFS utilizes what is called layout drivers. The STD defines 3 basic
layout types: "files" "objects" and "blocks". For each of these types
there is a layout-driver with a common function-vectors table which
are called by the nfs-client pnfs-core to implement the different layout
types.
Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c
Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory
Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory
objects-layout setup
--------------------
As part of the full STD implementation the objlayoutdriver.ko needs, at times,
to automatically login to yet undiscovered iscsi/osd devices. For this the
driver makes up-calles to a user-mode script called *osd_login*
The path_name of the script to use is by default:
/sbin/osd_login.
This name can be overridden by the Kernel module parameter:
objlayoutdriver.osd_login_prog
If Kernel does not find the osd_login_prog path it will zero it out
and will not attempt farther logins. An admin can then write new value
to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it.
The /sbin/osd_login is part of the nfs-utils package, and should usually
be installed on distributions that support this Kernel version.
The API to the login script is as follows:
Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID>
Options:
-u target uri e.g. iscsi://<ip>:<port>
(allways exists)
(More protocols can be defined in the future.
The client does not interpret this string it is
passed unchanged as received from the Server)
-o osdname of the requested target OSD
(Might be empty)
(A string which denotes the OSD name, there is a
limit of 64 chars on this string)
-s systemid of the requested target OSD
(Might be empty)
(This string, if not empty is always an hex
representation of the 20 bytes osd_system_id)
blocks-layout setup
-------------------
TODO: Document the setup needs of the blocks layout driver

View file

@ -0,0 +1,202 @@
This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.
CACHES
======
The caching replaces the old exports table and allows for
a wide variety of values to be caches.
There are a number of caches that are similar in structure though
quite possibly very different in content and use. There is a corpus
of common code for managing these caches.
Examples of caches that are likely to be needed are:
- mapping from IP address to client name
- mapping from client name and filesystem to export options
- mapping from UID to list of GIDs, to work around NFS's limitation
of 16 gids.
- mappings between local UID/GID and remote UID/GID for sites that
do not have uniform uid assignment
- mapping from network identify to public key for crypto authentication.
The common code handles such things as:
- general cache lookup with correct locking
- supporting 'NEGATIVE' as well as positive entries
- allowing an EXPIRED time on cache items, and removing
items after they expire, and are no longer in-use.
- making requests to user-space to fill in cache entries
- allowing user-space to directly set entries in the cache
- delaying RPC requests that depend on as-yet incomplete
cache entries, and replaying those requests when the cache entry
is complete.
- clean out old entries as they expire.
Creating a Cache
----------------
1/ A cache needs a datum to store. This is in the form of a
structure definition that must contain a
struct cache_head
as an element, usually the first.
It will also contain a key and some content.
Each cache element is reference counted and contains
expiry and update times for use in cache management.
2/ A cache needs a "cache_detail" structure that
describes the cache. This stores the hash table, some
parameters for cache management, and some operations detailing how
to work with particular cache items.
The operations requires are:
struct cache_head *alloc(void)
This simply allocates appropriate memory and returns
a pointer to the cache_detail embedded within the
structure
void cache_put(struct kref *)
This is called when the last reference to an item is
dropped. The pointer passed is to the 'ref' field
in the cache_head. cache_put should release any
references create by 'cache_init' and, if CACHE_VALID
is set, any references created by cache_update.
It should then release the memory allocated by
'alloc'.
int match(struct cache_head *orig, struct cache_head *new)
test if the keys in the two structures match. Return
1 if they do, 0 if they don't.
void init(struct cache_head *orig, struct cache_head *new)
Set the 'key' fields in 'new' from 'orig'. This may
include taking references to shared objects.
void update(struct cache_head *orig, struct cache_head *new)
Set the 'content' fileds in 'new' from 'orig'.
int cache_show(struct seq_file *m, struct cache_detail *cd,
struct cache_head *h)
Optional. Used to provide a /proc file that lists the
contents of a cache. This should show one item,
usually on just one line.
int cache_request(struct cache_detail *cd, struct cache_head *h,
char **bpp, int *blen)
Format a request to be send to user-space for an item
to be instantiated. *bpp is a buffer of size *blen.
bpp should be moved forward over the encoded message,
and *blen should be reduced to show how much free
space remains. Return 0 on success or <0 if not
enough room or other problem.
int cache_parse(struct cache_detail *cd, char *buf, int len)
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
cache with sunrpc_cache_lookup, and update the item
with sunrpc_cache_update.
3/ A cache needs to be registered using cache_register(). This
includes it on a list of caches that will be regularly
cleaned to discard old data.
Using a cache
-------------
To find a value in a cache, call sunrpc_cache_lookup passing a pointer
to the cache_head in a sample item with the 'key' fields filled in.
This will be passed to ->match to identify the target entry. If no
entry is found, a new entry will be create, added to the cache, and
marked as not containing valid data.
The item returned is typically passed to cache_check which will check
if the data is valid, and may initiate an up-call to get fresh data.
cache_check will return -ENOENT in the entry is negative or if an up
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;
cache_check can be passed a "struct cache_req *". This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req). This is
done when the found cache item is not uptodate, but the is reason to
believe that userspace might provide information soon. When the cache
item does become valid, the deferred copy of the request will be
revisited (->revisit). It is expected that this method will
reschedule the request for processing.
The value returned by sunrpc_cache_lookup can also be passed to
sunrpc_cache_update to set the content for the item. A second item is
passed which should hold the content. If the item found by _lookup
has valid data, then it is discarded and a new item is created. This
saves any user of an item from worrying about content changing while
it is being inspected. If the item found by _lookup does not contain
valid data, then the content is copied across and CACHE_VALID is set.
Populating a cache
------------------
Each cache has a name, and when the cache is registered, a directory
with that name is created in /proc/net/rpc
This directory contains a file called 'channel' which is a channel
for communicating between kernel and user for populating the cache.
This directory may later contain other files of interacting
with the cache.
The 'channel' works a bit like a datagram socket. Each 'write' is
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
- a key
- an expiry time
- a content.
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.
Reading from a channel is a bit more interesting. When a cache
lookup fails, or when it succeeds but finds an entry that may soon
expire, a request is lodged for that cache item to be updated by
user-space. These requests appear in the channel file.
Successive reads will return successive requests.
If there are no more requests to return, read will return EOF, but a
select or poll for read will block waiting for another request to be
added.
Thus a user-space helper is likely to:
open the channel.
select for readable
read a request
write a response
loop.
If it dies and needs to be restarted, any requests that have not been
answered will still appear in the file and will be read by the new
instance of the helper.
Each cache should define a "cache_parse" method which takes a message
written from user-space and processes it. It should return an error
(which propagates back to the write syscall) or 0.
Each cache should also define a "cache_request" method which
takes a cache item and encodes a request into the buffer
provided.
Note: If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail. This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.
request/response format
-----------------------
While each cache is free to use its own format for requests
and responses over channel, the following is recommended as
appropriate and support routines are available to help:
Each request or response record should be printable ASCII
with precisely one newline character which should be at the end.
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted. two mechanisms are available:
1/ If a field begins '\x' then it must contain an even number of
hex digits, and pairs of these digits provide the bytes in the
field.
2/ otherwise a \ in the field must be followed by 3 octal digits
which give the code for a byte. Other characters are treated
as them selves. At the very least, space, newline, nul, and
'\' must be quoted in this way.

View file

@ -0,0 +1,91 @@
rpcsec_gss support for kernel RPC servers
=========================================
This document gives references to the standards and protocols used to
implement RPCGSS authentication in kernel RPC servers such as the NFS
server and the NFS client's NFSv4.0 callback server. (But note that
NFSv4.1 and higher don't require the client to act as a server for the
purposes of authentication.)
RPCGSS is specified in a few IETF documents:
- RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
and there is a 3rd version being proposed:
- http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
(At draft n. 02 at the time of writing)
Background
----------
The RPCGSS Authentication method describes a way to perform GSSAPI
Authentication for NFS. Although GSSAPI is itself completely mechanism
agnostic, in many cases only the KRB5 mechanism is supported by NFS
implementations.
The Linux kernel, at the moment, supports only the KRB5 mechanism, and
depends on GSSAPI extensions that are KRB5 specific.
GSSAPI is a complex library, and implementing it completely in kernel is
unwarranted. However GSSAPI operations are fundementally separable in 2
parts:
- initial context establishment
- integrity/privacy protection (signing and encrypting of individual
packets)
The former is more complex and policy-independent, but less
performance-sensitive. The latter is simpler and needs to be very fast.
Therefore, we perform per-packet integrity and privacy protection in the
kernel, but leave the initial context establishment to userspace. We
need upcalls to request userspace to perform context establishment.
NFS Server Legacy Upcall Mechanism
----------------------------------
The classic upcall mechanism uses a custom text based upcall mechanism
to talk to a custom daemon called rpc.svcgssd that is provide by the
nfs-utils package.
This upcall mechanism has 2 limitations:
A) It can handle tokens that are no bigger than 2KiB
In some Kerberos deployment GSSAPI tokens can be quite big, up and
beyond 64KiB in size due to various authorization extensions attacked to
the Kerberos tickets, that needs to be sent through the GSS layer in
order to perform context establishment.
B) It does not properly handle creds where the user is member of more
than a few housand groups (the current hard limit in the kernel is 65K
groups) due to limitation on the size of the buffer that can be send
back to the kernel (4KiB).
NFS Server New RPC Upcall Mechanism
-----------------------------------
The newer upcall mechanism uses RPC over a unix socket to a daemon
called gss-proxy, implemented by a userspace program called Gssproxy.
The gss_proxy RPC protocol is currently documented here:
https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation
This upcall mechanism uses the kernel rpc client and connects to the gssproxy
userspace program over a regular unix socket. The gssproxy protocol does not
suffer from the size limitations of the legacy protocol.
Negotiating Upcall Mechanisms
-----------------------------
To provide backward compatibility, the kernel defaults to using the
legacy mechanism. To switch to the new mechanism, gss-proxy must bind
to /var/run/gssproxy.sock and then write "1" to
/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both
steps.
Once the upcall mechanism is chosen, it cannot be changed. To prevent
locking into the legacy mechanisms, the above steps must be performed
before starting nfsd. Whoever starts nfsd can guarantee this by reading
from /proc/net/rpc/use-gss-proxy and checking that it contains a
"1"--the read will block until gss-proxy has done its write to the file.

View file

@ -0,0 +1,270 @@
NILFS2
------
NILFS2 is a log-structured file system (LFS) supporting continuous
snapshotting. In addition to versioning capability of the entire file
system, users can even restore files mistakenly overwritten or
destroyed just a few seconds ago. Since NILFS2 can keep consistency
like conventional LFS, it achieves quick recovery after system
crashes.
NILFS2 creates a number of checkpoints every few seconds or per
synchronous write basis (unless there is no change). Users can select
significant versions among continuously created checkpoints, and can
change them into snapshots which will be preserved until they are
changed back to checkpoints.
There is no limit on the number of snapshots until the volume gets
full. Each snapshot is mountable as a read-only file system
concurrently with its writable mount, and this feature is convenient
for online backup.
The userland tools are included in nilfs-utils package, which is
available from the following download page. At least "mkfs.nilfs2",
"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called
cleaner or garbage collector) are required. Details on the tools are
described in the man pages included in the package.
Project web page: http://nilfs.sourceforge.net/
Download page: http://nilfs.sourceforge.net/en/download.html
List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
Caveats
=======
Features which NILFS2 does not support yet:
- atime
- extended attributes
- POSIX ACLs
- quotas
- fsck
- defragmentation
Mount options
=============
NILFS2 supports the following mount options:
(*) == default
barrier(*) This enables/disables the use of write barriers. This
nobarrier requires an IO stack which can support barriers, and
if nilfs gets an error on a barrier write, it will
disable again with a warning.
errors=continue Keep going on a filesystem error.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
cp=n Specify the checkpoint-number of the snapshot to be
mounted. Checkpoints and snapshots are listed by lscp
user command. Only the checkpoints marked as snapshot
are mountable with this option. Snapshot is read-only,
so a read-only mount option must be specified together.
order=relaxed(*) Apply relaxed order semantics that allows modified data
blocks to be written to disk without making a
checkpoint if no metadata update is going. This mode
is equivalent to the ordered data mode of the ext3
filesystem except for the updates on data blocks still
conserve atomicity. This will improve synchronous
write performance for overwriting.
order=strict Apply strict in-order semantics that preserves sequence
of all file operations including overwriting of data
blocks. That means, it is guaranteed that no
overtaking of events occurs in the recovered file
system after a crash.
norecovery Disable recovery of the filesystem on mount.
This disables every write access on the device for
read-only mounts or snapshots. This option will fail
for r/w mounts on an unclean volume.
discard This enables/disables the use of discard/TRIM commands.
nodiscard(*) The discard/TRIM commands are sent to the underlying
block device when blocks are freed. This is useful
for SSD devices and sparse/thinly-provisioned LUNs.
Ioctls
======
There is some NILFS2 specific functionality which can be accessed by applications
through the system call interfaces. The list of all NILFS2 specific ioctls are
shown in the table below.
Table of NILFS2 specific ioctls
..............................................................................
Ioctl Description
NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between
checkpoint and snapshot state. This ioctl is
used in chcp and mkcp utilities.
NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system.
This ioctl is used in rmcp utility.
NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This
ioctl is used in lscp utility and by
nilfs_cleanerd daemon.
NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is
used by lscp, rmcp utilities and by
nilfs_cleanerd daemon.
NILFS_IOCTL_GET_SUINFO Return segment usage info about requested
segments. This ioctl is used in lssu,
nilfs_resize utilities and by nilfs_cleanerd
daemon.
NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested
segments. This ioctl is used by
nilfs_cleanerd daemon to skip unnecessary
cleaning operation of segments and reduce
performance penalty or wear of flash device
due to redundant move of in-use blocks.
NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl
is used in lssu, nilfs_resize utilities and
by nilfs_cleanerd daemon.
NILFS_IOCTL_GET_VINFO Return information on virtual block addresses.
This ioctl is used by nilfs_cleanerd daemon.
NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk
block numbers. This ioctl is used by
nilfs_cleanerd daemon.
NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the
environment of requested parameters from
userspace. This ioctl is used by
nilfs_cleanerd daemon.
NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in
mkcp utility.
NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used
by nilfs_resize utility.
NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and
upper limit of segments in bytes. This ioctl
is used by nilfs_resize utility.
NILFS2 usage
============
To use nilfs2 as a local file system, simply:
# mkfs -t nilfs2 /dev/block_device
# mount -t nilfs2 /dev/block_device /dir
This will also invoke the cleaner through the mount helper program
(mount.nilfs2).
Checkpoints and snapshots are managed by the following commands.
Their manpages are included in the nilfs-utils package above.
lscp list checkpoints or snapshots.
mkcp make a checkpoint or a snapshot.
chcp change an existing checkpoint to a snapshot or vice versa.
rmcp invalidate specified checkpoint(s).
To mount a snapshot,
# mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
where <cno> is the checkpoint number of the snapshot.
To unmount the NILFS2 mount point or snapshot, simply:
# umount /dir
Then, the cleaner daemon is automatically shut down by the umount
helper program (umount.nilfs2).
Disk format
===========
A nilfs2 volume is equally divided into a number of segments except
for the super block (SB) and segment #0. A segment is the container
of logs. Each log is composed of summary information blocks, payload
blocks, and an optional super root block (SR):
______________________________________________________
| |SB| | Segment | Segment | Segment | ... | Segment | |
|_|__|_|____0____|____1____|____2____|_____|____N____|_|
0 +1K +4K +8M +16M +24M +(8MB x N)
. . (Typical offsets for 4KB-block)
. .
.______________________.
| log | log |... | log |
|__1__|__2__|____|__m__|
. .
. .
. .
.______________________________.
| Summary | Payload blocks |SR|
|_blocks__|_________________|__|
The payload blocks are organized per file, and each file consists of
data blocks and B-tree node blocks:
|<--- File-A --->|<--- File-B --->|
_______________________________________________________________
| Data blocks | B-tree blocks | Data blocks | B-tree blocks | ...
_|_____________|_______________|_____________|_______________|_
Since only the modified blocks are written in the log, it may have
files without data blocks or B-tree node blocks.
The organization of the blocks is recorded in the summary information
blocks, which contains a header structure (nilfs_segment_summary), per
file structures (nilfs_finfo), and per block structures (nilfs_binfo):
_________________________________________________________________________
| Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |...
|_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___
The logs include regular files, directory files, symbolic link files
and several meta data files. The mata data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
the following meta data files:
1) Inode file (ifile) -- Stores on-disk inodes
2) Checkpoint file (cpfile) -- Stores checkpoints
3) Segment usage file (sufile) -- Stores allocation state of segments
4) Data address translation file -- Maps virtual block numbers to usual
(DAT) block numbers. This file serves to
make on-disk blocks relocatable.
The following figure shows a typical organization of the logs:
_________________________________________________________________________
| Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR|
|_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__|
To stride over segment boundaries, this sequence of files may be split
into multiple logs. The sequence of logs that should be treated as
logically one log, is delimited with flags marked in the segment
summary. The recovery code of nilfs2 looks this boundary information
to ensure atomicity of updates.
The super root block is inserted for every checkpoints. It includes
three special inodes, inodes for the DAT, cpfile, and sufile. Inodes
of regular files, directories, symlinks and other special files, are
included in the ifile. The inode of ifile itself is included in the
corresponding checkpoint entry in the cpfile. Thus, the hierarchy
among NILFS2 files can be depicted as follows:
Super block (SB)
|
v
Super root block (the latest cno=xx)
|-- DAT
|-- sufile
`-- cpfile
|-- ifile (cno=c1)
|-- ifile (cno=c2) ---- file (ino=i1)
: : |-- file (ino=i2)
`-- ifile (cno=xx) |-- file (ino=i3)
: :
`-- file (ino=yy)
( regular file, directory, or symlink )
For detail on the format of each file, please see include/linux/nilfs2_fs.h.

View file

@ -0,0 +1,451 @@
The Linux NTFS filesystem driver
================================
Table of contents
=================
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- The Device-Mapper driver
- The Software RAID / MD driver
- Limitations when using the MD driver
Overview
========
Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
These include mkntfs, a full-featured ntfs filesystem format utility,
ntfsundelete used for recovering files that were unintentionally deleted
from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
See the web site for more information.
To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
system type 'ntfs'. The driver currently supports read-only mode (with no
fault-tolerance, encryption or journalling) and very limited, but safe, write
support.
For fault tolerance and raid support (i.e. volume and stripe sets), you can
use the kernel's Software RAID / MD driver. See section "Using Software RAID
with NTFS" for details.
Web site
========
There is plenty of additional information on the linux-ntfs web site
at http://www.linux-ntfs.org/
The web site has a lot of additional information, such as a comprehensive
FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
userspace utilities, etc.
Features
========
- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
earlier kernels. This new driver implements NTFS read support and is
functionally equivalent to the old ntfs driver and it also implements limited
write support. The biggest limitation at present is that files/directories
cannot be created or deleted. See below for the list of write features that
are so far supported. Another limitation is that writing to compressed files
is not implemented at all. Also, neither read nor write access to encrypted
files is so far implemented.
- The new driver has full support for sparse files on NTFS 3.x volumes which
the old driver isn't happy with.
- The new driver supports execution of binaries due to mmap() now being
supported.
- The new driver supports loopback mounting of files on NTFS which is used by
some Linux distributions to enable the user to run Linux from an NTFS
partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it.
- A comparison of the two drivers using:
time find . -type f -exec md5sum "{}" \;
run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space
was unchanged but the time spent in the kernel was decreased by a factor of
2.5 (from 85 CPU seconds down to 33).
- The driver does not support short file names in general. For backwards
compatibility, we implement access to files using their short file names if
they exist. The driver will not create short file names however, and a
rename will discard any existing short file name.
- The new driver supports exporting of mounted NTFS volumes via NFS.
- The new driver supports async io (aio).
- The new driver supports fsync(2), fdatasync(2), and msync(2).
- The new driver supports readv(2) and writev(2).
- The new driver supports access time updates (including mtime and ctime).
- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
only very limited support for highly fragmented files, i.e. ones which have
their data attribute split across multiple extents, is included. Another
limitation is that at present truncate(2) will never create sparse files,
since to mark a file sparse we need to modify the directory entry for the
file and we do not implement directory modifications yet.
- The new driver supports write(2) which can both overwrite existing data and
extend the file size so that you can write beyond the existing data. Also,
writing into sparse regions is supported and the holes are filled in with
clusters. But at present only limited support for highly fragmented files,
i.e. ones which have their data attribute split across multiple extents, is
included. Another limitation is that write(2) will never create sparse
files, since to mark a file sparse we need to modify the directory entry for
the file and we do not implement directory modifications yet.
Supported mount options
=======================
In addition to the generic mount options described by the manual page for the
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options:
iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name.
nls=name Character set to use when returning file names.
Unlike VFAT, NTFS suppresses names that contain
unconvertible characters. Note that most character
sets contain insufficient characters to represent all
possible Unicode characters that can exist on NTFS.
To be sure you are not missing any files, you are
advised to use nls=utf8 which is capable of
representing all Unicode characters.
utf8=<bool> Option no longer supported. Currently mapped to
nls=utf8 but please use nls=utf8 in the future and
make sure utf8 is compiled either as module or into
the kernel. See description for nls=name.
uid=
gid=
umask= Provide default owner, group, and access mode mask.
These options work as documented in mount(8). By
default, the files/directories are owned by root and
he/she has read and write permissions, as well as
browse permission for directories. No one else has any
access permissions. I.e. the mode on all files is by
default rw------- and for directories rwx------, a
consequence of the default fmask=0177 and dmask=0077.
Using a umask of zero will grant all permissions to
everyone, i.e. all files and directories will have mode
rwxrwxrwx.
fmask=
dmask= Instead of specifying umask which applies both to
files and directories, fmask applies only to files and
dmask only to directories.
sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
Otherwise the default behaviour is to abort mount if
any unknown options are found.
show_sys_files=<BOOL> If show_sys_files is specified, show the system files
in directory listings. Otherwise the default behaviour
is to hide the system files.
Note that even when show_sys_files is specified, "$MFT"
will not be visible due to bugs/mis-features in glibc.
Further, note that irrespective of show_sys_files, all
files are accessible by name, i.e. you can always do
"ls -l \$UpCase" for example to specifically show the
system file containing the Unicode upcase table.
case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
case sensitive and create file names in the POSIX
namespace. Otherwise the default behaviour is to treat
file names as case insensitive and to create file names
in the WIN32/LONG name space. Note, the Linux NTFS
driver will never create short file names and will
remove them on rename/delete of the corresponding long
file name.
Note that files remain accessible via their short file
name, if it exists. If case_sensitive, you will need
to provide the correct case of the short file name.
disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
regions, i.e. holes, inside files is disabled for the
volume (for the duration of this mount only). By
default, creation of sparse regions is enabled, which
is consistent with the behaviour of traditional Unix
filesystems.
errors=opt What to do when critical filesystem errors are found.
Following values can be used for "opt":
continue: DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then
continue.
recover: At present only supported is recovery of
the boot sector from the backup copy.
If read-only mount, the recovery is done
in memory only and not written to disk.
Note that the options are additive, i.e. specifying:
errors=continue,errors=recover
means the driver will attempt to recover and if that
fails it will clean-up as much as possible and
continue.
mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
setting is not persistent across mounts and can be
changed from mount to mount but cannot be changed on
remount). Values of 1 to 4 are allowed, 1 being the
default. The MFT zone multiplier determines how much
space is reserved for the MFT on the volume. If all
other space is used up, then the MFT zone will be
shrunk dynamically, so this has no impact on the
amount of free space. However, it can have an impact
on performance by affecting fragmentation of the MFT.
In general use the default. If you have a lot of small
files then use a higher value. The values have the
following meaning:
Value MFT zone size (% of volume size)
1 12.5%
2 25%
3 37.5%
4 50%
Note this option is irrelevant for read-only mounts.
Known bugs and (mis-)features
=============================
- The link count on each directory inode entry is set to 1, due to Linux not
supporting directory hard links. This may well confuse some user space
applications, since the directory names will have the same inode numbers.
This also speeds up ntfs_read_inode() immensely. And we haven't found any
problems with this approach so far. If you find a problem with this, please
let us know.
Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
Using NTFS volume and stripe sets
=================================
For support of volume and stripe sets, you can either use the kernel's
Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
the recommended one to use for linear raid. But the latter is required for
raid level 5. For striping and mirroring, either driver should work fine.
The Device-Mapper driver
------------------------
You will need to create a table of the components of the volume/stripe set and
how they fit together and load this into the kernel using the dmsetup utility
(see man 8 dmsetup).
Linear volume sets, i.e. linear raid, has been tested and works fine. Even
though untested, there is no reason why stripe sets, i.e. raid level 0, and
mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
raid level 5, unfortunately cannot work yet because the current version of the
Device-Mapper driver does not support raid level 5. You may be able to use the
Software RAID / MD driver for raid level 5, see the next section for details.
To create the table describing your volume you will need to know each of its
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
example if one of your partitions is /dev/hda2 you would do:
$ fdisk -ul /dev/hda
Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
/dev/hda3 37768815 46170809 4200997+ 83 Linux
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
33559785 sectors.
For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
http://www.linux-ntfs.org/
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version!
Then you would use ldminfo in dump mode to obtain the necessary information:
$ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the
VOLUME DEFINITIONS section which is all you really need. You may need to look
further above to determine which of the disks in the volume definitions is
which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
section). You can then find these Disk Ids in the VBLK DATABASE section in the
<Disk> components where you will get the LDM Name for the disk that is found in
the VOLUME DEFINITIONS section.
Note you will also need to enable the LDM driver in the Linux kernel. If your
distribution did not enable it, you will need to recompile the kernel with it
enabled. This will create the LDM partitions on each device at boot time. You
would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
in the Device-Mapper table.
You can also bypass using the LDM driver by using the main device (e.g.
/dev/hda) and then using the offsets of the LDM partitions into this device as
the "Start sector of device" when creating the table. Once again ldminfo would
give you the correct information to do this.
Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset into Size of this Raid type Device Start sector
# volume device of device
0 1028161 linear /dev/hda1 0
1028161 3903762 linear /dev/hdb2 0
4931923 2103211 linear /dev/hdc1 0
--- cut here ---
For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset Size Raid Number Chunk 1st Start 2nd Start
# into of the type of size Device in Device in
# volume volume stripes device device
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If there are more than two devices, just add each of them to the end of the
line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
this (note all values are in 512-byte sectors):
--- cut here ---
# Ofs Size Raid Log Number Region Should Number Source Start Target Start
# in of the type type of log size sync? of Device in Device in
# vol volume params mirrors Device Device
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If you are mirroring to multiple devices you can specify further targets at the
end of the line.
Note the "Should sync?" parameter "nosync" means that the two mirrors are
already in sync which will be the case on a clean shutdown of Windows. If the
mirrors are not clean, you can specify the "sync" option instead of "nosync"
and the Device-Mapper driver will then copy the entirety of the "Source Device"
to the "Target Device" or if you specified multiple target devices to all of
them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
and hand it over to dmsetup to work with, like so:
$ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to
mount the ntfs volume. For example:
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing
directory.)
It is advisable to do the mount read-only to see if the volume has been setup
correctly to avoid the possibility of causing damage to the data on the ntfs
volume.
The Software RAID / MD driver
-----------------------------
An alternative to using the Device-Mapper driver is to use the kernel's
Software RAID / MD driver. For which you need to set up your /etc/raidtab
appropriately (see man 5 raidtab).
Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
0, have been tested and work fine (though see section "Limitations when using
the MD driver with NTFS volumes" especially if you want to use linear raid).
Even though untested, there is no reason why mirrors, i.e. raid level 1, and
stripes with parity, i.e. raid level 5, should not work, too.
You have to use the "persistent-superblock 0" option for each raid-disk in the
NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
superblock used by the MD driver would damage the NTFS volume.
Windows by default uses a stripe chunk size of 64k, so you probably want the
"chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5
and /dev/hdb1 your /etc/raidtab would look like this:
raiddev /dev/md0
raid-level 0
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 64k
device /dev/hda5
raid-disk 0
device /dev/hdb1
raid-disk 1
For linear raid, just change the raid-level above to "raid-level linear", for
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
it to "raid-level 5".
Note for stripe sets with parity you will also need to tell the MD driver
which parity algorithm to use by specifying the option "parity-algorithm
which", where you need to replace "which" with the name of the algorithm to
use (see man 5 raidtab for available algorithms) and you will have to try the
different available algorithms until you find one that works. Make sure you
are working read-only when playing with this as you may damage your data
otherwise. If you find which algorithm works please let us know (email the
linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
IRC in channel #ntfs on the irc.freenode.net network) so we can update this
documentation.
Once the raidtab is setup, run for example raid0run -a to start all devices or
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the
ntfs volume.
Limitations when using the Software RAID / MD driver
-----------------------------------------------------
Using the md driver will not work properly if any of your NTFS partitions have
an odd number of sectors. This is especially important for linear raid as all
data after the first partition with an odd number of sectors will be offset by
one or more sectors so if you mount such a partition with write support you
will cause massive damage to the data on the volume which will only become
apparent when you try to use the volume again under Windows.
So when using linear raid, make sure that all your partitions have an even
number of sectors BEFORE attempting to use it. You have been warned!
Even better is to simply use the Device-Mapper for linear raid and then you do
not have this problem with odd numbers of sectors.

View file

@ -0,0 +1,102 @@
OCFS2 filesystem
==================
OCFS2 is a general purpose extent based shared disk cluster file
system with many similarities to ext3. It supports 64 bit inode
numbers, and has automatically extending metadata groups which may
also make it attractive for non-clustered use.
You'll want to install the ocfs2-tools package in order to at least
get "mount.ocfs2" and "ocfs2_hb_ctl".
Project web page: http://oss.oracle.com/projects/ocfs2
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS:
Lots of code taken from ext3 and other projects.
Authors in alphabetical order:
Joel Becker <joel.becker@oracle.com>
Zach Brown <zach.brown@oracle.com>
Mark Fasheh <mfasheh@suse.com>
Kurt Hackel <kurt.hackel@oracle.com>
Tao Ma <tao.ma@oracle.com>
Sunil Mushran <sunil.mushran@oracle.com>
Manish Singh <manish.singh@oracle.com>
Tiger Yang <tiger.yang@oracle.com>
Caveats
=======
Features which OCFS2 does not support yet:
- Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
Mount options
=============
OCFS2 supports the following mount options:
(*) == default
barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
intr (*) Allow signals to interrupt cluster operations.
nointr Do not allow signals to interrupt cluster
operations.
noatime Do not update access time.
relatime(*) Update atime if the previous atime is older than
mtime or ctime
strictatime Always update atime, but the minimum update interval
is specified by atime_quantum.
atime_quantum=60(*) OCFS2 will not update atime unless this number
of seconds has passed since the last update.
Set to zero to always update atime. This option need
work with strictatime.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
journal.
data=writeback Data ordering is not preserved, data may be written
into the main file system after its metadata has been
committed to the journal.
preferred_slot=0(*) During mount, try to use this filesystem slot first. If
it is in use by another node, the first empty one found
will be chosen. Invalid values will be ignored.
commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose
as much as the latest 5 seconds of work (your
filesystem will not be damaged though, thanks to the
journaling). This default value (or any low value)
will hurt performance, but it's good for data-safety.
Setting it to 0 will have the same effect as leaving
it at the default (5 seconds).
Setting it to very large values will improve
performance.
localalloc=8(*) Allows custom localalloc size in MB. If the value is too
large, the fs will silently revert it to the default.
localflocks This disables cluster aware flock.
inode64 Indicates that Ocfs2 is allowed to create inodes at
any location in the filesystem, including those which
will result in inode numbers occupying more than 32
bits of significance.
user_xattr (*) Enables Extended User Attributes.
nouser_xattr Disables Extended User Attributes.
acl Enables POSIX Access Control Lists support.
noacl (*) Disables POSIX Access Control Lists support.
resv_level=2 (*) Set how aggressive allocation reservations will be.
Valid values are between 0 (reservations off) to 8
(maximum space for reservations).
dir_resv_level= (*) By default, directory reservations will scale with file
reservations - users should rarely need to change this
value. If allocation reservations are turned off, this
option will have no effect.
coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode
lock will be taken to force other nodes drop cache,
therefore full cluster coherency is guaranteed even
for O_DIRECT writes.
coherency=buffered Allow concurrent O_DIRECT writes without EX lock among
nodes, which gains high performance at risk of getting
stale data on other nodes.

View file

@ -0,0 +1,106 @@
Optimized MPEG Filesystem (OMFS)
Overview
========
OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
and Rio Karma MP3 player. The filesystem is extent-based, utilizing
block sizes from 2k to 8k, with hash-based directories. This
filesystem driver may be used to read and write disks from these
devices.
Note, it is not recommended that this FS be used in place of a general
filesystem for your own streaming media device. Native Linux filesystems
will likely perform better.
More information is available at:
http://linux-karma.sf.net/
Various utilities, including mkomfs and omfsck, are included with
omfsprogs, available at:
http://bobcopeland.com/karma/
Instructions are included in its README.
Options
=======
OMFS supports the following mount-time options:
uid=n - make all files owned by specified user
gid=n - make all files owned by specified group
umask=xxx - set permission umask to xxx
fmask=xxx - set umask to xxx for files
dmask=xxx - set umask to xxx for directories
Disk format
===========
OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
group consists of super block information, file metadata, directory structures,
and extents. Each sysblock has a header containing CRCs of the entire
sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
have a smaller size than a data block, but since they are both addressed by the
same 64-bit block number, any remaining space in the smaller sysblock is
unused.
Sysblock header information:
struct omfs_header {
__be64 h_self; /* FS block where this is located */
__be32 h_body_size; /* size of useful data after header */
__be16 h_crc; /* crc-ccitt of body_size bytes */
char h_fill1[2];
u8 h_version; /* version, always 1 */
char h_type; /* OMFS_INODE_X */
u8 h_magic; /* OMFS_IMAGIC */
u8 h_check_xor; /* XOR of header bytes before this */
__be32 h_fill2;
};
Files and directories are both represented by omfs_inode:
struct omfs_inode {
struct omfs_header i_head; /* header */
__be64 i_parent; /* parent containing this inode */
__be64 i_sibling; /* next inode in hash bucket */
__be64 i_ctime; /* ctime, in milliseconds */
char i_fill1[35];
char i_type; /* OMFS_[DIR,FILE] */
__be32 i_fill2;
char i_fill3[64];
char i_name[OMFS_NAMELEN]; /* filename */
__be64 i_size; /* size of file, in bytes */
};
Directories in OMFS are implemented as a large hash table. Filenames are
hashed then prepended into the bucket list beginning at OMFS_DIR_START.
Lookup requires hashing the filename, then seeking across i_sibling pointers
until a match is found on i_name. Empty buckets are represented by block
pointers with all-1s (~0).
A file is an omfs_inode structure followed by an extent table beginning at
OMFS_EXTENT_START:
struct omfs_extent_entry {
__be64 e_cluster; /* start location of a set of blocks */
__be64 e_blocks; /* number of blocks after e_cluster */
};
struct omfs_extent {
__be64 e_next; /* next extent table location */
__be32 e_extent_count; /* total # extents in this table */
__be32 e_fill;
struct omfs_extent_entry e_entry; /* start of extent entries */
};
Each extent holds the block offset followed by number of blocks allocated to
the extent. The final extent in each table is a terminator with e_cluster
being ~0 and e_blocks being ones'-complement of the total number of blocks
in the table.
If this table overflows, a continuation inode is written and pointed to by
e_next. These have a header but lack the rest of the inode structure.

View file

@ -0,0 +1,198 @@
Written by: Neil Brown <neilb@suse.de>
Overlay Filesystem
==================
This document describes a prototype for a new approach to providing
overlay-filesystem functionality in Linux (sometimes referred to as
union-filesystems). An overlay-filesystem tries to present a
filesystem which is the result over overlaying one filesystem on top
of the other.
The result will inevitably fail to look exactly like a normal
filesystem for various technical reasons. The expectation is that
many use cases will be able to ignore these differences.
This approach is 'hybrid' because the objects that appear in the
filesystem do not all appear to belong to that filesystem. In many
cases an object accessed in the union will be indistinguishable
from accessing the corresponding object from the original filesystem.
This is most obvious from the 'st_dev' field returned by stat(2).
While directories will report an st_dev from the overlay-filesystem,
all non-directory objects will report an st_dev from the lower or
upper filesystem that is providing the object. Similarly st_ino will
only be unique when combined with st_dev, and both of these can change
over the lifetime of a non-directory object. Many applications and
tools ignore these values and will not be affected.
Upper and Lower
---------------
An overlay filesystem combines two filesystems - an 'upper' filesystem
and a 'lower' filesystem. When a name exists in both filesystems, the
object in the 'upper' filesystem is visible while the object in the
'lower' filesystem is either hidden or, in the case of directories,
merged with the 'upper' object.
It would be more correct to refer to an upper and lower 'directory
tree' rather than 'filesystem' as it is quite possible for both
directory trees to be in the same filesystem and there is no
requirement that the root of a filesystem be given for either upper or
lower.
The lower filesystem can be any filesystem supported by Linux and does
not need to be writable. The lower filesystem can even be another
overlayfs. The upper filesystem will normally be writable and if it
is it must support the creation of trusted.* extended attributes, and
must provide valid d_type in readdir responses, so NFS is not suitable.
A read-only overlay of two read-only filesystems may use any
filesystem type.
Directories
-----------
Overlaying mainly involves directories. If a given name appears in both
upper and lower filesystems and refers to a non-directory in either,
then the lower object is hidden - the name refers only to the upper
object.
Where both upper and lower objects are directories, a merged directory
is formed.
At mount time, the two directories given as mount options "lowerdir" and
"upperdir" are combined into a merged directory:
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
The "workdir" needs to be an empty directory on the same filesystem
as upperdir.
Then whenever a lookup is requested in such a merged directory, the
lookup is performed in each actual directory and the combined result
is cached in the dentry belonging to the overlay filesystem. If both
actual lookups find directories, both are stored and a merged
directory is created, otherwise only one is stored: the upper if it
exists, else the lower.
Only the lists of names from directories are merged. Other content
such as metadata and extended attributes are reported for the upper
directory only. These attributes of the lower directory are hidden.
whiteouts and opaque directories
--------------------------------
In order to support rm and rmdir without changing the lower
filesystem, an overlay filesystem needs to record in the upper filesystem
that files have been removed. This is done using whiteouts and opaque
directories (non-directories are always opaque).
A whiteout is created as a character device with 0/0 device number.
When a whiteout is found in the upper level of a merged directory, any
matching name in the lower level is ignored, and the whiteout itself
is also hidden.
A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
readdir
-------
When a 'readdir' request is made on a merged directory, the upper and
lower directories are each read and the name lists merged in the
obvious way (upper is read first, then lower - entries that already
exist are not re-added). This merged name list is cached in the
'struct file' and so remains as long as the file is kept open. If the
directory is opened and read by two processes at the same time, they
will each have separate caches. A seekdir to the start of the
directory (offset 0) followed by a readdir will cause the cache to be
discarded and rebuilt.
This means that changes to the merged directory do not appear while a
directory is being read. This is unlikely to be noticed by many
programs.
seek offsets are assigned sequentially when the directories are read.
Thus if
- read part of a directory
- remember an offset, and close the directory
- re-open the directory some time later
- seek to the remembered offset
there may be little correlation between the old and new locations in
the list of filenames, particularly if anything has changed in the
directory.
Readdir on directories that are not merged is simply handled by the
underlying directory (upper or lower).
Non-directories
---------------
Objects that are not directories (files, symlinks, device-special
files etc.) are presented either from the upper or lower filesystem as
appropriate. When a file in the lower filesystem is accessed in a way
the requires write-access, such as opening for write access, changing
some metadata etc., the file is first copied from the lower filesystem
to the upper filesystem (copy_up). Note that creating a hard-link
also requires copy_up, though of course creation of a symlink does
not.
The copy_up may turn out to be unnecessary, for example if the file is
opened for read-write but the data is not modified.
The copy_up process first makes sure that the containing directory
exists in the upper filesystem - creating it and any parents as
necessary. It then creates the object with the same metadata (owner,
mode, mtime, symlink-target etc.) and then if the object is a file, the
data is copied from the lower to the upper filesystem. Finally any
extended attributes are copied up.
Once the copy_up is complete, the overlay filesystem simply
provides direct access to the newly created file in the upper
filesystem - future operations on the file are barely noticed by the
overlay filesystem (though an operation on the name of the file such as
rename or unlink will of course be noticed and handled).
Non-standard behavior
---------------------
The copy_up operation essentially creates a new, identical file and
moves it over to the old name. The new file may be on a different
filesystem, so both st_dev and st_ino of the file may change.
Any open files referring to this inode will access the old data and
metadata. Similarly any file locks obtained before copy_up will not
apply to the copied up file.
On a file opened with O_RDONLY fchmod(2), fchown(2), futimesat(2) and
fsetxattr(2) will fail with EROFS.
If a file with multiple hard links is copied up, then this will
"break" the link. Changes will not be propagated to other names
referring to the same inode.
Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
object in overlayfs will not contain valid absolute paths, only
relative paths leading up to the filesystem's root. This will be
fixed in the future.
Some operations are not atomic, for example a crash during copy_up or
rename will leave the filesystem in an inconsistent state. This will
be addressed in the future.
Changes to underlying filesystems
---------------------------------
Offline changes, when the overlay is not mounted, are allowed to either
the upper or the lower trees.
Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed. If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock.

View file

@ -0,0 +1,382 @@
Path walking and name lookup locking
====================================
Path resolution is the finding a dentry corresponding to a path name string, by
performing a path walk. Typically, for every open(), stat() etc., the path name
will be resolved. Paths are resolved by walking the namespace tree, starting
with the first component of the pathname (eg. root or cwd) with a known dentry,
then finding the child of that dentry, which is named the next component in the
path string. Then repeating the lookup from the child dentry and finding its
child with the next element, and so on.
Since it is a frequent operation for workloads like multiuser environments and
web servers, it is important to optimize this code.
Path walking synchronisation history:
Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
thus in every component during path look-up. Since 2.5.10 onwards, fast-walk
algorithm changed this by holding the dcache_lock at the beginning and walking
as many cached path component dentries as possible. This significantly
decreases the number of acquisition of dcache_lock. However it also increases
the lock hold time significantly and affects performance in large SMP machines.
Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
make dcache look-up lock-free.
All the above algorithms required taking a lock and reference count on the
dentry that was looked up, so that may be used as the basis for walking the
next path element. This is inefficient and unscalable. It is inefficient
because of the locks and atomic operations required for every dentry element
slows things down. It is not scalable because many parallel applications that
are path-walk intensive tend to do path lookups starting from a common dentry
(usually, the root "/" or current working directory). So contention on these
common path elements causes lock and cacheline queueing.
Since 2.6.38, RCU is used to make a significant part of the entire path walk
(including dcache look-up) completely "store-free" (so, no locks, atomics, or
even stores into cachelines of common dentries). This is known as "rcu-walk"
path walking.
Path walking overview
=====================
A name string specifies a start (root directory, cwd, fd-relative) and a
sequence of elements (directory entry names), which together refer to a path in
the namespace. A path is represented as a (dentry, vfsmount) tuple. The name
elements are sub-strings, separated by '/'.
Name lookups will want to find a particular path that a name string refers to
(usually the final element, or parent of final element). This is done by taking
the path given by the name's starting point (which we know in advance -- eg.
current->fs->cwd or current->fs->root) as the first parent of the lookup. Then
iteratively for each subsequent name element, look up the child of the current
parent with the given name and if it is not the desired entry, make it the
parent for the next lookup.
A parent, of course, must be a directory, and we must have appropriate
permissions on the parent inode to be able to walk into it.
Turning the child into a parent for the next lookup requires more checks and
procedures. Symlinks essentially substitute the symlink name for the target
name in the name string, and require some recursive path walking. Mount points
must be followed into (thus changing the vfsmount that subsequent path elements
refer to), switching from the mount point path to the root of the particular
mounted vfsmount. These behaviours are variously modified depending on the
exact path walking flags.
Path walking then must, broadly, do several particular things:
- find the start point of the walk;
- perform permissions and validity checks on inodes;
- perform dcache hash name lookups on (parent, name element) tuples;
- traverse mount points;
- traverse symlinks;
- lookup and create missing parts of the path on demand.
Safe store-free look-up of dcache hash table
============================================
Dcache name lookup
------------------
In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
and use that to select a bucket in the dcache-hash table. The list of entries
in that bucket is then walked, and we do a full comparison of each entry
against our (parent, name) tuple.
The hash lists are RCU protected, so list walking is not serialised with
concurrent updates (insertion, deletion from the hash). This is a standard RCU
list application with the exception of renames, which will be covered below.
Parent and name members of a dentry, as well as its membership in the dcache
hash, and its inode are protected by the per-dentry d_lock spinlock. A
reference is taken on the dentry (while the fields are verified under d_lock),
and this stabilises its d_inode pointer and actual inode. This gives a stable
point to perform the next step of our path walk against.
These members are also protected by d_seq seqlock, although this offers
read-only protection and no durability of results, so care must be taken when
using d_seq for synchronisation (see seqcount based lookups, below).
Renames
-------
Back to the rename case. In usual RCU protected lists, the only operations that
will happen to an object is insertion, and then eventually removal from the
list. The object will not be reused until an RCU grace period is complete.
This ensures the RCU list traversal primitives can run over the object without
problems (see RCU documentation for how this works).
However when a dentry is renamed, its hash value can change, requiring it to be
moved to a new hash list. Allocating and inserting a new alias would be
expensive and also problematic for directory dentries. Latency would be far to
high to wait for a grace period after removing the dentry and before inserting
it in the new hash bucket. So what is done is to insert the dentry into the
new list immediately.
However, when the dentry's list pointers are updated to point to objects in the
new list before waiting for a grace period, this can result in a concurrent RCU
lookup of the old list veering off into the new (incorrect) list and missing
the remaining dentries on the list.
There is no fundamental problem with walking down the wrong list, because the
dentry comparisons will never match. However it is fatal to miss a matching
dentry. So a seqlock is used to detect when a rename has occurred, and so the
lookup can be retried.
1 2 3
+---+ +---+ +---+
hlist-->| N-+->| N-+->| N-+->
head <--+-P |<-+-P |<-+-P |
+---+ +---+ +---+
Rename of dentry 2 may require it deleted from the above list, and inserted
into a new list. Deleting 2 gives the following list.
1 3
+---+ +---+ (don't worry, the longer pointers do not
hlist-->| N-+-------->| N-+-> impose a measurable performance overhead
head <--+-P |<--------+-P | on modern CPUs)
+---+ +---+
^ 2 ^
| +---+ |
| | N-+----+
+----+-P |
+---+
This is a standard RCU-list deletion, which leaves the deleted object's
pointers intact, so a concurrent list walker that is currently looking at
object 2 will correctly continue to object 3 when it is time to traverse the
next object.
However, when inserting object 2 onto a new list, we end up with this:
1 3
+---+ +---+
hlist-->| N-+-------->| N-+->
head <--+-P |<--------+-P |
+---+ +---+
2
+---+
| N-+---->
<----+-P |
+---+
Because we didn't wait for a grace period, there may be a concurrent lookup
still at 2. Now when it follows 2's 'next' pointer, it will walk off into
another list without ever having checked object 3.
A related, but distinctly different, issue is that of rename atomicity versus
lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
of 'B' must succeed (note the reverse is not true).
Between deleting the dentry from the old hash list, and inserting it on the new
hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
rename seqlock is also used to cover this race in much the same way, by
retrying a negative lookup result if a rename was in progress.
Seqcount based lookups
----------------------
In refcount based dcache lookups, d_lock is used to serialise access to
the dentry, stabilising it while comparing its name and parent and then
taking a reference count (the reference count then gives a stable place to
start the next part of the path walk from).
As explained above, we would like to do path walking without taking locks or
reference counts on intermediate dentries along the path. To do this, a per
dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
looks like (its name, parent, and inode). That snapshot is then used to start
the next part of the path walk. When loading the coherent snapshot under d_seq,
care must be taken to load the members up-front, and use those pointers rather
than reloading from the dentry later on (otherwise we'd have interesting things
like d_inode going NULL underneath us, if the name was unlinked).
Also important is to avoid performing any destructive operations (pretty much:
no non-atomic stores to shared data), and to recheck the seqcount when we are
"done" with the operation. Retry or abort if the seqcount does not match.
Avoiding destructive or changing operations means we can easily unwind from
failure.
What this means is that a caller, provided they are holding RCU lock to
protect the dentry object from disappearing, can perform a seqcount based
lookup which does not increment the refcount on the dentry or write to
it in any way. This returned dentry can be used for subsequent operations,
provided that d_seq is rechecked after that operation is complete.
Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be
queried for permissions.
With this two parts of the puzzle, we can do path lookups without taking
locks or refcounts on dentry elements.
RCU-walk path walking design
============================
Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
is the traditional[*] way of performing dcache lookups using d_lock to
serialise concurrent modifications to the dentry and take a reference count on
it. ref-walk is simple and obvious, and may sleep, take locks, etc while path
walking is operating on each dentry. rcu-walk uses seqcount based dentry
lookups, and can perform lookup of intermediate elements without any stores to
shared data in the dentry or inode. rcu-walk can not be applied to all cases,
eg. if the filesystem must sleep or perform non trivial operations, rcu-walk
must be switched to ref-walk mode.
[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
path walk.
Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining
path string, rcu-walk uses a d_seq protected snapshot. When looking up a
child of this parent snapshot, we open d_seq critical section on the child
before closing d_seq critical section on the parent. This gives an interlocking
ladder of snapshots to walk down.
proc 101
/----------------\
/ comm: "vi" \
/ fs.root: dentry0 \
\ fs.cwd: dentry2 /
\ /
\----------------/
So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
start from current->fs->root, which is a pinned dentry. Alternatively,
"./test.c" would start from cwd; both names refer to the same path in
the context of proc101.
dentry 0
+---------------------+ rcu-walk begins here, we note d_seq, check the
| name: "/" | inode's permission, and then look up the next
| inode: 10 | path element which is "home"...
| children:"home", ...|
+---------------------+
|
dentry 1 V
+---------------------+ ... which brings us here. We find dentry1 via
| name: "home" | hash lookup, then note d_seq and compare name
| inode: 678 | string and parent pointer. When we have a match,
| children:"npiggin" | we now recheck the d_seq of dentry0. Then we
+---------------------+ check inode and look up the next element.
|
dentry2 V
+---------------------+ Note: if dentry0 is now modified, lookup is
| name: "npiggin" | not necessarily invalid, so we need only keep a
| inode: 543 | parent for d_seq verification, and grandparents
| children:"a.c", ... | can be forgotten.
+---------------------+
|
dentry3 V
+---------------------+ At this point we have our destination dentry.
| name: "a.c" | We now take its d_lock, verify d_seq of this
| inode: 14221 | dentry. If that checks out, we can increment
| children:NULL | its refcount because we're holding d_lock.
+---------------------+
Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
re-checking its d_seq, and then incrementing its refcount is called
"dropping rcu" or dropping from rcu-walk into ref-walk mode.
It is, in some sense, a bit of a house of cards. If the seqcount check of the
parent snapshot fails, the house comes down, because we had closed the d_seq
section on the grandparent, so we have nothing left to stand on. In that case,
the path walk must be fully restarted (which we do in ref-walk mode, to avoid
live locks). It is costly to have a full restart, but fortunately they are
quite rare.
When we reach a point where sleeping is required, or a filesystem callout
requires ref-walk, then instead of restarting the walk, we attempt to drop rcu
at the last known good dentry we have. Avoiding a full restart in ref-walk in
these cases is fundamental for performance and scalability because blocking
operations such as creates and unlinks are not uncommon.
The detailed design for rcu-walk is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
* When the destination dentry is reached, drop rcu there (ie. take d_lock,
verify d_seq, increment refcount).
* If seqlock verification fails anywhere along the path, do a full restart
of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
a better errno) to signal an rcu-walk failure.
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* Following links
It may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Final note:
"store-free" path walking is not strictly store free. We take vfsmount lock
and refcounts (both of which can be made per-cpu), and we also store to the
stack (which is essentially CPU-local), and we also have to take locks and
refcount on final dentry.
The point is that shared data, where practically possible, is not locked
or stored into. The result is massive improvements in performance and
scalability of path resolution.
Interesting statistics
======================
The following table gives rcu lookup statistics for a few simple workloads
(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
drop rcu that fail due to d_seq failure and requiring the entire path lookup
again. Other cases are successful rcu-drops that are required before the final
element, nodentry for missing dentry, revalidate for filesystem revalidate
routine requiring rcu drop, permission for permission check requiring drop,
and link for symlink traversal requiring drop.
rcu-lookups restart nodentry link revalidate permission
bootup 47121 0 4624 1010 10283 7852
dbench 25386793 0 6778659(26.7%) 55 549 1156
kbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590
git diff 39605 0 28 2 0 106
vfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651
What this shows is that failed rcu-walk lookups, ie. ones that are restarted
entirely with ref-walk, are quite rare. Even the "vfstest" case which
specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
such races is not showing a huge amount of restarts.
Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where
the reference count needs to be taken for some reason. This is either because
we have reached the target of the path walk, or because we have encountered a
condition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk
only when we have reached the target dentry, so the other statistics show where
this does not happen.
Note that a graceful drop from rcu-walk mode due to something such as the
dentry not existing (which can be common) is not necessarily a failure of
rcu-walk scheme, because some elements of the path may have been walked in
rcu-walk mode. The further we get from common path elements (such as cwd or
root), the less contended the dentry is likely to be. The closer we are to
common path elements, the more likely they will exist in dentry cache.
Papers and other documentation on dcache locking
================================================
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
2. http://lse.sourceforge.net/locking/dcache/dcache.html

View file

@ -0,0 +1,72 @@
POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
Evgeniy Polyakov <zbr@ioremap.net>
Homepage: http://www.ioremap.net/projects/pohmelfs
POHMELFS first began as a network filesystem with coherent local data and
metadata caches but is now evolving into a parallel distributed filesystem.
Main features of this FS include:
* Locally coherent cache for data and metadata with (potentially) byte-range locks.
Since all Linux filesystems lock the whole inode during writing, algorithm
is very simple and does not use byte-ranges, although they are sent in
locking messages.
* Completely async processing of all events except creation of hard and symbolic
links, and rename events.
Object creation and data reading and writing are processed asynchronously.
* Flexible object architecture optimized for network processing.
Ability to create long paths to objects and remove arbitrarily huge
directories with a single network command.
(like removing the whole kernel tree via a single network command).
* Very high performance.
* Fast and scalable multithreaded userspace server. Being in userspace it works
with any underlying filesystem and still is much faster than async in-kernel NFS one.
* Client is able to switch between different servers (if one goes down, client
automatically reconnects to second and so on).
* Transactions support. Full failover for all operations.
Resending transactions to different servers on timeout or error.
* Read request (data read, directory listing, lookup requests) balancing between multiple servers.
* Write requests are replicated to multiple servers and completed only when all of them are acked.
* Ability to add and/or remove servers from the working set at run-time.
* Strong authentification and possible data encryption in network channel.
* Extended attributes support.
POHMELFS is based on transactions, which are potentially long-standing objects that live
in the client's memory. Each transaction contains all the information needed to process a given
command (or set of commands, which is frequently used during data writing: single transactions
can contain creation and data writing commands). Transactions are committed by all the servers
to which they are sent and, in case of failures, are eventually resent or dropped with an error.
For example, reading will return an error if no servers are available.
POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
possible to detach replies from requests and, if the command requires data to be received, the
caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
servers and async threads will pick up replies in parallel, find appropriate transactions in the
system and put the data where it belongs (like the page or inode cache).
The main feature of POHMELFS is writeback data and the metadata cache.
Only a few non-performance critical operations use the write-through cache and
are synchronous: hard and symbolic link creation, and object rename. Creation,
removal of objects and data writing are asynchronous and are sent to
the server during system writeback. Only one writer at a time is allowed for any
given inode, which is guarded by an appropriate locking protocol.
Because of this feature, POHMELFS is extremely fast at metadata intensive
workloads and can fully utilize the bandwidth to the servers when doing bulk
data transfers.
POHMELFS clients operate with a working set of servers and are capable of balancing read-only
operations (like lookups or directory listings) between them according to IO priorities.
Administrators can add or remove servers from the set at run-time via special commands (described
in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which
are connected with write permission turned on. IO priority and permissions can be changed in
run-time.
POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
One can select any kernel supported cipher, encryption mode, hash type and operation mode
(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
is checked during mount time and, if the server does not support it, appropriate capabilities
will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
crypto operations and send the resulting data to server or submit it up the stack. This number
can be controlled via a mount option.

View file

@ -0,0 +1,99 @@
POHMELFS usage information.
Mount options.
All but index, number of crypto threads and maximum IO size can changed via remount.
idx=%u
Each mountpoint is associated with a special index via this option.
Administrator can add or remove servers from the given index, so all mounts,
which were attached to it, are updated.
Default it is 0.
trans_scan_timeout=%u
This timeout, expressed in milliseconds, specifies time to scan transaction
trees looking for stale requests, which have to be resent, or if number of
retries exceed specified limit, dropped with error.
Default is 5 seconds.
drop_scan_timeout=%u
Internal timeout, expressed in milliseconds, which specifies how frequently
inodes marked to be dropped are freed. It also specifies how frequently
the system checks that servers have to be added or removed from current working set.
Default is 1 second.
wait_on_page_timeout=%u
Number of milliseconds to wait for reply from remote server for data reading command.
If this timeout is exceeded, reading returns an error.
Default is 5 seconds.
trans_retries=%u
This is the number of times that a transaction will be resent to a server that did
not answer for the last @trans_scan_timeout milliseconds.
When the number of resends exceeds this limit, the transaction is completed with error.
Default is 5 resends.
crypto_thread_num=%u
Number of crypto processing threads. Threads are used both for RX and TX traffic.
Default is 2, or no threads if crypto operations are not supported.
trans_max_pages=%u
Maximum number of pages in a single transaction. This parameter also controls
the number of pages, allocated for crypto processing (each crypto thread has
pool of pages, the number of which is equal to 'trans_max_pages'.
Default is 100 pages.
crypto_fail_unsupported
If specified, mount will fail if the server does not support requested crypto operations.
By default mount will disable non-matching crypto operations.
mcache_timeout=%u
Maximum number of milliseconds to wait for the mcache objects to be processed.
Mcache includes locks (given lock should be granted by server), attributes (they should be
fully received in the given timeframe).
Default is 5 seconds.
Usage examples.
Add server server1.net:1025 into the working set with index $idx
with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
Mount filesystem with given index $idx to /mnt mountpoint.
Client will connect to all servers specified in the working set via previous command:
mount -t pohmel -o idx=$idx q /mnt
Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw):
$cfg A modify -a server1.net -p 1025 -i $idx -I 1
Change IO priority to 123 (node with the highest priority gets read requests).
$cfg A modify -a server1.net -p 1025 -i $idx -P 123
One can check currect status of all connections in the mountstats file:
# cat /proc/$PID/mountstats
...
device none mounted on /mnt with fstype pohmel
idx addr(:port) socket_type protocol active priority permissions
0 server1.net:1026 1 6 1 250 1
0 server2.net:1025 1 6 1 123 3
Server installation.
Creating a server, which listens at port 1025 and 0.0.0.0 address.
Working root directory (note, that server chroots there, so you have to have appropriate permissions)
is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
are appropriate key files.
Number of working threads is set to 10.
# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
-A 6 - listen on ipv6 address. Default: Disabled.
-r root - path to root directory. Default: /tmp.
-a addr - listen address. Default: 0.0.0.0.
-p port - listen port. Default: 1025.
-w workers - number of workers per connected client. Default: 1.
-K file - hash key size. Default: none.
-k file - cipher key size. Default: none.
-h - this help.
Number of worker threads specifies how many workers will be created for each client.
Bulk single-client transafers usually are better handled with smaller number (like 1-3).

View file

@ -0,0 +1,227 @@
POHMELFS network protocol.
Basic structure used in network communication is following command:
struct netfs_cmd
{
__u16 cmd; /* Command number */
__u16 csize; /* Attached crypto information size */
__u16 cpad; /* Attached padding size */
__u16 ext; /* External flags */
__u32 size; /* Size of the attached data */
__u32 trans; /* Transaction id */
__u64 id; /* Object ID to operate on. Used for feedback.*/
__u64 start; /* Start of the object. */
__u64 iv; /* IV sequence */
__u8 data[0];
};
Commands can be embedded into transaction command (which in turn has own command),
so one can extend protocol as needed without breaking backward compatibility as long
as old commands are supported. All string lengths include tail 0 byte.
All commands are transferred over the network in big-endian. CPU endianness is used at the end peers.
@cmd - command number, which specifies command to be processed. Following
commands are used currently:
NETFS_READDIR = 1, /* Read directory for given inode number */
NETFS_READ_PAGE, /* Read data page from the server */
NETFS_WRITE_PAGE, /* Write data page to the server */
NETFS_CREATE, /* Create directory entry */
NETFS_REMOVE, /* Remove directory entry */
NETFS_LOOKUP, /* Lookup single object */
NETFS_LINK, /* Create a link */
NETFS_TRANS, /* Transaction */
NETFS_OPEN, /* Open intent */
NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
NETFS_PAGE_CACHE, /* Page cache invalidation message */
NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
NETFS_RENAME, /* Rename object */
NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */
NETFS_LOCK, /* Distributed lock message */
NETFS_XATTR_SET, /* Set extended attribute */
NETFS_XATTR_GET, /* Get extended attribute */
@ext - external flags. Used by different commands to specify some extra arguments
like partial size of the embedded objects or creation flags.
@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
but size of the requested data is incorporated here. It does not include size of the command
header (struct netfs_cmd) itself.
@id - id of the object this command operates on. Each command can use it for own purpose.
@start - start of the object this command operates on. Each command can use it for own purpose.
@csize, @cpad - size and padding size of the (attached if needed) crypto information.
Command specifications.
@NETFS_READDIR
This command is used to sync content of the remote dir to the client.
@ext - length of the path to object.
@size - the same.
@id - local inode number of the directory to read.
@start - zero.
@NETFS_READ_PAGE
This command is used to read data from remote server.
Data size does not exceed local page cache size.
@id - inode number.
@start - first byte offset.
@size - number of bytes to read plus length of the path to object.
@ext - object path length.
@NETFS_CREATE
Used to create object.
It does not require that all directories on top of the object were
already created, it will create them automatically. Each object has
associated @netfs_path_entry data structure, which contains creation
mode (permissions and type) and length of the name as long as name itself.
@start - 0
@size - size of the all data structures needed to create a path
@id - local inode number
@ext - 0
@NETFS_REMOVE
Used to remove object.
@ext - length of the path to object.
@size - the same.
@id - local inode number.
@start - zero.
@NETFS_LOOKUP
Lookup information about object on server.
@ext - length of the path to object.
@size - the same.
@id - local inode number of the directory to look object in.
@start - local inode number of the object to look at.
@NETFS_LINK
Create hard of symlink.
Command is sent as "object_path|target_path".
@size - size of the above string.
@id - parent local inode number.
@start - 1 for symlink, 0 for hardlink.
@ext - size of the "object_path" above.
@NETFS_TRANS
Transaction header.
@size - incorporates all embedded command sizes including theirs header sizes.
@start - transaction generation number - unique id used to find transaction.
@ext - transaction flags. Unused at the moment.
@id - 0.
@NETFS_OPEN
Open intent for given transaction.
@id - local inode number.
@start - 0.
@size - path length to the object.
@ext - open flags (O_RDWR and so on).
@NETFS_INODE_INFO
Metadata update command.
It is sent to servers when attributes of the object are changed and received
when data or metadata were updated. It operates with the following structure:
struct netfs_inode_info
{
unsigned int mode;
unsigned int nlink;
unsigned int uid;
unsigned int gid;
unsigned int blocksize;
unsigned int padding;
__u64 ino;
__u64 blocks;
__u64 rdev;
__u64 size;
__u64 version;
};
It effectively mirrors stat(2) returned data.
@ext - path length to the object.
@size - the same plus size of the netfs_inode_info structure.
@id - local inode number.
@start - 0.
@NETFS_PAGE_CACHE
Command is only received by clients. It contains information about
page to be marked as not up-to-date.
@id - client's inode number.
@start - last byte of the page to be invalidated. If it is not equal to
current inode size, it will be vmtruncated().
@size - 0
@ext - 0
@NETFS_READ_PAGES
Used to read multiple contiguous pages in one go.
@start - first byte of the contiguous region to read.
@size - contains of two fields: lower 8 bits are used to represent page cache shift
used by client, another 3 bytes are used to get number of pages.
@id - local inode number.
@ext - path length to the object.
@NETFS_RENAME
Used to rename object.
Attached data is formed into following string: "old_path|new_path".
@id - local inode number.
@start - parent inode number.
@size - length of the above string.
@ext - length of the old path part.
@NETFS_CAPABILITIES
Used to exchange crypto capabilities with server.
If crypto capabilities are not supported by server, then client will disable it
or fail (if 'crypto_fail_unsupported' mount options was specified).
@id - superblock index. Used to specify crypto information for group of servers.
@size - size of the attached capabilities structure.
@start - 0.
@size - 0.
@scsize - 0.
@NETFS_LOCK
Used to send lock request/release messages. Although it sends byte range request
and is capable of flushing pages based on that, it is not used, since all Linux
filesystems lock the whole inode.
@id - lock generation number.
@start - start of the locked range.
@size - size of the locked range.
@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
if it is lock request (1) or release (0).
@NETFS_XATTR_SET
@NETFS_XATTR_GET
Used to set/get extended attributes for given inode.
@id - attribute generation number or xattr setting type
@start - size of the attribute (request or attached)
@size - name length, path len and data size for given attribute
@ext - path length for given object

View file

@ -0,0 +1,465 @@
Changes since 2.5.0:
---
[recommended]
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
sb_set_blocksize() and sb_min_blocksize().
Use them.
(sb_find_get_block() replaces 2.4's get_hash_table())
---
[recommended]
New methods: ->alloc_inode() and ->destroy_inode().
Remove inode->u.foo_inode_i
Declare
struct foo_inode_info {
/* fs-private stuff */
struct inode vfs_inode;
};
static inline struct foo_inode_info *FOO_I(struct inode *inode)
{
return list_entry(inode, struct foo_inode_info, vfs_inode);
}
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate
foo_inode_info and return the address of ->vfs_inode, the latter should free
FOO_I(inode) (see in-tree filesystems for examples).
Make them ->alloc_inode and ->destroy_inode in your super_operations.
Keep in mind that now you need explicit initialization of private data
typically between calling iget_locked() and unlocking the inode.
At some point that will become mandatory.
---
[mandatory]
Change of file_system_type method (->read_super to ->get_sb)
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
Turn your foo_read_super() into a function that would return 0 in case of
success and negative number in case of error (-EINVAL unless you have more
informative error value to report). Call it foo_fill_super(). Now declare
int foo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
mnt);
}
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
filesystem).
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
foo_get_sb.
---
[mandatory]
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
Most likely there is no need to change anything, but if you relied on
global exclusion between renames for some internal purpose - you need to
change your internal locking. Otherwise exclusion warranties remain the
same (i.e. parents and victim are locked, etc.).
---
[informational]
Now we have the exclusion between ->lookup() and directory removal (by
->rmdir() and ->rename()). If you used to need that exclusion and do
it by internal locking (most of filesystems couldn't care less) - you
can relax your locking.
---
[mandatory]
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
- that will guarantee the same locking you used to have. If your method or its
parts do not need BKL - better yet, now you can shift lock_kernel() and
unlock_kernel() so that they would protect exactly what needs to be
protected.
---
[mandatory]
BKL is also moved from around sb operations. BKL should have been shifted into
individual fs sb_op functions. If you don't need it, remove it.
---
[informational]
check for ->link() target not being a directory is done by callers. Feel
free to drop it...
---
[informational]
->link() callers hold ->i_mutex on the object we are linking to. Some of your
problems might be over...
---
[mandatory]
new file_system_type method - kill_sb(superblock). If you are converting
an existing filesystem, set it according to ->fs_flags:
FS_REQUIRES_DEV - kill_block_super
FS_LITTER - kill_litter_super
neither - kill_anon_super
FS_LITTER is gone - just remove it from fs_flags.
---
[mandatory]
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
went in - and hadn't been documented ;-/). Just remove it from fs_flags
(and see ->get_sb() entry for other actions).
---
[mandatory]
->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so
watch for ->i_mutex-grabbing code that might be used by your ->setattr().
Callers of notify_change() need ->i_mutex now.
---
[recommended]
New super_block field "struct export_operations *s_export_op" for
explicit support for exporting, e.g. via NFS. The structure is fully
documented at its declaration in include/linux/fs.h, and in
Documentation/filesystems/nfs/Exporting.
Briefly it allows for the definition of decode_fh and encode_fh operations
to encode and decode filehandles, and allows the filesystem to use
a standard helper function for decode_fh, and provide file-system specific
support for this helper, particularly get_parent.
It is planned that this will be required for exporting once the code
settles down a bit.
[mandatory]
s_export_op is now required for exporting a filesystem.
isofs, ext2, ext3, resierfs, fat
can be used as examples of very different filesystems.
---
[mandatory]
iget4() and the read_inode2 callback have been superseded by iget5_locked()
which has the following prototype,
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *),
void *data);
'test' is an additional function that can be used when the inode
number is not sufficient to identify the actual file object. 'set'
should be a non-blocking function that initializes those parts of a
newly created inode to allow the test function to succeed. 'data' is
passed as an opaque value to both test and set functions.
When the inode has been created by iget5_locked(), it will be returned with the
I_NEW flag set and will still be locked. The filesystem then needs to finalize
the initialization. Once the inode is initialized it must be unlocked by
calling unlock_new_inode().
The filesystem is responsible for setting (and possibly testing) i_ino
when appropriate. There is also a simpler iget_locked function that
just takes the superblock and inode number as arguments and does the
test and set for you.
e.g.
inode = iget_locked(sb, ino);
if (inode->i_state & I_NEW) {
err = read_inode_from_disk(inode);
if (err < 0) {
iget_failed(inode);
return err;
}
unlock_new_inode(inode);
}
Note that if the process of setting up a new inode fails, then iget_failed()
should be called on the inode to render it dead, and an appropriate error
should be passed back to the caller.
---
[recommended]
->getattr() finally getting used. See instances in nfs, minix, etc.
---
[mandatory]
->revalidate() is gone. If your filesystem had it - provide ->getattr()
and let it call whatever you had as ->revlidate() + (for symlinks that
had ->revalidate()) add calls in ->follow_link()/->readlink().
---
[mandatory]
->d_parent changes are not protected by BKL anymore. Read access is safe
if at least one of the following is true:
* filesystem has no cross-directory rename()
* we know that parent had been locked (e.g. we are looking at
->d_parent of ->lookup() argument).
* we are called from ->rename().
* the child's ->d_lock is held
Audit your code and add locking if needed. Notice that any place that is
not protected by the conditions above is risky even in the old tree - you
had been relying on BKL and that's prone to screwups. Old tree had quite
a few holes of that kind - unprotected access to ->d_parent leading to
anything from oops to silent memory corruption.
---
[mandatory]
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
(see rootfs for one kind of solution and bdev/socket/pipe for another).
---
[recommended]
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
As soon as it gets fixed is_read_only() will die.
---
[mandatory]
->permission() is called without BKL now. Grab it on entry, drop upon
return - that will guarantee the same locking you used to have. If
your method or its parts do not need BKL - better yet, now you can
shift lock_kernel() and unlock_kernel() so that they would protect
exactly what needs to be protected.
---
[mandatory]
->statfs() is now called without BKL held. BKL should have been
shifted into individual fs sb_op functions where it's not clear that
it's safe to remove it. If you don't need it, remove it.
---
[mandatory]
is_read_only() is gone; use bdev_read_only() instead.
---
[mandatory]
destroy_buffers() is gone; use invalidate_bdev().
---
[mandatory]
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
deliberate; as soon as struct block_device * is propagated in a reasonable
way by that code fixing will become trivial; until then nothing can be
done.
[mandatory]
block truncatation on error exit from ->write_begin, and ->direct_IO
moved from generic methods (block_write_begin, cont_write_begin,
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
ext2_write_failed and callers for an example.
[mandatory]
->truncate is gone. The whole truncate sequence needs to be
implemented in ->setattr, which is now mandatory for filesystems
implementing on-disk size changes. Start with a copy of the old inode_setattr
and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to
be in order of zeroing blocks using block_truncate_page or similar helpers,
size update and on finally on-disk truncation which should not fail.
inode_change_ok now includes the size checks for ATTR_SIZE and must be called
in the beginning of ->setattr unconditionally.
[mandatory]
->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
be used instead. It gets called whenever the inode is evicted, whether it has
remaining links or not. Caller does *not* evict the pagecache or inode-associated
metadata buffers; the method has to use truncate_inode_pages_final() to get rid
of those. Caller makes sure async writeback cannot be running for the inode while
(or after) ->evict_inode() is called.
->drop_inode() returns int now; it's called on final iput() with
inode->i_lock held and it returns true if filesystems wants the inode to be
dropped. As before, generic_drop_inode() is still the default and it's been
updated appropriately. generic_delete_inode() is also alive and it consists
simply of return 1. Note that all actual eviction work is done by caller after
->drop_inode() returns.
As before, clear_inode() must be called exactly once on each call of
->evict_inode() (as it used to be for each call of ->delete_inode()). Unlike
before, if you are using inode-associated metadata buffers (i.e.
mark_buffer_dirty_inode()), it's your responsibility to call
invalidate_inode_buffers() before clear_inode().
NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out
if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput()
may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
free the on-disk inode, you may end up doing that while ->write_inode() is writing
to it.
---
[mandatory]
.d_delete() now only advises the dcache as to whether or not to cache
unreferenced dentries, and is now only called when the dentry refcount goes to
0. Even on 0 refcount transition, it must be able to tolerate being called 0,
1, or more times (eg. constant, idempotent).
---
[mandatory]
.d_compare() calling convention and locking rules are significantly
changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
look at examples of other filesystems) for guidance.
---
[mandatory]
.d_hash() calling convention and locking rules are significantly
changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
look at examples of other filesystems) for guidance.
---
[mandatory]
dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c
for details of what locks to replace dcache_lock with in order to protect
particular things. Most of the time, a filesystem only needs ->d_lock, which
protects *all* the dcache state of a given dentry.
--
[mandatory]
Filesystems must RCU-free their inodes, if they can have been accessed
via rcu-walk path walk (basically, if the file can have had a path name in the
vfs namespace).
Even though i_dentry and i_rcu share storage in a union, we will
initialize the former in inode_init_always(), so just leave it alone in
the callback. It used to be necessary to clean it there, but not anymore
(starting at 3.2).
--
[recommended]
vfs now tries to do path walking in "rcu-walk mode", which avoids
atomic operations and scalability hazards on dentries and inodes (see
Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes
(above) are examples of the changes required to support this. For more complex
filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so
no changes are required to the filesystem. However, this is costly and loses
the benefits of rcu-walk mode. We will begin to add filesystem callbacks that
are rcu-walk aware, shown below. Filesystems should take advantage of this
where possible.
--
[mandatory]
d_revalidate is a callback that is made on every path element (if
the filesystem provides it), which requires dropping out of rcu-walk mode. This
may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be
returned if the filesystem cannot handle rcu-walk. See
Documentation/filesystems/vfs.txt for more details.
permission and check_acl are inode permission checks that are called
on many or all directory inodes on the way down a path walk (to check for
exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU).
See Documentation/filesystems/vfs.txt for more details.
--
[mandatory]
In ->fallocate() you must check the mode option passed in. If your
filesystem does not support hole punching (deallocating space in the middle of a
file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode.
Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set,
so the i_size should not change when hole punching, even when puching the end of
a file off.
--
[mandatory]
->get_sb() is gone. Switch to use of ->mount(). Typically it's just
a matter of switching from calling get_sb_... to mount_... and changing the
function type. If you were doing it manually, just switch from setting ->mnt_root
to some pointer to returning that pointer. On errors return ERR_PTR(...).
--
[mandatory]
->permission() and generic_permission()have lost flags
argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask.
generic_permission() has also lost the check_acl argument; ACL checking
has been taken to VFS and filesystems need to provide a non-NULL ->i_op->get_acl
to read an ACL from disk.
--
[mandatory]
If you implement your own ->llseek() you must handle SEEK_HOLE and
SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to
support it in some way. The generic handler assumes that the entire file is
data and there is a virtual hole at the end of the file. So if the provided
offset is less than i_size and SEEK_DATA is specified, return the same offset.
If the above is true for the offset and you are given SEEK_HOLE, return the end
of the file. If the offset is i_size or greater return -ENXIO in either case.
[mandatory]
If you have your own ->fsync() you must make sure to call
filemap_write_and_wait_range() so that all dirty pages are synced out properly.
You must also keep in mind that ->fsync() is not called with i_mutex held
anymore, so if you require i_mutex locking you must make sure to take it and
release it yourself.
--
[mandatory]
d_alloc_root() is gone, along with a lot of bugs caused by code
misusing it. Replacement: d_make_root(inode). The difference is,
d_make_root() drops the reference to inode if dentry allocation fails.
--
[mandatory]
The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and
->lookup() do *not* take struct nameidata anymore; just the flags.
--
[mandatory]
->create() doesn't take struct nameidata *; unlike the previous
two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that
local filesystems can ignore tha argument - they are guaranteed that the
object doesn't exist. It's remote/distributed ones that might care...
--
[mandatory]
FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate()
in your dentry operations instead.
--
[mandatory]
vfs_readdir() is gone; switch to iterate_dir() instead
--
[mandatory]
->readdir() is gone now; switch to ->iterate()
[mandatory]
vfs_follow_link has been removed. Filesystems must use nd_set_link
from ->follow_link for normal symlinks, or nd_jump_link for magic
/proc/<pid> style links.
--
[mandatory]
iget5_locked()/ilookup5()/ilookup5_nowait() test() callback used to be
called with both ->i_lock and inode_hash_lock held; the former is *not*
taken anymore, so verify that your callbacks do not rely on it (none
of the in-tree instances did). inode_hash_lock is still held,
of course, so they are still serialized wrt removal from inode hash,
as well as wrt set() callback of iget5_locked().

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,174 @@
The QNX6 Filesystem
===================
The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino)
It got introduced in QNX 6.4.0 and is used default since 6.4.1.
Option
======
mmi_fs Mount filesystem as used for example by Audi MMI 3G system
Specification
=============
qnx6fs shares many properties with traditional Unix filesystems. It has the
concepts of blocks, inodes and directories.
On QNX it is possible to create little endian and big endian qnx6 filesystems.
This feature makes it possible to create and use a different endianness fs
for the target (QNX is used on quite a range of embedded systems) plattform
running on a different endianness.
The Linux driver handles endianness transparently. (LE and BE)
Blocks
------
The space in the device or file is split up into blocks. These are a fixed
size of 512, 1024, 2048 or 4096, which is decided when the filesystem is
created.
Blockpointers are 32bit, so the maximum space that can be addressed is
2^32 * 4096 bytes or 16TB
The superblocks
---------------
The superblock contains all global information about the filesystem.
Each qnx6fs got two superblocks, each one having a 64bit serial number.
That serial number is used to identify the "active" superblock.
In write mode with reach new snapshot (after each synchronous write), the
serial of the new master superblock is increased (old superblock serial + 1)
So basically the snapshot functionality is realized by an atomic final
update of the serial number. Before updating that serial, all modifications
are done by copying all modified blocks during that specific write request
(or period) and building up a new (stable) filesystem structure under the
inactive superblock.
Each superblock holds a set of root inodes for the different filesystem
parts. (Inode, Bitmap and Longfilenames)
Each of these root nodes holds information like total size of the stored
data and the addressing levels in that specific tree.
If the level value is 0, up to 16 direct blocks can be addressed by each
node.
Level 1 adds an additional indirect addressing level where each indirect
addressing block holds up to blocksize / 4 bytes pointers to data blocks.
Level 2 adds an additional indirect addressing block level (so, already up
to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree).
Unused block pointers are always set to ~0 - regardless of root node,
indirect addressing blocks or inodes.
Data leaves are always on the lowest level. So no data is stored on upper
tree levels.
The first Superblock is located at 0x2000. (0x2000 is the bootblock size)
The Audi MMI 3G first superblock directly starts at byte 0.
Second superblock position can either be calculated from the superblock
information (total number of filesystem blocks) or by taking the highest
device address, zeroing the last 3 bytes and then subtracting 0x1000 from
that address.
0x1000 is the size reserved for each superblock - regardless of the
blocksize of the filesystem.
Inodes
------
Each object in the filesystem is represented by an inode. (index node)
The inode structure contains pointers to the filesystem blocks which contain
the data held in the object and all of the metadata about an object except
its longname. (filenames longer than 27 characters)
The metadata about an object includes the permissions, owner, group, flags,
size, number of blocks used, access time, change time and modification time.
Object mode field is POSIX format. (which makes things easier)
There are also pointers to the first 16 blocks, if the object data can be
addressed with 16 direct blocks.
For more than 16 blocks an indirect addressing in form of another tree is
used. (scheme is the same as the one used for the superblock root nodes)
The filesize is stored 64bit. Inode counting starts with 1. (whilst long
filename inodes start with 0)
Directories
-----------
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate each
name with an inode number.
'.' inode number points to the directory inode
'..' inode number points to the parent directory inode
Eeach filename record additionally got a filename length field.
One special case are long filenames or subdirectory names.
These got set a filename length field of 0xff in the corresponding directory
record plus the longfile inode number also stored in that record.
With that longfilename inode number, the longfilename tree can be walked
starting with the superblock longfilename root node pointers.
Special files
-------------
Symbolic links are also filesystem objects with inodes. They got a specific
bit in the inode mode field identifying them as symbolic link.
The directory entry file inode pointer points to the target file inode.
Hard links got an inode, a directory entry, but a specific mode bit set,
no block pointers and the directory file record pointing to the target file
inode.
Character and block special devices do not exist in QNX as those files
are handled by the QNX kernel/drivers and created in /dev independent of the
underlaying filesystem.
Long filenames
--------------
Long filenames are stored in a separate addressing tree. The staring point
is the longfilename root node in the active superblock.
Each data block (tree leaves) holds one long filename. That filename is
limited to 510 bytes. The first two starting bytes are used as length field
for the actual filename.
If that structure shall fit for all allowed blocksizes, it is clear why there
is a limit of 510 bytes for the actual filename stored.
Bitmap
------
The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap
root node in the superblock and each bit in the bitmap represents one
filesystem block.
The first block is block 0, which starts 0x1000 after superblock start.
So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical
address at which block 0 is located.
Bits at the end of the last bitmap block are set to 1, if the device is
smaller than addressing space in the bitmap.
Bitmap system area
------------------
The bitmap itself is divided into three parts.
First the system area, that is split into two halves.
Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
Each superblock got it's own half of the system area. So superblock #1
always uses blocks from the lower half whilst superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
Bitmap blocks, Inode blocks and indirect addressing blocks for those two
tree structures are treated as system blocks.
The rational behind that is that a write request can work on a new snapshot
(system area of the inactive - resp. lower serial numbered superblock) while
at the same time there is still a complete stable filesystem structer in the
other half of the system area.
When finished with writing (a sync write is completed, the maximum sync leap
time or a filesystem sync is requested), serial of the previously inactive
superblock atomically is increased and the fs switches over to that - then
stable declared - superblock.
For all data outside the system area, blocks are just copied while writing.

View file

@ -0,0 +1,65 @@
Quota subsystem
===============
Quota subsystem allows system administrator to set limits on used space and
number of used inodes (inode is a filesystem structure which is associated with
each file or directory) for users and/or groups. For both used space and number
of used inodes there are actually two limits. The first one is called softlimit
and the second one hardlimit. An user can never exceed a hardlimit for any
resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed
softlimit but only for limited period of time. This period is called "grace
period" or "grace time". When grace time is over, user is not able to allocate
more space/inodes until he frees enough of them to get below softlimit.
Quota limits (and amount of grace time) are set independently for each
filesystem.
For more details about quota design, see the documentation in quota-tools package
(http://sourceforge.net/projects/linuxquota).
Quota netlink interface
=======================
When user exceeds a softlimit, runs out of grace time or reaches hardlimit,
quota subsystem traditionally printed a message to the controlling terminal of
the process which caused the excess. This method has the disadvantage that
when user is using a graphical desktop he usually cannot see the message.
Thus quota netlink interface has been designed to pass information about
the above events to userspace. There they can be captured by an application
and processed accordingly.
The interface uses generic netlink framework (see
http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more
details about this layer). The name of the quota generic netlink interface
is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>.
Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
This command is used to send a notification about any of the above mentioned
events. Each message has six attributes. These are (type of the argument is
in parentheses):
QUOTA_NL_A_QTYPE (u32)
- type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
QUOTA_NL_A_EXCESS_ID (u64)
- UID/GID (depends on quota type) of user / group whose limit
is being exceeded.
QUOTA_NL_A_CAUSED_ID (u64)
- UID of a user who caused the event
QUOTA_NL_A_WARNING (u32)
- what kind of limit is exceeded:
QUOTA_NL_IHARDWARN - inode hardlimit
QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer
than given grace period
QUOTA_NL_ISOFTWARN - inode softlimit
QUOTA_NL_BHARDWARN - space (block) hardlimit
QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded
longer than given grace period.
QUOTA_NL_BSOFTWARN - space (block) softlimit
- four warnings are also defined for the event when user stops
exceeding some limit:
QUOTA_NL_IHARDBELOW - inode hardlimit
QUOTA_NL_ISOFTBELOW - inode softlimit
QUOTA_NL_BHARDBELOW - space (block) hardlimit
QUOTA_NL_BSOFTBELOW - space (block) softlimit
QUOTA_NL_A_DEV_MAJOR (u32)
- major number of a device with the affected filesystem
QUOTA_NL_A_DEV_MINOR (u32)
- minor number of a device with the affected filesystem

View file

@ -0,0 +1,359 @@
ramfs, rootfs and initramfs
October 17, 2005
Rob Landley <rob@landley.net>
=============================
What is ramfs?
--------------
Ramfs is a very simple filesystem that exports Linux's disk caching
mechanisms (the page cache and dentry cache) as a dynamically resizable
RAM-based filesystem.
Normally all files are cached in memory by Linux. Pages of data read from
backing store (usually the block device the filesystem is mounted on) are kept
around in case it's needed again, but marked as clean (freeable) in case the
Virtual Memory system needs the memory for something else. Similarly, data
written to files is marked clean as soon as it has been written to backing
store, but kept around for caching purposes until the VM reallocates the
memory. A similar mechanism (the dentry cache) greatly speeds up access to
directories.
With ramfs, there is no backing store. Files written into ramfs allocate
dentries and page cache as usual, but there's nowhere to write them to.
This means the pages are never marked clean, so they can't be freed by the
VM when it's looking to recycle memory.
The amount of code required to implement ramfs is tiny, because all the
work is done by the existing Linux caching infrastructure. Basically,
you're mounting the disk cache as a filesystem. Because of this, ramfs is not
an optional component removable via menuconfig, since there would be negligible
space savings.
ramfs and ramdisk:
------------------
The older "ram disk" mechanism created a synthetic block device out of
an area of RAM and used it as backing store for a filesystem. This block
device was of fixed size, so the filesystem mounted on it was of fixed
size. Using a ram disk also required unnecessarily copying memory from the
fake block device into the page cache (and copying changes back out), as well
as creating and destroying dentries. Plus it needed a filesystem driver
(such as ext2) to format and interpret this data.
Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks
to avoid this copying by playing with the page tables, but they're unpleasantly
complicated and turn out to be about as expensive as the copying anyway.)
More to the point, all the work ramfs is doing has to happen _anyway_,
since all file access goes through the page and dentry caches. The RAM
disk is simply unnecessary; ramfs is internally much simpler.
Another reason ramdisks are semi-obsolete is that the introduction of
loopback devices offered a more flexible and convenient way to create
synthetic block devices, now from files instead of from chunks of memory.
See losetup (8) for details.
ramfs and tmpfs:
----------------
One downside of ramfs is you can keep writing data into it until you fill
up all memory, and the VM can't free it because the VM thinks that files
should get written to backing store (rather than swap space), but ramfs hasn't
got any backing store. Because of this, only root (or a trusted user) should
be allowed write access to a ramfs mount.
A ramfs derivative called tmpfs was created to add size limits, and the ability
to write the data to swap space. Normal users can be allowed write access to
tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
What is rootfs?
---------------
Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
always present in 2.6 systems. You can't unmount rootfs for approximately the
same reason you can't kill the init process; rather than having special code
to check for and handle an empty list, it's smaller and simpler for the kernel
to just make sure certain lists can't become empty.
Most systems just mount another filesystem over rootfs and ignore it. The
amount of space an empty instance of ramfs takes up is tiny.
If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by
default. To force ramfs, add "rootfstype=ramfs" to the kernel command
line.
What is initramfs?
------------------
All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
extracted into rootfs when the kernel boots up. After extracting, the kernel
checks to see if rootfs contains a file "init", and if so it executes it as PID
1. If found, this init process is responsible for bringing the system the
rest of the way up, including locating and mounting the real root device (if
any). If rootfs does not contain an init program after the embedded cpio
archive is extracted into it, the kernel will fall through to the older code
to locate and mount a root partition, then exec some variant of /sbin/init
out of that.
All this differs from the old initrd in several ways:
- The old initrd was always a separate file, while the initramfs archive is
linked into the linux kernel image. (The directory linux-*/usr is devoted
to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
kernel's cpio extraction code is not only extremely small, it's also
__init text and data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
some setup and then returned to the kernel, while the init program from
initramfs is not expected to return to the kernel. (If /init needs to hand
off control it can overmount / with a new root device and exec another init
program. See the switch_root utility, below.)
- When switching another root device, initrd would pivot_root and then
umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
rootfs, nor unmount it. Instead delete everything out of rootfs to
free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
with the new root (cd /newmount; mount --move . /; chroot .), attach
stdin/stdout/stderr to the new /dev/console, and exec the new init.
Since this is a remarkably persnickety process (and involves deleting
commands before you can run them), the klibc package introduced a helper
program (utils/run_init.c) to do all this for you. Most other packages
(such as busybox) have named this command "switch_root".
Populating initramfs:
---------------------
The 2.6 kernel build process always creates a gzipped cpio format initramfs
archive and links it into the resulting kernel binary. By default, this
archive is empty (consuming 134 bytes on x86).
The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig,
and living in usr/Kconfig) can be used to specify a source for the
initramfs archive, which will automatically be incorporated into the
resulting binary. This option can point to an existing gzipped cpio
archive, a directory containing files to be archived, or a text file
specification such as the following example:
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
nod /dev/loop0 644 0 0 b 7 0
dir /bin 755 1000 1000
slink /bin/sh busybox 777 0 0
file /bin/busybox initramfs/busybox 755 0 0
dir /proc 755 0 0
dir /sys 755 0 0
dir /mnt 755 0 0
file /init initramfs/init.sh 755 0 0
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
documenting the above file format.
One advantage of the configuration file is that root access is not required to
set permissions or create device nodes in the new archive. (Note that those
two example "file" entries expect to find files named "init.sh" and "busybox" in
a directory called "initramfs", under the linux-2.6.* directory. See
Documentation/early-userspace/README for more details.)
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
scripts/gen_initramfs_list.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
(obviously) self-contained.
The one thing you might need external cpio utilities installed for is creating
or extracting your own preprepared cpio files to feed to the kernel build
(instead of a config file or directory).
The following command line can extract a cpio image (either by the above script
or by the kernel build) back into its component files:
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can
use in place of the above config file:
#!/bin/sh
# Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
# Licensed under GPL version 2
if [ $# -ne 2 ]
then
echo "usage: mkinitramfs directory imagename.cpio.gz"
exit 1
fi
if [ -d "$1" ]
then
echo "creating $2 from $1"
(cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
else
echo "First argument must be a directory"
exit 1
fi
Note: The cpio man page contains some bad advice that will break your initramfs
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth option
to minimize problems with permissions on directories that are unwritable or not
searchable." Don't do this when creating initramfs.cpio.gz images, it won't
work. The Linux kernel cpio extractor won't create files in a directory that
doesn't exist, so the directory entries must go before the files that go in
those directories. The above script gets them in the right order.
External initramfs images:
--------------------------
If the kernel has initrd support enabled, an external cpio.gz archive can also
be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
will autodetect the type (initramfs, not initrd) and extract the external cpio
archive into rootfs before trying to run /init.
This has the memory efficiency advantages of initramfs (no ramdisk block
device) but the separate packaging of initrd (which is nice if you have
non-GPL code you'd like to run from initramfs, without conflating it with
the GPL licensed Linux kernel binary).
It can also be used to supplement the kernel's built-in initramfs image. The
files in the external archive will overwrite any conflicting files in
the built-in initramfs archive. Some distributors also prefer to customize
a single kernel image with task-specific initramfs images, without recompiling.
Contents of initramfs:
----------------------
An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
http://www.linuxfromscratch.org/lfs/view/stable/
The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.3 release.)
In theory you could use glibc, but that's not well suited for small embedded
uses like this. (A "hello world" program statically linked against glibc is
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
name lookups, even when otherwise statically linked.)
A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or
User Mode Linux, like so:
cat > hello.c << EOF
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("Hello world!\n");
sleep(999999999);
}
EOF
gcc -static hello.c -o init
echo init | cpio -o -H newc | gzip > test.cpio.gz
# Testing external initramfs using the initrd loading mechanism.
qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
When debugging a normal root filesystem, it's nice to be able to boot with
"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
just as useful.
Why cpio rather than tar?
-------------------------
This decision was made back in December, 2001. The discussion started here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
And spawned a second thread (specifically on tar vs cpio), starting here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
The quick and dirty summary version (which is no substitute for reading
the above threads) is:
1) cpio is a standard. It's decades old (from the AT&T days), and already
widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
a Linux Journal article about it from 1996:
http://www.linuxjournal.com/article/1213
It's not as popular as tar because the traditional cpio command line tools
require _truly_hideous_ command line arguments. But that says nothing
either way about the archive format, and there are alternative tools,
such as:
http://freecode.com/projects/afio
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
3) The GNU project standardizing on tar is approximately as relevant as
Windows standardizing on zip. Linux is not part of either, and is free
to make its own technical decisions.
4) Since this is a kernel internal format, it could easily have been
something brand new. The kernel provides its own tools to create and
extract this format anyway. Using an existing standard was preferable,
but not essential.
5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
supported on the kernel side"):
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
explained his reasoning:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
and, most importantly, designed and implemented the initramfs code.
Future directions:
------------------
Today (2.6.16), initramfs is always compiled in, but not always used. The
kernel falls back to legacy boot code that is reached only if initramfs does
not contain an /init program. The fallback is legacy code, there to ensure a
smooth transition and allowing early boot functionality to gradually move to
"early userspace" (I.E. initramfs).
The move to early userspace is necessary because finding and mounting the real
root device is complex. Root partitions can span multiple devices (raid or
separate journal). They can be out on the network (requiring dhcp, setting a
specific MAC address, logging into a server, etc). They can live on removable
media, with dynamically allocated major/minor numbers and persistent naming
issues requiring a full udev implementation to sort out. They can be
compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
and so on.
This kind of complexity (which inevitably includes policy) is rightly handled
in userspace. Both klibc and busybox/uClibc are working on simple initramfs
packages to drop into a kernel build.
The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
The kernel's current early boot code (partition detection, etc) will probably
be migrated into a default initramfs, automatically created and used by the
kernel build.

View file

@ -0,0 +1,494 @@
relay interface (formerly relayfs)
==================================
The relay interface provides a means for kernel applications to
efficiently log and transfer large quantities of data from the kernel
to userspace via user-defined 'relay channels'.
A 'relay channel' is a kernel->user data relay mechanism implemented
as a set of per-cpu kernel buffers ('channel buffers'), each
represented as a regular file ('relay file') in user space. Kernel
clients write into the channel buffers using efficient write
functions; these automatically log into the current cpu's channel
buffer. User space applications mmap() or read() from the relay files
and retrieve the data as it becomes available. The relay files
themselves are files created in a host filesystem, e.g. debugfs, and
are associated with the channel buffers using the API described below.
The format of the data logged into the channel buffers is completely
up to the kernel client; the relay interface does however provide
hooks which allow kernel clients to impose some structure on the
buffer data. The relay interface doesn't implement any form of data
filtering - this also is left to the kernel client. The purpose is to
keep things as simple as possible.
This document provides an overview of the relay interface API. The
details of the function parameters are documented along with the
functions in the relay interface code - please see that for details.
Semantics
=========
Each relay channel has one buffer per CPU, each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it is written to
the next (if available). Messages are never split across sub-buffers.
At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
bytes of it are padding i.e. unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
After copying it, userspace can notify the kernel that a sub-buffer
has been consumed.
A relay channel can operate in a mode where it will overwrite data not
yet collected by userspace, and not wait for it to be consumed.
The relay channel itself does not provide for communication of such
data between userspace and kernel, allowing the kernel side to remain
simple and not impose a single interface on userspace. It does
provide a set of examples and a separate helper though, described
below.
The read() interface both removes padding and internally consumes the
read sub-buffers; thus in cases where read(2) is being used to drain
the channel buffers, special-purpose communication between kernel and
user isn't necessary for basic operation.
One of the major goals of the relay interface is to provide a low
overhead mechanism for conveying kernel data to userspace. While the
read() interface is easy to use, it's not as efficient as the mmap()
approach; the example code attempts to make the tradeoff between the
two approaches as small as possible.
klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
a couple simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
pair of .c files containing boilerplate code for each of the user and
kernel sides of a relay application. When combined these two sets of
boilerplate code provide glue to easily stream data to disk, without
having to bother with mundane housekeeping chores.
The 'klog debugging functions' patch (klog.patch in the relay-apps
tarball) provides a couple of high-level logging functions to the
kernel which allow writing formatted text or raw data to a channel,
regardless of whether a channel to write into exists or not, or even
whether the relay interface is compiled into the kernel or not. These
functions allow you to put unconditional 'trace' statements anywhere
in the kernel or kernel modules; only when there is a 'klog handler'
registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
i.e. without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
consumes the read sub-buffers; thus in cases where read(2) is being
used to drain the channel buffers, special-purpose communication
between kernel and user isn't necessary for basic operation. Things
such as buffer-full conditions would still need to be communicated via
some channel though.
klog and the relay-apps examples can be found in the relay-apps
tarball on http://relayfs.sourceforge.net
The relay interface user space API
==================================
The relay interface implements basic file operations for user space
access to relay channel buffer data. Here are the file operations
that are available and some comments regarding their behavior:
open() enables user to open an _existing_ channel buffer.
mmap() results in channel buffer being mapped into the caller's
memory space. Note that you can't do a partial mmap - you
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
'consumed' by the reader, i.e. they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
channel is being used in overwrite mode and there are
active channel writers, results may be unpredictable -
users should make sure that all logging to the channel has
ended before using read() with overwrite mode. Sub-buffer
padding is automatically removed and will not be seen by
the reader.
sendfile() transfer data from a channel buffer to an output file
descriptor. Sub-buffer padding is automatically removed
and will not be seen by the reader.
poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
reaches 0, i.e. when no process or kernel client has the
buffer open, the channel buffer is freed.
In order for a user application to make use of relay files, the
host filesystem must be mounted. For example,
mount -t debugfs debugfs /sys/kernel/debug
NOTE: the host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
The relay interface kernel API
==============================
Here's a summary of the API the relay interface provides to in-kernel clients:
TBD(curr. line MT:/API/)
channel management functions:
relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks, private_data)
relay_close(chan)
relay_flush(chan)
relay_reset(chan)
channel management typically called on instigation of userspace:
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
write functions:
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
callbacks:
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry)
helper functions:
relay_buf_full(buf)
subbuf_start_reserve(buf, length)
Creating a channel
------------------
relay_open() is used to create a channel, along with its per-cpu
channel buffers. Each channel buffer will have an associated file
created for it in the host filesystem, which can be and mmapped or
read from in user space. The files are named basename0...basenameN-1
where N is the number of online cpus, and by default will be created
in the root of the filesystem (if the parent param is NULL). If you
want a directory structure to contain your relay files, you should
create it using the host filesystem's directory creation function,
e.g. debugfs_create_dir(), and pass the parent directory to
relay_open(). Users are responsible for cleaning up any directory
structure they create, when the channel is closed - again the host
filesystem's directory removal functions should be used for that,
e.g. debugfs_remove().
In order for a channel to be created and the host filesystem's files
associated with its channel buffers, the user must provide definitions
for two callback functions, create_buf_file() and remove_buf_file().
create_buf_file() is called once for each per-cpu buffer from
relay_open() and allows the user to create the file which will be used
to represent the corresponding channel buffer. The callback should
return the dentry of the file created to represent the channel buffer.
remove_buf_file() must also be defined; it's responsible for deleting
the file(s) created in create_buf_file() and is called during
relay_close().
Here are some typical definitions for these callbacks, in this case
using debugfs:
/*
* create_buf_file() callback. Creates relay file in debugfs.
*/
static struct dentry *create_buf_file_handler(const char *filename,
struct dentry *parent,
int mode,
struct rchan_buf *buf,
int *is_global)
{
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
}
/*
* remove_buf_file() callback. Removes relay file from debugfs.
*/
static int remove_buf_file_handler(struct dentry *dentry)
{
debugfs_remove(dentry);
return 0;
}
/*
* relay interface callbacks
*/
static struct rchan_callbacks relay_callbacks =
{
.create_buf_file = create_buf_file_handler,
.remove_buf_file = remove_buf_file_handler,
};
And an example relay_open() invocation using them:
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
If the create_buf_file() callback fails, or isn't defined, channel
creation and thus relay_open() will fail.
The total size of each per-cpu buffer is calculated by multiplying the
number of sub-buffers by the sub-buffer size passed into relay_open().
The idea behind sub-buffers is that they're basically an extension of
double-buffering to N buffers, and they also allow applications to
easily implement random-access-on-buffer-boundary schemes, which can
be important for some high-volume applications. The number and size
of sub-buffers is completely dependent on the application and even for
the same application, different conditions will warrant different
values for these parameters at different times. Typically, the right
values to use are best decided after some experimentation; in general,
though, it's safe to assume that having only 1 sub-buffer is a bad
idea - you're guaranteed to either overwrite data or lose events
depending on the channel mode being used.
The create_buf_file() implementation can also be defined in such a way
as to allow the creation of a single 'global' buffer instead of the
default per-cpu set. This can be useful for applications interested
mainly in seeing the relative ordering of system-wide events without
the need to bother with saving explicit timestamps for the purpose of
merging/sorting per-cpu files in a postprocessing step.
To have relay_open() create a global buffer, the create_buf_file()
implementation should set the value of the is_global outparam to a
non-zero value in addition to creating the file that will be used to
represent the single buffer. In the case of a global buffer,
create_buf_file() and remove_buf_file() will be called only once. The
normal channel-writing functions, e.g. relay_write(), can still be
used - writes from any cpu will transparently end up in the global
buffer - but since it is a global buffer, callers should make sure
they use the proper locking for such a buffer, either by wrapping
writes in a spinlock, or by copying a write function from relay.h and
creating a local version that internally does the proper locking.
The private_data passed into relay_open() allows clients to associate
user-defined data with a channel, and is immediately available
(including in create_buf_file()) via chan->private_data or
buf->chan->private_data.
Buffer-only channels
--------------------
These channels have no files associated and can be created with
relay_open(NULL, NULL, ...). Such channels are useful in scenarios such
as when doing early tracing in the kernel, before the VFS is up. In these
cases, one may open a buffer-only channel and then call
relay_late_setup_files() when the kernel is ready to handle files,
to expose the buffered data to the userspace.
Channel 'modes'
---------------
relay channels can be used in either of two modes - 'overwrite' or
'no-overwrite'. The mode is entirely determined by the implementation
of the subbuf_start() callback, as described below. The default if no
subbuf_start() callback is defined is 'no-overwrite' mode. If the
default mode suits your needs, and you plan to use the read()
interface to retrieve channel data, you can ignore the details of this
section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
been consumed. In no-overwrite mode, writes will fail, i.e. data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
enough, data will be lost in either case; the only difference is
whether data is lost from the beginning or the end of a buffer.
As explained above, a relay channel is made of up one or more
per-cpu channel buffers, each implemented as a circular buffer
subdivided into one or more sub-buffers. Messages are written into
the current sub-buffer of the channel's current per-cpu buffer via the
write functions described below. Whenever a message can't fit into
the current sub-buffer, because there's no room left for it, the
client is notified via the subbuf_start() callback that a switch to a
new sub-buffer is about to occur. The client uses this callback to 1)
initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the
following:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
if (relay_buf_full(buf))
return 0;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
If the current buffer is full, i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
occur yet, i.e. until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is responsible for notifying the relay
interface when sub-buffers have been consumed via
relay_subbufs_consumed(). Any subsequent attempts to write into the
buffer will again invoke the subbuf_start() callback with the same
parameters; only when the consumer has consumed one or more of the
ready sub-buffers will relay_buf_full() return 0, in which case the
buffer switch can continue.
The implementation of the subbuf_start() callback for 'overwrite' mode
would be very similar:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur
unconditionally. It's also meaningless for the client to use the
relay_subbufs_consumed() function in this mode, as it's never
consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
implements the simplest possible 'no-overwrite' mode, i.e. it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
by calling the subbuf_start_reserve() helper function from within the
subbuf_start() callback. This reserved area can be used to store
whatever information the client wants. In the example above, room is
reserved in each sub-buffer to store the padding count for that
sub-buffer. This is filled in for the previous sub-buffer in the
subbuf_start() implementation; the padding value for the previous
sub-buffer is passed into the subbuf_start() callback along with a
pointer to the previous sub-buffer, since the padding value isn't
known until a sub-buffer is filled. The subbuf_start() callback is
also called for the first sub-buffer when the channel is opened, to
give the client a chance to reserve space in it. In this case the
previous sub-buffer pointer passed into the callback will be NULL, so
the client should check the value of the prev_subbuf pointer before
writing into the previous sub-buffer.
Writing to a channel
--------------------
Kernel clients write data into the current cpu's channel buffer using
relay_write() or __relay_write(). relay_write() is the main logging
function - it uses local_irqsave() to protect the buffer and should be
used if you might be logging from interrupt context. If you know
you'll never be logging from interrupt context, you can use
__relay_write(), which only disables preemption. These functions
don't return a value, so you can't determine whether or not they
failed - the assumption is that you wouldn't want to check a return
value in the fast logging path anyway, and that they'll always succeed
unless the buffer is full and no-overwrite mode is being used, in
which case you can detect a failed write in the subbuf_start()
callback by calling the relay_buf_full() helper function.
relay_reserve() is used to reserve a slot in a channel buffer which
can be written to later. This would typically be used in applications
that need to write directly into a channel buffer without having to
stage data in a temporary buffer beforehand. Because the actual write
may not happen immediately after the slot is reserved, applications
using relay_reserve() can keep a count of the number of bytes actually
written, either in space reserved in the sub-buffers themselves or as
a separate array. See the 'reserve' example in the relay-apps tarball
at http://relayfs.sourceforge.net for an example of how this can be
done. Because the write is under control of the client and is
separated from the reserve, relay_reserve() doesn't protect the buffer
at all - it's up to the client to provide the appropriate
synchronization when using relay_reserve().
Closing a channel
-----------------
The client calls relay_close() when it's finished using the channel.
The channel and its associated buffers are destroyed when there are no
longer any references to any of the channel buffers. relay_flush()
forces a sub-buffer switch on all the channel buffers, and can be used
to finalize and process the last sub-buffers before the channel is
closed.
Misc
----
Some applications may want to keep a channel around and re-use it
rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
do so, i.e. when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
is mmapped from user space and buf_unmapped() is called when it's
unmapped. The client can use this notification to trigger actions
within the kernel application, such as enabling/disabling logging to
the channel.
Resources
=========
For news, example code, mailing list, etc. see the relay interface homepage:
http://relayfs.sourceforge.net
Credits
=======
The ideas and specs for the relay interface came about as a result of
discussions on tracing involving the following:
Michel Dagenais <michel.dagenais@polymtl.ca>
Richard Moore <richardj_moore@uk.ibm.com>
Bob Wisniewski <bob@watson.ibm.com>
Karim Yaghmour <karim@opersys.com>
Tom Zanussi <zanussi@us.ibm.com>
Also thanks to Hubertus Franke for a lot of useful suggestions and bug
reports.

View file

@ -0,0 +1,186 @@
ROMFS - ROM FILE SYSTEM
This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having
modules linked at boot time. Using this filesystem, you get a very
similar feature, and even the possibility of a small kernel, with a
file system which doesn't take up useful memory from the router
functions in the basement of your office.
For comparison, both the older minix and xiafs (the latter is now
defunct) filesystems, compiled as module need more than 20000 bytes,
while romfs is less than a page, about 4000 bytes (assuming i586
code). Under the same conditions, the msdos filesystem would need
about 30K (and does not support device nodes or symlinks), while the
nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
comparison, an actual rescue disk used up 3202 blocks with ext2, while
with romfs, it needed 3079 blocks.
To create such a file system, you'll need a user program named
genromfs. It is available on http://romfs.sourceforge.net/
As the name suggests, romfs could be also used (space-efficiently) on
various read-only media, like (E)EPROM disks if someone will have the
motivation.. :)
However, the main purpose of romfs is to have a very small kernel,
which has only this filesystem linked in, and then can load any module
later, with the current module utilities. It can also be used to run
some program to decide if you need SCSI devices, and even IDE or
floppy drives can be loaded later if you use the "initrd"--initial
RAM disk--feature of the kernel. This would not be really news
flash, but with romfs, you can even spare off your ext2 or minix or
maybe even affs filesystem until you really know that you need it.
For example, a distribution boot disk can contain only the cd disk
drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
module. The kernel can be small enough, since it doesn't have other
filesystems, like the quite large ext2fs module, which can then be
loaded off the CD at a later stage of the installation. Another use
would be for a recovery disk, when you are reinstalling a workstation
from the network, and you will have all the tools/modules available
from a nearby server, so you don't want to carry two disks for this
purpose, just because it won't fit into ext2.
romfs operates on block devices as you can expect, and the underlying
structure is very simple. Every accessible structure begins on 16
byte boundaries for fast access. The minimum space a file will take
is 32 bytes (this is an empty file, with a less than 16 character
name). The maximum overhead for any non-empty file is the header, and
the 16 byte padding for the name and the contents, also 16+14+15 = 45
bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes.
The layout of the filesystem is the following:
offset content
+---+---+---+---+
0 | - | r | o | m | \
+---+---+---+---+ The ASCII representation of those bytes
4 | 1 | f | s | - | / (i.e. "-rom1fs-")
+---+---+---+---+
8 | full size | The number of accessible bytes in this fs.
+---+---+---+---+
12 | checksum | The checksum of the FIRST 512 BYTES.
+---+---+---+---+
16 | volume name | The zero terminated name of the volume,
: : padded to 16 byte boundary.
+---+---+---+---+
xx | file |
: headers :
Every multi byte value (32 bit words, I'll use the longwords term from
now on) must be in big endian order.
The first eight bytes identify the filesystem, even for the casual
inspector. After that, in the 3rd longword, it contains the number of
bytes accessible from the start of this filesystem. The 4th longword
is the checksum of the first 512 bytes (or the number of bytes
accessible, whichever is smaller). The applied algorithm is the same
as in the AFFS filesystem, namely a simple sum of the longwords
(assuming bigendian quantities again). For details, please consult
the source. This algorithm was chosen because although it's not quite
reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header
must begin on a 16 byte boundary.
offset content
+---+---+---+---+
0 | next filehdr|X| The offset of the next file header
+---+---+---+---+ (zero if no more files)
4 | spec.info | Info for directories/hard links/devices
+---+---+---+---+
8 | size | The size of this file in bytes
+---+---+---+---+
12 | checksum | Covering the meta data, including the file
+---+---+---+---+ name, and padding
16 | file name | The zero terminated name of the file,
: : padded to 16 byte boundary
+---+---+---+---+
xx | file data |
: :
Since the file headers begin always at a 16 byte boundary, the lowest
4 bits would be always zero in the next filehdr pointer. These four
bits are used for the mode information. Bits 0..2 specify the type of
the file; while bit 4 shows if the file is executable or not. The
permissions are assumed to be world readable, if this bit is not set,
and world executable if it is; except the character and block devices,
they are never accessible for other than owner. The owner of every
file is user and group 0, this should never be a problem for the
intended use. The mapping of the 8 possible values to file types is
the following:
mapping spec.info means
0 hard link link destination [file header]
1 directory first file's header
2 regular file unused, must be zero [MBZ]
3 symbolic link unused, MBZ (file data is the link content)
4 block device 16/16 bits major/minor number
5 char device - " -
6 socket unused, MBZ
7 fifo unused, MBZ
Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number).
Note also that it is your responsibility to not create hard link
loops, and creating all the . and .. links for directories. This is
normally done correctly by the genromfs program. Please refrain from
using the executable bits for special purposes on the socket and fifo
special files, they may have other uses in the future. Additionally,
please remember that only regular files, and symlinks are supposed to
have a nonzero size field; they contain the number of bytes available
directly after the (padded) file name.
Another thing to note is that romfs works on file headers and data
aligned to 16 byte boundaries, but most hardware devices and the block
device drivers are unable to cope with smaller than block-sized data.
To overcome this limitation, the whole size of the file system must be
padded to an 1024 byte boundary.
If you have any problems or suggestions concerning this file system,
please contact me. However, think twice before wanting me to add
features and code, because the primary and most important advantage of
this file system is the small code. On the other hand, don't be
alarmed, I'm not getting that much romfs related mail. Now I can
understand why Avery wrote poems in the ARCnet docs to get some more
feedback. :)
romfs has also a mailing list, and to date, it hasn't received any
traffic, so you are welcome to join it to discuss your ideas. :)
It's run by ezmlm, so you can subscribe to it by sending a message
to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
Pending issues:
- Permissions and owner information are pretty essential features of a
Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case
one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte
boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a
limiting factor in my eyes.
- Where it is used?
- Does it work on other architectures than intel and motorola?
Have fun,
Janos Farkas <chexum@shadow.banki.hu>

View file

@ -0,0 +1,334 @@
The seq_file interface
Copyright 2003 Jonathan Corbet <corbet@lwn.net>
This file is originally from the LWN.net Driver Porting series at
http://lwn.net/Articles/driver-porting/
There are numerous ways for a device driver (or other kernel component) to
provide information to the user or system administrator. One useful
technique is the creation of virtual files, in debugfs, /proc or elsewhere.
Virtual files can provide human-readable output that is easy to get at
without any special utility programs; they can also make life easier for
script writers. It is not surprising that the use of virtual files has
grown over the years.
Creating those files correctly has always been a bit of a challenge,
however. It is not that hard to make a virtual file which returns a
string. But life gets trickier if the output is long - anything greater
than an application is likely to read in a single operation. Handling
multiple reads (and seeks) requires careful attention to the reader's
position within the virtual file - that position is, likely as not, in the
middle of a line of output. The kernel has traditionally had a number of
implementations that got this wrong.
The 2.6 kernel contains a set of functions (implemented by Alexander Viro)
which are designed to make it easy for virtual file creators to get it
right.
The seq_file interface is available via <linux/seq_file.h>. There are
three aspects to seq_file:
* An iterator interface which lets a virtual file implementation
step through the objects it is presenting.
* Some utility functions for formatting objects for output without
needing to worry about things like output buffers.
* A set of canned file_operations which implement most operations on
the virtual file.
We'll look at the seq_file interface via an extremely simple example: a
loadable module which creates a file called /proc/sequence. The file, when
read, simply produces a set of increasing integer values, one per line. The
sequence will continue until the user loses patience and finds something
better to do. The file is seekable, in that one can do something like the
following:
dd if=/proc/sequence of=out1 count=1
dd if=/proc/sequence skip=1 of=out2 count=1
Then concatenate the output files out1 and out2 and get the right
result. Yes, it is a thoroughly useless module, but the point is to show
how the mechanism works without getting lost in other details. (Those
wanting to see the full source for this module can find it at
http://lwn.net/Articles/22359/).
Deprecated create_proc_entry
Note that the above article uses create_proc_entry which was removed in
kernel 3.10. Current versions require the following update
- entry = create_proc_entry("sequence", 0, NULL);
- if (entry)
- entry->proc_fops = &ct_file_ops;
+ entry = proc_create("sequence", 0, NULL, &ct_file_ops);
The iterator interface
Modules implementing a virtual file with seq_file must implement a simple
iterator object that allows stepping through the data of interest.
Iterators must be able to move to a specific position - like the file they
implement - but the interpretation of that position is up to the iterator
itself. A seq_file implementation that is formatting firewall rules, for
example, could interpret position N as the Nth rule in the chain.
Positioning can thus be done in whatever way makes the most sense for the
generator of the data, which need not be aware of how a position translates
to an offset in the virtual file. The one obvious exception is that a
position of zero should indicate the beginning of the file.
The /proc/sequence iterator just uses the count of the next number it
will output as its position.
Four functions must be implemented to make the iterator work. The first,
called start() takes a position as an argument and returns an iterator
which will start reading at that position. For our simple sequence example,
the start() function looks like:
static void *ct_seq_start(struct seq_file *s, loff_t *pos)
{
loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL);
if (! spos)
return NULL;
*spos = *pos;
return spos;
}
The entire data structure for this iterator is a single loff_t value
holding the current position. There is no upper bound for the sequence
iterator, but that will not be the case for most other seq_file
implementations; in most cases the start() function should check for a
"past end of file" condition and return NULL if need be.
For more complicated applications, the private field of the seq_file
structure can be used. There is also a special value which can be returned
by the start() function called SEQ_START_TOKEN; it can be used if you wish
to instruct your show() function (described below) to print a header at the
top of the output. SEQ_START_TOKEN should only be used if the offset is
zero, however.
The next function to implement is called, amazingly, next(); its job is to
move the iterator forward to the next position in the sequence. The
example module can simply increment the position by one; more useful
modules will do what is needed to step through some data structure. The
next() function returns a new iterator, or NULL if the sequence is
complete. Here's the example version:
static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
loff_t *spos = v;
*pos = ++*spos;
return spos;
}
The stop() function is called when iteration is complete; its job, of
course, is to clean up. If dynamic memory is allocated for the iterator,
stop() is the place to free it.
static void ct_seq_stop(struct seq_file *s, void *v)
{
kfree(v);
}
Finally, the show() function should format the object currently pointed to
by the iterator for output. The example module's show() function is:
static int ct_seq_show(struct seq_file *s, void *v)
{
loff_t *spos = v;
seq_printf(s, "%lld\n", (long long)*spos);
return 0;
}
If all is well, the show() function should return zero. A negative error
code in the usual manner indicates that something went wrong; it will be
passed back to user space. This function can also return SEQ_SKIP, which
causes the current item to be skipped; if the show() function has already
generated output before returning SEQ_SKIP, that output will be dropped.
We will look at seq_printf() in a moment. But first, the definition of the
seq_file iterator is finished by creating a seq_operations structure with
the four functions we have just defined:
static const struct seq_operations ct_seq_ops = {
.start = ct_seq_start,
.next = ct_seq_next,
.stop = ct_seq_stop,
.show = ct_seq_show
};
This structure will be needed to tie our iterator to the /proc file in
a little bit.
It's worth noting that the iterator value returned by start() and
manipulated by the other functions is considered to be completely opaque by
the seq_file code. It can thus be anything that is useful in stepping
through the data to be output. Counters can be useful, but it could also be
a direct pointer into an array or linked list. Anything goes, as long as
the programmer is aware that things can happen between calls to the
iterator function. However, the seq_file code (by design) will not sleep
between the calls to start() and stop(), so holding a lock during that time
is a reasonable thing to do. The seq_file code will also avoid taking any
other locks while the iterator is active.
Formatted output
The seq_file code manages positioning within the output created by the
iterator and getting it into the user's buffer. But, for that to work, that
output must be passed to the seq_file code. Some utility functions have
been defined which make this task easy.
Most code will simply use seq_printf(), which works pretty much like
printk(), but which requires the seq_file pointer as an argument. It is
common to ignore the return value from seq_printf(), but a function
producing complicated output may want to check that value and quit if
something non-zero is returned; an error return means that the seq_file
buffer has been filled and further output will be discarded.
For straight character output, the following functions may be used:
int seq_putc(struct seq_file *m, char c);
int seq_puts(struct seq_file *m, const char *s);
int seq_escape(struct seq_file *m, const char *s, const char *esc);
The first two output a single character and a string, just like one would
expect. seq_escape() is like seq_puts(), except that any character in s
which is in the string esc will be represented in octal form in the output.
There is also a pair of functions for printing filenames:
int seq_path(struct seq_file *m, struct path *path, char *esc);
int seq_path_root(struct seq_file *m, struct path *path,
struct path *root, char *esc)
Here, path indicates the file of interest, and esc is a set of characters
which should be escaped in the output. A call to seq_path() will output
the path relative to the current process's filesystem root. If a different
root is desired, it can be used with seq_path_root(). Note that, if it
turns out that path cannot be reached from root, the value of root will be
changed in seq_file_root() to a root which *does* work.
Making it all work
So far, we have a nice set of functions which can produce output within the
seq_file system, but we have not yet turned them into a file that a user
can see. Creating a file within the kernel requires, of course, the
creation of a set of file_operations which implement the operations on that
file. The seq_file interface provides a set of canned operations which do
most of the work. The virtual file author still must implement the open()
method, however, to hook everything up. The open function is often a single
line, as in the example module:
static int ct_open(struct inode *inode, struct file *file)
{
return seq_open(file, &ct_seq_ops);
}
Here, the call to seq_open() takes the seq_operations structure we created
before, and gets set up to iterate through the virtual file.
On a successful open, seq_open() stores the struct seq_file pointer in
file->private_data. If you have an application where the same iterator can
be used for more than one file, you can store an arbitrary pointer in the
private field of the seq_file structure; that value can then be retrieved
by the iterator functions.
There is also a wrapper function to seq_open() called seq_open_private(). It
kmallocs a zero filled block of memory and stores a pointer to it in the
private field of the seq_file structure, returning 0 on success. The
block size is specified in a third parameter to the function, e.g.:
static int ct_open(struct inode *inode, struct file *file)
{
return seq_open_private(file, &ct_seq_ops,
sizeof(struct mystruct));
}
There is also a variant function, __seq_open_private(), which is functionally
identical except that, if successful, it returns the pointer to the allocated
memory block, allowing further initialisation e.g.:
static int ct_open(struct inode *inode, struct file *file)
{
struct mystruct *p =
__seq_open_private(file, &ct_seq_ops, sizeof(*p));
if (!p)
return -ENOMEM;
p->foo = bar; /* initialize my stuff */
...
p->baz = true;
return 0;
}
A corresponding close function, seq_release_private() is available which
frees the memory allocated in the corresponding open.
The other operations of interest - read(), llseek(), and release() - are
all implemented by the seq_file code itself. So a virtual file's
file_operations structure will look like:
static const struct file_operations ct_file_ops = {
.owner = THIS_MODULE,
.open = ct_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release
};
There is also a seq_release_private() which passes the contents of the
seq_file private field to kfree() before releasing the structure.
The final step is the creation of the /proc file itself. In the example
code, that is done in the initialization code in the usual way:
static int ct_init(void)
{
struct proc_dir_entry *entry;
proc_create("sequence", 0, NULL, &ct_file_ops);
return 0;
}
module_init(ct_init);
And that is pretty much it.
seq_list
If your file will be iterating through a linked list, you may find these
routines useful:
struct list_head *seq_list_start(struct list_head *head,
loff_t pos);
struct list_head *seq_list_start_head(struct list_head *head,
loff_t pos);
struct list_head *seq_list_next(void *v, struct list_head *head,
loff_t *ppos);
These helpers will interpret pos as a position within the list and iterate
accordingly. Your start() and next() functions need only invoke the
seq_list_* helpers with a pointer to the appropriate list_head structure.
The extra-simple version
For extremely simple virtual files, there is an even easier interface. A
module can define only the show() function, which should create all the
output that the virtual file will contain. The file's open() method then
calls:
int single_open(struct file *file,
int (*show)(struct seq_file *m, void *p),
void *data);
When output time comes, the show() function will be called once. The data
value given to single_open() can be found in the private field of the
seq_file structure. When using single_open(), the programmer should use
single_release() instead of seq_release() in the file_operations structure
to avoid a memory leak.

View file

@ -0,0 +1,939 @@
Shared Subtrees
---------------
Contents:
1) Overview
2) Features
3) Setting mount states
4) Use-case
5) Detailed semantics
6) Quiz
7) FAQ
8) Implementation
1) Overview
-----------
Consider the following situation:
A process wants to clone its own namespace, but still wants to access the CD
that got mounted recently. Shared subtree semantics provide the necessary
mechanism to accomplish the above.
It provides the necessary building blocks for features like per-user-namespace
and versioned filesystem.
2) Features
-----------
Shared subtree provides four different flavors of mounts; struct vfsmount to be
precise
a. shared mount
b. slave mount
c. private mount
d. unbindable mount
2a) A shared mount can be replicated to as many mountpoints and all the
replicas continue to be exactly same.
Here is an example:
Let's say /mnt has a mount that is shared.
mount --make-shared /mnt
Note: mount(8) command now supports the --make-shared flag,
so the sample 'smount' program is no longer needed and has been
removed.
# mount --bind /mnt /tmp
The above command replicates the mount at /mnt to the mountpoint /tmp
and the contents of both the mounts remain identical.
#ls /mnt
a b c
#ls /tmp
a b c
Now let's say we mount a device at /tmp/a
# mount /dev/sd0 /tmp/a
#ls /tmp/a
t1 t2 t3
#ls /mnt/a
t1 t2 t3
Note that the mount has propagated to the mount at /mnt as well.
And the same is true even when /dev/sd0 is mounted on /mnt/a. The
contents will be visible under /tmp/a too.
2b) A slave mount is like a shared mount except that mount and umount events
only propagate towards it.
All slave mounts have a master mount which is a shared.
Here is an example:
Let's say /mnt has a mount which is shared.
# mount --make-shared /mnt
Let's bind mount /mnt to /tmp
# mount --bind /mnt /tmp
the new mount at /tmp becomes a shared mount and it is a replica of
the mount at /mnt.
Now let's make the mount at /tmp; a slave of /mnt
# mount --make-slave /tmp
let's mount /dev/sd0 on /mnt/a
# mount /dev/sd0 /mnt/a
#ls /mnt/a
t1 t2 t3
#ls /tmp/a
t1 t2 t3
Note the mount event has propagated to the mount at /tmp
However let's see what happens if we mount something on the mount at /tmp
# mount /dev/sd1 /tmp/b
#ls /tmp/b
s1 s2 s3
#ls /mnt/b
Note how the mount event has not propagated to the mount at
/mnt
2c) A private mount does not forward or receive propagation.
This is the mount we are familiar with. Its the default type.
2d) A unbindable mount is a unbindable private mount
let's say we have a mount at /mnt and we make is unbindable
# mount --make-unbindable /mnt
Let's try to bind mount this mount somewhere else.
# mount --bind /mnt /tmp
mount: wrong fs type, bad option, bad superblock on /mnt,
or too many mounted file systems
Binding a unbindable mount is a invalid operation.
3) Setting mount states
The mount command (util-linux package) can be used to set mount
states:
mount --make-shared mountpoint
mount --make-slave mountpoint
mount --make-private mountpoint
mount --make-unbindable mountpoint
4) Use cases
------------
A) A process wants to clone its own namespace, but still wants to
access the CD that got mounted recently.
Solution:
The system administrator can make the mount at /cdrom shared
mount --bind /cdrom /cdrom
mount --make-shared /cdrom
Now any process that clones off a new namespace will have a
mount at /cdrom which is a replica of the same mount in the
parent namespace.
So when a CD is inserted and mounted at /cdrom that mount gets
propagated to the other mount at /cdrom in all the other clone
namespaces.
B) A process wants its mounts invisible to any other process, but
still be able to see the other system mounts.
Solution:
To begin with, the administrator can mark the entire mount tree
as shareable.
mount --make-rshared /
A new process can clone off a new namespace. And mark some part
of its namespace as slave
mount --make-rslave /myprivatetree
Hence forth any mounts within the /myprivatetree done by the
process will not show up in any other namespace. However mounts
done in the parent namespace under /myprivatetree still shows
up in the process's namespace.
Apart from the above semantics this feature provides the
building blocks to solve the following problems:
C) Per-user namespace
The above semantics allows a way to share mounts across
namespaces. But namespaces are associated with processes. If
namespaces are made first class objects with user API to
associate/disassociate a namespace with userid, then each user
could have his/her own namespace and tailor it to his/her
requirements. Offcourse its needs support from PAM.
D) Versioned files
If the entire mount tree is visible at multiple locations, then
a underlying versioning file system can return different
version of the file depending on the path used to access that
file.
An example is:
mount --make-shared /
mount --rbind / /view/v1
mount --rbind / /view/v2
mount --rbind / /view/v3
mount --rbind / /view/v4
and if /usr has a versioning filesystem mounted, then that
mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
/view/v4/usr too
A user can request v3 version of the file /usr/fs/namespace.c
by accessing /view/v3/usr/fs/namespace.c . The underlying
versioning filesystem can then decipher that v3 version of the
filesystem is being requested and return the corresponding
inode.
5) Detailed semantics:
-------------------
The section below explains the detailed semantics of
bind, rbind, move, mount, umount and clone-namespace operations.
Note: the word 'vfsmount' and the noun 'mount' have been used
to mean the same thing, throughout this document.
5a) Mount states
A given mount can be in one of the following states
1) shared
2) slave
3) shared and slave
4) private
5) unbindable
A 'propagation event' is defined as event generated on a vfsmount
that leads to mount or unmount actions in other vfsmounts.
A 'peer group' is defined as a group of vfsmounts that propagate
events to each other.
(1) Shared mounts
A 'shared mount' is defined as a vfsmount that belongs to a
'peer group'.
For example:
mount --make-shared /mnt
mount --bind /mnt /tmp
The mount at /mnt and that at /tmp are both shared and belong
to the same peer group. Anything mounted or unmounted under
/mnt or /tmp reflect in all the other mounts of its peer
group.
(2) Slave mounts
A 'slave mount' is defined as a vfsmount that receives
propagation events and does not forward propagation events.
A slave mount as the name implies has a master mount from which
mount/unmount events are received. Events do not propagate from
the slave mount to the master. Only a shared mount can be made
a slave by executing the following command
mount --make-slave mount
A shared mount that is made as a slave is no more shared unless
modified to become shared.
(3) Shared and Slave
A vfsmount can be both shared as well as slave. This state
indicates that the mount is a slave of some vfsmount, and
has its own peer group too. This vfsmount receives propagation
events from its master vfsmount, and also forwards propagation
events to its 'peer group' and to its slave vfsmounts.
Strictly speaking, the vfsmount is shared having its own
peer group, and this peer-group is a slave of some other
peer group.
Only a slave vfsmount can be made as 'shared and slave' by
either executing the following command
mount --make-shared mount
or by moving the slave vfsmount under a shared vfsmount.
(4) Private mount
A 'private mount' is defined as vfsmount that does not
receive or forward any propagation events.
(5) Unbindable mount
A 'unbindable mount' is defined as vfsmount that does not
receive or forward any propagation events and cannot
be bind mounted.
State diagram:
The state diagram below explains the state transition of a mount,
in response to various commands.
------------------------------------------------------------------------
| |make-shared | make-slave | make-private |make-unbindab|
--------------|------------|--------------|--------------|-------------|
|shared |shared |*slave/private| private | unbindable |
| | | | | |
|-------------|------------|--------------|--------------|-------------|
|slave |shared | **slave | private | unbindable |
| |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|shared |shared | slave | private | unbindable |
|and slave |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|private |shared | **private | private | unbindable |
|-------------|------------|--------------|--------------|-------------|
|unbindable |shared |**unbindable | private | unbindable |
------------------------------------------------------------------------
* if the shared mount is the only mount in its peer group, making it
slave, makes it private automatically. Note that there is no master to
which it can be slaved to.
** slaving a non-shared mount has no effect on the mount.
Apart from the commands listed below, the 'move' operation also changes
the state of a mount depending on type of the destination mount. Its
explained in section 5d.
5b) Bind semantics
Consider the following command
mount --bind A/a B/b
where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
is the destination mount and 'b' is the dentry in the destination mount.
The outcome depends on the type of mount of 'A' and 'B'. The table
below contains quick reference.
---------------------------------------------------------------------------
| BIND MOUNT OPERATION |
|**************************************************************************
|source(A)->| shared | private | slave | unbindable |
| dest(B) | | | | |
| | | | | | |
| v | | | | |
|**************************************************************************
| shared | shared | shared | shared & slave | invalid |
| | | | | |
|non-shared| shared | private | slave | invalid |
***************************************************************************
Details:
1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
propagates to. A new propagation tree containing 'C1',..,'Cn' is
created. This propagation tree is identical to the propagation tree of
'B'. And finally the peer-group of 'C' is merged with the peer group
of 'A'.
2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
propagates to. A new propagation tree is set containing all new mounts
'C', 'C1', .., 'Cn' with exactly the same configuration as the
propagation tree for 'B'.
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
'C3' ... are created and mounted at the dentry 'b' on all mounts where
'B' propagates to. A new propagation tree containing the new mounts
'C','C1',.. 'Cn' is created. This propagation tree is identical to the
propagation tree for 'B'. And finally the mount 'C' and its peer group
is made the slave of mount 'Z'. In other words, mount 'C' is in the
state 'slave and shared'.
4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
invalid operation.
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. A new mount 'C' which is clone of 'A', is created.
Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
peer-group of 'A'.
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
new mount 'C' which is a clone of 'A' is created. Its root dentry is
'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
mount/unmount on 'A' do not propagate anywhere else. Similarly
mount/unmount on 'C' do not propagate anywhere else.
8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
invalid operation. A unbindable mount cannot be bind mounted.
5c) Rbind semantics
rbind is same as bind. Bind replicates the specified mount. Rbind
replicates all the mounts in the tree belonging to the specified mount.
Rbind mount is bind mount applied to all the mounts in the tree.
If the source tree that is rbind has some unbindable mounts,
then the subtree under the unbindable mount is pruned in the new
location.
eg: let's say we have the following mount tree.
A
/ \
B C
/ \ / \
D E F G
Let's say all the mount except the mount C in the tree are
of a type other than unbindable.
If this tree is rbound to say Z
We will have the following tree at the new location.
Z
|
A'
/
B' Note how the tree under C is pruned
/ \ in the new location.
D' E'
5d) Move semantics
Consider the following command
mount --move A B/b
where 'A' is the source mount, 'B' is the destination mount and 'b' is
the dentry in the destination mount.
The outcome depends on the type of the mount of 'A' and 'B'. The table
below is a quick reference.
---------------------------------------------------------------------------
| MOVE MOUNT OPERATION |
|**************************************************************************
| source(A)->| shared | private | slave | unbindable |
| dest(B) | | | | |
| | | | | | |
| v | | | | |
|**************************************************************************
| shared | shared | shared |shared and slave| invalid |
| | | | | |
|non-shared| shared | private | slave | unbindable |
***************************************************************************
NOTE: moving a mount residing under a shared mount is invalid.
Details follow:
1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. A new propagation tree is created in the
exact same configuration as that of 'B'. This new propagation tree
contains all the new mounts 'A1', 'A2'... 'An'. And this new
propagation tree is appended to the already existing propagation tree
of 'A'.
2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. The mount 'A' becomes a shared mount and a
propagation tree is created which is identical to that of
'B'. This new propagation tree contains all the new mounts 'A1',
'A2'... 'An'.
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
receive propagation from mount 'B'. A new propagation tree is created
in the exact same configuration as that of 'B'. This new propagation
tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
propagation tree is appended to the already existing propagation tree of
'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
becomes 'shared'.
4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
is invalid. Because mounting anything on the shared mount 'B' can
create new mounts that get mounted on the mounts that receive
propagation from 'B'. And since the mount 'A' is unbindable, cloning
it to mount at other mountpoints is not possible.
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
shared mount.
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
continues to be a slave mount of mount 'Z'.
8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
unbindable mount.
5e) Mount semantics
Consider the following command
mount device B/b
'B' is the destination mount and 'b' is the dentry in the destination
mount.
The above operation is the same as bind operation with the exception
that the source mount is always a private mount.
5f) Unmount semantics
Consider the following command
umount A
where 'A' is a mount mounted on mount 'B' at dentry 'b'.
If mount 'B' is shared, then all most-recently-mounted mounts at dentry
'b' on mounts that receive propagation from mount 'B' and does not have
sub-mounts within them are unmounted.
Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
each other.
let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
'B1', 'B2' and 'B3' respectively.
let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
mount 'B1', 'B2' and 'B3' respectively.
if 'C1' is unmounted, all the mounts that are most-recently-mounted on
'B1' and on the mounts that 'B1' propagates-to are unmounted.
'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
So all 'C1', 'C2' and 'C3' should be unmounted.
If any of 'C2' or 'C3' has some child mounts, then that mount is not
unmounted, but all other mounts are unmounted. However if 'C1' is told
to be unmounted and 'C1' has some sub-mounts, the umount operation is
failed entirely.
5g) Clone Namespace
A cloned namespace contains all the mounts as that of the parent
namespace.
Let's say 'A' and 'B' are the corresponding mounts in the parent and the
child namespace.
If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
each other.
If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
'Z'.
If 'A' is a private mount, then 'B' is a private mount too.
If 'A' is unbindable mount, then 'B' is a unbindable mount too.
6) Quiz
A. What is the result of the following command sequence?
mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt /tmp
mount --move /tmp /mnt/1
what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
Should they all be identical? or should /mnt and /mnt/1 be
identical only?
B. What is the result of the following command sequence?
mount --make-rshared /
mkdir -p /v/1
mount --rbind / /v/1
what should be the content of /v/1/v/1 be?
C. What is the result of the following command sequence?
mount --bind /mnt /mnt
mount --make-shared /mnt
mkdir -p /mnt/1/2/3 /mnt/1/test
mount --bind /mnt/1 /tmp
mount --make-slave /mnt
mount --make-shared /mnt
mount --bind /mnt/1/2 /tmp1
mount --make-slave /mnt
At this point we have the first mount at /tmp and
its root dentry is 1. Let's call this mount 'A'
And then we have a second mount at /tmp1 with root
dentry 2. Let's call this mount 'B'
Next we have a third mount at /mnt with root dentry
mnt. Let's call this mount 'C'
'B' is the slave of 'A' and 'C' is a slave of 'B'
A -> B -> C
at this point if we execute the following command
mount --bind /bin /tmp/test
The mount is attempted on 'A'
will the mount propagate to 'B' and 'C' ?
what would be the contents of
/mnt/1/test be?
7) FAQ
Q1. Why is bind mount needed? How is it different from symbolic links?
symbolic links can get stale if the destination mount gets
unmounted or moved. Bind mounts continue to exist even if the
other mount is unmounted or moved.
Q2. Why can't the shared subtree be implemented using exportfs?
exportfs is a heavyweight way of accomplishing part of what
shared subtree can do. I cannot imagine a way to implement the
semantics of slave mount using exportfs?
Q3 Why is unbindable mount needed?
Let's say we want to replicate the mount tree at multiple
locations within the same subtree.
if one rbind mounts a tree within the same subtree 'n' times
the number of mounts created is an exponential function of 'n'.
Having unbindable mount can help prune the unneeded bind
mounts. Here is a example.
step 1:
let's say the root tree has just two directories with
one vfsmount.
root
/ \
tmp usr
And we want to replicate the tree at multiple
mountpoints under /root/tmp
step2:
mount --make-shared /root
mkdir -p /tmp/m1
mount --rbind /root /tmp/m1
the new tree now looks like this:
root
/ \
tmp usr
/
m1
/ \
tmp usr
/
m1
it has two vfsmounts
step3:
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
the new tree now looks like this:
root
/ \
tmp usr
/ \
m1 m2
/ \ / \
tmp usr tmp usr
/ \ /
m1 m2 m1
/ \ / \
tmp usr tmp usr
/ / \
m1 m1 m2
/ \
tmp usr
/ \
m1 m2
it has 6 vfsmounts
step 4:
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
I won't draw the tree..but it has 24 vfsmounts
at step i the number of vfsmounts is V[i] = i*V[i-1].
This is an exponential function. And this tree has way more
mounts than what we really needed in the first place.
One could use a series of umount at each step to prune
out the unneeded mounts. But there is a better solution.
Unclonable mounts come in handy here.
step 1:
let's say the root tree has just two directories with
one vfsmount.
root
/ \
tmp usr
How do we set up the same tree at multiple locations under
/root/tmp
step2:
mount --bind /root/tmp /root/tmp
mount --make-rshared /root
mount --make-unbindable /root/tmp
mkdir -p /tmp/m1
mount --rbind /root /tmp/m1
the new tree now looks like this:
root
/ \
tmp usr
/
m1
/ \
tmp usr
step3:
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
the new tree now looks like this:
root
/ \
tmp usr
/ \
m1 m2
/ \ / \
tmp usr tmp usr
step4:
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
the new tree now looks like this:
root
/ \
tmp usr
/ \ \
m1 m2 m3
/ \ / \ / \
tmp usr tmp usr tmp usr
8) Implementation
8A) Datastructure
4 new fields are introduced to struct vfsmount
->mnt_share
->mnt_slave_list
->mnt_slave
->mnt_master
->mnt_share links together all the mount to/from which this vfsmount
send/receives propagation events.
->mnt_slave_list links all the mounts to which this vfsmount propagates
to.
->mnt_slave links together all the slaves that its master vfsmount
propagates to.
->mnt_master points to the master vfsmount from which this vfsmount
receives propagation.
->mnt_flags takes two more flags to indicate the propagation status of
the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
replicated.
All the shared vfsmounts in a peer group form a cyclic list through
->mnt_share.
All vfsmounts with the same ->mnt_master form on a cyclic list anchored
in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
->mnt_master can point to arbitrary (and possibly different) members
of master peer group. To find all immediate slaves of a peer group
you need to go through _all_ ->mnt_slave_list of its members.
Conceptually it's just a single set - distribution among the
individual lists does not affect propagation or the way propagation
tree is modified by operations.
All vfsmounts in a peer group have the same ->mnt_master. If it is
non-NULL, they form a contiguous (ordered) segment of slave list.
A example propagation tree looks as shown in the figure below.
[ NOTE: Though it looks like a forest, if we consider all the shared
mounts as a conceptual entity called 'pnode', it becomes a tree]
A <--> B <--> C <---> D
/|\ /| |\
/ F G J K H I
/
E<-->K
/|\
M L N
In the above figure A,B,C and D all are shared and propagate to each
other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
'E' is also shared with 'K' and they propagate to each other. And
'K' has 3 slaves 'M', 'L' and 'N'
A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
E's ->mnt_share links with ->mnt_share of K
'E', 'K', 'F', 'G' have their ->mnt_master point to struct
vfsmount of 'A'
'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
J and K's ->mnt_master points to struct vfsmount of C
and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
NOTE: The propagation tree is orthogonal to the mount tree.
8B Locking:
->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
by namespace_sem (exclusive for modifications, shared for reading).
Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
There are two exceptions: do_add_mount() and clone_mnt().
The former modifies a vfsmount that has not been visible in any shared
data structures yet.
The latter holds namespace_sem and the only references to vfsmount
are in lists that can't be traversed without namespace_sem.
8C Algorithm:
The crux of the implementation resides in rbind/move operation.
The overall algorithm breaks the operation into 3 phases: (look at
attach_recursive_mnt() and propagate_mnt())
1. prepare phase.
2. commit phases.
3. abort phases.
Prepare phase:
for each mount in the source tree:
a) Create the necessary number of mount trees to
be attached to each of the mounts that receive
propagation from the destination mount.
b) Do not attach any of the trees to its destination.
However note down its ->mnt_parent and ->mnt_mountpoint
c) Link all the new mounts to form a propagation tree that
is identical to the propagation tree of the destination
mount.
If this phase is successful, there should be 'n' new
propagation trees; where 'n' is the number of mounts in the
source tree. Go to the commit phase
Also there should be 'm' new mount trees, where 'm' is
the number of mounts to which the destination mount
propagates to.
if any memory allocations fail, go to the abort phase.
Commit phase
attach each of the mount trees to their corresponding
destination mounts.
Abort phase
delete all the newly created trees.
NOTE: all the propagation related functionality resides in the file
pnode.c
------------------------------------------------------------------------
version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
version 0.2 (Incorporated comments from Al Viro)

View file

@ -0,0 +1,521 @@
SPUFS(2) Linux Programmer's Manual SPUFS(2)
NAME
spufs - the SPU file system
DESCRIPTION
The SPU file system is used on PowerPC machines that implement the Cell
Broadband Engine Architecture in order to access Synergistic Processor
Units (SPUs).
The file system provides a name space similar to posix shared memory or
message queues. Users that have write permissions on the file system
can use spu_create(2) to establish SPU contexts in the spufs root.
Every SPU context is represented by a directory containing a predefined
set of files. These files can be used for manipulating the state of the
logical SPU. Users can change permissions on those files, but not actu-
ally add or remove files.
MOUNT OPTIONS
uid=<uid>
set the user owning the mount point, the default is 0 (root).
gid=<gid>
set the group owning the mount point, the default is 0 (root).
FILES
The files in spufs mostly follow the standard behavior for regular sys-
tem calls like read(2) or write(2), but often support only a subset of
the operations supported on regular file systems. This list details the
supported operations and the deviations from the behaviour in the
respective man pages.
All files that support the read(2) operation also support readv(2) and
all files that support the write(2) operation also support writev(2).
All files support the access(2) and stat(2) family of operations, but
only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
contain reliable information.
All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
tions, but will not be able to grant permissions that contradict the
possible operations, e.g. read access on the wbox file.
The current set of files is:
/mem
the contents of the local storage memory of the SPU. This can be
accessed like a regular shared memory file and contains both code and
data in the address space of the SPU. The possible operations on an
open mem file are:
read(2), pread(2), write(2), pwrite(2), lseek(2)
These operate as documented, with the exception that seek(2),
write(2) and pwrite(2) are not supported beyond the end of the
file. The file size is the size of the local storage of the SPU,
which normally is 256 kilobytes.
mmap(2)
Mapping mem into the process address space gives access to the
SPU local storage within the process address space. Only
MAP_SHARED mappings are allowed.
/mbox
The first SPU to CPU communication mailbox. This file is read-only and
can be read in units of 32 bits. The file can only be used in non-
blocking mode and it even poll() will not block on it. The possible
operations on an open mbox file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. If there is no data available in the mail
box, the return value is set to -1 and errno becomes EAGAIN.
When data has been read successfully, four bytes are placed in
the data buffer and the value four is returned.
/ibox
The second SPU to CPU communication mailbox. This file is similar to
the first mailbox file, but can be read in blocking I/O mode, and the
poll family of system calls can be used to wait for it. The possible
operations on an open ibox file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. If there is no data available in the mail
box and the file descriptor has been opened with O_NONBLOCK, the
return value is set to -1 and errno becomes EAGAIN.
If there is no data available in the mail box and the file
descriptor has been opened without O_NONBLOCK, the call will
block until the SPU writes to its interrupt mailbox channel.
When data has been read successfully, four bytes are placed in
the data buffer and the value four is returned.
poll(2)
Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
data is available for reading.
/wbox
The CPU to SPU communation mailbox. It is write-only and can be written
in units of 32 bits. If the mailbox is full, write() will block and
poll can be used to wait for it becoming empty again. The possible
operations on an open wbox file are: write(2) If a count smaller than
four is requested, write returns -1 and sets errno to EINVAL. If there
is no space available in the mail box and the file descriptor has been
opened with O_NONBLOCK, the return value is set to -1 and errno becomes
EAGAIN.
If there is no space available in the mail box and the file descriptor
has been opened without O_NONBLOCK, the call will block until the SPU
reads from its PPE mailbox channel. When data has been read success-
fully, four bytes are placed in the data buffer and the value four is
returned.
poll(2)
Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
space is available for writing.
/mbox_stat
/ibox_stat
/wbox_stat
Read-only files that contain the length of the current queue, i.e. how
many words can be read from mbox or ibox or how many words can be
written to wbox without blocking. The files can be read only in 4-byte
units and return a big-endian binary integer number. The possible
operations on an open *box_stat file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the number of elements that can be
read from (for mbox_stat and ibox_stat) or written to (for
wbox_stat) the respective mail box without blocking or resulting
in EAGAIN.
/npc
/decr
/decr_status
/spu_tag_mask
/event_mask
/srr0
Internal registers of the SPU. The representation is an ASCII string
with the numeric value of the next instruction to be executed. These
can be used in read/write mode for debugging, but normal operation of
programs should not rely on them because access to any of them except
npc requires an SPU context save and is therefore very inefficient.
The contents of these files are:
npc Next Program Counter
decr SPU Decrementer
decr_status Decrementer Status
spu_tag_mask MFC tag mask for SPU DMA
event_mask Event mask for SPU interrupts
srr0 Interrupt Return address register
The possible operations on an open npc, decr, decr_status,
spu_tag_mask, event_mask or srr0 file are:
read(2)
When the count supplied to the read call is shorter than the
required length for the pointer value plus a newline character,
subsequent reads from the same file descriptor will result in
completing the string, regardless of changes to the register by
a running SPU task. When a complete string has been read, all
subsequent read operations will return zero bytes and a new file
descriptor needs to be opened to read the value again.
write(2)
A write operation on the file results in setting the register to
the value given in the string. The string is parsed from the
beginning to the first non-numeric character or the end of the
buffer. Subsequent writes to the same file descriptor overwrite
the previous setting.
/fpcr
This file gives access to the Floating Point Status and Control Regis-
ter as a four byte long file. The operations on the fpcr file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the current value of the fpcr regis-
ter.
write(2)
If a count smaller than four is requested, write returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is copied
from the data buffer, updating the value of the fpcr register.
/signal1
/signal2
The two signal notification channels of an SPU. These are read-write
files that operate on a 32 bit word. Writing to one of these files
triggers an interrupt on the SPU. The value written to the signal
files can be read from the SPU through a channel read or from host user
space through the file. After the value has been read by the SPU, it
is reset to zero. The possible operations on an open signal1 or sig-
nal2 file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the current value of the specified
signal notification register.
write(2)
If a count smaller than four is requested, write returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is copied
from the data buffer, updating the value of the specified signal
notification register. The signal notification register will
either be replaced with the input data or will be updated to the
bitwise OR or the old value and the input data, depending on the
contents of the signal1_type, or signal2_type respectively,
file.
/signal1_type
/signal2_type
These two files change the behavior of the signal1 and signal2 notifi-
cation files. The contain a numerical ASCII string which is read as
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
contents of the signal channel with the data that is written to it. in
mode 1 (logical OR), the hardware accumulates the bits that are subse-
quently written to it. The possible operations on an open signal1_type
or signal2_type file are:
read(2)
When the count supplied to the read call is shorter than the
required length for the digit plus a newline character, subse-
quent reads from the same file descriptor will result in com-
pleting the string. When a complete string has been read, all
subsequent read operations will return zero bytes and a new file
descriptor needs to be opened to read the value again.
write(2)
A write operation on the file results in setting the register to
the value given in the string. The string is parsed from the
beginning to the first non-numeric character or the end of the
buffer. Subsequent writes to the same file descriptor overwrite
the previous setting.
EXAMPLES
/etc/fstab entry
none /spu spufs gid=spu 0 0
AUTHORS
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPUFS(2)
------------------------------------------------------------------------------
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
NAME
spu_run - execute an spu context
SYNOPSIS
#include <sys/spu.h>
int spu_run(int fd, unsigned int *npc, unsigned int *event);
DESCRIPTION
The spu_run system call is used on PowerPC machines that implement the
Cell Broadband Engine Architecture in order to access Synergistic Pro-
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
ate(2) to address a specific SPU context. When the context gets sched-
uled to a physical SPU, it starts execution at the instruction pointer
passed in npc.
Execution of SPU code happens synchronously, meaning that spu_run does
not return while the SPU is still running. If there is a need to exe-
cute SPU code in parallel with other code on either the main CPU or
other SPUs, you need to create a new thread of execution first, e.g.
using the pthread_create(3) call.
When spu_run returns, the current value of the SPU instruction pointer
is written back to npc, so you can call spu_run again without updating
the pointers.
event can be a NULL pointer or point to an extended status code that
gets filled when spu_run returns. It can be one of the following con-
stants:
SPE_EVENT_DMA_ALIGNMENT
A DMA alignment error
SPE_EVENT_SPE_DATA_SEGMENT
A DMA segmentation error
SPE_EVENT_SPE_DATA_STORAGE
A DMA storage error
If NULL is passed as the event argument, these errors will result in a
signal delivered to the calling process.
RETURN VALUE
spu_run returns the value of the spu_status register or -1 to indicate
an error and set errno to one of the error codes listed below. The
spu_status register value contains a bit mask of status codes and
optionally a 14 bit code returned from the stop-and-signal instruction
on the SPU. The bit masks for the status codes are:
0x02 SPU was stopped by stop-and-signal.
0x04 SPU was stopped by halt.
0x08 SPU is waiting for a channel.
0x10 SPU is in single-step mode.
0x20 SPU has tried to execute an invalid instruction.
0x40 SPU has tried to access an invalid channel.
0x3fff0000
The bits masked with this value contain the code returned from
stop-and-signal.
There are always one or more of the lower eight bits set or an error
code is returned from spu_run.
ERRORS
EAGAIN or EWOULDBLOCK
fd is in non-blocking mode and spu_run would block.
EBADF fd is not a valid file descriptor.
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
pointer.
EINTR A signal occurred while spu_run was in progress. The npc value
has been updated to the new program counter value if necessary.
EINVAL fd is not a file descriptor returned from spu_create(2).
ENOMEM Insufficient memory was available to handle a page fault result-
ing from an MFC direct memory access.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
NOTES
spu_run is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_create(2), spufs(7)
Linux 2005-09-28 SPU_RUN(2)
------------------------------------------------------------------------------
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
NAME
spu_create - create a new spu context
SYNOPSIS
#include <sys/types.h>
#include <sys/spu.h>
int spu_create(const char *pathname, int flags, mode_t mode);
DESCRIPTION
The spu_create system call is used on PowerPC machines that implement
the Cell Broadband Engine Architecture in order to access Synergistic
Processor Units (SPUs). It creates a new logical context for an SPU in
pathname and returns a handle to associated with it. pathname must
point to a non-existing directory in the mount point of the SPU file
system (spufs). When spu_create is successful, a directory gets cre-
ated on pathname and it is populated with files.
The returned file handle can only be passed to spu_run(2) or closed,
other operations are not defined on it. When it is closed, all associ-
ated directory entries in spufs are removed. When the last file handle
pointing either inside of the context directory or to this file
descriptor is closed, the logical SPU context is destroyed.
The parameter flags can be zero or any bitwise or'd combination of the
following constants:
SPU_RAWIO
Allow mapping of some of the hardware registers of the SPU into
user space. This flag requires the CAP_SYS_RAWIO capability, see
capabilities(7).
The mode parameter specifies the permissions used for creating the new
directory in spufs. mode is modified with the user's umask(2) value
and then used for both the directory and the files contained in it. The
file permissions mask out some more bits of mode because they typically
support only read or write access. See stat(2) for a full list of the
possible mode values.
RETURN VALUE
spu_create returns a new file descriptor. It may return -1 to indicate
an error condition and set errno to one of the error codes listed
below.
ERRORS
EACCESS
The current user does not have write access on the spufs mount
point.
EEXIST An SPU context already exists at the given path name.
EFAULT pathname is not a valid string pointer in the current address
space.
EINVAL pathname is not a directory in the spufs mount point.
ELOOP Too many symlinks were found while resolving pathname.
EMFILE The process has reached its maximum open file limit.
ENAMETOOLONG
pathname was too long.
ENFILE The system has reached the global open file limit.
ENOENT Part of pathname could not be resolved.
ENOMEM The kernel could not allocate all resources required.
ENOSPC There are not enough SPU resources available to create a new
context or the user specific limit for the number of SPU con-
texts has been reached.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
ENOTDIR
A part of pathname is not a directory.
NOTES
spu_create is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
FILES
pathname must point to a location beneath the mount point of spufs. By
convention, it gets mounted in /spu.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPU_CREATE(2)

View file

@ -0,0 +1,259 @@
SQUASHFS 4.0 FILESYSTEM
=======================
Squashfs is a compressed read-only filesystem for Linux.
It uses zlib/lzo/xz compression to compress files, inodes and directories.
Inodes in the system are very small and all blocks are packed to minimise
data overhead. Block sizes greater than 4K are supported up to a maximum
of 1Mbytes (default block size 128K).
Squashfs is intended for general read-only filesystem use, for archival
use (i.e. in cases where a .tar.gz file may be used), and in constrained
block device/memory systems (e.g. embedded systems) where low overhead is
needed.
Mailing list: squashfs-devel@lists.sourceforge.net
Web site: www.squashfs.org
1. FILESYSTEM FEATURES
----------------------
Squashfs filesystem features versus Cramfs:
Squashfs Cramfs
Max filesystem size: 2^64 256 MiB
Max file size: ~ 2 TiB 16 MiB
Max files: unlimited unlimited
Max directories: unlimited unlimited
Max entries per directory: unlimited unlimited
Max block size: 1 MiB 4 KiB
Metadata compression: yes no
Directory indexes: yes no
Sparse file support: yes no
Tail-end packing (fragments): yes no
Exportable (NFS etc.): yes no
Hard link support: yes no
"." and ".." in readdir: yes no
Real inode numbers: yes no
32-bit uids/gids: yes no
File creation time: yes no
Xattr support: yes no
ACL support: no no
Squashfs compresses data, inodes and directories. In addition, inode and
directory data are highly compacted, and packed on byte boundaries. Each
compressed inode is on average 8 bytes in length (the exact length varies on
file type, i.e. regular file, directory, symbolic link, and block/char device
inodes have different sizes).
2. USING SQUASHFS
-----------------
As squashfs is a read-only filesystem, the mksquashfs program must be used to
create populated squashfs filesystems. This and other squashfs utilities
can be obtained from http://www.squashfs.org. Usage instructions can be
obtained from this site also.
The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
3. SQUASHFS FILESYSTEM DESIGN
-----------------------------
A squashfs filesystem consists of a maximum of nine parts, packed together on a
byte alignment:
---------------
| superblock |
|---------------|
| compression |
| options |
|---------------|
| datablocks |
| & fragments |
|---------------|
| inode table |
|---------------|
| directory |
| table |
|---------------|
| fragment |
| table |
|---------------|
| export |
| table |
|---------------|
| uid/gid |
| lookup table |
|---------------|
| xattr |
| table |
---------------
Compressed data blocks are written to the filesystem as files are read from
the source directory, and checked for duplicates. Once all file data has been
written the completed inode, directory, fragment, export, uid/gid lookup and
xattr tables are written.
3.1 Compression options
-----------------------
Compressors can optionally support compression specific options (e.g.
dictionary size). If non-default compression options have been used, then
these are stored here.
3.2 Inodes
----------
Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each
compressed block is prefixed by a two byte length, the top bit is set if the
block is uncompressed. A block will be uncompressed if the -noI option is set,
or if the compressed block was larger than the uncompressed block.
Inodes are packed into the metadata blocks, and are not aligned to block
boundaries, therefore inodes overlap compressed blocks. Inodes are identified
by a 48-bit number which encodes the location of the compressed metadata block
containing the inode, and the byte offset into that block where the inode is
placed (<block, offset>).
To maximise compression there are different inodes for each file type
(regular file, directory, device, etc.), the inode contents and length
varying with the type.
To further maximise compression, two types of regular file inode and
directory inode are defined: inodes optimised for frequently occurring
regular files and directories, and extended types where extra
information has to be stored.
3.3 Directories
---------------
Like inodes, directories are packed into compressed metadata blocks, stored
in a directory table. Directories are accessed using the start address of
the metablock containing the directory and the offset into the
decompressed block (<block, offset>).
Directories are organised in a slightly complex way, and are not simply
a list of file names. The organisation takes advantage of the
fact that (in most cases) the inodes of the files will be in the same
compressed metadata block, and therefore, can share the start block.
Directories are therefore organised in a two level list, a directory
header containing the shared start block value, and a sequence of directory
entries, each of which share the shared start block. A new directory header
is written once/if the inode start block changes. The directory
header/directory entry list is repeated as many times as necessary.
Directories are sorted, and can contain a directory index to speed up
file lookup. Directory indexes store one entry per metablock, each entry
storing the index/filename mapping to the first directory header
in each metadata block. Directories are sorted in alphabetical order,
and at lookup the index is scanned linearly looking for the first filename
alphabetically larger than the filename being looked up. At this point the
location of the metadata block the filename is in has been found.
The general idea of the index is to ensure only one metadata block needs to be
decompressed to do a lookup irrespective of the length of the directory.
This scheme has the advantage that it doesn't require extra memory overhead
and doesn't require much extra storage on disk.
3.4 File data
-------------
Regular files consist of a sequence of contiguous compressed blocks, and/or a
compressed fragment block (tail-end packed block). The compressed size
of each datablock is stored in a block list contained within the
file inode.
To speed up access to datablocks when reading 'large' files (256 Mbytes or
larger), the code implements an index cache that caches the mapping from
block index to datablock location on disk.
The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
retaining a simple and space-efficient block list on disk. The cache
is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
The index cache is designed to be memory efficient, and by default uses
16 KiB.
3.5 Fragment lookup table
-------------------------
Regular files can contain a fragment index which is mapped to a fragment
location on disk and compressed size using a fragment lookup table. This
fragment lookup table is itself stored compressed into metadata blocks.
A second index table is used to locate these. This second index table for
speed of access (and because it is small) is read at mount time and cached
in memory.
3.6 Uid/gid lookup table
------------------------
For space efficiency regular files store uid and gid indexes, which are
converted to 32-bit uids/gids using an id look up table. This table is
stored compressed into metadata blocks. A second index table is used to
locate these. This second index table for speed of access (and because it
is small) is read at mount time and cached in memory.
3.7 Export table
----------------
To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
can optionally (disabled with the -no-exports Mksquashfs option) contain
an inode number to inode disk location lookup table. This is required to
enable Squashfs to map inode numbers passed in filehandles to the inode
location on disk, which is necessary when the export code reinstantiates
expired/flushed inodes.
This table is stored compressed into metadata blocks. A second index table is
used to locate these. This second index table for speed of access (and because
it is small) is read at mount time and cached in memory.
3.8 Xattr table
---------------
The xattr table contains extended attributes for each inode. The xattrs
for each inode are stored in a list, each list entry containing a type,
name and value field. The type field encodes the xattr prefix
("user.", "trusted." etc) and it also encodes how the name/value fields
should be interpreted. Currently the type indicates whether the value
is stored inline (in which case the value field contains the xattr value),
or if it is stored out of line (in which case the value field stores a
reference to where the actual value is stored). This allows large values
to be stored out of line improving scanning and lookup performance and it
also allows values to be de-duplicated, the value being stored once, and
all other occurrences holding an out of line reference to that value.
The xattr lists are packed into compressed 8K metadata blocks.
To reduce overhead in inodes, rather than storing the on-disk
location of the xattr list inside each inode, a 32-bit xattr id
is stored. This xattr id is mapped into the location of the xattr
list using a second xattr id lookup table.
4. TODOS AND OUTSTANDING ISSUES
-------------------------------
4.1 Todo list
-------------
Implement ACL support.
4.2 Squashfs internal cache
---------------------------
Blocks in Squashfs are compressed. To avoid repeatedly decompressing
recently accessed data Squashfs uses two small metadata and fragment caches.
The cache is not used for file datablocks, these are decompressed and cached in
the page-cache in the normal way. The cache is used to temporarily cache
fragment and metadata blocks which have been read as a result of a metadata
(i.e. inode or directory) or fragment access. Because metadata and fragments
are packed together into blocks (to gain greater compression) the read of a
particular piece of metadata or fragment will retrieve other metadata/fragments
which have been packed with it, these because of locality-of-reference may be
read in the near future. Temporarily caching them ensures they are available
for near future access without requiring an additional read and decompress.
In the future this internal cache may be replaced with an implementation which
uses the kernel page cache. Because the page cache operates on page sized
units this may introduce additional complexity in terms of locking and
associated race conditions.

View file

@ -0,0 +1,120 @@
Accessing PCI device resources through sysfs
--------------------------------------------
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
that support it. For example, a given bus might look like this:
/sys/devices/pci0000:17
|-- 0000:17:00.0
| |-- class
| |-- config
| |-- device
| |-- enable
| |-- irq
| |-- local_cpus
| |-- remove
| |-- resource
| |-- resource0
| |-- resource1
| |-- resource2
| |-- rom
| |-- subsystem_device
| |-- subsystem_vendor
| `-- vendor
`-- ...
The topmost element describes the PCI domain and bus number. In this case,
the domain number is 0000 and the bus number is 17 (both values are in hex).
This bus contains a single function device in slot 0. The domain and bus
numbers are reproduced for convenience. Under the device directory are several
files, each with their own function.
file function
---- --------
class PCI class (ascii, ro)
config PCI config space (binary, rw)
device PCI device (ascii, ro)
enable Whether the device is enabled (ascii, rw)
irq IRQ number (ascii, ro)
local_cpus nearby CPU mask (cpumask, ro)
remove remove device from kernel's list (ascii, wo)
resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap, rw[1])
resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
rom PCI ROM resource, if present (binary, ro)
subsystem_device PCI subsystem device (ascii, ro)
subsystem_vendor PCI subsystem vendor (ascii, ro)
vendor PCI vendor (ascii, ro)
ro - read only file
rw - file is readable and writable
wo - write only file
mmap - file is mmapable
ascii - file contains ascii text
binary - file contains binary data
cpumask - file contains a cpumask type
[1] rw for RESOURCE_IO (I/O port) regions only
The read only files are informational, writes to them will be ignored, with
the exception of the 'rom' file. Writable files can be used to perform
actions on the device (e.g. changing config space, detaching a device).
mmapable files are available via an mmap of the file at offset 0 and can be
used to do actual device programming from userspace. Note that some platforms
don't support mmapping of certain resources, so be sure to check the return
value from any attempted mmap. The most notable of these are I/O port
resources, which also provide read/write access.
The 'enable' file provides a counter that indicates how many times the device
has been enabled. If the 'enable' file currently returns '4', and a '1' is
echoed into it, it will then return '5'. Echoing a '0' into it will decrease
the count. Even when it returns to 0, though, some of the initialisation
may not be reversed.
The 'rom' file is special in that it provides read-only access to the device's
ROM file, if available. It's disabled by default, however, so applications
should write the string "1" to the file to enable it before attempting a read
call, and disable it following the access by writing "0" to the file. Note
that the device must be enabled for a rom read to return data successfully.
In the event a driver is not bound to the device, it can be enabled using the
'enable' file, documented above.
The 'remove' file is used to remove the PCI device, by writing a non-zero
integer to the file. This does not involve any kind of hot-plug functionality,
e.g. powering off the device. The device is removed from the kernel's list of
PCI devices, the sysfs directory for it is removed, and the device will be
removed from any drivers attached to it. Removal of PCI root buses is
disallowed.
Accessing legacy resources through sysfs
----------------------------------------
Legacy I/O port and ISA memory resources are also provided in sysfs if the
underlying platform supports them. They're located in the PCI class hierarchy,
e.g.
/sys/class/pci_bus/0000:17/
|-- bridge -> ../../../devices/pci0000:17
|-- cpuaffinity
|-- legacy_io
`-- legacy_mem
The legacy_io file is a read/write file that can be used by applications to
do legacy port I/O. The application should open the file, seek to the desired
port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
file should be mmapped with an offset corresponding to the memory offset
desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
simply dereference the returned pointer (after checking for errors of course)
to access legacy memory space.
Supporting PCI access on new platforms
--------------------------------------
In order to support PCI resource mapping as described above, Linux platform
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
Platforms are free to only support subsets of the mmap functionality, but
useful return codes should be provided.
Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
wishing to support legacy functionality should define it and provide
pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.

View file

@ -0,0 +1,42 @@
Sysfs tagging
-------------
(Taken almost verbatim from Eric Biederman's netns tagging patch
commit msg)
The problem. Network devices show up in sysfs and with the network
namespace active multiple devices with the same name can show up in
the same directory, ouch!
To avoid that problem and allow existing applications in network
namespaces to see the same interface that is currently presented in
sysfs, sysfs now has tagging directory support.
By using the network namespace pointers as tags to separate out the
the sysfs directory entries we ensure that we don't have conflicts
in the directories and applications only see a limited set of
the network devices.
Each sysfs directory entry may be tagged with zero or one
namespaces. A sysfs_dirent is augmented with a void *s_ns. If a
directory entry is tagged, then sysfs_dirent->s_flags will have a
flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will
point to the namespace to which it belongs.
Each sysfs superblock's sysfs_super_info contains an array void
*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace
kobj_nstype first mounts sysfs, a new superblock is created. It
will be differentiated from other sysfs mounts by having its
s_fs_info->ns[kobj_nstype] set to the new namespace. Note that
through bind mounting and mounts propagation, a task can easily view
the contents of other namespaces' sysfs mounts. Therefore, when a
namespace exits, it will call kobj_ns_exit() to invalidate any
sysfs_dirent->s_ns pointers pointing to it.
Users of this interface:
- define a type in the kobj_ns_type enumeration.
- call kobj_ns_type_register() with its kobj_ns_type_operations which has
- current_ns() which returns current's namespace
- netlink_ns() which returns a socket's namespace
- initial_ns() which returns the initial namesapce
- call kobj_ns_exit() when an individual tag is no longer valid

View file

@ -0,0 +1,380 @@
sysfs - _The_ filesystem for exporting kernel objects.
Patrick Mochel <mochel@osdl.org>
Mike Murphy <mamurph@cs.clemson.edu>
Revised: 16 August 2011
Original: 10 January 2003
What it is:
~~~~~~~~~~~
sysfs is a ram-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
sysfs is tied inherently to the kobject infrastructure. Please read
Documentation/kobject.txt for more information concerning the kobject
interface.
Using sysfs
~~~~~~~~~~~
sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
it by doing:
mount -t sysfs sysfs /sys
Directory Creation
~~~~~~~~~~~~~~~~~~
For every kobject that is registered with the system, a directory is
created for it in sysfs. That directory is created as a subdirectory
of the kobject's parent, expressing internal object hierarchies to
userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
Sysfs internally stores a pointer to the kobject that implements a
directory in the sysfs_dirent object associated with the directory. In
the past this kobject pointer has been used by sysfs to do reference
counting directly on the kobject whenever the file is opened or closed.
With the current sysfs implementation the kobject reference count is
only modified directly by the function sysfs_schedule_callback().
Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
the filesystem. Sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publicly humiliated and your code rewritten without notice.
An attribute definition is simply:
struct attribute {
char * name;
struct module *owner;
umode_t mode;
};
int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
A bare attribute contains no means to read or write the value of the
attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for
a specific object type.
For example, the driver model defines struct device_attribute like:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
int device_create_file(struct device *, const struct device_attribute *);
void device_remove_file(struct device *, const struct device_attribute *);
It also defines this helper for defining device attributes:
#define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
For example, declaring
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
is equivalent to doing:
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = S_IWUSR | S_IRUGO,
},
.show = show_foo,
.store = store_foo,
};
Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners.
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
};
[ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is
stored. See the kobject documentation for more information. ]
When a file is read or written, sysfs calls the appropriate method
for the type. The method then translates the generic struct kobject
and struct attribute pointers to the appropriate pointer types, and
calls the associated methods.
To illustrate:
#define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct device_attribute *dev_attr = to_dev_attr(attr);
struct device *dev = to_dev(kobj);
ssize_t ret = -EIO;
if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
if (ret >= (ssize_t)PAGE_SIZE) {
print_symbol("dev_attr_show: %s returned bad count\n",
(unsigned long)dev_attr->show);
}
return ret;
}
Reading/Writing Attribute Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as
simple as those defined for device attributes:
ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
IOW, they should take only an object, an attribute, and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
method. Sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
- On read(2), the show() method should fill the entire buffer.
Recall that an attribute should only be exporting one value, or an
array of similar values, so this shouldn't be that expensive.
This allows userspace to do partial reads and forward seeks
arbitrarily over the entire file at will. If userspace seeks back to
zero or does a pread(2) with an offset of '0' the show() method will
be called again, rearmed, to fill the buffer.
- On write(2), sysfs expects the entire buffer to be passed during the
first write. Sysfs then passes the entire buffer to the store()
method.
When writing sysfs files, userspace processes should first read the
entire file, modify the values it wishes to change, then write the
entire buffer back.
Attribute method implementations should operate on an identical
buffer when reading and writing values.
Other notes:
- Writing causes the show() method to be rearmed regardless of current
file position.
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.
- show() methods should return the number of bytes printed into the
buffer. This is the return value of scnprintf().
- show() should always use scnprintf().
- store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
- show() or store() can always return errors. If a bad value comes
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
referencing counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
A very simple (and naive) implementation of a device attribute is:
static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf)
{
return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}
static ssize_t store_name(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
snprintf(dev->name, sizeof(dev->name), "%.*s",
(int)min(count, sizeof(dev->name) - 1), buf);
return count;
}
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
(Note that the real implementation doesn't allow userspace to set the
name for a device.)
Top Level Directory Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~
The sysfs directory arrangement exposes the relationship of kernel
data structures.
The top level sysfs directory looks like:
block/
bus/
class/
dev/
devices/
firmware/
net/
fs/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
struct device.
bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories:
devices/
drivers/
devices/ contains symlinks for each device discovered in the system
that point to the device's directory under root/.
drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
below fs/ (see ./fuse.txt for an example).
dev/ contains two directories char/ and block/. Inside these two
directories there are symlinks named <major>:<minor>. These symlinks
point to the sysfs directory for the given device. /sys/dev provides a
quick way to lookup the sysfs interface for a device from the result of
a stat(2) operation.
More information can driver-model specific features can be found in
Documentation/driver-model/.
TODO: Finish this section.
Current Interfaces
~~~~~~~~~~~~~~~~~~
The following interface layers currently exist in sysfs:
- devices (include/linux/device.h)
----------------------------------
Structure:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
Declaring:
DEVICE_ATTR(_name, _mode, _show, _store);
Creation/Removal:
int device_create_file(struct device *dev, const struct device_attribute * attr);
void device_remove_file(struct device *dev, const struct device_attribute * attr);
- bus drivers (include/linux/device.h)
--------------------------------------
Structure:
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
};
Declaring:
BUS_ATTR(_name, _mode, _show, _store)
Creation/Removal:
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);
- device drivers (include/linux/device.h)
-----------------------------------------
Structure:
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf,
size_t count);
};
Declaring:
DRIVER_ATTR(_name, _mode, _show, _store)
Creation/Removal:
int driver_create_file(struct device_driver *, const struct driver_attribute *);
void driver_remove_file(struct device_driver *, const struct driver_attribute *);
Documentation
~~~~~~~~~~~~~
The sysfs directory structure and the attributes in each directory define an
ABI between the kernel and user space. As for any ABI, it is important that
this ABI is stable and properly documented. All new sysfs attributes must be
documented in Documentation/ABI. See also Documentation/ABI/README for more
information.

View file

@ -0,0 +1,197 @@
It implements all of
- Xenix FS,
- SystemV/386 FS,
- Coherent FS.
To install:
* Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel.
* To mount a disk or a partition, use
mount [-r] -t sysv device mountpoint
The file system type names
-t sysv
-t xenix
-t coherent
may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation:
- Coherent FS:
- The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized.
(See Coherent "man mkfs" for a description of these features.)
- SystemV Release 2 FS:
The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet.
These filesystems are rather similar. Here is a comparison with Minix FS:
* Linux fdisk reports on partitions
- Minix FS 0x81 Linux/Minix
- Xenix FS ??
- SystemV FS ??
- Coherent FS 0x08 AIX bootable
* Size of a block or zone (data allocation unit on disk)
- Minix FS 1024
- Xenix FS 1024 (also 512 ??)
- SystemV FS 1024 (also 512 and 2048)
- Coherent FS 512
* General layout: all have one boot block, one super block and
separate areas for inodes and for directories/data.
On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
all the block numbers (including the super block) are offset by one track.
* Byte ordering of "short" (16 bit entities) on disk:
- Minix FS little endian 0 1
- Xenix FS little endian 0 1
- SystemV FS little endian 0 1
- Coherent FS little endian 0 1
Of course, this affects only the file system, not the data of files on it!
* Byte ordering of "long" (32 bit entities) on disk:
- Minix FS little endian 0 1 2 3
- Xenix FS little endian 0 1 2 3
- SystemV FS little endian 0 1 2 3
- Coherent FS PDP-11 2 3 0 1
Of course, this affects only the file system, not the data of files on it!
* Inode on disk: "short", 0 means non-existent, the root dir ino is:
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
* Maximum number of hard links to a file:
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
* Free inode management:
- Minix FS a bitmap
- Xenix FS, SystemV FS, Coherent FS
There is a cache of a certain number of free inodes in the super-block.
When it is exhausted, new free inodes are found using a linear search.
* Free block management:
- Minix FS a bitmap
- Xenix FS, SystemV FS, Coherent FS
Free blocks are organized in a "free list". Maybe a misleading term,
since it is not true that every free block contains a pointer to
the next free block. Rather, the free blocks are organized in chunks
of limited size, and every now and then a free block contains pointers
to the free blocks pertaining to the next chunk; the first of these
contains pointers and so on. The list terminates with a "block number"
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
* Super-block location:
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
* Super-block layout:
- Minix FS
unsigned short s_ninodes;
unsigned short s_nzones;
unsigned short s_imap_blocks;
unsigned short s_zmap_blocks;
unsigned short s_firstdatazone;
unsigned short s_log_zone_size;
unsigned long s_max_size;
unsigned short s_magic;
- Xenix FS, SystemV FS, Coherent FS
unsigned short s_firstdatazone;
unsigned long s_nzones;
unsigned short s_fzone_count;
unsigned long s_fzones[NICFREE];
unsigned short s_finode_count;
unsigned short s_finodes[NICINOD];
char s_flock;
char s_ilock;
char s_modified;
char s_rdonly;
unsigned long s_time;
short s_dinfo[4]; -- SystemV FS only
unsigned long s_free_zones;
unsigned short s_free_inodes;
short s_dinfo[4]; -- Xenix FS only
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
char s_fname[6];
char s_fpack[6];
then they differ considerably:
Xenix FS
char s_clean;
char s_fill[371];
long s_magic;
long s_type;
SystemV FS
long s_fill[12 or 14];
long s_state;
long s_magic;
long s_type;
Coherent FS
unsigned long s_unique;
Note that Coherent FS has no magic.
* Inode layout:
- Minix FS
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
unsigned long i_time;
unsigned char i_gid;
unsigned char i_nlinks;
unsigned short i_zone[7+1+1];
- Xenix FS, SystemV FS, Coherent FS
unsigned short i_mode;
unsigned short i_nlink;
unsigned short i_uid;
unsigned short i_gid;
unsigned long i_size;
unsigned char i_zone[3*(10+1+1+1)];
unsigned long i_atime;
unsigned long i_mtime;
unsigned long i_ctime;
* Regular file data blocks are organized as
- Minix FS
7 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS
10 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
1 triple-indirect block (pointer to pointers to pointers to blocks)
* Inode size, inodes per block
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
* Directory entry on disk
- Minix FS
unsigned short inode;
char name[14/30];
- Xenix FS, SystemV FS, Coherent FS
unsigned short inode;
char name[14];
* Dir entry size, dir entries per block
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
* How to implement symbolic links such that the host fsck doesn't scream:
- Minix FS normal
- Xenix FS kludge: as regular files with chmod 1000
- SystemV FS ??
- Coherent FS kludge: as regular files with chmod 1000
Notation: We often speak of a "block" but mean a zone (the allocation unit)
and not the disk driver's notion of "block".

View file

@ -0,0 +1,148 @@
Tmpfs is a file system which keeps all files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be
created on your hard drive. If you unmount a tmpfs instance,
everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space. It has maximum size limits which can
be adjusted on the fly via 'mount -o remount ...'
If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages currently in memory will show up as cached. It will not show up
as shared or something like that. Further on you can check the actual
RAM+swap use of a tmpfs instance with df(1) and du(1).
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
set, the user visible part of tmpfs is not build. But the internal
mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on
if necessary.
This mount is _not_ needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
shared memory)
3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
loop mounts of tmpfs files do work, so mkinitrd shipped by most
distributions should succeed with a tmpfs /tmp.
4) And probably a lot more I do not know about :-)
tmpfs has three mount options for sizing:
size: The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix %
to limit this tmpfs instance to that percentage of your physical RAM:
the default, when neither size nor nr_blocks is specified, is size=50%
If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
if nr_inodes=0, inodes will not be limited. It is generally unwise to
mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many cpus making intensive use of it.
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
mpol=default use the process allocation policy
(see set_mempolicy(2))
mpol=prefer:Node prefers to allocate memory from the given Node
mpol=bind:NodeList allocates memory only from nodes in NodeList
mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn
mpol=local prefers to allocate memory from the local node
NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
A memory policy with a valid NodeList will be saved, as specified, for
use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints
[See Documentation/cgroups/cpusets.txt] and any optional flags, listed
below. If the resulting NodeLists is the empty set, the effective memory
policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified
when tmpfs is mounted by appending them to the mode before the NodeList.
See Documentation/vm/numa_memory_policy.txt for a list of all available
memory allocation policy mode flags and their effect on memory policy.
=static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES
For example, mpol=bind=static:NodeList, is the equivalent of an
allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES.
Note that trying to mount a tmpfs with an mpol option will fail if the
running kernel does not support NUMA; and will fail if its nodelist
specifies a node which is not online. If your system relies on that
tmpfs being mounted, but from time to time runs a kernel built without
NUMA capability (perhaps a safe recovery kernel), or with fewer nodes
online, then it is advisable to omit the mpol option from automatic
mount options. It can be added later, when the tmpfs is already mounted
on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
To specify the initial root directory you can use the following mount
options:
mode: The permissions as an octal number
uid: The user id
gid: The group id
These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
Author:
Christoph Rohland <cr@sap.com>, 1.12.01
Updated:
Hugh Dickins, 4 June 2007
Updated:
KOSAKI Motohiro, 16 Mar 2010

View file

@ -0,0 +1,119 @@
Introduction
=============
UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
Block Images". UBIFS is a flash file system, which means it is designed
to work with flash devices. It is important to understand, that UBIFS
is completely different to any traditional file-system in Linux, like
Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems
which work with MTD devices, not block devices. The other Linux
file-system of this class is JFFS2.
To make it more clear, here is a small comparison of MTD devices and
block devices.
1 MTD devices represent flash devices and they consist of eraseblocks of
rather large size, typically about 128KiB. Block devices consist of
small blocks, typically 512 bytes.
2 MTD devices support 3 main operations - read from some offset within an
eraseblock, write to some offset within an eraseblock, and erase a whole
eraseblock. Block devices support 2 main operations - read a whole
block and write a whole block.
3 The whole eraseblock has to be erased before it becomes possible to
re-write its contents. Blocks may be just re-written.
4 Eraseblocks become worn out after some number of erase cycles -
typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC
NAND flashes. Blocks do not have the wear-out property.
5 Eraseblocks may become bad (only on NAND flashes) and software should
deal with this. Blocks on hard drives typically do not become bad,
because hardware has mechanisms to substitute bad blocks, at least in
modern LBA disks.
It should be quite obvious why UBIFS is very different to traditional
file-systems.
UBIFS works on top of UBI. UBI is a separate software layer which may be
found in drivers/mtd/ubi. UBI is basically a volume management and
wear-leveling layer. It provides so called UBI volumes which is a higher
level abstraction than a MTD device. The programming model of UBI devices
is very similar to MTD devices - they still consist of large eraseblocks,
they have read/write/erase operations, but UBI devices are devoid of
limitations like wear and bad blocks (items 4 and 5 in the above list).
In a sense, UBIFS is a next generation of JFFS2 file-system, but it is
very different and incompatible to JFFS2. The following are the main
differences.
* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on
top of UBI volumes.
* JFFS2 does not have on-media index and has to build it while mounting,
which requires full media scan. UBIFS maintains the FS indexing
information on the flash media and does not require full media scan,
so it mounts many times faster than JFFS2.
* JFFS2 is a write-through file-system, while UBIFS supports write-back,
which makes UBIFS much faster on writes.
Similarly to JFFS2, UBIFS supports on-the-flight compression which makes
it possible to fit quite a lot of data to the flash.
Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts.
It does not need stuff like fsck.ext2. UBIFS automatically replays its
journal and recovers from crashes, ensuring that the on-flash data
structures are consistent.
UBIFS scales logarithmically (most of the data structures it uses are
trees), so the mount time and memory consumption do not linearly depend
on the flash size, like in case of JFFS2. This is because UBIFS
maintains the FS index on the flash media. However, UBIFS depends on
UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly.
Nevertheless, UBI/UBIFS scales considerably better than JFFS2.
The authors of UBIFS believe, that it is possible to develop UBI2 which
would scale logarithmically as well. UBI2 would support the same API as UBI,
but it would be binary incompatible to UBI. So UBIFS would not need to be
changed to use UBI2
Mount options
=============
(*) == default.
bulk_read read more in one go to take advantage of flash
media that read faster sequentially
no_bulk_read (*) do not bulk-read
no_chk_data_crc (*) skip checking of CRCs on data nodes in order to
improve read performance. Use this option only
if the flash media is highly reliable. The effect
of this option is that corruption of the contents
of a file can go unnoticed.
chk_data_crc do not skip checking CRCs on data nodes
compr=none override default compressor and set it to "none"
compr=lzo override default compressor and set it to "lzo"
compr=zlib override default compressor and set it to "zlib"
Quick usage instructions
========================
The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax,
where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
UBI volume name.
Mount volume 0 on UBI device 0 to /mnt/ubifs:
$ mount -t ubifs ubi0_0 /mnt/ubifs
Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
name):
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
The following is an example of the kernel boot arguments to attach mtd0
to UBI and mount volume "rootfs":
ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs
References
==========
UBIFS documentation and FAQ/HOWTO at the MTD web site:
http://www.linux-mtd.infradead.org/doc/ubifs.html
http://www.linux-mtd.infradead.org/faq/ubifs.html

Some files were not shown because too many files have changed in this diff Show more