mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-09-05 07:57:45 -04:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
32
Documentation/powerpc/00-INDEX
Normal file
32
Documentation/powerpc/00-INDEX
Normal file
|
@ -0,0 +1,32 @@
|
|||
Index of files in Documentation/powerpc. If you think something about
|
||||
Linux/PPC needs an entry here, needs correction or you've written one
|
||||
please mail me.
|
||||
Cort Dougan (cort@fsmlabs.com)
|
||||
|
||||
00-INDEX
|
||||
- this file
|
||||
bootwrapper.txt
|
||||
- Information on how the powerpc kernel is wrapped for boot on various
|
||||
different platforms.
|
||||
cpu_features.txt
|
||||
- info on how we support a variety of CPUs with minimal compile-time
|
||||
options.
|
||||
cxl.txt
|
||||
- Overview of the CXL driver.
|
||||
eeh-pci-error-recovery.txt
|
||||
- info on PCI Bus EEH Error Recovery
|
||||
firmware-assisted-dump.txt
|
||||
- Documentation on the firmware assisted dump mechanism "fadump".
|
||||
hvcs.txt
|
||||
- IBM "Hypervisor Virtual Console Server" Installation Guide
|
||||
mpc52xx.txt
|
||||
- Linux 2.6.x on MPC52xx family
|
||||
pmu-ebb.txt
|
||||
- Description of the API for using the PMU with Event Based Branches.
|
||||
qe_firmware.txt
|
||||
- describes the layout of firmware binaries for the Freescale QUICC
|
||||
Engine and the code that parses and uploads the microcode therein.
|
||||
ptrace.txt
|
||||
- Information on the ptrace interfaces for hardware debug registers.
|
||||
transactional_memory.txt
|
||||
- Overview of the Power8 transactional memory support.
|
141
Documentation/powerpc/bootwrapper.txt
Normal file
141
Documentation/powerpc/bootwrapper.txt
Normal file
|
@ -0,0 +1,141 @@
|
|||
The PowerPC boot wrapper
|
||||
------------------------
|
||||
Copyright (C) Secret Lab Technologies Ltd.
|
||||
|
||||
PowerPC image targets compresses and wraps the kernel image (vmlinux) with
|
||||
a boot wrapper to make it usable by the system firmware. There is no
|
||||
standard PowerPC firmware interface, so the boot wrapper is designed to
|
||||
be adaptable for each kind of image that needs to be built.
|
||||
|
||||
The boot wrapper can be found in the arch/powerpc/boot/ directory. The
|
||||
Makefile in that directory has targets for all the available image types.
|
||||
The different image types are used to support all of the various firmware
|
||||
interfaces found on PowerPC platforms. OpenFirmware is the most commonly
|
||||
used firmware type on general purpose PowerPC systems from Apple, IBM and
|
||||
others. U-Boot is typically found on embedded PowerPC hardware, but there
|
||||
are a handful of other firmware implementations which are also popular. Each
|
||||
firmware interface requires a different image format.
|
||||
|
||||
The boot wrapper is built from the makefile in arch/powerpc/boot/Makefile and
|
||||
it uses the wrapper script (arch/powerpc/boot/wrapper) to generate target
|
||||
image. The details of the build system is discussed in the next section.
|
||||
Currently, the following image format targets exist:
|
||||
|
||||
cuImage.%: Backwards compatible uImage for older version of
|
||||
U-Boot (for versions that don't understand the device
|
||||
tree). This image embeds a device tree blob inside
|
||||
the image. The boot wrapper, kernel and device tree
|
||||
are all embedded inside the U-Boot uImage file format
|
||||
with boot wrapper code that extracts data from the old
|
||||
bd_info structure and loads the data into the device
|
||||
tree before jumping into the kernel.
|
||||
Because of the series of #ifdefs found in the
|
||||
bd_info structure used in the old U-Boot interfaces,
|
||||
cuImages are platform specific. Each specific
|
||||
U-Boot platform has a different platform init file
|
||||
which populates the embedded device tree with data
|
||||
from the platform specific bd_info file. The platform
|
||||
specific cuImage platform init code can be found in
|
||||
arch/powerpc/boot/cuboot.*.c. Selection of the correct
|
||||
cuImage init code for a specific board can be found in
|
||||
the wrapper structure.
|
||||
dtbImage.%: Similar to zImage, except device tree blob is embedded
|
||||
inside the image instead of provided by firmware. The
|
||||
output image file can be either an elf file or a flat
|
||||
binary depending on the platform.
|
||||
dtbImages are used on systems which do not have an
|
||||
interface for passing a device tree directly.
|
||||
dtbImages are similar to simpleImages except that
|
||||
dtbImages have platform specific code for extracting
|
||||
data from the board firmware, but simpleImages do not
|
||||
talk to the firmware at all.
|
||||
PlayStation 3 support uses dtbImage. So do Embedded
|
||||
Planet boards using the PlanetCore firmware. Board
|
||||
specific initialization code is typically found in a
|
||||
file named arch/powerpc/boot/<platform>.c; but this
|
||||
can be overridden by the wrapper script.
|
||||
simpleImage.%: Firmware independent compressed image that does not
|
||||
depend on any particular firmware interface and embeds
|
||||
a device tree blob. This image is a flat binary that
|
||||
can be loaded to any location in RAM and jumped to.
|
||||
Firmware cannot pass any configuration data to the
|
||||
kernel with this image type and it depends entirely on
|
||||
the embedded device tree for all information.
|
||||
The simpleImage is useful for booting systems with
|
||||
an unknown firmware interface or for booting from
|
||||
a debugger when no firmware is present (such as on
|
||||
the Xilinx Virtex platform). The only assumption that
|
||||
simpleImage makes is that RAM is correctly initialized
|
||||
and that the MMU is either off or has RAM mapped to
|
||||
base address 0.
|
||||
simpleImage also supports inserting special platform
|
||||
specific initialization code to the start of the bootup
|
||||
sequence. The virtex405 platform uses this feature to
|
||||
ensure that the cache is invalidated before caching
|
||||
is enabled. Platform specific initialization code is
|
||||
added as part of the wrapper script and is keyed on
|
||||
the image target name. For example, all
|
||||
simpleImage.virtex405-* targets will add the
|
||||
virtex405-head.S initialization code (This also means
|
||||
that the dts file for virtex405 targets should be
|
||||
named (virtex405-<board>.dts). Search the wrapper
|
||||
script for 'virtex405' and see the file
|
||||
arch/powerpc/boot/virtex405-head.S for details.
|
||||
treeImage.%; Image format for used with OpenBIOS firmware found
|
||||
on some ppc4xx hardware. This image embeds a device
|
||||
tree blob inside the image.
|
||||
uImage: Native image format used by U-Boot. The uImage target
|
||||
does not add any boot code. It just wraps a compressed
|
||||
vmlinux in the uImage data structure. This image
|
||||
requires a version of U-Boot that is able to pass
|
||||
a device tree to the kernel at boot. If using an older
|
||||
version of U-Boot, then you need to use a cuImage
|
||||
instead.
|
||||
zImage.%: Image format which does not embed a device tree.
|
||||
Used by OpenFirmware and other firmware interfaces
|
||||
which are able to supply a device tree. This image
|
||||
expects firmware to provide the device tree at boot.
|
||||
Typically, if you have general purpose PowerPC
|
||||
hardware then you want this image format.
|
||||
|
||||
Image types which embed a device tree blob (simpleImage, dtbImage, treeImage,
|
||||
and cuImage) all generate the device tree blob from a file in the
|
||||
arch/powerpc/boot/dts/ directory. The Makefile selects the correct device
|
||||
tree source based on the name of the target. Therefore, if the kernel is
|
||||
built with 'make treeImage.walnut simpleImage.virtex405-ml403', then the
|
||||
build system will use arch/powerpc/boot/dts/walnut.dts to build
|
||||
treeImage.walnut and arch/powerpc/boot/dts/virtex405-ml403.dts to build
|
||||
the simpleImage.virtex405-ml403.
|
||||
|
||||
Two special targets called 'zImage' and 'zImage.initrd' also exist. These
|
||||
targets build all the default images as selected by the kernel configuration.
|
||||
Default images are selected by the boot wrapper Makefile
|
||||
(arch/powerpc/boot/Makefile) by adding targets to the $image-y variable. Look
|
||||
at the Makefile to see which default image targets are available.
|
||||
|
||||
How it is built
|
||||
---------------
|
||||
arch/powerpc is designed to support multiplatform kernels, which means
|
||||
that a single vmlinux image can be booted on many different target boards.
|
||||
It also means that the boot wrapper must be able to wrap for many kinds of
|
||||
images on a single build. The design decision was made to not use any
|
||||
conditional compilation code (#ifdef, etc) in the boot wrapper source code.
|
||||
All of the boot wrapper pieces are buildable at any time regardless of the
|
||||
kernel configuration. Building all the wrapper bits on every kernel build
|
||||
also ensures that obscure parts of the wrapper are at the very least compile
|
||||
tested in a large variety of environments.
|
||||
|
||||
The wrapper is adapted for different image types at link time by linking in
|
||||
just the wrapper bits that are appropriate for the image type. The 'wrapper
|
||||
script' (found in arch/powerpc/boot/wrapper) is called by the Makefile and
|
||||
is responsible for selecting the correct wrapper bits for the image type.
|
||||
The arguments are well documented in the script's comment block, so they
|
||||
are not repeated here. However, it is worth mentioning that the script
|
||||
uses the -p (platform) argument as the main method of deciding which wrapper
|
||||
bits to compile in. Look for the large 'case "$platform" in' block in the
|
||||
middle of the script. This is also the place where platform specific fixups
|
||||
can be selected by changing the link order.
|
||||
|
||||
In particular, care should be taken when working with cuImages. cuImage
|
||||
wrapper bits are very board specific and care should be taken to make sure
|
||||
the target you are trying to build is supported by the wrapper bits.
|
221
Documentation/powerpc/cpu_families.txt
Normal file
221
Documentation/powerpc/cpu_families.txt
Normal file
|
@ -0,0 +1,221 @@
|
|||
CPU Families
|
||||
============
|
||||
|
||||
This document tries to summarise some of the different cpu families that exist
|
||||
and are supported by arch/powerpc.
|
||||
|
||||
|
||||
Book3S (aka sPAPR)
|
||||
------------------
|
||||
|
||||
- Hash MMU
|
||||
- Mix of 32 & 64 bit
|
||||
|
||||
+--------------+ +----------------+
|
||||
| Old POWER | --------------> | RS64 (threads) |
|
||||
+--------------+ +----------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+ +----------------+ +------+
|
||||
| 601 | --------------> | 603 | ---> | e300 |
|
||||
+--------------+ +----------------+ +------+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+--------------+ +----------------+ +-------+
|
||||
| 604 | | 750 (G3) | ---> | 750CX |
|
||||
+--------------+ +----------------+ +-------+
|
||||
| | |
|
||||
| | |
|
||||
v v v
|
||||
+--------------+ +----------------+ +-------+
|
||||
| 620 (64 bit) | | 7400 | | 750CL |
|
||||
+--------------+ +----------------+ +-------+
|
||||
| | |
|
||||
| | |
|
||||
v v v
|
||||
+--------------+ +----------------+ +-------+
|
||||
| POWER3/630 | | 7410 | | 750FX |
|
||||
+--------------+ +----------------+ +-------+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+--------------+ +----------------+
|
||||
| POWER3+ | | 7450 |
|
||||
+--------------+ +----------------+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+--------------+ +----------------+
|
||||
| POWER4 | | 7455 |
|
||||
+--------------+ +----------------+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+--------------+ +-------+ +----------------+
|
||||
| POWER4+ | --> | 970 | | 7447 |
|
||||
+--------------+ +-------+ +----------------+
|
||||
| | |
|
||||
| | |
|
||||
v v v
|
||||
+--------------+ +-------+ +----------------+
|
||||
| POWER5 | | 970FX | | 7448 |
|
||||
+--------------+ +-------+ +----------------+
|
||||
| | |
|
||||
| | |
|
||||
v v v
|
||||
+--------------+ +-------+ +----------------+
|
||||
| POWER5+ | | 970MP | | e600 |
|
||||
+--------------+ +-------+ +----------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| POWER5++ |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+ +-------+
|
||||
| POWER6 | <-?-> | Cell |
|
||||
+--------------+ +-------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| POWER7 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| POWER7+ |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| POWER8 |
|
||||
+--------------+
|
||||
|
||||
|
||||
+---------------+
|
||||
| PA6T (64 bit) |
|
||||
+---------------+
|
||||
|
||||
|
||||
IBM BookE
|
||||
---------
|
||||
|
||||
- Software loaded TLB.
|
||||
- All 32 bit
|
||||
|
||||
+--------------+
|
||||
| 401 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| 403 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| 405 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| 440 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+ +----------------+
|
||||
| 450 | --> | BG/P |
|
||||
+--------------+ +----------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| 460 |
|
||||
+--------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| 476 |
|
||||
+--------------+
|
||||
|
||||
|
||||
Motorola/Freescale 8xx
|
||||
----------------------
|
||||
|
||||
- Software loaded with hardware assist.
|
||||
- All 32 bit
|
||||
|
||||
+-------------+
|
||||
| MPC8xx Core |
|
||||
+-------------+
|
||||
|
||||
|
||||
Freescale BookE
|
||||
---------------
|
||||
|
||||
- Software loaded TLB.
|
||||
- e6500 adds HW loaded indirect TLB entries.
|
||||
- Mix of 32 & 64 bit
|
||||
|
||||
+--------------+
|
||||
| e200 |
|
||||
+--------------+
|
||||
|
||||
|
||||
+--------------------------------+
|
||||
| e500 |
|
||||
+--------------------------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------------------------+
|
||||
| e500v2 |
|
||||
+--------------------------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------------------------+
|
||||
| e500mc (Book3e) |
|
||||
+--------------------------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------------------------+
|
||||
| e5500 (64 bit) |
|
||||
+--------------------------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------------------------+
|
||||
| e6500 (HW TLB) (Multithreaded) |
|
||||
+--------------------------------+
|
||||
|
||||
|
||||
IBM A2 core
|
||||
-----------
|
||||
|
||||
- Book3E, software loaded TLB + HW loaded indirect TLB entries.
|
||||
- 64 bit
|
||||
|
||||
+--------------+ +----------------+
|
||||
| A2 core | --> | WSP |
|
||||
+--------------+ +----------------+
|
||||
|
|
||||
|
|
||||
v
|
||||
+--------------+
|
||||
| BG/Q |
|
||||
+--------------+
|
56
Documentation/powerpc/cpu_features.txt
Normal file
56
Documentation/powerpc/cpu_features.txt
Normal file
|
@ -0,0 +1,56 @@
|
|||
Hollis Blanchard <hollis@austin.ibm.com>
|
||||
5 Jun 2002
|
||||
|
||||
This document describes the system (including self-modifying code) used in the
|
||||
PPC Linux kernel to support a variety of PowerPC CPUs without requiring
|
||||
compile-time selection.
|
||||
|
||||
Early in the boot process the ppc32 kernel detects the current CPU type and
|
||||
chooses a set of features accordingly. Some examples include Altivec support,
|
||||
split instruction and data caches, and if the CPU supports the DOZE and NAP
|
||||
sleep modes.
|
||||
|
||||
Detection of the feature set is simple. A list of processors can be found in
|
||||
arch/powerpc/kernel/cputable.c. The PVR register is masked and compared with
|
||||
each value in the list. If a match is found, the cpu_features of cur_cpu_spec
|
||||
is assigned to the feature bitmask for this processor and a __setup_cpu
|
||||
function is called.
|
||||
|
||||
C code may test 'cur_cpu_spec[smp_processor_id()]->cpu_features' for a
|
||||
particular feature bit. This is done in quite a few places, for example
|
||||
in ppc_setup_l2cr().
|
||||
|
||||
Implementing cpufeatures in assembly is a little more involved. There are
|
||||
several paths that are performance-critical and would suffer if an array
|
||||
index, structure dereference, and conditional branch were added. To avoid the
|
||||
performance penalty but still allow for runtime (rather than compile-time) CPU
|
||||
selection, unused code is replaced by 'nop' instructions. This nop'ing is
|
||||
based on CPU 0's capabilities, so a multi-processor system with non-identical
|
||||
processors will not work (but such a system would likely have other problems
|
||||
anyways).
|
||||
|
||||
After detecting the processor type, the kernel patches out sections of code
|
||||
that shouldn't be used by writing nop's over it. Using cpufeatures requires
|
||||
just 2 macros (found in arch/powerpc/include/asm/cputable.h), as seen in head.S
|
||||
transfer_to_handler:
|
||||
|
||||
#ifdef CONFIG_ALTIVEC
|
||||
BEGIN_FTR_SECTION
|
||||
mfspr r22,SPRN_VRSAVE /* if G4, save vrsave register value */
|
||||
stw r22,THREAD_VRSAVE(r23)
|
||||
END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
|
||||
#endif /* CONFIG_ALTIVEC */
|
||||
|
||||
If CPU 0 supports Altivec, the code is left untouched. If it doesn't, both
|
||||
instructions are replaced with nop's.
|
||||
|
||||
The END_FTR_SECTION macro has two simpler variations: END_FTR_SECTION_IFSET
|
||||
and END_FTR_SECTION_IFCLR. These simply test if a flag is set (in
|
||||
cur_cpu_spec[0]->cpu_features) or is cleared, respectively. These two macros
|
||||
should be used in the majority of cases.
|
||||
|
||||
The END_FTR_SECTION macros are implemented by storing information about this
|
||||
code in the '__ftr_fixup' ELF section. When do_cpu_ftr_fixups
|
||||
(arch/powerpc/kernel/misc.S) is invoked, it will iterate over the records in
|
||||
__ftr_fixup, and if the required feature is not present it will loop writing
|
||||
nop's from each BEGIN_FTR_SECTION to END_FTR_SECTION.
|
379
Documentation/powerpc/cxl.txt
Normal file
379
Documentation/powerpc/cxl.txt
Normal file
|
@ -0,0 +1,379 @@
|
|||
Coherent Accelerator Interface (CXL)
|
||||
====================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The coherent accelerator interface is designed to allow the
|
||||
coherent connection of accelerators (FPGAs and other devices) to a
|
||||
POWER system. These devices need to adhere to the Coherent
|
||||
Accelerator Interface Architecture (CAIA).
|
||||
|
||||
IBM refers to this as the Coherent Accelerator Processor Interface
|
||||
or CAPI. In the kernel it's referred to by the name CXL to avoid
|
||||
confusion with the ISDN CAPI subsystem.
|
||||
|
||||
Coherent in this context means that the accelerator and CPUs can
|
||||
both access system memory directly and with the same effective
|
||||
addresses.
|
||||
|
||||
|
||||
Hardware overview
|
||||
=================
|
||||
|
||||
POWER8 FPGA
|
||||
+----------+ +---------+
|
||||
| | | |
|
||||
| CPU | | AFU |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
+----------+ +---------+
|
||||
| PHB | | |
|
||||
| +------+ | PSL |
|
||||
| | CAPP |<------>| |
|
||||
+---+------+ PCIE +---------+
|
||||
|
||||
The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
|
||||
unit which is part of the PCIe Host Bridge (PHB). This is managed
|
||||
by Linux by calls into OPAL. Linux doesn't directly program the
|
||||
CAPP.
|
||||
|
||||
The FPGA (or coherently attached device) consists of two parts.
|
||||
The POWER Service Layer (PSL) and the Accelerator Function Unit
|
||||
(AFU). The AFU is used to implement specific functionality behind
|
||||
the PSL. The PSL, among other things, provides memory address
|
||||
translation services to allow each AFU direct access to userspace
|
||||
memory.
|
||||
|
||||
The AFU is the core part of the accelerator (eg. the compression,
|
||||
crypto etc function). The kernel has no knowledge of the function
|
||||
of the AFU. Only userspace interacts directly with the AFU.
|
||||
|
||||
The PSL provides the translation and interrupt services that the
|
||||
AFU needs. This is what the kernel interacts with. For example, if
|
||||
the AFU needs to read a particular effective address, it sends
|
||||
that address to the PSL, the PSL then translates it, fetches the
|
||||
data from memory and returns it to the AFU. If the PSL has a
|
||||
translation miss, it interrupts the kernel and the kernel services
|
||||
the fault. The context to which this fault is serviced is based on
|
||||
who owns that acceleration function.
|
||||
|
||||
|
||||
AFU Modes
|
||||
=========
|
||||
|
||||
There are two programming modes supported by the AFU. Dedicated
|
||||
and AFU directed. AFU may support one or both modes.
|
||||
|
||||
When using dedicated mode only one MMU context is supported. In
|
||||
this mode, only one userspace process can use the accelerator at
|
||||
time.
|
||||
|
||||
When using AFU directed mode, up to 16K simultaneous contexts can
|
||||
be supported. This means up to 16K simultaneous userspace
|
||||
applications may use the accelerator (although specific AFUs may
|
||||
support fewer). In this mode, the AFU sends a 16 bit context ID
|
||||
with each of its requests. This tells the PSL which context is
|
||||
associated with each operation. If the PSL can't translate an
|
||||
operation, the ID can also be accessed by the kernel so it can
|
||||
determine the userspace context associated with an operation.
|
||||
|
||||
|
||||
MMIO space
|
||||
==========
|
||||
|
||||
A portion of the accelerator MMIO space can be directly mapped
|
||||
from the AFU to userspace. Either the whole space can be mapped or
|
||||
just a per context portion. The hardware is self describing, hence
|
||||
the kernel can determine the offset and size of the per context
|
||||
portion.
|
||||
|
||||
|
||||
Interrupts
|
||||
==========
|
||||
|
||||
AFUs may generate interrupts that are destined for userspace. These
|
||||
are received by the kernel as hardware interrupts and passed onto
|
||||
userspace by a read syscall documented below.
|
||||
|
||||
Data storage faults and error interrupts are handled by the kernel
|
||||
driver.
|
||||
|
||||
|
||||
Work Element Descriptor (WED)
|
||||
=============================
|
||||
|
||||
The WED is a 64-bit parameter passed to the AFU when a context is
|
||||
started. Its format is up to the AFU hence the kernel has no
|
||||
knowledge of what it represents. Typically it will be the
|
||||
effective address of a work queue or status block where the AFU
|
||||
and userspace can share control and status information.
|
||||
|
||||
|
||||
|
||||
|
||||
User API
|
||||
========
|
||||
|
||||
For AFUs operating in AFU directed mode, two character device
|
||||
files will be created. /dev/cxl/afu0.0m will correspond to a
|
||||
master context and /dev/cxl/afu0.0s will correspond to a slave
|
||||
context. Master contexts have access to the full MMIO space an
|
||||
AFU provides. Slave contexts have access to only the per process
|
||||
MMIO space an AFU provides.
|
||||
|
||||
For AFUs operating in dedicated process mode, the driver will
|
||||
only create a single character device per AFU called
|
||||
/dev/cxl/afu0.0d. This will have access to the entire MMIO space
|
||||
that the AFU provides (like master contexts in AFU directed).
|
||||
|
||||
The types described below are defined in include/uapi/misc/cxl.h
|
||||
|
||||
The following file operations are supported on both slave and
|
||||
master devices.
|
||||
|
||||
|
||||
open
|
||||
----
|
||||
|
||||
Opens the device and allocates a file descriptor to be used with
|
||||
the rest of the API.
|
||||
|
||||
A dedicated mode AFU only has one context and only allows the
|
||||
device to be opened once.
|
||||
|
||||
An AFU directed mode AFU can have many contexts, the device can be
|
||||
opened once for each context that is available.
|
||||
|
||||
When all available contexts are allocated the open call will fail
|
||||
and return -ENOSPC.
|
||||
|
||||
Note: IRQs need to be allocated for each context, which may limit
|
||||
the number of contexts that can be created, and therefore
|
||||
how many times the device can be opened. The POWER8 CAPP
|
||||
supports 2040 IRQs and 3 are used by the kernel, so 2037 are
|
||||
left. If 1 IRQ is needed per context, then only 2037
|
||||
contexts can be allocated. If 4 IRQs are needed per context,
|
||||
then only 2037/4 = 509 contexts can be allocated.
|
||||
|
||||
|
||||
ioctl
|
||||
-----
|
||||
|
||||
CXL_IOCTL_START_WORK:
|
||||
Starts the AFU context and associates it with the current
|
||||
process. Once this ioctl is successfully executed, all memory
|
||||
mapped into this process is accessible to this AFU context
|
||||
using the same effective addresses. No additional calls are
|
||||
required to map/unmap memory. The AFU memory context will be
|
||||
updated as userspace allocates and frees memory. This ioctl
|
||||
returns once the AFU context is started.
|
||||
|
||||
Takes a pointer to a struct cxl_ioctl_start_work:
|
||||
|
||||
struct cxl_ioctl_start_work {
|
||||
__u64 flags;
|
||||
__u64 work_element_descriptor;
|
||||
__u64 amr;
|
||||
__s16 num_interrupts;
|
||||
__s16 reserved1;
|
||||
__s32 reserved2;
|
||||
__u64 reserved3;
|
||||
__u64 reserved4;
|
||||
__u64 reserved5;
|
||||
__u64 reserved6;
|
||||
};
|
||||
|
||||
flags:
|
||||
Indicates which optional fields in the structure are
|
||||
valid.
|
||||
|
||||
work_element_descriptor:
|
||||
The Work Element Descriptor (WED) is a 64-bit argument
|
||||
defined by the AFU. Typically this is an effective
|
||||
address pointing to an AFU specific structure
|
||||
describing what work to perform.
|
||||
|
||||
amr:
|
||||
Authority Mask Register (AMR), same as the powerpc
|
||||
AMR. This field is only used by the kernel when the
|
||||
corresponding CXL_START_WORK_AMR value is specified in
|
||||
flags. If not specified the kernel will use a default
|
||||
value of 0.
|
||||
|
||||
num_interrupts:
|
||||
Number of userspace interrupts to request. This field
|
||||
is only used by the kernel when the corresponding
|
||||
CXL_START_WORK_NUM_IRQS value is specified in flags.
|
||||
If not specified the minimum number required by the
|
||||
AFU will be allocated. The min and max number can be
|
||||
obtained from sysfs.
|
||||
|
||||
reserved fields:
|
||||
For ABI padding and future extensions
|
||||
|
||||
CXL_IOCTL_GET_PROCESS_ELEMENT:
|
||||
Get the current context id, also known as the process element.
|
||||
The value is returned from the kernel as a __u32.
|
||||
|
||||
|
||||
mmap
|
||||
----
|
||||
|
||||
An AFU may have an MMIO space to facilitate communication with the
|
||||
AFU. If it does, the MMIO space can be accessed via mmap. The size
|
||||
and contents of this area are specific to the particular AFU. The
|
||||
size can be discovered via sysfs.
|
||||
|
||||
In AFU directed mode, master contexts are allowed to map all of
|
||||
the MMIO space and slave contexts are allowed to only map the per
|
||||
process MMIO space associated with the context. In dedicated
|
||||
process mode the entire MMIO space can always be mapped.
|
||||
|
||||
This mmap call must be done after the START_WORK ioctl.
|
||||
|
||||
Care should be taken when accessing MMIO space. Only 32 and 64-bit
|
||||
accesses are supported by POWER8. Also, the AFU will be designed
|
||||
with a specific endianness, so all MMIO accesses should consider
|
||||
endianness (recommend endian(3) variants like: le64toh(),
|
||||
be64toh() etc). These endian issues equally apply to shared memory
|
||||
queues the WED may describe.
|
||||
|
||||
|
||||
read
|
||||
----
|
||||
|
||||
Reads events from the AFU. Blocks if no events are pending
|
||||
(unless O_NONBLOCK is supplied). Returns -EIO in the case of an
|
||||
unrecoverable error or if the card is removed.
|
||||
|
||||
read() will always return an integral number of events.
|
||||
|
||||
The buffer passed to read() must be at least 4K bytes.
|
||||
|
||||
The result of the read will be a buffer of one or more events,
|
||||
each event is of type struct cxl_event, of varying size.
|
||||
|
||||
struct cxl_event {
|
||||
struct cxl_event_header header;
|
||||
union {
|
||||
struct cxl_event_afu_interrupt irq;
|
||||
struct cxl_event_data_storage fault;
|
||||
struct cxl_event_afu_error afu_error;
|
||||
};
|
||||
};
|
||||
|
||||
The struct cxl_event_header is defined as:
|
||||
|
||||
struct cxl_event_header {
|
||||
__u16 type;
|
||||
__u16 size;
|
||||
__u16 process_element;
|
||||
__u16 reserved1;
|
||||
};
|
||||
|
||||
type:
|
||||
This defines the type of event. The type determines how
|
||||
the rest of the event is structured. These types are
|
||||
described below and defined by enum cxl_event_type.
|
||||
|
||||
size:
|
||||
This is the size of the event in bytes including the
|
||||
struct cxl_event_header. The start of the next event can
|
||||
be found at this offset from the start of the current
|
||||
event.
|
||||
|
||||
process_element:
|
||||
Context ID of the event.
|
||||
|
||||
reserved field:
|
||||
For future extensions and padding.
|
||||
|
||||
If the event type is CXL_EVENT_AFU_INTERRUPT then the event
|
||||
structure is defined as:
|
||||
|
||||
struct cxl_event_afu_interrupt {
|
||||
__u16 flags;
|
||||
__u16 irq; /* Raised AFU interrupt number */
|
||||
__u32 reserved1;
|
||||
};
|
||||
|
||||
flags:
|
||||
These flags indicate which optional fields are present
|
||||
in this struct. Currently all fields are mandatory.
|
||||
|
||||
irq:
|
||||
The IRQ number sent by the AFU.
|
||||
|
||||
reserved field:
|
||||
For future extensions and padding.
|
||||
|
||||
If the event type is CXL_EVENT_DATA_STORAGE then the event
|
||||
structure is defined as:
|
||||
|
||||
struct cxl_event_data_storage {
|
||||
__u16 flags;
|
||||
__u16 reserved1;
|
||||
__u32 reserved2;
|
||||
__u64 addr;
|
||||
__u64 dsisr;
|
||||
__u64 reserved3;
|
||||
};
|
||||
|
||||
flags:
|
||||
These flags indicate which optional fields are present in
|
||||
this struct. Currently all fields are mandatory.
|
||||
|
||||
address:
|
||||
The address that the AFU unsuccessfully attempted to
|
||||
access. Valid accesses will be handled transparently by the
|
||||
kernel but invalid accesses will generate this event.
|
||||
|
||||
dsisr:
|
||||
This field gives information on the type of fault. It is a
|
||||
copy of the DSISR from the PSL hardware when the address
|
||||
fault occurred. The form of the DSISR is as defined in the
|
||||
CAIA.
|
||||
|
||||
reserved fields:
|
||||
For future extensions
|
||||
|
||||
If the event type is CXL_EVENT_AFU_ERROR then the event structure
|
||||
is defined as:
|
||||
|
||||
struct cxl_event_afu_error {
|
||||
__u16 flags;
|
||||
__u16 reserved1;
|
||||
__u32 reserved2;
|
||||
__u64 error;
|
||||
};
|
||||
|
||||
flags:
|
||||
These flags indicate which optional fields are present in
|
||||
this struct. Currently all fields are Mandatory.
|
||||
|
||||
error:
|
||||
Error status from the AFU. Defined by the AFU.
|
||||
|
||||
reserved fields:
|
||||
For future extensions and padding
|
||||
|
||||
Sysfs Class
|
||||
===========
|
||||
|
||||
A cxl sysfs class is added under /sys/class/cxl to facilitate
|
||||
enumeration and tuning of the accelerators. Its layout is
|
||||
described in Documentation/ABI/testing/sysfs-class-cxl
|
||||
|
||||
Udev rules
|
||||
==========
|
||||
|
||||
The following udev rules could be used to create a symlink to the
|
||||
most logical chardev to use in any programming mode (afuX.Yd for
|
||||
dedicated, afuX.Ys for afu directed), since the API is virtually
|
||||
identical for each:
|
||||
|
||||
SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
|
||||
SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
|
||||
KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"
|
334
Documentation/powerpc/eeh-pci-error-recovery.txt
Normal file
334
Documentation/powerpc/eeh-pci-error-recovery.txt
Normal file
|
@ -0,0 +1,334 @@
|
|||
|
||||
|
||||
PCI Bus EEH Error Recovery
|
||||
--------------------------
|
||||
Linas Vepstas
|
||||
<linas@austin.ibm.com>
|
||||
12 January 2005
|
||||
|
||||
|
||||
Overview:
|
||||
---------
|
||||
The IBM POWER-based pSeries and iSeries computers include PCI bus
|
||||
controller chips that have extended capabilities for detecting and
|
||||
reporting a large variety of PCI bus error conditions. These features
|
||||
go under the name of "EEH", for "Extended Error Handling". The EEH
|
||||
hardware features allow PCI bus errors to be cleared and a PCI
|
||||
card to be "rebooted", without also having to reboot the operating
|
||||
system.
|
||||
|
||||
This is in contrast to traditional PCI error handling, where the
|
||||
PCI chip is wired directly to the CPU, and an error would cause
|
||||
a CPU machine-check/check-stop condition, halting the CPU entirely.
|
||||
Another "traditional" technique is to ignore such errors, which
|
||||
can lead to data corruption, both of user data or of kernel data,
|
||||
hung/unresponsive adapters, or system crashes/lockups. Thus,
|
||||
the idea behind EEH is that the operating system can become more
|
||||
reliable and robust by protecting it from PCI errors, and giving
|
||||
the OS the ability to "reboot"/recover individual PCI devices.
|
||||
|
||||
Future systems from other vendors, based on the PCI-E specification,
|
||||
may contain similar features.
|
||||
|
||||
|
||||
Causes of EEH Errors
|
||||
--------------------
|
||||
EEH was originally designed to guard against hardware failure, such
|
||||
as PCI cards dying from heat, humidity, dust, vibration and bad
|
||||
electrical connections. The vast majority of EEH errors seen in
|
||||
"real life" are due to either poorly seated PCI cards, or,
|
||||
unfortunately quite commonly, due to device driver bugs, device firmware
|
||||
bugs, and sometimes PCI card hardware bugs.
|
||||
|
||||
The most common software bug, is one that causes the device to
|
||||
attempt to DMA to a location in system memory that has not been
|
||||
reserved for DMA access for that card. This is a powerful feature,
|
||||
as it prevents what; otherwise, would have been silent memory
|
||||
corruption caused by the bad DMA. A number of device driver
|
||||
bugs have been found and fixed in this way over the past few
|
||||
years. Other possible causes of EEH errors include data or
|
||||
address line parity errors (for example, due to poor electrical
|
||||
connectivity due to a poorly seated card), and PCI-X split-completion
|
||||
errors (due to software, device firmware, or device PCI hardware bugs).
|
||||
The vast majority of "true hardware failures" can be cured by
|
||||
physically removing and re-seating the PCI card.
|
||||
|
||||
|
||||
Detection and Recovery
|
||||
----------------------
|
||||
In the following discussion, a generic overview of how to detect
|
||||
and recover from EEH errors will be presented. This is followed
|
||||
by an overview of how the current implementation in the Linux
|
||||
kernel does it. The actual implementation is subject to change,
|
||||
and some of the finer points are still being debated. These
|
||||
may in turn be swayed if or when other architectures implement
|
||||
similar functionality.
|
||||
|
||||
When a PCI Host Bridge (PHB, the bus controller connecting the
|
||||
PCI bus to the system CPU electronics complex) detects a PCI error
|
||||
condition, it will "isolate" the affected PCI card. Isolation
|
||||
will block all writes (either to the card from the system, or
|
||||
from the card to the system), and it will cause all reads to
|
||||
return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
|
||||
This value was chosen because it is the same value you would
|
||||
get if the device was physically unplugged from the slot.
|
||||
This includes access to PCI memory, I/O space, and PCI config
|
||||
space. Interrupts; however, will continued to be delivered.
|
||||
|
||||
Detection and recovery are performed with the aid of ppc64
|
||||
firmware. The programming interfaces in the Linux kernel
|
||||
into the firmware are referred to as RTAS (Run-Time Abstraction
|
||||
Services). The Linux kernel does not (should not) access
|
||||
the EEH function in the PCI chipsets directly, primarily because
|
||||
there are a number of different chipsets out there, each with
|
||||
different interfaces and quirks. The firmware provides a
|
||||
uniform abstraction layer that will work with all pSeries
|
||||
and iSeries hardware (and be forwards-compatible).
|
||||
|
||||
If the OS or device driver suspects that a PCI slot has been
|
||||
EEH-isolated, there is a firmware call it can make to determine if
|
||||
this is the case. If so, then the device driver should put itself
|
||||
into a consistent state (given that it won't be able to complete any
|
||||
pending work) and start recovery of the card. Recovery normally
|
||||
would consist of resetting the PCI device (holding the PCI #RST
|
||||
line high for two seconds), followed by setting up the device
|
||||
config space (the base address registers (BAR's), latency timer,
|
||||
cache line size, interrupt line, and so on). This is followed by a
|
||||
reinitialization of the device driver. In a worst-case scenario,
|
||||
the power to the card can be toggled, at least on hot-plug-capable
|
||||
slots. In principle, layers far above the device driver probably
|
||||
do not need to know that the PCI card has been "rebooted" in this
|
||||
way; ideally, there should be at most a pause in Ethernet/disk/USB
|
||||
I/O while the card is being reset.
|
||||
|
||||
If the card cannot be recovered after three or four resets, the
|
||||
kernel/device driver should assume the worst-case scenario, that the
|
||||
card has died completely, and report this error to the sysadmin.
|
||||
In addition, error messages are reported through RTAS and also through
|
||||
syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
|
||||
The correct way to deal with failed adapters is to use the standard
|
||||
PCI hotplug tools to remove and replace the dead card.
|
||||
|
||||
|
||||
Current PPC64 Linux EEH Implementation
|
||||
--------------------------------------
|
||||
At this time, a generic EEH recovery mechanism has been implemented,
|
||||
so that individual device drivers do not need to be modified to support
|
||||
EEH recovery. This generic mechanism piggy-backs on the PCI hotplug
|
||||
infrastructure, and percolates events up through the userspace/udev
|
||||
infrastructure. Following is a detailed description of how this is
|
||||
accomplished.
|
||||
|
||||
EEH must be enabled in the PHB's very early during the boot process,
|
||||
and if a PCI slot is hot-plugged. The former is performed by
|
||||
eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
|
||||
drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
|
||||
EEH must be enabled before a PCI scan of the device can proceed.
|
||||
Current Power5 hardware will not work unless EEH is enabled;
|
||||
although older Power4 can run with it disabled. Effectively,
|
||||
EEH can no longer be turned off. PCI devices *must* be
|
||||
registered with the EEH code; the EEH code needs to know about
|
||||
the I/O address ranges of the PCI device in order to detect an
|
||||
error. Given an arbitrary address, the routine
|
||||
pci_get_device_by_addr() will find the pci device associated
|
||||
with that address (if any).
|
||||
|
||||
The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
|
||||
etc. include a check to see if the i/o read returned all-0xff's.
|
||||
If so, these make a call to eeh_dn_check_failure(), which in turn
|
||||
asks the firmware if the all-ff's value is the sign of a true EEH
|
||||
error. If it is not, processing continues as normal. The grand
|
||||
total number of these false alarms or "false positives" can be
|
||||
seen in /proc/ppc64/eeh (subject to change). Normally, almost
|
||||
all of these occur during boot, when the PCI bus is scanned, where
|
||||
a large number of 0xff reads are part of the bus scan procedure.
|
||||
|
||||
If a frozen slot is detected, code in
|
||||
arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
|
||||
syslog (/var/log/messages). This stack trace has proven to be very
|
||||
useful to device-driver authors for finding out at what point the EEH
|
||||
error was detected, as the error itself usually occurs slightly
|
||||
beforehand.
|
||||
|
||||
Next, it uses the Linux kernel notifier chain/work queue mechanism to
|
||||
allow any interested parties to find out about the failure. Device
|
||||
drivers, or other parts of the kernel, can use
|
||||
eeh_register_notifier(struct notifier_block *) to find out about EEH
|
||||
events. The event will include a pointer to the pci device, the
|
||||
device node and some state info. Receivers of the event can "do as
|
||||
they wish"; the default handler will be described further in this
|
||||
section.
|
||||
|
||||
To assist in the recovery of the device, eeh.c exports the
|
||||
following functions:
|
||||
|
||||
rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second
|
||||
rtas_configure_bridge() -- ask firmware to configure any PCI bridges
|
||||
located topologically under the pci slot.
|
||||
eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
|
||||
config-space info for a device and any devices under it.
|
||||
|
||||
|
||||
A handler for the EEH notifier_block events is implemented in
|
||||
drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
|
||||
It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
|
||||
This last call causes the device driver for the card to be stopped,
|
||||
which causes uevents to go out to user space. This triggers
|
||||
user-space scripts that might issue commands such as "ifdown eth0"
|
||||
for ethernet cards, and so on. This handler then sleeps for 5 seconds,
|
||||
hoping to give the user-space scripts enough time to complete.
|
||||
It then resets the PCI card, reconfigures the device BAR's, and
|
||||
any bridges underneath. It then calls rpaphp_enable_pci_slot(),
|
||||
which restarts the device driver and triggers more user-space
|
||||
events (for example, calling "ifup eth0" for ethernet cards).
|
||||
|
||||
|
||||
Device Shutdown and User-Space Events
|
||||
-------------------------------------
|
||||
This section documents what happens when a pci slot is unconfigured,
|
||||
focusing on how the device driver gets shut down, and on how the
|
||||
events get delivered to user-space scripts.
|
||||
|
||||
Following is an example sequence of events that cause a device driver
|
||||
close function to be called during the first phase of an EEH reset.
|
||||
The following sequence is an example of the pcnet32 device driver.
|
||||
|
||||
rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c
|
||||
{
|
||||
calls
|
||||
pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
|
||||
{
|
||||
calls
|
||||
pci_destroy_dev (struct pci_dev *)
|
||||
{
|
||||
calls
|
||||
device_unregister (&dev->dev) // in /drivers/base/core.c
|
||||
{
|
||||
calls
|
||||
device_del (struct device *)
|
||||
{
|
||||
calls
|
||||
bus_remove_device() // in /drivers/base/bus.c
|
||||
{
|
||||
calls
|
||||
device_release_driver()
|
||||
{
|
||||
calls
|
||||
struct device_driver->remove() which is just
|
||||
pci_device_remove() // in /drivers/pci/pci_driver.c
|
||||
{
|
||||
calls
|
||||
struct pci_driver->remove() which is just
|
||||
pcnet32_remove_one() // in /drivers/net/pcnet32.c
|
||||
{
|
||||
calls
|
||||
unregister_netdev() // in /net/core/dev.c
|
||||
{
|
||||
calls
|
||||
dev_close() // in /net/core/dev.c
|
||||
{
|
||||
calls dev->stop();
|
||||
which is just pcnet32_close() // in pcnet32.c
|
||||
{
|
||||
which does what you wanted
|
||||
to stop the device
|
||||
}
|
||||
}
|
||||
}
|
||||
which
|
||||
frees pcnet32 device driver memory
|
||||
}
|
||||
}}}}}}
|
||||
|
||||
|
||||
in drivers/pci/pci_driver.c,
|
||||
struct device_driver->remove() is just pci_device_remove()
|
||||
which calls struct pci_driver->remove() which is pcnet32_remove_one()
|
||||
which calls unregister_netdev() (in net/core/dev.c)
|
||||
which calls dev_close() (in net/core/dev.c)
|
||||
which calls dev->stop() which is pcnet32_close()
|
||||
which then does the appropriate shutdown.
|
||||
|
||||
---
|
||||
Following is the analogous stack trace for events sent to user-space
|
||||
when the pci device is unconfigured.
|
||||
|
||||
rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
|
||||
calls
|
||||
pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
|
||||
calls
|
||||
pci_destroy_dev (struct pci_dev *) {
|
||||
calls
|
||||
device_unregister (&dev->dev) { // in /drivers/base/core.c
|
||||
calls
|
||||
device_del(struct device * dev) { // in /drivers/base/core.c
|
||||
calls
|
||||
kobject_del() { //in /libs/kobject.c
|
||||
calls
|
||||
kobject_uevent() { // in /libs/kobject.c
|
||||
calls
|
||||
kset_uevent() { // in /lib/kobject.c
|
||||
calls
|
||||
kset->uevent_ops->uevent() // which is really just
|
||||
a call to
|
||||
dev_uevent() { // in /drivers/base/core.c
|
||||
calls
|
||||
dev->bus->uevent() which is really just a call to
|
||||
pci_uevent () { // in drivers/pci/hotplug.c
|
||||
which prints device name, etc....
|
||||
}
|
||||
}
|
||||
then kobject_uevent() sends a netlink uevent to userspace
|
||||
--> userspace uevent
|
||||
(during early boot, nobody listens to netlink events and
|
||||
kobject_uevent() executes uevent_helper[], which runs the
|
||||
event process /sbin/hotplug)
|
||||
}
|
||||
}
|
||||
kobject_del() then calls sysfs_remove_dir(), which would
|
||||
trigger any user-space daemon that was watching /sysfs,
|
||||
and notice the delete event.
|
||||
|
||||
|
||||
Pro's and Con's of the Current Design
|
||||
-------------------------------------
|
||||
There are several issues with the current EEH software recovery design,
|
||||
which may be addressed in future revisions. But first, note that the
|
||||
big plus of the current design is that no changes need to be made to
|
||||
individual device drivers, so that the current design throws a wide net.
|
||||
The biggest negative of the design is that it potentially disturbs
|
||||
network daemons and file systems that didn't need to be disturbed.
|
||||
|
||||
-- A minor complaint is that resetting the network card causes
|
||||
user-space back-to-back ifdown/ifup burps that potentially disturb
|
||||
network daemons, that didn't need to even know that the pci
|
||||
card was being rebooted.
|
||||
|
||||
-- A more serious concern is that the same reset, for SCSI devices,
|
||||
causes havoc to mounted file systems. Scripts cannot post-facto
|
||||
unmount a file system without flushing pending buffers, but this
|
||||
is impossible, because I/O has already been stopped. Thus,
|
||||
ideally, the reset should happen at or below the block layer,
|
||||
so that the file systems are not disturbed.
|
||||
|
||||
Reiserfs does not tolerate errors returned from the block device.
|
||||
Ext3fs seems to be tolerant, retrying reads/writes until it does
|
||||
succeed. Both have been only lightly tested in this scenario.
|
||||
|
||||
The SCSI-generic subsystem already has built-in code for performing
|
||||
SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
|
||||
(HBA) resets. These are cascaded into a chain of attempted
|
||||
resets if a SCSI command fails. These are completely hidden
|
||||
from the block layer. It would be very natural to add an EEH
|
||||
reset into this chain of events.
|
||||
|
||||
-- If a SCSI error occurs for the root device, all is lost unless
|
||||
the sysadmin had the foresight to run /bin, /sbin, /etc, /var
|
||||
and so on, out of ramdisk/tmpfs.
|
||||
|
||||
|
||||
Conclusions
|
||||
-----------
|
||||
There's forward progress ...
|
||||
|
||||
|
270
Documentation/powerpc/firmware-assisted-dump.txt
Normal file
270
Documentation/powerpc/firmware-assisted-dump.txt
Normal file
|
@ -0,0 +1,270 @@
|
|||
|
||||
Firmware-Assisted Dump
|
||||
------------------------
|
||||
July 2011
|
||||
|
||||
The goal of firmware-assisted dump is to enable the dump of
|
||||
a crashed system, and to do so from a fully-reset system, and
|
||||
to minimize the total elapsed time until the system is back
|
||||
in production use.
|
||||
|
||||
- Firmware assisted dump (fadump) infrastructure is intended to replace
|
||||
the existing phyp assisted dump.
|
||||
- Fadump uses the same firmware interfaces and memory reservation model
|
||||
as phyp assisted dump.
|
||||
- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
|
||||
in the ELF format in the same way as kdump. This helps us reuse the
|
||||
kdump infrastructure for dump capture and filtering.
|
||||
- Unlike phyp dump, userspace tool does not need to refer any sysfs
|
||||
interface while reading /proc/vmcore.
|
||||
- Unlike phyp dump, fadump allows user to release all the memory reserved
|
||||
for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
|
||||
- Once enabled through kernel boot parameter, fadump can be
|
||||
started/stopped through /sys/kernel/fadump_registered interface (see
|
||||
sysfs files section below) and can be easily integrated with kdump
|
||||
service start/stop init scripts.
|
||||
|
||||
Comparing with kdump or other strategies, firmware-assisted
|
||||
dump offers several strong, practical advantages:
|
||||
|
||||
-- Unlike kdump, the system has been reset, and loaded
|
||||
with a fresh copy of the kernel. In particular,
|
||||
PCI and I/O devices have been reinitialized and are
|
||||
in a clean, consistent state.
|
||||
-- Once the dump is copied out, the memory that held the dump
|
||||
is immediately available to the running kernel. And therefore,
|
||||
unlike kdump, fadump doesn't need a 2nd reboot to get back
|
||||
the system to the production configuration.
|
||||
|
||||
The above can only be accomplished by coordination with,
|
||||
and assistance from the Power firmware. The procedure is
|
||||
as follows:
|
||||
|
||||
-- The first kernel registers the sections of memory with the
|
||||
Power firmware for dump preservation during OS initialization.
|
||||
These registered sections of memory are reserved by the first
|
||||
kernel during early boot.
|
||||
|
||||
-- When a system crashes, the Power firmware will save
|
||||
the low memory (boot memory of size larger of 5% of system RAM
|
||||
or 256MB) of RAM to the previous registered region. It will
|
||||
also save system registers, and hardware PTE's.
|
||||
|
||||
NOTE: The term 'boot memory' means size of the low memory chunk
|
||||
that is required for a kernel to boot successfully when
|
||||
booted with restricted memory. By default, the boot memory
|
||||
size will be the larger of 5% of system RAM or 256MB.
|
||||
Alternatively, user can also specify boot memory size
|
||||
through boot parameter 'fadump_reserve_mem=' which will
|
||||
override the default calculated size. Use this option
|
||||
if default boot memory size is not sufficient for second
|
||||
kernel to boot successfully.
|
||||
|
||||
-- After the low memory (boot memory) area has been saved, the
|
||||
firmware will reset PCI and other hardware state. It will
|
||||
*not* clear the RAM. It will then launch the bootloader, as
|
||||
normal.
|
||||
|
||||
-- The freshly booted kernel will notice that there is a new
|
||||
node (ibm,dump-kernel) in the device tree, indicating that
|
||||
there is crash data available from a previous boot. During
|
||||
the early boot OS will reserve rest of the memory above
|
||||
boot memory size effectively booting with restricted memory
|
||||
size. This will make sure that the second kernel will not
|
||||
touch any of the dump memory area.
|
||||
|
||||
-- User-space tools will read /proc/vmcore to obtain the contents
|
||||
of memory, which holds the previous crashed kernel dump in ELF
|
||||
format. The userspace tools may copy this info to disk, or
|
||||
network, nas, san, iscsi, etc. as desired.
|
||||
|
||||
-- Once the userspace tool is done saving dump, it will echo
|
||||
'1' to /sys/kernel/fadump_release_mem to release the reserved
|
||||
memory back to general use, except the memory required for
|
||||
next firmware-assisted dump registration.
|
||||
|
||||
e.g.
|
||||
# echo 1 > /sys/kernel/fadump_release_mem
|
||||
|
||||
Please note that the firmware-assisted dump feature
|
||||
is only available on Power6 and above systems with recent
|
||||
firmware versions.
|
||||
|
||||
Implementation details:
|
||||
----------------------
|
||||
|
||||
During boot, a check is made to see if firmware supports
|
||||
this feature on that particular machine. If it does, then
|
||||
we check to see if an active dump is waiting for us. If yes
|
||||
then everything but boot memory size of RAM is reserved during
|
||||
early boot (See Fig. 2). This area is released once we finish
|
||||
collecting the dump from user land scripts (e.g. kdump scripts)
|
||||
that are run. If there is dump data, then the
|
||||
/sys/kernel/fadump_release_mem file is created, and the reserved
|
||||
memory is held.
|
||||
|
||||
If there is no waiting dump data, then only the memory required
|
||||
to hold CPU state, HPTE region, boot memory dump and elfcore
|
||||
header, is reserved at the top of memory (see Fig. 1). This area
|
||||
is *not* released: this region will be kept permanently reserved,
|
||||
so that it can act as a receptacle for a copy of the boot memory
|
||||
content in addition to CPU state and HPTE region, in the case a
|
||||
crash does occur.
|
||||
|
||||
o Memory Reservation during first kernel
|
||||
|
||||
Low memory Top of memory
|
||||
0 boot memory size |
|
||||
| | |<--Reserved dump area -->|
|
||||
V V | Permanent Reservation V
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| | |CPU|HPTE| DUMP |ELF |
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| ^
|
||||
| |
|
||||
\ /
|
||||
-------------------------------------------
|
||||
Boot memory content gets transferred to
|
||||
reserved area by firmware at the time of
|
||||
crash
|
||||
Fig. 1
|
||||
|
||||
o Memory Reservation during second kernel after crash
|
||||
|
||||
Low memory Top of memory
|
||||
0 boot memory size |
|
||||
| |<------------- Reserved dump area ----------- -->|
|
||||
V V V
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| | |CPU|HPTE| DUMP |ELF |
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| |
|
||||
V V
|
||||
Used by second /proc/vmcore
|
||||
kernel to boot
|
||||
Fig. 2
|
||||
|
||||
Currently the dump will be copied from /proc/vmcore to a
|
||||
a new file upon user intervention. The dump data available through
|
||||
/proc/vmcore will be in ELF format. Hence the existing kdump
|
||||
infrastructure (kdump scripts) to save the dump works fine with
|
||||
minor modifications.
|
||||
|
||||
The tools to examine the dump will be same as the ones
|
||||
used for kdump.
|
||||
|
||||
How to enable firmware-assisted dump (fadump):
|
||||
-------------------------------------
|
||||
|
||||
1. Set config option CONFIG_FA_DUMP=y and build kernel.
|
||||
2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
|
||||
3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
|
||||
to specify size of the memory to reserve for boot memory dump
|
||||
preservation.
|
||||
|
||||
NOTE: If firmware-assisted dump fails to reserve memory then it will
|
||||
fallback to existing kdump mechanism if 'crashkernel=' option
|
||||
is set at kernel cmdline.
|
||||
|
||||
Sysfs/debugfs files:
|
||||
------------
|
||||
|
||||
Firmware-assisted dump feature uses sysfs file system to hold
|
||||
the control files and debugfs file to display memory reserved region.
|
||||
|
||||
Here is the list of files under kernel sysfs:
|
||||
|
||||
/sys/kernel/fadump_enabled
|
||||
|
||||
This is used to display the fadump status.
|
||||
0 = fadump is disabled
|
||||
1 = fadump is enabled
|
||||
|
||||
This interface can be used by kdump init scripts to identify if
|
||||
fadump is enabled in the kernel and act accordingly.
|
||||
|
||||
/sys/kernel/fadump_registered
|
||||
|
||||
This is used to display the fadump registration status as well
|
||||
as to control (start/stop) the fadump registration.
|
||||
0 = fadump is not registered.
|
||||
1 = fadump is registered and ready to handle system crash.
|
||||
|
||||
To register fadump echo 1 > /sys/kernel/fadump_registered and
|
||||
echo 0 > /sys/kernel/fadump_registered for un-register and stop the
|
||||
fadump. Once the fadump is un-registered, the system crash will not
|
||||
be handled and vmcore will not be captured. This interface can be
|
||||
easily integrated with kdump service start/stop.
|
||||
|
||||
/sys/kernel/fadump_release_mem
|
||||
|
||||
This file is available only when fadump is active during
|
||||
second kernel. This is used to release the reserved memory
|
||||
region that are held for saving crash dump. To release the
|
||||
reserved memory echo 1 to it:
|
||||
|
||||
echo 1 > /sys/kernel/fadump_release_mem
|
||||
|
||||
After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
|
||||
file will change to reflect the new memory reservations.
|
||||
|
||||
The existing userspace tools (kdump infrastructure) can be easily
|
||||
enhanced to use this interface to release the memory reserved for
|
||||
dump and continue without 2nd reboot.
|
||||
|
||||
Here is the list of files under powerpc debugfs:
|
||||
(Assuming debugfs is mounted on /sys/kernel/debug directory.)
|
||||
|
||||
/sys/kernel/debug/powerpc/fadump_region
|
||||
|
||||
This file shows the reserved memory regions if fadump is
|
||||
enabled otherwise this file is empty. The output format
|
||||
is:
|
||||
<region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
|
||||
|
||||
e.g.
|
||||
Contents when fadump is registered during first kernel
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/fadump_region
|
||||
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
|
||||
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
|
||||
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
|
||||
|
||||
Contents when fadump is active during second kernel
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/fadump_region
|
||||
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
|
||||
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
|
||||
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
|
||||
: [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
|
||||
|
||||
NOTE: Please refer to Documentation/filesystems/debugfs.txt on
|
||||
how to mount the debugfs filesystem.
|
||||
|
||||
|
||||
TODO:
|
||||
-----
|
||||
o Need to come up with the better approach to find out more
|
||||
accurate boot memory size that is required for a kernel to
|
||||
boot successfully when booted with restricted memory.
|
||||
o The fadump implementation introduces a fadump crash info structure
|
||||
in the scratch area before the ELF core header. The idea of introducing
|
||||
this structure is to pass some important crash info data to the second
|
||||
kernel which will help second kernel to populate ELF core header with
|
||||
correct data before it gets exported through /proc/vmcore. The current
|
||||
design implementation does not address a possibility of introducing
|
||||
additional fields (in future) to this structure without affecting
|
||||
compatibility. Need to come up with the better approach to address this.
|
||||
The possible approaches are:
|
||||
1. Introduce version field for version tracking, bump up the version
|
||||
whenever a new field is added to the structure in future. The version
|
||||
field can be used to find out what fields are valid for the current
|
||||
version of the structure.
|
||||
2. Reserve the area of predefined size (say PAGE_SIZE) for this
|
||||
structure and have unused area as reserved (initialized to zero)
|
||||
for future field additions.
|
||||
The advantage of approach 1 over 2 is we don't need to reserve extra space.
|
||||
---
|
||||
Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
||||
This document is based on the original documentation written for phyp
|
||||
assisted dump by Linas Vepstas and Manish Ahuja.
|
567
Documentation/powerpc/hvcs.txt
Normal file
567
Documentation/powerpc/hvcs.txt
Normal file
|
@ -0,0 +1,567 @@
|
|||
===========================================================================
|
||||
HVCS
|
||||
IBM "Hypervisor Virtual Console Server" Installation Guide
|
||||
for Linux Kernel 2.6.4+
|
||||
Copyright (C) 2004 IBM Corporation
|
||||
|
||||
===========================================================================
|
||||
NOTE:Eight space tabs are the optimum editor setting for reading this file.
|
||||
===========================================================================
|
||||
|
||||
Author(s) : Ryan S. Arnold <rsa@us.ibm.com>
|
||||
Date Created: March, 02, 2004
|
||||
Last Changed: August, 24, 2004
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Table of contents:
|
||||
|
||||
1. Driver Introduction:
|
||||
2. System Requirements
|
||||
3. Build Options:
|
||||
3.1 Built-in:
|
||||
3.2 Module:
|
||||
4. Installation:
|
||||
5. Connection:
|
||||
6. Disconnection:
|
||||
7. Configuration:
|
||||
8. Questions & Answers:
|
||||
9. Reporting Bugs:
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
1. Driver Introduction:
|
||||
|
||||
This is the device driver for the IBM Hypervisor Virtual Console Server,
|
||||
"hvcs". The IBM hvcs provides a tty driver interface to allow Linux user
|
||||
space applications access to the system consoles of logically partitioned
|
||||
operating systems (Linux and AIX) running on the same partitioned Power5
|
||||
ppc64 system. Physical hardware consoles per partition are not practical
|
||||
on this hardware so system consoles are accessed by this driver using
|
||||
firmware interfaces to virtual terminal devices.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
2. System Requirements:
|
||||
|
||||
This device driver was written using 2.6.4 Linux kernel APIs and will only
|
||||
build and run on kernels of this version or later.
|
||||
|
||||
This driver was written to operate solely on IBM Power5 ppc64 hardware
|
||||
though some care was taken to abstract the architecture dependent firmware
|
||||
calls from the driver code.
|
||||
|
||||
Sysfs must be mounted on the system so that the user can determine which
|
||||
major and minor numbers are associated with each vty-server. Directions
|
||||
for sysfs mounting are outside the scope of this document.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
3. Build Options:
|
||||
|
||||
The hvcs driver registers itself as a tty driver. The tty layer
|
||||
dynamically allocates a block of major and minor numbers in a quantity
|
||||
requested by the registering driver. The hvcs driver asks the tty layer
|
||||
for 64 of these major/minor numbers by default to use for hvcs device node
|
||||
entries.
|
||||
|
||||
If the default number of device entries is adequate then this driver can be
|
||||
built into the kernel. If not, the default can be over-ridden by inserting
|
||||
the driver as a module with insmod parameters.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
3.1 Built-in:
|
||||
|
||||
The following menuconfig example demonstrates selecting to build this
|
||||
driver into the kernel.
|
||||
|
||||
Device Drivers --->
|
||||
Character devices --->
|
||||
<*> IBM Hypervisor Virtual Console Server Support
|
||||
|
||||
Begin the kernel make process.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
3.2 Module:
|
||||
|
||||
The following menuconfig example demonstrates selecting to build this
|
||||
driver as a kernel module.
|
||||
|
||||
Device Drivers --->
|
||||
Character devices --->
|
||||
<M> IBM Hypervisor Virtual Console Server Support
|
||||
|
||||
The make process will build the following kernel modules:
|
||||
|
||||
hvcs.ko
|
||||
hvcserver.ko
|
||||
|
||||
To insert the module with the default allocation execute the following
|
||||
commands in the order they appear:
|
||||
|
||||
insmod hvcserver.ko
|
||||
insmod hvcs.ko
|
||||
|
||||
The hvcserver module contains architecture specific firmware calls and must
|
||||
be inserted first, otherwise the hvcs module will not find some of the
|
||||
symbols it expects.
|
||||
|
||||
To override the default use an insmod parameter as follows (requesting 4
|
||||
tty devices as an example):
|
||||
|
||||
insmod hvcs.ko hvcs_parm_num_devs=4
|
||||
|
||||
There is a maximum number of dev entries that can be specified on insmod.
|
||||
We think that 1024 is currently a decent maximum number of server adapters
|
||||
to allow. This can always be changed by modifying the constant in the
|
||||
source file before building.
|
||||
|
||||
NOTE: The length of time it takes to insmod the driver seems to be related
|
||||
to the number of tty interfaces the registering driver requests.
|
||||
|
||||
In order to remove the driver module execute the following command:
|
||||
|
||||
rmmod hvcs.ko
|
||||
|
||||
The recommended method for installing hvcs as a module is to use depmod to
|
||||
build a current modules.dep file in /lib/modules/`uname -r` and then
|
||||
execute:
|
||||
|
||||
modprobe hvcs hvcs_parm_num_devs=4
|
||||
|
||||
The modules.dep file indicates that hvcserver.ko needs to be inserted
|
||||
before hvcs.ko and modprobe uses this file to smartly insert the modules in
|
||||
the proper order.
|
||||
|
||||
The following modprobe command is used to remove hvcs and hvcserver in the
|
||||
proper order:
|
||||
|
||||
modprobe -r hvcs
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
4. Installation:
|
||||
|
||||
The tty layer creates sysfs entries which contain the major and minor
|
||||
numbers allocated for the hvcs driver. The following snippet of "tree"
|
||||
output of the sysfs directory shows where these numbers are presented:
|
||||
|
||||
sys/
|
||||
|-- *other sysfs base dirs*
|
||||
|
|
||||
|-- class
|
||||
| |-- *other classes of devices*
|
||||
| |
|
||||
| `-- tty
|
||||
| |-- *other tty devices*
|
||||
| |
|
||||
| |-- hvcs0
|
||||
| | `-- dev
|
||||
| |-- hvcs1
|
||||
| | `-- dev
|
||||
| |-- hvcs2
|
||||
| | `-- dev
|
||||
| |-- hvcs3
|
||||
| | `-- dev
|
||||
| |
|
||||
| |-- *other tty devices*
|
||||
|
|
||||
|-- *other sysfs base dirs*
|
||||
|
||||
For the above examples the following output is a result of cat'ing the
|
||||
"dev" entry in the hvcs directory:
|
||||
|
||||
Pow5:/sys/class/tty/hvcs0/ # cat dev
|
||||
254:0
|
||||
|
||||
Pow5:/sys/class/tty/hvcs1/ # cat dev
|
||||
254:1
|
||||
|
||||
Pow5:/sys/class/tty/hvcs2/ # cat dev
|
||||
254:2
|
||||
|
||||
Pow5:/sys/class/tty/hvcs3/ # cat dev
|
||||
254:3
|
||||
|
||||
The output from reading the "dev" attribute is the char device major and
|
||||
minor numbers that the tty layer has allocated for this driver's use. Most
|
||||
systems running hvcs will already have the device entries created or udev
|
||||
will do it automatically.
|
||||
|
||||
Given the example output above, to manually create a /dev/hvcs* node entry
|
||||
mknod can be used as follows:
|
||||
|
||||
mknod /dev/hvcs0 c 254 0
|
||||
mknod /dev/hvcs1 c 254 1
|
||||
mknod /dev/hvcs2 c 254 2
|
||||
mknod /dev/hvcs3 c 254 3
|
||||
|
||||
Using mknod to manually create the device entries makes these device nodes
|
||||
persistent. Once created they will exist prior to the driver insmod.
|
||||
|
||||
Attempting to connect an application to /dev/hvcs* prior to insertion of
|
||||
the hvcs module will result in an error message similar to the following:
|
||||
|
||||
"/dev/hvcs*: No such device".
|
||||
|
||||
NOTE: Just because there is a device node present doesn't mean that there
|
||||
is a vty-server device configured for that node.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
5. Connection
|
||||
|
||||
Since this driver controls devices that provide a tty interface a user can
|
||||
interact with the device node entries using any standard tty-interactive
|
||||
method (e.g. "cat", "dd", "echo"). The intent of this driver however, is
|
||||
to provide real time console interaction with a Linux partition's console,
|
||||
which requires the use of applications that provide bi-directional,
|
||||
interactive I/O with a tty device.
|
||||
|
||||
Applications (e.g. "minicom" and "screen") that act as terminal emulators
|
||||
or perform terminal type control sequence conversion on the data being
|
||||
passed through them are NOT acceptable for providing interactive console
|
||||
I/O. These programs often emulate antiquated terminal types (vt100 and
|
||||
ANSI) and expect inbound data to take the form of one of these supported
|
||||
terminal types but they either do not convert, or do not _adequately_
|
||||
convert, outbound data into the terminal type of the terminal which invoked
|
||||
them (though screen makes an attempt and can apparently be configured with
|
||||
much termcap wrestling.)
|
||||
|
||||
For this reason kermit and cu are two of the recommended applications for
|
||||
interacting with a Linux console via an hvcs device. These programs simply
|
||||
act as a conduit for data transfer to and from the tty device. They do not
|
||||
require inbound data to take the form of a particular terminal type, nor do
|
||||
they cook outbound data to a particular terminal type.
|
||||
|
||||
In order to ensure proper functioning of console applications one must make
|
||||
sure that once connected to a /dev/hvcs console that the console's $TERM
|
||||
env variable is set to the exact terminal type of the terminal emulator
|
||||
used to launch the interactive I/O application. If one is using xterm and
|
||||
kermit to connect to /dev/hvcs0 when the console prompt becomes available
|
||||
one should "export TERM=xterm" on the console. This tells ncurses
|
||||
applications that are invoked from the console that they should output
|
||||
control sequences that xterm can understand.
|
||||
|
||||
As a precautionary measure an hvcs user should always "exit" from their
|
||||
session before disconnecting an application such as kermit from the device
|
||||
node. If this is not done, the next user to connect to the console will
|
||||
continue using the previous user's logged in session which includes
|
||||
using the $TERM variable that the previous user supplied.
|
||||
|
||||
Hotplug add and remove of vty-server adapters affects which /dev/hvcs* node
|
||||
is used to connect to each vty-server adapter. In order to determine which
|
||||
vty-server adapter is associated with which /dev/hvcs* node a special sysfs
|
||||
attribute has been added to each vty-server sysfs entry. This entry is
|
||||
called "index" and showing it reveals an integer that refers to the
|
||||
/dev/hvcs* entry to use to connect to that device. For instance cating the
|
||||
index attribute of vty-server adapter 30000004 shows the following.
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat index
|
||||
2
|
||||
|
||||
This index of '2' means that in order to connect to vty-server adapter
|
||||
30000004 the user should interact with /dev/hvcs2.
|
||||
|
||||
It should be noted that due to the system hotplug I/O capabilities of a
|
||||
system the /dev/hvcs* entry that interacts with a particular vty-server
|
||||
adapter is not guaranteed to remain the same across system reboots. Look
|
||||
in the Q & A section for more on this issue.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
6. Disconnection
|
||||
|
||||
As a security feature to prevent the delivery of stale data to an
|
||||
unintended target the Power5 system firmware disables the fetching of data
|
||||
and discards that data when a connection between a vty-server and a vty has
|
||||
been severed. As an example, when a vty-server is immediately disconnected
|
||||
from a vty following output of data to the vty the vty adapter may not have
|
||||
enough time between when it received the data interrupt and when the
|
||||
connection was severed to fetch the data from firmware before the fetch is
|
||||
disabled by firmware.
|
||||
|
||||
When hvcs is being used to serve consoles this behavior is not a huge issue
|
||||
because the adapter stays connected for large amounts of time following
|
||||
almost all data writes. When hvcs is being used as a tty conduit to tunnel
|
||||
data between two partitions [see Q & A below] this is a huge problem
|
||||
because the standard Linux behavior when cat'ing or dd'ing data to a device
|
||||
is to open the tty, send the data, and then close the tty. If this driver
|
||||
manually terminated vty-server connections on tty close this would close
|
||||
the vty-server and vty connection before the target vty has had a chance to
|
||||
fetch the data.
|
||||
|
||||
Additionally, disconnecting a vty-server and vty only on module removal or
|
||||
adapter removal is impractical because other vty-servers in other
|
||||
partitions may require the usage of the target vty at any time.
|
||||
|
||||
Due to this behavioral restriction disconnection of vty-servers from the
|
||||
connected vty is a manual procedure using a write to a sysfs attribute
|
||||
outlined below, on the other hand the initial vty-server connection to a
|
||||
vty is established automatically by this driver. Manual vty-server
|
||||
connection is never required.
|
||||
|
||||
In order to terminate the connection between a vty-server and vty the
|
||||
"vterm_state" sysfs attribute within each vty-server's sysfs entry is used.
|
||||
Reading this attribute reveals the current connection state of the
|
||||
vty-server adapter. A zero means that the vty-server is not connected to a
|
||||
vty. A one indicates that a connection is active.
|
||||
|
||||
Writing a '0' (zero) to the vterm_state attribute will disconnect the VTERM
|
||||
connection between the vty-server and target vty ONLY if the vterm_state
|
||||
previously read '1'. The write directive is ignored if the vterm_state
|
||||
read '0' or if any value other than '0' was written to the vterm_state
|
||||
attribute. The following example will show the method used for verifying
|
||||
the vty-server connection status and disconnecting a vty-server connection.
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat vterm_state
|
||||
1
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # echo 0 > vterm_state
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat vterm_state
|
||||
0
|
||||
|
||||
All vty-server connections are automatically terminated when the device is
|
||||
hotplug removed and when the module is removed.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
7. Configuration
|
||||
|
||||
Each vty-server has a sysfs entry in the /sys/devices/vio directory, which
|
||||
is symlinked in several other sysfs tree directories, notably under the
|
||||
hvcs driver entry, which looks like the following example:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs # ls
|
||||
. .. 30000003 30000004 rescan
|
||||
|
||||
By design, firmware notifies the hvcs driver of vty-server lifetimes and
|
||||
partner vty removals but not the addition of partner vtys. Since an HMC
|
||||
Super Admin can add partner info dynamically we have provided the hvcs
|
||||
driver sysfs directory with the "rescan" update attribute which will query
|
||||
firmware and update the partner info for all the vty-servers that this
|
||||
driver manages. Writing a '1' to the attribute triggers the update. An
|
||||
explicit example follows:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs # echo 1 > rescan
|
||||
|
||||
Reading the attribute will indicate a state of '1' or '0'. A one indicates
|
||||
that an update is in process. A zero indicates that an update has
|
||||
completed or was never executed.
|
||||
|
||||
Vty-server entries in this directory are a 32 bit partition unique unit
|
||||
address that is created by firmware. An example vty-server sysfs entry
|
||||
looks like the following:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # ls
|
||||
. current_vty devspec name partner_vtys
|
||||
.. index partner_clcs vterm_state
|
||||
|
||||
Each entry is provided, by default with a "name" attribute. Reading the
|
||||
"name" attribute will reveal the device type as shown in the following
|
||||
example:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000003 # cat name
|
||||
vty-server
|
||||
|
||||
Each entry is also provided, by default, with a "devspec" attribute which
|
||||
reveals the full device specification when read, as shown in the following
|
||||
example:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat devspec
|
||||
/vdevice/vty-server@30000004
|
||||
|
||||
Each vty-server sysfs dir is provided with two read-only attributes that
|
||||
provide lists of easily parsed partner vty data: "partner_vtys" and
|
||||
"partner_clcs".
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat partner_vtys
|
||||
30000000
|
||||
30000001
|
||||
30000002
|
||||
30000000
|
||||
30000000
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat partner_clcs
|
||||
U5112.428.103048A-V3-C0
|
||||
U5112.428.103048A-V3-C2
|
||||
U5112.428.103048A-V3-C3
|
||||
U5112.428.103048A-V4-C0
|
||||
U5112.428.103048A-V5-C0
|
||||
|
||||
Reading partner_vtys returns a list of partner vtys. Vty unit address
|
||||
numbering is only per-partition-unique so entries will frequently repeat.
|
||||
|
||||
Reading partner_clcs returns a list of "converged location codes" which are
|
||||
composed of a system serial number followed by "-V*", where the '*' is the
|
||||
target partition number, and "-C*", where the '*' is the slot of the
|
||||
adapter. The first vty partner corresponds to the first clc item, the
|
||||
second vty partner to the second clc item, etc.
|
||||
|
||||
A vty-server can only be connected to a single vty at a time. The entry,
|
||||
"current_vty" prints the clc of the currently selected partner vty when
|
||||
read.
|
||||
|
||||
The current_vty can be changed by writing a valid partner clc to the entry
|
||||
as in the following example:
|
||||
|
||||
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # echo U5112.428.10304
|
||||
8A-V4-C0 > current_vty
|
||||
|
||||
Changing the current_vty when a vty-server is already connected to a vty
|
||||
does not affect the current connection. The change takes effect when the
|
||||
currently open connection is freed.
|
||||
|
||||
Information on the "vterm_state" attribute was covered earlier on the
|
||||
chapter entitled "disconnection".
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
8. Questions & Answers:
|
||||
===========================================================================
|
||||
Q: What are the security concerns involving hvcs?
|
||||
|
||||
A: There are three main security concerns:
|
||||
|
||||
1. The creator of the /dev/hvcs* nodes has the ability to restrict
|
||||
the access of the device entries to certain users or groups. It
|
||||
may be best to create a special hvcs group privilege for providing
|
||||
access to system consoles.
|
||||
|
||||
2. To provide network security when grabbing the console it is
|
||||
suggested that the user connect to the console hosting partition
|
||||
using a secure method, such as SSH or sit at a hardware console.
|
||||
|
||||
3. Make sure to exit the user session when done with a console or
|
||||
the next vty-server connection (which may be from another
|
||||
partition) will experience the previously logged in session.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: How do I multiplex a console that I grab through hvcs so that other
|
||||
people can see it:
|
||||
|
||||
A: You can use "screen" to directly connect to the /dev/hvcs* device and
|
||||
setup a session on your machine with the console group privileges. As
|
||||
pointed out earlier by default screen doesn't provide the termcap settings
|
||||
for most terminal emulators to provide adequate character conversion from
|
||||
term type "screen" to others. This means that curses based programs may
|
||||
not display properly in screen sessions.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: Why are the colors all messed up?
|
||||
Q: Why are the control characters acting strange or not working?
|
||||
Q: Why is the console output all strange and unintelligible?
|
||||
|
||||
A: Please see the preceding section on "Connection" for a discussion of how
|
||||
applications can affect the display of character control sequences.
|
||||
Additionally, just because you logged into the console using and xterm
|
||||
doesn't mean someone else didn't log into the console with the HMC console
|
||||
(vt320) before you and leave the session logged in. The best thing to do
|
||||
is to export TERM to the terminal type of your terminal emulator when you
|
||||
get the console. Additionally make sure to "exit" the console before you
|
||||
disconnect from the console. This will ensure that the next user gets
|
||||
their own TERM type set when they login.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: When I try to CONNECT kermit to an hvcs device I get:
|
||||
"Sorry, can't open connection: /dev/hvcs*"What is happening?
|
||||
|
||||
A: Some other Power5 console mechanism has a connection to the vty and
|
||||
isn't giving it up. You can try to force disconnect the consoles from the
|
||||
HMC by right clicking on the partition and then selecting "close terminal".
|
||||
Otherwise you have to hunt down the people who have console authority. It
|
||||
is possible that you already have the console open using another kermit
|
||||
session and just forgot about it. Please review the console options for
|
||||
Power5 systems to determine the many ways a system console can be held.
|
||||
|
||||
OR
|
||||
|
||||
A: Another user may not have a connectivity method currently attached to a
|
||||
/dev/hvcs device but the vterm_state may reveal that they still have the
|
||||
vty-server connection established. They need to free this using the method
|
||||
outlined in the section on "Disconnection" in order for others to connect
|
||||
to the target vty.
|
||||
|
||||
OR
|
||||
|
||||
A: The user profile you are using to execute kermit probably doesn't have
|
||||
permissions to use the /dev/hvcs* device.
|
||||
|
||||
OR
|
||||
|
||||
A: You probably haven't inserted the hvcs.ko module yet but the /dev/hvcs*
|
||||
entry still exists (on systems without udev).
|
||||
|
||||
OR
|
||||
|
||||
A: There is not a corresponding vty-server device that maps to an existing
|
||||
/dev/hvcs* entry.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: When I try to CONNECT kermit to an hvcs device I get:
|
||||
"Sorry, write access to UUCP lockfile directory denied."
|
||||
|
||||
A: The /dev/hvcs* entry you have specified doesn't exist where you said it
|
||||
does? Maybe you haven't inserted the module (on systems with udev).
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: If I already have one Linux partition installed can I use hvcs on said
|
||||
partition to provide the console for the install of a second Linux
|
||||
partition?
|
||||
|
||||
A: Yes granted that your are connected to the /dev/hvcs* device using
|
||||
kermit or cu or some other program that doesn't provide terminal emulation.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: Can I connect to more than one partition's console at a time using this
|
||||
driver?
|
||||
|
||||
A: Yes. Of course this means that there must be more than one vty-server
|
||||
configured for this partition and each must point to a disconnected vty.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: Does the hvcs driver support dynamic (hotplug) addition of devices?
|
||||
|
||||
A: Yes, if you have dlpar and hotplug enabled for your system and it has
|
||||
been built into the kernel the hvcs drivers is configured to dynamically
|
||||
handle additions of new devices and removals of unused devices.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: For some reason /dev/hvcs* doesn't map to the same vty-server adapter
|
||||
after a reboot. What happened?
|
||||
|
||||
A: Assignment of vty-server adapters to /dev/hvcs* entries is always done
|
||||
in the order that the adapters are exposed. Due to hotplug capabilities of
|
||||
this driver assignment of hotplug added vty-servers may be in a different
|
||||
order than how they would be exposed on module load. Rebooting or
|
||||
reloading the module after dynamic addition may result in the /dev/hvcs*
|
||||
and vty-server coupling changing if a vty-server adapter was added in a
|
||||
slot between two other vty-server adapters. Refer to the section above
|
||||
on how to determine which vty-server goes with which /dev/hvcs* node.
|
||||
Hint; look at the sysfs "index" attribute for the vty-server.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
Q: Can I use /dev/hvcs* as a conduit to another partition and use a tty
|
||||
device on that partition as the other end of the pipe?
|
||||
|
||||
A: Yes, on Power5 platforms the hvc_console driver provides a tty interface
|
||||
for extra /dev/hvc* devices (where /dev/hvc0 is most likely the console).
|
||||
In order to get a tty conduit working between the two partitions the HMC
|
||||
Super Admin must create an additional "serial server" for the target
|
||||
partition with the HMC gui which will show up as /dev/hvc* when the target
|
||||
partition is rebooted.
|
||||
|
||||
The HMC Super Admin then creates an additional "serial client" for the
|
||||
current partition and points this at the target partition's newly created
|
||||
"serial server" adapter (remember the slot). This shows up as an
|
||||
additional /dev/hvcs* device.
|
||||
|
||||
Now a program on the target system can be configured to read or write to
|
||||
/dev/hvc* and another program on the current partition can be configured to
|
||||
read or write to /dev/hvcs*. Now you have a tty conduit between two
|
||||
partitions.
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
9. Reporting Bugs:
|
||||
|
||||
The proper channel for reporting bugs is either through the Linux OS
|
||||
distribution company that provided your OS or by posting issues to the
|
||||
PowerPC development mailing list at:
|
||||
|
||||
linuxppc-dev@lists.ozlabs.org
|
||||
|
||||
This request is to provide a documented and searchable public exchange
|
||||
of the problems and solutions surrounding this driver for the benefit of
|
||||
all users.
|
39
Documentation/powerpc/mpc52xx.txt
Normal file
39
Documentation/powerpc/mpc52xx.txt
Normal file
|
@ -0,0 +1,39 @@
|
|||
Linux 2.6.x on MPC52xx family
|
||||
-----------------------------
|
||||
|
||||
For the latest info, go to http://www.246tNt.com/mpc52xx/
|
||||
|
||||
To compile/use :
|
||||
|
||||
- U-Boot:
|
||||
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
|
||||
if you wish to ).
|
||||
# make lite5200_defconfig
|
||||
# make uImage
|
||||
|
||||
then, on U-boot:
|
||||
=> tftpboot 200000 uImage
|
||||
=> tftpboot 400000 pRamdisk
|
||||
=> bootm 200000 400000
|
||||
|
||||
- DBug:
|
||||
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
|
||||
if you wish to ).
|
||||
# make lite5200_defconfig
|
||||
# cp your_initrd.gz arch/ppc/boot/images/ramdisk.image.gz
|
||||
# make zImage.initrd
|
||||
# make
|
||||
|
||||
then in DBug:
|
||||
DBug> dn -i zImage.initrd.lite5200
|
||||
|
||||
|
||||
Some remarks :
|
||||
- The port is named mpc52xxx, and config options are PPC_MPC52xx. The MGT5100
|
||||
is not supported, and I'm not sure anyone is interesting in working on it
|
||||
so. I didn't took 5xxx because there's apparently a lot of 5xxx that have
|
||||
nothing to do with the MPC5200. I also included the 'MPC' for the same
|
||||
reason.
|
||||
- Of course, I inspired myself from the 2.4 port. If you think I forgot to
|
||||
mention you/your company in the copyright of some code, I'll correct it
|
||||
ASAP.
|
137
Documentation/powerpc/pmu-ebb.txt
Normal file
137
Documentation/powerpc/pmu-ebb.txt
Normal file
|
@ -0,0 +1,137 @@
|
|||
PMU Event Based Branches
|
||||
========================
|
||||
|
||||
Event Based Branches (EBBs) are a feature which allows the hardware to
|
||||
branch directly to a specified user space address when certain events occur.
|
||||
|
||||
The full specification is available in Power ISA v2.07:
|
||||
|
||||
https://www.power.org/documentation/power-isa-version-2-07/
|
||||
|
||||
One type of event for which EBBs can be configured is PMU exceptions. This
|
||||
document describes the API for configuring the Power PMU to generate EBBs,
|
||||
using the Linux perf_events API.
|
||||
|
||||
|
||||
Terminology
|
||||
-----------
|
||||
|
||||
Throughout this document we will refer to an "EBB event" or "EBB events". This
|
||||
just refers to a struct perf_event which has set the "EBB" flag in its
|
||||
attr.config. All events which can be configured on the hardware PMU are
|
||||
possible "EBB events".
|
||||
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
When a PMU EBB occurs it is delivered to the currently running process. As such
|
||||
EBBs can only sensibly be used by programs for self-monitoring.
|
||||
|
||||
It is a feature of the perf_events API that events can be created on other
|
||||
processes, subject to standard permission checks. This is also true of EBB
|
||||
events, however unless the target process enables EBBs (via mtspr(BESCR)) no
|
||||
EBBs will ever be delivered.
|
||||
|
||||
This makes it possible for a process to enable EBBs for itself, but not
|
||||
actually configure any events. At a later time another process can come along
|
||||
and attach an EBB event to the process, which will then cause EBBs to be
|
||||
delivered to the first process. It's not clear if this is actually useful.
|
||||
|
||||
|
||||
When the PMU is configured for EBBs, all PMU interrupts are delivered to the
|
||||
user process. This means once an EBB event is scheduled on the PMU, no non-EBB
|
||||
events can be configured. This means that EBB events can not be run
|
||||
concurrently with regular 'perf' commands, or any other perf events.
|
||||
|
||||
It is however safe to run 'perf' commands on a process which is using EBBs. The
|
||||
kernel will in general schedule the EBB event, and perf will be notified that
|
||||
its events could not run.
|
||||
|
||||
The exclusion between EBB events and regular events is implemented using the
|
||||
existing "pinned" and "exclusive" attributes of perf_events. This means EBB
|
||||
events will be given priority over other events, unless they are also pinned.
|
||||
If an EBB event and a regular event are both pinned, then whichever is enabled
|
||||
first will be scheduled and the other will be put in error state. See the
|
||||
section below titled "Enabling an EBB event" for more information.
|
||||
|
||||
|
||||
Creating an EBB event
|
||||
---------------------
|
||||
|
||||
To request that an event is counted using EBB, the event code should have bit
|
||||
63 set.
|
||||
|
||||
EBB events must be created with a particular, and restrictive, set of
|
||||
attributes - this is so that they interoperate correctly with the rest of the
|
||||
perf_events subsystem.
|
||||
|
||||
An EBB event must be created with the "pinned" and "exclusive" attributes set.
|
||||
Note that if you are creating a group of EBB events, only the leader can have
|
||||
these attributes set.
|
||||
|
||||
An EBB event must NOT set any of the "inherit", "sample_period", "freq" or
|
||||
"enable_on_exec" attributes.
|
||||
|
||||
An EBB event must be attached to a task. This is specified to perf_event_open()
|
||||
by passing a pid value, typically 0 indicating the current task.
|
||||
|
||||
All events in a group must agree on whether they want EBB. That is all events
|
||||
must request EBB, or none may request EBB.
|
||||
|
||||
EBB events must specify the PMC they are to be counted on. This ensures
|
||||
userspace is able to reliably determine which PMC the event is scheduled on.
|
||||
|
||||
|
||||
Enabling an EBB event
|
||||
---------------------
|
||||
|
||||
Once an EBB event has been successfully opened, it must be enabled with the
|
||||
perf_events API. This can be achieved either via the ioctl() interface, or the
|
||||
prctl() interface.
|
||||
|
||||
However, due to the design of the perf_events API, enabling an event does not
|
||||
guarantee that it has been scheduled on the PMU. To ensure that the EBB event
|
||||
has been scheduled on the PMU, you must perform a read() on the event. If the
|
||||
read() returns EOF, then the event has not been scheduled and EBBs are not
|
||||
enabled.
|
||||
|
||||
This behaviour occurs because the EBB event is pinned and exclusive. When the
|
||||
EBB event is enabled it will force all other non-pinned events off the PMU. In
|
||||
this case the enable will be successful. However if there is already an event
|
||||
pinned on the PMU then the enable will not be successful.
|
||||
|
||||
|
||||
Reading an EBB event
|
||||
--------------------
|
||||
|
||||
It is possible to read() from an EBB event. However the results are
|
||||
meaningless. Because interrupts are being delivered to the user process the
|
||||
kernel is not able to count the event, and so will return a junk value.
|
||||
|
||||
|
||||
Closing an EBB event
|
||||
--------------------
|
||||
|
||||
When an EBB event is finished with, you can close it using close() as for any
|
||||
regular event. If this is the last EBB event the PMU will be deconfigured and
|
||||
no further PMU EBBs will be delivered.
|
||||
|
||||
|
||||
EBB Handler
|
||||
-----------
|
||||
|
||||
The EBB handler is just regular userspace code, however it must be written in
|
||||
the style of an interrupt handler. When the handler is entered all registers
|
||||
are live (possibly) and so must be saved somehow before the handler can invoke
|
||||
other code.
|
||||
|
||||
It's up to the program how to handle this. For C programs a relatively simple
|
||||
option is to create an interrupt frame on the stack and save registers there.
|
||||
|
||||
Fork
|
||||
----
|
||||
|
||||
EBB events are not inherited across fork. If the child process wishes to use
|
||||
EBBs it should open a new event for itself. Similarly the EBB state in
|
||||
BESCR/EBBHR/EBBRR is cleared across fork().
|
151
Documentation/powerpc/ptrace.txt
Normal file
151
Documentation/powerpc/ptrace.txt
Normal file
|
@ -0,0 +1,151 @@
|
|||
GDB intends to support the following hardware debug features of BookE
|
||||
processors:
|
||||
|
||||
4 hardware breakpoints (IAC)
|
||||
2 hardware watchpoints (read, write and read-write) (DAC)
|
||||
2 value conditions for the hardware watchpoints (DVC)
|
||||
|
||||
For that, we need to extend ptrace so that GDB can query and set these
|
||||
resources. Since we're extending, we're trying to create an interface
|
||||
that's extendable and that covers both BookE and server processors, so
|
||||
that GDB doesn't need to special-case each of them. We added the
|
||||
following 3 new ptrace requests.
|
||||
|
||||
1. PTRACE_PPC_GETHWDEBUGINFO
|
||||
|
||||
Query for GDB to discover the hardware debug features. The main info to
|
||||
be returned here is the minimum alignment for the hardware watchpoints.
|
||||
BookE processors don't have restrictions here, but server processors have
|
||||
an 8-byte alignment restriction for hardware watchpoints. We'd like to avoid
|
||||
adding special cases to GDB based on what it sees in AUXV.
|
||||
|
||||
Since we're at it, we added other useful info that the kernel can return to
|
||||
GDB: this query will return the number of hardware breakpoints, hardware
|
||||
watchpoints and whether it supports a range of addresses and a condition.
|
||||
The query will fill the following structure provided by the requesting process:
|
||||
|
||||
struct ppc_debug_info {
|
||||
unit32_t version;
|
||||
unit32_t num_instruction_bps;
|
||||
unit32_t num_data_bps;
|
||||
unit32_t num_condition_regs;
|
||||
unit32_t data_bp_alignment;
|
||||
unit32_t sizeof_condition; /* size of the DVC register */
|
||||
uint64_t features; /* bitmask of the individual flags */
|
||||
};
|
||||
|
||||
features will have bits indicating whether there is support for:
|
||||
|
||||
#define PPC_DEBUG_FEATURE_INSN_BP_RANGE 0x1
|
||||
#define PPC_DEBUG_FEATURE_INSN_BP_MASK 0x2
|
||||
#define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x4
|
||||
#define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x8
|
||||
#define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x10
|
||||
|
||||
2. PTRACE_SETHWDEBUG
|
||||
|
||||
Sets a hardware breakpoint or watchpoint, according to the provided structure:
|
||||
|
||||
struct ppc_hw_breakpoint {
|
||||
uint32_t version;
|
||||
#define PPC_BREAKPOINT_TRIGGER_EXECUTE 0x1
|
||||
#define PPC_BREAKPOINT_TRIGGER_READ 0x2
|
||||
#define PPC_BREAKPOINT_TRIGGER_WRITE 0x4
|
||||
uint32_t trigger_type; /* only some combinations allowed */
|
||||
#define PPC_BREAKPOINT_MODE_EXACT 0x0
|
||||
#define PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE 0x1
|
||||
#define PPC_BREAKPOINT_MODE_RANGE_EXCLUSIVE 0x2
|
||||
#define PPC_BREAKPOINT_MODE_MASK 0x3
|
||||
uint32_t addr_mode; /* address match mode */
|
||||
|
||||
#define PPC_BREAKPOINT_CONDITION_MODE 0x3
|
||||
#define PPC_BREAKPOINT_CONDITION_NONE 0x0
|
||||
#define PPC_BREAKPOINT_CONDITION_AND 0x1
|
||||
#define PPC_BREAKPOINT_CONDITION_EXACT 0x1 /* different name for the same thing as above */
|
||||
#define PPC_BREAKPOINT_CONDITION_OR 0x2
|
||||
#define PPC_BREAKPOINT_CONDITION_AND_OR 0x3
|
||||
#define PPC_BREAKPOINT_CONDITION_BE_ALL 0x00ff0000 /* byte enable bits */
|
||||
#define PPC_BREAKPOINT_CONDITION_BE(n) (1<<((n)+16))
|
||||
uint32_t condition_mode; /* break/watchpoint condition flags */
|
||||
|
||||
uint64_t addr;
|
||||
uint64_t addr2;
|
||||
uint64_t condition_value;
|
||||
};
|
||||
|
||||
A request specifies one event, not necessarily just one register to be set.
|
||||
For instance, if the request is for a watchpoint with a condition, both the
|
||||
DAC and DVC registers will be set in the same request.
|
||||
|
||||
With this GDB can ask for all kinds of hardware breakpoints and watchpoints
|
||||
that the BookE supports. COMEFROM breakpoints available in server processors
|
||||
are not contemplated, but that is out of the scope of this work.
|
||||
|
||||
ptrace will return an integer (handle) uniquely identifying the breakpoint or
|
||||
watchpoint just created. This integer will be used in the PTRACE_DELHWDEBUG
|
||||
request to ask for its removal. Return -ENOSPC if the requested breakpoint
|
||||
can't be allocated on the registers.
|
||||
|
||||
Some examples of using the structure to:
|
||||
|
||||
- set a breakpoint in the first breakpoint register
|
||||
|
||||
p.version = PPC_DEBUG_CURRENT_VERSION;
|
||||
p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
|
||||
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
|
||||
p.addr = (uint64_t) address;
|
||||
p.addr2 = 0;
|
||||
p.condition_value = 0;
|
||||
|
||||
- set a watchpoint which triggers on reads in the second watchpoint register
|
||||
|
||||
p.version = PPC_DEBUG_CURRENT_VERSION;
|
||||
p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
|
||||
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
|
||||
p.addr = (uint64_t) address;
|
||||
p.addr2 = 0;
|
||||
p.condition_value = 0;
|
||||
|
||||
- set a watchpoint which triggers only with a specific value
|
||||
|
||||
p.version = PPC_DEBUG_CURRENT_VERSION;
|
||||
p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
|
||||
p.condition_mode = PPC_BREAKPOINT_CONDITION_AND | PPC_BREAKPOINT_CONDITION_BE_ALL;
|
||||
p.addr = (uint64_t) address;
|
||||
p.addr2 = 0;
|
||||
p.condition_value = (uint64_t) condition;
|
||||
|
||||
- set a ranged hardware breakpoint
|
||||
|
||||
p.version = PPC_DEBUG_CURRENT_VERSION;
|
||||
p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
|
||||
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
|
||||
p.addr = (uint64_t) begin_range;
|
||||
p.addr2 = (uint64_t) end_range;
|
||||
p.condition_value = 0;
|
||||
|
||||
- set a watchpoint in server processors (BookS)
|
||||
|
||||
p.version = 1;
|
||||
p.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
|
||||
or
|
||||
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
|
||||
|
||||
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
|
||||
p.addr = (uint64_t) begin_range;
|
||||
/* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where
|
||||
* addr2 - addr <= 8 Bytes.
|
||||
*/
|
||||
p.addr2 = (uint64_t) end_range;
|
||||
p.condition_value = 0;
|
||||
|
||||
3. PTRACE_DELHWDEBUG
|
||||
|
||||
Takes an integer which identifies an existing breakpoint or watchpoint
|
||||
(i.e., the value returned from PTRACE_SETHWDEBUG), and deletes the
|
||||
corresponding breakpoint or watchpoint..
|
295
Documentation/powerpc/qe_firmware.txt
Normal file
295
Documentation/powerpc/qe_firmware.txt
Normal file
|
@ -0,0 +1,295 @@
|
|||
Freescale QUICC Engine Firmware Uploading
|
||||
-----------------------------------------
|
||||
|
||||
(c) 2007 Timur Tabi <timur at freescale.com>,
|
||||
Freescale Semiconductor
|
||||
|
||||
Table of Contents
|
||||
=================
|
||||
|
||||
I - Software License for Firmware
|
||||
|
||||
II - Microcode Availability
|
||||
|
||||
III - Description and Terminology
|
||||
|
||||
IV - Microcode Programming Details
|
||||
|
||||
V - Firmware Structure Layout
|
||||
|
||||
VI - Sample Code for Creating Firmware Files
|
||||
|
||||
Revision Information
|
||||
====================
|
||||
|
||||
November 30, 2007: Rev 1.0 - Initial version
|
||||
|
||||
I - Software License for Firmware
|
||||
=================================
|
||||
|
||||
Each firmware file comes with its own software license. For information on
|
||||
the particular license, please see the license text that is distributed with
|
||||
the firmware.
|
||||
|
||||
II - Microcode Availability
|
||||
===========================
|
||||
|
||||
Firmware files are distributed through various channels. Some are available on
|
||||
http://opensource.freescale.com. For other firmware files, please contact
|
||||
your Freescale representative or your operating system vendor.
|
||||
|
||||
III - Description and Terminology
|
||||
================================
|
||||
|
||||
In this document, the term 'microcode' refers to the sequence of 32-bit
|
||||
integers that compose the actual QE microcode.
|
||||
|
||||
The term 'firmware' refers to a binary blob that contains the microcode as
|
||||
well as other data that
|
||||
|
||||
1) describes the microcode's purpose
|
||||
2) describes how and where to upload the microcode
|
||||
3) specifies the values of various registers
|
||||
4) includes additional data for use by specific device drivers
|
||||
|
||||
Firmware files are binary files that contain only a firmware.
|
||||
|
||||
IV - Microcode Programming Details
|
||||
===================================
|
||||
|
||||
The QE architecture allows for only one microcode present in I-RAM for each
|
||||
RISC processor. To replace any current microcode, a full QE reset (which
|
||||
disables the microcode) must be performed first.
|
||||
|
||||
QE microcode is uploaded using the following procedure:
|
||||
|
||||
1) The microcode is placed into I-RAM at a specific location, using the
|
||||
IRAM.IADD and IRAM.IDATA registers.
|
||||
|
||||
2) The CERCR.CIR bit is set to 0 or 1, depending on whether the firmware
|
||||
needs split I-RAM. Split I-RAM is only meaningful for SOCs that have
|
||||
QEs with multiple RISC processors, such as the 8360. Splitting the I-RAM
|
||||
allows each processor to run a different microcode, effectively creating an
|
||||
asymmetric multiprocessing (AMP) system.
|
||||
|
||||
3) The TIBCR trap registers are loaded with the addresses of the trap handlers
|
||||
in the microcode.
|
||||
|
||||
4) The RSP.ECCR register is programmed with the value provided.
|
||||
|
||||
5) If necessary, device drivers that need the virtual traps and extended mode
|
||||
data will use them.
|
||||
|
||||
Virtual Microcode Traps
|
||||
|
||||
These virtual traps are conditional branches in the microcode. These are
|
||||
"soft" provisional introduced in the ROMcode in order to enable higher
|
||||
flexibility and save h/w traps If new features are activated or an issue is
|
||||
being fixed in the RAM package utilizing they should be activated. This data
|
||||
structure signals the microcode which of these virtual traps is active.
|
||||
|
||||
This structure contains 6 words that the application should copy to some
|
||||
specific been defined. This table describes the structure.
|
||||
|
||||
---------------------------------------------------------------
|
||||
| Offset in | | Destination Offset | Size of |
|
||||
| array | Protocol | within PRAM | Operand |
|
||||
--------------------------------------------------------------|
|
||||
| 0 | Ethernet | 0xF8 | 4 bytes |
|
||||
| | interworking | | |
|
||||
---------------------------------------------------------------
|
||||
| 4 | ATM | 0xF8 | 4 bytes |
|
||||
| | interworking | | |
|
||||
---------------------------------------------------------------
|
||||
| 8 | PPP | 0xF8 | 4 bytes |
|
||||
| | interworking | | |
|
||||
---------------------------------------------------------------
|
||||
| 12 | Ethernet RX | 0x22 | 1 byte |
|
||||
| | Distributor Page | | |
|
||||
---------------------------------------------------------------
|
||||
| 16 | ATM Globtal | 0x28 | 1 byte |
|
||||
| | Params Table | | |
|
||||
---------------------------------------------------------------
|
||||
| 20 | Insert Frame | 0xF8 | 4 bytes |
|
||||
---------------------------------------------------------------
|
||||
|
||||
|
||||
Extended Modes
|
||||
|
||||
This is a double word bit array (64 bits) that defines special functionality
|
||||
which has an impact on the softwarew drivers. Each bit has its own impact
|
||||
and has special instructions for the s/w associated with it. This structure is
|
||||
described in this table:
|
||||
|
||||
-----------------------------------------------------------------------
|
||||
| Bit # | Name | Description |
|
||||
-----------------------------------------------------------------------
|
||||
| 0 | General | Indicates that prior to each host command |
|
||||
| | push command | given by the application, the software must |
|
||||
| | | assert a special host command (push command)|
|
||||
| | | CECDR = 0x00800000. |
|
||||
| | | CECR = 0x01c1000f. |
|
||||
-----------------------------------------------------------------------
|
||||
| 1 | UCC ATM | Indicates that after issuing ATM RX INIT |
|
||||
| | RX INIT | command, the host must issue another special|
|
||||
| | push command | command (push command) and immediately |
|
||||
| | | following that re-issue the ATM RX INIT |
|
||||
| | | command. (This makes the sequence of |
|
||||
| | | initializing the ATM receiver a sequence of |
|
||||
| | | three host commands) |
|
||||
| | | CECDR = 0x00800000. |
|
||||
| | | CECR = 0x01c1000f. |
|
||||
-----------------------------------------------------------------------
|
||||
| 2 | Add/remove | Indicates that following the specific host |
|
||||
| | command | command: "Add/Remove entry in Hash Lookup |
|
||||
| | validation | Table" used in Interworking setup, the user |
|
||||
| | | must issue another command. |
|
||||
| | | CECDR = 0xce000003. |
|
||||
| | | CECR = 0x01c10f58. |
|
||||
-----------------------------------------------------------------------
|
||||
| 3 | General push | Indicates that the s/w has to initialize |
|
||||
| | command | some pointers in the Ethernet thread pages |
|
||||
| | | which are used when Header Compression is |
|
||||
| | | activated. The full details of these |
|
||||
| | | pointers is located in the software drivers.|
|
||||
-----------------------------------------------------------------------
|
||||
| 4 | General push | Indicates that after issuing Ethernet TX |
|
||||
| | command | INIT command, user must issue this command |
|
||||
| | | for each SNUM of Ethernet TX thread. |
|
||||
| | | CECDR = 0x00800003. |
|
||||
| | | CECR = 0x7'b{0}, 8'b{Enet TX thread SNUM}, |
|
||||
| | | 1'b{1}, 12'b{0}, 4'b{1} |
|
||||
-----------------------------------------------------------------------
|
||||
| 5 - 31 | N/A | Reserved, set to zero. |
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
V - Firmware Structure Layout
|
||||
==============================
|
||||
|
||||
QE microcode from Freescale is typically provided as a header file. This
|
||||
header file contains macros that define the microcode binary itself as well as
|
||||
some other data used in uploading that microcode. The format of these files
|
||||
do not lend themselves to simple inclusion into other code. Hence,
|
||||
the need for a more portable format. This section defines that format.
|
||||
|
||||
Instead of distributing a header file, the microcode and related data are
|
||||
embedded into a binary blob. This blob is passed to the qe_upload_firmware()
|
||||
function, which parses the blob and performs everything necessary to upload
|
||||
the microcode.
|
||||
|
||||
All integers are big-endian. See the comments for function
|
||||
qe_upload_firmware() for up-to-date implementation information.
|
||||
|
||||
This structure supports versioning, where the version of the structure is
|
||||
embedded into the structure itself. To ensure forward and backwards
|
||||
compatibility, all versions of the structure must use the same 'qe_header'
|
||||
structure at the beginning.
|
||||
|
||||
'header' (type: struct qe_header):
|
||||
The 'length' field is the size, in bytes, of the entire structure,
|
||||
including all the microcode embedded in it, as well as the CRC (if
|
||||
present).
|
||||
|
||||
The 'magic' field is an array of three bytes that contains the letters
|
||||
'Q', 'E', and 'F'. This is an identifier that indicates that this
|
||||
structure is a QE Firmware structure.
|
||||
|
||||
The 'version' field is a single byte that indicates the version of this
|
||||
structure. If the layout of the structure should ever need to be
|
||||
changed to add support for additional types of microcode, then the
|
||||
version number should also be changed.
|
||||
|
||||
The 'id' field is a null-terminated string(suitable for printing) that
|
||||
identifies the firmware.
|
||||
|
||||
The 'count' field indicates the number of 'microcode' structures. There
|
||||
must be one and only one 'microcode' structure for each RISC processor.
|
||||
Therefore, this field also represents the number of RISC processors for this
|
||||
SOC.
|
||||
|
||||
The 'soc' structure contains the SOC numbers and revisions used to match
|
||||
the microcode to the SOC itself. Normally, the microcode loader should
|
||||
check the data in this structure with the SOC number and revisions, and
|
||||
only upload the microcode if there's a match. However, this check is not
|
||||
made on all platforms.
|
||||
|
||||
Although it is not recommended, you can specify '0' in the soc.model
|
||||
field to skip matching SOCs altogether.
|
||||
|
||||
The 'model' field is a 16-bit number that matches the actual SOC. The
|
||||
'major' and 'minor' fields are the major and minor revision numbers,
|
||||
respectively, of the SOC.
|
||||
|
||||
For example, to match the 8323, revision 1.0:
|
||||
soc.model = 8323
|
||||
soc.major = 1
|
||||
soc.minor = 0
|
||||
|
||||
'padding' is necessary for structure alignment. This field ensures that the
|
||||
'extended_modes' field is aligned on a 64-bit boundary.
|
||||
|
||||
'extended_modes' is a bitfield that defines special functionality which has an
|
||||
impact on the device drivers. Each bit has its own impact and has special
|
||||
instructions for the driver associated with it. This field is stored in
|
||||
the QE library and available to any driver that calles qe_get_firmware_info().
|
||||
|
||||
'vtraps' is an array of 8 words that contain virtual trap values for each
|
||||
virtual traps. As with 'extended_modes', this field is stored in the QE
|
||||
library and available to any driver that calles qe_get_firmware_info().
|
||||
|
||||
'microcode' (type: struct qe_microcode):
|
||||
For each RISC processor there is one 'microcode' structure. The first
|
||||
'microcode' structure is for the first RISC, and so on.
|
||||
|
||||
The 'id' field is a null-terminated string suitable for printing that
|
||||
identifies this particular microcode.
|
||||
|
||||
'traps' is an array of 16 words that contain hardware trap values
|
||||
for each of the 16 traps. If trap[i] is 0, then this particular
|
||||
trap is to be ignored (i.e. not written to TIBCR[i]). The entire value
|
||||
is written as-is to the TIBCR[i] register, so be sure to set the EN
|
||||
and T_IBP bits if necessary.
|
||||
|
||||
'eccr' is the value to program into the ECCR register.
|
||||
|
||||
'iram_offset' is the offset into IRAM to start writing the
|
||||
microcode.
|
||||
|
||||
'count' is the number of 32-bit words in the microcode.
|
||||
|
||||
'code_offset' is the offset, in bytes, from the beginning of this
|
||||
structure where the microcode itself can be found. The first
|
||||
microcode binary should be located immediately after the 'microcode'
|
||||
array.
|
||||
|
||||
'major', 'minor', and 'revision' are the major, minor, and revision
|
||||
version numbers, respectively, of the microcode. If all values are 0,
|
||||
then these fields are ignored.
|
||||
|
||||
'reserved' is necessary for structure alignment. Since 'microcode'
|
||||
is an array, the 64-bit 'extended_modes' field needs to be aligned
|
||||
on a 64-bit boundary, and this can only happen if the size of
|
||||
'microcode' is a multiple of 8 bytes. To ensure that, we add
|
||||
'reserved'.
|
||||
|
||||
After the last microcode is a 32-bit CRC. It can be calculated using
|
||||
this algorithm:
|
||||
|
||||
u32 crc32(const u8 *p, unsigned int len)
|
||||
{
|
||||
unsigned int i;
|
||||
u32 crc = 0;
|
||||
|
||||
while (len--) {
|
||||
crc ^= *p++;
|
||||
for (i = 0; i < 8; i++)
|
||||
crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
|
||||
}
|
||||
return crc;
|
||||
}
|
||||
|
||||
VI - Sample Code for Creating Firmware Files
|
||||
============================================
|
||||
|
||||
A Python program that creates firmware binaries from the header files normally
|
||||
distributed by Freescale can be found on http://opensource.freescale.com.
|
198
Documentation/powerpc/transactional_memory.txt
Normal file
198
Documentation/powerpc/transactional_memory.txt
Normal file
|
@ -0,0 +1,198 @@
|
|||
Transactional Memory support
|
||||
============================
|
||||
|
||||
POWER kernel support for this feature is currently limited to supporting
|
||||
its use by user programs. It is not currently used by the kernel itself.
|
||||
|
||||
This file aims to sum up how it is supported by Linux and what behaviour you
|
||||
can expect from your user programs.
|
||||
|
||||
|
||||
Basic overview
|
||||
==============
|
||||
|
||||
Hardware Transactional Memory is supported on POWER8 processors, and is a
|
||||
feature that enables a different form of atomic memory access. Several new
|
||||
instructions are presented to delimit transactions; transactions are
|
||||
guaranteed to either complete atomically or roll back and undo any partial
|
||||
changes.
|
||||
|
||||
A simple transaction looks like this:
|
||||
|
||||
begin_move_money:
|
||||
tbegin
|
||||
beq abort_handler
|
||||
|
||||
ld r4, SAVINGS_ACCT(r3)
|
||||
ld r5, CURRENT_ACCT(r3)
|
||||
subi r5, r5, 1
|
||||
addi r4, r4, 1
|
||||
std r4, SAVINGS_ACCT(r3)
|
||||
std r5, CURRENT_ACCT(r3)
|
||||
|
||||
tend
|
||||
|
||||
b continue
|
||||
|
||||
abort_handler:
|
||||
... test for odd failures ...
|
||||
|
||||
/* Retry the transaction if it failed because it conflicted with
|
||||
* someone else: */
|
||||
b begin_move_money
|
||||
|
||||
|
||||
The 'tbegin' instruction denotes the start point, and 'tend' the end point.
|
||||
Between these points the processor is in 'Transactional' state; any memory
|
||||
references will complete in one go if there are no conflicts with other
|
||||
transactional or non-transactional accesses within the system. In this
|
||||
example, the transaction completes as though it were normal straight-line code
|
||||
IF no other processor has touched SAVINGS_ACCT(r3) or CURRENT_ACCT(r3); an
|
||||
atomic move of money from the current account to the savings account has been
|
||||
performed. Even though the normal ld/std instructions are used (note no
|
||||
lwarx/stwcx), either *both* SAVINGS_ACCT(r3) and CURRENT_ACCT(r3) will be
|
||||
updated, or neither will be updated.
|
||||
|
||||
If, in the meantime, there is a conflict with the locations accessed by the
|
||||
transaction, the transaction will be aborted by the CPU. Register and memory
|
||||
state will roll back to that at the 'tbegin', and control will continue from
|
||||
'tbegin+4'. The branch to abort_handler will be taken this second time; the
|
||||
abort handler can check the cause of the failure, and retry.
|
||||
|
||||
Checkpointed registers include all GPRs, FPRs, VRs/VSRs, LR, CCR/CR, CTR, FPCSR
|
||||
and a few other status/flag regs; see the ISA for details.
|
||||
|
||||
Causes of transaction aborts
|
||||
============================
|
||||
|
||||
- Conflicts with cache lines used by other processors
|
||||
- Signals
|
||||
- Context switches
|
||||
- See the ISA for full documentation of everything that will abort transactions.
|
||||
|
||||
|
||||
Syscalls
|
||||
========
|
||||
|
||||
Performing syscalls from within transaction is not recommended, and can lead
|
||||
to unpredictable results.
|
||||
|
||||
Syscalls do not by design abort transactions, but beware: The kernel code will
|
||||
not be running in transactional state. The effect of syscalls will always
|
||||
remain visible, but depending on the call they may abort your transaction as a
|
||||
side-effect, read soon-to-be-aborted transactional data that should not remain
|
||||
invisible, etc. If you constantly retry a transaction that constantly aborts
|
||||
itself by calling a syscall, you'll have a livelock & make no progress.
|
||||
|
||||
Simple syscalls (e.g. sigprocmask()) "could" be OK. Even things like write()
|
||||
from, say, printf() should be OK as long as the kernel does not access any
|
||||
memory that was accessed transactionally.
|
||||
|
||||
Consider any syscalls that happen to work as debug-only -- not recommended for
|
||||
production use. Best to queue them up till after the transaction is over.
|
||||
|
||||
|
||||
Signals
|
||||
=======
|
||||
|
||||
Delivery of signals (both sync and async) during transactions provides a second
|
||||
thread state (ucontext/mcontext) to represent the second transactional register
|
||||
state. Signal delivery 'treclaim's to capture both register states, so signals
|
||||
abort transactions. The usual ucontext_t passed to the signal handler
|
||||
represents the checkpointed/original register state; the signal appears to have
|
||||
arisen at 'tbegin+4'.
|
||||
|
||||
If the sighandler ucontext has uc_link set, a second ucontext has been
|
||||
delivered. For future compatibility the MSR.TS field should be checked to
|
||||
determine the transactional state -- if so, the second ucontext in uc->uc_link
|
||||
represents the active transactional registers at the point of the signal.
|
||||
|
||||
For 64-bit processes, uc->uc_mcontext.regs->msr is a full 64-bit MSR and its TS
|
||||
field shows the transactional mode.
|
||||
|
||||
For 32-bit processes, the mcontext's MSR register is only 32 bits; the top 32
|
||||
bits are stored in the MSR of the second ucontext, i.e. in
|
||||
uc->uc_link->uc_mcontext.regs->msr. The top word contains the transactional
|
||||
state TS.
|
||||
|
||||
However, basic signal handlers don't need to be aware of transactions
|
||||
and simply returning from the handler will deal with things correctly:
|
||||
|
||||
Transaction-aware signal handlers can read the transactional register state
|
||||
from the second ucontext. This will be necessary for crash handlers to
|
||||
determine, for example, the address of the instruction causing the SIGSEGV.
|
||||
|
||||
Example signal handler:
|
||||
|
||||
void crash_handler(int sig, siginfo_t *si, void *uc)
|
||||
{
|
||||
ucontext_t *ucp = uc;
|
||||
ucontext_t *transactional_ucp = ucp->uc_link;
|
||||
|
||||
if (ucp_link) {
|
||||
u64 msr = ucp->uc_mcontext.regs->msr;
|
||||
/* May have transactional ucontext! */
|
||||
#ifndef __powerpc64__
|
||||
msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32;
|
||||
#endif
|
||||
if (MSR_TM_ACTIVE(msr)) {
|
||||
/* Yes, we crashed during a transaction. Oops. */
|
||||
fprintf(stderr, "Transaction to be restarted at 0x%llx, but "
|
||||
"crashy instruction was at 0x%llx\n",
|
||||
ucp->uc_mcontext.regs->nip,
|
||||
transactional_ucp->uc_mcontext.regs->nip);
|
||||
}
|
||||
}
|
||||
|
||||
fix_the_problem(ucp->dar);
|
||||
}
|
||||
|
||||
When in an active transaction that takes a signal, we need to be careful with
|
||||
the stack. It's possible that the stack has moved back up after the tbegin.
|
||||
The obvious case here is when the tbegin is called inside a function that
|
||||
returns before a tend. In this case, the stack is part of the checkpointed
|
||||
transactional memory state. If we write over this non transactionally or in
|
||||
suspend, we are in trouble because if we get a tm abort, the program counter and
|
||||
stack pointer will be back at the tbegin but our in memory stack won't be valid
|
||||
anymore.
|
||||
|
||||
To avoid this, when taking a signal in an active transaction, we need to use
|
||||
the stack pointer from the checkpointed state, rather than the speculated
|
||||
state. This ensures that the signal context (written tm suspended) will be
|
||||
written below the stack required for the rollback. The transaction is aborted
|
||||
because of the treclaim, so any memory written between the tbegin and the
|
||||
signal will be rolled back anyway.
|
||||
|
||||
For signals taken in non-TM or suspended mode, we use the
|
||||
normal/non-checkpointed stack pointer.
|
||||
|
||||
|
||||
Failure cause codes used by kernel
|
||||
==================================
|
||||
|
||||
These are defined in <asm/reg.h>, and distinguish different reasons why the
|
||||
kernel aborted a transaction:
|
||||
|
||||
TM_CAUSE_RESCHED Thread was rescheduled.
|
||||
TM_CAUSE_TLBI Software TLB invalide.
|
||||
TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap.
|
||||
TM_CAUSE_SYSCALL Currently unused; future syscalls that must abort
|
||||
transactions for consistency will use this.
|
||||
TM_CAUSE_SIGNAL Signal delivered.
|
||||
TM_CAUSE_MISC Currently unused.
|
||||
TM_CAUSE_ALIGNMENT Alignment fault.
|
||||
TM_CAUSE_EMULATE Emulation that touched memory.
|
||||
|
||||
These can be checked by the user program's abort handler as TEXASR[0:7]. If
|
||||
bit 7 is set, it indicates that the error is consider persistent. For example
|
||||
a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.q
|
||||
|
||||
GDB
|
||||
===
|
||||
|
||||
GDB and ptrace are not currently TM-aware. If one stops during a transaction,
|
||||
it looks like the transaction has just started (the checkpointed state is
|
||||
presented). The transaction cannot then be continued and will take the failure
|
||||
handler route. Furthermore, the transactional 2nd register state will be
|
||||
inaccessible. GDB can currently be used on programs using TM, but not sensibly
|
||||
in parts within transactions.
|
Loading…
Add table
Add a link
Reference in a new issue