Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,32 @@
Index of files in Documentation/powerpc. If you think something about
Linux/PPC needs an entry here, needs correction or you've written one
please mail me.
Cort Dougan (cort@fsmlabs.com)
00-INDEX
- this file
bootwrapper.txt
- Information on how the powerpc kernel is wrapped for boot on various
different platforms.
cpu_features.txt
- info on how we support a variety of CPUs with minimal compile-time
options.
cxl.txt
- Overview of the CXL driver.
eeh-pci-error-recovery.txt
- info on PCI Bus EEH Error Recovery
firmware-assisted-dump.txt
- Documentation on the firmware assisted dump mechanism "fadump".
hvcs.txt
- IBM "Hypervisor Virtual Console Server" Installation Guide
mpc52xx.txt
- Linux 2.6.x on MPC52xx family
pmu-ebb.txt
- Description of the API for using the PMU with Event Based Branches.
qe_firmware.txt
- describes the layout of firmware binaries for the Freescale QUICC
Engine and the code that parses and uploads the microcode therein.
ptrace.txt
- Information on the ptrace interfaces for hardware debug registers.
transactional_memory.txt
- Overview of the Power8 transactional memory support.

View file

@ -0,0 +1,141 @@
The PowerPC boot wrapper
------------------------
Copyright (C) Secret Lab Technologies Ltd.
PowerPC image targets compresses and wraps the kernel image (vmlinux) with
a boot wrapper to make it usable by the system firmware. There is no
standard PowerPC firmware interface, so the boot wrapper is designed to
be adaptable for each kind of image that needs to be built.
The boot wrapper can be found in the arch/powerpc/boot/ directory. The
Makefile in that directory has targets for all the available image types.
The different image types are used to support all of the various firmware
interfaces found on PowerPC platforms. OpenFirmware is the most commonly
used firmware type on general purpose PowerPC systems from Apple, IBM and
others. U-Boot is typically found on embedded PowerPC hardware, but there
are a handful of other firmware implementations which are also popular. Each
firmware interface requires a different image format.
The boot wrapper is built from the makefile in arch/powerpc/boot/Makefile and
it uses the wrapper script (arch/powerpc/boot/wrapper) to generate target
image. The details of the build system is discussed in the next section.
Currently, the following image format targets exist:
cuImage.%: Backwards compatible uImage for older version of
U-Boot (for versions that don't understand the device
tree). This image embeds a device tree blob inside
the image. The boot wrapper, kernel and device tree
are all embedded inside the U-Boot uImage file format
with boot wrapper code that extracts data from the old
bd_info structure and loads the data into the device
tree before jumping into the kernel.
Because of the series of #ifdefs found in the
bd_info structure used in the old U-Boot interfaces,
cuImages are platform specific. Each specific
U-Boot platform has a different platform init file
which populates the embedded device tree with data
from the platform specific bd_info file. The platform
specific cuImage platform init code can be found in
arch/powerpc/boot/cuboot.*.c. Selection of the correct
cuImage init code for a specific board can be found in
the wrapper structure.
dtbImage.%: Similar to zImage, except device tree blob is embedded
inside the image instead of provided by firmware. The
output image file can be either an elf file or a flat
binary depending on the platform.
dtbImages are used on systems which do not have an
interface for passing a device tree directly.
dtbImages are similar to simpleImages except that
dtbImages have platform specific code for extracting
data from the board firmware, but simpleImages do not
talk to the firmware at all.
PlayStation 3 support uses dtbImage. So do Embedded
Planet boards using the PlanetCore firmware. Board
specific initialization code is typically found in a
file named arch/powerpc/boot/<platform>.c; but this
can be overridden by the wrapper script.
simpleImage.%: Firmware independent compressed image that does not
depend on any particular firmware interface and embeds
a device tree blob. This image is a flat binary that
can be loaded to any location in RAM and jumped to.
Firmware cannot pass any configuration data to the
kernel with this image type and it depends entirely on
the embedded device tree for all information.
The simpleImage is useful for booting systems with
an unknown firmware interface or for booting from
a debugger when no firmware is present (such as on
the Xilinx Virtex platform). The only assumption that
simpleImage makes is that RAM is correctly initialized
and that the MMU is either off or has RAM mapped to
base address 0.
simpleImage also supports inserting special platform
specific initialization code to the start of the bootup
sequence. The virtex405 platform uses this feature to
ensure that the cache is invalidated before caching
is enabled. Platform specific initialization code is
added as part of the wrapper script and is keyed on
the image target name. For example, all
simpleImage.virtex405-* targets will add the
virtex405-head.S initialization code (This also means
that the dts file for virtex405 targets should be
named (virtex405-<board>.dts). Search the wrapper
script for 'virtex405' and see the file
arch/powerpc/boot/virtex405-head.S for details.
treeImage.%; Image format for used with OpenBIOS firmware found
on some ppc4xx hardware. This image embeds a device
tree blob inside the image.
uImage: Native image format used by U-Boot. The uImage target
does not add any boot code. It just wraps a compressed
vmlinux in the uImage data structure. This image
requires a version of U-Boot that is able to pass
a device tree to the kernel at boot. If using an older
version of U-Boot, then you need to use a cuImage
instead.
zImage.%: Image format which does not embed a device tree.
Used by OpenFirmware and other firmware interfaces
which are able to supply a device tree. This image
expects firmware to provide the device tree at boot.
Typically, if you have general purpose PowerPC
hardware then you want this image format.
Image types which embed a device tree blob (simpleImage, dtbImage, treeImage,
and cuImage) all generate the device tree blob from a file in the
arch/powerpc/boot/dts/ directory. The Makefile selects the correct device
tree source based on the name of the target. Therefore, if the kernel is
built with 'make treeImage.walnut simpleImage.virtex405-ml403', then the
build system will use arch/powerpc/boot/dts/walnut.dts to build
treeImage.walnut and arch/powerpc/boot/dts/virtex405-ml403.dts to build
the simpleImage.virtex405-ml403.
Two special targets called 'zImage' and 'zImage.initrd' also exist. These
targets build all the default images as selected by the kernel configuration.
Default images are selected by the boot wrapper Makefile
(arch/powerpc/boot/Makefile) by adding targets to the $image-y variable. Look
at the Makefile to see which default image targets are available.
How it is built
---------------
arch/powerpc is designed to support multiplatform kernels, which means
that a single vmlinux image can be booted on many different target boards.
It also means that the boot wrapper must be able to wrap for many kinds of
images on a single build. The design decision was made to not use any
conditional compilation code (#ifdef, etc) in the boot wrapper source code.
All of the boot wrapper pieces are buildable at any time regardless of the
kernel configuration. Building all the wrapper bits on every kernel build
also ensures that obscure parts of the wrapper are at the very least compile
tested in a large variety of environments.
The wrapper is adapted for different image types at link time by linking in
just the wrapper bits that are appropriate for the image type. The 'wrapper
script' (found in arch/powerpc/boot/wrapper) is called by the Makefile and
is responsible for selecting the correct wrapper bits for the image type.
The arguments are well documented in the script's comment block, so they
are not repeated here. However, it is worth mentioning that the script
uses the -p (platform) argument as the main method of deciding which wrapper
bits to compile in. Look for the large 'case "$platform" in' block in the
middle of the script. This is also the place where platform specific fixups
can be selected by changing the link order.
In particular, care should be taken when working with cuImages. cuImage
wrapper bits are very board specific and care should be taken to make sure
the target you are trying to build is supported by the wrapper bits.

View file

@ -0,0 +1,221 @@
CPU Families
============
This document tries to summarise some of the different cpu families that exist
and are supported by arch/powerpc.
Book3S (aka sPAPR)
------------------
- Hash MMU
- Mix of 32 & 64 bit
+--------------+ +----------------+
| Old POWER | --------------> | RS64 (threads) |
+--------------+ +----------------+
|
|
v
+--------------+ +----------------+ +------+
| 601 | --------------> | 603 | ---> | e300 |
+--------------+ +----------------+ +------+
| |
| |
v v
+--------------+ +----------------+ +-------+
| 604 | | 750 (G3) | ---> | 750CX |
+--------------+ +----------------+ +-------+
| | |
| | |
v v v
+--------------+ +----------------+ +-------+
| 620 (64 bit) | | 7400 | | 750CL |
+--------------+ +----------------+ +-------+
| | |
| | |
v v v
+--------------+ +----------------+ +-------+
| POWER3/630 | | 7410 | | 750FX |
+--------------+ +----------------+ +-------+
| |
| |
v v
+--------------+ +----------------+
| POWER3+ | | 7450 |
+--------------+ +----------------+
| |
| |
v v
+--------------+ +----------------+
| POWER4 | | 7455 |
+--------------+ +----------------+
| |
| |
v v
+--------------+ +-------+ +----------------+
| POWER4+ | --> | 970 | | 7447 |
+--------------+ +-------+ +----------------+
| | |
| | |
v v v
+--------------+ +-------+ +----------------+
| POWER5 | | 970FX | | 7448 |
+--------------+ +-------+ +----------------+
| | |
| | |
v v v
+--------------+ +-------+ +----------------+
| POWER5+ | | 970MP | | e600 |
+--------------+ +-------+ +----------------+
|
|
v
+--------------+
| POWER5++ |
+--------------+
|
|
v
+--------------+ +-------+
| POWER6 | <-?-> | Cell |
+--------------+ +-------+
|
|
v
+--------------+
| POWER7 |
+--------------+
|
|
v
+--------------+
| POWER7+ |
+--------------+
|
|
v
+--------------+
| POWER8 |
+--------------+
+---------------+
| PA6T (64 bit) |
+---------------+
IBM BookE
---------
- Software loaded TLB.
- All 32 bit
+--------------+
| 401 |
+--------------+
|
|
v
+--------------+
| 403 |
+--------------+
|
|
v
+--------------+
| 405 |
+--------------+
|
|
v
+--------------+
| 440 |
+--------------+
|
|
v
+--------------+ +----------------+
| 450 | --> | BG/P |
+--------------+ +----------------+
|
|
v
+--------------+
| 460 |
+--------------+
|
|
v
+--------------+
| 476 |
+--------------+
Motorola/Freescale 8xx
----------------------
- Software loaded with hardware assist.
- All 32 bit
+-------------+
| MPC8xx Core |
+-------------+
Freescale BookE
---------------
- Software loaded TLB.
- e6500 adds HW loaded indirect TLB entries.
- Mix of 32 & 64 bit
+--------------+
| e200 |
+--------------+
+--------------------------------+
| e500 |
+--------------------------------+
|
|
v
+--------------------------------+
| e500v2 |
+--------------------------------+
|
|
v
+--------------------------------+
| e500mc (Book3e) |
+--------------------------------+
|
|
v
+--------------------------------+
| e5500 (64 bit) |
+--------------------------------+
|
|
v
+--------------------------------+
| e6500 (HW TLB) (Multithreaded) |
+--------------------------------+
IBM A2 core
-----------
- Book3E, software loaded TLB + HW loaded indirect TLB entries.
- 64 bit
+--------------+ +----------------+
| A2 core | --> | WSP |
+--------------+ +----------------+
|
|
v
+--------------+
| BG/Q |
+--------------+

View file

@ -0,0 +1,56 @@
Hollis Blanchard <hollis@austin.ibm.com>
5 Jun 2002
This document describes the system (including self-modifying code) used in the
PPC Linux kernel to support a variety of PowerPC CPUs without requiring
compile-time selection.
Early in the boot process the ppc32 kernel detects the current CPU type and
chooses a set of features accordingly. Some examples include Altivec support,
split instruction and data caches, and if the CPU supports the DOZE and NAP
sleep modes.
Detection of the feature set is simple. A list of processors can be found in
arch/powerpc/kernel/cputable.c. The PVR register is masked and compared with
each value in the list. If a match is found, the cpu_features of cur_cpu_spec
is assigned to the feature bitmask for this processor and a __setup_cpu
function is called.
C code may test 'cur_cpu_spec[smp_processor_id()]->cpu_features' for a
particular feature bit. This is done in quite a few places, for example
in ppc_setup_l2cr().
Implementing cpufeatures in assembly is a little more involved. There are
several paths that are performance-critical and would suffer if an array
index, structure dereference, and conditional branch were added. To avoid the
performance penalty but still allow for runtime (rather than compile-time) CPU
selection, unused code is replaced by 'nop' instructions. This nop'ing is
based on CPU 0's capabilities, so a multi-processor system with non-identical
processors will not work (but such a system would likely have other problems
anyways).
After detecting the processor type, the kernel patches out sections of code
that shouldn't be used by writing nop's over it. Using cpufeatures requires
just 2 macros (found in arch/powerpc/include/asm/cputable.h), as seen in head.S
transfer_to_handler:
#ifdef CONFIG_ALTIVEC
BEGIN_FTR_SECTION
mfspr r22,SPRN_VRSAVE /* if G4, save vrsave register value */
stw r22,THREAD_VRSAVE(r23)
END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
#endif /* CONFIG_ALTIVEC */
If CPU 0 supports Altivec, the code is left untouched. If it doesn't, both
instructions are replaced with nop's.
The END_FTR_SECTION macro has two simpler variations: END_FTR_SECTION_IFSET
and END_FTR_SECTION_IFCLR. These simply test if a flag is set (in
cur_cpu_spec[0]->cpu_features) or is cleared, respectively. These two macros
should be used in the majority of cases.
The END_FTR_SECTION macros are implemented by storing information about this
code in the '__ftr_fixup' ELF section. When do_cpu_ftr_fixups
(arch/powerpc/kernel/misc.S) is invoked, it will iterate over the records in
__ftr_fixup, and if the required feature is not present it will loop writing
nop's from each BEGIN_FTR_SECTION to END_FTR_SECTION.

View file

@ -0,0 +1,379 @@
Coherent Accelerator Interface (CXL)
====================================
Introduction
============
The coherent accelerator interface is designed to allow the
coherent connection of accelerators (FPGAs and other devices) to a
POWER system. These devices need to adhere to the Coherent
Accelerator Interface Architecture (CAIA).
IBM refers to this as the Coherent Accelerator Processor Interface
or CAPI. In the kernel it's referred to by the name CXL to avoid
confusion with the ISDN CAPI subsystem.
Coherent in this context means that the accelerator and CPUs can
both access system memory directly and with the same effective
addresses.
Hardware overview
=================
POWER8 FPGA
+----------+ +---------+
| | | |
| CPU | | AFU |
| | | |
| | | |
| | | |
+----------+ +---------+
| PHB | | |
| +------+ | PSL |
| | CAPP |<------>| |
+---+------+ PCIE +---------+
The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
unit which is part of the PCIe Host Bridge (PHB). This is managed
by Linux by calls into OPAL. Linux doesn't directly program the
CAPP.
The FPGA (or coherently attached device) consists of two parts.
The POWER Service Layer (PSL) and the Accelerator Function Unit
(AFU). The AFU is used to implement specific functionality behind
the PSL. The PSL, among other things, provides memory address
translation services to allow each AFU direct access to userspace
memory.
The AFU is the core part of the accelerator (eg. the compression,
crypto etc function). The kernel has no knowledge of the function
of the AFU. Only userspace interacts directly with the AFU.
The PSL provides the translation and interrupt services that the
AFU needs. This is what the kernel interacts with. For example, if
the AFU needs to read a particular effective address, it sends
that address to the PSL, the PSL then translates it, fetches the
data from memory and returns it to the AFU. If the PSL has a
translation miss, it interrupts the kernel and the kernel services
the fault. The context to which this fault is serviced is based on
who owns that acceleration function.
AFU Modes
=========
There are two programming modes supported by the AFU. Dedicated
and AFU directed. AFU may support one or both modes.
When using dedicated mode only one MMU context is supported. In
this mode, only one userspace process can use the accelerator at
time.
When using AFU directed mode, up to 16K simultaneous contexts can
be supported. This means up to 16K simultaneous userspace
applications may use the accelerator (although specific AFUs may
support fewer). In this mode, the AFU sends a 16 bit context ID
with each of its requests. This tells the PSL which context is
associated with each operation. If the PSL can't translate an
operation, the ID can also be accessed by the kernel so it can
determine the userspace context associated with an operation.
MMIO space
==========
A portion of the accelerator MMIO space can be directly mapped
from the AFU to userspace. Either the whole space can be mapped or
just a per context portion. The hardware is self describing, hence
the kernel can determine the offset and size of the per context
portion.
Interrupts
==========
AFUs may generate interrupts that are destined for userspace. These
are received by the kernel as hardware interrupts and passed onto
userspace by a read syscall documented below.
Data storage faults and error interrupts are handled by the kernel
driver.
Work Element Descriptor (WED)
=============================
The WED is a 64-bit parameter passed to the AFU when a context is
started. Its format is up to the AFU hence the kernel has no
knowledge of what it represents. Typically it will be the
effective address of a work queue or status block where the AFU
and userspace can share control and status information.
User API
========
For AFUs operating in AFU directed mode, two character device
files will be created. /dev/cxl/afu0.0m will correspond to a
master context and /dev/cxl/afu0.0s will correspond to a slave
context. Master contexts have access to the full MMIO space an
AFU provides. Slave contexts have access to only the per process
MMIO space an AFU provides.
For AFUs operating in dedicated process mode, the driver will
only create a single character device per AFU called
/dev/cxl/afu0.0d. This will have access to the entire MMIO space
that the AFU provides (like master contexts in AFU directed).
The types described below are defined in include/uapi/misc/cxl.h
The following file operations are supported on both slave and
master devices.
open
----
Opens the device and allocates a file descriptor to be used with
the rest of the API.
A dedicated mode AFU only has one context and only allows the
device to be opened once.
An AFU directed mode AFU can have many contexts, the device can be
opened once for each context that is available.
When all available contexts are allocated the open call will fail
and return -ENOSPC.
Note: IRQs need to be allocated for each context, which may limit
the number of contexts that can be created, and therefore
how many times the device can be opened. The POWER8 CAPP
supports 2040 IRQs and 3 are used by the kernel, so 2037 are
left. If 1 IRQ is needed per context, then only 2037
contexts can be allocated. If 4 IRQs are needed per context,
then only 2037/4 = 509 contexts can be allocated.
ioctl
-----
CXL_IOCTL_START_WORK:
Starts the AFU context and associates it with the current
process. Once this ioctl is successfully executed, all memory
mapped into this process is accessible to this AFU context
using the same effective addresses. No additional calls are
required to map/unmap memory. The AFU memory context will be
updated as userspace allocates and frees memory. This ioctl
returns once the AFU context is started.
Takes a pointer to a struct cxl_ioctl_start_work:
struct cxl_ioctl_start_work {
__u64 flags;
__u64 work_element_descriptor;
__u64 amr;
__s16 num_interrupts;
__s16 reserved1;
__s32 reserved2;
__u64 reserved3;
__u64 reserved4;
__u64 reserved5;
__u64 reserved6;
};
flags:
Indicates which optional fields in the structure are
valid.
work_element_descriptor:
The Work Element Descriptor (WED) is a 64-bit argument
defined by the AFU. Typically this is an effective
address pointing to an AFU specific structure
describing what work to perform.
amr:
Authority Mask Register (AMR), same as the powerpc
AMR. This field is only used by the kernel when the
corresponding CXL_START_WORK_AMR value is specified in
flags. If not specified the kernel will use a default
value of 0.
num_interrupts:
Number of userspace interrupts to request. This field
is only used by the kernel when the corresponding
CXL_START_WORK_NUM_IRQS value is specified in flags.
If not specified the minimum number required by the
AFU will be allocated. The min and max number can be
obtained from sysfs.
reserved fields:
For ABI padding and future extensions
CXL_IOCTL_GET_PROCESS_ELEMENT:
Get the current context id, also known as the process element.
The value is returned from the kernel as a __u32.
mmap
----
An AFU may have an MMIO space to facilitate communication with the
AFU. If it does, the MMIO space can be accessed via mmap. The size
and contents of this area are specific to the particular AFU. The
size can be discovered via sysfs.
In AFU directed mode, master contexts are allowed to map all of
the MMIO space and slave contexts are allowed to only map the per
process MMIO space associated with the context. In dedicated
process mode the entire MMIO space can always be mapped.
This mmap call must be done after the START_WORK ioctl.
Care should be taken when accessing MMIO space. Only 32 and 64-bit
accesses are supported by POWER8. Also, the AFU will be designed
with a specific endianness, so all MMIO accesses should consider
endianness (recommend endian(3) variants like: le64toh(),
be64toh() etc). These endian issues equally apply to shared memory
queues the WED may describe.
read
----
Reads events from the AFU. Blocks if no events are pending
(unless O_NONBLOCK is supplied). Returns -EIO in the case of an
unrecoverable error or if the card is removed.
read() will always return an integral number of events.
The buffer passed to read() must be at least 4K bytes.
The result of the read will be a buffer of one or more events,
each event is of type struct cxl_event, of varying size.
struct cxl_event {
struct cxl_event_header header;
union {
struct cxl_event_afu_interrupt irq;
struct cxl_event_data_storage fault;
struct cxl_event_afu_error afu_error;
};
};
The struct cxl_event_header is defined as:
struct cxl_event_header {
__u16 type;
__u16 size;
__u16 process_element;
__u16 reserved1;
};
type:
This defines the type of event. The type determines how
the rest of the event is structured. These types are
described below and defined by enum cxl_event_type.
size:
This is the size of the event in bytes including the
struct cxl_event_header. The start of the next event can
be found at this offset from the start of the current
event.
process_element:
Context ID of the event.
reserved field:
For future extensions and padding.
If the event type is CXL_EVENT_AFU_INTERRUPT then the event
structure is defined as:
struct cxl_event_afu_interrupt {
__u16 flags;
__u16 irq; /* Raised AFU interrupt number */
__u32 reserved1;
};
flags:
These flags indicate which optional fields are present
in this struct. Currently all fields are mandatory.
irq:
The IRQ number sent by the AFU.
reserved field:
For future extensions and padding.
If the event type is CXL_EVENT_DATA_STORAGE then the event
structure is defined as:
struct cxl_event_data_storage {
__u16 flags;
__u16 reserved1;
__u32 reserved2;
__u64 addr;
__u64 dsisr;
__u64 reserved3;
};
flags:
These flags indicate which optional fields are present in
this struct. Currently all fields are mandatory.
address:
The address that the AFU unsuccessfully attempted to
access. Valid accesses will be handled transparently by the
kernel but invalid accesses will generate this event.
dsisr:
This field gives information on the type of fault. It is a
copy of the DSISR from the PSL hardware when the address
fault occurred. The form of the DSISR is as defined in the
CAIA.
reserved fields:
For future extensions
If the event type is CXL_EVENT_AFU_ERROR then the event structure
is defined as:
struct cxl_event_afu_error {
__u16 flags;
__u16 reserved1;
__u32 reserved2;
__u64 error;
};
flags:
These flags indicate which optional fields are present in
this struct. Currently all fields are Mandatory.
error:
Error status from the AFU. Defined by the AFU.
reserved fields:
For future extensions and padding
Sysfs Class
===========
A cxl sysfs class is added under /sys/class/cxl to facilitate
enumeration and tuning of the accelerators. Its layout is
described in Documentation/ABI/testing/sysfs-class-cxl
Udev rules
==========
The following udev rules could be used to create a symlink to the
most logical chardev to use in any programming mode (afuX.Yd for
dedicated, afuX.Ys for afu directed), since the API is virtually
identical for each:
SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"

View file

@ -0,0 +1,334 @@
PCI Bus EEH Error Recovery
--------------------------
Linas Vepstas
<linas@austin.ibm.com>
12 January 2005
Overview:
---------
The IBM POWER-based pSeries and iSeries computers include PCI bus
controller chips that have extended capabilities for detecting and
reporting a large variety of PCI bus error conditions. These features
go under the name of "EEH", for "Extended Error Handling". The EEH
hardware features allow PCI bus errors to be cleared and a PCI
card to be "rebooted", without also having to reboot the operating
system.
This is in contrast to traditional PCI error handling, where the
PCI chip is wired directly to the CPU, and an error would cause
a CPU machine-check/check-stop condition, halting the CPU entirely.
Another "traditional" technique is to ignore such errors, which
can lead to data corruption, both of user data or of kernel data,
hung/unresponsive adapters, or system crashes/lockups. Thus,
the idea behind EEH is that the operating system can become more
reliable and robust by protecting it from PCI errors, and giving
the OS the ability to "reboot"/recover individual PCI devices.
Future systems from other vendors, based on the PCI-E specification,
may contain similar features.
Causes of EEH Errors
--------------------
EEH was originally designed to guard against hardware failure, such
as PCI cards dying from heat, humidity, dust, vibration and bad
electrical connections. The vast majority of EEH errors seen in
"real life" are due to either poorly seated PCI cards, or,
unfortunately quite commonly, due to device driver bugs, device firmware
bugs, and sometimes PCI card hardware bugs.
The most common software bug, is one that causes the device to
attempt to DMA to a location in system memory that has not been
reserved for DMA access for that card. This is a powerful feature,
as it prevents what; otherwise, would have been silent memory
corruption caused by the bad DMA. A number of device driver
bugs have been found and fixed in this way over the past few
years. Other possible causes of EEH errors include data or
address line parity errors (for example, due to poor electrical
connectivity due to a poorly seated card), and PCI-X split-completion
errors (due to software, device firmware, or device PCI hardware bugs).
The vast majority of "true hardware failures" can be cured by
physically removing and re-seating the PCI card.
Detection and Recovery
----------------------
In the following discussion, a generic overview of how to detect
and recover from EEH errors will be presented. This is followed
by an overview of how the current implementation in the Linux
kernel does it. The actual implementation is subject to change,
and some of the finer points are still being debated. These
may in turn be swayed if or when other architectures implement
similar functionality.
When a PCI Host Bridge (PHB, the bus controller connecting the
PCI bus to the system CPU electronics complex) detects a PCI error
condition, it will "isolate" the affected PCI card. Isolation
will block all writes (either to the card from the system, or
from the card to the system), and it will cause all reads to
return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
This value was chosen because it is the same value you would
get if the device was physically unplugged from the slot.
This includes access to PCI memory, I/O space, and PCI config
space. Interrupts; however, will continued to be delivered.
Detection and recovery are performed with the aid of ppc64
firmware. The programming interfaces in the Linux kernel
into the firmware are referred to as RTAS (Run-Time Abstraction
Services). The Linux kernel does not (should not) access
the EEH function in the PCI chipsets directly, primarily because
there are a number of different chipsets out there, each with
different interfaces and quirks. The firmware provides a
uniform abstraction layer that will work with all pSeries
and iSeries hardware (and be forwards-compatible).
If the OS or device driver suspects that a PCI slot has been
EEH-isolated, there is a firmware call it can make to determine if
this is the case. If so, then the device driver should put itself
into a consistent state (given that it won't be able to complete any
pending work) and start recovery of the card. Recovery normally
would consist of resetting the PCI device (holding the PCI #RST
line high for two seconds), followed by setting up the device
config space (the base address registers (BAR's), latency timer,
cache line size, interrupt line, and so on). This is followed by a
reinitialization of the device driver. In a worst-case scenario,
the power to the card can be toggled, at least on hot-plug-capable
slots. In principle, layers far above the device driver probably
do not need to know that the PCI card has been "rebooted" in this
way; ideally, there should be at most a pause in Ethernet/disk/USB
I/O while the card is being reset.
If the card cannot be recovered after three or four resets, the
kernel/device driver should assume the worst-case scenario, that the
card has died completely, and report this error to the sysadmin.
In addition, error messages are reported through RTAS and also through
syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
The correct way to deal with failed adapters is to use the standard
PCI hotplug tools to remove and replace the dead card.
Current PPC64 Linux EEH Implementation
--------------------------------------
At this time, a generic EEH recovery mechanism has been implemented,
so that individual device drivers do not need to be modified to support
EEH recovery. This generic mechanism piggy-backs on the PCI hotplug
infrastructure, and percolates events up through the userspace/udev
infrastructure. Following is a detailed description of how this is
accomplished.
EEH must be enabled in the PHB's very early during the boot process,
and if a PCI slot is hot-plugged. The former is performed by
eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
EEH must be enabled before a PCI scan of the device can proceed.
Current Power5 hardware will not work unless EEH is enabled;
although older Power4 can run with it disabled. Effectively,
EEH can no longer be turned off. PCI devices *must* be
registered with the EEH code; the EEH code needs to know about
the I/O address ranges of the PCI device in order to detect an
error. Given an arbitrary address, the routine
pci_get_device_by_addr() will find the pci device associated
with that address (if any).
The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
etc. include a check to see if the i/o read returned all-0xff's.
If so, these make a call to eeh_dn_check_failure(), which in turn
asks the firmware if the all-ff's value is the sign of a true EEH
error. If it is not, processing continues as normal. The grand
total number of these false alarms or "false positives" can be
seen in /proc/ppc64/eeh (subject to change). Normally, almost
all of these occur during boot, when the PCI bus is scanned, where
a large number of 0xff reads are part of the bus scan procedure.
If a frozen slot is detected, code in
arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
syslog (/var/log/messages). This stack trace has proven to be very
useful to device-driver authors for finding out at what point the EEH
error was detected, as the error itself usually occurs slightly
beforehand.
Next, it uses the Linux kernel notifier chain/work queue mechanism to
allow any interested parties to find out about the failure. Device
drivers, or other parts of the kernel, can use
eeh_register_notifier(struct notifier_block *) to find out about EEH
events. The event will include a pointer to the pci device, the
device node and some state info. Receivers of the event can "do as
they wish"; the default handler will be described further in this
section.
To assist in the recovery of the device, eeh.c exports the
following functions:
rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second
rtas_configure_bridge() -- ask firmware to configure any PCI bridges
located topologically under the pci slot.
eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
config-space info for a device and any devices under it.
A handler for the EEH notifier_block events is implemented in
drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
This last call causes the device driver for the card to be stopped,
which causes uevents to go out to user space. This triggers
user-space scripts that might issue commands such as "ifdown eth0"
for ethernet cards, and so on. This handler then sleeps for 5 seconds,
hoping to give the user-space scripts enough time to complete.
It then resets the PCI card, reconfigures the device BAR's, and
any bridges underneath. It then calls rpaphp_enable_pci_slot(),
which restarts the device driver and triggers more user-space
events (for example, calling "ifup eth0" for ethernet cards).
Device Shutdown and User-Space Events
-------------------------------------
This section documents what happens when a pci slot is unconfigured,
focusing on how the device driver gets shut down, and on how the
events get delivered to user-space scripts.
Following is an example sequence of events that cause a device driver
close function to be called during the first phase of an EEH reset.
The following sequence is an example of the pcnet32 device driver.
rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c
{
calls
pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
{
calls
pci_destroy_dev (struct pci_dev *)
{
calls
device_unregister (&dev->dev) // in /drivers/base/core.c
{
calls
device_del (struct device *)
{
calls
bus_remove_device() // in /drivers/base/bus.c
{
calls
device_release_driver()
{
calls
struct device_driver->remove() which is just
pci_device_remove() // in /drivers/pci/pci_driver.c
{
calls
struct pci_driver->remove() which is just
pcnet32_remove_one() // in /drivers/net/pcnet32.c
{
calls
unregister_netdev() // in /net/core/dev.c
{
calls
dev_close() // in /net/core/dev.c
{
calls dev->stop();
which is just pcnet32_close() // in pcnet32.c
{
which does what you wanted
to stop the device
}
}
}
which
frees pcnet32 device driver memory
}
}}}}}}
in drivers/pci/pci_driver.c,
struct device_driver->remove() is just pci_device_remove()
which calls struct pci_driver->remove() which is pcnet32_remove_one()
which calls unregister_netdev() (in net/core/dev.c)
which calls dev_close() (in net/core/dev.c)
which calls dev->stop() which is pcnet32_close()
which then does the appropriate shutdown.
---
Following is the analogous stack trace for events sent to user-space
when the pci device is unconfigured.
rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
calls
pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
calls
pci_destroy_dev (struct pci_dev *) {
calls
device_unregister (&dev->dev) { // in /drivers/base/core.c
calls
device_del(struct device * dev) { // in /drivers/base/core.c
calls
kobject_del() { //in /libs/kobject.c
calls
kobject_uevent() { // in /libs/kobject.c
calls
kset_uevent() { // in /lib/kobject.c
calls
kset->uevent_ops->uevent() // which is really just
a call to
dev_uevent() { // in /drivers/base/core.c
calls
dev->bus->uevent() which is really just a call to
pci_uevent () { // in drivers/pci/hotplug.c
which prints device name, etc....
}
}
then kobject_uevent() sends a netlink uevent to userspace
--> userspace uevent
(during early boot, nobody listens to netlink events and
kobject_uevent() executes uevent_helper[], which runs the
event process /sbin/hotplug)
}
}
kobject_del() then calls sysfs_remove_dir(), which would
trigger any user-space daemon that was watching /sysfs,
and notice the delete event.
Pro's and Con's of the Current Design
-------------------------------------
There are several issues with the current EEH software recovery design,
which may be addressed in future revisions. But first, note that the
big plus of the current design is that no changes need to be made to
individual device drivers, so that the current design throws a wide net.
The biggest negative of the design is that it potentially disturbs
network daemons and file systems that didn't need to be disturbed.
-- A minor complaint is that resetting the network card causes
user-space back-to-back ifdown/ifup burps that potentially disturb
network daemons, that didn't need to even know that the pci
card was being rebooted.
-- A more serious concern is that the same reset, for SCSI devices,
causes havoc to mounted file systems. Scripts cannot post-facto
unmount a file system without flushing pending buffers, but this
is impossible, because I/O has already been stopped. Thus,
ideally, the reset should happen at or below the block layer,
so that the file systems are not disturbed.
Reiserfs does not tolerate errors returned from the block device.
Ext3fs seems to be tolerant, retrying reads/writes until it does
succeed. Both have been only lightly tested in this scenario.
The SCSI-generic subsystem already has built-in code for performing
SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
(HBA) resets. These are cascaded into a chain of attempted
resets if a SCSI command fails. These are completely hidden
from the block layer. It would be very natural to add an EEH
reset into this chain of events.
-- If a SCSI error occurs for the root device, all is lost unless
the sysadmin had the foresight to run /bin, /sbin, /etc, /var
and so on, out of ramdisk/tmpfs.
Conclusions
-----------
There's forward progress ...

View file

@ -0,0 +1,270 @@
Firmware-Assisted Dump
------------------------
July 2011
The goal of firmware-assisted dump is to enable the dump of
a crashed system, and to do so from a fully-reset system, and
to minimize the total elapsed time until the system is back
in production use.
- Firmware assisted dump (fadump) infrastructure is intended to replace
the existing phyp assisted dump.
- Fadump uses the same firmware interfaces and memory reservation model
as phyp assisted dump.
- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
in the ELF format in the same way as kdump. This helps us reuse the
kdump infrastructure for dump capture and filtering.
- Unlike phyp dump, userspace tool does not need to refer any sysfs
interface while reading /proc/vmcore.
- Unlike phyp dump, fadump allows user to release all the memory reserved
for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
- Once enabled through kernel boot parameter, fadump can be
started/stopped through /sys/kernel/fadump_registered interface (see
sysfs files section below) and can be easily integrated with kdump
service start/stop init scripts.
Comparing with kdump or other strategies, firmware-assisted
dump offers several strong, practical advantages:
-- Unlike kdump, the system has been reset, and loaded
with a fresh copy of the kernel. In particular,
PCI and I/O devices have been reinitialized and are
in a clean, consistent state.
-- Once the dump is copied out, the memory that held the dump
is immediately available to the running kernel. And therefore,
unlike kdump, fadump doesn't need a 2nd reboot to get back
the system to the production configuration.
The above can only be accomplished by coordination with,
and assistance from the Power firmware. The procedure is
as follows:
-- The first kernel registers the sections of memory with the
Power firmware for dump preservation during OS initialization.
These registered sections of memory are reserved by the first
kernel during early boot.
-- When a system crashes, the Power firmware will save
the low memory (boot memory of size larger of 5% of system RAM
or 256MB) of RAM to the previous registered region. It will
also save system registers, and hardware PTE's.
NOTE: The term 'boot memory' means size of the low memory chunk
that is required for a kernel to boot successfully when
booted with restricted memory. By default, the boot memory
size will be the larger of 5% of system RAM or 256MB.
Alternatively, user can also specify boot memory size
through boot parameter 'fadump_reserve_mem=' which will
override the default calculated size. Use this option
if default boot memory size is not sufficient for second
kernel to boot successfully.
-- After the low memory (boot memory) area has been saved, the
firmware will reset PCI and other hardware state. It will
*not* clear the RAM. It will then launch the bootloader, as
normal.
-- The freshly booted kernel will notice that there is a new
node (ibm,dump-kernel) in the device tree, indicating that
there is crash data available from a previous boot. During
the early boot OS will reserve rest of the memory above
boot memory size effectively booting with restricted memory
size. This will make sure that the second kernel will not
touch any of the dump memory area.
-- User-space tools will read /proc/vmcore to obtain the contents
of memory, which holds the previous crashed kernel dump in ELF
format. The userspace tools may copy this info to disk, or
network, nas, san, iscsi, etc. as desired.
-- Once the userspace tool is done saving dump, it will echo
'1' to /sys/kernel/fadump_release_mem to release the reserved
memory back to general use, except the memory required for
next firmware-assisted dump registration.
e.g.
# echo 1 > /sys/kernel/fadump_release_mem
Please note that the firmware-assisted dump feature
is only available on Power6 and above systems with recent
firmware versions.
Implementation details:
----------------------
During boot, a check is made to see if firmware supports
this feature on that particular machine. If it does, then
we check to see if an active dump is waiting for us. If yes
then everything but boot memory size of RAM is reserved during
early boot (See Fig. 2). This area is released once we finish
collecting the dump from user land scripts (e.g. kdump scripts)
that are run. If there is dump data, then the
/sys/kernel/fadump_release_mem file is created, and the reserved
memory is held.
If there is no waiting dump data, then only the memory required
to hold CPU state, HPTE region, boot memory dump and elfcore
header, is reserved at the top of memory (see Fig. 1). This area
is *not* released: this region will be kept permanently reserved,
so that it can act as a receptacle for a copy of the boot memory
content in addition to CPU state and HPTE region, in the case a
crash does occur.
o Memory Reservation during first kernel
Low memory Top of memory
0 boot memory size |
| | |<--Reserved dump area -->|
V V | Permanent Reservation V
+-----------+----------/ /----------+---+----+-----------+----+
| | |CPU|HPTE| DUMP |ELF |
+-----------+----------/ /----------+---+----+-----------+----+
| ^
| |
\ /
-------------------------------------------
Boot memory content gets transferred to
reserved area by firmware at the time of
crash
Fig. 1
o Memory Reservation during second kernel after crash
Low memory Top of memory
0 boot memory size |
| |<------------- Reserved dump area ----------- -->|
V V V
+-----------+----------/ /----------+---+----+-----------+----+
| | |CPU|HPTE| DUMP |ELF |
+-----------+----------/ /----------+---+----+-----------+----+
| |
V V
Used by second /proc/vmcore
kernel to boot
Fig. 2
Currently the dump will be copied from /proc/vmcore to a
a new file upon user intervention. The dump data available through
/proc/vmcore will be in ELF format. Hence the existing kdump
infrastructure (kdump scripts) to save the dump works fine with
minor modifications.
The tools to examine the dump will be same as the ones
used for kdump.
How to enable firmware-assisted dump (fadump):
-------------------------------------
1. Set config option CONFIG_FA_DUMP=y and build kernel.
2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
to specify size of the memory to reserve for boot memory dump
preservation.
NOTE: If firmware-assisted dump fails to reserve memory then it will
fallback to existing kdump mechanism if 'crashkernel=' option
is set at kernel cmdline.
Sysfs/debugfs files:
------------
Firmware-assisted dump feature uses sysfs file system to hold
the control files and debugfs file to display memory reserved region.
Here is the list of files under kernel sysfs:
/sys/kernel/fadump_enabled
This is used to display the fadump status.
0 = fadump is disabled
1 = fadump is enabled
This interface can be used by kdump init scripts to identify if
fadump is enabled in the kernel and act accordingly.
/sys/kernel/fadump_registered
This is used to display the fadump registration status as well
as to control (start/stop) the fadump registration.
0 = fadump is not registered.
1 = fadump is registered and ready to handle system crash.
To register fadump echo 1 > /sys/kernel/fadump_registered and
echo 0 > /sys/kernel/fadump_registered for un-register and stop the
fadump. Once the fadump is un-registered, the system crash will not
be handled and vmcore will not be captured. This interface can be
easily integrated with kdump service start/stop.
/sys/kernel/fadump_release_mem
This file is available only when fadump is active during
second kernel. This is used to release the reserved memory
region that are held for saving crash dump. To release the
reserved memory echo 1 to it:
echo 1 > /sys/kernel/fadump_release_mem
After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
file will change to reflect the new memory reservations.
The existing userspace tools (kdump infrastructure) can be easily
enhanced to use this interface to release the memory reserved for
dump and continue without 2nd reboot.
Here is the list of files under powerpc debugfs:
(Assuming debugfs is mounted on /sys/kernel/debug directory.)
/sys/kernel/debug/powerpc/fadump_region
This file shows the reserved memory regions if fadump is
enabled otherwise this file is empty. The output format
is:
<region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
e.g.
Contents when fadump is registered during first kernel
# cat /sys/kernel/debug/powerpc/fadump_region
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
Contents when fadump is active during second kernel
# cat /sys/kernel/debug/powerpc/fadump_region
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
: [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
NOTE: Please refer to Documentation/filesystems/debugfs.txt on
how to mount the debugfs filesystem.
TODO:
-----
o Need to come up with the better approach to find out more
accurate boot memory size that is required for a kernel to
boot successfully when booted with restricted memory.
o The fadump implementation introduces a fadump crash info structure
in the scratch area before the ELF core header. The idea of introducing
this structure is to pass some important crash info data to the second
kernel which will help second kernel to populate ELF core header with
correct data before it gets exported through /proc/vmcore. The current
design implementation does not address a possibility of introducing
additional fields (in future) to this structure without affecting
compatibility. Need to come up with the better approach to address this.
The possible approaches are:
1. Introduce version field for version tracking, bump up the version
whenever a new field is added to the structure in future. The version
field can be used to find out what fields are valid for the current
version of the structure.
2. Reserve the area of predefined size (say PAGE_SIZE) for this
structure and have unused area as reserved (initialized to zero)
for future field additions.
The advantage of approach 1 over 2 is we don't need to reserve extra space.
---
Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
This document is based on the original documentation written for phyp
assisted dump by Linas Vepstas and Manish Ahuja.

View file

@ -0,0 +1,567 @@
===========================================================================
HVCS
IBM "Hypervisor Virtual Console Server" Installation Guide
for Linux Kernel 2.6.4+
Copyright (C) 2004 IBM Corporation
===========================================================================
NOTE:Eight space tabs are the optimum editor setting for reading this file.
===========================================================================
Author(s) : Ryan S. Arnold <rsa@us.ibm.com>
Date Created: March, 02, 2004
Last Changed: August, 24, 2004
---------------------------------------------------------------------------
Table of contents:
1. Driver Introduction:
2. System Requirements
3. Build Options:
3.1 Built-in:
3.2 Module:
4. Installation:
5. Connection:
6. Disconnection:
7. Configuration:
8. Questions & Answers:
9. Reporting Bugs:
---------------------------------------------------------------------------
1. Driver Introduction:
This is the device driver for the IBM Hypervisor Virtual Console Server,
"hvcs". The IBM hvcs provides a tty driver interface to allow Linux user
space applications access to the system consoles of logically partitioned
operating systems (Linux and AIX) running on the same partitioned Power5
ppc64 system. Physical hardware consoles per partition are not practical
on this hardware so system consoles are accessed by this driver using
firmware interfaces to virtual terminal devices.
---------------------------------------------------------------------------
2. System Requirements:
This device driver was written using 2.6.4 Linux kernel APIs and will only
build and run on kernels of this version or later.
This driver was written to operate solely on IBM Power5 ppc64 hardware
though some care was taken to abstract the architecture dependent firmware
calls from the driver code.
Sysfs must be mounted on the system so that the user can determine which
major and minor numbers are associated with each vty-server. Directions
for sysfs mounting are outside the scope of this document.
---------------------------------------------------------------------------
3. Build Options:
The hvcs driver registers itself as a tty driver. The tty layer
dynamically allocates a block of major and minor numbers in a quantity
requested by the registering driver. The hvcs driver asks the tty layer
for 64 of these major/minor numbers by default to use for hvcs device node
entries.
If the default number of device entries is adequate then this driver can be
built into the kernel. If not, the default can be over-ridden by inserting
the driver as a module with insmod parameters.
---------------------------------------------------------------------------
3.1 Built-in:
The following menuconfig example demonstrates selecting to build this
driver into the kernel.
Device Drivers --->
Character devices --->
<*> IBM Hypervisor Virtual Console Server Support
Begin the kernel make process.
---------------------------------------------------------------------------
3.2 Module:
The following menuconfig example demonstrates selecting to build this
driver as a kernel module.
Device Drivers --->
Character devices --->
<M> IBM Hypervisor Virtual Console Server Support
The make process will build the following kernel modules:
hvcs.ko
hvcserver.ko
To insert the module with the default allocation execute the following
commands in the order they appear:
insmod hvcserver.ko
insmod hvcs.ko
The hvcserver module contains architecture specific firmware calls and must
be inserted first, otherwise the hvcs module will not find some of the
symbols it expects.
To override the default use an insmod parameter as follows (requesting 4
tty devices as an example):
insmod hvcs.ko hvcs_parm_num_devs=4
There is a maximum number of dev entries that can be specified on insmod.
We think that 1024 is currently a decent maximum number of server adapters
to allow. This can always be changed by modifying the constant in the
source file before building.
NOTE: The length of time it takes to insmod the driver seems to be related
to the number of tty interfaces the registering driver requests.
In order to remove the driver module execute the following command:
rmmod hvcs.ko
The recommended method for installing hvcs as a module is to use depmod to
build a current modules.dep file in /lib/modules/`uname -r` and then
execute:
modprobe hvcs hvcs_parm_num_devs=4
The modules.dep file indicates that hvcserver.ko needs to be inserted
before hvcs.ko and modprobe uses this file to smartly insert the modules in
the proper order.
The following modprobe command is used to remove hvcs and hvcserver in the
proper order:
modprobe -r hvcs
---------------------------------------------------------------------------
4. Installation:
The tty layer creates sysfs entries which contain the major and minor
numbers allocated for the hvcs driver. The following snippet of "tree"
output of the sysfs directory shows where these numbers are presented:
sys/
|-- *other sysfs base dirs*
|
|-- class
| |-- *other classes of devices*
| |
| `-- tty
| |-- *other tty devices*
| |
| |-- hvcs0
| | `-- dev
| |-- hvcs1
| | `-- dev
| |-- hvcs2
| | `-- dev
| |-- hvcs3
| | `-- dev
| |
| |-- *other tty devices*
|
|-- *other sysfs base dirs*
For the above examples the following output is a result of cat'ing the
"dev" entry in the hvcs directory:
Pow5:/sys/class/tty/hvcs0/ # cat dev
254:0
Pow5:/sys/class/tty/hvcs1/ # cat dev
254:1
Pow5:/sys/class/tty/hvcs2/ # cat dev
254:2
Pow5:/sys/class/tty/hvcs3/ # cat dev
254:3
The output from reading the "dev" attribute is the char device major and
minor numbers that the tty layer has allocated for this driver's use. Most
systems running hvcs will already have the device entries created or udev
will do it automatically.
Given the example output above, to manually create a /dev/hvcs* node entry
mknod can be used as follows:
mknod /dev/hvcs0 c 254 0
mknod /dev/hvcs1 c 254 1
mknod /dev/hvcs2 c 254 2
mknod /dev/hvcs3 c 254 3
Using mknod to manually create the device entries makes these device nodes
persistent. Once created they will exist prior to the driver insmod.
Attempting to connect an application to /dev/hvcs* prior to insertion of
the hvcs module will result in an error message similar to the following:
"/dev/hvcs*: No such device".
NOTE: Just because there is a device node present doesn't mean that there
is a vty-server device configured for that node.
---------------------------------------------------------------------------
5. Connection
Since this driver controls devices that provide a tty interface a user can
interact with the device node entries using any standard tty-interactive
method (e.g. "cat", "dd", "echo"). The intent of this driver however, is
to provide real time console interaction with a Linux partition's console,
which requires the use of applications that provide bi-directional,
interactive I/O with a tty device.
Applications (e.g. "minicom" and "screen") that act as terminal emulators
or perform terminal type control sequence conversion on the data being
passed through them are NOT acceptable for providing interactive console
I/O. These programs often emulate antiquated terminal types (vt100 and
ANSI) and expect inbound data to take the form of one of these supported
terminal types but they either do not convert, or do not _adequately_
convert, outbound data into the terminal type of the terminal which invoked
them (though screen makes an attempt and can apparently be configured with
much termcap wrestling.)
For this reason kermit and cu are two of the recommended applications for
interacting with a Linux console via an hvcs device. These programs simply
act as a conduit for data transfer to and from the tty device. They do not
require inbound data to take the form of a particular terminal type, nor do
they cook outbound data to a particular terminal type.
In order to ensure proper functioning of console applications one must make
sure that once connected to a /dev/hvcs console that the console's $TERM
env variable is set to the exact terminal type of the terminal emulator
used to launch the interactive I/O application. If one is using xterm and
kermit to connect to /dev/hvcs0 when the console prompt becomes available
one should "export TERM=xterm" on the console. This tells ncurses
applications that are invoked from the console that they should output
control sequences that xterm can understand.
As a precautionary measure an hvcs user should always "exit" from their
session before disconnecting an application such as kermit from the device
node. If this is not done, the next user to connect to the console will
continue using the previous user's logged in session which includes
using the $TERM variable that the previous user supplied.
Hotplug add and remove of vty-server adapters affects which /dev/hvcs* node
is used to connect to each vty-server adapter. In order to determine which
vty-server adapter is associated with which /dev/hvcs* node a special sysfs
attribute has been added to each vty-server sysfs entry. This entry is
called "index" and showing it reveals an integer that refers to the
/dev/hvcs* entry to use to connect to that device. For instance cating the
index attribute of vty-server adapter 30000004 shows the following.
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat index
2
This index of '2' means that in order to connect to vty-server adapter
30000004 the user should interact with /dev/hvcs2.
It should be noted that due to the system hotplug I/O capabilities of a
system the /dev/hvcs* entry that interacts with a particular vty-server
adapter is not guaranteed to remain the same across system reboots. Look
in the Q & A section for more on this issue.
---------------------------------------------------------------------------
6. Disconnection
As a security feature to prevent the delivery of stale data to an
unintended target the Power5 system firmware disables the fetching of data
and discards that data when a connection between a vty-server and a vty has
been severed. As an example, when a vty-server is immediately disconnected
from a vty following output of data to the vty the vty adapter may not have
enough time between when it received the data interrupt and when the
connection was severed to fetch the data from firmware before the fetch is
disabled by firmware.
When hvcs is being used to serve consoles this behavior is not a huge issue
because the adapter stays connected for large amounts of time following
almost all data writes. When hvcs is being used as a tty conduit to tunnel
data between two partitions [see Q & A below] this is a huge problem
because the standard Linux behavior when cat'ing or dd'ing data to a device
is to open the tty, send the data, and then close the tty. If this driver
manually terminated vty-server connections on tty close this would close
the vty-server and vty connection before the target vty has had a chance to
fetch the data.
Additionally, disconnecting a vty-server and vty only on module removal or
adapter removal is impractical because other vty-servers in other
partitions may require the usage of the target vty at any time.
Due to this behavioral restriction disconnection of vty-servers from the
connected vty is a manual procedure using a write to a sysfs attribute
outlined below, on the other hand the initial vty-server connection to a
vty is established automatically by this driver. Manual vty-server
connection is never required.
In order to terminate the connection between a vty-server and vty the
"vterm_state" sysfs attribute within each vty-server's sysfs entry is used.
Reading this attribute reveals the current connection state of the
vty-server adapter. A zero means that the vty-server is not connected to a
vty. A one indicates that a connection is active.
Writing a '0' (zero) to the vterm_state attribute will disconnect the VTERM
connection between the vty-server and target vty ONLY if the vterm_state
previously read '1'. The write directive is ignored if the vterm_state
read '0' or if any value other than '0' was written to the vterm_state
attribute. The following example will show the method used for verifying
the vty-server connection status and disconnecting a vty-server connection.
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat vterm_state
1
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # echo 0 > vterm_state
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat vterm_state
0
All vty-server connections are automatically terminated when the device is
hotplug removed and when the module is removed.
---------------------------------------------------------------------------
7. Configuration
Each vty-server has a sysfs entry in the /sys/devices/vio directory, which
is symlinked in several other sysfs tree directories, notably under the
hvcs driver entry, which looks like the following example:
Pow5:/sys/bus/vio/drivers/hvcs # ls
. .. 30000003 30000004 rescan
By design, firmware notifies the hvcs driver of vty-server lifetimes and
partner vty removals but not the addition of partner vtys. Since an HMC
Super Admin can add partner info dynamically we have provided the hvcs
driver sysfs directory with the "rescan" update attribute which will query
firmware and update the partner info for all the vty-servers that this
driver manages. Writing a '1' to the attribute triggers the update. An
explicit example follows:
Pow5:/sys/bus/vio/drivers/hvcs # echo 1 > rescan
Reading the attribute will indicate a state of '1' or '0'. A one indicates
that an update is in process. A zero indicates that an update has
completed or was never executed.
Vty-server entries in this directory are a 32 bit partition unique unit
address that is created by firmware. An example vty-server sysfs entry
looks like the following:
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # ls
. current_vty devspec name partner_vtys
.. index partner_clcs vterm_state
Each entry is provided, by default with a "name" attribute. Reading the
"name" attribute will reveal the device type as shown in the following
example:
Pow5:/sys/bus/vio/drivers/hvcs/30000003 # cat name
vty-server
Each entry is also provided, by default, with a "devspec" attribute which
reveals the full device specification when read, as shown in the following
example:
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat devspec
/vdevice/vty-server@30000004
Each vty-server sysfs dir is provided with two read-only attributes that
provide lists of easily parsed partner vty data: "partner_vtys" and
"partner_clcs".
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat partner_vtys
30000000
30000001
30000002
30000000
30000000
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # cat partner_clcs
U5112.428.103048A-V3-C0
U5112.428.103048A-V3-C2
U5112.428.103048A-V3-C3
U5112.428.103048A-V4-C0
U5112.428.103048A-V5-C0
Reading partner_vtys returns a list of partner vtys. Vty unit address
numbering is only per-partition-unique so entries will frequently repeat.
Reading partner_clcs returns a list of "converged location codes" which are
composed of a system serial number followed by "-V*", where the '*' is the
target partition number, and "-C*", where the '*' is the slot of the
adapter. The first vty partner corresponds to the first clc item, the
second vty partner to the second clc item, etc.
A vty-server can only be connected to a single vty at a time. The entry,
"current_vty" prints the clc of the currently selected partner vty when
read.
The current_vty can be changed by writing a valid partner clc to the entry
as in the following example:
Pow5:/sys/bus/vio/drivers/hvcs/30000004 # echo U5112.428.10304
8A-V4-C0 > current_vty
Changing the current_vty when a vty-server is already connected to a vty
does not affect the current connection. The change takes effect when the
currently open connection is freed.
Information on the "vterm_state" attribute was covered earlier on the
chapter entitled "disconnection".
---------------------------------------------------------------------------
8. Questions & Answers:
===========================================================================
Q: What are the security concerns involving hvcs?
A: There are three main security concerns:
1. The creator of the /dev/hvcs* nodes has the ability to restrict
the access of the device entries to certain users or groups. It
may be best to create a special hvcs group privilege for providing
access to system consoles.
2. To provide network security when grabbing the console it is
suggested that the user connect to the console hosting partition
using a secure method, such as SSH or sit at a hardware console.
3. Make sure to exit the user session when done with a console or
the next vty-server connection (which may be from another
partition) will experience the previously logged in session.
---------------------------------------------------------------------------
Q: How do I multiplex a console that I grab through hvcs so that other
people can see it:
A: You can use "screen" to directly connect to the /dev/hvcs* device and
setup a session on your machine with the console group privileges. As
pointed out earlier by default screen doesn't provide the termcap settings
for most terminal emulators to provide adequate character conversion from
term type "screen" to others. This means that curses based programs may
not display properly in screen sessions.
---------------------------------------------------------------------------
Q: Why are the colors all messed up?
Q: Why are the control characters acting strange or not working?
Q: Why is the console output all strange and unintelligible?
A: Please see the preceding section on "Connection" for a discussion of how
applications can affect the display of character control sequences.
Additionally, just because you logged into the console using and xterm
doesn't mean someone else didn't log into the console with the HMC console
(vt320) before you and leave the session logged in. The best thing to do
is to export TERM to the terminal type of your terminal emulator when you
get the console. Additionally make sure to "exit" the console before you
disconnect from the console. This will ensure that the next user gets
their own TERM type set when they login.
---------------------------------------------------------------------------
Q: When I try to CONNECT kermit to an hvcs device I get:
"Sorry, can't open connection: /dev/hvcs*"What is happening?
A: Some other Power5 console mechanism has a connection to the vty and
isn't giving it up. You can try to force disconnect the consoles from the
HMC by right clicking on the partition and then selecting "close terminal".
Otherwise you have to hunt down the people who have console authority. It
is possible that you already have the console open using another kermit
session and just forgot about it. Please review the console options for
Power5 systems to determine the many ways a system console can be held.
OR
A: Another user may not have a connectivity method currently attached to a
/dev/hvcs device but the vterm_state may reveal that they still have the
vty-server connection established. They need to free this using the method
outlined in the section on "Disconnection" in order for others to connect
to the target vty.
OR
A: The user profile you are using to execute kermit probably doesn't have
permissions to use the /dev/hvcs* device.
OR
A: You probably haven't inserted the hvcs.ko module yet but the /dev/hvcs*
entry still exists (on systems without udev).
OR
A: There is not a corresponding vty-server device that maps to an existing
/dev/hvcs* entry.
---------------------------------------------------------------------------
Q: When I try to CONNECT kermit to an hvcs device I get:
"Sorry, write access to UUCP lockfile directory denied."
A: The /dev/hvcs* entry you have specified doesn't exist where you said it
does? Maybe you haven't inserted the module (on systems with udev).
---------------------------------------------------------------------------
Q: If I already have one Linux partition installed can I use hvcs on said
partition to provide the console for the install of a second Linux
partition?
A: Yes granted that your are connected to the /dev/hvcs* device using
kermit or cu or some other program that doesn't provide terminal emulation.
---------------------------------------------------------------------------
Q: Can I connect to more than one partition's console at a time using this
driver?
A: Yes. Of course this means that there must be more than one vty-server
configured for this partition and each must point to a disconnected vty.
---------------------------------------------------------------------------
Q: Does the hvcs driver support dynamic (hotplug) addition of devices?
A: Yes, if you have dlpar and hotplug enabled for your system and it has
been built into the kernel the hvcs drivers is configured to dynamically
handle additions of new devices and removals of unused devices.
---------------------------------------------------------------------------
Q: For some reason /dev/hvcs* doesn't map to the same vty-server adapter
after a reboot. What happened?
A: Assignment of vty-server adapters to /dev/hvcs* entries is always done
in the order that the adapters are exposed. Due to hotplug capabilities of
this driver assignment of hotplug added vty-servers may be in a different
order than how they would be exposed on module load. Rebooting or
reloading the module after dynamic addition may result in the /dev/hvcs*
and vty-server coupling changing if a vty-server adapter was added in a
slot between two other vty-server adapters. Refer to the section above
on how to determine which vty-server goes with which /dev/hvcs* node.
Hint; look at the sysfs "index" attribute for the vty-server.
---------------------------------------------------------------------------
Q: Can I use /dev/hvcs* as a conduit to another partition and use a tty
device on that partition as the other end of the pipe?
A: Yes, on Power5 platforms the hvc_console driver provides a tty interface
for extra /dev/hvc* devices (where /dev/hvc0 is most likely the console).
In order to get a tty conduit working between the two partitions the HMC
Super Admin must create an additional "serial server" for the target
partition with the HMC gui which will show up as /dev/hvc* when the target
partition is rebooted.
The HMC Super Admin then creates an additional "serial client" for the
current partition and points this at the target partition's newly created
"serial server" adapter (remember the slot). This shows up as an
additional /dev/hvcs* device.
Now a program on the target system can be configured to read or write to
/dev/hvc* and another program on the current partition can be configured to
read or write to /dev/hvcs*. Now you have a tty conduit between two
partitions.
---------------------------------------------------------------------------
9. Reporting Bugs:
The proper channel for reporting bugs is either through the Linux OS
distribution company that provided your OS or by posting issues to the
PowerPC development mailing list at:
linuxppc-dev@lists.ozlabs.org
This request is to provide a documented and searchable public exchange
of the problems and solutions surrounding this driver for the benefit of
all users.

View file

@ -0,0 +1,39 @@
Linux 2.6.x on MPC52xx family
-----------------------------
For the latest info, go to http://www.246tNt.com/mpc52xx/
To compile/use :
- U-Boot:
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
if you wish to ).
# make lite5200_defconfig
# make uImage
then, on U-boot:
=> tftpboot 200000 uImage
=> tftpboot 400000 pRamdisk
=> bootm 200000 400000
- DBug:
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
if you wish to ).
# make lite5200_defconfig
# cp your_initrd.gz arch/ppc/boot/images/ramdisk.image.gz
# make zImage.initrd
# make
then in DBug:
DBug> dn -i zImage.initrd.lite5200
Some remarks :
- The port is named mpc52xxx, and config options are PPC_MPC52xx. The MGT5100
is not supported, and I'm not sure anyone is interesting in working on it
so. I didn't took 5xxx because there's apparently a lot of 5xxx that have
nothing to do with the MPC5200. I also included the 'MPC' for the same
reason.
- Of course, I inspired myself from the 2.4 port. If you think I forgot to
mention you/your company in the copyright of some code, I'll correct it
ASAP.

View file

@ -0,0 +1,137 @@
PMU Event Based Branches
========================
Event Based Branches (EBBs) are a feature which allows the hardware to
branch directly to a specified user space address when certain events occur.
The full specification is available in Power ISA v2.07:
https://www.power.org/documentation/power-isa-version-2-07/
One type of event for which EBBs can be configured is PMU exceptions. This
document describes the API for configuring the Power PMU to generate EBBs,
using the Linux perf_events API.
Terminology
-----------
Throughout this document we will refer to an "EBB event" or "EBB events". This
just refers to a struct perf_event which has set the "EBB" flag in its
attr.config. All events which can be configured on the hardware PMU are
possible "EBB events".
Background
----------
When a PMU EBB occurs it is delivered to the currently running process. As such
EBBs can only sensibly be used by programs for self-monitoring.
It is a feature of the perf_events API that events can be created on other
processes, subject to standard permission checks. This is also true of EBB
events, however unless the target process enables EBBs (via mtspr(BESCR)) no
EBBs will ever be delivered.
This makes it possible for a process to enable EBBs for itself, but not
actually configure any events. At a later time another process can come along
and attach an EBB event to the process, which will then cause EBBs to be
delivered to the first process. It's not clear if this is actually useful.
When the PMU is configured for EBBs, all PMU interrupts are delivered to the
user process. This means once an EBB event is scheduled on the PMU, no non-EBB
events can be configured. This means that EBB events can not be run
concurrently with regular 'perf' commands, or any other perf events.
It is however safe to run 'perf' commands on a process which is using EBBs. The
kernel will in general schedule the EBB event, and perf will be notified that
its events could not run.
The exclusion between EBB events and regular events is implemented using the
existing "pinned" and "exclusive" attributes of perf_events. This means EBB
events will be given priority over other events, unless they are also pinned.
If an EBB event and a regular event are both pinned, then whichever is enabled
first will be scheduled and the other will be put in error state. See the
section below titled "Enabling an EBB event" for more information.
Creating an EBB event
---------------------
To request that an event is counted using EBB, the event code should have bit
63 set.
EBB events must be created with a particular, and restrictive, set of
attributes - this is so that they interoperate correctly with the rest of the
perf_events subsystem.
An EBB event must be created with the "pinned" and "exclusive" attributes set.
Note that if you are creating a group of EBB events, only the leader can have
these attributes set.
An EBB event must NOT set any of the "inherit", "sample_period", "freq" or
"enable_on_exec" attributes.
An EBB event must be attached to a task. This is specified to perf_event_open()
by passing a pid value, typically 0 indicating the current task.
All events in a group must agree on whether they want EBB. That is all events
must request EBB, or none may request EBB.
EBB events must specify the PMC they are to be counted on. This ensures
userspace is able to reliably determine which PMC the event is scheduled on.
Enabling an EBB event
---------------------
Once an EBB event has been successfully opened, it must be enabled with the
perf_events API. This can be achieved either via the ioctl() interface, or the
prctl() interface.
However, due to the design of the perf_events API, enabling an event does not
guarantee that it has been scheduled on the PMU. To ensure that the EBB event
has been scheduled on the PMU, you must perform a read() on the event. If the
read() returns EOF, then the event has not been scheduled and EBBs are not
enabled.
This behaviour occurs because the EBB event is pinned and exclusive. When the
EBB event is enabled it will force all other non-pinned events off the PMU. In
this case the enable will be successful. However if there is already an event
pinned on the PMU then the enable will not be successful.
Reading an EBB event
--------------------
It is possible to read() from an EBB event. However the results are
meaningless. Because interrupts are being delivered to the user process the
kernel is not able to count the event, and so will return a junk value.
Closing an EBB event
--------------------
When an EBB event is finished with, you can close it using close() as for any
regular event. If this is the last EBB event the PMU will be deconfigured and
no further PMU EBBs will be delivered.
EBB Handler
-----------
The EBB handler is just regular userspace code, however it must be written in
the style of an interrupt handler. When the handler is entered all registers
are live (possibly) and so must be saved somehow before the handler can invoke
other code.
It's up to the program how to handle this. For C programs a relatively simple
option is to create an interrupt frame on the stack and save registers there.
Fork
----
EBB events are not inherited across fork. If the child process wishes to use
EBBs it should open a new event for itself. Similarly the EBB state in
BESCR/EBBHR/EBBRR is cleared across fork().

View file

@ -0,0 +1,151 @@
GDB intends to support the following hardware debug features of BookE
processors:
4 hardware breakpoints (IAC)
2 hardware watchpoints (read, write and read-write) (DAC)
2 value conditions for the hardware watchpoints (DVC)
For that, we need to extend ptrace so that GDB can query and set these
resources. Since we're extending, we're trying to create an interface
that's extendable and that covers both BookE and server processors, so
that GDB doesn't need to special-case each of them. We added the
following 3 new ptrace requests.
1. PTRACE_PPC_GETHWDEBUGINFO
Query for GDB to discover the hardware debug features. The main info to
be returned here is the minimum alignment for the hardware watchpoints.
BookE processors don't have restrictions here, but server processors have
an 8-byte alignment restriction for hardware watchpoints. We'd like to avoid
adding special cases to GDB based on what it sees in AUXV.
Since we're at it, we added other useful info that the kernel can return to
GDB: this query will return the number of hardware breakpoints, hardware
watchpoints and whether it supports a range of addresses and a condition.
The query will fill the following structure provided by the requesting process:
struct ppc_debug_info {
unit32_t version;
unit32_t num_instruction_bps;
unit32_t num_data_bps;
unit32_t num_condition_regs;
unit32_t data_bp_alignment;
unit32_t sizeof_condition; /* size of the DVC register */
uint64_t features; /* bitmask of the individual flags */
};
features will have bits indicating whether there is support for:
#define PPC_DEBUG_FEATURE_INSN_BP_RANGE 0x1
#define PPC_DEBUG_FEATURE_INSN_BP_MASK 0x2
#define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x4
#define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x8
#define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x10
2. PTRACE_SETHWDEBUG
Sets a hardware breakpoint or watchpoint, according to the provided structure:
struct ppc_hw_breakpoint {
uint32_t version;
#define PPC_BREAKPOINT_TRIGGER_EXECUTE 0x1
#define PPC_BREAKPOINT_TRIGGER_READ 0x2
#define PPC_BREAKPOINT_TRIGGER_WRITE 0x4
uint32_t trigger_type; /* only some combinations allowed */
#define PPC_BREAKPOINT_MODE_EXACT 0x0
#define PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE 0x1
#define PPC_BREAKPOINT_MODE_RANGE_EXCLUSIVE 0x2
#define PPC_BREAKPOINT_MODE_MASK 0x3
uint32_t addr_mode; /* address match mode */
#define PPC_BREAKPOINT_CONDITION_MODE 0x3
#define PPC_BREAKPOINT_CONDITION_NONE 0x0
#define PPC_BREAKPOINT_CONDITION_AND 0x1
#define PPC_BREAKPOINT_CONDITION_EXACT 0x1 /* different name for the same thing as above */
#define PPC_BREAKPOINT_CONDITION_OR 0x2
#define PPC_BREAKPOINT_CONDITION_AND_OR 0x3
#define PPC_BREAKPOINT_CONDITION_BE_ALL 0x00ff0000 /* byte enable bits */
#define PPC_BREAKPOINT_CONDITION_BE(n) (1<<((n)+16))
uint32_t condition_mode; /* break/watchpoint condition flags */
uint64_t addr;
uint64_t addr2;
uint64_t condition_value;
};
A request specifies one event, not necessarily just one register to be set.
For instance, if the request is for a watchpoint with a condition, both the
DAC and DVC registers will be set in the same request.
With this GDB can ask for all kinds of hardware breakpoints and watchpoints
that the BookE supports. COMEFROM breakpoints available in server processors
are not contemplated, but that is out of the scope of this work.
ptrace will return an integer (handle) uniquely identifying the breakpoint or
watchpoint just created. This integer will be used in the PTRACE_DELHWDEBUG
request to ask for its removal. Return -ENOSPC if the requested breakpoint
can't be allocated on the registers.
Some examples of using the structure to:
- set a breakpoint in the first breakpoint register
p.version = PPC_DEBUG_CURRENT_VERSION;
p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
p.addr = (uint64_t) address;
p.addr2 = 0;
p.condition_value = 0;
- set a watchpoint which triggers on reads in the second watchpoint register
p.version = PPC_DEBUG_CURRENT_VERSION;
p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
p.addr = (uint64_t) address;
p.addr2 = 0;
p.condition_value = 0;
- set a watchpoint which triggers only with a specific value
p.version = PPC_DEBUG_CURRENT_VERSION;
p.trigger_type = PPC_BREAKPOINT_TRIGGER_READ;
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
p.condition_mode = PPC_BREAKPOINT_CONDITION_AND | PPC_BREAKPOINT_CONDITION_BE_ALL;
p.addr = (uint64_t) address;
p.addr2 = 0;
p.condition_value = (uint64_t) condition;
- set a ranged hardware breakpoint
p.version = PPC_DEBUG_CURRENT_VERSION;
p.trigger_type = PPC_BREAKPOINT_TRIGGER_EXECUTE;
p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
p.addr = (uint64_t) begin_range;
p.addr2 = (uint64_t) end_range;
p.condition_value = 0;
- set a watchpoint in server processors (BookS)
p.version = 1;
p.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
or
p.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
p.addr = (uint64_t) begin_range;
/* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where
* addr2 - addr <= 8 Bytes.
*/
p.addr2 = (uint64_t) end_range;
p.condition_value = 0;
3. PTRACE_DELHWDEBUG
Takes an integer which identifies an existing breakpoint or watchpoint
(i.e., the value returned from PTRACE_SETHWDEBUG), and deletes the
corresponding breakpoint or watchpoint..

View file

@ -0,0 +1,295 @@
Freescale QUICC Engine Firmware Uploading
-----------------------------------------
(c) 2007 Timur Tabi <timur at freescale.com>,
Freescale Semiconductor
Table of Contents
=================
I - Software License for Firmware
II - Microcode Availability
III - Description and Terminology
IV - Microcode Programming Details
V - Firmware Structure Layout
VI - Sample Code for Creating Firmware Files
Revision Information
====================
November 30, 2007: Rev 1.0 - Initial version
I - Software License for Firmware
=================================
Each firmware file comes with its own software license. For information on
the particular license, please see the license text that is distributed with
the firmware.
II - Microcode Availability
===========================
Firmware files are distributed through various channels. Some are available on
http://opensource.freescale.com. For other firmware files, please contact
your Freescale representative or your operating system vendor.
III - Description and Terminology
================================
In this document, the term 'microcode' refers to the sequence of 32-bit
integers that compose the actual QE microcode.
The term 'firmware' refers to a binary blob that contains the microcode as
well as other data that
1) describes the microcode's purpose
2) describes how and where to upload the microcode
3) specifies the values of various registers
4) includes additional data for use by specific device drivers
Firmware files are binary files that contain only a firmware.
IV - Microcode Programming Details
===================================
The QE architecture allows for only one microcode present in I-RAM for each
RISC processor. To replace any current microcode, a full QE reset (which
disables the microcode) must be performed first.
QE microcode is uploaded using the following procedure:
1) The microcode is placed into I-RAM at a specific location, using the
IRAM.IADD and IRAM.IDATA registers.
2) The CERCR.CIR bit is set to 0 or 1, depending on whether the firmware
needs split I-RAM. Split I-RAM is only meaningful for SOCs that have
QEs with multiple RISC processors, such as the 8360. Splitting the I-RAM
allows each processor to run a different microcode, effectively creating an
asymmetric multiprocessing (AMP) system.
3) The TIBCR trap registers are loaded with the addresses of the trap handlers
in the microcode.
4) The RSP.ECCR register is programmed with the value provided.
5) If necessary, device drivers that need the virtual traps and extended mode
data will use them.
Virtual Microcode Traps
These virtual traps are conditional branches in the microcode. These are
"soft" provisional introduced in the ROMcode in order to enable higher
flexibility and save h/w traps If new features are activated or an issue is
being fixed in the RAM package utilizing they should be activated. This data
structure signals the microcode which of these virtual traps is active.
This structure contains 6 words that the application should copy to some
specific been defined. This table describes the structure.
---------------------------------------------------------------
| Offset in | | Destination Offset | Size of |
| array | Protocol | within PRAM | Operand |
--------------------------------------------------------------|
| 0 | Ethernet | 0xF8 | 4 bytes |
| | interworking | | |
---------------------------------------------------------------
| 4 | ATM | 0xF8 | 4 bytes |
| | interworking | | |
---------------------------------------------------------------
| 8 | PPP | 0xF8 | 4 bytes |
| | interworking | | |
---------------------------------------------------------------
| 12 | Ethernet RX | 0x22 | 1 byte |
| | Distributor Page | | |
---------------------------------------------------------------
| 16 | ATM Globtal | 0x28 | 1 byte |
| | Params Table | | |
---------------------------------------------------------------
| 20 | Insert Frame | 0xF8 | 4 bytes |
---------------------------------------------------------------
Extended Modes
This is a double word bit array (64 bits) that defines special functionality
which has an impact on the softwarew drivers. Each bit has its own impact
and has special instructions for the s/w associated with it. This structure is
described in this table:
-----------------------------------------------------------------------
| Bit # | Name | Description |
-----------------------------------------------------------------------
| 0 | General | Indicates that prior to each host command |
| | push command | given by the application, the software must |
| | | assert a special host command (push command)|
| | | CECDR = 0x00800000. |
| | | CECR = 0x01c1000f. |
-----------------------------------------------------------------------
| 1 | UCC ATM | Indicates that after issuing ATM RX INIT |
| | RX INIT | command, the host must issue another special|
| | push command | command (push command) and immediately |
| | | following that re-issue the ATM RX INIT |
| | | command. (This makes the sequence of |
| | | initializing the ATM receiver a sequence of |
| | | three host commands) |
| | | CECDR = 0x00800000. |
| | | CECR = 0x01c1000f. |
-----------------------------------------------------------------------
| 2 | Add/remove | Indicates that following the specific host |
| | command | command: "Add/Remove entry in Hash Lookup |
| | validation | Table" used in Interworking setup, the user |
| | | must issue another command. |
| | | CECDR = 0xce000003. |
| | | CECR = 0x01c10f58. |
-----------------------------------------------------------------------
| 3 | General push | Indicates that the s/w has to initialize |
| | command | some pointers in the Ethernet thread pages |
| | | which are used when Header Compression is |
| | | activated. The full details of these |
| | | pointers is located in the software drivers.|
-----------------------------------------------------------------------
| 4 | General push | Indicates that after issuing Ethernet TX |
| | command | INIT command, user must issue this command |
| | | for each SNUM of Ethernet TX thread. |
| | | CECDR = 0x00800003. |
| | | CECR = 0x7'b{0}, 8'b{Enet TX thread SNUM}, |
| | | 1'b{1}, 12'b{0}, 4'b{1} |
-----------------------------------------------------------------------
| 5 - 31 | N/A | Reserved, set to zero. |
-----------------------------------------------------------------------
V - Firmware Structure Layout
==============================
QE microcode from Freescale is typically provided as a header file. This
header file contains macros that define the microcode binary itself as well as
some other data used in uploading that microcode. The format of these files
do not lend themselves to simple inclusion into other code. Hence,
the need for a more portable format. This section defines that format.
Instead of distributing a header file, the microcode and related data are
embedded into a binary blob. This blob is passed to the qe_upload_firmware()
function, which parses the blob and performs everything necessary to upload
the microcode.
All integers are big-endian. See the comments for function
qe_upload_firmware() for up-to-date implementation information.
This structure supports versioning, where the version of the structure is
embedded into the structure itself. To ensure forward and backwards
compatibility, all versions of the structure must use the same 'qe_header'
structure at the beginning.
'header' (type: struct qe_header):
The 'length' field is the size, in bytes, of the entire structure,
including all the microcode embedded in it, as well as the CRC (if
present).
The 'magic' field is an array of three bytes that contains the letters
'Q', 'E', and 'F'. This is an identifier that indicates that this
structure is a QE Firmware structure.
The 'version' field is a single byte that indicates the version of this
structure. If the layout of the structure should ever need to be
changed to add support for additional types of microcode, then the
version number should also be changed.
The 'id' field is a null-terminated string(suitable for printing) that
identifies the firmware.
The 'count' field indicates the number of 'microcode' structures. There
must be one and only one 'microcode' structure for each RISC processor.
Therefore, this field also represents the number of RISC processors for this
SOC.
The 'soc' structure contains the SOC numbers and revisions used to match
the microcode to the SOC itself. Normally, the microcode loader should
check the data in this structure with the SOC number and revisions, and
only upload the microcode if there's a match. However, this check is not
made on all platforms.
Although it is not recommended, you can specify '0' in the soc.model
field to skip matching SOCs altogether.
The 'model' field is a 16-bit number that matches the actual SOC. The
'major' and 'minor' fields are the major and minor revision numbers,
respectively, of the SOC.
For example, to match the 8323, revision 1.0:
soc.model = 8323
soc.major = 1
soc.minor = 0
'padding' is necessary for structure alignment. This field ensures that the
'extended_modes' field is aligned on a 64-bit boundary.
'extended_modes' is a bitfield that defines special functionality which has an
impact on the device drivers. Each bit has its own impact and has special
instructions for the driver associated with it. This field is stored in
the QE library and available to any driver that calles qe_get_firmware_info().
'vtraps' is an array of 8 words that contain virtual trap values for each
virtual traps. As with 'extended_modes', this field is stored in the QE
library and available to any driver that calles qe_get_firmware_info().
'microcode' (type: struct qe_microcode):
For each RISC processor there is one 'microcode' structure. The first
'microcode' structure is for the first RISC, and so on.
The 'id' field is a null-terminated string suitable for printing that
identifies this particular microcode.
'traps' is an array of 16 words that contain hardware trap values
for each of the 16 traps. If trap[i] is 0, then this particular
trap is to be ignored (i.e. not written to TIBCR[i]). The entire value
is written as-is to the TIBCR[i] register, so be sure to set the EN
and T_IBP bits if necessary.
'eccr' is the value to program into the ECCR register.
'iram_offset' is the offset into IRAM to start writing the
microcode.
'count' is the number of 32-bit words in the microcode.
'code_offset' is the offset, in bytes, from the beginning of this
structure where the microcode itself can be found. The first
microcode binary should be located immediately after the 'microcode'
array.
'major', 'minor', and 'revision' are the major, minor, and revision
version numbers, respectively, of the microcode. If all values are 0,
then these fields are ignored.
'reserved' is necessary for structure alignment. Since 'microcode'
is an array, the 64-bit 'extended_modes' field needs to be aligned
on a 64-bit boundary, and this can only happen if the size of
'microcode' is a multiple of 8 bytes. To ensure that, we add
'reserved'.
After the last microcode is a 32-bit CRC. It can be calculated using
this algorithm:
u32 crc32(const u8 *p, unsigned int len)
{
unsigned int i;
u32 crc = 0;
while (len--) {
crc ^= *p++;
for (i = 0; i < 8; i++)
crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
}
return crc;
}
VI - Sample Code for Creating Firmware Files
============================================
A Python program that creates firmware binaries from the header files normally
distributed by Freescale can be found on http://opensource.freescale.com.

View file

@ -0,0 +1,198 @@
Transactional Memory support
============================
POWER kernel support for this feature is currently limited to supporting
its use by user programs. It is not currently used by the kernel itself.
This file aims to sum up how it is supported by Linux and what behaviour you
can expect from your user programs.
Basic overview
==============
Hardware Transactional Memory is supported on POWER8 processors, and is a
feature that enables a different form of atomic memory access. Several new
instructions are presented to delimit transactions; transactions are
guaranteed to either complete atomically or roll back and undo any partial
changes.
A simple transaction looks like this:
begin_move_money:
tbegin
beq abort_handler
ld r4, SAVINGS_ACCT(r3)
ld r5, CURRENT_ACCT(r3)
subi r5, r5, 1
addi r4, r4, 1
std r4, SAVINGS_ACCT(r3)
std r5, CURRENT_ACCT(r3)
tend
b continue
abort_handler:
... test for odd failures ...
/* Retry the transaction if it failed because it conflicted with
* someone else: */
b begin_move_money
The 'tbegin' instruction denotes the start point, and 'tend' the end point.
Between these points the processor is in 'Transactional' state; any memory
references will complete in one go if there are no conflicts with other
transactional or non-transactional accesses within the system. In this
example, the transaction completes as though it were normal straight-line code
IF no other processor has touched SAVINGS_ACCT(r3) or CURRENT_ACCT(r3); an
atomic move of money from the current account to the savings account has been
performed. Even though the normal ld/std instructions are used (note no
lwarx/stwcx), either *both* SAVINGS_ACCT(r3) and CURRENT_ACCT(r3) will be
updated, or neither will be updated.
If, in the meantime, there is a conflict with the locations accessed by the
transaction, the transaction will be aborted by the CPU. Register and memory
state will roll back to that at the 'tbegin', and control will continue from
'tbegin+4'. The branch to abort_handler will be taken this second time; the
abort handler can check the cause of the failure, and retry.
Checkpointed registers include all GPRs, FPRs, VRs/VSRs, LR, CCR/CR, CTR, FPCSR
and a few other status/flag regs; see the ISA for details.
Causes of transaction aborts
============================
- Conflicts with cache lines used by other processors
- Signals
- Context switches
- See the ISA for full documentation of everything that will abort transactions.
Syscalls
========
Performing syscalls from within transaction is not recommended, and can lead
to unpredictable results.
Syscalls do not by design abort transactions, but beware: The kernel code will
not be running in transactional state. The effect of syscalls will always
remain visible, but depending on the call they may abort your transaction as a
side-effect, read soon-to-be-aborted transactional data that should not remain
invisible, etc. If you constantly retry a transaction that constantly aborts
itself by calling a syscall, you'll have a livelock & make no progress.
Simple syscalls (e.g. sigprocmask()) "could" be OK. Even things like write()
from, say, printf() should be OK as long as the kernel does not access any
memory that was accessed transactionally.
Consider any syscalls that happen to work as debug-only -- not recommended for
production use. Best to queue them up till after the transaction is over.
Signals
=======
Delivery of signals (both sync and async) during transactions provides a second
thread state (ucontext/mcontext) to represent the second transactional register
state. Signal delivery 'treclaim's to capture both register states, so signals
abort transactions. The usual ucontext_t passed to the signal handler
represents the checkpointed/original register state; the signal appears to have
arisen at 'tbegin+4'.
If the sighandler ucontext has uc_link set, a second ucontext has been
delivered. For future compatibility the MSR.TS field should be checked to
determine the transactional state -- if so, the second ucontext in uc->uc_link
represents the active transactional registers at the point of the signal.
For 64-bit processes, uc->uc_mcontext.regs->msr is a full 64-bit MSR and its TS
field shows the transactional mode.
For 32-bit processes, the mcontext's MSR register is only 32 bits; the top 32
bits are stored in the MSR of the second ucontext, i.e. in
uc->uc_link->uc_mcontext.regs->msr. The top word contains the transactional
state TS.
However, basic signal handlers don't need to be aware of transactions
and simply returning from the handler will deal with things correctly:
Transaction-aware signal handlers can read the transactional register state
from the second ucontext. This will be necessary for crash handlers to
determine, for example, the address of the instruction causing the SIGSEGV.
Example signal handler:
void crash_handler(int sig, siginfo_t *si, void *uc)
{
ucontext_t *ucp = uc;
ucontext_t *transactional_ucp = ucp->uc_link;
if (ucp_link) {
u64 msr = ucp->uc_mcontext.regs->msr;
/* May have transactional ucontext! */
#ifndef __powerpc64__
msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32;
#endif
if (MSR_TM_ACTIVE(msr)) {
/* Yes, we crashed during a transaction. Oops. */
fprintf(stderr, "Transaction to be restarted at 0x%llx, but "
"crashy instruction was at 0x%llx\n",
ucp->uc_mcontext.regs->nip,
transactional_ucp->uc_mcontext.regs->nip);
}
}
fix_the_problem(ucp->dar);
}
When in an active transaction that takes a signal, we need to be careful with
the stack. It's possible that the stack has moved back up after the tbegin.
The obvious case here is when the tbegin is called inside a function that
returns before a tend. In this case, the stack is part of the checkpointed
transactional memory state. If we write over this non transactionally or in
suspend, we are in trouble because if we get a tm abort, the program counter and
stack pointer will be back at the tbegin but our in memory stack won't be valid
anymore.
To avoid this, when taking a signal in an active transaction, we need to use
the stack pointer from the checkpointed state, rather than the speculated
state. This ensures that the signal context (written tm suspended) will be
written below the stack required for the rollback. The transaction is aborted
because of the treclaim, so any memory written between the tbegin and the
signal will be rolled back anyway.
For signals taken in non-TM or suspended mode, we use the
normal/non-checkpointed stack pointer.
Failure cause codes used by kernel
==================================
These are defined in <asm/reg.h>, and distinguish different reasons why the
kernel aborted a transaction:
TM_CAUSE_RESCHED Thread was rescheduled.
TM_CAUSE_TLBI Software TLB invalide.
TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap.
TM_CAUSE_SYSCALL Currently unused; future syscalls that must abort
transactions for consistency will use this.
TM_CAUSE_SIGNAL Signal delivered.
TM_CAUSE_MISC Currently unused.
TM_CAUSE_ALIGNMENT Alignment fault.
TM_CAUSE_EMULATE Emulation that touched memory.
These can be checked by the user program's abort handler as TEXASR[0:7]. If
bit 7 is set, it indicates that the error is consider persistent. For example
a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.q
GDB
===
GDB and ptrace are not currently TM-aware. If one stops during a transaction,
it looks like the transaction has just started (the checkpointed state is
presented). The transaction cannot then be continued and will take the failure
handler route. Furthermore, the transactional 2nd register state will be
inaccessible. GDB can currently be used on programs using TM, but not sensibly
in parts within transactions.