Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,69 @@
IRQ affinity on IA64 platforms
------------------------------
07.01.2002, Erich Focht <efocht@ess.nec.de>
By writing to /proc/irq/IRQ#/smp_affinity the interrupt routing can be
controlled. The behavior on IA64 platforms is slightly different from
that described in Documentation/IRQ-affinity.txt for i386 systems.
Because of the usage of SAPIC mode and physical destination mode the
IRQ target is one particular CPU and cannot be a mask of several
CPUs. Only the first non-zero bit is taken into account.
Usage examples:
The target CPU has to be specified as a hexadecimal CPU mask. The
first non-zero bit is the selected CPU. This format has been kept for
compatibility reasons with i386.
Set the delivery mode of interrupt 41 to fixed and route the
interrupts to CPU #3 (logical CPU number) (2^3=0x08):
echo "8" >/proc/irq/41/smp_affinity
Set the default route for IRQ number 41 to CPU 6 in lowest priority
delivery mode (redirectable):
echo "r 40" >/proc/irq/41/smp_affinity
The output of the command
cat /proc/irq/IRQ#/smp_affinity
gives the target CPU mask for the specified interrupt vector. If the CPU
mask is preceded by the character "r", the interrupt is redirectable
(i.e. lowest priority mode routing is used), otherwise its route is
fixed.
Initialization and default behavior:
If the platform features IRQ redirection (info provided by SAL) all
IO-SAPIC interrupts are initialized with CPU#0 as their default target
and the routing is the so called "lowest priority mode" (actually
fixed SAPIC mode with hint). The XTP chipset registers are used as hints
for the IRQ routing. Currently in Linux XTP registers can have three
values:
- minimal for an idle task,
- normal if any other task runs,
- maximal if the CPU is going to be switched off.
The IRQ is routed to the CPU with lowest XTP register value, the
search begins at the default CPU. Therefore most of the interrupts
will be handled by CPU #0.
If the platform doesn't feature interrupt redirection IOSAPIC fixed
routing is used. The target CPUs are distributed in a round robin
manner. IRQs will be routed only to the selected target CPUs. Check
with
cat /proc/interrupts
Comments:
On large (multi-node) systems it is recommended to route the IRQs to
the node to which the corresponding device is connected.
For systems like the NEC AzusA we get IRQ node-affinity for free. This
is because usually the chipsets on each node redirect the interrupts
only to their own CPUs (as they cannot see the XTP registers on the
other nodes).

View file

@ -0,0 +1,5 @@
# List of programs to build
hostprogs-y := aliasing-test
# Tell kbuild to always build the programs
always := $(hostprogs-y)

43
Documentation/ia64/README Normal file
View file

@ -0,0 +1,43 @@
Linux kernel release 2.4.xx for the IA-64 Platform
These are the release notes for Linux version 2.4 for IA-64
platform. This document provides information specific to IA-64
ONLY, to get additional information about the Linux kernel also
read the original Linux README provided with the kernel.
INSTALLING the kernel:
- IA-64 kernel installation is the same as the other platforms, see
original README for details.
SOFTWARE REQUIREMENTS
Compiling and running this kernel requires an IA-64 compliant GCC
compiler. And various software packages also compiled with an
IA-64 compliant GCC compiler.
CONFIGURING the kernel:
Configuration is the same, see original README for details.
COMPILING the kernel:
- Compiling this kernel doesn't differ from other platform so read
the original README for details BUT make sure you have an IA-64
compliant GCC compiler.
IA-64 SPECIFICS
- General issues:
o Hardly any performance tuning has been done. Obvious targets
include the library routines (IP checksum, etc.). Less
obvious targets include making sure we don't flush the TLB
needlessly, etc.
o SMP locks cleanup/optimization
o IA32 support. Currently experimental. It mostly works.

View file

@ -0,0 +1,263 @@
/*
* Exercise /dev/mem mmap cases that have been troublesome in the past
*
* (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
* Bjorn Helgaas <bjorn.helgaas@hp.com>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <fcntl.h>
#include <fnmatch.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <linux/pci.h>
int sum;
static int map_mem(char *path, off_t offset, size_t length, int touch)
{
int fd, rc;
void *addr;
int *c;
fd = open(path, O_RDWR);
if (fd == -1) {
perror(path);
return -1;
}
if (fnmatch("/proc/bus/pci/*", path, 0) == 0) {
rc = ioctl(fd, PCIIOC_MMAP_IS_MEM);
if (rc == -1)
perror("PCIIOC_MMAP_IS_MEM ioctl");
}
addr = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_SHARED, fd, offset);
if (addr == MAP_FAILED)
return 1;
if (touch) {
c = (int *) addr;
while (c < (int *) (addr + length))
sum += *c++;
}
rc = munmap(addr, length);
if (rc == -1) {
perror("munmap");
return -1;
}
close(fd);
return 0;
}
static int scan_tree(char *path, char *file, off_t offset, size_t length, int touch)
{
struct dirent **namelist;
char *name, *path2;
int i, n, r, rc = 0, result = 0;
struct stat buf;
n = scandir(path, &namelist, 0, alphasort);
if (n < 0) {
perror("scandir");
return -1;
}
for (i = 0; i < n; i++) {
name = namelist[i]->d_name;
if (fnmatch(".", name, 0) == 0)
goto skip;
if (fnmatch("..", name, 0) == 0)
goto skip;
path2 = malloc(strlen(path) + strlen(name) + 3);
strcpy(path2, path);
strcat(path2, "/");
strcat(path2, name);
if (fnmatch(file, name, 0) == 0) {
rc = map_mem(path2, offset, length, touch);
if (rc == 0)
fprintf(stderr, "PASS: %s 0x%lx-0x%lx is %s\n", path2, offset, offset + length, touch ? "readable" : "mappable");
else if (rc > 0)
fprintf(stderr, "PASS: %s 0x%lx-0x%lx not mappable\n", path2, offset, offset + length);
else {
fprintf(stderr, "FAIL: %s 0x%lx-0x%lx not accessible\n", path2, offset, offset + length);
return rc;
}
} else {
r = lstat(path2, &buf);
if (r == 0 && S_ISDIR(buf.st_mode)) {
rc = scan_tree(path2, file, offset, length, touch);
if (rc < 0)
return rc;
}
}
result |= rc;
free(path2);
skip:
free(namelist[i]);
}
free(namelist);
return result;
}
char buf[1024];
static int read_rom(char *path)
{
int fd, rc;
size_t size = 0;
fd = open(path, O_RDWR);
if (fd == -1) {
perror(path);
return -1;
}
rc = write(fd, "1", 2);
if (rc <= 0) {
close(fd);
perror("write");
return -1;
}
do {
rc = read(fd, buf, sizeof(buf));
if (rc > 0)
size += rc;
} while (rc > 0);
close(fd);
return size;
}
static int scan_rom(char *path, char *file)
{
struct dirent **namelist;
char *name, *path2;
int i, n, r, rc = 0, result = 0;
struct stat buf;
n = scandir(path, &namelist, 0, alphasort);
if (n < 0) {
perror("scandir");
return -1;
}
for (i = 0; i < n; i++) {
name = namelist[i]->d_name;
if (fnmatch(".", name, 0) == 0)
goto skip;
if (fnmatch("..", name, 0) == 0)
goto skip;
path2 = malloc(strlen(path) + strlen(name) + 3);
strcpy(path2, path);
strcat(path2, "/");
strcat(path2, name);
if (fnmatch(file, name, 0) == 0) {
rc = read_rom(path2);
/*
* It's OK if the ROM is unreadable. Maybe there
* is no ROM, or some other error occurred. The
* important thing is that no MCA happened.
*/
if (rc > 0)
fprintf(stderr, "PASS: %s read %d bytes\n", path2, rc);
else {
fprintf(stderr, "PASS: %s not readable\n", path2);
return rc;
}
} else {
r = lstat(path2, &buf);
if (r == 0 && S_ISDIR(buf.st_mode)) {
rc = scan_rom(path2, file);
if (rc < 0)
return rc;
}
}
result |= rc;
free(path2);
skip:
free(namelist[i]);
}
free(namelist);
return result;
}
int main(void)
{
int rc;
if (map_mem("/dev/mem", 0, 0xA0000, 1) == 0)
fprintf(stderr, "PASS: /dev/mem 0x0-0xa0000 is readable\n");
else
fprintf(stderr, "FAIL: /dev/mem 0x0-0xa0000 not accessible\n");
/*
* It's not safe to blindly read the VGA frame buffer. If you know
* how to poke the card the right way, it should respond, but it's
* not safe in general. Many machines, e.g., Intel chipsets, cover
* up a non-responding card by just returning -1, but others will
* report the failure as a machine check.
*/
if (map_mem("/dev/mem", 0xA0000, 0x20000, 0) == 0)
fprintf(stderr, "PASS: /dev/mem 0xa0000-0xc0000 is mappable\n");
else
fprintf(stderr, "FAIL: /dev/mem 0xa0000-0xc0000 not accessible\n");
if (map_mem("/dev/mem", 0xC0000, 0x40000, 1) == 0)
fprintf(stderr, "PASS: /dev/mem 0xc0000-0x100000 is readable\n");
else
fprintf(stderr, "FAIL: /dev/mem 0xc0000-0x100000 not accessible\n");
/*
* Often you can map all the individual pieces above (0-0xA0000,
* 0xA0000-0xC0000, and 0xC0000-0x100000), but can't map the whole
* thing at once. This is because the individual pieces use different
* attributes, and there's no single attribute supported over the
* whole region.
*/
rc = map_mem("/dev/mem", 0, 1024*1024, 0);
if (rc == 0)
fprintf(stderr, "PASS: /dev/mem 0x0-0x100000 is mappable\n");
else if (rc > 0)
fprintf(stderr, "PASS: /dev/mem 0x0-0x100000 not mappable\n");
else
fprintf(stderr, "FAIL: /dev/mem 0x0-0x100000 not accessible\n");
scan_tree("/sys/class/pci_bus", "legacy_mem", 0, 0xA0000, 1);
scan_tree("/sys/class/pci_bus", "legacy_mem", 0xA0000, 0x20000, 0);
scan_tree("/sys/class/pci_bus", "legacy_mem", 0xC0000, 0x40000, 1);
scan_tree("/sys/class/pci_bus", "legacy_mem", 0, 1024*1024, 0);
scan_rom("/sys/devices", "rom");
scan_tree("/proc/bus/pci", "??.?", 0, 0xA0000, 1);
scan_tree("/proc/bus/pci", "??.?", 0xA0000, 0x20000, 0);
scan_tree("/proc/bus/pci", "??.?", 0xC0000, 0x40000, 1);
scan_tree("/proc/bus/pci", "??.?", 0, 1024*1024, 0);
return rc;
}

View file

@ -0,0 +1,221 @@
MEMORY ATTRIBUTE ALIASING ON IA-64
Bjorn Helgaas
<bjorn.helgaas@hp.com>
May 4, 2006
MEMORY ATTRIBUTES
Itanium supports several attributes for virtual memory references.
The attribute is part of the virtual translation, i.e., it is
contained in the TLB entry. The ones of most interest to the Linux
kernel are:
WB Write-back (cacheable)
UC Uncacheable
WC Write-coalescing
System memory typically uses the WB attribute. The UC attribute is
used for memory-mapped I/O devices. The WC attribute is uncacheable
like UC is, but writes may be delayed and combined to increase
performance for things like frame buffers.
The Itanium architecture requires that we avoid accessing the same
page with both a cacheable mapping and an uncacheable mapping[1].
The design of the chipset determines which attributes are supported
on which regions of the address space. For example, some chipsets
support either WB or UC access to main memory, while others support
only WB access.
MEMORY MAP
Platform firmware describes the physical memory map and the
supported attributes for each region. At boot-time, the kernel uses
the EFI GetMemoryMap() interface. ACPI can also describe memory
devices and the attributes they support, but Linux/ia64 currently
doesn't use this information.
The kernel uses the efi_memmap table returned from GetMemoryMap() to
learn the attributes supported by each region of physical address
space. Unfortunately, this table does not completely describe the
address space because some machines omit some or all of the MMIO
regions from the map.
The kernel maintains another table, kern_memmap, which describes the
memory Linux is actually using and the attribute for each region.
This contains only system memory; it does not contain MMIO space.
The kern_memmap table typically contains only a subset of the system
memory described by the efi_memmap. Linux/ia64 can't use all memory
in the system because of constraints imposed by the identity mapping
scheme.
The efi_memmap table is preserved unmodified because the original
boot-time information is required for kexec.
KERNEL IDENTITY MAPPINGS
Linux/ia64 identity mappings are done with large pages, currently
either 16MB or 64MB, referred to as "granules." Cacheable mappings
are speculative[2], so the processor can read any location in the
page at any time, independent of the programmer's intentions. This
means that to avoid attribute aliasing, Linux can create a cacheable
identity mapping only when the entire granule supports cacheable
access.
Therefore, kern_memmap contains only full granule-sized regions that
can referenced safely by an identity mapping.
Uncacheable mappings are not speculative, so the processor will
generate UC accesses only to locations explicitly referenced by
software. This allows UC identity mappings to cover granules that
are only partially populated, or populated with a combination of UC
and WB regions.
USER MAPPINGS
User mappings are typically done with 16K or 64K pages. The smaller
page size allows more flexibility because only 16K or 64K has to be
homogeneous with respect to memory attributes.
POTENTIAL ATTRIBUTE ALIASING CASES
There are several ways the kernel creates new mappings:
mmap of /dev/mem
This uses remap_pfn_range(), which creates user mappings. These
mappings may be either WB or UC. If the region being mapped
happens to be in kern_memmap, meaning that it may also be mapped
by a kernel identity mapping, the user mapping must use the same
attribute as the kernel mapping.
If the region is not in kern_memmap, the user mapping should use
an attribute reported as being supported in the EFI memory map.
Since the EFI memory map does not describe MMIO on some
machines, this should use an uncacheable mapping as a fallback.
mmap of /sys/class/pci_bus/.../legacy_mem
This is very similar to mmap of /dev/mem, except that legacy_mem
only allows mmap of the one megabyte "legacy MMIO" area for a
specific PCI bus. Typically this is the first megabyte of
physical address space, but it may be different on machines with
several VGA devices.
"X" uses this to access VGA frame buffers. Using legacy_mem
rather than /dev/mem allows multiple instances of X to talk to
different VGA cards.
The /dev/mem mmap constraints apply.
mmap of /proc/bus/pci/.../??.?
This is an MMIO mmap of PCI functions, which additionally may or
may not be requested as using the WC attribute.
If WC is requested, and the region in kern_memmap is either WC
or UC, and the EFI memory map designates the region as WC, then
the WC mapping is allowed.
Otherwise, the user mapping must use the same attribute as the
kernel mapping.
read/write of /dev/mem
This uses copy_from_user(), which implicitly uses a kernel
identity mapping. This is obviously safe for things in
kern_memmap.
There may be corner cases of things that are not in kern_memmap,
but could be accessed this way. For example, registers in MMIO
space are not in kern_memmap, but could be accessed with a UC
mapping. This would not cause attribute aliasing. But
registers typically can be accessed only with four-byte or
eight-byte accesses, and the copy_from_user() path doesn't allow
any control over the access size, so this would be dangerous.
ioremap()
This returns a mapping for use inside the kernel.
If the region is in kern_memmap, we should use the attribute
specified there.
If the EFI memory map reports that the entire granule supports
WB, we should use that (granules that are partially reserved
or occupied by firmware do not appear in kern_memmap).
If the granule contains non-WB memory, but we can cover the
region safely with kernel page table mappings, we can use
ioremap_page_range() as most other architectures do.
Failing all of the above, we have to fall back to a UC mapping.
PAST PROBLEM CASES
mmap of various MMIO regions from /dev/mem by "X" on Intel platforms
The EFI memory map may not report these MMIO regions.
These must be allowed so that X will work. This means that
when the EFI memory map is incomplete, every /dev/mem mmap must
succeed. It may create either WB or UC user mappings, depending
on whether the region is in kern_memmap or the EFI memory map.
mmap of 0x0-0x9FFFF /dev/mem by "hwinfo" on HP sx1000 with VGA enabled
The EFI memory map reports the following attributes:
0x00000-0x9FFFF WB only
0xA0000-0xBFFFF UC only (VGA frame buffer)
0xC0000-0xFFFFF WB only
This mmap is done with user pages, not kernel identity mappings,
so it is safe to use WB mappings.
The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000,
which uses a granule-sized UC mapping. This granule will cover some
WB-only memory, but since UC is non-speculative, the processor will
never generate an uncacheable reference to the WB-only areas unless
the driver explicitly touches them.
mmap of 0x0-0xFFFFF legacy_mem by "X"
If the EFI memory map reports that the entire range supports the
same attributes, we can allow the mmap (and we will prefer WB if
supported, as is the case with HP sx[12]000 machines with VGA
disabled).
If EFI reports the range as partly WB and partly UC (as on sx[12]000
machines with VGA enabled), we must fail the mmap because there's no
safe attribute to use.
If EFI reports some of the range but not all (as on Intel firmware
that doesn't report the VGA frame buffer at all), we should fail the
mmap and force the user to map just the specific region of interest.
mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled
The EFI memory map reports the following attributes:
0x00000-0xFFFFF WB only (no VGA MMIO hole)
This is a special case of the previous case, and the mmap should
fail for the same reason as above.
read of /sys/devices/.../rom
For VGA devices, this may cause an ioremap() of 0xC0000. This
used to be done with a UC mapping, because the VGA frame buffer
at 0xA0000 prevents use of a WB granule. The UC mapping causes
an MCA on HP sx[12]000 chipsets.
We should use WB page table mappings to avoid covering the VGA
frame buffer.
NOTES
[1] SDM rev 2.2, vol 2, sec 4.4.1.
[2] SDM rev 2.2, vol 2, sec 4.4.6.

View file

@ -0,0 +1,128 @@
EFI Real Time Clock driver
-------------------------------
S. Eranian <eranian@hpl.hp.com>
March 2000
I/ Introduction
This document describes the efirtc.c driver has provided for
the IA-64 platform.
The purpose of this driver is to supply an API for kernel and user applications
to get access to the Time Service offered by EFI version 0.92.
EFI provides 4 calls one can make once the OS is booted: GetTime(),
SetTime(), GetWakeupTime(), SetWakeupTime() which are all supported by this
driver. We describe those calls as well the design of the driver in the
following sections.
II/ Design Decisions
The original ideas was to provide a very simple driver to get access to,
at first, the time of day service. This is required in order to access, in a
portable way, the CMOS clock. A program like /sbin/hwclock uses such a clock
to initialize the system view of the time during boot.
Because we wanted to minimize the impact on existing user-level apps using
the CMOS clock, we decided to expose an API that was very similar to the one
used today with the legacy RTC driver (driver/char/rtc.c). However, because
EFI provides a simpler services, not all ioctl() are available. Also
new ioctl()s have been introduced for things that EFI provides but not the
legacy.
EFI uses a slightly different way of representing the time, noticeably
the reference date is different. Year is the using the full 4-digit format.
The Epoch is January 1st 1998. For backward compatibility reasons we don't
expose this new way of representing time. Instead we use something very
similar to the struct tm, i.e. struct rtc_time, as used by hwclock.
One of the reasons for doing it this way is to allow for EFI to still evolve
without necessarily impacting any of the user applications. The decoupling
enables flexibility and permits writing wrapper code is ncase things change.
The driver exposes two interfaces, one via the device file and a set of
ioctl()s. The other is read-only via the /proc filesystem.
As of today we don't offer a /proc/sys interface.
To allow for a uniform interface between the legacy RTC and EFI time service,
we have created the include/linux/rtc.h header file to contain only the
"public" API of the two drivers. The specifics of the legacy RTC are still
in include/linux/mc146818rtc.h.
III/ Time of day service
The part of the driver gives access to the time of day service of EFI.
Two ioctl()s, compatible with the legacy RTC calls:
Read the CMOS clock: ioctl(d, RTC_RD_TIME, &rtc);
Write the CMOS clock: ioctl(d, RTC_SET_TIME, &rtc);
The rtc is a pointer to a data structure defined in rtc.h which is close
to a struct tm:
struct rtc_time {
int tm_sec;
int tm_min;
int tm_hour;
int tm_mday;
int tm_mon;
int tm_year;
int tm_wday;
int tm_yday;
int tm_isdst;
};
The driver takes care of converting back an forth between the EFI time and
this format.
Those two ioctl()s can be exercised with the hwclock command:
For reading:
# /sbin/hwclock --show
Mon Mar 6 15:32:32 2000 -0.910248 seconds
For setting:
# /sbin/hwclock --systohc
Root privileges are required to be able to set the time of day.
IV/ Wakeup Alarm service
EFI provides an API by which one can program when a machine should wakeup,
i.e. reboot. This is very different from the alarm provided by the legacy
RTC which is some kind of interval timer alarm. For this reason we don't use
the same ioctl()s to get access to the service. Instead we have
introduced 2 news ioctl()s to the interface of an RTC.
We have added 2 new ioctl()s that are specific to the EFI driver:
Read the current state of the alarm
ioctl(d, RTC_WKLAM_RD, &wkt)
Set the alarm or change its status
ioctl(d, RTC_WKALM_SET, &wkt)
The wkt structure encapsulates a struct rtc_time + 2 extra fields to get
status information:
struct rtc_wkalrm {
unsigned char enabled; /* =1 if alarm is enabled */
unsigned char pending; /* =1 if alarm is pending */
struct rtc_time time;
}
As of today, none of the existing user-level apps supports this feature.
However writing such a program should be hard by simply using those two
ioctl().
Root privileges are required to be able to set the alarm.
V/ References.
Checkout the following Web site for more information on EFI:
http://developer.intel.com/technology/efi/

File diff suppressed because it is too large Load diff

286
Documentation/ia64/fsys.txt Normal file
View file

@ -0,0 +1,286 @@
-*-Mode: outline-*-
Light-weight System Calls for IA-64
-----------------------------------
Started: 13-Jan-2003
Last update: 27-Sep-2003
David Mosberger-Tang
<davidm@hpl.hp.com>
Using the "epc" instruction effectively introduces a new mode of
execution to the ia64 linux kernel. We call this mode the
"fsys-mode". To recap, the normal states of execution are:
- kernel mode:
Both the register stack and the memory stack have been
switched over to kernel memory. The user-level state is saved
in a pt-regs structure at the top of the kernel memory stack.
- user mode:
Both the register stack and the kernel stack are in
user memory. The user-level state is contained in the
CPU registers.
- bank 0 interruption-handling mode:
This is the non-interruptible state which all
interruption-handlers start execution in. The user-level
state remains in the CPU registers and some kernel state may
be stored in bank 0 of registers r16-r31.
In contrast, fsys-mode has the following special properties:
- execution is at privilege level 0 (most-privileged)
- CPU registers may contain a mixture of user-level and kernel-level
state (it is the responsibility of the kernel to ensure that no
security-sensitive kernel-level state is leaked back to
user-level)
- execution is interruptible and preemptible (an fsys-mode handler
can disable interrupts and avoid all other interruption-sources
to avoid preemption)
- neither the memory-stack nor the register-stack can be trusted while
in fsys-mode (they point to the user-level stacks, which may
be invalid, or completely bogus addresses)
In summary, fsys-mode is much more similar to running in user-mode
than it is to running in kernel-mode. Of course, given that the
privilege level is at level 0, this means that fsys-mode requires some
care (see below).
* How to tell fsys-mode
Linux operates in fsys-mode when (a) the privilege level is 0 (most
privileged) and (b) the stacks have NOT been switched to kernel memory
yet. For convenience, the header file <asm-ia64/ptrace.h> provides
three macros:
user_mode(regs)
user_stack(task,regs)
fsys_mode(task,regs)
The "regs" argument is a pointer to a pt_regs structure. The "task"
argument is a pointer to the task structure to which the "regs"
pointer belongs to. user_mode() returns TRUE if the CPU state pointed
to by "regs" was executing in user mode (privilege level 3).
user_stack() returns TRUE if the state pointed to by "regs" was
executing on the user-level stack(s). Finally, fsys_mode() returns
TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
The fsys_mode() macro is equivalent to the expression:
!user_mode(regs) && user_stack(task,regs)
* How to write an fsyscall handler
The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
(fsyscall_table). This table contains one entry for each system call.
By default, a system call is handled by fsys_fallback_syscall(). This
routine takes care of entering (full) kernel mode and calling the
normal Linux system call handler. For performance-critical system
calls, it is possible to write a hand-tuned fsyscall_handler. For
example, fsys.S contains fsys_getpid(), which is a hand-tuned version
of the getpid() system call.
The entry and exit-state of an fsyscall handler is as follows:
** Machine state on entry to fsyscall handler:
- r10 = 0
- r11 = saved ar.pfs (a user-level value)
- r15 = system call number
- r16 = "current" task pointer (in normal kernel-mode, this is in r13)
- r32-r39 = system call arguments
- b6 = return address (a user-level value)
- ar.pfs = previous frame-state (a user-level value)
- PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
- all other registers may contain values passed in from user-mode
** Required machine state on exit to fsyscall handler:
- r11 = saved ar.pfs (as passed into the fsyscall handler)
- r15 = system call number (as passed into the fsyscall handler)
- r32-r39 = system call arguments (as passed into the fsyscall handler)
- b6 = return address (as passed into the fsyscall handler)
- ar.pfs = previous frame-state (as passed into the fsyscall handler)
Fsyscall handlers can execute with very little overhead, but with that
speed comes a set of restrictions:
o Fsyscall-handlers MUST check for any pending work in the flags
member of the thread-info structure and if any of the
TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
doing a full system call (by calling fsys_fallback_syscall).
o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
r15, b6, and ar.pfs) because they will be needed in case of a
system call restart. Of course, all "preserved" registers also
must be preserved, in accordance to the normal calling conventions.
o Fsyscall-handlers MUST check argument registers for containing a
NaT value before using them in any way that could trigger a
NaT-consumption fault. If a system call argument is found to
contain a NaT value, an fsyscall-handler may return immediately
with r8=EINVAL, r10=-1.
o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
any other operation that would trigger mandatory RSE
(register-stack engine) traffic.
o Fsyscall-handlers MUST NOT write to any stacked registers because
it is not safe to assume that user-level called a handler with the
proper number of arguments.
o Fsyscall-handlers need to be careful when accessing per-CPU variables:
unless proper safe-guards are taken (e.g., interruptions are avoided),
execution may be pre-empted and resumed on another CPU at any given
time.
o Fsyscall-handlers must be careful not to leak sensitive kernel'
information back to user-level. In particular, before returning to
user-level, care needs to be taken to clear any scratch registers
that could contain sensitive information (note that the current
task pointer is not considered sensitive: it's already exposed
through ar.k6).
o Fsyscall-handlers MUST NOT access user-memory without first
validating access-permission (this can be done typically via
probe.r.fault and/or probe.w.fault) and without guarding against
memory access exceptions (this can be done with the EX() macros
defined by asmmacro.h).
The above restrictions may seem draconian, but remember that it's
possible to trade off some of the restrictions by paying a slightly
higher overhead. For example, if an fsyscall-handler could benefit
from the shadow register bank, it could temporarily disable PSR.i and
PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
needed. In other words, following the above rules yields extremely
fast system call execution (while fully preserving system call
semantics), but there is also a lot of flexibility in handling more
complicated cases.
* Signal handling
The delivery of (asynchronous) signals must be delayed until fsys-mode
is exited. This is accomplished with the help of the lower-privilege
transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
checks whether the interrupted task was in fsys-mode and, if so, sets
PSR.lp and returns immediately. When fsys-mode is exited via the
"br.ret" instruction that lowers the privilege level, a trap will
occur. The trap handler clears PSR.lp again and returns immediately.
The kernel exit path then checks for and delivers any pending signals.
* PSR Handling
The "epc" instruction doesn't change the contents of PSR at all. This
is in contrast to a regular interruption, which clears almost all
bits. Because of that, some care needs to be taken to ensure things
work as expected. The following discussion describes how each PSR bit
is handled.
PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
to ensure the CPU is in little-endian mode before the first
load/store instruction is executed. PSR.be is normally NOT
restored upon return from an fsys-mode handler. In other
words, user-level code must not rely on PSR.be being preserved
across a system call.
PSR.up Unchanged.
PSR.ac Unchanged.
PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
PSR.pk Unchanged.
PSR.dt Unchanged.
PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
PSR.sp Unchanged.
PSR.pp Unchanged.
PSR.di Unchanged.
PSR.si Unchanged.
PSR.db Unchanged. The kernel prevents user-level from setting a hardware
breakpoint that triggers at any privilege level other than 3 (user-mode).
PSR.lp Unchanged.
PSR.tb Lazy redirect. If a taken-branch trap occurs while in
fsys-mode, the trap-handler modifies the saved machine state
such that execution resumes in the gate page at
syscall_via_break(), with privilege level 3. Note: the
taken branch would occur on the branch invoking the
fsyscall-handler, at which point, by definition, a syscall
restart is still safe. If the system call number is invalid,
the fsys-mode handler will return directly to user-level. This
return will trigger a taken-branch trap, but since the trap is
taken _after_ restoring the privilege level, the CPU has already
left fsys-mode, so no special treatment is needed.
PSR.rt Unchanged.
PSR.cpl Cleared to 0.
PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
PSR.mc Unchanged.
PSR.it Unchanged (guaranteed to be 1).
PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
be taken. The trap handler then modifies the saved machine
state such that execution resumes in the gate page at
syscall_via_break(), with privilege level 3.
PSR.ri Unchanged.
PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
handler performed a speculative load that gets NaTted. If so, this
would be the normal & expected behavior, so no special treatment is
needed.
PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
Doing so requires clearing PSR.i and PSR.ic as well.
PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
* Using fast system calls
To use fast system calls, userspace applications need simply call
__kernel_syscall_via_epc(). For example
-- example fgettimeofday() call --
-- fgettimeofday.S --
#include <asm/asmmacro.h>
GLOBAL_ENTRY(fgettimeofday)
.prologue
.save ar.pfs, r11
mov r11 = ar.pfs
.body
mov r2 = 0xa000000000020660;; // gate address
// found by inspection of System.map for the
// __kernel_syscall_via_epc() function. See
// below for how to do this for real.
mov b7 = r2
mov r15 = 1087 // gettimeofday syscall
;;
br.call.sptk.many b6 = b7
;;
.restore sp
mov ar.pfs = r11
br.ret.sptk.many rp;; // return to caller
END(fgettimeofday)
-- end fgettimeofday.S --
In reality, getting the gate address is accomplished by two extra
values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
The ELF DSO is a pre-linked library that is mapped in by the kernel at
the gate page. It is a proper ELF shared object so, with a dynamic
loader that recognises the library, you should be able to make calls to
the exported functions within it as with any other shared library.
AT_SYSINFO points into the kernel DSO at the
__kernel_syscall_via_epc() function for historical reasons (it was
used before the kernel DSO) and as a convenience.

View file

@ -0,0 +1,83 @@
Currently, kvm module is in EXPERIMENTAL stage on IA64. This means that
interfaces are not stable enough to use. So, please don't run critical
applications in virtual machine.
We will try our best to improve it in future versions!
Guide: How to boot up guests on kvm/ia64
This guide is to describe how to enable kvm support for IA-64 systems.
1. Get the kvm source from git.kernel.org.
Userspace source:
git clone git://git.kernel.org/pub/scm/virt/kvm/kvm-userspace.git
Kernel Source:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/xiantao/kvm-ia64.git
2. Compile the source code.
2.1 Compile userspace code:
(1)cd ./kvm-userspace
(2)./configure
(3)cd kernel
(4)make sync LINUX= $kernel_dir (kernel_dir is the directory of kernel source.)
(5)cd ..
(6)make qemu
(7)cd qemu; make install
2.2 Compile kernel source code:
(1) cd ./$kernel_dir
(2) Make menuconfig
(3) Enter into virtualization option, and choose kvm.
(4) make
(5) Once (4) done, make modules_install
(6) Make initrd, and use new kernel to reboot up host machine.
(7) Once (6) done, cd $kernel_dir/arch/ia64/kvm
(8) insmod kvm.ko; insmod kvm-intel.ko
Note: For step 2, please make sure that host page size == TARGET_PAGE_SIZE of qemu, otherwise, may fail.
3. Get Guest Firmware named as Flash.fd, and put it under right place:
(1) If you have the guest firmware (binary) released by Intel Corp for Xen, use it directly.
(2) If you have no firmware at hand, Please download its source from
hg clone http://xenbits.xensource.com/ext/efi-vfirmware.hg
you can get the firmware's binary in the directory of efi-vfirmware.hg/binaries.
(3) Rename the firmware you owned to Flash.fd, and copy it to /usr/local/share/qemu
4. Boot up Linux or Windows guests:
4.1 Create or install a image for guest boot. If you have xen experience, it should be easy.
4.2 Boot up guests use the following command.
/usr/local/bin/qemu-system-ia64 -smp xx -m 512 -hda $your_image
(xx is the number of virtual processors for the guest, now the maximum value is 4)
5. Known possible issue on some platforms with old Firmware.
In the event of strange host crash issues, try to solve it through either of the following ways:
(1): Upgrade your Firmware to the latest one.
(2): Applying the below patch to kernel source.
diff --git a/arch/ia64/kernel/pal.S b/arch/ia64/kernel/pal.S
index 0b53344..f02b0f7 100644
--- a/arch/ia64/kernel/pal.S
+++ b/arch/ia64/kernel/pal.S
@@ -84,7 +84,8 @@ GLOBAL_ENTRY(ia64_pal_call_static)
mov ar.pfs = loc1
mov rp = loc0
;;
- srlz.d // serialize restoration of psr.l
+ srlz.i // serialize restoration of psr.l
+ ;;
br.ret.sptk.many b0
END(ia64_pal_call_static)
6. Bug report:
If you found any issues when use kvm/ia64, Please post the bug info to kvm-ia64-devel mailing list.
https://lists.sourceforge.net/lists/listinfo/kvm-ia64-devel/
Thanks for your interest! Let's work together, and make kvm/ia64 stronger and stronger!
Xiantao Zhang <xiantao.zhang@intel.com>
2008.3.10

194
Documentation/ia64/mca.txt Normal file
View file

@ -0,0 +1,194 @@
An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel
free to update it with notes about any area that is not clear.
---
MCA/INIT are completely asynchronous. They can occur at any time, when
the OS is in any state. Including when one of the cpus is already
holding a spinlock. Trying to get any lock from MCA/INIT state is
asking for deadlock. Also the state of structures that are protected
by locks is indeterminate, including linked lists.
---
The complicated ia64 MCA process. All of this is mandated by Intel's
specification for ia64 SAL, error recovery and unwind, it is not as
if we have a choice here.
* MCA occurs on one cpu, usually due to a double bit memory error.
This is the monarch cpu.
* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
to all the other cpus, the slaves.
* Slave cpus that receive the MCA interrupt call down into SAL, they
end up spinning disabled while the MCA is being serviced.
* If any slave cpu was already spinning disabled when the MCA occurred
then it cannot service the MCA interrupt. SAL waits ~20 seconds then
sends an unmaskable INIT event to the slave cpus that have not
already rendezvoused.
* Because MCA/INIT can be delivered at any time, including when the cpu
is down in PAL in physical mode, the registers at the time of the
event are _completely_ undefined. In particular the MCA/INIT
handlers cannot rely on the thread pointer, PAL physical mode can
(and does) modify TP. It is allowed to do that as long as it resets
TP on return. However MCA/INIT events expose us to these PAL
internal TP changes. Hence curr_task().
* If an MCA/INIT event occurs while the kernel was running (not user
space) and the kernel has called PAL then the MCA/INIT handler cannot
assume that the kernel stack is in a fit state to be used. Mainly
because PAL may or may not maintain the stack pointer internally.
Because the MCA/INIT handlers cannot trust the kernel stack, they
have to use their own, per-cpu stacks. The MCA/INIT stacks are
preformatted with just enough task state to let the relevant handlers
do their job.
* Unlike most other architectures, the ia64 struct task is embedded in
the kernel stack[1]. So switching to a new kernel stack means that
we switch to a new task as well. Because various bits of the kernel
assume that current points into the struct task, switching to a new
stack also means a new value for current.
* Once all slaves have rendezvoused and are spinning disabled, the
monarch is entered. The monarch now tries to diagnose the problem
and decide if it can recover or not.
* Part of the monarch's job is to look at the state of all the other
tasks. The only way to do that on ia64 is to call the unwinder,
as mandated by Intel.
* The starting point for the unwind depends on whether a task is
running or not. That is, whether it is on a cpu or is blocked. The
monarch has to determine whether or not a task is on a cpu before it
knows how to start unwinding it. The tasks that received an MCA or
INIT event are no longer running, they have been converted to blocked
tasks. But (and its a big but), the cpus that received the MCA
rendezvous interrupt are still running on their normal kernel stacks!
* To distinguish between these two cases, the monarch must know which
tasks are on a cpu and which are not. Hence each slave cpu that
switches to an MCA/INIT stack, registers its new stack using
set_curr_task(), so the monarch can tell that the _original_ task is
no longer running on that cpu. That gives us a decent chance of
getting a valid backtrace of the _original_ task.
* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
nested error, we want diagnostics on the MCA/INIT handler that
failed, not on the task that was originally running. Again this
requires set_curr_task() so the MCA/INIT handlers can register their
own stack as running on that cpu. Then a recursive error gets a
trace of the failing handler's "task".
[1] My (Keith Owens) original design called for ia64 to separate its
struct task and the kernel stacks. Then the MCA/INIT data would be
chained stacks like i386 interrupt stacks. But that required
radical surgery on the rest of ia64, plus extra hard wired TLB
entries with its associated performance degradation. David
Mosberger vetoed that approach. Which meant that separate kernel
stacks meant separate "tasks" for the MCA/INIT handlers.
---
INIT is less complicated than MCA. Pressing the nmi button or using
the equivalent command on the management console sends INIT to all
cpus. SAL picks one of the cpus as the monarch and the rest are
slaves. All the OS INIT handlers are entered at approximately the same
time. The OS monarch prints the state of all tasks and returns, after
which the slaves return and the system resumes.
At least that is what is supposed to happen. Alas there are broken
versions of SAL out there. Some drive all the cpus as monarchs. Some
drive them all as slaves. Some drive one cpu as monarch, wait for that
cpu to return from the OS then drive the rest as slaves. Some versions
of SAL cannot even cope with returning from the OS, they spin inside
SAL on resume. The OS INIT code has workarounds for some of these
broken SAL symptoms, but some simply cannot be fixed from the OS side.
---
The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
violations. Unfortunately MCA/INIT start off as massive layer
violations (can occur at _any_ time) and they build from there.
At least ia64 makes an attempt at recovering from hardware errors, but
it is a difficult problem because of the asynchronous nature of these
errors. When processing an unmaskable interrupt we sometimes need
special code to cope with our inability to take any locks.
---
How is ia64 MCA/INIT different from x86 NMI?
* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
all cpus.
* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
per cpu.
* x86 has a separate struct task which points to one of multiple kernel
stacks. ia64 has the struct task embedded in the single kernel
stack, so switching stack means switching task.
* x86 does not call the BIOS so the NMI handler does not have to worry
about any registers having changed. MCA/INIT can occur while the cpu
is in PAL in physical mode, with undefined registers and an undefined
kernel stack.
* i386 backtrace is not very sensitive to whether a process is running
or not. ia64 unwind is very, very sensitive to whether a process is
running or not.
---
What happens when MCA/INIT is delivered what a cpu is running user
space code?
The user mode registers are stored in the RSE area of the MCA/INIT on
entry to the OS and are restored from there on return to SAL, so user
mode registers are preserved across a recoverable MCA/INIT. Since the
OS has no idea what unwind data is available for the user space stack,
MCA/INIT never tries to backtrace user space. Which means that the OS
does not bother making the user space process look like a blocked task,
i.e. the OS does not copy pt_regs and switch_stack to the user space
stack. Also the OS has no idea how big the user space RSE and memory
stacks are, which makes it too risky to copy the saved state to a user
mode stack.
---
How do we get a backtrace on the tasks that were running when MCA/INIT
was delivered?
mca.c:::ia64_mca_modify_original_stack(). That identifies and
verifies the original kernel stack, copies the dirty registers from
the MCA/INIT stack's RSE to the original stack's RSE, copies the
skeleton struct pt_regs and switch_stack to the original stack, fills
in the skeleton structures from the PAL minstate area and updates the
original stack's thread.ksp. That makes the original stack look
exactly like any other blocked task, i.e. it now appears to be
sleeping. To get a backtrace, just start with thread.ksp for the
original task and unwind like any other sleeping task.
---
How do we identify the tasks that were running when MCA/INIT was
delivered?
If the previous task has been verified and converted to a blocked
state, then sos->prev_task on the MCA/INIT stack is updated to point to
the previous task. You can look at that field in dumps or debuggers.
To help distinguish between the handler and the original tasks,
handlers have _TIF_MCA_INIT set in thread_info.flags.
The sos data is always in the MCA/INIT handler stack, at offset
MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
ia64_sal_os_state), with 16 byte alignment for all structures.
Also the comm field of the MCA/INIT task is modified to include the pid
of the original task, for humans to use. For example, a comm field of
'MCA 12159' means that pid 12159 was running when the MCA was
delivered.

View file

@ -0,0 +1,137 @@
Paravirt_ops on IA64
====================
21 May 2008, Isaku Yamahata <yamahata@valinux.co.jp>
Introduction
------------
The aim of this documentation is to help with maintainability and/or to
encourage people to use paravirt_ops/IA64.
paravirt_ops (pv_ops in short) is a way for virtualization support of
Linux kernel on x86. Several ways for virtualization support were
proposed, paravirt_ops is the winner.
On the other hand, now there are also several IA64 virtualization
technologies like kvm/IA64, xen/IA64 and many other academic IA64
hypervisors so that it is good to add generic virtualization
infrastructure on Linux/IA64.
What is paravirt_ops?
---------------------
It has been developed on x86 as virtualization support via API, not ABI.
It allows each hypervisor to override operations which are important for
hypervisors at API level. And it allows a single kernel binary to run on
all supported execution environments including native machine.
Essentially paravirt_ops is a set of function pointers which represent
operations corresponding to low level sensitive instructions and high
level functionalities in various area. But one significant difference
from usual function pointer table is that it allows optimization with
binary patch. It is because some of these operations are very
performance sensitive and indirect call overhead is not negligible.
With binary patch, indirect C function call can be transformed into
direct C function call or in-place execution to eliminate the overhead.
Thus, operations of paravirt_ops are classified into three categories.
- simple indirect call
These operations correspond to high level functionality so that the
overhead of indirect call isn't very important.
- indirect call which allows optimization with binary patch
Usually these operations correspond to low level instructions. They
are called frequently and performance critical. So the overhead is
very important.
- a set of macros for hand written assembly code
Hand written assembly codes (.S files) also need paravirtualization
because they include sensitive instructions or some of code paths in
them are very performance critical.
The relation to the IA64 machine vector
---------------------------------------
Linux/IA64 has the IA64 machine vector functionality which allows the
kernel to switch implementations (e.g. initialization, ipi, dma api...)
depending on executing platform.
We can replace some implementations very easily defining a new machine
vector. Thus another approach for virtualization support would be
enhancing the machine vector functionality.
But paravirt_ops approach was taken because
- virtualization support needs wider support than machine vector does.
e.g. low level instruction paravirtualization. It must be
initialized very early before platform detection.
- virtualization support needs more functionality like binary patch.
Probably the calling overhead might not be very large compared to the
emulation overhead of virtualization. However in the native case, the
overhead should be eliminated completely.
A single kernel binary should run on each environment including native,
and the overhead of paravirt_ops on native environment should be as
small as possible.
- for full virtualization technology, e.g. KVM/IA64 or
Xen/IA64 HVM domain, the result would be
(the emulated platform machine vector. probably dig) + (pv_ops).
This means that the virtualization support layer should be under
the machine vector layer.
Possibly it might be better to move some function pointers from
paravirt_ops to machine vector. In fact, Xen domU case utilizes both
pv_ops and machine vector.
IA64 paravirt_ops
-----------------
In this section, the concrete paravirt_ops will be discussed.
Because of the architecture difference between ia64 and x86, the
resulting set of functions is very different from x86 pv_ops.
- C function pointer tables
They are not very performance critical so that simple C indirect
function call is acceptable. The following structures are defined at
this moment. For details see linux/include/asm-ia64/paravirt.h
- struct pv_info
This structure describes the execution environment.
- struct pv_init_ops
This structure describes the various initialization hooks.
- struct pv_iosapic_ops
This structure describes hooks to iosapic operations.
- struct pv_irq_ops
This structure describes hooks to irq related operations
- struct pv_time_op
This structure describes hooks to steal time accounting.
- a set of indirect calls which need optimization
Currently this class of functions correspond to a subset of IA64
intrinsics. At this moment the optimization with binary patch isn't
implemented yet.
struct pv_cpu_op is defined. For details see
linux/include/asm-ia64/paravirt_privop.h
Mostly they correspond to ia64 intrinsics 1-to-1.
Caveat: Now they are defined as C indirect function pointers, but in
order to support binary patch optimization, they will be changed
using GCC extended inline assembly code.
- a set of macros for hand written assembly code (.S files)
For maintenance purpose, the taken approach for .S files is single
source code and compile multiple times with different macros definitions.
Each pv_ops instance must define those macros to compile.
The important thing here is that sensitive, but non-privileged
instructions must be paravirtualized and that some privileged
instructions also need paravirtualization for reasonable performance.
Developers who modify .S files must be aware of that. At this moment
an easy checker is implemented to detect paravirtualization breakage.
But it doesn't cover all the cases.
Sometimes this set of macros is called pv_cpu_asm_op. But there is no
corresponding structure in the source code.
Those macros mostly 1:1 correspond to a subset of privileged
instructions. See linux/include/asm-ia64/native/inst.h.
And some functions written in assembly also need to be overrided so
that each pv_ops instance have to define some macros. Again see
linux/include/asm-ia64/native/inst.h.
Those structures must be initialized very early before start_kernel.
Probably initialized in head.S using multi entry point or some other trick.
For native case implementation see linux/arch/ia64/kernel/paravirt.c.

View file

@ -0,0 +1,151 @@
SERIAL DEVICE NAMING
As of 2.6.10, serial devices on ia64 are named based on the
order of ACPI and PCI enumeration. The first device in the
ACPI namespace (if any) becomes /dev/ttyS0, the second becomes
/dev/ttyS1, etc., and PCI devices are named sequentially
starting after the ACPI devices.
Prior to 2.6.10, there were confusing exceptions to this:
- Firmware on some machines (mostly from HP) provides an HCDP
table[1] that tells the kernel about devices that can be used
as a serial console. If the user specified "console=ttyS0"
or the EFI ConOut path contained only UART devices, the
kernel registered the device described by the HCDP as
/dev/ttyS0.
- If there was no HCDP, we assumed there were UARTs at the
legacy COM port addresses (I/O ports 0x3f8 and 0x2f8), so
the kernel registered those as /dev/ttyS0 and /dev/ttyS1.
Any additional ACPI or PCI devices were registered sequentially
after /dev/ttyS0 as they were discovered.
With an HCDP, device names changed depending on EFI configuration
and "console=" arguments. Without an HCDP, device names didn't
change, but we registered devices that might not really exist.
For example, an HP rx1600 with a single built-in serial port
(described in the ACPI namespace) plus an MP[2] (a PCI device) has
these ports:
pre-2.6.10 pre-2.6.10
MMIO (EFI console (EFI console
address on builtin) on MP port) 2.6.10
========== ========== ========== ======
builtin 0xff5e0000 ttyS0 ttyS1 ttyS0
MP UPS 0xf8031000 ttyS1 ttyS2 ttyS1
MP Console 0xf8030000 ttyS2 ttyS0 ttyS2
MP 2 0xf8030010 ttyS3 ttyS3 ttyS3
MP 3 0xf8030038 ttyS4 ttyS4 ttyS4
CONSOLE SELECTION
EFI knows what your console devices are, but it doesn't tell the
kernel quite enough to actually locate them. The DIG64 HCDP
table[1] does tell the kernel where potential serial console
devices are, but not all firmware supplies it. Also, EFI supports
multiple simultaneous consoles and doesn't tell the kernel which
should be the "primary" one.
So how do you tell Linux which console device to use?
- If your firmware supplies the HCDP, it is simplest to
configure EFI with a single device (either a UART or a VGA
card) as the console. Then you don't need to tell Linux
anything; the kernel will automatically use the EFI console.
(This works only in 2.6.6 or later; prior to that you had
to specify "console=ttyS0" to get a serial console.)
- Without an HCDP, Linux defaults to a VGA console unless you
specify a "console=" argument.
NOTE: Don't assume that a serial console device will be /dev/ttyS0.
It might be ttyS1, ttyS2, etc. Make sure you have the appropriate
entries in /etc/inittab (for getty) and /etc/securetty (to allow
root login).
EARLY SERIAL CONSOLE
The kernel can't start using a serial console until it knows where
the device lives. Normally this happens when the driver enumerates
all the serial devices, which can happen a minute or more after the
kernel starts booting.
2.6.10 and later kernels have an "early uart" driver that works
very early in the boot process. The kernel will automatically use
this if the user supplies an argument like "console=uart,io,0x3f8",
or if the EFI console path contains only a UART device and the
firmware supplies an HCDP.
TROUBLESHOOTING SERIAL CONSOLE PROBLEMS
No kernel output after elilo prints "Uncompressing Linux... done":
- You specified "console=ttyS0" but Linux changed the device
to which ttyS0 refers. Configure exactly one EFI console
device[3] and remove the "console=" option.
- The EFI console path contains both a VGA device and a UART.
EFI and elilo use both, but Linux defaults to VGA. Remove
the VGA device from the EFI console path[3].
- Multiple UARTs selected as EFI console devices. EFI and
elilo use all selected devices, but Linux uses only one.
Make sure only one UART is selected in the EFI console
path[3].
- You're connected to an HP MP port[2] but have a non-MP UART
selected as EFI console device. EFI uses the MP as a
console device even when it isn't explicitly selected.
Either move the console cable to the non-MP UART, or change
the EFI console path[3] to the MP UART.
Long pause (60+ seconds) between "Uncompressing Linux... done" and
start of kernel output:
- No early console because you used "console=ttyS<n>". Remove
the "console=" option if your firmware supplies an HCDP.
- If you don't have an HCDP, the kernel doesn't know where
your console lives until the driver discovers serial
devices. Use "console=uart, io,0x3f8" (or appropriate
address for your machine).
Kernel and init script output works fine, but no "login:" prompt:
- Add getty entry to /etc/inittab for console tty. Look for
the "Adding console on ttyS<n>" message that tells you which
device is the console.
"login:" prompt, but can't login as root:
- Add entry to /etc/securetty for console tty.
No ACPI serial devices found in 2.6.17 or later:
- Turn on CONFIG_PNP and CONFIG_PNPACPI. Prior to 2.6.17, ACPI
serial devices were discovered by 8250_acpi. In 2.6.17,
8250_acpi was replaced by the combination of 8250_pnp and
CONFIG_PNPACPI.
[1] http://www.dig64.org/specifications/agreement
The table was originally defined as the "HCDP" for "Headless
Console/Debug Port." The current version is the "PCDP" for
"Primary Console and Debug Port Devices."
[2] The HP MP (management processor) is a PCI device that provides
several UARTs. One of the UARTs is often used as a console; the
EFI Boot Manager identifies it as "Acpi(HWP0002,700)/Pci(...)/Uart".
The external connection is usually a 25-pin connector, and a
special dongle converts that to three 9-pin connectors, one of
which is labelled "Console."
[3] EFI console devices are configured using the EFI Boot Manager
"Boot option maintenance" menu. You may have to interrupt the
boot sequence to use this menu, and you will have to reset the
box after changing console configuration.

183
Documentation/ia64/xen.txt Normal file
View file

@ -0,0 +1,183 @@
Recipe for getting/building/running Xen/ia64 with pv_ops
--------------------------------------------------------
This recipe describes how to get xen-ia64 source and build it,
and run domU with pv_ops.
============
Requirements
============
- python
- mercurial
it (aka "hg") is an open-source source code
management software. See the below.
http://www.selenic.com/mercurial/wiki/
- git
- bridge-utils
=================================
Getting and Building Xen and Dom0
=================================
My environment is;
Machine : Tiger4
Domain0 OS : RHEL5
DomainU OS : RHEL5
1. Download source
# hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg
# cd xen-unstable.hg
# hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg
2. # make world
3. # make install-tools
4. copy kernels and xen
# cp xen/xen.gz /boot/efi/efi/redhat/
# cp build-linux-2.6.18-xen_ia64/vmlinux.gz \
/boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen
5. make initrd for Dom0/DomU
# make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \
O=$(/bin/pwd)/build-linux-2.6.18-xen_ia64
# mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \
2.6.18.8-xen --builtin mptspi --builtin mptbase \
--builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \
--builtin ehci-hcd
================================
Making a disk image for guest OS
================================
1. make file
# dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0
# mke2fs -F -j /root/rhel5.img
# mount -o loop /root/rhel5.img /mnt
# cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt
# mkdir /mnt/{root,proc,sys,home,tmp}
Note: You may miss some device files. If so, please create them
with mknod. Or you can use tar instead of cp.
2. modify DomU's fstab
# vi /mnt/etc/fstab
/dev/xvda1 / ext3 defaults 1 1
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs defaults 0 0
none /proc proc defaults 0 0
none /sys sysfs defaults 0 0
3. modify inittab
set runlevel to 3 to avoid X trying to start
# vi /mnt/etc/inittab
id:3:initdefault:
Start a getty on the hvc0 console
X0:2345:respawn:/sbin/mingetty hvc0
tty1-6 mingetty can be commented out
4. add hvc0 into /etc/securetty
# vi /mnt/etc/securetty (add hvc0)
5. umount
# umount /mnt
FYI, virt-manager can also make a disk image for guest OS.
It's GUI tools and easy to make it.
==================
Boot Xen & Domain0
==================
1. replace elilo
elilo of RHEL5 can boot Xen and Dom0.
If you use old elilo (e.g RHEL4), please download from the below
http://elilo.sourceforge.net/cgi-bin/blosxom
and copy into /boot/efi/efi/redhat/
# cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi
2. modify elilo.conf (like the below)
# vi /boot/efi/efi/redhat/elilo.conf
prompt
timeout=20
default=xen
relocatable
image=vmlinuz-2.6.18.8-xen
label=xen
vmm=xen.gz
initrd=initrd-2.6.18.8-xen.img
read-only
append=" -- rhgb root=/dev/sda2"
The append options before "--" are for xen hypervisor,
the options after "--" are for dom0.
FYI, your machine may need console options like
"com1=19200,8n1 console=vga,com1". For example,
append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \
console=ttyS0 root=/dev/sda2"
=====================================
Getting and Building domU with pv_ops
=====================================
1. get pv_ops tree
# git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/
2. git branch (if necessary)
# cd linux-2.6-xen-ia64/
# git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19
(Note: The current branch is xen-ia64-domu-minimal-2008may19.
But you would find the new branch. You can see with
"git branch -r" to get the branch lists.
http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/
is also available. The tree is based on
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test)
3. copy .config for pv_ops of domU
# cp arch/ia64/configs/xen_domu_wip_defconfig .config
4. make kernel with pv_ops
# make oldconfig
# make
5. install the kernel and initrd
# cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU
# make modules_install
# mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \
2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \
--builtin mptbase --builtin mptscsih --builtin uhci-hcd \
--builtin ohci-hcd --builtin ehci-hcd
========================
Boot DomainU with pv_ops
========================
1. make config of DomU
# vi /etc/xen/rhel5
kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU"
ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img"
vcpus = 1
memory = 512
name = "rhel5"
disk = [ 'file:/root/rhel5.img,xvda1,w' ]
root = "/dev/xvda1 ro"
extra= "rhgb console=hvc0"
2. After boot xen and dom0, start xend
# /etc/init.d/xend start
( In the debugging case, # XEND_DEBUG=1 xend trace_start )
3. start domU
# xm create -c rhel5
=========
Reference
=========
- Wiki of Xen/IA64 upstream merge
http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge
Written by Akio Takebe <takebe_akio@jp.fujitsu.com> on 28 May 2008